Commit Graph

61904 Commits

Author SHA1 Message Date
Linus Torvalds
534121d289 io_uring-5.5-20191226
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4E+aYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprKmEAC6tcPlb2BB+7fOuj44uAdE+RInqMxbfD3w
 Tj9KpF47e02DUvBTtwDJDHJ9QT4PlJhd66M1xrp3IMUV13PKQt9OFfc6TH38Jz/9
 mhHDGNj1s+GVLRH22PQtFjyMgzHA6+UF4NxHLDJ62c2CtrCVswFRUiWrSR8LgvDp
 EkVELGEpi080ffton9nhyy3ylOCcpCu1xX1mOCg5EhcqzFQnZMlFaj9PDFrNhQzT
 e8fdl/nGoKtxZ/x6V8Oso02r/K1XievV4dfrAtOZg4jiqp/3G2eiqoGGcYnShSDU
 qulKLGsuHK51Lay8AGEaw3haeMn1PKCNe+xv0uCubHdf2iMyBdpjCLsLpTlhmtF/
 DkfP13H8k3/nUP9Y8FHt9+Ld56qpdqi/77ngCF84Ed4MFXKYkwyFFyHLMaBCw5zk
 Z07qISAbj3UeRPug+8iBKpzNBUXvXqqOGHp2h0faXz+C0yG0l7HOkhZ3m+dDD6vN
 6ABrMrS/ZuWdiW4PiJUejW81rlRKJaCgmTXMjjQCpgFeUqj6flB4sALp3amY9v7r
 CZVL67wBZ4u4YeKW0q8j4Lh3DrT5M7IGPP2uT9tw0FiamgYByC2rV/SAecxbh8f5
 NQJbL4uyJwcusOorRvaKOWU9KQaz5Q6dx9auHnmLC3A0WsEFxjgtVwG3iqw8zOeF
 8E7Lk3kPcA==
 =L2kR
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.5-20191226' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Removal of now unused busy wqe list (Hillf)

 - Add cond_resched() to io-wq work processing (Hillf)

 - And then the series that I hinted at from last week, which removes
   the sqe from the io_kiocb and keeps all sqe handling on the prep
   side. This guarantees that an opcode can't do the wrong thing and
   read the sqe more than once. This is unchanged from last week, no
   issues have been observed with this in testing. Hence I really think
   we should fold this into 5.5.

* tag 'io_uring-5.5-20191226' of git://git.kernel.dk/linux-block:
  io-wq: add cond_resched() to worker thread
  io-wq: remove unused busy list from io_sqe
  io_uring: pass in 'sqe' to the prep handlers
  io_uring: standardize the prep methods
  io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler
  io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler
  io_uring: move all prep state for IORING_OP_CONNECT to prep handler
  io_uring: add and use struct io_rw for read/writes
  io_uring: use u64_to_user_ptr() consistently
2019-12-27 11:17:08 -08:00
Hillf Danton
fd1c4bc6e9 io-wq: add cond_resched() to worker thread
Reschedule the current IO worker to cut the risk that it is becoming
a cpu hog.

Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-24 09:14:29 -07:00
Hillf Danton
1f424e8bd1 io-wq: remove unused busy list from io_sqe
Commit e61df66c69 ("io-wq: ensure free/busy list browsing see all
items") added a list for io workers in addition to the free and busy
lists, not only making worker walk cleaner, but leaving the busy list
unused. Let's remove it.

Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-23 08:23:54 -07:00
Linus Torvalds
9efa3ed504 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Eric's s_inodes softlockup fixes + Jan's fix for recent regression
  from pipe rework"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: call fsnotify_sb_delete after evict_inodes
  fs: avoid softlockups in s_inodes iterators
  pipe: Fix bogus dereference in iov_iter_alignment()
2019-12-22 17:00:04 -08:00
Linus Torvalds
c601747175 Fixes for 5.5:
- Minor documentation fixes
 - Fix a file corruption due to read racing with an insert range
 operation.
 - Fix log reservation overflows when allocating large rt extents
 - Fix a buffer log item flags check
 - Don't allow administrators to mount with sunit= options that will
 cause later xfs_repair complaints about the root directory being
 suspicious because the fs geometry appeared inconsistent
 - Fix a non-static helper that should have been static
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3890IACgkQ+H93GTRK
 tOuNkw//e7lT3Xys+dd60Regn+EkYkuI6myN4TsRNYHI7fQ0C9FqkGyYltJmEvqr
 QhUvQD1BU9Czb5Ghba+DpYz2dpqLYrbVTdmK4jNGjt9K0xNEU7zL297/PyE5Y54t
 il5nAxZVZ9x0aadKS0yhIt+Q3+dN29O2ablcRcErPi6H5EM3csjmPnrHKD+irG5j
 MhY5NNWvU1//qU4w2q8ikRKGhMrDYLWo57iJoIX2y17Sw+HXrDsEGoavOpyaoy0v
 T5m4OfBxU9FD8UhqI86Pua9HG8AlZK+IPT9pZjYGYWT8mkuTppSWjSHJU6HBGqF2
 fNrfMpQK/2H5KrqBTVvAzbhYcby8L1tZXUg+4w5iJuvAlHqb/IuBd+Y+nbSbduL/
 O9k3Ao0PL6Yt78knNf2F1943ioAI0zbhjDhmKtX17qfAojWQz6CAJmP5OWPWPprh
 FHA9WT0OzArXF77E+srfYyChclQzllBTOmYNKU//sXgKnqe33fgRIN6Il3T6V3w1
 5ifI/0N+FV2Z+yRqE0gSaqLNdPATMNzuGorQsv7P+TRPtD70aB8dhRzcVzqzfbTm
 C7owl3FGQFTCS/PIwPTRsLfqt3vt9mvc7pUMZOIu7uP63T2daPZ2amTbav/poXgb
 5Zih0pknWS8iQM4bwaPMLEr51Wp3Yo8gDPuW1jKJ13FCOuXU70E=
 =JXs9
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.5-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Fix a few bugs that could lead to corrupt files, fsck complaints, and
  filesystem crashes:

   - Minor documentation fixes

   - Fix a file corruption due to read racing with an insert range
     operation.

   - Fix log reservation overflows when allocating large rt extents

   - Fix a buffer log item flags check

   - Don't allow administrators to mount with sunit= options that will
     cause later xfs_repair complaints about the root directory being
     suspicious because the fs geometry appeared inconsistent

   - Fix a non-static helper that should have been static"

* tag 'xfs-5.5-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: Make the symbol 'xfs_rtalloc_log_count' static
  xfs: don't commit sunit/swidth updates to disk if that would cause repair failures
  xfs: split the sunit parameter update into two parts
  xfs: refactor agfl length computation function
  libxfs: resync with the userspace libxfs
  xfs: use bitops interface for buf log item AIL flag check
  xfs: fix log reservation overflows when allocating large rt extents
  xfs: stabilize insert range start boundary to avoid COW writeback race
  xfs: fix Sphinx documentation warning
2019-12-22 10:59:06 -08:00
Linus Torvalds
a396560706 Ext4 bug fixes (including a regression fix) for 5.5
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl3/fDEACgkQ8vlZVpUN
 gaMZ6Qf/f973waBpA1E9GgAvB4AymRvGbqPJhW2lDDhEl36oXVpUw6EgIKWgNQPS
 HP6NhYXZakrpEak6Uk2MtiTmcm+6lqDJ+bCslCMylNh9/Y1yUrED2r8l7S3nGv4g
 hVB7Eah7E+sutDyrDQhYhcQo3GJjt8CbwRLgo8fbhSVrZ7qdfb0lWQmVnruc+72b
 3VAeMzPJb0wRY6myxLN4Pw6oEMR1WKVsXm3I9gNXboE2XvgVvnNn2tJxP+xml8rW
 uGxzWTo7QQNN2bUyjZBa6Mm44lMpHr7JT0nMwkIGV5v3eAYuBgeSwIXUskfw29q7
 sP9xNP2voU3M6TyWuT0+cHpoeZasPg==
 =K63f
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bug fixes from Ted Ts'o:
 "Ext4 bug fixes, including a regression fix"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: clarify impact of 'commit' mount option
  ext4: fix unused-but-set-variable warning in ext4_add_entry()
  jbd2: fix kernel-doc notation warning
  ext4: use RCU API in debug_print_tree
  ext4: validate the debug_want_extra_isize mount option at parse time
  ext4: reserve revoke credits in __ext4_new_inode
  ext4: unlock on error in ext4_expand_extra_isize()
  ext4: optimize __ext4_check_dir_entry()
  ext4: check for directory entries too close to block end
  ext4: fix ext4_empty_dir() for directories with holes
2019-12-22 10:41:48 -08:00
Jan Stancek
0dd1e3773a pipe: fix empty pipe check in pipe_write()
LTP pipeio_1 test is hanging with v5.5-rc2-385-gb8e382a185eb,
with read side observing empty pipe and sleeping and write
side running out of space and then sleeping as well. In this
scenario there are 5 writers and 1 reader.

Problem is that after pipe_write() reacquires pipe lock, it
re-checks for empty pipe with potentially stale 'head' and
doesn't wake up read side anymore. pipe->tail can advance
beyond 'head', because there are multiple writers.

Use pipe->head for empty pipe check after reacquiring lock
to observe current state.

Testing: With patch, LTP pipeio_1 ran successfully in loop for 1 hour.
         Without patch it hanged within a minute.

Fixes: 1b6b26ae70 ("pipe: fix and clarify pipe write wakeup logic")
Reported-by: Rachel Sibley <rasibley@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-22 09:47:47 -08:00
Yunfeng Ye
68d7b2d838 ext4: fix unused-but-set-variable warning in ext4_add_entry()
Warning is found when compile with "-Wunused-but-set-variable":

fs/ext4/namei.c: In function ‘ext4_add_entry’:
fs/ext4/namei.c:2167:23: warning: variable ‘sbi’ set but not used
[-Wunused-but-set-variable]
  struct ext4_sb_info *sbi;
                       ^~~
Fix this by moving the variable @sbi under CONFIG_UNICODE.

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/cb5eb904-224a-9701-c38f-cb23514b1fff@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-21 21:00:53 -05:00
Linus Torvalds
f8f04d0859 io_uring-5.5-20191220
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl39CowQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpufZD/4t5p6e5S1GO915Y35+q8ooOjd7Ci4a+QJh
 mYV1KVlrvt+uPXSMHZoEj2JQhI3quEb2IHmUE5IydFBfIwJl2soD7mAsky2iQNaC
 ULMQFCW33vVnfz7WyuHwkEHmdgEuKg8OeGWVSMEsjrFqygHYSWR94wmqJiYKpkgm
 Klw4guyGzVfjHutxEDRM3QzHbHmy9xwSNDpJR9Vyr/s0GPOLCavpE71/1ztoc3mq
 UbolvEirXwUgGNArC/YyHhJAMM+lWNYplWBdGM3YrKzmV2oqKQY9+148IOWeV3Yl
 vmHLX0/s2WsbKnZPqE5DeuDc8X1fspJHcrQDn+BeM8w8TdGaSUHxgMIuyFulDzWr
 +3cDjVGaKo3J41xnX7u7v3ph8qnDjMz6k6o6IaQtWz7MCwzJKpCEyF/dJcDIJfxU
 7gaOnP5ltDf5wJWzfsOCYFA3/CTL2mshdHD4lg0siEp7CksfX6BFGoLNqdk5qhuv
 0md3X0nMkTSsXd09tRjYBQUaKOJ9NnvD63bGIcSshUAMZ2JQUOcFdPNfkSwb3jq+
 OMFnz/t6C8VOnyLwBYleYr8r5bum80lVzwvDa4LZNGivyeD/ne1HO+22WA5xDwod
 8yNm/hBhy5FGtoucQU2Vo2P9SiAil586PLz9HxAtD9eUOBTLQxVOOrv9x8vVhAW3
 Jln/sNGwYg==
 =6pRI
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.5-20191220' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Here's a set of fixes that should go into 5.5-rc3 for io_uring.

  This is bigger than I'd like it to be, mainly because we're fixing the
  case where an application reuses sqe data right after issue. This
  really must work, or it's confusing. With 5.5 we're flagging us as
  submit stable for the actual data, this must also be the case for
  SQEs.

  Honestly, I'd really like to add another series on top of this, since
  it cleans it up considerable and prevents any SQE reuse by design. I
  posted that here:

    https://lore.kernel.org/io-uring/20191220174742.7449-1-axboe@kernel.dk/T/#u

  and may still send it your way early next week once it's been looked
  at and had some more soak time (does pass all regression tests). With
  that series, we've unified the prep+issue handling, and only the prep
  phase even has access to the SQE.

  Anyway, outside of that, fixes in here for a few other issues that
  have been hit in testing or production"

* tag 'io_uring-5.5-20191220' of git://git.kernel.dk/linux-block:
  io_uring: io_wq_submit_work() should not touch req->rw
  io_uring: don't wait when under-submitting
  io_uring: warn about unhandled opcode
  io_uring: read opcode and user_data from SQE exactly once
  io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable
  io_uring: make IORING_OP_CANCEL_ASYNC deferrable
  io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable
  io_uring: make HARDLINK imply LINK
  io_uring: any deferred command must have stable sqe data
  io_uring: remove 'sqe' parameter to the OP helpers that take it
  io_uring: fix pre-prepped issue with force_nonblock == true
  io-wq: re-add io_wq_current_is_worker()
  io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG
  io_uring: fix stale comment and a few typos
2019-12-20 13:30:49 -08:00
Jens Axboe
3529d8c2b3 io_uring: pass in 'sqe' to the prep handlers
This moves the prep handlers outside of the opcode handlers, and allows
us to pass in the sqe directly. If the sqe is non-NULL, it means that
the request should be prepared for the first time.

With the opcode handlers not having access to the sqe at all, we are
guaranteed that the prep handler has setup the request fully by the
time we get there. As before, for opcodes that need to copy in more
data then the io_kiocb allows for, the io_async_ctx holds that info. If
a prep handler is invoked with req->io set, it must use that to retain
information for later.

Finally, we can remove io_kiocb->sqe as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 10:04:50 -07:00
Jens Axboe
06b76d44ba io_uring: standardize the prep methods
We currently have a mix of use cases. Most of the newer ones are pretty
uniform, but we have some older ones that use different calling
calling conventions. This is confusing.

For the opcodes that currently rely on the req->io->sqe copy saving
them from reuse, add a request type struct in the io_kiocb command
union to store the data they need.

Prepare for all opcodes having a standard prep method, so we can call
it in a uniform fashion and outside of the opcode handler. This is in
preparation for passing in the 'sqe' pointer, rather than storing it
in the io_kiocb. Once we have uniform prep handlers, we can leave all
the prep work to that part, and not even pass in the sqe to the opcode
handler. This ensures that we don't reuse sqe data inadvertently.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 10:04:22 -07:00
Jens Axboe
26a61679f1 io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler
Add the count field to struct io_timeout, and ensure the prep handler
has read it. Timeout also needs an async context always, set it up
in the prep handler if we don't have one.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 09:55:33 -07:00
Jens Axboe
e47293fdf9 io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler
Add struct io_sr_msg in our io_kiocb per-command union, and ensure that
the send/recvmsg prep handlers have grabbed what they need from the SQE
by the time prep is done.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 09:55:23 -07:00
Jens Axboe
3fbb51c18f io_uring: move all prep state for IORING_OP_CONNECT to prep handler
Add struct io_connect in our io_kiocb per-command union, and ensure
that io_connect_prep() has grabbed what it needs from the SQE.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 09:52:48 -07:00
Jens Axboe
9adbd45d6d io_uring: add and use struct io_rw for read/writes
Put the kiocb in struct io_rw, and add the addr/len for the request as
well. Use the kiocb->private field for the buffer index for fixed reads
and writes.

Any use of kiocb->ki_filp is flipped to req->file. It's the same thing,
and less confusing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 09:52:45 -07:00
Chen Wandun
5084bf6b20 xfs: Make the symbol 'xfs_rtalloc_log_count' static
Fix the following sparse warning:

fs/xfs/libxfs/xfs_trans_resv.c:206:1: warning: symbol 'xfs_rtalloc_log_count' was not declared. Should it be static?

Fixes: b1de6fc752 ("xfs: fix log reservation overflows when allocating large rt extents")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-20 08:07:31 -08:00
Jens Axboe
d55e5f5b70 io_uring: use u64_to_user_ptr() consistently
We use it in some spots, but not consistently. Convert the rest over,
makes it easier to read as well.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 08:36:50 -07:00
Darrick J. Wong
13eaec4b2a xfs: don't commit sunit/swidth updates to disk if that would cause repair failures
Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit
and swidth values could cause xfs_repair to fail loudly.  The problem
here is that repair calculates the where mkfs should have allocated the
root inode, based on the superblock geometry.  The allocation decisions
depend on sunit, which means that we really can't go updating sunit if
it would lead to a subsequent repair failure on an otherwise correct
filesystem.

Port from xfs_repair some code that computes the location of the root
inode and teach mount to skip the ondisk update if it would cause
problems for repair.  Along the way we'll update the documentation,
provide a function for computing the minimum AGFL size instead of
open-coding it, and cut down some indenting in the mount code.

Note that we allow the mount to proceed (and new allocations will
reflect this new geometry) because we've never screened this kind of
thing before.  We'll have to wait for a new future incompat feature to
enforce correct behavior, alas.

Note that the geometry reporting always uses the superblock values, not
the incore ones, so that is what xfs_info and xfs_growfs will report.

[1] https://lore.kernel.org/linux-xfs/20191125130744.GA44777@bfoster/T/#m00f9594b511e076e2fcdd489d78bc30216d72a7d

Reported-by: Alex Lyakas <alex@zadara.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-12-19 07:53:48 -08:00
Darrick J. Wong
4f5b1b3a8f xfs: split the sunit parameter update into two parts
If the administrator provided a sunit= mount option, we need to validate
the raw parameter, convert the mount option units (512b blocks) into the
internal unit (fs blocks), and then validate that the (now cooked)
parameter doesn't screw anything up on disk.  The incore inode geometry
computation can depend on the new sunit option, but a subsequent patch
will make validating the cooked value depends on the computed inode
geometry, so break the sunit update into two steps.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-12-19 07:53:48 -08:00
Darrick J. Wong
1cac233cfe xfs: refactor agfl length computation function
Refactor xfs_alloc_min_freelist to accept a NULL @pag argument, in which
case it returns the largest possible minimum length.  This will be used
in an upcoming patch to compute the length of the AGFL at mkfs time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-12-19 07:53:48 -08:00
Darrick J. Wong
af952aeb4a libxfs: resync with the userspace libxfs
Prepare to resync the userspace libxfs with the kernel libxfs.  There
were a few things I missed -- a couple of static inline directory
functions that have to be exported for xfs_repair; a couple of directory
naming functions that make porting much easier if they're /not/ static
inline; and a u16 usage that should have been uint16_t.

None of these things are bugs in their own right; this just makes
porting xfsprogs easier.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2019-12-19 07:53:47 -08:00
Brian Foster
826f7e3413 xfs: use bitops interface for buf log item AIL flag check
The xfs_log_item flags were converted to atomic bitops as of commit
22525c17ed ("xfs: log item flags are racy"). The assert check for
AIL presence in xfs_buf_item_relse() still uses the old value based
check. This likely went unnoticed as XFS_LI_IN_AIL evaluates to 0
and causes the assert to unconditionally pass. Fix up the check.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Fixes: 22525c17ed ("xfs: log item flags are racy")
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-19 07:53:47 -08:00
Jens Axboe
fd6c2e4c06 io_uring: io_wq_submit_work() should not touch req->rw
I've been chasing a weird and obscure crash that was userspace stack
corruption, and finally narrowed it down to a bit flip that made a
stack address invalid. io_wq_submit_work() unconditionally flips
the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work
handler, this isn't valid. Normal read/write operations own that
part of the request, on other types it could be something else.

Move the IOCB_NOWAIT clear to the read/write handlers where it belongs.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-18 12:19:41 -07:00
Pavel Begunkov
7c504e6520 io_uring: don't wait when under-submitting
There is no reliable way to submit and wait in a single syscall, as
io_submit_sqes() may under-consume sqes (in case of an early error).
Then it will wait for not-yet-submitted requests, deadlocking the user
in most cases.

Don't wait/poll if can't submit all sqes

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-18 10:01:49 -07:00
Eric Sandeen
1edc8eb2e9 fs: call fsnotify_sb_delete after evict_inodes
When a filesystem is unmounted, we currently call fsnotify_sb_delete()
before evict_inodes(), which means that fsnotify_unmount_inodes()
must iterate over all inodes on the superblock looking for any inodes
with watches.  This is inefficient and can lead to livelocks as it
iterates over many unwatched inodes.

At this point, SB_ACTIVE is gone and dropping refcount to zero kicks
the inode out out immediately, so anything processed by
fsnotify_sb_delete / fsnotify_unmount_inodes gets evicted in that loop.

After that, the call to evict_inodes will evict everything else with a
zero refcount.

This should speed things up overall, and avoid livelocks in
fsnotify_unmount_inodes().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-18 00:03:01 -05:00
Eric Sandeen
04646aebd3 fs: avoid softlockups in s_inodes iterators
Anything that walks all inodes on sb->s_inodes list without rescheduling
risks softlockups.

Previous efforts were made in 2 functions, see:

c27d82f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
ac05fbb inode: don't softlockup when evicting inodes

but there hasn't been an audit of all walkers, so do that now.  This
also consistently moves the cond_resched() calls to the bottom of each
loop in cases where it already exists.

One loop remains: remove_dquot_ref(), because I'm not quite sure how
to deal with that one w/o taking the i_lock.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-18 00:03:01 -05:00
Jens Axboe
e781573e2f io_uring: warn about unhandled opcode
Now that we have all the opcodes handled in terms of command prep and
SQE reuse, add a printk_once() to warn about any potentially new and
unhandled ones.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
d625c6ee49 io_uring: read opcode and user_data from SQE exactly once
If we defer a request, we can't be reading the opcode again. Ensure that
the user_data and opcode fields are stable. For the user_data we already
have a place for it, for the opcode we can fill a one byte hold and store
that as well. For both of them, assign them when we originally read the
SQE in io_get_sqring(). Any code that uses sqe->opcode or sqe->user_data
is switched to req->opcode and req->user_data.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
b29472ee7b io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable
If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the timeout remove op into
the prep handling to make it safe for SQE reuse.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
fbf23849b1 io_uring: make IORING_OP_CANCEL_ASYNC deferrable
If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the async cancel op into
the prep handling to make it safe for SQE reuse.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
0969e783e3 io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable
If we defer these commands as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the poll add/remove into
the prep handling to make it safe for SQE reuse.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Pavel Begunkov
ffbb8d6b76 io_uring: make HARDLINK imply LINK
The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a
link and there is no need to set IOSQE_IO_LINK separately, though it
could be there. Add proper check and ensure that IOSQE_IO_HARDLINK
implies IOSQE_IO_LINK.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
8ed8d3c3bc io_uring: any deferred command must have stable sqe data
We're currently not retaining sqe data for accept, fsync, and
sync_file_range. None of these commands need data outside of what
is directly provided, hence it can't go stale when the request is
deferred. However, it can get reused, if an application reuses
SQE entries.

Ensure that we retain the information we need and only read the sqe
contents once, off the submission path. Most of this is just moving
code into a prep and finish function.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:20 -07:00
Jens Axboe
fc4df999e2 io_uring: remove 'sqe' parameter to the OP helpers that take it
We pass in req->sqe for all of them, no need to pass it in as the
request is always passed in. This is a necessary prep patch to be
able to cleanup/fix the request prep path.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:20 -07:00
Jens Axboe
b7bb4f7da0 io_uring: fix pre-prepped issue with force_nonblock == true
Some of these code paths assume that any force_nonblock == true issue
is not prepped, but that's not true if we did prep as part of link setup
earlier. Check if we already have an async context allocate before
setting up a new one.

Cleanup the async context setup in general, we have a lot of duplicated
code there.

Fixes: 03b1230ca1 ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Fixes: f67676d160 ("io_uring: ensure async punted read/write requests copy iovec")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:20 -07:00
Jens Axboe
525b305d61 io-wq: re-add io_wq_current_is_worker()
This reverts commit 8cdda87a44, we now have several use csaes for this
helper. Reinstate it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:20 -07:00
Linus Torvalds
2187f215eb for-5.5-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl35CAYACgkQxWXV+ddt
 WDur/w//S98RvSZYMW5y2u+bPGe8sCpXwu5Sr87hTd14We8cBWj8684npUmSk7Dz
 rTRSjcf9EQe5dGoiHOzpKU0HcsLKy9DVTPigvVbmsWZfT9mqS6Y8wAKMw/7UUvyy
 n7aZk/yQGRow3gZ/Z/aF23JypRoDJK7DPbSMKUW164BnD5rCCyr+VdA8V+CwHgVh
 UN6UG0KMDbDKS4501DsX8418pcJN+a+Jo4oBGwN/guKRjK1oNcrhj34DNhvXlaOV
 Rlu7HcVtfHNDS/xD3DZS9mDIiycJ6qHkvC3hUsEmlKRoPEm1leVxTDLDf78oEy9H
 TrvH71hbvYjxaOU4YQbJG8ky+VwFfiV0Vrj73GgdEeRRDuMbYwUyFI5gYQOji8fS
 DuYdJGyslOqQovpii+jrPiT1TPG+97R4+qKH2DfOW1xUChYsbQHt7FOfzUbLe0JE
 dev9zV6MRqZ1qf70+Wt2LuWYFefpg9KVnsn8mcjoBwz9s9uImzLgpI90+DMPKOaU
 TizwJK3W5K3YLhqPHwPLvqVxKwVOzu00v01xl/bjTuyp982oPSCj3fj+FprGV1la
 OkqOYbKe2ZqEkQpINDu8I58oydTKywZGVsUl4ldJlcSY1hEDFCyeoAFmixaJbRbQ
 IdBcQnjD7qgvu9E4cA0kL8Ma1op2+1zw8sUOdXKFIDiNEqL5FPs=
 =AnHL
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A mix of regression fixes and regular fixes for stable trees:

   - fix swapped error messages for qgroup enable/rescan

   - fixes for NO_HOLES feature with clone range

   - fix deadlock between iget/srcu lock/synchronize srcu while freeing
     an inode

   - fix double lock on subvolume cross-rename

   - tree log fixes
      * fix missing data checksums after replaying a log tree
      * also teach tree-checker about this problem
      * skip log replay on orphaned roots

   - fix maximum devices constraints for RAID1C -3 and -4

   - send: don't print warning on read-only mount regarding orphan
     cleanup

   - error handling fixes"

* tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: send: remove WARN_ON for readonly mount
  btrfs: do not leak reloc root if we fail to read the fs root
  btrfs: skip log replay on orphaned roots
  btrfs: handle ENOENT in btrfs_uuid_tree_iterate
  btrfs: abort transaction after failed inode updates in create_subvol
  Btrfs: fix hole extent items with a zero size after range cloning
  Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues
  Btrfs: make tree checker detect checksum items with overlapping ranges
  Btrfs: fix missing data checksums after replaying a log tree
  btrfs: return error pointer from alloc_test_extent_buffer
  btrfs: fix devs_max constraints for raid1c3 and raid1c4
  btrfs: tree-checker: Fix error format string for size_t
  btrfs: don't double lock the subvol_sem for rename exchange
  btrfs: handle error in btrfs_cache_block_group
  btrfs: do not call synchronize_srcu() in inode_tree_del
  Btrfs: fix cloning range with a hole when using the NO_HOLES feature
  btrfs: Fix error messages in qgroup_rescan_init
2019-12-17 13:27:02 -08:00
Darrick J. Wong
b1de6fc752 xfs: fix log reservation overflows when allocating large rt extents
Omar Sandoval reported that a 4G fallocate on the realtime device causes
filesystem shutdowns due to a log reservation overflow that happens when
we log the rtbitmap updates.  Factor rtbitmap/rtsummary updates into the
the tr_write and tr_itruncate log reservation calculation.

"The following reproducer results in a transaction log overrun warning
for me:

    mkfs.xfs -f -r rtdev=/dev/vdc -d rtinherit=1 -m reflink=0 /dev/vdb
    mount -o rtdev=/dev/vdc /dev/vdb /mnt
    fallocate -l 4G /mnt/foo

Reported-by: Omar Sandoval <osandov@osandov.com>
Tested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-12-17 11:19:28 -08:00
Linus Torvalds
4340ebd19f Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Fix the guest-nice cpustat values in /proc"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cputime, proc/stat: Fix incorrect guest nice cpustat value
2019-12-17 11:09:05 -08:00
Linus Torvalds
6afa873170 linux-kselftest-5.5-rc2
This Kselftest fixes update for Linux 5.5-rc2 consists of
 
 -- ftrace and safesetid test fixes from Masami Hiramatsu
 -- Kunit fixes from Brendan Higgins, Iurii Zaikin, and Heidi Fahim
 -- Kselftest framework fixes from SeongJae Park and Michael Ellerman
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl3zxBUACgkQCwJExA0N
 QxxeyhAAgCPilGbQEjr3mJk9rHLpBlDcHF783zrKS538ymVWDcMqxWgW9WOY7RKb
 LKli4Q3SDhWPzxiH4dcNklkIld6WaNaehIwhYCykAxrWnOKmQQ1i8/4+D6KPwGhp
 W7do/g8ZITYJYJgYieoABC5W4rThFyIR+uAVCDyf5nP+nQrJlgPfsq2ClBvRxzep
 QhanBPlweQSHVLBMATijUETFHqoIvx6bL8emolY9x6qbCPrTcvVKqW+Va2K24TqP
 dJGPm5OctSHD2RP4clKMfx3dbwabQR0JuDKdh3F/jO89h+1/Pku5YboZCDezbNp9
 2oKXjDXniZHKmuWVgzh7ix/5y6FfpGpck4+9PhaNpCd/pIJ2ZtrZNd+ct72JA2yr
 zGIWWtj5y6Ggw7NkWloRsVTmQFAYsWBIJS8CqC+2aypFfWZpRFaDWSBcBifRsvVc
 3F1L/uQyrgeJx5XNTe028i7eLmvQ1a4RqHUxQIt795lnQmygeLHffx4R+K/uw8XD
 0eKtjV3HYR/FuRXEB1A6WH3eLQ4b1mmcx2aV5e6mbUk+QezPRMnJr2E6+dE6XH63
 2ipJHfDQmKakrieidt5LCYTy9+VzlFj2TOrKiLLwUPmjPJv2AASfXAYwlTYpsbMi
 bymTZcJkVsGnCi3jNfK/1SBBNRUVUYTeN8SWq/6ublgIs4BMYe0=
 =MJZj
 -----END PGP SIGNATURE-----

Merge tag 'linux-kselftest-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest fixes from Shuah Khan:

 - ftrace and safesetid test fixes from Masami Hiramatsu

 - Kunit fixes from Brendan Higgins, Iurii Zaikin, and Heidi Fahim

 - Kselftest framework fixes from SeongJae Park and Michael Ellerman

* tag 'linux-kselftest-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kselftest: Support old perl versions
  kselftest/runner: Print new line in print of timeout log
  selftests: Fix dangling documentation references to kselftest_module.sh
  Documentation: kunit: add documentation for kunit_tool
  Documentation: kunit: fix typos and gramatical errors
  kunit: testing kunit: Bug fix in test_run_timeout function
  fs/ext4/inode-test: Fix inode test on 32 bit platforms.
  selftests: safesetid: Fix Makefile to set correct test program
  selftests: safesetid: Check the return value of setuid/setgid
  selftests: safesetid: Move link library to LDLIBS
  selftests/ftrace: Fix multiple kprobe testcase
  selftests/ftrace: Do not to use absolute debugfs path
  selftests/ftrace: Fix ftrace test cases to check unsupported
  selftests/ftrace: Fix to check the existence of set_ftrace_filter
2019-12-16 10:06:04 -08:00
Jens Axboe
0b416c3e13 io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG
If we have to punt the recvmsg to async context, we copy all the
context.  But since the iovec used can be either on-stack (if small) or
dynamically allocated, if it's on-stack, then we need to ensure we reset
the iov pointer. If we don't, then we're reusing old stack data, and
that can lead to -EFAULTs if things get overwritten.

Ensure we retain the right pointers for the iov, and free it as well if
we end up having to go beyond UIO_FASTIOV number of vectors.

Fixes: 03b1230ca1 ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-15 22:12:47 -07:00
Phong Tran
69000d82ee ext4: use RCU API in debug_print_tree
struct ext4_sb_info.system_blks was marked __rcu.
But access the pointer without using RCU lock and dereference.
Sparse warning with __rcu notation:

block_validity.c:139:29: warning: incorrect type in argument 1 (different address spaces)
block_validity.c:139:29:    expected struct rb_root const *
block_validity.c:139:29:    got struct rb_root [noderef] <asn:4> *

Link: https://lore.kernel.org/r/20191213153306.30744-1-tranmanphong@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Phong Tran <tranmanphong@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-15 21:41:04 -05:00
Theodore Ts'o
9803387c55 ext4: validate the debug_want_extra_isize mount option at parse time
Instead of setting s_want_extra_size and then making sure that it is a
valid value afterwards, validate the field before we set it.  This
avoids races and other problems when remounting the file system.

Link: https://lore.kernel.org/r/20191215063020.GA11512@mit.edu
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-and-tested-by: syzbot+4a39a025912b265cacef@syzkaller.appspotmail.com
2019-12-15 18:05:20 -05:00
Brian Gianforcaro
d195a66e36 io_uring: fix stale comment and a few typos
- Fix a few typos found while reading the code.

- Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter
  was renamed to 'req', but the comment still holds.

Signed-off-by: Brian Gianforcaro <b.gianfo@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-15 14:49:30 -07:00
Linus Torvalds
2e6d304515 Merge branch 'remove-ksys-mount-dup' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux
Pull ksys_mount() and ksys_dup() removal from Dominik Brodowski:
 "This small series replaces all in-kernel calls to the
  userspace-focused ksys_mount() and ksys_dup() with calls to
  kernel-centric functions:

  For each replacement of ksys_mount() with do_mount(), one needs to
  verify that the first and third parameter (char *dev_name, char *type)
  are strings allocated in kernelspace and that the fifth parameter
  (void *data) is either NULL or refers to a full page (only occurence
  in init/do_mounts.c::do_mount_root()). The second and fourth
  parameters (char *dir_name, unsigned long flags) are passed by
  ksys_mount() to do_mount() unchanged, and therefore do not require
  particular care.

  Moreover, instead of pretending to be userspace, the opening of
  /dev/console as stdin/stdout/stderr can be implemented using in-kernel
  functions as well. Thereby, ksys_dup() can be removed for good"

[ This doesn't get rid of the special "kernel init runs with KERNEL_DS"
  case, but it at least removes _some_ of the users of "treat kernel
  pointers as user pointers for our magical init sequence".

  One day we'll hopefully be rid of it all, and can initialize our
  init_thread addr_limit to USER_DS.    - Linus ]

* 'remove-ksys-mount-dup' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux:
  fs: remove ksys_dup()
  init: unify opening /dev/console as stdin/stdout/stderr
  init: use do_mount() instead of ksys_mount()
  initrd: use do_mount() instead of ksys_mount()
  devtmpfs: use do_mount() instead of ksys_mount()
2019-12-15 11:36:12 -08:00
yangerkun
a70fd5ac2e ext4: reserve revoke credits in __ext4_new_inode
It's possible that __ext4_new_inode will release the xattr block, so
it will trigger a warning since there is revoke credits will be 0 if
the handle == NULL. The below scripts can reproduce it easily.

------------[ cut here ]------------
WARNING: CPU: 0 PID: 3861 at fs/jbd2/revoke.c:374 jbd2_journal_revoke+0x30e/0x540 fs/jbd2/revoke.c:374
...
__ext4_forget+0x1d7/0x800 fs/ext4/ext4_jbd2.c:248
ext4_free_blocks+0x213/0x1d60 fs/ext4/mballoc.c:4743
ext4_xattr_release_block+0x55b/0x780 fs/ext4/xattr.c:1254
ext4_xattr_block_set+0x1c2c/0x2c40 fs/ext4/xattr.c:2112
ext4_xattr_set_handle+0xa7e/0x1090 fs/ext4/xattr.c:2384
__ext4_set_acl+0x54d/0x6c0 fs/ext4/acl.c:214
ext4_init_acl+0x218/0x2e0 fs/ext4/acl.c:293
__ext4_new_inode+0x352a/0x42b0 fs/ext4/ialloc.c:1151
ext4_mkdir+0x2e9/0xbd0 fs/ext4/namei.c:2774
vfs_mkdir+0x386/0x5f0 fs/namei.c:3811
do_mkdirat+0x11c/0x210 fs/namei.c:3834
do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:294
...
-------------------------------------

scripts:
mkfs.ext4 /dev/vdb
mount /dev/vdb /mnt
cd /mnt && mkdir dir && for i in {1..8}; do setfacl -dm "u:user_"$i":rx" dir; done
mkdir dir/dir1 && mv dir/dir1 ./
sh repro.sh && add some user

[root@localhost ~]# cat repro.sh
while [ 1 -eq 1 ]; do
    rm -rf dir
    rm -rf dir1/dir1
    mkdir dir
    for i in {1..8}; do  setfacl -dm "u:test"$i":rx" dir; done
    setfacl -m "u:user_9:rx" dir &
    mkdir dir1/dir1 &
done

Before exec repro.sh, dir1 has inherit the default acl from dir, and
xattr block of dir1 dir is not the same, so the h_refcount of these
two dir's xattr block will be 1. Then repro.sh can trigger the warning
with the situation show as below. The last h_refcount can be clear
with mkdir, and __ext4_new_inode has not reserved revoke credits, so
the warning will happened, fix it by reserve revoke credits in
__ext4_new_inode.

Thread 1                        Thread 2
mkdir dir
set default acl(will create
a xattr block blk1 and the
refcount of ext4_xattr_header
will be 1)
				...
                                mkdir dir1/dir1
				->....->ext4_init_acl
				->__ext4_set_acl(set default acl,
			          will reuse blk1, and h_refcount
				  will be 2)

setfacl->ext4_set_acl->...
->ext4_xattr_block_set(will create
new block blk2 to store xattr)

				->__ext4_set_acl(set access acl, since
				  h_refcount of blk1 is 2, will create
				  blk3 to store xattr)

  ->ext4_xattr_release_block(dec
  h_refcount of blk1 to 1)
				  ->ext4_xattr_release_block(dec
				    h_refcount and since it is 0,
				    will release the block and trigger
				    the warning)

Link: https://lore.kernel.org/r/20191213014900.47228-1-yangerkun@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-14 17:47:13 -05:00
Dan Carpenter
7f420d64a0 ext4: unlock on error in ext4_expand_extra_isize()
We need to unlock the xattr before returning on this error path.

Cc: stable@kernel.org # 4.13
Fixes: c03b45b853 ("ext4, project: expand inode extra size if possible")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20191213185010.6k7yl2tck3wlsdkt@kili.mountain
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-14 17:31:23 -05:00
Theodore Ts'o
707d1a2f60 ext4: optimize __ext4_check_dir_entry()
Make __ext4_check_dir_entry() a bit easier to understand, and reduce
the object size of the function by over 11%.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20191209004346.38526-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-14 17:23:14 -05:00
Jan Kara
109ba779d6 ext4: check for directory entries too close to block end
ext4_check_dir_entry() currently does not catch a case when a directory
entry ends so close to the block end that the header of the next
directory entry would not fit in the remaining space. This can lead to
directory iteration code trying to access address beyond end of current
buffer head leading to oops.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191202170213.4761-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-14 17:22:45 -05:00
Jan Kara
64d4ce8923 ext4: fix ext4_empty_dir() for directories with holes
Function ext4_empty_dir() doesn't correctly handle directories with
holes and crashes on bh->b_data dereference when bh is NULL. Reorganize
the loop to use 'offset' variable all the times instead of comparing
pointers to current direntry with bh->b_data pointer. Also add more
strict checking of '.' and '..' directory entries to avoid entering loop
in possibly invalid state on corrupted filesystems.

References: CVE-2019-19037
CC: stable@vger.kernel.org
Fixes: 4e19d6b65f ("ext4: allow directory holes")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191202170213.4761-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-14 17:22:45 -05:00