Commit Graph

75672 Commits

Author SHA1 Message Date
Pavel Begunkov
9d170164db io_uring: partially uninline io_put_task()
In most cases io_put_task() is called from the submitter task and go
through a higly optimised fast path, which has to be inlined. The other
branch though is bulkier and we don't care about it as much because it
implies atomics and other heavy calls. Extract it into a helper, which
is expected not to be inlined.

[before] size ./fs/io_uring.o
   text    data     bss     dec     hex filename
  89328   13646       8  102982   19246 ./fs/io_uring.o
[after] size ./fs/io_uring.o
   text    data     bss     dec     hex filename
  89096   13646       8  102750   1915e ./fs/io_uring.o

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dec213db0e0b8605132da81e0a0be687a4d140cb.1648209006.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 17:34:19 -06:00
Pavel Begunkov
f892963051 io_uring: cleanup conditional submit locking
Refactor io_ring_submit_[un]lock(), make it accept issue_flags and
remove manual IO_URING_F_UNLOCKED checks. It also allows us to place
lockdep annotations inside instead of sprinkling them in a bunch of
places. There is only one user that doesn't fit now, so hand code
locking in __io_rsrc_put_work().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e55c2c06767676a801252e8094c9ab09912487a4.1648209006.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 17:34:19 -06:00
Pavel Begunkov
d487b43cd3 io_uring: optimise mutex locking for submit+iopoll
Both submittion and iopolling requires holding uring_lock. IOPOLL can
users do them together in a single syscall, however it would still do 2
pairs of lock/unlock. Optimise this case combining locking into one
lock/unlock pair, which especially nice for low QD.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/034b6c41658648ad3ad3c9485ac8eb546f010bc4.1647957378.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 17:34:16 -06:00
Pavel Begunkov
773697b610 io_uring: pre-calculate syscall iopolling decision
Syscall should only iopoll for events when it's a IOPOLL ring and is not
SQPOLL. Instead of check both flags every time we can save it in ring
flags so it's easier to use. We don't care much about an extra if there,
however it will be inconvenient to copy-paste this chunk with checks in
future patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7fd2f8fc2606305aa06dd8c0ff8f76a66b39c383.1647957378.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 17:34:16 -06:00
Pavel Begunkov
f81440d33c io_uring: split off IOPOLL argument verifiction
IOPOLL doesn't use additional arguments like sigsets, but it still
needs some basic verification, which is currently done by
io_get_ext_arg(). This patch adds a separate function for the IOPOLL
path, which is a bit simpler and doesn't do extra. This prepares us for
further patches, which would have hurt inlining in the hot path otherwise.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/71b23fca412e3374b74be7711cfd42a3d9d5dfe0.1647957378.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 17:34:16 -06:00
Pavel Begunkov
57859f4d93 io_uring: clean up io_queue_next()
Move fast check out of io_queue_next(), it makes req->flags checks in
__io_submit_flush_completions() a bit clearer and grants us better
comtrol, e.g. can remove now not justified unlikely() in
__io_submit_flush_completions(). Also, we don't care about having this
check in io_free_req() as the function is a slow path and
io_req_find_next() handles it correctly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1f9e1cc80adbb11b37017d511df4a2c6141a3f08.1647897811.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 17:34:16 -06:00
Pavel Begunkov
b605a7fabb io_uring: move poll recycling later in compl flushing
There is a new (req->flags & REQ_F_POLLED) check in
__io_submit_flush_completions() for poll recycling, however
io_free_batch_list() is a much better place for it. First, we prefer it
after putting the last req ref just to avoid potential problems in the
future. Also, it'll enable the recycling for IOPOLL and also will place
it closer to all other req->flags bits clean up requests.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/31dfe1dafda66ba3ce36b301884ec7e162c777d1.1647897811.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 17:34:16 -06:00
Pavel Begunkov
a538be5be3 io_uring: optimise io_free_batch_list
We do several req->flags checks in the fast path of
io_free_batch_list(). One explicit check of REQ_F_REFCOUNT, and two
other hidden in io_queue_next() and io_dismantle_req(). Moreover, there
is a io_req_put_rsrc_locked() call in between, so there is no hope
req->flags will be preserved in registers.

All those flags if not a slow path than definitely a slower path, so
put them all under a single flags mask check and save several mem
reloads and ifs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0fb493f73f2009aea395c570c2932fecaa4e1244.1647897811.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 17:34:16 -06:00
Pavel Begunkov
7819a1f6ac io_uring: refactor io_req_find_next
Move the fast path from io_req_find_next() into callers. It prepares us
for further changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/10bd0e564472dde0c7f8d90ae317d05356cd565a.1647897811.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 17:34:16 -06:00
Pavel Begunkov
60053be859 io_uring: remove extra ifs around io_commit_cqring
Now io_commit_cqring() is simple and it tolerates well being called
without a new CQE filled, so kill a bunch of not needed anymore
guards.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/36aed692dff402bba00a444a63a9cd2e97a340ea.1647897811.git.asml.silence@gmail.com
[axboe: fold in followup fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 17:34:16 -06:00
Pavel Begunkov
68ca8fc002 io_uring: small optimisation of tctx_task_work
There should be no completions stashed when we first get into
tctx_task_work(), so move completion flushing checks a bit later
after we had a chance to execute some task works.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c6765c804f3c438591b9825ab9c43d22039073c4.1647897811.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 17:34:16 -06:00
Linus Torvalds
22da5264ab 3 fixes to ksmbd server
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmJkGaQACgkQiiy9cAdy
 T1G7Pgv/SHmxsBGIT+YyW+ppWZGIMMpCld3YbovtN2EWdt3NXwvnGrTTFkXe0acJ
 +63NjNnCoDTKy+XHN0HhMqLsJSep6J63e5rmiuvUxKQIR0n1WzZl7W1YpHKptZog
 wMkObv0UzKgr6gLEwQeMMqvcyxOVmgssAXou4N8p6rDJLFijM2kcVjpB/B9uyUAR
 JG0ss8lhX7+YTcRuI0QqyulHlUTiGwu4/XjO9oWs3bF2faADVIfZXOKGMFfRpjCK
 YDGdh+HieW5y2SngvstdBuVxmZXLjWjWwe9mQCUaF7khZJ0acuGeQh9BCPNopjUD
 0CBYe9JWM2NxTjXqmhspCGUi40+EedZgIMxukRyl7MrBX1wBF0ErSYpGmYSZQqoH
 u2R9Pr8tUwuzVfQ6s7VhWACckKFNwI53lHOhY1ikc2M/NWeLy961Wi9JRykac8cZ
 ElkJkJUdGttntjwilcKp7NuWmHssDzxNH103WAr3GZrhBDgntUHFBpsLqzQmQW3a
 Gp3jkj1e
 =Y9T8
 -----END PGP SIGNATURE-----

Merge tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull ksmbd server fixes from Steve French:

 - cap maximum sector size reported to avoid mount problems

 - reference count fix

 - fix filename rename race

* tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: set fixed sector size to FS_SECTOR_SIZE_INFORMATION
  ksmbd: increment reference count of parent fp
  ksmbd: remove filename in ksmbd_file
2022-04-23 17:16:10 -07:00
Linus Torvalds
1f5e98e723 io_uring-5.18-2022-04-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmJjYJUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsZZEADS/dD7pZKxBLcHTCGJAik1/IIv/3ynTrOp
 o86uV0AH+nL6lUyBU7uTTQVFn9Hjh6T10ZfRmcU+1Xb8G4obHTrQJkk5evwCNPng
 9CUW2fwQa+6H4Ui8TU7f1rLcLlm+AUSVmab6h/20X5ldMwzF1JhcE11qtqZw0ti7
 mDPwmxEsx7KMvMy59awA+5IpnXHxe5SvHXuzLsMNwux6dH7VauxE8R+Y+HHVzLWc
 fM2dbEU2Hq5nL23DedMw3ZaHwhQTiWdOQA0386iDB6cJdFv19iw+ApD4KS/qAT2X
 URQ3pmyNOXvOsBosVL4za7VVCoUlA23ZSMoU82p2K3NK4NGfV7S4oIeO5ZR2BK/C
 bIC4c2gutIbYrtdSITBW4z2tj+26BBZS7LaT3Bek/3BL+GjQuM6vK8N4ZhRXPC+l
 vWAwXUnWSyXR4+HWpvm3ewlrSY5CQjfsZgU1PIYybhTf/oo2BxX/HQfRk3XKfLIR
 89gvITTUrC8B4dgPgLs/MF4Ercmoa2//2yL0onBEwdC2b1lRqD/bGM2FAYolzLzf
 W1+BrFj3sRUjexdO7ChrtZvAWo59REAxXdP/3h+NbkIz8sunG2Vpf9NX3cXj30n3
 bZL3SkEbtFxpnXspRSQRnmL6DMLPMa7UC+MxpNBAV0g6aMmuSdxqiIZR10+kIEyR
 PJyRqbRefg==
 =Huq7
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.18-2022-04-22' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Just two small fixes - one fixing a potential leak for the iovec for
  larger requests added in this cycle, and one fixing a theoretical leak
  with CQE_SKIP and IOPOLL"

* tag 'io_uring-5.18-2022-04-22' of git://git.kernel.dk/linux-block:
  io_uring: fix leaks on IOPOLL and CQE_SKIP
  io_uring: free iovec if file assignment fails
2022-04-23 09:42:13 -07:00
Linus Torvalds
c00c5e1d15 Fix some syzbot-detected bugs, as well as other bugs found by I/O
injection testing.  Change ext4's fallocate to update consistently
 drop set[ug]id bits when an fallocate operation might possibly change
 the user-visible contents of a file.  Also, improve handling of
 potentially invalid values in the the s_overhead_cluster superblock
 field to avoid ext4 returning a negative number of free blocks.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmJinf8ACgkQ8vlZVpUN
 gaOHgQf+MKgUZgteYogLzoP3mF1kSycOGawk4wZ3QHOLz7AvsV2p9J8BWihbS/EK
 dBydfXbTMvCUrjWmpqb5dHECRzxdfxOJ0SPJtibc8DZaJc9ImNFmgSp9kyJ3uRaN
 cPGO6Lz2RXpdumVMPPLwzUJdVyrLi0K6I1NYSocxKgribePzd+xil8S9zRZj8Bpe
 RaeH0EytcRj2CI5qs5mI/mOPBAMsZeczd3HInI3gyCgP2I4ZOfsADne3APx57mcI
 IGKf77nvIwMHeKel3MGYfFPitEs5cZpHUhHplCMtgFsO8H0IR93tqnlaCvTM7VAZ
 Slamgl7pfcXFcLZP+pm0QL/82ub7iw==
 =FIds
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix some syzbot-detected bugs, as well as other bugs found by I/O
  injection testing.

  Change ext4's fallocate to consistently drop set[ug]id bits when an
  fallocate operation might possibly change the user-visible contents of
  a file.

  Also, improve handling of potentially invalid values in the the
  s_overhead_cluster superblock field to avoid ext4 returning a negative
  number of free blocks"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: fix a potential race while discarding reserved buffers after an abort
  ext4: update the cached overhead value in the superblock
  ext4: force overhead calculation if the s_overhead_cluster makes no sense
  ext4: fix overhead calculation to account for the reserved gdt blocks
  ext4, doc: fix incorrect h_reserved size
  ext4: limit length to bitmap_maxbytes - blocksize in punch_hole
  ext4: fix use-after-free in ext4_search_dir
  ext4: fix bug_on in start_this_handle during umount filesystem
  ext4: fix symlink file size not match to file content
  ext4: fix fallocate to use file_modified to update permissions consistently
2022-04-22 18:18:27 -07:00
Linus Torvalds
88c5060d56 4 fixes to cifs client, 2 for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmJi1i4ACgkQiiy9cAdy
 T1G9qgv/fY5M1iyz7ddwSbFeX2MPL3225C/LuWf1cW/deUuf83/RyjefaI/8GSjG
 DT+0mvpow/vvOWWxS+ZR+O3vp32Q/twO4NhgTU5FQicACU1AmofcY/klVUCm1W9f
 oS6q/dlvWaqlqDLajprPOEuh5ZG01GlkmF0lTvo00ze40PlDfpXZEpmoCHtNPUx9
 VUFvq+Bxh+3kvWaHk7YKN0ENilSZ9UYy1KIClpQXlXU4rRwT9MORiljwXB+FYw9E
 suKsAyUk7ek01TBFUjcRC7OYjlq2t7wcvsOlLLwvMlisgauJZgPHAv1igJF04fVw
 jMIE9RwH5EuPHP5JrVj+w3mkk9XXG4xbXREfmNCC4V/V1a3ZefU3r0lj1VBV4HIi
 p2FpVvEBU+DkFuO8pwA2xv56ykCXooAftk1xX9Rj27ICEL/OPqiXbEdb46wTT+Nj
 cf8L0qumOvV4/IMRre94RJCcwn4IfTW0O4VGCAhk3U+qR1MZzmGtpvjzMVdUOad+
 C7jaDuwd
 =XdOO
 -----END PGP SIGNATURE-----

Merge tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Four fixes, two of them for stable:

   - fcollapse fix

   - reconnect lock fix

   - DFS oops fix

   - minor cleanup patch"

* tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: destage any unwritten data to the server before calling copychunk_write
  cifs: use correct lock type in cifs_reconnect()
  cifs: fix NULL ptr dereference in refresh_mounts()
  cifs: Use kzalloc instead of kmalloc/memset
2022-04-22 13:26:11 -07:00
Linus Torvalds
279b83c673 fs.fixes.v5.18-rc4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYmF+/wAKCRCRxhvAZXjc
 otmDAP47jPBjTS+gdnMy8fP6ymZu2+So3gex8N777x23mTvQ5AEAy9s4tKMb5UA5
 rwQa8vRgkUBAyiB9yjloNOdN65X1cAE=
 =WLsw
 -----END PGP SIGNATURE-----

Merge tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull mount_setattr fix from Christian Brauner:
 "The recent cleanup in e257039f0f ("mount_setattr(): clean the
  control flow and calling conventions") switched the mount attribute
  codepaths from do-while to for loops as they are more idiomatic when
  walking mounts.

  However, we did originally choose do-while constructs because if we
  request a mount or mount tree to be made read-only we need to hold
  writers in the following way: The mount attribute code will grab
  lock_mount_hash() and then call mnt_hold_writers() which will
  _unconditionally_ set MNT_WRITE_HOLD on the mount.

  Any callers that need write access have to call mnt_want_write(). They
  will immediately see that MNT_WRITE_HOLD is set on the mount and the
  caller will then either spin (on non-preempt-rt) or wait on
  lock_mount_hash() (on preempt-rt).

  The fact that MNT_WRITE_HOLD is set unconditionally means that once
  mnt_hold_writers() returns we need to _always_ pair it with
  mnt_unhold_writers() in both the failure and success paths.

  The do-while constructs did take care of this. But Al's change to a
  for loop in the failure path stops on the first mount we failed to
  change mount attributes _without_ going into the loop to call
  mnt_unhold_writers().

  This in turn means that once we failed to make a mount read-only via
  mount_setattr() - i.e. there are already writers on that mount - we
  will block any writers indefinitely. Fix this by ensuring that the for
  loop always unsets MNT_WRITE_HOLD including the first mount we failed
  to change to read-only. Also sprinkle a few comments into the cleanup
  code to remind people about what is happening including myself. After
  all, I didn't catch it during review.

  This is only relevant on mainline and was reported by syzbot. Details
  about the syzbot reports are all in the commit message"

* tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fs: unset MNT_WRITE_HOLD on failure
2022-04-22 13:17:19 -07:00
Christophe Leroy
5f24d5a579 mm, hugetlb: allow for "high" userspace addresses
This is a fix for commit f6795053da ("mm: mmap: Allow for "high"
userspace addresses") for hugetlb.

This patch adds support for "high" userspace addresses that are
optionally supported on the system and have to be requested via a hint
mechanism ("high" addr parameter to mmap).

Architectures such as powerpc and x86 achieve this by making changes to
their architectural versions of hugetlb_get_unmapped_area() function.
However, arm64 uses the generic version of that function.

So take into account arch_get_mmap_base() and arch_get_mmap_end() in
hugetlb_get_unmapped_area().  To allow that, move those two macros out
of mm/mmap.c into include/linux/sched/mm.h

If these macros are not defined in architectural code then they default
to (TASK_SIZE) and (base) so should not introduce any behavioural
changes to architectures that do not define them.

For the time being, only ARM64 is affected by this change.

Catalin (ARM64) said
 "We should have fixed hugetlb_get_unmapped_area() as well when we added
  support for 52-bit VA. The reason for commit f6795053da was to
  prevent normal mmap() from returning addresses above 48-bit by default
  as some user-space had hard assumptions about this.

  It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
  but I doubt anyone would notice. It's more likely that the current
  behaviour would cause issues, so I'd rather have them consistent.

  Basically when arm64 gained support for 52-bit addresses we did not
  want user-space calling mmap() to suddenly get such high addresses,
  otherwise we could have inadvertently broken some programs (similar
  behaviour to x86 here). Hence we added commit f6795053da. But we
  missed hugetlbfs which could still get such high mmap() addresses. So
  in theory that's a potential regression that should have bee addressed
  at the same time as commit f6795053da (and before arm64 enabled
  52-bit addresses)"

Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu
Fixes: f6795053da ("mm: mmap: Allow for "high" userspace addresses")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>	[5.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21 20:01:09 -07:00
Ye Bin
23e3d7f706 jbd2: fix a potential race while discarding reserved buffers after an abort
we got issue as follows:
[   72.796117] EXT4-fs error (device sda): ext4_journal_check_start:83: comm fallocate: Detected aborted journal
[   72.826847] EXT4-fs (sda): Remounting filesystem read-only
fallocate: fallocate failed: Read-only file system
[   74.791830] jbd2_journal_commit_transaction: jh=0xffff9cfefe725d90 bh=0x0000000000000000 end delay
[   74.793597] ------------[ cut here ]------------
[   74.794203] kernel BUG at fs/jbd2/transaction.c:2063!
[   74.794886] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[   74.795533] CPU: 4 PID: 2260 Comm: jbd2/sda-8 Not tainted 5.17.0-rc8-next-20220315-dirty #150
[   74.798327] RIP: 0010:__jbd2_journal_unfile_buffer+0x3e/0x60
[   74.801971] RSP: 0018:ffffa828c24a3cb8 EFLAGS: 00010202
[   74.802694] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[   74.803601] RDX: 0000000000000001 RSI: ffff9cfefe725d90 RDI: ffff9cfefe725d90
[   74.804554] RBP: ffff9cfefe725d90 R08: 0000000000000000 R09: ffffa828c24a3b20
[   74.805471] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9cfefe725d90
[   74.806385] R13: ffff9cfefe725d98 R14: 0000000000000000 R15: ffff9cfe833a4d00
[   74.807301] FS:  0000000000000000(0000) GS:ffff9d01afb00000(0000) knlGS:0000000000000000
[   74.808338] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   74.809084] CR2: 00007f2b81bf4000 CR3: 0000000100056000 CR4: 00000000000006e0
[   74.810047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   74.810981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   74.811897] Call Trace:
[   74.812241]  <TASK>
[   74.812566]  __jbd2_journal_refile_buffer+0x12f/0x180
[   74.813246]  jbd2_journal_refile_buffer+0x4c/0xa0
[   74.813869]  jbd2_journal_commit_transaction.cold+0xa1/0x148
[   74.817550]  kjournald2+0xf8/0x3e0
[   74.819056]  kthread+0x153/0x1c0
[   74.819963]  ret_from_fork+0x22/0x30

Above issue may happen as follows:
        write                   truncate                   kjournald2
generic_perform_write
 ext4_write_begin
  ext4_walk_page_buffers
   do_journal_get_write_access ->add BJ_Reserved list
 ext4_journalled_write_end
  ext4_walk_page_buffers
   write_end_fn
    ext4_handle_dirty_metadata
                ***************JBD2 ABORT**************
     jbd2_journal_dirty_metadata
 -> return -EROFS, jh in reserved_list
                                                   jbd2_journal_commit_transaction
                                                    while (commit_transaction->t_reserved_list)
                                                      jh = commit_transaction->t_reserved_list;
                        truncate_pagecache_range
                         do_invalidatepage
			  ext4_journalled_invalidatepage
			   jbd2_journal_invalidatepage
			    journal_unmap_buffer
			     __dispose_buffer
			      __jbd2_journal_unfile_buffer
			       jbd2_journal_put_journal_head ->put last ref_count
			        __journal_remove_journal_head
				 bh->b_private = NULL;
				 jh->b_bh = NULL;
				                      jbd2_journal_refile_buffer(journal, jh);
							bh = jh2bh(jh);
							->bh is NULL, later will trigger null-ptr-deref
				 journal_free_journal_head(jh);

After commit 96f1e09745, we no longer hold the j_state_lock while
iterating over the list of reserved handles in
jbd2_journal_commit_transaction().  This potentially allows the
journal_head to be freed by journal_unmap_buffer while the commit
codepath is also trying to free the BJ_Reserved buffers.  Keeping
j_state_lock held while trying extends hold time of the lock
minimally, and solves this issue.

Fixes: 96f1e0974575("jbd2: avoid long hold times of j_state_lock while committing a transaction")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220317142137.1821590-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-04-21 14:21:30 -04:00
Christian Brauner
0014edaedf
fs: unset MNT_WRITE_HOLD on failure
After mnt_hold_writers() has been called we will always have set MNT_WRITE_HOLD
and consequently we always need to pair mnt_hold_writers() with
mnt_unhold_writers(). After the recent cleanup in [1] where Al switched from a
do-while to a for loop the cleanup currently fails to unset MNT_WRITE_HOLD for
the first mount that was changed. Fix this and make sure that the first mount
will be cleaned up and add some comments to make it more obvious.

Link: https://lore.kernel.org/lkml/0000000000007cc21d05dd0432b8@google.com
Link: https://lore.kernel.org/lkml/00000000000080e10e05dd043247@google.com
Link: https://lore.kernel.org/r/20220420131925.2464685-1-brauner@kernel.org
Fixes: e257039f0f ("mount_setattr(): clean the control flow and calling conventions") [1]
Cc: Hillf Danton <hdanton@sina.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+10a16d1c43580983f6a2@syzkaller.appspotmail.com
Reported-by: syzbot+306090cfa3294f0bbfb3@syzkaller.appspotmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-04-21 17:57:37 +02:00
Ronnie Sahlberg
f5d0f921ea cifs: destage any unwritten data to the server before calling copychunk_write
because the copychunk_write might cover a region of the file that has not yet
been sent to the server and thus fail.

A simple way to reproduce this is:
truncate -s 0 /mnt/testfile; strace -f -o x -ttT xfs_io -i -f -c 'pwrite 0k 128k' -c 'fcollapse 16k 24k' /mnt/testfile

the issue is that the 'pwrite 0k 128k' becomes rearranged on the wire with
the 'fcollapse 16k 24k' due to write-back caching.

fcollapse is implemented in cifs.ko as a SMB2 IOCTL(COPYCHUNK_WRITE) call
and it will fail serverside since the file is still 0b in size serverside
until the writes have been destaged.
To avoid this we must ensure that we destage any unwritten data to the
server before calling COPYCHUNK_WRITE.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1997373
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-20 22:54:54 -05:00
Paulo Alcantara
cd70a3e898 cifs: use correct lock type in cifs_reconnect()
TCP_Server_Info::origin_fullpath and TCP_Server_Info::leaf_fullpath
are protected by refpath_lock mutex and not cifs_tcp_ses_lock
spinlock.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@vger.kernel.org
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-20 22:54:39 -05:00
Paulo Alcantara
41f10081a9 cifs: fix NULL ptr dereference in refresh_mounts()
Either mount(2) or automount might not have server->origin_fullpath
set yet while refresh_cache_worker() is attempting to refresh DFS
referrals.  Add missing NULL check and locking around it.

This fixes bellow crash:

[ 1070.276835] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN NOPTI
[ 1070.277676] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[ 1070.278219] CPU: 1 PID: 8506 Comm: kworker/u8:1 Not tainted 5.18.0-rc3 #10
[ 1070.278701] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
[ 1070.279495] Workqueue: cifs-dfscache refresh_cache_worker [cifs]
[ 1070.280044] RIP: 0010:strcasecmp+0x34/0x150
[ 1070.280359] Code: 00 00 00 fc ff df 41 54 55 48 89 fd 53 48 83 ec 10 eb 03 4c 89 fe 48 89 ef 48 83 c5 01 48 89 f8 48 89 fa 48 c1 e8 03 83 e2 07 <42> 0f b6 04 28 38 d0 7f 08 84 c0 0f 85 bc 00 00 00 0f b6 45 ff 44
[ 1070.281729] RSP: 0018:ffffc90008367958 EFLAGS: 00010246
[ 1070.282114] RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000
[ 1070.282691] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 1070.283273] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffff873eda27
[ 1070.283857] R10: ffffc900083679a0 R11: 0000000000000001 R12: ffff88812624c000
[ 1070.284436] R13: dffffc0000000000 R14: ffff88810e6e9a88 R15: ffff888119bb9000
[ 1070.284990] FS:  0000000000000000(0000) GS:ffff888151200000(0000) knlGS:0000000000000000
[ 1070.285625] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1070.286100] CR2: 0000561a4d922418 CR3: 000000010aecc000 CR4: 0000000000350ee0
[ 1070.286683] Call Trace:
[ 1070.286890]  <TASK>
[ 1070.287070]  refresh_cache_worker+0x895/0xd20 [cifs]
[ 1070.287475]  ? __refresh_tcon.isra.0+0xfb0/0xfb0 [cifs]
[ 1070.287905]  ? __lock_acquire+0xcd1/0x6960
[ 1070.288247]  ? is_dynamic_key+0x1a0/0x1a0
[ 1070.288591]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 1070.289012]  ? lock_downgrade+0x6f0/0x6f0
[ 1070.289318]  process_one_work+0x7bd/0x12d0
[ 1070.289637]  ? worker_thread+0x160/0xec0
[ 1070.289970]  ? pwq_dec_nr_in_flight+0x230/0x230
[ 1070.290318]  ? _raw_spin_lock_irq+0x5e/0x90
[ 1070.290619]  worker_thread+0x5ac/0xec0
[ 1070.290891]  ? process_one_work+0x12d0/0x12d0
[ 1070.291199]  kthread+0x2a5/0x350
[ 1070.291430]  ? kthread_complete_and_exit+0x20/0x20
[ 1070.291770]  ret_from_fork+0x22/0x30
[ 1070.292050]  </TASK>
[ 1070.292223] Modules linked in: bpfilter cifs cifs_arc4 cifs_md4
[ 1070.292765] ---[ end trace 0000000000000000 ]---
[ 1070.293108] RIP: 0010:strcasecmp+0x34/0x150
[ 1070.293471] Code: 00 00 00 fc ff df 41 54 55 48 89 fd 53 48 83 ec 10 eb 03 4c 89 fe 48 89 ef 48 83 c5 01 48 89 f8 48 89 fa 48 c1 e8 03 83 e2 07 <42> 0f b6 04 28 38 d0 7f 08 84 c0 0f 85 bc 00 00 00 0f b6 45 ff 44
[ 1070.297718] RSP: 0018:ffffc90008367958 EFLAGS: 00010246
[ 1070.298622] RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000
[ 1070.299428] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 1070.300296] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffff873eda27
[ 1070.301204] R10: ffffc900083679a0 R11: 0000000000000001 R12: ffff88812624c000
[ 1070.301932] R13: dffffc0000000000 R14: ffff88810e6e9a88 R15: ffff888119bb9000
[ 1070.302645] FS:  0000000000000000(0000) GS:ffff888151200000(0000) knlGS:0000000000000000
[ 1070.303462] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1070.304131] CR2: 0000561a4d922418 CR3: 000000010aecc000 CR4: 0000000000350ee0
[ 1070.305004] Kernel panic - not syncing: Fatal exception
[ 1070.305711] Kernel Offset: disabled
[ 1070.305971] ---[ end Kernel panic - not syncing: Fatal exception ]---

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@vger.kernel.org
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-20 22:54:17 -05:00
Linus Torvalds
10c5f102e2 Changes since last update:
- Fix use-after-free of the on-stack z_erofs_decompressqueue;
 
  - Fix sysfs documentation Sphinx warnings.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYl2A0REceGlhbmdAa2Vy
 bmVsLm9yZwAKCRA5NzHcH7XmBEe7AQCh7aNhYEncBnoHvrB276HCP0xmMwmc0gPq
 6UeSWgarpAEArgfgLyo8tFgS+gvXbhB8/P1FqEMwaRZVLWathXID3QU=
 =GFAS
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:
 "One patch to fix a use-after-free race related to the on-stack
  z_erofs_decompressqueue, which happens very rarely but needs to be
  fixed properly soon.

  The other patch fixes some sysfs Sphinx warnings"

* tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  Documentation/ABI: sysfs-fs-erofs: Fix Sphinx errors
  erofs: fix use-after-free of on-stack io[]
2022-04-20 12:35:20 -07:00
Linus Torvalds
906f904097 Revert "fs/pipe: use kvcalloc to allocate a pipe_buffer array"
This reverts commit 5a519c8fe4.

It turns out that making the pipe almost arbitrarily large has some
rather unexpected downsides.  The kernel test robot reports a kernel
warning that is due to pipe->max_usage now growing to the point where
the iter_file_splice_write() buffer allocation can no longer be
satisfied as a slab allocation, and the

        int nbufs = pipe->max_usage;
        struct bio_vec *array = kcalloc(nbufs, sizeof(struct bio_vec),
                                        GFP_KERNEL);

code sequence there will now always fail as a result.

That code could be modified to use kvcalloc() too, but I feel very
uncomfortable making those kinds of changes for a very niche use case
that really should have other options than make these kinds of
fundamental changes to pipe behavior.

Maybe the CRIU process dumping should be multi-threaded, and use
multiple pipes and multiple cores, rather than try to use one larger
pipe to minimize splice() calls.

Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/all/20220420073717.GD16310@xsang-OptiPlex-9020/
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-20 12:07:53 -07:00
Christian Brauner
705191b03d fs: fix acl translation
Last cycle we extended the idmapped mounts infrastructure to support
idmapped mounts of idmapped filesystems (No such filesystem yet exist.).
Since then, the meaning of an idmapped mount is a mount whose idmapping
is different from the filesystems idmapping.

While doing that work we missed to adapt the acl translation helpers.
They still assume that checking for the identity mapping is enough.  But
they need to use the no_idmapping() helper instead.

Note, POSIX ACLs are always translated right at the userspace-kernel
boundary using the caller's current idmapping and the initial idmapping.
The order depends on whether we're coming from or going to userspace.
The filesystem's idmapping doesn't matter at the border.

Consequently, if a non-idmapped mount is passed we need to make sure to
always pass the initial idmapping as the mount's idmapping and not the
filesystem idmapping.  Since it's irrelevant here it would yield invalid
ids and prevent setting acls for filesystems that are mountable in a
userns and support posix acls (tmpfs and fuse).

I verified the regression reported in [1] and verified that this patch
fixes it.  A regression test will be added to xfstests in parallel.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215849 [1]
Fixes: bd303368b7 ("fs: support mapped mounts of mapped filesystems")
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org> # 5.17
Cc: <regressions@lists.linux.dev>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-19 10:19:02 -07:00
Haowen Bai
9339faac6d cifs: Use kzalloc instead of kmalloc/memset
Use kzalloc rather than duplicating its implementation, which
makes code simple and easy to understand.

Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-18 10:22:57 -05:00
Pavel Begunkov
c0713540f6 io_uring: fix leaks on IOPOLL and CQE_SKIP
If all completed requests in io_do_iopoll() were marked with
REQ_F_CQE_SKIP, we'll not only skip CQE posting but also
io_free_batch_list() leaking memory and resources.

Move @nr_events increment before REQ_F_CQE_SKIP check. We'll potentially
return the value greater than the real one, but iopolling will deal with
it and the userspace will re-iopoll if needed. In anyway, I don't think
there are many use cases for REQ_F_CQE_SKIP + IOPOLL.

Fixes: 83a13a4181 ("io_uring: tweak iopoll CQE_SKIP event counting")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5072fc8693fbfd595f89e5d4305bfcfd5d2f0a64.1650186611.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 06:54:11 -06:00
Jens Axboe
323b190ba2 io_uring: free iovec if file assignment fails
We just return failure in this case, but we need to release the iovec
first. If we're doing IO with more than FAST_IOV segments, then the
iovec is allocated and must be freed.

Reported-by: syzbot+96b43810dfe9c3bb95ed@syzkaller.appspotmail.com
Fixes: 584b0180f0 ("io_uring: move read/write file prep state into actual opcode handler")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-16 21:14:00 -06:00
Linus Torvalds
59250f8a7f Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "14 patches.

  Subsystems affected by this patch series: MAINTAINERS, binfmt, and
  mm (tmpfs, secretmem, kasan, kfence, pagealloc, zram, compaction,
  hugetlb, vmalloc, and kmemleak)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: kmemleak: take a full lowmem check in kmemleak_*_phys()
  mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
  revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"
  revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"
  hugetlb: do not demote poisoned hugetlb pages
  mm: compaction: fix compiler warning when CONFIG_COMPACTION=n
  mm: fix unexpected zeroed page mapping with zram swap
  mm, page_alloc: fix build_zonerefs_node()
  mm, kfence: support kmem_dump_obj() for KFENCE objects
  kasan: fix hw tags enablement when KUNIT tests are disabled
  irq_work: use kasan_record_aux_stack_noalloc() record callstack
  mm/secretmem: fix panic when growing a memfd_secret
  tmpfs: fix regressions from wider use of ZERO_PAGE
  MAINTAINERS: Broadcom internal lists aren't maintainers
2022-04-15 15:57:18 -07:00
Andrew Morton
aeb7923733 revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"
Despite Mike's attempted fix (925346c129), regressions reports
continue:

  https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
  https://bugzilla.kernel.org/show_bug.cgi?id=215720
  https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info

So revert this patch.

Fixes: 9630f0d60f ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:56 -07:00
Andrew Morton
354e923df0 revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"
Commit 925346c129 ("fs/binfmt_elf: fix PT_LOAD p_align values for
loaders") was an attempt to fix regressions due to 9630f0d60f
("fs/binfmt_elf: use PT_LOAD p_align values for static PIE").

But regressionss continue to be reported:

  https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
  https://bugzilla.kernel.org/show_bug.cgi?id=215720
  https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info

This patch reverts the fix, so the original can also be reverted.

Fixes: 925346c129 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders")
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:56 -07:00
Linus Torvalds
0647b9cc7f io_uring-5.18-2022-04-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmJY2BIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpielEACDqz5oShRkOHGAaqo0jbOy7C9I0hwUXS9Y
 QeIH7umrSUBgDXLIq3jbsgK3d0nroaYHTU8aa64XnyB0lGouBTiw+I8FZfxNlW7w
 HI+AUZn/m0wvTuKcw//cSX2CP2pcVPfmao+JskU6kyrsoj0nkQNx2NNHXbVQ3cFC
 DKlZE7ZulvzfM+xC0aJxIYUWLECzqgZvicn+mqeqVQX9QJ3k/637GTnVu83QSpIC
 0Hw/isuGuaK+0nurwc4Rx9ZojItVYyPPt3a+8ImtGaJlhyeg9bHLMJZbBLvrlqjd
 AS2iVPaQOhFfxt8qe7ETHpIUmkBQZIckavsCCO7sfFVKtGRNA0kQkAYLjXLDLP8T
 1DXn8VQHsGHHcBd2vTZURng32AniOCzQkshLGF8s/yYuoHp7JODDhwu4xO5HC5eN
 rD7SNDcW9mYlEmVtCuoeKCrRknHL+x3ZTlAPTXr3DgjtgB7phBUmLqZU7riwy7vs
 0oGD/uXf5jT7Ujw2fZNF7LFetQVJkCi92p1+IGBO0hXQShZ5IsCITk95V2d+2cVs
 J1r7ZkXiXiWHkHDQQfuKNeVoUXRe10a4+xOeOCMrwuOgKpKIyoN7ofb6Lp4SRE+j
 f6PvGXvfwIRz9McJwg5IWCIEVASysn4s17bgnaMKXpRCgf67Z4HMTcBO1PHRORBD
 ssBXiVxOhA==
 =aTrS
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.18-2022-04-14' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Ensure we check and -EINVAL any use of reserved or struct padding.

   Although we generally always do that, it's missed in two spots for
   resource updates, one for the ring fd registration from this merge
   window, and one for the extended arg. Make sure we have all of them
   handled. (Dylan)

 - A few fixes for the deferred file assignment (me, Pavel)

 - Add a feature flag for the deferred file assignment so apps can tell
   we handle it correctly (me)

 - Fix a small perf regression with the current file position fix in
   this merge window (me)

* tag 'io_uring-5.18-2022-04-14' of git://git.kernel.dk/linux-block:
  io_uring: abort file assignment prior to assigning creds
  io_uring: fix poll error reporting
  io_uring: fix poll file assign deadlock
  io_uring: use right issue_flags for splice/tee
  io_uring: verify pad field is 0 in io_get_ext_arg
  io_uring: verify resv is 0 in ringfd register/unregister
  io_uring: verify that resv2 is 0 in io_uring_rsrc_update2
  io_uring: move io_uring_rsrc_update2 validation
  io_uring: fix assign file locking issue
  io_uring: stop using io_wq_work as an fd placeholder
  io_uring: move apoll->events cache
  io_uring: io_kiocb_update_pos() should not touch file for non -1 offset
  io_uring: flag the fact that linked file assignment is sane
2022-04-15 11:33:20 -07:00
Hongyu Jin
60b3005011 erofs: fix use-after-free of on-stack io[]
The root cause is the race as follows:
Thread #1                              Thread #2(irq ctx)

z_erofs_runqueue()
  struct z_erofs_decompressqueue io_A[];
  submit bio A
  z_erofs_decompress_kickoff(,,1)
                                       z_erofs_decompressqueue_endio(bio A)
                                       z_erofs_decompress_kickoff(,,-1)
                                       spin_lock_irqsave()
                                       atomic_add_return()
  io_wait_event()	-> pending_bios is already 0
  [end of function]
                                       wake_up_locked(io_A[]) // crash

Referenced backtrace in kernel 5.4:

[   10.129422] Unable to handle kernel paging request at virtual address eb0454a4
[   10.364157] CPU: 0 PID: 709 Comm: getprop Tainted: G        WC O      5.4.147-ab09225 #1
[   11.556325] [<c01b33b8>] (__wake_up_common) from [<c01b3300>] (__wake_up_locked+0x40/0x48)
[   11.565487] [<c01b3300>] (__wake_up_locked) from [<c044c8d0>] (z_erofs_vle_unzip_kickoff+0x6c/0xc0)
[   11.575438] [<c044c8d0>] (z_erofs_vle_unzip_kickoff) from [<c044c854>] (z_erofs_vle_read_endio+0x16c/0x17c)
[   11.586082] [<c044c854>] (z_erofs_vle_read_endio) from [<c06a80e8>] (clone_endio+0xb4/0x1d0)
[   11.595428] [<c06a80e8>] (clone_endio) from [<c04a1280>] (blk_update_request+0x150/0x4dc)
[   11.604516] [<c04a1280>] (blk_update_request) from [<c06dea28>] (mmc_blk_cqe_complete_rq+0x144/0x15c)
[   11.614640] [<c06dea28>] (mmc_blk_cqe_complete_rq) from [<c04a5d90>] (blk_done_softirq+0xb0/0xcc)
[   11.624419] [<c04a5d90>] (blk_done_softirq) from [<c010242c>] (__do_softirq+0x184/0x56c)
[   11.633419] [<c010242c>] (__do_softirq) from [<c01051e8>] (irq_exit+0xd4/0x138)
[   11.641640] [<c01051e8>] (irq_exit) from [<c010c314>] (__handle_domain_irq+0x94/0xd0)
[   11.650381] [<c010c314>] (__handle_domain_irq) from [<c04fde70>] (gic_handle_irq+0x50/0xd4)
[   11.659641] [<c04fde70>] (gic_handle_irq) from [<c0101b70>] (__irq_svc+0x70/0xb0)

Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220401115527.4935-1-hongyu.jin.cn@gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-04-15 23:51:43 +08:00
Theodore Ts'o
eb7054212e ext4: update the cached overhead value in the superblock
If we (re-)calculate the file system overhead amount and it's
different from the on-disk s_overhead_clusters value, update the
on-disk version since this can take potentially quite a while on
bigalloc file systems.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-04-14 22:39:00 -04:00
Jens Axboe
701521403c io_uring: abort file assignment prior to assigning creds
We need to either restore creds properly if we fail on the file
assignment, or just do the file assignment first instead. Let's do
the latter as it's simpler, should make no difference here for
file assignment.

Link: https://lore.kernel.org/lkml/000000000000a7edb305dca75a50@google.com/
Reported-by: syzbot+60c52ca98513a8760a91@syzkaller.appspotmail.com
Fixes: 6bf9c47a39 ("io_uring: defer file assignment")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-14 20:23:40 -06:00
Theodore Ts'o
85d825dbf4 ext4: force overhead calculation if the s_overhead_cluster makes no sense
If the file system does not use bigalloc, calculating the overhead is
cheap, so force the recalculation of the overhead so we don't have to
trust the precalculated overhead in the superblock.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-04-14 22:05:47 -04:00
Namjae Jeon
02655a70b7 ksmbd: set fixed sector size to FS_SECTOR_SIZE_INFORMATION
Currently ksmbd is using ->f_bsize from vfs_statfs() as sector size.
If fat/exfat is a local share, ->f_bsize is a cluster size that is too
large to be used as a sector size. Sector sizes larger than 4K cause
problem occurs when mounting an iso file through windows client.

The error message can be obtained using Mount-DiskImage command,
 the error is:
"Mount-DiskImage : The sector size of the physical disk on which the
virtual disk resides is not supported."

This patch reports fixed 4KB sector size if ->s_blocksize is bigger
than 4KB.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-14 20:56:13 -05:00
Namjae Jeon
8510a043d3 ksmbd: increment reference count of parent fp
Add missing increment reference count of parent fp in
ksmbd_lookup_fd_inode().

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-14 20:56:13 -05:00
Namjae Jeon
50f500b7f6 ksmbd: remove filename in ksmbd_file
If the filename is change by underlying rename the server, fp->filename
and real filename can be different. This patch remove the uses of
fp->filename in ksmbd and replace it with d_path().

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-14 20:56:13 -05:00
Theodore Ts'o
10b01ee92d ext4: fix overhead calculation to account for the reserved gdt blocks
The kernel calculation was underestimating the overhead by not taking
into account the reserved gdt blocks.  With this change, the overhead
calculated by the kernel matches the overhead calculation in mke2fs.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-04-14 21:31:27 -04:00
Linus Torvalds
62345e4828 5 fixes to cifs client, 1 for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmJYRzcACgkQiiy9cAdy
 T1EECAwAuCUHV174bwd5XQBeP7aouw6upKPF9jbXXJpQy/NKxwFAGpiKaEiGAGjb
 dA1M+wdBZfeIQQW7pCuFWuWvtagJChEdnAKUamnX+G+l2iWCxsJz64nisc27u6CR
 z/n6YfIT8OPVQLhfU9xgyXmmmDSSJy6YFNgL3Bh+poR/8764HhEkCgeV3P3kD4Vo
 uvj9p6M4K1Ddh4v914QgDhLzk1UsR++zHCE9IuXrxNDgzrQ5Kh5VFn52vaEaNz7g
 hIr790gALsGNnLrapiaD3L2alrGwr3GO3WT5pqAdCgzTWq5bngBBQYl0qSWnSHs+
 OhnFFBtZ1efRX2bp6SeR8r2z5hNj4YiLJCo/cjoJBIU7W4YDOdLo+UVm/sW9xKgA
 tbw5lYWNDMpHygkb+/uSnw807y1cz/3Ip9QbN/ESQP6b9hg/woAXkjgiMVDcew0P
 nNcQEz0LZHUAzidIAHlByTo17QNQdHOhhZYQuA48dH4KJJrSEWXmiG9n9Qp4Q6cX
 6aueD1Lw
 =q5N5
 -----END PGP SIGNATURE-----

Merge tag '5.18-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:

 - two fixes related to unmount

 - symlink overflow fix

 - minor netfs fix

 - improved tracing for crediting (flow control)

* tag '5.18-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: verify that tcon is valid before dereference in cifs_kill_sb
  cifs: potential buffer overflow in handling symlinks
  cifs: Split the smb3_add_credits tracepoint
  cifs: release cached dentries only if mount is complete
  cifs: Check the IOCB_DIRECT flag, not O_DIRECT
2022-04-14 16:00:36 -07:00
NeilBrown
b3d4650d82 VFS: filename_create(): fix incorrect intent.
When asked to create a path ending '/', but which is not to be a
directory (LOOKUP_DIRECTORY not set), filename_create() will never try
to create the file.  If it doesn't exist, -ENOENT is reported.

However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems
->lookup() function, even though there is no intent to create.  This is
misleading and can cause incorrect behaviour.

If you try

   ln -s foo /path/dir/

where 'dir' is a directory on an NFS filesystem which is not currently
known in the dcache, this will fail with ENOENT.

But as the name is not in the dcache, nfs_lookup gets called with
LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any
lookup, with the expectation that a subsequent call to create the target
will be made, and the lookup can be combined with the creation.  In the
case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never
made.  Instead filename_create() sees that the dentry is not (yet)
positive and returns -ENOENT - even though the directory actually
exists.

So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to
create, and use the absence of these flags to decide if -ENOENT should
be returned.

Note that filename_parentat() is only interested in LOOKUP_REVAL, so we
split that out and store it in 'reval_flag'.  __lookup_hash() then gets
reval_flag combined with whatever create flags were determined to be
needed.

Reviewed-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-14 15:53:43 -07:00
Linus Torvalds
722985e2f6 for-5.18-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJUfvwACgkQxWXV+ddt
 WDtiuA//csj0CrJq7wyRvgDkbPtCCMyDtL4zfmjP4s++NWaMDOKTxBU8msuGUJgR
 Xribel2zWqlFiOzptd9sGEfxfKfz1p5Rm/gtFj57WVSGV7YtiAGyFGuzn/vpCrgq
 NP5LFY2z5N36VxDXUPKHzvdqczO5W2n9KdaysJCr6FpCz+vVrplFiT5U+X175Sgg
 zWS/XrPIHYbtaEFdb3rUKol6riu7vXW3MxEA9di8K4Xo9TJrp0NtBoGZDsWFQxf7
 vfVwtYsQPsACJxw+MjBcVQ5fNXO5iATL1JfBb9ltN589xouja7bDCb40Fm2gEwWB
 IUatnCq/4MN2S2NdPtEcXQ52W9svT/87z6ZblefSEiQqvQBBJHN131RTM/s8LBG4
 fkHcGV6PsiutSIFycrID0bpXr1Mhvg2aMjxdriLPBtYopaMPh+ivK6LPYoE5MggQ
 /rshWfjiWJhPKXPsng+H7UGbViOiYeG0kUBuaqFx4ARnESpN1gF2dJRYvXYFL/8a
 Q0wmLr1tf3M82VAaAFNOl/BVk8blutCSHcJDLDKxcl3DhVVlY5J8onEPXjneJRkk
 rRB3zoxLlptgfW75CPNyFrpicPdAXzGCoccienIUoEdHSX/W5rNA4L6XpVE5Hv94
 dWR1aVAbCUcuhY1QfBU7H2iYw0RHzqDO3+IRfqJuYsLGciijLbs=
 =APMY
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more code and warning fixes.

  There's one feature ioctl removal patch slated for 5.18 that did not
  make it to the main pull request. It's just a one-liner and the ioctl
  has a v2 that's in use for a long time, no point to postpone it to
  5.19.

  Late update:

   - remove balance v1 ioctl, superseded by v2 in 2012

  Fixes:

   - add back cgroup attribution for compressed writes

   - add super block write start/end annotations to asynchronous balance

   - fix root reference count on an error handling path

   - in zoned mode, activate zone at the chunk allocation time to avoid
     ENOSPC due to timing issues

   - fix delayed allocation accounting for direct IO

  Warning fixes:

   - simplify assertion condition in zoned check

   - remove an unused variable"

* tag 'for-5.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix btrfs_submit_compressed_write cgroup attribution
  btrfs: fix root ref counts in error handling in btrfs_get_root_ref
  btrfs: zoned: activate block group only for extent allocation
  btrfs: return allocated block group from do_chunk_alloc()
  btrfs: mark resumed async balance as writing
  btrfs: remove support of balance v1 ioctl
  btrfs: release correct delalloc amount in direct IO write path
  btrfs: remove unused variable in btrfs_{start,write}_dirty_block_groups()
  btrfs: zoned: remove redundant condition in btrfs_run_delalloc_range
2022-04-14 10:58:27 -07:00
Linus Torvalds
ec9c57a732 fscache fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmJW6J0ACgkQ+7dXa6fL
 C2sXZw//Z5if0BzqI0SjZ7JeozZ4ADgAKYrT0WvYRg7TPfJBThZsa9NEaQfueXBp
 0DlZpM/ToNQXMYV/jRlDjzooPstKt45IdS4wMewTtXZAXDuZNDygOIRTY5S3zKHY
 wW4ZuZ/3AtNejY8YAnxFtykcbVfxMzYkQ+cgX79fd/+ZLgpYImtk4k3fS53pyRhv
 Q6bT7zxIOIGzaPVda/lu4t1vdTXNTNI2b/9lYOPEXMC6OFmHQ94dnqsyW6AKUD6K
 oYldKUFIFcv5CeiH8sTXFrkckNqgkUln5OXdkFLUjRPHE8Yyq57ZyX7ilp4yFdfS
 HE8VpDaDfk11hVpFLngkPu5RJKsRbad5CQ+Jl8+xiEn0fbuR4q7LFvqSVoykh/x5
 hh5veuSc/KvfpRIijNlCff4gry9gnAhHlLbt8xFbpNWPaYFTVdkkuDtoB8Kv/KoV
 HmxyNyQEW2eL7i7JXfSI4EjNPI85RVhVXmccbJJoKHQRvSVCZuWBlR7NUkykzuXg
 CQbnO/SeqilpOpbFvhTUXdOXNQEHbtcKp9yO9hd/f67E7gNKiDOewOmItMt4HfFx
 pjKv48tLR1cnZtL8pNf7nSMKuGe0bofDu46rnra0Z4mcz4d4y9mkIrdoIVT/heae
 E3Ut/1GJ0MkETl+nYXdV9IxTMHWIVGyzt7HyJTziXj03dfVquqA=
 =fUQK
 -----END PGP SIGNATURE-----

Merge tag 'fscache-fixes-20220413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fscache fixes from David Howells:
 "Here's a collection of fscache and cachefiles fixes and misc small
  cleanups. The two main fixes are:

   - Add a missing unmark of the inode in-use mark in an error path.

   - Fix a KASAN slab-out-of-bounds error when setting the xattr on a
     cachefiles volume due to the wrong length being given to memcpy().

  In addition, there's the removal of an unused parameter, removal of an
  unused Kconfig option, conditionalising a bit of procfs-related stuff
  and some doc fixes"

* tag 'fscache-fixes-20220413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  fscache: remove FSCACHE_OLD_API Kconfig option
  fscache: Use wrapper fscache_set_cache_state() directly when relinquishing
  fscache: Move fscache_cookies_seq_ops specific code under CONFIG_PROC_FS
  fscache: Remove the cookie parameter from fscache_clear_page_bits()
  docs: filesystems: caching/backend-api.rst: fix an object withdrawn API
  docs: filesystems: caching/backend-api.rst: correct two relinquish APIs use
  cachefiles: Fix KASAN slab-out-of-bounds in cachefiles_set_volume_xattr
  cachefiles: unmark inode in use in error path
2022-04-14 10:51:20 -07:00
Ronnie Sahlberg
8b6c58458e cifs: verify that tcon is valid before dereference in cifs_kill_sb
On umount, cifs_sb->tlink_tree might contain entries that do not represent
a valid tcon.
Check the tcon for error before we dereference it.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-14 00:07:36 -05:00
Harshit Mogalapalli
64c4a37ac0 cifs: potential buffer overflow in handling symlinks
Smatch printed a warning:
	arch/x86/crypto/poly1305_glue.c:198 poly1305_update_arch() error:
	__memcpy() 'dctx->buf' too small (16 vs u32max)

It's caused because Smatch marks 'link_len' as untrusted since it comes
from sscanf(). Add a check to ensure that 'link_len' is not larger than
the size of the 'link_str' buffer.

Fixes: c69c1b6eae ("cifs: implement CIFSParseMFSymlink()")
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-13 12:00:49 -05:00
Pavel Begunkov
7179c3ce3d io_uring: fix poll error reporting
We should not return an error code in req->result in
io_poll_check_events(), because it may get mangled and returned as
success. Just return the error code directly, the callers will fail the
request or proceed accordingly.

Fixes: 6bf9c47a39 ("io_uring: defer file assignment")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5f03514ee33324dc811fb93df84aee0f695fb044.1649862516.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-13 10:25:37 -06:00
Pavel Begunkov
cce64ef013 io_uring: fix poll file assign deadlock
We pass "unlocked" into io_assign_file() in io_poll_check_events(),
which can lead to double locking.

Fixes: 6bf9c47a39 ("io_uring: defer file assignment")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2476d4ae46554324b599ee4055447b105f20a75a.1649862516.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-13 10:25:37 -06:00
Pavel Begunkov
e941976659 io_uring: use right issue_flags for splice/tee
Pass right issue_flags into into io_file_get_fixed() instead of
IO_URING_F_UNLOCKED. It's probably not a problem at the moment but let's
do it safer.

Fixes: 6bf9c47a39 ("io_uring: defer file assignment")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7d242daa9df5d776907686977cd29fbceb4a2d8d.1649862516.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-13 10:25:36 -06:00
Tadeusz Struk
2da376228a ext4: limit length to bitmap_maxbytes - blocksize in punch_hole
Syzbot found an issue [1] in ext4_fallocate().
The C reproducer [2] calls fallocate(), passing size 0xffeffeff000ul,
and offset 0x1000000ul, which, when added together exceed the
bitmap_maxbytes for the inode. This triggers a BUG in
ext4_ind_remove_space(). According to the comments in this function
the 'end' parameter needs to be one block after the last block to be
removed. In the case when the BUG is triggered it points to the last
block. Modify the ext4_punch_hole() function and add constraint that
caps the length to satisfy the one before laster block requirement.

LINK: [1] https://syzkaller.appspot.com/bug?id=b80bd9cf348aac724a4f4dff251800106d721331
LINK: [2] https://syzkaller.appspot.com/text?tag=ReproC&x=14ba0238700000

Fixes: a4bb6b64e3 ("ext4: enable "punch hole" functionality")
Reported-by: syzbot+7a806094edd5d07ba029@syzkaller.appspotmail.com
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Link: https://lore.kernel.org/r/20220331200515.153214-1-tadeusz.struk@linaro.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-04-12 22:24:35 -04:00