There is an old problem with io-wq cancellation where requests should be
killed and are in io-wq but are not discoverable, e.g. in @next_hashed
or @linked vars of io_worker_handle_work(). It adds some unreliability
to individual request canellation, but also may potentially get
__io_uring_cancel() stuck. For instance:
1) An __io_uring_cancel()'s cancellation round have not found any
request but there are some as desribed.
2) __io_uring_cancel() goes to sleep
3) Then workers wake up and try to execute those hidden requests
that happen to be unbound.
As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT
in advance, so preventing 3) from executing unbound requests. The
workers will initially break looping because of getting a signal as they
are threads of the dying/exec()'ing user task.
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix another "confused deputy" weakness[1]. Writes to /proc/$pid/attr/
files need to check the opener credentials, since these fds do not
transition state across execve(). Without this, it is possible to
trick another process (which may have different credentials) to write
to its own /proc/$pid/attr/ files, leading to unexpected and possibly
exploitable behaviors.
[1] https://www.kernel.org/doc/html/latest/security/credentials.html?highlight=confused#open-file-credentials
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCs8qYACgkQ+7dXa6fL
C2s3QhAAkRSWoZQEmisGMcRKOhDJQh8Qc7lV4aKXFIa4EaLdeBPvhJhnJqMld2KY
m35g4bSU/RsUjzSCLXVnEiHa9jdFKyK0C/XWshyidzrTDUk0HN6NXsBpp3ztWKlq
iMOvQYnKWKoWr4seIdC1fAKSFcQ3uRlVnDnmm0GtB5ahu5ThNQtqf8nYMSuULZbo
K9SybNUVCrDsORqDu2595gfK63MCOVn72Hj066s8owHrbD8Io52Kf6Q7jP1CkMGL
x6Kl0pwjql6usUsaDEaqmNT3ck7UjlLp5h1EZnt/7SWbgInpNzk6BLP33DwCis+4
rUpu+Zf8TEeOYDU5if8QpVszwsMyoKtkp9AjgjZkvxbedCqHkXJjxrnkk6/H7yJc
4Zvi8sIU52D9PcZO0bD8zP/8eYm/ZTVjMjDt8PvIbTA583oGNWsfRBbvJYi1huxB
i3G0PNVbqH0U3Z78XH4dmrkE1oMxbq2O5fg9ZNCuStxqD2vrZyyo/CcfidElLnCq
vcT+obEI+NYFphMzk7rwSL4pH4OPwPziJfiudKANmUDei8rOejQ8nrw18CVF7neN
Ewj1XiHOdi4JgGq92owpmCmTvle7GG9KNuCvfd4U67S9KOJAPT5UrSD696PrJJN7
YpcBHJMqS9XLXwrGuKD7oDroxEEpvJEunRH+yt3YPa5OQtX3wIA=
=poNo
-----END PGP SIGNATURE-----
Merge tag 'netfs-lib-fixes-20200525' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull netfs fixes from David Howells:
"A couple of fixes to the new netfs lib:
- Pass the AOP flags through from netfs_write_begin() into
grab_cache_page_write_begin().
- Automatically enable in Kconfig netfs lib rather than presenting an
option for manual enablement"
* tag 'netfs-lib-fixes-20200525' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
netfs: Make CONFIG_NETFS_SUPPORT auto-selected rather than manual
netfs: Pass flags through to grab_cache_page_write_begin()
Reporting event->pid should depend on the privileges of the user that
initialized the group, not the privileges of the user reading the
events.
Use an internal group flag FANOTIFY_UNPRIV to record the fact that the
group was initialized by an unprivileged user.
To be on the safe side, the premissions to setup filesystem and mount
marks now require that both the user that initialized the group and
the user setting up the mark have CAP_SYS_ADMIN.
Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxiA77_P5vtv7e83g0+9d7B5W9ZTE4GfQEYbWmfT1rA=VA@mail.gmail.com/
Fixes: 7cea2a3c50 ("fanotify: support limited functionality for unprivileged users")
Cc: <Stable@vger.kernel.org> # v5.12+
Link: https://lore.kernel.org/r/20210524135321.2190062-1-amir73il@gmail.com
Reviewed-by: Matthew Bobrowski <repnop@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The RTINHERIT bit can be set on a directory so that newly created
regular files will have the REALTIME bit set to store their data on the
realtime volume. If an extent size hint (and EXTSZINHERIT) are set on
the directory, the hint will also be copied into the new file.
As pointed out in previous patches, for realtime files we require the
extent size hint be an integer multiple of the realtime extent, but we
don't perform the same validation on a directory with both RTINHERIT and
EXTSZINHERIT set, even though the only use-case of that combination is
to propagate extent size hints into new realtime files. This leads to
inode corruption errors when the bad values are propagated.
Because there may be existing filesystems with such a configuration, we
cannot simply amend the inode verifier to trip on these directories and
call it a day because that will cause previously "working" filesystems
to start throwing errors abruptly. Note that it's valid to have
directories with rtinherit set even if there is no realtime volume, in
which case the problem does not manifest because rtinherit is ignored if
there's no realtime device; and it's possible that someone set the flag,
crashed, repaired the filesystem (which clears the hint on the realtime
file) and continued.
Therefore, mitigate this issue in several ways: First, if we try to
write out an inode with both rtinherit/extszinherit set and an unaligned
extent size hint, turn off the hint to correct the error. Second, if
someone tries to misconfigure a directory via the fssetxattr ioctl, fail
the ioctl. Third, reverify both extent size hint values when we
propagate heritable inode attributes from parent to child, to prevent
misconfigurations from spreading.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
While chasing a bug involving invalid extent size hints being propagated
into newly created realtime files, I noticed that the xfs_ioctl_setattr
checks for the extent size hints weren't the same as the ones now
encoded in libxfs and used for validation in repair and mkfs.
Because the checks in libxfs are more stringent than the ones in the
ioctl, it's possible for a live system to set inode flags that
immediately result in corruption warnings. Specifically, it's possible
to set an extent size hint on an rtinherit directory without checking if
the hint is aligned to the realtime extent size, which makes no sense
since that combination is used only to seed new realtime files.
Replace the open-coded and inadequate checks with the libxfs verifier
versions and update the code comments a bit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The new online shrink code exposed a gap in the per-AG reservation
code, which is that we only return ENOSPC to callers if the entire fs
doesn't have enough free blocks. Except for debugging mode, the
reservation init code doesn't ever check that there's enough free space
in that AG to cover the reservation.
Not having enough space is not considered an immediate fatal error that
requires filesystem offlining because (a) it's shouldn't be possible to
wind up in that state through normal file operations and (b) even if
one did, freeing data blocks would recover the situation.
However, online shrink now needs to know if shrinking would not leave
enough space so that it can abort the shrink operation. Hence we need
to promote this assertion into an actual error return.
Observed by running xfs/168 with a 1k block size, though in theory this
could happen with any configuration.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
In commit d6995da311 ("hugetlb: use page.private for hugetlb specific
page flags") the use of PagePrivate to indicate a reservation count
should be restored at free time was changed to the hugetlb specific flag
HPageRestoreReserve. Changes to a userfaultfd error path as well as a
VM_BUG_ON() in remove_inode_hugepages() were overlooked.
Users could see incorrect hugetlb reserve counts if they experience an
error with a UFFDIO_COPY operation. Specifically, this would be the
result of an unlikely copy_huge_page_from_user error. There is not an
increased chance of hitting the VM_BUG_ON.
Link: https://lkml.kernel.org/r/20210521233952.236434-1-mike.kravetz@oracle.com
Fixes: d6995da311 ("hugetlb: use page.private for hugetlb specific page flags")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mina Almasry <almasry.mina@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCpPQcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptBfD/0XV2+UXT7GTq3zfAgeCRojxzjH8YSh/fu1
uSYA2fME0J7B2lpgxwHmDwgW/JkxkQ9oal+QxoNUJnmTF4CN3c7edQYxaA+QAnb/
XiEY6s5slpBtopJCXQqPlE6dUnn1yjc0wNIm3EjvmmMaFjb6MVfMZWqWn9AANNTd
yiRtk8a7KQuYdBeQMPVQG4v1ue37VTL5B9D9tM3p03W2ngNhtWw2Yxy5k/ePseip
HYhPm+SKcbpmSFS+KN/a4aBLHyW89FRnhBWZF50sBmdUD+HLgz09IyFdSqyo9s2f
wb7h3u3FbzTk3JofcJhYfqoQXkwmYhHrwNGCMjhK/zy+qloCIOu8Nw4jkcH+VYwK
Rf7cFu+CZDRgcIu4Op/W5CPHNPY680Rxd/yBKlG/n4aZ/zxuuOu08992Z5BSaxfw
UpIFMOWMuDvbBRUk71R34ME0o1wNhWL75Rljh97dAMRZLez1h8CmGdktT9g4keuo
71Swq51AQk7fWXW0yQK2kIpbTjazfh6+AEvdF4c/Njss83K7PHCq00xeKI9PeNXN
aQvPBpFifTeN1B1IENH5wEHO8F7e38eU45WHPwgNJUuSEpuBoXQGoLBlf6WXzNUS
sIt6UjGDCFTZddIYwVfVISl7+DLBLCxYRYnw0Mx99x1shUpH+6q0HdpPgOxiKmoH
ZgdG/q8rVg==
=tipG
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.13-2021-05-22' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"One fix for a regression with poll in this merge window, and another
just hardens the io-wq exit path a bit"
* tag 'io_uring-5.13-2021-05-22' of git://git.kernel.dk/linux-block:
io_uring: fortify tctx/io_wq cleanup
io_uring: don't modify req->poll for rw
- Fix some math errors in the realtime allocator when extent size hints
are applied.
- Fix unnecessary short writes to realtime files when free space is
fragmented.
- Fix a crash when using scrub tracepoints.
- Restore ioctl uapi definitions that were accidentally removed in
5.13-rc1.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmCmgSoACgkQ+H93GTRK
tOvInhAAj9JbQLz/BPt14qVdteNDNBpdFdl/SC5Bow5ABOWi8FSOu9N/F32nMy9y
fGPXP2G//sYzfW5jwE+ZPEEq7e+K55rLZHxHsbtWXjCL96t9edEDGFS5p6LmgnRW
aJwE1QxhFHFJZEX1+Y08zmB+fSnwHJ4HzZihYb1f9sTI5cJRh7pvvj26HiqfUk9I
FUT5Oo/Dy8gSSRmNCPWlLWhXJurrSwAYHmoE44vNHoEYHcodVwAK+ZR/Dj/5DQaS
DTYgLeHZQmPijcW7/B/RKcEz96hMQ9afg7vBdgZwhG/oFCRKV2m3rYjjjo/AgKv4
4FkmUoP5+CTT1vt9UKrdyl9uLTMqxiHuU1qvb82DM5XbbsLEFBBa+e3QrWmJqR8D
3lBp6ogtOmbNEJpgLxCdbVl80HOjB+yaIWUB536nauz4USZvCcRGvdYDQ922q6ig
1eT5Q6KCNgO3e6WIQ3W5kNJsM+/gXlNvwwhN/jHCQKy//bgVRnNYbODC6K46mjp/
H8+NWyzQXpFjRWjmQ60LD2/RlyJVbbLbMkPTO1g/vjyZWjgp0fV2wtG6Ag/FlwVF
DERUdp/0m6V+lPRcKbWCasjEc+pis6TDnvn+tCftXthh3fXGcalC3bnmv3yH9rkP
YAejnDvuuLVtgPyVARA9DTQR5ch5CSRuFc6sU9SFL5nv2p59RtU=
=IvUb
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.13-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix some math errors in the realtime allocator when extent size hints
are applied.
- Fix unnecessary short writes to realtime files when free space is
fragmented.
- Fix a crash when using scrub tracepoints.
- Restore ioctl uapi definitions that were accidentally removed in
5.13-rc1.
* tag 'xfs-5.13-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: restore old ioctl definitions
xfs: fix deadlock retry tracepoint arguments
xfs: retry allocations when locality-based search fails
xfs: adjust rt allocation minlen when extszhint > rtextsize
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmCn2x8ACgkQiiy9cAdy
T1HjnQv+M87Xx++VVaJzeLQQlKGA/vfkhM7YLEkIwxmbUpt8JURORoK91xVa/RZA
eS/K2tYOilAuuV7VXXw6ng6WNCWE/l+BNT5FHZ4WJt71pE1/tN/NIACtOhBB01GO
r+JhAE08zYLu8vA1Ax1EBtSSBjTLUjDX0fWMfwD4C/BBABw5VZISnkSEj2lC6wT9
vovEalU9amMRrvlhK9Z+MRJRJFzxY4LingiEVlFIdLczCGia5PgSl3NXRY1//rNO
wc//34cCGxBNc5Su5Bvn1kTZT5mdBFR98mLOuD+Dw55LlIlShKDnhZHGQDGPyQGT
ey2w2b+pNAr3rwVNtU6JNmI7AiUllNHiDu5UsyB0ctDWJljzrILd4uPaWofcNXAh
5qPRvuGsqjo3D/10DPshla1pJtmFr8eKXy8o6UVfMYQSHDo1LbqMll7ArGgV3Fxn
B2g5N+ax1+DXZlykKJGhYBBkvGANuUBU/tq810i5BvLhfrc1dx+pJlZAeO5OxCSA
SBUiirq4
=neWC
-----END PGP SIGNATURE-----
Merge tag '5.13-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Seven smb3 fixes: one for stable, three others fix problems found in
testing handle leases, and a compounded request fix"
* tag '5.13-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6:
Fix KASAN identified use-after-free issue.
Defer close only when lease is enabled.
Fix kernel oops when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
cifs: Fix inconsistent indenting
cifs: fix memory leak in smb2_copychunk_range
SMB3: incorrect file id in requests compounded with open
cifs: remove deadstore in cifs_close_all_deferred_files()
Pull siginfo fix from Eric Biederman:
"During the merge window an issue with si_perf and the siginfo ABI came
up. The alpha and sparc siginfo structure layout had changed with the
addition of SIGTRAP TRAP_PERF and the new field si_perf.
The reason only alpha and sparc were affected is that they are the
only architectures that use si_trapno.
Looking deeper it was discovered that si_trapno is used for only a few
select signals on alpha and sparc, and that none of the other
_sigfault fields past si_addr are used at all. Which means technically
no regression on alpha and sparc.
While the alignment concerns might be dismissed the abuse of si_errno
by SIGTRAP TRAP_PERF does have the potential to cause regressions in
existing userspace.
While we still have time before userspace starts using and depending
on the new definition siginfo for SIGTRAP TRAP_PERF this set of
changes cleans up siginfo_t.
- The si_trapno field is demoted from magic alpha and sparc status
and made an ordinary union member of the _sigfault member of
siginfo_t. Without moving it of course.
- si_perf is replaced with si_perf_data and si_perf_type ending the
abuse of si_errno.
- Unnecessary additions to signalfd_siginfo are removed"
* 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo
signal: Deliver all of the siginfo perf data in _perf
signal: Factor force_sig_perf out of perf_sigtrap
signal: Implement SIL_FAULT_TRAPNO
siginfo: Move si_trapno inside the union inside _si_fault
When a write fault occurs, we need to take the inode glock of the underlying
inode in exclusive mode. Otherwise, there's no guarantee that the dirty page
will be written back to disk.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Here is a big set of char/misc/other driver fixes for 5.13-rc3.
The majority here is the fallout of the umn.edu re-review of all prior
submissions. That resulted in a bunch of reverts along with the
"correct" changes made, such that there is no regression of any of the
potential fixes that were made by those individuals. I would like to
thank the over 80 different developers who helped with the review and
fixes for this mess.
Other than that, there's a few habanna driver fixes for reported issues,
and some dyndbg fixes for reported problems.
All of these have been in linux-next for a while with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYKZCBg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynhRQCdGk6ri4oluyn/Z/2KAjvXDOmTmvgAn12VP42d
S1Zmh4qRH2OWaLOBg7c2
=qtxj
-----END PGP SIGNATURE-----
Merge tag 'char-misc-5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here is a big set of char/misc/other driver fixes for 5.13-rc3.
The majority here is the fallout of the umn.edu re-review of all prior
submissions. That resulted in a bunch of reverts along with the
"correct" changes made, such that there is no regression of any of the
potential fixes that were made by those individuals. I would like to
thank the over 80 different developers who helped with the review and
fixes for this mess.
Other than that, there's a few habanna driver fixes for reported
issues, and some dyndbg fixes for reported problems.
All of these have been in linux-next for a while with no reported
problems"
* tag 'char-misc-5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (82 commits)
misc: eeprom: at24: check suspend status before disable regulator
uio_hv_generic: Fix another memory leak in error handling paths
uio_hv_generic: Fix a memory leak in error handling paths
uio/uio_pci_generic: fix return value changed in refactoring
Revert "Revert "ALSA: usx2y: Fix potential NULL pointer dereference""
dyndbg: drop uninformative vpr_info
dyndbg: avoid calling dyndbg_emit_prefix when it has no work
binder: Return EFAULT if we fail BINDER_ENABLE_ONEWAY_SPAM_DETECTION
cdrom: gdrom: initialize global variable at init time
brcmfmac: properly check for bus register errors
Revert "brcmfmac: add a check for the status of usb_register"
video: imsttfb: check for ioremap() failures
Revert "video: imsttfb: fix potential NULL pointer dereferences"
net: liquidio: Add missing null pointer checks
Revert "net: liquidio: fix a NULL pointer dereference"
media: gspca: properly check for errors in po1030_probe()
Revert "media: gspca: Check the return value of write_bridge for timeout"
media: gspca: mt9m111: Check write_bridge for timeout
Revert "media: gspca: mt9m111: Check write_bridge for timeout"
media: dvb: Add check on sp8870_readreg return
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmCmN9AACgkQnJ2qBz9k
QNn5ZwgAwnLdgBuILDqJwPaYpXOzvMhjjG8AwBDzhMYhhpt+OOCUevoRm7mDU7J2
t/DlwWGMhpp80ku+x+AURR/ltOfFvw4QAHeIXPWjkoieFKcLOEvAjWWZP6oIFC12
5e/QVXqK58fuRJwveYp4jZ+AXvDMoHJrDXsoTFezjBDIQQgzlIlrMzPavS/6UzUN
mAF2sapE9lcQoRMfU8kktBWPVM/GpFkus2Q48EYFCZ1rp3aRyw/aahTVuvSUZCV0
XiY6f2F7qgFLtomK6UurlxTc7rPsrG+UmNvGWuXf3R81UawegmKQeG5zcaMGrZs1
kHyJQcP9nGYPLDXt/4kW9cY0s8oOKg==
=RbOE
-----END PGP SIGNATURE-----
Merge tag 'quota_for_v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota fixes from Jan Kara:
"The most important part in the pull is disablement of the new syscall
quotactl_path() which was added in rc1.
The reason is some people at LWN discussion pointed out dirfd would be
useful for this path based syscall and Christian Brauner agreed.
Without dirfd it may be indeed problematic for containers. So let's
just disable the syscall for now when it doesn't have users yet so
that we have more time to mull over how to best specify the filesystem
we want to work on"
* tag 'quota_for_v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Disable quotactl_path syscall
quota: Use 'hlist_for_each_entry' to simplify code
Commit de144ff423 changes _pnfs_return_layout() to call
pnfs_mark_matching_lsegs_return() passing NULL as the struct
pnfs_layout_range argument. Unfortunately,
pnfs_mark_matching_lsegs_return() doesn't check if we have a value here
before dereferencing it, causing an oops.
I'm able to hit this crash consistently when running connectathon basic
tests on NFS v4.1/v4.2 against Ontap.
Fixes: de144ff423 ("NFSv4: Don't discard segments marked for return in _pnfs_return_layout()")
Cc: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Variable 'rd_size' is being initialized however
this value is never read as 'rd_size' is assigned
a new value in for statement. Remove the redundant
assignment.
Clean up clang warning:
fs/nfs/pnfs.c:2681:6: warning: Value stored to 'rd_size' during its
initialization is never read [clang-analyzer-deadcode.DeadStores]
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The "sizeof(struct nfs_fh)" is two bytes too large and could lead to
memory corruption. It should be NFS_MAXFHSIZE because that's the size
of the ->data[] buffer.
I reversed the size of the arguments to put the variable on the left.
Fixes: 16b374ca43 ("NFSv4.1: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We set the state of the current process to TASK_KILLABLE via
prepare_to_wait(). Should we use fatal_signal_pending() to detect
the signal here?
Fixes: b4868b44c5 ("NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE")
Signed-off-by: zhouchuangao <zhouchuangao@vivo.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
These ioctl definitions in xfs_fs.h are part of the userspace ABI and
were mistakenly removed during the 5.13 merge window.
Fixes: 9fefd5db08 ("xfs: convert to fileattr")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
sc->ip is the inode that's being scrubbed, which means that it's not set
for scrub types that don't involve inodes. If one of those scrubbers
(e.g. inode btrees) returns EDEADLOCK, we'll trip over the null pointer.
Fix that by reporting either the file being examined or the file that
was used to call scrub.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
If a realtime allocation fails because we can't find a sufficiently
large free extent satisfying locality rules, relax the locality rules
and try again. This reduces the occurrence of short writes to realtime
files when the write size is large and the free space is fragmented.
This was originally discovered by running generic/186 with the realtime
reflink patchset and a 128k cow extent size hint, but the short write
symptoms can manifest with a 128k extent size hint and no reflink, so
apply the fix now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
When BLKRRPART is called concurrently with del_gendisk, the partitions
rescan can create a stale partition that will never be be cleaned up.
Fix this by checking the the disk is up before rescanning partitions
while under bd_mutex.
Signed-off-by: Gulam Mohamed <gulam.mohamed@oracle.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514131842.1600568-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As an artifact of how gendisk lookup used to work in earlier kernels,
GENHD_FL_UP is only cleared very late in del_gendisk, and a global lock
is used to prevent opens from succeeding while del_gendisk is tearing
down the gendisk. Switch to clearing the flag early and under bd_mutex
so that callers can use bd_mutex to stabilize the flag, which removes
the need for the global mutex.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514131842.1600568-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When multiple processes write data to the same block group on a
compressed zoned filesystem, the underlying device could report I/O
errors and data corruption is possible.
This happens because on a zoned file system, compressed data writes
where sent to the device via a REQ_OP_WRITE instead of a
REQ_OP_ZONE_APPEND operation. But with REQ_OP_WRITE and parallel
submission it cannot be guaranteed that the data is always submitted
aligned to the underlying zone's write pointer.
The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a
zoned filesystem is non intrusive on a regular file system or when
submitting to a conventional zone on a zoned filesystem, as it is
guarded by btrfs_use_zone_append.
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 9d294a685f ("btrfs: zoned: enable to mount ZONED incompat flag")
CC: stable@vger.kernel.org # 5.12.x: e380adfc21: btrfs: zoned: pass start block to btrfs_use_zone_append
CC: stable@vger.kernel.org # 5.12.x
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_use_zone_append only needs the passed in extent_map's block_start
member, so there's no need to pass in the full extent map.
This also enables the use of btrfs_use_zone_append in places where we only
have a start byte but no extent_map.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't want anyone poking into tctx->io_wq awhile it's being destroyed
by io_wq_put_and_exit(), and even though it shouldn't even happen, if
buggy would be preferable to get a NULL-deref instead of subtle delayed
failure or UAF.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/827b021de17926fd807610b3e53a5a5fa8530856.1621513214.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Before this patch, the system ail lists were cleaned up if the logd
process withdrew, but on other withdraws, they were not cleaned up.
This included the cleaning up of the revokes as well.
This patch reorganizes things a bit so that all withdraws (not just logd)
clean up the ail lists, including any pending revokes.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, gfs2 would deadlock because of the following
sequence during mount:
mount
gfs2_fill_super
gfs2_make_fs_rw <--- Detects IO error with glock
kthread_stop(sdp->sd_quotad_process);
<--- Blocked waiting for quotad to finish
logd
Detects IO error and the need to withdraw
calls gfs2_withdraw
gfs2_make_fs_ro
kthread_stop(sdp->sd_quotad_process);
<--- Blocked waiting for quotad to finish
gfs2_quotad
gfs2_statfs_sync
gfs2_glock_wait <---- Blocked waiting for statfs glock to be granted
glock_work_func
do_xmote <---Detects IO error, can't release glock: blocked on withdraw
glops->go_inval
glock_blocked_by_withdraw
requeue glock work & exit <--- work requeued, blocked by withdraw
This patch makes a special exception for the statfs system inode glock,
which allows the statfs glock UNLOCK to proceed normally. That allows the
quotad daemon to exit during the withdraw, which allows the logd daemon
to exit during the withdraw, which allows the mount to exit.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, in the unlikely event that gfs2_glock_dq encountered
a withdraw, it would do a wait_on_bit to wait for its journal to be
recovered, but it never released the glock's spin_lock, which caused a
scheduling-while-atomic error.
This patch unlocks the lockref spin_lock before waiting for recovery.
Fixes: 601ef0d52e ("gfs2: Force withdraw to replay journals and wait for it to finish")
Cc: stable@vger.kernel.org # v5.7+
Reported-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Patch 4a378d8a0d added a new check for I_NEW inodes, but unfortunately
it used the wrong variable, i_flags. This caused GFS2 to withdraw when
gfs2_lookup_by_inum needed to refresh an I_NEW inode. This patch switches
to use the correct variable, i_state.
Fixes: 4a378d8a0d ("gfs2: be careful with inode refresh")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a direct I/O write falls entirely and falls back to buffered I/O and the
buffered I/O fails, the write failed with return value 0 instead of the error
number reported by the buffered I/O. Fix that.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When smb2 lease parameter is disabled on server. Server grants
batch oplock instead of RHW lease by default on open, inode page cache
needs to be zapped immediatley upon close as cache is not valid.
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Removed oplock_break_received flag which was added to achieve
synchronization between oplock handler and open handler by earlier commit.
It is not needed because there is an existing lock open_file_lock to achieve
the same. find_readable_file takes open_file_lock and then traverses the
openFileList. Similarly, cifs_oplock_break while closing the deferred
handle (i.e cifsFileInfo_put) takes open_file_lock and then sends close
to the server.
Added comments for better readability.
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When using smb2_copychunk_range() for large ranges we will
run through several iterations of a loop calling SMB2_ioctl()
but never actually free the returned buffer except for the final
iteration.
This leads to memory leaks everytime a large copychunk is requested.
Fixes: 9bf0c9cd43 ("CIFS: Fix SMB2/SMB3 Copy offload support (refcopy) for large files")
Cc: <stable@vger.kernel.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYKT8wgAKCRCRxhvAZXjc
op2mAP9hyc3sp2/HvEuTYDc6LmljPNCqdKeCP1eiX5SZR0yMVwEAwR5xym7YeqYZ
LRj+xjuvOmrSNhcxpZqLpXuPOcY78wM=
=E4a/
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.mount_setattr.v5.13-rc3' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux
Pull mount_setattr fix from Christian Brauner:
"This makes an underlying idmapping assumption more explicit.
We currently don't have any filesystems that support idmapped mounts
which are mountable inside a user namespace, i.e. where s_user_ns !=
init_user_ns. That was a deliberate decision for now as userns root
can just mount the filesystem themselves.
Express this restriction explicitly and enforce it until there's a
real use-case for this. This way we can notice it and will have a
chance to adapt and audit our translation helpers and fstests
appropriately if we need to support such filesystems"
* tag 'fs.idmapped.mount_setattr.v5.13-rc3' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
fs/mount_setattr: tighten permission checks
See MS-SMB2 3.2.4.1.4, file ids in compounded requests should be set to
0xFFFFFFFFFFFFFFFF (we were treating it as u32 not u64 and setting
it incorrectly).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
With the addition of ssi_perf_data and ssi_perf_type struct signalfd_siginfo
is dangerously close to running out of space. All that remains is just
enough space for two additional 64bit fields. A practice of adding all
possible siginfo_t fields into struct singalfd_siginfo can not be supported
as adding the missing fields ssi_lower, ssi_upper, and ssi_pkey would
require two 64bit fields and one 32bit fields. In practice the fields
ssi_perf_data and ssi_perf_type can never be used by signalfd as the signal
that generates them always delivers them synchronously to the thread that
triggers them.
Therefore until someone actually needs the fields ssi_perf_data and
ssi_perf_type in signalfd_siginfo remove them. This leaves a bit more room
for future expansion.
v1: https://lkml.kernel.org/r/20210503203814.25487-12-ebiederm@xmission.com
v2: https://lkml.kernel.org/r/20210505141101.11519-12-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-5-ebiederm@xmission.com
Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that si_trapno is part of the union in _si_fault and available on
all architectures, add SIL_FAULT_TRAPNO and update siginfo_layout to
return SIL_FAULT_TRAPNO when the code assumes si_trapno is valid.
There is room for future changes to reduce when si_trapno is valid but
this is all that is needed to make si_trapno and the other members of
the the union in _sigfault mutually exclusive.
Update the code that uses siginfo_layout to deal with SIL_FAULT_TRAPNO
and have the same code ignore si_trapno in in all other cases.
v1: https://lkml.kernel.org/r/m1o8dvs7s7.fsf_-_@fess.ebiederm.org
v2: https://lkml.kernel.org/r/20210505141101.11519-6-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-2-ebiederm@xmission.com
Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When (ia->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID)) is zero, then
the SELinux implementation of the locked_down hook might report a denial
even though the operation would actually be allowed.
To fix this, make sure that security_locked_down() is called only when
the return value will be taken into account (i.e. when changing one of
the problematic attributes).
Note: this was introduced by commit 5496197f9b ("debugfs: Restrict
debugfs when the kernel is locked down"), but it didn't matter at that
time, as the SELinux support came in later.
Fixes: 59438b4647 ("security,lockdown,selinux: implement SELinux lockdown")
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Link: https://lore.kernel.org/r/20210507125304.144394-1-omosnace@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCibywACgkQxWXV+ddt
WDs8QhAAlJ1INZGF01lP2mUhzesVIctIAPGBf/77Zsxmcu0rA6E66RVVsYMgGU54
+FWd+LwuFCtC1364OnDa2DnmYtvHfgR4If7EGowpk3qzZFeZQSLqayOFa5tZLYPG
tJStjY32QTerfZRoxPJ1QPcoWjxNMxYqYw/s68G3tTTSHEYtlH9zNHbLm9ny507x
uPHpxqKXdv3/LYHLt6XUypFqsZkMoDW98oOKvo0MZE/fjcqiDcrvAoYe+y8raFC3
FztlfA2TBmmp/PouDXLCspXAksLpVo9mgTQ0kW4K7152cC0X/zWXYNH01uQ+qTAS
OFNKt2DSRIq5TR56ZmReYvRgq0FNMotYpRpxoePSF/rwL+wnsTl7QI3r/d/h/uxQ
IzBeBv1Wd+1ZJcqnmEGx8Mws3nGswKyl4W65x8yin41djVoHgM4tYu3nGqielu+w
ifEBmU5tUGo05z2HA5kpLjDzc6MwWaCIduQvjH/I4Vgo9fhDo6pQO2dyPC50Nkk5
DQ5jfxiXJ/ZSh5NbWtIkB/OQuwkVL1nDy2jtj3qnK06HDKstK1zui5nccFKFNOiX
wtYjnGqd3+vIGIZniMuu9rbPLtG4CCerq44v1gyS6LSEycNW9/r2cOXRaiQk5pej
CoYMdnmAqzwidtn4FZPRNQ7JgyckKCXQQSGCazN2vvLCXisCUrw=
=ue6o
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes:
- fix fiemap to print extents that could get misreported due to
internal extent splitting and logical merging for fiemap output
- fix RCU stalls during delayed iputs
- fix removed dentries still existing after log is synced"
* tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix removed dentries still existing after log is synced
btrfs: return whole extents in fiemap
btrfs: avoid RCU stalls while running delayed iputs
btrfs: return 0 for dev_extent_hole_check_zoned hole_start in case of error
Deadstore detected by Lukas Bulwahn's CodeChecker Tool (ELISA group).
line 741 struct cifsInodeInfo *cinode;
line 747 cinode = CIFS_I(d_inode(cfile->dentry));
could be deleted.
cinode on filesystem should not be deleted when files are closed,
they are representations of some data fields on a physical disk,
thus no further action is required.
The virtual inode on vfs will be handled by vfs automatically,
and the denotation is inode, which is different from the cinode.
Signed-off-by: wenhuizhang <wenhui@gwmail.gwu.edu>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
xfs_bmap_rtalloc doesn't handle realtime extent files with extent size
hints larger than the rt volume's extent size properly, because
xfs_bmap_extsize_align can adjust the offset/length parameters to try to
fit the extent size hint.
Under these conditions, minlen has to be large enough so that any
allocation returned by xfs_rtallocate_extent will be large enough to
cover at least one of the blocks that the caller asked for. If the
allocation is too short, bmapi_write will return no mapping for the
requested range, which causes ENOSPC errors in other parts of the
filesystem.
Therefore, adjust minlen upwards to fix this. This can be found by
running generic/263 (g/127 or g/522) with a realtime extent size hint
that's larger than the rt volume extent size.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Merge misc fixes from Andrew Morton:
"13 patches.
Subsystems affected by this patch series: resource, squashfs, hfsplus,
modprobe, and mm (hugetlb, slub, userfaultfd, ksm, pagealloc, kasan,
pagemap, and ioremap)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/ioremap: fix iomap_max_page_shift
docs: admin-guide: update description for kernel.modprobe sysctl
hfsplus: prevent corruption in shrinking truncate
mm/filemap: fix readahead return types
kasan: fix unit tests with CONFIG_UBSAN_LOCAL_BOUNDS enabled
mm: fix struct page layout on 32-bit systems
ksm: revert "use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree()"
userfaultfd: release page in error path to avoid BUG_ON
squashfs: fix divide error in calculate_skip()
kernel/resource: fix return code check in __request_free_mem_region
mm, slub: move slub_debug static key enabling outside slab_mutex
mm/hugetlb: fix cow where page writtable in child
mm/hugetlb: fix F_SEAL_FUTURE_WRITE
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCew+oQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnAAD/0RGU6BTpYX0AjSuHtHPsGxAWlLroe7Yvew
BBXX58uL9LqSYDe+FfCherA7GbyLdXrN9yvbeKVEZH7wmFV0u6dGX/RiK8lpWvfd
pMTSf14QkASkoQ5bMQURdSv73OruzKQN7CisN3btwD2sDcqZqz7RsFuWf5Fuxs4r
UjyQdJpt+sNs1UTvHBjqQcCrAipEVWePH93/jhayx8iBykab4+aNKFtysjqYJdD0
LL5NG5LihP5G2WECD5Q7vDmb9+km3cN5TJLhHSDsmQg4Ln6U9zd4X3bnvEXNtlWk
edNNhKVmS8rtwK2qiZCoVlR4HrSjCCjUg/0h6hyOL8AYNV9vPup/0EuWfRKxLE+3
l3TRTO02/SM8Tjdu27lYtxFYnIkIgRv+w2/ZmURzwnPpIvjwbdfth5DN+10bFnUV
IPKcEvMXhbgdyQ5OtA1oPk3udWesrk836s2W6kqBLSEeqFrb0UbI8A40VXxoAfVQ
Ig5LmuuDAlZzt4fCu3GYhVZS1jj2CXuBsGrsbSVZaJGbMu9MPbmMUoz6XBS3lsY6
gnhYv2paMuOo/hD6q4XeCH4j1jveLXgzenW3fzEP4E0wxfvMkybyWCwfW14a15Q+
Sr8VEEUTc74RfW5pP0ZTvrYGnR+oJwB1RacdbU5WpOrB01A5bWkmb0fRNHfj8vjH
h49oIdqZKw==
=5+hs
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.13-2021-05-14' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Just a few minor fixes/changes:
- Fix issue with double free race for linked timeout completions
- Fix reference issue with timeouts
- Remove last few places that make SQPOLL special, since it's just an
io thread now.
- Bump maximum allowed registered buffers, as we don't allocate as
much anymore"
* tag 'io_uring-5.13-2021-05-14' of git://git.kernel.dk/linux-block:
io_uring: increase max number of reg buffers
io_uring: further remove sqpoll limits on opcodes
io_uring: fix ltout double free on completion race
io_uring: fix link timeout refs
- update documentation to fix the broken illustration due to ReST
conversion by accident at that time and complete the big pcluster
introduction;
- fix 1 lcluster-sized pclusters for the big pcluster feature.
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYJ8DGBEceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBAC0AQDaap8fSTWMLroLLBCcr1MwTqoS6wf44tx8
iq2FFcU/hQD+PqrnCFJW7wjWjMC84weOudRvh2/lu/GKH2a5LgJ5Xgs=
=UTkq
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"This mainly fixes 1 lcluster-sized pclusters for the big pcluster
feature, which can be forcely generated by mkfs as a specific on-disk
case for per-(sub)file compression strategies but missed to handle in
runtime properly.
Also, documentation updates are included to fix the broken
illustration due to the ReST conversion by accident and complete the
big pcluster introduction.
Summary:
- update documentation to fix the broken illustration due to ReST
conversion by accident at that time and complete the big pcluster
introduction
- fix 1 lcluster-sized pclusters for the big pcluster feature"
* tag 'erofs-for-5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix 1 lcluster-sized pcluster for big pcluster
erofs: update documentation about data compression
erofs: fix broken illustration in documentation
- Fix a hang condition (missed wakeups with virtiofs when invalidating
entries)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAmCfBboACgkQHtKRamZ9
iAIiHQ/+LqD0USAXxWQFcDupTATVy0Z/hpUCBWcEKII/ljluUWLLkGUT2/Gy3TXE
0HZmJBWyJyqNRyWtzNZ8hu4FpxSawtYVkqTv0/ODAjrpva9m8p4eVYFp0UpTHn3d
KL/DD+VeLWs1yoPIXgqd2dSwV2YsAJSEYYXcF0CYeHOWH4BVGrOglQBL7kJyra6n
IQsnXGJQMXkOoDMB/5xTI7LgYD0R09OevsHE6Eupxm9SI8ud2qUQlBLde8Eh+7qb
pMhkeNNjG2w461C8215rhGPzCweMMasiBwUz1EHXDpXebZSsDfURwBWMCFbe/H7p
x3u0s3hlJydTZmUnaMeWje+wR1Ku8YXiBeelMobpXi4RzNyebhZ0Fap3fMDbrR8/
5mro6H9blEYGZ1kISHSdvZUfh6uzWiL8hs+uBb/ANICZouValjyVrHuTauwncyQP
PHaKZYo/kh6Hj3j1LYDHbMs69Cbr+E0x/JFnYAxIkZSggYJeXN9+3K9hhUXcQNIf
Lh4p1F/t7DmIXzljFu6qwJl9JmCC+yx4PcSgOqa6vPvm2H6KEH+rMCLHtu+WgaXq
1Gj9EI1sshTXgot8Y1xlPCCTLNqxhV0O30L+EsasmjNCjWwVRi2zz+FjkgFAeDvo
7LZUNVepC9YMffknBNGkfNibfVBn5/DxbGR/9SWygHy8ahECoLc=
=cWwB
-----END PGP SIGNATURE-----
Merge tag 'dax-fixes-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fixes from Dan Williams:
"A fix for a hang condition due to missed wakeups in the filesystem-dax
core when exercised by virtiofs.
This bug has been there from the beginning, but the condition has
not triggered on other filesystems since they hold a lock over
invalidation events"
* tag 'dax-fixes-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Wake up all waiters after invalidating dax entry
dax: Add a wakeup mode parameter to put_unlocked_entry()
dax: Add an enum for specifying dax wakup mode
I believe there are some issues introduced by commit 31651c6071
("hfsplus: avoid deadlock on file truncation")
HFS+ has extent records which always contains 8 extents. In case the
first extent record in catalog file gets full, new ones are allocated from
extents overflow file.
In case shrinking truncate happens to middle of an extent record which
locates in extents overflow file, the logic in hfsplus_file_truncate() was
changed so that call to hfs_brec_remove() is not guarded any more.
Right action would be just freeing the extents that exceed the new size
inside extent record by calling hfsplus_free_extents(), and then check if
the whole extent record should be removed. However since the guard
(blk_cnt > start) is now after the call to hfs_brec_remove(), this has
unfortunate effect that the last matching extent record is removed
unconditionally.
To reproduce this issue, create a file which has at least 10 extents, and
then perform shrinking truncate into middle of the last extent record, so
that the number of remaining extents is not under or divisible by 8. This
causes the last extent record (8 extents) to be removed totally instead of
truncating into middle of it. Thus this causes corruption, and lost data.
Fix for this is simply checking if the new truncated end is below the
start of this extent record, making it safe to remove the full extent
record. However call to hfs_brec_remove() can't be moved to it's previous
place since we're dropping ->tree_lock and it can cause a race condition
and the cached info being invalidated possibly corrupting the node data.
Another issue is related to this one. When entering into the block
(blk_cnt > start) we are not holding the ->tree_lock. We break out from
the loop not holding the lock, but hfs_find_exit() does unlock it. Not
sure if it's possible for someone else to take the lock under our feet,
but it can cause hard to debug errors and premature unlocking. Even if
there's no real risk of it, the locking should still always be kept in
balance. Thus taking the lock now just before the check.
Link: https://lkml.kernel.org/r/20210429165139.3082828-1-jouni.roivas@tuxera.com
Fixes: 31651c6071 ("hfsplus: avoid deadlock on file truncation")
Signed-off-by: Jouni Roivas <jouni.roivas@tuxera.com>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Cc: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A readahead request will not allocate more memory than can be represented
by a size_t, even on systems that have HIGHMEM available. Change the
length functions from returning an loff_t to a size_t.
Link: https://lkml.kernel.org/r/20210510201201.1558972-1-willy@infradead.org
Fixes: 32c0a6bcaa ("btrfs: add and use readahead_batch_length")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sysbot has reported a "divide error" which has been identified as being
caused by a corrupted file_size value within the file inode. This value
has been corrupted to a much larger value than expected.
Calculate_skip() is passed i_size_read(inode) >> msblk->block_log. Due to
the file_size value corruption this overflows the int argument/variable in
that function, leading to the divide error.
This patch changes the function to use u64. This will accommodate any
unexpectedly large values due to corruption.
The value returned from calculate_skip() is clamped to be never more than
SQUASHFS_CACHED_BLKS - 1, or 7. So file_size corruption does not lead to
an unexpectedly large return result here.
Link: https://lkml.kernel.org/r/20210507152618.9447-1-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: <syzbot+e8f781243ce16ac2f962@syzkaller.appspotmail.com>
Reported-by: <syzbot+7b98870d4fec9447b951@syzkaller.appspotmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/hugetlb: Fix issues on file sealing and fork", v2.
Hugh reported issue with F_SEAL_FUTURE_WRITE not applied correctly to
hugetlbfs, which I can easily verify using the memfd_test program, which
seems that the program is hardly run with hugetlbfs pages (as by default
shmem).
Meanwhile I found another probably even more severe issue on that hugetlb
fork won't wr-protect child cow pages, so child can potentially write to
parent private pages. Patch 2 addresses that.
After this series applied, "memfd_test hugetlbfs" should start to pass.
This patch (of 2):
F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
There is a test program for that and it fails constantly.
$ ./memfd_test hugetlbfs
memfd-hugetlb: CREATE
memfd-hugetlb: BASIC
memfd-hugetlb: SEAL-WRITE
memfd-hugetlb: SEAL-FUTURE-WRITE
mmap() didn't fail as expected
Aborted (core dumped)
I think it's probably because no one is really running the hugetlbfs test.
Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we
do in shmem_mmap(). Generalize a helper for that.
Link: https://lkml.kernel.org/r/20210503234356.9097-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210503234356.9097-2-peterx@redhat.com
Fixes: ab3948f58f ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series of patches fix some critical bugs such as memory leak in compression
flows, kernel panic when handling errors, and swapon failure due to newly added
condition check.
-----BEGIN PGP SIGNATURE-----
iQIyBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmCeTNcACgkQQBSofoJI
UNKc0Q/3X9Tngns5DpnzQw9kbWochucG/7Vf1HjdvV2gE7IxruDPofGtBdhPdSYV
uR+nP9dxhXLQQNfWS2KICzdj0yceKCKq8xpnnNdq9SGjVJmUjCD39ByGJV3GMOGM
fY6dizcywltH7iBQboMZ0Eh3ivPh6ugl5klDo20WzYsH6F/UF7CPSuSJ3K5ezGuY
T6R3NkqG8v1cS6+5u+teDpmdCCHOCBEeizBFQ6XskNDBavbw7KEA0liwKOv6eghB
PdoeqYemg1VfOHAKqP6F3o+eSlsT5Ljs0Zmc5x8h8qRS76JK/hb9REFngrcERaIw
GPDnMPCHniCHMd61z90oqGtMf4PFqLUVnBTdxhxLK5G4+u874dlZLciKWGDIGTLv
eNU2W+8c9s+KdJAZFJbYN5zVoyJUR5SW7RcYTuvZWt8wX38Ch+FZEGqOC8FxWyWU
i1WHXZiGeifIlIeqUOPJP5sbslL2hfK5OqMYJotAeIW/E2RyJWnc+Yo2UwvmvXVU
xPOKFOn9nAAQz2GGgSpvGaWAVfMNqhoLw7/gzwdkabP0EASIzHuW2PCJhp59c1NO
Fb9eUi7yhgd94vDbYftXRBgAhrUmCd0u+/gySyou4vujWtfWcO+AvOEreV7VRFfL
Su0bGkBJ03ThFEFAZnY14RenydGSwpk5Fd9wkRy4Qk7mBv9sRg==
=NeKH
-----END PGP SIGNATURE-----
Merge tag 'f2fs-5.13-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fixes from Jaegeuk Kim:
"This fixes some critical bugs such as memory leak in compression
flows, kernel panic when handling errors, and swapon failure due to
newly added condition check"
* tag 'f2fs-5.13-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: return EINVAL for hole cases in swap file
f2fs: avoid swapon failure by giving a warning first
f2fs: compress: fix to assign cc.cluster_idx correctly
f2fs: compress: fix race condition of overwrite vs truncate
f2fs: compress: fix to free compress page correctly
f2fs: support iflag change given the mask
f2fs: avoid null pointer access when handling IPU error
Since recent changes instead of storing a large array of struct
io_mapped_ubuf, we store pointers to them, that is 4 times slimmer and
we should not to so worry about restricting max number of registererd
buffer slots, increase the limit 4 times.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d3dee1da37f46da416aa96a16bf9e5094e10584d.1620990371.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we move one inode from one directory to another and both the inode
and its previous parent directory were logged before, we are not supposed
to have the dentry for the old parent if we have a power failure after the
log is synced. Only the new dentry is supposed to exist.
Generally this works correctly, however there is a scenario where this is
not currently working, because the old parent of the file/directory that
was moved is not authoritative for a range that includes the dir index and
dir item keys of the old dentry. This case is better explained with the
following example and reproducer:
# The test requires a very specific layout of keys and items in the
# fs/subvolume btree to trigger the bug. So we want to make sure that
# on whatever platform we are, we have the same leaf/node size.
#
# Currently in btrfs the node/leaf size can not be smaller than the page
# size (but it can be greater than the page size). So use the largest
# supported node/leaf size (64K).
$ mkfs.btrfs -f -n 65536 /dev/sdc
$ mount /dev/sdc /mnt
# "testdir" is inode 257.
$ mkdir /mnt/testdir
$ chmod 755 /mnt/testdir
# Create several empty files to have the directory "testdir" with its
# items spread over several leaves (7 in this case).
$ for ((i = 1; i <= 1200; i++)); do
echo -n > /mnt/testdir/file$i
done
# Create our test directory "dira", inode number 1458, which gets all
# its items in leaf 7.
#
# The BTRFS_DIR_ITEM_KEY item for inode 257 ("testdir") that points to
# the entry named "dira" is in leaf 2, while the BTRFS_DIR_INDEX_KEY
# item that points to that entry is in leaf 3.
#
# For this particular filesystem node size (64K), file count and file
# names, we endup with the directory entry items from inode 257 in
# leaves 2 and 3, as previously mentioned - what matters for triggering
# the bug exercised by this test case is that those items are not placed
# in leaf 1, they must be placed in a leaf different from the one
# containing the inode item for inode 257.
#
# The corresponding BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items for
# the parent inode (257) are the following:
#
# item 460 key (257 DIR_ITEM 3724298081) itemoff 48344 itemsize 34
# location key (1458 INODE_ITEM 0) type DIR
# transid 6 data_len 0 name_len 4
# name: dira
#
# and:
#
# item 771 key (257 DIR_INDEX 1202) itemoff 36673 itemsize 34
# location key (1458 INODE_ITEM 0) type DIR
# transid 6 data_len 0 name_len 4
# name: dira
$ mkdir /mnt/testdir/dira
# Make sure everything done so far is durably persisted.
$ sync
# Now do a change to inode 257 ("testdir") that does not result in
# COWing leaves 2 and 3 - the leaves that contain the directory items
# pointing to inode 1458 (directory "dira").
#
# Changing permissions, the owner/group, updating or adding a xattr,
# etc, will not change (COW) leaves 2 and 3. So for the sake of
# simplicity change the permissions of inode 257, which results in
# updating its inode item and therefore change (COW) only leaf 1.
$ chmod 700 /mnt/testdir
# Now fsync directory inode 257.
#
# Since only the first leaf was changed/COWed, we log the inode item of
# inode 257 and only the dentries found in the first leaf, all have a
# key type of BTRFS_DIR_ITEM_KEY, and no keys of type
# BTRFS_DIR_INDEX_KEY, because they sort after the former type and none
# exist in the first leaf.
#
# We also log 3 items that represent ranges for dir items and dir
# indexes for which the log is authoritative:
#
# 1) a key of type BTRFS_DIR_LOG_ITEM_KEY, which indicates the log is
# authoritative for all BTRFS_DIR_ITEM_KEY keys that have an offset
# in the range [0, 2285968570] (the offset here is the crc32c of the
# dentry's name). The value 2285968570 corresponds to the offset of
# the first key of leaf 2 (which is of type BTRFS_DIR_ITEM_KEY);
#
# 2) a key of type BTRFS_DIR_LOG_ITEM_KEY, which indicates the log is
# authoritative for all BTRFS_DIR_ITEM_KEY keys that have an offset
# in the range [4293818216, (u64)-1] (the offset here is the crc32c
# of the dentry's name). The value 4293818216 corresponds to the
# offset of the highest key of type BTRFS_DIR_ITEM_KEY plus 1
# (4293818215 + 1), which is located in leaf 2;
#
# 3) a key of type BTRFS_DIR_LOG_INDEX_KEY, with an offset of 1203,
# which indicates the log is authoritative for all keys of type
# BTRFS_DIR_INDEX_KEY that have an offset in the range
# [1203, (u64)-1]. The value 1203 corresponds to the offset of the
# last key of type BTRFS_DIR_INDEX_KEY plus 1 (1202 + 1), which is
# located in leaf 3;
#
# Also, because "testdir" is a directory and inode 1458 ("dira") is a
# child directory, we log inode 1458 too.
$ xfs_io -c "fsync" /mnt/testdir
# Now move "dira", inode 1458, to be a child of the root directory
# (inode 256).
#
# Because this inode was previously logged, when "testdir" was fsynced,
# the log is updated so that the old inode reference, referring to inode
# 257 as the parent, is deleted and the new inode reference, referring
# to inode 256 as the parent, is added to the log.
$ mv /mnt/testdir/dira /mnt
# Now change some file and fsync it. This guarantees the log changes
# made by the previous move/rename operation are persisted. We do not
# need to do any special modification to the file, just any change to
# any file and sync the log.
$ xfs_io -c "pwrite -S 0xab 0 64K" -c "fsync" /mnt/testdir/file1
# Simulate a power failure and then mount again the filesystem to
# replay the log tree. We want to verify that we are able to mount the
# filesystem, meaning log replay was successful, and that directory
# inode 1458 ("dira") only has inode 256 (the filesystem's root) as
# its parent (and no longer a child of inode 257).
#
# It used to happen that during log replay we would end up having
# inode 1458 (directory "dira") with 2 hard links, being a child of
# inode 257 ("testdir") and inode 256 (the filesystem's root). This
# resulted in the tree checker detecting the issue and causing the
# mount operation to fail (with -EIO).
#
# This happened because in the log we have the new name/parent for
# inode 1458, which results in adding the new dentry with inode 256
# as the parent, but the previous dentry, under inode 257 was never
# removed - this is because the ranges for dir items and dir indexes
# of inode 257 for which the log is authoritative do not include the
# old dir item and dir index for the dentry of inode 257 referring to
# inode 1458:
#
# - for dir items, the log is authoritative for the ranges
# [0, 2285968570] and [4293818216, (u64)-1]. The dir item at inode 257
# pointing to inode 1458 has a key of (257 DIR_ITEM 3724298081), as
# previously mentioned, so the dir item is not deleted when the log
# replay procedure processes the authoritative ranges, as 3724298081
# is outside both ranges;
#
# - for dir indexes, the log is authoritative for the range
# [1203, (u64)-1], and the dir index item of inode 257 pointing to
# inode 1458 has a key of (257 DIR_INDEX 1202), as previously
# mentioned, so the dir index item is not deleted when the log
# replay procedure processes the authoritative range.
<power failure>
$ mount /dev/sdc /mnt
mount: /mnt: can't read superblock on /dev/sdc.
$ dmesg
(...)
[87849.840509] BTRFS info (device sdc): start tree-log replay
[87849.875719] BTRFS critical (device sdc): corrupt leaf: root=5 block=30539776 slot=554 ino=1458, invalid nlink: has 2 expect no more than 1 for dir
[87849.878084] BTRFS info (device sdc): leaf 30539776 gen 7 total ptrs 557 free space 2092 owner 5
[87849.879516] BTRFS info (device sdc): refs 1 lock_owner 0 current 2099108
[87849.880613] item 0 key (1181 1 0) itemoff 65275 itemsize 160
[87849.881544] inode generation 6 size 0 mode 100644
[87849.882692] item 1 key (1181 12 257) itemoff 65258 itemsize 17
(...)
[87850.562549] item 556 key (1458 12 257) itemoff 16017 itemsize 14
[87850.563349] BTRFS error (device dm-0): block=30539776 write time tree block corruption detected
[87850.564386] ------------[ cut here ]------------
[87850.564920] WARNING: CPU: 3 PID: 2099108 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
[87850.566129] Modules linked in: btrfs dm_zero dm_snapshot (...)
[87850.573789] CPU: 3 PID: 2099108 Comm: mount Not tainted 5.12.0-rc8-btrfs-next-86 #1
(...)
[87850.587481] Call Trace:
[87850.587768] btree_csum_one_bio+0x244/0x2b0 [btrfs]
[87850.588354] ? btrfs_bio_fits_in_stripe+0xd8/0x110 [btrfs]
[87850.589003] btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
[87850.589654] submit_one_bio+0x61/0x70 [btrfs]
[87850.590248] submit_extent_page+0x91/0x2f0 [btrfs]
[87850.590842] write_one_eb+0x175/0x440 [btrfs]
[87850.591370] ? find_extent_buffer_nolock+0x1c0/0x1c0 [btrfs]
[87850.592036] btree_write_cache_pages+0x1e6/0x610 [btrfs]
[87850.592665] ? free_debug_processing+0x1d5/0x240
[87850.593209] do_writepages+0x43/0xf0
[87850.593798] ? __filemap_fdatawrite_range+0xa4/0x100
[87850.594391] __filemap_fdatawrite_range+0xc5/0x100
[87850.595196] btrfs_write_marked_extents+0x68/0x160 [btrfs]
[87850.596202] btrfs_write_and_wait_transaction.isra.0+0x4d/0xd0 [btrfs]
[87850.597377] btrfs_commit_transaction+0x794/0xca0 [btrfs]
[87850.598455] ? _raw_spin_unlock_irqrestore+0x32/0x60
[87850.599305] ? kmem_cache_free+0x15a/0x3d0
[87850.600029] btrfs_recover_log_trees+0x346/0x380 [btrfs]
[87850.601021] ? replay_one_extent+0x7d0/0x7d0 [btrfs]
[87850.601988] open_ctree+0x13c9/0x1698 [btrfs]
[87850.602846] btrfs_mount_root.cold+0x13/0xed [btrfs]
[87850.603771] ? kmem_cache_alloc_trace+0x7c9/0x930
[87850.604576] ? vfs_parse_fs_string+0x5d/0xb0
[87850.605293] ? kfree+0x276/0x3f0
[87850.605857] legacy_get_tree+0x30/0x50
[87850.606540] vfs_get_tree+0x28/0xc0
[87850.607163] fc_mount+0xe/0x40
[87850.607695] vfs_kern_mount.part.0+0x71/0x90
[87850.608440] btrfs_mount+0x13b/0x3e0 [btrfs]
(...)
[87850.629477] ---[ end trace 68802022b99a1ea0 ]---
[87850.630849] BTRFS: error (device sdc) in btrfs_commit_transaction:2381: errno=-5 IO failure (Error while writing out transaction)
[87850.632422] BTRFS warning (device sdc): Skipping commit of aborted transaction.
[87850.633416] BTRFS: error (device sdc) in cleanup_transaction:1978: errno=-5 IO failure
[87850.634553] BTRFS: error (device sdc) in btrfs_replay_log:2431: errno=-5 IO failure (Failed to recover log tree)
[87850.637529] BTRFS error (device sdc): open_ctree failed
In this example the inode we moved was a directory, so it was easy to
detect the problem because directories can only have one hard link and
the tree checker immediately detects that. If the moved inode was a file,
then the log replay would succeed and we would end up having both the
new hard link (/mnt/foo) and the old hard link (/mnt/testdir/foo) present,
but only the new one should be present.
Fix this by forcing re-logging of the old parent directory when logging
the new name during a rename operation. This ensures we end up with a log
that is authoritative for a range covering the keys for the old dentry,
therefore causing the old dentry do be deleted when replaying the log.
A test case for fstests will follow up soon.
Fixes: 64d6b281ba ("btrfs: remove unnecessary check_parent_dirs_for_sync()")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
`xfs_io -c 'fiemap <off> <len>' <file>`
can give surprising results on btrfs that differ from xfs.
btrfs prints out extents trimmed to fit the user input. If the user's
fiemap request has an offset, then rather than returning each whole
extent which intersects that range, we also trim the start extent to not
have start < off.
Documentation in filesystems/fiemap.txt and the xfs_io man page suggests
that returning the whole extent is expected.
Some cases which all yield the same fiemap in xfs, but not btrfs:
dd if=/dev/zero of=$f bs=4k count=1
sudo xfs_io -c 'fiemap 0 1024' $f
0: [0..7]: 26624..26631
sudo xfs_io -c 'fiemap 2048 1024' $f
0: [4..7]: 26628..26631
sudo xfs_io -c 'fiemap 2048 4096' $f
0: [4..7]: 26628..26631
sudo xfs_io -c 'fiemap 3584 512' $f
0: [7..7]: 26631..26631
sudo xfs_io -c 'fiemap 4091 5' $f
0: [7..6]: 26631..26630
I believe this is a consequence of the logic for merging contiguous
extents represented by separate extent items. That logic needs to track
the last offset as it loops through the extent items, which happens to
pick up the start offset on the first iteration, and trim off the
beginning of the full extent. To fix it, start `off` at 0 rather than
`start` so that we keep the iteration/merging intact without cutting off
the start of the extent.
after the fix, all the above commands give:
0: [0..7]: 26624..26631
The merging logic is exercised by fstest generic/483, and I have written
a new fstest for checking we don't have backwards or zero-length fiemaps
for cases like those above.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Generally a delayed iput is added when we might do the final iput, so
usually we'll end up sleeping while processing the delayed iputs
naturally. However there's no guarantee of this, especially for small
files. In production we noticed 5 instances of RCU stalls while testing
a kernel release overnight across 1000 machines, so this is relatively
common:
host count: 5
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: ....: (20998 ticks this GP) idle=59e/1/0x4000000000000002 softirq=12333372/12333372 fqs=3208
(t=21031 jiffies g=27810193 q=41075) NMI backtrace for cpu 1
CPU: 1 PID: 1713 Comm: btrfs-cleaner Kdump: loaded Not tainted 5.6.13-0_fbk12_rc1_5520_gec92bffc1ec9 #1
Call Trace:
<IRQ> dump_stack+0x50/0x70
nmi_cpu_backtrace.cold.6+0x30/0x65
? lapic_can_unplug_cpu.cold.30+0x40/0x40
nmi_trigger_cpumask_backtrace+0xba/0xca
rcu_dump_cpu_stacks+0x99/0xc7
rcu_sched_clock_irq.cold.90+0x1b2/0x3a3
? trigger_load_balance+0x5c/0x200
? tick_sched_do_timer+0x60/0x60
? tick_sched_do_timer+0x60/0x60
update_process_times+0x24/0x50
tick_sched_timer+0x37/0x70
__hrtimer_run_queues+0xfe/0x270
hrtimer_interrupt+0xf4/0x210
smp_apic_timer_interrupt+0x5e/0x120
apic_timer_interrupt+0xf/0x20 </IRQ>
RIP: 0010:queued_spin_lock_slowpath+0x17d/0x1b0
RSP: 0018:ffffc9000da5fe48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000000 RBX: ffff889fa81d0cd8 RCX: 0000000000000029
RDX: ffff889fff86c0c0 RSI: 0000000000080000 RDI: ffff88bfc2da7200
RBP: ffff888f2dcdd768 R08: 0000000001040000 R09: 0000000000000000
R10: 0000000000000001 R11: ffffffff82a55560 R12: ffff88bfc2da7200
R13: 0000000000000000 R14: ffff88bff6c2a360 R15: ffffffff814bd870
? kzalloc.constprop.57+0x30/0x30
list_lru_add+0x5a/0x100
inode_lru_list_add+0x20/0x40
iput+0x1c1/0x1f0
run_delayed_iput_locked+0x46/0x90
btrfs_run_delayed_iputs+0x3f/0x60
cleaner_kthread+0xf2/0x120
kthread+0x10b/0x130
Fix this by adding a cond_resched_lock() to the loop processing delayed
iputs so we can avoid these sort of stalls.
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 7000babdda ("btrfs: assign proper values to a bool variable in
dev_extent_hole_check_zoned") assigned false to the hole_start parameter
of dev_extent_hole_check_zoned().
The hole_start parameter is not boolean and returns the start location of
the found hole.
Fixes: 7000babdda ("btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
crypt_stat memory itself is allocated when inode is created, in
ecryptfs_alloc_inode, which returns NULL on failure and is handled
by callers, which would prevent us getting to this point. It then
calls ecryptfs_init_crypt_stat which allocates crypt_stat->tfm
checking for and likewise handling allocation failure. Finally,
crypt_stat->flags has ECRYPTFS_STRUCT_INITIALIZED merged into it
in ecryptfs_init_crypt_stat as well.
Simply put, the conditions that the BUG_ON checks for will never
be triggered, as to even get to this function, the relevant conditions
will have already been fulfilled (or the inode allocation would fail in
the first place and thus no call to this function or those above it).
Cc: Tyler Hicks <code@tyhicks.com>
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Link: https://lore.kernel.org/r/20210503115736.2104747-50-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 2c2a7552dd.
Because of recent interactions with developers from @umn.edu, all
commits from them have been recently re-reviewed to ensure if they were
correct or not.
Upon review, this commit was found to be incorrect for the reasons
below, so it must be reverted. It will be fixed up "correctly" in a
later kernel change.
The original commit log for this change was incorrect, no "error
handling code" was added, things will blow up just as badly as before if
any of these cases ever were true. As this BUG_ON() never fired, and
most of these checks are "obviously" never going to be true, let's just
revert to the original code for now until this gets unwound to be done
correctly in the future.
Cc: Aditya Pakki <pakki001@umn.edu>
Fixes: 2c2a7552dd ("ecryptfs: replace BUG_ON with error handling code")
Cc: stable <stable@vger.kernel.org>
Acked-by: Tyler Hicks <code@tyhicks.com>
Link: https://lore.kernel.org/r/20210503115736.2104747-49-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the 1st NONHEAD lcluster of a pcluster isn't CBLKCNT lcluster type
rather than a HEAD or PLAIN type instead, which means its pclustersize
_must_ be 1 lcluster (since its uncompressed size < 2 lclusters),
as illustrated below:
HEAD HEAD / PLAIN lcluster type
____________ ____________
|_:__________|_________:__| file data (uncompressed)
. .
.____________.
|____________| pcluster data (compressed)
Such on-disk case was explained before [1] but missed to be handled
properly in the runtime implementation.
It can be observed if manually generating 1 lcluster-sized pcluster
with 2 lclusters (thus CBLKCNT doesn't exist.) Let's fix it now.
[1] https://lore.kernel.org/r/20210407043927.10623-1-xiang@kernel.org
Link: https://lore.kernel.org/r/20210510064715.29123-1-xiang@kernel.org
Fixes: cec6e93bea ("erofs: support parsing big pcluster compress indexes")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <xiang@kernel.org>
We currently don't have any filesystems that support idmapped mounts
which are mountable inside a user namespace. That was a deliberate
decision for now as a userns root can just mount the filesystem
themselves. So enforce this restriction explicitly until there's a real
use-case for this. This way we can notice it and will have a chance to
adapt and audit our translation helpers and fstests appropriately if we
need to support such filesystems.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
Suggested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The final solution can be migrating blocks to form a section-aligned file
internally. Meanwhile, let's ask users to do that when preparing the swap
file initially like:
1) create()
2) ioctl(F2FS_IOC_SET_PIN_FILE)
3) fallocate()
Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: 36e4d95891 ("f2fs: check if swapfile is section-alligned")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_destroy_compress_ctx(), after f2fs_destroy_compress_ctx(),
cc.cluster_idx will be cleared w/ NULL_CLUSTER, f2fs_cluster_blocks()
may check wrong cluster metadata, fix it.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
pos_fsstress testcase complains a panic as belew:
------------[ cut here ]------------
kernel BUG at fs/f2fs/compress.c:1082!
invalid opcode: 0000 [#1] SMP PTI
CPU: 4 PID: 2753477 Comm: kworker/u16:2 Tainted: G OE 5.12.0-rc1-custom #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Workqueue: writeback wb_workfn (flush-252:16)
RIP: 0010:prepare_compress_overwrite+0x4c0/0x760 [f2fs]
Call Trace:
f2fs_prepare_compress_overwrite+0x5f/0x80 [f2fs]
f2fs_write_cache_pages+0x468/0x8a0 [f2fs]
f2fs_write_data_pages+0x2a4/0x2f0 [f2fs]
do_writepages+0x38/0xc0
__writeback_single_inode+0x44/0x2a0
writeback_sb_inodes+0x223/0x4d0
__writeback_inodes_wb+0x56/0xf0
wb_writeback+0x1dd/0x290
wb_workfn+0x309/0x500
process_one_work+0x220/0x3c0
worker_thread+0x53/0x420
kthread+0x12f/0x150
ret_from_fork+0x22/0x30
The root cause is truncate() may race with overwrite as below,
so that one reference count left in page can not guarantee the
page attaching in mapping tree all the time, after truncation,
later find_lock_page() may return NULL pointer.
- prepare_compress_overwrite
- f2fs_pagecache_get_page
- unlock_page
- f2fs_setattr
- truncate_setsize
- truncate_inode_page
- delete_from_page_cache
- find_lock_page
Fix this by avoiding referencing updated page.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In error path of f2fs_write_compressed_pages(), it needs to call
f2fs_compress_free_page() to release temporary page.
Fixes: 5e6bbde959 ("f2fs: introduce mempool for {,de}compress intermediate page allocation")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_fileattr_set(),
if (!fa->flags_valid)
mask &= FS_COMMON_FL;
In this case, we can set supported flags by mask only instead of BUG_ON.
/* Flags shared betwen flags/xflags */
(FS_SYNC_FL | FS_IMMUTABLE_FL | FS_APPEND_FL | \
FS_NODUMP_FL | FS_NOATIME_FL | FS_DAX_FL | \
FS_PROJINHERIT_FL)
Fixes: 9b1bb01c8a ("f2fs: convert to fileattr")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCaiuQACgkQxWXV+ddt
WDv3Ww//bDUlNXqAYEoLKePohy1bupiqG8lKYX4s4bGEq0x0cyh4qVER/Q/lU2l2
AMf8t6Pwr/iBOPwfckreLDuFrhacvWq0K4eMkgpf++3P0Mzbj2sIBX0+XnrWluRL
yFCZudJej+cpM55Ve4l6M8zrk1nbzYJLFPRRdOIFe4HonWkhI/zY6RD7kFybQevW
mAxqMgIpUQAjoj5F/EhwXQ9dk6PXSZj+gaOoNrmQmN7mZMqNgSLHBEoJUHrotm1K
rDlEwIRUTtNPV+rcPxcXD1GFiUxU0cZhg0jts252z89Mvaqb2g/YKaHPAR/IVIt5
enf4llZzoEeiMnHuSj9zCg4HxOvCCFV8zZYXlO7/9IqdgLJjQkElZoqTz45obWdE
aoJrHAWWlulS2jPocJfJ/Zti2xBYGLjQASH0kYS+vjVxjKyqz3fuM1Tsasaf9Mcp
+M2m6yMBjJ0nJMTL2CgBksCd0dHwfiBZ/YYClrMSjYlzYSU6ofA2b2hej0OjqZ4X
FmpEmCBK4lySdJI+JlJKikeneOOxKSpT0xGqU+OMmbpwFH3k1N3oseu0hrG8Xreo
RU1xNbekGTwRbCcCA9l5HQ/RYptT7rt/KqkC70UFEvdIijCNcptOGaTAoYvLS14O
T+yu0Cizt7O0Fdg5E+MAS/qaI2yacXxBfIkMDbPxHGUg7+vUteM=
=Phtq
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-rc1-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"Handle transaction start error in btrfs_fileattr_set()
This is fix for code introduced by the new fileattr merge"
* tag 'for-5.13-rc1-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: handle transaction start error in btrfs_fileattr_set
Add error handling in btrfs_fileattr_set in case of an error while
starting a transaction. This fixes btrfs/232 which otherwise used to
fail with below signature on Power.
btrfs/232 [ 1119.474650] run fstests btrfs/232 at 2021-04-21 02:21:22
<...>
[ 1366.638585] BUG: Unable to handle kernel data access on read at 0xffffffffffffff86
[ 1366.638768] Faulting instruction address: 0xc0000000009a5c88
cpu 0x0: Vector: 380 (Data SLB Access) at [c000000014f177b0]
pc: c0000000009a5c88: btrfs_update_root_times+0x58/0xc0
lr: c0000000009a5c84: btrfs_update_root_times+0x54/0xc0
<...>
pid = 24881, comm = fsstress
btrfs_update_inode+0xa0/0x140
btrfs_fileattr_set+0x5d0/0x6f0
vfs_fileattr_set+0x2a8/0x390
do_vfs_ioctl+0x1290/0x1ac0
sys_ioctl+0x6c/0x120
system_call_exception+0x3d4/0x410
system_call_common+0xec/0x278
Fixes: 97fc297754 ("btrfs: convert to fileattr")
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCZnCIACgkQxWXV+ddt
WDuEvhAAmC+Mkrz25GbQnSIp2FKYCCQK34D0rdghml0Bc0cJcDh3yhgIB6ZTHZ7e
Z+UZu84ISK31OHKDzXtX0MINN2wuU4u4kd6PHtYj0wSVl3cX6E/K5j6YcThfI1Ru
vCW5O87V9SCV5NnykIFt3sbYvsPKtF9lhgPQprj4np+wxaSyNlEF2c+zLTI3J7NV
+8OlM4oi8GocZd1aAwGpVM3qUPyQSHEb9oUEp6aV1ERuAs6LIyeGks3Cag6gjPnq
dYz3jV9HyZB5GtX0dmv4LeRFIog1uFi+SIEFl5RpqhB3sXN3n6XHMka4x20FXiWy
PfX9+Nf4bQGx6F9rGsgHNHQP5dVhHAkZcq3E0n0yshIfNe8wDHBRlmk0wbfj4K7I
VYv85SxEYpigG8KzF5gjiar4EqsaJVQcJioMxVE7z9vrW6xlOWD1lf/ViUZnB3wd
IQEyGz2qOe9eqJD+dnyN7QkN9WKGSUr2p1Q/DngCIwFzKWf1qIlETNXrIL+AZ97r
v4G5mMq9dCxs3s8c5SGbdF9qqK8gEuaV3iWQAoKOciuy6fbc553Q90I1v3OhW+by
j2yVoo3nJbBJBuLBNWPDUlwxQF/EHPQ6nh3fvxNRgwksXgRmqywdJb5dQ8hcKgSH
RsvinJhtKo5rTgtgGgmNvmLAjKIieW1lIVG4ha0O/m49HeaohDE=
=GNNs
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"First batch of various fixes, here's a list of notable ones:
- fix unmountable seed device after fstrim
- fix silent data loss in zoned mode due to ordered extent splitting
- fix race leading to unpersisted data and metadata on fsync
- fix deadlock when cloning inline extents and using qgroups"
* tag 'for-5.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: initialize return variable in cleanup_free_space_cache_v1
btrfs: zoned: sanity check zone type
btrfs: fix unmountable seed device after fstrim
btrfs: fix deadlock when cloning inline extents and using qgroups
btrfs: fix race leading to unpersisted data and metadata on fsync
btrfs: do not consider send context as valid when trying to flush qgroups
btrfs: zoned: fix silent data loss after failure splitting ordered extent
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmCXTY4ACgkQiiy9cAdy
T1GTwAv8DHmrUOSMPY79kH+vYrg67S8bnCEE2b0ChP0CyU/Gp05CRUTH+C9bstb+
hou6Wx7tvAa++PpgdwBfNcj+Be76PcPLxCQoXglOzckpuAdvyCLgzcOhrPoOYAyR
gbLsfkIXMPGFLnGltZMQl9TCXEIpF6xjizZUyHVSq0UBhIyfBOdsDeoCNbVOuB1Y
t6KtBRTro+P2DIj9zITKYCnKFU3OXae/gBwv+wQU066mNq/IxlMqX84Av728govJ
sEd8CSLwnwX4/NsvADjiy4+M6iLvWtXN9xDr6S5Hyo+1Y/SkU/C1qo8Uc8H74gdT
D9LSoGTBx36X97eSisrE8vOFt3SIwEy9fU/yU87qrDuwwCjBYNvYCxK+VaVUA0vH
1CjScSifrG7MZAAw7h4o6Ug6q4Otobabj+ODq4exkjW+GYb0wQ/DzjqVuoIvqP7F
TLoRisVWpg/MY7FcEU1ZpYdHAFyKKTcdOLlmX6Y40sKue0mwfpcv0rI+isNUqCYX
R3CJ5tPp
=9Egy
-----END PGP SIGNATURE-----
Merge tag '5.13-rc-smb3-part3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small SMB3 chmultichannel related changesets (also for stable)
from the SMB3 test event this week.
The other fixes are still in review/testing"
* tag '5.13-rc-smb3-part3' of git://git.samba.org/sfrench/cifs-2.6:
smb3: if max_channels set to more than one channel request multichannel
smb3: do not attempt multichannel to server which does not support it
smb3: when mounting with multichannel include it in requested capabilities
WARNING: CPU: 0 PID: 10242 at lib/refcount.c:28 refcount_warn_saturate+0x15b/0x1a0 lib/refcount.c:28
RIP: 0010:refcount_warn_saturate+0x15b/0x1a0 lib/refcount.c:28
Call Trace:
__refcount_sub_and_test include/linux/refcount.h:283 [inline]
__refcount_dec_and_test include/linux/refcount.h:315 [inline]
refcount_dec_and_test include/linux/refcount.h:333 [inline]
io_put_req fs/io_uring.c:2140 [inline]
io_queue_linked_timeout fs/io_uring.c:6300 [inline]
__io_queue_sqe+0xbef/0xec0 fs/io_uring.c:6354
io_submit_sqe fs/io_uring.c:6534 [inline]
io_submit_sqes+0x2bbd/0x7c50 fs/io_uring.c:6660
__do_sys_io_uring_enter fs/io_uring.c:9240 [inline]
__se_sys_io_uring_enter+0x256/0x1d60 fs/io_uring.c:9182
io_link_timeout_fn() should put only one reference of the linked timeout
request, however in case of racing with the master request's completion
first io_req_complete() puts one and then io_put_req_deferred() is
called.
Cc: stable@vger.kernel.org # 5.12+
Fixes: 9ae1f8dd37 ("io_uring: fix inconsistent lock state")
Reported-by: syzbot+a2910119328ce8e7996f@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ff51018ff29de5ffa76f09273ef48cb24c720368.1620417627.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Convert sh and sparc to use generic shell scripts to generate the
syscall headers
- refactor .gitignore files
- Update kernel/config_data.gz only when the content of the .config is
really changed, which avoids the unneeded re-link of vmlinux
- move "remove stale files" workarounds to scripts/remove-stale-files
- suppress unused-but-set-variable warnings by default for Clang as well
- fix locale setting LANG=C to LC_ALL=C
- improve 'make distclean'
- always keep intermediate objects from scripts/link-vmlinux.sh
- move IF_ENABLED out of <linux/kconfig.h> to make it self-contained
- misc cleanups
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmCWrucVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGRLkQAJ8t7PfMJLSh/VcgDXp3Z7fZ/V2M
RUGbOeRYErR1gylejuip/R19mS5MiBNecU60VrugZyDOMf98+mx61mI/ykpPeX92
sE3VU5MPXEwmv758QUr4gH014TZshMtHHo+tXA+NVUbqFp7RTnkZMDjOXGthYDHG
NhDou4LZ2P0CUKm8vb58SJPqB7ZdYOT9eEQEdHevm18Gx0KProCxRziup7loldy7
ET770okQ23if90ufCSVmnM6Ee6opoKYvXS5lv8V/a4xV/VbicbUclpzIZsHF7L2i
mIfr6dy480ncOaQlfWnX9ACgIeeqiFPOeZbAu7HAtwXzP5vCahgQ9FKVC7KPt+BP
Lf3LgdBrfSP5A7f7FrtkkPmP7pl1j6/Bq3+PhCur9XimtRIsvTOx7m7nuvsY4yHC
/wmBXFZgqE5DGyzpHXz1az8JHWw2AesP9L2f536BhfvRtdXaoOxPtZ/rmO1lfcMV
fWMa9f1em8lXwCiD1dR8UkBrIxItty+qqPffu2S/DlEepbiZrCg1gD827Fy7Mm3n
5rvrzYMOY2YK0yW1jtm+w3NlPlmG91BDUTP8tEcDxrTOIXezwqJf7fw8qIgGIy7W
3WzuBfgSvpT977ByMsB0YYugo2Xie+R1jpOWt7tv6KHM4varNBu0WpVhQhrKQr5o
agJiuvzsf3b+64oP
=935P
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- Convert sh and sparc to use generic shell scripts to generate the
syscall headers
- refactor .gitignore files
- Update kernel/config_data.gz only when the content of the .config
is really changed, which avoids the unneeded re-link of vmlinux
- move "remove stale files" workarounds to scripts/remove-stale-files
- suppress unused-but-set-variable warnings by default for Clang
as well
- fix locale setting LANG=C to LC_ALL=C
- improve 'make distclean'
- always keep intermediate objects from scripts/link-vmlinux.sh
- move IF_ENABLED out of <linux/kconfig.h> to make it self-contained
- misc cleanups
* tag 'kbuild-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (25 commits)
linux/kconfig.h: replace IF_ENABLED() with PTR_IF() in <linux/kernel.h>
kbuild: Don't remove link-vmlinux temporary files on exit/signal
kbuild: remove the unneeded comments for external module builds
kbuild: make distclean remove tag files in sub-directories
kbuild: make distclean work against $(objtree) instead of $(srctree)
kbuild: refactor modname-multi by using suffix-search
kbuild: refactor fdtoverlay rule
kbuild: parameterize the .o part of suffix-search
arch: use cross_compiling to check whether it is a cross build or not
kbuild: remove ARCH=sh64 support from top Makefile
.gitignore: prefix local generated files with a slash
kbuild: replace LANG=C with LC_ALL=C
Makefile: Move -Wno-unused-but-set-variable out of GCC only block
kbuild: add a script to remove stale generated files
kbuild: update config_data.gz only when the content of .config is changed
.gitignore: ignore only top-level modules.builtin
.gitignore: move tags and TAGS close to other tag files
kernel/.gitgnore: remove stale timeconst.h and hz.bc
usr/include: refactor .gitignore
genksyms: fix stale comment
...
Mounting with "multichannel" is obviously implied if user requested
more than one channel on mount (ie mount parm max_channels>1).
Currently both have to be specified. Fix that so that if max_channels
is greater than 1 on mount, enable multichannel rather than silently
falling back to non-multichannel.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
Cc: <stable@vger.kernel.org> # v5.11+
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
We were ignoring CAP_MULTI_CHANNEL in the server response - if the
server doesn't support multichannel we should not be attempting it.
See MS-SMB2 section 3.2.5.2
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
Cc: <stable@vger.kernel.org> # v5.8+
Signed-off-by: Steve French <stfrench@microsoft.com>
In the SMB3/SMB3.1.1 negotiate protocol request, we are supposed to
advertise CAP_MULTICHANNEL capability when establishing multiple
channels has been requested by the user doing the mount. See MS-SMB2
sections 2.2.3 and 3.2.5.2
Without setting it there is some risk that multichannel could fail
if the server interpreted the field strictly.
Reviewed-By: Tom Talpey <tom@talpey.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: <stable@vger.kernel.org> # v5.8+
Signed-off-by: Steve French <stfrench@microsoft.com>
I am seeing missed wakeups which ultimately lead to a deadlock when I am
using virtiofs with DAX enabled and running "make -j". I had to mount
virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
the problem consistently.
So here is the problem. put_unlocked_entry() wakes up waiters only
if entry is not null as well as !dax_is_conflict(entry). But if I
call multiple instances of invalidate_inode_pages2() in parallel,
then I can run into a situation where there are waiters on
this index but nobody will wake these waiters.
invalidate_inode_pages2()
invalidate_inode_pages2_range()
invalidate_exceptional_entry2()
dax_invalidate_mapping_entry_sync()
__dax_invalidate_entry() {
xas_lock_irq(&xas);
entry = get_unlocked_entry(&xas, 0);
...
...
dax_disassociate_entry(entry, mapping, trunc);
xas_store(&xas, NULL);
...
...
put_unlocked_entry(&xas, entry);
xas_unlock_irq(&xas);
}
Say a fault in in progress and it has locked entry at offset say "0x1c".
Now say three instances of invalidate_inode_pages2() are in progress
(A, B, C) and they all try to invalidate entry at offset "0x1c". Given
dax entry is locked, all tree instances A, B, C will wait in wait queue.
When dax fault finishes, say A is woken up. It will store NULL entry
at index "0x1c" and wake up B. When B comes along it will find "entry=0"
at page offset 0x1c and it will call put_unlocked_entry(&xas, 0). And
this means put_unlocked_entry() will not wake up next waiter, given
the current code. And that means C continues to wait and is not woken
up.
This patch fixes the issue by waking up all waiters when a dax entry
has been invalidated. This seems to fix the deadlock I am facing
and I can make forward progress.
Reported-by: Sergio Lopez <slp@redhat.com>
Fixes: ac401cc782 ("dax: New fault locking")
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20210428190314.1865312-4-vgoyal@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
As of now put_unlocked_entry() always wakes up next waiter. In next
patches we want to wake up all waiters at one callsite. Hence, add a
parameter to the function.
This patch does not introduce any change of behavior.
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20210428190314.1865312-3-vgoyal@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan mentioned that he is not very fond of passing around a boolean true/false
to specify if only next waiter should be woken up or all waiters should be
woken up. He instead prefers that we introduce an enum and make it very
explicity at the callsite itself. Easier to read code.
This patch should not introduce any change of behavior.
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20210428190314.1865312-2-vgoyal@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCVVnQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgps0ND/0SL4zWQJ5fh+NVCyQJFLm0E+ejqWg6Ykmk
EE1Dzhgr9lgxZU19UCXKtN0lF9icWPfoVDxvqsB2luJLc89GciOmla3PaknCgY6N
QZ/GJh/2Kwb9ybVblzKvUNnGSZOZ8gplpAAXu4zlbFXl7xoGBb12kql78fjw84rS
S4IG+nKvTdC6ENVTPwFMj0UREL5nccVJycvsuZgzYsSQ//5i5zViDz7mfdCujAo4
g3rt8rctBqYoF684BG4OVkDp7ivJUFvMW93PVqvx8vw2sAOB11v+sAKvX5cZIsdM
Z01a3C5nY8IQcpXhoI7n6Kgg4VY0ubeiOrlIBssNQWJszquAHPN7s5uiiSFaIKwg
mCyo69Ofmk4wYm2UO0hM8y7x94QvUNKmlcVxb4ls5OEaAKS/v7chnjoovp8s8Me/
2w1BMBB4qPcF99+K2GF9KyT/gKrXDRXkr9ERTtLLPpCf2uIXtFcU+X+Y64cOivhf
ImN1kbN8fQm1ItiEntn5tVd9u9cDnfqTJhzutBolLP33jjarK3TblJ4cUZqN/xAC
uH5k1IXZGHbrE9LuXUJQwFs752m21LElSkfG7OxzlktfJcKxJriM9o/dw0mgEmLv
0i1meb55VMbtYT/dNWZEa2FRVtelFIngfoiLSgH0IHXU7sKgTEpgyLmSu4PrySez
kRVUsF1Lfw==
=Sv+q
-----END PGP SIGNATURE-----
Merge tag 'block-5.13-2021-05-07' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- dasd spelling fixes (Bhaskar)
- Limit bio max size on multi-page bvecs to the hardware limit, to
avoid overly large bio's (and hence latencies). Originally queued for
the merge window, but needed a fix and was dropped from the initial
pull (Changheun)
- NVMe pull request (Christoph):
- reset the bdev to ns head when failover (Daniel Wagner)
- remove unsupported command noise (Keith Busch)
- misc passthrough improvements (Kanchan Joshi)
- fix controller ioctl through ns_head (Minwoo Im)
- fix controller timeouts during reset (Tao Chiu)
- rnbd fixes/cleanups (Gioh, Md, Dima)
- Fix iov_iter re-expansion (yangerkun)
* tag 'block-5.13-2021-05-07' of git://git.kernel.dk/linux-block:
block: reexpand iov_iter after read/write
nvmet: remove unsupported command noise
nvme-multipath: reset bdev to ns head when failover
nvme-pci: fix controller reset hang when racing with nvme_timeout
nvme: move the fabrics queue ready check routines to core
nvme: avoid memset for passthrough requests
nvme: add nvme_get_ns helper
nvme: fix controller ioctl through ns_head
bio: limit bio max size
RDMA/rtrs: fix uninitialized symbol 'cnt'
s390: dasd: Mundane spelling fixes
block/rnbd: Remove all likely and unlikely
block/rnbd-clt: Check the return value of the function rtrs_clt_query
block/rnbd: Fix style issues
block/rnbd-clt: Change queue_depth type in rnbd_clt_session to size_t
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCVVmMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv7yEAC/WV1alcH9XdEqLrc2aDwlaScMmSlrMQhY
ihtDCR9BsX11E3QcUB7D+VYjBo68uKR+ksa1/GN2Xp+vvqmdjQvZindgto/5b6u1
ko0Dradl2zulCAc7QIdjb2tbmL+Q+JOX5wxv14/+2XabEcce3OegWIvIgX+56NFW
ZHg80SQzXUhEtQcAUVCoPeBN+H+xzadgz38VlOI08gOG7/M6tS965GH3tZqTjh2K
P7dLjUn0WcxZ3euAYAsQzNN2O2ObJfpCsQtsG2eSf8DGpanPe4gQjAud1BstDtN0
CJ0+b6DHgzQYOAgPFjm7l0jjs+VnIYIMnoBBxm5EkIoktsj0hHdqTnEugoz4wTnS
T8WgojaU6jYNx+Jj6vciCLk0lb5c3O3nxmw3w84/rtTwtaEChCAbWdAkl4cleNaw
3/Z2bksCVrQWDVskmu4FP7+kGYpjpV+ZiA2+6OGwILTCN+W7vi079NByQAzdLaRb
K/4lEGM7VYEXtq/I7C6VzjtY7gq46TJmpFW+OdQnPIguavp+7vlUl2pLV3oTeGBc
E6c+xltgIN+sbbDc/57EJEvhHQod4A6HYOGwBMyjHrhr/sdQ4xvUaJPNmG9HfqRK
SM3TOlwpHRWFTgbO+6qoJQSMvACQyE/SDqiPi08q75zFVTNCcYM7uYV3fJMsQ9sj
vA+5HAaRKQ==
=YwTw
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.13-2021-05-07' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Mostly fixes for merge window merged code. In detail:
- Error case memory leak fixes (Colin, Zqiang)
- Add the tools/io_uring/ to the list of maintained files (Lukas)
- Set of fixes for the modified buffer registration API (Pavel)
- Sanitize io thread setup on x86 (Stefan)
- Ensure we truncate transfer count for registered buffers (Thadeu)"
* tag 'io_uring-5.13-2021-05-07' of git://git.kernel.dk/linux-block:
x86/process: setup io_threads more like normal user space threads
MAINTAINERS: add io_uring tool to IO_URING
io_uring: truncate lengths larger than MAX_RW_COUNT on provide buffers
io_uring: Fix memory leak in io_sqe_buffers_register()
io_uring: Fix premature return from loop and memory leak
io_uring: fix unchecked error in switch_start()
io_uring: allow empty slots for reg buffers
io_uring: add more build check for uapi
io_uring: dont overlap internal and user req flags
io_uring: fix drain with rsrc CQEs
Highlights include:
Stable fixes:
- Add validation of the UDP retrans parameter to prevent shift out-of-bounds
- Don't discard pNFS layout segments that are marked for return
Bugfixes:
- Fix a NULL dereference crash in xprt_complete_bc_request() when the
NFSv4.1 server misbehaves.
- Fix the handling of NFS READDIR cookie verifiers
- Sundry fixes to ensure attribute revalidation works correctly when the
server does not return post-op attributes.
- nfs4_bitmask_adjust() must not change the server global bitmasks
- Fix major timeout handling in the RPC code.
- NFSv4.2 fallocate() fixes.
- Fix the NFSv4.2 SEEK_HOLE/SEEK_DATA end-of-file handling
- Copy offload attribute revalidation fixes
- Fix an incorrect filehandle size check in the pNFS flexfiles driver
- Fix several RDMA transport setup/teardown races
- Fix several RDMA queue wrapping issues
- Fix a misplaced memory read barrier in sunrpc's call_decode()
Features:
- Micro optimisation of the TCP transmission queue using TCP_CORK
- statx() performance improvements by further splitting up the tracking
of invalid cached file metadata.
- Support the NFSv4.2 "change_attr_type" attribute and use it to
optimise handling of change attribute updates.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmCVLooACgkQZwvnipYK
APJB5BAAtIJyhx40ooMBzcucDmXd1qovlKsb8ZlvnSI6c7wvHhFPNk9z4zwThnjL
FpVYzJzK6XzAQY/PtgbrPwnSUmW925ngPWYR/hiYe+OGPBnYV+tXP8izCyEkNgMg
45goDOxojGWl7AGTuAJiKcDSdH9PyIrbvt28iwcNSGjslasGSbAoL/836l4OIGr1
Ymxs/NDML11dPco8GIKLGtHd8leFGleDx089VeNsgud8MdaFErp16O5Iz8DdzRKd
W1l2zDMb05j8eDZIfy3w3FyrLkDXA+KgLSADiC8TcpxoadPaQJMeCvoIq8oqVndn
bZBoxduXdLgf54Aec0WnNKFAOyc7pGvZoSNmFouT7EGV73g+g1LQ+ZbEE1bb8fCQ
XHqCVaBt2+47NiTUgdxjXlZRfcn8fYKx0tVxfG3mQVMXUAWfsjmMyQMNgijDRJI2
8Wz3lZMRGMILbR9j4QpP1biVy/2zGNWG/TB5ZZyZMSY4uT+aOpzlqdknb4UsRaSp
f7MfmB7xEWpS4DJr9RIBrJ/hIdnMu1mNInxDPFo5Kl5HNp4TaPm2dPir2ZD2wMZI
daURTX7giUhpE15ZebQDBqWD+mTR0bVDqLLeo131JRmMfMEHugNrr49xe+NkBu/R
QWnFzgkGdQsOeiKRRwEUuhsi74JspqfwzdZzHqcRM5WuXVvBLcA=
=h01b
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Add validation of the UDP retrans parameter to prevent shift
out-of-bounds
- Don't discard pNFS layout segments that are marked for return
Bugfixes:
- Fix a NULL dereference crash in xprt_complete_bc_request() when the
NFSv4.1 server misbehaves.
- Fix the handling of NFS READDIR cookie verifiers
- Sundry fixes to ensure attribute revalidation works correctly when
the server does not return post-op attributes.
- nfs4_bitmask_adjust() must not change the server global bitmasks
- Fix major timeout handling in the RPC code.
- NFSv4.2 fallocate() fixes.
- Fix the NFSv4.2 SEEK_HOLE/SEEK_DATA end-of-file handling
- Copy offload attribute revalidation fixes
- Fix an incorrect filehandle size check in the pNFS flexfiles driver
- Fix several RDMA transport setup/teardown races
- Fix several RDMA queue wrapping issues
- Fix a misplaced memory read barrier in sunrpc's call_decode()
Features:
- Micro optimisation of the TCP transmission queue using TCP_CORK
- statx() performance improvements by further splitting up the
tracking of invalid cached file metadata.
- Support the NFSv4.2 'change_attr_type' attribute and use it to
optimise handling of change attribute updates"
* tag 'nfs-for-5.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (85 commits)
xprtrdma: Fix a NULL dereference in frwr_unmap_sync()
sunrpc: Fix misplaced barrier in call_decode
NFSv4.2: Remove ifdef CONFIG_NFSD from NFSv4.2 client SSC code.
xprtrdma: Move fr_mr field to struct rpcrdma_mr
xprtrdma: Move the Work Request union to struct rpcrdma_mr
xprtrdma: Move fr_linv_done field to struct rpcrdma_mr
xprtrdma: Move cqe to struct rpcrdma_mr
xprtrdma: Move fr_cid to struct rpcrdma_mr
xprtrdma: Remove the RPC/RDMA QP event handler
xprtrdma: Don't display r_xprt memory addresses in tracepoints
xprtrdma: Add an rpcrdma_mr_completion_class
xprtrdma: Add tracepoints showing FastReg WRs and remote invalidation
xprtrdma: Avoid Send Queue wrapping
xprtrdma: Do not wake RPC consumer on a failed LocalInv
xprtrdma: Do not recycle MR after FastReg/LocalInv flushes
xprtrdma: Clarify use of barrier in frwr_wc_localinv_done()
xprtrdma: Rename frwr_release_mr()
xprtrdma: rpcrdma_mr_pop() already does list_del_init()
xprtrdma: Delete rpcrdma_recv_buffer_put()
xprtrdma: Fix cwnd update ordering
...