Instead of appending new text attribute data at the offset specified by the
write() system call, only pass the newly written data to the .store()
callback.
Reported-by: Bodo Stroesser <bostroesser@gmail.com>
Tested-by: Bodo Stroesser <bostroesser@gmail.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The compiler should be forbidden from any strange optimization for async
writes to user visible data-structures. Without proper protection, the
compiler can cause write-tearing or invent writes that would confuse the
userspace.
However, there are writes to sq_flags which are not protected by
WRITE_ONCE(). Use WRITE_ONCE() for these writes.
This is purely a theoretical issue. Presumably, any compiler is very
unlikely to do such optimizations.
Fixes: 75b28affdd ("io_uring: allocate the two rings together")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Link: https://lore.kernel.org/r/20210808001342.964634-3-namit@vmware.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When using SQPOLL, the submission queue polling thread calls
task_work_run() to run queued work. However, when work is added with
TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
set afterwards and is never cleared.
Consequently, when the submission queue polling thread checks whether
signal_pending(), it may always find a pending signal, if
task_work_add() was ever called before.
The impact of this bug might be different on different kernel versions.
It appears that on 5.14 it would only cause unnecessary calculation and
prevent the polling thread from sleeping. On 5.13, where the bug was
found, it stops the polling thread from finding newly submitted work.
Instead of task_work_run(), use tracehook_notify_signal() that clears
TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to
current->task_works to avoid a race in which task_works is cleared but
the TIF_NOTIFY_SIGNAL is set.
Fixes: 685fe7feed ("io-wq: eliminate the need for a manager thread")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Link: https://lore.kernel.org/r/20210808001342.964634-2-namit@vmware.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEOuKcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpk4WD/9XQIUM2WmJrWCdrGJ64hgWyOFMg7Zhzbwn
1swI2yMRcAHVOphP3Iyavxn7JHk+AKRcRgnjsKBdIn1WykkRwoIGfYGhTaO2zD7E
T6r8BH/qLkwJcbhA9KNmhA3gf/zEnJgvc6JOVMtSvAdO6FTSfrh1aQneb/xjr0hx
Yuy4dUtx6AmXjUZ0lh4XCud8IOzOrvjF9cZHVq/Q+DP+m/MnlpZOVws9Ge0XeAgN
t7I9NouiyHNt5Z1SIQITbPvAGjY6/z/j2FOuECxgfdo7VpNUgK+V3byaXLFwLjLf
X7DI9G1pf0fNPOrJktS7WarLVX4NKw8nt1Og5iOhFp7MoBl6SMWWY/2YOESk172g
k0KGnPeoF+NZGnCyWLs6cEk6X0KQODAwWLhiYUuDcP3C/3H9BNNIR3IkfRhhOoJZ
1aTvMjYRa0uD/fRnHydEfMlLl9PyjnmVFD8WLwI70zUO0GpuhlfLIq0rWgtsOG35
s/CJVDH4VoDnMsx5Pg+oXsywiKDhhVLP2y4D35B2ivFrrLy+pAfXI8t7+IVP5Xm4
sMmVqEtZLBvYBxrbsO81cWI5Ddcc4/z+F/I5DfACdi26GZC/Pn1/L3Le+i4zLypz
CEY8vSM9+qgApxElaWrNOPtOFWV/XUll3lRLNIT9QncqtdmDdkofBdSv9lNj+wRt
pVhBn/B6NQ==
=FAeG
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.14-2021-08-07' of git://git.kernel.dk/linux-block
Pull io_uring from Jens Axboe:
"A few io-wq related fixes:
- Fix potential nr_worker race and missing max_workers check from one
path (Hao)
- Fix race between worker exiting and new work queue (me)"
* tag 'io_uring-5.14-2021-08-07' of git://git.kernel.dk/linux-block:
io-wq: fix lack of acct->nr_workers < acct->max_workers judgement
io-wq: fix no lock protection of acct->nr_worker
io-wq: fix race between worker exiting and activating free worker
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmENbm0ACgkQ8vlZVpUN
gaNEEQf7B8GXPqJvRgqhJwCdqmZqz8DB7dfjqT0SB99f1EL3VoeHEvo+yEgMqD3L
cSYRFh4efEHgr51HSZoIPINqcU9hs86SvFmjd6jWIcnY/EJLd0g3e8aEWpJ3S5rR
3avSC4tiDbn34GgDeoR2DFG6RsGbRxDUEEzkrd8h7Hx6q39s3aXdi89lBmBe8rg/
lIVaeivZrZ7SfY/YFEziF0P7KurJNju6lGwqm0xAqu79J9QaabXMF1u5GPjUi2rw
TIaLMSP6O5VQbQwskcTIhJlKSAB4aUIB+fMV5Zi2cCXAKGdzK24xdM5VbzOeKAAX
1EwOE9GEyytpxD1P0zb8vVGsJW3wjQ==
=hQAS
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"A regression fix, bug fix, and a comment cleanup for ext4"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix potential htree corruption when growing large_dir directories
ext4: remove conflicting comment from __ext4_forget
ext4: fix potential uninitialized access to retval in kmmpd
Commit b5776e7524 ("ext4: fix potential htree index checksum
corruption) removed a required restart when multiple levels of index
nodes need to be split. Fix this to avoid directory htree corruptions
when using the large_dir feature.
Cc: stable@kernel.org # v5.11
Cc: Благодаренко Артём <artem.blagodarenko@gmail.com>
Fixes: b5776e7524 ("ext4: fix potential htree index checksum corruption)
Reported-by: Denis <denis@voxelsoft.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There should be this judgement before we create an io-worker
Fixes: 685fe7feed ("io-wq: eliminate the need for a manager thread")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is an acct->nr_worker visit without lock protection. Think about
the case: two callers call io_wqe_wake_worker(), one is the original
context and the other one is an io-worker(by calling
io_wqe_enqueue(wqe, linked)), on two cpus paralelly, this may cause
nr_worker to be larger than max_worker.
Let's fix it by adding lock for it, and let's do nr_workers++ before
create_io_worker. There may be a edge cause that the first caller fails
to create an io-worker, but the second caller doesn't know it and then
quit creating io-worker as well:
say nr_worker = max_worker - 1
cpu 0 cpu 1
io_wqe_wake_worker() io_wqe_wake_worker()
nr_worker < max_worker
nr_worker++
create_io_worker() nr_worker == max_worker
failed return
return
But the chance of this case is very slim.
Fixes: 685fe7feed ("io-wq: eliminate the need for a manager thread")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
[axboe: fix unconditional create_io_worker() call]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We used to follow the rule earlier that the create SD context
always be a multiple of 8. However, with the change:
cifs: refactor create_sd_buf() and and avoid corrupting the buffer
...we recompute the length, and we failed that rule.
Fixing that with this change.
Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This program always prints 4096 and hangs before the patch, and always
prints 8192 and exits successfully after:
int main()
{
int pipefd[2];
for (int i = 0; i < 1025; i++)
if (pipe(pipefd) == -1)
return 1;
size_t bufsz = fcntl(pipefd[1], F_GETPIPE_SZ);
printf("%zd\n", bufsz);
char *buf = calloc(bufsz, 1);
write(pipefd[1], buf, bufsz);
read(pipefd[0], buf, bufsz-1);
write(pipefd[1], buf, 1);
}
Note that you may need to increase your RLIMIT_NOFILE before running the
program.
Fixes: 759c01142a ("pipe: limit the per-user amount of pages allocated in pipes")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/lkml/1628086770.5rn8p04n6j.none@localhost/
Link: https://lore.kernel.org/lkml/1628127094.lxxn016tj7.none@localhost/
Signed-off-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nadav correctly reports that we have a race between a worker exiting,
and new work being queued. This can lead to work being queued behind
an existing worker that could be sleeping on an event before it can
run to completion, and hence introducing potential big latency gaps
if we hit this race condition:
cpu0 cpu1
---- ----
io_wqe_worker()
schedule_timeout()
// timed out
io_wqe_enqueue()
io_wqe_wake_worker()
// work_flags & IO_WQ_WORK_CONCURRENT
io_wqe_activate_free_worker()
io_worker_exit()
Fix this by having the exiting worker go through the normal decrement
of a running worker, which will spawn a new one if needed.
The free worker activation is modified to only return success if we
were able to find a sleeping worker - if not, we keep looking through
the list. If we fail, we create a new worker as per usual.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/io-uring/BFF746C0-FEDE-4646-A253-3021C57C26C9@gmail.com/
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a race in ceph_put_snap_realm. The change to the nref and the
spinlock acquisition are not done atomically, so you could decrement
nref, and before you take the spinlock, the nref is incremented again.
At that point, you end up putting it on the empty list when it
shouldn't be there. Eventually __cleanup_empty_realms runs and frees
it when it's still in-use.
Fix this by protecting the 1->0 transition with atomic_dec_and_lock,
and just drop the spinlock if we can get the rwsem.
Because these objects can also undergo a 0->1 refcount transition, we
must protect that change as well with the spinlock. Increment locklessly
unless the value is at 0, in which case we take the spinlock, increment
and then take it off the empty list if it did the 0->1 transition.
With these changes, I'm removing the dout() messages from these
functions, as well as in __put_snap_realm. They've always been racy, and
it's better to not print values that may be misleading.
Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46419
Reported-by: Mark Nelson <mnelson@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Function ceph_check_delayed_caps() is called from the mdsc->delayed_work
workqueue and it can be kept looping for quite some time if caps keep
being added back to the mdsc->cap_delay_list. This may result in the
watchdog tainting the kernel with the softlockup flag.
This patch breaks this loop if the caps have been recently (i.e. during
the loop execution). Any new caps added to the list will be handled in
the next run.
Also, allow schedule_delayed() callers to explicitly set the delay value
instead of defaulting to 5s, so we can ensure that it runs soon
afterward if it looks like there is more work.
Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46284
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* Fix a number of coordination bugs relating to cache flushes for
metadata writeback, cache flushes for multi-buffer log writes, and
FUA writes for single-buffer log writes.
* Fix a bug with incorrect replay of attr3 blocks.
* Fix unnecessary stalls when flushing logs to disk.
* Fix spoofing problems when recovering realtime bitmap blocks.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmEC2PgACgkQ+H93GTRK
tOtEvg//XQTcqKgO+60lJzhfgfGD8HsYWGcAc0UW8vu0I6gPNstd/PHKBCYkhT66
rp0l8CtZhbo3qj2ZJTIDvVxFeeAUcMhAIgU4gJB6OmW6/VV8NJlArfeyaA+85/lV
lVYD53qBcc0IydDWlRD5oU8T55pqv9hg0W9WkpWrtjxoTlxPX5rDj7yrKEqiQs1M
IUa5X4Qnwo/C2ATD/t2G3PIM7OxdCJ7YjyrZ27VWWRsUJW8DOqXtJX6HBs+VT9cM
mh/IeIy60rmKgf2Ag2ZJCvrKnmqXqJFyGjEDzk6gXoqktQyWnUBLhQoyLh5r9UlA
4ThLGvPwUh5QEFOoo3cpN72X0wUeHcebfh4DgY/G3PeEK4J1CVq1UXLB1a8Si7X4
qf5ZqfUU4dr6v8C2AIqd9S/H6wm8v84hzA2uXca9tsw67rAcLc6N0rHydlLtn+n8
DL4PQYcUmn0LGrhIi2t/4ec80SGBf7ad/iDbr3A0K5NsV5kMl8dReg2yCDl9kHM0
yHFk8zLTKh5fs7fmmJXOORP33YMzstET9L1oKBv9cd9iMlHNUn27o9tpwwa2noM+
v6E+UCKlRTauj/MTxZITdmNzgGEymgu5bpbb77N24OTF9jf48OEW+cr0ZzgrVYtk
wGuj9RFGcwneJoWjVPGURu1xBuC1AX9PbqnR9NQXbqmuwd6BINk=
=pLW3
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"This contains a bunch of bug fixes in XFS.
Dave and I have been busy the last couple of weeks to find and fix as
many log recovery bugs as we can find; here are the results so far. Go
fstests -g recoveryloop! ;)
- Fix a number of coordination bugs relating to cache flushes for
metadata writeback, cache flushes for multi-buffer log writes, and
FUA writes for single-buffer log writes
- Fix a bug with incorrect replay of attr3 blocks
- Fix unnecessary stalls when flushing logs to disk
- Fix spoofing problems when recovering realtime bitmap blocks"
* tag 'xfs-5.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: prevent spoofing of rtbitmap blocks when recovering buffers
xfs: limit iclog tail updates
xfs: need to see iclog flags in tracing
xfs: Enforce attr3 buffer recovery order
xfs: logging the on disk inode LSN can make it go backwards
xfs: avoid unnecessary waits in xfs_log_force_lsn()
xfs: log forces imply data device cache flushes
xfs: factor out forced iclog flushes
xfs: fix ordering violation between cache flushes and tail updates
xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog
xfs: external logs need to flush data device
xfs: flush data dev on external log write
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmEEaj8ACgkQiiy9cAdy
T1FdSgv+KaGjM/IPPHDymSoWEtkPp3x3frPEXfJrTSD/M8NVx+9UEnwH8JEZEn+t
PEQyGxOx28qo8ZtNJKmKAAkAm45/gTriteyjk8m2witJdDaK7wyhRYJ3Mmo99ot5
hOV55Jn40N+dK4r2AfWv6WauiDnLw83QlSmm5eN1/om3FXFx49IOjeKGaagpHGSQ
MaaEX0pMcyMop/a1TFOC01M8ergYfWF/3S3zmA5S9Ei4+PefxbhYpdj4RTRZY7rp
+2ZNyAs4Zc88l6weF4XCNck/2qrW6A4a1DCmnD19MJTGK+4FqqIBZ0f5fWV6aFGl
EIi6vsHnt4b2MGkePs/SrpifCMZk/uSM7p9TQvplX0KHhIJ5h55pO0+GFdH0uYlv
KkV2SjepCavsn8pwhVyqMRQ8+/MCQi+vAAP9vmdD2+V0cps8VFBexVSfuWEdwSBd
JuDWeVP1kgR4OQGrCp4r8wWhSCh+FOGsdrwTHTwPWL9GCoUuiTui6Z36JM5fT6Dg
437Hg7RY
=qJCo
-----END PGP SIGNATURE-----
Merge tag '5.14-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three cifs/smb3 fixes, including two for stable, and a fix for an
fallocate problem noticed by Clang"
* tag '5.14-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: add missing parsing of backupuid
smb3: rc uninitialized in one fallocate path
SMB3: fix readpage for large swap cache
Since commit 1b6b26ae70 ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.
In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already. Doing
extraneous wakeups will only cause potential thundering herd problems.
However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it". Even if there was no edge in sight.
Quoting Sandeep Patil:
"The commit 1b6b26ae70 ('pipe: fix and clarify pipe write wakeup
logic') changed pipe write logic to wakeup readers only if the pipe
was empty at the time of write. However, there are libraries that
relied upon the older behavior for notification scheme similar to
what's described in [1]
One such library 'realm-core'[2] is used by numerous Android
applications. The library uses a similar notification mechanism as GNU
Make but it never drains the pipe until it is full. When Android moved
to v5.10 kernel, all applications using this library stopped working.
The library has since been fixed[3] but it will be a while before all
applications incorporate the updated library"
Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong. Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.
So add the extraneous wakeup, to approximate the old behavior.
[ I say "approximate", because the exact old behavior was to do a wakeup
not for each write(), but for each pipe buffer chunk that was filled
in. The behavior introduced by this change is not that - this is just
"every write will cause a wakeup, whether necessary or not", which
seems to be sufficient for the broken library use. ]
It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.
See commit f467a6a664 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".
Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
Link: https://github.com/realm/realm-core [2]
Link: https://github.com/realm/realm-core/issues/4666 [3]
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
Reported-by: Sandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEEEz8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpglXD/9CGREpOf1W5oqOScpTygjehwrRnAYisQv6
Oca/qGHBa61BTN3taAJc4NMwl+IwFBER2kdTcOyz8hNmyAUPyRmFND0mG2vGTzQA
P9+ekiRKCJ1aRLsnyBL0JBbmvdoPMBHz39P165vMWMrVmnlpcPKoYDS0itHtYYNP
VD5Y3A9ACGMDglipDmL+3tsXQo/AoJqRO8WGMUBY2qJ0lasYuCbPpzq0kHzXi6kE
0X64bg6JOZVd3wdyWywKahW3ntsVNLswRUBzLVrnjwE29UuBGWgF+/vwyW/Ob0yS
ojafKvehCYnV8Q7IatASOtbwGLvLKgpJZXf7VUEsYnSD6SnmoZctjMjRdyLhNWut
lD86Y+eWjQM0pUsOVPykfrV2hd9CrhjyRFskcbI0SJRlMOl0Lstl/X17efDWcDmz
1/V8ub3gKA3HF2Gc/QKhPJDClxM7SaWnsAO3Rk+qJ6bT4EiiRg2GewI1C7YNpmGW
ty1fqcQE36JtSWadH4KL/evmX258ROfn3QT1nut2jpNsd1RQ+hHBcjcfeOx6n1GX
ALxT8LnmlVYbAUwQvXJcqFcft8K3JoB5ZXT74lat/CAbIKhfEUeSUiqnQcQ8kJLW
MTKviuZ9eJHO6/E7vw08ARDR0PmpSFqvc6rK9DiIM/kmVDz8OdLMovTqzX/hIzUT
7IfyHzQbwg==
=5FG2
-----END PGP SIGNATURE-----
Merge tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- gendisk freeing fix (Christoph)
- blk-iocost wake ordering fix (Tejun)
- tag allocation error handling fix (John)
- loop locking fix. While this isn't the prettiest fix in the world,
nobody has any good alternatives for 5.14. Something to likely
revisit for 5.15. (Tetsuo)
* tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
block: delay freeing the gendisk
blk-iocost: fix operation ordering in iocg_wake_fn()
blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling
loop: reintroduce global lock for safe loop_validate_file() traversal
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEEDKIACgkQxWXV+ddt
WDtW+BAAnUD7h3ollIQo4C6hE9WaTG49Tp12Z00Og2m8hn4XyhI2QIaDz6a2CU7n
MLQv16vZUQk5Z/VMtczM+5ZF5Rf0ywlMXnS4Sq5yKWT0YHpnH7q2nMAvg4gql/tJ
Ldov92hnTrFAZX6vvkLVM5lZriY7fop3Lv2vHeAKu4CymAoisAv+SLa5xYkBR6Ig
3S16+lh/rIRgssI7KuDnjp9iTXvnB1J2MbfAOLNfqjXGWUDumu1k7HWQSNYZnHJX
L390/QS3F3K6Trxkf5MSUXOxQROqcGKQVKyAR5ZvyULKly84nDpiINze80yCopq/
7//32pO43xDPb78c7saxSWtjdgX4XsBOdzIoiJZHnc5CTTbCcneLes8zz4fD6AGq
vjZKDLTgiO/sRlkQHZQk1y+7CawrqbKkAG+O7MqF7KGOtQ1WLRGfAkFP732TBFXM
TyoZ7ENh3TiFDdeRmkOonpQ2k3DctW+7z2BmdlsuSXgD8fFbEArfxnO1SnRHrmcr
C8FNeSkks8MTL7uePNUxwlnB8uHuGWCgSuS++q4OkCnzA3AmO6cRlDoMT3RMwVB/
wQxvqF/U6JJx16YOVqwA6ZjuUWVwyBj/WBKlaxgfghz8CUmDC0D4Xb2/S1UVcZi6
bFRph0UKeE5LaduoNZYaAqMOinCXFmetjudPmWO4sWfPrLb1mOY=
=J0Pw
-----END PGP SIGNATURE-----
Merge tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix -Warray-bounds warning, to help external patchset to make it
default treewide
- fix writeable device accounting (syzbot report)
- fix fsync and log replay after a rename and inode eviction
- fix potentially lost error code when submitting multiple bios for
compressed range
* tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: calculate number of eb pages properly in csum_tree_block
btrfs: fix rw device counting in __btrfs_free_extra_devids
btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
btrfs: mark compressed range uptodate only if all bio succeed
Merge misc fixes from Andrew Morton:
"7 patches.
Subsystems affected by this patch series: lib, ocfs2, and mm (slub,
migration, and memcg)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()
slub: fix unreclaimable slab stat for bulk free
mm/migrate: fix NR_ISOLATED corruption on 64-bit
mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code
ocfs2: issue zeroout to EOF blocks
ocfs2: fix zero out valid data
lib/test_string.c: move string selftest in the Runtime Testing menu
For punch holes in EOF blocks, fallocate used buffer write to zero the
EOF blocks in last cluster. But since ->writepage will ignore EOF
pages, those zeros will not be flushed.
This "looks" ok as commit 6bba4471f0 ("ocfs2: fix data corruption by
fallocate") will zero the EOF blocks when extend the file size, but it
isn't. The problem happened on those EOF pages, before writeback, those
pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
set, when writeback run by write_cache_pages(), DIRTY flag on the page
was cleared, but DIRTY flag on the buffer_head not.
When next write happened to those EOF pages, since buffer_head already
had DIRTY flag set, it would not mark page DIRTY again. That made
writeback ignore them forever. That will cause data corruption. Even
directio write can't work because it will fail when trying to drop pages
caches before direct io, as it found the buffer_head for those pages
still had DIRTY flag set, then it will fall back to buffer io mode.
To make a summary of the issue, as writeback ingores EOF pages, once any
EOF page is generated, any write to it will only go to the page cache,
it will never be flushed to disk even file size extends and that page is
not EOF page any more. The fix is to avoid zero EOF blocks with buffer
write.
The following code snippet from qemu-img could trigger the corruption.
656 open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
...
660 fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
660 fallocate(11, 0, 2275868672, 327680) = 0
658 pwrite64(11, "
Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If append-dio feature is enabled, direct-io write and fallocate could
run in parallel to extend file size, fallocate used "orig_isize" to
record i_size before taking "ip_alloc_sem", when
ocfs2_zeroout_partial_cluster() zeroout EOF blocks, i_size maybe already
extended by ocfs2_dio_end_io_write(), that will cause valid data zeroed
out.
Link: https://lkml.kernel.org/r/20210722054923.24389-1-junxiao.bi@oracle.com
Fixes: 6bba4471f0 ("ocfs2: fix data corruption by fallocate")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull alpha updates from Matt Turner:
"They're mostly small janitorial fixes but there's also more important
ones:
- drop the alpha-specific x86 binary loader (David Hildenbrand)
- regression fix for at least Marvel platforms (Mike Rapoport)
- fix for a scary-looking typo (Zheng Yongjun)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha:
alpha: register early reserved memory in memblock
alpha: fix spelling mistakes
alpha: Remove space between * and parameter name
alpha: fp_emul: avoid init/cleanup_module names
alpha: Add syscall_get_return_value()
binfmt: remove support for em86 (alpha only)
alpha: fix typos in a comment
alpha: defconfig: add necessary configs for boot testing
alpha: Send stop IPI to send to online CPUs
alpha: convert comma to semicolon
alpha: remove undef inline in compiler.h
alpha: Kconfig: Replace HTTP links with HTTPS ones
alpha: __udiv_qrnnd should be exported
While reviewing the buffer item recovery code, the thought occurred to
me: in V5 filesystems we use log sequence number (LSN) tracking to avoid
replaying older metadata updates against newer log items. However, we
use the magic number of the ondisk buffer to find the LSN of the ondisk
metadata, which means that if an attacker can control the layout of the
realtime device precisely enough that the start of an rt bitmap block
matches the magic and UUID of some other kind of block, they can control
the purported LSN of that spoofed block and thereby break log replay.
Since realtime bitmap and summary blocks don't have headers at all, we
have no way to tell if a block really should be replayed. The best we
can do is replay unconditionally and hope for the best.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
From the department of "generic/482 keeps on giving", we bring you
another tail update race condition:
iclog:
S1 C1
+-----------------------+-----------------------+
S2 EOIC
Two checkpoints in a single iclog. One is complete, the other just
contains the start record and overruns into a new iclog.
Timeline:
Before S1: Cache flush, log tail = X
At S1: Metadata stable, write start record and checkpoint
At C1: Write commit record, set NEED_FUA
Single iclog checkpoint, so no need for NEED_FLUSH
Log tail still = X, so no need for NEED_FLUSH
After C1,
Before S2: Cache flush, log tail = X
At S2: Metadata stable, write start record and checkpoint
After S2: Log tail moves to X+1
At EOIC: End of iclog, more journal data to write
Releases iclog
Not a commit iclog, so no need for NEED_FLUSH
Writes log tail X+1 into iclog.
At this point, the iclog has tail X+1 and NEED_FUA set. There has
been no cache flush for the metadata between X and X+1, and the
iclog writes the new tail permanently to the log. THis is sufficient
to violate on disk metadata/journal ordering.
We have two options here. The first is to detect this case in some
manner and ensure that the partial checkpoint write sets NEED_FLUSH
when the iclog is already marked NEED_FUA and the log tail changes.
This seems somewhat fragile and quite complex to get right, and it
doesn't actually make it obvious what underlying problem it is
actually addressing from reading the code.
The second option seems much cleaner to me, because it is derived
directly from the requirements of the C1 commit record in the iclog.
That is, when we write this commit record to the iclog, we've
guaranteed that the metadata/data ordering is correct for tail
update purposes. Hence if we only write the log tail into the iclog
for the *first* commit record rather than the log tail at the last
release, we guarantee that the log tail does not move past where the
the first commit record in the log expects it to be.
IOWs, taking the first option means that replay of C1 becomes
dependent on future operations doing the right thing, not just the
C1 checkpoint itself doing the right thing. This makes log recovery
almost impossible to reason about because now we have to take into
account what might or might not have happened in the future when
looking at checkpoints in the log rather than just having to
reconstruct the past...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Because I cannot tell if the NEED_FLUSH flag is being set correctly
by the log force and CIL push machinery without it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
From the department of "WTAF? How did we miss that!?"...
When we are recovering a buffer, the first thing we do is check the
buffer magic number and extract the LSN from the buffer. If the LSN
is older than the current LSN, we replay the modification to it. If
the metadata on disk is newer than the transaction in the log, we
skip it. This is a fundamental v5 filesystem metadata recovery
behaviour.
generic/482 failed with an attribute writeback failure during log
recovery. The write verifier caught the corruption before it got
written to disk, and the attr buffer dump looked like:
XFS (dm-3): Metadata corruption detected at xfs_attr3_leaf_verify+0x275/0x2e0, xfs_attr3_leaf block 0x19be8
XFS (dm-3): Unmount and run xfs_repair
XFS (dm-3): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 4d 2a 01 e1 ........;...M*..
00000010: 00 00 00 00 00 01 9b e8 00 00 00 01 00 00 05 38 ...............8
^^^^^^^^^^^^^^^^^^^^^^^
00000020: df 39 5e 51 58 ac 44 b6 8d c5 e7 10 44 09 bc 17 .9^QX.D.....D...
00000030: 00 00 00 00 00 02 00 83 00 03 00 cc 0f 24 01 00 .............$..
00000040: 00 68 0e bc 0f c8 00 10 00 00 00 00 00 00 00 00 .h..............
00000050: 00 00 3c 31 0f 24 01 00 00 00 3c 32 0f 88 01 00 ..<1.$....<2....
00000060: 00 00 3c 33 0f d8 01 00 00 00 00 00 00 00 00 00 ..<3............
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
.....
The highlighted bytes are the LSN that was replayed into the
buffer: 0x100000538. This is cycle 1, block 0x538. Prior to replay,
that block on disk looks like this:
$ sudo xfs_db -c "fsb 0x417d" -c "type attr3" -c p /dev/mapper/thin-vol
hdr.info.hdr.forw = 0
hdr.info.hdr.back = 0
hdr.info.hdr.magic = 0x3bee
hdr.info.crc = 0xb5af0bc6 (correct)
hdr.info.bno = 105448
hdr.info.lsn = 0x100000900
^^^^^^^^^^^
hdr.info.uuid = df395e51-58ac-44b6-8dc5-e7104409bc17
hdr.info.owner = 131203
hdr.count = 2
hdr.usedbytes = 120
hdr.firstused = 3796
hdr.holes = 1
hdr.freemap[0-2] = [base,size]
Note the LSN stamped into the buffer on disk: 1/0x900. The version
on disk is much newer than the log transaction that was being
replayed. That's a bug, and should -never- happen.
So I immediately went to look at xlog_recover_get_buf_lsn() to check
that we handled the LSN correctly. I was wondering if there was a
similar "two commits with the same start LSN skips the second
replay" problem with buffers. I didn't get that far, because I found
a much more basic, rudimentary bug: xlog_recover_get_buf_lsn()
doesn't recognise buffers with XFS_ATTR3_LEAF_MAGIC set in them!!!
IOWs, attr3 leaf buffers fall through the magic number checks
unrecognised, so trigger the "recover immediately" behaviour instead
of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf
buffers and that causes silent on disk corruption of inode attribute
forks and potentially other things....
Git history shows this is *another* zero day bug, this time
introduced in commit 50d5c8d8e9 ("xfs: check LSN ordering for v5
superblocks during recovery") which failed to handle the attr3 leaf
buffers in recovery. And we've failed to handle them ever since...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
When we log an inode, we format the "log inode" core and set an LSN
in that inode core. We do that via xfs_inode_item_format_core(),
which calls:
xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);
to format the log inode. It writes the LSN from the inode item into
the log inode, and if recovery decides the inode item needs to be
replayed, it recovers the log inode LSN field and writes it into the
on disk inode LSN field.
Now this might seem like a reasonable thing to do, but it is wrong
on multiple levels. Firstly, if the item is not yet in the AIL,
item->li_lsn is zero. i.e. the first time the inode it is logged and
formatted, the LSN we write into the log inode will be zero. If we
only log it once, recovery will run and can write this zero LSN into
the inode.
This means that the next time the inode is logged and log recovery
runs, it will *always* replay changes to the inode regardless of
whether the inode is newer on disk than the version in the log and
that violates the entire purpose of recording the LSN in the inode
at writeback time (i.e. to stop it going backwards in time on disk
during recovery).
Secondly, if we commit the CIL to the journal so the inode item
moves to the AIL, and then relog the inode, the LSN that gets
stamped into the log inode will be the LSN of the inode's current
location in the AIL, not it's age on disk. And it's not the LSN that
will be associated with the current change. That means when log
recovery replays this inode item, the LSN that ends up on disk is
the LSN for the previous changes in the log, not the current
changes being replayed. IOWs, after recovery the LSN on disk is not
in sync with the LSN of the modifications that were replayed into
the inode. This, again, violates the recovery ordering semantics
that on-disk writeback LSNs provide.
Hence the inode LSN in the log dinode is -always- invalid.
Thirdly, recovery actually has the LSN of the log transaction it is
replaying right at hand - it uses it to determine if it should
replay the inode by comparing it to the on-disk inode's LSN. But it
doesn't use that LSN to stamp the LSN into the inode which will be
written back when the transaction is fully replayed. It uses the one
in the log dinode, which we know is always going to be incorrect.
Looking back at the change history, the inode logging was broken by
commit 93f958f9c4 ("xfs: cull unnecessary icdinode fields") way
back in 2016 by a stupid idiot who thought he knew how this code
worked. i.e. me. That commit replaced an in memory di_lsn field that
was updated only at inode writeback time from the inode item.li_lsn
value - and hence always contained the same LSN that appeared in the
on-disk inode - with a read of the inode item LSN at inode format
time. CLearly these are not the same thing.
Before 93f958f9c4, the log recovery behaviour was irrelevant,
because the LSN in the log inode always matched the on-disk LSN at
the time the inode was logged, hence recovery of the transaction
would never make the on-disk LSN in the inode go backwards or get
out of sync.
A symptom of the problem is this, caught from a failure of
generic/482. Before log recovery, the inode has been allocated but
never used:
xfs_db> inode 393388
xfs_db> p
core.magic = 0x494e
core.mode = 0
....
v3.crc = 0x99126961 (correct)
v3.change_count = 0
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jan 1 10:00:00 1970
v3.crtime.nsec = 0
After log recovery:
xfs_db> p
core.magic = 0x494e
core.mode = 020444
....
v3.crc = 0x23e68f23 (correct)
v3.change_count = 2
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jul 22 17:03:03 2021
v3.crtime.nsec = 751000000
...
You can see that the LSN of the on-disk inode is 0, even though it
clearly has been written to disk. I point out this inode, because
the generic/482 failure occurred because several adjacent inodes in
this specific inode cluster were not replayed correctly and still
appeared to be zero on disk when all the other metadata (inobt,
finobt, directories, etc) indicated they should be allocated and
written back.
The fix for this is two-fold. The first is that we need to either
revert the LSN changes in 93f958f9c4 or stop logging the inode LSN
altogether. If we do the former, log recovery does not need to
change but we add 8 bytes of memory per inode to store what is
largely a write-only inode field. If we do the latter, log recovery
needs to stamp the on-disk inode in the same manner that inode
writeback does.
I prefer the latter, because we shouldn't really be trying to log
and replay changes to the on disk LSN as the on-disk value is the
canonical source of the on-disk version of the inode. It also
matches the way we recover buffer items - we create a buf_log_item
that carries the current recovery transaction LSN that gets stamped
into the buffer by the write verifier when it gets written back
when the transaction is fully recovered.
However, this might break log recovery on older kernels even more,
so I'm going to simply ignore the logged value in recovery and stamp
the on-disk inode with the LSN of the transaction being recovered
that will trigger writeback on transaction recovery completion. This
will ensure that the on-disk inode LSN always reflects the LSN of
the last change that was written to disk, regardless of whether it
comes from log recovery or runtime writeback.
Fixes: 93f958f9c4 ("xfs: cull unnecessary icdinode fields")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Before waiting on a iclog in xfs_log_force_lsn(), we don't check to
see if the iclog has already been completed and the contents on
stable storage. We check for completed iclogs in xfs_log_force(), so
we should do the same thing for xfs_log_force_lsn().
This fixed some random up-to-30s pauses seen in unmounting
filesystems in some tests. A log force ends up waiting on completed
iclog, and that doesn't then get flushed (and hence the log force
get completed) until the background log worker issues a log force
that flushes the iclog in question. Then the unmount unblocks and
continues.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
After fixing the tail_lsn vs cache flush race, generic/482 continued
to fail in a similar way where cache flushes were missing before
iclog FUA writes. Tracing of iclog state changes during the fsstress
workload portion of the test (via xlog_iclog* events) indicated that
iclog writes were coming from two sources - CIL pushes and log
forces (due to fsync/O_SYNC operations). All of the cases where a
recovery problem was triggered indicated that the log force was the
source of the iclog write that was not preceeded by a cache flush.
This was an oversight in the modifications made in commit
eef983ffea ("xfs: journal IO cache flush reductions"). Log forces
for fsync imply a data device cache flush has been issued if an
iclog was flushed to disk and is indicated to the caller via the
log_flushed parameter so they can elide the device cache flush if
the journal issued one.
The change in eef983ffea results in iclogs only issuing a cache
flush if XLOG_ICL_NEED_FLUSH is set on the iclog, but this was not
added to the iclogs that the log force code flushes to disk. Hence
log forces are no longer guaranteeing that a cache flush is issued,
hence opening up a potential on-disk ordering failure.
Log forces should also set XLOG_ICL_NEED_FUA as well to ensure that
the actual iclogs it forces to the journal are also on stable
storage before it returns to the caller.
This patch introduces the xlog_force_iclog() helper function to
encapsulate the process of taking a reference to an iclog, switching
its state if WANT_SYNC and flushing it to stable storage correctly.
Both xfs_log_force() and xfs_log_force_lsn() are converted to use
it, as is xlog_unmount_write() which has an elaborate method of
doing exactly the same "write this iclog to stable storage"
operation.
Further, if the log force code needs to wait on a iclog in the
WANT_SYNC state, it needs to ensure that iclog also results in a
cache flush being issued. This covers the case where the iclog
contains the commit record of the CIL flush that the log force
triggered, but it hasn't been written yet because there is still an
active reference to the iclog.
Note: this whole cache flush whack-a-mole patch is a result of log
forces still being iclog state centric rather than being CIL
sequence centric. Most of this nasty code will go away in future
when log forces are converted to wait on CIL sequence push
completion rather than iclog completion. With the CIL push algorithm
guaranteeing that the CIL checkpoint is fully on stable storage when
it completes, we no longer need to iterate iclogs and push them to
ensure a CIL sequence push has completed and so all this nasty iclog
iteration and flushing code will go away.
Fixes: eef983ffea ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
We force iclogs in several places - we need them all to have the
same cache flush semantics, so start by factoring out the iclog
force into a common helper.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
There is a race between the new CIL async data device metadata IO
completion cache flush and the log tail in the iclog the flush
covers being updated. This can be seen by repeating generic/482 in a
loop and eventually log recovery fails with a failures such as this:
XFS (dm-3): Starting recovery (logdev: internal)
XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
XFS (dm-3): Unmount and run xfs_repair
XFS (dm-3): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117
Analysis of the logwrite replay shows that there were no writes to
the data device between the FUA @ write 124 and the FUA at write @
125, but log recovery @ 125 failed. The difference was the one log
write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
and so the inode create intent in (1,8) was not replayed and so the
inode cluster was zero on disk when replay of the first inode item
in (1,32) was attempted.
What this meant was that the journal write that occurred at @ 125
did not ensure that metadata completed before the iclog was written
was correctly on stable storage. The tail of the log moved forward,
so IO must have been completed between the two iclog writes. This
means that there is a race condition between the unconditional async
cache flush in the CIL push work and the tail LSN that is written to
the iclog. This happens like so:
CIL push work AIL push work
------------- -------------
Add to committing list
start async data dev cache flush
.....
<flush completes>
<all writes to old tail lsn are stable>
xlog_write
.... push inode create buffer
<start IO>
.....
xlog_write(commit record)
.... <IO completes>
log tail moves
xlog_assign_tail_lsn()
start_lsn == commit_lsn
<no iclog preflush!>
xlog_state_release_iclog
__xlog_state_release_iclog()
<writes *new* tail_lsn into iclog>
xlog_sync()
....
submit_bio()
<tail in log moves forward without flushing written metadata>
Essentially, this can only occur if the commit iclog is issued
without a cache flush. If the iclog bio is submitted with
REQ_PREFLUSH, then it will guarantee that all the completed IO is
one stable storage before the iclog bio with the new tail LSN in it
is written to the log.
IOWs, the tail lsn that is written to the iclog needs to be sampled
*before* we issue the cache flush that guarantees all IO up to that
LSN has been completed.
To fix this without giving up the performance advantage of the
flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
compared to 5.13), we need to ensure that we always issue a cache
flush if the tail LSN changes between the initial async flush and
the commit record being written. THis requires sampling the tail_lsn
before we start the flush, and then passing the sampled tail LSN to
xlog_state_release_iclog() so it can determine if the the tail LSN
has changed while writing the checkpoint. If the tail LSN has
changed, then it needs to set the NEED_FLUSH flag on the iclog and
we'll issue another cache flush before writing the iclog.
Fixes: eef983ffea ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Fold __xlog_state_release_iclog into its only caller to prepare
make an upcoming fix easier.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The recent journal flush/FUA changes replaced the flushing of the
data device on every iclog write with an up-front async data device
cache flush. Unfortunately, the assumption of which this was based
on has been proven incorrect by the flush vs log tail update
ordering issue. As the fix for that issue uses the
XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache
flush, we now need to (once again) ensure that an iclog write to
external logs that need a cache flush to be issued actually issue a
cache flush to the data device as well as the log device.
Fixes: eef983ffea ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
We incorrectly flush the log device instead of the data device when
trying to ensure metadata is correctly on disk before writing the
unmount record.
Fixes: eef983ffea ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Building with -Warray-bounds on systems with 64K pages there's a
warning:
fs/btrfs/disk-io.c: In function ‘csum_tree_block’:
fs/btrfs/disk-io.c:226:34: warning: array subscript 1 is above array bounds of ‘struct page *[1]’ [-Warray-bounds]
226 | kaddr = page_address(buf->pages[i]);
| ~~~~~~~~~~^~~
./include/linux/mm.h:1630:48: note: in definition of macro ‘page_address’
1630 | #define page_address(page) lowmem_page_address(page)
| ^~~~
In file included from fs/btrfs/ctree.h:32,
from fs/btrfs/disk-io.c:23:
fs/btrfs/extent_io.h:98:15: note: while referencing ‘pages’
98 | struct page *pages[1];
| ^~~~~
The compiler has no way to know that in that case the nodesize is exactly
PAGE_SIZE, so the resulting number of pages will be correct (1).
Let's use num_extent_pages that makes the case nodesize == PAGE_SIZE
explicitly 1.
Reported-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lost parsing of backupuid in the switch to new mount API.
Add it back.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: <stable@vger.kernel.org> # v5.11+
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEBWkcACgkQnJ2qBz9k
QNlc2Af/dJBIzZmwPiqW/3vg8/2NihuKnhlkR0ytF5pGswDiZ/3jpNoapz53UeMy
is73PwCqrBYII923Q//+TsiRSGELbmo5nY+xRKlAmg4yovVti+/fgkg2sYdHLfz5
SwMpZjtpqnJ6sfKY6wnN4nXJ0JfGR6Q52wfMWmYQbpQaHLPy1XVUBmKKh+TKwuqy
5S7OhYQ/sml3pdlHhQ5AoG0glgM12DiC5DvqJjwThWmZbsGNfpOw578XC9suCdKJ
6/Wvxm2KiKcltoSb/5LzRTOSIJNtBX7XXwUQewRXnXclEbZYhb5cob/HBkoAU0Nw
4LxVXzxnF3SDwx1thtkgoJ6qUclDWg==
=/q9+
-----END PGP SIGNATURE-----
Merge tag 'fixes_for_v5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 and reiserfs fixes from Jan Kara:
"A fix for the ext2 conversion to kmap_local() and two reiserfs
hardening fixes"
* tag 'fixes_for_v5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: check directory items on read from disk
fs/ext2: Avoid page_address on pages returned by ext2_get_page
reiserfs: add check for root_inode in reiserfs_fill_super
When removing a writeable device in __btrfs_free_extra_devids, the rw
device count should be decremented.
This error was caught by Syzbot which reported a warning in
close_fs_devices:
WARNING: CPU: 1 PID: 9355 at fs/btrfs/volumes.c:1168 close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
Modules linked in:
CPU: 0 PID: 9355 Comm: syz-executor552 Not tainted 5.13.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
RSP: 0018:ffffc9000333f2f0 EFLAGS: 00010293
RAX: ffffffff8365f5c3 RBX: 0000000000000001 RCX: ffff888029afd4c0
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffff88802846f508 R08: ffffffff8365f525 R09: ffffed100337d128
R10: ffffed100337d128 R11: 0000000000000000 R12: dffffc0000000000
R13: ffff888019be8868 R14: 1ffff1100337d10d R15: 1ffff1100337d10a
FS: 00007f6f53828700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000047c410 CR3: 00000000302a6000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_close_devices+0xc9/0x450 fs/btrfs/volumes.c:1180
open_ctree+0x8e1/0x3968 fs/btrfs/disk-io.c:3693
btrfs_fill_super fs/btrfs/super.c:1382 [inline]
btrfs_mount_root+0xac5/0xc60 fs/btrfs/super.c:1749
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x86/0x270 fs/super.c:1498
fc_mount fs/namespace.c:993 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1023
btrfs_mount+0x3d3/0xb50 fs/btrfs/super.c:1809
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x86/0x270 fs/super.c:1498
do_new_mount fs/namespace.c:2905 [inline]
path_mount+0x196f/0x2be0 fs/namespace.c:3235
do_mount fs/namespace.c:3248 [inline]
__do_sys_mount fs/namespace.c:3456 [inline]
__se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
Because fs_devices->rw_devices was not 0 after
closing all devices. Here is the call trace that was observed:
btrfs_mount_root():
btrfs_scan_one_device():
device_list_add(); <---------------- device added
btrfs_open_devices():
open_fs_devices():
btrfs_open_one_device(); <-------- writable device opened,
rw device count ++
btrfs_fill_super():
open_ctree():
btrfs_free_extra_devids():
__btrfs_free_extra_devids(); <--- writable device removed,
rw device count not decremented
fail_tree_roots:
btrfs_close_devices():
close_fs_devices(); <------- rw device count off by 1
As a note, prior to commit cf89af146b ("btrfs: dev-replace: fail
mount if we don't have replace item with target device"), rw_devices
was decremented on removing a writable device in
__btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit
was not set for the device. However, this check does not need to be
reinstated as it is now redundant and incorrect.
In __btrfs_free_extra_devids, we skip removing the device if it is the
target for replacement. This is done by checking whether device->devid
== BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set
only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices
should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check,
and so it's redundant to test for that bit.
Additionally, following commit 82372bc816 ("Btrfs: make
the logic of source device removing more clear"), rw_devices is
incremented whenever a writeable device is added to the alloc
list (including the target device in btrfs_dev_replace_finishing), so
all removals of writable devices from the alloc list should also be
accompanied by a decrement to rw_devices.
Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Fixes: cf89af146b ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
CC: stable@vger.kernel.org # 5.10+
Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When checking if we need to log the new name of a renamed inode, we are
checking if the inode and its parent inode have been logged before, and if
not we don't log the new name. The check however is buggy, as it directly
compares the logged_trans field of the inodes versus the ID of the current
transaction. The problem is that logged_trans is a transient field, only
stored in memory and never persisted in the inode item, so if an inode
was logged before, evicted and reloaded, its logged_trans field is set to
a value of 0, meaning the check will return false and the new name of the
renamed inode is not logged. If the old parent directory was previously
fsynced and we deleted the logged directory entries corresponding to the
old name, we end up with a log that when replayed will delete the renamed
inode.
The following example triggers the problem:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/A
$ mkdir /mnt/B
$ echo -n "hello world" > /mnt/A/foo
$ sync
# Add some new file to A and fsync directory A.
$ touch /mnt/A/bar
$ xfs_io -c "fsync" /mnt/A
# Now trigger inode eviction. We are only interested in triggering
# eviction for the inode of directory A.
$ echo 2 > /proc/sys/vm/drop_caches
# Move foo from directory A to directory B.
# This deletes the directory entries for foo in A from the log, and
# does not add the new name for foo in directory B to the log, because
# logged_trans of A is 0, which is less than the current transaction ID.
$ mv /mnt/A/foo /mnt/B/foo
# Now make an fsync to anything except A, B or any file inside them,
# like for example create a file at the root directory and fsync this
# new file. This syncs the log that contains all the changes done by
# previous rename operation.
$ touch /mnt/baz
$ xfs_io -c "fsync" /mnt/baz
<power fail>
# Mount the filesystem and replay the log.
$ mount /dev/sdc /mnt
# Check the filesystem content.
$ ls -1R /mnt
/mnt/:
A
B
baz
/mnt/A:
bar
/mnt/B:
$
# File foo is gone, it's neither in A/ nor in B/.
Fix this by using the inode_logged() helper at btrfs_log_new_name(), which
safely checks if an inode was logged before in the current transaction.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In compression write endio sequence, the range which the compressed_bio
writes is marked as uptodate if the last bio of the compressed (sub)bios
is completed successfully. There could be previous bio which may
have failed which is recorded in cb->errors.
Set the writeback range as uptodate only if cb->errors is zero, as opposed
to checking only the last bio's status.
Backporting notes: in all versions up to 4.4 the last argument is always
replaced by "!cb->errors".
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For pure poll requests, it doesn't remove the second poll wait entry
when it's done, neither after vfs_poll() or in the poll completion
handler. We should remove the second poll wait entry.
And we use io_poll_remove_double() rather than io_poll_remove_waitqs()
since the latter has some redundant logic.
Fixes: 88e41cf928 ("io_uring: add multishot mode for IORING_OP_POLL_ADD")
Cc: stable@vger.kernel.org # 5.13+
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210728030322.12307-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some setups, like SCSI, can throw spurious -EAGAIN off the softirq
completion path. Normally we expect this to happen inline as part
of submission, but apparently SCSI has a weird corner case where it
can happen as part of normal completions.
This should be solved by having the -EAGAIN bubble back up the stack
as part of submission, but previous attempts at this failed and we're
not just quite there yet. Instead we currently use REQ_F_REISSUE to
handle this case.
For now, catch it in io_rw_should_reissue() and prevent a reissue
from a bogus path.
Cc: stable@vger.kernel.org
Reported-by: Fabian Ebner <f.ebner@proxmox.com>
Tested-by: Fabian Ebner <f.ebner@proxmox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkdev_get_no_open acquires a reference to the block_device through
the block device inode and then tries to acquire a device model
reference to the gendisk. But at this point the disk migh already
be freed (although the race is free). Fix this by only freeing the
gendisk from the whole device bdevs ->free_inode callback as well.
Fixes: 22ae8ce8b8 ("block: simplify bdev/disk lookup in blkdev_get")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210722075402.983367-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull cgroup fix from Tejun Heo:
"Fix leak of filesystem context root which is triggered by LTP.
Not too likely to be a problem in non-testing environments"
* 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup1: fix leaked context root causing sporadic NULL deref in LTP
As a safeguard, if we're going to queue async work, do it from task_work
from the original task. This ensures that we can always sanely create
threads, regards of what the reissue context may be.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Clang detected a problem with rc possibly being unitialized
(when length is zero) in a recently added fallocate code path.
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
readpage was calculating the offset of the page incorrectly
for the case of large swapcaches.
loff_t offset = (loff_t)page->index << PAGE_SHIFT;
As pointed out by Matthew Wilcox, this needs to use
page_file_offset() to calculate the offset instead.
Pages coming from the swap cache have page->index set
to their index within the swapcache, not within the backing
file. For a sufficiently large swapcache, we could have
overlapping values of page->index within the same backing file.
Suggested by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org> # v5.7+
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We use a bit to manage if we need to add the shared task_work, but
a list + lock for the pending work. Before aborting a current run
of the task_work we check if the list is empty, but we do so without
grabbing the lock that protects it. This can lead to races where
we think we have nothing left to run, where in practice we could be
racing with a task adding new work to the list. If we do hit that
race condition, we could be left with work items that need processing,
but the shared task_work is not active.
Ensure that we grab the lock before checking if the list is empty,
so we know if it's safe to exit the run or not.
Link: https://lore.kernel.org/io-uring/c6bd5987-e9ae-cd02-49d0-1b3ac1ef65b1@tnonline.net/
Cc: stable@vger.kernel.org # 5.11+
Reported-by: Forza <forza@tnonline.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have a fairly specific alpha binary loader in Linux: running x86
(i386, i486) binaries via the em86 [1] emulator. As noted in the Kconfig
option, the same behavior can be achieved via binfmt_misc, for example,
more nowadays used for running qemu-user.
An example on how to get binfmt_misc running with em86 can be found in
Documentation/admin-guide/binfmt-misc.rst
The defconfig does not have CONFIG_BINFMT_EM86=y set. And doing a
make defconfig && make olddefconfig
results in
# CONFIG_BINFMT_EM86 is not set
... as we don't seem to have any supported Linux distirbution for alpha
anymore, there isn't really any "default" user of that feature anymore.
Searching for "CONFIG_BINFMT_EM86=y" reveals mostly discussions from
around 20 years ago, like [2] describing how to get netscape via em86
running via em86, or [3] discussing that running wine or installing
Win 3.11 through em86 would be a nice feature.
The latest binaries available for em86 are from 2000, version 2.2.1 [4] --
which translates to "unsupported"; further, em86 doesn't even work with
glibc-2.x but only with glibc-2.0 [4, 5]. These are clear signs that
there might not be too many em86 users out there, especially users
relying on modern Linux kernels.
Even though the code footprint is relatively small, let's just get rid
of this blast from the past that's effectively unused.
[1] http://ftp.dreamtime.org/pub/linux/Linux-Alpha/em86/v0.4/docs/em86.html
[2] https://static.lwn.net/1998/1119/a/alpha-netscape.html
[3] https://groups.google.com/g/linux.debian.alpha/c/AkGuQHeCe0Y
[4] http://zeniv.linux.org.uk/pub/linux/alpha/em86/v2.2-1/relnotes.2.2.1.html
[5] https://forum.teamspeak.com/archive/index.php/t-1477.html
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: linux-alpha@vger.kernel.org
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Matt Turner <mattst88@gmail.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmD8kXkACgkQiiy9cAdy
T1F4nQwAnElfJhx0iBMXWW89Erzr8Hr+aOS7xEtattvT+/YS+2vtza3Q3IOmZEKG
PL5U6q7G9TnaB3ZvTXdZeH3K57QU5DL1KUPhDpoTcylL2uPKyhEG8zreY1X0sur+
nzNNIUaix7iCrWELtA21Wxiwpzf960JBdpYkCjronQdAvWQt76vtj0dZ+5rt6nac
9Zw4wso0OmaCC956MXQz8+Cy3Piy+eqT+IiZQQbpjZ+e4q9JZ5yLVChfq8sSfOHs
30tJVgE/RHTgYX6IxmFqxD1AeZ3+/D3HDlXYBTVRQCY380b/Oo2lI5XkDOJdzRT6
BJBNXdhFwF7J5gsDD2TcQjxXI762mD7OdtXTbha3mQuNQy9HlSNz7Eoqr63yryrU
7zwQOUJPP4V0v8/rlAJkBcXAEdkZXHKFwWsYy9YL4/LQ2JAHZVnUpyKDwXnlB8E2
HrER/Z5kzzl2Qlgafk4Iwaq2hIkhgvyvBXPrwUrAoqXT+gj41qMUNMg5DoBbL+ax
K5Caf1fi
=Ngut
-----END PGP SIGNATURE-----
Merge tag '5.14-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Five cifs/smb3 fixes, including a DFS failover fix, two fallocate
fixes, and two trivial coverity cleanups"
* tag '5.14-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix fallocate when trying to allocate a hole.
CIFS: Clarify SMB1 code for POSIX delete file
CIFS: Clarify SMB1 code for POSIX Create
cifs: support share failover when remounting
cifs: only write 64kb at a time when fallocating a small region of a file
Merge misc mm fixes from Andrew Morton:
"15 patches.
VM subsystems affected by this patch series: userfaultfd, kfence,
highmem, pagealloc, memblock, pagecache, secretmem, pagemap, and
hugetlbfs"
* akpm:
hugetlbfs: fix mount mode command line processing
mm: fix the deadlock in finish_fault()
mm: mmap_lock: fix disabling preemption directly
mm/secretmem: wire up ->set_page_dirty
writeback, cgroup: do not reparent dax inodes
writeback, cgroup: remove wb from offline list before releasing refcnt
memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions
mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interaction
mm: use kmap_local_page in memzero_page
mm: call flush_dcache_page() in memcpy_to_page() and memzero_page()
kfence: skip all GFP_ZONEMASK allocations
kfence: move the size check to the beginning of __kfence_alloc()
kfence: defer kfence_test_init to ensure that kunit debugfs is created
selftest: use mmap instead of posix_memalign to allocate memory
userfaultfd: do not untag user pointers
In commit 32021982a3 ("hugetlbfs: Convert to fs_context") processing
of the mount mode string was changed from match_octal() to fsparam_u32.
This changed existing behavior as match_octal does not require octal
values to have a '0' prefix, but fsparam_u32 does.
Use fsparam_u32oct which provides the same behavior as match_octal.
Link: https://lkml.kernel.org/r/20210721183326.102716-1-mike.kravetz@oracle.com
Fixes: 32021982a3 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Dennis Camera <bugs+kernel.org@dtnr.ch>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c60379 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Catch an illegal case to queue async from an unrelated task that got
the ring fd passed to it. This should not be possible to hit, but
better be proactive and catch it explicitly. io-wq is extended to
check for early IO_WQ_WORK_CANCEL being set on a work item as well,
so it can run the request through the normal cancelation path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are two reasons why this shouldn't be done:
1) Ring is exiting, and we're canceling requests anyway. Any request
should be canceled anyway. In theory, this could iterate for a
number of times if someone else is also driving the target block
queue into request starvation, however the likelihood of this
happening is miniscule.
2) If the original task decided to pass the ring to another task, then
we don't want to be reissuing from this context as it may be an
unrelated task or context. No assumptions should be made about
the context in which ->release() is run. This can only happen for pure
read/write, and we'll get -EFAULT on them anyway.
Link: https://lore.kernel.org/io-uring/YPr4OaHv0iv0KTOc@zeniv-ca.linux.org.uk/
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmD7GZUACgkQxWXV+ddt
WDs+BA/+OHDY2ROYEnysAqF1qaDENVVnUavnDYYa+Uk61KVvx0pm/mHY9SllsuH4
WCIwCwH7LZs11cRYp3vD80t4OdVGBKaDvEfX+znMCQYuoBm6G5eT3n5jhsVFr1jJ
EqUVzUY+S44IWAEhzkVnSAD4xnMsan8b+YnngIFSMEqJlH+on6w8FhyP0QXwInxk
1kfjl8tDMiryKFaekfGX5WaeflEeWGoHNf2xYokzPD/Oq6TCaoLycar1YXH+80FM
05Jl0+jfEWbaHouMNd8bW9nHnSxh30i7gorY17Q6KLOFDCThNiKZuypZsQcCi/df
TbjQDNTZjSsReFvrFeFlEdGv3dFHBGxz1Ns7RFPfVeNgmN0WnOLmzS+4rmfGyi8L
+3TQ6MGqgG0DppPwfB9caDvxYsbN23uA1v5J1B+Dsbo47lFWWIoBQBtDvErAiHEy
KF7B4jIOWrx3ZYwv3pkE3D+D19sKkB9+wLnlwVSF77npKO1up8W0h4mPdMLZaznW
TGBXxwqI4105MSX5UatBpX+HYATpEWG5tmeZz5ERGFNC/piILmY4iVz/c5Vguh9/
iUQwjSudIDWgGxcL7VClqrdF7sucsml6Svb+ZrxckmK7pa97TG2bIlzJDg0eFcle
NBcw8RBcBMUay/Y04cKHLJAj6OOjBiXnxKjjHrhvtaBmOV2SHpc=
=kj2e
-----END PGP SIGNATURE-----
Merge tag 'for-5.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes and one patch to help some block layer API cleanups:
- skip missing device when running fstrim
- fix unpersisted i_size on fsync after expanding truncate
- fix lock inversion problem when doing qgroup extent tracing
- replace bdgrab/bdput usage, replace gendisk by block_device"
* tag 'for-5.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: store a block_device in struct btrfs_ordered_extent
btrfs: fix lock inversion problem when doing qgroup extent tracing
btrfs: check for missing device in btrfs_trim_fs
btrfs: fix unpersisted i_size on fsync after expanding truncate
(marked for stable). Also included a rare WARN condition tweak.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmD67fETHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi1UAB/43vuj0sLO2cAW7HkjvoSqQG6MHruUl
XaeZCUxG6AdgvrpwFxfi7r2k8N4RegoYFKiqEXdnYl6BANEEcZR1KFB6Uy9vEOuo
R1NdmBF7ZY2U1o22SpWFHbdoCOx7KEdsFHU5rTODw4dwAZuj3GtRyJ8uGPz7VatH
0wTLPSIcphFkq5mcdA4hQSes3O4vKmDlVfBreUl+PQg/lxnBPsXx07gLIk3Q0gN1
uKseGr0miSpDHIS1IjYBOMs8AM5VbJKuzcsy5iCE1z/9tI1J5fsPBrZCopCPjajt
1yN8/r7F7Ih9HaZoEU4NXLbEbLe4eX9XEWGOmiZjgry66zxwOCr3rJGa
=Mqd9
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A subtle deadlock on lock_rwsem (marked for stable) and rbd fixes for
a -rc1 regression.
Also included a rare WARN condition tweak"
* tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client:
rbd: resurrect setting of disk->private_data in rbd_init_disk()
ceph: don't WARN if we're still opening a session to an MDS
rbd: don't hold lock_rwsem while running_list is being drained
rbd: always kick acquire on "acquired" and "released" notifications
We do a bforget and return for no journal case, so let's remove this
conflict comment.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Link: https://lore.kernel.org/r/20210714055940.1553705-1-guoqing.jiang@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
if (!ext4_has_feature_mmp(sb)) then retval can be unitialized before
we jump to the wait_to_exit label.
Fixes: 61bb4a1c41 ("ext4: fix possible UAF when remounting r/o a mmp-protected file system")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20210713022728.2533770-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove the conditional checking for out_data_len and skipping the fallocate
if it is 0. This is wrong will actually change any legitimate the fallocate
where the entire region is unallocated into a no-op.
Additionally, before allocating the range, if FALLOC_FL_KEEP_SIZE is set then
we need to clamp the length of the fallocate region as to not extend the size of the file.
Fixes: 966a3cb7c7 ("cifs: improve fallocate emulation")
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
A previous commit shuffled some code around, and inadvertently used
struct file after fdput() had been called on it. As we can't touch
the file post fdput() dropping our reference, move the fdput() to
after that has been done.
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/io-uring/YPnqM0fY3nM5RdRI@zeniv-ca.linux.org.uk/
Fixes: f2a48dd09b ("io_uring: refactor io_sq_offload_create()")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coverity also complains about the way we calculate the offset
(starting from the address of a 4 byte array within the
header structure rather than from the beginning of the struct
plus 4 bytes) for SMB1 CIFSPOSIXDelFile. This changeset
doesn't change the address but makes it slightly clearer.
Addresses-Coverity: 711519 ("Out of bounds write")
Signed-off-by: Steve French <stfrench@microsoft.com>
Coverity also complains about the way we calculate the offset
(starting from the address of a 4 byte array within the
header structure rather than from the beginning of the struct
plus 4 bytes) for SMB1 CIFSPOSIXCreate. This changeset
doesn't change the address but makes it slightly clearer.
Addresses-Coverity: 711518 ("Out of bounds write")
Signed-off-by: Steve French <stfrench@microsoft.com>
When remouting a DFS share, force a new DFS referral of the path and
if the currently cached targets do not match any of the new targets or
there was no cached targets, then mark it for reconnect.
For example:
$ mount //dom/dfs/link /mnt -o username=foo,password=bar
$ ls /mnt
oldfile.txt
change target share of 'link' in server settings
$ mount /mnt -o remount,username=foo,password=bar
$ ls /mnt
newfile.txt
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
We only allow sending single credit writes through the SMB2_write() synchronous
api so split this into smaller chunks.
Fixes: 966a3cb7c7 ("cifs: improve fallocate emulation")
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reported-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Store the block device instead of the gendisk in the btrfs_ordered_extent
structure instead of acquiring a reference to it later.
Note: this is from series removing bdgrab/bdput, btrfs is one of the
last users.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_qgroup_trace_extent_post() we call btrfs_find_all_roots() with a
NULL value as the transaction handle argument, which makes that function
take the commit_root_sem semaphore, which is necessary when we don't hold
a transaction handle or any other mechanism to prevent a transaction
commit from wiping out commit roots.
However btrfs_qgroup_trace_extent_post() can be called in a context where
we are holding a write lock on an extent buffer from a subvolume tree,
namely from btrfs_truncate_inode_items(), called either during truncate
or unlink operations. In this case we end up with a lock inversion problem
because the commit_root_sem is a higher level lock, always supposed to be
acquired before locking any extent buffer.
Lockdep detects this lock inversion problem since we switched the extent
buffer locks from custom locks to semaphores, and when running btrfs/158
from fstests, it reported the following trace:
[ 9057.626435] ======================================================
[ 9057.627541] WARNING: possible circular locking dependency detected
[ 9057.628334] 5.14.0-rc2-btrfs-next-93 #1 Not tainted
[ 9057.628961] ------------------------------------------------------
[ 9057.629867] kworker/u16:4/30781 is trying to acquire lock:
[ 9057.630824] ffff8e2590f58760 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.632542]
but task is already holding lock:
[ 9057.633551] ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
[ 9057.635255]
which lock already depends on the new lock.
[ 9057.636292]
the existing dependency chain (in reverse order) is:
[ 9057.637240]
-> #1 (&fs_info->commit_root_sem){++++}-{3:3}:
[ 9057.638138] down_read+0x46/0x140
[ 9057.638648] btrfs_find_all_roots+0x41/0x80 [btrfs]
[ 9057.639398] btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
[ 9057.640283] btrfs_add_delayed_data_ref+0x418/0x490 [btrfs]
[ 9057.641114] btrfs_free_extent+0x35/0xb0 [btrfs]
[ 9057.641819] btrfs_truncate_inode_items+0x424/0xf70 [btrfs]
[ 9057.642643] btrfs_evict_inode+0x454/0x4f0 [btrfs]
[ 9057.643418] evict+0xcf/0x1d0
[ 9057.643895] do_unlinkat+0x1e9/0x300
[ 9057.644525] do_syscall_64+0x3b/0xc0
[ 9057.645110] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 9057.645835]
-> #0 (btrfs-tree-00){++++}-{3:3}:
[ 9057.646600] __lock_acquire+0x130e/0x2210
[ 9057.647248] lock_acquire+0xd7/0x310
[ 9057.647773] down_read_nested+0x4b/0x140
[ 9057.648350] __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.649175] btrfs_read_lock_root_node+0x31/0x40 [btrfs]
[ 9057.650010] btrfs_search_slot+0x537/0xc00 [btrfs]
[ 9057.650849] scrub_print_warning_inode+0x89/0x370 [btrfs]
[ 9057.651733] iterate_extent_inodes+0x1e3/0x280 [btrfs]
[ 9057.652501] scrub_print_warning+0x15d/0x2f0 [btrfs]
[ 9057.653264] scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
[ 9057.654295] scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
[ 9057.655111] btrfs_work_helper+0xf8/0x400 [btrfs]
[ 9057.655831] process_one_work+0x247/0x5a0
[ 9057.656425] worker_thread+0x55/0x3c0
[ 9057.656993] kthread+0x155/0x180
[ 9057.657494] ret_from_fork+0x22/0x30
[ 9057.658030]
other info that might help us debug this:
[ 9057.659064] Possible unsafe locking scenario:
[ 9057.659824] CPU0 CPU1
[ 9057.660402] ---- ----
[ 9057.660988] lock(&fs_info->commit_root_sem);
[ 9057.661581] lock(btrfs-tree-00);
[ 9057.662348] lock(&fs_info->commit_root_sem);
[ 9057.663254] lock(btrfs-tree-00);
[ 9057.663690]
*** DEADLOCK ***
[ 9057.664437] 4 locks held by kworker/u16:4/30781:
[ 9057.665023] #0: ffff8e25922a1148 ((wq_completion)btrfs-scrub){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
[ 9057.666260] #1: ffffabb3451ffe70 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
[ 9057.667639] #2: ffff8e25922da198 (&ret->mutex){+.+.}-{3:3}, at: scrub_handle_errored_block.isra.0+0x5d2/0x1640 [btrfs]
[ 9057.669017] #3: ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
[ 9057.670408]
stack backtrace:
[ 9057.670976] CPU: 7 PID: 30781 Comm: kworker/u16:4 Not tainted 5.14.0-rc2-btrfs-next-93 #1
[ 9057.672030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 9057.673492] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[ 9057.674258] Call Trace:
[ 9057.674588] dump_stack_lvl+0x57/0x72
[ 9057.675083] check_noncircular+0xf3/0x110
[ 9057.675611] __lock_acquire+0x130e/0x2210
[ 9057.676132] lock_acquire+0xd7/0x310
[ 9057.676605] ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.677313] ? lock_is_held_type+0xe8/0x140
[ 9057.677849] down_read_nested+0x4b/0x140
[ 9057.678349] ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.679068] __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.679760] btrfs_read_lock_root_node+0x31/0x40 [btrfs]
[ 9057.680458] btrfs_search_slot+0x537/0xc00 [btrfs]
[ 9057.681083] ? _raw_spin_unlock+0x29/0x40
[ 9057.681594] ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
[ 9057.682336] scrub_print_warning_inode+0x89/0x370 [btrfs]
[ 9057.683058] ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
[ 9057.683834] ? scrub_write_block_to_dev_replace+0xb0/0xb0 [btrfs]
[ 9057.684632] iterate_extent_inodes+0x1e3/0x280 [btrfs]
[ 9057.685316] scrub_print_warning+0x15d/0x2f0 [btrfs]
[ 9057.685977] ? ___ratelimit+0xa4/0x110
[ 9057.686460] scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
[ 9057.687316] scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
[ 9057.688021] btrfs_work_helper+0xf8/0x400 [btrfs]
[ 9057.688649] ? lock_is_held_type+0xe8/0x140
[ 9057.689180] process_one_work+0x247/0x5a0
[ 9057.689696] worker_thread+0x55/0x3c0
[ 9057.690175] ? process_one_work+0x5a0/0x5a0
[ 9057.690731] kthread+0x155/0x180
[ 9057.691158] ? set_kthread_struct+0x40/0x40
[ 9057.691697] ret_from_fork+0x22/0x30
Fix this by making btrfs_find_all_roots() never attempt to lock the
commit_root_sem when it is called from btrfs_qgroup_trace_extent_post().
We can't just pass a non-NULL transaction handle to btrfs_find_all_roots()
from btrfs_qgroup_trace_extent_post(), because that would make backref
lookup not use commit roots and acquire read locks on extent buffers, and
therefore could deadlock when btrfs_qgroup_trace_extent_post() is called
from the btrfs_truncate_inode_items() code path which has acquired a write
lock on an extent buffer of the subvolume btree.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have an inode that does not have the full sync flag set, was changed
in the current transaction, then it is logged while logging some other
inode (like its parent directory for example), its i_size is increased by
a truncate operation, the log is synced through an fsync of some other
inode and then finally we explicitly call fsync on our inode, the new
i_size is not persisted.
The following example shows how to trigger it, with comments explaining
how and why the issue happens:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/foo
$ xfs_io -f -c "pwrite -S 0xab 0 1M" /mnt/bar
$ sync
# Fsync bar, this will be a noop since the file has not yet been
# modified in the current transaction. The goal here is to clear
# BTRFS_INODE_NEEDS_FULL_SYNC from the inode's runtime flags.
$ xfs_io -c "fsync" /mnt/bar
# Now rename both files, without changing their parent directory.
$ mv /mnt/bar /mnt/bar2
$ mv /mnt/foo /mnt/foo2
# Increase the size of bar2 with a truncate operation.
$ xfs_io -c "truncate 2M" /mnt/bar2
# Now fsync foo2, this results in logging its parent inode (the root
# directory), and logging the parent results in logging the inode of
# file bar2 (its inode item and the new name). The inode of file bar2
# is logged with an i_size of 0 bytes since it's logged in
# LOG_INODE_EXISTS mode, meaning we are only logging its names (and
# xattrs if it had any) and the i_size of the inode will not be changed
# when the log is replayed.
$ xfs_io -c "fsync" /mnt/foo2
# Now explicitly fsync bar2. This resulted in doing nothing, not
# logging the inode with the new i_size of 2M and the hole from file
# offset 1M to 2M. Because the inode did not have the flag
# BTRFS_INODE_NEEDS_FULL_SYNC set, when it was logged through the
# fsync of file foo2, its last_log_commit field was updated,
# resulting in this explicit of file bar2 not doing anything.
$ xfs_io -c "fsync" /mnt/bar2
# File bar2 content and size before a power failure.
$ od -A d -t x1 /mnt/bar2
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
1048576 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
2097152
<power failure>
# Mount the filesystem to replay the log.
$ mount /dev/sdc /mnt
# Read the file again, should have the same content and size as before
# the power failure happened, but it doesn't, i_size is still at 1M.
$ od -A d -t x1 /mnt/bar2
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
1048576
This started to happen after commit 209ecbb858 ("btrfs: remove stale
comment and logic from btrfs_inode_in_log()"), since btrfs_inode_in_log()
no longer checks if the inode's list of modified extents is not empty.
However, checking that list is not the right way to address this case
and the check was added long time ago in commit 125c4cf9f3
("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync")
for a different purpose, to address consecutive ranged fsyncs.
The reason that checking for the list emptiness makes this test pass is
because during an expanding truncate we create an extent map to represent
a hole from the old i_size to the new i_size, and add that extent map to
the list of modified extents in the inode. However if we are low on
available memory and we can not allocate a new extent map, then we don't
treat it as an error and just set the full sync flag on the inode, so that
the next fsync does not rely on the list of modified extents - so checking
for the emptiness of the list to decide if the inode needs to be logged is
not reliable, and results in not logging the inode if it was not possible
to allocate the extent map for the hole.
Fix this by ensuring that if we are only logging that an inode exists
(inode item, names/references and xattrs), we don't update the inode's
last_log_commit even if it does not have the full sync runtime flag set.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 5.13+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmD4LMkACgkQ+7dXa6fL
C2t4vg//eOjVirl2L9NSwW27btAv3tySeX7zN78i4IIy1KesVTV4An/yNYnZ1XDg
yackjIIgRaw/pX+W/dWiQalRtAcYno4vFam874lzBq+Yourk9j+7mU7WDtkofRh4
UtuWAyARGVfLZdkJQp1Eb1YoZj0oro31Qa7eFeHQ/1Wlx6Wo1gdsRAdGfRpukLeC
fTv/DXrpK+cC36ENZjP3VuhfsOAKIxKVqf7rLKiMBbeyAGW7jRs3fSTxgOJAgJN0
opj/WTGZf8vskq+u3uTbAv8dwK9CakqB6QEkDq22kZSVh+LSxKzIB6E/3wSXfhAj
1NAKUVQb4e6caXKk/vxFUhnyq8wWdzWW6ExVzH2DYSdNn442A6eo9g6d6qE7wIGK
ui0sL1p7ilwV/1aBVA7imYM5bSS3HDpVFCW1+HCZpbVxnARDv87su8iOJGErWpz+
5fDqifhm+Wrtrhoo05RnsYoygCf5elnz5iUu4MaOYKno1dFQrHBBJhFsv8PAduUq
nZkr+q8HUFH7EYcUjidJiMR+NSiVpe8bCV4LoaGczY4j/AYc5rOfD1s3kW0tfiqG
zqx+pGy5AXTwrZqzVf28I5T6cBCW1GxLnFLmSWZ4DsQdNGziYbPUTYDCJcJxfT0U
cASaEXQSItfxlkVw+JIr+Hb/QXYEfoNEdwWLvUQ2LrPueJwH2V8=
=Pn/w
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20210721' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
- Fix a tracepoint that causes one of the tracing subsystem query files
to crash if the module is loaded
- Fix afs_writepages() to take account of whether the storage rpc
actually succeeded when updating the cyclic writeback counter
- Fix some error code propagation/handling
- Fix place where afs_writepages() was setting writeback_index to a
file position rather than a page index
* tag 'afs-fixes-20210721' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Remove redundant assignment to ret
afs: Fix setting of writeback_index
afs: check function return
afs: Fix tracepoint string placement with built-in AFS
Richard reported sporadic (roughly one in 10 or so) null dereferences and
other strange behaviour for a set of automated LTP tests. Things like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
RIP: 0010:kernfs_sop_show_path+0x1b/0x60
...or these others:
RIP: 0010:do_mkdirat+0x6a/0xf0
RIP: 0010:d_alloc_parallel+0x98/0x510
RIP: 0010:do_readlinkat+0x86/0x120
There were other less common instances of some kind of a general scribble
but the common theme was mount and cgroup and a dubious dentry triggering
the NULL dereference. I was only able to reproduce it under qemu by
replicating Richard's setup as closely as possible - I never did get it
to happen on bare metal, even while keeping everything else the same.
In commit 71d883c37e ("cgroup_do_mount(): massage calling conventions")
we see this as a part of the overall change:
--------------
struct cgroup_subsys *ss;
- struct dentry *dentry;
[...]
- dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root,
- CGROUP_SUPER_MAGIC, ns);
[...]
- if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
- struct super_block *sb = dentry->d_sb;
- dput(dentry);
+ ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns);
+ if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
+ struct super_block *sb = fc->root->d_sb;
+ dput(fc->root);
deactivate_locked_super(sb);
msleep(10);
return restart_syscall();
}
--------------
In changing from the local "*dentry" variable to using fc->root, we now
export/leave that dentry pointer in the file context after doing the dput()
in the unlikely "is_dying" case. With LTP doing a crazy amount of back to
back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely
becomes slightly likely and then bad things happen.
A fix would be to not leave the stale reference in fc->root as follows:
--------------
dput(fc->root);
+ fc->root = NULL;
deactivate_locked_super(sb);
--------------
...but then we are just open-coding a duplicate of fc_drop_locked() so we
simply use that instead.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org # v5.1+
Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Fixes: 71d883c37e ("cgroup_do_mount(): massage calling conventions")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Variable ret is set to -ENOENT and -ENOMEM but this value is never
read as it is overwritten or not used later on, hence it is a
redundant assignment and can be removed.
Cleans up the following clang-analyzer warning:
fs/afs/dir.c:2014:4: warning: Value stored to 'ret' is never read
[clang-analyzer-deadcode.DeadStores].
fs/afs/dir.c:659:2: warning: Value stored to 'ret' is never read
[clang-analyzer-deadcode.DeadStores].
[DH made the following modifications:
- In afs_rename(), -ENOMEM should be placed in op->error instead of ret,
rather than the assignment being removed entirely. afs_put_operation()
will pick it up from there and return it.
- If afs_sillyrename() fails, its error code should be placed in op->error
rather than in ret also.
]
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/1619691492-83866-1-git-send-email-jiapeng.chong@linux.alibaba.com
Link: https://lore.kernel.org/r/162609465444.3133237.7562832521724298900.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162610729052.3408253.17364333638838151299.stgit@warthog.procyon.org.uk/ # v2
To quote Alexey[1]:
I was adding custom tracepoint to the kernel, grabbed full F34 kernel
.config, disabled modules and booted whole shebang as VM kernel.
Then did
perf record -a -e ...
It crashed:
general protection fault, probably for non-canonical address 0x435f5346592e4243: 0000 [#1] SMP PTI
CPU: 1 PID: 842 Comm: cat Not tainted 5.12.6+ #26
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
RIP: 0010:t_show+0x22/0xd0
Then reproducer was narrowed to
# cat /sys/kernel/tracing/printk_formats
Original F34 kernel with modules didn't crash.
So I started to disable options and after disabling AFS everything
started working again.
The root cause is that AFS was placing char arrays content into a
section full of _pointers_ to strings with predictable consequences.
Non canonical address 435f5346592e4243 is "CB.YFS_" which came from
CM_NAME macro.
Steps to reproduce:
CONFIG_AFS=y
CONFIG_TRACING=y
# cat /sys/kernel/tracing/printk_formats
Fix this by the following means:
(1) Add enum->string translation tables in the event header with the AFS
and YFS cache/callback manager operations listed by RPC operation ID.
(2) Modify the afs_cb_call tracepoint to print the string from the
translation table rather than using the string at the afs_call name
pointer.
(3) Switch translation table depending on the service we're being accessed
as (AFS or YFS) in the tracepoint print clause. Will this cause
problems to userspace utilities?
Note that the symbolic representation of the YFS service ID isn't
available to this header, so I've put it in as a number. I'm not sure
if this is the best way to do this.
(4) Remove the name wrangling (CM_NAME) macro and put the names directly
into the afs_call_type structs in cmservice.c.
Fixes: 8e8d7f13b6 ("afs: Add some tracepoints")
Reported-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YLAXfvZ+rObEOdc%2F@localhost.localdomain/ [1]
Link: https://lore.kernel.org/r/643721.1623754699@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162430903582.2896199.6098150063997983353.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162609463957.3133237.15916579353149746363.stgit@warthog.procyon.org.uk/ # v1 (repost)
Link: https://lore.kernel.org/r/162610726860.3408253.445207609466288531.stgit@warthog.procyon.org.uk/ # v2
If MDSs aren't available while mounting a filesystem, the session state
will transition from SESSION_OPENING to SESSION_CLOSING. And in that
scenario check_session_state() will be called from delayed_work() and
trigger this WARN.
Avoid this by only WARNing after a session has already been established
(i.e., the s_ttl will be different from 0).
Fixes: 62575e270f ("ceph: check session state after bumping session->s_seq")
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
__io_queue_proc() can enqueue both poll entries and still fail
afterwards, so the callers trying to cancel it should also try to remove
the second poll entry (if any).
For example, it may leave the request alive referencing a io_uring
context but not accessible for cancellation:
[ 282.599913][ T1620] task:iou-sqp-23145 state:D stack:28720 pid:23155 ppid: 8844 flags:0x00004004
[ 282.609927][ T1620] Call Trace:
[ 282.613711][ T1620] __schedule+0x93a/0x26f0
[ 282.634647][ T1620] schedule+0xd3/0x270
[ 282.638874][ T1620] io_uring_cancel_generic+0x54d/0x890
[ 282.660346][ T1620] io_sq_thread+0xaac/0x1250
[ 282.696394][ T1620] ret_from_fork+0x1f/0x30
Cc: stable@vger.kernel.org
Fixes: 18bceab101 ("io_uring: allow POLL_ADD with double poll_wait() users")
Reported-and-tested-by: syzbot+ac957324022b7132accf@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0ec1228fc5eda4cb524eeda857da8efdc43c331c.1626774457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If __io_queue_proc() fails to add a second poll entry, e.g. kmalloc()
failed, but it goes on with a third waitqueue, it may succeed and
overwrite the error status. Count the number of poll entries we added,
so we can set pt->error to zero at the beginning and find out when the
mentioned scenario happens.
Cc: stable@vger.kernel.org
Fixes: 18bceab101 ("io_uring: allow POLL_ADD with double poll_wait() users")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9d6b9e561f88bcc0163623b74a76c39f712151c3.1626774457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no reasonable need for a buffer larger than this, and it avoids
int overflow pitfalls.
Fixes: 058504edd0 ("fs/seq_file: fallback to vmalloc allocation")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Fix shrink eligibility checking when sparse inode clusters enabled.
* Reset '..' directory entries when unlinking directories to prevent
verifier errors if fs is shrinked later.
* Don't report unusable extent size hints to FSGETXATTR.
* Don't warn when extent size hints are unusable because the sysadmin
configured them that way.
* Fix insufficient parameter validation in GROWFSRT ioctl.
* Fix integer overflow when adding rt volumes to filesystem.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmDwajMACgkQ+H93GTRK
tOtPlw//TyFCUf8krAknSc5tF5yI77JPIj19a43frMN/L6G68aDu2eBhIHbpwzAL
LuPGksSqMJyBylwhZXYt83jfar0sGTl48sPqxYBr6YOj+LAmiba2PdlXGQPdWcC3
1DGqvaiFZ3ENRlk0GG0a4xPJK4nW18uujc6L8yxrzA+0VsFirorqvzay7COic0Js
b5eytqqbTsqvUc7+WX+yfWyyH+zWs+VIxBJVT7kirLY8u9Da5L54JdSbTWiXq7K0
8zu7d0oyiDpb0Yb5tylLh9eoG5TVHLNHN65Le7k1dCSw/zaJMFhpc0MsxJ9zVDI5
9NjmyOXP/uFGG/dvyqZUxOKsj2W0DwGeDRF3hxkLTWeiPFGfBYRHiBDCOpOoNIIy
i3hTUCAqlgt+Ehyau8HR68L06V6bD9j991HM3MK2phNRKgC+iCH1poXixjAcaddR
pAG1dF8WkEUQiKn9/oikNRAA8z5+z6NHZIZiEH1DUIGAh39SBVTuD2qSVIqj0BiR
pOy1gwVOFKpwdRps/JQVLPoGP7NHyOxJ2dLAYpWWYiPS2Ch6UvyXiL8aMTVF8DaV
G5Rsu+e0BJV38ass3enOOh1Nok//dIyKNS0iUO9TLdw5dZ6i3+36YeKskf+KLtXQ
m+i3hfAqM+EbyU/jUsykKWAeELV8FZTM2Ckc5utrkhOaZToktJ4=
=dKfy
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"A few fixes for issues in the new online shrink code, additional
corrections for my recent bug-hunt w.r.t. extent size hints on
realtime, and improved input checking of the GROWFSRT ioctl.
IOW, the usual 'I somehow got bored during the merge window and
resumed auditing the farther reaches of xfs':
- Fix shrink eligibility checking when sparse inode clusters enabled
- Reset '..' directory entries when unlinking directories to prevent
verifier errors if fs is shrinked later
- Don't report unusable extent size hints to FSGETXATTR
- Don't warn when extent size hints are unusable because the sysadmin
configured them that way
- Fix insufficient parameter validation in GROWFSRT ioctl
- Fix integer overflow when adding rt volumes to filesystem"
* tag 'xfs-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: detect misaligned rtinherit directory extent size hints
xfs: fix an integer overflow error in xfs_growfs_rt
xfs: improve FSGROWFSRT precondition checking
xfs: don't expose misaligned extszinherit hints to userspace
xfs: correct the narrative around misaligned rtinherit/extszinherit dirs
xfs: reset child dir '..' entry when unlinking child
xfs: check for sparse inode clusters that cross new EOAG when shrinking
* Fix KASAN warnings due to integer overflow in SEEK_DATA/SEEK_HOLE.
* Fix assertion errors when using inlinedata files on gfs2.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmDwarkACgkQ+H93GTRK
tOuxKRAAlQTwSUaVlQ4NlJfCyB9U3ZLDv3hfwZor6HrSJtycQ2UPD1VcQe7sCH87
yUL/hL5zgFPXigSGu/E8yebi3rnH2joC526iCbHSs4BaAj78FRgLNctMOUGw1o0Y
rzUxhFbrCXMPnbEHB5AkXO7HsN6Ba1Ch369Fh0NVaIxDx77jX34JWtKHwEduJz8b
+tCgTAjyH0N48jZ9iEiTFx8sI/lhDCXxpQwAOzZos19KZ5RjCUurtWYdboINTQ1j
X78bDrnlo1t2VHUYzDSyt2/HIBPqxwriyBVyDY67NVfJ7cQd0yYOFy0+/7Dfik02
scOg6/tI5b3pkQqum4DA6U27kau08I+JgzyV8GKgXSk+YV6Tjj+qzfIDMhKcZTFS
SDhtcFNyjCrdaFC52E6F19YX/VAHP/asqBlZ8pqxTboiwMwQNoIt/xw4TvqW62bM
aT04YXUIQwGCuLPU1sEUh/7io6IvRoig/OtCuKcIPFJO5mJtag4cT/996LqkMAzZ
j/MZxXEpx/fP6Dpn1DWLvpmQTpVrkag6j82dKXdKita12aYZClfLyPqGhQGREPVm
tjmZxbk/MRLL9joTx0Eil99S1oTra+ohw8ilwC9m+lkh/PrmVeQAYCRTU8Rlq66T
2D8eo1umeO6Y/Bhfhnn8zyY/5XoLwvWu4JHLPqKjSmtWmtUx6Pg=
=Vbt9
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fixes from Darrick Wong:
"A handful of bugfixes for the iomap code.
There's nothing especially exciting here, just fixes for UBSAN (not
KASAN as I erroneously wrote in the tag message) warnings about
undefined behavior in the SEEK_DATA/SEEK_HOLE code, and some
reshuffling of per-page block state info to fix some problems with
gfs2.
- Fix KASAN warnings due to integer overflow in SEEK_DATA/SEEK_HOLE
- Fix assertion errors when using inlinedata files on gfs2"
* tag 'iomap-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: Don't create iomap_page objects in iomap_page_mkwrite_actor
iomap: Don't create iomap_page objects for inline files
iomap: Permit pages without an iop to enter writeback
iomap: remove the length variable in iomap_seek_hole
iomap: remove the length variable in iomap_seek_data
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmDxrnAACgkQiiy9cAdy
T1HlzQv/eGdv+jV5z6iMtrtcoOFOdcOvnj2ZUooGcZ5hETuoNDK5cKSsAOeAYACe
vWxhBpRSgCDDJS6TyV0bJyQ154ygoH8ycze/PYNgEhVVdEbJ9lN0/7w+oNyvCy8l
e8QWUriW+3kcF1fppT+5pdK471Tw3pKDqOwiivWq4Z0ca3Iod2Fo3RcsqX/gFhJu
bRl9dFNQqmW4AMmlU18n0qDstBB5/v4Ak4040+nACEVVInBMZadj13wNcj26ek16
D2l/+3MN1eY7ewLBSNh367/ufxlDLbHwsb6W9k21iilxo/cdMYqHGgLEahTwLDXd
6QWukYlsJFHo71l+WswW8Zz3z+VVc6xVb7HxCNtPGoc+ksthqvvmq9wfw+ONYQ0w
0h+r/7THw5HLQwAwLrTcHEvb9ECpOExItPTsfSR71logRORW4xe1snfoj7xRfCSE
BEWsVDOo/SagrTEMDpmSaNY+ePcUiPaeNM0GRbiB7BBnwKQDenswrCa7QS97YItc
piX2ile0
=aa4P
-----END PGP SIGNATURE-----
Merge tag '5.14-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Eight cifs/smb3 fixes, including three for stable.
Three are DFS related fixes, and two to fix problems pointed out by
static checkers"
* tag '5.14-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: do not share tcp sessions of dfs connections
SMB3.1.1: fix mount failure to some servers when compression enabled
cifs: added WARN_ON for all the count decrements
cifs: fix missing null session check in mount
cifs: handle reconnect of tcon when there is no cached dfs referral
cifs: fix the out of range assignment to bit fields in parse_server_interfaces
cifs: Do not use the original cruid when following DFS links for multiuser mounts
cifs: use the expiry output of dns_query to schedule next resolution
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDxlXoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmriEACbXk5zuQvG5jRX4k2c5w8/rjz9y1vQTxsy
V/jNVlv+ZS3RrC4+JeRQQRF5+80xkKTsFFNxplT8wIguFSQPJ9VM80N+vDU6n3hi
dqF4J/Em0IsQRWWzG8YdZc0QjPdDhiA9vrlRqfp7vYZQt1+2Lpyg/Me6t1lppTB7
tH8hCFCE04oYR7TzFTmaeKUUIhmhiR2/dC1PR+G1IZWPInD9WbL/9UF2uMInReDN
W7WNPexLRdkU2/89Qtcyfv5wlq9TyrWSU6K9hYQc1EJKUlcd3hRKyD+mAyAseUro
0o5NHNbu0yuZf1vU6oyXIEDjQ8a0LGGdZxlFY8OCuAOgJQhGVm4G+2bzTPOkkABa
f8pSIDnHymqMqjhg3KfODR9p4aKFhsG9KvDzvgaS8PDckXTIn0rr8P+lCQ4jDPD3
rdkRdt01uPoSgaNoElxzwrUV+gQX4Hv5qL4nDxpoGvBEOhgytdnfhNGzHWmZOlWk
M9RdAsMWpbUuht4YJ93IxdXLSAhju1K+45IzutNnjfxeavnGCW9tQLjIGsPrUkPm
NweuMQfYcvrnAqefrOO1gcWdSyrXWN//Ae4iMBLetg8UXXuH/Pd6NXOQ762hV5tC
kka6DnlpJLB3LF5STLubfL3Gul+faEqgLWQgwt5XRoZjHySb7SnDwUFUTMY4z6ie
UkxWeICp9Q==
=7nbe
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.14-2021-07-16' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Two small fixes: one fixing the process target of a check, and the
other a minor issue with the drain error handling"
* tag 'io_uring-5.14-2021-07-16' of git://git.kernel.dk/linux-block:
io_uring: fix io_drain_req()
io_uring: use right task for exiting checks
A single patch for this pull request, to remove an unnecessary NULL bio
check (from Xianting).
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCYPETNQAKCRDdoc3SxdoY
doquAQCQLoz8fVAceRQ+E3Rp9Edm36cQT/19V7692dSJWkS/JAEAqt5SeABmys9B
PfgpesFN/euQUglw0ehxrGjT4MNXbwk=
=eChI
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fix from Damien Le Moal:
"A single patch to remove an unnecessary NULL bio check (from
Xianting)"
* tag 'zonefs-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: remove redundant null bio check
While verifying the leaf item that we read from the disk, reiserfs
doesn't check the directory items, this could cause a crash when we
read a directory item from the disk that has an invalid deh_location.
This patch adds a check to the directory items read from the disk that
does a bounds check on deh_location for the directory entries. Any
directory entry header with a directory entry offset greater than the
item length is considered invalid.
Link: https://lore.kernel.org/r/20210709152929.766363-1-chouhan.shreyansh630@gmail.com
Reported-by: syzbot+c31a48e6702ccb3d64c9@syzkaller.appspotmail.com
Signed-off-by: Shreyansh Chouhan <chouhan.shreyansh630@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Commit 782b76d7ab ("fs/ext2: Replace
kmap() with kmap_local_page()") replaced the kmap/kunmap calls in
ext2_get_page/ext2_put_page with kmap_local_page/kunmap_local for
efficiency reasons. As a necessary side change, the commit also
made ext2_get_page (and ext2_find_entry and ext2_dotdot) return
the mapping address along with the page itself, as it is required
for kunmap_local, and converted uses of page_address on such pages
to use the newly returned address instead. However, uses of
page_address on such pages were missed in ext2_check_page and
ext2_delete_entry, which triggers oopses if kmap_local_page happens
to return an address from high memory. Fix this now by converting
the remaining uses of page_address to use the right address, as
returned by kmap_local_page.
Link: https://lore.kernel.org/r/20210714185448.8707ac239e9f12b3a7f5b9f9@urjc.es
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Javier Pello <javier.pello@urjc.es>
Fixes: 782b76d7ab ("fs/ext2: Replace kmap() with kmap_local_page()")
Signed-off-by: Jan Kara <jack@suse.cz>
Make sure that we do not share tcp sessions of dfs mounts when
mounting regular shares that connect to same server. DFS connections
rely on a single instance of tcp in order to do failover properly in
cifs_reconnect().
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
bio_alloc() with __GFP_DIRECT_RECLAIM, which is included in
GFP_NOFS, never fails, see comments in bio_alloc_bioset().
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
- fix the read and write iterators (Bart Van Assche)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmDvEUkLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYNf8BAAwdXyOpuKOrcSn9jx1vk/feDLDB+aGeHUrV1KjRDM
rEYLSQXoJAEvvrtG+IRXhBPKBhAKGl0ltESAgGe0VQcVlCad7c7cwS0XWd0TJJqQ
GbylrtWMQTilHTw4HLKAAoenGD4KcZdSn4ty5go+W6hPTjv6w5+ArLWY2YFaeWHT
/flOR+UbL9BAxNDwZVZCGym0p/Dth4SyMbPAH91MXJAiR5xYD3v+nKXlc5fPTvIE
7hCeLEYK5Jxjx3uUffdPOybYIRGSUSVAmFYpvZ8Z9HfXxZ3i4yjFZhUe+1jSuB5O
7f12D0Ak4tCvAMhnXnOvcsxi6ZXP2ry6lro3/fe+rq1EdQBOY6rFQUcYiPMRVyu7
iKHLy/rHTtrmiGJIy8IjoyVBSryDbW1I/KpjVo0Rgo2r+A8i6mEFkID77+BnJ1Zp
sedxmPXcsAGkd57DZeRrWyegCaAKreuhQmvxCcenD6ZR2qj13K7olRzvoyAvM/zB
HF/R/cqgoNIc3tS5UbLB2ObFphzFmvYWpbSw2XawI6HO4F4xZ9dHyOS2Kormh91i
qqgRl61B5CpD3mek+hPz1uw701IRUOWM5KegE9e0w+GLF9s5oSf0UMWATP2KsPSi
9iluJj/FKIiP3Qdiz8+2o+BG+otmL86n8MrVISB86Jmr+q6YtUYeq1Q4ZXBJJ9kX
zRc=
=sDu1
-----END PGP SIGNATURE-----
Merge tag 'configfs-5.13-1' of git://git.infradead.org/users/hch/configfs
Pull configfs fix from Christoph Hellwig:
- fix the read and write iterators (Bart Van Assche)
* tag 'configfs-5.13-1' of git://git.infradead.org/users/hch/configfs:
configfs: fix the read and write iterators
When sending the compression context to some servers, they rejected
the SMB3.1.1 negotiate protocol because they expect the compression
context to have a data length of a multiple of 8.
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We have a few ref counters srv_count, ses_count and
tc_count which we use for ref counting. Added a WARN_ON
during the decrement of each of these counters to make
sure that they don't go below their minimum values.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Although it is unlikely to be have ended up with a null
session pointer calling cifs_try_adding_channels in cifs_mount.
Coverity correctly notes that we are already checking for
it earlier (when we return from do_dfs_failover), so at
a minimum to clarify the code we should make sure we also
check for it when we exit the loop so we don't end up calling
cifs_try_adding_channels or mount_setup_tlink with a null
ses pointer.
Addresses-Coverity: 1505608 ("Derefernce after null check")
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
When there is no cached DFS referral of tcon->dfs_path, then reconnect
to same share.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Hi Linus,
Please, pull the following patches that fix many fall-through
warnings when building with Clang and -Wimplicit-fallthrough.
This pull-request also contains the patch for Makefile that enables
-Wimplicit-fallthrough for Clang, globally.
It's also important to notice that since we have adopted the use of
the pseudo-keyword macro fallthrough; we also want to avoid having
more /* fall through */ comments being introduced. Notice that contrary
to GCC, Clang doesn't recognize any comments as implicit fall-through
markings when the -Wimplicit-fallthrough option is enabled. So, in
order to avoid having more comments being introduced, we have to use
the option -Wimplicit-fallthrough=5 for GCC, which similar to Clang,
will cause a warning in case a code comment is intended to be used
as a fall-through marking. The patch for Makefile also enforces this.
We had almost 4,000 of these issues for Clang in the beginning,
and there might be a couple more out there when building some
architectures with certain configurations. However, with the
recent fixes I think we are in good shape and it is now possible
to enable -Wimplicit-fallthrough for Clang. :)
Thanks!
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAmDvPV0ACgkQRwW0y0cG
2zGJ1xAAsbSN8I+ESxFCKi4UwQC71988isb0rHGL6rQPBbEYD2bbI+82aGu6WH8z
2ZVQZa5x3tUoIsqR2GMK5Npn1dFvuNWGZkYzlgjAp+FgzaULCLUc4NW9DEWU/Kbd
UBz8ilXL7tUAlDalUYjXAVVTxMnRTZ+BDCbEmIdRZ5X/pMlD2xHeMGG4kh8/cHjJ
+mqFwCFPZoyLJzPBnm37OEfIr8JxrLZAo1gfmz5sxfmBdYDY3hjV/BVgP8h0bkY9
ITaq0LMEAPMXvU+4DP3DxuqzRr9COYMcddmbaWwsXPd7UAuoXwGLvqx7JE0QqY3g
J80w4/hXqEa4o4/fUAsfjYEQoNTEnhl0iGskBJD2xy5Lj17/m4aOSL44ibr2Ag67
yfJPWi9tooYELEGNobPVHPuqk4ts6amKTOKqHWMBEcTEOIGzaGWPEjRoyaMivfZ3
G3ZbEPEYERcJRizm63UbciLeslGsxb6tdYQEy5VI8Wa7caz9Ci9HT+jjS7Vm2fuq
1zg+vIgwuSxvwxrSV/PDTRcQm9EM/MoRL0D4jbr7gtYD6KNwH8Vo6M21CoaB9gH3
FgMiCT4zy9KiONjVDJKkjtzuk+9OwpI5AAg0/fAQ4Ak9TzUN3Xfg7HpytETCIJ0Z
Ii3Jqwe+3IRTNq94ZTIiX/0hT7i5GsPxUfPzI2vwttCJn9ctUcM=
=XVNN
-----END PGP SIGNATURE-----
Merge tag 'Wimplicit-fallthrough-clang-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull fallthrough fixes from Gustavo Silva:
"This fixes many fall-through warnings when building with Clang and
-Wimplicit-fallthrough, and also enables -Wimplicit-fallthrough for
Clang, globally.
It's also important to notice that since we have adopted the use of
the pseudo-keyword macro fallthrough, we also want to avoid having
more /* fall through */ comments being introduced. Contrary to GCC,
Clang doesn't recognize any comments as implicit fall-through markings
when the -Wimplicit-fallthrough option is enabled.
So, in order to avoid having more comments being introduced, we use
the option -Wimplicit-fallthrough=5 for GCC, which similar to Clang,
will cause a warning in case a code comment is intended to be used as
a fall-through marking. The patch for Makefile also enforces this.
We had almost 4,000 of these issues for Clang in the beginning, and
there might be a couple more out there when building some
architectures with certain configurations. However, with the recent
fixes I think we are in good shape and it is now possible to enable
the warning for Clang"
* tag 'Wimplicit-fallthrough-clang-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (27 commits)
Makefile: Enable -Wimplicit-fallthrough for Clang
powerpc/smp: Fix fall-through warning for Clang
dmaengine: mpc512x: Fix fall-through warning for Clang
usb: gadget: fsl_qe_udc: Fix fall-through warning for Clang
powerpc/powernv: Fix fall-through warning for Clang
MIPS: Fix unreachable code issue
MIPS: Fix fall-through warnings for Clang
ASoC: Mediatek: MT8183: Fix fall-through warning for Clang
power: supply: Fix fall-through warnings for Clang
dmaengine: ti: k3-udma: Fix fall-through warning for Clang
s390: Fix fall-through warnings for Clang
dmaengine: ipu: Fix fall-through warning for Clang
iommu/arm-smmu-v3: Fix fall-through warning for Clang
mmc: jz4740: Fix fall-through warning for Clang
PCI: Fix fall-through warning for Clang
scsi: libsas: Fix fall-through warning for Clang
video: fbdev: Fix fall-through warning for Clang
math-emu: Fix fall-through warning
cpufreq: Fix fall-through warning for Clang
drm/msm: Fix fall-through warning in msm_gem_new_impl()
...
Merge misc fixes from Andrew Morton:
"13 patches.
Subsystems affected by this patch series: mm (kasan, pagealloc, rmap,
hmm, and hugetlb), and hfs"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/hugetlb: fix refs calculation from unaligned @vaddr
hfs: add lock nesting notation to hfs_find_init
hfs: fix high memory mapping in hfs_bnode_read
hfs: add missing clean-up in hfs_fill_super
lib/test_hmm: remove set but unused page variable
mm: fix the try_to_unmap prototype for !CONFIG_MMU
mm/page_alloc: further fix __alloc_pages_bulk() return value
mm/page_alloc: correct return value when failing at preparing
mm/page_alloc: avoid page allocator recursion with pagesets.lock held
Revert "mm/page_alloc: make should_fail_alloc_page() static"
kasan: fix build by including kernel.h
kasan: add memzero init for unaligned size at DEBUG
mm: move helper to check slub_debug_enabled
Syzbot reports a possible recursive lock in [1].
This happens due to missing lock nesting information. From the logs, we
see that a call to hfs_fill_super is made to mount the hfs filesystem.
While searching for the root inode, the lock on the catalog btree is
grabbed. Then, when the parent of the root isn't found, a call to
__hfs_bnode_create is made to create the parent of the root. This
eventually leads to a call to hfs_ext_read_extent which grabs a lock on
the extents btree.
Since the order of locking is catalog btree -> extents btree, this lock
hierarchy does not lead to a deadlock.
To tell lockdep that this locking is safe, we add nesting notation to
distinguish between catalog btrees, extents btrees, and attributes
btrees (for HFS+). This has already been done in hfsplus.
Link: https://syzkaller.appspot.com/bug?id=f007ef1d7a31a469e3be7aeb0fde0769b18585db [1]
Link: https://lkml.kernel.org/r/20210701030756.58760-4-desmondcheongzx@gmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reported-by: syzbot+b718ec84a87b7e73ade4@syzkaller.appspotmail.com
Tested-by: syzbot+b718ec84a87b7e73ade4@syzkaller.appspotmail.com
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "hfs: fix various errors", v2.
This series ultimately aims to address a lockdep warning in
hfs_find_init reported by Syzbot [1].
The work done for this led to the discovery of another bug, and the
Syzkaller repro test also reveals an invalid memory access error after
clearing the lockdep warning. Hence, this series is broken up into
three patches:
1. Add a missing call to hfs_find_exit for an error path in
hfs_fill_super
2. Fix memory mapping in hfs_bnode_read by fixing calls to kmap
3. Add lock nesting notation to tell lockdep that the observed locking
hierarchy is safe
This patch (of 3):
Before exiting hfs_fill_super, the struct hfs_find_data used in
hfs_find_init should be passed to hfs_find_exit to be cleaned up, and to
release the lock held on the btree.
The call to hfs_find_exit is missing from an error path. We add it back
in by consolidating calls to hfs_find_exit for error paths.
Link: https://syzkaller.appspot.com/bug?id=f007ef1d7a31a469e3be7aeb0fde0769b18585db [1]
Link: https://lkml.kernel.org/r/20210701030756.58760-1-desmondcheongzx@gmail.com
Link: https://lkml.kernel.org/r/20210701030756.58760-2-desmondcheongzx@gmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we encounter a directory that has been configured to pass on an
extent size hint to a new realtime file and the hint isn't an integer
multiple of the rt extent size, we should flag the hint for
administrative review because that is a misconfiguration (that other
parts of the kernel will fix automatically).
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
During a realtime grow operation, we run a single transaction for each
rt bitmap block added to the filesystem. This means that each step has
to be careful to increase sb_rblocks appropriately.
Fix the integer overflow error in this calculation that can happen when
the extent size is very large. Found by running growfs to add a rt
volume to a filesystem formatted with a 1g rt extent size.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Improve the checking at the start of a realtime grow operation so that
we avoid accidentally set a new extent size that is too large and avoid
adding an rt volume to a filesystem with rmap or reflink because we
don't support rt rmap or reflink yet.
While we're at it, separate the checks so that we're only testing one
aspect at a time.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Commit 603f000b15 changed xfs_ioctl_setattr_check_extsize to reject an
attempt to set an EXTSZINHERIT extent size hint on a directory with
RTINHERIT set if the hint isn't a multiple of the realtime extent size.
However, I have recently discovered that it is possible to change the
realtime extent size when adding a rt device to a filesystem, which
means that the existence of directories with misaligned inherited hints
is not an accident.
As a result, it's possible that someone could have set a valid hint and
added an rt volume with a different rt extent size, which invalidates
the ondisk hints. After such a sequence, FSGETXATTR will report a
misaligned hint, which FSSETXATTR will trip over, causing confusion if
the user was doing the usual GET/SET sequence to change some other
attribute. Change xfs_fill_fsxattr to omit the hint if it isn't aligned
properly.
Fixes: 603f000b15 ("xfs: validate extsz hints against rt extent size when rtinherit is set")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
While auditing the realtime growfs code, I realized that the GROWFSRT
ioctl (and by extension xfs_growfs) has always allowed sysadmins to
change the realtime extent size when adding a realtime section to the
filesystem. Since we also have always allowed sysadmins to set
RTINHERIT and EXTSZINHERIT on directories even if there is no realtime
device, this invalidates the premise laid out in the comments added in
commit 603f000b15.
In other words, this is not a case of inadequate metadata validation.
This is a case of nearly forgotten (and apparently untested) but
supported functionality. Update the comments to reflect what we've
learned, and remove the log message about correcting the misalignment.
Fixes: 603f000b15 ("xfs: validate extsz hints against rt extent size when rtinherit is set")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
While running xfs/168, I noticed a second source of post-shrink
corruption errors causing shutdowns.
Let's say that directory B has a low inode number and is a child of
directory A, which has a high number. If B is empty but open, and
unlinked from A, B's dotdot link continues to point to A. If A is then
unlinked and the filesystem shrunk so that A is no longer a valid inode,
a subsequent AIL push of B will trip the inode verifiers because the
dotdot entry points outside of the filesystem.
To avoid this problem, reset B's dotdot entry to the root directory when
unlinking directories, since the root directory cannot be removed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
While running xfs/168, I noticed occasional write verifier shutdowns
involving inodes at the very end of the filesystem. Existing inode
btree validation code checks that all inode clusters are fully contained
within the filesystem.
However, due to inadequate checking in the fs shrink code, it's possible
that there could be a sparse inode cluster at the end of the filesystem
where the upper inodes of the cluster are marked as holes and the
corresponding blocks are free. In this case, the last blocks in the AG
are listed in the bnobt. This enables the shrink to proceed but results
in a filesystem that trips the inode verifiers. Fix this by disallowing
the shrink.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Now that we create those objects in iomap_writepage_map when needed,
there's no need to pre-create them in iomap_page_mkwrite_actor anymore.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In iomap_readpage_actor, don't create iop objects for inline inodes.
Otherwise, iomap_read_inline_data will set PageUptodate without setting
iop->uptodate, and iomap_page_release will eventually complain.
To prevent this kind of bug from occurring in the future, make sure the
page doesn't have private data attached in iomap_read_inline_data.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Create an iop in the writeback path if one doesn't exist. This allows us
to avoid creating the iop in some cases. We'll initially do that for pages
with inline data, but it can be extended to pages which are entirely within
an extent. It also allows for an iop to be removed from pages in the
future (eg page split).
Co-developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The length variable is rather pointless given that it can be trivially
deduced from offset and size. Also the initial calculation can lead
to KASAN warnings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Leizhen (ThunderTown) <thunder.leizhen@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
The length variable is rather pointless given that it can be trivially
deduced from offset and size. Also the initial calculation can lead
to KASAN warnings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Leizhen (ThunderTown) <thunder.leizhen@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Add a simple helper that filesystems can use in their parameter parser
to parse the "source" parameter. A few places open-coded this function
and that already caused a bug in the cgroup v1 parser that we fixed.
Let's make it harder to get this wrong by introducing a helper which
performs all necessary checks.
Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because the out of range assignment to bit fields
are compiler-dependant, the fields could have wrong
value.
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=213565
cruid should only be used for the initial mount and after this we should use the current
users credentials.
Ignore the original cruid mount argument when creating a new context for a multiuser mount
following a DFS link.
Fixes: 24e0a1eff9 ("cifs: switch to new mount api")
Cc: stable@vger.kernel.org # 5.11+
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
We recently fixed DNS resolution of the server hostname during reconnect.
However, server IP address may change, even when the old one continues
to server (although sub-optimally).
We should schedule the next DNS resolution based on the TTL of
the DNS record used for the last resolution. This way, we resolve the
server hostname again when a DNS record expires.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: <stable@vger.kernel.org> # v5.11+
Signed-off-by: Steve French <stfrench@microsoft.com>
This patch series adds support for the atomic_open
directory-inode op to vboxsf.
Note this is not just an enhancement this also fixes an actual issue
which users are hitting, see the commit message of the
"boxsf: Add support for the atomic_open directory-inode" patch.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEEuvA7XScYQRpenhd+kuxHeUQDJ9wFAmDTLTsUHGhkZWdvZWRl
QHJlZGhhdC5jb20ACgkQkuxHeUQDJ9ytCAf8DjQurYh0B+E5i9pFL1hLgS715rnD
4qu8GT7+DF/9Yj8Mpg7aS/v1GNwjWBE506Fj6bc8E3s637OrflBiqqtFM6a/jcfP
i3RDtfCiTD9jeDT5OPhV4esuQvXnQ63ldXFSHf1TxaNb4Be8OmACibnSvslyC+Eb
YhKtMRH+oKeQfob3rbTJBglkDRe1KUuA2zGPBuYheaLLaYSHrj1xSRCoGY6mJMBJ
pP5FCT/nOsgxD6zej3/aa57put9kZoYVlu1TLnCfkggzuirL+82/pABC3ZYTtsM8
jeby97djOI/fufIlVD1yX7q+kzyVWj3ouparoAKsu5TDSmmIRYnfu/RlNA==
=yFXG
-----END PGP SIGNATURE-----
Merge tag 'vboxsf-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux
Pull vboxsf fixes from Hans de Goede:
"This adds support for the atomic_open directory-inode op to vboxsf.
Note this is not just an enhancement this also fixes an actual issue
which users are hitting, see the commit message of the "boxsf: Add
support for the atomic_open directory-inode" patch"
* tag 'vboxsf-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux:
vboxsf: Add support for the atomic_open directory-inode op
vboxsf: Add vboxsf_[create|release]_sf_handle() helpers
vboxsf: Make vboxsf_dir_create() return the handle for the created file
vboxsf: Honor excl flag to the dir-inode create op
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDsjSEACgkQxWXV+ddt
WDtnZRAAieSXta8GaJYNF4cKs7xHttIkNl0ljJHsJsKoN5kCxW22RWsf8gAyToT3
XERkJfRksgMH0Th3StJqTxg0fQTSiSi1bcz+wJjMVvQev2gX8dw7O05GLZT5GTzx
zquI57+OGDEpQdEM6YzrUl+tYnO0roibI2LQeMWUXYXJTy6F75zWjBqKcTGcnfGc
d8bOi6ijN4F148zIxvr6ahHrQN9WGwD5OWA1I5RqHBadgwCDWsQIdE6/N1Kdavf5
uW785lJ8a4VqOWyM7Y0kp4madnF9rwZ/CFyoQFJ51oG/NrUf469+bCBFM8VOEwSa
c3ZaqvF8CF3sndSAYiI4MEBFbM2O4hIVl/B9NkjDXDu3VlkRwwHDxZfadvc4BzsG
kfisaw/GbOvOv8ojxBq4ux2nbRIVul096HpZH4UWHs/MCQ5Ct40OP5sG77YZKQgf
o+D65V3NMn1gnp+B8wqyNnraY4hAoBePoK9f3IH+WXF5hlk6gWkbWxmXxCIPvJM4
XTJUcNCXDZtKA9KRgOmcP9fZSu4gyD3hbDRgU5nKkLLSGE+mE4BRmtnq91VnT7FA
5Nxlrjw9Na9LoyXYaoHcCksj207KU6WVgIjK4OFJarLMWlSDYBwAQCX0+voG+ZBq
qa6BuLpq2aJhB6Q4M3MdAQSbhfR6tcI+HENCQlFHa6Je7oY9NVQ=
=v00S
-----END PGP SIGNATURE-----
Merge tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs zoned mode fixes from David Sterba:
- fix deadlock when allocating system chunk
- fix wrong mutex unlock on an error path
- fix extent map splitting for append operation
- update and fix message reporting unusable chunk space
- don't block when background zone reclaim runs with balance in
parallel
* tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree
btrfs: don't block if we can't acquire the reclaim lock
btrfs: properly split extent_map for REQ_OP_ZONE_APPEND
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
btrfs: fix deadlock with concurrent chunk allocations involving system chunks
btrfs: zoned: print unusable percentage when reclaiming block groups
btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work
Commit 7fe1e79b59 ("configfs: implement the .read_iter and .write_iter
methods") changed the simple_read_from_buffer() calls into copy_to_iter()
calls and the simple_write_to_buffer() calls into copy_from_iter() calls.
The simple*buffer() methods update the file offset (*ppos) but the read
and write iterators not yet. Make the read and write iterators update the
file offset (iocb->ki_pos).
This patch has been tested as follows:
# modprobe target_core_user
# dd if=/sys/kernel/config/target/dbroot bs=1
/var/target
12+0 records in
12+0 records out
12 bytes copied, 9.5539e-05 s, 126 kB/s
# cd /sys/kernel/config/acpi/table
# mkdir test
# cd test
# dmesg -c >/dev/null; printf 'SSDT\x8\0\0\0abcdefghijklmnopqrstuvwxyz' | dd of=aml bs=1; dmesg -c
34+0 records in
34+0 records out
34 bytes copied, 0.010627 s, 3.2 kB/s
[ 261.056551] ACPI configfs: invalid table length
Reported-by: Yanko Kaneti <yaneti@declera.com>
Cc: Yanko Kaneti <yaneti@declera.com>
Fixes: 7fe1e79b59 ("configfs: implement the .read_iter and .write_iter methods")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fix the following warning:
fs/fcntl.c:373:3: warning: fallthrough annotation in unreachable code [-Wimplicit-fallthrough]
fallthrough;
^
include/linux/compiler_attributes.h:210:41: note: expanded from macro 'fallthrough'
# define fallthrough __attribute__((__fallthrough__))
by placing the fallthrough; statement inside ifdeffery.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
In preparation to enable -Wimplicit-fallthrough for Clang, fix
the following warnings by replacing /* fallthrough */ comments,
and its variants, with the new pseudo-keyword macro fallthrough:
fs/xfs/libxfs/xfs_attr.c:487:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:500:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:532:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:594:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:607:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:1410:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:1445:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:1473:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Notice that Clang doesn't recognize /* fallthrough */ comments as
implicit fall-through markings, so in order to globally enable
-Wimplicit-fallthrough for Clang, these comments need to be
replaced with fallthrough; in the whole codebase.
Link: https://github.com/KSPP/linux/issues/115
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
When we use delayed_work for fallback execution of requests, current
will be not of the submitter task, and so checks in io_req_task_submit()
may not behave as expected. Currently, it leaves inline completions not
flushed, so making io_ring_exit_work() to hang. Use the submitter task
for all those checks.
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cb413c715bed0bc9c98b169059ea9c8a2c770715.1625881431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmDpFjwACgkQiiy9cAdy
T1GriwwApJ/dwPeiWeotsw8kiFQjM0S/wu0JByPLJIjZ27LN3D4aGO8uqaB861G4
4Q+w545poIcRmn//2PwhN6LZUnBY6RkPCTra4mbVtMB6c1MU9NsqRe3vqiqKN167
qc2zCTl/z67jamUIwuWHB+ljphecGKBsSpik1ZqQkTbMIz+SXoMfkgr3R3CAXkpY
EyzuLCaIU3PC8MCb749LEG6Xkq+7sT1GIaoW9BPle/ZFKbZ3acvPGzVcA4s4qlDH
/mNZ3cVKxGDv0CxUqa9wkpelfuaooFLmWwah3mvG7YhQ2gxmFlty8/+ppP2LvTpm
ZRLmpeot7PctDnNTSRhrvm+XC3HqzLLrXXtaGvoW4zTKa3RrWFEcr5tppMmiHjJt
gTJhkkINTHv0Ewp677/vu8xnEPLgBcR0t2IEPNjgI0jqPuYrjT6EF0wsSUP3K00O
ODJGT/q/dmtqjezenmOi9MQnGKox/9C2uuAcagWPzHoeesYfK8Sh+q5tc+bfykHz
6QN5jKp/
=7Jd9
-----END PGP SIGNATURE-----
Merge tag '5.14-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"13 cifs/smb3 fixes. Most are to address minor issues pointed out by
Coverity.
Also includes a packet signing enhancement and mount improvement"
* tag '5.14-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal version number
cifs: prevent NULL deref in cifs_compose_mount_options()
SMB3.1.1: Add support for negotiating signing algorithm
cifs: use helpers when parsing uid/gid mount options and validate them
CIFS: Clarify SMB1 code for POSIX Lock
CIFS: Clarify SMB1 code for rename open file
CIFS: Clarify SMB1 code for delete
CIFS: Clarify SMB1 code for SetFileSize
smb3: fix typo in header file
CIFS: Clarify SMB1 code for UnixSetPathInfo
CIFS: Clarify SMB1 code for UnixCreateSymLink
cifs: clarify SMB1 code for UnixCreateHardLink
cifs: make locking consistent around the server session status
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDoXgsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv7XEADFFTtlR/xLAgPoVA5XX25g4m/k/Uo2zY+Y
JXDHX55MIpSHTycnfXESiwf2a6fTkbKCAlcGbkqKxySxyFF36vvcRXF4PnspeRI5
LjZlhUgqqFgWQ5Evl3LcKpG8sLsg1fB6vjD0kI+x9nlNA36Ly6egIgZ+i39tk7zK
m72ez0A5B1sKTn78BNze+akaSSjDmPTvb53E0Jl4eqn4cTKySXQ5JJlG71ke/4lJ
VUNdufrxZMP0EvBTwxjQCg0DwhCy57tYiB98/OrMagfbOxxsF9GxKafvmtpN4fev
rLLnKs4tU68aRK3nVeIbUYfQAquNiNdy9Hoqv4XYzN7NBMnGiyRmmYLN1pbA7OlW
lmc0v2c4ORoeHMTChQlEcx6xKVpGrRS7+S8P+Vz81JPpn+OO1sRhgzjc9/FdEdjy
sTBnQg23V274zL6jZsIL9HZzdI0fv5ybzuG061Tj7P6TUs9E/nn0wGo6m8FWTpSd
B5Gl42SOWlybFETXessXFECd7TrbvT6Dkghg//GnbhtfUNtG4NwBfuIJLVrFGwJr
RGu5YXxSP79YuAdqP0cOlr3O/i/BrdqjmveDSUbpMHANEz6gGTplcCOwOs4QWBCI
HO2ExxEP2NmLOtwXfGabTlt6Wlo3zj1ToHhpOmXCYVfEZonhHxcwV4odWfRRx43A
OTSfC1Tvlg==
=74oU
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.14-2021-07-09' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A few fixes that should go into this merge.
One fixes a regression introduced in this release, others are just
generic fixes, mostly related to handling fallback task_work"
* tag 'io_uring-5.14-2021-07-09' of git://git.kernel.dk/linux-block:
io_uring: remove dead non-zero 'poll' check
io_uring: mitigate unlikely iopoll lag
io_uring: fix drain alloc fail return code
io_uring: fix exiting io_req_task_work_add leaks
io_uring: simplify task_work func
io_uring: fix stuck fallback reqs
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDnGVYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv6UEAC78zkseI8TmKaowNfkz/+MkP9eSFb1pVn3
rxpbPOsZompHoZpeWt4oHL+3Rmm3a9iRo/APA2ELas4zvp+Q+6uG7eha2Dc4hUA9
YgeO4z9YfG8wQNZc3x7bncb6ZwqEE5nnbFe/m25SyrAZVLlZ7FKHxfoZDqjhlGFC
eLNiYO6vdvwgCoBMcotyCDttrPfEu6947/5vB1zevv57twdQQaEWGUhvyx1XrlDX
0YD5fmdOjNU2isgxt4xo2Ur2zL6w254/hvj58sV3Z7JfkJpI9DCK+ztKEfzuyEhA
WYz06rDAT1+1KuVLfowaZ+pYiPPOIsL0+QXI83r3nLaE7WGGlfS8Hmz//1FbziYs
ZSZI826kEN+/lKeWTcKOOMhmkYyXEFFuQZS34eg9KI4xwML8v+ILlHmcp+tjebw9
vzNF6f7N2ki+jnyxxyNxeMHxeAMWsqnIRROOhZg6bbs6UVNpDy4qRzpQaDOaJsVe
uSAQ6PTd/etR9KE+ClhLe6X7Rmp/lfZCPe64wqM/3k1qV2KWhE1fwCQO4c5o1MBN
rpk3Ef5PZYP3aakCvZnfcjMWlpZNbq/xMc6vPc+yq32akq1t1KbODVBiR5odcH0C
Gt5N11im50SO06haBt7EOe4JMQLbK5sxG15t4C6mNQZgPegGfaLlVkKpzIkOzUha
OkRofKMcDA==
=gHse
-----END PGP SIGNATURE-----
Merge tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
"A combination of changes that ended up depending on both the driver
and core branch (and/or the IDE removal), and a few late arriving
fixes. In detail:
- Fix io ticks wrap-around issue (Chunguang)
- nvme-tcp sock locking fix (Maurizio)
- s390-dasd fixes (Kees, Christoph)
- blk_execute_rq polling support (Keith)
- blk-cgroup RCU iteration fix (Yu)
- nbd backend ID addition (Prasanna)
- Partition deletion fix (Yufen)
- Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph)
- Removal of now dead block request types due to IDE removal
(Christoph)
- Loop probing and control device cleanups (Christoph)
- Device uevent fix (Christoph)
- Misc cleanups/fixes (Tetsuo, Christoph)"
* tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits)
blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs
block: fix the problem of io_ticks becoming smaller
nvme-tcp: can't set sk_user_data without write_lock
loop: remove unused variable in loop_set_status()
block: remove the bdgrab in blk_drop_partitions
block: grab a device refcount in disk_uevent
s390/dasd: Avoid field over-reading memcpy()
dasd: unexport dasd_set_target_state
block: check disk exist before trying to add partition
ubd: remove dead code in ubd_setup_common
nvme: use return value from blk_execute_rq()
block: return errors from blk_execute_rq()
nvme: use blk_execute_rq() for passthrough commands
block: support polling through blk_execute_rq
block: remove REQ_OP_SCSI_{IN,OUT}
block: mark blk_mq_init_queue_data static
loop: rewrite loop_exit using idr_for_each_entry
loop: split loop_lookup
loop: don't allow deleting an unspecified loop device
loop: move loop_ctl_mutex locking into loop_add
...
The optional @ref parameter might contain an NULL node_name, so
prevent dereferencing it in cifs_compose_mount_options().
Addresses-Coverity: 1476408 ("Explicit null dereferenced")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Support for faster packet signing (using GMAC instead of CMAC) can
now be negotiated to some newer servers, including Windows.
See MS-SMB2 section 2.2.3.17.
This patch adds support for sending the new negotiate context
with the first of three supported signing algorithms (AES-CMAC)
and decoding the response. A followon patch will add support
for sending the other two (including AES-GMAC, which is fastest)
and changing the signing algorithm used based on what was
negotiated.
To allow the client to request GMAC signing set module parameter
"enable_negotiate_signing" to 1.
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
- Fix for a race xattr list and modification
- Various minor fixes (spelling, return codes, ...)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmDnaSYWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wd36D/9ZJDHP1D1ZTTN6eFbAy55fDV+U
6nx36Ot+DA5BqaPYX1YRBNxshBIdd620W4n07US20SL14k7dzky/y+oJt+ZZY44k
1ev6zqbmWTKdZhITwoaeVWRGYkEUVTvGW4MoOERZSfLhodrRVCfqs9/1a8ol/x1g
AhZFxBRBUWLWlHYNplqdZdIuCo0qGQBWITv/A8of0pXH9cGxRTvYV5V6CbeXrnuD
LgvckvwCZHPtRHgil3R1l/oGo109QNKhEycFBBLf1ydqIyIjOBcpI2ETZmifyv9D
v2dgyu9u1SM2mzMEZmdTfWVBjqhKH2vYiv1otF2BD7DCJ48Y4yb/YQby3MOrZfDI
rI3B8FlfNeXkqzoKjuLjiHe21KXrifPNsuTcunNd8Thnw+0mqsY3OimY4pYpQO2O
JhDlnt8V4QLcUeg2NYGrQHx1qfLSMgc5pYd0ntOWWhJKXtreFYtX2WeorMroIHZQ
B4li4VasaLvg+HSwJsgJ4qrPHl0fMT8eu0yb4KjsJi/d9Zl4FhmRe/QVs43GayAT
zczxywAAN81sP2q/TQPtDgHzI90QTA0E89JkQVa0cGA2/VznM9RmAhrxGIjwGKZE
4owC5RMA8w7eb60aLZ4HEooLwNXbKRd3ayR9Hatbp+Cq3+GBUG77em+2sLtwyqiD
mDWJo6AWRZxBDpG4jg==
=2Wk7
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBIFS updates from Richard Weinberger:
- Fix for a race xattr list and modification
- Various minor fixes (spelling, return codes, ...)
* tag 'for-linus-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: Set/Clear I_LINKABLE under i_lock for whiteout inode
ubifs: Fix spelling mistakes
ubifs: Remove ui_mutex in ubifs_xattr_get and change_xattr
ubifs: Fix races between xattr_{set|get} and listxattr operations
ubifs: fix snprintf() checking
ubifs: journal: Fix error return code in ubifs_jnl_write_inode()
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmDnU2AACgkQ8vlZVpUN
gaOBIAgApIAIeGbppf7aFjRN4h4wxfRpr7w6lux3GVmz7D+6djRi21X5dT5xq01m
u6DkLAcKrCATIidyP6qHlvBbxxcPt2PX1FcQbruj9WcnSng1Ngl7RW8BEqp/eIRo
Nb7MY0pg8HIJVMEniWQcdEjFWKDL3ksWR9+X3V3nhSzp+0kXFF1ySjk+TWi/ZGSn
T/Q1sEyeUOiVfV75cIW5JbKoJEgvCvrclFvGJLYVcIAYeqJfQKQ0+tlkhDeYnWfQ
nZgh1UU350bO629LGIhbRAkLbAloEb0d57mOQCrATo0JFrAZ52+0ZCkrTXtIyoOF
TUILVf3zsqgdO8HLDkbH1G+lGn9WOA==
=qU+W
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Ext4 regression and bug fixes"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: inline jbd2_journal_[un]register_shrinker()
ext4: fix flags validity checking for EXT4_IOC_CHECKPOINT
ext4: fix possible UAF when remounting r/o a mmp-protected file system
ext4: use ext4_grp_locked_error in mb_find_extent
ext4: fix WARN_ON_ONCE(!buffer_uptodate) after an error writing the superblock
Revert "ext4: consolidate checks for resize of bigalloc into ext4_resize_begin"
Xiubo, two patchsets from Jeff that begin to untangle some heavyweight
blocking locks in the filesystem and a bunch of code cleanups.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmDnVcgTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi+d9CACqbWorDRCksqBB40muthHfgArYAc8A
WZEvrcieymV6P+A3KJj9wtNeRgT8iSdJDweD/5Yl0ZfZUx3i0x78600fe5cls3u3
XiX154G8KZpnAQbuDXnSny+4PiEQMkbfL3Zk++TSClBWb2PqYF/LvEsCfdBIuHYm
BRMTpZ9rGWD+WWnz1iroubhMfmUTdyGzsgA4zjBNr46d2k1gZVviB0TDsEfhC8lP
qio7IABkIWmvVJk9MCwp4JJQMMKuaN9DRddoA2Q/NZzevxHRUWCvW5a6o6vpO1+W
d74Zzf9kbwCy+qbO1YpS0yrpNXP2IBVa0ZPNChOVDluPTmgVyQmrRjnU
=wXsA
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.14-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"We have new filesystem client metrics for reporting I/O sizes from
Xiubo, two patchsets from Jeff that begin to untangle some heavyweight
blocking locks in the filesystem and a bunch of code cleanups"
* tag 'ceph-for-5.14-rc1' of git://github.com/ceph/ceph-client:
ceph: take reference to req->r_parent at point of assignment
ceph: eliminate ceph_async_iput()
ceph: don't take s_mutex in ceph_flush_snaps
ceph: don't take s_mutex in try_flush_caps
ceph: don't take s_mutex or snap_rwsem in ceph_check_caps
ceph: eliminate session->s_gen_ttl_lock
ceph: allow ceph_put_mds_session to take NULL or ERR_PTR
ceph: clean up locking annotation for ceph_get_snap_realm and __lookup_snap_realm
ceph: add some lockdep assertions around snaprealm handling
ceph: decoding error in ceph_update_snap_realm should return -EIO
ceph: add IO size metrics support
ceph: update and rename __update_latency helper to __update_stdev
ceph: simplify the metrics struct
libceph: fix doc warnings in cls_lock_client.c
libceph: remove unnecessary ret variable in ceph_auth_init()
libceph: fix some spelling mistakes
libceph: kill ceph_none_authorizer::reply_buf
ceph: make ceph_queue_cap_snap static
ceph: make ceph_netfs_read_ops static
ceph: remove bogus checks and WARN_ONs from ceph_set_page_dirty
Highlights include:
Stable fixes:
- Two sunrpc fixes for deadlocks involving privileged rpc_wait_queues
Bugfixes
- SUNRPC: Avoid a KASAN slab-out-of-bounds bug in xdr_set_page_base()
- SUNRPC: prevent port reuse on transports which don't request it.
- NFSv3: Fix memory leak in posix_acl_create()
- NFS: Various fixes to attribute revalidation timeouts
- NFSv4: Fix handling of non-atomic change attribute updates
- NFSv4: If a server is down, don't cause mounts to other servers to
hang as well
- pNFS: Fix an Oops in pnfs_mark_request_commit() when doing O_DIRECT
- NFS: Fix mount failures due to incorrect setting of the has_sec_mnt_opts
filesystem flag
- NFS: Ensure nfs_readpage returns promptly when an internal error occurs
- NFS: Fix fscache read from NFS after cache error
- pNFS: Various bugfixes around the LAYOUTGET operation
Features
- Multiple patches to add support for fcntl() leases over NFSv4.
- A sysfs interface to display more information about the various
transport connections used by the RPC client
- A sysfs interface to allow a suitably privileged user to offline a
transport that may no longer point to a valid server
- A sysfs interface to allow a suitably privileged user to change the
server IP address used by the RPC client
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmDnPrkACgkQZwvnipYK
APKGAw/9EjdGoic6VpShQyb5uxaRoDd4uwWgFLBOfPhIWC7qNMtkj49wHIOEUm7e
YfdF5RlCdmshaoMxjY84wjl8NTMwHbPahgooDd4+UsZUs2qxZ8dBsr0itfbFsJv8
BpaCYKQt6XGQngGrWfC7SiCETnMej2YsmjDfHvhD58TxnRfPWexHUvx9xi9uGRCS
sIWRA2QMNs7LwdShkkRotagodRLhu/zo4g0lon5lI8D/SRg6o8RoO4YP6oKH1FN4
OyVzy1aWZGocgwCMUtNeuigJSRyDa+bJTfJ2c27uw5g18s0XWZ3j2DxD5I+HCEuE
B4rhg+ujtPIifYLHf2Aj3nlxdBePZ5L67a2MOOUo+wSD+nPmNMZF1eIT/3Jsg/HA
Z8gqcBiTIkBfVGJxWWbrbHfxPXQiK1IRGQx9acyhLCN9M6Kv5bbkn4R4dnronvJR
g6O968fgC5uvl60CXdc8NCpWtSitXB/nH8pn7MbJ8JBGq7QIYNkS0d4E8ePhYwxk
sRYJt21O+ryjodfQDHaUxodzCKGcpRoknpirMmgoAp4zdkva4ltViNsQvHa7jFh8
HIuhU6Aia1xVYpUMDEXf2WMXCT9yLa2TyMDuS5KDfb69wBkQJWeKNkebf+1k03wQ
saEmdoP4aEEujimkA7rqyOlI8XhsudKvBd3HXg+w9+xIt4yoie0=
=NaOI
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- Multiple patches to add support for fcntl() leases over NFSv4.
- A sysfs interface to display more information about the various
transport connections used by the RPC client
- A sysfs interface to allow a suitably privileged user to offline a
transport that may no longer point to a valid server
- A sysfs interface to allow a suitably privileged user to change the
server IP address used by the RPC client
Stable fixes:
- Two sunrpc fixes for deadlocks involving privileged rpc_wait_queues
Bugfixes:
- SUNRPC: Avoid a KASAN slab-out-of-bounds bug in xdr_set_page_base()
- SUNRPC: prevent port reuse on transports which don't request it.
- NFSv3: Fix memory leak in posix_acl_create()
- NFS: Various fixes to attribute revalidation timeouts
- NFSv4: Fix handling of non-atomic change attribute updates
- NFSv4: If a server is down, don't cause mounts to other servers to
hang as well
- pNFS: Fix an Oops in pnfs_mark_request_commit() when doing O_DIRECT
- NFS: Fix mount failures due to incorrect setting of the
has_sec_mnt_opts filesystem flag
- NFS: Ensure nfs_readpage returns promptly when an internal error
occurs
- NFS: Fix fscache read from NFS after cache error
- pNFS: Various bugfixes around the LAYOUTGET operation"
* tag 'nfs-for-5.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits)
NFSv4/pNFS: Return an error if _nfs4_pnfs_v3_ds_connect can't load NFSv3
NFSv4/pNFS: Don't call _nfs4_pnfs_v3_ds_connect multiple times
NFSv4/pnfs: Clean up layout get on open
NFSv4/pnfs: Fix layoutget behaviour after invalidation
NFSv4/pnfs: Fix the layout barrier update
NFS: Fix fscache read from NFS after cache error
NFS: Ensure nfs_readpage returns promptly when internal error occurs
sunrpc: remove an offlined xprt using sysfs
sunrpc: provide showing transport's state info in the sysfs directory
sunrpc: display xprt's queuelen of assigned tasks via sysfs
sunrpc: provide multipath info in the sysfs directory
NFSv4.1 identify and mark RPC tasks that can move between transports
sunrpc: provide transport info in the sysfs directory
SUNRPC: take a xprt offline using sysfs
sunrpc: add dst_attr attributes to the sysfs xprt directory
SUNRPC for TCP display xprt's source port in sysfs xprt_info
SUNRPC query transport's source port
SUNRPC display xprt's main value in sysfs's xprt_info
SUNRPC mark the first transport
sunrpc: add add sysfs directory per xprt under each xprt_switch
...
In this round, we've improved the compression support especially for Android
such as allowing compression for mmap files, replacing the immutable bit with
internal bit to prohibits data writes explicitly, and adding a mount option,
"compress_cache", to improve random reads. And, we added "readonly" feature to
compact the partition w/ compression enabled, which will be useful for Android
RO partitions.
Enhancement:
- support compression for mmap file
- use an f2fs flag instead of IMMUTABLE bit for compression
- support RO feature w/ extent_cache
- fully support swapfile with file pinning
- improve atgc tunability
- add nocompress extensions to unselect files for compression
Bug fix:
- fix false alaram on iget failure during GC
- fix race condition on global pointers when there are multiple f2fs instances
- add MODULE_SOFTDEP for initramfs
As usual, we've also cleaned up some places for better code readability.
(e.g., sysfs/feature, debugging messages, slab cache name, and docs)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmDmAKIACgkQQBSofoJI
UNIGGA//XOcQ+BF6t/os55UZFZc2E5F9VDNkKKBw9WIXp2g2/5D8hl+dT7qMFzMA
ndZiRl0wEElce1ZqSzrt8TRcbwoDLgguUdD7B70CR2NMZ36kdVDet3BW01SnwhXy
JoiLBkdB5VUdK6QMnP7+xE3tvKhEk3FDohpOoaOtmYyPpk8Q7C8n0PX2KADgmepl
6aNi1aJGy+hf6UyhcHlgfKLpndyX6gGJVmoLN+yJF2D9tLALaLHcZgJUBG9ZajNH
yLNwiBG7ybcs1TrWan0887Ne3GVXQpWqrBLou3s4b8abTLpYdFBnIPrN00jmu3Hp
I/uenkBS4HI5AZgKFrF1i2QiKlBL/+PmQ1fJtraVJj6auhnwHqZGIdTb8c3AbeSJ
pKhC3w8W8K6Fs2Awuyc4qlWDdNquAeM0wf5qlD0yLgZO6/KgpXbBDCQIOhhBn2hW
6h6jZ4KJwEuTXEZB39PJ83YXPpzoxoZW4hVZXlAcsgX2s3Ke1R0BrjZ+ff6JeG78
Oj13PCWFbwd+UjS3I2W1R9taBDeuYkyMF7CKl+BtAzkYD1bO+KfO4sfSoR/XLSVx
OwB36LFzklNJeJ1koJZw4n/czJQYZTOQgThDPFDNL1m8zpyB5p8MNL7f84cOJ2Yo
aU7mpTX/i56bGsO3sr4OyO37MlZTBkyNjMI/sBgouPXYlN3YmHs=
=1IHL
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've improved the compression support especially for
Android such as allowing compression for mmap files, replacing the
immutable bit with internal bit to prohibits data writes explicitly,
and adding a mount option, "compress_cache", to improve random reads.
And, we added "readonly" feature to compact the partition w/
compression enabled, which will be useful for Android RO partitions.
Enhancements:
- support compression for mmap file
- use an f2fs flag instead of IMMUTABLE bit for compression
- support RO feature w/ extent_cache
- fully support swapfile with file pinning
- improve atgc tunability
- add nocompress extensions to unselect files for compression
Bug fixes:
- fix false alaram on iget failure during GC
- fix race condition on global pointers when there are multiple f2fs
instances
- add MODULE_SOFTDEP for initramfs
As usual, we've also cleaned up some places for better code
readability (e.g., sysfs/feature, debugging messages, slab cache
name, and docs)"
* tag 'f2fs-for-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (32 commits)
f2fs: drop dirty node pages when cp is in error status
f2fs: initialize page->private when using for our internal use
f2fs: compress: add nocompress extensions support
MAINTAINERS: f2fs: update my email address
f2fs: remove false alarm on iget failure during GC
f2fs: enable extent cache for compression files in read-only
f2fs: fix to avoid adding tab before doc section
f2fs: introduce f2fs_casefolded_name slab cache
f2fs: swap: support migrating swapfile in aligned write mode
f2fs: swap: remove dead codes
f2fs: compress: add compress_inode to cache compressed blocks
f2fs: clean up /sys/fs/f2fs/<disk>/features
f2fs: add pin_file in feature list
f2fs: Advertise encrypted casefolding in sysfs
f2fs: Show casefolding support only when supported
f2fs: support RO feature
f2fs: logging neatening
f2fs: introduce FI_COMPRESS_RELEASED instead of using IMMUTABLE bit
f2fs: compress: remove unneeded preallocation
f2fs: atgc: export entries for better tunability via sysfs
...
Colin reports that Coverity complains about checking for poll being
non-zero after having dereferenced it multiple times. This is a valid
complaint, and actually a leftover from back when this code was based
on the aio poll code.
Kill the redundant check.
Link: https://lore.kernel.org/io-uring/fe70c532-e2a7-3722-58a1-0fa4e5c5ff2c@canonical.com/
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the nice helpers to initialize and the uid/gid/cred_uid when passed as mount arguments.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
We have requests like IORING_OP_FILES_UPDATE that don't go through
->iopoll_list but get completed in place under ->uring_lock, and so
after dropping the lock io_iopoll_check() should expect that some CQEs
might have get completed in a meanwhile.
Currently such events won't be accounted in @nr_events, and the loop
will continue to poll even if there is enough of CQEs. It shouldn't be a
problem as it's not likely to happen and so, but not nice either. Just
return earlier in this case, it should be enough.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/66ef932cc66a34e3771bbae04b2953a8058e9d05.1625747741.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we fail to return an error if the NFSv3 module failed to load
when we're trying to connect to a pNFS data server.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
After we grab the lock in nfs4_pnfs_ds_connect(), there is no check for
whether or not ds->ds_clp has already been initialised, so we can end up
adding the same transports multiple times.
Fixes: fc821d5920 ("pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cache the layout in the arguments so we don't have to keep looking it up
from the inode.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the layout gets invalidated, we should wait for any outstanding
layoutget requests for that layout to complete, and we should resend
them only after re-establishing the layout stateid.
Fixes: d29b468da4 ("pNFS/NFSv4: Improve rejection of out-of-order layouts")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we have multiple outstanding layoutget requests, the current code to
update the layout barrier assumes that the outstanding layout stateids
are updated in order. That's not necessarily the case.
Instead of using the value of lo->plh_outstanding as a guesstimate for
the window of values we need to accept, just wait to update the window
until we're processing the last one. The intention here is just to
ensure that we don't process 2^31 seqid updates without also updating
the barrier.
Fixes: 1bcf34fdac ("pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Earlier commits refactored some NFS read code and removed
nfs_readpage_async(), but neglected to properly fixup
nfs_readpage_from_fscache_complete(). The code path is
only hit when something unusual occurs with the cachefiles
backing filesystem, such as an IO error or while a cookie
is being invalidated.
Mark page with PG_checked if fscache IO completes in error,
unlock the page, and let the VM decide to re-issue based on
PG_uptodate. When the VM reissues the readpage, PG_checked
allows us to skip over fscache and read from the server.
Link: https://marc.info/?l=linux-nfs&m=162498209518739
Fixes: 1e83b173b2 ("NFS: Add nfs_pageio_complete_read() and remove nfs_readpage_async()")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
A previous refactoring of nfs_readpage() might end up calling
wait_on_page_locked_killable() even if readpage_async_filler() failed
with an internal error and pg_error was non-zero (for example, if
nfs_create_request() failed). In the case of an internal error,
skip over wait_on_page_locked_killable() as this is only needed
when the read is sent and an error occurs during completion handling.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>