- Fix a few typos found while reading the code.
- Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter
was renamed to 'req', but the comment still holds.
Signed-off-by: Brian Gianforcaro <b.gianfo@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull ksys_mount() and ksys_dup() removal from Dominik Brodowski:
"This small series replaces all in-kernel calls to the
userspace-focused ksys_mount() and ksys_dup() with calls to
kernel-centric functions:
For each replacement of ksys_mount() with do_mount(), one needs to
verify that the first and third parameter (char *dev_name, char *type)
are strings allocated in kernelspace and that the fifth parameter
(void *data) is either NULL or refers to a full page (only occurence
in init/do_mounts.c::do_mount_root()). The second and fourth
parameters (char *dir_name, unsigned long flags) are passed by
ksys_mount() to do_mount() unchanged, and therefore do not require
particular care.
Moreover, instead of pretending to be userspace, the opening of
/dev/console as stdin/stdout/stderr can be implemented using in-kernel
functions as well. Thereby, ksys_dup() can be removed for good"
[ This doesn't get rid of the special "kernel init runs with KERNEL_DS"
case, but it at least removes _some_ of the users of "treat kernel
pointers as user pointers for our magical init sequence".
One day we'll hopefully be rid of it all, and can initialize our
init_thread addr_limit to USER_DS. - Linus ]
* 'remove-ksys-mount-dup' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux:
fs: remove ksys_dup()
init: unify opening /dev/console as stdin/stdout/stderr
init: use do_mount() instead of ksys_mount()
initrd: use do_mount() instead of ksys_mount()
devtmpfs: use do_mount() instead of ksys_mount()
It's possible that __ext4_new_inode will release the xattr block, so
it will trigger a warning since there is revoke credits will be 0 if
the handle == NULL. The below scripts can reproduce it easily.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 3861 at fs/jbd2/revoke.c:374 jbd2_journal_revoke+0x30e/0x540 fs/jbd2/revoke.c:374
...
__ext4_forget+0x1d7/0x800 fs/ext4/ext4_jbd2.c:248
ext4_free_blocks+0x213/0x1d60 fs/ext4/mballoc.c:4743
ext4_xattr_release_block+0x55b/0x780 fs/ext4/xattr.c:1254
ext4_xattr_block_set+0x1c2c/0x2c40 fs/ext4/xattr.c:2112
ext4_xattr_set_handle+0xa7e/0x1090 fs/ext4/xattr.c:2384
__ext4_set_acl+0x54d/0x6c0 fs/ext4/acl.c:214
ext4_init_acl+0x218/0x2e0 fs/ext4/acl.c:293
__ext4_new_inode+0x352a/0x42b0 fs/ext4/ialloc.c:1151
ext4_mkdir+0x2e9/0xbd0 fs/ext4/namei.c:2774
vfs_mkdir+0x386/0x5f0 fs/namei.c:3811
do_mkdirat+0x11c/0x210 fs/namei.c:3834
do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:294
...
-------------------------------------
scripts:
mkfs.ext4 /dev/vdb
mount /dev/vdb /mnt
cd /mnt && mkdir dir && for i in {1..8}; do setfacl -dm "u:user_"$i":rx" dir; done
mkdir dir/dir1 && mv dir/dir1 ./
sh repro.sh && add some user
[root@localhost ~]# cat repro.sh
while [ 1 -eq 1 ]; do
rm -rf dir
rm -rf dir1/dir1
mkdir dir
for i in {1..8}; do setfacl -dm "u:test"$i":rx" dir; done
setfacl -m "u:user_9:rx" dir &
mkdir dir1/dir1 &
done
Before exec repro.sh, dir1 has inherit the default acl from dir, and
xattr block of dir1 dir is not the same, so the h_refcount of these
two dir's xattr block will be 1. Then repro.sh can trigger the warning
with the situation show as below. The last h_refcount can be clear
with mkdir, and __ext4_new_inode has not reserved revoke credits, so
the warning will happened, fix it by reserve revoke credits in
__ext4_new_inode.
Thread 1 Thread 2
mkdir dir
set default acl(will create
a xattr block blk1 and the
refcount of ext4_xattr_header
will be 1)
...
mkdir dir1/dir1
->....->ext4_init_acl
->__ext4_set_acl(set default acl,
will reuse blk1, and h_refcount
will be 2)
setfacl->ext4_set_acl->...
->ext4_xattr_block_set(will create
new block blk2 to store xattr)
->__ext4_set_acl(set access acl, since
h_refcount of blk1 is 2, will create
blk3 to store xattr)
->ext4_xattr_release_block(dec
h_refcount of blk1 to 1)
->ext4_xattr_release_block(dec
h_refcount and since it is 0,
will release the block and trigger
the warning)
Link: https://lore.kernel.org/r/20191213014900.47228-1-yangerkun@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Make __ext4_check_dir_entry() a bit easier to understand, and reduce
the object size of the function by over 11%.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20191209004346.38526-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_check_dir_entry() currently does not catch a case when a directory
entry ends so close to the block end that the header of the next
directory entry would not fit in the remaining space. This can lead to
directory iteration code trying to access address beyond end of current
buffer head leading to oops.
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191202170213.4761-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Function ext4_empty_dir() doesn't correctly handle directories with
holes and crashes on bh->b_data dereference when bh is NULL. Reorganize
the loop to use 'offset' variable all the times instead of comparing
pointers to current direntry with bh->b_data pointer. Also add more
strict checking of '.' and '..' directory entries to avoid entering loop
in possibly invalid state on corrupted filesystems.
References: CVE-2019-19037
CC: stable@vger.kernel.org
Fixes: 4e19d6b65f ("ext4: allow directory holes")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191202170213.4761-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl30eFcACgkQiiy9cAdy
T1FrCAv+Lwq/7pwyzoS0F1DWXJfs+3a2KCR2QSnZE9Mok8AobvRnNS9dZcXZdp0U
RnSFfkv9ia5vXegO0899nM6n0jtS5/OCex9RITVBtvIk8HLKrpcmYP+gS3III5Qq
oMF11JZCWDI2HHJA6xEV677rgFjOiEDjtjaZrXOS2TClnBCU3ZDRghul46DKACbA
xQDr0ifgOcDdxKBSTERhGvKA6xOf+gPP73SKB5Kg6OaPWL9FjhGvN/1ic2LVmFHF
cc7rbXhl1jTnKfw22qrS5XsElBSQdbM7X23CJ9ik8zXpF8gLBjGefx894xyFLELb
efJcHC1vroBc11HfLaLmAoQRp7leDIP4Icyq2TqJC6+mWsxJ0G96ofn7tGR9z/FK
SlcggWkrC8frJQQoYMYylQcknXK9+WahyyNswMXFiUYU7XSojyx4MoBtvRavOI5l
JKJtoiZd3OPtXOjvnKzNqSFcHC3YNQDE9GmIOXHgRuhFcnO0UQOKvbwXik2HCXBp
7A1dxRob
=65tY
-----END PGP SIGNATURE-----
Merge tag '5.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cfis fixes from Steve French:
"Three small smb3 fixes: this addresses two recent issues reported in
additional testing during rc1, a refcount underflow and a problem with
an intermittent crash in SMB2_open_init"
* tag '5.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Close cached root handle only if it has a lease
SMB3: Fix crash in SMB2_open_init due to uninitialized field in compounding path
smb3: fix refcount underflow warning on unmount when no directory leases
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXfNhGQAKCRDh3BK/laaZ
PGSEAP9Nyv3XCN2wdqMLdrgn07B3Pk9w2Unf3Y5amKOxNXqyQwEAy2/E6DCiGjSa
WRheJoTgDSeqUQNY6GFHsCIgLWOCHgs=
=WH5O
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix some bugs and documentation"
* tag 'ovl-fixes-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
docs: filesystems: overlayfs: Fix restview warnings
docs: filesystems: overlayfs: Rename overlayfs.txt to .rst
ovl: relax WARN_ON() on rename to self
ovl: fix corner case of non-unique st_dev;st_ino
ovl: don't use a temp buf for encoding real fh
ovl: make sure that real fid is 32bit aligned in memory
ovl: fix lookup failure on multi lower squashfs
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3y5kIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv0MEADY9LOC2JkG2aNP53gCXGu74rFQNstJfusr
xPpkMs5I7hxoeVp/bkVAvR6BbpfPDsyGMhSMURELgMFzkuw+03z2IbxVuexdGOEu
X1+6EkunAq/r331fXL+cXUKJM3ThlkUBbNTuV8z5oWWSM1EfoBAWYfh2u4L4oBcc
yIHRH8by9qipx+fhBWPU/ZWeCcMVt5Ju2b6PHAE/GWwteZh00PsTwq0oqQcVcm/5
4Ennr0OELb7LaZWblAIIZe96KeJuCPzGfbRV/y12Ne/t6STH/sWA1tMUzgT22Eju
27uXtwAJtoLsep4V66a4VYObs/hXmEP281wIsXEnIkX0YCvbAiM+J6qVHGjYSMGf
mRrzfF6nDC1vaMduE4waoO3VFiDFQU/qfQRa21ZP50dfQOAXTpEz8kJNG/PjN1VB
gmz9xsujDU1QPD7IRiTreiPPHE5AocUzlUYuYOIMt7VSbFZIMHW5S/pML93avkJt
nx6g3gOP0JKkjaCEvIKY4VLljDT8eDZ/WScDsSedSbPZikMzEo8DSx4U6nbGPqzy
qSljgQniDcrH8GdRaJFDXgLkaB8pu83NH7zUH+xioUZjAHq/XEzKQFYqysf1DbtU
SEuSnUnLXBvbwfb3Z2VpQ/oz8G3a0nD9M7oudt1sN19oTCJKxYSbOUgNUrj9JsyQ
QlAVrPKRkQ==
=fbsf
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- A tweak to IOSQE_IO_LINK (also marked for stable) to allow links that
don't sever if the result is < 0.
This is mostly for linked timeouts, where if we ask for a pure
timeout we always get -ETIME. This makes links useless for that case,
hence allow a case where it works.
- Five minor optimizations to fix and improve cases that regressed
since v5.4.
- An SQTHREAD locking fix.
- A sendmsg/recvmsg iov assignment fix.
- Net fix where read_iter/write_iter don't honor IOCB_NOWAIT, and
subsequently ensuring that works for io_uring.
- Fix a case where for an invalid opcode we might return -EBADF instead
of -EINVAL, if the ->fd of that sqe was set to an invalid fd value.
* tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block:
io_uring: ensure we return -EINVAL on unknown opcode
io_uring: add sockets to list of files that support non-blocking issue
net: make socket read/write_iter() honor IOCB_NOWAIT
io_uring: only hash regular files for async work execution
io_uring: run next sqe inline if possible
io_uring: don't dynamically allocate poll data
io_uring: deferred send/recvmsg should assign iov
io_uring: sqthread should grab ctx->uring_lock for submissions
io-wq: briefly spin for new work after finishing work
io-wq: remove worker->wait waitqueue
io_uring: allow unbreakable links
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl3umDgWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlvsD/49R12HK7UzTxNTrcpvbadJ4t7j
j/qJvjMerW7iVNAPOoNAOePUa21+y3rI1AZPvoPyzIqp1Bf2eOICf5SdisG2cG+O
X0A8EKWvS0SSQWSKaT6udUKJ3nBJItwvOvQ5B58KQzcOj3S4X7B9iVBWgieMHrzz
urkZm7pqowrZB3wuF8keRtli5IZaoiCwzApy48Qrn70G3OeXymknFbpHTDwIAiGw
RiE5Xh0R4EzQdsYyCgjR8U56gBchadAmj8BUJU0ppMnOFMyIAG670hNLrs0L3roP
8TOIeyb993ZC5GZaMlnR8mz0jfibfkPa3Z85VAsVyQSPaOQldwc9j8TGBqD5Gfat
1PjOU5RVwma0pH5xTPOeevWPQpIK9KovQpQYqMMN9GMxOEx96IOUjwTrnNK2xWoN
UGyOVlESFGoniClhCiKYzPSrYOjlIBk5ovf15PdTe+bwyUDMfyfy5CZV88OS2DHz
ZBZvpLrH/EMW9zJ+FqMTp0C4s4wa2Ioid3bSh6XuNUTtltKSjp71eUja8ZEz+2sd
5AGstCC+hYqxaEk+6/851pfkQ9sbBjwuGtNrtX+pqreiLUvWLhQ0yUj6cLXlEQNH
aucjCukCjI+4lMzofeaQ2LbNhtff4YsfO4b1Ye8maoDdHjzUVL57n3bTOxKhdzbt
y6FM3lApOjk3OyaTJQ==
=YU4A
-----END PGP SIGNATURE-----
Merge tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull FIELD_SIZEOF conversion from Kees Cook:
"A mostly mechanical treewide conversion from FIELD_SIZEOF() to
sizeof_field(). This avoids the redundancy of having 2 macros
(actually 3) doing the same thing, and consolidates on sizeof_field().
While "field" is not an accurate name, it is the common name used in
the kernel, and doesn't result in any unintended innuendo.
As there are still users of FIELD_SIZEOF() in -next, I will clean up
those during this coming development cycle and send the final old
macro removal patch at that time"
* tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
treewide: Use sizeof_field() macro
MIPS: OCTEON: Replace SIZEOF_FIELD() macro
We log warning if root::orphan_cleanup_state is not set to
ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is
mounted as readonly we skip the orphan item cleanup during the lookup
and root::orphan_cleanup_state remains at the init state 0 instead of
ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the
warning as below.
WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE);
WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
::
RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
::
Call Trace:
::
_btrfs_ioctl_send+0x7b/0x110 [btrfs]
btrfs_ioctl+0x150a/0x2b00 [btrfs]
::
do_vfs_ioctl+0xa9/0x620
? __fget+0xac/0xe0
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x49/0x130
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reproducer:
mkfs.btrfs -fq /dev/sdb
mount /dev/sdb /btrfs
btrfs subvolume create /btrfs/sv1
btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1
umount /btrfs
mount -o ro /dev/sdb /btrfs
btrfs send /btrfs/ss1 -f /tmp/f
The warning exists because having orphan inodes could confuse send and
cause it to fail or produce incorrect streams. The two cases that would
cause such send failures, which are already fixed are:
1) Inodes that were unlinked - these are orphanized and remain with a
link count of 0. These caused send operations to fail because it
expected to always find at least one path for an inode. However this
is no longer a problem since send is now able to deal with such
inodes since commit 46b2f4590a ("Btrfs: fix send failure when root
has deleted files still open") and treats them as having been
completely removed (the state after an orphan cleanup is performed).
2) Inodes that were in the process of being truncated. These resulted in
send not knowing about the truncation and potentially issue write
operations full of zeroes for the range from the new file size to the
old file size. This is no longer a problem because we no longer
create orphan items for truncation since commit f7e9e8fc79 ("Btrfs:
stop creating orphan items for truncate").
As such before these commits, the WARN_ON here provided a clue in case
something went wrong. Instead of being a warning against the
root::orphan_cleanup_state value, it could have been more accurate by
checking if there were actually any orphan items, and then issue a
warning only if any exists, but that would be more expensive to check.
Since orphanized inodes no longer cause problems for send, just remove
the warning.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.net>
Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/
CC: stable@vger.kernel.org # 4.19+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to read the fs root corresponding with a reloc root we'll
just break out and free the reloc roots. But we remove our current
reloc_root from this list higher up, which means we'll leak this
reloc_root. Fix this by adding ourselves back to the reloc_roots list
so we are properly cleaned up.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
My fsstress modifications coupled with generic/475 uncovered a failure
to mount and replay the log if we hit a orphaned root. We do not want
to replay the log for an orphan root, but it's completely legitimate to
have an orphaned root with a log attached. Fix this by simply skipping
replaying the log. We still need to pin it's root node so that we do
not overwrite it while replaying other logs, as we re-read the log root
at every stage of the replay.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the
uuid tree we'll just continue and do btrfs_next_item(). However we've
done a btrfs_release_path() at this point and no longer have a valid
path. So increment the key and go back and do a normal search.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can just abort the transaction here, and in fact do that for every
other failure in this function except these two cases.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Normally when cloning a file range if we find an implicit hole at the end
of the range we assume it is because the NO_HOLES feature is enabled.
However that is not always the case. One well known case [1] is when we
have a power failure after mixing buffered and direct IO writes against
the same file.
In such cases we need to punch a hole in the destination file, and if
the NO_HOLES feature is not enabled, we need to insert explicit file
extent items to represent the hole. After commit 690a5dbfc5
("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning
extents"), we started to insert file extent items representing the hole
with an item size of 0, which is invalid and should be 53 bytes (the size
of a btrfs_file_extent_item structure), resulting in all sorts of
corruptions and invalid memory accesses. This is detected by the tree
checker when we attempt to write a leaf to disk.
The problem can be sporadically triggered by test case generic/561 from
fstests. That test case does not exercise power failure and creates a new
filesystem when it starts, so it does not use a filesystem created by any
previous test that tests power failure. However the test does both
buffered and direct IO writes (through fsstress) and it's precisely that
which is creating the implicit holes in files. That happens even before
the commit mentioned earlier. I need to investigate why we get those
implicit holes to check if there is a real problem or not. For now this
change fixes the regression of introducing file extent items with an item
size of 0 bytes.
Fix the issue by calling btrfs_punch_hole_range() without passing a
btrfs_clone_extent_info structure, which ensures file extent items are
inserted to represent the hole with a correct item size. We were passing
a btrfs_clone_extent_info with a value of 0 for its 'item_size' field,
which was causing the insertion of file extent items with an item size
of 0.
[1] https://www.spinics.net/lists/linux-btrfs/msg75350.html
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a tree mod log user no longer needs to use the tree it calls
btrfs_put_tree_mod_seq() to remove itself from the list of users and
delete all no longer used elements of the tree's red black tree, which
should be all elements with a sequence number less then our equals to
the caller's sequence number. However the logic is broken because it
can delete and free elements from the red black tree that have a
sequence number greater then the caller's sequence number:
1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the
tree mod log;
2) The task which got assigned the sequence number 1 calls
btrfs_put_tree_mod_seq();
3) Sequence number 1 is deleted from the list of sequence numbers;
4) The current minimum sequence number is computed to be the sequence
number 2;
5) A task using sequence number 2 is at tree_mod_log_rewind() and gets
a pointer to one of its elements from the red black tree through
a call to tree_mod_log_search();
6) The task with sequence number 1 iterates the red black tree of tree
modification elements and deletes (and frees) all elements with a
sequence number less then or equals to 2 (the computed minimum sequence
number) - it ends up only leaving elements with sequence numbers of 3
and 4;
7) The task with sequence number 2 now uses the pointer to its element,
already freed by the other task, at __tree_mod_log_rewind(), resulting
in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces
a trace like the following:
[16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G W 5.4.0-rc8-btrfs-next-51 #1
[16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[16804.548666] RIP: 0010:rb_next+0x16/0x50
(...)
[16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202
[16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b
[16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600
[16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000
[16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e
[16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8
[16804.554399] FS: 00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000
[16804.555039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0
[16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[16804.557583] Call Trace:
[16804.558207] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[16804.558835] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[16804.559468] resolve_indirect_refs+0x1eb/0xc70 [btrfs]
[16804.560087] ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs]
[16804.560700] find_parent_nodes+0x388/0x1120 [btrfs]
[16804.561310] btrfs_check_shared+0x115/0x1c0 [btrfs]
[16804.561916] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[16804.562518] extent_fiemap+0x59d/0x6d0 [btrfs]
[16804.563112] ? __might_fault+0x11/0x90
[16804.563706] do_vfs_ioctl+0x45a/0x700
[16804.564299] ksys_ioctl+0x70/0x80
[16804.564885] ? trace_hardirqs_off_thunk+0x1a/0x20
[16804.565461] __x64_sys_ioctl+0x16/0x20
[16804.566020] do_syscall_64+0x5c/0x250
[16804.566580] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[16804.567153] RIP: 0033:0x7f4b1ba2add7
(...)
[16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7
[16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003
[16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44
[16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48
[16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50
(...)
[16804.575623] ---[ end trace 87317359aad4ba50 ]---
Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that
have a sequence number equals to the computed minimum sequence number, and
not just elements with a sequence number greater then that minimum.
Fixes: bd989ba359 ("Btrfs: add tree modification log functions")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Having checksum items, either on the checksums tree or in a log tree, that
represent ranges that overlap each other is a sign of a corruption. Such
case confuses the checksum lookup code and can result in not being able to
find checksums or find stale checksums.
So add a check for such case.
This is motivated by a recent fix for a case where a log tree had checksum
items covering ranges that overlap each other due to extent cloning, and
resulted in missing checksums after replaying the log tree. It also helps
detect past issues such as stale and outdated checksums due to overlapping,
commit 27b9a8122f ("Btrfs: fix csum tree corruption, duplicate and
outdated checksums").
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Callers of alloc_test_extent_buffer have not correctly interpreted the
return value as error pointer, as alloc_test_extent_buffer should behave
as alloc_extent_buffer. The self-tests were unaffected but
btrfs_find_create_tree_block could call both functions and that would
cause problems up in the call chain.
Fixes: faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting code")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value 0 for devs_max means to spread the allocated chunks over all
available devices, eg. stripe for RAID0 or RAID5. This got mistakenly
copied to the RAID1C3/4 profiles. The intention is to have exactly 3 and
4 copies respectively.
Fixes: 47e6f7423b ("btrfs: add support for 3-copy replication (raid1c3)")
Fixes: 8d6fac0087 ("btrfs: add support for 4-copy replication (raid1c4)")
Signed-off-by: David Sterba <dsterba@suse.com>
Argument BTRFS_FILE_EXTENT_INLINE_DATA_START is defined as offsetof(),
which returns type size_t, so we need %zu instead of %lu.
This fixes a build warning on 32-bit ARM:
../fs/btrfs/tree-checker.c: In function 'check_extent_data_item':
../fs/btrfs/tree-checker.c:230:43: warning: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'unsigned int' [-Wformat=]
230 | "invalid item size, have %u expect [%lu, %u)",
| ~~^
| long unsigned int
| %u
Fixes: 153a6d2999 ("btrfs: tree-checker: Check item size before reading file extent type")
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andreas Färber <afaerber@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're rename exchanging two subvols we'll try to lock this lock
twice, which is bad. Just lock once if either of the ino's are subvols.
Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a BUG_ON(ret < 0) in find_free_extent from
btrfs_cache_block_group. If we fail to allocate our ctl we'll just
panic, which is not good. Instead just go on to another block group.
If we fail to find a block group we don't want to return ENOSPC, because
really we got a ENOMEM and that's the root of the problem. Save our
return from btrfs_cache_block_group(), and then if we still fail to make
our allocation return that ret so we get the right error back.
Tested with inject-error.py from bcc.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Testing with the new fsstress uncovered a pretty nasty deadlock with
lookup and snapshot deletion.
Process A
unlink
-> final iput
-> inode_tree_del
-> synchronize_srcu(subvol_srcu)
Process B
btrfs_lookup <- srcu_read_lock() acquired here
-> btrfs_iget
-> find inode that has I_FREEING set
-> __wait_on_freeing_inode()
We're holding the srcu_read_lock() while doing the iget in order to make
sure our fs root doesn't go away, and then we are waiting for the inode
to finish freeing. However because the free'ing process is doing a
synchronize_srcu() we deadlock.
Fix this by dropping the synchronize_srcu() in inode_tree_del(). We
don't need people to stop accessing the fs root at this point, we're
only adding our empty root to the dead roots list.
A larger much more invasive fix is forthcoming to address how we deal
with fs roots, but this fixes the immediate problem.
Fixes: 76dda93c6a ("Btrfs: add snapshot/subvolume destroy ioctl")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the NO_HOLES feature if we clone a range that contains a hole
and a temporary ENOSPC happens while dropping extents from the target
inode's range, we can end up failing and aborting the transaction with
-EEXIST or with a corrupt file extent item, that has a length greater
than it should and overlaps with other extents. For example when cloning
the following range from inode A to inode B:
Inode A:
extent A1 extent A2
[ ----------- ] [ hole, implicit, 4MB length ] [ ------------- ]
0 1MB 5MB 6MB
Range to clone: [1MB, 6MB)
Inode B:
extent B1 extent B2 extent B3 extent B4
[ ---------- ] [ --------- ] [ ---------- ] [ ---------- ]
0 1MB 1MB 2MB 2MB 5MB 5MB 6MB
Target range: [1MB, 6MB) (same as source, to make it easier to explain)
The following can happen:
1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
set 'drop_end' to 2MB, meaning it was able to drop only extent B2;
3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB =
1MB;
4) We then attempt to insert a file extent item at inode B with a file
offset of 5MB, which is the value of clone_info->file_offset. This
fails with error -EEXIST because there's already an extent at that
offset (extent B4);
5) We abort the current transaction with -EEXIST and return that error
to user space as well.
Another example, for extent corruption:
Inode A:
extent A1 extent A2
[ ----------- ] [ hole, implicit, 10MB length ] [ ------------- ]
0 1MB 11MB 12MB
Inode B:
extent B1 extent B2
[ ----------- ] [ --------- ] [ ----------------------------- ]
0 1MB 1MB 5MB 5MB 12MB
Target range: [1MB, 12MB) (same as source, to make it easier to explain)
1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
set 'drop_end' to 5MB, meaning it was able to drop only extent B2;
3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB =
4MB;
4) We then insert a file extent item at inode B with a file offset of 11MB
which is the value of clone_info->file_offset, and a length of 4MB (the
value of 'clone_len'). So we get 2 extents items with ranges that
overlap and an extent length of 4MB, larger then the extent A2 from
inode A (1MB length);
5) After that we end the transaction, balance the btree dirty pages and
then start another or join the previous transaction. It might happen
that the transaction which inserted the incorrect extent was committed
by another task so we end up with extent corruption if a power failure
happens.
So fix this by making sure we attempt to insert the extent to clone at
the destination inode only if we are past dropping the sub-range that
corresponds to a hole.
Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The branch of qgroup_rescan_init which is executed from the mount
path prints wrong errors messages. The textual print out in case
BTRFS_QGROUP_STATUS_FLAG_RESCAN/BTRFS_QGROUP_STATUS_FLAG_ON are not
set are transposed. Fix it by exchanging their place.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
SMB2_tdis() checks if a root handle is valid in order to decide
whether it needs to close the handle or not. However if another
thread has reference for the handle, it may end up with putting
the reference twice. The extra reference that we want to put
during the tree disconnect is the reference that has a directory
lease. So, track the fact that we have a directory lease and
close the handle only in that case.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Ran into an intermittent crash in
SMB2_open_init+0x2f6/0x970
due to oparms.cifs_sb not being initialized when called from:
smb2_compound_op+0x45d/0x1690
Zero the whole oparms struct in the compounding path before setting up the
oparms so we don't risk any uninitialized fields.
Fixes: fdef665ba4 ("smb3: fix mode passed in on create for modetosid mount option")
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
from Xiubo, a patch to add some observability into cap waiters from
Jeff and a couple of cleanups.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3yiJMTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi5k9CACmM3fJGrTUuOLgXAxxllCfiV6UQLoY
nuTo/bx0DmG603n+Ze8+Z0iz7hDc1Gw2XUeLkJcAE/xSetgZXO/MvJ0Ionq5Ac/k
CrqS6ucIa1bPxbE1QMTHswHjkajKwBpAZ5+khdLNLuXJxy3c9HDCGOT4VZav7Yc9
99W4kIdzOKdYLpZHAedMK97IJIrD5WhYTAFW4rNPY0GL6OPD1V0uiS9v7xUWIxnZ
Uusnu+zY8miQlLVx/V9DyLh/6G5X7XyQO1nkSQcVXZOOG7+qnkq6jDhQW8adgOSZ
wUFigTxxhSTIcntWg01TaCRNoi1N3/P8Z9/rD27zBHPbl93ANH+lUkCh
=NicF
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.5-rc2' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A fix to avoid a corner case when scheduling cap reclaim in batches
from Xiubo, a patch to add some observability into cap waiters from
Jeff and a couple of cleanups"
* tag 'ceph-for-5.5-rc2' of git://github.com/ceph/ceph-client:
ceph: add more debug info when decoding mdsmap
ceph: switch to global cap helper
ceph: trigger the reclaim work once there has enough pending caps
ceph: show tasks waiting on caps in debugfs caps file
ceph: convert int fields in ceph_mount_options to unsigned int
ksys_dup() is used only at one place in the kernel, namely to duplicate
fd 0 of /dev/console to stdout and stderr. The same functionality can be
achieved by using functions already available within the kernel namespace.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
In prepare_namespace(), do_mount() can be used instead of ksys_mount()
as the first and third argument are const strings in the kernel, the
second and fourth argument are passed through anyway, and the fifth
argument is NULL.
In do_mount_root(), ksys_mount() is called with the first and third
argument being already kernelspace strings, which do not need to be
copied over from userspace to kernelspace (again). The second and
fourth arguments are passed through to do_mount() anyway. The fifth
argument, while already residing in kernelspace, needs to be put into
a page of its own. Then, do_mount() can be used instead of
ksys_mount().
Once this is done, there are no in-kernel users to ksys_mount() left,
which can therefore be removed.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl3xW+cACgkQ+7dXa6fL
C2t+UQ/9ETwAYJ2XgfwatNhAHX0+Tf4XMlgEDybf5a+ERRLbkSQEqRjTImopk5m4
T7yKea0ctsVb1avdUVdayh2KuJhjmO627u3w48kd/uf2ut4IyOyaraatYKYOJ+XL
upr8WxAMXUHgnwsb38avjH0dwq7+eDyX8FNLHiDY4Sk3mTAVHbafL6V0ujp0ms2o
VmHLX6ihwECehUqrNEyy1pLX2nuFmd+MeLnoi/EWLa47/X0te21G8u8UdPWHGgHn
cZp8kqWSEoaeChO7x6XOoJZCY6N/7o+hogsrU/N5YrB6FXPpFWGdwFpej3YcyFso
QqffyXX5jVAyUg9I/v3WCrtDmTQ9xVG4kxhmBMVId6bBk3ZVpGaHuLE3ENLbr1w/
Z1k26BI41nJwrqV0C2bPoMalpDprP2WIuWsBIpYTqaICpc53KJ72c6mLhcDEt+oc
YCSd9X0Bv5O7tZS9rLI8mt0JtCr9Uw+VfnK+uOuUTPv2YcqQwK1/xZ0/9hd2P8XS
L5GKcEeuk3indgmefjwsz5YJcmzeCttaOQTlmaN/Hz3MbkK2CC+spMG4OZWzUXL4
axZmCIo/kKrv64bLyVZul6BjkhawyvfWzCs2NM6Ni63akW3Yb0S2WRdgJhvGILml
48YgqcG7R1y8LRPYiJzcuAPnbumuu6ZSxVuyZN0FU0qc4lt5mPM=
=ldOp
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20191211' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"Fixes for AFS plus one patch to make debugging easier:
- Fix how addresses are matched to server records. This is currently
incorrect which means cache invalidation callbacks from the server
don't necessarily get delivered correctly. This causes stale data
and metadata to be seen under some circumstances.
- Make the dynamic root superblock R/W so that rpm/dnf can reapply
the SELinux label to it when upgrading the Fedora filesystem-afs
package. If the filesystem is R/O, this fails and the upgrade
fails.
It might be better in future to allow setxattr from an LSM to
bypass the R/O protections, if only for pseudo-filesystems.
- Fix the parsing of mountpoint strings. The mountpoint object has to
have a terminal dot, whereas the source/device string passed to
mount should not. This confuses type-forcing suffix detection
leading to the wrong volume variant being mounted.
- Make lookups in the dynamic root superblock for creation events
(such as mkdir) fail with EOPNOTSUPP rather than something like
EEXIST. The dynamic root only allows implicit creation by the
->lookup() method - and only if the target cell exists.
- Fix the looking up of an AFS superblock to include the cell in the
matching key - otherwise all volumes with the same ID number are
treated as the same thing, irrespective of which cell they're in.
- Show the volume name of each volume in the volume records displayed
in /proc/net/afs/<cell>/volumes. This proved useful in debugging as
it provides a way to map the volume IDs to names, where the names
are what appear in /proc/mounts"
* tag 'afs-fixes-20191211' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Show volume name in /proc/net/afs/<cell>/volumes
afs: Fix missing cell comparison in afs_test_super()
afs: Fix creation calls in the dynamic root to fail with EOPNOTSUPP
afs: Fix mountpoint parsing
afs: Fix SELinux setting security label on /afs
afs: Fix afs_find_server lookups for ipv4 peers
If we submit an unknown opcode and have fd == -1, io_op_needs_file()
will return true as we default to needing a file. Then when we go and
assign the file, we find the 'fd' invalid and return -EBADF. We really
should be returning -EINVAL for that case, as we normally do for
unsupported opcodes.
Change io_op_needs_file() to have the following return values:
0 - does not need a file
1 - does need a file
< 0 - error value
and use this to pass back the right value for this invalid case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
generic/522 (fsx) occasionally fails with a file corruption due to
an insert range operation. The primary characteristic of the
corruption is a misplaced insert range operation that differs from
the requested target offset. The reason for this behavior is a race
between the extent shift sequence of an insert range and a COW
writeback completion that causes a front merge with the first extent
in the shift.
The shift preparation function flushes and unmaps from the target
offset of the operation to the end of the file to ensure no
modifications can be made and page cache is invalidated before file
data is shifted. An insert range operation then splits the extent at
the target offset, if necessary, and begins to shift the start
offset of each extent starting from the end of the file to the start
offset. The shift sequence operates at extent level and so depends
on the preparation sequence to guarantee no changes can be made to
the target range during the shift. If the block immediately prior to
the target offset was dirty and shared, however, it can undergo
writeback and move from the COW fork to the data fork at any point
during the shift. If the block is contiguous with the block at the
start offset of the insert range, it can front merge and alter the
start offset of the extent. Once the shift sequence reaches the
target offset, it shifts based on the latest start offset and
silently changes the target offset of the operation and corrupts the
file.
To address this problem, update the shift preparation code to
stabilize the start boundary along with the full range of the
insert. Also update the existing corruption check to fail if any
extent is shifted with a start offset behind the target offset of
the insert range. This prevents insert from racing with COW
writeback completion and fails loudly in the event of an unexpected
extent shift.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
- Fix improper return value of listxattr() with no xattr;
- Keep up documentation with latest code.
-----BEGIN PGP SIGNATURE-----
iIwEABYIADQWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXfELlBYcZ2FveGlhbmcy
NUBodWF3ZWkuY29tAAoJEDk3MdwfteYEtUABAN164UwGU9QKEsqgZQcmbz23qXSJ
QDR8r/ch2LxzXKkVAQDXCNU+ol6jkiapLcTvsXEjBk8sUxsCEVnmZ36jru+TBA==
=kRp9
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.5-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"Mainly address a regression reported by David recently observed
together with overlayfs due to the improper return value of
listxattr() without xattr. Update outdated expressions in document as
well.
Summary:
- Fix improper return value of listxattr() with no xattr
- Keep up documentation with latest code"
* tag 'erofs-for-5.5-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: update documentation
erofs: zero out when listxattr is called with no xattr
There's no need to separately check for signals while inside the locked
region, since we're going to do "wait_event_interruptible()" right
afterwards anyway, and the error handling is much simpler there.
The check for whether we had already read anything was also redundant,
since we no longer do the odd merging of reads when there are pending
writers.
But perhaps more importantly, this adds commentary about why we still
need to wake up possible writers even though we didn't read any data,
and why we can skip all the finishing touches now if we get a signal (or
had a signal pending) while waiting for more data.
[ This is a split-out cleanup from my "make pipe IO use exclusive wait
queues" thing, which I can't apply because it triggers a nasty bug in
the GNU make jobserver - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Show the name of each volume in /proc/net/afs/<cell>/volumes to make it
easier to work out the name corresponding to a volume ID. This makes it
easier to work out which mounts in /proc/mounts correspond to which volume
ID.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Fix missing cell comparison in afs_test_super(). Without this, any pair
volumes that have the same volume ID will share a superblock, no matter the
cell, unless they're in different network namespaces.
Normally, most users will only deal with a single cell and so they won't
see this. Even if they do look into a second cell, they won't see a
problem unless they happen to hit a volume with the same ID as one they've
already got mounted.
Before the patch:
# ls /afs/grand.central.org/archive
linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/
# ls /afs/kth.se/
linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/
# cat /proc/mounts | grep afs
none /afs afs rw,relatime,dyn,autocell 0 0
#grand.central.org:root.cell /afs/grand.central.org afs ro,relatime 0 0
#grand.central.org:root.archive /afs/grand.central.org/archive afs ro,relatime 0 0
#grand.central.org:root.archive /afs/kth.se afs ro,relatime 0 0
After the patch:
# ls /afs/grand.central.org/archive
linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/
# ls /afs/kth.se/
admin/ common/ install/ OldFiles/ service/ system/
bakrestores/ home/ misc/ pkg/ src/ wsadmin/
# cat /proc/mounts | grep afs
none /afs afs rw,relatime,dyn,autocell 0 0
#grand.central.org:root.cell /afs/grand.central.org afs ro,relatime 0 0
#grand.central.org:root.archive /afs/grand.central.org/archive afs ro,relatime 0 0
#kth.se:root.cell /afs/kth.se afs ro,relatime 0 0
Fixes: ^1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Carsten Jacobi <jacobi@de.ibm.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Jonathan Billings <jsbillings@jsbillings.org>
cc: Todd DeSantis <atd@us.ibm.com>
Fix the lookup method on the dynamic root directory such that creation
calls, such as mkdir, open(O_CREAT), symlink, etc. fail with EOPNOTSUPP
rather than failing with some odd error (such as EEXIST).
lookup() itself tries to create automount directories when it is invoked.
These are cached locally in RAM and not committed to storage.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Jonathan Billings <jsbillings@jsbillings.org>
Each AFS mountpoint has strings that define the target to be mounted. This
is required to end in a dot that is supposed to be stripped off. The
string can include suffixes of ".readonly" or ".backup" - which are
supposed to come before the terminal dot. To add to the confusion, the "fs
lsmount" afs utility does not show the terminal dot when displaying the
string.
The kernel mount source string parser, however, assumes that the terminal
dot marks the suffix and that the suffix is always "" and is thus ignored.
In most cases, there is no suffix and this is not a problem - but if there
is a suffix, it is lost and this affects the ability to mount the correct
volume.
The command line mount command, on the other hand, is expected not to
include a terminal dot - so the problem doesn't arise there.
Fix this by making sure that the dot exists and then stripping it when
passing the string to the mount configuration.
Fixes: bec5eb6141 ("AFS: Implement an autocell mount capability [ver #2]")
Reported-by: Jonathan Billings <jsbillings@jsbillings.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Jonathan Billings <jsbillings@jsbillings.org>
The value being used for guest_nice should be CPUTIME_GUEST_NICE
and not CPUTIME_USER.
Fixes: 26dae145a7 ("procfs: Use all-in-one vtime aware kcpustat accessor")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191205020344.14940-1-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In chasing a performance issue between using IORING_OP_RECVMSG and
IORING_OP_READV on sockets, tracing showed that we always punt the
socket reads to async offload. This is due to io_file_supports_async()
not checking for S_ISSOCK on the inode. Since sockets supports the
O_NONBLOCK (or MSG_DONTWAIT) flag just fine, add sockets to the list
of file types that we can do a non-blocking issue to.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We hash regular files to avoid having multiple threads hammer on the
inode mutex, but it should not be needed on other types of files
(like sockets).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
One major use case of linked commands is the ability to run the next
link inline, if at all possible. This is done correctly for async
offload, but somewhere along the line we lost the ability to do so when
we were able to complete a request without having to punt it. Ensure
that we do so correctly.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This essentially reverts commit e944475e69. For high poll ops
workloads, like TAO, the dynamic allocation of the wait_queue
entry for IORING_OP_POLL_ADD adds considerable extra overhead.
Go back to embedding the wait_queue_entry, but keep the usage of
wait->private for the pointer stashing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't just assign it from the main call path, that can miss the case
when we're called from issue deferral.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We use the mutex to guard against registered file updates, for instance.
Ensure we're safe in accessing that state against concurrent updates.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To avoid going to sleep only to get woken shortly thereafter, spin
briefly for new work upon completion of work.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We only have one cases of using the waitqueue to wake the worker, the
rest are using wake_up_process(). Since we can save some cycles not
fiddling with the waitqueue io_wqe_worker(), switch the work activation
to task wakeup and get rid of the now unused wait_queue_head_t in
struct io_worker.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some commands will invariably end in a failure in the sense that the
completion result will be less than zero. One such example is timeouts
that don't have a completion count set, they will always complete with
-ETIME unless cancelled.
For linked commands, we sever links and fail the rest of the chain if
the result is less than zero. Since we have commands where we know that
will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever
regardless of the completion result. Note that the link will still sever
if we fail submitting the parent request, hard links are only resilient
in the presence of completion results for requests that did submit
correctly.
Cc: stable@vger.kernel.org # v5.4
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In ovl_rename(), if new upper is hardlinked to old upper underneath
overlayfs before upper dirs are locked, user will get an ESTALE error
and a WARN_ON will be printed.
Changes to underlying layers while overlayfs is mounted may result in
unexpected behavior, but it shouldn't crash the kernel and it shouldn't
trigger WARN_ON() either, so relax this WARN_ON().
Reported-by: syzbot+bb1836a212e69f8e201a@syzkaller.appspotmail.com
Fixes: 804032fabb ("ovl: don't check rename to self")
Cc: <stable@vger.kernel.org> # v4.9+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
On non-samefs overlay without xino, non pure upper inodes should use a
pseudo_dev assigned to each unique lower fs and pure upper inodes use the
real upper st_dev.
It is fine for an overlay pure upper inode to use the same st_dev;st_ino
values as the real upper inode, because the content of those two different
filesystem objects is always the same.
In this case, however:
- two filesystems, A and B
- upper layer is on A
- lower layer 1 is also on A
- lower layer 2 is on B
Non pure upper overlay inode, whose origin is in layer 1 will have the same
st_dev;st_ino values as the real lower inode. This may result with a false
positive results of 'diff' between the real lower and copied up overlay
inode.
Fix this by using the upper st_dev;st_ino values in this case. This breaks
the property of constant st_dev;st_ino across copy up of this case. This
breakage will be fixed by a later patch.
Fixes: 5148626b80 ("ovl: allocate anon bdev per unique lower fs")
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We can allocate maximum fh size and encode into it directly.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Seprate on-disk encoding from in-memory and on-wire resresentation
of overlay file handle.
In-memory and on-wire we only ever pass around pointers to struct
ovl_fh, which encapsulates at offset 3 the on-disk format struct
ovl_fb. struct ovl_fb encapsulates at offset 21 the real file handle.
That makes sure that the real file handle is always 32bit aligned
in-memory when passed down to the underlying filesystem.
On-disk format remains the same and store/load are done into
correctly aligned buffer.
New nfs exported file handles are exported with aligned real fid.
Old nfs file handles are copied to an aligned buffer before being
decoded.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In the past, overlayfs required that lower fs have non null uuid in
order to support nfs export and decode copy up origin file handles.
Commit 9df085f3c9 ("ovl: relax requirement for non null uuid of
lower fs") relaxed this requirement for nfs export support, as long
as uuid (even if null) is unique among all lower fs.
However, said commit unintentionally also relaxed the non null uuid
requirement for decoding copy up origin file handles, regardless of
the unique uuid requirement.
Amend this mistake by disabling decoding of copy up origin file handle
from lower fs with a conflicting uuid.
We still encode copy up origin file handles from those fs, because
file handles like those already exist in the wild and because they
might provide useful information in the future.
There is an unhandled corner case described by Miklos this way:
- two filesystems, A and B, both have null uuid
- upper layer is on A
- lower layer 1 is also on A
- lower layer 2 is on B
In this case bad_uuid won't be set for B, because the check only
involves the list of lower fs. Hence we'll try to decode a layer 2
origin on layer 1 and fail.
We will deal with this corner case later.
Reported-by: Colin Ian King <colin.king@canonical.com>
Tested-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/lkml/20191106234301.283006-1-colin.king@canonical.com/
Fixes: 9df085f3c9 ("ovl: relax requirement for non null uuid ...")
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Show the laggy state.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
__ceph_is_any_caps is a duplicate helper.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The nr in ceph_reclaim_caps_nr() is very possibly larger than 1,
so we may miss it and the reclaim work couldn't triggered as expected.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add some visibility of tasks that are waiting for caps to the "caps"
debugfs file. Display the tgid of the waiting task, inode number, and
the caps the task needs and wants.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Most of these values should never be negative, so convert them to
unsigned values. Add some sanity checking to the parsed values, and
clean up some unneeded casts.
Note that while caps_max should never be negative, this patch leaves
it signed, since this value ends up later being compared to a signed
counter. Just ensure that userland never passes in a negative value
for caps_max.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
at places where these are defined. Later patches will remove the unused
definition of FIELD_SIZEOF().
This patch is generated using following script:
EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"
git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
do
if [[ "$file" =~ $EXCLUDE_FILES ]]; then
continue
fi
sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
done
Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David Miller <davem@davemloft.net> # for net
Fixes the issue caused by the fact that in C in the expression
of the form -1234L only 1234L is the actual literal, the unary
minus is an operation applied to the literal. Which means that
to express the lower bound for the type one has to negate the
upper bound and subtract 1.
Original error:
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == -2147483648
timestamp.tv_sec == 2147483648
1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits: msb:1
lower_bound:1 extra_bits: 0
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 2147483648
timestamp.tv_sec == 6442450944
2038-01-19 Lower bound of 32bit <0 timestamp, lo extra sec bit on:
msb:1 lower_bound:1 extra_bits: 1
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 6442450944
timestamp.tv_sec == 10737418240
2174-02-25 Lower bound of 32bit <0 timestamp, hi extra sec bit on:
msb:1 lower_bound:1 extra_bits: 2
not ok 1 - inode_test_xtimestamp_decoding
not ok 1 - ext4_inode_test
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Iurii Zaikin <yzaikin@google.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Brendan Higgins <brendanhiggins@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Because the BLAKE2B code went through a different tree, it was not
available at the time the btrfs part was merged. Now that the Kconfig
symbol exists, add it to the list.
Signed-off-by: David Sterba <dsterba@suse.com>
Make the AFS dynamic root superblock R/W so that SELinux can set the
security label on it. Without this, upgrades to, say, the Fedora
filesystem-afs RPM fail if afs is mounted on it because the SELinux label
can't be (re-)applied.
It might be better to make it possible to bypass the R/O check for LSM
label application through setxattr.
Fixes: 4d673da145 ("afs: Support the AFS dynamic root")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: selinux@vger.kernel.org
cc: linux-security-module@vger.kernel.org
afs_find_server tries to find a server that has an address that
matches the transport address of an rxrpc peer. The code assumes
that the transport address is always ipv6, with ipv4 represented
as ipv4 mapped addresses, but that's not the case. If the transport
family is AF_INET, srx->transport.sin6.sin6_addr.s6_addr32[] will
be beyond the actual ipv4 address and will always be 0, and all
ipv4 addresses will be seen as matching.
As a result, the first ipv4 address seen on any server will be
considered a match, and the server returned may be the wrong one.
One of the consequences is that callbacks received over ipv4 will
only be correctly applied for the server that happens to have the
first ipv4 address on the fs_addresses4 list. Callbacks over ipv4
from all other servers are dropped, causing the client to serve stale
data.
This is fixed by looking at the transport family, and comparing ipv4
addresses based on a sockaddr_in structure rather than a sockaddr_in6.
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl3sPUUACgkQiiy9cAdy
T1GumAwAhh0Fk2uEV01REMgA6MgQ2hrdGE5HariSTzGifCk8cxMnq1H1u9yxtic8
uvEJQaUmTLWrN2C+xqD2JqPmJyrPOtnL0PLCLQk2/RsPCsDgYnmdKoAehInPh17g
J8MoKPp1/1wYhbOl7CeF0xo2rEchoh/PcPCXpt8qj+M+kBgQkI64UQ/6iY/mV9Zl
n7WJJFDyz3D1+SaJPaVxMpNxZcMpFbGqVJYTWP4v3pL2E8wEhyWjAryLCJAFFGf7
Y2FwOSFuifMN/qC9t83W5KkRT9I/zRQ2g5qK1tC24LiTjQ3cqkCy1SSqpKQyvKwz
P/oRX0HsuIbr1KFzN55kg831m/V7/1B/5bf9AivfhjsAoSyp2yyVQgPeV+nQkO0r
iQdNatohC9HlwXmrypS+GhLXnj8xLnCR4+Aj7hGSuiVLHnCOfnGjQxI40BFWaBli
1RG9agkploMYvcjcgSgDGVFFWTeHgSQKI1DQTL2Nx4py1zj7Rv/kEgwkZ3zdEf9h
PPl37hBM
=gey9
-----END PGP SIGNATURE-----
Merge tag '5.5-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Nine cifs/smb3 fixes:
- one fix for stable (oops during oplock break)
- two timestamp fixes including important one for updating mtime at
close to avoid stale metadata caching issue on dirty files (also
improves perf by using SMB2_CLOSE_FLAG_POSTQUERY_ATTRIB over the
wire)
- two fixes for "modefromsid" mount option for file create (now
allows mode bits to be set more atomically and accurately on create
by adding "sd_context" on create when modefromsid specified on
mount)
- two fixes for multichannel found in testing this week against
different servers
- two small cleanup patches"
* tag '5.5-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb3: improve check for when we send the security descriptor context on create
smb3: fix mode passed in on create for modetosid mount option
cifs: fix possible uninitialized access and race on iface_list
cifs: Fix lookup of SMB connections on multichannel
smb3: query attributes on file close
smb3: remove unused flag passed into close functions
cifs: remove redundant assignment to pointer pneg_ctxt
fs: cifs: Fix atime update check vs mtime
CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
Pull misc vfs cleanups from Al Viro:
"No common topic, just three cleanups".
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make __d_alloc() static
fs/namespace: add __user to open_tree and move_mount syscalls
fs/fnctl: fix missing __user in fcntl_rw_hint()
- Fix a UAF when reporting writeback errors
- Fix a race condition when handling page uptodate on a blocksize <
pagesize file that is also fragmented
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3pJVwACgkQ+H93GTRK
tOue8RAAklPs+kV4l1JQudPkd/lXIPCqyRKBEjNDQRfj27v5s7h/3fMqMsV6/RIs
zTkrRc/N5Uu6SMnvMIHvTx1nAk6ZnTiN3gUHenoUQgQuxZ4jtGhIvHmyiOGgI27j
Xd2g+aE8wdhZDnD2b+4AsnJmtvffX0g3DNDZ4KJn1d/Ejs2qQIHgeaoyj56Fz8vh
4dqQxSg5RvL25k3qqHSVogsfxSRJxvYJefcARUL25a58UzE6DLJqeb3v41RVLc4l
lDOX/o4lavqo1lmptWjsTn4bqqyeRfNWp2M2EGE4a411aEaNTkFB7xzQ9Y1v04kK
VHq6XK3vieAv6VOuqZ3ySx6ihQcb9dQWXiMkJ9HDVDv4rmaLlrHECM19A5toZN/6
0Xy6xLDqbimckyWDZLBUnkfcoCsI+uWTdNIBxkjEYFdDuL33ovUlKu+Cn63vYzoQ
aCWnNA4NdKOBYps9aKCQV35IU1ODmOYWqkxkpSaVYKApi8Q6+2HUXOf+fXFBiV91
zO0/nWq8RU7fGQxk8YJtMO2E0lGMaMAt03vgp+pMkd0aIo0NWFLV0q3tko5VlWNv
v8BGkphgrOOqfsCFh8wua5x8bIXOBBElOSUjU0wBrAHb0ZXkQ3zi0nUt+sj7RFUX
a+ojaRrXvuyZZArOvQbrHEC0HWchVmvfiyXRAPbxbLi1BiLfPW8=
=+LzI
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.5-merge-14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fixes from Darrick Wong:
"Fix a race condition and a use-after-free error:
- Fix a UAF when reporting writeback errors
- Fix a race condition when handling page uptodate on fragmented file
with blocksize < pagesize"
* tag 'iomap-5.5-merge-14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: stop using ioend after it's been freed in iomap_finish_ioend()
iomap: fix sub-page uptodate handling
- Fix a crash in the log setup code when log mounting fails
- Fix a hang when allocating space on the realtime device
- Fix a block leak when freeing space on the realtime device
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3n5OkACgkQ+H93GTRK
tOtUvw/+Lxnwx1AaYT7hSMS4p1buJnAV7aM6Ofm+F+GkRISWlxSEYLIjrj3NFxeI
Fv4jnPWMGHtBzW2c4OXv9zhW7vQeuU0Z72sgxA+Wqccf53sfR5Qum+Tya+Bo8X0X
LPeDEuA+k4UUUHvSLqscPFPZYbXgwp3dJ2i7JeuH0vx3BhFeTjV1nlOeo1z3QBzN
LLVjuBg7Zms+TPorrOgS67LAWfzqCAiQDauvyODB5EW+UElK6pvpOklFzc7TUPr2
PIvmjEE8UNN2y9QuEEKJ26t43ejAzG016yGMxyW74i5wU33R7PHkdgG1pWwWRZzr
yvNClg2CMtLu+Bcx8Lc9X23mW0DqPkUYDohWCe/Tytvz5kaKCEq3WlqwaG5EdMOg
gSJlivpoOTduQ26V4fPvD1/fjGpJLWyfHPs9p43wie+K/NuUcTIvr4BGGQszjS/n
5Zr630g6Tq5VrBMl0f1P2NuEbeQEvmbWNTW2TIvvHZTgMd8mZdvX28IXj0dAhBb/
2U5o1NF8F6VeRGYoZFJI70RIfiLYzOQEmsA5hAyJfUQQ18u8zDJuPV4isO4/XjQf
d32E36cvP3CKVQYn7hAiMD5O8jOFckYL287qd4uDYutmleHEcfzc0H4pTW+66IHp
IuzkPCgvOkd4h4qnhtHSDoSlKd7kc1Ai3hBhCdA9zd6IUAVsZ6Y=
=Ps4B
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.5-merge-17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Fix a couple of resource management errors and a hang:
- fix a crash in the log setup code when log mounting fails
- fix a hang when allocating space on the realtime device
- fix a block leak when freeing space on the realtime device"
* tag 'xfs-5.5-merge-17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix mount failure crash on invalid iclog memory access
xfs: don't check for AG deadlock for realtime files in bunmapi
xfs: fix realtime file data space leak
Orangefs has no open, and orangefs checks file permissions
on each file access. Posix requires that file permissions
be checked on open and nowhere else. Orangefs-through-the-kernel
needs to seem posix compliant.
The VFS opens files, even if the filesystem provides no
method. We can see if a file was successfully opened for
read and or for write by looking at file->f_mode.
When writes are flowing from the page cache, file is no
longer available. We can trust the VFS to have checked
file->f_mode before writing to the page cache.
The mode of a file might change between when it is opened
and IO commences, or it might be created with an arbitrary mode.
We'll make sure we don't hit EACCES during the IO stage by
using UID 0.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJd6qoUAAoJEM9EDqnrzg2+SHQQAI8osluG1xDle0Ur0y/XQrWR
z/1+mA/ZJpkaC2KPJ3F/B93ZR7TSSb6xB/u/EoxfqVQDoVpodP3PzcvSosRsePOk
OYo67xit7YRcg2nQF5kEjR+wYbW/T1j55oQzrWxLvYr+FhlDZLJyn0xuSaGvvuKQ
kWqNwPQpIZwNR1ZJ6Yjif86kR4sWF5htoy976x5ScvoeOb08dNHQn2je5oXH/eKH
zwWBVYTeZTAIVCs9YV2UM4gi5/0pysjSL58jP7+ckLj79ozBoyhc9cRB4ez0cFyc
4+4dW9zZ1GAfvmbsFzvCfKb2Syz4JkStGJQGST+cgH9ldp70R8AdRjzYfZGXa2af
9I/jRgrVBsU/jo++a1npMy2j44+2GvhoValzKePwiCGTOB/f80XsmB9p9qci8JCv
ucVzJwbhjxPKphUpnW8Gg7F2gWr2ULhv+wKRmAb3tF+bIFPjn7KjyzFfUAS3FY1s
iwgci0Mw9NLLlvX511N0wiUGo6V9A9r7XsZQjScmm/3ybUhMyJAYoe81OO60Xwnv
2s+V0Tv9ah4b+EF0J0qtQ7GzsoKDBu+ZWqGieiOXDWTVixY2gV6CetnR7veeSeQh
s9OeqY8qaSYiV9KtBNZp56IS4PuADDgxnRB1pXTUUPgapuElEtvYC1BUovidMMmh
kLQEpYdSGrkLRah4hKsg
=9AOz
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.5-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs update from Mike Marshall:
"orangefs: posix open permission checking...
Orangefs has no open, and orangefs checks file permissions on each
file access. Posix requires that file permissions be checked on open
and nowhere else. Orangefs-through-the-kernel needs to seem posix
compliant.
The VFS opens files, even if the filesystem provides no method. We can
see if a file was successfully opened for read and or for write by
looking at file->f_mode.
When writes are flowing from the page cache, file is no longer
available. We can trust the VFS to have checked file->f_mode before
writing to the page cache.
The mode of a file might change between when it is opened and IO
commences, or it might be created with an arbitrary mode.
We'll make sure we don't hit EACCES during the IO stage by using
UID 0"
[ This is "posixish", but not a great solution in the long run, since a
proper secure network server shouldn't really trust the client like this.
But proper and secure POSIX behavior requires an open method and a
resulting cookie for IO of some kind, or similar. - Linus ]
* tag 'for-linus-5.5-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: posix open permission checking...
Possibly most interesting is Trond's fixes for some callback races that
were due to my incomplete understanding of rpc client shutdown.
Unfortunately at the last minute I've started noticing a new
intermittent failure to send callbacks. As the logic seems basically
correct, I'm leaving Trond's patches in for now, and hope to find a fix
in the next week so I don't have to revert those patches.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl3r3AAVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+rjkP/3L6DZs0Uv0BYbGq5Gmit0uoPSQk
8BT7oQhbagCh+ULRYWCnK6cz82wejR4Gzq4PLyl5x5Vcc5x+bLoPI9YgiRZlIbZu
ZvSg93E6SITLfq5xRlDC0MlIVZkI+HoIfyYgv1aYiWvQ3834bcx4DxVm9h7cNpT3
x37anEFi1lv3n9fct3obOrs3AvCS76XyA6VVhcSLJ77amKQ+O7LI0crqUc6cuX2i
CkTwTSDwyCrzkx3dZ2xDPDTbLecxw+Ce4adaby5v3GEQo3TOCmEWX92D3dvzfMmv
ICU07FsVOILnIT/fmC91b1+JWVRLjUUBw5EPmDduwSP/yw4YnIEODFEP/wAUAmMJ
vJ9hi9c1rThQ9n8h08RIwA2snhnpXRxKCWhpIRY6WM8DhHL9Y9AuVPYTKxhQOjPK
l3wbOGcMW63NrTOPHHN7hTB0vDLgPKIXYVIrMvZTd/P7CghDDEbhT1gDvx/IL3Uq
WrHKbJtK7rbx9i2bh5f6fH0DRrv7lxbD0ffunRRa3twPAe6zsG9WPjsbZZraZzEg
O7/o3wZu2N7MpL5bXPfzB+5ylOTxvNWew07NJjA4BIOfwin3bw/71YfB0Vnoairv
PhmbN2Dj4/t82ld0JU5GJWojpUfH4ARXM2Li9WO99wzx+KrxScsqGPnRMFe9dC7b
Q7ltP1p0gUbkJ88Z
=b2zA
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"This is a relatively quiet cycle for nfsd, mainly various bugfixes.
Possibly most interesting is Trond's fixes for some callback races
that were due to my incomplete understanding of rpc client shutdown.
Unfortunately at the last minute I've started noticing a new
intermittent failure to send callbacks. As the logic seems basically
correct, I'm leaving Trond's patches in for now, and hope to find a
fix in the next week so I don't have to revert those patches"
* tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux: (24 commits)
nfsd: depend on CRYPTO_MD5 for legacy client tracking
NFSD fixing possible null pointer derefering in copy offload
nfsd: check for EBUSY from vfs_rmdir/vfs_unink.
nfsd: Ensure CLONE persists data and metadata changes to the target file
SUNRPC: Fix backchannel latency metrics
nfsd: restore NFSv3 ACL support
nfsd: v4 support requires CRYPTO_SHA256
nfsd: Fix cld_net->cn_tfm initialization
lockd: remove __KERNEL__ ifdefs
sunrpc: remove __KERNEL__ ifdefs
race in exportfs_decode_fh()
nfsd: Drop LIST_HEAD where the variable it declares is never used.
nfsd: document callback_wq serialization of callback code
nfsd: mark cb path down on unknown errors
nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()
nfsd: minor 4.1 callback cleanup
SUNRPC: Fix svcauth_gss_proxy_init()
SUNRPC: Trace gssproxy upcall results
sunrpc: fix crash when cache_head become valid before update
nfsd: remove private bin2hex implementation
...
Highlights include:
Features:
- NFSv4.2 now supports cross device offloaded copy (i.e. offloaded copy
of a file from one source server to a different target server).
- New RDMA tracepoints for debugging congestion control and Local Invalidate
WRs.
Bugfixes and cleanups
- Drop the NFSv4.1 session slot if nfs4_delegreturn_prepare waits for
layoutreturn
- Handle bad/dead sessions correctly in nfs41_sequence_process()
- Various bugfixes to the delegation return operation.
- Various bugfixes pertaining to delegations that have been revoked.
- Cleanups to the NFS timespec code to avoid unnecessary conversions
between timespec and timespec64.
- Fix unstable RDMA connections after a reconnect
- Close race between waking an RDMA sender and posting a receive
- Wake pending RDMA tasks if connection fails
- Fix MR list corruption, and clean up MR usage
- Fix another RPCSEC_GSS issue with MIC buffer space
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl3qp8QACgkQZwvnipYK
APIZfBAAuFhLUA2Ua9OQOPDJDkQ1IFDBfYGG48aqVu3GIXS9LkEvTavLm/P9ocm+
ijGsUv2iw4x9H4S7OGuzLQm5zmTNsQAlPXD+3+xQS7cjPjh5HCyIAEgpov+JEGae
CeZoSvhtdBd0xB71t2zAKEdHkqc47Jxz3Db0FX22zTTnDvdhArfggisZUt4Xq5Qb
cPcs8R1E5yBZqJFHKObOUP4itVYsXte/VFhtWpjRFqzaZ/t7xNpPVOBH8cli7aI9
E6DqdbIjUreyn62FVWYIeGhwvsKdxv+Slc5ZOEbD45jUryovyCAZxhqDmcAg/0q0
uykplL0cv8MeiZ68wmlxdir/n36hWduiGqa0UKMg2+BAbdudGKJ7xPhkGYP2uZqo
zoZGjd+Hl8AunMBUaT7YAxWOzuIXeMP338szTL6sSBPxT75WmmNJAh3J4b22G7Bl
eGrcJcckDBnvfRCia40l8g9NLHmVKqS9qNKxSWMlMlBmwd1HE0oEE1ddCx9bGHKe
srf0S14RPQBRF6r+Nv0cx5S+CiptDtGiILR+cn5ZDra5YYCPX5kkJ6VEqw/m4yNE
AKjjj5gim+jWYdBOTMU3u5KNNqFx37xnOCdC+5DvhMNWRHf2O/I5JSKtuKaZht+5
PEuwcYfQvaZGp3fCEh38zzOX2qWUhRMbXUqSv5F0DbuWK7OAABQ=
=VZFk
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- NFSv4.2 now supports cross device offloaded copy (i.e. offloaded
copy of a file from one source server to a different target
server).
- New RDMA tracepoints for debugging congestion control and Local
Invalidate WRs.
Bugfixes and cleanups
- Drop the NFSv4.1 session slot if nfs4_delegreturn_prepare waits for
layoutreturn
- Handle bad/dead sessions correctly in nfs41_sequence_process()
- Various bugfixes to the delegation return operation.
- Various bugfixes pertaining to delegations that have been revoked.
- Cleanups to the NFS timespec code to avoid unnecessary conversions
between timespec and timespec64.
- Fix unstable RDMA connections after a reconnect
- Close race between waking an RDMA sender and posting a receive
- Wake pending RDMA tasks if connection fails
- Fix MR list corruption, and clean up MR usage
- Fix another RPCSEC_GSS issue with MIC buffer space"
* tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
SUNRPC: Capture completion of all RPC tasks
SUNRPC: Fix another issue with MIC buffer space
NFS4: Trace lock reclaims
NFS4: Trace state recovery operation
NFSv4.2 fix memory leak in nfs42_ssc_open
NFSv4.2 fix kfree in __nfs42_copy_file_range
NFS: remove duplicated include from nfs4file.c
NFSv4: Make _nfs42_proc_copy_notify() static
NFS: Fallocate should use the nfs4_fattr_bitmap
NFS: Return -ETXTBSY when attempting to write to a swapfile
fs: nfs: sysfs: Remove NULL check before kfree
NFS: remove unneeded semicolon
NFSv4: add declaration of current_stateid
NFSv4.x: Drop the slot if nfs4_delegreturn_prepare waits for layoutreturn
NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process()
nfsv4: Move NFSPROC4_CLNT_COPY_NOTIFY to end of list
SUNRPC: Avoid RPC delays when exiting suspend
NFS: Add a tracepoint in nfs_fh_to_dentry()
NFSv4: Don't retry the GETATTR on old stateid in nfs4_delegreturn_done()
NFSv4: Handle NFS4ERR_OLD_STATEID in delegreturn
...
We had cases in the previous patch where we were sending the security
descriptor context on SMB3 open (file create) in cases when we hadn't
mounted with with "modefromsid" mount option.
Add check for that mount flag before calling ad_sd_context in
open init.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
pipe_wait() may be simple, but since it relies on the pipe lock, it
means that we have to do the wakeup while holding the lock. That's
unfortunate, because the very first thing the waked entity will want to
do is to get the pipe lock for itself.
So get rid of the pipe_wait() usage by simply releasing the pipe lock,
doing the wakeup (if required) and then using wait_event_interruptible()
to wait on the right condition instead.
wait_event_interruptible() handles races on its own by comparing the
wakeup condition before and after adding itself to the wait queue, so
you can use an optimistic unlocked condition for it.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This code is ancient, and goes back to when we only had a single page
for the pipe buffers. The exact history is hidden in the mists of time
(ie "before git", and in fact predates the BK repository too).
At that long-ago point in time, it actually helped to try to merge big
back-and-forth pipe reads and writes, and not limit pipe reads to the
single pipe buffer in length just because that was all we had at a time.
However, since then we've expanded the pipe buffers to multiple pages,
and this logic really doesn't seem to make sense. And a lot of it is
somewhat questionable (ie "hmm, the user asked for a non-blocking read,
but we see that there's a writer pending, so let's wait anyway to get
the extra data that the writer will have").
But more importantly, it makes the "go to sleep" logic much less
obvious, and considering the wakeup issues we've had, I want to make for
less of those kinds of things.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the read side version of the previous commit: it simplifies the
logic to only wake up waiting writers when necessary, and makes sure to
use a synchronous wakeup. This time not so much for GNU make jobserver
reasons (that pipe never fills up), but simply to get the writer going
quickly again.
A bit less verbose commentary this time, if only because I assume that
the write side commentary isn't going to be ignored if you touch this
code.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pipe rework ends up having been extra painful, partly becaused of
actual bugs with ordering and caching of the pipe state, but also
because of subtle performance issues.
In particular, the pipe rework caused the kernel build to inexplicably
slow down.
The reason turns out to be that the GNU make jobserver (which limits the
parallelism of the build) uses a pipe to implement a "token" system: a
parallel submake will read a character from the pipe to get the job
token before starting a new job, and will write a character back to the
pipe when it is done. The overall job limit is thus easily controlled
by just writing the appropriate number of initial token characters into
the pipe.
But to work well, that really means that the old behavior of write
wakeups being synchronous (WF_SYNC) is very important - when the pipe
writer wakes up a reader, we want the reader to actually get scheduled
immediately. Otherwise you lose the parallelism of the build.
The pipe rework lost that synchronous wakeup on write, and we had
clearly all forgotten the reasons and rules for it.
This rewrites the pipe write wakeup logic to do the required Wsync
wakeups, but also clarifies the logic and avoids extraneous wakeups.
It also ends up addign a number of comments about what oit does and why,
so that we hopefully don't end up forgetting about this next time we
change this code.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel wait queues have a basic rule to them: you add yourself to
the wait-queue first, and then you check the things that you're going to
wait on. That avoids the races with the event you're waiting for.
The same goes for poll/select logic: the "poll_wait()" goes first, and
then you check the things you're polling for.
Of course, if you use locking, the ordering doesn't matter since the
lock will serialize with anything that changes the state you're looking
at. That's not the case here, though.
So move the poll_wait() first in pipe_poll(), before you start looking
at the pipe state.
Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The legacy client tracking infrastructure of nfsd makes use of MD5 to
derive a client's recovery directory name. As the nfsd module doesn't
declare any dependency on CRYPTO_MD5, though, it may fail to allocate
the hash if the kernel was compiled without it. As a result, generation
of client recovery directories will fail with the following error:
NFSD: unable to generate recoverydir name
The explicit dependency on CRYPTO_MD5 was removed as redundant back in
6aaa67b5f3 (NFSD: Remove redundant "select" clauses in fs/Kconfig
2008-02-11) as it was already implicitly selected via RPCSEC_GSS_KRB5.
This broke when RPCSEC_GSS_KRB5 was made optional for NFSv4 in commit
df486a2590 (NFS: Fix the selection of security flavours in Kconfig) at
a later point.
Fix the issue by adding back an explicit dependency on CRYPTO_MD5.
Fixes: df486a2590 (NFS: Fix the selection of security flavours in Kconfig)
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Static checker revealed possible error path leading to possible
NULL pointer dereferencing.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: e0639dc580: ("NFSD introduce async copy feature")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fix the iteration end check in fuse_dev_splice_write(). The iterator
position can only be compared with == or != since wrappage may be involved.
Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similarly to commit 8f868d68d3 ("pipe: Fix missing mask update after
pipe_wait()") this fixes a case where the pipe rewrite ended up caching
the pipe state incorrectly over a pipe lock drop event.
It wasn't quite as obvious, because you needed to splice data from a
pipe to a file, which is a fairly unusual operation, but it's completely
wrong.
Make sure we load the pipe head/tail/size information only after we've
waited for there to be data in the pipe.
While in that file, also make one of the splice helper functions use the
canonical arghument order for pipe_empty(). That's syntactic - pipe
emptiness is just that head and tail are equal, and thus mixing up head
and tail doesn't really matter. It's still wrong, though.
Reported-by: David Sterba <dsterba@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using the special SID to store the mode bits in an ACE (See
http://technet.microsoft.com/en-us/library/hh509017(v=ws.10).aspx)
which is enabled with mount parm "modefromsid" we were not
passing in the mode via SMB3 create (although chmod was enabled).
SMB3 create allows a security descriptor context to be passed
in (which is more atomic and thus preferable to setting the mode
bits after create via a setinfo).
This patch enables setting the mode bits on create when using
modefromsid mount option. In addition it fixes an endian
error in the definition of the Control field flags in the SMB3
security descriptor. It also makes the ACE type of the special
SID better match the documentation (and behavior of servers
which use this to store mode bits in SMB3 ACLs).
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3puL4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjOJD/4vV2ENSYcwZ7Qezgn9tx5NEdADN6jC0DXe
tJDnA0sDLfLTlVM9I1U2/wD6Wu31cvV2dHKmxO+WWWRP2M0orI00UA3I6BssVptY
42e/YJd13IM+iVrhfhm+hmcAHvQUmWHTPZg/7F7OZPkEtNcbCjn6s+HTyzu9gqtt
/kbJVDJ75TR9dEWCPY/A4jUcJYLw2leoHI4u2eqYRsTFNJHUQoaFUHDyNHyj7UNI
kIUi2UixJv+a4pkFvyLPKhJcLr6DBC2TeUnfP2Re5oX1Z/XkDR7iX+L5dUwRHHbT
4M7aJJ+mHDm8Z4IwwBqsSDRU5wUpzuplwBQBY/EQilUZAyOALVzUQABlRa0zxH8t
0D4U6LDDOopYeK0by/FGdD88S6Z0LKm0HJETbdPaGd9fqSlhBn19iGeBKYk0cCIu
Uecm9DKF+QLfKhTbN6h8nPyR3bfkqyUptQJ1UnheWo9f6L32bfp19sdI9LdtRUnS
k661QApajaYQu/641u9CDZiI+DvI+57Wm9I4eCF3GXqOYWPAxSgWP7rJVgS+mfA3
wo99/r11SruaA1Wqf+bOoN/wjsQCiUPFa8po8sh05ER4MH5Brb5EFlg04ZlQZkR9
jOdiL8ISM9t86eyKwtR4PanE7/pPEYUjEWm4ntCC6ViTwCvgV3d6s2We6QPw1j2l
nQ9p/c2bhg==
=7ZFw
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20191205' of git://git.kernel.dk/linux-block
Pull more block and io_uring updates from Jens Axboe:
"I wasn't expecting this to be so big, and if I was, I would have used
separate branches for this. Going forward I'll be doing separate
branches for the current tree, just like for the next kernel version
tree. In any case, this contains:
- Series from Christoph that fixes an inherent race condition with
zoned devices and revalidation.
- null_blk zone size fix (Damien)
- Fix for a regression in this merge window that caused busy spins by
sending empty disk uevents (Eric)
- Fix for a regression in this merge window for bfq stats (Hou)
- Fix for io_uring creds allocation failure handling (me)
- io_uring -ERESTARTSYS send/recvmsg fix (me)
- Series that fixes the need for applications to retain state across
async request punts for io_uring. This one is a bit larger than I
would have hoped, but I think it's important we get this fixed for
5.5.
- connect(2) improvement for io_uring, handling EINPROGRESS instead
of having applications needing to poll for it (me)
- Have io_uring use a hash for poll requests instead of an rbtree.
This turned out to work much better in practice, so I think we
should make the switch now. For some workloads, even with a fair
amount of cancellations, the insertion sort is just too expensive.
(me)
- Various little io_uring fixes (me, Jackie, Pavel, LimingWu)
- Fix for brd unaligned IO, and a warning for the future (Ming)
- Fix for a bio integrity data leak (Justin)
- bvec_iter_advance() improvement (Pavel)
- Xen blkback page unmap fix (SeongJae)
The major items in here are all well tested, and on the liburing side
we continue to add regression and feature test cases. We're up to 50
topic cases now, each with anywhere from 1 to more than 10 cases in
each"
* tag 'for-linus-20191205' of git://git.kernel.dk/linux-block: (33 commits)
block: fix memleak of bio integrity data
io_uring: fix a typo in a comment
bfq-iosched: Ensure bio->bi_blkg is valid before using it
io_uring: hook all linked requests via link_list
io_uring: fix error handling in io_queue_link_head
io_uring: use hash table for poll command lookups
io-wq: clear node->next on list deletion
io_uring: ensure deferred timeouts copy necessary data
io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT
null_blk: remove unused variable warning on !CONFIG_BLK_DEV_ZONED
brd: warn on un-aligned buffer
brd: remove max_hw_sectors queue limit
xen/blkback: Avoid unmapping unmapped grant pages
io_uring: handle connect -EINPROGRESS like -EAGAIN
block: set the zone size in blk_revalidate_disk_zones atomically
block: don't handle bio based drivers in blk_revalidate_disk_zones
block: allocate the zone bitmaps lazily
block: replace seq_zones_bitmap with conv_zones_bitmap
block: simplify blkdev_nr_zones
block: remove the empty line at the end of blk-zoned.c
...
Pull vfs d_inode/d_flags memory ordering fixes from Al Viro:
"Fallout from tree-wide audit for ->d_inode/->d_flags barriers use.
Basically, the problem is that negative pinned dentries require
careful treatment - unless ->d_lock is locked or parent is held at
least shared, another thread can make them positive right under us.
Most of the uses turned out to be safe - the main surprises as far as
filesystems are concerned were
- race in dget_parent() fastpath, that might end up with the caller
observing the returned dentry _negative_, due to insufficient
barriers. It is positive in memory, but we could end up seeing the
wrong value of ->d_inode in CPU cache. Fixed.
- manual checks that result of lookup_one_len_unlocked() is positive
(and rejection of negatives). Again, insufficient barriers (we
might end up with inconsistent observed values of ->d_inode and
->d_flags). Fixed by switching to a new primitive that does the
checks itself and returns ERR_PTR(-ENOENT) instead of a negative
dentry. That way we get rid of boilerplate converting negatives
into ERR_PTR(-ENOENT) in the callers and have a single place to
deal with the barrier-related mess - inside fs/namei.c rather than
in every caller out there.
The guts of pathname resolution *do* need to be careful - the race
found by Ritesh is real, as well as several similar races.
Fortunately, it turns out that we can take care of that with fairly
local changes in there.
The tree-wide audit had not been fun, and I hate the idea of repeating
it. I think the right approach would be to annotate the places where
we are _not_ guaranteed ->d_inode/->d_flags stability and have sparse
catch regressions. But I'm still not sure what would be the least
invasive way of doing that and it's clearly the next cycle fodder"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/namei.c: fix missing barriers when checking positivity
fix dget_parent() fastpath race
new helper: lookup_positive_unlocked()
fs/namei.c: pull positivity check into follow_managed()
Pull autofs updates from Al Viro:
"autofs misuses checks for ->d_subdirs emptiness; the cursors are in
the same lists, resulting in false negatives. It's not needed anyway,
since autofs maintains counter in struct autofs_info, containing 0 for
removed ones, 1 for live symlinks and 1 + number of children for live
directories, which is precisely what we need for those checks.
This series switches to use of that counter and untangles the crap
around its uses (it needs not be atomic and there's a bunch of
completely pointless "defensive" checks).
This fell out of dcache_readdir work; the main point is to get rid of
->d_subdirs abuses in there. I've more followup cleanups, but I hadn't
run those by Ian yet, so they can go next cycle"
* 'next.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
autofs: don't bother with atomics for ino->count
autofs_dir_rmdir(): check ino->count for deciding whether it's empty...
autofs: get rid of pointless checks around ->count handling
autofs_clear_leaf_automount_flags(): use ino->count instead of ->d_subdirs
Merge two fixes for the pipe rework from David Howells:
"Here are a couple of patches to fix bugs syzbot found in the pipe
changes:
- An assertion check will sometimes trip when polling a pipe because
the ring size and indices used are approximate and may be being
changed simultaneously.
An equivalent approximate calculation was done previously, but
without the assertion check, so I've just dropped the check. To
make it accurate, the pipe mutex would need to be taken or the spin
lock could be used - but usage of the spinlock would need to be
rolled out into splice, iov_iter and other places for that.
- The index mask and the max_usage values cannot be cached across
pipe_wait() as F_SETPIPE_SZ could have been called during the wait.
This can cause pipe_write() to break"
* pipe-rework:
pipe: Fix missing mask update after pipe_wait()
pipe: Remove assertion from pipe_poll()
Fix pipe_write() to not cache the ring index mask and max_usage as their
values are invalidated by calling pipe_wait() because the latter
function drops the pipe lock, thereby allowing F_SETPIPE_SZ change them.
Without this, pipe_write() may subsequently miscalculate the array
indices and pipe fullness, leading to an oops like the following:
BUG: KASAN: slab-out-of-bounds in pipe_write+0xc25/0xe10 fs/pipe.c:481
Write of size 8 at addr ffff8880771167a8 by task syz-executor.3/7987
...
CPU: 1 PID: 7987 Comm: syz-executor.3 Not tainted 5.4.0-rc2-syzkaller #0
...
Call Trace:
pipe_write+0xc25/0xe10 fs/pipe.c:481
call_write_iter include/linux/fs.h:1895 [inline]
new_sync_write+0x3fd/0x7e0 fs/read_write.c:483
__vfs_write+0x94/0x110 fs/read_write.c:496
vfs_write+0x18a/0x520 fs/read_write.c:558
ksys_write+0x105/0x220 fs/read_write.c:611
__do_sys_write fs/read_write.c:623 [inline]
__se_sys_write fs/read_write.c:620 [inline]
__x64_sys_write+0x6e/0xb0 fs/read_write.c:620
do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This is not a problem for pipe_read() as the mask is recalculated on
each pass of the loop, after pipe_wait() has been called.
Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: syzbot+838eb0878ffd51f27c41@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Eric Biggers <ebiggers@kernel.org>
[ Changed it to use a temporary variable 'mask' to avoid long lines -Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An assertion check was added to pipe_poll() to make sure that the ring
occupancy isn't seen to overflow the ring size. However, since no locks
are held when the three values are read, it is possible for F_SETPIPE_SZ
to intervene and muck up the calculation, thereby causing the oops.
Fix this by simply removing the assertion and accepting that the
calculation might be approximate.
Note that the previous code also had a similar issue, though there was
no assertion check, since the occupancy counter and the ring size were
not read with a lock held, so it's possible that the poll check might
have malfunctioned then too.
Also wake up all the waiters so that they can reissue their checks if
there was a competing read or write.
Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: syzbot+d37abaade33a934f16f2@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bob's extensive filesystem withdrawal and recovery testing:
- Don't write log headers after file system withdraw
- clean up iopen glock mess in gfs2_create_inode
- Close timing window with GLF_INVALIDATE_IN_PROGRESS
- Abort gfs2_freeze if io error is seen
- Don't loop forever in gfs2_freeze if withdrawn
- fix infinite loop in gfs2_ail1_flush on io error
- Introduce function gfs2_withdrawn
- fix glock reference problem in gfs2_trans_remove_revoke
Filesystems with a block size smaller than the page size:
- Fix end-of-file handling in gfs2_page_mkwrite
- Improve mmap write vs. punch_hole consistency
Other:
- Remove active journal side effect from gfs2_write_log_header
- Multi-block allocations in gfs2_page_mkwrite
Minor cleanups and coding style fixes:
- Remove duplicate call from gfs2_create_inode
- make gfs2_log_shutdown static
- make gfs2_fs_parameters static
- Some whitespace cleanups
- removed unnecessary semicolon
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJd6VQ1AAoJENW/n+sDE2U6tUsP/2Zd64dVA+2KzpfvklAUNpx/
TENbvRqGVvhofxuScY0oAvcg0KZN03K/L6ZCa71e7RI7T3VM/FG3rKNCkm084a9d
r+Rxcz+eSkce5qKVNva6CtQiczGAj27iNiho0Y9IgtzEZ5KEVJuY3cV8Z2qzHrQM
Rel2aVpUtaLhbFOj3jLGMt/HHSw8RTTzNqtJJCwRys/tVF1WPVqyNg4PD9a7zMT7
z96tqrlQQxFT4SGiZVJwQHFGuQZEnbr2ahNRmivmGtnNnawLxpEdFuFrSAsC73UB
wHO0Dq+7vYVTyQ7HugrqdkXxqyQr5ta06Pep7uj8ZvhoLWZvPuBf2SccpIn9Ufvo
9iLFY5Z9cHg6wpsW+YMG75Mz0A6WbPRIScVog0fxaKW+z0vMZOmsT6hT6llAlCzn
oj1igqVEOIBTS+4uMDIOJKvMixo4NTdgsLFQyftUxNHiCw5iGbqkV7ux31YjqX22
A830zv8lba44BsixGtuPEy/0Dnka7rMfRp0cflCzKESSLIdXtSjUSEtS6g0fJISS
qKNmnnkHpvjBGMG4lOqJihJdQ+2IiIMLWdNgxkvWgt6F7xjl3gcdREjJF0dmFsd+
GfDUkUZ/70T9UPsaTR2V0GBvEleq4PglALcet9Eela7tKOEziNf3L+prRjMr3909
y4uJIMH9/no1knKBxtkJ
=b1m/
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Andreas Gruenbacher:
"Bob's extensive filesystem withdrawal and recovery testing:
- don't write log headers after file system withdraw
- clean up iopen glock mess in gfs2_create_inode
- close timing window with GLF_INVALIDATE_IN_PROGRESS
- abort gfs2_freeze if io error is seen
- don't loop forever in gfs2_freeze if withdrawn
- fix infinite loop in gfs2_ail1_flush on io error
- introduce function gfs2_withdrawn
- fix glock reference problem in gfs2_trans_remove_revoke
Filesystems with a block size smaller than the page size:
- fix end-of-file handling in gfs2_page_mkwrite
- improve mmap write vs. punch_hole consistency
Other:
- remove active journal side effect from gfs2_write_log_header
- multi-block allocations in gfs2_page_mkwrite
Minor cleanups and coding style fixes:
- remove duplicate call from gfs2_create_inode
- make gfs2_log_shutdown static
- make gfs2_fs_parameters static
- some whitespace cleanups
- removed unnecessary semicolon"
* tag 'gfs2-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Don't write log headers after file system withdraw
gfs2: Remove duplicate call from gfs2_create_inode
gfs2: clean up iopen glock mess in gfs2_create_inode
gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS
gfs2: Abort gfs2_freeze if io error is seen
gfs2: Don't loop forever in gfs2_freeze if withdrawn
gfs2: fix infinite loop in gfs2_ail1_flush on io error
gfs2: Introduce function gfs2_withdrawn
gfs2: fix glock reference problem in gfs2_trans_remove_revoke
gfs2: make gfs2_log_shutdown static
gfs2: Remove active journal side effect from gfs2_write_log_header
gfs2: Fix end-of-file handling in gfs2_page_mkwrite
gfs2: Multi-block allocations in gfs2_page_mkwrite
gfs2: Improve mmap write vs. punch_hole consistency
gfs2: make gfs2_fs_parameters static
gfs2: Some whitespace cleanups
gfs2: removed unnecessary semicolon
mappings are handled and a conversion to the new mount API (slightly
complicated by the fact that we had a common option parsing framework
that called out into rbd and the filesystem instead of them calling
into it). Also included a few scattered fixes and a MAINTAINERS update
for rbd, adding Dongsheng as a reviewer.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3oDqwTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi8/OCACAhPEoSkG8J0XOgP0NlFQGRvikugKq
wlUfpNhkJOVmyM1t9LgiHHirTa7/kA76wPo/iHtnvjIZuZoaX3+NoZX5DwgKVCo1
SCQdXR4ohVPiYxUpK+z/fDXxpYHhaO2SAww+RRHSDxnlN5CHqFBcBhRBPfhraZT5
dwiQt7++UOnp/hfk1Dqg5EogmSdLxqWyjClKf2lliZkzbU9YXmGapqQsur6sBk+e
cLRmRBmMw4cDAKLL1taCympN0AxNMcePs1njvdwQ7XabNWrT061yFyt1ZNwAV/Nu
0nCyh/9IwQcsR0EvK7FCdUEJPy88Reufd+GleS4nkEZpbxQBzo0aGow0
=Egtk
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.5-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The two highlights are a set of improvements to how rbd read-only
mappings are handled and a conversion to the new mount API (slightly
complicated by the fact that we had a common option parsing framework
that called out into rbd and the filesystem instead of them calling
into it).
Also included a few scattered fixes and a MAINTAINERS update for rbd,
adding Dongsheng as a reviewer"
* tag 'ceph-for-5.5-rc1' of git://github.com/ceph/ceph-client:
libceph, rbd, ceph: convert to use the new mount API
rbd: ask for a weaker incompat mask for read-only mappings
rbd: don't query snapshot features
rbd: remove snapshot existence validation code
rbd: don't establish watch for read-only mappings
rbd: don't acquire exclusive lock for read-only mappings
rbd: disallow read-write partitions on images mapped read-only
rbd: treat images mapped read-only seriously
rbd: introduce RBD_DEV_FLAG_READONLY
rbd: introduce rbd_is_snap()
ceph: don't leave ino field in ceph_mds_request_head uninitialized
ceph: tone down loglevel on ceph_mdsc_build_path warning
rbd: update MAINTAINERS info
ceph: fix geting random mds from mdsmap
rbd: fix spelling mistake "requeueing" -> "requeuing"
ceph: make several helper accessors take const pointers
libceph: drop unnecessary check from dispatch() in mon_client.c
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXedyjQAKCRDh3BK/laaZ
PCR7AQCf+bb3/so1bygFeblBTT4UZbYZRXz2nZNSA5tgJvafWwD+I4MlqR+tEixb
gZEusbtAVrtm3hJrBc+1fA1wacGhmAg=
=Jg/0
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
- Fix a regression introduced in the last release
- Fix a number of issues with validating data coming from userspace
- Some cleanups in virtiofs
* tag 'fuse-update-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix Kconfig indentation
fuse: fix leak of fuse_io_priv
virtiofs: Use completions while waiting for queue to be drained
virtiofs: Do not send forget request "struct list_head" element
virtiofs: Use a common function to send forget
virtiofs: Fix old-style declaration
fuse: verify nlink
fuse: verify write return
fuse: verify attributes
This patch fixes the following KASAN report. The @ioend has been
freed by dio_put(), but the iomap_finish_ioend() still trys to access
its data.
[20563.631624] BUG: KASAN: use-after-free in iomap_finish_ioend+0x58c/0x5c0
[20563.638319] Read of size 8 at addr fffffc0c54a36928 by task kworker/123:2/22184
[20563.647107] CPU: 123 PID: 22184 Comm: kworker/123:2 Not tainted 5.4.0+ #1
[20563.653887] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.11 06/18/2019
[20563.664499] Workqueue: xfs-conv/sda5 xfs_end_io [xfs]
[20563.669547] Call trace:
[20563.671993] dump_backtrace+0x0/0x370
[20563.675648] show_stack+0x1c/0x28
[20563.678958] dump_stack+0x138/0x1b0
[20563.682455] print_address_description.isra.9+0x60/0x378
[20563.687759] __kasan_report+0x1a4/0x2a8
[20563.691587] kasan_report+0xc/0x18
[20563.694985] __asan_report_load8_noabort+0x18/0x20
[20563.699769] iomap_finish_ioend+0x58c/0x5c0
[20563.703944] iomap_finish_ioends+0x110/0x270
[20563.708396] xfs_end_ioend+0x168/0x598 [xfs]
[20563.712823] xfs_end_io+0x1e0/0x2d0 [xfs]
[20563.716834] process_one_work+0x7f0/0x1ac8
[20563.720922] worker_thread+0x334/0xae0
[20563.724664] kthread+0x2c4/0x348
[20563.727889] ret_from_fork+0x10/0x18
[20563.732941] Allocated by task 83403:
[20563.736512] save_stack+0x24/0xb0
[20563.739820] __kasan_kmalloc.isra.9+0xc4/0xe0
[20563.744169] kasan_slab_alloc+0x14/0x20
[20563.747998] slab_post_alloc_hook+0x50/0xa8
[20563.752173] kmem_cache_alloc+0x154/0x330
[20563.756185] mempool_alloc_slab+0x20/0x28
[20563.760186] mempool_alloc+0xf4/0x2a8
[20563.763845] bio_alloc_bioset+0x2d0/0x448
[20563.767849] iomap_writepage_map+0x4b8/0x1740
[20563.772198] iomap_do_writepage+0x200/0x8d0
[20563.776380] write_cache_pages+0x8a4/0xed8
[20563.780469] iomap_writepages+0x4c/0xb0
[20563.784463] xfs_vm_writepages+0xf8/0x148 [xfs]
[20563.788989] do_writepages+0xc8/0x218
[20563.792658] __writeback_single_inode+0x168/0x18f8
[20563.797441] writeback_sb_inodes+0x370/0xd30
[20563.801703] wb_writeback+0x2d4/0x1270
[20563.805446] wb_workfn+0x344/0x1178
[20563.808928] process_one_work+0x7f0/0x1ac8
[20563.813016] worker_thread+0x334/0xae0
[20563.816757] kthread+0x2c4/0x348
[20563.819979] ret_from_fork+0x10/0x18
[20563.825028] Freed by task 22184:
[20563.828251] save_stack+0x24/0xb0
[20563.831559] __kasan_slab_free+0x10c/0x180
[20563.835648] kasan_slab_free+0x10/0x18
[20563.839389] slab_free_freelist_hook+0xb4/0x1c0
[20563.843912] kmem_cache_free+0x8c/0x3e8
[20563.847745] mempool_free_slab+0x20/0x28
[20563.851660] mempool_free+0xd4/0x2f8
[20563.855231] bio_free+0x33c/0x518
[20563.858537] bio_put+0xb8/0x100
[20563.861672] iomap_finish_ioend+0x168/0x5c0
[20563.865847] iomap_finish_ioends+0x110/0x270
[20563.870328] xfs_end_ioend+0x168/0x598 [xfs]
[20563.874751] xfs_end_io+0x1e0/0x2d0 [xfs]
[20563.878755] process_one_work+0x7f0/0x1ac8
[20563.882844] worker_thread+0x334/0xae0
[20563.886584] kthread+0x2c4/0x348
[20563.889804] ret_from_fork+0x10/0x18
[20563.894855] The buggy address belongs to the object at fffffc0c54a36900
which belongs to the cache bio-1 of size 248
[20563.906844] The buggy address is located 40 bytes inside of
248-byte region [fffffc0c54a36900, fffffc0c54a369f8)
[20563.918485] The buggy address belongs to the page:
[20563.923269] page:ffffffff82f528c0 refcount:1 mapcount:0 mapping:fffffc8e4ba31900 index:0xfffffc0c54a33300
[20563.932832] raw: 17ffff8000000200 ffffffffa3060100 0000000700000007 fffffc8e4ba31900
[20563.940567] raw: fffffc0c54a33300 0000000080aa0042 00000001ffffffff 0000000000000000
[20563.948300] page dumped because: kasan: bad access detected
[20563.955345] Memory state around the buggy address:
[20563.960129] fffffc0c54a36800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
[20563.967342] fffffc0c54a36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[20563.974554] >fffffc0c54a36900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[20563.981766] ^
[20563.986288] fffffc0c54a36980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
[20563.993501] fffffc0c54a36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[20564.000713] ==================================================================
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205703
Signed-off-by: Zorro Lang <zlang@redhat.com>
Fixes: 9cd0ed63ca ("iomap: enhance writeback error message")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Links are created by chaining requests through req->list with an
exception that head uses req->link_list. (e.g. link_list->list->list)
Because of that, io_req_link_next() needs complex splicing to advance.
Link them all through list_list. Also, it seems to be simpler and more
consistent IMHO.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of an error io_submit_sqe() drops a request and continues
without it, even if the request was a part of a link. Not only it
doesn't cancel links, but also may execute wrong sequence of actions.
Stop consuming sqes, and let the user handle errors.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ELF reads done by the kernel have very complicated error detection code
which better live in one place.
Link: http://lkml.kernel.org/r/20191005165215.GB26927@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, ep_poll_safewake() in the CONFIG_DEBUG_LOCK_ALLOC case uses
ep_call_nested() in order to pass the correct subclass argument to
spin_lock_irqsave_nested(). However, ep_call_nested() adds unnecessary
checks for epoll depth and loops that are already verified when doing
EPOLL_CTL_ADD. This mirrors a conversion that was done for
!CONFIG_DEBUG_LOCK_ALLOC in: commit 37b5e5212a ("epoll: remove
ep_call_nested() from ep_eventpoll_poll()")
Link: http://lkml.kernel.org/r/1567628549-11501-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Reviewed-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ / /' -i */Kconfig
[adobriyan@gmail.com: add two spaces where necessary]
Link: http://lkml.kernel.org/r/20191124133936.GA5655@avx2
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
List iteration takes more code than anything else which means embedded
list_head should be the first element of the structure.
Space savings:
add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-18 (-18)
Function old new delta
close_pdeo 228 227 -1
proc_reg_release 86 82 -4
proc_entry_rundown 143 139 -4
proc_reg_open 298 289 -9
Link: http://lkml.kernel.org/r/20191004234753.GB30246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pointer to next '/' encodes length of path element and next start
position. Subtraction and increment are redundant.
Link: http://lkml.kernel.org/r/20191004234521.GA30246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently gluing PDE into global /proc tree is done under lock, but
changing ->nlink is not. Additionally struct proc_dir_entry::nlink is
not atomic so updates can be lost.
Link: http://lkml.kernel.org/r/20190925202436.GA17388@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We recently changed this from a single list to an rbtree, but for some
real life workloads, the rbtree slows down the submission/insertion
case enough so that it's the top cycle consumer on the io_uring side.
In testing, using a hash table is a more well rounded compromise. It
is fast for insertion, and as long as it's sized appropriately, it
works well for the cancellation case as well. Running TAO with a lot
of network sockets, this removes io_poll_req_insert() from spending
2% of the CPU cycles.
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we defer a timeout, we should ensure that we copy the timespec
when we have consumed the sqe. This is similar to commit f67676d160
for read/write requests. We already did this correctly for timeouts
deferred as links, but do it generally and use the infrastructure added
by commit 1a6b74fc87 instead of having the timeout deferral use its
own.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
iface[0] was accessed regardless of the count value and without
locking.
* check count before accessing any ifaces
* make copy of iface list (it's a simple POD array) and use it without
locking.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
With the addition of SMB session channels, we introduced new TCP
server pointers that have no sessions or tcons associated with them.
In this case, when we started looking for TCP connections, we might
end up picking session channel rather than the master connection,
hence failing to get either a session or a tcon.
In order to fix that, this patch introduces a new "is_channel" field
to TCP_Server_Info structure so we can skip session channels during
lookup of connections.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
There's really no reason why we forbid things like link/drain etc on
regular timeout commands. Enable the usual SQE flags on timeouts.
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio completions can race when a page spans more than one file system
block. Add a spinlock to synchronize marking the page uptodate.
Fixes: 9dc55f1389 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Orangefs has no open, and orangefs checks file permissions
on each file access. Posix requires that file permissions
be checked on open and nowhere else. Orangefs-through-the-kernel
needs to seem posix compliant.
The VFS opens files, even if the filesystem provides no
method. We can see if a file was successfully opened for
read and or for write by looking at file->f_mode.
When writes are flowing from the page cache, file is no
longer available. We can trust the VFS to have checked
file->f_mode before writing to the page cache.
The mode of a file might change between when it is opened
and IO commences, or it might be created with an arbitrary mode.
We'll make sure we don't hit EACCES during the IO stage by
using UID 0. Some of the time we have access without changing
to UID 0 - how to check?
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
As David reported [1], ENODATA returns when attempting
to modify files by using EROFS as an overlayfs lower layer.
The root cause is that listxattr could return unexpected
-ENODATA by mistake for inodes without xattr. That breaks
listxattr return value convention and it can cause copy
up failure when used with overlayfs.
Resolve by zeroing out if no xattr is found for listxattr.
[1] https://lore.kernel.org/r/CAEvUa7nxnby+rxK-KRMA46=exeOMApkDMAV08AjMkkPnTPV4CQ@mail.gmail.com
Link: https://lore.kernel.org/r/20191201084040.29275-1-hsiangkao@aol.com
Fixes: cadf1ccf1b ("staging: erofs: add error handling for xattr submodule")
Cc: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
syzbot (via KASAN) reports a use-after-free in the error path of
xlog_alloc_log(). Specifically, the iclog freeing loop doesn't
handle the case of a fully initialized ->l_iclog linked list.
Instead, it assumes that the list is partially constructed and NULL
terminated.
This bug manifested because there was no possible error scenario
after iclog list setup when the original code was added. Subsequent
code and associated error conditions were added some time later,
while the original error handling code was never updated. Fix up the
error loop to terminate either on a NULL iclog or reaching the end
of the list.
Reported-by: syzbot+c732f8644185de340492@syzkaller.appspotmail.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Since timestamps on files on most servers can be updated at
close, and since timestamps on our dentries default to one
second we can have stale timestamps in some common cases
(e.g. open, write, close, stat, wait one second, stat - will
show different mtime for the first and second stat).
The SMB2/SMB3 protocol allows querying timestamps at close
so add the code to request timestamp and attr information
(which is cheap for the server to provide) to be returned
when a file is closed (it is not needed for the many
paths that call SMB2_close that are from compounded
query infos and close nor is it needed for some of
the cases where a directory close immediately follows a
directory open.
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
- Make iomap_dio_rw callers explicitly tell us if they want us to wait
- Port the xfs writeback code to iomap to complete the buffered io
library functions
- Refactor the unshare code to share common pieces
- Add support for performing copy on write with buffered writes
- Other minor fixes
- Fix unchecked return in iomap_bmap
- Fix a type casting bug in a ternary statement in iomap_dio_bio_actor
- Improve tracepoints for easier diagnostic ability
- Fix pipe page leakage in directio reads
- Clean up iter usage in directio paths
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3kAhYACgkQ+H93GTRK
tOviwQ/9FyTPgZKDrJMOxiEIyx1ye+wWasaUSmt1/jaXBhX8B12WkjfV5Nk970pw
WOdiu5rnHy+Aje35luZmwPpX1H8CPvnVUTCFaYdZe9r695+XBmpg0coPr9T3Lsg/
dMRPB2LaM1gZxaQukBTWoj2mwDanepz6Zu5dGcwEuZE26jlnFf6+yQibFlOIjFmk
ioCkZLwkNdQlF+dutOykIilkP6imFyTU1lYueAZTmtJoM7JfXcqVmsn7A4R2pi8U
2CF/ySTCuKu8Mwee8BTzWMcNqvYc/bshF2bQeiBNnKB+K3GGM9s5oNL5DGQTc709
Cjhw6aq8ZTedT6AEw07zY5X1NhkCe8dQImz1qQPjCuoW2a5bVww8hvw8otUAyxOM
/ScwHqOH/TxzuTmYHoO+2DxZYjSEFTci26dpVenxWrFt82LcMfG/DvEThzTcT/qq
EdH7oLHbgXpvpwMkOYU9usCCkosraWKNaN4aoX9R5VhhWA4t8CcQNpXBkMDEI3ck
xywGGiwau3aUy2ELtjgfjxRUY28rZUkoABkyR9fAc4BYbS4RDqoHgF9pY/z5H/2B
xwDLzCA9ww6wLjeXEWt9ZBwmkNkdsFp/XeKrx2WX81LrCRoqFc3viV13FWrfuEe6
de9jHgV1HZqkmm5pChRMJ3n5Qyah9anXXnTRvar8UedHEbxoQuE=
=QMWG
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.5-merge-13' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap cleanups from Darrick Wong:
"Aome more new iomap code for 5.5.
There's not much this time -- just removing some local variables that
don't need to exist in the iomap directio code"
* tag 'iomap-5.5-merge-13' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: remove unneeded variable in iomap_dio_rw()
iomap: Do not create fake iter in iomap_dio_bio_actor()
Pull timer updates from Ingo Molnar:
"The main changes in the timer code in this cycle were:
- Clockevent updates:
- timer-of framework cleanups. (Geert Uytterhoeven)
- Use timer-of for the renesas-ostm and the device name to prevent
name collision in case of multiple timers. (Geert Uytterhoeven)
- Check if there is an error after calling of_clk_get in asm9260
(Chuhong Yuan)
- ABI fix: Zero out high order bits of nanoseconds on compat
syscalls. This got broken a year ago, with apparently no side
effects so far.
Since the kernel would use random data otherwise I don't think we'd
have other options but to fix the bug, even if there was a side
effect to applications (Dmitry Safonov)
- Optimize ns_to_timespec64() on 32-bit systems: move away from
div_s64_rem() which can be slow, to div_u64_rem() which is faster
(Arnd Bergmann)
- Annotate KCSAN-reported false positive data races in
hrtimer_is_queued() users by moving timer->state handling over to
the READ_ONCE()/WRITE_ONCE() APIs. This documents these accesses
(Eric Dumazet)
- Misc cleanups and small fixes"
[ I undid the "ABI fix" and updated the comments instead. The reason
there were apparently no side effects is that the fix was a no-op.
The updated comment is to say _why_ it was a no-op. - Linus ]
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Zero the upper 32-bits in __kernel_timespec on 32-bit
time: Rename tsk->real_start_time to ->start_boottime
hrtimer: Remove the comment about not used HRTIMER_SOFTIRQ
time: Fix spelling mistake in comment
time: Optimize ns_to_timespec64()
hrtimer: Annotate lockless access to timer->state
clocksource/drivers/asm9260: Add a check for of_clk_get
clocksource/drivers/renesas-ostm: Use unique device name instead of ostm
clocksource/drivers/renesas-ostm: Convert to timer_of
clocksource/drivers/timer-of: Use unique device name instead of timer
clocksource/drivers/timer-of: Convert last full_name to %pOF
Right now we return it to userspace, which means the application has
to poll for the socket to be writeable. Let's just treat it like
-EAGAIN and have io_uring handle it internally, this makes it much
easier to use.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit b18fdf71e0 ("io_uring: simplify io_req_link_next()"),
the io_wq_current_is_worker function is no longer needed, clean it
up.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If this flag is set, applications can be certain that any data for
async offload has been consumed when the kernel has consumed the
SQE.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just like commit f67676d160 for read/write requests, this one ensures
that the sockaddr data has been copied for IORING_OP_CONNECT if we need
to punt the request to async context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just like commit f67676d160 for read/write requests, this one ensures
that the msghdr data is fully copied if we need to punt a recvmsg or
sendmsg system call to async context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we don't copy the iovecs when we punt to async context. This
can be problematic for applications that store the iovec on the stack,
as they often assume that it's safe to let the iovec go out of scope
as soon as IO submission has been called. This isn't always safe, as we
will re-copy the iovec once we're in async context.
Make this 100% safe by copying the iovec just once. With this change,
applications may safely store the iovec on the stack for all cases.
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 5b094d6dac ("xfs: fix multi-AG deadlock in xfs_bunmapi") added
a check in __xfs_bunmapi() to stop early if we would touch multiple AGs
in the wrong order. However, this check isn't applicable for realtime
files. In most cases, it just makes us do unnecessary commits. However,
without the fix from the previous commit ("xfs: fix realtime file data
space leak"), if the last and second-to-last extents also happen to have
different "AG numbers", then the break actually causes __xfs_bunmapi()
to return without making any progress, which sends
xfs_itruncate_extents_flags() into an infinite loop.
Fixes: 5b094d6dac ("xfs: fix multi-AG deadlock in xfs_bunmapi")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Realtime files in XFS allocate extents in rextsize units. However, the
written/unwritten state of those extents is still tracked in blocksize
units. Therefore, a realtime file can be split up into written and
unwritten extents that are not necessarily aligned to the realtime
extent size. __xfs_bunmapi() has some logic to handle these various
corner cases. Consider how it handles the following case:
1. The last extent is unwritten.
2. The last extent is smaller than the realtime extent size.
3. startblock of the last extent is not aligned to the realtime extent
size, but startblock + blockcount is.
In this case, __xfs_bunmapi() calls xfs_bmap_add_extent_unwritten_real()
to set the second-to-last extent to unwritten. This should merge the
last and second-to-last extents, so __xfs_bunmapi() moves on to the
second-to-last extent.
However, if the size of the last and second-to-last extents combined is
greater than MAXEXTLEN, xfs_bmap_add_extent_unwritten_real() does not
merge the two extents. When that happens, __xfs_bunmapi() skips past the
last extent without unmapping it, thus leaking the space.
Fix it by only unwriting the minimum amount needed to align the last
extent to the realtime extent size, which is guaranteed to merge with
the last extent.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Right now we just copy the sqe for async offload, but we want to store
more context across an async punt. In preparation for doing so, put the
sqe copy inside a structure that we can expand. With this pointer added,
we can get rid of REQ_F_FREE_SQE, as that is now indicated by whether
req->io is NULL or not.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 6917d06899 ("block: merge invalidate_partitions into
rescan_partitions") caused a regression where systemd-udevd spins
forever using max CPU starting at boot time.
It's caused by a behavior change where a KOBJ_CHANGE uevent is now sent
in a case where previously it wasn't.
Restore the old behavior.
Fixes: 6917d06899 ("block: merge invalidate_partitions into rescan_partitions")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should never return -ERESTARTSYS to userspace, transform it into
-EINTR.
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
UBI:
- Fix a regression around producing a anchor PEB for fastmap.
Due to a change in our locking fastmap was unable to produce
fresh anchors an re-used the existing one a way to often.
UBIFS:
- Fixes for endianness. A few places blindly assumed little endian.
- Fix for a memory leak in the orphan code.
- Fix for a possible crash during a commit.
- Revert a wrong bugfix.
JFFS2:
- Revert a bad bugfix in (false positive from a code checking
tool).
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl3kG34WHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wU70D/9K3QdG/fciafNRkVFFKnXVNXLv
rxbfrRVCZeEo2Phm8daVQAvERWKi2Mw5jW1fNbOtDKvnPkC9+3vYmmHCienJ6VEt
LhQ4Ey0MX8dQID1Caji49i5pPD3PBMX0m+YrxGt0uuzbr/nNWeiIUf/nyIvU95o7
ZmmX7wWmKHUeLKC8TGpeolBVAf4xKqzp+U7NJxKpF5j6dn6hTi1J51GwsYw6aSXU
lo5KvPWHq11ENWxJhtyWfO4MMyzTAtvbL1xEf0NRXX+aK4dUUyu0dcxzAqNzR6HU
ER7gsIyEBKLZpgTqLpy6B9tWcnEjHHHN4tjxhD8gRGlOVBdXSlN+X45I8gjYOact
Jkx1eF+T7E865YP+8vwoHo8eMoThxz54SnON8Dk6WaioAOGvf48w4gNnAilOb8Fa
FZNk20pQdR2KyGc6PQ2I82vDYX1c2GmlGXOFLsEM8KxGWHjSCpQC1P946h17QSPe
OxCptqu36czclm6zk76Ycw5/BWkbzViOcToVuQS61HLH7P/BzF1BnEM9rgwf+tD0
4midOomxrZXmUrbC/HO4mARRdGtgonBHl1yJ8JjWhWEL9KPVLAhWJ4zHR8Zo84LJ
BodZTzJPJ96igf+hLMAG/xFCnpGylXaz/G+W+5wUHnNJyKVsbhYIEmBscJYE60iM
9/00Ce8r9lEpetUnKA==
=LbMI
-----END PGP SIGNATURE-----
Merge tag 'upstream-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI/UBIFS/JFFS2 updates from Richard Weinberger:
"This pull request contains mostly fixes for UBI, UBIFS and JFFS2:
UBI:
- Fix a regression around producing a anchor PEB for fastmap.
Due to a change in our locking fastmap was unable to produce fresh
anchors an re-used the existing one a way to often.
UBIFS:
- Fixes for endianness. A few places blindly assumed little endian.
- Fix for a memory leak in the orphan code.
- Fix for a possible crash during a commit.
- Revert a wrong bugfix.
JFFS2:
- Revert a bad bugfix (false positive from a code checking tool)"
* tag 'upstream-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
Revert "jffs2: Fix possible null-pointer dereferences in jffs2_add_frag_to_fragtree()"
ubi: Fix producing anchor PEBs
ubifs: ubifs_tnc_start_commit: Fix OOB in layout_in_gaps
ubifs: do_kill_orphans: Fix a memory leak bug
Revert "ubifs: Fix memory leak bug in alloc_ubifs_info() error path"
ubifs: Fix type of sup->hash_algo
ubifs: Fixed missed le64_to_cpu() in journal
ubifs: Force prandom result to __le32
ubifs: Remove obsolete TODO from dfs_file_write()
ubi: Fix warning static is not at beginning of declaration
ubi: Print skip_check in ubi_dump_vol_info()
close was relayered to allow passing in an async flag which
is no longer needed in this path. Remove the unneeded parameter
"flags" passed in on close.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
The pointer pneg_ctxt is being initialized with a value that is never
read and it is being updated later with a new value. The assignment
is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
- Fill out the build string
- Prevent inode fork extent count overflows
- Refactor the allocator to reduce long tail latency
- Rework incore log locking a little to reduce spinning
- Break up the xfs_iomap_begin functions into smaller more cohesive
parts
- Fix allocation alignment being dropped too early when the allocation
request is for more blocks than an AG is large
- Other small cleanups
- Clean up file buftarg retrieval helpers
- Hoist the resvsp and unresvsp ioctls to the vfs
- Remove the undocumented biosize mount option, since it has never been
mentioned as existing or supported on linux
- Clean up some of the mount option printing and parsing
- Enhance attr leaf verifier to check block structure
- Check dirent and attr names for invalid characters before passing them
to the vfs
- Refactor open-coded bmbt walking
- Fix a few places where we return EIO instead of EFSCORRUPTED after
failing metadata sanity checks
- Fix a synchronization problem between fallocate and aio dio corrupting
the file length
- Clean up various loose ends in the iomap and bmap code
- Convert to the new mount api
- Make sure we always log something when returning EFSCORRUPTED
- Fix some problems where long running scrub loops could trigger soft
lockup warnings and/or fail to exit due to fatal signals pending
- Fix various Coverity complaints
- Remove most of the function pointers from the directory code to reduce
indirection penalties
- Ensure that dquots are attached to the inode when performing unwritten
extent conversion after io
- Deuglify incore projid and crtime types
- Fix another AGI/AGF locking order deadlock when renaming
- Clean up some quota typedefs
- Remove the FSSETDM ioctls which haven't done anything in 20 years
- Fix some memory leaks when mounting the log fails
- Fix an underflow when updating an xattr leaf freemap
- Remove some trivial wrappers
- Report metadata corruption as an error, not a (potentially) fatal
assertion
- Clean up the dir/attr buffer mapping code
- Allow fatal signals to kill scrub during parent pointer checks
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3fNjcACgkQ+H93GTRK
tOv/8w//Y0Oa9Paiy8+iPTChs3/PqeKp307Fj5KONG52haMCakEJFT5+/wpkIAJw
uUmKiPolwN1ivviIUmIS14ThTJ7NV1jq0G0h/0tC25i/3hoJrGWdzqYJMlvhlqgE
taHrjCwPTDkhRJ0D5QCrkkHPU7lSdquO5TWxltaqYLhyLIt8SkklD6dN1dHWEPnk
k0j3TL+VqVJDYyEj1bLwJ0QUb2C3J8ygWnlviF/WxsSeJtJpGoeLEaYXhhsUK0Dt
aHg70OM6zzFzrJJAtJeBXpgaFsG/Pqbcw4wUWSxEMWjVSJwCSKLuZ5F+p6NcqoEj
HeLQkaGePoO61YCInk2JKLHIyx7ohqMOt7+Dm0mdbe1pvcKwV9ZcdkqKa8L/Fm6v
bUP6a2hEpsGy7vLnkYxwYACTLPbGX3uLw8MUr6ZpJ+SpfVLktU4ycpr8dCkJkp6a
0qOpEeHsBDy74NkMOUa7Qrju7lJ2GiL70qqBwaPe+ubcUa3U/3WAsSekSzXgUwn8
Fap4r8wn7cUbxymAvO06RlU8YymuulAlyjwdo9gOL/Su/5POldss6dy1YuUtyq19
CD6NtkHqEUMsTc2cI+H65H44aEeckB1j0D2Grm2uMchAh0GcTSFVNF6jony++B8k
s2sL2dEw9/9vr0uc1TSVF5ezxaONuyaCXdYXUkkdyq3iNvfpRCg=
=aACq
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.5-merge-16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS updates from Darrick Wong:
"For this release, we changed quite a few things.
Highlights:
- Fixed some long tail latency problems in the block allocator
- Removed some long deprecated (and for the past several years no-op)
mount options and ioctls
- Strengthened the extended attribute and directory verifiers
- Audited and fixed all the places where we could return EFSCORRUPTED
without logging anything
- Refactored the old SGI space allocation ioctls to make the
equivalent fallocate calls
- Fixed a race between fallocate and directio
- Fixed an integer overflow when files have more than a few
billion(!) extents
- Fixed a longstanding bug where quota accounting could be incorrect
when performing unwritten extent conversion on a freshly mounted fs
- Fixed various complaints in scrub about soft lockups and
unresponsiveness to signals
- De-vtable'd the directory handling code, which should make it
faster
- Converted to the new mount api, for better or for worse
- Cleaned up some memory leaks
and quite a lot of other smaller fixes and cleanups.
A more detailed summary:
- Fill out the build string
- Prevent inode fork extent count overflows
- Refactor the allocator to reduce long tail latency
- Rework incore log locking a little to reduce spinning
- Break up the xfs_iomap_begin functions into smaller more cohesive
parts
- Fix allocation alignment being dropped too early when the
allocation request is for more blocks than an AG is large
- Other small cleanups
- Clean up file buftarg retrieval helpers
- Hoist the resvsp and unresvsp ioctls to the vfs
- Remove the undocumented biosize mount option, since it has never
been mentioned as existing or supported on linux
- Clean up some of the mount option printing and parsing
- Enhance attr leaf verifier to check block structure
- Check dirent and attr names for invalid characters before passing
them to the vfs
- Refactor open-coded bmbt walking
- Fix a few places where we return EIO instead of EFSCORRUPTED after
failing metadata sanity checks
- Fix a synchronization problem between fallocate and aio dio
corrupting the file length
- Clean up various loose ends in the iomap and bmap code
- Convert to the new mount api
- Make sure we always log something when returning EFSCORRUPTED
- Fix some problems where long running scrub loops could trigger soft
lockup warnings and/or fail to exit due to fatal signals pending
- Fix various Coverity complaints
- Remove most of the function pointers from the directory code to
reduce indirection penalties
- Ensure that dquots are attached to the inode when performing
unwritten extent conversion after io
- Deuglify incore projid and crtime types
- Fix another AGI/AGF locking order deadlock when renaming
- Clean up some quota typedefs
- Remove the FSSETDM ioctls which haven't done anything in 20 years
- Fix some memory leaks when mounting the log fails
- Fix an underflow when updating an xattr leaf freemap
- Remove some trivial wrappers
- Report metadata corruption as an error, not a (potentially) fatal
assertion
- Clean up the dir/attr buffer mapping code
- Allow fatal signals to kill scrub during parent pointer checks"
* tag 'xfs-5.5-merge-16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (198 commits)
xfs: allow parent directory scans to be interrupted with fatal signals
xfs: remove the mappedbno argument to xfs_da_get_buf
xfs: remove the mappedbno argument to xfs_da_read_buf
xfs: split xfs_da3_node_read
xfs: remove the mappedbno argument to xfs_dir3_leafn_read
xfs: remove the mappedbno argument to xfs_dir3_leaf_read
xfs: remove the mappedbno argument to xfs_attr3_leaf_read
xfs: remove the mappedbno argument to xfs_da_reada_buf
xfs: improve the xfs_dabuf_map calling conventions
xfs: refactor xfs_dabuf_map
xfs: simplify mappedbno handling in xfs_da_{get,read}_buf
xfs: report corruption only as a regular error
xfs: Remove kmem_zone_free() wrapper
xfs: Remove kmem_zone_destroy() wrapper
xfs: Remove slab init wrappers
xfs: fix attr leaf header freemap.size underflow
xfs: fix some memory leaks in log recovery
xfs: fix another missing include
xfs: remove XFS_IOC_FSSETDM and XFS_IOC_FSSETDM_BY_HANDLE
xfs: remove duplicated include from xfs_dir2_data.c
...
According to the comment in the code and commit log, some apps
expect atime >= mtime; but the introduced code results in
atime==mtime. Fix the comparison to guard against atime<mtime.
Fixes: 9b9c5bea0b ("cifs: do not return atime less than mtime")
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: stfrench@microsoft.com
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently when the client creates a cifsFileInfo structure for
a newly opened file, it allocates a list of byte-range locks
with a pointer to the new cfile and attaches this list to the
inode's lock list. The latter happens before initializing all
other fields, e.g. cfile->tlink. Thus a partially initialized
cifsFileInfo structure becomes available to other threads that
walk through the inode's lock list. One example of such a thread
may be an oplock break worker thread that tries to push all
cached byte-range locks. This causes NULL-pointer dereference
in smb2_push_mandatory_locks() when accessing cfile->tlink:
[598428.945633] BUG: kernel NULL pointer dereference, address: 0000000000000038
...
[598428.945749] Workqueue: cifsoplockd cifs_oplock_break [cifs]
[598428.945793] RIP: 0010:smb2_push_mandatory_locks+0xd6/0x5a0 [cifs]
...
[598428.945834] Call Trace:
[598428.945870] ? cifs_revalidate_mapping+0x45/0x90 [cifs]
[598428.945901] cifs_oplock_break+0x13d/0x450 [cifs]
[598428.945909] process_one_work+0x1db/0x380
[598428.945914] worker_thread+0x4d/0x400
[598428.945921] kthread+0x104/0x140
[598428.945925] ? process_one_work+0x380/0x380
[598428.945931] ? kthread_park+0x80/0x80
[598428.945937] ret_from_fork+0x35/0x40
Fix this by reordering initialization steps of the cifsFileInfo
structure: initialize all the fields first and then add the new
byte-range lock list to the inode's lock list.
Cc: Stable <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
- Various kerneldoc script enhancements.
- More RST conversions; those are slowing down as we run out of things to
convert, but we're a ways from done still.
- Dan's "maintainer profile entry" work landed at last. Now we just need
to get maintainers to fill in the profiles...
- A reworking of the parallel build setup to work better with a variety of
systems (and to not take over huge systems entirely in particular).
- The MAINTAINERS file is now converted to RST during the build.
Hopefully nobody ever tries to print this thing, or they will need to
load a lot of paper.
- A script and documentation making it easy for maintainers to add Link:
tags at commit time.
Also included is the removal of a bunch of spurious CR characters.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl3j5B0PHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YtBcH/jIN2cO8/0YW2rjVT+1G6ytSdFUKx5WJ/lpf
5uBeCvuCeYhtCB6+BgnXvjykJ7jDW11/NJNjWqz/gsvD5l5FJK1rXarI/oz2Klyi
kcPtDmBF/ki4wz9qXzEpa0vg8LXdjeys50S1vE75qCzxZoPP7YjuRbPnLrlIJukv
JbDVi4p9kxgeHfRB4+BHOe5rFwA3mMmaxKNIX34Y+UUO2KZ0g/yUi1bAaQwQAdt+
PsORmkVQ8Puh3K9xRIr7dYlcWBlBiPqzYdvDgTVxSjrxdK6wjYjSgVk2VjC5MBUN
mTSTWgyfsIcD/76/s8tq7ZRl2fw+SkCSkFo79Rb/hJwDTb7Vnng=
=LPBr
-----END PGP SIGNATURE-----
Merge tag 'docs-5.5a' of git://git.lwn.net/linux
Pull Documentation updates from Jonathan Corbet:
"Here are the main documentation changes for 5.5:
- Various kerneldoc script enhancements.
- More RST conversions; those are slowing down as we run out of
things to convert, but we're a ways from done still.
- Dan's "maintainer profile entry" work landed at last. Now we just
need to get maintainers to fill in the profiles...
- A reworking of the parallel build setup to work better with a
variety of systems (and to not take over huge systems entirely in
particular).
- The MAINTAINERS file is now converted to RST during the build.
Hopefully nobody ever tries to print this thing, or they will need
to load a lot of paper.
- A script and documentation making it easy for maintainers to add
Link: tags at commit time.
Also included is the removal of a bunch of spurious CR characters"
* tag 'docs-5.5a' of git://git.lwn.net/linux: (91 commits)
docs: remove a bunch of stray CRs
docs: fix up the maintainer profile document
libnvdimm, MAINTAINERS: Maintainer Entry Profile
Maintainer Handbook: Maintainer Entry Profile
MAINTAINERS: Reclaim the P: tag for Maintainer Entry Profile
docs, parallelism: Rearrange how jobserver reservations are made
docs, parallelism: Do not leak blocking mode to other readers
docs, parallelism: Fix failure path and add comment
Documentation: Remove bootmem_debug from kernel-parameters.txt
Documentation: security: core.rst: fix warnings
Documentation/process/howto/kokr: Update for 4.x -> 5.x versioning
Documentation/translation: Use Korean for Korean translation title
docs/memory-barriers.txt: Remove remaining references to mmiowb()
docs/memory-barriers.txt/kokr: Update I/O section to be clearer about CPU vs thread
docs/memory-barriers.txt/kokr: Fix style, spacing and grammar in I/O section
Documentation/kokr: Kill all references to mmiowb()
docs/memory-barriers.txt/kokr: Rewrite "KERNEL I/O BARRIER EFFECTS" section
docs: Add initial documentation for devfreq
Documentation: Document how to get links with git am
docs: Add request_irq() documentation
...
Merge updates from Andrew Morton:
"Incoming:
- a small number of updates to scripts/, ocfs2 and fs/buffer.c
- most of MM
I still have quite a lot of material (mostly not MM) staged after
linux-next due to -next dependencies. I'll send those across next week
as the preprequisites get merged up"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (135 commits)
mm/page_io.c: annotate refault stalls from swap_readpage
mm/Kconfig: fix trivial help text punctuation
mm/Kconfig: fix indentation
mm/memory_hotplug.c: remove __online_page_set_limits()
mm: fix typos in comments when calling __SetPageUptodate()
mm: fix struct member name in function comments
mm/shmem.c: cast the type of unmap_start to u64
mm: shmem: use proper gfp flags for shmem_writepage()
mm/shmem.c: make array 'values' static const, makes object smaller
userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK
fs/userfaultfd.c: wp: clear VM_UFFD_MISSING or VM_UFFD_WP during userfaultfd_register()
userfaultfd: wrap the common dst_vma check into an inlined function
userfaultfd: remove unnecessary WARN_ON() in __mcopy_atomic_hugetlb()
userfaultfd: use vma_pagesize for all huge page size calculation
mm/madvise.c: use PAGE_ALIGN[ED] for range checking
mm/madvise.c: replace with page_size() in madvise_inject_error()
mm/mmap.c: make vma_merge() comment more easy to understand
mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops
autonuma: reduce cache footprint when scanning page tables
autonuma: fix watermark checking in migrate_balanced_pgdat()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3h2LsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvrOD/43g08VvkLexb6DvO8kTkyxPSUsPZml292H
yowv1Rij9x/7Y9MTOmcnEAV9QofRPY9nkmEq1KR+iqGQnLZsKFLrDVxQSgkhU7HJ
2wRE2DCPYTfIyIfS0TF2DdVjh3Q90tPKZbVkiOxvy9R+yS6hmTur0t731OERsMAh
QgkY8zJP04XonXNwZSch9QYdpf9XGu+W83Pc0ecmootimJHVhFV8J9111dessod0
h82Epbbbc8bg/CBWCA4fDm4EZ8doMHdAHYpw60WDXPoLF9zISvg5QG+pVjKrLkVx
pqgTt5Kwd/EkqurRw3sH+8A6p0ACLrUiKffX2cXk8ScdTXMGD5stmdF5cMLSikFY
yJgGHOTBHdjV4T2gB1TQ0rlnnb1VtHoJm5XvPviRaSQt7irzh4EWwhYiL7JlTAC3
etCbzMPn8Sm+Ns+/zObuOmVTQvyE/+PngToxDRpgzDd4pSbMPR502iL3gvSbMDjw
BCfGKFRkBYAB1SSjMwi+l9hiX2F+jTnHSvrm69HUz5K2zvWl4hsd2wClExuwoCQ+
UFXULCJZFCKbAcgEVh2OQX9JVg7fUv5GhIymEPBWFDAODwtX1XJK6IxmQhRh8owV
AxBFnNdpgRcBzyy+c+2cM4JOVcm8bV1s6eYP0UyV+EieD5OcBdq7GH5YBFzOztJM
SLMbjQQ/7w==
=hnf4
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20191129' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"I wasn't going to send this one off so soon, but unfortunately one of
the fixes from the previous pull broke the build on some archs. So I'm
sending this sooner rather than later. This contains:
- Add highmem.h include for io_uring, because of the kmap() additions
from last round. For some reason the build bot didn't spot this
even though it sat for days.
- Three minor ';' removals
- Add support for the Beurer CD-on-a-chip device
- Make io_uring work on MMU-less archs"
* tag 'for-linus-20191129' of git://git.kernel.dk/linux-block:
io_uring: fix missing kmap() declaration on powerpc
ataflop: Remove unneeded semicolon
block: sunvdc: Remove unneeded semicolon
drbd: Remove unneeded semicolon
io_uring: add mapping support for NOMMU archs
sr_vendor: support Beurer GL50 evo CD-on-a-chip devices.
cdrom: respect device capabilities during opening action
This is a series of cleanups for the y2038 work, mostly intended
for namespace cleaning: the kernel defines the traditional
time_t, timeval and timespec types that often lead to y2038-unsafe
code. Even though the unsafe usage is mostly gone from the kernel,
having the types and associated functions around means that we
can still grow new users, and that we may be missing conversions
to safe types that actually matter.
There are still a number of driver specific patches needed to
get the last users of these types removed, those have been
submitted to the respective maintainers.
Link: https://lore.kernel.org/lkml/20191108210236.1296047-1-arnd@arndb.de/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJd3D+wAAoJEJpsee/mABjZfdcQAJvl6e+4ddKoDMIVJqVCE25N
meFRgA7S8jy6BefEVeUgI8TxK+amGO36szMBUEnZxSSxq9u+gd13m5bEK6Xq/ov7
4KTAiA3Irm/W5FBTktu1zc5ROIra1Xj7jLdubf8wEC3viSXIXB3+68Y28iBN7D2O
k9kSpwINC5lWeC8guZy2I+2yc4ywUEXao9nVh8C/J+FQtU02TcdLtZop9OhpAa8u
U19VVH3WHkQI7ZfLvBTUiYK6tlYTiYCnpr8l6sm850CnVv1fzBW+DzmVhPJ6FdFd
4m5staC0sQ6gVqtjVMBOtT5CdzREse6hpwbKo2GRWFroO5W9tljMOJJXHvv/f6kz
DxrpUmj37JuRbqAbr8KDmQqPo6M2CRkxFxjol1yh5ER63u1xMwLm/PQITZIMDvPO
jrFc2C2SdM2E9bKP/RMCVoKSoRwxCJ5IwJ2AF237rrU0sx/zB2xsrOGssx5CWEgc
3bbk6tDQujJJubnCfgRy1tTxpLZOHEEKw8YhFLLbR2LCtA9pA/0rfLLad16cjA5e
5jIHxfsFc23zgpzrJeB7kAF/9xgu1tlA5BotOs3VBE89LtWOA9nK5dbPXng6qlUe
er3xLCfS38ovhUw6DusQpaYLuaYuLM7DKO4iav9kuTMcY9GkbPk7vDD3KPGh2goy
hY5cSM8+kT1q/THLnUBH
=Bdbv
-----END PGP SIGNATURE-----
Merge tag 'y2038-cleanups-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull y2038 cleanups from Arnd Bergmann:
"y2038 syscall implementation cleanups
This is a series of cleanups for the y2038 work, mostly intended for
namespace cleaning: the kernel defines the traditional time_t, timeval
and timespec types that often lead to y2038-unsafe code. Even though
the unsafe usage is mostly gone from the kernel, having the types and
associated functions around means that we can still grow new users,
and that we may be missing conversions to safe types that actually
matter.
There are still a number of driver specific patches needed to get the
last users of these types removed, those have been submitted to the
respective maintainers"
Link: https://lore.kernel.org/lkml/20191108210236.1296047-1-arnd@arndb.de/
* tag 'y2038-cleanups-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (26 commits)
y2038: alarm: fix half-second cut-off
y2038: ipc: fix x32 ABI breakage
y2038: fix typo in powerpc vdso "LOPART"
y2038: allow disabling time32 system calls
y2038: itimer: change implementation to timespec64
y2038: move itimer reset into itimer.c
y2038: use compat_{get,set}_itimer on alpha
y2038: itimer: compat handling to itimer.c
y2038: time: avoid timespec usage in settimeofday()
y2038: timerfd: Use timespec64 internally
y2038: elfcore: Use __kernel_old_timeval for process times
y2038: make ns_to_compat_timeval use __kernel_old_timeval
y2038: socket: use __kernel_old_timespec instead of timespec
y2038: socket: remove timespec reference in timestamping
y2038: syscalls: change remaining timeval to __kernel_old_timeval
y2038: rusage: use __kernel_old_timeval
y2038: uapi: change __kernel_time_t to __kernel_old_time_t
y2038: stat: avoid 'time_t' in 'struct stat'
y2038: ipc: remove __kernel_time_t reference from headers
y2038: vdso: powerpc: avoid timespec references
...
As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need support
for time64_t.
In completely unrelated work, I spent time on cleaning up parts of this
file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the rest
of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which is
the scsi ioctl handlers. I have patches for those as well, but they need
more testing or possibly a rewrite.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1
RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ
+ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF
lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL
6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD
dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH
VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb
YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT
m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm
TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n
e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl
bN65armTm7bFFR32Avnu
=lgCl
-----END PGP SIGNATURE-----
Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
"As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need
support for time64_t.
In completely unrelated work, I spent time on cleaning up parts of
this file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the
rest of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which
is the scsi ioctl handlers. I have patches for those as well, but they
need more testing or possibly a rewrite"
* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
scsi: sd: enable compat ioctls for sed-opal
pktcdvd: add compat_ioctl handler
compat_ioctl: move SG_GET_REQUEST_TABLE handling
compat_ioctl: ppp: move simple commands into ppp_generic.c
compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
compat_ioctl: unify copy-in of ppp filters
tty: handle compat PPP ioctls
compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
compat_ioctl: handle SIOCOUTQNSD
af_unix: add compat_ioctl support
compat_ioctl: reimplement SG_IO handling
compat_ioctl: move WDIOC handling into wdt drivers
fs: compat_ioctl: move FITRIM emulation into file systems
gfs2: add compat_ioctl support
compat_ioctl: remove unused convert_in_user macro
compat_ioctl: remove last RAID handling code
compat_ioctl: remove /dev/raw ioctl translation
compat_ioctl: remove PCI ioctl translation
compat_ioctl: remove joystick ioctl translation
...
A while ago Andy noticed
(http://lkml.kernel.org/r/CALCETrWY+5ynDct7eU_nDUqx=okQvjm=Y5wJvA4ahBja=CQXGw@mail.gmail.com)
that UFFD_FEATURE_EVENT_FORK used by an unprivileged user may have
security implications.
As the first step of the solution the following patch limits the availably
of UFFD_FEATURE_EVENT_FORK only for those having CAP_SYS_PTRACE.
The usage of CAP_SYS_PTRACE ensures compatibility with CRIU.
Yet, if there are other users of non-cooperative userfaultfd that run
without CAP_SYS_PTRACE, they would be broken :(
Current implementation of UFFD_FEATURE_EVENT_FORK modifies the file
descriptor table from the read() implementation of uffd, which may have
security implications for unprivileged use of the userfaultfd.
Limit availability of UFFD_FEATURE_EVENT_FORK only for callers that have
CAP_SYS_PTRACE.
Link: http://lkml.kernel.org/r/1572967777-8812-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Nosh Minwalla <nosh@google.com>
Cc: Pavel Emelyanov <ovzxemul@gmail.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the registration is repeated without VM_UFFD_MISSING or VM_UFFD_WP they
need to be cleared. Currently setting UFFDIO_REGISTER_MODE_WP returns
-EINVAL, so this patch is a noop until the UFFDIO_REGISTER_MODE_WP support
is applied.
Link: http://lkml.kernel.org/r/20191004232834.GP13922@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first parameter hstate in function hugetlb_fault_mutex_hash() is not
used anymore.
This patch removes it.
[akpm@linux-foundation.org: various build fixes]
[cai@lca.pw: fix a GCC compilation warning]
Link: http://lkml.kernel.org/r/1570544108-32331-1-git-send-email-cai@lca.pw
Link: http://lkml.kernel.org/r/20191005003302.785-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With hugetlbfs, a common pattern for mapping anonymous huge pages is to
create a temporary file first. Currently libraries like libhugetlbfs
and seastar create these with a standard mkstemp+unlink trick, but it
would be more robust to be able to simply pass the O_TMPFILE flag to
open(). O_TMPFILE is already supported by several file systems like
ext4 and xfs. The implementation simply uses the existi= ng d_tmpfile
utility function to instantiate the dcache entry for the file.
Tested manually by successfully creating a temporary file by opening it
with (O_TMPFILE|O_RDWR) on mounted hugetlbfs and successfully mapping 2M
huge pages with it. Without the patch, trying to open a file with
O_TMPFILE results in -ENOSUP.
Link: http://lkml.kernel.org/r/bc9383eff6e1374d79f3a92257ae829ba1e6ae60.1573285189.git.p.sarna@tlen.pl
Signed-off-by: Piotr Sarna <p.sarna@tlen.pl>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is assumed that the hugetlbfs_vfsmount[] array will contain either a
valid vfsmount pointer or NULL for each hstate after initialization.
Changes made while converting to use fs_context broke this assumption.
While fixing the hugetlbfs_vfsmount issue, it was discovered that
init_hugetlbfs_fs never did correctly clean up when encountering a vfs
mount error.
It was found during code inspection. A small memory allocation failure
would be the most likely cause of taking a error path with the bug.
This is unlikely to happen as this is early init code.
Link: http://lkml.kernel.org/r/94b6244d-2c24-e269-b12c-e3ba694b242d@oracle.com
Reported-by: Chengguang Xu <cgxu519@mykernel.net>
Fixes: 32021982a3 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A new clang diagnostic (-Wsizeof-array-div) warns about the calculation
to determine the number of u32's in an array of unsigned longs.
Suppress warning by adding parentheses.
While looking at the above issue, noticed that the 'address' parameter
to hugetlb_fault_mutex_hash is no longer used. So, remove it from the
definition and all callers.
No functional change.
Link: http://lkml.kernel.org/r/20190919011847.18400-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Ilie Halip <ilie.halip@gmail.com>
Cc: David Bolvansky <david.bolvansky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This helper prints warning if direct I/O write failed to invalidate cache,
and set EIO at inode to warn usersapce about possible data corruption.
See also commit 5a9d929d6e ("iomap: report collisions between directio
and buffered writes to userspace").
Direct I/O is supported by non-disk filesystems, for example NFS. Thus
generic code needs this even in kernel without CONFIG_BLOCK.
Link: http://lkml.kernel.org/r/157270038074.4812.7980855544557488880.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The declarations of __block_write_begin_int and guard_bio_eod are needed
from internal.h so include it to fix the following sparse warnings:
fs/buffer.c:1930:5: warning: symbol '__block_write_begin_int' was not declared. Should it be static?
fs/buffer.c:2994:6: warning: symbol 'guard_bio_eod' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20191011170039.16100-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use true/false for bool return type of has_bh_in_lru().
Link: http://lkml.kernel.org/r/20191029040529.GA7625@saurav
Signed-off-by: Saurav Girepunje <saurav.girepunje@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a static code checker warning:
fs/ocfs2/acl.c:331
ocfs2_acl_chmod() warn: passing zero to 'PTR_ERR'
Link: http://lkml.kernel.org/r/1dee278b-6c96-eec2-ce76-fe6e07c6e20f@linux.alibaba.com
Fixes: 5ee0fbd50f ("ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang")
Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl3dbM4UHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMagw/+MiOlIHFykjK0NGOyUbXR7AwjdrHz
1Fgkh7VZCAHMfCcdlljIIkYe7P+ybfdK51E1QLBiTPGh353JJRAvrjFbDcIyT4kf
u7AVddUeT0QQefFs39ZFWTV0mvJjDBfjFkmiL3cdY9ulx4ZX8V426qjyl8KIwTHe
YkQF3pYMmO28G1SfZu298zTmrFRA10FezxCUbBRZTTE9FcD7mnGQRB7w/wY9t0H1
ebIDgDA0EBh3oznGD3qxD63b5ULdSrImTzvSmEfhzZZsoZB4XGO5MOnLRqgBwFbT
qfRSRfHVl+1JtDHCU43zOUlzScAsff5mvfVNdDMLAFtGGSjDGxgxKNyLwp8+5wmH
GzvB99QQECvCN+gbaedm6adWBGzi7vpoCcgfqY0UPLYvCqsNFbZw4U/iu5ONKWy/
cEGVUGzHf0pWofugDIJfGQuRt6iS2XT9Ode6+QMx8OiC3auZcluhTuJSxMQIhFZ1
5XmoHOQddBwlalmIx8fsIHSo6xsAjNFwTOikEFVzUB+ECR2Urs8eHvQj+d94xz6e
q9LrNkt/eVIDiI+PeP+UBD5IJlfmRSoyUd/mIDFMfmqMBubBSQl70Eyt3dUdt1Bz
0PBs6xjYztpgk3Q7s35TMn8EvDcEksG0WEoFM5fEohYPLWQf8tJ5BbyYJJqc0sod
ExlXu1lPfg9ppKc=
=fc7R
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20191126' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Audit is back for v5.5, albeit with only two patches:
- Allow for the auditing of suspicious O_CREAT usage via the new
AUDIT_ANOM_CREAT record.
- Remove a redundant if-conditional check found during code analysis.
It's a minor change, but when the pull request is only two patches
long, you need filler in the pull request email"
[ Heh on the pull request filler. I wish more people tried to write
better pull request messages, even if maybe it's not worth it for the
trivial cases ;^) - Linus ]
* tag 'audit-pr-20191126' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: remove redundant condition check in kauditd_thread()
audit: Report suspicious O_CREAT usage
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl3O0OoACgkQ+7dXa6fL
C2tAwA//VH9Y81azemXFdflDF90sSH3TCASlKHVYHbBNAkH/QP5F00G4BEM4nNqH
F3x7qcU9vzfGdumF1pc90Yt6XSYlsQEGF+xMyMw/VS2wKs40yv+b/doVbzOWbN9C
NfrklgHeuuBk+JzU2llDisVqKRTLt4SmDpYu1ZdcchUQFZCCl3BpgdSEC+xXrHay
+KlRPVNMSd2kXMCDuSWrr71lVNdCTdf3nNC5p1i780+VrgpIBIG/jmiNdCcd7PLH
1aesPlr8UZY3+bmRtqe587fVRAhT2qA2xibKtyf9R0hrDtUKR4NSnpPmaeIjb26e
LhVntcChhYxQqzy/T4ScTDNVjpSlwi6QMo5DwAwzNGf2nf/v5/CZ+vGYDVdXRFHj
tgH1+8eDpHsi7jJp6E4cmZjiolsUx/ePDDTrQ4qbdDMO7fmIV6YQKFAMTLJepLBY
qnJVqoBq3qn40zv6tVZmKgWiXQ65jEkBItZhEUmcQRBiSbBDPweIdEzx/mwzkX7U
1gShGdut6YP4GX7BnOhkiQmzucS85mgkUfG43+mBfYXb+4zNTEjhhkqhEduz2SQP
xnjHxEM+MTGCj3PozIpJxNKzMTEceYY7cAUdNEMDQcHog7OCnIdGBIc7BPnsN8yA
CPzntwP4mmLfK3weq3PIGC6d9xfc9PpmiR9docxQOvE6sk2Ifeo=
=FKC7
-----END PGP SIGNATURE-----
Merge tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull pipe rework from David Howells:
"This is my set of preparatory patches for building a general
notification queue on top of pipes. It makes a number of significant
changes:
- It removes the nr_exclusive argument from __wake_up_sync_key() as
this is always 1. This prepares for the next step:
- Adds wake_up_interruptible_sync_poll_locked() so that poll can be
woken up from a function that's holding the poll waitqueue
spinlock.
- Change the pipe buffer ring to be managed in terms of unbounded
head and tail indices rather than bounded index and length. This
means that reading the pipe only needs to modify one index, not
two.
- A selection of helper functions are provided to query the state of
the pipe buffer, plus a couple to apply updates to the pipe
indices.
- The pipe ring is allowed to have kernel-reserved slots. This allows
many notification messages to be spliced in by the kernel without
allowing userspace to pin too many pages if it writes to the same
pipe.
- Advance the head and tail indices inside the pipe waitqueue lock
and use wake_up_interruptible_sync_poll_locked() to poke poll
without having to take the lock twice.
- Rearrange pipe_write() to preallocate the buffer it is going to
write into and then drop the spinlock. This allows kernel
notifications to then be added the ring whilst it is filling the
buffer it allocated. The read side is stalled because the pipe
mutex is still held.
- Don't wake up readers on a pipe if there was already data in it
when we added more.
- Don't wake up writers on a pipe if the ring wasn't full before we
removed a buffer"
* tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
pipe: Remove sync on wake_ups
pipe: Increase the writer-wakeup threshold to reduce context-switch count
pipe: Check for ring full inside of the spinlock in pipe_write()
pipe: Remove redundant wakeup from pipe_write()
pipe: Rearrange sequence in pipe_write() to preallocate slot
pipe: Conditionalise wakeup in pipe_read()
pipe: Advance tail pointer inside of wait spinlock in pipe_read()
pipe: Allow pipes to have kernel-reserved slots
pipe: Use head and tail pointers for the ring, not cursor and length
Add wake_up_interruptible_sync_poll_locked()
Remove the nr_exclusive argument from __wake_up_sync_key()
pipe: Reduce #inclusion of pipe_fs_i.h
vfs_rmdir and vfs_unlink can return -EBUSY if the
target is a mountpoint. This currently gets passed to
nfserrno() by nfsd_unlink(), and that results in a WARNing,
which is not user-friendly.
Possibly the best NFSv4 error is NFS4ERR_FILE_OPEN, because
there is a sense in which the object is currently in use
by some other task. The Linux NFSv4 client will map this
back to EBUSY, which is an added benefit.
For NFSv3, the best we can do is probably NFS3ERR_ACCES, which isn't
true, but is not less true than the other options.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The NFSv4.2 CLONE operation has implicit persistence requirements on the
target file, since there is no protocol requirement that the client issue
a separate operation to persist data.
For that reason, we should call vfs_fsync_range() on the destination file
after a successful call to vfs_clone_file_range().
Fixes: ffa0160a10 ("nfsd: implement the NFSv4.2 CLONE operation")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl3hAmYACgkQnJ2qBz9k
QNna+AgAs/6eWmQqEwij9+E761bELuMXIp8Ej9OxrBgE9Uls0VsMOsG1xfFoue+J
sZme9XSUQ+STqPQzTi6FH4x1n5kL3VnZzjRDv9qpsRCfMzX1MbibczNvXuCCgGuP
cKr3gPFX+GSDwQPEJX7tzF5zLZ1t1XOc4TD/H6v/X+2Nr/N1lnYeAaNcfFiGMyg1
Pk9JK2/qUoqWASCx6zxHG/lfjmaSSIrOwaKFACNHH0/6Luc1M7Oee3M9T/jS2IEo
8/WTS4o9CwOUFaEPemuUBv4yHyBpbJwsM+emwNKrNDSWbI5vP3ECOO7c4c+1/abS
FcwPvkHyAQnpSgmeYlhfTjZfA/A6+w==
=m7WG
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"Three fsnotify cleanups"
* tag 'fsnotify_for_v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: Add git tree reference to MAINTAINERS
fsnotify/fdinfo: exportfs_encode_inode_fh() takes pointer as 4th argument
fsnotify: move declaration of fsnotify_mark_connector_cachep to fsnotify.h
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl3hAFIACgkQnJ2qBz9k
QNkV/gf+Kwn7xHg76YXd15lZYBzhgj/ABAYEEAAVY49OOCK5+XVmmAufHesMZ2lU
Solt8PvbQ8d5786bWpaYXgrTU3JW37c6x1MDUPDLQ8goXWzx7pZWvD+Yup558rDa
H1aoqvFKLgpeVVqkUdvvv2CDbgZyOgGlkDqWeS+c5pZd1NPFZzUAoU26slvQ5h4f
t41mbavOIm5DChQ5UjwRNw+pb09GXaHrPBRJwa1XuJYJWAansBcQIsxiiqt/43Gn
AzwUGrsz4vrPBk+Kcd0SGb8vinFVQr19gBFKFeN3rPFUEUn6T0FPBqaYeiNTNE37
AqASYKlIuhcSf0Wdvx6vxwSHsFl5VA==
=NGxV
-----END PGP SIGNATURE-----
Merge tag 'for_v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2, quota, reiserfs cleanups and fixes from Jan Kara:
- Refactor the quota on/off kernel internal interfaces (mostly for
ubifs quota support as ubifs does not want to have inodes holding
quota information)
- A few other small quota fixes and cleanups
- Various small ext2 fixes and cleanups
- Reiserfs xattr fix and one cleanup
* tag 'for_v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
ext2: code cleanup for descriptor_loc()
fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long
ext2: fix improper function comment
ext2: code cleanup for ext2_try_to_allocate()
ext2: skip unnecessary operations in ext2_try_to_allocate()
ext2: Simplify initialization in ext2_try_to_allocate()
ext2: code cleanup by calling ext2_group_last_block_no()
ext2: introduce new helper ext2_group_last_block_no()
reiserfs: replace open-coded atomic_dec_and_mutex_lock()
ext2: check err when partial != NULL
quota: Handle quotas without quota inodes in dquot_get_state()
quota: Make dquot_disable() work without quota inodes
quota: Drop dquot_enable()
fs: Use dquot_load_quota_inode() from filesystems
quota: Rename vfs_load_quota_inode() to dquot_load_quota_inode()
quota: Simplify dquot_resume()
quota: Factor out setup of quota inode
quota: Check that quota is not dirty before release
quota: fix livelock in dquot_writeback_dquots
ext2: don't set *count in the case of failure in ext2_try_to_allocate()
...
- Introduce superblock checksum support;
- Set iowait when waiting I/O for sync decompression path;
- Several code cleanups.
-----BEGIN PGP SIGNATURE-----
iIwEABYIADQWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXd/dwRYcZ2FveGlhbmcy
NUBodWF3ZWkuY29tAAoJEDk3MdwfteYEjd4BANBpFIxoenCeu9A8XVG4TVIobyoz
g/eci9QZr+rgPGl/AP93QFEAPmjwQi6iNQREwvXQ96EE9rMX0Legt3tFeHPRCw==
=xdz3
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"No major kernel updates for this round since I'm fully diving into
LZMA algorithm internals now to provide high CR XZ algorihm support.
That needs more work and time for me to get a better compression time.
Summary:
- Introduce superblock checksum support
- Set iowait when waiting I/O for sync decompression path
- Several code cleanups"
* tag 'erofs-for-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: remove unnecessary output in erofs_show_options()
erofs: drop all vle annotations for runtime names
erofs: support superblock checksum
erofs: set iowait for sync decompression
erofs: clean up decompress queue stuffs
erofs: get rid of __stagingpage_alloc helper
erofs: remove dead code since managed cache is now built-in
erofs: clean up collection handling routines
-----BEGIN PGP SIGNATURE-----
iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl3fB2gACgkQiiy9cAdy
T1G4rgwAwTVvUYOmUVxJt1OxJDAnVxLXOH08FmrSL08nacrUCb6fPRw0z0Eb65SB
4/Yc3xIH+bJInNpACKZVY0ghPlDYC4u7AtLUZqm4+ruDVSjoWmF4s38WsNCrZywm
uAWq4p+j5A62uGdgdSzVxdUJkSLGRTc+7ib+y/TZ2U4BLjb6AY2etUtOpnA6lQ4R
UHvjf7leHw55pvPbNnpZ/2zT18KpPH8twCj0f3lH4yt8Akw0p/va6tRcfsmE2plh
V3dxiPmS86otXFPyHpu9B9plrxJCcGuYUL+R8pPpbLD3MZMsdK7J+sGNdayOoLXl
bpuoTvl3LThEwcg7bFeB/jHXGx95KPm3vh99lrZvbM9Snya+ueSfudrhmoqCEiSo
5ri1tWH7GlPr9nEaxp/aYB2LZENpbYvwn0yxoUdgK+R9EN5F3utUwK54CgYPVz9Y
1zp0LksafWZu7NrIPOuoq+/ATJ1WgbDYrhotk9fwaZoc2EsUfJd0I31XGECisjdh
fcHNxbA5
=EIrP
-----END PGP SIGNATURE-----
Merge tag '5.5-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
"Various smb3 fixes (including 12 for stable) and also features
(addition of multichannel support)"
* tag '5.5-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: (41 commits)
CIFS: fix a white space issue in cifs_get_inode_info()
cifs: update internal module version number
cifs: Always update signing key of first channel
cifs: Fix retrieval of DFS referrals in cifs_mount()
cifs: Fix potential softlockups while refreshing DFS cache
cifs: Fix lookup of root ses in DFS referral cache
cifs: Fix use-after-free bug in cifs_reconnect()
cifs: dump channel info in DebugData
smb3: dump in_send and num_waiters stats counters by default
cifs: try harder to open new channels
CIFS: Properly process SMB3 lease breaks
cifs: move cifsFileInfo_put logic into a work-queue
cifs: try opening channels after mounting
CIFS: refactor cifs_get_inode_info()
cifs: switch servers depending on binding state
cifs: add server param
cifs: add multichannel mount options and data structs
cifs: sort interface list by speed
CIFS: Fix SMB2 oplock break processing
cifs: don't use 'pre:' for MODULE_SOFTDEP
...
In this round, we've introduced fairly small number of patches as below.
Enhancement:
- improve the in-place-update IO flow
- allocate segment to guarantee no GC for pinned files
Bug fix:
- fix updatetime in lazytime mode
- potential memory leak in f2fs_listxattr
- record parent inode number in rename2 correctly
- fix deadlock in f2fs_gc along with atomic writes
- avoid needless data migration in GC
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl3e1XkACgkQQBSofoJI
UNJ0GhAAhVIX4J91CLnVSh0ik1XCaI6h/dFeS6kbDd8oxzQm/qt64b59aZqgy7Rk
iblGWfj8uPP5yO60pqb5uN4a0hybptVZSEldbhF0Xv0zUeVoT7C1ksTMrdUd1p7d
YkO8G+V4QBBrtpKG1KKKEncrvcdx4n9QHxGsRh4z5vXZH7sEmH7+N8OE88MaPjdZ
UWqYk0S0GoZBhPe7c8pQuD/PM+WJJH4Lewgw5kK21eAjOKI+yZKb+bY2tGjo5dA1
nzYO72CRMV4VEKsnxTZ/LCB2kCXeexaGuiVPyHjCmgAh990cLjsCWIbJ8EJu7uAa
vAo6/EMfgfPkPt5Y7uWGR4EeNT7AFhUoMuoQ9zdXzecY48D4Gz58o87Q+OFY3ipZ
W2OSf92pEJyfumE5o8wN435gaRYUjjCo1SMoIQABNav411XrBVoRwjvkV3DyA6af
Bs1bafz2hR/E1q0uoZvLWC5waiHy9605OkKMs/y8IRsn6yhRep/tv3KLk2Dz3fOO
LxenhuVO9bQDCheEcH15qIljxTuyfTyUOa9UrFXOwn4mK61J8A/Gs+SiqW0y28oA
feSw7cLPxK0OlYQgql24JfJN/Xt523WmCSfXfe7TCUDTDkBpmsdhFwHYZyCLzqt+
FyBhf2DF/BGzKMT28oc7StO43mIvOc1Wk+jfJFW+hld5ncAJxCE=
=qyrd
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've introduced fairly small number of patches as below.
Enhancements:
- improve the in-place-update IO flow
- allocate segment to guarantee no GC for pinned files
Bug fixes:
- fix updatetime in lazytime mode
- potential memory leak in f2fs_listxattr
- record parent inode number in rename2 correctly
- fix deadlock in f2fs_gc along with atomic writes
- avoid needless data migration in GC"
* tag 'f2fs-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: stop GC when the victim becomes fully valid
f2fs: expose main_blkaddr in sysfs
f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project()
f2fs: Fix deadlock in f2fs_gc() context during atomic files handling
f2fs: show f2fs instance in printk_ratelimited
f2fs: fix potential overflow
f2fs: fix to update dir's i_pino during cross_rename
f2fs: support aligned pinned file
f2fs: avoid kernel panic on corruption test
f2fs: fix wrong description in document
f2fs: cache global IPU bio
f2fs: fix to avoid memory leakage in f2fs_listxattr
f2fs: check total_segments from devices in raw_super
f2fs: update multi-dev metadata in resize_fs
f2fs: mark recovery flag correctly in read_raw_super_block()
f2fs: fix to update time in lazytime mode
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl3W9UYACgkQ+7dXa6fL
C2uO1g//T0NvHmrZr6zHbOWlaKkGXRfhgCLENzuTCtcFt29NV+Jv8D/FNuT29d7f
x8PzrZUBDYfpjP8OSHEivZgU5+a4jpI5pSd2mnGnT3523Oxbm2lIz/SXgpkwQdkM
4OcK+voCP1MvHbmO7jyOfQm7SX/aOtF90GQrUog5JqwPD0ddSpk/swNNRKZ6xK8/
cAq9/DHPfYHfLh8wc2FWmhdPA5Px+eKt9kGmMMpwsCulR4yMM7vBCIVNSccSjpYR
QJzmFpi/TfOiRMc0oZnXLv6UUG+7hHeRfRosoQA8YGGsjwXfs2I7MfAziJx68tfu
v7beYI+u7tc4EhgK9jXxbgPCJF6nFR/zrbzfC3Zlh7Hr3NCGGzH4gwDu+pn4lwfV
bpUyA6VwmEt+CVzjj6LB5l4vyRSk+TtVQSwfQvCA7xqvdq0J0/fux6OQ1DwJonQe
6dOpPUd4ip5tW28S9U7MzbbpMv1kP5hMoAu87gBdygUDhPojM7Ut0mFwZDMOunXg
vW6oQwKyyFPeJhRcPMKxS7opWQz5889tDe8GPNI/1ociOYCcCkmwZaaG+hYAJ8+N
8bK2OCi4yNGBa/qMPbwATY+TVKdYcceq/7THxcI3vZk8nVrda73UQHfwUXsjwUzf
SNNcoMrsEV9BiPSG7gAkCf3VI2gXj+USm3w7SBSjZDM5qR/lRHg=
=YjAm
-----END PGP SIGNATURE-----
Merge tag 'afs-next-20191121' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
"Minor cleanups and fix:
- Minor fix to make some debugging statements display information
from the correct iov_iter.
- Rename some members and variables to make things more obvious or
consistent.
- Provide a helper to wrap increments of the usage count on the
afs_read struct.
- Use scnprintf() to print into a stack buffer rather than sprintf().
- Remove some set but unused variables"
* tag 'afs-next-20191121' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Remove set but not used variable 'ret'
afs: Remove set but not used variables 'before', 'after'
afs: xattr: use scnprintf
afs: Introduce an afs_get_read() refcount helper
afs: Rename desc -> req in afs_fetch_data()
afs: Switch the naming of call->iter and call->_iter
afs: Use call->_iter not &call->iter in debugging statements
* Direct I/O via iomap (required the iomap-for-next branch from Darrick
as a prereq).
* Support for using dioread-nolock where the block size < page size.
* Support for encryption for file systems where the block size < page size.
* Rework of journal credits handling so a revoke-heavy workload will
not cause the journal to run out of space.
* Replace bit-spinlocks with spinlocks in jbd2
Also included were some bug fixes and cleanups, mostly to clean up
corner cases from fuzzed file systems and error path handling.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl3dHxoACgkQ8vlZVpUN
gaMZswf5AbtQhTEJDXO7Pc1ull38GIGFgAv7uAth0TymLC3h1/FEYWW0crEPFsDr
1Eei55UUVOYrMMUKQ4P7wlLX0cIh3XDPMWnRFuqBoV5/ZOsH/ZSbkY//TG2Xze/v
9wXIH/RKQnzbRtXffJ1+DnvmXJk+HFm1R1gjl0nfyUXGrnlSfqJxhLSczyd6bJJq
ehi/tso5UC/4EQsAIdWp7VWsAdaHcZ7ogHqDoy8dXpM1equ408iml7VlKr8R+Nr7
5ANpCISXChSlLLYm0NYN5vhO8upF5uDxWLdCtxVPL5kFdM2m/ELjXw9h9C+78l7C
EWJGlGlxvx07Px+e+bfStEsoixpWBg==
=0eko
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"This merge window saw the the following new featuers added to ext4:
- Direct I/O via iomap (required the iomap-for-next branch from
Darrick as a prereq).
- Support for using dioread-nolock where the block size < page size.
- Support for encryption for file systems where the block size < page
size.
- Rework of journal credits handling so a revoke-heavy workload will
not cause the journal to run out of space.
- Replace bit-spinlocks with spinlocks in jbd2
Also included were some bug fixes and cleanups, mostly to clean up
corner cases from fuzzed file systems and error path handling"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (59 commits)
ext4: work around deleting a file with i_nlink == 0 safely
ext4: add more paranoia checking in ext4_expand_extra_isize handling
jbd2: make jbd2_handle_buffer_credits() handle reserved handles
ext4: fix a bug in ext4_wait_for_tail_page_commit
ext4: bio_alloc with __GFP_DIRECT_RECLAIM never fails
ext4: code cleanup for get_next_id
ext4: fix leak of quota reservations
ext4: remove unused variable warning in parse_options()
ext4: Enable encryption for subpage-sized blocks
fs/buffer.c: support fscrypt in block_read_full_page()
ext4: Add error handling for io_end_vec struct allocation
jbd2: Fine tune estimate of necessary descriptor blocks
jbd2: Provide trace event for handle restarts
ext4: Reserve revoke credits for freed blocks
jbd2: Make credit checking more strict
jbd2: Rename h_buffer_credits to h_total_credits
jbd2: Reserve space for revoke descriptor blocks
jbd2: Drop jbd2_space_needed()
jbd2: Account descriptor blocks into t_outstanding_credits
jbd2: Factor out common parts of stopping and restarting a handle
...
- Fix another place in the splice code where a pipe could ask a
filesystem for a longer read than the pipe actually has free buffer
space.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3XKBsACgkQ+H93GTRK
tOubhg/9FhWIX+Ijj0a1MQj3QDv0YbLPv2uTVLYyRK+1X5Xwc7u0OmmbWYrQQQUJ
6aOaictBXXTIJtKqVcmXBXboUskKygnWKlVssKAI/BYR9ope/EMrMV5+4bz0mKt/
ufcDXRiJX2hbL8GicmJxK/qmp3bEAG0n/e0+0LcuYynatYXJQ19eOUeritVZ3EYA
Vhbgvrtuq8ws9SqvJ4Dm7fLsPT0AOHocor34tXS+gRq8KYIA2Ru7uDyE2uJJujF7
qAlmZVvun+RcSBG0mYdk6GDgLGibn76t+d0xnqd/OzSeh+6bHYc1ni6FJD8kjVzf
nGHFJfbEOCzDEees9ESBkgnH1TiCN4wCs7JQ4BictAPotBCtkbcNLvrO+b77zgwp
DAu6uE/DiRmIu9zqfy8TyH1E78pW7qTpI9yFDUQLaItvAbbqYIaiRLr4Hjd2Hf3E
MLYOSGZkKn/xZaqedk+rGwYJJmmvzCZblfrQ3hCtLGjmNQgfTRXidlgyicl1Ta12
iG9UkCFdbkZmdW9zWyYmZrfxLB6QzOZpgbdUwyMk0Rfc1sKJKYoNE+hkt8fjmS71
0hBPYSopDsYkGEH1719Nsfvyj0101McFklvEqS2hU9/rLyaF0Mdq5catVbzyBPhk
RwnvXdFd/sFgbR+1xLiEm51aXAe/kmveNNQBqZkuBgXITt7oh7Q=
=ripU
-----END PGP SIGNATURE-----
Merge tag 'vfs-5.5-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull splice fix from Darrick Wong:
"Fix another place in the splice code where a pipe could ask a
filesystem for a longer read than the pipe actually has free buffer
space"
* tag 'vfs-5.5-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
splice: only read in as much information as there is pipe buffer space
- Make iomap_dio_rw callers explicitly tell us if they want us to wait
- Port the xfs writeback code to iomap to complete the buffered io
library functions
- Refactor the unshare code to share common pieces
- Add support for performing copy on write with buffered writes
- Other minor fixes
- Fix unchecked return in iomap_bmap
- Fix a type casting bug in a ternary statement in iomap_dio_bio_actor
- Improve tracepoints for easier diagnostic ability
- Fix pipe page leakage in directio reads
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3YDqQACgkQ+H93GTRK
tOsbPg/+KrPkhe60RnUO4PQbSLTRLsz5R7OK5ubfxsKlHdyy/8vu12yPdY03LcHn
QoyQz+5gPjM58UvysIonlyRO0O30apl8UZ9PAINDMi7os6NP87illw4HtHL1cZjB
JfIQKVJrLNtocZnAgVL74d6pUs7MH32SPw0r+/qbfz/JFcHdf/Sz8fSpb0sdS/oK
QwT73TcaCcNa2C4twvhtO6+kMkzlTkJknqSZMqrthScqRRVeOnyGTjLdKqUamSqp
uj0iAKm+bBpCcuMdcHd7EOgQyVGwCQKUndaLKojK/V1iqiK+3KsLnIoJj6HwlU27
Q+pDoThv2V8m/Y940Gq0wzTNtkwdCirNaeKXwXX2ytlyPX5W45ZxgzGUQy4YYXGM
ObHRmeJXdka6kH7yzWlPiZGdZixpagFLzFUHWjMXAD8Fb4YxNKM4FsJDY3K3uQi6
y7EKV5O4q3qBW3ieL6l+wl9NkdcppSywRjRxhBZIhH4T2n7RMDLdBkNCiKZrWqIW
1QcrIC1NvwbXPzSNaZ40dgjB6mJGTv+P9AJLSQpmlR1Y6txLsJ1AeVzucZOnXcCo
TEiBfDFOZ9jvXUsaqGSdM99HxsePKn+VgEeTA6TFxTWJ9QcEgio1Pym+RkN6KrIU
zsbJbgk9bp3YT940xKt90fAHWm/iKUWrrRyidVCK3kJd72e41UE=
=FRl4
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.5-merge-11' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"In this release, we hoisted as much of XFS' writeback code into iomap
as was practicable, refactored the unshare file data function, added
the ability to perform buffered io copy on write, and tweaked various
parts of the directio implementation as needed to port ext4's directio
code (that will be a separate pull).
Summary:
- Make iomap_dio_rw callers explicitly tell us if they want us to
wait
- Port the xfs writeback code to iomap to complete the buffered io
library functions
- Refactor the unshare code to share common pieces
- Add support for performing copy on write with buffered writes
- Other minor fixes
- Fix unchecked return in iomap_bmap
- Fix a type casting bug in a ternary statement in
iomap_dio_bio_actor
- Improve tracepoints for easier diagnostic ability
- Fix pipe page leakage in directio reads"
* tag 'iomap-5.5-merge-11' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (31 commits)
iomap: Fix pipe page leakage during splicing
iomap: trace iomap_appply results
iomap: fix return value of iomap_dio_bio_actor on 32bit systems
iomap: iomap_bmap should check iomap_apply return value
iomap: Fix overflow in iomap_page_mkwrite
fs/iomap: remove redundant check in iomap_dio_rw()
iomap: use a srcmap for a read-modify-write I/O
iomap: renumber IOMAP_HOLE to 0
iomap: use write_begin to read pages to unshare
iomap: move the zeroing case out of iomap_read_page_sync
iomap: ignore non-shared or non-data blocks in xfs_file_dirty
iomap: always use AOP_FLAG_NOFS in iomap_write_begin
iomap: remove the unused iomap argument to __iomap_write_end
iomap: better document the IOMAP_F_* flags
iomap: enhance writeback error message
iomap: pass a struct page to iomap_finish_page_writeback
iomap: cleanup iomap_ioend_compare
iomap: move struct iomap_page out of iomap.h
iomap: warn on inline maps in iomap_writepage_map
iomap: lift the xfs writeback code to iomap
...
Christophe reports that current master fails building on powerpc with
this error:
CC fs/io_uring.o
fs/io_uring.c: In function ‘loop_rw_iter’:
fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
[-Werror=implicit-function-declaration]
iovec.iov_base = kmap(iter->bvec->bv_page)
^
fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
without a cast [-Wint-conversion]
iovec.iov_base = kmap(iter->bvec->bv_page)
^
fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
[-Werror=implicit-function-declaration]
kunmap(iter->bvec->bv_page);
^
which is caused by a missing highmem.h include. Fix it by including
it.
Fixes: 311ae9e159 ("io_uring: fix dead-hung for non-iter fixed rw")
Reported-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit f2538f9993. The patch
stopped JFFS2 from being able to mount an existing filesystem with the
following errors:
jffs2: error: (77) jffs2_build_inode_fragtree: Add node to tree failed -22
jffs2: error: (77) jffs2_do_read_inode_internal: Failed to build final fragtree for inode #5377: error -22
Fixes: f2538f9993 ("jffs2: Fix possible null-pointer dereferences...")
Cc: stable@vger.kernel.org
Suggested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Richard Weinberger <richard@nod.at>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3f/C0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprUjD/4t5ituIXmfPDulW4178LchRbAdC2FquGeU
Okft4laeQsJ8ucTBm3RAziVDNnkeJ0mLy5JTMpdPwKW+1F/d8jPuDeTlmmRyKGyW
vt2NlI9fKaOt8cU+LPp3MiGIamleZFOEPsRHU03HYASFcRKQwAswgjS2DZnR7oTK
66XGB9CdwREh1mM63wERYX07+t+ylocTClcLQDCi99H4HI5ScsJXB5aSqpombycy
XujZGlEAUT6WLp0HAwihhAnQIu/fEw+tXMWLvozphIP9xZGpk+a2nbd2WxTCdLZy
MoksoCpl76RmYPuZ+TC1L+tHf2KqGHa/mo64L/6v15QS8hROw3ZTMdJI0mEC/90g
5JwWuBqWOcRiMcV1Uwkz5OYL3KYQOWebS+FBbIc+2jqBlXJeN+1R/Ik5HP8xtGaE
5p317/q09h32piZN0c+BmkXCdnAYZeBez0EjSXf6j3PW3LAqmLOjxc0yA2OmwFWj
iudOfBOg48HtHkMaNrg2f+0THMP8zHqUVoRh96BKCSjnOXmbxGFt3ACHvb+VS0CX
RmlVOcaq+SzL1EsT805Hb6N3x7mK2l9CoanGfLxYXQZlOzD6TcRdlBzs9JCgqiAs
3jk5nAq9wtYvigAkVarMw/LAJNatEO4sy0AHyWJwAcfOsVJquNvhhokT8VxOcVZs
pMxUhUYKZA==
=HqiO
-----END PGP SIGNATURE-----
Merge tag 'for-5.5/io_uring-post-20191128' of git://git.kernel.dk/linux-block
Pull more io_uring updates from Jens Axboe:
"As mentioned in the first pull request, there was a later batch as
well. This contains fixes to the stuff that already went in, cleanups,
and a few later additions. In particular, this contains:
- Cleanups/fixes/unification of the submission and completion path
(Pavel,me)
- Linked timeouts improvements (Pavel,me)
- Error path fixes (me)
- Fix lookup window where cancellations wouldn't work (me)
- Improve DRAIN support (Pavel)
- Fix backlog flushing -EBUSY on submit (me)
- Add support for connect(2) (me)
- Fix for non-iter based fixed IO (Pavel)
- creds inheritance for async workers (me)
- Disable cmsg/ancillary data for sendmsg/recvmsg (me)
- Shrink io_kiocb to 3 cachelines (me)
- NUMA fix for io-wq (Jann)"
* tag 'for-5.5/io_uring-post-20191128' of git://git.kernel.dk/linux-block: (42 commits)
io_uring: make poll->wait dynamically allocated
io-wq: shrink io_wq_work a bit
io-wq: fix handling of NUMA node IDs
io_uring: use kzalloc instead of kcalloc for single-element allocations
io_uring: cleanup io_import_fixed()
io_uring: inline struct sqe_submit
io_uring: store timeout's sqe->off in proper place
net: disallow ancillary data for __sys_{send,recv}msg_file()
net: separate out the msghdr copy from ___sys_{send,recv}msg()
io_uring: remove superfluous check for sqe->off in io_accept()
io_uring: async workers should inherit the user creds
io-wq: have io_wq_create() take a 'data' argument
io_uring: fix dead-hung for non-iter fixed rw
io_uring: add support for IORING_OP_CONNECT
net: add __sys_connect_file() helper
io_uring: only return -EBUSY for submit on non-flushed backlog
io_uring: only !null ptr to io_issue_sqe()
io_uring: simplify io_req_link_next()
io_uring: pass only !null to io_req_find_next()
io_uring: remove io_free_req_find_next()
...
That is a bit weird scenario but I find it interesting to run fio loads
using LKL linux, where MMU is disabled. Probably other real archs which
run uClinux can also benefit from this patch.
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert the ceph filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.
See Documentation/filesystems/mount_api.txt for more information.
[ Numerous string handling, leak and regression fixes; rbd conversion
was particularly broken and had to be redone almost from scratch. ]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Here is the "big" set of driver core patches for 5.5-rc1
There's a few minor cleanups and fixes in here, but the majority of the
patches in here fall into two buckets:
- debugfs api cleanups and fixes
- driver core device link support for boot dependancy issues
The debugfs api cleanups are working to slowly refactor the debugfs apis
so that it is even harder to use incorrectly. That work has been
happening for the past few kernel releases and will continue over time,
it's a long-term project/goal
The driver core device link support missed 5.4 by just a bit, so it's
been sitting and baking for many months now. It's from Saravana Kannan
to help resolve the problems that DT-based systems have at boot time
with dependancy graphs and kernel modules. Turns out that no one has
actually tried to build a generic arm64 kernel with loads of modules and
have it "just work" for a variety of platforms (like a distro kernel)
The big problem turned out to be a lack of depandancy information
between different areas of DT entries, and the work here resolves that
problem and now allows devices to boot properly, and quicker than a
monolith kernel.
All of these patches have been in linux-next for a long time with no
reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXd6m6Q8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yntJQCcCqg6RQ7LTdHuZv1ETeefXlsfk00An1Jtean6
42bWGx52bGFvAcpjWy8R
=P7hq
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the "big" set of driver core patches for 5.5-rc1
There's a few minor cleanups and fixes in here, but the majority of
the patches in here fall into two buckets:
- debugfs api cleanups and fixes
- driver core device link support for boot dependancy issues
The debugfs api cleanups are working to slowly refactor the debugfs
apis so that it is even harder to use incorrectly. That work has been
happening for the past few kernel releases and will continue over
time, it's a long-term project/goal
The driver core device link support missed 5.4 by just a bit, so it's
been sitting and baking for many months now. It's from Saravana Kannan
to help resolve the problems that DT-based systems have at boot time
with dependancy graphs and kernel modules. Turns out that no one has
actually tried to build a generic arm64 kernel with loads of modules
and have it "just work" for a variety of platforms (like a distro
kernel). The big problem turned out to be a lack of dependency
information between different areas of DT entries, and the work here
resolves that problem and now allows devices to boot properly, and
quicker than a monolith kernel.
All of these patches have been in linux-next for a long time with no
reported issues"
* tag 'driver-core-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (68 commits)
tracing: Remove unnecessary DEBUG_FS dependency
of: property: Add device link support for interrupt-parent, dmas and -gpio(s)
debugfs: Fix !DEBUG_FS debugfs_create_automount
of: property: Add device link support for "iommu-map"
of: property: Fix the semantics of of_is_ancestor_of()
i2c: of: Populate fwnode in of_i2c_get_board_info()
drivers: base: Fix Kconfig indentation
firmware_loader: Fix labels with comma for builtin firmware
driver core: Allow device link operations inside sync_state()
driver core: platform: Declare ret variable only once
cpu-topology: declare parse_acpi_topology in <linux/arch_topology.h>
crypto: hisilicon: no need to check return value of debugfs_create functions
driver core: platform: use the correct callback type for bus_find_device
firmware_class: make firmware caching configurable
driver core: Clarify documentation for fwnode_operations.add_links()
mailbox: tegra: Fix superfluous IRQ error message
net: caif: Fix debugfs on 64-bit platforms
mac80211: Use debugfs_create_xul() helper
media: c8sectpfe: no need to check return value of debugfs_create functions
of: property: Add device link support for iommus, mboxes and io-channels
...
We accidentally messed up the indenting on this if statement.
Fixes: 16c696a6c300 ("CIFS: refactor cifs_get_inode_info()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Allow a fatal signal to interrupt us when we're scanning a directory to
verify a parent pointer.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ /\t/' -i */Kconfig
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
exit_aio() is sometimes stuck in wait_for_completion() after aio is issued
with direct IO and the task receives a signal.
The reason is failure to call ->ki_complete() due to a leaked reference to
fuse_io_priv. This happens in fuse_async_req_send() if
fuse_simple_background() returns an error (e.g. -EINTR).
In this case the error value is propagated via io->err, so return success
to not confuse callers.
This issue is tracked as a virtio-fs issue:
https://gitlab.com/virtio-fs/qemu/issues/14
Reported-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Fixes: 45ac96ed7c ("fuse: convert direct_io to simple api")
Cc: <stable@vger.kernel.org> # v5.4
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- A comprehensive rewrite of the robust/PI futex code's exit handling
to fix various exit races. (Thomas Gleixner et al)
- Rework the generic REFCOUNT_FULL implementation using
atomic_fetch_* operations so that the performance impact of the
cmpxchg() loops is mitigated for common refcount operations.
With these performance improvements the generic implementation of
refcount_t should be good enough for everybody - and this got
confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
REFCOUNT_FULL entirely, leaving the generic implementation enabled
unconditionally. (Will Deacon)
- Other misc changes, fixes, cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
lkdtm: Remove references to CONFIG_REFCOUNT_FULL
locking/refcount: Remove unused 'refcount_error_report()' function
locking/refcount: Consolidate implementations of refcount_t
locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
locking/refcount: Move saturation warnings out of line
locking/refcount: Improve performance of generic REFCOUNT_FULL code
locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header
locking/refcount: Remove unused refcount_*_checked() variants
locking/refcount: Ensure integer operands are treated as signed
locking/refcount: Define constants for saturation and max refcount values
futex: Prevent exit livelock
futex: Provide distinct return value when owner is exiting
futex: Add mutex around futex exit
futex: Provide state handling for exec() as well
futex: Sanitize exit state handling
futex: Mark the begin of futex exit explicitly
futex: Set task::futex_state to DEAD right after handling futex exit
futex: Split futex_mm_release() for exit/exec
exit/exec: Seperate mm_release()
futex: Replace PF_EXITPIDONE with a state
...
Pull RCU updates from Ingo Molnar:
"The main changes in this cycle were:
- Dynamic tick (nohz) updates, perhaps most notably changes to force
the tick on when needed due to lengthy in-kernel execution on CPUs
on which RCU is waiting.
- Linux-kernel memory consistency model updates.
- Replace rcu_swap_protected() with rcu_prepace_pointer().
- Torture-test updates.
- Documentation updates.
- Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
security/safesetid: Replace rcu_swap_protected() with rcu_replace_pointer()
net/sched: Replace rcu_swap_protected() with rcu_replace_pointer()
net/netfilter: Replace rcu_swap_protected() with rcu_replace_pointer()
net/core: Replace rcu_swap_protected() with rcu_replace_pointer()
bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer()
fs/afs: Replace rcu_swap_protected() with rcu_replace_pointer()
drivers/scsi: Replace rcu_swap_protected() with rcu_replace_pointer()
drm/i915: Replace rcu_swap_protected() with rcu_replace_pointer()
x86/kvm/pmu: Replace rcu_swap_protected() with rcu_replace_pointer()
rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer()
rcu: Suppress levelspread uninitialized messages
rcu: Fix uninitialized variable in nocb_gp_wait()
rcu: Update descriptions for rcu_future_grace_period tracepoint
rcu: Update descriptions for rcu_nocb_wake tracepoint
rcu: Remove obsolete descriptions for rcu_barrier tracepoint
rcu: Ensure that ->rcu_urgent_qs is set before resched IPI
workqueue: Convert for_each_wq to use built-in list check
rcu: Several rcu_segcblist functions can be static
rcu: Remove unused function hlist_bl_del_init_rcu()
Documentation: Rename rcu_node_context_switch() to rcu_note_context_switch()
...
Pull scheduler updates from Ingo Molnar:
"The biggest changes in this cycle were:
- Make kcpustat vtime aware (Frederic Weisbecker)
- Rework the CFS load_balance() logic (Vincent Guittot)
- Misc cleanups, smaller enhancements, fixes.
The load-balancing rework is the most intrusive change: it replaces
the old heuristics that have become less meaningful after the
introduction of the PELT metrics, with a grounds-up load-balancing
algorithm.
As such it's not really an iterative series, but replaces the old
load-balancing logic with the new one. We hope there are no
performance regressions left - but statistically it's highly probable
that there *is* going to be some workload that is hurting from these
chnages. If so then we'd prefer to have a look at that workload and
fix its scheduling, instead of reverting the changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
rackmeter: Use vtime aware kcpustat accessor
leds: Use all-in-one vtime aware kcpustat accessor
cpufreq: Use vtime aware kcpustat accessors for user time
procfs: Use all-in-one vtime aware kcpustat accessor
sched/vtime: Bring up complete kcpustat accessor
sched/cputime: Support other fields on kcpustat_field()
sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()
sched/fair: Add comments for group_type and balancing at SD_NUMA level
sched/fair: Fix rework of find_idlest_group()
sched/uclamp: Fix overzealous type replacement
sched/Kconfig: Fix spelling mistake in user-visible help text
sched/core: Further clarify sched_class::set_next_task()
sched/fair: Use mul_u32_u32()
sched/core: Simplify sched_class::pick_next_task()
sched/core: Optimize pick_next_task()
sched/core: Make pick_next_task_idle() more consistent
sched/fair: Better document newidle_balance()
leds: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM
cpufreq: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM
procfs: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM
...
In the quest to bring io_kiocb down to 3 cachelines, this one does
the trick. Make the wait_queue_entry for the poll command come out
of kmalloc instead of embedding it in struct io_poll_iocb, as the
latter is the largest member of io_kiocb. Once we trim this down a
bit, we're back at a healthy 192 bytes for struct io_kiocb.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we're using 40 bytes for the io_wq_work structure, and 16 of
those is the doubly link list node. We don't need doubly linked lists,
we always add to tail to keep things ordered, and any other use case
is list traversal with deletion. For the deletion case, we can easily
support any node deletion by keeping track of the previous entry.
This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from
io_uring to 216 to 208 bytes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are several things that can go wrong in the current code on NUMA
systems, especially if not all nodes are online all the time:
- If the identifiers of the online nodes do not form a single contiguous
block starting at zero, wq->wqes will be too small, and OOB memory
accesses will occur e.g. in the loop in io_wq_create().
- If a node comes online between the call to num_online_nodes() and the
for_each_node() loop in io_wq_create(), an OOB write will occur.
- If a node comes online between io_wq_create() and io_wq_enqueue(), a
lookup is performed for an element that doesn't exist, and an OOB read
will probably occur.
Fix it by:
- using nr_node_ids instead of num_online_nodes() for the allocation size;
nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the
highest node ID that could possibly come online at some point, even if
those nodes' identifiers are not a contiguous block
- creating workers for all possible CPUs, not just all online ones
This is basically what the normal workqueue code also does, as far as I can
tell.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These allocations are single-element allocations, so don't use the array
allocation wrapper for them.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Clean io_import_fixed() call site and make it return proper type.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no point left in keeping struct sqe_submit. Inline it
into struct io_kiocb, so any req->submit.field is now just req->field
- moves initialisation of ring_file into io_get_req()
- removes duplicated req->sequence.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Timeouts' sequence offset (i.e. sqe->off) is stored in
req->submit.sequence under a false name. Keep it in timeout.data
instead. The unused space for sequence will be reclaimed in the
following patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 0be0ee7181.
I was hoping it would be benign to switch over entirely to FMODE_STREAM,
and we'd have just a couple of small fixups we'd need, but it looks like
we're not quite there yet.
While it worked fine on both my desktop and laptop, they are fairly
similar in other respects, and run mostly the same loads. Kenneth
Crudup reports that it seems to break both his vmware installation and
the KDE upower service. In both cases apparently leading to timeouts
due to waitinmg for the f_pos lock.
There are a number of character devices in particular that definitely
want stream-like behavior, but that currently don't get marked as
streams, and as a result get the exclusion between concurrent
read()/write() on the same file descriptor. Which doesn't work well for
them.
The most obvious example if this is /dev/console and /dev/tty, which use
console_fops and tty_fops respectively (and ptmx_fops for the pty master
side). It may be that it's just this that causes problems, but we
clearly weren't ready yet.
Because there's a number of other likely common cases that don't have
llseek implementations and would seem to act as stream devices:
/dev/fuse (fuse_dev_operations)
/dev/mcelog (mce_chrdev_ops)
/dev/mei0 (mei_fops)
/dev/net/tun (tun_fops)
/dev/nvme0 (nvme_dev_fops)
/dev/tpm0 (tpm_fops)
/proc/self/ns/mnt (ns_file_operations)
/dev/snd/pcm* (snd_pcm_f_ops[])
and while some of these could be trivially automatically detected by the
vfs layer when the character device is opened by just noticing that they
have no read or write operations either, it often isn't that obvious.
Some character devices most definitely do use the file position, even if
they don't allow seeking: the firmware update code, for example, uses
simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
back and forth.
We'll revisit this when there's a better way to detect the problem and
fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
annotations).
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Cc: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The 'start' variable indicates the start of a filemap and is set to the
iocb's position, which we have already cached as 'pos', upon function
entry.
'pos' is used as a cursor indicating the current position and updated
later in iomap_dio_rw(), but not before the last use of 'start'.
Remove 'start' as it's synonym for 'pos' before we're entering the loop
calling iomapp_apply().
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
iomap_dio_bio_actor() copies iter to a local variable and then limits it
to a file extent we have mapped. When IO is submitted,
iomap_dio_bio_actor() advances the original iter while the copied iter
is advanced inside bio_iov_iter_get_pages(). This logic is non-obvious
especially because both iters still point to same shared structures
(such as pipe info) so if iov_iter_advance() changes anything in the
shared structure, this scheme breaks. Let's just truncate and reexpand
the original iter as needed instead of playing games with copying iters
and keeping them in sync.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl3bpjoACgkQUqAMR0iA
lPJJDA/+IJT4YCRp2TwV2jvIs0QzvXZrzEsxgCLibLE85mYTJgoQBD3W1bH2eyjp
T/9U0Zh5PGr/84cHd4qiMxzo+5Olz930weG59NcO4RJBSr671aRYs5tJqwaQAZDR
wlwaob5S28vUmjPxKulvxv6V3FdI79ZE9xrCOCSTQvz4iCLsGOu+Dn/qtF64pImX
M/EXzPMBrByiQ8RTM4Ege8JoBqiCZPDG9GR3KPXIXQwEeQgIoeYxwRYakxSmSzz8
W8NduFCbWavg/yHhghHikMiyOZeQzAt+V9k9WjOBTle3TGJegRhvjgI7508q3tXe
jQTMGATBOPkIgFaZz7eEn/iBa3jZUIIOzDY93RYBmd26aBvwKLOma/Vkg5oGYl0u
ZK+CMe+/xXl7brQxQ6JNsQhbSTjT+746LvLJlCvPbbPK9R0HeKNhsdKpGY3ugnmz
VAnOFIAvWUHO7qx+J+EnOo5iiPpcwXZj4AjrwVrs/x5zVhzwQ+4DSU6rbNn0O1Ak
ELrBqCQkQzh5kqK93jgMHeWQ9EOUp1Lj6PJhTeVnOx2x8tCOi6iTQFFrfdUPlZ6K
2DajgrFhti4LvwVsohZlzZuKRm5EuwReLRSOn7PU5qoSm5rcouqMkdlYG/viwyhf
mTVzEfrfemrIQOqWmzPrWEXlMj2mq8oJm4JkC+jJ/+HsfK4UU8I=
=QCEy
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek:
- Allow to print symbolic error names via new %pe modifier.
- Use pr_warn() instead of the remaining pr_warning() calls. Fix
formatting of the related lines.
- Add VSPRINTF entry to MAINTAINERS.
* tag 'printk-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (32 commits)
checkpatch: don't warn about new vsprintf pointer extension '%pe'
MAINTAINERS: Add VSPRINTF
tools lib api: Renaming pr_warning to pr_warn
ASoC: samsung: Use pr_warn instead of pr_warning
lib: cpu_rmap: Use pr_warn instead of pr_warning
trace: Use pr_warn instead of pr_warning
dma-debug: Use pr_warn instead of pr_warning
vgacon: Use pr_warn instead of pr_warning
fs: afs: Use pr_warn instead of pr_warning
sh/intc: Use pr_warn instead of pr_warning
scsi: Use pr_warn instead of pr_warning
platform/x86: intel_oaktrail: Use pr_warn instead of pr_warning
platform/x86: asus-laptop: Use pr_warn instead of pr_warning
platform/x86: eeepc-laptop: Use pr_warn instead of pr_warning
oprofile: Use pr_warn instead of pr_warning
of: Use pr_warn instead of pr_warning
macintosh: Use pr_warn instead of pr_warning
idsn: Use pr_warn instead of pr_warning
ide: Use pr_warn instead of pr_warning
crypto: n2: Use pr_warn instead of pr_warning
...
Pull cgroup updates from Tejun Heo:
"There are several notable changes here:
- Single thread migrating itself has been optimized so that it
doesn't need threadgroup rwsem anymore.
- Freezer optimization to avoid unnecessary frozen state changes.
- cgroup ID unification so that cgroup fs ino is the only unique ID
used for the cgroup and can be used to directly look up live
cgroups through filehandle interface on 64bit ino archs. On 32bit
archs, cgroup fs ino is still the only ID in use but it is only
unique when combined with gen.
- selftest and other changes"
* 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
writeback: fix -Wformat compilation warnings
docs: cgroup: mm: Fix spelling of "list"
cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()
cgroup: use cgrp->kn->id as the cgroup ID
kernfs: use 64bit inos if ino_t is 64bit
kernfs: implement custom exportfs ops and fid type
kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
kernfs: convert kernfs_node->id from union kernfs_node_id to u64
kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes
kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()
netprio: use css ID instead of cgroup ID
writeback: use ino_t for inodes in tracepoints
kernfs: fix ino wrap-around detection
kselftests: cgroup: Avoid the reuse of fd after it is deallocated
cgroup: freezer: don't change task and cgroups status unnecessarily
cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency
cgroup: remove cgroup_enable_task_cg_lists() optimization
cgroup: pids: use atomic64_t for pids->limit
selftests: cgroup: Run test_core under interfering stress
selftests: cgroup: Add task migration tests
...
This field contains a pointer to addrlen and checking to see if it's set
returns -EINVAL if the caller sets addr & addrlen pointers.
Fixes: 17f2fe35d0 ("io_uring: add support for IORING_OP_ACCEPT")
Signed-off-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently pass in 4 arguments outside of the bounded size. In
preparation for adding one more argument, let's bundle them up in
a struct to make it more readable.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Read/write requests to devices without implemented read/write_iter
using fixed buffers can cause general protection fault, which totally
hangs a machine.
io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter()
accesses it as iovec, dereferencing random address.
kmap() page by page in this case
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This allows an application to call connect() in an async fashion. Like
other opcodes, we first try a non-blocking connect, then punt to async
context if we have to.
Note that we can still return -EINPROGRESS, and in that case the caller
should use IORING_OP_POLL_ADD to do an async wait for completion of the
connect request (just like for regular connect(2), except we can do it
async here too).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We return -EBUSY on submit when we have a CQ ring overflow backlog, but
that can be a bit problematic if the application is using pure userspace
poll of the CQ ring. For that case, if the ring briefly overflowed and
we have pending entries in the backlog, the submit flushes the backlog
successfully but still returns -EBUSY. If we're able to fully flush the
CQ ring backlog, let the submission proceed.
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass only non-null @nxt to io_issue_sqe() and handle it at the caller's
side. And propagate it.
- kiocb_done() is only called from io_read() and io_write(), which are
only called from io_issue_sqe(), so it's @nxt != NULL
- io_put_req_find_next() is called either with explicitly non-null local
nxt, or from one of the functions in io_issue_sqe() switch (or their
callees).
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
"if (nxt)" is always true, as it was checked in the while's condition.
io_wq_current_is_worker() is unnecessary, as non-async callers don't
pass nxt, so io_queue_async_work() will be called for them anyway.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make io_req_find_next() and io_req_link_next() to accept only non-null
nxt, and handle it in callers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is only one one-liner user of io_free_req_find_next(). Inline it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The number of SQEs to submit is specified by a user, so io_get_sqring()
in most of the cases succeeds. Hint compilers about that.
Checking ASM genereted by gcc 9.2.0 for x64, there is one branch
misprediction.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_submit_sqe() is issuing requests, so call it as
such. Moreover, it ends by calling io_iopoll_req_issued().
Rename it and make terminology clearer.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't have shadow requests anymore, so get rid of the shadow
argument. Add the user_data argument, as that's often useful to easily
match up requests, instead of having to look at request pointers.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's an issue with the shadow drain logic in that we drop the
completion lock after deciding to defer a request, then re-grab it later
and assume that the state is still the same. In the mean time, someone
else completing a request could have found and issued it. This can cause
a stall in the queue, by having a shadow request inserted that nobody is
going to drain.
Additionally, if we fail allocating the shadow request, we simply ignore
the drain.
Instead of using a shadow request, defer the next request/link instead.
This also has the following advantages:
- removes semi-duplicated code
- doesn't allocate memory for shadows
- works better if only the head marked for drain
- doesn't need complex synchronisation
On the flip side, it removes the shadow->seq ==
last_drain_in_in_link->seq optimization. That shouldn't be a common
case, and can always be added back, if needed.
Fixes: 4fe2c96315 ("io_uring: add support for link with drain")
Cc: Jackie Liu <liuyun01@kylinos.cn>
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we find new work to process within the work handler, we queue the
linked timeout before we have issued the new work. This can be
problematic for very short timeouts, as we have a window where the new
work isn't visible.
Allow the work handler to store a callback function for this in the work
item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that
is set, then io-wq will call the callback when it has setup the new work
item.
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently try and start the next link when we put the request, and
only if we were going to free it. This means that the optimization to
continue executing requests from the same context often fails, as we're
not putting the final reference.
Add REQ_F_LINK_NEXT to keep track of this, and allow io_uring to find the
next request more efficiently.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently rely on the ring destroy on cleaning things up in case of
failure, but io_allocate_scq_urings() can leave things half initialized
if only parts of it fails.
Be nice and return with either everything setup in success, or return an
error with things nicely cleaned up.
Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Always mark requests with allocated sqe and deallocate it in
__io_free_req(). It's easier to follow and doesn't add edge cases.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently clear the linked timeout field if we cancel such a timeout,
but we should only attempt to cancel if it's the first one we see.
Others should simply be freed like other requests, as they haven't
been started yet.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT
1. submission stage: submission references for REQ and LINK_TIMEOUT
are dropped. So, references respectively (1,1,2)
2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all
linked timeouts will call cancel_timeout() and drop 1 reference.
So, references after: (0,0,1). That's a leak.
Make it treat only the first linked timeout as such, and pass others
through __io_double_put_req().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass any IORING_OP_LINK_TIMEOUT request further, where it will
eventually fail in io_issue_sqe().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If io_req_defer() failed, it needs to cancel a dependant link.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently have a race where if setup is really slow, we can be
calling io_wq_destroy() before we're done setting up. This will cause
the caller to get stuck waiting for the manager to set things up, but
the manager already exited.
Fix this by doing a sync setup of the manager. This also fixes the case
where if we failed creating workers, we'd also get stuck.
In practice this race window was really small, as we already wait for
the manager to start. Hence someone would have to call io_wq_destroy()
after the task has started, but before it started the first loop. The
reported test case forked tons of these, which is why it became an
issue.
Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com
Fixes: 771b53d033 ("io-wq: small threadpool implementation for io_uring")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently don't explicitly break links if a request is cancelled, but
we should. Add explicitly link breakage for all types of request
cancellations that we support.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently a poll request fills a completion entry of 0, even if it got
cancelled. This is odd, and it makes it harder to support with chains.
Ensure that it returns -ECANCELED in the completions events if it got
cancelled, and furthermore ensure that the linked timeout that triggered
it completes with -ETIME if we did indeed trigger the completions
through a timeout.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the conversion to io-wq, we no longer use that flag. Kill it.
Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have an issue with timeout links that are deeper in the submit chain,
because we only handle it upfront, not from later submissions. Move the
prep + issue of the timeout link to the async work prep handler, and do
it normally for non-async queue. If we validate and prepare the timeout
links upfront when we first see them, there's nothing stopping us from
supporting any sort of nesting.
Fixes: 2665abfd75 ("io_uring: add support for linked SQE timeouts")
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are a few reasons for this:
- As a prep to improving the linked timeout logic
- io_timeout is the biggest member in the io_kiocb opcode union
This also enables a few cleanups, like unifying the timer setup between
IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple
arguments to the link/prep helpers.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we don't use the normal completion path, we may skip killing links
that should be errored and freed. Add __io_double_put_req() for use
within the completion path itself, other calls should just use
io_double_put_req().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err,
but the caller doesn't care since the errors are handled inline. Clean
these up and just make them void.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we have a linked request, this enables us to pass it back directly
without having to go through async context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This kselftest update for Linux 5.5-rc1 adds KUnit, a lightweight unit
testing and mocking framework for the Linux kernel from Brendan Higgins.
KUnit is not an end-to-end testing framework. It is currently supported
on UML and sub-systems can write unit tests and run them in UML env.
KUnit documentation is included in this update.
In addition, this Kunit update adds 3 new kunit tests:
- kunit test for proc sysctl from Iurii Zaikin
- kunit test for the 'list' doubly linked list from David Gow
- ext4 kunit test for decoding extended timestamps from Iurii Zaikin
In the future KUnit will be linked to Kselftest framework to provide
a way to trigger KUnit tests from user-space.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl3YfBQACgkQCwJExA0N
QxzTjxAAiFhaDMhlpLhn1DpIUNvfKrIgDjJgajQAyMMs6TJK3OrD6J4WbpVD7wGo
aqF9l6o64sY18JAo3s00j6AcAmVwNH7qzEEuzIQPjJvQ8C4sCWL3esEP4JHgFb2F
snlSn5KjSsdC1D9N7uQIhgW76xPSyDrTwWQpglvmB9TwmJVBIl9zhu+unp73ufFJ
N+ieDg8A6W/wDGYSq5JBSkJbuI0gL+daNwUYzxEEZIskndhpovOc82WAldECRm6x
TfI0u39zTbrEO0DHgmYpyGbTN8TB2mXjH5HMjwg+KbHfKVTKKGvTK7XFs8mWGQpO
n2meypZuwuIsRPOPcAVs+Gt2dc0jFODJVIV1EzA0WSv6TEdPqyhM/d13tHdCqjm9
ic5wQ/hQQNEB1Dvg5ereXBaGGaoqP95y61ZpCS9vCXFXH+28E/B63Ebfs2IBIuqS
Jv2KcoxabyZq3uGdjnn+mD7IM8rkvscRP4Ba31nXRgJIYDHAzqe7APN7y3on4NGx
1q7lBlA3XZZ8qgo0zpLST20ck/qaL3tk4k8E1f8emh6CuyrCWtazgrWkMIlyEX0O
8nre3uEAF9xUzB4+gZK4YmelN9Bld3Uv7Ippt1zTCiQ0FkEABQIMUrTZygy7Wfg6
6qi4dk8frWW4Kt63gOXsMxr9FWTqDk+Ys4GPAVDVm1d0dzERn8k=
=0zEe
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-5.5-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest KUnit support gtom Shuah Khan:
"This adds KUnit, a lightweight unit testing and mocking framework for
the Linux kernel from Brendan Higgins.
KUnit is not an end-to-end testing framework. It is currently
supported on UML and sub-systems can write unit tests and run them in
UML env. KUnit documentation is included in this update.
In addition, this Kunit update adds 3 new kunit tests:
- proc sysctl test from Iurii Zaikin
- the 'list' doubly linked list test from David Gow
- ext4 tests for decoding extended timestamps from Iurii Zaikin
In the future KUnit will be linked to Kselftest framework to provide a
way to trigger KUnit tests from user-space"
* tag 'linux-kselftest-5.5-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (23 commits)
lib/list-test: add a test for the 'list' doubly linked list
ext4: add kunit test for decoding extended timestamps
Documentation: kunit: Fix verification command
kunit: Fix '--build_dir' option
kunit: fix failure to build without printk
MAINTAINERS: add proc sysctl KUnit test to PROC SYSCTL section
kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec()
MAINTAINERS: add entry for KUnit the unit testing framework
Documentation: kunit: add documentation for KUnit
kunit: defconfig: add defconfigs for building KUnit tests
kunit: tool: add Python wrappers for running KUnit tests
kunit: test: add tests for KUnit managed resources
kunit: test: add the concept of assertions
kunit: test: add tests for kunit test abort
kunit: test: add support for test abort
objtool: add kunit_try_catch_throw to the noreturn list
kunit: test: add initial tests
lib: enable building KUnit in lib/
kunit: test: add the concept of expectations
kunit: test: add assertion printing library
...
Expose the fs-verity bit through statx().
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXdtWqhQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK+C9AQCCf8C2KP6DynoGQb9KRYYreJk8js8G
IgtlhazJ3j1RJAD/VijFbdwbxGCmiR1Y6BhKq5eaCYD1El68wSwkKuNO3ww=
=7WpU
-----END PGP SIGNATURE-----
Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fsverity updates from Eric Biggers:
"Expose the fs-verity bit through statx()"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
docs: fs-verity: mention statx() support
f2fs: support STATX_ATTR_VERITY
ext4: support STATX_ATTR_VERITY
statx: define STATX_ATTR_VERITY
docs: fs-verity: document first supported kernel version
- Add the IV_INO_LBLK_64 encryption policy flag which modifies the
encryption to be optimized for UFS inline encryption hardware.
- For AES-128-CBC, use the crypto API's implementation of ESSIV (which
was added in 5.4) rather than doing ESSIV manually.
- A few other cleanups.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXdtVMxQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK8MVAP44iRzj8ZXu62BhqNOYYcF60s/58QfZ
Jo1VdmvO/8MNrAD+P/jW5sqzcB5BLdNzS7pLKGIzsC55uMyp/79xyKK8wQc=
=XKWV
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt updates from Eric Biggers:
- Add the IV_INO_LBLK_64 encryption policy flag which modifies the
encryption to be optimized for UFS inline encryption hardware.
- For AES-128-CBC, use the crypto API's implementation of ESSIV (which
was added in 5.4) rather than doing ESSIV manually.
- A few other cleanups.
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
f2fs: add support for IV_INO_LBLK_64 encryption policies
ext4: add support for IV_INO_LBLK_64 encryption policies
fscrypt: add support for IV_INO_LBLK_64 policies
fscrypt: avoid data race on fscrypt_mode::logged_impl_name
docs: ioctl-number: document fscrypt ioctl numbers
fscrypt: zeroize fscrypt_info before freeing
fscrypt: remove struct fscrypt_ctx
fscrypt: invoke crypto API for ESSIV handling
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3YC+sACgkQxWXV+ddt
WDtcJg/+JESwug/t+8+JWoajhwcMtLcQqqdPIWcfGq+9S7kIcKeoNtboO9Axj8d7
rxubmkk9zHEVgQtA5TWW9kdnNBjTnslHvCvD6m7g6AJGtiNdOofbc/2EaEmDP05f
6N7ZBOQ+l8jwgeZqRO2H2/Xh+Le9pjchhljcRgJ0eVVgPklSh5e4KrWkqOb4UwCm
iZCHXPB8AeLxTfGL1yy5wklh4nKZF3DMDaT7+20YyKzNtJjmbS/2WPivr9uWd25D
QHUeHk+yOJoRICAX4/8xFk+x7FLJTutfAsvwJM90a5+HR2/msyC7ahzIYguPBtaX
Aa5hlP79ug6UhtB9RIUULjZiBYQS38qhlEmAE7RA4/sYFzbm6AkSQTnOxBennivN
2+ev64xeDEKNx//pWM52OHyWuDJbGBeBMdr4I8LtYEoyBgQC5kDnecodV81q9c79
ROTBmjP9gQFxkk3OoP44haky6t3HPqN/jCfFOvo0VxbEyVluRS1BZQ5+IjZIgSFU
Zg5L9OHA93+PsZsNnas9Zd07YuMBXK3X4g//veWmY/eWursC6cn3IGhsGlxl0X1a
g07jiZCLElmKV66UTdKqn+ByV1n0TreAjSn1Wr3NtYuBiqTAKh0xrbMlbE6n1db7
z+yKBRlcGBVvQQpI0AZB29aeeleNvDGDLuz+hE8K4rFFUCUb1zs=
=NomJ
-----END PGP SIGNATURE-----
Merge tag 'affs-for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull AFFS updates from David Sterba:
"A minor bugfix and cleanup for AFFS"
* tag 'affs-for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
affs: fix a memory leak in affs_remount
affs: Replace binary semaphores with mutexes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3YCRAACgkQxWXV+ddt
WDuTuQ/7BOibDKqInm2SsL8xMuZqXjxGXUcHDPio5MbzNJ3wpV0j1KqWWsuK8hi0
HAhSI3fu3NG7RQYh3nuRO0CaZy3ENiqKoffrSpg7k5DJG0B7Lm/G/970fmOYUp6a
j6PMNcrKaw+1J3yuljSd20+n6j/hdmfn847ZsSY+7JmZ4zGMJ5GMv3IRipdLFUzR
tmjWmmCI05sF4/8cI6jzUVq588uSFTO1bGXFugmoO0ztpameudCnYniJI0tDBFSV
pqk6lqoOPNcaC9nATuA5KKOpUJ9nSscP/St3DV4D6LaZKkT/M5zs12lXPMJx/pKn
oCHt/A/wBdbDOoy1uHVMWQ9cz9PyVFtU7eSKizcFjoqnHO6fzlnRr9fsmIZKtTw9
H6nXVmP1S+xJg/zTBxCXHVfZR2dqADUsHWztN1LM8Pen/l9+UMwBeMhq9f9Jz68I
kF7zWlfLEtNh8naEYf34LkGVMtCNY4PHFsSztPg/jbfsH34xMvetKvPR2s8lejhp
42YqPHgEh2+8mmVcq65+jl+bPOp/5bdToRtuPiszWiKZSXt/5xplP+5lkSEet0J6
4aNZ8NRAiZ98br45jdTMUVSo6YtI27SS+GdVOUHPQtPI/kWi9XHx+l3E9MVOUtrd
lQ1Z9tPinEnJH4kntiCz2sKdzNKE01IagV4wFylz1Ct+ZqF9jNs=
=JzIp
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"User visible changes:
- new block group profiles: RAID1 with 3- and 4- copies
- RAID1 in btrfs has always 2 copies, now add support for 3 and 4
- this is an incompat feature (named RAID1C34)
- recommended use of RAID1C3 is replacement of RAID6 profile on
metadata, this brings a more reliable resiliency against 2
device loss/damage
- support for new checksums
- per-filesystem, set at mkfs time
- fast hash (crc32c successor): xxhash, 64bit digest
- strong hashes (both 256bit): sha256 (slower, FIPS), blake2b
(faster)
- the blake2b module goes via the crypto tree, btrfs.ko has a
soft dependency
- speed up lseek, don't take inode locks unnecessarily, this can
speed up parallel SEEK_CUR/SEEK_SET/SEEK_END by 80%
- send:
- allow clone operations within the same file
- limit maximum number of sent clone references to avoid slow
backref walking
- error message improvements: device scan prints process name and PID
Core changes:
- cleanups
- remove unique workqueue helpers, used to provide a way to avoid
deadlocks in the workqueue code, now done in a simpler way
- remove lots of indirect function calls in compression code
- extent IO tree code moved out of extent_io.c
- cleanup backup superblock handling at mount time
- transaction life cycle documentation and cleanups
- locking code cleanups, annotations and documentation
- add more cold, const, pure function attributes
- removal of unused or redundant struct members or variables
- new tree-checker sanity tests
- try to detect missing INODE_ITEM, cross-reference checks of
DIR_ITEM, DIR_INDEX, INODE_REF, and XATTR_* items
- remove own bio scheduling code (used to avoid checksum submissions
being stuck behind other IO), replaced by cgroup controller-based
code to allow better control and avoid priority inversions in cases
where the custom and cgroup scheduling disagreed
Fixes:
- avoid getting stuck during cyclic writebacks
- fix trimming of ranges crossing block group boundaries
- fix rename exchange on subvolumes, all involved subvolumes need to
be recorded in the transaction"
* tag 'for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (137 commits)
btrfs: drop bdev argument from submit_extent_page
btrfs: remove extent_map::bdev
btrfs: drop bio_set_dev where not needed
btrfs: get bdev directly from fs_devices in submit_extent_page
btrfs: record all roots for rename exchange on a subvol
Btrfs: fix block group remaining RO forever after error during device replace
btrfs: scrub: Don't check free space before marking a block group RO
btrfs: change btrfs_fs_devices::rotating to bool
btrfs: change btrfs_fs_devices::seeding to bool
btrfs: rename btrfs_block_group_cache
btrfs: block-group: Reuse the item key from caller of read_one_block_group()
btrfs: block-group: Refactor btrfs_read_block_groups()
btrfs: document extent buffer locking
btrfs: access eb::blocking_writers according to ACCESS_ONCE policies
btrfs: set blocking_writers directly, no increment or decrement
btrfs: merge blocking_writers branches in btrfs_tree_read_lock
btrfs: drop incompat bit for raid1c34 after last block group is gone
btrfs: add incompat for raid1 with 3, 4 copies
btrfs: add support for 4-copy replication (raid1c4)
btrfs: add support for 3-copy replication (raid1c3)
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3YA5sQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplFxEACM7CwrWsullPX6b3j62NW6VepU5JQdzwVW
S+bmLpb8Z2I4wzEnaVuWAY5hEhGaS9NFtQLdBG0W0YOzH7sweNmL38dZfCE4+oFj
ZwytpXQQhAQUwkgANJCpNfzDymHduPsTz7RYqRr1plmhna1KC/dnhuMwg8lVOBf5
myWjqcCHxxoQn6KFqcX9/Azz29ZrgzV28lOnZdiw9yoTjraBmS/ymx4woaa3pc2v
UNw0Cgx53vHENJzEL9FNSxc0ENZq/bQhpDolnc2AlPGy9+vPg4afMitJb60KTT7r
HpDcLGkYAIKLrfk8DUmFW8lZhWsxTchXvK2+zwQV7nXMcdUgGN/G3HTIdvWEHFv8
oGbPB8cfdA2vNC9QAybwWEum/S0H/GfYsBVplNCUCdFXE7yj1cbKD5dPfCyIvmPz
BjgMae5vH/KoH+vNdZ8NL5oFz2eFC3rLxa/Ss78pcEoBdiiV3WQHPv9MBmn/OQ/v
CeUAM7omyWpbv3lcByNzIOkeeO3m6Ne28EpEMc2pzLnDPu2btvSyetdO488DE+7O
MNfApZULVX91W7jWnhM5GR+1SJTdEXZnoxnFV+J/j4deog5vUR7Dt1VkujpUILfL
7jMl3erF6C53wNrc465z8iLRp1ZM+aTpwatXXRfucNXeomExKK9zF+/+O1ACckUB
jWDCR9NTcw==
=e5Lx
-----END PGP SIGNATURE-----
Merge tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block
Pull disk revalidation updates from Jens Axboe:
"This continues the work that Jan Kara started to thoroughly cleanup
and consolidate how we handle rescans and revalidations"
* tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block:
block: move clearing bd_invalidated into check_disk_size_change
block: remove (__)blkdev_reread_part as an exported API
block: fix bdev_disk_changed for non-partitioned devices
block: move rescan_partitions to fs/block_dev.c
block: merge invalidate_partitions into rescan_partitions
block: refactor rescan_partitions
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3YAiAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsJRD/wNfUGWVdIckw7iiFNuuipKBEy0Nd2VLt0B
I+pVW/YjDsG2oxWXWPs5Nxc7ca2A8EzRXcWP0xEjBfOCcBh/9mULi1flkLRoWKcq
v/OuTVif3ATvgJcwNkbMcoi0bYA/VwKi2dWC6ALhDDmZhyMTLeE362oIeOUNNnl6
GM8CGZHaRfmBzcH5t+WnxiS6rBlt5iwFJ35EvZo3GMXGGiLGlryxEXPAwZrf4haA
Z4atNinKcNXhb80LWHo23aK3bpnaumwKP4BPuLEyvnjS4iU8SeYTXy+w5yq1BE+h
HBP5s3no/mPiBAG8b6EZXqOJUGlN596AQfNLu7vCR78tmImZF0jKRFsHEAaKXf+B
1yRgZi7J+gV0qzK/Ufulg43vItk5/sTzEuV9YLfCpKTr14MFcWw908BAqaI5Kk1K
e8uGqnb2KbZOLTW4QdPvpWg3eYtqEoluSoZUQ5elHxqQZ4MSZ1lK78FF1TeaW/pw
sYH+v6rsWoVjEcFSwGoaaOMravzU4MKtavNAZrTJwKZx7qCqkwmi3R1k8WF6KsSV
rTRAzUC1wpTdSOm1MYPMMKM/h5+BJRSJ/RjljOF4fXLnvpD5q0lequCWjrrEzc6c
HPRKIgSBq7S620A19QD8UxwvZJ8bOivESqr0bux29v1Vpf7vJBrRMng8nLUrXfJs
jdma5mK1UA==
=/G9l
-----END PGP SIGNATURE-----
Merge tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block
Pull zoned block device update from Jens Axboe:
"Enhancements and improvements to the zoned device support"
* tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block:
scsi: sd_zbc: Remove set but not used variable 'buflen'
block: rework zone reporting
scsi: sd_zbc: Cleanup sd_zbc_alloc_report_buffer()
null_blk: Add zone_nr_conv to features
null_blk: clean up report zones
null_blk: clean up the block device operations
block: Remove partition support for zoned block devices
block: Simplify report zones execution
block: cleanup the !zoned case in blk_revalidate_disk_zones
block: Enhance blk_revalidate_disk_zones()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WxrEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuH5D/9qQKfIIuQDUNO4Xx+dIHimTDCrfiEOeO9e
CRaMuSj+yMxLDMwfX8RnDmR17H3ZVoiIY1CT24U9ZkA5iDjeAH4xmzkH30US7LR7
/64YVZTxB0OrWppRK8RiIhaJJZDQ6+HPUQsn6PRaLVuFHi2unMoTQnj/ZQKz03QA
Pl8Xx7qBtH1JwYCzQ21f/uryAcNg9eWabRLN2f1uiOXLmvRxOfh6Z/iaezlaZlmL
qeJdcdLjjvOgOPwEOfNjfS6pd+XBz3gdEhn0l+11nHITxWZmVBwsWTKyUQlCmKnl
yuCWDVyx5d6zCnlrLYG0l2Fn2lr9SwAkdkq3YAKV03hA/6s6P9q9bm31VvOf828x
7gmr4YVz68y7H9bM0QAHCvDpjll0aIEUw6XFzSOCDtZ9B6/pppYQWzMU71J05eyF
8DOKv2M2EVNLUjf6u0RDyolnWGU0kIjt5ryWE3OsGcezAVa2wYstgUJTKbrn1YgT
j+4KTpaI+sg8GKDFauvxcSa6gwoRp6jweFNW+7vC090/shXmrGmVLOnQZKRuHho/
O4W8y/1/deM8CCIAETpiNxA8RV5U/EZygrFGDFc7yzTtVDGHY356M/B4Bmm2qkVu
K3WgeZp8Fc0lH0QF6Pp9ZlBkZEpGNCAPVsPkXIsxQXbctftkn3KY//uIubfpFEB1
PpHSicvkww==
=HYYq
-----END PGP SIGNATURE-----
Merge tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
"Due to more granular branches, this one is small and will be followed
with other core branches that add specific features. I meant to just
have a core and drivers branch, but external dependencies we ended up
adding a few more that are also core.
The changes are:
- Fixes and improvements for the zoned device support (Ajay, Damien)
- sed-opal table writing and datastore UID (Revanth)
- blk-cgroup (and bfq) blk-cgroup stat fixes (Tejun)
- Improvements to the block stats tracking (Pavel)
- Fix for overruning sysfs buffer for large number of CPUs (Ming)
- Optimization for small IO (Ming, Christoph)
- Fix typo in RWH lifetime hint (Eugene)
- Dead code removal and documentation (Bart)
- Reduction in memory usage for queue and tag set (Bart)
- Kerneldoc header documentation (André)
- Device/partition revalidation fixes (Jan)
- Stats tracking for flush requests (Konstantin)
- Various other little fixes here and there (et al)"
* tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block: (48 commits)
Revert "block: split bio if the only bvec's length is > SZ_4K"
block: add iostat counters for flush requests
block,bfq: Skip tracing hooks if possible
block: sed-opal: Introduce SUM_SET_LIST parameter and append it using 'add_token_u64'
blk-cgroup: cgroup_rstat_updated() shouldn't be called on cgroup1
block: Don't disable interrupts in trigger_softirq()
sbitmap: Delete sbitmap_any_bit_clear()
blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue()
block: split bio if the only bvec's length is > SZ_4K
block: still try to split bio if the bvec crosses pages
blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT
blk-cgroup: reimplement basic IO stats using cgroup rstat
blk-cgroup: remove now unused blkg_print_stat_{bytes|ios}_recursive()
blk-throtl: stop using blkg->stat_bytes and ->stat_ios
bfq-iosched: stop using blkg->stat_bytes and ->stat_ios
bfq-iosched: relocate bfqg_*rwstat*() helpers
block: add zone open, close and finish ioctl support
block: add zone open, close and finish operations
block: Simplify REQ_OP_ZONE_RESET_ALL handling
block: Remove REQ_OP_ZONE_RESET plugging
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WxNwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgps4kD/9SIDXhYhhE8fNqeAF7Uouu8fxgwnkY3hSI
43vJwCziiDxWWJH5mYW7/83VNOMZKHIbiYMnU6iEUsRQ/sG/wI0wEfAQZDHLzCKt
cko2q7zAC1/4rtoslwJ3q04hE2Ap/nb93ELZBVr7fOAuODBNFUp/vifAojvsMPKz
hNMNPq/vYg7c/iYMZKSBdtjE3tqceFNBjAVNMB9dHKQLeexEy4ve7AjBeawWsSi7
GesnQ5w5u5LqkMYwLslpv/oVjHiiFWgGnDAvBNvykQvVy+DfB54KSqMV11W1aqdU
l6L+ENfZasEvlk1yMAth2Foq4vlscm5MKEb6VdJhXWHHXtXkcBmz7RBqPmjSvXCY
wS5GZRw8oYtTcid0aQf+t/wgRNTDJsGsnsT32qto41No3Z7vlIDHUDxHZGTA+gEL
E8j9rDx6EXMTo3EFbC8XZcfsorhPJ1HKAyw1YFczHtYzJEQUR9jJe3f/Q9u6K2Vy
s/EhkVeHa/lEd7kb6mI+6lQjGe1FXl7AHauDuaaEfIOZA/xJB3Bad5Wjq1va1cUO
TX+37zjzFzJghhSIBGYq7G7iT4AMecPQgxHzCdCyYfW5S4Uur9tMmIElwVPI/Pjl
kDZ9gdg9lm6JifZ9Ab8QcGhuQQTF3frwX9VfgrVgcqyvm38AiYzVgL9ZJnxRS/Cy
ZfLNkACXqQ==
=YZ9s
-----END PGP SIGNATURE-----
Merge tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"A lot of stuff has been going on this cycle, with improving the
support for networked IO (and hence unbounded request completion
times) being one of the major themes. There's been a set of fixes done
this week, I'll send those out as well once we're certain we're fully
happy with them.
This contains:
- Unification of the "normal" submit path and the SQPOLL path (Pavel)
- Support for sparse (and bigger) file sets, and updating of those
file sets without needing to unregister/register again.
- Independently sized CQ ring, instead of just making it always 2x
the SQ ring size. This makes it more flexible for networked
applications.
- Support for overflowed CQ ring, never dropping events but providing
backpressure on submits.
- Add support for absolute timeouts, not just relative ones.
- Support for generic cancellations. This divorces io_uring from
workqueues as well, which additionally gets us one step closer to
generic async system call support.
- With cancellations, we can support grabbing the process file table
as well, just like we do mm context. This allows support for system
calls that create file descriptors, like accept4() support that's
built on top of that.
- Support for io_uring tracing (Dmitrii)
- Support for linked timeouts. These abort an operation if it isn't
completed by the time noted in the linke timeout.
- Speedup tracking of poll requests
- Various cleanups making the coder easier to follow (Jackie, Pavel,
Bob, YueHaibing, me)
- Update MAINTAINERS with new io_uring list"
* tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits)
io_uring: make POLL_ADD/POLL_REMOVE scale better
io-wq: remove now redundant struct io_wq_nulls_list
io_uring: Fix getting file for non-fd opcodes
io_uring: introduce req_need_defer()
io_uring: clean up io_uring_cancel_files()
io-wq: ensure free/busy list browsing see all items
io-wq: ensure we have a stable view of ->cur_work for cancellations
io_wq: add get/put_work handlers to io_wq_create()
io_uring: check for validity of ->rings in teardown
io_uring: fix potential deadlock in io_poll_wake()
io_uring: use correct "is IO worker" helper
io_uring: fix -ENOENT issue with linked timer with short timeout
io_uring: don't do flush cancel under inflight_lock
io_uring: flag SQPOLL busy condition to userspace
io_uring: make ASYNC_CANCEL work with poll and timeout
io_uring: provide fallback request for OOM situations
io_uring: convert accept4() -ERESTARTSYS into -EINTR
io_uring: fix error clear of ->file_table in io_sqe_files_register()
io_uring: separate the io_free_req and io_free_req_find_next interface
io_uring: keep io_put_req only responsible for release and put req
...
fdget_pos() is used by file operations that will read and update f_pos:
things like "read()", "write()" and "lseek()" (but not, for example,
"pread()/pwrite" that get their file positions elsewhere).
However, it had two separate escape clauses for this, because not
everybody wants or needs serialization of the file position.
The first and most obvious case is the "file descriptor doesn't have a
position at all", ie a stream-like file. Except we didn't actually use
FMODE_STREAM, but instead used FMODE_ATOMIC_POS. The reason for that
was that FMODE_STREAM didn't exist back in the days, but also that we
didn't want to mark all the special cases, so we only marked the ones
that _required_ position atomicity according to POSIX - regular files
and directories.
The case one was intentionally lazy, but now that we _do_ have
FMODE_STREAM we could and should just use it. With the change to use
FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
the code to set it is deleted.
Any cases where we don't want the serialization because the driver (or
subsystem) doesn't use the file position should just be updated to do
"stream_open()". We've done that for all the obvious and common
situations, we may need a few more. Quoting Kirill Smelkov in the
original FMODE_STREAM thread (see link below for full email):
"And I appreciate if people could help at least somehow with "getting
rid of mixed case entirely" (i.e. always lock f_pos_lock on
!FMODE_STREAM), because this transition starts to diverge from my
particular use-case too far. To me it makes sense to do that
transition as follows:
- convert nonseekable_open -> stream_open via stream_open.cocci;
- audit other nonseekable_open calls and convert left users that
truly don't depend on position to stream_open;
- extend stream_open.cocci to analyze alloc_file_pseudo as well (this
will cover pipes and sockets), or maybe convert pipes and sockets
to FMODE_STREAM manually;
- extend stream_open.cocci to analyze file_operations that use
no_llseek or noop_llseek, but do not use nonseekable_open or
alloc_file_pseudo. This might find files that have stream semantic
but are opened differently;
- extend stream_open.cocci to analyze file_operations whose
.read/.write do not use ppos at all (independently of how file was
opened);
- ...
- after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
!FMODE_STREAM;
- gather bug reports for deadlocked read/write and convert missed
cases to FMODE_STREAM, probably extending stream_open.cocci along
the road to catch similar cases
i.e. always take f_pos_lock unless a file is explicitly marked as
being stream, and try to find and cover all files that are streams"
We have not done the "extend stream_open.cocci to analyze
alloc_file_pseudo" as well, but the previous commit did manually handle
the case of pipes and sockets.
The other case where we can avoid locking f_pos is the "this file
descriptor only has a single user and it is us, and thus there is no
need to lock it".
The second test was correct, although a bit subtle and worth just
re-iterating here. There are two kinds of other sources of references
to the same file descriptor: file descriptors that have been explicitly
shared across fork() or with dup(), and file tables having elevated
reference counts due to threading (or explicit file sharing with
clone()).
The first case would have incremented the file count explicitly, and in
the second case the previous __fdget() would have incremented it for us
and set the FDPUT_FPUT flag.
But in both cases the file count would be greater than one, so the
"file_count(file) > 1" test catches both situations. Also note that if
file_count is 1, that also means that no other thread can have access to
the file table, so there also cannot be races with concurrent calls to
dup()/fork()/clone() that would increment the file count any other way.
Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We must stop GC, once the segment becomes fully valid. Otherwise, it can
produce another dirty segments by moving valid blocks in the segment partially.
Ramon hit no free segment panic sometimes and saw this case happens when
validating reliable file pinning feature.
Signed-off-by: Ramon Pantin <pantin@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Expose in /sys/fs/f2fs/<blockdev>/main_blkaddr the block address where the
main area starts. This allows user mode programs to determine:
- That pinned files that are made exclusively of fully allocated 2MB
segments will never be unpinned by the file system.
- Where the main area starts. This is required by programs that want to
verify if a file is made exclusively of 2MB f2fs segments, the alignment
boundary for segments starts at this address. Testing for 2MB alignment
relative to the start of the device is incorrect, because for some
filesystems main_blkaddr is not at a 2MB boundary relative to the start
of the device.
The entry will be used when validating reliable pinning file feature proposed
by "f2fs: support aligned pinned file".
Signed-off-by: Ramon Pantin <pantin@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Setting softlimit larger than hardlimit seems meaningless
for disk quota but currently it is allowed. In this case,
there may be a bit of comfusion for users when they run
df comamnd to directory which has project quota.
For example, we set 20M softlimit and 10M hardlimit of
block usage limit for project quota of test_dir(project id 123).
[root@hades f2fs]# repquota -P -a
*** Report for project quotas on device /dev/nvme0n1p8
Block grace time: 7days; Inode grace time: 7days
Block limits File limits
Project used soft hard grace used soft hard grace
----------------------------------------------------------------------
0 -- 4 0 0 1 0 0
123 +- 10248 20480 10240 2 0 0
The result of df command as below:
[root@hades f2fs]# df -h /mnt/f2fs/test
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p8 20M 11M 10M 51% /mnt/f2fs
Even though it looks like there is another 10M free space to use,
if we write new data to diretory test(inherit project id),
the write will fail with errno(-EDQUOT).
After this patch, the df result looks like below.
[root@hades f2fs]# df -h /mnt/f2fs/test
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p8 10M 10M 0 100% /mnt/f2fs
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In commit 3975b097e5 ("convert stream-like files -> stream_open, even
if they use noop_llseek") Kirill used a coccinelle script to change
"nonseekable_open()" to "stream_open()", which changed the trivial cases
of stream-like file descriptors to the new model with FMODE_STREAM.
However, the two big cases - sockets and pipes - don't actually have
that trivial pattern at all, and were thus never converted to
FMODE_STREAM even though it makes lots of sense to do so.
That's particularly true when looking forward to the next change:
getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
decide whether f_pos updates are needed or not. And if they are, we'll
always do them atomically.
This came up because KCSAN (correctly) noted that the non-locked f_pos
updates are data races: they are clearly benign for the case where we
don't care, but it would be good to just not have that issue exist at
all.
Note that the reason we used FMODE_ATOMIC_POS originally is that only
doing it for the minimal required case is "safer" in that it's possible
that the f_pos locking can cause unnecessary serialization across the
whole write() call. And in the worst case, that kind of serialization
can cause deadlock issues: think writers that need readers to empty the
state using the same file descriptor.
[ Note that the locking is per-file descriptor - because it protects
"f_pos", which is obviously per-file descriptor - so it only affects
cases where you literally use the same file descriptor to both read
and write.
So a regular pipe that has separate reading and writing file
descriptors doesn't really have this situation even though it's the
obvious case of "reader empties what a bit writer concurrently fills"
But we want to make pipes as being stream-line anyway, because we
don't want the unnecessary overhead of locking, and because a named
pipe can be (ab-)used by reading and writing to the same file
descriptor. ]
There are likely a lot of other cases that might want FMODE_STREAM, and
looking for ".llseek = no_llseek" users and other cases that don't have
an lseek file operation at all and making them use "stream_open()" might
be a good idea. But pipes and sockets are likely to be the two main
cases.
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update signing key of first channel whenever generating the master
sigining/encryption/decryption keys rather than only in cifs_mount().
This also fixes reconnect when re-establishing smb sessions to other
servers.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We used to skip reconnects on all SMB2_IOCTL commands due to SMB3+
FSCTL_VALIDATE_NEGOTIATE_INFO - which made sense since we're still
establishing a SMB session.
However, when refresh_cache_worker() calls smb2_get_dfs_refer() and
we're under reconnect, SMB2_ioctl() will not be able to get a proper
status error (e.g. -EHOSTDOWN in case we failed to reconnect) but an
-EAGAIN from cifs_send_recv() thus looping forever in
refresh_cache_worker().
Fixes: e99c63e4d8 ("SMB3: Fix deadlock in validate negotiate hits reconnect")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Suggested-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We don't care about module aliasing validation in
cifs_compose_mount_options(..., is_smb3) when finding the root SMB
session of an DFS namespace in order to refresh DFS referral cache.
The following issue has been observed when mounting with '-t smb3' and
then specifying 'vers=2.0':
...
Nov 08 15:27:08 tw kernel: address conversion returned 0 for FS0.WIN.LOCAL
Nov 08 15:27:08 tw kernel: [kworke] ==> dns_query((null),FS0.WIN.LOCAL,13,(null))
Nov 08 15:27:08 tw kernel: [kworke] call request_key(,FS0.WIN.LOCAL,)
Nov 08 15:27:08 tw kernel: [kworke] ==> dns_resolver_cmp(FS0.WIN.LOCAL,FS0.WIN.LOCAL)
Nov 08 15:27:08 tw kernel: [kworke] <== dns_resolver_cmp() = 1
Nov 08 15:27:08 tw kernel: [kworke] <== dns_query() = 13
Nov 08 15:27:08 tw kernel: fs/cifs/dns_resolve.c: dns_resolve_server_name_to_ip: resolved: FS0.WIN.LOCAL to 192.168.30.26
===> Nov 08 15:27:08 tw kernel: CIFS VFS: vers=2.0 not permitted when mounting with smb3
Nov 08 15:27:08 tw kernel: fs/cifs/dfs_cache.c: CIFS VFS: leaving refresh_tcon (xid = 26) rc = -22
...
Fixes: 5072010ccf ("cifs: Fix DFS cache refresher for DFS links")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We currently just pass junk in this field unless we're retransmitting a
create, but in later patches, we'll need a mechanism to pass a delegated
inode number on an initial create request. Prepare for this by ensuring
this field is zeroed out.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When this occurs, it usually means that we raced with a rename, and
there is no need to warn in that case. Only printk if we pass the
rename sequence check but still ended up with pos < 0.
Either way, this doesn't warrant a KERN_ERR message. Change it to
KERN_WARNING.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For example, if we have 5 mds in the mdsmap and the states are:
m_info[5] --> [-1, 1, -1, 1, 1]
If we get a random number 1, then we should get the mds index 3 as
expected, but actually we will get index 2, which the state is -1.
The issue is that the for loop increment will advance past any "up"
MDS that was found during the while loop search.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>