Commit Graph

62062 Commits

Author SHA1 Message Date
Theodore Ts'o
9803387c55 ext4: validate the debug_want_extra_isize mount option at parse time
Instead of setting s_want_extra_size and then making sure that it is a
valid value afterwards, validate the field before we set it.  This
avoids races and other problems when remounting the file system.

Link: https://lore.kernel.org/r/20191215063020.GA11512@mit.edu
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-and-tested-by: syzbot+4a39a025912b265cacef@syzkaller.appspotmail.com
2019-12-15 18:05:20 -05:00
Brian Gianforcaro
d195a66e36 io_uring: fix stale comment and a few typos
- Fix a few typos found while reading the code.

- Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter
  was renamed to 'req', but the comment still holds.

Signed-off-by: Brian Gianforcaro <b.gianfo@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-15 14:49:30 -07:00
Linus Torvalds
2e6d304515 Merge branch 'remove-ksys-mount-dup' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux
Pull ksys_mount() and ksys_dup() removal from Dominik Brodowski:
 "This small series replaces all in-kernel calls to the
  userspace-focused ksys_mount() and ksys_dup() with calls to
  kernel-centric functions:

  For each replacement of ksys_mount() with do_mount(), one needs to
  verify that the first and third parameter (char *dev_name, char *type)
  are strings allocated in kernelspace and that the fifth parameter
  (void *data) is either NULL or refers to a full page (only occurence
  in init/do_mounts.c::do_mount_root()). The second and fourth
  parameters (char *dir_name, unsigned long flags) are passed by
  ksys_mount() to do_mount() unchanged, and therefore do not require
  particular care.

  Moreover, instead of pretending to be userspace, the opening of
  /dev/console as stdin/stdout/stderr can be implemented using in-kernel
  functions as well. Thereby, ksys_dup() can be removed for good"

[ This doesn't get rid of the special "kernel init runs with KERNEL_DS"
  case, but it at least removes _some_ of the users of "treat kernel
  pointers as user pointers for our magical init sequence".

  One day we'll hopefully be rid of it all, and can initialize our
  init_thread addr_limit to USER_DS.    - Linus ]

* 'remove-ksys-mount-dup' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux:
  fs: remove ksys_dup()
  init: unify opening /dev/console as stdin/stdout/stderr
  init: use do_mount() instead of ksys_mount()
  initrd: use do_mount() instead of ksys_mount()
  devtmpfs: use do_mount() instead of ksys_mount()
2019-12-15 11:36:12 -08:00
yangerkun
a70fd5ac2e ext4: reserve revoke credits in __ext4_new_inode
It's possible that __ext4_new_inode will release the xattr block, so
it will trigger a warning since there is revoke credits will be 0 if
the handle == NULL. The below scripts can reproduce it easily.

------------[ cut here ]------------
WARNING: CPU: 0 PID: 3861 at fs/jbd2/revoke.c:374 jbd2_journal_revoke+0x30e/0x540 fs/jbd2/revoke.c:374
...
__ext4_forget+0x1d7/0x800 fs/ext4/ext4_jbd2.c:248
ext4_free_blocks+0x213/0x1d60 fs/ext4/mballoc.c:4743
ext4_xattr_release_block+0x55b/0x780 fs/ext4/xattr.c:1254
ext4_xattr_block_set+0x1c2c/0x2c40 fs/ext4/xattr.c:2112
ext4_xattr_set_handle+0xa7e/0x1090 fs/ext4/xattr.c:2384
__ext4_set_acl+0x54d/0x6c0 fs/ext4/acl.c:214
ext4_init_acl+0x218/0x2e0 fs/ext4/acl.c:293
__ext4_new_inode+0x352a/0x42b0 fs/ext4/ialloc.c:1151
ext4_mkdir+0x2e9/0xbd0 fs/ext4/namei.c:2774
vfs_mkdir+0x386/0x5f0 fs/namei.c:3811
do_mkdirat+0x11c/0x210 fs/namei.c:3834
do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:294
...
-------------------------------------

scripts:
mkfs.ext4 /dev/vdb
mount /dev/vdb /mnt
cd /mnt && mkdir dir && for i in {1..8}; do setfacl -dm "u:user_"$i":rx" dir; done
mkdir dir/dir1 && mv dir/dir1 ./
sh repro.sh && add some user

[root@localhost ~]# cat repro.sh
while [ 1 -eq 1 ]; do
    rm -rf dir
    rm -rf dir1/dir1
    mkdir dir
    for i in {1..8}; do  setfacl -dm "u:test"$i":rx" dir; done
    setfacl -m "u:user_9:rx" dir &
    mkdir dir1/dir1 &
done

Before exec repro.sh, dir1 has inherit the default acl from dir, and
xattr block of dir1 dir is not the same, so the h_refcount of these
two dir's xattr block will be 1. Then repro.sh can trigger the warning
with the situation show as below. The last h_refcount can be clear
with mkdir, and __ext4_new_inode has not reserved revoke credits, so
the warning will happened, fix it by reserve revoke credits in
__ext4_new_inode.

Thread 1                        Thread 2
mkdir dir
set default acl(will create
a xattr block blk1 and the
refcount of ext4_xattr_header
will be 1)
				...
                                mkdir dir1/dir1
				->....->ext4_init_acl
				->__ext4_set_acl(set default acl,
			          will reuse blk1, and h_refcount
				  will be 2)

setfacl->ext4_set_acl->...
->ext4_xattr_block_set(will create
new block blk2 to store xattr)

				->__ext4_set_acl(set access acl, since
				  h_refcount of blk1 is 2, will create
				  blk3 to store xattr)

  ->ext4_xattr_release_block(dec
  h_refcount of blk1 to 1)
				  ->ext4_xattr_release_block(dec
				    h_refcount and since it is 0,
				    will release the block and trigger
				    the warning)

Link: https://lore.kernel.org/r/20191213014900.47228-1-yangerkun@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-14 17:47:13 -05:00
Dan Carpenter
7f420d64a0 ext4: unlock on error in ext4_expand_extra_isize()
We need to unlock the xattr before returning on this error path.

Cc: stable@kernel.org # 4.13
Fixes: c03b45b853 ("ext4, project: expand inode extra size if possible")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20191213185010.6k7yl2tck3wlsdkt@kili.mountain
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-14 17:31:23 -05:00
Theodore Ts'o
707d1a2f60 ext4: optimize __ext4_check_dir_entry()
Make __ext4_check_dir_entry() a bit easier to understand, and reduce
the object size of the function by over 11%.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20191209004346.38526-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-14 17:23:14 -05:00
Jan Kara
109ba779d6 ext4: check for directory entries too close to block end
ext4_check_dir_entry() currently does not catch a case when a directory
entry ends so close to the block end that the header of the next
directory entry would not fit in the remaining space. This can lead to
directory iteration code trying to access address beyond end of current
buffer head leading to oops.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191202170213.4761-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-14 17:22:45 -05:00
Jan Kara
64d4ce8923 ext4: fix ext4_empty_dir() for directories with holes
Function ext4_empty_dir() doesn't correctly handle directories with
holes and crashes on bh->b_data dereference when bh is NULL. Reorganize
the loop to use 'offset' variable all the times instead of comparing
pointers to current direntry with bh->b_data pointer. Also add more
strict checking of '.' and '..' directory entries to avoid entering loop
in possibly invalid state on corrupted filesystems.

References: CVE-2019-19037
CC: stable@vger.kernel.org
Fixes: 4e19d6b65f ("ext4: allow directory holes")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191202170213.4761-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-14 17:22:45 -05:00
Linus Torvalds
103a022d6b 3 small smb3 fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl30eFcACgkQiiy9cAdy
 T1FrCAv+Lwq/7pwyzoS0F1DWXJfs+3a2KCR2QSnZE9Mok8AobvRnNS9dZcXZdp0U
 RnSFfkv9ia5vXegO0899nM6n0jtS5/OCex9RITVBtvIk8HLKrpcmYP+gS3III5Qq
 oMF11JZCWDI2HHJA6xEV677rgFjOiEDjtjaZrXOS2TClnBCU3ZDRghul46DKACbA
 xQDr0ifgOcDdxKBSTERhGvKA6xOf+gPP73SKB5Kg6OaPWL9FjhGvN/1ic2LVmFHF
 cc7rbXhl1jTnKfw22qrS5XsElBSQdbM7X23CJ9ik8zXpF8gLBjGefx894xyFLELb
 efJcHC1vroBc11HfLaLmAoQRp7leDIP4Icyq2TqJC6+mWsxJ0G96ofn7tGR9z/FK
 SlcggWkrC8frJQQoYMYylQcknXK9+WahyyNswMXFiUYU7XSojyx4MoBtvRavOI5l
 JKJtoiZd3OPtXOjvnKzNqSFcHC3YNQDE9GmIOXHgRuhFcnO0UQOKvbwXik2HCXBp
 7A1dxRob
 =65tY
 -----END PGP SIGNATURE-----

Merge tag '5.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cfis fixes from Steve French:
 "Three small smb3 fixes: this addresses two recent issues reported in
  additional testing during rc1, a refcount underflow and a problem with
  an intermittent crash in SMB2_open_init"

* tag '5.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Close cached root handle only if it has a lease
  SMB3: Fix crash in SMB2_open_init due to uninitialized field in compounding path
  smb3: fix refcount underflow warning on unmount when no directory leases
2019-12-14 11:44:14 -08:00
Linus Torvalds
81c64b0bd0 overlayfs fixes for 5.5-rc2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXfNhGQAKCRDh3BK/laaZ
 PGSEAP9Nyv3XCN2wdqMLdrgn07B3Pk9w2Unf3Y5amKOxNXqyQwEAy2/E6DCiGjSa
 WRheJoTgDSeqUQNY6GFHsCIgLWOCHgs=
 =WH5O
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs fixes from Miklos Szeredi:
 "Fix some bugs and documentation"

* tag 'ovl-fixes-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  docs: filesystems: overlayfs: Fix restview warnings
  docs: filesystems: overlayfs: Rename overlayfs.txt to .rst
  ovl: relax WARN_ON() on rename to self
  ovl: fix corner case of non-unique st_dev;st_ino
  ovl: don't use a temp buf for encoding real fh
  ovl: make sure that real fid is 32bit aligned in memory
  ovl: fix lookup failure on multi lower squashfs
2019-12-14 11:13:54 -08:00
Linus Torvalds
5bd831a469 io_uring-5.5-20191212
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3y5kIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv0MEADY9LOC2JkG2aNP53gCXGu74rFQNstJfusr
 xPpkMs5I7hxoeVp/bkVAvR6BbpfPDsyGMhSMURELgMFzkuw+03z2IbxVuexdGOEu
 X1+6EkunAq/r331fXL+cXUKJM3ThlkUBbNTuV8z5oWWSM1EfoBAWYfh2u4L4oBcc
 yIHRH8by9qipx+fhBWPU/ZWeCcMVt5Ju2b6PHAE/GWwteZh00PsTwq0oqQcVcm/5
 4Ennr0OELb7LaZWblAIIZe96KeJuCPzGfbRV/y12Ne/t6STH/sWA1tMUzgT22Eju
 27uXtwAJtoLsep4V66a4VYObs/hXmEP281wIsXEnIkX0YCvbAiM+J6qVHGjYSMGf
 mRrzfF6nDC1vaMduE4waoO3VFiDFQU/qfQRa21ZP50dfQOAXTpEz8kJNG/PjN1VB
 gmz9xsujDU1QPD7IRiTreiPPHE5AocUzlUYuYOIMt7VSbFZIMHW5S/pML93avkJt
 nx6g3gOP0JKkjaCEvIKY4VLljDT8eDZ/WScDsSedSbPZikMzEo8DSx4U6nbGPqzy
 qSljgQniDcrH8GdRaJFDXgLkaB8pu83NH7zUH+xioUZjAHq/XEzKQFYqysf1DbtU
 SEuSnUnLXBvbwfb3Z2VpQ/oz8G3a0nD9M7oudt1sN19oTCJKxYSbOUgNUrj9JsyQ
 QlAVrPKRkQ==
 =fbsf
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - A tweak to IOSQE_IO_LINK (also marked for stable) to allow links that
   don't sever if the result is < 0.

   This is mostly for linked timeouts, where if we ask for a pure
   timeout we always get -ETIME. This makes links useless for that case,
   hence allow a case where it works.

 - Five minor optimizations to fix and improve cases that regressed
   since v5.4.

 - An SQTHREAD locking fix.

 - A sendmsg/recvmsg iov assignment fix.

 - Net fix where read_iter/write_iter don't honor IOCB_NOWAIT, and
   subsequently ensuring that works for io_uring.

 - Fix a case where for an invalid opcode we might return -EBADF instead
   of -EINVAL, if the ->fd of that sqe was set to an invalid fd value.

* tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block:
  io_uring: ensure we return -EINVAL on unknown opcode
  io_uring: add sockets to list of files that support non-blocking issue
  net: make socket read/write_iter() honor IOCB_NOWAIT
  io_uring: only hash regular files for async work execution
  io_uring: run next sqe inline if possible
  io_uring: don't dynamically allocate poll data
  io_uring: deferred send/recvmsg should assign iov
  io_uring: sqthread should grab ctx->uring_lock for submissions
  io-wq: briefly spin for new work after finishing work
  io-wq: remove worker->wait waitqueue
  io_uring: allow unbreakable links
2019-12-13 14:24:54 -08:00
Linus Torvalds
22ff311af9 treewide conversion from FIELD_SIZEOF() to sizeof_field()
-----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl3umDgWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlvsD/49R12HK7UzTxNTrcpvbadJ4t7j
 j/qJvjMerW7iVNAPOoNAOePUa21+y3rI1AZPvoPyzIqp1Bf2eOICf5SdisG2cG+O
 X0A8EKWvS0SSQWSKaT6udUKJ3nBJItwvOvQ5B58KQzcOj3S4X7B9iVBWgieMHrzz
 urkZm7pqowrZB3wuF8keRtli5IZaoiCwzApy48Qrn70G3OeXymknFbpHTDwIAiGw
 RiE5Xh0R4EzQdsYyCgjR8U56gBchadAmj8BUJU0ppMnOFMyIAG670hNLrs0L3roP
 8TOIeyb993ZC5GZaMlnR8mz0jfibfkPa3Z85VAsVyQSPaOQldwc9j8TGBqD5Gfat
 1PjOU5RVwma0pH5xTPOeevWPQpIK9KovQpQYqMMN9GMxOEx96IOUjwTrnNK2xWoN
 UGyOVlESFGoniClhCiKYzPSrYOjlIBk5ovf15PdTe+bwyUDMfyfy5CZV88OS2DHz
 ZBZvpLrH/EMW9zJ+FqMTp0C4s4wa2Ioid3bSh6XuNUTtltKSjp71eUja8ZEz+2sd
 5AGstCC+hYqxaEk+6/851pfkQ9sbBjwuGtNrtX+pqreiLUvWLhQ0yUj6cLXlEQNH
 aucjCukCjI+4lMzofeaQ2LbNhtff4YsfO4b1Ye8maoDdHjzUVL57n3bTOxKhdzbt
 y6FM3lApOjk3OyaTJQ==
 =YU4A
 -----END PGP SIGNATURE-----

Merge tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull FIELD_SIZEOF conversion from Kees Cook:
 "A mostly mechanical treewide conversion from FIELD_SIZEOF() to
  sizeof_field(). This avoids the redundancy of having 2 macros
  (actually 3) doing the same thing, and consolidates on sizeof_field().
  While "field" is not an accurate name, it is the common name used in
  the kernel, and doesn't result in any unintended innuendo.

  As there are still users of FIELD_SIZEOF() in -next, I will clean up
  those during this coming development cycle and send the final old
  macro removal patch at that time"

* tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  treewide: Use sizeof_field() macro
  MIPS: OCTEON: Replace SIZEOF_FIELD() macro
2019-12-13 14:02:12 -08:00
Anand Jain
fbd542971a btrfs: send: remove WARN_ON for readonly mount
We log warning if root::orphan_cleanup_state is not set to
ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is
mounted as readonly we skip the orphan item cleanup during the lookup
and root::orphan_cleanup_state remains at the init state 0 instead of
ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the
warning as below.

  WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE);

WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
::
RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
::
Call Trace:
::
_btrfs_ioctl_send+0x7b/0x110 [btrfs]
btrfs_ioctl+0x150a/0x2b00 [btrfs]
::
do_vfs_ioctl+0xa9/0x620
? __fget+0xac/0xe0
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x49/0x130
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reproducer:
  mkfs.btrfs -fq /dev/sdb
  mount /dev/sdb /btrfs
  btrfs subvolume create /btrfs/sv1
  btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1
  umount /btrfs
  mount -o ro /dev/sdb /btrfs
  btrfs send /btrfs/ss1 -f /tmp/f

The warning exists because having orphan inodes could confuse send and
cause it to fail or produce incorrect streams.  The two cases that would
cause such send failures, which are already fixed are:

1) Inodes that were unlinked - these are orphanized and remain with a
   link count of 0. These caused send operations to fail because it
   expected to always find at least one path for an inode. However this
   is no longer a problem since send is now able to deal with such
   inodes since commit 46b2f4590a ("Btrfs: fix send failure when root
   has deleted files still open") and treats them as having been
   completely removed (the state after an orphan cleanup is performed).

2) Inodes that were in the process of being truncated. These resulted in
   send not knowing about the truncation and potentially issue write
   operations full of zeroes for the range from the new file size to the
   old file size. This is no longer a problem because we no longer
   create orphan items for truncation since commit f7e9e8fc79 ("Btrfs:
   stop creating orphan items for truncate").

As such before these commits, the WARN_ON here provided a clue in case
something went wrong. Instead of being a warning against the
root::orphan_cleanup_state value, it could have been more accurate by
checking if there were actually any orphan items, and then issue a
warning only if any exists, but that would be more expensive to check.
Since orphanized inodes no longer cause problems for send, just remove
the warning.

Reported-by: Christoph Anton Mitterer <calestyo@scientia.net>
Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/
CC: stable@vger.kernel.org # 4.19+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:10:46 +01:00
Josef Bacik
ca1aa2818a btrfs: do not leak reloc root if we fail to read the fs root
If we fail to read the fs root corresponding with a reloc root we'll
just break out and free the reloc roots.  But we remove our current
reloc_root from this list higher up, which means we'll leak this
reloc_root.  Fix this by adding ourselves back to the reloc_roots list
so we are properly cleaned up.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:10:45 +01:00
Josef Bacik
9bc574de59 btrfs: skip log replay on orphaned roots
My fsstress modifications coupled with generic/475 uncovered a failure
to mount and replay the log if we hit a orphaned root.  We do not want
to replay the log for an orphan root, but it's completely legitimate to
have an orphaned root with a log attached.  Fix this by simply skipping
replaying the log.  We still need to pin it's root node so that we do
not overwrite it while replaying other logs, as we re-read the log root
at every stage of the replay.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:10:45 +01:00
Josef Bacik
714cd3e8cb btrfs: handle ENOENT in btrfs_uuid_tree_iterate
If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the
uuid tree we'll just continue and do btrfs_next_item().  However we've
done a btrfs_release_path() at this point and no longer have a valid
path.  So increment the key and go back and do a normal search.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:10:45 +01:00
Josef Bacik
c7e54b5102 btrfs: abort transaction after failed inode updates in create_subvol
We can just abort the transaction here, and in fact do that for every
other failure in this function except these two cases.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:10:44 +01:00
Filipe Manana
147271e35b Btrfs: fix hole extent items with a zero size after range cloning
Normally when cloning a file range if we find an implicit hole at the end
of the range we assume it is because the NO_HOLES feature is enabled.
However that is not always the case. One well known case [1] is when we
have a power failure after mixing buffered and direct IO writes against
the same file.

In such cases we need to punch a hole in the destination file, and if
the NO_HOLES feature is not enabled, we need to insert explicit file
extent items to represent the hole. After commit 690a5dbfc5
("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning
extents"), we started to insert file extent items representing the hole
with an item size of 0, which is invalid and should be 53 bytes (the size
of a btrfs_file_extent_item structure), resulting in all sorts of
corruptions and invalid memory accesses. This is detected by the tree
checker when we attempt to write a leaf to disk.

The problem can be sporadically triggered by test case generic/561 from
fstests. That test case does not exercise power failure and creates a new
filesystem when it starts, so it does not use a filesystem created by any
previous test that tests power failure. However the test does both
buffered and direct IO writes (through fsstress) and it's precisely that
which is creating the implicit holes in files. That happens even before
the commit mentioned earlier. I need to investigate why we get those
implicit holes to check if there is a real problem or not. For now this
change fixes the regression of introducing file extent items with an item
size of 0 bytes.

Fix the issue by calling btrfs_punch_hole_range() without passing a
btrfs_clone_extent_info structure, which ensures file extent items are
inserted to represent the hole with a correct item size. We were passing
a btrfs_clone_extent_info with a value of 0 for its 'item_size' field,
which was causing the insertion of file extent items with an item size
of 0.

[1] https://www.spinics.net/lists/linux-btrfs/msg75350.html

Reported-by: David Sterba <dsterba@suse.com>
Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:10:28 +01:00
Filipe Manana
6609fee889 Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues
When a tree mod log user no longer needs to use the tree it calls
btrfs_put_tree_mod_seq() to remove itself from the list of users and
delete all no longer used elements of the tree's red black tree, which
should be all elements with a sequence number less then our equals to
the caller's sequence number. However the logic is broken because it
can delete and free elements from the red black tree that have a
sequence number greater then the caller's sequence number:

1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the
   tree mod log;

2) The task which got assigned the sequence number 1 calls
   btrfs_put_tree_mod_seq();

3) Sequence number 1 is deleted from the list of sequence numbers;

4) The current minimum sequence number is computed to be the sequence
   number 2;

5) A task using sequence number 2 is at tree_mod_log_rewind() and gets
   a pointer to one of its elements from the red black tree through
   a call to tree_mod_log_search();

6) The task with sequence number 1 iterates the red black tree of tree
   modification elements and deletes (and frees) all elements with a
   sequence number less then or equals to 2 (the computed minimum sequence
   number) - it ends up only leaving elements with sequence numbers of 3
   and 4;

7) The task with sequence number 2 now uses the pointer to its element,
   already freed by the other task, at __tree_mod_log_rewind(), resulting
   in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces
   a trace like the following:

  [16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  [16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G        W         5.4.0-rc8-btrfs-next-51 #1
  [16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  [16804.548666] RIP: 0010:rb_next+0x16/0x50
  (...)
  [16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202
  [16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b
  [16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600
  [16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000
  [16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e
  [16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8
  [16804.554399] FS:  00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000
  [16804.555039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0
  [16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [16804.557583] Call Trace:
  [16804.558207]  __tree_mod_log_rewind+0xbf/0x280 [btrfs]
  [16804.558835]  btrfs_search_old_slot+0x105/0xd00 [btrfs]
  [16804.559468]  resolve_indirect_refs+0x1eb/0xc70 [btrfs]
  [16804.560087]  ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs]
  [16804.560700]  find_parent_nodes+0x388/0x1120 [btrfs]
  [16804.561310]  btrfs_check_shared+0x115/0x1c0 [btrfs]
  [16804.561916]  ? extent_fiemap+0x59d/0x6d0 [btrfs]
  [16804.562518]  extent_fiemap+0x59d/0x6d0 [btrfs]
  [16804.563112]  ? __might_fault+0x11/0x90
  [16804.563706]  do_vfs_ioctl+0x45a/0x700
  [16804.564299]  ksys_ioctl+0x70/0x80
  [16804.564885]  ? trace_hardirqs_off_thunk+0x1a/0x20
  [16804.565461]  __x64_sys_ioctl+0x16/0x20
  [16804.566020]  do_syscall_64+0x5c/0x250
  [16804.566580]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [16804.567153] RIP: 0033:0x7f4b1ba2add7
  (...)
  [16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7
  [16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003
  [16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44
  [16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48
  [16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50
  (...)
  [16804.575623] ---[ end trace 87317359aad4ba50 ]---

Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that
have a sequence number equals to the computed minimum sequence number, and
not just elements with a sequence number greater then that minimum.

Fixes: bd989ba359 ("Btrfs: add tree modification log functions")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:09:25 +01:00
Filipe Manana
ad1d8c4399 Btrfs: make tree checker detect checksum items with overlapping ranges
Having checksum items, either on the checksums tree or in a log tree, that
represent ranges that overlap each other is a sign of a corruption. Such
case confuses the checksum lookup code and can result in not being able to
find checksums or find stale checksums.

So add a check for such case.

This is motivated by a recent fix for a case where a log tree had checksum
items covering ranges that overlap each other due to extent cloning, and
resulted in missing checksums after replaying the log tree. It also helps
detect past issues such as stale and outdated checksums due to overlapping,
commit 27b9a8122f ("Btrfs: fix csum tree corruption, duplicate and
outdated checksums").

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:09:25 +01:00
Filipe Manana
40e046acbd Btrfs: fix missing data checksums after replaying a log tree
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.

Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:

   [ bytenr 13893632, offset 64Kb, len 64Kb  ]
   0                                         64Kb

   [ bytenr 13631488, offset 64Kb, len 192Kb ]
   64Kb                                      256Kb

   [ bytenr 13893632, offset 0, len 256Kb    ]
   256Kb                                     512Kb

When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.

Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:

   (...)
   item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
           range start 13631488 end 14155776 length 524288
   item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
           range start 13959168 end 14024704 length 65536

The first one covers the range from the second one, they overlap.

So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.

However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.

After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:

   [ bytenr 13893632, offset 64K, len 64Kb   ]
   0                                         64Kb

   [ bytenr 13631488, offset 64Kb, len 192Kb ]
   64Kb                                      256Kb

   [ bytenr 14155776, offset 0, len 64Kb     ]
   256Kb                                     320Kb

   [ bytenr 13893632, offset 64Kb, len 192Kb ]
   320Kb                                     512Kb

After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:

1) When replaying the file extent item for file offset 320Kb, we
   lookup for the checksums for the extent range from 13959168
   (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
   to btrfs_lookup_csums_range();

2) btrfs_lookup_csums_range() finds the checksum item that starts
   precisely at offset 13959168 (item 7 in the log tree, shown before);

3) However that checksum item only covers 64Kb of data, and not 192Kb
   of data;

4) As a result only the checksums for the first 64Kb of data referenced
   by the file extent item are found and copied to the fs/subvolume tree.
   The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
   the corresponding data checksums found and copied to the fs/subvolume
   tree.

5) After replaying the log userspace will not be able to read the file
   range from 384Kb to 512Kb, because the checksums are missing and
   resulting in an -EIO error.

The following steps reproduce this scenario:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt/sdc

  $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
  $ xfs_io -c "fsync" /mnt/sdc/foobar
  $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar

  $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
  $ xfs_io -c "fsync" /mnt/sdc/foobar

  $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
  $ xfs_io -c "fsync" /mnt/sdc/foobar

  <power failure>

  $ mount /dev/sdc /mnt/sdc
  $ md5sum /mnt/sdc/foobar
  md5sum: /mnt/sdc/foobar: Input/output error

  $ dmesg | tail
  [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
  [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
  [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
  [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
  [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
  [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
  [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
  [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
  [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
  [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1

Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.

This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.

A test case for fstests follows soon.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:09:24 +01:00
Dan Carpenter
b6293c821e btrfs: return error pointer from alloc_test_extent_buffer
Callers of alloc_test_extent_buffer have not correctly interpreted the
return value as error pointer, as alloc_test_extent_buffer should behave
as alloc_extent_buffer. The self-tests were unaffected but
btrfs_find_create_tree_block could call both functions and that would
cause problems up in the call chain.

Fixes: faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting code")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:09:24 +01:00
David Sterba
cf93e15eca btrfs: fix devs_max constraints for raid1c3 and raid1c4
The value 0 for devs_max means to spread the allocated chunks over all
available devices, eg. stripe for RAID0 or RAID5. This got mistakenly
copied to the RAID1C3/4 profiles. The intention is to have exactly 3 and
4 copies respectively.

Fixes: 47e6f7423b ("btrfs: add support for 3-copy replication (raid1c3)")
Fixes: 8d6fac0087 ("btrfs: add support for 4-copy replication (raid1c4)")
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:09:23 +01:00
Andreas Färber
994bf9cd78 btrfs: tree-checker: Fix error format string for size_t
Argument BTRFS_FILE_EXTENT_INLINE_DATA_START is defined as offsetof(),
which returns type size_t, so we need %zu instead of %lu.

This fixes a build warning on 32-bit ARM:

  ../fs/btrfs/tree-checker.c: In function 'check_extent_data_item':
  ../fs/btrfs/tree-checker.c:230:43: warning: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'unsigned int' [-Wformat=]
    230 |     "invalid item size, have %u expect [%lu, %u)",
        |                                         ~~^
        |                                           long unsigned int
        |                                         %u

Fixes: 153a6d2999 ("btrfs: tree-checker: Check item size before reading file extent type")
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andreas Färber <afaerber@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:09:23 +01:00
Josef Bacik
943eb3bf25 btrfs: don't double lock the subvol_sem for rename exchange
If we're rename exchanging two subvols we'll try to lock this lock
twice, which is bad.  Just lock once if either of the ino's are subvols.

Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:09:23 +01:00
Josef Bacik
db8fe64f9c btrfs: handle error in btrfs_cache_block_group
We have a BUG_ON(ret < 0) in find_free_extent from
btrfs_cache_block_group.  If we fail to allocate our ctl we'll just
panic, which is not good.  Instead just go on to another block group.
If we fail to find a block group we don't want to return ENOSPC, because
really we got a ENOMEM and that's the root of the problem.  Save our
return from btrfs_cache_block_group(), and then if we still fail to make
our allocation return that ret so we get the right error back.

Tested with inject-error.py from bcc.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:09:22 +01:00
Josef Bacik
f72ff01df9 btrfs: do not call synchronize_srcu() in inode_tree_del
Testing with the new fsstress uncovered a pretty nasty deadlock with
lookup and snapshot deletion.

Process A
unlink
 -> final iput
   -> inode_tree_del
     -> synchronize_srcu(subvol_srcu)

Process B
btrfs_lookup  <- srcu_read_lock() acquired here
  -> btrfs_iget
    -> find inode that has I_FREEING set
      -> __wait_on_freeing_inode()

We're holding the srcu_read_lock() while doing the iget in order to make
sure our fs root doesn't go away, and then we are waiting for the inode
to finish freeing.  However because the free'ing process is doing a
synchronize_srcu() we deadlock.

Fix this by dropping the synchronize_srcu() in inode_tree_del().  We
don't need people to stop accessing the fs root at this point, we're
only adding our empty root to the dead roots list.

A larger much more invasive fix is forthcoming to address how we deal
with fs roots, but this fixes the immediate problem.

Fixes: 76dda93c6a ("Btrfs: add snapshot/subvolume destroy ioctl")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:09:08 +01:00
Filipe Manana
fcb970581d Btrfs: fix cloning range with a hole when using the NO_HOLES feature
When using the NO_HOLES feature if we clone a range that contains a hole
and a temporary ENOSPC happens while dropping extents from the target
inode's range, we can end up failing and aborting the transaction with
-EEXIST or with a corrupt file extent item, that has a length greater
than it should and overlaps with other extents. For example when cloning
the following range from inode A to inode B:

  Inode A:

    extent A1                                          extent A2
  [ ----------- ]  [ hole, implicit, 4MB length ]  [ ------------- ]
  0            1MB                                 5MB            6MB

  Range to clone: [1MB, 6MB)

  Inode B:

    extent B1       extent B2        extent B3         extent B4
  [ ---------- ]  [ --------- ]    [ ---------- ]    [ ---------- ]
  0           1MB 1MB        2MB   2MB        5MB    5MB         6MB

  Target range: [1MB, 6MB) (same as source, to make it easier to explain)

The following can happen:

1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();

2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
   set 'drop_end' to 2MB, meaning it was able to drop only extent B2;

3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB =
   1MB;

4) We then attempt to insert a file extent item at inode B with a file
   offset of 5MB, which is the value of clone_info->file_offset. This
   fails with error -EEXIST because there's already an extent at that
   offset (extent B4);

5) We abort the current transaction with -EEXIST and return that error
   to user space as well.

Another example, for extent corruption:

  Inode A:

    extent A1                                           extent A2
  [ ----------- ]   [ hole, implicit, 10MB length ]  [ ------------- ]
  0            1MB                                  11MB            12MB

  Inode B:

    extent B1         extent B2
  [ ----------- ]   [ --------- ]    [ ----------------------------- ]
  0            1MB 1MB         5MB  5MB                             12MB

  Target range: [1MB, 12MB) (same as source, to make it easier to explain)

1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();

2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
   set 'drop_end' to 5MB, meaning it was able to drop only extent B2;

3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB =
   4MB;

4) We then insert a file extent item at inode B with a file offset of 11MB
   which is the value of clone_info->file_offset, and a length of 4MB (the
   value of 'clone_len'). So we get 2 extents items with ranges that
   overlap and an extent length of 4MB, larger then the extent A2 from
   inode A (1MB length);

5) After that we end the transaction, balance the btree dirty pages and
   then start another or join the previous transaction. It might happen
   that the transaction which inserted the incorrect extent was committed
   by another task so we end up with extent corruption if a power failure
   happens.

So fix this by making sure we attempt to insert the extent to clone at
the destination inode only if we are past dropping the sub-range that
corresponds to a hole.

Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 13:29:22 +01:00
Nikolay Borisov
37d02592f1 btrfs: Fix error messages in qgroup_rescan_init
The branch of qgroup_rescan_init which is executed from the mount
path prints wrong errors messages. The textual print out in case
BTRFS_QGROUP_STATUS_FLAG_RESCAN/BTRFS_QGROUP_STATUS_FLAG_ON are not
set are transposed. Fix it by exchanging their place.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 13:29:12 +01:00
Pavel Shilovsky
d919131935 CIFS: Close cached root handle only if it has a lease
SMB2_tdis() checks if a root handle is valid in order to decide
whether it needs to close the handle or not. However if another
thread has reference for the handle, it may end up with putting
the reference twice. The extra reference that we want to put
during the tree disconnect is the reference that has a directory
lease. So, track the fact that we have a directory lease and
close the handle only in that case.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-12-13 00:49:57 -06:00
Steve French
e0fc5b1153 SMB3: Fix crash in SMB2_open_init due to uninitialized field in compounding path
Ran into an intermittent crash in
	SMB2_open_init+0x2f6/0x970
due to oparms.cifs_sb not being initialized when called from:
	smb2_compound_op+0x45d/0x1690
Zero the whole oparms struct in the compounding path before setting up the
oparms so we don't risk any uninitialized fields.

Fixes: fdef665ba4 ("smb3: fix mode passed in on create for modetosid mount option")

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-12-13 00:49:38 -06:00
Linus Torvalds
37d4e84f76 A fix to avoid a corner case when scheduling cap reclaim in batches
from Xiubo, a patch to add some observability into cap waiters from
 Jeff and a couple of cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3yiJMTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi5k9CACmM3fJGrTUuOLgXAxxllCfiV6UQLoY
 nuTo/bx0DmG603n+Ze8+Z0iz7hDc1Gw2XUeLkJcAE/xSetgZXO/MvJ0Ionq5Ac/k
 CrqS6ucIa1bPxbE1QMTHswHjkajKwBpAZ5+khdLNLuXJxy3c9HDCGOT4VZav7Yc9
 99W4kIdzOKdYLpZHAedMK97IJIrD5WhYTAFW4rNPY0GL6OPD1V0uiS9v7xUWIxnZ
 Uusnu+zY8miQlLVx/V9DyLh/6G5X7XyQO1nkSQcVXZOOG7+qnkq6jDhQW8adgOSZ
 wUFigTxxhSTIcntWg01TaCRNoi1N3/P8Z9/rD27zBHPbl93ANH+lUkCh
 =NicF
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.5-rc2' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A fix to avoid a corner case when scheduling cap reclaim in batches
  from Xiubo, a patch to add some observability into cap waiters from
  Jeff and a couple of cleanups"

* tag 'ceph-for-5.5-rc2' of git://github.com/ceph/ceph-client:
  ceph: add more debug info when decoding mdsmap
  ceph: switch to global cap helper
  ceph: trigger the reclaim work once there has enough pending caps
  ceph: show tasks waiting on caps in debugfs caps file
  ceph: convert int fields in ceph_mount_options to unsigned int
2019-12-12 10:56:37 -08:00
Dominik Brodowski
8243186f0c fs: remove ksys_dup()
ksys_dup() is used only at one place in the kernel, namely to duplicate
fd 0 of /dev/console to stdout and stderr. The same functionality can be
achieved by using functions already available within the kernel namespace.

Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2019-12-12 19:00:36 +01:00
Dominik Brodowski
cccaa5e335 init: use do_mount() instead of ksys_mount()
In prepare_namespace(), do_mount() can be used instead of ksys_mount()
as the first and third argument are const strings in the kernel, the
second and fourth argument are passed through anyway, and the fifth
argument is NULL.

In do_mount_root(), ksys_mount() is called with the first and third
argument being already kernelspace strings, which do not need to be
copied over from userspace to kernelspace (again). The second and
fourth arguments are passed through to do_mount() anyway. The fifth
argument, while already residing in kernelspace, needs to be put into
a page of its own. Then, do_mount() can be used instead of
ksys_mount().

Once this is done, there are no in-kernel users to ksys_mount() left,
which can therefore be removed.

Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2019-12-12 14:50:05 +01:00
Linus Torvalds
ae4b064e2a AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl3xW+cACgkQ+7dXa6fL
 C2t+UQ/9ETwAYJ2XgfwatNhAHX0+Tf4XMlgEDybf5a+ERRLbkSQEqRjTImopk5m4
 T7yKea0ctsVb1avdUVdayh2KuJhjmO627u3w48kd/uf2ut4IyOyaraatYKYOJ+XL
 upr8WxAMXUHgnwsb38avjH0dwq7+eDyX8FNLHiDY4Sk3mTAVHbafL6V0ujp0ms2o
 VmHLX6ihwECehUqrNEyy1pLX2nuFmd+MeLnoi/EWLa47/X0te21G8u8UdPWHGgHn
 cZp8kqWSEoaeChO7x6XOoJZCY6N/7o+hogsrU/N5YrB6FXPpFWGdwFpej3YcyFso
 QqffyXX5jVAyUg9I/v3WCrtDmTQ9xVG4kxhmBMVId6bBk3ZVpGaHuLE3ENLbr1w/
 Z1k26BI41nJwrqV0C2bPoMalpDprP2WIuWsBIpYTqaICpc53KJ72c6mLhcDEt+oc
 YCSd9X0Bv5O7tZS9rLI8mt0JtCr9Uw+VfnK+uOuUTPv2YcqQwK1/xZ0/9hd2P8XS
 L5GKcEeuk3indgmefjwsz5YJcmzeCttaOQTlmaN/Hz3MbkK2CC+spMG4OZWzUXL4
 axZmCIo/kKrv64bLyVZul6BjkhawyvfWzCs2NM6Ni63akW3Yb0S2WRdgJhvGILml
 48YgqcG7R1y8LRPYiJzcuAPnbumuu6ZSxVuyZN0FU0qc4lt5mPM=
 =ldOp
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20191211' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS fixes from David Howells:
 "Fixes for AFS plus one patch to make debugging easier:

   - Fix how addresses are matched to server records. This is currently
     incorrect which means cache invalidation callbacks from the server
     don't necessarily get delivered correctly. This causes stale data
     and metadata to be seen under some circumstances.

   - Make the dynamic root superblock R/W so that rpm/dnf can reapply
     the SELinux label to it when upgrading the Fedora filesystem-afs
     package. If the filesystem is R/O, this fails and the upgrade
     fails.

     It might be better in future to allow setxattr from an LSM to
     bypass the R/O protections, if only for pseudo-filesystems.

   - Fix the parsing of mountpoint strings. The mountpoint object has to
     have a terminal dot, whereas the source/device string passed to
     mount should not. This confuses type-forcing suffix detection
     leading to the wrong volume variant being mounted.

   - Make lookups in the dynamic root superblock for creation events
     (such as mkdir) fail with EOPNOTSUPP rather than something like
     EEXIST. The dynamic root only allows implicit creation by the
     ->lookup() method - and only if the target cell exists.

   - Fix the looking up of an AFS superblock to include the cell in the
     matching key - otherwise all volumes with the same ID number are
     treated as the same thing, irrespective of which cell they're in.

   - Show the volume name of each volume in the volume records displayed
     in /proc/net/afs/<cell>/volumes. This proved useful in debugging as
     it provides a way to map the volume IDs to names, where the names
     are what appear in /proc/mounts"

* tag 'afs-fixes-20191211' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Show volume name in /proc/net/afs/<cell>/volumes
  afs: Fix missing cell comparison in afs_test_super()
  afs: Fix creation calls in the dynamic root to fail with EOPNOTSUPP
  afs: Fix mountpoint parsing
  afs: Fix SELinux setting security label on /afs
  afs: Fix afs_find_server lookups for ipv4 peers
2019-12-11 18:10:40 -08:00
Jens Axboe
9e3aa61ae3 io_uring: ensure we return -EINVAL on unknown opcode
If we submit an unknown opcode and have fd == -1, io_op_needs_file()
will return true as we default to needing a file. Then when we go and
assign the file, we find the 'fd' invalid and return -EBADF. We really
should be returning -EINVAL for that case, as we normally do for
unsupported opcodes.

Change io_op_needs_file() to have the following return values:

0   - does not need a file
1   - does need a file
< 0 - error value

and use this to pass back the right value for this invalid case.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-11 16:02:32 -07:00
Brian Foster
d0c2204135 xfs: stabilize insert range start boundary to avoid COW writeback race
generic/522 (fsx) occasionally fails with a file corruption due to
an insert range operation. The primary characteristic of the
corruption is a misplaced insert range operation that differs from
the requested target offset. The reason for this behavior is a race
between the extent shift sequence of an insert range and a COW
writeback completion that causes a front merge with the first extent
in the shift.

The shift preparation function flushes and unmaps from the target
offset of the operation to the end of the file to ensure no
modifications can be made and page cache is invalidated before file
data is shifted. An insert range operation then splits the extent at
the target offset, if necessary, and begins to shift the start
offset of each extent starting from the end of the file to the start
offset. The shift sequence operates at extent level and so depends
on the preparation sequence to guarantee no changes can be made to
the target range during the shift. If the block immediately prior to
the target offset was dirty and shared, however, it can undergo
writeback and move from the COW fork to the data fork at any point
during the shift. If the block is contiguous with the block at the
start offset of the insert range, it can front merge and alter the
start offset of the extent. Once the shift sequence reaches the
target offset, it shifts based on the latest start offset and
silently changes the target offset of the operation and corrupts the
file.

To address this problem, update the shift preparation code to
stabilize the start boundary along with the full range of the
insert. Also update the existing corruption check to fail if any
extent is shifted with a start offset behind the target offset of
the insert range. This prevents insert from racing with COW
writeback completion and fails loudly in the event of an unexpected
extent shift.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-11 13:18:42 -08:00
Linus Torvalds
687dec9b94 Changes since last update:
- Fix improper return value of listxattr() with no xattr;
 
 - Keep up documentation with latest code.
 -----BEGIN PGP SIGNATURE-----
 
 iIwEABYIADQWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXfELlBYcZ2FveGlhbmcy
 NUBodWF3ZWkuY29tAAoJEDk3MdwfteYEtUABAN164UwGU9QKEsqgZQcmbz23qXSJ
 QDR8r/ch2LxzXKkVAQDXCNU+ol6jkiapLcTvsXEjBk8sUxsCEVnmZ36jru+TBA==
 =kRp9
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-5.5-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:
 "Mainly address a regression reported by David recently observed
  together with overlayfs due to the improper return value of
  listxattr() without xattr. Update outdated expressions in document as
  well.

  Summary:

   - Fix improper return value of listxattr() with no xattr

   - Keep up documentation with latest code"

* tag 'erofs-for-5.5-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: update documentation
  erofs: zero out when listxattr is called with no xattr
2019-12-11 12:25:32 -08:00
Linus Torvalds
d1c6a2aa02 pipe: simplify signal handling in pipe_read() and add comments
There's no need to separately check for signals while inside the locked
region, since we're going to do "wait_event_interruptible()" right
afterwards anyway, and the error handling is much simpler there.

The check for whether we had already read anything was also redundant,
since we no longer do the odd merging of reads when there are pending
writers.

But perhaps more importantly, this adds commentary about why we still
need to wake up possible writers even though we didn't read any data,
and why we can skip all the finishing touches now if we get a signal (or
had a signal pending) while waiting for more data.

[ This is a split-out cleanup from my "make pipe IO use exclusive wait
  queues" thing, which I can't apply because it triggers a nasty bug in
  the GNU make jobserver   - Linus ]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-11 11:46:19 -08:00
David Howells
50559800b7 afs: Show volume name in /proc/net/afs/<cell>/volumes
Show the name of each volume in /proc/net/afs/<cell>/volumes to make it
easier to work out the name corresponding to a volume ID.  This makes it
easier to work out which mounts in /proc/mounts correspond to which volume
ID.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
2019-12-11 17:48:20 +00:00
David Howells
106bc79843 afs: Fix missing cell comparison in afs_test_super()
Fix missing cell comparison in afs_test_super().  Without this, any pair
volumes that have the same volume ID will share a superblock, no matter the
cell, unless they're in different network namespaces.

Normally, most users will only deal with a single cell and so they won't
see this.  Even if they do look into a second cell, they won't see a
problem unless they happen to hit a volume with the same ID as one they've
already got mounted.

Before the patch:

    # ls /afs/grand.central.org/archive
    linuxdev/  mailman/  moin/  mysql/  pipermail/  stage/  twiki/
    # ls /afs/kth.se/
    linuxdev/  mailman/  moin/  mysql/  pipermail/  stage/  twiki/
    # cat /proc/mounts | grep afs
    none /afs afs rw,relatime,dyn,autocell 0 0
    #grand.central.org:root.cell /afs/grand.central.org afs ro,relatime 0 0
    #grand.central.org:root.archive /afs/grand.central.org/archive afs ro,relatime 0 0
    #grand.central.org:root.archive /afs/kth.se afs ro,relatime 0 0

After the patch:

    # ls /afs/grand.central.org/archive
    linuxdev/  mailman/  moin/  mysql/  pipermail/  stage/  twiki/
    # ls /afs/kth.se/
    admin/        common/  install/  OldFiles/  service/  system/
    bakrestores/  home/    misc/     pkg/       src/      wsadmin/
    # cat /proc/mounts | grep afs
    none /afs afs rw,relatime,dyn,autocell 0 0
    #grand.central.org:root.cell /afs/grand.central.org afs ro,relatime 0 0
    #grand.central.org:root.archive /afs/grand.central.org/archive afs ro,relatime 0 0
    #kth.se:root.cell /afs/kth.se afs ro,relatime 0 0

Fixes: ^1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Carsten Jacobi <jacobi@de.ibm.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Jonathan Billings <jsbillings@jsbillings.org>
cc: Todd DeSantis <atd@us.ibm.com>
2019-12-11 17:47:51 +00:00
David Howells
1da4bd9f9d afs: Fix creation calls in the dynamic root to fail with EOPNOTSUPP
Fix the lookup method on the dynamic root directory such that creation
calls, such as mkdir, open(O_CREAT), symlink, etc. fail with EOPNOTSUPP
rather than failing with some odd error (such as EEXIST).

lookup() itself tries to create automount directories when it is invoked.
These are cached locally in RAM and not committed to storage.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Jonathan Billings <jsbillings@jsbillings.org>
2019-12-11 17:47:51 +00:00
David Howells
158d583353 afs: Fix mountpoint parsing
Each AFS mountpoint has strings that define the target to be mounted.  This
is required to end in a dot that is supposed to be stripped off.  The
string can include suffixes of ".readonly" or ".backup" - which are
supposed to come before the terminal dot.  To add to the confusion, the "fs
lsmount" afs utility does not show the terminal dot when displaying the
string.

The kernel mount source string parser, however, assumes that the terminal
dot marks the suffix and that the suffix is always "" and is thus ignored.
In most cases, there is no suffix and this is not a problem - but if there
is a suffix, it is lost and this affects the ability to mount the correct
volume.

The command line mount command, on the other hand, is expected not to
include a terminal dot - so the problem doesn't arise there.

Fix this by making sure that the dot exists and then stripping it when
passing the string to the mount configuration.

Fixes: bec5eb6141 ("AFS: Implement an autocell mount capability [ver #2]")
Reported-by: Jonathan Billings <jsbillings@jsbillings.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Jonathan Billings <jsbillings@jsbillings.org>
2019-12-11 16:56:54 +00:00
Flavio Leitner
346da4d2c7 sched/cputime, proc/stat: Fix incorrect guest nice cpustat value
The value being used for guest_nice should be CPUTIME_GUEST_NICE
and not CPUTIME_USER.

Fixes: 26dae145a7 ("procfs: Use all-in-one vtime aware kcpustat accessor")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191205020344.14940-1-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-11 07:09:58 +01:00
Jens Axboe
10d5934557 io_uring: add sockets to list of files that support non-blocking issue
In chasing a performance issue between using IORING_OP_RECVMSG and
IORING_OP_READV on sockets, tracing showed that we always punt the
socket reads to async offload. This is due to io_file_supports_async()
not checking for S_ISSOCK on the inode. Since sockets supports the
O_NONBLOCK (or MSG_DONTWAIT) flag just fine, add sockets to the list
of file types that we can do a non-blocking issue to.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:23 -07:00
Jens Axboe
53108d476a io_uring: only hash regular files for async work execution
We hash regular files to avoid having multiple threads hammer on the
inode mutex, but it should not be needed on other types of files
(like sockets).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:23 -07:00
Jens Axboe
4a0a7a1874 io_uring: run next sqe inline if possible
One major use case of linked commands is the ability to run the next
link inline, if at all possible. This is done correctly for async
offload, but somewhere along the line we lost the ability to do so when
we were able to complete a request without having to punt it. Ensure
that we do so correctly.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:23 -07:00
Jens Axboe
392edb45b2 io_uring: don't dynamically allocate poll data
This essentially reverts commit e944475e69. For high poll ops
workloads, like TAO, the dynamic allocation of the wait_queue
entry for IORING_OP_POLL_ADD adds considerable extra overhead.
Go back to embedding the wait_queue_entry, but keep the usage of
wait->private for the pointer stashing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:23 -07:00
Jens Axboe
d96885658d io_uring: deferred send/recvmsg should assign iov
Don't just assign it from the main call path, that can miss the case
when we're called from issue deferral.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:23 -07:00
Jens Axboe
8a4955ff1c io_uring: sqthread should grab ctx->uring_lock for submissions
We use the mutex to guard against registered file updates, for instance.
Ensure we're safe in accessing that state against concurrent updates.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:23 -07:00
Jens Axboe
e995d5123e io-wq: briefly spin for new work after finishing work
To avoid going to sleep only to get woken shortly thereafter, spin
briefly for new work upon completion of work.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:22 -07:00
Jens Axboe
506d95ff5d io-wq: remove worker->wait waitqueue
We only have one cases of using the waitqueue to wake the worker, the
rest are using wake_up_process(). Since we can save some cycles not
fiddling with the waitqueue io_wqe_worker(), switch the work activation
to task wakeup and get rid of the now unused wait_queue_head_t in
struct io_worker.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:22 -07:00
Jens Axboe
4e88d6e779 io_uring: allow unbreakable links
Some commands will invariably end in a failure in the sense that the
completion result will be less than zero. One such example is timeouts
that don't have a completion count set, they will always complete with
-ETIME unless cancelled.

For linked commands, we sever links and fail the rest of the chain if
the result is less than zero. Since we have commands where we know that
will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever
regardless of the completion result. Note that the link will still sever
if we fail submitting the parent request, hard links are only resilient
in the presence of completion results for requests that did submit
correctly.

Cc: stable@vger.kernel.org # v5.4
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:06 -07:00
Amir Goldstein
6889ee5a53 ovl: relax WARN_ON() on rename to self
In ovl_rename(), if new upper is hardlinked to old upper underneath
overlayfs before upper dirs are locked, user will get an ESTALE error
and a WARN_ON will be printed.

Changes to underlying layers while overlayfs is mounted may result in
unexpected behavior, but it shouldn't crash the kernel and it shouldn't
trigger WARN_ON() either, so relax this WARN_ON().

Reported-by: syzbot+bb1836a212e69f8e201a@syzkaller.appspotmail.com
Fixes: 804032fabb ("ovl: don't check rename to self")
Cc: <stable@vger.kernel.org> # v4.9+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-12-10 16:00:55 +01:00
Amir Goldstein
9c6d8f13e9 ovl: fix corner case of non-unique st_dev;st_ino
On non-samefs overlay without xino, non pure upper inodes should use a
pseudo_dev assigned to each unique lower fs and pure upper inodes use the
real upper st_dev.

It is fine for an overlay pure upper inode to use the same st_dev;st_ino
values as the real upper inode, because the content of those two different
filesystem objects is always the same.

In this case, however:
 - two filesystems, A and B
 - upper layer is on A
 - lower layer 1 is also on A
 - lower layer 2 is on B

Non pure upper overlay inode, whose origin is in layer 1 will have the same
st_dev;st_ino values as the real lower inode. This may result with a false
positive results of 'diff' between the real lower and copied up overlay
inode.

Fix this by using the upper st_dev;st_ino values in this case.  This breaks
the property of constant st_dev;st_ino across copy up of this case. This
breakage will be fixed by a later patch.

Fixes: 5148626b80 ("ovl: allocate anon bdev per unique lower fs")
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-12-10 16:00:55 +01:00
Amir Goldstein
ec7bbb53d3 ovl: don't use a temp buf for encoding real fh
We can allocate maximum fh size and encode into it directly.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-12-10 16:00:55 +01:00
Amir Goldstein
cbe7fba8ed ovl: make sure that real fid is 32bit aligned in memory
Seprate on-disk encoding from in-memory and on-wire resresentation
of overlay file handle.

In-memory and on-wire we only ever pass around pointers to struct
ovl_fh, which encapsulates at offset 3 the on-disk format struct
ovl_fb. struct ovl_fb encapsulates at offset 21 the real file handle.
That makes sure that the real file handle is always 32bit aligned
in-memory when passed down to the underlying filesystem.

On-disk format remains the same and store/load are done into
correctly aligned buffer.

New nfs exported file handles are exported with aligned real fid.
Old nfs file handles are copied to an aligned buffer before being
decoded.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-12-10 16:00:55 +01:00
Amir Goldstein
7e63c87fc2 ovl: fix lookup failure on multi lower squashfs
In the past, overlayfs required that lower fs have non null uuid in
order to support nfs export and decode copy up origin file handles.

Commit 9df085f3c9 ("ovl: relax requirement for non null uuid of
lower fs") relaxed this requirement for nfs export support, as long
as uuid (even if null) is unique among all lower fs.

However, said commit unintentionally also relaxed the non null uuid
requirement for decoding copy up origin file handles, regardless of
the unique uuid requirement.

Amend this mistake by disabling decoding of copy up origin file handle
from lower fs with a conflicting uuid.

We still encode copy up origin file handles from those fs, because
file handles like those already exist in the wild and because they
might provide useful information in the future.

There is an unhandled corner case described by Miklos this way:
- two filesystems, A and B, both have null uuid
- upper layer is on A
- lower layer 1 is also on A
- lower layer 2 is on B

In this case bad_uuid won't be set for B, because the check only
involves the list of lower fs.  Hence we'll try to decode a layer 2
origin on layer 1 and fail.

We will deal with this corner case later.

Reported-by: Colin Ian King <colin.king@canonical.com>
Tested-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/lkml/20191106234301.283006-1-colin.king@canonical.com/
Fixes: 9df085f3c9 ("ovl: relax requirement for non null uuid ...")
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-12-10 16:00:55 +01:00
Steve French
281393894a smb3: fix refcount underflow warning on unmount when no directory leases
Fix refcount underflow warning when unmounting to servers which didn't grant
directory leases.

[  301.680095] refcount_t: underflow; use-after-free.
[  301.680192] WARNING: CPU: 1 PID: 3569 at lib/refcount.c:28
refcount_warn_saturate+0xb4/0xf3
...
[  301.682139] Call Trace:
[  301.682240]  close_shroot+0x97/0xda [cifs]
[  301.682351]  SMB2_tdis+0x7c/0x176 [cifs]
[  301.682456]  ? _get_xid+0x58/0x91 [cifs]
[  301.682563]  cifs_put_tcon.part.0+0x99/0x202 [cifs]
[  301.682637]  ? ida_free+0x99/0x10a
[  301.682727]  ? cifs_umount+0x3d/0x9d [cifs]
[  301.682829]  cifs_put_tlink+0x3a/0x50 [cifs]
[  301.682929]  cifs_umount+0x44/0x9d [cifs]

Fixes: 72e73c78c4 ("cifs: close the shared root handle on tree disconnect")

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reported-and-tested-by: Arthur Marsh <arthur.marsh@internode.on.net>
2019-12-09 19:47:10 -06:00
Xiubo Li
da08e1e1d7 ceph: add more debug info when decoding mdsmap
Show the laggy state.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-12-09 20:55:10 +01:00
Xiubo Li
bd84fbcb31 ceph: switch to global cap helper
__ceph_is_any_caps is a duplicate helper.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-12-09 20:55:10 +01:00
Xiubo Li
bba1560bd4 ceph: trigger the reclaim work once there has enough pending caps
The nr in ceph_reclaim_caps_nr() is very possibly larger than 1,
so we may miss it and the reclaim work couldn't triggered as expected.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-12-09 20:55:10 +01:00
Jeff Layton
3a3430affc ceph: show tasks waiting on caps in debugfs caps file
Add some visibility of tasks that are waiting for caps to the "caps"
debugfs file. Display the tgid of the waiting task, inode number, and
the caps the task needs and wants.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-12-09 20:55:10 +01:00
Jeff Layton
ad8c28a9eb ceph: convert int fields in ceph_mount_options to unsigned int
Most of these values should never be negative, so convert them to
unsigned values. Add some sanity checking to the parsed values, and
clean up some unneeded casts.

Note that while caps_max should never be negative, this patch leaves
it signed, since this value ends up later being compared to a signed
counter. Just ensure that userland never passes in a negative value
for caps_max.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-12-09 20:55:10 +01:00
Pankaj Bharadiya
c593642c8b treewide: Use sizeof_field() macro
Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
at places where these are defined. Later patches will remove the unused
definition of FIELD_SIZEOF().

This patch is generated using following script:

EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
do

	if [[ "$file" =~ $EXCLUDE_FILES ]]; then
		continue
	fi
	sed -i  -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
done

Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David Miller <davem@davemloft.net> # for net
2019-12-09 10:36:44 -08:00
Iurii Zaikin
39101b2265 fs/ext4/inode-test: Fix inode test on 32 bit platforms.
Fixes the issue caused by the fact that in C in the expression
of the form -1234L only 1234L is the actual literal, the unary
minus is an operation applied to the literal. Which means that
to express the lower bound for the type one has to negate the
upper bound and subtract 1.

Original error:
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == -2147483648
timestamp.tv_sec == 2147483648
1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits: msb:1
lower_bound:1 extra_bits: 0
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 2147483648
timestamp.tv_sec == 6442450944
2038-01-19 Lower bound of 32bit <0 timestamp, lo extra sec bit on:
msb:1 lower_bound:1 extra_bits: 1
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 6442450944
timestamp.tv_sec == 10737418240
2174-02-25 Lower bound of 32bit <0 timestamp, hi extra sec bit on:
msb:1 lower_bound:1 extra_bits: 2
not ok 1 - inode_test_xtimestamp_decoding
not ok 1 - ext4_inode_test

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Iurii Zaikin <yzaikin@google.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Brendan Higgins <brendanhiggins@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2019-12-09 11:15:44 -07:00
David Sterba
78f926f72e btrfs: add Kconfig dependency for BLAKE2B
Because the BLAKE2B code went through a different tree, it was not
available at the time the btrfs part was merged. Now that the Kconfig
symbol exists, add it to the list.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-09 17:56:06 +01:00
David Howells
bcbccaf2ed afs: Fix SELinux setting security label on /afs
Make the AFS dynamic root superblock R/W so that SELinux can set the
security label on it.  Without this, upgrades to, say, the Fedora
filesystem-afs RPM fail if afs is mounted on it because the SELinux label
can't be (re-)applied.

It might be better to make it possible to bypass the R/O check for LSM
label application through setxattr.

Fixes: 4d673da145 ("afs: Support the AFS dynamic root")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: selinux@vger.kernel.org
cc: linux-security-module@vger.kernel.org
2019-12-09 16:37:36 +00:00
Marc Dionne
9bd0160d12 afs: Fix afs_find_server lookups for ipv4 peers
afs_find_server tries to find a server that has an address that
matches the transport address of an rxrpc peer.  The code assumes
that the transport address is always ipv6, with ipv4 represented
as ipv4 mapped addresses, but that's not the case.  If the transport
family is AF_INET, srx->transport.sin6.sin6_addr.s6_addr32[] will
be beyond the actual ipv4 address and will always be 0, and all
ipv4 addresses will be seen as matching.

As a result, the first ipv4 address seen on any server will be
considered a match, and the server returned may be the wrong one.

One of the consequences is that callbacks received over ipv4 will
only be correctly applied for the server that happens to have the
first ipv4 address on the fs_addresses4 list.  Callbacks over ipv4
from all other servers are dropped, causing the client to serve stale
data.

This is fixed by looking at the transport family, and comparing ipv4
addresses based on a sockaddr_in structure rather than a sockaddr_in6.

Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2019-12-09 15:04:43 +00:00
Linus Torvalds
a78f7cdddb 9 cifs/smb3 fixes: two timestamp fixes, one oops fix (during oplock break) for stable, two fixes found in multichannel testing, two fixes for file create when using modeforsid mount parm
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl3sPUUACgkQiiy9cAdy
 T1GumAwAhh0Fk2uEV01REMgA6MgQ2hrdGE5HariSTzGifCk8cxMnq1H1u9yxtic8
 uvEJQaUmTLWrN2C+xqD2JqPmJyrPOtnL0PLCLQk2/RsPCsDgYnmdKoAehInPh17g
 J8MoKPp1/1wYhbOl7CeF0xo2rEchoh/PcPCXpt8qj+M+kBgQkI64UQ/6iY/mV9Zl
 n7WJJFDyz3D1+SaJPaVxMpNxZcMpFbGqVJYTWP4v3pL2E8wEhyWjAryLCJAFFGf7
 Y2FwOSFuifMN/qC9t83W5KkRT9I/zRQ2g5qK1tC24LiTjQ3cqkCy1SSqpKQyvKwz
 P/oRX0HsuIbr1KFzN55kg831m/V7/1B/5bf9AivfhjsAoSyp2yyVQgPeV+nQkO0r
 iQdNatohC9HlwXmrypS+GhLXnj8xLnCR4+Aj7hGSuiVLHnCOfnGjQxI40BFWaBli
 1RG9agkploMYvcjcgSgDGVFFWTeHgSQKI1DQTL2Nx4py1zj7Rv/kEgwkZ3zdEf9h
 PPl37hBM
 =gey9
 -----END PGP SIGNATURE-----

Merge tag '5.5-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Nine cifs/smb3 fixes:

   - one fix for stable (oops during oplock break)

   - two timestamp fixes including important one for updating mtime at
     close to avoid stale metadata caching issue on dirty files (also
     improves perf by using SMB2_CLOSE_FLAG_POSTQUERY_ATTRIB over the
     wire)

   - two fixes for "modefromsid" mount option for file create (now
     allows mode bits to be set more atomically and accurately on create
     by adding "sd_context" on create when modefromsid specified on
     mount)

   - two fixes for multichannel found in testing this week against
     different servers

   - two small cleanup patches"

* tag '5.5-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: improve check for when we send the security descriptor context on create
  smb3: fix mode passed in on create for modetosid mount option
  cifs: fix possible uninitialized access and race on iface_list
  cifs: Fix lookup of SMB connections on multichannel
  smb3: query attributes on file close
  smb3: remove unused flag passed into close functions
  cifs: remove redundant assignment to pointer pneg_ctxt
  fs: cifs: Fix atime update check vs mtime
  CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
2019-12-08 12:12:18 -08:00
Linus Torvalds
5bf9a06a5f Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs cleanups from Al Viro:
 "No common topic, just three cleanups".

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make __d_alloc() static
  fs/namespace: add __user to open_tree and move_mount syscalls
  fs/fnctl: fix missing __user in fcntl_rw_hint()
2019-12-08 11:08:28 -08:00
Linus Torvalds
95207d554b Fixes for 5.5-rc1:
- Fix a UAF when reporting writeback errors
 - Fix a race condition when handling page uptodate on a blocksize <
   pagesize file that is also fragmented
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3pJVwACgkQ+H93GTRK
 tOue8RAAklPs+kV4l1JQudPkd/lXIPCqyRKBEjNDQRfj27v5s7h/3fMqMsV6/RIs
 zTkrRc/N5Uu6SMnvMIHvTx1nAk6ZnTiN3gUHenoUQgQuxZ4jtGhIvHmyiOGgI27j
 Xd2g+aE8wdhZDnD2b+4AsnJmtvffX0g3DNDZ4KJn1d/Ejs2qQIHgeaoyj56Fz8vh
 4dqQxSg5RvL25k3qqHSVogsfxSRJxvYJefcARUL25a58UzE6DLJqeb3v41RVLc4l
 lDOX/o4lavqo1lmptWjsTn4bqqyeRfNWp2M2EGE4a411aEaNTkFB7xzQ9Y1v04kK
 VHq6XK3vieAv6VOuqZ3ySx6ihQcb9dQWXiMkJ9HDVDv4rmaLlrHECM19A5toZN/6
 0Xy6xLDqbimckyWDZLBUnkfcoCsI+uWTdNIBxkjEYFdDuL33ovUlKu+Cn63vYzoQ
 aCWnNA4NdKOBYps9aKCQV35IU1ODmOYWqkxkpSaVYKApi8Q6+2HUXOf+fXFBiV91
 zO0/nWq8RU7fGQxk8YJtMO2E0lGMaMAt03vgp+pMkd0aIo0NWFLV0q3tko5VlWNv
 v8BGkphgrOOqfsCFh8wua5x8bIXOBBElOSUjU0wBrAHb0ZXkQ3zi0nUt+sj7RFUX
 a+ojaRrXvuyZZArOvQbrHEC0HWchVmvfiyXRAPbxbLi1BiLfPW8=
 =+LzI
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.5-merge-14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap fixes from Darrick Wong:
 "Fix a race condition and a use-after-free error:

   - Fix a UAF when reporting writeback errors

   - Fix a race condition when handling page uptodate on fragmented file
     with blocksize < pagesize"

* tag 'iomap-5.5-merge-14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: stop using ioend after it's been freed in iomap_finish_ioend()
  iomap: fix sub-page uptodate handling
2019-12-07 17:07:18 -08:00
Linus Torvalds
50caca9d7f Fixes for 5.5-rc1:
- Fix a crash in the log setup code when log mounting fails
 - Fix a hang when allocating space on the realtime device
 - Fix a block leak when freeing space on the realtime device
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3n5OkACgkQ+H93GTRK
 tOtUvw/+Lxnwx1AaYT7hSMS4p1buJnAV7aM6Ofm+F+GkRISWlxSEYLIjrj3NFxeI
 Fv4jnPWMGHtBzW2c4OXv9zhW7vQeuU0Z72sgxA+Wqccf53sfR5Qum+Tya+Bo8X0X
 LPeDEuA+k4UUUHvSLqscPFPZYbXgwp3dJ2i7JeuH0vx3BhFeTjV1nlOeo1z3QBzN
 LLVjuBg7Zms+TPorrOgS67LAWfzqCAiQDauvyODB5EW+UElK6pvpOklFzc7TUPr2
 PIvmjEE8UNN2y9QuEEKJ26t43ejAzG016yGMxyW74i5wU33R7PHkdgG1pWwWRZzr
 yvNClg2CMtLu+Bcx8Lc9X23mW0DqPkUYDohWCe/Tytvz5kaKCEq3WlqwaG5EdMOg
 gSJlivpoOTduQ26V4fPvD1/fjGpJLWyfHPs9p43wie+K/NuUcTIvr4BGGQszjS/n
 5Zr630g6Tq5VrBMl0f1P2NuEbeQEvmbWNTW2TIvvHZTgMd8mZdvX28IXj0dAhBb/
 2U5o1NF8F6VeRGYoZFJI70RIfiLYzOQEmsA5hAyJfUQQ18u8zDJuPV4isO4/XjQf
 d32E36cvP3CKVQYn7hAiMD5O8jOFckYL287qd4uDYutmleHEcfzc0H4pTW+66IHp
 IuzkPCgvOkd4h4qnhtHSDoSlKd7kc1Ai3hBhCdA9zd6IUAVsZ6Y=
 =Ps4B
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.5-merge-17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Fix a couple of resource management errors and a hang:

   - fix a crash in the log setup code when log mounting fails

   - fix a hang when allocating space on the realtime device

   - fix a block leak when freeing space on the realtime device"

* tag 'xfs-5.5-merge-17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix mount failure crash on invalid iclog memory access
  xfs: don't check for AG deadlock for realtime files in bunmapi
  xfs: fix realtime file data space leak
2019-12-07 17:05:33 -08:00
Linus Torvalds
316933cf74 orangefs: posix open permission checking...
Orangefs has no open, and orangefs checks file permissions
 on each file access. Posix requires that file permissions
 be checked on open and nowhere else. Orangefs-through-the-kernel
 needs to seem posix compliant.
 
 The VFS opens files, even if the filesystem provides no
 method. We can see if a file was successfully opened for
 read and or for write by looking at file->f_mode.
 
 When writes are flowing from the page cache, file is no
 longer available. We can trust the VFS to have checked
 file->f_mode before writing to the page cache.
 
 The mode of a file might change between when it is opened
 and IO commences, or it might be created with an arbitrary mode.
 
 We'll make sure we don't hit EACCES during the IO stage by
 using UID 0.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJd6qoUAAoJEM9EDqnrzg2+SHQQAI8osluG1xDle0Ur0y/XQrWR
 z/1+mA/ZJpkaC2KPJ3F/B93ZR7TSSb6xB/u/EoxfqVQDoVpodP3PzcvSosRsePOk
 OYo67xit7YRcg2nQF5kEjR+wYbW/T1j55oQzrWxLvYr+FhlDZLJyn0xuSaGvvuKQ
 kWqNwPQpIZwNR1ZJ6Yjif86kR4sWF5htoy976x5ScvoeOb08dNHQn2je5oXH/eKH
 zwWBVYTeZTAIVCs9YV2UM4gi5/0pysjSL58jP7+ckLj79ozBoyhc9cRB4ez0cFyc
 4+4dW9zZ1GAfvmbsFzvCfKb2Syz4JkStGJQGST+cgH9ldp70R8AdRjzYfZGXa2af
 9I/jRgrVBsU/jo++a1npMy2j44+2GvhoValzKePwiCGTOB/f80XsmB9p9qci8JCv
 ucVzJwbhjxPKphUpnW8Gg7F2gWr2ULhv+wKRmAb3tF+bIFPjn7KjyzFfUAS3FY1s
 iwgci0Mw9NLLlvX511N0wiUGo6V9A9r7XsZQjScmm/3ybUhMyJAYoe81OO60Xwnv
 2s+V0Tv9ah4b+EF0J0qtQ7GzsoKDBu+ZWqGieiOXDWTVixY2gV6CetnR7veeSeQh
 s9OeqY8qaSYiV9KtBNZp56IS4PuADDgxnRB1pXTUUPgapuElEtvYC1BUovidMMmh
 kLQEpYdSGrkLRah4hKsg
 =9AOz
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.5-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs update from Mike Marshall:
 "orangefs: posix open permission checking...

  Orangefs has no open, and orangefs checks file permissions on each
  file access. Posix requires that file permissions be checked on open
  and nowhere else. Orangefs-through-the-kernel needs to seem posix
  compliant.

  The VFS opens files, even if the filesystem provides no method. We can
  see if a file was successfully opened for read and or for write by
  looking at file->f_mode.

  When writes are flowing from the page cache, file is no longer
  available. We can trust the VFS to have checked file->f_mode before
  writing to the page cache.

  The mode of a file might change between when it is opened and IO
  commences, or it might be created with an arbitrary mode.

  We'll make sure we don't hit EACCES during the IO stage by using
  UID 0"

[ This is "posixish", but not a great solution in the long run, since a
  proper secure network server shouldn't really trust the client like this.
  But proper and secure POSIX behavior requires an open method and a
  resulting cookie for IO of some kind, or similar.    - Linus ]

* tag 'for-linus-5.5-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: posix open permission checking...
2019-12-07 16:59:25 -08:00
Linus Torvalds
911d137ab0 This is a relatively quiet cycle for nfsd, mainly various bugfixes.
Possibly most interesting is Trond's fixes for some callback races that
 were due to my incomplete understanding of rpc client shutdown.
 Unfortunately at the last minute I've started noticing a new
 intermittent failure to send callbacks.  As the logic seems basically
 correct, I'm leaving Trond's patches in for now, and hope to find a fix
 in the next week so I don't have to revert those patches.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl3r3AAVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+rjkP/3L6DZs0Uv0BYbGq5Gmit0uoPSQk
 8BT7oQhbagCh+ULRYWCnK6cz82wejR4Gzq4PLyl5x5Vcc5x+bLoPI9YgiRZlIbZu
 ZvSg93E6SITLfq5xRlDC0MlIVZkI+HoIfyYgv1aYiWvQ3834bcx4DxVm9h7cNpT3
 x37anEFi1lv3n9fct3obOrs3AvCS76XyA6VVhcSLJ77amKQ+O7LI0crqUc6cuX2i
 CkTwTSDwyCrzkx3dZ2xDPDTbLecxw+Ce4adaby5v3GEQo3TOCmEWX92D3dvzfMmv
 ICU07FsVOILnIT/fmC91b1+JWVRLjUUBw5EPmDduwSP/yw4YnIEODFEP/wAUAmMJ
 vJ9hi9c1rThQ9n8h08RIwA2snhnpXRxKCWhpIRY6WM8DhHL9Y9AuVPYTKxhQOjPK
 l3wbOGcMW63NrTOPHHN7hTB0vDLgPKIXYVIrMvZTd/P7CghDDEbhT1gDvx/IL3Uq
 WrHKbJtK7rbx9i2bh5f6fH0DRrv7lxbD0ffunRRa3twPAe6zsG9WPjsbZZraZzEg
 O7/o3wZu2N7MpL5bXPfzB+5ylOTxvNWew07NJjA4BIOfwin3bw/71YfB0Vnoairv
 PhmbN2Dj4/t82ld0JU5GJWojpUfH4ARXM2Li9WO99wzx+KrxScsqGPnRMFe9dC7b
 Q7ltP1p0gUbkJ88Z
 =b2zA
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "This is a relatively quiet cycle for nfsd, mainly various bugfixes.

  Possibly most interesting is Trond's fixes for some callback races
  that were due to my incomplete understanding of rpc client shutdown.
  Unfortunately at the last minute I've started noticing a new
  intermittent failure to send callbacks. As the logic seems basically
  correct, I'm leaving Trond's patches in for now, and hope to find a
  fix in the next week so I don't have to revert those patches"

* tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux: (24 commits)
  nfsd: depend on CRYPTO_MD5 for legacy client tracking
  NFSD fixing possible null pointer derefering in copy offload
  nfsd: check for EBUSY from vfs_rmdir/vfs_unink.
  nfsd: Ensure CLONE persists data and metadata changes to the target file
  SUNRPC: Fix backchannel latency metrics
  nfsd: restore NFSv3 ACL support
  nfsd: v4 support requires CRYPTO_SHA256
  nfsd: Fix cld_net->cn_tfm initialization
  lockd: remove __KERNEL__ ifdefs
  sunrpc: remove __KERNEL__ ifdefs
  race in exportfs_decode_fh()
  nfsd: Drop LIST_HEAD where the variable it declares is never used.
  nfsd: document callback_wq serialization of callback code
  nfsd: mark cb path down on unknown errors
  nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()
  nfsd: minor 4.1 callback cleanup
  SUNRPC: Fix svcauth_gss_proxy_init()
  SUNRPC: Trace gssproxy upcall results
  sunrpc: fix crash when cache_head become valid before update
  nfsd: remove private bin2hex implementation
  ...
2019-12-07 16:56:00 -08:00
Linus Torvalds
fb9bf40cf0 NFS client updates for Linux 5.5
Highlights include:
 
 Features:
 - NFSv4.2 now supports cross device offloaded copy (i.e. offloaded copy
   of a file from one source server to a different target server).
 - New RDMA tracepoints for debugging congestion control and Local Invalidate
   WRs.
 
 Bugfixes and cleanups
 - Drop the NFSv4.1 session slot if nfs4_delegreturn_prepare waits for
   layoutreturn
 - Handle bad/dead sessions correctly in nfs41_sequence_process()
 - Various bugfixes to the delegation return operation.
 - Various bugfixes pertaining to delegations that have been revoked.
 - Cleanups to the NFS timespec code to avoid unnecessary conversions
   between timespec and timespec64.
 - Fix unstable RDMA connections after a reconnect
 - Close race between waking an RDMA sender and posting a receive
 - Wake pending RDMA tasks if connection fails
 - Fix MR list corruption, and clean up MR usage
 - Fix another RPCSEC_GSS issue with MIC buffer space
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl3qp8QACgkQZwvnipYK
 APIZfBAAuFhLUA2Ua9OQOPDJDkQ1IFDBfYGG48aqVu3GIXS9LkEvTavLm/P9ocm+
 ijGsUv2iw4x9H4S7OGuzLQm5zmTNsQAlPXD+3+xQS7cjPjh5HCyIAEgpov+JEGae
 CeZoSvhtdBd0xB71t2zAKEdHkqc47Jxz3Db0FX22zTTnDvdhArfggisZUt4Xq5Qb
 cPcs8R1E5yBZqJFHKObOUP4itVYsXte/VFhtWpjRFqzaZ/t7xNpPVOBH8cli7aI9
 E6DqdbIjUreyn62FVWYIeGhwvsKdxv+Slc5ZOEbD45jUryovyCAZxhqDmcAg/0q0
 uykplL0cv8MeiZ68wmlxdir/n36hWduiGqa0UKMg2+BAbdudGKJ7xPhkGYP2uZqo
 zoZGjd+Hl8AunMBUaT7YAxWOzuIXeMP338szTL6sSBPxT75WmmNJAh3J4b22G7Bl
 eGrcJcckDBnvfRCia40l8g9NLHmVKqS9qNKxSWMlMlBmwd1HE0oEE1ddCx9bGHKe
 srf0S14RPQBRF6r+Nv0cx5S+CiptDtGiILR+cn5ZDra5YYCPX5kkJ6VEqw/m4yNE
 AKjjj5gim+jWYdBOTMU3u5KNNqFx37xnOCdC+5DvhMNWRHf2O/I5JSKtuKaZht+5
 PEuwcYfQvaZGp3fCEh38zzOX2qWUhRMbXUqSv5F0DbuWK7OAABQ=
 =VZFk
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:

   - NFSv4.2 now supports cross device offloaded copy (i.e. offloaded
     copy of a file from one source server to a different target
     server).

   - New RDMA tracepoints for debugging congestion control and Local
     Invalidate WRs.

  Bugfixes and cleanups

   - Drop the NFSv4.1 session slot if nfs4_delegreturn_prepare waits for
     layoutreturn

   - Handle bad/dead sessions correctly in nfs41_sequence_process()

   - Various bugfixes to the delegation return operation.

   - Various bugfixes pertaining to delegations that have been revoked.

   - Cleanups to the NFS timespec code to avoid unnecessary conversions
     between timespec and timespec64.

   - Fix unstable RDMA connections after a reconnect

   - Close race between waking an RDMA sender and posting a receive

   - Wake pending RDMA tasks if connection fails

   - Fix MR list corruption, and clean up MR usage

   - Fix another RPCSEC_GSS issue with MIC buffer space"

* tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
  SUNRPC: Capture completion of all RPC tasks
  SUNRPC: Fix another issue with MIC buffer space
  NFS4: Trace lock reclaims
  NFS4: Trace state recovery operation
  NFSv4.2 fix memory leak in nfs42_ssc_open
  NFSv4.2 fix kfree in __nfs42_copy_file_range
  NFS: remove duplicated include from nfs4file.c
  NFSv4: Make _nfs42_proc_copy_notify() static
  NFS: Fallocate should use the nfs4_fattr_bitmap
  NFS: Return -ETXTBSY when attempting to write to a swapfile
  fs: nfs: sysfs: Remove NULL check before kfree
  NFS: remove unneeded semicolon
  NFSv4: add declaration of current_stateid
  NFSv4.x: Drop the slot if nfs4_delegreturn_prepare waits for layoutreturn
  NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process()
  nfsv4: Move NFSPROC4_CLNT_COPY_NOTIFY to end of list
  SUNRPC: Avoid RPC delays when exiting suspend
  NFS: Add a tracepoint in nfs_fh_to_dentry()
  NFSv4: Don't retry the GETATTR on old stateid in nfs4_delegreturn_done()
  NFSv4: Handle NFS4ERR_OLD_STATEID in delegreturn
  ...
2019-12-07 16:50:55 -08:00
Steve French
231e2a0ba5 smb3: improve check for when we send the security descriptor context on create
We had cases in the previous patch where we were sending the security
descriptor context on SMB3 open (file create) in cases when we hadn't
mounted with with "modefromsid" mount option.

Add check for that mount flag before calling ad_sd_context in
open init.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-12-07 17:38:22 -06:00
Linus Torvalds
85190d15f4 pipe: don't use 'pipe_wait() for basic pipe IO
pipe_wait() may be simple, but since it relies on the pipe lock, it
means that we have to do the wakeup while holding the lock.  That's
unfortunate, because the very first thing the waked entity will want to
do is to get the pipe lock for itself.

So get rid of the pipe_wait() usage by simply releasing the pipe lock,
doing the wakeup (if required) and then using wait_event_interruptible()
to wait on the right condition instead.

wait_event_interruptible() handles races on its own by comparing the
wakeup condition before and after adding itself to the wait queue, so
you can use an optimistic unlocked condition for it.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07 13:53:09 -08:00
Linus Torvalds
a28c8b9db8 pipe: remove 'waiting_writers' merging logic
This code is ancient, and goes back to when we only had a single page
for the pipe buffers.  The exact history is hidden in the mists of time
(ie "before git", and in fact predates the BK repository too).

At that long-ago point in time, it actually helped to try to merge big
back-and-forth pipe reads and writes, and not limit pipe reads to the
single pipe buffer in length just because that was all we had at a time.

However, since then we've expanded the pipe buffers to multiple pages,
and this logic really doesn't seem to make sense.  And a lot of it is
somewhat questionable (ie "hmm, the user asked for a non-blocking read,
but we see that there's a writer pending, so let's wait anyway to get
the extra data that the writer will have").

But more importantly, it makes the "go to sleep" logic much less
obvious, and considering the wakeup issues we've had, I want to make for
less of those kinds of things.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07 13:21:01 -08:00
Linus Torvalds
f467a6a664 pipe: fix and clarify pipe read wakeup logic
This is the read side version of the previous commit: it simplifies the
logic to only wake up waiting writers when necessary, and makes sure to
use a synchronous wakeup.  This time not so much for GNU make jobserver
reasons (that pipe never fills up), but simply to get the writer going
quickly again.

A bit less verbose commentary this time, if only because I assume that
the write side commentary isn't going to be ignored if you touch this
code.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07 12:54:26 -08:00
Linus Torvalds
1b6b26ae70 pipe: fix and clarify pipe write wakeup logic
The pipe rework ends up having been extra painful, partly becaused of
actual bugs with ordering and caching of the pipe state, but also
because of subtle performance issues.

In particular, the pipe rework caused the kernel build to inexplicably
slow down.

The reason turns out to be that the GNU make jobserver (which limits the
parallelism of the build) uses a pipe to implement a "token" system: a
parallel submake will read a character from the pipe to get the job
token before starting a new job, and will write a character back to the
pipe when it is done.  The overall job limit is thus easily controlled
by just writing the appropriate number of initial token characters into
the pipe.

But to work well, that really means that the old behavior of write
wakeups being synchronous (WF_SYNC) is very important - when the pipe
writer wakes up a reader, we want the reader to actually get scheduled
immediately.  Otherwise you lose the parallelism of the build.

The pipe rework lost that synchronous wakeup on write, and we had
clearly all forgotten the reasons and rules for it.

This rewrites the pipe write wakeup logic to do the required Wsync
wakeups, but also clarifies the logic and avoids extraneous wakeups.

It also ends up addign a number of comments about what oit does and why,
so that we hopefully don't end up forgetting about this next time we
change this code.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07 12:14:28 -08:00
Linus Torvalds
ad910e36da pipe: fix poll/select race introduced by the pipe rework
The kernel wait queues have a basic rule to them: you add yourself to
the wait-queue first, and then you check the things that you're going to
wait on.  That avoids the races with the event you're waiting for.

The same goes for poll/select logic: the "poll_wait()" goes first, and
then you check the things you're polling for.

Of course, if you use locking, the ordering doesn't matter since the
lock will serialize with anything that changes the state you're looking
at. That's not the case here, though.

So move the poll_wait() first in pipe_poll(), before you start looking
at the pipe state.

Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07 10:41:17 -08:00
Patrick Steinhardt
38a2204f52 nfsd: depend on CRYPTO_MD5 for legacy client tracking
The legacy client tracking infrastructure of nfsd makes use of MD5 to
derive a client's recovery directory name. As the nfsd module doesn't
declare any dependency on CRYPTO_MD5, though, it may fail to allocate
the hash if the kernel was compiled without it. As a result, generation
of client recovery directories will fail with the following error:

    NFSD: unable to generate recoverydir name

The explicit dependency on CRYPTO_MD5 was removed as redundant back in
6aaa67b5f3 (NFSD: Remove redundant "select" clauses in fs/Kconfig
2008-02-11) as it was already implicitly selected via RPCSEC_GSS_KRB5.
This broke when RPCSEC_GSS_KRB5 was made optional for NFSv4 in commit
df486a2590 (NFS: Fix the selection of security flavours in Kconfig) at
a later point.

Fix the issue by adding back an explicit dependency on CRYPTO_MD5.

Fixes: df486a2590 (NFS: Fix the selection of security flavours in Kconfig)
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-12-07 11:28:52 -05:00
Olga Kornievskaia
18f428d4e2 NFSD fixing possible null pointer derefering in copy offload
Static checker revealed possible error path leading to possible
NULL pointer dereferencing.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: e0639dc580: ("NFSD introduce async copy feature")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-12-07 11:28:46 -05:00
David Howells
76f6777c9c pipe: Fix iteration end check in fuse_dev_splice_write()
Fix the iteration end check in fuse_dev_splice_write().  The iterator
position can only be compared with == or != since wrappage may be involved.

Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-06 13:57:04 -08:00
Linus Torvalds
ec057595cb pipe: fix incorrect caching of pipe state over pipe_wait()
Similarly to commit 8f868d68d3 ("pipe: Fix missing mask update after
pipe_wait()") this fixes a case where the pipe rewrite ended up caching
the pipe state incorrectly over a pipe lock drop event.

It wasn't quite as obvious, because you needed to splice data from a
pipe to a file, which is a fairly unusual operation, but it's completely
wrong.

Make sure we load the pipe head/tail/size information only after we've
waited for there to be data in the pipe.

While in that file, also make one of the splice helper functions use the
canonical arghument order for pipe_empty().  That's syntactic - pipe
emptiness is just that head and tail are equal, and thus mixing up head
and tail doesn't really matter.  It's still wrong, though.

Reported-by: David Sterba <dsterba@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-06 12:40:35 -08:00
Steve French
fdef665ba4 smb3: fix mode passed in on create for modetosid mount option
When using the special SID to store the mode bits in an ACE (See
http://technet.microsoft.com/en-us/library/hh509017(v=ws.10).aspx)
which is enabled with mount parm "modefromsid" we were not
passing in the mode via SMB3 create (although chmod was enabled).
SMB3 create allows a security descriptor context to be passed
in (which is more atomic and thus preferable to setting the mode
bits after create via a setinfo).

This patch enables setting the mode bits on create when using
modefromsid mount option.  In addition it fixes an endian
error in the definition of the Control field flags in the SMB3
security descriptor. It also makes the ACE type of the special
SID better match the documentation (and behavior of servers
which use this to store mode bits in SMB3 ACLs).

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-12-06 14:15:52 -06:00
Linus Torvalds
9feb1af97e for-linus-20191205
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3puL4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjOJD/4vV2ENSYcwZ7Qezgn9tx5NEdADN6jC0DXe
 tJDnA0sDLfLTlVM9I1U2/wD6Wu31cvV2dHKmxO+WWWRP2M0orI00UA3I6BssVptY
 42e/YJd13IM+iVrhfhm+hmcAHvQUmWHTPZg/7F7OZPkEtNcbCjn6s+HTyzu9gqtt
 /kbJVDJ75TR9dEWCPY/A4jUcJYLw2leoHI4u2eqYRsTFNJHUQoaFUHDyNHyj7UNI
 kIUi2UixJv+a4pkFvyLPKhJcLr6DBC2TeUnfP2Re5oX1Z/XkDR7iX+L5dUwRHHbT
 4M7aJJ+mHDm8Z4IwwBqsSDRU5wUpzuplwBQBY/EQilUZAyOALVzUQABlRa0zxH8t
 0D4U6LDDOopYeK0by/FGdD88S6Z0LKm0HJETbdPaGd9fqSlhBn19iGeBKYk0cCIu
 Uecm9DKF+QLfKhTbN6h8nPyR3bfkqyUptQJ1UnheWo9f6L32bfp19sdI9LdtRUnS
 k661QApajaYQu/641u9CDZiI+DvI+57Wm9I4eCF3GXqOYWPAxSgWP7rJVgS+mfA3
 wo99/r11SruaA1Wqf+bOoN/wjsQCiUPFa8po8sh05ER4MH5Brb5EFlg04ZlQZkR9
 jOdiL8ISM9t86eyKwtR4PanE7/pPEYUjEWm4ntCC6ViTwCvgV3d6s2We6QPw1j2l
 nQ9p/c2bhg==
 =7ZFw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191205' of git://git.kernel.dk/linux-block

Pull more block and io_uring updates from Jens Axboe:
 "I wasn't expecting this to be so big, and if I was, I would have used
  separate branches for this. Going forward I'll be doing separate
  branches for the current tree, just like for the next kernel version
  tree. In any case, this contains:

   - Series from Christoph that fixes an inherent race condition with
     zoned devices and revalidation.

   - null_blk zone size fix (Damien)

   - Fix for a regression in this merge window that caused busy spins by
     sending empty disk uevents (Eric)

   - Fix for a regression in this merge window for bfq stats (Hou)

   - Fix for io_uring creds allocation failure handling (me)

   - io_uring -ERESTARTSYS send/recvmsg fix (me)

   - Series that fixes the need for applications to retain state across
     async request punts for io_uring. This one is a bit larger than I
     would have hoped, but I think it's important we get this fixed for
     5.5.

   - connect(2) improvement for io_uring, handling EINPROGRESS instead
     of having applications needing to poll for it (me)

   - Have io_uring use a hash for poll requests instead of an rbtree.
     This turned out to work much better in practice, so I think we
     should make the switch now. For some workloads, even with a fair
     amount of cancellations, the insertion sort is just too expensive.
     (me)

   - Various little io_uring fixes (me, Jackie, Pavel, LimingWu)

   - Fix for brd unaligned IO, and a warning for the future (Ming)

   - Fix for a bio integrity data leak (Justin)

   - bvec_iter_advance() improvement (Pavel)

   - Xen blkback page unmap fix (SeongJae)

  The major items in here are all well tested, and on the liburing side
  we continue to add regression and feature test cases. We're up to 50
  topic cases now, each with anywhere from 1 to more than 10 cases in
  each"

* tag 'for-linus-20191205' of git://git.kernel.dk/linux-block: (33 commits)
  block: fix memleak of bio integrity data
  io_uring: fix a typo in a comment
  bfq-iosched: Ensure bio->bi_blkg is valid before using it
  io_uring: hook all linked requests via link_list
  io_uring: fix error handling in io_queue_link_head
  io_uring: use hash table for poll command lookups
  io-wq: clear node->next on list deletion
  io_uring: ensure deferred timeouts copy necessary data
  io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT
  null_blk: remove unused variable warning on !CONFIG_BLK_DEV_ZONED
  brd: warn on un-aligned buffer
  brd: remove max_hw_sectors queue limit
  xen/blkback: Avoid unmapping unmapped grant pages
  io_uring: handle connect -EINPROGRESS like -EAGAIN
  block: set the zone size in blk_revalidate_disk_zones atomically
  block: don't handle bio based drivers in blk_revalidate_disk_zones
  block: allocate the zone bitmaps lazily
  block: replace seq_zones_bitmap with conv_zones_bitmap
  block: simplify blkdev_nr_zones
  block: remove the empty line at the end of blk-zoned.c
  ...
2019-12-06 10:08:59 -08:00
Linus Torvalds
0aecba6173 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs d_inode/d_flags memory ordering fixes from Al Viro:
 "Fallout from tree-wide audit for ->d_inode/->d_flags barriers use.
  Basically, the problem is that negative pinned dentries require
  careful treatment - unless ->d_lock is locked or parent is held at
  least shared, another thread can make them positive right under us.

  Most of the uses turned out to be safe - the main surprises as far as
  filesystems are concerned were

   - race in dget_parent() fastpath, that might end up with the caller
     observing the returned dentry _negative_, due to insufficient
     barriers. It is positive in memory, but we could end up seeing the
     wrong value of ->d_inode in CPU cache. Fixed.

   - manual checks that result of lookup_one_len_unlocked() is positive
     (and rejection of negatives). Again, insufficient barriers (we
     might end up with inconsistent observed values of ->d_inode and
     ->d_flags). Fixed by switching to a new primitive that does the
     checks itself and returns ERR_PTR(-ENOENT) instead of a negative
     dentry. That way we get rid of boilerplate converting negatives
     into ERR_PTR(-ENOENT) in the callers and have a single place to
     deal with the barrier-related mess - inside fs/namei.c rather than
     in every caller out there.

  The guts of pathname resolution *do* need to be careful - the race
  found by Ritesh is real, as well as several similar races.
  Fortunately, it turns out that we can take care of that with fairly
  local changes in there.

  The tree-wide audit had not been fun, and I hate the idea of repeating
  it. I think the right approach would be to annotate the places where
  we are _not_ guaranteed ->d_inode/->d_flags stability and have sparse
  catch regressions. But I'm still not sure what would be the least
  invasive way of doing that and it's clearly the next cycle fodder"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/namei.c: fix missing barriers when checking positivity
  fix dget_parent() fastpath race
  new helper: lookup_positive_unlocked()
  fs/namei.c: pull positivity check into follow_managed()
2019-12-06 09:06:58 -08:00
Linus Torvalds
b0d4beaa5a Merge branch 'next.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull autofs updates from Al Viro:
 "autofs misuses checks for ->d_subdirs emptiness; the cursors are in
  the same lists, resulting in false negatives. It's not needed anyway,
  since autofs maintains counter in struct autofs_info, containing 0 for
  removed ones, 1 for live symlinks and 1 + number of children for live
  directories, which is precisely what we need for those checks.

  This series switches to use of that counter and untangles the crap
  around its uses (it needs not be atomic and there's a bunch of
  completely pointless "defensive" checks).

  This fell out of dcache_readdir work; the main point is to get rid of
  ->d_subdirs abuses in there. I've more followup cleanups, but I hadn't
  run those by Ian yet, so they can go next cycle"

* 'next.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  autofs: don't bother with atomics for ino->count
  autofs_dir_rmdir(): check ino->count for deciding whether it's empty...
  autofs: get rid of pointless checks around ->count handling
  autofs_clear_leaf_automount_flags(): use ino->count instead of ->d_subdirs
2019-12-05 17:11:48 -08:00
Linus Torvalds
da73fcd8cf Merge branch 'pipe-rework' (patches from David Howells)
Merge two fixes for the pipe rework from David Howells:
 "Here are a couple of patches to fix bugs syzbot found in the pipe
  changes:

   - An assertion check will sometimes trip when polling a pipe because
     the ring size and indices used are approximate and may be being
     changed simultaneously.

     An equivalent approximate calculation was done previously, but
     without the assertion check, so I've just dropped the check. To
     make it accurate, the pipe mutex would need to be taken or the spin
     lock could be used - but usage of the spinlock would need to be
     rolled out into splice, iov_iter and other places for that.

   - The index mask and the max_usage values cannot be cached across
     pipe_wait() as F_SETPIPE_SZ could have been called during the wait.
     This can cause pipe_write() to break"

* pipe-rework:
  pipe: Fix missing mask update after pipe_wait()
  pipe: Remove assertion from pipe_poll()
2019-12-05 16:35:53 -08:00
David Howells
8f868d68d3 pipe: Fix missing mask update after pipe_wait()
Fix pipe_write() to not cache the ring index mask and max_usage as their
values are invalidated by calling pipe_wait() because the latter
function drops the pipe lock, thereby allowing F_SETPIPE_SZ change them.
Without this, pipe_write() may subsequently miscalculate the array
indices and pipe fullness, leading to an oops like the following:

  BUG: KASAN: slab-out-of-bounds in pipe_write+0xc25/0xe10 fs/pipe.c:481
  Write of size 8 at addr ffff8880771167a8 by task syz-executor.3/7987
  ...
  CPU: 1 PID: 7987 Comm: syz-executor.3 Not tainted 5.4.0-rc2-syzkaller #0
  ...
  Call Trace:
    pipe_write+0xc25/0xe10 fs/pipe.c:481
    call_write_iter include/linux/fs.h:1895 [inline]
    new_sync_write+0x3fd/0x7e0 fs/read_write.c:483
    __vfs_write+0x94/0x110 fs/read_write.c:496
    vfs_write+0x18a/0x520 fs/read_write.c:558
    ksys_write+0x105/0x220 fs/read_write.c:611
    __do_sys_write fs/read_write.c:623 [inline]
    __se_sys_write fs/read_write.c:620 [inline]
    __x64_sys_write+0x6e/0xb0 fs/read_write.c:620
    do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is not a problem for pipe_read() as the mask is recalculated on
each pass of the loop, after pipe_wait() has been called.

Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: syzbot+838eb0878ffd51f27c41@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Eric Biggers <ebiggers@kernel.org>
[ Changed it to use a temporary variable 'mask' to avoid long lines -Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-05 15:56:20 -08:00
David Howells
8c7b8c34ae pipe: Remove assertion from pipe_poll()
An assertion check was added to pipe_poll() to make sure that the ring
occupancy isn't seen to overflow the ring size.  However, since no locks
are held when the three values are read, it is possible for F_SETPIPE_SZ
to intervene and muck up the calculation, thereby causing the oops.

Fix this by simply removing the assertion and accepting that the
calculation might be approximate.

Note that the previous code also had a similar issue, though there was
no assertion check, since the occupancy counter and the ring size were
not read with a lock held, so it's possible that the poll check might
have malfunctioned then too.

Also wake up all the waiters so that they can reissue their checks if
there was a competing read or write.

Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: syzbot+d37abaade33a934f16f2@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-05 15:33:50 -08:00
Linus Torvalds
3f1266ec70 GFS2 changes for this merge window:
Bob's extensive filesystem withdrawal and recovery testing:
 - Don't write log headers after file system withdraw
 - clean up iopen glock mess in gfs2_create_inode
 - Close timing window with GLF_INVALIDATE_IN_PROGRESS
 - Abort gfs2_freeze if io error is seen
 - Don't loop forever in gfs2_freeze if withdrawn
 - fix infinite loop in gfs2_ail1_flush on io error
 - Introduce function gfs2_withdrawn
 - fix glock reference problem in gfs2_trans_remove_revoke
 
 Filesystems with a block size smaller than the page size:
 - Fix end-of-file handling in gfs2_page_mkwrite
 - Improve mmap write vs. punch_hole consistency
 
 Other:
 - Remove active journal side effect from gfs2_write_log_header
 - Multi-block allocations in gfs2_page_mkwrite
 
 Minor cleanups and coding style fixes:
 - Remove duplicate call from gfs2_create_inode
 - make gfs2_log_shutdown static
 - make gfs2_fs_parameters static
 - Some whitespace cleanups
 - removed unnecessary semicolon
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJd6VQ1AAoJENW/n+sDE2U6tUsP/2Zd64dVA+2KzpfvklAUNpx/
 TENbvRqGVvhofxuScY0oAvcg0KZN03K/L6ZCa71e7RI7T3VM/FG3rKNCkm084a9d
 r+Rxcz+eSkce5qKVNva6CtQiczGAj27iNiho0Y9IgtzEZ5KEVJuY3cV8Z2qzHrQM
 Rel2aVpUtaLhbFOj3jLGMt/HHSw8RTTzNqtJJCwRys/tVF1WPVqyNg4PD9a7zMT7
 z96tqrlQQxFT4SGiZVJwQHFGuQZEnbr2ahNRmivmGtnNnawLxpEdFuFrSAsC73UB
 wHO0Dq+7vYVTyQ7HugrqdkXxqyQr5ta06Pep7uj8ZvhoLWZvPuBf2SccpIn9Ufvo
 9iLFY5Z9cHg6wpsW+YMG75Mz0A6WbPRIScVog0fxaKW+z0vMZOmsT6hT6llAlCzn
 oj1igqVEOIBTS+4uMDIOJKvMixo4NTdgsLFQyftUxNHiCw5iGbqkV7ux31YjqX22
 A830zv8lba44BsixGtuPEy/0Dnka7rMfRp0cflCzKESSLIdXtSjUSEtS6g0fJISS
 qKNmnnkHpvjBGMG4lOqJihJdQ+2IiIMLWdNgxkvWgt6F7xjl3gcdREjJF0dmFsd+
 GfDUkUZ/70T9UPsaTR2V0GBvEleq4PglALcet9Eela7tKOEziNf3L+prRjMr3909
 y4uJIMH9/no1knKBxtkJ
 =b1m/
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull GFS2 updates from Andreas Gruenbacher:
 "Bob's extensive filesystem withdrawal and recovery testing:
   - don't write log headers after file system withdraw
   - clean up iopen glock mess in gfs2_create_inode
   - close timing window with GLF_INVALIDATE_IN_PROGRESS
   - abort gfs2_freeze if io error is seen
   - don't loop forever in gfs2_freeze if withdrawn
   - fix infinite loop in gfs2_ail1_flush on io error
   - introduce function gfs2_withdrawn
   - fix glock reference problem in gfs2_trans_remove_revoke

  Filesystems with a block size smaller than the page size:
   - fix end-of-file handling in gfs2_page_mkwrite
   - improve mmap write vs. punch_hole consistency

  Other:
   - remove active journal side effect from gfs2_write_log_header
   - multi-block allocations in gfs2_page_mkwrite

  Minor cleanups and coding style fixes:
   - remove duplicate call from gfs2_create_inode
   - make gfs2_log_shutdown static
   - make gfs2_fs_parameters static
   - some whitespace cleanups
   - removed unnecessary semicolon"

* tag 'gfs2-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Don't write log headers after file system withdraw
  gfs2: Remove duplicate call from gfs2_create_inode
  gfs2: clean up iopen glock mess in gfs2_create_inode
  gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS
  gfs2: Abort gfs2_freeze if io error is seen
  gfs2: Don't loop forever in gfs2_freeze if withdrawn
  gfs2: fix infinite loop in gfs2_ail1_flush on io error
  gfs2: Introduce function gfs2_withdrawn
  gfs2: fix glock reference problem in gfs2_trans_remove_revoke
  gfs2: make gfs2_log_shutdown static
  gfs2: Remove active journal side effect from gfs2_write_log_header
  gfs2: Fix end-of-file handling in gfs2_page_mkwrite
  gfs2: Multi-block allocations in gfs2_page_mkwrite
  gfs2: Improve mmap write vs. punch_hole consistency
  gfs2: make gfs2_fs_parameters static
  gfs2: Some whitespace cleanups
  gfs2: removed unnecessary semicolon
2019-12-05 13:20:11 -08:00
Linus Torvalds
a231582359 The two highlights are a set of improvements to how rbd read-only
mappings are handled and a conversion to the new mount API (slightly
 complicated by the fact that we had a common option parsing framework
 that called out into rbd and the filesystem instead of them calling
 into it).  Also included a few scattered fixes and a MAINTAINERS update
 for rbd, adding Dongsheng as a reviewer.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3oDqwTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi8/OCACAhPEoSkG8J0XOgP0NlFQGRvikugKq
 wlUfpNhkJOVmyM1t9LgiHHirTa7/kA76wPo/iHtnvjIZuZoaX3+NoZX5DwgKVCo1
 SCQdXR4ohVPiYxUpK+z/fDXxpYHhaO2SAww+RRHSDxnlN5CHqFBcBhRBPfhraZT5
 dwiQt7++UOnp/hfk1Dqg5EogmSdLxqWyjClKf2lliZkzbU9YXmGapqQsur6sBk+e
 cLRmRBmMw4cDAKLL1taCympN0AxNMcePs1njvdwQ7XabNWrT061yFyt1ZNwAV/Nu
 0nCyh/9IwQcsR0EvK7FCdUEJPy88Reufd+GleS4nkEZpbxQBzo0aGow0
 =Egtk
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.5-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The two highlights are a set of improvements to how rbd read-only
  mappings are handled and a conversion to the new mount API (slightly
  complicated by the fact that we had a common option parsing framework
  that called out into rbd and the filesystem instead of them calling
  into it).

  Also included a few scattered fixes and a MAINTAINERS update for rbd,
  adding Dongsheng as a reviewer"

* tag 'ceph-for-5.5-rc1' of git://github.com/ceph/ceph-client:
  libceph, rbd, ceph: convert to use the new mount API
  rbd: ask for a weaker incompat mask for read-only mappings
  rbd: don't query snapshot features
  rbd: remove snapshot existence validation code
  rbd: don't establish watch for read-only mappings
  rbd: don't acquire exclusive lock for read-only mappings
  rbd: disallow read-write partitions on images mapped read-only
  rbd: treat images mapped read-only seriously
  rbd: introduce RBD_DEV_FLAG_READONLY
  rbd: introduce rbd_is_snap()
  ceph: don't leave ino field in ceph_mds_request_head uninitialized
  ceph: tone down loglevel on ceph_mdsc_build_path warning
  rbd: update MAINTAINERS info
  ceph: fix geting random mds from mdsmap
  rbd: fix spelling mistake "requeueing" -> "requeuing"
  ceph: make several helper accessors take const pointers
  libceph: drop unnecessary check from dispatch() in mon_client.c
2019-12-05 13:06:51 -08:00
Linus Torvalds
7ce4fab819 fuse update for 5.5
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXedyjQAKCRDh3BK/laaZ
 PCR7AQCf+bb3/so1bygFeblBTT4UZbYZRXz2nZNSA5tgJvafWwD+I4MlqR+tEixb
 gZEusbtAVrtm3hJrBc+1fA1wacGhmAg=
 =Jg/0
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse update from Miklos Szeredi:

 - Fix a regression introduced in the last release

 - Fix a number of issues with validating data coming from userspace

 - Some cleanups in virtiofs

* tag 'fuse-update-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix Kconfig indentation
  fuse: fix leak of fuse_io_priv
  virtiofs: Use completions while waiting for queue to be drained
  virtiofs: Do not send forget request "struct list_head" element
  virtiofs: Use a common function to send forget
  virtiofs: Fix old-style declaration
  fuse: verify nlink
  fuse: verify write return
  fuse: verify attributes
2019-12-05 12:44:22 -08:00
Zorro Lang
c275779ff2 iomap: stop using ioend after it's been freed in iomap_finish_ioend()
This patch fixes the following KASAN report. The @ioend has been
freed by dio_put(), but the iomap_finish_ioend() still trys to access
its data.

[20563.631624] BUG: KASAN: use-after-free in iomap_finish_ioend+0x58c/0x5c0
[20563.638319] Read of size 8 at addr fffffc0c54a36928 by task kworker/123:2/22184

[20563.647107] CPU: 123 PID: 22184 Comm: kworker/123:2 Not tainted 5.4.0+ #1
[20563.653887] Hardware name: HPE Apollo 70             /C01_APACHE_MB         , BIOS L50_5.13_1.11 06/18/2019
[20563.664499] Workqueue: xfs-conv/sda5 xfs_end_io [xfs]
[20563.669547] Call trace:
[20563.671993]  dump_backtrace+0x0/0x370
[20563.675648]  show_stack+0x1c/0x28
[20563.678958]  dump_stack+0x138/0x1b0
[20563.682455]  print_address_description.isra.9+0x60/0x378
[20563.687759]  __kasan_report+0x1a4/0x2a8
[20563.691587]  kasan_report+0xc/0x18
[20563.694985]  __asan_report_load8_noabort+0x18/0x20
[20563.699769]  iomap_finish_ioend+0x58c/0x5c0
[20563.703944]  iomap_finish_ioends+0x110/0x270
[20563.708396]  xfs_end_ioend+0x168/0x598 [xfs]
[20563.712823]  xfs_end_io+0x1e0/0x2d0 [xfs]
[20563.716834]  process_one_work+0x7f0/0x1ac8
[20563.720922]  worker_thread+0x334/0xae0
[20563.724664]  kthread+0x2c4/0x348
[20563.727889]  ret_from_fork+0x10/0x18

[20563.732941] Allocated by task 83403:
[20563.736512]  save_stack+0x24/0xb0
[20563.739820]  __kasan_kmalloc.isra.9+0xc4/0xe0
[20563.744169]  kasan_slab_alloc+0x14/0x20
[20563.747998]  slab_post_alloc_hook+0x50/0xa8
[20563.752173]  kmem_cache_alloc+0x154/0x330
[20563.756185]  mempool_alloc_slab+0x20/0x28
[20563.760186]  mempool_alloc+0xf4/0x2a8
[20563.763845]  bio_alloc_bioset+0x2d0/0x448
[20563.767849]  iomap_writepage_map+0x4b8/0x1740
[20563.772198]  iomap_do_writepage+0x200/0x8d0
[20563.776380]  write_cache_pages+0x8a4/0xed8
[20563.780469]  iomap_writepages+0x4c/0xb0
[20563.784463]  xfs_vm_writepages+0xf8/0x148 [xfs]
[20563.788989]  do_writepages+0xc8/0x218
[20563.792658]  __writeback_single_inode+0x168/0x18f8
[20563.797441]  writeback_sb_inodes+0x370/0xd30
[20563.801703]  wb_writeback+0x2d4/0x1270
[20563.805446]  wb_workfn+0x344/0x1178
[20563.808928]  process_one_work+0x7f0/0x1ac8
[20563.813016]  worker_thread+0x334/0xae0
[20563.816757]  kthread+0x2c4/0x348
[20563.819979]  ret_from_fork+0x10/0x18

[20563.825028] Freed by task 22184:
[20563.828251]  save_stack+0x24/0xb0
[20563.831559]  __kasan_slab_free+0x10c/0x180
[20563.835648]  kasan_slab_free+0x10/0x18
[20563.839389]  slab_free_freelist_hook+0xb4/0x1c0
[20563.843912]  kmem_cache_free+0x8c/0x3e8
[20563.847745]  mempool_free_slab+0x20/0x28
[20563.851660]  mempool_free+0xd4/0x2f8
[20563.855231]  bio_free+0x33c/0x518
[20563.858537]  bio_put+0xb8/0x100
[20563.861672]  iomap_finish_ioend+0x168/0x5c0
[20563.865847]  iomap_finish_ioends+0x110/0x270
[20563.870328]  xfs_end_ioend+0x168/0x598 [xfs]
[20563.874751]  xfs_end_io+0x1e0/0x2d0 [xfs]
[20563.878755]  process_one_work+0x7f0/0x1ac8
[20563.882844]  worker_thread+0x334/0xae0
[20563.886584]  kthread+0x2c4/0x348
[20563.889804]  ret_from_fork+0x10/0x18

[20563.894855] The buggy address belongs to the object at fffffc0c54a36900
                which belongs to the cache bio-1 of size 248
[20563.906844] The buggy address is located 40 bytes inside of
                248-byte region [fffffc0c54a36900, fffffc0c54a369f8)
[20563.918485] The buggy address belongs to the page:
[20563.923269] page:ffffffff82f528c0 refcount:1 mapcount:0 mapping:fffffc8e4ba31900 index:0xfffffc0c54a33300
[20563.932832] raw: 17ffff8000000200 ffffffffa3060100 0000000700000007 fffffc8e4ba31900
[20563.940567] raw: fffffc0c54a33300 0000000080aa0042 00000001ffffffff 0000000000000000
[20563.948300] page dumped because: kasan: bad access detected

[20563.955345] Memory state around the buggy address:
[20563.960129]  fffffc0c54a36800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
[20563.967342]  fffffc0c54a36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[20563.974554] >fffffc0c54a36900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[20563.981766]                                   ^
[20563.986288]  fffffc0c54a36980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
[20563.993501]  fffffc0c54a36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[20564.000713] ==================================================================

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205703
Signed-off-by: Zorro Lang <zlang@redhat.com>
Fixes: 9cd0ed63ca ("iomap: enhance writeback error message")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-12-05 07:41:16 -08:00
LimingWu
0b4295b5e2 io_uring: fix a typo in a comment
thatn -> than.

Signed-off-by: Liming Wu <19092205@suning.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-05 07:59:37 -07:00
Pavel Begunkov
4493233edc io_uring: hook all linked requests via link_list
Links are created by chaining requests through req->list with an
exception that head uses req->link_list. (e.g. link_list->list->list)
Because of that, io_req_link_next() needs complex splicing to advance.

Link them all through list_list. Also, it seems to be simpler and more
consistent IMHO.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-05 06:54:52 -07:00
Pavel Begunkov
2e6e1fde32 io_uring: fix error handling in io_queue_link_head
In case of an error io_submit_sqe() drops a request and continues
without it, even if the request was a part of a link. Not only it
doesn't cancel links, but also may execute wrong sequence of actions.

Stop consuming sqes, and let the user handle errors.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-05 06:54:51 -07:00
Alexey Dobriyan
658c033565 fs/binfmt_elf.c: extract elf_read() function
ELF reads done by the kernel have very complicated error detection code
which better live in one place.

Link: http://lkml.kernel.org/r/20191005165215.GB26927@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:13 -08:00
Alexey Dobriyan
81696d5d54 fs/binfmt_elf.c: delete unused "interp_map_addr" argument
Link: http://lkml.kernel.org/r/20191005165049.GA26927@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:13 -08:00
Heiher
339ddb53d3 fs/epoll: remove unnecessary wakeups of nested epoll
Take the case where we have:

        t0
         | (ew)
        e0
         | (et)
        e1
         | (lt)
        s0

t0: thread 0
e0: epoll fd 0
e1: epoll fd 1
s0: socket fd 0
ew: epoll_wait
et: edge-trigger
lt: level-trigger

We remove unnecessary wakeups to prevent the nested epoll that working in edge-
triggered mode to waking up continuously.

Test code:
 #include <unistd.h>
 #include <sys/epoll.h>
 #include <sys/socket.h>

 int main(int argc, char *argv[])
 {
 	int sfd[2];
 	int efd[2];
 	struct epoll_event e;

 	if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
 		goto out;

 	efd[0] = epoll_create(1);
 	if (efd[0] < 0)
 		goto out;

 	efd[1] = epoll_create(1);
 	if (efd[1] < 0)
 		goto out;

 	e.events = EPOLLIN;
 	if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0)
 		goto out;

 	e.events = EPOLLIN | EPOLLET;
 	if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
 		goto out;

 	if (write(sfd[1], "w", 1) != 1)
 		goto out;

 	if (epoll_wait(efd[0], &e, 1, 0) != 1)
 		goto out;

 	if (epoll_wait(efd[0], &e, 1, 0) != 0)
 		goto out;

 	close(efd[0]);
 	close(efd[1]);
 	close(sfd[0]);
 	close(sfd[1]);

 	return 0;

 out:
 	return -1;
 }

More tests:
 https://github.com/heiher/epoll-wakeup

Link: http://lkml.kernel.org/r/20191009060516.3577-1-r@hev.cc
Signed-off-by: hev <r@hev.cc>
Reviewed-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:13 -08:00
Jason Baron
f6520c5208 epoll: simplify ep_poll_safewake() for CONFIG_DEBUG_LOCK_ALLOC
Currently, ep_poll_safewake() in the CONFIG_DEBUG_LOCK_ALLOC case uses
ep_call_nested() in order to pass the correct subclass argument to
spin_lock_irqsave_nested().  However, ep_call_nested() adds unnecessary
checks for epoll depth and loops that are already verified when doing
EPOLL_CTL_ADD.  This mirrors a conversion that was done for
!CONFIG_DEBUG_LOCK_ALLOC in: commit 37b5e5212a ("epoll: remove
ep_call_nested() from ep_eventpoll_poll()")

Link: http://lkml.kernel.org/r/1567628549-11501-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Reviewed-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:13 -08:00
Krzysztof Kozlowski
3d82191c22 fs/proc/Kconfig: fix indentation
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
        $ sed -e 's/^        /	/' -i */Kconfig

[adobriyan@gmail.com: add two spaces where necessary]
Link: http://lkml.kernel.org/r/20191124133936.GA5655@avx2
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00
Alexey Dobriyan
70a731c0e3 fs/proc/internal.h: shuffle "struct pde_opener"
List iteration takes more code than anything else which means embedded
list_head should be the first element of the structure.

Space savings:

	add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-18 (-18)
	Function                                     old     new   delta
	close_pdeo                                   228     227      -1
	proc_reg_release                              86      82      -4
	proc_entry_rundown                           143     139      -4
	proc_reg_open                                298     289      -9

Link: http://lkml.kernel.org/r/20191004234753.GB30246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00
Alexey Dobriyan
5f6354eaa5 fs/proc/generic.c: delete useless "len" variable
Pointer to next '/' encodes length of path element and next start
position.  Subtraction and increment are redundant.

Link: http://lkml.kernel.org/r/20191004234521.GA30246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00
Alexey Dobriyan
e06689bf57 proc: change ->nlink under proc_subdir_lock
Currently gluing PDE into global /proc tree is done under lock, but
changing ->nlink is not.  Additionally struct proc_dir_entry::nlink is
not atomic so updates can be lost.

Link: http://lkml.kernel.org/r/20190925202436.GA17388@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00
Jens Axboe
78076bb64a io_uring: use hash table for poll command lookups
We recently changed this from a single list to an rbtree, but for some
real life workloads, the rbtree slows down the submission/insertion
case enough so that it's the top cycle consumer on the io_uring side.
In testing, using a hash table is a more well rounded compromise. It
is fast for insertion, and as long as it's sized appropriately, it
works well for the cancellation case as well. Running TAO with a lot
of network sockets, this removes io_poll_req_insert() from spending
2% of the CPU cycles.

Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04 20:12:58 -07:00
Jens Axboe
08bdcc35f0 io-wq: clear node->next on list deletion
If someone removes a node from a list, and then later adds it back to
a list, we can have invalid data in ->next. This can cause all sorts
of issues. One such use case is the IORING_OP_POLL_ADD command, which
will do just that if we race and get woken twice without any pending
events. This is a pretty rare case, but can happen under extreme loads.
Dan reports that he saw the following crash:

BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD d283ce067 P4D d283ce067 PUD e5ca04067 PMD 0
Oops: 0002 [#1] SMP
CPU: 17 PID: 10726 Comm: tao:fast-fiber Kdump: loaded Not tainted 5.2.9-02851-gac7bc042d2d1 #116
Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
RIP: 0010:io_wqe_enqueue+0x3e/0xd0
Code: 34 24 74 55 8b 47 58 48 8d 6f 50 85 c0 74 50 48 89 df e8 35 7c 75 00 48 83 7b 08 00 48 8b 14 24 0f 84 84 00 00 00 48 8b 4b 10 <48> 89 11 48 89 53 10 83 63 20 fe 48 89 c6 48 89 df e8 0c 7a 75 00
RSP: 0000:ffffc90006858a08 EFLAGS: 00010082
RAX: 0000000000000002 RBX: ffff889037492fc0 RCX: 0000000000000000
RDX: ffff888e40cc11a8 RSI: ffff888e40cc11a8 RDI: ffff889037492fc0
RBP: ffff889037493010 R08: 00000000000000c3 R09: ffffc90006858ab8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff888e40cc11a8
R13: 0000000000000000 R14: 00000000000000c3 R15: ffff888e40cc1100
FS:  00007fcddc9db700(0000) GS:ffff88903fa40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000e479f5003 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <IRQ>
 io_poll_wake+0x12f/0x2a0
 __wake_up_common+0x86/0x120
 __wake_up_common_lock+0x7a/0xc0
 sock_def_readable+0x3c/0x70
 tcp_rcv_established+0x557/0x630
 tcp_v6_do_rcv+0x118/0x3c0
 tcp_v6_rcv+0x97e/0x9d0
 ip6_protocol_deliver_rcu+0xe3/0x440
 ip6_input+0x3d/0xc0
 ? ip6_protocol_deliver_rcu+0x440/0x440
 ipv6_rcv+0x56/0xd0
 ? ip6_rcv_finish_core.isra.18+0x80/0x80
 __netif_receive_skb_one_core+0x50/0x70
 netif_receive_skb_internal+0x2f/0xa0
 napi_gro_receive+0x125/0x150
 mlx5e_handle_rx_cqe+0x1d9/0x5a0
 ? mlx5e_poll_tx_cq+0x305/0x560
 mlx5e_poll_rx_cq+0x49f/0x9c5
 mlx5e_napi_poll+0xee/0x640
 ? smp_reschedule_interrupt+0x16/0xd0
 ? reschedule_interrupt+0xf/0x20
 net_rx_action+0x286/0x3d0
 __do_softirq+0xca/0x297
 irq_exit+0x96/0xa0
 do_IRQ+0x54/0xe0
 common_interrupt+0xf/0xf
 </IRQ>
RIP: 0033:0x7fdc627a2e3a
Code: 31 c0 85 d2 0f 88 f6 00 00 00 55 48 89 e5 41 57 41 56 4c 63 f2 41 55 41 54 53 48 83 ec 18 48 85 ff 0f 84 c7 00 00 00 48 8b 07 <41> 89 d4 49 89 f5 48 89 fb 48 85 c0 0f 84 64 01 00 00 48 83 78 10

when running a networked workload with about 5000 sockets being polled
for. Fix this by clearing node->next when the node is being removed from
the list.

Fixes: 6206f0e180 ("io-wq: shrink io_wq_work a bit")
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04 17:26:57 -07:00
Jens Axboe
2d28390aff io_uring: ensure deferred timeouts copy necessary data
If we defer a timeout, we should ensure that we copy the timespec
when we have consumed the sqe. This is similar to commit f67676d160
for read/write requests. We already did this correctly for timeouts
deferred as links, but do it generally and use the infrastructure added
by commit 1a6b74fc87 instead of having the timeout deferral use its
own.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04 11:12:08 -07:00
Aurelien Aptel
9a7d5a9e6d cifs: fix possible uninitialized access and race on iface_list
iface[0] was accessed regardless of the count value and without
locking.

* check count before accessing any ifaces
* make copy of iface list (it's a simple POD array) and use it without
  locking.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
2019-12-04 11:51:18 -06:00
Paulo Alcantara (SUSE)
3345bb44ba cifs: Fix lookup of SMB connections on multichannel
With the addition of SMB session channels, we introduced new TCP
server pointers that have no sessions or tcons associated with them.

In this case, when we started looking for TCP connections, we might
end up picking session channel rather than the master connection,
hence failing to get either a session or a tcon.

In order to fix that, this patch introduces a new "is_channel" field
to TCP_Server_Info structure so we can skip session channels during
lookup of connections.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-12-04 11:50:32 -06:00
Jens Axboe
901e59bba9 io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT
There's really no reason why we forbid things like link/drain etc on
regular timeout commands. Enable the usual SQE flags on timeouts.

Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04 10:34:03 -07:00
Christoph Hellwig
1cea335d1d iomap: fix sub-page uptodate handling
bio completions can race when a page spans more than one file system
block.  Add a spinlock to synchronize marking the page uptodate.

Fixes: 9dc55f1389 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-04 09:33:52 -08:00
Mike Marshall
f9bbb68233 orangefs: posix open permission checking...
Orangefs has no open, and orangefs checks file permissions
on each file access. Posix requires that file permissions
be checked on open and nowhere else. Orangefs-through-the-kernel
needs to seem posix compliant.

The VFS opens files, even if the filesystem provides no
method. We can see if a file was successfully opened for
read and or for write by looking at file->f_mode.

When writes are flowing from the page cache, file is no
longer available. We can trust the VFS to have checked
file->f_mode before writing to the page cache.

The mode of a file might change between when it is opened
and IO commences, or it might be created with an arbitrary mode.

We'll make sure we don't hit EACCES during the IO stage by
using UID 0. Some of the time we have access without changing
to UID 0 - how to check?

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-12-04 08:52:55 -05:00
Gao Xiang
926d165017 erofs: zero out when listxattr is called with no xattr
As David reported [1], ENODATA returns when attempting
to modify files by using EROFS as an overlayfs lower layer.

The root cause is that listxattr could return unexpected
-ENODATA by mistake for inodes without xattr. That breaks
listxattr return value convention and it can cause copy
up failure when used with overlayfs.

Resolve by zeroing out if no xattr is found for listxattr.

[1] https://lore.kernel.org/r/CAEvUa7nxnby+rxK-KRMA46=exeOMApkDMAV08AjMkkPnTPV4CQ@mail.gmail.com
Link: https://lore.kernel.org/r/20191201084040.29275-1-hsiangkao@aol.com
Fixes: cadf1ccf1b ("staging: erofs: add error handling for xattr submodule")
Cc: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
2019-12-04 21:15:04 +08:00
Brian Foster
798a9cada4 xfs: fix mount failure crash on invalid iclog memory access
syzbot (via KASAN) reports a use-after-free in the error path of
xlog_alloc_log(). Specifically, the iclog freeing loop doesn't
handle the case of a fully initialized ->l_iclog linked list.
Instead, it assumes that the list is partially constructed and NULL
terminated.

This bug manifested because there was no possible error scenario
after iclog list setup when the original code was added.  Subsequent
code and associated error conditions were added some time later,
while the original error handling code was never updated. Fix up the
error loop to terminate either on a NULL iclog or reaching the end
of the list.

Reported-by: syzbot+c732f8644185de340492@syzkaller.appspotmail.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-03 14:53:07 -08:00
Steve French
43f8a6a74e smb3: query attributes on file close
Since timestamps on files on most servers can be updated at
close, and since timestamps on our dentries default to one
second we can have stale timestamps in some common cases
(e.g. open, write, close, stat, wait one second, stat - will
show different mtime for the first and second stat).

The SMB2/SMB3 protocol allows querying timestamps at close
so add the code to request timestamp and attr information
(which is cheap for the server to provide) to be returned
when a file is closed (it is not needed for the many
paths that call SMB2_close that are from compounded
query infos and close nor is it needed for some of
the cases where a directory close immediately follows a
directory open.

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-12-03 15:48:02 -06:00
Linus Torvalds
2a31aca500 New code for 5.5:
- Make iomap_dio_rw callers explicitly tell us if they want us to wait
 - Port the xfs writeback code to iomap to complete the buffered io
   library functions
 - Refactor the unshare code to share common pieces
 - Add support for performing copy on write with buffered writes
 - Other minor fixes
 - Fix unchecked return in iomap_bmap
 - Fix a type casting bug in a ternary statement in iomap_dio_bio_actor
 - Improve tracepoints for easier diagnostic ability
 - Fix pipe page leakage in directio reads
 - Clean up iter usage in directio paths
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3kAhYACgkQ+H93GTRK
 tOviwQ/9FyTPgZKDrJMOxiEIyx1ye+wWasaUSmt1/jaXBhX8B12WkjfV5Nk970pw
 WOdiu5rnHy+Aje35luZmwPpX1H8CPvnVUTCFaYdZe9r695+XBmpg0coPr9T3Lsg/
 dMRPB2LaM1gZxaQukBTWoj2mwDanepz6Zu5dGcwEuZE26jlnFf6+yQibFlOIjFmk
 ioCkZLwkNdQlF+dutOykIilkP6imFyTU1lYueAZTmtJoM7JfXcqVmsn7A4R2pi8U
 2CF/ySTCuKu8Mwee8BTzWMcNqvYc/bshF2bQeiBNnKB+K3GGM9s5oNL5DGQTc709
 Cjhw6aq8ZTedT6AEw07zY5X1NhkCe8dQImz1qQPjCuoW2a5bVww8hvw8otUAyxOM
 /ScwHqOH/TxzuTmYHoO+2DxZYjSEFTci26dpVenxWrFt82LcMfG/DvEThzTcT/qq
 EdH7oLHbgXpvpwMkOYU9usCCkosraWKNaN4aoX9R5VhhWA4t8CcQNpXBkMDEI3ck
 xywGGiwau3aUy2ELtjgfjxRUY28rZUkoABkyR9fAc4BYbS4RDqoHgF9pY/z5H/2B
 xwDLzCA9ww6wLjeXEWt9ZBwmkNkdsFp/XeKrx2WX81LrCRoqFc3viV13FWrfuEe6
 de9jHgV1HZqkmm5pChRMJ3n5Qyah9anXXnTRvar8UedHEbxoQuE=
 =QMWG
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.5-merge-13' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap cleanups from Darrick Wong:
 "Aome more new iomap code for 5.5.

  There's not much this time -- just removing some local variables that
  don't need to exist in the iomap directio code"

* tag 'iomap-5.5-merge-13' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: remove unneeded variable in iomap_dio_rw()
  iomap: Do not create fake iter in iomap_dio_bio_actor()
2019-12-03 13:18:34 -08:00
Linus Torvalds
043cf46825 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Ingo Molnar:
 "The main changes in the timer code in this cycle were:

   - Clockevent updates:

      - timer-of framework cleanups. (Geert Uytterhoeven)

      - Use timer-of for the renesas-ostm and the device name to prevent
        name collision in case of multiple timers. (Geert Uytterhoeven)

      - Check if there is an error after calling of_clk_get in asm9260
        (Chuhong Yuan)

   - ABI fix: Zero out high order bits of nanoseconds on compat
     syscalls. This got broken a year ago, with apparently no side
     effects so far.

     Since the kernel would use random data otherwise I don't think we'd
     have other options but to fix the bug, even if there was a side
     effect to applications (Dmitry Safonov)

   - Optimize ns_to_timespec64() on 32-bit systems: move away from
     div_s64_rem() which can be slow, to div_u64_rem() which is faster
     (Arnd Bergmann)

   - Annotate KCSAN-reported false positive data races in
     hrtimer_is_queued() users by moving timer->state handling over to
     the READ_ONCE()/WRITE_ONCE() APIs. This documents these accesses
     (Eric Dumazet)

   - Misc cleanups and small fixes"

[ I undid the "ABI fix" and updated the comments instead. The reason
  there were apparently no side effects is that the fix was a no-op.

  The updated comment is to say _why_ it was a no-op.    - Linus ]

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Zero the upper 32-bits in __kernel_timespec on 32-bit
  time: Rename tsk->real_start_time to ->start_boottime
  hrtimer: Remove the comment about not used HRTIMER_SOFTIRQ
  time: Fix spelling mistake in comment
  time: Optimize ns_to_timespec64()
  hrtimer: Annotate lockless access to timer->state
  clocksource/drivers/asm9260: Add a check for of_clk_get
  clocksource/drivers/renesas-ostm: Use unique device name instead of ostm
  clocksource/drivers/renesas-ostm: Convert to timer_of
  clocksource/drivers/timer-of: Use unique device name instead of timer
  clocksource/drivers/timer-of: Convert last full_name to %pOF
2019-12-03 12:20:25 -08:00
Jens Axboe
87f80d623c io_uring: handle connect -EINPROGRESS like -EAGAIN
Right now we return it to userspace, which means the application has
to poll for the socket to be writeable. Let's just treat it like
-EAGAIN and have io_uring handle it internally, this makes it much
easier to use.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 11:23:54 -07:00
Jackie Liu
8cdda87a44 io_uring: remove io_wq_current_is_worker
Since commit b18fdf71e0 ("io_uring: simplify io_req_link_next()"),
the io_wq_current_is_worker function is no longer needed, clean it
up.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:04:32 -07:00
Jackie Liu
22efde5998 io_uring: remove parameter ctx of io_submit_state_start
Parameter ctx we have never used, clean it up.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:04:32 -07:00
Jens Axboe
da8c969069 io_uring: mark us with IORING_FEAT_SUBMIT_STABLE
If this flag is set, applications can be certain that any data for
async offload has been consumed when the kernel has consumed the
SQE.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:04:32 -07:00
Jens Axboe
f499a021ea io_uring: ensure async punted connect requests copy data
Just like commit f67676d160 for read/write requests, this one ensures
that the sockaddr data has been copied for IORING_OP_CONNECT if we need
to punt the request to async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:04:30 -07:00
Jens Axboe
03b1230ca1 io_uring: ensure async punted sendmsg/recvmsg requests copy data
Just like commit f67676d160 for read/write requests, this one ensures
that the msghdr data is fully copied if we need to punt a recvmsg or
sendmsg system call to async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:03:35 -07:00
Jens Axboe
f67676d160 io_uring: ensure async punted read/write requests copy iovec
Currently we don't copy the iovecs when we punt to async context. This
can be problematic for applications that store the iovec on the stack,
as they often assume that it's safe to let the iovec go out of scope
as soon as IO submission has been called. This isn't always safe, as we
will re-copy the iovec once we're in async context.

Make this 100% safe by copying the iovec just once. With this change,
applications may safely store the iovec on the stack for all cases.

Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 21:33:25 -07:00
Omar Sandoval
69ffe5960d xfs: don't check for AG deadlock for realtime files in bunmapi
Commit 5b094d6dac ("xfs: fix multi-AG deadlock in xfs_bunmapi") added
a check in __xfs_bunmapi() to stop early if we would touch multiple AGs
in the wrong order. However, this check isn't applicable for realtime
files. In most cases, it just makes us do unnecessary commits. However,
without the fix from the previous commit ("xfs: fix realtime file data
space leak"), if the last and second-to-last extents also happen to have
different "AG numbers", then the break actually causes __xfs_bunmapi()
to return without making any progress, which sends
xfs_itruncate_extents_flags() into an infinite loop.

Fixes: 5b094d6dac ("xfs: fix multi-AG deadlock in xfs_bunmapi")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-02 17:58:51 -08:00
Omar Sandoval
0c4da70c83 xfs: fix realtime file data space leak
Realtime files in XFS allocate extents in rextsize units. However, the
written/unwritten state of those extents is still tracked in blocksize
units. Therefore, a realtime file can be split up into written and
unwritten extents that are not necessarily aligned to the realtime
extent size. __xfs_bunmapi() has some logic to handle these various
corner cases. Consider how it handles the following case:

1. The last extent is unwritten.
2. The last extent is smaller than the realtime extent size.
3. startblock of the last extent is not aligned to the realtime extent
   size, but startblock + blockcount is.

In this case, __xfs_bunmapi() calls xfs_bmap_add_extent_unwritten_real()
to set the second-to-last extent to unwritten. This should merge the
last and second-to-last extents, so __xfs_bunmapi() moves on to the
second-to-last extent.

However, if the size of the last and second-to-last extents combined is
greater than MAXEXTLEN, xfs_bmap_add_extent_unwritten_real() does not
merge the two extents. When that happens, __xfs_bunmapi() skips past the
last extent without unmapping it, thus leaking the space.

Fix it by only unwriting the minimum amount needed to align the last
extent to the realtime extent size, which is guaranteed to merge with
the last extent.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-02 17:58:50 -08:00
Jens Axboe
1a6b74fc87 io_uring: add general async offload context
Right now we just copy the sqe for async offload, but we want to store
more context across an async punt. In preparation for doing so, put the
sqe copy inside a structure that we can expand. With this pointer added,
we can get rid of REQ_F_FREE_SQE, as that is now indicated by whether
req->io is NULL or not.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 18:49:33 -07:00
Eric Biggers
490547ca2d block: don't send uevent for empty disk when not invalidating
Commit 6917d06899 ("block: merge invalidate_partitions into
rescan_partitions") caused a regression where systemd-udevd spins
forever using max CPU starting at boot time.

It's caused by a behavior change where a KOBJ_CHANGE uevent is now sent
in a case where previously it wasn't.

Restore the old behavior.

Fixes: 6917d06899 ("block: merge invalidate_partitions into rescan_partitions")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 18:49:30 -07:00
Jens Axboe
441cdbd544 io_uring: transform send/recvmsg() -ERESTARTSYS to -EINTR
We should never return -ERESTARTSYS to userspace, transform it into
-EINTR.

Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 18:49:10 -07:00
Linus Torvalds
e3a251e366 This pull request contains mostly fixes for UBI, UBIFS and JFFS2:
UBI:
 - Fix a regression around producing a anchor PEB for fastmap.
   Due to a change in our locking fastmap was unable to produce
   fresh anchors an re-used the existing one a way to often.
 
 UBIFS:
 - Fixes for endianness. A few places blindly assumed little endian.
 - Fix for a memory leak in the orphan code.
 - Fix for a possible crash during a commit.
 - Revert a wrong bugfix.
 
 JFFS2:
 - Revert a bad bugfix in (false positive from a code checking
   tool).
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl3kG34WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wU70D/9K3QdG/fciafNRkVFFKnXVNXLv
 rxbfrRVCZeEo2Phm8daVQAvERWKi2Mw5jW1fNbOtDKvnPkC9+3vYmmHCienJ6VEt
 LhQ4Ey0MX8dQID1Caji49i5pPD3PBMX0m+YrxGt0uuzbr/nNWeiIUf/nyIvU95o7
 ZmmX7wWmKHUeLKC8TGpeolBVAf4xKqzp+U7NJxKpF5j6dn6hTi1J51GwsYw6aSXU
 lo5KvPWHq11ENWxJhtyWfO4MMyzTAtvbL1xEf0NRXX+aK4dUUyu0dcxzAqNzR6HU
 ER7gsIyEBKLZpgTqLpy6B9tWcnEjHHHN4tjxhD8gRGlOVBdXSlN+X45I8gjYOact
 Jkx1eF+T7E865YP+8vwoHo8eMoThxz54SnON8Dk6WaioAOGvf48w4gNnAilOb8Fa
 FZNk20pQdR2KyGc6PQ2I82vDYX1c2GmlGXOFLsEM8KxGWHjSCpQC1P946h17QSPe
 OxCptqu36czclm6zk76Ycw5/BWkbzViOcToVuQS61HLH7P/BzF1BnEM9rgwf+tD0
 4midOomxrZXmUrbC/HO4mARRdGtgonBHl1yJ8JjWhWEL9KPVLAhWJ4zHR8Zo84LJ
 BodZTzJPJ96igf+hLMAG/xFCnpGylXaz/G+W+5wUHnNJyKVsbhYIEmBscJYE60iM
 9/00Ce8r9lEpetUnKA==
 =LbMI
 -----END PGP SIGNATURE-----

Merge tag 'upstream-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

Pull UBI/UBIFS/JFFS2 updates from Richard Weinberger:
 "This pull request contains mostly fixes for UBI, UBIFS and JFFS2:

  UBI:

   - Fix a regression around producing a anchor PEB for fastmap.

     Due to a change in our locking fastmap was unable to produce fresh
     anchors an re-used the existing one a way to often.

  UBIFS:

   - Fixes for endianness. A few places blindly assumed little endian.

   - Fix for a memory leak in the orphan code.

   - Fix for a possible crash during a commit.

   - Revert a wrong bugfix.

  JFFS2:

   - Revert a bad bugfix (false positive from a code checking tool)"

* tag 'upstream-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
  Revert "jffs2: Fix possible null-pointer dereferences in jffs2_add_frag_to_fragtree()"
  ubi: Fix producing anchor PEBs
  ubifs: ubifs_tnc_start_commit: Fix OOB in layout_in_gaps
  ubifs: do_kill_orphans: Fix a memory leak bug
  Revert "ubifs: Fix memory leak bug in alloc_ubifs_info() error path"
  ubifs: Fix type of sup->hash_algo
  ubifs: Fixed missed le64_to_cpu() in journal
  ubifs: Force prandom result to __le32
  ubifs: Remove obsolete TODO from dfs_file_write()
  ubi: Fix warning static is not at beginning of declaration
  ubi: Print skip_check in ubi_dump_vol_info()
2019-12-02 17:06:34 -08:00
Steve French
9e8fae2597 smb3: remove unused flag passed into close functions
close was relayered to allow passing in an async flag which
is no longer needed in this path.  Remove the unneeded parameter
"flags" passed in on close.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2019-12-02 18:07:17 -06:00
Colin Ian King
a9f76cf827 cifs: remove redundant assignment to pointer pneg_ctxt
The pointer pneg_ctxt is being initialized with a value that is never
read and it is being updated later with a new value.  The assignment
is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-12-02 16:55:08 -06:00
Linus Torvalds
97eeb4d9d7 New code for 5.5:
- Fill out the build string
 - Prevent inode fork extent count overflows
 - Refactor the allocator to reduce long tail latency
 - Rework incore log locking a little to reduce spinning
 - Break up the xfs_iomap_begin functions into smaller more cohesive
 parts
 - Fix allocation alignment being dropped too early when the allocation
 request is for more blocks than an AG is large
 - Other small cleanups
 - Clean up file buftarg retrieval helpers
 - Hoist the resvsp and unresvsp ioctls to the vfs
 - Remove the undocumented biosize mount option, since it has never been
   mentioned as existing or supported on linux
 - Clean up some of the mount option printing and parsing
 - Enhance attr leaf verifier to check block structure
 - Check dirent and attr names for invalid characters before passing them
 to the vfs
 - Refactor open-coded bmbt walking
 - Fix a few places where we return EIO instead of EFSCORRUPTED after
 failing metadata sanity checks
 - Fix a synchronization problem between fallocate and aio dio corrupting
 the file length
 - Clean up various loose ends in the iomap and bmap code
 - Convert to the new mount api
 - Make sure we always log something when returning EFSCORRUPTED
 - Fix some problems where long running scrub loops could trigger soft
 lockup warnings and/or fail to exit due to fatal signals pending
 - Fix various Coverity complaints
 - Remove most of the function pointers from the directory code to reduce
 indirection penalties
 - Ensure that dquots are attached to the inode when performing unwritten
 extent conversion after io
 - Deuglify incore projid and crtime types
 - Fix another AGI/AGF locking order deadlock when renaming
 - Clean up some quota typedefs
 - Remove the FSSETDM ioctls which haven't done anything in 20 years
 - Fix some memory leaks when mounting the log fails
 - Fix an underflow when updating an xattr leaf freemap
 - Remove some trivial wrappers
 - Report metadata corruption as an error, not a (potentially) fatal
 assertion
 - Clean up the dir/attr buffer mapping code
 - Allow fatal signals to kill scrub during parent pointer checks
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3fNjcACgkQ+H93GTRK
 tOv/8w//Y0Oa9Paiy8+iPTChs3/PqeKp307Fj5KONG52haMCakEJFT5+/wpkIAJw
 uUmKiPolwN1ivviIUmIS14ThTJ7NV1jq0G0h/0tC25i/3hoJrGWdzqYJMlvhlqgE
 taHrjCwPTDkhRJ0D5QCrkkHPU7lSdquO5TWxltaqYLhyLIt8SkklD6dN1dHWEPnk
 k0j3TL+VqVJDYyEj1bLwJ0QUb2C3J8ygWnlviF/WxsSeJtJpGoeLEaYXhhsUK0Dt
 aHg70OM6zzFzrJJAtJeBXpgaFsG/Pqbcw4wUWSxEMWjVSJwCSKLuZ5F+p6NcqoEj
 HeLQkaGePoO61YCInk2JKLHIyx7ohqMOt7+Dm0mdbe1pvcKwV9ZcdkqKa8L/Fm6v
 bUP6a2hEpsGy7vLnkYxwYACTLPbGX3uLw8MUr6ZpJ+SpfVLktU4ycpr8dCkJkp6a
 0qOpEeHsBDy74NkMOUa7Qrju7lJ2GiL70qqBwaPe+ubcUa3U/3WAsSekSzXgUwn8
 Fap4r8wn7cUbxymAvO06RlU8YymuulAlyjwdo9gOL/Su/5POldss6dy1YuUtyq19
 CD6NtkHqEUMsTc2cI+H65H44aEeckB1j0D2Grm2uMchAh0GcTSFVNF6jony++B8k
 s2sL2dEw9/9vr0uc1TSVF5ezxaONuyaCXdYXUkkdyq3iNvfpRCg=
 =aACq
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.5-merge-16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS updates from Darrick Wong:
 "For this release, we changed quite a few things.

  Highlights:

   - Fixed some long tail latency problems in the block allocator

   - Removed some long deprecated (and for the past several years no-op)
     mount options and ioctls

   - Strengthened the extended attribute and directory verifiers

   - Audited and fixed all the places where we could return EFSCORRUPTED
     without logging anything

   - Refactored the old SGI space allocation ioctls to make the
     equivalent fallocate calls

   - Fixed a race between fallocate and directio

   - Fixed an integer overflow when files have more than a few
     billion(!) extents

   - Fixed a longstanding bug where quota accounting could be incorrect
     when performing unwritten extent conversion on a freshly mounted fs

   - Fixed various complaints in scrub about soft lockups and
     unresponsiveness to signals

   - De-vtable'd the directory handling code, which should make it
     faster

   - Converted to the new mount api, for better or for worse

   - Cleaned up some memory leaks

  and quite a lot of other smaller fixes and cleanups.

  A more detailed summary:

   - Fill out the build string

   - Prevent inode fork extent count overflows

   - Refactor the allocator to reduce long tail latency

   - Rework incore log locking a little to reduce spinning

   - Break up the xfs_iomap_begin functions into smaller more cohesive
     parts

   - Fix allocation alignment being dropped too early when the
     allocation request is for more blocks than an AG is large

   - Other small cleanups

   - Clean up file buftarg retrieval helpers

   - Hoist the resvsp and unresvsp ioctls to the vfs

   - Remove the undocumented biosize mount option, since it has never
     been mentioned as existing or supported on linux

   - Clean up some of the mount option printing and parsing

   - Enhance attr leaf verifier to check block structure

   - Check dirent and attr names for invalid characters before passing
     them to the vfs

   - Refactor open-coded bmbt walking

   - Fix a few places where we return EIO instead of EFSCORRUPTED after
     failing metadata sanity checks

   - Fix a synchronization problem between fallocate and aio dio
     corrupting the file length

   - Clean up various loose ends in the iomap and bmap code

   - Convert to the new mount api

   - Make sure we always log something when returning EFSCORRUPTED

   - Fix some problems where long running scrub loops could trigger soft
     lockup warnings and/or fail to exit due to fatal signals pending

   - Fix various Coverity complaints

   - Remove most of the function pointers from the directory code to
     reduce indirection penalties

   - Ensure that dquots are attached to the inode when performing
     unwritten extent conversion after io

   - Deuglify incore projid and crtime types

   - Fix another AGI/AGF locking order deadlock when renaming

   - Clean up some quota typedefs

   - Remove the FSSETDM ioctls which haven't done anything in 20 years

   - Fix some memory leaks when mounting the log fails

   - Fix an underflow when updating an xattr leaf freemap

   - Remove some trivial wrappers

   - Report metadata corruption as an error, not a (potentially) fatal
     assertion

   - Clean up the dir/attr buffer mapping code

   - Allow fatal signals to kill scrub during parent pointer checks"

* tag 'xfs-5.5-merge-16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (198 commits)
  xfs: allow parent directory scans to be interrupted with fatal signals
  xfs: remove the mappedbno argument to xfs_da_get_buf
  xfs: remove the mappedbno argument to xfs_da_read_buf
  xfs: split xfs_da3_node_read
  xfs: remove the mappedbno argument to xfs_dir3_leafn_read
  xfs: remove the mappedbno argument to xfs_dir3_leaf_read
  xfs: remove the mappedbno argument to xfs_attr3_leaf_read
  xfs: remove the mappedbno argument to xfs_da_reada_buf
  xfs: improve the xfs_dabuf_map calling conventions
  xfs: refactor xfs_dabuf_map
  xfs: simplify mappedbno handling in xfs_da_{get,read}_buf
  xfs: report corruption only as a regular error
  xfs: Remove kmem_zone_free() wrapper
  xfs: Remove kmem_zone_destroy() wrapper
  xfs: Remove slab init wrappers
  xfs: fix attr leaf header freemap.size underflow
  xfs: fix some memory leaks in log recovery
  xfs: fix another missing include
  xfs: remove XFS_IOC_FSSETDM and XFS_IOC_FSSETDM_BY_HANDLE
  xfs: remove duplicated include from xfs_dir2_data.c
  ...
2019-12-02 14:46:22 -08:00
Deepa Dinamani
69738cfdfa fs: cifs: Fix atime update check vs mtime
According to the comment in the code and commit log, some apps
expect atime >= mtime; but the introduced code results in
atime==mtime.  Fix the comparison to guard against atime<mtime.

Fixes: 9b9c5bea0b ("cifs: do not return atime less than mtime")
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: stfrench@microsoft.com
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-12-02 15:15:35 -06:00
Pavel Shilovsky
6f582b273e CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
Currently when the client creates a cifsFileInfo structure for
a newly opened file, it allocates a list of byte-range locks
with a pointer to the new cfile and attaches this list to the
inode's lock list. The latter happens before initializing all
other fields, e.g. cfile->tlink. Thus a partially initialized
cifsFileInfo structure becomes available to other threads that
walk through the inode's lock list. One example of such a thread
may be an oplock break worker thread that tries to push all
cached byte-range locks. This causes NULL-pointer dereference
in smb2_push_mandatory_locks() when accessing cfile->tlink:

[598428.945633] BUG: kernel NULL pointer dereference, address: 0000000000000038
...
[598428.945749] Workqueue: cifsoplockd cifs_oplock_break [cifs]
[598428.945793] RIP: 0010:smb2_push_mandatory_locks+0xd6/0x5a0 [cifs]
...
[598428.945834] Call Trace:
[598428.945870]  ? cifs_revalidate_mapping+0x45/0x90 [cifs]
[598428.945901]  cifs_oplock_break+0x13d/0x450 [cifs]
[598428.945909]  process_one_work+0x1db/0x380
[598428.945914]  worker_thread+0x4d/0x400
[598428.945921]  kthread+0x104/0x140
[598428.945925]  ? process_one_work+0x380/0x380
[598428.945931]  ? kthread_park+0x80/0x80
[598428.945937]  ret_from_fork+0x35/0x40

Fix this by reordering initialization steps of the cifsFileInfo
structure: initialize all the fields first and then add the new
byte-range lock list to the inode's lock list.

Cc: Stable <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-12-02 15:15:00 -06:00
Linus Torvalds
937d6eefc7 Here's the main documentation changes for 5.5:
- Various kerneldoc script enhancements.
 
  - More RST conversions; those are slowing down as we run out of things to
    convert, but we're a ways from done still.
 
  - Dan's "maintainer profile entry" work landed at last.  Now we just need
    to get maintainers to fill in the profiles...
 
  - A reworking of the parallel build setup to work better with a variety of
    systems (and to not take over huge systems entirely in particular).
 
  - The MAINTAINERS file is now converted to RST during the build.
    Hopefully nobody ever tries to print this thing, or they will need to
    load a lot of paper.
 
  - A script and documentation making it easy for maintainers to add Link:
    tags at commit time.
 
 Also included is the removal of a bunch of spurious CR characters.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl3j5B0PHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YtBcH/jIN2cO8/0YW2rjVT+1G6ytSdFUKx5WJ/lpf
 5uBeCvuCeYhtCB6+BgnXvjykJ7jDW11/NJNjWqz/gsvD5l5FJK1rXarI/oz2Klyi
 kcPtDmBF/ki4wz9qXzEpa0vg8LXdjeys50S1vE75qCzxZoPP7YjuRbPnLrlIJukv
 JbDVi4p9kxgeHfRB4+BHOe5rFwA3mMmaxKNIX34Y+UUO2KZ0g/yUi1bAaQwQAdt+
 PsORmkVQ8Puh3K9xRIr7dYlcWBlBiPqzYdvDgTVxSjrxdK6wjYjSgVk2VjC5MBUN
 mTSTWgyfsIcD/76/s8tq7ZRl2fw+SkCSkFo79Rb/hJwDTb7Vnng=
 =LPBr
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.5a' of git://git.lwn.net/linux

Pull Documentation updates from Jonathan Corbet:
 "Here are the main documentation changes for 5.5:

   - Various kerneldoc script enhancements.

   - More RST conversions; those are slowing down as we run out of
     things to convert, but we're a ways from done still.

   - Dan's "maintainer profile entry" work landed at last. Now we just
     need to get maintainers to fill in the profiles...

   - A reworking of the parallel build setup to work better with a
     variety of systems (and to not take over huge systems entirely in
     particular).

   - The MAINTAINERS file is now converted to RST during the build.
     Hopefully nobody ever tries to print this thing, or they will need
     to load a lot of paper.

   - A script and documentation making it easy for maintainers to add
     Link: tags at commit time.

  Also included is the removal of a bunch of spurious CR characters"

* tag 'docs-5.5a' of git://git.lwn.net/linux: (91 commits)
  docs: remove a bunch of stray CRs
  docs: fix up the maintainer profile document
  libnvdimm, MAINTAINERS: Maintainer Entry Profile
  Maintainer Handbook: Maintainer Entry Profile
  MAINTAINERS: Reclaim the P: tag for Maintainer Entry Profile
  docs, parallelism: Rearrange how jobserver reservations are made
  docs, parallelism: Do not leak blocking mode to other readers
  docs, parallelism: Fix failure path and add comment
  Documentation: Remove bootmem_debug from kernel-parameters.txt
  Documentation: security: core.rst: fix warnings
  Documentation/process/howto/kokr: Update for 4.x -> 5.x versioning
  Documentation/translation: Use Korean for Korean translation title
  docs/memory-barriers.txt: Remove remaining references to mmiowb()
  docs/memory-barriers.txt/kokr: Update I/O section to be clearer about CPU vs thread
  docs/memory-barriers.txt/kokr: Fix style, spacing and grammar in I/O section
  Documentation/kokr: Kill all references to mmiowb()
  docs/memory-barriers.txt/kokr: Rewrite "KERNEL I/O BARRIER EFFECTS" section
  docs: Add initial documentation for devfreq
  Documentation: Document how to get links with git am
  docs: Add request_irq() documentation
  ...
2019-12-02 11:51:02 -08:00
Linus Torvalds
8328dd2f39 pstore bug fix
- add missing "static" (Ben Dooks)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl3dUowWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJpZXD/9OHY0SaBLvZYmApI7l6dV4eN1I
 OcyrC9jtjQDuSaTQcwZPIDFvJAtPM32wbhI6SJztaF+tuWhQzeOnRBaupfYoWS0e
 cbaUCG/mqrCbmGOjMxcqQbkAY0uyOuRUJItrX5WZ4YspcrMjdHi3sTj4ItHaaXvK
 SjSxhITM/SQlgnK/4oysC3sdMdDac9OEdtDomltYc2qxou4b4nPWuc4epTTHLXBQ
 b+LjZXQBbdAQN7dcgWB4lgzg1eUPqkkFNUDw2hgcqZN1RHKzXGwruE+E4PsgY2Fw
 F/j+cSLu/dExVPzEGYcMV3ZS09ix9LHmO95LOy2N2TPD9Ax1KJbRb58P+1a1SDwb
 kwzMxyJlcUmqKRBfTTnVW4+g3v5qvrmducGqdmJrhOoNxCtwvRID1jRsm/DGIhdk
 /ZJDpHl1lg7VECcwXjTP+0+FYNGJPrb4MePXAMXbi/2/Zt1FDtf2z+Ubo220YMJM
 s82YMn8yVuBrXGXxjL8KmGvvRVFKCtGabvvdMTqhmIcp938CxI6hVfSU/B4SKciO
 N8R6Wl78pDDibXipTRCC3MXUNBIZi9qWfXLLmcrx0JRCjgfwSL7UWE/xrPDSr+6u
 tIDvOD08Ym1BXTtvEZoqZKuEViW04nVVukXi/rQXeyipk6tg+HfONfi2RRTCDzV5
 DbMj9pJB9373N0+h5w==
 =JD6z
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore bug fix from Kees Cook:

 - add missing "static" (Ben Dooks)

* tag 'pstore-v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore: Make pstore_choose_compression() static
2019-12-02 11:26:40 -08:00
Jens Axboe
0b8c0ec7ee io_uring: use current task creds instead of allocating a new one
syzbot reports:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Modules linked in:
---[ end trace f2e1a4307fbe2245 ]---
RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

which is caused by slab fault injection triggering a failure in
prepare_creds(). We don't actually need to create a copy of the creds
as we're not modifying it, we just need a reference on the current task
creds. This avoids the failure case as well, and propagates the const
throughout the stack.

Fixes: 181e448d87 ("io_uring: async workers should inherit the user creds")
Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 08:50:00 -07:00
Linus Torvalds
596cf45cbf Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "Incoming:

   - a small number of updates to scripts/, ocfs2 and fs/buffer.c

   - most of MM

  I still have quite a lot of material (mostly not MM) staged after
  linux-next due to -next dependencies. I'll send those across next week
  as the preprequisites get merged up"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (135 commits)
  mm/page_io.c: annotate refault stalls from swap_readpage
  mm/Kconfig: fix trivial help text punctuation
  mm/Kconfig: fix indentation
  mm/memory_hotplug.c: remove __online_page_set_limits()
  mm: fix typos in comments when calling __SetPageUptodate()
  mm: fix struct member name in function comments
  mm/shmem.c: cast the type of unmap_start to u64
  mm: shmem: use proper gfp flags for shmem_writepage()
  mm/shmem.c: make array 'values' static const, makes object smaller
  userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK
  fs/userfaultfd.c: wp: clear VM_UFFD_MISSING or VM_UFFD_WP during userfaultfd_register()
  userfaultfd: wrap the common dst_vma check into an inlined function
  userfaultfd: remove unnecessary WARN_ON() in __mcopy_atomic_hugetlb()
  userfaultfd: use vma_pagesize for all huge page size calculation
  mm/madvise.c: use PAGE_ALIGN[ED] for range checking
  mm/madvise.c: replace with page_size() in madvise_inject_error()
  mm/mmap.c: make vma_merge() comment more easy to understand
  mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops
  autonuma: reduce cache footprint when scanning page tables
  autonuma: fix watermark checking in migrate_balanced_pgdat()
  ...
2019-12-01 20:36:41 -08:00
Linus Torvalds
31764f1b6d for-linus-20191129
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3h2LsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvrOD/43g08VvkLexb6DvO8kTkyxPSUsPZml292H
 yowv1Rij9x/7Y9MTOmcnEAV9QofRPY9nkmEq1KR+iqGQnLZsKFLrDVxQSgkhU7HJ
 2wRE2DCPYTfIyIfS0TF2DdVjh3Q90tPKZbVkiOxvy9R+yS6hmTur0t731OERsMAh
 QgkY8zJP04XonXNwZSch9QYdpf9XGu+W83Pc0ecmootimJHVhFV8J9111dessod0
 h82Epbbbc8bg/CBWCA4fDm4EZ8doMHdAHYpw60WDXPoLF9zISvg5QG+pVjKrLkVx
 pqgTt5Kwd/EkqurRw3sH+8A6p0ACLrUiKffX2cXk8ScdTXMGD5stmdF5cMLSikFY
 yJgGHOTBHdjV4T2gB1TQ0rlnnb1VtHoJm5XvPviRaSQt7irzh4EWwhYiL7JlTAC3
 etCbzMPn8Sm+Ns+/zObuOmVTQvyE/+PngToxDRpgzDd4pSbMPR502iL3gvSbMDjw
 BCfGKFRkBYAB1SSjMwi+l9hiX2F+jTnHSvrm69HUz5K2zvWl4hsd2wClExuwoCQ+
 UFXULCJZFCKbAcgEVh2OQX9JVg7fUv5GhIymEPBWFDAODwtX1XJK6IxmQhRh8owV
 AxBFnNdpgRcBzyy+c+2cM4JOVcm8bV1s6eYP0UyV+EieD5OcBdq7GH5YBFzOztJM
 SLMbjQQ/7w==
 =hnf4
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191129' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "I wasn't going to send this one off so soon, but unfortunately one of
  the fixes from the previous pull broke the build on some archs. So I'm
  sending this sooner rather than later. This contains:

   - Add highmem.h include for io_uring, because of the kmap() additions
     from last round. For some reason the build bot didn't spot this
     even though it sat for days.

   - Three minor ';' removals

   - Add support for the Beurer CD-on-a-chip device

   - Make io_uring work on MMU-less archs"

* tag 'for-linus-20191129' of git://git.kernel.dk/linux-block:
  io_uring: fix missing kmap() declaration on powerpc
  ataflop: Remove unneeded semicolon
  block: sunvdc: Remove unneeded semicolon
  drbd: Remove unneeded semicolon
  io_uring: add mapping support for NOMMU archs
  sr_vendor: support Beurer GL50 evo CD-on-a-chip devices.
  cdrom: respect device capabilities during opening action
2019-12-01 18:26:56 -08:00
Linus Torvalds
ceb3074745 y2038: syscall implementation cleanups
This is a series of cleanups for the y2038 work, mostly intended
 for namespace cleaning: the kernel defines the traditional
 time_t, timeval and timespec types that often lead to y2038-unsafe
 code. Even though the unsafe usage is mostly gone from the kernel,
 having the types and associated functions around means that we
 can still grow new users, and that we may be missing conversions
 to safe types that actually matter.
 
 There are still a number of driver specific patches needed to
 get the last users of these types removed, those have been
 submitted to the respective maintainers.
 
 Link: https://lore.kernel.org/lkml/20191108210236.1296047-1-arnd@arndb.de/
 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJd3D+wAAoJEJpsee/mABjZfdcQAJvl6e+4ddKoDMIVJqVCE25N
 meFRgA7S8jy6BefEVeUgI8TxK+amGO36szMBUEnZxSSxq9u+gd13m5bEK6Xq/ov7
 4KTAiA3Irm/W5FBTktu1zc5ROIra1Xj7jLdubf8wEC3viSXIXB3+68Y28iBN7D2O
 k9kSpwINC5lWeC8guZy2I+2yc4ywUEXao9nVh8C/J+FQtU02TcdLtZop9OhpAa8u
 U19VVH3WHkQI7ZfLvBTUiYK6tlYTiYCnpr8l6sm850CnVv1fzBW+DzmVhPJ6FdFd
 4m5staC0sQ6gVqtjVMBOtT5CdzREse6hpwbKo2GRWFroO5W9tljMOJJXHvv/f6kz
 DxrpUmj37JuRbqAbr8KDmQqPo6M2CRkxFxjol1yh5ER63u1xMwLm/PQITZIMDvPO
 jrFc2C2SdM2E9bKP/RMCVoKSoRwxCJ5IwJ2AF237rrU0sx/zB2xsrOGssx5CWEgc
 3bbk6tDQujJJubnCfgRy1tTxpLZOHEEKw8YhFLLbR2LCtA9pA/0rfLLad16cjA5e
 5jIHxfsFc23zgpzrJeB7kAF/9xgu1tlA5BotOs3VBE89LtWOA9nK5dbPXng6qlUe
 er3xLCfS38ovhUw6DusQpaYLuaYuLM7DKO4iav9kuTMcY9GkbPk7vDD3KPGh2goy
 hY5cSM8+kT1q/THLnUBH
 =Bdbv
 -----END PGP SIGNATURE-----

Merge tag 'y2038-cleanups-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground

Pull y2038 cleanups from Arnd Bergmann:
 "y2038 syscall implementation cleanups

  This is a series of cleanups for the y2038 work, mostly intended for
  namespace cleaning: the kernel defines the traditional time_t, timeval
  and timespec types that often lead to y2038-unsafe code. Even though
  the unsafe usage is mostly gone from the kernel, having the types and
  associated functions around means that we can still grow new users,
  and that we may be missing conversions to safe types that actually
  matter.

  There are still a number of driver specific patches needed to get the
  last users of these types removed, those have been submitted to the
  respective maintainers"

Link: https://lore.kernel.org/lkml/20191108210236.1296047-1-arnd@arndb.de/

* tag 'y2038-cleanups-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (26 commits)
  y2038: alarm: fix half-second cut-off
  y2038: ipc: fix x32 ABI breakage
  y2038: fix typo in powerpc vdso "LOPART"
  y2038: allow disabling time32 system calls
  y2038: itimer: change implementation to timespec64
  y2038: move itimer reset into itimer.c
  y2038: use compat_{get,set}_itimer on alpha
  y2038: itimer: compat handling to itimer.c
  y2038: time: avoid timespec usage in settimeofday()
  y2038: timerfd: Use timespec64 internally
  y2038: elfcore: Use __kernel_old_timeval for process times
  y2038: make ns_to_compat_timeval use __kernel_old_timeval
  y2038: socket: use __kernel_old_timespec instead of timespec
  y2038: socket: remove timespec reference in timestamping
  y2038: syscalls: change remaining timeval to __kernel_old_timeval
  y2038: rusage: use __kernel_old_timeval
  y2038: uapi: change __kernel_time_t to __kernel_old_time_t
  y2038: stat: avoid 'time_t' in 'struct stat'
  y2038: ipc: remove __kernel_time_t reference from headers
  y2038: vdso: powerpc: avoid timespec references
  ...
2019-12-01 14:00:59 -08:00
Linus Torvalds
0da522107e compat_ioctl: remove most of fs/compat_ioctl.c
As part of the cleanup of some remaining y2038 issues, I came to
 fs/compat_ioctl.c, which still has a couple of commands that need support
 for time64_t.
 
 In completely unrelated work, I spent time on cleaning up parts of this
 file in the past, moving things out into drivers instead.
 
 After Al Viro reviewed an earlier version of this series and did a lot
 more of that cleanup, I decided to try to completely eliminate the rest
 of it and move it all into drivers.
 
 This series incorporates some of Al's work and many patches of my own,
 but in the end stops short of actually removing the last part, which is
 the scsi ioctl handlers. I have patches for those as well, but they need
 more testing or possibly a rewrite.
 
 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1
 RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ
 +ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF
 lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL
 6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD
 dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH
 VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb
 YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT
 m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm
 TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n
 e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl
 bN65armTm7bFFR32Avnu
 =lgCl
 -----END PGP SIGNATURE-----

Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground

Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
 "As part of the cleanup of some remaining y2038 issues, I came to
  fs/compat_ioctl.c, which still has a couple of commands that need
  support for time64_t.

  In completely unrelated work, I spent time on cleaning up parts of
  this file in the past, moving things out into drivers instead.

  After Al Viro reviewed an earlier version of this series and did a lot
  more of that cleanup, I decided to try to completely eliminate the
  rest of it and move it all into drivers.

  This series incorporates some of Al's work and many patches of my own,
  but in the end stops short of actually removing the last part, which
  is the scsi ioctl handlers. I have patches for those as well, but they
  need more testing or possibly a rewrite"

* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
  scsi: sd: enable compat ioctls for sed-opal
  pktcdvd: add compat_ioctl handler
  compat_ioctl: move SG_GET_REQUEST_TABLE handling
  compat_ioctl: ppp: move simple commands into ppp_generic.c
  compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
  compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
  compat_ioctl: unify copy-in of ppp filters
  tty: handle compat PPP ioctls
  compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
  compat_ioctl: handle SIOCOUTQNSD
  af_unix: add compat_ioctl support
  compat_ioctl: reimplement SG_IO handling
  compat_ioctl: move WDIOC handling into wdt drivers
  fs: compat_ioctl: move FITRIM emulation into file systems
  gfs2: add compat_ioctl support
  compat_ioctl: remove unused convert_in_user macro
  compat_ioctl: remove last RAID handling code
  compat_ioctl: remove /dev/raw ioctl translation
  compat_ioctl: remove PCI ioctl translation
  compat_ioctl: remove joystick ioctl translation
  ...
2019-12-01 13:46:15 -08:00
Mike Rapoport
3c1c24d91f userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK
A while ago Andy noticed
(http://lkml.kernel.org/r/CALCETrWY+5ynDct7eU_nDUqx=okQvjm=Y5wJvA4ahBja=CQXGw@mail.gmail.com)
that UFFD_FEATURE_EVENT_FORK used by an unprivileged user may have
security implications.

As the first step of the solution the following patch limits the availably
of UFFD_FEATURE_EVENT_FORK only for those having CAP_SYS_PTRACE.

The usage of CAP_SYS_PTRACE ensures compatibility with CRIU.

Yet, if there are other users of non-cooperative userfaultfd that run
without CAP_SYS_PTRACE, they would be broken :(

Current implementation of UFFD_FEATURE_EVENT_FORK modifies the file
descriptor table from the read() implementation of uffd, which may have
security implications for unprivileged use of the userfaultfd.

Limit availability of UFFD_FEATURE_EVENT_FORK only for callers that have
CAP_SYS_PTRACE.

Link: http://lkml.kernel.org/r/1572967777-8812-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Nosh Minwalla <nosh@google.com>
Cc: Pavel Emelyanov <ovzxemul@gmail.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:10 -08:00
Andrea Arcangeli
9d4678eb17 fs/userfaultfd.c: wp: clear VM_UFFD_MISSING or VM_UFFD_WP during userfaultfd_register()
If the registration is repeated without VM_UFFD_MISSING or VM_UFFD_WP they
need to be cleared.  Currently setting UFFDIO_REGISTER_MODE_WP returns
-EINVAL, so this patch is a noop until the UFFDIO_REGISTER_MODE_WP support
is applied.

Link: http://lkml.kernel.org/r/20191004232834.GP13922@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:10 -08:00
Wei Yang
188b04a7d9 hugetlb: remove unused hstate in hugetlb_fault_mutex_hash()
The first parameter hstate in function hugetlb_fault_mutex_hash() is not
used anymore.

This patch removes it.

[akpm@linux-foundation.org: various build fixes]
[cai@lca.pw: fix a GCC compilation warning]
 Link: http://lkml.kernel.org/r/1570544108-32331-1-git-send-email-cai@lca.pw
Link: http://lkml.kernel.org/r/20191005003302.785-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:08 -08:00
Piotr Sarna
1ab5b82f54 hugetlbfs: add O_TMPFILE support
With hugetlbfs, a common pattern for mapping anonymous huge pages is to
create a temporary file first.  Currently libraries like libhugetlbfs
and seastar create these with a standard mkstemp+unlink trick, but it
would be more robust to be able to simply pass the O_TMPFILE flag to
open().  O_TMPFILE is already supported by several file systems like
ext4 and xfs.  The implementation simply uses the existi= ng d_tmpfile
utility function to instantiate the dcache entry for the file.

Tested manually by successfully creating a temporary file by opening it
with (O_TMPFILE|O_RDWR) on mounted hugetlbfs and successfully mapping 2M
huge pages with it.  Without the patch, trying to open a file with
O_TMPFILE results in -ENOSUP.

Link: http://lkml.kernel.org/r/bc9383eff6e1374d79f3a92257ae829ba1e6ae60.1573285189.git.p.sarna@tlen.pl
Signed-off-by: Piotr Sarna <p.sarna@tlen.pl>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:08 -08:00
Mike Kravetz
8fc312b32b mm/hugetlbfs: fix error handling when setting up mounts
It is assumed that the hugetlbfs_vfsmount[] array will contain either a
valid vfsmount pointer or NULL for each hstate after initialization.
Changes made while converting to use fs_context broke this assumption.

While fixing the hugetlbfs_vfsmount issue, it was discovered that
init_hugetlbfs_fs never did correctly clean up when encountering a vfs
mount error.

It was found during code inspection.  A small memory allocation failure
would be the most likely cause of taking a error path with the bug.
This is unlikely to happen as this is early init code.

Link: http://lkml.kernel.org/r/94b6244d-2c24-e269-b12c-e3ba694b242d@oracle.com
Reported-by: Chengguang Xu <cgxu519@mykernel.net>
Fixes: 32021982a3 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:08 -08:00
Mike Kravetz
552546366a hugetlbfs: hugetlb_fault_mutex_hash() cleanup
A new clang diagnostic (-Wsizeof-array-div) warns about the calculation
to determine the number of u32's in an array of unsigned longs.
Suppress warning by adding parentheses.

While looking at the above issue, noticed that the 'address' parameter
to hugetlb_fault_mutex_hash is no longer used.  So, remove it from the
definition and all callers.

No functional change.

Link: http://lkml.kernel.org/r/20190919011847.18400-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Ilie Halip <ilie.halip@gmail.com>
Cc: David Bolvansky <david.bolvansky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:08 -08:00
Konstantin Khlebnikov
a92853b674 fs/direct-io.c: keep dio_warn_stale_pagecache() when CONFIG_BLOCK=n
This helper prints warning if direct I/O write failed to invalidate cache,
and set EIO at inode to warn usersapce about possible data corruption.

See also commit 5a9d929d6e ("iomap: report collisions between directio
and buffered writes to userspace").

Direct I/O is supported by non-disk filesystems, for example NFS.  Thus
generic code needs this even in kernel without CONFIG_BLOCK.

Link: http://lkml.kernel.org/r/157270038074.4812.7980855544557488880.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:18 -08:00
Ben Dooks
2b211dc04c fs/buffer.c: include internal.h for missing declarations
The declarations of __block_write_begin_int and guard_bio_eod are needed
from internal.h so include it to fix the following sparse warnings:

  fs/buffer.c:1930:5: warning: symbol '__block_write_begin_int' was not declared. Should it be static?
  fs/buffer.c:2994:6: warning: symbol 'guard_bio_eod' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191011170039.16100-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:17 -08:00
Saurav Girepunje
1d70667973 fs/buffer.c: fix use true/false for bool type
Use true/false for bool return type of has_bh_in_lru().

Link: http://lkml.kernel.org/r/20191029040529.GA7625@saurav
Signed-off-by: Saurav Girepunje <saurav.girepunje@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:17 -08:00
Ding Xiang
188c523e1c ocfs2: fix passing zero to 'PTR_ERR' warning
Fix a static code checker warning:
fs/ocfs2/acl.c:331
	ocfs2_acl_chmod() warn: passing zero to 'PTR_ERR'

Link: http://lkml.kernel.org/r/1dee278b-6c96-eec2-ce76-fe6e07c6e20f@linux.alibaba.com
Fixes: 5ee0fbd50f ("ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang")
Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:17 -08:00
Linus Torvalds
3b805ca177 audit/stable-5.5 PR 20191126
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl3dbM4UHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMagw/+MiOlIHFykjK0NGOyUbXR7AwjdrHz
 1Fgkh7VZCAHMfCcdlljIIkYe7P+ybfdK51E1QLBiTPGh353JJRAvrjFbDcIyT4kf
 u7AVddUeT0QQefFs39ZFWTV0mvJjDBfjFkmiL3cdY9ulx4ZX8V426qjyl8KIwTHe
 YkQF3pYMmO28G1SfZu298zTmrFRA10FezxCUbBRZTTE9FcD7mnGQRB7w/wY9t0H1
 ebIDgDA0EBh3oznGD3qxD63b5ULdSrImTzvSmEfhzZZsoZB4XGO5MOnLRqgBwFbT
 qfRSRfHVl+1JtDHCU43zOUlzScAsff5mvfVNdDMLAFtGGSjDGxgxKNyLwp8+5wmH
 GzvB99QQECvCN+gbaedm6adWBGzi7vpoCcgfqY0UPLYvCqsNFbZw4U/iu5ONKWy/
 cEGVUGzHf0pWofugDIJfGQuRt6iS2XT9Ode6+QMx8OiC3auZcluhTuJSxMQIhFZ1
 5XmoHOQddBwlalmIx8fsIHSo6xsAjNFwTOikEFVzUB+ECR2Urs8eHvQj+d94xz6e
 q9LrNkt/eVIDiI+PeP+UBD5IJlfmRSoyUd/mIDFMfmqMBubBSQl70Eyt3dUdt1Bz
 0PBs6xjYztpgk3Q7s35TMn8EvDcEksG0WEoFM5fEohYPLWQf8tJ5BbyYJJqc0sod
 ExlXu1lPfg9ppKc=
 =fc7R
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20191126' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "Audit is back for v5.5, albeit with only two patches:

   - Allow for the auditing of suspicious O_CREAT usage via the new
     AUDIT_ANOM_CREAT record.

   - Remove a redundant if-conditional check found during code analysis.
     It's a minor change, but when the pull request is only two patches
     long, you need filler in the pull request email"

[ Heh on the pull request filler. I wish more people tried to write
  better pull request messages, even if maybe it's not worth it for the
  trivial cases ;^)   - Linus ]

* tag 'audit-pr-20191126' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: remove redundant condition check in kauditd_thread()
  audit: Report suspicious O_CREAT usage
2019-11-30 17:01:48 -08:00
Linus Torvalds
6a965666b7 Pipework for general notification queue
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl3O0OoACgkQ+7dXa6fL
 C2tAwA//VH9Y81azemXFdflDF90sSH3TCASlKHVYHbBNAkH/QP5F00G4BEM4nNqH
 F3x7qcU9vzfGdumF1pc90Yt6XSYlsQEGF+xMyMw/VS2wKs40yv+b/doVbzOWbN9C
 NfrklgHeuuBk+JzU2llDisVqKRTLt4SmDpYu1ZdcchUQFZCCl3BpgdSEC+xXrHay
 +KlRPVNMSd2kXMCDuSWrr71lVNdCTdf3nNC5p1i780+VrgpIBIG/jmiNdCcd7PLH
 1aesPlr8UZY3+bmRtqe587fVRAhT2qA2xibKtyf9R0hrDtUKR4NSnpPmaeIjb26e
 LhVntcChhYxQqzy/T4ScTDNVjpSlwi6QMo5DwAwzNGf2nf/v5/CZ+vGYDVdXRFHj
 tgH1+8eDpHsi7jJp6E4cmZjiolsUx/ePDDTrQ4qbdDMO7fmIV6YQKFAMTLJepLBY
 qnJVqoBq3qn40zv6tVZmKgWiXQ65jEkBItZhEUmcQRBiSbBDPweIdEzx/mwzkX7U
 1gShGdut6YP4GX7BnOhkiQmzucS85mgkUfG43+mBfYXb+4zNTEjhhkqhEduz2SQP
 xnjHxEM+MTGCj3PozIpJxNKzMTEceYY7cAUdNEMDQcHog7OCnIdGBIc7BPnsN8yA
 CPzntwP4mmLfK3weq3PIGC6d9xfc9PpmiR9docxQOvE6sk2Ifeo=
 =FKC7
 -----END PGP SIGNATURE-----

Merge tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull pipe rework from David Howells:
 "This is my set of preparatory patches for building a general
  notification queue on top of pipes. It makes a number of significant
  changes:

   - It removes the nr_exclusive argument from __wake_up_sync_key() as
     this is always 1. This prepares for the next step:

   - Adds wake_up_interruptible_sync_poll_locked() so that poll can be
     woken up from a function that's holding the poll waitqueue
     spinlock.

   - Change the pipe buffer ring to be managed in terms of unbounded
     head and tail indices rather than bounded index and length. This
     means that reading the pipe only needs to modify one index, not
     two.

   - A selection of helper functions are provided to query the state of
     the pipe buffer, plus a couple to apply updates to the pipe
     indices.

   - The pipe ring is allowed to have kernel-reserved slots. This allows
     many notification messages to be spliced in by the kernel without
     allowing userspace to pin too many pages if it writes to the same
     pipe.

   - Advance the head and tail indices inside the pipe waitqueue lock
     and use wake_up_interruptible_sync_poll_locked() to poke poll
     without having to take the lock twice.

   - Rearrange pipe_write() to preallocate the buffer it is going to
     write into and then drop the spinlock. This allows kernel
     notifications to then be added the ring whilst it is filling the
     buffer it allocated. The read side is stalled because the pipe
     mutex is still held.

   - Don't wake up readers on a pipe if there was already data in it
     when we added more.

   - Don't wake up writers on a pipe if the ring wasn't full before we
     removed a buffer"

* tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  pipe: Remove sync on wake_ups
  pipe: Increase the writer-wakeup threshold to reduce context-switch count
  pipe: Check for ring full inside of the spinlock in pipe_write()
  pipe: Remove redundant wakeup from pipe_write()
  pipe: Rearrange sequence in pipe_write() to preallocate slot
  pipe: Conditionalise wakeup in pipe_read()
  pipe: Advance tail pointer inside of wait spinlock in pipe_read()
  pipe: Allow pipes to have kernel-reserved slots
  pipe: Use head and tail pointers for the ring, not cursor and length
  Add wake_up_interruptible_sync_poll_locked()
  Remove the nr_exclusive argument from __wake_up_sync_key()
  pipe: Reduce #inclusion of pipe_fs_i.h
2019-11-30 14:12:13 -08:00
NeilBrown
466e16f092 nfsd: check for EBUSY from vfs_rmdir/vfs_unink.
vfs_rmdir and vfs_unlink can return -EBUSY if the
target is a mountpoint.  This currently gets passed to
nfserrno() by nfsd_unlink(), and that results in a WARNing,
which is not user-friendly.

Possibly the best NFSv4 error is NFS4ERR_FILE_OPEN, because
there is a sense in which the object is currently in use
by some other task.  The Linux NFSv4 client will map this
back to EBUSY, which is an added benefit.

For NFSv3, the best we can do is probably NFS3ERR_ACCES, which isn't
true, but is not less true than the other options.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-30 14:59:52 -05:00
Trond Myklebust
a25e3726b3 nfsd: Ensure CLONE persists data and metadata changes to the target file
The NFSv4.2 CLONE operation has implicit persistence requirements on the
target file, since there is no protocol requirement that the client issue
a separate operation to persist data.
For that reason, we should call vfs_fsync_range() on the destination file
after a successful call to vfs_clone_file_range().

Fixes: ffa0160a10 ("nfsd: implement the NFSv4.2 CLONE operation")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-30 14:55:38 -05:00
Linus Torvalds
32ef955363 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl3hAmYACgkQnJ2qBz9k
 QNna+AgAs/6eWmQqEwij9+E761bELuMXIp8Ej9OxrBgE9Uls0VsMOsG1xfFoue+J
 sZme9XSUQ+STqPQzTi6FH4x1n5kL3VnZzjRDv9qpsRCfMzX1MbibczNvXuCCgGuP
 cKr3gPFX+GSDwQPEJX7tzF5zLZ1t1XOc4TD/H6v/X+2Nr/N1lnYeAaNcfFiGMyg1
 Pk9JK2/qUoqWASCx6zxHG/lfjmaSSIrOwaKFACNHH0/6Luc1M7Oee3M9T/jS2IEo
 8/WTS4o9CwOUFaEPemuUBv4yHyBpbJwsM+emwNKrNDSWbI5vP3ECOO7c4c+1/abS
 FcwPvkHyAQnpSgmeYlhfTjZfA/A6+w==
 =m7WG
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
 "Three fsnotify cleanups"

* tag 'fsnotify_for_v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: Add git tree reference to MAINTAINERS
  fsnotify/fdinfo: exportfs_encode_inode_fh() takes pointer as 4th argument
  fsnotify: move declaration of fsnotify_mark_connector_cachep to fsnotify.h
2019-11-30 11:34:33 -08:00
Linus Torvalds
b8072d5b3c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl3hAFIACgkQnJ2qBz9k
 QNkV/gf+Kwn7xHg76YXd15lZYBzhgj/ABAYEEAAVY49OOCK5+XVmmAufHesMZ2lU
 Solt8PvbQ8d5786bWpaYXgrTU3JW37c6x1MDUPDLQ8goXWzx7pZWvD+Yup558rDa
 H1aoqvFKLgpeVVqkUdvvv2CDbgZyOgGlkDqWeS+c5pZd1NPFZzUAoU26slvQ5h4f
 t41mbavOIm5DChQ5UjwRNw+pb09GXaHrPBRJwa1XuJYJWAansBcQIsxiiqt/43Gn
 AzwUGrsz4vrPBk+Kcd0SGb8vinFVQr19gBFKFeN3rPFUEUn6T0FPBqaYeiNTNE37
 AqASYKlIuhcSf0Wdvx6vxwSHsFl5VA==
 =NGxV
 -----END PGP SIGNATURE-----

Merge tag 'for_v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull ext2, quota, reiserfs cleanups and fixes from Jan Kara:

 - Refactor the quota on/off kernel internal interfaces (mostly for
   ubifs quota support as ubifs does not want to have inodes holding
   quota information)

 - A few other small quota fixes and cleanups

 - Various small ext2 fixes and cleanups

 - Reiserfs xattr fix and one cleanup

* tag 'for_v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
  ext2: code cleanup for descriptor_loc()
  fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long
  ext2: fix improper function comment
  ext2: code cleanup for ext2_try_to_allocate()
  ext2: skip unnecessary operations in ext2_try_to_allocate()
  ext2: Simplify initialization in ext2_try_to_allocate()
  ext2: code cleanup by calling ext2_group_last_block_no()
  ext2: introduce new helper ext2_group_last_block_no()
  reiserfs: replace open-coded atomic_dec_and_mutex_lock()
  ext2: check err when partial != NULL
  quota: Handle quotas without quota inodes in dquot_get_state()
  quota: Make dquot_disable() work without quota inodes
  quota: Drop dquot_enable()
  fs: Use dquot_load_quota_inode() from filesystems
  quota: Rename vfs_load_quota_inode() to dquot_load_quota_inode()
  quota: Simplify dquot_resume()
  quota: Factor out setup of quota inode
  quota: Check that quota is not dirty before release
  quota: fix livelock in dquot_writeback_dquots
  ext2: don't set *count in the case of failure in ext2_try_to_allocate()
  ...
2019-11-30 11:16:07 -08:00
Linus Torvalds
e2d73c302b Updates since last change:
- Introduce superblock checksum support;
 - Set iowait when waiting I/O for sync decompression path;
 - Several code cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIwEABYIADQWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXd/dwRYcZ2FveGlhbmcy
 NUBodWF3ZWkuY29tAAoJEDk3MdwfteYEjd4BANBpFIxoenCeu9A8XVG4TVIobyoz
 g/eci9QZr+rgPGl/AP93QFEAPmjwQi6iNQREwvXQ96EE9rMX0Legt3tFeHPRCw==
 =xdz3
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
 "No major kernel updates for this round since I'm fully diving into
  LZMA algorithm internals now to provide high CR XZ algorihm support.
  That needs more work and time for me to get a better compression time.

  Summary:

   - Introduce superblock checksum support

   - Set iowait when waiting I/O for sync decompression path

   - Several code cleanups"

* tag 'erofs-for-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: remove unnecessary output in erofs_show_options()
  erofs: drop all vle annotations for runtime names
  erofs: support superblock checksum
  erofs: set iowait for sync decompression
  erofs: clean up decompress queue stuffs
  erofs: get rid of __stagingpage_alloc helper
  erofs: remove dead code since managed cache is now built-in
  erofs: clean up collection handling routines
2019-11-30 11:13:33 -08:00
Linus Torvalds
21b26d2679 Various smb3 fixes (including 12 for stable) and also features (addition of multichannel support)
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl3fB2gACgkQiiy9cAdy
 T1G4rgwAwTVvUYOmUVxJt1OxJDAnVxLXOH08FmrSL08nacrUCb6fPRw0z0Eb65SB
 4/Yc3xIH+bJInNpACKZVY0ghPlDYC4u7AtLUZqm4+ruDVSjoWmF4s38WsNCrZywm
 uAWq4p+j5A62uGdgdSzVxdUJkSLGRTc+7ib+y/TZ2U4BLjb6AY2etUtOpnA6lQ4R
 UHvjf7leHw55pvPbNnpZ/2zT18KpPH8twCj0f3lH4yt8Akw0p/va6tRcfsmE2plh
 V3dxiPmS86otXFPyHpu9B9plrxJCcGuYUL+R8pPpbLD3MZMsdK7J+sGNdayOoLXl
 bpuoTvl3LThEwcg7bFeB/jHXGx95KPm3vh99lrZvbM9Snya+ueSfudrhmoqCEiSo
 5ri1tWH7GlPr9nEaxp/aYB2LZENpbYvwn0yxoUdgK+R9EN5F3utUwK54CgYPVz9Y
 1zp0LksafWZu7NrIPOuoq+/ATJ1WgbDYrhotk9fwaZoc2EsUfJd0I31XGECisjdh
 fcHNxbA5
 =EIrP
 -----END PGP SIGNATURE-----

Merge tag '5.5-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs updates from Steve French:
 "Various smb3 fixes (including 12 for stable) and also features
  (addition of multichannel support)"

* tag '5.5-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: (41 commits)
  CIFS: fix a white space issue in cifs_get_inode_info()
  cifs: update internal module version number
  cifs: Always update signing key of first channel
  cifs: Fix retrieval of DFS referrals in cifs_mount()
  cifs: Fix potential softlockups while refreshing DFS cache
  cifs: Fix lookup of root ses in DFS referral cache
  cifs: Fix use-after-free bug in cifs_reconnect()
  cifs: dump channel info in DebugData
  smb3: dump in_send and num_waiters stats counters by default
  cifs: try harder to open new channels
  CIFS: Properly process SMB3 lease breaks
  cifs: move cifsFileInfo_put logic into a work-queue
  cifs: try opening channels after mounting
  CIFS: refactor cifs_get_inode_info()
  cifs: switch servers depending on binding state
  cifs: add server param
  cifs: add multichannel mount options and data structs
  cifs: sort interface list by speed
  CIFS: Fix SMB2 oplock break processing
  cifs: don't use 'pre:' for MODULE_SOFTDEP
  ...
2019-11-30 11:10:39 -08:00
Linus Torvalds
8f45533e9d f2fs-for-5.5-rc1
In this round, we've introduced fairly small number of patches as below.
 
 Enhancement:
  - improve the in-place-update IO flow
  - allocate segment to guarantee no GC for pinned files
 
 Bug fix:
  - fix updatetime in lazytime mode
  - potential memory leak in f2fs_listxattr
  - record parent inode number in rename2 correctly
  - fix deadlock in f2fs_gc along with atomic writes
  - avoid needless data migration in GC
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl3e1XkACgkQQBSofoJI
 UNJ0GhAAhVIX4J91CLnVSh0ik1XCaI6h/dFeS6kbDd8oxzQm/qt64b59aZqgy7Rk
 iblGWfj8uPP5yO60pqb5uN4a0hybptVZSEldbhF0Xv0zUeVoT7C1ksTMrdUd1p7d
 YkO8G+V4QBBrtpKG1KKKEncrvcdx4n9QHxGsRh4z5vXZH7sEmH7+N8OE88MaPjdZ
 UWqYk0S0GoZBhPe7c8pQuD/PM+WJJH4Lewgw5kK21eAjOKI+yZKb+bY2tGjo5dA1
 nzYO72CRMV4VEKsnxTZ/LCB2kCXeexaGuiVPyHjCmgAh990cLjsCWIbJ8EJu7uAa
 vAo6/EMfgfPkPt5Y7uWGR4EeNT7AFhUoMuoQ9zdXzecY48D4Gz58o87Q+OFY3ipZ
 W2OSf92pEJyfumE5o8wN435gaRYUjjCo1SMoIQABNav411XrBVoRwjvkV3DyA6af
 Bs1bafz2hR/E1q0uoZvLWC5waiHy9605OkKMs/y8IRsn6yhRep/tv3KLk2Dz3fOO
 LxenhuVO9bQDCheEcH15qIljxTuyfTyUOa9UrFXOwn4mK61J8A/Gs+SiqW0y28oA
 feSw7cLPxK0OlYQgql24JfJN/Xt523WmCSfXfe7TCUDTDkBpmsdhFwHYZyCLzqt+
 FyBhf2DF/BGzKMT28oc7StO43mIvOc1Wk+jfJFW+hld5ncAJxCE=
 =qyrd
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've introduced fairly small number of patches as below.

  Enhancements:
   - improve the in-place-update IO flow
   - allocate segment to guarantee no GC for pinned files

  Bug fixes:
   - fix updatetime in lazytime mode
   - potential memory leak in f2fs_listxattr
   - record parent inode number in rename2 correctly
   - fix deadlock in f2fs_gc along with atomic writes
   - avoid needless data migration in GC"

* tag 'f2fs-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: stop GC when the victim becomes fully valid
  f2fs: expose main_blkaddr in sysfs
  f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project()
  f2fs: Fix deadlock in f2fs_gc() context during atomic files handling
  f2fs: show f2fs instance in printk_ratelimited
  f2fs: fix potential overflow
  f2fs: fix to update dir's i_pino during cross_rename
  f2fs: support aligned pinned file
  f2fs: avoid kernel panic on corruption test
  f2fs: fix wrong description in document
  f2fs: cache global IPU bio
  f2fs: fix to avoid memory leakage in f2fs_listxattr
  f2fs: check total_segments from devices in raw_super
  f2fs: update multi-dev metadata in resize_fs
  f2fs: mark recovery flag correctly in read_raw_super_block()
  f2fs: fix to update time in lazytime mode
2019-11-30 11:02:30 -08:00
Linus Torvalds
4a55d362ff AFS development
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl3W9UYACgkQ+7dXa6fL
 C2uO1g//T0NvHmrZr6zHbOWlaKkGXRfhgCLENzuTCtcFt29NV+Jv8D/FNuT29d7f
 x8PzrZUBDYfpjP8OSHEivZgU5+a4jpI5pSd2mnGnT3523Oxbm2lIz/SXgpkwQdkM
 4OcK+voCP1MvHbmO7jyOfQm7SX/aOtF90GQrUog5JqwPD0ddSpk/swNNRKZ6xK8/
 cAq9/DHPfYHfLh8wc2FWmhdPA5Px+eKt9kGmMMpwsCulR4yMM7vBCIVNSccSjpYR
 QJzmFpi/TfOiRMc0oZnXLv6UUG+7hHeRfRosoQA8YGGsjwXfs2I7MfAziJx68tfu
 v7beYI+u7tc4EhgK9jXxbgPCJF6nFR/zrbzfC3Zlh7Hr3NCGGzH4gwDu+pn4lwfV
 bpUyA6VwmEt+CVzjj6LB5l4vyRSk+TtVQSwfQvCA7xqvdq0J0/fux6OQ1DwJonQe
 6dOpPUd4ip5tW28S9U7MzbbpMv1kP5hMoAu87gBdygUDhPojM7Ut0mFwZDMOunXg
 vW6oQwKyyFPeJhRcPMKxS7opWQz5889tDe8GPNI/1ociOYCcCkmwZaaG+hYAJ8+N
 8bK2OCi4yNGBa/qMPbwATY+TVKdYcceq/7THxcI3vZk8nVrda73UQHfwUXsjwUzf
 SNNcoMrsEV9BiPSG7gAkCf3VI2gXj+USm3w7SBSjZDM5qR/lRHg=
 =YjAm
 -----END PGP SIGNATURE-----

Merge tag 'afs-next-20191121' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS updates from David Howells:
 "Minor cleanups and fix:

   - Minor fix to make some debugging statements display information
     from the correct iov_iter.

   - Rename some members and variables to make things more obvious or
     consistent.

   - Provide a helper to wrap increments of the usage count on the
     afs_read struct.

   - Use scnprintf() to print into a stack buffer rather than sprintf().

   - Remove some set but unused variables"

* tag 'afs-next-20191121' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Remove set but not used variable 'ret'
  afs: Remove set but not used variables 'before', 'after'
  afs: xattr: use scnprintf
  afs: Introduce an afs_get_read() refcount helper
  afs: Rename desc -> req in afs_fetch_data()
  afs: Switch the naming of call->iter and call->_iter
  afs: Use call->_iter not &call->iter in debugging statements
2019-11-30 10:57:22 -08:00
Linus Torvalds
50b8b3f85a This merge window saw the the following new featuers added to ext4:
* Direct I/O via iomap (required the iomap-for-next branch from Darrick
    as a prereq).
  * Support for using dioread-nolock where the block size < page size.
  * Support for encryption for file systems where the block size < page size.
  * Rework of journal credits handling so a revoke-heavy workload will
    not cause the journal to run out of space.
  * Replace bit-spinlocks with spinlocks in jbd2
 
 Also included were some bug fixes and cleanups, mostly to clean up
 corner cases from fuzzed file systems and error path handling.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl3dHxoACgkQ8vlZVpUN
 gaMZswf5AbtQhTEJDXO7Pc1ull38GIGFgAv7uAth0TymLC3h1/FEYWW0crEPFsDr
 1Eei55UUVOYrMMUKQ4P7wlLX0cIh3XDPMWnRFuqBoV5/ZOsH/ZSbkY//TG2Xze/v
 9wXIH/RKQnzbRtXffJ1+DnvmXJk+HFm1R1gjl0nfyUXGrnlSfqJxhLSczyd6bJJq
 ehi/tso5UC/4EQsAIdWp7VWsAdaHcZ7ogHqDoy8dXpM1equ408iml7VlKr8R+Nr7
 5ANpCISXChSlLLYm0NYN5vhO8upF5uDxWLdCtxVPL5kFdM2m/ELjXw9h9C+78l7C
 EWJGlGlxvx07Px+e+bfStEsoixpWBg==
 =0eko
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "This merge window saw the the following new featuers added to ext4:

   - Direct I/O via iomap (required the iomap-for-next branch from
     Darrick as a prereq).

   - Support for using dioread-nolock where the block size < page size.

   - Support for encryption for file systems where the block size < page
     size.

   - Rework of journal credits handling so a revoke-heavy workload will
     not cause the journal to run out of space.

   - Replace bit-spinlocks with spinlocks in jbd2

  Also included were some bug fixes and cleanups, mostly to clean up
  corner cases from fuzzed file systems and error path handling"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (59 commits)
  ext4: work around deleting a file with i_nlink == 0 safely
  ext4: add more paranoia checking in ext4_expand_extra_isize handling
  jbd2: make jbd2_handle_buffer_credits() handle reserved handles
  ext4: fix a bug in ext4_wait_for_tail_page_commit
  ext4: bio_alloc with __GFP_DIRECT_RECLAIM never fails
  ext4: code cleanup for get_next_id
  ext4: fix leak of quota reservations
  ext4: remove unused variable warning in parse_options()
  ext4: Enable encryption for subpage-sized blocks
  fs/buffer.c: support fscrypt in block_read_full_page()
  ext4: Add error handling for io_end_vec struct allocation
  jbd2: Fine tune estimate of necessary descriptor blocks
  jbd2: Provide trace event for handle restarts
  ext4: Reserve revoke credits for freed blocks
  jbd2: Make credit checking more strict
  jbd2: Rename h_buffer_credits to h_total_credits
  jbd2: Reserve space for revoke descriptor blocks
  jbd2: Drop jbd2_space_needed()
  jbd2: Account descriptor blocks into t_outstanding_credits
  jbd2: Factor out common parts of stopping and restarting a handle
  ...
2019-11-30 10:53:02 -08:00
Linus Torvalds
f112a2fd1f New code for 5.5:
- Fix another place in the splice code where a pipe could ask a
 filesystem for a longer read than the pipe actually has free buffer
 space.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3XKBsACgkQ+H93GTRK
 tOubhg/9FhWIX+Ijj0a1MQj3QDv0YbLPv2uTVLYyRK+1X5Xwc7u0OmmbWYrQQQUJ
 6aOaictBXXTIJtKqVcmXBXboUskKygnWKlVssKAI/BYR9ope/EMrMV5+4bz0mKt/
 ufcDXRiJX2hbL8GicmJxK/qmp3bEAG0n/e0+0LcuYynatYXJQ19eOUeritVZ3EYA
 Vhbgvrtuq8ws9SqvJ4Dm7fLsPT0AOHocor34tXS+gRq8KYIA2Ru7uDyE2uJJujF7
 qAlmZVvun+RcSBG0mYdk6GDgLGibn76t+d0xnqd/OzSeh+6bHYc1ni6FJD8kjVzf
 nGHFJfbEOCzDEees9ESBkgnH1TiCN4wCs7JQ4BictAPotBCtkbcNLvrO+b77zgwp
 DAu6uE/DiRmIu9zqfy8TyH1E78pW7qTpI9yFDUQLaItvAbbqYIaiRLr4Hjd2Hf3E
 MLYOSGZkKn/xZaqedk+rGwYJJmmvzCZblfrQ3hCtLGjmNQgfTRXidlgyicl1Ta12
 iG9UkCFdbkZmdW9zWyYmZrfxLB6QzOZpgbdUwyMk0Rfc1sKJKYoNE+hkt8fjmS71
 0hBPYSopDsYkGEH1719Nsfvyj0101McFklvEqS2hU9/rLyaF0Mdq5catVbzyBPhk
 RwnvXdFd/sFgbR+1xLiEm51aXAe/kmveNNQBqZkuBgXITt7oh7Q=
 =ripU
 -----END PGP SIGNATURE-----

Merge tag 'vfs-5.5-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull splice fix from Darrick Wong:
 "Fix another place in the splice code where a pipe could ask a
  filesystem for a longer read than the pipe actually has free buffer
  space"

* tag 'vfs-5.5-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  splice: only read in as much information as there is pipe buffer space
2019-11-30 10:48:24 -08:00
Linus Torvalds
3b266a52d8 New code for 5.5:
- Make iomap_dio_rw callers explicitly tell us if they want us to wait
 - Port the xfs writeback code to iomap to complete the buffered io
   library functions
 - Refactor the unshare code to share common pieces
 - Add support for performing copy on write with buffered writes
 - Other minor fixes
 - Fix unchecked return in iomap_bmap
 - Fix a type casting bug in a ternary statement in iomap_dio_bio_actor
 - Improve tracepoints for easier diagnostic ability
 - Fix pipe page leakage in directio reads
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3YDqQACgkQ+H93GTRK
 tOsbPg/+KrPkhe60RnUO4PQbSLTRLsz5R7OK5ubfxsKlHdyy/8vu12yPdY03LcHn
 QoyQz+5gPjM58UvysIonlyRO0O30apl8UZ9PAINDMi7os6NP87illw4HtHL1cZjB
 JfIQKVJrLNtocZnAgVL74d6pUs7MH32SPw0r+/qbfz/JFcHdf/Sz8fSpb0sdS/oK
 QwT73TcaCcNa2C4twvhtO6+kMkzlTkJknqSZMqrthScqRRVeOnyGTjLdKqUamSqp
 uj0iAKm+bBpCcuMdcHd7EOgQyVGwCQKUndaLKojK/V1iqiK+3KsLnIoJj6HwlU27
 Q+pDoThv2V8m/Y940Gq0wzTNtkwdCirNaeKXwXX2ytlyPX5W45ZxgzGUQy4YYXGM
 ObHRmeJXdka6kH7yzWlPiZGdZixpagFLzFUHWjMXAD8Fb4YxNKM4FsJDY3K3uQi6
 y7EKV5O4q3qBW3ieL6l+wl9NkdcppSywRjRxhBZIhH4T2n7RMDLdBkNCiKZrWqIW
 1QcrIC1NvwbXPzSNaZ40dgjB6mJGTv+P9AJLSQpmlR1Y6txLsJ1AeVzucZOnXcCo
 TEiBfDFOZ9jvXUsaqGSdM99HxsePKn+VgEeTA6TFxTWJ9QcEgio1Pym+RkN6KrIU
 zsbJbgk9bp3YT940xKt90fAHWm/iKUWrrRyidVCK3kJd72e41UE=
 =FRl4
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.5-merge-11' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "In this release, we hoisted as much of XFS' writeback code into iomap
  as was practicable, refactored the unshare file data function, added
  the ability to perform buffered io copy on write, and tweaked various
  parts of the directio implementation as needed to port ext4's directio
  code (that will be a separate pull).

  Summary:

   - Make iomap_dio_rw callers explicitly tell us if they want us to
     wait

   - Port the xfs writeback code to iomap to complete the buffered io
     library functions

   - Refactor the unshare code to share common pieces

   - Add support for performing copy on write with buffered writes

   - Other minor fixes

   - Fix unchecked return in iomap_bmap

   - Fix a type casting bug in a ternary statement in
     iomap_dio_bio_actor

   - Improve tracepoints for easier diagnostic ability

   - Fix pipe page leakage in directio reads"

* tag 'iomap-5.5-merge-11' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (31 commits)
  iomap: Fix pipe page leakage during splicing
  iomap: trace iomap_appply results
  iomap: fix return value of iomap_dio_bio_actor on 32bit systems
  iomap: iomap_bmap should check iomap_apply return value
  iomap: Fix overflow in iomap_page_mkwrite
  fs/iomap: remove redundant check in iomap_dio_rw()
  iomap: use a srcmap for a read-modify-write I/O
  iomap: renumber IOMAP_HOLE to 0
  iomap: use write_begin to read pages to unshare
  iomap: move the zeroing case out of iomap_read_page_sync
  iomap: ignore non-shared or non-data blocks in xfs_file_dirty
  iomap: always use AOP_FLAG_NOFS in iomap_write_begin
  iomap: remove the unused iomap argument to __iomap_write_end
  iomap: better document the IOMAP_F_* flags
  iomap: enhance writeback error message
  iomap: pass a struct page to iomap_finish_page_writeback
  iomap: cleanup iomap_ioend_compare
  iomap: move struct iomap_page out of iomap.h
  iomap: warn on inline maps in iomap_writepage_map
  iomap: lift the xfs writeback code to iomap
  ...
2019-11-30 10:44:49 -08:00
Jens Axboe
aa4c396775 io_uring: fix missing kmap() declaration on powerpc
Christophe reports that current master fails building on powerpc with
this error:

   CC      fs/io_uring.o
fs/io_uring.c: In function ‘loop_rw_iter’:
fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
[-Werror=implicit-function-declaration]
     iovec.iov_base = kmap(iter->bvec->bv_page)
                      ^
fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
without a cast [-Wint-conversion]
     iovec.iov_base = kmap(iter->bvec->bv_page)
                    ^
fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
[-Werror=implicit-function-declaration]
     kunmap(iter->bvec->bv_page);
     ^

which is caused by a missing highmem.h include. Fix it by including
it.

Fixes: 311ae9e159 ("io_uring: fix dead-hung for non-iter fixed rw")
Reported-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-29 10:14:00 -07:00
Joel Stanley
6e78c01fde Revert "jffs2: Fix possible null-pointer dereferences in jffs2_add_frag_to_fragtree()"
This reverts commit f2538f9993. The patch
stopped JFFS2 from being able to mount an existing filesystem with the
following errors:

 jffs2: error: (77) jffs2_build_inode_fragtree: Add node to tree failed -22
 jffs2: error: (77) jffs2_do_read_inode_internal: Failed to build final fragtree for inode #5377: error -22

Fixes: f2538f9993 ("jffs2: Fix possible null-pointer dereferences...")
Cc: stable@vger.kernel.org
Suggested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-29 11:29:58 +01:00
Linus Torvalds
05bd375b6b for-5.5/io_uring-post-20191128
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3f/C0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprUjD/4t5ituIXmfPDulW4178LchRbAdC2FquGeU
 Okft4laeQsJ8ucTBm3RAziVDNnkeJ0mLy5JTMpdPwKW+1F/d8jPuDeTlmmRyKGyW
 vt2NlI9fKaOt8cU+LPp3MiGIamleZFOEPsRHU03HYASFcRKQwAswgjS2DZnR7oTK
 66XGB9CdwREh1mM63wERYX07+t+ylocTClcLQDCi99H4HI5ScsJXB5aSqpombycy
 XujZGlEAUT6WLp0HAwihhAnQIu/fEw+tXMWLvozphIP9xZGpk+a2nbd2WxTCdLZy
 MoksoCpl76RmYPuZ+TC1L+tHf2KqGHa/mo64L/6v15QS8hROw3ZTMdJI0mEC/90g
 5JwWuBqWOcRiMcV1Uwkz5OYL3KYQOWebS+FBbIc+2jqBlXJeN+1R/Ik5HP8xtGaE
 5p317/q09h32piZN0c+BmkXCdnAYZeBez0EjSXf6j3PW3LAqmLOjxc0yA2OmwFWj
 iudOfBOg48HtHkMaNrg2f+0THMP8zHqUVoRh96BKCSjnOXmbxGFt3ACHvb+VS0CX
 RmlVOcaq+SzL1EsT805Hb6N3x7mK2l9CoanGfLxYXQZlOzD6TcRdlBzs9JCgqiAs
 3jk5nAq9wtYvigAkVarMw/LAJNatEO4sy0AHyWJwAcfOsVJquNvhhokT8VxOcVZs
 pMxUhUYKZA==
 =HqiO
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/io_uring-post-20191128' of git://git.kernel.dk/linux-block

Pull more io_uring updates from Jens Axboe:
 "As mentioned in the first pull request, there was a later batch as
  well. This contains fixes to the stuff that already went in, cleanups,
  and a few later additions. In particular, this contains:

   - Cleanups/fixes/unification of the submission and completion path
     (Pavel,me)

   - Linked timeouts improvements (Pavel,me)

   - Error path fixes (me)

   - Fix lookup window where cancellations wouldn't work (me)

   - Improve DRAIN support (Pavel)

   - Fix backlog flushing -EBUSY on submit (me)

   - Add support for connect(2) (me)

   - Fix for non-iter based fixed IO (Pavel)

   - creds inheritance for async workers (me)

   - Disable cmsg/ancillary data for sendmsg/recvmsg (me)

   - Shrink io_kiocb to 3 cachelines (me)

   - NUMA fix for io-wq (Jann)"

* tag 'for-5.5/io_uring-post-20191128' of git://git.kernel.dk/linux-block: (42 commits)
  io_uring: make poll->wait dynamically allocated
  io-wq: shrink io_wq_work a bit
  io-wq: fix handling of NUMA node IDs
  io_uring: use kzalloc instead of kcalloc for single-element allocations
  io_uring: cleanup io_import_fixed()
  io_uring: inline struct sqe_submit
  io_uring: store timeout's sqe->off in proper place
  net: disallow ancillary data for __sys_{send,recv}msg_file()
  net: separate out the msghdr copy from ___sys_{send,recv}msg()
  io_uring: remove superfluous check for sqe->off in io_accept()
  io_uring: async workers should inherit the user creds
  io-wq: have io_wq_create() take a 'data' argument
  io_uring: fix dead-hung for non-iter fixed rw
  io_uring: add support for IORING_OP_CONNECT
  net: add __sys_connect_file() helper
  io_uring: only return -EBUSY for submit on non-flushed backlog
  io_uring: only !null ptr to io_issue_sqe()
  io_uring: simplify io_req_link_next()
  io_uring: pass only !null to io_req_find_next()
  io_uring: remove io_free_req_find_next()
  ...
2019-11-28 10:43:39 -08:00
Roman Penyaev
6c5c240e41 io_uring: add mapping support for NOMMU archs
That is a bit weird scenario but I find it interesting to run fio loads
using LKL linux, where MMU is disabled.  Probably other real archs which
run uClinux can also benefit from this patch.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-28 10:08:02 -07:00
David Howells
82995cc6c5 libceph, rbd, ceph: convert to use the new mount API
Convert the ceph filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

[ Numerous string handling, leak and regression fixes; rbd conversion
  was particularly broken and had to be redone almost from scratch. ]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-27 22:28:37 +01:00
Linus Torvalds
9a3d7fd275 Driver core patches for 5.5-rc1
Here is the "big" set of driver core patches for 5.5-rc1
 
 There's a few minor cleanups and fixes in here, but the majority of the
 patches in here fall into two buckets:
   - debugfs api cleanups and fixes
   - driver core device link support for boot dependancy issues
 
 The debugfs api cleanups are working to slowly refactor the debugfs apis
 so that it is even harder to use incorrectly.  That work has been
 happening for the past few kernel releases and will continue over time,
 it's a long-term project/goal
 
 The driver core device link support missed 5.4 by just a bit, so it's
 been sitting and baking for many months now.  It's from Saravana Kannan
 to help resolve the problems that DT-based systems have at boot time
 with dependancy graphs and kernel modules.  Turns out that no one has
 actually tried to build a generic arm64 kernel with loads of modules and
 have it "just work" for a variety of platforms (like a distro kernel)
 The big problem turned out to be a lack of depandancy information
 between different areas of DT entries, and the work here resolves that
 problem and now allows devices to boot properly, and quicker than a
 monolith kernel.
 
 All of these patches have been in linux-next for a long time with no
 reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXd6m6Q8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yntJQCcCqg6RQ7LTdHuZv1ETeefXlsfk00An1Jtean6
 42bWGx52bGFvAcpjWy8R
 =P7hq
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the "big" set of driver core patches for 5.5-rc1

  There's a few minor cleanups and fixes in here, but the majority of
  the patches in here fall into two buckets:

   - debugfs api cleanups and fixes

   - driver core device link support for boot dependancy issues

  The debugfs api cleanups are working to slowly refactor the debugfs
  apis so that it is even harder to use incorrectly. That work has been
  happening for the past few kernel releases and will continue over
  time, it's a long-term project/goal

  The driver core device link support missed 5.4 by just a bit, so it's
  been sitting and baking for many months now. It's from Saravana Kannan
  to help resolve the problems that DT-based systems have at boot time
  with dependancy graphs and kernel modules. Turns out that no one has
  actually tried to build a generic arm64 kernel with loads of modules
  and have it "just work" for a variety of platforms (like a distro
  kernel). The big problem turned out to be a lack of dependency
  information between different areas of DT entries, and the work here
  resolves that problem and now allows devices to boot properly, and
  quicker than a monolith kernel.

  All of these patches have been in linux-next for a long time with no
  reported issues"

* tag 'driver-core-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (68 commits)
  tracing: Remove unnecessary DEBUG_FS dependency
  of: property: Add device link support for interrupt-parent, dmas and -gpio(s)
  debugfs: Fix !DEBUG_FS debugfs_create_automount
  of: property: Add device link support for "iommu-map"
  of: property: Fix the semantics of of_is_ancestor_of()
  i2c: of: Populate fwnode in of_i2c_get_board_info()
  drivers: base: Fix Kconfig indentation
  firmware_loader: Fix labels with comma for builtin firmware
  driver core: Allow device link operations inside sync_state()
  driver core: platform: Declare ret variable only once
  cpu-topology: declare parse_acpi_topology in <linux/arch_topology.h>
  crypto: hisilicon: no need to check return value of debugfs_create functions
  driver core: platform: use the correct callback type for bus_find_device
  firmware_class: make firmware caching configurable
  driver core: Clarify documentation for fwnode_operations.add_links()
  mailbox: tegra: Fix superfluous IRQ error message
  net: caif: Fix debugfs on 64-bit platforms
  mac80211: Use debugfs_create_xul() helper
  media: c8sectpfe: no need to check return value of debugfs_create functions
  of: property: Add device link support for iommus, mboxes and io-channels
  ...
2019-11-27 11:06:20 -08:00
Dan Carpenter via samba-technical
68464b88cc CIFS: fix a white space issue in cifs_get_inode_info()
We accidentally messed up the indenting on this if statement.

Fixes: 16c696a6c300 ("CIFS: refactor cifs_get_inode_info()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-27 11:31:49 -06:00
Darrick J. Wong
8feb4732ff xfs: allow parent directory scans to be interrupted with fatal signals
Allow a fatal signal to interrupt us when we're scanning a directory to
verify a parent pointer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-11-27 08:23:14 -08:00
Krzysztof Kozlowski
8d66fcb748 fuse: fix Kconfig indentation
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
	$ sed -e 's/^        /\t/' -i */Kconfig

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-27 09:35:20 +01:00
Miklos Szeredi
f1ebdeffc6 fuse: fix leak of fuse_io_priv
exit_aio() is sometimes stuck in wait_for_completion() after aio is issued
with direct IO and the task receives a signal.

The reason is failure to call ->ki_complete() due to a leaked reference to
fuse_io_priv.  This happens in fuse_async_req_send() if
fuse_simple_background() returns an error (e.g. -EINTR).

In this case the error value is propagated via io->err, so return success
to not confuse callers.

This issue is tracked as a virtio-fs issue:
https://gitlab.com/virtio-fs/qemu/issues/14

Reported-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Fixes: 45ac96ed7c ("fuse: convert direct_io to simple api")
Cc: <stable@vger.kernel.org> # v5.4
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-27 09:33:49 +01:00
Linus Torvalds
168829ad09 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - A comprehensive rewrite of the robust/PI futex code's exit handling
     to fix various exit races. (Thomas Gleixner et al)

   - Rework the generic REFCOUNT_FULL implementation using
     atomic_fetch_* operations so that the performance impact of the
     cmpxchg() loops is mitigated for common refcount operations.

     With these performance improvements the generic implementation of
     refcount_t should be good enough for everybody - and this got
     confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
     REFCOUNT_FULL entirely, leaving the generic implementation enabled
     unconditionally. (Will Deacon)

   - Other misc changes, fixes, cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  lkdtm: Remove references to CONFIG_REFCOUNT_FULL
  locking/refcount: Remove unused 'refcount_error_report()' function
  locking/refcount: Consolidate implementations of refcount_t
  locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
  locking/refcount: Move saturation warnings out of line
  locking/refcount: Improve performance of generic REFCOUNT_FULL code
  locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header
  locking/refcount: Remove unused refcount_*_checked() variants
  locking/refcount: Ensure integer operands are treated as signed
  locking/refcount: Define constants for saturation and max refcount values
  futex: Prevent exit livelock
  futex: Provide distinct return value when owner is exiting
  futex: Add mutex around futex exit
  futex: Provide state handling for exec() as well
  futex: Sanitize exit state handling
  futex: Mark the begin of futex exit explicitly
  futex: Set task::futex_state to DEAD right after handling futex exit
  futex: Split futex_mm_release() for exit/exec
  exit/exec: Seperate mm_release()
  futex: Replace PF_EXITPIDONE with a state
  ...
2019-11-26 16:02:40 -08:00
Linus Torvalds
1ae78780ed Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Dynamic tick (nohz) updates, perhaps most notably changes to force
     the tick on when needed due to lengthy in-kernel execution on CPUs
     on which RCU is waiting.

   - Linux-kernel memory consistency model updates.

   - Replace rcu_swap_protected() with rcu_prepace_pointer().

   - Torture-test updates.

   - Documentation updates.

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  security/safesetid: Replace rcu_swap_protected() with rcu_replace_pointer()
  net/sched: Replace rcu_swap_protected() with rcu_replace_pointer()
  net/netfilter: Replace rcu_swap_protected() with rcu_replace_pointer()
  net/core: Replace rcu_swap_protected() with rcu_replace_pointer()
  bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer()
  fs/afs: Replace rcu_swap_protected() with rcu_replace_pointer()
  drivers/scsi: Replace rcu_swap_protected() with rcu_replace_pointer()
  drm/i915: Replace rcu_swap_protected() with rcu_replace_pointer()
  x86/kvm/pmu: Replace rcu_swap_protected() with rcu_replace_pointer()
  rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer()
  rcu: Suppress levelspread uninitialized messages
  rcu: Fix uninitialized variable in nocb_gp_wait()
  rcu: Update descriptions for rcu_future_grace_period tracepoint
  rcu: Update descriptions for rcu_nocb_wake tracepoint
  rcu: Remove obsolete descriptions for rcu_barrier tracepoint
  rcu: Ensure that ->rcu_urgent_qs is set before resched IPI
  workqueue: Convert for_each_wq to use built-in list check
  rcu: Several rcu_segcblist functions can be static
  rcu: Remove unused function hlist_bl_del_init_rcu()
  Documentation: Rename rcu_node_context_switch() to rcu_note_context_switch()
  ...
2019-11-26 15:42:43 -08:00
Linus Torvalds
77a05940ee Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The biggest changes in this cycle were:

   - Make kcpustat vtime aware (Frederic Weisbecker)

   - Rework the CFS load_balance() logic (Vincent Guittot)

   - Misc cleanups, smaller enhancements, fixes.

  The load-balancing rework is the most intrusive change: it replaces
  the old heuristics that have become less meaningful after the
  introduction of the PELT metrics, with a grounds-up load-balancing
  algorithm.

  As such it's not really an iterative series, but replaces the old
  load-balancing logic with the new one. We hope there are no
  performance regressions left - but statistically it's highly probable
  that there *is* going to be some workload that is hurting from these
  chnages. If so then we'd prefer to have a look at that workload and
  fix its scheduling, instead of reverting the changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  rackmeter: Use vtime aware kcpustat accessor
  leds: Use all-in-one vtime aware kcpustat accessor
  cpufreq: Use vtime aware kcpustat accessors for user time
  procfs: Use all-in-one vtime aware kcpustat accessor
  sched/vtime: Bring up complete kcpustat accessor
  sched/cputime: Support other fields on kcpustat_field()
  sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()
  sched/fair: Add comments for group_type and balancing at SD_NUMA level
  sched/fair: Fix rework of find_idlest_group()
  sched/uclamp: Fix overzealous type replacement
  sched/Kconfig: Fix spelling mistake in user-visible help text
  sched/core: Further clarify sched_class::set_next_task()
  sched/fair: Use mul_u32_u32()
  sched/core: Simplify sched_class::pick_next_task()
  sched/core: Optimize pick_next_task()
  sched/core: Make pick_next_task_idle() more consistent
  sched/fair: Better document newidle_balance()
  leds: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM
  cpufreq: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM
  procfs: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM
  ...
2019-11-26 15:23:14 -08:00
Jens Axboe
e944475e69 io_uring: make poll->wait dynamically allocated
In the quest to bring io_kiocb down to 3 cachelines, this one does
the trick. Make the wait_queue_entry for the poll command come out
of kmalloc instead of embedding it in struct io_poll_iocb, as the
latter is the largest member of io_kiocb. Once we trim this down a
bit, we're back at a healthy 192 bytes for struct io_kiocb.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26 15:02:56 -07:00
Jens Axboe
6206f0e180 io-wq: shrink io_wq_work a bit
Currently we're using 40 bytes for the io_wq_work structure, and 16 of
those is the doubly link list node. We don't need doubly linked lists,
we always add to tail to keep things ordered, and any other use case
is list traversal with deletion. For the deletion case, we can easily
support any node deletion by keeping track of the previous entry.

This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from
io_uring to 216 to 208 bytes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26 15:02:56 -07:00
Jann Horn
3fc50ab559 io-wq: fix handling of NUMA node IDs
There are several things that can go wrong in the current code on NUMA
systems, especially if not all nodes are online all the time:

 - If the identifiers of the online nodes do not form a single contiguous
   block starting at zero, wq->wqes will be too small, and OOB memory
   accesses will occur e.g. in the loop in io_wq_create().
 - If a node comes online between the call to num_online_nodes() and the
   for_each_node() loop in io_wq_create(), an OOB write will occur.
 - If a node comes online between io_wq_create() and io_wq_enqueue(), a
   lookup is performed for an element that doesn't exist, and an OOB read
   will probably occur.

Fix it by:

 - using nr_node_ids instead of num_online_nodes() for the allocation size;
   nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the
   highest node ID that could possibly come online at some point, even if
   those nodes' identifiers are not a contiguous block
 - creating workers for all possible CPUs, not just all online ones

This is basically what the normal workqueue code also does, as far as I can
tell.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26 15:02:56 -07:00
Jann Horn
ad6e005ca6 io_uring: use kzalloc instead of kcalloc for single-element allocations
These allocations are single-element allocations, so don't use the array
allocation wrapper for them.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26 15:02:56 -07:00
Pavel Begunkov
7d00916555 io_uring: cleanup io_import_fixed()
Clean io_import_fixed() call site and make it return proper type.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26 15:02:56 -07:00
Pavel Begunkov
cf6fd4bd55 io_uring: inline struct sqe_submit
There is no point left in keeping struct sqe_submit. Inline it
into struct io_kiocb, so any req->submit.field is now just req->field

- moves initialisation of ring_file into io_get_req()
- removes duplicated req->sequence.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26 15:02:56 -07:00
Pavel Begunkov
cc42e0ac17 io_uring: store timeout's sqe->off in proper place
Timeouts' sequence offset (i.e. sqe->off) is stored in
req->submit.sequence under a false name. Keep it in timeout.data
instead. The unused space for sequence will be reclaimed in the
following patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26 15:02:56 -07:00
Linus Torvalds
2be7d348fe Revert "vfs: properly and reliably lock f_pos in fdget_pos()"
This reverts commit 0be0ee7181.

I was hoping it would be benign to switch over entirely to FMODE_STREAM,
and we'd have just a couple of small fixups we'd need, but it looks like
we're not quite there yet.

While it worked fine on both my desktop and laptop, they are fairly
similar in other respects, and run mostly the same loads.  Kenneth
Crudup reports that it seems to break both his vmware installation and
the KDE upower service.  In both cases apparently leading to timeouts
due to waitinmg for the f_pos lock.

There are a number of character devices in particular that definitely
want stream-like behavior, but that currently don't get marked as
streams, and as a result get the exclusion between concurrent
read()/write() on the same file descriptor.  Which doesn't work well for
them.

The most obvious example if this is /dev/console and /dev/tty, which use
console_fops and tty_fops respectively (and ptmx_fops for the pty master
side).  It may be that it's just this that causes problems, but we
clearly weren't ready yet.

Because there's a number of other likely common cases that don't have
llseek implementations and would seem to act as stream devices:

  /dev/fuse		(fuse_dev_operations)
  /dev/mcelog		(mce_chrdev_ops)
  /dev/mei0		(mei_fops)
  /dev/net/tun		(tun_fops)
  /dev/nvme0		(nvme_dev_fops)
  /dev/tpm0		(tpm_fops)
  /proc/self/ns/mnt	(ns_file_operations)
  /dev/snd/pcm*		(snd_pcm_f_ops[])

and while some of these could be trivially automatically detected by the
vfs layer when the character device is opened by just noticing that they
have no read or write operations either, it often isn't that obvious.

Some character devices most definitely do use the file position, even if
they don't allow seeking: the firmware update code, for example, uses
simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
back and forth.

We'll revisit this when there's a better way to detect the problem and
fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
annotations).

Reported-by: Kenneth R. Crudup <kenny@panix.com>
Cc: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-26 11:34:06 -08:00
Johannes Thumshirn
88cfd30e18 iomap: remove unneeded variable in iomap_dio_rw()
The 'start' variable indicates the start of a filemap and is set to the
iocb's position, which we have already cached as 'pos', upon function
entry.

'pos' is used as a cursor indicating the current position and updated
later in iomap_dio_rw(), but not before the last use of 'start'.

Remove 'start' as it's synonym for 'pos' before we're entering the loop
calling iomapp_apply().

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-26 09:28:47 -08:00
Jan Kara
f550ee9b85 iomap: Do not create fake iter in iomap_dio_bio_actor()
iomap_dio_bio_actor() copies iter to a local variable and then limits it
to a file extent we have mapped. When IO is submitted,
iomap_dio_bio_actor() advances the original iter while the copied iter
is advanced inside bio_iov_iter_get_pages(). This logic is non-obvious
especially because both iters still point to same shared structures
(such as pipe info) so if iov_iter_advance() changes anything in the
shared structure, this scheme breaks. Let's just truncate and reexpand
the original iter as needed instead of playing games with copying iters
and keeping them in sync.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-26 09:28:47 -08:00
Linus Torvalds
436b2a8039 Printk changes for 5.5
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl3bpjoACgkQUqAMR0iA
 lPJJDA/+IJT4YCRp2TwV2jvIs0QzvXZrzEsxgCLibLE85mYTJgoQBD3W1bH2eyjp
 T/9U0Zh5PGr/84cHd4qiMxzo+5Olz930weG59NcO4RJBSr671aRYs5tJqwaQAZDR
 wlwaob5S28vUmjPxKulvxv6V3FdI79ZE9xrCOCSTQvz4iCLsGOu+Dn/qtF64pImX
 M/EXzPMBrByiQ8RTM4Ege8JoBqiCZPDG9GR3KPXIXQwEeQgIoeYxwRYakxSmSzz8
 W8NduFCbWavg/yHhghHikMiyOZeQzAt+V9k9WjOBTle3TGJegRhvjgI7508q3tXe
 jQTMGATBOPkIgFaZz7eEn/iBa3jZUIIOzDY93RYBmd26aBvwKLOma/Vkg5oGYl0u
 ZK+CMe+/xXl7brQxQ6JNsQhbSTjT+746LvLJlCvPbbPK9R0HeKNhsdKpGY3ugnmz
 VAnOFIAvWUHO7qx+J+EnOo5iiPpcwXZj4AjrwVrs/x5zVhzwQ+4DSU6rbNn0O1Ak
 ELrBqCQkQzh5kqK93jgMHeWQ9EOUp1Lj6PJhTeVnOx2x8tCOi6iTQFFrfdUPlZ6K
 2DajgrFhti4LvwVsohZlzZuKRm5EuwReLRSOn7PU5qoSm5rcouqMkdlYG/viwyhf
 mTVzEfrfemrIQOqWmzPrWEXlMj2mq8oJm4JkC+jJ/+HsfK4UU8I=
 =QCEy
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk updates from Petr Mladek:

 - Allow to print symbolic error names via new %pe modifier.

 - Use pr_warn() instead of the remaining pr_warning() calls. Fix
   formatting of the related lines.

 - Add VSPRINTF entry to MAINTAINERS.

* tag 'printk-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (32 commits)
  checkpatch: don't warn about new vsprintf pointer extension '%pe'
  MAINTAINERS: Add VSPRINTF
  tools lib api: Renaming pr_warning to pr_warn
  ASoC: samsung: Use pr_warn instead of pr_warning
  lib: cpu_rmap: Use pr_warn instead of pr_warning
  trace: Use pr_warn instead of pr_warning
  dma-debug: Use pr_warn instead of pr_warning
  vgacon: Use pr_warn instead of pr_warning
  fs: afs: Use pr_warn instead of pr_warning
  sh/intc: Use pr_warn instead of pr_warning
  scsi: Use pr_warn instead of pr_warning
  platform/x86: intel_oaktrail: Use pr_warn instead of pr_warning
  platform/x86: asus-laptop: Use pr_warn instead of pr_warning
  platform/x86: eeepc-laptop: Use pr_warn instead of pr_warning
  oprofile: Use pr_warn instead of pr_warning
  of: Use pr_warn instead of pr_warning
  macintosh: Use pr_warn instead of pr_warning
  idsn: Use pr_warn instead of pr_warning
  ide: Use pr_warn instead of pr_warning
  crypto: n2: Use pr_warn instead of pr_warning
  ...
2019-11-25 19:40:40 -08:00
Linus Torvalds
1b96a41b42 Merge branch 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "There are several notable changes here:

   - Single thread migrating itself has been optimized so that it
     doesn't need threadgroup rwsem anymore.

   - Freezer optimization to avoid unnecessary frozen state changes.

   - cgroup ID unification so that cgroup fs ino is the only unique ID
     used for the cgroup and can be used to directly look up live
     cgroups through filehandle interface on 64bit ino archs. On 32bit
     archs, cgroup fs ino is still the only ID in use but it is only
     unique when combined with gen.

   - selftest and other changes"

* 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
  writeback: fix -Wformat compilation warnings
  docs: cgroup: mm: Fix spelling of "list"
  cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()
  cgroup: use cgrp->kn->id as the cgroup ID
  kernfs: use 64bit inos if ino_t is 64bit
  kernfs: implement custom exportfs ops and fid type
  kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
  kernfs: convert kernfs_node->id from union kernfs_node_id to u64
  kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes
  kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()
  netprio: use css ID instead of cgroup ID
  writeback: use ino_t for inodes in tracepoints
  kernfs: fix ino wrap-around detection
  kselftests: cgroup: Avoid the reuse of fd after it is deallocated
  cgroup: freezer: don't change task and cgroups status unnecessarily
  cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency
  cgroup: remove cgroup_enable_task_cg_lists() optimization
  cgroup: pids: use atomic64_t for pids->limit
  selftests: cgroup: Run test_core under interfering stress
  selftests: cgroup: Add task migration tests
  ...
2019-11-25 19:23:46 -08:00
Hrvoje Zeba
8042d6ce8c io_uring: remove superfluous check for sqe->off in io_accept()
This field contains a pointer to addrlen and checking to see if it's set
returns -EINVAL if the caller sets addr & addrlen pointers.

Fixes: 17f2fe35d0 ("io_uring: add support for IORING_OP_ACCEPT")
Signed-off-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Jens Axboe
181e448d87 io_uring: async workers should inherit the user creds
If we don't inherit the original task creds, then we can confuse users
like fuse that pass creds in the request header. See link below on
identical aio issue.

Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Jens Axboe
576a347b7a io-wq: have io_wq_create() take a 'data' argument
We currently pass in 4 arguments outside of the bounded size. In
preparation for adding one more argument, let's bundle them up in
a struct to make it more readable.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Pavel Begunkov
311ae9e159 io_uring: fix dead-hung for non-iter fixed rw
Read/write requests to devices without implemented read/write_iter
using fixed buffers can cause general protection fault, which totally
hangs a machine.

io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter()
accesses it as iovec, dereferencing random address.

kmap() page by page in this case

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Jens Axboe
f8e85cf255 io_uring: add support for IORING_OP_CONNECT
This allows an application to call connect() in an async fashion. Like
other opcodes, we first try a non-blocking connect, then punt to async
context if we have to.

Note that we can still return -EINPROGRESS, and in that case the caller
should use IORING_OP_POLL_ADD to do an async wait for completion of the
connect request (just like for regular connect(2), except we can do it
async here too).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Jens Axboe
c4a2ed72c9 io_uring: only return -EBUSY for submit on non-flushed backlog
We return -EBUSY on submit when we have a CQ ring overflow backlog, but
that can be a bit problematic if the application is using pure userspace
poll of the CQ ring. For that case, if the ring briefly overflowed and
we have pending entries in the backlog, the submit flushes the backlog
successfully but still returns -EBUSY. If we're able to fully flush the
CQ ring backlog, let the submission proceed.

Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Pavel Begunkov
f9bd67f69a io_uring: only !null ptr to io_issue_sqe()
Pass only non-null @nxt to io_issue_sqe() and handle it at the caller's
side. And propagate it.

- kiocb_done() is only called from io_read() and io_write(), which are
only called from io_issue_sqe(), so it's @nxt != NULL

- io_put_req_find_next() is called either with explicitly non-null local
nxt, or from one of the functions in io_issue_sqe() switch (or their
callees).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Pavel Begunkov
b18fdf71e0 io_uring: simplify io_req_link_next()
"if (nxt)" is always true, as it was checked in the while's condition.
io_wq_current_is_worker() is unnecessary, as non-async callers don't
pass nxt, so io_queue_async_work() will be called for them anyway.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Pavel Begunkov
944e58bfed io_uring: pass only !null to io_req_find_next()
Make io_req_find_next() and io_req_link_next() to accept only non-null
nxt, and handle it in callers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Pavel Begunkov
70cf9f3270 io_uring: remove io_free_req_find_next()
There is only one one-liner user of io_free_req_find_next(). Inline it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Pavel Begunkov
9835d6fafb io_uring: add likely/unlikely in io_get_sqring()
The number of SQEs to submit is specified by a user, so io_get_sqring()
in most of the cases succeeds. Hint compilers about that.

Checking ASM genereted by gcc 9.2.0 for x64, there is one branch
misprediction.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:10 -07:00
Pavel Begunkov
d732447fed io_uring: rename __io_submit_sqe()
__io_submit_sqe() is issuing requests, so call it as
such. Moreover, it ends by calling io_iopoll_req_issued().

Rename it and make terminology clearer.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:10 -07:00
Jens Axboe
915967f69c io_uring: improve trace_io_uring_defer() trace point
We don't have shadow requests anymore, so get rid of the shadow
argument. Add the user_data argument, as that's often useful to easily
match up requests, instead of having to look at request pointers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:10 -07:00
Pavel Begunkov
1b4a51b6d0 io_uring: drain next sqe instead of shadowing
There's an issue with the shadow drain logic in that we drop the
completion lock after deciding to defer a request, then re-grab it later
and assume that the state is still the same. In the mean time, someone
else completing a request could have found and issued it. This can cause
a stall in the queue, by having a shadow request inserted that nobody is
going to drain.

Additionally, if we fail allocating the shadow request, we simply ignore
the drain.

Instead of using a shadow request, defer the next request/link instead.
This also has the following advantages:

- removes semi-duplicated code
- doesn't allocate memory for shadows
- works better if only the head marked for drain
- doesn't need complex synchronisation

On the flip side, it removes the shadow->seq ==
last_drain_in_in_link->seq optimization. That shouldn't be a common
case, and can always be added back, if needed.

Fixes: 4fe2c96315 ("io_uring: add support for link with drain")
Cc: Jackie Liu <liuyun01@kylinos.cn>
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:10 -07:00
Jens Axboe
b76da70fc3 io_uring: close lookup gap for dependent next work
When we find new work to process within the work handler, we queue the
linked timeout before we have issued the new work. This can be
problematic for very short timeouts, as we have a window where the new
work isn't visible.

Allow the work handler to store a callback function for this in the work
item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that
is set, then io-wq will call the callback when it has setup the new work
item.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:10 -07:00
Jens Axboe
4d7dd46297 io_uring: allow finding next link independent of req reference count
We currently try and start the next link when we put the request, and
only if we were going to free it. This means that the optimization to
continue executing requests from the same context often fails, as we're
not putting the final reference.

Add REQ_F_LINK_NEXT to keep track of this, and allow io_uring to find the
next request more efficiently.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Jens Axboe
eb065d301e io_uring: io_allocate_scq_urings() should return a sane state
We currently rely on the ring destroy on cleaning things up in case of
failure, but io_allocate_scq_urings() can leave things half initialized
if only parts of it fails.

Be nice and return with either everything setup in success, or return an
error with things nicely cleaned up.

Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Pavel Begunkov
bbad27b2f6 io_uring: Always REQ_F_FREE_SQE for allocated sqe
Always mark requests with allocated sqe and deallocate it in
__io_free_req(). It's easier to follow and doesn't add edge cases.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Jens Axboe
5d960724b0 io_uring: io_fail_links() should only consider first linked timeout
We currently clear the linked timeout field if we cancel such a timeout,
but we should only attempt to cancel if it's the first one we see.
Others should simply be freed like other requests, as they haven't
been started yet.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Pavel Begunkov
09fbb0a83e io_uring: Fix leaking linked timeouts
let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT

1. submission stage: submission references for REQ and LINK_TIMEOUT
are dropped. So, references respectively (1,1,2)

2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all
linked timeouts will call cancel_timeout() and drop 1 reference.
So, references after: (0,0,1). That's a leak.

Make it treat only the first linked timeout as such, and pass others
through __io_double_put_req().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Pavel Begunkov
f70193d6d8 io_uring: remove redundant check
Pass any IORING_OP_LINK_TIMEOUT request further, where it will
eventually fail in io_issue_sqe().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Pavel Begunkov
d3b35796b1 io_uring: break links for failed defer
If io_req_defer() failed, it needs to cancel a dependant link.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Dan Carpenter
b2e9c7d64b io-wq: remove extra space characters
These lines are indented an extra space character.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:05 -07:00
Jens Axboe
b60fda6000 io-wq: wait for io_wq_create() to setup necessary workers
We currently have a race where if setup is really slow, we can be
calling io_wq_destroy() before we're done setting up. This will cause
the caller to get stuck waiting for the manager to set things up, but
the manager already exited.

Fix this by doing a sync setup of the manager. This also fixes the case
where if we failed creating workers, we'd also get stuck.

In practice this race window was really small, as we already wait for
the manager to start. Hence someone would have to call io_wq_destroy()
after the task has started, but before it started the first loop. The
reported test case forked tons of these, which is why it became an
issue.

Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com
Fixes: 771b53d033 ("io-wq: small threadpool implementation for io_uring")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:05 -07:00
Jens Axboe
fba38c272a io_uring: request cancellations should break links
We currently don't explicitly break links if a request is cancelled, but
we should. Add explicitly link breakage for all types of request
cancellations that we support.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:05 -07:00
Jens Axboe
b0dd8a4126 io_uring: correct poll cancel and linked timeout expiration completion
Currently a poll request fills a completion entry of 0, even if it got
cancelled. This is odd, and it makes it harder to support with chains.
Ensure that it returns -ECANCELED in the completions events if it got
cancelled, and furthermore ensure that the linked timeout that triggered
it completes with -ETIME if we did indeed trigger the completions
through a timeout.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:05 -07:00
Jens Axboe
e0e328c4b3 io_uring: remove dead REQ_F_SEQ_PREV flag
With the conversion to io-wq, we no longer use that flag. Kill it.

Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:05 -07:00
Jens Axboe
94ae5e77a9 io_uring: fix sequencing issues with linked timeouts
We have an issue with timeout links that are deeper in the submit chain,
because we only handle it upfront, not from later submissions. Move the
prep + issue of the timeout link to the async work prep handler, and do
it normally for non-async queue. If we validate and prepare the timeout
links upfront when we first see them, there's nothing stopping us from
supporting any sort of nesting.

Fixes: 2665abfd75 ("io_uring: add support for linked SQE timeouts")
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:05 -07:00
Jens Axboe
ad8a48acc2 io_uring: make req->timeout be dynamically allocated
There are a few reasons for this:

- As a prep to improving the linked timeout logic
- io_timeout is the biggest member in the io_kiocb opcode union

This also enables a few cleanups, like unifying the timer setup between
IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple
arguments to the link/prep helpers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:01 -07:00
Jens Axboe
978db57e2c io_uring: make io_double_put_req() use normal completion path
If we don't use the normal completion path, we may skip killing links
that should be errored and freed. Add __io_double_put_req() for use
within the completion path itself, other calls should just use
io_double_put_req().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:48:31 -07:00
Jens Axboe
0e0702dac2 io_uring: cleanup return values from the queueing functions
__io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err,
but the caller doesn't care since the errors are handled inline. Clean
these up and just make them void.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:48:31 -07:00
Jens Axboe
95a5bbae05 io_uring: io_async_cancel() should pass in 'nxt' request pointer
If we have a linked request, this enables us to pass it back directly
without having to go through async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:48:31 -07:00
Linus Torvalds
e25645b181 linux-kselftest-5.5-rc1-kunit
This kselftest update for Linux 5.5-rc1 adds KUnit, a lightweight unit
 testing and mocking framework for the Linux kernel from Brendan Higgins.
 
 KUnit is not an end-to-end testing framework. It is currently supported
 on UML and sub-systems can write unit tests and run them in UML env.
 KUnit documentation is included in this update.
 
 In addition, this Kunit update adds 3 new kunit tests:
 
 - kunit test for proc sysctl from Iurii Zaikin
 - kunit test for the 'list' doubly linked list from David Gow
 - ext4 kunit test for decoding extended timestamps from Iurii Zaikin
 
 In the future KUnit will be linked to Kselftest framework to provide
 a way to trigger KUnit tests from user-space.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl3YfBQACgkQCwJExA0N
 QxzTjxAAiFhaDMhlpLhn1DpIUNvfKrIgDjJgajQAyMMs6TJK3OrD6J4WbpVD7wGo
 aqF9l6o64sY18JAo3s00j6AcAmVwNH7qzEEuzIQPjJvQ8C4sCWL3esEP4JHgFb2F
 snlSn5KjSsdC1D9N7uQIhgW76xPSyDrTwWQpglvmB9TwmJVBIl9zhu+unp73ufFJ
 N+ieDg8A6W/wDGYSq5JBSkJbuI0gL+daNwUYzxEEZIskndhpovOc82WAldECRm6x
 TfI0u39zTbrEO0DHgmYpyGbTN8TB2mXjH5HMjwg+KbHfKVTKKGvTK7XFs8mWGQpO
 n2meypZuwuIsRPOPcAVs+Gt2dc0jFODJVIV1EzA0WSv6TEdPqyhM/d13tHdCqjm9
 ic5wQ/hQQNEB1Dvg5ereXBaGGaoqP95y61ZpCS9vCXFXH+28E/B63Ebfs2IBIuqS
 Jv2KcoxabyZq3uGdjnn+mD7IM8rkvscRP4Ba31nXRgJIYDHAzqe7APN7y3on4NGx
 1q7lBlA3XZZ8qgo0zpLST20ck/qaL3tk4k8E1f8emh6CuyrCWtazgrWkMIlyEX0O
 8nre3uEAF9xUzB4+gZK4YmelN9Bld3Uv7Ippt1zTCiQ0FkEABQIMUrTZygy7Wfg6
 6qi4dk8frWW4Kt63gOXsMxr9FWTqDk+Ys4GPAVDVm1d0dzERn8k=
 =0zEe
 -----END PGP SIGNATURE-----

Merge tag 'linux-kselftest-5.5-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest KUnit support gtom Shuah Khan:
 "This adds KUnit, a lightweight unit testing and mocking framework for
  the Linux kernel from Brendan Higgins.

  KUnit is not an end-to-end testing framework. It is currently
  supported on UML and sub-systems can write unit tests and run them in
  UML env. KUnit documentation is included in this update.

  In addition, this Kunit update adds 3 new kunit tests:

   - proc sysctl test from Iurii Zaikin

   - the 'list' doubly linked list test from David Gow

   - ext4 tests for decoding extended timestamps from Iurii Zaikin

  In the future KUnit will be linked to Kselftest framework to provide a
  way to trigger KUnit tests from user-space"

* tag 'linux-kselftest-5.5-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (23 commits)
  lib/list-test: add a test for the 'list' doubly linked list
  ext4: add kunit test for decoding extended timestamps
  Documentation: kunit: Fix verification command
  kunit: Fix '--build_dir' option
  kunit: fix failure to build without printk
  MAINTAINERS: add proc sysctl KUnit test to PROC SYSCTL section
  kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec()
  MAINTAINERS: add entry for KUnit the unit testing framework
  Documentation: kunit: add documentation for KUnit
  kunit: defconfig: add defconfigs for building KUnit tests
  kunit: tool: add Python wrappers for running KUnit tests
  kunit: test: add tests for KUnit managed resources
  kunit: test: add the concept of assertions
  kunit: test: add tests for kunit test abort
  kunit: test: add support for test abort
  objtool: add kunit_try_catch_throw to the noreturn list
  kunit: test: add initial tests
  lib: enable building KUnit in lib/
  kunit: test: add the concept of expectations
  kunit: test: add assertion printing library
  ...
2019-11-25 15:01:30 -08:00
Linus Torvalds
1c1ff4836f fsverity updates for 5.5
Expose the fs-verity bit through statx().
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXdtWqhQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK+C9AQCCf8C2KP6DynoGQb9KRYYreJk8js8G
 IgtlhazJ3j1RJAD/VijFbdwbxGCmiR1Y6BhKq5eaCYD1El68wSwkKuNO3ww=
 =7WpU
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fsverity updates from Eric Biggers:
 "Expose the fs-verity bit through statx()"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  docs: fs-verity: mention statx() support
  f2fs: support STATX_ATTR_VERITY
  ext4: support STATX_ATTR_VERITY
  statx: define STATX_ATTR_VERITY
  docs: fs-verity: document first supported kernel version
2019-11-25 12:21:23 -08:00
Linus Torvalds
ea4b71bc0b fscrypt updates for 5.5
- Add the IV_INO_LBLK_64 encryption policy flag which modifies the
   encryption to be optimized for UFS inline encryption hardware.
 
 - For AES-128-CBC, use the crypto API's implementation of ESSIV (which
   was added in 5.4) rather than doing ESSIV manually.
 
 - A few other cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXdtVMxQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK8MVAP44iRzj8ZXu62BhqNOYYcF60s/58QfZ
 Jo1VdmvO/8MNrAD+P/jW5sqzcB5BLdNzS7pLKGIzsC55uMyp/79xyKK8wQc=
 =XKWV
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Eric Biggers:

 - Add the IV_INO_LBLK_64 encryption policy flag which modifies the
   encryption to be optimized for UFS inline encryption hardware.

 - For AES-128-CBC, use the crypto API's implementation of ESSIV (which
   was added in 5.4) rather than doing ESSIV manually.

 - A few other cleanups.

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  f2fs: add support for IV_INO_LBLK_64 encryption policies
  ext4: add support for IV_INO_LBLK_64 encryption policies
  fscrypt: add support for IV_INO_LBLK_64 policies
  fscrypt: avoid data race on fscrypt_mode::logged_impl_name
  docs: ioctl-number: document fscrypt ioctl numbers
  fscrypt: zeroize fscrypt_info before freeing
  fscrypt: remove struct fscrypt_ctx
  fscrypt: invoke crypto API for ESSIV handling
2019-11-25 12:19:28 -08:00
Linus Torvalds
ae36607b66 affs-for-5.5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3YC+sACgkQxWXV+ddt
 WDtcJg/+JESwug/t+8+JWoajhwcMtLcQqqdPIWcfGq+9S7kIcKeoNtboO9Axj8d7
 rxubmkk9zHEVgQtA5TWW9kdnNBjTnslHvCvD6m7g6AJGtiNdOofbc/2EaEmDP05f
 6N7ZBOQ+l8jwgeZqRO2H2/Xh+Le9pjchhljcRgJ0eVVgPklSh5e4KrWkqOb4UwCm
 iZCHXPB8AeLxTfGL1yy5wklh4nKZF3DMDaT7+20YyKzNtJjmbS/2WPivr9uWd25D
 QHUeHk+yOJoRICAX4/8xFk+x7FLJTutfAsvwJM90a5+HR2/msyC7ahzIYguPBtaX
 Aa5hlP79ug6UhtB9RIUULjZiBYQS38qhlEmAE7RA4/sYFzbm6AkSQTnOxBennivN
 2+ev64xeDEKNx//pWM52OHyWuDJbGBeBMdr4I8LtYEoyBgQC5kDnecodV81q9c79
 ROTBmjP9gQFxkk3OoP44haky6t3HPqN/jCfFOvo0VxbEyVluRS1BZQ5+IjZIgSFU
 Zg5L9OHA93+PsZsNnas9Zd07YuMBXK3X4g//veWmY/eWursC6cn3IGhsGlxl0X1a
 g07jiZCLElmKV66UTdKqn+ByV1n0TreAjSn1Wr3NtYuBiqTAKh0xrbMlbE6n1db7
 z+yKBRlcGBVvQQpI0AZB29aeeleNvDGDLuz+hE8K4rFFUCUb1zs=
 =NomJ
 -----END PGP SIGNATURE-----

Merge tag 'affs-for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull AFFS updates from David Sterba:
 "A minor bugfix and cleanup for AFFS"

* tag 'affs-for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  affs: fix a memory leak in affs_remount
  affs: Replace binary semaphores with mutexes
2019-11-25 12:17:58 -08:00
Linus Torvalds
97d0bf96a0 for-5.5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3YCRAACgkQxWXV+ddt
 WDuTuQ/7BOibDKqInm2SsL8xMuZqXjxGXUcHDPio5MbzNJ3wpV0j1KqWWsuK8hi0
 HAhSI3fu3NG7RQYh3nuRO0CaZy3ENiqKoffrSpg7k5DJG0B7Lm/G/970fmOYUp6a
 j6PMNcrKaw+1J3yuljSd20+n6j/hdmfn847ZsSY+7JmZ4zGMJ5GMv3IRipdLFUzR
 tmjWmmCI05sF4/8cI6jzUVq588uSFTO1bGXFugmoO0ztpameudCnYniJI0tDBFSV
 pqk6lqoOPNcaC9nATuA5KKOpUJ9nSscP/St3DV4D6LaZKkT/M5zs12lXPMJx/pKn
 oCHt/A/wBdbDOoy1uHVMWQ9cz9PyVFtU7eSKizcFjoqnHO6fzlnRr9fsmIZKtTw9
 H6nXVmP1S+xJg/zTBxCXHVfZR2dqADUsHWztN1LM8Pen/l9+UMwBeMhq9f9Jz68I
 kF7zWlfLEtNh8naEYf34LkGVMtCNY4PHFsSztPg/jbfsH34xMvetKvPR2s8lejhp
 42YqPHgEh2+8mmVcq65+jl+bPOp/5bdToRtuPiszWiKZSXt/5xplP+5lkSEet0J6
 4aNZ8NRAiZ98br45jdTMUVSo6YtI27SS+GdVOUHPQtPI/kWi9XHx+l3E9MVOUtrd
 lQ1Z9tPinEnJH4kntiCz2sKdzNKE01IagV4wFylz1Ct+ZqF9jNs=
 =JzIp
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "User visible changes:
   - new block group profiles: RAID1 with 3- and 4- copies
       - RAID1 in btrfs has always 2 copies, now add support for 3 and 4
       - this is an incompat feature (named RAID1C34)
       - recommended use of RAID1C3 is replacement of RAID6 profile on
         metadata, this brings a more reliable resiliency against 2
         device loss/damage

   - support for new checksums
       - per-filesystem, set at mkfs time
       - fast hash (crc32c successor): xxhash, 64bit digest
       - strong hashes (both 256bit): sha256 (slower, FIPS), blake2b
         (faster)
       - the blake2b module goes via the crypto tree, btrfs.ko has a
         soft dependency

   - speed up lseek, don't take inode locks unnecessarily, this can
     speed up parallel SEEK_CUR/SEEK_SET/SEEK_END by 80%

   - send:
       - allow clone operations within the same file
       - limit maximum number of sent clone references to avoid slow
         backref walking

   - error message improvements: device scan prints process name and PID

  Core changes:
   - cleanups
       - remove unique workqueue helpers, used to provide a way to avoid
         deadlocks in the workqueue code, now done in a simpler way
       - remove lots of indirect function calls in compression code
       - extent IO tree code moved out of extent_io.c
       - cleanup backup superblock handling at mount time
       - transaction life cycle documentation and cleanups
       - locking code cleanups, annotations and documentation
       - add more cold, const, pure function attributes
       - removal of unused or redundant struct members or variables

   - new tree-checker sanity tests
       - try to detect missing INODE_ITEM, cross-reference checks of
         DIR_ITEM, DIR_INDEX, INODE_REF, and XATTR_* items

   - remove own bio scheduling code (used to avoid checksum submissions
     being stuck behind other IO), replaced by cgroup controller-based
     code to allow better control and avoid priority inversions in cases
     where the custom and cgroup scheduling disagreed

  Fixes:
   - avoid getting stuck during cyclic writebacks

   - fix trimming of ranges crossing block group boundaries

   - fix rename exchange on subvolumes, all involved subvolumes need to
     be recorded in the transaction"

* tag 'for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (137 commits)
  btrfs: drop bdev argument from submit_extent_page
  btrfs: remove extent_map::bdev
  btrfs: drop bio_set_dev where not needed
  btrfs: get bdev directly from fs_devices in submit_extent_page
  btrfs: record all roots for rename exchange on a subvol
  Btrfs: fix block group remaining RO forever after error during device replace
  btrfs: scrub: Don't check free space before marking a block group RO
  btrfs: change btrfs_fs_devices::rotating to bool
  btrfs: change btrfs_fs_devices::seeding to bool
  btrfs: rename btrfs_block_group_cache
  btrfs: block-group: Reuse the item key from caller of read_one_block_group()
  btrfs: block-group: Refactor btrfs_read_block_groups()
  btrfs: document extent buffer locking
  btrfs: access eb::blocking_writers according to ACCESS_ONCE policies
  btrfs: set blocking_writers directly, no increment or decrement
  btrfs: merge blocking_writers branches in btrfs_tree_read_lock
  btrfs: drop incompat bit for raid1c34 after last block group is gone
  btrfs: add incompat for raid1 with 3, 4 copies
  btrfs: add support for 4-copy replication (raid1c4)
  btrfs: add support for 3-copy replication (raid1c3)
  ...
2019-11-25 12:01:49 -08:00
Linus Torvalds
7e5192b93c for-5.5/disk-revalidate-20191122
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3YA5sQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplFxEACM7CwrWsullPX6b3j62NW6VepU5JQdzwVW
 S+bmLpb8Z2I4wzEnaVuWAY5hEhGaS9NFtQLdBG0W0YOzH7sweNmL38dZfCE4+oFj
 ZwytpXQQhAQUwkgANJCpNfzDymHduPsTz7RYqRr1plmhna1KC/dnhuMwg8lVOBf5
 myWjqcCHxxoQn6KFqcX9/Azz29ZrgzV28lOnZdiw9yoTjraBmS/ymx4woaa3pc2v
 UNw0Cgx53vHENJzEL9FNSxc0ENZq/bQhpDolnc2AlPGy9+vPg4afMitJb60KTT7r
 HpDcLGkYAIKLrfk8DUmFW8lZhWsxTchXvK2+zwQV7nXMcdUgGN/G3HTIdvWEHFv8
 oGbPB8cfdA2vNC9QAybwWEum/S0H/GfYsBVplNCUCdFXE7yj1cbKD5dPfCyIvmPz
 BjgMae5vH/KoH+vNdZ8NL5oFz2eFC3rLxa/Ss78pcEoBdiiV3WQHPv9MBmn/OQ/v
 CeUAM7omyWpbv3lcByNzIOkeeO3m6Ne28EpEMc2pzLnDPu2btvSyetdO488DE+7O
 MNfApZULVX91W7jWnhM5GR+1SJTdEXZnoxnFV+J/j4deog5vUR7Dt1VkujpUILfL
 7jMl3erF6C53wNrc465z8iLRp1ZM+aTpwatXXRfucNXeomExKK9zF+/+O1ACckUB
 jWDCR9NTcw==
 =e5Lx
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block

Pull disk revalidation updates from Jens Axboe:
 "This continues the work that Jan Kara started to thoroughly cleanup
  and consolidate how we handle rescans and revalidations"

* tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block:
  block: move clearing bd_invalidated into check_disk_size_change
  block: remove (__)blkdev_reread_part as an exported API
  block: fix bdev_disk_changed for non-partitioned devices
  block: move rescan_partitions to fs/block_dev.c
  block: merge invalidate_partitions into rescan_partitions
  block: refactor rescan_partitions
2019-11-25 11:37:01 -08:00
Linus Torvalds
464a47f45d for-5.5/zoned-20191122
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3YAiAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsJRD/wNfUGWVdIckw7iiFNuuipKBEy0Nd2VLt0B
 I+pVW/YjDsG2oxWXWPs5Nxc7ca2A8EzRXcWP0xEjBfOCcBh/9mULi1flkLRoWKcq
 v/OuTVif3ATvgJcwNkbMcoi0bYA/VwKi2dWC6ALhDDmZhyMTLeE362oIeOUNNnl6
 GM8CGZHaRfmBzcH5t+WnxiS6rBlt5iwFJ35EvZo3GMXGGiLGlryxEXPAwZrf4haA
 Z4atNinKcNXhb80LWHo23aK3bpnaumwKP4BPuLEyvnjS4iU8SeYTXy+w5yq1BE+h
 HBP5s3no/mPiBAG8b6EZXqOJUGlN596AQfNLu7vCR78tmImZF0jKRFsHEAaKXf+B
 1yRgZi7J+gV0qzK/Ufulg43vItk5/sTzEuV9YLfCpKTr14MFcWw908BAqaI5Kk1K
 e8uGqnb2KbZOLTW4QdPvpWg3eYtqEoluSoZUQ5elHxqQZ4MSZ1lK78FF1TeaW/pw
 sYH+v6rsWoVjEcFSwGoaaOMravzU4MKtavNAZrTJwKZx7qCqkwmi3R1k8WF6KsSV
 rTRAzUC1wpTdSOm1MYPMMKM/h5+BJRSJ/RjljOF4fXLnvpD5q0lequCWjrrEzc6c
 HPRKIgSBq7S620A19QD8UxwvZJ8bOivESqr0bux29v1Vpf7vJBrRMng8nLUrXfJs
 jdma5mK1UA==
 =/G9l
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block

Pull zoned block device update from Jens Axboe:
 "Enhancements and improvements to the zoned device support"

* tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block:
  scsi: sd_zbc: Remove set but not used variable 'buflen'
  block: rework zone reporting
  scsi: sd_zbc: Cleanup sd_zbc_alloc_report_buffer()
  null_blk: Add zone_nr_conv to features
  null_blk: clean up report zones
  null_blk: clean up the block device operations
  block: Remove partition support for zoned block devices
  block: Simplify report zones execution
  block: cleanup the !zoned case in blk_revalidate_disk_zones
  block: Enhance blk_revalidate_disk_zones()
2019-11-25 11:22:37 -08:00
Linus Torvalds
ff6814b078 for-5.5/block-20191121
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WxrEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuH5D/9qQKfIIuQDUNO4Xx+dIHimTDCrfiEOeO9e
 CRaMuSj+yMxLDMwfX8RnDmR17H3ZVoiIY1CT24U9ZkA5iDjeAH4xmzkH30US7LR7
 /64YVZTxB0OrWppRK8RiIhaJJZDQ6+HPUQsn6PRaLVuFHi2unMoTQnj/ZQKz03QA
 Pl8Xx7qBtH1JwYCzQ21f/uryAcNg9eWabRLN2f1uiOXLmvRxOfh6Z/iaezlaZlmL
 qeJdcdLjjvOgOPwEOfNjfS6pd+XBz3gdEhn0l+11nHITxWZmVBwsWTKyUQlCmKnl
 yuCWDVyx5d6zCnlrLYG0l2Fn2lr9SwAkdkq3YAKV03hA/6s6P9q9bm31VvOf828x
 7gmr4YVz68y7H9bM0QAHCvDpjll0aIEUw6XFzSOCDtZ9B6/pppYQWzMU71J05eyF
 8DOKv2M2EVNLUjf6u0RDyolnWGU0kIjt5ryWE3OsGcezAVa2wYstgUJTKbrn1YgT
 j+4KTpaI+sg8GKDFauvxcSa6gwoRp6jweFNW+7vC090/shXmrGmVLOnQZKRuHho/
 O4W8y/1/deM8CCIAETpiNxA8RV5U/EZygrFGDFc7yzTtVDGHY356M/B4Bmm2qkVu
 K3WgeZp8Fc0lH0QF6Pp9ZlBkZEpGNCAPVsPkXIsxQXbctftkn3KY//uIubfpFEB1
 PpHSicvkww==
 =HYYq
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Due to more granular branches, this one is small and will be followed
  with other core branches that add specific features. I meant to just
  have a core and drivers branch, but external dependencies we ended up
  adding a few more that are also core.

  The changes are:

   - Fixes and improvements for the zoned device support (Ajay, Damien)

   - sed-opal table writing and datastore UID (Revanth)

   - blk-cgroup (and bfq) blk-cgroup stat fixes (Tejun)

   - Improvements to the block stats tracking (Pavel)

   - Fix for overruning sysfs buffer for large number of CPUs (Ming)

   - Optimization for small IO (Ming, Christoph)

   - Fix typo in RWH lifetime hint (Eugene)

   - Dead code removal and documentation (Bart)

   - Reduction in memory usage for queue and tag set (Bart)

   - Kerneldoc header documentation (André)

   - Device/partition revalidation fixes (Jan)

   - Stats tracking for flush requests (Konstantin)

   - Various other little fixes here and there (et al)"

* tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block: (48 commits)
  Revert "block: split bio if the only bvec's length is > SZ_4K"
  block: add iostat counters for flush requests
  block,bfq: Skip tracing hooks if possible
  block: sed-opal: Introduce SUM_SET_LIST parameter and append it using 'add_token_u64'
  blk-cgroup: cgroup_rstat_updated() shouldn't be called on cgroup1
  block: Don't disable interrupts in trigger_softirq()
  sbitmap: Delete sbitmap_any_bit_clear()
  blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue()
  block: split bio if the only bvec's length is > SZ_4K
  block: still try to split bio if the bvec crosses pages
  blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT
  blk-cgroup: reimplement basic IO stats using cgroup rstat
  blk-cgroup: remove now unused blkg_print_stat_{bytes|ios}_recursive()
  blk-throtl: stop using blkg->stat_bytes and ->stat_ios
  bfq-iosched: stop using blkg->stat_bytes and ->stat_ios
  bfq-iosched: relocate bfqg_*rwstat*() helpers
  block: add zone open, close and finish ioctl support
  block: add zone open, close and finish operations
  block: Simplify REQ_OP_ZONE_RESET_ALL handling
  block: Remove REQ_OP_ZONE_RESET plugging
  ...
2019-11-25 10:59:41 -08:00
Linus Torvalds
fb4b3d3fd0 for-5.5/io_uring-20191121
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WxNwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgps4kD/9SIDXhYhhE8fNqeAF7Uouu8fxgwnkY3hSI
 43vJwCziiDxWWJH5mYW7/83VNOMZKHIbiYMnU6iEUsRQ/sG/wI0wEfAQZDHLzCKt
 cko2q7zAC1/4rtoslwJ3q04hE2Ap/nb93ELZBVr7fOAuODBNFUp/vifAojvsMPKz
 hNMNPq/vYg7c/iYMZKSBdtjE3tqceFNBjAVNMB9dHKQLeexEy4ve7AjBeawWsSi7
 GesnQ5w5u5LqkMYwLslpv/oVjHiiFWgGnDAvBNvykQvVy+DfB54KSqMV11W1aqdU
 l6L+ENfZasEvlk1yMAth2Foq4vlscm5MKEb6VdJhXWHHXtXkcBmz7RBqPmjSvXCY
 wS5GZRw8oYtTcid0aQf+t/wgRNTDJsGsnsT32qto41No3Z7vlIDHUDxHZGTA+gEL
 E8j9rDx6EXMTo3EFbC8XZcfsorhPJ1HKAyw1YFczHtYzJEQUR9jJe3f/Q9u6K2Vy
 s/EhkVeHa/lEd7kb6mI+6lQjGe1FXl7AHauDuaaEfIOZA/xJB3Bad5Wjq1va1cUO
 TX+37zjzFzJghhSIBGYq7G7iT4AMecPQgxHzCdCyYfW5S4Uur9tMmIElwVPI/Pjl
 kDZ9gdg9lm6JifZ9Ab8QcGhuQQTF3frwX9VfgrVgcqyvm38AiYzVgL9ZJnxRS/Cy
 ZfLNkACXqQ==
 =YZ9s
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "A lot of stuff has been going on this cycle, with improving the
  support for networked IO (and hence unbounded request completion
  times) being one of the major themes. There's been a set of fixes done
  this week, I'll send those out as well once we're certain we're fully
  happy with them.

  This contains:

   - Unification of the "normal" submit path and the SQPOLL path (Pavel)

   - Support for sparse (and bigger) file sets, and updating of those
     file sets without needing to unregister/register again.

   - Independently sized CQ ring, instead of just making it always 2x
     the SQ ring size. This makes it more flexible for networked
     applications.

   - Support for overflowed CQ ring, never dropping events but providing
     backpressure on submits.

   - Add support for absolute timeouts, not just relative ones.

   - Support for generic cancellations. This divorces io_uring from
     workqueues as well, which additionally gets us one step closer to
     generic async system call support.

   - With cancellations, we can support grabbing the process file table
     as well, just like we do mm context. This allows support for system
     calls that create file descriptors, like accept4() support that's
     built on top of that.

   - Support for io_uring tracing (Dmitrii)

   - Support for linked timeouts. These abort an operation if it isn't
     completed by the time noted in the linke timeout.

   - Speedup tracking of poll requests

   - Various cleanups making the coder easier to follow (Jackie, Pavel,
     Bob, YueHaibing, me)

   - Update MAINTAINERS with new io_uring list"

* tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits)
  io_uring: make POLL_ADD/POLL_REMOVE scale better
  io-wq: remove now redundant struct io_wq_nulls_list
  io_uring: Fix getting file for non-fd opcodes
  io_uring: introduce req_need_defer()
  io_uring: clean up io_uring_cancel_files()
  io-wq: ensure free/busy list browsing see all items
  io-wq: ensure we have a stable view of ->cur_work for cancellations
  io_wq: add get/put_work handlers to io_wq_create()
  io_uring: check for validity of ->rings in teardown
  io_uring: fix potential deadlock in io_poll_wake()
  io_uring: use correct "is IO worker" helper
  io_uring: fix -ENOENT issue with linked timer with short timeout
  io_uring: don't do flush cancel under inflight_lock
  io_uring: flag SQPOLL busy condition to userspace
  io_uring: make ASYNC_CANCEL work with poll and timeout
  io_uring: provide fallback request for OOM situations
  io_uring: convert accept4() -ERESTARTSYS into -EINTR
  io_uring: fix error clear of ->file_table in io_sqe_files_register()
  io_uring: separate the io_free_req and io_free_req_find_next interface
  io_uring: keep io_put_req only responsible for release and put req
  ...
2019-11-25 10:40:27 -08:00
Linus Torvalds
0be0ee7181 vfs: properly and reliably lock f_pos in fdget_pos()
fdget_pos() is used by file operations that will read and update f_pos:
things like "read()", "write()" and "lseek()" (but not, for example,
"pread()/pwrite" that get their file positions elsewhere).

However, it had two separate escape clauses for this, because not
everybody wants or needs serialization of the file position.

The first and most obvious case is the "file descriptor doesn't have a
position at all", ie a stream-like file.  Except we didn't actually use
FMODE_STREAM, but instead used FMODE_ATOMIC_POS.  The reason for that
was that FMODE_STREAM didn't exist back in the days, but also that we
didn't want to mark all the special cases, so we only marked the ones
that _required_ position atomicity according to POSIX - regular files
and directories.

The case one was intentionally lazy, but now that we _do_ have
FMODE_STREAM we could and should just use it.  With the change to use
FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
the code to set it is deleted.

Any cases where we don't want the serialization because the driver (or
subsystem) doesn't use the file position should just be updated to do
"stream_open()".  We've done that for all the obvious and common
situations, we may need a few more.  Quoting Kirill Smelkov in the
original FMODE_STREAM thread (see link below for full email):

 "And I appreciate if people could help at least somehow with "getting
  rid of mixed case entirely" (i.e. always lock f_pos_lock on
  !FMODE_STREAM), because this transition starts to diverge from my
  particular use-case too far. To me it makes sense to do that
  transition as follows:

   - convert nonseekable_open -> stream_open via stream_open.cocci;
   - audit other nonseekable_open calls and convert left users that
     truly don't depend on position to stream_open;
   - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
     will cover pipes and sockets), or maybe convert pipes and sockets
     to FMODE_STREAM manually;
   - extend stream_open.cocci to analyze file_operations that use
     no_llseek or noop_llseek, but do not use nonseekable_open or
     alloc_file_pseudo. This might find files that have stream semantic
     but are opened differently;
   - extend stream_open.cocci to analyze file_operations whose
     .read/.write do not use ppos at all (independently of how file was
     opened);
   - ...
   - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
     !FMODE_STREAM;
   - gather bug reports for deadlocked read/write and convert missed
     cases to FMODE_STREAM, probably extending stream_open.cocci along
     the road to catch similar cases

  i.e. always take f_pos_lock unless a file is explicitly marked as
  being stream, and try to find and cover all files that are streams"

We have not done the "extend stream_open.cocci to analyze
alloc_file_pseudo" as well, but the previous commit did manually handle
the case of pipes and sockets.

The other case where we can avoid locking f_pos is the "this file
descriptor only has a single user and it is us, and thus there is no
need to lock it".

The second test was correct, although a bit subtle and worth just
re-iterating here.  There are two kinds of other sources of references
to the same file descriptor: file descriptors that have been explicitly
shared across fork() or with dup(), and file tables having elevated
reference counts due to threading (or explicit file sharing with
clone()).

The first case would have incremented the file count explicitly, and in
the second case the previous __fdget() would have incremented it for us
and set the FDPUT_FPUT flag.

But in both cases the file count would be greater than one, so the
"file_count(file) > 1" test catches both situations.  Also note that if
file_count is 1, that also means that no other thread can have access to
the file table, so there also cannot be races with concurrent calls to
dup()/fork()/clone() that would increment the file count any other way.

Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-25 10:01:53 -08:00
Jaegeuk Kim
803e74be04 f2fs: stop GC when the victim becomes fully valid
We must stop GC, once the segment becomes fully valid. Otherwise, it can
produce another dirty segments by moving valid blocks in the segment partially.

Ramon hit no free segment panic sometimes and saw this case happens when
validating reliable file pinning feature.

Signed-off-by: Ramon Pantin <pantin@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-25 10:01:28 -08:00
Jaegeuk Kim
a4db59ac90 f2fs: expose main_blkaddr in sysfs
Expose in /sys/fs/f2fs/<blockdev>/main_blkaddr the block address where the
main area starts. This allows user mode programs to determine:

- That pinned files that are made exclusively of fully allocated 2MB
  segments will never be unpinned by the file system.

- Where the main area starts. This is required by programs that want to
  verify if a file is made exclusively of 2MB f2fs segments, the alignment
  boundary for segments starts at this address. Testing for 2MB alignment
  relative to the start of the device is incorrect, because for some
  filesystems main_blkaddr is not at a 2MB boundary relative to the start
  of the device.

The entry will be used when validating reliable pinning file feature proposed
by "f2fs: support aligned pinned file".

Signed-off-by: Ramon Pantin <pantin@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-25 10:01:27 -08:00
Chengguang Xu
909110c060 f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project()
Setting softlimit larger than hardlimit seems meaningless
for disk quota but currently it is allowed. In this case,
there may be a bit of comfusion for users when they run
df comamnd to directory which has project quota.

For example, we set 20M softlimit and 10M hardlimit of
block usage limit for project quota of test_dir(project id 123).

[root@hades f2fs]# repquota -P -a
*** Report for project quotas on device /dev/nvme0n1p8
Block grace time: 7days; Inode grace time: 7days
Block limits File limits
Project used soft hard grace used soft hard grace
----------------------------------------------------------------------
0 -- 4 0 0 1 0 0
123 +- 10248 20480 10240 2 0 0

The result of df command as below:

[root@hades f2fs]# df -h /mnt/f2fs/test
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p8 20M 11M 10M 51% /mnt/f2fs

Even though it looks like there is another 10M free space to use,
if we write new data to diretory test(inherit project id),
the write will fail with errno(-EDQUOT).

After this patch, the df result looks like below.

[root@hades f2fs]# df -h /mnt/f2fs/test
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p8 10M 10M 0 100% /mnt/f2fs

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-25 10:01:12 -08:00
Linus Torvalds
d8e464ecc1 vfs: mark pipes and sockets as stream-like file descriptors
In commit 3975b097e5 ("convert stream-like files -> stream_open, even
if they use noop_llseek") Kirill used a coccinelle script to change
"nonseekable_open()" to "stream_open()", which changed the trivial cases
of stream-like file descriptors to the new model with FMODE_STREAM.

However, the two big cases - sockets and pipes - don't actually have
that trivial pattern at all, and were thus never converted to
FMODE_STREAM even though it makes lots of sense to do so.

That's particularly true when looking forward to the next change:
getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
decide whether f_pos updates are needed or not.  And if they are, we'll
always do them atomically.

This came up because KCSAN (correctly) noted that the non-locked f_pos
updates are data races: they are clearly benign for the case where we
don't care, but it would be good to just not have that issue exist at
all.

Note that the reason we used FMODE_ATOMIC_POS originally is that only
doing it for the minimal required case is "safer" in that it's possible
that the f_pos locking can cause unnecessary serialization across the
whole write() call.  And in the worst case, that kind of serialization
can cause deadlock issues: think writers that need readers to empty the
state using the same file descriptor.

[ Note that the locking is per-file descriptor - because it protects
  "f_pos", which is obviously per-file descriptor - so it only affects
  cases where you literally use the same file descriptor to both read
  and write.

  So a regular pipe that has separate reading and writing file
  descriptors doesn't really have this situation even though it's the
  obvious case of "reader empties what a bit writer concurrently fills"

  But we want to make pipes as being stream-line anyway, because we
  don't want the unnecessary overhead of locking, and because a named
  pipe can be (ab-)used by reading and writing to the same file
  descriptor. ]

There are likely a lot of other cases that might want FMODE_STREAM, and
looking for ".llseek = no_llseek" users and other cases that don't have
an lseek file operation at all and making them use "stream_open()" might
be a good idea.  But pipes and sockets are likely to be the two main
cases.

Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-25 09:12:11 -08:00
Steve French
1656a07a89 cifs: update internal module version number
To 2.24

Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25 10:00:02 -06:00
Paulo Alcantara (SUSE)
ff6b6f3f91 cifs: Always update signing key of first channel
Update signing key of first channel whenever generating the master
sigining/encryption/decryption keys rather than only in cifs_mount().

This also fixes reconnect when re-establishing smb sessions to other
servers.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25 09:59:28 -06:00
Paulo Alcantara (SUSE)
5bb30a4dd6 cifs: Fix retrieval of DFS referrals in cifs_mount()
Make sure that DFS referrals are sent to newly resolved root targets
as in a multi tier DFS setup.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Link: https://lkml.kernel.org/r/05aa2995-e85e-0ff4-d003-5bb08bd17a22@canonical.com
Cc: stable@vger.kernel.org
Tested-by: Matthew Ruffell <matthew.ruffell@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25 09:36:49 -06:00
Paulo Alcantara (SUSE)
84a1f5b1cc cifs: Fix potential softlockups while refreshing DFS cache
We used to skip reconnects on all SMB2_IOCTL commands due to SMB3+
FSCTL_VALIDATE_NEGOTIATE_INFO - which made sense since we're still
establishing a SMB session.

However, when refresh_cache_worker() calls smb2_get_dfs_refer() and
we're under reconnect, SMB2_ioctl() will not be able to get a proper
status error (e.g. -EHOSTDOWN in case we failed to reconnect) but an
-EAGAIN from cifs_send_recv() thus looping forever in
refresh_cache_worker().

Fixes: e99c63e4d8 ("SMB3: Fix deadlock in validate negotiate hits reconnect")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Suggested-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25 09:33:04 -06:00
Paulo Alcantara (SUSE)
df3df923b3 cifs: Fix lookup of root ses in DFS referral cache
We don't care about module aliasing validation in
cifs_compose_mount_options(..., is_smb3) when finding the root SMB
session of an DFS namespace in order to refresh DFS referral cache.

The following issue has been observed when mounting with '-t smb3' and
then specifying 'vers=2.0':

...
Nov 08 15:27:08 tw kernel: address conversion returned 0 for FS0.WIN.LOCAL
Nov 08 15:27:08 tw kernel: [kworke] ==> dns_query((null),FS0.WIN.LOCAL,13,(null))
Nov 08 15:27:08 tw kernel: [kworke] call request_key(,FS0.WIN.LOCAL,)
Nov 08 15:27:08 tw kernel: [kworke] ==> dns_resolver_cmp(FS0.WIN.LOCAL,FS0.WIN.LOCAL)
Nov 08 15:27:08 tw kernel: [kworke] <== dns_resolver_cmp() = 1
Nov 08 15:27:08 tw kernel: [kworke] <== dns_query() = 13
Nov 08 15:27:08 tw kernel: fs/cifs/dns_resolve.c: dns_resolve_server_name_to_ip: resolved: FS0.WIN.LOCAL to 192.168.30.26
===> Nov 08 15:27:08 tw kernel: CIFS VFS: vers=2.0 not permitted when mounting with smb3
Nov 08 15:27:08 tw kernel: fs/cifs/dfs_cache.c: CIFS VFS: leaving refresh_tcon (xid = 26) rc = -22
...

Fixes: 5072010ccf ("cifs: Fix DFS cache refresher for DFS links")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25 09:25:32 -06:00
Paulo Alcantara (SUSE)
8354d88efd cifs: Fix use-after-free bug in cifs_reconnect()
Ensure we grab an active reference in cifs superblock while doing
failover to prevent automounts (DFS links) of expiring and then
destroying the superblock pointer.

This patch fixes the following KASAN report:

[  464.301462] BUG: KASAN: use-after-free in
cifs_reconnect+0x6ab/0x1350
[  464.303052] Read of size 8 at addr ffff888155e580d0 by task
cifsd/1107

[  464.304682] CPU: 3 PID: 1107 Comm: cifsd Not tainted 5.4.0-rc4+ #13
[  464.305552] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS rel-1.12.1-0-ga5cab58-rebuilt.opensuse.org 04/01/2014
[  464.307146] Call Trace:
[  464.307875]  dump_stack+0x5b/0x90
[  464.308631]  print_address_description.constprop.0+0x16/0x200
[  464.309478]  ? cifs_reconnect+0x6ab/0x1350
[  464.310253]  ? cifs_reconnect+0x6ab/0x1350
[  464.311040]  __kasan_report.cold+0x1a/0x41
[  464.311811]  ? cifs_reconnect+0x6ab/0x1350
[  464.312563]  kasan_report+0xe/0x20
[  464.313300]  cifs_reconnect+0x6ab/0x1350
[  464.314062]  ? extract_hostname.part.0+0x90/0x90
[  464.314829]  ? printk+0xad/0xde
[  464.315525]  ? _raw_spin_lock+0x7c/0xd0
[  464.316252]  ? _raw_read_lock_irq+0x40/0x40
[  464.316961]  ? ___ratelimit+0xed/0x182
[  464.317655]  cifs_readv_from_socket+0x289/0x3b0
[  464.318386]  cifs_read_from_socket+0x98/0xd0
[  464.319078]  ? cifs_readv_from_socket+0x3b0/0x3b0
[  464.319782]  ? try_to_wake_up+0x43c/0xa90
[  464.320463]  ? cifs_small_buf_get+0x4b/0x60
[  464.321173]  ? allocate_buffers+0x98/0x1a0
[  464.321856]  cifs_demultiplex_thread+0x218/0x14a0
[  464.322558]  ? cifs_handle_standard+0x270/0x270
[  464.323237]  ? __switch_to_asm+0x40/0x70
[  464.323893]  ? __switch_to_asm+0x34/0x70
[  464.324554]  ? __switch_to_asm+0x40/0x70
[  464.325226]  ? __switch_to_asm+0x40/0x70
[  464.325863]  ? __switch_to_asm+0x34/0x70
[  464.326505]  ? __switch_to_asm+0x40/0x70
[  464.327161]  ? __switch_to_asm+0x34/0x70
[  464.327784]  ? finish_task_switch+0xa1/0x330
[  464.328414]  ? __switch_to+0x363/0x640
[  464.329044]  ? __schedule+0x575/0xaf0
[  464.329655]  ? _raw_spin_lock_irqsave+0x82/0xe0
[  464.330301]  kthread+0x1a3/0x1f0
[  464.330884]  ? cifs_handle_standard+0x270/0x270
[  464.331624]  ? kthread_create_on_node+0xd0/0xd0
[  464.332347]  ret_from_fork+0x35/0x40

[  464.333577] Allocated by task 1110:
[  464.334381]  save_stack+0x1b/0x80
[  464.335123]  __kasan_kmalloc.constprop.0+0xc2/0xd0
[  464.335848]  cifs_smb3_do_mount+0xd4/0xb00
[  464.336619]  legacy_get_tree+0x6b/0xa0
[  464.337235]  vfs_get_tree+0x41/0x110
[  464.337975]  fc_mount+0xa/0x40
[  464.338557]  vfs_kern_mount.part.0+0x6c/0x80
[  464.339227]  cifs_dfs_d_automount+0x336/0xd29
[  464.339846]  follow_managed+0x1b1/0x450
[  464.340449]  lookup_fast+0x231/0x4a0
[  464.341039]  path_openat+0x240/0x1fd0
[  464.341634]  do_filp_open+0x126/0x1c0
[  464.342277]  do_sys_open+0x1eb/0x2c0
[  464.342957]  do_syscall_64+0x5e/0x190
[  464.343555]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[  464.344772] Freed by task 0:
[  464.345347]  save_stack+0x1b/0x80
[  464.345966]  __kasan_slab_free+0x12c/0x170
[  464.346576]  kfree+0xa6/0x270
[  464.347211]  rcu_core+0x39c/0xc80
[  464.347800]  __do_softirq+0x10d/0x3da

[  464.348919] The buggy address belongs to the object at
ffff888155e58000
                which belongs to the cache kmalloc-256 of size 256
[  464.350222] The buggy address is located 208 bytes inside of
                256-byte region [ffff888155e58000, ffff888155e58100)
[  464.351575] The buggy address belongs to the page:
[  464.352333] page:ffffea0005579600 refcount:1 mapcount:0
mapping:ffff88815a803400 index:0x0 compound_mapcount: 0
[  464.353583] flags: 0x200000000010200(slab|head)
[  464.354209] raw: 0200000000010200 ffffea0005576200 0000000400000004
ffff88815a803400
[  464.355353] raw: 0000000000000000 0000000080100010 00000001ffffffff
0000000000000000
[  464.356458] page dumped because: kasan: bad access detected

[  464.367005] Memory state around the buggy address:
[  464.367787]  ffff888155e57f80: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  464.368877]  ffff888155e58000: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[  464.369967] >ffff888155e58080: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[  464.371111]                                                  ^
[  464.371775]  ffff888155e58100: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  464.372893]  ffff888155e58180: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  464.373983] ==================================================================

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25 09:23:10 -06:00
Ingo Molnar
83bae01182 Merge branch 'timers/urgent' into timers/core, to pick up fix
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-25 15:43:15 +01:00
Jeff Layton
2def865a81 ceph: don't leave ino field in ceph_mds_request_head uninitialized
We currently just pass junk in this field unless we're retransmitting a
create, but in later patches, we'll need a mechanism to pass a delegated
inode number on an initial create request. Prepare for this by ensuring
this field is zeroed out.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-25 11:44:02 +01:00
Jeff Layton
f5946bcc5e ceph: tone down loglevel on ceph_mdsc_build_path warning
When this occurs, it usually means that we raced with a rename, and
there is no need to warn in that case.  Only printk if we pass the
rename sequence check but still ended up with pos < 0.

Either way, this doesn't warrant a KERN_ERR message. Change it to
KERN_WARNING.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-25 11:44:02 +01:00
Xiubo Li
74d6f03019 ceph: fix geting random mds from mdsmap
For example, if we have 5 mds in the mdsmap and the states are:
m_info[5] --> [-1, 1, -1, 1, 1]

If we get a random number 1, then we should get the mds index 3 as
expected, but actually we will get index 2, which the state is -1.

The issue is that the for loop increment will advance past any "up"
MDS that was found during the while loop search.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-25 11:44:02 +01:00