note that conditions surrounding accesses to dname in audit_watch_handle_event()
and audit_mark_handle_event() guarantee that dname won't have been NULL.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Note that in fnsotify_move() and fsnotify_link() we are guaranteed
that dentry->d_name won't change during the fsnotify() evaluation
(by having the parent directory locked exclusive), so we don't
need to fetch dentry->d_name.name in the callers. In fsnotify_dirent()
the same stability of dentry->d_name is also true, but it's a bit
more convoluted - there is one callchain (devpts_pty_new() ->
fsnotify_create() -> fsnotify_dirent()) where the parent is _not_
locked, but on devpts ->d_name of everything is unchanging; it
has neither explicit nor implicit renames.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
note that in the second (RENAME_EXCHANGE) call of fsnotify_move() in
vfs_rename() the old_dentry->d_name is guaranteed to be unchanged
throughout the evaluation of fsnotify_move() (by the fact that the
parent directory is locked exclusive), so we don't need to fetch
old_dentry->d_name.name in the caller.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's contrary to the normal semantics, only sysv and adfs try to
do that (on any other filesystem you'll get -ENAMETOOLONG instead
of quiet truncation) and nobody actually uses that - it got
accidentally broken 5 years ago and nobody noticed. Time to
bury it...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1) IS_ERR(p) && PTR_ERR(p) == -E... is spelled p == ERR_PTR(-E...)
2) yes, you can open-code do-while and sometimes there's even
a good reason to do so. Not in this case, though.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
No modular uses since introducion of alloc_file_pseudo(),
and the only non-modular user not in alloc_file_pseudo()
had actually been wrong - should've been d_alloc_anon().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
autofs_d_release() can overlap with lockless ->d_manage(),
ending up with autofs_dentry_ino() freed under the latter.
Make freeing autofs_info instances RCU-delayed...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For lockless accesses to dentries we don't have pinned we rely
(among other things) upon having an RCU delay between dropping
the last reference and actually freeing the memory.
On the other hand, for things like pipes and sockets we neither
do that kind of lockless access, nor want to deal with the
overhead of an RCU delay every time a socket gets closed.
So delay was made optional - setting DCACHE_RCUACCESS in ->d_flags
made sure it would happen. We tried to avoid setting it unless
we knew we need it. Unfortunately, that had led to recurring
class of bugs, in which we missed the need to set it.
We only really need it for dentries that are created by
d_alloc_pseudo(), so let's not bother with trying to be smart -
just make having an RCU delay the default. The ones that do
*not* get it set the replacement flag (DCACHE_NORCU) and we'd
better use that sparingly. d_alloc_pseudo() is the only
such user right now.
FWIW, the race that finally prompted that switch had been
between __lock_parent() of immediate subdirectory of what's
currently the root of a disconnected tree (e.g. from
open-by-handle in progress) racing with d_splice_alias()
elsewhere picking another alias for the same inode, either
on outright corrupted fs image, or (in case of open-by-handle
on NFS) that subdirectory having been just moved on server.
It's not easy to hit, so the sky is not falling, but that's
not the first race on similar missed cases and the logics
for settinf DCACHE_RCUACCESS has gotten ridiculously
convoluted.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
What happens there is that we are replacing file->path.mnt of
a file we'd just opened with a clone and we need the write
count contribution to be transferred from original mount to
new one. That's it. We do *NOT* want any kind of freeze
protection for the duration of switchover.
IOW, we should just use __mnt_{want,drop}_write() for that
switchover; no need to bother with mnt_{want,drop}_write()
there.
Tested-by: Amir Goldstein <amir73il@gmail.com>
Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge misc fixes from Andrew Morton:
"22 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links
fs: fs_parser: fix printk format warning
checkpatch: add %pt as a valid vsprintf extension
mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate
drivers/block/zram/zram_drv.c: fix idle/writeback string compare
mm/page_isolation.c: fix a wrong flag in set_migratetype_isolate()
mm/memory_hotplug.c: fix notification in offline error path
ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASK
fs/proc/kcore.c: make kcore_modules static
include/linux/list.h: fix list_is_first() kernel-doc
mm/debug.c: fix __dump_page when mapping->host is not set
mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified
include/linux/hugetlb.h: convert to use vm_fault_t
iommu/io-pgtable-arm-v7s: request DMA32 memory, and improve debugging
mm: add support for kmem caches in DMA32 zone
ocfs2: fix inode bh swapping mixup in ocfs2_reflink_inodes_lock
mm/hotplug: fix offline undo_isolate_page_range()
fs/open.c: allow opening only regular files during execve()
mailmap: add Changbin Du
mm/debug.c: add a cast to u64 for atomic64_read()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyeQn8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnqwD/0bqoixqUEicnpvCE8V6eze3HYHK0T8jWtr
32hZXWMihtZpDBq4LXWWJOjHevOP2+NN0uvJDtwhvJAaJM+Xfg/Yh2iPWHYn40rI
tjtVoszBA+w50EyCG8u+JjmYPxdgmwIfowkGiYf7ZJbY8LQqXQQCVzwjjJjbmBAZ
XrbJRPl6HFNGMA4cHoL+beHK5kgKwi+V0LMRNjoigE9J129Co6fyjJRw1cC+IHvP
DPb/Lncjzzuy59fIGXSfRcbs43vHQncLS2DdzsISkTgKlnB52rh7XPlvp2JxvN+N
ReTblAeq2CJAQoSijmPh2/qwhiRm7OWmw54dkE6gRveJUFmjV9u+Pyf1c68kMz83
kGOQqobYuzL95UJYJTxQV4988bqqrnboimjARUGosagcYy0vQHNUnEODlWToZCqO
uGwGfPWALi9CNkfJm5rSH0VcXUytmzm0BHg+haal9LKfHOdgeBQcnex3O1RiBBI2
PLW1sF4VGgpLQuGFwNZM3yVpXhQl7QO8cbN7/qD2xby1Rn/8d/Zk0yCKqONNq9tt
jmQiVvA47DiuOUQWVQduB0qaYn/vYv0uvw6BLMUzPfX9wSG/j1COSGBtl0XmrU5D
a8woZwWyYbu/diqB9QdbWTEoqKfPWQY1NQSafH3FYAkuFVQtdrIFdALdjbwf16Rt
jkWltGv1Fw==
=3chO
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190329' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Small set of fixes that should go into this series. This contains:
- compat signal mask fix for io_uring (Arnd)
- EAGAIN corner case for direct vs buffered writes for io_uring
(Roman)
- NVMe pull request from Christoph with various little fixes
- sbitmap ws_active fix, which caused a perf regression for shared
tags (me)
- sbitmap bit ordering fix (Ming)
- libata on-stack DMA fix (Raymond)"
* tag 'for-linus-20190329' of git://git.kernel.dk/linux-block:
nvmet: fix error flow during ns enable
nvmet: fix building bvec from sg list
nvme-multipath: relax ANA state check
nvme-tcp: fix an endianess miss-annotation
libata: fix using DMA buffers on stack
io_uring: offload write to async worker in case of -EAGAIN
sbitmap: order READ/WRITE freed instance and setting clear bit
blk-mq: fix sbitmap ws_active for shared tags
io_uring: fix big-endian compat signal mask handling
blk-mq: update comment for blk_mq_hctx_has_pending()
blk-mq: use blk_mq_put_driver_tag() to put tag
a small use-after-free fix.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlyeRsgTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi8i6B/9wP90ZLGzdAZDIlfWKXjGB1PUrFdeN
WCA5p68Hl7yh1RbY6cvbZcTF5Bo3DhjxjxTFjXHPXLxsARlxbCXon9R6Lo2lDgA4
Bk/W8dcR3onU3nspifG91Him/WnImWB80pyVgZog2PTiwsZJ0rRknXXbRU9ARCpk
8vjg19O4wHwXgtMXAN3vxjQ7v8T8wk8vDb08efPcmMPLDYMaTUL1z2JoqyRfMTbo
OpZoXSjHXqVFfz0mJ5EN7+92eK39oDcQIDSuuqePDCI09ZmrcQd/xSvG5tBfPoXr
1mR3ojkKRURW5RKGClbSoAt90vIuYJH5Cncmemzsr6m4FETH6XthGbJl
=twzl
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.1-rc3' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A patch to avoid choking on multipage bvecs in the messenger and a
small use-after-free fix"
* tag 'ceph-for-5.1-rc3' of git://github.com/ceph/ceph-client:
ceph: fix use-after-free on symlink traversal
libceph: fix breakage caused by multipage bvecs
- Fix a bunch of static checker complaints about uninitialized variables
and insufficient range checks.
- Avoid a crash when incore extent map data are corrupt.
- Disallow FITRIM when we haven't recovered the log and know the
metadata are stale.
- Fix a data corruption when doing unaligned overlapping dio writes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlyaSCwACgkQ+H93GTRK
tOsBSRAAoD6npxZjzApGk7y0y2d+8+/f3BBXdyOHhzg8G/VTcVW+ZQsVibeXEYYm
d02iu3RCQ3AsJVN3Z2FUgAUkf+2duS6QWJH6hL29+fn9aeHb8CYDtlZU9uW6Mf2K
DKuWR3v3aesXEKzL8DVbJa825UWy3fyfggQWvRUvMD+uO/Td2gZEpUSQeBLAUFMZ
4Yj0q1zjWVfi3lcsQDY+gsL3+8hGBD4YldyoX8eUCI78/WMeXzwP4WECNnSBfmM7
Ke63AniGKeAkAMX0PtwiOTITjD6c2Msa9jbriSdUSkX1xnnq5CDbqQHJ7sEefyYT
ff8INci0hL/8kZx63CjrpNZQ5hB5+rIusz2tScmJ/hBnGtAMLg8Duq98ZmQSlSOy
fVV1L+roDGRHO+SEaF4xko2dwMu4iSJmGW50PrXjCJdCgZ7tBaL87k5GQ/W1A0KX
EFje3OPBbGYKHdPdk0TqRoIs2qgOuAYERlLZWcgLLscnOp7XwhgSrvwThV7I7TNB
eu8+xEH7H3V+BHa+OuLgLDFklj1UhyQR8DLKXs/j+DyhD1f5xh6sXVnVhNAZdhbU
OLlgjKT9BkfIsNOgWcjg9SO2EoU/Oi3InDkNz8mSebFpixEG+bvXyguzB+Y2IgA8
8btKHyLOnxJJ1Zb4dnZLFgVWV3QMUip4AlFBXSkzOefDznjGPms=
=iNqS
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.1-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are a few fixes for some corruption bugs and uninitialized
variable problems. The few patches here have gone through a few days
worth of fstest runs with no new problems observed.
Changes since last update:
- Fix a bunch of static checker complaints about uninitialized
variables and insufficient range checks.
- Avoid a crash when incore extent map data are corrupt.
- Disallow FITRIM when we haven't recovered the log and know the
metadata are stale.
- Fix a data corruption when doing unaligned overlapping dio writes"
* tag 'xfs-5.1-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: serialize unaligned dio writes against all other dio writes
xfs: prohibit fstrim in norecovery mode
xfs: always init bma in xfs_bmapi_write
xfs: fix btree scrub checking with regards to root-in-inode
xfs: dabtree scrub needs to range-check level
xfs: don't trip over uninitialized buffer on extent read of corrupted inode
Fix printk format warning (seen on i386 builds) by using ptrdiff format
specifier (%t):
fs/fs_parser.c:413:6: warning: format `%lu' expects argument of type `long unsigned int', but argument 3 has type `int' [-Wformat=]
Link: http://lkml.kernel.org/r/19432668-ffd3-fbb2-af4f-1c8e48f6cc81@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_reflink_inodes_lock() can swap the inode1/inode2 variables so that
we always grab cluster locks in order of increasing inode number.
Unfortunately, we forget to swap the inode record buffer head pointers
when we've done this, which leads to incorrect bookkeepping when we're
trying to make the two inodes have the same refcount tree.
This has the effect of causing filesystem shutdowns if you're trying to
reflink data from inode 100 into inode 97, where inode 100 already has a
refcount tree attached and inode 97 doesn't. The reflink code decides
to copy the refcount tree pointer from 100 to 97, but uses inode 97's
inode record to open the tree root (which it doesn't have) and blows up.
This issue causes filesystem shutdowns and metadata corruption!
Link: http://lkml.kernel.org/r/20190312214910.GK20533@magnolia
Fixes: 29ac8e856c ("ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syzbot is hitting lockdep warning [1] due to trying to open a fifo
during an execve() operation. But we don't need to open non regular
files during an execve() operation, for all files which we will need are
the executable file itself and the interpreter programs like /bin/sh and
ld-linux.so.2 .
Since the manpage for execve(2) says that execve() returns EACCES when
the file or a script interpreter is not a regular file, and the manpage
for uselib(2) says that uselib() can return EACCES, and we use
FMODE_EXEC when opening for execve()/uselib(), we can bail out if a non
regular file is requested with FMODE_EXEC set.
Since this deadlock followed by khungtaskd warnings is trivially
reproducible by a local unprivileged user, and syzbot's frequent crash
due to this deadlock defers finding other bugs, let's workaround this
deadlock until we get a chance to find a better solution.
[1] https://syzkaller.appspot.com/bug?id=b5095bfec44ec84213bac54742a82483aad578ce
Link: http://lkml.kernel.org/r/1552044017-7890-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Reported-by: syzbot <syzbot+e93a80c1bb7c5c56e522461c149f8bf55eab1b2b@syzkaller.appspotmail.com>
Fixes: 8924feff66 ("splice: lift pipe_lock out of splice_to_pipe()")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The marshalling of AFS.StoreData, AFS.StoreData64 and YFS.StoreData64 calls
generated by ->setattr() ops for the purpose of expanding a file is
incorrect due to older documentation incorrectly describing the way the RPC
'FileLength' parameter is meant to work.
The older documentation says that this is the length the file is meant to
end up at the end of the operation; however, it was never implemented this
way in any of the servers, but rather the file is truncated down to this
before the write operation is effected, and never expanded to it (and,
indeed, it was renamed to 'TruncPos' in 2014).
Fix this by setting the position parameter to the new file length and doing
a zero-lengh write there.
The bug causes Xwayland to SIGBUS due to unexpected non-expansion of a file
it then mmaps. This can be tested by giving the following test program a
filename in an AFS directory:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
int main(int argc, char *argv[])
{
char *p;
int fd;
if (argc != 2) {
fprintf(stderr,
"Format: test-trunc-mmap <file>\n");
exit(2);
}
fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC);
if (fd < 0) {
perror(argv[1]);
exit(1);
}
if (ftruncate(fd, 0x140008) == -1) {
perror("ftruncate");
exit(1);
}
p = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
if (p == MAP_FAILED) {
perror("mmap");
exit(1);
}
p[0] = 'a';
if (munmap(p, 4096) < 0) {
perror("munmap");
exit(1);
}
if (close(fd) < 0) {
perror("close");
exit(1);
}
exit(0);
}
Fixes: 31143d5d51 ("AFS: implement basic file write support")
Reported-by: Jonathan Billings <jsbillin@umich.edu>
Tested-by: Jonathan Billings <jsbillin@umich.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
free the symlink body after the same RCU delay we have for freeing the
struct inode itself, so that traversal during RCU pathwalk wouldn't step
into freed memory.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Highlights include:
Stable fixes:
- Fix nfs4_lock_state refcounting in nfs4_alloc_{lock,unlock}data()
- fix mount/umount race in nlmclnt.
- NFSv4.1 don't free interrupted slot on open
Bugfixes:
- Don't let RPC_SOFTCONN tasks time out if the transport is connected
- Fix a typo in nfs_init_timeout_values()
- Fix layoutstats handling during read failovers
- fix uninitialized variable warning
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcmobMAAoJEA4mA3inWBJc/7cP+wR19SPLnbPAFnA09LyT2wDu
wZI/y4KYcqGX4kW+ZfhvtR91Zy+UzF685NlbY+kH74JH9Wp9o9DJHW6DC//oxAM5
bzMKH4FIY5IEYN6R554QzHHvIzDzJADgdmjwaSjZyYiNQMJ5xnYClkAWBqU4zG4c
luTLcYg2cHYic/2bYCVI/SvSSH4Rq93MhttxWgmP0yUm2l3ed+r+ZydQiAyxBFRv
0DN8dM7gltHnbOapKVxttmdNpK7EIDlTdUFupiwZMvsm5OCGcLm09DUUE0oE0d+s
bZflhWNtV/0P7zjx0SZTfd3/XKo5PRIzAB2sx4KsqzbnC5kR9fl3royZ0CUgPJYa
n7Bb9PJd8AJV+0FK5cyH3KQwL5UokpU7g1pD7MNxUuIM8iDbpZcOfsiKN/ZWVInJ
E/eot9/D4kaDvTWQ+EmCzb7bI6yjVo6B27KFVC+ZNunfP1hFz+CrybUHpbraMw+7
okvE9x+qCeeHRKTNGhcFTAEjGFPQX6nomS6MyFUXUriKSy29Fiq9kUem1qFFsPxk
c79pYQdu/TUX3sUxjVsOaOr1sS+VJZOrUzGe2/IAZKM86Mu0fQ8W4PTKhqv/ZG+4
oxC4ukHI39cDYcjyUMnpOGgZ3k1w7UcttVKy0fcsfHQJCTfa5kfd+s9mPpCBV3JG
GN9QQkWPLud8uoR/85rR
=d5ft
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.1-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix nfs4_lock_state refcounting in nfs4_alloc_{lock,unlock}data()
- fix mount/umount race in nlmclnt.
- NFSv4.1 don't free interrupted slot on open
Bugfixes:
- Don't let RPC_SOFTCONN tasks time out if the transport is connected
- Fix a typo in nfs_init_timeout_values()
- Fix layoutstats handling during read failovers
- fix uninitialized variable warning"
* tag 'nfs-for-5.1-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: fix uninitialized variable warning
pNFS/flexfiles: Fix layoutstats handling during read failovers
NFS: Fix a typo in nfs_init_timeout_values()
SUNRPC: Don't let RPC_SOFTCONN tasks time out if the transport is connected
NFSv4.1 don't free interrupted slot on open
NFS: fix mount/umount race in nlmclnt.
NFS: Fix nfs4_lock_state refcounting in nfs4_alloc_{lock,unlock}data()
XFS applies more strict serialization constraints to unaligned
direct writes to accommodate things like direct I/O layer zeroing,
unwritten extent conversion, etc. Unaligned submissions acquire the
exclusive iolock and wait for in-flight dio to complete to ensure
multiple submissions do not race on the same block and cause data
corruption.
This generally works in the case of an aligned dio followed by an
unaligned dio, but the serialization is lost if I/Os occur in the
opposite order. If an unaligned write is submitted first and
immediately followed by an overlapping, aligned write, the latter
submits without the typical unaligned serialization barriers because
there is no indication of an unaligned dio still in-flight. This can
lead to unpredictable results.
To provide proper unaligned dio serialization, require that such
direct writes are always the only dio allowed in-flight at one time
for a particular inode. We already acquire the exclusive iolock and
drain pending dio before submitting the unaligned dio. Wait once
more after the dio submission to hold the iolock across the I/O and
prevent further submissions until the unaligned I/O completes. This
is heavy handed, but consistent with the current pre-submission
serialization for unaligned direct writes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In case of direct write -EAGAIN will be returned if page cache was
previously populated. To avoid immediate completion of a request
with -EAGAIN error write has to be offloaded to the async worker,
like io_read() does.
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On big-endian architectures, the signal masks are differnet
between 32-bit and 64-bit tasks, so we have to use a different
function for reading them from user space.
io_cqring_wait() initially got this wrong, and always interprets
this as a native structure. This is ok on x86 and most arm64,
but not on s390, ppc64be, mips64be, sparc64 and parisc.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The xfs fstrim implementation uses the free space btrees to find free
space that can be discarded. If we haven't recovered the log, the bnobt
will be stale and we absolutely *cannot* use stale metadata to zap the
underlying storage.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Andreas reported that he was seeing the tdbtorture test fail in some
cases with -EDEADLCK when it wasn't before. Some debugging showed that
deadlock detection was sometimes discovering the caller's lock request
itself in a dependency chain.
While we remove the request from the blocked_lock_hash prior to
reattempting to acquire it, any locks that are blocked on that request
will still be present in the hash and will still have their fl_blocker
pointer set to the current request.
This causes posix_locks_deadlock to find a deadlock dependency chain
when it shouldn't, as a lock request cannot block itself.
We are going to end up waking all of those blocked locks anyway when we
go to reinsert the request back into the blocked_lock_hash, so just do
it prior to checking for deadlocks. This ensures that any lock blocked
on the current request will no longer be part of any blocked request
chain.
URL: https://bugzilla.kernel.org/show_bug.cgi?id=202975
Fixes: 5946c4319e ("fs/locks: allow a lock request to block other requests.")
Cc: stable@vger.kernel.org
Reported-by: Andreas Schneider <asn@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Pull x86 fixes from Thomas Gleixner:
"A set of x86 fixes:
- Prevent potential NULL pointer dereferences in the HPET and HyperV
code
- Exclude the GART aperture from /proc/kcore to prevent kernel
crashes on access
- Use the correct macros for Cyrix I/O on Geode processors
- Remove yet another kernel address printk leak
- Announce microcode reload completion as requested by quite some
people. Microcode loading has become popular recently.
- Some 'Make Clang' happy fixlets
- A few cleanups for recently added code"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/gart: Exclude GART aperture from kcore
x86/hw_breakpoints: Make default case in hw_breakpoint_arch_parse() return an error
x86/mm/pti: Make local symbols static
x86/cpu/cyrix: Remove {get,set}Cx86_old macros used for Cyrix processors
x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors
x86/microcode: Announce reload operation's completion
x86/hyperv: Prevent potential NULL pointer dereference
x86/hpet: Prevent potential NULL pointer dereference
x86/lib: Fix indentation issue, remove extra tab
x86/boot: Restrict header scope to make Clang happy
x86/mm: Don't leak kernel addresses
x86/cpufeature: Fix various quality problems in the <asm/cpu_device_hd.h> header
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlyWPM0ACgkQiiy9cAdy
T1FKCAwAsrZ2WlTdSp5/1Ogwa18vrS5dMHnMipOaZytG6HxBDXPGDzohNQxHbLQK
ShxQIrcPoVmB6WVrzpEYrPGamDETogp+ennVHwngTNDP7TN/U/oSVzBSJ/ZW32uO
w6LSXm3upVNluQLalLhy95xRUhZrt/FkCGp8BkTduR9VObfDtSHouCvsdMl0gXL1
qHZM1LguJsB0ziWNQpeXvFar63NO5bJEBtvP+sc+sGjuoRskE6Bz68GCg3t+Mbp+
73M/iVil8HWMHZELs1JwPslakbp1xDAcz7fjcizrVXcMaW8xqT68rylORC4aY3p3
bNqDTCPRjufE0ktjSi8ld8g9W3W/TbyizBVUDUHYqDzRomx5sx74vz57yCdqyKZX
/MENHMcrQ8OcB1OXKKTtgbh3dEcwSbrKHKuO+0XMz4We1S5Py8qRRTVxissqxWx6
aSeXYDC2XLlsPO9/OY6h08pQ21i7qO8wbuCbmpunaMgnk1vurF0B5Xt8s27FSoEK
M32p50xP
=Q90v
-----END PGP SIGNATURE-----
Merge tag '5.1-rc1-cifs-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb3 fixes from Steve French:
- two fixes for stable for guest mount problems with smb3.1.1
- two fixes for crediting (SMB3 flow control) on resent requests
- a byte range lock leak fix
- two fixes for incorrect rc mappings
* tag '5.1-rc1-cifs-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number
SMB3: Fix SMB3.1.1 guest mounts to Samba
cifs: Fix slab-out-of-bounds when tracing SMB tcon
cifs: allow guest mounts to work for smb3.11
fix incorrect error code mapping for OBJECTID_NOT_FOUND
cifs: fix that return -EINVAL when do dedupe operation
CIFS: Fix an issue with re-sending rdata when transport returning -EAGAIN
CIFS: Fix an issue with re-sending wdata when transport returning -EAGAIN
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyWVysQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpn5lD/0bEg76kbuwOUy5+FDqOpF0MNOU7xZcYcsI
YkkaKkUi2YQL6NJlkU7AhtPwep+J2sgSnDW9Ho9WIXbsnsO6UF79uIdcix6zJGIl
WnZZ3BLgWeciCfrzFpn3FFZnm/AKJSPWPmllUFvmUYT9GdRgN4ZnHBsS1HTlJ1m5
5HhwLtaYOsZ75NxWBRqWspmtFe+XZ/CrjGgmvIF8FjSuIP2q0RrOmCF1XAA82umd
ehiU1ZtQ+v4FHxmJWjzMWhrCj2y0gmPb+DotIqefFjVnd/G+LrFGMD1fsLoQVFDy
L5VzSOGj1E4KXfDpIeGnz/08dpqXmOkvsSaNnv1U7vA7SCkbodR/BA1EKJrvk5v7
MGkkcQDaU/WzC41RCyVQNWAWjzNLKbruXQ+1HqCx5eh7uthvMQMXDvGf4Jgeq+/E
vGzrEKZ6qI78Vy0mXSy4dfFbFaNTjCkE2jbIG7BQx5zdtnS9/VPXNkpZxPrGLM+P
/fTsLXghU9lKn6WHVtLpQsfJr0OMjyC9JA23pTX2G9MtBhDcyuRs+uCeQgG6cIkl
F15LGuOY7YGYxRsegdinFaoldnHersUDx19c+uFdrB0k0A/A6KeGHuZx7aJPkW1L
M89FkyJr2ZBgc26PvKz6j1Hwl2MKJC5h8TpPES/QnulWh4FbqqH3a501Qa1AQuxC
1me95iy74w==
=l4lx
-----END PGP SIGNATURE-----
Merge tag 'io_uring-20190323' of git://git.kernel.dk/linux-block
Pull io_uring fixes and improvements from Jens Axboe:
"The first five in this series are heavily inspired by the work Al did
on the aio side to fix the races there.
The last two re-introduce a feature that was in io_uring before it got
merged, but which I pulled since we didn't have a good way to have
BVEC iters that already have a stable reference. These aren't
necessarily related to block, it's just how io_uring pins fixed
buffers"
* tag 'io_uring-20190323' of git://git.kernel.dk/linux-block:
block: add BIO_NO_PAGE_REF flag
iov_iter: add ITER_BVEC_FLAG_NO_REF flag
io_uring: mark me as the maintainer
io_uring: retry bulk slab allocs as single allocs
io_uring: fix poll races
io_uring: fix fget/fput handling
io_uring: add prepped flag
io_uring: make io_read/write return an integer
io_uring: use regular request ref counts
The ext4 fstrim implementation uses the block bitmaps to find free space
that can be discarded. If we haven't replayed the journal, the bitmaps
will be stale and we absolutely *cannot* use stale metadata to zap the
underlying storage.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
During a read failover, we may end up changing the value of
the pgio_mirror_idx, so make sure that we record the layout
stats before that update.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Specifying a retrans=0 mount parameter to a NFS/TCP mount, is
inadvertently causing the NFS client to rewrite any specified
timeout parameter to the default of 60 seconds.
Fixes: a956beda19 ("NFS: Allow the mount option retrans=0")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Currently, we are releasing the indirect buffer where we are done with
it in ext4_ind_remove_space(), so we can see the brelse() and
BUFFER_TRACE() everywhere. It seems fragile and hard to read, and we
may probably forget to release the buffer some day. This patch cleans
up the code by putting of the code which releases the buffers to the
end of the function.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
All indirect buffers get by ext4_find_shared() should be released no
mater the branch should be freed or not. But now, we forget to release
the lower depth indirect buffers when removing space from the same
higher depth indirect block. It will lead to buffer leak and futher
more, it may lead to quota information corruption when using old quota,
consider the following case.
- Create and mount an empty ext4 filesystem without extent and quota
features,
- quotacheck and enable the user & group quota,
- Create some files and write some data to them, and then punch hole
to some files of them, it may trigger the buffer leak problem
mentioned above.
- Disable quota and run quotacheck again, it will create two new
aquota files and write the checked quota information to them, which
probably may reuse the freed indirect block(the buffer and page
cache was not freed) as data block.
- Enable quota again, it will invoke
vfs_load_quota_inode()->invalidate_bdev() to try to clean unused
buffers and pagecache. Unfortunately, because of the buffer of quota
data block is still referenced, quota code cannot read the up to date
quota info from the device and lead to quota information corruption.
This problem can be reproduced by xfstests generic/231 on ext3 file
system or ext4 file system without extent and quota features.
This patch fix this problem by releasing the missing indirect buffers,
in ext4_ind_remove_space().
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
On machines where the GART aperture is mapped over physical RAM,
/proc/kcore contains the GART aperture range. Accessing the GART range via
/proc/kcore results in a kernel crash.
vmcore used to have the same issue, until it was fixed with commit
2a3e83c6f9 ("x86/gart: Exclude GART aperture from vmcore")', leveraging
existing hook infrastructure in vmcore to let /proc/vmcore return zeroes
when attempting to read the aperture region, and so it won't read from the
actual memory.
Apply the same workaround for kcore. First implement the same hook
infrastructure for kcore, then reuse the hook functions introduced in the
previous vmcore fix. Just with some minor adjustment, rename some functions
for more general usage, and simplify the hook infrastructure a bit as there
is no module usage yet.
Suggested-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jiri Bohac <jbohac@suse.cz>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Dave Young <dyoung@redhat.com>
Link: https://lkml.kernel.org/r/20190308030508.13548-1-kasong@redhat.com
Workaround problem with Samba responses to SMB3.1.1
null user (guest) mounts. The server doesn't set the
expected flag in the session setup response so we have
to do a similar check to what is done in smb3_validate_negotiate
where we also check if the user is a null user (but not sec=krb5
since username might not be passed in on mount for Kerberos case).
Note that the commit below tightened the conditions and forced signing
for the SMB2-TreeConnect commands as per MS-SMB2.
However, this should only apply to normal user sessions and not for
cases where there is no user (even if server forgets to set the flag
in the response) since we don't have anything useful to sign with.
This is especially important now that the more secure SMB3.1.1 protocol
is in the default dialect list.
An earlier patch ("cifs: allow guest mounts to work for smb3.11") fixed
the guest mounts to Windows.
Fixes: 6188f28bf6 ("Tree connect for SMB3.1.1 must be signed for non-encrypted shares")
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
This patch fixes the following KASAN report:
[ 779.044746] BUG: KASAN: slab-out-of-bounds in string+0xab/0x180
[ 779.044750] Read of size 1 at addr ffff88814f327968 by task trace-cmd/2812
[ 779.044756] CPU: 1 PID: 2812 Comm: trace-cmd Not tainted 5.1.0-rc1+ #62
[ 779.044760] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
[ 779.044761] Call Trace:
[ 779.044769] dump_stack+0x5b/0x90
[ 779.044775] ? string+0xab/0x180
[ 779.044781] print_address_description+0x6c/0x23c
[ 779.044787] ? string+0xab/0x180
[ 779.044792] ? string+0xab/0x180
[ 779.044797] kasan_report.cold.3+0x1a/0x32
[ 779.044803] ? string+0xab/0x180
[ 779.044809] string+0xab/0x180
[ 779.044816] ? widen_string+0x160/0x160
[ 779.044822] ? vsnprintf+0x5bf/0x7f0
[ 779.044829] vsnprintf+0x4e7/0x7f0
[ 779.044836] ? pointer+0x4a0/0x4a0
[ 779.044841] ? seq_buf_vprintf+0x79/0xc0
[ 779.044848] seq_buf_vprintf+0x62/0xc0
[ 779.044855] trace_seq_printf+0x113/0x210
[ 779.044861] ? trace_seq_puts+0x110/0x110
[ 779.044867] ? trace_raw_output_prep+0xd8/0x110
[ 779.044876] trace_raw_output_smb3_tcon_class+0x9f/0xc0
[ 779.044882] print_trace_line+0x377/0x890
[ 779.044888] ? tracing_buffers_read+0x300/0x300
[ 779.044893] ? ring_buffer_read+0x58/0x70
[ 779.044899] s_show+0x6e/0x140
[ 779.044906] seq_read+0x505/0x6a0
[ 779.044913] vfs_read+0xaf/0x1b0
[ 779.044919] ksys_read+0xa1/0x130
[ 779.044925] ? kernel_write+0xa0/0xa0
[ 779.044931] ? __do_page_fault+0x3d5/0x620
[ 779.044938] do_syscall_64+0x63/0x150
[ 779.044944] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 779.044949] RIP: 0033:0x7f62c2c2db31
[ 779.044955] Code: fe ff ff 48 8d 3d 17 9e 09 00 48 83 ec 08 e8 96 02
02 00 66 0f 1f 44 00 00 8b 05 fa fc 2c 00 48 63 ff 85 c0 75 13 31 c0
0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 55 53 48 89 d5 48
89
[ 779.044958] RSP: 002b:00007ffd6e116678 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 779.044964] RAX: ffffffffffffffda RBX: 0000560a38be9260 RCX: 00007f62c2c2db31
[ 779.044966] RDX: 0000000000002000 RSI: 00007ffd6e116710 RDI: 0000000000000003
[ 779.044966] RDX: 0000000000002000 RSI: 00007ffd6e116710 RDI: 0000000000000003
[ 779.044969] RBP: 00007f62c2ef5420 R08: 0000000000000000 R09: 0000000000000003
[ 779.044972] R10: ffffffffffffffa8 R11: 0000000000000246 R12: 00007ffd6e116710
[ 779.044975] R13: 0000000000002000 R14: 0000000000000d68 R15: 0000000000002000
[ 779.044981] Allocated by task 1257:
[ 779.044987] __kasan_kmalloc.constprop.5+0xc1/0xd0
[ 779.044992] kmem_cache_alloc+0xad/0x1a0
[ 779.044997] getname_flags+0x6c/0x2a0
[ 779.045003] user_path_at_empty+0x1d/0x40
[ 779.045008] do_faccessat+0x12a/0x330
[ 779.045012] do_syscall_64+0x63/0x150
[ 779.045017] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 779.045019] Freed by task 1257:
[ 779.045023] __kasan_slab_free+0x12e/0x180
[ 779.045029] kmem_cache_free+0x85/0x1b0
[ 779.045034] filename_lookup.part.70+0x176/0x250
[ 779.045039] do_faccessat+0x12a/0x330
[ 779.045043] do_syscall_64+0x63/0x150
[ 779.045048] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 779.045052] The buggy address belongs to the object at ffff88814f326600
which belongs to the cache names_cache of size 4096
[ 779.045057] The buggy address is located 872 bytes to the right of
4096-byte region [ffff88814f326600, ffff88814f327600)
[ 779.045058] The buggy address belongs to the page:
[ 779.045062] page:ffffea00053cc800 count:1 mapcount:0 mapping:ffff88815b191b40 index:0x0 compound_mapcount: 0
[ 779.045067] flags: 0x200000000010200(slab|head)
[ 779.045075] raw: 0200000000010200 dead000000000100 dead000000000200 ffff88815b191b40
[ 779.045081] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
[ 779.045083] page dumped because: kasan: bad access detected
[ 779.045085] Memory state around the buggy address:
[ 779.045089] ffff88814f327800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 779.045093] ffff88814f327880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 779.045097] >ffff88814f327900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 779.045099] ^
[ 779.045103] ffff88814f327980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 779.045107] ffff88814f327a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 779.045109] ==================================================================
[ 779.045110] Disabling lock debugging due to kernel taint
Correctly assign tree name str for smb3_tcon event.
Signed-off-by: Paulo Alcantara (SUSE) <paulo@paulo.ac>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix Guest/Anonymous sessions so that they work with SMB 3.11.
The commit noted below tightened the conditions and forced signing for
the SMB2-TreeConnect commands as per MS-SMB2.
However, this should only apply to normal user sessions and not for
Guest/Anonumous sessions.
Fixes: 6188f28bf6 ("Tree connect for SMB3.1.1 must be signed for non-encrypted shares")
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
It was mapped to EIO which can be confusing when user space
queries for an object GUID for an object for which the server
file system doesn't support (or hasn't saved one).
As Amir Goldstein suggested this is similar to ENOATTR
(equivalently ENODATA in Linux errno definitions) so
changing NT STATUS code mapping for OBJECTID_NOT_FOUND
to ENODATA.
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Amir Goldstein <amir73il@gmail.com>
dedupe_file_range operations is combiled into remap_file_range.
But it's always skipped for dedupe operations in function
cifs_remap_file_range.
Example to test:
Before this patch:
# dd if=/dev/zero of=cifs/file bs=1M count=1
# xfs_io -c "dedupe cifs/file 4k 64k 4k" cifs/file
XFS_IOC_FILE_EXTENT_SAME: Invalid argument
After this patch:
# dd if=/dev/zero of=cifs/file bs=1M count=1
# xfs_io -c "dedupe cifs/file 4k 64k 4k" cifs/file
XFS_IOC_FILE_EXTENT_SAME: Operation not supported
Influence for xfstests:
generic/091
generic/112
generic/127
generic/263
These tests report this error "do_copy_range:: Invalid
argument" instead of "FIDEDUPERANGE: Invalid argument".
Because there are still two bugs cause these test failed.
https://bugzilla.kernel.org/show_bug.cgi?id=202935https://bugzilla.kernel.org/show_bug.cgi?id=202785
Signed-off-by: Xiaoli Feng <fengxiaoli0714@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When sending a rdata, transport may return -EAGAIN. In this case
we should re-obtain credits because the session may have been
reconnected.
Change in v2: adjust_credits before re-sending
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
When sending a wdata, transport may return -EAGAIN. In this case
we should re-obtain credits because the session may have been
reconnected.
Change in v2: adjust_credits before re-sending
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlyTUdgACgkQnJ2qBz9k
QNm8IAgAs38MqUpxZircs/li5fLhFUDr1bELH8gsdwbmBrQST/X5giAk1JFLsga3
2zaWnpjiQAw0K0vfUNYxu5c2V6eo+5gbEL3wwZ2Q4/ORilv36Sbh8KT8nfsMESfz
YKwu27Eek+KDk2y6cRuJTWACPB9ohVoxWiomcerOhJy40/56ctCngDczP6r+dXuy
MGH6GA3zT8IixX1vNv4qzoiDX7fbWAlWaH6Ni66EgGtVhsdRkhsmv70ZLzkMzIqr
oaEXCxRzUk1sy47HDzqEABeFcR0esGzj41NklZ32mqTOst/T1s9PM0Ao3grAX1x/
jUBkV0bHkN/HyNy6bjfSi6ioHnCIkA==
=Fn5R
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixes from Jan Kara:
"One inotify and one fanotify fix"
* tag 'fsnotify_for_v5.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: Allow copying of file handle to userspace
inotify: Fix fsnotify_mark refcount leak in inotify_update_existing_watch()
Back in commit a89ca6f24f ("Btrfs: fix fsync after truncate when
no_holes feature is enabled") I added an assertion that is triggered when
an inline extent is found to assert that the length of the (uncompressed)
data the extent represents is the same as the i_size of the inode, since
that is true most of the time I couldn't find or didn't remembered about
any exception at that time. Later on the assertion was expanded twice to
deal with a case of a compressed inline extent representing a range that
matches the sector size followed by an expanding truncate, and another
case where fallocate can update the i_size of the inode without adding
or updating existing extents (if the fallocate range falls entirely within
the first block of the file). These two expansion/fixes of the assertion
were done by commit 7ed586d0a8 ("Btrfs: fix assertion on fsync of
regular file when using no-holes feature") and commit 6399fb5a0b
("Btrfs: fix assertion failure during fsync in no-holes mode").
These however missed the case where an falloc expands the i_size of an
inode to exactly the sector size and inline extent exists, for example:
$ mkfs.btrfs -f -O no-holes /dev/sdc
$ mount /dev/sdc /mnt
$ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar
wrote 1096/1096 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec)
$ xfs_io -c "falloc 1096 3000" /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
Segmentation fault
$ dmesg
[701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727
[701253.602962] ------------[ cut here ]------------
[701253.603224] kernel BUG at fs/btrfs/ctree.h:3533!
[701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G W 5.0.0-rc8-btrfs-next-45 #1
[701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs]
(...)
[701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286
[701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000
[701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868
[701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000
[701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18
[701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80
[701253.607769] FS: 00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000
[701253.608163] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0
[701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[701253.609608] Call Trace:
[701253.609994] btrfs_log_inode+0xdfb/0xe40 [btrfs]
[701253.610383] btrfs_log_inode_parent+0x2be/0xa60 [btrfs]
[701253.610770] ? do_raw_spin_unlock+0x49/0xc0
[701253.611150] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[701253.611537] btrfs_sync_file+0x3b2/0x440 [btrfs]
[701253.612010] ? do_sysinfo+0xb0/0xf0
[701253.612552] do_fsync+0x38/0x60
[701253.612988] __x64_sys_fsync+0x10/0x20
[701253.613360] do_syscall_64+0x60/0x1b0
[701253.613733] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[701253.614103] RIP: 0033:0x7f14da4e66d0
(...)
[701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0
[701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003
[701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010
[701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260
[701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10
(...)
[701253.619941] ---[ end trace e088d74f132b6da5 ]---
Updating the assertion again to allow for this particular case would result
in a meaningless assertion, plus there is currently no risk of logging
content that would result in any corruption after a log replay if the size
of the data encoded in an inline extent is greater than the inode's i_size
(which is not currently possibe either with or without compression),
therefore just remove the assertion.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>