SMB2_tdis() checks if a root handle is valid in order to decide
whether it needs to close the handle or not. However if another
thread has reference for the handle, it may end up with putting
the reference twice. The extra reference that we want to put
during the tree disconnect is the reference that has a directory
lease. So, track the fact that we have a directory lease and
close the handle only in that case.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Ran into an intermittent crash in
SMB2_open_init+0x2f6/0x970
due to oparms.cifs_sb not being initialized when called from:
smb2_compound_op+0x45d/0x1690
Zero the whole oparms struct in the compounding path before setting up the
oparms so we don't risk any uninitialized fields.
Fixes: fdef665ba4 ("smb3: fix mode passed in on create for modetosid mount option")
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl3sPUUACgkQiiy9cAdy
T1GumAwAhh0Fk2uEV01REMgA6MgQ2hrdGE5HariSTzGifCk8cxMnq1H1u9yxtic8
uvEJQaUmTLWrN2C+xqD2JqPmJyrPOtnL0PLCLQk2/RsPCsDgYnmdKoAehInPh17g
J8MoKPp1/1wYhbOl7CeF0xo2rEchoh/PcPCXpt8qj+M+kBgQkI64UQ/6iY/mV9Zl
n7WJJFDyz3D1+SaJPaVxMpNxZcMpFbGqVJYTWP4v3pL2E8wEhyWjAryLCJAFFGf7
Y2FwOSFuifMN/qC9t83W5KkRT9I/zRQ2g5qK1tC24LiTjQ3cqkCy1SSqpKQyvKwz
P/oRX0HsuIbr1KFzN55kg831m/V7/1B/5bf9AivfhjsAoSyp2yyVQgPeV+nQkO0r
iQdNatohC9HlwXmrypS+GhLXnj8xLnCR4+Aj7hGSuiVLHnCOfnGjQxI40BFWaBli
1RG9agkploMYvcjcgSgDGVFFWTeHgSQKI1DQTL2Nx4py1zj7Rv/kEgwkZ3zdEf9h
PPl37hBM
=gey9
-----END PGP SIGNATURE-----
Merge tag '5.5-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Nine cifs/smb3 fixes:
- one fix for stable (oops during oplock break)
- two timestamp fixes including important one for updating mtime at
close to avoid stale metadata caching issue on dirty files (also
improves perf by using SMB2_CLOSE_FLAG_POSTQUERY_ATTRIB over the
wire)
- two fixes for "modefromsid" mount option for file create (now
allows mode bits to be set more atomically and accurately on create
by adding "sd_context" on create when modefromsid specified on
mount)
- two fixes for multichannel found in testing this week against
different servers
- two small cleanup patches"
* tag '5.5-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb3: improve check for when we send the security descriptor context on create
smb3: fix mode passed in on create for modetosid mount option
cifs: fix possible uninitialized access and race on iface_list
cifs: Fix lookup of SMB connections on multichannel
smb3: query attributes on file close
smb3: remove unused flag passed into close functions
cifs: remove redundant assignment to pointer pneg_ctxt
fs: cifs: Fix atime update check vs mtime
CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
Pull misc vfs cleanups from Al Viro:
"No common topic, just three cleanups".
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make __d_alloc() static
fs/namespace: add __user to open_tree and move_mount syscalls
fs/fnctl: fix missing __user in fcntl_rw_hint()
- Fix a UAF when reporting writeback errors
- Fix a race condition when handling page uptodate on a blocksize <
pagesize file that is also fragmented
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3pJVwACgkQ+H93GTRK
tOue8RAAklPs+kV4l1JQudPkd/lXIPCqyRKBEjNDQRfj27v5s7h/3fMqMsV6/RIs
zTkrRc/N5Uu6SMnvMIHvTx1nAk6ZnTiN3gUHenoUQgQuxZ4jtGhIvHmyiOGgI27j
Xd2g+aE8wdhZDnD2b+4AsnJmtvffX0g3DNDZ4KJn1d/Ejs2qQIHgeaoyj56Fz8vh
4dqQxSg5RvL25k3qqHSVogsfxSRJxvYJefcARUL25a58UzE6DLJqeb3v41RVLc4l
lDOX/o4lavqo1lmptWjsTn4bqqyeRfNWp2M2EGE4a411aEaNTkFB7xzQ9Y1v04kK
VHq6XK3vieAv6VOuqZ3ySx6ihQcb9dQWXiMkJ9HDVDv4rmaLlrHECM19A5toZN/6
0Xy6xLDqbimckyWDZLBUnkfcoCsI+uWTdNIBxkjEYFdDuL33ovUlKu+Cn63vYzoQ
aCWnNA4NdKOBYps9aKCQV35IU1ODmOYWqkxkpSaVYKApi8Q6+2HUXOf+fXFBiV91
zO0/nWq8RU7fGQxk8YJtMO2E0lGMaMAt03vgp+pMkd0aIo0NWFLV0q3tko5VlWNv
v8BGkphgrOOqfsCFh8wua5x8bIXOBBElOSUjU0wBrAHb0ZXkQ3zi0nUt+sj7RFUX
a+ojaRrXvuyZZArOvQbrHEC0HWchVmvfiyXRAPbxbLi1BiLfPW8=
=+LzI
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.5-merge-14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fixes from Darrick Wong:
"Fix a race condition and a use-after-free error:
- Fix a UAF when reporting writeback errors
- Fix a race condition when handling page uptodate on fragmented file
with blocksize < pagesize"
* tag 'iomap-5.5-merge-14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: stop using ioend after it's been freed in iomap_finish_ioend()
iomap: fix sub-page uptodate handling
- Fix a crash in the log setup code when log mounting fails
- Fix a hang when allocating space on the realtime device
- Fix a block leak when freeing space on the realtime device
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3n5OkACgkQ+H93GTRK
tOtUvw/+Lxnwx1AaYT7hSMS4p1buJnAV7aM6Ofm+F+GkRISWlxSEYLIjrj3NFxeI
Fv4jnPWMGHtBzW2c4OXv9zhW7vQeuU0Z72sgxA+Wqccf53sfR5Qum+Tya+Bo8X0X
LPeDEuA+k4UUUHvSLqscPFPZYbXgwp3dJ2i7JeuH0vx3BhFeTjV1nlOeo1z3QBzN
LLVjuBg7Zms+TPorrOgS67LAWfzqCAiQDauvyODB5EW+UElK6pvpOklFzc7TUPr2
PIvmjEE8UNN2y9QuEEKJ26t43ejAzG016yGMxyW74i5wU33R7PHkdgG1pWwWRZzr
yvNClg2CMtLu+Bcx8Lc9X23mW0DqPkUYDohWCe/Tytvz5kaKCEq3WlqwaG5EdMOg
gSJlivpoOTduQ26V4fPvD1/fjGpJLWyfHPs9p43wie+K/NuUcTIvr4BGGQszjS/n
5Zr630g6Tq5VrBMl0f1P2NuEbeQEvmbWNTW2TIvvHZTgMd8mZdvX28IXj0dAhBb/
2U5o1NF8F6VeRGYoZFJI70RIfiLYzOQEmsA5hAyJfUQQ18u8zDJuPV4isO4/XjQf
d32E36cvP3CKVQYn7hAiMD5O8jOFckYL287qd4uDYutmleHEcfzc0H4pTW+66IHp
IuzkPCgvOkd4h4qnhtHSDoSlKd7kc1Ai3hBhCdA9zd6IUAVsZ6Y=
=Ps4B
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.5-merge-17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Fix a couple of resource management errors and a hang:
- fix a crash in the log setup code when log mounting fails
- fix a hang when allocating space on the realtime device
- fix a block leak when freeing space on the realtime device"
* tag 'xfs-5.5-merge-17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix mount failure crash on invalid iclog memory access
xfs: don't check for AG deadlock for realtime files in bunmapi
xfs: fix realtime file data space leak
Orangefs has no open, and orangefs checks file permissions
on each file access. Posix requires that file permissions
be checked on open and nowhere else. Orangefs-through-the-kernel
needs to seem posix compliant.
The VFS opens files, even if the filesystem provides no
method. We can see if a file was successfully opened for
read and or for write by looking at file->f_mode.
When writes are flowing from the page cache, file is no
longer available. We can trust the VFS to have checked
file->f_mode before writing to the page cache.
The mode of a file might change between when it is opened
and IO commences, or it might be created with an arbitrary mode.
We'll make sure we don't hit EACCES during the IO stage by
using UID 0.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJd6qoUAAoJEM9EDqnrzg2+SHQQAI8osluG1xDle0Ur0y/XQrWR
z/1+mA/ZJpkaC2KPJ3F/B93ZR7TSSb6xB/u/EoxfqVQDoVpodP3PzcvSosRsePOk
OYo67xit7YRcg2nQF5kEjR+wYbW/T1j55oQzrWxLvYr+FhlDZLJyn0xuSaGvvuKQ
kWqNwPQpIZwNR1ZJ6Yjif86kR4sWF5htoy976x5ScvoeOb08dNHQn2je5oXH/eKH
zwWBVYTeZTAIVCs9YV2UM4gi5/0pysjSL58jP7+ckLj79ozBoyhc9cRB4ez0cFyc
4+4dW9zZ1GAfvmbsFzvCfKb2Syz4JkStGJQGST+cgH9ldp70R8AdRjzYfZGXa2af
9I/jRgrVBsU/jo++a1npMy2j44+2GvhoValzKePwiCGTOB/f80XsmB9p9qci8JCv
ucVzJwbhjxPKphUpnW8Gg7F2gWr2ULhv+wKRmAb3tF+bIFPjn7KjyzFfUAS3FY1s
iwgci0Mw9NLLlvX511N0wiUGo6V9A9r7XsZQjScmm/3ybUhMyJAYoe81OO60Xwnv
2s+V0Tv9ah4b+EF0J0qtQ7GzsoKDBu+ZWqGieiOXDWTVixY2gV6CetnR7veeSeQh
s9OeqY8qaSYiV9KtBNZp56IS4PuADDgxnRB1pXTUUPgapuElEtvYC1BUovidMMmh
kLQEpYdSGrkLRah4hKsg
=9AOz
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.5-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs update from Mike Marshall:
"orangefs: posix open permission checking...
Orangefs has no open, and orangefs checks file permissions on each
file access. Posix requires that file permissions be checked on open
and nowhere else. Orangefs-through-the-kernel needs to seem posix
compliant.
The VFS opens files, even if the filesystem provides no method. We can
see if a file was successfully opened for read and or for write by
looking at file->f_mode.
When writes are flowing from the page cache, file is no longer
available. We can trust the VFS to have checked file->f_mode before
writing to the page cache.
The mode of a file might change between when it is opened and IO
commences, or it might be created with an arbitrary mode.
We'll make sure we don't hit EACCES during the IO stage by using
UID 0"
[ This is "posixish", but not a great solution in the long run, since a
proper secure network server shouldn't really trust the client like this.
But proper and secure POSIX behavior requires an open method and a
resulting cookie for IO of some kind, or similar. - Linus ]
* tag 'for-linus-5.5-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: posix open permission checking...
Possibly most interesting is Trond's fixes for some callback races that
were due to my incomplete understanding of rpc client shutdown.
Unfortunately at the last minute I've started noticing a new
intermittent failure to send callbacks. As the logic seems basically
correct, I'm leaving Trond's patches in for now, and hope to find a fix
in the next week so I don't have to revert those patches.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl3r3AAVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+rjkP/3L6DZs0Uv0BYbGq5Gmit0uoPSQk
8BT7oQhbagCh+ULRYWCnK6cz82wejR4Gzq4PLyl5x5Vcc5x+bLoPI9YgiRZlIbZu
ZvSg93E6SITLfq5xRlDC0MlIVZkI+HoIfyYgv1aYiWvQ3834bcx4DxVm9h7cNpT3
x37anEFi1lv3n9fct3obOrs3AvCS76XyA6VVhcSLJ77amKQ+O7LI0crqUc6cuX2i
CkTwTSDwyCrzkx3dZ2xDPDTbLecxw+Ce4adaby5v3GEQo3TOCmEWX92D3dvzfMmv
ICU07FsVOILnIT/fmC91b1+JWVRLjUUBw5EPmDduwSP/yw4YnIEODFEP/wAUAmMJ
vJ9hi9c1rThQ9n8h08RIwA2snhnpXRxKCWhpIRY6WM8DhHL9Y9AuVPYTKxhQOjPK
l3wbOGcMW63NrTOPHHN7hTB0vDLgPKIXYVIrMvZTd/P7CghDDEbhT1gDvx/IL3Uq
WrHKbJtK7rbx9i2bh5f6fH0DRrv7lxbD0ffunRRa3twPAe6zsG9WPjsbZZraZzEg
O7/o3wZu2N7MpL5bXPfzB+5ylOTxvNWew07NJjA4BIOfwin3bw/71YfB0Vnoairv
PhmbN2Dj4/t82ld0JU5GJWojpUfH4ARXM2Li9WO99wzx+KrxScsqGPnRMFe9dC7b
Q7ltP1p0gUbkJ88Z
=b2zA
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"This is a relatively quiet cycle for nfsd, mainly various bugfixes.
Possibly most interesting is Trond's fixes for some callback races
that were due to my incomplete understanding of rpc client shutdown.
Unfortunately at the last minute I've started noticing a new
intermittent failure to send callbacks. As the logic seems basically
correct, I'm leaving Trond's patches in for now, and hope to find a
fix in the next week so I don't have to revert those patches"
* tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux: (24 commits)
nfsd: depend on CRYPTO_MD5 for legacy client tracking
NFSD fixing possible null pointer derefering in copy offload
nfsd: check for EBUSY from vfs_rmdir/vfs_unink.
nfsd: Ensure CLONE persists data and metadata changes to the target file
SUNRPC: Fix backchannel latency metrics
nfsd: restore NFSv3 ACL support
nfsd: v4 support requires CRYPTO_SHA256
nfsd: Fix cld_net->cn_tfm initialization
lockd: remove __KERNEL__ ifdefs
sunrpc: remove __KERNEL__ ifdefs
race in exportfs_decode_fh()
nfsd: Drop LIST_HEAD where the variable it declares is never used.
nfsd: document callback_wq serialization of callback code
nfsd: mark cb path down on unknown errors
nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()
nfsd: minor 4.1 callback cleanup
SUNRPC: Fix svcauth_gss_proxy_init()
SUNRPC: Trace gssproxy upcall results
sunrpc: fix crash when cache_head become valid before update
nfsd: remove private bin2hex implementation
...
Highlights include:
Features:
- NFSv4.2 now supports cross device offloaded copy (i.e. offloaded copy
of a file from one source server to a different target server).
- New RDMA tracepoints for debugging congestion control and Local Invalidate
WRs.
Bugfixes and cleanups
- Drop the NFSv4.1 session slot if nfs4_delegreturn_prepare waits for
layoutreturn
- Handle bad/dead sessions correctly in nfs41_sequence_process()
- Various bugfixes to the delegation return operation.
- Various bugfixes pertaining to delegations that have been revoked.
- Cleanups to the NFS timespec code to avoid unnecessary conversions
between timespec and timespec64.
- Fix unstable RDMA connections after a reconnect
- Close race between waking an RDMA sender and posting a receive
- Wake pending RDMA tasks if connection fails
- Fix MR list corruption, and clean up MR usage
- Fix another RPCSEC_GSS issue with MIC buffer space
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl3qp8QACgkQZwvnipYK
APIZfBAAuFhLUA2Ua9OQOPDJDkQ1IFDBfYGG48aqVu3GIXS9LkEvTavLm/P9ocm+
ijGsUv2iw4x9H4S7OGuzLQm5zmTNsQAlPXD+3+xQS7cjPjh5HCyIAEgpov+JEGae
CeZoSvhtdBd0xB71t2zAKEdHkqc47Jxz3Db0FX22zTTnDvdhArfggisZUt4Xq5Qb
cPcs8R1E5yBZqJFHKObOUP4itVYsXte/VFhtWpjRFqzaZ/t7xNpPVOBH8cli7aI9
E6DqdbIjUreyn62FVWYIeGhwvsKdxv+Slc5ZOEbD45jUryovyCAZxhqDmcAg/0q0
uykplL0cv8MeiZ68wmlxdir/n36hWduiGqa0UKMg2+BAbdudGKJ7xPhkGYP2uZqo
zoZGjd+Hl8AunMBUaT7YAxWOzuIXeMP338szTL6sSBPxT75WmmNJAh3J4b22G7Bl
eGrcJcckDBnvfRCia40l8g9NLHmVKqS9qNKxSWMlMlBmwd1HE0oEE1ddCx9bGHKe
srf0S14RPQBRF6r+Nv0cx5S+CiptDtGiILR+cn5ZDra5YYCPX5kkJ6VEqw/m4yNE
AKjjj5gim+jWYdBOTMU3u5KNNqFx37xnOCdC+5DvhMNWRHf2O/I5JSKtuKaZht+5
PEuwcYfQvaZGp3fCEh38zzOX2qWUhRMbXUqSv5F0DbuWK7OAABQ=
=VZFk
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- NFSv4.2 now supports cross device offloaded copy (i.e. offloaded
copy of a file from one source server to a different target
server).
- New RDMA tracepoints for debugging congestion control and Local
Invalidate WRs.
Bugfixes and cleanups
- Drop the NFSv4.1 session slot if nfs4_delegreturn_prepare waits for
layoutreturn
- Handle bad/dead sessions correctly in nfs41_sequence_process()
- Various bugfixes to the delegation return operation.
- Various bugfixes pertaining to delegations that have been revoked.
- Cleanups to the NFS timespec code to avoid unnecessary conversions
between timespec and timespec64.
- Fix unstable RDMA connections after a reconnect
- Close race between waking an RDMA sender and posting a receive
- Wake pending RDMA tasks if connection fails
- Fix MR list corruption, and clean up MR usage
- Fix another RPCSEC_GSS issue with MIC buffer space"
* tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
SUNRPC: Capture completion of all RPC tasks
SUNRPC: Fix another issue with MIC buffer space
NFS4: Trace lock reclaims
NFS4: Trace state recovery operation
NFSv4.2 fix memory leak in nfs42_ssc_open
NFSv4.2 fix kfree in __nfs42_copy_file_range
NFS: remove duplicated include from nfs4file.c
NFSv4: Make _nfs42_proc_copy_notify() static
NFS: Fallocate should use the nfs4_fattr_bitmap
NFS: Return -ETXTBSY when attempting to write to a swapfile
fs: nfs: sysfs: Remove NULL check before kfree
NFS: remove unneeded semicolon
NFSv4: add declaration of current_stateid
NFSv4.x: Drop the slot if nfs4_delegreturn_prepare waits for layoutreturn
NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process()
nfsv4: Move NFSPROC4_CLNT_COPY_NOTIFY to end of list
SUNRPC: Avoid RPC delays when exiting suspend
NFS: Add a tracepoint in nfs_fh_to_dentry()
NFSv4: Don't retry the GETATTR on old stateid in nfs4_delegreturn_done()
NFSv4: Handle NFS4ERR_OLD_STATEID in delegreturn
...
We had cases in the previous patch where we were sending the security
descriptor context on SMB3 open (file create) in cases when we hadn't
mounted with with "modefromsid" mount option.
Add check for that mount flag before calling ad_sd_context in
open init.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
pipe_wait() may be simple, but since it relies on the pipe lock, it
means that we have to do the wakeup while holding the lock. That's
unfortunate, because the very first thing the waked entity will want to
do is to get the pipe lock for itself.
So get rid of the pipe_wait() usage by simply releasing the pipe lock,
doing the wakeup (if required) and then using wait_event_interruptible()
to wait on the right condition instead.
wait_event_interruptible() handles races on its own by comparing the
wakeup condition before and after adding itself to the wait queue, so
you can use an optimistic unlocked condition for it.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This code is ancient, and goes back to when we only had a single page
for the pipe buffers. The exact history is hidden in the mists of time
(ie "before git", and in fact predates the BK repository too).
At that long-ago point in time, it actually helped to try to merge big
back-and-forth pipe reads and writes, and not limit pipe reads to the
single pipe buffer in length just because that was all we had at a time.
However, since then we've expanded the pipe buffers to multiple pages,
and this logic really doesn't seem to make sense. And a lot of it is
somewhat questionable (ie "hmm, the user asked for a non-blocking read,
but we see that there's a writer pending, so let's wait anyway to get
the extra data that the writer will have").
But more importantly, it makes the "go to sleep" logic much less
obvious, and considering the wakeup issues we've had, I want to make for
less of those kinds of things.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the read side version of the previous commit: it simplifies the
logic to only wake up waiting writers when necessary, and makes sure to
use a synchronous wakeup. This time not so much for GNU make jobserver
reasons (that pipe never fills up), but simply to get the writer going
quickly again.
A bit less verbose commentary this time, if only because I assume that
the write side commentary isn't going to be ignored if you touch this
code.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pipe rework ends up having been extra painful, partly becaused of
actual bugs with ordering and caching of the pipe state, but also
because of subtle performance issues.
In particular, the pipe rework caused the kernel build to inexplicably
slow down.
The reason turns out to be that the GNU make jobserver (which limits the
parallelism of the build) uses a pipe to implement a "token" system: a
parallel submake will read a character from the pipe to get the job
token before starting a new job, and will write a character back to the
pipe when it is done. The overall job limit is thus easily controlled
by just writing the appropriate number of initial token characters into
the pipe.
But to work well, that really means that the old behavior of write
wakeups being synchronous (WF_SYNC) is very important - when the pipe
writer wakes up a reader, we want the reader to actually get scheduled
immediately. Otherwise you lose the parallelism of the build.
The pipe rework lost that synchronous wakeup on write, and we had
clearly all forgotten the reasons and rules for it.
This rewrites the pipe write wakeup logic to do the required Wsync
wakeups, but also clarifies the logic and avoids extraneous wakeups.
It also ends up addign a number of comments about what oit does and why,
so that we hopefully don't end up forgetting about this next time we
change this code.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel wait queues have a basic rule to them: you add yourself to
the wait-queue first, and then you check the things that you're going to
wait on. That avoids the races with the event you're waiting for.
The same goes for poll/select logic: the "poll_wait()" goes first, and
then you check the things you're polling for.
Of course, if you use locking, the ordering doesn't matter since the
lock will serialize with anything that changes the state you're looking
at. That's not the case here, though.
So move the poll_wait() first in pipe_poll(), before you start looking
at the pipe state.
Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The legacy client tracking infrastructure of nfsd makes use of MD5 to
derive a client's recovery directory name. As the nfsd module doesn't
declare any dependency on CRYPTO_MD5, though, it may fail to allocate
the hash if the kernel was compiled without it. As a result, generation
of client recovery directories will fail with the following error:
NFSD: unable to generate recoverydir name
The explicit dependency on CRYPTO_MD5 was removed as redundant back in
6aaa67b5f3 (NFSD: Remove redundant "select" clauses in fs/Kconfig
2008-02-11) as it was already implicitly selected via RPCSEC_GSS_KRB5.
This broke when RPCSEC_GSS_KRB5 was made optional for NFSv4 in commit
df486a2590 (NFS: Fix the selection of security flavours in Kconfig) at
a later point.
Fix the issue by adding back an explicit dependency on CRYPTO_MD5.
Fixes: df486a2590 (NFS: Fix the selection of security flavours in Kconfig)
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Static checker revealed possible error path leading to possible
NULL pointer dereferencing.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: e0639dc580: ("NFSD introduce async copy feature")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fix the iteration end check in fuse_dev_splice_write(). The iterator
position can only be compared with == or != since wrappage may be involved.
Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similarly to commit 8f868d68d3 ("pipe: Fix missing mask update after
pipe_wait()") this fixes a case where the pipe rewrite ended up caching
the pipe state incorrectly over a pipe lock drop event.
It wasn't quite as obvious, because you needed to splice data from a
pipe to a file, which is a fairly unusual operation, but it's completely
wrong.
Make sure we load the pipe head/tail/size information only after we've
waited for there to be data in the pipe.
While in that file, also make one of the splice helper functions use the
canonical arghument order for pipe_empty(). That's syntactic - pipe
emptiness is just that head and tail are equal, and thus mixing up head
and tail doesn't really matter. It's still wrong, though.
Reported-by: David Sterba <dsterba@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using the special SID to store the mode bits in an ACE (See
http://technet.microsoft.com/en-us/library/hh509017(v=ws.10).aspx)
which is enabled with mount parm "modefromsid" we were not
passing in the mode via SMB3 create (although chmod was enabled).
SMB3 create allows a security descriptor context to be passed
in (which is more atomic and thus preferable to setting the mode
bits after create via a setinfo).
This patch enables setting the mode bits on create when using
modefromsid mount option. In addition it fixes an endian
error in the definition of the Control field flags in the SMB3
security descriptor. It also makes the ACE type of the special
SID better match the documentation (and behavior of servers
which use this to store mode bits in SMB3 ACLs).
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3puL4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjOJD/4vV2ENSYcwZ7Qezgn9tx5NEdADN6jC0DXe
tJDnA0sDLfLTlVM9I1U2/wD6Wu31cvV2dHKmxO+WWWRP2M0orI00UA3I6BssVptY
42e/YJd13IM+iVrhfhm+hmcAHvQUmWHTPZg/7F7OZPkEtNcbCjn6s+HTyzu9gqtt
/kbJVDJ75TR9dEWCPY/A4jUcJYLw2leoHI4u2eqYRsTFNJHUQoaFUHDyNHyj7UNI
kIUi2UixJv+a4pkFvyLPKhJcLr6DBC2TeUnfP2Re5oX1Z/XkDR7iX+L5dUwRHHbT
4M7aJJ+mHDm8Z4IwwBqsSDRU5wUpzuplwBQBY/EQilUZAyOALVzUQABlRa0zxH8t
0D4U6LDDOopYeK0by/FGdD88S6Z0LKm0HJETbdPaGd9fqSlhBn19iGeBKYk0cCIu
Uecm9DKF+QLfKhTbN6h8nPyR3bfkqyUptQJ1UnheWo9f6L32bfp19sdI9LdtRUnS
k661QApajaYQu/641u9CDZiI+DvI+57Wm9I4eCF3GXqOYWPAxSgWP7rJVgS+mfA3
wo99/r11SruaA1Wqf+bOoN/wjsQCiUPFa8po8sh05ER4MH5Brb5EFlg04ZlQZkR9
jOdiL8ISM9t86eyKwtR4PanE7/pPEYUjEWm4ntCC6ViTwCvgV3d6s2We6QPw1j2l
nQ9p/c2bhg==
=7ZFw
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20191205' of git://git.kernel.dk/linux-block
Pull more block and io_uring updates from Jens Axboe:
"I wasn't expecting this to be so big, and if I was, I would have used
separate branches for this. Going forward I'll be doing separate
branches for the current tree, just like for the next kernel version
tree. In any case, this contains:
- Series from Christoph that fixes an inherent race condition with
zoned devices and revalidation.
- null_blk zone size fix (Damien)
- Fix for a regression in this merge window that caused busy spins by
sending empty disk uevents (Eric)
- Fix for a regression in this merge window for bfq stats (Hou)
- Fix for io_uring creds allocation failure handling (me)
- io_uring -ERESTARTSYS send/recvmsg fix (me)
- Series that fixes the need for applications to retain state across
async request punts for io_uring. This one is a bit larger than I
would have hoped, but I think it's important we get this fixed for
5.5.
- connect(2) improvement for io_uring, handling EINPROGRESS instead
of having applications needing to poll for it (me)
- Have io_uring use a hash for poll requests instead of an rbtree.
This turned out to work much better in practice, so I think we
should make the switch now. For some workloads, even with a fair
amount of cancellations, the insertion sort is just too expensive.
(me)
- Various little io_uring fixes (me, Jackie, Pavel, LimingWu)
- Fix for brd unaligned IO, and a warning for the future (Ming)
- Fix for a bio integrity data leak (Justin)
- bvec_iter_advance() improvement (Pavel)
- Xen blkback page unmap fix (SeongJae)
The major items in here are all well tested, and on the liburing side
we continue to add regression and feature test cases. We're up to 50
topic cases now, each with anywhere from 1 to more than 10 cases in
each"
* tag 'for-linus-20191205' of git://git.kernel.dk/linux-block: (33 commits)
block: fix memleak of bio integrity data
io_uring: fix a typo in a comment
bfq-iosched: Ensure bio->bi_blkg is valid before using it
io_uring: hook all linked requests via link_list
io_uring: fix error handling in io_queue_link_head
io_uring: use hash table for poll command lookups
io-wq: clear node->next on list deletion
io_uring: ensure deferred timeouts copy necessary data
io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT
null_blk: remove unused variable warning on !CONFIG_BLK_DEV_ZONED
brd: warn on un-aligned buffer
brd: remove max_hw_sectors queue limit
xen/blkback: Avoid unmapping unmapped grant pages
io_uring: handle connect -EINPROGRESS like -EAGAIN
block: set the zone size in blk_revalidate_disk_zones atomically
block: don't handle bio based drivers in blk_revalidate_disk_zones
block: allocate the zone bitmaps lazily
block: replace seq_zones_bitmap with conv_zones_bitmap
block: simplify blkdev_nr_zones
block: remove the empty line at the end of blk-zoned.c
...
Pull vfs d_inode/d_flags memory ordering fixes from Al Viro:
"Fallout from tree-wide audit for ->d_inode/->d_flags barriers use.
Basically, the problem is that negative pinned dentries require
careful treatment - unless ->d_lock is locked or parent is held at
least shared, another thread can make them positive right under us.
Most of the uses turned out to be safe - the main surprises as far as
filesystems are concerned were
- race in dget_parent() fastpath, that might end up with the caller
observing the returned dentry _negative_, due to insufficient
barriers. It is positive in memory, but we could end up seeing the
wrong value of ->d_inode in CPU cache. Fixed.
- manual checks that result of lookup_one_len_unlocked() is positive
(and rejection of negatives). Again, insufficient barriers (we
might end up with inconsistent observed values of ->d_inode and
->d_flags). Fixed by switching to a new primitive that does the
checks itself and returns ERR_PTR(-ENOENT) instead of a negative
dentry. That way we get rid of boilerplate converting negatives
into ERR_PTR(-ENOENT) in the callers and have a single place to
deal with the barrier-related mess - inside fs/namei.c rather than
in every caller out there.
The guts of pathname resolution *do* need to be careful - the race
found by Ritesh is real, as well as several similar races.
Fortunately, it turns out that we can take care of that with fairly
local changes in there.
The tree-wide audit had not been fun, and I hate the idea of repeating
it. I think the right approach would be to annotate the places where
we are _not_ guaranteed ->d_inode/->d_flags stability and have sparse
catch regressions. But I'm still not sure what would be the least
invasive way of doing that and it's clearly the next cycle fodder"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/namei.c: fix missing barriers when checking positivity
fix dget_parent() fastpath race
new helper: lookup_positive_unlocked()
fs/namei.c: pull positivity check into follow_managed()
Pull autofs updates from Al Viro:
"autofs misuses checks for ->d_subdirs emptiness; the cursors are in
the same lists, resulting in false negatives. It's not needed anyway,
since autofs maintains counter in struct autofs_info, containing 0 for
removed ones, 1 for live symlinks and 1 + number of children for live
directories, which is precisely what we need for those checks.
This series switches to use of that counter and untangles the crap
around its uses (it needs not be atomic and there's a bunch of
completely pointless "defensive" checks).
This fell out of dcache_readdir work; the main point is to get rid of
->d_subdirs abuses in there. I've more followup cleanups, but I hadn't
run those by Ian yet, so they can go next cycle"
* 'next.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
autofs: don't bother with atomics for ino->count
autofs_dir_rmdir(): check ino->count for deciding whether it's empty...
autofs: get rid of pointless checks around ->count handling
autofs_clear_leaf_automount_flags(): use ino->count instead of ->d_subdirs
Merge two fixes for the pipe rework from David Howells:
"Here are a couple of patches to fix bugs syzbot found in the pipe
changes:
- An assertion check will sometimes trip when polling a pipe because
the ring size and indices used are approximate and may be being
changed simultaneously.
An equivalent approximate calculation was done previously, but
without the assertion check, so I've just dropped the check. To
make it accurate, the pipe mutex would need to be taken or the spin
lock could be used - but usage of the spinlock would need to be
rolled out into splice, iov_iter and other places for that.
- The index mask and the max_usage values cannot be cached across
pipe_wait() as F_SETPIPE_SZ could have been called during the wait.
This can cause pipe_write() to break"
* pipe-rework:
pipe: Fix missing mask update after pipe_wait()
pipe: Remove assertion from pipe_poll()
Fix pipe_write() to not cache the ring index mask and max_usage as their
values are invalidated by calling pipe_wait() because the latter
function drops the pipe lock, thereby allowing F_SETPIPE_SZ change them.
Without this, pipe_write() may subsequently miscalculate the array
indices and pipe fullness, leading to an oops like the following:
BUG: KASAN: slab-out-of-bounds in pipe_write+0xc25/0xe10 fs/pipe.c:481
Write of size 8 at addr ffff8880771167a8 by task syz-executor.3/7987
...
CPU: 1 PID: 7987 Comm: syz-executor.3 Not tainted 5.4.0-rc2-syzkaller #0
...
Call Trace:
pipe_write+0xc25/0xe10 fs/pipe.c:481
call_write_iter include/linux/fs.h:1895 [inline]
new_sync_write+0x3fd/0x7e0 fs/read_write.c:483
__vfs_write+0x94/0x110 fs/read_write.c:496
vfs_write+0x18a/0x520 fs/read_write.c:558
ksys_write+0x105/0x220 fs/read_write.c:611
__do_sys_write fs/read_write.c:623 [inline]
__se_sys_write fs/read_write.c:620 [inline]
__x64_sys_write+0x6e/0xb0 fs/read_write.c:620
do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This is not a problem for pipe_read() as the mask is recalculated on
each pass of the loop, after pipe_wait() has been called.
Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: syzbot+838eb0878ffd51f27c41@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Eric Biggers <ebiggers@kernel.org>
[ Changed it to use a temporary variable 'mask' to avoid long lines -Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An assertion check was added to pipe_poll() to make sure that the ring
occupancy isn't seen to overflow the ring size. However, since no locks
are held when the three values are read, it is possible for F_SETPIPE_SZ
to intervene and muck up the calculation, thereby causing the oops.
Fix this by simply removing the assertion and accepting that the
calculation might be approximate.
Note that the previous code also had a similar issue, though there was
no assertion check, since the occupancy counter and the ring size were
not read with a lock held, so it's possible that the poll check might
have malfunctioned then too.
Also wake up all the waiters so that they can reissue their checks if
there was a competing read or write.
Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: syzbot+d37abaade33a934f16f2@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bob's extensive filesystem withdrawal and recovery testing:
- Don't write log headers after file system withdraw
- clean up iopen glock mess in gfs2_create_inode
- Close timing window with GLF_INVALIDATE_IN_PROGRESS
- Abort gfs2_freeze if io error is seen
- Don't loop forever in gfs2_freeze if withdrawn
- fix infinite loop in gfs2_ail1_flush on io error
- Introduce function gfs2_withdrawn
- fix glock reference problem in gfs2_trans_remove_revoke
Filesystems with a block size smaller than the page size:
- Fix end-of-file handling in gfs2_page_mkwrite
- Improve mmap write vs. punch_hole consistency
Other:
- Remove active journal side effect from gfs2_write_log_header
- Multi-block allocations in gfs2_page_mkwrite
Minor cleanups and coding style fixes:
- Remove duplicate call from gfs2_create_inode
- make gfs2_log_shutdown static
- make gfs2_fs_parameters static
- Some whitespace cleanups
- removed unnecessary semicolon
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJd6VQ1AAoJENW/n+sDE2U6tUsP/2Zd64dVA+2KzpfvklAUNpx/
TENbvRqGVvhofxuScY0oAvcg0KZN03K/L6ZCa71e7RI7T3VM/FG3rKNCkm084a9d
r+Rxcz+eSkce5qKVNva6CtQiczGAj27iNiho0Y9IgtzEZ5KEVJuY3cV8Z2qzHrQM
Rel2aVpUtaLhbFOj3jLGMt/HHSw8RTTzNqtJJCwRys/tVF1WPVqyNg4PD9a7zMT7
z96tqrlQQxFT4SGiZVJwQHFGuQZEnbr2ahNRmivmGtnNnawLxpEdFuFrSAsC73UB
wHO0Dq+7vYVTyQ7HugrqdkXxqyQr5ta06Pep7uj8ZvhoLWZvPuBf2SccpIn9Ufvo
9iLFY5Z9cHg6wpsW+YMG75Mz0A6WbPRIScVog0fxaKW+z0vMZOmsT6hT6llAlCzn
oj1igqVEOIBTS+4uMDIOJKvMixo4NTdgsLFQyftUxNHiCw5iGbqkV7ux31YjqX22
A830zv8lba44BsixGtuPEy/0Dnka7rMfRp0cflCzKESSLIdXtSjUSEtS6g0fJISS
qKNmnnkHpvjBGMG4lOqJihJdQ+2IiIMLWdNgxkvWgt6F7xjl3gcdREjJF0dmFsd+
GfDUkUZ/70T9UPsaTR2V0GBvEleq4PglALcet9Eela7tKOEziNf3L+prRjMr3909
y4uJIMH9/no1knKBxtkJ
=b1m/
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Andreas Gruenbacher:
"Bob's extensive filesystem withdrawal and recovery testing:
- don't write log headers after file system withdraw
- clean up iopen glock mess in gfs2_create_inode
- close timing window with GLF_INVALIDATE_IN_PROGRESS
- abort gfs2_freeze if io error is seen
- don't loop forever in gfs2_freeze if withdrawn
- fix infinite loop in gfs2_ail1_flush on io error
- introduce function gfs2_withdrawn
- fix glock reference problem in gfs2_trans_remove_revoke
Filesystems with a block size smaller than the page size:
- fix end-of-file handling in gfs2_page_mkwrite
- improve mmap write vs. punch_hole consistency
Other:
- remove active journal side effect from gfs2_write_log_header
- multi-block allocations in gfs2_page_mkwrite
Minor cleanups and coding style fixes:
- remove duplicate call from gfs2_create_inode
- make gfs2_log_shutdown static
- make gfs2_fs_parameters static
- some whitespace cleanups
- removed unnecessary semicolon"
* tag 'gfs2-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Don't write log headers after file system withdraw
gfs2: Remove duplicate call from gfs2_create_inode
gfs2: clean up iopen glock mess in gfs2_create_inode
gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS
gfs2: Abort gfs2_freeze if io error is seen
gfs2: Don't loop forever in gfs2_freeze if withdrawn
gfs2: fix infinite loop in gfs2_ail1_flush on io error
gfs2: Introduce function gfs2_withdrawn
gfs2: fix glock reference problem in gfs2_trans_remove_revoke
gfs2: make gfs2_log_shutdown static
gfs2: Remove active journal side effect from gfs2_write_log_header
gfs2: Fix end-of-file handling in gfs2_page_mkwrite
gfs2: Multi-block allocations in gfs2_page_mkwrite
gfs2: Improve mmap write vs. punch_hole consistency
gfs2: make gfs2_fs_parameters static
gfs2: Some whitespace cleanups
gfs2: removed unnecessary semicolon
mappings are handled and a conversion to the new mount API (slightly
complicated by the fact that we had a common option parsing framework
that called out into rbd and the filesystem instead of them calling
into it). Also included a few scattered fixes and a MAINTAINERS update
for rbd, adding Dongsheng as a reviewer.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3oDqwTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi8/OCACAhPEoSkG8J0XOgP0NlFQGRvikugKq
wlUfpNhkJOVmyM1t9LgiHHirTa7/kA76wPo/iHtnvjIZuZoaX3+NoZX5DwgKVCo1
SCQdXR4ohVPiYxUpK+z/fDXxpYHhaO2SAww+RRHSDxnlN5CHqFBcBhRBPfhraZT5
dwiQt7++UOnp/hfk1Dqg5EogmSdLxqWyjClKf2lliZkzbU9YXmGapqQsur6sBk+e
cLRmRBmMw4cDAKLL1taCympN0AxNMcePs1njvdwQ7XabNWrT061yFyt1ZNwAV/Nu
0nCyh/9IwQcsR0EvK7FCdUEJPy88Reufd+GleS4nkEZpbxQBzo0aGow0
=Egtk
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.5-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The two highlights are a set of improvements to how rbd read-only
mappings are handled and a conversion to the new mount API (slightly
complicated by the fact that we had a common option parsing framework
that called out into rbd and the filesystem instead of them calling
into it).
Also included a few scattered fixes and a MAINTAINERS update for rbd,
adding Dongsheng as a reviewer"
* tag 'ceph-for-5.5-rc1' of git://github.com/ceph/ceph-client:
libceph, rbd, ceph: convert to use the new mount API
rbd: ask for a weaker incompat mask for read-only mappings
rbd: don't query snapshot features
rbd: remove snapshot existence validation code
rbd: don't establish watch for read-only mappings
rbd: don't acquire exclusive lock for read-only mappings
rbd: disallow read-write partitions on images mapped read-only
rbd: treat images mapped read-only seriously
rbd: introduce RBD_DEV_FLAG_READONLY
rbd: introduce rbd_is_snap()
ceph: don't leave ino field in ceph_mds_request_head uninitialized
ceph: tone down loglevel on ceph_mdsc_build_path warning
rbd: update MAINTAINERS info
ceph: fix geting random mds from mdsmap
rbd: fix spelling mistake "requeueing" -> "requeuing"
ceph: make several helper accessors take const pointers
libceph: drop unnecessary check from dispatch() in mon_client.c
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXedyjQAKCRDh3BK/laaZ
PCR7AQCf+bb3/so1bygFeblBTT4UZbYZRXz2nZNSA5tgJvafWwD+I4MlqR+tEixb
gZEusbtAVrtm3hJrBc+1fA1wacGhmAg=
=Jg/0
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
- Fix a regression introduced in the last release
- Fix a number of issues with validating data coming from userspace
- Some cleanups in virtiofs
* tag 'fuse-update-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix Kconfig indentation
fuse: fix leak of fuse_io_priv
virtiofs: Use completions while waiting for queue to be drained
virtiofs: Do not send forget request "struct list_head" element
virtiofs: Use a common function to send forget
virtiofs: Fix old-style declaration
fuse: verify nlink
fuse: verify write return
fuse: verify attributes
This patch fixes the following KASAN report. The @ioend has been
freed by dio_put(), but the iomap_finish_ioend() still trys to access
its data.
[20563.631624] BUG: KASAN: use-after-free in iomap_finish_ioend+0x58c/0x5c0
[20563.638319] Read of size 8 at addr fffffc0c54a36928 by task kworker/123:2/22184
[20563.647107] CPU: 123 PID: 22184 Comm: kworker/123:2 Not tainted 5.4.0+ #1
[20563.653887] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.11 06/18/2019
[20563.664499] Workqueue: xfs-conv/sda5 xfs_end_io [xfs]
[20563.669547] Call trace:
[20563.671993] dump_backtrace+0x0/0x370
[20563.675648] show_stack+0x1c/0x28
[20563.678958] dump_stack+0x138/0x1b0
[20563.682455] print_address_description.isra.9+0x60/0x378
[20563.687759] __kasan_report+0x1a4/0x2a8
[20563.691587] kasan_report+0xc/0x18
[20563.694985] __asan_report_load8_noabort+0x18/0x20
[20563.699769] iomap_finish_ioend+0x58c/0x5c0
[20563.703944] iomap_finish_ioends+0x110/0x270
[20563.708396] xfs_end_ioend+0x168/0x598 [xfs]
[20563.712823] xfs_end_io+0x1e0/0x2d0 [xfs]
[20563.716834] process_one_work+0x7f0/0x1ac8
[20563.720922] worker_thread+0x334/0xae0
[20563.724664] kthread+0x2c4/0x348
[20563.727889] ret_from_fork+0x10/0x18
[20563.732941] Allocated by task 83403:
[20563.736512] save_stack+0x24/0xb0
[20563.739820] __kasan_kmalloc.isra.9+0xc4/0xe0
[20563.744169] kasan_slab_alloc+0x14/0x20
[20563.747998] slab_post_alloc_hook+0x50/0xa8
[20563.752173] kmem_cache_alloc+0x154/0x330
[20563.756185] mempool_alloc_slab+0x20/0x28
[20563.760186] mempool_alloc+0xf4/0x2a8
[20563.763845] bio_alloc_bioset+0x2d0/0x448
[20563.767849] iomap_writepage_map+0x4b8/0x1740
[20563.772198] iomap_do_writepage+0x200/0x8d0
[20563.776380] write_cache_pages+0x8a4/0xed8
[20563.780469] iomap_writepages+0x4c/0xb0
[20563.784463] xfs_vm_writepages+0xf8/0x148 [xfs]
[20563.788989] do_writepages+0xc8/0x218
[20563.792658] __writeback_single_inode+0x168/0x18f8
[20563.797441] writeback_sb_inodes+0x370/0xd30
[20563.801703] wb_writeback+0x2d4/0x1270
[20563.805446] wb_workfn+0x344/0x1178
[20563.808928] process_one_work+0x7f0/0x1ac8
[20563.813016] worker_thread+0x334/0xae0
[20563.816757] kthread+0x2c4/0x348
[20563.819979] ret_from_fork+0x10/0x18
[20563.825028] Freed by task 22184:
[20563.828251] save_stack+0x24/0xb0
[20563.831559] __kasan_slab_free+0x10c/0x180
[20563.835648] kasan_slab_free+0x10/0x18
[20563.839389] slab_free_freelist_hook+0xb4/0x1c0
[20563.843912] kmem_cache_free+0x8c/0x3e8
[20563.847745] mempool_free_slab+0x20/0x28
[20563.851660] mempool_free+0xd4/0x2f8
[20563.855231] bio_free+0x33c/0x518
[20563.858537] bio_put+0xb8/0x100
[20563.861672] iomap_finish_ioend+0x168/0x5c0
[20563.865847] iomap_finish_ioends+0x110/0x270
[20563.870328] xfs_end_ioend+0x168/0x598 [xfs]
[20563.874751] xfs_end_io+0x1e0/0x2d0 [xfs]
[20563.878755] process_one_work+0x7f0/0x1ac8
[20563.882844] worker_thread+0x334/0xae0
[20563.886584] kthread+0x2c4/0x348
[20563.889804] ret_from_fork+0x10/0x18
[20563.894855] The buggy address belongs to the object at fffffc0c54a36900
which belongs to the cache bio-1 of size 248
[20563.906844] The buggy address is located 40 bytes inside of
248-byte region [fffffc0c54a36900, fffffc0c54a369f8)
[20563.918485] The buggy address belongs to the page:
[20563.923269] page:ffffffff82f528c0 refcount:1 mapcount:0 mapping:fffffc8e4ba31900 index:0xfffffc0c54a33300
[20563.932832] raw: 17ffff8000000200 ffffffffa3060100 0000000700000007 fffffc8e4ba31900
[20563.940567] raw: fffffc0c54a33300 0000000080aa0042 00000001ffffffff 0000000000000000
[20563.948300] page dumped because: kasan: bad access detected
[20563.955345] Memory state around the buggy address:
[20563.960129] fffffc0c54a36800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
[20563.967342] fffffc0c54a36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[20563.974554] >fffffc0c54a36900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[20563.981766] ^
[20563.986288] fffffc0c54a36980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
[20563.993501] fffffc0c54a36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[20564.000713] ==================================================================
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205703
Signed-off-by: Zorro Lang <zlang@redhat.com>
Fixes: 9cd0ed63ca ("iomap: enhance writeback error message")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Links are created by chaining requests through req->list with an
exception that head uses req->link_list. (e.g. link_list->list->list)
Because of that, io_req_link_next() needs complex splicing to advance.
Link them all through list_list. Also, it seems to be simpler and more
consistent IMHO.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of an error io_submit_sqe() drops a request and continues
without it, even if the request was a part of a link. Not only it
doesn't cancel links, but also may execute wrong sequence of actions.
Stop consuming sqes, and let the user handle errors.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ELF reads done by the kernel have very complicated error detection code
which better live in one place.
Link: http://lkml.kernel.org/r/20191005165215.GB26927@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, ep_poll_safewake() in the CONFIG_DEBUG_LOCK_ALLOC case uses
ep_call_nested() in order to pass the correct subclass argument to
spin_lock_irqsave_nested(). However, ep_call_nested() adds unnecessary
checks for epoll depth and loops that are already verified when doing
EPOLL_CTL_ADD. This mirrors a conversion that was done for
!CONFIG_DEBUG_LOCK_ALLOC in: commit 37b5e5212a ("epoll: remove
ep_call_nested() from ep_eventpoll_poll()")
Link: http://lkml.kernel.org/r/1567628549-11501-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Reviewed-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ / /' -i */Kconfig
[adobriyan@gmail.com: add two spaces where necessary]
Link: http://lkml.kernel.org/r/20191124133936.GA5655@avx2
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
List iteration takes more code than anything else which means embedded
list_head should be the first element of the structure.
Space savings:
add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-18 (-18)
Function old new delta
close_pdeo 228 227 -1
proc_reg_release 86 82 -4
proc_entry_rundown 143 139 -4
proc_reg_open 298 289 -9
Link: http://lkml.kernel.org/r/20191004234753.GB30246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pointer to next '/' encodes length of path element and next start
position. Subtraction and increment are redundant.
Link: http://lkml.kernel.org/r/20191004234521.GA30246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently gluing PDE into global /proc tree is done under lock, but
changing ->nlink is not. Additionally struct proc_dir_entry::nlink is
not atomic so updates can be lost.
Link: http://lkml.kernel.org/r/20190925202436.GA17388@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We recently changed this from a single list to an rbtree, but for some
real life workloads, the rbtree slows down the submission/insertion
case enough so that it's the top cycle consumer on the io_uring side.
In testing, using a hash table is a more well rounded compromise. It
is fast for insertion, and as long as it's sized appropriately, it
works well for the cancellation case as well. Running TAO with a lot
of network sockets, this removes io_poll_req_insert() from spending
2% of the CPU cycles.
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we defer a timeout, we should ensure that we copy the timespec
when we have consumed the sqe. This is similar to commit f67676d160
for read/write requests. We already did this correctly for timeouts
deferred as links, but do it generally and use the infrastructure added
by commit 1a6b74fc87 instead of having the timeout deferral use its
own.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
iface[0] was accessed regardless of the count value and without
locking.
* check count before accessing any ifaces
* make copy of iface list (it's a simple POD array) and use it without
locking.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
With the addition of SMB session channels, we introduced new TCP
server pointers that have no sessions or tcons associated with them.
In this case, when we started looking for TCP connections, we might
end up picking session channel rather than the master connection,
hence failing to get either a session or a tcon.
In order to fix that, this patch introduces a new "is_channel" field
to TCP_Server_Info structure so we can skip session channels during
lookup of connections.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
There's really no reason why we forbid things like link/drain etc on
regular timeout commands. Enable the usual SQE flags on timeouts.
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio completions can race when a page spans more than one file system
block. Add a spinlock to synchronize marking the page uptodate.
Fixes: 9dc55f1389 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Orangefs has no open, and orangefs checks file permissions
on each file access. Posix requires that file permissions
be checked on open and nowhere else. Orangefs-through-the-kernel
needs to seem posix compliant.
The VFS opens files, even if the filesystem provides no
method. We can see if a file was successfully opened for
read and or for write by looking at file->f_mode.
When writes are flowing from the page cache, file is no
longer available. We can trust the VFS to have checked
file->f_mode before writing to the page cache.
The mode of a file might change between when it is opened
and IO commences, or it might be created with an arbitrary mode.
We'll make sure we don't hit EACCES during the IO stage by
using UID 0. Some of the time we have access without changing
to UID 0 - how to check?
Signed-off-by: Mike Marshall <hubcap@omnibond.com>