Commit Graph

60754 Commits

Author SHA1 Message Date
Al Viro
1f08529c84 ceph: add missing check in d_revalidate snapdir handling
We should not play with dcache without parent locked...

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-29 22:29:55 +01:00
Al Viro
aa8dd81673 ceph: fix RCU case handling in ceph_d_revalidate()
For RCU case ->d_revalidate() is called with rcu_read_lock() and
without pinning the dentry passed to it.  Which means that it
can't rely upon ->d_inode remaining stable; that's the reason
for d_inode_rcu(), actually.

Make sure we don't reload ->d_inode there.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-29 22:29:54 +01:00
Luis Henriques
ea60ed6fcf ceph: fix use-after-free in __ceph_remove_cap()
KASAN reports a use-after-free when running xfstest generic/531, with the
following trace:

[  293.903362]  kasan_report+0xe/0x20
[  293.903365]  rb_erase+0x1f/0x790
[  293.903370]  __ceph_remove_cap+0x201/0x370
[  293.903375]  __ceph_remove_caps+0x4b/0x70
[  293.903380]  ceph_evict_inode+0x4e/0x360
[  293.903386]  evict+0x169/0x290
[  293.903390]  __dentry_kill+0x16f/0x250
[  293.903394]  dput+0x1c6/0x440
[  293.903398]  __fput+0x184/0x330
[  293.903404]  task_work_run+0xb9/0xe0
[  293.903410]  exit_to_usermode_loop+0xd3/0xe0
[  293.903413]  do_syscall_64+0x1a0/0x1c0
[  293.903417]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happens because __ceph_remove_cap() may queue a cap release
(__ceph_queue_cap_release) which can be scheduled before that cap is
removed from the inode list with

	rb_erase(&cap->ci_node, &ci->i_caps);

And, when this finally happens, the use-after-free will occur.

This can be fixed by removing the cap from the inode list before being
removed from the session list, and thus eliminating the risk of an UAF.

Cc: stable@vger.kernel.org
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-29 22:29:51 +01:00
Linus Torvalds
c9a2e4a829 Seven cifs/smb3 fixes, including three for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGyBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl208asACgkQiiy9cAdy
 T1EVJAv41lwLL01FQtWsoMB8yn46q7iboc0eqVJ4CEnZj25Fe3VgokKm1+2N6/15
 g8tQeLnE8kzjQyAxEgI260PHRF6Dg0RSL7zwxnJOFpspWLb9+U4UEryCDao/v4Ao
 LYr88F6XeYDGojyuGKokT/jlsxWfdPwaYhpOPabjO76f6Jv7cNfIa9EKzbEAtq3e
 a9to2AVnJ02FLGTP3QM1ThBtCu3JFFMI07u2Rp8ANiJYtM5Mg2piOkO1PQ3hGLIM
 /3+1E7B12KgG7D3GA7ZuP1HLjyo+A91BhZVTzlBDq1VHTd8ccJDJklg28oXoMrMh
 dgW5XRcS/s60LsRzagpNMC0cPlAdBAd9N9uZaoXmQhNAg0I7CJ24H3tglwKejoUR
 m/duI1C89Fbo1kgjwfAztxhfvair2SYLu8DGijvYH/C5syNUR7V99uy1th+nmOgl
 vpBM9CbkM6mtSxFHtgcFuv4rwIf61EQaGoXiL/e3RLH4zc/HaUpBo2rDh0LPmwFB
 koXIw6E=
 =fTOk
 -----END PGP SIGNATURE-----

Merge tag '5.4-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Seven cifs/smb3 fixes, including three for stable"

* tag '5.4-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occurs
  CIFS: Fix use after free of file info structures
  CIFS: Fix retry mid list corruption on reconnects
  cifs: Fix missed free operations
  CIFS: avoid using MID 0xFFFF
  cifs: clarify comment about timestamp granularity for old servers
  cifs: Handle -EINPROGRESS only when noblockcnt is set
2019-10-27 06:41:52 -04:00
Linus Torvalds
acf913b7fb for-linus-2019-10-26
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl20ST0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprl3D/wJo392Nd3fDa1+t9XpV3M5T/HKcC7yau/G
 98r41DF8TzjlwWFpsbolbBibeUkwpkZr32rAyEN7UScw6d5I+qlGF/ATzZijWGvd
 a714VhMq3fs8siGLDtNxdclYlSXzF5wnhbniPdF1QJsEHK7wAbqXV6/xf5FubHke
 SjYX9HOt0PGfkdDWmdMLRlaaedUZG5RPWAZV4dtg8LWMcL9yu9FZtxbyYyhFXf94
 Wd3MxvDp4/aeudO+CLkI73H0LOl5f0iuIOAOTEDE39X0zQA3ezwYiBh8wRtrBh1Y
 DJus7LI0G56DdUOengyptFkR2qvjmQpJr6mFKzVlp9bRJhwuvzcJ1MWj23oDDFOx
 xW1m74rQcP2COZ2BIvzufKBrBJn8Gq5BXDjQAHgQMNn0FQbEnBnCQ+vA3V4NkzMs
 EuUgk2P/LAwBpldkhFUg78UZUKCxiV72cTihLZJHN2FW5cf652Hf9jaA5Vnrzye5
 gJvXmjWojDmEu77lP51UdKSk4Jhhf+Mr/FM3tNoCyteXpceSIrDE8gOMwhkj3oM4
 eXVgHKSyxrD1npcE2/C7wwOMP9V0FBCFb5ulDbuvRyhZRQeSXtugbzU5Pn9VqgiK
 e7uhXA7AqrKS0bdHdXXlxs0of1gvq42zYcD8A9KFJhUxU07yUKXmIwXO4hApAMsk
 VLxsoPdIMA==
 =PhEL
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2019-10-26' of git://git.kernel.dk/linux-block

Pull block and io_uring fixes from Jens Axboe:
 "A bit bigger than usual at this point in time, mostly due to some good
  bug hunting work by Pavel that resulted in three io_uring fixes from
  him and two from me. Anyway, this pull request contains:

   - Revert of the submit-and-wait optimization for io_uring, it can't
     always be done safely. It depends on commands always making
     progress on their own, which isn't necessarily the case outside of
     strict file IO. (me)

   - Series of two patches from me and three from Pavel, fixing issues
     with shared data and sequencing for io_uring.

   - Lastly, two timeout sequence fixes for io_uring (zhangyi)

   - Two nbd patches fixing races (Josef)

   - libahci regulator_get_optional() fix (Mark)"

* tag 'for-linus-2019-10-26' of git://git.kernel.dk/linux-block:
  nbd: verify socket is supported during setup
  ata: libahci_platform: Fix regulator_get_optional() misuse
  nbd: handle racing with error'ed out commands
  nbd: protect cmd->status with cmd->lock
  io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
  io_uring: used cached copies of sq->dropped and cq->overflow
  io_uring: Fix race for sqes with userspace
  io_uring: Fix broken links with offloading
  io_uring: Fix corrupted user_data
  io_uring: correct timeout req sequence when inserting a new entry
  io_uring : correct timeout req sequence when waiting timeout
  io_uring: revert "io_uring: optimize submit_and_wait API"
2019-10-26 14:59:51 -04:00
Linus Torvalds
485fc4b69c dax fix 5.4-rc5
- Fix a performance regression that followed from a fix to the
   conversion of the fsdax implementation to the xarray. v5.3 users
   report that they stop seeing huge page mappings on an application +
   filesystem layout that was seeing huge pages previously on v5.2.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJds5qfAAoJEB7SkWpmfYgCD6gP/1PuAbdRnTt/pEUB5GLtOR21
 hszHP5vFiWaiDN4yUw7IknYmOWlPVPoHg8SdSwKLM75EGj9JCfQGUcUTZIREB15h
 6MnAavVr3u697ZGfnqg8y2Nhg6hNKbqzRnzCEACAXe+O2GnH2TuJRauyS6Zr33St
 dXiJ+XQbTGoUbR0+CEDIdv0DnkUFaaE/m1+O5DbApcP56Gt/4E1PvlIPJjMog72/
 WKVk4/Cv9+gc7JZMjR+GjjXJafiNaXT1btPGzF6zeHe9XrAnkxvfjm9BSh+qeCxj
 19jOKKFs+bY5O21jczEIvgE4cpPQEytYhZijPwMXFt2o3betOH4ACMDxVMuNJKag
 Q3yd8QJRqtP5xALAxeby7HoD6gdVLLcH0qjTdMt+LHvSde5OH3l8qOfWjK5CoIdx
 uiUC+ugg8FDZLTsliI5qPaDCSYhswoyLlvIF3AIqllB7Y1lUbihw5h+gXr9aLGmu
 1SUA6ipPw1gUW5ikYlKTu5Lzxa2VN8vHBv03W1mjuj2nOggRWp8cK8CmwKpnUuOp
 9xD1WB0dl23MRNRZOKQvkrg8kOL2vuowwJYQ6Kl3Tc2vWvqYfNRQW6dSwgx3gF38
 rQ0ZQddet4EMATrxbwMalU5hVDV2w5LAUz0+bNyovb7S0oZ20wLHm5o5iImppV1U
 +K2aRqwaPi/ytcwRF+iF
 =LioO
 -----END PGP SIGNATURE-----

Merge tag 'dax-fix-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull dax fix from Dan Williams:
 "Fix a performance regression that followed from a fix to the
  conversion of the fsdax implementation to the xarray. v5.3 users
  report that they stop seeing huge page mappings on an application +
  filesystem layout that was seeing huge pages previously on v5.2"

* tag 'dax-fix-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  fs/dax: Fix pmd vs pte conflict detection
2019-10-26 06:26:04 -04:00
Jens Axboe
2b2ed9750f io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
We currently assume that submissions from the sqthread are successful,
and if IO polling is enabled, we use that value for knowing how many
completions to look for. But if we overflowed the CQ ring or some
requests simply got errored and already completed, they won't be
available for polling.

For the case of IO polling and SQTHREAD usage, look at the pending
poll list. If it ever hits empty then we know that we don't have
anymore pollable requests inflight. For that case, simply reset
the inflight count to zero.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 10:58:53 -06:00
Jens Axboe
498ccd9eda io_uring: used cached copies of sq->dropped and cq->overflow
We currently use the ring values directly, but that can lead to issues
if the application is malicious and changes these values on our behalf.
Created in-kernel cached versions of them, and just overwrite the user
side when we update them. This is similar to how we treat the sq/cq
ring tail/head updates.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 10:58:45 -06:00
Pavel Begunkov
935d1e4590 io_uring: Fix race for sqes with userspace
io_ring_submit() finalises with
1. io_commit_sqring(), which releases sqes to the userspace
2. Then calls to io_queue_link_head(), accessing released head's sqe

Reorder them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 09:02:01 -06:00
Pavel Begunkov
fb5ccc9878 io_uring: Fix broken links with offloading
io_sq_thread() processes sqes by 8 without considering links. As a
result, links will be randomely subdivided.

The easiest way to fix it is to call io_get_sqring() inside
io_submit_sqes() as do io_ring_submit().

Downsides:
1. This removes optimisation of not grabbing mm_struct for fixed files
2. It submitting all sqes in one go, without finer-grained sheduling
with cq processing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 09:01:59 -06:00
Pavel Begunkov
84d55dc5b9 io_uring: Fix corrupted user_data
There is a bug, where failed linked requests are returned not with
specified @user_data, but with garbage from a kernel stack.

The reason is that io_fail_links() uses req->user_data, which is
uninitialised when called from io_queue_sqe() on fail path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 09:01:58 -06:00
Dave Wysochanski
d46b0da7a3 cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occurs
There's a deadlock that is possible and can easily be seen with
a test where multiple readers open/read/close of the same file
and a disruption occurs causing reconnect.  The deadlock is due
a reader thread inside cifs_strict_readv calling down_read and
obtaining lock_sem, and then after reconnect inside
cifs_reopen_file calling down_read a second time.  If in
between the two down_read calls, a down_write comes from
another process, deadlock occurs.

        CPU0                    CPU1
        ----                    ----
cifs_strict_readv()
 down_read(&cifsi->lock_sem);
                               _cifsFileInfo_put
                                  OR
                               cifs_new_fileinfo
                                down_write(&cifsi->lock_sem);
cifs_reopen_file()
 down_read(&cifsi->lock_sem);

Fix the above by changing all down_write(lock_sem) calls to
down_write_trylock(lock_sem)/msleep() loop, which in turn
makes the second down_read call benign since it will never
block behind the writer while holding lock_sem.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Suggested-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed--by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-10-24 21:35:04 -05:00
Pavel Shilovsky
1a67c41596 CIFS: Fix use after free of file info structures
Currently the code assumes that if a file info entry belongs
to lists of open file handles of an inode and a tcon then
it has non-zero reference. The recent changes broke that
assumption when putting the last reference of the file info.
There may be a situation when a file is being deleted but
nothing prevents another thread to reference it again
and start using it. This happens because we do not hold
the inode list lock while checking the number of references
of the file info structure. Fix this by doing the proper
locking when doing the check.

Fixes: 487317c994 ("cifs: add spinlock for the openFileList to cifsInodeInfo")
Fixes: cb248819d2 ("cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panic")
Cc: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-10-24 21:32:35 -05:00
Pavel Shilovsky
abe57073d0 CIFS: Fix retry mid list corruption on reconnects
When the client hits reconnect it iterates over the mid
pending queue marking entries for retry and moving them
to a temporary list to issue callbacks later without holding
GlobalMid_Lock. In the same time there is no guarantee that
mids can't be removed from the temporary list or even
freed completely by another thread. It may cause a temporary
list corruption:

[  430.454897] list_del corruption. prev->next should be ffff98d3a8f316c0, but was 2e885cb266355469
[  430.464668] ------------[ cut here ]------------
[  430.466569] kernel BUG at lib/list_debug.c:51!
[  430.468476] invalid opcode: 0000 [#1] SMP PTI
[  430.470286] CPU: 0 PID: 13267 Comm: cifsd Kdump: loaded Not tainted 5.4.0-rc3+ #19
[  430.473472] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  430.475872] RIP: 0010:__list_del_entry_valid.cold+0x31/0x55
...
[  430.510426] Call Trace:
[  430.511500]  cifs_reconnect+0x25e/0x610 [cifs]
[  430.513350]  cifs_readv_from_socket+0x220/0x250 [cifs]
[  430.515464]  cifs_read_from_socket+0x4a/0x70 [cifs]
[  430.517452]  ? try_to_wake_up+0x212/0x650
[  430.519122]  ? cifs_small_buf_get+0x16/0x30 [cifs]
[  430.521086]  ? allocate_buffers+0x66/0x120 [cifs]
[  430.523019]  cifs_demultiplex_thread+0xdc/0xc30 [cifs]
[  430.525116]  kthread+0xfb/0x130
[  430.526421]  ? cifs_handle_standard+0x190/0x190 [cifs]
[  430.528514]  ? kthread_park+0x90/0x90
[  430.530019]  ret_from_fork+0x35/0x40

Fix this by obtaining extra references for mids being retried
and marking them as MID_DELETED which indicates that such a mid
has been dequeued from the pending list.

Also move mid cleanup logic from DeleteMidQEntry to
_cifs_mid_q_entry_release which is called when the last reference
to a particular mid is put. This allows to avoid any use-after-free
of response buffers.

The patch needs to be backported to stable kernels. A stable tag
is not mentioned below because the patch doesn't apply cleanly
to any actively maintained stable kernel.

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-and-tested-by: David Wysochanski <dwysocha@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-10-24 21:32:32 -05:00
Linus Torvalds
65b15b7f4b Fix a memory leak introduced in -rc1.
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdsbi7AAoJENW/n+sDE2U6CNMQAL9iTPDLkNEDvNNrnj3v5cq3
 WS3z6seqt+IQT9PYO5/vWsjCXOM0KLU4MC9gUUVYm+8xy6t10ViOUdltx7zRXpJP
 yh8nIHiAAYNAE2sMbZ1JsmXZPE0fZd8/kP8fhAiyYE7KYpAoCJyZGNPQsZwmI6JU
 oa4c6sJ9diYkUSiYJ5A+DN9kv2ANjDQ96ShcfoTFKSijurasIslqybLqdg0r0bgc
 NpF+xIh1z66dEisAE+Hj+l/FXHeWU+c/wCpRV8atoRNpYX0f7X8u7PBQOcrvVtcd
 Lyq3UwSCqqO234Dly5o9RotdnvI66iLXj7Hf8Zx5l9DhR6qTbqpG30fEGiv/vIrC
 Zvcavg6zQXYcPrXllmqUIZv3P/reVdUc4MQMToO5lRuWpLtbdMnUtulVGBMusLSF
 jthxasj6u+EcM0Xe/mhgDBpI/GDQrjH2+pGGjQk6VI1lkLBWZdEQphUAiuthc6GR
 9FA6oPX/4Gp1ry+6CUJePegHYW6BW9X2Qcmqqz/QHOCabkAnX37ay5z/9tbb47JT
 Jl0EATxUQag4kIco7M04JFviXMONfsXpA1QyDj/VxfOiKx0rDB054DJT1Ge+5qvR
 9x7ymUWIbPb7NR9jMfIfzgb5iwlVQdQJv46urryKEKNZaKubVWY8j1iaYpr4cOny
 1g40ngooghJF3LuwEJA4
 =sfLj
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.4-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:
 "Fix a memory leak introduced in -rc1"

* tag 'gfs2-v5.4-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix memory leak when gfs2meta's fs_context is freed
2019-10-24 15:31:55 -04:00
Andrew Price
30aecae86e gfs2: Fix memory leak when gfs2meta's fs_context is freed
gfs2 and gfs2meta share an ->init_fs_context function which allocates an
args structure stored in fc->fs_private. gfs2 registers a ->free
function to free this memory when the fs_context is cleaned up, but
there was not one registered for gfs2meta, causing a leak.

Register a ->free function for gfs2meta. The existing gfs2_fc_free
function does what we need.

Reported-by: syzbot+c2fdfd2b783754878fb6@syzkaller.appspotmail.com
Fixes: 1f52aa08d1 ("gfs2: Convert gfs2 to fs_context")
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-10-24 16:20:43 +02:00
zhangyi (F)
a1f58ba46f io_uring: correct timeout req sequence when inserting a new entry
The sequence number of the timeout req (req->sequence) indicate the
expected completion request. Because of each timeout req consume a
sequence number, so the sequence of each timeout req on the timeout
list shouldn't be the same. But now, we may get the same number (also
incorrect) if we insert a new entry before the last one, such as submit
such two timeout reqs on a new ring instance below.

                    req->sequence
 req_1 (count = 2):       2
 req_2 (count = 1):       2

Then, if we submit a nop req, req_2 will still timeout even the nop req
finished. This patch fix this problem by adjust the sequence number of
each reordered reqs when inserting a new entry.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23 22:09:56 -06:00
zhangyi (F)
ef03681ae8 io_uring : correct timeout req sequence when waiting timeout
The sequence number of reqs on the timeout_list before the timeout req
should be adjusted in io_timeout_fn(), because the current timeout req
will consumes a slot in the cq_ring and cq_tail pointer will be
increased, otherwise other timeout reqs may return in advance without
waiting for enough wait_nr.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23 22:09:56 -06:00
Jens Axboe
bc808bced3 io_uring: revert "io_uring: optimize submit_and_wait API"
There are cases where it isn't always safe to block for submission,
even if the caller asked to wait for events as well. Revert the
previous optimization of doing that.

This reverts two commits:

bf7ec93c64
c576666863

Fixes: c576666863 ("io_uring: optimize submit_and_wait API")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23 22:09:56 -06:00
Linus Torvalds
54955e3bfd for-5.4-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2vBPgACgkQxWXV+ddt
 WDvaGQ/+JbV05ML7QjufNFKjzojNPg8dOwvLqF58askMEAqzyqOrzu1YsIOjchqv
 eZPP+wxh6j6i8LFnbaMYRhSVMlgpK9XOR3tNStp73QlSox6/6VKu7XR4wFXAoip3
 l4XI8UqO5XqDG55UTtjKyo2VNq8vgq1gRUWD6hPtfzDP6WYj4JZuXOd+dT6PbP32
 VHd91RoB8Z8qmypLF9Sju6IBWNhKl91TjlRqVdfLywaK8azqaxE1UttPf5DVbE6+
 MlTuXhuxkNc4ddzj//oJ1s3ZP/nXtFmIZ75+Sd6P5DfqRNeIfjkKqXvffsjqoYVI
 1Wv9sDiezxRB1RZjfInhtqvdvsqcsrXrM7x6BRVqI7IZDUaH5em8IoozQT62ze/4
 MJBrRIbVodx2I7EVDMNGkx2TaIDAfnW4Z3UC+YSHLoy6jht1+SA5KLQDR1G/4NeR
 dWht5wOXwhp8P3DoaczTUpk0DLtAtygj04fH8CG277EILLCpuWwW8iRkPttkrIM2
 HrtRKrKFJNyEpq9vHSFVvpQJkzgzqBFs1UnqnEuYh6qNgWCrS4PJDu9geQ8aPGr2
 pA5aEqg5b+jjWzIYDP93PF0u7kF4mFVAozn6xMf95FKmM1OupYYQg5BRe7n/DfDu
 yh3Ms0Mmd9+snpNJ9lDr40cobHC5CDvfvO2SRERyBeXxARainN0=
 =Ft+G
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fixes of error handling cleanup of metadata accounting with qgroups
   enabled

 - fix swapped values for qgroup tracepoints

 - fix race when handling full sync flag

 - don't start unused worker thread, functionality removed already

* tag 'for-5.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: check for the full sync flag while holding the inode lock during fsync
  Btrfs: fix qgroup double free after failure to reserve metadata for delalloc
  btrfs: tracepoints: Fix bad entry members of qgroup events
  btrfs: tracepoints: Fix wrong parameter order for qgroup events
  btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()
  btrfs: don't needlessly create extent-refs kernel thread
  btrfs: block-group: Fix a memory leak due to missing btrfs_put_block_group()
  Btrfs: add missing extents release on file extent cluster relocation error
2019-10-23 06:14:29 -04:00
Dan Williams
6370740e5f fs/dax: Fix pmd vs pte conflict detection
Users reported a v5.3 performance regression and inability to establish
huge page mappings. A revised version of the ndctl "dax.sh" huge page
unit test identifies commit 23c84eb783 "dax: Fix missed wakeup with
PMD faults" as the source.

Update get_unlocked_entry() to check for NULL entries before checking
the entry order, otherwise NULL is misinterpreted as a present pte
conflict. The 'order' check needs to happen before the locked check as
an unlocked entry at the wrong order must fallback to lookup the correct
order.

Reported-by: Jeff Smits <jeff.smits@intel.com>
Reported-by: Doug Nelson <doug.nelson@intel.com>
Cc: <stable@vger.kernel.org>
Fixes: 23c84eb783 ("dax: Fix missed wakeup with PMD faults")
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Link: https://lore.kernel.org/r/157167532455.3945484.11971474077040503994.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-10-22 22:53:02 -07:00
Chuhong Yuan
783bf7b8b6 cifs: Fix missed free operations
cifs_setattr_nounix has two paths which miss free operations
for xid and fullpath.
Use goto cifs_setattr_exit like other paths to fix them.

CC: Stable <stable@vger.kernel.org>
Fixes: aa081859b1 ("cifs: flush before set-info if we have writeable handles")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-10-20 19:19:49 -05:00
Roberto Bergantinos Corpas
03d9a9fe3f CIFS: avoid using MID 0xFFFF
According to MS-CIFS specification MID 0xFFFF should not be used by the
CIFS client, but we actually do. Besides, this has proven to cause races
leading to oops between SendReceive2/cifs_demultiplex_thread. On SMB1,
MID is a 2 byte value easy to reach in CurrentMid which may conflict with
an oplock break notification request coming from server

Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2019-10-20 19:19:49 -05:00
Steve French
553292a634 cifs: clarify comment about timestamp granularity for old servers
It could be confusing why we set granularity to 1 seconds rather
than 2 seconds (1 second is the max the VFS allows) for these
mounts to very old servers ...

Signed-off-by: Steve French <stfrench@microsoft.com>
2019-10-20 19:19:49 -05:00
Paulo Alcantara (SUSE)
d532cc7efd cifs: Handle -EINPROGRESS only when noblockcnt is set
We only want to avoid blocking in connect when mounting SMB root
filesystems, otherwise bail out from generic_ip_connect() so cifs.ko
can perform any reconnect failover appropriately.

This fixes DFS failover/reconnection tests in upstream buildbot.

Fixes: 8eecd1c2e5 ("cifs: Add support for root file systems")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-10-20 19:19:49 -05:00
Linus Torvalds
998d75510e Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "Rather a lot of fixes, almost all affecting mm/"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
  scripts/gdb: fix debugging modules on s390
  kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register
  mm/thp: allow dropping THP from page cache
  mm/vmscan.c: support removing arbitrary sized pages from mapping
  mm/thp: fix node page state in split_huge_page_to_list()
  proc/meminfo: fix output alignment
  mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch
  mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition
  mm: include <linux/huge_mm.h> for is_vma_temporary_stack
  zram: fix race between backing_dev_show and backing_dev_store
  mm/memcontrol: update lruvec counters in mem_cgroup_move_account
  ocfs2: fix panic due to ocfs2_wq is null
  hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()
  mm: memblock: do not enforce current limit for memblock_phys* family
  mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
  mm/gup: fix a misnamed "write" argument, and a related bug
  mm/gup_benchmark: add a missing "w" to getopt string
  ocfs2: fix error handling in ocfs2_setattr()
  mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
  mm/memunmap: don't access uninitialized memmap in memunmap_pages()
  ...
2019-10-19 06:53:59 -04:00
Kirill A. Shutemov
2be5fbf9a9 proc/meminfo: fix output alignment
Patch series "Fixes for THP in page cache", v2.

This patch (of 5):

Add extra space for FileHugePages and FilePmdMapped, so the output is
aligned with other rows.

Link: http://lkml.kernel.org/r/20191017164223.2762148-2-songliubraving@fb.com
Fixes: 60fbf0ab5d ("mm,thp: stats for file backed THP")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Yi Li
b918c43021 ocfs2: fix panic due to ocfs2_wq is null
mount.ocfs2 failed when reading ocfs2 filesystem superblock encounters
an error.  ocfs2_initialize_super() returns before allocating ocfs2_wq.
ocfs2_dismount_volume() triggers the following panic.

  Oct 15 16:09:27 cnwarekv-205120 kernel: On-disk corruption discovered.Please run fsck.ocfs2 once the filesystem is unmounted.
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_read_locked_inode:537 ERROR: status = -30
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:458 ERROR: status = -30
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:491 ERROR: status = -30
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_initialize_super:2313 ERROR: status = -30
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_fill_super:1033 ERROR: status = -30
  ------------[ cut here ]------------
  Oops: 0002 [#1] SMP NOPTI
  CPU: 1 PID: 11753 Comm: mount.ocfs2 Tainted: G  E
        4.14.148-200.ckv.x86_64 #1
  Hardware name: Sugon H320-G30/35N16-US, BIOS 0SSDX017 12/21/2018
  task: ffff967af0520000 task.stack: ffffa5f05484000
  RIP: 0010:mutex_lock+0x19/0x20
  Call Trace:
    flush_workqueue+0x81/0x460
    ocfs2_shutdown_local_alloc+0x47/0x440 [ocfs2]
    ocfs2_dismount_volume+0x84/0x400 [ocfs2]
    ocfs2_fill_super+0xa4/0x1270 [ocfs2]
    ? ocfs2_initialize_super.isa.211+0xf20/0xf20 [ocfs2]
    mount_bdev+0x17f/0x1c0
    mount_fs+0x3a/0x160

Link: http://lkml.kernel.org/r/1571139611-24107-1-git-send-email-yili@winhong.com
Signed-off-by: Yi Li <yilikernel@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Chengguang Xu
ce750f43f5 ocfs2: fix error handling in ocfs2_setattr()
Should set transfer_to[USRQUOTA/GRPQUOTA] to NULL on error case before
jumping to do dqput().

Link: http://lkml.kernel.org/r/20191010082349.1134-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
David Hildenbrand
aad5f69bc1 fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c
There are three places where we access uninitialized memmaps, namely:
- /proc/kpagecount
- /proc/kpageflags
- /proc/kpagecgroup

We have initialized memmaps either when the section is online or when the
page was initialized to the ZONE_DEVICE.  Uninitialized memmaps contain
garbage and in the worst case trigger kernel BUGs, especially with
CONFIG_PAGE_POISONING.

For example, not onlining a DIMM during boot and calling /proc/kpagecount
with CONFIG_PAGE_POISONING:

  :/# cat /proc/kpagecount > tmp.test
  BUG: unable to handle page fault for address: fffffffffffffffe
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
  Oops: 0000 [#1] SMP NOPTI
  CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
  RIP: 0010:kpagecount_read+0xce/0x1e0
  Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
  RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
  RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
  RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
  R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
  FS:  00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
  Call Trace:
   proc_reg_read+0x3c/0x60
   vfs_read+0xc5/0x180
   ksys_read+0x68/0xe0
   do_syscall_64+0x5c/0xa0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

For now, let's drop support for ZONE_DEVICE from the three pseudo files
in order to fix this.  To distinguish offline memory (with garbage
memmap) from ZONE_DEVICE memory with properly initialized memmaps, we
would have to check get_dev_pagemap() and pfn_zone_device_reserved()
right now.  The usage of both (especially, special casing devmem) is
frowned upon and needs to be reworked.

The fundamental issue we have is:

	if (pfn_to_online_page(pfn)) {
		/* memmap initialized */
	} else if (pfn_valid(pfn)) {
		/*
		 * ???
		 * a) offline memory. memmap garbage.
		 * b) devmem: memmap initialized to ZONE_DEVICE.
		 * c) devmem: reserved for driver. memmap garbage.
		 * (d) devmem: memmap currently initializing - garbage)
		 */
	}

We'll leave the pfn_zone_device_reserved() check in stable_page_flags()
in place as that function is also used from memory failure.  We now no
longer dump information about pages that are not in use anymore -
offline.

Link: http://lkml.kernel.org/r/20191009142435.3975-2-david@redhat.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e86b]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
Cc: Pankaj gupta <pagupta@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: <stable@vger.kernel.org>	[4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:31 -04:00
Linus Torvalds
d418d07005 for-linus-2019-10-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2qbF0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptsuEADEKL8pta74uy50pl0t8l9fZ++U+wdIeEIW
 9uumpOEPnI2GpkG1sOyKWK6tl8InQLw6pAquP9MoT2BHXqFHk7NIgtvk67lwQeoc
 dRwklVfvOLAdnKzyfODqE9Fh9BgczZIuOLzgdtNqrPKqgJfFRCwN94Kj/r2tYuy7
 v+riK3A49u12dOLtjU6ciNgZ0m1iUX9s0+PFYVUXtJHU/1OYToQaKP+sgWiue0Ca
 VJP/L4MLYD0a7tfd92WAK7xWLsYWTDw1Gg20hXH/tV+IIDQ5+OXhu2s6PuqI7c0y
 cZqWHQHBDkZMQvT8+V+YqZtEa+xwVCom51prJEPasmdq3fGx+2sDC1HQiySao1ML
 wfFxZvFvY9fm6M7p2xsSNEcOmamrx1aLLyNSbjIvAqLUDYJWWS56BHsKyTU5Z+Jp
 RA9dpq8iR6ISaIAcFf0IB0pJSv1HEeHyo/ixlALqezBFJaMdhWy/M+dEbWKtix9M
 s19ozcpe+omN9+O0anlLtzKNgj2Xnjiwuu8mhVcqn6uG/p6GUOup+lNvTW/fig3I
 JBH8kObjYXL181V9rYVqFutnuqcf2HYqMvV2vzAmg4LYnPVUmU7HMj8zEpxc4N+f
 Evd77j0wXmY9S+4JERxaqQZuvKBEIkvM1rkk3N4NbNghfa7QL4aW+I9cWtuelPC2
 E+DK7if0Gg==
 =rvkw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Keith that address deadlocks, double resets,
   memory leaks, and other regression.

 - Fixup elv_support_iosched() for bio based devices (Damien)

 - Fixup for the ahci PCS quirk (Dan)

 - Socket O_NONBLOCK handling fix for io_uring (me)

 - Timeout sequence io_uring fixes (yangerkun)

 - MD warning fix for parameter default_layout (Song)

 - blkcg activation fixes (Tejun)

 - blk-rq-qos node deletion fix (Tejun)

* tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block:
  nvme-pci: Set the prp2 correctly when using more than 4k page
  io_uring: fix logic error in io_timeout
  io_uring: fix up O_NONBLOCK handling for sockets
  md/raid0: fix warning message for parameter default_layout
  libata/ahci: Fix PCS quirk application
  blk-rq-qos: fix first node deletion of rq_qos_del()
  blkcg: Fix multiple bugs in blkcg_activate_policy()
  io_uring: consider the overflow of sequence for timeout req
  nvme-tcp: fix possible leakage during error flow
  nvmet-loop: fix possible leakage during error flow
  block: Fix elv_support_iosched()
  nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL
  nvme: Wait for reset state when required
  nvme: Prevent resets during paused controller state
  nvme: Restart request timers in resetting state
  nvme: Remove ADMIN_ONLY state
  nvme-pci: Free tagset if no IO queues
  nvme: retain split access workaround for capability reads
  nvme: fix possible deadlock when nvme_update_formats fails
2019-10-18 22:29:36 -04:00
Linus Torvalds
b9959c7a34 filldir[64]: remove WARN_ON_ONCE() for bad directory entries
This was always meant to be a temporary thing, just for testing and to
see if it actually ever triggered.

The only thing that reported it was syzbot doing disk image fuzzing, and
then that warning is expected.  So let's just remove it before -rc4,
because the extra sanity testing should probably go to -stable, but we
don't want the warning to do so.

Reported-by: syzbot+3031f712c7ad5dd4d926@syzkaller.appspotmail.com
Fixes: 8a23eb804c ("Make filldir[64]() verify the directory entry filename is valid")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-18 18:41:16 -04:00
Linus Torvalds
6b95cf9b8b A future-proofing decoding fix from Jeff intended for stable and
a patch for a mostly benign race from Dongsheng.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl2qAEkTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi4mZB/9HMfEfZ8JC9keyVaJpAyvV8ufTR4qs
 4b8NNc0MDM01z1Z23G0o89b5M0WEDcGslh25plCifxyNIMa+L/lYKl8CTr7CLVQS
 qCEtNgJ7ibfM26v7rfHOlk6Nnd07/OmjcioaHu/R3bqEQmXpcWQg+aX9C6mPh/2f
 yzZTKZdKhTZfUyQQctuhNo9G+wD8K86DYT1XRbubPNQ3VtXKPuNH9rLhvLCZzbVA
 6FHW05A4mwSv80MsLgN6qLSKxv/+LjV/voHepH4HygqUKw2+1lwi9BC/4k7sprQs
 1jFONZ0p1sv/LdWwJYUyCpwj6d3NliXM0uvYxfyzKveWWCxb3l3gaWUS
 =7scd
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.4-rc4' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A future-proofing decoding fix from Jeff intended for stable and a
  patch for a mostly benign race from Dongsheng"

* tag 'ceph-for-5.4-rc4' of git://github.com/ceph/ceph-client:
  rbd: cancel lock_dwork if the wait is interrupted
  ceph: just skip unrecognized info in ceph_reply_info_extra
2019-10-18 18:30:09 -04:00
yangerkun
8b07a65ad3 io_uring: fix logic error in io_timeout
If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not
tmp_nxt.

Fixes: 5da0fb1ab3 ("io_uring: consider the overflow of sequence for timeout req")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-17 15:49:15 -06:00
Jens Axboe
491381ce07 io_uring: fix up O_NONBLOCK handling for sockets
We've got two issues with the non-regular file handling for non-blocking
IO:

1) We don't want to re-do a short read in full for a non-regular file,
   as we can't just read the data again.
2) For non-regular files that don't support non-blocking IO attempts,
   we need to punt to async context even if the file is opened as
   non-blocking. Otherwise the caller always gets -EAGAIN.

Add two new request flags to handle these cases. One is just a cache
of the inode S_ISREG() status, the other tells io_uring that we always
need to punt this request to async context, even if REQ_F_NOWAIT is set.

Cc: stable@vger.kernel.org
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-17 15:49:11 -06:00
Linus Torvalds
6e8ba0098e Changes since last update:
- Fix a timestamp signedness problem in the new bulkstat ioctl.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl2l7R8ACgkQ+H93GTRK
 tOtFnw/9GyBxDqODfne2puAb3nISogzVDVRxzo2Ip4fH+djQ8XF0nZgaVmdK45xm
 lD0VNCMT7tb7YA9rIKb+G8WdZDfq5XUs2QOR8RCqg2rCy5Kxww8B0e83ziJ/hz8C
 ebmfuBpNoV3Ul8I4vPN2Qb2nN2wHo8LqMi3GgK3nmv92xuL7Fignh3AxForHSFRz
 VUQF+canS5IQo2KkbKsP92X1r+nwTG/8GL3WNicm4BgeFHcZbMKJn32OSpZHUSGb
 RPEpWD1TAROhjOejFpEqEXmU8x8o0BtErgo3dBgGQTntrq9gOPgOuoot3cyweHZK
 JEtxglnIIbJs0B0sru9ZuN7RkoXZYJjDaTOU8uLst8dwVaoX2kw8siWjQimadKI4
 aI323yHpTTjuzZa2PQRU682fNyVNwcDfQnOss28xQG1wG9Pjxo39AZqhpgR1GtX4
 kgYBVr6oJ/5HrZ5KFghBnjE1+jx4F4axAhgKqmbsg712mZcr1EM4TlFfd/yX6+Sd
 ihHlgSeIiZk+2fthnUwRvPuGSYIimtUAXY5wgRzXnqmCFxy3rPafLo41pPkCcYV4
 wCHdh6YR1H0/wJ9K2iUMxKKrSdu4PcdaprlD7aX7j3u727gybqOqIhaK6GdlxQg8
 zyUZIGQEyR6uTpbtC4xNttZ3OcR5w5u7JjIn6TR3xIws4TkVMwU=
 =WVUx
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.4-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Darrick Wong:
 "The single fix converts the seconds field in the recently added XFS
  bulkstat structure to a signed 64-bit quantity.

  The structure layout doesn't change and so far there are no users of
  the ioctl to break because we only publish xfs ioctl interfaces
  through the XFS userspace development libraries, and we're still
  working on a 5.3 release"

* tag 'xfs-5.4-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: change the seconds fields in xfs_bulkstat to signed
2019-10-17 14:19:52 -07:00
Filipe Manana
ba0b084ac3 Btrfs: check for the full sync flag while holding the inode lock during fsync
We were checking for the full fsync flag in the inode before locking the
inode, which is racy, since at that that time it might not be set but
after we acquire the inode lock some other task set it. One case where
this can happen is on a system low on memory and some concurrent task
failed to allocate an extent map and therefore set the full sync flag on
the inode, to force the next fsync to work in full mode.

A consequence of missing the full fsync flag set is hitting the problems
fixed by commit 0c713cbab6 ("Btrfs: fix race between ranged fsync and
writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
of weird inconsistencies after replaying a log due to file extents items
representing ranges that overlap.

So just move the check such that it's done after locking the inode and
before starting writeback again.

Fixes: 0c713cbab6 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-17 20:36:02 +02:00
Filipe Manana
c7967fc149 Btrfs: fix qgroup double free after failure to reserve metadata for delalloc
If we fail to reserve metadata for delalloc operations we end up releasing
the previously reserved qgroup amount twice, once explicitly under the
'out_qgroup' label by calling btrfs_qgroup_free_meta_prealloc() and once
again, under label 'out_fail', by calling btrfs_inode_rsv_release() with a
value of 'true' for its 'qgroup_free' argument, which results in
btrfs_qgroup_free_meta_prealloc() being called again, so we end up having
a double free.

Also if we fail to reserve the necessary qgroup amount, we jump to the
label 'out_fail', which calls btrfs_inode_rsv_release() and that in turns
calls btrfs_qgroup_free_meta_prealloc(), even though we weren't able to
reserve any qgroup amount. So we freed some amount we never reserved.

So fix this by removing the call to btrfs_inode_rsv_release() in the
failure path, since it's not necessary at all as we haven't changed the
inode's block reserve in any way at this point.

Fixes: c8eaeac7b7 ("btrfs: reserve delalloc metadata differently")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-17 20:13:44 +02:00
Qu Wenruo
fd2b007eae btrfs: tracepoints: Fix wrong parameter order for qgroup events
[BUG]
For btrfs:qgroup_meta_reserve event, the trace event can output garbage:

  qgroup_meta_reserve: 9c7f6acc-b342-4037-bc47-7f6e4d2232d7: refroot=5(FS_TREE) type=DATA diff=2

The diff should always be alinged to sector size (4k), so there is
definitely something wrong.

[CAUSE]
For the wrong @diff, it's caused by wrong parameter order.
The correct parameters are:

  struct btrfs_root, s64 diff, int type.

However the parameters used are:

  struct btrfs_root, int type, s64 diff.

Fixes: 4ee0d8832c ("btrfs: qgroup: Update trace events for metadata reservation")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-17 14:09:31 +02:00
Qu Wenruo
8702ba9396 btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()
[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.

PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.

[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.

The most obvious one is:
In btrfs_buffered_write():
	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);

We always free qgroup PREALLOC meta space.

While in btrfs_truncate_block():
	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));

We only free qgroup PREALLOC meta space when something went wrong.

[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.

The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
  In btrfs_delalloc_reserve_metadata()

- Free the unused metadata space
  Normally in:
  btrfs_delalloc_release_extents()
  |- btrfs_inode_rsv_release()
     Here we do calculation on whether we should release or not.

E.g. for 64K buffered write, the metadata rsv works like:

/* The first page */
reserve_meta:	num_bytes=calc_inode_reservations()
free_meta:	num_bytes=0
total:		num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
   rsv */

/* The 2nd page */
reserve_meta:	num_bytes=calc_inode_reservations()
free_meta:	num_bytes=calc_inode_reservations()
total:		not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
   rsv, so we free what we have reserved */

/* The 3rd~16th pages */
reserve_meta:	num_bytes=calc_inode_reservations()
free_meta:	num_bytes=calc_inode_reservations()
total:		not changed (still space for one outstanding extent)

This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().

The good news is:
- The callers are not that hot
  The hottest caller is in btrfs_buffered_write(), which is already
  fixed by commit 336a8bb8e3 ("btrfs: Fix wrong
  btrfs_delalloc_release_extents parameter"). Thus it's not that
  easy to cause false EDQUOT.

- The trans commit in advance for qgroup would hide the bug
  Since commit f5fef45936 ("btrfs: qgroup: Make qgroup async transaction
  commit more aggressive"), when btrfs qgroup metadata free space is slow,
  it will try to commit transaction and free the wrongly converted
  PERTRANS space, so it's not that easy to hit such bug.

[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-15 18:50:07 +02:00
Darrick J. Wong
5e0cd1ef64 xfs: change the seconds fields in xfs_bulkstat to signed
64-bit time is a signed quantity in the kernel, so the bulkstat
structure should reflect that.  Note that the structure size stays
the same and that we have not yet published userspace headers for this
new ioctl so there are no users to break.

Fixes: 7035f9724f ("xfs: introduce new v5 bulkstat structure")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-10-15 08:46:07 -07:00
Jeff Layton
1d3f87233e ceph: just skip unrecognized info in ceph_reply_info_extra
In the future, we're going to want to extend the ceph_reply_info_extra
for create replies. Currently though, the kernel code doesn't accept an
extra blob that is larger than the expected data.

Change the code to skip over any unrecognized fields at the end of the
extra blob, rather than returning -EIO.

Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-15 17:43:10 +02:00
yangerkun
5da0fb1ab3 io_uring: consider the overflow of sequence for timeout req
Now we recalculate the sequence of timeout with 'req->sequence =
ctx->cached_sq_head + count - 1', judge the right place to insert
for timeout_list by compare the number of request we still expected for
completion. But we have not consider about the situation of overflow:

1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for
the new timeout req can have a small req->sequence.

2. cached_sq_head of now may overflow compare with before req. And it
will lead the timeout req with small req->sequence.

This overflow will lead to the misorder of timeout_list, which can lead
to the wrong order of the completion of timeout_list. Fix it by reuse
req->submit.sequence to store the count, and change the logic of
inserting sort in io_timeout.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-15 08:55:50 -06:00
David Sterba
80ed4548d0 btrfs: don't needlessly create extent-refs kernel thread
The patch 32b593bfcb ("Btrfs: remove no longer used function to run
delayed refs asynchronously") removed the async delayed refs but the
thread has been created, without any use. Remove it to avoid resource
consumption.

Fixes: 32b593bfcb ("Btrfs: remove no longer used function to run delayed refs asynchronously")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-15 15:43:29 +02:00
Randy Dunlap
b46ec1da5e fs/fs-writeback.c: fix kernel-doc warning
Fix kernel-doc warning in fs/fs-writeback.c:

  fs/fs-writeback.c:913: warning: Excess function parameter 'nr_pages' description in 'cgroup_writeback_by_id'

Link: http://lkml.kernel.org/r/756645ac-0ce8-d47e-d30a-04d9e4923a4f@infradead.org
Fixes: d62241c7a4 ("writeback, memcg: Implement cgroup_writeback_by_id()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14 15:04:01 -07:00
Randy Dunlap
8e88bfba77 fs/libfs.c: fix kernel-doc warning
Fix kernel-doc warning in fs/libfs.c:

  fs/libfs.c:496: warning: Excess function parameter 'available' description in 'simple_write_end'

Link: http://lkml.kernel.org/r/5fc9d70b-e377-0ec9-066a-970d49579041@infradead.org
Fixes: ad2a722f19 ("libfs: Open code simple_commit_write into only user")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Boaz Harrosh <boazh@netapp.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14 15:04:01 -07:00
Randy Dunlap
c70d868f27 fs/direct-io.c: fix kernel-doc warning
Fix kernel-doc warning in fs/direct-io.c:

  fs/direct-io.c:258: warning: Excess function parameter 'offset' description in 'dio_complete'

Also, don't mark this function as having kernel-doc notation since it is
not exported.

Link: http://lkml.kernel.org/r/97908511-4328-4a56-17fe-f43a1d7aa470@infradead.org
Fixes: 6d544bb4d9 ("dio: centralize completion in dio_complete()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Zach Brown <zab@zabbo.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14 15:04:01 -07:00
Linus Torvalds
d4615e5a46 A few tracing fixes:
- Removed locked down from tracefs itself and moved it to the trace
    directory. Having the open functions there do the lockdown checks.
 
  - Fixed a few races with opening an instance file and the instance being
    deleted (Discovered during the locked down updates). Kept separate
    from the clean up code such that they can be backported to stable
    easier.
 
  - Cleaned up and consolidated the checks done when opening a trace
    file, as there were multiple checks that need to be done, and it
    did not make sense having them done in each open instance.
 
  - Fixed a regression in the record mcount code.
 
  - Small hw_lat detector tracer fixes.
 
  - A trace_pipe read fix due to not initializing trace_seq.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXaNhphQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6quDIAP4v08ARNdIh+r+c4AOBm3xsOuE/d9GB
 I56ydnskm+x2JQD6Ap9ivXe9yDBIErFeHNtCoq7pM8YDI4YoYIB30N0GfwM=
 =7oAu
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "A few tracing fixes:

   - Remove lockdown from tracefs itself and moved it to the trace
     directory. Have the open functions there do the lockdown checks.

   - Fix a few races with opening an instance file and the instance
     being deleted (Discovered during the lockdown updates). Kept
     separate from the clean up code such that they can be backported to
     stable easier.

   - Clean up and consolidated the checks done when opening a trace
     file, as there were multiple checks that need to be done, and it
     did not make sense having them done in each open instance.

   - Fix a regression in the record mcount code.

   - Small hw_lat detector tracer fixes.

   - A trace_pipe read fix due to not initializing trace_seq"

* tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Initialize iter->seq after zeroing in tracing_read_pipe()
  tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency
  tracing/hwlat: Report total time spent in all NMIs during the sample
  recordmcount: Fix nop_mcount() function
  tracing: Do not create tracefs files if tracefs lockdown is in effect
  tracing: Add locked_down checks to the open calls of files created for tracefs
  tracing: Add tracing_check_open_get_tr()
  tracing: Have trace events system open call tracing_open_generic_tr()
  tracing: Get trace_array reference for available_tracers files
  ftrace: Get a reference counter for the trace_array on filter files
  tracefs: Revert ccbd54ff54 ("tracefs: Restrict tracefs when the kernel is locked down")
2019-10-13 14:47:10 -07:00
Linus Torvalds
b27528b027 for-linus-20191012
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2iBvAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvokEACaUozc0Y2L99CqJcRtoH8cwIAimjKTFSmi
 UKijO/w11uc9xCIOospkM8iseRbpP/oCIwfYyVROi4LniPB1TSyQJoLF/MELeqIi
 lOSg8RrXzC22h5+QG3BR5VO55me3MwIXmJLMMWXbh7FAuUg+wepVwHwzZBurd1Pz
 C1wG8NU758AmqJdSGclUbdLLwTBkR+fXoi2EHfdbsBMqGmdZ5VRzCLx70grDcxq2
 S9gt+2+diME6o6p1wko+m9pdhUdEy7epPE50K4SA+DU06I77OZpTdSAPilPyQHei
 +3HDslzjCyos5/Ay+7hWl42jvyZ+x6GI5VcDLMTcMZsmxLy/WObDB4zcAixJrDV0
 c7XWnwCCs2wMXYMsu0ASbkU32mcQ/Ik8CrX0W3/tP1aZBW2LZnOHvXYbybzpbgbo
 QY47t3jz21uzF5kplBGxqaY1sPmLxp5V0vQmj9eN+Y7T+hiR/UgWKO/5B4CGNJe0
 BQkXNl5cZAtqxmSE8AxlpOla7z6beZUljdhFc7K8cMDdhoatRgm2MBh2G9BB1ayo
 VwHc1qLcvInpQrGiC4l5H/F8jrEs/PglknFSs1wlsav1gXA69J0cHA7LY6Czweaw
 mkiKr7jA6duBBn6MRUThF5sBMFWYuxp3B/BA4wfDJsSfCmyhWN+OZlsDaEVrzk7o
 /XK2Ew9R2Q==
 =PUeL
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191012' of git://git.kernel.dk/linux-block

Pull io_uring fix from Jens Axboe:
 "Single small fix for a regression in the sequence logic for linked
  commands"

* tag 'for-linus-20191012' of git://git.kernel.dk/linux-block:
  io_uring: fix sequence logic for timeout requests
2019-10-13 08:15:35 -07:00
Steven Rostedt (VMware)
bf8e602186 tracing: Do not create tracefs files if tracefs lockdown is in effect
If on boot up, lockdown is activated for tracefs, don't even bother creating
the files. This can also prevent instances from being created if lockdown is
in effect.

Link: http://lkml.kernel.org/r/CAHk-=whC6Ji=fWnjh2+eS4b15TnbsS4VPVtvBOwCy1jjEG_JHQ@mail.gmail.com

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:49:07 -04:00