2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-14 00:24:15 +08:00
Commit Graph

61613 Commits

Author SHA1 Message Date
David Sterba
d6156218be btrfs: make locking assertion helpers static inline
I've noticed that none of the btrfs_assert_*lock* debugging helpers is
inlined, despite they're short and mostly a value update. Making them
inline shaves 67 from the text size, reduces stack consumption and
perhaps also slightly improves the performance due to avoiding
unnecessary calls.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:49 +01:00
Omar Sandoval
c9eb55db84 btrfs: get rid of pointless wtag variable in async-thread.c
Commit ac0c7cf8be ("btrfs: fix crash when tracepoint arguments are
freed by wq callbacks") added a void pointer, wtag, which is passed into
trace_btrfs_all_work_done() instead of the freed work item. This is
silly for a few reasons:

1. The freed work item still has the same address.
2. work is still in scope after it's freed, so assigning wtag doesn't
   stop anyone from using it.
3. The tracepoint has always taken a void * argument, so assigning wtag
   doesn't actually make things any more type-safe. (Note that the
   original bug in commit bc074524e1 ("btrfs: prefix fsid to all trace
   events") was that the void * was implicitly casted when it was passed
   to btrfs_work_owner() in the trace point itself).

Instead, let's add some clearer warnings as comments.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:49 +01:00
Omar Sandoval
a0cac0ec96 btrfs: get rid of unique workqueue helper functions
Commit 9e0af23764 ("Btrfs: fix task hang under heavy compressed
write") worked around the issue that a recycled work item could get a
false dependency on the original work item due to how the workqueue code
guarantees non-reentrancy. It did so by giving different work functions
to different types of work.

However, the fixes in the previous few patches are more complete, as
they prevent a work item from being recycled at all (except for a tiny
window that the kernel workqueue code handles for us). This obsoletes
the previous fix, so we don't need the unique helpers for correctness.
The only other reason to keep them would be so they show up in stack
traces, but they always seem to be optimized to a tail call, so they
don't show up anyways. So, let's just get rid of the extra indirection.

While we're here, rename normal_work_helper() to the more informative
btrfs_work_helper().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:48 +01:00
Omar Sandoval
57d4f0b863 btrfs: don't prematurely free work in scrub_missing_raid56_worker()
Currently, scrub_missing_raid56_worker() puts and potentially frees
sblock (which embeds the work item) and then submits a bio through
scrub_wr_submit(). This is another potential instance of the bug in
"btrfs: don't prematurely free work in run_ordered_work()". Fix it by
dropping the reference after we submit the bio.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:48 +01:00
Omar Sandoval
e732fe95e4 btrfs: don't prematurely free work in reada_start_machine_worker()
Currently, reada_start_machine_worker() frees the reada_machine_work and
then calls __reada_start_machine() to do readahead. This is another
potential instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".

There _might_ already be a deadlock here: reada_start_machine_worker()
can depend on itself through stacked filesystems (__read_start_machine()
-> reada_start_machine_dev() -> reada_tree_block_flagged() ->
read_extent_buffer_pages() -> submit_one_bio() ->
btree_submit_bio_hook() -> btrfs_map_bio() -> submit_stripe_bio() ->
submit_bio() onto a loop device can trigger readahead on the lower
filesystem).

Either way, let's fix it by freeing the work at the end.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:48 +01:00
Omar Sandoval
9be490f1e1 btrfs: don't prematurely free work in end_workqueue_fn()
Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds
the work item) and then calls bio_endio(). This is another potential
instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".

In particular, the endio call may depend on other work items. For
example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() ->
__btrfs_correct_data_nocsum() -> dio_read_error() ->
submit_dio_repair_bio(), which submits a bio that is also completed
through a end_workqueue_fn() work item. However,
__btrfs_correct_data_nocsum() waits for the newly submitted bio to
complete, thus it depends on another work item.

This example currently usually works because we use different workqueue
helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR.
However, it may deadlock with stacked filesystems and is fragile
overall. The proper fix is to free the work item at the very end of the
work function, so let's do that.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:48 +01:00
Omar Sandoval
c495dcd6fb btrfs: don't prematurely free work in run_ordered_work()
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:

1. The top (loop device) filesystem queues an async_cow work item from
   cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
   (run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
   (work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
   to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
   work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
   thread A, so it schedules X to be executed _after_ worker thread A
   finishes (see the find_worker_executing_work() call in
   process_one_work()).

Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.

This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.

P.S.:

One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.

Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.

Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:48 +01:00
Omar Sandoval
cdc6f1668e btrfs: get rid of unnecessary memset() of work item
Commit fc97fab0ea ("btrfs: Replace fs_info->qgroup_rescan_worker
workqueue with btrfs_workqueue.") converted qgroup_rescan_work to be
initialized with btrfs_init_work(), but it left behind an unnecessary
memset(). Get rid of the memset().

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Josef Bacik
b3f167aa6c btrfs: move the failrec tree stuff into extent-io-tree.h
This needs to be cleaned up in the future, but for now it belongs to the
extent-io-tree stuff since it uses the internal tree search code.
Needed to export get_state_failrec and set_state_failrec as well since
we're not going to move the actual IO part of the failrec stuff out at
this point.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Josef Bacik
083e75e7e6 btrfs: export find_delalloc_range
This utilizes internal stuff to the extent_io_tree, so we need to export
it before we move it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Josef Bacik
9c7d3a5483 btrfs: move extent_io_tree defs to their own header
extent_io.c/h are huge, encompassing a bunch of different things.  The
extent_io_tree code can live on its own, so separate this out.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Josef Bacik
6f0d04f8e7 btrfs: separate out the extent io init function
We are moving extent_io_tree into it's on file, so separate out the
extent_state init stuff from extent_io_tree_init().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Josef Bacik
33ca832fef btrfs: separate out the extent leak code
We check both extent buffer and extent state leaks in the same function,
separate these two functions out so we can move them around.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:46 +01:00
Qu Wenruo
34ffafdba1 btrfs: ctree: Remove stray comment of setting up path lock
The following comment shows up in btrfs_search_slot() with out much
sense:

	/*
	 * setup the path here so we can release it under lock
	 * contention with the cow code
	 */
	if (cow) {
		/* code touching path->lock[] is far away from here */
	}

This comment hasn't been cleaned up after the relevant code has been
removed.

The original code is introduced in commit 65b51a009e
("btrfs_search_slot: reduce lock contention by cowing in two stages"):

  +
  +               /*
  +                * setup the path here so we can release it under lock
  +                * contention with the cow code
  +                */
  +               p->nodes[level] = b;
  +               if (!p->skip_locking)
  +                       p->locks[level] = 1;
  +

But in current code, we have different timing for modifying path lock,
so just remove the comment.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:46 +01:00
Qu Wenruo
abe9339d69 btrfs: ctree: Reduce one indent level for btrfs_search_old_slot()
Similar to btrfs_search_slot() done in previous patch, make a shortcut
for the level 0 case and allow to reduce indentation for the remaining
case.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:46 +01:00
Qu Wenruo
f624d97608 btrfs: ctree: Reduce one indent level for btrfs_search_slot()
In btrfs_search_slot(), we something like:

	if (level != 0) {
		/* Do search inside tree nodes*/
	} else {
		/* Do search inside tree leaves */
		goto done;
	}

This caused extra indent for tree node search code.  Change it to
something like:

	if (level == 0) {
		/* Do search inside tree leaves */
		goto done'
	}
	/* Do search inside tree nodes */

So we have more space to maneuver our code, this is especially useful as
the tree nodes search code is more complex than the leaves search code.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:46 +01:00
Qu Wenruo
71bf92a9b8 btrfs: tree-checker: Add check for INODE_REF
For INODE_REF we will check:
- Objectid (ino) against previous key
  To detect missing INODE_ITEM.

- No overflow/padding in the data payload
  Much like DIR_ITEM, but with less members to check.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:46 +01:00
Qu Wenruo
c18679ebd8 btrfs: tree-checker: Try to detect missing INODE_ITEM
For the following items, key->objectid is inode number:
- DIR_ITEM
- DIR_INDEX
- XATTR_ITEM
- EXTENT_DATA
- INODE_REF

So in the subvolume tree, such items must have its previous item share the
same objectid, e.g.:

 (257 INODE_ITEM 0)
 (257 DIR_INDEX xxx)
 (257 DIR_ITEM xxx)
 (258 INODE_ITEM 0)
 (258 INODE_REF 0)
 (258 XATTR_ITEM 0)
 (258 EXTENT_DATA 0)

But if we have the following sequence, then there is definitely
something wrong, normally some INODE_ITEM is missing, like:

 (257 INODE_ITEM 0)
 (257 DIR_INDEX xxx)
 (257 DIR_ITEM xxx)
 (258 XATTR_ITEM 0)  <<< objecitd suddenly changed to 258
 (258 EXTENT_DATA 0)

So just by checking the previous key for above inode based key types, we
can detect a missing inode item.

For INODE_REF key type, the check will be added along with INODE_REF
checker.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:46 +01:00
Filipe Manana
b9fae2ebee Btrfs: make btrfs_wait_extents() static
It's not used ouside of transaction.c

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:45 +01:00
Nikolay Borisov
35b814f3c5 btrfs: Add assert to catch nested transaction commit
A recent patch to btrfs showed that there was at least 1 case where a
nested transaction was committed. Nested transaction in this case means
a code which has a transaction handle calls some function which in turn
obtains a copy of the same transaction handle. In such cases the correct
thing to do is for the lower callee to call btrfs_end_transaction which
contains appropriate checks so as to not commit the transaction which
will result in stale trans handler for the caller.

To catch such cases add an assert in btrfs_commit_transaction ensuring
btrfs_trans_handle::use_count is always 1.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:45 +01:00
Goldwyn Rodrigues
9cf35f6735 btrfs: simplify inode locking for RWF_NOWAIT
This is similar to 942491c9e6 ("xfs: fix AIM7 regression"). Apparently
our current rwsem code doesn't like doing the trylock, then lock for
real scheme. This causes extra contention on the lock and can be
measured eg. by AIM7 benchmark.  So change our read/write methods to
just do the trylock for the RWF_NOWAIT case.

Fixes: edf064e7c6 ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:45 +01:00
Zhihao Cheng
6abf572621 ubifs: ubifs_tnc_start_commit: Fix OOB in layout_in_gaps
Running stress-test test_2 in mtd-utils on ubi device, sometimes we can
get following oops message:

  BUG: unable to handle page fault for address: ffffffff00000140
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 280a067 P4D 280a067 PUD 0
  Oops: 0000 [#1] SMP
  CPU: 0 PID: 60 Comm: kworker/u16:1 Kdump: loaded Not tainted 5.2.0 #13
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0
  -0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  Workqueue: writeback wb_workfn (flush-ubifs_0_0)
  RIP: 0010:rb_next_postorder+0x2e/0xb0
  Code: 80 db 03 01 48 85 ff 0f 84 97 00 00 00 48 8b 17 48 83 05 bc 80 db
  03 01 48 83 e2 fc 0f 84 82 00 00 00 48 83 05 b2 80 db 03 01 <48> 3b 7a
  10 48 89 d0 74 02 f3 c3 48 8b 52 08 48 83 05 a3 80 db 03
  RSP: 0018:ffffc90000887758 EFLAGS: 00010202
  RAX: ffff888129ae4700 RBX: ffff888138b08400 RCX: 0000000080800001
  RDX: ffffffff00000130 RSI: 0000000080800024 RDI: ffff888138b08400
  RBP: ffff888138b08400 R08: ffffea0004a6b920 R09: 0000000000000000
  R10: ffffc90000887740 R11: 0000000000000001 R12: ffff888128d48000
  R13: 0000000000000800 R14: 000000000000011e R15: 00000000000007c8
  FS:  0000000000000000(0000) GS:ffff88813ba00000(0000)
  knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffffff00000140 CR3: 000000013789d000 CR4: 00000000000006f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    destroy_old_idx+0x5d/0xa0 [ubifs]
    ubifs_tnc_start_commit+0x4fe/0x1380 [ubifs]
    do_commit+0x3eb/0x830 [ubifs]
    ubifs_run_commit+0xdc/0x1c0 [ubifs]

Above Oops are due to the slab-out-of-bounds happened in do-while of
function layout_in_gaps indirectly called by ubifs_tnc_start_commit. In
function layout_in_gaps, there is a do-while loop placing index nodes
into the gaps created by obsolete index nodes in non-empty index LEBs
until rest index nodes can totally be placed into pre-allocated empty
LEBs. @c->gap_lebs points to a memory area(integer array) which records
LEB numbers used by 'in-the-gaps' method. Whenever a fitable index LEB
is found, corresponding lnum will be incrementally written into the
memory area pointed by @c->gap_lebs. The size
((@c->lst.idx_lebs + 1) * sizeof(int)) of memory area is allocated before
do-while loop and can not be changed in the loop. But @c->lst.idx_lebs
could be increased by function ubifs_change_lp (called by
layout_leb_in_gaps->ubifs_find_dirty_idx_leb->get_idx_gc_leb) during the
loop. So, sometimes oob happens when number of cycles in do-while loop
exceeds the original value of @c->lst.idx_lebs. See detail in
https://bugzilla.kernel.org/show_bug.cgi?id=204229.
This patch fixes oob in layout_in_gaps.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:22:54 +01:00
Zhihao Cheng
10256f0009 ubifs: do_kill_orphans: Fix a memory leak bug
If there are more than one valid snod on the sleb->nodes list,
do_kill_orphans will malloc ino more than once without releasing
previous ino's memory. Finally, it will trigger memory leak.

Fixes: ee1438ce5d ("ubifs: Check link count of inodes when...")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:22:54 +01:00
Richard Weinberger
91cbf01178 Revert "ubifs: Fix memory leak bug in alloc_ubifs_info() error path"
This reverts commit 9163e0184b.

At the point when ubifs_fill_super() runs, we have already a reference
to the super block. So upon deactivate_locked_super() c will get
free()'ed via ->kill_sb().

Cc: Wenwen Wang <wenwen@cs.uga.edu>
Fixes: 9163e0184b ("ubifs: Fix memory leak bug in alloc_ubifs_info() error path")
Reported-by: https://twitter.com/grsecurity/status/1180609139359277056
Signed-off-by: Richard Weinberger <richard@nod.at>
Tested-by: Romain Izard <romain.izard.pro@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:22:53 +01:00
Ben Dooks (Codethink)
7cc7720f06 ubifs: Fix type of sup->hash_algo
The sup->hash_algo is a __le16, and whilst 0xffff is
the same in __le16 and u16, it would be better to use
cpu_to_le16() anyway (which should deal with constants)
and silence the following sparse warning:

fs/ubifs/sb.c:187:32: warning: incorrect type in assignment (different base types)
fs/ubifs/sb.c:187:32:    expected restricted __le16 [usertype] hash_algo
fs/ubifs/sb.c:187:32:    got int

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:22:53 +01:00
Ben Dooks (Codethink)
df22b5b3ec ubifs: Fixed missed le64_to_cpu() in journal
In the ubifs_jnl_write_inode() functon, it calls ubifs_iget()
with xent->inum. The xent->inum is __le64, but the ubifs_iget()
takes native cpu endian.

I think that this should be changed to passing le64_to_cpu(xent->inum)
to fix the following sparse warning:

fs/ubifs/journal.c:902:58: warning: incorrect type in argument 2 (different base types)
fs/ubifs/journal.c:902:58:    expected unsigned long inum
fs/ubifs/journal.c:902:58:    got restricted __le64 [usertype] inum

Fixes: 7959cf3a75 ("ubifs: journal: Handle xattrs like files")
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:21:51 +01:00
Ben Dooks (Codethink)
3cfa4412df ubifs: Force prandom result to __le32
In set_dent_cookie() the result of prandom_u32() is
assinged to an __le32 type. Make this a forced conversion
to remove the following sparse warning:

fs/ubifs/journal.c:506:30: warning: incorrect type in assignment (different base types)
fs/ubifs/journal.c:506:30:    expected restricted __le32 [usertype] cookie
fs/ubifs/journal.c:506:30:    got unsigned int

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:15:29 +01:00
Richard Weinberger
b27b281f9c ubifs: Remove obsolete TODO from dfs_file_write()
AFAICT this kind of problems are no longer possible since
debugfs gained file removal protection via
e9117a5a4b ("debugfs: implement per-file removal protection").

Cc: Christoph Hellwig <hch@lst.de>
Cc: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:15:28 +01:00
Brian Foster
2a2b5932db xfs: fix attr leaf header freemap.size underflow
The leaf format xattr addition helper xfs_attr3_leaf_add_work()
adjusts the block freemap in a couple places. The first update drops
the size of the freemap that the caller had already selected to
place the xattr name/value data. Before the function returns, it
also checks whether the entries array has encroached on a freemap
range by virtue of the new entry addition. This is necessary because
the entries array grows from the start of the block (but end of the
block header) towards the end of the block while the name/value data
grows from the end of the block in the opposite direction. If the
associated freemap is already empty, however, size is zero and the
subtraction underflows the field and causes corruption.

This is reproduced rarely by generic/070. The observed behavior is
that a smaller sized freemap is aligned to the end of the entries
list, several subsequent xattr additions land in larger freemaps and
the entries list expands into the smaller freemap until it is fully
consumed and then underflows. Note that it is not otherwise a
corruption for the entries array to consume an empty freemap because
the nameval list (i.e. the firstused pointer in the xattr header)
starts beyond the end of the corrupted freemap.

Update the freemap size modification to account for the fact that
the freemap entry can be empty and thus stale.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-15 21:15:29 -08:00
Darrick J. Wong
050552cbe0 xfs: fix some memory leaks in log recovery
Fix a few places where we xlog_alloc_buffer a buffer, hit an error, and
then bail out without freeing the buffer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-11-15 21:15:29 -08:00
Linus Torvalds
b226c9e1f4 for-linus-20191115
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3O/gkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsLBD/47jITnsOf/EU1gqW8vbl+psrPYQN+p68id
 EA5L8fqF7wHg/Anxg9MApDO6noH8BvnfSGFnqxWoE5YcvT/mfj4pVciLMiNG2BwA
 hUiJCwIG8SGCn2MRbaTQpqRnMw8aoTKdJAUWwjZTl/db+X9aCv++Odn4XuAABfh2
 LxIb0ZZBF5M8CfKRHtksuCcGBftEUTrlCzSZ9dXI5tD8EpRNJw/5LDGB6w7inhcZ
 0+X7ENdSQrMKA9ImJunLPUDFejHu4fr4qJdAX67Qai0Wf2dR54eaXmTVO4d4SGcU
 UX0zpNC6bozCq+X/ICnlJkK+ECuR33xFLRIS0S7Xv2Er6n3Ul8N6cb6RRv8Q+o1h
 XG5NfpOH+Atqmdyp9zSRI2c2UVfIfmvmRVIUFM+ZXmdw5oSfUltGLdyNVnKuhzc+
 f2Y3dti96YnT35TIihKcwfqlFuaXfLfCmLYabtVylwlOJ80Sjhgea3IyvwstpJau
 uIs5X8Z5AdBuqufPj4veS3x73DeE7slGmzADcNtUeFb1K5423MJqlQUOeVeJW3x3
 85tS7aot/SoMnA1dtREvceerFP/lIa/02iqX0TYQ7BqsN5oZjQzaiuJkUfV2WNOs
 3TlNRBKF69tpX4+NXxaSm5kC0YHtHIWF0EtNliKM7Yi8WS0tVsy74pDO7otj3j1m
 s10Rr/1seA==
 =wP5w
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191115' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few fixes that should make it into this release. This contains:

   - io_uring:
        - The timeout command assumes sequence == 0 means that we want
          one completion, but this kind of overloading is unfortunate as
          it prevents users from doing a pure time based wait. Since
          this operation was introduced in this cycle, let's correct it
          now, while we can. (me)
        - One-liner to fix an issue with dependent links and fixed
          buffer reads. The actual IO completed fine, but the link got
          severed since we stored the wrong expected value. (me)
        - Add TIMEOUT to list of opcodes that don't need a file. (Pavel)

   - rsxx missing workqueue destry calls. Old bug. (Chuhong)

   - Fix blk-iocost active list check (Jiufei)

   - Fix impossible-to-hit overflow merge condition, that still hit some
     folks very rarely (Junichi)

   - Fix bfq hang issue from 5.3. This didn't get marked for stable, but
     will go into stable post this merge (Paolo)"

* tag 'for-linus-20191115' of git://git.kernel.dk/linux-block:
  rsxx: add missed destroy_workqueue calls in remove
  iocost: check active_list of all the ancestors in iocg_activate()
  block, bfq: deschedule empty bfq_queues not referred by any process
  io_uring: ensure registered buffer import returns the IO length
  io_uring: Fix getting file for timeout
  block: check bi_size overflow before merge
  io_uring: make timeout sequence == 0 mean no sequence
2019-11-15 13:02:34 -08:00
Linus Torvalds
875fef493f Two fixes for the buffered reads and O_DIRECT writes serialization
patch that went into -rc1 and a fixup for a bogus warning on older
 gcc versions.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3O2RMTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi+k6CACf0hUTyWcJaiH3WAmkpKOnZVG//Ghv
 +hDWskib0gSilW+mx8Cjsndb5rXVuE4MhZ9P1VD1MMhhfVlfTUspCPG6cIQ3B3gd
 jEVLHDALaMc/tpKwa6EbxvxQRAL5D/2Umh8aK1kVMX2U9R6KKfMiRVToHVPewSkS
 eM3HJuV0kUonnD6glHyie1iwI9iFkDgt+eTJR1hpiFx26y6TwVCH5RNNvZGr0Tcf
 KMLgwAHxIowx0SxblvbJTMf1iIoPJNiUuZyBoo0Hli9hI7S/b4XfARAkD9eud02d
 4shv7D9Js0V1tKDsfV/c8UpnOgBTGEY4AEXIN3Mm6Gk6q9pMnobWt+l8
 =kIHR
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.4-rc8' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two fixes for the buffered reads and O_DIRECT writes serialization
  patch that went into -rc1 and a fixup for a bogus warning on older gcc
  versions"

* tag 'ceph-for-5.4-rc8' of git://github.com/ceph/ceph-client:
  rbd: silence bogus uninitialized warning in rbd_object_map_update_finish()
  ceph: increment/decrement dio counter on async requests
  ceph: take the inode lock before acquiring cap refs
2019-11-15 10:30:24 -08:00
David Howells
a28f239e29 afs: Fix race in commit bulk status fetch
When a lookup is done, the afs filesystem will perform a bulk status-fetch
operation on the requested vnode (file) plus the next 49 other vnodes from
the directory list (in AFS, directory contents are downloaded as blobs and
parsed locally).  When the results are received, it will speculatively
populate the inode cache from the extra data.

However, if the lookup races with another lookup on the same directory, but
for a different file - one that's in the 49 extra fetches, then if the bulk
status-fetch operation finishes first, it will try and update the inode
from the other lookup.

If this other inode is still in the throes of being created, however, this
will cause an assertion failure in afs_apply_status():

	BUG_ON(test_bit(AFS_VNODE_UNSET, &vnode->flags));

on or about fs/afs/inode.c:175 because it expects data to be there already
that it can compare to.

Fix this by skipping the update if the inode is being created as the
creator will presumably set up the inode with the same information.

Fixes: 39db9815da ("afs: Fix application of the results of a inline bulk status fetch")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15 10:28:02 -08:00
Linus Torvalds
b4c0800e42 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs fixes from Al Viro:
 "Assorted fixes all over the place; some of that is -stable fodder,
  some regressions from the last window"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
  ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
  ecryptfs: fix unlink and rmdir in face of underlying fs modifications
  audit_get_nd(): don't unlock parent too early
  exportfs_decode_fh(): negative pinned may become positive without the parent locked
  cgroup: don't put ERR_PTR() into fc->root
  autofs: fix a leak in autofs_expire_indirect()
  aio: Fix io_pgetevents() struct __compat_aio_sigset layout
  fs/namespace.c: fix use-after-free of mount in mnt_warn_timestamp_expiry()
2019-11-15 08:44:08 -08:00
David Howells
3c0edea9b2 pipe: Remove sync on wake_ups 2019-11-15 16:22:54 +00:00
David Howells
cefa80ced5 pipe: Increase the writer-wakeup threshold to reduce context-switch count
Increase the threshold at which the reader sends a wake event to the
writers in the queue such that the queue must be half empty before the wake
is issued rather than the wake being issued when just a single slot
available.

This reduces the number of context switches in the tests significantly,
without altering the amount of work achieved.  With my pipe-bench program,
there's a 20% reduction versus an unpatched kernel.

Suggested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
8df441294d pipe: Check for ring full inside of the spinlock in pipe_write()
Make pipe_write() check to see if the ring has become full between it
taking the pipe mutex, checking the ring status and then taking the
spinlock.

This can happen if a notification is written into the pipe as that happens
without the pipe mutex.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
7e25a73f1a pipe: Remove redundant wakeup from pipe_write()
Remove a redundant wakeup from pipe_write().

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
a194dfe6e6 pipe: Rearrange sequence in pipe_write() to preallocate slot
Rearrange the sequence in pipe_write() so that the allocation of the new
buffer, the allocation of a ring slot and the attachment to the ring is
done under the pipe wait spinlock and then the lock is dropped and the
buffer can be filled.

The data copy needs to be done with the spinlock unheld and irqs enabled,
so the lock needs to be dropped first.  However, the reader can't progress
as we're holding pipe->mutex.

We also need to drop the lock as that would impact others looking at the
pipe waitqueue, such as poll(), the consumer and a future kernel message
writer.

We just abandon the preallocated slot if we get a copy error.  Future
writes may continue it and a future read will eventually recycle it.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
8446487feb pipe: Conditionalise wakeup in pipe_read()
Only do a wakeup in pipe_read() if we made space in a completely full
buffer.  The producer shouldn't be waiting on pipe->wait otherwise.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
b667b86734 pipe: Advance tail pointer inside of wait spinlock in pipe_read()
Advance the pipe ring tail pointer inside of wait spinlock in pipe_read()
so that the pipe can be written into with kernel notifications from
contexts where pipe->mutex cannot be taken.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
6718b6f855 pipe: Allow pipes to have kernel-reserved slots
Split pipe->ring_size into two numbers:

 (1) pipe->ring_size - indicates the hard size of the pipe ring.

 (2) pipe->max_usage - indicates the maximum number of pipe ring slots that
     userspace orchestrated events can fill.

This allows for a pipe that is both writable by the general kernel
notification facility and by userspace, allowing plenty of ring space for
notifications to be added whilst preventing userspace from being able to
pin too much unswappable kernel space.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
Arnd Bergmann
bde9e963af y2038: timerfd: Use timespec64 internally
timerfd_show() uses a 'struct itimerspec' internally, but that is
deprecated because of the time_t overflow and a conflict with the glibc
type of the same name that is now incompatible in user space.

Use a pair of timespec64 variables instead as a simple replacement.

As this removes the last use of itimerspec from the kernel, allowing the
removal of the definition from the uapi headers along with timespec and
timeval later.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
e2bb80d55d y2038: elfcore: Use __kernel_old_timeval for process times
We store elapsed time for a crashed process in struct elf_prstatus using
'timeval' structures. Once glibc starts using 64-bit time_t, this becomes
incompatible with the kernel's idea of timeval since the structure layout
no longer matches on 32-bit architectures.

This changes the definition of the elf_prstatus structure to use
__kernel_old_timeval instead, which is hardcoded to the currently used
binary layout. There is no risk of overflow in y2038 though, because
the time values are all relative times, and can store up to 68 years
of process elapsed time.

There is a risk of applications breaking at build time when they
use the new kernel headers and expect the type to be exactly 'timeval'
rather than a structure that has the same fields as before. Those
applications have to be modified to deal with 64-bit time_t anyway.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:29 +01:00
Arnd Bergmann
75d319c06e y2038: syscalls: change remaining timeval to __kernel_old_timeval
All of the remaining syscalls that pass a timeval (gettimeofday, utime,
futimesat) can trivially be changed to pass a __kernel_old_timeval
instead, which has a compatible layout, but avoids ambiguity with
the timeval type in user space.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:29 +01:00
Arnd Bergmann
3ca47e958a y2038: remove CONFIG_64BIT_TIME
The CONFIG_64BIT_TIME option is defined on all architectures, and can
be removed for simplicity now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:27 +01:00
yangerkun
565333a155 ext4: fix a bug in ext4_wait_for_tail_page_commit
No need to wait for any commit once the page is fully truncated.
Besides, it may confuse e.g. concurrent ext4_writepage() with the page
still be dirty (will be cleared by truncate_pagecache() in
ext4_setattr()) but buffers has been freed; and then trigger a bug
show as below:

[   26.057508] ------------[ cut here ]------------
[   26.058531] kernel BUG at fs/ext4/inode.c:2134!
...
[   26.088130] Call trace:
[   26.088695]  ext4_writepage+0x914/0xb28
[   26.089541]  writeout.isra.4+0x1b4/0x2b8
[   26.090409]  move_to_new_page+0x3b0/0x568
[   26.091338]  __unmap_and_move+0x648/0x988
[   26.092241]  unmap_and_move+0x48c/0xbb8
[   26.093096]  migrate_pages+0x220/0xb28
[   26.093945]  kernel_mbind+0x828/0xa18
[   26.094791]  __arm64_sys_mbind+0xc8/0x138
[   26.095716]  el0_svc_common+0x190/0x490
[   26.096571]  el0_svc_handler+0x60/0xd0
[   26.097423]  el0_svc+0x8/0xc

Run the procedure (generate by syzkaller) parallel with ext3.

void main()
{
	int fd, fd1, ret;
	void *addr;
	size_t length = 4096;
	int flags;
	off_t offset = 0;
	char *str = "12345";

	fd = open("a", O_RDWR | O_CREAT);
	assert(fd >= 0);

	/* Truncate to 4k */
	ret = ftruncate(fd, length);
	assert(ret == 0);

	/* Journal data mode */
	flags = 0xc00f;
	ret = ioctl(fd, _IOW('f', 2, long), &flags);
	assert(ret == 0);

	/* Truncate to 0 */
	fd1 = open("a", O_TRUNC | O_NOATIME);
	assert(fd1 >= 0);

	addr = mmap(NULL, length, PROT_WRITE | PROT_READ,
					MAP_SHARED, fd, offset);
	assert(addr != (void *)-1);

	memcpy(addr, str, 5);
	mbind(addr, length, 0, 0, 0, MPOL_MF_MOVE);
}

And the bug will be triggered once we seen the below order.

reproduce1                         reproduce2

...                            |   ...
truncate to 4k                 |
change to journal data mode    |
                               |   memcpy(set page dirty)
truncate to 0:                 |
ext4_setattr:                  |
...                            |
ext4_wait_for_tail_page_commit |
                               |   mbind(trigger bug)
truncate_pagecache(clean dirty)|   ...
...                            |

mbind will call ext4_writepage() since the page still be dirty, and then
report the bug since the buffers has been free. Fix it by return
directly once offset equals to 0 which means the page has been fully
truncated.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Link: https://lore.kernel.org/r/20190919063508.1045-1-yangerkun@huawei.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14 22:22:17 -05:00
Gao Xiang
5500221ea1 ext4: bio_alloc with __GFP_DIRECT_RECLAIM never fails
Similar to [1] [2], bio_alloc with __GFP_DIRECT_RECLAIM flags
guarantees bio allocation under some given restrictions, as
stated in block/bio.c and fs/direct-io.c So here it's ok to
not check for NULL value from bio_alloc().

[1] https://lore.kernel.org/r/20191030035518.65477-1-gaoxiang25@huawei.com
[2] https://lore.kernel.org/r/20190830162812.GA10694@infradead.org
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20191031092315.139267-1-gaoxiang25@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14 22:19:11 -05:00
Chengguang Xu
ebc11f7b1f ext4: code cleanup for get_next_id
Now the checks in ext4_get_next_id() and dquot_get_next_id()
are almost the same, so just call dquot_get_next_id() instead
of ext4_get_next_id().

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Link: https://lore.kernel.org/r/20191006103028.31299-1-cgxu519@mykernel.net
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14 22:15:43 -05:00
Jan Kara
f4c2d372b8 ext4: fix leak of quota reservations
Commit 8fcc3a5806 ("ext4: rework reserved cluster accounting when
invalidating pages") moved freeing of delayed allocation reservations
from dirty page invalidation time to time when we evict corresponding
status extent from extent status tree. For inodes which don't have any
blocks allocated this may actually happen only in ext4_clear_blocks()
which is after we've dropped references to quota structures from the
inode. Thus reservation of quota leaked. Fix the problem by clearing
quota information from the inode only after evicting extent status tree
in ext4_clear_inode().

Link: https://lore.kernel.org/r/20191108115420.GI20863@quack2.suse.cz
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 8fcc3a5806 ("ext4: rework reserved cluster accounting when invalidating pages")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14 19:25:47 -05:00
Olof Johansson
1e1a76ed9a ext4: remove unused variable warning in parse_options()
Commit c33fbe8f67 ("ext4: Enable blocksize < pagesize for
dioread_nolock") removed the only user of 'sbi' outside of the ifdef,
so it caused a new warning:

fs/ext4/super.c:2068:23: warning: unused variable 'sbi' [-Wunused-variable]

Fixes: c33fbe8f67 ("ext4: Enable blocksize < pagesize for dioread_nolock")
Signed-off-by: Olof Johansson <olof@lixom.net>
Link: https://lore.kernel.org/r/20191111022523.34256-1-olof@lixom.net
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2019-11-14 17:59:05 -05:00
Chandan Rajendra
196624e192 ext4: Enable encryption for subpage-sized blocks
Now that we have the code to support encryption for subpage-sized
blocks, this commit removes the conditional check in filesystem mount
code.

The commit also changes the support statement in
Documentation/filesystems/fscrypt.rst to reflect the fact that
encryption on filesystems with blocksize less than page size now works.

[EB: Tested with 'gce-xfstests -c ext4/encrypt_1k -g auto', using the
new "encrypt_1k" config I created.  All tests pass except for those that
already fail or are excluded with the encrypt or 1k configs, and 2 tests
that try to create 1023-byte symlinks which fails since encrypted
symlinks are limited to blocksize-3 bytes.  Also ran the dedicated
encryption tests using 'kvm-xfstests -c ext4/1k -g encrypt'; all pass,
including the on-disk ciphertext verification tests.]

Signed-off-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191023033312.361355-3-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14 16:40:45 -05:00
Eric Biggers
31fb992ce6 fs/buffer.c: support fscrypt in block_read_full_page()
After each filesystem block (as represented by a buffer_head) has been
read from disk by block_read_full_page(), decrypt it if needed.  The
decryption is done on the fscrypt_read_workqueue.

This is the final change needed to support ext4 encryption with
blocksize != PAGE_SIZE, and it's a fairly small change now that
CONFIG_FS_ENCRYPTION is a bool and fs/crypto/ exposes functions to
decrypt individual blocks and to enqueue work on the fscrypt workqueue.

Don't try to add fs-verity support yet, as the fs/verity/ support layer
isn't ready for sub-page blocks yet.  Just add fscrypt support for now.

Almost all the new code is compiled away when CONFIG_FS_ENCRYPTION=n.

Cc: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191023033312.361355-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14 16:40:45 -05:00
Jens Axboe
eac406c61c io_uring: make POLL_ADD/POLL_REMOVE scale better
One of the obvious use cases for these commands is networking, where
it's not uncommon to have tons of sockets open and polled for. The
current implementation uses a list for insertion and lookup, which works
fine for file based use cases where the count is usually low, it breaks
down somewhat for higher number of files / sockets. A test case with
30k sockets being polled for and cancelled takes:

real    0m6.968s
user    0m0.002s
sys     0m6.936s

with the patch it takes:

real    0m0.233s
user    0m0.010s
sys     0m0.176s

If you go to 50k sockets, it gets even more abysmal with the current
code:

real    0m40.602s
user    0m0.010s
sys     0m40.555s

with the patch it takes:

real    0m0.398s
user    0m0.000s
sys     0m0.341s

Change is pretty straight forward, just replace the cancel_list with
a red/black tree instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 12:09:58 -07:00
Jeff Layton
6a81749ebe ceph: increment/decrement dio counter on async requests
Ceph can in some cases issue an async DIO request, in which case we can
end up calling ceph_end_io_direct before the I/O is actually complete.
That may allow buffered operations to proceed while DIO requests are
still in flight.

Fix this by incrementing the i_dio_count when issuing an async DIO
request, and decrement it when tearing down the aio_req.

Fixes: 321fe13c93 ("ceph: add buffered/direct exclusionary locking for reads and writes")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-14 18:44:51 +01:00
Jeff Layton
a81bc3102b ceph: take the inode lock before acquiring cap refs
Most of the time, we (or the vfs layer) takes the inode_lock and then
acquires caps, but ceph_read_iter does the opposite, and that can lead
to a deadlock.

When there are multiple clients treading over the same data, we can end
up in a situation where a reader takes caps and then tries to acquire
the inode_lock. Another task holds the inode_lock and issues a request
to the MDS which needs to revoke the caps, but that can't happen until
the inode_lock is unwedged.

Fix this by having ceph_read_iter take the inode_lock earlier, before
attempting to acquire caps.

Fixes: 321fe13c93 ("ceph: add buffered/direct exclusionary locking for reads and writes")
Link: https://tracker.ceph.com/issues/36348
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-14 18:44:51 +01:00
Jens Axboe
021d1cdda3 io-wq: remove now redundant struct io_wq_nulls_list
Since we don't iterate these lists anymore after commit:

e61df66c69 ("io-wq: ensure free/busy list browsing see all items")

we don't need to retain the nulls value we use for them. That means it's
pretty pointless to wrap the hlist_nulls_head in a structure, so get rid
of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 08:02:19 -07:00
Christoph Hellwig
979c690d9a block: move clearing bd_invalidated into check_disk_size_change
Both callers of check_disk_size_change clear bd_invalidate directly
after the call, so move the clearing into check_disk_size_change
itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:44:01 -07:00
Christoph Hellwig
f0b870df80 block: remove (__)blkdev_reread_part as an exported API
In general drivers should never mess with partition tables directly.
Unfortunately s390 and loop do for somewhat historic reasons, but they
can use bdev_disk_changed directly instead when we export it as they
satisfy the sanity checks we have in __blkdev_reread_part.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>	[dasd]
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:43:59 -07:00
Christoph Hellwig
142fe8f4bb block: fix bdev_disk_changed for non-partitioned devices
We still have to set the capacity to 0 if invalidating or call
revalidate_disk if not even if the disk has no partitions.  Fix
that by merging rescan_partitions into bdev_disk_changed and just
stubbing out blk_add_partitions and blk_drop_partitions for
non-partitioned devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:43:53 -07:00
Christoph Hellwig
a1548b6744 block: move rescan_partitions to fs/block_dev.c
Large parts of rescan_partitions aren't about partitions, and
moving it to block_dev.c will allow for some further cleanups by
merging it into its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:43:21 -07:00
Christoph Hellwig
6917d06899 block: merge invalidate_partitions into rescan_partitions
A lot of the logic in invalidate_partitions and rescan_partitions is
shared.  Merge the two functions to simplify things.  There is a small
behavior change in that we now send the kevent change notice also if we
were not invalidating but no partitions were found, which seems like
the right thing to do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:42:41 -07:00
Pavel Begunkov
a320e9fa1e io_uring: Fix getting file for non-fd opcodes
For timeout requests and bunch of others io_uring tries to grab a file
with specified fd, which is usually stdin/fd=0.
Update io_op_needs_file()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 19:41:01 -07:00
Bob Liu
9d858b2148 io_uring: introduce req_need_defer()
Makes the code easier to read.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 19:41:01 -07:00
Bob Liu
2f6d9b9d63 io_uring: clean up io_uring_cancel_files()
We don't use the return value anymore, drop it. Also drop the
unecessary double cancel_req value check.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 19:41:01 -07:00
Jens Axboe
e61df66c69 io-wq: ensure free/busy list browsing see all items
We have two lists for workers in io-wq, a busy and a free list. For
certain operations we want to browse all workers, and we currently do
that by browsing the two separate lists. But since these lists are RCU
protected, we can potentially miss workers if they move between the two
lists while we're browsing them.

Add a third list, all_list, that simply holds all workers. A worker is
added to that list when it starts, and removed when it exits. This makes
the worker iteration cleaner, too.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 19:40:57 -07:00
Darrick J. Wong
f368b29ba9 xfs: fix another missing include
Fix missing include of xfs_filestream.h in xfs_filestream.c so that we
actually check the function declarations against the definitions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-11-13 18:22:41 -08:00
Christoph Hellwig
8234532fd4 xfs: remove XFS_IOC_FSSETDM and XFS_IOC_FSSETDM_BY_HANDLE
Thes ioctls set DMAPI specific flags in the on-disk inode, but there is
no way to actually ever query those flags.  The only known user is
xfsrestore with the -D option, which is documented to be only useful
inside a DMAPI enviroment, which isn't supported by upstream XFS.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:41 -08:00
YueHaibing
eb0d21637f xfs: remove duplicated include from xfs_dir2_data.c
Remove duplicated include.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:41 -08:00
Eric Sandeen
a55cefccaa xfs: remove unused structure members & simple typedefs
Remove some unused typedef'd simple types, and some unused
structure members.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:41 -08:00
Eric Sandeen
35dab307c8 xfs: remove unused typedef definitions
Remove some typdefs for type_t's that are no longer referred to
by their typedef'd types.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:40 -08:00
Pavel Reichl
1cc95e6f0d xfs: Replace function declaration by actual definition
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix typo in subject line]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:40 -08:00
Pavel Reichl
d0bdfb1069 xfs: remove the xfs_qoff_logitem_t typedef
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix a comment]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:28 -08:00
Pavel Reichl
fd8b81dbbb xfs: remove the xfs_dq_logitem_t typedef
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:26 -08:00
Pavel Reichl
c072fbefe4 xfs: remove the xfs_quotainfo_t typedef
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:23 -08:00
Jens Axboe
5e559561a8 io_uring: ensure registered buffer import returns the IO length
A test case was reported where two linked reads with registered buffers
failed the second link always. This is because we set the expected value
of a request in req->result, and if we don't get this result, then we
fail the dependent links. For some reason the registered buffer import
returned -ERROR/0, while the normal import returns -ERROR/length. This
broke linked commands with registered buffers.

Fix this by making io_import_fixed() correctly return the mapped length.

Cc: stable@vger.kernel.org # v5.3
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 16:15:14 -07:00
Pavel Begunkov
5683e5406e io_uring: Fix getting file for timeout
For timeout requests io_uring tries to grab a file with specified fd,
which is usually stdin/fd=0.
Update io_op_needs_file()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:25:57 -07:00
Jens Axboe
36c2f9223e io-wq: ensure we have a stable view of ->cur_work for cancellations
worker->cur_work is currently protected by the lock of the wqe that the
worker belongs to. When we send a signal to a worker, we need a stable
view of ->cur_work, so we need to hold that lock. But this doesn't work
so well, since we have the opposite order potentially on queueing work.
If POLL_ADD is used with a signalfd, then io_poll_wake() is called with
the signal lock, and that sometimes needs to insert work items.

Add a specific worker lock that protects the current work item. Then we
can guarantee that the task we're sending a signal is currently
processing the exact work we think it is.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 13:51:54 -07:00
Eric Biggers
924e319416 f2fs: support STATX_ATTR_VERITY
Set the STATX_ATTR_VERITY bit when the statx() system call is used on a
verity file on f2fs.

Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-13 12:15:34 -08:00
Eric Biggers
1f60719552 ext4: support STATX_ATTR_VERITY
Set the STATX_ATTR_VERITY bit when the statx() system call is used on a
verity file on ext4.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-13 12:15:34 -08:00
Linus Torvalds
afd7a71872 for-5.4-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3MUkoACgkQxWXV+ddt
 WDvAEQ//fEHZ51NbIMwJNltqF4mr6Oao0M5u0wejEgiXzmR9E1IuHUgVK+KDQmSu
 wZl/y+RTlQC0TiURyStaVFBreXEqiuB79my9u4iDeNv/4UJQB42qpmYB4EuviDgB
 mxb9bFpWTLkO6Oc+vMrGF3BOmVsQKlq2nOua25g8VFtApQ6uiEfbwBOslCcC8kQB
 ZpNBl6x74xz/VWNWZnRStBfwYjRitKNDVU6dyIyRuLj8cktqfGBxGtx7/w0wDiZT
 kPR1bNtdpy3Ndke6H/0G6plRWi9kENqcN43hvrz54IKh2l+Jd2/as51j4Qq2tJU9
 KaAnJzRaSePxc2m0SqtgZTvc2BYSOg7dqaCyHxBB0CUBdTdJdz2TVZ2KM9MiLns4
 1haHBLo4l8o8zeYZpW05ac6OXKY4f8qsjWPEGshn4FDbq0TrHQzYxAF3c0X3hPag
 SnuvilgYUuYal+n87qinePg/ZmVrrBXPRycpQnn7FxqezJbf/2WUEojUVQnreU05
 mdp8mulxQxyFhgEvO7K1uDtlP8bqW69IO9M//6IWzGNKTDK2SRI08ULplghqgyna
 8SG0+y9w26r8UIWDhuvPbdfUMSG3kEH8yLFK84AFDMVJJxOnfznE3sC8sGOiP5q9
 OUkl8l7bhDkyAdWZY57gGUobebdPfnLxRV9A+LZQ2El1kSOEK18=
 =xzXs
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "A fix for an older bug that has started to show up during testing
  (because of an updated test for rename exchange).

  It's an in-memory corruption caused by local variable leaking out of
  the function scope"

* tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix log context list corruption after rename exchange operation
2019-11-13 12:06:10 -08:00
Pavel Reichl
aefe69a45d xfs: remove the xfs_disk_dquot_t and xfs_dquot_t
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix some of the comments]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Arnd Bergmann
e8777b27ca xfs: avoid time_t in user api
The ioctl definitions for XFS_IOC_SWAPEXT, XFS_IOC_FSBULKSTAT and
XFS_IOC_FSBULKSTAT_SINGLE are part of libxfs and based on time_t.

The definition for time_t differs between current kernels and coming
32-bit libc variants that define it as 64-bit. For most ioctls, that
means the kernel has to be able to handle two different command codes
based on the different structure sizes.

The same solution could be applied for XFS_IOC_SWAPEXT, but it would
not work for XFS_IOC_FSBULKSTAT and XFS_IOC_FSBULKSTAT_SINGLE because
the structure with the time_t is passed through an indirect pointer,
and the command number itself is based on struct xfs_fsop_bulkreq,
which does not differ based on time_t.

This means any solution that can be applied requires a change of the
ABI definition in the xfs_fs.h header file, as well as doing the same
change in any user application that contains a copy of this header.

The usual solution would be to define a replacement structure and
use conditional compilation for the ioctl command codes to use
one or the other, such as

 #define XFS_IOC_FSBULKSTAT_OLD _IOWR('X', 101, struct xfs_fsop_bulkreq)
 #define XFS_IOC_FSBULKSTAT_NEW _IOWR('X', 129, struct xfs_fsop_bulkreq)
 #define XFS_IOC_FSBULKSTAT ((sizeof(time_t) == sizeof(__kernel_long_t)) ? \
			     XFS_IOC_FSBULKSTAT_OLD : XFS_IOC_FSBULKSTAT_NEW)

After this, the kernel would be able to implement both
XFS_IOC_FSBULKSTAT_OLD and XFS_IOC_FSBULKSTAT_NEW handlers on
32-bit architectures with the correct ABI for either definition
of time_t.

However, as long as two observations are true, a much simpler solution
can be used:

1. xfsprogs is the only user space project that has a copy of this header
2. xfsprogs already has a replacement for all three affected ioctl commands,
   based on the xfs_bulkstat structure to pass 64-bit timestamps
   regardless of the architecture

Based on those assumptions, changing xfs_bstime to use __kernel_long_t
instead of time_t in both the kernel and in xfsprogs preserves the current
ABI for any libc definition of time_t and solves the problem of passing
64-bit timestamps to 32-bit user space.

If either of the two assumptions is invalid, more discussion is needed
for coming up with a way to fix as much of the affected user space
code as possible.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
kaixuxia
93597ae8da xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()
When target_ip exists in xfs_rename(), the xfs_dir_replace() call may
need to hold the AGF lock to allocate more blocks, and then invoking
the xfs_droplink() call to hold AGI lock to drop target_ip onto the
unlinked list, so we get the lock order AGF->AGI. This would break the
ordering constraint on AGI and AGF locking - inode allocation locks
the AGI, then can allocate a new extent for new inodes, locking the
AGF after the AGI.

In this patch we check whether the replace operation need more
blocks firstly. If so, acquire the agi lock firstly to preserve
locking order(AGI/AGF). Actually, the locking order problem only
occurs when we are locking the AGI/AGF of the same AG. For multiple
AGs the AGI lock will be released after the transaction committed.

Signed-off-by: kaixuxia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: reword the comment]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Christoph Hellwig
048a35d2f0 xfs: don't reset the "inode core" in xfs_iread
We have the exact same memset in xfs_inode_alloc, which is always called
just before xfs_iread.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Christoph Hellwig
de7a866fd4 xfs: merge the projid fields in struct xfs_icdinode
There is no point in splitting the fields like this in an purely
in-memory structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Christoph Hellwig
8d2d878db8 xfs: use a struct timespec64 for the in-core crtime
struct xfs_icdinode is purely an in-memory data structure, so don't use
a log on-disk structure for it.  This simplifies the code a bit, and
also reduces our include hell slightly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix a minor indenting problem in xfs_trans_ichgtime]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Christoph Hellwig
d8d11fc703 xfs: devirtualize ->m_dirnameops
Instead of causing a relatively expensive indirect call for each
hashing and comparism of a file name in a directory just use an
inline function and a simple branch on the ASCII CI bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix unused variable warning]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Christoph Hellwig
537dabcfdb xfs: remove the unused m_chsize field
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Darrick J. Wong
a71895c5da xfs: convert open coded corruption check to use XFS_IS_CORRUPT
Convert the last of the open coded corruption check and report idioms to
use the XFS_IS_CORRUPT macro.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-13 11:08:01 -08:00
Jens Axboe
7d7230652e io_wq: add get/put_work handlers to io_wq_create()
For cancellation, we need to ensure that the work item stays valid for
as long as ->cur_work is valid. Right now we can't safely dereference
the work item even under the wqe->lock, because while the ->cur_work
pointer will remain valid, the work could be completing and be freed
in parallel.

Only invoke ->get/put_work() on items we know that the caller queued
themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed
when we're queueing a flush item, for instance.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 11:37:54 -07:00
Jens Axboe
15dff286d0 io_uring: check for validity of ->rings in teardown
Normally the rings are always valid, the exception is if we failed to
allocate the rings at setup time. syzbot reports this:

RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8903 Comm: syz-executor410 Not tainted 5.4.0-rc7-next-20191113
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  io_cqring_overflow_flush+0x6b9/0xa90 fs/io_uring.c:673
  io_ring_ctx_wait_and_kill+0x24f/0x7c0 fs/io_uring.c:4260
  io_uring_create fs/io_uring.c:4600 [inline]
  io_uring_setup+0x1256/0x1cc0 fs/io_uring.c:4626
  __do_sys_io_uring_setup fs/io_uring.c:4639 [inline]
  __se_sys_io_uring_setup fs/io_uring.c:4636 [inline]
  __x64_sys_io_uring_setup+0x54/0x80 fs/io_uring.c:4636
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441229
Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
---[ end trace b0f5b127a57f623f ]---
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

which is exactly the case of failing to allocate the SQ/CQ rings, and
then entering shutdown. Check if the rings are valid before trying to
access them at shutdown time.

Reported-by: syzbot+21147d79607d724bd6f3@syzkaller.appspotmail.com
Fixes: 1d7bb1d50f ("io_uring: add support for backlogged CQ ring")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 09:11:36 -07:00
Peter Zijlstra
cf25e24db6 time: Rename tsk->real_start_time to ->start_boottime
Since it stores CLOCK_BOOTTIME, not, as the name suggests,
CLOCK_REALTIME, let's rename ->real_start_time to ->start_bootime.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 11:09:49 +01:00
Christoph Hellwig
d41003513e block: rework zone reporting
Avoid the need to allocate a potentially large array of struct blk_zone
in the block layer by switching the ->report_zones method interface to
a callback model. Now the caller simply supplies a callback that is
executed on each reported zone, and private data for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 19:12:07 -07:00
Darrick J. Wong
f9e0370648 xfs: kill the XFS_WANT_CORRUPT_* macros
The XFS_WANT_CORRUPT_* macros conceal subtle side effects such as the
creation of local variables and redirections of the code flow.  This is
pretty ugly, so replace them with explicit XFS_IS_CORRUPT tests that
remove both of those ugly points.  The change was performed with the
following coccinelle script:

@@
expression mp, test;
identifier label;
@@

- XFS_WANT_CORRUPTED_GOTO(mp, test, label);
+ if (XFS_IS_CORRUPT(mp, !test)) { error = -EFSCORRUPTED; goto label; }

@@
expression mp, test;
@@

- XFS_WANT_CORRUPTED_RETURN(mp, test);
+ if (XFS_IS_CORRUPT(mp, !test)) return -EFSCORRUPTED;

@@
expression mp, lval, rval;
@@

- XFS_IS_CORRUPT(mp, !(lval == rval))
+ XFS_IS_CORRUPT(mp, lval != rval)

@@
expression mp, e1, e2;
@@

- XFS_IS_CORRUPT(mp, !(e1 && e2))
+ XFS_IS_CORRUPT(mp, !e1 || !e2)

@@
expression e1, e2;
@@

- !(e1 == e2)
+ e1 != e2

@@
expression e1, e2, e3, e4, e5, e6;
@@

- !(e1 == e2 && e3 == e4) || e5 != e6
+ e1 != e2 || e3 != e4 || e5 != e6

@@
expression e1, e2, e3, e4, e5, e6;
@@

- !(e1 == e2 || (e3 <= e4 && e5 <= e6))
+ e1 != e2 && (e3 > e4 || e5 > e6)

@@
expression mp, e1, e2;
@@

- XFS_IS_CORRUPT(mp, !(e1 <= e2))
+ XFS_IS_CORRUPT(mp, e1 > e2)

@@
expression mp, e1, e2;
@@

- XFS_IS_CORRUPT(mp, !(e1 < e2))
+ XFS_IS_CORRUPT(mp, e1 >= e2)

@@
expression mp, e1;
@@

- XFS_IS_CORRUPT(mp, !!e1)
+ XFS_IS_CORRUPT(mp, e1)

@@
expression mp, e1, e2;
@@

- XFS_IS_CORRUPT(mp, !(e1 || e2))
+ XFS_IS_CORRUPT(mp, !e1 && !e2)

@@
expression mp, e1, e2, e3, e4;
@@

- XFS_IS_CORRUPT(mp, !(e1 == e2) && !(e3 == e4))
+ XFS_IS_CORRUPT(mp, e1 != e2 && e3 != e4)

@@
expression mp, e1, e2, e3, e4;
@@

- XFS_IS_CORRUPT(mp, !(e1 <= e2) || !(e3 >= e4))
+ XFS_IS_CORRUPT(mp, e1 > e2 || e3 < e4)

@@
expression mp, e1, e2, e3, e4;
@@

- XFS_IS_CORRUPT(mp, !(e1 == e2) && !(e3 <= e4))
+ XFS_IS_CORRUPT(mp, e1 != e2 && e3 > e4)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-12 17:19:02 -08:00
Darrick J. Wong
1ec28615d2 xfs: add a XFS_IS_CORRUPT macro
Add a new macro, XFS_IS_CORRUPT, which we will use to integrate some
corruption reporting when the corruption test expression is true.  This
will be used in the next patch to remove the ugly XFS_WANT_CORRUPT*
macros.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-12 17:19:02 -08:00
Jens Axboe
7c9e7f0fe0 io_uring: fix potential deadlock in io_poll_wake()
We attempt to run the poll completion inline, but we're using trylock to
do so. This avoids a deadlock since we're grabbing the locks in reverse
order at this point, we already hold the poll wq lock and we're trying
to grab the completion lock, while the normal rules are the reverse of
that order.

IO completion for a timeout link will need to grab the completion lock,
but that's not safe from this context. Put the completion under the
completion_lock in io_poll_wake(), and mark the request as entering
the completion with the completion_lock already held.

Fixes: 2665abfd75 ("io_uring: add support for linked SQE timeouts")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 12:26:34 -07:00
Tejun Heo
40430452fd kernfs: use 64bit inos if ino_t is 64bit
Each kernfs_node is identified with a 64bit ID.  The low 32bit is
exposed as ino and the high gen.  While this already allows using inos
as keys by looking up with wildcard generation number of 0, it's
adding unnecessary complications for 64bit ino archs which can
directly use kernfs_node IDs as inos to uniquely identify each cgroup
instance.

This patch exposes IDs directly as inos on 64bit ino archs.  The
conversion is mostly straight-forward.

* 32bit ino archs behave the same as before.  64bit ino archs now use
  the whole 64bit ID as ino and the generation number is fixed at 1.

* 64bit inos still use the same idr allocator which gurantees that the
  lower 32bits identify the current live instance uniquely and the
  high 32bits are incremented whenever the low bits wrap.  As the
  upper 32bits are no longer used as gen and we don't wanna start ino
  allocation with 33rd bit set, the initial value for highbits
  allocation is changed to 0 on 64bit ino archs.

* blktrace exposes two 32bit numbers - (INO,GEN) pair - to identify
  the issuing cgroup.  Userland builds FILEID_INO32_GEN fids from
  these numbers to look up the cgroups.  To remain compatible with the
  behavior, always output (LOW32,HIGH32) which will be constructed
  back to the original 64bit ID by __kernfs_fh_to_dentry().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:04 -08:00
Tejun Heo
33c5ac9175 kernfs: implement custom exportfs ops and fid type
The current kernfs exportfs implementation uses the generic_fh_*()
helpers and FILEID_INO32_GEN[_PARENT] which limits ino to 32bits.
Let's implement custom exportfs operations and fid type to remove the
restriction.

* FILEID_KERNFS is a single u64 value whose content is
  kernfs_node->id.  This is the only native fid type.

* For backward compatibility with blk_log_action() path which exposes
  (ino,gen) pairs which userland assembles into FILEID_INO32_GEN keys,
  combine the generic keys into 64bit IDs in the same order.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:04 -08:00
Tejun Heo
fe0f726c9f kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
kernfs_find_and_get_node_by_ino() looks the kernfs_node matching the
specified ino.  On top of that, kernfs_get_node_by_id() and
kernfs_fh_get_inode() implement full ID matching by testing the rest
of ID.

On surface, confusingly, the two are slightly different in that the
latter uses 0 gen as wildcard while the former doesn't - does it mean
that the latter can't uniquely identify inodes w/ 0 gen?  In practice,
this is a distinction without a difference because generation number
starts at 1.  There are no actual IDs with 0 gen, so it can always
safely used as wildcard.

Let's simplify the code by renaming kernfs_find_and_get_node_by_ino()
to kernfs_find_and_get_node_by_id(), moving all lookup logics into it,
and removing now unnecessary kernfs_get_node_by_id().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12 08:18:04 -08:00
Tejun Heo
67c0496e87 kernfs: convert kernfs_node->id from union kernfs_node_id to u64
kernfs_node->id is currently a union kernfs_node_id which represents
either a 32bit (ino, gen) pair or u64 value.  I can't see much value
in the usage of the union - all that's needed is a 64bit ID which the
current code is already limited to.  Using a union makes the code
unnecessarily complicated and prevents using 64bit ino without adding
practical benefits.

This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
ino is stored in the lower 32bits and gen upper.  Accessors -
kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
ino and gen.  This simplifies ID handling less cumbersome and will
allow using 64bit inos on supported archs.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexei Starovoitov <ast@kernel.org>
2019-11-12 08:18:03 -08:00
Tejun Heo
880df13161 kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes
kernfs node can be created in two separate steps - allocation and
activation.  This is used to make kernfs nodes visible only after the
internal states attached to the node are fully initialized.
kernfs_find_and_get_node_by_id() currently allows lookups of nodes
which aren't activated yet and thus can expose nodes are which are
still being prepped by kernfs users.

Fix it by disallowing lookups of nodes which aren't activated yet.

kernfs_find_and_get_node_by_ino()

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:03 -08:00
Tejun Heo
b680b08171 kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()
kernfs_find_and_get_node_by_ino() uses RCU protection.  It's currently
a bit buggy because it can look up a node which hasn't been activated
yet and thus may end up exposing a node that the kernfs user is still
prepping.

While it can be fixed by pushing it further in the current direction,
it's already complicated and isn't clear whether the complexity is
justified.  The main use of kernfs_find_and_get_node_by_ino() is for
exportfs operations.  They aren't super hot and all the follow-up
operations (e.g. mapping to path) use normal locking anyway.

Let's switch to a dumber locking scheme and protect the lookup with
kernfs_idr_lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:03 -08:00
Tejun Heo
e23f568aa6 kernfs: fix ino wrap-around detection
When the 32bit ino wraps around, kernfs increments the generation
number to distinguish reused ino instances.  The wrap-around detection
tests whether the allocated ino is lower than what the cursor but the
cursor is pointing to the next ino to allocate so the condition never
triggers.

Fix it by remembering the last ino and comparing against that.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 4a3ef68aca ("kernfs: implement i_generation")
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: stable@vger.kernel.org # v4.14+
2019-11-12 08:18:03 -08:00
Jens Axboe
960e432dfa io_uring: use correct "is IO worker" helper
Since we switched to io-wq, the dependent link optimization for when to
pass back work inline has been broken. Fix this by providing a suitable
io-wq helper for io_uring to use to detect when to do this.

Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 08:02:26 -07:00
Jens Axboe
93bd25bb69 io_uring: make timeout sequence == 0 mean no sequence
Currently we make sequence == 0 be the same as sequence == 1, but that's
not super useful if the intent is really to have a timeout that's just
a pure timeout.

If the user passes in sqe->off == 0, then don't apply any sequence logic
to the request, let it purely be driven by the timeout specified.

Reported-by: 李通洲 <carter.li@eoitek.com>
Reviewed-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 00:18:51 -07:00
Jens Axboe
76a46e066e io_uring: fix -ENOENT issue with linked timer with short timeout
If you prep a read (for example) that needs to get punted to async
context with a timer, if the timeout is sufficiently short, the timer
request will get completed with -ENOENT as it could not find the read.

The issue is that we prep and start the timer before we start the read.
Hence the timer can trigger before the read is even started, and the end
result is then that the timer completes with -ENOENT, while the read
starts instead of being cancelled by the timer.

Fix this by splitting the linked timer into two parts:

1) Prep and validate the linked timer
2) Start timer

The read is then started between steps 1 and 2, so we know that the
timer will always have a consistent view of the read request state.

Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:33:22 -07:00
Jens Axboe
768134d4f4 io_uring: don't do flush cancel under inflight_lock
We can't safely cancel under the inflight lock. If the work hasn't been
started yet, then io_wq_cancel_work() simply marks the work as cancelled
and invokes the work handler. But if the work completion needs to grab
the inflight lock because it's grabbing user files, then we'll deadlock
trying to finish the work as we already hold that lock.

Instead grab a reference to the request, if it isn't already zero. If
it's zero, then we know it's going through completion anyway, and we
can safely ignore it. If it's not zero, then we can drop the lock and
attempt to cancel from there.

This also fixes a missing finish_wait() at the end of
io_uring_cancel_files().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:33:17 -07:00
Jens Axboe
c1edbf5f08 io_uring: flag SQPOLL busy condition to userspace
Now that we have backpressure, for SQPOLL, we have one more condition
that warrants flagging that the application needs to enter the kernel:
we failed to submit IO due to backpressure. Make sure we catch that
and flag it appropriately.

If we run into backpressure issues with the SQPOLL thread, flag it
as such to the application by setting IORING_SQ_NEED_WAKEUP. This will
cause the application to enter the kernel, and that will flush the
backlog and clear the condition.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:33:11 -07:00
Jens Axboe
47f467686e io_uring: make ASYNC_CANCEL work with poll and timeout
It's a little confusing that we have multiple types of command
cancellation opcodes now that we have a generic one. Make the generic
one work with POLL_ADD and TIMEOUT commands as well, that makes for an
easier to use API for the application. The fact that they currently
don't is a bit confusing.

Add a helper that takes care of it, so we can user it from both
IORING_OP_ASYNC_CANCEL and from the linked timeout cancellation.

Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:33:05 -07:00
Jens Axboe
0ddf92e848 io_uring: provide fallback request for OOM situations
One thing that really sucks for userspace APIs is if the kernel passes
back -ENOMEM/-EAGAIN for resource shortages. The application really has
no idea of what to do in those cases. Should it try and reap
completions? Probably a good idea. Will it solve the issue? Who knows.

This patch adds a simple fallback mechanism if we fail to allocate
memory for a request. If we fail allocating memory from the slab for a
request, we punt to a pre-allocated request. There's just one of these
per io_ring_ctx, but the important part is if we ever return -EBUSY to
the application, the applications knows that it can wait for events and
make forward progress when events have completed. This is the important
part.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:32:55 -07:00
Jan Stancek
e9f930ac88 iomap: fix return value of iomap_dio_bio_actor on 32bit systems
Naresh reported LTP diotest4 failing for 32bit x86 and arm -next
kernels on ext4. Same problem exists in 5.4-rc7 on xfs.

The failure comes down to:
  openat(AT_FDCWD, "testdata-4.5918", O_RDWR|O_DIRECT) = 4
  mmap2(NULL, 4096, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f7b000
  read(4, 0xb7f7b000, 4096)              = 0 // expects -EFAULT

Problem is conversion at iomap_dio_bio_actor() return. Ternary
operator has a return type and an attempt is made to convert each
of operands to the type of the other. In this case "ret" (int)
is converted to type of "copied" (unsigned long). Both have size
of 4 bytes:
    size_t copied = 0;
    int ret = -14;
    long long actor_ret = copied ? copied : ret;

    On x86_64: actor_ret == -14;
    On x86   : actor_ret == 4294967282

Replace ternary operator with 2 return statements to avoid this
unwanted conversion.

Fixes: 4721a60109 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-11 12:58:24 -08:00
Darrick J. Wong
2713fefa5d xfs: attach dquots before performing xfs_swap_extents
Make sure we attach dquots to both inodes before swapping their extents.
This was found via manual code inspection by looking for places where we
could call xfs_trans_mod_dquot without dquots attached to inodes, and
confirmed by instrumenting the kernel and running xfs/328.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-11 12:42:52 -08:00
Darrick J. Wong
2815a16d7f xfs: attach dquots and reserve quota blocks during unwritten conversion
In xfs_iomap_write_unwritten, we need to ensure that dquots are attached
to the inode and quota blocks reserved so that we capture in the quota
counters any blocks allocated to handle a bmbt split.  This can happen
on the first unwritten extent conversion to a preallocated sparse file
on a fresh mount.

This was found by running generic/311 with quotas enabled.  The bug
seems to have been introduced in "[XFS] rework iocore infrastructure,
remove some code and make it more" from ~2002?

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-11 12:42:52 -08:00
Darrick J. Wong
27d9ee577d xfs: actually check xfs_btree_check_block return in xfs_btree_islastblock
Coverity points out that xfs_btree_islastblock doesn't check the return
value of xfs_btree_check_block.  Since the question "Does the cursor
point to the last block in this level?" only makes sense if the caller
previously performed a lookup or seek operation, the block should
already have been checked.

Therefore, check the return value in an ASSERT and turn the whole thing
into a static inline predicate.

Coverity-id: 114069
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-11 12:42:52 -08:00
Filipe Manana
e6c617102c Btrfs: fix log context list corruption after rename exchange operation
During rename exchange we might have successfully log the new name in the
source root's log tree, in which case we leave our log context (allocated
on stack) in the root's list of log contextes. However we might fail to
log the new name in the destination root, in which case we fallback to
a transaction commit later and never sync the log of the source root,
which causes the source root log context to remain in the list of log
contextes. This later causes invalid memory accesses because the context
was allocated on stack and after rename exchange finishes the stack gets
reused and overwritten for other purposes.

The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
detect this and report something like the following:

  [  691.489929] ------------[ cut here ]------------
  [  691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
  [  691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
  (...)
  [  691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
  [  691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  [  691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
  (...)
  [  691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
  [  691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
  [  691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
  [  691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
  [  691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
  [  691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
  [  691.490019] FS:  00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
  [  691.490021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
  [  691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [  691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [  691.490029] Call Trace:
  [  691.490058]  btrfs_log_inode_parent+0x667/0x2730 [btrfs]
  [  691.490083]  ? join_transaction+0x24a/0xce0 [btrfs]
  [  691.490107]  ? btrfs_end_log_trans+0x80/0x80 [btrfs]
  [  691.490111]  ? dget_parent+0xb8/0x460
  [  691.490116]  ? lock_downgrade+0x6b0/0x6b0
  [  691.490121]  ? rwlock_bug.part.0+0x90/0x90
  [  691.490127]  ? do_raw_spin_unlock+0x142/0x220
  [  691.490151]  btrfs_log_dentry_safe+0x65/0x90 [btrfs]
  [  691.490172]  btrfs_sync_file+0x9f1/0xc00 [btrfs]
  [  691.490195]  ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
  [  691.490198]  ? rcu_read_lock_any_held.part.11+0x20/0x20
  [  691.490204]  ? __do_sys_newstat+0x88/0xd0
  [  691.490207]  ? cp_new_stat+0x5d0/0x5d0
  [  691.490218]  ? do_fsync+0x38/0x60
  [  691.490220]  do_fsync+0x38/0x60
  [  691.490224]  __x64_sys_fdatasync+0x32/0x40
  [  691.490228]  do_syscall_64+0x9f/0x540
  [  691.490233]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [  691.490235] RIP: 0033:0x7f4b253ad5f0
  (...)
  [  691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
  [  691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
  [  691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
  [  691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
  [  691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
  [  691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0

This started happening recently when running some test cases from fstests
like btrfs/004 for example, because support for rename exchange was added
last week to fsstress from fstests.

So fix this by deleting the log context for the source root from the list
if we have logged the new name in the source root.

Reported-by: Su Yue <Damenly_Su@gmx.com>
Fixes: d4682ba03e ("Btrfs: sync log after logging new name")
CC: stable@vger.kernel.org # 4.19+
Tested-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-11 19:46:02 +01:00
Konstantin Khlebnikov
6fcbcec9cf fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long
Quota statistics counted as 64-bit per-cpu counter. Reading sums per-cpu
fractions as signed 64-bit int, filters negative values and then reports
lower half as signed 32-bit int.

Result may looks like:

fs.quota.allocated_dquots = 22327
fs.quota.cache_hits = -489852115
fs.quota.drops = -487288718
fs.quota.free_dquots = 22083
fs.quota.lookups = -486883485
fs.quota.reads = 22327
fs.quota.syncs = 335064
fs.quota.writes = 3088689

Values bigger than 2^31-1 reported as negative.

All counters except "allocated_dquots" and "free_dquots" are monotonic,
thus they should be reported as is without filtering negative values.

Kernel doesn't have generic helper for 64-bit sysctl yet,
let's use at least unsigned long.

Link: https://lore.kernel.org/r/157337934693.2078.9842146413181153727.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-11 11:06:27 +01:00
Ingo Molnar
6d5a763c30 Linux 5.4-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl3IqJQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGOiUH+gOEDwid5OODaFAd
 CggXugdFIlBZefKqGVNW5sjgX8pxFWHXuEMC8iNb6QXtQZdFrI6LFf9hhUDmzQtm
 6y1LPxxEiTZjObMEsBNylb7tyzgujFHcAlp0Zro3w/HLCqmYTSP3FF46i2u6KZfL
 XhkpM4X7R7qxlfpdhlfESv/ElRGocZe6SwXfC7pcPo5flFcmkdu9ijqhNd/6CZ/h
 Nf9rTsD/wEDVUelFbgVN+LJzlaB0tsyc4Zbof07n8OsFZjhdEOop8gfM/kTBLcyY
 6bh66SfDScdsNnC/l8csbPjSZRx+i+nQs67DyhGNnsSAFgHBZdC4Tb/2mDCwhCLR
 dUvuYZc=
 =1N6F
 -----END PGP SIGNATURE-----

Merge tag 'v5.4-rc7' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:34:59 +01:00
Jens Axboe
8e3cca1270 io_uring: convert accept4() -ERESTARTSYS into -EINTR
If we cancel a pending accept operating with a signal, we get
-ERESTARTSYS returned. Turn that into -EINTR for userspace, we should
not be return -ERESTARTSYS.

Fixes: 17f2fe35d0 ("io_uring: add support for IORING_OP_ACCEPT")
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jens Axboe
46568e9be7 io_uring: fix error clear of ->file_table in io_sqe_files_register()
syzbot reports that when using failslab and friends, we can get a double
free in io_sqe_files_unregister():

BUG: KASAN: double-free or invalid-free in
io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185

CPU: 1 PID: 8819 Comm: syz-executor452 Not tainted 5.4.0-rc6-next-20191108
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x197/0x210 lib/dump_stack.c:118
  print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
  kasan_report_invalid_free+0x65/0xa0 mm/kasan/report.c:468
  __kasan_slab_free+0x13a/0x150 mm/kasan/common.c:450
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:480
  __cache_free mm/slab.c:3426 [inline]
  kfree+0x10a/0x2c0 mm/slab.c:3757
  io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185
  io_ring_ctx_free fs/io_uring.c:3998 [inline]
  io_ring_ctx_wait_and_kill+0x348/0x700 fs/io_uring.c:4060
  io_uring_release+0x42/0x50 fs/io_uring.c:4068
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  exit_task_work include/linux/task_work.h:22 [inline]
  do_exit+0x904/0x2e60 kernel/exit.c:817
  do_group_exit+0x135/0x360 kernel/exit.c:921
  __do_sys_exit_group kernel/exit.c:932 [inline]
  __se_sys_exit_group kernel/exit.c:930 [inline]
  __x64_sys_exit_group+0x44/0x50 kernel/exit.c:930
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43f2c8
Code: 31 b8 c5 f7 ff ff 48 8b 5c 24 28 48 8b 6c 24 30 4c 8b 64 24 38 4c 8b
6c 24 40 4c 8b 74 24 48 4c 8b 7c 24 50 48 83 c4 58 c3 66 <0f> 1f 84 00 00
00 00 00 48 8d 35 59 ca 00 00 0f b6 d2 48 89 fb 48
RSP: 002b:00007ffd5b976008 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f2c8
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
RBP: 00000000004bf0a8 R08: 00000000000000e7 R09: ffffffffffffffd0
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001
R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000

This happens if we fail allocating the file tables. For that case we do
free the file table correctly, but we forget to set it to NULL. This
means that ring teardown will see it as being non-NULL, and attempt to
free it again.

Fix this by clearing the file_table pointer if we free the table.

Reported-by: syzbot+3254bc44113ae1e331ee@syzkaller.appspotmail.com
Fixes: 65e19f54d2 ("io_uring: support for larger fixed file sets")
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jackie Liu
c69f8dbe24 io_uring: separate the io_free_req and io_free_req_find_next interface
Similar to the distinction between io_put_req and io_put_req_find_next,
io_free_req has been modified similarly, with no functional changes.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jackie Liu
ec9c02ad4c io_uring: keep io_put_req only responsible for release and put req
We already have io_put_req_find_next to find the next req of the link.
we should not use the io_put_req function to find them. They should be
functions of the same level.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jackie Liu
a197f664a0 io_uring: remove passed in 'ctx' function parameter ctx if possible
Many times, the core of the function is req, and req has already set
req->ctx at initialization time, so there is no need to pass in the
ctx from the caller.

Cleanup, no functional change.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jens Axboe
206aefde4f io_uring: reduce/pack size of io_ring_ctx
With the recent flurry of additions and changes to io_uring, the
layout of io_ring_ctx has become a bit stale. We're right now at
704 bytes in size on my x86-64 build, or 11 cachelines. This
patch does two things:

- We have to completion structs embedded, that we only use for
  quiesce of the ctx (or shutdown) and for sqthread init cases.
  That 2x32 bytes right there, let's dynamically allocate them.

- Reorder the struct a bit with an eye on cachelines, use cases,
  and holes.

With this patch, we're down to 512 bytes, or 8 cachelines.

Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Christoph Hellwig
23220fe260 xfs: always pass a valid hdr to xfs_dir3_leaf_check_int
Move the code for extracting the incore header to the only caller that
didn't already do that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
ae42976de7 xfs: merge xfs_dir2_data_freescan and xfs_dir2_data_freescan_int
There is no real need for xfs_dir2_data_freescan wrapper, so rename
xfs_dir2_data_freescan_int to xfs_dir2_data_freescan and let the
callers dereference the mount pointer from the inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
957ee13e20 xfs: remove the now unused dir ops infrastructure
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
59b8b46505 xfs: devirtualize ->data_get_ftype and ->data_put_ftype
Replace the ->data_get_ftype and ->data_put_ftype dir ops methods with
directly called xfs_dir2_data_get_ftype and xfs_dir2_data_put_ftype
helpers that takes care of the differences between the directory format
with and without the file type field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
1848b607a9 xfs: devirtualize ->data_bestfree_p
Replace the ->data_bestfree_p dir ops method with a directly called
xfs_dir2_data_bestfree_p helper that takes care of the differences
between the v4 and v5 on-disk format.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
711c7dbf5f xfs: cleanup xfs_dir2_data_entsize
Remove the XFS_DIR2_DATA_ENTSIZE and XFS_DIR3_DATA_ENTSIZE and open
code them in their only caller, which now becomes so simple that
we can turn it into an inline function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
d73e1cee8a xfs: move the dir2 data block fixed offsets to struct xfs_da_geometry
Move the data block fixed offsets towards our structure for dir/attr
geometry parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
7e8ae7bd1c xfs: devirtualize ->data_entry_tag_p
Replace the ->data_entry_tag_p dir ops method with a directly called
xfs_dir2_data_entry_tag_p helper that takes care of the differences
between the directory format with and without the file type field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
fdbb8c5b80 xfs: devirtualize ->data_entsize
Replace the ->data_entsize dir ops method with a directly called
xfs_dir2_data_entsize helper that takes care of the differences between
the directory format with and without the file type field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
5c072127d3 xfs: replace xfs_dir3_data_endp with xfs_dir3_data_end_offset
All the callers really want an offset into the buffer, so adopt
the helper to return that instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
9eedae1089 xfs: remove the now unused ->data_entry_p method
Now that all users use the data_entry_offset field this method is
unused and can be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
48a71399e7 xfs: cleanup __xfs_dir3_data_check
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
62479f5734 xfs: cleanup xfs_dir2_data_freescan_int
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
8073af5153 xfs: cleanup xfs_dir2_block_to_sf
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
4a1a8b2f5f xfs: cleanup xchk_directory_data_bestfree
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
4c037dd5fd xfs: cleanup xchk_dir_rec
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
2f4369a862 xfs: cleanup xfs_dir2_leaf_getdents
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
263dde869b xfs: cleanup xfs_dir2_block_getdents
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
ee641d5af5 xfs: remove the ->data_unused_p method
Replace the two users of the ->data_unused_p dir ops method with a
direct calculation using ->data_entry_offset, and clean them up a bit.
xfs_dir2_sf_to_block already had an offset variable containing the
value of ->data_entry_offset, which we are now reusing to make it
clear that the initial freespace entry is at the same place that
we later fill in the 1 entry, and in xfs_dir3_data_init the function
is cleaned up a bit to keep the initialization of fields of a given
structure close to each other, and to avoid a local variable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
da3ca0df8b xfs: remove the ->data_dot_entry_p and ->data_dotdot_entry_p methods
The only user of the ->data_dot_entry_p and ->data_dotdot_entry_p
methods is the xfs_dir2_sf_to_block function that builds block format
directorys from a short form directory.  It already uses pointer
arithmetics with a offset variable to do so for the real entries in
the directory, so switch the generation of the . and .. entries to
the same scheme, and clean up some of the later pointer arithmetics
to use bp->b_addr directly as well and avoid some casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
2eb68a5d36 xfs: remove the data_dotdot_offset field in struct xfs_dir_ops
The data_dotdot_offset value is always equal to data_entry_offset plus
the fixed size of the "." entry.  Right now calculating that fixed size
requires an indirect call, but by the end of this series it will be
an inline function that can be constant folded.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
1682310474 xfs: remove the data_dot_offset field in struct xfs_dir_ops
The data_dot_offset value is always equal to data_entry_offset given
that "." is always the first entry in the directory.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
c81484e2b9 xfs: remove the unused ->data_first_entry_p method
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
4501ed2a3a xfs: devirtualize ->sf_get_ftype and ->sf_put_ftype
Replace the ->sf_get_ftype and ->sf_put_ftype dir ops methods with
directly called xfs_dir2_sf_get_ftype and xfs_dir2_sf_put_ftype helpers
that takes care of the differences between the directory format with and
without the file type field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
93b1e96a42 xfs: devirtualize ->sf_get_ino and ->sf_put_ino
Replace the ->sf_get_ino and ->sf_put_ino dir ops methods with directly
called xfs_dir2_sf_get_ino and xfs_dir2_sf_put_ino helpers that take care
of the difference between the directory format with and without the file
type field.  Also move xfs_dir2_sf_get_parent_ino and
xfs_dir2_sf_put_parent_ino to xfs_dir2_sf.c with the rest of the
low-level short form entry handling and use XFS_MAXINUMBER istead of
opencoded constants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
50f6bb6b7a xfs: devirtualize ->sf_entsize and ->sf_nextentry
Just check for file-type enabled directories directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
84915e1bdd xfs: devirtualize ->sf_get_parent_ino and ->sf_put_parent_ino
The parent inode handling is the same for all directory format variants,
just use direct calls instead of going through a pointless indirect
call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
3d92c93b70 xfs: devirtualize ->db_to_fdb and ->db_to_fdindex
Now that the max bests value is in struct xfs_da_geometry both instances
of ->db_to_fdb and ->db_to_fdindex are identical.  Replace them with
local xfs_dir2_db_to_fdb and xfs_dir2_db_to_fdindex functions in
xfs_dir2_node.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
5893e4feb0 xfs: move the max dir2 free bests count to struct xfs_da_geometry
Move the max free bests count towards our structure for dir/attr
geometry parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
ed1d612fbe xfs: move the dir2 free header size to struct xfs_da_geometry
Move the free header size towards our structure for dir/attr geometry
parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
a84f3d5cb0 xfs: add a bests pointer to struct xfs_dir3_icfree_hdr
All but two callers of the ->free_bests_p dir operation already have a
struct xfs_dir3_icfree_hdr from a previous call to
xfs_dir2_free_hdr_from_disk at hand.  Add a pointer to the bests to
struct xfs_dir3_icfree_hdr to clean up this pattern.  To optimize this
pattern, pass the struct xfs_dir3_icfree_hdr to xfs_dir2_free_log_bests
instead of recalculating the pointer there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
195b0a44ab xfs: make the xfs_dir3_icfree_hdr available to xfs_dir2_node_addname_int
Return the xfs_dir3_icfree_hdr used by the helpers called from
xfs_dir2_node_addname_int to the main function to prepare for the
next round of changes where we'll use the ichdr in xfs_dir3_icfree_hdr
to avoid extra operations to find the bests pointers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
200dada700 xfs: devirtualize ->free_hdr_to_disk
Replace the ->free_hdr_to_disk dir ops method with a directly called
xfs_dir2_free_hdr_to_disk helper that takes care of the differences
between the v4 and v5 on-disk format.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
5ba30919a6 xfs: devirtualize ->free_hdr_from_disk
Replace the ->free_hdr_from_disk dir ops method with a directly called
xfs_dir_free_hdr_from_disk helper that takes care of the differences
between the v4 and v5 on-disk format.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
478c7835cb xfs: move the max dir2 leaf entries count to struct xfs_da_geometry
Move the max leaf entries count towards our structure for dir/attr
geometry parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
545910bcc8 xfs: move the dir2 leaf header size to struct xfs_da_geometry
Move the leaf header size towards our structure for dir/attr geometry
parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
787b0893ad xfs: add an entries pointer to struct xfs_dir3_icleaf_hdr
All callers of the ->node_tree_p dir operation already have a struct
xfs_dir3_icleaf_hdr from a previous call to xfs_da_leaf_hdr_from_disk at
hand, or just need slight changes to the calling conventions to do so.
Add a pointer to the entries to struct xfs_dir3_icleaf_hdr to clean up
this pattern.  To make this possible the xfs_dir3_leaf_log_ents function
grow a new argument to pass the xfs_dir3_icleaf_hdr that call callers
already have, and xfs_dir2_leaf_lookup_int returns the
xfs_dir3_icleaf_hdr to the callers so that they can later use it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
163fbbb356 xfs: devirtualize ->leaf_hdr_to_disk
Replace the ->leaf_hdr_to_disk dir ops method with a directly called
xfs_dir_leaf_hdr_to_disk helper that takes care of the differences
between the v4 and v5 on-disk format.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
518425560a xfs: devirtualize ->leaf_hdr_from_disk
Replace the ->leaf_hdr_from_disk dir ops method with a directly called
xfs_dir2_leaf_hdr_from_disk helper that takes care of the differences
between the v4 and v5 on-disk format.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
3b34441309 xfs: move the node header size to struct xfs_da_geometry
Move the node header size field to struct xfs_da_geometry, and remove
the now unused non-directory dir ops infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
51908ca75f xfs: add a btree entries pointer to struct xfs_da3_icnode_hdr
All but two callers of the ->node_tree_p dir operation already have a
xfs_da3_icnode_hdr from a previous call to xfs_da3_node_hdr_from_disk at
hand.  Add a pointer to the btree entries to struct xfs_da3_icnode_hdr
to clean up this pattern.  The two remaining callers now expand the
whole header as well, but that isn't very expensive and not in a super
hot path anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
e1c8af1e02 xfs: devirtualize ->node_hdr_to_disk
Replace the ->node_hdr_to_disk dir ops method with a directly called
xfs_da_node_hdr_to_disk helper that takes care of the v4 vs v5
difference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
f475dc4dc7 xfs: devirtualize ->node_hdr_from_disk
Replace the ->node_hdr_from_disk dir ops method with a directly called
xfs_da_node_hdr_from_disk helper that takes care of the v4 vs v5
difference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
649d9d98c6 xfs: refactor btree node scrubbing
Break up xchk_da_btree_entry and handle looking up leaf node entries
in the attr / dir callbacks, so that only the generic node handling
is left in the common core code.  Note that the checks for the crc
enabled blocks are removed, as the scrubbing code already remaps the
magic numbers earlier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
b16be56187 xfs: use unsigned int for all size values in struct xfs_da_geometry
None of these can ever be negative, so use unsigned types.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
a39f089a25 xfs: move incore structures out of xfs_da_format.h
Move the abstract in-memory version of various btree block headers
out of xfs_da_format.h as they aren't on-disk formats.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Dan Carpenter
7f6bcf7c29 xfs: remove a stray tab in xfs_remount_rw()
The extra tab makes the code slightly confusing.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:18 -08:00
Darrick J. Wong
895e196fb6 xfs: convert EIO to EFSCORRUPTED when log contents are invalid
Convert EIO to EFSCORRUPTED in the logging code when we can determine
that the log contents are invalid.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-10 16:54:18 -08:00
Ritesh Harjani
4d06bfb97e ext4: Add error handling for io_end_vec struct allocation
This patch adds the error handling in case of any memory allocation
failure for io_end_vec. This was missing in original
patch series which enables dioread_nolock for blocksize < pagesize.

Fixes: c8cc88163f ("ext4: Add support for blocksize < pagesize in dioread_nolock")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191106093809.10673-1-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-10 19:52:03 -05:00
Linus Torvalds
a5871fcba4 configfs regression fix for 5.4-rc
- fix a regression from this merge window in the configfs
    symlink handling (Honggang Li)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl3IG00LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPJbBAAw57uaM2mKcUe1syqwsaliqYVGqZ7X+4sMdtwNPAI
 W01aolnmmukcWInlZBqiHFu3+iAVYJND9NvP7ZA+p6/meJfWiIrgWMEVS4X6HFpg
 qsBCP1qh8BvLxCzHC+i9tkgPMpicYjImE/pza99ZtrMHTRC7ao5I3GKbWBIdEpDv
 5NZDbt2g79Y1YfqGRo0xcY0etwpau+iN4rXivjj4qMO1o+Vt4rWlGZehhD2/r2M0
 NGeR3JpYe1MdxvyECMoDI+aWuOiDFrJ/ZfWfTuCskqwTyZ1BtKElBnqS6VFn7yxL
 XqPNwe6Sr9fz7RRZXQ1iH8O/SYct25xVyc6JQlAvcpp0emAUsj2bNAQ3Hj4W3Krw
 h1D5HKvte+CjIqZEnFlqI6GeHawdbSpJoE1VOECS1rXpYhvs5V+2e5RyT2gmj1Pp
 X58Q0xF5Ver2sF0NkhAMTxXL77L8cpRDVljveBXgUh/MzTdPcq1h/RsKXwwVBD23
 Vsmg4qKlBAWJLK1TJ4wvvmmp0N/XzcaNWm7cKCxJDBlzY8LpcVTmUJfGty96mHqb
 cRZLMNcWbtPFSkHfzkYeuWWnhhtgXBmEcTawOd3y0s+jd8wjhRr2wuhL/MSCicJG
 t0A+4/5p4s8H94OAiq9tF8yOAQpvzmrrunW+lyyOQchK4kBBM8rD+cMIRziL+u/2
 B84=
 =kgn+
 -----END PGP SIGNATURE-----

Merge tag 'configfs-for-5.4-2' of git://git.infradead.org/users/hch/configfs

Pull configfs regression fix from Christoph Hellwig:
 "Fix a regression from this merge window in the configfs symlink
  handling (Honggang Li)"

* tag 'configfs-for-5.4-2' of git://git.infradead.org/users/hch/configfs:
  configfs: calculate the depth of parent item
2019-11-10 12:59:34 -08:00
Linus Torvalds
79a64063a8 small fix for an smb3 reconnect bug
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl3HowUACgkQiiy9cAdy
 T1GBFwwAsUHGIXIeV5gWNuy7XhTYOkH9OsSEOJIT+N/3d/L2QrIcfQ0x4n/RauC9
 S654xqdsPljti231e7LgeV0fZlayFg99YeNGBkorGk/UBxv5pYzVqqUyyMnUhunI
 /TCC4SPzhcKyLjrei/pZ2lTkZoZe34WfAq36E1p5d758g44ypLSlIcMUUznmhOKE
 B4ATOI5wV5zqHUh4pZSkLWMQ/D33XapQZDFJweol6YEaPCz2NjQfsL/MRJ/pg3nL
 Vh7v1jdHiPMbMhqV9pqsmkBSOci+WMQTqwz4yXiKJgKRsOjrcp+64Tpl+yMDfNfk
 1AsW7A2oHxbWkmSwDN14lNMPqGftREJww9pf40BaYGzTSKUvyI7VZNULT9kLCtSQ
 0xvVhrAMtHZSFeDqIf0ndcPS5WXWlHrTRDx25fgFHid3ZkDh5tLcNgODRZMih/xy
 KvdHlKm2y36jdvDxXMoNAna7iwgKEOtYYk/X9oq82icI+6S0p46HEcShx1rqgDXL
 KAWDdnsb
 =8yoq
 -----END PGP SIGNATURE-----

Merge tag '5.4-rc7-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fix from Steve French:
 "Small fix for an smb3 reconnect bug (also marked for stable)"

* tag '5.4-rc7-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
  SMB3: Fix persistent handles reconnect
2019-11-10 11:43:18 -08:00
Darrick J. Wong
2fe4f92834 xfs: refactor "does this fork map blocks" predicate
Replace the open-coded checks for whether or not an inode fork maps
blocks with a macro that will implant the code for us.  This helps us
declutter the bmap code a bit.

Note that I had to use a macro instead of a static inline function
because of C header dependency problems between xfs_inode.h and
xfs_inode_fork.h.

Conversion was performed with the following Coccinelle script:

@@
expression ip, w;
@@

- XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_EXTENTS || XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_BTREE
+ xfs_ifork_has_extents(ip, w)

@@
expression ip, w;
@@

- XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_EXTENTS && XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_BTREE
+ !xfs_ifork_has_extents(ip, w)

@@
expression ip, w;
@@

- XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_BTREE || XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_EXTENTS
+ xfs_ifork_has_extents(ip, w)

@@
expression ip, w;
@@

- XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_BTREE && XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_EXTENTS
+ !xfs_ifork_has_extents(ip, w)

@@
expression ip, w;
@@

- (xfs_ifork_has_extents(ip, w))
+ xfs_ifork_has_extents(ip, w)

@@
expression ip, w;
@@

- (!xfs_ifork_has_extents(ip, w))
+ !xfs_ifork_has_extents(ip, w)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-10 10:22:51 -08:00
Darrick J. Wong
5113f8ec37 xfs: clean up weird while loop in xfs_alloc_ag_vextent_near
Refactor the weird while loop out of existence.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-10 10:21:58 -08:00
Darrick J. Wong
f755979355 xfs: annotate functions that trip static checker locking checks
Add some lock annotations to helper functions that seem to have
unbalanced locking that confuses the static analyzers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-10 10:21:58 -08:00
Joe Perches
cf085a1b5d xfs: Correct comment tyops -> typos
Just fix the typos checkpatch notices...

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 10:21:57 -08:00
Al Viro
762c69685f ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
We need to get the underlying dentry of parent; sure, absent the races
it is the parent of underlying dentry, but there's nothing to prevent
losing a timeslice to preemtion in the middle of evaluation of
lower_dentry->d_parent->d_inode, having another process move lower_dentry
around and have its (ex)parent not pinned anymore and freed on memory
pressure.  Then we regain CPU and try to fetch ->d_inode from memory
that is freed by that point.

dentry->d_parent *is* stable here - it's an argument of ->lookup() and
we are guaranteed that it won't be moved anywhere until we feed it
to d_add/d_splice_alias.  So we safely go that way to get to its
underlying dentry.

Cc: stable@vger.kernel.org # since 2009 or so
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:57:45 -05:00
Al Viro
e72b9dd6a5 ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
lower_dentry can't go from positive to negative (we have it pinned),
but it *can* go from negative to positive.  So fetching ->d_inode
into a local variable, doing a blocking allocation, checking that
now ->d_inode is non-NULL and feeding the value we'd fetched
earlier to a function that won't accept NULL is not a good idea.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:57:44 -05:00
Al Viro
bcf0d9d4b7 ecryptfs: fix unlink and rmdir in face of underlying fs modifications
A problem similar to the one caught in commit 74dd7c97ea ("ecryptfs_rename():
verify that lower dentries are still OK after lock_rename()") exists for
unlink/rmdir as well.

Instead of playing with dget_parent() of underlying dentry of victim
and hoping it's the same as underlying dentry of our directory,
do the following:
        * find the underlying dentry of victim
        * find the underlying directory of victim's parent (stable
since the victim is ecryptfs dentry and inode of its parent is
held exclusive by the caller).
        * lock the inode of dentry underlying the victim's parent
        * check that underlying dentry of victim is still hashed and
has the right parent - it can be moved, but it can't be moved to/from
the directory we are holding exclusive.  So while ->d_parent itself
might not be stable, the result of comparison is.

If the check passes, everything is fine - underlying directory is locked,
underlying victim is still a child of that directory and we can go ahead
and feed them to vfs_unlink().  As in the current mainline we need to
pin the underlying dentry of victim, so that it wouldn't go negative under
us, but that's the only temporary reference that needs to be grabbed there.
Underlying dentry of parent won't go away (it's pinned by the parent,
which is held by caller), so there's no need to grab it.

The same problem (with the same solution) exists for rmdir.  Moreover,
rename gets simpler and more robust with the same "don't bother with
dget_parent()" approach.

Fixes: 74dd7c97ea "ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:57:44 -05:00
Al Viro
a2ece08888 exportfs_decode_fh(): negative pinned may become positive without the parent locked
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:56:05 -05:00
Linus Torvalds
00aff68362 for-5.4-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3GwvEACgkQxWXV+ddt
 WDvsfQ//UQ/TND1roMW3EmN+PLTRTUQS75Hb737r5Q66j0j94gFnBxB03R+tU8lP
 bH5XVpateb1wDmmdscqVQ1WM9O82bQdDNiYeLDr86+kzpLgy61rZHswZfNlDtl5M
 wDwyaxsrd7HndDeUZnIuaakYI9MXz9WIaNXkt0o8hSHctt0N18y23DSBFTVSh/4q
 T3cn4odkoBKtQ4mIn2OSmQvkl69nWRBVpjPJZIvsNszKjOo9aZTuGHrOWUV5RPiE
 Ho9UBkr+IjEDo8OH88vXAsHeYFIoYhEeUltjLHyF6Agwk1/Ajwp5sxXSubbfHUMQ
 l1YdmrTZf+l1Dxdj0sCdyK1npcgGI5IuZmIICpNUEAny9AbTtpSE3GNwtnIHAEAr
 cpki+1Z3lfaVSwNMYUz9Esbsb72C+f08WJHGHBMaOhjrBIwQUeWeYzTx6N7uDwNg
 GjaDRxjqFV2o7373isQaz7WOTatUMNtL1xvhkeceONI9NBXQRjW5rq5zECmr1ix7
 lSTKDzn7yAd6eGBW0T6iYdRl4Bta/zxPiDEF5KDNPugvv23yx17LAOdS4qAiHzbx
 oglra/kz9D4xmKVqH7hpFaaHrzL38G4mz6aMdwBu0M7dkn9AwsXiXWSDb0n7jnK0
 o96gkUBZjUSFBseUFD2s34MikFtrDynEJzPq96JHuWLguaByMcc=
 =CZxY
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few regressions and fixes for stable.

  Regressions:

   - fix a race leading to metadata space leak after task received a
     signal

   - un-deprecate 2 ioctls, marked as deprecated by mistake

  Fixes:

   - fix limit check for number of devices during chunk allocation

   - fix a race due to double evaluation of i_size_read inside max()
     macro, can cause a crash

   - remove wrong device id check in tree-checker"

* tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC
  btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range
  Btrfs: fix race leading to metadata space leak after task received signal
  btrfs: tree-checker: Fix wrong check on max devid
  btrfs: Consider system chunk array size for new SYSTEM chunks
2019-11-09 08:51:37 -08:00
Linus Torvalds
5cb8418cb5 for-linus-2019-11-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3F+DIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkVmD/9FIa092Q6obga0RqA16GlbI85tgtNyRFZU
 neIO/g9R3G/uBTGbiUeXHXHDz9CqUXIRYX7pmI2u0b07iGLRz8oUsOsgyVEfCMen
 VitqwkJJAZ9j9OifyKpLYCZX9ulVDWX5hEz/vm2cNWDkjCbOpXvRuQmkXEzp7RNM
 F7K25PpGLvJHfqC90q9FXXxNDlB2i1M/rh5I7eUqhb6rHmzfJGKCd+H80t+REoB1
 iXAygPj86agQKLOUKZtJjXUjq9Ol/0FD+OKY+eP3EfVv/FJvIeWWYe78WplxJpRD
 BYb9dhLMCSo619WVVy4hNYCPjSfPKVT2cO5QJmVRpgOI1urFuTNgNoIiw06AgvkZ
 09vrlrJZ5A7eFEppuAFQC4WRYKWCQCQfq8wxt2iGUivXgHfskjJ7qJz1Mh5Nlxsr
 JGm1hSVw9UCBzjqC75K2CR+vVt4T8ovEaizPFvzVj6lQ0lRTmSchisxbXzTdevOn
 kFvBOntBdpeSq/CwZ0x5PDP4AbRsgH3ny47LCJoEpFZlxlOLuEB/dAx564dn6ZgZ
 rRz6mKU7rlO7brrkW+DbRZ0XiOn5qzN6FrcSGX1DBzc4+hMus/5PPjv8Gk+Wo1PU
 388mu6N6DObSjw1ij1AqkwyzudmbrziKM4isJY4I72I0YGSq9cuG5VXgM2GqbGlU
 XXpzsu8pSw==
 =8B/z
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Two NVMe device removal crash fixes, and a compat fixup for for an
   ioctl that was introduced in this release (Anton, Charles, Max - via
   Keith)

 - Missing error path mutex unlock for drbd (Dan)

 - cgroup writeback fixup on dead memcg (Tejun)

 - blkcg online stats print fix (Tejun)

* tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block:
  cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
  block: drbd: remove a stray unlock in __drbd_send_protocol()
  blkcg: make blkcg_print_stat() print stats only for online blkgs
  nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
  nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
  nvme-rdma: fix a segmentation fault during module unload
2019-11-08 18:15:55 -08:00
Tejun Heo
65de03e251 cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
cgroup writeback tries to refresh the associated wb immediately if the
current wb is dead.  This is to avoid keeping issuing IOs on the stale
wb after memcg - blkcg association has changed (ie. when blkcg got
disabled / enabled higher up in the hierarchy).

Unfortunately, the logic gets triggered spuriously on inodes which are
associated with dead cgroups.  When the logic is triggered on dead
cgroups, the attempt fails only after doing quite a bit of work
allocating and initializing a new wb.

While c3aab9a0bd ("mm/filemap.c: don't initiate writeback if mapping
has no dirty pages") alleviated the issue significantly as it now only
triggers when the inode has dirty pages.  However, the condition can
still be triggered before the inode is switched to a different cgroup
and the logic simply doesn't make sense.

Skip the immediate switching if the associated memcg is dying.

This is a simplified version of the following two patches:

 * https://lore.kernel.org/linux-mm/20190513183053.GA73423@dennisz-mbp/
 * http://lkml.kernel.org/r/156355839560.2063.5265687291430814589.stgit@buzz

Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: e8a7abf5a5 ("writeback: disassociate inodes from dying bdi_writebacks")
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-08 13:37:24 -07:00
Linus Torvalds
0689acfad3 Some late-breaking dentry handling fixes from Al and Jeff, a patch to
further restrict copy_file_range() to avoid potential data corruption
 from Luis and a fix for !CONFIG_CEPH_FSCACHE kernels.  Everything but
 the fscache fix is marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3FjtQTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziyjfCACbNiRHpTWfuZ1iHXXCpo0tAixOT6i0
 9ZlE8yiF6U6iH7uv/MuIeJ+Qeep9+6wwgJYEJKcF0e+Raiob7womDO+yeDx6RYrC
 bqq6OLjJl8VbfeQWvJTmisGTPpvOsOKxxXIKG6vx1zJaJ69yENilFZ0biis14iTu
 RRDvtSntc96OhZ3n8IXie2+f9purYeIKhiCZqx9WzYZOk5sb3zGyhqD+Wh5kxkEh
 QzwwZ/G5RYvMMhL0o13uMZrjqozGTI9Qm09u7+EikIXFFtt+szEdgFd4pQPhsC1D
 VmpdrvV4ml8JnbkENiZ7tlHCcICQJfgMWNcLVgHvD6808CJ0UP5Cq/jP
 =34Iz
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.4-rc7' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Some late-breaking dentry handling fixes from Al and Jeff, a patch to
  further restrict copy_file_range() to avoid potential data corruption
  from Luis and a fix for !CONFIG_CEPH_FSCACHE kernels.

  Everything but the fscache fix is marked for stable"

* tag 'ceph-for-5.4-rc7' of git://github.com/ceph/ceph-client:
  ceph: return -EINVAL if given fsc mount option on kernel w/o support
  ceph: don't allow copy_file_range when stripe_count != 1
  ceph: don't try to handle hashed dentries in non-O_CREAT atomic_open
  ceph: add missing check in d_revalidate snapdir handling
  ceph: fix RCU case handling in ceph_d_revalidate()
  ceph: fix use-after-free in __ceph_remove_cap()
2019-11-08 12:31:27 -08:00
Darrick J. Wong
2b91b28e29 iomap: iomap_bmap should check iomap_apply return value
Coverity caught this fairly minor bug, but we should check the return
value of iomap_apply regardless.

Coverity-id: 1437065
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-08 07:44:12 -08:00
Darrick J. Wong
d6abecb825 xfs: range check ri_cnt when recovering log items
Range check the region counter when we're reassembling regions from log
items during log recovery.  In the old days ASSERT would halt the
kernel, but this isn't true any more so we have to make an explicit
error return.

Coverity-id: 1132508
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-07 13:00:54 -08:00
Darrick J. Wong
120254608f xfs: "optimize" buffer item log segment bitmap setting
Optimize the setting of full words of bits in xfs_buf_item_log_segment.
The optimization is purely within the bug triage process.  No functional
changes.

Coverity-id: 1446793
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-07 13:00:54 -08:00
Darrick J. Wong
f5be08446e xfs: null out bma->prev if no previous extent
Coverity complains that we don't check the return value of
xfs_iext_peek_prev_extent like we do nearly all of the time.  If there
is no previous extent then just null out bma->prev like we do elsewhere
in the bmap code.

Coverity-id: 1424057
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-07 13:00:54 -08:00
Darrick J. Wong
5f213ddbcb xfs: fix missing header includes
Some of the xfs source files are missing header includes, so add them
back.  Sparse complains about non-static functions that don't have a
forward declaration anywhere.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-07 13:00:53 -08:00
Darrick J. Wong
5d1116d4c6 xfs: periodically yield scrub threads to the scheduler
Christoph Hellwig complained about the following soft lockup warning
when running scrub after generic/175 when preemption is disabled and
slub debugging is enabled:

watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [xfs_scrub:161]
Modules linked in:
irq event stamp: 41692326
hardirqs last  enabled at (41692325): [<ffffffff8232c3b7>] _raw_0
hardirqs last disabled at (41692326): [<ffffffff81001c5a>] trace0
softirqs last  enabled at (41684994): [<ffffffff8260031f>] __do_e
softirqs last disabled at (41684987): [<ffffffff81127d8c>] irq_e0
CPU: 3 PID: 16189 Comm: xfs_scrub Not tainted 5.4.0-rc3+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.124
RIP: 0010:_raw_spin_unlock_irqrestore+0x39/0x40
Code: 89 f3 be 01 00 00 00 e8 d5 3a e5 fe 48 89 ef e8 ed 87 e5 f2
RSP: 0018:ffffc9000233f970 EFLAGS: 00000286 ORIG_RAX: ffffffffff3
RAX: ffff88813b398040 RBX: 0000000000000286 RCX: 0000000000000006
RDX: 0000000000000006 RSI: ffff88813b3988c0 RDI: ffff88813b398040
RBP: ffff888137958640 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00042b0c00
R13: 0000000000000001 R14: ffff88810ac32308 R15: ffff8881376fc040
FS:  00007f6113dea700(0000) GS:ffff88813bb80000(0000) knlGS:00000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6113de8ff8 CR3: 000000012f290000 CR4: 00000000000006e0
Call Trace:
 free_debug_processing+0x1dd/0x240
 __slab_free+0x231/0x410
 kmem_cache_free+0x30e/0x360
 xchk_ag_btcur_free+0x76/0xb0
 xchk_ag_free+0x10/0x80
 xchk_bmap_iextent_xref.isra.14+0xd9/0x120
 xchk_bmap_iextent+0x187/0x210
 xchk_bmap+0x2e0/0x3b0
 xfs_scrub_metadata+0x2e7/0x500
 xfs_ioc_scrub_metadata+0x4a/0xa0
 xfs_file_ioctl+0x58a/0xcd0
 do_vfs_ioctl+0xa0/0x6f0
 ksys_ioctl+0x5b/0x90
 __x64_sys_ioctl+0x11/0x20
 do_syscall_64+0x4b/0x1a0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

If preemption is disabled, all metadata buffers needed to perform the
scrub are already in memory, and there are a lot of records to check,
it's possible that the scrub thread will run for an extended period of
time without sleeping for IO or any other reason.  Then the watchdog
timer or the RCU stall timeout can trigger, producing the backtrace
above.

To fix this problem, call cond_resched() from the scrub thread so that
we back out to the scheduler whenever necessary.

Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-07 13:00:53 -08:00
Jens Axboe
912c0a8591 Merge branch 'for-linus' into for-5.5/block
Pull on for-linus to resolve what otherwise would have been a conflict
with the cgroups rstat patchset from Tejun.

* for-linus: (942 commits)
  blkcg: make blkcg_print_stat() print stats only for online blkgs
  nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
  nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
  nvme-rdma: fix a segmentation fault during module unload
  iocost: don't nest spin_lock_irq in ioc_weight_write()
  io_uring: ensure we clear io_kiocb->result before each issue
  um-ubd: Entrust re-queue to the upper layers
  nvme-multipath: remove unused groups_only mode in ana log
  nvme-multipath: fix possible io hang after ctrl reconnect
  io_uring: don't touch ctx in setup after ring fd install
  io_uring: Fix leaked shadow_req
  Linux 5.4-rc5
  riscv: cleanup do_trap_break
  nbd: verify socket is supported during setup
  ata: libahci_platform: Fix regulator_get_optional() misuse
  nbd: handle racing with error'ed out commands
  nbd: protect cmd->status with cmd->lock
  io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
  io_uring: used cached copies of sq->dropped and cq->overflow
  ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157
  ...
2019-11-07 12:27:19 -07:00
Chao Yu
1f0d5c911b f2fs: fix potential overflow
We expect 64-bit calculation result from below statement, however
in 32-bit machine, looped left shift operation on pgoff_t type
variable may cause overflow issue, fix it by forcing type cast.

page->index << PAGE_SHIFT;

Fixes: 26de9b1171 ("f2fs: avoid unnecessary updating inode during fsync")
Fixes: 0a2aa8fbb9 ("f2fs: refactor __exchange_data_block for speed up")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-07 11:17:39 -08:00
Chao Yu
2a60637f06 f2fs: fix to update dir's i_pino during cross_rename
As Eric reported:

RENAME_EXCHANGE support was just added to fsstress in xfstests:

	commit 65dfd40a97b6bbbd2a22538977bab355c5bc0f06
	Author: kaixuxia <xiakaixu1987@gmail.com>
	Date:   Thu Oct 31 14:41:48 2019 +0800

	    fsstress: add EXCHANGE renameat2 support

This is causing xfstest generic/579 to fail due to fsck.f2fs reporting errors.
I'm not sure what the problem is, but it still happens even with all the
fs-verity stuff in the test commented out, so that the test just runs fsstress.

generic/579 23s ... 	[10:02:25]
[    7.745370] run fstests generic/579 at 2019-11-04 10:02:25
_check_generic_filesystem: filesystem on /dev/vdc is inconsistent
(see /results/f2fs/results-default/generic/579.full for details)
 [10:02:47]
Ran: generic/579
Failures: generic/579
Failed 1 of 1 tests
Xunit report: /results/f2fs/results-default/result.xml

Here's the contents of 579.full:

_check_generic_filesystem: filesystem on /dev/vdc is inconsistent
*** fsck.f2fs output ***
[ASSERT] (__chk_dots_dentries:1378)  --> Bad inode number[0x24] for '..', parent parent ino is [0xd10]

The root cause is that we forgot to update directory's i_pino during
cross_rename, fix it.

Fixes: 32f9bc25cb ("f2fs: support ->rename2()")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Tested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-07 11:15:39 -08:00
Jens Axboe
5f8fd2d3e0 io_uring: properly mark async work as bounded vs unbounded
Now that io-wq supports separating the two request lifetime types, mark
the following IO as having unbounded runtimes:

- Any read/write to a non-regular file
- Any specific networked IO
- Any poll command

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 11:57:17 -07:00
Jens Axboe
c5def4ab84 io-wq: add support for bounded vs unbunded work
io_uring supports request types that basically have two different
lifetimes:

1) Bounded completion time. These are requests like disk reads or writes,
   which we know will finish in a finite amount of time.
2) Unbounded completion time. These are generally networked IO, where we
   have no idea how long they will take to complete. Another example is
   POLL commands.

This patch provides support for io-wq to handle these differently, so we
don't starve bounded requests by tying up workers for too long. By default
all work is bounded, unless otherwise specified in the work item.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 11:41:35 -07:00
Jens Axboe
91d666ea43 io-wq: io_wqe_run_queue() doesn't need to use list_empty_careful()
We hold the wqe lock at this point (which is also annotated), so there's
no need to use the careful variant of list_empty().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-09 11:45:32 -07:00
Jens Axboe
1d7bb1d50f io_uring: add support for backlogged CQ ring
Currently we drop completion events, if the CQ ring is full. That's fine
for requests with bounded completion times, but it may make it harder or
impossible to use io_uring with networked IO where request completion
times are generally unbounded. Or with POLL, for example, which is also
unbounded.

After this patch, we never overflow the ring, we simply store requests
in a backlog for later flushing. This flushing is done automatically by
the kernel. To prevent the backlog from growing indefinitely, if the
backlog is non-empty, we apply back pressure on IO submissions. Any
attempt to submit new IO with a non-empty backlog will get an -EBUSY
return from the kernel. This is a signal to the application that it has
backlogged CQ events, and that it must reap those before being allowed
to submit more IO.

Note that if we do return -EBUSY, we will have filled whatever
backlogged events into the CQ ring first, if there's room. This means
the application can safely reap events WITHOUT entering the kernel and
waiting for them, they are already available in the CQ ring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-09 11:45:29 -07:00
Jens Axboe
78e19bbef3 io_uring: pass in io_kiocb to fill/add CQ handlers
This is in preparation for handling CQ ring overflow a bit smarter. We
should not have any functional changes in this patch. Most of the
changes are fairly straight forward, the only ones that stick out a bit
are the ones that change __io_free_req() to take the reference count
into account. If the request hasn't been submitted yet, we know it's
safe to simply ignore references and free it. But let's clean these up
too, as later patches will depend on the caller doing the right thing if
the completion logging grabs a reference to the request.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-08 06:57:27 -07:00
Jens Axboe
84f97dc233 io_uring: make io_cqring_events() take 'ctx' as argument
The rings can be derived from the ctx, and we need the ctx there for
a future change.

No functional changes in this patch.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-08 06:57:21 -07:00
Jens Axboe
2665abfd75 io_uring: add support for linked SQE timeouts
While we have support for generic timeouts, we don't have a way to tie
a timeout to a specific SQE. The generic timeouts simply trigger wakeups
on the CQ ring.

This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
as a link to a previous command. The timeout specific can be either
relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
the timeout triggers before the dependent command completes, it will
attempt to cancel that command. Likewise, if the dependent command
completes before the timeout triggers, it will cancel the timeout.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 19:12:40 -07:00
Jens Axboe
e977d6d34f io_uring: abstract out io_async_cancel_one() helper
We're going to need this helper in a future patch, so move it out
of io_async_cancel() and into its own separate function.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 12:31:31 -07:00
Jaegeuk Kim
f5a53edcf0 f2fs: support aligned pinned file
This patch supports 2MB-aligned pinned file, which can guarantee no GC at all
by allocating fully valid 2MB segment.

Check free segments by has_not_enough_free_secs() with large budget.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-07 10:40:59 -08:00
Jaegeuk Kim
bc005a4d53 f2fs: avoid kernel panic on corruption test
xfstests/generic/475 complains kernel warn/panic while testing corrupted disk.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-07 10:40:59 -08:00
Jeff Layton
ff29fde84d ceph: return -EINVAL if given fsc mount option on kernel w/o support
If someone requests fscache on the mount, and the kernel doesn't
support it, it should fail the mount.

[ Drop ceph prefix -- it's provided by pr_err. ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-07 18:03:23 +01:00
Andreas Gruenbacher
add66fcbd3 iomap: Fix overflow in iomap_page_mkwrite
On architectures where loff_t is wider than pgoff_t, the expression
((page->index + 1) << PAGE_SHIFT) can overflow.  Rewrite to use the page
offset, which we already compute here anyway.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-07 07:28:18 -08:00
Ajay Joshi
6c1b1da58f block: add zone open, close and finish operations
Zoned block devices (ZBC and ZAC devices) allow an explicit control
over the condition (state) of zones. The operations allowed are:
* Open a zone: Transition to open condition to indicate that a zone will
  actively be written
* Close a zone: Transition to closed condition to release the drive
  resources used for writing to a zone
* Finish a zone: Transition an open or closed zone to the full
  condition to prevent write operations

To enable this control for in-kernel zoned block device users, define
the new request operations REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE
and REQ_OP_ZONE_FINISH as well as the generic function
blkdev_zone_mgmt() for submitting these operations on a range of zones.
This results in blkdev_reset_zones() removal and replacement with this
new zone magement function. Users of blkdev_reset_zones() (f2fs and
dm-zoned) are updated accordingly.

Contains contributions from Matias Bjorling, Hans Holmberg,
Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.

Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 06:31:48 -07:00
Chengguang Xu
355b9aae86 ext2: fix improper function comment
Just fix a improper function comment.

Link: https://lore.kernel.org/r/20191104114036.9893-5-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-07 12:35:27 +01:00
Chengguang Xu
44dd616133 ext2: code cleanup for ext2_try_to_allocate()
Code cleanup by removing duplicated code.

Link: https://lore.kernel.org/r/20191104114036.9893-4-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-07 12:35:27 +01:00
Chengguang Xu
ac3820f822 ext2: skip unnecessary operations in ext2_try_to_allocate()
Move 'repeat' tag to proper place so that we can
skip unnecessary operations in ext2_try_to_allocate().

Link: https://lore.kernel.org/r/20191104114036.9893-3-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-07 12:35:27 +01:00
Jan Kara
cf4eb321b3 ext2: Simplify initialization in ext2_try_to_allocate()
Somewhat simplify the logic initializing search start and end in
ext2_try_to_allocate(). No functional change.

Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-07 12:35:22 +01:00
Pavel Shilovsky
d243af7ab9 SMB3: Fix persistent handles reconnect
When the client hits a network reconnect, it re-opens every open
file with a create context to reconnect a persistent handle. All
create context types should be 8-bytes aligned but the padding
was missed for that one. As a result, some servers don't allow
us to reconnect handles and return an error. The problem occurs
when the problematic context is not at the end of the create
request packet. Fix this by adding a proper padding at the end
of the reconnect persistent handle context.

Cc: Stable <stable@vger.kernel.org> # 4.19.x
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-06 21:32:18 -06:00
Pavel Begunkov
267bc90442 io_uring: use inlined struct sqe_submit
req->submit is always up-to-date, use it directly

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 20:23:02 -07:00
Pavel Begunkov
50585b9a07 io_uring: Use submit info inlined into req
Stack allocated struct sqe_submit is passed down to the submission path
along with a request (a.k.a. struct io_kiocb), and will be copied into
req->submit for async requests.

As space for it is already allocated, fill req->submit in the first
place instead of using on-stack one. As a result:

1. sqe->submit is the only place for sqe_submit and is always valid,
so we don't need to track which one to use.
2. don't need to copy in case of async
3. allows to simplify the code by not carrying it as an argument all
the way down
4. allows to reduce number of function arguments / potentially improve
spilling

The downside is that stack is most probably be cached, that's not true
for just allocated memory for a request. Another concern is cache
pollution. Though, a request would be touched and fetched along with
req->submit at some point anyway, so shouldn't be a problem.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 20:23:00 -07:00
Pavel Begunkov
196be95cd5 io_uring: allocate io_kiocb upfront
Let io_submit_sqes() to allocate io_kiocb before fetching an sqe.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 20:21:37 -07:00
Eric Biggers
0eee17e332 f2fs: add support for IV_INO_LBLK_64 encryption policies
f2fs inode numbers are stable across filesystem resizing, and f2fs inode
and file logical block numbers are always 32-bit.  So f2fs can always
support IV_INO_LBLK_64 encryption policies.  Wire up the needed
fscrypt_operations to declare support.

Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-06 12:34:42 -08:00
Eric Biggers
b925acb8f8 ext4: add support for IV_INO_LBLK_64 encryption policies
IV_INO_LBLK_64 encryption policies have special requirements from the
filesystem beyond those of the existing encryption policies:

- Inode numbers must never change, even if the filesystem is resized.
- Inode numbers must be <= 32 bits.
- File logical block numbers must be <= 32 bits.

ext4 has 32-bit inode and file logical block numbers.  However,
resize2fs can re-number inodes when shrinking an ext4 filesystem.

However, typically the people who would want to use this format don't
care about filesystem shrinking.  They'd be fine with a solution that
just prevents the filesystem from being shrunk.

Therefore, add a new feature flag EXT4_FEATURE_COMPAT_STABLE_INODES that
will do exactly that.  Then wire up the fscrypt_operations to expose
this flag to fs/crypto/, so that it allows IV_INO_LBLK_64 policies when
this flag is set.

Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-06 12:34:42 -08:00
Eric Biggers
b103fb7653 fscrypt: add support for IV_INO_LBLK_64 policies
Inline encryption hardware compliant with the UFS v2.1 standard or with
the upcoming version of the eMMC standard has the following properties:

(1) Per I/O request, the encryption key is specified by a previously
    loaded keyslot.  There might be only a small number of keyslots.

(2) Per I/O request, the starting IV is specified by a 64-bit "data unit
    number" (DUN).  IV bits 64-127 are assumed to be 0.  The hardware
    automatically increments the DUN for each "data unit" of
    configurable size in the request, e.g. for each filesystem block.

Property (1) makes it inefficient to use the traditional fscrypt
per-file keys.  Property (2) precludes the use of the existing
DIRECT_KEY fscrypt policy flag, which needs at least 192 IV bits.

Therefore, add a new fscrypt policy flag IV_INO_LBLK_64 which causes the
encryption to modified as follows:

- The encryption keys are derived from the master key, encryption mode
  number, and filesystem UUID.

- The IVs are chosen as (inode_number << 32) | file_logical_block_num.
  For filenames encryption, file_logical_block_num is 0.

Since the file nonces aren't used in the key derivation, many files may
share the same encryption key.  This is much more efficient on the
target hardware.  Including the inode number in the IVs and mixing the
filesystem UUID into the keys ensures that data in different files is
nevertheless still encrypted differently.

Additionally, limiting the inode and block numbers to 32 bits and
placing the block number in the low bits maintains compatibility with
the 64-bit DUN convention (property (2) above).

Since this scheme assumes that inode numbers are stable (which may
preclude filesystem shrinking) and that inode and file logical block
numbers are at most 32-bit, IV_INO_LBLK_64 will only be allowed on
filesystems that meet these constraints.  These are acceptable
limitations for the cases where this format would actually be used.

Note that IV_INO_LBLK_64 is an on-disk format, not an implementation.
This patch just adds support for it using the existing filesystem layer
encryption.  A later patch will add support for inline encryption.

Reviewed-by: Paul Crowley <paulcrowley@google.com>
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-06 12:34:36 -08:00
Eric Biggers
ff73c2c016 fscrypt: avoid data race on fscrypt_mode::logged_impl_name
The access to logged_impl_name is technically a data race, which tools
like KCSAN could complain about in the future.  See:
https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE

Fix by using xchg(), which also ensures that only one thread does the
logging.

This also required switching from bool to int, to avoid a build error on
the RISC-V architecture which doesn't implement xchg on bytes.

Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-06 12:33:15 -08:00
Pavel Begunkov
e5eb6366ac io_uring: io_queue_link*() right after submit
After a call to io_submit_sqe(), it's already known whether it needs
to queue a link or not. Do it there, as it's simplier and doesn't keep
an extra variable across the loop.

Reviewed-by:Bob Liu <bob.liu@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 11:20:11 -07:00
Pavel Begunkov
ae9428ca61 io_uring: Merge io_submit_sqes and io_ring_submit
io_submit_sqes() and io_ring_submit() are doing the same stuff with
a little difference. Deduplicate them.

Reviewed-by:Bob Liu <bob.liu@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 11:20:07 -07:00
Honggang Li
e2f238f7d5 configfs: calculate the depth of parent item
When create symbolic link, create_link should calculate the depth
of the parent item. However, both the first and second parameters
of configfs_get_target_path had been set to the target. Broken
symbolic link created.

$ targetcli ls /
o- / ............................................................. [...]
  o- backstores .................................................. [...]
  | o- block ...................................... [Storage Objects: 0]
  | o- fileio ..................................... [Storage Objects: 2]
  | | o- vdev0 .......... [/dev/ramdisk1 (16.0MiB) write-thru activated]
  | | | o- alua ....................................... [ALUA Groups: 1]
  | | |   o- default_tg_pt_gp ........... [ALUA state: Active/optimized]
  | | o- vdev1 .......... [/dev/ramdisk2 (16.0MiB) write-thru activated]
  | |   o- alua ....................................... [ALUA Groups: 1]
  | |     o- default_tg_pt_gp ........... [ALUA state: Active/optimized]
  | o- pscsi ...................................... [Storage Objects: 0]
  | o- ramdisk .................................... [Storage Objects: 0]
  o- iscsi ................................................ [Targets: 0]
  o- loopback ............................................. [Targets: 0]
  o- srpt ................................................. [Targets: 2]
  | o- ib.e89a8f91cb3200000000000000000000 ............... [no-gen-acls]
  | | o- acls ................................................ [ACLs: 2]
  | | | o- ib.e89a8f91cb3200000000000000000000 ........ [Mapped LUNs: 2]
  | | | | o- mapped_lun0 ............................. [BROKEN LUN LINK]
  | | | | o- mapped_lun1 ............................. [BROKEN LUN LINK]
  | | | o- ib.e89a8f91cb3300000000000000000000 ........ [Mapped LUNs: 2]
  | | |   o- mapped_lun0 ............................. [BROKEN LUN LINK]
  | | |   o- mapped_lun1 ............................. [BROKEN LUN LINK]
  | | o- luns ................................................ [LUNs: 2]
  | |   o- lun0 ...... [fileio/vdev0 (/dev/ramdisk1) (default_tg_pt_gp)]
  | |   o- lun1 ...... [fileio/vdev1 (/dev/ramdisk2) (default_tg_pt_gp)]
  | o- ib.e89a8f91cb3300000000000000000000 ............... [no-gen-acls]
  |   o- acls ................................................ [ACLs: 0]
  |   o- luns ................................................ [LUNs: 0]
  o- vhost ................................................ [Targets: 0]

Fixes: e9c03af21c ("configfs: calculate the symlink target only once")
Signed-off-by: Honggang Li <honli@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-06 18:36:01 +01:00
Shuning Zhang
e74540b285 ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()
When the extent tree is modified, it should be protected by inode
cluster lock and ip_alloc_sem.

The extent tree is accessed and modified in the
ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem.

The following is a case.  The function ocfs2_fiemap is accessing the
extent tree, which is modified at the same time.

  kernel BUG at fs/ocfs2/extent_map.c:475!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...]
  CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2
  Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018
  task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000
  RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
  Call Trace:
    ocfs2_fiemap+0x1e3/0x430 [ocfs2]
    do_vfs_ioctl+0x155/0x510
    SyS_ioctl+0x81/0xa0
    system_call_fastpath+0x18/0xd8
  Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f
  RIP  ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
  ---[ end trace c8aa0c8180e869dc ]---
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled

This issue can be reproduced every week in a production environment.

This issue is related to the usage mode.  If others use ocfs2 in this
mode, the kernel will panic frequently.

[akpm@linux-foundation.org: coding style fixes]
[Fix new warning due to unused function by removing said function - Linus ]
Link: http://lkml.kernel.org/r/1568772175-2906-2-git-send-email-sunny.s.zhang@oracle.com
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-06 08:47:08 -08:00
Colin Ian King
0279c71fe0 xfs: remove redundant assignment to variable error
Variable error is being initialized with a value that is never read
and is being re-assigned a couple of statements later on. The
assignment is redundant and hence can be removed.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-06 08:07:46 -08:00
Darrick J. Wong
8ef34723ef xfs: add missing early termination checks to record scrubbing functions
Scrubbing directories, quotas, and fs counters all involve iterating
some collection of metadata items.  The per-item scrub functions for
these three are missing some of the components they need to be able to
check for a fatal signal and terminate early.

Per-item scrub functions need to call xchk_should_terminate to look for
fatal signals, and they need to check the scrub context's corruption
flag because there's no point in continuing a scan once we've decided
the data structure is bad.  Add both of these where missing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-06 08:07:41 -08:00
Chengguang Xu
90f3741c2b ext2: code cleanup by calling ext2_group_last_block_no()
Call common helper ext2_group_last_block_no() to
calculate group last block number.

Link: https://lore.kernel.org/r/20191104114036.9893-2-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-06 16:25:15 +01:00
Chengguang Xu
eb9e47fc23 ext2: introduce new helper ext2_group_last_block_no()
Introduce new helper ext2_group_last_block_no() to calculate
last block num for specific block group, we can replace open
coded logic by calling this common helper.

Link: https://lore.kernel.org/r/20191104114036.9893-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-06 16:25:12 +01:00
Jan Kara
dae82c7fd0 Pull series refactoring quota enabling and disabling code. 2019-11-06 10:52:10 +01:00
Jens Axboe
3aa5fa0305 io_uring: kill dead REQ_F_LINK_DONE flag
We had no more use for this flag after the conversion to io-wq, kill it
off.

Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-05 20:34:32 -07:00
Jens Axboe
f1f40853c0 io_uring: fixup a few spots where link failure isn't flagged
If a request fails, we need to ensure we set REQ_F_FAIL_LINK on it if
REQ_F_LINK is set. Any failure in the chain should break the chain.

We were missing a few spots where this should be done. It might be nice
to generalize this somewhat at some point, as long as we factor in the
fact that failure looks different for each request type.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-05 20:33:16 -07:00
Jens Axboe
89723d0bd6 io_uring: enable optimized link handling for IORING_OP_POLL_ADD
As introduced by commit:

ba816ad61f ("io_uring: run dependent links inline if possible")

enable inline dependent link running for poll commands.
io_poll_complete_work() is the most important change, as it allows a
linked sequence of { POLL, READ } (for example) to proceed inline
instead of needing to get punted to another async context. The
submission side only potentially matters for sqthread, but may as well
include that bit.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-05 15:32:58 -07:00
Theodore Ts'o
8d0d47ea16 Merge branch 'mb/dio' into master 2019-11-05 16:21:09 -05:00
Theodore Ts'o
a6d4040846 Merge branch 'jk/jbd2-revoke-overflow' 2019-11-05 16:02:20 -05:00
Jan Kara
19014d6971 jbd2: Fine tune estimate of necessary descriptor blocks
Currently we reserve j_max_transaction_buffers / 32 for transaction
descriptor blocks. Now that revoke descriptors are accounted for
separately this estimate is unnecessarily high and we can actually
compute much tighter estimate. In the common case of 32k journal blocks
and 4k blocksize this actually reduces the amount of reserved descriptor
blocks from 256 to ~25 which allows us to fit more real data into a
transaction.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-25-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:49 -05:00
Jan Kara
0094f981bb jbd2: Provide trace event for handle restarts
Provide trace event for handle restarts to ease debugging.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-24-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:49 -05:00
Jan Kara
83448bdfb5 ext4: Reserve revoke credits for freed blocks
So far we have reserved only relatively high fixed amount of revoke
credits for each transaction. We over-reserved by large amount for most
cases but when freeing large directories or files with data journalling,
the fixed amount is not enough. In fact the worst case estimate is
inconveniently large (maximum extent size) for freeing of one extent.

We fix this by doing proper estimate of the amount of blocks that need
to be revoked when removing blocks from the inode due to truncate or
hole punching and otherwise reserve just a small amount of revoke
credits for each transaction to accommodate freeing of xattrs block or
so.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-23-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:49 -05:00
Jan Kara
d090707eda jbd2: Make credit checking more strict
Make checking of available credits in jbd2_journal_dirty_metadata() more
strict. There should be always enough credits in the handle to write all
potential revoke descriptors. Also we warn in case there are not enough
credits since this is a bug in the filesystem.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-22-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:49 -05:00
Jan Kara
933f1c1e0b jbd2: Rename h_buffer_credits to h_total_credits
The credit counter now contains both buffer and revoke descriptor block
credits. Rename to counter to h_total_credits to reflect that. No
functional change.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-21-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Jan Kara
fdc3ef882a jbd2: Reserve space for revoke descriptor blocks
Extend functions for starting, extending, and restarting transaction
handles to take number of revoke records handle must be able to
accommodate. These functions then make sure transaction has enough
credits to be able to store resulting revoke descriptor blocks. Also
revoke code tracks number of revoke records created by a handle to catch
situation where some place didn't reserve enough space for revoke
records. Similarly to standard transaction credits, space for unused
reserved revoke records is released when the handle is stopped.

On the ext4 side we currently take a simplistic approach of reserving
space for 1024 revoke records for any transaction. This grows amount of
credits reserved for each handle only by a few and is enough for any
normal workload so that we don't hit warnings in jbd2. We will refine
the logic in following commits.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-20-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Jan Kara
77444ac4f9 jbd2: Drop jbd2_space_needed()
The function is now just a trivial wrapper returning
journal->j_max_transaction_buffers. Drop it.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-19-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Jan Kara
9f356e5a4f jbd2: Account descriptor blocks into t_outstanding_credits
Currently, journal descriptor blocks were not accounted in
transaction->t_outstanding_credits and we were just leaving some slack
space in the journal for them (in jbd2_log_space_left() and
jbd2_space_needed()). This is making proper accounting (and reservation
we want to add) of descriptor blocks difficult so switch to accounting
descriptor blocks in transaction->t_outstanding_credits and just reserve
the same amount of credits in t_outstanding credits for journal
descriptor blocks when creating transaction.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-18-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Jan Kara
ec8b6f600e jbd2: Factor out common parts of stopping and restarting a handle
jbd2__journal_restart() has quite some code that is common with
jbd2_journal_stop(). Factor this functionality into stop_this_handle()
helper and use it from both functions. Note that this also drops
t_handle_lock protection from jbd2__journal_restart() as
jbd2_journal_stop() does the same thing without it.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-17-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Jan Kara
5559b2d81b jbd2: Drop pointless wakeup from jbd2_journal_stop()
When we drop last handle from a transaction and journal->j_barrier_count
> 0, jbd2_journal_stop() wakes up journal->j_wait_transaction_locked
wait queue. This looks pointless - wait for outstanding handles always
happens on journal->j_wait_updates waitqueue.
journal->j_wait_transaction_locked is used to wait for transaction state
changes and by start_this_handle() for waiting until
journal->j_barrier_count drops to 0. The first case is clearly
irrelevant here since only jbd2 thread changes transaction state. The
second case looks related but jbd2_journal_unlock_updates() is
responsible for the wakeup in this case. So just drop the wakeup.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-16-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Jan Kara
150549ed2f jbd2: Drop pointless check from jbd2_journal_stop()
If a transaction is larger than journal->j_max_transaction_buffers, that
is a bug and not a trigger for transaction commit. Also the very next
attempt to start new handle will start transaction commit anyway. So
just remove the pointless check. Arguably, we could start transaction
commit whenever the transaction size is *close* to
journal->j_max_transaction_buffers. This has a potential to reduce
latency of the next jbd2_journal_start() at the cost of somewhat smaller
transactions. However for this to have any effect, it would mean that
there isn't someone already waiting in jbd2_journal_start() which means
metadata load for the fs is pretty light anyway so probably this
optimization is not worth it.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-15-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Jan Kara
dfaf5ffda2 jbd2: Reorganize jbd2_journal_stop()
Move code in jbd2_journal_stop() around a bit. It removes some
unnecessary code duplication and will make factoring out parts common
with jbd2__journal_restart() easier.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-14-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Jan Kara
015c603306 jbd2: Fix statistics for the number of logged blocks
jbd2 statistics counting number of blocks logged in a transaction was
wrong. It didn't count the commit block and more importantly it didn't
count revoke descriptor blocks. Make sure these get properly counted.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-13-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Jan Kara
9797a90248 ocfs2: Use accessor function for h_buffer_credits
Use the jbd2 accessor function for h_buffer_credits.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-12-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Jan Kara
a9a8344ee1 ext4, jbd2: Provide accessor function for handle credits
Provide accessor function to get number of credits available in a handle
and use it from ext4. Later, computation of available credits won't be
so straightforward.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-11-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Jan Kara
a413036791 ext4: Provide function to handle transaction restarts
Provide ext4_journal_ensure_credits_fn() function to ensure transaction
has given amount of credits and call helper function to prepare for
restarting a transaction. This allows to remove some boilerplate code
from various places, add proper error handling for the case where
transaction extension or restart fails, and reduces following changes
needed for proper revoke record reservation tracking.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-10-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Jan Kara
f2890730f8 ext4: Avoid unnecessary revokes in ext4_alloc_branch()
Error cleanup path in ext4_alloc_branch() calls ext4_forget() on freshly
allocated indirect blocks with 'metadata' set to 1. This results in
generating revoke records for these blocks. However this is unnecessary
as the freed blocks are only allocated in the current transaction and
thus they will never be journalled. Make this cleanup path similar to
e.g. cleanup in ext4_splice_branch() and use ext4_free_blocks() to
handle block forgetting by passing EXT4_FREE_BLOCKS_FORGET and not
EXT4_FREE_BLOCKS_METADATA to ext4_free_blocks(). This also allows
allocating transaction not to reserve any credits for revoke records.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-9-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:47 -05:00
Jan Kara
6cb367c2d1 ext4: Use ext4_journal_extend() instead of jbd2_journal_extend()
Use ext4 helper ext4_journal_extend() instead of opencoding it in
ext4_try_to_expand_extra_isize().

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-8-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:47 -05:00
Jan Kara
321238fbfb ext4: Fix ext4_should_journal_data() for EA inodes
Similarly to directories, EA inodes do only journalled modifications to
their data. Change ext4_should_journal_data() to return true for them so
that we don't have to special-case them during truncate.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-7-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:47 -05:00
Jan Kara
65db869c75 ext4: Fix credit estimate for final inode freeing
Estimate for the number of credits needed for final freeing of inode in
ext4_evict_inode() was to small. We may modify 4 blocks (inode & sb for
orphan deletion, bitmap & group descriptor for inode freeing) and not
just 3.

[ Fixed minor whitespace nit. -- TYT ]

Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-6-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:31 -05:00
Jens Axboe
6f72653e76 io-wq: use proper nesting IRQ disabling spinlocks for cancel
We don't know what context we'll be called in for cancel, it could very
well be with IRQs disabled already. Use the IRQ saving variants of the
locking primitives.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-05 13:53:53 -07:00
Matthew Bobrowski
378f32bab3 ext4: introduce direct I/O write using iomap infrastructure
This patch introduces a new direct I/O write path which makes use of
the iomap infrastructure.

All direct I/O writes are now passed from the ->write_iter() callback
through to the new direct I/O handler ext4_dio_write_iter(). This
function is responsible for calling into the iomap infrastructure via
iomap_dio_rw().

Code snippets from the existing direct I/O write code within
ext4_file_write_iter() such as, checking whether the I/O request is
unaligned asynchronous I/O, or whether the write will result in an
overwrite have effectively been moved out and into the new direct I/O
->write_iter() handler.
The block mapping flags that are eventually passed down to
ext4_map_blocks() from the *_get_block_*() suite of routines have been
taken out and introduced within ext4_iomap_alloc().

For inode extension cases, ext4_handle_inode_extension() is
effectively the function responsible for performing such metadata
updates. This is called after iomap_dio_rw() has returned so that we
can safely determine whether we need to potentially truncate any
allocated blocks that may have been prepared for this direct I/O
write. We don't perform the inode extension, or truncate operations
from the ->end_io() handler as we don't have the original I/O 'length'
available there. The ->end_io() however is responsible fo converting
allocated unwritten extents to written extents.

In the instance of a short write, we fallback and complete the
remainder of the I/O using buffered I/O via
ext4_buffered_write_iter().

The existing buffer_head direct I/O implementation has been removed as
it's now redundant.

[ Fix up ext4_dio_write_iter() per Jan's comments at
  https://lore.kernel.org/r/20191105135932.GN22379@quack2.suse.cz -- TYT ]

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/e55db6f12ae6ff017f36774135e79f3e7b0333da.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 15:53:28 -05:00
Jan Kara
9b88f9fb0d ext4: Do not iput inode under running transaction
When ext4_mkdir(), ext4_symlink(), ext4_create(), or ext4_mknod() fail
to add entry into directory, it ends up dropping freshly created inode
under the running transaction and thus inode truncation happens under
that transaction. That breaks assumptions that evict() does not get
called from a transaction context and at least in ext4_symlink() case it
can result in inode eviction deadlocking in inode_wait_for_writeback()
when flush worker finds symlink inode, starts to write it back and
blocks on starting a transaction. So change the code in ext4_mkdir() and
ext4_add_nondir() to drop inode reference only after the transaction is
stopped. We also have to add inode to the orphan list in that case as
otherwise the inode would get leaked in case we crash before inode
deletion is committed.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 12:13:25 -05:00
Jan Kara
a9e26328ad ext4: Move marking of handle as sync to ext4_add_nondir()
Every caller of ext4_add_nondir() marks handle as sync if directory has
DIRSYNC set. Move this marking to ext4_add_nondir() so reduce some
duplication.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 12:13:25 -05:00
Jan Kara
b90bfdf581 jbd2: Completely fill journal descriptor blocks
With 32-bit block numbers, we don't allocate the array for journal
buffer heads large enough for corresponding descriptor tags to fill the
descriptor block. Thus we end up writing out half-full descriptor blocks
to the journal unnecessarily growing the transaction. Fix the logic to
allocate the array large enough.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 12:13:25 -05:00
Jan Kara
0db4588945 jbd2: Fixup stale comment in commit code
jbd2_journal_next_log_block() does not look at
transaction->t_outstanding_credits. Remove the misleading comment.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 12:13:25 -05:00
Matthew Bobrowski
3eaf9cc62f ext4: update ext4_sync_file() to not use __generic_file_fsync()
When the filesystem is created without a journal, we eventually call
into __generic_file_fsync() in order to write out all the modified
in-core data to the permanent storage device. This function happens to
try and obtain an inode_lock() while synchronizing the files buffer
and it's associated metadata.

Generally, this is fine, however it becomes a problem when there is
higher level code that has already obtained an inode_lock() as this
leads to a recursive lock situation. This case is especially true when
porting across direct I/O to iomap infrastructure as we obtain an
inode_lock() early on in the I/O within ext4_dio_write_iter() and hold
it until the I/O has been completed. Consequently, to not run into
this specific issue, we move away from calling into
__generic_file_fsync() and perform the necessary synchronization tasks
within ext4_sync_file().

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/3495f35ef67f2021b567e28e6f59222e583689b8.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 11:31:40 -05:00
Matthew Bobrowski
0b9f230b94 ext4: move inode extension check out from ext4_iomap_alloc()
Lift the inode extension/orphan list handling code out from
ext4_iomap_alloc() and apply it within the ext4_dax_write_iter().

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/fd5c84db25d5d0da87d97ed4c36fd844f57da759.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 11:31:40 -05:00
Matthew Bobrowski
569342dc24 ext4: move inode extension/truncate code out from ->iomap_end() callback
In preparation for implementing the iomap direct I/O modifications,
the inode extension/truncate code needs to be moved out from the
ext4_iomap_end() callback. For direct I/O, if the current code
remained, it would behave incorrrectly. Updating the inode size prior
to converting unwritten extents would potentially allow a racing
direct I/O read to find unwritten extents before being converted
correctly.

The inode extension/truncate code now resides within a new helper
ext4_handle_inode_extension(). This function has been designed so that
it can accommodate for both DAX and direct I/O extension/truncate
operations.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d41ffa26e20b15b12895812c3cad7c91a6a59bc6.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 11:31:40 -05:00
Matthew Bobrowski
b1b4705d54 ext4: introduce direct I/O read using iomap infrastructure
This patch introduces a new direct I/O read path which makes use of
the iomap infrastructure.

The new function ext4_do_read_iter() is responsible for calling into
the iomap infrastructure via iomap_dio_rw(). If the read operation
performed on the inode is not supported, which is checked via
ext4_dio_supported(), then we simply fallback and complete the I/O
using buffered I/O.

Existing direct I/O read code path has been removed, as it is now
redundant.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/f98a6f73fadddbfbad0fc5ed04f712ca0b799f37.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 11:31:40 -05:00
Matthew Bobrowski
09edf4d381 ext4: introduce new callback for IOMAP_REPORT
As part of the ext4_iomap_begin() cleanups that precede this patch, we
also split up the IOMAP_REPORT branch into a completely separate
->iomap_begin() callback named ext4_iomap_begin_report(). Again, the
raionale for this change is to reduce the overall clutter within
ext4_iomap_begin().

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/5c97a569e26ddb6696e3d3ac9fbde41317e029a0.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 11:31:40 -05:00
Matthew Bobrowski
f063db5ee9 ext4: split IOMAP_WRITE branch in ext4_iomap_begin() into helper
In preparation for porting across the ext4 direct I/O path over to the
iomap infrastructure, split up the IOMAP_WRITE branch that's currently
within ext4_iomap_begin() into a separate helper
ext4_alloc_iomap(). This way, when we add in the necessary code for
direct I/O, we don't end up with ext4_iomap_begin() becoming a
monstrous twisty maze.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/50eef383add1ea529651640574111076c55aca9f.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 11:31:40 -05:00
Matthew Bobrowski
c8fdfe2941 ext4: move set iomap routines into a separate helper ext4_set_iomap()
Separate the iomap field population code that is currently within
ext4_iomap_begin() into a separate helper ext4_set_iomap(). The intent
of this function is self explanatory, however the rationale behind
taking this step is to reeduce the overall clutter that we currently
have within the ext4_iomap_begin() callback.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1ea34da65eecffcddffb2386668ae06134e8deaf.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 11:31:40 -05:00
Matthew Bobrowski
2e9b51d782 ext4: iomap that extends beyond EOF should be marked dirty
This patch addresses what Dave Chinner had discovered and fixed within
commit: 7684e2c438. This changes does not have any user visible
impact for ext4 as none of the current users of ext4_iomap_begin()
that extend files depend on IOMAP_F_DIRTY.

When doing a direct IO that spans the current EOF, and there are
written blocks beyond EOF that extend beyond the current write, the
only metadata update that needs to be done is a file size extension.

However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
there is IO completion metadata updates required, and hence we may
fail to correctly sync file size extensions made in IO completion when
O_DSYNC writes are being used and the hardware supports FUA.

Hence when setting IOMAP_F_DIRTY, we need to also take into account
whether the iomap spans the current EOF. If it does, then we need to
mark it dirty so that IO completion will call generic_write_sync() to
flush the inode size update to stable storage correctly.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8b43ee9ee94bee5328da56ba0909b7d2229ef150.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 11:31:39 -05:00
Matthew Bobrowski
548feebec7 ext4: update direct I/O read lock pattern for IOCB_NOWAIT
This patch updates the lock pattern in ext4_direct_IO_read() to not
block on inode lock in cases of IOCB_NOWAIT direct I/O reads. The
locking condition implemented here is similar to that of 942491c9e6
("xfs: fix AIM7 regression").

Fixes: 16c5468859 ("ext4: Allow parallel DIO reads")
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c5d5e759f91747359fbd2c6f9a36240cf75ad79f.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 11:31:39 -05:00
Matthew Bobrowski
53e5cca567 ext4: reorder map.m_flags checks within ext4_iomap_begin()
For the direct I/O changes that follow in this patch series, we need
to accommodate for the case where the block mapping flags passed
through to ext4_map_blocks() result in m_flags having both
EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In order for any
allocated unwritten extents to be converted correctly in the
->end_io() handler, the iomap->type must be set to IOMAP_UNWRITTEN for
cases where the EXT4_MAP_UNWRITTEN bit has been set within
m_flags. Hence the reason why we need to reshuffle this conditional
statement around.

This change is a no-op for DAX as the block mapping flags passed
through to ext4_map_blocks() i.e. EXT4_GET_BLOCKS_CREATE_ZERO never
results in both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at
once.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1309ad80d31a637b2deed55a85283d582a54a26a.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 11:31:39 -05:00
Theodore Ts'o
f21bdbba0a Merge branch 'iomap-for-next' into mb/dio 2019-11-05 11:31:32 -05:00
Darrick J. Wong
9842b56cd4 xfs: make the assertion message functions take a mount parameter
Make the assfail and asswarn functions take a struct xfs_mount so that
we can start tying debugging and corruption messages to a particular
mount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-05 08:28:27 -08:00
Darrick J. Wong
110f09cb70 xfs: add missing assert in xfs_fsmap_owner_from_rmap
The fsmap handler shouldn't fail silently if the rmap code ever feeds it
a special owner number that isn't known to the fsmap handler.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-05 08:28:27 -08:00
Darrick J. Wong
ee4fb16cbe xfs: decrease indenting problems in xfs_dabuf_map
Refactor the code that complains when a dir/attr mapping doesn't exist
but the caller requires a mapping.  This small restructuring helps us to
reduce the indenting level.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-05 08:28:27 -08:00
Ian Kent
50f8300904 xfs: fold xfs_mount-alloc() into xfs_init_fs_context()
After switching to use the mount-api the only remaining caller of
xfs_mount_alloc() is xfs_init_fs_context(), so fold xfs_mount_alloc()
into it.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:27 -08:00
Ian Kent
8757c38f2c xfs: move xfs_fc_parse_param() above xfs_fc_get_tree()
Grouping the options parsing and mount handling functions above the
struct fs_context_operations but below the struct super_operations
should improve (some) the grouping of the super operations while also
improving the grouping of the options parsing and mount handling code.

Lastly move xfs_fc_parse_param() and related functions down to above
xfs_fc_get_tree() and it's related functions.

But leave the options enum, struct fs_parameter_spec and the struct
fs_parameter_description declarations at the top since that's the
logical place for them.

This is a straight code move, there aren't any functional changes.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:27 -08:00
Ian Kent
2f8d66b3cd xfs: move xfs_fc_get_tree() above xfs_fc_reconfigure()
Grouping the options parsing and mount handling functions above the
struct fs_context_operations but below the struct super_operations
should improve (some) the grouping of the super operations while also
improving the grouping of the options parsing and mount handling code.

Now move xfs_fc_get_tree() and friends, also take the oppertunity to
change STATIC to static for the xfs_fs_put_super() function.
This is a straight code move, there aren't any functional changes.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:27 -08:00
Ian Kent
63cd1e9b02 xfs: move xfs_fc_reconfigure() above xfs_fc_free()
Grouping the options parsing and mount handling functions above the
struct fs_context_operations but below the struct super_operations
should improve (some) the grouping of the super operations while also
improving the grouping of the options parsing and mount handling code.

Start by moving xfs_fc_reconfigure() and friends.
This is a straight code move, there aren't any functional changes.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:26 -08:00
Ian Kent
73e5fff98b xfs: switch to use the new mount-api
Define the struct fs_parameter_spec table that's used by the new
mount-api for options parsing.

Create the various fs context operations methods and define the
fs_context_operations struct.

Create the fs context initialization method and update the struct
file_system_type to utilize it. The initialization function is
responsible for working storage initialization, allocation and
initialization of file system private information storage and for
setting the operations in the fs context.

Also set struct file_system_type .parameters to the newly defined
struct fs_parameter_spec options parsing table for use by the fs
context methods and remove unused code.

[darrick: add a comment pointing out the one place where mp->m_super is
null]

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:26 -08:00
Ian Kent
7c89fcb278 xfs: dont set sb in xfs_mount_alloc()
When changing to use the new mount api the super block won't be
available when the xfs_mount struct is allocated so move setting the
super block in xfs_mount to xfs_fs_fill_super().

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:26 -08:00
Ian Kent
9a861816a0 xfs: move xfs_parseargs() validation to a helper
Move the validation code of xfs_parseargs() into a helper for later
use within the mount context methods.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:26 -08:00
Ian Kent
48a06e1b57 xfs: refactor xfs_parseags()
Refactor xfs_parseags(), move the entire token case block to a separate
function in an attempt to highlight the code that actually changes in
converting to use the new mount api.

Also change the break in the switch to a return in the factored out
xfs_fc_parse_param() function.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:26 -08:00
Ian Kent
846410ccd1 xfs: avoid redundant checks when options is empty
When options passed to xfs_parseargs() is NULL the checks performed
after taking the branch are made with the initial values of dsunit,
dswidth and iosizelog. But all the checks do nothing in this case
so return immediately instead.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:26 -08:00
Ian Kent
c0a6791667 xfs: refactor suffix_kstrtoint()
The mount-api doesn't have a "human unit" parse type yet so the options
that have values like "10k" etc. still need to be converted by the fs.

But the value comes to the fs as a string (not a substring_t type) so
there's a need to change the conversion function to take a character
string instead.

When xfs is switched to use the new mount-api match_kstrtoint() will no
longer be used and will be removed.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:26 -08:00
Ian Kent
2c6eba3177 xfs: add xfs_remount_ro() helper
Factor the remount read only code into a helper to simplify the
subsequent change from the super block method .remount_fs to the
mount-api fs_context_operations method .reconfigure.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:26 -08:00
Ian Kent
82332b6da2 xfs: add xfs_remount_rw() helper
Factor the remount read write code into a helper to simplify the
subsequent change from the super block method .remount_fs to the
mount-api fs_context_operations method .reconfigure.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:25 -08:00
Ian Kent
a943f372c2 xfs: merge freeing of mp names and mp
In all cases when struct xfs_mount (mp) fields m_rtname and m_logname
are freed mp is also freed, so merge these into a single function
xfs_mount_free()

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:25 -08:00
Ian Kent
7b77b46a61 xfs: use kmem functions for struct xfs_mount
The remount function uses the kmem functions for allocating and freeing
struct xfs_mount, for consistency use the kmem functions everwhere for
struct xfs_mount.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:25 -08:00
Ian Kent
3d9d60d9ad xfs: dont use XFS_IS_QUOTA_RUNNING() for option check
When CONFIG_XFS_QUOTA is not defined any quota option is invalid.

Using the macro XFS_IS_QUOTA_RUNNING() as a check if any quota option
has been given is a little misleading so use a simple m_qflags != 0
check to make the intended use more explicit.

Also change to use the IS_ENABLED() macro for the kernel config check.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:25 -08:00
Ian Kent
e1d3d21885 xfs: use super s_id instead of struct xfs_mount m_fsname
Eliminate struct xfs_mount field m_fsname by using the super block s_id
field directly.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:25 -08:00
Ian Kent
f676c75086 xfs: remove unused struct xfs_mount field m_fsname_len
The struct xfs_mount field m_fsname_len is not used anywhere, remove it.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05 08:28:25 -08:00
Luis Henriques
a3a0819388 ceph: don't allow copy_file_range when stripe_count != 1
copy_file_range tries to use the OSD 'copy-from' operation, which simply
performs a full object copy.  Unfortunately, the implementation of this
system call assumes that stripe_count is always set to 1 and doesn't take
into account that the data may be striped across an object set.  If the
file layout has stripe_count different from 1, then the destination file
data will be corrupted.

For example:

Consider a 8 MiB file with 4 MiB object size, stripe_count of 2 and
stripe_size of 2 MiB; the first half of the file will be filled with 'A's
and the second half will be filled with 'B's:

               0      4M     8M       Obj1     Obj2
               +------+------+       +----+   +----+
        file:  | AAAA | BBBB |       | AA |   | AA |
               +------+------+       |----|   |----|
                                     | BB |   | BB |
                                     +----+   +----+

If we copy_file_range this file into a new file (which needs to have the
same file layout!), then it will start by copying the object starting at
file offset 0 (Obj1).  And then it will copy the object starting at file
offset 4M -- which is Obj1 again.

Unfortunately, the solution for this is to not allow remote object copies
to be performed when the file layout stripe_count is not 1 and simply
fallback to the default (VFS) copy_file_range implementation.

Cc: stable@vger.kernel.org
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-05 15:42:58 +01:00
Jeff Layton
5bb5e6ee6f ceph: don't try to handle hashed dentries in non-O_CREAT atomic_open
If ceph_atomic_open is handed a !d_in_lookup dentry, then that means
that it already passed d_revalidate so we *know* that it's negative (or
at least was very recently). Just return -ENOENT in that case.

This also addresses a subtle bug in dentry handling. Non-O_CREAT opens
call atomic_open with the parent's i_rwsem shared, but calling
d_splice_alias on a hashed dentry requires the exclusive lock.

If ceph_atomic_open receives a hashed, negative dentry on a non-O_CREAT
open, and another client were to race in and create the file before we
issue our OPEN, ceph_fill_trace could end up calling d_splice_alias on
the dentry with the new inode with insufficient locks.

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-05 15:42:44 +01:00
Nikitas Angelinas
a9913d7eaf reiserfs: replace open-coded atomic_dec_and_mutex_lock()
Replace the open-coded logic of atomic_dec_and_mutex_lock() in
reiserfs_file_release().

Link: https://lore.kernel.org/r/20191103094431.GA18576-nikitas.angelinas@gmail.com
Signed-off-by: Nikitas Angelinas <nikitas.angelinas@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-05 12:25:22 +01:00
Chengguang Xu
e705f4b8aa ext2: check err when partial != NULL
Check err when partial == NULL is meaningless because
partial == NULL means getting branch successfully without
error.

CC: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20191105045100.7104-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-05 10:42:18 +01:00
Darrick J. Wong
a5155b870d xfs: always log corruption errors
Make sure we log something to dmesg whenever we return -EFSCORRUPTED up
the call stack.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-04 13:55:54 -08:00
Darrick J. Wong
d243b89a61 xfs: constify the buffer pointer arguments to error functions
Some of the xfs error message functions take a pointer to a buffer that
will be dumped to the system log.  The logging functions don't change
the contents, so constify all the parameters.  This enables the next
patch to ensure that we log bad metadata when we encounter it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-04 13:55:34 -08:00
Darrick J. Wong
e91ec882af xfs: relax shortform directory size checks
Each of the four functions that operate on shortform directories checks
that the directory's di_size is at least as large as the shortform
directory header.  This is now checked by the inode fork verifiers
(di_size is used to allocate if_bytes, and if_bytes is checked against
the header structure size) so we can turn these checks into ASSERTions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-04 13:54:58 -08:00
David Sterba
a5009d3a31 btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC
The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as
deprecated and scheduled for removal but we actualy do use them for
'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351e5
should have been just the async flag for subvolume creation.

The deprecation has been added in this development cycle, remove it
until it's time.

Fixes: ebc87351e5 ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag")
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-04 21:42:01 +01:00
Josef Bacik
d98da49977 btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range
We hit a regression while rolling out 5.2 internally where we were
hitting the following panic

  kernel BUG at mm/page-writeback.c:2659!
  RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
  Call Trace:
   __process_pages_contig+0x25a/0x350
   ? extent_clear_unlock_delalloc+0x43/0x70
   submit_compressed_extents+0x359/0x4d0
   normal_work_helper+0x15a/0x330
   process_one_work+0x1f5/0x3f0
   worker_thread+0x2d/0x3d0
   ? rescuer_thread+0x340/0x340
   kthread+0x111/0x130
   ? kthread_create_on_node+0x60/0x60
   ret_from_fork+0x1f/0x30

This is happening because the page is not locked when doing
clear_page_dirty_for_io.  Looking at the core dump it was because our
async_extent had a ram_size of 24576 but our async_chunk range only
spanned 20480, so we had a whole extra page in our ram_size for our
async_extent.

This happened because we try not to compress pages outside of our
i_size, however a cleanup patch changed us to do

actual_end = min_t(u64, i_size_read(inode), end + 1);

which is problematic because i_size_read() can evaluate to different
values in between checking and assigning.  So either an expanding
truncate or a fallocate could increase our i_size while we're doing
writeout and actual_end would end up being past the range we have
locked.

I confirmed this was what was happening by installing a debug kernel
that had

  actual_end = min_t(u64, i_size_read(inode), end + 1);
  if (actual_end > end + 1) {
	  printk(KERN_ERR "KABOOM\n");
	  actual_end = end + 1;
  }

and installing it onto 500 boxes of the tier that had been seeing the
problem regularly.  Last night I got my debug message and no panic,
confirming what I expected.

[ dsterba: the assembly confirms a tiny race window:

    mov    0x20(%rsp),%rax
    cmp    %rax,0x48(%r15)           # read
    movl   $0x0,0x18(%rsp)
    mov    %rax,%r12
    mov    %r14,%rax
    cmovbe 0x48(%r15),%r12           # eval

  Where r15 is inode and 0x48 is offset of i_size.

  The original fix was to revert 62b3762271 that would do an
  intermediate assignment and this would also avoid the doulble
  evaluation but is not future-proof, should the compiler merge the
  stores and call i_size_read anyway.

  There's a patch adding READ_ONCE to i_size_read but that's not being
  applied at the moment and we need to fix the bug. Instead, emulate
  READ_ONCE by two barrier()s that's what effectively happens. The
  assembly confirms single evaluation:

    mov    0x48(%rbp),%rax          # read once
    mov    0x20(%rsp),%rcx
    mov    $0x20,%edx
    cmp    %rax,%rcx
    cmovbe %rcx,%rax
    mov    %rax,(%rsp)
    mov    %rax,%rcx
    mov    %r14,%rax

  Where 0x48(%rbp) is inode->i_size stored to %eax.
]

Fixes: 62b3762271 ("btrfs: Remove isize local variable in compress_file_range")
CC: stable@vger.kernel.org # v5.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changelog updated ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-04 21:41:49 +01:00
Jens Axboe
51c3ff62ca io_uring: add completion trace event
We currently don't have a completion event trace, add one of those. And
to better be able to match up submissions and completions, add user_data
to the submission trace as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04 07:07:52 -07:00
Jan Kara
a0828b6ccb quota: Handle quotas without quota inodes in dquot_get_state()
Make dquot_get_state() gracefully handle a situation when there are no
quota files present even though quotas are enabled.

Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-04 11:21:40 +01:00
Jan Kara
2ec1f3011f quota: Make dquot_disable() work without quota inodes
Quota on and quota off are protected by s_umount semaphore held in
exclusive mode since commit 7d6cd73d33 "quota: Hold s_umount in
exclusive mode when enabling / disabling quotas". This makes it
impossible for dquot_disable() to race with other enabling or disabling
of quotas. Simplify the cleanup done by dquot_disable() based on this
fact and also remove some stale comments. As a bonus this cleanup makes
dquot_disable() properly handle a case when there are no quota inodes.

Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-04 11:21:35 +01:00
Jan Kara
069a916636 quota: Drop dquot_enable()
Now dquot_enable() has only two internal callers and both of them just
need to update quota flags and don't need most of checks. Just drop
dquot_enable() and fold necessary functionality into the two calling
places.

Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-04 09:58:06 +01:00
Jan Kara
7212b95e61 fs: Use dquot_load_quota_inode() from filesystems
Use dquot_load_quota_inode from filesystems instead of dquot_enable().
In all three cases we want to load quota inode and never use the
function to update quota flags.

Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-04 09:58:05 +01:00
Jan Kara
dc19432ae1 quota: Rename vfs_load_quota_inode() to dquot_load_quota_inode()
Rename vfs_load_quota_inode() to dquot_load_quota_inode() to be
consistent with naming of other functions used for enabling quota
accounting from filesystems. Also export the function and add some
sanity checks to assure filesystems are calling the function properly.

Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-04 09:58:05 +01:00
Jan Kara
ae45f07d47 quota: Simplify dquot_resume()
We already have quota inode loaded when resuming quotas. Use
vfs_load_quota() to avoid some pointless churn with the quota inode.

Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-04 09:58:05 +01:00
Jan Kara
c7d3d28360 quota: Factor out setup of quota inode
Factor out setting up of quota inode and eventual error cleanup from
vfs_load_quota_inode(). This will simplify situation for filesystems
that don't have any quota inodes.

Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-04 09:58:01 +01:00
Christoph Hellwig
c34d570d15 xfs: cleanup use of the XFS_ALLOC_ flags
Always set XFS_ALLOC_USERDATA for data fork allocations, and check it
in xfs_alloc_is_userdata instead of the current obsfucated check.
Also remove the xfs_alloc_is_userdata and xfs_alloc_allow_busy_reuse
helpers to make the code a little easier to understand.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-03 10:22:31 -08:00
Christoph Hellwig
fd638f1de1 xfs: move extent zeroing to xfs_bmapi_allocate
Move the extent zeroing case there for the XFS_BMAPI_ZERO flag outside
the low-level allocator and into xfs_bmapi_allocate, where is still
is in transaction context, but outside the very lowlevel code where
it doesn't belong.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-03 10:22:30 -08:00
Christoph Hellwig
be6cacbeea xfs: refactor xfs_bmapi_allocate
Avoid duplicate userdata and data fork checks by restructuring the code
so we only have a helper for userdata allocations that combines these
checks in a straight foward way.  That also helps to obsoletes the
comments explaining what the code does as it is now clearly obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-03 10:22:30 -08:00
Christoph Hellwig
e696663a97 xfs: simplify the xfs_iomap_write_direct calling
Move the EOF alignment and checking for the next allocated extent into
the callers to avoid the need to pass the byte based offset and count
as well as looking at the incoming imap.  The added benefit is that
the caller can unlock the incoming ilock and the function doesn't have
funny unbalanced locking contexts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-03 10:22:30 -08:00
Christoph Hellwig
307cdb54b8 xfs: don't log the inode in xfs_fs_map_blocks if it
Even if we are asked for a write layout there is no point in logging
the inode unless we actually modified it in some way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-03 10:22:30 -08:00
Christoph Hellwig
88cdb7147b xfs: slightly tweak an assert in xfs_fs_map_blocks
We should never see delalloc blocks for a pNFS layout, write or not.
Adjust the assert to check for that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-03 10:22:30 -08:00
Christoph Hellwig
57c49444d7 xfs: remove the extsize argument to xfs_eof_alignment
And move the code dependent on it to the one caller that cares
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-03 10:22:30 -08:00
Christoph Hellwig
49bbf8c761 xfs: mark xfs_eof_alignment static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-03 10:22:30 -08:00
Christoph Hellwig
ae7e403fa5 xfs: simplify xfs_iomap_eof_align_last_fsb
By open coding xfs_bmap_last_extent instead of calling it through a
double indirection we don't need to handle an error return that
can't happen given that we are guaranteed to have the extent list in
memory already.  Also simplify the calling conventions a little and
move the extent list assert from the only caller into the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-03 10:22:30 -08:00
Jan Kara
cba22d86e0 bdev: Refresh bdev size for disks without partitioning
Currently, block device size in not updated on second and further open
for block devices where partition scan is disabled. This is particularly
annoying for example for DVD drives as that means block device size does
not get updated once the media is inserted into a drive if the device is
already open when inserting the media. This is actually always the case
for example when pktcdvd is in use.

Fix the problem by revalidating block device size on every open even for
devices with partition scan disabled.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-03 07:53:50 -07:00
Jan Kara
731dc48683 bdev: Factor out bdev revalidation into a common helper
Factor out code handling revalidation of bdev on disk change into a
common helper.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-03 07:53:49 -07:00
Greg Kroah-Hartman
9927c6fa3e debugfs: remove return value of debugfs_create_atomic_t()
No one checks the return value of debugfs_create_atomic_t(), as it's not
needed, so make the return value void, so that no one tries to do so in
the future.

Link: https://lore.kernel.org/r/20191016130332.GA28240@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-03 14:03:01 +01:00
Linus Torvalds
56cfd2507d a small smb3 memleak fix
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl290UsACgkQiiy9cAdy
 T1GVTAv+Mga+91Nw8Nte0Ix3ynuitDsqjtj6jIJs2FHoOI8cO1RhplU16elxS1OQ
 y3AekBU/go2aWWraTPtGiZZReIPm0gyku11lK8zox3zEE9buFFR0dHvZgxll2gG8
 IHJNgn76avvs+gI4XLeITzpwcv8Xt+z9VN1A0vujDSfSg3TeMEIyr6ofnFSgo9jx
 2SRmCAMcgBameUlZWkc4fdz66GLguXhnYAZ7paX1mMLPuEsEmvHquU691+sKqDej
 Q2GarzDR3JVusNIiuJtlwJlUprKAQuGuF0h6B9raZ0saoyR3MFr2bUkxNqDMPj4T
 9BTeRItnPWcxh+q7bfvJi9LiHTP2tevoXZhqafd17hYRj3noXyw0FRLsKmYDccW2
 Q4+PjOiv/Qyxg8g6l/Bw87VowYrzvVPxfcFMt8fC+tijX9XhdbzF/kSwD83jy/Vm
 u14Eps2UEdaO7qiNZDRNSk1DyFePwCUq55ZMx27MbYfqu8RHXV5NvSJw/P7WQEF7
 rAB7Cvy6
 =Oh9N
 -----END PGP SIGNATURE-----

Merge tag '5.4-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fix from Steve French:
 "A small smb3 memleak fix"

* tag '5.4-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
  fix memory leak in large read decrypt offload
2019-11-02 14:34:00 -07:00
Greg Kroah-Hartman
c7c1168909 debugfs: remove return value of debugfs_create_x8()
No one checks the return value of debugfs_create_x8(), as it's not
needed, so make the return value void, so that no one tries to do so in
the future.

Link: https://lore.kernel.org/r/20191011132931.1186197-5-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-02 18:09:12 +01:00
YueHaibing
364b05fd06 io-wq: use kfree_rcu() to simplify the code
The callback function of call_rcu() just calls kfree(), so we can use
kfree_rcu() instead of call_rcu() + callback function.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-02 07:59:46 -06:00
Linus Torvalds
372bf6c1c8 NFS Client Bugfixes for Linux 5.4-rc6
Stable bugfixes:
 - Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
 
 Other fixes:
 - The TCP back channel mustn't disappear while requests are outstanding
 - The RDMA back channel mustn't disappear while requests are outstanding
 - Destroy the back channel when we destroy the host transport
 - Don't allow a cached open with a revoked delegation
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl28mZ4ACgkQ18tUv7Cl
 QOvAAxAA5h15NJ8OTOpr/kyWQAHEYWIEWQyTNYT8Gi/TI/YCfauET6kxAKg7ETk+
 Bx6+A7Q5OlKSqFZXf2UMMIHPnvH7Ar06uIof6DLjWxA9N/q57SNW0YszQCUhvWjp
 4Ry7d9A2yNrCizEk/vTmY5zfFIo2S88AvwkQ6gLCEyfIn5REWQyQnS4SS3vL7MuS
 E/iWBsXFV2C9/yON+zzZnC0ewMIPbtWA3/CsVI2zDKLsHKvBYOx6n7TbfTtMtc/p
 mCdgLzdEEZlN0KOCzzMp7ziME8y1qUGX6eo17Jrr0ZBZIM7o5d7sYMA88Apu6dMg
 MQM5iEEAf//kmHLWzQ6yg8XZEZrXFMOiGKYu0kWt4pVlqi78gemEfE6T8dClsB5m
 P7jqn/4ntDlJPWPciImsVEzDOHlzlZBIXwZs1JjufeO/hIB570cs80PjeDQP0id8
 Cx3yR7Hr1ldKzwSwpWrtIaqEPljJJw0jhxQ7YYlvfbA1Ogd4ek4QU4RuvyW11CRr
 d/5Oaa7xvjGs63klu8HpZG4NHFby3PPmZT6t4xUcH75MR14S15YCGtaUvPJ7ORck
 C6tZJAkVtJqSoAjsGuBtTc5qiOdo+hMRghutW4Pd6UpnPYJKwQUtTqp3MbfM4+bA
 S19tGG6qn+7cl7bW4NJpNy8HfbC7BdS+m7maHpQfGIdMsxKVFoI=
 =ttcU
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "This contains two delegation fixes (with the RCU lock leak fix marked
  for stable), and three patches to fix destroying the the sunrpc back
  channel.

  Stable bugfixes:

   - Fix an RCU lock leak in nfs4_refresh_delegation_stateid()

  Other fixes:

   - The TCP back channel mustn't disappear while requests are
     outstanding

   - The RDMA back channel mustn't disappear while requests are
     outstanding

   - Destroy the back channel when we destroy the host transport

   - Don't allow a cached open with a revoked delegation"

* tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
  NFSv4: Don't allow a cached open with a revoked delegation
  SUNRPC: Destroy the back channel when we destroy the host transport
  SUNRPC: The RDMA back channel mustn't disappear while requests are outstanding
  SUNRPC: The TCP back channel mustn't disappear while requests are outstanding
2019-11-01 17:37:44 -07:00
Linus Torvalds
0821de2896 for-linus-20191101
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl28lRQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjtbEADbrMXdsdPV2CApxSiZIaWO1mR78yy/btxp
 cHsJ+avaPGxhNukSsose2KWm656SriH/OfQspqtvzDpslbu40V41+vSqSknqGRPr
 8jW5efZIAy6dq0FjbtBnmIV6PhC5d4F/nAEQbsnVRn8RSr3OwQcm8/smpSFA8urI
 oHVU8jiyLsQiSbDvjf2KPhPYhWBHO0W5SyGo29HY8pSzQpsMzGkQ6TcHL4EzwPZP
 WPtGglr14v8rMyhNMxUdHZ9eHCMq7uufFPuyJXzesE/qyM+H8p2pwwxyfflHGZil
 w2vxLJRu8d4UIkHEkNbC0bydXJ+eCtRMBZON1ZGdrZwQ58L9AbBPBZmxKb0LkmHb
 4tc/yQm/0kSUUXwFtDoUoIBFjjy36Pl5BsLt4n5fofsl04myhm5CLqZ8oWxyU0vO
 sCinJwk1+eQO/tbQVDfven+MroNlYVPCnXhIe/12/wEba3EJ7Ab4X5p0lJoJ1oY7
 9dQyY6+BaHd4wV9p0domOP5y7dJnXM9k46EF0/5YoNjoqaH5MWPMq355VH2xNjdw
 5HzRcZfvOAlXASrnXuQAAQAdR2b+s/iFZaNKA7bTZxjNPvYE0zySCMeQeNXmfVKe
 CrDuwViWukwIzETDZHYqMWJxOV4nyOeL3jTo7rQp5A5TEWwBiJKQ4aGBif2eqc+L
 Mk41ziQGuQ==
 =+rar
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191101' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Two small nvme fixes, one is a fabrics connection fix, the other one
   a cleanup made possible by that fix (Anton, via Keith)

 - Fix requeue handling in umb ubd (Anton)

 - Fix spin_lock_irq() nesting in blk-iocost (Dan)

 - Three small io_uring fixes:
     - Install io_uring fd after done with ctx (me)
     - Clear ->result before every poll issue (me)
     - Fix leak of shadow request on error (Pavel)

* tag 'for-linus-20191101' of git://git.kernel.dk/linux-block:
  iocost: don't nest spin_lock_irq in ioc_weight_write()
  io_uring: ensure we clear io_kiocb->result before each issue
  um-ubd: Entrust re-queue to the upper layers
  nvme-multipath: remove unused groups_only mode in ana log
  nvme-multipath: fix possible io hang after ctrl reconnect
  io_uring: don't touch ctx in setup after ring fd install
  io_uring: Fix leaked shadow_req
2019-11-01 17:33:12 -07:00
Theodore Ts'o
0d0a60c92f Merge branch 'rh/dioread-nolock-1k' into dev 2019-11-01 14:49:52 -04:00
Trond Myklebust
79cc55422c NFS: Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
A typo in nfs4_refresh_delegation_stateid() means we're leaking an
RCU lock, and always returning a value of 'false'. As the function
description states, we were always supposed to return 'true' if a
matching delegation was found.

Fixes: 12f275cdd1 ("NFSv4: Retry CLOSE and DELEGRETURN on NFS4ERR_OLD_STATEID.")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-11-01 11:03:56 -04:00
Trond Myklebust
be3df3dd4c NFSv4: Don't allow a cached open with a revoked delegation
If the delegation is marked as being revoked, we must not use it
for cached opens.

Fixes: 869f9dfa4d ("NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-11-01 10:59:26 -04:00
Jackie Liu
e9ffa5c2b7 io_uring: set -EINTR directly when a signal wakes up in io_cqring_wait
We didn't use -ERESTARTSYS to tell the application layer to restart the
system call, but instead return -EINTR. we can set -EINTR directly when
wakeup by the signal, which can help us save an assignment operation and
comparison operation.

Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-01 08:36:36 -06:00
Jens Axboe
62755e35df io_uring: support for generic async request cancel
This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
cancel requests that have been punted to async context and are now
in-flight. This works for regular read/write requests to files, as
long as they haven't been started yet. For socket based IO (or things
like accept4(2)), we can cancel work that is already running as well.

To cancel a request, the sqe must have ->addr set to the user_data of
the request it wishes to cancel. If the request is cancelled
successfully, the original request is completed with -ECANCELED
and the cancel request is completed with a result of 0. If the
request was already running, the original may or may not complete
in error. The cancel request will complete with -EALREADY for that
case. And finally, if the request to cancel wasn't found, the cancel
request is completed with -ENOENT.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-01 08:35:31 -06:00
Dmitry Monakhov
df4bb5d128 quota: Check that quota is not dirty before release
There is a race window where quota was redirted once we drop dq_list_lock inside dqput(),
but before we grab dquot->dq_lock inside dquot_release()

TASK1                                                       TASK2 (chowner)
->dqput()
  we_slept:
    spin_lock(&dq_list_lock)
    if (dquot_dirty(dquot)) {
          spin_unlock(&dq_list_lock);
          dquot->dq_sb->dq_op->write_dquot(dquot);
          goto we_slept
    if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
          spin_unlock(&dq_list_lock);
          dquot->dq_sb->dq_op->release_dquot(dquot);
                                                            dqget()
							    mark_dquot_dirty()
							    dqput()
          goto we_slept;
        }
So dquot dirty quota will be released by TASK1, but on next we_sleept loop
we detect this and call ->write_dquot() for it.
XFSTEST: 440a80d4cb

Link: https://lore.kernel.org/r/20191031103920.3919-2-dmonakhov@openvz.org
CC: stable@vger.kernel.org
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-10-31 19:07:42 +01:00
Dmitry Monakhov
6ff33d99fc quota: fix livelock in dquot_writeback_dquots
Write only quotas which are dirty at entry.

XFSTEST: b10ad23566

Link: https://lore.kernel.org/r/20191031103920.3919-1-dmonakhov@openvz.org
CC: stable@vger.kernel.org
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-10-31 19:07:11 +01:00
Dave Chinner
249bd9087a xfs: properly serialise fallocate against AIO+DIO
AIO+DIO can extend the file size on IO completion, and it holds
no inode locks while the IO is in flight. Therefore, a race
condition exists in file size updates if we do something like this:

aio-thread			fallocate-thread

lock inode
submit IO beyond inode->i_size
unlock inode
.....
				lock inode
				break layouts
				if (off + len > inode->i_size)
					new_size = off + len
				.....
				inode_dio_wait()
				<blocks>
.....
completes
inode->i_size updated
inode_dio_done()
....
				<wakes>
				<does stuff no long beyond EOF>
				if (new_size)
					xfs_vn_setattr(inode, new_size)


Yup, that attempt to extend the file size in the fallocate code
turns into a truncate - it removes the whatever the aio write
allocated and put to disk, and reduced the inode size back down to
where the fallocate operation ends.

Fundamentally, xfs_file_fallocate()  not compatible with racing
AIO+DIO completions, so we need to move the inode_dio_wait() call
up to where the lock the inode and break the layouts.

Secondly, storing the inode size and then using it unchecked without
holding the ILOCK is not safe; we can only do such a thing if we've
locked out and drained all IO and other modification operations,
which we don't do initially in xfs_file_fallocate.

It should be noted that some of the fallocate operations are
compound operations - they are made up of multiple manipulations
that may zero data, and so we may need to flush and invalidate the
file multiple times during an operation. However, we only need to
lock out IO and other space manipulation operations once, as that
lockout is maintained until the entire fallocate operation has been
completed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-31 09:17:55 -07:00
David Howells
8cefc107ca pipe: Use head and tail pointers for the ring, not cursor and length
Convert pipes to use head and tail pointers for the buffer ring rather than
pointer and length as the latter requires two atomic ops to update (or a
combined op) whereas the former only requires one.

 (1) The head pointer is the point at which production occurs and points to
     the slot in which the next buffer will be placed.  This is equivalent
     to pipe->curbuf + pipe->nrbufs.

     The head pointer belongs to the write-side.

 (2) The tail pointer is the point at which consumption occurs.  It points
     to the next slot to be consumed.  This is equivalent to pipe->curbuf.

     The tail pointer belongs to the read-side.

 (3) head and tail are allowed to run to UINT_MAX and wrap naturally.  They
     are only masked off when the array is being accessed, e.g.:

	pipe->bufs[head & mask]

     This means that it is not necessary to have a dead slot in the ring as
     head == tail isn't ambiguous.

 (4) The ring is empty if "head == tail".

     A helper, pipe_empty(), is provided for this.

 (5) The occupancy of the ring is "head - tail".

     A helper, pipe_occupancy(), is provided for this.

 (6) The number of free slots in the ring is "pipe->ring_size - occupancy".

     A helper, pipe_space_for_user() is provided to indicate how many slots
     userspace may use.

 (7) The ring is full if "head - tail >= pipe->ring_size".

     A helper, pipe_full(), is provided for this.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-10-31 15:12:34 +00:00
Chengguang Xu
30ef0e4085 ext2: don't set *count in the case of failure in ext2_try_to_allocate()
Currently we set *count to num(value 0) in the failure
of block allocation in ext2_try_to_allocate(). Without
reservation, we reuse *count(value 0) to retry block
allocation and wrong *count will cause only allocating
maximum 1 block even though having sufficent free blocks
in that block group. Finally, it probably cause significant
fragmentation.

Link: https://lore.kernel.org/r/20191026090721.23794-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-10-31 15:40:44 +01:00
Jeff Mahoney
60e4cf67a5 reiserfs: fix extended attributes on the root directory
Since commit d0a5b995a3 (vfs: Add IOP_XATTR inode operations flag)
extended attributes haven't worked on the root directory in reiserfs.

This is due to reiserfs conditionally setting the sb->s_xattrs handler
array depending on whether it located or create the internal privroot
directory.  It necessarily does this after the root inode is already
read in.  The IOP_XATTR flag is set during inode initialization, so
it never gets set on the root directory.

This commit unconditionally assigns sb->s_xattrs and clears IOP_XATTR on
internal inodes.  The old return values due to the conditional assignment
are handled via open_xa_root, which now returns EOPNOTSUPP as the VFS
would have done.

Link: https://lore.kernel.org/r/20191024143127.17509-1-jeffm@suse.com
CC: stable@vger.kernel.org
Fixes: d0a5b995a3 ("vfs: Add IOP_XATTR inode operations flag")
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-10-31 15:39:24 +01:00
Ingo Molnar
43e0ae7ae0 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU and LKMM changes from Paul E. McKenney:

  - Documentation updates.

  - Miscellaneous fixes.

  - Dynamic tick (nohz) updates, perhaps most notably changes to
    force the tick on when needed due to lengthy in-kernel execution
    on CPUs on which RCU is waiting.

  - Replace rcu_swap_protected() with rcu_prepace_pointer().

  - Torture-test updates.

  - Linux-kernel memory consistency model updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-31 09:33:19 +01:00
Jens Axboe
6873e0bd6a io_uring: ensure we clear io_kiocb->result before each issue
We use io_kiocb->result == -EAGAIN as a way to know if we need to
re-submit a polled request, as -EAGAIN reporting happens out-of-line
for IO submission failures. This field is cleared when we originally
allocate the request, but it isn't reset when we retry the submission
from async context. This can cause issues where we think something
needs a re-issue, but we're really just reading stale data.

Reset ->result whenever we re-prep a request for polled submission.

Cc: stable@vger.kernel.org
Fixes: 9e645e1105 ("io_uring: add support for sqe links")
Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-30 14:45:22 -06:00
Paul E. McKenney
62860da708 fs/afs: Replace rcu_swap_protected() with rcu_replace_pointer()
This commit replaces the use of rcu_swap_protected() with the more
intuitively appealing rcu_replace_pointer() as a step towards removing
rcu_swap_protected().

Link: https://lore.kernel.org/lkml/CAHk-=wiAsJLw1egFEE=Z7-GGtM6wcvtyytXZA1+BHqta4gg6Hw@mail.gmail.com/
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
[ paulmck: From rcu_replace() to rcu_replace_pointer() per Ingo Molnar. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
Cc: <linux-afs@lists.infradead.org>
Cc: <linux-kernel@vger.kernel.org>
2019-10-30 08:44:27 -07:00
Jens Axboe
975c99a570 io_uring: io_wq_create() returns an error pointer, not NULL
syzbot reported an issue where we crash at setup time if failslab is
used. The issue is that io_wq_create() returns an error pointer on
failure, not NULL. Hence io_uring thought the io-wq was setup just
fine, but in reality it's a garbage error pointer.

Use IS_ERR() instead of a NULL check, and assign ret appropriately.

Reported-by: syzbot+221cc24572a2fed23b6b@syzkaller.appspotmail.com
Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-30 08:42:56 -06:00
Linus Torvalds
b66b449872 Fix remounting (broken in -rc1).
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJduXHFAAoJENW/n+sDE2U6FKkP/j9xxzkjHAQwCaim5DIBNJfo
 1GDpDxrN6Voa4hC82p/8fUwFGedo72Ex/TyyVCOnJ9/iNRSugpIdEAWM+NCNs/Sm
 UHbNkKPxCB0rRiO20tvHZsbhytbM2zLEKuDuNwaI5BN+6RvW6AaAeMMSAv245tMh
 0iu5U8l5B/oPRx+EcirSia8Ep+MiMp8+6MrLoQLVYnrs9wsfTrD8LltGPLqLhbPV
 JPkJaeIdjdM5Ii34z5nE8yyAEYr7pjfL7CNim1NY/cM7gRsX4jTdd2xGMecIaUdV
 TIBWn+NzqIguqLdPHf7xDinwNxF6/F2ppQpwW3zSS0VpQpJI1oApi0Xw8ESiylZ5
 uMkR/+x9XpeSjvyvKonk0D0WlnfLfDy/frSxhw8emsJ1o8TSqL2TgHRqVIfU8+ZD
 fR9zGw8tXjmWHZkC1gkyPx6xWiMAVBrTsgHlRIRWukXtKCf/qlCgoq0HwSfFINkt
 z3AYfpHcAZ/zVpYTfy7x3X4jeM71aToTEQuoJPNk3AQ+b5uKH9fD48XYyx1EQlZl
 jwvldyLcPfU2YVe2KTfsglEGY3XqN4cn3B6Ysm7kX81uh7dJ4oeKX1w0yC47/umS
 1cNbx/4h7R+6fTZGhduOH1znDfw92iel8vz6npOe508mq+isj49kGRPJ9wurAMce
 lcSV5NZgDRpikbIq6fZw
 =hVf6
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.4-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:
 "Fix remounting (broken in -rc1)."

* tag 'gfs2-v5.4-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix initialisation of args for remount
2019-10-30 14:05:40 +01:00
Andrew Price
d5798141fd gfs2: Fix initialisation of args for remount
When gfs2 was converted to use fs_context, the initialisation of the
mount args structure to the currently active args was lost with the
removal of gfs2_remount_fs(), so the checks of the new args on remount
became checks against the default values instead of the current ones.
This caused unexpected remount behaviour and test failures (xfstests
generic/294, generic/306 and generic/452).

Reinstate the args initialisation, this time in gfs2_init_fs_context()
and conditional upon fc->purpose, as that's the only time we get control
before the mount args are parsed in the remount process.

Fixes: 1f52aa08d1 ("gfs2: Convert gfs2 to fs_context")
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-10-30 12:16:53 +01:00
Jens Axboe
842f96124c io_uring: fix race with canceling timeouts
If we get -1 from hrtimer_try_to_cancel(), we know that the timer
is running. Hence leave all completion to the timeout handler. If
we don't, we can corrupt the list and miss a completion.

Fixes: 11365043e5 ("io_uring: add support for canceling timeout requests")
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 15:43:30 -06:00
Al Viro
1f08529c84 ceph: add missing check in d_revalidate snapdir handling
We should not play with dcache without parent locked...

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-29 22:29:55 +01:00
Al Viro
aa8dd81673 ceph: fix RCU case handling in ceph_d_revalidate()
For RCU case ->d_revalidate() is called with rcu_read_lock() and
without pinning the dentry passed to it.  Which means that it
can't rely upon ->d_inode remaining stable; that's the reason
for d_inode_rcu(), actually.

Make sure we don't reload ->d_inode there.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-29 22:29:54 +01:00
Luis Henriques
ea60ed6fcf ceph: fix use-after-free in __ceph_remove_cap()
KASAN reports a use-after-free when running xfstest generic/531, with the
following trace:

[  293.903362]  kasan_report+0xe/0x20
[  293.903365]  rb_erase+0x1f/0x790
[  293.903370]  __ceph_remove_cap+0x201/0x370
[  293.903375]  __ceph_remove_caps+0x4b/0x70
[  293.903380]  ceph_evict_inode+0x4e/0x360
[  293.903386]  evict+0x169/0x290
[  293.903390]  __dentry_kill+0x16f/0x250
[  293.903394]  dput+0x1c6/0x440
[  293.903398]  __fput+0x184/0x330
[  293.903404]  task_work_run+0xb9/0xe0
[  293.903410]  exit_to_usermode_loop+0xd3/0xe0
[  293.903413]  do_syscall_64+0x1a0/0x1c0
[  293.903417]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happens because __ceph_remove_cap() may queue a cap release
(__ceph_queue_cap_release) which can be scheduled before that cap is
removed from the inode list with

	rb_erase(&cap->ci_node, &ci->i_caps);

And, when this finally happens, the use-after-free will occur.

This can be fixed by removing the cap from the inode list before being
removed from the session list, and thus eliminating the risk of an UAF.

Cc: stable@vger.kernel.org
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-29 22:29:51 +01:00
Jens Axboe
65e19f54d2 io_uring: support for larger fixed file sets
There's been a few requests for supporting more fixed files than 1024.
This isn't really tricky to do, we just need to split up the file table
into multiple tables and index appropriately. As we do so, reduce the
max single file table to 512. This enables us to do single page allocs
always for the tables, which is an improvement over the situation prior.

This patch adds support for up to 64K files, which should be enough for
everyone.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 12:43:06 -06:00
Jens Axboe
b7620121dc io_uring: protect fixed file indexing with array_index_nospec()
We index the file tables with a user given value. After we check
it's within our limits, use array_index_nospec() to prevent any
spectre attacks here.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 12:43:06 -06:00
Jens Axboe
17f2fe35d0 io_uring: add support for IORING_OP_ACCEPT
This allows an application to call accept4() in an async fashion. Like
other opcodes, we first try a non-blocking accept, then punt to async
context if we have to.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 12:43:06 -06:00