Commit Graph

58391 Commits

Author SHA1 Message Date
David Sterba
907877664e btrfs: get fs_info from trans in btrfs_set_log_full_commit
We can read fs_info from the transaction and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:41 +02:00
David Sterba
4884b8e8eb btrfs: get fs_info from trans in btrfs_need_log_full_commit
We can read fs_info from the transaction and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:41 +02:00
David Sterba
9b7a2440ae btrfs: get fs_info from trans in btrfs_create_tree
We can read fs_info from the transaction and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:41 +02:00
David Sterba
6b27940843 btrfs: get fs_info from trans in update_block_group
We can read fs_info from the transaction and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:41 +02:00
David Sterba
5742d15fa7 btrfs: get fs_info from trans in btrfs_write_dirty_block_groups
We can read fs_info from the transaction and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:40 +02:00
David Sterba
bbebb3e0ba btrfs: get fs_info from trans in btrfs_setup_space_cache
We can read fs_info from the transaction and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:40 +02:00
David Sterba
39db232dae btrfs: get fs_info from trans in write_one_cache_group
We can read fs_info from the transaction and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:40 +02:00
Nikolay Borisov
f9756261c2 btrfs: Remove redundant inode argument from btrfs_add_ordered_sum
Ordered csums are keyed off of a btrfs_ordered_extent, which already has
a reference to the inode. This implies that an explicit inode argument
is redundant. So remove it.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:40 +02:00
Qu Wenruo
8d47a0d8f7 btrfs: Do mandatory tree block check before submitting bio
There are at least 2 reports about a memory bit flip sneaking into
on-disk data.

Currently we only have a relaxed check triggered at
btrfs_mark_buffer_dirty() time, as it's not mandatory and only for
CONFIG_BTRFS_FS_CHECK_INTEGRITY enabled build, it doesn't help users to
detect such problem.

This patch will address the hole by triggering comprehensive check on
tree blocks before writing it back to disk.

The design points are:

- Timing of the check: Tree block write hook
  This timing is chosen to reduce the overhead.
  The comprehensive check should be as expensive as a checksum
  calculation.
  Doing full check at btrfs_mark_buffer_dirty() is too expensive for end
  user.

- Loose empty leaf check
  Originally for an empty leaf, tree-checker will report error if it's
  not a tree root.

  The problem for such check at write time is:
  * False alert for tree root created in current transaction
    In that case, the commit root still needs to be written to disk.
    And since current root can differ from commit root, then it will
    cause false alert.
    This happens for log tree.

  * False alert for relocated tree block
    Relocated tree block can be written to disk due to memory pressure,
    in that case an empty csum tree root can be written to disk and
    cause false alert, since csum root node hasn't been updated.

  Previous patch of removing comprehensive empty leaf owner check has
  paved the way for this patch.

The example error output will be something like:

  BTRFS critical (device dm-3): corrupt leaf: root=2 block=1350630375424 slot=68, bad key order, prev (10510212874240 169 0) current (1714119868416 169 0)
  BTRFS error (device dm-3): block=1350630375424 write time tree block corruption detected
  BTRFS: error (device dm-3) in btrfs_commit_transaction:2220: errno=-5 IO failure (Error while writing out transaction)
  BTRFS info (device dm-3): forced readonly
  BTRFS warning (device dm-3): Skipping commit of aborted transaction.
  BTRFS: error (device dm-3) in cleanup_transaction:1839: errno=-5 IO failure
  BTRFS info (device dm-3): delayed_refs has NO entry

Reported-by: Leonard Lausen <leonard@lausen.nl>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:40 +02:00
Qu Wenruo
ff2ac107fa btrfs: tree-checker: Remove comprehensive root owner check
Commit 1ba98d086f ("Btrfs: detect corruption when non-root leaf has
zero item") introduced comprehensive root owner checker.

However it's pretty expensive tree search to locate the owner root,
especially when it get reused by mandatory read and write time
tree-checker.

This patch will remove that check, and completely rely on owner based
empty leaf check, which is much faster and still works fine for most
case.

And since we skip the old root owner check, now write time tree check
can be merged with btrfs_check_leaf_full().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:39 +02:00
Robbie Ko
39ad317315 Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota reserve
When doing fallocate, we first add the range to the reserve_list and
then reserve the quota.  If quota reservation fails, we'll release all
reserved parts of reserve_list.

However, cur_offset is not updated to indicate that this range is
already been inserted into the list.  Therefore, the same range is freed
twice.  Once at list_for_each_entry loop, and once at the end of the
function.  This will result in WARN_ON on bytes_may_use when we free the
remaining space.

At the end, under the 'out' label we have a call to:

   btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset);

The start offset, third argument, should be cur_offset.

Everything from alloc_start to cur_offset was freed by the
list_for_each_entry_safe_loop.

Fixes: 18513091af ("btrfs: update btrfs_space_info's bytes_may_use timely")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:39 +02:00
David Sterba
178507595c btrfs: get fs_info from eb in read_one_dev
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:39 +02:00
David Sterba
9690ac0987 btrfs: get fs_info from eb in read_one_chunk
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:39 +02:00
David Sterba
ddaf1d5aef btrfs: get fs_info from eb in btrfs_check_chunk_valid
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:39 +02:00
David Sterba
6ec0896c4c btrfs: get fs_info from eb in should_balance_chunk
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:39 +02:00
David Sterba
813fd1dcab btrfs: get fs_info from eb in btrfs_check_node
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:38 +02:00
David Sterba
cfdaad5e5f btrfs: get fs_info from eb in btrfs_check_leaf_relaxed
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:38 +02:00
David Sterba
1c4360ee05 btrfs: get fs_info from eb in btrfs_check_leaf_full
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:38 +02:00
Nikolay Borisov
929be17a9b btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit
Instead of always calling the allocator to search for a free extent,
that satisfies the input criteria, switch btrfs_trim_free_extents to
using find_first_clear_extent_bit. With this change it's no longer
necessary to read the device tree in order to figure out holes in
the devices.

Now the code always searches in-memory data structure to figure out the
space range which contains the requested which should result in speed
improvements.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:38 +02:00
Nikolay Borisov
45bfcfc168 btrfs: Implement find_first_clear_extent_bit
This function is very similar to find_first_extent_bit except that it
locates the first contiguous span of space which does not have bits set.
It's intended use is in the freespace trimming code.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:38 +02:00
Nikolay Borisov
8811133d8a btrfs: Optimize unallocated chunks discard
Currently unallocated chunks are always trimmed. For example
2 consecutive trims on large storage would trim freespace twice
irrespective of whether the space was actually allocated or not between
those trims.

Optimise this behavior by exploiting the newly introduced alloc_state
tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark
those unallocated chunks which have been trimmed and have not been
allocated afterwards. On chunk allocation the respective underlying devices'
physical space will have its CHUNK_TRIMMED flag cleared. This avoids
submitting discards for space which hasn't been changed since the last
time discard was issued.

This applies to the single mount period of the filesystem as the
information is not stored permanently.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:38 +02:00
Nikolay Borisov
e74e3993bc btrfs: Factor out in_range macro
This is used in more than one places so let's factor it out in ctree.h.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:37 +02:00
Nikolay Borisov
60dfdf25bd btrfs: Remove 'trans' argument from find_free_dev_extent(_start)
Now that these functions no longer require a handle to transaction to
inspect pending/pinned chunks the argument can be removed. At the same
time also remove any surrounding code which acquired the handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:37 +02:00
Jeff Mahoney
1c11b63eff btrfs: replace pending/pinned chunks lists with io tree
The pending chunks list contains chunks that are allocated in the
current transaction but haven't been created yet. The pinned chunks
list contains chunks that are being released in the current transaction.
Both describe chunks that are not reflected on disk as in use but are
unavailable just the same.

The pending chunks list is anchored by the transaction handle, which
means that we need to hold a reference to a transaction when working
with the list.

The way we use them is by iterating over both lists to perform
comparisons on the stripes they describe for each device. This is
backwards and requires that we keep a transaction handle open while
we're trimming.

This patchset adds an extent_io_tree to btrfs_device that maintains
the allocation state of the device.  Extents are set dirty when
chunks are first allocated -- when the extent maps are added to the
mapping tree. They're cleared when last removed -- when the extent
maps are removed from the mapping tree. This matches the lifespan
of the pending and pinned chunks list and allows us to do trims
on unallocated space safely without pinning the transaction for what
may be a lengthy operation. We can also use this io tree to mark
which chunks have already been trimmed so we don't repeat the operation.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:37 +02:00
Nikolay Borisov
68c94e55e1 btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in close_ctree
Following the introduction of the alloc_state tree, some of the callees
of btrfs_mapping_tree_free will have to interact with the btrfs_device
of the constituent devices. Enable this by moving the code responsible
for freeing devices after the last user (btrfs_mapping_tree_free).
Otherwise the kernel could crash due to use-after-free.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:37 +02:00
Nikolay Borisov
8e75fd893b btrfs: Stop using call_rcu for device freeing
btrfs_device structs are freed from RCU context since device iteration
is protected by RCU. Currently this is achieved by using call_rcu since
no blocking functions are called within btrfs_free_device. Future
refactoring of pending/pinned chunks will require calling sleeping
functions.

This patch is in preparation for these changes by simply switching from
RCU callbacks to explicit calls of synchronize_rcu and calling
btrfs_free_device directly. This is functionally equivalent, making sure
that there are no readers at that time.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:37 +02:00
Nikolay Borisov
4ca7365606 btrfs: Implement set_extent_bits_nowait
It will be used in a future patch that will require modifying an
extent_io_tree struct under a spinlock.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:36 +02:00
Nikolay Borisov
930b090729 btrfs: Introduce new bits for device allocation tree
Rather than hijacking the existing defines let's just define new bits,
with more descriptive names. Instead of using yet more (currently at 18)
bits for the new flags, use the fact those flags will be specific to
the device allocation tree so define them using existing EXTENT_* flags.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:36 +02:00
Nikolay Borisov
39e264a40d btrfs: Populate ->orig_block_len during read_one_chunk
Chunks read from disk currently don't get their ->orig_block_len member
set, in contrast when a new chunk is allocated, the respective
extent_map's ->orig_block_len is assigned the size of the stripe of this
chunk.

Let's apply the same strategy for chunks which are read from
disk, not only does this codify the invariant that ->orig_block_len
always contains the size of the stripe for a chunk (when the em belongs
to the mapping tree). But it's also a preparatory patch for further work
around tracking chunk allocation in an extent tree rather than
pinned/pending lists.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:36 +02:00
Nikolay Borisov
41e7acd38c btrfs: Rename and export clear_btree_io_tree
This function is going to be used to clear out the device extent
allocation information. Give it a more generic name and export it. This
is in preparation to replacing the pending/pinned chunk lists with an
extent tree. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:36 +02:00
Nikolay Borisov
61d0d0d2cb btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink
During device shrink pinned/pending chunks (i.e. those which have been
deleted/created respectively, in the current transaction and haven't
touched disk) need to be accounted when doing device shrink. Presently
this happens after the main relocation loop in btrfs_shrink_device,
which could lead to making another go in the body of the function.

Since there is no hard requirement to perform pinned/pending chunks
handling after the relocation loop, move the code before it. This leads
to simplifying the code flow around - i.e. no need to use 'goto again'.

A notable side effect of this change is that modification of the
device's size requires a transaction to be started and committed before
the relocation loop starts. This is necessary to ensure that relocation
process sees the shrunk device size.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:36 +02:00
Nikolay Borisov
bbbf7243d6 btrfs: combine device update operations during transaction commit
We currently overload the pending_chunks list to handle updating
btrfs_device->commit_bytes used.  We don't actually care about the
extent mapping or even the device mapping for the chunk - we just need
the device, and we can end up processing it multiple times.  The
fs_devices->resized_list does more or less the same thing, but with the
disk size.  They are called consecutively during commit and have more or
less the same purpose.

We can combine the two lists into a single list that attaches to the
transaction and contains a list of devices that need updating.  Since we
always add the device to a list when we change bytes_used or
disk_total_size, there's no harm in copying both values at once.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:36 +02:00
Nikolay Borisov
c2d1b3aae3 btrfs: Honour FITRIM range constraints during free space trim
Up until now trimming the freespace was done irrespective of what the
arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments
will be entirely ignored. Fix it by correctly handling those paramter.
This requires breaking if the found freespace extent is after the end of
the passed range as well as completing trim after trimming
fstrim_range::len bytes.

Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:35 +02:00
Robbie Ko
040ee6120c Btrfs: send, improve clone range
Improve clone_range in two scenarios.

1. Remove the limit of inode size when find clone inodes We can do
   partial clone, so there is no need to limit the size of the candidate
   inode.  When clone a range, we clone the legal range only by bytenr,
   offset, len, inode size.

2. In the scenarios of rewrite or clone_range, data_offset rarely
   matches exactly, so the chance of a clone is missed.

e.g.
    1. Write a 1M file
        dd if=/dev/zero of=1M bs=1M count=1

    2. Clone 1M file
       cp --reflink 1M clone

    3. Rewrite 4k on the clone file
       dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc

    The disk layout is as follows:
    item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
	extent data disk byte 1103101952 nr 1048576
	extent data offset 0 nr 1048576 ram 1048576
	extent compression(none)
    ...
    item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
	extent data disk byte 1104150528 nr 4096
	extent data offset 0 nr 4096 ram 4096
	extent compression(none)
    item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
	extent data disk byte 1103101952 nr 1048576
	extent data offset 4096 nr 1044480 ram 1048576
	extent compression(none)

When send, inode 258 file offset 4096~1048576 (item 23) has a chance to
clone_range, but because data_offset does not match inode 257 (item 16),
it causes missed clone and can only transfer actual data.

Improve the problem by judging whether the current data_offset has
overlap with the file extent item, and if so, adjusting offset and
extent_len so that we can clone correctly.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:35 +02:00
Anand Jain
8b4d1efc9e btrfs: prop: open code btrfs_set_prop in inherit_prop
When an inode inherits property from its parent, we call btrfs_set_prop().
btrfs_set_prop() does an elaborate checks, which is not required in the
context of inheriting a property. Instead just open-code only the required
items from btrfs_set_prop() and then call btrfs_setxattr() directly. So
now the only user of btrfs_set_prop() is gone, (except for the wraper
function btrfs_set_prop_trans()).

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:35 +02:00
Anand Jain
ae0bc86310 btrfs: drop unused parameter in mount_subvol
@device_name in mount_subvol() is not used, drop it.  Also see:
5bedc48a8f ("btrfs: drop unused parameters from mount_subvol").

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:35 +02:00
David Sterba
39e57f495b btrfs: tree-checker: get fs_info from eb in check_inode_item
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:35 +02:00
David Sterba
412a23127c btrfs: tree-checker: get fs_info from eb in check_dev_item
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:35 +02:00
David Sterba
5617ed80cb btrfs: tree-checker: get fs_info from eb in dev_item_err
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:34 +02:00
David Sterba
d001e4a3fe btrfs: tree-checker: get fs_info from eb in chunk_err
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:34 +02:00
David Sterba
e2ccd361ef btrfs: tree-checker: get fs_info from eb in check_leaf
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:34 +02:00
David Sterba
0076bc89a7 btrfs: tree-checker: get fs_info from eb in check_leaf_item
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:34 +02:00
David Sterba
ae2a19d8ad btrfs: tree-checker: get fs_info from eb in check_extent_data_item
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:34 +02:00
David Sterba
af60ce2b93 btrfs: tree-checker: get fs_info from eb in check_block_group_item
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:34 +02:00
David Sterba
4806bd886a btrfs: tree-checker: get fs_info from eb in block_group_err
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:33 +02:00
David Sterba
ce4252c049 btrfs: tree-checker: get fs_info from eb in check_dir_item
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:33 +02:00
David Sterba
d98ced688f btrfs: tree-checker: get fs_info from eb in dir_item_err
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:33 +02:00
David Sterba
68128ce756 btrfs: tree-checker: get fs_info from eb in check_csum_item
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:33 +02:00
David Sterba
1fd715ffdd btrfs: tree-checker: get fs_info from eb in file_extent_err
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:33 +02:00
David Sterba
86a6be3abe btrfs: tree-checker: get fs_info from eb in generic_err
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:33 +02:00
Qu Wenruo
6bf9e4bd6a btrfs: inode: Verify inode mode to avoid NULL pointer dereference
[BUG]
When accessing a file on a crafted image, btrfs can crash in block layer:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  PGD 136501067 P4D 136501067 PUD 124519067 PMD 0
  CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.0.0-rc8-default #252
  RIP: 0010:end_bio_extent_readpage+0x144/0x700
  Call Trace:
   <IRQ>
   blk_update_request+0x8f/0x350
   blk_mq_end_request+0x1a/0x120
   blk_done_softirq+0x99/0xc0
   __do_softirq+0xc7/0x467
   irq_exit+0xd1/0xe0
   call_function_single_interrupt+0xf/0x20
   </IRQ>
  RIP: 0010:default_idle+0x1e/0x170

[CAUSE]
The crafted image has a tricky corruption, the INODE_ITEM has a
different type against its parent dir:

        item 20 key (268 INODE_ITEM 0) itemoff 2808 itemsize 160
                generation 13 transid 13 size 1048576 nbytes 1048576
                block group 0 mode 121644 links 1 uid 0 gid 0 rdev 0
                sequence 9 flags 0x0(none)

This mode number 0120000 means it's a symlink.

But the dir item think it's still a regular file:

        item 8 key (264 DIR_INDEX 5) itemoff 3707 itemsize 32
                location key (268 INODE_ITEM 0) type FILE
                transid 13 data_len 0 name_len 2
                name: f4
        item 40 key (264 DIR_ITEM 51821248) itemoff 1573 itemsize 32
                location key (268 INODE_ITEM 0) type FILE
                transid 13 data_len 0 name_len 2
                name: f4

For symlink, we don't set BTRFS_I(inode)->io_tree.ops and leave it
empty, as symlink is only designed to have inlined extent, all handled
by tree block read.  Thus no need to trigger btrfs_submit_bio_hook() for
inline file extent.

However end_bio_extent_readpage() expects tree->ops populated, as it's
reading regular data extent.  This causes NULL pointer dereference.

[FIX]
This patch fixes the problem in two ways:

- Verify inode mode against its dir item when looking up inode
  So in btrfs_lookup_dentry() if we find inode mode mismatch with dir
  item, we error out so that corrupted inode will not be accessed.

- Verify inode mode when getting extent mapping
  Only regular file should have regular or preallocated extent.
  If we found regular/preallocated file extent for symlink or
  the rest, we error out before submitting the read bio.

With this fix that crafted image can be rejected gracefully:

  BTRFS critical (device loop0): inode mode mismatch with dir: inode mode=0121644 btrfs type=7 dir type=1

Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202763
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:32 +02:00
Qu Wenruo
496245cac5 btrfs: tree-checker: Verify inode item
There is a report in kernel bugzilla about mismatch file type in dir
item and inode item.

This inspires us to check inode mode in inode item.

This patch will check the following members:

- inode key objectid
  Should be ROOT_DIR_DIR or [256, (u64)-256] or FREE_INO.

- inode key offset
  Should be 0

- inode item generation
- inode item transid
  No newer than sb generation + 1.
  The +1 is for log tree.

- inode item mode
  No unknown bits.
  No invalid S_IF* bit.
  NOTE: S_IFMT check is not enough, need to check every know type.

- inode item nlink
  Dir should have no more link than 1.

- inode item flags

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:32 +02:00
Qu Wenruo
80e46cf22b btrfs: tree-checker: Enhance chunk checker to validate chunk profile
Btrfs-progs already have a comprehensive type checker, to ensure there
is only 0 (SINGLE profile) or 1 (DUP/RAID0/1/5/6/10) bit set for chunk
profile bits.

Do the same work for kernel.

Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202765
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:32 +02:00
Qu Wenruo
ab4ba2e133 btrfs: tree-checker: Verify dev item
[BUG]
For fuzzed image whose DEV_ITEM has invalid total_bytes as 0, then
kernel will just panic:
  BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
  #PF error: [normal kernel read fault]
  PGD 800000022b2bd067 P4D 800000022b2bd067 PUD 22b2bc067 PMD 0
  Oops: 0000 [#1] SMP PTI
  CPU: 0 PID: 1106 Comm: mount Not tainted 5.0.0-rc8+ #9
  RIP: 0010:btrfs_verify_dev_extents+0x2a5/0x5a0
  Call Trace:
   open_ctree+0x160d/0x2149
   btrfs_mount_root+0x5b2/0x680

[CAUSE]
If device extent verification finds a deivce with 0 total_bytes, then it
assumes it's a seed dummy, then search for seed devices.

But in this case, there is no seed device at all, causing NULL pointer.

[FIX]
Since this is caused by fuzzed image, let's go the tree-check way, just
add a new verification for device item.

Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:32 +02:00
Qu Wenruo
075cb3c78f btrfs: tree-checker: Check chunk item at tree block read time
Since we have btrfs_check_chunk_valid() in tree-checker, let's do
chunk item verification in tree-checker too.

Since the tree-checker is run at endio time, if one chunk leaf fails
chunk verification, we can still retry the other copy, making btrfs more
robust to fuzzed image as we may still get a good chunk item.

Also since we have done chunk verification in tree block read time, skip
the btrfs_check_chunk_valid() call in read_one_chunk() if we're reading
chunk items from leaf.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:32 +02:00
Qu Wenruo
bf871c3b43 btrfs: tree-checker: Make btrfs_check_chunk_valid() return EUCLEAN instead of EIO
To follow the standard behavior of tree-checker.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:32 +02:00
Qu Wenruo
f114024376 btrfs: tree-checker: Make chunk item checker messages more readable
Old error message would be something like:
  BTRFS error (device dm-3): invalid chunk num_stipres: 0

New error message would be:
  Btrfs critical (device dm-3): corrupt superblock syschunk array: chunk_start=2097152, invalid chunk num_stripes: 0
Or
  Btrfs critical (device dm-3): corrupt leaf: root=3 block=8388608 slot=3 chunk_start=2097152, invalid chunk num_stripes: 0

And for certain error message, also output expected value.

The error message levels are changed from error to critical.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:31 +02:00
Qu Wenruo
82fc28fbed btrfs: Move btrfs_check_chunk_valid() to tree-check.[ch] and export it
By function, chunk item verification is more suitable to be done inside
tree-checker.

So move btrfs_check_chunk_valid() to tree-checker.c and export it.

And since it's now moved to tree-checker, also add a better comment for
what this function is doing.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:31 +02:00
David Sterba
90b1377daa btrfs: qgroup: remove obsolete fs_info members
The commit fcebe4562d ("Btrfs: rework qgroup accounting") reworked
qgroups and added some new structures. Another rework of qgroup
mechanics e69bcee376 ("btrfs: qgroup: Cleanup the old
ref_node-oriented mechanism.") stopped using them and left uncleaned.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:31 +02:00
David Sterba
e064d5e9f0 btrfs: get fs_info from eb in btrfs_verify_level_key
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:31 +02:00
David Sterba
5ab12d1ff8 btrfs: get fs_info from eb in btree_read_extent_buffer_pages
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:31 +02:00
David Sterba
d0d20b0f5c btrfs: get fs_info from eb in read_node_slot
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:31 +02:00
David Sterba
e902baac65 btrfs: get fs_info from eb in btrfs_leaf_free_space
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:30 +02:00
David Sterba
6a884d7d52 btrfs: get fs_info from eb in clean_tree_block
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:30 +02:00
David Sterba
ed874f0db8 btrfs: get fs_info from eb in tree_mod_log_eb_copy
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:30 +02:00
David Sterba
b0c9b3b05d btrfs: get fs_info from eb in check_tree_block_fsid
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:30 +02:00
David Sterba
bcdc428cfe btrfs: get fs_info from eb in btrfs_exclude_logged_extents
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:30 +02:00
David Sterba
8f881e8c18 btrfs: get fs_info from eb in leaf_data_end
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:30 +02:00
David Sterba
0ab0206328 btrfs: get fs_info from eb in write_one_eb
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:29 +02:00
David Sterba
20a1fbf97e btrfs: get fs_info from eb in repair_eb_io_failure
We can read fs_info from extent buffer and can drop it from the
parameters. As all callsites are updated, add the btrfs_ prefix as the
function is exported.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:29 +02:00
David Sterba
9df76fb544 btrfs: get fs_info from eb in lock_extent_buffer_for_io
We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:29 +02:00
Phillip Potter
7d157c3d48 btrfs: use common file type conversion
Deduplicate the btrfs file type conversion implementation - file systems
that use the same file types as defined by POSIX do not need to define
their own versions and can use the common helper functions decared in
fs_types.h and implemented in fs_types.c

Common implementation can be found via commit:
bbe7449e25 "fs: common implementation of file type"

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:29 +02:00
Goldwyn Rodrigues
7984ae52bb btrfs: Perform locking/unlocking in btrfs_remap_file_range()
Move code to make it more readable, so as locking and unlocking is
done in the same function. The generic checks that are now performed in
the locked section are unaffected.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:29 +02:00
Arnd Bergmann
290342f661 btrfs: use BUG() instead of BUG_ON(1)
BUG_ON(1) leads to bogus warnings from clang when
CONFIG_PROFILE_ANNOTATED_BRANCHES is set:

fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false
      [-Werror,-Wsometimes-uninitialized]
                BUG_ON(1);
                ^~~~~~~~~
include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
 #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
                                   ^~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
 #  define unlikely(x)   (__branch_check__(x, 0, __builtin_constant_p(x)))
                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here
                             max_chunk_size);
                             ^~~~~~~~~~~~~~
include/linux/kernel.h:860:36: note: expanded from macro 'min'
 #define min(x, y)       __careful_cmp(x, y, <)
                                         ^
include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp'
                __cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
                              ^
include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once'
                typeof(y) unique_y = (y);               \
                                      ^
fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true
                BUG_ON(1);
                ^
include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
 #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
                               ^
fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning
        u64 max_chunk_size;
                          ^
                           = 0

Change it to BUG() so clang can see that this code path can never
continue.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:28 +02:00
David Sterba
247462a5ac btrfs: move tree block wait and write helpers to tree-log
The wrapper names better describe what's happening so they're not
deleted though they're trivial, but at least moved closer to their place
of use.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:28 +02:00
David Sterba
d4eb671a08 btrfs: remove stale definition of BUFFER_LRU_MAX
Long time ago (2008), the extent buffers were organized in a LRU list
and switched to rb-tree in 6af118ce51 ("Btrfs: Index extent
buffers in an rbtree"). There was one stale macro definition left.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:28 +02:00
David Sterba
e4fa7469eb btrfs: tests: unify messages when tests start
- make the messages more visually consistent and use same format
  "running ... test", any error or other warning can be easily spotted

- move some message to the test entry function

- add message to the inode tests

Example output:

[    8.187391] Btrfs loaded, crc32c=crc32c-generic, assert=on, integrity-checker=on, ref-verify=on
[    8.189476] BTRFS: selftest: sectorsize: 4096  nodesize: 4096
[    8.190761] BTRFS: selftest: running btrfs free space cache tests
[    8.192245] BTRFS: selftest: running extent only tests
[    8.193573] BTRFS: selftest: running bitmap only tests
[    8.194876] BTRFS: selftest: running bitmap and extent tests
[    8.196166] BTRFS: selftest: running space stealing from bitmap to extent tests
[    8.198026] BTRFS: selftest: running extent buffer operation tests
[    8.199328] BTRFS: selftest: running btrfs_split_item tests
[    8.200653] BTRFS: selftest: running extent I/O tests
[    8.201808] BTRFS: selftest: running find delalloc tests
[    8.320733] BTRFS: selftest: running extent buffer bitmap tests
[    8.340795] BTRFS: selftest: running inode tests
[    8.341766] BTRFS: selftest: running btrfs_get_extent tests
[    8.342981] BTRFS: selftest: running hole first btrfs_get_extent test
[    8.344342] BTRFS: selftest: running outstanding_extents tests
[    8.345575] BTRFS: selftest: running qgroup tests
[    8.346537] BTRFS: selftest: running qgroup add/remove tests
[    8.347725] BTRFS: selftest: running qgroup multiple refs test
[    8.354982] BTRFS: selftest: running free space tree tests
[    8.372175] BTRFS: selftest: sectorsize: 4096  nodesize: 8192
[    8.373539] BTRFS: selftest: running btrfs free space cache tests
[    8.374989] BTRFS: selftest: running extent only tests
[    8.376236] BTRFS: selftest: running bitmap only tests
[    8.377483] BTRFS: selftest: running bitmap and extent tests
[    8.378854] BTRFS: selftest: running space stealing from bitmap to extent tests
...

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:28 +02:00
David Sterba
752dbe48e2 btrfs: tests: drop messages when some tests finish
The messages like 'extent I/O tests finished' are redundant, if the test
fails it's quite obvious in the log and hang is also noticeable. No
other then extent_io and free space tree tests print that so make it
consistent.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:28 +02:00
David Sterba
3173fd926c btrfs: tests: fix comments about tested extent map ranges
Comments about ranges did not match the code, the correct calculation is
to use start and start+len as the interval boundaries.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:28 +02:00
David Sterba
43f7cddc6e btrfs: tests: use SZ_ constants everywhere
There are a few unconverted constants that are not powers of two and
haven't been converted.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:27 +02:00
David Sterba
6c30474680 btrfs: tests: use standard error message after extent map allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:27 +02:00
David Sterba
ccfada1f65 btrfs: tests: return error from all extent map test cases
The way the extent map tests handle errors does not conform to the rest
of the suite, where the first failure is reported and then it stops.
Do the same now that we have the errors returned from all the functions.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:27 +02:00
David Sterba
7c6f670052 btrfs: tests: return errors from extent map test case 4
Replace asserts with error messages and return errors.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:27 +02:00
David Sterba
992dce7494 btrfs: tests: return errors from extent map test case 3
Replace asserts with error messages and return errors.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:27 +02:00
David Sterba
e71f2e17e8 btrfs: tests: return errors from extent map test case 2
Replace asserts with error messages and return errors.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:27 +02:00
David Sterba
d7de4b0864 btrfs: tests: return errors from extent map test case 1
Replace asserts with error messages and return errors.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:26 +02:00
David Sterba
488f673023 btrfs: tests: return errors from extent map tests
The individual testcases for extent maps do not return an error on
allocation failures. This is not a big problem as the allocation don't
fail in general but there are functional tests handled with ASSERTS.
This makes tests dependent on them and it's not reliable.

This patch adds the allocation failure handling and allows for the
conversion of the asserts to proper error handling and reporting.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:26 +02:00
David Sterba
7b9586bc2b btrfs: tests: properly initialize fs_info of extent buffer
The fs_info is supposed to be valid, even though it's not used right
now and the test does not crash.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:26 +02:00
David Sterba
3199366da7 btrfs: tests: use standard error message after block group allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:26 +02:00
David Sterba
6a060db85d btrfs: tests: use standard error message after inode allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:26 +02:00
David Sterba
770e0cc040 btrfs: tests: use standard error message after path allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:26 +02:00
David Sterba
9e3d9f8462 btrfs: tests: use standard error message after extent buffer allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:25 +02:00
David Sterba
52ab7bca35 btrfs: tests: use standard error message after root allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:25 +02:00
David Sterba
37b2a7bc1e btrfs: tests: use standard error message after fs_info allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:25 +02:00
David Sterba
703de4266f btrfs: tests: add table of most common errors
Allocation of main objects like fs_info or extent buffers is in each
test so let's simplify and unify the error messages to a table and add a
convenience helper.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:25 +02:00
David Sterba
efd31fce54 btrfs: tests: print file:line for error messages
For better diagnostics print the file name and line to locate the
errors. Sample output:

[    9.052924] BTRFS: selftest: fs/btrfs/tests/extent-io-tests.c:283 offset bits do not match

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:25 +02:00
David Sterba
d33d105b85 btrfs: tests: don't leak fs_info in extent_io bitmap tests
The fs_info is not freed at the end of the function and leaks. The
function is called twice so there can be up to 2x sizeof(struct
btrfs_fs_info) of leaked memory.  Fortunatelly this affects only testing
builds, the size could be 16k with several debugging features enabled.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:25 +02:00
David Sterba
d46a05edac btrfs: tests: handle fs_info allocation failure in extent_io tests
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:24 +02:00
Qu Wenruo
75391f0d41 btrfs: disk-io: Show the timing of corrupted tree block explicitly
Just add one extra line to show when the corruption is detected.
Currently only read time detection is possible.

The planned distinguish line would be:

  read time:
    <detailed report>
    block=XXXXX read time tree block corruption detected

  write time:
    <detailed report>
    block=XXXXX write time tree block corruption detected

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:24 +02:00
Josef Bacik
ff612ba784 btrfs: fix panic during relocation after ENOSPC before writeback happens
We've been seeing the following sporadically throughout our fleet

panic: kernel BUG at fs/btrfs/relocation.c:4584!
netversion: 5.0-0
Backtrace:
 #0 [ffffc90003adb880] machine_kexec at ffffffff81041da8
 #1 [ffffc90003adb8c8] __crash_kexec at ffffffff8110396c
 #2 [ffffc90003adb988] crash_kexec at ffffffff811048ad
 #3 [ffffc90003adb9a0] oops_end at ffffffff8101c19a
 #4 [ffffc90003adb9c0] do_trap at ffffffff81019114
 #5 [ffffc90003adba00] do_error_trap at ffffffff810195d0
 #6 [ffffc90003adbab0] invalid_op at ffffffff81a00a9b
    [exception RIP: btrfs_reloc_cow_block+692]
    RIP: ffffffff8143b614  RSP: ffffc90003adbb68  RFLAGS: 00010246
    RAX: fffffffffffffff7  RBX: ffff8806b9c32000  RCX: ffff8806aad00690
    RDX: ffff880850b295e0  RSI: ffff8806b9c32000  RDI: ffff88084f205bd0
    RBP: ffff880849415000   R8: ffffc90003adbbe0   R9: ffff88085ac90000
    R10: ffff8805f7369140  R11: 0000000000000000  R12: ffff880850b295e0
    R13: ffff88084f205bd0  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffffc90003adbbb0] __btrfs_cow_block at ffffffff813bf1cd
 #8 [ffffc90003adbc28] btrfs_cow_block at ffffffff813bf4b3
 #9 [ffffc90003adbc78] btrfs_search_slot at ffffffff813c2e6c

The way relocation moves data extents is by creating a reloc inode and
preallocating extents in this inode and then copying the data into these
preallocated extents.  Once we've done this for all of our extents,
we'll write out these dirty pages, which marks the extent written, and
goes into btrfs_reloc_cow_block().  From here we get our current
reloc_control, which _should_ match the reloc_control for the current
block group we're relocating.

However if we get an ENOSPC in this path at some point we'll bail out,
never initiating writeback on this inode.  Not a huge deal, unless we
happen to be doing relocation on a different block group, and this block
group is now rc->stage == UPDATE_DATA_PTRS.  This trips the BUG_ON() in
btrfs_reloc_cow_block(), because we expect to be done modifying the data
inode.  We are in fact done modifying the metadata for the data inode
we're currently using, but not the one from the failed block group, and
thus we BUG_ON().

(This happens when writeback finishes for extents from the previous
group, when we are at btrfs_finish_ordered_io() which updates the data
reloc tree (inode item, drops/adds extent items, etc).)

Fix this by writing out the reloc data inode always, and then breaking
out of the loop after that point to keep from tripping this BUG_ON()
later.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ add note from Filipe ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:24 +02:00
Nikolay Borisov
6a8d2136ca btrfs: Use less confusing condition for uptodate parameter to btrfs_writepage_endio_finish_ordered
The uptodate parameter of btrfs_writepage_endio_finish_ordered is used
to signal whether an error has occured while writing the given page.
0 signals an error, which is propagated to callees and 1 signifies
success. In end_compressed_bio_write the ->bi_status is checked and
based on it either BLK_STS_OK (0) or BLK_STS_NOTSUPP (1) are used. While
from functional point of view this is ok it's a for the poor reader of
the code, since the block layer values are conflated with the semantics
of the parameter.

Just use plain 0 or 1. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:24 +02:00
Qu Wenruo
a2a72fbd11 btrfs: extent_io: Handle errors better in extent_writepages()
We can only get <=0 from extent_write_cache_pages, add an ASSERT() for
it just in case.

Then instead of submitting the write bio even if we got some error,
check the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.

If there is no error so far, then call flush_write_bio() and return the
result.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:24 +02:00
Qu Wenruo
2e3c25136a btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()
This function needs some extra checks on locked pages and eb.  For error
handling we need to unlock locked pages and the eb.

There is a rare >0 return value branch, where all pages get locked
while write bio is not flushed.

Thankfully it's handled by the only caller, btree_write_cache_pages(),
as later write_one_eb() call will trigger submit_one_bio().  So there
shouldn't be any problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:24 +02:00
Qu Wenruo
02c6db4f73 btrfs: extent_io: Handle errors better in extent_write_locked_range()
We can only get @ret <= 0.  Add an ASSERT() for it just in case.

Then, instead of submitting the write bio even we got some error, check
the return value first.

If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.

If there is no error so far, then call flush_write_bio() and return the
result.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:23 +02:00
Qu Wenruo
e06808be8a btrfs: extent_io: Kill dead condition in extent_write_cache_pages()
Since __extent_writepage() will no longer return >0 value,
(ret == AOP_WRITEPAGE_ACTIVATE) will never be true.

Kill that dead branch.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:23 +02:00
Qu Wenruo
2b952eea81 btrfs: extent_io: Handle errors better in btree_write_cache_pages()
In btree_write_cache_pages(), we can only get @ret <= 0.
Add an ASSERT() for it just in case.

Then instead of submitting the write bio even we got some error, check
the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.

If there is no error so far, then call flush_write_bio() and return the
result.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:23 +02:00
Qu Wenruo
3065976b04 btrfs: extent_io: Handle errors better in extent_write_full_page()
Since now flush_write_bio() could return error, kill the BUG_ON() first.
Then don't call flush_write_bio() unconditionally, instead we check the
return value from __extent_writepage() first.

If __extent_writepage() fails, we do cleanup, and return error without
submitting the possible corrupted or half-baked bio.

If __extent_writepage() successes, then we call flush_write_bio() and
return the result.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:23 +02:00
Qu Wenruo
f4340622e0 btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up
We have a BUG_ON() in flush_write_bio() to handle the return value of
submit_one_bio().

Move the BUG_ON() one level up to all its callers.

This patch will introduce temporary variable, @flush_ret to keep code
change minimal in this patch. That variable will be cleaned up when
enhancing the error handling later.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:23 +02:00
Qu Wenruo
63489055e4 btrfs: Always output error message when key/level verification fails
We have internal report of strange transaction abort due to EUCLEAN
without any error message.

Since error message inside verify_level_key() is only enabled for
CONFIG_BTRFS_DEBUG, the error message won't be printed on most builds.

This patch will make the error message mandatory, so when problem
happens we know what's causing the problem.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:23 +02:00
Qu Wenruo
448de471cd btrfs: Check the first key and level for cached extent buffer
[BUG]
When reading a file from a fuzzed image, kernel can panic like:

  BTRFS warning (device loop0): csum failed root 5 ino 270 off 0 csum 0x98f94189 expected csum 0x00000000 mirror 1
  assertion failed: !memcmp_extent_buffer(b, &disk_key, offsetof(struct btrfs_leaf, items[0].key), sizeof(disk_key)), file: fs/btrfs/ctree.c, line: 2544
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.h:3500!
  invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  RIP: 0010:btrfs_search_slot.cold.24+0x61/0x63 [btrfs]
  Call Trace:
   btrfs_lookup_csum+0x52/0x150 [btrfs]
   __btrfs_lookup_bio_sums+0x209/0x640 [btrfs]
   btrfs_submit_bio_hook+0x103/0x170 [btrfs]
   submit_one_bio+0x59/0x80 [btrfs]
   extent_read_full_page+0x58/0x80 [btrfs]
   generic_file_read_iter+0x2f6/0x9d0
   __vfs_read+0x14d/0x1a0
   vfs_read+0x8d/0x140
   ksys_read+0x52/0xc0
   do_syscall_64+0x60/0x210
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

[CAUSE]
The fuzzed image has a corrupted leaf whose first key doesn't match its
parent:

  checksum tree key (CSUM_TREE ROOT_ITEM 0)
  node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
  fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
  chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
  	...
          key (EXTENT_CSUM EXTENT_CSUM 79691776) block 29761536 gen 19

  leaf 29761536 items 1 free space 1726 generation 19 owner CSUM_TREE
  leaf 29761536 flags 0x1(WRITTEN) backref revision 1
  fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
  chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
          item 0 key (EXTENT_CSUM EXTENT_CSUM 8798638964736) itemoff 1751 itemsize 2244
                  range start 8798638964736 end 8798641262592 length 2297856

When reading the above tree block, we have extent_buffer->refs = 2 in
the context:

- initial one from __alloc_extent_buffer()
  alloc_extent_buffer()
  |- __alloc_extent_buffer()
     |- atomic_set(&eb->refs, 1)

- one being added to fs_info->buffer_radix
  alloc_extent_buffer()
  |- check_buffer_tree_ref()
     |- atomic_inc(&eb->refs)

So if even we call free_extent_buffer() in read_tree_block or other
similar situation, we only decrease the refs by 1, it doesn't reach 0
and won't be freed right now.

The staled eb and its corrupted content will still be kept cached.

Furthermore, we have several extra cases where we either don't do first
key check or the check is not proper for all callers:

- scrub
  We just don't have first key in this context.

- shared tree block
  One tree block can be shared by several snapshot/subvolume trees.
  In that case, the first key check for one subvolume doesn't apply to
  another.

So for the above reasons, a corrupted extent buffer can sneak into the
buffer cache.

[FIX]
Call verify_level_key in read_block_for_search to do another
verification. For that purpose the function is exported.

Due to above reasons, although we can free corrupted extent buffer from
cache, we still need the check in read_block_for_search(), for scrub and
shared tree blocks.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202757
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202759
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202761
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202767
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202769
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:22 +02:00
Nikolay Borisov
537f38f019 btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails
If a an eb fails to be read for whatever reason - it's corrupted on disk
and parent transid/key validations fail or IO for eb pages fail then
this buffer must be removed from the buffer cache. Currently the code
calls free_extent_buffer if an error occurs. Unfortunately this doesn't
achieve the desired behavior since btrfs_find_create_tree_block returns
with eb->refs == 2.

On the other hand free_extent_buffer will only decrement the refs once
leaving it added to the buffer cache radix tree.  This enables later
code to look up the buffer from the cache and utilize it potentially
leading to a crash.

The correct way to free the buffer is call free_extent_buffer_stale.
This function will correctly call atomic_dec explicitly for the buffer
and subsequently call release_extent_buffer which will decrement the
final reference thus correctly remove the invalid buffer from buffer
cache. This change affects only newly allocated buffers since they have
eb->refs == 2.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755
Reported-by: Jungyeon <jungyeon@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:22 +02:00
Qu Wenruo
80fbc341dc btrfs: Make btrfs_(set|clear)_header_flag return void
From the introduction of btrfs_(set|clear)_header_flag, there is no
usage of its return value.  So just make it return void.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:22 +02:00
Qu Wenruo
10995c0491 btrfs: reloc: Fix NULL pointer dereference due to expanded reloc_root lifespan
Commit d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after
merge_reloc_roots()") expands the life span of root->reloc_root.

This breaks certain checs of fs_info->reloc_ctl.  Before that commit, if
we have a root with valid reloc_root, then it's ensured to have
fs_info->reloc_ctl.

But now since reloc_root doesn't always mean a valid fs_info->reloc_ctl,
such check is unreliable and can cause the following NULL pointer
dereference:

  BUG: unable to handle kernel NULL pointer dereference at 00000000000005c1
  IP: btrfs_reloc_pre_snapshot+0x20/0x50 [btrfs]
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 0 PID: 10379 Comm: snapperd Not tainted
  Call Trace:
   create_pending_snapshot+0xd7/0xfc0 [btrfs]
   create_pending_snapshots+0x8e/0xb0 [btrfs]
   btrfs_commit_transaction+0x2ac/0x8f0 [btrfs]
   btrfs_mksubvol+0x561/0x570 [btrfs]
   btrfs_ioctl_snap_create_transid+0x189/0x190 [btrfs]
   btrfs_ioctl_snap_create_v2+0x102/0x150 [btrfs]
   btrfs_ioctl+0x5c9/0x1e60 [btrfs]
   do_vfs_ioctl+0x90/0x5f0
   SyS_ioctl+0x74/0x80
   do_syscall_64+0x7b/0x150
   entry_SYSCALL_64_after_hwframe+0x3d/0xa2
  RIP: 0033:0x7fd7cdab8467

Fix it by explicitly checking fs_info->reloc_ctl other than using the
implied root->reloc_root.

Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:22 +02:00
Nikolay Borisov
d51f51bb6f btrfs: Remove unused -EIO assignment in end_bio_extent_readpage
In case we hit the error case for a metadata buffer in
end_bio_extent_readpage then 'ret' won't really be checked before it's
written again to. This means the -EIO in this case will never be
checked, just remove it.

Fixes-coverity-id: 1442513
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:22 +02:00
Nikolay Borisov
e65ef21ed8 btrfs: Exploit the fact that pages passed to extent_readpages are always contiguous
Currently extent_readpages (called from btrfs_readpages) will always
call __extent_readpages which tries to create contiguous range of pages
and call __do_contiguous_readpages when such contiguous range is
created.

It turns out this is unnecessary due to the fact that generic MM code
always calls filesystem's ->readpages callback (btrfs_readpages in
this case) with already contiguous pages. Armed with this knowledge it's
possible to simplify extent_readpages by eliminating the call to
__extent_readpages and directly calling contiguous_readpages.

The only edge case that needs to be handled is when
add_to_page_cache_lru fails. This is easy as all that is needed is to
submit whatever is the number of pages successfully added to the lru.
This can happen when the page is already in the range, so it does not
need to be read again, and we can't do anything else in case of other
errors.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:22 +02:00
David Sterba
ed1b4ed79d btrfs: switch extent_buffer::lock_nested to bool
The member is tracking simple status of the lock, we can use bool for
that and make some room for further space reduction in the structure.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:21 +02:00
David Sterba
c79adfc085 btrfs: use assertion helpers for extent buffer write lock counters
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::write_locks become unused and can be
moved to the appropriate section, saving a few bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:21 +02:00
David Sterba
e3f1538867 btrfs: add assertion helpers for extent buffer write lock counters
The write_locks are a simple counter to track locking balance and used
to assert tree locks.  Add helpers to make it conditionally work only in
DEBUG builds.  Will be used in followup patches.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:21 +02:00
David Sterba
5c9c799ab7 btrfs: use assertion helpers for extent buffer read lock counters
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::read_locks become unused and can be
moved to the appropriate section, saving a few bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:21 +02:00
David Sterba
58a2ddaedb btrfs: add assertion helpers for extent buffer read lock counters
The read_locks are a simple counter to track locking balance and used to
assert tree locks.  Add helpers to make it conditionally work only in
DEBUG builds.  Will be used in followup patches.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:21 +02:00
David Sterba
afd495a826 btrfs: use assertion helpers for spinning readers
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::spining_readers become unused and can be
moved to the appropriate section, saving a few bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:21 +02:00
David Sterba
225948dedc btrfs: add assertion helpers for spinning readers
Add helpers for conditional DEBUG build to assert that the extent buffer
spinning_readers constraints are met. Will be used in followup patches.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:20 +02:00
David Sterba
843ccf9f46 btrfs: use assertion helpers for spinning writers
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::spining_writers become unused and can be
moved to the appropriate section, saving a few bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:20 +02:00
David Sterba
e4e9fd0f32 btrfs: add assertion helpers for spinning writers
Add helpers for conditional DEBUG build to assert that the extent buffer
spinning_writers constraints are met. Will be used in followup patches.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:20 +02:00
Nikolay Borisov
8882679ea5 btrfs: Remove EXTENT_IOBITS
This flag just became synonymous to EXTENT_LOCKED, so just remove it and
used EXTENT_LOCKED directly. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:20 +02:00
Nikolay Borisov
4e586ca3c3 btrfs: Remove EXTENT_WRITEBACK
This flag was introduced in a52d9a8033 ("Btrfs: Extent based page
cache code.") and subsequently it's usage effectively was removed by
1edbb734b4 ("Btrfs: reduce CPU usage in the extent_state tree") and
f2a97a9dbd ("btrfs: remove all unused functions"). Just remove it,
no functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:20 +02:00
Nathan Chancellor
e8baf7abcf btrfs: Turn an 'else if' into an 'else' in btrfs_uuid_tree_add
When building with -Wsometimes-uninitialized, Clang warns:

fs/btrfs/uuid-tree.c:129:13: warning: variable 'eb' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
fs/btrfs/uuid-tree.c:129:13: warning: variable 'offset' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]

Clang can't tell that all cases are covered with this final else if.
Just turn it into an else so that it is clear.

Link: https://github.com/ClangBuiltLinux/linux/issues/385
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:19 +02:00
Anand Jain
262c96a3c3 btrfs: refactor btrfs_set_prop and add btrfs_set_prop_trans
btrfs_set_prop() takes transaction pointer as the first argument,
however in ioctl.c for the purpose of setting the compression property,
we call btrfs_set_prop() with NULL transaction pointer. Down in
the call chain  btrfs_setxattr() starts transaction to update the
attribute and also to update the inode.

So for clarity, create btrfs_set_prop_trans() with no transaction
pointer as argument, in preparation to start transaction here instead of
doing it down the call chain at btrfs_setxattr().

Also now the btrfs_set_prop() is a static function.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:19 +02:00
Anand Jain
419a6f30fd btrfs: rename fs_info argument to fs_private
fs_info is commonly used to represent struct fs_info *, rename
to fs_private to avoid confusion.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:19 +02:00
Anand Jain
3dcf96c7b9 btrfs: drop redundant forward declaration in props.c
Drop forward declaration of the functions:

- prop_compression_validate
- prop_compression_apply
- prop_compression_extract

No functional changes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:19 +02:00
Anand Jain
7715da84f7 btrfs: merge _btrfs_set_prop helpers
btrfs_set_prop() is a redirect to __btrfs_set_prop() with the
transaction handle equal to NULL.  __btrfs_set_prop() in turn passes
this to do_setxattr() which then transaction is actually created.

Instead merge  __btrfs_set_prop() to btrfs_set_prop(), and update the
caller with NULL argument.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:19 +02:00
Johannes Thumshirn
443c8e2a83 btrfs: reduce kmap_atomic time for checksumming
Since commit c40a3d38af ("Btrfs: Compute and look up csums based on
sectorsized blocks") we do a kmap_atomic() on the contents of a bvec.
The code before c40a3d38af had the kmap region just around the
checksumming too.

kmap_atomic() in turn does a preempt_disable() and pagefault_disable(),
so we shouldn't map the data for too long. Reduce the time the bvec's
page is mapped to when we actually need it.

Performance wise it doesn't seem to make a huge difference with a 2 vcpu VM
on a /dev/zram device:

       vanilla      patched      delta
write  17.4MiB/s    17.8MiB/s	+0.4MiB/s (+2%)
read   40.6MiB/s    41.5MiB/s   +0.9MiB/s (+2%)

The following fio job profile was used in the comparision:

[global]
ioengine=libaio
direct=1
sync=1
norandommap
time_based
runtime=10m
size=100m
group_reporting
numjobs=2

[test]
filename=/mnt/test/fio
rw=randrw
rwmixread=70

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:19 +02:00
Qu Wenruo
a1d198478e btrfs: tracepoints: Add trace events for extent_io_tree
Although btrfs heavily relies on extent_io_tree, we don't really have
any good trace events for them.

This patch will add the folowing trace events:
- trace_btrfs_set_extent_bit()
- trace_btrfs_clear_extent_bit()
- trace_btrfs_convert_extent_bit()

Since selftests could create temporary extent_io_tree without fs_info,
modify TP_fast_assign_fsid() to accept NULL as fs_info.  NULL fs_info
will lead to all zero fsid.

The output would be:
  btrfs_set_extent_bit: <FDID>: io_tree=INODE_IO ino=1 root=1 start=22036480 len=4096 set_bits=LOCKED
  btrfs_set_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22040576 len=4096 set_bits=LOCKED
  btrfs_set_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22044672 len=4096 set_bits=LOCKED
  btrfs_set_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22048768 len=4096 set_bits=LOCKED
  btrfs_clear_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22036480 len=16384 clear_bits=LOCKED
  ^^^ Extent buffer 22036480 read from disk, the locking progress

  btrfs_set_extent_bit: <FSID>: io_tree=TRANS_DIRTY_PAGES ino=1 root=1 start=30425088 len=16384 set_bits=DIRTY
  btrfs_set_extent_bit: <FSID>: io_tree=TRANS_DIRTY_PAGES ino=1 root=1 start=30441472 len=16384 set_bits=DIRTY
  ^^^ 2 new tree blocks allocated in one transaction

  btrfs_set_extent_bit: <FSID>: io_tree=FREED_EXTENTS0 ino=0 root=0 start=30523392 len=16384 set_bits=DIRTY
  btrfs_set_extent_bit: <FSID>: io_tree=FREED_EXTENTS0 ino=0 root=0 start=30556160 len=16384 set_bits=DIRTY
  ^^^ 2 old tree blocks get pinned down

There is one point which need attention:
1) Those trace events can be pretty heavy:
   The following workload would generate over 400 trace events.

	mkfs.btrfs -f $dev
	start_trace
	mount $dev $mnt -o enospc_debug
	sync
	touch $mnt/file1
	touch $mnt/file2
	touch $mnt/file3
	xfs_io -f -c "pwrite 0 16k" $mnt/file4
	umount $mnt
	end_trace

   It's not recommended to use them in real world environment.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename enums ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:19 +02:00
Qu Wenruo
43eb5f2975 btrfs: Introduce extent_io_tree::owner to distinguish different io_trees
Btrfs has the following different extent_io_trees used:

- fs_info::free_extents[2]
- btrfs_inode::io_tree - for both normal inodes and the btree inode
- btrfs_inode::io_failure_tree
- btrfs_transaction::dirty_pages
- btrfs_root::dirty_log_pages

If we want to trace changes in those trees, it will be pretty hard to
distinguish them.

Instead of using hard-to-read pointer address, this patch will introduce
a new member extent_io_tree::owner to track the owner.

This modification needs all the callers of extent_io_tree_init() to
accept a new parameter @owner.

This patch provides the basis for later trace events.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:18 +02:00
David Sterba
7b4397386f btrfs: switch extent_io_tree::track_uptodate to bool
This patch is split from the following one "btrfs: Introduce
extent_io_tree::owner to distinguish different io_trees" from Qu, so the
different changes are not mixed together.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:18 +02:00
Qu Wenruo
c258d6e364 btrfs: Introduce fs_info to extent_io_tree
This patch will add a new member fs_info to extent_io_tree.

This provides the basis for later trace events to distinguish the output
between different btrfs filesystems. While this increases the size of
the structure, we want to know the source of the trace events and
passing the fs_info as an argument to all contexts is not possible.

The selftests are now allowed to set it to NULL as they don't use the
tracepoints.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:18 +02:00
Filipe Manana
3b1da515c6 Btrfs: remove no longer used 'sync' member from transaction handle
Commit db2462a6ad ("btrfs: don't run delayed refs in the end transaction
logic") removed its last use, so now it does absolutely nothing, therefore
remove it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:18 +02:00
Dennis Zhou
b2423496a6 btrfs: zstd: remove indirect calls for local functions
While calling functions inside zstd, we don't need to use the
indirection provided by the workspace_manager. Forward declarations are
added to maintain the function order of btrfs_compress_op.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:18 +02:00
David Sterba
6c3abeda77 btrfs: scrub: return EAGAIN when fs is closing
The error code used here is wrong as it's not invalid to try to start
scrub when umount has begun.  Returning EAGAIN is more user friendly as
it's recoverable.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:17 +02:00
Goldwyn Rodrigues
8de60fe942 btrfs: Initialize inode::i_mapping once in btrfs_symlink
inode->i_op is initialized multiple times. Perform it once. This was
left by 4779cc0424 ("Btrfs: get rid of btrfs_symlink_aops").

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:17 +02:00
Qu Wenruo
7ac1e464c4 btrfs: Don't panic when we can't find a root key
When we failed to find a root key in btrfs_update_root(), we just panic.

That's definitely not cool, fix it by outputting an unique error
message, aborting current transaction and return -EUCLEAN. This should
not normally happen as the root has been used by the callers in some
way.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:17 +02:00
Johannes Thumshirn
c53839fc32 btrfs: warn if extent buffer mapping crosses a page boundary in csum_tree_block
Since commit d2e174d5d3 ("btrfs: document extent mapping assumptions in
checksum") we have a comment in place why map_private_extent_buffer()
can't return 1 in the csum_tree_block() case.

Make this a bit more explicit and WARN_ON() in case this this assumption
breaks.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:17 +02:00
Johannes Thumshirn
2996e1f8bc btrfs: factor our read/write stage off csum_tree_block into its callers
Currently csum_tree_block() does two things, first it as it's name
suggests it calculates the checksum for a tree-block. But it also writes
this checksum to disk or reads an extent_buffer from disk and compares the
checksum with the calculated checksum, depending on the verify argument.

Furthermore one of the two callers passes in '1' for the verify argument,
the other one passes in '0'.

For clarity and less layering violations, factor out the second stage in
csum_tree_block()'s callers.

Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:17 +02:00
Paul Moore
35a196bef4 proc: prevent changes to overridden credentials
Prevent userspace from changing the the /proc/PID/attr values if the
task's credentials are currently overriden.  This not only makes sense
conceptually, it also prevents some really bizarre error cases caused
when trying to commit credentials to a task with overridden
credentials.

Cc: <stable@vger.kernel.org>
Reported-by: "chengjian (D)" <cj.chengjian@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: James Morris <james.morris@microsoft.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
2019-04-29 09:51:21 -04:00
Thomas Gleixner
6924f5feba btrfs: ref-verify: Simplify stack trace retrieval
Replace the indirection through struct stack_trace with an invocation of
the storage array based interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094802.338890064@linutronix.de
2019-04-29 12:37:51 +02:00
Thomas Gleixner
e988e5ec18 proc: Simplify task stack retrieval
Replace the indirection through struct stack_trace with an invocation of
the storage array based interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094801.589304463@linutronix.de
2019-04-29 12:37:48 +02:00
Alexander Lochmann
f69e749a49 Abort file_remove_privs() for non-reg. files
file_remove_privs() might be called for non-regular files, e.g.
blkdev inode. There is no reason to do its job on things
like blkdev inodes, pipes, or cdevs. Hence, abort if
file does not refer to a regular inode.

AV: more to the point, for devices there might be any number of
inodes refering to given device.  Which one to strip the permissions
from, even if that made any sense in the first place?  All of them
will be observed with contents modified, after all.

Found by LockDoc (Alexander Lochmann, Horst Schirmeier and Olaf
Spinczyk)

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Alexander Lochmann <alexander.lochmann@tu-dortmund.de>
Signed-off-by: Horst Schirmeier <horst.schirmeier@tu-dortmund.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-28 21:46:57 -04:00
Al Viro
ee948837d7 [fix] get rid of checking for absent device name in vfs_get_tree()
It has no business being there, it's checked by relevant ->get_tree()
as it is *and* it returns the wrong error for no reason whatsoever.

Fixes: f3a09c9201 "introduce fs_context methods"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-28 21:34:21 -04:00
Jan Kara
b1da6a5187 fsnotify: Fix NULL ptr deref in fanotify_get_fsid()
fanotify_get_fsid() is reading mark->connector->fsid under srcu. It can
happen that it sees mark not fully initialized or mark that is already
detached from the object list. In these cases mark->connector
can be NULL leading to NULL ptr dereference. Fix the problem by
being careful when reading mark->connector and check it for being NULL.
Also use WRITE_ONCE when writing the mark just to prevent compiler from
doing something stupid.

Reported-by: syzbot+15927486a4f1bfcbaf91@syzkaller.appspotmail.com
Fixes: 77115225ac ("fanotify: cache fsid in fsnotify_mark_connector")
Signed-off-by: Jan Kara <jack@suse.cz>
2019-04-28 22:14:50 +02:00
Linus Torvalds
975a0f400f for-linus-20190428
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzFsCIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppR7EACv4YzzybTBSa4eluFJA/Ll7HwUALsYj0Sp
 8V5djQb948WXrAnX0uWmE9Hoz3NVbG40bwLASzt+qXEFd/pTYIsSm7yOLd84DEEi
 iV756z7iYjCxotvSbXC8EOmK4AxV/mFPWvqq/o45iDEZgfP3OnEHdQqInvTN/eUn
 0SSWrlAsKMrmi+KrZk5twBgfi4mW5dJJ1DZKjvym4b1Ek/pytpjK6AOCwlveAQpL
 EshhvApoie9Hwfih3Ukeyl4HhAbDU10ZgmM6H6GIkwwrZrQDUhbZpLfzXghyz5kM
 dnWqqjpzp8QeAGUIe02E5ITfJqyyJ/rCKQfX5yA9lzZc6sHfgToTlIDUdyAjxKs/
 kelu/2lmsfA5x7r/l7dH0Fh3p91r4r7UnN1eSSkZcZStGfw52t51sxpdUPePZl9I
 z3v1jUSAU2USg2hxV/jMgGGB9yMWItFutXri3TZOdRM586PaGATneO+a2toMLelL
 aHoL7n91b6olpNJE3p4IkV2z09Lk1rQ8BfsYj9h5BQiHg5ONds5e3zcjIbUSoQqc
 jhLC+PBDmOozDfR3/haPoBOUqH6hD9z3m9lphIqLIgq3pIdrJX/UeQH2sSGxiTlT
 w/rFOV90Aa7klzwRbaA1Uvt4Q2g3mutcRb7Zs+vnmaKJD/xTOF3IL0G5kRJEfvmu
 sJS4bgmf0g==
 =p7B3
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190428' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A set of io_uring fixes that should go into this release. In
  particular, this contains:

   - The mutex lock vs ctx ref count fix (me)

   - Removal of a dead variable (me)

   - Two race fixes (Stefan)

   - Ring head/tail condition fix for poll full SQ detection (Stefan)"

* tag 'for-linus-20190428' of git://git.kernel.dk/linux-block:
  io_uring: remove 'state' argument from io_{read,write} path
  io_uring: fix poll full SQ detection
  io_uring: fix race condition when sq threads goes sleeping
  io_uring: fix race condition reading SQ entries
  io_uring: fail io_uring_register(2) on a dying io_uring instance
2019-04-28 10:06:32 -07:00
Christoph Hellwig
73ce6abae5 iomap: convert to SPDX identifier
Use SPDX-License-Identifier instead of GPLv2 boilerplate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-28 08:34:02 -07:00
Linus Torvalds
ce944935ee Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "9 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  fs/proc/proc_sysctl.c: Fix a NULL pointer dereference
  mm/page_alloc.c: fix never set ALLOC_NOFRAGMENT flag
  mm/page_alloc.c: avoid potential NULL pointer dereference
  mm, page_alloc: always use a captured page regardless of compaction result
  mm: do not boost watermarks to avoid fragmentation for the DISCONTIG memory model
  lib/test_vmalloc.c: do not create cpumask_t variable on stack
  lib/Kconfig.debug: fix build error without CONFIG_BLOCK
  zram: pass down the bvec we need to read into in the work struct
  mm/memory_hotplug.c: drop memory device reference after find_memory_block()
2019-04-26 18:15:33 -07:00
Brian Foster
1749d1ea89 xfs: add missing error check in xfs_prepare_shift()
xfs_prepare_shift() fails to check the error return from
xfs_flush_unmap_range(). If the latter fails, that could lead to an
insert/collapse range operation over a delalloc range, which is not
supported.

Add an error check and return appropriately. This is reproduced
rarely by generic/475.

Fixes: 7f9f71be84 ("xfs: extent shifting doesn't fully invalidate page cache")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-26 12:28:56 -07:00
Darrick J. Wong
47cd97b5b2 xfs: scrub should check incore counters against ondisk headers
In theory, the incore per-AG structure counters should match the ones on
disk, so check that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-26 12:28:56 -07:00
Darrick J. Wong
9a1f3049f4 xfs: allow scrubbers to pause background reclaim
The forthcoming summary counter patch races with regular filesystem
activity to compute rough expected values for the counters.  This design
was chosen to avoid having to freeze the entire filesystem to check the
counters, but while that's running we'd prefer to minimize background
reclamation activity to reduce the perturbations to the incore free
block count.  Therefore, provide a way for scrubbers to disable
background posteof and cowblock reclamation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-26 12:28:56 -07:00
Darrick J. Wong
ed30dcbd90 xfs: rename the speculative block allocation reclaim toggle functions
"reclaim" is used throughout the icache code to mean reclamation of
incore inode structures.  It's also used for two helper functions that
toggle background deletion of speculative preallocations.  Separate
the second of the two uses to make things less confusing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-26 12:28:55 -07:00
Darrick J. Wong
9fe82b8c42 xfs: track delayed allocation reservations across the filesystem
Add a percpu counter to track the number of blocks directly reserved for
delayed allocations on the data device.  This counter (in contrast to
i_delayed_blks) does not track allocated CoW staging extents or anything
going on with the realtime device.  It will be used in the upcoming
summary counter scrub function to check the free block counts without
having to freeze the filesystem or walk all the inodes to find the
delayed allocations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-26 12:28:55 -07:00
Darrick J. Wong
f60be90fc9 xfs: fix broken bhold behavior in xrep_roll_ag_trans
In xrep_roll_ag_trans, the transaction roll will always set sc->tp to
the new transaction, even if committing the old one fails.  A bare
transaction roll leaves the buffer(s) locked but not joined to the new
transaction, so it's not necessary to release the hold if the roll
fails.  Remove the incorrect xfs_trans_bhold_release calls.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-26 12:28:55 -07:00
Linus Torvalds
e9e1a2e7b4 There tracing fixes:
- Use "nosteal" for ring buffer splice pages
  - Memory leak fix in error path of trace_pid_write()
  - Fix preempt_enable_no_resched() (use preempt_enable()) in ring buffer code
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXMMoghQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmB1AQDfpVxYxcmxibBBAM6fZyILYpKqDWmy
 ut6gHZ+GHhQT4AEAwSRsC6V4yO3d5dJFpkcQXUj1v+Ip9XU+dv//s8O6tAI=
 =LsG/
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Three tracing fixes:

   - Use "nosteal" for ring buffer splice pages

   - Memory leak fix in error path of trace_pid_write()

   - Fix preempt_enable_no_resched() (use preempt_enable()) in ring
     buffer code"

* tag 'trace-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  trace: Fix preempt_enable_no_resched() abuse
  tracing: Fix a memory leak by early error exit in trace_pid_write()
  tracing: Fix buffer_ref pipe ops
2019-04-26 11:09:55 -07:00
Linus Torvalds
d0473f978e for-5.1-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlzC5YIACgkQxWXV+ddt
 WDvYuxAAprsX3QjNDurMnm32TKTzY1a4yjYNQZXxNUKYvDAt4sglUMIfY3kP4uVy
 te9GtbHf0XDFNX58LBdBo8tnA8H/TbUr2qOlmh1hRqF89VVUCHpWBkhtOpWtYNVP
 jyL+tModOQjzhjV3RD2m4qHL0Q3oJpoiC7o+kuLGP46NIzfHt/tD1iDIW0j3QJRL
 f1rviXFiheNsXoeuDv/Shj6jMy6tGFa9Ys6tVmWcOxHBHVTBu+GsJaG86p4X39Sj
 ffOUPF3Btug8Q5ALppX+tbWocScITJs//mJJq4FjdAt8Qn5gnJM3h87GEPj+RZJf
 EwDMyd9uOwk7/HKMTtasJ6LDMsycriZ4r4cPOh/bqKz+dAsMA2V+FsV+MDSzhJlw
 w0MLZc5ELzKY1jV2jm4KR1PClj+tQ4n8Jl2P/b87TP1rsJcjpDpOgUwXDBgJCXQg
 LqZ/quQgfg3Zpcp+lPlsVN7dgwAwNYGlcKmMDXOnzZYRT2nNS6c4yx3EpSOXf6BI
 t557BdfP/Kz23hLNuawdO33XOTxhutVd4gyghmz2VOwz8XbYKw9MoJgmODWzenM7
 QqbmQvoKx82hHFLc1WQDyZxk9mhmTetQTdE4rFb291oxFBn6cGglx2omHylfJ8LU
 P27vH68QgihkWs/WXrkktPzhwZVlOTlf+cCZ+h5PEw0z8k9kqWY=
 =efvr
 -----END PGP SIGNATURE-----

Merge tag 'for-5.1-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One patch to fix a crash in io submission path, due to memory
  allocation errors.

  In short, the multipage bio work that landed in 5.1 caused larger bios
  that in turn require larger temporary memory for checksums. The patch
  is a workaround, we're going to rework the allocation so it does not
  require the vmalloc fallback.

  It took a while to identify that it's caused by patches in 5.1 and not
  a patchset that did some changes in error handling in the code. I've
  tested it on various memory/cpu combinations, it could hit OOM but
  does not crash.

  The timestamp of the patch is less than a day due to updates in the
  changelog, tests were running meanwhile"

* tag 'for-5.1-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: Switch memory allocations in async csum calculation path to kvmalloc
2019-04-26 09:46:46 -07:00
Linus Torvalds
58130235bf three small SMB3 fixes: 2 leaks and a rename bug
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlzCLfsACgkQiiy9cAdy
 T1G8KQwAjscNkN7r4i1aA4R9XU1+2qvUkykxjqN4/WTk2HCmjeJm5Y3RpNa6lqo1
 ik6+vk/nE7a4s2L3+RB40F0UzbiRC7b8A2p0Mxq+Qv2oWrGvhnZ/QhFCXmNeRNE8
 2qwr7xVsloNh7/JY4r/4WXTXtBzGke2voOSc5XILrRrdHYfoHYG+ytWc1C6DAwbh
 hqrVaMnN9LBNf7UOKHHSeykE/OOg6J2MtGartB7ujHdPXwlWrlifVfcJcvzXzEOQ
 O76rSV3pojQF0S5lHMIxbOoqqbw5WrzK+qF+/Vi7Y7UuVgCIeuPwya3xAp63m0/z
 TZHsyNX+Y2xVUSfBbtz5vdDwteh4ZG0lx/CbEiK6S5m/5RgzEbUbAdFhk5UOFyQs
 3o854S3u8uUrerRRFOREHmoGJl3NjVSOycFJNTuTuIDXdIMnZw9lciGpQ7STp9uy
 DB36VYIXcNsq18Stow+5ctO9tMgWI4UUt8Lk+/NvpsteY460rBUOOVcmXyyQsPeH
 PBubWS/b
 =GJRQ
 -----END PGP SIGNATURE-----

Merge tag '5.1-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three small SMB3 fixes (all for stable as well): two leaks and a
  rename bug"

* tag '5.1-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix page reference leak with readv/writev
  cifs: do not attempt cifs operation on smb2+ rename error
  cifs: fix memory leak in SMB2_read
2019-04-26 09:45:39 -07:00
YueHaibing
89189557b4 fs/proc/proc_sysctl.c: Fix a NULL pointer dereference
Syzkaller report this:

  sysctl could not get directory: /net//bridge -12
  kasan: CONFIG_KASAN_INLINE enabled
  kasan: GPF could be caused by NULL-ptr deref or user memory access
  general protection fault: 0000 [#1] SMP KASAN PTI
  CPU: 1 PID: 7027 Comm: syz-executor.0 Tainted: G         C        5.1.0-rc3+ #8
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
  RIP: 0010:__write_once_size include/linux/compiler.h:220 [inline]
  RIP: 0010:__rb_change_child include/linux/rbtree_augmented.h:144 [inline]
  RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:186 [inline]
  RIP: 0010:rb_erase+0x5f4/0x19f0 lib/rbtree.c:459
  Code: 00 0f 85 60 13 00 00 48 89 1a 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 f2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 75 0c 00 00 4d 85 ed 4c 89 2e 74 ce 4c 89 ea 48
  RSP: 0018:ffff8881bb507778 EFLAGS: 00010206
  RAX: dffffc0000000000 RBX: ffff8881f224b5b8 RCX: ffffffff818f3f6a
  RDX: 000000000000000a RSI: 0000000000000050 RDI: ffff8881f224b568
  RBP: 0000000000000000 R08: ffffed10376a0ef4 R09: ffffed10376a0ef4
  R10: 0000000000000001 R11: ffffed10376a0ef4 R12: ffff8881f224b558
  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  FS:  00007f3e7ce13700(0000) GS:ffff8881f7300000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fd60fbe9398 CR3: 00000001cb55c001 CR4: 00000000007606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  PKRU: 55555554
  Call Trace:
   erase_entry fs/proc/proc_sysctl.c:178 [inline]
   erase_header+0xe3/0x160 fs/proc/proc_sysctl.c:207
   start_unregistering fs/proc/proc_sysctl.c:331 [inline]
   drop_sysctl_table+0x558/0x880 fs/proc/proc_sysctl.c:1631
   get_subdir fs/proc/proc_sysctl.c:1022 [inline]
   __register_sysctl_table+0xd65/0x1090 fs/proc/proc_sysctl.c:1335
   br_netfilter_init+0x68/0x1000 [br_netfilter]
   do_one_initcall+0xbc/0x47d init/main.c:901
   do_init_module+0x1b5/0x547 kernel/module.c:3456
   load_module+0x6405/0x8c10 kernel/module.c:3804
   __do_sys_finit_module+0x162/0x190 kernel/module.c:3898
   do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  Modules linked in: br_netfilter(+) backlight comedi(C) hid_sensor_hub max3100 ti_ads8688 udc_core fddi snd_mona leds_gpio rc_streamzap mtd pata_netcell nf_log_common rc_winfast udp_tunnel snd_usbmidi_lib snd_usb_toneport snd_usb_line6 snd_rawmidi snd_seq_device snd_hwdep videobuf2_v4l2 videobuf2_common videodev media videobuf2_vmalloc videobuf2_memops rc_gadmei_rm008z 8250_of smm665 hid_tmff hid_saitek hwmon_vid rc_ati_tv_wonder_hd_600 rc_core pata_pdc202xx_old dn_rtmsg as3722 ad714x_i2c ad714x snd_soc_cs4265 hid_kensington panel_ilitek_ili9322 drm drm_panel_orientation_quirks ipack cdc_phonet usbcore phonet hid_jabra hid extcon_arizona can_dev industrialio_triggered_buffer kfifo_buf industrialio adm1031 i2c_mux_ltc4306 i2c_mux ipmi_msghandler mlxsw_core snd_soc_cs35l34 snd_soc_core snd_pcm_dmaengine snd_pcm snd_timer ac97_bus snd_compress snd soundcore gpio_da9055 uio ecdh_generic mdio_thunder of_mdio fixed_phy libphy mdio_cavium iptable_security iptable_raw iptable_mangle
   iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun joydev mousedev ppdev tpm kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ide_pci_generic piix aes_x86_64 crypto_simd cryptd ide_core glue_helper input_leds psmouse intel_agp intel_gtt serio_raw ata_generic i2c_piix4 agpgart pata_acpi parport_pc parport floppy rtc_cmos sch_fq_codel ip_tables x_tables sha1_ssse3 sha1_generic ipv6 [last unloaded: br_netfilter]
  Dumping ftrace buffer:
     (ftrace buffer empty)
  ---[ end trace 68741688d5fbfe85 ]---

commit 23da958803 ("fs/proc/proc_sysctl.c: fix NULL pointer
dereference in put_links") forgot to handle start_unregistering() case,
while header->parent is NULL, it calls erase_header() and as seen in the
above syzkaller call trace, accessing &header->parent->root will trigger
a NULL pointer dereference.

As that commit explained, there is also no need to call
start_unregistering() if header->parent is NULL.

Link: http://lkml.kernel.org/r/20190409153622.28112-1-yuehaibing@huawei.com
Fixes: 23da958803 ("fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links")
Fixes: 0e47c99d7f ("sysctl: Replace root_list with links between sysctl_table_sets")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-26 09:18:05 -07:00
Jann Horn
b987222654 tracing: Fix buffer_ref pipe ops
This fixes multiple issues in buffer_pipe_buf_ops:

 - The ->steal() handler must not return zero unless the pipe buffer has
   the only reference to the page. But generic_pipe_buf_steal() assumes
   that every reference to the pipe is tracked by the page's refcount,
   which isn't true for these buffers - buffer_pipe_buf_get(), which
   duplicates a buffer, doesn't touch the page's refcount.
   Fix it by using generic_pipe_buf_nosteal(), which refuses every
   attempted theft. It should be easy to actually support ->steal, but the
   only current users of pipe_buf_steal() are the virtio console and FUSE,
   and they also only use it as an optimization. So it's probably not worth
   the effort.
 - The ->get() and ->release() handlers can be invoked concurrently on pipe
   buffers backed by the same struct buffer_ref. Make them safe against
   concurrency by using refcount_t.
 - The pointers stored in ->private were only zeroed out when the last
   reference to the buffer_ref was dropped. As far as I know, this
   shouldn't be necessary anyway, but if we do it, let's always do it.

Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Fixes: 73a757e631 ("ring-buffer: Return reader page back into existing ring buffer")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-26 11:44:39 -04:00
Andrea Parri
998267900c kernfs: fix barrier usage in __kernfs_new_node()
smp_mb__before_atomic() can not be applied to atomic_set().  Remove the
barrier and rely on RELEASE synchronization.

Fixes: ba16b2846a ("kernfs: add an API to get kernfs node from inode number")
Cc: stable@vger.kernel.org
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-25 21:41:35 +02:00
Linus Torvalds
8113a85f87 dentry name handling fixes from Jeff and a memory leak fix from Zheng.
Both are old issues, marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlzB8RoTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzix4BCACfaFPdToRlxFa0oYP8yeoznkVNjldX
 yr00KWc1xNZf0G92REk3zZP1+OfL0GIUbbBwe82uT0QLKcHG7l9ioM5qPwZo5uLv
 DiV6iEN7jf84H0YmzB2iVJAZp6T85veFjcXOJ2sq0AtoqmUpdcDqm9Ov40oCvxUs
 aKHsxUQnrOkwF5mtQmOG0CZsl9+pq0c/OpCnc2i6f78KgpWKMzH0gsL4mTM7GLzL
 LxtbvHws3rxG5HtnWcJgPnv1D0bdcEGgckXWTK9NGYZmWYTuZRwROSrx8lNhkant
 +1bQPrpzyB6W8rN0wSrWlPkmp71Qqo43Eop13r5V34bU04zSr2bI+PUh
 =8A02
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.1-rc7' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "dentry name handling fixes from Jeff and a memory leak fix from Zheng.

  Both are old issues, marked for stable"

* tag 'ceph-for-5.1-rc7' of git://github.com/ceph/ceph-client:
  ceph: fix ci->i_head_snapc leak
  ceph: handle the case where a dentry has been renamed on outstanding req
  ceph: ensure d_name stability in ceph_dentry_hash()
  ceph: only use d_name directly when parent is locked
2019-04-25 10:48:50 -07:00
Nikolay Borisov
a3d46aea46 btrfs: Switch memory allocations in async csum calculation path to kvmalloc
Recent multi-page biovec rework allowed creation of bios that can span
large regions - up to 128 megabytes in the case of btrfs. OTOH btrfs'
submission path currently allocates a contiguous array to store the
checksums for every bio submitted. This means we can request up to
(128mb / BTRFS_SECTOR_SIZE) * 4 bytes + 32bytes of memory from kmalloc.
On busy systems with possibly fragmented memory said kmalloc can fail
which will trigger BUG_ON due to improper error handling IO submission
context in btrfs.

Until error handling is improved or bios in btrfs limited to a more
manageable size (e.g. 1m) let's use kvmalloc to fallback to vmalloc for
such large allocations. There is no hard requirement that the memory
allocated for checksums during IO submission has to be contiguous, but
this is a simple fix that does not require several non-contiguous
allocations.

For small writes this is unlikely to have any visible effect since
kmalloc will still satisfy allocation requests as usual. For larger
requests the code will just fallback to vmalloc.

We've performed evaluation on several workload types and there was no
significant difference kmalloc vs kvmalloc.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-25 14:17:38 +02:00
Ronald Tschalär
9abb24990a debugfs: update documented return values of debugfs helpers
Since commit ff9fb72bc0 ("debugfs: return error values, not NULL")
these helper functions do not return NULL anymore (with the exception
of debugfs_create_u32_array()).

Fixes: ff9fb72bc0 ("debugfs: return error values, not NULL")
Signed-off-by: Ronald Tschalär <ronald@innovation.ch>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-25 11:56:50 +02:00
Eric Biggers
877b5691f2 crypto: shash - remove shash_desc::flags
The flags field in 'struct shash_desc' never actually does anything.
The only ostensibly supported flag is CRYPTO_TFM_REQ_MAY_SLEEP.
However, no shash algorithm ever sleeps, making this flag a no-op.

With this being the case, inevitably some users who can't sleep wrongly
pass MAY_SLEEP.  These would all need to be fixed if any shash algorithm
actually started sleeping.  For example, the shash_ahash_*() functions,
which wrap a shash algorithm with the ahash API, pass through MAY_SLEEP
from the ahash API to the shash API.  However, the shash functions are
called under kmap_atomic(), so actually they're assumed to never sleep.

Even if it turns out that some users do need preemption points while
hashing large buffers, we could easily provide a helper function
crypto_shash_update_large() which divides the data into smaller chunks
and calls crypto_shash_update() and cond_resched() for each chunk.  It's
not necessary to have a flag in 'struct shash_desc', nor is it necessary
to make individual shash algorithms aware of this at all.

Therefore, remove shash_desc::flags, and document that the
crypto_shash_*() functions can be called from any context.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-04-25 15:38:12 +08:00
Jérôme Glisse
13f5938d82 cifs: fix page reference leak with readv/writev
CIFS can leak pages reference gotten through GUP (get_user_pages*()
through iov_iter_get_pages()). This happen if cifs_send_async_read()
or cifs_write_from_iter() calls fail from within __cifs_readv() and
__cifs_writev() respectively. This patch move page unreference to
cifs_aio_ctx_release() which will happens on all code paths this is
all simpler to follow for correctness.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-04-24 12:33:59 -05:00
Frank Sorenson
652727bbe1 cifs: do not attempt cifs operation on smb2+ rename error
A path-based rename returning EBUSY will incorrectly try opening
the file with a cifs (NT Create AndX) operation on an smb2+ mount,
which causes the server to force a session close.

If the mount is smb2+, skip the fallback.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2019-04-24 12:33:59 -05:00
Ronnie Sahlberg
05fd5c2c61 cifs: fix memory leak in SMB2_read
Commit 088aaf17aa introduced a leak where
if SMB2_read() returned an error we would return without freeing the
request buffer.

Cc: Stable <stable@vger.kernel.org>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-04-24 12:33:59 -05:00
Linus Torvalds
12a54b150f Fix miscellaneous nfsd bugs, in NFSv4.1 callbacks, NFSv4.1
lock-notification callbacks, NFSv3 readdir encoding, and the
 cache/upcall code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcv3EfAAoJECebzXlCjuG+j8IQAIMJFCpMTSu3lw0X9BjJFwe/
 TELdvH4eXm+xtnz0Rbp4RTmlNgt/ox1ySp+sPQqcDutJcU6ImtJS1JY5BmoSU80p
 2ZTRzVOusSS1dP6EDwnYQlR3rI3ObCdDYtkjfboqDd4UJM0hvuCrdQols3DRZF1/
 gBMUDNk8Sch64JMs8EcpItmNWgZ1QDmL6vbKsYDB8HWfBcFkNOs/Q1OecSD7F9Dh
 k+GojPb4pXRCgN8heeFr2H2mXxJSDOPyyybBSQOtbv58rT7IXTpVvAt+WHpjy917
 ljOSD/EuCY+IvX6oY8xPrly9a3t4Mh4eQQFj1EZGSlNucqDgsi1B421QcxGZSkF4
 WZkTRC5BppairSx4dnW96Nx0BbCysj2dvZ0MOnUa59YRwRbyrnHKCv221SzFBidI
 0slxNBYW7Rw2ViFJ6fBPjsjH0XWM9Hq4wC84kj8XXNeDDh2KzXPtUtUzp3iDpKQg
 jP0UiEYnt6OXWvHOH0hfh52Mds48iIUSRmKL4huDT8Ijnt9KBZxnTunClJt6g1SH
 MsIttRsZ2fLo/M5SlWzvQ7XNoKNn/s49Z3SZjqYEKqL9aMwuMoMcIgolqlh5oFPu
 j3cqprDNXuij+2pl35IuQRbk57G2DZ1maShJ5n8I7l0uBe6DMX7T+4d4rHy75Tu9
 otqz4WYU2Yjnnmv5Kp/H
 =9hKY
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.1-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfixes from Bruce Fields:
 "Fix miscellaneous nfsd bugs, in NFSv4.1 callbacks, NFSv4.1
  lock-notification callbacks, NFSv3 readdir encoding, and the
  cache/upcall code"

* tag 'nfsd-5.1-1' of git://linux-nfs.org/~bfields/linux:
  nfsd: wake blocked file lock waiters before sending callback
  nfsd: wake waiters blocked on file_lock before deleting it
  nfsd: Don't release the callback slot unless it was actually held
  nfsd/nfsd3_proc_readdir: fix buffer count and page pointers
  sunrpc: don't mark uninitialised items as VALID.
2019-04-23 13:40:55 -07:00
Yan, Zheng
37659182bf ceph: fix ci->i_head_snapc leak
We missed two places that i_wrbuffer_ref_head, i_wr_ref, i_dirty_caps
and i_flushing_caps may change. When they are all zeros, we should free
i_head_snapc.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/38224
Reported-and-tested-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-04-23 21:37:54 +02:00
Jeff Layton
4b82228700 ceph: handle the case where a dentry has been renamed on outstanding req
It's possible for us to issue a lookup to revalidate a dentry
concurrently with a rename. If done in the right order, then we could
end up processing dentry info in the reply that no longer reflects the
state of the dentry.

If req->r_dentry->d_name differs from the one in the trace, then just
ignore the trace in the reply. We only need to do this however if the
parent's i_rwsem is not held.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-04-23 21:37:54 +02:00
Jeff Layton
76a495d666 ceph: ensure d_name stability in ceph_dentry_hash()
Take the d_lock here to ensure that d_name doesn't change.

Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-04-23 21:37:54 +02:00
Jeff Layton
1bcb344086 ceph: only use d_name directly when parent is locked
Ben reported tripping the BUG_ON in create_request_message during some
performance testing. Analysis of the vmcore showed that the length of
the r_dentry->d_name string changed after we allocated the buffer, but
before we encoded it.

build_dentry_path returns pointers to d_name in the common case of
non-snapped dentries, but this optimization isn't safe unless the parent
directory is locked. When it isn't, have the code make a copy of the
d_name while holding the d_lock.

Cc: stable@vger.kernel.org
Reported-by: Ben England <bengland@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-04-23 21:37:54 +02:00
Darrick J. Wong
3de5eab3fd xfs: unlock inode when xfs_ioctl_setattr_get_trans can't get transaction
We passed an inode into xfs_ioctl_setattr_get_trans with join_flags
indicating which locks are held on that inode.  If we can't allocate a
transaction then we need to unlock the inode before we bail out, like
all the other error paths do.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-23 08:36:23 -07:00
Darrick J. Wong
078f4a7d31 xfs: kill the xfs_dqtrx_t typedef
There's only a few uses left, so just kill the typedef while we're at
it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-04-23 08:36:23 -07:00
Darrick J. Wong
394aafdc15 xfs: widen inode delalloc block counter to 64-bits
Widen the incore inode's i_delayed_blks counter to be a 64-bit integer.
This is necessary to fix an integer overflow problem that can be
reproduced easily now that we use the counter to track blocks that are
assigned to the inode in memory but not on disk.  This includes actual
delalloc reservations as well as real extents in the COW fork that
are waiting to be remapped into the data fork.

These 'delayed mapping' blocks can easily exceed 2^32 blocks if one
creates a very large sparse file of size approximately 2^33 bytes with
one byte written every 2^23 bytes, sets a very large COW extent size
hint of 2^23 blocks, reflinks the first file into a second file, and
then writes a single byte every 2^23 blocks in the original file.

When this happens, we'll try to create approximately 1024 2^23 extent
reservations in the COW fork, which will overflow the counter and cause
problems.

Note that on x64 we end up filling a 4-byte gap in the structure so this
doesn't increase the incore size.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-04-23 08:36:23 -07:00
Darrick J. Wong
903b1fc273 xfs: widen quota block counters to 64-bit integers
Widen the incore quota transaction delta structure to treat block
counters as 64-bit integers.  This is a necessary addition so that we
can widen the i_delayed_blks counter to be a 64-bit integer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-04-23 08:36:23 -07:00
Darrick J. Wong
1fdeaea4d9 xfs: abort unaligned nowait directio early
Dave Chinner noticed that xfs_file_dio_aio_write returns EAGAIN without
dropping the IOLOCK when its deciding not to wait, which means that we
leak the IOLOCK there.  Since we now make unaligned directio always
wait, we have the opportunity to bail out before trying to take the
lock, which should reduce the overhead of this never-gonna-work case
considerably while also solving the dropped lock problem.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-04-23 08:36:23 -07:00
Brian Foster
362f5e745a xfs: assert that we don't enter agfl freeing with a non-permanent transaction
Block allocation requires a permanent transaction for deferred AGFL
frees.  Add an assert in the block allocation path to make explicit and
obvious to future callers the requirement of a transaction with a
permanent reservation.

Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: split this out from the previous patch per hch request]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-23 08:36:23 -07:00
Jens Axboe
8358e3a826 io_uring: remove 'state' argument from io_{read,write} path
Since commit 09bb839434 we don't use the state argument for any sort
of on-stack caching in the io read and write path. Remove the stale
and unused argument from them, and bubble it up to __io_submit_sqe()
and down to io_prep_rw().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-23 08:17:58 -06:00
Brian Foster
945c941fcd xfs: make tr_growdata a permanent transaction
The growdata transaction is used by growfs operations to increase
the data size of the filesystem. Part of this sequence involves
extending the size of the last preexisting AG in the fs, if
necessary. This is implemented by freeing the newly available
physical range to the AG.

tr_growdata is not a permanent transaction, however, and block
allocation transactions must be permanent to handle deferred frees
of AGFL blocks. If the grow operation extends an existing AG that
requires AGFL fixing, assert failures occur due to a populated dfops
list on a non-permanent transaction and the AGFL free does not
occur. This is reproduced (rarely) by xfs/104.

Change tr_growdata to a permanent transaction with a default log
count. This increases initial transaction reservation size, but
growfs is an infrequent and non-performance critical operation and
so should have minimal impact.

Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: add a comment to the assert]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-22 16:28:45 -07:00
Jeff Layton
f456458e4d nfsd: wake blocked file lock waiters before sending callback
When a blocked NFS lock is "awoken" we send a callback to the server and
then wake any hosts waiting on it. If a client attempts to get a lock
and then drops off the net, we could end up waiting for a long time
until we end up waking locks blocked on that request.

So, wake any other waiting lock requests before sending the callback.
Do this by calling locks_delete_block in a new "prepare" phase for
CB_NOTIFY_LOCK callbacks.

URL: https://bugzilla.kernel.org/show_bug.cgi?id=203363
Fixes: 16306a61d3 ("fs/locks: always delete_block after waiting.")
Reported-by: Slawomir Pryczek <slawek1211@gmail.com>
Cc: Neil Brown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-22 15:38:41 -04:00
Jeff Layton
6aaafc43a4 nfsd: wake waiters blocked on file_lock before deleting it
After a blocked nfsd file_lock request is deleted, knfsd will send a
callback to the client and then free the request. Commit 16306a61d3
("fs/locks: always delete_block after waiting.") changed it such that
locks_delete_block is always called on a request after it is awoken,
but that patch missed fixing up blocked nfsd request handling.

Call locks_delete_block on the block to wake up any locks still blocked
on the nfsd lock request before freeing it. Some of its callers already
do this however, so just remove those calls.

URL: https://bugzilla.kernel.org/show_bug.cgi?id=203363
Fixes: 16306a61d3 ("fs/locks: always delete_block after waiting.")
Reported-by: Slawomir Pryczek <slawek1211@gmail.com>
Cc: Neil Brown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-22 15:31:54 -04:00
Stefan Bühler
fb775faa9e io_uring: fix poll full SQ detection
io_uring_poll shouldn't signal EPOLLOUT | EPOLLWRNORM if the queue is
full; the old check would always signal EPOLLOUT | EPOLLWRNORM (unless
there were U32_MAX - 1 entries in the SQ queue).

Signed-off-by: Stefan Bühler <source@stbuehler.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22 11:00:58 -06:00
Stefan Bühler
0d7bae69c5 io_uring: fix race condition when sq threads goes sleeping
Reading the SQ tail needs to come after setting IORING_SQ_NEED_WAKEUP in
flags; there is no cheap barrier for ordering a store before a load, a
full memory barrier is required.

Userspace needs a full memory barrier between updating SQ tail and
checking for the IORING_SQ_NEED_WAKEUP too.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22 11:00:56 -06:00
Stefan Bühler
e523a29c4f io_uring: fix race condition reading SQ entries
A read memory barrier is required between reading SQ tail and reading
the actual data belonging to the SQ entry.

Userspace needs a matching write barrier between writing SQ entries and
updating SQ tail (using smp_store_release to update tail will do).

Signed-off-by: Stefan Bühler <source@stbuehler.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22 11:00:55 -06:00
Jens Axboe
35fa71a030 io_uring: fail io_uring_register(2) on a dying io_uring instance
If we have multiple threads doing io_uring_register(2) on an io_uring
fd, then we can potentially try and kill the percpu reference while
someone else has already killed it.

Prevent this race by failing io_uring_register(2) if the ref is marked
dying. This is safe since we're inside the io_uring mutex.

Fixes: b19062a567 ("io_uring: fix possible deadlock between io_uring_{enter,register}")
Reported-by: syzbot <syzbot+10d25e23199614b7721f@syzkaller.appspotmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22 10:37:07 -06:00
Jens Axboe
5c61ee2cd5 Linux 5.1-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAly8rGYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGmZMH/1IRB0E1Qmzz8yzw
 wj79UuRGYPqxDDSWW+wNc8sU4Ic7iYirn9APHAztCdQqsjmzU/OVLfSa3JhdBe5w
 THo7pbGKBqEDcWnKfNk/21jXFNLZ1vr9BoQv2DGU2MMhHAyo/NZbalo2YVtpQPmM
 OCRth5n+LzvH7rGrX7RYgWu24G9l3NMfgtaDAXBNXesCGFAjVRrdkU5CBAaabvtU
 4GWh/nnutndOOLdByL3x+VZ3H3fIBnbNjcIGCglvvqzk7h3hrfGEl4UCULldTxcM
 IFsfMUhSw1ENy7F6DHGbKIG90cdCJcrQ8J/ziEzjj/KLGALluutfFhVvr6YCM2J6
 2RgU8CY=
 =CfY1
 -----END PGP SIGNATURE-----

Merge tag 'v5.1-rc6' into for-5.2/block

Pull in v5.1-rc6 to resolve two conflicts. One is in BFQ, in just a
comment, and is trivial. The other one is a conflict due to a later fix
in the bio multi-page work, and needs a bit more care.

* tag 'v5.1-rc6': (770 commits)
  Linux 5.1-rc6
  block: make sure that bvec length can't be overflow
  block: kill all_q_node in request_queue
  x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
  coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
  mm/kmemleak.c: fix unused-function warning
  init: initialize jump labels before command line option parsing
  kernel/watchdog_hld.c: hard lockup message should end with a newline
  kcov: improve CONFIG_ARCH_HAS_KCOV help text
  mm: fix inactive list balancing between NUMA nodes and cgroups
  mm/hotplug: treat CMA pages as unmovable
  proc: fixup proc-pid-vm test
  proc: fix map_files test on F29
  mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
  mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
  mm: swapoff: shmem_unuse() stop eviction without igrab()
  mm: swapoff: take notice of completion sooner
  mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
  mm: swapoff: shmem_find_swap_entries() filter out other types
  slab: store tagged freelist for off-slab slabmgmt
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22 09:47:36 -06:00
Greg Kroah-Hartman
3a26172437 Merge 5.1-rc6 into char-misc-next
We want the fixes, and this resolves a merge error in the fastrpc
driver.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-21 23:14:47 +02:00
Linus Torvalds
38a2ca2cac for-linus-20190420
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAly7HukQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpla6D/9y7YyAyKDgv/pVQgAlDYaGSXufvrK5iK/f
 uFdSWPvGuWMbx+xy/4hfSX1pV9ZRv1aRJeFOkL/qVyr+4izKrgevwj+6Kl3/mCUO
 dhiqF76bnaGXNQC6YDn1IgZp2Za+WGpeNlEhwcg20Ve11U7DVBhcL/n/6NYphtUG
 V7ZFoVw+yjOO9GvkUeHx24HIQdC0JrABMoXYldl/tX3H9WjB3d1ncZDS45TuemXJ
 lwm/S4nyaNaDzLnO7Hv51u3tCFpuaJbcgBdKuZB/oSWhU68D26/6peW+8qAvN+ec
 htibFrK6KPRQCLNMEaV2njZEyprkL/BZJz4YukwmWB8GAtsuquy3Q3wJPounGmm5
 7fCG/T1asqkurwhVcHOC07R6+d8AT5ARyJn3QYFmoYCIoSwObu6xhZHHAv7Ct1Xn
 lrU4it0WkYbTXVI1l4CaRUtshCIQTZwr2EsgppjAsBc1+V2KgtbxR1wkQq2q9tQZ
 Fa/2KTv9Y1+7FUOf09LEvTbuUgZn4I6u4E07QwY4miFsQSEUufirfHZ5t62lIgA9
 3YzUrlVQSP1PbG8IP4aCSX2o+dxhL1Js6ukdZAxM6w9RtjLqWI3zTImSSMJobjna
 SF53kkpv1xuJYT+Z1YmNGbMauzLs/HhCB9ww56TUuQYW/rTDASqFc48l7+vsfPrZ
 sTEkShVGOw==
 =d9Ws
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190420' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A set of small fixes that should go into this series. This contains:

   - Removal of unused queue member (Hou)

   - Overflow bvec fix (Ming)

   - Various little io_uring tweaks (me)
       - kthread parking
       - Only call cpu_possible() for verified CPU
       - Drop unused 'file' argument to io_file_put()
       - io_uring_enter vs io_uring_register deadlock fix
       - CQ overflow fix

   - BFQ internal depth update fix (me)"

* tag 'for-linus-20190420' of git://git.kernel.dk/linux-block:
  block: make sure that bvec length can't be overflow
  block: kill all_q_node in request_queue
  io_uring: fix CQ overflow condition
  io_uring: fix possible deadlock between io_uring_{enter,register}
  io_uring: drop io_file_put() 'file' argument
  bfq: update internal depth state when queue depth changes
  io_uring: only test SQPOLL cpu after we've verified it
  io_uring: park SQPOLL thread if it's percpu
2019-04-20 12:20:58 -07:00
Andrea Arcangeli
04f5866e41 coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it.  Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier.  For example in Hugh's post from Jul 2017:

  https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils

  "Not strictly relevant here, but a related note: I was very surprised
   to discover, only quite recently, how handle_mm_fault() may be called
   without down_read(mmap_sem) - when core dumping. That seems a
   misguided optimization to me, which would also be nice to correct"

In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.

Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.

Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.

For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs.  Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.

Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.

In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.

Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.

Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-19 09:46:05 -07:00
David Howells
5dd50aaeb1
Make anon_inodes unconditional
Make the anon_inodes facility unconditional so that it can be used by core
VFS code and pidfd code.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[christian@brauner.io: adapt commit message to mention pidfds]
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-04-19 14:03:11 +02:00
Linus Torvalds
2a852fd1ac AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXLGSOfu3V2unywtrAQKrLA/+MJfgZC9ejNsXqjnnqq48LtBf4B3FhYrO
 qdPZMKPuy5XxSWBIIGpXtgK7PspkpAD1NCe7qpvibQoa1SJqwa2gc46aN7+gzp3e
 HETkb4UEXNV5yXSjEA+YOQVa0MiLG2yL6satlqhN4KqoDB5Rj9KoaPFwT79K4SIU
 1iME9Q//tA2M7SU8B1FWqyx5kd350A38KoXmHOux7x8K1ZyjEB2DbyKqZBtn6Y/T
 nfH9ZZioep1JbPFwOuG5NYQH0hP2jUj8ogDnrzzWzn1rQQTWs6wVfigwAh8tc/Dv
 eDKA8rV9WtZT97ExOG6cQOe0UZsM3NFhnb4P/oSCpl+jtr9WHO3pDmJPLToCr0BA
 8xXW7vA+Ejx15dNz4Snnz9yj3FS0UeURSFyhn3g2AFQMK4v9Dd3ZZKOgjR9Czi98
 f1UNh768LKY+siOHmoOhgZNlrI8bgamCmXOucPayGzrQ/HVHqAwR6PiUGCzaH+vk
 +LfDC/Z/ezSqrfQiH/Ji5lLkQkhmYiUTcePn5xdi+kZGWqHwl9qGplVAFUVafs5O
 Hz+H2F7i8F6Oc4v5hzCHp5aDUP2oEKmruaXJimN+O9SOkUPi72aG8STRpBFnBqf7
 77WxNVG/XDFL6I8HukkgRMRePFDXi+2NwGOjB9k8PpAqyA1Vi6sSkCXVkwgAhLRS
 NCaOR2iDaqE=
 =cqBe
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20190413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS fixes from David Howells:

 - Stop using the deprecated get_seconds().

 - Don't make tracepoint strings const as the section they go in isn't
   read-only.

 - Differentiate failure due to unmarshalling from other failure cases.
   We shouldn't abort with RXGEN_CC/SS_UNMARSHAL if it's not due to
   unmarshalling.

 - Add a missing unlock_page().

 - Fix the interaction between receiving a notification from a server
   that it has invalidated all outstanding callback promises and a
   client call that we're in the middle of making that will get a new
   promise.

* tag 'afs-fixes-20190413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix in-progess ops to ignore server-level callback invalidation
  afs: Unlock pages for __pagevec_release()
  afs: Differentiate abort due to unmarshalling from other errors
  afs: Avoid section confusion in CM_NAME
  afs: avoid deprecated get_seconds()
2019-04-18 08:10:22 -07:00
Linus Torvalds
e53f31bffe five small SMB3 fixes, all also for stable - an important fix for an oplock (lease) bug, a handle leak, and 3 bugs spotted by KASAN
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAly2PPMACgkQiiy9cAdy
 T1EbQQwArLGfousJE2mkQZIdAlefjeyCLJdFxOLktA3dd3EbmgOtxg8FSBt5W04X
 UqQ6i2cGL46Uecp+EaOTy2o0o3lc7eRGA83+0CYY+JLNLSS9g+RNgK2kIkcnj3h8
 N/fovUg3yhdLO0tmh13FcwBDAYPqjSd4iija1icBXSkkZdp592ims3zJaGunHMGn
 s237jaLPzx7xxqNaGDD/MtQtpPDzZzPlVCszTnO3anGaSHB1R7HNhR9twWTF/mmz
 bfhLiOyquzzt53XyaTmtpnzMua5X+Rhii2Pq315Pb+71JQ5TBYWRIOx/8QKwdGS+
 9EiiiEgqYfwmcDl1KPnm1ppTF2++mYkFBFq43cgosagDaL8YXdqtCIrsVkTxXnmv
 +bKhiXpriPzsy5INuRgfWmODrbl2QLRY1qbS4I/lA0+uZxFu7palwmSBkMOgcorn
 0TnejxsajTtYBpkzYKOuiRN7z4ukvR4FJd6xaxqSkyX+Jlda4wnwPTRVyBxOcL06
 EaAxdi7X
 =vsJs
 -----END PGP SIGNATURE-----

Merge tag '5.1-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb3 fixes from Steve French:
 "Five small SMB3 fixes, all also for stable - an important fix for an
  oplock (lease) bug, a handle leak, and three bugs spotted by KASAN"

* tag '5.1-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: keep FileInfo handle live during oplock break
  cifs: fix handle leak in smb2_query_symlink()
  cifs: Fix lease buffer length error
  cifs: Fix use-after-free in SMB2_read
  cifs: Fix use-after-free in SMB2_write
2019-04-17 13:36:45 -07:00
Jens Axboe
74f464e970 io_uring: fix CQ overflow condition
This is a leftover from when the rings initially were not free flowing,
and hence a test for tail + 1 == head would indicate full. Since we now
let them wrap instead of mask them with the size, we need to check if
they drift more than the ring size from each other.

This fixes a case where we'd overwrite CQ ring entries, if the user
failed to reap completions. Both cases would ultimately result in lost
completions as the application violated the depth it asked for. The only
difference is that before this fix we'd return invalid entries for the
overflowed completions, instead of properly flagging it in the
cq_ring->overflow variable.

Reported-by: Stefan Bühler <source@stbuehler.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-17 11:41:49 -06:00
Linus Torvalds
2a3a028fc6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Handle init flow failures properly in iwlwifi driver, from Shahar S
    Matityahu.

 2) mac80211 TXQs need to be unscheduled on powersave start, from Felix
    Fietkau.

 3) SKB memory accounting fix in A-MDSU aggregation, from Felix Fietkau.

 4) Increase RCU lock hold time in mlx5 FPGA code, from Saeed Mahameed.

 5) Avoid checksum complete with XDP in mlx5, also from Saeed.

 6) Fix netdev feature clobbering in ibmvnic driver, from Thomas Falcon.

 7) Partial sent TLS record leak fix from Jakub Kicinski.

 8) Reject zero size iova range in vhost, from Jason Wang.

 9) Allow pending work to complete before clcsock release from Karsten
    Graul.

10) Fix XDP handling max MTU in thunderx, from Matteo Croce.

11) A lot of protocols look at the sa_family field of a sockaddr before
    validating it's length is large enough, from Tetsuo Handa.

12) Don't write to free'd pointer in qede ptp error path, from Colin Ian
    King.

13) Have to recompile IP options in ipv4_link_failure because it can be
    invoked from ARP, from Stephen Suryaputra.

14) Doorbell handling fixes in qed from Denis Bolotin.

15) Revert net-sysfs kobject register leak fix, it causes new problems.
    From Wang Hai.

16) Spectre v1 fix in ATM code, from Gustavo A. R. Silva.

17) Fix put of BROPT_VLAN_STATS_PER_PORT in bridging code, from Nikolay
    Aleksandrov.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (111 commits)
  socket: fix compat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW
  tcp: tcp_grow_window() needs to respect tcp_space()
  ocelot: Clean up stats update deferred work
  ocelot: Don't sleep in atomic context (irqs_disabled())
  net: bridge: fix netlink export of vlan_stats_per_port option
  qed: fix spelling mistake "faspath" -> "fastpath"
  tipc: set sysctl_tipc_rmem and named_timeout right range
  tipc: fix link established but not in session
  net: Fix missing meta data in skb with vlan packet
  net: atm: Fix potential Spectre v1 vulnerabilities
  net/core: work around section mismatch warning for ptp_classifier
  net: bridge: fix per-port af_packet sockets
  bnx2x: fix spelling mistake "dicline" -> "decline"
  route: Avoid crash from dereferencing NULL rt->from
  MAINTAINERS: normalize Woojung Huh's email address
  bonding: fix event handling for stacked bonds
  Revert "net-sysfs: Fix memory leak in netdev_register_kobject"
  rtnetlink: fix rtnl_valid_stats_req() nlmsg_len check
  qed: Fix the DORQ's attentions handling
  qed: Fix missing DORQ attentions
  ...
2019-04-17 09:57:45 -07:00
Darrick J. Wong
3994fc4895 xfs: merge adjacent io completions of the same type
It's possible for pagecache writeback to split up a large amount of work
into smaller pieces for throttling purposes or to reduce the amount of
time a writeback operation is pending.  Whatever the reason, XFS can end
up with a bunch of IO completions that call for the same operation to be
performed on a contiguous extent mapping.  Since mappings are extent
based in XFS, we'd prefer to run fewer transactions when we can.

When we're processing an ioend on the list of io completions, check to
see if the next items on the list are both adjacent and of the same
type.  If so, we can merge the completions to reduce transaction
overhead.

On fast storage this doesn't seem to make much of a difference in
performance, though the number of transactions for an overnight xfstests
run seems to drop by ~5%.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-16 10:01:58 -07:00
Darrick J. Wong
2840824370 xfs: remove unused m_data_workqueue
Now that we're no longer using m_data_workqueue, remove it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-16 10:01:58 -07:00
Darrick J. Wong
cb357bf3d1 xfs: implement per-inode writeback completion queues
When scheduling writeback of dirty file data in the page cache, XFS uses
IO completion workqueue items to ensure that filesystem metadata only
updates after the write completes successfully.  This is essential for
converting unwritten extents to real extents at the right time and
performing COW remappings.

Unfortunately, XFS queues each IO completion work item to an unbounded
workqueue, which means that the kernel can spawn dozens of threads to
try to handle the items quickly.  These threads need to take the ILOCK
to update file metadata, which results in heavy ILOCK contention if a
large number of the work items target a single file, which is
inefficient.

Worse yet, the writeback completion threads get stuck waiting for the
ILOCK while holding transaction reservations, which can use up all
available log reservation space.  When that happens, metadata updates to
other parts of the filesystem grind to a halt, even if the filesystem
could otherwise have handled it.

Even worse, if one of the things grinding to a halt happens to be a
thread in the middle of a defer-ops finish holding the same ILOCK and
trying to obtain more log reservation having exhausted the permanent
reservation, we now have an ABBA deadlock - writeback completion has a
transaction reserved and wants the ILOCK, and someone else has the ILOCK
and wants a transaction reservation.

Therefore, we create a per-inode writeback io completion queue + work
item.  When writeback finishes, it can add the ioend to the per-inode
queue and let the single worker item process that queue.  This
dramatically cuts down on the number of kworkers and ILOCK contention in
the system, and seems to have eliminated an occasional deadlock I was
seeing while running generic/476.

Testing with a program that simulates a heavy random-write workload to a
single file demonstrates that the number of kworkers drops from
approximately 120 threads per file to 1, without dramatically changing
write bandwidth or pagecache access latency.

Note that we leave the xfs-conv workqueue's max_active alone because we
still want to be able to run ioend processing for as many inodes as the
system can handle.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-16 10:01:57 -07:00
Darrick J. Wong
4fb7951fde xfs: scrub should only cross-reference with healthy btrees
Skip cross-referencing with a btree if the health report tells us that
it's known to be bad.  This should reduce the dmesg spew considerably.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-16 10:01:57 -07:00
Darrick J. Wong
4860a05d24 xfs: scrub/repair should update filesystem metadata health
Now that we have the ability to track sick metadata in-core, make scrub
and repair update those health assessments after doing work.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-16 10:01:57 -07:00
Darrick J. Wong
160b5a7845 xfs: hoist the already_fixed variable to the scrub context
Now that we no longer memset the scrub context, we can move the
already_fixed variable into the scrub context's state flags instead of
passing around pointers to separate stack variables.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-16 10:01:57 -07:00
Darrick J. Wong
f8c2a2257c xfs: collapse scrub bool state flags into a single unsigned int
Combine all the boolean state flags in struct xfs_scrub into a single
unsigned int, because we're going to be adding more state flags soon.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-16 10:01:57 -07:00
Darrick J. Wong
9d71e15586 xfs: refactor scrub context initialization
It's a little silly how the memset in scrub context initialization
forces us to declare stack variables to preserve context variables
across a retry.  Since the teardown functions already null out most of
the ephemeral state (buffer pointers, btree cursors, etc.), just skip
the memset and move the initialization as needed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-16 10:01:57 -07:00
Aurelien Aptel
b98749cac4 CIFS: keep FileInfo handle live during oplock break
In the oplock break handler, writing pending changes from pages puts
the FileInfo handle. If the refcount reaches zero it closes the handle
and waits for any oplock break handler to return, thus causing a deadlock.

To prevent this situation:

* We add a wait flag to cifsFileInfo_put() to decide whether we should
  wait for running/pending oplock break handlers

* We keep an additionnal reference of the SMB FileInfo handle so that
  for the rest of the handler putting the handle won't close it.
  - The ref is bumped everytime we queue the handler via the
    cifs_queue_oplock_break() helper.
  - The ref is decremented at the end of the handler

This bug was triggered by xfstest 464.

Also important fix to address the various reports of
oops in smb2_push_mandatory_locks

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2019-04-16 09:38:38 -05:00
Ronnie Sahlberg
e6d0fb7b34 cifs: fix handle leak in smb2_query_symlink()
If we enter smb2_query_symlink() for something that is not a symlink
and where the SMB2_open() would succeed we would never end up
closing this handle and would thus leak a handle on the server.

Fix this by immediately calling SMB2_close() on successfull open.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-04-16 09:38:26 -05:00
ZhangXiaoxu
b57a55e220 cifs: Fix lease buffer length error
There is a KASAN slab-out-of-bounds:
BUG: KASAN: slab-out-of-bounds in _copy_from_iter_full+0x783/0xaa0
Read of size 80 at addr ffff88810c35e180 by task mount.cifs/539

CPU: 1 PID: 539 Comm: mount.cifs Not tainted 4.19 #10
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
            rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
Call Trace:
 dump_stack+0xdd/0x12a
 print_address_description+0xa7/0x540
 kasan_report+0x1ff/0x550
 check_memory_region+0x2f1/0x310
 memcpy+0x2f/0x80
 _copy_from_iter_full+0x783/0xaa0
 tcp_sendmsg_locked+0x1840/0x4140
 tcp_sendmsg+0x37/0x60
 inet_sendmsg+0x18c/0x490
 sock_sendmsg+0xae/0x130
 smb_send_kvec+0x29c/0x520
 __smb_send_rqst+0x3ef/0xc60
 smb_send_rqst+0x25a/0x2e0
 compound_send_recv+0x9e8/0x2af0
 cifs_send_recv+0x24/0x30
 SMB2_open+0x35e/0x1620
 open_shroot+0x27b/0x490
 smb2_open_op_close+0x4e1/0x590
 smb2_query_path_info+0x2ac/0x650
 cifs_get_inode_info+0x1058/0x28f0
 cifs_root_iget+0x3bb/0xf80
 cifs_smb3_do_mount+0xe00/0x14c0
 cifs_do_mount+0x15/0x20
 mount_fs+0x5e/0x290
 vfs_kern_mount+0x88/0x460
 do_mount+0x398/0x31e0
 ksys_mount+0xc6/0x150
 __x64_sys_mount+0xea/0x190
 do_syscall_64+0x122/0x590
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

It can be reproduced by the following step:
  1. samba configured with: server max protocol = SMB2_10
  2. mount -o vers=default

When parse the mount version parameter, the 'ops' and 'vals'
was setted to smb30,  if negotiate result is smb21, just
update the 'ops' to smb21, but the 'vals' is still smb30.
When add lease context, the iov_base is allocated with smb21
ops, but the iov_len is initiallited with the smb30. Because
the iov_len is longer than iov_base, when send the message,
copy array out of bounds.

we need to keep the 'ops' and 'vals' consistent.

Fixes: 9764c02fcb ("SMB3: Add support for multidialect negotiate (SMB2.1 and later)")
Fixes: d5c7076b77 ("smb3: add smb3.1.1 to default dialect list")

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-04-16 09:38:23 -05:00
ZhangXiaoxu
088aaf17aa cifs: Fix use-after-free in SMB2_read
There is a KASAN use-after-free:
BUG: KASAN: use-after-free in SMB2_read+0x1136/0x1190
Read of size 8 at addr ffff8880b4e45e50 by task ln/1009

Should not release the 'req' because it will use in the trace.

Fixes: eccb4422cf ("smb3: Add ftrace tracepoints for improved SMB3 debugging")

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org> 4.18+
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-04-16 09:38:21 -05:00
ZhangXiaoxu
6a3eb33606 cifs: Fix use-after-free in SMB2_write
There is a KASAN use-after-free:
BUG: KASAN: use-after-free in SMB2_write+0x1342/0x1580
Read of size 8 at addr ffff8880b6a8e450 by task ln/4196

Should not release the 'req' because it will use in the trace.

Fixes: eccb4422cf ("smb3: Add ftrace tracepoints for improved SMB3 debugging")

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org> 4.18+
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-04-16 09:38:18 -05:00
Linus Torvalds
5512320c9f fsdax fix 5.1-rc6
- Avoid a crash scenario with architectures like powerpc that require
   'pgtable_deposit' for the zero page.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJctNryAAoJEB7SkWpmfYgCzcMP/37LJbb4SYNwnDIW4BF33ril
 ZwtPeJJVTR56Ojo+Dy1v9084zeyhUHHewz0Oqx15dm6k/N5SS19yKNFKQDOK+4OC
 zbaWD5UOtllU3RQ2ORUOUoqNGF278+h4VVVQMntVaHhdt5f120tgHXxmKoB5Z5zH
 Gcy0vZNHoJ5lVYfKjKYG0b0/dWWOD1ZEjTkZjTa4DjhVSQcFauN8DxJ4hSyumYqs
 HDnZZt44RTTUS5W3BTlhuaSEcZaDOznmyj1HmKXNg3ghxguKACho4xhA7xFKqT8O
 03WZxDBFnOXZb3yfKpHB6RclkJgrtmD5U5GStzl5SobLPb2E/TPQzCRhZ/kcFPZ8
 RE2JkgdGl8gqCDRqRsC/tbF3dETO66vxUyf5utNv0ttBk7qLMwTGTKm3VQz7Xvu2
 SLkwv6Rlw4UT6ML8nd2kNhf8xRkaLl6j1B6zWDy7wEoFPXWW+My0PPpsJZcbTeza
 eib2ood7AlPHRU0/mW2ZrGHGabbS6kNGeQlod9U5sikkE7ZA/LwzyFl4b/uCqYNP
 NKGQdz0iHVcq8lFPXEmZ7vP2krd6uUWIv9KaiwmjBBMf9w3ZAzS85c7HFAZD0zgC
 tTHm6stMhpdS3ndyIxMBf0sL7AB/Q9BH7jJwDK/P8QObovezW2zZ4CPx/gQYJ2XU
 LTeCJmQh3xcCpI3f/eka
 =ijtJ
 -----END PGP SIGNATURE-----

Merge tag 'fsdax-fix-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull fsdax fix from Dan Williams:
 "A single filesystem-dax fix. It has been lingering in -next for a long
  while and there are no other fsdax fixes on the horizon:

   - Avoid a crash scenario with architectures like powerpc that require
     'pgtable_deposit' for the zero page"

* tag 'fsdax-fix-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  fs/dax: Deposit pagetable even when installing zero page
2019-04-15 15:10:20 -07:00
Jens Axboe
b19062a567 io_uring: fix possible deadlock between io_uring_{enter,register}
If we have multiple threads, one doing io_uring_enter() while the other
is doing io_uring_register(), we can run into a deadlock between the
two. io_uring_register() must wait for existing users of the io_uring
instance to exit. But it does so while holding the io_uring mutex.
Callers of io_uring_enter() may need this mutex to make progress (and
eventually exit). If we wait for users to exit in io_uring_register(),
we can't do so with the io_uring mutex held without potentially risking
a deadlock.

Drop the io_uring mutex while waiting for existing callers to exit. This
is safe and guaranteed to make forward progress, since we already killed
the percpu ref before doing so. Hence later callers of io_uring_enter()
will be rejected.

Reported-by: syzbot+16dc03452dee970a0c3e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-15 10:49:38 -06:00
Darrick J. Wong
89d139d5ad xfs: report inode health via bulkstat
Use space in the bulkstat ioctl structure to report any problems
observed with the inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:58 -07:00
Darrick J. Wong
1302c6a24f xfs: report AG health via AG geometry ioctl
Use the AG geometry info ioctl to report health status too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Darrick J. Wong
c23232d409 xfs: report fs and rt health via geometry structure
Use our newly expanded geometry structure to report the overall fs and
realtime health status.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Darrick J. Wong
7cd5006bdb xfs: add a new ioctl to describe allocation group geometry
Add a new ioctl to describe an allocation group's geometry.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Dave Chinner
1b6d968de2 xfs: bump XFS_IOC_FSGEOMETRY to v5 structures
Unfortunately, the V4 XFS_IOC_FSGEOMETRY structure is out of space so we
can't just add a new field to it. Hence we need to bump the definition
to V5 and and treat the V4 ioctl and structure similar to v1 to v3.

While doing this, clean up all the definitions associated with the
XFS_IOC_FSGEOMETRY ioctl.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: forward port to 5.1, expand structure size to 256 bytes]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Darrick J. Wong
519841c207 xfs: clear BAD_SUMMARY if unmounting an unhealthy filesystem
If we know the filesystem metadata isn't healthy during unmount, we want
to encourage the administrator to run xfs_repair right away.  We can't
do this if BAD_SUMMARY will cause an unclean log unmount to force
summary recalculation, so turn it off if the fs is bad.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Darrick J. Wong
39353ff6e9 xfs: replace the BAD_SUMMARY mount flag with the equivalent health code
Replace the BAD_SUMMARY mount flag with calls to the equivalent health
tracking code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Darrick J. Wong
6772c1f112 xfs: track metadata health status
Add the necessary in-core metadata fields to keep track of which parts
of the filesystem have been observed and which parts were observed to be
unhealthy, and print a warning at unmount time if we have unfixed
problems.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Wang Shilong
2bf9d264ef xfs,fstrim: fix to return correct minlen
This patch tries to address two problems:

1) return @minlen we used to trim to
user space.

2) return EINVAL if granularity is larger than
avg size, even most of cases, granularity is small(4K),
but if devices return a lager granularity for some reaons
(testing, bugs etc), fstrim should return failure directly.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-14 18:15:57 -07:00
Brian Foster
1ca89fbc48 xfs: don't account extra agfl blocks as available
The block allocation AG selection code has parameters that allow a
caller to perform multiple allocations from a single AG and
transaction (under certain conditions). The parameters specify the
total block allocation count required by the transaction and the AG
selection code selects and locks an AG that will be able to satisfy
the overall requirement. If the available block accounting
calculation turns out to be inaccurate and a subsequent allocation
call fails with -ENOSPC, the resulting transaction cancel leads to
filesystem shutdown because the transaction is dirty.

This exact problem can be reproduced with a highly parallel space
consumer and fsstress workload running long enough to a large
filesystem against -ENOSPC conditions. A bmbt block allocation
request made for inode extent to bmap format conversion after an
extent allocation is expected to be satisfied by the same AG and the
same transaction as the extent allocation. The bmbt block allocation
fails, however, because the block availability of the AG has changed
since the AG was selected (outside of the blocks used for the extent
itself).

The inconsistent block availability calculation is caused by the
deferred block freeing behavior of the AGFL. This immediately
removes extra blocks from the AGFL to free up AGFL slots, but rather
than immediately freeing such blocks as was done in the past, the
block free is deferred such that said blocks are not available for
allocation until the current transaction commits. The AG selection
logic currently considers all AGFL blocks as available and executes
shortly before any extra AGFL blocks are freed. This means the block
availability of the current AG can change before the first
allocation even occurs, but in practice a failure is more likely to
manifest via a subsequent allocation because extent allocation
usually has a contiguity requirement larger than a single block that
can't be satisfied from the AGFL.

In general, XFS prefers operational robustness to absolute
allocation efficiency. In other words, we prefer to return -ENOSPC
slightly earlier at the expense of not being able to allocate every
last block in an AG to avoid this kind of problem. As such, update
the AG block availability calculation to consider extra AGFL blocks
as unavailable since they are immediately removed following the
calculation and will not become available until the current
transaction commits.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-14 18:15:57 -07:00
Brian Foster
22fedd80b6 xfs: shutdown after buf release in iflush cluster abort path
If xfs_iflush_cluster() fails due to corruption, the error path
issues a shutdown and simulates an I/O completion to release the
buffer. This code has a couple small problems. First, the shutdown
sequence can issue a synchronous log force, which is unsafe to do
with buffer locks held. Second, the simulated I/O completion does not
guarantee the buffer is async and thus is unlocked and released.

For example, if the last operation on the buffer was a read off disk
prior to the corruption event, XBF_ASYNC is not set and the buffer
is left locked and held upon return. This results in a memory leak
as shown by the following message on module unload:

 BUG xfs_buf (...): Objects remaining in xfs_buf on __kmem_cache_shutdown()

Fix both of these problems by setting XBF_ASYNC on the buffer prior
to the simulated I/O error and performing the shutdown immediately
after ioend processing when the buffer has been released.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-14 18:15:56 -07:00
Brian Foster
545aa41f5c xfs: wake commit waiters on CIL abort before log item abort
XFS shutdown deadlocks have been reproduced by fstest generic/475.
The deadlock signature involves log I/O completion running error
handling to abort logged items and waiting for an inode cluster
buffer lock in the buffer item unpin handler. The buffer lock is
held by xfsaild attempting to flush an inode. The buffer happens to
be pinned and so xfs_iflush() triggers an async log force to begin
work required to get it unpinned. The log force is blocked waiting
on the commit completion, which never occurs and thus leaves the
filesystem deadlocked.

The root problem is that aborted log I/O completion pots commit
completion behind callback completion, which is unexpected for async
log forces. Under normal running conditions, an async log force
returns to the caller once the CIL ctx has been formatted/submitted
and the commit completion event triggered at the tail end of
xlog_cil_push(). If the filesystem has shutdown, however, we rely on
xlog_cil_committed() to trigger the completion event and it happens
to do so after running log item unpin callbacks. This makes it
unsafe to invoke an async log force from contexts that hold locks
that might also be required in log completion processing.

To address this problem, wake commit completion waiters before
aborting log items in the log I/O completion handler. This ensures
that an async log force will not deadlock on held locks if the
filesystem happens to shutdown. Note that it is still unsafe to
issue a sync log force while holding such locks because a sync log
force explicitly waits on the force completion, which occurs after
log I/O completion processing.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-14 18:15:56 -07:00
Brian Foster
4d09807f20 xfs: fix use after free in buf log item unlock assert
The xfs_buf_log_item ->iop_unlock() callback asserts that the buffer
is unlocked when either non-stale or aborted. This assert occurs
after the bli refcount has been dropped and the log item potentially
freed. The aborted check is thus a potential use after free. This
problem has been reproduced with KASAN enabled via generic/475.

Fix up xfs_buf_item_unlock() to query aborted state before the bli
reference is dropped to prevent a potential use after free.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-14 18:15:56 -07:00
Linus Torvalds
6b3a707736 Merge branch 'page-refs' (page ref overflow)
Merge page ref overflow branch.

Jann Horn reported that he can overflow the page ref count with
sufficient memory (and a filesystem that is intentionally extremely
slow).

Admittedly it's not exactly easy.  To have more than four billion
references to a page requires a minimum of 32GB of kernel memory just
for the pointers to the pages, much less any metadata to keep track of
those pointers.  Jann needed a total of 140GB of memory and a specially
crafted filesystem that leaves all reads pending (in order to not ever
free the page references and just keep adding more).

Still, we have a fairly straightforward way to limit the two obvious
user-controllable sources of page references: direct-IO like page
references gotten through get_user_pages(), and the splice pipe page
duplication.  So let's just do that.

* branch page-refs:
  fs: prevent page refcount overflow in pipe_buf_get
  mm: prevent get_user_pages() from overflowing page refcount
  mm: add 'try_get_page()' helper function
  mm: make page ref count overflow check tighter and more explicit
2019-04-14 15:09:40 -07:00
Thomas Gleixner
accddc41b9 latency_top: Remove the ULONG_MAX stack trace hackery
No architecture terminates the stack trace with ULONG_MAX anymore. The
consumer terminates on the first zero entry or at the number of entries, so
no functional change.

Remove the cruft.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Link: https://lkml.kernel.org/r/20190410103644.853527514@linutronix.de
2019-04-14 19:58:31 +02:00
Matthew Wilcox
15fab63e1e fs: prevent page refcount overflow in pipe_buf_get
Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount.  All
callers converted to handle a failure.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-14 10:00:04 -07:00
Jens Axboe
3d6770fbd9 io_uring: drop io_file_put() 'file' argument
Since the fget/fput handling was reworked in commit 09bb839434, we
never call io_file_put() with state == NULL (and hence file != NULL)
anymore. Remove that case.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-13 19:08:22 -06:00
Jens Axboe
917257daa0 io_uring: only test SQPOLL cpu after we've verified it
We currently call cpu_possible() even if we don't use the CPU. Move the
test under the SQ_AFF branch, which is the only place where we'll use
the value. Do the cpu_possible() test AFTER we've limited it to a max
of NR_CPUS. This avoids triggering the following warning:

WARNING: CPU: 1 PID: 7600 at include/linux/cpumask.h:121 cpu_max_bits_warn

if CONFIG_DEBUG_PER_CPU_MAPS is enabled.

While in there, also move the SQ thread idle period assignment inside
SETUP_SQPOLL, as we don't use it otherwise either.

Reported-by: syzbot+cd714a07c6de2bc34293@syzkaller.appspotmail.com
Fixes: 6c271ce2f1 ("io_uring: add submission polling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-13 19:08:22 -06:00
Jens Axboe
0605863246 io_uring: park SQPOLL thread if it's percpu
kthread expects this, or we can throw a warning on exit:

WARNING: CPU: 0 PID: 7822 at kernel/kthread.c:399
__kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 7822 Comm: syz-executor030 Not tainted 5.1.0-rc4-next-20190412
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2cb/0x72b kernel/panic.c:214
  __warn.cold+0x20/0x46 kernel/panic.c:576
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
RIP: 0010:__kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399
Code: 48 89 fb e8 f7 ab 24 00 4c 89 e6 48 89 df e8 ac e1 02 00 31 ff 49 89
c4 48 89 c6 e8 7f ad 24 00 4d 85 e4 75 15 e8 d5 ab 24 00 <0f> 0b e8 ce ab
24 00 5b 41 5c 41 5d 41 5e 5d c3 e8 c0 ab 24 00 4c
RSP: 0018:ffff8880a89bfbb8 EFLAGS: 00010293
RAX: ffff88808ca7a280 RBX: ffff8880a98e4380 RCX: ffffffff814bdd11
RDX: 0000000000000000 RSI: ffffffff814bdd1b RDI: 0000000000000007
RBP: ffff8880a89bfbd8 R08: ffff88808ca7a280 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffff87691148 R14: ffff8880a98e43a0 R15: ffffffff81c91e10
  __kthread_bind kernel/kthread.c:412 [inline]
  kthread_unpark+0x123/0x160 kernel/kthread.c:480
  kthread_stop+0xfa/0x6c0 kernel/kthread.c:556
  io_sq_thread_stop fs/io_uring.c:2057 [inline]
  io_sq_thread_stop fs/io_uring.c:2052 [inline]
  io_finish_async+0xab/0x180 fs/io_uring.c:2064
  io_ring_ctx_free fs/io_uring.c:2534 [inline]
  io_ring_ctx_wait_and_kill+0x133/0x510 fs/io_uring.c:2591
  io_uring_release+0x42/0x50 fs/io_uring.c:2599
  __fput+0x2e5/0x8d0 fs/file_table.c:278
  ____fput+0x16/0x20 fs/file_table.c:309
  task_work_run+0x14a/0x1c0 kernel/task_work.c:113
  exit_task_work include/linux/task_work.h:22 [inline]
  do_exit+0x90a/0x2fa0 kernel/exit.c:876
  do_group_exit+0x135/0x370 kernel/exit.c:980
  __do_sys_exit_group kernel/exit.c:991 [inline]
  __se_sys_exit_group kernel/exit.c:989 [inline]
  __x64_sys_exit_group+0x44/0x50 kernel/exit.c:989
  do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: syzbot+6d4a92619eb0ad08602b@syzkaller.appspotmail.com
Fixes: 6c271ce2f1 ("io_uring: add submission polling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-13 19:08:22 -06:00
Linus Torvalds
4443f8e6ac for-linus-20190412
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyw354QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiMyEAC4THUReCrTuv9oFRNg5uILVYIq51nP8dw7
 XamC7A92jPXd6vl/QVjmvLwT34/Y2XvX0t62RBsk849CEjgGYTeF1/qI3tMkpN7c
 huupab3aYM/Rrv4i1KSPQu6iIto3DYqfmREaGJJ1Ikbu/CKDuUGyEo+Z4wrKUPon
 GWnE8QMS2fdc764eVzKKqB+GryaEiHmeD1N4NnPs+nla14ysueUvJUikkTt/Laef
 h7nOmz9mrqE6u1xVHNpo0TlW0oJdLfaDIL9ghwHFJXqvriTh8Tg2tEHpXI6vSTTt
 StnPbTA1s1uhHs4rWYl8J0UXSZnRRp0Ep8jCvqEb9CJ23uHCNyGEoy/R7q+x2quf
 T+ruolMXY7IIJP30ZMHar374YfajJdw7EH/565nlbLnjSBXhqjmc07kQ7mIYvpg6
 JgureSdDwOOHpfrJgVq5es48ndt5HBYUBPzkvVGTgkeSJkMydkkM1qZeYEnai105
 8EnUFusRUnYZtb73HBPjKS7i0BZZvZlI1oKYHabiMtajqcKyvwDP2tTmhqXYLDLY
 9uloW0u2B0lddfzCb9hTYZOroNWfifo4vuSU5DHvnJoKvf4z3auDxaFD9N8fGn6S
 aZsRjMCpFqFd0YEnZPbsctgPg2Licrs02uPntlzBTJ0ByH20pX4OepYrvgQk3vao
 tOQ1jRYMKw==
 =cISy
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Set of fixes that should go into this round. This pull is larger than
  I'd like at this time, but there's really no specific reason for that.
  Some are fixes for issues that went into this merge window, others are
  not. Anyway, this contains:

   - Hardware queue limiting for virtio-blk/scsi (Dongli)

   - Multi-page bvec fixes for lightnvm pblk

   - Multi-bio dio error fix (Jason)

   - Remove the cache hint from the io_uring tool side, since we didn't
     move forward with that (me)

   - Make io_uring SETUP_SQPOLL root restricted (me)

   - Fix leak of page in error handling for pc requests (Jérôme)

   - Fix BFQ regression introduced in this merge window (Paolo)

   - Fix break logic for bio segment iteration (Ming)

   - Fix NVMe cancel request error handling (Ming)

   - NVMe pull request with two fixes (Christoph):
       - fix the initial CSN for nvme-fc (James)
       - handle log page offsets properly in the target (Keith)"

* tag 'for-linus-20190412' of git://git.kernel.dk/linux-block:
  block: fix the return errno for direct IO
  nvmet: fix discover log page when offsets are used
  nvme-fc: correct csn initialization and increments on error
  block: do not leak memory in bio_copy_user_iov()
  lightnvm: pblk: fix crash in pblk_end_partial_read due to multipage bvecs
  nvme: cancel request synchronously
  blk-mq: introduce blk_mq_complete_request_sync()
  scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids
  virtio-blk: limit number of hw queues by nr_cpu_ids
  block, bfq: fix use after free in bfq_bfqq_expire
  io_uring: restrict IORING_SETUP_SQPOLL to root
  tools/io_uring: remove IOCQE_FLAG_CACHEHIT
  block: don't use for-inside-for in bio_for_each_segment_all
2019-04-13 16:23:16 -07:00
Linus Torvalds
b60bc0665e NFS client bugfixes for Linux 5.1
Highlights include:
 
 Stable fixes:
 - Fix a deadlock in close() due to incorrect draining of RDMA queues
 
 Bugfixes:
 - Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
   as it is causing stack overflows
 - Fix a regression where NFSv4 getacl and fs_locations stopped working
 - Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
 - Fix xfstests failures due to incorrect copy_file_range() return values
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcsfeVAAoJEA4mA3inWBJcPjAQAIPERRVWjg7xRz6CJzt2yoM1
 ApPj965DCnC9bGcGAH2U+TbCWJOi3lJwaZOPTL0ut/Tcv9PpKETRqk+rrjUcFRy1
 1b1HH16GivprOmHgCRyqo5Qj2ZiaGNpY3tJfxl/6eIiSpHKPZLa4zY+q2KfK/YNI
 SOVyNU0Gq08p4AiKr3CG5VVZGdNgRMrnzBYJqeTh1zZ7erWE2nJoE+pmvcLhZR0w
 uxshbTWbJT21KLEI+PXTyGtFkz5jNaKy4Ts07MRBJdQjDv73MUW8CcqFZicSjtqx
 zdKYa1VH9pEOjFOs57xGELSnYRdB00Vgd9/b6MqKyWH8iJzXFbgjEusMWiU45aeF
 NLg9ySSU8LeY93SxV66CHG57NIgHqwZu6P+lO3efRzuHgEGceDsz0WwDF2KNIZlm
 /vOmbk0I+woneFUeNDWAXD9/ETUJ8RCNk1/b1UlbkUL7aD5WSLDp1bKPifk/WA6E
 Mtgwmqz1Vso3cIPglWcAgsfEAYJZSJVDMfRIhm2dy7vVU0nfW12I00G8BShgr8f7
 mxAxd/V+1/Q9ftPENgC9z5LWKYQjfjksnYRHXW1m5c92Yoe9TF0yiNyDmT5hBR6w
 MvUN2j3yeQBqk6JHZxtH/mmdSRD0o5kxvFrEqMj1PpP8X8DpWupQA8SZKnHq0wlj
 8Q7LRum+wmhbiKCmZ+1F
 =vRPB
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable fix:

   - Fix a deadlock in close() due to incorrect draining of RDMA queues

  Bugfixes:

   - Revert "SUNRPC: Micro-optimise when the task is known not to be
     sleeping" as it is causing stack overflows

   - Fix a regression where NFSv4 getacl and fs_locations stopped
     working

   - Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.

   - Fix xfstests failures due to incorrect copy_file_range() return
     values"

* tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
  NFSv4.1 fix incorrect return value in copy_file_range
  xprtrdma: Fix helper that drains the transport
  NFS: Fix handling of reply page vector
  NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
2019-04-13 14:47:06 -07:00
David Howells
eeba1e9cf3 afs: Fix in-progess ops to ignore server-level callback invalidation
The in-kernel afs filesystem client counts the number of server-level
callback invalidation events (CB.InitCallBackState* RPC operations) that it
receives from the server.  This is stored in cb_s_break in various
structures, including afs_server and afs_vnode.

If an inode is examined by afs_validate(), say, the afs_server copy is
compared, along with other break counters, to those in afs_vnode, and if
one or more of the counters do not match, it is considered that the
server's callback promise is broken.  At points where this happens,
AFS_VNODE_CB_PROMISED is cleared to indicate that the status must be
refetched from the server.

afs_validate() issues an FS.FetchStatus operation to get updated metadata -
and based on the updated data_version may invalidate the pagecache too.

However, the break counters are also used to determine whether to note a
new callback in the vnode (which would set the AFS_VNODE_CB_PROMISED flag)
and whether to cache the permit data included in the YFSFetchStatus record
by the server.


The problem comes when the server sends us a CB.InitCallBackState op.  The
first such instance doesn't cause cb_s_break to be incremented, but rather
causes AFS_SERVER_FL_NEW to be cleared - but thereafter, say some hours
after last use and all the volumes have been automatically unmounted and
the server has forgotten about the client[*], this *will* likely cause an
increment.

 [*] There are other circumstances too, such as the server restarting or
     needing to make space in its callback table.

Note that the server won't send us a CB.InitCallBackState op until we talk
to it again.

So what happens is:

 (1) A mount for a new volume is attempted, a inode is created for the root
     vnode and vnode->cb_s_break and AFS_VNODE_CB_PROMISED aren't set
     immediately, as we don't have a nominated server to talk to yet - and
     we may iterate through a few to find one.

 (2) Before the operation happens, afs_fetch_status(), say, notes in the
     cursor (fc.cb_break) the break counter sum from the vnode, volume and
     server counters, but the server->cb_s_break is currently 0.

 (3) We send FS.FetchStatus to the server.  The server sends us back
     CB.InitCallBackState.  We increment server->cb_s_break.

 (4) Our FS.FetchStatus completes.  The reply includes a callback record.

 (5) xdr_decode_AFSCallBack()/xdr_decode_YFSCallBack() check to see whether
     the callback promise was broken by checking the break counter sum from
     step (2) against the current sum.

     This fails because of step (3), so we don't set the callback record
     and, importantly, don't set AFS_VNODE_CB_PROMISED on the vnode.

This does not preclude the syscall from progressing, and we don't loop here
rechecking the status, but rather assume it's good enough for one round
only and will need to be rechecked next time.

 (6) afs_validate() it triggered on the vnode, probably called from
     d_revalidate() checking the parent directory.

 (7) afs_validate() notes that AFS_VNODE_CB_PROMISED isn't set, so doesn't
     update vnode->cb_s_break and assumes the vnode to be invalid.

 (8) afs_validate() needs to calls afs_fetch_status().  Go back to step (2)
     and repeat, every time the vnode is validated.

This primarily affects volume root dir vnodes.  Everything subsequent to
those inherit an already incremented cb_s_break upon mounting.


The issue is that we assume that the callback record and the cached permit
information in a reply from the server can't be trusted after getting a
server break - but this is wrong since the server makes sure things are
done in the right order, holding up our ops if necessary[*].

 [*] There is an extremely unlikely scenario where a reply from before the
     CB.InitCallBackState could get its delivery deferred till after - at
     which point we think we have a promise when we don't.  This, however,
     requires unlucky mass packet loss to one call.

AFS_SERVER_FL_NEW tries to paper over the cracks for the initial mount from
a server we've never contacted before, but this should be unnecessary.
It's also further insulated from the problem on an initial mount by
querying the server first with FS.GetCapabilities, which triggers the
CB.InitCallBackState.


Fix this by

 (1) Remove AFS_SERVER_FL_NEW.

 (2) In afs_calc_vnode_cb_break(), don't include cb_s_break in the
     calculation.

 (3) In afs_cb_is_broken(), don't include cb_s_break in the check.


Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-13 08:37:37 +01:00
Marc Dionne
21bd68f196 afs: Unlock pages for __pagevec_release()
__pagevec_release() complains loudly if any page in the vector is still
locked.  The pages need to be locked for generic_error_remove_page(), but
that function doesn't actually unlock them.

Unlock the pages afterwards.

Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jonathan Billings <jsbillin@umich.edu>
2019-04-13 08:37:37 +01:00
David Howells
8022c4b95c afs: Differentiate abort due to unmarshalling from other errors
Differentiate an abort due to an unmarshalling error from an abort due to
other errors, such as ENETUNREACH.  It doesn't make sense to set abort code
RXGEN_*_UNMARSHAL in such a case, so use RX_USER_ABORT instead.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-13 08:37:37 +01:00
Andi Kleen
d2abfa86ff afs: Avoid section confusion in CM_NAME
__tracepoint_str cannot be const because the tracepoint_str
section is not read-only. Remove the stray const.

Cc: dhowells@redhat.com
Cc: viro@zeniv.linux.org.uk
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2019-04-13 08:37:36 +01:00
Arnd Bergmann
ba25b81e3a afs: avoid deprecated get_seconds()
get_seconds() has a limited range on 32-bit architectures and is
deprecated because of that. While AFS uses the same limits for
its inode timestamps on the wire protocol, let's just use the
simpler current_time() as we do for other file systems.

This will still zero out the 'tv_nsec' field of the timestamps
internally.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-13 08:37:36 +01:00
Marc Dionne
f7f1dd3162 afs: Check for rxrpc call completion in wait loop
Check the state of the rxrpc call backing an afs call in each iteration of
the call wait loop in case the rxrpc call has already been terminated at
the rxrpc layer.

Interrupt the wait loop and mark the afs call as complete if the rxrpc
layer call is complete.

There were cases where rxrpc errors were not passed up to afs, which could
result in this loop waiting forever for an afs call to transition to
AFS_CALL_COMPLETE while the rx call was already complete.

Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 16:57:23 -07:00
Marc Dionne
4611da30d6 rxrpc: Make rxrpc_kernel_check_life() indicate if call completed
Make rxrpc_kernel_check_life() pass back the life counter through the
argument list and return true if the call has not yet completed.

Suggested-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 16:57:23 -07:00
Jason Yan
a89afe58f1 block: fix the return errno for direct IO
If the last bio returned is not dio->bio, the status of the bio will
not assigned to dio->bio if it is error. This will cause the whole IO
status wrong.

    ksoftirqd/21-117   [021] ..s.  4017.966090:   8,0    C   N 4883648 [0]
          <idle>-0     [018] ..s.  4017.970888:   8,0    C  WS 4924800 + 1024 [0]
          <idle>-0     [018] ..s.  4017.970909:   8,0    D  WS 4935424 + 1024 [<idle>]
          <idle>-0     [018] ..s.  4017.970924:   8,0    D  WS 4936448 + 321 [<idle>]
    ksoftirqd/21-117   [021] ..s.  4017.995033:   8,0    C   R 4883648 + 336 [65475]
    ksoftirqd/21-117   [021] d.s.  4018.001988: myprobe1: (blkdev_bio_end_io+0x0/0x168) bi_status=7
    ksoftirqd/21-117   [021] d.s.  4018.001992: myprobe: (aio_complete_rw+0x0/0x148) x0=0xffff802f2595ad80 res=0x12a000 res2=0x0

We always have to assign bio->bi_status to dio->bio.bi_status because we
will only check dio->bio.bi_status when we return the whole IO to
the upper layer.

Fixes: 542ff7bf18 ("block: new direct I/O implementation")
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-11 21:22:21 -06:00
Linus Torvalds
2d06b23581 for-5.1-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlyveOkACgkQxWXV+ddt
 WDvb3g//RTIy+8o1PwPihGT9z0wWaFTxJF9Ea3SDzgMVsOaWzZIM29Q0XCwGyR5s
 xdwJlNrum4eJwcLRuvvZtZ6e9h6vdgmNxi7ULem0r2ik58rvgZf3TQp/t78qMMLo
 Z8Qp/jQPHOOhwGURFsUd2TwCYQ+3yyEhzoObDDQ5OAyeCgneLe2hLNvyMMF7YNkl
 joaWY5iAwDY61UaRxggwx8OI7TkhCA/ZA27zyjc6oomCQglIM/KmdvmjHpiBs0j/
 1Ij4SDmSo9nqGES/SfubW/l3fpg42hoQWBMuI3/WLr3CBKN08wuRe+BKoDmVyoex
 eVTy3+AnBp6KsjyOmN9h5Am+r1lyToJ1ZpsjkKQzo5SRlYC/SA0oFlxhHKygoft6
 tEEJf+hbySiod/ZX0KItS6Myo1xsHsX8LnidAO+7pK0S5e5D59QPXtw6oJIp0JX/
 kAKrng6bX3+7bSrF8h62nKqpFq/NUTjF8zxB6gwMwnrtroU36r8AFE4+bcyX5Z5g
 +JoJcZ9VFsFcA4GCTzYWTYQ2RfCU6Vnbvh4wTLHiR+IDvbkNFxMUE4z5O+fwqhkl
 6nv8G2EgK3ORZS/4mNAjmanB71gPwbovhTQhju8LQGEmlmwBrF1mQ+YhrBlMrMv9
 XspCzNktqzbGj70tMPbPSB2A5H9oi+5Oabzq2MPFUVBkA9ztpSA=
 =fYo9
 -----END PGP SIGNATURE-----

Merge tag 'for-5.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix parsing of compression algorithm when set as a inode property,
   this could end up with eg. 'zst' or 'zli' in the value

 - don't allow trim on a filesystem with unreplayed log, this could
   cause data loss if there are pending updates to the block groups that
   would not be subject to trim after replay

* tag 'for-5.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: prop: fix vanished compression property after failed set
  btrfs: prop: fix zstd compression parameter validation
  Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
2019-04-11 14:19:02 -07:00
Olga Kornievskaia
0769663b4f NFSv4.1 fix incorrect return value in copy_file_range
According to the NFSv4.2 spec if the input and output file is the
same file, operation should fail with EINVAL. However, linux
copy_file_range() system call has no such restrictions. Therefore,
in such case let's return EOPNOTSUPP and allow VFS to fallback
to doing do_splice_direct(). Also when copy_file_range is called
on an NFSv4.0 or 4.1 mount (ie., a server that doesn't support
COPY functionality), we also need to return EOPNOTSUPP and
fallback to a regular copy.

Fixes xfstest generic/075, generic/091, generic/112, generic/263
for all NFSv4.x versions.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-04-11 15:23:48 -04:00
Chuck Lever
29e7ca715f NFS: Fix handling of reply page vector
NFSv4 GETACL and FS_LOCATIONS requests stopped working in v5.1-rc.

These two need the extra padding to be added directly to the reply
length.

Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes: 02ef04e432 ("NFS: Account for XDR pad of buf->pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-04-11 15:23:48 -04:00
Tetsuo Handa
7c2bd9a398 NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
syzbot is reporting uninitialized value at rpc_sockaddr2uaddr() [1]. This
is because syzbot is setting AF_INET6 to "struct sockaddr_in"->sin_family
(which is embedded into user-visible "struct nfs_mount_data" structure)
despite nfs23_validate_mount_data() cannot pass sizeof(struct sockaddr_in6)
bytes of AF_INET6 address to rpc_sockaddr2uaddr().

Since "struct nfs_mount_data" structure is user-visible, we can't change
"struct nfs_mount_data" to use "struct sockaddr_storage". Therefore,
assuming that everybody is using AF_INET family when passing address via
"struct nfs_mount_data"->addr, reject if its sin_family is not AF_INET.

[1] https://syzkaller.appspot.com/bug?id=599993614e7cbbf66bc2656a919ab2a95fb5d75c

Reported-by: syzbot <syzbot+047a11c361b872896a4f@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-04-11 15:23:48 -04:00
Al Viro
ad7999cd70 Merge branch 'fixes' into work.icache 2019-04-10 14:12:44 -04:00
Linus Torvalds
972acfb494 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc fixes from Al Viro:
 "A few regression fixes from this cycle"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  aio: use kmem_cache_free() instead of kfree()
  iov_iter: Fix build error without CONFIG_CRYPTO
  aio: Fix an error code in __io_submit_one()
2019-04-09 16:20:59 -10:00
Al Viro
ce285c267a autofs: fix use-after-free in lockless ->d_manage()
autofs_d_release() can overlap with lockless ->d_manage(),
ending up with autofs_dentry_ino() freed under the latter.
Make freeing autofs_info instances RCU-delayed...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-09 19:18:19 -04:00