Commit Graph

64612 Commits

Author SHA1 Message Date
Qu Wenruo
3be4d8efe3 btrfs: block-group: rename write_one_cache_group()
The name of this function contains the word "cache", which is left from
the times where btrfs_block_group was called btrfs_block_group_cache.

Now this "cache" doesn't match anything, and we have better namings for
functions like read/insert/remove_block_group_item().

Rename it to update_block_group_item().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:31 +02:00
Qu Wenruo
97f4728af8 btrfs: block-group: refactor how we insert a block group item
Currently the block group item insert is pretty straight forward, fill
the block group item structure and insert it into extent tree.

However the incoming skinny block group feature is going to change this,
so this patch will refactor insertion into a new function,
insert_block_group_item(), to make the incoming feature easier to add.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:31 +02:00
Qu Wenruo
7357623a7f btrfs: block-group: refactor how we delete one block group item
When deleting a block group item, it's pretty straight forward, just
delete the item pointed by the key.  However it will not be that
straight-forward for incoming skinny block group item.

So refactor the block group item deletion into a new function,
remove_block_group_item(), also to make the already lengthy
btrfs_remove_block_group() a little shorter.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:31 +02:00
Qu Wenruo
9afc66498a btrfs: block-group: refactor how we read one block group item
Structure btrfs_block_group has the following members which are
currently read from on-disk block group item and key:

- length - from item key
- used
- flags - from block group item

However for incoming skinny block group tree, we are going to read those
members from different sources.

This patch will refactor such read by:

- Don't initialize btrfs_block_group::length at allocation
  Caller should initialize them manually.
  Also to avoid possible (well, only two callers) missing
  initialization, add extra ASSERT() in btrfs_add_block_group_cache().

- Refactor length/used/flags initialization into one function
  The new function, fill_one_block_group() will handle the
  initialization of such members.

- Use btrfs_block_group::length to replace key::offset
  Since skinny block group item would have a different meaning for its
  key offset.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:30 +02:00
Qu Wenruo
83fe9e12b0 btrfs: block-group: don't set the wrong READA flag for btrfs_read_block_groups()
Regular block group items in extent tree are scattered inside the huge
tree, thus forward readahead makes no sense.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:30 +02:00
Marcos Paulo de Souza
89efda52e6 btrfs: send: emit file capabilities after chown
Whenever a chown is executed, all capabilities of the file being touched
are lost.  When doing incremental send with a file with capabilities,
there is a situation where the capability can be lost on the receiving
side. The sequence of actions bellow shows the problem:

  $ mount /dev/sda fs1
  $ mount /dev/sdb fs2

  $ touch fs1/foo.bar
  $ setcap cap_sys_nice+ep fs1/foo.bar
  $ btrfs subvolume snapshot -r fs1 fs1/snap_init
  $ btrfs send fs1/snap_init | btrfs receive fs2

  $ chgrp adm fs1/foo.bar
  $ setcap cap_sys_nice+ep fs1/foo.bar

  $ btrfs subvolume snapshot -r fs1 fs1/snap_complete
  $ btrfs subvolume snapshot -r fs1 fs1/snap_incremental

  $ btrfs send fs1/snap_complete | btrfs receive fs2
  $ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2

At this point, only a chown was emitted by "btrfs send" since only the
group was changed. This makes the cap_sys_nice capability to be dropped
from fs2/snap_incremental/foo.bar

To fix that, only emit capabilities after chown is emitted. The current
code first checks for xattrs that are new/changed, emits them, and later
emit the chown. Now, __process_new_xattr skips capabilities, letting
only finish_inode_if_needed to emit them, if they exist, for the inode
being processed.

This behavior was being worked around in "btrfs receive" side by caching
the capability and only applying it after chown. Now, xattrs are only
emmited _after_ chown, making that workaround not needed anymore.

Link: https://github.com/kdave/btrfs-progs/issues/202
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:30 +02:00
Filipe Manana
89490303a4 btrfs: scrub, only lookup for csums if we are dealing with a data extent
When scrubbing a stripe, whenever we find an extent we lookup for its
checksums in the checksum tree. However we do it even for metadata extents
which don't have checksum items stored in the checksum tree, that is
only for data extents.

So make the lookup for checksums only if we are processing with a data
extent.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:30 +02:00
Filipe Manana
684b752b09 btrfs: move the block group freeze/unfreeze helpers into block-group.c
The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group()
used to be named btrfs_get_block_group_trimming() and
btrfs_put_block_group_trimming() respectively.

At the time they were added to free-space-cache.c, by commit e33e17ee10
("btrfs: add missing discards when unpinning extents with -o discard")
because all the trimming related functions were in free-space-cache.c.

Now that the helpers were renamed and are used in scrub context as well,
move them to block-group.c, a much more logical location for them.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:30 +02:00
Filipe Manana
6b7304af62 btrfs: rename member 'trimming' of block group to a more generic name
Back in 2014, commit 04216820fe ("Btrfs: fix race between fs trimming
and block group remove/allocation"), I added the 'trimming' member to the
block group structure. Its purpose was to prevent races between trimming
and block group deletion/allocation by pinning the block group in a way
that prevents its logical address and device extents from being reused
while trimming is in progress for a block group, so that if another task
deletes the block group and then another task allocates a new block group
that gets the same logical address and device extents while the trimming
task is still in progress.

After the previous fix for scrub (patch "btrfs: fix a race between scrub
and block group removal/allocation"), scrub now also has the same needs that
trimming has, so the member name 'trimming' no longer makes sense.
Since there is already a 'pinned' member in the block group that refers
to space reservations (pinned bytes), rename the member to 'frozen',
add a comment on top of it to describe its general purpose and rename
the helpers to increment and decrement the counter as well, to match
the new member name.

The next patch in the series will move the helpers into a more suitable
file (from free-space-cache.c to block-group.c).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:29 +02:00
Filipe Manana
2473d24f2b btrfs: fix a race between scrub and block group removal/allocation
When scrub is verifying the extents of a block group for a device, it is
possible that the corresponding block group gets removed and its logical
address and device extents get used for a new block group allocation.
When this happens scrub incorrectly reports that errors were detected
and, if the the new block group has a different profile then the old one,
deleted block group, we can crash due to a null pointer dereference.
Possibly other unexpected and weird consequences can happen as well.

Consider the following sequence of actions that leads to the null pointer
dereference crash when scrub is running in parallel with balance:

1) Balance sets block group X to read-only mode and starts relocating it.
   Block group X is a metadata block group, has a raid1 profile (two
   device extents, each one in a different device) and a logical address
   of 19424870400;

2) Scrub is running and finds device extent E, which belongs to block
   group X. It enters scrub_stripe() to find all extents allocated to
   block group X, the search is done using the extent tree;

3) Balance finishes relocating block group X and removes block group X;

4) Balance starts relocating another block group and when trying to
   commit the current transaction as part of the preparation step
   (prepare_to_relocate()), it blocks because scrub is running;

5) The scrub task finds the metadata extent at the logical address
   19425001472 and marks the pages of the extent to be read by a bio
   (struct scrub_bio). The extent item's flags, which have the bit
   BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
   scrub_page). It is these flags in the scrub pages that tells the
   bio's end io function (scrub_bio_end_io_worker) which type of extent
   it is dealing with. At this point we end up with 4 pages in a bio
   which is ready for submission (the metadata extent has a size of
   16Kb, so that gives 4 pages on x86);

6) At the next iteration of scrub_stripe(), scrub checks that there is a
   pause request from the relocation task trying to commit a transaction,
   therefore it submits the pending bio and pauses, waiting for the
   transaction commit to complete before resuming;

7) The relocation task commits the transaction. The device extent E, that
   was used by our block group X, is now available for allocation, since
   the commit root for the device tree was swapped by the transaction
   commit;

8) Another task doing a direct IO write allocates a new data block group Y
   which ends using device extent E. This new block group Y also ends up
   getting the same logical address that block group X had: 19424870400.
   This happens because block group X was the block group with the highest
   logical address and, when allocating Y, find_next_chunk() returns the
   end offset of the current last block group to be used as the logical
   address for the new block group, which is

        18351128576 + 1073741824 = 19424870400

   So our new block group Y has the same logical address and device extent
   that block group X had. However Y is a data block group, while X was
   a metadata one, and Y has a raid0 profile, while X had a raid1 profile;

9) After allocating block group Y, the direct IO submits a bio to write
   to device extent E;

10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
    extent E, which now correspond to the data written by the task that
    did a direct IO write. Then at the end io function associated with
    the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
    which calls scrub_checksum(). This later function checks the flags
    of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
    is set in the flags, so it assumes it has a metadata extent and
    then calls scrub_checksum_tree_block(). That functions returns an
    error, since interpreting data as a metadata extent causes the
    checksum verification to fail.

    So this makes scrub_checksum() call scrub_handle_errored_block(),
    which determines 'failed_mirror_index' to be 1, since the device
    extent E was allocated as the second mirror of block group X.

    It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
    'sblocks_for_recheck', and all the memory is initialized to zeroes by
    kcalloc().

    After that it calls scrub_setup_recheck_block(), which is responsible
    for filling each of those structures. However, when that function
    calls btrfs_map_sblock() against the logical address of the metadata
    extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
    the current block group Y. However block group Y has a raid0 profile
    and not a raid1 profile like X had, so the following call returns 1:

       scrub_nr_raid_mirrors(bbio)

    And as a result scrub_setup_recheck_block() only initializes the
    first (index 0) scrub_block structure in 'sblocks_for_recheck'.

    Then scrub_recheck_block() is called by scrub_handle_errored_block()
    with the second (index 1) scrub_block structure as the argument,
    because 'failed_mirror_index' was previously set to 1.
    This scrub_block was not initialized by scrub_setup_recheck_block(),
    so it has zero pages, its 'page_count' member is 0 and its 'pagev'
    page array has all members pointing to NULL.

    Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
    we have a NULL pointer dereference when accessing the flags of the first
    page, as pavev[0] is NULL:

    static void scrub_recheck_block_checksum(struct scrub_block *sblock)
    {
        (...)
        if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
            scrub_checksum_data(sblock);
        (...)
    }

    Producing a stack trace like the following:

    [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
    [542998.010238] #PF: supervisor read access in kernel mode
    [542998.010878] #PF: error_code(0x0000) - not-present page
    [542998.011516] PGD 0 P4D 0
    [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G    B   W         5.6.0-rc7-btrfs-next-58 #1
    [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
    [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
    [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
    [542998.018474] Code: 4c 89 e6 ...
    [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
    [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
    [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
    [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
    [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
    [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
    [542998.027237] FS:  0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
    [542998.028327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
    [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [542998.032380] Call Trace:
    [542998.032752]  scrub_recheck_block+0x162/0x400 [btrfs]
    [542998.033500]  ? __alloc_pages_nodemask+0x31e/0x460
    [542998.034228]  scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
    [542998.035170]  scrub_bio_end_io_worker+0x100/0x520 [btrfs]
    [542998.035991]  btrfs_work_helper+0xaa/0x720 [btrfs]
    [542998.036735]  process_one_work+0x26d/0x6a0
    [542998.037275]  worker_thread+0x4f/0x3e0
    [542998.037740]  ? process_one_work+0x6a0/0x6a0
    [542998.038378]  kthread+0x103/0x140
    [542998.038789]  ? kthread_create_worker_on_cpu+0x70/0x70
    [542998.039419]  ret_from_fork+0x3a/0x50
    [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
    [542998.047288] CR2: 0000000000000028
    [542998.047724] ---[ end trace bde186e176c7f96a ]---

This issue has been around for a long time, possibly since scrub exists.
The last time I ran into it was over 2 years ago. After recently fixing
fstests to pass the "--full-balance" command line option to btrfs-progs
when doing balance, several tests started to more heavily exercise balance
with fsstress, scrub and other operations in parallel, and therefore
started to hit this issue again (with btrfs/061 for example).

Fix this by having scrub increment the 'trimming' counter of the block
group, which pins the block group in such a way that it guarantees neither
its logical address nor device extents can be reused by future block group
allocations until we decrement the 'trimming' counter. Also make sure that
on each iteration of scrub_stripe() we stop scrubbing the block group if
it was removed already.

A later patch in the series will rename the block group's 'trimming'
counter and its helpers to a more generic name, since now it is not used
exclusively for pinning while trimming anymore.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:29 +02:00
David Sterba
31344b2fce btrfs: remove more obsolete v0 extent ref declarations
The extent references v0 have been superseded long time go, there are
some unused declarations of access helpers. We can safely remove them
now. The struct btrfs_extent_ref_v0 is not used anywhere, but struct
btrfs_extent_item_v0 is still part of a backward compatibility check in
relocation.c and thus not removed.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:29 +02:00
YueHaibing
943aeb0dae btrfs: remove unused function btrfs_dev_extent_chunk_tree_uuid
There's no callers in-tree anymore since
commit d24ee97b96 ("btrfs: use new helpers to set uuids in eb")

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:29 +02:00
Qu Wenruo
cbab8ade58 btrfs: qgroup: mark qgroup inconsistent if we're inherting snapshot to a new qgroup
[BUG]
For the following operation, qgroup is guaranteed to be screwed up due
to snapshot adding to a new qgroup:

  # mkfs.btrfs -f $dev
  # mount $dev $mnt
  # btrfs qgroup en $mnt
  # btrfs subv create $mnt/src
  # xfs_io -f -c "pwrite 0 1m" $mnt/src/file
  # sync
  # btrfs qgroup create 1/0 $mnt/src
  # btrfs subv snapshot -i 1/0 $mnt/src $mnt/snapshot
  # btrfs qgroup show -prce $mnt/src
  qgroupid         rfer         excl     max_rfer     max_excl parent  child
  --------         ----         ----     --------     -------- ------  -----
  0/5          16.00KiB     16.00KiB         none         none ---     ---
  0/257         1.02MiB     16.00KiB         none         none ---     ---
  0/258         1.02MiB     16.00KiB         none         none 1/0     ---
  1/0             0.00B        0.00B         none         none ---     0/258
	        ^^^^^^^^^^^^^^^^^^^^

[CAUSE]
The problem is in btrfs_qgroup_inherit(), we don't have good enough
check to determine if the new relation would break the existing
accounting.

Unlike btrfs_add_qgroup_relation(), which has proper check to determine
if we can do quick update without a rescan, in btrfs_qgroup_inherit() we
can even assign a snapshot to multiple qgroups.

[FIX]
Fix it by manually marking qgroup inconsistent for snapshot inheritance.

For subvolume creation, since all its extents are exclusively owned, we
don't need to rescan.

In theory, we should call relation check like quick_update_accounting()
when doing qgroup inheritance and inform user about qgroup accounting
inconsistency.

But we don't have good mechanism to relay that back to the user in the
snapshot creation context, thus we can only silently mark the qgroup
inconsistent.

Anyway, user shouldn't use qgroup inheritance during snapshot creation,
and should add qgroup relationship after snapshot creation by 'btrfs
qgroup assign', which has a much better UI to inform user about qgroup
inconsistent and kick in rescan automatically.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:29 +02:00
Robbie Ko
a619b3c7ab btrfs: speedup dead root detection during orphan cleanup
When mounting, we handle deleted subvolume and orphan items.  First,
find add orphan roots, then add them to fs_root radix tree.  Second, in
tree-root, process each orphan item, skip if it is dead root.

The original algorithm is based on the list of dead_roots, one by one to
visit and check whether the objectid is consistent, the time complexity
is O (n ^ 2).  When processing 50000 deleted subvols, it takes about
120s.

Because btrfs_find_orphan_roots has already ran before us, and added
deleted subvol to fs_roots radix tree.

The fs root will only be removed from the fs_roots radix tree after the
cleaner process is started, and the cleaner will only start execution
after the mount is complete.

btrfs_orphan_cleanup can be called during the whole filesystem mount
lifetime, but only "tree root" will be used in this section of code, and
only mount time will be brought into tree root.

So we can quickly check whether the orphan item is dead root through the
fs_roots radix tree.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:29 +02:00
YueHaibing
eec5b6e097 btrfs: remove unused function heads_to_leaves
There's no callers in-tree anymore since commit 64403612b7 ("btrfs:
rework btrfs_check_space_for_delayed_refs")

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:28 +02:00
David Sterba
fb8521caa8 btrfs: add more codes to decoder table
I've grepped logs for 'errno=.*unknown' and found -95, -117 and -122,
now added to the table. The wording is adjusted so it makes sense in
context of filesystem.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:28 +02:00
David Sterba
d54f814434 btrfs: sort error decoder entries
Add the raw errnos and sort them accordingly.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:28 +02:00
Anand Jain
7f551d9690 btrfs: free alien device after device add
When an old device has new fsid through 'btrfs device add -f <dev>' our
fs_devices list has an alien device in one of the fs_devices lists.

By having an alien device in fs_devices, we have two issues so far

1. missing device does not not show as missing in the userland

2. degraded mount will fail

Both issues are caused by the fact that there's an alien device in the
fs_devices list. (Alien means that it does not belong to the filesystem,
identified by fsid, or does not contain btrfs filesystem at all, eg. due
to overwrite).

A device can be scanned/added through the control device ioctls
SCAN_DEV, DEVICES_READY or by ADD_DEV.

And device coming through the control device is checked against the all
other devices in the lists, but this was not the case for ADD_DEV.

This patch fixes both issues above by removing the alien device.

CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:28 +02:00
Anand Jain
998a067196 btrfs: include non-missing as a qualifier for the latest_bdev
btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to
the bdev with greatest device::generation number.  For a typical-missing
device the generation number is zero so fs_devices::latest_bdev will
never point to it.

But if the missing device is due to alienation [1], then
device::generation is not zero and if it is greater or equal to the rest
of device  generations in the list, then fs_devices::latest_bdev ends up
pointing to the missing device and reports the error like [2].

[1] We maintain devices of a fsid (as in fs_device::fsid) in the
fs_devices::devices list, a device is considered as an alien device
if its fsid does not match with the fs_device::fsid

Consider a working filesystem with raid1:

  $ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb
  $ mount /dev/sda /mnt-raid1
  $ umount /mnt-raid1

While mnt-raid1 was unmounted the user force-adds one of its devices to
another btrfs filesystem:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt-single
  $ btrfs dev add -f /dev/sda /mnt-single

Now the original mnt-raid1 fails to mount in degraded mode, because
fs_devices::latest_bdev is pointing to the alien device.

  $ mount -o degraded /dev/sdb /mnt-raid1

[2]
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

  kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing
  kernel: BTRFS error (device sdb): failed to read devices
  kernel: BTRFS error (device sdb): open_ctree failed

Fix the root cause by checking if the device is not missing before it
can be considered for the fs_devices::latest_bdev.

CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:28 +02:00
Eric Biggers
fd08001f17 btrfs: use crypto_shash_digest() instead of open coding
Use crypto_shash_digest() instead of crypto_shash_init() +
crypto_shash_update() + crypto_shash_final().  This is more efficient.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:27 +02:00
Anand Jain
1ed802c972 btrfs: drop useless goto in open_fs_devices
There is no need of goto out in open_fs_devices() as there is nothing
special done there.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:27 +02:00
Filipe Manana
0bc2d3c08e btrfs: remove useless check for copy_items() return value
At btrfs_log_prealloc_extents() we are checking if copy_items() returns a
value greater than 0. That used to happen in the past to signal the caller
that the path given to it was released and reused for other searches, but
as of commit 0e56315ca1 ("Btrfs: fix missing hole after hole punching
and fsync when using NO_HOLES"), the copy_items() function does not have
that behaviour anymore and always returns 0 or a negative value. So just
remove that check at btrfs_log_prealloc_extents(), which the previously
mentioned commit forgot to remove.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:27 +02:00
Omar Sandoval
77d5d68931 btrfs: unify buffered and direct I/O read repair
Currently, direct I/O has its own versions of bio_readpage_error() and
btrfs_check_repairable() (dio_read_error() and
btrfs_check_dio_repairable(), respectively). The main difference is that
the direct I/O version doesn't do read validation. The rework of direct
I/O repair makes it possible to do validation, so we can get rid of
btrfs_check_dio_repairable() and combine bio_readpage_error() and
dio_read_error() into a new helper, btrfs_submit_read_repair().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:27 +02:00
Omar Sandoval
5c047a699a btrfs: get rid of endio_repair_workers
This was originally added in commit 8b110e393c ("Btrfs: implement
repair function when direct read fails") to avoid a deadlock. In that
commit, the direct I/O read endio executes on the endio_workers
workqueue, submits a repair bio, and waits for it to complete. The
repair bio endio must execute on a different workqueue, otherwise it
could block on the endio_workers workqueue becoming available, which
won't happen because the original endio is blocked on the repair bio.

As of the previous commit, the original endio doesn't wait for the
repair bio, so this separate workqueue is unnecessary.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:27 +02:00
Omar Sandoval
fd9d6670ed btrfs: simplify direct I/O read repair
Direct I/O read repair was originally implemented in commit 8b110e393c
("Btrfs: implement repair function when direct read fails"). This
implementation is unnecessarily complicated. There is major code
duplication between __btrfs_subio_endio_read() (checks checksums and
handles I/O errors for files with checksums),
__btrfs_correct_data_nocsum() (handles I/O errors for files without
checksums), btrfs_retry_endio() (checks checksums and handles I/O errors
for retries of files with checksums), and btrfs_retry_endio_nocsum()
(handles I/O errors for retries of files without checksum). If it sounds
like these should be one function, that's because they should.
Additionally, these functions are very hard to follow due to their
excessive use of goto.

This commit replaces the original implementation. After the previous
commit getting rid of orig_bio, we can reuse the same endio callback for
repair I/O and the original I/O, we just need to track the file offset
and original iterator in the repair bio. We can also unify the handling
of files with and without checksums and simplify the control flow. We
also no longer have to wait for each repair I/O to complete one by one.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:26 +02:00
Omar Sandoval
769b4f2497 btrfs: get rid of one layer of bios in direct I/O
In the worst case, there are _4_ layers of bios in the Btrfs direct I/O
path:

1. The bio created by the generic direct I/O code (dio_bio).
2. A clone of dio_bio we create in btrfs_submit_direct() to represent
   the entire direct I/O range (orig_bio).
3. A partial clone of orig_bio limited to the size of a RAID stripe that
   we create in btrfs_submit_direct_hook().
4. Clones of each of those split bios for each RAID stripe that we
   create in btrfs_map_bio().

As of the previous commit, the second layer (orig_bio) is no longer
needed for anything: we can split dio_bio instead, and complete dio_bio
directly when all of the cloned bios complete. This lets us clean up a
bunch of cruft, including dip->subio_endio and dip->errors (we can use
dio_bio->bi_status instead). It also enables the next big cleanup of
direct I/O read repair.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:26 +02:00
Omar Sandoval
85879573fc btrfs: put direct I/O checksums in btrfs_dio_private instead of bio
The next commit will get rid of btrfs_dio_private->orig_bio. The only
thing we really need it for is containing all of the checksums, but we
can easily put the checksum array in btrfs_dio_private and have the
submitted bios reference the array. We can also look the checksums up
while we're setting up instead of the current awkward logic that looks
them up for orig_bio when the first split bio is submitted.

(Interestingly, btrfs_dio_private did contain the
checksums before commit 23ea8e5a07 ("Btrfs: load checksum data once
when submitting a direct read io"), but it didn't look them up up
front.)

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:26 +02:00
Omar Sandoval
e3b318d14d btrfs: convert btrfs_dio_private->pending_bios to refcount_t
This is really a reference count now, so convert it to refcount_t and
rename it to refs.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:26 +02:00
Omar Sandoval
2390a6daf9 btrfs: remove unused btrfs_dio_private::private
We haven't used this since commit 9be3395bcd ("Btrfs: use a btrfs
bioset instead of abusing bio internals").

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:26 +02:00
Omar Sandoval
ce06d3ec2b btrfs: make btrfs_check_repairable() static
Since its introduction in commit 2fe6303e7c ("Btrfs: split
bio_readpage_error into several functions"), btrfs_check_repairable()
has only been used from extent_io.c where it is defined.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:25 +02:00
Omar Sandoval
47df7765a8 btrfs: rename __readpage_endio_check to check_data_csum
__readpage_endio_check() is also used from the direct I/O read code, so
give it a more descriptive name.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:25 +02:00
Omar Sandoval
fb30f4707d btrfs: clarify btrfs_lookup_bio_sums documentation
Fix a couple of issues in the btrfs_lookup_bio_sums documentation:

* The bio doesn't need to be a btrfs_io_bio if dst was provided. Move
  the declaration in the code to make that clear, too.
* dst must be large enough to hold nblocks * csum_size, not just
  csum_size.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:25 +02:00
Omar Sandoval
f337bd7478 btrfs: don't do repair validation for checksum errors
The purpose of the validation step is to distinguish between good and
bad sectors in a failed multi-sector read. If a multi-sector read
succeeded but some of those sectors had checksum errors, we don't need
to validate anything; we know the sectors with bad checksums need to be
repaired.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:25 +02:00
Omar Sandoval
c7333972b9 btrfs: look at full bi_io_vec for repair decision
Read repair does two things: it finds a good copy of data to return to
the reader, and it corrects the bad copy on disk. If a read of multiple
sectors has an I/O error, repair does an extra "validation" step that
issues a separate read for each sector. This allows us to find the exact
failing sectors and only rewrite those.

This heuristic is implemented in
bio_readpage_error()/btrfs_check_repairable() as:

	failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
	if (failed_bio_pages > 1)
		do validation

However, at this point, bi_iter may have already been advanced. This
means that we'll skip the validation step and rewrite the entire failed
read.

Fix it by getting the actual size from the biovec (which we can do
because this is only called for non-cloned bios, although that will
change in a later commit).

Fixes: 8a2ee44a37 ("btrfs: look at bi_size for repair decisions")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:25 +02:00
Omar Sandoval
c36cac28cb btrfs: fix double __endio_write_update_ordered in direct I/O
In btrfs_submit_direct(), if we fail to allocate the btrfs_dio_private,
we complete the ordered extent range. However, we don't mark that the
range doesn't need to be cleaned up from btrfs_direct_IO() until later.
Therefore, if we fail to allocate the btrfs_dio_private, we complete the
ordered extent range twice. We could fix this by updating
unsubmitted_oe_range earlier, but it's cleaner to reorganize the code so
that creating the btrfs_dio_private and submitting the bios are
separate, and once the btrfs_dio_private is created, cleanup always
happens through the btrfs_dio_private.

The logic around unsubmitted_oe_range_end and unsubmitted_oe_range_start
is really subtle. We have the following:

  1. btrfs_direct_IO sets those two to the same value.

  2. When we call __blockdev_direct_IO unless
     btrfs_get_blocks_direct->btrfs_get_blocks_direct_write is called to
     modify unsubmitted_oe_range_start so that start < end. Cleanup
     won't happen.

  3. We come into btrfs_submit_direct - if it dip allocation fails we'd
     return with oe_range_end now modified so cleanup will happen.

  4. If we manage to allocate the dip we reset the unsubmitted range
     members to be equal so that cleanup happens from
     btrfs_endio_direct_write.

This 4-step logic is not really obvious, especially given it's scattered
across 3 functions.

Fixes: f28a492878 ("Btrfs: fix leaking of ordered extents after direct IO write error")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
[ add range start/end logic explanation from Nikolay ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:25 +02:00
Omar Sandoval
6d3113a193 btrfs: fix error handling when submitting direct I/O bio
In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID
stripe or chunk, we submit orig_bio without cloning it. In this case, we
don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we
decrement pending_bios to -1, and we never complete orig_bio. Fix it by
initializing pending_bios to 1 instead of incrementing later.

Fixing this exposes another bug: we put orig_bio prematurely and then
put it again from end_io. Fix it by not putting orig_bio.

After this change, pending_bios is really more of a reference count, but
I'll leave that cleanup separate to keep the fix small.

Fixes: e65e153554 ("btrfs: fix panic caused by direct IO")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:24 +02:00
Filipe Manana
534cf531cc btrfs: simplify error handling of clean_pinned_extents()
At clean_pinned_extents(), whether we end up returning success or failure,
we pretty much have to do the same things:

1) unlock unused_bg_unpin_mutex
2) decrement reference count on the previous transaction

We also call btrfs_dec_block_group_ro() in case of failure, but that is
better done in its caller, btrfs_delete_unused_bgs(), since its the
caller that calls inc_block_group_ro(), so it should be responsible for
the decrement operation, as it is in case any of the other functions it
calls fail.

So move the call to btrfs_dec_block_group_ro() from clean_pinned_extents()
into  btrfs_delete_unused_bgs() and unify the error and success return
paths for clean_pinned_extents(), reducing duplicated code and making it
simpler.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:24 +02:00
Qu Wenruo
e3b8336117 btrfs: remove the redundant parameter level in btrfs_bin_search()
All callers pass the eb::level so we can get read it directly inside the
btrfs_bin_search and key_search.

This is inspired by the work of Marek in U-boot.

CC: Marek Behun <marek.behun@nic.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:24 +02:00
Nikolay Borisov
b335eab890 btrfs: make btrfs_read_disk_super return struct btrfs_disk_super
Instead of returning both the page and the super block structure, make
btrfs_read_disk_super just return a pointer to struct btrfs_disk_super.
As a result the function signature is simplified. Also,
read_cache_page_gfp can never return NULL so check its return value only
for IS_ERR.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:24 +02:00
Nikolay Borisov
a7571232b2 btrfs: use list_for_each_entry_safe in free_reloc_roots
The function always works on a local copy of the reloc root list, which
cannot be modified outside of it so using list_for_each_entry is fine.
Additionally the macro handles empty lists so drop list_empty checks of
callers. No semantic changes.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:23 +02:00
David Sterba
7c09c03091 btrfs: don't force read-only after error in drop snapshot
Deleting a subvolume on a full filesystem leads to ENOSPC followed by a
forced read-only. This is not a transaction abort and the filesystem is
otherwise ok, so the error should be just propagated to the callers.

This is caused by unnecessary call to btrfs_handle_fs_error for all
errors, except EAGAIN. This does not make sense as the standard
transaction abort mechanism is in btrfs_drop_snapshot so all relevant
failures are handled.

Originally in commit cb1b69f450 ("Btrfs: forced readonly when
btrfs_drop_snapshot() fails") there was no return value at all, so the
btrfs_std_error made some sense but once the error handling and
propagation has been implemented we don't need it anymore.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:23 +02:00
Filipe Manana
2d9faa5a8a btrfs: remove pointless assertion on reclaim_size counter
The reclaim_size counter of a space_info object is unsigned. So its value
can never be negative, it's pointless to have an assertion that checks
its value is >= 0, therefore remove it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:23 +02:00
Zheng Wei
72f4f078de btrfs: tree-checker: remove duplicate definition of 'inode_item_err'
Remove the duplicate definition of 'inode_item_err' in the file
tree-checker.c that got there by accident in c23c77b097 ("btrfs:
tree-checker: Refactor inode key check into seperate function").

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Zheng Wei <wei.zheng@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:23 +02:00
Josef Bacik
9c343784c4 btrfs: force chunk allocation if our global rsv is larger than metadata
Nikolay noticed a bunch of test failures with my global rsv steal
patches.  At first he thought they were introduced by them, but they've
been failing for a while with 64k nodes.

The problem is with 64k nodes we have a global reserve that calculates
out to 13MiB on a freshly made file system, which only has 8MiB of
metadata space.  Because of changes I previously made we no longer
account for the global reserve in the overcommit logic, which means we
correctly allow overcommit to happen even though we are already
overcommitted.

However in some corner cases, for example btrfs/170, we will allocate
the entire file system up with data chunks before we have enough space
pressure to allocate a metadata chunk.  Then once the fs is full we
ENOSPC out because we cannot overcommit and the global reserve is taking
up all of the available space.

The most ideal way to deal with this is to change our space reservation
stuff to take into account the height of the tree's that we're
modifying, so that our global reserve calculation does not end up so
obscenely large.

However that is a huge undertaking.  Instead fix this by forcing a chunk
allocation if the global reserve is larger than the total metadata
space.  This gives us essentially the same behavior that happened
before, we get a chunk allocated and these tests can pass.

This is meant to be a stop-gap measure until we can tackle the "tree
height only" project.

Fixes: 0096420adb ("btrfs: do not account global reserve in can_overcommit")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:23 +02:00
Josef Bacik
42a72cb753 btrfs: run btrfs_try_granting_tickets if a priority ticket fails
With normal tickets we could have a large reservation at the front of
the list that is unable to be satisfied, but a smaller ticket later on
that can be satisfied.  The way we handle this is to run
btrfs_try_granting_tickets() in maybe_fail_all_tickets().

However no such protection exists for priority tickets.  Fix this by
handling it in handle_reserve_ticket().  If we've returned after
attempting to flush space in a priority related way, we'll still be on
the priority list and need to be removed.

We rely on the flushing to free up space and wake the ticket, but if
there is not enough space to reclaim _but_ there's enough space in the
space_info to handle subsequent reservations then we would have gotten
an ENOSPC erroneously.

Address this by catching where we are still on the list, meaning we were
a priority ticket, and removing ourselves and then running
btrfs_try_granting_tickets().  This will handle this particular corner
case.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:23 +02:00
Josef Bacik
666daa9f97 btrfs: only check priority tickets for priority flushing
In debugging a generic/320 failure on ppc64, Nikolay noticed that
sometimes we'd ENOSPC out with plenty of space to reclaim if we had
committed the transaction.  He further discovered that this was because
there was a priority ticket that was small enough to fit in the free
space currently in the space_info.

Consider the following scenario.  There is no more space to reclaim in
the fs without committing the transaction.  Assume there's 1MiB of space
free in the space info, but there are pending normal tickets with 2MiB
reservations.

Now a priority ticket comes in with a .5MiB reservation.  Because we
have normal tickets pending we add ourselves to the priority list,
despite the fact that we could satisfy this reservation.

The flushing machinery now gets to the point where it wants to commit
the transaction, but because there's a .5MiB ticket on the priority list
and we have 1MiB of free space we assume the ticket will be granted
soon, so we bail without committing the transaction.

Meanwhile the priority flushing does not commit the transaction, and
eventually fails with an ENOSPC.  Then all other tickets are failed with
ENOSPC because we were never able to actually commit the transaction.

The fix for this is we should have simply granted the priority flusher
his reservation, because there was space to make the reservation.
Priority flushers by definition take priority, so they are allowed to
make their reservations before any previous normal tickets.  By not
adding this priority ticket to the list the normal flushing mechanisms
will then commit the transaction and everything will continue normally.

We still need to serialize ourselves with other priority tickets, so if
there are any tickets on the priority list then we need to add ourselves
to that list in order to maintain the serialization between priority
tickets.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:22 +02:00
Josef Bacik
bb4f58a747 btrfs: account for trans_block_rsv in may_commit_transaction
On ppc64le with 64k page size (respectively 64k block size) generic/320
was failing and debug output showed we were getting a premature ENOSPC
with a bunch of space in btrfs_fs_info::trans_block_rsv.

This meant there were still open transaction handles holding space, yet
the flusher didn't commit the transaction because it deemed the freed
space won't be enough to satisfy the current reserve ticket. Fix this
by accounting for space in trans_block_rsv when deciding whether the
current transaction should be committed or not.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:22 +02:00
Josef Bacik
e6549c2aab btrfs: allow to use up to 90% of the global block rsv for unlink
We previously had a limit of stealing 50% of the global reserve for
unlink.  This was from a time when the global reserve was used for the
delayed refs as well.  However now those reservations are kept separate,
so the global reserve can be depleted much more to allow us to make
progress for space restoring operations like unlink.  Change the minimum
amount of space required to be left in the global reserve to 10%.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:22 +02:00
Josef Bacik
7f9fe61440 btrfs: improve global reserve stealing logic
For unlink transactions and block group removal
btrfs_start_transaction_fallback_global_rsv will first try to start an
ordinary transaction and if it fails it will fall back to reserving the
required amount by stealing from the global reserve. This is problematic
because of all the same reasons we had with previous iterations of the
ENOSPC handling, thundering herd.  We get a bunch of failures all at
once, everybody tries to allocate from the global reserve, some win and
some lose, we get an ENSOPC.

Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
used to mark unlink reservation. To fix this we need to integrate this
logic into the normal ENOSPC infrastructure.  We still go through all of
the normal flushing work, and at the moment we begin to fail all the
tickets we try to satisfy any tickets that are allowed to steal by
stealing from the global reserve.  If this works we start the flushing
system over again just like we would with a normal ticket satisfaction.
This serializes our global reserve stealing, so we don't have the
thundering herd problem.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:22 +02:00
Qu Wenruo
876de781b0 btrfs: backref: distinguish reloc and non-reloc use of indirect resolution
For relocation tree detection, relocation backref cache uses
btrfs_should_ignore_reloc_root() which uses relocation-specific checks
like checking the DEAD_RELOC_ROOT bit.

However for general purpose backref cache, we can rely on that check, as
it's possible that relocation is also running.

For generic purposed backref cache, we detect reloc root by
SHARED_BLOCK_REF item.  Only reloc root node has its parent bytenr
pointing back to itself.

And in that case, backref cache will mark the reloc root node useless,
dropping any child orphan nodes.

So only call btrfs_should_ignore_reloc_root() if the backref cache is
for relocation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:22 +02:00
Qu Wenruo
1b23ea180b btrfs: reloc: move error handling of build_backref_tree() to backref.c
The error cleanup will be extracted as a new function,
btrfs_backref_error_cleanup(), and moved to backref.c and exported for
later usage.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:21 +02:00
Qu Wenruo
fc997ed05a btrfs: backref: rename and move finish_upper_links()
This the the 2nd major part of generic backref cache. Move it to
backref.c so we can reuse it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:21 +02:00
Qu Wenruo
1b60d2ec98 btrfs: backref: rename and move handle_one_tree_block()
This function is the major part of backref cache build process, move it
to backref.c so we can reuse it later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:21 +02:00
Qu Wenruo
d36e7f0e8f btrfs: reloc: open code read_fs_root() for handle_indirect_tree_backref()
The backref code is going to be moved to backref.c, and read_fs_root()
is just a simple wrapper, open-code it to prepare to the incoming code
move.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:21 +02:00
Qu Wenruo
55465730bc btrfs: backref: rename and move should_ignore_root()
This function is mostly single purpose to relocation backref cache, but
since we're moving the main part of backref cache to backref.c, we need
to export such function.

And to avoid confusion, rename the function to
btrfs_should_ignore_reloc_root() make the name a little more clear.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:21 +02:00
Qu Wenruo
982c92cbd5 btrfs: backref: rename and move backref_tree_panic()
Also change the parameter, since all callers can easily grab an fs_info,
there is no need for all the pointer chasing.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:21 +02:00
Qu Wenruo
13fe1bdb22 btrfs: backref: rename and move backref_cache_cleanup()
Since we're releasing all existing nodes/edges, other than cleanup the
mess after error, "release" is a more proper naming here.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:20 +02:00
Qu Wenruo
023acb07bc btrfs: backref: rename and move remove_backref_node()
Also add comment explaining the cleanup progress, to differ it from
btrfs_backref_drop_node().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:20 +02:00
Qu Wenruo
b0fe7078d6 btrfs: backref: rename and move drop_backref_node()
With extra comment for drop_backref_node() as it has some similarity
with remove_backref_node(), thus we need extra comment explaining the
difference.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:20 +02:00
Qu Wenruo
741188d3a5 btrfs: backref: rename and move free_backref_(node|edge)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:20 +02:00
Qu Wenruo
f39911e552 btrfs: backref: rename and move link_backref_edge()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:20 +02:00
Qu Wenruo
47254d07f3 btrfs: backref: rename and move alloc_backref_edge()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:19 +02:00
Qu Wenruo
b1818dab9b btrfs: backref: rename and move alloc_backref_node()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:19 +02:00
Qu Wenruo
584fb12187 btrfs: backref: rename and move backref_cache_init()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:19 +02:00
Qu Wenruo
e9a28dc52a btrfs: rename tree_entry to rb_simple_node and export it
Structure tree_entry provides a very simple rb_tree which only uses
bytenr as search index.

That tree_entry is used in 3 structures: backref_node, mapping_node and
tree_block.

Since we're going to make backref_node independnt from relocation, it's
a good time to extract the tree_entry into rb_simple_node, and export it
into misc.h.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:19 +02:00
Qu Wenruo
7053544146 btrfs: backref: move btrfs_backref_(node|edge|cache) structures to backref.h
These 3 structures are the main part of btrfs backref cache, move them
to backref.h to build the basis for later reuse.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:19 +02:00
Qu Wenruo
a26195a523 btrfs: reloc: add btrfs_ prefix for backref_node/edge/cache
Those three structures are the main elements of backref cache. Add the
"btrfs_" prefix for later export.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:18 +02:00
Qu Wenruo
29db137b6b btrfs: reloc: refactor useless nodes handling into its own function
This patch will also add some comment for the cleanup.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:18 +02:00
Qu Wenruo
1f87292466 btrfs: reloc: refactor finishing part of upper linkage into finish_upper_links()
After handle_one_tree_backref(), all newly added (not cached) edges and
nodes have the following features:

- Only backref_edge::list[LOWER] is linked.
  This means, we can only iterate from botton to top, not the other
  direction.

- Newly added nodes are not added to cache rb_tree yet

So to finish the backref cache, we still need to finish the links and
add all nodes into backref cache rb_tree.

This patch will refactor the existing code into finish_upper_links(),
add more comments of each branch, and why we need to do all the work.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:18 +02:00
Qu Wenruo
e7d571c7b0 btrfs: reloc: remove the open-coded goto loop for breadth-first search
build_backref_tree() uses "goto again;" to implement a breadth-first
search to build backref cache.

This patch will extract most of its work into a wrapper,
handle_one_tree_block(), and use a do {} while() loop to implement the
same thing.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:18 +02:00
Qu Wenruo
0304f2d8cc btrfs: reloc: pass essential members for alloc_backref_node()
Bytenr and level are essential parameters for backref_node, thus it
makes sense to initialize them at allocation time.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:18 +02:00
Qu Wenruo
2a979612d5 btrfs: reloc: use wrapper to replace open-coded edge linking
Since backref_edge is used to connect upper and lower backref nodes, and
needs to access both nodes, some code can look pretty nasty:

		list_add_tail(&edge->list[LOWER], &cur->upper);

The above code will link @cur to the LOWER side of the edge, while both
"LOWER" and "upper" words show up.  This can sometimes be very confusing
for reader to grasp.

This patch introduces a new wrapper, link_backref_edge(), to handle the
linking behavior.  Which also has extra ASSERT() to ensure caller won't
pass wrong nodes.

Also, this updates the comment of related lists of backref_node and
backref_edge, to make it more clear that each list points to what.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:18 +02:00
Qu Wenruo
4d81ea8bb4 btrfs: reloc: refactor indirect tree backref processing into its own function
The processing of indirect tree backref (TREE_BLOCK_REF) is the most
complex work.

We need to grab the fs root, do a tree search to locate all its parent
nodes, link all needed edges, and put all uncached edges to pending edge
list.

This is definitely worth a helper function.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:17 +02:00
Qu Wenruo
4007ea87d9 btrfs: reloc: refactor direct tree backref processing into its own function
For BTRFS_SHARED_BLOCK_REF_KEY, its processing is straightforward, as we
now the parent node bytenr directly.

If the parent is already cached, or a root, call it a day.
If the parent is not cached, add it pending list.

This patch will just refactor this part into its own function,
handle_direct_tree_backref() and add some comment explaining the
@ref_key parameter.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:17 +02:00
Qu Wenruo
2433bea592 btrfs: reloc: make reloc root search-specific for relocation backref cache
find_reloc_root() searches reloc_control::reloc_root_tree to find the
reloc root.  This behavior is only useful for relocation backref cache.

For the incoming more generic purpose backref cache, we don't care
about who owns the reloc root, but only care if it's a reloc root.

So this patch makes the following modifications to make the reloc root
search more specific to relocation backref:

- Add backref_node::is_reloc_root
  This will be an extra indicator for generic purposed backref cache.
  User doesn't need to read root key from backref_node::root to
  determine if it's a reloc root.
  Also for reloc tree root, it's useless and will be queued to useless
  list.

- Add backref_cache::is_reloc
  This will allow backref cache code to do different behavior for
  generic purpose backref cache and relocation backref cache.

- Pass fs_info to find_reloc_root()

- Export find_reloc_root()
  So backref.c can utilize this function.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:17 +02:00
Qu Wenruo
33a0f1f716 btrfs: reloc: add backref_cache::fs_info member
Add this member so that we can grab fs_info without the help from
reloc_control.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:17 +02:00
Qu Wenruo
8478028933 btrfs: reloc: add backref_cache::pending_edge and backref_cache::useless_node
These two new members will act the same as the existing local lists,
@useless and @list in build_backref_tree().

Currently build_backref_tree() is only executed serially, thus moving
such local list into backref_cache is still safe.

Also since we're here, use list_first_entry() to replace a lot of
list_entry() calls after !list_empty().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:17 +02:00
Qu Wenruo
9569cc203d btrfs: reloc: rename mark_block_processed and __mark_block_processed
These two functions are weirdly named, mark_block_processed() in fact
just marks a range dirty unconditionally, while __mark_block_processed()
does extra check before doing the marking.

This patch will open code old mark_block_processed, and rename
__mark_block_processed() to remove the "__" prefix.

Since we're here, also kill the forward declaration, which could also
kill in_block_group() with in_range() macro.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:16 +02:00
Qu Wenruo
71f572a9e8 btrfs: reloc: use btrfs_backref_iter infrastructure
In the core function of relocation, build_backref_tree, it needs to
iterate all backref items of one tree block.

Use btrfs_backref_iter infrastructure to do the loop and make the code
more readable.

The backref items look would be much more easier to read:

	ret = btrfs_backref_iter_start(iter, cur->bytenr);
	for (; ret == 0; ret = btrfs_backref_iter_next(iter)) {
		/* The really important work */
	}

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:16 +02:00
Qu Wenruo
c39c2ddc67 btrfs: backref: implement btrfs_backref_iter_next()
This function will go to the next inline/keyed backref for
btrfs_backref_iter infrastructure.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:16 +02:00
Qu Wenruo
a37f232b7b btrfs: backref: introduce the skeleton of btrfs_backref_iter
Due to the complex nature of btrfs extent tree, when we want to iterate
all backrefs of one extent, this involves quite a lot of work, like
searching the EXTENT_ITEM/METADATA_ITEM, iteration through inline and keyed
backrefs.

Normally this would result in a complex code, something like:

  btrfs_search_slot()
  /* Ensure we are at EXTENT_ITEM/METADATA_ITEM */
  while (1) {	/* Loop for extent tree items */
	while (ptr < end) { /* Loop for inlined items */
		/* Real work here */
	}
  next:
  	ret = btrfs_next_item()
	/* Ensure we're still at keyed item for specified bytenr */
  }

The idea of btrfs_backref_iter is to avoid such complex and hard to
read code structure, but something like the following:

  iter = btrfs_backref_iter_alloc();
  ret = btrfs_backref_iter_start(iter, bytenr);
  if (ret < 0)
	goto out;
  for (; ; ret = btrfs_backref_iter_next(iter)) {
	/* Real work here */
  }
  out:
  btrfs_backref_iter_free(iter);

This patch is just the skeleton + btrfs_backref_iter_start() code.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:16 +02:00
Jules Irenge
78d933c79c btrfs: add missing annotation for btrfs_tree_lock()
Sparse reports a warning at btrfs_tree_lock()

warning: context imbalance in btrfs_tree_lock() - wrong count at exit

The root cause is the missing annotation at btrfs_tree_lock()
Add the missing __acquires(&eb->lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:16 +02:00
Jules Irenge
c142c6a449 btrfs: add missing annotation for btrfs_lock_cluster()
Sparse reports a warning at btrfs_lock_cluster()

warning: context imbalance in btrfs_lock_cluster()
	- wrong count

The root cause is the missing annotation at btrfs_lock_cluster()
Add the missing __acquires(&cluster->refill_lock) annotation.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:16 +02:00
David S. Miller
13209a8f73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-24 13:47:27 -07:00
Linus Torvalds
caffb99b69 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix RCU warnings in ipv6 multicast router code, from Madhuparna
    Bhowmik.

 2) Nexthop attributes aren't being checked properly because of
    mis-initialized iterator, from David Ahern.

 3) Revert iop_idents_reserve() change as it caused performance
    regressions and was just working around what is really a UBSAN bug
    in the compiler. From Yuqi Jin.

 4) Read MAC address properly from ROM in bmac driver (double iteration
    proceeds past end of address array), from Jeremy Kerr.

 5) Add Microsoft Surface device IDs to r8152, from Marc Payne.

 6) Prevent reference to freed SKB in __netif_receive_skb_core(), from
    Boris Sukholitko.

 7) Fix ACK discard behavior in rxrpc, from David Howells.

 8) Preserve flow hash across packet scrubbing in wireguard, from Jason
    A. Donenfeld.

 9) Cap option length properly for SO_BINDTODEVICE in AX25, from Eric
    Dumazet.

10) Fix encryption error checking in kTLS code, from Vadim Fedorenko.

11) Missing BPF prog ref release in flow dissector, from Jakub Sitnicki.

12) dst_cache must be used with BH disabled in tipc, from Eric Dumazet.

13) Fix use after free in mlxsw driver, from Jiri Pirko.

14) Order kTLS key destruction properly in mlx5 driver, from Tariq
    Toukan.

15) Check devm_platform_ioremap_resource() return value properly in
    several drivers, from Tiezhu Yang.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (71 commits)
  net: smsc911x: Fix runtime PM imbalance on error
  net/mlx4_core: fix a memory leak bug.
  net: ethernet: ti: cpsw: fix ASSERT_RTNL() warning during suspend
  net: phy: mscc: fix initialization of the MACsec protocol mode
  net: stmmac: don't attach interface until resume finishes
  net: Fix return value about devm_platform_ioremap_resource()
  net/mlx5: Fix error flow in case of function_setup failure
  net/mlx5e: CT: Correctly get flow rule
  net/mlx5e: Update netdev txq on completions during closure
  net/mlx5: Annotate mutex destroy for root ns
  net/mlx5: Don't maintain a case of del_sw_func being null
  net/mlx5: Fix cleaning unmanaged flow tables
  net/mlx5: Fix memory leak in mlx5_events_init
  net/mlx5e: Fix inner tirs handling
  net/mlx5e: kTLS, Destroy key object after destroying the TIS
  net/mlx5e: Fix allowed tc redirect merged eswitch offload cases
  net/mlx5: Avoid processing commands before cmdif is ready
  net/mlx5: Fix a race when moving command interface to events mode
  net/mlx5: Add command entry handling completion
  rxrpc: Fix a memory leak in rxkad_verify_response()
  ...
2020-05-23 17:16:18 -07:00
David Howells
8a1d24e1cc rxrpc: Fix a warning
Fix a warning due to an uninitialised variable.

le included from ../fs/afs/fs_probe.c:11:
../fs/afs/fs_probe.c: In function 'afs_fileserver_probe_result':
../fs/afs/internal.h:1453:2: warning: 'rtt_us' may be used uninitialized in this function [-Wmaybe-uninitialized]
 1453 |  printk("[%-6.6s] "FMT"\n", current->comm ,##__VA_ARGS__)
      |  ^~~~~~
../fs/afs/fs_probe.c:35:15: note: 'rtt_us' was declared here

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-23 00:31:39 +01:00
David S. Miller
4629ed2e48 RxRPC fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7FQH4ACgkQ+7dXa6fL
 C2u05BAAlilXCYGZU23/yQmMxxkmcG+jW+oDV9ySNDl6iJT+OtKxHEGReDD4D2f0
 rhaBivgGcOnZy89AGjNrSVROGlwSBXl7ArJcjsfsx8AuNzxHUHQKWlW/k8n87qEt
 NCTze7f65IT6NowgYAFgJn5kIpY/9iKuNiCf6NGL3Z35wqxPvwNs6AQSGM495uvB
 el/ddkr8QzzjI9Ejsgzj94x4DAOjk4T4WzfWMAgyr1OEqz6vKNKkCwSKPySOsQAK
 72JRaGhWA9rfAOkA7nAZpnjHdfFYnkFBOVQzmswOJYRYe3D/QY5D9PUlGIQ5OSjL
 yV5YOi/+AUrSif79NfEYXga0r/NFJMFqBg2zo/eiSrhfZZFZMDcagnGhzpGjbYF1
 IaeIu4q/MQOQybi8m1GJhvFfPOhdKRn731jlsUvEoxK0TonSu/u64eus+qelQxOd
 uiIcu/kLxfPZSznUd8cXZ+Pffce0uBIRWq0nRQZ703TyHY+/gYo7ZGHr/FZNKaK4
 lRNP4Nu3goLQCI40R7y7USnpX+kWfd4mYC9zl+VBSXG1JymYbOezXYrNATBCqpo6
 9VoYtqDdo8ESksFUBqM7fGRDZ20nah6KdRGmnrPU+rpODHZEZmNN7D/rayU31wua
 VIbVw1WluvSVnQ8+b1BwJwvhJQ4CazyGcnbxDqx1zd07EjUiiJI=
 =jJvy
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-fixes-20200520' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fix retransmission timeout and ACK discard

Here are a couple of fixes and an extra tracepoint for AF_RXRPC:

 (1) Calculate the RTO pretty much as TCP does, rather than making
     something up, including an initial 4s timeout (which causes return
     probes from the fileserver to fail if a packet goes missing), and add
     backoff.

 (2) Fix the discarding of out-of-order received ACKs.  We mustn't let the
     hard-ACK point regress, nor do we want to do unnecessary
     retransmission because the soft-ACK list regresses.  This is not
     trivial, however, due to some loose wording in various old protocol
     specs, the ACK field that should be used for this sometimes has the
     wrong information in it.

 (3) Add a tracepoint to log a discarded ACK.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 15:53:30 -07:00
Linus Torvalds
444565650a io_uring-5.7-2020-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7IByoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpigmEACWJoK7zk5OK2RhavpzOsb2SDu8nz/YAbUe
 +R6tXAjwe4Z7lVVa+FW/fmN9/mQcjyYRIbbG564IFs5fe6hPUoOjUzHqvGTOFLHd
 Fw8mjVKgWjWAE5GdoX6ATauLhVwwjnImej1PNfO/J5y29o0SQksP8MbM0eGuuNx1
 piqxBj0/3h3YyPn1GeJmqxwwcsFhzHDqk7fbkfbQokZk+7SPiKpqWgJBa7AKSlNC
 N0WTluT4UOummQZw1RFynPfA4cCuX6XHVgWAa9h7vrJHXigvuMWqLaHG+MBFqeKu
 xD6PPnaCnMwcLRe4T2sJvtjxmNSdyr15Q2kGkIi/RhohSIn4u/y8jEA6wTprCP48
 rDi30dn1o2LwUj2S1NO3YCOV8jIKWUguztEvKiAXmjf4KDZIDd4/OwrFsJdb4vg9
 EuK86SEwXbvFHf9nu1M7pHlGThKfQi0CiK6C6M7Qb/kOthio72wwZ46gGkwLDk5z
 DZWHymHBhQw/z1c20loX7pBvFIzLzbuYUThf23UegPzXVqqQfBkqs4BGFcOGuqy6
 yfEYF/MAX/O/TQgm2dDQHrhl05AevLu/UQXMXZ8Ha6OrmlC4C2qu3Te/iZO8FUew
 YIx5H5XmBh93McjpmJ8VCn7CjE+y/ufNTMdvm8WzCyAIfH40gfcyLangpre26QoJ
 CCAARffXrQ==
 =ZYUy
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.7-2020-05-22' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "A small collection of small fixes that should go into this release:

   - Two fixes for async request preparation (Pavel)

   - Busy clear fix for SQPOLL (Xiaoguang)

   - Don't use kiocb->private for O_DIRECT buf index, some file systems
     use it (Bijan)

   - Kill dead check in io_splice()

   - Ensure sqo_wait is initialized early

   - Cancel task_work if we fail adding to original process

   - Only add (IO)pollable requests to iopoll list, fixing a regression
     in this merge window"

* tag 'io_uring-5.7-2020-05-22' of git://git.kernel.dk/linux-block:
  io_uring: reset -EBUSY error when io sq thread is waken up
  io_uring: don't add non-IO requests to iopoll pending list
  io_uring: don't use kiocb.private to store buf_index
  io_uring: cancel work if task_work_add() fails
  io_uring: remove dead check in io_splice()
  io_uring: fix FORCE_ASYNC req preparation
  io_uring: don't prepare DRAIN reqs twice
  io_uring: initialize ctx->sqo_wait earlier
2020-05-22 11:12:30 -07:00
Christoph Hellwig
9398554fb3 block: remove the error_sector argument to blkdev_issue_flush
The argument isn't used by any caller, and drivers don't fill out
bi_sector for flush requests either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-22 08:45:46 -06:00
Namjae Jeon
907fa89325 exfat: add the dummy mount options to be backward compatible with staging/exfat
As Ubuntu and Fedora release new version used kernel version equal to or
higher than v5.4, They started to support kernel exfat filesystem.

Linus reported a mount error with new version of exfat on Fedora:

        exfat: Unknown parameter 'namecase'

This is because there is a difference in mount option between old
staging/exfat and new exfat.  And utf8, debug, and codepage options as
well as namecase have been removed from new exfat.

This patch add the dummy mount options as deprecated option to be
backward compatible with old one.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-21 16:40:11 -07:00
Linus Torvalds
57f1b0cf2d Fix regression in ext4's FIEMAP handling introduced in v5.7-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7Gh44ACgkQ8vlZVpUN
 gaM9TAgAkthbnWUb3uT7/Nx9PHtT5X5wZthMRCGpa0wlSvy51gwhi/8kVxw214Pn
 Z0Rlcopbx6gmWplbvVUCiHCgR/QMASaL3mQwmLTjTs1+fweNedrgPwTg6u7ZNaJe
 pXgUMdr/FSnAQdnQElAll7GdfN9+FpPzmsaXzu9uQUYtaPKDx4dv0GKzLgyxRRJn
 2OL4uUFPk0Q+hw8zGnloav6+rx9uw/Sees8tAUZgj5E2AjnqvKUrxB+UN481vk5T
 TUyhCK9S8SX+eWoL53dqL8QoTa9v5ovyrK/UNbLX8M8UPa5O8mIVNqES11htKzLu
 h9EhtiJCaAqEH5K/BgCh+qMgABLF6g==
 =hK/Y
 -----END PGP SIGNATURE-----

Merge tag 'fiemap-regression-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix regression in ext4's FIEMAP handling introduced in v5.7-rc1"

* tag 'fiemap-regression-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix fiemap size checks for bitmap files
  ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
2020-05-21 11:37:20 -07:00
Christoph Hellwig
3783daeb1d block: remove ioctl_by_bdev
No callers left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21 08:22:20 -06:00
Linus Torvalds
fea371e259 This pull request contains the following bug fixes for UBI and UBIFS:
- Correctly set next cursor for detailed_erase_block_info debugfs file
 - Don't use crypto_shash_descsize() for digest size in UBIFS
 - Remove broken lazytime support from UBIFS
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl7Fh08WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wW2WD/428LjXh+24Y3rekfnCRXG5w+es
 yITAfhOmNuzn2vjS1UvCD0HsoBaS/LYbjuaceoyfXF9BG5mcrRTjFH7dVEEWFGDZ
 YeRvBFkyt4xBEJtrY/6MW35KPRtnCp4Jau9HR9M5RCcQ5xzOeGtw0r/JMdZe56Av
 zc2mLnZag1x5NyS4TvS30nCgj5pxVbO2bdAkyULJwBfPYs0C3TKeIul/4vjRi+57
 PjyIUSR7CxpsOJde0tMjDvf23ewn1IUEW+YnewP1qk36ijRw1M6C90ERr4CU9BM5
 YTEfjsxAheCItSf8r+BC70gaPBQPADtvHzPFqs9yNMSsLHYdOkkvqT8Bpwisj76d
 1zL45DjZZ8UxC3HfSMFPl/dYDWvfddpffNwrimeltoAzzejI/Wk8AX0VqH1IQ3Z1
 zDbz0ixP21ADATvrHUxr7UsoeEU9havGV+2sg+4wSU1aLtKIZUTjceizjkTN+9oB
 ntHLuv6cS2iop22iSbJGClOv2TjpBlGQNwMDQ7TdD1a0QqxTSPRiguMmf/mDpQa/
 MgQGAO6xS5NKRNiEbifniiCugLqpUQBHBPyn+q+4unmfK5sPzzLdpb3vpc0XNmbm
 WgwfuMZdfmK0jO27P1/MRG6LUGxXKh5arsi6JrUJVIsdxzV3bdc2xBjkUFOOS/tH
 W7fn4QS+WmbPVm09Jg==
 =eCh7
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

Pull UBI and UBIFS fixes from Richard Weinberger:

 - Correctly set next cursor for detailed_erase_block_info debugfs file

 - Don't use crypto_shash_descsize() for digest size in UBIFS

 - Remove broken lazytime support from UBIFS

* tag 'for-linus-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
  ubi: Fix seq_file usage in detailed_erase_block_info debugfs file
  ubifs: fix wrong use of crypto_shash_descsize()
  ubifs: remove broken lazytime support
2020-05-20 13:07:01 -07:00
Linus Torvalds
8e2b7f634a overlayfs fixes for 5.7-rc7
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXsU6SgAKCRDh3BK/laaZ
 PABRAP9MCZz/CLH2sEqHqH9KQHScNc4uf4bReiCU1hrLs7PbYwD/Y+vbRMffki7I
 B/gt0Dg4kGxG5CV+ckeZK0+p2NWUUgQ=
 =PPLW
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs fixes from Miklos Szeredi:
 "Fix two bugs introduced in this cycle and one introduced in v5.5"

* tag 'ovl-fixes-5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: potential crash in ovl_fid_to_fh()
  ovl: clear ATTR_OPEN from attr->ia_valid
  ovl: clear ATTR_FILE from attr->ia_valid
2020-05-20 11:28:35 -07:00
Tetsuo Handa
566d136289 pipe: Fix pipe_full() test in opipe_prep().
syzbot is reporting that splice()ing from non-empty read side to
already-full write side causes unkillable task, for opipe_prep() is by
error not inverting pipe_full() test.

  CPU: 0 PID: 9460 Comm: syz-executor.5 Not tainted 5.6.0-rc3-next-20200228-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:rol32 include/linux/bitops.h:105 [inline]
  RIP: 0010:iterate_chain_key kernel/locking/lockdep.c:369 [inline]
  RIP: 0010:__lock_acquire+0x6a3/0x5270 kernel/locking/lockdep.c:4178
  Call Trace:
     lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4720
     __mutex_lock_common kernel/locking/mutex.c:956 [inline]
     __mutex_lock+0x156/0x13c0 kernel/locking/mutex.c:1103
     pipe_lock_nested fs/pipe.c:66 [inline]
     pipe_double_lock+0x1a0/0x1e0 fs/pipe.c:104
     splice_pipe_to_pipe fs/splice.c:1562 [inline]
     do_splice+0x35f/0x1520 fs/splice.c:1141
     __do_sys_splice fs/splice.c:1447 [inline]
     __se_sys_splice fs/splice.c:1427 [inline]
     __x64_sys_splice+0x2b5/0x320 fs/splice.c:1427
     do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:295
     entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: syzbot+b48daca8639150bc5e73@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=9386d051e11e09973d5a4cf79af5e8cedf79386d
Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Cc: stable@vger.kernel.org # 5.5+
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-20 10:54:29 -07:00
Christoph Hellwig
c928f642c2 fs: rename pipe_buf ->steal to ->try_steal
And replace the arcane return value convention with a simple bool
where true means success and false means failure.

[AV: braino fix folded in]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:14:10 -04:00
Christoph Hellwig
b8d9e7f241 fs: make the pipe_buf_operations ->confirm operation optional
Just return 0 for success if it is not present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:26 -04:00
Christoph Hellwig
76887c2567 fs: make the pipe_buf_operations ->steal operation optional
Just return 1 for failure if it is not present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:26 -04:00
Christoph Hellwig
f6dd975583 pipe: merge anon_pipe_buf*_ops
All the op vectors are exactly the same, they are just used to encode
packet or nomerge behavior.  There already is a flag for the packet
behavior, so just add a new one to allow for merging.  Inverting it vs
the previous nomerge special casing actually allows for much nicer code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:26 -04:00
Christoph Hellwig
00c285d0d0 fs: simplify do_splice_from
No need for a local function pointer when we can trivial branch on the
->splice_write presence.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:26 -04:00
Christoph Hellwig
2bc010600d fs: simplify do_splice_to
No need for a local function pointer when we can trivial branch on the
->splice_read presence.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:25 -04:00
Xiaoguang Wang
6b668c9b7f io_uring: don't submit sqes when ctx->refs is dying
When IORING_SETUP_SQPOLL is enabled, io_ring_ctx_wait_and_kill() will wait
for sq thread to idle by busy loop:

    while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait))
        cond_resched();

Above loop isn't very CPU friendly, it may introduce a short cpu burst on
the current cpu.

If ctx->refs is dying, we forbid sq_thread from submitting any further
SQEs. Instead they just get discarded when we exit.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-20 08:41:26 -06:00
Xiaoguang Wang
d4ae271dfa io_uring: reset -EBUSY error when io sq thread is waken up
In io_sq_thread(), currently if we get an -EBUSY error and go to sleep,
we will won't clear it again, which will result in io_sq_thread() will
never have a chance to submit sqes again. Below test program test.c
can reveal this bug:

int main(int argc, char *argv[])
{
        struct io_uring ring;
        int i, fd, ret;
        struct io_uring_sqe *sqe;
        struct io_uring_cqe *cqe;
        struct iovec *iovecs;
        void *buf;
        struct io_uring_params p;

        if (argc < 2) {
                printf("%s: file\n", argv[0]);
                return 1;
        }

        memset(&p, 0, sizeof(p));
        p.flags = IORING_SETUP_SQPOLL;
        ret = io_uring_queue_init_params(4, &ring, &p);
        if (ret < 0) {
                fprintf(stderr, "queue_init: %s\n", strerror(-ret));
                return 1;
        }

        fd = open(argv[1], O_RDONLY | O_DIRECT);
        if (fd < 0) {
                perror("open");
                return 1;
        }

        iovecs = calloc(10, sizeof(struct iovec));
        for (i = 0; i < 10; i++) {
                if (posix_memalign(&buf, 4096, 4096))
                        return 1;
                iovecs[i].iov_base = buf;
                iovecs[i].iov_len = 4096;
        }

        ret = io_uring_register_files(&ring, &fd, 1);
        if (ret < 0) {
                fprintf(stderr, "%s: register %d\n", __FUNCTION__, ret);
                return ret;
        }

        for (i = 0; i < 10; i++) {
                sqe = io_uring_get_sqe(&ring);
                if (!sqe)
                        break;

                io_uring_prep_readv(sqe, 0, &iovecs[i], 1, 0);
                sqe->flags |= IOSQE_FIXED_FILE;

                ret = io_uring_submit(&ring);
                sleep(1);
                printf("submit %d\n", i);
        }

        for (i = 0; i < 10; i++) {
                io_uring_wait_cqe(&ring, &cqe);
                printf("receive: %d\n", i);
                if (cqe->res != 4096) {
                        fprintf(stderr, "ret=%d, wanted 4096\n", cqe->res);
                        ret = 1;
                }
                io_uring_cqe_seen(&ring, cqe);
        }

        close(fd);
        io_uring_queue_exit(&ring);
        return 0;
}
sudo ./test testfile
above command will hang on the tenth request, to fix this bug, when io
sq_thread is waken up, we reset the variable 'ret' to be zero.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-20 07:26:47 -06:00
Jens Axboe
b532576ed3 io_uring: don't add non-IO requests to iopoll pending list
We normally disable any commands that aren't specifically poll commands
for a ring that is setup for polling, but we do allow buffer provide and
remove commands to support buffer selection for polled IO. Once a
request is issued, we add it to the poll list to poll for completion. But
we should not do that for non-IO commands, as those request complete
inline immediately and aren't pollable. If we do, we can leave requests
on the iopoll list after they are freed.

Fixes: ddf0322db7 ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19 21:20:27 -06:00
Linus Torvalds
115a54162a Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fix from Al Viro:
 "Stable fodder fix: copy_fdtable() would get screwed on 64bit boxen
  with sysctl_nr_open raised to 512M or higher, which became possible
  since 2.6.25.

  Nobody sane would set the things up that way, but..."

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix multiplication overflow in copy_fdtable()
2020-05-19 16:33:26 -07:00
Al Viro
4e89b72104 fix multiplication overflow in copy_fdtable()
cpy and set really should be size_t; we won't get an overflow on that,
since sysctl_nr_open can't be set above ~(size_t)0 / sizeof(void *),
so nr that would've managed to overflow size_t on that multiplication
won't get anywhere near copy_fdtable() - we'll fail with EMFILE
before that.

Cc: stable@kernel.org # v2.6.25+
Fixes: 9cfe015aa4 (get rid of NR_OPEN and introduce a sysctl_nr_open)
Reported-by: Thiago Macieira <thiago.macieira@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-19 18:29:36 -04:00
Bijan Mottahedeh
4f4eeba87c io_uring: don't use kiocb.private to store buf_index
kiocb.private is used in iomap_dio_rw() so store buf_index separately.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>

Move 'buf_index' to a hole in io_kiocb.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19 16:19:49 -06:00
Christoph Hellwig
959f758451 ext4: fix fiemap size checks for bitmap files
Add an extra validation of the len parameter, as for ext4 some files
might have smaller file size limits than others.  This also means the
redundant size check in ext4_ioctl_get_es_cache can go away, as all
size checking is done in the shared fiemap handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200505154324.3226743-3-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-19 15:03:37 -04:00
Ritesh Harjani
9f44eda195 ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
ext4 supports max number of logical blocks in a file to be 0xffffffff.
(This is since ext4_extent's ee_block is __le32).
This means that EXT4_MAX_LOGICAL_BLOCK should be 0xfffffffe (starting
from 0 logical offset). This patch fixes this.

The issue was seen when ext4 moved to iomap_fiemap API and when
overlayfs was mounted on top of ext4. Since overlayfs was missing
filemap_check_ranges(), so it could pass a arbitrary huge length which
lead to overflow of map.m_len logic.

This patch fixes that.

Fixes: d3b6f23f71 ("ext4: move ext4_fiemap to use iomap framework")
Reported-by: syzbot+77fa5bdb65cc39711820@syzkaller.appspotmail.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20200505154324.3226743-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-19 15:03:37 -04:00
Christoph Hellwig
ef8385128d xfs: cleanup xfs_idestroy_fork
Move freeing the dynamically allocated attr and COW fork, as well
as zeroing the pointers where actually needed into the callers, and
just pass the xfs_ifork structure to xfs_idestroy_fork.  Also simplify
the kmem_free calls by not checking for NULL first.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:59 -07:00
Christoph Hellwig
f7e67b20ec xfs: move the fork format fields into struct xfs_ifork
Both the data and attr fork have a format that is stored in the legacy
idinode.  Move it into the xfs_ifork structure instead, where it uses
up padding.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:58 -07:00
Christoph Hellwig
daf83964a3 xfs: move the per-fork nextents fields into struct xfs_ifork
There are there are three extents counters per inode, one for each of
the forks.  Two are in the legacy icdinode and one is directly in
struct xfs_inode.  Switch to a single counter in the xfs_ifork structure
where it uses up padding at the end of the structure.  This simplifies
various bits of code that just wants the number of extents counter and
can now directly dereference it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:58 -07:00
Christoph Hellwig
b2c20045b6 xfs: remove xfs_ifree_local_data
xfs_ifree only need to free inline data in the data fork, as we've
already taken care of the attr fork before (and in fact freed the
fork structure).  Just open code the freeing of the inline data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:58 -07:00
Christoph Hellwig
09c38edd54 xfs: remove the XFS_DFORK_Q macro
Just checking di_forkoff directly is a little easier to follow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:58 -07:00
Darrick J. Wong
5fd68bdb5a xfs: clean up xchk_bmap_check_rmaps usage of XFS_IFORK_Q
XFS_IFORK_Q is supposed to be a predicate, not a function returning a
value.  Its usage is in xchk_bmap_check_rmaps is incorrect, but that
function only cares about whether or not the "size" of the data is zero
or not.  Convert that logic to use a proper boolean, and teach the
caller to skip the call entirely if the end result would be that we'd do
nothing anyway.  This avoids a crash later in this series.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[hch: generalized the NULL ifor check]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-05-19 09:40:58 -07:00
Christoph Hellwig
4b516ff4e7 xfs: remove the NULL fork handling in xfs_bmapi_read
Now that we fully verify the inode forks before they are added to the
inode cache, the crash reported in

  https://bugzilla.kernel.org/show_bug.cgi?id=204031

can't happen anymore, as we'll never let an inode that has inconsistent
nextents counts vs the presence of an in-core attr fork leak into the
inactivate code path.  So remove the work around to try to handle the
case, and just return an error and warn if the fork is not present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:58 -07:00
Christoph Hellwig
1a1c57b282 xfs: remove the special COW fork handling in xfs_bmapi_read
We don't call xfs_bmapi_read for the COW fork anymore, so remove the
special casing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:58 -07:00
Christoph Hellwig
0f45a1b20c xfs: improve local fork verification
Call the data/attr local fork verifiers as soon as we are ready for them.
This keeps them close to the code setting up the forks, and avoids a
few branches later on.  Also open code xfs_inode_verify_forks in the
only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:58 -07:00
Christoph Hellwig
7c7ba21863 xfs: refactor xfs_inode_verify_forks
The split between xfs_inode_verify_forks and the two helpers
implementing the actual functionality is a little strange.  Reshuffle
it so that xfs_inode_verify_forks verifies if the data and attr forks
are actually in local format and only call the low-level helpers if
that is the case.  Handle the actual error reporting in the low-level
handlers to streamline the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:57 -07:00
Christoph Hellwig
1934c8bd81 xfs: remove xfs_ifork_ops
xfs_ifork_ops add up to two indirect calls per inode read and flush,
despite just having a single instance in the kernel.  In xfsprogs
phase6 in xfs_repair overrides the verify_dir method to deal with inodes
that do not have a valid parent, but that can be fixed pretty easily
by ensuring they always have a valid looking parent.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:57 -07:00
Christoph Hellwig
bb8a66af4f xfs: remove xfs_iread
There is not much point in the xfs_iread function, as it has a single
caller and not a whole lot of code.  Move it into the only caller,
and trim down the overdocumentation to just documenting the important
"why" instead of a lot of redundant "what".

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:57 -07:00
Christoph Hellwig
7f02901235 xfs: don't reset i_delayed_blks in xfs_iread
i_delayed_blks is set to 0 in xfs_inode_alloc and can't have anything
assigned to it until the inode is visible to the VFS.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:57 -07:00
Christoph Hellwig
2d6051d496 xfs: call xfs_dinode_verify from xfs_inode_from_disk
Keep the code dealing with the dinode together, and also ensure we verify
the dinode in the owner change log recovery case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:57 -07:00
Christoph Hellwig
0bce8173fd xfs: handle unallocated inodes in xfs_inode_from_disk
Handle inodes with a 0 di_mode in xfs_inode_from_disk, instead of partially
duplicating inode reading in xfs_iread.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:57 -07:00
Christoph Hellwig
9229d18e80 xfs: split xfs_iformat_fork
xfs_iformat_fork is a weird catchall.  Split it into one helper for
the data fork and one for the attr fork, and then call both helper
as well as the COW fork initialization from xfs_inode_from_disk.  Order
the COW fork initialization after the attr fork initialization given
that it can't fail to simplify the error handling.

Note that the newly split helpers are moved down the file in
xfs_inode_fork.c to avoid the need for forward declarations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:57 -07:00
Christoph Hellwig
cb7d585944 xfs: call xfs_iformat_fork from xfs_inode_from_disk
We always need to fill out the fork structures when reading the inode,
so call xfs_iformat_fork from the tail of xfs_inode_from_disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:57 -07:00
Christoph Hellwig
b90c2a9c8b xfs: xfs_bmapi_read doesn't take a fork id as the last argument
The last argument to xfs_bmapi_raad contains XFS_BMAPI_* flags, not the
fork.  Given that XFS_DATA_FORK evaluates to 0 no real harm is done,
but let's fix this anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:57 -07:00
Kaixu Xia
14506f7a91 xfs: fix the warning message in xfs_validate_sb_common()
Fix this error message to complain about project and group quota flag
bits instead of "PUOTA" and "QUOTA".

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19 09:40:56 -07:00
Darrick J. Wong
765d3c393c xfs: don't allow SWAPEXT if we'd screw up quota accounting
Since the old SWAPEXT ioctl doesn't know how to adjust quota ids,
bail out of the ids don't match and quotas are enabled.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-19 09:40:56 -07:00
Darrick J. Wong
78bba5c812 xfs: use ordered buffers to initialize dquot buffers during quotacheck
While QAing the new xfs_repair quotacheck code, I uncovered a quota
corruption bug resulting from a bad interaction between dquot buffer
initialization and quotacheck.  The bug can be reproduced with the
following sequence:

# mkfs.xfs -f /dev/sdf
# mount /dev/sdf /opt -o usrquota
# su nobody -s /bin/bash -c 'touch /opt/barf'
# sync
# xfs_quota -x -c 'report -ahi' /opt
User quota on /opt (/dev/sdf)
                        Inodes
User ID      Used   Soft   Hard Warn/Grace
---------- ---------------------------------
root            3      0      0  00 [------]
nobody          1      0      0  00 [------]

# xfs_io -x -c 'shutdown' /opt
# umount /opt
# mount /dev/sdf /opt -o usrquota
# touch /opt/man2
# xfs_quota -x -c 'report -ahi' /opt
User quota on /opt (/dev/sdf)
                        Inodes
User ID      Used   Soft   Hard Warn/Grace
---------- ---------------------------------
root            1      0      0  00 [------]
nobody          1      0      0  00 [------]

# umount /opt

Notice how the initial quotacheck set the root dquot icount to 3
(rootino, rbmino, rsumino), but after shutdown -> remount -> recovery,
xfs_quota reports that the root dquot has only 1 icount.  We haven't
deleted anything from the filesystem, which means that quota is now
under-counting.  This behavior is not limited to icount or the root
dquot, but this is the shortest reproducer.

I traced the cause of this discrepancy to the way that we handle ondisk
dquot updates during quotacheck vs. regular fs activity.  Normally, when
we allocate a disk block for a dquot, we log the buffer as a regular
(dquot) buffer.  Subsequent updates to the dquots backed by that block
are done via separate dquot log item updates, which means that they
depend on the logged buffer update being written to disk before the
dquot items.  Because individual dquots have their own LSN fields, that
initial dquot buffer must always be recovered.

However, the story changes for quotacheck, which can cause dquot block
allocations but persists the final dquot counter values via a delwri
list.  Because recovery doesn't gate dquot buffer replay on an LSN, this
means that the initial dquot buffer can be replayed over the (newer)
contents that were delwritten at the end of quotacheck.  In effect, this
re-initializes the dquot counters after they've been updated.  If the
log does not contain any other dquot items to recover, the obsolete
dquot contents will not be corrected by log recovery.

Because quotacheck uses a transaction to log the setting of the CHKD
flags in the superblock, we skip quotacheck during the second mount
call, which allows the incorrect icount to remain.

Fix this by changing the ondisk dquot initialization function to use
ordered buffers to write out fresh dquot blocks if it detects that we're
running quotacheck.  If the system goes down before quotacheck can
complete, the CHKD flags will not be set in the superblock and the next
mount will run quotacheck again, which can fix uninitialized dquot
buffers.  This requires amending the defer code to maintaine ordered
buffer state across defer rolls for the sake of the dquot allocation
code.

For regular operations we preserve the current behavior since the dquot
items require properly initialized ondisk dquot records.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-19 09:40:56 -07:00
Eric Biggers
e3b1078bed fscrypt: add support for IV_INO_LBLK_32 policies
The eMMC inline crypto standard will only specify 32 DUN bits (a.k.a. IV
bits), unlike UFS's 64.  IV_INO_LBLK_64 is therefore not applicable, but
an encryption format which uses one key per policy and permits the
moving of encrypted file contents (as f2fs's garbage collector requires)
is still desirable.

To support such hardware, add a new encryption format IV_INO_LBLK_32
that makes the best use of the 32 bits: the IV is set to
'SipHash-2-4(inode_number) + file_logical_block_number mod 2^32', where
the SipHash key is derived from the fscrypt master key.  We hash only
the inode number and not also the block number, because we need to
maintain contiguity of DUNs to merge bios.

Unlike with IV_INO_LBLK_64, with this format IV reuse is possible; this
is unavoidable given the size of the DUN.  This means this format should
only be used where the requirements of the first paragraph apply.
However, the hash spreads out the IVs in the whole usable range, and the
use of a keyed hash makes it difficult for an attacker to determine
which files use which IVs.

Besides the above differences, this flag works like IV_INO_LBLK_64 in
that on ext4 it is only allowed if the stable_inodes feature has been
enabled to prevent inode numbers and the filesystem UUID from changing.

Link: https://lore.kernel.org/r/20200515204141.251098-1-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Paul Crowley <paulcrowley@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-19 09:34:18 -07:00
Brian Foster
f28cef9e4d xfs: don't fail verifier on empty attr3 leaf block
The attr fork can transition from shortform to leaf format while
empty if the first xattr doesn't fit in shortform. While this empty
leaf block state is intended to be transient, it is technically not
due to the transactional implementation of the xattr set operation.

We historically have a couple of bandaids to work around this
problem. The first is to hold the buffer after the format conversion
to prevent premature writeback of the empty leaf buffer and the
second is to bypass the xattr count check in the verifier during
recovery. The latter assumes that the xattr set is also in the log
and will be recovered into the buffer soon after the empty leaf
buffer is reconstructed. This is not guaranteed, however.

If the filesystem crashes after the format conversion but before the
xattr set that induced it, only the format conversion may exist in
the log. When recovered, this creates a latent corrupted state on
the inode as any subsequent attempts to read the buffer fail due to
verifier failure. This includes further attempts to set xattrs on
the inode or attempts to destroy the attr fork, which prevents the
inode from ever being removed from the unlinked list.

To avoid this condition, accept that an empty attr leaf block is a
valid state and remove the count check from the verifier. This means
that on rare occasions an attr fork might exist in an unexpected
state, but is otherwise consistent and functional. Note that we
retain the logic to avoid racing with metadata writeback to reduce
the window where this can occur.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-19 09:05:24 -07:00
Eric Biggers
0ca2ddb0cd fscrypt: make test_dummy_encryption use v2 by default
Since v1 encryption policies are deprecated, make test_dummy_encryption
test v2 policies by default.

Note that this causes ext4/023 and ext4/028 to start failing due to
known bugs in those tests (see previous commit).

Link: https://lore.kernel.org/r/20200512233251.118314-5-ebiggers@kernel.org
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-18 20:21:48 -07:00
Eric Biggers
ed318a6cc0 fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.

Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.

To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2".  The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.

To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.

To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount.  (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)

Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'.  On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.

Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-18 20:21:48 -07:00
Linus Torvalds
45088963ca Description for this pull request:
- Fix potential memory leak in exfat_find.
 - Set exfat's splice_write to iter_file_splice_write to fix the splice
   failure on direct-opened file
 -----BEGIN PGP SIGNATURE-----
 
 iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAl7CCAkYHG5hbWphZS5q
 ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEIX3AQAM7cV9GZecl6YfQu5AIeFbHT
 uvSnvuW5O5JS9qdra4knSTthHYJ8eUucjcPlxUtHhs4oznm+erjZc9A0tRwDQyjy
 EjoZZGEBOphWFLCY28K9LdJZD89JhNh9v5XUD9dId3XFnznaRjvZRHlbCVzqAWG1
 DUcRedNEderpkg0FySEBIx6EHhKX6+YgkKOWlGG8r8bqdRrgZbjyAyduRdKlyX31
 7XIeS4qFMDWLrqcbJdmL9pljx4VH2MswNIXK6kA2pydMwItGhod2yRWzFMYPeTDm
 fTRDKzHvfA3J30h3wMI5FJu/ikfuVqsmp8i5rND7v/eRP13uuxZCSI2MfnUzHEj2
 ciWxGfr5kFGg/1eAjNtOy3AnS5wsaEQ0ixYFGgKb8ENvToyT4cHa+9X2y0PrVnRu
 bOyqJTBwlSisqp3DiK8aAhklHHbX1/CheGOLMj1B48H42eREUHFn/yPYroOb+Ot/
 CiRH4feACSCMRGn8HdlgnguOs4zwZIWtLQWpfqhu4CJSNFa3IW6PSl53U1vPzuXG
 v2Cdxn6D1gCqxsFbSmzmMJVkNfILrY7sLSU9lqrXWCQ4T6I8FpBxIvU8CCi1boQD
 7hpdXstL/0xhb/gTFQL2uJ2MasQdSzVQgl6dmGK5riJkqwgaWz4FDro+IF3JxdQT
 qtUZ5nd6e33pl6PwK3nt
 =JN5f
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat

Pull exfat fixes from Namjae Jeon:

 - Fix potential memory leak in exfat_find

 - Set exfat's splice_write to iter_file_splice_write to fix a splice
   failure on direct-opened files

* tag 'for-5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
  exfat: fix possible memory leak in exfat_find()
  exfat: use iter_file_splice_write
2020-05-18 10:33:13 -07:00
David Howells
9d1be4f4dc afs: Don't unlock fetched data pages until the op completes successfully
Don't call req->page_done() on each page as we finish filling it with
the data coming from the network.  Whilst this might speed up the
application a bit, it's a problem if there's a network failure and the
operation has to be reissued.

If this happens, an oops occurs because afs_readpages_page_done() clears
the pointer to each page it unlocks and when a retry happens, the
pointers to the pages it wants to fill are now NULL (and the pages have
been unlocked anyway).

Instead, wait till the operation completes successfully and only then
release all the pages after clearing any terminal gap (the server can
give us less data than we requested as we're allowed to ask for more
than is available).

KASAN produces a bug like the following, and even without KASAN, it can
oops and panic.

    BUG: KASAN: wild-memory-access in _copy_to_iter+0x323/0x5f4
    Write of size 1404 at addr 0005088000000000 by task md5sum/5235

    CPU: 0 PID: 5235 Comm: md5sum Not tainted 5.7.0-rc3-fscache+ #250
    Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
    Call Trace:
     memcpy+0x39/0x58
     _copy_to_iter+0x323/0x5f4
     __skb_datagram_iter+0x89/0x2a6
     skb_copy_datagram_iter+0x129/0x135
     rxrpc_recvmsg_data.isra.0+0x615/0xd42
     rxrpc_kernel_recv_data+0x1e9/0x3ae
     afs_extract_data+0x139/0x33a
     yfs_deliver_fs_fetch_data64+0x47a/0x91b
     afs_deliver_to_call+0x304/0x709
     afs_wait_for_call_to_complete+0x1cc/0x4ad
     yfs_fs_fetch_data+0x279/0x288
     afs_fetch_data+0x1e1/0x38d
     afs_readpages+0x593/0x72e
     read_pages+0xf5/0x21e
     __do_page_cache_readahead+0x128/0x23f
     ondemand_readahead+0x36e/0x37f
     generic_file_buffered_read+0x234/0x680
     new_sync_read+0x109/0x17e
     vfs_read+0xe6/0x138
     ksys_read+0xd8/0x14d
     do_syscall_64+0x6e/0x8a
     entry_SYSCALL_64_after_hwframe+0x49/0xb3

Fixes: 196ee9cd2d ("afs: Make afs_fs_fetch_data() take a list of pages")
Fixes: 30062bd13e ("afs: Implement YFS support in the fs client")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-18 10:29:17 -07:00
Jens Axboe
e3aabf9554 io_uring: cancel work if task_work_add() fails
We currently move it to the io_wqe_manager for execution, but we cannot
safely do so as we may lack some of the state to execute it out of
context. As we cancel work anyway when the ring/task exits, just mark
this request as canceled and io_async_task_func() will do the right
thing.

Fixes: aa96bf8a9e ("io_uring: use io-wq manager as backup task if task is exiting")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-18 11:14:22 -06:00
Wei Yongjun
94182167ec exfat: fix possible memory leak in exfat_find()
'es' is malloced from exfat_get_dentry_set() in exfat_find() and should
be freed before leaving from the error handling cases, otherwise it will
cause memory leak.

Fixes: 5f2aa07507 ("exfat: add inode operations")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-05-18 11:51:44 +09:00
Eric Sandeen
0357794830 exfat: use iter_file_splice_write
Doing copy_file_range() on exfat with a file opened for direct IO leads
to an -EFAULT:

# xfs_io -f -d -c "truncate 32768" \
       -c "copy_range -d 16384 -l 16384 -f 0" /mnt/test/junk
copy_range: Bad address

and the reason seems to be that we go through:

default_file_splice_write
 splice_from_pipe
  __splice_from_pipe
   write_pipe_buf
    __kernel_write
     new_sync_write
      generic_file_write_iter
       generic_file_direct_write
        exfat_direct_IO
         do_blockdev_direct_IO
          iov_iter_get_pages

and land in iterate_all_kinds(), which does "return -EFAULT" for our kvec
iter.

Setting exfat's splice_write to iter_file_splice_write fixes this and lets
fsx (which originally detected the problem) run to success from
the xfstests harness.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-05-18 11:51:40 +09:00
Jens Axboe
310672552f io_uring: async task poll trigger cleanup
If the request is still hashed in io_async_task_func(), then it cannot
have been canceled and it's pointless to check. So save that check.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 17:43:31 -06:00
Eric Biggers
3c3c32f85b ubifs: fix wrong use of crypto_shash_descsize()
crypto_shash_descsize() returns the size of the shash_desc context
needed to compute the hash, not the size of the hash itself.

crypto_shash_digestsize() would be correct, or alternatively using
c->hash_len and c->hmac_desc_len which already store the correct values.
But actually it's simpler to just use stack arrays, so do that instead.

Fixes: 49525e5eec ("ubifs: Add helper functions for authentication support")
Fixes: da8ef65f95 ("ubifs: Authenticate replayed journal")
Cc: <stable@vger.kernel.org> # v4.20+
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-05-17 23:38:21 +02:00
Jens Axboe
948a774945 io_uring: remove dead check in io_splice()
We checked for 'force_nonblock' higher up, so it's definitely false
at this point. Kill the check, it's a remnant of when we tried to do
inline splice without always punting to async context.

Fixes: 2fb3e82284 ("io_uring: punt splice async because of inode mutex")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:21:38 -06:00
Pavel Begunkov
f2a8d5c7a2 io_uring: add tee(2) support
Add IORING_OP_TEE implementing tee(2) support. Almost identical to
splice bits, but without offsets.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:10:07 -06:00
Pavel Begunkov
9dafdfc2f0 splice: export do_tee()
export do_tee() for use in io_uring

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:10:07 -06:00
Pavel Begunkov
c11368a57b io_uring: don't repeat valid flag list
req->flags stores all sqe->flags. After checking that sqe->flags are
valid set if IOSQE* flags, no need to double check it, just forward them
all.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:10:07 -06:00
Pavel Begunkov
9f13c35b33 io_uring: rename io_file_put()
io_file_put() deals with flushing state's file refs, adding "state" to
its name makes it a bit clearer. Also, avoid double check of
state->file in __io_file_get() in some cases.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:10:07 -06:00
Pavel Begunkov
0cdaf760f4 io_uring: remove req->needs_fixed_files
A submission is "async" IIF it's done by SQPOLL thread. Instead of
passing @async flag into io_submit_sqes(), deduce it from ctx->flags.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:10:07 -06:00
Jens Axboe
3bfa5bcb26 io_uring: cleanup io_poll_remove_one() logic
We only need apoll in the one section, do the juggling with the work
restoration there. This removes a special case further down as well.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:10:01 -06:00
Linus Torvalds
b48397cb75 Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve fix from Eric Biederman:
 "While working on my exec cleanups I found a bug in exec that I
  introduced by accident a couple of years ago. I apparently missed the
  fact that bprm->file can change.

  Now I have a very personal motive to clean up exec and make it more
  approachable.

  The change is just moving woud_dump to where it acts on the final
  bprm->file not the initial bprm->file. I have been careful and tested
  and verify this fix works"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  exec: Move would_dump into flush_old_exec
2020-05-17 12:23:37 -07:00
Eric W. Biederman
f87d1c9559 exec: Move would_dump into flush_old_exec
I goofed when I added mm->user_ns support to would_dump.  I missed the
fact that in the case of binfmt_loader, binfmt_em86, binfmt_misc, and
binfmt_script bprm->file is reassigned.  Which made the move of
would_dump from setup_new_exec to __do_execve_file before exec_binprm
incorrect as it can result in would_dump running on the script instead
of the interpreter of the script.

The net result is that the code stopped making unreadable interpreters
undumpable.  Which allows them to be ptraced and written to disk
without special permissions.  Oops.

The move was necessary because the call in set_new_exec was after
bprm->mm was no longer valid.

To correct this mistake move the misplaced would_dump from
__do_execve_file into flos_old_exec, before exec_mmap is called.

I tested and confirmed that without this fix I can attach with gdb to
a script with an unreadable interpreter, and with this fix I can not.

Cc: stable@vger.kernel.org
Fixes: f84df2a6f2 ("exec: Ensure mm->user_ns contains the execed files")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-17 10:48:24 -05:00
Pavel Begunkov
bd2ab18a1d io_uring: fix FORCE_ASYNC req preparation
As for other not inlined requests, alloc req->io for FORCE_ASYNC reqs,
so they can be prepared properly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 09:22:09 -06:00
Pavel Begunkov
650b548129 io_uring: don't prepare DRAIN reqs twice
If req->io is not NULL, it's already prepared. Don't do it again,
it's dangerous.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 09:22:09 -06:00
Jens Axboe
583863ed91 io_uring: initialize ctx->sqo_wait earlier
Ensure that ctx->sqo_wait is initialized as soon as the ctx is allocated,
instead of deferring it to the offload setup. This fixes a syzbot
reported lockdep complaint, which is really due to trying to wake_up
on an uninitialized wait queue:

RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441319
RDX: 0000000000000001 RSI: 0000000020000140 RDI: 000000000000047b
RBP: 0000000000010475 R08: 0000000000000001 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402260
R13: 00000000004022f0 R14: 0000000000000000 R15: 0000000000000000
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 7090 Comm: syz-executor222 Not tainted 5.7.0-rc1-next-20200415-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 assign_lock_key kernel/locking/lockdep.c:913 [inline]
 register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225
 __lock_acquire+0x104/0x4c50 kernel/locking/lockdep.c:4234
 lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4934
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159
 __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:122
 io_cqring_ev_posted+0xa5/0x1e0 fs/io_uring.c:1160
 io_poll_remove_all fs/io_uring.c:4357 [inline]
 io_ring_ctx_wait_and_kill+0x2bc/0x5a0 fs/io_uring.c:7305
 io_uring_create fs/io_uring.c:7843 [inline]
 io_uring_setup+0x115e/0x22b0 fs/io_uring.c:7870
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x441319
Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9

Reported-by: syzbot+8c91f5d054e998721c57@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 09:20:00 -06:00
Linus Torvalds
5a9ffb954a three small cifs/smb3 fixes, one for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl7Ao9QACgkQiiy9cAdy
 T1GjGgv+L2zqdaHOFaFWFsQejY5DjQ7U7EjwMvCLBoM1RgTIPosQCdwo8EqNkPm/
 fHtHVyG7I2vHjv9zmcxPPphasHOl/WwDZf8VP9u+cRCH+/2NRTZziCqW1kFpi4ET
 q88K5DWD6FMZVZZxP+mlJKLws3Za+I0wujx3VylbRfX20mniFLNFGQyNA3TCCw8k
 gEiv4TUE9dhzX+PULLIL3/63ZIYay3IfwN3GTuLIdOMlINGj1DxfrXX1VTVeiNKb
 uuii5Sb5XbHt0+ZylU787Sbvr7t61GZXDjBwKV12o/P2kcRc2BklekkDbs20FZJZ
 g5rkmcUYY0atDEak6MYr931QE8LgotsQL9aH9n5Mb7Hra1t9/lfjoDov+KoA83S8
 7dnzBaDzrlsbij15DADYrg0ygJe/zcUHq88ETc8UH4dQUrnhJQSn5Tf+xASP8H0N
 SJgpGxIQs30rMUKstFz1/xD1yJ30M55kxhU5Lchre5WuGorGR57c7Kv/j2O2leR8
 iiddqtPE
 =aPOp
 -----END PGP SIGNATURE-----

Merge tag '5.7-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three small cifs/smb3 fixes, one for stable"

* tag '5.7-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix leaked reference on requeued write
  cifs: Fix null pointer check in cifs_read
  CIFS: Spelling s/EACCESS/EACCES/
2020-05-16 21:43:11 -07:00
Linus Torvalds
18e70f3a76 io_uring-5.7-2020-05-15
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6/cDAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpo1DEADMUm23Ha/X5YSXujM55hzV6MKQQ+BjNwl1
 WIU0qdDkzmoiqSpcengjHPo/2CZvjlavXGaLDWKqfz1pojld9iBhCVgFaOm03Aiq
 yKbG5gu5UcKNGtlE7cObhkaX8WFDtyZBuoyx+MB6F2eOdO2sbuOEV4o4bRB0w6Gt
 hn91zHtQRS5WPBxxLGbj8xDOWrrhbiHhfM2QkUX9kd10D9UcOdHsdPt2hPs2L6OF
 sKkjwcVEd8oR4A5l1pKEwrbm1+7Gn1H0fR8HuBtaZ3ZU65ugaPB6cj6Brn8GDBuT
 JZ4ThKPqb/6jYVpxWVKCMYGLydSDZYxsQWJP7FgICqeqQgb6q+wHYScgJTPx8v4/
 rJgfgWSPbxVf/kKhjFj2JYo8QDcR5R7IUlO4TxcRaIockPdRdc4kTc7EFtzodUoN
 /5wWAuVuIRBicu+KBshrB3RNTT7NCcYfSom+eqMPIvWTvw1vIpK/5ULV6WZjqFbV
 bPt3mUjEYV3AfQ9l39pkoclQ3YWWq718l1iuyHdD+1u/41D85UTfGHa/m9NibO5g
 TP6Jj3FWLeD0UCpUiTEuUfYfdXxnP7R5SNnF+F15fucuHtRILh7rhTCqGbG4crLw
 0a1fOx0wmf1JUxQmQtJLQTQ4LdUfwqBbSekKrviClc8HizvFroQyxaFMIjgRXn8X
 F37nPAzswA==
 =kqDx
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.7-2020-05-15' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two small fixes that should go into this release:

   - Check and handle zero length splice (Pavel)

   - Fix a regression in this merge window for fixed files used with
     polled block IO"

* tag 'io_uring-5.7-2020-05-15' of git://git.kernel.dk/linux-block:
  io_uring: polled fixed file must go through free iteration
  io_uring: fix zero len do_splice()
2020-05-16 13:17:41 -07:00
Linus Torvalds
12bf0b632e NFS client bugfixes for Linux 5.7
Highlights include:
 
 Stable fixes:
 - nfs: fix NULL deference in nfs4_get_valid_delegation
 
 Bugfixes:
 - Fix corruption of the return value in cachefiles_read_or_alloc_pages()
 - Fix several fscache cookie issues
 - Fix a fscache queuing race that can trigger a BUG_ON
 - NFS: Fix 2 use-after-free regressions due to the RPC_TASK_CRED_NOREF flag
 - SUNRPC: Fix a use-after-free regression in rpc_free_client_work()
 - SUNRPC: Fix a race when tearing down the rpc client debugfs directory
 - SUNRPC: Signalled ASYNC tasks need to exit
 - NFSv3: fix rpc receive buffer size for MOUNT call
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6/AkgACgkQZwvnipYK
 APICLQ/9ENY+mQmVMSbtw2VGlBphS+GLN44k70NgmjAaI9n8f3ILZyLl0/iHEPFU
 ZaCWhPWEs76olLfCWwoQFzISIUTHWVUow8mJn7gTh4mq8n9UFv8zgUxtJ3HTfEHt
 rtujG9xsK0Oa351UPJWE5yC7PvFzMohcc1vVD8WQeGJQ3sMBVHTuOuPatIv3vK6s
 8MflKTbtz/gUkcbWMCk9ljMPXr6/Ksgu9GZDnDFAYZBfFkwx//RNmq6K+z1Ru15s
 tkmPPZGNMCfKblrnUXUmPt78wxSExmWrXroSMNas2fyeOXgPL0ogNx8vfdFcFsxs
 sHpMkF97+npntNj1y5om6GrdU4SRYUFqXT7pNqV4wuGguOfFELIXrJIQuCDaKrGD
 ApEoo9UDisGCLdqs738ascZFHZiTQoy6drbpR8moalqhYkTI7Al/pPxHnozGDYsJ
 +wElaFXZX2hlPc6ih1q54RcB+D4qswDC9QudArKc9hJEKPv+SsmiVhBG/f+X+Jca
 M19UJGWZvRtY8L+0yJdG22O9Hwo0zSK917gtOZNgkwtkkKgzjj2kcNHTK2jGBEuR
 pqEIQCreUH8Le0WR9cPeJeYc2/HeCEWDHrf/q+gFRClaZMe+0Sfu3z3pBH3v0T9q
 qoZ1VvCDUhggsZyJZebcPTCL+ghqOXFJajHLlcA6BSdGJ/iXq2M=
 =lqFu
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.7-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - nfs: fix NULL deference in nfs4_get_valid_delegation

  Bugfixes:
   - Fix corruption of the return value in cachefiles_read_or_alloc_pages()
   - Fix several fscache cookie issues
   - Fix a fscache queuing race that can trigger a BUG_ON
   - NFS: Fix two use-after-free regressions due to the RPC_TASK_CRED_NOREF flag
   - SUNRPC: Fix a use-after-free regression in rpc_free_client_work()
   - SUNRPC: Fix a race when tearing down the rpc client debugfs directory
   - SUNRPC: Signalled ASYNC tasks need to exit
   - NFSv3: fix rpc receive buffer size for MOUNT call"

* tag 'nfs-for-5.7-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv3: fix rpc receive buffer size for MOUNT call
  SUNRPC: 'Directory with parent 'rpc_clnt' already present!'
  NFS/pnfs: Don't use RPC_TASK_CRED_NOREF with pnfs
  NFS: Don't use RPC_TASK_CRED_NOREF with delegreturn
  SUNRPC: Signalled ASYNC tasks need to exit
  nfs: fix NULL deference in nfs4_get_valid_delegation
  SUNRPC: fix use-after-free in rpc_free_client_work()
  cachefiles: Fix race between read_waiter and read_copier involving op->to_do
  NFSv4: Fix fscache cookie aux_data to ensure change_attr is included
  NFS: Fix fscache super_cookie allocation
  NFS: Fix fscache super_cookie index_key from changing after umount
  cachefiles: Fix corruption of the return value in cachefiles_read_or_alloc_pages()
2020-05-15 14:03:13 -07:00
Eric Biggers
cdeb21da17 fscrypt: add fscrypt_add_test_dummy_key()
Currently, the test_dummy_encryption mount option (which is used for
encryption I/O testing with xfstests) uses v1 encryption policies, and
it relies on userspace inserting a test key into the session keyring.

We need test_dummy_encryption to support v2 encryption policies too.
Requiring userspace to add the test key doesn't work well with v2
policies, since v2 policies only support the filesystem keyring (not the
session keyring), and keys in the filesystem keyring are lost when the
filesystem is unmounted.  Hooking all test code that unmounts and
re-mounts the filesystem would be difficult.

Instead, let's make the filesystem automatically add the test key to its
keyring when test_dummy_encryption is enabled.

That puts the responsibility for choosing the test key on the kernel.
We could just hard-code a key.  But out of paranoia, let's first try
using a per-boot random key, to prevent this code from being misused.
A per-boot key will work as long as no one expects dummy-encrypted files
to remain accessible after a reboot.  (gce-xfstests doesn't.)

Therefore, this patch adds a function fscrypt_add_test_dummy_key() which
implements the above.  The next patch will use it.

Link: https://lore.kernel.org/r/20200512233251.118314-3-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-15 13:51:45 -07:00
David S. Miller
da07f52d3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Move the bpf verifier trace check into the new switch statement in
HEAD.

Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 13:48:59 -07:00
Jens Axboe
6a4d07cde5 io_uring: file registration list and lock optimization
There's no point in using list_del_init() on entries that are going
away, and the associated lock is always used in process context so
let's not use the IRQ disabling+saving variant of the spinlock.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15 14:37:14 -06:00
Stefano Garzarella
7e55a19cf6 io_uring: add IORING_CQ_EVENTFD_DISABLED to the CQ ring flags
This new flag should be set/clear from the application to
disable/enable eventfd notifications when a request is completed
and queued to the CQ ring.

Before this patch, notifications were always sent if an eventfd is
registered, so IORING_CQ_EVENTFD_DISABLED is not set during the
initialization.

It will be up to the application to set the flag after initialization
if no notifications are required at the beginning.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15 12:16:59 -06:00
Stefano Garzarella
0d9b5b3af1 io_uring: add 'cq_flags' field for the CQ ring
This patch adds the new 'cq_flags' field that should be written by
the application and read by the kernel.

This new field is available to the userspace application through
'cq_off.flags'.
We are using 4-bytes previously reserved and set to zero. This means
that if the application finds this field to zero, then the new
functionality is not supported.

In the next patch we will introduce the first flag available.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15 12:16:59 -06:00
Jens Axboe
18bceab101 io_uring: allow POLL_ADD with double poll_wait() users
Some file descriptors use separate waitqueues for their f_ops->poll()
handler, most commonly one for read and one for write. The io_uring
poll implementation doesn't work with that, as the 2nd poll_wait()
call will cause the io_uring poll request to -EINVAL.

This affects (at least) tty devices and /dev/random as well. This is a
big problem for event loops where some file descriptors work, and others
don't.

With this fix, io_uring handles multiple waitqueues.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15 11:56:54 -06:00
Jens Axboe
4a38aed2a0 io_uring: batch reap of dead file registrations
We currently embed and queue a work item per fixed_file_ref_node that
we update, but if the workload does a lot of these, then the associated
kworker-events overhead can become quite noticeable.

Since we rarely need to wait on these, batch them at 1 second intervals
instead. If we do need to wait for them, we just flush the pending
delayed work.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15 11:56:18 -06:00
Sami Tolvanen
628d06a48f scs: Add page accounting for shadow call stack allocations
This change adds accounting for the memory allocated for shadow stacks.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-15 16:35:49 +01:00
David S. Miller
d00f26b623 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-05-14

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON.

2) support for narrow loads in bpf_sock_addr progs and additional
   helpers in cg-skb progs, from Andrey.

3) bpf benchmark runner, from Andrii.

4) arm and riscv JIT optimizations, from Luke.

5) bpf iterator infrastructure, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14 20:31:21 -07:00
Jens Axboe
0f158b4cf2 io_uring: name sq thread and ref completions
We used to have three completions, now we just have two. With the two,
let's not allocate them dynamically, just embed then in the ctx and
name them appropriately.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 17:18:39 -06:00
Adam McCoy
a481379960 cifs: fix leaked reference on requeued write
Failed async writes that are requeued may not clean up a refcount
on the file, which can result in a leaked open. This scenario arises
very reliably when using persistent handles and a reconnect occurs
while writing.

cifs_writev_requeue only releases the reference if the write fails
(rc != 0). The server->ops->async_writev operation will take its own
reference, so the initial reference can always be released.

Signed-off-by: Adam McCoy <adam@forsedomani.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-05-14 17:47:01 -05:00
Olga Kornievskaia
8eed292bc8 NFSv3: fix rpc receive buffer size for MOUNT call
Prior to commit e3d3ab64dd66 ("SUNRPC: Use au_rslack when
computing reply buffer size"), there was enough slack in the reply
buffer to commodate filehandles of size 60bytes. However, the real
problem was that the reply buffer size for the MOUNT operation was
not correctly calculated. Received buffer size used the filehandle
size for NFSv2 (32bytes) which is much smaller than the allowed
filehandle size for the v3 mounts.

Fix the reply buffer size (decode arguments size) for the MNT command.

Fixes: 2c94b8eca1 ("SUNRPC: Use au_rslack when computing reply buffer size")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-14 18:42:44 -04:00
Roman Penyaev
65759097d8 epoll: call final ep_events_available() check under the lock
There is a possible race when ep_scan_ready_list() leaves ->rdllist and
->obflist empty for a short period of time although some events are
pending.  It is quite likely that ep_events_available() observes empty
lists and goes to sleep.

Since commit 339ddb53d3 ("fs/epoll: remove unnecessary wakeups of
nested epoll") we are conservative in wakeups (there is only one place
for wakeup and this is ep_poll_callback()), thus ep_events_available()
must always observe correct state of two lists.

The easiest and correct way is to do the final check under the lock.
This does not impact the performance, since lock is taken anyway for
adding a wait entry to the wait queue.

The discussion of the problem can be found here:

   https://lore.kernel.org/linux-fsdevel/a2f22c3c-c25a-4bda-8339-a7bdaf17849e@akamai.com/

In this patch barrierless __set_current_state() is used.  This is safe
since waitqueue_active() is called under the same lock on wakeup side.

Short-circuit for fatal signals (i.e.  fatal_signal_pending() check) is
moved to the line just before actual events harvesting routine.  This is
fully compliant to what is said in the comment of the patch where the
actual fatal_signal_pending() check was added: c257a340ed ("fs, epoll:
short circuit fetching events if thread has been killed").

Fixes: 339ddb53d3 ("fs/epoll: remove unnecessary wakeups of nested epoll")
Reported-by: Jason Baron <jbaron@akamai.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jason Baron <jbaron@akamai.com>
Cc: Khazhismel Kumykov <khazhy@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200505145609.1865152-1-rpenyaev@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-14 10:00:35 -07:00
Steve French
9bd21d4b1a cifs: Fix null pointer check in cifs_read
Coverity scan noted a redundant null check

Coverity-id: 728517
Reported-by: Coverity <scan-admin@coverity.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
2020-05-14 10:30:03 -05:00
Miklos Szeredi
c8ffd8bcdd vfs: add faccessat2 syscall
POSIX defines faccessat() as having a fourth "flags" argument, while the
linux syscall doesn't have it.  Glibc tries to emulate AT_EACCESS and
AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.

Add a new faccessat(2) syscall with the added flags argument and implement
both flags.

The value of AT_EACCESS is defined in glibc headers to be the same as
AT_REMOVEDIR.  Use this value for the kernel interface as well, together
with the explanatory comment.

Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
be useful and is trivial to implement.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14 16:44:25 +02:00
Miklos Szeredi
55923e4d7d vfs: don't parse "silent" option
Parsing "silent" and clearing SB_SILENT makes zero sense.

Parsing "silent" and setting SB_SILENT would make a bit more sense, but
apparently nobody cares.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-14 16:44:25 +02:00
Miklos Szeredi
caaef1ba8c vfs: don't parse "posixacl" option
Unlike the others, this is _not_ a standard option accepted by mount(8).

In fact SB_POSIXACL is an internal flag, and accepting MS_POSIXACL on the
mount(2) interface is possibly a bug.

The only filesystem that apparently wants to handle the "posixacl" option
is 9p, but it has special handling of that option besides setting
SB_POSIXACL.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-14 16:44:25 +02:00
Miklos Szeredi
9193ae87a8 vfs: don't parse forbidden flags
Makes little sense to keep this blacklist synced with what mount(8) parses
and what it doesn't.  E.g. it has various forms of "*atime" options, but
not "atime"...

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-14 16:44:25 +02:00
Miklos Szeredi
80340fe360 statx: add mount_root
Determining whether a path or file descriptor refers to a mountpoint (or
more precisely a mount root) is not trivial using current tools.

Add a flag to statx that indicates whether the path or fd refers to the
root of a mount or not.

Cc: linux-api@vger.kernel.org
Cc: linux-man@vger.kernel.org
Reported-by: Lennart Poettering <mzxreary@0pointer.de>
Reported-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-14 16:44:24 +02:00
Miklos Szeredi
fa2fcf4f1d statx: add mount ID
Systemd is hacking around to get it and it's trivial to add to statx, so...

Cc: linux-api@vger.kernel.org
Cc: linux-man@vger.kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-14 16:44:24 +02:00
Miklos Szeredi
761e28fa27 statx: don't clear STATX_ATIME on SB_RDONLY
IS_NOATIME(inode) is defined as __IS_FLG(inode, SB_RDONLY|SB_NOATIME), so
generic_fillattr() will clear STATX_ATIME from the result_mask if the super
block is marked read only.

This was probably not the intention, so fix to only clear STATX_ATIME if
the fs doesn't support atime at all.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-14 16:44:24 +02:00
Miklos Szeredi
581701b7ef uapi: deprecate STATX_ALL
Constants of the *_ALL type can be actively harmful due to the fact that
developers will usually fail to consider the possible effects of future
changes to the definition.

Deprecate STATX_ALL in the uapi, while no damage has been done yet.

We could keep something like this around in the kernel, but there's
actually no point, since all filesystems should be explicitly checking
flags that they support and not rely on the VFS masking unknown ones out: a
flag could be known to the VFS, yet not known to the filesystem.

Cc: David Howells <dhowells@redhat.com>
Cc: linux-api@vger.kernel.org
Cc: linux-man@vger.kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-14 16:44:24 +02:00
Miklos Szeredi
44a3b87444 utimensat: AT_EMPTY_PATH support
This makes it possible to use utimensat on an O_PATH file (including
symlinks).

It supersedes the nonstandard utimensat(fd, NULL, ...) form.

Cc: linux-api@vger.kernel.org
Cc: linux-man@vger.kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-14 16:44:24 +02:00
Miklos Szeredi
9470451505 vfs: split out access_override_creds()
Split out a helper that overrides the credentials in preparation for
actually doing the access check.

This prepares for the next patch that optionally disables the creds
override.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14 16:44:24 +02:00
Miklos Szeredi
9f6c61f96f proc/mounts: add cursor
If mounts are deleted after a read(2) call on /proc/self/mounts (or its
kin), the subsequent read(2) could miss a mount that comes after the
deleted one in the list.  This is because the file position is interpreted
as the number mount entries from the start of the list.

E.g. first read gets entries #0 to #9; the seq file index will be 10.  Then
entry #5 is deleted, resulting in #10 becoming #9 and #11 becoming #10,
etc...  The next read will continue from entry #10, and #9 is missed.

Solve this by adding a cursor entry for each open instance.  Taking the
global namespace_sem for write seems excessive, since we are only dealing
with a per-namespace list.  Instead add a per-namespace spinlock and use
that together with namespace_sem taken for read to protect against
concurrent modification of the mount list.  This may reduce parallelism of
is_local_mountpoint(), but it's hardly a big contention point.  We could
also use RCU freeing of cursors to make traversal not need additional
locks, if that turns out to be neceesary.

Only move the cursor once for each read (cursor is not added on open) to
minimize cacheline invalidation.  When EOF is reached, the cursor is taken
off the list, in order to prevent an excessive number of cursors due to
inactive open file descriptors.

Reported-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14 16:44:24 +02:00
Miklos Szeredi
530f32fc37 aio: fix async fsync creds
Avi Kivity reports that on fuse filesystems running in a user namespace
asyncronous fsync fails with EOVERFLOW.

The reason is that f_ops->fsync() is called with the creds of the kthread
performing aio work instead of the creds of the process originally
submitting IOCB_CMD_FSYNC.

Fuse sends the creds of the caller in the request header and it needs to
translate the uid and gid into the server's user namespace.  Since the
kthread is running in init_user_ns, the translation will fail and the
operation returns an error.

It can be argued that fsync doesn't actually need any creds, but just
zeroing out those fields in the header (as with requests that currently
don't take creds) is a backward compatibility risk.

Instead of working around this issue in fuse, solve the core of the problem
by calling the filesystem with the proper creds.

Reported-by: Avi Kivity <avi@scylladb.com>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Fixes: c9582eb0ff ("fuse: Fail all requests with invalid uids or gids")
Cc: stable@vger.kernel.org  # 4.18+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-14 16:44:24 +02:00
Miklos Szeredi
a3c751a50f vfs: allow unprivileged whiteout creation
Whiteouts, unlike real device node should not require privileges to create.

The general concern with device nodes is that opening them can have side
effects.  The kernel already avoids zero major (see
Documentation/admin-guide/devices.txt).  To be on the safe side the patch
explicitly forbids registering a char device with 0/0 number (see
cdev_add()).

This guarantees that a non-O_PATH open on a whiteout will fail with ENODEV;
i.e. it won't have any side effect.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14 16:44:23 +02:00
Nishad Kamdar
508578f2f5 xfs: Use the correct style for SPDX License Identifier
This patch corrects the SPDX License Identifier style in header files
related to XFS File System support. For C header files
Documentation/process/license-rules.rst mandates C-like comments.
(opposed to C source files where C++ style should be used).

Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-13 15:32:45 -07:00
Gustavo A. R. Silva
ee4064e56c xfs: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-13 15:32:45 -07:00
Zheng Bin
237aac4624 xfs: ensure f_bfree returned by statfs() is non-negative
Construct an img like this:

dd if=/dev/zero of=xfs.img bs=1M count=20
mkfs.xfs -d agcount=1 xfs.img
xfs_db -x xfs.img
sb 0
write fdblocks 0
agf 0
write freeblks 0
write longest 0
quit

mount it, df -h /mnt(xfs mount point), will show this:
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0       17M  -64Z  -32K 100% /mnt

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-13 15:32:45 -07:00
Jens Axboe
9d9e88a24c io_uring: polled fixed file must go through free iteration
When we changed the file registration handling, it became important to
iterate the bulk request freeing list for fixed files as well, or we
miss dropping the fixed file reference. If not, we're leaking references,
and we'll get a kworker stuck waiting for file references to disappear.

This also means we can remove the special casing of fixed vs non-fixed
files, we need to iterate for both and we can just rely on
__io_req_aux_free() doing io_put_file() instead of doing it manually.

Fixes: 0558955373 ("io_uring: refactor file register/unregister/update handling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13 13:00:00 -06:00
Ira Weiny
2c567af418 fs: Introduce DCACHE_DONTCACHE
DCACHE_DONTCACHE indicates a dentry should not be cached on final
dput().

Also add a helper function to mark DCACHE_DONTCACHE on all dentries
pointing to a specific inode when that inode is being set I_DONTCACHE.

This facilitates dropping dentry references to inodes sooner which
require eviction to swap S_DAX mode.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-13 08:44:35 -07:00
Ira Weiny
dae2f8ed79 fs: Lift XFS_IDONTCACHE to the VFS layer
DAX effective mode (S_DAX) changes requires inode eviction.

XFS has an advisory flag (XFS_IDONTCACHE) to prevent caching of the
inode if no other additional references are taken.  We lift this flag to
the VFS layer and change the behavior slightly by allowing the flag to
remain even if multiple references are taken.

This will expedite the eviction of inodes to change S_DAX.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-13 08:44:35 -07:00
Trond Myklebust
4fa7ef69e2 NFS/pnfs: Don't use RPC_TASK_CRED_NOREF with pnfs
When we're doing pnfs then the credential being used for the RPC call
is not necessarily the same as the one used in the open context, so
don't use RPC_TASK_CRED_NOREF.

Fixes: 6129650720 ("NFSv4: Avoid referencing the cred unnecessarily during NFSv4 I/O")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-13 09:55:36 -04:00
Christian Brauner
303cc571d1
nsproxy: attach to namespaces via pidfds
For quite a while we have been thinking about using pidfds to attach to
namespaces. This patchset has existed for about a year already but we've
wanted to wait to see how the general api would be received and adopted.
Now that more and more programs in userspace have started using pidfds
for process management it's time to send this one out.

This patch makes it possible to use pidfds to attach to the namespaces
of another process, i.e. they can be passed as the first argument to the
setns() syscall. When only a single namespace type is specified the
semantics are equivalent to passing an nsfd. That means
setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
when a pidfd is passed, multiple namespace flags can be specified in the
second setns() argument and setns() will attach the caller to all the
specified namespaces all at once or to none of them. Specifying 0 is not
valid together with a pidfd.

Here are just two obvious examples:
setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
setns(pidfd, CLONE_NEWUSER);
Allowing to also attach subsets of namespaces supports various use-cases
where callers setns to a subset of namespaces to retain privilege, perform
an action and then re-attach another subset of namespaces.

If the need arises, as Eric suggested, we can extend this patchset to
assume even more context than just attaching all namespaces. His suggestion
specifically was about assuming the process' root directory when
setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just
keep it flexible in terms of supporting subsets of namespaces but let's
wait until we have users asking for even more context to be assumed. At
that point we can add an extension.

The obvious example where this is useful is a standard container
manager interacting with a running container: pushing and pulling files
or directories, injecting mounts, attaching/execing any kind of process,
managing network devices all these operations require attaching to all
or at least multiple namespaces at the same time. Given that nowadays
most containers are spawned with all namespaces enabled we're currently
looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns>
nsfds, another 7 to actually perform the namespace switch. With time
namespaces we're looking at about 16 syscalls.
(We could amortize the first 7 or 8 syscalls for opening the nsfds by
 stashing them in each container's monitor process but that would mean
 we need to send around those file descriptors through unix sockets
 everytime we want to interact with the container or keep on-disk
 state. Even in scenarios where a caller wants to join a particular
 namespace in a particular order callers still profit from batching
 other namespaces. That mostly applies to the user namespace but
 all container runtimes I found join the user namespace first no matter
 if it privileges or deprivileges the container similar to how unshare
 behaves.)
With pidfds this becomes a single syscall no matter how many namespaces
are supposed to be attached to.

A decently designed, large-scale container manager usually isn't the
parent of any of the containers it spawns so the containers don't die
when it crashes or needs to update or reinitialize. This means that
for the manager to interact with containers through pids is inherently
racy especially on systems where the maximum pid number is not
significicantly bumped. This is even more problematic since we often spawn
and manage thousands or ten-thousands of containers. Interacting with a
container through a pid thus can become risky quite quickly. Especially
since we allow for an administrator to enable advanced features such as
syscall interception where we're performing syscalls in lieu of the
container. In all of those cases we use pidfds if they are available and
we pass them around as stable references. Using them to setns() to the
target process' namespaces is as reliable as using nsfds. Either the
target process is already dead and we get ESRCH or we manage to attach
to its namespaces but we can't accidently attach to another process'
namespaces. So pidfds lend themselves to be used with this api.
The other main advantage is that with this change the pidfd becomes the
only relevant token for most container interactions and it's the only
token we need to create and send around.

Apart from significiantly reducing the number of syscalls from double
digit to single digit which is a decent reason post-spectre/meltdown
this also allows to switch to a set of namespaces atomically, i.e.
either attaching to all the specified namespaces succeeds or we fail. If
we fail we haven't changed a single namespace. There are currently three
namespaces that can fail (other than for ENOMEM which really is not
very interesting since we then have other problems anyway) for
non-trivial reasons, user, mount, and pid namespaces. We can fail to
attach to a pid namespace if it is not our current active pid namespace
or a descendant of it. We can fail to attach to a user namespace because
we are multi-threaded or because our current mount namespace shares
filesystem state with other tasks, or because we're trying to setns()
to the same user namespace, i.e. the target task has the same user
namespace as we do. We can fail to attach to a mount namespace because
it shares filesystem state with other tasks or because we fail to lookup
the new root for the new mount namespace. In most non-pathological
scenarios these issues can be somewhat mitigated. But there are cases where
we're half-attached to some namespace and failing to attach to another one.
I've talked about some of these problem during the hallway track (something
only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles
in 2018(?). Even if all these issues could be avoided with super careful
userspace coding it would be nicer to have this done in-kernel. Pidfds seem
to lend themselves nicely for this.

The other neat thing about this is that setns() becomes an actual
counterpart to the namespace bits of unshare().

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com
2020-05-13 11:41:22 +02:00
Dan Carpenter
9aafc1b018 ovl: potential crash in ovl_fid_to_fh()
The "buflen" value comes from the user and there is a potential that it
could be zero.  In do_handle_to_path() we know that "handle->handle_bytes"
is non-zero and we do:

	handle_dwords = handle->handle_bytes >> 2;

So values 1-3 become zero.  Then in ovl_fh_to_dentry() we do:

	int len = fh_len << 2;

So now len is in the "0,4-128" range and a multiple of 4.  But if
"buflen" is zero it will try to copy negative bytes when we do the
memcpy in ovl_fid_to_fh().

	memcpy(&fh->fb, fid, buflen - OVL_FH_WIRE_OFFSET);

And that will lead to a crash.  Thanks to Amir Goldstein for his help
with this patch.

Fixes: cbe7fba8ed ("ovl: make sure that real fid is 32bit aligned in memory")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Cc: <stable@vger.kernel.org> # v5.5
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-13 11:10:57 +02:00
Johannes Thumshirn
02ef12a663 zonefs: use REQ_OP_ZONE_APPEND for sync DIO
Synchronous direct I/O to a sequential write only zone can be issued using
the new REQ_OP_ZONE_APPEND request operation. As dispatching multiple
BIOs can potentially result in reordering, we cannot support asynchronous
IO via this interface.

We also can only dispatch up to queue_max_zone_append_sectors() via the
new zone-append method and have to return a short write back to user-space
in case an IO larger than queue_max_zone_append_sectors() has been issued.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-12 20:36:28 -06:00
Ming Lei
e6249cdd46 block: add blk_io_schedule() for avoiding task hung in sync dio
Sync dio could be big, or may take long time in discard or in case of
IO failure.

We have prevented task hung in submit_bio_wait() and blk_execute_rq(),
so apply the same trick for prevent task hung from happening in sync dio.

Add helper of blk_io_schedule() and use io_schedule_timeout() to prevent
task hung warning.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Cc: Salman Qazi <sqazi@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-12 20:32:42 -06:00
Eric Biggers
9cd6b593cf fs-verity: remove unnecessary extern keywords
Remove the unnecessary 'extern' keywords from function declarations.
This makes it so that we don't have a mix of both styles, so it won't be
ambiguous what to use in new fs-verity patches.  This also makes the
code shorter and matches the 'checkpatch --strict' expectation.

Link: https://lore.kernel.org/r/20200511192118.71427-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:44:00 -07:00
Eric Biggers
6377a38bd3 fs-verity: fix all kerneldoc warnings
Fix all kerneldoc warnings in fs/verity/ and include/linux/fsverity.h.
Most of these were due to missing documentation for function parameters.

Detected with:

    scripts/kernel-doc -v -none fs/verity/*.{c,h} include/linux/fsverity.h

This cleanup makes it possible to check new patches for kerneldoc
warnings without having to filter out all the existing ones.

Link: https://lore.kernel.org/r/20200511192118.71427-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:43:59 -07:00
Eric Biggers
607009020a fscrypt: remove unnecessary extern keywords
Remove the unnecessary 'extern' keywords from function declarations.
This makes it so that we don't have a mix of both styles, so it won't be
ambiguous what to use in new fscrypt patches.  This also makes the code
shorter and matches the 'checkpatch --strict' expectation.

Link: https://lore.kernel.org/r/20200511191358.53096-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:37:17 -07:00
Eric Biggers
d2fe97545a fscrypt: fix all kerneldoc warnings
Fix all kerneldoc warnings in fs/crypto/ and include/linux/fscrypt.h.
Most of these were due to missing documentation for function parameters.

Detected with:

    scripts/kernel-doc -v -none fs/crypto/*.{c,h} include/linux/fscrypt.h

This cleanup makes it possible to check new patches for kerneldoc
warnings without having to filter out all the existing ones.

For consistency, also adjust some function "brief descriptions" to
include the parentheses and to wrap at 80 characters.  (The latter
matches the checkpatch expectation.)

Link: https://lore.kernel.org/r/20200511191358.53096-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12 16:37:17 -07:00
Linus Torvalds
e719340f46 Various gfs2 fixes
Fixes for bugs prior to v5.7-rc1:
 - Fix random block reads when reading fragmented journals (v5.2).
 - Fix a possible random memory access in gfs2_walk_metadata (v5.3).
 
 Fixes for v5.7-rc1:
 - Fix several overlooked gfs2_qa_get / gfs2_qa_put imbalances.
 - Fix several bugs in the new filesystem withdraw logic.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl66togUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTruhxAAttUMbVZaxny7zB+gXc7fqvM3T6BE
 m613knneGkkQIRQqzLXKictUkWiTItkNaM7HFwO9MJfDZO1xMett941RpIlW4oa5
 d42EWRKEwAZZOx+Yz9tE9G/fPoh0Rz16Svl/0EJ6NG5QfTyyuSQoH+MbRibVlYy2
 XVnfMKZAEyOsIJ8lu3xRzjLTwkRK/8X+QpF/syanEq9oaFMYtB7j1TOgimVUMV3m
 5va4+PXARx1/Dsgn/21zgsZgQ4IW7ZYXzjxZuX9CwbKaszz+f77pyxkea5fDvVFo
 16OaFXtl+dzBJ4vIdZr9OfQTvMfSCxWiXgjxj+6W152qXEQkyKDWGETH7A3yVZ4n
 9G3N+Cdpp09gM8tmI9140uTDNXLg8M34fTtHntqckPKpNZ9IvzoXTp3ebSe92pwJ
 +5K1//ifcTqbnHCwTCPPYEtIRGbm/I0en0H9A3tqFmKDNdarnVuZ5QHJVrSF9x8g
 z+Go3NJlhevq64OGLXd8UlODRevpGPjQWdrcFjeuLhtcqbUVjcERoEcaBsJoKdus
 NYn+yT5CqzMqLZzXLIAfm9TfCry9/NF7D/7acsZZ05BEyz+WOwHVMTcTdsAfT6Ft
 1ytU7tufdM/Zw/8t6lI89rC/XcDwAm/vEpQLd27xUvMKOKaEQYKs8geh9du4Q+fN
 yaQOvgDhmPVIKwI=
 =4Sbr
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.7-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fixes from Andreas Gruenbacher:
 "Various gfs2 fixes.

  Fixes for bugs prior to v5.7:
   - Fix random block reads when reading fragmented journals (v5.2)
   - Fix a possible random memory access in gfs2_walk_metadata (v5.3)

  Fixes for v5.7:
   - Fix several overlooked gfs2_qa_get / gfs2_qa_put imbalances
   - Fix several bugs in the new filesystem withdraw logic"

* tag 'gfs2-v5.7-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  Revert "gfs2: Don't demote a glock until its revokes are written"
  gfs2: If go_sync returns error, withdraw but skip invalidate
  gfs2: Grab glock reference sooner in gfs2_add_revoke
  gfs2: don't call quota_unhold if quotas are not locked
  gfs2: move privileged user check to gfs2_quota_lock_check
  gfs2: remove check for quotas on in gfs2_quota_check
  gfs2: Change BUG_ON to an assert_withdraw in gfs2_quota_change
  gfs2: Fix problems regarding gfs2_qa_get and _put
  gfs2: More gfs2_find_jhead fixes
  gfs2: Another gfs2_walk_metadata fix
  gfs2: Fix use-after-free in gfs2_logd after withdraw
  gfs2: Fix BUG during unmount after file system withdraw
  gfs2: Fix error exit in do_xmote
  gfs2: fix withdraw sequence deadlock
2020-05-12 10:32:32 -07:00
Kees Cook
7a0ad54684 pstore: Refactor pstorefs record list removal
The "unlink" handling should perform list removal (which can also make
sure records don't get double-erased), and the "evict" handling should
be responsible only for memory freeing.

Link: https://lore.kernel.org/lkml/20200506152114.50375-8-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12 09:15:29 -07:00
Kees Cook
6248a0666c pstore: Add proper unregister lock checking
The pstore backend lock wasn't being used during pstore_unregister().
Add sanity check and locking.

Link: https://lore.kernel.org/lkml/20200506152114.50375-7-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12 09:15:11 -07:00
Kees Cook
db23491c77 pstore: Convert "records_list" locking to mutex
The pstorefs internal list lock doesn't need to be a spinlock and will
create problems when trying to access the list in the subsequent patch
that will walk the pstorefs records during pstore_unregister(). Change
this to a mutex to avoid may_sleep() warnings when unregistering devices.

Link: https://lore.kernel.org/lkml/20200506152114.50375-6-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12 09:14:18 -07:00
Kees Cook
47af61ffb1 pstore: Rename "allpstore" to "records_list"
The name "allpstore" doesn't carry much meaning, so rename it to what it
actually is: the list of all records present in the filesystem. The lock
is also renamed accordingly.

Link: https://lore.kernel.org/lkml/20200506152114.50375-5-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12 09:14:05 -07:00
Kees Cook
cab12fd049 pstore: Convert "psinfo" locking to mutex
Currently pstore can only have a single backend attached at a time, and it
tracks the active backend via "psinfo", under a lock. The locking for this
does not need to be a spinlock, and in order to avoid may_sleep() issues
during future changes to pstore_unregister(), switch to a mutex instead.

Link: https://lore.kernel.org/lkml/20200506152114.50375-4-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12 09:13:47 -07:00
Kees Cook
c30b20cd96 pstore: Rename "pstore_lock" to "psinfo_lock"
The name "pstore_lock" sounds very global, but it is only supposed to be
used for managing changes to "psinfo", so rename it accordingly.

Link: https://lore.kernel.org/lkml/20200506152114.50375-3-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12 09:13:29 -07:00
Kees Cook
e7c1c00cf3 pstore: Drop useless try_module_get() for backend
There is no reason to be doing a module get/put in pstore_register(),
since the module calling pstore_register() cannot be unloaded since it
hasn't finished its initialization. Remove it so there is no confusion
about how registration ordering works.

Link: https://lore.kernel.org/lkml/20200506152114.50375-2-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12 09:12:31 -07:00
Trond Myklebust
f304a809a9 NFS: Don't use RPC_TASK_CRED_NOREF with delegreturn
We are not guaranteed that the credential will remain pinned.

Fixes: 6129650720 ("NFSv4: Avoid referencing the cred unnecessarily during NFSv4 I/O")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-11 14:06:51 -04:00
Trond Myklebust
2b666a110b Merge tag 'fscache-fixes-20200508-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
(1) The reorganisation of bmap() use accidentally caused the return value
     of cachefiles_read_or_alloc_pages() to get corrupted.

 (2) The NFS superblock index key accidentally got changed to include a
     number of kernel pointers - meaning that the key isn't matchable after
     a reboot.

 (3) A redundant check in nfs_fscache_get_super_cookie().

 (4) The NFS change_attr sometimes set in the auxiliary data for the
     caching of an file and sometimes not, which causes the cache to get
     discarded when it shouldn't.

 (5) There's a race between cachefiles_read_waiter() and
     cachefiles_read_copier() that causes an occasional assertion failure.
2020-05-11 14:06:50 -04:00
J. Bruce Fields
29fe839976 nfs: fix NULL deference in nfs4_get_valid_delegation
We add the new state to the nfsi->open_states list, making it
potentially visible to other threads, before we've finished initializing
it.

That wasn't a problem when all the readers were also taking the i_lock
(as we do here), but since we switched to RCU, there's now a possibility
that a reader could see the partially initialized state.

Symptoms observed were a crash when another thread called
nfs4_get_valid_delegation() on a NULL inode, resulting in an oops like:

	BUG: unable to handle page fault for address: ffffffffffffffb0 ...
	RIP: 0010:nfs4_get_valid_delegation+0x6/0x30 [nfsv4] ...
	Call Trace:
	 nfs4_open_prepare+0x80/0x1c0 [nfsv4]
	 __rpc_execute+0x75/0x390 [sunrpc]
	 ? finish_task_switch+0x75/0x260
	 rpc_async_schedule+0x29/0x40 [sunrpc]
	 process_one_work+0x1ad/0x370
	 worker_thread+0x30/0x390
	 ? create_worker+0x1a0/0x1a0
	 kthread+0x10c/0x130
	 ? kthread_park+0x80/0x80
	 ret_from_fork+0x22/0x30

Fixes: 9ae075fdd1 "NFSv4: Convert open state lookup to use RCU"
Reviewed-by: Seiichi Ikarashi <s.ikarashi@fujitsu.com>
Tested-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-11 14:05:58 -04:00
David Howells
c410bf0193 rxrpc: Fix the excessive initial retransmission timeout
rxrpc currently uses a fixed 4s retransmission timeout until the RTT is
sufficiently sampled.  This can cause problems with some fileservers with
calls to the cache manager in the afs filesystem being dropped from the
fileserver because a packet goes missing and the retransmission timeout is
greater than the call expiry timeout.

Fix this by:

 (1) Copying the RTT/RTO calculation code from Linux's TCP implementation
     and altering it to fit rxrpc.

 (2) Altering the various users of the RTT to make use of the new SRTT
     value.

 (3) Replacing the use of rxrpc_resend_timeout to use the calculated RTO
     value instead (which is needed in jiffies), along with a backoff.

Notes:

 (1) rxrpc provides RTT samples by matching the serial numbers on outgoing
     DATA packets that have the RXRPC_REQUEST_ACK set and PING ACK packets
     against the reference serial number in incoming REQUESTED ACK and
     PING-RESPONSE ACK packets.

 (2) Each packet that is transmitted on an rxrpc connection gets a new
     per-connection serial number, even for retransmissions, so an ACK can
     be cross-referenced to a specific trigger packet.  This allows RTT
     information to be drawn from retransmitted DATA packets also.

 (3) rxrpc maintains the RTT/RTO state on the rxrpc_peer record rather than
     on an rxrpc_call because many RPC calls won't live long enough to
     generate more than one sample.

 (4) The calculated SRTT value is in units of 8ths of a microsecond rather
     than nanoseconds.

The (S)RTT and RTO values are displayed in /proc/net/rxrpc/peers.

Fixes: 17926a7932 ([AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both"")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-11 16:42:28 +01:00
Xiaoming Ni
8469508951 io_uring: remove duplicate semicolon at the end of line
Remove duplicate semicolon at the end of line in io_file_from_index()

Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-11 09:04:05 -06:00
Linus Torvalds
0a85ed6e7f block-5.7-2020-05-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl63WVAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkXWD/9qJgqQpPkigCCwwPHZ+phthw6gHeAgBxPH
 Cw6P9QB4QCdacZjQA6QH3zdxaDsCCitQRioWPgxngs1326TKYNzBi7U3eTEwiK12
 cnRybLnkzei4yzYVUSJk637oOoQh3CiJLvYcJBppGFi7crpbvlQv68M2hu05vhwL
 R/91H62X/5UaUlc1cJV63OBk8euWzF6XNbCQQrR4ayDvz+BsV5Fs72vYa1gx7qIt
 as/67oTT6y4U4pd74nT4OGkxDIXbXfn2eTbh5sMNc4ilBkqMyNbf8aOHdWqXZIBd
 18RKpNl6h/fiDMJ0jsGliReONLjfRBcJla68Kn1AFONMcyxcXidjptOwLOt2fYWf
 YMguCVMhfgxVBslzLWoQ9AWSiNVh36ycORWlCOrnRaOaQCb9OaLZ2fwibfZ0JsMd
 0259Z5vA7MIUoobCc5akXOYHbpByA9FSYkKudgTYLpdjkn05kxQyA12GgJjW3sVw
 ZRjoUuDuZDDUct6JcLWdrlONT8st05g+qf6PCoD+Jac8HtbpqHfKJJUtYecUat75
 4hGKhuvTzpuVY0wNHo3sgqKfsejQODTN6UhejNI11Zs/nx6O0ze/qoDuWZHncnKl
 158le+K5rNS8SUNbDBTMWp3OX4SJm/Gsf30fOWkkt6z1iaEfKc5sCxBHvSOeBEvH
 M9pzy56Vtw==
 =73nU
 -----END PGP SIGNATURE-----

Merge tag 'block-5.7-2020-05-09' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - a small series fixing a use-after-free of bdi name (Christoph,Yufen)

 - NVMe fix for a regression with the smaller CQ update (Alexey)

 - NVMe fix for a hang at namespace scanning error recovery (Sagi)

 - fix race with blk-iocost iocg->abs_vdebt updates (Tejun)

* tag 'block-5.7-2020-05-09' of git://git.kernel.dk/linux-block:
  nvme: fix possible hang when ns scanning fails during error recovery
  nvme-pci: fix "slimmer CQ head update"
  bdi: add a ->dev_name field to struct backing_dev_info
  bdi: use bdi_dev_name() to get device name
  bdi: move bdi_dev_name out of line
  vboxsf: don't use the source name in the bdi name
  iocost: protect iocg->abs_vdebt with iocg->waitq.lock
2020-05-10 11:16:07 -07:00
Yonghong Song
138d0be35b net: bpf: Add netlink and ipv6_route bpf_iter targets
This patch added netlink and ipv6_route targets, using
the same seq_ops (except show() and minor changes for stop())
for /proc/net/{netlink,ipv6_route}.

The net namespace for these targets are the current net
namespace at file open stage, similar to
/proc/net/{netlink,ipv6_route} reference counting
the net namespace at seq_file open stage.

Since module is not supported for now, ipv6_route is
supported only if the IPV6 is built-in, i.e., not compiled
as a module. The restriction can be lifted once module
is properly supported for bpf_iter.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175910.2476329-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Christoph Hellwig
af00423a3d hfs: stop using ioctl_by_bdev
Instead just call the CDROM layer functionality directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00
Christoph Hellwig
1cd925d583 bdi: remove the name field in struct backing_dev_info
The name is only printed for a not registered bdi in writeback.  Use the
device name there as is more useful anyway for the unlike case that the
warning triggers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00
Christoph Hellwig
aef33c2ff8 bdi: simplify bdi_alloc
Merge the _node vs normal version and drop the superflous gfp_t argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00
Jens Axboe
873f1c8df7 Merge branch 'block-5.7' into for-5.8/block
Pull in block-5.7 fixes for 5.8. Mostly to resolve a conflict with
the blk-iocost changes, but we also need the base of the bdi
use-after-free as well as we build on top of it.

* block-5.7:
  nvme: fix possible hang when ns scanning fails during error recovery
  nvme-pci: fix "slimmer CQ head update"
  bdi: add a ->dev_name field to struct backing_dev_info
  bdi: use bdi_dev_name() to get device name
  bdi: move bdi_dev_name out of line
  vboxsf: don't use the source name in the bdi name
  iocost: protect iocg->abs_vdebt with iocg->waitq.lock
  block: remove the bd_openers checks in blk_drop_partitions
  nvme: prevent double free in nvme_alloc_ns() error handling
  null_blk: Cleanup zoned device initialization
  null_blk: Fix zoned command handling
  block: remove unused header
  blk-iocost: Fix error on iocost_ioc_vrate_adj
  bdev: Reduce time holding bd_mutex in sync in blkdev_close()
  buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:13:58 -06:00
Yufen Yu
d51cfc53ad bdi: use bdi_dev_name() to get device name
Use the common interface bdi_dev_name() to get device name.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Add missing <linux/backing-dev.h> include BFQ

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:07:39 -06:00
Al Viro
502fd722fe btrfs_ioctl_send(): don't bother with access_ok()
we do copy_from_user() on that range anyway

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-09 15:59:10 -04:00
Al Viro
f06d3a7e6e fat_dir_ioctl(): hadn't needed that access_ok() for more than a decade...
address is passed only to put_user() and copy_to_user()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-09 15:57:04 -04:00
Al Viro
37d59a5148 dlmfs_file_write(): get rid of pointless access_ok()
address passed only to copy_from_user()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-09 15:54:58 -04:00
Linus Torvalds
1d3962ae3b io_uring-5.7-2020-05-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl62HvYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptEAEACbuLfgFok0Vw8j7KNW0WNNKlS2o6nXQlW5
 cl95JsqYdSL+toiDPQnJFtdoaxMhzL90kbWZzvPTBj+yTpLzRX0YnwFqXwFfmrga
 gd/7SOM5C97F1LCPL+luhbgp5HUq+ZVH882KjMiOVLvjjAb4SeKSexQGoxeKvtcV
 Pg3xm+zsbKKvclRDEqhnZB1X93WFAIrufuKBuV5xMZar7lkeRS9zwBUHySXa00xF
 i7lbvDqtNn3itgNQd7VGSNCF5u4JxCUm73SumY3nDMFXBfvSNk0nUpFBpTYLjb7G
 0XY71tfWrBlbk1sssqr1Dbs+pRuxJRj9FgtfNAMid7gcK0L9k6n7v08cFxkIz4Sv
 XPHisD6QCOz7pZ5JwfdAp9Ea5g9z+QsN0G1Owr18fSgWwlgvhJ9rdd4H0Of7rWVj
 mGyF5f+ZqoLD2UhaEmLgjQoSvzPlb6rsAUL9SxgpZkg/mk5l0j5tk32JS5bJL8h5
 RTj0oeyqoVGKqnRy8heV/0z6TqcEtuNn/nOsht8adCgIUVpk95bkjTGBM900IK/X
 HhdJMqPlTEDXQic+ZxVYNHDTZFhq4UOVJkoDfEwIN971LZfUaiz8XZ6uG5m4rFqj
 iRmLN5XJNVNK52hNT1dLQyeQ4j3a5OnVGsvjZ33QLy2P6rCZd7yU6jKfsoL8JDEU
 uAzkaWqLjA==
 =YeXV
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.7-2020-05-08' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Fix finish_wait() balancing in file cancelation (Xiaoguang)

 - Ensure early cleanup of resources in ring map failure (Xiaoguang)

 - Ensure IORING_OP_SLICE does the right file mode checks (Pavel)

 - Remove file opening from openat/openat2/statx, it's not needed and
   messes with O_PATH

* tag 'io_uring-5.7-2020-05-08' of git://git.kernel.dk/linux-block:
  io_uring: don't use 'fd' for openat/openat2/statx
  splice: move f_mode checks to do_{splice,tee}()
  io_uring: handle -EFAULT properly in io_uring_setup()
  io_uring: fix mismatched finish_wait() calls in io_uring_cancel_files()
2020-05-09 12:02:09 -07:00
Pavel Begunkov
c96874265c io_uring: fix zero len do_splice()
do_splice() doesn't expect len to be 0. Just always return 0 in this
case as splice(2) does.

Fixes: 7d67af2c01 ("io_uring: add splice(2) support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 10:16:10 -06:00
Christian Brauner
f2a8d52e0a
nsproxy: add struct nsset
Add a simple struct nsset. It holds all necessary pieces to switch to a new
set of namespaces without leaving a task in a half-switched state which we
will make use of in the next patch. This patch switches the existing setns
logic over without causing a change in setns() behavior. This brings
setns() closer to how unshare() works(). The prepare_ns() function is
responsible to prepare all necessary information. This has two reasons.
First it minimizes dependencies between individual namespaces, i.e. all
install handler can expect that all fields are properly initialized
independent in what order they are called in. Second, this makes the code
easier to maintain and easier to follow if it needs to be changed.

The prepare_ns() helper will only be switched over to use a flags argument
in the next patch. Here it will still use nstype as a simple integer
argument which was argued would be clearer. I'm not particularly
opinionated about this if it really helps or not. The struct nsset itself
already contains the flags field since its name already indicates that it
can contain information required by different namespaces. None of this
should have functional consequences.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com
2020-05-09 13:57:12 +02:00
Xiaoguang Wang
7d01bd745a io_uring: remove obsolete 'state' parameter
The "struct io_submit_state *state" parameter is not used, remove it.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-08 21:29:47 -06:00
Jens Axboe
904fbcb115 io_uring: remove 'fd is io_uring' from close path
The attempt protecting us from closing the ring itself wasn't really
complete, and we actually don't need it. The referencing of requests
themselve, and the references they hold on the ring, ensures that the
life time of the ring is sane. With the check removed, we can also
remove the need to have the close operation fget() the file.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-08 21:27:24 -06:00
Lei Xue
7bb0c53384 cachefiles: Fix race between read_waiter and read_copier involving op->to_do
There is a potential race in fscache operation enqueuing for reading and
copying multiple pages from cachefiles to netfs.  The problem can be seen
easily on a heavy loaded system (for example many processes reading files
continually on an NFS share covered by fscache triggered this problem within
a few minutes).

The race is due to cachefiles_read_waiter() adding the op to the monitor
to_do list and then then drop the object->work_lock spinlock before
completing fscache_enqueue_operation().  Once the lock is dropped,
cachefiles_read_copier() grabs the op, completes processing it, and
makes it through fscache_retrieval_complete() which sets the op->state to
the final state of FSCACHE_OP_ST_COMPLETE(4).  When cachefiles_read_waiter()
finally gets through the remainder of fscache_enqueue_operation()
it sees the invalid state, and hits the ASSERTCMP and the following
oops is seen:
[ 2259.612361] FS-Cache:
[ 2259.614785] FS-Cache: Assertion failed
[ 2259.618639] FS-Cache: 4 == 5 is false
[ 2259.622456] ------------[ cut here ]------------
[ 2259.627190] kernel BUG at fs/fscache/operation.c:70!
...
[ 2259.791675] RIP: 0010:[<ffffffffc061b4cf>]  [<ffffffffc061b4cf>] fscache_enqueue_operation+0xff/0x170 [fscache]
[ 2259.802059] RSP: 0000:ffffa0263d543be0  EFLAGS: 00010046
[ 2259.807521] RAX: 0000000000000019 RBX: ffffa01a4d390480 RCX: 0000000000000006
[ 2259.814847] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffa0263d553890
[ 2259.822176] RBP: ffffa0263d543be8 R08: 0000000000000000 R09: ffffa0263c2d8708
[ 2259.829502] R10: 0000000000001e7f R11: 0000000000000000 R12: ffffa01a4d390480
[ 2259.844483] R13: ffff9fa9546c5920 R14: ffffa0263d543c80 R15: ffffa0293ff9bf10
[ 2259.859554] FS:  00007f4b6efbd700(0000) GS:ffffa0263d540000(0000) knlGS:0000000000000000
[ 2259.875571] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2259.889117] CR2: 00007f49e1624ff0 CR3: 0000012b38b38000 CR4: 00000000007607e0
[ 2259.904015] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2259.918764] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2259.933449] PKRU: 55555554
[ 2259.943654] Call Trace:
[ 2259.953592]  <IRQ>
[ 2259.955577]  [<ffffffffc03a7c12>] cachefiles_read_waiter+0x92/0xf0 [cachefiles]
[ 2259.978039]  [<ffffffffa34d3942>] __wake_up_common+0x82/0x120
[ 2259.991392]  [<ffffffffa34d3a63>] __wake_up_common_lock+0x83/0xc0
[ 2260.004930]  [<ffffffffa34d3510>] ? task_rq_unlock+0x20/0x20
[ 2260.017863]  [<ffffffffa34d3ab3>] __wake_up+0x13/0x20
[ 2260.030230]  [<ffffffffa34c72a0>] __wake_up_bit+0x50/0x70
[ 2260.042535]  [<ffffffffa35bdcdb>] unlock_page+0x2b/0x30
[ 2260.054495]  [<ffffffffa35bdd09>] page_endio+0x29/0x90
[ 2260.066184]  [<ffffffffa368fc81>] mpage_end_io+0x51/0x80

CPU1
cachefiles_read_waiter()
 20 static int cachefiles_read_waiter(wait_queue_entry_t *wait, unsigned mode,
 21                                   int sync, void *_key)
 22 {
...
 61         spin_lock(&object->work_lock);
 62         list_add_tail(&monitor->op_link, &op->to_do);
 63         spin_unlock(&object->work_lock);
<begin race window>
 64
 65         fscache_enqueue_retrieval(op);
182 static inline void fscache_enqueue_retrieval(struct fscache_retrieval *op)
183 {
184         fscache_enqueue_operation(&op->op);
185 }
 58 void fscache_enqueue_operation(struct fscache_operation *op)
 59 {
 60         struct fscache_cookie *cookie = op->object->cookie;
 61
 62         _enter("{OBJ%x OP%x,%u}",
 63                op->object->debug_id, op->debug_id, atomic_read(&op->usage));
 64
 65         ASSERT(list_empty(&op->pend_link));
 66         ASSERT(op->processor != NULL);
 67         ASSERT(fscache_object_is_available(op->object));
 68         ASSERTCMP(atomic_read(&op->usage), >, 0);
<end race window>

CPU2
cachefiles_read_copier()
168         while (!list_empty(&op->to_do)) {
...
202                 fscache_end_io(op, monitor->netfs_page, error);
203                 put_page(monitor->netfs_page);
204                 fscache_retrieval_complete(op, 1);

CPU1
 58 void fscache_enqueue_operation(struct fscache_operation *op)
 59 {
...
 69         ASSERTIFCMP(op->state != FSCACHE_OP_ST_IN_PROGRESS,
 70                     op->state, ==,  FSCACHE_OP_ST_CANCELLED);

Signed-off-by: Lei Xue <carmark.dlut@gmail.com>
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-08 23:01:10 +01:00
Dave Wysochanski
50eaa652b5 NFSv4: Fix fscache cookie aux_data to ensure change_attr is included
Commit 402cb8dda9 ("fscache: Attach the index key and aux data to
the cookie") added the aux_data and aux_data_len to parameters to
fscache_acquire_cookie(), and updated the callers in the NFS client.
In the process it modified the aux_data to include the change_attr,
but missed adding change_attr to a couple places where aux_data was
used.  Specifically, when opening a file and the change_attr is not
added, the following attempt to lookup an object will fail inside
cachefiles_check_object_xattr() = -116 due to
nfs_fscache_inode_check_aux() failing memcmp on auxdata and returning
FSCACHE_CHECKAUX_OBSOLETE.

Fix this by adding nfs_fscache_update_auxdata() to set the auxdata
from all relevant fields in the inode, including the change_attr.

Fixes: 402cb8dda9 ("fscache: Attach the index key and aux data to the cookie")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-08 22:20:24 +01:00
Dave Wysochanski
1575161273 NFS: Fix fscache super_cookie allocation
Commit f2aedb713c ("NFS: Add fs_context support.") reworked
NFS mount code paths for fs_context support which included
super_block initialization.  In the process there was an extra
return left in the code and so we never call
nfs_fscache_get_super_cookie even if 'fsc' is given on as mount
option.  In addition, there is an extra check inside
nfs_fscache_get_super_cookie for the NFS_OPTION_FSCACHE which
is unnecessary since the only caller nfs_get_cache_cookie
checks this flag.

Fixes: f2aedb713c ("NFS: Add fs_context support.")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-08 22:20:24 +01:00
Dave Wysochanski
d9bfced1fb NFS: Fix fscache super_cookie index_key from changing after umount
Commit 402cb8dda9 ("fscache: Attach the index key and aux data to
the cookie") added the index_key and index_key_len parameters to
fscache_acquire_cookie(), and updated the callers in the NFS client.
One of the callers was inside nfs_fscache_get_super_cookie()
and was changed to use the full struct nfs_fscache_key as the
index_key.  However, a couple members of this structure contain
pointers and thus will change each time the same NFS share is
remounted.  Since index_key is used for fscache_cookie->key_hash
and this subsequently is used to compare cookies, the effectiveness
of fscache with NFS is reduced to the point at which a umount
occurs.   Any subsequent remount of the same share will cause a
unique NFS super_block index_key and key_hash to be generated for
the same data, rendering any prior fscache data unable to be
found.  A simple reproducer demonstrates the problem.

1. Mount share with 'fsc', create a file, drop page cache
systemctl start cachefilesd
mount -o vers=3,fsc 127.0.0.1:/export /mnt
dd if=/dev/zero of=/mnt/file1.bin bs=4096 count=1
echo 3 > /proc/sys/vm/drop_caches

2. Read file into page cache and fscache, then unmount
dd if=/mnt/file1.bin of=/dev/null bs=4096 count=1
umount /mnt

3. Remount and re-read which should come from fscache
mount -o vers=3,fsc 127.0.0.1:/export /mnt
echo 3 > /proc/sys/vm/drop_caches
dd if=/mnt/file1.bin of=/dev/null bs=4096 count=1

4. Check for READ ops in mountstats - there should be none
grep READ: /proc/self/mountstats

Looking at the history and the removed function, nfs_super_get_key(),
we should only use nfs_fscache_key.key plus any uniquifier, for
the fscache index_key.

Fixes: 402cb8dda9 ("fscache: Attach the index key and aux data to the cookie")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-08 22:20:24 +01:00
Bob Peterson
b14c94908b Revert "gfs2: Don't demote a glock until its revokes are written"
This reverts commit df5db5f9ee.

This patch fixes a regression: patch df5db5f9ee allowed function
run_queue() to bypass its call to do_xmote() if revokes were queued for
the glock. That's wrong because its call to do_xmote() is what is
responsible for calling the go_sync() glops functions to sync both
the ail list and any revokes queued for it. By bypassing the call,
gfs2 could get into a stand-off where the glock could not be demoted
until its revokes are written back, but the revokes would not be
written back because do_xmote() was never called.

It "sort of" works, however, because there are other mechanisms like
the log flush daemon (logd) that can sync the ail items and revokes,
if it deems it necessary. The problem is: without file system pressure,
it might never deem it necessary.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-05-08 15:01:25 -05:00
Bob Peterson
b11e1a84f3 gfs2: If go_sync returns error, withdraw but skip invalidate
Before this patch, if the go_sync operation returned an error during
the do_xmote process (such as unable to sync metadata to the journal)
the code did goto out. That kept the glock locked, so it could not be
given away, which correctly avoids file system corruption. However,
it never set the withdraw bit or requeueing the glock work. So it would
hang forever, unable to ever demote the glock.

This patch changes to goto to a new label, skip_inval, so that errors
from go_sync are treated the same way as errors from go_inval:
The delayed withdraw bit is set and the work is requeued. That way,
the logd should eventually figure out there's a problem and withdraw
properly there.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08 15:00:07 -05:00
Linus Torvalds
eb24fdd8e6 Fixes for an endianness handling bug that prevented mounts on
big-endian arches, a spammy log message and a couple error paths.
 Also included a MAINTAINERS update.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl61ktUTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi3yKB/9s0kZ7fLYtGzqtuoIjualsaM0lsBBS
 rWAN4BkIVsxp3eOd5Hdb+ngIY5ykLLcUd+4gKqUNHkB7/1upDq9ZURKlyTwel5Wy
 889YEYESCVQQxPVY9KNvafaPeuR++2r9Thlp9hWyczrtvXtz80sFIrtO9TwDrj1P
 ZXPN3lxppGlxQiVNQfKIw2Cs78OxaNu9BthXZ7jN2OGaMQ0NU6sZ4LRXz8rbY+od
 AbfLEfwz4dPHQ/44k3rQg2IWNuOxRK+CNayxhuN0KWzock3MzGVYoYkPx0wNLiDx
 rntMscBqh3kppILZPEIeIA5Nv0yDAf4tf2hcUDf7GoJT/L/f9v7Q2SHa
 =75Ca
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.7-rc5' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Fixes for an endianness handling bug that prevented mounts on
  big-endian arches, a spammy log message and a couple error paths.

  Also included a MAINTAINERS update"

* tag 'ceph-for-5.7-rc5' of git://github.com/ceph/ceph-client:
  ceph: demote quotarealm lookup warning to a debug message
  MAINTAINERS: remove myself as ceph co-maintainer
  ceph: fix double unlock in handle_cap_export()
  ceph: fix special error code in ceph_try_get_caps()
  ceph: fix endianness bug when handling MDS session feature bits
2020-05-08 10:27:00 -07:00
Andreas Gruenbacher
f4e2f5e1a5 gfs2: Grab glock reference sooner in gfs2_add_revoke
This patch rearranges gfs2_add_revoke so that the extra glock
reference is added earlier on in the function to avoid races in which
the glock is freed before the new reference is taken.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-05-08 18:49:04 +02:00
Bob Peterson
c9cb9e3819 gfs2: don't call quota_unhold if quotas are not locked
Before this patch, function gfs2_quota_unlock checked if quotas are
turned off, and if so, it branched to label out, which called
gfs2_quota_unhold. With the new system of gfs2_qa_get and put, we
no longer want to call gfs2_quota_unhold or we won't balance our
gets and puts.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08 18:49:04 +02:00
Bob Peterson
4ed0c30811 gfs2: move privileged user check to gfs2_quota_lock_check
Before this patch, function gfs2_quota_lock checked if it was called
from a privileged user, and if so, it bypassed the quota check:
superuser can operate outside the quotas.
That's the wrong place for the check because the lock/unlock functions
are separate from the lock_check function, and you can do lock and
unlock without actually checking the quotas.

This patch moves the check to gfs2_quota_lock_check.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08 18:47:58 +02:00
Bob Peterson
e6ce26e571 gfs2: remove check for quotas on in gfs2_quota_check
This patch removes a check from gfs2_quota_check for whether quotas
are enabled by the superblock. There is a test just prior for the
GIF_QD_LOCKED bit in the inode, and that can only be set by functions
that already check that quotas are enabled in the superblock.
Therefore, the check is redundant.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08 18:47:39 +02:00
Bob Peterson
f9615fe311 gfs2: Change BUG_ON to an assert_withdraw in gfs2_quota_change
Before this patch, gfs2_quota_change() would BUG_ON if the
qa_ref counter was not a positive number. This patch changes it to
be a withdraw instead. That way we can debug things more easily.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08 18:45:12 +02:00
Bob Peterson
2297ab6144 gfs2: Fix problems regarding gfs2_qa_get and _put
This patch fixes a couple of places in which gfs2_qa_get and gfs2_qa_put are
not balanced: we now keep references around whenever a file is open for writing
(see gfs2_open_common and gfs2_release), so we need to put all references we
grab in function gfs2_create_inode.  This was broken in the successful case and
on one error path.

This also means that we don't have a reference to put in gfs2_evict_inode.

In addition, gfs2_qa_put was called for the wrong inode in gfs2_link.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08 18:45:11 +02:00
Luis Henriques
12ae44a40a ceph: demote quotarealm lookup warning to a debug message
A misconfigured cephx can easily result in having the kernel client
flooding the logs with:

  ceph: Can't lookup inode 1 (err: -13)

Change this message to debug level.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/44546
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-05-08 18:44:40 +02:00
Linus Torvalds
c61529f6f5 Driver core fixes for 5.7-rc5
Here are a number of small driver core fixes for 5.7-rc5 to resolve a
 bunch of reported issues with the current tree.
 
 Biggest here are the reverts and patches from John Stultz to resolve a
 bunch of deferred probe regressions we have been seeing in 5.7-rc right
 now.
 
 Along with those are some other smaller fixes:
 	- coredump crash fix
 	- devlink fix for when permissive mode was enabled
 	- amba and platform device dma_parms fixes
 	- component error silenced for when deferred probe happens
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXrVnyg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylWBgCfbwjUbsDsHsrsVgWfOakIaoPUQ8IAmwetMKvS
 ny1Kq7Cia+2y2e+7fDyo
 =UKEM
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are a number of small driver core fixes for 5.7-rc5 to resolve a
  bunch of reported issues with the current tree.

  Biggest here are the reverts and patches from John Stultz to resolve a
  bunch of deferred probe regressions we have been seeing in 5.7-rc
  right now.

  Along with those are some other smaller fixes:

   - coredump crash fix

   - devlink fix for when permissive mode was enabled

   - amba and platform device dma_parms fixes

   - component error silenced for when deferred probe happens

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'driver-core-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  regulator: Revert "Use driver_deferred_probe_timeout for regulator_init_complete_work"
  driver core: Ensure wait_for_device_probe() waits until the deferred_probe_timeout fires
  driver core: Use dev_warn() instead of dev_WARN() for deferred_probe_timeout warnings
  driver core: Revert default driver_deferred_probe_timeout value to 0
  component: Silence bind error on -EPROBE_DEFER
  driver core: Fix handling of fw_devlink=permissive
  coredump: fix crash when umh is disabled
  amba: Initialize dma_parms for amba devices
  driver core: platform: Initialize dma_parms for platform devices
2020-05-08 09:06:34 -07:00
Chen Zhou
3d60548b21 xfs: remove duplicate headers
Remove duplicate headers which are included twice.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-08 08:51:34 -07:00
Brian Foster
43dc0aa84e xfs: fix unused variable warning in buffer completion on !DEBUG
The random buffer write failure errortag patch introduced a local
mount pointer variable for the test macro, but the macro is compiled
out on !DEBUG kernels. This results in an unused variable warning.

Access the mount structure through the buffer pointer and remove the
local mount pointer to address the warning.

Fixes: 7376d74547 ("xfs: random buffer write failure errortag")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-08 08:50:52 -07:00
Darrick J. Wong
6ea670ade2 xfs: remove unnecessary includes from xfs_log_recover.c
Remove unnecessary includes from the log recovery code.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-05-08 08:50:01 -07:00
Darrick J. Wong
17d29bf271 xfs: move log recovery buffer cancellation code to xfs_buf_item_recover.c
Move the helpers that handle incore buffer cancellation records to
xfs_buf_item_recover.c since they're not directly related to the main
log recovery machinery.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-08 08:50:01 -07:00
Darrick J. Wong
cc560a5a95 xfs: hoist setting of XFS_LI_RECOVERED to caller
The only purpose of XFS_LI_RECOVERED is to prevent log recovery from
trying to replay recovered intents more than once.  Therefore, we can
move the bit setting up to the ->iop_recover caller.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-08 08:50:01 -07:00
Darrick J. Wong
96b60f8267 xfs: refactor intent item iop_recover calls
Now that we've made the recovered item tests all the same, we can hoist
the test and the ail locking code to the ->iop_recover caller and call
the recovery function directly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-08 08:50:01 -07:00
Darrick J. Wong
889eb55dd6 xfs: refactor intent item RECOVERED flag into the log item
Rename XFS_{EFI,BUI,RUI,CUI}_RECOVERED to XFS_LI_RECOVERED so that we
track recovery status in the log item, then get rid of the now unused
flags fields in each of those log item types.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-08 08:50:01 -07:00
Darrick J. Wong
86a3717413 xfs: refactor adding recovered intent items to the log
During recovery, every intent that we recover from the log has to be
added to the AIL.  Replace the open-coded addition with a helper.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-05-08 08:50:00 -07:00
Darrick J. Wong
154c733a33 xfs: refactor releasing finished intents during log recovery
Replace the open-coded AIL item walking with a proper helper when we're
trying to release an intent item that has been finished.  We add a new
->iop_match method to decide if an intent item matches a supplied ID.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-08 08:50:00 -07:00