Commit Graph

76926 Commits

Author SHA1 Message Date
Linus Torvalds
780d8ce716 v5.19 pull request
Small collection of incremental improvement patches:
 
 - Minor code cleanup patches, comment improvements, etc from static tools
 
 - Clean the some of the kernel caps, reducing the historical stealth uAPI
   leftovers
 
 - Bug fixes and minor changes for rdmavt, hns, rxe, irdma
 
 - Remove unimplemented cruft from rxe
 
 - Reorganize UMR QP code in mlx5 to avoid going through the IB verbs layer
 
 - flush_workqueue(system_unbound_wq) removal
 
 - Ensure rxe waits for objects to be unused before allowing the core to
   free them
 
 - Several rc quality bug fixes for hfi1
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCYo+NxgAKCRCFwuHvBreF
 YbqSAQDJ+QolaATUvOQUPLbuLopUCJLe95VS15Kl3SNXiVUUFAEA8DLL1s6+WShd
 AgypUxGHipx5BAytrn45/WiwuDeEbQ8=
 =jgTl
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Small collection of incremental improvement patches:

   - Minor code cleanup patches, comment improvements, etc from static
     tools

   - Clean the some of the kernel caps, reducing the historical stealth
     uAPI leftovers

   - Bug fixes and minor changes for rdmavt, hns, rxe, irdma

   - Remove unimplemented cruft from rxe

   - Reorganize UMR QP code in mlx5 to avoid going through the IB verbs
     layer

   - flush_workqueue(system_unbound_wq) removal

   - Ensure rxe waits for objects to be unused before allowing the core
     to free them

   - Several rc quality bug fixes for hfi1"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (67 commits)
  RDMA/rtrs-clt: Fix one kernel-doc comment
  RDMA/hfi1: Remove all traces of diagpkt support
  RDMA/hfi1: Consolidate software versions
  RDMA/hfi1: Remove pointless driver version
  RDMA/hfi1: Fix potential integer multiplication overflow errors
  RDMA/hfi1: Prevent panic when SDMA is disabled
  RDMA/hfi1: Prevent use of lock before it is initialized
  RDMA/rxe: Fix an error handling path in rxe_get_mcg()
  IB/core: Fix typo in comment
  RDMA/core: Fix typo in comment
  IB/hf1: Fix typo in comment
  IB/qib: Fix typo in comment
  IB/iser: Fix typo in comment
  RDMA/mlx4: Avoid flush_scheduled_work() usage
  IB/isert: Avoid flush_scheduled_work() usage
  RDMA/mlx5: Remove duplicate pointer assignment in mlx5_ib_alloc_implicit_mr()
  RDMA/qedr: Remove unnecessary synchronize_irq() before free_irq()
  RDMA/hns: Use hr_reg_read() instead of remaining roce_get_xxx()
  RDMA/hns: Use hr_reg_xxx() instead of remaining roce_set_xxx()
  RDMA/irdma: Add SW mechanism to generate completions on error
  ...
2022-05-26 21:08:40 -07:00
Linus Torvalds
6d29d7fe4f NFSD 5.19 Release Notes
We introduce "courteous server" in this release. Previously NFSD
 would purge open and lock state for an unresponsive client after
 one lease period (typically 90 seconds). Now, after one lease
 period, another client can open and lock those files and the
 unresponsive client's lease is purged; otherwise if the unrespon-
 sive client's open and lock state is uncontended, the server retains
 that open and lock state for up to 24 hours, allowing the client's
 workload to resume after a lengthy network partition.
 
 A longstanding issue with NFSv4 file creation is also addressed.
 Previously a file creation can fail internally, returning an error
 to the client, but leave the newly created file in place as an
 artifact. The file creation code path has been reorganized so that
 internal failures and race conditions are less likely to result in
 an unwanted file creation.
 
 A fault injector has been added to help exercise paths that are run
 during kernel metadata cache invalidation. These caches contain
 information maintained by user space about exported filesystems.
 Many of our test workloads do not trigger cache invalidation.
 
 There is one patch that is needed to support PREEMPT_RT and a fix
 for an ancient "sleep while spin-locked" splat that seems to have
 become easier to hit since v5.18-rc3.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmKPliAACgkQM2qzM29m
 f5dB3BAAorPa2L8xu5P1Ge1oTNogNSOVRkLPDzEkfEwK07ZM2qvz78eMZGkMziJ/
 strorvBWl3SWBlVtTePgNpJUjgYQ75MRRwaX7Qh2WuHeRKm1JlZm0/NId3+zKgbh
 N40QI20jdswWcNDuhidxVFFWurd09GlcM4z1cu8gZLbfthkiUOjZoPiLkXeNcvhk
 7wC9GiueWxHefYQQDAKh1nQS/L0GG1EkzJdJo7WUVAldZ9qVY9LpmJVMRqrBBbta
 XrFYfpeY1zFFDY4Qolyz5PUJSeQuDj9PctlhoZ6B1hp56PD/6yaqVhYXiPxtlALj
 tITtktfiekULZkgfvfvyzssCv+wkbYiaEBZcSSCauR7dkGOmBmajO+cf7vpsERgE
 fbCU8DWGk78SMeehdCrO+26cV37VP+8c2t2Txq/rG5Eq4ZoCi++Hj5poRboFLqb+
 oom+0Ee0LfcAKXkxH5gWTPTblHo49GzGitPZtRzTgZ9uFnVwvEaJ4+t0ij0J8JpL
 HuVtWrg5/REhqpEvOSwF0sRmkYWLTu7KdueGn/iZ8xUi7GHEue01NsVkClohKJcR
 WOjWrbNCNF/LJaG88MX0z5u7IO7s9bOHphd7PJ92vR+4YsehW3uRhk+rNi2ZBqQz
 hzULfu8BiaicV9fdB/hDcMmKQD6U6due2AVVPtxTf5XY+CHQNRY=
 =phE1
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "We introduce 'courteous server' in this release. Previously NFSD would
  purge open and lock state for an unresponsive client after one lease
  period (typically 90 seconds). Now, after one lease period, another
  client can open and lock those files and the unresponsive client's
  lease is purged; otherwise if the unresponsive client's open and lock
  state is uncontended, the server retains that open and lock state for
  up to 24 hours, allowing the client's workload to resume after a
  lengthy network partition.

  A longstanding issue with NFSv4 file creation is also addressed.
  Previously a file creation can fail internally, returning an error to
  the client, but leave the newly created file in place as an artifact.
  The file creation code path has been reorganized so that internal
  failures and race conditions are less likely to result in an unwanted
  file creation.

  A fault injector has been added to help exercise paths that are run
  during kernel metadata cache invalidation. These caches contain
  information maintained by user space about exported filesystems. Many
  of our test workloads do not trigger cache invalidation.

  There is one patch that is needed to support PREEMPT_RT and a fix for
  an ancient 'sleep while spin-locked' splat that seems to have become
  easier to hit since v5.18-rc3"

* tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (36 commits)
  NFSD: nfsd_file_put() can sleep
  NFSD: Add documenting comment for nfsd4_release_lockowner()
  NFSD: Modernize nfsd4_release_lockowner()
  NFSD: Fix possible sleep during nfsd4_release_lockowner()
  nfsd: destroy percpu stats counters after reply cache shutdown
  nfsd: Fix null-ptr-deref in nfsd_fill_super()
  nfsd: Unregister the cld notifier when laundry_wq create failed
  SUNRPC: Use RMW bitops in single-threaded hot paths
  NFSD: Clean up the show_nf_flags() macro
  NFSD: Trace filecache opens
  NFSD: Move documenting comment for nfsd4_process_open2()
  NFSD: Fix whitespace
  NFSD: Remove dprintk call sites from tail of nfsd4_open()
  NFSD: Instantiate a struct file when creating a regular NFSv4 file
  NFSD: Clean up nfsd_open_verified()
  NFSD: Remove do_nfsd_create()
  NFSD: Refactor NFSv4 OPEN(CREATE)
  NFSD: Refactor NFSv3 CREATE
  NFSD: Refactor nfsd_create_setattr()
  NFSD: Avoid calling fh_drop_write() twice in do_nfsd_create()
  ...
2022-05-26 20:52:24 -07:00
Darrick J. Wong
efc2efeba1 xfs: move xfs_attr_use_log_assist usage out of libxfs
The LARP patchset added an awkward coupling point between libxfs and
what would be libxlog, if the XFS log were actually its own library.
Move the code that sets up logged xattr updates out of libxfs and into
xfs_xattr.c so that libxfs no longer has to know about xlog_* functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:34:04 +10:00
Darrick J. Wong
d9c61ccb3b xfs: move xfs_attr_use_log_assist out of xfs_log.c
The LARP patchset added an awkward coupling point between libxfs and
what would be libxlog, if the XFS log were actually its own library.
Move the code that enables logged xattr updates out of "lib"xlog and into
xfs_xattr.c so that it no longer has to know about xlog_* functions.

While we're at it, give xfs_xattr.c its own header file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:33:29 +10:00
Darrick J. Wong
202865cc21 xfs: warn about LARP once per mount
Since LARP is an experimental debug-only feature, we should try to warn
about it being in use once per mount, not once per reboot.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:32:07 +10:00
Darrick J. Wong
df5660cf63 xfs: implement per-mount warnings for scrub and shrink usage
Currently, we don't have a consistent story around logging when an
EXPERIMENTAL feature gets turned on at runtime -- online fsck and shrink
log a message once per day across all mounts, and the recently merged
LARP mode only ever does it once per insmod cycle or reboot.

Because EXPERIMENTAL tags are supposed to go away eventually, convert
the existing daily warnings into state flags that travel with the mount,
and warn once per mount.  Making this an opstate flag means that we'll
be able to capture the experimental usage in the ftrace output too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:31:34 +10:00
Darrick J. Wong
374037966d xfs: don't log every time we clear the log incompat flags
There's no need to spam the logs every time we clear the log incompat
flags -- if someone is periodically using one of these features, they'll
be cleared every time the log tries to clean itself, which can get
pretty chatty:

$ dmesg | grep -i clear
[ 5363.894711] XFS (sdd): Clearing log incompat feature flags.
[ 5365.157516] XFS (sdd): Clearing log incompat feature flags.
[ 5369.388543] XFS (sdd): Clearing log incompat feature flags.
[ 5371.281246] XFS (sdd): Clearing log incompat feature flags.

These aren't high value messages either -- nothing's gone wrong, and
nobody's trying anything tricky.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:29:51 +10:00
Darrick J. Wong
910bbdf2f4 xfs: convert buf_cancel_table allocation to kmalloc_array
While we're messing around with how recovery allocates and frees the
buffer cancellation table, convert the allocation to use kmalloc_array
instead of the old kmem_alloc APIs, and make it handle a null return,
even though that's not likely.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:27:19 +10:00
Darrick J. Wong
8db074bd84 xfs: don't leak xfs_buf_cancel structures when recovery fails
If log recovery fails, we free the memory used by the buffer
cancellation buckets, but we don't actually traverse each bucket list to
free the individual xfs_buf_cancel objects.  This leads to a memory
leak, as reported by kmemleak in xfs/051:

unreferenced object 0xffff888103629560 (size 32):
  comm "mount", pid 687045, jiffies 4296935916 (age 10.752s)
  hex dump (first 32 bytes):
    08 d3 0a 01 00 00 00 00 08 00 00 00 01 00 00 00  ................
    d0 f5 0b 92 81 88 ff ff 80 64 64 25 81 88 ff ff  .........dd%....
  backtrace:
    [<ffffffffa0317c83>] kmem_alloc+0x73/0x140 [xfs]
    [<ffffffffa03234a9>] xlog_recover_buf_commit_pass1+0x139/0x200 [xfs]
    [<ffffffffa032dc27>] xlog_recover_commit_trans+0x307/0x350 [xfs]
    [<ffffffffa032df15>] xlog_recovery_process_trans+0xa5/0xe0 [xfs]
    [<ffffffffa032e12d>] xlog_recover_process_data+0x8d/0x140 [xfs]
    [<ffffffffa032e49d>] xlog_do_recovery_pass+0x19d/0x740 [xfs]
    [<ffffffffa032f22d>] xlog_do_log_recovery+0x6d/0x150 [xfs]
    [<ffffffffa032f343>] xlog_do_recover+0x33/0x1d0 [xfs]
    [<ffffffffa032faba>] xlog_recover+0xda/0x190 [xfs]
    [<ffffffffa03194bc>] xfs_log_mount+0x14c/0x360 [xfs]
    [<ffffffffa030bfed>] xfs_mountfs+0x50d/0xa60 [xfs]
    [<ffffffffa03124b5>] xfs_fs_fill_super+0x6a5/0x950 [xfs]
    [<ffffffff812b92a5>] get_tree_bdev+0x175/0x280
    [<ffffffff812b7c3a>] vfs_get_tree+0x1a/0x80
    [<ffffffff812e366f>] path_mount+0x6ff/0xaa0
    [<ffffffff812e3b13>] __x64_sys_mount+0x103/0x140

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:26:38 +10:00
Darrick J. Wong
2723234923 xfs: refactor buffer cancellation table allocation
Move the code that allocates and frees the buffer cancellation tables
used by log recovery into the file that actually uses the tables.  This
is a precursor to some cleanups and a memory leak fix.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:26:17 +10:00
Darrick J. Wong
a54f78def7 xfs: don't leak btree cursor when insrec fails after a split
The recent patch to improve btree cycle checking caused a regression
when I rebased the in-memory btree branch atop the 5.19 for-next branch,
because in-memory short-pointer btrees do not have AG numbers.  This
produced the following complaint from kmemleak:

unreferenced object 0xffff88803d47dde8 (size 264):
  comm "xfs_io", pid 4889, jiffies 4294906764 (age 24.072s)
  hex dump (first 32 bytes):
    90 4d 0b 0f 80 88 ff ff 00 a0 bd 05 80 88 ff ff  .M..............
    e0 44 3a a0 ff ff ff ff 00 df 08 06 80 88 ff ff  .D:.............
  backtrace:
    [<ffffffffa0388059>] xfbtree_dup_cursor+0x49/0xc0 [xfs]
    [<ffffffffa029887b>] xfs_btree_dup_cursor+0x3b/0x200 [xfs]
    [<ffffffffa029af5d>] __xfs_btree_split+0x6ad/0x820 [xfs]
    [<ffffffffa029b130>] xfs_btree_split+0x60/0x110 [xfs]
    [<ffffffffa029f6da>] xfs_btree_make_block_unfull+0x19a/0x1f0 [xfs]
    [<ffffffffa029fada>] xfs_btree_insrec+0x3aa/0x810 [xfs]
    [<ffffffffa029fff3>] xfs_btree_insert+0xb3/0x240 [xfs]
    [<ffffffffa02cb729>] xfs_rmap_insert+0x99/0x200 [xfs]
    [<ffffffffa02cf142>] xfs_rmap_map_shared+0x192/0x5f0 [xfs]
    [<ffffffffa02cf60b>] xfs_rmap_map_raw+0x6b/0x90 [xfs]
    [<ffffffffa0384a85>] xrep_rmap_stash+0xd5/0x1d0 [xfs]
    [<ffffffffa0384dc0>] xrep_rmap_visit_bmbt+0xa0/0xf0 [xfs]
    [<ffffffffa0384fb6>] xrep_rmap_scan_iext+0x56/0xa0 [xfs]
    [<ffffffffa03850d8>] xrep_rmap_scan_ifork+0xd8/0x160 [xfs]
    [<ffffffffa0385195>] xrep_rmap_scan_inode+0x35/0x80 [xfs]
    [<ffffffffa03852ee>] xrep_rmap_find_rmaps+0x10e/0x270 [xfs]

I noticed that xfs_btree_insrec has a bunch of debug code that return
out of the function immediately, without freeing the "new" btree cursor
that can be returned when _make_block_unfull calls xfs_btree_split.  Fix
the error return in this function to free the btree cursor.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:22:56 +10:00
Darrick J. Wong
86d40f1e49 xfs: purge dquots after inode walk fails during quotacheck
xfs/434 and xfs/436 have been reporting occasional memory leaks of
xfs_dquot objects.  These tests themselves were the messenger, not the
culprit, since they unload the xfs module, which trips the slub
debugging code while tearing down all the xfs slab caches:

=============================================================================
BUG xfs_dquot (Tainted: G        W        ): Objects remaining in xfs_dquot on __kmem_cache_shutdown()
-----------------------------------------------------------------------------

Slab 0xffffea000606de00 objects=30 used=5 fp=0xffff888181b78a78 flags=0x17ff80000010200(slab|head|node=0|zone=2|lastcpupid=0xfff)
CPU: 0 PID: 3953166 Comm: modprobe Tainted: G        W         5.18.0-rc6-djwx #rc6 d5824be9e46a2393677bda868f9b154d917ca6a7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014

Since we don't generally rmmod the xfs module between fstests, this
means that xfs/434 is really just the canary in the coal mine --
something leaked a dquot, but we don't know who.  After days of pounding
on fstests with kmemleak enabled, I finally got it to spit this out:

unreferenced object 0xffff8880465654c0 (size 536):
  comm "u10:4", pid 88, jiffies 4294935810 (age 29.512s)
  hex dump (first 32 bytes):
    60 4a 56 46 80 88 ff ff 58 ea e4 5c 80 88 ff ff  `JVF....X..\....
    00 e0 52 49 80 88 ff ff 01 00 01 00 00 00 00 00  ..RI............
  backtrace:
    [<ffffffffa0740f6c>] xfs_dquot_alloc+0x2c/0x530 [xfs]
    [<ffffffffa07443df>] xfs_qm_dqread+0x6f/0x330 [xfs]
    [<ffffffffa07462a2>] xfs_qm_dqget+0x132/0x4e0 [xfs]
    [<ffffffffa0756bb0>] xfs_qm_quotacheck_dqadjust+0xa0/0x3e0 [xfs]
    [<ffffffffa075724d>] xfs_qm_dqusage_adjust+0x35d/0x4f0 [xfs]
    [<ffffffffa06c9068>] xfs_iwalk_ag_recs+0x348/0x5d0 [xfs]
    [<ffffffffa06c95d3>] xfs_iwalk_run_callbacks+0x273/0x540 [xfs]
    [<ffffffffa06c9e8d>] xfs_iwalk_ag+0x5ed/0x890 [xfs]
    [<ffffffffa06ca22f>] xfs_iwalk_ag_work+0xff/0x170 [xfs]
    [<ffffffffa06d22c9>] xfs_pwork_work+0x79/0x130 [xfs]
    [<ffffffff81170bb2>] process_one_work+0x672/0x1040
    [<ffffffff81171b1b>] worker_thread+0x59b/0xec0
    [<ffffffff8118711e>] kthread+0x29e/0x340
    [<ffffffff810032bf>] ret_from_fork+0x1f/0x30

Now we know that quotacheck is at fault, but even this report was
canaryish -- it was triggered by xfs/494, which doesn't actually mount
any filesystems.  (kmemleak can be a little slow to notice leaks, even
with fstests repeatedly whacking it to look for them.)  Looking at the
*previous* fstest, however, showed that the test run before xfs/494 was
xfs/117.  The tipoff to the problem is in this excerpt from dmesg:

XFS (sda4): Quotacheck needed: Please wait.
XFS (sda4): Metadata corruption detected at xfs_dinode_verify.part.0+0xdb/0x7b0 [xfs], inode 0x119 dinode
XFS (sda4): Unmount and run xfs_repair
XFS (sda4): First 128 bytes of corrupted metadata buffer:
00000000: 49 4e 81 a4 03 02 00 00 00 00 00 00 00 00 00 00  IN..............
00000010: 00 00 00 01 00 00 00 00 00 90 57 54 54 1a 4c 68  ..........WTT.Lh
00000020: 81 f9 7d e1 6d ee 16 00 34 bd 7d e1 6d ee 16 00  ..}.m...4.}.m...
00000030: 34 bd 7d e1 6d ee 16 00 00 00 00 00 00 00 00 00  4.}.m...........
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000050: 00 00 00 02 00 00 00 00 00 00 00 00 96 80 f3 ab  ................
00000060: ff ff ff ff da 57 7b 11 00 00 00 00 00 00 00 03  .....W{.........
00000070: 00 00 00 01 00 00 00 10 00 00 00 00 00 00 00 08  ................
XFS (sda4): Quotacheck: Unsuccessful (Error -117): Disabling quotas.

The dinode verifier decided that the inode was corrupt, which causes
iget to return with EFSCORRUPTED.  Since this happened during
quotacheck, it is obvious that the kernel aborted the inode walk on
account of the corruption error and disabled quotas.  Unfortunately, we
neglect to purge the dquot cache before doing that, which is how the
dquots leaked.

The problems started 10 years ago in commit b84a3a, when the dquot lists
were converted to a radix tree, but the error handling behavior was not
correctly preserved -- in that commit, if the bulkstat failed and
usrquota was enabled, the bulkstat failure code would be overwritten by
the result of flushing all the dquots to disk.  As long as that
succeeds, we'd continue the quota mount as if everything were ok, but
instead we're now operating with a corrupt inode and incorrect quota
usage counts.  I didn't notice this bug in 2019 when I wrote commit
ebd126a, which changed quotacheck to skip the dqflush when the scan
doesn't complete due to inode walk failures.

Introduced-by: b84a3a9675 ("xfs: remove the per-filesystem list of dquots")
Fixes: ebd126a651 ("xfs: convert quotacheck to use the new iwalk functions")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:21:43 +10:00
Dave Chinner
56486f3071 xfs: assert in xfs_btree_del_cursor should take into account error
xfs/538 on a 1kB block filesystem failed with this assert:

XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_ino.allocated == 0 || xfs_is_shutdown(cur->bc_mp), file: fs/xfs/libxfs/xfs_btree.c, line: 448

The problem was that an allocation failed unexpectedly in
xfs_bmbt_alloc_block() after roughly 150,000 minlen allocation error
injections, resulting in an EFSCORRUPTED error being returned to
xfs_bmapi_write(). The error occurred on extent-to-btree format
conversion allocating the new root block:

 RIP: 0010:xfs_bmbt_alloc_block+0x177/0x210
 Call Trace:
  <TASK>
  xfs_btree_new_iroot+0xdf/0x520
  xfs_btree_make_block_unfull+0x10d/0x1c0
  xfs_btree_insrec+0x364/0x790
  xfs_btree_insert+0xaa/0x210
  xfs_bmap_add_extent_hole_real+0x1fe/0x9a0
  xfs_bmapi_allocate+0x34c/0x420
  xfs_bmapi_write+0x53c/0x9c0
  xfs_alloc_file_space+0xee/0x320
  xfs_file_fallocate+0x36b/0x450
  vfs_fallocate+0x148/0x340
  __x64_sys_fallocate+0x3c/0x70
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa

Why the allocation failed at this point is unknown, but is likely
that we ran the transaction out of reserved space and filesystem out
of space with bmbt blocks because of all the minlen allocations
being done causing worst case fragmentation of a large allocation.

Regardless of the cause, we've then called xfs_bmapi_finish() which
calls xfs_btree_del_cursor(cur, error) to tear down the cursor.

So we have a failed operation, error != 0, cur->bc_ino.allocated > 0
and the filesystem is still up. The assert fails to take into
account that allocation can fail with an error and the transaction
teardown will shut the filesystem down if necessary. i.e. the
assert needs to check "|| error != 0" as well, because at this point
shutdown is pending because the current transaction is dirty....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:21:09 +10:00
Dave Chinner
5b55cbc2d7 xfs: don't assert fail on perag references on teardown
Not fatal, the assert is there to catch developer attention. I'm
seeing this occasionally during recoveryloop testing after a
shutdown, and I don't want this to stop an overnight recoveryloop
run as it is currently doing.

Convert the ASSERT to a XFS_IS_CORRUPT() check so it will dump a
corruption report into the log and cause a test failure that way,
but it won't stop the machine dead.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:21:04 +10:00
Dave Chinner
5672225e8f xfs: avoid unnecessary runtime sibling pointer endian conversions
Commit dc04db2aa7 has caused a small aim7 regression, showing a
small increase in CPU usage in __xfs_btree_check_sblock() as a
result of the extra checking.

This is likely due to the endian conversion of the sibling poitners
being unconditional instead of relying on the compiler to endian
convert the NULL pointer at compile time and avoiding the runtime
conversion for this common case.

Rework the checks so that endian conversion of the sibling pointers
is only done if they are not null as the original code did.

.... and these need to be "inline" because the compiler completely
fails to inline them automatically like it should be doing.

$ size fs/xfs/libxfs/xfs_btree.o*
   text	   data	    bss	    dec	    hex	filename
  51874	    240	      0	  52114	   cb92 fs/xfs/libxfs/xfs_btree.o.orig
  51562	    240	      0	  51802	   ca5a fs/xfs/libxfs/xfs_btree.o.inline

Just when you think the tools have advanced sufficiently we don't
have to care about stuff like this anymore, along comes a reminder
that *our tools still suck*.

Fixes: dc04db2aa7 ("xfs: detect self referencing btree sibling pointers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-27 10:20:45 +10:00
Linus Torvalds
44d35720c9 sysctl changes for v5.19-rc1
For two kernel releases now kernel/sysctl.c has been being cleaned up
 slowly, since the tables were grossly long, sprinkled with tons of #ifdefs and
 all this caused merge conflicts with one susbystem or another.
 
 This tree was put together to help try to avoid conflicts with these cleanups
 going on different trees at time. So nothing exciting on this pull request,
 just cleanups.
 
 I actually had this sysctl-next tree up since v5.18 but I missed sending a
 pull request for it on time during the last merge window. And so these changes
 have been being soaking up on sysctl-next and so linux-next for a while.
 The last change was merged May 4th.
 
 Most of the compile issues were reported by 0day and fixed.
 
 To help avoid a conflict with bpf folks at Daniel Borkmann's request
 I merged bpf-next/pr/bpf-sysctl into sysctl-next to get the effor which
 moves the BPF sysctls from kernel/sysctl.c to BPF core.
 
 Possible merge conflicts and known resolutions as per linux-next:
 
 bfp:
 https://lkml.kernel.org/r/20220414112812.652190b5@canb.auug.org.au
 
 rcu:
 https://lkml.kernel.org/r/20220420153746.4790d532@canb.auug.org.au
 
 powerpc:
 https://lkml.kernel.org/r/20220520154055.7f964b76@canb.auug.org.au
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmKOq8ASHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinDAkQAJVo5YVM9f74UwYp4PQhTpjxJBCjRoZD
 z1u9bp5rMj2ujTC8Fr7VmzKaHrb8+r1C1WvCvZtIzemYNB4lZUrHpVDYfXuXiPRB
 ihPmEjhlPO5PFBx6cVCpI3cu9bEhG00rLc1QXnABx/pXwNPcOTJAGZJVamZvqubk
 chjgZrb7N+adHPfvS55v1+zpwdeKfpp5U3zuu5qlT/nn0GS0HCVzOj5fj4oC4wtJ
 IqfUubo+FX50Ga58yQABWNrjaPD9Crykz5ohVazy3ElQl0hJ4VsK65ct3blqc2vz
 1Bb8kPpWuv6aZ5nr1lCVE8qvF4ZIL33ySvpg5BSdWLQEDrBbSpzvJe9Yn7wgR+eq
 y7fhpO24+zRM82EoDMEvyxX9u1n1RsvoXRtf3ds9BGf63MUxk8a1cgjlU6vuyO2U
 JhDmfM1xzdKvPoY4COOnHzcAiIqzItTqKd09N5y0cahmYstROU8lvp9huhTAHqk1
 SjQMbLIZG7OnX8ZeQcR1EB8sq/IOPZT48ejj0iJmQ8FyMaep71MOQLYyLPAq4lgh
 JHXm8P6QdB57jfJbqAeNSyZoK0qdxOUR/83Zcah7Jjns6vkju1DNatEsaEEI2y2M
 4n7/rkHeZ3TyFHBUX4e9FomKvGLsAalDBRiqsuxLSOPMU8rGrNLAslOAtKwvp90X
 4ht3M2VP098l
 =btwh
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
 "For two kernel releases now kernel/sysctl.c has been being cleaned up
  slowly, since the tables were grossly long, sprinkled with tons of
  #ifdefs and all this caused merge conflicts with one susbystem or
  another.

  This tree was put together to help try to avoid conflicts with these
  cleanups going on different trees at time. So nothing exciting on this
  pull request, just cleanups.

  Thanks a lot to the Uniontech and Huawei folks for doing some of this
  nasty work"

* tag 'sysctl-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (28 commits)
  sched: Fix build warning without CONFIG_SYSCTL
  reboot: Fix build warning without CONFIG_SYSCTL
  kernel/kexec_core: move kexec_core sysctls into its own file
  sysctl: minor cleanup in new_dir()
  ftrace: fix building with SYSCTL=y but DYNAMIC_FTRACE=n
  fs/proc: Introduce list_for_each_table_entry for proc sysctl
  mm: fix unused variable kernel warning when SYSCTL=n
  latencytop: move sysctl to its own file
  ftrace: fix building with SYSCTL=n but DYNAMIC_FTRACE=y
  ftrace: Fix build warning
  ftrace: move sysctl_ftrace_enabled to ftrace.c
  kernel/do_mount_initrd: move real_root_dev sysctls to its own file
  kernel/delayacct: move delayacct sysctls to its own file
  kernel/acct: move acct sysctls to its own file
  kernel/panic: move panic sysctls to its own file
  kernel/lockdep: move lockdep sysctls to its own file
  mm: move page-writeback sysctls to their own file
  mm: move oom_kill sysctls to their own file
  kernel/reboot: move reboot sysctls to its own file
  sched: Move energy_aware sysctls to topology.c
  ...
2022-05-26 16:57:20 -07:00
Linus Torvalds
98931dd95f Yang Shi has improved the behaviour of khugepaged collapsing of readonly
file-backed transparent hugepages.
 
 Johannes Weiner has arranged for zswap memory use to be tracked and
 managed on a per-cgroup basis.
 
 Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime
 enablement of the recent huge page vmemmap optimization feature.
 
 Baolin Wang contributes a series to fix some issues around hugetlb
 pagetable invalidation.
 
 Zhenwei Pi has fixed some interactions between hwpoisoned pages and
 virtualization.
 
 Tong Tiangen has enabled the use of the presently x86-only
 page_table_check debugging feature on arm64 and riscv.
 
 David Vernet has done some fixup work on the memcg selftests.
 
 Peter Xu has taught userfaultfd to handle write protection faults against
 shmem- and hugetlbfs-backed files.
 
 More DAMON development from SeongJae Park - adding online tuning of the
 feature and support for monitoring of fixed virtual address ranges.  Also
 easier discovery of which monitoring operations are available.
 
 Nadav Amit has done some optimization of TLB flushing during mprotect().
 
 Neil Brown continues to labor away at improving our swap-over-NFS support.
 
 David Hildenbrand has some fixes to anon page COWing versus
 get_user_pages().
 
 Peng Liu fixed some errors in the core hugetlb code.
 
 Joao Martins has reduced the amount of memory consumed by device-dax's
 compound devmaps.
 
 Some cleanups of the arch-specific pagemap code from Anshuman Khandual.
 
 Muchun Song has found and fixed some errors in the TLB flushing of
 transparent hugepages.
 
 Roman Gushchin has done more work on the memcg selftests.
 
 And, of course, many smaller fixes and cleanups.  Notably, the customary
 million cleanup serieses from Miaohe Lin.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo52xQAKCRDdBJ7gKXxA
 jtJFAQD238KoeI9z5SkPMaeBRYSRQmNll85mxs25KapcEgWgGQD9FAb7DJkqsIVk
 PzE+d9hEfirUGdL6cujatwJ6ejYR8Q8=
 =nFe6
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Almost all of MM here. A few things are still getting finished off,
  reviewed, etc.

   - Yang Shi has improved the behaviour of khugepaged collapsing of
     readonly file-backed transparent hugepages.

   - Johannes Weiner has arranged for zswap memory use to be tracked and
     managed on a per-cgroup basis.

   - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
     runtime enablement of the recent huge page vmemmap optimization
     feature.

   - Baolin Wang contributes a series to fix some issues around hugetlb
     pagetable invalidation.

   - Zhenwei Pi has fixed some interactions between hwpoisoned pages and
     virtualization.

   - Tong Tiangen has enabled the use of the presently x86-only
     page_table_check debugging feature on arm64 and riscv.

   - David Vernet has done some fixup work on the memcg selftests.

   - Peter Xu has taught userfaultfd to handle write protection faults
     against shmem- and hugetlbfs-backed files.

   - More DAMON development from SeongJae Park - adding online tuning of
     the feature and support for monitoring of fixed virtual address
     ranges. Also easier discovery of which monitoring operations are
     available.

   - Nadav Amit has done some optimization of TLB flushing during
     mprotect().

   - Neil Brown continues to labor away at improving our swap-over-NFS
     support.

   - David Hildenbrand has some fixes to anon page COWing versus
     get_user_pages().

   - Peng Liu fixed some errors in the core hugetlb code.

   - Joao Martins has reduced the amount of memory consumed by
     device-dax's compound devmaps.

   - Some cleanups of the arch-specific pagemap code from Anshuman
     Khandual.

   - Muchun Song has found and fixed some errors in the TLB flushing of
     transparent hugepages.

   - Roman Gushchin has done more work on the memcg selftests.

  ... and, of course, many smaller fixes and cleanups. Notably, the
  customary million cleanup serieses from Miaohe Lin"

* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
  mm: kfence: use PAGE_ALIGNED helper
  selftests: vm: add the "settings" file with timeout variable
  selftests: vm: add "test_hmm.sh" to TEST_FILES
  selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
  selftests: vm: add migration to the .gitignore
  selftests/vm/pkeys: fix typo in comment
  ksm: fix typo in comment
  selftests: vm: add process_mrelease tests
  Revert "mm/vmscan: never demote for memcg reclaim"
  mm/kfence: print disabling or re-enabling message
  include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
  include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
  mm: fix a potential infinite loop in start_isolate_page_range()
  MAINTAINERS: add Muchun as co-maintainer for HugeTLB
  zram: fix Kconfig dependency warning
  mm/shmem: fix shmem folio swapoff hang
  cgroup: fix an error handling path in alloc_pagecache_max_30M()
  mm: damon: use HPAGE_PMD_SIZE
  tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
  nodemask.h: fix compilation error with GCC12
  ...
2022-05-26 12:32:41 -07:00
Chuck Lever
08af54b3e5 NFSD: nfsd_file_put() can sleep
Now that there are no more callers of nfsd_file_put() that might
hold a spin lock, ensure the lockdep infrastructure can catch
newly introduced calls to nfsd_file_put() made while a spinlock
is held.

Link: https://lore.kernel.org/linux-nfs/ece7fd1d-5fb3-5155-54ba-347cfc19bd9a@oracle.com/T/#mf1855552570cf9a9c80d1e49d91438cd9085aada
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-05-26 10:50:51 -04:00
Chuck Lever
043862b09c NFSD: Add documenting comment for nfsd4_release_lockowner()
And return explicit nfserr values that match what is documented in the
new comment / API contract.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-26 10:50:50 -04:00
Chuck Lever
bd8fdb6e54 NFSD: Modernize nfsd4_release_lockowner()
Refactor: Use existing helpers that other lock operations use. This
change removes several automatic variables, so re-organize the
variable declarations for readability.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-26 10:50:50 -04:00
Chuck Lever
ce3c4ad7f4 NFSD: Fix possible sleep during nfsd4_release_lockowner()
nfsd4_release_lockowner() holds clp->cl_lock when it calls
check_for_locks(). However, check_for_locks() calls nfsd_file_get()
/ nfsd_file_put() to access the backing inode's flc_posix list, and
nfsd_file_put() can sleep if the inode was recently removed.

Let's instead rely on the stateowner's reference count to gate
whether the release is permitted. This should be a reliable
indication of locks-in-use since file lock operations and
->lm_get_owner take appropriate references, which are released
appropriately when file locks are removed.

Reported-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
2022-05-26 10:50:49 -04:00
Linus Torvalds
babf0bb978 xfs: Changes for 5.19-rc1
This update includes:
 - support for printk message indexing.
 - large extent counts to provide support for up to 2^47 data extents and 2^32
   attribute extents, allowing us to scale beyond 4 billion data extents
   to billions of xattrs per inode.
 - conversion of various flags fields to be consistently declared as
   unsigned bit fields.
 - improvements to realtime extent accounting and converts them to per-cpu
   counters to match all the other block and inode accounting.
 - reworks core log formatting code to reduce iterations, have a shorter, cleaner
   fast path and generally be easier to understand and maintain.
 - improvements to rmap btree searches that reduce overhead by up
   to 30% resulting in xfs_scrub runtime reductions of 15%.
 - improvements to reflink that remove the size limitations in remapping operations
   and greatly reduce the size of transaction reservations.
 - reworks the minimum log size calculations to allow us to change transaction
   reservations without changing the minimum supported log size.
 - removal of quota warning support as it has never been used on Linux.
 - intent whiteouts to allow us to cancel intents that are completed entirely
   in memory rather than having use CPU and disk bandwidth formatting and writing
   them into the journal when it is not necessary. This makes rmap, reflink and
   extent freeing slightly more efficient, but provides massive improvements
   for....
 - Logged Attribute Replay feature support. This is a fundamental change to the
   way we modify attributes, laying the foundation for future integration of
   attribute modifications as part of other atomic transactional operations the
   filesystem performs.
 - Lots of cleanups and fixes for the logged attribute replay functionality.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmKO2lIUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h0cYRAAutdpA5BZzfgpqnRbmjkOzCmhp6xj
 mSB6A8iBvlhtfY8p0IFFSbTT6jnf+EWfnsjy/jopojhhz5vCqYKfhGM6P9KBHxfz
 amxfmWZd3XWcnc8Ay9hcjLIa7QLQr8PXh3zJhjiYm8PvsrtNzsiEKrh6lxG6pe0w
 vQiq062ColCdN5DcuFVtfScsynCrzZCbUWFGm3y27NF00JpLdm8aBO57/ZaSFVdA
 UKKsogoPUNkRIbmf81IjTWTx2f0syNQyjrK+CX0sxGb6nzcoU/dT8qQ5t/U5gPTc
 cGpHE6vyBLdNA6BlnrFBoVAQ/M8n+ixnYy7XytZuTL5Izo80N+Vo+U5d1nLvC+fn
 ZLKAxbtpudqjy2O393Nv0cqEkT/xPUy2x3IvNL1rKXlQmNWt+KFGuiNrE+y2W4WT
 1bfbnmUJi0Knde4MD43iImwwaocXXdtVkED9f68aknZLCihqGEoi1EmU1Sr4+Wbj
 D8lXZe4BZfGVCHoA2sDtgJsATAG5rdBu/Y6lJcEfUSblvwF2Ufh0r9ehieDrnGmq
 asCTuXmIX/AzUQDa7JjgAzo2sgdhI+nOIPWJeKDVHRdpFjq+7xV573Iqa77Brik9
 DNxAMATh5bZc+9paDib8Za55yE7NJO1cM/UJkwwqn3rvbV5hYki0XZvlKZQsJGig
 ur5otF9Sdz+AcmE=
 =yUEM
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.19-for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Dave Chinner:
 "This is a big update with lots of new code. The summary below them
  all, so I'll just touch on teh higlights. The two main new features
  are Large Extent Counts and Logged Attribute Replay - these are two
  new foundational features that we are building more complex future
  features on top of.

  For upcoming functionality, we need to be able to store hundreds of
  millions of xattrs per inode. The Large Extent Count feature removes
  the limits that prevent this scale of xattr storage, and while we were
  modifying the on disk extent count format we also increased the number
  of data extents we support per inode from 2^32 to 2^47.

  We also need to be able to modify xattrs as part of larger atomic
  transactions rather than as standalone transactions. The Logged
  Attribute Replay feature introduces the infrastructure that allows us
  to use intents to record the attribute modifications in the journal
  before we start them, hence allowing other atomic transactions to log
  attribute modification intents and then defer the actual modification
  to later. If we then crash, log recovery then guarantees that the
  attribute is replayed in the context of the atomic transaction that
  logged the intent.

  A significant chunk of the commits in this merge are for the base
  attribute replay functionality along with fixes, improvements and
  cleanups related to this new functioanlity. Allison deserves a big
  round of thanks for her ongoing work to get this functionality into
  XFS.

  There are also many other smaller changes and improvements, so overall
  this is one of the bigger XFS merge requests in some time.

  I will be following up next week with another smaller pull request -
  we already have another round of fixes and improvements to the logged
  attribute replay functionality just about ready to go. They'll soak
  and test over the next week, and I'll send a pull request for them
  near the end of the merge window.

  Summary:

   - support for printk message indexing.

   - large extent counts to provide support for up to 2^47 data extents
     and 2^32 attribute extents, allowing us to scale beyond 4 billion
     data extents to billions of xattrs per inode.

   - conversion of various flags fields to be consistently declared as
     unsigned bit fields.

   - improvements to realtime extent accounting and converts them to
     per-cpu counters to match all the other block and inode accounting.

   - reworks core log formatting code to reduce iterations, have a
     shorter, cleaner fast path and generally be easier to understand
     and maintain.

   - improvements to rmap btree searches that reduce overhead by up to
     30% resulting in xfs_scrub runtime reductions of 15%.

   - improvements to reflink that remove the size limitations in
     remapping operations and greatly reduce the size of transaction
     reservations.

   - reworks the minimum log size calculations to allow us to change
     transaction reservations without changing the minimum supported log
     size.

   - removal of quota warning support as it has never been used on
     Linux.

   - intent whiteouts to allow us to cancel intents that are completed
     entirely in memory rather than having use CPU and disk bandwidth
     formatting and writing them into the journal when it is not
     necessary. This makes rmap, reflink and extent freeing slightly
     more efficient, but provides massive improvements for....

   - Logged Attribute Replay feature support. This is a fundamental
     change to the way we modify attributes, laying the foundation for
     future integration of attribute modifications as part of other
     atomic transactional operations the filesystem performs.

   - Lots of cleanups and fixes for the logged attribute replay
     functionality"

* tag 'xfs-5.19-for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (124 commits)
  xfs: can't use kmem_zalloc() for attribute buffers
  xfs: detect empty attr leaf blocks in xfs_attr3_leaf_verify
  xfs: ATTR_REPLACE algorithm with LARP enabled needs rework
  xfs: use XFS_DA_OP flags in deferred attr ops
  xfs: remove xfs_attri_remove_iter
  xfs: switch attr remove to xfs_attri_set_iter
  xfs: introduce attr remove initial states into xfs_attr_set_iter
  xfs: xfs_attr_set_iter() does not need to return EAGAIN
  xfs: clean up final attr removal in xfs_attr_set_iter
  xfs: remote xattr removal in xfs_attr_set_iter() is conditional
  xfs: XFS_DAS_LEAF_REPLACE state only needed if !LARP
  xfs: split remote attr setting out from replace path
  xfs: consolidate leaf/node states in xfs_attr_set_iter
  xfs: kill XFS_DAC_LEAF_ADDNAME_INIT
  xfs: separate out initial attr_set states
  xfs: don't set quota warning values
  xfs: remove warning counters from struct xfs_dquot_res
  xfs: remove quota warning limit from struct xfs_quota_limits
  xfs: rework deferred attribute operation setup
  xfs: make xattri_leaf_bp more useful
  ...
2022-05-25 19:34:40 -07:00
Linus Torvalds
e375780b63 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmKOIC0ACgkQnJ2qBz9k
 QNmmqwf+PlZrxoXoDxxw5LdXnIXj6qwN5p/5mKDmKt7CPU8Vt5Reb8GA3b2OcUj2
 XaqQLOpEVrGW9nVKgKzUIujJtK9Sa4IlHSuwYGN3ZTYnsh0rT7VhIyfVNn2Zngo9
 juDHaGrE+g2c8hz3eUGrnkIeiHy/Ny0QEHLjxaXzYYpx3XInzGSmMS3/4/I8tFyr
 G/g1KasTTeBMR3aVh0pt4TvT/p7E/BJL3fFVrsQyeFBFrxisUennUtmK9ngcU7CH
 Y7hEl8CYMNXfm06ZH6Dt1oX9BzFjU9x18kOYAVhpuhzIA3VViL1iWPbyK/8xl1eZ
 PIRsOdDyVWtlcZdkmmlHc9Bnrj4AFA==
 =e7PC
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
 "The biggest part of this is support for fsnotify inode marks that
  don't pin inodes in memory but rather get evicted together with the
  inode (they are useful if userspace needs to exclude receipt of events
  from potentially large subtrees using fanotify ignore marks).

  There is also a fix for more consistent handling of events sent to
  parent and a fix of sparse(1) complaints"

* tag 'fsnotify_for_v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: fix incorrect fmode_t casts
  fsnotify: consistent behavior for parent not watching children
  fsnotify: introduce mark type iterator
  fanotify: enable "evictable" inode marks
  fanotify: use fsnotify group lock helpers
  fanotify: implement "evictable" inode marks
  fanotify: factor out helper fanotify_mark_update_flags()
  fanotify: create helper fanotify_mark_user_flags()
  fsnotify: allow adding an inode mark without pinning inode
  dnotify: use fsnotify group lock helpers
  nfsd: use fsnotify group lock helpers
  audit: use fsnotify group lock helpers
  inotify: use fsnotify group lock helpers
  fsnotify: create helpers for group mark_mutex lock
  fsnotify: make allow_dups a property of the group
  fsnotify: pass flags argument to fsnotify_alloc_group()
  fsnotify: fix wrong lockdep annotations
  inotify: move control flags from mask to mark flags
  inotify: show inotify mask flags in proc fdinfo
2022-05-25 19:29:54 -07:00
Linus Torvalds
8b728edc5b \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmKOH70ACgkQnJ2qBz9k
 QNkRBwgAxhAQmdNGzsMmE9wlF/Dj9VyGioug9thdUBMfEL6xXr1/Gf0e+fMLFFSr
 h2niti2jcuAuYSMqJNuss1Qdo8rWuGbKvjEyqzDfgJ1VDy6oGD14i1qPbKUv1/LN
 AkH4kClQpsltPYYOq9gkagOwRtHaMPCRc1HreqMGk/rDqCe26vZ4xkp2zr328Dt7
 JVdiHNnLj5d03xt3LWgchtqptdaRHJm8ymvICNcIxV3Ievxs+p21BhA/2m/v22/i
 sG/Kg9g7+bQ5sTI+DQHCjodylm9PciFviTRPc3xOkjXDLQ/Md63RlNoHja6uCrPD
 OxvH/L9rZvJ/yfSrg7ztDxcB1vGT/w==
 =UWYT
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull writeback and ext2 cleanups from Jan Kara:
 "One small ext2 cleanup and one writeback spelling fix"

* tag 'fs_for_v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  writeback: fix typo in comment
  fs: ext2: Fix duplicate included linux/dax.h
2022-05-25 19:24:06 -07:00
Linus Torvalds
62e5873ec9 Misc hardening changes for 5.19-rc1
Hi Linus,
 
 Please, pull the following hardening changes that I've been collecting
 in my tree during the last development cycle. All of them have been
 baking in linux-next.
 
 Replace open-coded instances with size_t saturating arithmetic helpers:
 
 - virt: acrn: Prefer array_size and struct_size over open coded arithmetic (Len Baker)
 - afs: Prefer struct_size over open coded arithmetic (Len Baker)
 
 Thanks
 --
 Gustavo
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAmKNUvYACgkQRwW0y0cG
 2zGm1g//YZYWBEzqOncd1K4/TBzbMytxE7oQYqUiS4e3V5D1eiT8924BLr7iuWMd
 c64Lv0trjyHCjdqBxGLMz5cv5EZd206xS3bqIQ8gqcKHOK/z7ob/c7YJxdUcXaWK
 oyGY65Arg9XnuRIJvNeiwdJNpPGsSyI8BSRC6gf6NFYke/GNIgv3rxmVBSbH8V7W
 QVBWf3AWR8+ZlXJFHWEfIAMRwFtumAhak3ZpN/4SDWM6cw3PIm1TFEO5Jx/rWaNn
 LZth+91uJU6UeSrhrME/QQ9Q1kgmLUMqzW3U/HpMky7Ugi0ffWIHc83XYDOgrFRJ
 Br5KLL4c5SHMJfPXqHMsbRNG/bI6fJfd2A0vUDIQMdy4kX88MAzTsUzSiGe6gAAc
 a3RQV4qUkBvC53glZm4pb0IGY4j3vO1tf6Hd9BdxH4knpfQzVNTAwHXUxDq8kDBu
 Bir3B60qAktszxdp/QukmfoTRpVPSV732TMgIllVPFtFD/KYCiYFhSILPouh5JWJ
 3TP8ND1hKrV3P9FVirlJxdFLcADu8dyrmwxKdxZVa6ek9xtr0gt111FYcTeZG+eS
 PaJ7N8G+OkvYktv14/+0DBgIQdeHQ/kp6D4eTjQZ6y0k0b7uAI3HRv7fxWh8eyOw
 RmaE8jQCHokUnerouBUIGEwAmyFhN3atogqIHczB4ebBNbEeOGg=
 =22O8
 -----END PGP SIGNATURE-----

Merge tag 'size_t-saturating-helpers-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux

Pull misc hardening updates from Gustavo Silva:
 "Replace a few open-coded instances with size_t saturating arithmetic
  helpers"

* tag 'size_t-saturating-helpers-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
  virt: acrn: Prefer array_size and struct_size over open coded arithmetic
  afs: Prefer struct_size over open coded arithmetic
2022-05-25 13:56:57 -07:00
Junxiao Bi via Ocfs2-devel
863e0d81b6 ocfs2: dlmfs: fix error handling of user_dlm_destroy_lock
When user_dlm_destroy_lock failed, it didn't clean up the flags it set
before exit.  For USER_LOCK_IN_TEARDOWN, if this function fails because of
lock is still in used, next time when unlink invokes this function, it
will return succeed, and then unlink will remove inode and dentry if lock
is not in used(file closed), but the dlm lock is still linked in dlm lock
resource, then when bast come in, it will trigger a panic due to
user-after-free.  See the following panic call trace.  To fix this,
USER_LOCK_IN_TEARDOWN should be reverted if fail.  And also error should
be returned if USER_LOCK_IN_TEARDOWN is set to let user know that unlink
fail.

For the case of ocfs2_dlm_unlock failure, besides USER_LOCK_IN_TEARDOWN,
USER_LOCK_BUSY is also required to be cleared.  Even though spin lock is
released in between, but USER_LOCK_IN_TEARDOWN is still set, for
USER_LOCK_BUSY, if before every place that waits on this flag,
USER_LOCK_IN_TEARDOWN is checked to bail out, that will make sure no flow
waits on the busy flag set by user_dlm_destroy_lock(), then we can
simplely revert USER_LOCK_BUSY when ocfs2_dlm_unlock fails.  Fix
user_dlm_cluster_lock() which is the only function not following this.

[  941.336392] (python,26174,16):dlmfs_unlink:562 ERROR: unlink
004fb0000060000b5a90b8c847b72e1, error -16 from destroy
[  989.757536] ------------[ cut here ]------------
[  989.757709] kernel BUG at fs/ocfs2/dlmfs/userdlm.c:173!
[  989.757876] invalid opcode: 0000 [#1] SMP
[  989.758027] Modules linked in: ksplice_2zhuk2jr_ib_ipoib_new(O)
ksplice_2zhuk2jr(O) mptctl mptbase xen_netback xen_blkback xen_gntalloc
xen_gntdev xen_evtchn cdc_ether usbnet mii ocfs2 jbd2 rpcsec_gss_krb5
auth_rpcgss nfsv4 nfsv3 nfs_acl nfs fscache lockd grace ocfs2_dlmfs
ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bnx2fc
fcoe libfcoe libfc scsi_transport_fc sunrpc ipmi_devintf bridge stp llc
rds_rdma rds bonding ib_sdp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm falcon_lsm_serviceable(PE) falcon_nf_netcontain(PE)
mlx4_vnic falcon_kal(E) falcon_lsm_pinned_13402(E) mlx4_ib ib_sa ib_mad
ib_core ib_addr xenfs xen_privcmd dm_multipath iTCO_wdt iTCO_vendor_support
pcspkr sb_edac edac_core i2c_i801 lpc_ich mfd_core ipmi_ssif i2c_core ipmi_si
ipmi_msghandler
[  989.760686]  ioatdma sg ext3 jbd mbcache sd_mod ahci libahci ixgbe dca ptp
pps_core vxlan udp_tunnel ip6_udp_tunnel megaraid_sas mlx4_core crc32c_intel
be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi ipv6 cxgb3 mdio
libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi wmi
dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
ksplice_2zhuk2jr_ib_ipoib_old]
[  989.761987] CPU: 10 PID: 19102 Comm: dlm_thread Tainted: P           OE
4.1.12-124.57.1.el6uek.x86_64 #2
[  989.762290] Hardware name: Oracle Corporation ORACLE SERVER
X5-2/ASM,MOTHERBOARD,1U, BIOS 30350100 06/17/2021
[  989.762599] task: ffff880178af6200 ti: ffff88017f7c8000 task.ti:
ffff88017f7c8000
[  989.762848] RIP: e030:[<ffffffffc07d4316>]  [<ffffffffc07d4316>]
__user_dlm_queue_lockres.part.4+0x76/0x80 [ocfs2_dlmfs]
[  989.763185] RSP: e02b:ffff88017f7cbcb8  EFLAGS: 00010246
[  989.763353] RAX: 0000000000000000 RBX: ffff880174d48008 RCX:
0000000000000003
[  989.763565] RDX: 0000000000120012 RSI: 0000000000000003 RDI:
ffff880174d48170
[  989.763778] RBP: ffff88017f7cbcc8 R08: ffff88021f4293b0 R09:
0000000000000000
[  989.763991] R10: ffff880179c8c000 R11: 0000000000000003 R12:
ffff880174d48008
[  989.764204] R13: 0000000000000003 R14: ffff880179c8c000 R15:
ffff88021db7a000
[  989.764422] FS:  0000000000000000(0000) GS:ffff880247480000(0000)
knlGS:ffff880247480000
[  989.764685] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[  989.764865] CR2: ffff8000007f6800 CR3: 0000000001ae0000 CR4:
0000000000042660
[  989.765081] Stack:
[  989.765167]  0000000000000003 ffff880174d48040 ffff88017f7cbd18
ffffffffc07d455f
[  989.765442]  ffff88017f7cbd88 ffffffff816fb639 ffff88017f7cbd38
ffff8800361b5600
[  989.765717]  ffff88021db7a000 ffff88021f429380 0000000000000003
ffffffffc0453020
[  989.765991] Call Trace:
[  989.766093]  [<ffffffffc07d455f>] user_bast+0x5f/0xf0 [ocfs2_dlmfs]
[  989.766287]  [<ffffffff816fb639>] ? schedule_timeout+0x169/0x2d0
[  989.766475]  [<ffffffffc0453020>] ? o2dlm_lock_ast_wrapper+0x20/0x20
[ocfs2_stack_o2cb]
[  989.766738]  [<ffffffffc045303a>] o2dlm_blocking_ast_wrapper+0x1a/0x20
[ocfs2_stack_o2cb]
[  989.767010]  [<ffffffffc0864ec6>] dlm_do_local_bast+0x46/0xe0 [ocfs2_dlm]
[  989.767217]  [<ffffffffc084f5cc>] ? dlm_lockres_calc_usage+0x4c/0x60
[ocfs2_dlm]
[  989.767466]  [<ffffffffc08501f1>] dlm_thread+0xa31/0x1140 [ocfs2_dlm]
[  989.767662]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.767834]  [<ffffffff816f78ce>] ? __schedule+0x23e/0x810
[  989.768006]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.768178]  [<ffffffff816f78ce>] ? __schedule+0x23e/0x810
[  989.768349]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.768521]  [<ffffffff816f78ce>] ? __schedule+0x23e/0x810
[  989.768693]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.768893]  [<ffffffff816f78ce>] ? __schedule+0x23e/0x810
[  989.769067]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.769241]  [<ffffffff810ce4d0>] ? wait_woken+0x90/0x90
[  989.769411]  [<ffffffffc084f7c0>] ? dlm_kick_thread+0x80/0x80 [ocfs2_dlm]
[  989.769617]  [<ffffffff810a8bbb>] kthread+0xcb/0xf0
[  989.769774]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.769945]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.770117]  [<ffffffff810a8af0>] ? kthread_create_on_node+0x180/0x180
[  989.770321]  [<ffffffff816fdaa1>] ret_from_fork+0x61/0x90
[  989.770492]  [<ffffffff810a8af0>] ? kthread_create_on_node+0x180/0x180
[  989.770689] Code: d0 00 00 00 f0 45 7d c0 bf 00 20 00 00 48 89 83 c0 00 00
00 48 89 83 c8 00 00 00 e8 55 c1 8c c0 83 4b 04 10 48 83 c4 08 5b 5d c3 <0f>
0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 41 54 53 48 83
[  989.771892] RIP  [<ffffffffc07d4316>]
__user_dlm_queue_lockres.part.4+0x76/0x80 [ocfs2_dlmfs]
[  989.772174]  RSP <ffff88017f7cbcb8>
[  989.772704] ---[ end trace ebd1e38cebcc93a8 ]---
[  989.772907] Kernel panic - not syncing: Fatal exception
[  989.773173] Kernel Offset: disabled

Link: https://lkml.kernel.org/r/20220518235224.87100-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-25 13:05:42 -07:00
Junxiao Bi
0b6d14e3db ocfs2: dlmfs: don't clear USER_LOCK_ATTACHED when destroying lock
The following function is the only place that checks USER_LOCK_ATTACHED. 
This flag is set when lock request is granted through user_ast() and only
the following function will clear it.

Checking of this flag here is to make sure ocfs2_dlm_unlock is not issued
if this lock is never granted.  For example, lock file is created and then
get removed, open file never happens.

Clearing the flag here is not necessary because this is the only function
that checks it, if another flow is executing user_dlm_destroy_lock(), it
will bail out at the beginning because of USER_LOCK_IN_TEARDOWN and never
check USER_LOCK_ATTACHED.  Drop the clear, so we don't need take care of
it for the following error handling patch.

int user_dlm_destroy_lock(struct user_lock_res *lockres)
{
    ...

    status = 0;
    if (!(lockres->l_flags & USER_LOCK_ATTACHED)) {
        spin_unlock(&lockres->l_lock);
        goto bail;
    }

    lockres->l_flags &= ~USER_LOCK_ATTACHED;
    lockres->l_flags |= USER_LOCK_BUSY;
    spin_unlock(&lockres->l_lock);

	status = ocfs2_dlm_unlock(conn, &lockres->l_lksb, DLM_LKF_VALBLK);
    if (status) {
        user_log_dlm_error("ocfs2_dlm_unlock", status, lockres);
        goto bail;
    }
	...
}

V1 discussion with Joseph:
https://lore.kernel.org/all/7b620c53-0c45-da2c-829e-26195cbe7d4e@linux.alibaba.com/T/

Link: https://lkml.kernel.org/r/20220518235224.87100-1-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-25 13:05:42 -07:00
Linus Torvalds
7e062cda7d Networking changes for 5.19.
Core
 ----
 
  - Support TCPv6 segmentation offload with super-segments larger than
    64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP).
 
  - Generalize skb freeing deferral to per-cpu lists, instead of
    per-socket lists.
 
  - Add a netdev statistic for packets dropped due to L2 address
    mismatch (rx_otherhost_dropped).
 
  - Continue work annotating skb drop reasons.
 
  - Accept alternative netdev names (ALT_IFNAME) in more netlink
    requests.
 
  - Add VLAN support for AF_PACKET SOCK_RAW GSO.
 
  - Allow receiving skb mark from the socket as a cmsg.
 
  - Enable memcg accounting for veth queues, sysctl tables and IPv6.
 
 BPF
 ---
 
  - Add libbpf support for User Statically-Defined Tracing (USDTs).
 
  - Speed up symbol resolution for kprobes multi-link attachments.
 
  - Support storing typed pointers to referenced and unreferenced
    objects in BPF maps.
 
  - Add support for BPF link iterator.
 
  - Introduce access to remote CPU map elements in BPF per-cpu map.
 
  - Allow middle-of-the-road settings for the
    kernel.unprivileged_bpf_disabled sysctl.
 
  - Implement basic types of dynamic pointers e.g. to allow for
    dynamically sized ringbuf reservations without extra memory copies.
 
 Protocols
 ---------
 
  - Retire port only listening_hash table, add a second bind table
    hashed by port and address. Avoid linear list walk when binding
    to very popular ports (e.g. 443).
 
  - Add bridge FDB bulk flush filtering support allowing user space
    to remove all FDB entries matching a condition.
 
  - Introduce accept_unsolicited_na sysctl for IPv6 to implement
    router-side changes for RFC9131.
 
  - Support for MPTCP path manager in user space.
 
  - Add MPTCP support for fallback to regular TCP for connections
    that have never connected additional subflows or transmitted
    out-of-sequence data (partial support for RFC8684 fallback).
 
  - Avoid races in MPTCP-level window tracking, stabilize and improve
    throughput.
 
  - Support lockless operation of GRE tunnels with seq numbers enabled.
 
  - WiFi support for host based BSS color collision detection.
 
  - Add support for SO_TXTIME/SCM_TXTIME on CAN sockets.
 
  - Support transmission w/o flow control in CAN ISOTP (ISO 15765-2).
 
  - Support zero-copy Tx with TLS 1.2 crypto offload (sendfile).
 
  - Allow matching on the number of VLAN tags via tc-flower.
 
  - Add tracepoint for tcp_set_ca_state().
 
 Driver API
 ----------
 
  - Improve error reporting from classifier and action offload.
 
  - Add support for listing line cards in switches (devlink).
 
  - Add helpers for reporting page pool statistics with ethtool -S.
 
  - Add support for reading clock cycles when using PTP virtual clocks,
    instead of having the driver convert to time before reporting.
    This makes it possible to report time from different vclocks.
 
  - Support configuring low-latency Tx descriptor push via ethtool.
 
  - Separate Clause 22 and Clause 45 MDIO accesses more explicitly.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Marvell's Octeon NIC PCI Endpoint support (octeon_ep)
    - Sunplus SP7021 SoC (sp7021_emac)
    - Add support for Renesas RZ/V2M (in ravb)
    - Add support for MediaTek mt7986 switches (in mtk_eth_soc)
 
  - Ethernet PHYs:
    - ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting)
    - TI DP83TD510 PHY
    - Microchip LAN8742/LAN88xx PHYs
 
  - WiFi:
    - Driver for pureLiFi X, XL, XC devices (plfxlc)
    - Driver for Silicon Labs devices (wfx)
    - Support for WCN6750 (in ath11k)
    - Support Realtek 8852ce devices (in rtw89)
 
  - Mobile:
    - MediaTek T700 modems (Intel 5G 5000 M.2 cards)
 
  - CAN:
   - ctucanfd: add support for CTU CAN FD open-source IP core
     from Czech Technical University in Prague
 
 Drivers
 -------
 
  - Delete a number of old drivers still using virt_to_bus().
 
  - Ethernet NICs:
    - intel: support TSO on tunnels MPLS
    - broadcom: support multi-buffer XDP
    - nfp: support VF rate limiting
    - sfc: use hardware tx timestamps for more than PTP
    - mlx5: multi-port eswitch support
    - hyper-v: add support for XDP_REDIRECT
    - atlantic: XDP support (including multi-buffer)
    - macb: improve real-time perf by deferring Tx processing to NAPI
 
  - High-speed Ethernet switches:
    - mlxsw: implement basic line card information querying
    - prestera: add support for traffic policing on ingress and egress
 
  - Embedded Ethernet switches:
    - lan966x: add support for packet DMA (FDMA)
    - lan966x: add support for PTP programmable pins
    - ti: cpsw_new: enable bc/mc storm prevention
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - Wake-on-WLAN support for QCA6390 and WCN6855
    - device recovery (firmware restart) support
    - support setting Specific Absorption Rate (SAR) for WCN6855
    - read country code from SMBIOS for WCN6855/QCA6390
    - enable keep-alive during WoWLAN suspend
    - implement remain-on-channel support
 
  - MediaTek WiFi (mt76):
    - support Wireless Ethernet Dispatch offloading packet movement
      between the Ethernet switch and WiFi interfaces
    - non-standard VHT MCS10-11 support
    - mt7921 AP mode support
    - mt7921 IPv6 NS offload support
 
  - Ethernet PHYs:
    - micrel: ksz9031/ksz9131: cabletest support
    - lan87xx: SQI support for T1 PHYs
    - lan937x: add interrupt support for link detection
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmKNMPQACgkQMUZtbf5S
 IrsRARAAuDyYs6jFYB3p+xazZdOnbF4iAgVv71+DQGvmsCl6CB9OrsNZMlvE85OL
 Q3gjcRbgjrkN4lhgI8DmiGYbsUJnAvVjFdNjccz1Z/vTLYvuIM0ol54MUp5S+9WY
 StncOJkOGJxxR/Gi5gzVmejPDsysU3Jik+hm/fpIcz8pybXxAsFKU5waY5qfl+/T
 TZepfV0VCfqRDjqcF1qA5+jJZNU8pdodQlZ1+mh8bwu6Jk1ZkWkj6Ov8MWdwQldr
 LnPeK/9hIGzkdJYHZfajxA3t8D0K5CHzSuih2bJ9ry8ZXgVBkXEThew778/R5izW
 uB0YZs9COFlrIP7XHjtRTy/2xHOdYIPlj2nWhVdfuQDX8Crvt4VRN6EZ1rjko1ZJ
 WanfG6WHF8NH5pXBRQbh3kIMKBnYn6OIzuCfCQSqd+niHcxFIM4vRiggeXI5C5TW
 vJgEWfK6X+NfDiFVa3xyCrEmp5ieA/pNecpwd8rVkql+MtFAAw4vfsotLKOJEAru
 J/XL6UE+YuLqIJV9ACZ9x1AFXXAo661jOxBunOo4VXhXVzWS9lYYz5r5ryIkgT/8
 /Fr0zjANJWgfIuNdIBtYfQ4qG+LozGq038VA06RhFUAZ5tF9DzhqJs2Q2AFuWWBC
 ewCePJVqo1j2Ceq2mGonXRt47OEnlePoOxTk9W+cKZb7ZWE+zEo=
 =Wjii
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core
  ----

   - Support TCPv6 segmentation offload with super-segments larger than
     64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP).

   - Generalize skb freeing deferral to per-cpu lists, instead of
     per-socket lists.

   - Add a netdev statistic for packets dropped due to L2 address
     mismatch (rx_otherhost_dropped).

   - Continue work annotating skb drop reasons.

   - Accept alternative netdev names (ALT_IFNAME) in more netlink
     requests.

   - Add VLAN support for AF_PACKET SOCK_RAW GSO.

   - Allow receiving skb mark from the socket as a cmsg.

   - Enable memcg accounting for veth queues, sysctl tables and IPv6.

  BPF
  ---

   - Add libbpf support for User Statically-Defined Tracing (USDTs).

   - Speed up symbol resolution for kprobes multi-link attachments.

   - Support storing typed pointers to referenced and unreferenced
     objects in BPF maps.

   - Add support for BPF link iterator.

   - Introduce access to remote CPU map elements in BPF per-cpu map.

   - Allow middle-of-the-road settings for the
     kernel.unprivileged_bpf_disabled sysctl.

   - Implement basic types of dynamic pointers e.g. to allow for
     dynamically sized ringbuf reservations without extra memory copies.

  Protocols
  ---------

   - Retire port only listening_hash table, add a second bind table
     hashed by port and address. Avoid linear list walk when binding to
     very popular ports (e.g. 443).

   - Add bridge FDB bulk flush filtering support allowing user space to
     remove all FDB entries matching a condition.

   - Introduce accept_unsolicited_na sysctl for IPv6 to implement
     router-side changes for RFC9131.

   - Support for MPTCP path manager in user space.

   - Add MPTCP support for fallback to regular TCP for connections that
     have never connected additional subflows or transmitted
     out-of-sequence data (partial support for RFC8684 fallback).

   - Avoid races in MPTCP-level window tracking, stabilize and improve
     throughput.

   - Support lockless operation of GRE tunnels with seq numbers enabled.

   - WiFi support for host based BSS color collision detection.

   - Add support for SO_TXTIME/SCM_TXTIME on CAN sockets.

   - Support transmission w/o flow control in CAN ISOTP (ISO 15765-2).

   - Support zero-copy Tx with TLS 1.2 crypto offload (sendfile).

   - Allow matching on the number of VLAN tags via tc-flower.

   - Add tracepoint for tcp_set_ca_state().

  Driver API
  ----------

   - Improve error reporting from classifier and action offload.

   - Add support for listing line cards in switches (devlink).

   - Add helpers for reporting page pool statistics with ethtool -S.

   - Add support for reading clock cycles when using PTP virtual clocks,
     instead of having the driver convert to time before reporting. This
     makes it possible to report time from different vclocks.

   - Support configuring low-latency Tx descriptor push via ethtool.

   - Separate Clause 22 and Clause 45 MDIO accesses more explicitly.

  New hardware / drivers
  ----------------------

   - Ethernet:
      - Marvell's Octeon NIC PCI Endpoint support (octeon_ep)
      - Sunplus SP7021 SoC (sp7021_emac)
      - Add support for Renesas RZ/V2M (in ravb)
      - Add support for MediaTek mt7986 switches (in mtk_eth_soc)

   - Ethernet PHYs:
      - ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting)
      - TI DP83TD510 PHY
      - Microchip LAN8742/LAN88xx PHYs

   - WiFi:
      - Driver for pureLiFi X, XL, XC devices (plfxlc)
      - Driver for Silicon Labs devices (wfx)
      - Support for WCN6750 (in ath11k)
      - Support Realtek 8852ce devices (in rtw89)

   - Mobile:
      - MediaTek T700 modems (Intel 5G 5000 M.2 cards)

   - CAN:
      - ctucanfd: add support for CTU CAN FD open-source IP core from
        Czech Technical University in Prague

  Drivers
  -------

   - Delete a number of old drivers still using virt_to_bus().

   - Ethernet NICs:
      - intel: support TSO on tunnels MPLS
      - broadcom: support multi-buffer XDP
      - nfp: support VF rate limiting
      - sfc: use hardware tx timestamps for more than PTP
      - mlx5: multi-port eswitch support
      - hyper-v: add support for XDP_REDIRECT
      - atlantic: XDP support (including multi-buffer)
      - macb: improve real-time perf by deferring Tx processing to NAPI

   - High-speed Ethernet switches:
      - mlxsw: implement basic line card information querying
      - prestera: add support for traffic policing on ingress and egress

   - Embedded Ethernet switches:
      - lan966x: add support for packet DMA (FDMA)
      - lan966x: add support for PTP programmable pins
      - ti: cpsw_new: enable bc/mc storm prevention

   - Qualcomm 802.11ax WiFi (ath11k):
      - Wake-on-WLAN support for QCA6390 and WCN6855
      - device recovery (firmware restart) support
      - support setting Specific Absorption Rate (SAR) for WCN6855
      - read country code from SMBIOS for WCN6855/QCA6390
      - enable keep-alive during WoWLAN suspend
      - implement remain-on-channel support

   - MediaTek WiFi (mt76):
      - support Wireless Ethernet Dispatch offloading packet movement
        between the Ethernet switch and WiFi interfaces
      - non-standard VHT MCS10-11 support
      - mt7921 AP mode support
      - mt7921 IPv6 NS offload support

   - Ethernet PHYs:
      - micrel: ksz9031/ksz9131: cabletest support
      - lan87xx: SQI support for T1 PHYs
      - lan937x: add interrupt support for link detection"

* tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1809 commits)
  ptp: ocp: Add firmware header checks
  ptp: ocp: fix PPS source selector debugfs reporting
  ptp: ocp: add .init function for sma_op vector
  ptp: ocp: vectorize the sma accessor functions
  ptp: ocp: constify selectors
  ptp: ocp: parameterize input/output sma selectors
  ptp: ocp: revise firmware display
  ptp: ocp: add Celestica timecard PCI ids
  ptp: ocp: Remove #ifdefs around PCI IDs
  ptp: ocp: 32-bit fixups for pci start address
  Revert "net/smc: fix listen processing for SMC-Rv2"
  ath6kl: Use cc-disable-warning to disable -Wdangling-pointer
  selftests/bpf: Dynptr tests
  bpf: Add dynptr data slices
  bpf: Add bpf_dynptr_read and bpf_dynptr_write
  bpf: Dynptr support for ring buffers
  bpf: Add bpf_dynptr_from_mem for local dynptrs
  bpf: Add verifier support for dynptrs
  bpf: Suppress 'passing zero to PTR_ERR' warning
  bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack
  ...
2022-05-25 12:22:58 -07:00
Luís Henriques
ea16567f11 ceph: fix decoding of client session messages flags
The cephfs kernel client started to show  the message:

 ceph: mds0 session blocklisted

when mounting a filesystem.  This is due to the fact that the session
messages are being incorrectly decoded: the skip needs to take into
account the 'len'.

While there, fixed some whitespaces too.

Cc: stable@vger.kernel.org
Fixes: e1c9788cb3 ("ceph: don't rely on error_string to validate blocklisted session.")
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:14 +02:00
Xiubo Li
5e56776d52 ceph: switch TASK_INTERRUPTIBLE to TASK_KILLABLE
If the task is placed in the TASK_INTERRUPTIBLE state it will sleep
until either something explicitly wakes it up, or a non-masked signal
is received. Switch to TASK_KILLABLE to avoid the noises.

Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:14 +02:00
Colin Ian King
2ecd0edd13 ceph: remove redundant variable ino
Variable ino is being assigned a value that is never read. The variable
and assignment are redundant, remove it.

Cleans up clang scan build warning:
warning: Although the value stored to 'ino' is used in the enclosing
expression, the value is never actually read from 'ino'
[deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:14 +02:00
Xiubo Li
a74379543d ceph: try to queue a writeback if revoking fails
If the pagecaches writeback just finished and the i_wrbuffer_ref
reaches zero it will try to trigger ceph_check_caps(). But if just
before ceph_check_caps() the i_wrbuffer_ref could be increased
again by mmap/cache write, then the Fwb revoke will fail.

We need to try to queue a writeback in this case instead of
triggering the writeback by BDI's delayed work per 5 seconds.

URL: https://tracker.ceph.com/issues/46904
URL: https://tracker.ceph.com/issues/55377
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:14 +02:00
Luís Henriques
55ab552080 ceph: fix statfs for subdir mounts
When doing a mount using as base a directory that has 'max_bytes' quotas
statfs uses that value as the total; if a subdirectory is used instead,
the same 'max_bytes' too in statfs, unless there is another quota set.

Unfortunately, if this subdirectory only has the 'max_files' quota set,
then statfs uses the filesystem total.  Fix this by making sure we only
lookup realms that contain the 'max_bytes' quota.

Cc: Ryan Taylor <rptaylor@uvic.ca>
URL: https://tracker.ceph.com/issues/55090
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:14 +02:00
Xiubo Li
825978fd6a ceph: fix possible deadlock when holding Fwb to get inline_data
1, mount with wsync.
2, create a file with O_RDWR, and the request was sent to mds.0:

   ceph_atomic_open()-->
     ceph_mdsc_do_request(openc)
     finish_open(file, dentry, ceph_open)-->
       ceph_open()-->
         ceph_init_file()-->
           ceph_init_file_info()-->
             ceph_uninline_data()-->
             {
               ...
               if (inline_version == 1 || /* initial version, no data */
                   inline_version == CEPH_INLINE_NONE)
                     goto out_unlock;
               ...
             }

The inline_version will be 1, which is the initial version for the
new create file. And here the ci->i_inline_version will keep with 1,
it's buggy.

3, buffer write to the file immediately:

   ceph_write_iter()-->
     ceph_get_caps(file, need=Fw, want=Fb, ...);
     generic_perform_write()-->
       a_ops->write_begin()-->
         ceph_write_begin()-->
           netfs_write_begin()-->
             netfs_begin_read()-->
               netfs_rreq_submit_slice()-->
                 netfs_read_from_server()-->
                   rreq->netfs_ops->issue_read()-->
                     ceph_netfs_issue_read()-->
                     {
                       ...
                       if (ci->i_inline_version != CEPH_INLINE_NONE &&
                           ceph_netfs_issue_op_inline(subreq))
                         return;
                       ...
                     }
     ceph_put_cap_refs(ci, Fwb);

The ceph_netfs_issue_op_inline() will send a getattr(Fsr) request to
mds.1.

4, then the mds.1 will request the rd lock for CInode::filelock from
the auth mds.0, the mds.0 will do the CInode::filelock state transation
from excl --> sync, but it need to revoke the Fxwb caps back from the
clients.

While the kernel client has aleady held the Fwb caps and waiting for
the getattr(Fsr).

It's deadlock!

URL: https://tracker.ceph.com/issues/55377
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:14 +02:00
Xiubo Li
3459bd0c55 ceph: redirty the page for writepage on failure
When run out of memories we should redirty the page before failing
the writepage. Or we will hit BUG_ON(folio_get_private(folio)) in
ceph_dirty_folio().

URL: https://tracker.ceph.com/issues/55421
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:14 +02:00
Xiubo Li
5eed80fba6 ceph: try to choose the auth MDS if possible for getattr
If any 'x' caps is issued we can just choose the auth MDS instead
of the random replica MDSes. Because only when the Locker is in
LOCK_EXEC state will the loner client could get the 'x' caps. And
if we send the getattr requests to any replica MDS it must auth pin
and tries to rdlock from the auth MDS, and then the auth MDS need
to do the Locker state transition to LOCK_SYNC. And after that the
lock state will change back.

This cost much when doing the Locker state transition and usually
will need to revoke caps from clients.

URL: https://tracker.ceph.com/issues/55240
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:14 +02:00
Xiubo Li
f7a2d0688a ceph: disable updating the atime since cephfs won't maintain it
Since CephFS makes no attempt to maintain atime, we shouldn't
try to update it in mmap and generic read cases and ignore updating
it in direct and sync read cases.

And even we update it in mmap and generic read cases we will drop
it and won't sync it to MDS. And we are seeing the atime will be
updated and then dropped to the floor again and again.

URL: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/VSJM7T4CS5TDRFF6XFPIYMHP75K73PZ6/
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:14 +02:00
Xiubo Li
1b2ba3c561 ceph: flush the mdlog for filesystem sync
Before waiting for a request's safe reply, we will send the mdlog flush
request to the relevant MDS. And this will also flush the mdlog for all
the other unsafe requests in the same session, so we can record the last
session and no need to flush mdlog again in the next loop. But there
still have cases that it may send the mdlog flush requst twice or more,
but that should be not often.

Rename wait_unsafe_requests() to
flush_mdlog_and_wait_mdsc_unsafe_requests() to make it more
descriptive.

[xiubli: fold in MDS request refcount leak fix from Jeff]

URL: https://tracker.ceph.com/issues/55284
URL: https://tracker.ceph.com/issues/55411
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:14 +02:00
Xiubo Li
ae06706330 ceph: rename unsafe_request_wait()
Rename it to flush_mdlog_and_wait_inode_unsafe_requests() to make
it more descriptive.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:14 +02:00
Xiubo Li
261998c300 ceph: fix statx AT_STATX_DONT_SYNC vs AT_STATX_FORCE_SYNC check
From the posix and the initial statx supporting commit comments,
the AT_STATX_DONT_SYNC is a lightweight stat and the
AT_STATX_FORCE_SYNC is a heaverweight one. And also checked all
the other current usage about these two flags they are all doing
the same, that is only when the AT_STATX_FORCE_SYNC is not set
and the AT_STATX_DONT_SYNC is set will they skip sync retriving
the attributes from storage.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:13 +02:00
Xiubo Li
68e5ec2ec9 ceph: no need to invalidate the fscache twice
Fixes: 400e1286c0 ("ceph: conversion to new fscache API")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:13 +02:00
Jakob Koschel
3ffa9d6f99 ceph: replace usage of found with dedicated list iterator variable
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.

To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean.

This removes the need to use a found variable and simply checking if
the variable was set, can determine if the break/goto was hit.

Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:13 +02:00
Jakob Koschel
57a5df0e86 ceph: use dedicated list iterator variable
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.

To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable.

Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:13 +02:00
Xiubo Li
7ffe4fcea7 ceph: update the dlease for the hashed dentry when removing
The MDS will always refresh the dentry lease when removing the files
or directories. And if the dentry is still hashed, we can update
the dentry lease and no need to do the lookup from the MDS later.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:13 +02:00
Xiubo Li
546a5d6122 ceph: stop retrying the request when exceeding 256 times
The type of 'r_attempts' in kernel 'ceph_mds_request' is 'int',
while in 'ceph_mds_request_head' the type of 'num_retry' is '__u8'.
So in case the request retries exceeding 256 times, the MDS will
receive a incorrect retry seq.

In this case it's ususally a bug in MDS and continue retrying the
request makes no sense. For now let's limit it to 256. In future
this could be fixed in ceph code, so avoid using the hardcode here.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:13 +02:00
Xiubo Li
1980b1bf17 ceph: stop forwarding the request when exceeding 256 times
The type of 'num_fwd' in ceph 'MClientRequestForward' is 'int32_t',
while in 'ceph_mds_request_head' the type is '__u8'. So in case
the request bounces between MDSes exceeding 256 times, the client
will get stuck.

In this case it's ususally a bug in MDS and continue bouncing the
request makes no sense.

URL: https://tracker.ceph.com/issues/55130
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:13 +02:00
Xiubo Li
6c1dc50284 ceph: remove unused CEPH_MDS_LEASE_RELEASE related code
The ceph_mdsc_lease_release() has been removed by commit 8aa152c778
(ceph: remove ceph_mdsc_lease_release). ceph_mdsc_lease_send_msg will
never be called with CEPH_MDS_LEASE_RELEASE.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:13 +02:00
Venky Shankar
d7a2dc5230 ceph: allow ceph.dir.rctime xattr to be updatable
`rctime' has been a pain point in cephfs due to its buggy
nature - inconsistent values reported and those sorts.
Fixing rctime is non-trivial needing an overall redesign
of the entire nested statistics infrastructure.

As a workaround, PR

     http://github.com/ceph/ceph/pull/37938

allows this extended attribute to be manually set. This allows
users to "fixup" inconsistent rctime values. While this sounds
messy, its probably the wisest approach allowing users/scripts
to workaround buggy rctime values.

The above PR enables Ceph MDS to allow manually setting
rctime extended attribute with the corresponding user-land
changes. We may as well allow the same to be done via kclient
for parity.

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25 20:45:13 +02:00
Yufen Yu
908ea65416 f2fs: add f2fs_init_write_merge_io function
Almost all other initialization of variables in f2fs_fill_super are
extraced to a single function. Also do it for write_io[], which can
make code more clean.

This patch just refactors the code, theres no functional change.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
[Jaegeuk Kim: clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-25 10:48:56 -07:00
Paulo Alcantara
de3a9e943d cifs: fix ntlmssp on old servers
Some older servers seem to require the workstation name during ntlmssp
to be at most 15 chars (RFC1001 name length), so truncate it before
sending when using insecure dialects.

Link: https://lore.kernel.org/r/e6837098-15d9-acb6-7e34-1923cf8c6fe1@winds.org
Reported-by: Byron Stanoszek <gandalf@winds.org>
Tested-by: Byron Stanoszek <gandalf@winds.org>
Fixes: 49bd49f983 ("cifs: send workstation name during ntlmssp session setup")
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-25 07:41:22 -05:00
Jens Axboe
54739cc6b4 io_uring: make prep and issue side of req handlers named consistently
Almost all of them are, the odd ones out are the poll remove and the
files update request. Name them like the others, which is:

io_#cmdname_prep	for request preparation
io_#cmdname		for request issue

Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-25 05:37:06 -06:00
Jens Axboe
ecddc25d13 io_uring: make timeout prep handlers consistent with other prep handlers
All other opcodes take a {req, sqe} set for prep handling, split out
a timeout prep handler so that timeout and linked timeouts can use
the same one.

Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-25 05:36:54 -06:00
Chao Yu
78901cfa44 f2fs: avoid unneeded error handling for revoke_entry_slab allocation
In __f2fs_commit_atomic_write(), we will guarantee success of
revoke_entry_slab allocation, so let's avoid unneeded error handling.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-24 23:06:51 -07:00
Sungjong Seo
66d34fcbbe f2fs: allow compression for mmap files in compress_mode=user
Since commit e3c548323d ("f2fs: let's allow compression for mmap files"),
it has been allowed to compress mmap files. However, in compress_mode=user,
it is not allowed yet. To keep the same concept in both compress_modes,
f2fs_ioc_(de)compress_file() should also allow it.

Let's remove checking mmap files in f2fs_ioc_(de)compress_file() so that
the compression for mmap files is also allowed in compress_mode=user.

Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-24 23:06:38 -07:00
Linus Torvalds
fdaf9a5840 Page cache changes for 5.19
- Appoint myself page cache maintainer
 
  - Fix how scsicam uses the page cache
 
  - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
 
  - Remove the AOP flags entirely
 
  - Remove pagecache_write_begin() and pagecache_write_end()
 
  - Documentation updates
 
  - Convert several address_space operations to use folios:
    - is_dirty_writeback
    - readpage becomes read_folio
    - releasepage becomes release_folio
    - freepage becomes free_folio
 
  - Change filler_t to require a struct file pointer be the first argument
    like ->read_folio
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmKNMDUACgkQDpNsjXcp
 gj4/mwf/bpHhXH4ZoNIvtUpTF6rZbqeffmc0VrbxCZDZ6igRnRPglxZ9H9v6L53O
 7B0FBQIfxgNKHZpdqGdOkv8cjg/GMe/HJUbEy5wOakYPo4L9fZpHbDZ9HM2Eankj
 xBqLIBgBJ7doKr+Y62DAN19TVD8jfRfVtli5mqXJoNKf65J7BkxljoTH1L3EXD9d
 nhLAgyQjR67JQrT/39KMW+17GqLhGefLQ4YnAMONtB6TVwX/lZmigKpzVaCi4r26
 bnk5vaR/3PdjtNxIoYvxdc71y2Eg05n2jEq9Wcy1AaDv/5vbyZUlZ2aBSaIVbtKX
 WfrhN9O3L0bU5qS7p9PoyfLc9wpq8A==
 =djLv
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache

Pull page cache updates from Matthew Wilcox:

 - Appoint myself page cache maintainer

 - Fix how scsicam uses the page cache

 - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS

 - Remove the AOP flags entirely

 - Remove pagecache_write_begin() and pagecache_write_end()

 - Documentation updates

 - Convert several address_space operations to use folios:
     - is_dirty_writeback
     - readpage becomes read_folio
     - releasepage becomes release_folio
     - freepage becomes free_folio

 - Change filler_t to require a struct file pointer be the first
   argument like ->read_folio

* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
  nilfs2: Fix some kernel-doc comments
  Appoint myself page cache maintainer
  fs: Remove aops->freepage
  secretmem: Convert to free_folio
  nfs: Convert to free_folio
  orangefs: Convert to free_folio
  fs: Add free_folio address space operation
  fs: Convert drop_buffers() to use a folio
  fs: Change try_to_free_buffers() to take a folio
  jbd2: Convert release_buffer_page() to use a folio
  jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
  reiserfs: Convert release_buffer_page() to use a folio
  fs: Remove last vestiges of releasepage
  ubifs: Convert to release_folio
  reiserfs: Convert to release_folio
  orangefs: Convert to release_folio
  ocfs2: Convert to release_folio
  nilfs2: Remove comment about releasepage
  nfs: Convert to release_folio
  jfs: Convert to release_folio
  ...
2022-05-24 19:55:07 -07:00
Linus Torvalds
8642174b52 New code for 5.19:
- Fix a couple of accounting errors in the buffered io code.
 - Discontinue the practice of marking folios !uptodate and invalidating
   them when writeback fails.  This fixes some UAF bugs when multipage
   folios are enabled, and brings the behavior of XFS/gfs/zonefs into
   alignment with the behavior of all the other Linux filesystems.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmKGceAACgkQ+H93GTRK
 tOvaHg//S8E5vHEVdDQUBmcpzJmAGEbwJ/h09Gtt82hcRdtCfw2ZLaehmm2X07Kg
 VorKsMzzEca3bQnR5H45t8PbHQ9amlhE7iXxMswRo92unTQf/K2UUojabmysPIJV
 lCg9kZ1pGcyZcvILbw0CWMnfmNktnfyFre8QOAkvX2WhJsJz+vktVXsDwzVbhxwn
 vJd+gSI8rj6+0jkjAreRwsUBHnuymHI51FbE9TFap/xCaRJqo/wlzMzHt/o1i0ep
 YLBXRdV3bPZxx1f+0G7hk3c1oW7LDDYaTR18A1CtsHyXsZwT+OAgO06DcX3aN+xR
 E//QpeqqhQvrp40xPKZA1YC/Uoiud8fM9RD9t8JGpMcnwDrwQSOnt7ylFLLkZVPj
 +dm1XSBK79Bxvz0WJKH9NdNLlijTyUxaUni9qu9dfeLXcsWV6CshszeZiaS/6G8a
 gilu9p04ha63gZhKj1hgfApMEla5htfz8/15g+9KXtHA92snfDEhecu6t+r8/PfY
 6I0NivB8I7FkYpXdWaUfT/jRLv7/Ov4GaPspSXWtH4k6f779vOd3IsPkeo6nNQhE
 oA0wqP6rNb/zl+e0QX29MKNNPE8LtzpU3WTDQgmMVG62+jHEbS8mCmzHJL4b/yim
 4QxciGP8F1CW7/Gay7JG/8l2iPlnNDy+YCIRCH3i3c0XLa76pPs=
 =lnqq
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.19-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "There's a couple of corrections sent in by Andreas for some accounting
  errors.

  The biggest change this time around is that writeback errors longer
  clear pageuptodate nor does XFS invalidate the page cache anymore.
  This brings XFS (and gfs2/zonefs) behavior in line with every other
  Linux filesystem driver, and fixes some UAF bugs that only cropped up
  after willy turned on multipage folios for XFS in 5.18-rc1.

  Regrettably, it took all the way to the end of the 5.18 cycle to find
  the source of these bugs and reach a consensus that XFS' writeback
  failure behavior from 20 years ago is no longer necessary.

  Summary:

   - Fix a couple of accounting errors in the buffered io code.

   - Discontinue the practice of marking folios !uptodate and
     invalidating them when writeback fails.

     This fixes some UAF bugs when multipage folios are enabled, and
     brings the behavior of XFS/gfs/zonefs into alignment with the
     behavior of all the other Linux filesystems"

* tag 'iomap-5.19-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: don't invalidate folios after writeback errors
  iomap: iomap_write_end cleanup
  iomap: iomap_write_failed fix
2022-05-24 19:21:30 -07:00
Linus Torvalds
f289811258 dlm for 5.19
This set includes several large patches to improve endian handling and
 remove sparse warnings.  The code previously used in/out, in-place
 endianness conversion functions.  Other code cleanup includes the list
 iterator changes.  Finally, a long standing bug was found and fixed,
 caused by missed decrement on an lock struct ref count.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJijT7zAAoJEDgbc8f8gGmqQxUQAIJiDJEKmdiXWn4s8arqHHp3
 d5Teh8SCIwF/pVvpFKzibbAfqD5OzulbQneAJSy+tz8r6uW8QaOSADcQa3Lit1yw
 7wMQGn38VMsGw20y087sOdqOxTdVir/OwzjMCNoUtGOddhQ4XmNY06++O8tAiMFt
 EroGRfvVC2Y7UI78ntvK49Z9errAxhEKpJMAG5yiA4lt12vSaORoJOTMWfIcVdWa
 htjkPSf0W7W5kbvFGW+/wSSt9QGuGTDmjPY+PVqOmIHgVKQkcfrCCbkCjeXOxNav
 1XhoENIxL4FPbkRLnLxrJTpIbbz42H8F8AX82ra+AH8gEy5xBOd2ltA4ozTjBBNE
 /EmsSrfSPcHQOlFGXHv9gF2fRcOPAg6q+edNruIdk2hNb4LoTBtZJcwoqVbaNSdq
 b9V8U8vzSJDVxOi/wdyf9/8oRmYd6FcNtgEiDAA59IfqCHN/bQsVzoONKPB1C0PZ
 fCy4Pm2bbwffqCLC0d0QO0cs2Ow7mRgIWnPafzhXQnIJzNx6ZBx8bvmQNWUMg3Mg
 Ys+tbuN6I53dbYtbAgmLwOP1/4tyMi7vTAli007EZx1XhjXWvnOLxHN1GYFrcbLB
 UUWdru7PxIEEpb+HQDMd0I8lY9qG22Sem9UkzP6N9478ysIAbnZp5s05lucJ7ZeN
 l/XDofB5+CpL+8B7QxRR
 =aFyJ
 -----END PGP SIGNATURE-----

Merge tag 'dlm-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This includes several large patches to improve endian handling and
  remove sparse warnings. The code previously used in/out, in-place
  endianness conversion functions.

  Other code cleanup includes the list iterator changes.

  Finally, a long standing bug was found and fixed, caused by missed
  decrement on an lock struct ref count"

* tag 'dlm-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: (28 commits)
  dlm: use kref_put_lock in __put_lkb
  dlm: use kref_put_lock in put_rsb
  dlm: remove unnecessary error assign
  dlm: fix missing lkb refcount handling
  fs: dlm: cast resource pointer to uintptr_t
  dlm: replace usage of found with dedicated list iterator variable
  dlm: remove usage of list iterator for list_add() after the loop body
  dlm: fix pending remove if msg allocation fails
  dlm: fix wake_up() calls for pending remove
  dlm: check required context while close
  dlm: cleanup lock handling in dlm_master_lookup
  dlm: remove found label in dlm_master_lookup
  dlm: remove __user conversion warnings
  dlm: move conversion to compile time
  dlm: use __le types for dlm messages
  dlm: use __le types for rcom messages
  dlm: use __le types for dlm header
  dlm: use __le types for options header
  dlm: add __CHECKER__ for false positives
  dlm: move global to static inits
  ...
2022-05-24 19:09:16 -07:00
Linus Torvalds
fea3043314 Various bug fixes and cleanups for ext4. In particular, move the
crypto related fucntions from fs/ext4/super.c into a new
 fs/ext4/crypto.c, and fix a number of bugs found by fuzzers and error
 injection tools.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmKNOh0ACgkQ8vlZVpUN
 gaP4kwf+KfqZ/iBDOOCMKV5C7/Z4ieiMLeNqzCWmvju7jceYBoSLOIz3w5MFjEV9
 5ZB/6MovMZ/vZRtm76k0K01ayHKUd1BKjwwvIaABjdNVDTar5Wg/Tq7MF0OMQ5Kw
 ec5rvOQ05VzbXwf/JOjp7IHP/9yEbtgKjAYzgVyMVGrE8jxLQ+UOSUBzzZEHv/js
 Xh7GmRGEs5V7bj+V4SuCaEKSf3wYjT/zlJNIPtsg9RJeQojOP2qlOFhcGeduF1X/
 E4OwabfHqdmlbdI0vL3ANb8nByi/bA0p8i9PGqGIDx0nRUK9UzJCjePmkPux6koT
 pPZLo8DKR8g5i0Hn/ennA9tAIXIaXg==
 =OliY
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Various bug fixes and cleanups for ext4.

  In particular, move the crypto related fucntions from fs/ext4/super.c
  into a new fs/ext4/crypto.c, and fix a number of bugs found by fuzzers
  and error injection tools"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
  ext4: only allow test_dummy_encryption when supported
  ext4: fix bug_on in __es_tree_search
  ext4: avoid cycles in directory h-tree
  ext4: verify dir block before splitting it
  ext4: filter out EXT4_FC_REPLAY from on-disk superblock field s_state
  ext4: fix bug_on in ext4_writepages
  ext4: refactor and move ext4_ioctl_get_encryption_pwsalt()
  ext4: cleanup function defs from ext4.h into crypto.c
  ext4: move ext4 crypto code to its own file crypto.c
  ext4: fix memory leak in parse_apply_sb_mount_options()
  ext4: reject the 'commit' option on ext2 filesystems
  ext4: remove duplicated #include of dax.h in inode.c
  ext4: fix race condition between ext4_write and ext4_convert_inline_data
  ext4: convert symlink external data block mapping to bdev
  ext4: add nowait mode for ext4_getblk()
  ext4: fix journal_ioprio mount option handling
  ext4: mark group as trimmed only if it was fully scanned
  ext4: fix use-after-free in ext4_rename_dir_prepare
  ext4: add unmount filesystem message
  ext4: remove unnecessary conditionals
  ...
2022-05-24 19:04:46 -07:00
Linus Torvalds
7208c9842c gfs2 fixes
- Clean up the allocation of glocks that have an address space attached.
 - Quota locking fix and quota iomap conversion.
 - Fix the FITRIM error reporting.
 - Some list iterator cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmKNMukUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqKig/8DV+EjY6hLX7Olq5EL+TgPvUJ6fyy
 UQMRLL0wZerC9ArIv0yGTAtSpU6mDJTI4TbNS69LnQWfNYiP3WhZg1mrO4vEAP7o
 1h1yE4MCIKV0oDRzENqshzhaKFcvCJvcoY8ToLWYRgwgtuCZ9GLlQ8v38j+yZmQk
 hD+Z5LWNaQ51L8cl2A+yDfw8EFlaQgcJfcU3GUPdo8WDj+BXzTM0gI6Awrrwtg16
 y0NyEJiCRn5Cx4h/HBhqoMrOIfz+wWgtYov2OQowlVGzgDsNQ7TG7a4qKCOKEBww
 bZYVlet6zUt7csAweTcYXbCzL075dYOEiJ25tIpn2WXtDrwjTNNO3e78UXqy85Fb
 PPp4iESAuvqvQwByNazH4Nc0sJdDvKto8cDpvenoJtr6cEDBqSBAibFmojGde4e1
 X7VKBSyREnkjZ+FOr24wYTD0ztkfs0M91Ax5Ha22DQOmtmt1XNIEynGQYT+VDqJL
 LDnvObANIZ6h3+bKXB/ddBhCKkV3POI1VojWNwywOD2yryKVVVFu51S+dbLaeyaz
 W4hz6pqTl1fnFwWv9WPpYS/0kShQrHndhK8eCBZW32+0MpNbDlLfnjSwE7PoO4gV
 ZzqjfJYDYeRW5rq0CSIvbXvNy5EH9Av4fjuZ2/JOQRDG5LHa5Ov+5Yk+XrWcQ8fN
 duiMl/gvm8DKMUU=
 =7hCN
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

 - Clean up the allocation of glocks that have an address space attached

 - Quota locking fix and quota iomap conversion

 - Fix the FITRIM error reporting

 - Some list iterator cleanups

* tag 'gfs2-v5.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Convert function bh_get to use iomap
  gfs2: use i_lock spin_lock for inode qadata
  gfs2: Return more useful errors from gfs2_rgrp_send_discards()
  gfs2: Use container_of() for gfs2_glock(aspace)
  gfs2: Explain some direct I/O oddities
  gfs2: replace 'found' with dedicated list iterator variable
2022-05-24 19:00:41 -07:00
Linus Torvalds
bd1b7c1384 for-5.19-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmKLxJAACgkQxWXV+ddt
 WDvC4BAAnSNwZ15FJKe5Y423f6PS6EXjyMuc5t/fW6UumTTbI+tsS+Glkis+JNBf
 BiDZSlVQmiK9WoQSJe04epZgHaK8MaCARyZaRaxjDC4Nvfq4DlD9mbAU9D6e7tZY
 Mo8M99D8wDW+SB+P8RBpNjwB/oGCMmE3nKC83g+1ObmA0FVRCyQ1Kazf8RzNT1rZ
 DiaJoKTvU1/wDN3/1rw5yG+EfW2m9A14gRCihslhFYaDV7jhpuabl8wLT7MftZtE
 MtJ6EOOQbgIDjnp5BEIrPmowW/N0tKDT/gorF7cWgLG2R1cbSlKgqSH1Sq7CjFUE
 AKj/DwfqZArPLpqMThWklCwy2B9qDEezrQSy7renP/vkeFLbOp8hQuIY5KRzohdG
 oDI8ThlQGtCVjbny6NX/BbCnWRAfTz0TquCgag3Xl8NbkRFgFJtkf/cSxzb+3LW1
 tFeiUyTVLXVDS1cZLwgcb29Rrtp4bjd5/v3uECQlVD+or5pcAqSMkQgOBlyQJGbE
 Xb0nmPRihzQ8D4vINa63WwRyq0+QczVjvBxKj1daas0VEKGd32PIBS/0Qha+EpGl
 uFMiHBMSfqyl8QcShFk0cCbcgPMcNc7I6IAbXCE/WhhFG0ytqm9vpmlLqsTrXmHH
 z7/Eye/waqgACNEXoA8C4pyYzduQ4i1CeLDOdcsvBU6XQSuicSM=
 =lv6P
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Features:

   - subpage:
      - support for PAGE_SIZE > 4K (previously only 64K)
      - make it work with raid56

   - repair super block num_devices automatically if it does not match
     the number of device items

   - defrag can convert inline extents to regular extents, up to now
     inline files were skipped but the setting of mount option
     max_inline could affect the decision logic

   - zoned:
      - minimal accepted zone size is explicitly set to 4MiB
      - make zone reclaim less aggressive and don't reclaim if there are
        enough free zones
      - add per-profile sysfs tunable of the reclaim threshold

   - allow automatic block group reclaim for non-zoned filesystems, with
     sysfs tunables

   - tree-checker: new check, compare extent buffer owner against owner
     rootid

  Performance:

   - avoid blocking on space reservation when doing nowait direct io
     writes (+7% throughput for reads and writes)

   - NOCOW write throughput improvement due to refined locking (+3%)

   - send: reduce pressure to page cache by dropping extent pages right
     after they're processed

  Core:

   - convert all radix trees to xarray

   - add iterators for b-tree node items

   - support printk message index

   - user bulk page allocation for extent buffers

   - switch to bio_alloc API, use on-stack bios where convenient, other
     bio cleanups

   - use rw lock for block groups to favor concurrent reads

   - simplify workques, don't allocate high priority threads for all
     normal queues as we need only one

   - refactor scrub, process chunks based on their constraints and
     similarity

   - allocate direct io structures on stack and pass around only
     pointers, avoids allocation and reduces potential error handling

  Fixes:

   - fix count of reserved transaction items for various inode
     operations

   - fix deadlock between concurrent dio writes when low on free data
     space

   - fix a few cases when zones need to be finished

  VFS, iomap:

   - add helper to check if sb write has started (usable for assertions)

   - new helper iomap_dio_alloc_bio, export iomap_dio_bio_end_io"

* tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (173 commits)
  btrfs: zoned: introduce a minimal zone size 4M and reject mount
  btrfs: allow defrag to convert inline extents to regular extents
  btrfs: add "0x" prefix for unsupported optional features
  btrfs: do not account twice for inode ref when reserving metadata units
  btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer
  btrfs: send: avoid trashing the page cache
  btrfs: send: keep the current inode open while processing it
  btrfs: allocate the btrfs_dio_private as part of the iomap dio bio
  btrfs: move struct btrfs_dio_private to inode.c
  btrfs: remove the disk_bytenr in struct btrfs_dio_private
  btrfs: allocate dio_data on stack
  iomap: add per-iomap_iter private data
  iomap: allow the file system to provide a bio_set for direct I/O
  btrfs: add a btrfs_dio_rw wrapper
  btrfs: zoned: zone finish unused block group
  btrfs: zoned: properly finish block group on metadata write
  btrfs: zoned: finish block group when there are no more allocatable bytes left
  btrfs: zoned: consolidate zone finish functions
  btrfs: zoned: introduce btrfs_zoned_bg_is_full
  btrfs: improve error reporting in lookup_inline_extent_backref
  ...
2022-05-24 18:52:35 -07:00
Linus Torvalds
3842007b1a zonefs fix for 5.19-rc1
A single patch to fix zonefs_init_file_inode() return value.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCYoyTTwAKCRDdoc3SxdoY
 dgtgAQC0x4Ecx415K1KuiMEWQU0+b14y6E5dg6r+puXVhRDwmgD9Hj4QJblfdcjz
 OPvppXawiA0fzfcOKpm7PJmzT2ASYAM=
 =Nnnm
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-5.19-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs fix from Damien Le Moal:
 "A single patch to fix zonefs_init_file_inode() return value"

* tag 'zonefs-5.19-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: Fix zonefs_init_file_inode() return value
2022-05-24 18:48:36 -07:00
Linus Torvalds
65965d9530 Changes since last update:
- Add erofs on-demand load support over fscache;
 
  - Support NFS export for erofs;
 
  - Support idmapped mounts for erofs;
 
  - Don't prompt for risk any more when using big pcluster;
 
  - Fix buffer copy overflow of ztailpacking feature;
 
  - Several minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYojqfREceGlhbmdAa2Vy
 bmVsLm9yZwAKCRA5NzHcH7XmBJ/vAP0XBbClZjsHhiSI/Gkp3UTcQHjR+uDIb2QR
 FhAui79F+QEAqCHoKF/F6YFkJdWtH0t6rBeNt6NL0UNU9hw3riF3IwY=
 =bcu7
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs (and fscache) updates from Gao Xiang:
 "After working on it on the mailing list for more than half a year, we
  finally form 'erofs over fscache' feature into shape. Hopefully it
  could bring more possibility to the communities.

  The story mainly started from a new project what we called "RAFS v6" [1]
  for Nydus image service almost a year ago, which enhances EROFS to be
  a new form of one bootstrap (which includes metadata representing the
  whole fs tree) + several data-deduplicated content addressable blobs
  (actually treated as multiple devices). Each blob can represent one
  container image layer but not quite exactly since all new data can be
  fully existed in the previous blobs so no need to introduce another
  new blob.

  It is actually not a new idea (at least on my side it's much like a
  simpilied casync [2] for now) and has many benefits over per-file
  blobs or some other exist ways since typically each RAFS v6 image only
  has dozens of device blobs instead of thousands of per-file blobs.
  It's easy to be signed with user keys as a golden image, transfered
  untouchedly with minimal overhead over the network, kept in some type
  of storage conveniently, and run with (optional) runtime verification
  but without involving too many irrelevant features crossing the system
  beyond EROFS itself. At least it's our final goal and we're keeping
  working on it. There was also a good summary of this approach from the
  casync author [3].

  Regardless further optimizations, this work is almost done in the
  previous Linux release cycles. In this round, we'd like to introduce
  on-demand load for EROFS with the fscache/cachefiles infrastructure,
  considering the following advantages:

   - Introduce new file-based backend to EROFS. Although each image only
     contains dozens of blobs but in densely-deployed runC host for
     example, there could still be massive blobs on a machine, which is
     messy if each blob is treated as a device. In contrast, fscache and
     cachefiles are really great interfaces for us to make them work.

   - Introduce on-demand load to fscache and EROFS. Previously, fscache
     is mainly used to caching network-likewise filesystems, now it can
     support on-demand downloading for local fses too with the exact
     localfs on-disk format. It has many advantages which we're been
     described in the latest patchset cover letter [4]. In addition to
     that, most importantly, the cached data is still stored in the
     original local fs on-disk format so that it's still the one signed
     with private keys but only could be partially available. Users can
     fully trust it during running. Later, users can also back up
     cachefiles easily to another machine.

   - More reliable on-demand approach in principle. After data is all
     available locally, user daemon can be no longer online in some use
     cases, which helps daemon crash recovery (filesystems can still in
     service) and hot-upgrade (user daemon can be upgraded more
     frequently due to new features or protocols introduced.)

   - Other format can also be converted to EROFS filesystem format over
     the internet on the fly with the new on-demand load feature and
     mounted. That is entirely possible with on-demand load feature as
     long as such archive format metadata can be fetched in advance like
     stargz.

  In addition, although currently our target user is Nydus image service [5],
  but laterly, it can be used for other use cases like on-demand system
  booting, etc. As for the fscache on-demand load feature itself,
  strictly it can be used for other local fses too. Laterly we could
  promote most code to the iomap infrastructure and also enhance it in
  the read-write way if other local fses are interested.

  Thanks David Howells for taking so much time and patience on this
  these months, many thanks with great respect here again! Thanks Jeffle
  for working on this feature and Xin Yin from Bytedance for
  asynchronous I/O implementation as well as Zichen Tian, Jia Zhu, and
  Yan Song for testing, much appeciated. We're also exploring more
  possibly over fscache cache management over FSDAX for secure
  containers and working on more improvements and useful features for
  fscache, cachefiles, and on-demand load.

  In addition to "erofs over fscache", NFS export and idmapped mount are
  also completed in this cycle for container use cases as well.

  Summary:

   - Add erofs on-demand load support over fscache

   - Support NFS export for erofs

   - Support idmapped mounts for erofs

   - Don't prompt for risk any more when using big pcluster

   - Fix buffer copy overflow of ztailpacking feature

   - Several minor cleanups"

[1] https://lore.kernel.org/r/20210730194625.93856-1-hsiangkao@linux.alibaba.com
[2] https://github.com/systemd/casync
[3] http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html
[4] https://lore.kernel.org/r/20220509074028.74954-1-jefflexu@linux.alibaba.com
[5] https://github.com/dragonflyoss/image-service

* tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (29 commits)
  erofs: scan devices from device table
  erofs: change to use asynchronous io for fscache readpage/readahead
  erofs: add 'fsid' mount option
  erofs: implement fscache-based data readahead
  erofs: implement fscache-based data read for inline layout
  erofs: implement fscache-based data read for non-inline layout
  erofs: implement fscache-based metadata read
  erofs: register fscache context for extra data blobs
  erofs: register fscache context for primary data blob
  erofs: add erofs_fscache_read_folios() helper
  erofs: add anonymous inode caching metadata for data blobs
  erofs: add fscache context helper functions
  erofs: register fscache volume
  erofs: add fscache mode check helper
  erofs: make erofs_map_blocks() generally available
  cachefiles: document on-demand read mode
  cachefiles: add tracepoints for on-demand read mode
  cachefiles: enable on-demand read mode
  cachefiles: implement on-demand read
  cachefiles: notify the user daemon when withdrawing cookie
  ...
2022-05-24 18:42:04 -07:00
Linus Torvalds
850f6033cd Description for this pull request:
- fix referencing wrong parent directory information during rename.
  - introduce a sys_tz mount option to use system timezone.
  - improve performance while zeroing a cluster with dirsync mount option.
  - fix slab-out-bounds in exat_clear_bitmap() reported from syzbot.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmKLhPkWHGxpbmtpbmpl
 b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCOeXD/9nJofEx/n9KK0pA1WP2zKoLBnP
 YgLfnWTBlXfErInklW4kg057S6q8M0pDm0iASLw9P6GZNe0VFV5PigrTyAjy5ghW
 hm3JiFAHIZgaOlOk2NQd/1Qv/IdlnkbRkngXqHcizxEX/LcKZpP+pb1mdW6NzWrt
 /HagLClFTUhb0Su3DT7TCqiam5lI+lkarRI0Jo4Scstgsn4aT+25jk9N0bfUih7f
 hGKJpii+5UWCLlBJnyyghrBRQiiPdsETadJdRnHgeDdzKg/UNWxMP0C+G6PmSko/
 mScrR+FeH2toURSUESi1Q558z1+3Fhb8rMbl3aWV70FJDmzMwn9YyhPgBrX3x6Gb
 AF7UBHFvORStYRUmmSMbX9XkY2gNoI9qZMXghDRlgF8t/WWY8VeVnyslaWwqDQhw
 qXyOIThiCuhLfKTD+r+MM08oUPcyFBtuGvdzDOH7/b56zEDwzab+hSHZz94xPWEz
 ESk0hNhaJCEvcgEr7IlSSF4k5Ff+hWVKN4R/DD78yxmjHveOuNMibTeE2YgDIFEX
 SiKFaAiXUuWGVsAPuAeJ+np/7rW3OEBG8yKhtri5vsoXk+Mqd56rp0EkEHzqbVHQ
 Ki5gua549KNRuxnbXRtWLjCKucwN86mE45WD0P0ORnBOjlfmgg8adp6BBSW5yVD2
 SZkfgI0FL8rWdBwRcQ==
 =fr78
 -----END PGP SIGNATURE-----

Merge tag 'exfat-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat

Pull exfat updates from Namjae Jeon:

 - fix referencing wrong parent directory information during rename

 - introduce a sys_tz mount option to use system timezone

 - improve performance while zeroing a cluster with dirsync mount option

 - fix slab-out-bounds in exat_clear_bitmap() reported from syzbot

* tag 'exfat-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
  exfat: check if cluster num is valid
  exfat: reduce block requests when zeroing a cluster
  block: add sync_blockdev_range()
  exfat: introduce mount option 'sys_tz'
  exfat: fix referencing wrong parent directory information after renaming
2022-05-24 18:30:27 -07:00
Linus Torvalds
f30fabe78a fs.idmapped.v5.19
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYotC2wAKCRCRxhvAZXjc
 omivAQD7hDdmZdhGaWgHJKGMofPJ+j62F7QPyoc1UPEkr0sMvAEA1EehhXkw4E8L
 6aFsXKs+Bb77TfdZI5EI7cUw1fAWUwE=
 =wlyp
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull fs idmapping updates from Christian Brauner:
 "This contains two minor updates:

   - An update to the idmapping documentation by Rodrigo making it
     easier to understand that we first introduce several use-cases that
     fail without idmapped mounts simply to explain how they can be
     handled with idmapped mounts.

   - When changing a mount's idmapping we now hold writers to make it
     more robust.

     This is similar to turning a mount ro with the difference that in
     contrast to turning a mount ro changing the idmapping can only ever
     be done once while a mount can transition between ro and rw as much
     as it wants.

     The vfs layer itself takes care to retrieve the idmapping of a
     mount once ensuring that the idmapping used for vfs permission
     checking is identical to the idmapping passed down to the
     filesystem. All filesystems with FS_ALLOW_IDMAP raised take the
     same precautions as the vfs in code-paths that are outside of
     direct control of the vfs such as ioctl()s.

     However, holding writers makes this more robust and predictable for
     both the kernel and userspace.

     This is a minor user-visible change. But it is extremely unlikely
     to matter. The caller must've created a detached mount via
     OPEN_TREE_CLONE and then handed that O_PATH fd to another process
     or thread which then must've gotten a writable fd for that mount
     and started creating files in there while the caller is still
     changing mount properties. While not impossible it will be an
     extremely rare corner-case and should in general be considered a
     bug in the application. Consider making a mount MOUNT_ATTR_NOEXEC
     or MOUNT_ATTR_NODEV while allowing someone else to perform lookups
     or exec'ing in parallel by handing them a copy of the
     OPEN_TREE_CLONE fd or another fd beneath that mount.

     I've pinged all major users of idmapped mounts pointing out this
     change and none of them have active writers on a mount while still
     changing mount properties. It would've been strange if they did.

  The rest and majority of the work will be coming through the overlayfs
  tree this cycle. In addition to overlayfs this cycle should also see
  support for idmapped mounts on erofs as I've acked a patch to this
  effect a little while ago"

* tag 'fs.idmapped.v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fs: hold writers when changing mount's idmapping
  docs: Add small intro to idmap examples
2022-05-24 18:19:06 -07:00
Linus Torvalds
0350785b0a integrity-v5.19
-----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCYo0tOhQcem9oYXJAbGlu
 dXguaWJtLmNvbQAKCRDLwZzRsCrn5QJfAP47Ym9vacLc1m8/MUaRA/QjbJ/8t3TX
 h/4McK8kiRudxgD/RiPHII6gJ8q+qpBrYWJZ4ZZaHE8v0oA1viuZfbuN2wc=
 =KQYi
 -----END PGP SIGNATURE-----

Merge tag 'integrity-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity

Pull IMA updates from Mimi Zohar:
 "New is IMA support for including fs-verity file digests and signatures
  in the IMA measurement list as well as verifying the fs-verity file
  digest based signatures, both based on policy.

  In addition, are two bug fixes:

   - avoid reading UEFI variables, which cause a page fault, on Apple
     Macs with T2 chips.

   - remove the original "ima" template Kconfig option to address a boot
     command line ordering issue.

  The rest is a mixture of code/documentation cleanup"

* tag 'integrity-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  integrity: Fix sparse warnings in keyring_handler
  evm: Clean up some variables
  evm: Return INTEGRITY_PASS for enum integrity_status value '0'
  efi: Do not import certificates from UEFI Secure Boot for T2 Macs
  fsverity: update the documentation
  ima: support fs-verity file digest based version 3 signatures
  ima: permit fsverity's file digests in the IMA measurement list
  ima: define a new template field named 'd-ngv2' and templates
  fs-verity: define a function to return the integrity protected file digest
  ima: use IMA default hash algorithm for integrity violations
  ima: fix 'd-ng' comments and documentation
  ima: remove the IMA_TEMPLATE Kconfig option
  ima: remove redundant initialization of pointer 'file'.
2022-05-24 13:50:39 -07:00
Linus Torvalds
a6b450573b execve updates for v5.19-rc1
- Fix binfmt_flat GOT handling for riscv (Niklas Cassel)
 
 - Remove unused/broken shared library and coredump code (Eric W. Biederman)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmKL3fAWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlVjD/47IFUgk2hmBlQbUCgA8iEv5IOY
 SmoXSrnT0fQOHNenubNODWUwpcfojRBOiOQbFuOlpSj3Ehsy+IwPGBUB+Kirq013
 joTeQuJfu+jsO34QycHDiEOb8hn6/eqWDb+VPHTUJOEzZXLDS8pExqvcifp/NrpB
 iIugY0yUjxtN5XlaGFV9cI5ZN1MjEBUnzzNlf7ik3dU4ydyqjOwnWXk8MjAF6+68
 SVh3mXpSUEx//MZD9LvDOmSFNdvosVD3UviX3X4ZvuHRzHe7ySgWxYyLi2uskM88
 /TMCRo55APrdJLFN7RhP68ufdkfIGrlLvdJsimg441GlQM82f3TYdDrFJOGgPAqj
 sy0vhdzDGmm0qlnhMc0LjhR2UsZTsusbKV8V1CmD/fpMh0FF3iAT5U/0o6ZM6pas
 7PocJTBahbI1yvDtkP2bUk/ZyQICTKaaQZrB/bqN66ccppHOSdaA9m1jhmDG6d2e
 g0AbyPQwTd29yr459cv7IHGp/zjfzEUy6Hcki4Wn1Lj6cDZUuOt2bu4H7bpRvlLM
 CJHJYHNlYPWXvQAgV8rBBk49rQzwBxMBDOfDk1Lbfe9RfbkUL9oLWFTQO+p3rufm
 qgUk1WQNqWaQyfx3InLnY1bPs9ZKy4/LHfxm3vfWjpnhdfVxfxJ28locZl/ypk1A
 lKG4oNzZ79wsSaAueg==
 =ci+a
 -----END PGP SIGNATURE-----

Merge tag 'execve-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve updates from Kees Cook:

 - Fix binfmt_flat GOT handling for riscv (Niklas Cassel)

 - Remove unused/broken binfmt_flat shared library and coredump code
   (Eric W. Biederman)

* tag 'execve-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt_flat: Remove shared library support
  binfmt_flat: Drop vestiges of coredump support
  binfmt_flat: do not stop relocating GOT entries prematurely on riscv
2022-05-24 12:49:48 -07:00
Eric Biggers
5f41fdaea6 ext4: only allow test_dummy_encryption when supported
Make the test_dummy_encryption mount option require that the encrypt
feature flag be already enabled on the filesystem, rather than
automatically enabling it.  Practically, this means that "-O encrypt"
will need to be included in MKFS_OPTIONS when running xfstests with the
test_dummy_encryption mount option.  (ext4/053 also needs an update.)

Moreover, as long as the preconditions for test_dummy_encryption are
being tightened anyway, take the opportunity to start rejecting it when
!CONFIG_FS_ENCRYPTION rather than ignoring it.

The motivation for requiring the encrypt feature flag is that:

- Having the filesystem auto-enable feature flags is problematic, as it
  bypasses the usual sanity checks.  The specific issue which came up
  recently is that in kernel versions where ext4 supports casefold but
  not encrypt+casefold (v5.1 through v5.10), the kernel will happily add
  the encrypt flag to a filesystem that has the casefold flag, making it
  unmountable -- but only for subsequent mounts, not the initial one.
  This confused the casefold support detection in xfstests, causing
  generic/556 to fail rather than be skipped.

- The xfstests-bld test runners (kvm-xfstests et al.) already use the
  required mkfs flag, so they will not be affected by this change.  Only
  users of test_dummy_encryption alone will be affected.  But, this
  option has always been for testing only, so it should be fine to
  require that the few users of this option update their test scripts.

- f2fs already requires it (for its equivalent feature flag).

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-24 15:34:27 -04:00
Baokun Li
d36f6ed761 ext4: fix bug_on in __es_tree_search
Hulk Robot reported a BUG_ON:
==================================================================
kernel BUG at fs/ext4/extents_status.c:199!
[...]
RIP: 0010:ext4_es_end fs/ext4/extents_status.c:199 [inline]
RIP: 0010:__es_tree_search+0x1e0/0x260 fs/ext4/extents_status.c:217
[...]
Call Trace:
 ext4_es_cache_extent+0x109/0x340 fs/ext4/extents_status.c:766
 ext4_cache_extents+0x239/0x2e0 fs/ext4/extents.c:561
 ext4_find_extent+0x6b7/0xa20 fs/ext4/extents.c:964
 ext4_ext_map_blocks+0x16b/0x4b70 fs/ext4/extents.c:4384
 ext4_map_blocks+0xe26/0x19f0 fs/ext4/inode.c:567
 ext4_getblk+0x320/0x4c0 fs/ext4/inode.c:980
 ext4_bread+0x2d/0x170 fs/ext4/inode.c:1031
 ext4_quota_read+0x248/0x320 fs/ext4/super.c:6257
 v2_read_header+0x78/0x110 fs/quota/quota_v2.c:63
 v2_check_quota_file+0x76/0x230 fs/quota/quota_v2.c:82
 vfs_load_quota_inode+0x5d1/0x1530 fs/quota/dquot.c:2368
 dquot_enable+0x28a/0x330 fs/quota/dquot.c:2490
 ext4_quota_enable fs/ext4/super.c:6137 [inline]
 ext4_enable_quotas+0x5d7/0x960 fs/ext4/super.c:6163
 ext4_fill_super+0xa7c9/0xdc00 fs/ext4/super.c:4754
 mount_bdev+0x2e9/0x3b0 fs/super.c:1158
 mount_fs+0x4b/0x1e4 fs/super.c:1261
[...]
==================================================================

Above issue may happen as follows:
-------------------------------------
ext4_fill_super
 ext4_enable_quotas
  ext4_quota_enable
   ext4_iget
    __ext4_iget
     ext4_ext_check_inode
      ext4_ext_check
       __ext4_ext_check
        ext4_valid_extent_entries
         Check for overlapping extents does't take effect
   dquot_enable
    vfs_load_quota_inode
     v2_check_quota_file
      v2_read_header
       ext4_quota_read
        ext4_bread
         ext4_getblk
          ext4_map_blocks
           ext4_ext_map_blocks
            ext4_find_extent
             ext4_cache_extents
              ext4_es_cache_extent
               ext4_es_cache_extent
                __es_tree_search
                 ext4_es_end
                  BUG_ON(es->es_lblk + es->es_len < es->es_lblk)

The error ext4 extents is as follows:
0af3 0300 0400 0000 00000000    extent_header
00000000 0100 0000 12000000     extent1
00000000 0100 0000 18000000     extent2
02000000 0400 0000 14000000     extent3

In the ext4_valid_extent_entries function,
if prev is 0, no error is returned even if lblock<=prev.
This was intended to skip the check on the first extent, but
in the error image above, prev=0+1-1=0 when checking the second extent,
so even though lblock<=prev, the function does not return an error.
As a result, bug_ON occurs in __es_tree_search and the system panics.

To solve this problem, we only need to check that:
1. The lblock of the first extent is not less than 0.
2. The lblock of the next extent  is not less than
   the next block of the previous extent.
The same applies to extent_idx.

Cc: stable@kernel.org
Fixes: 5946d08937 ("ext4: check for overlapping extents in ext4_valid_extent_entries()")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220518120816.1541863-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-24 15:34:17 -04:00
Jan Kara
3ba733f879 ext4: avoid cycles in directory h-tree
A maliciously corrupted filesystem can contain cycles in the h-tree
stored inside a directory. That can easily lead to the kernel corrupting
tree nodes that were already verified under its hands while doing a node
split and consequently accessing unallocated memory. Fix the problem by
verifying traversed block numbers are unique.

Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220518093332.13986-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-24 15:34:13 -04:00
Jan Kara
46c116b920 ext4: verify dir block before splitting it
Before splitting a directory block verify its directory entries are sane
so that the splitting code does not access memory it should not.

Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220518093332.13986-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-24 15:34:08 -04:00
Theodore Ts'o
c878bea3c9 ext4: filter out EXT4_FC_REPLAY from on-disk superblock field s_state
The EXT4_FC_REPLAY bit in sbi->s_mount_state is used to indicate that
we are in the middle of replay the fast commit journal.  This was
actually a mistake, since the sbi->s_mount_info is initialized from
es->s_state.  Arguably s_mount_state is misleadingly named, but the
name is historical --- s_mount_state and s_state dates back to ext2.

What should have been used is the ext4_{set,clear,test}_mount_flag()
inline functions, which sets EXT4_MF_* bits in sbi->s_mount_flags.

The problem with using EXT4_FC_REPLAY is that a maliciously corrupted
superblock could result in EXT4_FC_REPLAY getting set in
s_mount_state.  This bypasses some sanity checks, and this can trigger
a BUG() in ext4_es_cache_extent().  As a easy-to-backport-fix, filter
out the EXT4_FC_REPLAY bit for now.  We should eventually transition
away from EXT4_FC_REPLAY to something like EXT4_MF_REPLAY.

Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20220420192312.1655305-1-phind.uet@gmail.com
Link: https://lore.kernel.org/r/20220517174028.942119-1-tytso@mit.edu
Reported-by: syzbot+c7358a3cd05ee786eb31@syzkaller.appspotmail.com
2022-05-24 15:33:58 -04:00
Ronnie Sahlberg
d87c48ce4d cifs: cache the dirents for entries in a cached directory
This adds caching of the directory entries for a cached directory while we keep
a lease on the directory.

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-24 14:33:11 -05:00
Bob Peterson
c360abbb9d gfs2: Convert function bh_get to use iomap
Before this patch, function bh_get used block_map to figure out the
block it needed to read in from the quota_change file. This patch
changes it to use iomap directly to make it more efficient.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-24 21:29:14 +02:00
Bob Peterson
5fcff61eea gfs2: use i_lock spin_lock for inode qadata
Before this patch, functions gfs2_qa_get and _put used the i_rw_mutex to
prevent simultaneous access to its i_qadata. But i_rw_mutex is now used
for many other things, including iomap_begin and end, which causes a
conflict according to lockdep. We cannot just remove the lock since
simultaneous opens (gfs2_open -> gfs2_open_common -> gfs2_qa_get) can
then stomp on each others values for i_qadata.

This patch solves the conflict by using the i_lock spin_lock in the inode
to prevent simultaneous access.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-24 21:29:14 +02:00
Andrew Price
f4a47561fc gfs2: Return more useful errors from gfs2_rgrp_send_discards()
The bug that 27ca8273f ("gfs2: Make sure FITRIM minlen is rounded up to
fs block size") fixes was a little confusing as the user saw
"Input/output error" which masked the -EINVAL that sb_issue_discard()
returned.

sb_issue_discard() can fail for various reasons, so we should return its
return value from gfs2_rgrp_send_discards() to avoid all errors being
reported as IO errors.

This improves error reporting for FITRIM and makes no difference to the
-o discard code path because the return value from
gfs2_rgrp_send_discards() gets thrown away in that case (and the option
switches off). Presumably that's why it was ok to just return -EIO in
the past, before FITRIM was implemented.

Tested with xfstests.

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-24 21:29:14 +02:00
Kees Cook
11d8b79e84 gfs2: Use container_of() for gfs2_glock(aspace)
Clang's structure layout randomization feature gets upset when it sees
struct address_space (which is randomized) cast to struct gfs2_glock.
This is due to seeing the mapping pointer as being treated as an array
of gfs2_glock, rather than "something else, before struct address_space":

In file included from fs/gfs2/acl.c:23:
fs/gfs2/meta_io.h:44:12: error: casting from randomized structure pointer type 'struct address_space *' to 'struct gfs2_glock *'
	return (((struct gfs2_glock *)mapping) - 1)->gl_name.ln_sbd;
		^

Replace the instances of open-coded pointer math with container_of()
usage, and update the allocator to match.

Some cleanups and conversion of gfs2_glock_get() and
gfs2_glock_dealloc() by Andreas.

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/202205041550.naKxwCBj-lkp@intel.com
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Bill Wendling <morbo@google.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-24 21:29:14 +02:00
Andreas Gruenbacher
53bb540fd5 gfs2: Explain some direct I/O oddities
Add some comments explaining the oddities of partial direct I/O reads
and writes.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-24 21:29:14 +02:00
Linus Torvalds
51518aa68c fsverity updates for 5.19
A couple small cleanups for fs/verity/.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYosv7hQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK/UUAQCFP7hbXhEWsL7GdlC/mbj37xt5eqZl
 Ezlkci0QZhi84QD7BeiTLaS5gwu4zWdNOL7zPawnA6+7tGuZjvphPNCvbAc=
 =q5iZ
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fsverity updates from Eric Biggers:
 "A couple small cleanups for fs/verity/"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fs-verity: Use struct_size() helper in enable_verity()
  fs-verity: remove unused parameter desc_size in fsverity_create_info()
2022-05-24 12:22:56 -07:00
Linus Torvalds
c1f4cfdbef fscrypt updates for 5.19
Some cleanups for fs/crypto/:
 
 - Split up the misleadingly-named FS_CRYPTO_BLOCK_SIZE constant.
 
 - Consistently report the encryption implementation that is being used.
 
 - Add helper functions for the test_dummy_encryption mount option that
   work properly with the new mount API.  ext4 and f2fs will use these.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYosv0xQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK17XAP0bsA+vaVC/r408kPVZKdbN/aekFJa6
 u81cTj2Yn1qylAEA4/3BXmjZerHF496lfFZCk7VpnN/3l5YOig2k1TECqQ8=
 =Glr9
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Eric Biggers:
 "Some cleanups for fs/crypto/:

   - Split up the misleadingly-named FS_CRYPTO_BLOCK_SIZE constant.

   - Consistently report the encryption implementation that is being
     used.

   - Add helper functions for the test_dummy_encryption mount option
     that work properly with the new mount API. ext4 and f2fs will use
     these"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: add new helper functions for test_dummy_encryption
  fscrypt: factor out fscrypt_policy_to_key_spec()
  fscrypt: log when starting to use inline encryption
  fscrypt: split up FS_CRYPTO_BLOCK_SIZE
2022-05-24 12:17:45 -07:00
Shyam Prasad N
5752bf645f cifs: avoid parallel session setups on same channel
After allowing channels to reconnect in parallel, it now
becomes important to take care that multiple processes do not
call negotiate/session setup in parallel on the same channel.

This change avoids that by marking a channel as "in_reconnect".
During session setup if the channel in question has this flag
set, we return immediately.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-24 14:16:32 -05:00
Shyam Prasad N
dd3cd8709e cifs: use new enum for ses_status
ses->status today shares statusEnum with server->tcpStatus.
This has been confusing, and tcon->status has deviated to use
a new enum. Follow suit and use new enum for ses_status as well.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-24 14:11:17 -05:00
Shyam Prasad N
1a6a41d4ce cifs: do not use tcpStatus after negotiate completes
Recent changes to multichannel to allow channel reconnects to
work in parallel and independent of each other did so by
making use of tcpStatus for the connection, and status for the
session. However, this did not take into account the multiuser
scenario, where same connection is used by multiple connections.

However, tcpStatus should be tracked only till the end of
negotiate exchange, and not used for session setup. This change
fixes this.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-24 14:08:25 -05:00
Jason Gunthorpe
a6f844da39 Linux 5.18
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmKKlIAeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGC3oH/iPm/fLG2sJut8My
 sU0RC9K+6ESV5h2Qy6k00/lqKstlu4EvBjw4V8vYpx3Q2+hbSFMn2SeWqqqT3Lkk
 Zb8KINCFuuyMtdCBb42PV0zhUf5pCQF7ocm/Ae4jllDHtPmqk3WJ6IGtZBK5JBlw
 z6RR/wKt0y0MRj9eZyPyYjOee2L2vuVh4tgnexK/4L8g2ZtMMRThhvUzSMWG4zxR
 STYYNp0uFcfT1Vt85+ODevFH4TvdECAj+SqAegN+seHLM17YY7M0/WiIYpxGRv8P
 lIpDQl4PBU8EBkpI5hkpJ/3qPincbuVOMLsYfxFtpcjjG12vGjFp2krGpS3TedZQ
 3mvaJ7c=
 =vLke
 -----END PGP SIGNATURE-----

Merge tag 'v5.18' into rdma.git for-next

Following patches have dependencies.

Resolve the merge conflict in
drivers/net/ethernet/mellanox/mlx5/core/main.c by keeping the new names
for the fs functions following linux-next:

https://lore.kernel.org/r/20220519113529.226bc3e2@canb.auug.org.au/

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-05-24 12:40:28 -03:00
Damien Le Moal
14bdb047a5 zonefs: Fix zonefs_init_file_inode() return value
Commit 87c9ce3ffe ("zonefs: Add active seq file accounting") wrongly
changed zonefs_init_file_inode() to always return 0 even if the call to
zonefs_zone_mgmt() fails. Fix this by propagating zonefs_zone_mgmt()
return value as the return value for zonefs_init_file_inode().

Fixes: 87c9ce3ffe ("zonefs: Add active seq file accounting")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2022-05-24 17:06:45 +09:00
Steve French
52832252dd smb3: add mount parm nosparse
To reduce risk of applications breaking that mount to servers
with only partial sparse file support, add optional mount parm
"nosparse" which disables setting files sparse (and thus
will return EOPNOTSUPP on certain fallocate operations).

Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-23 23:32:54 -05:00
Linus Torvalds
143a6252e1 arm64 updates for 5.19:
- Initial support for the ARMv9 Scalable Matrix Extension (SME). SME
   takes the approach used for vectors in SVE and extends this to provide
   architectural support for matrix operations. No KVM support yet, SME
   is disabled in guests.
 
 - Support for crashkernel reservations above ZONE_DMA via the
   'crashkernel=X,high' command line option.
 
 - btrfs search_ioctl() fix for live-lock with sub-page faults.
 
 - arm64 perf updates: support for the Hisilicon "CPA" PMU for monitoring
   coherent I/O traffic, support for Arm's CMN-650 and CMN-700
   interconnect PMUs, minor driver fixes, kerneldoc cleanup.
 
 - Kselftest updates for SME, BTI, MTE.
 
 - Automatic generation of the system register macros from a 'sysreg'
   file describing the register bitfields.
 
 - Update the type of the function argument holding the ESR_ELx register
   value to unsigned long to match the architecture register size
   (originally 32-bit but extended since ARMv8.0).
 
 - stacktrace cleanups.
 
 - ftrace cleanups.
 
 - Miscellaneous updates, most notably: arm64-specific huge_ptep_get(),
   avoid executable mappings in kexec/hibernate code, drop TLB flushing
   from get_clear_flush() (and rename it to get_clear_contig()),
   ARCH_NR_GPIO bumped to 2048 for ARCH_APPLE.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmKH19IACgkQa9axLQDI
 XvEFWg//bf0p6zjeNaOJmBbyVFsXsVyYiEaLUpFPUs3oB+81s2YZ+9i1rgMrNCft
 EIDQ9+/HgScKxJxnzWf68heMdcBDbk76VJtLALExbge6owFsjByQDyfb/b3v/bLd
 ezAcGzc6G5/FlI1IP7ct4Z9MnQry4v5AG8lMNAHjnf6GlBS/tYNAqpmj8HpQfgRQ
 ZbhfZ8Ayu3TRSLWL39NHVevpmxQm/bGcpP3Q9TtjUqg0r1FQ5sK/LCqOksueIAzT
 UOgUVYWSFwTpLEqbYitVqgERQp9LiLoK5RmNYCIEydfGM7+qmgoxofSq5e2hQtH2
 SZM1XilzsZctRbBbhMit1qDBqMlr/XAy/R5FO0GauETVKTaBhgtj6mZGyeC9nU/+
 RGDljaArbrOzRwMtSuXF+Fp6uVo5spyRn1m8UT/k19lUTdrV9z6EX5Fzuc4Mnhed
 oz4iokbl/n8pDObXKauQspPA46QpxUYhrAs10B/ELc3yyp/Qj3jOfzYHKDNFCUOq
 HC9mU+YiO9g2TbYgCrrFM6Dah2E8fU6/cR0ZPMeMgWK4tKa+6JMEINYEwak9e7M+
 8lZnvu3ntxiJLN+PrPkiPyG+XBh2sux1UfvNQ+nw4Oi9xaydeX7PCbQVWmzTFmHD
 q7UPQ8220e2JNCha9pULS8cxDLxiSksce06DQrGXwnHc1Ir7T04=
 =0DjE
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:

 - Initial support for the ARMv9 Scalable Matrix Extension (SME).

   SME takes the approach used for vectors in SVE and extends this to
   provide architectural support for matrix operations. No KVM support
   yet, SME is disabled in guests.

 - Support for crashkernel reservations above ZONE_DMA via the
   'crashkernel=X,high' command line option.

 - btrfs search_ioctl() fix for live-lock with sub-page faults.

 - arm64 perf updates: support for the Hisilicon "CPA" PMU for
   monitoring coherent I/O traffic, support for Arm's CMN-650 and
   CMN-700 interconnect PMUs, minor driver fixes, kerneldoc cleanup.

 - Kselftest updates for SME, BTI, MTE.

 - Automatic generation of the system register macros from a 'sysreg'
   file describing the register bitfields.

 - Update the type of the function argument holding the ESR_ELx register
   value to unsigned long to match the architecture register size
   (originally 32-bit but extended since ARMv8.0).

 - stacktrace cleanups.

 - ftrace cleanups.

 - Miscellaneous updates, most notably: arm64-specific huge_ptep_get(),
   avoid executable mappings in kexec/hibernate code, drop TLB flushing
   from get_clear_flush() (and rename it to get_clear_contig()),
   ARCH_NR_GPIO bumped to 2048 for ARCH_APPLE.

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (145 commits)
  arm64/sysreg: Generate definitions for FAR_ELx
  arm64/sysreg: Generate definitions for DACR32_EL2
  arm64/sysreg: Generate definitions for CSSELR_EL1
  arm64/sysreg: Generate definitions for CPACR_ELx
  arm64/sysreg: Generate definitions for CONTEXTIDR_ELx
  arm64/sysreg: Generate definitions for CLIDR_EL1
  arm64/sve: Move sve_free() into SVE code section
  arm64: Kconfig.platforms: Add comments
  arm64: Kconfig: Fix indentation and add comments
  arm64: mm: avoid writable executable mappings in kexec/hibernate code
  arm64: lds: move special code sections out of kernel exec segment
  arm64/hugetlb: Implement arm64 specific huge_ptep_get()
  arm64/hugetlb: Use ptep_get() to get the pte value of a huge page
  arm64: kdump: Do not allocate crash low memory if not needed
  arm64/sve: Generate ZCR definitions
  arm64/sme: Generate defintions for SVCR
  arm64/sme: Generate SMPRI_EL1 definitions
  arm64/sme: Automatically generate SMPRIMAP_EL2 definitions
  arm64/sme: Automatically generate SMIDR_EL1 defines
  arm64/sme: Automatically generate defines for SMCR
  ...
2022-05-23 21:06:11 -07:00
Steve French
9ccfc23a72 smb3: don't set rc when used and unneeded in query_info_compound
rc is not checked so should not be set coming back from open_cached_dir
(the cfid pointer is checked instead to see if open_cached_dir failed)

Addresses-Coverity: 1518021 ("Code maintainability issues  (UNUSED_VALUE)")
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-23 21:02:45 -05:00
Steve French
bbdf6cf56c smb3: check for null tcon
Although unlikely to be null, it is confusing to use a pointer
before checking for it to be null so move the use down after
null check.

Addresses-Coverity: 1517586 ("Null pointer dereferences  (REVERSE_INULL)")
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-23 20:50:38 -05:00
Steve French
93ed91c020 cifs: fix minor compile warning
Add ifdef around nodfs variable from patch:
  "cifs: don't call cifs_dfs_query_info_nonascii_quirk() if nodfs was set"
which is unused when CONFIG_DFS_UPCALL is not set.

Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-23 20:24:12 -05:00
Steve French
a42078b9e8 Add various fsctl structs
Add missing structure definition for various newer fsctl operations
  - duplicate_extents_ex
  - get_integrity_information
  - query_file_regions
  - query_on_disk_volume_info

And move some fsctl defintions to smbfs_common

Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-23 20:24:12 -05:00
Steve French
22c5b91336 Add defines for various newer FSCTLs
Checking MS-FSCC section 2.3 found six FSCTL defines
that were missing

Reviewed-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-23 20:23:57 -05:00
Linus Torvalds
a13dc4d409 - Serious sanitization and cleanup of the whole APERF/MPERF and
frequency invariance code along with removing the need for unnecessary IPIs
 
 - Finally remove a.out support
 
 - The usual trivial cleanups and fixes all over x86
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLn48ACgkQEsHwGGHe
 VUpbkg/+PELrc0y/qxLM/+dyftKYY16Rhk6ZVAXfwqlh5ldyVQcLMUgKwDqYyTn2
 XmgdI3cTcFlH2K7j6ANWLu0I9NPaviimUcEdMVcXt7aY5mGWk/q4hIyCYM8d41sV
 qKx4OjNSdyoofG6MtwFLJDuoeVg99Bqgvm4nP9BuxL0dZJ2hfcUZ7MTxYCx9ZYjK
 /3trx0NV287Yg/wm91EU0nLQzy9xbGS7WCmMnse6uxiUdm2vXbBt8oNFF4f747Dj
 0cArfNrMgYq4Cv5bgt/Ki0NU/n4EOGDpJUSyQwlnjDKeN81ESPy7IWtTQ6cE/rJK
 BZeUIPiGiYHwtqXv0UTAPGLG8cAqKeab8u0xAOyrFVDkTc0+WlPJRsUAOmRRGIGE
 M8ZjoxrLeuFgxw6vKpVjaA+mDRj3qEpSH+IrTcekS98PN7gmVzvq03GobgGbT7YB
 xmtbThJa+514FfUVckkyC0+A56BknUIgVxwFPqrthE2atzYTbH67hW4U0yVWXXr7
 2VI7ttozBrYVgHCWhD9eoT0uhyD74Vl6pqHnqzY9ShIfKVUGvMgKHHg04nLLtF7W
 hm87xV3Q5UEmXhTmDzT1rUZ99mBUxGbWxk227I9raMugIh7pp9wIr57+7O0LRYfX
 TdnE2+tL8RMi7+XzRH5iLhnwkrvahBESeHSQ7GVI1Y2zMmmFN+0=
 =Dks/
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Borislav Petkov:

 - Serious sanitization and cleanup of the whole APERF/MPERF and
   frequency invariance code along with removing the need for
   unnecessary IPIs

 - Finally remove a.out support

 - The usual trivial cleanups and fixes all over x86

* tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  x86: Remove empty files
  x86/speculation: Add missing srbds=off to the mitigations= help text
  x86/prctl: Remove pointless task argument
  x86/aperfperf: Make it correct on 32bit and UP kernels
  x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
  x86/aperfmperf: Replace arch_freq_get_on_cpu()
  x86/aperfmperf: Replace aperfmperf_get_khz()
  x86/aperfmperf: Store aperf/mperf data for cpu frequency reads
  x86/aperfmperf: Make parts of the frequency invariance code unconditional
  x86/aperfmperf: Restructure arch_scale_freq_tick()
  x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct
  x86/aperfmperf: Untangle Intel and AMD frequency invariance init
  x86/aperfmperf: Separate AP/BP frequency invariance init
  x86/smp: Move APERF/MPERF code where it belongs
  x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu()
  x86/process: Fix kernel-doc warning due to a changed function name
  x86: Remove a.out support
  x86/mm: Replace nodes_weight() with nodes_empty() where appropriate
  x86: Replace cpumask_weight() with cpumask_empty() where appropriate
  x86/pkeys: Remove __arch_set_user_pkey_access() declaration
  ...
2022-05-23 18:17:09 -07:00
Linus Torvalds
140e40e39a zonefs changes for 5.19-rc1
This set of patches improve zonefs open sequential file accounting and
 adds accounting for active sequential files to allow the user to handle
 the maximum number of active zones of an NVMe ZNS drive. sysfs
 attributes for both open and active sequential files are also added to
 facilitate access to this information from applications without
 resorting to inspecting the block device limits.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCYosTQQAKCRDdoc3SxdoY
 dqUWAQDGKoSkyRAPJAmuQXYOuOJTLu0b8DSfvyPopFLfKXpPHAEAg995JNTLUs0G
 R3m7lH6GK+OSBWhZ/Z5HOND3QS9BhgM=
 =hvqx
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs updates from Damien Le Moal:
 "This improves zonefs open sequential file accounting and adds
  accounting for active sequential files to allow the user to handle the
  maximum number of active zones of an NVMe ZNS drive.

  sysfs attributes for both open and active sequential files are also
  added to facilitate access to this information from applications
  without resorting to inspecting the block device limits"

* tag 'zonefs-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  documentation: zonefs: Document sysfs attributes
  documentation: zonefs: Cleanup the mount options section
  zonefs: Add active seq file accounting
  zonefs: Export open zone resource information through sysfs
  zonefs: Always do seq file write open accounting
  zonefs: Rename super block information fields
  zonefs: Fix management of open zones
  zonefs: Clear inode information flags on inode creation
2022-05-23 14:36:45 -07:00
Linus Torvalds
115cd47132 for-5.19/block-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrUsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgDjD/44hY9h0JsOLoRH1IvFtuaH6n718JXuqG17
 hHCfmnAUVqj2jT00IUbVlUTd905bCGpfrodBL3PAmPev1zZHOUd/MnJKrSynJ+/s
 NJEMZQaHxLmocNDpJ1sZo7UbAFErsZXB0gVYUO8cH2bFYNu84H1mhRCOReYyqmvQ
 aIAASX5qRB/ciBQCivzAJl2jTdn4WOn5hWi9RLidQB7kSbaXGPmgKAuN88WI4H7A
 zQgAkEl2EEquyMI5tV1uquS7engJaC/4PsenF0S9iTyrhJLjneczJBJZKMLeMR8d
 sOm6sKJdpkrfYDyaA4PIkgmLoEGTtwGpqGHl4iXTyinUAxJoca5tmPvBb3wp66GE
 2Mr7pumxc1yJID2VHbsERXlOAX3aZNCowx2gum2MTRIO8g11Eu3aaVn2kv37MBJ2
 4R2a/cJFl5zj9M8536cG+Yqpy0DDVCCQKUIqEupgEu1dyfpznyWH5BTAHXi1E8td
 nxUin7uXdD0AJkaR0m04McjS/Bcmc1dc6I8xvkdUFYBqYCZWpKOTiEpIBlHg0XJA
 sxdngyz5lSYTGVA4o4QCrdR0Tx1n36A1IYFuQj0wzxBJYZ02jEZuII/A3dd+8hiv
 EY+VeUQeVIXFFuOcY+e0ScPpn7Nr17hAd1en/j2Hcoe4ZE8plqG2QTcnwgflcbis
 iomvJ4yk0Q==
 =0Rw1
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Here are the core block changes for 5.19. This contains:

   - blk-throttle accounting fix (Laibin)

   - Series removing redundant assignments (Michal)

   - Expose bio cache via the bio_set, so that DM can use it (Mike)

   - Finish off the bio allocation interface cleanups by dealing with
     the weirdest member of the family. bio_kmalloc combines a kmalloc
     for the bio and bio_vecs with a hidden bio_init call and magic
     cleanup semantics (Christoph)

   - Clean up the block layer API so that APIs consumed by file systems
     are (almost) only struct block_device based, so that file systems
     don't have to poke into block layer internals like the
     request_queue (Christoph)

   - Clean up the blk_execute_rq* API (Christoph)

   - Clean up various lose end in the blk-cgroup code to make it easier
     to follow in preparation of reworking the blkcg assignment for bios
     (Christoph)

   - Fix use-after-free issues in BFQ when processes with merged queues
     get moved to different cgroups (Jan)

   - BFQ fixes (Jan)

   - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming,
     Wolfgang, me)"

* tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits)
  blk-mq: fix typo in comment
  bfq: Remove bfq_requeue_request_body()
  bfq: Remove superfluous conversion from RQ_BIC()
  bfq: Allow current waker to defend against a tentative one
  bfq: Relax waker detection for shared queues
  blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE()
  blk-throttle: Set BIO_THROTTLED when bio has been throttled
  blk-cgroup: Remove unnecessary rcu_read_lock/unlock()
  blk-cgroup: always terminate io.stat lines
  block, bfq: make bfq_has_work() more accurate
  block, bfq: protect 'bfqd->queued' by 'bfqd->lock'
  block: cleanup the VM accounting in submit_bio
  block: Fix the bio.bi_opf comment
  block: reorder the REQ_ flags
  blk-iocost: combine local_stat and desc_stat to stat
  block: improve the error message from bio_check_eod
  block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone
  block: remove superfluous calls to blkcg_bio_issue_init
  kthread: unexport kthread_blkcg
  blk-cgroup: cleanup blkcg_maybe_throttle_current
  ...
2022-05-23 13:56:39 -07:00
Linus Torvalds
df1c5d73d2 for-5.19/writeback-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrAMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuqeD/sH85B2UQOlHtcn10NOQSv9U8KerbJ9LOoq
 ZqCaBcb9NEPRvQPjOmFvr08S8rBvGcpGcniKpwvCZiX78mdp+DFAHZDJppWasSdX
 F5EXV+40Pxtg+kAOJNEh2XNXuTGRddys9i70sxbKbkLG9m74nT8pnDmND0WZn2KS
 3d1ljBKZkJ+Ohy1NuUXRTm9KkrMyjSrsOh0ge893DwY7Dmz7/M34wBvY4JOLnAjj
 03tz9Ge4/HNeqtEQMYCOFetxfKuxCeL583sJNP5SpmbCWFEnFtipY0ezGMUmDPoV
 QdLpqJTBMNpUiSLmNVmqQaaOF7IGdklWQRHoyFl3qspygnNe2xT+Lj3QHZnHTQVJ
 JaZRudW5eLTWYJ4wFw1FdhOQqXxU1NqNkFRblwdntPKfuq363URcwB9rFVCleNd0
 MMrUNDRZeYURfzpTMkbRKNJByDcdnbtvaxjhE8un1IwTyAzJ8TK3IvAr/sFt4xTB
 89R4lxRdZ+RD3dmhU6v+OrCJ5Xl3KlbmPTdfb21XSMF/NxizSWg6IY+Xwi3rlE8g
 b3lHETEpLV4jBA/OA/BsW2gOKxMwj/0hGUwXGAvr73haRWAxLOKjDpU5FhGi8sO1
 ioeZSO3AOlHxir0fYujvWcme4RsTWChdzZSlOUbYXV0UQVlq8s3PvuyI7XHdi7CB
 l+F3TvuOOw==
 =vChY
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/writeback-2022-05-22' of git://git.kernel.dk/linux-block

Pull writeback fix from Jens Axboe:
 "A single writeback fix that didn't belong in any other branch,
  correcting the number of skipped pages"

* tag 'for-5.19/writeback-2022-05-22' of git://git.kernel.dk/linux-block:
  fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
2022-05-23 13:48:23 -07:00
Linus Torvalds
9836e93c0a for-5.19/io_uring-passthrough-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKovAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv9oD/4qCs7k3bPZZWZ6xoWb4EObyyWOUifi26lp
 vpsJHFUbA67S/i4++LV9H18SazWJ7h08ac4bjgZ+NQz40/1WkTN8/Fa76jo+BnNK
 7T10Wp4Ak6uwWVrKaA81pnT+G9+xmHlJ3X27aKxzLuT7BEPpShZ6ouFVjTkx9CzN
 LrLjuCDTOBBN+ZoaroWYfdLwTQX2VCAl9B15lOtQIlFvuuU8VlrvLboY+80K8TvY
 1wvTA2HTjnXoYx+/cTTMIFZIwQH3r1hsbwEDD8/YJj1+ouhSRQ1b0p/nk2pA+3ws
 HF5r/YS/rLBjlPF094IzeOBaUyA433AN1VhZqnII8ek7ViT3W3x+BRrgE9O6ZkWT
 0AjX1BXReI5rdFmxBmwsSdBnrSoGaJOf2GdsCCdubXBIi+F/RvyajrPf7PTB5zbW
 9WEK/uy3xvZsRVkUGAzOb9QGdvjcllgMzwPJsDegDCw5PdcPdT3mzy6KGIWipFLp
 j8R+br7hRMpOJv/YpihJDMzSDkQ/r1/SCwR4fpLid/QdSHG/eRTQK6c4Su5bNYEy
 QDy2F6kQdBVtEJCQHcEOsbhXzSTNBcdB+ujUUM5653FkaHe6y4JbomLrsNx407Id
 i/4ROwA5K1dioJx503Eap+OhbI5rV+PFytJTwxvLrNyVGccwbH2YOVq80fsVBP2e
 cZbn6EX4Vg==
 =/peE
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/io_uring-passthrough-2022-05-22' of git://git.kernel.dk/linux-block

Pull io_uring NVMe command passthrough from Jens Axboe:
 "On top of everything else, this adds support for passthrough for
  io_uring.

  The initial feature for this is NVMe passthrough support, which allows
  non-filesystem based IO commands and admin commands.

  To support this, io_uring grows support for SQE and CQE members that
  are twice as big, allowing to pass in a full NVMe command without
  having to copy data around. And to complete with more than just a
  single 32-bit value as the output"

* tag 'for-5.19/io_uring-passthrough-2022-05-22' of git://git.kernel.dk/linux-block: (22 commits)
  io_uring: cleanup handling of the two task_work lists
  nvme: enable uring-passthrough for admin commands
  nvme: helper for uring-passthrough checks
  blk-mq: fix passthrough plugging
  nvme: add vectored-io support for uring-cmd
  nvme: wire-up uring-cmd support for io-passthru on char-device.
  nvme: refactor nvme_submit_user_cmd()
  block: wire-up support for passthrough plugging
  fs,io_uring: add infrastructure for uring-cmd
  io_uring: support CQE32 for nop operation
  io_uring: enable CQE32
  io_uring: support CQE32 in /proc info
  io_uring: add tracing for additional CQE32 fields
  io_uring: overflow processing for CQE32
  io_uring: flush completions for CQE32
  io_uring: modify io_get_cqe for CQE32
  io_uring: add CQE32 completion processing
  io_uring: add CQE32 setup processing
  io_uring: change ring size calculation for CQE32
  io_uring: store add. return values for CQE32
  ...
2022-05-23 13:06:15 -07:00
Linus Torvalds
e1a8fde720 for-5.19/io_uring-net-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKotMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmVwEACo7qBTjrrneZEwlYUWrSr45QtDNsQHPWjv
 aoK1dBLVH4ZjoZoOTI/aYcRgd5IJYo1P6I9tUrolM/+N3adM4UTEVC7i2PYDOaL3
 WUm/YT2aSLiyHaHQON7SMyGSVU8kfM9YvJAGbj7ohalO9A2VVtHfUAmcAtBdgWqv
 Dl/Uu6vbogOl19xztAwN4nvwqljA+GUMnbHJ/oeASzrMzYMOdQ0q3UsQbEt+pTXt
 rBzv8fCsrKsT2uBc59Bi3eFKeBMM6ERzux/40TlqcOnXf3KUCK7nM4VaRgPbvXdt
 GOOYfYs+j9L8SSEedvdKyYNq4vVwWgYfTRAKMNB0FPiOaTGZuUthqkgRZGYY8AA9
 +lJWxa+mzPmWEOmL+E44kt0OwtKDHX72ccEJUD7PHhTp0g87yKZfS6mXRNYLSxm7
 IYt7N1x3cOp0lrwUTvLDnSPOTuYOSEiB2JZtfkf+y3SuI5SWowIcudKOuO5p7G1r
 IpAROsZrpHzMf/eniINoX3IrqBSqr254jzwq+9IgUaw/ky76oPYqM1dWP9BnVxCg
 PXgvfT5zj6xrU43TxTeIPU92JoAqhMeXi6dcyoiAAf9+8Vih+sbmLzAdJbYb5F2v
 G0ISy31+x/Goi43fQS59HzS/MNXJplcmy2mxKUYBT7/ZoJ2A26Q8SukTWD+U8sDn
 XIrV4HEOUQ==
 =PUw1
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-block

Pull io_uring 'more data in socket' support from Jens Axboe:
 "To be able to fully utilize the 'poll first' support in the core
  io_uring branch, it's advantageous knowing if the socket was empty
  after a receive. This adds support for that"

* tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-block:
  io_uring: return hint on whether more data is available after receive
  tcp: pass back data left in socket after receive
2022-05-23 12:51:04 -07:00
Linus Torvalds
368da430d0 for-5.19/io_uring-socket-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKorgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpm0eEACdTzhm7h5cXn9KjIvWLkdocAb/NOL8GYPn
 Q1mY1SqKQFZvs/fyKHkkZEiIBPxhvN6snVFXMpb4LDmPYeeH4GTUlNomrGTIjvf/
 j6SnZN4lCs9A2NlE+iDVWnFQOPQFALza2Y9BhC5xzay326qnKlO+0fQv3C1vXXrc
 /PNLqxQr7+GmO0a0PJnS6mGWGj6qF7nLqilB9apnKsTK6BKbJEec6ciKreqxU6ME
 WHaux11uIAbcf8rc6C/2myEK0k6jCOAue3vZ0lizygf+8klUCl2vMqV5BLwCBlXG
 /e7hBsUUrGr0CG0fryqhQQTUxsZLshioBbQH1vttSeZCli46mmWWAhPNy3/jb1ZU
 72bazA84Fe9ney9uVZvZoMoBsG+6t6UOatqND13MeRFAXnkRr0jZRuau2iBxgqAr
 OINJW+IVPU7IrCD+S4lV1/LCdhLhYcob8/zfKmIrdHMQnWG/gLonVpYJIBCyLDAv
 2jvHFIPJuSMUSGVjRKCb16LLNV6u7YG6VOWbKuippxfJxDdwA3TOtOhvTJIpYq0u
 TotPgpZ7bfcr4xDsGgD9mZS8E7jwsL/G0/MwsnixELykEXuhd++sgoTbr+RyUYdV
 45Hm6DsxlytjzOb/5uQrqhwrso05eVt14K74XApPa3fWKL8aWCh1jGSdo3CSbIyW
 iHwss919Ag==
 =nb5i
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/io_uring-socket-2022-05-22' of git://git.kernel.dk/linux-block

Pull io_uring socket() support from Jens Axboe:
 "This adds support for socket(2) for io_uring. This is handy when using
  direct / registered file descriptors with io_uring.

  Outside of those two patches, a small series from Dylan on top that
  improves the tracing by providing a text representation of the opcode
  rather than needing to decode this by reading the header file every
  time.

  That sits in this branch as it was the last opcode added (until it
  wasn't...)"

* tag 'for-5.19/io_uring-socket-2022-05-22' of git://git.kernel.dk/linux-block:
  io_uring: use the text representation of ops in trace
  io_uring: rename op -> opcode
  io_uring: add io_uring_get_opcode
  io_uring: add type to op enum
  io_uring: add socket(2) support
  net: add __sys_socket_file()
2022-05-23 12:42:33 -07:00
Linus Torvalds
09beaff75e for-5.19/io_uring-xattr-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKopkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgakEACktFtUBQLrYOXbM/mVxMpR//ht4e29E8k0
 j/DkqK0yDKn9VkvDryALguH+ixNSI9Z4N7xSELLb/meNQsbJ7YdprL3xJn3BoUgs
 3zx44janE8J3Q5TsXvD2z2jPIMaT892t5+5aLFYZqP1g+KDXI8T1WpHsETMkKfRG
 ZPeerUrd0fhtnDpViaaYbRutIEt8V8tsPhh0XG/4GojWjUW0FTsRKBSGuQ0sQnUr
 aJDfF5VylOjOBzRGimGZ23vJIgtZ8UEpX0T2MxR5V6ffj4cI8bCFQOrphh7yHxF5
 f09xte80zX6pow5AivIpultZShR6IoQG5DIvF59woNP16uXy5yUyVTQvdnt8RlyY
 RjLd8ro9Gt4wBQGqckJLyY/o1FGhaQ8S99wOixUlpb9qKAOGmQZI97FQKFENqx/1
 Xe+bP6QmTt9uCXsYPIFBtZaaEv2u0yjHOyERFUSzKJQUuPTa5Rmen0EXYXRhe5/E
 p+sR3Qbk1wzlW7UHuCT2gcaI67SAFG+yDv1U6BAaVdcS71i0WCA+Q2a6AuB+NJzg
 ER4+JRoeOnjEXSP2UPvIUBL1Komdj4R2hnrOK4S80R3yQ3NaadrWywhBn5HNcniM
 wE2P6J0erzRFqyfBw9tyNLsZwR1iS7JqSD9/NuBLoWwb42O0l+WgqqwDTSxMsde4
 egKBaidRqg==
 =CfhD
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/io_uring-xattr-2022-05-22' of git://git.kernel.dk/linux-block

Pull io_uring xattr support from Jens Axboe:
 "Support for the xattr variants"

* tag 'for-5.19/io_uring-xattr-2022-05-22' of git://git.kernel.dk/linux-block:
  io_uring: cleanup error-handling around io_req_complete
  io_uring: fix trace for reduced sqe padding
  io_uring: add fgetxattr and getxattr support
  io_uring: add fsetxattr and setxattr support
  fs: split off do_getxattr from getxattr
  fs: split off setxattr_copy and do_setxattr function from setxattr
2022-05-23 12:30:30 -07:00
Linus Torvalds
3a166bdbf3 for-5.19/io_uring-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKol0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpn+sEACbdEQqG6OoCOhJ0ZuxTdQqNMGxCImKBxjP
 8Bqf+0hYNgwfG+80/UQvmc7olb+KxvZ6KtrgViC/ujhvMQmX0Xf/881kiiKG/iHJ
 XKoL9PdqIkenIGnlyEp1uRmnUbooYF+s4iT6Gj/pjnn29GbcKjsPzKV1CUNkt3GC
 R+wpdKczHQDaSwzDY5Ntyjf68QUQOyUznkHW+6JOcBeih3ET7NfapR/zsFS93RlL
 B9pQ9NiBBQfzCAUycVyQMC+p/rJbKWgidAiFk4fXKRm8/7iNwT4dB0+oUymlECxt
 xvalRVK6ER1s4RSdQcUTZoQA+SrzzOnK1DYja9cvcLT3wH+aojana6S0rOMDi8wp
 hoWT5jdMaZN09Vcm7J4sBN15i50m9aDITp21PKOVDZXSMVsebltCL9phaN5+9x/j
 AfF6Vki1WTB4gYaDHR8v6UkW+HcF1WOmMdq8GB9UMfnTya6EJqAooYT9lhQBP/rv
 jxkdj9Fu98O87dOfy1Av9AxH1UB8d7ypCJKkSEMAUPoWf0rC9HjYr0cRq/yppAj8
 pI/0PwXaXRfQuoHPqZyETrPel77VQdBw+Hg+6TS0KlTd3WlVEJMZJPtXK466IFLp
 pYSRVnSI9PuhiClOpxriTCw0cppfRIv11IerCxRziqH9S1zijk0VBCN40//XDs1o
 JfvoA6htKQ==
 =S+Uf
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/io_uring-2022-05-22' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Here are the main io_uring changes for 5.19. This contains:

   - Fixes for sparse type warnings (Christoph, Vasily)

   - Support for multi-shot accept (Hao)

   - Support for io_uring managed fixed files, rather than always
     needing the applicationt o manage the indices (me)

   - Fix for a spurious poll wakeup (Dylan)

   - CQE overflow fixes (Dylan)

   - Support more types of cancelations (me)

   - Support for co-operative task_work signaling, rather than always
     forcing an IPI (me)

   - Support for doing poll first when appropriate, rather than always
     attempting a transfer first (me)

   - Provided buffer cleanups and support for mapped buffers (me)

   - Improve how io_uring handles inflight SCM files (Pavel)

   - Speedups for registered files (Pavel, me)

   - Organize the completion data in a struct in io_kiocb rather than
     keep it in separate spots (Pavel)

   - task_work improvements (Pavel)

   - Cleanup and optimize the submission path, in general and for
     handling links (Pavel)

   - Speedups for registered resource handling (Pavel)

   - Support sparse buffers and file maps (Pavel, me)

   - Various fixes and cleanups (Almog, Pavel, me)"

* tag 'for-5.19/io_uring-2022-05-22' of git://git.kernel.dk/linux-block: (111 commits)
  io_uring: fix incorrect __kernel_rwf_t cast
  io_uring: disallow mixed provided buffer group registrations
  io_uring: initialize io_buffer_list head when shared ring is unregistered
  io_uring: add fully sparse buffer registration
  io_uring: use rcu_dereference in io_close
  io_uring: consistently use the EPOLL* defines
  io_uring: make apoll_events a __poll_t
  io_uring: drop a spurious inline on a forward declaration
  io_uring: don't use ERR_PTR for user pointers
  io_uring: use a rwf_t for io_rw.flags
  io_uring: add support for ring mapped supplied buffers
  io_uring: add io_pin_pages() helper
  io_uring: add buffer selection support to IORING_OP_NOP
  io_uring: fix locking state for empty buffer group
  io_uring: implement multishot mode for accept
  io_uring: let fast poll support multishot
  io_uring: add REQ_F_APOLL_MULTISHOT for requests
  io_uring: add IORING_ACCEPT_MULTISHOT for accept
  io_uring: only wake when the correct events are set
  io_uring: avoid io-wq -EAGAIN looping for !IOPOLL
  ...
2022-05-23 12:22:49 -07:00
Julian Schroeder
fd5e363eac nfsd: destroy percpu stats counters after reply cache shutdown
Upon nfsd shutdown any pending DRC cache is freed. DRC cache use is
tracked via a percpu counter. In the current code the percpu counter
is destroyed before. If any pending cache is still present,
percpu_counter_add is called with a percpu counter==NULL. This causes
a kernel crash.
The solution is to destroy the percpu counter after the cache is freed.

Fixes: e567b98ce9 (“nfsd: protect concurrent access to nfsd stats counters”)
Signed-off-by: Julian Schroeder <jumaco@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-23 15:22:17 -04:00
Julia Lawall
759820c92a f2fs: fix typo in comment
Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-23 10:29:41 -07:00
Julia Lawall
2999e1e387 writeback: fix typo in comment
Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.

Link: https://lore.kernel.org/r/20220521111145.81697-32-Julia.Lawall@inria.fr
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>

Signed-off-by: Jan Kara <jack@suse.cz>
2022-05-23 17:39:20 +02:00
Zhang Xiaoxu
6f6f84aa21 nfsd: Fix null-ptr-deref in nfsd_fill_super()
KASAN report null-ptr-deref as follows:

  BUG: KASAN: null-ptr-deref in nfsd_fill_super+0xc6/0xe0 [nfsd]
  Write of size 8 at addr 000000000000005d by task a.out/852

  CPU: 7 PID: 852 Comm: a.out Not tainted 5.18.0-rc7-dirty #66
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   kasan_report+0xab/0x120
   ? nfsd_mkdir+0x71/0x1c0 [nfsd]
   ? nfsd_fill_super+0xc6/0xe0 [nfsd]
   nfsd_fill_super+0xc6/0xe0 [nfsd]
   ? nfsd_mkdir+0x1c0/0x1c0 [nfsd]
   get_tree_keyed+0x8e/0x100
   vfs_get_tree+0x41/0xf0
   __do_sys_fsconfig+0x590/0x670
   ? fscontext_read+0x180/0x180
   ? anon_inode_getfd+0x4f/0x70
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xae

This can be reproduce by concurrent operations:
	1. fsopen(nfsd)/fsconfig
	2. insmod/rmmod nfsd

Since the nfsd file system is registered before than nfsd_net allocated,
the caller may get the file_system_type and use the nfsd_net before it
allocated, then null-ptr-deref occurred.

So init_nfsd() should call register_filesystem() last.

Fixes: bd5ae9288d ("nfsd: register pernet ops last, unregister first")
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-23 11:06:29 -04:00
Zhang Xiaoxu
62fdb65edb nfsd: Unregister the cld notifier when laundry_wq create failed
If laundry_wq create failed, the cld notifier should be unregistered.

Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-23 11:06:29 -04:00
Chuck Lever
28df098881 SUNRPC: Use RMW bitops in single-threaded hot paths
I noticed CPU pipeline stalls while using perf.

Once an svc thread is scheduled and executing an RPC, no other
processes will touch svc_rqst::rq_flags. Thus bus-locked atomics are
not needed outside the svc thread scheduler.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-23 11:06:29 -04:00
Chuck Lever
bb283ca18d NFSD: Clean up the show_nf_flags() macro
The flags are defined using C macros, so TRACE_DEFINE_ENUM is
unnecessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-23 11:06:29 -04:00
Chuck Lever
0122e88211 NFSD: Trace filecache opens
Instrument calls to nfsd_open_verified() to get a sense of the
filecache hit rate.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-23 11:06:29 -04:00
Chuck Lever
7e2ce0cc15 NFSD: Move documenting comment for nfsd4_process_open2()
Clean up nfsd4_open() by converting a large comment at the only
call site for nfsd4_process_open2() to a kerneldoc comment in
front of that function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-23 11:06:29 -04:00
Chuck Lever
26320d7e31 NFSD: Fix whitespace
Clean up: Pull case arms back one tab stop to conform every other
switch statement in fs/nfsd/nfs4proc.c.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-23 11:06:29 -04:00
Chuck Lever
f67a16b147 NFSD: Remove dprintk call sites from tail of nfsd4_open()
Clean up: These relics are not likely to benefit server
administrators.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-23 11:06:29 -04:00
Chuck Lever
fb70bf124b NFSD: Instantiate a struct file when creating a regular NFSv4 file
There have been reports of races that cause NFSv4 OPEN(CREATE) to
return an error even though the requested file was created. NFSv4
does not provide a status code for this case.

To mitigate some of these problems, reorganize the NFSv4
OPEN(CREATE) logic to allocate resources before the file is actually
created, and open the new file while the parent directory is still
locked.

Two new APIs are added:

+ Add an API that works like nfsd_file_acquire() but does not open
the underlying file. The OPEN(CREATE) path can use this API when it
already has an open file.

+ Add an API that is kin to dentry_open(). NFSD needs to create a
file and grab an open "struct file *" atomically. The
alloc_empty_file() has to be done before the inode create. If it
fails (for example, because the NFS server has exceeded its
max_files limit), we avoid creating the file and can still return
an error to the NFS client.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=382
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: JianHong Yin <jiyin@redhat.com>
2022-05-23 11:06:29 -04:00
Vasily Averin
dccd855771 fanotify: fix incorrect fmode_t casts
Fixes sparce warnings:
fs/notify/fanotify/fanotify_user.c:267:63: sparse:
 warning: restricted fmode_t degrades to integer
fs/notify/fanotify/fanotify_user.c:1351:28: sparse:
 warning: restricted fmode_t degrades to integer

FMODE_NONTIFY have bitwise fmode_t type and requires __force attribute
for any casts.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/9adfd6ac-1b89-791e-796b-49ada3293985@openvz.org
2022-05-23 11:42:53 +02:00
Tadeusz Struk
64ba4b15e5 exfat: check if cluster num is valid
Syzbot reported slab-out-of-bounds read in exfat_clear_bitmap.
This was triggered by reproducer calling truncute with size 0,
which causes the following trace:

BUG: KASAN: slab-out-of-bounds in exfat_clear_bitmap+0x147/0x490 fs/exfat/balloc.c:174
Read of size 8 at addr ffff888115aa9508 by task syz-executor251/365

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack_lvl+0x1e2/0x24b lib/dump_stack.c:118
 print_address_description+0x81/0x3c0 mm/kasan/report.c:233
 __kasan_report mm/kasan/report.c:419 [inline]
 kasan_report+0x1a4/0x1f0 mm/kasan/report.c:436
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report_generic.c:309
 exfat_clear_bitmap+0x147/0x490 fs/exfat/balloc.c:174
 exfat_free_cluster+0x25a/0x4a0 fs/exfat/fatent.c:181
 __exfat_truncate+0x99e/0xe00 fs/exfat/file.c:217
 exfat_truncate+0x11b/0x4f0 fs/exfat/file.c:243
 exfat_setattr+0xa03/0xd40 fs/exfat/file.c:339
 notify_change+0xb76/0xe10 fs/attr.c:336
 do_truncate+0x1ea/0x2d0 fs/open.c:65

Move the is_valid_cluster() helper from fatent.c to a common
header to make it reusable in other *.c files. And add is_valid_cluster()
to validate if cluster number is within valid range in exfat_clear_bitmap()
and exfat_set_bitmap().

Link: https://syzkaller.appspot.com/bug?id=50381fc73821ecae743b8cf24b4c9a04776f767c
Reported-by: syzbot+a4087e40b9c13aad7892@syzkaller.appspotmail.com
Fixes: 1e49a94cf7 ("exfat: add bitmap operations")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-05-23 11:17:30 +09:00
Yuezhang Mo
1b61383854 exfat: reduce block requests when zeroing a cluster
If 'dirsync' is enabled, when zeroing a cluster, submitting
sector by sector will generate many block requests, will
cause the block device to not fully perform its performance.

This commit makes the sectors in a cluster to be submitted in
once, it will reduce the number of block requests. This will
make the block device to give full play to its performance.

Test create 1000 directories on SD card with:

$ time (for ((i=0;i<1000;i++)); do mkdir dir${i}; done)

Performance has been improved by more than 73% on imx6q-sabrelite.

Cluster size       Before         After       Improvement
64  KBytes         3m34.036s      0m56.052s   73.8%
128 KBytes         6m2.644s       1m13.354s   79.8%
256 KBytes         11m22.202s     1m39.451s   85.4%

imx6q-sabrelite:
  - CPU: 792 MHz x4
  - Memory: 1GB DDR3
  - SD Card: SanDisk 8GB Class 4

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-05-23 11:17:30 +09:00
Chung-Chiang Cheng
9b002894b4 exfat: introduce mount option 'sys_tz'
EXFAT_TZ_VALID bit in {create,modify,access}_tz is corresponding to
OffsetValid field in exfat specification [1]. When this bit isn't
set, timestamps should be treated as having the same UTC offset as
the current local time.

Currently, there is an option 'time_offset' for users to specify the
UTC offset for this issue. This patch introduces a new mount option
'sys_tz' to use system timezone as time offset.

Link: [1] https://docs.microsoft.com/en-us/windows/win32/fileio/exfat-specification#74102-offsetvalid-field

Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-05-23 11:17:29 +09:00
Yuezhang Mo
d8dad2588a exfat: fix referencing wrong parent directory information after renaming
During renaming, the parent directory information maybe
updated. But the file/directory still references to the
old parent directory information.

This bug will cause 2 problems.

(1) The renamed file can not be written.

    [10768.175172] exFAT-fs (sda1): error, failed to bmap (inode : 7afd50e4 iblock : 0, err : -5)
    [10768.184285] exFAT-fs (sda1): Filesystem has been set read-only
    ash: write error: Input/output error

(2) Some dentries of the renamed file/directory are not set
    to deleted after removing the file/directory.

exfat_update_parent_info() is a workaround for the wrong parent
directory information being used after renaming. Now that bug is
fixed, this is no longer needed, so remove it.

Fixes: 5f2aa07507 ("exfat: add inode operations")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-05-23 11:17:29 +09:00
Dave Chinner
ab6a8d3f1a Merge branch 'guilt/xfs-5.19-misc-3' into xfs-5.19-for-next 2022-05-23 08:55:09 +10:00
Darrick J. Wong
4183e4f27f xfs: share xattr name and value buffers when logging xattr updates
While running xfs/297 and generic/642, I noticed a crash in
xfs_attri_item_relog when it tries to copy the attr name to the new
xattri log item.  I think what happened here was that we called
->iop_commit on the old attri item (which nulls out the pointers) as
part of a log force at the same time that a chained attr operation was
ongoing.  The system was busy enough that at some later point, the defer
ops operation decided it was necessary to relog the attri log item, but
as we've detached the name buffer from the old attri log item, we can't
copy it to the new one, and kaboom.

I think there's a broader refcounting problem with LARP mode -- the
setxattr code can return to userspace before the CIL actually formats
and commits the log item, which results in a UAF bug.  Therefore, the
xattr log item needs to be able to retain a reference to the name and
value buffers until the log items have completely cleared the log.
Furthermore, each time we create an intent log item, we allocate new
memory and (re)copy the contents; sharing here would be very useful.

Solve the UAF and the unnecessary memory allocations by having the log
code create a single refcounted buffer to contain the name and value
contents.  This buffer can be passed from old to new during a relog
operation, and the logging code can (optionally) attach it to the
xfs_attr_item for reuse when LARP mode is enabled.

This also fixes a problem where the xfs_attri_log_item objects weren't
being freed back to the same cache where they came from.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-23 08:43:46 +10:00
Darrick J. Wong
22a68ba724 xfs: do not use logged xattr updates on V4 filesystems
V4 superblocks do not contain the log_incompat feature bit, which means
that we cannot protect xattr log items against kernels that are too old
to know how to recover them.  Turn off the log items for such
filesystems and adjust the "delayed" name to reflect what it's really
controlling.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-23 08:41:03 +10:00
David Howells
adc9613ff6 afs: Adjust ACK interpretation to try and cope with NAT
If a client's address changes, say if it is NAT'd, this can disrupt an in
progress operation.  For most operations, this is not much of a problem,
but StoreData can be different as some servers modify the target file as
the data comes in, so if a store request is disrupted, the file can get
corrupted on the server.

The problem is that the server doesn't recognise packets that come after
the change of address as belonging to the original client and will bounce
them, either by sending an OUT_OF_SEQUENCE ACK to the apparent new call if
the packet number falls within the initial sequence number window of a call
or by sending an EXCEEDS_WINDOW ACK if it falls outside and then aborting
it.  In both cases, firstPacket will be 1 and previousPacket will be 0 in
the ACK information.

Fix this by the following means:

 (1) If a client call receives an EXCEEDS_WINDOW ACK with firstPacket as 1
     and previousPacket as 0, assume this indicates that the server saw the
     incoming packets from a different peer and thus as a different call.
     Fail the call with error -ENETRESET.

 (2) Also fail the call if a similar OUT_OF_SEQUENCE ACK occurs if the
     first packet has been hard-ACK'd.  If it hasn't been hard-ACK'd, the
     ACK packet will cause it to get retransmitted, so the call will just
     be repeated.

 (3) Make afs_select_fileserver() treat -ENETRESET as a straight fail of
     the operation.

 (4) Prioritise the error code over things like -ECONNRESET as the server
     did actually respond.

 (5) Make writeback treat -ENETRESET as a retryable error and make it
     redirty all the pages involved in a write so that the VM will retry.

Note that there is still a circumstance that I can't easily deal with: if
the operation is fully received and processed by the server, but the reply
is lost due to address change.  There's no way to know if the op happened.
We can examine the server, but a conflicting change could have been made by
a third party - and we can't tell the difference.  In such a case, a
message like:

    kAFS: vnode modified {100058:146266} b7->b8 YFS.StoreData64 (op=2646a)

will be logged to dmesg on the next op to touch the file and the client
will reset the inode state, including invalidating clean parts of the
pagecache.

Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2021-December/004811.html # v1
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:03:02 +01:00
David Howells
de696c4784 rxrpc, afs: Fix selection of abort codes
The RX_USER_ABORT code should really only be used to indicate that the user
of the rxrpc service (ie. userspace) implicitly caused a call to be aborted
- for instance if the AF_RXRPC socket is closed whilst the call was in
progress.  (The user may also explicitly abort a call and specify the abort
code to use).

Change some of the points of generation to use other abort codes instead:

 (1) Abort the call with RXGEN_SS_UNMARSHAL or RXGEN_CC_UNMARSHAL if we see
     ENOMEM and EFAULT during received data delivery and abort with
     RX_CALL_DEAD in the default case.

 (2) Abort with RXGEN_SS_MARSHAL if we get ENOMEM whilst trying to send a
     reply.

 (3) Abort with RX_CALL_DEAD if we stop hearing from the peer if we had
     heard from the peer and abort with RX_CALL_TIMEOUT if we hadn't.

 (4) Abort with RX_CALL_DEAD if we try to disconnect a call that's not
     completed successfully or been aborted.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:03:02 +01:00
David Howells
ad25f5cb39 rxrpc: Fix locking issue
There's a locking issue with the per-netns list of calls in rxrpc.  The
pieces of code that add and remove a call from the list use write_lock()
and the calls procfile uses read_lock() to access it.  However, the timer
callback function may trigger a removal by trying to queue a call for
processing and finding that it's already queued - at which point it has a
spare refcount that it has to do something with.  Unfortunately, if it puts
the call and this reduces the refcount to 0, the call will be removed from
the list.  Unfortunately, since the _bh variants of the locking functions
aren't used, this can deadlock.

================================
WARNING: inconsistent lock state
5.18.0-rc3-build4+ #10 Not tainted
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
ksoftirqd/2/25 [HC0[0]:SC1[1]:HE1:SE0] takes:
ffff888107ac4038 (&rxnet->call_lock){+.?.}-{2:2}, at: rxrpc_put_call+0x103/0x14b
{SOFTIRQ-ON-W} state was registered at:
...
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&rxnet->call_lock);
  <Interrupt>
    lock(&rxnet->call_lock);

 *** DEADLOCK ***

1 lock held by ksoftirqd/2/25:
 #0: ffff8881008ffdb0 ((&call->timer)){+.-.}-{0:0}, at: call_timer_fn+0x5/0x23d

Changes
=======
ver #2)
 - Changed to using list_next_rcu() rather than rcu_dereference() directly.

Fixes: 17926a7932 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:03:01 +01:00
David Howells
2aeb8c86d4 afs: Fix afs_getattr() to refetch file status if callback break occurred
If a callback break occurs (change notification), afs_getattr() needs to
issue an FS.FetchStatus RPC operation to update the status of the file
being examined by the stat-family of system calls.

Fix afs_getattr() to do this if AFS_VNODE_CB_PROMISED has been cleared
on a vnode by a callback break.  Skip this if AT_STATX_DONT_SYNC is set.

This can be tested by appending to a file on one AFS client and then
using "stat -L" to examine its length on a machine running kafs.  This
can also be watched through tracing on the kafs machine.  The callback
break is seen:

     kworker/1:1-46      [001] .....   978.910812: afs_cb_call: c=0000005f YFSCB.CallBack
     kworker/1:1-46      [001] ...1.   978.910829: afs_cb_break: 100058:23b4c:242d2c2 b=2 s=1 break-cb
     kworker/1:1-46      [001] .....   978.911062: afs_call_done:    c=0000005f ret=0 ab=0 [0000000082994ead]

And then the stat command generated no traffic if unpatched, but with
this change a call to fetch the status can be observed:

            stat-4471    [000] .....   986.744122: afs_make_fs_call: c=000000ab 100058:023b4c:242d2c2 YFS.FetchStatus
            stat-4471    [000] .....   986.745578: afs_call_done:    c=000000ab ret=0 ab=0 [0000000087fc8c84]

Fixes: 08e0e7c82e ("[AF_RXRPC]: Make the in-kernel AFS filesystem use AF_RXRPC.")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
Tested-by: kafs-testing+fedora34_64checkkafs-build-496@auristor.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216010
Link: https://lore.kernel.org/r/165308359800.162686.14122417881564420962.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-05-22 09:25:47 -10:00
Jiapeng Chong
73c348d4ab xfs: Remove duplicate include
Clean up the following includecheck warning:

./fs/xfs/xfs_attr_item.c: xfs_inode.h is included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-22 16:47:17 +10:00
Kaixu Xia
93e6aa4329 xfs: reduce IOCB_NOWAIT judgment for retry exclusive unaligned DIO
Retry unaligned DIO with exclusive blocking semantics only when the
IOCB_NOWAIT flag is not set. If we are doing nonblocking user I/O,
propagate the error directly.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-22 16:47:11 +10:00
Jiapeng Chong
e62c720817 xfs: Remove dead code
Remove tht entire xlog_recover_check_summary() function, this entire
function is dead code and has been for 12 years.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-22 16:46:57 +10:00
Julia Lawall
41bc61c02a xfs: fix typo in comment
Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-22 16:46:38 +10:00
Darrick J. Wong
e3c5de2202 xfs: rename struct xfs_attr_item to xfs_attr_intent
Everywhere else in XFS, structures that capture the state of an ongoing
deferred work item all have names that end with "_intent".  The new
extended attribute deferred work items are not named as such, so fix it
to follow the naming convention used elsewhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-22 16:00:26 +10:00
Darrick J. Wong
3768f69857 xfs: clean up state variable usage in xfs_attr_node_remove_attr
The state variable is now a local variable pointing to a heap
allocation, so we don't need to zero-initialize it, nor do we need the
conditional to decide if we should free it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-22 15:59:48 +10:00
Darrick J. Wong
4136e38af7 xfs: put attr[id] log item cache init with the others
Initialize and destroy the xattr log item caches in the same places that
we do all the other log item caches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-22 15:59:48 +10:00
Darrick J. Wong
500a512c60 xfs: remove struct xfs_attr_item.xattri_flags
Nobody uses this field, so get rid of it and the unused flag definition.
Rearrange the structure layout to reduce its size from 104 to 96 bytes.
This gets us from 39 to 42 objects per page.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-22 15:59:48 +10:00
Darrick J. Wong
e2c78949b6 xfs: use a separate slab cache for deferred xattr work state
Create a separate slab cache for struct xfs_attr_item objects, since we
can pack the (104-byte) intent items more tightly than we can with the
general slab cache objects.  On x86, this means 39 intents per memory
page instead of 32.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-22 15:59:48 +10:00
Darrick J. Wong
b53d212b4b xfs: put the xattr intent item op flags in their own namespace
The flags that are stored in the extended attr intent log item really
should have a separate namespace from the rest of the XFS_ATTR_* flags.
Give them one to make it a little more obvious that they're intent item
flags.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-22 15:59:48 +10:00
Darrick J. Wong
4d0cdd2bb8 xfs: clean up xfs_attr_node_hasname
The calling conventions of this function are a mess -- callers /can/
provide a pointer to a pointer to a state structure, but it's not
required, and as evidenced by the last two patches, the callers that do
weren't be careful enough about how to deal with an existing da state.

Push the allocation and freeing responsibilty to the callers, which
means that callers from the xattr node state machine steps now have the
visibility to allocate or free the da state structure as they please.
As a bonus, the node remove/add paths for larp-mode replaces can reset
the da state structure instead of freeing and immediately reallocating
it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-22 15:59:34 +10:00
Steve French
35a2b533a2 smb3: add trace point for oplock not found
In order to debug problems with server potentially
sending us an oplock that we don't recognize (or a race
with close and oplock break) it would be helpful to have
a dynamic trace point for this case.  New tracepoint
is called trace_smb3_oplock_not_found

Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-22 00:46:08 -05:00
ChenXiaoSong
2b058acecf cifs: return the more nuanced writeback error on close()
As filemap_check_errors() only report -EIO or -ENOSPC, we return more nuanced
writeback error -(file->f_mapping->wb_err & MAX_ERRNO).

  filemap_write_and_wait
    filemap_write_and_wait_range
      filemap_check_errors
        -ENOSPC or -EIO
  filemap_check_wb_err
    errseq_check
      return -(file->f_mapping->wb_err & MAX_ERRNO)

Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-22 00:01:06 -05:00
Steve French
fb253d5ba3 smb3: add trace point for lease not found issue
When trying to debug problems with server sending us a
lease we don't recognize, it would be helpful to have
a dynamic trace point for this case.  New tracepoint
is called trace_smb3_lease_not_found

Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 23:56:16 -05:00
Julia Lawall
fb64f7f105 cifs: smbd: fix typo in comment
Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 23:35:43 -05:00
Ye Bin
ef09ed5d37 ext4: fix bug_on in ext4_writepages
we got issue as follows:
EXT4-fs error (device loop0): ext4_mb_generate_buddy:1141: group 0, block bitmap and bg descriptor inconsistent: 25 vs 31513 free cls
------------[ cut here ]------------
kernel BUG at fs/ext4/inode.c:2708!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 2 PID: 2147 Comm: rep Not tainted 5.18.0-rc2-next-20220413+ #155
RIP: 0010:ext4_writepages+0x1977/0x1c10
RSP: 0018:ffff88811d3e7880 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88811c098000
RDX: 0000000000000000 RSI: ffff88811c098000 RDI: 0000000000000002
RBP: ffff888128140f50 R08: ffffffffb1ff6387 R09: 0000000000000000
R10: 0000000000000007 R11: ffffed10250281ea R12: 0000000000000001
R13: 00000000000000a4 R14: ffff88811d3e7bb8 R15: ffff888128141028
FS:  00007f443aed9740(0000) GS:ffff8883aef00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020007200 CR3: 000000011c2a4000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 do_writepages+0x130/0x3a0
 filemap_fdatawrite_wbc+0x83/0xa0
 filemap_flush+0xab/0xe0
 ext4_alloc_da_blocks+0x51/0x120
 __ext4_ioctl+0x1534/0x3210
 __x64_sys_ioctl+0x12c/0x170
 do_syscall_64+0x3b/0x90

It may happen as follows:
1. write inline_data inode
vfs_write
  new_sync_write
    ext4_file_write_iter
      ext4_buffered_write_iter
        generic_perform_write
          ext4_da_write_begin
            ext4_da_write_inline_data_begin -> If inline data size too
            small will allocate block to write, then mapping will has
            dirty page
                ext4_da_convert_inline_data_to_extent ->clear EXT4_STATE_MAY_INLINE_DATA
2. fallocate
do_vfs_ioctl
  ioctl_preallocate
    vfs_fallocate
      ext4_fallocate
        ext4_convert_inline_data
          ext4_convert_inline_data_nolock
            ext4_map_blocks -> fail will goto restore data
            ext4_restore_inline_data
              ext4_create_inline_data
              ext4_write_inline_data
              ext4_set_inode_state -> set inode EXT4_STATE_MAY_INLINE_DATA
3. writepages
__ext4_ioctl
  ext4_alloc_da_blocks
    filemap_flush
      filemap_fdatawrite_wbc
        do_writepages
          ext4_writepages
            if (ext4_has_inline_data(inode))
              BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))

The root cause of this issue is we destory inline data until call
ext4_writepages under delay allocation mode.  But there maybe already
convert from inline to extent.  To solve this issue, we call
filemap_flush first..

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220516122634.1690462-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-21 22:24:24 -04:00
Ritesh Harjani
72f63f4a77 ext4: refactor and move ext4_ioctl_get_encryption_pwsalt()
This patch move code for FS_IOC_GET_ENCRYPTION_PWSALT case into
ext4's crypto.c file, i.e. ext4_ioctl_get_encryption_pwsalt()
and uuid_is_zero(). This is mostly refactoring logic and should
not affect any functionality change.

Suggested-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/5af98b17152a96b245b4f7d2dfb8607fc93e36aa.1652595565.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-21 22:24:24 -04:00
Ritesh Harjani
3030b59c85 ext4: cleanup function defs from ext4.h into crypto.c
Some of these functions when CONFIG_FS_ENCRYPTION is enabled are not
really inline (let compiler be the best judge of it).
Remove inline and move them into crypto.c where they should be present.

Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/b7b9de2c7226298663fb5a0c28909135e2ab220f.1652595565.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-21 22:24:24 -04:00
Ritesh Harjani
b1241c8eb9 ext4: move ext4 crypto code to its own file crypto.c
This is to cleanup super.c file which has grown quite large.
So, start moving ext4 crypto related code to where it should
be in the first place i.e. fs/ext4/crypto.c

Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/7d637e093cbc34d727397e8d41a53a1b9ca7d7a4.1652595565.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-21 22:24:24 -04:00
Hyunchul Lee
376b913382 ksmbd: fix outstanding credits related bugs
outstanding credits must be initialized to 0,
because it means the sum of credits consumed by
in-flight requests.
And outstanding credits must be compared with
total credits in smb2_validate_credit_charge(),
because total credits are the sum of credits
granted by ksmbd.

This patch fix the following error,
while frametest with Windows clients:

Limits exceeding the maximum allowable outstanding requests,
given : 128, pending : 8065

Fixes: b589f5db6d ("ksmbd: limits exceeding the maximum allowable outstanding requests")
Cc: stable@vger.kernel.org
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Reported-by: Yufan Chen <wiz.chen@gmail.com>
Tested-by: Yufan Chen <wiz.chen@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 15:01:43 -05:00
Hyunchul Lee
5366afc406 ksmbd: smbd: fix connection dropped issue
When there are bursty connection requests,
RDMA connection event handler is deferred and
Negotiation requests are received even if
connection status is NEW.

To handle it, set the status to CONNECTED
if Negotiation requests are received.

Reported-by: Yufan Chen <wiz.chen@gmail.com>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Tested-by: Yufan Chen <wiz.chen@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 15:01:43 -05:00
Yang Li
7820c6ee02 ksmbd: Fix some kernel-doc comments
Remove some warnings found by running scripts/kernel-doc,
which is caused by using 'make W=1'.

fs/ksmbd/misc.c:30: warning: Function parameter or member 'str' not
described in 'match_pattern'
fs/ksmbd/misc.c:30: warning: Excess function parameter 'string'
description in 'match_pattern'
fs/ksmbd/misc.c:163: warning: Function parameter or member 'share' not
described in 'convert_to_nt_pathname'
fs/ksmbd/misc.c:163: warning: Function parameter or member 'path' not
described in 'convert_to_nt_pathname'
fs/ksmbd/misc.c:163: warning: Excess function parameter 'filename'
description in 'convert_to_nt_pathname'
fs/ksmbd/misc.c:163: warning: Excess function parameter 'sharepath'
description in 'convert_to_nt_pathname'
fs/ksmbd/misc.c:259: warning: Function parameter or member 'share' not
described in 'convert_to_unix_name'
fs/ksmbd/misc.c:259: warning: Function parameter or member 'name' not
described in 'convert_to_unix_name'
fs/ksmbd/misc.c:259: warning: Excess function parameter 'path'
description in 'convert_to_unix_name'
fs/ksmbd/misc.c:259: warning: Excess function parameter 'tid'
description in 'convert_to_unix_name'

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 15:01:43 -05:00
Namjae Jeon
7a84399e1c ksmbd: fix wrong smbd max read/write size check
smb-direct max read/write size can be different with smb2 max read/write
size. So smb2_read() can return error by wrong max read/write size check.
This patch use smb_direct_max_read_write_size for this check in
smb-direct read/write().

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 15:01:43 -05:00
Namjae Jeon
65bb45b97b ksmbd: add smbd max io size parameter
Add 'smbd max io size' parameter to adjust smbd-direct max read/write
size.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 15:01:43 -05:00
Namjae Jeon
65ca7a3fff ksmbd: handle smb2 query dir request for OutputBufferLength that is too small
We found the issue that ksmbd return STATUS_NO_MORE_FILES response
even though there are still dentries that needs to be read while
file read/write test using framtest utils.
windows client send smb2 query dir request included
OutputBufferLength(128) that is too small to contain even one entry.
This patch make ksmbd immediately returns OutputBufferLength of response
as zero to client.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 15:01:43 -05:00
Hyunchul Lee
ee1b055896 ksmbd: smbd: handle multiple Buffer descriptors
Make ksmbd handle multiple buffer descriptors
when reading and writing files using SMB direct:
Post the work requests of rdma_rw_ctx for
RDMA read/write in smb_direct_rdma_xmit(), and
the work request for the READ/WRITE response
with a remote invalidation in smb_direct_writev().

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 15:01:43 -05:00
Hyunchul Lee
4e3edd0092 ksmbd: smbd: change the return value of get_sg_list
Make get_sg_list return EINVAL if there aren't
mapped scatterlists.

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 15:01:38 -05:00
Hyunchul Lee
11659a8ddb ksmbd: smbd: simplify tracking pending packets
Because we don't have to tracking pending packets
by dividing these into packets with payload and
packets without payload, merge the tracking code.

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 15:01:33 -05:00
Hyunchul Lee
ddbdc861e3 ksmbd: smbd: introduce read/write credits for RDMA read/write
SMB2_READ/SMB2_WRITE request has to be granted the number
of rw credits, the pages the request wants to transfer
/ the maximum pages which can be registered with one
MR to read and write a file.
And allocate enough RDMA resources for the maximum
number of rw credits allowed by ksmbd.

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 15:01:28 -05:00
Hyunchul Lee
1807abcf87 ksmbd: smbd: change prototypes of RDMA read/write related functions
Change the prototypes of RDMA read/write
operations to accept a pointer and length
of buffer descriptors.

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 15:01:19 -05:00
Ronnie Sahlberg
c9fc5ca454 cifs: set the CREATE_NOT_FILE when opening the directory in use_cached_dir()
This enforces that we can only do this for directories and not normal files
or else the server will return an error.
This means that we will have conditionally check IF the path refers
to a directory or not in all the call-sites where we are unsure.
Right now this check is for "" i.e. root.

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 12:23:24 -05:00
Ronnie Sahlberg
198bf836df cifs: check for smb1 in open_cached_dir()
Check protocol version in open_cached_dir() and return not supported
for SMB1.  This allows us to call open_cached_dir() from code that
is common to both smb1 and smb2/3 in future patches without having to
do this check in the call-site.
At the same time, add a check if tcon is valid or not for the same reason.

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 12:23:08 -05:00
Ronnie Sahlberg
f695b28935 cifs: move definition of cifs_fattr earlier in cifsglob.h
This only moves these definitions to come earlier in the file
but not change the definition itself.
This is done to reduce the amount of changes in future patches.

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-21 12:22:57 -05:00
Jens Axboe
3fe07bcd80 io_uring: cleanup handling of the two task_work lists
Rather than pass in a bool for whether or not this work item needs to go
into the priority list or not, provide separate helpers for it. For most
use cases, this also then gets rid of the branch for non-priority task
work.

While at it, rename the prior_task_list to prio_task_list. Prior is
a confusing name for it, as it would seem to indicate that this is the
previous task_work list. prio makes it clear that this is a priority
task_work list.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-21 09:17:05 -06:00
Enzo Matsumiya
71081e7ac1 cifs: print TIDs as hex
Makes these debug messages easier to read

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-20 17:46:22 -05:00
Enzo Matsumiya
337b8b0e43 cifs: return ENOENT for DFS lookup_cache_entry()
EEXIST didn't make sense to use when dfs_cache_find() couldn't find a
cache entry nor retrieve a referral target.

It also doesn't make sense cifs_dfs_query_info_nonascii_quirk() to
emulate ENOENT anymore.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-20 17:44:34 -05:00
Enzo Matsumiya
421ef3d565 cifs: don't call cifs_dfs_query_info_nonascii_quirk() if nodfs was set
Also return EOPNOTSUPP if path is remote but nodfs was set.

Fixes: a2809d0e16 ("cifs: quirk for STATUS_OBJECT_NAME_INVALID returned for non-ASCII dfs refs")
Cc: stable@vger.kernel.org
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-20 17:38:11 -05:00
Chuck Lever
f4d84c5264 NFSD: Clean up nfsd_open_verified()
Its only caller always passes S_IFREG as the @type parameter. As an
additional clean-up, add a kerneldoc comment.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-20 13:18:25 -04:00
Chuck Lever
1c388f2775 NFSD: Remove do_nfsd_create()
Now that its two callers have their own version-specific instance of
this function, do_nfsd_create() is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-20 13:18:25 -04:00
Chuck Lever
254454a5aa NFSD: Refactor NFSv4 OPEN(CREATE)
Copy do_nfsd_create() to nfs4proc.c and remove NFSv3-specific logic.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-20 13:18:25 -04:00
Chuck Lever
df9606abdd NFSD: Refactor NFSv3 CREATE
The NFSv3 CREATE and NFSv4 OPEN(CREATE) use cases are about to
diverge such that it makes sense to split do_nfsd_create() into one
version for NFSv3 and one for NFSv4.

As a first step, copy do_nfsd_create() to nfs3proc.c and remove
NFSv4-specific logic.

One immediate legibility benefit is that the logic for handling
NFSv3 createhow is now quite straightforward. NFSv4 createhow
has some subtleties that IMO do not belong in generic code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-20 13:18:24 -04:00
Chuck Lever
5f46e950c3 NFSD: Refactor nfsd_create_setattr()
I'd like to move do_nfsd_create() out of vfs.c. Therefore
nfsd_create_setattr() needs to be made publicly visible.

Note that both call sites in vfs.c commit both the new object and
its parent directory, so just combine those common metadata commits
into nfsd_create_setattr().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-20 13:18:24 -04:00
Chuck Lever
14ee45b70d NFSD: Avoid calling fh_drop_write() twice in do_nfsd_create()
Clean up: The "out" label already invokes fh_drop_write().

Note that fh_drop_write() is already careful not to invoke
mnt_drop_write() if either it has already been done or there is
nothing to drop. Therefore no change in behavior is expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-20 13:18:24 -04:00
Chuck Lever
e61568599c NFSD: Clean up nfsd3_proc_create()
As near as I can tell, mode bit masking and setting S_IFREG is
already done by do_nfsd_create() and vfs_create(). The NFSv4 path
(do_open_lookup), for example, does not bother with this special
processing.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-20 13:18:24 -04:00
Darrick J. Wong
2fe3ffcf55 xfs: free xfs_attrd_log_items correctly
Technically speaking, objects allocated out of a specific slab cache are
supposed to be freed to that slab cache.  The popular slab backends will
take care of this for us, but SLOB famously doesn't.  Fix this, even if
slob + xfs are not that common of a combination.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-20 14:42:49 +10:00
Darrick J. Wong
25b1e9dc32 xfs: validate xattr name earlier in recovery
When we're validating a recovered xattr log item during log recovery, we
should check the name before starting to allocate resources.  This isn't
strictly necessary on its own, but it means that we won't bother with
huge memory allocations during recovery if the attr name is garbage,
which will simplify the changes in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-20 14:42:36 +10:00
Darrick J. Wong
85d76aec6b xfs: reject unknown xattri log item filter flags during recovery
Make sure we screen the "attr flags" field of recovered xattr intent log
items to reject flag bits that we don't know about.  This is really the
attr *filter* field from xfs_da_args, so rename the field and create
a mask to make checking for invalid bits easier.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-20 14:42:15 +10:00
Darrick J. Wong
356cb708ea xfs: reject unknown xattri log item operation flags during recovery
Make sure we screen the op flags field of recovered xattr intent log
items to reject flag bits that we don't know about.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-20 14:41:47 +10:00
Darrick J. Wong
a618acab13 xfs: don't leak the retained da state when doing a leaf to node conversion
If a setxattr operation finds an xattr structure in leaf format, adding
the attr can fail due to lack of space and hence requires an upgrade to
node format.  After this happens, we'll roll the transaction and
re-enter the state machine, at which time we need to perform a second
lookup of the attribute name to find its new location.  This lookup
attaches a new da state structure to the xfs_attr_item but doesn't free
the old one (from the leaf lookup) and leaks it.  Fix that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-20 14:41:42 +10:00
Darrick J. Wong
309001c22c xfs: don't leak da state when freeing the attr intent item
kmemleak reported that we lost an xfs_da_state while removing xattrs in
generic/020:

unreferenced object 0xffff88801c0e4b40 (size 480):
  comm "attr", pid 30515, jiffies 4294931061 (age 5.960s)
  hex dump (first 32 bytes):
    78 bc 65 07 00 c9 ff ff 00 30 60 1c 80 88 ff ff  x.e......0`.....
    02 00 00 00 00 00 00 00 80 18 83 4e 80 88 ff ff  ...........N....
  backtrace:
    [<ffffffffa023ef4a>] xfs_da_state_alloc+0x1a/0x30 [xfs]
    [<ffffffffa021b6f3>] xfs_attr_node_hasname+0x23/0x90 [xfs]
    [<ffffffffa021c6f1>] xfs_attr_set_iter+0x441/0xa30 [xfs]
    [<ffffffffa02b5104>] xfs_xattri_finish_update+0x44/0x80 [xfs]
    [<ffffffffa02b515e>] xfs_attr_finish_item+0x1e/0x40 [xfs]
    [<ffffffffa0244744>] xfs_defer_finish_noroll+0x184/0x740 [xfs]
    [<ffffffffa02a6473>] __xfs_trans_commit+0x153/0x3e0 [xfs]
    [<ffffffffa021d149>] xfs_attr_set+0x469/0x7e0 [xfs]
    [<ffffffffa02a78d9>] xfs_xattr_set+0x89/0xd0 [xfs]
    [<ffffffff812e6512>] __vfs_removexattr+0x52/0x70
    [<ffffffff812e6a08>] __vfs_removexattr_locked+0xb8/0x150
    [<ffffffff812e6af6>] vfs_removexattr+0x56/0x100
    [<ffffffff812e6bf8>] removexattr+0x58/0x90
    [<ffffffff812e6cce>] path_removexattr+0x9e/0xc0
    [<ffffffff812e6d44>] __x64_sys_lremovexattr+0x14/0x20
    [<ffffffff81786b35>] do_syscall_64+0x35/0x80

I think this is a consequence of xfs_attr_node_removename_setup
attaching a new da(btree) state to xfs_attr_item and never freeing it.
I /think/ it's the case that the remove paths could detach the da state
earlier in the remove state machine since nothing else accesses the
state.  However, let's future-proof the new xattr code by adding a
catch-all when we free the xfs_attr_item to make sure we never leak the
da state.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-20 14:41:34 +10:00
Tom Rix
30476f7e6d namei: cleanup double word in comment
Remove the second 'to'.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-19 23:26:29 -04:00
Al Viro
52dba645ca get rid of dead code in legitimize_root()
Combination of LOOKUP_IS_SCOPED and NULL nd->root.mnt is impossible
after successful path_init().  All places where ->root.mnt might
become NULL do that only if LOOKUP_IS_SCOPED is not there and
path_init() itself can return success without setting nd->root
only if ND_ROOT_PRESET had been set (in which case nd->root
had been set by caller and never changed) or if the name had
been a relative one *and* none of the bits in LOOKUP_IS_SCOPED
had been present.

Since all calls of legitimize_root() must be downstream of successful
path_init(), the check for !nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED)
is pure paranoia.

FWIW, it had been discussed (and agreed upon) with Aleksa back when
scoped lookups had been merged; looks like that had fallen through the
cracks back then.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-19 23:26:29 -04:00
Al Viro
e5ca024e16 fs/namei.c:reserve_stack(): tidy up the call of try_to_unlazy()
!foo() != 0 is a strange way to spell !foo(); fallout from
"fs: make unlazy_walk() error handling consistent"...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-19 23:26:29 -04:00
Al Viro
f6957b7191 m->mnt_root->d_inode->i_sb is a weird way to spell m->mnt_sb...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-19 23:25:45 -04:00
Al Viro
a5f85d7834 uninline may_mount() and don't opencode it in fspick(2)/fsopen(2)
It's done once per (mount-related) syscall and there's no point
whatsoever making it inline.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-19 23:25:10 -04:00
Chao Liu
d9c454ab22 f2fs: make f2fs_read_inline_data() more readable
In f2fs_read_inline_data(), it is confused with checking of
inline_data flag, as we checked it before calling. So this
patch add some comments for f2fs_has_inline_data().

Signed-off-by: Chao Liu <liuchao@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-19 17:23:22 -07:00
Colin Ian King
69bc169ec3 fs/ntfs: remove redundant variable idx
The variable idx is assigned a value and is never read.  The variable is
not used and is redundant, remove it.

Cleans up clang scan build warning:
warning: Although the value stored to 'idx' is used in the enclosing
expression, the value is never actually read from 'idx'
[deadcode.DeadStores]

Link: https://lkml.kernel.org/r/20220517093646.93628-2-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:10:31 -07:00
Chung-Chiang Cheng
1213375077 fat: remove time truncations in vfat_create/vfat_mkdir
All the timestamps in vfat_create() and vfat_mkdir() come from
fat_time_fat2unix() which ensures time granularity.  We don't need to
truncate them to fit FAT's format.

Moreover, fat_truncate_crtime() and fat_timespec64_trunc_10ms() are also
removed because there is no caller anymore.

Link: https://lkml.kernel.org/r/20220503152536.2503003-4-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:10:31 -07:00
Chung-Chiang Cheng
30abce053f fat: report creation time in statx
creation time is no longer mixed with change time.  Add an in-memory field
for it, and report it in statx if supported.

Link: https://lkml.kernel.org/r/20220503152536.2503003-3-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:10:31 -07:00
Chung-Chiang Cheng
0f9d148167 fat: ignore ctime updates, and keep ctime identical to mtime in memory
FAT supports creation time but not change time, and there was no
corresponding timestamp for creation time in previous VFS.  The original
implementation took the compromise of saving the in-memory change time
into the on-disk creation time field, but this would lead to compatibility
issues with non-linux systems.

To address this issue, this patch changes the behavior of ctime.  It will
no longer be loaded and stored from the creation time on disk.  Instead of
that, it'll be consistent with the in-memory mtime and share the same
on-disk field.  All updates to mtime will also be applied to ctime in
memory, while all updates to ctime will be ignored.

Link: https://lkml.kernel.org/r/20220503152536.2503003-2-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:10:31 -07:00
Chung-Chiang Cheng
4dcc3f96e7 fat: split fat_truncate_time() into separate functions
Separate fat_truncate_time() to each timestamps for later creation time
work.

This patch does not introduce any functional changes, it's merely
refactoring change.

Link: https://lkml.kernel.org/r/20220503152536.2503003-1-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:10:31 -07:00
Johannes Weiner
f6498b776d mm: zswap: add basic meminfo and vmstat coverage
Currently it requires poking at debugfs to figure out the size and
population of the zswap cache on a host.  There are no counters for reads
and writes against the cache.  As a result, it's difficult to understand
zswap behavior on production systems.

Print zswap memory consumption and how many pages are zswapped out in
/proc/meminfo.  Count zswapouts and zswapins in /proc/vmstat.

Link: https://lkml.kernel.org/r/20220510152847.230957-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:53 -07:00
Jakub Kicinski
d7e6f58360 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/mellanox/mlx5/core/main.c
  b33886971d ("net/mlx5: Initialize flow steering during driver probe")
  40379a0084 ("net/mlx5_fpga: Drop INNOVA TLS support")
  f2b41b32cd ("net/mlx5: Remove ipsec_ops function table")
https://lore.kernel.org/all/20220519040345.6yrjromcdistu7vh@sx1/
  16d42d3133 ("net/mlx5: Drain fw_reset when removing device")
  8324a02c34 ("net/mlx5: Add exit route when waiting for FW")
https://lore.kernel.org/all/20220519114119.060ce014@canb.auug.org.au/

tools/testing/selftests/net/mptcp/mptcp_join.sh
  e274f71540 ("selftests: mptcp: add subflow limits test-cases")
  b6e074e171 ("selftests: mptcp: add infinite map testcase")
  5ac1d2d634 ("selftests: mptcp: Add tests for userspace PM type")
https://lore.kernel.org/all/20220516111918.366d747f@canb.auug.org.au/

net/mptcp/options.c
  ba2c89e0ea ("mptcp: fix checksum byte order")
  1e39e5a32a ("mptcp: infinite mapping sending")
  ea66758c17 ("tcp: allow MPTCP to update the announced window")
https://lore.kernel.org/all/20220519115146.751c3a37@canb.auug.org.au/

net/mptcp/pm.c
  95d6865178 ("mptcp: fix subflow accounting on close")
  4d25247d3a ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs")
https://lore.kernel.org/all/20220516111435.72f35dca@canb.auug.org.au/

net/mptcp/subflow.c
  ae66fb2ba6 ("mptcp: Do TCP fallback on early DSS checksum failure")
  0348c690ed ("mptcp: add the fallback check")
  f8d4bcacff ("mptcp: infinite mapping receiving")
https://lore.kernel.org/all/20220519115837.380bb8d4@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19 11:23:59 -07:00
Hao Luo
1a702dc88e kernfs: Separate kernfs_pr_cont_buf and rename_lock.
Previously the protection of kernfs_pr_cont_buf was piggy backed by
rename_lock, which means that pr_cont() needs to be protected under
rename_lock. This can cause potential circular lock dependencies.

If there is an OOM, we have the following call hierarchy:

 -> cpuset_print_current_mems_allowed()
   -> pr_cont_cgroup_name()
     -> pr_cont_kernfs_name()

pr_cont_kernfs_name() will grab rename_lock and call printk. So we have
the following lock dependencies:

 kernfs_rename_lock -> console_sem

Sometimes, printk does a wakeup before releasing console_sem, which has
the dependence chain:

 console_sem -> p->pi_lock -> rq->lock

Now, imagine one wants to read cgroup_name under rq->lock, for example,
printing cgroup_name in a tracepoint in the scheduler code. They will
be holding rq->lock and take rename_lock:

 rq->lock -> kernfs_rename_lock

Now they will deadlock.

A prevention to this circular lock dependency is to separate the
protection of pr_cont_buf from rename_lock. In principle, rename_lock
is to protect the integrity of cgroup name when copying to buf. Once
pr_cont_buf has got its content, rename_lock can be dropped. So it's
safe to drop rename_lock after kernfs_name_locked (and
kernfs_path_from_node_locked) and rely on a dedicated pr_cont_lock
to protect pr_cont_buf.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220516190951.3144144-1-haoluo@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-19 19:37:06 +02:00
Zhang Jianhua
e6af1bb077 fs-verity: Use struct_size() helper in enable_verity()
Follow the best practice for allocating a variable-sized structure.

Signed-off-by: Zhang Jianhua <chris.zjh@huawei.com>
[ebiggers: adjusted commit message]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220519022450.2434483-1-chris.zjh@huawei.com
2022-05-19 09:53:33 -07:00
Dai Ngo
e9488d5ae1 NFSD: Show state of courtesy client in client info
Update client_info_show to show state of courtesy client
and seconds since last renew.

Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19 12:25:40 -04:00
Dai Ngo
27431affb0 NFSD: add support for lock conflict to courteous server
This patch allows expired client with lock state to be in COURTESY
state. Lock conflict with COURTESY client is resolved by the fs/lock
code using the lm_lock_expirable and lm_expire_lock callback in the
struct lock_manager_operations.

If conflict client is in COURTESY state, set it to EXPIRABLE and
schedule the laundromat to run immediately to expire the client. The
callback lm_expire_lock waits for the laundromat to flush its work
queue before returning to caller.

Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19 12:25:39 -04:00
Dai Ngo
2443da2259 fs/lock: add 2 callbacks to lock_manager_operations to resolve conflict
Add 2 new callbacks, lm_lock_expirable and lm_expire_lock, to
lock_manager_operations to allow the lock manager to take appropriate
action to resolve the lock conflict if possible.

A new field, lm_mod_owner, is also added to lock_manager_operations.
The lm_mod_owner is used by the fs/lock code to make sure the lock
manager module such as nfsd, is not freed while lock conflict is being
resolved.

lm_lock_expirable checks and returns true to indicate that the lock
conflict can be resolved else return false. This callback must be
called with the flc_lock held so it can not block.

lm_expire_lock is called to resolve the lock conflict if the returned
value from lm_lock_expirable is true. This callback is called without
the flc_lock held since it's allowed to block. Upon returning from
this callback, the lock conflict should be resolved and the caller is
expected to restart the conflict check from the beginnning of the list.

Lock manager, such as NFSv4 courteous server, uses this callback to
resolve conflict by destroying lock owner, or the NFSv4 courtesy client
(client that has expired but allowed to maintains its states) that owns
the lock.

Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-05-19 12:25:39 -04:00
Dai Ngo
591502c5cb fs/lock: add helper locks_owner_has_blockers to check for blockers
Add helper locks_owner_has_blockers to check if there is any blockers
for a given lockowner.

Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-05-19 12:25:39 -04:00
Dai Ngo
d76cc46b37 NFSD: move create/destroy of laundry_wq to init_nfsd and exit_nfsd
This patch moves create/destroy of laundry_wq from nfs4_state_start
and nfs4_state_shutdown_net to init_nfsd and exit_nfsd to prevent
the laundromat from being freed while a thread is processing a
conflicting lock.

Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19 12:25:39 -04:00
Dai Ngo
3d69427151 NFSD: add support for share reservation conflict to courteous server
This patch allows expired client with open state to be in COURTESY
state. Share/access conflict with COURTESY client is resolved by
setting COURTESY client to EXPIRABLE state, schedule laundromat
to run and returning nfserr_jukebox to the request client.

Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19 12:25:39 -04:00
Dai Ngo
66af257999 NFSD: add courteous server support for thread with only delegation
This patch provides courteous server support for delegation only.
Only expired client with delegation but no conflict and no open
or lock state is allowed to be in COURTESY state.

Delegation conflict with COURTESY/EXPIRABLE client is resolved by
setting it to EXPIRABLE, queue work for the laundromat and return
delay to the caller. Conflict is resolved when the laudromat runs
and expires the EXIRABLE client while the NFS client retries the
OPEN request. Local thread request that gets conflict is doing the
retry in _break_lease.

Client in COURTESY or EXPIRABLE state is allowed to reconnect and
continues to have access to its state. Access to the nfs4_client by
the reconnecting thread and the laundromat is serialized via the
client_lock.

Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19 12:25:39 -04:00
Chuck Lever
91e23b1c39 NFSD: Clean up nfsd_splice_actor()
nfsd_splice_actor() checks that the page being spliced does not
match the previous element in the svc_rqst::rq_pages array. We
believe this is to prevent a double put_page() in cases where the
READ payload is partially contained in the xdr_buf's head buffer.

However, the NFSD READ proc functions no longer place any part of
the READ payload in the head buffer, in order to properly support
NFS/RDMA READ with Write chunks. Therefore, simplify the logic in
nfsd_splice_actor() to remove this unnecessary check.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19 12:25:38 -04:00
Paulo Alcantara
d80c69846d cifs: fix signed integer overflow when fl_end is OFFSET_MAX
This fixes the following when running xfstests generic/504:

[  134.394698] CIFS: Attempting to mount \\win16.vm.test\Share
[  134.420905] CIFS: VFS: generate_smb3signingkey: dumping generated
AES session keys
[  134.420911] CIFS: VFS: Session Id    05 00 00 00 00 c4 00 00
[  134.420914] CIFS: VFS: Cipher type   1
[  134.420917] CIFS: VFS: Session Key   ea 0b d9 22 2e af 01 69 30 1b
15 74 bf 87 41 11
[  134.420920] CIFS: VFS: Signing Key   59 28 43 5c f0 b6 b1 6f f5 7b
65 f2 9f 9e 58 7d
[  134.420923] CIFS: VFS: ServerIn Key  eb aa 58 c8 95 01 9a f7 91 98
e4 fa bc d8 74 f1
[  134.420926] CIFS: VFS: ServerOut Key 08 5b 21 e5 2e 4e 86 f6 05 c2
58 e0 af 53 83 e7
[  134.771946]
================================================================================
[  134.771953] UBSAN: signed-integer-overflow in fs/cifs/file.c:1706:19
[  134.771957] 9223372036854775807 + 1 cannot be represented in type
'long long int'
[  134.771960] CPU: 4 PID: 2773 Comm: flock Not tainted 5.11.22 #1
[  134.771964] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  134.771966] Call Trace:
[  134.771970]  dump_stack+0x8d/0xb5
[  134.771981]  ubsan_epilogue+0x5/0x50
[  134.771988]  handle_overflow+0xa3/0xb0
[  134.771997]  ? lockdep_hardirqs_on_prepare+0xe8/0x1b0
[  134.772006]  cifs_setlk+0x63c/0x680 [cifs]
[  134.772085]  ? _get_xid+0x5f/0xa0 [cifs]
[  134.772085]  cifs_flock+0x131/0x400 [cifs]
[  134.772085]  __x64_sys_flock+0xfc/0x120
[  134.772085]  do_syscall_64+0x33/0x40
[  134.772085]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  134.772085] RIP: 0033:0x7fea4f83b3fb
[  134.772085] Code: ff 48 8b 15 8f 1a 0d 00 f7 d8 64 89 02 b8 ff ff
ff ff eb da e8 16 0b 02 00 66 0f 1f 44 00 00 f3 0f 1e fa b8 49 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 1a 0d 00 f7 d8 64 89
01 48

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-19 10:54:41 -05:00
Zhihao Cheng
68f4c6eba7 fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
Commit 505a666ee3 ("writeback: plug writeback in wb_writeback() and
writeback_inodes_wb()") has us holding a plug during wb_writeback, which
may cause a potential ABBA dead lock:

    wb_writeback		fat_file_fsync
blk_start_plug(&plug)
for (;;) {
  iter i-1: some reqs have been added into plug->mq_list  // LOCK A
  iter i:
    progress = __writeback_inodes_wb(wb, work)
    . writeback_sb_inodes // fat's bdev
    .   __writeback_single_inode
    .   . generic_writepages
    .   .   __block_write_full_page
    .   .   . . 	    __generic_file_fsync
    .   .   . . 	      sync_inode_metadata
    .   .   . . 	        writeback_single_inode
    .   .   . . 		  __writeback_single_inode
    .   .   . . 		    fat_write_inode
    .   .   . . 		      __fat_write_inode
    .   .   . . 		        sync_dirty_buffer	// fat's bdev
    .   .   . . 			  lock_buffer(bh)	// LOCK B
    .   .   . . 			    submit_bh
    .   .   . . 			      blk_mq_get_tag	// LOCK A
    .   .   . trylock_buffer(bh)  // LOCK B
    .   .   .   redirty_page_for_writepage
    .   .   .     wbc->pages_skipped++
    .   .   --wbc->nr_to_write
    .   wrote += write_chunk - wbc.nr_to_write  // wrote > 0
    .   requeue_inode
    .     redirty_tail_locked
    if (progress)    // progress > 0
      continue;
  iter i+1:
      queue_io
      // similar process with iter i, infinite for-loop !
}
blk_finish_plug(&plug)   // flush plug won't be called

Above process triggers a hungtask like:
[  399.044861] INFO: task bb:2607 blocked for more than 30 seconds.
[  399.046824]       Not tainted 5.18.0-rc1-00005-gefae4d9eb6a2-dirty
[  399.051539] task:bb              state:D stack:    0 pid: 2607 ppid:
2426 flags:0x00004000
[  399.051556] Call Trace:
[  399.051570]  __schedule+0x480/0x1050
[  399.051592]  schedule+0x92/0x1a0
[  399.051602]  io_schedule+0x22/0x50
[  399.051613]  blk_mq_get_tag+0x1d3/0x3c0
[  399.051640]  __blk_mq_alloc_requests+0x21d/0x3f0
[  399.051657]  blk_mq_submit_bio+0x68d/0xca0
[  399.051674]  __submit_bio+0x1b5/0x2d0
[  399.051708]  submit_bio_noacct+0x34e/0x720
[  399.051718]  submit_bio+0x3b/0x150
[  399.051725]  submit_bh_wbc+0x161/0x230
[  399.051734]  __sync_dirty_buffer+0xd1/0x420
[  399.051744]  sync_dirty_buffer+0x17/0x20
[  399.051750]  __fat_write_inode+0x289/0x310
[  399.051766]  fat_write_inode+0x2a/0xa0
[  399.051783]  __writeback_single_inode+0x53c/0x6f0
[  399.051795]  writeback_single_inode+0x145/0x200
[  399.051803]  sync_inode_metadata+0x45/0x70
[  399.051856]  __generic_file_fsync+0xa3/0x150
[  399.051880]  fat_file_fsync+0x1d/0x80
[  399.051895]  vfs_fsync_range+0x40/0xb0
[  399.051929]  __x64_sys_fsync+0x18/0x30

In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback:
unplug before cond_resched in writeback_sb_inodes") in function
'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted
from write_cache_pages().

Fix it by correcting wrote number according number of skipped pages
in writeback_sb_inodes().

Goto Link to find a reproducer.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215837
Cc: stable@vger.kernel.org # v4.3
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220510133805.1988292-1-chengzhihao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-19 06:29:41 -06:00
Linus Torvalds
01464a73a6 io_uring-5.18-2022-05-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKFeygQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsXxD/9dnVf1LCfVR3wQpV/O7o1u5zRpjTSFetmS
 qGw4kMDUtfE63oCLzGW/WpmQLrHduoTa32Ms/v0vU1+1MixUHatzRzN5OF2Nv7Wv
 Ukc6JtH2PjcYFNouAFJ8r/H1uanrr9DOK7R/9n3t9CwOIoand+PUmZ0rIHVN9u0j
 KtpousBih66c7tajADqSxKb1uPlDrgTkzfAYVj5O4OwBAiIwY2CLJfvbLFTdoHP9
 bdCrsvo9R08QRH9PedPqctdQURBxfzd0kXm3N8+d4r4GEQvbaGI+6bN88VYW2aEa
 VfBCZD4A8/VupcGWWhAnKu0rOwxgrh2jDEd5g+0g/f3SRTY2LBUJAlEMM7N043vX
 MtAuO05E3/Svr5doAwAJsp44h6SvS2s5ogvPIYjEUvY4JK2UDdWOcYrb7HqXrNBi
 0jNsgUu4+N89i6jiewf0dPw6BKK7Cf9322fGy4X1dQ77gIkDWt4Z55iV46aSX3Y8
 dQt2Jaoj0c/wRkOgZtrDi7FmdM0xcMzGMB/oOyI07d3ERjbyZuDTFowWh/kqd1xA
 6vsrkyCQaGilmDu9xTywPSfarKoxIrX+TgZj5rnnJpv2tENJSx/o8yKWcu9wnO1L
 gnanzpwYW4qIOjkWKvIBLBHXXA6Qp8aXacif8rODUhIuABQpe85b+btpHeqO9lWL
 a3wS9Gc2nQ==
 =8Xxv
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.18-2022-05-18' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two small changes fixing issues from the 5.18 merge window:

   - Fix wrong ordering of a tracepoint (Dylan)

   - Fix MSG_RING on IOPOLL rings (me)"

* tag 'io_uring-5.18-2022-05-18' of git://git.kernel.dk/linux-block:
  io_uring: don't attempt to IOPOLL for MSG_RING requests
  io_uring: fix ordering of args in io_uring_queue_async_work
2022-05-18 14:21:30 -10:00
Chao Yu
677a82b44e f2fs: fix to do sanity check for inline inode
Yanming reported a kernel bug in Bugzilla kernel [1], which can be
reproduced. The bug message is:

The kernel message is shown below:

kernel BUG at fs/inode.c:611!
Call Trace:
 evict+0x282/0x4e0
 __dentry_kill+0x2b2/0x4d0
 dput+0x2dd/0x720
 do_renameat2+0x596/0x970
 __x64_sys_rename+0x78/0x90
 do_syscall_64+0x3b/0x90

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215895

The bug is due to fuzzed inode has both inline_data and encrypted flags.
During f2fs_evict_inode(), as the inode was deleted by rename(), it
will cause inline data conversion due to conflicting flags. The page
cache will be polluted and the panic will be triggered in clear_inode().

Try fixing the bug by doing more sanity checks for inline data inode in
sanity_check_inode().

Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-18 15:36:11 -07:00
Chao Yu
958ed92922 f2fs: fix fallocate to use file_modified to update permissions consistently
This patch tries to fix permission consistency issue as all other
mainline filesystems.

Since the initial introduction of (posix) fallocate back at the turn of
the century, it has been possible to use this syscall to change the
user-visible contents of files.  This can happen by extending the file
size during a preallocation, or through any of the newer modes (punch,
zero, collapse, insert range).  Because the call can be used to change
file contents, we should treat it like we do any other modification to a
file -- update the mtime, and drop set[ug]id privileges/capabilities.

The VFS function file_modified() does all this for us if pass it a
locked inode, so let's make fallocate drop permissions correctly.

Cc: stable@kernel.org
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-18 15:36:09 -07:00
Jens Axboe
2fcabce2d7 io_uring: disallow mixed provided buffer group registrations
It's nonsensical to register a provided buffer ring, if a classic
provided buffer group with the same ID exists. Depending on the order of
which we decide what type to pick, the other type will never get used.
Explicitly disallow it and return an error if this is attempted.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 16:21:30 -06:00
Jens Axboe
1d0dbbfa28 io_uring: initialize io_buffer_list head when shared ring is unregistered
We use ->buf_pages != 0 to tell if this is a shared buffer ring or a
classic provided buffer group. If we unregister the shared ring and
then attempt to use it, buf_pages is zero yet the classic list head
isn't properly initialized. This causes io_buffer_select() to think
that we have classic buffers available, but then we crash when we try
and get one from the list.

Just initialize the list if we unregister a shared buffer ring, leaving
it in a sane state for either re-registration or for attempting to use
it. And do the same for the initial setup from the classic path.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 16:21:11 -06:00
Pavel Begunkov
0184f08e65 io_uring: add fully sparse buffer registration
Honour IORING_RSRC_REGISTER_SPARSE not only for direct files but fixed
buffers as well. It makes the rsrc API more consistent.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/66f429e4912fe39fb3318217ff33a2853d4544be.1652879898.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 12:53:04 -06:00
Zhang Jianhua
b0487ede1f fs-verity: remove unused parameter desc_size in fsverity_create_info()
The parameter desc_size in fsverity_create_info() is useless and it is
not referenced anywhere. The greatest meaning of desc_size here is to
indecate the size of struct fsverity_descriptor and futher calculate the
size of signature. However, the desc->sig_size can do it also and it is
indeed, so remove it.

Therefore, it is no need to acquire desc_size by fsverity_get_descriptor()
in ensure_verity_info(), so remove the parameter desc_ret in
fsverity_get_descriptor() too.

Signed-off-by: Zhang Jianhua <chris.zjh@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220518132256.2297655-1-chris.zjh@huawei.com
2022-05-18 11:01:31 -07:00
Eric Biggers
c069db76ed ext4: fix memory leak in parse_apply_sb_mount_options()
If processing the on-disk mount options fails after any memory was
allocated in the ext4_fs_context, e.g. s_qf_names, then this memory is
leaked.  Fix this by calling ext4_fc_free() instead of kfree() directly.

Reproducer:

    mkfs.ext4 -F /dev/vdc
    tune2fs /dev/vdc -E mount_opts=usrjquota=file
    echo clear > /sys/kernel/debug/kmemleak
    mount /dev/vdc /vdc
    echo scan > /sys/kernel/debug/kmemleak
    sleep 5
    echo scan > /sys/kernel/debug/kmemleak
    cat /sys/kernel/debug/kmemleak

Fixes: 7edfd85b1f ("ext4: Completely separate options parsing and sb setup")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220513231605.175121-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-18 11:24:22 -04:00
Eric Biggers
cb8435dc8b ext4: reject the 'commit' option on ext2 filesystems
The 'commit' option is only applicable for ext3 and ext4 filesystems,
and has never been accepted by the ext2 filesystem driver, so the ext4
driver shouldn't allow it on ext2 filesystems.

This fixes a failure in xfstest ext4/053.

Fixes: 8dc0aa8cf0 ("ext4: check incompatible mount options while mounting ext2/3")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220510183232.172615-1-ebiggers@kernel.org
2022-05-18 11:24:22 -04:00
Yang Li
b10b6278ae ext4: remove duplicated #include of dax.h in inode.c
Fix following includecheck warning:
./fs/ext4/inode.c: linux/dax.h is included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220504225025.44753-1-yang.lee@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-18 11:24:22 -04:00
Christoph Hellwig
0b1e987c56 freevxfs: relicense to GPLv2 only
When I wrote the freevxfs driver I had some odd choice of licensing
statements, the options are either GPL (without version) or an odd
BSD-ish licensense with advertising clause.

The GPL vs always meant to be the same as the kernel, that is version
2 only, and the odd BSD-ish license doesn't make much sense.  Add
a GPL2.0-only SPDX tag to make the GPL intentions clear and drop the
bogus BSD license.

Acked-by: Krzysztof Błaszkowski <kb@sysmikro.com.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-18 15:30:17 +02:00
Amir Goldstein
e730558adf fsnotify: consistent behavior for parent not watching children
The logic for handling events on child in groups that have a mark on
the parent inode, but without FS_EVENT_ON_CHILD flag in the mask is
duplicated in several places and inconsistent.

Move the logic into the preparation of mark type iterator, so that the
parent mark type will be excluded from all mark type iterations in that
case.

This results in several subtle changes of behavior, hopefully all
desired changes of behavior, for example:

- Group A has a mount mark with FS_MODIFY in mask
- Group A has a mark with ignore mask that does not survive FS_MODIFY
  and does not watch children on directory D.
- Group B has a mark with FS_MODIFY in mask that does watch children
  on directory D.
- FS_MODIFY event on file D/foo should not clear the ignore mask of
  group A, but before this change it does

And if group A ignore mask was set to survive FS_MODIFY:
- FS_MODIFY event on file D/foo should be reported to group A on account
  of the mount mark, but before this change it is wrongly ignored

Fixes: 2f02fd3fa1 ("fanotify: fix ignore mask logic for events on child and on dir")
Reported-by: Jan Kara <jack@suse.com>
Link: https://lore.kernel.org/linux-fsdevel/20220314113337.j7slrb5srxukztje@quack3.lan/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220511190213.831646-3-amir73il@gmail.com
2022-05-18 15:08:09 +02:00
Amir Goldstein
14362a2541 fsnotify: introduce mark type iterator
fsnotify_foreach_iter_mark_type() is used to reduce boilerplate code
of iterating all marks of a specific group interested in an event
by consulting the iterator report_mask.

Use an open coded version of that iterator in fsnotify_iter_next()
that collects all marks of the current iteration group without
consulting the iterator report_mask.

At the moment, the two iterator variants are the same, but this
decoupling will allow us to exclude some of the group's marks from
reporting the event, for example for event on child and inode marks
on parent did not request to watch events on children.

Fixes: 2f02fd3fa1 ("fanotify: fix ignore mask logic for events on child and on dir")
Reported-by: Jan Kara <jack@suse.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220511190213.831646-2-amir73il@gmail.com
2022-05-18 15:07:43 +02:00
Christoph Hellwig
0bf1dbee9b io_uring: use rcu_dereference in io_close
Accessing the file table needs a rcu_dereference_protected().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:19:05 -06:00
Christoph Hellwig
a294bef57c io_uring: consistently use the EPOLL* defines
POLL* are unannotated values for the userspace ABI, while everything
in-kernel should use EPOLL* and the __poll_t type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:19:05 -06:00
Christoph Hellwig
58f5c8d39e io_uring: make apoll_events a __poll_t
apoll_events is fed to vfs_poll and the poll tables, so it should be
a __poll_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:19:05 -06:00
Christoph Hellwig
ee67ba3b20 io_uring: drop a spurious inline on a forward declaration
io_file_get_normal isn't marked inline, so don't claim it as such in the
forward declaration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:19:05 -06:00
Christoph Hellwig
984824db84 io_uring: don't use ERR_PTR for user pointers
ERR_PTR abuses the high bits of a pointer to transport error information.
This is only safe for kernel pointers and not user pointers.  Fix
io_buffer_select and its helpers to just return NULL for failure and get
rid of this abuse.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:18:56 -06:00
Christoph Hellwig
20cbd21d89 io_uring: use a rwf_t for io_rw.flags
Use the proper type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:17:52 -06:00
Jens Axboe
c7fb19428d io_uring: add support for ring mapped supplied buffers
Provided buffers allow an application to supply io_uring with buffers
that can then be grabbed for a read/receive request, when the data
source is ready to deliver data. The existing scheme relies on using
IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use
in real world applications. It's pretty efficient if the application
is able to supply back batches of provided buffers when they have been
consumed and the application is ready to recycle them, but if
fragmentation occurs in the buffer space, it can become difficult to
supply enough buffers at the time. This hurts efficiency.

Add a register op, IORING_REGISTER_PBUF_RING, which allows an application
to setup a shared queue for each buffer group of provided buffers. The
application can then supply buffers simply by adding them to this ring,
and the kernel can consume then just as easily. The ring shares the head
with the application, the tail remains private in the kernel.

Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use
IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the
ring, they must use the mapped ring. Mapped provided buffer rings can
co-exist with normal provided buffers, just not within the same group ID.

To gauge overhead of the existing scheme and evaluate the mapped ring
approach, a simple NOP benchmark was written. It uses a ring of 128
entries, and submits/completes 32 at the time. 'Replenish' is how
many buffers are provided back at the time after they have been
consumed:

Test			Replenish			NOPs/sec
================================================================
No provided buffers	NA				~30M
Provided buffers	32				~16M
Provided buffers	 1				~10M
Ring buffers		32				~27M
Ring buffers		 1				~27M

The ring mapped buffers perform almost as well as not using provided
buffers at all, and they don't care if you provided 1 or more back at
the same time. This means application can just replenish as they go,
rather than need to batch and compact, further reducing overhead in the
application. The NOP benchmark above doesn't need to do any compaction,
so that overhead isn't even reflected in the above test.

Co-developed-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:12:42 -06:00
Jens Axboe
d8c2237d0a io_uring: add io_pin_pages() helper
Abstract this out from io_sqe_buffer_register() so we can use it
elsewhere too without duplicating this code.

No intended functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:12:42 -06:00
Jens Axboe
3d200242a6 io_uring: add buffer selection support to IORING_OP_NOP
Obviously not really useful since it's not transferring data, but it
is helpful in benchmarking overhead of provided buffers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:12:42 -06:00
Jens Axboe
e7637a492b io_uring: fix locking state for empty buffer group
io_provided_buffer_select() must drop the submit lock, if needed, even
in the error handling case. Failure to do so will leave us with the
ctx->uring_lock held, causing spew like:

====================================
WARNING: iou-wrk-366/368 still has locks held!
5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted
------------------------------------
1 lock held by iou-wrk-366/368:
 #0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48

stack backtrace:
CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace.part.0+0xa4/0xd4
 show_stack+0x14/0x5c
 dump_stack_lvl+0x88/0xb0
 dump_stack+0x14/0x2c
 debug_check_no_locks_held+0x84/0x90
 try_to_freeze.isra.0+0x18/0x44
 get_signal+0x94/0x6ec
 io_wqe_worker+0x1d8/0x2b4
 ret_from_fork+0x10/0x20

and triggering later hangs off get_signal() because we attempt to
re-grab the lock.

Reported-by: syzbot+987d7bb19195ae45208c@syzkaller.appspotmail.com
Fixes: 149c69b04a ("io_uring: abstract out provided buffer list selection")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:12:41 -06:00
Dave Wysochanski
9c4a5c75a6 NFS: Pass i_size to fscache_unuse_cookie() when a file is released
Pass updated i_size in fscache_unuse_cookie() when called
from nfs_fscache_release_file(), which ensures the size of
an fscache object gets written to the cache storage.  Failing
to do so results in unnessary reads from the NFS server, even
when the data is cached, due to a cachefiles object coherency
check failing with a trace similar to the following:
  cachefiles_coherency: o=0000000e BAD osiz B=afbb3 c=0

This problem can be reproduced as follows:
  #!/bin/bash
  v=4.2; NFS_SERVER=127.0.0.1
  set -e; trap cleanup EXIT; rc=1
  function cleanup {
          umount /mnt/nfs > /dev/null 2>&1
          RC_STR="TEST PASS"
          [ $rc -eq 1 ] && RC_STR="TEST FAIL"
          echo "$RC_STR on $(uname -r) with NFSv$v and server $NFS_SERVER"
  }
  mount -o vers=$v,fsc $NFS_SERVER:/export /mnt/nfs
  rm -f /mnt/nfs/file1.bin > /dev/null 2>&1
  dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1 > /dev/null 2>&1
  echo 3 > /proc/sys/vm/drop_caches
  echo Read file 1st time from NFS server into fscache
  dd if=/mnt/nfs/file1.bin of=/dev/null > /dev/null 2>&1
  umount /mnt/nfs && mount -o vers=$v,fsc $NFS_SERVER:/export /mnt/nfs
  echo 3 > /proc/sys/vm/drop_caches
  echo Read file 2nd time from fscache
  dd if=/mnt/nfs/file1.bin of=/dev/null > /dev/null 2>&1
  echo Check mountstats for NFS read
  grep -q "READ: 0" /proc/self/mountstats # (1st number) == 0
  [ $? -eq 0 ] && rc=0

Fixes: a6b5a28eb5 "nfs: Convert to new fscache volume/cookie API"
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 15:39:45 -04:00
NeilBrown
3e2910c7e2 NFS: Improve warning message when locks are lost.
NFSv4 can lose locks if, for example there is a network partition for
longer than the lease period.  When this happens a warning message

  NFS: __nfs4_reclaim_open_state: Lock reclaim failed!

is generated, possibly once for each lock (though rate limited).

This is potentially misleading as is can be read as suggesting that lock
reclaim was attempted.  However the default behaviour is to not attempt
to recover locks (except due to server report).

This patch changes the reporting to produce at most one message for each
attempt to recover all state from a given server.  The message reports
the server name and the number of locks lost if that number is non-zero.
It reports that locks were lost and give no suggestion as to whether
there was an attempt or not.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 15:27:17 -04:00
Steve French
0a55cf74ff SMB3: EBADF/EIO errors in rename/open caused by race condition in smb2_compound_op
There is  a race condition in smb2_compound_op:

after_close:
	num_rqst++;

	if (cfile) {
		cifsFileInfo_put(cfile); // sends SMB2_CLOSE to the server
		cfile = NULL;

This is triggered by smb2_query_path_info operation that happens during
revalidate_dentry. In smb2_query_path_info, get_readable_path is called to
load the cfile, increasing the reference counter. If in the meantime, this
reference becomes the very last, this call to cifsFileInfo_put(cfile) will
trigger a SMB2_CLOSE request sent to the server just before sending this compound
request – and so then the compound request fails either with EBADF/EIO depending
on the timing at the server, because the handle is already closed.

In the first scenario, the race seems to be happening between smb2_query_path_info
triggered by the rename operation, and between “cleanup” of asynchronous writes – while
fsync(fd) likely waits for the asynchronous writes to complete, releasing the writeback
structures can happen after the close(fd) call. So the EBADF/EIO errors will pop up if
the timing is such that:
1) There are still outstanding references after close(fd) in the writeback structures
2) smb2_query_path_info successfully fetches the cfile, increasing the refcounter by 1
3) All writeback structures release the same cfile, reducing refcounter to 1
4) smb2_compound_op is called with that cfile

In the second scenario, the race seems to be similar – here open triggers the
smb2_query_path_info operation, and if all other threads in the meantime decrease the
refcounter to 1 similarly to the first scenario, again SMB2_CLOSE will be sent to the
server just before issuing the compound request. This case is harder to reproduce.

See https://bugzilla.samba.org/show_bug.cgi?id=15051

Cc: stable@vger.kernel.org
Fixes: 8de9e86c67 ("cifs: create a helper to find a writeable handle by path name")
Signed-off-by: Ondrej Hubsch <ohubsch@purestorage.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-05-17 13:47:27 -05:00
Jens Axboe
aa184e8671 io_uring: don't attempt to IOPOLL for MSG_RING requests
We gate whether to IOPOLL for a request on whether the opcode is allowed
on a ring setup for IOPOLL and if it's got a file assigned. MSG_RING
is the only one that allows a file yet isn't pollable, it's merely
supported to allow communication on an IOPOLL ring, not because we can
poll for completion of it.

Put the assigned file early and clear it, so we don't attempt to poll
for it.

Reported-by: syzbot+1a0a53300ce782f8b3ad@syzkaller.appspotmail.com
Fixes: 3f1d52abf0 ("io_uring: defer msg-ring file validity check until command issue")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-17 12:46:04 -06:00
Eric Biggers
b5639bb431 f2fs: don't use casefolded comparison for "." and ".."
Tryng to rename a directory that has all following properties fails with
EINVAL and triggers the 'WARN_ON_ONCE(!fscrypt_has_encryption_key(dir))'
in f2fs_match_ci_name():

    - The directory is casefolded
    - The directory is encrypted
    - The directory's encryption key is not yet set up
    - The parent directory is *not* encrypted

The problem is incorrect handling of the lookup of ".." to get the
parent reference to update.  fscrypt_setup_filename() treats ".." (and
".") specially, as it's never encrypted.  It's passed through as-is, and
setting up the directory's key is not attempted.  As the name isn't a
no-key name, f2fs treats it as a "normal" name and attempts a casefolded
comparison.  That breaks the assumption of the WARN_ON_ONCE() in
f2fs_match_ci_name() which assumes that for encrypted directories,
casefolded comparisons only happen when the directory's key is set up.

We could just remove this WARN_ON_ONCE().  However, since casefolding is
always a no-op on "." and ".." anyway, let's instead just not casefold
these names.  This results in the standard bytewise comparison.

Fixes: 7ad08a58bf ("f2fs: Handle casefolding with Encryption")
Cc: <stable@vger.kernel.org> # v5.11+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-17 11:19:23 -07:00
Jaegeuk Kim
c81d5bae40 f2fs: do not stop GC when requiring a free section
The f2fs_gc uses a bitmap to indicate pinned sections, but when disabling
chckpoint, we call f2fs_gc() with NULL_SEGNO which selects the same dirty
segment as a victim all the time, resulting in checkpoint=disable failure,
for example. Let's pick another one, if we fail to collect it.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-17 11:19:19 -07:00
Baokun Li
f87c7a4b08 ext4: fix race condition between ext4_write and ext4_convert_inline_data
Hulk Robot reported a BUG_ON:
 ==================================================================
 EXT4-fs error (device loop3): ext4_mb_generate_buddy:805: group 0,
 block bitmap and bg descriptor inconsistent: 25 vs 31513 free clusters
 kernel BUG at fs/ext4/ext4_jbd2.c:53!
 invalid opcode: 0000 [#1] SMP KASAN PTI
 CPU: 0 PID: 25371 Comm: syz-executor.3 Not tainted 5.10.0+ #1
 RIP: 0010:ext4_put_nojournal fs/ext4/ext4_jbd2.c:53 [inline]
 RIP: 0010:__ext4_journal_stop+0x10e/0x110 fs/ext4/ext4_jbd2.c:116
 [...]
 Call Trace:
  ext4_write_inline_data_end+0x59a/0x730 fs/ext4/inline.c:795
  generic_perform_write+0x279/0x3c0 mm/filemap.c:3344
  ext4_buffered_write_iter+0x2e3/0x3d0 fs/ext4/file.c:270
  ext4_file_write_iter+0x30a/0x11c0 fs/ext4/file.c:520
  do_iter_readv_writev+0x339/0x3c0 fs/read_write.c:732
  do_iter_write+0x107/0x430 fs/read_write.c:861
  vfs_writev fs/read_write.c:934 [inline]
  do_pwritev+0x1e5/0x380 fs/read_write.c:1031
 [...]
 ==================================================================

Above issue may happen as follows:
           cpu1                     cpu2
__________________________|__________________________
do_pwritev
 vfs_writev
  do_iter_write
   ext4_file_write_iter
    ext4_buffered_write_iter
     generic_perform_write
      ext4_da_write_begin
                           vfs_fallocate
                            ext4_fallocate
                             ext4_convert_inline_data
                              ext4_convert_inline_data_nolock
                               ext4_destroy_inline_data_nolock
                                clear EXT4_STATE_MAY_INLINE_DATA
                               ext4_map_blocks
                                ext4_ext_map_blocks
                                 ext4_mb_new_blocks
                                  ext4_mb_regular_allocator
                                   ext4_mb_good_group_nolock
                                    ext4_mb_init_group
                                     ext4_mb_init_cache
                                      ext4_mb_generate_buddy  --> error
       ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
                                ext4_restore_inline_data
                                 set EXT4_STATE_MAY_INLINE_DATA
       ext4_block_write_begin
      ext4_da_write_end
       ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
       ext4_write_inline_data_end
        handle=NULL
        ext4_journal_stop(handle)
         __ext4_journal_stop
          ext4_put_nojournal(handle)
           ref_cnt = (unsigned long)handle
           BUG_ON(ref_cnt == 0)  ---> BUG_ON

The lock held by ext4_convert_inline_data is xattr_sem, but the lock
held by generic_perform_write is i_rwsem. Therefore, the two locks can
be concurrent.

To solve above issue, we add inode_lock() for ext4_convert_inline_data().
At the same time, move ext4_convert_inline_data() in front of
ext4_punch_hole(), remove similar handling from ext4_punch_hole().

Fixes: 0c8d414f16 ("ext4: let fallocate handle inline data correctly")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220428134031.4153381-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-17 14:17:40 -04:00
Zhang Yi
6493792d32 ext4: convert symlink external data block mapping to bdev
Symlink's external data block is one kind of metadata block, and now
that almost all ext4 metadata block's page cache (e.g. directory blocks,
quota blocks...) belongs to bdev backing inode except the symlink. It
is essentially worked in data=journal mode like other regular file's
data block because probably in order to make it simple for generic VFS
code handling symlinks or some other historical reasons, but the logic
of creating external data block in ext4_symlink() is complicated. and it
also make things confused if user do not want to let the filesystem
worked in data=journal mode. This patch convert the final exceptional
case and make things clean, move the mapping of the symlink's external
data block to bdev like any other metadata block does.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20220424140936.1898920-3-yi.zhang@huawei.com
2022-05-17 14:17:40 -04:00
Zhang Yi
9558cf14e8 ext4: add nowait mode for ext4_getblk()
Current ext4_getblk() might sleep if some resources are not valid or
could be race with a concurrent extents modifing procedure. So we
cannot call ext4_getblk() and ext4_map_blocks() to get map blocks in
the atomic context in some fast path (e.g. the upcoming procedure of
getting symlink external block in the RCU context), even if the map
extents have already been check and cached.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20220424140936.1898920-2-yi.zhang@huawei.com
2022-05-17 14:17:40 -04:00
Ojaswin Mujoo
e4e58e5df3 ext4: fix journal_ioprio mount option handling
In __ext4_super() we always overwrote the user specified journal_ioprio
value with a default value, expecting parse_apply_sb_mount_options() to
later correctly set ctx->journal_ioprio to the user specified value.
However, if parse_apply_sb_mount_options() returned early because of
empty sbi->es_s->s_mount_opts, the correct journal_ioprio value was
never set.

This patch fixes __ext4_super() to only use the default value if the
user has not specified any value for journal_ioprio.

Similarly, the remount behavior was to either use journal_ioprio
value specified during initial mount, or use the default value
irrespective of the journal_ioprio value specified during remount.
This patch modifies this to first check if a new value for ioprio
has been passed during remount and apply it.  If no new value is
passed, use the value specified during initial mount.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20220418083545.45778-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-05-17 14:17:29 -04:00
Dmitry Monakhov
d63c00ea43 ext4: mark group as trimmed only if it was fully scanned
Otherwise nonaligned fstrim calls will works inconveniently for iterative
scanners, for example:

// trim [0,16MB] for group-1, but mark full group as trimmed
fstrim  -o $((1024*1024*128)) -l $((1024*1024*16)) ./m
// handle [16MB,16MB] for group-1, do nothing because group already has the flag.
fstrim  -o $((1024*1024*144)) -l $((1024*1024*16)) ./m

[ Update function documentation for ext4_trim_all_free -- TYT ]

Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Link: https://lore.kernel.org/r/1650214995-860245-1-git-send-email-dmtrmonakhov@yandex-team.ru
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-05-17 14:17:21 -04:00
Ye Bin
0be698ecbe ext4: fix use-after-free in ext4_rename_dir_prepare
We got issue as follows:
EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue
ext4_get_first_dir_block: bh->b_data=0xffff88810bee6000 len=34478
ext4_get_first_dir_block: *parent_de=0xffff88810beee6ae bh->b_data=0xffff88810bee6000
ext4_rename_dir_prepare: [1] parent_de=0xffff88810beee6ae
==================================================================
BUG: KASAN: use-after-free in ext4_rename_dir_prepare+0x152/0x220
Read of size 4 at addr ffff88810beee6ae by task rep/1895

CPU: 13 PID: 1895 Comm: rep Not tainted 5.10.0+ #241
Call Trace:
 dump_stack+0xbe/0xf9
 print_address_description.constprop.0+0x1e/0x220
 kasan_report.cold+0x37/0x7f
 ext4_rename_dir_prepare+0x152/0x220
 ext4_rename+0xf44/0x1ad0
 ext4_rename2+0x11c/0x170
 vfs_rename+0xa84/0x1440
 do_renameat2+0x683/0x8f0
 __x64_sys_renameat+0x53/0x60
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f45a6fc41c9
RSP: 002b:00007ffc5a470218 EFLAGS: 00000246 ORIG_RAX: 0000000000000108
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f45a6fc41c9
RDX: 0000000000000005 RSI: 0000000020000180 RDI: 0000000000000005
RBP: 00007ffc5a470240 R08: 00007ffc5a470160 R09: 0000000020000080
R10: 00000000200001c0 R11: 0000000000000246 R12: 0000000000400bb0
R13: 00007ffc5a470320 R14: 0000000000000000 R15: 0000000000000000

The buggy address belongs to the page:
page:00000000440015ce refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x10beee
flags: 0x200000000000000()
raw: 0200000000000000 ffffea00043ff4c8 ffffea0004325608 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88810beee580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88810beee600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88810beee680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                  ^
 ffff88810beee700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88810beee780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
Disabling lock debugging due to kernel taint
ext4_rename_dir_prepare: [2] parent_de->inode=3537895424
ext4_rename_dir_prepare: [3] dir=0xffff888124170140
ext4_rename_dir_prepare: [4] ino=2
ext4_rename_dir_prepare: ent->dir->i_ino=2 parent=-757071872

Reason is first directory entry which 'rec_len' is 34478, then will get illegal
parent entry. Now, we do not check directory entry after read directory block
in 'ext4_get_first_dir_block'.
To solve this issue, check directory entry in 'ext4_get_first_dir_block'.

[ Trigger an ext4_error() instead of just warning if the directory is
  missing a '.' or '..' entry.   Also make sure we return an error code
  if the file system is corrupted.  -TYT ]

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220414025223.4113128-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-05-17 14:16:56 -04:00
Johannes Thumshirn
0a05fafe9d btrfs: zoned: introduce a minimal zone size 4M and reject mount
Zoned devices are expected to have zone sizes in the range of 1-2GB for
ZNS SSDs and SMR HDDs have zone sizes of 256MB, so there is no need to
allow arbitrarily small zone sizes on btrfs.

But for testing purposes with emulated devices it is sometimes desirable
to create devices with as small as 4MB zone size to uncover errors.

So use 4MB as the smallest possible zone size and reject mounts of devices
with a smaller zone size.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 20:15:25 +02:00
Qu Wenruo
d8101a0c8a btrfs: allow defrag to convert inline extents to regular extents
Btrfs defaults to max_inline=2K to make small writes inlined into
metadata.

The default value is always a win, as even DUP/RAID1/RAID10 doubles the
metadata usage, it should still cause less physical space used compared
to a 4K regular extents.

But since the introduction of RAID1C3 and RAID1C4 it's no longer the case,
users may find inlined extents causing too much space wasted, and want
to convert those inlined extents back to regular extents.

Unfortunately defrag will unconditionally skip all inline extents, no
matter if the user is trying to converting them back to regular extents.

So this patch will add a small exception for defrag_collect_targets() to
allow defragging inline extents, if and only if the inlined extents are
larger than max_inline, allowing users to convert them to regular ones.

This also allows us to defrag extents like the following:

	item 6 key (257 EXTENT_DATA 0) itemoff 15794 itemsize 69
		generation 7 type 0 (inline)
		inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
	item 7 key (257 EXTENT_DATA 4096) itemoff 15741 itemsize 53
		generation 7 type 1 (regular)
		extent data disk byte 13631488 nr 4096
		extent data offset 0 nr 16384 ram 16384
		extent compression 1 (zlib)

Previously we're unable to do any defrag, since the first extent is
inlined, and the second one has no extent to merge.

Now we can defrag it to just one single extent, saving 48 bytes metadata
space.

	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
		generation 8 type 1 (regular)
		extent data disk byte 13635584 nr 4096
		extent data offset 0 nr 20480 ram 20480
		extent compression 1 (zlib)

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 20:15:25 +02:00
Qu Wenruo
d5321a0fa8 btrfs: add "0x" prefix for unsupported optional features
The following error message lack the "0x" obviously:

  cannot mount because of unsupported optional features (4000)

Add the prefix to make it less confusing. This can happen on older
kernels that try to mount a filesystem with newer features so it makes
sense to backport to older trees.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 20:15:25 +02:00
Filipe Manana
97bdf1a903 btrfs: do not account twice for inode ref when reserving metadata units
When reserving metadata units for creating an inode, we don't need to
reserve one extra unit for the inode ref item because when creating the
inode, at btrfs_create_new_inode(), we always insert the inode item and
the inode ref item in a single batch (a single btree insert operation,
and both ending up in the same leaf).

As we have accounted already one unit for the inode item, the extra unit
for the inode ref item is superfluous, it only makes us reserve more
metadata than necessary and often adding more reclaim pressure if we are
low on available metadata space.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 20:15:25 +02:00
Naohiro Aota
aa9ffadfca btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer
The block_group->alloc_offset is an offset from the start of the block
group. OTOH, the ->meta_write_pointer is an address in the logical
space. So, we should compare the alloc_offset shifted with the
block_group->start.

Fixes: afba2bc036 ("btrfs: zoned: implement active zone tracking")
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 20:15:25 +02:00
Filipe Manana
152555b39c btrfs: send: avoid trashing the page cache
A send operation reads extent data using the buffered IO path for getting
extent data to send in write commands and this is both because it's simple
and to make use of the generic readahead infrastructure, which results in
a massive speedup.

However this fills the page cache with data that, most of the time, is
really only used by the send operation - once the write commands are sent,
it's not useful to have the data in the page cache anymore. For large
snapshots, bringing all data into the page cache eventually leads to the
need to evict other data from the page cache that may be more useful for
applications (and kernel subsystems).

Even if extents are shared with the subvolume on which a snapshot is based
on and the data is currently on the page cache due to being read through
the subvolume, attempting to read the data through the snapshot will
always result in bringing a new copy of the data into another location in
the page cache (there's currently no shared memory for shared extents).

So make send evict the data it has read before if when it first opened
the inode, its mapping had no pages currently loaded: when
inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
based on the return value of filemap_range_has_page() before reading an
extent because the generic readahead mechanism may read pages beyond the
range we request (and it very often does it), which means a call to
filemap_range_has_page() will return true due to the readahead that was
triggered when processing a previous extent - we don't have a simple way
to distinguish this case from the case where the data was brought into
the page cache through someone else. So checking for the mapping number
of pages being 0 when we first open the inode is simple, cheap and it
generally accomplishes the goal of not trashing the page cache - the
only exception is if part of data was previously loaded into the page
cache through the snapshot by some other process, in that case we end
up not evicting any data send brings into the page cache, just like
before this change - but that however is not the common case.

Example scenario, on a box with 32G of RAM:

  $ btrfs subvolume create /mnt/sv1
  $ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1

  $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1

  $ free -m
                 total        used        free      shared  buff/cache   available
  Mem:           31937         186       26866           0        4883       31297
  Swap:           8188           0        8188

  # After this we get less 4G of free memory.
  $ btrfs send /mnt/snap1 >/dev/null

  $ free -m
                 total        used        free      shared  buff/cache   available
  Mem:           31937         186       22814           0        8935       31297
  Swap:           8188           0        8188

The same, obviously, applies to an incremental send.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 20:14:54 +02:00
Trond Myklebust
71342db057 NFSv4.1: Enable access to the NFSv4.1 'dacl' and 'sacl' attributes
Enable access to the NFSv4 acl via the NFSv4.1 'dacl' and 'sacl'
attributes.
This allows the server to authenticate the DACL and the SACL operations
separately, since reading and/or editing the SACL is usually considered
to be a privileged operation.
It also allows the propagation of automatic inheritance information that
was not supported by the NFSv4.0 'acl' attribute.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 13:32:46 -04:00
Trond Myklebust
db145db021 NFSv4: Add encoders/decoders for the NFSv4.1 dacl and sacl attributes
Add the ability to set or retrieve the acl using the NFSv4.1 'dacl' and
'sacl' attributes to the NFSv4 xdr encoders/decoders.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 13:32:46 -04:00
Trond Myklebust
7b8b44eb77 NFSv4: Specify the type of ACL to cache
When caching a NFSv4 ACL, we want to specify whether we are caching an
NFSv4.0 type acl, the NFSv4.1 dacl or the NFSv4.1 sacl.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 13:32:46 -04:00
Trond Myklebust
6949493884 NFSv4: Don't hold the layoutget locks across multiple RPC calls
When doing layoutget as part of the open() compound, we have to be
careful to release the layout locks before we can call any further RPC
calls, such as setattr(). The reason is that those calls could trigger
a recall, which could deadlock.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 12:53:33 -04:00
Trond Myklebust
126966dded pNFS/files: Fall back to I/O through the MDS on non-fatal layout errors
Only report the error when the server is returning a fatal error, such
as ESTALE, EIO, etc...

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 12:53:33 -04:00
Trond Myklebust
c6fd3511c3 NFS: Further fixes to the writeback error handling
When we handle an error by redirtying the page, we're not corrupting the
mapping, so we don't want the error to be recorded in the mapping.
If the caller has specified a sync_mode of WB_SYNC_NONE, we can just
return AOP_WRITEPAGE_ACTIVATE. However if we're dealing with
WB_SYNC_ALL, we need to ensure that retries happen when the errors are
non-fatal.

Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes: 8fc75bed96 ("NFS: Fix up return value on fatal errors in nfs_page_async_flush()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 12:53:33 -04:00
Trond Myklebust
3764a17e31 NFSv4/pNFS: Do not fail I/O when we fail to allocate the pNFS layout
Commit 587f03deb6 caused pnfs_update_layout() to stop returning ENOMEM
when the memory allocation fails, and hence causes it to fall back to
trying to do I/O through the MDS. There is no guarantee that this will
fare any better. If we're failing the pNFS layout allocation, then we
should just redirty the page and retry later.

Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes: 587f03deb6 ("pnfs: refactor send_layoutget")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 12:53:33 -04:00
Trond Myklebust
452284407c NFS: Memory allocation failures are not server fatal errors
We need to filter out ENOMEM in nfs_error_is_fatal_on_server(), because
running out of memory on our client is not a server error.

Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes: 2dc23afffb ("NFS: ENOMEM should also be a fatal error.")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 12:53:33 -04:00
Jeffle Xu
ba73eadd23 erofs: scan devices from device table
When "-o device" mount option is not specified, scan the device table
and instantiate the devices if there's any in the device table. In this
case, the tag field of each device slot uniquely specifies a device.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220512055601.106109-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18 00:11:21 +08:00
Xin Yin
d435d53228 erofs: change to use asynchronous io for fscache readpage/readahead
Use asynchronous io to read data from fscache may greatly improve IO
bandwidth for sequential buffered read scenario.

Change erofs_fscache_read_folios to erofs_fscache_read_folios_async,
and read data from fscache asynchronously.
Make .readpage()/.readahead() to use this new helper.

Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220509074028.74954-23-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
[ Gao Xiang: minor styling changes. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18 00:11:21 +08:00