The code to check for replay is not just -EAGAIN. In some
cases, the send request or receive response may result in
network errors, which we're now mapping to -ECONNABORTED.
This change introduces a helper function which checks
if the error returned in one of the above two errors.
And all checks for replays will now use this helper.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When the network stack returns various errors, we today bubble
up the error to the user (in case of soft mounts).
This change translates all network errors except -EINTR and
-EAGAIN to -ECONNABORTED. A similar approach is taken when
we receive network errors when reading from the socket.
The change also forces the cifsd thread to reconnect during
it's next activity.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
cifs_pick_channel today just selects a channel based
on the policy of least loaded channel. However, it
does not take into account if the channel needs
reconnect. As a result, we can have failures in send
that can be completely avoided.
This change doesn't make a channel a candidate for
this selection if it needs reconnect.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use cifsi->netfs_ctx.remote_i_size instead of cifsi->server_eof so that
netfslib can refer to it to.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Recent versions of Clang gets confused about the possible size of the
"user" allocation, and CONFIG_FORTIFY_SOURCE ends up emitting a
warning[1]:
repro.c:126:4: warning: call to '__write_overflow_field' declared with 'warning' attribute: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
126 | __write_overflow_field(p_size_field, size);
| ^
for this memset():
int len;
__le16 *user;
...
len = ses->user_name ? strlen(ses->user_name) : 0;
user = kmalloc(2 + (len * 2), GFP_KERNEL);
...
if (len) {
...
} else {
memset(user, '\0', 2);
}
While Clang works on this bug[2], switch to using a direct assignment,
which avoids memset() entirely which both simplifies the code and silences
the false positive warning. (Making "len" size_t also silences the
warning, but the direct assignment seems better.)
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://github.com/ClangBuiltLinux/linux/issues/1966 [1]
Link: https://github.com/llvm/llvm-project/issues/77813 [2]
Cc: Steve French <sfrench@samba.org>
Cc: Paulo Alcantara <pc@manguebit.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Shyam Prasad N <sprasad@microsoft.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: llvm@lists.linux.dev
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
- Fix histogram tracing_map insertion.
The tracing_map_insert copies the value into the elt variable and
then assigns the elt to the entry value. But it is possible that
the entry value becomes visible on other CPUs before the elt is
fully initialized. This is fixed by adding a wmb() between the
initialization of the elt variable and assigning it.
- Have eventfs directory have unique inode numbers. Having them be
all the same proved to be a failure as the find application will
think that the directories are causing loops, as it checks for
directory loops via their inodes. Have the evenfs dir entries
get their inodes assigned when they are referenced and then save
them in the eventfs_inode structure.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZa/LjhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qjmRAQD+av2eJnjP+SdfczlzW41V2UGBQjWh
m81pRJ5xBWsrDwEA5OFN/t2ZzrdwhagkCoSyzNQmNX/c6Ppr7LVsmKOMKwA=
=T/WA
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing and eventfs fixes from Steven Rostedt:
- Fix histogram tracing_map insertion.
The tracing_map_insert copies the value into the elt variable and
then assigns the elt to the entry value. But it is possible that the
entry value becomes visible on other CPUs before the elt is fully
initialized. This is fixed by adding a wmb() between the
initialization of the elt variable and assigning it.
- Have eventfs directory have unique inode numbers.
Having them be all the same proved to be a failure as the 'find'
application will think that the directories are causing loops, as it
checks for directory loops via their inodes. Have the evenfs dir
entries get their inodes assigned when they are referenced and then
save them in the eventfs_inode structure.
* tag 'trace-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Save directory inodes in the eventfs_inode structure
tracing: Ensure visibility when inserting an element into tracing_map
Kernel has its own official true/false definitions.
The defines aren't even used in this file.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The eventfs inodes and directories are allocated when referenced. But this
leaves the issue of keeping consistent inode numbers and the number is
only saved in the inode structure itself. When the inode is no longer
referenced, it can be freed. When the file that the inode was representing
is referenced again, the inode is once again created, but the inode number
needs to be the same as it was before.
Just making the inode numbers the same for all files is fine, but that
does not work with directories. The find command will check for loops via
the inode number and having the same inode number for directories triggers:
# find /sys/kernel/tracing
find: File system loop detected;
'/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the same file system loop as
'/sys/kernel/debug/tracing/events/initcall'.
[..]
Linus pointed out that the eventfs_inode structure ends with a single
32bit int, and on 64 bit machines, there's likely a 4 byte hole due to
alignment. We can use this hole to store the inode number for the
eventfs_inode. All directories in eventfs are represented by an
eventfs_inode and that data structure can hold its inode number.
That last int was also purposely placed at the end of the structure to
prevent holes from within. Now that there's a 4 byte number to hold the
inode, both the inode number and the last integer can be moved up in the
structure for better cache locality, where the llist and rcu fields can be
moved to the end as they are only used when the eventfs_inode is being
deleted.
Link: https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=X+c7x0Eewxn-YRdCA@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240122152748.46897388@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Fixes: 53c41052ba ("eventfs: Have the inodes all for files and directories all be the same")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
An opaque directory cannot have xwhiteouts, so instead of marking an
xwhiteouts directory with a new xattr, overload overlay.opaque xattr
for marking both opaque dir ('y') and xwhiteouts dir ('x').
This is more efficient as the overlay.opaque xattr is checked during
lookup of directory anyway.
This also prevents unnecessary checking the xattr when reading a
directory without xwhiteouts, i.e. most of the time.
Note that the xwhiteouts marker is not checked on the upper layer and
on the last layer in lowerstack, where xwhiteouts are not expected.
Fixes: bc8df7a3dc ("ovl: Add an alternative type of whiteout")
Cc: <stable@vger.kernel.org> # v6.7
Reviewed-by: Alexander Larsson <alexl@redhat.com>
Tested-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
This reverts commit 1e7f6def8b.
It causes my machine to not even boot, and Klara Modin reports that the
cause is that small zstd-compressed files return garbage when read.
Reported-by: Klara Modin <klarasmodin@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CABq1_vj4GpUeZpVG49OHCo-3sdbe2-2ROcu_xDvUG-6-5zPRXg@mail.gmail.com/
Reported-and-bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Qu Wenruo <wqu@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove afs_dynroot_d_revalidate() as it is redundant as all it does is
return 1 and the caller assumes that if the op is not given.
Suggested-by: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
When afs does a lookup, it tries to use FS.InlineBulkStatus to preemptively
look up a bunch of files in the parent directory and cache this locally, on
the basis that we might want to look at them too (for example if someone
does an ls on a directory, they may want want to then stat every file
listed).
FS.InlineBulkStatus can be considered a compound op with the normal abort
code applying to the compound as a whole. Each status fetch within the
compound is then given its own individual abort code - but assuming no
error that prevents the bulk fetch from returning the compound result will
be 0, even if all the constituent status fetches failed.
At the conclusion of afs_do_lookup(), we should use the abort code from the
appropriate status to determine the error to return, if any - but instead
it is assumed that we were successful if the op as a whole succeeded and we
return an incompletely initialised inode, resulting in ENOENT, no matter
the actual reason. In the particular instance reported, a vnode with no
permission granted to be accessed is being given a UAEACCES abort code
which should be reported as EACCES, but is instead being reported as
ENOENT.
Fix this by abandoning the inode (which will be cleaned up with the op) if
file[1] has an abort code indicated and turn that abort code into an error
instead.
Whilst we're at it, add a tracepoint so that the abort codes of the
individual subrequests of FS.InlineBulkStatus can be logged. At the moment
only the container abort code can be 0.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
There appears to be a race between silly-rename files being created/removed
and various userspace tools iterating over the contents of a directory,
leading to such errors as:
find: './kernel/.tmp_cpio_dir/include/dt-bindings/reset/.__afs2080': No such file or directory
tar: ./include/linux/greybus/.__afs3C95: File removed before we read it
when building a kernel.
Fix afs_readdir() so that it doesn't return .__afsXXXX silly-rename files
to userspace. This doesn't stop them being looked up directly by name as
we need to be able to look them up from within the kernel as part of the
silly-rename algorithm.
Fixes: 79ddbfa500 ("afs: Implement sillyrename for unlink and rename")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cachefiles_ondemand_init_object() as called from cachefiles_open_file() and
cachefiles_create_tmpfile() does not check if object->ondemand is set
before dereferencing it, leading to an oops something like:
RIP: 0010:cachefiles_ondemand_init_object+0x9/0x41
...
Call Trace:
<TASK>
cachefiles_open_file+0xc9/0x187
cachefiles_lookup_cookie+0x122/0x2be
fscache_cookie_state_machine+0xbe/0x32b
fscache_cookie_worker+0x1f/0x2d
process_one_work+0x136/0x208
process_scheduled_works+0x3a/0x41
worker_thread+0x1a2/0x1f6
kthread+0xca/0xd2
ret_from_fork+0x21/0x33
Fix this by making cachefiles_ondemand_init_object() return immediately if
cachefiles->ondemand is NULL.
Fixes: 3c5ecfe16e ("cachefiles: extract ondemand info field from cachefiles_object")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Gao Xiang <xiang@kernel.org>
cc: Chao Yu <chao@kernel.org>
cc: Yue Hu <huyue2@coolpad.com>
cc: Jeffle Xu <jefflexu@linux.alibaba.com>
cc: linux-erofs@lists.ozlabs.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
The netfs_grab_folio_for_write() function doesn't return NULL, it returns
error pointers. Update the check accordingly.
Fixes: c38f4e96e6 ("netfs: Provide func to copy data to pagecache for buffered write")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/29fb1310-8e2d-47ba-b68d-40354eb7b896@moroto.mountain/
This function dereferences "cache" and then checks if it's
IS_ERR_OR_NULL(). Check first, then dereference.
Fixes: 9549332df4 ("fscache: Implement cache registration")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/e84bc740-3502-4f16-982a-a40d5676615c@moroto.mountain/ # v2
Filesystems should use folio->index and folio->mapping, instead of
folio_index(folio), folio_mapping() and folio_file_mapping() since
they know that it's in the pagecache.
Change this automagically with:
perl -p -i -e 's/folio_mapping[(]([^)]*)[)]/\1->mapping/g' fs/smb/client/*.c
perl -p -i -e 's/folio_file_mapping[(]([^)]*)[)]/\1->mapping/g' fs/smb/client/*.c
perl -p -i -e 's/folio_index[(]([^)]*)[)]/\1->index/g' fs/smb/client/*.c
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Ronnie Sahlberg <lsahlber@redhat.com>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Filesystems should use folio->index and folio->mapping, instead of
folio_index(folio), folio_mapping() and folio_file_mapping() since
they know that it's in the pagecache.
Change this automagically with:
perl -p -i -e 's/folio_mapping[(]([^)]*)[)]/\1->mapping/g' fs/afs/*.c
perl -p -i -e 's/folio_file_mapping[(]([^)]*)[)]/\1->mapping/g' fs/afs/*.c
perl -p -i -e 's/folio_index[(]([^)]*)[)]/\1->index/g' fs/afs/*.c
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Filesystems should use folio->index and folio->mapping, instead of
folio_index(folio), folio_mapping() and folio_file_mapping() since
they know that it's in the pagecache.
Change this automagically with:
perl -p -i -e 's/folio_mapping[(]([^)]*)[)]/\1->mapping/g' fs/netfs/*.c
perl -p -i -e 's/folio_file_mapping[(]([^)]*)[)]/\1->mapping/g' fs/netfs/*.c
perl -p -i -e 's/folio_index[(]([^)]*)[)]/\1->index/g' fs/netfs/*.c
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-erofs@lists.ozlabs.org
cc: linux-fsdevel@vger.kernel.org
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWurp4ACgkQxWXV+ddt
WDsqSg/+OS5/1Cr2W6/3ns2hannEeAzYUeoRDNhNHluHOSufXS52QTckQdiA62BO
iMKGoIxZIn9BQPlvil1hi+jIEt/9qsRt/Qc6oBnzvlto21tJCoS486PJAShu6Sj5
jXKxtR7d6WrJEfk65uzatk1SbRguRKFxSrFlkaOeOHAmWsD54p/BnsZ/pqxPjF8W
LOFvwdhbTw3pzQ873b+hJg16rm4IenAnuazZNmXRdSufgdPEcArv0l7fMr4xTBvO
DBQXoM5GBGVHV2+IsrZiK39p7khz9ej2Ob4rps/x6PduC+GPxGtm6iLy8dZts+hV
D1FOHh3fqWmV2LQIzLNNu9N7sj5sF5dNFRZHSkq4qFNVNQYfvyFg43iJKfUnMY/s
puUm7ElSF3tLC2pRys0m/jDfkykZVFFZzbayfYQn+jRKuUASyXnWqmCKlljkLJD5
ekFXPpor+SQzQso9x0OpAjkSIUmmYFqSvoJCCczPFoo/3EDPv4C6VGOPEQyN6dDH
nBjn7fLXmn4hpdEKia+LU1MhajFis+SUlmjaoTh7UfCCzXDosDOPThRC1Kx0rNlY
t4KON8pMUCK3iGEce+7iOSwEImDDU4B7DUARey/sF0C8cs7jRsX8bf8eFTrEId8M
4C2sLmTw0JJ5n2I2soyTi9fHrGJnJamUlzp/hLrp8JyMzy6qBrs=
=38MW
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- zoned mode fixes:
- fix slowdown when writing large file sequentially by looking up
block groups with enough space faster
- locking fixes when activating a zone
- new mount API fixes:
- preserve mount options for a ro/rw mount of the same subvolume
- scrub fixes:
- fix use-after-free in case the chunk length is not aligned to
64K, this does not happen normally but has been reported on
images converted from ext4
- similar alignment check was missing with raid-stripe-tree
- subvolume deletion fixes:
- prevent calling ioctl on already deleted subvolume
- properly track flag tracking a deleted subvolume
- in subpage mode, fix decompression of an inline extent (zlib, lzo,
zstd)
- fix crash when starting writeback on a folio, after integration with
recent MM changes this needs to be started conditionally
- reject unknown flags in defrag ioctl
- error handling, API fixes, minor warning fixes
* tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: scrub: limit RST scrub to chunk boundary
btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
btrfs: don't unconditionally call folio_start_writeback in subpage
btrfs: use the original mount's mount options for the legacy reconfigure
btrfs: don't warn if discard range is not aligned to sector
btrfs: tree-checker: fix inline ref size in error messages
btrfs: zstd: fix and simplify the inline extent decompression
btrfs: lzo: fix and simplify the inline extent decompression
btrfs: zlib: fix and simplify the inline extent decompression
btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args
btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted
btrfs: don't abort filesystem when attempting to snapshot deleted subvolume
btrfs: zoned: fix lock ordering in btrfs_zone_activate()
btrfs: fix unbalanced unlock of mapping_tree_lock
btrfs: ref-verify: free ref cache before clearing mount opt
btrfs: fix kvcalloc() arguments order in btrfs_ioctl_send()
btrfs: zoned: optimize hint byte for zoned allocator
btrfs: zoned: factor out prepare_allocation_zoned()
If get_unused_fd_flags() fails, the error handling is incomplete because
bprm->cred is already set to NULL, and therefore free_bprm will not
unlock the cred_guard_mutex. Note there are two error conditions which
end up here, one before and one after bprm->cred is cleared.
Fixes: b8a61c9e7b ("exec: Generic execfd support")
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/AS8P193MB128517ADB5EFF29E04389EDAE4752@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Consolidate the calls to allow_write_access()/fput() into a single
place, since we repeat this code pattern. Add comments around the
callers for the details on it.
Link: https://lore.kernel.org/r/202209161637.9EDAF6B18@keescook
Signed-off-by: Kees Cook <keescook@chromium.org>
REQ_OP_FLUSH is only for internal use in the blk-mq and request based
drivers. File systems and other block layer consumers must use
REQ_OP_WRITE | REQ_PREFLUSH as documented in
Documentation/block/writeback_cache_control.rst.
While REQ_OP_FLUSH appears to work for blk-mq drivers it does not
get the proper flush state machine handling, and completely fails
for any bio based drivers, including all the stacking drivers. The
block layer will also get a check in 6.8 to reject this use case
entirely.
[Note: completely untested, but as this never got fixed since the
original bug report in November:
https://bugzilla.kernel.org/show_bug.cgi?id=218184
and the the discussion in December:
https://lore.kernel.org/all/20231221053016.72cqcfg46vxwohcj@moria.home.lan/T/
this seems to be best way to force it]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Recently xfs/513 started failing on my test machines testing "-o
ro,norecovery" mount options. This was being emitted in dmesg:
[ 9906.932724] XFS (pmem0): no-recovery mounts must be read-only.
Turns out, readonly mounts with the fsopen()/fsconfig() mount API
have been busted since day zero. It's only taken 5 years for debian
unstable to start using this "new" mount API, and shortly after this
I noticed xfs/513 had started to fail as per above.
The syscall trace is:
fsopen("xfs", FSOPEN_CLOEXEC) = 3
mount_setattr(-1, NULL, 0, NULL, 0) = -1 EINVAL (Invalid argument)
.....
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/pmem0", 0) = 0
fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0
fsconfig(3, FSCONFIG_SET_FLAG, "norecovery", NULL, 0) = 0
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = -1 EINVAL (Invalid argument)
close(3) = 0
Showing that the actual mount instantiation (FSCONFIG_CMD_CREATE) is
what threw out the error.
During mount instantiation, we call xfs_fs_validate_params() which
does:
/* No recovery flag requires a read-only mount */
if (xfs_has_norecovery(mp) && !xfs_is_readonly(mp)) {
xfs_warn(mp, "no-recovery mounts must be read-only.");
return -EINVAL;
}
and xfs_is_readonly() checks internal mount flags for read only
state. This state is set in xfs_init_fs_context() from the
context superblock flag state:
/*
* Copy binary VFS mount flags we are interested in.
*/
if (fc->sb_flags & SB_RDONLY)
set_bit(XFS_OPSTATE_READONLY, &mp->m_opstate);
With the old mount API, all of the VFS specific superblock flags
had already been parsed and set before xfs_init_fs_context() is
called, so this all works fine.
However, in the brave new fsopen/fsconfig world,
xfs_init_fs_context() is called from fsopen() context, before any
VFS superblock have been set or parsed. Hence if we use fsopen(),
the internal XFS readonly state is *never set*. Hence anything that
depends on xfs_is_readonly() actually returning true for read only
mounts is broken if fsopen() has been used to mount the filesystem.
Fix this by moving this internal state initialisation to
xfs_fs_fill_super() before we attempt to validate the parameters
that have been set prior to the FSCONFIG_CMD_CREATE call being made.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Fixes: 73e5fff98b ("xfs: switch to use the new mount-api")
cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Fix some kernel-doc comments to silence the warnings:
fs/smb/server/transport_tcp.c:374: warning: Function parameter or struct member 'max_retries' not described in 'ksmbd_tcp_read'
fs/smb/server/transport_tcp.c:423: warning: Function parameter or struct member 'iface' not described in 'create_socket'
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
- assorted prep work for disk space accounting rewrite
- BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
makes our trigger context more explicit
- A few fixes to avoid excessive transaction restarts on multithreaded
workloads: fstests (in addition to ktest tests) are now checking
slowpath counters, and that's shaking out a few bugs
- Assorted tracepoint improvements
- Starting to break up bcachefs_format.h and move on disk types so
they're with the code they belong to; this will make room to start
documenting the on disk format better.
- A few minor fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmWtjOsACgkQE6szbY3K
bnbyXRAAsx+yM81TFqsLzRRqf8oocRwf2dj5XzExz9Ig/lYQS5LIVROS2OxwDsAc
DeaYQSTcph9dkOswCrNR96bBnEgmmZ1ClfVI6WRXvm6vs4rjhSMNbNaVyySrMUVn
5p/Lsn1/RKl0lWMYlHrdryo+106zRcr6z1Hiv9QCXkXhzdkV8wFYDkfbMveShUsu
KobC29wvd2EfZr04nqsIXS/y/iRIXhtZqJmFCiAguN70UWrwUwArpELHI5Ve+WPZ
9VjgFXW6Ka3QxJs/20tX+t24DrC+eDXR44DzQmxwG5mPBBpXkcSk5UgRw/EUag5U
5+mDZQ5Ei3gvZvUwrilMosVy3pIw0IuvqeqwDGFoFXs1cce01QCMN+NG/dBTQw9i
KGGxJw5sOrZ8fIiFnypk1M+r9NVtA8MjriLNR5bJjCWPSpWqzkT2HzxFXc6HmTZu
vsE/AxwC1RLA6B2HZlDEqLOdHE3cofkDiIzWM5ABvb4p118iyk9hE6HhAufk5UdE
HaG646kGB8pUY/sCxBIOD6K2pgthDFv+fftTM7X+uIazD3bovvPQCEInu48/KAHn
/KmslSPO0txyjnRFMbXFJvd4Fgfo44GcBCeqGpy3B79aEJ3nroyRZ0qNnnsqj0Gl
picUWjTn4W561Q1zBXuE/6cLWEp+sfaqYQcM8L3CCitRTVDPaCQ=
=yd+F
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs
Pull more bcachefs updates from Kent Overstreet:
"Some fixes, Some refactoring, some minor features:
- Assorted prep work for disk space accounting rewrite
- BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
makes our trigger context more explicit
- A few fixes to avoid excessive transaction restarts on
multithreaded workloads: fstests (in addition to ktest tests) are
now checking slowpath counters, and that's shaking out a few bugs
- Assorted tracepoint improvements
- Starting to break up bcachefs_format.h and move on disk types so
they're with the code they belong to; this will make room to start
documenting the on disk format better.
- A few minor fixes"
* tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
bcachefs: Improve inode_to_text()
bcachefs: logged_ops_format.h
bcachefs: reflink_format.h
bcachefs; extents_format.h
bcachefs: ec_format.h
bcachefs: subvolume_format.h
bcachefs: snapshot_format.h
bcachefs: alloc_background_format.h
bcachefs: xattr_format.h
bcachefs: dirent_format.h
bcachefs: inode_format.h
bcachefs; quota_format.h
bcachefs: sb-counters_format.h
bcachefs: counters.c -> sb-counters.c
bcachefs: comment bch_subvolume
bcachefs: bch_snapshot::btime
bcachefs: add missing __GFP_NOWARN
bcachefs: opts->compression can now also be applied in the background
bcachefs: Prep work for variable size btree node buffers
bcachefs: grab s_umount only if snapshotting
...
Add a field to bch_snapshot for creation time; this will be important
when we start exposing the snapshot tree to userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The "apply this compression method in the background" paths now use the
compression option if background_compression is not set; this means that
setting or changing the compression option will cause existing data to
be compressed accordingly in the background.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs btree nodes are big - typically 256k - and btree roots are
pinned in memory. As we're now up to 18 btrees, we now have significant
memory overhead in mostly empty btree roots.
And in the future we're going to start enforcing that certain btree node
boundaries exist, to solve lock contention issues - analagous to XFS's
AGIs.
Thus, we need to start allocating smaller btree node buffers when we
can. This patch changes code that refers to the filesystem constant
c->opts.btree_node_size to refer to the btree node buffer size -
btree_buf_bytes() - where appropriate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The variable tmp is being assigned a value but it isn't being
read afterwards. The assignment is redundant and so tmp can be
removed.
Cleans up clang scan build warning:
warning: Although the value stored to 'ret' is used in the enclosing
expression, the value is never actually read from 'ret'
[deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
drop_locks_do() should not be used in a fastpath without first trying
the do in nonblocking mode - the unlock and relock will cause excessive
transaction restarts and potentially livelocking with other threads that
are contending for the same locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Factor out bch2_journal_bufs_to_text(), and use it in the
journal_entry_full() tracepoint; when we can't get a journal reservation
we need to know the outstanding journal entry sizes to know if the
problem is due to excessive flushing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When issuing discards, we may need to flush the journal if there's too
many buckets that can't be discarded until a journal flush.
But the heuristic was bad; we should be comparing the number of buckets
that need to flushes against the number of free buckets, not the number
of buckets we saw.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also print out the data_opts, so that we can see what specifically is
being done to an extent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug with rebalance IOs getting stuck with reads completed,
but writes never being issued.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Drop t he loop in bch2_kthread_io_clock_wait(): this allows the code
that uses it to be woken up for other reasons, and fixes a bug where
rebalance wouldn't wake up when a scan was requested.
This raises the possibility of spurious wakeups, but callers should
always be able to handle that reasonably well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't have to take locks in any particular ordering - we'll make
forward progress just fine - but if we try to stick to an ordering, it
can help to avoid excessive would_deadlock transaction restarts.
This tweaks the reflink path to take extents btree locks in the right
order.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The disk space accounting rewrite is splitting out accounting for each
replicas set - those are moving to btree keys, instead of percpu
counters.
This breaks bch2_trans_fs_usage_apply() up, splitting out the part we
will still need.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Split out base filesystem usage into its own type; prep work for
breaking up bch2_trans_fs_usage_apply().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, we added logging in the write path to ensure that any
unexpected errors getting reported to userspace have a log message; but
BCH_WRITE_ALLOC_NOWAIT is a special case, it's used for promotes where
errors are expected and not reported out to userspace - so we need to
silence those.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmWsTxUACgkQiiy9cAdy
T1GrcwwAl6fLD+A6r7GHFQ7LiA7KxXhQrnsdOibtWH0QRPqaiiNq2ctLg9+6pM16
vuTqsLF/sglJyjm1X5qP+xne2GFS5o7y6Vnpsj0cxoogt6I9f+q/uPrdIafJL9or
N2RaWvKINuiKpHoz3jwrnDTjhvWGrc95tYKKUBRvfQF94gWQbBfLjEBP7KbU14BL
jNJ+Zi4ZvEn1ITZNdiE8cqquCQuVG+lrQuqhzn9d9tDTU7rhkOan3jE7yyJPVSce
4HqtHvxnWUvOfNUyp8/bGYQhkTWEh2vy2Jo+mIPSwzwj0xSxYl3SITWo8F2mIV3U
MY12FQlJLzkUhkSj0oOgkMltOe35IjPEDgBCRvjj7qm33FbCparIKLs1lx8rfghj
pwzbgG3OX8yB3bIyyTmVRYl31uztN0RehYas8g4KPbVcF7w9HjHjsiemxBDTnOkb
A9jxfwan8RJcO+e4e4OG7+AKMZxQt1dwf99Bo2nWhVQmV/aYJyswBCGp9hGBfrB4
0PGp7zlz
=vwnz
-----END PGP SIGNATURE-----
Merge tag 'v6.8-rc-part2-smb-client' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client updates from Steve French:
"Various smb client fixes, including multichannel and for SMB3.1.1
POSIX extensions:
- debugging improvement (display start time for stats)
- two reparse point handling fixes
- various multichannel improvements and fixes
- SMB3.1.1 POSIX extensions open/create parsing fix
- retry (reconnect) improvement including new retrans mount parm, and
handling of two additional return codes that need to be retried on
- two minor cleanup patches and another to remove duplicate query
info code
- two documentation cleanup, and one reviewer email correction"
* tag 'v6.8-rc-part2-smb-client' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update iface_last_update on each query-and-update
cifs: handle servers that still advertise multichannel after disabling
cifs: new mount option called retrans
cifs: reschedule periodic query for server interfaces
smb: client: don't clobber ->i_rdev from cached reparse points
smb: client: get rid of smb311_posix_query_path_info()
smb: client: parse owner/group when creating reparse points
smb: client: fix parsing of SMB3.1.1 POSIX create context
cifs: update known bugs mentioned in kernel docs for cifs
cifs: new nt status codes from MS-SMB2
cifs: pick channel for tcon and tdis
cifs: open_cached_dir should not rely on primary channel
smb3: minor documentation updates
Update MAINTAINERS email address
cifs: minor comment cleanup
smb3: show beginning time for per share stats
cifs: remove redundant variable tcon_exist
- Remove of the final (very recent) user of strlcpy() (in bcachefs).
- Remove the strlcpy() API. Long live strscpy().
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmWq5VgWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJhsHD/9xfEA1YFC4WzTuX1RcsSwZQTGL
L8ej9NuRiQ57vJA37PEV3wyTIVHLOJDjNr+8cmL1Pu0GR9K4R2s4YzdQtaK6pFeE
BYuXUOUK9rkQsVLL0DrTv/YryjMah0DDb/M7kKZDRfTgii0yWZ1WqEmO2+9wbdKS
n7O9oYZreiNkFg3/6yHPYlBve9QXt+VHN/NIxSQqps3BVXRPKcIwCCJq7IiazBpR
xo7FkhTftmL1ZqGOGRcoY7YKWt7WFg9HPBB30WXkqCIqmaFWm4sBancArVgTgQ+r
vES/QF4SsFXkprf4fPWuQZlcChc2hibREI9o3t3Qck4FG7W+alXSpj3IxFiZqNFu
BvNZwKW5/MB2r+CugM12JUszxAVlcwqskoilGOVD65AJ26xUYh2oAr3kpU5L6Nur
c3zcLSlpec9sYQMdXSGQWOF2juhEp2ikceP5dw5ONcj4P7UXadPnB4hsW8ulG844
Rh552sR0je5UCxzXNozec9X1JFZf7Z8lOjdRv1Xs549+F2rmzaZAt2eOnageCCO7
XKoqZ/auIwzj/3WqDxivjs3xT+1PpxJd3bALDXb/iIu10DMbNq7CRwHO+1OZo1e1
4OLE1gbM3Ldv2WgUe2o1dDURnmKq1aiYN8ThoOIVy9VTC0FOxujVXKsd//f6qpMu
EGOypgqRBFpVd53DvQ==
=DuKT
-----END PGP SIGNATURE-----
Merge tag 'strlcpy-removal-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull strlcpy removal from Kees Cook:
"As promised, this is 'part 2' of the hardening tree, late in -rc1 now
that all the other trees with strlcpy() removals have landed. One new
user appeared (in bcachefs) but was a trivial refactor. The kernel is
now free of the strlcpy() API!
- Remove of the final (very recent) user of strlcpy() (in bcachefs)
- Remove the strlcpy() API. Long live strscpy()"
* tag 'strlcpy-removal-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
string: Remove strlcpy()
bcachefs: Replace strlcpy() with strscpy()
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmWqmP8THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi3cQB/0XJABiPkolqNtd3dSGw8x2YnpS6ciV
yHxpJViF0+qmnS5l6Vn2lEDr/57h/jts0t3kXUUSDVbitK9glim5ar2FsBeuY7gi
lQbqhFPfQ+G3APDn2Dn27JYvO1VQLMmvuFJyE4rJ03XZjvOYpq4zM3zPO0jPGvCN
Gnw0VqPst/h4eobcsFEsHvHuMkkVy6YIOQPsDkiYUShaY6OBUWM4kewrlztmEvaK
fyuo/FSNmZeEkoc5R7Pfo1FE4PZzfdUie7RmEznxqgHUWFmx2jKZ5TwnCZt1D2av
dV2e2JWnZUZZL9vAnCQddvnYrj8j+an/IbGZ+0Wa5DZo/eMglDd01VV2
=kNSw
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.8-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"Assorted CephFS fixes and cleanups with nothing standing out"
* tag 'ceph-for-6.8-rc1' of https://github.com/ceph/ceph-client:
ceph: get rid of passing callbacks in __dentry_leases_walk()
ceph: d_obtain_{alias,root}(ERR_PTR(...)) will do the right thing
ceph: fix invalid pointer access if get_quota_realm return ERR_PTR
ceph: remove duplicated code in ceph_netfs_issue_read()
ceph: send oldest_client_tid when renewing caps
ceph: rename create_session_open_msg() to create_session_full_msg()
ceph: select FS_ENCRYPTION_ALGS if FS_ENCRYPTION
ceph: fix deadlock or deadcode of misusing dget()
ceph: try to allocate a smaller extent map for sparse read
libceph: remove MAX_EXTENTS check for sparse reads
ceph: reinitialize mds feature bit even when session in open
ceph: skip reconnecting if MDS is not ready
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmWobwQACgkQiiy9cAdy
T1GTXgv/amjpAd1kEVwgfyUGvM9rsN+DtojoXt1z5xDkzJrnszI0s/WARz7o2gc/
6nIOxKWfpxb0QHRcLebZHbN7mrZeelHLyMqbx0Wphy5Y0cUQlq1C50l6xAkua1dd
/uPklGZVW9LDRvZYzdwa4Spi0tVwDgsnmK2UiWckCfc1yu9BVz3mg1gtaeg0Z/Z4
caojvSTXdJ/35xKV4udE4lo0PpbCC2h990c5iR8iOXgMuBlgIv3JQYC+3avuXXdS
Erneof5Vx9sUV7p7SWCR71VFaWnMsF1/DtXMKTxAJKjYGZPckQGQjxOJeonaRltg
0YjzeMT5/d5Fv3w0L1Mtn9pxyXTQ4ywKJ3Vpm21geuw7pl/70sqdQtg+ujljfN6j
uzb6mzwFezp3LGK8RI6h//JKUrnurIM/dc9FuRCoU2N91rfhGuiAvTwzrjh67WW8
HTmsnhNWnWuhH7OQJrOPQVEEs6DSUgA6MvHjslXoj7V+ksoKMe57JyZIr6Hx0CX9
W4Q0j6Hk
=gj+T
-----END PGP SIGNATURE-----
Merge tag '6.8-rc-smb-server-fixes-part2' of git://git.samba.org/ksmbd
Pull more smb server updates from Steve French:
- Fix for incorrect oplock break on directories when leases disabled
- UAF fix for race between create and destroy of tcp connection
- Important session setup SPNEGO fix
- Update ksmbd feature status summary
* tag '6.8-rc-smb-server-fixes-part2' of git://git.samba.org/ksmbd:
ksmbd: only v2 leases handle the directory
ksmbd: fix UAF issue in ksmbd_tcp_new_connection()
ksmbd: validate mech token in session setup
ksmbd: update feature status in documentation
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZabMrQAKCRCRxhvAZXjc
ovnUAQDgCOonb1tjtTvC8s8IMDUEoaVYZI91KVfsZQSJYN1sdQD+KfJmX1BhJnWG
l0cEffGfnWGXMZkZqDgLPHUIPzFrmws=
=1b3j
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.netfs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
"This extends the netfs helper library that network filesystems can use
to replace their own implementations. Both afs and 9p are ported. cifs
is ready as well but the patches are way bigger and will be routed
separately once this is merged. That will remove lots of code as well.
The overal goal is to get high-level I/O and knowledge of the page
cache and ouf of the filesystem drivers. This includes knowledge about
the existence of pages and folios
The pull request converts afs and 9p. This removes about 800 lines of
code from afs and 300 from 9p. For 9p it is now possible to do writes
in larger than a page chunks. Additionally, multipage folio support
can be turned on for 9p. Separate patches exist for cifs removing
another 2000+ lines. I've included detailed information in the
individual pulls I took.
Summary:
- Add NFS-style (and Ceph-style) locking around DIO vs buffered I/O
calls to prevent these from happening at the same time.
- Support for direct and unbuffered I/O.
- Support for write-through caching in the page cache.
- O_*SYNC and RWF_*SYNC writes use write-through rather than writing
to the page cache and then flushing afterwards.
- Support for write-streaming.
- Support for write grouping.
- Skip reads for which the server could only return zeros or EOF.
- The fscache module is now part of the netfs library and the
corresponding maintainer entry is updated.
- Some helpers from the fscache subsystem are renamed to mark them as
belonging to the netfs library.
- Follow-up fixes for the netfs library.
- Follow-up fixes for the 9p conversion"
* tag 'vfs-6.8.netfs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (50 commits)
netfs: Fix wrong #ifdef hiding wait
cachefiles: Fix signed/unsigned mixup
netfs: Fix the loop that unmarks folios after writing to the cache
netfs: Fix interaction between write-streaming and cachefiles culling
netfs: Count DIO writes
netfs: Mark netfs_unbuffered_write_iter_locked() static
netfs: Fix proc/fs/fscache symlink to point to "netfs" not "../netfs"
netfs: Rearrange netfs_io_subrequest to put request pointer first
9p: Use length of data written to the server in preference to error
9p: Do a couple of cleanups
9p: Fix initialisation of netfs_inode for 9p
cachefiles: Fix __cachefiles_prepare_write()
9p: Use netfslib read/write_iter
afs: Use the netfs write helpers
netfs: Export the netfs_sreq tracepoint
netfs: Optimise away reads above the point at which there can be no data
netfs: Implement a write-through caching option
netfs: Provide a launder_folio implementation
netfs: Provide a writepages implementation
netfs, cachefiles: Pass upper bound length to allow expansion
...
iface_last_update was an unused field when it was introduced.
Later, when we had periodic update of server interface list,
this field was used regularly to decide when to update next.
However, with the new logic of updating the interfaces, it
becomes crucial that this field be updated whenever
parse_server_interfaces runs successfully.
This change updates this field when either the server does
not support query of interfaces; so that we do not query
the interfaces repeatedly. It also updates the field when
the function reaches the end.
Fixes: aa45dadd34 ("cifs: change iface_list from array to sorted linked list")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Some servers like Azure SMB servers always advertise multichannel
capability in server capabilities list. Such servers return error
STATUS_NOT_IMPLEMENTED for ioctl calls to query server interfaces,
and expect clients to consider that as a sign that they do not support
multichannel.
We already handled this at mount time. Soon after the tree connect,
we query server interfaces. And when server returned STATUS_NOT_IMPLEMENTED,
we kept interface list as empty. When cifs_try_adding_channels gets
called, it would not find any interfaces, so will not add channels.
For the case where an active multichannel mount exists, and multichannel
is disabled by such a server, this change will now allow the client
to disable secondary channels on the mount. It will check the return
status of query server interfaces call soon after a tree reconnect.
If the return status is EOPNOTSUPP, then instead of the check to add
more channels, we'll disable the secondary channels instead.
For better code reuse, this change also moves the common code for
disabling multichannel to a helper function.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We have several places in the code where we treat the
error -EAGAIN very differently. Some code retry for
arbitrary number of times.
Introducing this new mount option named "retrans", so
that all these handlers of -EAGAIN can retry a fixed
number of times. This applies only to soft mounts.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Today, we schedule periodic query for server interfaces
once every 10 minutes once a tree connection has been
established. Recent change to handle disabling of
multichannel disabled this delayed work.
This change reenables it following a reconnect, and
the server advertises multichannel.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Don't clobber ->i_rdev from valid reparse inodes over readdir(2) as it
can't be provided by query dir responses.
Signed-off-by: Paulo Alcantara <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Merge smb311_posix_query_path_info into ->query_path_info() to get rid
of duplicate code.
Signed-off-by: Paulo Alcantara <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Parse owner/group when creating special files and symlinks under
SMB3.1.1 POSIX mounts.
Move the parsing of owner/group to smb2_compound_op() so we don't have
to duplicate it in both smb2_get_reparse_inode() and
smb311_posix_query_path_info().
Signed-off-by: Paulo Alcantara <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The data offset for the SMB3.1.1 POSIX create context will always be
8-byte aligned so having the check 'noff + nlen >= doff' in
smb2_parse_contexts() is wrong as it will lead to -EINVAL because noff
+ nlen == doff.
Fix the sanity check to correctly handle aligned create context data.
Fixes: af1689a9b7 ("smb: client: fix potential OOBs in smb2_parse_contexts()")
Signed-off-by: Paulo Alcantara <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
MS-SMB2 spec has introduced two new status codes,
STATUS_SERVER_UNAVAILABLE and STATUS_FILE_NOT_AVAILABLE
which are to be treated as retryable errors.
This change adds these to the available mappings and
maps them to Linux errno EAGAIN.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Today, the tree connect and disconnect requests are
sent on the primary channel only. However, the new
multichannel logic allows the session to remain active
even if one of the channels are alive. So a tree connect
can now be triggered during a reconnect on any of
its channels.
This change changes tcon and tdis calls to pick an
active channel instead of the first one.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
open_cached_dir today selects ses->server a.k.a primary channel
to send requests. When multichannel is used, the primary
channel maybe down. So it does not make sense to rely only
on that channel.
This fix makes this function pick a channel with the standard
helper function cifs_pick_channel.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
- Fix a "BUG: kernel NULL pointer dereference" issue due to
inconsistent on-disk indices of compressed inodes against
per-sb `available_compr_algs` generated by Syzkaller;
- Don't use certain unnecessary folio_*() helpers if the folio
type (page cache) is known.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmWpO4cRHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qrJ7w//UpMasVxNpnZCsaWntDhp8AM9+wQZjosM
sc0B1sFjuISQuGfjVEpnlabSudzRRGKI/0R55M8/woa8fuSXJiRNou+bv9Ogi+Aa
CJ4E4+TSCGq98rjuuM9gb5L7V36pBp0PtxgANzKskHcq5w5JUNG6f6nhNQqnvRUG
M7hBvzzLLz3fRPZZFzdu5S8ekwuBrq8K/PBM7PFfDgbl5IZ0cjLXXIdx61MXTro9
FGGJSRbJsUYg6+sqb0YWmluW4CBiwe7crovp6IaPBU0744Ga+jGyTNrOWAGjW42e
7glsM5MClTfmv17LJK3jV1Dg8EPkKtrhpeTCdECnWnuAyLGKFOT4juNc68GzCieR
sSRR+WhmF/B2msAvyH4+gcaULCMAhLiVL1Yf1sfaxC1walEuyEM0EPWEHhAEGXjA
BpT6+EZBbYdh24hpyNSNWy/xGMHuiUFy7940yII0o/9cvEbMXNPtIHxA09mOH08X
1tWgLlsLJ69ApIFYD3TkP9yNj22HrxRCQByKvYEe9JsmxwqDayXUP5FQLv1NPNMm
ds36PDbNpxAM/cBnQcfPbZSODSWOCkLIHtmOvFP12tiixMG7yc4KY14Wuj3ZyHYr
T16BZLlcdobHPapSsxzEQqPTgAYBcvh+6PHXfwnLsoXSYQXoxaUQMX1JREnmC3+I
4nMpKIp3qpY=
=knvn
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.8-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Fix a "BUG: kernel NULL pointer dereference" issue due to
inconsistent on-disk indices of compressed inodes against
per-sb `available_compr_algs` generated by Syzkaller
- Don't use certain unnecessary folio_*() helpers if the folio
type (page cache) is known
* tag 'erofs-for-6.8-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: Don't use certain unnecessary folio_*() functions
erofs: fix inconsistent per-file compression format
- Hard-code the inodes for eventfs to the same number for files, and
the same number for directories.
- Have getdent() not create dentries/inodes in iterate_shared() as now
it has hard-coded inode numbers
- Use kcalloc() instead of kzalloc() on a list of elements
- Fix seq_buf warning and make static work properly.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZak5GxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qvZYAP0ZO4YN9fKnl6Cw1GNCwPtMO13dEg9D
mIvwftxX8DuaegD/fQFY9gBc+ZMSCbiWVJBAyfO57NPvHk4S3slwPVuL9gA=
=iKJI
-----END PGP SIGNATURE-----
Merge tag 'eventfs-v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull eventfs updates from Steven Rostedt:
- Remove "lookup" parameter of create_dir_dentry() and
create_file_dentry(). These functions were called by lookup and the
readdir logic, where readdir needed it to up the ref count of the
dentry but the lookup did not. A "lookup" parameter was passed in to
tell it what to do, but this complicated the code. It is better to
just always up the ref count and require the caller to decrement it,
even for lookup.
- Modify the .iterate_shared callback to not use the dcache_readdir()
logic and just handle what gets displayed by that one function. This
removes the need for eventfs to hijack the file->private_data from
the dcache_readdir() "cursor" pointer, and makes the code a bit more
sane
- Use the root and instance inodes for default ownership. Instead of
walking the dentry tree and updating each dentry gid, use the
getattr(), setattr() and permission() callbacks to set the ownership
and permissions using the root or instance as the default
- Some other optimizations with the eventfs iterate_shared logic
- Hard-code the inodes for eventfs to the same number for files, and
the same number for directories
- Have getdent() not create dentries/inodes in iterate_shared() as now
it has hard-coded inode numbers
- Use kcalloc() instead of kzalloc() on a list of elements
- Fix seq_buf warning and make static work properly.
* tag 'eventfs-v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
seq_buf: Make DECLARE_SEQ_BUF() usable
eventfs: Use kcalloc() instead of kzalloc()
eventfs: Do not create dentries nor inodes in iterate_shared
eventfs: Have the inodes all for files and directories all be the same
eventfs: Shortcut eventfs_iterate() by skipping entries already read
eventfs: Read ei->entries before ei->children in eventfs_iterate()
eventfs: Do ctx->pos update for all iterations in eventfs_iterate()
eventfs: Have eventfs_iterate() stop immediately if ei->is_freed is set
tracefs/eventfs: Use root and instance inodes as default ownership
eventfs: Stop using dcache_readdir() for getdents()
eventfs: Remove "lookup" parameter from create_dir/file_dentry()
[BUG]
If there is an extent beyond chunk boundary, currently RST scrub would
error out.
[CAUSE]
In scrub_submit_extent_sector_read(), we completely rely on
extent_sector_bitmap, which is populated using extent tree.
The extent tree can be corrupted that there is an extent item beyond a
chunk.
In that case, RST scrub would fail and error out.
[FIX]
Despite the extent_sector_bitmap usage, also limit the read to chunk
boundary.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report that, on a ext4-converted btrfs, scrub leads to
various problems, including:
- "unable to find chunk map" errors
BTRFS info (device vdb): scrub: started on devid 1
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056
This would lead to unrepariable errors.
- Use-after-free KASAN reports:
==================================================================
BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
Read of size 8 at addr ffff8881013c9040 by task btrfs/909
CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
Call Trace:
<TASK>
dump_stack_lvl+0x43/0x60
print_report+0xcf/0x640
kasan_report+0xa6/0xd0
__blk_rq_map_sg+0x18f/0x7c0
virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
blk_mq_flush_plug_list.part.0+0x780/0x860
__blk_flush_plug+0x1ba/0x220
blk_finish_plug+0x3b/0x60
submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
__x64_sys_ioctl+0xbd/0x100
do_syscall_64+0x5d/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f47e5e0952b
- Crash, mostly due to above use-after-free
[CAUSE]
The converted fs has the following data chunk layout:
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
length 86016 owner 2 stripe_len 65536 type DATA|single
For above logical bytenr 2214744064, it's at the chunk end
(2214658048 + 86016 = 2214744064).
This means btrfs_submit_bio() would split the bio, and trigger endio
function for both of the two halves.
However scrub_submit_initial_read() would only expect the endio function
to be called once, not any more.
This means the first endio function would already free the bbio::bio,
leaving the bvec freed, thus the 2nd endio call would lead to
use-after-free.
[FIX]
- Make sure scrub_read_endio() only updates bits in its range
Since we may read less than 64K at the end of the chunk, we should not
touch the bits beyond chunk boundary.
- Make sure scrub_submit_initial_read() only to read the chunk range
This is done by calculating the real number of sectors we need to
read, and add sector-by-sector to the bio.
Thankfully the scrub read repair path won't need extra fixes:
- scrub_stripe_submit_repair_read()
With above fixes, we won't update error bit for range beyond chunk,
thus scrub_stripe_submit_repair_read() should never submit any read
beyond the chunk.
Reported-by: Rongrong <i@rong.moe>
Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Tested-by: Rongrong <i@rong.moe>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the normal case we check if a page is under writeback and skip it
before we attempt to begin writeback.
The exception is subpage metadata writes, where we know we don't have an
eb under writeback and we're doing it one eb at a time. Since
b5612c3686 ("mm: return void from folio_start_writeback() and related
functions") we now will BUG_ON() if we call folio_start_writeback()
on a folio that's already under writeback. Previously
folio_start_writeback() would bail if writeback was already started.
Fix this in the subpage code by checking if we have writeback set and
skipping it if we do. This fixes the panic we were seeing on subpage.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs/330, which tests our old trick to allow
mount -o ro,subvol=/x /dev/sda1 /foo
mount -o rw,subvol=/y /dev/sda1 /bar
fails on the block group tree. This is because we aren't preserving the
mount options for what is essentially a remount, and thus we're ending
up without the FREE_SPACE_TREE mount option, which triggers our free
space tree delete codepath. This isn't possible with the block group
tree and thus it falls over.
Fix this by making sure we copy the existing mount options for the
existing fs mount over in this case.
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a warning in btrfs_issue_discard() when the range is not aligned
to 512 bytes, originally added in 4d89d377bb ("btrfs:
btrfs_issue_discard ensure offset/length are aligned to sector
boundaries"). We can't do sub-sector writes anyway so the adjustment is
the only thing that we can do and the warning is unnecessary.
CC: stable@vger.kernel.org # 4.19+
Reported-by: syzbot+4a4f1eba14eb5c3417d1@syzkaller.appspotmail.com
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The error message should accurately reflect the size rather than the
type.
Fixes: f82d1c7ca8 ("btrfs: tree-checker: Add EXTENT_ITEM and METADATA_ITEM check")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If we have a filesystem with 4k sectorsize, and an inlined compressed
extent created like this:
item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
generation 8 transid 8 size 4096 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
index 2 namelen 14 name: source_inlined
item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
generation 8 type 0 (inline)
inline extent data size 48 ram_bytes 4096 compression 3 (zstd)
Then trying to reflink that extent in an aarch64 system with 64K page
size, the reflink would just fail:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
XFS_IOC_CLONE_RANGE: Input/output error
[CAUSE]
In zstd_decompress(), we didn't treat @start_byte as just a page offset,
but also use it as an indicator on whether we should error out, without
any proper explanation (this is copied from other decompression code).
In reality, for subpage cases, although @start_byte can be non-zero,
we should never switch input/output buffer nor error out, since the whole
input/output buffer should never exceed one sector, thus we should not
need to do any buffer switch.
Thus the current code using @start_byte as a condition to switch
input/output buffer or finish the decompression is completely incorrect.
[FIX]
The fix involves several modification:
- Rename @start_byte to @dest_pgoff to properly express its meaning
- Use @sectorsize other than PAGE_SIZE to properly initialize the
output buffer size
- Use correct destination offset inside the destination page
- Simplify the main loop
Since the input/output buffer should never switch, we only need one
zstd_decompress_stream() call.
- Consider early end as an error
After the fix, even on 64K page sized aarch64, above reflink now
works as expected:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
linked 4096/4096 bytes at offset 61440
And results the correct file layout:
item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
generation 10 transid 10 size 65536 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
index 3 namelen 4 name: dest
item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
location key (0 UNKNOWN.0 0) type XATTR
transid 10 data_len 37 name_len 16
name: security.selinux
data unconfined_u:object_r:unlabeled_t:s0
item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
generation 10 type 1 (regular)
extent data disk byte 13631488 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If we have a filesystem with 4k sectorsize, and an inlined compressed
extent created like this:
item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
generation 8 transid 8 size 4096 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
index 2 namelen 14 name: source_inlined
item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
generation 8 type 0 (inline)
inline extent data size 48 ram_bytes 4096 compression 2 (lzo)
Then trying to reflink that extent in an aarch64 system with 64K page
size, the reflink would just fail:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
XFS_IOC_CLONE_RANGE: Input/output error
[CAUSE]
In zlib_decompress(), we didn't treat @start_byte as just a page offset,
but also use it as an indicator on whether we should error out, without
any proper explanation (this is from the very beginning of btrfs).
In reality, for subpage cases, although @start_byte can be non-zero,
we should never switch input/output buffer nor error out, since the whole
input/output buffer should never exceed one sector.
Note: The above assumption is only not true if we're going to support
multi-page sectorsize.
Thus the current code using @start_byte as a condition to switch
input/output buffer or finish the decompression is completely incorrect.
[FIX]
The fix involves several modifications:
- Rename @start_byte to @dest_pgoff to properly express its meaning
- Use @sectorsize other than PAGE_SIZE to properly initialize the
output buffer size
- Use correct destination offset inside the destination page
- Use memcpy_to_page() to copy the contents to the destination page
- Use memzero_page() to zero out the tailing part
- Consider early end as an error
After the fix, even on 64K page sized aarch64, above reflink now
works as expected:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
linked 4096/4096 bytes at offset 61440
And results the correct file layout:
item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
generation 10 transid 10 size 65536 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
index 3 namelen 4 name: dest
item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
location key (0 UNKNOWN.0 0) type XATTR
transid 10 data_len 37 name_len 16
name: security.selinux
data unconfined_u:object_r:unlabeled_t:s0
item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
generation 10 type 1 (regular)
extent data disk byte 13631488 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If we have a filesystem with 4k sectorsize, and an inlined compressed
extent created like this:
item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
generation 8 transid 8 size 4096 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
index 2 namelen 14 name: source_inlined
item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
generation 8 type 0 (inline)
inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
Which has an inline compressed extent at file offset 0, and its
decompressed size is 4K, allowing us to reflink that 4K range to another
location (which will not be compressed).
If we do such reflink on a subpage system, it would fail like this:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
XFS_IOC_CLONE_RANGE: Input/output error
[CAUSE]
In zlib_decompress(), we didn't treat @start_byte as just a page offset,
but also use it as an indicator on whether we should switch our output
buffer.
In reality, for subpage cases, although @start_byte can be non-zero,
we should never switch input/output buffer, since the whole input/output
buffer should never exceed one sector.
Note: The above assumption is only not true if we're going to support
multi-page sectorsize.
Thus the current code using @start_byte as a condition to switch
input/output buffer or finish the decompression is completely incorrect.
[FIX]
The fix involves several modifications:
- Rename @start_byte to @dest_pgoff to properly express its meaning
- Add an extra ASSERT() inside btrfs_decompress() to make sure the
input/output size never exceeds one sector.
- Use Z_FINISH flag to make sure the decompression happens in one go
- Remove the loop needed to switch input/output buffers
- Use correct destination offset inside the destination page
- Consider early end as an error
After the fix, even on 64K page sized aarch64, above reflink now
works as expected:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
linked 4096/4096 bytes at offset 61440
And resulted a correct file layout:
item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
generation 10 transid 10 size 65536 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
index 3 namelen 4 name: dest
item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
location key (0 UNKNOWN.0 0) type XATTR
transid 10 data_len 37 name_len 16
name: security.selinux
data unconfined_u:object_r:unlabeled_t:s0
item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
generation 10 type 1 (regular)
extent data disk byte 13631488 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
strlcpy() reads the entire source buffer first. This read may exceed
the destination size limit. This is both inefficient and can lead
to linear read overflows if a source string is not NUL-terminated[1].
Additionally, it returns the size of the source string, not the
resulting size of the destination string. In an effort to remove strlcpy()
completely[2], replace strlcpy() here with strscpy().
Nothing checks the return value here, so a direct replacement with
strspy() is possible.
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [1]
Link: https://github.com/KSPP/linux/issues/89 [2]
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Brian Foster <bfoster@redhat.com>
Cc: <linux-bcachefs@vger.kernel.org>
Link: https://lore.kernel.org/r/20240110235438.work.385-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
minor comment cleanup and trivial camelCase removal
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In analyzing problems, one missing piece of debug data is when the
mount occurred. A related problem is when collecting stats we don't
know the period of time the stats covered, ie when this set of stats
for the tcon started to be collected. To make debugging easier track
the stats begin time. Set it when the mount occurred at mount time,
and reset it to current time whenever stats are reset. For example,
...
1) \\localhost\test
SMBs: 14 since 2024-01-17 22:17:30 UTC
Bytes read: 0 Bytes written: 0
Open files: 0 total (local), 0 open on server
TreeConnects: 1 total 0 failed
TreeDisconnects: 0 total 0 failed
...
2) \\localhost\scratch
SMBs: 24 since 2024-01-17 22:16:04 UTC
Bytes read: 0 Bytes written: 0
Open files: 0 total (local), 0 open on server
TreeConnects: 1 total 0 failed
TreeDisconnects: 0 total 0 failed
...
Note the time "since ... UTC" is now displayed in /proc/fs/cifs/Stats
for each share that is mounted.
Suggested-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Here are the set of driver core and kernfs changes for 6.8-rc1. Nothing
major in here this release cycle, just lots of small cleanups and some
tweaks on kernfs that in the very end, got reverted and will come back
in a safer way next release cycle.
Included in here are:
- more driver core 'const' cleanups and fixes
- fw_devlink=rpm is now the default behavior
- kernfs tiny changes to remove some string functions
- cpu handling in the driver core is updated to work better on many
systems that add topologies and cpus after booting
- other minor changes and cleanups
All of the cpu handling patches have been acked by the respective
maintainers and are coming in here in one series. Everything has been
in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZaeOrg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymtcwCffzvKKkSY9qAp6+0v2WQNkZm1JWoAoJCPYUwF
If6wEoPLWvRfKx4gIoq9
=D96r
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here are the set of driver core and kernfs changes for 6.8-rc1.
Nothing major in here this release cycle, just lots of small cleanups
and some tweaks on kernfs that in the very end, got reverted and will
come back in a safer way next release cycle.
Included in here are:
- more driver core 'const' cleanups and fixes
- fw_devlink=rpm is now the default behavior
- kernfs tiny changes to remove some string functions
- cpu handling in the driver core is updated to work better on many
systems that add topologies and cpus after booting
- other minor changes and cleanups
All of the cpu handling patches have been acked by the respective
maintainers and are coming in here in one series. Everything has been
in linux-next for a while with no reported issues"
* tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (51 commits)
Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"
kernfs: convert kernfs_idr_lock to an irq safe raw spinlock
class: fix use-after-free in class_register()
PM: clk: make pm_clk_add_notifier() take a const pointer
EDAC: constantify the struct bus_type usage
kernfs: fix reference to renamed function
driver core: device.h: fix Excess kernel-doc description warning
driver core: class: fix Excess kernel-doc description warning
driver core: mark remaining local bus_type variables as const
driver core: container: make container_subsys const
driver core: bus: constantify subsys_register() calls
driver core: bus: make bus_sort_breadthfirst() take a const pointer
kernfs: d_obtain_alias(NULL) will do the right thing...
driver core: Better advertise dev_err_probe()
kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy()
initramfs: Expose retained initrd as sysfs file
fs/kernfs/dir: obey S_ISGID
kernel/cgroup: use kernfs_create_dir_ns()
...
As 'needed' to trace_ext4_discard_preallocations is always 0 which
is meaningless. Just remove it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-10-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The "needed" controls the number of ext4_prealloc_space to discard in
ext4_discard_preallocations. Function ext4_discard_preallocations is
supposed to discard all non-used preallocated blocks when "needed"
is 0 and now ext4_discard_preallocations is always called with "needed"
= 0. Remove unnecessary parameter "needed" and remove all non-used
preallocated spaces in ext4_discard_preallocations to simplify the
code.
Note: If count of non-used preallocated spaces could be more than
UINT_MAX, there was a memory leak as some non-used preallocated
spaces are left ununsed and this commit will fix it. Otherwise,
there is no behavior change.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Function ext4_mb_release_context always return 0 and the return value is
never used. Just remove unneeded return value of ext4_mb_release_context.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Otherwise unlocking the group in ext4_grp_locked_error may allow other
processes to modify the core block bitmap that is known to be corrupt.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Places the logic for checking if the group's block bitmap is corrupt under
the protection of the group lock to avoid allocating blocks from the group
with a corrupted block bitmap.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-8-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Determine if the group block bitmap is corrupted before using ac_b_ex in
ext4_mb_try_best_found() to avoid allocating blocks from a group with a
corrupted block bitmap in the following concurrency and making the
situation worse.
ext4_mb_regular_allocator
ext4_lock_group(sb, group)
ext4_mb_good_group
// check if the group bbitmap is corrupted
ext4_mb_complex_scan_group
// Scan group gets ac_b_ex but doesn't use it
ext4_unlock_group(sb, group)
ext4_mark_group_bitmap_corrupted(group)
// The block bitmap was corrupted during
// the group unlock gap.
ext4_mb_try_best_found
ext4_lock_group(ac->ac_sb, group)
ext4_mb_use_best_found
mb_mark_used
// Allocating blocks in block bitmap corrupted group
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Determine if bb_fragments is 0 instead of determining bb_free to eliminate
the risk of dividing by zero when the block bitmap is corrupted.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After updating bb_free in mb_free_blocks, it is possible to return without
updating bb_fragments because the block being freed is found to have
already been freed, which leads to inconsistency between bb_free and
bb_fragments.
Since the group may be unlocked in ext4_grp_locked_error(), this can lead
to problems such as dividing by zero when calculating the average fragment
length. Hence move the update of bb_free to after the block double-free
check guarantees that the corresponding statistics are updated only after
the core block bitmap is modified.
Fixes: eabe0444df ("ext4: speed-up releasing blocks on commit")
CC: <stable@vger.kernel.org> # 3.10
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This mostly reverts commit 6bd97bf273 ("ext4: remove redundant
mb_regenerate_buddy()") and reintroduces mb_regenerate_buddy(). Based on
code in mb_free_blocks(), fast commit replay can end up marking as free
blocks that are already marked as such. This causes corruption of the
buddy bitmap so we need to regenerate it in that case.
Reported-by: Jan Kara <jack@suse.cz>
Fixes: 6bd97bf273 ("ext4: remove redundant mb_regenerate_buddy()")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Otherwise operating on an incorrupted block bitmap can lead to all sorts
of unknown problems.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In ext4_move_extents(), moved_len is only updated when all moves are
successfully executed, and only discards orig_inode and donor_inode
preallocations when moved_len is not zero. When the loop fails to exit
after successfully moving some extents, moved_len is not updated and
remains at 0, so it does not discard the preallocations.
If the moved extents overlap with the preallocated extents, the
overlapped extents are freed twice in ext4_mb_release_inode_pa() and
ext4_process_freed_data() (as described in commit 94d7c16cbb ("ext4:
Fix double-free of blocks with EXT4_IOC_MOVE_EXT")), and bb_free is
incremented twice. Hence when trim is executed, a zero-division bug is
triggered in mb_update_avg_fragment_size() because bb_free is not zero
and bb_fragments is zero.
Therefore, update move_len after each extent move to avoid the issue.
Reported-by: Wei Chen <harperchen1110@gmail.com>
Reported-by: xingwei lee <xrivendell7@gmail.com>
Closes: https://lore.kernel.org/r/CAO4mrferzqBUnCag8R3m2zf897ts9UEuhjFQGPtODT92rYyR2Q@mail.gmail.com
Fixes: fcf6b1b729 ("ext4: refactor ext4_move_extents code base")
CC: <stable@vger.kernel.org> # 3.18
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For dio read, bio will be leave in flight when a successful partial
aio read have been setup, blockdev_direct_IO() will return
-EIOCBQUEUED. In the case, iter->iov_offset will be not advanced,
the oops reported by syzbot will occur if revert iter->iov_offset
with iov_iter_revert(). The unwritten part had been zeroed by aio
read, so there is no need to zero it in dio read.
Reported-by: syzbot+fd404f6b03a58e8bc403@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=fd404f6b03a58e8bc403
Fixes: 11a347fb6c ("exfat: change to get file size from DataLength")
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
- Clean up Kconfigs that all KVM architectures were selecting
- New functionality around "guest_memfd", a new userspace API that
creates an anonymous file and returns a file descriptor that refers
to it. guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
switch a memory area between guest_memfd and regular anonymous memory.
- New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
per-page attributes for a given page of guest memory; right now the
only attribute is whether the guest expects to access memory via
guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees
confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM).
x86:
- Support for "software-protected VMs" that can use the new guest_memfd
and page attributes infrastructure. This is mostly useful for testing,
since there is no pKVM-like infrastructure to provide a meaningfully
reduced TCB.
- Fix a relatively benign off-by-one error when splitting huge pages during
CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf
TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE.
- Use more generic lockdep assertions in paths that don't actually care
about whether the caller is a reader or a writer.
- let Xen guests opt out of having PV clock reported as "based on a stable TSC",
because some of them don't expect the "TSC stable" bit (added to the pvclock
ABI by KVM, but never set by Xen) to be set.
- Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM always
flushes on nested transitions, i.e. always satisfies flush requests. This
allows running bleeding edge versions of VMware Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV support.
- On AMD machines with vNMI, always rely on hardware instead of intercepting
IRET in some cases to detect unmasking of NMIs
- Support for virtualizing Linear Address Masking (LAM)
- Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state
prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events using a
dedicated field instead of snapshotting the "previous" counter. If the
hardware PMC count triggers overflow that is recognized in the same VM-Exit
that KVM manually bumps an event count, KVM would pend PMIs for both the
hardware-triggered overflow and for KVM-triggered overflow.
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be problematic for
subsystems that require no regressions for W=1 builds.
- Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
"features".
- Don't force a masterclock update when a vCPU synchronizes to the current TSC
generation, as updating the masterclock can cause kvmclock's time to "jump"
unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
partly as a super minor optimization, but mostly to make KVM play nice with
position independent executable builds.
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation"
at build time.
ARM64:
- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB
base granule sizes. Branch shared with the arm64 tree.
- Large Fine-Grained Trap rework, bringing some sanity to the
feature, although there is more to come. This comes with
a prefix branch shared with the arm64 tree.
- Some additional Nested Virtualization groundwork, mostly
introducing the NV2 VNCR support and retargetting the NV
support to that version of the architecture.
- A small set of vgic fixes and associated cleanups.
Loongarch:
- Optimization for memslot hugepage checking
- Cleanup and fix some HW/SW timer issues
- Add LSX/LASX (128bit/256bit SIMD) support
RISC-V:
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list selftest
- Support for reporting steal time along with selftest
s390:
- Bugfixes
Selftests:
- Fix an annoying goof where the NX hugepage test prints out garbage
instead of the magic token needed to run the test.
- Fix build errors when a header is delete/moved due to a missing flag
in the Makefile.
- Detect if KVM bugged/killed a selftest's VM and print out a helpful
message instead of complaining that a random ioctl() failed.
- Annotate the guest printf/assert helpers with __printf(), and fix the
various bugs that were lurking due to lack of said annotation.
There are two non-KVM patches buried in the middle of guest_memfd support:
fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmWcMWkUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroO15gf/WLmmg3SET6Uzw9iEq2xo28831ZA+
6kpILfIDGKozV5safDmMvcInlc/PTnqOFrsKyyN4kDZ+rIJiafJdg/loE0kPXBML
wdR+2ix5kYI1FucCDaGTahskBDz8Lb/xTpwGg9BFLYFNmuUeHc74o6GoNvr1uliE
4kLZL2K6w0cSMPybUD+HqGaET80ZqPwecv+s1JL+Ia0kYZJONJifoHnvOUJ7DpEi
rgudVdgzt3EPjG0y1z6MjvDBXTCOLDjXajErlYuZD3Ej8N8s59Dh2TxOiDNTLdP4
a4zjRvDmgyr6H6sz+upvwc7f4M4p+DBvf+TkWF54mbeObHUYliStqURIoA==
=66Ws
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"Generic:
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
architectures.
- Clean up Kconfigs that all KVM architectures were selecting
- New functionality around "guest_memfd", a new userspace API that
creates an anonymous file and returns a file descriptor that refers
to it. guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be
resized. guest_memfd files do however support PUNCH_HOLE, which can
be used to switch a memory area between guest_memfd and regular
anonymous memory.
- New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
per-page attributes for a given page of guest memory; right now the
only attribute is whether the guest expects to access memory via
guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
TDX or ARM64 pKVM is checked by firmware or hypervisor that
guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
the case of pKVM).
x86:
- Support for "software-protected VMs" that can use the new
guest_memfd and page attributes infrastructure. This is mostly
useful for testing, since there is no pKVM-like infrastructure to
provide a meaningfully reduced TCB.
- Fix a relatively benign off-by-one error when splitting huge pages
during CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in
non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
a non-huge SPTE.
- Use more generic lockdep assertions in paths that don't actually
care about whether the caller is a reader or a writer.
- let Xen guests opt out of having PV clock reported as "based on a
stable TSC", because some of them don't expect the "TSC stable" bit
(added to the pvclock ABI by KVM, but never set by Xen) to be set.
- Revert a bogus, made-up nested SVM consistency check for
TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM
always flushes on nested transitions, i.e. always satisfies flush
requests. This allows running bleeding edge versions of VMware
Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV
support.
- On AMD machines with vNMI, always rely on hardware instead of
intercepting IRET in some cases to detect unmasking of NMIs
- Support for virtualizing Linear Address Masking (LAM)
- Fix a variety of vPMU bugs where KVM fail to stop/reset counters
and other state prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events
using a dedicated field instead of snapshotting the "previous"
counter. If the hardware PMC count triggers overflow that is
recognized in the same VM-Exit that KVM manually bumps an event
count, KVM would pend PMIs for both the hardware-triggered overflow
and for KVM-triggered overflow.
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be
problematic for subsystems that require no regressions for W=1
builds.
- Advertise all of the host-supported CPUID bits that enumerate
IA32_SPEC_CTRL "features".
- Don't force a masterclock update when a vCPU synchronizes to the
current TSC generation, as updating the masterclock can cause
kvmclock's time to "jump" unexpectedly, e.g. when userspace
hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter
fault paths, partly as a super minor optimization, but mostly to
make KVM play nice with position independent executable builds.
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the
code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
"emulation" at build time.
ARM64:
- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
granule sizes. Branch shared with the arm64 tree.
- Large Fine-Grained Trap rework, bringing some sanity to the
feature, although there is more to come. This comes with a prefix
branch shared with the arm64 tree.
- Some additional Nested Virtualization groundwork, mostly
introducing the NV2 VNCR support and retargetting the NV support to
that version of the architecture.
- A small set of vgic fixes and associated cleanups.
Loongarch:
- Optimization for memslot hugepage checking
- Cleanup and fix some HW/SW timer issues
- Add LSX/LASX (128bit/256bit SIMD) support
RISC-V:
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list
selftest
- Support for reporting steal time along with selftest
s390:
- Bugfixes
Selftests:
- Fix an annoying goof where the NX hugepage test prints out garbage
instead of the magic token needed to run the test.
- Fix build errors when a header is delete/moved due to a missing
flag in the Makefile.
- Detect if KVM bugged/killed a selftest's VM and print out a helpful
message instead of complaining that a random ioctl() failed.
- Annotate the guest printf/assert helpers with __printf(), and fix
the various bugs that were lurking due to lack of said annotation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
x86/kvm: Do not try to disable kvmclock if it was not enabled
KVM: x86: add missing "depends on KVM"
KVM: fix direction of dependency on MMU notifiers
KVM: introduce CONFIG_KVM_COMMON
KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
RISC-V: KVM: selftests: Add get-reg-list test for STA registers
RISC-V: KVM: selftests: Add steal_time test support
RISC-V: KVM: selftests: Add guest_sbi_probe_extension
RISC-V: KVM: selftests: Move sbi_ecall to processor.c
RISC-V: KVM: Implement SBI STA extension
RISC-V: KVM: Add support for SBI STA registers
RISC-V: KVM: Add support for SBI extension registers
RISC-V: KVM: Add SBI STA info to vcpu_arch
RISC-V: KVM: Add steal-update vcpu request
RISC-V: KVM: Add SBI STA extension skeleton
RISC-V: paravirt: Implement steal-time support
RISC-V: Add SBI STA extension definitions
RISC-V: paravirt: Add skeleton for pv-time support
RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
...
UBI:
- Use in-tree fault injection framework and add new injection types
- Fix for a memory leak in the block driver
UBIFS:
- kernel-doc fixes
- Various minor fixes
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmWi8k0WHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wVMEEAClpCwGQ1zjViuDR+ly1etpd2VJ
SVH687jQ5bj92joMbJuX1n3iucDKu22KNR6tuePtEWousKEjiP5MU5Vhj4qcEZJj
ORwtLOhchF7EHokJ16O2zBTBjznQuSmy0TG8vB/4hKj1a9FHLYPoDpZ595i2ATIA
sh4+jfTRiOviX1SWe3qP9Hwx/WBXJpNluNNosabaEkTPe6CEAqnw92Hsm8PC8WY0
0F9zKPbRTiu/Mt8PoF0YHo9pNsX0TikJMPj+QuBSOt3tK5PmPFttL6ce5Zal+wi3
Df+8Qqw2QPchMDesaeZHtknZkZWbxtWPk+1U7EaLUwb6lw7cyI9SPWtQFYS4Ot6r
ieUW5mQt2arC6Yjj1u+pFLIvLJOYgg0kiPySvRiA4EKkAyTMBjQzeyf0XCVrgW2s
UeBiQTz5LkL4soAo/aWDyny81RXJjtuMpn/+WAq4o36LZkG4aiGXh+ue5l5d9Mq5
Fh/MNyRA9le5STebrqqH7TBtiOwBG+ZJ9yqYffzya+756od6wsnemGfaZ/pPzzSe
sp9MEYzrz4hhRvDHegKcIbxb+OUVFNJ1t5gdIUsZAqWARxcfYD9xeqyHVVhvFDjf
UzQhZXfKgdnwp4zWHtSBRkDKCEMvxG8Nw3Rnp9ayZwxiQBBalRV6MV33g5RXRIis
Xp+fCRu3gjlhBzlU6w==
=5I24
-----END PGP SIGNATURE-----
Merge tag 'ubifs-for-linus-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS updates from Richard Weinberger:
"UBI:
- Use in-tree fault injection framework and add new injection types
- Fix for a memory leak in the block driver
UBIFS:
- kernel-doc fixes
- Various minor fixes"
* tag 'ubifs-for-linus-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubi: block: fix memleak in ubiblock_create()
ubifs: fix kernel-doc warnings
mtd: Add several functions to the fail_function list
ubi: Reserve sufficient buffer length for the input mask
ubi: Add six fault injection type for testing
ubi: Split io_failures into write_failure and erase_failure
ubi: Use the fault injection framework to enhance the fault injection capability
ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path
ubifs: Check @c->dirty_[n|p]n_cnt and @c->nroot state under @c->lp_mutex
ubifs: describe function parameters
ubifs: auth.c: fix kernel-doc function prototype warning
ubifs: use crypto_shash_tfm_digest() in ubifs_hmac_wkm()
Fix a bug in my change to how f2fs frees its superblock info (which was
part of changing the timing of fscrypt keyring destruction).
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZaH8VRQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK6ZKAP9cGzwa35300y5/ZwPQxdN7eIThjU0f
dv3pUhd69LkZ8QD/QwFRxtjLOp0nx/nfUjwm2TBH44XjidFvPXb0nRCumgc=
=SHQL
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux
Pull fscrypt fix from Eric Biggers:
"Fix a bug in my change to how f2fs frees its superblock info (which
was part of changing the timing of fscrypt keyring destruction)"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
f2fs: fix double free of f2fs_sb_info
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZaJ8xAAKCRCRxhvAZXjc
ojs2AQCrK7pwncSszfIbQRK7SAHhZS/k4G3LQiQ8mt7VstcTlgD/TpbfnlIX6ONf
g3NWgQ8Y/ifPDqQl2qnd9PK4zYVJswo=
=ExMf
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8-rc1.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"This contains two fixes for the current merge window. The listmount
changes that you requested and a fix for a fsnotify performance
regression:
- The proposed listmount changes are currently under my authorship. I
wasn't sure whether you'd wanted to be author as the patch wasn't
signed off. If you do I'm happy if you just apply your own patch.
I've tested the patch with my sh4 cross-build setup. And confirmed
that a) the build failure with sh on current upstream is
reproducible and that b) the proposed patch fixes the build
failure. That should only leave the task of fixing put_user on sh.
- The fsnotify regression was caused by moving one of the hooks out
of the security hook in preparation for other fsnotify work. This
meant that CONFIG_SECURITY would have compiled out the fsnotify
hook before but didn't do so now.
That lead to up to 6% performance regression in some io_uring
workloads that compile all fsnotify and security checks out. Fix
this by making sure that the relevant hooks are covered by the
already existing CONFIG_FANOTIFY_ACCESS_PERMISSIONS where the
relevant hook belongs"
* tag 'vfs-6.8-rc1.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
fs: rework listmount() implementation
fsnotify: compile out fsnotify permission hooks if !FANOTIFY_ACCESS_PERMISSIONS
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZaHe5gAKCRDdBJ7gKXxA
jrAiAQCYZQuwsNVyGJUuPD/GGQzqVUZNpWcuYwMXXAi6dO5rSAD+LDeFviun2K52
uHCz4iRq5EwNLA+MbdHtAnQzr+e5CQ8=
=Jjkw
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-01-12-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"For once not mostly MM-related.
17 hotfixes. 10 address post-6.7 issues and the other 7 are cc:stable"
* tag 'mm-hotfixes-stable-2024-01-12-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
userfaultfd: avoid huge_zero_page in UFFDIO_MOVE
MAINTAINERS: add entry for shrinker
selftests: mm: hugepage-vmemmap fails on 64K page size systems
mm/memory_hotplug: fix memmap_on_memory sysfs value retrieval
mailmap: switch email for Tanzir Hasan
mailmap: add old address mappings for Randy
kernel/crash_core.c: make __crash_hotplug_lock static
efi: disable mirror feature during crashkernel
kexec: do syscore_shutdown() in kernel_kexec
mailmap: update entry for Manivannan Sadhasivam
fs/proc/task_mmu: move mmu notification mechanism inside mm lock
mm: zswap: switch maintainers to recently active developers and reviewers
scripts/decode_stacktrace.sh: optionally use LLVM utilities
kasan: avoid resetting aux_lock
lib/Kconfig.debug: disable CONFIG_DEBUG_INFO_BTF for Hexagon
MAINTAINERS: update LTP maintainers
kdump: defer the insertion of crashkernel resources
The variable tcon_exist is being assigned however it is never read, the
variable is redundant and can be removed.
Cleans up clang scan build warning:
warning: Although the value stored to 'tcon_exist' is used in
the enclosing expression, the value is never actually readfrom
'tcon_exist' [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.
So, use the purpose specific kcalloc() function instead of the argument
size * count in the kzalloc() function.
[1] https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Link: https://lore.kernel.org/linux-trace-kernel/20240115181658.4562-1-erick.archer@gmx.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://github.com/KSPP/linux/issues/162
Signed-off-by: Erick Archer <erick.archer@gmx.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The original eventfs code added a wrapper around the dcache_readdir open
callback and created all the dentries and inodes at open, and increment
their ref count. A wrapper was added around the dcache_readdir release
function to decrement all the ref counts of those created inodes and
dentries. But this proved to be buggy[1] for when a kprobe was created
during a dir read, it would create a dentry between the open and the
release, and because the release would decrement all ref counts of all
files and directories, that would include the kprobe directory that was
not there to have its ref count incremented in open. This would cause the
ref count to go to negative and later crash the kernel.
To solve this, the dentries and inodes that were created and had their ref
count upped in open needed to be saved. That list needed to be passed from
the open to the release, so that the release would only decrement the ref
counts of the entries that were incremented in the open.
Unfortunately, the dcache_readdir logic was already using the
file->private_data, which is the only field that can be used to pass
information from the open to the release. What was done was the eventfs
created another descriptor that had a void pointer to save the
dcache_readdir pointer, and it wrapped all the callbacks, so that it could
save the list of entries that had their ref counts incremented in the
open, and pass it to the release. The wrapped callbacks would just put
back the dcache_readdir pointer and call the functions it used so it could
still use its data[2].
But Linus had an issue with the "hijacking" of the file->private_data
(unfortunately this discussion was on a security list, so no public link).
Which we finally agreed on doing everything within the iterate_shared
callback and leave the dcache_readdir out of it[3]. All the information
needed for the getents() could be created then.
But this ended up being buggy too[4]. The iterate_shared callback was not
the right place to create the dentries and inodes. Even Christian Brauner
had issues with that[5].
An attempt was to go back to creating the inodes and dentries at
the open, create an array to store the information in the
file->private_data, and pass that information to the other callbacks.[6]
The difference between that and the original method, is that it does not
use dcache_readdir. It also does not up the ref counts of the dentries and
pass them. Instead, it creates an array of a structure that saves the
dentry's name and inode number. That information is used in the
iterate_shared callback, and the array is freed in the dir release. The
dentries and inodes created in the open are not used for the iterate_share
or release callbacks. Just their names and inode numbers.
Linus did not like that either[7] and just wanted to remove the dentries
being created in iterate_shared and use the hard coded inode numbers.
[ All this while Linus enjoyed an unexpected vacation during the merge
window due to lack of power. ]
[1] https://lore.kernel.org/linux-trace-kernel/20230919211804.230edf1e@gandalf.local.home/
[2] https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d4fa@gandalf.local.home/
[3] https://lore.kernel.org/linux-trace-kernel/20240104015435.682218477@goodmis.org/
[4] https://lore.kernel.org/all/202401152142.bfc28861-oliver.sang@intel.com/
[5] https://lore.kernel.org/all/20240111-unzahl-gefegt-433acb8a841d@brauner/
[6] https://lore.kernel.org/all/20240116114711.7e8637be@gandalf.local.home/
[7] https://lore.kernel.org/all/20240116170154.5bf0a250@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20240116211353.573784051@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Fixes: 493ec81a8f ("eventfs: Stop using dcache_readdir() for getdents()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202401152142.bfc28861-oliver.sang@intel.com
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The dentries and inodes are created in the readdir for the sole purpose of
getting a consistent inode number. Linus stated that is unnecessary, and
that all inodes can have the same inode number. For a virtual file system
they are pretty meaningless.
Instead use a single unique inode number for all files and one for all
directories.
Link: https://lore.kernel.org/all/20240116133753.2808d45e@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20240116211353.412180363@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
__dentry_leases_walk() gets a callback and calls it for
a bunch of denties; there are exactly two callers and
we already have a flag telling them apart - lwc->dir_lease.
Seeing that indirect calls are costly these days, let's
get rid of the callback and just call the right function
directly. Has a side benefit of saner signatures...
[ xiubli: a minor fix in the commit title ]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Clean up the code.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This issue is reported by smatch that get_quota_realm() might return
ERR_PTR but we did not handle it. It's not a immediate bug, while we
still should address it to avoid potential bugs if get_quota_realm()
is changed to return other ERR_PTR in future.
Set ceph_snap_realm's pointer in get_quota_realm()'s to address this
issue, the pointer would be set to NULL if get_quota_realm() failed
to get struct ceph_snap_realm, so no ERR_PTR would happen any more.
[ xiubli: minor code style clean up ]
Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When allocating an osd request the libceph.ko will add the
'read_from_replica' flag by default.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Update the oldest_client_tid via the session renew caps msg to
make sure that the MDSs won't pile up the completed request list
in a very large size.
[ idryomov: drop inapplicable comment ]
Link: https://tracker.ceph.com/issues/63364
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Makes the create session msg helper to be more general and could
be used by other ops.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The kconfig options for filesystems that support FS_ENCRYPTION are
supposed to select FS_ENCRYPTION_ALGS. This is needed to ensure that
required crypto algorithms get enabled as loadable modules or builtin as
is appropriate for the set of enabled filesystems. Do this for CEPH_FS
so that there aren't any missing algorithms if someone happens to have
CEPH_FS as their only enabled filesystem that supports encryption.
Cc: stable@vger.kernel.org
Fixes: f061feda6c ("ceph: add fscrypt ioctls and ceph.fscrypt.auth vxattr")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The lock order is incorrect between denty and its parent, we should
always make sure that the parent get the lock first.
But since this deadcode is never used and the parent dir will always
be set from the callers, let's just remove it.
Link: https://lore.kernel.org/r/20231116081919.GZ1957730@ZenIV
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In fscrypt case and for a smaller read length we can predict the
max count of the extent map. And for small read length use cases
this could save some memories.
[ idryomov: squash into a single patch to avoid build break, drop
redundant variable in ceph_alloc_sparse_ext_map() ]
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Following along the same lines as per the user-space fix. Right
now this isn't really an issue with the ceph kernel driver because
of the feature bit laginess, however, that can change over time
(when the new snaprealm info type is ported to the kernel driver)
and depending on the MDS version that's being upgraded can cause
message decoding issues - so, fix that early on.
Link: http://tracker.ceph.com/issues/63188
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When MDS closed the session the kclient will send to reconnect to
it immediately, but if the MDS just restarted and still not ready
yet, such as still in the up:replay state and the sessionmap journal
logs hasn't be replayed, the MDS will close the session.
And then the kclient could remove the session and later when the
mdsmap is in RECONNECT phrase it will skip reconnecting. But the MDS
will wait until timeout and then evict the kclient.
Just skip sending the reconnection request until the MDS is ready.
Link: https://tracker.ceph.com/issues/62489
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When smb2 leases is disable, ksmbd can send oplock break notification
and cause wait oplock break ack timeout. It may appear like hang when
accessing a directory. This patch make only v2 leases handle the
directory.
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The race is between the handling of a new TCP connection and
its disconnection. It leads to UAF on `struct tcp_transport` in
ksmbd_tcp_new_connection() function.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-22991
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
If client send invalid mech token in session setup request, ksmbd
validate and make the error if it is invalid.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-22890
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
EROFS can select compression algorithms on a per-file basis, and each
per-file compression algorithm needs to be marked in the on-disk
superblock for initialization.
However, syzkaller can generate inconsistent crafted images that use
an unsupported algorithmtype for specific inodes, e.g. use MicroLZMA
algorithmtype even it's not set in `sbi->available_compr_algs`. This
can lead to an unexpected "BUG: kernel NULL pointer dereference" if
the corresponding decompressor isn't built-in.
Fix this by checking against `sbi->available_compr_algs` for each
m_algorithmformat request. Incorrect !erofs_sb_has_compr_cfgs preset
bitmap is now fixed together since it was harmless previously.
Reported-by: <bugreport@ubisectech.com>
Fixes: 8f89926290 ("erofs: get compression algorithms directly on mapping")
Fixes: 622ceaddb7 ("erofs: lzma compression support")
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20240113150602.1471050-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Linus pointed out that there's error handling and naming issues in the
that we should rewrite:
* Perform the access checks for the buffer before actually doing any
work instead of doing it during the iteration.
* Rename the arguments to listmount() and do_listmount() to clarify what
the arguments are used for.
* Get rid of the pointless ctr variable and overflow checking.
* Get rid of the pointless speculation check.
Link: https://lore.kernel.org/r/CAHk-=wjh6Cypo8WC-McXgSzCaou3UXccxB+7PVeSuGR8AjCphg@mail.gmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
- Replace the internal table lookup algorithm with the hweight library
and ffs of the bitops library.
- Handle the two types of stream entry, valid data size(has been written)
and data size separately.It will improves compatibility with two
differently sized files created on Windows.
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmWgmxoWHGxpbmtpbmpl
b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCERgD/4rHm1yG0ZlURvXiAwZwVOQMJoz
9Y8Gz3M1LsycJEN1uxNjSYfUe9LX/BlbXz5uIH8tVQjEEIbyl0RmJjITawVBHVbS
Ps/UMDiQvT5DPqIwhrfTh9qxy0cRi7WBuKNAXRXSSVRx2mMWYIjNxT+8dIcD2FEG
63ojnoYi8RwuYuvCwo51coxf8/E7GY+WAYnC97hqtj2jSQ6gMjeDtiyYx1m5PfUN
32NdG1IYaiTstD7EU1lv1QNzLZx/Q9gBhi0jhDu1qc0fI+rS49p0zqop1TeEtsIf
RD05XHZ8KRapChgoSvw+hb6CfZ7RanImFAHm6WnILqgFoY7uagUH1dn3oOJFgdLA
OTwbEA/sQmnIdqg07Hhgf74OI9bu/kgP7g8/xrooqhO2SkYGLXDLgYFhEk08aEyE
sp9fxtBfKhXUVHKafzkKtUmI+THl5W793aAfND5W+ahX2zDprwupzg/F7p4Sj3tJ
GbvaRL/n1d/O1dhf/doTmfggH7TnDODS729w0HBSNJU+q6zrGluLRyqB3XsRFXng
7RlN8f4HSI6eFRVG7KTwxVcfwsedtPmNRKLg3PEMkXz5jb4wsw7tZUB3gAFwy9qf
cZd7/+oU9qKEgrBRDJfJsFqq0IpzLCXDEZp00F5RregLIhWHZN4ghBrVU94ciDuT
gxBgoSWrLqObnXmLVQ==
=WFoc
-----END PGP SIGNATURE-----
Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat updates from Namjae Jeon:
- Replace the internal table lookup algorithm with the hweight library
and ffs of the bitops library.
- Handle the two types of stream entry, valid data size (has been
written) and data size separately. It improves compatibility with two
differently sized files created on Windows.
* tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: do not zero the extended part
exfat: change to get file size from DataLength
exfat: using ffs instead of internal logic
exfat: using hweight instead of internal logic
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZaDougAKCRBZ7Krx/gZQ
60eJAQCtXa908kOFDjSSTetU6aBzWKcCCHszirjhXiTFJv1jTgD/TbvyGs4ku7Ri
oI4nh1XX4QMVWsup1VETnnLAjt6DhAw=
=fror
-----END PGP SIGNATURE-----
Merge tag 'pull-bcachefs-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull bcachefs locking fix from Al Viro:
"Fix broken locking in bch2_ioctl_subvolume_destroy()"
* tag 'pull-bcachefs-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
bch2_ioctl_subvolume_destroy(): fix locking
new helper: user_path_locked_at()
Move mmu notification mechanism inside mm lock to prevent race condition
in other components which depend on it. The notifier will invalidate
memory range. Depending upon the number of iterations, different memory
ranges would be invalidated.
The following warning would be removed by this patch:
WARNING: CPU: 0 PID: 5067 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:734 kvm_mmu_notifier_change_pte+0x860/0x960 arch/x86/kvm/../../../virt/kvm/kvm_main.c:734
There is no behavioural and performance change with this patch when
there is no component registered with the mmu notifier.
[akpm@linux-foundation.org: narrow the scope of `range', per Sean]
Link: https://lkml.kernel.org/r/20240109112445.590736-1-usama.anjum@collabora.com
Fixes: 52526ca7fd ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Reported-by: syzbot+81227d2bd69e9dedb802@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/000000000000f6d051060c6785bc@google.com/
Reviewed-by: Sean Christopherson <seanjc@google.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In this series, we've some progress to support Zoned block device regarding to
the power-cut recovery flow and enabling checkpoint=disable feature which is
essential for Android OTA. Other than that, some patches touched sysfs entries
and tracepoints which are minor, while several bug fixes on error handlers and
compression flows are good to improve the overall stability.
Enhancement:
- enable checkpoint=disable for zoned block device
- sysfs entries such as discard status, discard_io_aware, dir_level
- tracepoints such as f2fs_vm_page_mkwrite(), f2fs_rename(), f2fs_new_inode()
- use shared inode lock during f2fs_fiemap() and f2fs_seek_block()
Bug fix:
- address some power-cut recovery issues on zoned block device
- handle errors and logics on do_garbage_collect(), f2fs_reserve_new_block(),
f2fs_move_file_range(), f2fs_recover_xattr_data()
- don't set FI_PREALLOCATED_ALL for partial write
- fix to update iostat correctly in f2fs_filemap_fault()
- fix to wait on block writeback for post_read case
- fix to tag gcing flag on page during block migration
- restrict max filesize for 16K f2fs
- fix to avoid dirent corruption
- explicitly null-terminate the xattr list
There are also several clean-up patches to remove dead codes and better
readability.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmWgMYcACgkQQBSofoJI
UNJShxAAiYOXP7LPOAbPS1251BBgl8AIfs6u96hGTZkxOYsLHrBBbPbkWf3+nVbC
JsBsVOe9K50rssK9kPg6XHPbmFGC8ERlyYcZTpONLfjtHOaQicbRnc//2qOvnCx8
JOKcMVkZyLU/HbOCoUW6mzNCQlOl0aAV8tRcb7jwAxT0HgpjHTHxej/62gRcPKzC
1E5w4iNTY//R97YGB36jPeGlKhbBZ7Ox1NM6AWadgE7B0j9rcYiBnPQllyeyaVVo
XMCWRdl42tNMks2zgvU+vC41OrZ55bwLTQmVj3P1wnyKXig5/ZLQsrEcIGE+b2tP
Mx+imCIRNYZqLwv5KYl6FU+KuLQGuZT1AjpP70Cb95WLyiYvVE6+xeiZg0fVTCEF
3Hg7lEqMtAEAh1NEmJyYmbiAm9KQ3vHyse9ix++tfm+Xvgqj8b2flmzAtIFKpCBV
J+yFI+A55IYuYZt7gzPoZLkQL0tULPf80TKQrzwlnHNtZ6T6FK2Nunu+Urwf1/Th
s5IulqHJZxHU/Bgd6yQZUVfDILcXTkqNCpO3+qLZMPZizlH1hXiJFTeVzS6mnGvZ
sK2LL4rEJ8EhDHU1F0SJzCWJcuR8cQ/t2zKYUygo9LvHbtEM1bZwC1Bqfolt7NrU
+pgiM2wnE9yjkPdfZN1JgYZDq0/lGvxPQ5NAc/5ERX71QonRyn8=
=MQl3
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs update from Jaegeuk Kim:
"In this series, we've some progress to support Zoned block device
regarding to the power-cut recovery flow and enabling
checkpoint=disable feature which is essential for Android OTA.
Other than that, some patches touched sysfs entries and tracepoints
which are minor, while several bug fixes on error handlers and
compression flows are good to improve the overall stability.
Enhancements:
- enable checkpoint=disable for zoned block device
- sysfs entries such as discard status, discard_io_aware, dir_level
- tracepoints such as f2fs_vm_page_mkwrite(), f2fs_rename(),
f2fs_new_inode()
- use shared inode lock during f2fs_fiemap() and f2fs_seek_block()
Bug fixes:
- address some power-cut recovery issues on zoned block device
- handle errors and logics on do_garbage_collect(),
f2fs_reserve_new_block(), f2fs_move_file_range(),
f2fs_recover_xattr_data()
- don't set FI_PREALLOCATED_ALL for partial write
- fix to update iostat correctly in f2fs_filemap_fault()
- fix to wait on block writeback for post_read case
- fix to tag gcing flag on page during block migration
- restrict max filesize for 16K f2fs
- fix to avoid dirent corruption
- explicitly null-terminate the xattr list
There are also several clean-up patches to remove dead codes and
better readability"
* tag 'f2fs-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (33 commits)
f2fs: show more discard status by sysfs
f2fs: Add error handling for negative returns from do_garbage_collect
f2fs: Constrain the modification range of dir_level in the sysfs
f2fs: Use wait_event_freezable_timeout() for freezable kthread
f2fs: fix to check return value of f2fs_recover_xattr_data
f2fs: don't set FI_PREALLOCATED_ALL for partial write
f2fs: fix to update iostat correctly in f2fs_filemap_fault()
f2fs: fix to check compress file in f2fs_move_file_range()
f2fs: fix to wait on block writeback for post_read case
f2fs: fix to tag gcing flag on page during block migration
f2fs: add tracepoint for f2fs_vm_page_mkwrite()
f2fs: introduce f2fs_invalidate_internal_cache() for cleanup
f2fs: update blkaddr in __set_data_blkaddr() for cleanup
f2fs: introduce get_dnode_addr() to clean up codes
f2fs: delete obsolete FI_DROP_CACHE
f2fs: delete obsolete FI_FIRST_BLOCK_WRITTEN
f2fs: Restrict max filesize for 16K f2fs
f2fs: let's finish or reset zones all the time
f2fs: check write pointers when checkpoint=disable
f2fs: fix write pointers on zoned device after roll forward
...
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmWfZIYACgkQiiy9cAdy
T1Gijgv+MH40buaJETR2FjRxzUiC92FFafGcX+fh5dLM22Sxb9AW+cNzOf/CSqNE
0AKpdhh+MVq0xiYwaXrrGGUfpUOZ3fwTHNJjpCt5o34b8U6IrHIah96noCRhDwQm
pE1Loi5TyWYRYsCjajau+tIi9+lgROkJ9eM34bx2dkkDw5ng4MQAygqJDwDM1n+i
N/O3vRqG3ZUrK5h7v9kVBdFZlkiiVm8cjNH86prfecnT6TTa/6QZoNd9kQAyvGe2
hWhW4J70y+H3JHcaBeGp11wcLcyBeFwBdqeo+os5EUPN/BbWaXuUqktBKvfnHIWA
ucwrAMNVVK82JhudlkIGuYVgEOrLDrZsWjIJmDajFtJuO3Yo9VLkeZKcMRhPj45H
kodPbCfPxNXg/y2fn1P3nCyiRSHHFb0QFEvK5JZV0Zwv5yPeziA/5e+6lu6OWNW7
VYQb2ZdOzuNW6aXHXnPRQQDSbVFsgD5dVxqYA9cugvtKAslUb9Q6z4kNkCFbYq/F
xJ0VH0K5
=UhAE
-----END PGP SIGNATURE-----
Merge tag '6.8-rc-smb-server-fixes' of git://git.samba.org/ksmbd
Pull smb server updates from Steve French:
- memory allocation fix
- three lease fixes, including important rename fix
- read only share fix
- thread freeze fix
- three cleanup fixes (two kernel doc related)
- locking fix in setting EAs
- packet header validation fix
* tag '6.8-rc-smb-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: Add missing set_freezable() for freezable kthread
ksmbd: free ppace array on error in parse_dacl
ksmbd: send lease break notification on FILE_RENAME_INFORMATION
ksmbd: don't allow O_TRUNC open on read-only share
ksmbd: vfs: fix all kernel-doc warnings
ksmbd: auth: fix most kernel-doc warnings
ksmbd: Remove usage of the deprecated ida_simple_xx() API
ksmbd: don't increment epoch if current state and request state are same
ksmbd: fix potential circular locking issue in smb2_set_ea()
ksmbd: set v2 lease version on lease upgrade
ksmbd: validate the zero field of packet header
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ/BCAAKCRBZ7Krx/gZQ
68qqAQD6LtfYLDJGdJM+lNpyiG4BA7coYpPlJtmH7mzL+MbFPgEAnM7XsK6zyvza
3+rEggLM0UFWjg9Ln7Nlq035TeYtFwo=
=w1mD
-----END PGP SIGNATURE-----
Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc filesystem updates from Al Viro:
"Misc cleanups (the part that hadn't been picked by individual fs
trees)"
* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
apparmorfs: don't duplicate kfree_link()
orangefs: saner arguments passing in readdir guts
ocfs2_find_match(): there's no such thing as NULL or negative ->d_parent
reiserfs_add_entry(): get rid of pointless namelen checks
__ocfs2_add_entry(), ocfs2_prepare_dir_for_insert(): namelen checks
ext4_add_entry(): ->d_name.len is never 0
befs: d_obtain_alias(ERR_PTR(...)) will do the right thing
affs: d_obtain_alias(ERR_PTR(...)) will do the right thing
/proc/sys: use d_splice_alias() calling conventions to simplify failure exits
hostfs: use d_splice_alias() calling conventions to simplify failure exits
udf_fiiter_add_entry(): check for zero ->d_name.len is bogus...
udf: d_obtain_alias(ERR_PTR(...)) will do the right thing...
udf: d_splice_alias() will do the right thing on ERR_PTR() inode
nfsd: kill stale comment about simple_fill_super() requirements
bfs_add_entry(): get rid of pointless ->d_name.len checks
nilfs2: d_obtain_alias(ERR_PTR(...)) will do the right thing...
zonefs: d_splice_alias() will do the right thing on ERR_PTR() inode
change of locking rules for __dentry_kill(), regularized refcounting
rules in that area, assorted cleanups and removal of weird corner
cases (e.g. now ->d_iput() on child is always called before the parent
might hit __dentry_kill(), etc.)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ+sQQAKCRBZ7Krx/gZQ
6ybjAQDM5jiS93IUzfHjCWq0nVBX5YGbDAkZOeqxbmIdQb+2UAEA6elP5r0fBBcA
seo3bry4DirQMDaA/Cjh4+8r71YSOQs=
=7+Hk
-----END PGP SIGNATURE-----
Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache updates from Al Viro:
"Change of locking rules for __dentry_kill(), regularized refcounting
rules in that area, assorted cleanups and removal of weird corner
cases (e.g. now ->d_iput() on child is always called before the parent
might hit __dentry_kill(), etc)"
* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
dcache: remove unnecessary NULL check in dget_dlock()
kill DCACHE_MAY_FREE
__d_unalias() doesn't use inode argument
d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant
get rid of DCACHE_GENOCIDE
d_genocide(): move the extern into fs/internal.h
simple_fill_super(): don't bother with d_genocide() on failure
nsfs: use d_make_root()
d_alloc_pseudo(): move setting ->d_op there from the (sole) caller
kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller
retain_dentry(): introduce a trimmed-down lockless variant
__dentry_kill(): new locking scheme
d_prune_aliases(): use a shrink list
switch select_collect{,2}() to use of to_shrink_list()
to_shrink_list(): call only if refcount is 0
fold dentry_kill() into dput()
don't try to cut corners in shrink_lock_dentry()
fold the call of retain_dentry() into fast_dput()
Call retain_dentry() with refcount 0
dentry_kill(): don't bother with retain_dentry() on slow path
...
broken in 6.5; we really can't lock two unrelated directories
without holding ->s_vfs_rename_mutex first and in case of
same-parent rename of a subdirectory 6.5 ends up doing just
that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ+lyQAKCRBZ7Krx/gZQ
60MWAP94hTqeMIpjhsUIkrTnylrIFaiw4UCWFJzIRG1QQYKqCgD/XUaWI9np7dL6
0wR/j4CQSdJjiEFKUFE2pD3QoSuJYAQ=
=+x0+
-----END PGP SIGNATURE-----
Merge tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull rename updates from Al Viro:
"Fix directory locking scheme on rename
This was broken in 6.5; we really can't lock two unrelated directories
without holding ->s_vfs_rename_mutex first and in case of same-parent
rename of a subdirectory 6.5 ends up doing just that"
* tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
rename(): avoid a deadlock in the case of parents having no common ancestor
kill lock_two_inodes()
rename(): fix the locking of subdirectories
f2fs: Avoid reading renamed directory if parent does not change
ext4: don't access the source subdirectory content on same-directory rename
ext2: Avoid reading renamed directory if parent does not change
udf_rename(): only access the child content on cross-directory rename
ocfs2: Avoid touching renamed directory if parent does not change
reiserfs: Avoid touching renamed directory if parent does not change
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ+hqgAKCRBZ7Krx/gZQ
61nkAP0Ybu4QRGISg1w5uKlNY8uNO37wR51oZ8c6DrOHOqv7tQEA+tB2LGc0rDjp
5J6mMlmCQlJnXj4k3OjVch9S7xdsSA8=
=0C5U
-----END PGP SIGNATURE-----
Merge tag 'pull-minix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull minixfs updates from Al Viro:
"minixfs kmap_local_page() switchover and related fixes - very similar
to sysv series"
* tag 'pull-minix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
minixfs: switch to kmap_local_page()
minixfs: Use dir_put_page() in minix_unlink() and minix_rename()
minixfs: change the signature of dir_get_page()
minixfs: use offset_in_page()
- The minimum Sphinx requirement has been raised to 2.4.4, following a
warning that was added in 6.2.
- Some reworking of the Documentation/process front page to, hopefully,
make it more useful.
- Various kernel-doc tweaks to, for example, make it deal properly with
__counted_by annotations.
- We have also restored a warning for documentation of nonexistent
structure members that disappeared a while back. That had the delightful
consequence of adding some 600 warnings to the docs build. A sustained
effort by Randy, Vegard, and myself has addressed almost all of those,
bringing the documentation back into sync with the code. The fixes are
going through the appropriate maintainer trees.
- Various improvements to the HTML rendered docs, including automatic links
to Git revisions and a nice new pulldown to make translations easy to
access.
- Speaking of translations, more of those for Spanish and Chinese.
...plus the usual stream of documentation updates and typo fixes.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmWcRKMPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YTKIH/AxBt/3iWt40dPf18arZHLU6tdUbmg01ttef
CNKWkniCmABGKc//KYDXvjZMRDt0YlrS0KgUzrb8nIQTBlZG40D+88EwjXE0HeGP
xt1Fk7OPOiJEqBZ3HEe0PDVfOiA+4yR6CmDKklCJuKg77X9atklneBwPUw/cOASk
CWj+BdbwPBiSNQv48Lp87rGusKwnH/g0MN2uS0z9MPr1DYjM1K8+ngZjGW24lZHt
qs5yhP43mlZGBF/lwNJXQp/xhnKAqJ9XwylBX9Wmaoxaz9yyzNVsADGvROMudgzi
9YB+Jdy7Z0JSrVoLIRhUuDOv7aW8vk+8qLmGJt2aTIsqehbQ6pk=
=fCtT
-----END PGP SIGNATURE-----
Merge tag 'docs-6.8' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
"Another moderately busy cycle for documentation, including:
- The minimum Sphinx requirement has been raised to 2.4.4, following
a warning that was added in 6.2
- Some reworking of the Documentation/process front page to,
hopefully, make it more useful
- Various kernel-doc tweaks to, for example, make it deal properly
with __counted_by annotations
- We have also restored a warning for documentation of nonexistent
structure members that disappeared a while back. That had the
delightful consequence of adding some 600 warnings to the docs
build. A sustained effort by Randy, Vegard, and myself has
addressed almost all of those, bringing the documentation back into
sync with the code. The fixes are going through the appropriate
maintainer trees
- Various improvements to the HTML rendered docs, including automatic
links to Git revisions and a nice new pulldown to make translations
easy to access
- Speaking of translations, more of those for Spanish and Chinese
... plus the usual stream of documentation updates and typo fixes"
* tag 'docs-6.8' of git://git.lwn.net/linux: (57 commits)
MAINTAINERS: use tabs for indent of CONFIDENTIAL COMPUTING THREAT MODEL
A reworked process/index.rst
ring-buffer/Documentation: Add documentation on buffer_percent file
Translated the RISC-V architecture boot documentation.
Docs: remove mentions of fdformat from util-linux
Docs/zh_CN: Fix the meaning of DEBUG to pr_debug()
Documentation: move driver-api/dcdbas to userspace-api/
Documentation: move driver-api/isapnp to userspace-api/
Documentation/core-api : fix typo in workqueue
Documentation/trace: Fixed typos in the ftrace FLAGS section
kernel-doc: handle a void function without producing a warning
scripts/get_abi.pl: ignore some temp files
docs: kernel_abi.py: fix command injection
scripts/get_abi: fix source path leak
CREDITS, MAINTAINERS, docs/process/howto: Update man-pages' maintainer
docs: translations: add translations links when they exist
kernel-doc: Align quick help and the code
MAINTAINERS: add reviewer for Spanish translations
docs: ignore __counted_by attribute in structure definitions
scripts: kernel-doc: Clarify missing struct member description
..
Add extra sanity check for btrfs_ioctl_defrag_range_args::flags.
This is not really to enhance fuzzing tests, but as a preparation for
future expansion on btrfs_ioctl_defrag_range_args.
In the future we're going to add new members, allowing more fine tuning
for btrfs defrag. Without the -ENONOTSUPP error, there would be no way
to detect if the kernel supports those new defrag features.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sweet Tea spotted a race between subvolume deletion and snapshotting
that can result in the root item for the snapshot having the
BTRFS_ROOT_SUBVOL_DEAD flag set. The race is:
Thread 1 | Thread 2
----------------------------------------------|----------
btrfs_delete_subvolume |
btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)|
|btrfs_mksubvol
| down_read(subvol_sem)
| create_snapshot
| ...
| create_pending_snapshot
| copy root item from source
down_write(subvol_sem) |
This flag is only checked in send and swap activate, which this would
cause to fail mysteriously.
create_snapshot() now checks the root refs to reject a deleted
subvolume, so we can fix this by locking subvol_sem earlier so that the
BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically.
CC: stable@vger.kernel.org # 4.14+
Reported-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs CI reported a lockdep warning as follows by running generic
generic/129.
WARNING: possible circular locking dependency detected
6.7.0-rc5+ #1 Not tainted
------------------------------------------------------
kworker/u5:5/793427 is trying to acquire lock:
ffff88813256d028 (&cache->lock){+.+.}-{2:2}, at: btrfs_zone_finish_one_bg+0x5e/0x130
but task is already holding lock:
ffff88810a23a318 (&fs_info->zone_active_bgs_lock){+.+.}-{2:2}, at: btrfs_zone_finish_one_bg+0x34/0x130
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&fs_info->zone_active_bgs_lock){+.+.}-{2:2}:
...
-> #0 (&cache->lock){+.+.}-{2:2}:
...
This is because we take fs_info->zone_active_bgs_lock after a block_group's
lock in btrfs_zone_activate() while doing the opposite in other places.
Fix the issue by expanding the fs_info->zone_active_bgs_lock's critical
section and taking it before a block_group's lock.
Fixes: a7e1ac7bdc ("btrfs: zoned: reserve zones for an active metadata/system block group")
CC: stable@vger.kernel.org # 6.6
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The error path of btrfs_get_chunk_map() releases
fs_info->mapping_tree_lock. But, it is taken and released in
btrfs_find_chunk_map(). So, there is no need to do so.
Fixes: 7dc66abb5a ("btrfs: use a dedicated data structure for chunk maps")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When compiling with gcc version 14.0.0 20231220 (experimental)
and W=1, I've noticed the following warning:
fs/btrfs/send.c: In function 'btrfs_ioctl_send':
fs/btrfs/send.c:8208:44: warning: 'kvcalloc' sizes specified with 'sizeof'
in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
8208 | sctx->clone_roots = kvcalloc(sizeof(*sctx->clone_roots),
| ^
Since 'n' and 'size' arguments of 'kvcalloc()' are multiplied to
calculate the final size, their actual order doesn't affect the result
and so this is not a bug. But it's still worth to fix it.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Writing sequentially to a huge file on btrfs on a SMR HDD revealed a
decline of the performance (220 MiB/s to 30 MiB/s after 500 minutes).
The performance goes down because of increased latency of the extent
allocation, which is induced by a traversing of a lot of full block groups.
So, this patch optimizes the ffe_ctl->hint_byte by choosing a block group
with sufficient size from the active block group list, which does not
contain full block groups.
After applying the patch, the performance is maintained well.
Fixes: 2eda57089e ("btrfs: zoned: implement sequential extent allocation")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out prepare_allocation_zoned() for further extension. While at
it, optimize the if-branch a bit.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWcIOIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpn6hD/9oO7U75PuxUwYYHZ9Uzxpw6gQ0LEmeyJmE
NQYCkfYHVq3IsgOdF7elI9v3qtr6v8V8CdB7cByrnn3DgwsMuiTKZZ0dK7vH37PO
DX+/xn349e8oH7RdRo7f3m95g1YbHfpfnj0Rc4mjTDV72Jr/HlLTVgGTQg8DEnCR
wBIFmeuBHHgeeLh87gsWLAP7ReReiy9V1uqpDFsko2/4BxRAM/8eedkwcAxD8aEy
rd+dT/SBQj2cOdQMUeExT3gWjwzHh6ZHx3f1WCLK5fdck6BogH2hBUeri6F/H98L
HoaXjBZYBTH68hB/mnO5I4g1ZlrVM74Vp7JPa3e1SFFtyEi6lsyrk2J3GoNh0E7r
pXqH5kAcaJwBsBrbRGuvEyGbn9RLTaN5Gvseud0VE4oMruyodTniQaHXuIGackgz
sMavMho4486EUWPaF7gIBdLNK1hO13w+IDZ4+3oBxhudMqdgZbk4iYpOCqQ7QY5G
2vkzAE/sZ+aVNXeaIQOI8dE5clBy8gJ+6+t8dm3DY1r1xdbcnU40iZ8/fri3h69r
vHs9bpQnVWZF0gEyEflY1pkcAPpIkvMmWCR7Ehy5YCkIfa+qfSL05o3dicpWovLP
N+gCtpkhTK2AvmUWsUMypMLRvoSOImyCIiobrr3qNBaUdgRP8xKfUa72RuRp8cGl
Vrj5oAiE3w==
=YAfp
-----END PGP SIGNATURE-----
Merge tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
"Pretty quiet round this time around. This contains:
- NVMe updates via Keith:
- nvme fabrics spec updates (Guixin, Max)
- nvme target udpates (Guixin, Evan)
- nvme attribute refactoring (Daniel)
- nvme-fc numa fix (Keith)
- MD updates via Song:
- Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai)
- Fix raid5 hang issue (Junxiao Bi)
- Add Yu Kuai as Reviewer of the md subsystem
- Remove deprecated flavors (Song Liu)
- raid1 read error check support (Li Nan)
- Better handle events off-by-1 case (Alex Lyakas)
- Efficiency improvements for passthrough (Kundan)
- Support for mapping integrity data directly (Keith)
- Zoned write fix (Damien)
- rnbd fixes (Kees, Santosh, Supriti)
- Default to a sane discard size granularity (Christoph)
- Make the default max transfer size naming less confusing
(Christoph)
- Remove support for deprecated host aware zoned model (Christoph)
- Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel,
Bart, Christoph)"
* tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits)
block: Treat sequential write preferred zone type as invalid
block: remove disk_clear_zoned
sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics
drivers/block/xen-blkback/common.h: Fix spelling typo in comment
blk-cgroup: fix rcu lockdep warning in blkg_lookup()
blk-cgroup: don't use removal safe list iterators
block: floor the discard granularity to the physical block size
mtd_blkdevs: use the default discard granularity
bcache: use the default discard granularity
zram: use the default discard granularity
null_blk: use the default discard granularity
nbd: use the default discard granularity
ubd: use the default discard granularity
block: default the discard granularity to sector size
bcache: discard_granularity should not be smaller than a sector
block: remove two comments in bio_split_discard
block: rename and document BLK_DEF_MAX_SECTORS
loop: don't abuse BLK_DEF_MAX_SECTORS
aoe: don't abuse BLK_DEF_MAX_SECTORS
null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS
...
Core & protocols
----------------
- Analyze and reorganize core networking structs (socks, netdev,
netns, mibs) to optimize cacheline consumption and set up
build time warnings to safeguard against future header changes.
This improves TCP performances with many concurrent connections
up to 40%.
- Add page-pool netlink-based introspection, exposing the
memory usage and recycling stats. This helps indentify
bad PP users and possible leaks.
- Refine TCP/DCCP source port selection to no longer favor even
source port at connect() time when IP_LOCAL_PORT_RANGE is set.
This lowers the time taken by connect() for hosts having
many active connections to the same destination.
- Refactor the TCP bind conflict code, shrinking related socket
structs.
- Refactor TCP SYN-Cookie handling, as a preparation step to
allow arbitrary SYN-Cookie processing via eBPF.
- Tune optmem_max for 0-copy usage, increasing the default value
to 128KB and namespecifying it.
- Allow coalescing for cloned skbs coming from page pools, improving
RX performances with some common configurations.
- Reduce extension header parsing overhead at GRO time.
- Add bridge MDB bulk deletion support, allowing user-space to
request the deletion of matching entries.
- Reorder nftables struct members, to keep data accessed by the
datapath first.
- Introduce TC block ports tracking and use. This allows supporting
multicast-like behavior at the TC layer.
- Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
classifiers (RSVP and tcindex).
- More data-race annotations.
- Extend the diag interface to dump TCP bound-only sockets.
- Conditional notification of events for TC qdisc class and actions.
- Support for WPAN dynamic associations with nearby devices, to form
a sub-network using a specific PAN ID.
- Implement SMCv2.1 virtual ISM device support.
- Add support for Batman-avd mulicast packet type.
BPF
---
- Tons of verifier improvements:
- BPF register bounds logic and range support along with a large
test suite
- log improvements
- complete precision tracking support for register spills
- track aligned STACK_ZERO cases as imprecise spilled registers. It
improves the verifier "instructions processed" metric from single
digit to 50-60% for some programs
- support for user's global BPF subprogram arguments with few
commonly requested annotations for a better developer experience
- support tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
like
- several fixes
- Add initial TX metadata implementation for AF_XDP with support in
mlx5 and stmmac drivers. Two types of offloads are supported right
now, that is, TX timestamp and TX checksum offload.
- Fix kCFI bugs in BPF all forms of indirect calls from BPF into
kernel and from kernel into BPF work with CFI enabled. This allows
BPF to work with CONFIG_FINEIBT=y.
- Change BPF verifier logic to validate global subprograms lazily
instead of unconditionally before the main program, so they can be
guarded using BPF CO-RE techniques.
- Support uid/gid options when mounting bpffs.
- Add a new kfunc which acquires the associated cgroup of a task
within a specific cgroup v1 hierarchy where the latter is identified
by its id.
- Extend verifier to allow bpf_refcount_acquire() of a map value field
obtained via direct load which is a use-case needed in sched_ext.
- Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter.
- Support for VLAN tag in XDP hints.
- Remove deprecated bpfilter kernel leftovers given the project
is developed in user-space (https://github.com/facebook/bpfilter).
Misc
----
- Support for parellel TC self-tests execution.
- Increase MPTCP self-tests coverage.
- Updated the bridge documentation, including several so-far
undocumented features.
- Convert all the net self-tests to run in unique netns, to
avoid random failures due to conflict and allow concurrent
runs.
- Add TCP-AO self-tests.
- Add kunit tests for both cfg80211 and mac80211.
- Autogenerate Netlink families documentation from YAML spec.
- Add yml-gen support for fixed headers and recursive nests, the
tool can now generate user-space code for all genetlink families
for which we have specs.
- A bunch of additional module descriptions fixes.
- Catch incorrect freeing of pages belonging to a page pool.
Driver API
----------
- Rust abstractions for network PHY drivers; do not cover yet the
full C API, but already allow implementing functional PHY drivers
in rust.
- Introduce queue and NAPI support in the netdev Netlink interface,
allowing complete access to the device <> NAPIs <> queues
relationship.
- Introduce notifications filtering for devlink to allow control
application scale to thousands of instances.
- Improve PHY validation, requesting rate matching information for
each ethtool link mode supported by both the PHY and host.
- Add support for ethtool symmetric-xor RSS hash.
- ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
platform.
- Expose pin fractional frequency offset value over new DPLL generic
netlink attribute.
- Convert older drivers to platform remove callback returning void.
- Add support for PHY package MMD read/write.
New hardware / drivers
----------------------
- Ethernet:
- Octeon CN10K devices
- Broadcom 5760X P7
- Qualcomm SM8550 SoC
- Texas Instrument DP83TG720S PHY
- Bluetooth:
- IMC Networks Bluetooth radio
Removed
-------
- WiFi:
- libertas 16-bit PCMCIA support
- Atmel at76c50x drivers
- HostAP ISA/PCMCIA style 802.11b driver
- zd1201 802.11b USB dongles
- Orinoco ISA/PCMCIA 802.11b driver
- Aviator/Raytheon driver
- Planet WL3501 driver
- RNDIS USB 802.11b driver
Drivers
-------
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- allow one by one port representors creation and removal
- add temperature and clock information reporting
- add get/set for ethtool's header split ringparam
- add again FW logging
- adds support switchdev hardware packet mirroring
- iavf: implement symmetric-xor RSS hash
- igc: add support for concurrent physical and free-running timers
- i40e: increase the allowable descriptors
- nVidia/Mellanox:
- Preparation for Socket-Direct multi-dev netdev. That will allow
in future releases combining multiple PFs devices attached to
different NUMA nodes under the same netdev
- Broadcom (bnxt):
- TX completion handling improvements
- add basic ntuple filter support
- reduce MSIX vectors usage for MQPRIO offload
- add VXLAN support, USO offload and TX coalesce completion for P7
- Marvell Octeon EP:
- xmit-more support
- add PF-VF mailbox support and use it for FW notifications for VFs
- Wangxun (ngbe/txgbe):
- implement ethtool functions to operate pause param, ring param,
coalesce channel number and msglevel
- Netronome/Corigine (nfp):
- add flow-steering support
- support UDP segmentation offload
- Ethernet NICs embedded, slower, virtual:
- Xilinx AXI: remove duplicate DMA code adopting the dma engine driver
- stmmac: add support for HW-accelerated VLAN stripping
- TI AM654x sw: add mqprio, frame preemption & coalescing
- gve: add support for non-4k page sizes.
- virtio-net: support dynamic coalescing moderation
- nVidia/Mellanox Ethernet datacenter switches:
- allow firmware upgrade without a reboot
- more flexible support for bridge flooding via the compressed
FID flooding mode
- Ethernet embedded switches:
- Microchip:
- fine-tune flow control and speed configurations in KSZ8xxx
- KSZ88X3: enable setting rmii reference
- Renesas:
- add jumbo frames support
- Marvell:
- 88E6xxx: add "eth-mac" and "rmon" stats support
- Ethernet PHYs:
- aquantia: add firmware load support
- at803x: refactor the driver to simplify adding support for more
chip variants
- NXP C45 TJA11xx: Add MACsec offload support
- Wifi:
- MediaTek (mt76):
- NVMEM EEPROM improvements
- mt7996 Extremely High Throughput (EHT) improvements
- mt7996 Wireless Ethernet Dispatcher (WED) support
- mt7996 36-bit DMA support
- Qualcomm (ath12k):
- support for a single MSI vector
- WCN7850: support AP mode
- Intel (iwlwifi):
- new debugfs file fw_dbg_clear
- allow concurrent P2P operation on DFS channels
- Bluetooth:
- QCA2066: support HFP offload
- ISO: more broadcast-related improvements
- NXP: better recovery in case receiver/transmitter get out of sync
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmWdamsSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkGC4P/2xjLzdw22ckSssuE9ORbGko9SNjnqHk
PQh1E+26BHiCg5KB8VvzMsL78E79MRNXEattSW+1g7dhCvln3oi+Vd0WkdRkgt35
98Iv18zLbbwFAJeyKvmLAPAkQkMLtVj19QILBBRrugF+egEZgVSE3JBcTAiKv2ZQ
HzkabA171Ri6LpCcEEtY5XuaKvimGnGzF8YMFf8rX0wtqd2p5kbY9aMe47WAGxvU
Vf9548XvH+A5yVH2/4/gujtUOpA/RHuhuCMb+oo0cZ+VCC1x9MGzoXzj6r87OTkf
k2W1whNzcGoin92f+9Lk1JYMuiGKBH4QVaDdNXJnYFSJWPTE7RvRsPzYTSD4/GzK
yEZbzSJXpy/2vDQm16NoAxl7evRs8Sorzkw4LQRviZHI/5SAkK2ZQiCK5CO8QSYy
C1LELcV5kn6Foe24xWnrWLjAGug9oJnYoGPMU5gvPmFJMvUMXqm5rmbBgUWL5Rxw
q1M6gVzabCyWUy6z2G2vaqW2ZntNVvCkdsLtIX0XZkcTzNoP0MA+TuhyGz4wbiuo
PeyQp/mbGnDgCYggqKIA0YWrTVxkhFrKN520cbO8qXBQytV9oFbM/0/+C0/r/5WX
pL1JVzLrh6l5ME7EIQfha8UOF9j8q4ueSwb40P3AR2NaZiDABM0zfUZ6+sx+91WF
ucqPEcZB5cRE
=1bW6
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"The most interesting thing is probably the networking structs
reorganization and a significant amount of changes is around
self-tests.
Core & protocols:
- Analyze and reorganize core networking structs (socks, netdev,
netns, mibs) to optimize cacheline consumption and set up build
time warnings to safeguard against future header changes
This improves TCP performances with many concurrent connections up
to 40%
- Add page-pool netlink-based introspection, exposing the memory
usage and recycling stats. This helps indentify bad PP users and
possible leaks
- Refine TCP/DCCP source port selection to no longer favor even
source port at connect() time when IP_LOCAL_PORT_RANGE is set. This
lowers the time taken by connect() for hosts having many active
connections to the same destination
- Refactor the TCP bind conflict code, shrinking related socket
structs
- Refactor TCP SYN-Cookie handling, as a preparation step to allow
arbitrary SYN-Cookie processing via eBPF
- Tune optmem_max for 0-copy usage, increasing the default value to
128KB and namespecifying it
- Allow coalescing for cloned skbs coming from page pools, improving
RX performances with some common configurations
- Reduce extension header parsing overhead at GRO time
- Add bridge MDB bulk deletion support, allowing user-space to
request the deletion of matching entries
- Reorder nftables struct members, to keep data accessed by the
datapath first
- Introduce TC block ports tracking and use. This allows supporting
multicast-like behavior at the TC layer
- Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
classifiers (RSVP and tcindex)
- More data-race annotations
- Extend the diag interface to dump TCP bound-only sockets
- Conditional notification of events for TC qdisc class and actions
- Support for WPAN dynamic associations with nearby devices, to form
a sub-network using a specific PAN ID
- Implement SMCv2.1 virtual ISM device support
- Add support for Batman-avd mulicast packet type
BPF:
- Tons of verifier improvements:
- BPF register bounds logic and range support along with a large
test suite
- log improvements
- complete precision tracking support for register spills
- track aligned STACK_ZERO cases as imprecise spilled registers.
This improves the verifier "instructions processed" metric from
single digit to 50-60% for some programs
- support for user's global BPF subprogram arguments with few
commonly requested annotations for a better developer
experience
- support tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
like
- several fixes
- Add initial TX metadata implementation for AF_XDP with support in
mlx5 and stmmac drivers. Two types of offloads are supported right
now, that is, TX timestamp and TX checksum offload
- Fix kCFI bugs in BPF all forms of indirect calls from BPF into
kernel and from kernel into BPF work with CFI enabled. This allows
BPF to work with CONFIG_FINEIBT=y
- Change BPF verifier logic to validate global subprograms lazily
instead of unconditionally before the main program, so they can be
guarded using BPF CO-RE techniques
- Support uid/gid options when mounting bpffs
- Add a new kfunc which acquires the associated cgroup of a task
within a specific cgroup v1 hierarchy where the latter is
identified by its id
- Extend verifier to allow bpf_refcount_acquire() of a map value
field obtained via direct load which is a use-case needed in
sched_ext
- Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter
- Support for VLAN tag in XDP hints
- Remove deprecated bpfilter kernel leftovers given the project is
developed in user-space (https://github.com/facebook/bpfilter)
Misc:
- Support for parellel TC self-tests execution
- Increase MPTCP self-tests coverage
- Updated the bridge documentation, including several so-far
undocumented features
- Convert all the net self-tests to run in unique netns, to avoid
random failures due to conflict and allow concurrent runs
- Add TCP-AO self-tests
- Add kunit tests for both cfg80211 and mac80211
- Autogenerate Netlink families documentation from YAML spec
- Add yml-gen support for fixed headers and recursive nests, the tool
can now generate user-space code for all genetlink families for
which we have specs
- A bunch of additional module descriptions fixes
- Catch incorrect freeing of pages belonging to a page pool
Driver API:
- Rust abstractions for network PHY drivers; do not cover yet the
full C API, but already allow implementing functional PHY drivers
in rust
- Introduce queue and NAPI support in the netdev Netlink interface,
allowing complete access to the device <> NAPIs <> queues
relationship
- Introduce notifications filtering for devlink to allow control
application scale to thousands of instances
- Improve PHY validation, requesting rate matching information for
each ethtool link mode supported by both the PHY and host
- Add support for ethtool symmetric-xor RSS hash
- ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
platform
- Expose pin fractional frequency offset value over new DPLL generic
netlink attribute
- Convert older drivers to platform remove callback returning void
- Add support for PHY package MMD read/write
New hardware / drivers:
- Ethernet:
- Octeon CN10K devices
- Broadcom 5760X P7
- Qualcomm SM8550 SoC
- Texas Instrument DP83TG720S PHY
- Bluetooth:
- IMC Networks Bluetooth radio
Removed:
- WiFi:
- libertas 16-bit PCMCIA support
- Atmel at76c50x drivers
- HostAP ISA/PCMCIA style 802.11b driver
- zd1201 802.11b USB dongles
- Orinoco ISA/PCMCIA 802.11b driver
- Aviator/Raytheon driver
- Planet WL3501 driver
- RNDIS USB 802.11b driver
Driver updates:
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- allow one by one port representors creation and removal
- add temperature and clock information reporting
- add get/set for ethtool's header split ringparam
- add again FW logging
- adds support switchdev hardware packet mirroring
- iavf: implement symmetric-xor RSS hash
- igc: add support for concurrent physical and free-running
timers
- i40e: increase the allowable descriptors
- nVidia/Mellanox:
- Preparation for Socket-Direct multi-dev netdev. That will
allow in future releases combining multiple PFs devices
attached to different NUMA nodes under the same netdev
- Broadcom (bnxt):
- TX completion handling improvements
- add basic ntuple filter support
- reduce MSIX vectors usage for MQPRIO offload
- add VXLAN support, USO offload and TX coalesce completion
for P7
- Marvell Octeon EP:
- xmit-more support
- add PF-VF mailbox support and use it for FW notifications
for VFs
- Wangxun (ngbe/txgbe):
- implement ethtool functions to operate pause param, ring
param, coalesce channel number and msglevel
- Netronome/Corigine (nfp):
- add flow-steering support
- support UDP segmentation offload
- Ethernet NICs embedded, slower, virtual:
- Xilinx AXI: remove duplicate DMA code adopting the dma engine
driver
- stmmac: add support for HW-accelerated VLAN stripping
- TI AM654x sw: add mqprio, frame preemption & coalescing
- gve: add support for non-4k page sizes.
- virtio-net: support dynamic coalescing moderation
- nVidia/Mellanox Ethernet datacenter switches:
- allow firmware upgrade without a reboot
- more flexible support for bridge flooding via the compressed
FID flooding mode
- Ethernet embedded switches:
- Microchip:
- fine-tune flow control and speed configurations in KSZ8xxx
- KSZ88X3: enable setting rmii reference
- Renesas:
- add jumbo frames support
- Marvell:
- 88E6xxx: add "eth-mac" and "rmon" stats support
- Ethernet PHYs:
- aquantia: add firmware load support
- at803x: refactor the driver to simplify adding support for more
chip variants
- NXP C45 TJA11xx: Add MACsec offload support
- Wifi:
- MediaTek (mt76):
- NVMEM EEPROM improvements
- mt7996 Extremely High Throughput (EHT) improvements
- mt7996 Wireless Ethernet Dispatcher (WED) support
- mt7996 36-bit DMA support
- Qualcomm (ath12k):
- support for a single MSI vector
- WCN7850: support AP mode
- Intel (iwlwifi):
- new debugfs file fw_dbg_clear
- allow concurrent P2P operation on DFS channels
- Bluetooth:
- QCA2066: support HFP offload
- ISO: more broadcast-related improvements
- NXP: better recovery in case receiver/transmitter get out of sync"
* tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits)
lan78xx: remove redundant statement in lan78xx_get_eee
lan743x: remove redundant statement in lan743x_ethtool_get_eee
bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer()
bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel()
bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter()
tcp: Revert no longer abort SYN_SENT when receiving some ICMP
Revert "mlx5 updates 2023-12-20"
Revert "net: stmmac: Enable Per DMA Channel interrupt"
ipvlan: Remove usage of the deprecated ida_simple_xx() API
ipvlan: Fix a typo in a comment
net/sched: Remove ipt action tests
net: stmmac: Use interrupt mode INTM=1 for per channel irq
net: stmmac: Add support for TX/RX channel interrupt
net: stmmac: Make MSI interrupt routine generic
dt-bindings: net: snps,dwmac: per channel irq
net: phy: at803x: make read_status more generic
net: phy: at803x: add support for cdt cross short test for qca808x
net: phy: at803x: refactor qca808x cable test get status function
net: phy: at803x: generalize cdt fault length function
net: ethernet: cortina: Drop TSO support
...
This reverts commit dad3fb67ca1cbef87ce700e83a55835e5921ce8a.
The commit converted kernfs_idr_lock to an IRQ-safe raw_spinlock because it
could be acquired while holding an rq lock through bpf_cgroup_from_id().
However, kernfs_idr_lock is held while doing GPF_NOWAIT allocations which
involves acquiring an non-IRQ-safe and non-raw lock leading to the following
lockdep warning:
=============================
[ BUG: Invalid wait context ]
6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 Not tainted
-----------------------------
swapper/0/0 is trying to lock:
dfbcd488 (&c->lock){....}-{3:3}, at: local_lock_acquire+0x0/0xa4
other info that might help us debug this:
context-{5:5}
2 locks held by swapper/0/0:
#0: dfbc9c60 (lock){+.+.}-{3:3}, at: local_lock_acquire+0x0/0xa4
#1: c0c012a8 (kernfs_idr_lock){....}-{2:2}, at: __kernfs_new_node.constprop.0+0x68/0x258
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc5-kzm9g-00251-g655022a45b1c #578
Hardware name: Generic SH73A0 (Flattened Device Tree)
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x68/0x90
dump_stack_lvl from __lock_acquire+0x3cc/0x168c
__lock_acquire from lock_acquire+0x274/0x30c
lock_acquire from local_lock_acquire+0x28/0xa4
local_lock_acquire from ___slab_alloc+0x234/0x8a8
___slab_alloc from __slab_alloc.constprop.0+0x30/0x44
__slab_alloc.constprop.0 from kmem_cache_alloc+0x7c/0x148
kmem_cache_alloc from radix_tree_node_alloc.constprop.0+0x44/0xdc
radix_tree_node_alloc.constprop.0 from idr_get_free+0x110/0x2b8
idr_get_free from idr_alloc_u32+0x9c/0x108
idr_alloc_u32 from idr_alloc_cyclic+0x50/0xb8
idr_alloc_cyclic from __kernfs_new_node.constprop.0+0x88/0x258
__kernfs_new_node.constprop.0 from kernfs_create_root+0xbc/0x154
kernfs_create_root from sysfs_init+0x18/0x5c
sysfs_init from mnt_init+0xc4/0x220
mnt_init from vfs_caches_init+0x6c/0x88
vfs_caches_init from start_kernel+0x474/0x528
start_kernel from 0x0
Let's rever the commit. It's undesirable to spread out raw spinlock usage
anyway and the problem can be solved by protecting the lookup path with RCU
instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <andrea.righi@canonical.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/CAMuHMdV=AKt+mwY7svEq5gFPx41LoSQZ_USME5_MEdWQze13ww@mail.gmail.com
Link: https://lore.kernel.org/r/20240109214828.252092-2-tj@kernel.org
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We're only allocating from the realtime device if the inode is marked
for realtime and we're /not/ allocating into the attr fork.
Fixes: 5864346054 ("xfs: also use xfs_bmap_btalloc_accounting for RT allocations")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
To help make the move of sysctls out of kernel/sysctl.c not incur a size
penalty sysctl has been changed to allow us to not require the sentinel, the
final empty element on the sysctl array. Joel Granados has been doing all this
work. On the v6.6 kernel we got the major infrastructure changes required to
support this. For v6.7 we had all arch/ and drivers/ modified to remove
the sentinel. For v6.8-rc1 we get a few more updates for fs/ directory only.
The kernel/ directory is left but we'll save that for v6.9-rc1 as those patches
are still being reviewed. After that we then can expect also the removal of the
no longer needed check for procname == NULL.
Let us recap the purpose of this work:
- this helps reduce the overall build time size of the kernel and run time
memory consumed by the kernel by about ~64 bytes per array
- the extra 64-byte penalty is no longer inncurred now when we move sysctls
out from kernel/sysctl.c to their own files
Thomas Weißschuh also sent a few cleanups, for v6.9-rc1 we expect to see further
work by Thomas Weißschuh with the constificatin of the struct ctl_table.
Due to Joel Granados's work, and to help bring in new blood, I have suggested
for him to become a maintainer and he's accepted. So for v6.9-rc1 I look forward
to seeing him sent you a pull request for further sysctl changes. This also
removes Iurii Zaikin as a maintainer as he has moved on to other projects and
has had no time to help at all.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmWdWDESHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinjJAP/jTNNoyzWisvrrvmXqR5txFGLOE+wW6x
Xv9avuiM+DTHsH/wK8CkXEivwDqYNAZEHU7NEcolS5bJX/ddSRwN9b5aSVlCrUdX
Ab4rXmpeSCNFp9zNszWJsDuBKIqjvsKw7qGleGtgZ2qAUHbbH30VROLWCggaee50
wU3icDLdwkasxrcMXy4Sq5dT5wYC4j/QelqBGIkYPT14Arl1im5zqPZ95gmO/s/6
mdicTAmq+hhAUfUBJBXRKtsvxY6CItxe55Q4fjpncLUJLHUw+VPVNoBKFWJlBwlh
LO3liKFfakPSkil4/en+/+zuMByd0JBkIzIJa+Kk5kjpbHRhK0RkmU4+Y5G5spWN
jjLfiv6RxInNaZ8EWQBMfjE95A7PmYDQ4TOH08+OvzdDIi6B0BB5tBGQpG9BnyXk
YsLg1Uo4CwE/vn1/a9w0rhadjUInvmAryhb/uSJYFz/lmApLm2JUpY3/KstwGetb
z+HmLstJb24Djkr6pH8DcjhzRBHeWQ5p0b4/6B+v1HqAUuEhdbyw1F2GrDywyF3R
h/UOAaKLm1+ffdA246o9TejKiDU96qEzzXMaCzPKyestaRZuiyuYEMDhYbvtsMV5
zIdMJj5HQ+U1KHDv4IN99DEj7+/vjE3f4Sjo+POFpQeQ8/d+fxpFNqXVv449dgnb
6xEkkxsR0ElM
=2qBt
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"To help make the move of sysctls out of kernel/sysctl.c not incur a
size penalty sysctl has been changed to allow us to not require the
sentinel, the final empty element on the sysctl array. Joel Granados
has been doing all this work.
In the v6.6 kernel we got the major infrastructure changes required to
support this. For v6.7 we had all arch/ and drivers/ modified to
remove the sentinel. For v6.8-rc1 we get a few more updates for fs/
directory only.
The kernel/ directory is left but we'll save that for v6.9-rc1 as
those patches are still being reviewed. After that we then can expect
also the removal of the no longer needed check for procname == NULL.
Let us recap the purpose of this work:
- this helps reduce the overall build time size of the kernel and run
time memory consumed by the kernel by about ~64 bytes per array
- the extra 64-byte penalty is no longer inncurred now when we move
sysctls out from kernel/sysctl.c to their own files
Thomas Weißschuh also sent a few cleanups, for v6.9-rc1 we expect to
see further work by Thomas Weißschuh with the constificatin of the
struct ctl_table.
Due to Joel Granados's work, and to help bring in new blood, I have
suggested for him to become a maintainer and he's accepted. So for
v6.9-rc1 I look forward to seeing him sent you a pull request for
further sysctl changes. This also removes Iurii Zaikin as a maintainer
as he has moved on to other projects and has had no time to help at
all"
* tag 'sysctl-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
sysctl: remove struct ctl_path
sysctl: delete unused define SYSCTL_PERM_EMPTY_DIR
coda: Remove the now superfluous sentinel elements from ctl_table array
sysctl: Remove the now superfluous sentinel elements from ctl_table array
fs: Remove the now superfluous sentinel elements from ctl_table array
cachefiles: Remove the now superfluous sentinel element from ctl_table array
sysclt: Clarify the results of selftest run
sysctl: Add a selftest for handling empty dirs
sysctl: Fix out of bounds access for empty sysctl registers
MAINTAINERS: Add Joel Granados as co-maintainer for proc sysctl
MAINTAINERS: remove Iurii Zaikin from proc sysctl
The goal is to get sched.h down to a type only header, so the main thing
happening in this patchset is splitting out various _types.h headers and
dependency fixups, as well as moving some things out of sched.h to
better locations.
This is prep work for the memory allocation profiling patchset which
adds new sched.h interdepencencies.
Testing - it's been in -next, and fixes from pretty much all
architectures have percolated in - nothing major.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmWfBwwACgkQE6szbY3K
bnZPwBAAmuRojXaeWxi01IPIOehSGDe68vw44PR9glEMZvxdnZuPOdvE4/+245/L
bRKU2WBCjBUokUbV9msIShwRkFTZAmEMPNfPAAsFMA+VXeDYHKB+ZRdwTggNAQ+I
SG6fZgh5m0HsewCDxU8oqVHkjVq4fXn0cy+aL6xLEd9gu67GoBzX2pDieS2Kvy6j
jnyoKTxFwb+LTQgph0P4EIpq5I2umAsdLwdSR8EJ+8e9NiNvMo1pI00Lx/ntAnFZ
JftWUJcMy3TQ5u1GkyfQN9y/yThX1bZK5GvmHS9SJ2Dkacaus5d+xaKCHtRuFS1I
7C6b8PsNgRczUMumBXus44HdlNfNs1yU3lvVxFvBIPE1qC9pYRHrkWIXXIocXLLC
oxTEJ6B2G3BQZVQgLIA4fOaxMVhmvKffi/aEZLi9vN9VVosd1a6XNKI6KbyRnXFp
GSs9qDqszhn5I3GYNlDNQTc/8UsRlhPFgS6nS0By6QnvxtGi9QkU2tBRBsXvqwCy
cLoCYIhc2tvugHvld70dz26umiJ4rnmxGlobStNoigDvIKAIUt1UmIdr1so8P8eH
xehnL9ZcOX6xnANDL0AqMFFHV6I58CJynhFdUoXfVQf/DWLGX48mpi9LVNsYBzsI
CAwVOAQ0UjGrpdWmJ9ueY/ABYqg9vRjzaDEXQ+MhAYO55CLaVsg=
=3tyT
-----END PGP SIGNATURE-----
Merge tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs
Pull header cleanups from Kent Overstreet:
"The goal is to get sched.h down to a type only header, so the main
thing happening in this patchset is splitting out various _types.h
headers and dependency fixups, as well as moving some things out of
sched.h to better locations.
This is prep work for the memory allocation profiling patchset which
adds new sched.h interdepencencies"
* tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (51 commits)
Kill sched.h dependency on rcupdate.h
kill unnecessary thread_info.h include
Kill unnecessary kernel.h include
preempt.h: Kill dependency on list.h
rseq: Split out rseq.h from sched.h
LoongArch: signal.c: add header file to fix build error
restart_block: Trim includes
lockdep: move held_lock to lockdep_types.h
sem: Split out sem_types.h
uidgid: Split out uidgid_types.h
seccomp: Split out seccomp_types.h
refcount: Split out refcount_types.h
uapi/linux/resource.h: fix include
x86/signal: kill dependency on time.h
syscall_user_dispatch.h: split out *_types.h
mm_types_task.h: Trim dependencies
Split out irqflags_types.h
ipc: Kill bogus dependency on spinlock.h
shm: Slim down dependencies
workqueue: Split out workqueue_types.h
...
- btree write buffer rewrite: instead of adding keys to the btree write
buffer at transaction commit time, we know journal them with a
different journal entry type and copy them from the journal to the
write buffer just prior to journal write.
This reduces the number of atomic operations on shared cachelines
in the transaction commit path and is a signicant performance
improvement on some workloads: multithreaded 4k random writes went
from ~650k iops to ~850k iops.
- Bring back optimistic spinning for six locks: the new implementation
doesn't use osq locks; instead we add to the lock waitlist as normal,
and then spin on the lock_acquired bit in the waitlist entry, _not_
the lock itself.
- BCH_IOCTL_DEV_USAGE_V2, which allows for new data types
- BCH_IOCTL_OFFLINE_FSCK, which runs the kernel implementation of fsck
but without mounting: useful for transparently using the kernel
version of fsck from 'bcachefs fsck' when the kernel version is a
better match for the on disk filesystem.
- BCH_IOCTL_ONLINE_FSCK: online fsck. Not all passes are supported yet,
but the passes that are supported are fully featured - errors may be
corrected as normal.
The new ioctls use the new 'thread_with_file' abstraction for kicking
off a kthread that's tied to a file descriptor returned to userspace
via the ioctl.
- btree_paths within a btree_trans are now dynamically growable,
instead of being limited to 64. This is important for the
check_directory_structure phase of fsck, and also fixes some issues
we were having with btree path overflow in the reflink btree.
- Trigger refactoring; prep work for the upcoming disk space accounting
rewrite
- Numerous bugfixes :)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmWe8PUACgkQE6szbY3K
bnYw6g/9GAXfIGasTZZwK2XEr36RYtEFYMwd/m9V1ET0DH6d/MFH9G7tTYl52AQ4
k9cDFb0d2qdtNk2Rlml1lHFrxMzkp2Q7j9S4YcETrE+/Dir8ODVcJXrGeNTCMGmz
B+C12mTOpWrzGMrioRgFZjWAnacsY3RP8NFRTT9HIJHO9UCP+xN5y++sX10C5Gwv
7UVWTaUwjkgdYWkR8RCKGXuG5cNNlRp4Y0eeK2XruG1iI9VAilir1glcD/YMOY8M
vECQzmf2ZLGFS/tpnmqVhNbNwVWpTQMYassvKaisWNHLDUgskOoF8YfoYSH27t7F
GBb1154O2ga6ea866677FDeNVlg386mGCTUy2xOhMpDL3zW+/Is+8MdfJI4MJP5R
EwcjHnn2bk0C2kULbAohw0gnU42FulfvsLNnrfxCeygmZrDoOOCL1HpvnBG4vskc
Fp6NK83l974QnyLdPsjr1yB2d2pgb+uMP1v76IukQi0IjNSAyvwSa5nloPTHRzpC
j6e2cFpdtX+6vEu6KngXVKTblSEnwhVBTaTR37Lr8PX1sZqFS/+mjRDgg3HZa/GI
u0fC0mQyVL9KjDs5LJGpTc/qs8J4mpoS5+dfzn38MI76dFxd5TYZKWVfILTrOtDF
ugDnoLkMuYFdueKI2M3YzxXyaA7HBT+7McAdENuJJzJnEuSAZs0=
=JvA2
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-01-10' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs updates from Kent Overstreet:
- btree write buffer rewrite: instead of adding keys to the btree write
buffer at transaction commit time, we now journal them with a
different journal entry type and copy them from the journal to the
write buffer just prior to journal write.
This reduces the number of atomic operations on shared cachelines in
the transaction commit path and is a signicant performance
improvement on some workloads: multithreaded 4k random writes went
from ~650k iops to ~850k iops.
- Bring back optimistic spinning for six locks: the new implementation
doesn't use osq locks; instead we add to the lock waitlist as normal,
and then spin on the lock_acquired bit in the waitlist entry, _not_
the lock itself.
- New ioctls:
- BCH_IOCTL_DEV_USAGE_V2, which allows for new data types
- BCH_IOCTL_OFFLINE_FSCK, which runs the kernel implementation of
fsck but without mounting: useful for transparently using the
kernel version of fsck from 'bcachefs fsck' when the kernel
version is a better match for the on disk filesystem.
- BCH_IOCTL_ONLINE_FSCK: online fsck. Not all passes are supported
yet, but the passes that are supported are fully featured - errors
may be corrected as normal.
The new ioctls use the new 'thread_with_file' abstraction for kicking
off a kthread that's tied to a file descriptor returned to userspace
via the ioctl.
- btree_paths within a btree_trans are now dynamically growable,
instead of being limited to 64. This is important for the
check_directory_structure phase of fsck, and also fixes some issues
we were having with btree path overflow in the reflink btree.
- Trigger refactoring; prep work for the upcoming disk space accounting
rewrite
- Numerous bugfixes :)
* tag 'bcachefs-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (226 commits)
bcachefs: eytzinger0_find() search should be const
bcachefs: move "ptrs not changing" optimization to bch2_trigger_extent()
bcachefs: fix simulateously upgrading & downgrading
bcachefs: Restart recovery passes more reliably
bcachefs: bch2_dump_bset() doesn't choke on u64s == 0
bcachefs: improve checksum error messages
bcachefs: improve validate_bset_keys()
bcachefs: print sb magic when relevant
bcachefs: __bch2_sb_field_to_text()
bcachefs: %pg is banished
bcachefs: Improve would_deadlock trace event
bcachefs: fsck_err()s don't need to manually check c->sb.version anymore
bcachefs: Upgrades now specify errors to fix, like downgrades
bcachefs: no thread_with_file in userspace
bcachefs: Don't autofix errors we can't fix
bcachefs: add missing bch2_latency_acct() call
bcachefs: increase max_active on io_complete_wq
bcachefs: add time_stats for btree_node_read_done()
bcachefs: don't clear accessed bit in btree node fill
bcachefs: Add an option to control btree node prefetching
...
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmWfB8kACgkQiiy9cAdy
T1HnJwv/ZhSkyu7iSsP9pWmYaCvzKkT6OljIvixTMKilzVzLe24g+sqSSkDb+SSw
6W9v26AWJqjfzVYyoy8PtR+Erbp5CKoJifAQg+1KK/jJ38ApWCqTpOtRBuo/vEX+
OFdAolOaNr74QWpylInyg5GCyjarH78cvyctukj6ZNO419fX3xKQfIc8z0yE0Osp
0M4R/xy8MuuoKuEzDVuimzY6J9qPqQCsv2jrUDXRZry1bBR0JYl7mIxg3wPT+nBM
aFFVybF1kKlGvb0clmymSpI4isJg1i5v03Vy3lot01CQxI5kBftE/CkMdVNxMPBh
WDs2ZIXk3c9zsyWmI7ijhEaOVeaaOG8s1MQ/374bBRMeUIzcqGY5rn/gn4JEQbnh
rjy9+8oH17+h76qCxHzoVZpuoE2hwHuifQ4KKesoApV3edbPDOzMGol4gA+7KvQD
pJAfIDPSZFyA9kNOQiyoritAOG2YulQ+cMcOXG4HeIHJWqnbLiiIq0mt3sJCdvqI
5PiZDR81
=K+9M
-----END PGP SIGNATURE-----
Merge tag 'v6.8-rc-part1-smb-client' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
"Various smb client fixes, most related to better handling special file
types:
- Improve handling of special file types:
- performance improvement (better compounding and better caching
of readdir entries that are reparse points)
- extend support for creating special files (sockets, fifos,
block/char devices)
- fix renaming and hardlinking of reparse points
- extend support for creating symlinks with IO_REPARSE_TAG_SYMLINK
- Multichannel logging improvement
- Exception handling fix
- Minor cleanups"
* tag 'v6.8-rc-part1-smb-client' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number for cifs.ko
cifs: remove unneeded return statement
cifs: make cifs_chan_update_iface() a void function
cifs: delete unnecessary NULL checks in cifs_chan_update_iface()
cifs: get rid of dup length check in parse_reparse_point()
smb: client: stop revalidating reparse points unnecessarily
cifs: Pass unbyteswapped eof value into SMB2_set_eof()
smb3: Improve exception handling in allocate_mr_list()
cifs: fix in logging in cifs_chan_update_iface
smb: client: handle special files and symlinks in SMB3 POSIX
smb: client: cleanup smb2_query_reparse_point()
smb: client: allow creating symlinks via reparse points
smb: client: fix hardlinking of reparse points
smb: client: fix renaming of reparse points
smb: client: optimise reparse point querying
smb: client: allow creating special files via reparse points
smb: client: extend smb2_compound_op() to accept more commands
smb: client: Fix minor whitespace errors and warnings
New Features:
* Always ask for type with READDIR
* Remove nfs_writepage()
Bugfixes:
* Fix a suspicious RCU usage warning
* Fix a blocklayoutdriver reference leak
* Fix the block driver's calculation of layoutget size
* Fix handling NFS4ERR_RETURNCONFLICT
* Fix _xprt_switch_find_current_entry()
* Fix v4.1 backchannel request timeouts
* Don't add zero-length pnfs block devices
* Use the parent cred in nfs_access_login_time()
Cleanups:
* A few improvements when dealing with referring calls from the server
* Clean up various unused variables, struct fields, and function calls
* Various tracepoint improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmWfDp8ACgkQ18tUv7Cl
QOsJ+Q/8DgrVmP3jwoM9Fu7JI/RnTQr9svk7zyrlyrQd3ywYqu6A1SC7lphcrzhy
qxH55ykUuVgCB4kFqWPsU5yilJ8UzPncTOUObiBxN3pCU885Wckm4PJ9PNXtF9ct
hc7+RpSTby/hYxiJABGVLgUADJ30rYBe6Y+KspSf+S1HvmgY1jbMPhEbVGpP2QBt
zSF5pmnecZ748LGzSwSeW29WUZhvRPBL5B204EB4aq9SmPAhnAclnE7uhErQ1u8e
Z6RVwSXv2j1FcM79F5xc/gAByCQhObGuMceFd0sAnx87RUttHi1fteVboz2gZxHB
rawZQ9p9K9c7ayCu8disxKWTxNYAztvXDOs+Dnij+c3/2EpAmEUD53AXnXAz025b
IbSnh6ggLlxoKLv1Lrwrli2d/Qi4TYTm2RSW/dY416pIhoO3aC6fv1a5tUnou9RX
3XpiiFeNoTixWswmS23AMT7BrJTWXY/+NX7AxFZUyPyJ8y9F2Ug8BCam8uAvTluf
80Dx0pB+7DRF19/ZkH0mUFU+2/+mlK/Ub0p+izSJhkhPSH5TwUTA7hvX6xb7yFtS
OY4aTVD0rpTbSOvHOEI+F4tWBnw8onTobYMfRcuwNKYJCvuEh4mLLpn44QEJwW9M
3nHIzdE75Nz3deO+gg6Jo5JuiMwqvh7AEGsxIA64FnAi/xRCDi0=
=9hVw
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull nfs client updates from Anna Schumaker:
"New Features:
- Always ask for type with READDIR
- Remove nfs_writepage()
Bugfixes:
- Fix a suspicious RCU usage warning
- Fix a blocklayoutdriver reference leak
- Fix the block driver's calculation of layoutget size
- Fix handling NFS4ERR_RETURNCONFLICT
- Fix _xprt_switch_find_current_entry()
- Fix v4.1 backchannel request timeouts
- Don't add zero-length pnfs block devices
- Use the parent cred in nfs_access_login_time()
Cleanups:
- A few improvements when dealing with referring calls from the
server
- Clean up various unused variables, struct fields, and function
calls
- Various tracepoint improvements"
* tag 'nfs-for-6.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (21 commits)
NFSv4.1: Use the nfs_client's rpc timeouts for backchannel
SUNRPC: Fixup v4.1 backchannel request timeouts
rpc_pipefs: Replace one label in bl_resolve_deviceid()
nfs: Remove writepage
NFS: drop unused nfs_direct_req bytes_left
pNFS: Fix the pnfs block driver's calculation of layoutget size
nfs: print fileid in lookup tracepoints
nfs: rename the nfs_async_rename_done tracepoint
nfs: add new tracepoint at nfs4 revalidate entry point
SUNRPC: fix _xprt_switch_find_current_entry logic
NFSv4.1/pnfs: Ensure we handle the error NFS4ERR_RETURNCONFLICT
NFSv4.1: if referring calls are complete, trust the stateid argument
NFSv4: Track the number of referring calls in struct cb_process_state
NFS: Use parent's objective cred in nfs_access_login_time()
NFSv4: Always ask for type with READDIR
pnfs/blocklayout: Don't add zero-length pnfs_block_dev
blocklayoutdriver: Fix reference leak of pnfs_device_node
SUNRPC: Fix a suspicious RCU usage warning
SUNRPC: Create a helper function for accessing the rpc_clnt's xprt_switch
SUNRPC: Remove unused function rpc_clnt_xprt_switch_put()
...
mostly in the fstrim and mballoc code paths. Also enable
dioread_nolock in the case where the block size is less than the page
size. (Dioread_nolock has been default in the bs == ps case for quite
some time.)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmWe6MMACgkQ8vlZVpUN
gaM/gAf/e9j4yCAR/W23cICNh/9hw2U0HItEONZF7GDfySlGADL5dsOADe58jLY9
g8UwBpHptOcyxmMTYgdKPQ2YpUF+3Kd4oi2M1Q6CjeeBeRbwuzT4lMTeKrtMEgiz
Ns8mqBgGX3DIXjcbkdO9QdLZPBj07djamAIQlWVLHAR2w6LPgiBhHebUSe+36Ufk
xLaj5X2nkdTtPcN1EnlTYNR+zMLyAwXUsxKf44aUveRwiNAfLGBgY9yvFby7hC+6
ENCP1WsalvVnaI8mr9pgt1KTXIrElknA1bbiWJ9RZ5Y8Za+MEHxXBKpP/AStX8Nc
WEo7a9tNB1AXU04+/SgVp9GAkXEViA==
=Zk8h
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Various ext4 bug fixes and cleanups. The fixes are mostly in the
fstrim and mballoc code paths.
Also enable dioread_nolock in the case where the block size is less
than the page size (dioread_nolock has been default in the bs == ps
case for quite some time)"
* tag 'ext4_for_linus-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix inconsistent between segment fstrim and full fstrim
ext4: fallback to complex scan if aligned scan doesn't work
ext4: convert ext4_da_do_write_end() to take a folio
ext4: allow for the last group to be marked as trimmed
ext4: move ext4_check_bdev_write_error() into nojournal mode
jbd2: abort journal when detecting metadata writeback error of fs dev
jbd2: remove unused 'JBD2_CHECKPOINT_IO_ERROR' and 'j_atomic_flags'
jbd2: replace journal state flag by checking errseq
jbd2: add errseq to detect client fs's bdev writeback error
ext4: improving calculation of 'fe_{len|start}' in mb_find_extent()
ext4: clarify handling of unwritten bh in __ext4_block_zero_page_range()
ext4: treat end of range as exclusive in ext4_zero_range()
ext4: enable dioread_nolock as default for bs < ps case
ext4: delete redundant calculations in ext4_mb_get_buddy_page_lock()
ext4: reduce unnecessary memory allocation in alloc_flex_gd()
ext4: avoid online resizing failures due to oversized flex bg
ext4: remove unnecessary check from alloc_flex_gd()
ext4: unify the type of flexbg_size to unsigned int
Other than the update to MAINTAINERS, this PR has only a fix to stop
ecryptfs from inadvertently mounting case-insensitive filesystems that
it cannot handle, which would otherwise caused post-mount failures. It
has been on linux-next for the past month and a half.
As a side note, the optimization to the case-insensitive comparison code
that you suggested, and the negative dentry support are still on the
list, and were postponed to the next release.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRIEdicmeMNCZKVCdo6u2Upsdk6RAUCZZxQsQAKCRA6u2Upsdk6
RDNtAP9QkDintyPet5GhZKvgUDuNkduMbjDymDv9aELCdhz3FAEAprzJOanEXjOY
JInwn3pvZIcpQzwFEJtDt1HZEZLF0Q0=
=sHKn
-----END PGP SIGNATURE-----
Merge tag 'unicode-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode
Pull unicode updates from Gabriel Krisman Bertazi:
"Other than the update to MAINTAINERS, this PR has only a fix to stop
ecryptfs from inadvertently mounting case-insensitive filesystems that
it cannot handle, which would otherwise caused post-mount failures"
* tag 'unicode-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
MAINTAINERS: update unicode maintainer e-mail address
ecryptfs: Reject casefold directory inodes
- Introduce the param_unknown_fn type and other clean ups (Andy Shevchenko)
- Various __counted_by annotations (Christophe JAILLET, Gustavo A. R. Silva,
Kees Cook)
- Add KFENCE test to LKDTM (Stephen Boyd)
- Various strncpy() refactorings (Justin Stitt)
- Fix qnx4 to avoid writing into the smaller of two overlapping buffers
- Various strlcpy() refactorings
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmWcOsQWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJoiDD/9gNhalNG+6MNF5TDwSvO9X7pvL
bQ6D3clByRxYjnJ4dMQ7p3s+rJ937uQt9PezIWHgRoldjQy3x7AJ5BxkhjeMlD2B
YLbfdVYPy09X0Ewk1Efvfm/ta6tJpBGYF7Bc7LIneZrdQ6gemBpLW1PNZAFYzcWX
oDjV+M1NytxaiF0aebxPZvZ1W+NGQ105Sxvj5MheDoezyO/j0CTe+ZYtCzFguFY0
8SPpR5FG4AFidb8GHd5Ndv0trVWjF1jat0FUFgEFOCE0fJNWLVR0Bbr2MtXiG7wL
LF7IZ/Mn+mi+O3BmcD6JiaYf9EPlMUXCyqc8NvsnoWGqhWhWmQPCInZVrpplMUNK
V/UHVMkmjDs4f/lAHBJoJHDK6fmOD+cAFaNMOltfErcjV4s+lEo6vHoiKl8hfPnH
EzpQaK3funGroVYwTc35e07NrJJHCzqIUhZ0FJO7ByuOE2tIomiVo9Xy9gy54iCT
qzC7zkrZ0MKqui4qiUY9FWayRRYLX4qNxELm4yie6Pzmk8943hNOaDofcyKWuZFC
eqvhIkvqb4LasLrzCBk+ehA2KWSRmTrR6E9IygwbBXUTsvn2yj2RRYeAlGQNBTBZ
adgSXQpRBmtKYqyihWLhP4QcunknEiQdDS3lS2qJmPH33Iv3jGH4yS6BNIBufMGL
PoC2UxSfGd+YT079fw==
=1Wxx
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
- Introduce the param_unknown_fn type and other clean ups (Andy
Shevchenko)
- Various __counted_by annotations (Christophe JAILLET, Gustavo A. R.
Silva, Kees Cook)
- Add KFENCE test to LKDTM (Stephen Boyd)
- Various strncpy() refactorings (Justin Stitt)
- Fix qnx4 to avoid writing into the smaller of two overlapping buffers
- Various strlcpy() refactorings
* tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
qnx4: Use get_directory_fname() in qnx4_match()
qnx4: Extract dir entry filename processing into helper
atags_proc: Add __counted_by for struct buffer and use struct_size()
tracing/uprobe: Replace strlcpy() with strscpy()
params: Fix multi-line comment style
params: Sort headers
params: Use size_add() for kmalloc()
params: Do not go over the limit when getting the string length
params: Introduce the param_unknown_fn type
lkdtm: Add kfence read after free crash type
nvme-fc: replace deprecated strncpy with strscpy
nvdimm/btt: replace deprecated strncpy with strscpy
nvme-fabrics: replace deprecated strncpy with strscpy
drm/modes: replace deprecated strncpy with strscpy_pad
afs: Add __counted_by for struct afs_acl and use struct_size()
VMCI: Annotate struct vmci_handle_arr with __counted_by
i40e: Annotate struct i40e_qvlist_info with __counted_by
HID: uhid: replace deprecated strncpy with strscpy
samples: Replace strlcpy() with strscpy()
SUNRPC: Replace strlcpy() with strscpy()
Suppose we issue two FITRIM ioctls for ranges [0,15] and [16,31] with
mininum length of trimmed range set to 8 blocks. If we have say a range of
blocks 10-22 free, this range will not be trimmed because it straddles the
boundary of the two FITRIM ranges and neither part is big enough. This is a
bit surprising to some users that call FITRIM on smaller ranges of blocks
to limit impact on the system. Also XFS trims all free space extents that
overlap with the specified range so we are inconsistent among filesystems.
Let's change ext4_try_to_trim_range() to consider for trimming the whole
free space extent that straddles the end of specified range, not just the
part of it within the range.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231216010919.1995851-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently in case the goal length is a multiple of stripe size we use
ext4_mb_scan_aligned() to find the stripe size aligned physical blocks.
In case we are not able to find any, we again go back to calling
ext4_mb_choose_next_group() to search for a different suitable block
group. However, since the linear search always begins from the start,
most of the times we end up with the same BG and the cycle continues.
With large fliesystems, the CPU can be stuck in this loop for hours
which can slow down the whole system. Hence, until we figure out a
better way to continue the search (rather than starting from beginning)
in ext4_mb_choose_next_group(), lets just fallback to
ext4_mb_complex_scan_group() in case aligned scan fails, as it is much
more likely to find the needed blocks.
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/ee033f6dfa0a7f2934437008a909c3788233950f.1702455010.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There's nothing page-specific happening in ext4_da_do_write_end();
it's merely used for its refcount & lock, both of which are folio
properties. Saves four calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231214053035.1018876-1-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The ext4 filesystem tracks the trim status of blocks at the group
level. When an entire group has been trimmed then it is marked as
such and subsequent trim invocations with the same minimum trim size
will not be attempted on that group unless it is marked as able to be
trimmed again such as when a block is freed.
Currently the last group can't be marked as trimmed due to incorrect
logic in ext4_last_grp_cluster(). ext4_last_grp_cluster() is supposed
to return the zero based index of the last cluster in a group. This is
then used by ext4_try_to_trim_range() to determine if the trim
operation spans the entire group and as such if the trim status of the
group should be recorded.
ext4_last_grp_cluster() takes a 0 based group index, thus the valid
values for grp are 0..(ext4_get_groups_count - 1). Any group index
less than (ext4_get_groups_count - 1) is not the last group and must
have EXT4_CLUSTERS_PER_GROUP(sb) clusters. For the last group we need
to calculate the number of clusters based on the number of blocks in
the group. Finally subtract 1 from the number of clusters as zero
based indexing is expected. Rearrange the function slightly to make
it clear what we are calculating and returning.
Reproducer:
// Create file system where the last group has fewer blocks than
// blocks per group
$ mkfs.ext4 -b 4096 -g 8192 /dev/nvme0n1 8191
$ mount /dev/nvme0n1 /mnt
Before Patch:
$ fstrim -v /mnt
/mnt: 25.9 MiB (27156480 bytes) trimmed
// Group not marked as trimmed so second invocation still discards blocks
$ fstrim -v /mnt
/mnt: 25.9 MiB (27156480 bytes) trimmed
After Patch:
fstrim -v /mnt
/mnt: 25.9 MiB (27156480 bytes) trimmed
// Group marked as trimmed so second invocation DOESN'T discard any blocks
fstrim -v /mnt
/mnt: 0 B (0 bytes) trimmed
Fixes: 45e4ab320c ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()")
Cc: <stable@vger.kernel.org> # 4.19+
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213051635.37731-1-surajjs@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
- Do not allow misconfigured ECC sizes (Sergey Shtylyov)
- Allow for odd number of CPUs (Weichen Chen)
- Refactor error handling to use cleanup.h
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmWcPVEWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJgjoD/wIWCiu4TWAziFlXy4Gmz2bFT+i
bY7APft9SkCa9QENIohhKaNuDYSymjGfq+cupvZ3/erDdfjgwPAg/Cs8fiKnAdRY
+sSFyDttcZu0Z9u7QB1TI2GG4E0MA/x9K001RwNzODj27yCj4mozuwoyfiuiTgHo
Dclkl2p7b4SjXrWuh5tCSaaV3TO3af8rAseT63phoqBM0BwRwh7rza1A3LhDoeWY
27/uba919KwTfvBH+yqOtglsWIe9bBI+vr4J9OGb2DOdpWi3yhwe074mjCn5C/BR
TpQDUT5moX0xsmdc4NaTKgyxWQ5EOa832TjNbPn5RMqaslVvnz6zYLCL+D1qYQvG
Jasbg8qa8hqdxS+KxgPZTSfkmpYi80AxzBGngRrlXEMArLTW40dhmebXN5QiT0CP
IKMYq7xuPiVN+GiZTl7hThqxFTOb5I6pbKDoIUFPCTIjJUcLTwM9y71dQ+XzJHKu
GAHvzvzLSD2Y0BwaWedWinPjTqaBsOfeqecE77dIkMoFWa7Y0dx0BxySUT2dUKny
6Z28mMX6C9sf5ncdJLjcEXf0UDECfnuXw+1NJUwyaSBtlR56pWIk33YFWf1+u3Jn
p6ZX6Jx6A77h4236A63zodSdna4NzuSQETmyqFvJOra8Gubidx2ggwL9EdxK4qHq
tQJxHbxI4+vRpaOm7A==
=TOWy
-----END PGP SIGNATURE-----
Merge tag 'pstore-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore updates from Kees Cook:
- Do not allow misconfigured ECC sizes (Sergey Shtylyov)
- Allow for odd number of CPUs (Weichen Chen)
- Refactor error handling to use cleanup.h
* tag 'pstore-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore: inode: Use cleanup.h for struct pstore_private
pstore: inode: Use __free(pstore_iput) for inode allocations
pstore: inode: Convert mutex usage to guard(mutex)
pstore: inode: Convert kfree() usage to __free(kfree)
pstore: ram_core: fix possible overflow in persistent_ram_init_ecc()
pstore/ram: Fix crash when setting number of cpus to an odd number
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmWeozwACgkQEVvVyTe/
1WqnyA//U2Ka5ZIncs/hA5D03LMyuCh9qlH5qAGce5vrBTxogTlFuTGGKtsUuCB5
Y4GALO+fw8aWAowt5X1XfHD3TETLVbCshT7dYjKsKy/ojANCbgkCipXBudYx+l9m
fllwTZyueK0UY14kCU2DAV5PYsI/XVVykk71GSMOMLCUfRJfDI7R0vBD0NaUd7Kz
Wp/M6t0MnXX23nGUdgNoroZPPj3Ts/gK2MXID+QHXGaR2+M1B1lLKfSu6TcRDLtn
tbe/ivaw4y1jj3jfFwMC7sSSDyIJeZh9tBB4Rvv2vsMiYU8zAC6Eg35eIbPONu42
pUMd0QQa79H3cyYEDtUzyskzur0Jry5azzb8JdQWipgVKFh5g3CHce2XAFlVjw2a
9RyCKg41A9LvdB5l/PvBtsxig2PzaYqE09rXAfUM7eLNFlOLbL99uc1WJbIFfG43
Czh9vPxsuJ5RkdwS7R0m4GYDw8+BKW6WjpaC+Eje4I8X1rAQK0H/BLTCxe2dLRB7
0neAg8e3g6NdisRSLOP74xoEn/dhijNP7ENOFF1EdP/BFPHL7+sRsV6XYwwBeUAc
c6YsxeAPylm6gvIq/ESoRiY+e5QWvImHIWP+zB/cySYdT0fQHL9WjO6/uZW0ALuv
oZugICSmZ15pYlACIU8iYztRkS19CJZrUV7Gbq4+AurUKP8kCEI=
=2Ohx
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
Pull overlayfs updates from Amir Goldstein:
"This is a very small update with no bug fixes and no new features.
The larger update of overlayfs for this cycle, the re-factoring of
overlayfs code into generic backing_file helpers, was already merged
via Christian.
Summary:
- Simplify/clarify some code
No bug fixes here, just some changes following questions from Al
about overlayfs code that could be a little more simple to follow.
- Overlayfs documentation style fixes
Mainly fixes for ReST formatting suggested by documentation
developers"
* tag 'ovl-update-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
overlayfs.rst: fix ReST formatting
overlayfs.rst: use consistent feature names
ovl: initialize ovl_copy_up_ctx.destname inside ovl_do_copy_up()
ovl: remove redundant ofs->indexdir member
- Add basic sub-page compressed data support;
- Fix a memory leak on MicroLZMA and DEFLATE compression;
- Fix a rare LZ4 inplace decompression issue on recent x86 CPUs;
- Fix a KASAN issue reported by syzbot around crafted images;
- Some cleanups.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmWeiBMRHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qpRiw/9EexUiFCsXGUQP9P4M7KXoTxYDrjVi8uN
xjTQAame59JGGqzBivVAlUvP/zqdluafFvstEsINv3VoLzw+OLDHHbGVN3w/Jn2C
Thilxul3shRyVhcUK/7d0lDagY32ggwYpqKc4Cr/6RiVHtQ7fnJBdsELFetSeI6d
FcLQed/S4C3MgN0g/j9erj8j0Rizgk+yoLqglIECaxIxTbmhnZFXcLfRDWF/OoEy
AdZ48qK5sIEBbVAhH/5sxXNod77wbwuTjpnzSaC+9PiAHgKGdl3W5Vf3SnckosmX
WFbwszqk5JISS01vcNISLZg1U47a9vVd7CDis7lkbtU2LddhFerTmf3Xr6FIc+qJ
hvsr+0djRbArF66DvYjWcoYueHkYh/kgTsYXsvmqheKtyNZJIrk6d0YS32+6XKth
TGwX55WdWrLqhfwac509EFYKD7moYCXMTFaJh4zhqMiz5TX5eVLlRcoU3Uy57x3/
Q2UWnPuYiGFuWrhnYWNgn1n6KoQgb/tD9jjQ5D/i9AJI9aHydkoUFJdQTgxMv9FY
lfdxp94Yo2+XjJ9BhSACgVkSnGzv89/9iUQ0Fps08rnc25rD4upiipqtAuqDWn6N
gcEXC6oAOywdWdR5Y+yP/N3hIMYxn48X2gt875jyYMe0KTzIETIyPG4l3YhfitTN
0pBOcZBOQkw=
=TiFo
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, we'd like to enable basic sub-page compressed data
support for Android ecosystem (for vendors to try out 16k page size
with 4k-block images in their compatibility mode) as well as container
images (so that 4k-block images can be parsed on arm64 cloud servers
using 64k page size.)
In addition, there are several bugfixes and cleanups as usual. All
commits have been in -next for a while and no potential merge conflict
is observed.
Summary:
- Add basic sub-page compressed data support
- Fix a memory leak on MicroLZMA and DEFLATE compression
- Fix a rare LZ4 inplace decompression issue on recent x86 CPUs
- Fix a KASAN issue reported by syzbot around crafted images
- Some cleanups"
* tag 'erofs-for-6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: make erofs_{err,info}() support NULL sb parameter
erofs: avoid debugging output for (de)compressed data
erofs: allow partially filled compressed bvecs
erofs: enable sub-page compressed block support
erofs: refine z_erofs_transform_plain() for sub-page block support
erofs: fix ztailpacking for subpage compressed blocks
erofs: fix up compacted indexes for block size < 4096
erofs: record `pclustersize` in bytes instead of pages
erofs: support I/O submission for sub-page compressed blocks
erofs: fix lz4 inplace decompression
erofs: fix memory leak on short-lived bounced pages
Adjust the timing of the fscrypt keyring destruction, to prepare for
btrfs's fscrypt support. Also document that CephFS supports fscrypt now.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZZx4UBQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK85+AQCBHoG6R5UuPqafoDtabcCpxRW/ZHdo
WzOwjvHz1/tq5AEApogvjPI/3v2gelLnG9ZrXUBZMWZN6W0LQbH/k1VHjQ8=
=nvWY
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux
Pull fscrypt updates from Eric Biggers:
"Adjust the timing of the fscrypt keyring destruction, to prepare for
btrfs's fscrypt support.
Also document that CephFS supports fscrypt now"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fs: move fscrypt keyring destruction to after ->put_super
f2fs: move release of block devices to after kill_block_super()
fscrypt: document that CephFS supports fscrypt now
fscrypt: update comment for do_remove_key()
fscrypt.rst: update definition of struct fscrypt_context_v2
The bulk of the patches for this release are clean-ups and minor bug
fixes.
There is one significant revert to mention: support for RDMA Read
operations in the server's RPC-over-RDMA transport implementation
has been fixed so it waits for Read completion in a way that avoids
tying up an nfsd thread. This prevents a possible DoS vector if an
RPC-over-RDMA client should become unresponsive during RDMA Read
operations.
As always I am grateful to NFSD contributors, reviewers, and
testers.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmWdW34ACgkQM2qzM29m
f5fKmw/+PcjoNDWR55kTmOo8j0h4HF8rhunvP2C50svnnsX63y1WKkLaxyAFN/Hl
UFucJDQBjJvwi+PEbGOXcjkizuG5mhRBFvFIYDJYGWsE1s7B/v3E/Servvt1wSek
UjoTjknYrqH6R3YfA8zBaWRJUXwvVQW3Bzo4mShrQK7He9/7nBHdUe0aWbAA9oW3
QgzKH/FzqCS03MvuxQv74KgBcl3diIrDaj041A3CtSnXzSKqwc3LaUAd5B4BL+oq
GnxpV1rtZla50M4Ntddi+vSjUvHWZySQ1GEJj7rKLTwpGXkxM2NuMkGx676WR4Iv
sYDX0fsica2elKbqJem8pk68qi6XEdZVAdoOHdgNJRClmYHby8xkrL/TYKiQZf42
IN9FogoVSZ+vSdI158Weim9+0Jqf+ffIh57ZtOyQQQAGZkdhB6GhcbdHJhQ9eOgB
LAiAL7bsoWvDmBh5m9KnBmQYGpZoDUa6AT0bIvGD2O4/MdpHBkyT8Xwt+210nPOK
mpBtxe5O8cUcg7A5/TwnVRg5jKp4CF8VWh2R8sGDhcYV8UfRthB38h4rHNhv4vxt
l6ZUgmtTxrs1rCeh6aoiWTKXeQmI8meWlcet7cxw/axAsaTXkYPi5mslxF9f4O8u
nQ8q7LuZQy2CKZO/t98STwx7s9OJcDOwcy51rnKK85TlCwnxFWg=
=mIKg
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"The bulk of the patches for this release are clean-ups and minor bug
fixes.
There is one significant revert to mention: support for RDMA Read
operations in the server's RPC-over-RDMA transport implementation has
been fixed so it waits for Read completion in a way that avoids tying
up an nfsd thread. This prevents a possible DoS vector if an
RPC-over-RDMA client should become unresponsive during RDMA Read
operations.
As always I am grateful to NFSD contributors, reviewers, and testers"
* tag 'nfsd-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (56 commits)
nfsd: rename nfsd_last_thread() to nfsd_destroy_serv()
SUNRPC: discard sv_refcnt, and svc_get/svc_put
svc: don't hold reference for poolstats, only mutex.
SUNRPC: remove printk when back channel request not found
svcrdma: Implement multi-stage Read completion again
svcrdma: Copy construction of svc_rqst::rq_arg to rdma_read_complete()
svcrdma: Add back svcxprt_rdma::sc_read_complete_q
svcrdma: Add back svc_rdma_recv_ctxt::rc_pages
svcrdma: Clean up comment in svc_rdma_accept()
svcrdma: Remove queue-shortening warnings
svcrdma: Remove pointer addresses shown in dprintk()
svcrdma: Optimize svc_rdma_cc_init()
svcrdma: De-duplicate completion ID initialization helpers
svcrdma: Move the svc_rdma_cc_init() call
svcrdma: Remove struct svc_rdma_read_info
svcrdma: Update the synopsis of svc_rdma_read_special()
svcrdma: Update the synopsis of svc_rdma_read_call_chunk()
svcrdma: Update synopsis of svc_rdma_read_multiple_chunks()
svcrdma: Update synopsis of svc_rdma_copy_inline_range()
svcrdma: Update the synopsis of svc_rdma_read_data_item()
...
This set cleans up the interface between nfs lockd and dlm, which
is handling nfs file locking for gfs2 and ocfs2. Very basic lockd
functionality is fixed, in which the fl owner was using the lockd
pid instead of the owner value from nfs.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEcGkeEvkvjdvlR90nOBtzx/yAaaoFAmWDZFwACgkQOBtzx/yA
aapu/hAAx/9ahq4Vm+T7Lpw6wGEKISUi5djZlqrN7EddHcyAMFFX/41PkOez9KJT
Rr4Mp+MBB6xjDDco4uVZxhnWJCI6RKExSB4N+eMx0Rhs09Ksf8UCtxTvKaDa18fr
ZwPmGNpE/a3khTkwC5h/98m8kOyYIqSOL8/cR8zGytkHkgDiyv4VqD0cHAvwxR5a
O8jQDtssXld6sF5GxhVQnLQiu0eVfFLlaaSsb28ju+yMPVOTDxmwNkP3eP+8d1le
lcNp82+C7UmzO5Ds1/SgBIJZoej/xipz00BAlGH1oieD4xRLCbkoJSQsGxpkPwEI
I1V8fd7zaFQ1VnDHMeMrjl46qjUQKkCfDK/v9BCvN5x8sCqaqUydMQ0mD/424NXe
A/JgjAtloIhIOqmX/K/h4jioTrFlVevtTAr9Cv/sq31VX0+ALJVS3ccbhv68gjiW
Cflef7Va53mXYfIAs6qc60/ArpvrPUG7Bna4aIb5iVJj4z/OOjnTxyZVOD3wJetY
bs4w2dSrafX589EN/gIyKka3iOMcJS7wVsvRME9KYVikNbHgQrSpsixHPlLdjGq+
cHbozutVQYnhaGI608yMjPZ+rXu5jYEfAIQnI8FABbi4VR29+SnzxrZllMICUZ+Y
pfRQ6YkiuBRy2HSbnwudemj6iSrPqZEts2GDkqj2LDfkMWeycKM=
=UBeR
-----END PGP SIGNATURE-----
Merge tag 'dlm-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set cleans up the interface between nfs lockd and dlm, which is
handling nfs file locking for gfs2 and ocfs2. Very basic lockd
functionality is fixed, in which the fl owner was using the lockd pid
instead of the owner value from nfs"
* tag 'dlm-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: update format header reflect current format
dlm: fix format seq ops type 4
dlm: implement EXPORT_OP_ASYNC_LOCK
dlm: use FL_SLEEP to determine blocking vs non-blocking
dlm: use fl_owner from lockd
dlm: use kernel_connect() and kernel_bind()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmWYJ6kACgkQ+7dXa6fL
C2v6YBAAkDdqgWN96h2KOcd+El13Uxa3WNDjTHtzc0ZhjEDkzkU42sSF2yE0nerS
6kX18vibXC+TPnbBn1gOSGrVoFIC1kh/vUjrz/UQYfxXN19P8LE2wSdl+bC4nPT1
Qkrxkr+q4GSSJoYg9QUUAu0Hh2PvXMeDE/XyED6XiAkuDUbISO9yDeu+wo3wZM5L
1e8vRlg/2EQl2v1Crh5nC0tgJZbGULc2mCqi/rU5A9wdlKHFzwjU+2PTsbQNKE0m
0ueLblFeFRwBZpOfAUNNVAt3bwaSfhYpqUiiSldrU/JXhnx5CgY1kHzI3OPVedQt
WMfp/epwO848i3qVM8dHJXc93NJeC3gTBK7gYRrH07MuK3Of1KRH3D8YBsE0/r0q
NVcDQ6/eoni06CA8VMfSIEQ2+Q0m4xxUzAQURsOxRPY/FktzCKXMfpYTDZqbQfow
SXrKmsPnMZe4DUnvdcTSU8B3+vybJH/JgEnZXRtCPOYNDSyMcPhKPG2ioOz4UV+M
amQmpYfG4hzi1VmRrH57dwlXejBX16+zc9pLdZC5c0/phk3caYrJVMA8pwCOP4HM
AvB5Yl6gH2aGj1kKjffL7nWnQ2QbD7VWUn98TqLPezOX7DwQHMMKvlfPnv6R87sy
0HMmj9VxCgOvGLOf1JdQoTxtb49ndM4Y5fPvKYK2awW5FkAacLM=
=bHoG
-----END PGP SIGNATURE-----
Merge tag 'afs-fix-rotation-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull afs updates from David Howells:
"The majority of the patches are aimed at fixing and improving the AFS
filesystem's rotation over server IP addresses, but there are also
some fixes from Oleg Nesterov for the use of read_seqbegin_or_lock().
- Fix fileserver probe handling so that the next round of probes
doesn't break ongoing server/address rotation by clearing all the
probe result tracking. This could occasionally cause the rotation
algorithm to drop straight through, give a 'successful' result
without actually emitting any RPC calls, leaving the reply buffer
in an undefined state.
Instead, detach the probe results into a separate struct and
allocate a new one each time we start probing and update the
pointer to it. Probes are also sent in order of address preference
to try and improve the chance that the preferred one will complete
first.
- Fix server rotation so that it uses configurable address
preferences across on the probes that have completed so far than
ranking them by RTT as the latter doesn't necessarily give the best
route. The preference list can be altered by writing into
/proc/net/afs/addr_prefs.
- Fix the handling of Read-Only (and Backup) volume callbacks as
there is one per volume, not one per file, so if someone performs a
command that, say, offlines the volume but doesn't change it, when
it comes back online we don't spam the server with a status fetch
for every vnode we're using. Instead, check the Creation timestamp
in the VolSync record when prompted by a callback break.
- Handle volume regression (ie. a RW volume being restored from a
backup) by scrubbing all cache data for that volume. This is
detected from the VolSync creation timestamp.
- Adjust abort handling and abort -> error mapping to match better
with what other AFS clients do.
- Fix offline and busy volume state handling as they only apply to
individual server instances and not entire volumes and the rotation
algorithm should go and look at other servers if available. Also
make it sleep briefly before each retry if all the volume instances
are unavailable"
* tag 'afs-fix-rotation-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (40 commits)
afs: trace: Log afs_make_call(), including server address
afs: Fix offline and busy message emission
afs: Fix fileserver rotation
afs: Overhaul invalidation handling to better support RO volumes
afs: Parse the VolSync record in the reply of a number of RPC ops
afs: Don't leave DONTUSE/NEWREPSITE servers out of server list
afs: Fix comment in afs_do_lookup()
afs: Apply server breaks to mmap'd files in the call processor
afs: Move the vnode/volume validity checking code into its own file
afs: Defer volume record destruction to a workqueue
afs: Make it possible to find the volumes that are using a server
afs: Combine the endpoint state bools into a bitmask
afs: Keep a record of the current fileserver endpoint state
afs: Dispatch vlserver probes in priority order
afs: Dispatch fileserver probes in priority order
afs: Mark address lists with configured priorities
afs: Provide a way to configure address priorities
afs: Remove the unimplemented afs_cmp_addr_list()
afs: Add some more info to /proc/net/afs/servers
rxrpc: Create a procfile to display outstanding client conn bundles
...
- Add support for non-blocking lookup (MAY_NOT_BLOCK / LOOKUP_RCU)
- Various minor fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmWb00YUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpb6w//Sj7bN2SsLlx131LPxnzGnu+LgQ7b
vd9atU4+DSov2J/KpfX+arxiZSCcB/5FdatpeulSsczjtvvp/JyWuOQSudBlxA+N
bUpRrzoLoIrm1rkemLLOpwHmP1WkmpjCsxRilheoXi9jqw3MROoN/ZIpUVfnaGBy
NKWsK7rr1W0+nkKIColCRCfCujkJJ+s9Js8fsmOtOZA8+JYCdsZo7q7VzbhdGBFh
IPLFEHiRmJIBjECvs76T3MtxkdYQElhsCacE8i9ozqPlDoBDdj1zKzYD2wrd5t0Z
V49Ef6IKoezuxUob7f8ReHSOHUxc4kDxptJQsP6TI4bs+lBUTUBRtjlWiUwOwo2H
MdklRpGaxt0aChHqSXRA5+eDURRvq4Ly42vXnYFdiiNofwGYWrsEc00PUEBr55kF
9DlEfl/GP2gisleqmNTW8OSPV+/WP46KG0f9uy5dDDCvXCw66wdu11LXsF7KQwFc
CRcaXLAgbk+M3qi3XBykEoTvugFQ06s6CSty0zmyNwwGJEelgfXwQl0ISO6L/Qnb
NJIurC20cwizlnRPvMT5MUqXMuwuE1mTMQdfOMACYsGMBkfXrObteK2EUPCfK0uv
nHPD/RCfZxboXq9B7xdltEoFPsNfyipT2YfUASXQJ9txZLmKrU9ZP+rMc/Dmeekr
cvog8NJ+HvzE7JM=
=vN0N
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v6.7-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Add support for non-blocking lookup (MAY_NOT_BLOCK / LOOKUP_RCU)
- Various minor fixes and cleanups
* tag 'gfs2-v6.7-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix freeze consistency check in log_write_header
gfs2: Refcounting fix in gfs2_thaw_super
gfs2: Minor gfs2_{freeze,thaw}_super cleanup
gfs2: Use wait_event_freezable_timeout() for freezable kthread
gfs2: Add missing set_freezable() for freezable kthread
gfs2: Remove use of error flag in journal reads
gfs2: Lift withdraw check out of gfs2_ail1_empty
gfs2: Rename gfs2_withdrawn to gfs2_withdrawing_or_withdrawn
gfs2: Mark withdraws as unlikely
gfs2: Minor gfs2_ail1_empty cleanup
gfs2: use is_subdir()
gfs2: d_obtain_alias(ERR_PTR(...)) will do the right thing
gfs2: Use GL_NOBLOCK flag for non-blocking lookups
gfs2: Add GL_NOBLOCK flag
gfs2: rgrp: fix kernel-doc warnings
gfs2: fix kernel BUG in gfs2_quota_cleanup
gfs2: Fix inode_go_instantiate description
gfs2: Fix kernel NULL pointer dereference in gfs2_rgrp_dump
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
* New features/functionality
* Online repair
* Reserve disk space for online repairs.
* Fix misinteraction between the AIL and btree bulkloader because of
which the bulk load fails to queue a buffer for writeback if it
happens to be on the AIL list.
* Prevent transaction reservation overflows when reaping blocks during
online repair.
* Whenever possible, bulkloader now copies multiple records into a
block.
* Support repairing of
1. Per-AG free space, inode and refcount btrees.
2. Ondisk inodes.
3. File data and attribute fork mappings.
* Verify the contents of
1. Inode and data fork of realtime bitmap file.
2. Quota files.
* Introduce MF_MEM_PRE_REMOVE. This will be used to notify tasks about
a pmem device being removed.
* Bug fixes
* Fix memory leak of recovered attri intent items.
* Fix UAF during log intent recovery.
* Fix realtime geometry integer overflows.
* Prevent scrub from live locking in xchk_iget.
* Prevent fs shutdown when removing files during low free disk space.
* Prevent transaction reservation overflow when extending an RT device.
* Prevent incorrect warning from being printed when extending a
filesystem.
* Fix an off-by-one error in xreap_agextent_binval.
* Serialize access to perag radix tree during deletion operation.
* Fix perag memory leak during growfs.
* Allow allocation of minlen realtime extent when the maximum sized
realtime free extent is minlen in size.
* Cleanups
* Remove duplicate boilerplate code spread across functionality associated
with different log items.
* Cleanup resblks interfaces.
* Pass defer ops pointer to defer helpers instead of an enum.
* Initialize di_crc in xfs_log_dinode to prevent KMSAN warnings.
* Use static_assert() instead of BUILD_BUG_ON_MSG() to validate size of
structures and structure member offsets. This is done in order to be
able to share the code with userspace.
* Move XFS documentation under a new directory specific to XFS.
* Do not invoke deferred ops' ->create_done callback if the deferred
operation does not have an intent item associated with it.
* Remove duplicate inclusion of header files from scrub/health.c.
* Refactor Realtime code.
* Cleanup attr code.
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZZJQbwAKCRAH7y4RirJu
9JjkAP9Zg0QZNmAMsZwvgEBbuF/OnHKl4GmPA5uq0jPmSWCOqAEA0HjlOmuNfQWn
93fIw6CPbt+9QCluTYBwUisKLIJ/wgA=
=qmO0
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Chandan Babu:
"New features/functionality:
- Online repair:
- Reserve disk space for online repairs
- Fix misinteraction between the AIL and btree bulkloader because
of which the bulk load fails to queue a buffer for writeback if
it happens to be on the AIL list
- Prevent transaction reservation overflows when reaping blocks
during online repair
- Whenever possible, bulkloader now copies multiple records into
a block
- Support repairing of
1. Per-AG free space, inode and refcount btrees
2. Ondisk inodes
3. File data and attribute fork mappings
- Verify the contents of
1. Inode and data fork of realtime bitmap file
2. Quota files
- Introduce MF_MEM_PRE_REMOVE. This will be used to notify tasks
about a pmem device being removed
Bug fixes:
- Fix memory leak of recovered attri intent items
- Fix UAF during log intent recovery
- Fix realtime geometry integer overflows
- Prevent scrub from live locking in xchk_iget
- Prevent fs shutdown when removing files during low free disk space
- Prevent transaction reservation overflow when extending an RT
device
- Prevent incorrect warning from being printed when extending a
filesystem
- Fix an off-by-one error in xreap_agextent_binval
- Serialize access to perag radix tree during deletion operation
- Fix perag memory leak during growfs
- Allow allocation of minlen realtime extent when the maximum sized
realtime free extent is minlen in size
Cleanups:
- Remove duplicate boilerplate code spread across functionality
associated with different log items
- Cleanup resblks interfaces
- Pass defer ops pointer to defer helpers instead of an enum
- Initialize di_crc in xfs_log_dinode to prevent KMSAN warnings
- Use static_assert() instead of BUILD_BUG_ON_MSG() to validate size
of structures and structure member offsets. This is done in order
to be able to share the code with userspace
- Move XFS documentation under a new directory specific to XFS
- Do not invoke deferred ops' ->create_done callback if the deferred
operation does not have an intent item associated with it
- Remove duplicate inclusion of header files from scrub/health.c
- Refactor Realtime code
- Cleanup attr code"
* tag 'xfs-6.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (123 commits)
xfs: use the op name in trace_xlog_intent_recovery_failed
xfs: fix a use after free in xfs_defer_finish_recovery
xfs: turn the XFS_DA_OP_REPLACE checks in xfs_attr_shortform_addname into asserts
xfs: remove xfs_attr_sf_hdr_t
xfs: remove struct xfs_attr_shortform
xfs: use xfs_attr_sf_findname in xfs_attr_shortform_getvalue
xfs: remove xfs_attr_shortform_lookup
xfs: simplify xfs_attr_sf_findname
xfs: move the xfs_attr_sf_lookup tracepoint
xfs: return if_data from xfs_idata_realloc
xfs: make if_data a void pointer
xfs: fold xfs_rtallocate_extent into xfs_bmap_rtalloc
xfs: simplify and optimize the RT allocation fallback cascade
xfs: reorder the minlen and prod calculations in xfs_bmap_rtalloc
xfs: remove XFS_RTMIN/XFS_RTMAX
xfs: remove rt-wrappers from xfs_format.h
xfs: factor out a xfs_rtalloc_sumlevel helper
xfs: tidy up xfs_rtallocate_extent_exact
xfs: merge the calls to xfs_rtallocate_range in xfs_rtallocate_block
xfs: reflow the tail end of xfs_rtallocate_extent_block
...