Patch series "mm: allow mapping accounted kernel pages to userspace", v6.
Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace. The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter. Pages with a type set can't be mapped to
userspace.
But in general the kmemcg flag has nothing to do with mapping to
userspace. It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.
Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.
This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer. Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions. As the
result the code became more robust with fewer open-coded bit tricks.
This patch (of 4):
Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.
It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information. In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.
This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
struct mem_cgroup *page_memcg(struct page *page);
struct mem_cgroup *page_memcg_rcu(struct page *page);
struct mem_cgroup *page_memcg_check(struct page *page);
page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector. It does
check the lowest bit, and if set, returns NULL. page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.
To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
This option lets a user set a per socket NAPI budget for
busy-polling. If the options is not set, it will use the default of 8.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.
One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.
This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.
If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.
In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.
Example usage:
$ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
$ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.
Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
nfs_inc_stats() is already thread-safe, and there are no other reasons
to hold the inode lock here.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Remove the contentious inode lock, and instead provide thread safety
using the file->f_lock spinlock.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Certain NFSv4.2/RDMA tests fail with v5.9-rc1.
rpcrdma_convert_kvec() runs off the end of the rl_segments array
because rq_rcv_buf.tail[0].iov_len holds a very large positive
value. The resultant kernel memory corruption is enough to crash
the client system.
Callers of rpc_prepare_reply_pages() must reserve an extra XDR_UNIT
in the maximum decode size for a possible XDR pad of the contents
of the xdr_buf's pages. That guarantees the allocated receive buffer
will be large enough to accommodate the usual contents plus that XDR
pad word.
encode_op_hdr() cannot add that extra word. If it does,
xdr_inline_pages() underruns the length of the tail iovec.
Fixes: 3e1f02123f ("NFSv4.2: add client side XDR handling for extended attributes")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We forgot to unregister the nfs4_xattr_large_entry_shrinker.
That leaves the global list of shrinkers corrupted after unload of the
nfs module, after which possibly unrelated code that calls
register_shrinker() or unregister_shrinker() gets a BUG() with
"supervisor write access in kernel mode".
And similarly for the nfs4_xattr_large_entry_lru.
Reported-by: Kris Karas <bugs-a17@moonlit-rail.com>
Tested-By: Kris Karas <bugs-a17@moonlit-rail.com>
Fixes: 95ad37f90c "NFSv4.2: add client side xattr caching."
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl+qtyAACgkQxWXV+ddt
WDusYg/+P1SroRe2n33Pi6v9w47Luqjfi5qMfdrV/ex7g9bP/7dVPvAa0YJr7CpA
UHpvcgWMFK2e29oOoeEoYXukHQ4BKtC6F5L0MPgVJocDT6xsAWM3v98VZn69Olcu
TcUNkxUVbj7OBbQDaINV8dXnTbQWNOsfNlYXH+nPgWqrjSDbPoLkXIEVfZ9CTBeZ
P/qEshkTXqvx1Pux/uRcKrMSu+lSFICKlLvky0b9gRpg4usVlF8jlGQrvJHQvqnP
lECR1cb7/nf2PQ+HdPpgigD24bddiiORoyGW68Q1zZHgs+kGfL6p4M3WF04WINrV
Taiv7WVZ6qHiEB+LDxlOx2cy0Z6YzFOaGSASz+Hh64hvOezBOVGvCF1U9U/Dp+MC
n6QjUiw6c0rIjbdoxpTfQETCdIt/l3qXfOVEr9Zjr2KEbasLsZXdGSf3ydr81Uff
94CwrXp2wq429zu4mdCfOwihF/288+VrN8XRfkSy5RFQ5hHVnZBFQO4KbRIQ4i5X
ZIjHQPX0jA/XN/jpUde/RJL5AyLz20n0o9I3frjXwSs+rvU3f0wD//fxmXlRUshM
hsFXFKO0VdaFtoywVIf7VK/fDsKQhiq+9Yg48A8ylpk+W7meMjeDYuYMIEhMQX3m
S1OMG1Qf27pWXD6KEzpaqzI4SrYBOGVDsX8qxMxRws7n55koVJc=
=3FTR
-----END PGP SIGNATURE-----
Merge tag 'for-5.10-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A handful of minor fixes and updates:
- handle missing device replace item on mount (syzbot report)
- fix space reservation calculation when finishing relocation
- fix memory leak on error path in ref-verify (debugging feature)
- fix potential overflow during defrag on 32bit arches
- minor code update to silence smatch warning
- minor error message updates"
* tag 'for-5.10-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: ref-verify: fix memory leak in btrfs_ref_tree_mod
btrfs: dev-replace: fail mount if we don't have replace item with target device
btrfs: scrub: update message regarding read-only status
btrfs: clean up NULL checks in qgroup_unreserve_range()
btrfs: fix min reserved size calculation in merge_reloc_root
btrfs: print the block rsv type when we fail our reservation
btrfs: fix potential overflow in cluster_pages_for_defrag on 32bit arch
Fix a regression where a new WARN_ON() was reachable when using
FSCRYPT_POLICY_FLAG_IV_INO_LBLK_32 on ext4, causing xfstest generic/602
to sometimes fail on ext4.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCX6nKxBQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK3FwAQD9r8ROaizX2LEYhXYO2uUIcnPMdngD
FOtghnSohKSKAQEAv4fm04Gd67kCIbRh25zUfykRJoC8kpdl52k+zzqMvwA=
=1vim
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt fix from Eric Biggers:
"Fix a regression where a new WARN_ON() was reachable when using
FSCRYPT_POLICY_FLAG_IV_INO_LBLK_32 on ext4, causing xfstest
generic/602 to sometimes fail on ext4"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: remove reachable WARN in fscrypt_setup_iv_ino_lblk_32_key()
refactoring.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl+pk7QVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+lwgQAL0WE92H1QJwYtrC5bXko1CjXjL7
I1lv/rMf1ZHhdbZLZQNSqXFYTGrO3w6n02H7bJcYlryg5YSt8i8evdJXICYyeZIX
5QAT0K5hzHTNWKnumqBSwoVOPl1e6ImZtmyxqQvA/2sQP18OPvroK/9H0YkdnM3/
d8lcpKTBCJj0UAWmktaXGYG8PdNSjaNXMfPRwpCOGHiXk+QBAb+QjshB54PKjjhR
aiJTJzceroLer0YlQSXfVQMt6EwkTkjCbMbxPywfFYGGvl/Y7H4YgVA8rYqO/XZr
BmP9V+xX87GyB0IEGxoheVcmTMUSw37JUfAC2oBQB9g2emG5avRn4vdhL25nKd1T
sgaVC+0tnoMQ7KNaYp1SK6orgS+OIYeQLhxbu6jmU+viccJ621JmpRF+95OwEZ9Z
4+vBwI3Oft20jndgNwrTvCLgkzEVFpJuayBeZCk7pvchM2YjaWwl291ix+cwM2wQ
fwMVs6dpLIgfB8jNOM6qAfI1jB1HMePrPraqxddxh5tZ0Tt4C4uwpEIDDwaPesmJ
FK3JB+7GpU/tMHmmaeVFUMGx9V+8fJFEC0MFUrrqAMZ3XbzQ+DM5ysk1TQsO0OEO
F1ojiYNW8s4U+dLCY0S16vFVoQIuM9Ui1zXGaJHQgS04l+cFCmD495s4HtYA1k7l
H/T/o416bZlbOhcK
=bpPt
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.10-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"This is mainly server-to-server copy and fallout from Chuck's 5.10 rpc
refactoring"
* tag 'nfsd-5.10-1' of git://linux-nfs.org/~bfields/linux:
net/sunrpc: fix useless comparison in proc_do_xprt()
net/sunrpc: return 0 on attempt to write to "transports"
NFSD: fix missing refcount in nfsd4_copy by nfsd4_do_async_copy
NFSD: Fix use-after-free warning when doing inter-server copy
NFSD: MKNOD should return NFSERR_BADTYPE instead of NFSERR_INVAL
SUNRPC: Fix general protection fault in trace_rpc_xdr_overflow()
NFSD: NFSv3 PATHCONF Reply is improperly formed
few other miscellaneous bug fixes and a cleanup for the MAINTAINERS
file.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl+pfIoACgkQ8vlZVpUN
gaMGFAf9FP9DoKQp10zp98LphiPRzMAqt9/ghUWcpz1bkXy33+NHivEi1tOTFcrZ
hetvtOi/YnMlcD8f5IMf2vyOvj96ubI9fgsN3CIGNzU6kQm5E1s/h14PdQ2OkJbb
Kn/BpmaWcTZRj0OXt9CcnEqAYIGrRMaHgZLcDoMwOCr+WgUTJD9Sk7mMLDRBkh8u
QXnfRG2Ahsip8ZUdNTxB7fWPC8BkQAhkLnUe+9mMzIQEMDNs7kfPhnuN+ka334KV
62rc99lYvy3jWV34Iahd/pwS8VOYb0x4EHtcqD28bePy/WR9GU54bdbqMDV33bsx
D+gnQLfwgoW92+3/2TTXvpG4WPWeqQ==
=Y2bI
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes and cleanups from Ted Ts'o:
"More fixes and cleanups for the new fast_commit features, but also a
few other miscellaneous bug fixes and a cleanup for the MAINTAINERS
file"
* tag 'ext4_for_linus_cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (28 commits)
jbd2: fix up sparse warnings in checkpoint code
ext4: fix sparse warnings in fast_commit code
ext4: cleanup fast commit mount options
jbd2: don't start fast commit on aborted journal
ext4: make s_mount_flags modifications atomic
ext4: issue fsdev cache flush before starting fast commit
ext4: disable fast commit with data journalling
ext4: fix inode dirty check in case of fast commits
ext4: remove unnecessary fast commit calls from ext4_file_mmap
ext4: mark buf dirty before submitting fast commit buffer
ext4: fix code documentatioon
ext4: dedpulicate the code to wait on inode that's being committed
jbd2: don't read journal->j_commit_sequence without taking a lock
jbd2: don't touch buffer state until it is filled
jbd2: add todo for a fast commit performance optimization
jbd2: don't pass tid to jbd2_fc_end_commit_fallback()
jbd2: don't use state lock during commit path
jbd2: drop jbd2_fc_init documentation
ext4: clean up the JBD2 API that initializes fast commits
jbd2: rename j_maxlen to j_total_len and add jbd2_journal_max_txn_bufs
...
- fix setting up pcluster improperly for temporary pages;
- derive atime instead of leaving it empty.
-----BEGIN PGP SIGNATURE-----
iIsEABYIADMWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCX6lG0xUcaHNpYW5na2Fv
QHJlZGhhdC5jb20ACgkQOTcx3B+15gRmjQEAspSscgSq13pb1s1z51dxrD7vljqP
gmQ66XB+YAOy6McBAJbvmn37K6Ku0YbeOAdBicIgYe9ykW9PMZagDodOeP8M
=Wnwh
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.10-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"A week ago, Vladimir reported an issue that the kernel log would
become polluted if the page allocation debug option is enabled. I also
found this when I cleaned up magical page->mapping and originally
planned to submit these all for 5.11 but it seems the impact can be
noticed so submit the fix in advance.
In addition, nl6720 also reported that atime is empty although EROFS
has the only one on-disk timestamp as a practical consideration for
now but it's better to derive it as what we did for the other
timestamps.
Summary:
- fix setting up pcluster improperly for temporary pages
- derive atime instead of leaving it empty"
* tag 'erofs-for-5.10-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix setting up pcluster for temporary pages
erofs: derive atime instead of leaving it empty
- Fix an uninitialized struct problem.
- Fix an iomap problem zeroing unwritten EOF blocks.
- Fix some clumsy error handling when writeback fails on
blocksize < pagesize filesystems.
- Fix a retry loop not resetting loop variables properly.
- Fix scrub flagging rtinherit inodes on a non-rt fs, since the kernel
actually does permit that combination.
- Fix excessive page cache flushing when unsharing part of a file.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl+jWK8ACgkQ+H93GTRK
tOuiEQ/+IAEncpqUS1PTSWRlNX7MEQDvlnoLl9ZqhaYrW9pNyz8JzxejubkP/7RA
qkI/fgcBIhxOf+mTKguAUsu81we49PmlObWCEb5mfBI2aeoSL/yM4zikOHFNpy0o
f4U9++kpKJwrWG6kyvNwYMyT6r74vLW0EO9lhYjxAY6+5KgZL0SuFuRAaADDtWj8
SKIc/dli6qDS3IrnkibQtzFOOcmeOEn0qcWcS4gD7tbUpJlw0M2g88JjBPoT8oTK
wRBNrspbAA42YbYqlwmkBQZZwM+XZKLZNcvzzLgQLaQdTKEem2w2pB1j1KvJXsSo
ibxhmk1/tGFKtPTmbpm7dUC9ubr7xch6J+GHNwHuaWL2hxBWJzNRVokG1BsbDXcc
FW3ilwLFd8CFUXttQqQfhiUx8wfe2eJ1aXEBK5JeHWRwD+egLI9WXFJQzjUUwe+v
T+7r+0kS2TL3SXKU5TE+gsuuI5mcJpYvcWVqYPwBxjZW0tIhUzBldpfBYysG3ZAm
uhYcw3BHw1ucsjqcSe14CWqA4KnwgfAcKva5AJSLjJBu3wOi1wrFKg/+Wtpo0xA2
yFAqFP5FGW13oqeYtqJy0J79qOw6Po9wl+XnekSiBCEif965KtV+RBMP2/TBG+Pl
R+bNvSXlb1QiDNSjiIG2b34RiDNoiV2k+ELxOz3SSbzIxx8gvm4=
=MqoT
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.10-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix an uninitialized struct problem
- Fix an iomap problem zeroing unwritten EOF blocks
- Fix some clumsy error handling when writeback fails on filesystems
with blocksize < pagesize
- Fix a retry loop not resetting loop variables properly
- Fix scrub flagging rtinherit inodes on a non-rt fs, since the kernel
actually does permit that combination
- Fix excessive page cache flushing when unsharing part of a file
* tag 'xfs-5.10-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: only flush the unshared range in xfs_reflink_unshare
xfs: fix scrub flagging rtinherit even if there is no rt device
xfs: fix missing CoW blocks writeback conversion retry
iomap: clean up writeback state logic on writepage error
iomap: support partial page discard on writeback block mapping failure
xfs: flush new eof page on truncate to avoid post-eof corruption
xfs: set xefi_discard when creating a deferred agfl free log intent item
Merge procfs splice read fixes from Christoph Hellwig:
"Greg reported a problem due to the fact that Android tests use procfs
files to test splice, which stopped working with the changes for
set_fs() removal.
This series adds read_iter support for seq_file, and uses those for
various proc files using seq_file to restore splice read support"
[ Side note: Christoph initially had a scripted "move everything over"
patch, which looks fine, but I personally would prefer us to actively
discourage splice() on random files. So this does just the minimal
basic core set of proc file op conversions.
For completeness, and in case people care, that script was
sed -i -e 's/\.proc_read\(\s*=\s*\)seq_read/\.proc_read_iter\1seq_read_iter/g'
but I'll wait and see if somebody has a strong argument for using
splice on random small /proc files before I'd run it on the whole
kernel. - Linus ]
* emailed patches from Christoph Hellwig <hch@lst.de>:
proc "seq files": switch to ->read_iter
proc "single files": switch to ->read_iter
proc/stat: switch to ->read_iter
proc/cpuinfo: switch to ->read_iter
proc: wire up generic_file_splice_read for iter ops
seq_file: add seq_read_iter
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+m/0MQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjPHD/4gRQTmbcfC2K9lLvm1xrHZKDvaoRHW5IqT
gMHWYQzVGHKKi3lk0tRzgClXNRE21OGI19gyJ9gdGWqi/2NPYm+4CjDlZezXmQXC
x7icKhJygrOGgsi1/IDE5Y2XClIUIEL+iROIaFz+AtI7J+9v3BfuMty3kqMRK5VY
psqJDOUXdb5ci0SwXKMXzrvHJUHr4tiRUy+TbHBWYfjY4r8WObh35FKOtveBNNtK
5ZyxinXqRu0WcPPJ9t/wdvA8YiwiDJOY3w2zhlWPMGpiJ4gc+OBAxRiNmnEeJ6RN
XKW7xW7Qff6rmNad2K/rQMMMOK19e8csL6hk69gxgMnWDvkWqSJC0DUMzwRGbC2d
IezL/rrjJ7zwrKcwud+QpeGc7niO/DRgLbB4BuH96TE1J1ZbE/lqkVg/JkjDCDyo
atwzF62HbZ+z52+6/V/2IDqTFf0QalmjDJFXY/az3i6bzb8xQcVqri6MO8IbNJDE
pB44BDyZPKwPOH6e2W6U4s/K63E8wvp75YibwSOUhoFiAy7jkfkvQoE7dFXA21/s
kqCUl0voOrGAAAOchnxI4KAoQOXbBM1sjaJtiEWuKgMLxd57sR1GXSYw1zoz7ymX
8wqhKu76RP3WRFNtBQpEeq+xj3iCnaoXkX7q8EJ9tfpe6GJqln5/ROA3Gg9V6oHh
KMqgOAgdYg==
=zplt
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.10-2020-11-07' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A set of fixes for io_uring:
- SQPOLL cancelation fixes
- Two fixes for the io_identity COW
- Cancelation overflow fix (Pavel)
- Drain request cancelation fix (Pavel)
- Link timeout race fix (Pavel)"
* tag 'io_uring-5.10-2020-11-07' of git://git.kernel.dk/linux-block:
io_uring: fix link lookup racing with link timeout
io_uring: use correct pointer for io_uring_show_cred()
io_uring: don't forget to task-cancel drained reqs
io_uring: fix overflowed cancel w/ linked ->files
io_uring: drop req/tctx io_identity separately
io_uring: ensure consistent view of original task ->mm from SQPOLL
io_uring: properly handle SQPOLL request cancelations
io-wq: cancel request if it's asking for files and we don't have them
Add missing __acquires() and __releases() annotations. Also, in an
"this should never happen" WARN_ON check, if it *does* actually
happen, we need to release j_state_lock since this function is always
supposed to release that lock. Otherwise, things will quickly grind
to a halt after the WARN_ON trips.
Fixes: 96f1e09745 ("jbd2: avoid long hold times of j_state_lock...")
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add missing __acquire() and __releases() annotations, and make
fc_ineligible_reasons[] static, as it is not used outside of
fs/ext4/fast_commit.c.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Drop no_fc mount option that disable fast commit even if it was
enabled at mkfs time. Move fc_debug_force mount option under ifdef
EXT4_DEBUG to annotate that this is strictly for debugging and testing
purposes and should not be used in production.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-23-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fast commit file system states are recorded in
sbi->s_mount_flags. Fast commit expects these bit manipulations to be
atomic. This patch adds helpers to make those modifications atomic.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-21-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the journal dev is different from fsdev, issue a cache flush before
committing fast commit blocks to disk.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201106035911.1942128-20-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fast commits don't work with data journalling. This patch disables the
fast commit support when data journalling is turned on.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201106035911.1942128-19-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In case of fast commits, determine if the inode is dirty by checking
if the inode is on fast commit list. This also helps us get rid of
ext4_inode_info.i_fc_committed_subtid field.
Reported-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-18-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add a TODO to remember fixing REQ_FUA | REQ_PREFLUSH for fast commit
buffers. Also, fix a typo in top level comment in fast_commit.c
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-15-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch removes the deduplicates the code that implements waiting
on inode that's being committed. That code is moved into a new
function.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-14-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fast commit buffers should be filled in before toucing their
state. Remove code that sets buffer state as dirty before the buffer
is passed to the file system.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-12-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fast commit performance can be optimized if commit thread doesn't wait
for ongoing fast commits to complete until the transaction enters
T_FLUSH state. Document this optimization.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-11-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In jbd2_fc_end_commit_fallback(), we know which tid to commit. There's
no need for caller to pass it.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-10-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Variables journal->j_fc_off, journal->j_fc_wbuf are accessed during
commit path. Since today we allow only one process to perform a fast
commit, there is no need take state lock before accessing these
variables. This patch removes these locks and adds comments to
describe this.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-9-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch removes jbd2_fc_init() API and its related functions to
simplify enabling fast commits. With this change, the number of fast
commit blocks to use is solely determined by the JBD2 layer. So, we
move the default value for minimum number of fast commit blocks from
ext4/fast_commit.h to include/linux/jbd2.h. However, whether or not to
use fast commits is determined by the file system. The file system
just sets the fast commit feature using
jbd2_journal_set_features(). JBD2 layer then determines how many
blocks to use for fast commits (based on the value found in the JBD2
superblock).
Note that the JBD2 feature flag of fast commits is just an indication
that there are fast commit blocks present on disk. It doesn't tell
JBD2 layer about the intent of the file system of whether to it wants
to use fast commit or not. That's why, we blindly clear the fast
commit flag in journal_reset() after the recovery is done.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-7-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The on-disk superblock field sb->s_maxlen represents the total size of
the journal including the fast commit area and is no more the max
number of blocks available for a transaction. The maximum number of
blocks available to a transaction is reduced by the number of fast
commit blocks. So, this patch renames j_maxlen to j_total_len to
better represent its intent. Also, it adds a function to calculate max
number of bufs available for a transaction.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Firstly, pass handle to all ext4_fc_track_* functions and use
transaction id found in handle->h_transaction->h_tid for tracking fast
commit updates. Secondly, don't pass inode to
ext4_fc_track_link/create/unlink functions. inode can be found inside
these functions as d_inode(dentry). However, rename path is an
exeception. That's because in that case, we need inode that's not same
as d_inode(dentry). To handle that, add a couple of low-level wrapper
functions that take inode and dentry as arguments.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-5-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_fc_track_range() should only be called when blocks are added or
removed from an inode. So, the only places from where we need to call
this function are ext4_map_blocks(), punch hole, collapse / zero
range, truncate. Remove all the other redundant calls to ths function.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-4-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If inode gets evicted due to memory pressure, we have to remove it
from the fast commit list. However, that inode may have uncommitted
changes that fast commits will lose. So, just fall back to full
commits in this case. Also, rename the fast commit ineligiblity reason
from "EXT4_FC_REASON_MEM" to "EXT4_FC_REASON_MEM_NOMEM" for better
expression.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-3-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fast commit feature has flags in the file system as well in JBD2. The
meaning of fast commit feature flags can get confusing. Update docs
and code to add more documentation about it.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201106035911.1942128-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
It takes xattr_sem to check inline data again but without unlock it
in case not have. So unlock it before return.
Fixes: aef1c8513c ("ext4: let ext4_truncate handle inline data correctly")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604370542-124630-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Smatch complains that "i" can be uninitialized if we don't enter the
loop. I don't know if it's possible but we may as well silence this
warning.
[ Initialize i to sb->s_blocksize instead of 0. The only way the for
loop could be skipped entirely is the in-memory data structures, in
particular the bh->b_data for the on-disk superblock has gotten
corrupted enough that calculated value of group is >= to
ext4_get_groups_count(sb). In that case, we want to exit
immediately without allocating a block. -- TYT ]
Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20201030114620.GB3251003@mwanda
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
The macro MOPT_Q is used to indicates the mount option is related to
quota stuff and is defined to be MOPT_NOSUPPORT when CONFIG_QUOTA is
disabled. Normally the quota options are handled explicitly, so it
didn't matter that the MOPT_STRING flag was missing, even though the
usrjquota and grpjquota mount options take a string argument. It's
important that's present in the !CONFIG_QUOTA case, since without
MOPT_STRING, the mount option matcher will match usrjquota= followed
by an integer, and will otherwise skip the table entry, and so "mount
option not supported" error message is never reported.
[ Fixed up the commit description to better explain why the fix
works. --TYT ]
Fixes: 26092bf524 ("ext4: use a table-driven handler for mount options")
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Link: https://lore.kernel.org/r/1603986396-28917-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
our REQUEST_CLOSE message. The code that handled this case was
inadvertently disabled in 5.9, this patch removes it entirely and
fixes the problem in a way that is consistent with ceph-fuse.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl+lookTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHziwFGB/43MBB+nG4wBHM58GybIi2NkS/TMmd5
5D3GPmchWYbE1d3hzAJmAYUUZCIx8kh0TeWjPZzR0iEok+f9Zf8bjrGDEFRWWOZc
OL+PMfZckhVS2W8dUkx9CsypnA9/Rx2i7y/XKDDiC3eumfkDMktTdSS4UNxZT5cg
ElmfozOPdv7fRGNPZJiQnkgWdMRFkqiGsdL+9wgRP4qc8WOkipoFouw+gJ2lN3vK
odcY3UGcmx4iuiBj0uXjiFy/MtdYuNLjJrtMmkkEBklxGgIP/1dTOMnV3ktMMYkT
gRUNM7fz/HeZIXb1N6jFs2S/ai1uuS6wP7aTHGfi8W2xgQA5ukDLIbu/
=Tbrl
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.10-rc3' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for a potential stall on umount caused by the MDS dropping our
REQUEST_CLOSE message. The code that handled this case was
inadvertently disabled in 5.9, this patch removes it entirely and
fixes the problem in a way that is consistent with ceph-fuse"
* tag 'ceph-for-5.10-rc3' of git://github.com/ceph/ceph-client:
ceph: check session state after bumping session->s_seq
Implement ->read_iter for all proc "seq files" so that splice works on
them.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement ->read_iter for all proc "single files" so that more bionic
tests cases can pass when they call splice() on other fun files like
/proc/version
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement ->read_iter so that splice can be used on this file.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement ->read_iter so that the Android bionic test suite can use
this random proc file for its splice test case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wire up generic_file_splice_read for the iter based proxy ops, so
that splice reads from them work.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
iov_iter based variant for reading a seq_file. seq_read is
reimplemented on top of the iter variant.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I_CREATING isn't actually set until the inode has been assigned an inode
number and inserted into the inode hash table. So the WARN_ON() in
fscrypt_setup_iv_ino_lblk_32_key() is wrong, and it can trigger when
creating an encrypted file on ext4. Remove it.
This was sometimes causing xfstest generic/602 to fail on ext4. I
didn't notice it before because due to a separate oversight, new inodes
that haven't been assigned an inode number yet don't necessarily have
i_ino == 0 as I had thought, so by chance I never saw the test fail.
Fixes: a992b20cd4 ("fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()")
Reported-by: Theodore Y. Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20201031004556.87862-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>