- tracing/kprobes: Fix kernel-doc warnings for the variable length
arguments.
- tracing/kprobes: Fix to count the symbols in modules even if the
module name is not specified so that user can probe the symbols in
the modules without module name.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmU82MUbHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bMZ0H+wZHWVUsmqGLGNCt3gfi
m2EJX83VMwY8PzpwZ5ezrx4ibAcUyo7Dhh8OniGgEazC3BNeggoUu/HwpirS22gI
Tx0EMlgLOJQykauiUe6FPem0IbrlbQMI1gLplx6cVd8lgIYZQfMIM5gI0kuCywT3
Ka9sCgp6y3UKQNtHKFwtPRLYFTF3Afyy2C01wdsa800SEqeOAeTD9+8yz7ZnuFt+
bNgu6vJGFfJHkEkvYCwFFqZ1eIfXON6lUFpijNpCGvMN2h1XArLexSk8JRBf6j2+
8+1FrRQsTXRk3G6v9uQABeK7z5W2F8gufmSFyBlXajbZp2HT6j4s2S86u5lP9P9J
l1U=
=etyx
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- tracing/kprobes: Fix kernel-doc warnings for the variable length
arguments
- tracing/kprobes: Fix to count the symbols in modules even if the
module name is not specified so that user can probe the symbols in
the modules without module name
* tag 'probes-fixes-v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/kprobes: Fix symbol counting logic by looking at modules as well
tracing/kprobes: Fix the description of variable length arguments
Recent changes to count number of matching symbols when creating
a kprobe event failed to take into account kernel modules. As such, it
breaks kprobes on kernel module symbols, by assuming there is no match.
Fix this my calling module_kallsyms_on_each_symbol() in addition to
kallsyms_on_each_match_symbol() to perform a proper counting.
Link: https://lore.kernel.org/all/20231027233126.2073148-1-andrii@kernel.org/
Cc: Francis Laniel <flaniel@linux.microsoft.com>
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Fixes: b022f0c7e4 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fix the following kernel-doc warnings:
kernel/trace/trace_kprobe.c:1029: warning: Excess function parameter 'args' description in '__kprobe_event_gen_cmd_start'
kernel/trace/trace_kprobe.c:1097: warning: Excess function parameter 'args' description in '__kprobe_event_add_fields'
Refer to the usage of variable length arguments elsewhere in the kernel
code, "@..." is the proper way to express it in the description.
Link: https://lore.kernel.org/all/20231027041315.2613166-1-yujie.liu@intel.com/
Fixes: 2a588dd1d5 ("tracing: Add kprobe event command generation functions")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310190437.paI6LYJF-lkp@intel.com/
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When allocating a new pool at runtime, reduce the number of slabs so
that the allocation order is at most MAX_ORDER. This avoids a kernel
warning in __alloc_pages().
The warning is relatively benign, because the pool size is subsequently
reduced when allocation fails, but it is silly to start with a request
that is known to fail, especially since this is the default behavior if
the kernel is built with CONFIG_SWIOTLB_DYNAMIC=y and booted without any
swiotlb= parameter.
Reported-by: Ben Greear <greearb@candelatech.com>
Closes: https://lore.kernel.org/netdev/4f173dd2-324a-0240-ff8d-abf5c191be18@candelatech.com/
Fixes: 1aaa736815 ("swiotlb: allocate a new memory pool when existing pools are full")
Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
- kprobe-events: Fix kprobe events to reject if the attached symbol
is not unique name because it may not the function which the user
want to attach to. (User can attach a probe to such symbol using
the nearest unique symbol + offset.)
- selftest: Add a testcase to ensure the kprobe event rejects non
unique symbol correctly.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmUzdQobHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bMNAH/inFWv8e+rMm8F5Po6ZI
CmBxuZbxy2l+KfYDjXqSHu7TLKngVd6Bhdb5H2K7fgdwiZxrS0i6qvdppo+Cxgop
Yod06peDTM80IKavioCcOJOwLPGXXpZkMlK5fdC48HN6vrf9km4vws5ZAagfc1ng
YhnYm1HHeXcIYwtLkE2dCr6HkwkaOebWTLdZ8c70d1OPw0L9rzxH+edjhKCq8uIw
6WUg9ERxJYPUuCkQxOxVJrTdzNMRXsgf28FHc0LyYRm8kDpECT2BP6e/Y+TBbsX5
2pN5cUY5qfI6t3Pc1HDs2KX8ui2QCmj0mCvT0VixhdjThdHpRf0VjIFFAANf3LNO
XVA=
=O1Aa
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.6-rc6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- kprobe-events: Fix kprobe events to reject if the attached symbol is
not unique name because it may not the function which the user want
to attach to. (User can attach a probe to such symbol using the
nearest unique symbol + offset.)
- selftest: Add a testcase to ensure the kprobe event rejects non
unique symbol correctly.
* tag 'probes-fixes-v6.6-rc6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
selftests/ftrace: Add new test case which checks non unique symbol
tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols
When a kprobe is attached to a function that's name is not unique (is
static and shares the name with other functions in the kernel), the
kprobe is attached to the first function it finds. This is a bug as the
function that it is attaching to is not necessarily the one that the
user wants to attach to.
Instead of blindly picking a function to attach to what is ambiguous,
error with EADDRNOTAVAIL to let the user know that this function is not
unique, and that the user must use another unique function with an
address offset to get to the function they want to attach to.
Link: https://lore.kernel.org/all/20231020104250.9537-2-flaniel@linux.microsoft.com/
Cc: stable@vger.kernel.org
Fixes: 413d37d1eb ("tracing: Add kprobe-based event tracer")
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
Link: https://lore.kernel.org/lkml/20230819101105.b0c104ae4494a7d1f2eea742@kernel.org/
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTD6IQAKCRCRxhvAZXjc
opXLAQC9X+ECnGUAOy/kvOrEBkBb7G4BuZ8XsrnL976riVNp0gEA85LaJV9Ow7Xk
51k/1ujhYkglQbCsa0zo+mI4ueE3wAQ=
=Dqrj
-----END PGP SIGNATURE-----
Merge tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fix from Christian Brauner:
"An openat() call from io_uring triggering an audit call can apparently
cause the refcount of struct filename to be incremented from multiple
threads concurrently during async execution, triggering a refcount
underflow and hitting a BUG_ON(). That bug has been lurking around
since at least v5.16 apparently.
Switch to an atomic counter to fix that. The underflow check is
downgraded from a BUG_ON() to a WARN_ON_ONCE() but we could easily
remove that check altogether tbh"
* tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
audit,io_uring: io_uring openat triggers audit reference count underflow
Because group consistency is non-atomic between parent (filedesc) and children
(inherited) events, it is possible for PERF_FORMAT_GROUP read() to try and sum
non-matching counter groups -- with non-sensical results.
Add group_generation to distinguish the case where a parent group removes and
adds an event and thus has the same number, but a different configuration of
events as inherited groups.
This became a problem when commit fa8c269353 ("perf/core: Invert
perf_read_group() loops") flipped the order of child_list and sibling_list.
Previously it would iterate the group (sibling_list) first, and for each
sibling traverse the child_list. In this order, only the group composition of
the parent is relevant. By flipping the order the group composition of the
child (inherited) events becomes an issue and the mis-match in group
composition becomes evident.
That said; even prior to this commit, while reading of a group that is not
equally inherited was not broken, it still made no sense.
(Ab)use ECHILD as error return to indicate issues with child process group
composition.
Fixes: fa8c269353 ("perf/core: Invert perf_read_group() loops")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231018115654.GK33217@noisy.programming.kicks-ass.net
An io_uring openat operation can update an audit reference count
from multiple threads resulting in the call trace below.
A call to io_uring_submit() with a single openat op with a flag of
IOSQE_ASYNC results in the following reference count updates.
These first part of the system call performs two increments that do not race.
do_syscall_64()
__do_sys_io_uring_enter()
io_submit_sqes()
io_openat_prep()
__io_openat_prep()
getname()
getname_flags() /* update 1 (increment) */
__audit_getname() /* update 2 (increment) */
The openat op is queued to an io_uring worker thread which starts the
opportunity for a race. The system call exit performs one decrement.
do_syscall_64()
syscall_exit_to_user_mode()
syscall_exit_to_user_mode_prepare()
__audit_syscall_exit()
audit_reset_context()
putname() /* update 3 (decrement) */
The io_uring worker thread performs one increment and two decrements.
These updates can race with the system call decrement.
io_wqe_worker()
io_worker_handle_work()
io_wq_submit_work()
io_issue_sqe()
io_openat()
io_openat2()
do_filp_open()
path_openat()
__audit_inode() /* update 4 (increment) */
putname() /* update 5 (decrement) */
__audit_uring_exit()
audit_reset_context()
putname() /* update 6 (decrement) */
The fix is to change the refcnt member of struct audit_names
from int to atomic_t.
kernel BUG at fs/namei.c:262!
Call Trace:
...
? putname+0x68/0x70
audit_reset_context.part.0.constprop.0+0xe1/0x300
__audit_uring_exit+0xda/0x1c0
io_issue_sqe+0x1f3/0x450
? lock_timer_base+0x3b/0xd0
io_wq_submit_work+0x8d/0x2b0
? __try_to_del_timer_sync+0x67/0xa0
io_worker_handle_work+0x17c/0x2b0
io_wqe_worker+0x10a/0x350
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/
Fixes: 5bd2182d58 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Dan Clash <daclash@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
- In cgroup1, the `tasks` file could have duplicate pids which can trigger a
warning in seq_file. Fix it by removing duplicate items after sorting.
- Comment update.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZSh+2A4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGfASAP9dgEe1Ay6jkJoCCGROnjPRDj2j7Cm9WWcHV79X
0Pr3zQEA/vFIpUzRZGbisrvnyXwNNLX12Hq/nwRX6DzN4UIkDgE=
=tPdI
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- In cgroup1, the `tasks` file could have duplicate pids which can
trigger a warning in seq_file. Fix it by removing duplicate items
after sorting
- Comment update
* tag 'cgroup-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Fix incorrect css_set_rwsem reference in comment
cgroup: Remove duplicates in cgroup v1 tasks file
* Fix access-after-free in pwq allocation error path.
* Implicitly ordered unbound workqueues should lose the implicit ordering if
an attribute change which isn't compatible with ordered operation is
requested. However, attribute changes requested through the sysfs
interface weren't doing that leaving no way to override the implicit
ordering through the sysfs interface. Fix it.
* Other doc and misc updates.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZSh9vQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGTG4AQCklH7aGqSbzGBPuV19gN6q+BPjkNNLTkEtOzW7
3t1gewEAuwGiGr5FwuxCuGhzTUm5dkELFFsKdzYk+Pt7B2M5Pg0=
=YQtE
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Fix access-after-free in pwq allocation error path
- Implicitly ordered unbound workqueues should lose the implicit
ordering if an attribute change which isn't compatible with ordered
operation is requested. However, attribute changes requested through
the sysfs interface weren't doing that leaving no way to override the
implicit ordering through the sysfs interface. Fix it.
- Other doc and misc updates
* tag 'wq-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix -Wformat-truncation in create_worker
workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()
workqueue: Use the kmem_cache_free() instead of kfree() to release pwq
workqueue: doc: Fix function and sysfs path errors
workqueue: Fix UAF report by KASAN in pwq_release_workfn()
Previous releases - regressions:
- af_packet: fix fortified memcpy() without flex array.
- tcp: fix crashes trying to free half-baked MTU probes
- xdp: fix zero-size allocation warning in xskq_create()
- can: sja1000: always restart the tx queue after an overrun
- eth: mlx5e: again mutually exclude RX-FCS and RX-port-timestamp
- eth: nfp: avoid rmmod nfp crash issues
- eth: octeontx2-pf: fix page pool frag allocation warning
Previous releases - always broken:
- mctp: perform route lookups under a RCU read-side lock
- bpf: s390: fix clobbering the caller's backchain in the trampoline
- phy: lynx-28g: cancel the CDR check work item on the remove path
- dsa: qca8k: fix qca8k driver for Turris 1.x
- eth: ravb: fix use-after-free issue in ravb_tx_timeout_work()
- eth: ixgbe: fix crash with empty VF macvlan list
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmUnw0USHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkN0EP/RKl317fLqlm6ZzRUMVP169CNRAgMaBG
7FIwxlCv4hfO2Rx09Mxu2wjDp+tBQKqBKaxfcwh8tEdLMqqCymOW2K5+tWVty8C8
TJJS+zggqLAo7DjXbnT8GBm5owHPLKGNxW6vRmnw9xraCD/nuV1wqolI2+l4IxB+
kqfliltepnJSakg0uXg7/uwAE87slBzX5VgB6K5JKLiiDMD8tYoAUmZzH8bMJd0l
Cl7+L+ucRfQkj0DPfuZM/FncM0el7oFB6imnKd36hD6vfDfCNxpyNBYG1yZ/61/N
7H3E595Hr9PA+YBZjja3UvQGbFXkyMHloQdYxmq4s0T2WHqKwRyjLlwPayMXvavn
OTJh2VAs68ivtti0ry5Nbgz4viiNfr32PLyZr6XySwCZ1/TCLjV4Cq9IYnaP3YeM
KA+CIl3d0asQdZuMXTBivmtF65Buawt9UX/gJzUst2mNdcqhV1RTNWDNWoFLQ0qW
gz8XN68V5LhbaaOq/Lat80krWgNLNZIlTNmSsE/Ie799w7dAHn/xvT6h+h5pF1XX
dhng9NK7RL7KVcI/9walArOnhz9ksGWc2+JPMQohuPM/ITMHW11oOUOX6NwAre5m
hBJKh+Rz7ylLDLn33C4qowUhxnJlqqm+rDCVDTmoYngEFQvhEl19mfndSsC8P/K/
xXQJ+diS/Jug
=orAS
-----END PGP SIGNATURE-----
Merge tag 'net-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from CAN and BPF.
We have a regression in TC currently under investigation, otherwise
the things that stand off most are probably the TCP and AF_PACKET
fixes, with both issues coming from 6.5.
Previous releases - regressions:
- af_packet: fix fortified memcpy() without flex array.
- tcp: fix crashes trying to free half-baked MTU probes
- xdp: fix zero-size allocation warning in xskq_create()
- can: sja1000: always restart the tx queue after an overrun
- eth: mlx5e: again mutually exclude RX-FCS and RX-port-timestamp
- eth: nfp: avoid rmmod nfp crash issues
- eth: octeontx2-pf: fix page pool frag allocation warning
Previous releases - always broken:
- mctp: perform route lookups under a RCU read-side lock
- bpf: s390: fix clobbering the caller's backchain in the trampoline
- phy: lynx-28g: cancel the CDR check work item on the remove path
- dsa: qca8k: fix qca8k driver for Turris 1.x
- eth: ravb: fix use-after-free issue in ravb_tx_timeout_work()
- eth: ixgbe: fix crash with empty VF macvlan list"
* tag 'net-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
rswitch: Fix imbalance phy_power_off() calling
rswitch: Fix renesas_eth_sw_remove() implementation
octeontx2-pf: Fix page pool frag allocation warning
nfc: nci: assert requested protocol is valid
af_packet: Fix fortified memcpy() without flex array.
net: tcp: fix crashes trying to free half-baked MTU probes
net/smc: Fix pos miscalculation in statistics
nfp: flower: avoid rmmod nfp crash issues
net: usb: dm9601: fix uninitialized variable use in dm9601_mdio_read
ethtool: Fix mod state of verbose no_mask bitset
net: nfc: fix races in nfc_llcp_sock_get() and nfc_llcp_sock_get_sn()
mctp: perform route lookups under a RCU read-side lock
net: skbuff: fix kernel-doc typos
s390/bpf: Fix unwinding past the trampoline
s390/bpf: Fix clobbering the caller's backchain in the trampoline
net/mlx5e: Again mutually exclude RX-FCS and RX-port-timestamp
net/smc: Fix dependency of SMC on ISM
ixgbe: fix crash with empty VF macvlan list
net/mlx5e: macsec: use update_pn flag instead of PN comparation
net: phy: mscc: macsec: reject PN update requests
...
Compiling with W=1 emitted the following warning
(Compiler: gcc (x86-64, ver. 13.2.1, .config: result of make allyesconfig,
"Treat warnings as errors" turned off):
kernel/workqueue.c:2188:54: warning: ‘%d’ directive output may be
truncated writing between 1 and 10 bytes into a region of size
between 5 and 14 [-Wformat-truncation=]
kernel/workqueue.c:2188:50: note: directive argument in the range
[0, 2147483647]
kernel/workqueue.c:2188:17: note: ‘snprintf’ output between 4 and 23 bytes
into a destination of size 16
setting "id_buf" to size 23 will silence the warning, since GCC
determines snprintf's output to be max. 23 bytes in line 2188.
Please let me know if there are any mistakes in my patch!
Signed-off-by: Lucy Mielke <lucymielke@icloud.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1
to be ordered") enabled implicit ordered attribute to be added to
WQ_UNBOUND workqueues with max_active of 1. This prevented the changing
of attributes to these workqueues leading to fix commit 0a94efb5ac
("workqueue: implicit ordered attribute should be overridable").
However, workqueue_apply_unbound_cpumask() was not updated at that time.
So sysfs changes to wq_unbound_cpumask has no effect on WQ_UNBOUND
workqueues with implicit ordered attribute. Since not all WQ_UNBOUND
workqueues are visible on sysfs, we are not able to make all the
necessary cpumask changes even if we iterates all the workqueue cpumasks
in sysfs and changing them one by one.
Fix this problem by applying the corresponding change made
to apply_workqueue_attrs_locked() in the fix commit to
workqueue_apply_unbound_cpumask().
Fixes: 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, the kfree() be used for pwq objects allocated with
kmem_cache_alloc() in alloc_and_link_pwqs(), this isn't wrong.
but usually, use "trace_kmem_cache_alloc/trace_kmem_cache_free"
to track memory allocation and free. this commit therefore use
kmem_cache_free() instead of kfree() in alloc_and_link_pwqs()
and also consistent with release of the pwq in rcu_free_pwq().
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The verifier, as part of check_return_code(), verifies that async
callbacks such as from e.g. timers, will return 0. It does this by
correctly checking that R0->var_off is in tnum_const(0), which
effectively checks that it's in a range of 0. If this condition fails,
however, it prints an error message which says that the value should
have been in (0x0; 0x1). This results in possibly confusing output such
as the following in which an async callback returns 1:
At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x1)
The fix is easy -- we should just pass the tnum_const(0) as the correct
range to verbose_invalid_scalar(), which will then print the following:
At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x0)
Fixes: bfc6bb74e4 ("bpf: Implement verifier support for validation of async callbacks.")
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231009161414.235829-1-void@manifault.com
One PID may appear multiple times in a preloaded pidlist.
(Possibly due to PID recycling but we have reports of the same
task_struct appearing with different PIDs, thus possibly involving
transfer of PID via de_thread().)
Because v1 seq_file iterator uses PIDs as position, it leads to
a message:
> seq_file: buggy .next function kernfs_seq_next did not update position index
Conservative and quick fix consists of removing duplicates from `tasks`
file (as opposed to removing pidlists altogether). It doesn't affect
correctness (it's sufficient to show a PID once), performance impact
would be hidden by unconditional sorting of the pidlist already in place
(asymptotically).
Link: https://lore.kernel.org/r/20230823174804.23632-1-mkoutny@suse.com/
Suggested-by: Firo Yang <firo.yang@suse.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Commit 9e70a5e109 ("printk: Add per-console suspended state")
removed console lock usage during resume and replaced it with
the clearly defined console_list_lock and srcu mechanisms.
However, the console lock usage had an important side-effect
of flushing the consoles. After its removal, consoles were no
longer flushed before checking their progress.
Add the console_lock/console_unlock dance to the beginning
of __pr_flush() to actually flush the consoles before checking
their progress. Also add comments to clarify this additional
usage of the console lock.
Note that console_unlock() does not guarantee flushing all messages
since the commit dbdda842fe ("printk: Add console owner and waiter
logic to load balance console writes").
Reported-by: Todd Brandt <todd.e.brandt@intel.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217955
Fixes: 9e70a5e109 ("printk: Add per-console suspended state")
Co-developed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20231006082151.6969-2-pmladek@suse.com
The old pick_eevdf() could fail to find the actual earliest eligible
deadline when it descended to the right looking for min_deadline, but
it turned out that that min_deadline wasn't actually eligible. In that
case we need to go back and search through any left branches we
skipped looking for the actual best _eligible_ min_deadline.
This is more expensive, but still O(log n), and at worst should only
involve descending two branches of the rbtree.
I've run this through a userspace stress test (thank you
tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
corner cases.
Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/xm261qego72d.fsf_-_@google.com
Marek and Biju reported instances of:
"EEVDF scheduling fail, picking leftmost"
which Mike correlated with cgroup scheduling and the min_deadline heap
getting corrupted; some trace output confirms:
> And yeah, min_deadline is hosed somehow:
>
> validate_cfs_rq: --- /
> __print_se: ffff88845cf48080 w: 1024 ve: -58857638 lag: 870381 vd: -55861854 vmd: -66302085 E (11372/tr)
> __print_se: ffff88810d165800 w: 25 ve: -80323686 lag: 22336429 vd: -41496434 vmd: -66302085 E (-1//autogroup-31)
> __print_se: ffff888108379000 w: 25 ve: 0 lag: -57987257 vd: 114632828 vmd: 114632828 N (-1//autogroup-33)
> validate_cfs_rq: min_deadline: -55861854 avg_vruntime: -62278313462 / 1074 = -57987256
Turns out that reweight_entity(), which tries really hard to be fast,
does not do the normal dequeue+update+enqueue pattern but *does* scale
the deadline.
However, it then fails to propagate the updated deadline value up the
heap.
Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Biju Das <biju.das.jz@bp.renesas.com>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Link: https://lkml.kernel.org/r/20231006192445.GE743@noisy.programming.kicks-ass.net
- Two EEVDF fixes: one to fix sysctl_sched_base_slice propagation,
and to fix an avg_vruntime() corner-case.
- A cpufreq frequency scaling fix
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUidV8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1itsA//Yl5PmadMcHxX2UYzMho9bCMNdVZyb8im
C0xbvs1o7LO1RgzkqV4kpGjRy1BqHzrYeE5xCiI9K/HoqdmChtB7+oQBJ1Y7mjfy
miePmjGXRol+0H5eR94QqgL5M/SspmwEmsFm9QwfMmcnUOZOMkiWiElnCKMJbsbr
XE2Bj0sj+BuFu6PK6f0R+aoy/H6Za0g5DpujxGfRJHlep8vuJA4afIO9rL18EXa3
AI2YnvZh5IH9EMJXQ8c+dtqi0xPTWhSpQ28EDAMV89TnAJAv+uo/cMPoZj+ewjb3
PFN5ASI2f7IdCuK4vdixZM1E9vgM9UTI4Ju9IanUcXkUs8YNUXJVZXezaGZG/wZN
QXD827AjScTZJWIJGGMfaB8ubYVRqg6wG4NRcToFHxp5G5o7iZ0joTenSA6/nGy9
o5RpA8KbB1oyuSlWvHqNCYmc8QavujoiaDbyqlsY2E5mIqNHkbegK7kiAgeVNnL1
oqKgnzjAAER0gujqP/4jTHIlF23sh17/oIRgHb+y2wWMxwnZR/TKIxuMaYrmoq0I
FIPx7l20USl9n2VmSl29vzzUZaM07AeKl3HGtYGMdgAUG+meEH0dATn690WPnF12
MNQolPMMpp051LV4EUJVKySIEb/KCknTPtRHpbHNfBSnrptF6X9GsPl+FmeA+aBh
/XxuZEAKDWs=
=2T70
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2023-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc scheduler fixes from Ingo Molnar:
- Two EEVDF fixes: one to fix sysctl_sched_base_slice propagation, and
to fix an avg_vruntime() corner-case.
- A cpufreq frequency scaling fix
* tag 'sched-urgent-2023-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpufreq: schedutil: Update next_freq when cpufreq_limits change
sched/eevdf: Fix avg_vruntime()
sched/eevdf: Also update slice on placement
The recently added tcx attachment extended the BPF UAPI for attaching and
detaching by a couple of fields. Those fields are currently only supported
for tcx, other types like cgroups and flow dissector silently ignore the
new fields except for the new flags.
This is problematic once we extend bpf_mprog to older attachment types, since
it's hard to figure out whether the syscall really was successful if the
kernel silently ignores non-zero values.
Explicitly reject non-zero fields relevant to bpf_mprog for attachment types
which don't use the latter yet.
Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20231006220655.1653-3-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Improve consistency for bpf_mprog_query() API and let the latter also handle
a NULL entry as can be the case for tcx. Instead of returning -ENOENT, we
copy a count of 0 and revision of 1 to user space, so that this can be fed
into a subsequent bpf_mprog_attach() call as expected_revision. A BPF self-
test as part of this series has been added to assert this case.
Suggested-by: Lorenz Bauer <lmb@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20231006220655.1653-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
While working on the ebpf-go [0] library integration for bpf_mprog and tcx,
Lorenz noticed that two subsequent BPF_PROG_QUERY requests currently fail. A
typical workflow is to first gather the bpf_mprog count without passing program/
link arrays, followed by the second request which contains the actual array
pointers.
The initial call populates count and revision fields. The second call gets
rejected due to a BPF_PROG_QUERY_LAST_FIELD bug which should point to
query.revision instead of query.link_attach_flags since the former is really
the last member.
It was not noticed in libbpf as bpf_prog_query_opts() always calls bpf(2) with
an on-stack bpf_attr that is memset() each time (and therefore query.revision
was reset to zero).
[0] https://ebpf-go.dev
Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Lorenz Bauer <lmb@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20231006220655.1653-1-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
When cpufreq's policy is 'single', there is a scenario that will
cause sg_policy's next_freq to be unable to update.
When the CPU's util is always max, the cpufreq will be max,
and then if we change the policy's scaling_max_freq to be a
lower freq, indeed, the sg_policy's next_freq need change to
be the lower freq, however, because the cpu_is_busy, the next_freq
would keep the max_freq.
For example:
The cpu7 is a single CPU:
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737
pid 4737's current affinity mask: ff
pid 4737's new affinity mask: 80
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq
2301000
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq
2301000
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq
2171000
At this time, the sg_policy's next_freq would stay at 2301000, which
is wrong.
To fix this, add a check for the ->need_freq_update flag.
[ mingo: Clarified the changelog. ]
Co-developed-by: Guohua Yan <guohua.yan@unisoc.com>
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Guohua Yan <guohua.yan@unisoc.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com
I didn't collect precise data but feels like we've got a lot
of 6.5 fixes here. WiFi fixes are most user-awaited.
Current release - regressions:
- Bluetooth: fix hci_link_tx_to RCU lock usage
Current release - new code bugs:
- bpf: mprog: fix maximum program check on mprog attachment
- eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()
Previous releases - regressions:
- ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling
- vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(),
it doesn't handle zero length like we expected
- wifi:
- cfg80211: fix cqm_config access race, fix crashes with brcmfmac
- iwlwifi: mvm: handle PS changes in vif_cfg_changed
- mac80211: fix mesh id corruption on 32 bit systems
- mt76: mt76x02: fix MT76x0 external LNA gain handling
- Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER
- l2tp: fix handling of transhdrlen in __ip{,6}_append_data()
- dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent
- eth: stmmac: fix the incorrect parameter after refactoring
Previous releases - always broken:
- net: replace calls to sock->ops->connect() with kernel_connect(),
prevent address rewrite in kernel_bind(); otherwise BPF hooks may
modify arguments, unexpectedly to the caller
- tcp: fix delayed ACKs when reads and writes align with MSS
- bpf:
- verifier: unconditionally reset backtrack_state masks on global
func exit
- s390: let arch_prepare_bpf_trampoline return program size,
fix struct_ops offsets
- sockmap: fix accounting of available bytes in presence of PEEKs
- sockmap: reject sk_msg egress redirects to non-TCP sockets
- ipv4/fib: send netlink notify when delete source address routes
- ethtool: plca: fix width of reads when parsing netlink commands
- netfilter: nft_payload: rebuild vlan header on h_proto access
- Bluetooth: hci_codec: fix leaking memory of local_codecs
- eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids
- eth: stmmac:
- dwmac-stm32: fix resume on STM32 MCU
- remove buggy and unneeded stmmac_poll_controller, depend on NAPI
- ibmveth: always recompute TCP pseudo-header checksum, fix use
of the driver with Open vSwitch
- wifi:
- rtw88: rtw8723d: fix MAC address offset in EEPROM
- mt76: fix lock dependency problem for wed_lock
- mwifiex: sanity check data reported by the device
- iwlwifi: ensure ack flag is properly cleared
- iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
- iwlwifi: mvm: fix incorrect usage of scan API
Misc:
- wifi: mac80211: work around Cisco AP 9115 VHT MPDU length
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmUe/hkACgkQMUZtbf5S
Iru94hAAlkVsHgGy7T8Te23m5Q5v33ZRj+7hbjFBN+be2TJu8cWo9qL6jGvrxBP6
D0X32ZvoWtX95Ua023Ibs1WtEc6ebROctTuDZpoIW35MYamFz9fYecoY4i+t+m1+
6Y/UVRu4eHWXUtZclRQUEUdMe5HAJms9uNlIkTxLivvFhERgmGKtAsca8nuU9wMo
lJJUAFxf4wJKx/338AUaa0yfsg6cQcgRpbm1csAR9VSa3mU1PbrIYzf7fMbWjRET
6DiijheeClb7biF4ZKqqHgZcProwVFxoFCq1GH+PCzaw8K1eIkGwXlpVEW89lgsC
Ukc1L9JgLJIBSObvNWiO2mu5w+V89+XR2rL9KM3tW6x+k5tEncNL3WOphqH69NPS
gfizGlvXjP+2aCJgHovvlQEaBn/7xNYWTAbqh1jVDleUC95Ur8ap4X42YB/3QvPN
X9l8hp4Htu8SclqjXKtMz9qt6Ug5Uyi88o+1U53BNE6C6ICKW9i4uApT1DsLBAK8
QM5WPTj/ChIBbQu7HWNW+Ux3NX5R6fFzZ5BfKrjbuNEHQKRauj2300gVtU6xGS7T
IFWiu8i00T34aXF2Vnfykc0zNRylhw/DHqtJFUxmJQOBQgyKlkjYacf2cYru5lnR
BWA8Zsg5wpapT5CWSGlSRid4sRMtcDiMsI7fnIqB5CPhJGnR6eI=
=JAtc
-----END PGP SIGNATURE-----
Merge tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Bluetooth, netfilter, BPF and WiFi.
I didn't collect precise data but feels like we've got a lot of 6.5
fixes here. WiFi fixes are most user-awaited.
Current release - regressions:
- Bluetooth: fix hci_link_tx_to RCU lock usage
Current release - new code bugs:
- bpf: mprog: fix maximum program check on mprog attachment
- eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()
Previous releases - regressions:
- ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling
- vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(), it
doesn't handle zero length like we expected
- wifi:
- cfg80211: fix cqm_config access race, fix crashes with brcmfmac
- iwlwifi: mvm: handle PS changes in vif_cfg_changed
- mac80211: fix mesh id corruption on 32 bit systems
- mt76: mt76x02: fix MT76x0 external LNA gain handling
- Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER
- l2tp: fix handling of transhdrlen in __ip{,6}_append_data()
- dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent
- eth: stmmac: fix the incorrect parameter after refactoring
Previous releases - always broken:
- net: replace calls to sock->ops->connect() with kernel_connect(),
prevent address rewrite in kernel_bind(); otherwise BPF hooks may
modify arguments, unexpectedly to the caller
- tcp: fix delayed ACKs when reads and writes align with MSS
- bpf:
- verifier: unconditionally reset backtrack_state masks on global
func exit
- s390: let arch_prepare_bpf_trampoline return program size, fix
struct_ops offsets
- sockmap: fix accounting of available bytes in presence of PEEKs
- sockmap: reject sk_msg egress redirects to non-TCP sockets
- ipv4/fib: send netlink notify when delete source address routes
- ethtool: plca: fix width of reads when parsing netlink commands
- netfilter: nft_payload: rebuild vlan header on h_proto access
- Bluetooth: hci_codec: fix leaking memory of local_codecs
- eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids
- eth: stmmac:
- dwmac-stm32: fix resume on STM32 MCU
- remove buggy and unneeded stmmac_poll_controller, depend on NAPI
- ibmveth: always recompute TCP pseudo-header checksum, fix use of
the driver with Open vSwitch
- wifi:
- rtw88: rtw8723d: fix MAC address offset in EEPROM
- mt76: fix lock dependency problem for wed_lock
- mwifiex: sanity check data reported by the device
- iwlwifi: ensure ack flag is properly cleared
- iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
- iwlwifi: mvm: fix incorrect usage of scan API
Misc:
- wifi: mac80211: work around Cisco AP 9115 VHT MPDU length"
* tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
MAINTAINERS: update Matthieu's email address
mptcp: userspace pm allow creating id 0 subflow
mptcp: fix delegated action races
net: stmmac: remove unneeded stmmac_poll_controller
net: lan743x: also select PHYLIB
net: ethernet: mediatek: disable irq before schedule napi
net: mana: Fix oversized sge0 for GSO packets
net: mana: Fix the tso_bytes calculation
net: mana: Fix TX CQE error handling
netlink: annotate data-races around sk->sk_err
sctp: update hb timer immediately after users change hb_interval
sctp: update transport state when processing a dupcook packet
tcp: fix delayed ACKs for MSS boundary condition
tcp: fix quick-ack counting to count actual ACKs of new data
page_pool: fix documentation typos
tipc: fix a potential deadlock on &tx->lock
net: stmmac: dwmac-stm32: fix resume on STM32 MCU
ipv4: Set offload_failed flag in fibmatch results
netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
netfilter: nf_tables: Deduplicate nft_register_obj audit logs
...
The following crash is observed 100% of the time during resume from
the hibernation on a x86 QEMU system.
[ 12.931887] ? __die_body+0x1a/0x60
[ 12.932324] ? page_fault_oops+0x156/0x420
[ 12.932824] ? search_exception_tables+0x37/0x50
[ 12.933389] ? fixup_exception+0x21/0x300
[ 12.933889] ? exc_page_fault+0x69/0x150
[ 12.934371] ? asm_exc_page_fault+0x26/0x30
[ 12.934869] ? get_buffer.constprop.0+0xac/0x100
[ 12.935428] snapshot_write_next+0x7c/0x9f0
[ 12.935929] ? submit_bio_noacct_nocheck+0x2c2/0x370
[ 12.936530] ? submit_bio_noacct+0x44/0x2c0
[ 12.937035] ? hib_submit_io+0xa5/0x110
[ 12.937501] load_image+0x83/0x1a0
[ 12.937919] swsusp_read+0x17f/0x1d0
[ 12.938355] ? create_basic_memory_bitmaps+0x1b7/0x240
[ 12.938967] load_image_and_restore+0x45/0xc0
[ 12.939494] software_resume+0x13c/0x180
[ 12.939994] resume_store+0xa3/0x1d0
The commit being fixed introduced a bug in copying the zero bitmap
to safe pages. A temporary bitmap is allocated with PG_ANY flag in
prepare_image() to make a copy of zero bitmap after the unsafe pages
are marked. Freeing this temporary bitmap with PG_UNSAFE_KEEP later
results in an inconsistent state of unsafe pages. Since free bit is
left as is for this temporary bitmap after free, these pages are
treated as unsafe pages when they are allocated again. This results
in incorrect calculation of the number of pages pre-allocated for the
image.
nr_pages = (nr_zero_pages + nr_copy_pages) - nr_highmem - allocated_unsafe_pages;
The allocate_unsafe_pages is estimated to be higher than the actual
which results in running short of pages in safe_pages_list. Hence the
crash is observed in get_buffer() due to NULL pointer access of
safe_pages_list.
Fix this issue by creating the temporary zero bitmap from safe pages
(free bit not set) so that the corresponding free bits can be cleared
while freeing this bitmap.
Fixes: 005e8dddd4 ("PM: hibernate: don't store zero pages in the image file")
Suggested-by:: Brian Geffon <bgeffon@google.com>
Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZRqk1wAKCRDbK58LschI
g8GRAQC4E0bw6BTFRl0b3MxvpZES6lU0BUtX2gKVK4tLZdXw/wEAmTlBXQqNzF3b
BkCQknVbFTSw/8l8pzUW123Fb46wUAQ=
=E3hd
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-10-02
We've added 11 non-merge commits during the last 12 day(s) which contain
a total of 12 files changed, 176 insertions(+), 41 deletions(-).
The main changes are:
1) Fix BPF verifier to reset backtrack_state masks on global function
exit as otherwise subsequent precision tracking would reuse them,
from Andrii Nakryiko.
2) Several sockmap fixes for available bytes accounting,
from John Fastabend.
3) Reject sk_msg egress redirects to non-TCP sockets given this
is only supported for TCP sockets today, from Jakub Sitnicki.
4) Fix a syzkaller splat in bpf_mprog when hitting maximum program
limits with BPF_F_BEFORE directive, from Daniel Borkmann
and Nikolay Aleksandrov.
5) Fix BPF memory allocator to use kmalloc_size_roundup() to adjust
size_index for selecting a bpf_mem_cache, from Hou Tao.
6) Fix arch_prepare_bpf_trampoline return code for s390 JIT,
from Song Liu.
7) Fix bpf_trampoline_get when CONFIG_BPF_JIT is turned off,
from Leon Hwang.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Use kmalloc_size_roundup() to adjust size_index
selftest/bpf: Add various selftests for program limits
bpf, mprog: Fix maximum program check on mprog attachment
bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets
bpf, sockmap: Add tests for MSG_F_PEEK
bpf, sockmap: Do not inc copied_seq when PEEK flag set
bpf: tcp_read_skb needs to pop skb regardless of seq
bpf: unconditionally reset backtrack_state masks on global func exit
bpf: Fix tr dereferencing
selftests/bpf: Check bpf_cubic_acked() is called via struct_ops
s390/bpf: Let arch_prepare_bpf_trampoline return program size
====================
Link: https://lore.kernel.org/r/20231002113417.2309-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The expectation is that placing a task at avg_vruntime() makes it
eligible. Turns out there is a corner case where this is not the case.
Specifically, avg_vruntime() relies on the fact that integer division
is a flooring function (eg. it discards the remainder). By this
property the value returned is slightly left of the true average.
However! when the average is a negative (relative to min_vruntime) the
effect is flipped and it becomes a ceil, with the result that the
returned value is just right of the average and thus not eligible.
Fixes: af4cf40470 ("sched/fair: Add cfs_rq::avg_vruntime")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tasks that never consume their full slice would not update their slice value.
This means that tasks that are spawned before the sysctl scaling keep their
original (UP) slice length.
Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230915124822.847197830@noisy.programming.kicks-ass.net
to issues which were introduced after 6.5.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZRmSDAAKCRDdBJ7gKXxA
jlSaAQCe3SnBdjRmuzbp5iIfNJOY7GXLN4NwMsArRUxRGY27IwD+KWhXZP/ydVnt
ZgS4x9rmarHuh5Pxds+6SRGhihRz/Ak=
=sf/5
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Fourteen hotfixes, eleven of which are cc:stable. The remainder
pertain to issues which were introduced after 6.5"
* tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Crash: add lock to serialize crash hotplug handling
selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions()
mm, memcg: reconsider kmem.limit_in_bytes deprecation
mm: zswap: fix potential memory corruption on duplicate store
arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries
mm: hugetlb: add huge page size param to set_huge_pte_at()
maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states
maple_tree: add mas_is_active() to detect in-tree walks
nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()
mm: abstract moving to the next PFN
mm: report success more often from filemap_map_folio_range()
fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUZMIoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iCuw/+Mc2ScQK+Y2gQWzOsACMIm863CqnwYxAK
rzvny0wEiESHDcRGFC46Bv6Ru6BZr8tPrBbsWHWUJTx9dO4RVSUlT/DwoaehQEXb
hqFTmio3YA+yUVbz3oh3BiELkUBQ/Q3M33Z5DiMrB7fH9/e4Disuw3aZu1zZ/CqX
AGHvoL4TILNOQhDMAmHjDXDgp3HZUqCZIhteNbHVa5HJ7Bpal8xh3j73EZ8cYEoj
GWKlFvwSxQQmmCexBTmLxZN7O0guLv27qL3LtlhfCtCJH8Hb6yC14QsPs8zJShqN
TZ5su+meeJKkyE4y5fyhNxCKSmB/8x0fFf8+juQFNo+V73XcfMg/Ymz4mVNuhlxj
bGKTOqkZEGLwAWpopJFvJK+hnLU8PvqOrTkJJ14JisiXBEV5YRccUARVz9r+jv8V
iGu4xI1SQjuK+Jq7/yQscii4VpQdZ6cYrCBQJY84cYFc+jSevqu0QKs5fl4LV8Q5
s9TRnlMk+Eo/oIbRTNJCc2a4IVofRnIiAy5AbKy8UvSlul7UXkMKnXZCJKo5Pl16
jyQ59R4wXnGCi2SDwLUdJnghOgsL29MFSdMqcvNS1VI+bIR9l9mWVStjtyIY/dKE
JWNGNQPvDCxevjp28rEtIzPx1AKQszsAVUKzAANZb+kHtcC+YP1zQHVWJ0ZknG7N
B41vXGbRlmM=
=ByhJ
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix a RT tasks related lockup/live-lock during CPU offlining"
* tag 'sched-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Fix live lock between select_fallback_rq() and RT push
- Make sure 32 bit applications using user events have aligned access when
running on a 64 bit kernel.
- Add cond_resched in the loop that handles converting enums in print_fmt
string is trace events.
- Fix premature wake ups of polling processes in the tracing ring buffer. When
a task polls waiting for a percentage of the ring buffer to be filled, the
writer still will wake it up at every event. Add the polling's percentage to
the "shortest_full" list to tell the writer when to wake it up.
- For eventfs dir lookups on dynamic events, an event system's only event could
be removed, leaving its dentry with no children. This is totally legitimate.
But in eventfs_release() it must not access the children array, as it is only
allocated when the dentry has children.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZRiI2xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlvoAQDKbevbqA0C8lEV1rbVh4Q9Rnq580rz
EAyEO/RrSOwE9AEA2z+Q597mDjEiqQBvqTjBkS+0xZ7AUQYZRWgTHRIbegg=
=tqOM
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Make sure 32-bit applications using user events have aligned access
when running on a 64-bit kernel.
- Add cond_resched in the loop that handles converting enums in
print_fmt string is trace events.
- Fix premature wake ups of polling processes in the tracing ring
buffer. When a task polls waiting for a percentage of the ring buffer
to be filled, the writer still will wake it up at every event. Add
the polling's percentage to the "shortest_full" list to tell the
writer when to wake it up.
- For eventfs dir lookups on dynamic events, an event system's only
event could be removed, leaving its dentry with no children. This is
totally legitimate. But in eventfs_release() it must not access the
children array, as it is only allocated when the dentry has children.
* tag 'trace-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Test for dentries array allocated in eventfs_release()
tracing/user_events: Align set_bit() address for all archs
tracing: relax trace_event_eval_update() execution with cond_resched()
ring-buffer: Update "shortest_full" in polling
All architectures should use a long aligned address passed to set_bit().
User processes can pass either a 32-bit or 64-bit sized value to be
updated when tracing is enabled when on a 64-bit kernel. Both cases are
ensured to be naturally aligned, however, that is not enough. The
address must be long aligned without affecting checks on the value
within the user process which require different adjustments for the bit
for little and big endian CPUs.
Add a compat flag to user_event_enabler that indicates when a 32-bit
value is being used on a 64-bit kernel. Long align addresses and correct
the bit to be used by set_bit() to account for this alignment. Ensure
compat flags are copied during forks and used during deletion clears.
Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/
Cc: stable@vger.kernel.org
Fixes: 7235759084 ("tracing/user_events: Use remote writes for event enablement")
Reported-by: Clément Léger <cleger@rivosinc.com>
Suggested-by: Clément Léger <cleger@rivosinc.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When kernel is compiled without preemption, the eval_map_work_func()
(which calls trace_event_eval_update()) will not be preempted up to its
complete execution. This can actually cause a problem since if another
CPU call stop_machine(), the call will have to wait for the
eval_map_work_func() function to finish executing in the workqueue
before being able to be scheduled. This problem was observe on a SMP
system at boot time, when the CPU calling the initcalls executed
clocksource_done_booting() which in the end calls stop_machine(). We
observed a 1 second delay because one CPU was executing
eval_map_work_func() and was not preempted by the stop_machine() task.
Adding a call to cond_resched() in trace_event_eval_update() allows
other tasks to be executed and thus continue working asynchronously
like before without blocking any pending task at boot time.
Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Clément Léger <cleger@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
It was discovered that the ring buffer polling was incorrectly stating
that read would not block, but that's because polling did not take into
account that reads will block if the "buffer-percent" was set. Instead,
the ring buffer polling would say reads would not block if there was any
data in the ring buffer. This was incorrect behavior from a user space
point of view. This was fixed by commit 42fb0a1e84 by having the polling
code check if the ring buffer had more data than what the user specified
"buffer percent" had.
The problem now is that the polling code did not register itself to the
writer that it wanted to wait for a specific "full" value of the ring
buffer. The result was that the writer would wake the polling waiter
whenever there was a new event. The polling waiter would then wake up, see
that there's not enough data in the ring buffer to notify user space and
then go back to sleep. The next event would wake it up again.
Before the polling fix was added, the code would wake up around 100 times
for a hackbench 30 benchmark. After the "fix", due to the constant waking
of the writer, it would wake up over 11,0000 times! It would never leave
the kernel, so the user space behavior was still "correct", but this
definitely is not the desired effect.
To fix this, have the polling code add what it's waiting for to the
"shortest_full" variable, to tell the writer not to wake it up if the
buffer is not as full as it expects to be.
Note, after this fix, it appears that the waiter is now woken up around 2x
the times it was before (~200). This is a tremendous improvement from the
11,000 times, but I will need to spend some time to see why polling is
more aggressive in its wakeups than the read blocking code.
Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: 42fb0a1e84 ("tracing/ring-buffer: Have polling block on watermark")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Tested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- fix the narea calculation in swiotlb initialization (Ross Lagerwall)
- fix the check whether a device has used swiotlb (Petr Tesarik)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmUYWTULHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMilA/8DomLRCrDy792MoRvBCThWaY6auW4bCUoY7S6VZe7
LVECKAJRwFH7b+dk1mNjPhWTyq8/wB7kr/OLuU9HDcVIeiP9zxks4BMQ4RGc/ZuK
rSZ5ZHPVVCC4EOI3ncjQXrODwgGkGUvtdriByCtX2r4NLBjO1T0vUQB4bLyBTZf+
GnTCLxCkSrIxaqniRvM0K34yO/0rq0ci5840MNneR7MKQkVqPUDY83sHwL1KcQPf
s16lwclQdjZdOVpFMPxFin5NpvPIrjdrvhoaxdnz+8ZuwSACqRUZDQuNlZ3+Zep6
iaynNR04o0c2p0PTT5l3ZRD5vsyCjvc+/3kB3KlM33XbBArWi6XV+694QQn59JnZ
5MmHoIulwZGLsIlTG188QreZBlLrmxylUX311Kot5ood/HW8DsYbTo/krbiiUgEk
MXKWq9k6cQOdhgriS4zxvUl+xkjby12jvSFxv9tN3HHvFsFB8+veVrTuLZzEDXpX
a5PrmI/dcQmlVpCZllzVzeTgL2KeE1Jo0uRZ1vXhuoX8IBys4/TstIXOB4jnyVb4
kzrHbLIoVqLSN42eVMRKBrqGXGlZSWETBpkdSQ41St6t/3MurhKAWZ/1SFPXlI06
SnatIdOU7nSZRofK8/Xe1CnWia5NUpyUQpb+tLUHTgo4kZzGV330bf34iJ+BvC1h
aks=
=bhgc
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.6-2023-09-30' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- fix the narea calculation in swiotlb initialization (Ross Lagerwall)
- fix the check whether a device has used swiotlb (Petr Tesarik)
* tag 'dma-mapping-6.6-2023-09-30' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: fix the check whether a device has used software IO TLB
swiotlb: use the calculated number of areas
Commit d52b59315b ("bpf: Adjust size_index according to the value of
KMALLOC_MIN_SIZE") uses KMALLOC_MIN_SIZE to adjust size_index, but as
reported by Nathan, the adjustment is not enough, because
__kmalloc_minalign() also decides the minimal alignment of slab object
as shown in new_kmalloc_cache() and its value may be greater than
KMALLOC_MIN_SIZE (e.g., 64 bytes vs 8 bytes under a riscv QEMU VM).
Instead of invoking __kmalloc_minalign() in bpf subsystem to find the
maximal alignment, just using kmalloc_size_roundup() directly to get the
corresponding slab object size for each allocation size. If these two
sizes are unmatched, adjust size_index to select a bpf_mem_cache with
unit_size equal to the object_size of the underlying slab cache for the
allocation size.
Fixes: 822fb26bdb ("bpf: Add a hint to allocated objects.")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/bpf/20230914181407.GA1000274@dev-arch.thelio-3990X/
Signed-off-by: Hou Tao <houtao1@huawei.com>
Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
Link: https://lore.kernel.org/r/20230928101558.2594068-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eric reported that handling corresponding crash hotplug event can be
failed easily when many memory hotplug event are notified in a short
period. They failed because failing to take __kexec_lock.
=======
[ 78.714569] Fallback order for Node 0: 0
[ 78.714575] Built 1 zonelists, mobility grouping on. Total pages: 1817886
[ 78.717133] Policy zone: Normal
[ 78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[ 78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[ 80.056643] PEFILE: Unsigned PE binary
=======
The memory hotplug events are notified very quickly and very many, while
the handling of crash hotplug is much slower relatively. So the atomic
variable __kexec_lock and kexec_trylock() can't guarantee the
serialization of crash hotplug handling.
Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug
handling specifically. This doesn't impact the usage of __kexec_lock.
Link: https://lkml.kernel.org/r/20230926120905.392903-1-bhe@redhat.com
Fixes: 2472627561 ("crash: add generic infrastructure for crash hotplug support")
Signed-off-by: Baoquan He <bhe@redhat.com>
Tested-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During RCU-boost testing with the TREE03 rcutorture config, I found that
after a few hours, the machine locks up.
On tracing, I found that there is a live lock happening between 2 CPUs.
One CPU has an RT task running, while another CPU is being offlined
which also has an RT task running. During this offlining, all threads
are migrated. The migration thread is repeatedly scheduled to migrate
actively running tasks on the CPU being offlined. This results in a live
lock because select_fallback_rq() keeps picking the CPU that an RT task
is already running on only to get pushed back to the CPU being offlined.
It is anyway pointless to pick CPUs for pushing tasks to if they are
being offlined only to get migrated away to somewhere else. This could
also add unwanted latency to this task.
Fix these issues by not selecting CPUs in RT if they are not 'active'
for scheduling, using the cpu_active_mask. Other parts in core.c already
use cpu_active_mask to prevent tasks from being put on CPUs going
offline.
With this fix I ran the tests for days and could not reproduce the
hang. Without the patch, I hit it in a few hours.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230923011409.3522762-1-joel@joelfernandes.org
When CONFIG_SWIOTLB_DYNAMIC=y, devices which do not use the software IO TLB
can avoid swiotlb lookup. A flag is added by commit 1395706a14 ("swiotlb:
search the software IO TLB only if the device makes use of it"), the flag
is correctly set, but it is then never checked. Add the actual check here.
Note that this code is an alternative to the default pool check, not an
additional check, because:
1. swiotlb_find_pool() also searches the default pool;
2. if dma_uses_io_tlb is false, the default swiotlb pool is not used.
Tested in a KVM guest against a QEMU RAM-backed SATA disk over virtio and
*not* using software IO TLB, this patch increases IOPS by approx 2% for
4-way parallel I/O.
The write memory barrier in swiotlb_dyn_alloc() is not needed, because a
newly allocated pool must always be observed by swiotlb_find_slots() before
an address from that pool is passed to is_swiotlb_buffer().
Correctness was verified using the following litmus test:
C swiotlb-new-pool
(*
* Result: Never
*
* Check that a newly allocated pool is always visible when the
* corresponding swiotlb buffer is visible.
*)
{
mem_pools = default;
}
P0(int **mem_pools, int *pool)
{
/* add_mem_pool() */
WRITE_ONCE(*pool, 999);
rcu_assign_pointer(*mem_pools, pool);
}
P1(int **mem_pools, int *flag, int *buf)
{
/* swiotlb_find_slots() */
int *r0;
int r1;
rcu_read_lock();
r0 = READ_ONCE(*mem_pools);
r1 = READ_ONCE(*r0);
rcu_read_unlock();
if (r1) {
WRITE_ONCE(*flag, 1);
smp_mb();
}
/* device driver (presumed) */
WRITE_ONCE(*buf, r1);
}
P2(int **mem_pools, int *flag, int *buf)
{
/* device driver (presumed) */
int r0 = READ_ONCE(*buf);
/* is_swiotlb_buffer() */
int r1;
int *r2;
int r3;
smp_rmb();
r1 = READ_ONCE(*flag);
if (r1) {
/* swiotlb_find_pool() */
rcu_read_lock();
r2 = READ_ONCE(*mem_pools);
r3 = READ_ONCE(*r2);
rcu_read_unlock();
}
}
exists (2:r0<>0 /\ 2:r3=0) (* Not found. *)
Fixes: 1395706a14 ("swiotlb: search the software IO TLB only if the device makes use of it")
Reported-by: Jonathan Corbet <corbet@lwn.net>
Closes: https://lore.kernel.org/linux-iommu/87a5uz3ob8.fsf@meer.lwn.net/
Signed-off-by: Petr Tesarik <petr@tesarici.cz>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
* Remove double allocation of wq_update_pod_attrs_buf.
* Fix missing allocation of pwq_release_worker when
wq_cpu_intensive_thresh_us is set to a custom value.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZRHSyg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGbNYAP93prDoDUYHLha4NAXyZJ441+bBA5jnOOdRYLiw
cd0yugEAgFzQQ/4Z6wKosdwiGdrSn33IAgnDCGdAXVWzbyM+wQU=
=968G
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.6-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Remove double allocation of wq_update_pod_attrs_buf
- Fix missing allocation of pwq_release_worker when
wq_cpu_intensive_thresh_us is set to a custom value
* tag 'wq-for-6.6-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix missed pwq_release_worker creation in wq_cpu_intensive_thresh_init()
workqueue: Removed double allocation of wq_update_pod_attrs_buf
- Fix the "bytes" output of the per_cpu stat file
The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the
accounting was not accurate. It is suppose to show how many used bytes are
still in the ring buffer, but even when the ring buffer was empty it would
still show there were bytes used.
- Fix a bug in eventfs where reading a dynamic event directory (open) and then
creating a dynamic event that goes into that diretory screws up the accounting.
On close, the newly created event dentry will get a "dput" without ever having
a "dget" done for it. The fix is to allocate an array on dir open to save what
dentries were actually "dget" on, and what ones to "dput" on close.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZQ9wihQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quz4AP4vSFohvmAcTzC+sKP7gMLUvEmqL76+
1pixXrQOIP5BrQEApUW3VnjqYgjZJR2ne0N4MvvmYElm/ylBhDd4JRrD3g8=
=X9wd
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix the "bytes" output of the per_cpu stat file
The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the
accounting was not accurate. It is suppose to show how many used
bytes are still in the ring buffer, but even when the ring buffer was
empty it would still show there were bytes used.
- Fix a bug in eventfs where reading a dynamic event directory (open)
and then creating a dynamic event that goes into that diretory screws
up the accounting.
On close, the newly created event dentry will get a "dput" without
ever having a "dget" done for it. The fix is to allocate an array on
dir open to save what dentries were actually "dget" on, and what ones
to "dput" on close.
* tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Remember what dentries were created on dir open
ring-buffer: Fix bytes info in per_cpu buffer stats
cc:stable.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZQ8hRwAKCRDdBJ7gKXxA
jlK9AQDzT/FUQV3kIshsV1IwAKFcg7gtcFSN0vs+pV+e1+4tbQD/Z2OgfGFFsCSP
X6uc2cYHc9DG5/o44iFgadW8byMssQs=
=w+St
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-09-23-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"13 hotfixes, 10 of which pertain to post-6.5 issues. The other three
are cc:stable"
* tag 'mm-hotfixes-stable-2023-09-23-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
proc: nommu: fix empty /proc/<pid>/maps
filemap: add filemap_map_order0_folio() to handle order0 folio
proc: nommu: /proc/<pid>/maps: release mmap read lock
mm: memcontrol: fix GFP_NOFS recursion in memory.high enforcement
pidfd: prevent a kernel-doc warning
argv_split: fix kernel-doc warnings
scatterlist: add missing function params to kernel-doc
selftests/proc: fixup proc-empty-vm test after KSM changes
revert "scripts/gdb/symbols: add specific ko module load command"
selftests: link libasan statically for tests with -fsanitize=address
task_work: add kerneldoc annotation for 'data' argument
mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong buddy list
sh: mm: re-add lost __ref to ioremap_prot() to fix modpost warning
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of
bytes in cpu buffer that have not been consumed. However, currently
after consuming data by reading file 'trace_pipe', the 'bytes' info
was not changed as expected.
# cat per_cpu/cpu0/stats
entries: 0
overrun: 0
commit overrun: 0
bytes: 568 <--- 'bytes' is problematical !!!
oldest event ts: 8651.371479
now ts: 8653.912224
dropped events: 0
read events: 8
The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it:
1. When stat 'read_bytes', account consumed event in rb_advance_reader();
2. When stat 'entries_bytes', exclude the discarded padding event which
is smaller than minimum size because it is invisible to reader. Then
use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for
page-based read/remove/overrun.
Also correct the comments of ring_buffer_bytes_cpu() in this patch.
Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Fixes: c64e148a3b ("trace: Add ring buffer stats to measure rate of events")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Current release - regressions:
- bpf: adjust size_index according to the value of KMALLOC_MIN_SIZE
- netfilter: fix entries val in rule reset audit log
- eth: stmmac: fix incorrect rxq|txq_stats reference
Previous releases - regressions:
- ipv4: fix null-deref in ipv4_link_failure
- netfilter:
- fix several GC related issues
- fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
- eth: team: fix null-ptr-deref when team device type is changed
- eth: i40e: fix VF VLAN offloading when port VLAN is configured
- eth: ionic: fix 16bit math issue when PAGE_SIZE >= 64KB
Previous releases - always broken:
- core: fix ETH_P_1588 flow dissector
- mptcp: fix several connection hang-up conditions
- bpf:
- avoid deadlock when using queue and stack maps from NMI
- add override check to kprobe multi link attach
- hsr: properly parse HSRv1 supervisor frames.
- eth: igc: fix infinite initialization loop with early XDP redirect
- eth: octeon_ep: fix tx dma unmap len values in SG
- eth: hns3: fix GRE checksum offload issue
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmUMFG8SHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOksHAP+QE2eNf5yxo86dIS+3RQnOQ8kFBnNbEn
04lrheGnzG7PpNnGoCoTZna+xYQPYVLgbmmip2/CFnQvnQIsKyLQfCui85sfV2V9
KjUeE/kTgeC+jUQOWNDyz3zDP/MPC2LmiK8Gwyggvm9vFYn5tVZXC36aPZBZ7Vok
/DUW6iXyl31SeVGOOEKakcwn0GIYJSABhVFNsjrDe4tV+leUwvf8obAq3ZWxOGaU
D94ez28lSXgfOSWfQQ/l1rHI/yC0fr8HYyWJ60dNG2uS3fNEqT8LyqZfAUK24kVz
XbAGZa+GA7CDq3cVsU7vCWNWbB5fO+kXtmGOwPtuKtJQM5LPo4X77CuSHlpzdyvq
TuW0vxeVfdzAYVb3Zg+2QgWxDJjY0B8ujwdDWrnnKTPu4Ylhn6HLISXIlkMBoGwT
1/47TCnmn9t+lGagkMADppRRnJotHWObQG5wkzksqVa2CUB0HTESgbrm4rsxe6Ku
JiZhHbTiiPWy7LgY6EFtj/YGPvLs0CSltvh4QUsd+QtDTM/EN7y3HcHqkv88ropG
bSvJIh6WXdEJkwfSUdA0LECXSC6dizzZW2Y1glnT+7FMlhE1jVY4gruNJ37mCYMb
0gh9Zr76c2KYLA5vljGp6uo3j3A7wARJTdLfRFVcaFoz6NQmuFf9ZdBfDNDcymxs
AGvO3j55JAZf
=AoVg
-----END PGP SIGNATURE-----
Merge tag 'net-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter and bpf.
Current release - regressions:
- bpf: adjust size_index according to the value of KMALLOC_MIN_SIZE
- netfilter: fix entries val in rule reset audit log
- eth: stmmac: fix incorrect rxq|txq_stats reference
Previous releases - regressions:
- ipv4: fix null-deref in ipv4_link_failure
- netfilter:
- fix several GC related issues
- fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
- eth: team: fix null-ptr-deref when team device type is changed
- eth: i40e: fix VF VLAN offloading when port VLAN is configured
- eth: ionic: fix 16bit math issue when PAGE_SIZE >= 64KB
Previous releases - always broken:
- core: fix ETH_P_1588 flow dissector
- mptcp: fix several connection hang-up conditions
- bpf:
- avoid deadlock when using queue and stack maps from NMI
- add override check to kprobe multi link attach
- hsr: properly parse HSRv1 supervisor frames.
- eth: igc: fix infinite initialization loop with early XDP redirect
- eth: octeon_ep: fix tx dma unmap len values in SG
- eth: hns3: fix GRE checksum offload issue"
* tag 'net-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
sfc: handle error pointers returned by rhashtable_lookup_get_insert_fast()
igc: Expose tx-usecs coalesce setting to user
octeontx2-pf: Do xdp_do_flush() after redirects.
bnxt_en: Flush XDP for bnxt_poll_nitroa0()'s NAPI
net: ena: Flush XDP packets on error.
net/handshake: Fix memory leak in __sock_create() and sock_alloc_file()
net: hinic: Fix warning-hinic_set_vlan_fliter() warn: variable dereferenced before check 'hwdev'
netfilter: ipset: Fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
netfilter: nf_tables: fix memleak when more than 255 elements expired
netfilter: nf_tables: disable toggling dormant table state more than once
vxlan: Add missing entries to vxlan_get_size()
net: rds: Fix possible NULL-pointer dereference
team: fix null-ptr-deref when team device type is changed
net: bridge: use DEV_STATS_INC()
net: hns3: add 5ms delay before clear firmware reset irq source
net: hns3: fix fail to delete tc flower rules during reset issue
net: hns3: only enable unicast promisc when mac table full
net: hns3: fix GRE checksum offload issue
net: hns3: add cmdq check for vf periodic service task
net: stmmac: fix incorrect rxq|txq_stats reference
...
In mark_chain_precision() logic, when we reach the entry to a global
func, it is expected that R1-R5 might be still requested to be marked
precise. This would correspond to some integer input arguments being
tracked as precise. This is all expected and handled as a special case.
What's not expected is that we'll leave backtrack_state structure with
some register bits set. This is because for subsequent precision
propagations backtrack_state is reused without clearing masks, as all
code paths are carefully written in a way to leave empty backtrack_state
with zeroed out masks, for speed.
The fix is trivial, we always clear register bit in the register mask, and
then, optionally, set reg->precise if register is SCALAR_VALUE type.
Reported-by: Chris Mason <clm@meta.com>
Fixes: be2ef81615 ("bpf: allow precision tracking for programs with subprogs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230918210110.2241458-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Change the comment to match the function name that the SYSCALL_DEFINE()
macros generate to prevent a kernel-doc warning.
kernel/pid.c:628: warning: expecting prototype for pidfd_open(). Prototype was for sys_pidfd_open() instead
Link: https://lkml.kernel.org/r/20230912060822.2500-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Initial booting is setting the task flag to idle (PF_IDLE) by the call
path sched_init() -> init_idle(). Having the task idle and calling
call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
set. Subsequent calls to any cond_resched() will enable IRQs,
potentially earlier than the IRQ setup has completed. Recent changes
have caused just this scenario and IRQs have been enabled early.
This causes a warning later in start_kernel() as interrupts are enabled
before they are fully set up.
Fix this issue by setting the PF_IDLE flag later in the boot sequence.
Although the boot task was marked as idle since (at least) d80e4fda576d,
I am not sure that it is wrong to do so. The forced context-switch on
idle task was introduced in the tiny_rcu update, so I'm going to claim
this fixes 5f6130fa52.
Fixes: 5f6130fa52 ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
Currently, if the wq_cpu_intensive_thresh_us is set to specific
value, will cause the wq_cpu_intensive_thresh_init() early exit
and missed creation of pwq_release_worker. this commit therefore
create the pwq_release_worker in advance before checking the
wq_cpu_intensive_thresh_us.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 967b494e2f ("workqueue: Use a kthread_worker to release pool_workqueues")
First commit 2930155b2e ("workqueue: Initialize unbound CPU pods later in
the boot") added the initialization of wq_update_pod_attrs_buf to
workqueue_init_early(), and then latter on, commit 84193c0710
("workqueue: Generalize unbound CPU pods") added it as well. This appeared
in a kmemleak run where the second allocation made the first allocation
leak.
Fixes: 84193c0710 ("workqueue: Generalize unbound CPU pods")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Tejun Heo <tj@kernel.org>
balancing bug, and a topology setup bug on (Intel) hybrid processors.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUHOVQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iOahAAj3YsoNbT/k6m9yp622n1OopaNEQvsK+/
F2Q5g/hJrm3+W5764rF8CvjhDbmrv6owjp3yUyZLDIfSAFZYMvwoNody3a373Yr3
VFBMJ00jNIv/TAFCJZYeybg3yViwObKKfpu4JBj//QU+4uGWCoBMolkVekU2bBti
r50fMxBPgg2Yic57DCC8Y+JZzHI/ydQ3rvVXMzkrTZCO/zY4/YmERM9d+vp4wl4B
uG9cfXQ4Yf/1gZo0XDlTUkOJUXPnkMgi+N4eHYlGuyOCoIZOfATI24hRaPBoQcdx
PDwHcKmyNxH9iaRppNQMvi797g3KrKVEmZwlZg1IfsILhKC0F4GsQ85tw8qQWE8j
brFPkWVUxAUSl4LXoqVInaxDHmJWR2UC3RA7b+BxFF/GMLTow0z4a+JMC6eKlNyR
uBisZnuEuecqwF9TLhyD3KBHh7PihUPz8PuFHk+Um5sggaUE82I+VwX6uxEi5y8r
ke2kAkpuMxPWT5lwDmFPAXWfvpZz5vvTIRUxGGj2+s4d8b0YfLtZsx5+uOIacaub
Gw+wYFfSowph72tR/SUVq0An/UTSPPBxty8eFIVeE6sW9bw3ghTtkf8300xjV7Rj
sKVxXy/podAo8wG7R6aZfTfsCpohmeEjskiatYdThYamPPx7V0R5pq4twmTXTHLJ
bFvQ1GFCOu0=
=jIeN
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Fix a performance regression on large SMT systems, an Intel SMT4
balancing bug, and a topology setup bug on (Intel) hybrid processors"
* tag 'sched-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sched: Restore the SD_ASYM_PACKING flag in the DIE domain
sched/fair: Fix SMT4 group_smt_balance handling
sched/fair: Optimize should_we_balance() for large SMT systems
Alexei Starovoitov says:
====================
The following pull-request contains BPF updates for your *net* tree.
We've added 21 non-merge commits during the last 8 day(s) which contain
a total of 21 files changed, 450 insertions(+), 36 deletions(-).
The main changes are:
1) Adjust bpf_mem_alloc buckets to match ksize(), from Hou Tao.
2) Check whether override is allowed in kprobe mult, from Jiri Olsa.
3) Fix btf_id symbol generation with ld.lld, from Jiri and Nick.
4) Fix potential deadlock when using queue and stack maps from NMI, from Toke Høiland-Jørgensen.
Please consider pulling these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
Thanks a lot!
Also thanks to reporters, reviewers and testers of commits in this pull-request:
Alan Maguire, Biju Das, Björn Töpel, Dan Carpenter, Daniel Borkmann,
Eduard Zingerman, Hsin-Wei Hung, Marcus Seyfarth, Nathan Chancellor,
Satya Durga Srinivasu Prabhala, Song Liu, Stephen Rothwell
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the handling of block devices in the test_resume mode of
hibernation (Chen Yu).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmUEoZMSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxwqwP+wUj5ap2m6uBYXodFjCA7TbQIM+g8OIM
4rwLZUYnMQP/EJ+oGHONW06slDE30x7klJN7LDDoNNZaeqD8yBYiJI1+EXOsxTk7
dgEhOrIcHU+jCiUAo4WsCF403XuQ35OtsnRcGbo232m+P6RLGyR3UD5470dE8/It
an/ZR95RPnv9pE6JMw2g/e6oU42U082Y3qw3fHXCghj47D+QiJdKPVgliF2lRcLl
PCfdJ2WRoCcpNZdodPnOLuU9K1jMyfchgUaQfBrXBK31bzZW982vH9bmoRiHCPcX
plo1X8HM0XWLlMpdnuGcMTIjvnp5FVu3HykTFmA/cywt0VvJBNZGwtYz3Kwbt4Vt
C+3Mk8KgXJAs7zqNXrLP9w2yBFhN0R4ILSLZXtvRzkH533KuNiHEkcYijlBD2sjh
htuayu5nzyCoUlTV7ca0uAQe0/a/wti5bx5L/V0dBNhvgHZCeytbDqw2Kl5PUQY7
BZm3vUtXcnIHRnfNWeuRCkuSm3IXp1BJuNLLLgDC9ut1iopnyoSK7+5Sxt0pYL4O
yfn28evr97sQl65hR5xilBZCVpBpJo/m9IJgjY3behCJPR7Tuawl3LhaB6f++WQr
fUsPA2BmyWeKdKbq1rZv4Pq22bz/3Bzh5+XvSv1tNu1wh4G/I+m9YclC9KOd8GlX
M6iELzdiMUU4
=3TcT
-----END PGP SIGNATURE-----
Merge tag 'pm-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"Fix the handling of block devices in the test_resume mode of
hibernation (Chen Yu)"
* tag 'pm-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Fix the exclusive get block device in test_resume mode
PM: hibernate: Rename function parameter from snapshot_test to exclusive
Commit:
5a5d7e9bad ("cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG")
amended warn_slowpath_fmt() to disable preemption until the WARN splat
has been emitted.
However the commit neglected to reenable preemption in the !fmt codepath,
i.e. when a WARN splat is emitted without additional format string.
One consequence is that users may see more splats than intended. E.g. a
WARN splat emitted in a work item results in at least two extra splats:
BUG: workqueue leaked lock or atomic
(emitted by process_one_work())
BUG: scheduling while atomic
(emitted by worker_thread() -> schedule())
Ironically the point of the commit was to *avoid* extra splats. ;)
Fix it.
Fixes: 5a5d7e9bad ("cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/3ec48fde01e4ee6505f77908ba351bad200ae3d1.1694763684.git.lukas@wunner.de
- Add missing LOCKDOWN checks for eventfs callers
When LOCKDOWN is active for tracing, it causes inconsistent state
when some functions succeed and others fail.
- Use dput() to free the top level eventfs descriptor
There was a race between accesses and freeing it.
- Fix a long standing bug that eventfs exposed due to changing timings
by dynamically creating files. That is, If a event file is opened
for an instance, there's nothing preventing the instance from being
removed which will make accessing the files cause use-after-free bugs.
- Fix a ring buffer race that happens when iterating over the ring
buffer while writers are active. Check to make sure not to read
the event meta data if it's beyond the end of the ring buffer sub buffer.
- Fix the print trigger that disappeared because the test to create it
was looking for the event dir field being filled, but now it has the
"ef" field filled for the eventfs structure.
- Remove the unused "dir" field from the event structure.
- Fix the order of the trace_dynamic_info as it had it backwards for the
offset and len fields for which one was for which endianess.
- Fix NULL pointer dereference with eventfs_remove_rec()
If an allocation fails in one of the eventfs_add_*() functions,
the caller of it in event_subsystem_dir() or event_create_dir()
assigns the result to the structure. But it's assigning the ERR_PTR
and not NULL. This was passed to eventfs_remove_rec() which expects
either a good pointer or a NULL, not ERR_PTR. The fix is to not
assign the ERR_PTR to the structure, but to keep it NULL on error.
- Fix list_for_each_rcu() to use list_for_each_srcu() in
dcache_dir_open_wrapper(). One iteration of the code used RCU
but because it had to call sleepable code, it had to be changed
to use SRCU, but one of the iterations was missed.
- Fix synthetic event print function to use "as_u64" instead of
passing in a pointer to the union. To fix big/little endian issues,
the u64 that represented several types was turned into a union to
define the types properly.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZQCvoBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtgrAP9MiYiCMU+90oJ+61DFchbs3y7BNidP
s3lLRDUMJ935NQD/SSAm54PqWb+YXMpD7m9+3781l6xqwfabBMXNaEl+FwA=
=tlZu
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Add missing LOCKDOWN checks for eventfs callers
When LOCKDOWN is active for tracing, it causes inconsistent state
when some functions succeed and others fail.
- Use dput() to free the top level eventfs descriptor
There was a race between accesses and freeing it.
- Fix a long standing bug that eventfs exposed due to changing timings
by dynamically creating files. That is, If a event file is opened for
an instance, there's nothing preventing the instance from being
removed which will make accessing the files cause use-after-free
bugs.
- Fix a ring buffer race that happens when iterating over the ring
buffer while writers are active. Check to make sure not to read the
event meta data if it's beyond the end of the ring buffer sub buffer.
- Fix the print trigger that disappeared because the test to create it
was looking for the event dir field being filled, but now it has the
"ef" field filled for the eventfs structure.
- Remove the unused "dir" field from the event structure.
- Fix the order of the trace_dynamic_info as it had it backwards for
the offset and len fields for which one was for which endianess.
- Fix NULL pointer dereference with eventfs_remove_rec()
If an allocation fails in one of the eventfs_add_*() functions, the
caller of it in event_subsystem_dir() or event_create_dir() assigns
the result to the structure. But it's assigning the ERR_PTR and not
NULL. This was passed to eventfs_remove_rec() which expects either a
good pointer or a NULL, not ERR_PTR. The fix is to not assign the
ERR_PTR to the structure, but to keep it NULL on error.
- Fix list_for_each_rcu() to use list_for_each_srcu() in
dcache_dir_open_wrapper(). One iteration of the code used RCU but
because it had to call sleepable code, it had to be changed to use
SRCU, but one of the iterations was missed.
- Fix synthetic event print function to use "as_u64" instead of passing
in a pointer to the union. To fix big/little endian issues, the u64
that represented several types was turned into a union to define the
types properly.
* tag 'trace-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec()
tracefs/eventfs: Use list_for_each_srcu() in dcache_dir_open_wrapper()
tracing/synthetic: Print out u64 values properly
tracing/synthetic: Fix order of struct trace_dynamic_info
selftests/ftrace: Fix dependencies for some of the synthetic event tests
tracing: Remove unused trace_event_file dir field
tracing: Use the new eventfs descriptor for print trigger
ring-buffer: Do not attempt to read past "commit"
tracefs/eventfs: Free top level files on removal
ring-buffer: Avoid softlockup in ring_buffer_resize()
tracing: Have event inject files inc the trace array ref count
tracing: Have option files inc the trace array ref count
tracing: Have current_trace inc the trace array ref count
tracing: Have tracing_max_latency inc the trace array ref count
tracing: Increase trace array ref count on enable and filter files
tracefs/eventfs: Use dput to free the toplevel events directory
tracefs/eventfs: Add missing lockdown checks
tracefs: Add missing lockdown check to tracefs_create_dir()
For SMT4, any group with more than 2 tasks will be marked as
group_smt_balance. Retain the behaviour of group_has_spare by marking
the busiest group as the group which has the least number of idle_cpus.
Also, handle rounding effect of adding (ncores_local + ncores_busy) when
the local is fully idle and busy group imbalance is less than 2 tasks.
Local group should try to pull at least 1 task in this case so imbalance
should be set to 2 instead.
Fixes: fee1759e4f ("sched/fair: Determine active load balance for SMT sched groups")
Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/6cd1633036bb6b651af575c32c2a9608a106702c.camel@linux.intel.com
Commit 8ac0406335 ("swiotlb: reduce the number of areas to match
actual memory pool size") calculated the reduced number of areas in
swiotlb_init_remap() but didn't actually use the value. Replace usage of
default_nareas accordingly.
Fixes: 8ac0406335 ("swiotlb: reduce the number of areas to match actual memory pool size")
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fix missing or extra function parameter kernel-doc warnings
in cgroup.c:
kernel/bpf/cgroup.c:1359: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_skb'
kernel/bpf/cgroup.c:1359: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_skb'
kernel/bpf/cgroup.c:1439: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sk'
kernel/bpf/cgroup.c:1439: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sk'
kernel/bpf/cgroup.c:1467: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_addr'
kernel/bpf/cgroup.c:1467: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_addr'
kernel/bpf/cgroup.c:1512: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_ops'
kernel/bpf/cgroup.c:1512: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_ops'
kernel/bpf/cgroup.c:1685: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sysctl'
kernel/bpf/cgroup.c:1685: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sysctl'
kernel/bpf/cgroup.c:795: warning: Excess function parameter 'type' description in '__cgroup_bpf_replace'
kernel/bpf/cgroup.c:795: warning: Function parameter or member 'new_prog' not described in '__cgroup_bpf_replace'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230912060812.1715-1-rdunlap@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 5904de0d73 ("PM: hibernate: Do not get block device exclusively
in test_resume mode") fixes a hibernation issue under test_resume mode.
That commit is supposed to open the block device in non-exclusive mode
when in test_resume. However the code does the opposite, which is against
its description.
In summary, the swap device is only opened exclusively by swsusp_check()
with its corresponding *close(), and must be in non test_resume mode.
This is to avoid the race condition that different processes scribble the
device at the same time. All the other cases should use non-exclusive mode.
Fix it by really disabling exclusive mode under test_resume.
Fixes: 5904de0d73 ("PM: hibernate: Do not get block device exclusively in test_resume mode")
Closes: https://lore.kernel.org/lkml/000000000000761f5f0603324129@google.com/
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Chenzhou Feng <chenzhoux.feng@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Several functions reply on snapshot_test to decide whether to
open the resume device exclusively. However there is no strict
connection between the snapshot_test and the open mode. Rename
the 'snapshot_test' input parameter to 'exclusive' to better reflect
the use case.
No functional change is expected.
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix for a bug observable under the following sequence of events:
1. Create a network device that does not support XDP offload.
2. Load a device bound XDP program with BPF_F_XDP_DEV_BOUND_ONLY flag
(such programs are not offloaded).
3. Load a device bound XDP program with zero flags
(such programs are offloaded).
At step (2) __bpf_prog_dev_bound_init() associates with device (1)
a dummy bpf_offload_netdev struct with .offdev field set to NULL.
At step (3) __bpf_prog_dev_bound_init() would reuse dummy struct
allocated at step (2).
However, downstream usage of the bpf_offload_netdev assumes that
.offdev field can't be NULL, e.g. in bpf_prog_offload_verifier_prep().
Adjust __bpf_prog_dev_bound_init() to require bpf_offload_netdev
with non-NULL .offdev for offloaded BPF programs.
Fixes: 2b3486bc2d ("bpf: Introduce device-bound XDP programs")
Reported-by: syzbot+291100dcb32190ec02a8@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000d97f3c060479c4f8@google.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230912005539.2248244-2-eddyz87@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Sysbot discovered that the queue and stack maps can deadlock if they are
being used from a BPF program that can be called from NMI context (such as
one that is attached to a perf HW counter event). To fix this, add an
in_nmi() check and use raw_spin_trylock() in NMI context, erroring out if
grabbing the lock fails.
Fixes: f1a2e44a3a ("bpf: add queue and stack maps")
Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Tested-by: Hsin-Wei Hung <hsinweih@uci.edu>
Co-developed-by: Hsin-Wei Hung <hsinweih@uci.edu>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230911132815.717240-1-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add extra check in bpf_mem_alloc_init() to ensure the unit_size of
bpf_mem_cache is matched with the object_size of underlying slab cache.
If these two sizes are unmatched, print a warning once and return
-EINVAL in bpf_mem_alloc_init(), so the mismatch can be found early and
the potential issue can be prevented.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230908133923.2675053-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When the unit_size of a bpf_mem_cache is unmatched with the object_size
of the underlying slab cache, the bpf_mem_cache will not be used, and
the allocation will be redirected to a bpf_mem_cache with a bigger
unit_size instead, so there is no need to prefill for these
unused bpf_mem_caches.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230908133923.2675053-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The following warning was reported when running "./test_progs -a
link_api -a linked_list" on a RISC-V QEMU VM:
------------[ cut here ]------------
WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill
Modules linked in: bpf_testmod(OE)
CPU: 3 PID: 261 Comm: test_progs- ... 6.5.0-rc5-01743-gdcb152bb8328 #2
Hardware name: riscv-virtio,qemu (DT)
epc : bpf_mem_refill+0x1fc/0x206
ra : irq_work_single+0x68/0x70
epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
t5 : ff6000008fd28278 t6 : 0000000000040000
[<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
[<ffffffff8015fe84>] irq_work_single+0x68/0x70
[<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
[<ffffffff8015fefa>] irq_work_run+0x38/0x66
[<ffffffff8000828a>] handle_IPI+0x3a/0xb4
[<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
[<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
[<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
[<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
[<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
[<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
[<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
[<ffffffff812b6904>] do_irq+0x66/0x98
---[ end trace 0000000000000000 ]---
The warning is due to WARN_ON_ONCE(tgt->unit_size != c->unit_size) in
free_bulk(). The direct reason is that a object is allocated and
freed by bpf_mem_caches with different unit_size.
The root cause is that KMALLOC_MIN_SIZE is 64 and there is no 96-bytes
slab cache in the specific VM. When linked_list test allocates a
72-bytes object through bpf_obj_new(), bpf_global_ma will allocate it
from a bpf_mem_cache with 96-bytes unit_size, but this bpf_mem_cache is
backed by 128-bytes slab cache. When the object is freed, bpf_mem_free()
uses ksize() to choose the corresponding bpf_mem_cache. Because the
object is allocated from 128-bytes slab cache, ksize() returns 128,
bpf_mem_free() chooses a 128-bytes bpf_mem_cache to free the object and
triggers the warning.
A similar warning will also be reported when using CONFIG_SLAB instead
of CONFIG_SLUB in a x86-64 kernel. Because CONFIG_SLUB defines
KMALLOC_MIN_SIZE as 8 but CONFIG_SLAB defines KMALLOC_MIN_SIZE as 32.
An alternative fix is to use kmalloc_size_round() in bpf_mem_alloc() to
choose a bpf_mem_cache which has the same unit_size with the backing
slab cache, but it may introduce performance degradation, so fix the
warning by adjusting the indexes in size_index according to the value of
KMALLOC_MIN_SIZE just like setup_kmalloc_cache_index_table() does.
Fixes: 822fb26bdb ("bpf: Add a hint to allocated objects.")
Reported-by: Björn Töpel <bjorn@kernel.org>
Closes: https://lore.kernel.org/bpf/87jztjmmy4.fsf@all.your.base.are.belong.to.us
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230908133923.2675053-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* The kernel now dynamically probes for misaligned access speed, as
opposed to relying on a table of known implementations.
* Support for non-coherent devices on systems using the Andes AX45MP
core, including the RZ/Five SoCs.
* Support for the V extension in ptrace(), again.
* Support for KASLR.
* Support for the BPF prog pack allocator in RISC-V.
* A handful of bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmT8eV0THHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYiQYTD/9V6asKMDdWUV+gti/gRvJsiYUjIrrK
h4MB8hL3fHfCLBpTD4rU6K1Gx6hzPjGsxIuQyAq/hf752KB/9XUiIVziRBv2ZEBb
GuTFCXfg0QXBUlxBZzFw5SKUuKXgRaMAQ14qjy3tfLk31YMQmBtAlEPdDM8mZOCQ
zNI3bbdn6zASeaSMh7hwBoOJWP2ACoOEW7RcD44EDT8jb3YW5rEF86x0XtYLgJb6
xhaR4ieIdaOLxz2RbjXj0GcPIBfhTxZbwN3fLlD8PxuGqCKn5kN03bPPwP9tMTAc
z02EgVcSDvJWpYikuuTkPMxpSi18OZPJ6eriwOv5ccP5NXQScO09iGo7IZEM7OzO
j1IrIXyncU4BhxlpWombU454Va+ezUlfh9uh+MrJ+Bnve3T3S9ax7AV4S8vkJZlT
bnmJVS/g7L/7nxTQdJ3zoAo2WzFQXL0C8SR5tGo/3aRk0uYoliHy/W419f55F9GZ
rFcc+LMqai8N4bLN3whaK0NnuodNWHoNlpcd/5ncJwecswuDkah3LWcd4rwBrWhu
8RIkIfpdr/vTQjUVXVLeMHdKB+lST3iF1feMqJj0PfTyvTZi5yfSppjAfkAdVq+9
lHqAjsaGdiCrOtLxb0oBR2PTDQPAm2gN2meuSMommDQR6Vul8K5WcQml9Zx9QEWA
eDXWYDZISKYJbA==
=s89m
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-6.6-mw2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull more RISC-V updates from Palmer Dabbelt:
- The kernel now dynamically probes for misaligned access speed, as
opposed to relying on a table of known implementations.
- Support for non-coherent devices on systems using the Andes AX45MP
core, including the RZ/Five SoCs.
- Support for the V extension in ptrace(), again.
- Support for KASLR.
- Support for the BPF prog pack allocator in RISC-V.
- A handful of bug fixes and cleanups.
* tag 'riscv-for-linus-6.6-mw2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (25 commits)
soc: renesas: Kconfig: For ARCH_R9A07G043 select the required configs if dependencies are met
riscv: Kconfig.errata: Add dependency for RISCV_SBI in ERRATA_ANDES config
riscv: Kconfig.errata: Drop dependency for MMU in ERRATA_ANDES_CMO config
riscv: Kconfig: Select DMA_DIRECT_REMAP only if MMU is enabled
bpf, riscv: use prog pack allocator in the BPF JIT
riscv: implement a memset like function for text
riscv: extend patch_text_nosync() for multiple pages
bpf: make bpf_prog_pack allocator portable
riscv: libstub: Implement KASLR by using generic functions
libstub: Fix compilation warning for rv32
arm64: libstub: Move KASLR handling functions to kaslr.c
riscv: Dump out kernel offset information on panic
riscv: Introduce virtual kernel mapping KASLR
RISC-V: Add ptrace support for vectors
soc: renesas: Kconfig: Select the required configs for RZ/Five SoC
cache: Add L2 cache management for Andes AX45MP RISC-V core
dt-bindings: cache: andestech,ax45mp-cache: Add DT binding documentation for L2 cache controller
riscv: mm: dma-noncoherent: nonstandard cache operations support
riscv: errata: Add Andes alternative ports
riscv: asm: vendorid_list: Add Andes Technology to the vendors list
...
- move a dma-debug call that prints a message out from a lock that's
causing problems with the lock order in serial drivers (Sergey Senozhatsky)
- fix the CONFIG_DMA_NUMA_CMA Kconfig entry to have the right dependency
on not default to y (Christoph Hellwig)
- move an ifdef a bit to remove a __maybe_unused that seems to trip up
some sensitivities (Christoph Hellwig)
- revert a bogus check in the CMA allocator (Zhenhua Huang)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmT8MJ8LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYOOyg/7BaZt4lih24cSFhXsU10wB49cKQuOngsT+4KFX7Eg
8LqdCK/L61M1VWm/1tcw6/XKK7oBDADIWnlutFTv1BHIaC0fjbFk0IpOQyWoXsYE
vlCzm6Sh1rrryYK/dviLwC3+ccF33C5dzsZsKzeJ89v0cP//rENdOfXhwAIT2Wc7
FG/h0W7wWeQG4jHCz+DGhRp7X6r5urwW72KNap4BlBaDpZdAV9E5w1OeaplnENXt
E7CK8rnHZz/AgH98LIMa929fgNhJ7Bec5mV9pItpBzYWAwL2iWk6k11FbAkxDiFQ
MQ7gMnH5KRkCVpH/QILX1NU5ImsjdjyCUYn+2q8OJ5Y2C42K7V5ClspIgBCFk7XH
AbCAGveoRUoQ3iCnQYZ0YPJOtyNSaUNB+q4NzwR/WFipiXGJBPflggI8gPzRiN8e
8SnKhduODklzLhOZZr7+nUEJpmwgR8aCkmhERZN/bw8iidBs/chiy9t1PfLmiOH9
R545BWEfsgpRikpEgb0HnuGD/zg26LcugLUilNbSZq0fzH4tkD5iCKKUitAITYLP
ED0wDj9AdyQ98L6aSgAxc+9ip3eATFntM6hlg/16Ve4zdDTbFgucZpJhdSuVF8BI
rq1VV3wztUm++zhkTFptK9eN2T+AQtpC+3XI6QyXpByADbX+m8GbQJm+r8V3hu5A
X64=
=AvYr
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.6-2023-09-09' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- move a dma-debug call that prints a message out from a lock that's
causing problems with the lock order in serial drivers (Sergey
Senozhatsky)
- fix the CONFIG_DMA_NUMA_CMA Kconfig entry to have the right
dependency and not default to y (Christoph Hellwig)
- move an ifdef a bit to remove a __maybe_unused that seems to trip up
some sensitivities (Christoph Hellwig)
- revert a bogus check in the CMA allocator (Zhenhua Huang)
* tag 'dma-mapping-6.6-2023-09-09' of git://git.infradead.org/users/hch/dma-mapping:
Revert "dma-contiguous: check for memory region overlap"
dma-pool: remove a __maybe_unused label in atomic_pool_expand
dma-contiguous: fix the Kconfig entry for CONFIG_DMA_NUMA_CMA
dma-debug: don't call __dma_entry_alloc_check_leak() under free_entries_lock
Now that eventfs structure is used to create the events directory via the
eventfs dynamically allocate code, the "dir" field of the trace_event_file
structure is no longer used. Remove it.
Link: https://lkml.kernel.org/r/20230908022001.580400115@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ajay Kaher <akaher@vmware.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The check to create the print event "trigger" was using the obsolete "dir"
value of the trace_event_file to determine if it should create the trigger
or not. But that value will now be NULL because it uses the event file
descriptor.
Change it to test the "ef" field of the trace_event_file structure so that
the trace_marker "trigger" file appears again.
Link: https://lkml.kernel.org/r/20230908022001.371815239@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ajay Kaher <akaher@vmware.com>
Fixes: 27152bceea ("eventfs: Move tracing/events to eventfs")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When iterating over the ring buffer while the ring buffer is active, the
writer can corrupt the reader. There's barriers to help detect this and
handle it, but that code missed the case where the last event was at the
very end of the page and has only 4 bytes left.
The checks to detect the corruption by the writer to reads needs to see the
length of the event. If the length in the first 4 bytes is zero then the
length is stored in the second 4 bytes. But if the writer is in the process
of updating that code, there's a small window where the length in the first
4 bytes could be zero even though the length is only 4 bytes. That will
cause rb_event_length() to read the next 4 bytes which could happen to be off the
allocated page.
To protect against this, fail immediately if the next event pointer is
less than 8 bytes from the end of the commit (last byte of data), as all
events must be a minimum of 8 bytes anyway.
Link: https://lore.kernel.org/all/20230905141245.26470-1-Tze-nan.Wu@mediatek.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230907122820.0899019c@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Reported-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently the multi_kprobe link attach does not check error
injection list for programs with bpf_override_return helper
and allows them to attach anywhere. Adding the missing check.
Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20230907200652.926951-1-jolsa@kernel.org
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmT6xsgACgkQUqAMR0iA
lPIpeA/+P8Jq25Hptd+Zh5XxkNO0lHKt8o9qZxgLwyA0Mh1YWLCTrmliqeK6jTWy
CrKXkeyZmOQxzTFrflEN9OVQ/NnIBZBdiT6TWl64pI87pc+ElS+awC1BxS9gKDRC
vhhwiMFWHt4DWlVhjKA+ARShYI/uI5+b/Ewlmg4ZeksWxWcFfnoZb0BCkQKGZcN+
9G1C1mPtxV4lT1FJAglkgx3hF3+BcpX9EEVYdjdpQbD9J0jmlTP08/w+yyMvTzjs
BVRPtcPeq0eb5iFp06SJKqa4377j6N9KMKtZG5IjzbkU/N6u7C4hXYxiGM/nOmyC
022/ZuFP6uwoeOiWBfuPJK9cadsomMbqeSJxC8wh/eLRqgTKU7N6wt8ybSSNynrC
oMzdEI+ovjYIVrb13ZFDE+YFsXCzNhw1xNMmxdJMGQeeFNVkjK5MV1yaGdisffgj
ps3eJbdaklFgU1m2GUkoVKglLeiYsihyvnSDZuxgfe+12GPReE0eTjvyj8gRUOd3
DLd1GPI7gmUv0c3k9VHyLu/ATrlBB9BvZ4cT2+anrC4Trf8Al2E8xgkzhYceqMOj
6XZFoGUGW+nhxU0vGy1psYCyw1k4L71vcT/WJD9ul+6RoHvwAnbDhmfu00W4LJ7C
YwMheIc+00v4ofu/oXt32DhVFuAy06VdGiC4LYwWKkaPJELatPA=
=bhor
-----END PGP SIGNATURE-----
Merge tag 'printk-for-6.6-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fix from Petr Mladek:
- Revert exporting symbols needed for dumping the raw printk buffer in
panic().
I pushed the export prematurely before the user was ready for merging
into the mainline.
* tag 'printk-for-6.6-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
Revert "printk: export symbols for debug modules"
Puranjay Mohan <puranjay12@gmail.com> says:
Here is some data to prove the V2 fixes the problem:
Without this series:
root@rv-selftester:~/src/kselftest/bpf# time ./test_tag
test_tag: OK (40945 tests)
real 7m47.562s
user 0m24.145s
sys 6m37.064s
With this series applied:
root@rv-selftester:~/src/selftest/bpf# time ./test_tag
test_tag: OK (40945 tests)
real 7m29.472s
user 0m25.865s
sys 6m18.401s
BPF programs currently consume a page each on RISCV. For systems with many BPF
programs, this adds significant pressure to instruction TLB. High iTLB pressure
usually causes slow down for the whole system.
Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue.
It packs multiple BPF programs into a single huge page. It is currently only
enabled for the x86_64 BPF JIT.
I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now.
This patch series enables the BPF prog pack allocator for the RISCV BPF JIT.
======================================================
Performance Analysis of prog pack allocator on RISCV64
======================================================
Test setup:
===========
Host machine: Debian GNU/Linux 11 (bullseye)
Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1)
u-boot-qemu Version: 2023.07+dfsg-1
opensbi Version: 1.3-1
To test the performance of the BPF prog pack allocator on RV, a stresser
tool[4] linked below was built. This tool loads 8 BPF programs on the system and
triggers 5 of them in an infinite loop by doing system calls.
The runner script starts 20 instances of the above which loads 8*20=160 BPF
programs on the system, 5*20=100 of which are being constantly triggered.
The script is passed a command which would be run in the above environment.
The script was run with following perf command:
./run.sh "perf stat -a \
-e iTLB-load-misses \
-e dTLB-load-misses \
-e dTLB-store-misses \
-e instructions \
--timeout 60000"
The output of the above command is discussed below before and after enabling the
BPF prog pack allocator.
The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs
was created using Bjorn's riscv-cross-builder[5] docker container linked below.
Results
=======
Before enabling prog pack allocator:
------------------------------------
Performance counter stats for 'system wide':
4939048 iTLB-load-misses
5468689 dTLB-load-misses
465234 dTLB-store-misses
1441082097998 instructions
60.045791200 seconds time elapsed
After enabling prog pack allocator:
-----------------------------------
Performance counter stats for 'system wide':
3430035 iTLB-load-misses
5008745 dTLB-load-misses
409944 dTLB-store-misses
1441535637988 instructions
60.046296600 seconds time elapsed
Improvements in metrics
=======================
It was expected that the iTLB-load-misses would decrease as now a single huge
page is used to keep all the BPF programs compared to a single page for each
program earlier.
--------------------------------------------
The improvement in iTLB-load-misses: -30.5 %
--------------------------------------------
I repeated this expriment more than 100 times in different setups and the
improvement was always greater than 30%.
This patch series is boot tested on the Starfive VisionFive 2 board[6].
The performance analysis was not done on the board because it doesn't
expose iTLB-load-misses, etc. The stresser program was run on the board to test
the loading and unloading of BPF programs
[1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/
[2] https://lore.kernel.org/all/20230626085811.3192402-1-puranjay12@gmail.com/
[3] https://lore.kernel.org/all/20230626085811.3192402-2-puranjay12@gmail.com/
[4] https://github.com/puranjaymohan/BPF-Allocator-Bench
[5] https://github.com/bjoto/riscv-cross-builder
[6] https://www.starfivetech.com/en/site/boards
* b4-shazam-merge:
bpf, riscv: use prog pack allocator in the BPF JIT
riscv: implement a memset like function for text
riscv: extend patch_text_nosync() for multiple pages
bpf: make bpf_prog_pack allocator portable
Link: https://lore.kernel.org/r/20230831131229.497941-1-puranjay12@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This reverts commit 3fa6456ebe.
The Commit broke the CMA region creation through DT on arm64,
as showed below logs with "memblock=debug":
[ 0.000000] memblock_phys_alloc_range: 41943040 bytes align=0x200000
from=0x0000000000000000 max_addr=0x00000000ffffffff
early_init_dt_alloc_reserved_memory_arch+0x34/0xa0
[ 0.000000] memblock_reserve: [0x00000000fd600000-0x00000000ffdfffff]
memblock_alloc_range_nid+0xc0/0x19c
[ 0.000000] Reserved memory: overlap with other memblock reserved region
>From call flow, region we defined in DT was always reserved before entering
into rmem_cma_setup. Also, rmem_cma_setup has one routine cma_init_reserved_mem
to ensure the region was reserved. Checking the region not reserved here seems
not correct.
early_init_fdt_scan_reserved_mem:
fdt_scan_reserved_mem
__reserved_mem_reserve_reg
early_init_dt_reserve_memory
memblock_reserve(using “reg” prop case)
fdt_init_reserved_mem
__reserved_mem_alloc_size
*early_init_dt_alloc_reserved_memory_arch*
memblock_reserve(dynamic alloc case)
__reserved_mem_init_node
rmem_cma_setup(region overlap check here should always fail)
Example DT can be used to reproduce issue:
dump_mem: mem_dump_region {
compatible = "shared-dma-pool";
alloc-ranges = <0x0 0x00000000 0x0 0xffffffff>;
reusable;
size = <0 0x2800000>;
};
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Current release - regressions:
- eth: stmmac: fix failure to probe without MAC interface specified
Current release - new code bugs:
- docs: netlink: fix missing classic_netlink doc reference
Previous releases - regressions:
- deal with integer overflows in kmalloc_reserve()
- use sk_forward_alloc_get() in sk_get_meminfo()
- bpf_sk_storage: fix the missing uncharge in sk_omem_alloc
- fib: avoid warn splat in flow dissector after packet mangling
- skb_segment: call zero copy functions before using skbuff frags
- eth: sfc: check for zero length in EF10 RX prefix
Previous releases - always broken:
- af_unix: fix msg_controllen test in scm_pidfd_recv() for
MSG_CMSG_COMPAT
- xsk: fix xsk_build_skb() dereferencing possible ERR_PTR()
- netfilter:
- nft_exthdr: fix non-linear header modification
- xt_u32, xt_sctp: validate user space input
- nftables: exthdr: fix 4-byte stack OOB write
- nfnetlink_osf: avoid OOB read
- one more fix for the garbage collection work from last release
- igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU
- bpf, sockmap: fix preempt_rt splat when using raw_spin_lock_t
- handshake: fix null-deref in handshake_nl_done_doit()
- ip: ignore dst hint for multipath routes to ensure packets
are hashed across the nexthops
- phy: micrel:
- correct bit assignments for cable test errata
- disable EEE according to the KSZ9477 errata
Misc:
- docs/bpf: document compile-once-run-everywhere (CO-RE) relocations
- Revert "net: macsec: preserve ingress frame ordering", it appears
to have been developed against an older kernel, problem doesn't
exist upstream
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmT6R6wACgkQMUZtbf5S
IrsmTg//TgmRjxSZ0lrPQtJwZR/eN3ZR2oQG3rwnssCx+YgHEGGxQsfT4KHEMacR
ZgGDZVTpthUJkkACBPi8ZMoy++RdjEmlCcanfeDkGHoYGtiX1lhkofhLMn1KUHbI
rIbP9EdNKxQT0SsBlw/U28pD5jKyqOgL23QobEwmcjLTdMpamb+qIsD6/xNv9tEj
Tu4BdCIkhjxnBD622hsE3pFTG7oSn2WM6rf5NT1E43mJ3W8RrMcydSB27J7Oryo9
l3nYMAhz0vQINS2WQ9eCT1/7GI6gg1nDtxFtrnV7ASvxayRBPIUr4kg1vT+Tixsz
CZMnwVamEBIYl9agmj7vSji7d5nOUgXPhtWhwWUM2tRoGdeGw3vSi1pgDvRiUCHE
PJ4UHv7goa2AgnOlOQCFtRybAu+9nmSGm7V+GkeGLnH7xbFsEa5smQ/+FSPJs8Dn
Yf4q5QAhdN8tdnofRlrN/nCssoDF3cfmBsTJ7wo5h71gW+BWhsP58eDCJlXd/r8k
+Qnvoe2kw27ktFR1tjsUDZ0AcSmeVARNwmXCOBYZsG4tEek8pLyj008mDvJvdfyn
PGPn7Eo5DyaERlHVmPuebHXSyniDEPe2GLTmlHcGiRpGspoUHbB+HRiDAuRLMB9g
pkL8RHpNfppnuUXeUoNy3rgEkYwlpTjZX0QHC6N8NQ76ccB6CNM=
=YpmE
-----END PGP SIGNATURE-----
Merge tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking updates from Jakub Kicinski:
"Including fixes from netfilter and bpf.
Current release - regressions:
- eth: stmmac: fix failure to probe without MAC interface specified
Current release - new code bugs:
- docs: netlink: fix missing classic_netlink doc reference
Previous releases - regressions:
- deal with integer overflows in kmalloc_reserve()
- use sk_forward_alloc_get() in sk_get_meminfo()
- bpf_sk_storage: fix the missing uncharge in sk_omem_alloc
- fib: avoid warn splat in flow dissector after packet mangling
- skb_segment: call zero copy functions before using skbuff frags
- eth: sfc: check for zero length in EF10 RX prefix
Previous releases - always broken:
- af_unix: fix msg_controllen test in scm_pidfd_recv() for
MSG_CMSG_COMPAT
- xsk: fix xsk_build_skb() dereferencing possible ERR_PTR()
- netfilter:
- nft_exthdr: fix non-linear header modification
- xt_u32, xt_sctp: validate user space input
- nftables: exthdr: fix 4-byte stack OOB write
- nfnetlink_osf: avoid OOB read
- one more fix for the garbage collection work from last release
- igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU
- bpf, sockmap: fix preempt_rt splat when using raw_spin_lock_t
- handshake: fix null-deref in handshake_nl_done_doit()
- ip: ignore dst hint for multipath routes to ensure packets are
hashed across the nexthops
- phy: micrel:
- correct bit assignments for cable test errata
- disable EEE according to the KSZ9477 errata
Misc:
- docs/bpf: document compile-once-run-everywhere (CO-RE) relocations
- Revert "net: macsec: preserve ingress frame ordering", it appears
to have been developed against an older kernel, problem doesn't
exist upstream"
* tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
net: enetc: distinguish error from valid pointers in enetc_fixup_clear_rss_rfs()
Revert "net: team: do not use dynamic lockdep key"
net: hns3: remove GSO partial feature bit
net: hns3: fix the port information display when sfp is absent
net: hns3: fix invalid mutex between tc qdisc and dcb ets command issue
net: hns3: fix debugfs concurrency issue between kfree buffer and read
net: hns3: fix byte order conversion issue in hclge_dbg_fd_tcam_read()
net: hns3: Support query tx timeout threshold by debugfs
net: hns3: fix tx timeout issue
net: phy: Provide Module 4 KSZ9477 errata (DS80000754C)
netfilter: nf_tables: Unbreak audit log reset
netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c
netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction
netfilter: nf_tables: uapi: Describe NFTA_RULE_CHAIN_ID
netfilter: nfnetlink_osf: avoid OOB read
netfilter: nftables: exthdr: fix 4-byte stack OOB write
selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc
bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc
bpf: bpf_sk_storage: Fix invalid wait context lockdep report
s390/bpf: Pass through tail call counter in trampolines
...
When user resize all trace ring buffer through file 'buffer_size_kb',
then in ring_buffer_resize(), kernel allocates buffer pages for each
cpu in a loop.
If the kernel preemption model is PREEMPT_NONE and there are many cpus
and there are many buffer pages to be allocated, it may not give up cpu
for a long time and finally cause a softlockup.
To avoid it, call cond_resched() after each cpu buffer allocation.
Link: https://lore.kernel.org/linux-trace-kernel/20230906081930.3939106-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The event inject files add events for a specific trace array. For an
instance, if the file is opened and the instance is deleted, reading or
writing to the file will cause a use after free.
Up the ref count of the trace_array when a event inject file is opened.
Link: https://lkml.kernel.org/r/20230907024804.292337868@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes: 6c3edaf9fd ("tracing: Introduce trace event injection")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The option files update the options for a given trace array. For an
instance, if the file is opened and the instance is deleted, reading or
writing to the file will cause a use after free.
Up the ref count of the trace_array when an option file is opened.
Link: https://lkml.kernel.org/r/20230907024804.086679464@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes: 8530dec63e ("tracing: Add tracing_check_open_get_tr()")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The current_trace updates the trace array tracer. For an instance, if the
file is opened and the instance is deleted, reading or writing to the file
will cause a use after free.
Up the ref count of the trace array when current_trace is opened.
Link: https://lkml.kernel.org/r/20230907024803.877687227@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes: 8530dec63e ("tracing: Add tracing_check_open_get_tr()")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracing_max_latency file points to the trace_array max_latency field.
For an instance, if the file is opened and the instance is deleted,
reading or writing to the file will cause a use after free.
Up the ref count of the trace_array when tracing_max_latency is opened.
Link: https://lkml.kernel.org/r/20230907024803.666889383@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes: 8530dec63e ("tracing: Add tracing_check_open_get_tr()")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>