Since commit ee6d3dd4ed ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.
Take advantage of this to constify the structure definitions which prevents
modification at runtime.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230217-kobj_type-irq-v1-1-fedfacaf8cdb@weissschuh.net
Calling msi_ctrl_valid() ultimately results in calling
msi_get_device_domain(), which requires holding the device MSI lock.
However, in msi_domain_populate_irqs() the lock is taken right after having
called msi_ctrl_valid(), which is just a tad too late.
Take the lock before invoking msi_ctrl_valid().
Fixes: 40742716f2 ("genirq/msi: Make msi_add_simple_msi_descs() device domain aware")
Reported-by: "Russell King (Oracle)" <linux@armlinux.org.uk>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/Y/Opu6ETe3ZzZ/8E@shell.armlinux.org.uk
Link: https://lore.kernel.org/r/20230220190101.314446-1-maz@kernel.org
If ipi_send_{mask|single}() is called with an invalid interrupt number, all
the local variables there will be NULL. ipi_send_verify() which is invoked
from these functions does verify its 'data' parameter, resulting in a
kernel oops in irq_data_get_affinity_mask() as the passed NULL pointer gets
dereferenced.
Add a missing NULL pointer check in ipi_send_verify()...
Found by Linux Verification Center (linuxtesting.org) with the SVACE static
analysis tool.
Fixes: 3b8e29a82d ("genirq: Implement ipi_send_mask/single()")
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/b541232d-c2b6-1fe9-79b4-a7129459e4d0@omp.ru
Posix-timers armed with a short interval with an ignored signal result
in an unpriviledged DoS. Due to the ignored signal the timer switches
into self rearm mode. This issue had been "fixed" before but a rework of
the alarmtimer code 5 years ago lost that workaround.
There is no real good solution for this issue, which is also worked around
in the core posix-timer code in the same way, but it certainly moved way
up on the ever growing todo list.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmPxYIwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocQ5D/45vfKxvXsF0mCHPpF+Dgx96T3rzpXY
aN/OeKWJ50A7IB5QKdxR3/VdDSuD6/HroFTfKTQIUVzkzAq47r7Ref7a0bgDsNbT
SLTX8oBUA1ALKVew3wGo5qkY317xyGb2WPiSCzfjAZk07MplOzDRC4YYgEfnYAWW
DzmzmkZQaAhKg3JONEB+kOas/E/T1MwTSs2NNJiY0hM6ATFK2+q2b706wjDHWspk
QwjCfUCbrFKSxol+hSQgvQpzV18FmS328xU4Ht4pioflch0dv1/HDrml9OGrCCzs
8hEmCcjxFwM2FXTPIix8/Jn4S8ppZwYQCi10bBjjF+0N9s897s/4dom88EbfwjfB
4YpE62SI8VZIyK0JauSuRcCAfnN7BwfgL9UpS3MEgHq4K+fnydZOItBYI3Rg4JY/
hAKizX5VzBSAq/iUFC9+DFKjonOz3cCvY5xoPLPfCXukykbYfOCz1BiBqrTP7n1z
/qWgDm+YvEyYNmoEqPjD5kktpHAvwGiMl6dmJJOUxaV6KNXWremykvI491oZK7fI
P3eWMPhJwJV7ukjUF6Wc5ylwupsakf+x802MF03rgmajXtzxnnFATotb/21g9SSD
/FYfdjS85CRArvYKuN59Yty8iMbjX4T1tRBaKgLYjpOdqgA6H6T3vmI+ShKS/FAM
T92uQTZMpVm7+w==
=s5VW
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2023-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A fix for a long standing issue in the alarmtimer code.
Posix-timers armed with a short interval with an ignored signal result
in an unpriviledged DoS. Due to the ignored signal the timer switches
into self rearm mode. This issue had been "fixed" before but a rework
of the alarmtimer code 5 years ago lost that workaround.
There is no real good solution for this issue, which is also worked
around in the core posix-timer code in the same way, but it certainly
moved way up on the ever growing todo list"
* tag 'timers-urgent-2023-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Prevent starvation by small intervals and SIG_IGN
- New and improved irqdomain locking, closing a number of races that
became apparent now that we are able to probe drivers in parallel
- A bunch of OF node refcounting bugs have been fixed
- We now have a new IPI mux, lifted from the Apple AIC code and
made common. It is expected that riscv will eventually benefit
from it
- Two small fixes for the Broadcom L2 drivers
- Various cleanups and minor bug fixes
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmPw4OgPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDYVgP/iVFxCPs+DCWUYvyTC8rvNzOj51COHUV/7yD
mY5BTIjH3yTQPDhQmFvITCAjKaMYc3eDLml/nF4tTCU0MFig+KsRsWNIEFXtSsI0
wO+S19QhHzj5odUok5IDC+cNTXScp2HV+vFoOhhf0zDzXqwVxRr7lO5i+n37ELMp
Mm9g2+EeUt43xTQxzbmNn5Kkpq9PMEnQFU2UkvJleg+KCgzSYThcR8/KUDKySZpk
TP+mcR5PevcqGhLt7vYS2lGh8Ye1warzp54C7Je8P8Txg3BM8xBynT1d3fgrlKfm
AOAPVW3PV6bPhgVYXZJopH3ykfmYM4ZiIvhRcgLyf6tbZAU6Twpiq823TAOVHyPI
SRcW8dehuvgq1VJIpRGZOSB2qIvFrqLhl0B1CtT04gFWJW9bSa2n5Y1h4Gcqy29o
SLJiKscx2KqvPmQqarLUUnuOZ5hhIrtYhkhhJuuwqZqzS1Kkz/mSB1MkPQEGxJi1
MpoTfbQ/0KTYXCqqgs/GBnDJ0mYrcvtBoGP7bjnVYnXpANP2bs+ZpQVPVq+17uuQ
k0gjxe8iENqXjW6JMlFX5K3dxG5ygXjfECMWsCJ+JdCtJdaIL8I46X/u7wHU2mfY
bohhb7xS2+HIPxz6w8aRu3IQG00mMv06vCYPBbPh+W0dUtocdM3U2kpe5gPYm1iz
kWx3WLaM
=ONcj
-----END PGP SIGNATURE-----
Merge tag 'irqchip-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates from Marc Zyngier:
- New and improved irqdomain locking, closing a number of races that
became apparent now that we are able to probe drivers in parallel
- A bunch of OF node refcounting bugs have been fixed
- We now have a new IPI mux, lifted from the Apple AIC code and
made common. It is expected that riscv will eventually benefit
from it
- Two small fixes for the Broadcom L2 drivers
- Various cleanups and minor bug fixes
Link: https://lore.kernel.org/r/20230218143452.3817627-1-maz@kernel.org
Just after the fix to TASK_COMM_LEN not converted to its value in
trace_events was pulled, the kernel test robot reported that the helper
function trace_define_field_ext() added to that change was only used in
the file it was defined in but was not declared static. Make it a local
function.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY+pQvhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qpWMAP958Izvo22zPjlvqypLrC4wkwOrU6BG
ITApOESLGS6YMAEA3X1qVpjgXClFmRv6j+J7S6LdhUzhkOm9Sxg5Vejxzgo=
=4fmj
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.2-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixlet from Steven Rostedt:
"Make trace_define_field_ext() static.
Just after the fix to TASK_COMM_LEN not converted to its value in
trace_events was pulled, the kernel test robot reported that the
helper function trace_define_field_ext() added to that change was only
used in the file it was defined in but was not declared static.
Make it a local function"
* tag 'trace-v6.2-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Make trace_define_field_ext() static
If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed in the following path:
do_rmdir
cgroup_rmdir
kernfs_drain_open_files
cgroup_file_release
cgroup_pressure_release
psi_trigger_destroy
However, the polling thread still has a reference to the pressure file and
will access the freed waitqueue when the file is closed or upon exit:
fput
ep_eventpoll_release
ep_free
ep_remove_wait_queue
remove_wait_queue
This results in use-after-free as pasted below.
The fundamental problem here is that cgroup_file_release() (and
consequently waitqueue's lifetime) is not tied to the file's real lifetime.
Using wake_up_pollfree() here might be less than ideal, but it is in line
with the comment at commit 42288cb44c ("wait: add wake_up_pollfree()")
since the waitqueue's lifetime is not tied to file's one and can be
considered as another special case. While this would be fixable by somehow
making cgroup_file_release() be tied to the fput(), it would require
sizable refactoring at cgroups or higher layer which might be more
justifiable if we identify more cases like this.
BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
Write of size 4 at addr ffff88810e625328 by task a.out/4404
CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
Call Trace:
<TASK>
dump_stack_lvl+0x73/0xa0
print_report+0x16c/0x4e0
kasan_report+0xc3/0xf0
kasan_check_range+0x2d2/0x310
_raw_spin_lock_irqsave+0x60/0xc0
remove_wait_queue+0x1a/0xa0
ep_free+0x12c/0x170
ep_eventpoll_release+0x26/0x30
__fput+0x202/0x400
task_work_run+0x11d/0x170
do_exit+0x495/0x1130
do_group_exit+0x100/0x100
get_signal+0xd67/0xde0
arch_do_signal_or_restart+0x2a/0x2b0
exit_to_user_mode_prepare+0x94/0x100
syscall_exit_to_user_mode+0x20/0x40
do_syscall_64+0x52/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
</TASK>
Allocated by task 4404:
kasan_set_track+0x3d/0x60
__kasan_kmalloc+0x85/0x90
psi_trigger_create+0x113/0x3e0
pressure_write+0x146/0x2e0
cgroup_file_write+0x11c/0x250
kernfs_fop_write_iter+0x186/0x220
vfs_write+0x3d8/0x5c0
ksys_write+0x90/0x110
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 4407:
kasan_set_track+0x3d/0x60
kasan_save_free_info+0x27/0x40
____kasan_slab_free+0x11d/0x170
slab_free_freelist_hook+0x87/0x150
__kmem_cache_free+0xcb/0x180
psi_trigger_destroy+0x2e8/0x310
cgroup_file_release+0x4f/0xb0
kernfs_drain_open_files+0x165/0x1f0
kernfs_drain+0x162/0x1a0
__kernfs_remove+0x1fb/0x310
kernfs_remove_by_name_ns+0x95/0xe0
cgroup_addrm_files+0x67f/0x700
cgroup_destroy_locked+0x283/0x3c0
cgroup_rmdir+0x29/0x100
kernfs_iop_rmdir+0xd1/0x140
vfs_rmdir+0xfe/0x240
do_rmdir+0x13d/0x280
__x64_sys_rmdir+0x2c/0x30
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: 0e94682b73 ("psi: introduce psi monitor")
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Mengchi Cheng <mengcc@amazon.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
Link: https://lore.kernel.org/r/20230214212705.4058045-1-kamatam@amazon.com
syzbot reported a RCU stall which is caused by setting up an alarmtimer
with a very small interval and ignoring the signal. The reproducer arms the
alarm timer with a relative expiry of 8ns and an interval of 9ns. Not a
problem per se, but that's an issue when the signal is ignored because then
the timer is immediately rearmed because there is no way to delay that
rearming to the signal delivery path. See posix_timer_fn() and commit
58229a1899 ("posix-timers: Prevent softirq starvation by small intervals
and SIG_IGN") for details.
The reproducer does not set SIG_IGN explicitely, but it sets up the timers
signal with SIGCONT. That has the same effect as explicitely setting
SIG_IGN for a signal as SIGCONT is ignored if there is no handler set and
the task is not ptraced.
The log clearly shows that:
[pid 5102] --- SIGCONT {si_signo=SIGCONT, si_code=SI_TIMER, si_timerid=0, si_overrun=316014, si_int=0, si_ptr=NULL} ---
It works because the tasks are traced and therefore the signal is queued so
the tracer can see it, which delays the restart of the timer to the signal
delivery path. But then the tracer is killed:
[pid 5087] kill(-5102, SIGKILL <unfinished ...>
...
./strace-static-x86_64: Process 5107 detached
and after it's gone the stall can be observed:
syzkaller login: [ 79.439102][ C0] hrtimer: interrupt took 68471 ns
[ 184.460538][ C1] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
...
[ 184.658237][ C1] rcu: Stack dump where RCU GP kthread last ran:
[ 184.664574][ C1] Sending NMI from CPU 1 to CPUs 0:
[ 184.669821][ C0] NMI backtrace for cpu 0
[ 184.669831][ C0] CPU: 0 PID: 5108 Comm: syz-executor192 Not tainted 6.2.0-rc6-next-20230203-syzkaller #0
...
[ 184.670036][ C0] Call Trace:
[ 184.670041][ C0] <IRQ>
[ 184.670045][ C0] alarmtimer_fired+0x327/0x670
posix_timer_fn() prevents that by checking whether the interval for
timers which have the signal ignored is smaller than a jiffie and
artifically delay it by shifting the next expiry out by a jiffie. That's
accurate vs. the overrun accounting, but slightly inaccurate
vs. timer_gettimer(2).
The comment in that function says what needs to be done and there was a fix
available for the regular userspace induced SIG_IGN mechanism, but that did
not work due to the implicit ignore for SIGCONT and similar signals. This
needs to be worked on, but for now the only available workaround is to do
exactly what posix_timer_fn() does:
Increase the interval of self-rearming timers, which have their signal
ignored, to at least a jiffie.
Interestingly this has been fixed before via commit ff86bf0c65
("alarmtimer: Rate limit periodic intervals") already, but that fix got
lost in a later rework.
Reported-by: syzbot+b9564ba6e8e00694511b@syzkaller.appspotmail.com
Fixes: f2c45807d3 ("alarmtimer: Switch over to generic set/get/rearm routine")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87k00q1no2.ffs@tglx
* irq/irqdomain-locking:
: .
: irqdomain locking overhaul courtesy of Johan Hovold.
:
: From the cover letter:
:
: "Parallel probing (e.g. due to asynchronous probing) of devices that
: share interrupts can currently result in two mappings for the same
: hardware interrupt to be created.
:
: This series fixes this mapping race and reworks the irqdomain locking so
: that in the end the global irq_domain_mutex is only used for managing
: the likewise global irq_domain_list, while domain operations (e.g. IRQ
: allocations) use per-domain (hierarchy) locking."
: .
irqdomain: Switch to per-domain locking
irqchip/mvebu-odmi: Use irq_domain_create_hierarchy()
irqchip/loongson-pch-msi: Use irq_domain_create_hierarchy()
irqchip/gic-v3-mbi: Use irq_domain_create_hierarchy()
irqchip/gic-v3-its: Use irq_domain_create_hierarchy()
irqchip/gic-v2m: Use irq_domain_create_hierarchy()
irqchip/alpine-msi: Use irq_domain_add_hierarchy()
x86/uv: Use irq_domain_create_hierarchy()
x86/ioapic: Use irq_domain_create_hierarchy()
irqdomain: Clean up irq_domain_push/pop_irq()
irqdomain: Drop leftover brackets
irqdomain: Drop dead domain-name assignment
irqdomain: Drop revmap mutex
irqdomain: Fix domain registration race
irqdomain: Fix mapping-creation race
irqdomain: Refactor __irq_domain_alloc_irqs()
irqdomain: Look for existing mapping only once
irqdomain: Drop bogus fwspec-mapping error handling
irqdomain: Fix disassociation race
irqdomain: Fix association race
Signed-off-by: Marc Zyngier <maz@kernel.org>
The IRQ domain structures are currently protected by the global
irq_domain_mutex. Switch to using more fine-grained per-domain locking,
which can speed up parallel probing by reducing lock contention.
On a recent arm64 laptop, the total time spent waiting for the locks
during boot drops from 160 to 40 ms on average, while the maximum
aggregate wait time drops from 550 to 90 ms over ten runs for example.
Note that the domain lock of the root domain (innermost domain) must be
used for hierarchical domains. For non-hierarchical domains (as for root
domains), the new root pointer is set to the domain itself so that
&domain->root->mutex always points to the right lock.
Also note that hierarchical domains should be constructed using
irq_domain_create_hierarchy() (or irq_domain_add_hierarchy()) to avoid
having racing allocations access a not fully initialised domain. As a
safeguard, the lockdep assertion in irq_domain_set_mapping() will catch
any offenders that also fail to set the root domain pointer.
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-21-johan+linaro@kernel.org
The irq_domain_push_irq() interface is used to add a new (outmost) level
to a hierarchical domain after IRQs have been allocated.
Possibly due to differing mental images of hierarchical domains, the
names used for the irq_data variables make these functions much harder
to understand than what they need to be.
Rename the struct irq_data pointer to the data embedded in the
descriptor as simply 'irq_data' and refer to the data allocated by this
interface as 'parent_irq_data' so that the names reflect how
hierarchical domains are implemented.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-12-johan+linaro@kernel.org
Drop some unnecessary brackets that were left in place when the
corresponding code was updated.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-11-johan+linaro@kernel.org
Since commit d59f6617ee ("genirq: Allow fwnode to carry name
information only") an IRQ domain is always given a name during
allocation (e.g. used for the debugfs entry).
Drop the leftover name assignment when allocating the first IRQ.
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-10-johan+linaro@kernel.org
The revmap mutex is essentially only used to maintain the integrity of
the radix tree during updates (lookups use RCU).
As the global irq_domain_mutex is now held in all paths that update the
revmap structures there is strictly no longer any need for the dedicated
mutex, which can be removed.
Drop the revmap mutex and add lockdep assertions to the revmap helpers
to make sure that the global lock is always held when updating the
revmap.
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-9-johan+linaro@kernel.org
Hierarchical domains created using irq_domain_create_hierarchy() are
currently added to the domain list before having been fully initialised.
This specifically means that a racing allocation request might fail to
allocate irq data for the inner domains of a hierarchy in case the
parent domain pointer has not yet been set up.
Note that this is not really any issue for irqchip drivers that are
registered early (e.g. via IRQCHIP_DECLARE() or IRQCHIP_ACPI_DECLARE())
but could potentially cause trouble with drivers that are registered
later (e.g. modular drivers using IRQCHIP_PLATFORM_DRIVER_BEGIN(),
gpiochip drivers, etc.).
Fixes: afb7da83b9 ("irqdomain: Introduce helper function irq_domain_add_hierarchy()")
Cc: stable@vger.kernel.org # 3.19
Signed-off-by: Marc Zyngier <maz@kernel.org>
[ johan: add commit message ]
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-8-johan+linaro@kernel.org
Parallel probing of devices that share interrupts (e.g. when a driver
uses asynchronous probing) can currently result in two mappings for the
same hardware interrupt to be created due to missing serialisation.
Make sure to hold the irq_domain_mutex when creating mappings so that
looking for an existing mapping before creating a new one is done
atomically.
Fixes: 765230b5f0 ("driver-core: add asynchronous probing support for drivers")
Fixes: b62b2cf575 ("irqdomain: Fix handling of type settings for existing mappings")
Link: https://lore.kernel.org/r/YuJXMHoT4ijUxnRb@hovoldconsulting.com
Cc: stable@vger.kernel.org # 4.8
Cc: Dmitry Torokhov <dtor@chromium.org>
Cc: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-7-johan+linaro@kernel.org
Refactor __irq_domain_alloc_irqs() so that it can be called internally
while holding the irq_domain_mutex.
This will be used to fix a shared-interrupt mapping race, hence the
Fixes tag.
Fixes: b62b2cf575 ("irqdomain: Fix handling of type settings for existing mappings")
Cc: stable@vger.kernel.org # 4.8
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-6-johan+linaro@kernel.org
Avoid looking for an existing mapping twice when creating a new mapping
using irq_create_fwspec_mapping() by factoring out the actual allocation
which is shared with irq_create_mapping_affinity().
The new helper function will also be used to fix a shared-interrupt
mapping race, hence the Fixes tag.
Fixes: b62b2cf575 ("irqdomain: Fix handling of type settings for existing mappings")
Cc: stable@vger.kernel.org # 4.8
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-5-johan+linaro@kernel.org
In case a newly allocated IRQ ever ends up not having any associated
struct irq_data it would not even be possible to dispose the mapping.
Replace the bogus disposal with a WARN_ON().
This will also be used to fix a shared-interrupt mapping race, hence the
CC-stable tag.
Fixes: 1e2a7d7849 ("irqdomain: Don't set type when mapping an IRQ")
Cc: stable@vger.kernel.org # 4.8
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-4-johan+linaro@kernel.org
The global irq_domain_mutex is held when mapping interrupts from
non-hierarchical domains but currently not when disposing them.
This specifically means that updates of the domain mapcount is racy
(currently only used for statistics in debugfs).
Make sure to hold the global irq_domain_mutex also when disposing
mappings from non-hierarchical domains.
Fixes: 9dc6be3d41 ("genirq/irqdomain: Add map counter")
Cc: stable@vger.kernel.org # 4.13
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-3-johan+linaro@kernel.org
The sanity check for an already mapped virq is done outside of the
irq_domain_mutex-protected section which means that an (unlikely) racing
association may not be detected.
Fix this by factoring out the association implementation, which will
also be used in a follow-on change to fix a shared-interrupt mapping
race.
Fixes: ddaf144c61 ("irqdomain: Refactor irq_domain_associate_many()")
Cc: stable@vger.kernel.org # 3.11
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-2-johan+linaro@kernel.org
Since commit 8f9ea86fdf ("sched: Always preserve the user requested
cpumask"), a successful call to sched_setaffinity() should always save
the user requested cpu affinity mask in a task's user_cpus_ptr. However,
when the given cpu mask is the same as the current one, user_cpus_ptr
is not updated. Fix this by saving the user mask in this case too.
Fixes: 8f9ea86fdf ("sched: Always preserve the user requested cpumask")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230203181849.221943-1-longman@redhat.com
Tetsuo-San noted that commit f5d39b0208 ("freezer,sched: Rewrite
core freezer logic") broke call_usermodehelper_exec() for the KILLABLE
case.
Specifically it was missed that the second, unconditional,
wait_for_completion() was not optional and ensures the on-stack
completion is unused before going out-of-scope.
Fixes: f5d39b0208 ("freezer,sched: Rewrite core freezer logic")
Reported-by: syzbot+6cd18e123583550cf469@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Y90ar35uKQoUrLEK@hirez.programming.kicks-ass.net
trace_define_field_ext() is not used outside of trace_events.c, it should
be static.
Link: https://lore.kernel.org/oe-kbuild-all/202302130750.679RaRog-lkp@intel.com/
Fixes: b6c7abd1c2 ("tracing: Fix TASK_COMM_LEN in trace event format file")
Reported-by: Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The TASK_COMM_LEN was converted from a macro into an enum so that BTF
would have access to it. But this unfortunately caused TASK_COMM_LEN to
display in the format fields of trace events, as they are created by the
TRACE_EVENT() macro and such, macros convert to their values, where as
enums do not.
To handle this, instead of using the field itself to be display, save the
value of the array size as another field in the trace_event_fields
structure, and use that instead. Not only does this fix the issue, but
also converts the other trace events that have this same problem (but were
not breaking tooling). With this change, the original work around
b3bc8547d3 ("tracing: Have TRACE_DEFINE_ENUM affect trace event types
as well") could be reverted (but that should be done in the merge window).
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY+lOqxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quYPAQD+9j+RPIUm9Ms4XCIEOXkFI04yjsbd
rQSRcpYBWyP59AEAnZNYNwE11vDsKBGxPrOPgGYe4Pzfr5yOWY84mgaxgwo=
=iYsE
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix showing of TASK_COMM_LEN instead of its value
The TASK_COMM_LEN was converted from a macro into an enum so that BTF
would have access to it. But this unfortunately caused TASK_COMM_LEN
to display in the format fields of trace events, as they are created
by the TRACE_EVENT() macro and such, macros convert to their values,
where as enums do not.
To handle this, instead of using the field itself to be display, save
the value of the array size as another field in the trace_event_fields
structure, and use that instead.
Not only does this fix the issue, but also converts the other trace
events that have this same problem (but were not breaking tooling).
With this change, the original work around b3bc8547d3 ("tracing:
Have TRACE_DEFINE_ENUM affect trace event types as well") could be
reverted (but that should be done in the merge window)"
* tag 'trace-v6.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix TASK_COMM_LEN in trace event format file
After commit 3087c61ed2 ("tools/testing/selftests/bpf: replace open-coded 16 with TASK_COMM_LEN"),
the content of the format file under
/sys/kernel/tracing/events/task/task_newtask was changed from
field:char comm[16]; offset:12; size:16; signed:0;
to
field:char comm[TASK_COMM_LEN]; offset:12; size:16; signed:0;
John reported that this change breaks older versions of perfetto.
Then Mathieu pointed out that this behavioral change was caused by the
use of __stringify(_len), which happens to work on macros, but not on enum
labels. And he also gave the suggestion on how to fix it:
:One possible solution to make this more robust would be to extend
:struct trace_event_fields with one more field that indicates the length
:of an array as an actual integer, without storing it in its stringified
:form in the type, and do the formatting in f_show where it belongs.
The result as follows after this change,
$ cat /sys/kernel/tracing/events/task/task_newtask/format
field:char comm[16]; offset:12; size:16; signed:0;
Link: https://lore.kernel.org/lkml/Y+QaZtz55LIirsUO@google.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230210155921.4610-1-laoar.shao@gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230212151303.12353-1-laoar.shao@gmail.com
Cc: stable@vger.kernel.org
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Kajetan Puchalski <kajetan.puchalski@arm.com>
CC: Qais Yousef <qyousef@layalina.io>
Fixes: 3087c61ed2 ("tools/testing/selftests/bpf: replace open-coded 16 with TASK_COMM_LEN")
Reported-by: John Stultz <jstultz@google.com>
Debugged-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPnV1wRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gWxg/8D6jl/Wl/16LkcnLk8eRWGmF34u24lGPK
d8T9VN0fF0Pnz0vQVNyuhpdSARX1PvaOAyvx/gv0KE+OjeNYdXWTHavNEekpBeKA
0XFrw2xoa+M7kJE1LswARk23ynFtLZoFD05G3fNZ6NxK/uxRepZzxUyAmwYE9ibG
9gekg+IeTysaCtmIKHkCOgcLvSy41/JEJMtA3CHA3Bww/CdPd5JSc9ERqZKYCp3L
lVHNtmTZotZ4TA0Dzgx+OgF1JtoQqQyerPQVhkXjmq0MnTJLjKWnvesF2gBcFLHS
6rovr3eCaO2dm7KGsFlt8Ne6FYEd1us1ifK166xoJgRV+TFFpf2UoDZkEhiCOL63
5x3S35wOuKopsBT4IHK5j2LTfhT8KTFSOsZMN41EPhvYYY5/n1EOrHSvFQKmwEFO
jbVvAWHY56YGKH54qePULb+hSrAR1V+AO2ceghusg9BhT4IQ72aR0vkv4hbxd/Zh
mufUd5E3+vWLOVbYM7e3ZGFiC6DA/QVZkTvVxKIllE0bzJkGKI80ITeLbjAxFFmp
OCs+stGij+SwOxEWfK+I0qz6ae8mL/lgWUr7hhkAi8LXGA/t5q1jErCULZFiPr+6
vugrk2SeQZOEVvfUb0/U3GUdn01yGHak7sz7wJBsd++y8I8FR9q6fb7kawgMCo4I
ZydwCwXat2I=
=sfDB
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2023-02-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix an rtmutex missed-wakeup bug"
* tag 'locking-urgent-2023-02-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rtmutex: Ensure that the top waiter is always woken up
With the fix that made poll() and select() block if read would block
caused a slight regression in rasdaemon, as it needed that kind
of behavior. Add a way to make that behavior come back by writing
zero into the "buffer_percentage", which means to never block on read.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY+Jn3xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgQ6AQC30hHcPMPm8+drlH/P6wEYstRP6xbp
nHYHcvT6qXNPtAD+OhUQR2Zav66m6cE0qvkdnZb72E0YHRTfBhN5OpshTgQ=
=dJEF
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix regression in poll() and select()
With the fix that made poll() and select() block if read would block
caused a slight regression in rasdaemon, as it needed that kind of
behavior. Add a way to make that behavior come back by writing zero
into the 'buffer_percentage', which means to never block on read"
* tag 'trace-v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix poll() and select() do not work on per_cpu trace_pipe and trace_pipe_raw
set_cpus_allowed_ptr() will fail with -EINVAL if the requested
affinity mask is not a subset of the task_cpu_possible_mask() for the
task being updated. Consequently, on a heterogeneous system with cpusets
spanning the different CPU types, updates to the cgroup hierarchy can
silently fail to update task affinities when the effective affinity
mask for the cpuset is expanded.
For example, consider an arm64 system with 4 CPUs, where CPUs 2-3 are
the only cores capable of executing 32-bit tasks. Attaching a 32-bit
task to a cpuset containing CPUs 0-2 will correctly affine the task to
CPU 2. Extending the cpuset to CPUs 0-3, however, will fail to extend
the affinity mask of the 32-bit task because update_tasks_cpumask() will
pass the full 0-3 mask to set_cpus_allowed_ptr().
Extend update_tasks_cpumask() to take a temporary 'cpumask' paramater
and use it to mask the 'effective_cpus' mask with the possible mask for
each task being updated.
Fixes: 431c69fac0 ("cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()")
Signed-off-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since commit 8f9ea86fdf ("sched: Always preserve the user
requested cpumask"), relax_compatible_cpus_allowed_ptr() is calling
__sched_setaffinity() unconditionally. This helps to expose a bug in
the current cpuset hotplug code where the cpumasks of the tasks in
the top cpuset are not updated at all when some CPUs become online or
offline. It is likely caused by the fact that some of the tasks in the
top cpuset, like percpu kthreads, cannot have their cpu affinity changed.
One way to reproduce this as suggested by Peter is:
- boot machine
- offline all CPUs except one
- taskset -p ffffffff $$
- online all CPUs
Fix this by allowing cpuset_cpus_allowed() to return a wider mask that
includes offline CPUs for those tasks that are in the top cpuset. For
tasks not in the top cpuset, the old rule applies and only online CPUs
will be returned in the mask since hotplug events will update their
cpumasks accordingly.
Fixes: 8f9ea86fdf ("sched: Always preserve the user requested cpumask")
Reported-by: Will Deacon <will@kernel.org>
Originally-from: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Will Deacon <will@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Using __irq_domain_alloc_irqs() is an unnecessary complexity. Use
irq_domain_alloc_irqs(), which is simpler and makes the code more
readable.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Let L1 and L2 be two spinlocks.
Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
waiter of L2.
Let T2 be the task holding L2.
Let T3 be a task trying to acquire L1.
The following events will lead to a state in which the wait queue of L2
isn't empty, but no task actually holds the lock.
T1 T2 T3
== == ==
spin_lock(L1)
| raw_spin_lock(L1->wait_lock)
| rtlock_slowlock_locked(L1)
| | task_blocks_on_rt_mutex(L1, T3)
| | | orig_waiter->lock = L1
| | | orig_waiter->task = T3
| | | raw_spin_unlock(L1->wait_lock)
| | | rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3)
spin_unlock(L2) | | | |
| rt_mutex_slowunlock(L2) | | | |
| | raw_spin_lock(L2->wait_lock) | | | |
| | wakeup(T1) | | | |
| | raw_spin_unlock(L2->wait_lock) | | | |
| | | | waiter = T1->pi_blocked_on
| | | | waiter == rt_mutex_top_waiter(L2)
| | | | waiter->task == T1
| | | | raw_spin_lock(L2->wait_lock)
| | | | dequeue(L2, waiter)
| | | | update_prio(waiter, T1)
| | | | enqueue(L2, waiter)
| | | | waiter != rt_mutex_top_waiter(L2)
| | | | L2->owner == NULL
| | | | wakeup(T1)
| | | | raw_spin_unlock(L2->wait_lock)
T1 wakes up
T1 != top_waiter(L2)
schedule_rtlock()
If the deadline of T1 is updated before the call to update_prio(), and the
new deadline is greater than the deadline of the second top waiter, then
after the requeue, T1 is no longer the top waiter, and the wrong task is
woken up which will then go back to sleep because it is not the top waiter.
This can be reproduced in PREEMPT_RT with stress-ng:
while true; do
stress-ng --sched deadline --sched-period 1000000000 \
--sched-runtime 800000000 --sched-deadline \
1000000000 --mmapfork 23 -t 20
done
A similar issue was pointed out by Thomas versus the cases where the top
waiter drops out early due to a signal or timeout, which is a general issue
for all regular rtmutex use cases, e.g. futex.
The problematic code is in rt_mutex_adjust_prio_chain():
// Save the top waiter before dequeue/enqueue
prerequeue_top_waiter = rt_mutex_top_waiter(lock);
rt_mutex_dequeue(lock, waiter);
waiter_update_prio(waiter, task);
rt_mutex_enqueue(lock, waiter);
// Lock has no owner?
if (!rt_mutex_owner(lock)) {
// Top waiter changed
----> if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
----> wake_up_state(waiter->task, waiter->wake_state);
This only takes the case into account where @waiter is the new top waiter
due to the requeue operation.
But it fails to handle the case where @waiter is not longer the top
waiter due to the requeue operation.
Ensure that the new top waiter is woken up so in all cases so it can take
over the ownerless lock.
[ tglx: Amend changelog, add Fixes tag ]
Fixes: c014ef69b3 ("locking/rtmutex: Add wake_state to rt_mutex_waiter")
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230117172649.52465-1-wander@redhat.com
Link: https://lore.kernel.org/r/20230202123020.14844-1-wander@redhat.com
Here are a number of small char/misc/whatever driver fixes for 6.2-rc7.
They include:
- IIO driver fixes for some reported problems
- nvmem driver fixes
- fpga driver fixes
- debugfs memory leak fix in the hv_balloon and irqdomain code
(irqdomain change was acked by the maintainer.)
All have been in linux-next with no reported problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY9+Xkw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yl92QCfSTATuad8nUXmtBDsveUyAV3G6pkAoJFMWAIj
otmAl9XOGWxdwAURdovs
=tSdo
-----END PGP SIGNATURE-----
Merge tag 'char-misc-6.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are a number of small char/misc/whatever driver fixes. They
include:
- IIO driver fixes for some reported problems
- nvmem driver fixes
- fpga driver fixes
- debugfs memory leak fix in the hv_balloon and irqdomain code
(irqdomain change was acked by the maintainer)
All have been in linux-next with no reported problems"
* tag 'char-misc-6.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (33 commits)
kernel/irq/irqdomain.c: fix memory leak with using debugfs_lookup()
HV: hv_balloon: fix memory leak with using debugfs_lookup()
nvmem: qcom-spmi-sdam: fix module autoloading
nvmem: core: fix return value
nvmem: core: fix cell removal on error
nvmem: core: fix device node refcounting
nvmem: core: fix registration vs use race
nvmem: core: fix cleanup after dev_set_name()
nvmem: core: remove nvmem_config wp_gpio
nvmem: core: initialise nvmem->id early
nvmem: sunxi_sid: Always use 32-bit MMIO reads
nvmem: brcm_nvram: Add check for kzalloc
iio: imu: fxos8700: fix MAGN sensor scale and unit
iio: imu: fxos8700: remove definition FXOS8700_CTRL_ODR_MIN
iio: imu: fxos8700: fix failed initialization ODR mode assignment
iio: imu: fxos8700: fix incorrect ODR mode readback
iio: light: cm32181: Fix PM support on system with 2 I2C resources
iio: hid: fix the retval in gyro_3d_capture_sample
iio: hid: fix the retval in accel_3d_capture_sample
iio: imu: st_lsm6dsx: fix build when CONFIG_IIO_TRIGGERED_BUFFER=m
...
All RISC-V platforms have a single HW IPI provided by the INTC local
interrupt controller. The HW method to trigger INTC IPI can be through
external irqchip (e.g. RISC-V AIA), through platform specific device
(e.g. SiFive CLINT timer), or through firmware (e.g. SBI IPI call).
To support multiple IPIs on RISC-V, add a generic IPI multiplexing
mechanism which help us create multiple virtual IPIs using a single
HW IPI. This generic IPI multiplexing is inspired by the Apple AIC
irqchip driver and it is shared by various RISC-V irqchip drivers.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Hector Martin <marcan@marcan.st>
Tested-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230103141221.772261-4-apatel@ventanamicro.com
When calling debugfs_lookup() the result must have dput() called on it,
otherwise the memory will leak over time. To make things simpler, just
call debugfs_lookup_and_remove() instead which handles all of the logic
at once.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable <stable@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230202151554.2310273-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Current release - regressions:
- phy: fix null-deref in phy_attach_direct
- mac802154: fix possible double free upon parsing error
Previous releases - regressions:
- bpf: preserve reg parent/live fields when copying range info,
prevent mis-verification of programs as safe
- ip6: fix GRE tunnels not generating IPv6 link local addresses
- phy: dp83822: fix null-deref on DP83825/DP83826 devices
- sctp: do not check hb_timer.expires when resetting hb_timer
- eth: mtk_sock: fix SGMII configuration after phylink conversion
Previous releases - always broken:
- eth: xdp: execute xdp_do_flush() before napi_complete_done()
- skb: do not mix page pool and page referenced frags in GRO
- bpf:
- fix a possible task gone issue with bpf_send_signal[_thread]()
- fix an off-by-one bug in bpf_mem_cache_idx() to select
the right cache
- add missing btf_put to register_btf_id_dtor_kfuncs
- sockmap: fon't let sock_map_{close,destroy,unhash} call itself
- gso: fix null-deref in skb_segment_list()
- mctp: purge receive queues on sk destruction
- fix UaF caused by accept on already connected socket in exotic
socket families
- tls: don't treat list head as an entry in tls_is_tx_ready()
- netfilter: br_netfilter: disable sabotage_in hook after first
suppression
- wwan: t7xx: fix runtime PM implementation
Misc:
- MAINTAINERS: spring cleanup of networking maintainers
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmPcKlkACgkQMUZtbf5S
IrtHgxAAxvuXeN9kQ+/gDYbxTa4Fc3gwCAA7SdJiBdfxP9xJAUEBs9TNfh5MPynb
iIWl9tTe1OOQtWJTnp5CKMuegUj4hlJVYuLu4MK36hW56Q41cbCFvhsDk/1xoFTF
2mtmEr5JiXZfgEdf9/tEITTGKgfANsbXOAzL4b5HcV9qtvwvEd2aFxll3VO9N2Xn
k6o59Lr2mff7NnXQsrVkTLBxXRK9oir8VwyTnCs7j+T7E8Qe4hIkx2LDWcDiRC1e
vSxfoWFoe+uOGTQMxlarMOAgAOt7nvOQngwCaaVpMBa/OLzGo51Clf98cpPm+cSi
18ZjmA8r5owGghd75p2PtxcND8U1vXeeRJW+FKK60K8qznMc0D4s7HX+wHBfevnV
U649F0OHsNsX0Y75Z5sqA1eSmJbTR9QeyiRbuS6nfOnT7ZKDw6vPJrKGJRN789y0
erzG/JwrfnrCDavXfNNEgk0mMlP0squemffWjuH6FfL4Pt4/gUz63Nokr5vRFSns
yskHZgXVs4gq++yrbZDZjKfQaPDkA1P/z6VADVRWh4Y5m1m6GtvDUy6vJxK7/hUs
5giqmhbAAQq0U+2L5RIaoYGfjk2UaDcunNpqTCgjdCdMXx9Yz41QBloFgBjRshp/
3kg6ElYPzflXXCYK4pGRrzqO4Gqz1Hrvr9YuT+kka+wrpZPDtWc=
=OGU7
-----END PGP SIGNATURE-----
Merge tag 'net-6.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, can and netfilter.
Current release - regressions:
- phy: fix null-deref in phy_attach_direct
- mac802154: fix possible double free upon parsing error
Previous releases - regressions:
- bpf: preserve reg parent/live fields when copying range info,
prevent mis-verification of programs as safe
- ip6: fix GRE tunnels not generating IPv6 link local addresses
- phy: dp83822: fix null-deref on DP83825/DP83826 devices
- sctp: do not check hb_timer.expires when resetting hb_timer
- eth: mtk_sock: fix SGMII configuration after phylink conversion
Previous releases - always broken:
- eth: xdp: execute xdp_do_flush() before napi_complete_done()
- skb: do not mix page pool and page referenced frags in GRO
- bpf:
- fix a possible task gone issue with bpf_send_signal[_thread]()
- fix an off-by-one bug in bpf_mem_cache_idx() to select the right
cache
- add missing btf_put to register_btf_id_dtor_kfuncs
- sockmap: fon't let sock_map_{close,destroy,unhash} call itself
- gso: fix null-deref in skb_segment_list()
- mctp: purge receive queues on sk destruction
- fix UaF caused by accept on already connected socket in exotic
socket families
- tls: don't treat list head as an entry in tls_is_tx_ready()
- netfilter: br_netfilter: disable sabotage_in hook after first
suppression
- wwan: t7xx: fix runtime PM implementation
Misc:
- MAINTAINERS: spring cleanup of networking maintainers"
* tag 'net-6.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits)
mtk_sgmii: enable PCS polling to allow SFP work
net: mediatek: sgmii: fix duplex configuration
net: mediatek: sgmii: ensure the SGMII PHY is powered down on configuration
MAINTAINERS: update SCTP maintainers
MAINTAINERS: ipv6: retire Hideaki Yoshifuji
mailmap: add John Crispin's entry
MAINTAINERS: bonding: move Veaceslav Falico to CREDITS
net: openvswitch: fix flow memory leak in ovs_flow_cmd_new
net: ethernet: mtk_eth_soc: disable hardware DSA untagging for second MAC
virtio-net: Keep stop() to follow mirror sequence of open()
selftests: net: udpgso_bench_tx: Cater for pending datagrams zerocopy benchmarking
selftests: net: udpgso_bench: Fix racing bug between the rx/tx programs
selftests: net: udpgso_bench_rx/tx: Stop when wrong CLI args are provided
selftests: net: udpgso_bench_rx: Fix 'used uninitialized' compiler warning
can: mcp251xfd: mcp251xfd_ring_set_ringparam(): assign missing tx_obj_num_coalesce_irq
can: isotp: split tx timer into transmission and timeout
can: isotp: handle wait_event_interruptible() return values
can: raw: fix CAN FD frame transmissions over CAN XL devices
can: j1939: fix errant WARN_ON_ONCE in j1939_session_deactivate
hv_netvsc: Fix missed pagebuf entries in netvsc_dma_map/unmap()
...
poll() and select() on per_cpu trace_pipe and trace_pipe_raw do not work
since kernel 6.1-rc6. This issue is seen after the commit
42fb0a1e84 ("tracing/ring-buffer: Have
polling block on watermark").
This issue is firstly detected and reported, when testing the CXL error
events in the rasdaemon and also erified using the test application for poll()
and select().
This issue occurs for the per_cpu case, when calling the ring_buffer_poll_wait(),
in kernel/trace/ring_buffer.c, with the buffer_percent > 0 and then wait until the
percentage of pages are available. The default value set for the buffer_percent is 50
in the kernel/trace/trace.c.
As a fix, allow userspace application could set buffer_percent as 0 through
the buffer_percent_fops, so that the task will wake up as soon as data is added
to any of the specific cpu buffer.
Link: https://lore.kernel.org/linux-trace-kernel/20230202182309.742-2-shiju.jose@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <mchehab@kernel.org>
Cc: <linux-edac@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes: 42fb0a1e84 ("tracing/ring-buffer: Have polling block on watermark")
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
It was found that the check to see if a partition could use up all
the cpus from the parent cpuset in update_parent_subparts_cpumask()
was incorrect. As a result, it is possible to leave parent with no
effective cpu left even if there are tasks in the parent cpuset. This
can lead to system panic as reported in [1].
Fix this probem by updating the check to fail the enabling the partition
if parent's effective_cpus is a subset of the child's cpus_allowed.
Also record the error code when an error happens in update_prstate()
and add a test case where parent partition and child have the same cpu
list and parent has task. Enabling partition in the child will fail in
this case.
[1] https://www.spinics.net/lists/cgroups/msg36254.html
Fixes: f0af1bfc27 ("cgroup/cpuset: Relax constraints to partition & cpus changes")
Cc: stable@vger.kernel.org # v6.1
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@intel.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Syzkaller triggered a WARN in put_pmu_ctx().
WARNING: CPU: 1 PID: 2245 at kernel/events/core.c:4925 put_pmu_ctx+0x1f0/0x278
This is because there is no locking around the access of "if
(!epc->ctx)" in find_get_pmu_context() and when it is set to NULL in
put_pmu_ctx().
The decrement of the reference count in put_pmu_ctx() also happens
outside of the spinlock, leading to the possibility of this order of
events, and the context being cleared in put_pmu_ctx(), after its
refcount is non zero:
CPU0 CPU1
find_get_pmu_context()
if (!epc->ctx) == false
put_pmu_ctx()
atomic_dec_and_test(&epc->refcount) == true
epc->refcount == 0
atomic_inc(&epc->refcount);
epc->refcount == 1
list_del_init(&epc->pmu_ctx_entry);
epc->ctx = NULL;
Another issue is that WARN_ON for no active PMU events in put_pmu_ctx()
is outside of the lock. If the perf_event_pmu_context is an embedded
one, even after clearing it, it won't be deleted and can be re-used. So
the warning can trigger. For this reason it also needs to be moved
inside the lock.
The above warning is very quick to trigger on Arm by running these two
commands at the same time:
while true; do perf record -- ls; done
while true; do perf record -- ls; done
[peterz: atomic_dec_and_raw_lock*()]
Fixes: bd27568117 ("perf: Rewrite core context handling")
Reported-by: syzbot+697196bc0265049822bd@syzkaller.appspotmail.com
Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20230127143141.1782804-2-james.clark@arm.com
avoid leaking memory
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPWd70ACgkQEsHwGGHe
VUpTgxAAjKo7kjhyE9PjKA8tJ5pv6I/21tieu/JAmRyb8AVkhi21elXJaRoEPW4k
2TuSd0WMxZeoRJ1N366HcGPZcsldKN45v3xXEv1e3XSL6ZTAIFsYvwni6adwVocS
ztnZBrMuHf1KZW/6ES956qGmWsum0rYcU2sr0kvnS9ihPiXJsbeLfNF1VXPwaH2y
DTnTxgl/QtuipSQYOUvLSk1XBduvE+Wx8kE1S6TrdlWE9qsD/Jn1Jn7btaLQeHfs
27AXTNGYnVHeRZSVy6d38dIwU4Gh6LcLMFfTtNENIqTVCUfc/QcAb/gItDF0NvIr
fgKbkdmI7jc0rIfLGlBUM8WZ0OaxopFwvchVywjcogDpCfLvJbgR7JplEKTfltgz
62IjvTF6vXUCcjySwu/C48EE2G7CWL0FtIcJofZsI0yTeHESkwFf1r5eOjnTV2ao
M2QYGeToRtk7z9vb+pa0ZttD55TKXQKZDBv+lBRKLTy8SeIzlExJswIAuGTpWuKR
ntwrdtekyM2HNir/l9ad1X2mjVblivEEv1hLMpsxPsthigWXDr+PZtQIxHCao8wj
aKY4bI0F1AXByuKhQAwX/RVHUmB+AGrzJ/M5Fdj+ealQ6Nk+xCrPyTX2qDmveiSF
NlQZpGCxV+gI/8q9YU4coJm1AioT8TSFsMxaufGZCreqqaWBXl0=
=qrIZ
-----END PGP SIGNATURE-----
Merge tag 'irq_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov:
- Cleanup the firmware node for the new IRQ MSI domain properly, to
avoid leaking memory
* tag 'irq_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/msi: Free the fwnode created by msi_create_device_irq_domain()
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY9RCwAAKCRDbK58LschI
g7drAQDfMPc1Q2CE4LZ9oh2wu1Nt2/85naTDK/WirlCToKs0xwD+NRQOyO3hcoJJ
rOCwfjOlAm+7uqtiwwodBvWgTlDgVAM=
=0fC+
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
bpf 2023-01-27
We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 10 files changed, 170 insertions(+), 59 deletions(-).
The main changes are:
1) Fix preservation of register's parent/live fields when copying
range-info, from Eduard Zingerman.
2) Fix an off-by-one bug in bpf_mem_cache_idx() to select the right
cache, from Hou Tao.
3) Fix stack overflow from infinite recursion in sock_map_close(),
from Jakub Sitnicki.
4) Fix missing btf_put() in register_btf_id_dtor_kfuncs()'s error path,
from Jiri Olsa.
5) Fix a splat from bpf_setsockopt() via lsm_cgroup/socket_sock_rcv_skb,
from Kui-Feng Lee.
6) Fix bpf_send_signal[_thread]() helpers to hold a reference on the task,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix the kernel crash caused by bpf_setsockopt().
selftests/bpf: Cover listener cloning with progs attached to sockmap
selftests/bpf: Pass BPF skeleton to sockmap_listen ops tests
bpf, sockmap: Check for any of tcp_bpf_prots when cloning a listener
bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself
bpf: Add missing btf_put to register_btf_id_dtor_kfuncs
selftests/bpf: Verify copy_register_state() preserves parent/live fields
bpf: Fix to preserve reg parent/live fields when copying range info
bpf: Fix a possible task gone issue with bpf_send_signal[_thread]() helpers
bpf: Fix off-by-one error in bpf_mem_cache_idx()
====================
Link: https://lore.kernel.org/r/20230127215820.4993-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Fix filter memory leak by calling ftrace_free_filter()
- Initialize trace_printk() earlier so that ftrace_dump_on_oops shows data
on early crashes.
- Update the outdated instructions in scripts/tracing/ftrace-bisect.sh
- Add lockdep_is_held() to fix lockdep warning
- Add allocation failure check in create_hist_field()
- Don't initialize pointer that gets set right away in enabled_monitors_write()
- Update MAINTAINER entries
- Fix help messages in Kconfigs
- Fix kernel-doc header for update_preds()
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY9MaCBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqb4AP4p78mTVP/Rw5oLCRUM4E7JGb8hmxuZ
HD0RR9fbvyDaQAD/ffkar/KmvGCfCvBNrkGvvx98dwtGIssIB1dShoEZPAU=
=lBwh
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix filter memory leak by calling ftrace_free_filter()
- Initialize trace_printk() earlier so that ftrace_dump_on_oops shows
data on early crashes.
- Update the outdated instructions in scripts/tracing/ftrace-bisect.sh
- Add lockdep_is_held() to fix lockdep warning
- Add allocation failure check in create_hist_field()
- Don't initialize pointer that gets set right away in enabled_monitors_write()
- Update MAINTAINER entries
- Fix help messages in Kconfigs
- Fix kernel-doc header for update_preds()
* tag 'trace-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
bootconfig: Update MAINTAINERS file to add tree and mailing list
rv: remove redundant initialization of pointer ptr
ftrace: Maintain samples/ftrace
tracing/filter: fix kernel-doc warnings
lib: Kconfig: fix spellos
trace_events_hist: add check for return value of 'create_hist_field'
tracing/osnoise: Use built-in RCU list checking
tracing: Kconfig: Fix spelling/grammar/punctuation
ftrace/scripts: Update the instructions for ftrace-bisect.sh
tracing: Make sure trace_printk() can output as soon as it can be used
ftrace: Export ftrace_free_filter() to modules
The kernel crash was caused by a BPF program attached to the
"lsm_cgroup/socket_sock_rcv_skb" hook, which performed a call to
`bpf_setsockopt()` in order to set the TCP_NODELAY flag as an
example. Flags like TCP_NODELAY can prompt the kernel to flush a
socket's outgoing queue, and this hook
"lsm_cgroup/socket_sock_rcv_skb" is frequently triggered by
softirqs. The issue was that in certain circumstances, when
`tcp_write_xmit()` was called to flush the queue, it would also allow
BH (bottom-half) to run. This could lead to our program attempting to
flush the same socket recursively, which caused a `skbuff` to be
unlinked twice.
`security_sock_rcv_skb()` is triggered by `tcp_filter()`. This occurs
before the sock ownership is checked in `tcp_v4_rcv()`. Consequently,
if a bpf program runs on `security_sock_rcv_skb()` while under softirq
conditions, it may not possess the lock needed for `bpf_setsockopt()`,
thus presenting an issue.
The patch fixes this issue by ensuring that a BPF program attached to
the "lsm_cgroup/socket_sock_rcv_skb" hook is not allowed to call
`bpf_setsockopt()`.
The differences from v1 are
- changing commit log to explain holding the lock of the sock,
- emphasizing that TCP_NODELAY is not the only flag, and
- adding the fixes tag.
v1: https://lore.kernel.org/bpf/20230125000244.1109228-1-kuifeng@meta.com/
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Fixes: 9113d7e48e ("bpf: expose bpf_{g,s}etsockopt to lsm cgroup")
Link: https://lore.kernel.org/r/20230127001732.4162630-1-kuifeng@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The pointer ptr is being initialized with a value that is never read,
it is being updated later on a call to strim. Remove the extraneous
initialization.
Link: https://lkml.kernel.org/r/20230116161612.77192-1-colin.i.king@gmail.com
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Use the 'struct' keyword for a struct's kernel-doc notation and
use the correct function parameter name to eliminate kernel-doc
warnings:
kernel/trace/trace_events_filter.c:136: warning: cannot understand function prototype: 'struct prog_entry '
kerne/trace/trace_events_filter.c:155: warning: Excess function parameter 'when_to_branch' description in 'update_preds'
Also correct some trivial punctuation problems.
Link: https://lkml.kernel.org/r/20230108021238.16398-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Function 'create_hist_field' is called recursively at
trace_events_hist.c:1954 and can return NULL-value that's why we have
to check it to avoid null pointer dereference.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Link: https://lkml.kernel.org/r/20230111120409.4111-1-n.petrova@fintech.ru
Cc: stable@vger.kernel.org
Fixes: 30350d65ac ("tracing: Add variable support to hist triggers")
Signed-off-by: Natalia Petrova <n.petrova@fintech.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
list_for_each_entry_rcu() has built-in RCU and lock checking.
Pass cond argument to list_for_each_entry_rcu() to silence false lockdep
warning when CONFIG_PROVE_RCU_LIST is enabled.
Execute as follow:
[tracing]# echo osnoise > current_tracer
[tracing]# echo 1 > tracing_on
[tracing]# echo 0 > tracing_on
The trace_types_lock is held when osnoise_tracer_stop() or
timerlat_tracer_stop() are called in the non-RCU read side section.
So, pass lockdep_is_held(&trace_types_lock) to silence false lockdep
warning.
Link: https://lkml.kernel.org/r/20221227023036.784337-1-nashuiliang@gmail.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: dae181349f ("tracing/osnoise: Support a list of trace_array *tr")
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>