Now that all users are converted remove the old interfaces.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.694291814@linutronix.de
Provide two sorts of interfaces to handle the different use cases:
- msi_domain_alloc_irqs_range():
Handles a caller defined precise range
- msi_domain_alloc_irqs_all():
Allocates all interrupts associated to a domain by scanning the
allocated MSI descriptors
The latter is useful for the existing PCI/MSI support which does not have
range information available.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.396497163@linutronix.de
Provide two sorts of interfaces to handle the different use cases:
- msi_domain_free_irqs_range():
Handles a caller defined precise range
- msi_domain_free_irqs_all():
Frees all interrupts associated to a domain
The latter is useful for device teardown and to handle the legacy MSI support
which does not have any range information available.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.337844751@linutronix.de
Allocating simple interrupt descriptors in the core code has to be multi
device irqdomain aware for the upcoming PCI/IMS support.
Change the interfaces to take a domain id into account. Use the internal
control struct for transport of arguments.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.279112474@linutronix.de
Change the descriptor free functions to take a domain id to prepare for the
upcoming multi MSI domain per device support.
To avoid changing and extending the interfaces over and over use an core
internal control struct and hand the pointer through the various functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.220788011@linutronix.de
Change the descriptor allocation and insertion functions to take a domain
id to prepare for the upcoming multi MSI domain per device support.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.163043028@linutronix.de
This reflects the functionality better. No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.103554618@linutronix.de
In preparation of the upcoming per device multi MSI domain support, change
the interface to support lookups based on domain id and zero based index
within the domain.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.044613697@linutronix.de
To support multiple MSI interrupt domains per device it is necessary to
segment the xarray MSI descriptor storage. Each domain gets up to
MSI_MAX_INDEX entries.
Change the iterators so they operate with domain ids and take the domain
offsets into account.
The publicly available iterators which are mostly used in legacy
implementations and the PCI/MSI core default to MSI_DEFAULT_DOMAIN (0)
which is the id for the existing "global" domains.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.985498981@linutronix.de
With the upcoming per device MSI interrupt domain support it is necessary
to store the domain pointers per device.
Instead of delegating that storage to device drivers or subsystems add a
domain pointer to the msi_dev_domain array in struct msi_device_data.
This pointer is also used to take care of tearing down the irq domains when
msi_device_data is cleaned up via devres.
The interfaces into the MSI core will be changed from irqdomain pointer
based interfaces to domain id based interfaces to support multiple MSI
domains on a single device (e.g. PCI/MSI[-X] and PCI/IMS.
Once the per device domain support is complete the irq domain pointer in
struct device::msi.domain will not longer contain a pointer to the "global"
MSI domain. It will contain a pointer to the MSI parent domain instead.
It would be a horrible maze of conditionals to evaluate all over the place
which domain pointer should be used, i.e. the "global" one in
device::msi::domain or one from the internal pointer array.
To avoid this evaluate in msi_setup_device_data() whether the irq domain
which is associated to a device is a "global" or a parent MSI domain. If it
is global then copy the pointer into the first entry of the msi_dev_domain
array.
This allows to convert interfaces and implementation to domain ids while
keeping everything existing working.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.923860399@linutronix.de
The upcoming support for multiple MSI domains per device requires storage
for the MSI descriptors and in a second step storage for the irqdomain
pointers.
Move the xarray into a separate data structure msi_dev_domain and create an
array with size 1 in msi_device_data, which can be expanded later when the
support for per device domains is implemented.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.864887773@linutronix.de
In the upcoming per device MSI domain concept the MSI parent domains are
not allowed to be used as regular MSI domains where the MSI allocation/free
operations are applicable.
Add appropriate checks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.806128070@linutronix.de
irq_domain::dev is a misnomer as it's usually the rule that a device
pointer points to something which is directly related to the instance.
irq_domain::dev can point to some other device for power management to
ensure that this underlying device is not powered down when an interrupt is
allocated.
The upcoming per device MSI domains really require a pointer to the device
which instantiated the irq domain and not to some random other device which
is required for power management down the chain.
Rename irq_domain::dev to irq_domain::pm_dev and fixup the few sites which
use that pointer.
Conversion was done with the help of coccinelle.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.574541683@linutronix.de
It's truly a MSI only flag and for the upcoming per device MSI domains this
must be in the MSI flags so it can be set during domain setup without
exposing this quirk outside of x86.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.454246167@linutronix.de
Similar to sk/inode/task local storage, enable sleepable support for
cgrp local storage.
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221201050444.2785007-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin mentioned that the verifier cannot assume arguments from
LSM hook sk_alloc_security being trusted since after the hook
is called, the sk ref_count is set to 1. This will overwrite
the ref_count changed by the bpf program and may cause ref_count
underflow later on.
I then further checked some other hooks. For example,
for bpf_lsm_file_alloc() hook in fs/file_table.c,
f->f_cred = get_cred(cred);
error = security_file_alloc(f);
if (unlikely(error)) {
file_free_rcu(&f->f_rcuhead);
return ERR_PTR(error);
}
atomic_long_set(&f->f_count, 1);
The input parameter 'f' to security_file_alloc() cannot be trusted
as well.
Specifically, I investiaged bpf_map/bpf_prog/file/sk/task alloc/free
lsm hooks. Except bpf_map_alloc and task_alloc, arguments for all other
hooks should not be considered as trusted. This may not be a complete
list, but it covers common usage for sk and task.
Fixes: 3f00c52393 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221203204954.2043348-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 9bb00b2895 ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
introduced MEM_RCU and bpf_rcu_read_lock/unlock() support. In that
commit, a rcu pointer is tagged with both MEM_RCU and PTR_TRUSTED
so that it can be passed into kfuncs or helpers as an argument.
Martin raised a good question in [1] such that the rcu pointer,
although being able to accessing the object, might have reference
count of 0. This might cause a problem if the rcu pointer is passed
to a kfunc which expects trusted arguments where ref count should
be greater than 0.
This patch makes the following changes related to MEM_RCU pointer:
- MEM_RCU pointer might be NULL (PTR_MAYBE_NULL).
- Introduce KF_RCU so MEM_RCU ptr can be acquired with
a KF_RCU tagged kfunc which assumes ref count of rcu ptr
could be zero.
- For mem access 'b = ptr->a', say 'ptr' is a MEM_RCU ptr, and
'a' is tagged with __rcu as well. Let us mark 'b' as
MEM_RCU | PTR_MAYBE_NULL.
[1] https://lore.kernel.org/bpf/ac70f574-4023-664e-b711-e0d3b18117fd@linux.dev/
Fixes: 9bb00b2895 ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221203184602.477272-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
see an already freed event
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOMqHUACgkQEsHwGGHe
VUpRzw/9Gow+0wbm2XhMuweUA6t3LgNweOmzDl9w8k1f55OD6niCvuDiF9jSaiKZ
UwGyErasp2dlEVjuNGnp42qSHos3vRiR7sdZZQG+7opWV2FFyxyFpx5x8UEgVnFy
gOuEij5vLXBApUdNRAcVqCbvivs4Lv6SggDyQ075zGzuOmUv57vw2jDt8YfKaFcp
jZTiL+j5GKwihndDB6ayx+7Gwo9a9ASKrTgz8JK2tPOIHZR4X9y9ot1IanZnxzwF
d0kFpLgF/ZqjPRpJoaFn/jgk1AfahQyYHXh7lQ1aP7rLSLRRGcfTBX4n9nC3BYT+
EHaA94l151L1mzbR69ij9tryAERU4NlguD/FIuCeW+6IEPiuwBNGklXF+rRegNj4
IYC0ZSld/NyWKtOrwNSrFRMsxFm583Pg6TaBkvU1rGd5YVQ7GImrj7UjecXO/W71
iXpfarF7ur2zmd+5+F5FB34VYw8GumRo+D+XIb34+8UMBURTX36hgXvSC3sVyyCw
b0c758F3+1zTwm8z52T1RhOOp47t5iWAznwTq6k1cT7788PDXJ9sGYXIpdLpwKcI
Fuj61alwamGeUciCr0iKGtCLRHayZII7OeQh1VjXuqgCwI3hI2j3EaI9C74WSApn
ttVInS0Ka2xcu//A1VFltkMOWNMQK9JeTlqdqctwypTL3WVb2XA=
=jo4r
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.1_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- Fix a use-after-free case where the perf pending task callback would
see an already freed event
* tag 'perf_urgent_for_v6.1_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix perf_pending_task() UaF
When handing the SIGNAL_GROUP_EXIT flag, the info in ksignal isn't cleared.
However, the info acquired by dequeue_synchronous_signal/dequeue_signal is
initialized and can be safely used. Fortunately, the fatal signal process
just uses the si_signo and doesn't use any other member. Even so, the
initialization before use is more safer.
Signed-off-by: haifeng.xu <haifeng.xu@shopee.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221128065606.19570-1-haifeng.xu@shopee.com
Since Warn count is now tracked and is a fairly interesting signal, add
the entry /sys/kernel/warn_count to expose it to userspace.
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: tangmeng <tangmeng@uniontech.com>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221117234328.594699-6-keescook@chromium.org
Like oops_limit, add warn_limit for limiting the number of warnings when
panic_on_warn is not set.
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: tangmeng <tangmeng@uniontech.com>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-doc@vger.kernel.org
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221117234328.594699-5-keescook@chromium.org
Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll
their own warnings, and each check "panic_on_warn". Consolidate this
into a single function so that future instrumentation can be added in
a single location.
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Gow <davidgow@google.com>
Cc: tangmeng <tangmeng@uniontech.com>
Cc: Jann Horn <jannh@google.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: kasan-dev@googlegroups.com
Cc: linux-mm@kvack.org
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org
Resolve conflicts in drivers/vfio/vfio_main.c by using the iommfd version.
The rc fix was done a different way when iommufd patches reworked this
code.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The DYING/STARTING callbacks are not expected to fail. However, as reported
by Derek, buggy drivers such as tboot are still free to return errors
within those sections, which halts the hot(un)plug and leaves the CPU in an
unrecoverable state.
As there is no rollback possible, only log the failures and proceed with
the following steps.
This restores the hotplug behaviour prior to commit 453e410851
("cpu/hotplug: Add cpuhp_invoke_callback_range()")
Fixes: 453e410851 ("cpu/hotplug: Add cpuhp_invoke_callback_range()")
Reported-by: Derek Dolney <z23@posteo.net>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Derek Dolney <z23@posteo.net>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215867
Link: https://lore.kernel.org/r/20220927101259.1149636-1-vdonnefort@google.com
Since the boot cpu does not go through the hotplug process it ends
up with state == CPUHP_ONLINE but target == CPUHP_OFFLINE.
So set the target to match in boot_cpu_hotplug_init().
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20221117162329.3164999-3-pauld@redhat.com
Writing the current state back in hotplug/target calls cpu_down()
which will set cpu dying even when it isn't and then nothing will
ever clear it. A stress test that reads values and writes them back
for all cpu device files in sysfs will trigger the BUG() in
select_fallback_rq once all cpus are marked as dying.
kernel/cpu.c::target_store()
...
if (st->state < target)
ret = cpu_up(dev->id, target);
else
ret = cpu_down(dev->id, target);
cpu_down() -> cpu_set_state()
bool bringup = st->state < target;
...
if (cpu_dying(cpu) != !bringup)
set_cpu_dying(cpu, !bringup);
Fix this by letting state==target fall through in the target_store()
conditional. Also make sure st->target == target in that case.
Fixes: 757c989b99 ("cpu/hotplug: Make target state writeable")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20221117162329.3164999-2-pauld@redhat.com
Commit ca16d5bee5 ("futex: Prevent robust futex exit race") addressed
two cases when tasks waiting on a robust non-PI futex remained blocked
despite the futex not being owned anymore:
* if the owner died after writing zero to the futex word, but before
waking up a waiter
* if a task waiting on the futex was woken up, but died before updating
the futex word (effectively swallowing the notification without acting
on it)
In the second case, the task could be woken up either by the previous
owner (after the futex word was reset to zero) or by the kernel (after
the OWNER_DIED bit was set and the TID part of the futex word was reset
to zero) if the previous owner died without the resetting the futex.
Because the referenced commit wakes up a potential waiter only if the
whole futex word is zero, the latter subcase remains unaddressed.
Fix this by looking only at the TID part of the futex when deciding
whether a wake up is needed.
Fixes: ca16d5bee5 ("futex: Prevent robust futex exit race")
Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221111215439.248185-1-izbyshev@ispras.ru
Variable and return descriptions were missing from the SRCU read
lock functions. Add them.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/87zgcjpdvo.fsf@jogness.linutronix.de
The console_list_lock provides synchronization for console list and
console->flags updates. All call sites that were using the console_lock
for this synchronization have either switched to use the
console_list_lock or the SRCU list iterator.
Remove console_lock usage for console list updates and console->flags
updates.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-40-john.ogness@linutronix.de
With commit 9e124fe16ff2("xen: Enable console tty by default in domU
if it's not a dummy") a hack was implemented to make sure that the
tty console remains the console behind the /dev/console device. The
main problem with the hack is that, after getting the console pointer
to the tty console, it is assumed the pointer is still valid after
releasing the console_sem. This assumption is incorrect and unsafe.
Make the hack safe by introducing a new function
console_force_preferred_locked() and perform the full operation
under the console_list_lock.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-33-john.ogness@linutronix.de
Currently it is not possible for drivers to detect if they have
already successfully registered their console. Several drivers
have multiple paths that lead to console registration. To avoid
attempting a 2nd registration (which leads to a WARN), drivers
are implementing their own solution.
Introduce console_is_registered() so drivers can easily identify
if their console is currently registered. A _locked() variant
is also provided if the caller is already holding the
console_list_lock.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-22-john.ogness@linutronix.de
Use srcu console list iteration for console list traversal. It is
acceptable because the consoles might come and go at any time.
Strict synchronizing with console registration code would not bring
any advantage over srcu.
Document why the console_lock is still necessary. Note that this
is a preparatory change for when console_lock no longer provides
synchronization for the console list.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-21-john.ogness@linutronix.de
With SRCU it is now safe to traverse the console list, even if
the console_trylock() failed. However, overwriting console->seq
when console_trylock() failed is still an issue.
Switch to SRCU iteration and document remaining issue with
console->seq.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-20-john.ogness@linutronix.de
Use srcu console list iteration for console list traversal.
Document why the console_lock is still necessary. Note that this
is a preparatory change for when console_lock no longer provides
synchronization for the console list.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-19-john.ogness@linutronix.de
All users of console_is_usable() are SRCU iterators. Use the
appropriate wrapper function to locklessly read the flags.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-18-john.ogness@linutronix.de
Use srcu console list iteration for console list traversal.
Document why the console_lock is still necessary. Note that this
is a preparatory change for when console_lock no longer provides
synchronization for the console list.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-17-john.ogness@linutronix.de
Guarantee safe iteration of the console list by using SRCU.
Note that in the case of a handover, the SRCU read lock is also
released. This is documented in the function description and as
comments in the code. It is a bit tricky, but this preserves the
lockdep lock ordering for the context handing over the
console_lock:
console_lock()
| mutex_acquire(&console_lock_dep_map) <-- console lock
|
console_unlock()
| console_flush_all()
| | srcu_read_lock(&console_srcu) <-- srcu lock
| | console_emit_next_record()
| | | console_lock_spinning_disable_and_check()
| | | | srcu_read_unlock(&console_srcu) <-- srcu unlock
| | | | mutex_release(&console_lock_dep_map) <-- console unlock
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-16-john.ogness@linutronix.de
Guarantee safe iteration of the console list by using SRCU.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-15-john.ogness@linutronix.de
After switching to SRCU for console list iteration, some readers
will begin readings console->flags as a data race. Locklessly
reading console->flags provides a consistent value because there
is at most one CPU modifying console->flags and that CPU is
using only read-modify-write operations.
Introduce a wrapper for SRCU iterators to read console flags.
Introduce a matching wrapper to write to flags of registered
consoles. Writing to flags of registered consoles is synchronized
by the console_list_lock.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-13-john.ogness@linutronix.de
Currently there exist races in register_console(), where the types
of registered consoles are checked (without holding the console_lock)
and then after acquiring the console_lock, it is assumed that the list
has not changed. Also, some code that performs console_unregister()
make similar assumptions.
It might be possible to fix these races using the console_lock. But
it would require a complex analysis of all console drivers to make
sure that the console_lock is not taken in match() and setup()
callbacks. And we really prefer to split up and reduce the
responsibilities of console_lock rather than expand its complexity.
Therefore, introduce a new console_list_lock to provide full
synchronization for any console list changes.
In addition, also use console_list_lock for synchronization of
console->flags updates. All flags are either static or modified only
during the console registration. There are only two exceptions.
The first exception is CON_ENABLED, which is also modified by
console_start()/console_stop(). Therefore, these functions must
also take the console_list_lock.
The second exception is when the flags are modified by the console
driver init code before the console is registered. These will be
ignored because they are not visible to the rest of the system
via the console_drivers list.
Note that one of the various responsibilities of the console_lock is
also intended to provide console list and console->flags
synchronization. Later changes will update call sites relying on the
console_lock for these purposes. Once all call sites have been
updated, the console_lock will be relieved of synchronizing
console_list and console->flags updates.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/87sficwokr.fsf@jogness.linutronix.de
It used to be that all consoles were synchronized with respect to
which message they were printing. After commit a699449bb1 ("printk:
refactor and rework printing logic"), all consoles have their own
@seq for tracking which message they are on. That commit also changed
how the initial sequence number was chosen. Instead of choosing the
next non-printed message, it chose the sequence number of the next
message that will be added to the ringbuffer.
That change created a possibility that a non-boot console taking over
for a boot console might skip messages if the boot console was behind
and did not have a chance to catch up before being unregistered.
Since it is not known which boot console is the same device, flush
all consoles and, if necessary, start with the message of the enabled
boot console that is the furthest behind. If no boot consoles are
enabled, begin with the next message that will be added to the
ringbuffer.
Also, since boot consoles are meant to be used at boot time, handle
them the same as CON_PRINTBUFFER to ensure that no initial messages
are skipped.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-7-john.ogness@linutronix.de
The code to initialize @seq for a new console needs to consider
more factors when choosing an initial value. Move the code into
a helper function console_init_seq() "as is" so this code can
be expanded without causing register_console() to become too
long. A later commit will implement the additional code.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-6-john.ogness@linutronix.de
The @bootcon_enabled and @realcon_enabled local variables actually
represent if such console types are registered. In general there
has been a confusion about enabled vs. registered. Incorrectly
naming such variables promotes such confusion.
Rename the variables to _registered.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-5-john.ogness@linutronix.de
Provide an NMI-safe SRCU protected variant to walk the console list.
Note that all console fields are now set before adding the console
to the list to avoid the console becoming visible by SCRU readers
before being fully initialized.
This is a preparatory change for a new console infrastructure which
operates independent of the console BKL.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-4-john.ogness@linutronix.de
Replace the open coded single linked list with a hlist so a conversion
to SRCU protected list walks can reuse the existing primitives.
Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-3-john.ogness@linutronix.de
The elf_check_arch() function is also used to test compatibility of
usermode binaries. Kernel modules may have more specific requirements,
for example powerpc would like to test for ABI version compatibility.
Add a weak module_elf_check_arch() that defaults to true, and call it
from elf_validity_check().
Signed-off-by: Jessica Yu <jeyu@kernel.org>
[np: added changelog, adjust name, rebase]
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221128041539.1742489-2-npiggin@gmail.com
Consider a verifier state with three acquired references, all with
release_on_unlock = true:
idx 0 1 2
state->refs = [2 4 6]
(with 2, 4, and 6 being the ref ids).
When bpf_spin_unlock is called, process_spin_lock will loop through all
acquired_refs and, for each ref, if it's release_on_unlock, calls
release_reference on it. That function in turn calls
release_reference_state, which removes the reference from state->refs by
swapping the reference state with the last reference state in
refs array and decrements acquired_refs count.
process_spin_lock's loop logic, which is essentially:
for (i = 0; i < state->acquired_refs; i++) {
if (!state->refs[i].release_on_unlock)
continue;
release_reference(state->refs[i].id);
}
will fail to release release_on_unlock references which are swapped from
the end. Running this logic on our example demonstrates:
state->refs = [2 4 6] (start of idx=0 iter)
release state->refs[0] by swapping w/ state->refs[2]
state->refs = [6 4] (start of idx=1)
release state->refs[1], no need to swap as it's the last idx
state->refs = [6] (start of idx=2, loop terminates)
ref_id 6 should have been removed but was skipped.
Fix this by looping from back-to-front, which results in refs that are
candidates for removal being swapped with refs which have already been
examined and kept.
If we modify our initial example such that ref 6 is replaced with ref 7,
which is _not_ release_on_unlock, and loop from the back, we'd see:
state->refs = [2 4 7] (start of idx=2)
state->refs = [2 4 7] (start of idx=1)
state->refs = [2 7] (start of idx=0, refs 7 and 4 swapped)
state->refs = [7] (after idx=0, 7 and 2 swapped, loop terminates)
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 534e86bc6c ("bpf: Add 'release on unlock' logic for bpf_list_push_{front,back}")
Link: https://lore.kernel.org/r/20221201183406.1203621-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
1. With CONFIG_JUMP_LABEL=n static_key_slow_inc() doesn't have any
protection against key->enabled refcounter overflow.
2. With CONFIG_JUMP_LABEL=y static_key_slow_inc_cpuslocked()
still may turn the refcounter negative as (v + 1) may overflow.
key->enabled is indeed a ref-counter as it's documented in multiple
places: top comment in jump_label.h, Documentation/staging/static-keys.rst,
etc.
As -1 is reserved for static key that's in process of being enabled,
functions would break with negative key->enabled refcount:
- for CONFIG_JUMP_LABEL=n negative return of static_key_count()
breaks static_key_false(), static_key_true()
- the ref counter may become 0 from negative side by too many
static_key_slow_inc() calls and lead to use-after-free issues.
These flaws result in that some users have to introduce an additional
mutex and prevent the reference counter from overflowing themselves,
see bpf_enable_runtime_stats() checking the counter against INT_MAX / 2.
Prevent the reference counter overflow by checking if (v + 1) > 0.
Change functions API to return whether the increment was successful.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit adds lockdep checks for illegal use of synchronize_srcu()
within same-type SRCU read-side critical sections and within normal
RCU read-side critical sections. It also makes synchronize_srcu()
be a no-op during early boot.
These changes bring Tiny synchronize_srcu() into line with both Tree
synchronize_srcu() and Tiny synchronize_rcu().
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: John Ogness <john.ogness@linutronix.de>
The system memory map can change over a hibernation-restore cycle due
to a defect in the platform firmware, and some of the page frames used
by the kernel before hibernation may not be available any more during
the subsequent restore which leads to the error below.
[ T357] PM: Image loading progress: 0%
[ T357] PM: Read 2681596 kbytes in 0.03 seconds (89386.53 MB/s)
[ T357] PM: Error -14 resuming
[ T357] PM: Failed to load hibernation image, recovering.
[ T357] PM: Basic memory bitmaps freed
[ T357] OOM killer enabled.
[ T357] Restarting tasks ... done.
[ T357] PM: resume from hibernation failed (-14)
[ T357] PM: Hibernation image not present or could not be loaded.
Add an error message to the unpack() function to allow problematic
page frames to be identified and the source of the problem to be
diagnosed more easily. This can save developers quite a bit of
debugging time.
Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
[ rjw: New subject, edited changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since Oops count is now tracked and is a fairly interesting signal, add
the entry /sys/kernel/oops_count to expose it to userspace.
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jann Horn <jannh@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221117234328.594699-3-keescook@chromium.org
Many Linux systems are configured to not panic on oops; but allowing an
attacker to oops the system **really** often can make even bugs that look
completely unexploitable exploitable (like NULL dereferences and such) if
each crash elevates a refcount by one or a lock is taken in read mode, and
this causes a counter to eventually overflow.
The most interesting counters for this are 32 bits wide (like open-coded
refcounts that don't use refcount_t). (The ldsem reader count on 32-bit
platforms is just 16 bits, but probably nobody cares about 32-bit platforms
that much nowadays.)
So let's panic the system if the kernel is constantly oopsing.
The speed of oopsing 2^32 times probably depends on several factors, like
how long the stack trace is and which unwinder you're using; an empirically
important one is whether your console is showing a graphical environment or
a text console that oopses will be printed to.
In a quick single-threaded benchmark, it looks like oopsing in a vfork()
child with a very short stack trace only takes ~510 microseconds per run
when a graphical console is active; but switching to a text console that
oopses are printed to slows it down around 87x, to ~45 milliseconds per
run.
(Adding more threads makes this faster, but the actual oops printing
happens under &die_lock on x86, so you can maybe speed this up by a factor
of around 2 and then any further improvement gets eaten up by lock
contention.)
It looks like it would take around 8-12 days to overflow a 32-bit counter
with repeated oopsing on a multi-core X86 system running a graphical
environment; both me (in an X86 VM) and Seth (with a distro kernel on
normal hardware in a standard configuration) got numbers in that ballpark.
12 days aren't *that* short on a desktop system, and you'd likely need much
longer on a typical server system (assuming that people don't run graphical
desktop environments on their servers), and this is a *very* noisy and
violent approach to exploiting the kernel; and it also seems to take orders
of magnitude longer on some machines, probably because stuff like EFI
pstore will slow it down a ton if that's active.
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org
In preparation for adding more sysctls directly in kernel/panic.c, split
CONFIG_SMP from the logic that adds sysctls.
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: tangmeng <tangmeng@uniontech.com>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221117234328.594699-1-keescook@chromium.org
Use only one hyphen in kernel-doc notation between the function name
and its short description.
The is the documented kerenl-doc format. It also fixes the HTML
presentation to be consistent with other functions.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20221201070331.25685-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the introduction of clockevents, i.e., commit d316c57ff6
("clockevents: add core functionality"), there has been a mismatch between
the function and the kernel-doc comment for clockevent_delta2ns().
Hence, ./scripts/kernel-doc -none kernel/time/clockevents.c warns about it.
Adjust the kernel-doc comment for clockevent_delta2ns() for make W=1
happiness.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221102091048.15068-1-lukas.bulwahn@gmail.com
The implementation of strscpy() is more robust and safer.
That's now the recommended way to copy NUL terminated strings.
Signed-off-by: Xu Panda <xu.panda@zte.com.cn>
Signed-off-by: Yang Yang <yang.yang29@zte.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/202211301601416229001@zte.com.cn
find_timens_vvar_page() is not architecture-specific, as can be seen from
how all five per-architecture versions of it are the same.
(arm64, powerpc and riscv are exactly the same; x86 and s390 have two
characters difference inside a comment, less blank lines, and mark the
!CONFIG_TIME_NS version as inline.)
Refactor the five copies into a central copy in kernel/time/namespace.c.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130115320.2918447-1-jannh@google.com
When building the kernel with clang lto (CONFIG_LTO_CLANG_FULL=y), the
following compilation error will appear:
$ make LLVM=1 LLVM_IAS=1 -j
...
ld.lld: error: ld-temp.o <inline asm>:26889:1: symbol 'cgroup_storage_map_btf_ids' is already defined
cgroup_storage_map_btf_ids:;
^
make[1]: *** [/.../bpf-next/scripts/Makefile.vmlinux_o:61: vmlinux.o] Error 1
In local_storage.c, we have
BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map)
Commit c4bcfb38a9 ("bpf: Implement cgroup storage available to
non-cgroup-attached bpf progs") added the above identical BTF_ID_LIST_SINGLE
definition in bpf_cgrp_storage.c. With duplicated definitions, llvm linker
complains with lto build.
Also, extracting btf_id of 'struct bpf_local_storage_map' is defined four times
for sk, inode, task and cgrp local storages. Let us define a single global one
with a different name than cgroup_storage_map_btf_ids, which also fixed
the lto compilation error.
Fixes: c4bcfb38a9 ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221130052147.1591625-1-yhs@fb.com
Following the pattern of io_uring, perf, skb, and bpf, iommfd will use
user->locked_vm for accounting pinned pages. Ensure the value is included
in the struct and export free_uid() as iommufd is modular.
user->locked_vm is the good accounting to use for ulimit because it is
per-user, and the security sandboxing of locked pages is not supposed to
be per-process. Other places (vfio, vdpa and infiniband) have used
mm->pinned_vm and/or mm->locked_vm for accounting pinned pages, but this
is only per-process and inconsistent with the new FOLL_LONGTERM users in
the kernel.
Concurrent work is underway to try to put this in a cgroup, so everything
can be consistent and the kernel can provide a FOLL_LONGTERM limit that
actually provides security.
Link: https://lore.kernel.org/r/7-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The integer overflow is descripted with following codes:
> 317 static comp_t encode_comp_t(u64 value)
> 318 {
> 319 int exp, rnd;
......
> 341 exp <<= MANTSIZE;
> 342 exp += value;
> 343 return exp;
> 344 }
Currently comp_t is defined as type of '__u16', but the variable 'exp' is
type of 'int', so overflow would happen when variable 'exp' in line 343 is
greater than 65535.
Link: https://lkml.kernel.org/r/20210515140631.369106-3-zhengyejian1@huawei.com
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Jinhao <zhangjinhao2@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Fix encode_comp_t()".
Type conversion in encode_comp_t() may look a bit problematic.
This patch (of 2):
See calculation of ac_{u,s}time in fill_ac():
> ac->ac_utime = encode_comp_t(nsec_to_AHZ(pacct->ac_utime));
> ac->ac_stime = encode_comp_t(nsec_to_AHZ(pacct->ac_stime));
Return value of nsec_to_AHZ() is always type of 'u64', but it is handled
as type of 'unsigned long' in encode_comp_t, and accuracy loss would
happen on 32-bit platform when 'unsigned long' value is 32-bit-width.
So 'u64' value of encode_comp_t() may look better.
Link: https://lkml.kernel.org/r/20210515140631.369106-1-zhengyejian1@huawei.com
Link: https://lkml.kernel.org/r/20210515140631.369106-2-zhengyejian1@huawei.com
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Randy Dunlap <rdunlap@infradead.org> # build-tested
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Jinhao <zhangjinhao2@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Though vmcoreinfo is intended to be small, at just one page, useful
information is still added to it, so we risk running out of space.
Currently there is no runtime check to see whether the vmcoreinfo buffer
has been exhausted. Add a warning for this case.
Currently, my static checking tool[1] indicates that a good upper bound
for vmcoreinfo size is currently 3415 bytes, but the best time to add
warnings is before the risk becomes too high.
[1] https://github.com/brenns10/kernel_stuff/blob/master/vmcoreinfosize/vmcoreinfosize.py
Link: https://lkml.kernel.org/r/20221027205008.312534-1-stephen.s.brennan@oracle.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The percpu_counter is used for scenarios where performance is more
important than the accuracy. For percpu_counter users, who want more
accurate information in their slowpath, percpu_counter_sum is provided
which traverses all the online CPUs to accumulate the data. The reason it
only needs to traverse online CPUs is because percpu_counter does
implement CPU offline callback which syncs the local data of the offlined
CPU.
However there is a small race window between the online CPUs traversal of
percpu_counter_sum and the CPU offline callback. The offline callback has
to traverse all the percpu_counters on the system to flush the CPU local
data which can be a lot. During that time, the CPU which is going offline
has already been published as offline to all the readers. So, as the
offline callback is running, percpu_counter_sum can be called for one
counter which has some state on the CPU going offline. Since
percpu_counter_sum only traverses online CPUs, it will skip that specific
CPU and the offline callback might not have flushed the state for that
specific percpu_counter on that offlined CPU.
Normally this is not an issue because percpu_counter users can deal with
some inaccuracy for small time window. However a new user i.e. mm_struct
on the cleanup path wants to check the exact state of the percpu_counter
through check_mm(). For such users, this patch introduces
percpu_counter_sum_all() which traverses all possible CPUs and it is used
in fork.c:check_mm() to avoid the potential race.
This issue is exposed by the later patch "mm: convert mm's rss stats into
percpu_counter".
Link: https://lkml.kernel.org/r/20221109012011.881058-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently mm_struct maintains rss_stats which are updated on page fault
and the unmapping codepaths. For page fault codepath the updates are
cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
The reason for caching is performance for multithreaded applications
otherwise the rss_stats updates may become hotspot for such applications.
However this optimization comes with the cost of error margin in the rss
stats. The rss_stats for applications with large number of threads can be
very skewed. At worst the error margin is (nr_threads * 64) and we have a
lot of applications with 100s of threads, so the error margin can be very
high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
Recently we started seeing the unbounded errors for rss_stats for specific
applications which use TCP rx0cp. It seems like vm_insert_pages()
codepath does not sync rss_stats at all.
This patch converts the rss_stats into percpu_counter to convert the error
margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). However
this conversion enable us to get the accurate stats for situations where
accuracy is more important than the cpu cost.
This patch does not make such tradeoffs - we can just use
percpu_counter_add_local() for the updates and percpu_counter_sum() (or
percpu_counter_sync() + percpu_counter_read) for the readers. At the
moment the readers are either procfs interface, oom_killer and memory
reclaim which I think are not performance critical and should be ok with
slow read. However I think we can make that change in a separate patch.
Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The networking programs typically don't require CAP_PERFMON, but through kfuncs
like bpf_cast_to_kern_ctx() they can access memory through PTR_TO_BTF_ID. In
such case enforce CAP_PERFMON.
Also make sure that only GPL programs can access kernel data structures.
All kfuncs require GPL already.
Also remove allow_ptr_to_map_access. It's the same as allow_ptr_leaks and
different name for the same check only causes confusion.
Fixes: fd264ca020 ("bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctx")
Fixes: 50c6b8a9ae ("selftests/bpf: Add a test for btf_type_tag "percpu"")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221125220617.26846-1-alexei.starovoitov@gmail.com
As explained in [1], we would like to remove SLOB if possible.
- There are no known users that need its somewhat lower memory footprint
so much that they cannot handle SLUB (after some modifications by the
previous patches) instead.
- It is an extra maintenance burden, and a number of features are
incompatible with it.
- It blocks the API improvement of allowing kfree() on objects allocated
via kmem_cache_alloc().
As the first step, rename the CONFIG_SLOB option in the slab allocator
configuration choice to CONFIG_SLOB_DEPRECATED. Add CONFIG_SLOB
depending on CONFIG_SLOB_DEPRECATED as an internal option to avoid code
churn. This will cause existing .config files and defconfigs with
CONFIG_SLOB=y to silently switch to the default (and recommended
replacement) SLUB, while still allowing SLOB to be configured by anyone
that notices and needs it. But those should contact the slab maintainers
and linux-mm@kvack.org as explained in the updated help. With no valid
objections, the plan is to update the existing defconfigs to SLUB and
remove SLOB in a few cycles.
To make SLUB more suitable replacement for SLOB, a CONFIG_SLUB_TINY
option was introduced to limit SLUB's memory overhead.
There is a number of defconfigs specifying CONFIG_SLOB=y. As part of
this patch, update them to select CONFIG_SLUB and CONFIG_SLUB_TINY.
[1] https://lore.kernel.org/all/b35c3f82-f67b-2103-7d82-7a7ba7521439@suse.cz/
Cc: Russell King <linux@armlinux.org.uk>
Cc: Aaro Koskinen <aaro.koskinen@iki.fi>
Cc: Janusz Krzysztofik <jmkrzyszt@gmail.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Conor Dooley <conor@kernel.org>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Aaro Koskinen <aaro.koskinen@iki.fi> # OMAP1
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> # riscv k210
Acked-by: Arnd Bergmann <arnd@arndb.de> # arm
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Earlier commits in this series allow battery-powered systems to build
their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option.
This Kconfig option causes call_rcu() to delay its callbacks in order
to batch them. This means that a given RCU grace period covers more
callbacks, thus reducing the number of grace periods, in turn reducing
the amount of energy consumed, which increases battery lifetime which
can be a very good thing. This is not a subtle effect: In some important
use cases, the battery lifetime is increased by more than 10%.
This CONFIG_RCU_LAZY=y option is available only for CPUs that offload
callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot
parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y.
Delaying callbacks is normally not a problem because most callbacks do
nothing but free memory. If the system is short on memory, a shrinker
will kick all currently queued lazy callbacks out of their laziness,
thus freeing their memory in short order. Similarly, the rcu_barrier()
function, which blocks until all currently queued callbacks are invoked,
will also kick lazy callbacks, thus enabling rcu_barrier() to complete
in a timely manner.
However, there are some cases where laziness is not a good option.
For example, synchronize_rcu() invokes call_rcu(), and blocks until
the newly queued callback is invoked. It would not be a good for
synchronize_rcu() to block for ten seconds, even on an idle system.
Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of
call_rcu(). The arrival of a non-lazy call_rcu_hurry() callback on a
given CPU kicks any lazy callbacks that might be already queued on that
CPU. After all, if there is going to be a grace period, all callbacks
might as well get full benefit from it.
Yes, this could be done the other way around by creating a
call_rcu_lazy(), but earlier experience with this approach and
feedback at the 2022 Linux Plumbers Conference shifted the approach
to call_rcu() being lazy with call_rcu_hurry() for the few places
where laziness is inappropriate.
And another call_rcu() instance that cannot be lazy is the one
in queue_rcu_work(), given that callers to queue_rcu_work() are
not necessarily OK with long delays.
Therefore, make queue_rcu_work() use call_rcu_hurry() in order to revert
to the old behavior.
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]
Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Invoking ./scripts/kernel-doc -none kernel/notifier.c warns:
kernel/notifier.c:71: warning: Excess function parameter 'returns' description in 'notifier_call_chain'
kernel/notifier.c:119: warning: Function parameter or member 'v' not described in 'notifier_call_chain_robust'
These two warning are easy to fix, as they are just due to some minor slips
that makes the comment not follow kernel-doc's syntactic expectation.
Fix those minor slips in kernel-doc comments for make W=1 happiness.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fault injection tests trigger warnings like this:
kernfs: can not remove 'chip_name', no directory
WARNING: CPU: 0 PID: 253 at fs/kernfs/dir.c:1616 kernfs_remove_by_name_ns+0xce/0xe0
RIP: 0010:kernfs_remove_by_name_ns+0xce/0xe0
Call Trace:
<TASK>
remove_files.isra.1+0x3f/0xb0
sysfs_remove_group+0x68/0xe0
sysfs_remove_groups+0x41/0x70
__kobject_del+0x45/0xc0
kobject_del+0x29/0x40
free_desc+0x42/0x70
irq_free_descs+0x5e/0x90
The reason is that the interrupt descriptor sysfs handling does not roll
back on a failing kobject_add() during allocation. If the descriptor is
freed later on, kobject_del() is invoked with a not added kobject resulting
in the above warnings.
A proper rollback in case of a kobject_add() failure would be the straight
forward solution. But this is not possible due to the way how interrupt
descriptor sysfs handling works.
Interrupt descriptors are allocated before sysfs becomes available. So the
sysfs files for the early allocated descriptors are added later in the boot
process. At this point there can be nothing useful done about a failing
kobject_add(). For consistency the interrupt descriptor allocation always
treats kobject_add() failures as non-critical and just emits a warning.
To solve this problem, keep track in the interrupt descriptor whether
kobject_add() was successful or not and make the invocation of
kobject_del() conditional on that.
[ tglx: Massage changelog, comments and use a state bit. ]
Fixes: ecb3f394c5 ("genirq: Expose interrupt information through sysfs")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20221128151612.1786122-1-yangyingliang@huawei.com
Kernels configured with CONFIG_PRINTK=n and CONFIG_SRCU=n get build
failures. This causes trouble for deep embedded systems. But given
that there are more than 25 instances of "select SRCU" in the kernel,
it is hard to believe that there are many kernels running in production
without SRCU. This commit therefore makes SRCU mandatory. The SRCU
Kconfig option remains for backwards compatibility, and will be removed
when it is no longer used.
[ paulmck: Update per kernel test robot feedback. ]
Reported-by: John Ogness <john.ogness@linutronix.de>
Reported-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: <linux-arch@vger.kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Reviewed-by: John Ogness <john.ogness@linutronix.de>
call_rcu() changes to save power will change the behavior of rcutorture
tests. Use the call_rcu_hurry() API instead which reverts to the old
behavior.
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
rcuscale uses call_rcu() to queue async readers. With recent changes to
save power, the test will have fewer async readers in flight. Use the
call_rcu_hurry() API instead to revert to the old behavior.
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
call_rcu() changes to save power will slow down rcu sync. Use the
call_rcu_hurry() API instead which reverts to the old behavior.
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds 2 tests to rcuscale. The first one is a startup test
to check whether we are not too lazy or too hard working. The second
one causes kfree_rcu() itself to use call_rcu() and checks memory
pressure. Testing indicates that the new call_rcu() keeps memory pressure
under control roughly as well as does kfree_rcu().
[ paulmck: Apply checkpatch feedback. ]
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The shrinker is used to speed up the free'ing of memory potentially held
by RCU lazy callbacks. RCU kernel module test cases show this to be
effective. Test is introduced in a later patch.
Signed-off-by: Vineeth Pillai <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This consolidates the code a bit and makes it cleaner. Functionally it
is the same.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Implement timer-based RCU callback batching (also known as lazy
callbacks). With this we save about 5-10% of power consumed due
to RCU requests that happen when system is lightly loaded or idle.
By default, all async callbacks (queued via call_rcu) are marked
lazy. An alternate API call_rcu_hurry() is provided for the few users,
for example synchronize_rcu(), that need the old behavior.
The batch is flushed whenever a certain amount of time has passed, or
the batch on a particular CPU grows too big. Also memory pressure will
flush it in a future patch.
To handle several corner cases automagically (such as rcu_barrier() and
hotplug), we re-use bypass lists which were originally introduced to
address lock contention, to handle lazy CBs as well. The bypass list
length has the lazy CB length included in it. A separate lazy CB length
counter is also introduced to keep track of the number of lazy CBs.
[ paulmck: Fix formatting of inline call_rcu_lazy() definition. ]
[ paulmck: Apply Zqiang feedback. ]
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]
Suggested-by: Paul McKenney <paulmck@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Current release - new code bugs:
- eth: mlx5e:
- use kvfree() in mlx5e_accel_fs_tcp_create()
- MACsec, fix RX data path 16 RX security channel limit
- MACsec, fix memory leak when MACsec device is deleted
- MACsec, fix update Rx secure channel active field
- MACsec, fix add Rx security association (SA) rule memory leak
Previous releases - regressions:
- wifi: cfg80211: don't allow multi-BSSID in S1G
- stmmac: set MAC's flow control register to reflect current settings
- eth: mlx5:
- E-switch, fix duplicate lag creation
- fix use-after-free when reverting termination table
Previous releases - always broken:
- ipv4: fix route deletion when nexthop info is not specified
- bpf: fix a local storage BPF map bug where the value's spin lock
field can get initialized incorrectly
- tipc: re-fetch skb cb after tipc_msg_validate
- wifi: wilc1000: fix Information Element parsing
- packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE
- sctp: fix memory leak in sctp_stream_outq_migrate()
- can: can327: fix potential skb leak when netdev is down
- can: add number of missing netdev freeing on error paths
- aquantia: do not purge addresses when setting the number of rings
- wwan: iosm:
- fix incorrect skb length leading to truncated packet
- fix crash in peek throughput test due to skb UAF
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmOGOdYACgkQMUZtbf5S
IrsknQ//SAoOyDOEu15YzOt8hAupLKoF6MM+D0dwwTEQZLf7IVXCjPpkKtVh7Si7
YCBoyrqrDs7vwaUrVoKY19Amwov+EYrHCpdC+c7wdZ7uxTaYfUbJJUGmxYOR179o
lV1+1Aiqg9F9C6CUsmZ5lDN2Yb7/uPDBICIV8LM+VzJAtXjurBVauyMwAxLxPOAr
cgvM+h5xzE7DXMF2z8R/mUq5MSIWoJo9hy2UwbV+f2liMTQuw9rwTbyw3d7+H/6p
xmJcBcVaABjoUEsEhld3NTlYbSEnlFgCQBfDWzf2e4y6jBxO0JepuIc7SZwJFRJY
XBqdsKcGw5RkgKbksKUgxe126XFX0SUUQEp0UkOIqe15k7eC2yO9uj1gRm6OuV4s
J94HKzHX9WNV5OQ790Ed2JyIJScztMZlNFVJ/cz2/+iKR42xJg6kaO6Rt2fobtmL
VC2cH+RfHzLl+2+7xnfzXEDgFePSBlA02Aq1wihU3zB3r7WCFHchEf9T7sGt1QF0
03R+8E3+N2tYqphPAXyDoy6kXQJTPxJHAe1FNHJlwgfieUDEWZi/Pm+uQrKIkDeo
oq9MAV2QBNSD1w4wl7cXfvicO5kBr/OP6YBqwkpsGao2jCSIgkWEX2DRrUaLczXl
5/Z+m/gCO5tAEcVRYfMivxUIon//9EIhbErVpHTlNWpRHk24eS4=
=0Lnw
-----END PGP SIGNATURE-----
Merge tag 'net-6.1-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, can and wifi.
Current release - new code bugs:
- eth: mlx5e:
- use kvfree() in mlx5e_accel_fs_tcp_create()
- MACsec, fix RX data path 16 RX security channel limit
- MACsec, fix memory leak when MACsec device is deleted
- MACsec, fix update Rx secure channel active field
- MACsec, fix add Rx security association (SA) rule memory leak
Previous releases - regressions:
- wifi: cfg80211: don't allow multi-BSSID in S1G
- stmmac: set MAC's flow control register to reflect current settings
- eth: mlx5:
- E-switch, fix duplicate lag creation
- fix use-after-free when reverting termination table
Previous releases - always broken:
- ipv4: fix route deletion when nexthop info is not specified
- bpf: fix a local storage BPF map bug where the value's spin lock
field can get initialized incorrectly
- tipc: re-fetch skb cb after tipc_msg_validate
- wifi: wilc1000: fix Information Element parsing
- packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE
- sctp: fix memory leak in sctp_stream_outq_migrate()
- can: can327: fix potential skb leak when netdev is down
- can: add number of missing netdev freeing on error paths
- aquantia: do not purge addresses when setting the number of rings
- wwan: iosm:
- fix incorrect skb length leading to truncated packet
- fix crash in peek throughput test due to skb UAF"
* tag 'net-6.1-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
net: ethernet: renesas: ravb: Fix promiscuous mode after system resumed
MAINTAINERS: Update maintainer list for chelsio drivers
ionic: update MAINTAINERS entry
sctp: fix memory leak in sctp_stream_outq_migrate()
packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE
net/mlx5: Lag, Fix for loop when checking lag
Revert "net/mlx5e: MACsec, remove replay window size limitation in offload path"
net: marvell: prestera: Fix a NULL vs IS_ERR() check in some functions
net: tun: Fix use-after-free in tun_detach()
net: mdiobus: fix unbalanced node reference count
net: hsr: Fix potential use-after-free
tipc: re-fetch skb cb after tipc_msg_validate
mptcp: fix sleep in atomic at close time
mptcp: don't orphan ssk in mptcp_close()
dsa: lan9303: Correct stat name
ipv4: Fix route deletion when nexthop info is not specified
net: wwan: iosm: fix incorrect skb length
net: wwan: iosm: fix crash in peek throughput test
net: wwan: iosm: fix dma_alloc_coherent incompatible pointer type
net: wwan: iosm: fix kernel test robot reported error
...
Per syzbot it is possible for perf_pending_task() to run after the
event is free()'d. There are two related but distinct cases:
- the task_work was already queued before destroying the event;
- destroying the event itself queues the task_work.
The first cannot be solved using task_work_cancel() since
perf_release() itself might be called from a task_work (____fput),
which means the current->task_works list is already empty and
task_work_cancel() won't be able to find the perf_pending_task()
entry.
The simplest alternative is extending the perf_event lifetime to cover
the task_work.
The second is just silly, queueing a task_work while you know the
event is going away makes no sense and is easily avoided by
re-arranging how the event is marked STATE_DEAD and ensuring it goes
through STATE_OFF on the way down.
Reported-by: syzbot+9228d6098455bb209ec8@syzkaller.appspotmail.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY4AC5QAKCRDbK58LschI
g1e0AQCfAqduTy7mYd02jDNCV0wLphNp9FbPiP9OrQT37ABpKAEA1ulj1X59bX3d
HnZdDKuatcPZT9MV5hDLM7MFJ9GjOA4=
=fNmM
-----END PGP SIGNATURE-----
Daniel Borkmann says:
====================
bpf-next 2022-11-25
We've added 101 non-merge commits during the last 11 day(s) which contain
a total of 109 files changed, 8827 insertions(+), 1129 deletions(-).
The main changes are:
1) Support for user defined BPF objects: the use case is to allocate own
objects, build own object hierarchies and use the building blocks to
build own data structures flexibly, for example, linked lists in BPF,
from Kumar Kartikeya Dwivedi.
2) Add bpf_rcu_read_{,un}lock() support for sleepable programs,
from Yonghong Song.
3) Add support storing struct task_struct objects as kptrs in maps,
from David Vernet.
4) Batch of BPF map documentation improvements, from Maryam Tahhan
and Donald Hunter.
5) Improve BPF verifier to propagate nullness information for branches
of register to register comparisons, from Eduard Zingerman.
6) Fix cgroup BPF iter infra to hold reference on the start cgroup,
from Hou Tao.
7) Fix BPF verifier to not mark fentry/fexit program arguments as trusted
given it is not the case for them, from Alexei Starovoitov.
8) Improve BPF verifier's realloc handling to better play along with dynamic
runtime analysis tools like KASAN and friends, from Kees Cook.
9) Remove legacy libbpf mode support from bpftool,
from Sahid Orentino Ferdjaoui.
10) Rework zero-len skb redirection checks to avoid potentially breaking
existing BPF test infra users, from Stanislav Fomichev.
11) Two small refactorings which are independent and have been split out
of the XDP queueing RFC series, from Toke Høiland-Jørgensen.
12) Fix a memory leak in LSM cgroup BPF selftest, from Wang Yufen.
13) Documentation on how to run BPF CI without patch submission,
from Daniel Müller.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Link: https://lore.kernel.org/r/20221125012450.441-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Fix osnoise duration type to 64bit not 32bit.
- Have histogram triggers be able to handle an unexpected NULL pointer
for the record event, that can happen when the histogram first starts up.
- Clear out ring buffers when dynamic events are removed, as the type
that is saved in the ring buffer is used to read the event, and a
stale type that is reused by another event could cause use after free
issues.
- Trivial comment fix.
- Fix memory leak in user_event_create().
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY4UPShQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qo3qAQCz2R814gCo54WEDcvRUmAmF9BSE2b5
wtaDfeyXQYbaMwEA4+cuNoMMlQvbVbmcss/Uh7RpRCB8+24iypeiEZNtIgg=
=GPrG
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix osnoise duration type to 64bit not 32bit
- Have histogram triggers be able to handle an unexpected NULL pointer
for the record event, which can happen when the histogram first
starts up
- Clear out ring buffers when dynamic events are removed, as the type
that is saved in the ring buffer is used to read the event, and a
stale type that is reused by another event could cause use after free
issues
- Trivial comment fix
- Fix memory leak in user_event_create()
* tag 'trace-v6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Free buffers when a used dynamic event is removed
tracing: Add tracing_reset_all_online_cpus_unlocked() function
tracing: Fix race where histograms can be called before the event
tracing/osnoise: Fix duration type
tracing/user_events: Fix memory leak in user_event_create()
tracing/hist: add in missing * in comment blocks
If user_event_trace_register() fails within user_event_parse() the
call's print_fmt member is not freed. Add kfree call to fix this.
Link: https://lkml.kernel.org/r/20221123183248.554-1-beaub@linux.microsoft.com
Fixes: aa3b2b4c66 ("user_events: Add print_fmt generation support for basic types")
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In kill_kprobe(), the check whether disarm_kprobe_ftrace() needs to be
called always fails. This is because before that we set the
KPROBE_FLAG_GONE flag for kprobe so that "!kprobe_disabled(p)" is always
false.
The disarm_kprobe_ftrace() call introduced by commit:
0cb2f1372b ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler")
to fix the NULL pointer reference problem. When the probe is enabled, if
we do not disarm it, this problem still exists.
Fix it by putting the probe enabled check before setting the
KPROBE_FLAG_GONE flag.
Link: https://lore.kernel.org/all/20221126114316.201857-1-lihuafei1@huawei.com/
Fixes: 3031313eb3 ("kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()")
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
- output the address in the sample only when it has been requested
- handle the case where user-only events can hit in kernel and thus
upset the sigtrap sanity checking
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmODP5wACgkQEsHwGGHe
VUqmBhAAnxQTCg4agqtKQyHAnFGZb09dE6pYghcpJzF2sOh9sffCYl9XYd53AGVS
z7nTGagPmT0AxdtjbFvDnxXuvpPVfcS9rIKG6oBzPUcXSVKr9PsqzZJJiAibg+uP
OSkC75tqNgBTnyDSGc+C5yd7LmWV98Y66t0fhwGUTHpAPY+vkbvYvi8yJYfVqZsA
Fc/6j+MJMFO1/52GoP42HVs2F7oyuXceJAAVhyYkYitC23mvrK2BRTZAkKsfRKBk
Un5gFMo+iFUTrOMo48f8BMNEtgfC5iAPPK2ZlFGwTEbDQ98+inf461s8jZq/Yeg7
KcdHw0Tk1rynBrFlbbUQArDEPyqzmj10FMVZ5VgUNdMneGEy+EUdpcoilX0ZccaA
NMnNDMPpU4S0cbuGn43Y2Jlv4QA9oECHqZ8NndAoJ5domR9g12WyVibD5W/bYBYQ
4Z2lv6luOCfWd2xQgT4s0tfMjwQb7NKmTrT4No9je7/IBDS5YjzBjPriAd4kbhFj
MUt6KKLK0fN4ASmxajXp6iYRALL2nvgJ2w1gzO0pcjMGCJ1WZBKDdfbNLbgulCwq
GJHNc55F7lWmkMNFLbWb28YLHjXGskQj4J/RARFszQq9khQnPEFj4X69StQfrghT
77rJF9CrS3KnxG4gCNWe5qvfQb9+Fi7IhwyWDI9VR0DITxGogvY=
=9CQa
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.1_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
"Two more fixes to the perf sigtrap handling:
- output the address in the sample only when it has been requested
- handle the case where user-only events can hit in kernel and thus
upset the sigtrap sanity checking"
* tag 'perf_urgent_for_v6.1_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Consider OS filter fail
perf: Fixup SIGTRAP and sample_flags interaction
- Revert a recent schedutil cpufreq governor change that introduced
a performace regression on Pixel 6 (Sam Wu).
- Fix amd-pstate driver initialization after running the kernel via
kexec (Wyes Karny).
- Turn amd-pstate into a built-in driver which allows it to take
precedence over acpi-cpufreq by default on supported systems and
amend it with a mechanism to disable this behavior (Perry Yuan).
- Update amd-pstate documentation in accordance with the other changes
made to it (Perry Yuan).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmOA9bgSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxfpsP/j1JTOy3tPCsvPgI+b/RCWq33sm5PpaU
iFYpfxE9uexy2FyGgsZSqp4fMpMw9fCjiCpv/yfxtLmKccmOyW2kG7IYnNuV5QuN
FnN6B9WfsLohbuk87H07GpuGZTy4OISjmEGE/gTHXhY6n0eQ9utiX9b8zQuRdhsO
Gz7N0GaYrDh1lo9HeHNXiXbgIuVSj8iRyzWjJwdeha8NhZxwzQ3tZz5zNcBlCeJ+
sFtopCWqDZ5DbmKQy4rYXRuW8zC6UUQ4opqcpG5+0e5e51aa0lexJtUqxcyuKqHd
8plSZEEofCd7XecsvdXGzxhN29FFFIPkHZHic6/GQQou0gC8uGdp/LQMCoiZ3sJ1
3YblYJkYXMCsi9COjCAesMwy6PENL98XLfQYVmfT/EEHPgyxG1hD0a/xf0UvXzsf
V2h/gX+gfnDspCtStkrprapQb1WTCgA8d4tOB6x47EpCfY+H3HrNPFNIpJ93ErQJ
rAe8jAARh1qAbOEWYipc0XmnJLV4aDBAkR+6uE7GHXu6ESOUfvOzpVrQ5d3x4Ohk
a+vyupwiRGxRRoBjuGFyf/v15g6/UeLi5A/4Qi5SpCM7Uet7RnE/ltxjbgZJZgfG
gtBJvNaFExWJI5hFZIpz4hubsWvFMaF6SwlXnVMdYGuKZl0w119wAHFLkyO4EmHk
5RacLxDHEJ4q
=+qZV
-----END PGP SIGNATURE-----
Merge tag 'pm-6.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These revert a recent change in the schedutil cpufreq governor that
had not been expected to make any functional difference, but turned
out to introduce a performance regression, fix an initialization issue
in the amd-pstate driver and make it actually replace the venerable
ACPI cpufreq driver on the supported systems by default.
Specifics:
- Revert a recent schedutil cpufreq governor change that introduced a
performace regression on Pixel 6 (Sam Wu)
- Fix amd-pstate driver initialization after running the kernel via
kexec (Wyes Karny)
- Turn amd-pstate into a built-in driver which allows it to take
precedence over acpi-cpufreq by default on supported systems and
amend it with a mechanism to disable this behavior (Perry Yuan)
- Update amd-pstate documentation in accordance with the other
changes made to it (Perry Yuan)"
* tag 'pm-6.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Documentation: add amd-pstate kernel command line options
Documentation: amd-pstate: add driver working mode introduction
cpufreq: amd-pstate: add amd-pstate driver parameter for mode selection
cpufreq: amd-pstate: change amd-pstate driver to be built-in type
cpufreq: amd-pstate: cpufreq: amd-pstate: reset MSR_AMD_PERF_CTL register at init
Revert "cpufreq: schedutil: Move max CPU capacity to sugov_policy"
There have been a lot of hotfixes this cycle, and this is quite a large
batch given how far we are into the -rc cycle. Presumably a reflection of
the unusually large amount of MM material which went into 6.1-rc1.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY4Bd6gAKCRDdBJ7gKXxA
jvX6AQCsG1ld24kMpdD+70XXUyC29g/6/jribgtZApHyDYjxSwD/WmLNpPlUPRax
WB071Y5w65vjSTUKvwU0OLGbHwyxgAw=
=swD5
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"24 MM and non-MM hotfixes. 8 marked cc:stable and 16 for post-6.0
issues.
There have been a lot of hotfixes this cycle, and this is quite a
large batch given how far we are into the -rc cycle. Presumably a
reflection of the unusually large amount of MM material which went
into 6.1-rc1"
* tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits)
test_kprobes: fix implicit declaration error of test_kprobes
nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty
mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1
mm: fix unexpected changes to {failslab|fail_page_alloc}.attr
swapfile: fix soft lockup in scan_swap_map_slots
hugetlb: fix __prep_compound_gigantic_page page flag setting
kfence: fix stack trace pruning
proc/meminfo: fix spacing in SecPageTables
mm: multi-gen LRU: retry folios written back while isolated
mailmap: update email address for Satya Priya
mm/migrate_device: return number of migrating pages in args->cpages
kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible
MAINTAINERS: update Alex Hung's email address
mailmap: update Alex Hung's email address
mm: mmap: fix documentation for vma_mas_szero
mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed
mm/memory: return vm_fault_t result from migrate_to_ram() callback
mm: correctly charge compressed memory to its memcg
ipc/shm: call underlying open/close vm_ops
gcov: clang: fix the buffer overflow issue
...
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
list_for_each_entry_reverse() assumes that the iterated list is nonempty
and that every list_head is embedded in the same type, but its use in
padata_do_serial() breaks both rules.
This doesn't cause any issues now because padata_priv and padata_list
happen to have their list fields at the same offset, but we really
shouldn't be relying on that.
Fixes: bfde23ce20 ("padata: unbind parallel jobs from specific CPUs")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
A deadlock can happen when an overloaded system runs ->parallel() in the
context of the current task:
padata_do_parallel
->parallel()
pcrypt_aead_enc/dec
padata_do_serial
spin_lock(&reorder->lock) // BHs still enabled
<interrupt>
...
__do_softirq
...
padata_do_serial
spin_lock(&reorder->lock)
It's a bug for BHs to be on in _do_serial as Steffen points out, so
ensure they're off in the "current task" case like they are in
padata_parallel_worker to avoid this situation.
Reported-by: syzbot+bc05445bc14148d51915@syzkaller.appspotmail.com
Fixes: 4611ce2246 ("padata: allocate work structures for parallel jobs from a pool")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The PTR_TRUSTED flag should only be applied to pointers where the verifier can
guarantee that such pointers are valid.
The fentry/fexit/fmod_ret programs are not in this category.
Only arguments of SEC("tp_btf") and SEC("iter") programs are trusted
(which have BPF_TRACE_RAW_TP and BPF_TRACE_ITER attach_type correspondingly)
This bug was masked because convert_ctx_accesses() was converting trusted
loads into BPF_PROBE_MEM loads. Fix it as well.
The loads from trusted pointers don't need exception handling.
Fixes: 3f00c52393 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221124215314.55890-1-alexei.starovoitov@gmail.com
Add two kfunc's bpf_rcu_read_lock() and bpf_rcu_read_unlock(). These two kfunc's
can be used for all program types. The following is an example about how
rcu pointer are used w.r.t. bpf_rcu_read_lock()/bpf_rcu_read_unlock().
struct task_struct {
...
struct task_struct *last_wakee;
struct task_struct __rcu *real_parent;
...
};
Let us say prog does 'task = bpf_get_current_task_btf()' to get a
'task' pointer. The basic rules are:
- 'real_parent = task->real_parent' should be inside bpf_rcu_read_lock
region. This is to simulate rcu_dereference() operation. The
'real_parent' is marked as MEM_RCU only if (1). task->real_parent is
inside bpf_rcu_read_lock region, and (2). task is a trusted ptr. So
MEM_RCU marked ptr can be 'trusted' inside the bpf_rcu_read_lock region.
- 'last_wakee = real_parent->last_wakee' should be inside bpf_rcu_read_lock
region since it tries to access rcu protected memory.
- the ptr 'last_wakee' will be marked as PTR_UNTRUSTED since in general
it is not clear whether the object pointed by 'last_wakee' is valid or
not even inside bpf_rcu_read_lock region.
The verifier will reset all rcu pointer register states to untrusted
at bpf_rcu_read_unlock() kfunc call site, so any such rcu pointer
won't be trusted any more outside the bpf_rcu_read_lock() region.
The current implementation does not support nested rcu read lock
region in the prog.
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221124053217.2373910-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce bpf_func_proto->might_sleep to indicate a particular helper
might sleep. This will make later check whether a helper might be
sleepable or not easier.
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221124053211.2373553-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tearing down timers which have circular dependencies to other
functionality, e.g. workqueues, where the timer can schedule work and work
can arm timers, is not trivial.
In those cases it is desired to shutdown the timer in a way which prevents
rearming of the timer. The mechanism to do so is to set timer->function to
NULL and use this as an indicator for the timer arming functions to ignore
the (re)arm request.
Expose new interfaces for this: timer_shutdown_sync() and timer_shutdown().
timer_shutdown_sync() has the same functionality as timer_delete_sync()
plus the NULL-ification of the timer function.
timer_shutdown() has the same functionality as timer_delete() plus the
NULL-ification of the timer function.
In both cases the rearming of the timer is prevented by silently discarding
rearm attempts due to timer->function being NULL.
Co-developed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org
Link: https://lore.kernel.org/r/20221123201625.314230270@linutronix.de
Tearing down timers which have circular dependencies to other
functionality, e.g. workqueues, where the timer can schedule work and work
can arm timers, is not trivial.
In those cases it is desired to shutdown the timer in a way which prevents
rearming of the timer. The mechanism to do so is to set timer->function to
NULL and use this as an indicator for the timer arming functions to ignore
the (re)arm request.
Add a shutdown argument to the relevant internal functions which makes the
actual deactivation code set timer->function to NULL which in turn prevents
rearming of the timer.
Co-developed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org
Link: https://lore.kernel.org/r/20221123201625.253883224@linutronix.de
Tearing down timers which have circular dependencies to other
functionality, e.g. workqueues, where the timer can schedule work and work
can arm timers, is not trivial.
In those cases it is desired to shutdown the timer in a way which prevents
rearming of the timer. The mechanism to do so is to set timer->function to
NULL and use this as an indicator for the timer arming functions to ignore
the (re)arm request.
Split the inner workings of try_do_del_timer_sync(), del_timer_sync() and
del_timer() into helper functions to prepare for implementing the shutdown
functionality.
No functional change.
Co-developed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org
Link: https://lore.kernel.org/r/20221123201625.195147423@linutronix.de
Tearing down timers which have circular dependencies to other
functionality, e.g. workqueues, where the timer can schedule work and work
can arm timers, is not trivial.
In those cases it is desired to shutdown the timer in a way which prevents
rearming of the timer. The mechanism to do so is to set timer->function to
NULL and use this as an indicator for the timer arming functions to ignore
the (re)arm request.
In preparation for that replace the warnings in the relevant code paths
with checks for timer->function == NULL. If the pointer is NULL, then
discard the rearm request silently.
Add debug_assert_init() instead of the WARN_ON_ONCE(!timer->function)
checks so that debug objects can warn about non-initialized timers.
The warning of debug objects does not warn if timer->function == NULL. It
warns when timer was not initialized using timer_setup[_on_stack]() or via
DEFINE_TIMER(). If developers fail to enable debug objects and then waste
lots of time to figure out why their non-initialized timer is not firing,
they deserve it. Same for initializing a timer with a NULL function.
Co-developed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org
Link: https://lore.kernel.org/r/87wn7kdann.ffs@tglx
The timer related functions do not have a strict timer_ prefixed namespace
which is really annoying.
Rename del_timer() to timer_delete() and provide del_timer()
as a wrapper. Document that del_timer() is not for new code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20221123201625.015535022@linutronix.de
The timer related functions do not have a strict timer_ prefixed namespace
which is really annoying.
Rename del_timer_sync() to timer_delete_sync() and provide del_timer_sync()
as a wrapper. Document that del_timer_sync() is not for new code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20221123201624.954785441@linutronix.de
del_timer_sync() is assumed to be pointless on uniprocessor systems and can
be mapped to del_timer() because in theory del_timer() can never be invoked
while the timer callback function is executed.
This is not entirely true because del_timer() can be invoked from interrupt
context and therefore hit in the middle of a running timer callback.
Contrary to that del_timer_sync() is not allowed to be invoked from
interrupt context unless the affected timer is marked with TIMER_IRQSAFE.
del_timer_sync() has proper checks in place to detect such a situation.
Give up on the UP optimization and make del_timer_sync() unconditionally
available.
Co-developed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org
Link: https://lore.kernel.org/r/20221123201624.888306160@linutronix.de
The kernel-doc of timer related functions is partially uncomprehensible
word salad. Rewrite it to make it useful.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20221123201624.828703870@linutronix.de
The timer code still has a few BUG_ON()s left which are crashing the kernel
in situations where it still can recover or simply refuse to take an
action.
Remove the one in the hotplug callback which checks for the CPU being
offline. If that happens then the whole hotplug machinery will explode in
colourful ways.
Replace the rest with WARN_ON_ONCE() and conditional returns where
appropriate.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20221123201624.769128888@linutronix.de
del_singleshot_timer_sync() used to be an optimization for deleting timers
which are not rearmed from the timer callback function.
This optimization turned out to be broken and got mapped to
del_timer_sync() about 17 years ago.
Get rid of the undocumented indirection and use del_timer_sync() directly.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20221123201624.706987932@linutronix.de
Event group from different hw pmus does not make sense and thus perf
has never allowed it. However, with recent rewrite that restriction
has been inadvertently removed. Fix it.
Fixes: bd27568117 ("perf: Rewrite core context handling")
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221122080326.228-1-ravi.bangoria@amd.com
With the advent of CFI it is no longer acceptible to cast function
pointers.
The robot complains thusly:
kernel-events-core.c⚠️cast-from-int-(-)(struct-perf_cpu_pmu_context-)-to-remote_function_f-(aka-int-(-)(void-)-)-converts-to-incompatible-function-type
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Some PMUs (notably the traditional hardware kind) have boundary issues
with the OS filter. Specifically, it is possible for
perf_event_attr::exclude_kernel=1 events to trigger in-kernel due to
SKID or errata.
This can upset the sigtrap logic some and trigger the WARN.
However, if this invalid sample is the first we must not loose the
SIGTRAP, OTOH if it is the second, it must not override the
pending_addr with a (possibly) invalid one.
Fixes: ca6c21327c ("perf: Fix missing SIGTRAPs")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Marco Elver <elver@google.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Link: https://lkml.kernel.org/r/Y3hDYiXwRnJr8RYG@xpf.sh.intel.com
Callers can currently store tasks as kptrs using bpf_task_acquire(),
bpf_task_kptr_get(), and bpf_task_release(). These are useful if a
caller already has a struct task_struct *, but there may be some callers
who only have a pid, and want to look up the associated struct
task_struct * from that to e.g. find task->comm.
This patch therefore adds a new bpf_task_from_pid() kfunc which allows
BPF programs to get a struct task_struct * kptr from a pid.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221122145300.251210-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Song Shuai reported:
The list func (ftrace_ops_list_func) will be patched first
before the transition between old and new calls are set,
which fixed the race described in this commit `59338f75`.
While ftrace_trace_function changes from the list func to a
ftrace_ops func, like unregistering the klp_ops to leave the only
global_ops in ftrace_ops_list, the ftrace_[regs]_call will be
replaced with the list func although it already exists. So there
should be a condition to avoid this.
And suggested using another variable to keep track of what the ftrace
function is set to. But this could be simplified by using a helper
function that does the same with a static variable.
Link: https://lore.kernel.org/lkml/20221026132039.2236233-1-suagrfillet@gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20221122180905.737b6f52@gandalf.local.home
Reported-by: Song Shuai <suagrfillet@gmail.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
After commit 060fa5c83e ("tracing/events: reuse trace event ids after
overflow"), trace events with dynamic type are linked up in list
'ftrace_event_list' through field 'trace_event.list'. Then when max
event type number used up, it's possible to reuse type number of some
freed one by traversing 'ftrace_event_list'.
As instead, using IDA to manage available type numbers can make codes
simpler and then the field 'trace_event.list' can be dropped.
Since 'struct trace_event' is used in static tracepoints, drop
'trace_event.list' can make vmlinux smaller. Local test with about 2000
tracepoints, vmlinux reduced about 64KB:
before:-rwxrwxr-x 1 root root 76669448 Nov 8 17:14 vmlinux
after: -rwxrwxr-x 1 root root 76604176 Nov 8 17:15 vmlinux
Link: https://lkml.kernel.org/r/20221110020319.1259291-1-zhengyejian1@huawei.com
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
After change in commit 4239174570 ("tracing: Make tracepoint_printk a
static_key"), this symbol is not used outside of the file, so mark it
static.
Link: https://lkml.kernel.org/r/20221122091456.72055-1-xiujianfeng@huawei.com
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This patch uses strndup_user instead of kzalloc + strncpy_from_user,
which makes the code more concise.
Link: https://lkml.kernel.org/r/20221121080831.707409-1-nashuiliang@gmail.com
Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add the tracing/osnoise/options file to control
osnoise/timerlat tracer features. It is a single
file to contain multiple features, similar to
the sched/features file.
Reading the file displays a list of options. Writing
the OPTION_NAME enables it, writing NO_OPTION_NAME disables
it.
The DEAFULTS is a particular option that resets the options
to the default ones.
It uses a bitmask to keep track of the status of the option. When
needed, we can add a list of static keys, but for now
it does not justify the memory increase.
Link: https://lkml.kernel.org/r/f8d34aefdb225d2603fcb4c02a120832a0cd3339.1668692096.git.bristot@kernel.org
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
After commit a389d86f7f ("ring-buffer: Have nested events still record
running time stamp"), the "event" parameter is no longer used in either
ring_buffer_unlock_commit() or rb_commit(). Best to remove it.
Link: https://lkml.kernel.org/r/1666274811-24138-1-git-send-email-chensong_2000@189.cn
Signed-off-by: Song Chen <chensong_2000@189.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Allow triggers to be enabled at kernel boot up. For example:
trace_trigger="sched_switch.stacktrace if prev_state == 2"
The above will enable the stacktrace trigger on top of the sched_switch
event and only trigger if its prev_state is 2 (TASK_UNINTERRUPTIBLE). Then
at boot up, a stacktrace will trigger and be recorded in the tracing ring
buffer every time the sched_switch happens where the previous state is
TASK_INTERRUPTIBLE.
Another useful trigger would be "traceoff" which can stop tracing on an
event if a field of the event matches a certain value defined by the
filter ("if" statement).
Link: https://lore.kernel.org/linux-trace-kernel/20221020210056.0d8d0a5b@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
After 65536 dynamic events have been added and removed, the "type" field
of the event then uses the first type number that is available (not
currently used by other events). A type number is the identifier of the
binary blobs in the tracing ring buffer (known as events) to map them to
logic that can parse the binary blob.
The issue is that if a dynamic event (like a kprobe event) is traced and
is in the ring buffer, and then that event is removed (because it is
dynamic, which means it can be created and destroyed), if another dynamic
event is created that has the same number that new event's logic on
parsing the binary blob will be used.
To show how this can be an issue, the following can crash the kernel:
# cd /sys/kernel/tracing
# for i in `seq 65536`; do
echo 'p:kprobes/foo do_sys_openat2 $arg1:u32' > kprobe_events
# done
For every iteration of the above, the writing to the kprobe_events will
remove the old event and create a new one (with the same format) and
increase the type number to the next available on until the type number
reaches over 65535 which is the max number for the 16 bit type. After it
reaches that number, the logic to allocate a new number simply looks for
the next available number. When an dynamic event is removed, that number
is then available to be reused by the next dynamic event created. That is,
once the above reaches the max number, the number assigned to the event in
that loop will remain the same.
Now that means deleting one dynamic event and created another will reuse
the previous events type number. This is where bad things can happen.
After the above loop finishes, the kprobes/foo event which reads the
do_sys_openat2 function call's first parameter as an integer.
# echo 1 > kprobes/foo/enable
# cat /etc/passwd > /dev/null
# cat trace
cat-2211 [005] .... 2007.849603: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849620: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849838: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849880: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
# echo 0 > kprobes/foo/enable
Now if we delete the kprobe and create a new one that reads a string:
# echo 'p:kprobes/foo do_sys_openat2 +0($arg2):string' > kprobe_events
And now we can the trace:
# cat trace
sendmail-1942 [002] ..... 530.136320: foo: (do_sys_openat2+0x0/0x240) arg1= cat-2046 [004] ..... 530.930817: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.930961: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.934278: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.934563: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
bash-1515 [007] ..... 534.299093: foo: (do_sys_openat2+0x0/0x240) arg1="kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk���������@��4Z����;Y�����U
And dmesg has:
==================================================================
BUG: KASAN: use-after-free in string+0xd4/0x1c0
Read of size 1 at addr ffff88805fdbbfa0 by task cat/2049
CPU: 0 PID: 2049 Comm: cat Not tainted 6.1.0-rc6-test+ #641
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x77
print_report+0x17f/0x47b
kasan_report+0xad/0x130
string+0xd4/0x1c0
vsnprintf+0x500/0x840
seq_buf_vprintf+0x62/0xc0
trace_seq_printf+0x10e/0x1e0
print_type_string+0x90/0xa0
print_kprobe_event+0x16b/0x290
print_trace_line+0x451/0x8e0
s_show+0x72/0x1f0
seq_read_iter+0x58e/0x750
seq_read+0x115/0x160
vfs_read+0x11d/0x460
ksys_read+0xa9/0x130
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fc2e972ade2
Code: c0 e9 b2 fe ff ff 50 48 8d 3d b2 3f 0a 00 e8 05 f0 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
RSP: 002b:00007ffc64e687c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2e972ade2
RDX: 0000000000020000 RSI: 00007fc2e980d000 RDI: 0000000000000003
RBP: 00007fc2e980d000 R08: 00007fc2e980c010 R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000020f00
R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
</TASK>
The buggy address belongs to the physical page:
page:ffffea00017f6ec0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5fdbb
flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000000 0000000000000000 ffffea00017f6ec8 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88805fdbbe80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88805fdbbf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88805fdbbf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88805fdbc000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88805fdbc080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
This was found when Zheng Yejian sent a patch to convert the event type
number assignment to use IDA, which gives the next available number, and
this bug showed up in the fuzz testing by Yujie Liu and the kernel test
robot. But after further analysis, I found that this behavior is the same
as when the event type numbers go past the 16bit max (and the above shows
that).
As modules have a similar issue, but is dealt with by setting a
"WAS_ENABLED" flag when a module event is enabled, and when the module is
freed, if any of its events were enabled, the ring buffer that holds that
event is also cleared, to prevent reading stale events. The same can be
done for dynamic events.
If any dynamic event that is being removed was enabled, then make sure the
buffers they were enabled in are now cleared.
Link: https://lkml.kernel.org/r/20221123171434.545706e3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110020319.1259291-1-zhengyejian1@huawei.com/
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Depends-on: e18eb8783e ("tracing: Add tracing_reset_all_online_cpus_unlocked() function")
Depends-on: 5448d44c38 ("tracing: Add unified dynamic event framework")
Depends-on: 6212dd2968 ("tracing/kprobes: Use dyn_event framework for kprobe events")
Depends-on: 065e63f951 ("tracing: Only have rmmod clear buffers that its events were active in")
Depends-on: 575380da8b ("tracing: Only clear trace buffer on module unload if event was traced")
Fixes: 77b44d1b7c ("tracing/kprobes: Rename Kprobe-tracer to kprobe-event")
Reported-by: Zheng Yejian <zhengyejian1@huawei.com>
Reported-by: Yujie Liu <yujie.liu@intel.com>
Reported-by: kernel test robot <yujie.liu@intel.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently the tracing_reset_all_online_cpus() requires the
trace_types_lock held. But only one caller of this function actually has
that lock held before calling it, and the other just takes the lock so
that it can call it. More users of this function is needed where the lock
is not held.
Add a tracing_reset_all_online_cpus_unlocked() function for the one use
case that calls it without being held, and also add a lockdep_assert to
make sure it is held when called.
Then have tracing_reset_all_online_cpus() take the lock internally, such
that callers do not need to worry about taking it.
Link: https://lkml.kernel.org/r/20221123192741.658273220@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
commit 94eedf3dde ("tracing: Fix race where eprobes can be called before
the event") fixed an issue where if an event is soft disabled, and the
trigger is being added, there's a small window where the event sees that
there's a trigger but does not see that it requires reading the event yet,
and then calls the trigger with the record == NULL.
This could be solved with adding memory barriers in the hot path, or to
make sure that all the triggers requiring a record check for NULL. The
latter was chosen.
Commit 94eedf3dde set the eprobe trigger handle to check for NULL, but
the same needs to be done with histograms.
Link: https://lore.kernel.org/linux-trace-kernel/20221118211809.701d40c0f8a757b0df3c025a@kernel.org/
Link: https://lore.kernel.org/linux-trace-kernel/20221123164323.03450c3a@gandalf.local.home
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Reported-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Syzkaller managed to hit another decl_tag issue:
btf_func_proto_check kernel/bpf/btf.c:4506 [inline]
btf_check_all_types kernel/bpf/btf.c:4734 [inline]
btf_parse_type_sec+0x1175/0x1980 kernel/bpf/btf.c:4763
btf_parse kernel/bpf/btf.c:5042 [inline]
btf_new_fd+0x65a/0xb00 kernel/bpf/btf.c:6709
bpf_btf_load+0x6f/0x90 kernel/bpf/syscall.c:4342
__sys_bpf+0x50a/0x6c0 kernel/bpf/syscall.c:5034
__do_sys_bpf kernel/bpf/syscall.c:5093 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5091 [inline]
__x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5091
do_syscall_64+0x54/0x70 arch/x86/entry/common.c:48
This seems similar to commit ea68376c8b ("bpf: prevent decl_tag from being
referenced in func_proto") but for the argument.
Reported-by: syzbot+8dd0551dda6020944c5d@syzkaller.appspotmail.com
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221123035422.872531-2-sdf@google.com
In commit fda01efc61 ("bpf: Enable cgroups to be used as kptrs"), I
added an 'int idx' variable to kfunc_init() which was meant to
dynamically set the index of the btf id entries of the
'generic_dtor_ids' array. This was done to make the code slightly less
brittle as the struct cgroup * kptr kfuncs such as bpf_cgroup_aquire()
are compiled out if CONFIG_CGROUPS is not defined. This, however, causes
an lkp build warning:
>> kernel/bpf/helpers.c:2005:40: warning: multiple unsequenced
modifications to 'idx' [-Wunsequenced]
.btf_id = generic_dtor_ids[idx++],
Fix the warning by just hard-coding the indices.
Fixes: fda01efc61 ("bpf: Enable cgroups to be used as kptrs")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Vernet <void@manifault.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221123135253.637525-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The flag that tells the event to call its triggers after reading the event
is set for eprobes after the eprobe is enabled. This leads to a race where
the eprobe may be triggered at the beginning of the event where the record
information is NULL. The eprobe then dereferences the NULL record causing
a NULL kernel pointer bug.
Test for a NULL record to keep this from happening.
Link: https://lore.kernel.org/linux-trace-kernel/20221116192552.1066630-1-rafaelmendsr@gmail.com/
Link: https://lore.kernel.org/all/20221117214249.2addbe10@gandalf.local.home/
Cc: stable@vger.kernel.org
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The duration type is a 64 long value, not an int. This was
causing some long noise to report wrong values.
Change the duration to a 64 bits value.
Link: https://lkml.kernel.org/r/a93d8a8378c7973e9c609de05826533c9e977939.1668692096.git.bristot@kernel.org
Cc: stable@vger.kernel.org
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Before current_user_event_group(), it has allocated memory and save it
in @name, this should freed before return error.
Link: https://lkml.kernel.org/r/20221115014445.158419-1-xiujianfeng@huawei.com
Fixes: e5d271812e ("tracing/user_events: Move pages/locks into groups to prepare for namespaces")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
struct cgroup * objects have a variably sized struct cgroup *ancestors[]
field which stores pointers to their ancestor cgroups. If using a cgroup
as a kptr, it can be useful to access these ancestors, but doing so
requires variable offset accesses for PTR_TO_BTF_ID, which is currently
unsupported.
This is a very useful field to access for cgroup kptrs, as programs may
wish to walk their ancestor cgroups when determining e.g. their
proportional cpu.weight. So as to enable this functionality with cgroup
kptrs before var_off is supported for PTR_TO_BTF_ID, this patch adds a
bpf_cgroup_ancestor() kfunc which accesses the cgroup node on behalf of
the caller, and acquires a reference on it. Once var_off is supported
for PTR_TO_BTF_ID, and fields inside a struct can be marked as trusted
so they retain the PTR_TRUSTED modifier when walked, this can be
removed.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221122055458.173143-4-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Now that tasks can be used as kfuncs, and the PTR_TRUSTED flag is
available for us to easily add basic acquire / get / release kfuncs, we
can do the same for cgroups. This patch set adds the following kfuncs
which enable using cgroups as kptrs:
struct cgroup *bpf_cgroup_acquire(struct cgroup *cgrp);
struct cgroup *bpf_cgroup_kptr_get(struct cgroup **cgrpp);
void bpf_cgroup_release(struct cgroup *cgrp);
A follow-on patch will add a selftest suite which validates these
kfuncs.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221122055458.173143-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There are a couple of missing * in comment blocks. Fix these.
Cleans up two clang warnings:
kernel/trace/trace_events_hist.c:986: warning: bad line:
kernel/trace/trace_events_hist.c:3229: warning: bad line:
Link: https://lkml.kernel.org/r/20221020133019.1547587-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Change the function argument in the description of cpuset_css_alloc()
from 'struct cgroup' -> 'struct cgroup_subsys_state'. The change to the
argument type was introduced by commit eb95419b02 ("cgroup: pass
around cgroup_subsys_state instead of cgroup in subsystem methods").
Also, add more information to its description.
Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This reverts commit 6d5afdc97e.
On a Pixel 6 device, it is observed that this commit increases
latency by approximately 50ms, or 20%, in migrating a task
that requires full CPU utilization from a LITTLE CPU to Fmax
on a big CPU. Reverting this change restores the latency back
to its original baseline value.
Fixes: 6d5afdc97e ("cpufreq: schedutil: Move max CPU capacity to sugov_policy")
Signed-off-by: Sam Wu <wusamuel@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The filter() callback in struct kset_uevent_ops does not modify the
kobject passed into it, so make the pointer const to enforce this
restriction. When doing so, fix up all existing filter() callbacks to
have the correct signature to preserve the build.
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: linux-media@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Christian König <christian.koenig@amd.com> for the changes to
Link: https://lore.kernel.org/r/20221121094649.1556002-3-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmN6wAgeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG0EYH/3/RO90NbrFItraN
Lzr+d3VdbGjTu8xd1M+PRTmwh3zxLpB+Jwqr0T0A2gzL9B/D+AUPUJdrCVbv9DqS
FLJAVqoeV20dNBAHSffOOLPsgCZ+Eu+LzlNN7Iqde0e8cyZICFMNktitui84Xm/i
1NgFVgz9OZ6+aieYvUj3FrFq0p8GTIaC/oybDZrxYKcO8ZzKVMJ11swRw10wwq0g
qOOECvV3w7wlQ8upQZkzFxItKFc7EexZI6R4elXeGSJJ9Hlc092dv/zsKB9dwV+k
WcwkJrZRoezYXzgGBFxUcQtzi+ethjrPjuJuM1rYLUSIcfIW/0lkaSLgRoBu8D+I
1GfXkXs=
=gt6P
-----END PGP SIGNATURE-----
Merge tag 'v6.1-rc6' into x86/core, to resolve conflicts
Resolve conflicts between these commits in arch/x86/kernel/asm-offsets.c:
# upstream:
debc5a1ec0 ("KVM: x86: use a separate asm-offsets.c file")
# retbleed work in x86/core:
5d8213864a ("x86/retbleed: Add SKL return thunk")
... and these commits in include/linux/bpf.h:
# upstram:
18acb7fac2 ("bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop")")
# x86/core commits:
931ab63664 ("x86/ibt: Implement FineIBT")
bea75b3389 ("x86/Kconfig: Introduce function padding")
The latter two modify BPF_DISPATCHER_ATTRIBUTES(), which was removed upstream.
Conflicts:
arch/x86/kernel/asm-offsets.c
include/linux/bpf.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
bpf_selem_alloc function is used by inode_storage, sk_storage and
task_storage maps to set map value, for these map types, there may
be a spin lock in the map value, so if we use memcpy to copy the whole
map value from user, the spin lock field may be initialized incorrectly.
Since the spin lock field is zeroed by kzalloc, call copy_map_value
instead of memcpy to skip copying the spin lock field to fix it.
Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20221114134720.1057939-2-xukuohai@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf_iter_attach_cgroup() has already acquired an extra reference for the
start cgroup, but the reference may be released if the iterator link fd
is closed after the creation of iterator fd, and it may lead to
user-after-free problem when reading the iterator fd.
An alternative fix is pinning iterator link when opening iterator,
but it will make iterator link being still visible after the close of
iterator link fd and the behavior is different with other link types, so
just fixing it by acquiring another reference for the start cgroup.
Fixes: d4ccaf58a8 ("bpf: Introduce cgroup iter")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221121073440.1828292-2-houtao@huaweicloud.com
Most allocation sites in the kernel want an explicitly sized allocation
(and not "more"), and that dynamic runtime analysis tools (e.g. KASAN,
UBSAN_BOUNDS, FORTIFY_SOURCE, etc) are looking for precise bounds checking
(i.e. not something that is rounded up). A tiny handful of allocations
were doing an implicit alloc/realloc loop that actually depended on
ksize(), and didn't actually always call realloc. This has created a
long series of bugs and problems over many years related to the runtime
bounds checking, so these callers are finally being adjusted to _not_
depend on the ksize() side-effect, by doing one of several things:
- tracking the allocation size precisely and just never calling ksize()
at all [1].
- always calling realloc and not using ksize() at all. (This solution
ends up actually be a subset of the next solution.)
- using kmalloc_size_roundup() to explicitly round up the desired
allocation size immediately [2].
The bpf/verifier case is this another of this latter case, and is the
last outstanding case to be fixed in the kernel.
Because some of the dynamic bounds checking depends on the size being an
_argument_ to an allocator function (i.e. see the __alloc_size attribute),
the ksize() users are rare, and it could waste local variables, it
was been deemed better to explicitly separate the rounding up from the
allocation itself [3].
Round up allocations with kmalloc_size_roundup() so that the verifier's
use of ksize() is always accurate.
[1] e.g.:
https://git.kernel.org/linus/712f210a457dhttps://git.kernel.org/linus/72c08d9f4c72
[2] e.g.:
https://git.kernel.org/netdev/net-next/c/12d6c1d3a2adhttps://git.kernel.org/netdev/net-next/c/ab3f7828c979https://git.kernel.org/netdev/net-next/c/d6dd508080a3
[3] https://lore.kernel.org/lkml/0ea1fc165a6c6117f982f4f135093e69cb884930.camel@redhat.com/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221118183409.give.387-kees@kernel.org
DMA allocations can never be turned back into a page pointer, so
requesting compound pages doesn't make sense and it can't even be
supported at all by various backends.
Reject __GFP_COMP with a warning in dma_alloc_attrs, and stop clearing
the flag in the arm dma ops and dma-iommu.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
The sysctl_numa_balancing_promote_rate_limit and sysctl_numa_balancing
are part of sched, move them to its own file.
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Implement bpf_rdonly_cast() which tries to cast the object
to a specified type. This tries to support use case like below:
#define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB)))
where skb_end_pointer(SKB) is a 'unsigned char *' and needs to
be casted to 'struct skb_shared_info *'.
The signature of bpf_rdonly_cast() looks like
void *bpf_rdonly_cast(void *obj, __u32 btf_id)
The function returns the same 'obj' but with PTR_TO_BTF_ID with
btf_id. The verifier will ensure btf_id being a struct type.
Since the supported type cast may not reflect what the 'obj'
represents, the returned btf_id is marked as PTR_UNTRUSTED, so
the return value and subsequent pointer chasing cannot be
used as helper/kfunc arguments.
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221120195437.3114585-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implement bpf_cast_to_kern_ctx() kfunc which does a type cast
of a uapi ctx object to the corresponding kernel ctx. Previously
if users want to access some data available in kctx but not
in uapi ctx, bpf_probe_read_kernel() helper is needed.
The introduction of bpf_cast_to_kern_ctx() allows direct
memory access which makes code simpler and easier to understand.
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221120195432.3113982-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Later on, we will introduce kfuncs bpf_cast_to_kern_ctx() and
bpf_rdonly_cast() which apply to all program types. Currently kfunc set
only supports individual prog types. This patch added support for kfunc
applying to all program types.
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221120195426.3113828-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In the unlikely event that bpf_global_ma is not correctly initialized,
instead of checking the boolean everytime bpf_obj_new_impl is called,
simply check it while loading the program and return an error if
bpf_global_ma_set is false.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221120212610.2361700-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Fix possible NULL pointer dereference on trace_event_file in kprobe_event_gen_test_exit()
- Fix NULL pointer dereference for trace_array in kprobe_event_gen_test_exit()
- Fix memory leak of filter string for eprobes
- Fix a possible memory leak in rethook_alloc()
- Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case
which can cause a possible use-after-free
- Fix warning in eprobe filter creation
- Fix eprobe filter creation as it picked the wrong event for the fields
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY3qRDhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgCdAP0cB1ZjmMM7O8OFdnrTV0jfavZTnNNC
ut9sczpYQ0upcwEArmVvB+H3sTM6Y7PrsQEUn8gsc7WmieUoDAOr0hIe4AI=
=DGVC
-----END PGP SIGNATURE-----
Merge tag 'trace-probes-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing/probes fixes from Steven Rostedt:
- Fix possible NULL pointer dereference on trace_event_file in
kprobe_event_gen_test_exit()
- Fix NULL pointer dereference for trace_array in
kprobe_event_gen_test_exit()
- Fix memory leak of filter string for eprobes
- Fix a possible memory leak in rethook_alloc()
- Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case which
can cause a possible use-after-free
- Fix warning in eprobe filter creation
- Fix eprobe filter creation as it picked the wrong event for the
fields
* tag 'trace-probes-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/eprobe: Fix eprobe filter to make a filter correctly
tracing/eprobe: Fix warning in filter creation
kprobes: Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case
rethook: fix a potential memleak in rethook_alloc()
tracing/eprobe: Fix memory leak of filter string
tracing: kprobe: Fix potential null-ptr-deref on trace_array in kprobe_event_gen_test_exit()
tracing: kprobe: Fix potential null-ptr-deref on trace_event_file in kprobe_event_gen_test_exit()
- Fix polling to block on watermark like the reads do, as user space
applications get confused when the select says read is available, and then
the read blocks.
- Fix accounting of ring buffer dropped pages as it is what is used to
determine if the buffer is empty or not.
- Fix memory leak in tracing_read_pipe()
- Fix struct trace_array warning about being declared in parameters
- Fix accounting of ftrace pages used in output at start up.
- Fix allocation of dyn_ftrace pages by subtracting one from order instead of
diving it by 2
- Static analyzer found a case were a pointer being used outside of a NULL
check. (rb_head_page_deactivate())
- Fix possible NULL pointer dereference if kstrdup() fails in ftrace_add_mod()
- Fix memory leak in test_gen_synth_cmd() and test_empty_synth_event()
- Fix bad pointer dereference in register_synth_event() on error path.
- Remove unused __bad_type_size() method
- Fix possible NULL pointer dereference of entry in list 'tr->err_log'
- Fix NULL pointer deference race if eprobe is called before the event setup
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY3qNNBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qiVzAP9vdtLkseOueVqPJ/Wc6v3z0xlkxO4L
Aj9jOac822SPOQEAvUJ1DM1bxm/D2BY5AQsfgSGjdaVYP+I3kvETNgWspQI=
=3ta3
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix polling to block on watermark like the reads do, as user space
applications get confused when the select says read is available, and
then the read blocks
- Fix accounting of ring buffer dropped pages as it is what is used to
determine if the buffer is empty or not
- Fix memory leak in tracing_read_pipe()
- Fix struct trace_array warning about being declared in parameters
- Fix accounting of ftrace pages used in output at start up.
- Fix allocation of dyn_ftrace pages by subtracting one from order
instead of diving it by 2
- Static analyzer found a case were a pointer being used outside of a
NULL check (rb_head_page_deactivate())
- Fix possible NULL pointer dereference if kstrdup() fails in
ftrace_add_mod()
- Fix memory leak in test_gen_synth_cmd() and test_empty_synth_event()
- Fix bad pointer dereference in register_synth_event() on error path
- Remove unused __bad_type_size() method
- Fix possible NULL pointer dereference of entry in list 'tr->err_log'
- Fix NULL pointer deference race if eprobe is called before the event
setup
* tag 'trace-v6.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix race where eprobes can be called before the event
tracing: Fix potential null-pointer-access of entry in list 'tr->err_log'
tracing: Remove unused __bad_type_size() method
tracing: Fix wild-memory-access in register_synth_event()
tracing: Fix memory leak in test_gen_synth_cmd() and test_empty_synth_event()
ftrace: Fix null pointer dereference in ftrace_add_mod()
ring_buffer: Do not deactivate non-existant pages
ftrace: Optimize the allocation for mcount entries
ftrace: Fix the possible incorrect kernel message
tracing: Fix warning on variable 'struct trace_array'
tracing: Fix memory leak in tracing_read_pipe()
ring-buffer: Include dropped pages in counting dirty patches
tracing/ring-buffer: Have polling block on watermark
The flag that tells the event to call its triggers after reading the event
is set for eprobes after the eprobe is enabled. This leads to a race where
the eprobe may be triggered at the beginning of the event where the record
information is NULL. The eprobe then dereferences the NULL record causing
a NULL kernel pointer bug.
Test for a NULL record to keep this from happening.
Link: https://lore.kernel.org/linux-trace-kernel/20221116192552.1066630-1-rafaelmendsr@gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20221117214249.2addbe10@gandalf.local.home
Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
misunderstanding whether the task holds rq->lock or not
- Prevent processes from getting killed when using deprecated or unknown
rseq ABI flags in order to be able to fuzz the rseq() syscall with
syzkaller
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmN6FH0ACgkQEsHwGGHe
VUrsFw//QdGg/ZpGdeGH5njOFX9KCLrcMojz+Ip7vn4TeWiT/0J70jym4Pxkvwls
Cz1s47kmhEE56xYi0gMJJi6XtjYWh1/sptuTkS+wHNp+NliaHMnmPPY08wL6/ZEq
Zm8KP1e3OHjl+k82yhuIeYAFFdUpbLlQkjCKHqQkBlemh0MF+GeTavHDWe3kLolb
arEld+kIo7OUkBDmVGk6BaE3R0DuCYGokDcj9xGNx0alcQCtu+oT+T9jYMgwBIn2
zGpfM2jeB4nwYKYv+Mg6tY38vaz6bHZWSoDhaB/+GzewhZMdGGCo3SdN207WJHUK
TSjjz6a5sMJSZTpVW0K6MxJhv245KNt3Tc/JpDtqVq4KfKNcv0Rs4bDIWWyIsvow
KlSUvRBeDT3rx9152vB72GiPptZogWsTc5dM0/ZlBaCfM4UBreYgO37ZYYK9hPdx
M2rCOafUSwkDFG+A5JCWM7d5sPAndvAAX29qqmrgbTVSV+XKAx3UdWpGJTEL8bSF
OWIFhVd2nqOglL41c1Rm4zk9TJ3k9mMsgnsuskgclmAI7GmRkPADHlv19bRRe8UL
TEDBOCQaNYFkjkp+4IM/I3GXGa+SiOHEiK7G2CjJEeHbzWoGKQPFpJtOtH4zHNV+
m32Ek2XBw1zKjfAyr1WGyA+GE4tTIIphqgonu4cA2cKCYjm/SG4=
=I+nD
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Fix a small race on the task's exit path where there's a
misunderstanding whether the task holds rq->lock or not
- Prevent processes from getting killed when using deprecated or
unknown rseq ABI flags in order to be able to fuzz the rseq() syscall
with syzkaller
* tag 'sched_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix race in task_call_func()
rseq: Use pr_warn_once() when deprecated/unknown ABI flags are encountered
for more than 4K
- Fix a NULL ptr dereference which can happen after an NMI interferes
with the event enabling dance in amd_pmu_enable_all()
- Free the events array too when freeing uncore contexts on CPU online,
thereby fixing a memory leak
- Improve the pending SIGTRAP check
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmN6DfgACgkQEsHwGGHe
VUoVUxAAuAaPNOil4T95BGBYL7O7w5VvUQtaq9TtSm1ZwDlz63BNFwrLdlyi3pnc
4J84lWB1m0hPJHggqqWwBcIRL4xgrNZN4F0Q641Ns8IcAUtYbuQ4Gxm4io5IysFN
DYph6EZ597S7izdZD/S0wETqjgE5/zs16x2/E9E9aJ9CH6Kewqa0y5bXhieF4+UF
o0z55KB8YtS19BvopZLunj3tvwOSgAqgkgW7UtbR6clhcXBiq6/6bAjEBxoPrYRE
7B8LEEMol7ERyIZHLE/xA38D6o1BhlD9OnVNpwYtDeYe63dtPrQAfjDwkk7gOdWB
H9mNHqYB4kPMaUt4sDfDRTVYvioUXH0y7gjvl54efUtnj3vdxhUFzjD+JFh4nzrr
efCU+Cy8TzYV1V7c9yn9JjIp7w0rLX2jMlPvL8GGwlYQ8nXQKR6OHi3Dw9kYLKrT
/jtAxkv8+enbUiMukw0AyhJfgzSM8OkUIjtHGTgkY7Q2BEv+yC7ogvD9wIozocLa
w0EIJo5wtLUHu78u5+5tIOZYcZq1A89lPV8TXNzCGU6+xwm3wRzpvcAICp285NqA
gXBEtWGULUB+paz4db5tnChP9xK8MOt4Ro3/Y00Hxr3ZgO6zJ2nc7j8fENX9gON0
YAeZB9yPPcvy6oABYcVDH3sG/wyN5wCnlPGxrVom4EYRRWzhAL8=
=9ZHJ
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Fix an intel PT erratum where CPUs do not support single range output
for more than 4K
- Fix a NULL ptr dereference which can happen after an NMI interferes
with the event enabling dance in amd_pmu_enable_all()
- Free the events array too when freeing uncore contexts on CPU online,
thereby fixing a memory leak
- Improve the pending SIGTRAP check
* tag 'perf_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/pt: Fix sampling using single range output
perf/x86/amd: Fix crash due to race between amd_pmu_enable_all, perf NMI and throttling
perf/x86/amd/uncore: Fix memory leak for events array
perf: Improve missing SIGTRAP checking
Now that BPF supports adding new kernel functions with kfuncs, and
storing kernel objects in maps with kptrs, we can add a set of kfuncs
which allow struct task_struct objects to be stored in maps as
referenced kptrs. The possible use cases for doing this are plentiful.
During tracing, for example, it would be useful to be able to collect
some tasks that performed a certain operation, and then periodically
summarize who they are, which cgroup they're in, how much CPU time
they've utilized, etc.
In order to enable this, this patch adds three new kfuncs:
struct task_struct *bpf_task_acquire(struct task_struct *p);
struct task_struct *bpf_task_kptr_get(struct task_struct **pp);
void bpf_task_release(struct task_struct *p);
A follow-on patch will add selftests validating these kfuncs.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221120051004.3605026-4-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kfuncs currently support specifying the KF_TRUSTED_ARGS flag to signal
to the verifier that it should enforce that a BPF program passes it a
"safe", trusted pointer. Currently, "safe" means that the pointer is
either PTR_TO_CTX, or is refcounted. There may be cases, however, where
the kernel passes a BPF program a safe / trusted pointer to an object
that the BPF program wishes to use as a kptr, but because the object
does not yet have a ref_obj_id from the perspective of the verifier, the
program would be unable to pass it to a KF_ACQUIRE | KF_TRUSTED_ARGS
kfunc.
The solution is to expand the set of pointers that are considered
trusted according to KF_TRUSTED_ARGS, so that programs can invoke kfuncs
with these pointers without getting rejected by the verifier.
There is already a PTR_UNTRUSTED flag that is set in some scenarios,
such as when a BPF program reads a kptr directly from a map
without performing a bpf_kptr_xchg() call. These pointers of course can
and should be rejected by the verifier. Unfortunately, however,
PTR_UNTRUSTED does not cover all the cases for safety that need to
be addressed to adequately protect kfuncs. Specifically, pointers
obtained by a BPF program "walking" a struct are _not_ considered
PTR_UNTRUSTED according to BPF. For example, say that we were to add a
kfunc called bpf_task_acquire(), with KF_ACQUIRE | KF_TRUSTED_ARGS, to
acquire a struct task_struct *. If we only used PTR_UNTRUSTED to signal
that a task was unsafe to pass to a kfunc, the verifier would mistakenly
allow the following unsafe BPF program to be loaded:
SEC("tp_btf/task_newtask")
int BPF_PROG(unsafe_acquire_task,
struct task_struct *task,
u64 clone_flags)
{
struct task_struct *acquired, *nested;
nested = task->last_wakee;
/* Would not be rejected by the verifier. */
acquired = bpf_task_acquire(nested);
if (!acquired)
return 0;
bpf_task_release(acquired);
return 0;
}
To address this, this patch defines a new type flag called PTR_TRUSTED
which tracks whether a PTR_TO_BTF_ID pointer is safe to pass to a
KF_TRUSTED_ARGS kfunc or a BPF helper function. PTR_TRUSTED pointers are
passed directly from the kernel as a tracepoint or struct_ops callback
argument. Any nested pointer that is obtained from walking a PTR_TRUSTED
pointer is no longer PTR_TRUSTED. From the example above, the struct
task_struct *task argument is PTR_TRUSTED, but the 'nested' pointer
obtained from 'task->last_wakee' is not PTR_TRUSTED.
A subsequent patch will add kfuncs for storing a task kfunc as a kptr,
and then another patch will add selftests to validate.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221120051004.3605026-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
reg_type_str() in the verifier currently only allows a single register
type modifier to be present in the 'prefix' string which is eventually
stored in the env type_str_buf. This currently works fine because there
are no overlapping type modifiers, but once PTR_TRUSTED is added, that
will no longer be the case. This patch updates reg_type_str() to support
having multiple modifiers in the prefix string, and updates the size of
type_str_buf to be 128 bytes.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221120051004.3605026-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
task_numa_work. x86 CMPXCHG instruction returns success in ZF flag, so
this change saves a compare after cmpxchg (and related move instruction in
front of cmpxchg).
No functional change intended.
Link: https://lkml.kernel.org/r/20220822173956.82525-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We already have struct range, so just use it.
Link: https://lkml.kernel.org/r/20220929042936.22012-4-bhe@redhat.com
Signed-off-by: Li Chen <lchen@ambarella.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Chen Lifu <chenlifu@huawei.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jianglei Nie <niejianglei2021@163.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: ye xingchen <ye.xingchen@zte.com.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Return the value kimage_add_entry() directly instead of storing it in
another redundant variable.
Link: https://lkml.kernel.org/r/20220929042936.22012-3-bhe@redhat.com
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chen Lifu <chenlifu@huawei.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jianglei Nie <niejianglei2021@163.com>
Cc: Li Chen <lchen@ambarella.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Russell King <linux@armlinux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In subsequent patches we'll arrange for architectures to have an
ftrace_regs which is entirely distinct from pt_regs. In preparation for
this, we need to minimize the use of pt_regs to where strictly necessary
in the core ftrace code.
This patch adds new ftrace_regs_{get,set}_*() helpers which can be used
to manipulate ftrace_regs. When CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y,
these can always be used on any ftrace_regs, and when
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=n these can be used when regs are
available. A new ftrace_regs_has_args(fregs) helper is added which code
can use to check when these are usable.
Co-developed-by: Florent Revest <revest@chromium.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20221103170520.931305-4-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
In subsequent patches we'll add a sew of ftrace_regs_{get,set}_*()
helpers. In preparation, this patch renames
ftrace_instruction_pointer_set() to
ftrace_regs_set_instruction_pointer().
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20221103170520.931305-3-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
In subsequent patches we'll arrange for architectures to have an
ftrace_regs which is entirely distinct from pt_regs. In preparation for
this, we need to minimize the use of pt_regs to where strictly
necessary in the core ftrace code.
This patch changes the prototype of arch_ftrace_set_direct_caller() to
take ftrace_regs rather than pt_regs, and moves the extraction of the
pt_regs into arch_ftrace_set_direct_caller().
On x86, arch_ftrace_set_direct_caller() can be used even when
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=n, and <linux/ftrace.h> defines
struct ftrace_regs. Due to this, it's necessary to define
arch_ftrace_set_direct_caller() as a macro to avoid using an incomplete
type. I've also moved the body of arch_ftrace_set_direct_caller() after
the CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y defineidion of struct
ftrace_regs.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20221103170520.931305-2-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Some architectures (powerpc) may not support ftrace locations being nop'ed
out at build time. Introduce CONFIG_HAVE_OBJTOOL_NOP_MCOUNT for objtool, as
a means for architectures to enable nop'ing of ftrace locations. Add --mnop
as an option to objtool --mcount, to indicate support for the same.
Also, make sure that --mnop can be passed as an option to objtool only when
--mcount is passed.
Tested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Sathvika Vasireddy <sv@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221114175754.1131267-12-sv@linux.ibm.com
The old behavior of bpf_map_meta_equal was that it compared timer_off
to be equal (but not spin_lock_off, because that was not allowed), and
did memcmp of kptr_off_tab.
Now, we memcmp the btf_record of two bpf_map structs, which has all
fields.
We preserve backwards compat as we kzalloc the array, so if only spin
lock and timer exist in map, we only compare offset while the rest of
unused members in the btf_field struct are zeroed out.
In case of kptr, btf and everything else is of vmlinux or module, so as
long type is same it will match, since kernel btf, module, dtor pointer
will be same across maps.
Now with list_head in the mix, things are a bit complicated. We
implicitly add a requirement that both BTFs are same, because struct
btf_field_list_head has btf and value_rec members.
We obviously shouldn't force BTFs to be equal by default, as that breaks
backwards compatibility.
Currently it is only implicitly required due to list_head matching
struct btf and value_rec member. value_rec points back into a btf_record
stashed in the map BTF (btf member of btf_field_list_head). So that
pointer and btf member has to match exactly.
Document all these subtle details so that things don't break in the
future when touching this code.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-19-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit implements the delayed release logic for bpf_list_push_front
and bpf_list_push_back.
Once a node has been added to the list, it's pointer changes to
PTR_UNTRUSTED. However, it is only released once the lock protecting the
list is unlocked. For such PTR_TO_BTF_ID | MEM_ALLOC with PTR_UNTRUSTED
set but an active ref_obj_id, it is still permitted to read them as long
as the lock is held. Writing to them is not allowed.
This allows having read access to push items we no longer own until we
release the lock guarding the list, allowing a little more flexibility
when working with these APIs.
Note that enabling write support has fairly tricky interactions with
what happens inside the critical section. Just as an example, currently,
bpf_obj_drop is not permitted, but if it were, being able to write to
the PTR_UNTRUSTED pointer while the object gets released back to the
memory allocator would violate safety properties we wish to guarantee
(i.e. not crashing the kernel). The memory could be reused for a
different type in the BPF program or even in the kernel as it gets
eventually kfree'd.
Not enabling bpf_obj_drop inside the critical section would appear to
prevent all of the above, but that is more of an artifical limitation
right now. Since the write support is tangled with how we handle
potential aliasing of nodes inside the critical section that may or may
not be part of the list anymore, it has been deferred to a future patch.
Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-18-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a linked list API for use in BPF programs, where it expects
protection from the bpf_spin_lock in the same allocation as the
bpf_list_head. For now, only one bpf_spin_lock can be present hence that
is assumed to be the one protecting the bpf_list_head.
The following functions are added to kick things off:
// Add node to beginning of list
void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node);
// Add node to end of list
void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node);
// Remove node at beginning of list and return it
struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head);
// Remove node at end of list and return it
struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head);
The lock protecting the bpf_list_head needs to be taken for all
operations. The verifier ensures that the lock that needs to be taken is
always held, and only the correct lock is taken for these operations.
These checks are made statically by relying on the reg->id preserved for
registers pointing into regions having both bpf_spin_lock and the
objects protected by it. The comment over check_reg_allocation_locked in
this change describes the logic in detail.
Note that bpf_list_push_front and bpf_list_push_back are meant to
consume the object containing the node in the 1st argument, however that
specific mechanism is intended to not release the ref_obj_id directly
until the bpf_spin_unlock is called. In this commit, nothing is done,
but the next commit will be introducing logic to handle this case, so it
has been left as is for now.
bpf_list_pop_front and bpf_list_pop_back delete the first or last item
of the list respectively, and return pointer to the element at the
list_node offset. The user can then use container_of style macro to get
the actual entry type. The verifier however statically knows the actual
type, so the safety properties are still preserved.
With these additions, programs can now manage their own linked lists and
store their objects in them.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-17-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pointer increment on seeing PTR_MAYBE_NULL is already protected against,
hence make an exception for PTR_TO_BTF_ID | MEM_ALLOC while still
keeping the warning for other unintended cases that might creep in.
bpf_list_pop_{front,_back} helpers planned to be introduced in next
commit will return a MEM_ALLOC register with incremented offset pointing
to bpf_list_node field. The user is supposed to then obtain the pointer
to the entry using container_of after NULL checking it. The current
restrictions trigger a warning when doing the NULL checking. Revisiting
the reason, it is meant as an assertion which seems to actually work and
catch the bad case.
Hence, under no other circumstances can reg->off be non-zero for a
register that has the PTR_MAYBE_NULL type flag set.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-16-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce bpf_obj_drop, which is the kfunc used to free allocated
objects (allocated using bpf_obj_new). Pairing with bpf_obj_new, it
implicitly destructs the fields part of object automatically without
user intervention.
Just like the previous patch, btf_struct_meta that is needed to free up
the special fields is passed as a hidden argument to the kfunc.
For the user, a convenience macro hides over the kernel side kfunc which
is named bpf_obj_drop_impl.
Continuing the previous example:
void prog(void) {
struct foo *f;
f = bpf_obj_new(typeof(*f));
if (!f)
return;
bpf_obj_drop(f);
}
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-15-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce type safe memory allocator bpf_obj_new for BPF programs. The
kernel side kfunc is named bpf_obj_new_impl, as passing hidden arguments
to kfuncs still requires having them in prototype, unlike BPF helpers
which always take 5 arguments and have them checked using bpf_func_proto
in verifier, ignoring unset argument types.
Introduce __ign suffix to ignore a specific kfunc argument during type
checks, then use this to introduce support for passing type metadata to
the bpf_obj_new_impl kfunc.
The user passes BTF ID of the type it wants to allocates in program BTF,
the verifier then rewrites the first argument as the size of this type,
after performing some sanity checks (to ensure it exists and it is a
struct type).
The second argument is also fixed up and passed by the verifier. This is
the btf_struct_meta for the type being allocated. It would be needed
mostly for the offset array which is required for zero initializing
special fields while leaving the rest of storage in unitialized state.
It would also be needed in the next patch to perform proper destruction
of the object's special fields.
Under the hood, bpf_obj_new will call bpf_mem_alloc and bpf_mem_free,
using the any context BPF memory allocator introduced recently. To this
end, a global instance of the BPF memory allocator is initialized on
boot to be used for this purpose. This 'bpf_global_ma' serves all
allocations for bpf_obj_new. In the future, bpf_obj_new variants will
allow specifying a custom allocator.
Note that now that bpf_obj_new can be used to allocate objects that can
be linked to BPF linked list (when future linked list helpers are
available), we need to also free the elements using bpf_mem_free.
However, since the draining of elements is done outside the
bpf_spin_lock, we need to do migrate_disable around the call since
bpf_list_head_free can be called from map free path where migration is
enabled. Otherwise, when called from BPF programs migration is already
disabled.
A convenience macro is included in the bpf_experimental.h header to hide
over the ugly details of the implementation, leading to user code
looking similar to a language level extension which allocates and
constructs fields of a user type.
struct bar {
struct bpf_list_node node;
};
struct foo {
struct bpf_spin_lock lock;
struct bpf_list_head head __contains(bar, node);
};
void prog(void) {
struct foo *f;
f = bpf_obj_new(typeof(*f));
if (!f)
return;
...
}
A key piece of this story is still missing, i.e. the free function,
which will come in the next patch.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-14-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allow passing known constant scalars as arguments to kfuncs that do not
represent a size parameter. We use mark_chain_precision for the constant
scalar argument to mark it precise. This makes the search pruning
optimization of verifier more conservative for such kfunc calls, and
each non-distinct argument is considered unequivalent.
We will use this support to then expose a bpf_obj_new function where it
takes the local type ID of a type in program BTF, and returns a
PTR_TO_BTF_ID | MEM_ALLOC to the local type, and allows programs to
allocate their own objects.
Each type ID resolves to a distinct type with a possibly distinct size,
hence the type ID constant matters in terms of program safety and its
precision needs to be checked between old and cur states inside regsafe.
The use of mark_chain_precision enables this.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-13-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
As we continue to add more features, argument types, kfunc flags, and
different extensions to kfuncs, the code to verify the correctness of
the kfunc prototype wrt the passed in registers has become ad-hoc and
ugly to read. To make life easier, and make a very clear split between
different stages of argument processing, move all the code into
verifier.c and refactor into easier to read helpers and functions.
This also makes sharing code within the verifier easier with kfunc
argument processing. This will be more and more useful in later patches
as we are now moving to implement very core BPF helpers as kfuncs, to
keep them experimental before baking into UAPI.
Remove all kfunc related bits now from btf_check_func_arg_match, as
users have been converted away to refactored kfunc argument handling.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-12-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There is no need to restrict users from locking bpf_spin_lock in map
values of inner maps. Each inner map lookup gets a unique reg->id
assigned to the returned PTR_TO_MAP_VALUE which will be preserved after
the NULL check. Distinct lookups into different inner map get unique
IDs, and distinct lookups into same inner map also get unique IDs.
Hence, lift the restriction by removing the check return -ENOTSUPP in
map_in_map.c. Later commits will add comprehensive test cases to ensure
that invalid cases are rejected.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-11-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Global variables reside in maps accessible using direct_value_addr
callbacks, so giving each load instruction's rewrite a unique reg->id
disallows us from holding locks which are global.
The reason for preserving reg->id as a unique value for registers that
may point to spin lock is that two separate lookups are treated as two
separate memory regions, and any possible aliasing is ignored for the
purposes of spin lock correctness.
This is not great especially for the global variable case, which are
served from maps that have max_entries == 1, i.e. they always lead to
map values pointing into the same map value.
So refactor the active_spin_lock into a 'active_lock' structure which
represents the lock identity, and instead of the reg->id, remember two
fields, a pointer and the reg->id. The pointer will store reg->map_ptr
or reg->btf. It's only necessary to distinguish for the id == 0 case of
global variables, but always setting the pointer to a non-NULL value and
using the pointer to check whether the lock is held simplifies code in
the verifier.
This is generic enough to allow it for global variables, map lookups,
and allocated objects at the same time.
Note that while whether a lock is held can be answered by just comparing
active_lock.ptr to NULL, to determine whether the register is pointing
to the same held lock requires comparing _both_ ptr and id.
Finally, as a result of this refactoring, pseudo load instructions are
not given a unique reg->id, as they are doing lookup for the same map
value (max_entries is never greater than 1).
Essentially, we consider that the tuple of (ptr, id) will always be
unique for any kind of argument to bpf_spin_{lock,unlock}.
Note that this can be extended in the future to also remember offset
used for locking, so that we can introduce multiple bpf_spin_lock fields
in the same allocation.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-10-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allow locking a bpf_spin_lock in an allocated object, in addition to
already supported map value pointers. The handling is similar to that of
map values, by just preserving the reg->id of PTR_TO_BTF_ID | MEM_ALLOC
as well, and adjusting process_spin_lock to work with them and remember
the id in verifier state.
Refactor the existing process_spin_lock to work with PTR_TO_BTF_ID |
MEM_ALLOC in addition to PTR_TO_MAP_VALUE. We need to update the
reg_may_point_to_spin_lock which is used in mark_ptr_or_null_reg to
preserve reg->id, that will be used in env->cur_state->active_spin_lock
to remember the currently held spin lock.
Also update the comment describing bpf_spin_lock implementation details
to also talk about PTR_TO_BTF_ID | MEM_ALLOC type.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ensure that there can be no ownership cycles among different types by
way of having owning objects that can hold some other type as their
element. For instance, a map value can only hold allocated objects, but
these are allowed to have another bpf_list_head. To prevent unbounded
recursion while freeing resources, elements of bpf_list_head in local
kptrs can never have a bpf_list_head which are part of list in a map
value. Later patches will verify this by having dedicated BTF selftests.
Also, to make runtime destruction easier, once btf_struct_metas is fully
populated, we can stash the metadata of the value type directly in the
metadata of the list_head fields, as that allows easier access to the
value type's layout to destruct it at runtime from the btf_field entry
of the list head itself.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allow specifying bpf_spin_lock, bpf_list_head, bpf_list_node fields in a
allocated object.
Also update btf_struct_access to reject direct access to these special
fields.
A bpf_list_head allows implementing map-in-map style use cases, where an
allocated object with bpf_list_head is linked into a list in a map
value. This would require embedding a bpf_list_node, support for which
is also included. The bpf_spin_lock is used to protect the bpf_list_head
and other data.
While we strictly don't require to hold a bpf_spin_lock while touching
the bpf_list_head in such objects, as when have access to it, we have
complete ownership of the object, the locking constraint is still kept
and may be conditionally lifted in the future.
Note that the specification of such types can be done just like map
values, e.g.:
struct bar {
struct bpf_list_node node;
};
struct foo {
struct bpf_spin_lock lock;
struct bpf_list_head head __contains(bar, node);
struct bpf_list_node node;
};
struct map_value {
struct bpf_spin_lock lock;
struct bpf_list_head head __contains(foo, node);
};
To recognize such types in user BTF, we build a btf_struct_metas array
of metadata items corresponding to each BTF ID. This is done once during
the btf_parse stage to avoid having to do it each time during the
verification process's requirement to inspect the metadata.
Moreover, the computed metadata needs to be passed to some helpers in
future patches which requires allocating them and storing them in the
BTF that is pinned by the program itself, so that valid access can be
assumed to such data during program runtime.
A key thing to note is that once a btf_struct_meta is available for a
type, both the btf_record and btf_field_offs should be available. It is
critical that btf_field_offs is available in case special fields are
present, as we extensively rely on special fields being zeroed out in
map values and allocated objects in later patches. The code ensures that
by bailing out in case of errors and ensuring both are available
together. If the record is not available, the special fields won't be
recognized, so not having both is also fine (in terms of being a
verification error and not a runtime bug).
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce support for representing pointers to objects allocated by the
BPF program, i.e. PTR_TO_BTF_ID that point to a type in program BTF.
This is indicated by the presence of MEM_ALLOC type flag in reg->type to
avoid having to check btf_is_kernel when trying to match argument types
in helpers.
Whenever walking such types, any pointers being walked will always yield
a SCALAR instead of pointer. In the future we might permit kptr inside
such allocated objects (either kernel or program allocated), and it will
then form a PTR_TO_BTF_ID of the respective type.
For now, such allocated objects will always be referenced in verifier
context, hence ref_obj_id == 0 for them is a bug. It is allowed to write
to such objects, as long fields that are special are not touched
(support for which will be added in subsequent patches). Note that once
such a pointer is marked PTR_UNTRUSTED, it is no longer allowed to write
to it.
No PROBE_MEM handling is therefore done for loads into this type unless
PTR_UNTRUSTED is part of the register type, since they can never be in
an undefined state, and their lifetime will always be valid.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Far too much code simply assumes that both btf_record and btf_field_offs
are set to valid pointers together, or both are unset. They go together
hand in hand as btf_record describes the special fields and
btf_field_offs is compact representation for runtime copying/zeroing.
It is very difficult to make this clear in the code when the only
exception to this universal invariant is inner_map_meta which is used
as reg->map_ptr in the verifier. This is simply a bug waiting to happen,
as in verifier context we cannot easily distinguish if PTR_TO_MAP_VALUE
is coming from an inner map, and if we ever end up using field_offs for
any reason in the future, we will silently ignore the special fields for
inner map case (as NULL is not an error but unset field_offs).
Hence, simply copy field_offs from inner map together with btf_record.
While at it, refactor code to unwind properly on errors with gotos.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Whenever btf_record_dup fails, we must free inner_map_meta that was
allocated before.
This fixes a memory leak (in case of errors) during inner map creation.
Fixes: aa3496accc ("bpf: Refactor kptr_off_tab into btf_record")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Since the commit being fixed, we now miss freeing btf_record for local
storage maps which will have a btf_record populated in case they have
bpf_spin_lock element.
This was missed because I made the choice of offloading the job to free
kptr_off_tab (now btf_record) to the map_free callback when adding
support for kptrs.
Revisiting the reason for this decision, there is the possibility that
the btf_record gets used inside map_free callback (e.g. in case of maps
embedding kptrs) to iterate over them and free them, hence doing it
before the map_free callback would be leaking special field memory, and
do invalid memory access. The btf_record keeps module references which
is critical to ensure the dtor call made for referenced kptr is safe to
do.
If doing it after map_free callback, the map area is already freed, so
we cannot access bpf_map structure anymore.
To fix this and prevent such lapses in future, move bpf_map_free_record
out of the map_free callback, and do it after map_free by remembering
the btf_record pointer. There is no need to access bpf_map structure in
that case, and we can avoid missing this case when support for new map
types is added for other special fields.
Since a btf_record and its btf_field_offs are used together, for
consistency delay freeing of field_offs as well. While not a problem
right now, a lot of code assumes that either both record and field_offs
are set or none at once.
Note that in case of map of maps (outer maps), inner_map_meta->record is
only used during verification, not to free fields in map value, hence we
simply keep the bpf_map_free_record call as is in bpf_map_meta_free and
never touch map->inner_map_meta in bpf_map_free_deferred.
Add a comment making note of these details.
Fixes: db55911782 ("bpf: Consolidate spin_lock, timer management into btf_record")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Instead of returning directly with -EOPNOTSUPP for the timer case, we
need to free the btf_record before returning to userspace.
Fixes: db55911782 ("bpf: Consolidate spin_lock, timer management into btf_record")
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Entries in list 'tr->err_log' will be reused after entry number
exceed TRACING_LOG_ERRS_MAX.
The cmd string of the to be reused entry will be freed first then
allocated a new one. If the allocation failed, then the entry will
still be in list 'tr->err_log' but its 'cmd' field is set to be NULL,
later access of 'cmd' is risky.
Currently above problem can cause the loss of 'cmd' information of first
entry in 'tr->err_log'. When execute `cat /sys/kernel/tracing/error_log`,
reproduce logs like:
[ 37.495100] trace_kprobe: error: Maxactive is not for kprobe(null) ^
[ 38.412517] trace_kprobe: error: Maxactive is not for kprobe
Command: p4:myprobe2 do_sys_openat2
^
Link: https://lore.kernel.org/linux-trace-kernel/20221114104632.3547266-1-zhengyejian1@huawei.com
Fixes: 1581a884b7 ("tracing: Remove size restriction on tracing_log_err cmd strings")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This has nothing to do with random.c and everything to do with stack
protectors. Yes, it uses randomness. But many things use randomness.
random.h and random.c are concerned with the generation of randomness,
not with each and every use. So move this function into the more
specific stackprotector.h file where it belongs.
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
These cases were done with this Coccinelle:
@@
expression H;
expression L;
@@
- (get_random_u32_below(H) + L)
+ get_random_u32_inclusive(L, H + L - 1)
@@
expression H;
expression L;
expression E;
@@
get_random_u32_inclusive(L,
H
- + E
- - E
)
@@
expression H;
expression L;
expression E;
@@
get_random_u32_inclusive(L,
H
- - E
- + E
)
@@
expression H;
expression L;
expression E;
expression F;
@@
get_random_u32_inclusive(L,
H
- - E
+ F
- + E
)
@@
expression H;
expression L;
expression E;
expression F;
@@
get_random_u32_inclusive(L,
H
- + E
+ F
- - E
)
And then subsequently cleaned up by hand, with several automatic cases
rejected if it didn't make sense contextually.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Since the eprobe filter was defined based on the eprobe's trace event
itself, it doesn't work correctly. Use the original trace event of
the eprobe when making the filter so that the filter works correctly.
Without this fix:
# echo 'e syscalls/sys_enter_openat \
flags_rename=$flags:u32 if flags < 1000' >> dynamic_events
# echo 1 > events/eprobes/sys_enter_openat/enable
[ 114.551550] event trace: Could not enable event sys_enter_openat
-bash: echo: write error: Invalid argument
With this fix:
# echo 'e syscalls/sys_enter_openat \
flags_rename=$flags:u32 if flags < 1000' >> dynamic_events
# echo 1 > events/eprobes/sys_enter_openat/enable
# tail trace
cat-241 [000] ...1. 266.498449: sys_enter_openat: (syscalls.sys_enter_openat) flags_rename=0
cat-242 [000] ...1. 266.977640: sys_enter_openat: (syscalls.sys_enter_openat) flags_rename=0
Link: https://lore.kernel.org/all/166823166395.1385292.8931770640212414483.stgit@devnote3/
Fixes: 752be5c5c9 ("tracing/eprobe: Add eprobe filter support")
Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Tested-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
In __unregister_kprobe_top(), if the currently unregistered probe has
post_handler but other child probes of the aggrprobe do not have
post_handler, the post_handler of the aggrprobe is cleared. If this is
a ftrace-based probe, there is a problem. In later calls to
disarm_kprobe(), we will use kprobe_ftrace_ops because post_handler is
NULL. But we're armed with kprobe_ipmodify_ops. This triggers a WARN in
__disarm_kprobe_ftrace() and may even cause use-after-free:
Failed to disarm kprobe-ftrace at kernel_clone+0x0/0x3c0 (error -2)
WARNING: CPU: 5 PID: 137 at kernel/kprobes.c:1135 __disarm_kprobe_ftrace.isra.21+0xcf/0xe0
Modules linked in: testKprobe_007(-)
CPU: 5 PID: 137 Comm: rmmod Not tainted 6.1.0-rc4-dirty #18
[...]
Call Trace:
<TASK>
__disable_kprobe+0xcd/0xe0
__unregister_kprobe_top+0x12/0x150
? mutex_lock+0xe/0x30
unregister_kprobes.part.23+0x31/0xa0
unregister_kprobe+0x32/0x40
__x64_sys_delete_module+0x15e/0x260
? do_user_addr_fault+0x2cd/0x6b0
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
[...]
For the kprobe-on-ftrace case, we keep the post_handler setting to
identify this aggrprobe armed with kprobe_ipmodify_ops. This way we
can disarm it correctly.
Link: https://lore.kernel.org/all/20221112070000.35299-1-lihuafei1@huawei.com/
Fixes: 0bc11ed5ab ("kprobes: Allow kprobes coexist with livepatch")
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
In rethook_alloc(), the variable rh is not freed or passed out
if handler is NULL, which could lead to a memleak, fix it.
Link: https://lore.kernel.org/all/20221110104438.88099-1-yiyang13@huawei.com/
[Masami: Add "rethook:" tag to the title.]
Fixes: 54ecbe6f1e ("rethook: Add a generic return hook")
Cc: stable@vger.kernel.org
Signed-off-by: Yi Yang <yiyang13@huawei.com>
Acke-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
This is a simple mechanical transformation done by:
@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>