2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-15 09:03:59 +08:00
Commit Graph

36315 Commits

Author SHA1 Message Date
Linus Torvalds
65c61de9d0 Summary of modules changes for the 5.13 merge window:
- Fix an age old bug involving jump_calls and static_labels when
   CONFIG_MODULE_UNLOAD=n. When CONFIG_MODULE_UNLOAD=n, it means you
   can't unload modules, so normally the __exit sections of a module are
   not loaded at all. However, dynamic code patching (jump_label,
   static_call, alternatives) can have sites in __exit sections even if
   __exit is never executed.
 
   Reported by Peter Zijlstra: "Alternatives, jump_labels and static_call
   all can have relocations into __exit code.  Not loading it at all would
   be BAD." Therefore, load the __exit sections even when
   CONFIG_MODULE_UNLOAD=n, and discard them after init.
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEVrp26glSWYuDNrCUwEV+OM47wXIFAmCKVqUQHGpleXVAa2Vy
 bmVsLm9yZwAKCRDARX44zjvBcjzUD/454XSMUia0ZuTBhkR21gjY9yz7nfed5R4c
 3ChyXEc6yfPPfO9OQRcVp+9UB/AGHvZM/+6mWHMpjbwv8mBW6gQ4xtvbVkVpQ0vB
 QXoqJZvILOcPtnY8ViTncoo3P+79ZsMWFniIRbCaZinmF017+peISRnlo+WppA+2
 NOOdmK1/oWIbZhxaurlqfzgR1UCDw31wK+djWu6+Ajy3YCvNWkDA4bVlXthTIrLe
 1r752f8eQw2wjVFi+sgN7wGW6JHCRRISYF01yy/VUsFTGRq6DL87nB6icNa8we+t
 Rs+PwWJsrb19VhRECegWVK6qJNfqpLR2QYyuu3JA6KM/YtuDJQcM5yHllt2UazjT
 daZQx6GKxIrNC9c8ttkllc+W6l2c3LyTyN0H8RnbUzpZ/8ZiQzXuIXLzEFCgd/7L
 5CByLgRTZ1n8c90K03xGN7px3ZOv4ugW4z4PdXPhfFmM6hl8u+GSk1tIBtXRG+VL
 uaRgToJF77vbL5WkIKd6keHY+mC8v2zu9CL20rJlmPC12/aXkU5O0RIp4ptYYL0I
 AB5UDyT5ciQiwNks/W8x7RNkAyPd22NRBQ1Bs6VYRTjJpjgiW3EcHUlXlFErnJGV
 LUNYVlQ5e7TxF87dTl78BUq3NHUJax/kOsMamWlcCrbgxXZkpdvutjHzsndZ9Qy+
 edbT0BWgyw==
 =zme0
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull module updates from Jessica Yu:
 "Fix an age old bug involving jump_calls and static_labels when
  CONFIG_MODULE_UNLOAD=n.

  When CONFIG_MODULE_UNLOAD=n, it means you can't unload modules, so
  normally the __exit sections of a module are not loaded at all.
  However, dynamic code patching (jump_label, static_call, alternatives)
  can have sites in __exit sections even if __exit is never executed.

  Reported by Peter Zijlstra:
     'Alternatives, jump_labels and static_call all can have relocations
      into __exit code. Not loading it at all would be BAD.'

  Therefore, load the __exit sections even when CONFIG_MODULE_UNLOAD=n,
  and discard them after init"

* tag 'modules-for-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: treat exit sections the same as init sections when !CONFIG_MODULE_UNLOAD
2021-04-30 12:29:36 -07:00
Zqiang
e2b5bcf9f5 irq_work: record irq_work_queue() call stack
Add the irq_work_queue() call stack into the KASAN auxiliary stack in
order to improve KASAN reports.  this will let us know where the irq work
be queued.

Link: https://lkml.kernel.org/r/20210331063202.28770-1-qiang.zhang@windriver.com
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:42 -07:00
Walter Wu
23f61f0fe1 kasan: record task_work_add() call stack
Why record task_work_add() call stack?  Syzbot reports many use-after-free
issues for task_work, see [1].  After seeing the free stack and the
current auxiliary stack, we think they are useless, we don't know where
the work was registered.  This work may be the free call stack, so we miss
the root cause and don't solve the use-after-free.

Add the task_work_add() call stack into the KASAN auxiliary stack in order
to improve KASAN reports.  It helps programmers solve use-after-free
issues.

[1]: https://groups.google.com/g/syzkaller-bugs/search?q=kasan%20use-after-free%20task_work_run

Link: https://lkml.kernel.org/r/20210316024410.19967-1-walter-zh.wu@mediatek.com
Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com>
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:42 -07:00
Nicholas Piggin
e82b9b3086 kernel/dma: remove unnecessary unmap_kernel_range
vunmap will remove ptes.

Link: https://lkml.kernel.org/r/20210322021806.892164-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Cédric Le Goater <clg@kaod.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Johannes Weiner
dc26532aed cgroup: rstat: punt root-level optimization to individual controllers
Current users of the rstat code can source root-level statistics from
the native counters of their respective subsystem, allowing them to
forego aggregation at the root level.  This optimization is currently
implemented inside the generic rstat code, which doesn't track the root
cgroup and doesn't invoke the subsystem flush callbacks on it.

However, the memory controller cannot do this optimization, because
cgroup1 breaks out memory specifically for the local level, including at
the root level.  In preparation for the memory controller switching to
rstat, move the optimization from rstat core to the controllers.

Afterwards, rstat will always track the root cgroup for changes and
invoke the subsystem callbacks on it; and it's up to the subsystem to
special-case and skip aggregation of the root cgroup if it can source
this information through other, cheaper means.

This is the case for the io controller and the cgroup base stats.  In
their respective flush callbacks, check whether the parent is the root
cgroup, and if so, skip the unnecessary upward propagation.

The extra cost of tracking the root cgroup is negligible: on stat
changes, we actually remove a branch that checks for the root.  The
queueing for a flush touches only per-cpu data, and only the first stat
change since a flush requires a (per-cpu) lock.

Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:37 -07:00
Johannes Weiner
a7df69b81a cgroup: rstat: support cgroup1
Rstat currently only supports the default hierarchy in cgroup2.  In
order to replace memcg's private stats infrastructure - used in both
cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1.

The initialization and destruction callbacks for regular cgroups are
already in place.  Remove the cgroup_on_dfl() guards to handle cgroup1.

The initialization of the root cgroup is currently hardcoded to only
handle cgrp_dfl_root.cgrp.  Move those callbacks to cgroup_setup_root()
and cgroup_destroy_root() to handle the default root as well as the
various cgroup1 roots we may set up during mounting.

The linking of css to cgroups happens in code shared between cgroup1 and
cgroup2 as well.  Simply remove the cgroup_on_dfl() guard.

Linkage of the root css to the root cgroup is a bit trickier: per
default, the root css of a subsystem controller belongs to the default
hierarchy (i.e.  the cgroup2 root).  When a controller is mounted in its
cgroup1 version, the root css is stolen and moved to the cgroup1 root;
on unmount, the css moves back to the default hierarchy.  Annotate
rebind_subsystems() to move the root css linkage along between roots.

Link: https://lkml.kernel.org/r/20210209163304.77088-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:37 -07:00
Muchun Song
27faca83a7 mm: memcontrol: fix kernel stack account
For simplification commit 991e767385 ("mm: memcontrol: account kernel
stack per node") changed the per zone vmalloc backed stack pages
accounting to per node.

By doing that we have lost a certain precision because those pages might
live in different NUMA nodes.  In the end NR_KERNEL_STACK_KB exported to
the userspace might be over estimated on some nodes while underestimated
on others.  But this is not a real world problem, just a problem found
by reading the code.  So there is no actual data to showing how much
impact it has on users.

This doesn't impose any real problem to correctnes of the kernel
behavior as the counter is not used for any internal processing but it
can cause some confusion to the userspace.

Address the problem by accounting each vmalloc backing page to its own
node.

Link: https://lkml.kernel.org/r/20210303151843.81156-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:37 -07:00
Petr Mladek
9bf3bc949f watchdog: cleanup handling of false positives
Commit d6ad3e286d ("softlockup: Add sched_clock_tick() to avoid kernel
warning on kgdb resume") introduced touch_softlockup_watchdog_sync().

It solved a problem when the watchdog was touched in an atomic context,
the timer callback was proceed right after releasing interrupts, and the
local clock has not been updated yet.  In this case, sched_clock_tick()
was called in watchdog_timer_fn() before updating the timer.

So far so good.

Later commit 5d1c0f4a80 ("watchdog: add check for suspended vm in
softlockup detector") added two kvm_check_and_clear_guest_paused()
calls.  They touch the watchdog when the guest has been sleeping.

The code makes my head spin around.

Scenario 1:

    + guest did sleep:
	+ PVCLOCK_GUEST_STOPPED is set

    + 1st watchdog_timer_fn() invocation:
	+ the watchdog is not touched yet
	+ is_softlockup() returns too big delay
	+ kvm_check_and_clear_guest_paused():
	   + clear PVCLOCK_GUEST_STOPPED
	   + call touch_softlockup_watchdog_sync()
		+ set SOFTLOCKUP_DELAY_REPORT
		+ set softlockup_touch_sync
	+ return from the timer callback

      + 2nd watchdog_timer_fn() invocation:

	+ call sched_clock_tick() even though it is not needed.
	  The timer callback was invoked again only because the clock
	  has already been updated in the meantime.

	+ call kvm_check_and_clear_guest_paused() that does nothing
	  because PVCLOCK_GUEST_STOPPED has been cleared already.

	+ call update_report_ts() and return. This is fine. Except
	  that sched_clock_tick() might allow to set it already
	  during the 1st invocation.

Scenario 2:

	+ guest did sleep

	+ 1st watchdog_timer_fn() invocation
	    + same as in 1st scenario

	+ guest did sleep again:
	    + set PVCLOCK_GUEST_STOPPED again

	+ 2nd watchdog_timer_fn() invocation
	    + SOFTLOCKUP_DELAY_REPORT is set from 1st invocation
	    + call sched_clock_tick()
	    + call kvm_check_and_clear_guest_paused()
		+ clear PVCLOCK_GUEST_STOPPED
		+ call touch_softlockup_watchdog_sync()
		    + set SOFTLOCKUP_DELAY_REPORT
		    + set softlockup_touch_sync
	    + call update_report_ts() (set real timestamp immediately)
	    + return from the timer callback

	+ 3rd watchdog_timer_fn() invocation
	    + timestamp is set from 2nd invocation
	    + softlockup_touch_sync is set but not checked because
	      the real timestamp is already set

Make the code more straightforward:

1. Always call kvm_check_and_clear_guest_paused() at the very
   beginning to handle PVCLOCK_GUEST_STOPPED. It touches the watchdog
   when the quest did sleep.

2. Handle the situation when the watchdog has been touched
   (SOFTLOCKUP_DELAY_REPORT is set).

   Call sched_clock_tick() when touch_*sync() variant was used. It makes
   sure that the timestamp will be up to date even when it has been
   touched in atomic context or quest did sleep.

As a result, kvm_check_and_clear_guest_paused() is called on a single
location.  And the right timestamp is always set when returning from the
timer callback.

Link: https://lkml.kernel.org/r/20210311122130.6788-7-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:36 -07:00
Petr Mladek
9f113bf760 watchdog: fix barriers when printing backtraces from all CPUs
Any parallel softlockup reports are skipped when one CPU is already
printing backtraces from all CPUs.

The exclusive rights are synchronized using one bit in
soft_lockup_nmi_warn.  There is also one memory barrier that does not make
much sense.

Use two barriers on the right location to prevent mixing two reports.

[pmladek@suse.com: use bit lock operations to prevent multiple soft-lockup reports]
  Link: https://lkml.kernel.org/r/YFSVsLGVWMXTvlbk@alley

Link: https://lkml.kernel.org/r/20210311122130.6788-6-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:36 -07:00
Petr Mladek
1bc503cb4a watchdog/softlockup: remove logic that tried to prevent repeated reports
The softlockup detector does some gymnastic with the variable
soft_watchdog_warn.  It was added by the commit 58687acba5
("lockup_detector: Combine nmi_watchdog and softlockup detector").

The purpose is not completely clear.  There are the following clues.  They
describe the situation how it looked after the above mentioned commit:

  1. The variable was checked with a comment "only warn once".

  2. The variable was set when softlockup was reported. It was cleared
     only when the CPU was not longer in the softlockup state.

  3. watchdog_touch_ts was not explicitly updated when the softlockup
     was reported. Without this variable, the report would normally
     be printed again during every following watchdog_timer_fn()
     invocation.

The logic has got even more tangled up by the commit ed235875e2
("kernel/watchdog.c: print traces for all cpus on lockup detection").
After this commit, soft_watchdog_warn is set only when
softlockup_all_cpu_backtrace is enabled.  But multiple reports from all
CPUs are prevented by a new variable soft_lockup_nmi_warn.

Conclusion:

The variable probably never worked as intended.  In each case, it has not
worked last many years because the softlockup was reported repeatedly
after the full period defined by watchdog_thresh.

The reason is that watchdog gets touched in many known slow paths, for
example, in printk_stack_address().  This code is called also when
printing the softlockup report.  It means that the watchdog timestamp gets
updated after each report.

Solution:

Simply remove the logic. People want the periodic report anyway.

Link: https://lkml.kernel.org/r/20210311122130.6788-5-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:36 -07:00
Petr Mladek
fef06efc2e watchdog/softlockup: report the overall time of softlockups
The softlockup detector currently shows the time spent since the last
report.  As a result it is not clear whether a CPU is infinitely hogged by
a single task or if it is a repeated event.

The situation can be simulated with a simply busy loop:

	while (true)
	      cpu_relax();

The softlockup detector produces:

[  168.277520] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
[  196.277604] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
[  236.277522] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cat:4865]

But it should be, something like:

[  480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]
[  508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943]
[  548.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 89s! [cat:4943]
[  576.372351] watchdog: BUG: soft lockup - CPU#2 stuck for 115s! [cat:4943]

For the better output, add an additional timestamp of the last report.
Only this timestamp is reset when the watchdog is intentionally touched
from slow code paths or when printing the report.

Link: https://lkml.kernel.org/r/20210311122130.6788-4-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:36 -07:00
Petr Mladek
c9ad17c991 watchdog: explicitly update timestamp when reporting softlockup
The softlockup situation might stay for a long time or even forever.  When
it happens, the softlockup debug messages are printed in regular intervals
defined by get_softlockup_thresh().

There is a mystery.  The repeated message is printed after the full
interval that is defined by get_softlockup_thresh().  But the timer
callback is called more often as defined by sample_period.  The code looks
like the soflockup should get reported in every sample_period when it was
once behind the thresh.

It works only by chance.  The watchdog is touched when printing the stall
report, for example, in printk_stack_address().

Make the behavior clear and predictable by explicitly updating the
timestamp in watchdog_timer_fn() when the report gets printed.

Link: https://lkml.kernel.org/r/20210311122130.6788-3-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:35 -07:00
Petr Mladek
7c0012f522 watchdog: rename __touch_watchdog() to a better descriptive name
Patch series "watchdog/softlockup: Report overall time and some cleanup", v2.

I dug deep into the softlockup watchdog history when time permitted this
year.  And reworked the patchset that fixed timestamps and cleaned up the
code[2].

I split it into very small steps and did even more code clean up.  The
result looks quite strightforward and I am pretty confident with the
changes.

[1] v2: https://lore.kernel.org/r/20201210160038.31441-1-pmladek@suse.com
[2] v1: https://lore.kernel.org/r/20191024114928.15377-1-pmladek@suse.com

This patch (of 6):

There are many touch_*watchdog() functions.  They are called in situations
where the watchdog could report false positives or create unnecessary
noise.  For example, when CPU is entering idle mode, a virtual machine is
stopped, or a lot of messages are printed in the atomic context.

These functions set SOFTLOCKUP_RESET instead of a real timestamp.  It
allows to call them even in a context where jiffies might be outdated.
For example, in an atomic context.

The real timestamp is set by __touch_watchdog() that is called from the
watchdog timer callback.

Rename this callback to update_touch_ts().  It better describes the effect
and clearly distinguish is from the other touch_*watchdog() functions.

Another motivation is that two timestamps are going to be used.  One will
be used for the total softlockup time.  The other will be used to measure
time since the last report.  The new function name will help to
distinguish which timestamp is being updated.

Link: https://lkml.kernel.org/r/20210311122130.6788-1-pmladek@suse.com
Link: https://lkml.kernel.org/r/20210311122130.6788-2-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:35 -07:00
Steven Rostedt (VMware)
aafe104aa9 tracing: Restructure trace_clock_global() to never block
It was reported that a fix to the ring buffer recursion detection would
cause a hung machine when performing suspend / resume testing. The
following backtrace was extracted from debugging that case:

Call Trace:
 trace_clock_global+0x91/0xa0
 __rb_reserve_next+0x237/0x460
 ring_buffer_lock_reserve+0x12a/0x3f0
 trace_buffer_lock_reserve+0x10/0x50
 __trace_graph_return+0x1f/0x80
 trace_graph_return+0xb7/0xf0
 ? trace_clock_global+0x91/0xa0
 ftrace_return_to_handler+0x8b/0xf0
 ? pv_hash+0xa0/0xa0
 return_to_handler+0x15/0x30
 ? ftrace_graph_caller+0xa0/0xa0
 ? trace_clock_global+0x91/0xa0
 ? __rb_reserve_next+0x237/0x460
 ? ring_buffer_lock_reserve+0x12a/0x3f0
 ? trace_event_buffer_lock_reserve+0x3c/0x120
 ? trace_event_buffer_reserve+0x6b/0xc0
 ? trace_event_raw_event_device_pm_callback_start+0x125/0x2d0
 ? dpm_run_callback+0x3b/0xc0
 ? pm_ops_is_empty+0x50/0x50
 ? platform_get_irq_byname_optional+0x90/0x90
 ? trace_device_pm_callback_start+0x82/0xd0
 ? dpm_run_callback+0x49/0xc0

With the following RIP:

RIP: 0010:native_queued_spin_lock_slowpath+0x69/0x200

Since the fix to the recursion detection would allow a single recursion to
happen while tracing, this lead to the trace_clock_global() taking a spin
lock and then trying to take it again:

ring_buffer_lock_reserve() {
  trace_clock_global() {
    arch_spin_lock() {
      queued_spin_lock_slowpath() {
        /* lock taken */
        (something else gets traced by function graph tracer)
          ring_buffer_lock_reserve() {
            trace_clock_global() {
              arch_spin_lock() {
                queued_spin_lock_slowpath() {
                /* DEAD LOCK! */

Tracing should *never* block, as it can lead to strange lockups like the
above.

Restructure the trace_clock_global() code to instead of simply taking a
lock to update the recorded "prev_time" simply use it, as two events
happening on two different CPUs that calls this at the same time, really
doesn't matter which one goes first. Use a trylock to grab the lock for
updating the prev_time, and if it fails, simply try again the next time.
If it failed to be taken, that means something else is already updating
it.

Link: https://lkml.kernel.org/r/20210430121758.650b6e8a@gandalf.local.home

Cc: stable@vger.kernel.org
Tested-by: Konstantin Kharlamov <hi-angel@yandex.ru>
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Fixes: b02414c8f0 ("ring-buffer: Fix recursion protection transitions between interrupt context") # started showing the problem
Fixes: 14131f2f98 ("tracing: implement trace_clock_*() APIs") # where the bug happened
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212761
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-30 13:48:07 -04:00
Linus Torvalds
8ca5297e7e Kconfig updates for v5.13
- Change 'option defconfig' to the environment variable
    KCONFIG_DEFCONFIG_LIST
 
  - Refactor tinyconfig without using allnoconfig_y
 
  - Remove 'option allnoconfig_y' syntax
 
  - Change 'option modules' to 'modules'
 
  - Do not use /boot/config-* etc. as base config for cross-compilation
 
  - Fix a search bug in nconf
 
  - Various code cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmCKTy8VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGLFkQAJFaFORoOIGvkErYkTNv64LpDZsB
 ck7xV6gAUB0iSfv6x5mKfbZRWllc0GMr0dNY2hKs0iazvrvm3OKheLNR6zQ7OwI4
 aPd46lD7Dpvl09iNJcAAwVkwuqAcISKKk8wBhTsdFNx6A+ouPxNPWZHics5SqT14
 jw6YGkI/MJaDx74izRlDKOiBlrpq1gM9pyAud2gHyWfksxu9E2JQ2guao/UpB0I7
 XmCC8HzDdMP637gvA0cMj/+thW0/6ws8ev0bwhHTNFnB1F+N5Aop1urWnwTQKIoy
 WatTUfvhikaZPbJUBxOA21xbmhN4NnBxICXcmsFRLxYIsaZJY1UOk5hDZ1MvptnB
 jnOKUH52yqWeHMvBLqdsxSxktUawg3U85v5ygtYOUUJmuyhkP5nz3095eeFXS/J6
 3KZAnSfRubb2XbfZMG0YUUtVoi782Mv0OvdRbyvON/TsXFP8T1skKUtCaDaXm31Z
 ApjIs1xViuuTXfRqmk7vmjTn0oWIhRahnS49Wl1Ro00JH9VjBJz7N3T+rJ5naY2B
 GOCM2oTWh/qMW5makFCQNFEsaSr5HBsueepRhUoUOQcyJHQFuK/Cb+C4Rv2gp5ao
 3QYp2x49v0c+dkEmkmOW4LwUxjKUe573D3eVLcGnq+4MYouY7XGFWFKfUKYuPgCL
 aqVi/QHKNZpd5Wko
 =+eSh
 -----END PGP SIGNATURE-----

Merge tag 'kconfig-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kconfig updates from Masahiro Yamada:

 - Change 'option defconfig' to the environment variable
   KCONFIG_DEFCONFIG_LIST

 - Refactor tinyconfig without using allnoconfig_y

 - Remove 'option allnoconfig_y' syntax

 - Change 'option modules' to 'modules'

 - Do not use /boot/config-* etc. as base config for cross-compilation

 - Fix a search bug in nconf

 - Various code cleanups

* tag 'kconfig-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits)
  kconfig: refactor .gitignore
  kconfig: highlight xconfig 'comment' lines with '***'
  kconfig: highlight gconfig 'comment' lines with '***'
  kconfig: gconf: remove unused code
  kconfig: remove unused PACKAGE definition
  kconfig: nconf: stop endless search loops
  kconfig: split menu.c out of parser.y
  kconfig: nconf: refactor in print_in_middle()
  kconfig: nconf: remove meaningless wattrset() call from show_menu()
  kconfig: nconf: change set_config_filename() to void function
  kconfig: nconf: refactor attributes setup code
  kconfig: nconf: remove unneeded default for menu prompt
  kconfig: nconf: get rid of (void) casts from wattrset() calls
  kconfig: nconf: fix NORMAL attributes
  kconfig: mconf,nconf: remove unneeded '\0' termination after snprintf()
  kconfig: use /boot/config-* etc. as DEFCONFIG_LIST only for native build
  kconfig: change sym_change_count to a boolean flag
  kconfig: nconf: fix core dump when searching in empty menu
  kconfig: lxdialog: A spello fix and a punctuation added
  kconfig: streamline_config.pl: Couple of typo fixes
  ...
2021-04-29 14:32:00 -07:00
Linus Torvalds
b0030af53a Kbuild updates for v5.13
- Evaluate $(call cc-option,...) etc. only for build targets
 
  - Add CONFIG_VMLINUX_MAP to generate .map file when linking vmlinux
 
  - Remove unnecessary --gcc-toolchains Clang flag because the --prefix
    flag finds the toolchains
 
  - Do not pass Clang's --prefix flag when using the integrated as
 
  - Check the assembler version in Kconfig time
 
  - Add new CONFIG options, AS_VERSION, AS_IS_GNU, AS_IS_LLVM to clean up
    some dependencies in Kconfig
 
  - Fix invalid Module.symvers creation when building only modules without
    vmlinux
 
  - Fix false-positive modpost warnings when CONFIG_TRIM_UNUSED_KSYMS is
    set, but there is no module to build
 
  - Refactor module installation Makefile
 
  - Support zstd for module compression
 
  - Convert alpha and ia64 to use generic shell scripts to generate the
    syscall headers
 
  - Add a new elfnote to indicate if the kernel was built with LTO, which
    will be used by pahole
 
  - Flatten the directory structure under include/config/ so CONFIG options
    and filenames match
 
  - Change the deb source package name from linux-$(KERNELRELEASE) to
    linux-upstream
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmCKOLUVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGdq8P/2z+saxIWGXVWt0ggavR0vimcY4e
 NQIKGu9uZpo/lfoC78UG8HO+XvzvPUrcRuOX+WIVr2GfScgVnweDukexUAY0/2oi
 4UvqhndJ0sjEwRj8mXXJ0O+PED+OtgrqrbhkLq9wHQd/jpSD4XEWXwn1g1XVrTZu
 WbwP6b1G/Rnjp2lz3HKC017rPkmfsCFQB7r+hbJGKhT0rCaceheUuBvGa/XqLknr
 IOyaUAY76u3Gtj6fVY1rk70kQgDMF8+LJPgdSSZ/XPCvbNJQAeop36EeRNfmxGIh
 vQhFJRJeqy+K5MhCpdGtTGYDawlmQVn/f/99SkDw9F04S4ZL2Xnaaqw4L1QDhjTh
 xBlckbPvmq36F4xSqWd5kYF3iwS+LsEJROwZKFLEVDb3zMsRQPEGQM/556QmrBi2
 5KXzwOYEJKuobWr1hQ3PwLumJKTPGLvGEFB3Bq2eG8LrgpOAHPI4ejC2EBu0vCez
 QbskP2lPlMj3MbL5iZg+6ZRlOChZ7RUrSDj6+iTeOcinmXHqQONCL6qy+um4Rfcb
 zUkfwTlqM9d88u6AbO2VvQMOobMjvp4bvmqi/Xv8IiTukLHco4tc8zTuySmZwSyI
 rd3RKYn367qWztX5YyaoGRPVmlMG7ssbRc4fkXiV13vfeZebNfVwlX/CHv9+IWwN
 RVnMhYBhUZR68h6z
 =ti9L
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Evaluate $(call cc-option,...) etc. only for build targets

 - Add CONFIG_VMLINUX_MAP to generate .map file when linking vmlinux

 - Remove unnecessary --gcc-toolchains Clang flag because the --prefix
   flag finds the toolchains

 - Do not pass Clang's --prefix flag when using the integrated as

 - Check the assembler version in Kconfig time

 - Add new CONFIG options, AS_VERSION, AS_IS_GNU, AS_IS_LLVM to clean up
   some dependencies in Kconfig

 - Fix invalid Module.symvers creation when building only modules
   without vmlinux

 - Fix false-positive modpost warnings when CONFIG_TRIM_UNUSED_KSYMS is
   set, but there is no module to build

 - Refactor module installation Makefile

 - Support zstd for module compression

 - Convert alpha and ia64 to use generic shell scripts to generate the
   syscall headers

 - Add a new elfnote to indicate if the kernel was built with LTO, which
   will be used by pahole

 - Flatten the directory structure under include/config/ so CONFIG
   options and filenames match

 - Change the deb source package name from linux-$(KERNELRELEASE) to
   linux-upstream

* tag 'kbuild-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (42 commits)
  kbuild: Add $(KBUILD_HOSTLDFLAGS) to 'has_libelf' test
  kbuild: deb-pkg: change the source package name to linux-upstream
  tools: do not include scripts/Kbuild.include
  kbuild: redo fake deps at include/config/*.h
  kbuild: remove TMPO from try-run
  MAINTAINERS: add pattern for dummy-tools
  kbuild: add an elfnote for whether vmlinux is built with lto
  ia64: syscalls: switch to generic syscallhdr.sh
  ia64: syscalls: switch to generic syscalltbl.sh
  alpha: syscalls: switch to generic syscallhdr.sh
  alpha: syscalls: switch to generic syscalltbl.sh
  sysctl: use min() helper for namecmp()
  kbuild: add support for zstd compressed modules
  kbuild: remove CONFIG_MODULE_COMPRESS
  kbuild: merge scripts/Makefile.modsign to scripts/Makefile.modinst
  kbuild: move module strip/compression code into scripts/Makefile.modinst
  kbuild: refactor scripts/Makefile.modinst
  kbuild: rename extmod-prefix to extmod_prefix
  kbuild: check module name conflict for external modules as well
  kbuild: show the target directory for depmod log
  ...
2021-04-29 14:24:39 -07:00
Linus Torvalds
9d31d23389 Networking changes for 5.13.
Core:
 
  - bpf:
 	- allow bpf programs calling kernel functions (initially to
 	  reuse TCP congestion control implementations)
 	- enable task local storage for tracing programs - remove the
 	  need to store per-task state in hash maps, and allow tracing
 	  programs access to task local storage previously added for
 	  BPF_LSM
 	- add bpf_for_each_map_elem() helper, allowing programs to
 	  walk all map elements in a more robust and easier to verify
 	  fashion
 	- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
 	  redirection
 	- lpm: add support for batched ops in LPM trie
 	- add BTF_KIND_FLOAT support - mostly to allow use of BTF
 	  on s390 which has floats in its headers files
 	- improve BPF syscall documentation and extend the use of kdoc
 	  parsing scripts we already employ for bpf-helpers
 	- libbpf, bpftool: support static linking of BPF ELF files
 	- improve support for encapsulation of L2 packets
 
  - xdp: restructure redirect actions to avoid a runtime lookup,
 	improving performance by 4-8% in microbenchmarks
 
  - xsk: build skb by page (aka generic zerocopy xmit) - improve
 	performance of software AF_XDP path by 33% for devices
 	which don't need headers in the linear skb part (e.g. virtio)
 
  - nexthop: resilient next-hop groups - improve path stability
 	on next-hops group changes (incl. offload for mlxsw)
 
  - ipv6: segment routing: add support for IPv4 decapsulation
 
  - icmp: add support for RFC 8335 extended PROBE messages
 
  - inet: use bigger hash table for IP ID generation
 
  - tcp: deal better with delayed TX completions - make sure we don't
 	give up on fast TCP retransmissions only because driver is
 	slow in reporting that it completed transmitting the original
 
  - tcp: reorder tcp_congestion_ops for better cache locality
 
  - mptcp:
 	- add sockopt support for common TCP options
 	- add support for common TCP msg flags
 	- include multiple address ids in RM_ADDR
 	- add reset option support for resetting one subflow
 
  - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
 	co-existence with UDP tunnel GRO, allowing the first to take
 	place correctly	even for encapsulated UDP traffic
 
  - micro-optimize dev_gro_receive() and flow dissection, avoid
 	retpoline overhead on VLAN and TEB GRO
 
  - use less memory for sysctls, add a new sysctl type, to allow using
 	u8 instead of "int" and "long" and shrink networking sysctls
 
  - veth: allow GRO without XDP - this allows aggregating UDP
 	packets before handing them off to routing, bridge, OvS, etc.
 
  - allow specifing ifindex when device is moved to another namespace
 
  - netfilter:
 	- nft_socket: add support for cgroupsv2
 	- nftables: add catch-all set element - special element used
 	  to define a default action in case normal lookup missed
 	- use net_generic infra in many modules to avoid allocating
 	  per-ns memory unnecessarily
 
  - xps: improve the xps handling to avoid potential out-of-bound
 	accesses and use-after-free when XPS change race with other
 	re-configuration under traffic
 
  - add a config knob to turn off per-cpu netdev refcnt to catch
 	underflows in testing
 
 Device APIs:
 
  - add WWAN subsystem to organize the WWAN interfaces better and
    hopefully start driving towards more unified and vendor-
    -independent APIs
 
  - ethtool:
 	- add interface for reading IEEE MIB stats (incl. mlx5 and
 	  bnxt support)
 	- allow network drivers to dump arbitrary SFP EEPROM data,
 	  current offset+length API was a poor fit for modern SFP
 	  which define EEPROM in terms of pages (incl. mlx5 support)
 
  - act_police, flow_offload: add support for packet-per-second
 	policing (incl. offload for nfp)
 
  - psample: add additional metadata attributes like transit delay
 	for packets sampled from switch HW (and corresponding egress
 	and policy-based sampling in the mlxsw driver)
 
  - dsa: improve support for sandwiched LAGs with bridge and DSA
 
  - netfilter:
 	- flowtable: use direct xmit in topologies with IP
 	  forwarding, bridging, vlans etc.
 	- nftables: counter hardware offload support
 
  - Bluetooth:
 	- improvements for firmware download w/ Intel devices
 	- add support for reading AOSP vendor capabilities
 	- add support for virtio transport driver
 
  - mac80211:
 	- allow concurrent monitor iface and ethernet rx decap
 	- set priority and queue mapping for injected frames
 
  - phy: add support for Clause-45 PHY Loopback
 
  - pci/iov: add sysfs MSI-X vector assignment interface
 	to distribute MSI-X resources to VFs (incl. mlx5 support)
 
 New hardware/drivers:
 
  - dsa: mv88e6xxx: add support for Marvell mv88e6393x -
 	11-port Ethernet switch with 8x 1-Gigabit Ethernet
 	and 3x 10-Gigabit interfaces.
 
  - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365
 	and BCM63xx switches
 
  - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
 
  - ath11k: support for QCN9074 a 802.11ax device
 
  - Bluetooth: Broadcom BCM4330 and BMC4334
 
  - phy: Marvell 88X2222 transceiver support
 
  - mdio: add BCM6368 MDIO mux bus controller
 
  - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
 
  - mana: driver for Microsoft Azure Network Adapter (MANA)
 
  - Actions Semi Owl Ethernet MAC
 
  - can: driver for ETAS ES58X CAN/USB interfaces
 
 Pure driver changes:
 
  - add XDP support to: enetc, igc, stmmac
  - add AF_XDP support to: stmmac
 
  - virtio:
 	- page_to_skb() use build_skb when there's sufficient tailroom
 	  (21% improvement for 1000B UDP frames)
 	- support XDP even without dedicated Tx queues - share the Tx
 	  queues with the stack when necessary
 
  - mlx5:
 	- flow rules: add support for mirroring with conntrack,
 	  matching on ICMP, GTP, flex filters and more
 	- support packet sampling with flow offloads
 	- persist uplink representor netdev across eswitch mode
 	  changes
 	- allow coexistence of CQE compression and HW time-stamping
 	- add ethtool extended link error state reporting
 
  - ice, iavf: support flow filters, UDP Segmentation Offload
 
  - dpaa2-switch:
 	- move the driver out of staging
 	- add spanning tree (STP) support
 	- add rx copybreak support
 	- add tc flower hardware offload on ingress traffic
 
  - ionic:
 	- implement Rx page reuse
 	- support HW PTP time-stamping
 
  - octeon: support TC hardware offloads - flower matching on ingress
 	and egress ratelimitting.
 
  - stmmac:
 	- add RX frame steering based on VLAN priority in tc flower
 	- support frame preemption (FPE)
 	- intel: add cross time-stamping freq difference adjustment
 
  - ocelot:
 	- support forwarding of MRP frames in HW
 	- support multiple bridges
 	- support PTP Sync one-step timestamping
 
  - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
 	learning, flooding etc.
 
  - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
 	SC7280 SoCs)
 
  - mt7601u: enable TDLS support
 
  - mt76:
 	- add support for 802.3 rx frames (mt7915/mt7615)
 	- mt7915 flash pre-calibration support
 	- mt7921/mt7663 runtime power management fixes
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCKFPIACgkQMUZtbf5S
 Irtw0g/+NA8bWdHNgG4H5rya0pv2z3IieLRmSdDfKRQQXcJpklawc5MKVVaTee/Q
 5/QqgPdCsu1LAU6JXBKsKmyDDaMlQKdWuKbOqDSiAQKoMesZStTEHf9d851ZzgxA
 Cdb6O7BD3lBl/IN+oxNG+KcmD1LKquTPKGySq2mQtEdLO12ekAsranzmj4voKffd
 q9tBShpXQ7Dq77DLYfiQXVCvsizNcbbJFuxX0o9Lpb9+61ZyYAbogZSa9ypiZZwR
 I/9azRBtJg7UV1aD/cLuAfy66Qh7t63+rCxVazs5Os8jVO26P/jQdisnnOe/x+p9
 wYEmKm3GSu0V4SAPxkWW+ooKusflCeqDoMIuooKt6kbP6BRj540veGw3Ww/m5YFr
 7pLQkTSP/tSjuGQIdBE1LOP5LBO8DZeC8Kiop9V0fzAW9hFSZbEq25WW0bPj8QQO
 zA4Z7yWlslvxcfY2BdJX3wD8klaINkl/8fDWZFFsBdfFX2VeLtm7Xfduw34BJpvU
 rYT3oWr6PhtkPAKR32SUcemSfeWgIVU41eSshzRz3kez1NngBUuLlSGGSEaKbes5
 pZVt6pYFFVByyf6MTHFEoQvafZfEw04JILZpo4R5V8iTHzom0kD3Py064sBiXEw2
 B6t+OW4qgcxGblpFkK2lD4kR2s1TPUs0ckVO6sAy1x8q60KKKjY=
 =vcbA
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - bpf:
        - allow bpf programs calling kernel functions (initially to
          reuse TCP congestion control implementations)
        - enable task local storage for tracing programs - remove the
          need to store per-task state in hash maps, and allow tracing
          programs access to task local storage previously added for
          BPF_LSM
        - add bpf_for_each_map_elem() helper, allowing programs to walk
          all map elements in a more robust and easier to verify fashion
        - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
          redirection
        - lpm: add support for batched ops in LPM trie
        - add BTF_KIND_FLOAT support - mostly to allow use of BTF on
          s390 which has floats in its headers files
        - improve BPF syscall documentation and extend the use of kdoc
          parsing scripts we already employ for bpf-helpers
        - libbpf, bpftool: support static linking of BPF ELF files
        - improve support for encapsulation of L2 packets

   - xdp: restructure redirect actions to avoid a runtime lookup,
     improving performance by 4-8% in microbenchmarks

   - xsk: build skb by page (aka generic zerocopy xmit) - improve
     performance of software AF_XDP path by 33% for devices which don't
     need headers in the linear skb part (e.g. virtio)

   - nexthop: resilient next-hop groups - improve path stability on
     next-hops group changes (incl. offload for mlxsw)

   - ipv6: segment routing: add support for IPv4 decapsulation

   - icmp: add support for RFC 8335 extended PROBE messages

   - inet: use bigger hash table for IP ID generation

   - tcp: deal better with delayed TX completions - make sure we don't
     give up on fast TCP retransmissions only because driver is slow in
     reporting that it completed transmitting the original

   - tcp: reorder tcp_congestion_ops for better cache locality

   - mptcp:
        - add sockopt support for common TCP options
        - add support for common TCP msg flags
        - include multiple address ids in RM_ADDR
        - add reset option support for resetting one subflow

   - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
     co-existence with UDP tunnel GRO, allowing the first to take place
     correctly even for encapsulated UDP traffic

   - micro-optimize dev_gro_receive() and flow dissection, avoid
     retpoline overhead on VLAN and TEB GRO

   - use less memory for sysctls, add a new sysctl type, to allow using
     u8 instead of "int" and "long" and shrink networking sysctls

   - veth: allow GRO without XDP - this allows aggregating UDP packets
     before handing them off to routing, bridge, OvS, etc.

   - allow specifing ifindex when device is moved to another namespace

   - netfilter:
        - nft_socket: add support for cgroupsv2
        - nftables: add catch-all set element - special element used to
          define a default action in case normal lookup missed
        - use net_generic infra in many modules to avoid allocating
          per-ns memory unnecessarily

   - xps: improve the xps handling to avoid potential out-of-bound
     accesses and use-after-free when XPS change race with other
     re-configuration under traffic

   - add a config knob to turn off per-cpu netdev refcnt to catch
     underflows in testing

  Device APIs:

   - add WWAN subsystem to organize the WWAN interfaces better and
     hopefully start driving towards more unified and vendor-
     independent APIs

   - ethtool:
        - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt
          support)
        - allow network drivers to dump arbitrary SFP EEPROM data,
          current offset+length API was a poor fit for modern SFP which
          define EEPROM in terms of pages (incl. mlx5 support)

   - act_police, flow_offload: add support for packet-per-second
     policing (incl. offload for nfp)

   - psample: add additional metadata attributes like transit delay for
     packets sampled from switch HW (and corresponding egress and
     policy-based sampling in the mlxsw driver)

   - dsa: improve support for sandwiched LAGs with bridge and DSA

   - netfilter:
        - flowtable: use direct xmit in topologies with IP forwarding,
          bridging, vlans etc.
        - nftables: counter hardware offload support

   - Bluetooth:
        - improvements for firmware download w/ Intel devices
        - add support for reading AOSP vendor capabilities
        - add support for virtio transport driver

   - mac80211:
        - allow concurrent monitor iface and ethernet rx decap
        - set priority and queue mapping for injected frames

   - phy: add support for Clause-45 PHY Loopback

   - pci/iov: add sysfs MSI-X vector assignment interface to distribute
     MSI-X resources to VFs (incl. mlx5 support)

  New hardware/drivers:

   - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port
     Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit
     interfaces.

   - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and
     BCM63xx switches

   - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches

   - ath11k: support for QCN9074 a 802.11ax device

   - Bluetooth: Broadcom BCM4330 and BMC4334

   - phy: Marvell 88X2222 transceiver support

   - mdio: add BCM6368 MDIO mux bus controller

   - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips

   - mana: driver for Microsoft Azure Network Adapter (MANA)

   - Actions Semi Owl Ethernet MAC

   - can: driver for ETAS ES58X CAN/USB interfaces

  Pure driver changes:

   - add XDP support to: enetc, igc, stmmac

   - add AF_XDP support to: stmmac

   - virtio:
        - page_to_skb() use build_skb when there's sufficient tailroom
          (21% improvement for 1000B UDP frames)
        - support XDP even without dedicated Tx queues - share the Tx
          queues with the stack when necessary

   - mlx5:
        - flow rules: add support for mirroring with conntrack, matching
          on ICMP, GTP, flex filters and more
        - support packet sampling with flow offloads
        - persist uplink representor netdev across eswitch mode changes
        - allow coexistence of CQE compression and HW time-stamping
        - add ethtool extended link error state reporting

   - ice, iavf: support flow filters, UDP Segmentation Offload

   - dpaa2-switch:
        - move the driver out of staging
        - add spanning tree (STP) support
        - add rx copybreak support
        - add tc flower hardware offload on ingress traffic

   - ionic:
        - implement Rx page reuse
        - support HW PTP time-stamping

   - octeon: support TC hardware offloads - flower matching on ingress
     and egress ratelimitting.

   - stmmac:
        - add RX frame steering based on VLAN priority in tc flower
        - support frame preemption (FPE)
        - intel: add cross time-stamping freq difference adjustment

   - ocelot:
        - support forwarding of MRP frames in HW
        - support multiple bridges
        - support PTP Sync one-step timestamping

   - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
     learning, flooding etc.

   - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
     SC7280 SoCs)

   - mt7601u: enable TDLS support

   - mt76:
        - add support for 802.3 rx frames (mt7915/mt7615)
        - mt7915 flash pre-calibration support
        - mt7921/mt7663 runtime power management fixes"

* tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits)
  net: selftest: fix build issue if INET is disabled
  net: netrom: nr_in: Remove redundant assignment to ns
  net: tun: Remove redundant assignment to ret
  net: phy: marvell: add downshift support for M88E1240
  net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255
  net/sched: act_ct: Remove redundant ct get and check
  icmp: standardize naming of RFC 8335 PROBE constants
  bpf, selftests: Update array map tests for per-cpu batched ops
  bpf: Add batched ops support for percpu array
  bpf: Implement formatted output helpers with bstr_printf
  seq_file: Add a seq_bprintf function
  sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues
  net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
  net: fix a concurrency bug in l2tp_tunnel_register()
  net/smc: Remove redundant assignment to rc
  mpls: Remove redundant assignment to err
  llc2: Remove redundant assignment to rc
  net/tls: Remove redundant initialization of record
  rds: Remove redundant assignment to nr_sig
  dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0
  ...
2021-04-29 11:57:23 -07:00
Christoph Hellwig
dfc06b389a swiotlb: don't override user specified size in swiotlb_adjust_size
If the user already specified a swiotlb size on the command line,
swiotlb_adjust_size should not overwrite it.

Fixes: 2cbc2776ef ("swiotlb: remove swiotlb_nr_tbl")
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2021-04-29 18:46:06 +00:00
Linus Torvalds
635de956a7 The x86 MM changes in this cycle were:
- Implement concurrent TLB flushes, which overlaps the local TLB flush with the
    remote TLB flush. In testing this improved sysbench performance measurably by
    a couple of percentage points, especially if TLB-heavy security mitigations
    are active.
 
  - Further micro-optimizations to improve the performance of TLB flushes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCKbNcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hjYBAAsyNUa/gOu0g6/Cx8R86w9HtHHmm5vso/
 6nJjWj2fd2qJ9JShlddxvXEMeXtPTYabVWQkiiriFMuofk6JeKnlHm1Jzl6keABX
 OQFwjIFeNASPRcdXvuuYPOVWAJJdr2oL9QUr6OOK1ccQJTz/Cd0zA+VQ5YqcsCon
 yaWbkxELwKXpgql+qt66eAZ6Q2Y1TKXyrTW7ZgxQi0yeeWqMaEOub0/oyS7Ax1Rg
 qEJMwm1prb76NPzeqR/G3e4KTrDZfQ/B/KnSsz36GTJpl4eye6XqWDUgm1nAGNIc
 5dbc4Vx7JtZsUOuC0AmzWb3hsDyzVcN/lQvijdZ2RsYR3gvuYGaBhKqExqV0XH6P
 oqaWOKWCz+LqWbsgJmxCpqkt1LZl5+VUOcfJ97WkIS7DyIPtSHTzQXbBMZqKLeat
 mn5UcKYB2Gi7wsUPv6VC2ChKbDqN0VT8G86XbYylGo4BE46KoZKPUNY/QWKLUPd6
 0UKcVeNM2HFyf1C73p/tO/z7hzu3qLuMMnsphP6/c2pKLpdgawEXgbnVKNId1B/c
 NrzyhTvVaMt+Um28bBRhHONIlzPJwWcnZbdY7NqMnu+LBKQ68cL/h4FOIV/RDLNb
 GJLgfAr8fIw/zIpqYuFHiiMNo9wWqVtZko1MvXhGceXUL69QuzTra2XR/6aDxkPf
 6gQVesetTvo=
 =3Cyp
 -----END PGP SIGNATURE-----

Merge tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 tlb updates from Ingo Molnar:
 "The x86 MM changes in this cycle were:

   - Implement concurrent TLB flushes, which overlaps the local TLB
     flush with the remote TLB flush.

     In testing this improved sysbench performance measurably by a
     couple of percentage points, especially if TLB-heavy security
     mitigations are active.

   - Further micro-optimizations to improve the performance of TLB
     flushes"

* tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Micro-optimize smp_call_function_many_cond()
  smp: Inline on_each_cpu_cond() and on_each_cpu()
  x86/mm/tlb: Remove unnecessary uses of the inline keyword
  cpumask: Mark functions as pure
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
  smp: Run functions concurrently in smp_call_function_many_cond()
2021-04-29 11:41:43 -07:00
Linus Torvalds
3644286f6c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmCJUfIACgkQnJ2qBz9k
 QNkStAf8CA7beya7LZ/GGN7HzXhv2cs+IpUFhRkynLklEM0lxKsOEagLFSZxkoMD
 IBSRSo4odkkderqI9W/yp+9OYhOd9+BQCq4isg1Gh9Tf5xANJEpLvBAPnWVhooJs
 9CrYZQY9Bdf+fF/8GHbKlrMAYm56vBCmWqyWTEtWUyPBOA12in2ZHQJmCa+5+nge
 zTT/B5cvuhN5K7uYhGM4YfeCU5DBmmvD4sV6YBTkQOgCU0bEF0f9R3JjHDo34a1s
 yqna3ypqKNRhsJVs8F+aOGRieUYxFoRqtYNHZK3qI9i07v7ndoTm5jzGN6OFlKs3
 U3rF9/+cBgeESahWG6IjHIqhXGXNhg==
 =KjNm
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:

 - support for limited fanotify functionality for unpriviledged users

 - faster merging of fanotify events

 - a few smaller fsnotify improvements

* tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  shmem: allow reporting fanotify events with file handles on tmpfs
  fs: introduce a wrapper uuid_to_fsid()
  fanotify_user: use upper_32_bits() to verify mask
  fanotify: support limited functionality for unprivileged users
  fanotify: configurable limits via sysfs
  fanotify: limit number of event merge attempts
  fsnotify: use hash table for faster events merge
  fanotify: mix event info and pid into merge key hash
  fanotify: reduce event objectid to 29-bit hash
  fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue
2021-04-29 11:06:13 -07:00
Linus Torvalds
767fcbc80f \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmCJU1UACgkQnJ2qBz9k
 QNk62AgAgp05OIXU/AgObb7DvSyI3ycwCV8PeWBpwD8yoDAh5x0tmT7vnJu974p6
 yHdnF7rr69ZzvbNCHLJ5kRykRlUao9W7cO5fdOW1uTpL7Ic60QuJMks/NfgVTHp1
 2zIQmBDerfn1/LTK8r2pPGcvtcjRcr7Ep4beN0Duw57lfVMJhjsNRPnBbXGBcp0r
 QzKk4/8V3DCZvOw+XNC3nto7avjvf+nU9sJmuh83546eqh0atjWivvO5aAlDOe6W
 rhBiLlmP0in5u2n1fYqzI1OQvtgtleyEZT2G0CrbAZn0xjmV/if9wl+3K6TOwDvR
 778xDEX7sZCaO/xkB+WK3hrd15ftKg==
 =0kYE
 -----END PGP SIGNATURE-----

Merge tag 'for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull quota, ext2, reiserfs updates from Jan Kara:

 - support for path (instead of device) based quotactl syscall
   (quotactl_path(2))

 - ext2 conversion to kmap_local()

 - other minor cleanups & fixes

* tag 'for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fs/reiserfs/journal.c: delete useless variables
  fs/ext2: Replace kmap() with kmap_local_page()
  ext2: Match up ext2_put_page() with ext2_dotdot() and ext2_find_entry()
  fs/ext2/: fix misspellings using codespell tool
  quota: report warning limits for realtime space quotas
  quota: wire up quotactl_path
  quota: Add mountpath based quota support
2021-04-29 10:51:29 -07:00
Linus Torvalds
625434dafd for-5.13/io_uring-2021-04-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIRBUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjt5D/9de6zCaha6CyfIIPiU+crropQ2jPzO49cb
 WzcOCmdhSv0GtYlhdnIqCOo5p8mRDWJAEBU9upTDTCWOx9hwr5Ms0TCNQHxuQ/T0
 4Ll+/cMsOxeTypiykfMtOG9TEmYSria2vTJKLgpyaP4ohfJa3uT7r2NZ8NK/8T4t
 wwbJ+jCSKewelI1l0XD8k8LBU39FS/KRgLTdfYj/rCW3PWt/ZE2eSIYjZQvMCVOC
 3fIdgOOJAMQVQafz+YAeJd2E+/l5/8YcJVKpJMVtBNbqTHIjA4EsInZauy8TpBgW
 OzJ3I+XdF70qZM119tI/nXw3sb0e+UV0fRsIXLkOwTEBzowernrAtsEwAOP+qFKS
 2YnqSKOSjMO5d5Mpkz6T0MDMloU45jph88lUH0RoShVxGa7jv+TMOL6QU1oOyxc1
 +gPPbApQs9WtSZDHsTJ0xFLpol804UDQmwb38mHdzedDVSE7iip1jANkw6LEhKkJ
 Mlg60ZF1Z305G+cDhrbs02ZGVa+fzbrtXtLlTqZw8bNX9lBp0JLtDpzskjbnUmck
 6A04nfg+Eto5GvAn+FRBuOCPridLEk2K6ygko/gwQWsYCgqkCgRuqjlIQCSZy5iu
 jHEFixIXKn6eACf+YzLVxSLyEQrmFyDSypbN7LvzoKJYo/loy8Q1+42nGlrVC3zi
 +CB1NokPng==
 =ZJ8L
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

 - Support for multi-shot mode for POLL requests

 - More efficient reference counting. This is shamelessly stolen from
   the mm side. Even though referencing is mostly single/dual user, the
   128 count was retained to keep the code the same. Maybe this
   should/could be made generic at some point.

 - Removal of the need to have a manager thread for each ring. The
   manager threads only job was checking and creating new io-threads as
   needed, instead we handle this from the queue path.

 - Allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE. Since 5.12, this
   thread is "just" a regular application thread, so no need to restrict
   use of it anymore.

 - Cleanup of how internal async poll data lifetime is managed.

 - Fix for syzbot reported crash on SQPOLL cancelation.

 - Make buffer registration more like file registrations, which includes
   flexibility in avoiding full set unregistration and re-registration.

 - Fix for io-wq affinity setting.

 - Be a bit more defensive in task->pf_io_worker setup.

 - Various SQPOLL fixes.

 - Cleanup of SQPOLL creds handling.

 - Improvements to in-flight request tracking.

 - File registration cleanups.

 - Tons of cleanups and little fixes

* tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block: (156 commits)
  io_uring: maintain drain logic for multishot poll requests
  io_uring: Check current->io_uring in io_uring_cancel_sqpoll
  io_uring: fix NULL reg-buffer
  io_uring: simplify SQPOLL cancellations
  io_uring: fix work_exit sqpoll cancellations
  io_uring: Fix uninitialized variable up.resv
  io_uring: fix invalid error check after malloc
  io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker
  kernel: always initialize task->pf_io_worker to NULL
  io_uring: update sq_thread_idle after ctx deleted
  io_uring: add full-fledged dynamic buffers support
  io_uring: implement fixed buffers registration similar to fixed files
  io_uring: prepare fixed rw for dynanic buffers
  io_uring: keep table of pointers to ubufs
  io_uring: add generic rsrc update with tags
  io_uring: add IORING_REGISTER_RSRC
  io_uring: enumerate dynamic resources
  io_uring: add generic path for rsrc update
  io_uring: preparation for rsrc tagging
  io_uring: decouple CQE filling from requests
  ...
2021-04-28 14:56:09 -07:00
Linus Torvalds
16b3d0cf5b Scheduler updates for this cycle are:
- Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and debugfs interfaces
    to a unified debugfs interface.
 
  - Signals: Allow caching one sigqueue object per task, to improve performance & latencies.
 
  - Improve newidle_balance() irq-off latencies on systems with a large number of CPU cgroups.
 
  - Improve energy-aware scheduling
 
  - Improve the PELT metrics for certain workloads
 
  - Reintroduce select_idle_smt() to improve load-balancing locality - but without the previous
    regressions
 
  - Add 'scheduler latency debugging': warn after long periods of pending need_resched. This
    is an opt-in feature that requires the enabling of the LATENCY_WARN scheduler feature,
    or the use of the resched_latency_warn_ms=xx boot parameter.
 
  - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix remaining
    balance_push() vs. hotplug holes/races
 
  - PSI fixes, plus allow /proc/pressure/ files to be written by CAP_SYS_RESOURCE tasks as well
 
  - Fix/improve various load-balancing corner cases vs. capacity margins
 
  - Fix sched topology on systems with NUMA diameter of 3 or above
 
  - Fix PF_KTHREAD vs to_kthread() race
 
  - Minor rseq optimizations
 
  - Misc cleanups, optimizations, fixes and smaller updates
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJInsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i5XxAArh0b+fwXlkVGzTUly7HQjhU7lFbChnmF
 h6ToyNLi6pXoZ14VC/WoRIME+RzK3gmw9cEFaSLVPxbkbekTcyWS78kqmcg1/j2v
 kO/20QhXobiIxVskYfoMmqSavZ5mKhMWBqtFXkCuYfxwGylas0VVdh3AZLJ7N21G
 WEoFh99pVULwWnPHxM2ZQ87Ex9BkGKbsBTswxWpprCfXLqD0N2hHlABpwJP78zRf
 VniWFOcC7lslILCFawb7CqGgAwbgV85nDRS4QCuCKisrkFywvjJrEeu/W+h1NfhF
 d6ves/osNdEAM1DSALoxwEA42An8l8xh8NyJnl8JZV00LW0DM108O5/7pf5Zcryc
 RHV3RxA7skgezBh5uThvo60QzNK+kVMatI4qpQEHxLE52CaDl/fBu1Cgb/VUxnIl
 AEBfyiFbk+skHpuMFKtl30Tx3M+yJKMTzFPd4kYjHYGEDwtAcXcB3dJQW48A79i3
 H3IWcDcXpk5Rjo2UZmaXdt/qlj7mP6U0xdOUq8ZK6JOC4uY9skszVGsfuNN9QQ5u
 2E2YKKVrGFoQydl4C8R6A7axL2VzIJszHFZNipd8E3YOyW7PWRAkr02tOOkBTj8N
 dLMcNM7aPJWqEYiEIjEzGQN20pweJ1dRA29LDuOswKh+7W2bWTQFh6F2Q8Haansc
 RVg5PDzl+Mc=
 =E7mz
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and
   debugfs interfaces to a unified debugfs interface.

 - Signals: Allow caching one sigqueue object per task, to improve
   performance & latencies.

 - Improve newidle_balance() irq-off latencies on systems with a large
   number of CPU cgroups.

 - Improve energy-aware scheduling

 - Improve the PELT metrics for certain workloads

 - Reintroduce select_idle_smt() to improve load-balancing locality -
   but without the previous regressions

 - Add 'scheduler latency debugging': warn after long periods of pending
   need_resched. This is an opt-in feature that requires the enabling of
   the LATENCY_WARN scheduler feature, or the use of the
   resched_latency_warn_ms=xx boot parameter.

 - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix
   remaining balance_push() vs. hotplug holes/races

 - PSI fixes, plus allow /proc/pressure/ files to be written by
   CAP_SYS_RESOURCE tasks as well

 - Fix/improve various load-balancing corner cases vs. capacity margins

 - Fix sched topology on systems with NUMA diameter of 3 or above

 - Fix PF_KTHREAD vs to_kthread() race

 - Minor rseq optimizations

 - Misc cleanups, optimizations, fixes and smaller updates

* tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  cpumask/hotplug: Fix cpu_dying() state tracking
  kthread: Fix PF_KTHREAD vs to_kthread() race
  sched/debug: Fix cgroup_path[] serialization
  sched,psi: Handle potential task count underflow bugs more gracefully
  sched: Warn on long periods of pending need_resched
  sched/fair: Move update_nohz_stats() to the CONFIG_NO_HZ_COMMON block to simplify the code & fix an unused function warning
  sched/debug: Rename the sched_debug parameter to sched_verbose
  sched,fair: Alternative sched_slice()
  sched: Move /proc/sched_debug to debugfs
  sched,debug: Convert sysctl sched_domains to debugfs
  debugfs: Implement debugfs_create_str()
  sched,preempt: Move preempt_dynamic to debug.c
  sched: Move SCHED_DEBUG sysctl to debugfs
  sched: Don't make LATENCYTOP select SCHED_DEBUG
  sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG
  sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG
  sched: Use cpu_dying() to fix balance_push vs hotplug-rollback
  cpumask: Introduce DYING mask
  cpumask: Make cpu_{online,possible,present,active}() inline
  rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()
  ...
2021-04-28 13:33:57 -07:00
Linus Torvalds
42dec9a936 Perf events changes in this cycle were:
- Improve Intel uncore PMU support:
 
      - Parse uncore 'discovery tables' - a new hardware capability enumeration method
        introduced on the latest Intel platforms. This table is in a well-defined PCI
        namespace location and is read via MMIO. It is organized in an rbtree.
 
        These uncore tables will allow the discovery of standard counter blocks, but
        fancier counters still need to be enumerated explicitly.
 
      - Add Alder Lake support
 
      - Improve IIO stacks to PMON mapping support on Skylake servers
 
  - Add Intel Alder Lake PMU support - which requires the introduction of 'hybrid' CPUs
    and PMUs. Alder Lake is a mix of Golden Cove ('big') and Gracemont ('small' - Atom derived)
    cores.
 
    The CPU-side feature set is entirely symmetrical - but on the PMU side there's
    core type dependent PMU functionality.
 
  - Reduce data loss with CPU level hardware tracing on Intel PT / AUX profiling, by
    fixing the AUX allocation watermark logic.
 
  - Improve ring buffer allocation on NUMA systems
 
  - Put 'struct perf_event' into their separate kmem_cache pool
 
  - Add support for synchronous signals for select perf events. The immediate motivation
    is to support low-overhead sampling-based race detection for user-space code. The
    feature consists of the following main changes:
 
     - Add thread-only event inheritance via perf_event_attr::inherit_thread, which limits
       inheritance of events to CLONE_THREAD.
 
     - Add the ability for events to not leak through exec(), via perf_event_attr::remove_on_exec.
 
     - Allow the generation of SIGTRAP via perf_event_attr::sigtrap, extend siginfo with an u64
       ::si_perf, and add the breakpoint information to ::si_addr and ::si_perf if the event is
       PERF_TYPE_BREAKPOINT.
 
    The siginfo support is adequate for breakpoints right now - but the new field can be used
    to introduce support for other types of metadata passed over siginfo as well.
 
  - Misc fixes, cleanups and smaller updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJGpERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j9zBAAuVbG2snV6SBSdXLhQcM66N3NckOXvSY5
 QjjhQcuwJQEK/NJB3266K5d8qSmdyRBsWf3GCsrmyBT67P1V28K44Pu7oCV0UDtf
 mpVRjEP0oR7hNsANSSgo8Fa4ZD7H5waX7dK7925Tvw8By3mMoZoddiD/84WJHhxO
 NDF+GRFaRj+/dpbhV8cdCoXTjYdkC36vYuZs3b9lu0tS9D/AJgsNy7TinLvO02Cs
 5peP+2y29dgvCXiGBiuJtEA6JyGnX3nUJCvfOZZ/DWDc3fdduARlRrc5Aiq4n/wY
 UdSkw1VTZBlZ1wMSdmHQVeC5RIH3uWUtRoNqy0Yc90lBm55AQ0EENwIfWDUDC5zy
 USdBqWTNWKMBxlEilUIyqKPQK8LW/31TRzqy8BWKPNcZt5yP5YS1SjAJRDDjSwL/
 I+OBw1vjLJamYh8oNiD5b+VLqNQba81jFASfv+HVWcULumnY6ImECCpkg289Fkpi
 BVR065boifJDlyENXFbvTxyMBXQsZfA+EhtxG7ju2Ni+TokBbogyCb3L2injPt9g
 7jjtTOqmfad4gX1WSc+215iYZMkgECcUd9E+BfOseEjBohqlo7yNKIfYnT8mE/Xq
 nb7eHjyvLiE8tRtZ+7SjsujOMHv9LhWFAbSaxU/kEVzpkp0zyd6mnnslDKaaHLhz
 goUMOL/D0lg=
 =NhQ7
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event updates from Ingo Molnar:

 - Improve Intel uncore PMU support:

     - Parse uncore 'discovery tables' - a new hardware capability
       enumeration method introduced on the latest Intel platforms. This
       table is in a well-defined PCI namespace location and is read via
       MMIO. It is organized in an rbtree.

       These uncore tables will allow the discovery of standard counter
       blocks, but fancier counters still need to be enumerated
       explicitly.

     - Add Alder Lake support

     - Improve IIO stacks to PMON mapping support on Skylake servers

 - Add Intel Alder Lake PMU support - which requires the introduction of
   'hybrid' CPUs and PMUs. Alder Lake is a mix of Golden Cove ('big')
   and Gracemont ('small' - Atom derived) cores.

   The CPU-side feature set is entirely symmetrical - but on the PMU
   side there's core type dependent PMU functionality.

 - Reduce data loss with CPU level hardware tracing on Intel PT / AUX
   profiling, by fixing the AUX allocation watermark logic.

 - Improve ring buffer allocation on NUMA systems

 - Put 'struct perf_event' into their separate kmem_cache pool

 - Add support for synchronous signals for select perf events. The
   immediate motivation is to support low-overhead sampling-based race
   detection for user-space code. The feature consists of the following
   main changes:

     - Add thread-only event inheritance via
       perf_event_attr::inherit_thread, which limits inheritance of
       events to CLONE_THREAD.

     - Add the ability for events to not leak through exec(), via
       perf_event_attr::remove_on_exec.

     - Allow the generation of SIGTRAP via perf_event_attr::sigtrap,
       extend siginfo with an u64 ::si_perf, and add the breakpoint
       information to ::si_addr and ::si_perf if the event is
       PERF_TYPE_BREAKPOINT.

   The siginfo support is adequate for breakpoints right now - but the
   new field can be used to introduce support for other types of
   metadata passed over siginfo as well.

 - Misc fixes, cleanups and smaller updates.

* tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  signal, perf: Add missing TRAP_PERF case in siginfo_layout()
  signal, perf: Fix siginfo_t by avoiding u64 on 32-bit architectures
  perf/x86: Allow for 8<num_fixed_counters<16
  perf/x86/rapl: Add support for Intel Alder Lake
  perf/x86/cstate: Add Alder Lake CPU support
  perf/x86/msr: Add Alder Lake CPU support
  perf/x86/intel/uncore: Add Alder Lake support
  perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
  perf/x86/intel: Add Alder Lake Hybrid support
  perf/x86: Support filter_match callback
  perf/x86/intel: Add attr_update for Hybrid PMUs
  perf/x86: Add structures for the attributes of Hybrid PMUs
  perf/x86: Register hybrid PMUs
  perf/x86: Factor out x86_pmu_show_pmu_cap
  perf/x86: Remove temporary pmu assignment in event_init
  perf/x86/intel: Factor out intel_pmu_check_extra_regs
  perf/x86/intel: Factor out intel_pmu_check_event_constraints
  perf/x86/intel: Factor out intel_pmu_check_num_counters
  perf/x86: Hybrid PMU support for extra_regs
  perf/x86: Hybrid PMU support for event constraints
  ...
2021-04-28 13:03:44 -07:00
Linus Torvalds
0ff0edb550 Locking changes for this cycle were:
- rtmutex cleanup & spring cleaning pass that removes ~400 lines of code
  - Futex simplifications & cleanups
  - Add debugging to the CSD code, to help track down a tenacious race (or hw problem)
  - Add lockdep_assert_not_held(), to allow code to require a lock to not be held,
    and propagate this into the ath10k driver
  - Misc LKMM documentation updates
  - Misc KCSAN updates: cleanups & documentation updates
  - Misc fixes and cleanups
  - Fix locktorture bugs with ww_mutexes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJDn0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hPrRAAryS4zPnuDsfkVk0smxo7a0lK5ljbH2Xo
 28QUZXOl6upnEV8dzbjwG7eAjt5ZJVI5tKIeG0PV0NUJH2nsyHwESdtULGGYuPf/
 4YUzNwZJa+nI/jeBnVsXCimLVxxnNCRdR7yOVOHm4ukEwa+YTNt1pvlYRmUd4YyH
 Q5cCrpb3THvLka3AAamEbqnHnAdGxHKuuHYVRkODpMQ+zrQvtN8antYsuk8kJsqM
 m+GZg/dVCuLEPah5k+lOACtcq/w7HCmTlxS8t4XLvD52jywFZLcCPvi1rk0+JR+k
 Vd9TngC09GJ4jXuDpr42YKkU9/X6qy2Es39iA/ozCvc1Alrhspx/59XmaVSuWQGo
 XYuEPx38Yuo/6w16haSgp0k4WSay15A4uhCTQ75VF4vli8Bqgg9PaxLyQH1uG8e2
 xk8U90R7bDzLlhKYIx1Vu5Z0t7A1JtB5CJtgpcfg/zQLlzygo75fHzdAiU5fDBDm
 3QQXSU2Oqzt7c5ZypioHWazARk7tL6th38KGN1gZDTm5zwifpaCtHi7sml6hhZ/4
 ATH6zEPzIbXJL2UqumSli6H4ye5ORNjOu32r7YPqLI4IDbzpssfoSwfKYlQG4Tvn
 4H1Ukirzni0gz5+wbleItzf2aeo1rocs4YQTnaT02j8NmUHUz4AzOHGOQFr5Tvh0
 wk/P4MIoSb0=
 =cOOk
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - rtmutex cleanup & spring cleaning pass that removes ~400 lines of
   code

 - Futex simplifications & cleanups

 - Add debugging to the CSD code, to help track down a tenacious race
   (or hw problem)

 - Add lockdep_assert_not_held(), to allow code to require a lock to not
   be held, and propagate this into the ath10k driver

 - Misc LKMM documentation updates

 - Misc KCSAN updates: cleanups & documentation updates

 - Misc fixes and cleanups

 - Fix locktorture bugs with ww_mutexes

* tag 'locking-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  kcsan: Fix printk format string
  static_call: Relax static_call_update() function argument type
  static_call: Fix unused variable warn w/o MODULE
  locking/rtmutex: Clean up signal handling in __rt_mutex_slowlock()
  locking/rtmutex: Restrict the trylock WARN_ON() to debug
  locking/rtmutex: Fix misleading comment in rt_mutex_postunlock()
  locking/rtmutex: Consolidate the fast/slowpath invocation
  locking/rtmutex: Make text section and inlining consistent
  locking/rtmutex: Move debug functions as inlines into common header
  locking/rtmutex: Decrapify __rt_mutex_init()
  locking/rtmutex: Remove pointless CONFIG_RT_MUTEXES=n stubs
  locking/rtmutex: Inline chainwalk depth check
  locking/rtmutex: Move rt_mutex_debug_task_free() to rtmutex.c
  locking/rtmutex: Remove empty and unused debug stubs
  locking/rtmutex: Consolidate rt_mutex_init()
  locking/rtmutex: Remove output from deadlock detector
  locking/rtmutex: Remove rtmutex deadlock tester leftovers
  locking/rtmutex: Remove rt_mutex_timed_lock()
  MAINTAINERS: Add myself as futex reviewer
  locking/mutex: Remove repeated declaration
  ...
2021-04-28 12:37:53 -07:00
Linus Torvalds
9a45da9270 RCU changes for this cycle were:
- Bitmap support for "N" as alias for last bit
  - kvfree_rcu updates
  - mm_dump_obj() updates.  (One of these is to mm, but was suggested by Andrew Morton.)
  - RCU callback offloading update
  - Polling RCU grace-period interfaces
  - Realtime-related RCU updates
  - Tasks-RCU updates
  - Torture-test updates
  - Torture-test scripting updates
  - Miscellaneous fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJCZERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hRjw/+Jkb9KvR9odPt/zqN/KPtIlburCUWgsFb
 2zAlWN4uMocPAiXT2Xq58/8gqMkpyn7ZVZtL1tD8fZSvlwEr0U8Z74+/NdoQvYE+
 kMXIYIuhIAGRyAupmzkriqN33iY+BSZPacX3u6ziPj57/0OZzbWVN/DAhbuvyLqG
 J/oL4PHCa7XAqXbf95rd5Zjs680QJ3CbTRh4nA8uHArzJmKZOaaHJ05Pxd1LpULe
 SJ+5p1GQnnwxd1HqmlHMDu/dW+2hE35BGykF8zi78je9OJXualDoM/6JpIYGhMNY
 5qlhU55QYP1jzjuNGVZZUS4L77eS2/W7SpPAaTmMEy/SsVB59G8Kf22oNDpVaEqQ
 m+2ErqwaHvlkMjqnsx+JQbsOP0yCi2NZBoEPFdfk1H23E2deVlSDbxPso4Zb1oUD
 E12769kN+SWDytuLSOAe1PY/KXqmNUKjPZl1GDCGXL7HlCnWyggUDschTsKJa19O
 XXl+yCTGMUH4XAPSqavAKQbBjurqpT6i4zfooSH4TBtOHm1ExgZOUS8gglZ1JuJd
 q+uJdZIgS8BcGkGw/k1bYDWY5TA4Rjv3sAOKQL1PgYBl1t/yLK441mE7LI9gWOwz
 Crz7vlSxD6Jc2cYQeUVW0KPGt5aVd63Gd9HjpXxGkqYQSDRqYMCebHEAGagz+jj7
 Nv/nOnf34Uc=
 =mpNt
 -----END PGP SIGNATURE-----

Merge tag 'core-rcu-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RCU updates from Ingo Molnar:

 - Support for "N" as alias for last bit in bitmap parsing library (eg
   using syntax like "nohz_full=2-N")

 - kvfree_rcu updates

 - mm_dump_obj() updates. (One of these is to mm, but was suggested by
   Andrew Morton.)

 - RCU callback offloading update

 - Polling RCU grace-period interfaces

 - Realtime-related RCU updates

 - Tasks-RCU updates

 - Torture-test updates

 - Torture-test scripting updates

 - Miscellaneous fixes

* tag 'core-rcu-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
  rcutorture: Test start_poll_synchronize_rcu() and poll_state_synchronize_rcu()
  rcu: Provide polling interfaces for Tiny RCU grace periods
  torture: Fix kvm.sh --datestamp regex check
  torture: Consolidate qemu-cmd duration editing into kvm-transform.sh
  torture: Print proper vmlinux path for kvm-again.sh runs
  torture: Make TORTURE_TRUST_MAKE available in kvm-again.sh environment
  torture: Make kvm-transform.sh update jitter commands
  torture: Add --duration argument to kvm-again.sh
  torture: Add kvm-again.sh to rerun a previous torture-test
  torture: Create a "batches" file for build reuse
  torture: De-capitalize TORTURE_SUITE
  torture: Make upper-case-only no-dot no-slash scenario names official
  torture: Rename SRCU-t and SRCU-u to avoid lowercase characters
  torture: Remove no-mpstat error message
  torture: Record kvm-test-1-run.sh and kvm-test-1-run-qemu.sh PIDs
  torture: Record jitter start/stop commands
  torture: Extract kvm-test-1-run-qemu.sh from kvm-test-1-run.sh
  torture: Record TORTURE_KCONFIG_GDB_ARG in qemu-cmd
  torture: Abstract jitter.sh start/stop into scripts
  rcu: Provide polling interfaces for Tree RCU grace periods
  ...
2021-04-28 12:00:13 -07:00
Steven Rostedt (VMware)
785e3c0a3a tracing: Map all PIDs to command lines
The default max PID is set by PID_MAX_DEFAULT, and the tracing
infrastructure uses this number to map PIDs to the comm names of the
tasks, such output of the trace can show names from the recorded PIDs in
the ring buffer. This mapping is also exported to user space via the
"saved_cmdlines" file in the tracefs directory.

But currently the mapping expects the PIDs to be less than
PID_MAX_DEFAULT, which is the default maximum and not the real maximum.
Recently, systemd will increases the maximum value of a PID on the system,
and when tasks are traced that have a PID higher than PID_MAX_DEFAULT, its
comm is not recorded. This leads to the entire trace to have "<...>" as
the comm name, which is pretty useless.

Instead, keep the array mapping the size of PID_MAX_DEFAULT, but instead
of just mapping the index to the comm, map a mask of the PID
(PID_MAX_DEFAULT - 1) to the comm, and find the full PID from the
map_cmdline_to_pid array (that already exists).

This bug goes back to the beginning of ftrace, but hasn't been an issue
until user space started increasing the maximum value of PIDs.

Link: https://lkml.kernel.org/r/20210427113207.3c601884@gandalf.local.home

Cc: stable@vger.kernel.org
Fixes: bc0c38d139 ("ftrace: latency tracer infrastructure")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-28 13:20:04 -04:00
Linus Torvalds
55e6be657b Merge branch 'for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:
 "The only notable change is Vipin's new misc cgroup controller.

  This implements generic support for resources which can be controlled
  by simply counting and limiting the number of resource instances - ie
  there's X number of these on the system and this cgroup subtree can
  have upto Y of those.

  The first user is the address space IDs used for virtual machine
  memory encryption and expected future usages are similar - niche
  hardware features with concrete resource limits and simple usage
  models"

* 'for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: use tsk->in_iowait instead of delayacct_is_task_waiting_on_io()
  cgroup/cpuset: fix typos in comments
  cgroup: misc: mark dummy misc_cg_res_total_usage() static inline
  svm/sev: Register SEV and SEV-ES ASIDs to the misc controller
  cgroup: Miscellaneous cgroup documentation.
  cgroup: Add misc cgroup controller
2021-04-27 18:47:42 -07:00
Linus Torvalds
eb6bbacc46 Livepatching changes for 5.13
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmCIF5EACgkQUqAMR0iA
 lPIABA/+MstVI15QFRD50xo/TyGUP3r7NZmU7BTbhzshSW2XkXFiWNn73VeULhZ8
 CDXf8bKuD2tIQTq30+RMRqmSUSN00mXepupA1eVeyKoUYqXzNsEU6oQBRQ3wKYQY
 HrvbcJtIMNC10G7TUIgltiqKsi5538YIfWn5EBN9wvxyHvnJQ/RzNS1OBIaDx+vV
 04E/65P+gcrNMBRXB03/Vl2KBfJQKb4Hj78Yo9puq6kJmV3uHmHgv2adYGT/veG/
 2o+cigfJS1uLg6vCiJC9bBkQgNJUj/3p+6EaVBFfgpZ1ddKW9AVEQMv2PDvaWCuR
 BKwuawobHl1eHgCPS+dofZMFZ0LT+z0pnf8jpduLmbcKFbHYaWLwmPjwFwSQNJ6e
 zqM91pnwRUkSVXmboxubcNHbioRFXhvIiswxHHbrzS4BBs3mXfSEnMRlbB75iKuJ
 cazVRP6u+ukg7XZhtsL2/8UYXOJ4bbIF7R9B/DM5o4zJD2gs9fRCkfDrp2n0Twtu
 x7NffZAAmlqBQ+7c9d/eEkZmzpkX76gRwUC/IvwJRIs+jOHfEehe+tPuqDlWN19b
 vM0a+kup0n8T/ZCfLeK8PLayjm8/hnNg2CK970zlovEKAdCIMovJHUJjnXboSx/n
 +4R/DH3LUG3TJo+rMKLCOZt9/ph2YC1ySTmOXv9GkE0VlsCn95c=
 =ZMd+
 -----END PGP SIGNATURE-----

Merge tag 'livepatching-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching

Pull livepatching update from Petr Mladek:

 - Use TIF_NOTIFY_SIGNAL infrastructure instead of the fake signal

* tag 'livepatching-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  livepatch: Replace the fake signal sending with TIF_NOTIFY_SIGNAL infrastructure
2021-04-27 18:14:38 -07:00
Linus Torvalds
7f3d08b255 printk changes for 5.13
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmCIBMIACgkQUqAMR0iA
 lPIt9w//bbHUN/JsNtLCs/849oExdUn/thVajrD5yELttYZXhdzbXncNdkGX9tlU
 4JmExmUoqKYdN6JhSnrcYvckHj7XXZM7pVh9IdzqRh10MEXIQ+7IUHjQc8034Zs/
 W4/oZmfMtBjszap+cJ9hvdp9qaJkPz/fRLGlrbjc1K4hhxDa1gGmeD35SKswGltm
 q6RzX3uRl5JbBrYsLoqb28MGYRHhjf2+Pvndoj+5Nn9FtwPSot6jAkyqY5Y6iJlS
 W2EsFqOt+Kv7/I93FyQlnXC6Nx7vntmow7knmmGPXDf2BqLb0J8Bxl3fwuzpQoao
 nZzL/p9GQ4ZXF6y8gRV8+RzPIcftBdayOswEDGH0LzlTkbAe/9Sq9Lo7a4Z8jxHW
 ro0P+PSRK5Ksm7jvpVmSTg+Nt+XqDA5zA1lAorX1UjsyeDDNF9ndQ4C+ZNhCKo54
 y+RDgtAArJMIvsHLQ53ReoOct5NnGVNb8G/r3bIAu+Dn6K3nesr6fP1XG8iduseL
 yFlLB7w214BQMr2B/C+8lQvj54wWE4lea2+LNvObxC5b8puYj0fEniUxTYP6bcB5
 QT+LfTToufYz4US7ggJy6hoEfohifGWVvDHbn9tXmyXotSTHH7pHdYypqY+UO+kl
 7BkwzNFCm4qCIKsg8nyJxT2hDOlpcCrQx1dBIjveMqJ0c5+ahXU=
 =ovSn
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

 - Stop synchronizing kernel log buffer readers by logbuf_lock. As a
   result, the access to the buffer is fully lockless now.

   Note that printk() itself still uses locks because it tries to flush
   the messages to the console immediately. Also the per-CPU temporary
   buffers are still there because they prevent infinite recursion and
   serialize backtraces from NMI. All this is going to change in the
   future.

 - kmsg_dump API rework and cleanup as a side effect of the logbuf_lock
   removal.

 - Make bstr_printf() aware that %pf and %pF formats could deference the
   given pointer.

 - Show also page flags by %pGp format.

 - Clarify the documentation for plain pointer printing.

 - Do not show no_hash_pointers warning multiple times.

 - Update Senozhatsky email address.

 - Some clean up.

* tag 'printk-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (24 commits)
  lib/vsprintf.c: remove leftover 'f' and 'F' cases from bstr_printf()
  printk: clarify the documentation for plain pointer printing
  kernel/printk.c: Fixed mundane typos
  printk: rename vprintk_func to vprintk
  vsprintf: dump full information of page flags in pGp
  mm, slub: don't combine pr_err with INFO
  mm, slub: use pGp to print page flags
  MAINTAINERS: update Senozhatsky email address
  lib/vsprintf: do not show no_hash_pointers message multiple times
  printk: console: remove unnecessary safe buffer usage
  printk: kmsg_dump: remove _nolock() variants
  printk: remove logbuf_lock
  printk: introduce a kmsg_dump iterator
  printk: kmsg_dumper: remove @active field
  printk: add syslog_lock
  printk: use atomic64_t for devkmsg_user.seq
  printk: use seqcount_latch for clear_seq
  printk: introduce CONSOLE_LOG_MAX
  printk: consolidate kmsg_dump_get_buffer/syslog_print_all code
  printk: refactor kmsg_dump_get_buffer()
  ...
2021-04-27 18:09:44 -07:00
Linus Torvalds
916a75965e kgdb patches for 5.13
Exclusively tidy ups this cycle. Most of them are thanks to Sumit Garg
 and, as it happens, the clean ups do result in a slight increase in
 the line count. This is due to registering kdb commands using data
 structures rather than function calls which, in turn, simplifies the
 memory management during command registration.
 
 In addition to changes to command registration we also have some dead
 code removal, a clearer implementation of environment variable handling
 and a typo fix.
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmCIEcIACgkQfOMlXTn3
 iKHaWRAApW5fVDvM4ACAYYzqnq+ID2wW5h9XHpbyFqeaErQQEwjIcQHKMnHGMQ7J
 2Smb8r9hmk93HgetUAYUe1xoU1/UgiVUlLxZDmtWw8Kf/e9unRcFAa6nUUxeFDn9
 J3s1ZLAGOyjxThRKqOhgoQLE0c+tby7QaiEFF1O/iHTxWIn/euSTRbENoF40xDXU
 oheEfysWF9TpRo/96ZO14tbBHj6THrjyKkHFzYyjRs29wc/j/ofLg9OJj4tzuMne
 XecT7npDJ1JFoyZfkvxxK8TW/a2MLaH1EVO3EpiBM6tMSlHW3NRCehXXlJ6JyGY0
 wAYwd5hgldXcVL8kmDXIC8qhuhGf7mCaU/8AmWyISj+lQcmcNfZYivJmZjdEW3qT
 yFkZ3k/r18N7/tRPbvOarlhxNHx+eCaFGm5ffz9VmMHMp1nrTk1lGhK63gTel3B0
 ndIbvpN4ujWpes5lDg2knvZpaRIClSvisn/UyuuzZyWmeTT/kdUVHruSmopUFw9m
 BEWvAh0HwlR9dRhB8UKUyFF7w5fMXXw0+9j5HOw7gOR78QoNUZymBsxsISWp9fOO
 +Pzz3B9+TQdt9aZP1frMgDvNf6HDebxDZnmgCh5uMoLiKXAqJtJYdmLCOiTjtg35
 FG44QhZFFlyDQdmPFf3hY+hGf2YxNdk+zdlyYq/amFjm5N3jGUM=
 =2I9J
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
 "Exclusively tidy ups this cycle. Most of them are thanks to Sumit Garg
  and, as it happens, the clean ups do result in a slight increase in
  the line count. This is due to registering kdb commands using data
  structures rather than function calls which, in turn, simplifies the
  memory management during command registration.

  In addition to changes to command registration we also have some dead
  code removal, a clearer implementation of environment variable
  handling and a typo fix"

* tag 'kgdb-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kdb: Refactor env variables get/set code
  kernel: debug: Ordinary typo fixes in the file gdbstub.c
  kdb: Simplify kdb commands registration
  kdb: Remove redundant function definitions/prototypes
2021-04-27 18:07:19 -07:00
Pedro Tammela
f008d732ab bpf: Add batched ops support for percpu array
Uses the already in-place infrastructure provided by the
'generic_map_*_batch' functions.

No tweak was needed as it transparently handles the percpu variant.

As arrays don't have delete operations, let it return a error to
user space (default behaviour).

Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210424214510.806627-2-pctammela@mojatatu.com
2021-04-28 01:17:45 +02:00
Florent Revest
48cac3f4a9 bpf: Implement formatted output helpers with bstr_printf
BPF has three formatted output helpers: bpf_trace_printk, bpf_seq_printf
and bpf_snprintf. Their signatures specify that all arguments are
provided from the BPF world as u64s (in an array or as registers). All
of these helpers are currently implemented by calling functions such as
snprintf() whose signatures take a variable number of arguments, then
placed in a va_list by the compiler to call vsnprintf().

"d9c9e4db bpf: Factorize bpf_trace_printk and bpf_seq_printf" introduced
a bpf_printf_prepare function that fills an array of u64 sanitized
arguments with an array of "modifiers" which indicate what the "real"
size of each argument should be (given by the format specifier). The
BPF_CAST_FMT_ARG macro consumes these arrays and casts each argument to
its real size. However, the C promotion rules implicitely cast them all
back to u64s. Therefore, the arguments given to snprintf are u64s and
the va_list constructed by the compiler will use 64 bits for each
argument. On 64 bit machines, this happens to work well because 32 bit
arguments in va_lists need to occupy 64 bits anyway, but on 32 bit
architectures this breaks the layout of the va_list expected by the
called function and mangles values.

In "88a5c690b6 bpf: fix bpf_trace_printk on 32 bit archs", this problem
had been solved for bpf_trace_printk only with a "horrid workaround"
that emitted multiple calls to trace_printk where each call had
different argument types and generated different va_list layouts. One of
the call would be dynamically chosen at runtime. This was ok with the 3
arguments that bpf_trace_printk takes but bpf_seq_printf and
bpf_snprintf accept up to 12 arguments. Because this approach scales
code exponentially, it is not a viable option anymore.

Because the promotion rules are part of the language and because the
construction of a va_list is an arch-specific ABI, it's best to just
avoid variadic arguments and va_lists altogether. Thankfully the
kernel's snprintf() has an alternative in the form of bstr_printf() that
accepts arguments in a "binary buffer representation". These binary
buffers are currently created by vbin_printf and used in the tracing
subsystem to split the cost of printing into two parts: a fast one that
only dereferences and remembers values, and a slower one, called later,
that does the pretty-printing.

This patch refactors bpf_printf_prepare to construct binary buffers of
arguments consumable by bstr_printf() instead of arrays of arguments and
modifiers. This gets rid of BPF_CAST_FMT_ARG and greatly simplifies the
bpf_printf_prepare usage but there are a few gotchas that change how
bpf_printf_prepare needs to do things.

Currently, bpf_printf_prepare uses a per cpu temporary buffer as a
generic storage for strings and IP addresses. With this refactoring, the
temporary buffers now holds all the arguments in a structured binary
format.

To comply with the format expected by bstr_printf, certain format
specifiers also need to be pre-formatted: %pB and %pi6/%pi4/%pI4/%pI6.
Because vsnprintf subroutines for these specifiers are hard to expose,
we pre-format these arguments with calls to snprintf().

Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210427174313.860948-3-revest@chromium.org
2021-04-27 15:56:31 -07:00
Linus Torvalds
e359bce39d audit/stable-5.13 PR 20210426
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmCHM4YUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOncA/+OnDdkYFD2e/6PsHURsQ9XK3Yk0kc
 1PY7lJnT4Eb4cUeMe2DP9LpTkA0ldhCxxbz8HYJNn7TUADqeCGhkShBLs/Fxz6k0
 F63RLupJFU0NKhBOYOyccqwmkzc19Ortcj27mYrIgYGK+tSPuRHzJ25PGmjnvh1W
 U7Or0sb1aOegxqFkTXi9IP2wY2Dv+YWfWkSdZNi/W5z4bedCQr9fJgGyUvsDCJyY
 YBIRa/VOLoU9AGkS/XN+uM06lckImC5gqZAqRtJEAk4vj7MsxcWp/eNkENiyaPeH
 +vSUrsv1bj0Bv85CMY8SWGY/GDaiDKjEf+3fVMHF5B/Ft3CgCheykbGPyjRqt3eT
 iIkv0PR5f2MclV5WB5n3gxwE42rPV+FOE8Mh8vRiDdkub/T8r0/cK0FJYPuwYWyA
 r/wdNKQpQUky+laMQWXKpi4tDx6JSWZPBPLG0I8Za/m1CV964VVok68VIMSmBcFj
 sbzYD6e3z9VTnuuxvLiS5HqFTtKkN5VG2al3HmBvZFtkF60xeNs4zbgHV4dg7adK
 clcBE3X4j0RHmYwLs4WWdOzWMPgx99BFJxVgZw3YGXv4oXguLUDFAswTIrc5FNtf
 YYs0/zsPn6CLt15Q7m/3Ec1T0fDf0A+DW3V3KSRNvLaMB41+E1XIWPYpUbfrr13v
 zGfT3CIdu9IR36Q=
 =SzqM
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "Another small pull request for audit, most of the patches are
  documentation updates with only two real code changes: one to fix a
  compiler warning for a dummy function/macro, and one to cleanup some
  code since we removed the AUDIT_FILTER_ENTRY ages ago (v4.17)"

* tag 'audit-pr-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: drop /proc/PID/loginuid documentation Format field
  audit: avoid -Wempty-body warning
  audit: document /proc/PID/sessionid
  audit: document /proc/PID/loginuid
  MAINTAINERS: update audit files
  audit: further cleanup of AUDIT_FILTER_ENTRY deprecation
2021-04-27 13:50:58 -07:00
Linus Torvalds
f1c921fb70 selinux/stable-5.13 PR 20210426
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmCHM2sUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNfCg/9GmoCyCh+ZRj5RGQ6M+yJas1+yyJQ
 uEfTNde54yfATUTaaWYnZG59yqzM3I2uaV11U7tqg8ajiFPxJKqbs5R9jl3lnSjH
 0Dg22nXPSCOTKcU0x/DeLoKRr+M9jO1K/nQ8NEZvYX4nC/OgtCvJqb/oEQZIKAk5
 2a7OEmNNQyFGd274p9dELaDHxN9UIaJ2PzQFXtq7ROHgBXQO4ONb2ajOf6mDSFQb
 vP/CDHwaH+pcE28w44oRy0/YBkO1SrdqoFQchg5yFagM5tQRLGkXK4OFSs5KHi5Q
 YMtmaOzMPIv1e5eaC1HuuMJYA4pPb30T9hFHP7tmBVZfmZaFaDeUs+BhMm98WTiS
 o0iTP7tfs36/poOR1Q0/sB06uvF9hUAAX1ZuE95YySifbXU9hsUc9b0uQSwCdg9P
 /J9rcdHLTpWqjw9n02mezWmAvo5U8ZvbDs+0xPIwI+3RTUP5t6mp+Hd5Tc7bPTq1
 0rpWXx+FQoSytFap5qiUSiwBp+HF6HQnNIXB0Muf6wctChoTjvo7TwoxH//z4kEm
 +SddhOCNkB7VC/X7hOxhl0F/rdHuXvb1AFIWjpTLJH2CR1PvMtF+sGey+uPT6hKZ
 /gvhmQGjFdph99eGlfVbCNvx1pM61O25IscaYD1T2wGImw+z7dX4WkG3WoOdDSkR
 bRjrBkcHh0gLhWk=
 =HTEy
 -----END PGP SIGNATURE-----

Merge tag 'selinux-pr-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull selinux updates from Paul Moore:

 - Add support for measuring the SELinux state and policy capabilities
   using IMA.

 - A handful of SELinux/NFS patches to compare the SELinux state of one
   mount with a set of mount options. Olga goes into more detail in the
   patch descriptions, but this is important as it allows more
   flexibility when using NFS and SELinux context mounts.

 - Properly differentiate between the subjective and objective LSM
   credentials; including support for the SELinux and Smack. My clumsy
   attempt at a proper fix for AppArmor didn't quite pass muster so John
   is working on a proper AppArmor patch, in the meantime this set of
   patches shouldn't change the behavior of AppArmor in any way. This
   change explains the bulk of the diffstat beyond security/.

 - Fix a problem where we were not properly terminating the permission
   list for two SELinux object classes.

* tag 'selinux-pr-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: add proper NULL termination to the secclass_map permissions
  smack: differentiate between subjective and objective task credentials
  selinux: clarify task subjective and objective credentials
  lsm: separate security_task_getsecid() into subjective and objective variants
  nfs: account for selinux security context when deciding to share superblock
  nfs: remove unneeded null check in nfs_fill_super()
  lsm,selinux: add new hook to compare new mount to an existing mount
  selinux: fix misspellings using codespell tool
  selinux: fix misspellings using codespell tool
  selinux: measure state and policy capabilities
  selinux: Allow context mounts for unpriviliged overlayfs
2021-04-27 13:42:11 -07:00
Linus Torvalds
57fa2369ab CFI on arm64 series for v5.13-rc1
- Clean up list_sort prototypes (Sami Tolvanen)
 
 - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmCHCR8ACgkQiXL039xt
 wCZyFQ//fnUZaXR2K354zDyW6CJljMf+d94RF6rH+J6eMTH2/HXa5v0iJokwABLf
 ussP6qF4k5wtmI22Gm9A5Zc3e4iiry5pC0jOdk0mk4gzWwFN9MdgNxJZIGA3xqhS
 bsBK4AGrVKjtZl48G1/ZxJuNDeJhVp6GNK2n6/Gl4rZF6R7D/Upz0XelyJRdDpcM
 HIGma7jZl6xfGU0mdWCzpOGK1zdMca1WVs7A4YuurSbLn5PZJrcNVWLouDqt/Si2
 AduSri1gyPClicgvqWjMOzhUpuw/nJtBLRl1x1EsWk/KSZ1/uNVjlewfzdN4fZrr
 zbtFr2gLubYLK6JOX7/LqoHlOTgE3tYLL+WIVN75DsOGZBKgHhmebTmWLyqzV0SL
 oqcyM5d3ucC6msdtAK5Fv4MSp8rpjqlK1Ha4SGRT6kC2wut7AhZ3KD7eyRIz8mV9
 Sa9mhignGFJnTEUp+LSbYdrAudgSKxB40WyXPmswAXX4VJFRD4ONrrcAON/SzkUT
 Hw/JdFRCKkJjgwNQjIQoZcUNMTbFz2PlNIEnjJWm38YImQKQlCb2mXaZKCwBkf45
 aheCZk17eKoxTCXFMd+KxlyNEtS2yBfq/PpZgvw7GW/pfFbWUg1+2O41LnihIe5v
 zu0hN1wNCQqgfxiMZqX1OTb9C/2vybzGsXILt+9nppjZ8EBU7iU=
 =wU6U
 -----END PGP SIGNATURE-----

Merge tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull CFI on arm64 support from Kees Cook:
 "This builds on last cycle's LTO work, and allows the arm64 kernels to
  be built with Clang's Control Flow Integrity feature. This feature has
  happily lived in Android kernels for almost 3 years[1], so I'm excited
  to have it ready for upstream.

  The wide diffstat is mainly due to the treewide fixing of mismatched
  list_sort prototypes. Other things in core kernel are to address
  various CFI corner cases. The largest code portion is the CFI runtime
  implementation itself (which will be shared by all architectures
  implementing support for CFI). The arm64 pieces are Acked by arm64
  maintainers rather than coming through the arm64 tree since carrying
  this tree over there was going to be awkward.

  CFI support for x86 is still under development, but is pretty close.
  There are a handful of corner cases on x86 that need some improvements
  to Clang and objtool, but otherwise works well.

  Summary:

   - Clean up list_sort prototypes (Sami Tolvanen)

   - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)"

* tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  arm64: allow CONFIG_CFI_CLANG to be selected
  KVM: arm64: Disable CFI for nVHE
  arm64: ftrace: use function_nocfi for ftrace_call
  arm64: add __nocfi to __apply_alternatives
  arm64: add __nocfi to functions that jump to a physical address
  arm64: use function_nocfi with __pa_symbol
  arm64: implement function_nocfi
  psci: use function_nocfi for cpu_resume
  lkdtm: use function_nocfi
  treewide: Change list_sort to use const pointers
  bpf: disable CFI in dispatcher functions
  kallsyms: strip ThinLTO hashes from static functions
  kthread: use WARN_ON_FUNCTION_MISMATCH
  workqueue: use WARN_ON_FUNCTION_MISMATCH
  module: ensure __cfi_check alignment
  mm: add generic function_nocfi macro
  cfi: add __cficanonical
  add support for Clang CFI
2021-04-27 10:16:46 -07:00
Claire Chang
95b079d821 swiotlb: Fix the type of index
Fix the type of index from unsigned int to int since find_slots() might
return -1.

Fixes: 26a7e09478 ("swiotlb: refactor swiotlb_tbl_map_single")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Claire Chang <tientzu@chromium.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
2021-04-27 17:03:20 +00:00
Linus Torvalds
7e4910b9ac seccomp updates for v5.13-rc1
- Fix "cacheable" typo in comments (Cui GaoSheng)
 
 - Fix CONFIG for /proc/$pid/status Seccomp_filters (Kenta.Tada@sony.com)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmCHBe0ACgkQiXL039xt
 wCadDA//cy6LXlzJ78tRy1Zj4/iRlvfLGQ6rNuhoWkm9nuLOTJlzmb9lPxLFo1lo
 N4FDuXE0daPvmgy/XVu9wBKDZsgTlegzikGARfQmeHJ7Wj1H8ibz1OJPd1o60p4Y
 pfeImxefoNKxx7IxnNFDMLHgVi+CtnOZklwlj+bobIWjzclNB2EacumnyJlPuboW
 4ZHBSkG1roLkBB4Q10fI7OHV8lSuQp/IyrAypLybydJ0xiZgvGD3NPOA4N8KH8nR
 A0kbA953Rld/PFzw5inRqepyPZKtT07LJyfl1ff60OtKOHkVBXPv6pYrdgWs0A9y
 XZxdHjVV/MHLvcK9dBoZZGi0/907fvcEgtMacaRekevD5sqiqtNOH5B5rQsMwtXs
 s/Kvg1KgmVJBQwFcMRuAfXqnnPy2672XvDU5/uptVbhpOIcIVeHtGvygPkiobuuO
 V1sE+huGCw+xnfRIOOmytRTpkHMlIS9ev1ApfXtuUtbXbM0W1G6H7adc0KE4bApm
 D/fpv97myH42r/UghOL5EHVaLcnw8embVr/ij4WpMiC1TrhWy0XU27oJisG6xRw6
 A2Q4ybO3VM85LgteeQg10BZFmnuwfHMRJPBL4TOhNSs5GBx5EmkEFByozvMst5xR
 W/GIDn7g7jy1H0wuQOQ7NCgU5+RDDslCOjCIJdSipwpsTc65QCQ=
 =m4xO
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp updates from Kees Cook:

 - Fix "cacheable" typo in comments (Cui GaoSheng)

 - Fix CONFIG for /proc/$pid/status Seccomp_filters (Kenta.Tada@sony.com)

* tag 'seccomp-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  seccomp: Fix "cacheable" typo in comments
  seccomp: Fix CONFIG tests for Seccomp_filters
2021-04-27 10:03:12 -07:00
Lorenzo Bianconi
bb02478077 bpf, cpumap: Bulk skb using netif_receive_skb_list
Rely on netif_receive_skb_list routine to send skbs converted from
xdp_frames in cpu_map_kthread_run in order to improve i-cache usage.
The proposed patch has been tested running xdp_redirect_cpu bpf sample
available in the kernel tree that is used to redirect UDP frames from
ixgbe driver to a cpumap entry and then to the networking stack. UDP
frames are generated using pktgen. Packets are discarded by the UDP
layer.

$ xdp_redirect_cpu  --cpu <cpu> --progname xdp_cpu_map0 --dev <eth>

bpf-next: ~2.35Mpps
bpf-next + cpumap skb-list: ~2.72Mpps

Rename drops counter in kmem_alloc_drops since now it reports just
kmem_cache_alloc_bulk failures

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/c729f83e5d7482d9329e0f165bdbe5adcefd1510.1619169700.git.lorenzo@kernel.org
2021-04-27 17:13:49 +02:00
Daniel Borkmann
10bf4e8316 bpf: Fix propagation of 32 bit unsigned bounds from 64 bit bounds
Similarly as b02709587e ("bpf: Fix propagation of 32-bit signed bounds
from 64-bit bounds."), we also need to fix the propagation of 32 bit
unsigned bounds from 64 bit counterparts. That is, really only set the
u32_{min,max}_value when /both/ {umin,umax}_value safely fit in 32 bit
space. For example, the register with a umin_value == 1 does /not/ imply
that u32_min_value is also equal to 1, since umax_value could be much
larger than 32 bit subregister can hold, and thus u32_min_value is in
the interval [0,1] instead.

Before fix, invalid tracking result of R2_w=inv1:

  [...]
  5: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0) R10=fp0
  5: (35) if r2 >= 0x1 goto pc+1
  [...] // goto path
  7: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,umin_value=1) R10=fp0
  7: (b6) if w2 <= 0x1 goto pc+1
  [...] // goto path
  9: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,smin_value=-9223372036854775807,smax_value=9223372032559808513,umin_value=1,umax_value=18446744069414584321,var_off=(0x1; 0xffffffff00000000),s32_min_value=1,s32_max_value=1,u32_max_value=1) R10=fp0
  9: (bc) w2 = w2
  10: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv1 R10=fp0
  [...]

After fix, correct tracking result of R2_w=inv(id=0,umax_value=1,var_off=(0x0; 0x1)):

  [...]
  5: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0) R10=fp0
  5: (35) if r2 >= 0x1 goto pc+1
  [...] // goto path
  7: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,umin_value=1) R10=fp0
  7: (b6) if w2 <= 0x1 goto pc+1
  [...] // goto path
  9: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,smax_value=9223372032559808513,umax_value=18446744069414584321,var_off=(0x0; 0xffffffff00000001),s32_min_value=0,s32_max_value=1,u32_max_value=1) R10=fp0
  9: (bc) w2 = w2
  10: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,umax_value=1,var_off=(0x0; 0x1)) R10=fp0
  [...]

Thus, same issue as in b02709587e holds for unsigned subregister tracking.
Also, align __reg64_bound_u32() similarly to __reg64_bound_s32() as done in
b02709587e to make them uniform again.

Fixes: 3f50f132d8 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Manfred Paul (@_manfp)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-27 17:13:49 +02:00
Florent Revest
38d26d89b3 bpf: Lock bpf_trace_printk's tmp buf before it is written to
bpf_trace_printk uses a shared static buffer to hold strings before they
are printed. A recent refactoring moved the locking of that buffer after
it gets filled by mistake.

Fixes: d9c9e4db18 ("bpf: Factorize bpf_trace_printk and bpf_seq_printf")
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210427112958.773132-1-revest@chromium.org
2021-04-27 08:04:34 -07:00
Linus Torvalds
5469f160e6 Power management updates for 5.13-rc1
- Add idle states table for IceLake-D to the intel_idle driver and
    update IceLake-X C6 data in it (Artem Bityutskiy).
 
  - Fix the C7 idle state on Tegra114 in the tegra cpuidle driver and
    drop the unused do_idle() firmware call from it (Dmitry Osipenko).
 
  - Fix cpuidle-qcom-spm Kconfig entry (He Ying).
 
  - Fix handling of possible negative tick_nohz_get_next_hrtimer()
    return values of in cpuidle governors (Rafael Wysocki).
 
  - Add support for frequency-invariance to the ACPI CPPC cpufreq
    driver and update the frequency-invariance engine (FIE) to use it
    as needed (Viresh Kumar).
 
  - Simplify the default delay_us setting in the ACPI CPPC cpufreq
    driver (Tom Saeger).
 
  - Clean up frequency-related computations in the intel_pstate
    cpufreq driver (Rafael Wysocki).
 
  - Fix TBG parent setting for load levels in the armada-37xx
    cpufreq driver and drop the CPU PM clock .set_parent method for
    armada-37xx (Marek Behún).
 
  - Fix multiple issues in the armada-37xx cpufreq driver (Pali Rohár).
 
  - Fix handling of dev_pm_opp_of_cpumask_add_table() return values
    in cpufreq-dt to take the -EPROBE_DEFER one into acconut as
    appropriate (Quanyang Wang).
 
  - Fix format string in ia64-acpi-cpufreq (Sergei Trofimovich).
 
  - Drop the unused for_each_policy() macro from cpufreq (Shaokun
    Zhang).
 
  - Simplify computations in the schedutil cpufreq governor to avoid
    unnecessary overhead (Yue Hu).
 
  - Fix typos in the s5pv210 cpufreq driver (Bhaskar Chowdhury).
 
  - Fix cpufreq documentation links in Kconfig (Alexander Monakov).
 
  - Fix PCI device power state handling in pci_enable_device_flags()
    to avoid issuse in some cases when the device depends on an ACPI
    power resource (Rafael Wysocki).
 
  - Add missing documentation of pm_runtime_resume_and_get() (Alan
    Stern).
 
  - Add missing static inline stub for pm_runtime_has_no_callbacks()
    to pm_runtime.h and drop the unused try_to_freeze_nowarn()
    definition (YueHaibing).
 
  - Drop duplicate struct device declaration from pm.h and fix a
    structure type declaration in intel_rapl.h (Wan Jiabing).
 
  - Use dev_set_name() instead of an open-coded equivalent of it in
    the wakeup sources code and drop a redundant local variable
    initialization from it (Andy Shevchenko, Colin Ian King).
 
  - Use crc32 instead of md5 for e820 memory map integrity check
    during resume from hibernation on x86 (Chris von Recklinghausen).
 
  - Fix typos in comments in the system-wide and hibernation support
    code (Lu Jialin).
 
  - Modify the generic power domains (genpd) code to avoid resuming
    devices in the "prepare" phase of system-wide suspend and
    hibernation (Ulf Hansson).
 
  - Add Hygon Fam18h RAPL support to the intel_rapl power capping
    driver (Pu Wen).
 
  - Add MAINTAINERS entry for the dynamic thermal power management
    (DTPM) code (Daniel Lezcano).
 
  - Add devm variants of operating performance points (OPP) API
    functions and switch over some users of the OPP framework to
    the new resource-managed API (Yangtao Li and Dmitry Osipenko).
 
  - Update devfreq core:
 
    * Register devfreq devices as cooling devices on demand (Daniel
      Lezcano).
 
    * Add missing unlock opeation in devfreq_add_device() (Lukasz
      Luba).
 
    * Use the next frequency as resume_freq instead of the previous
      frequency when using the opp-suspend property (Dong Aisheng).
 
    * Check get_dev_status in devfreq_update_stats() (Dong Aisheng).
 
    * Fix set_freq path for the userspace governor in Kconfig (Dong
      Aisheng).
 
    * Remove invalid description of get_target_freq() (Dong Aisheng).
 
  - Update devfreq drivers:
 
    * imx8m-ddrc: Remove imx8m_ddrc_get_dev_status() and unneeded
      of_match_ptr() (Dong Aisheng, Fabio Estevam).
 
    * rk3399_dmc: dt-bindings: Add rockchip,pmu phandle and drop
      references to undefined symbols (Enric Balletbo i Serra, Gaël
      PORTAY).
 
    * rk3399_dmc: Use dev_err_probe() to simplify the code (Krzysztof
      Kozlowski).
 
    * imx-bus: Remove unneeded of_match_ptr() (Fabio Estevam).
 
  - Fix kernel-doc warnings in three places (Pierre-Louis Bossart).
 
  - Fix typo in the pm-graph utility code (Ricardo Ribalda).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmCHAUISHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxAxMP/0tFjgxeaJ3chYaiqoPlk2QX/XdwqJvm
 8OOu2qBMWbt2bubcIlAPpdlCNaERI4itF7E8za7t9alswdq7YPWGmNR9snCXUKhD
 BzERuicZTeOcCk2P3DTgzLVc4EzF6wutA3lTdYYZIpf+LuuB+guG8zgMzScRHIsM
 N3I83O+iLTA9ifIqN0/wH//a0ISvo6rSWtcbx+48d5bYvYixW7CsBmoxWHhGiYsw
 4PJ4AzbdNJEhQp91SBYPIAmqwV88FZUPoYnRazXMxOSevMewhP9JuCHXAujC3gLV
 l5d2TBaBV4EBYLD5tfCpJvHMXhv/yBpg6KRivjri+zEnY1TAqIlfR4vYiL7puVvQ
 PdwjyvNFDNGyUSX/AAwYF6F4WCtIhw8hCahw6Dw2zcDz0plFdRZmWAiTdmIjECJK
 8EvwJNlmdl27G1y+EBc6NnwzEFZQwiu9F5bUHUkmc3fF1M1aFTza8WDNDo30TC94
 94c+uVBRL2fBePl4FfGZATfJbOMb8+vDIkroQxrIjQDT/7Ha3Mz75JZDRHItZo92
 +4fES2eFdfZISCLIQMBY5TeXox3O8qsirC1k4qELwy71gPUE9CpP3FkxKIvyZLlv
 +6fS9ttpUfyFBF7gqrEy+ziVk1Rm4oorLmWCtthz4xyerzj5+vtZQqKzcySk0GA5
 hYkseZkedR6y
 =t+SG
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These add some new hardware support (for example, IceLake-D idle
  states in intel_idle), fix some issues (for example, the handling of
  negative "sleep length" values in cpuidle governors), add new
  functionality to the existing drivers (for example, scale-invariance
  support in the ACPI CPPC cpufreq driver) and clean up code all over.

  Specifics:

   - Add idle states table for IceLake-D to the intel_idle driver and
     update IceLake-X C6 data in it (Artem Bityutskiy).

   - Fix the C7 idle state on Tegra114 in the tegra cpuidle driver and
     drop the unused do_idle() firmware call from it (Dmitry Osipenko).

   - Fix cpuidle-qcom-spm Kconfig entry (He Ying).

   - Fix handling of possible negative tick_nohz_get_next_hrtimer()
     return values of in cpuidle governors (Rafael Wysocki).

   - Add support for frequency-invariance to the ACPI CPPC cpufreq
     driver and update the frequency-invariance engine (FIE) to use it
     as needed (Viresh Kumar).

   - Simplify the default delay_us setting in the ACPI CPPC cpufreq
     driver (Tom Saeger).

   - Clean up frequency-related computations in the intel_pstate cpufreq
     driver (Rafael Wysocki).

   - Fix TBG parent setting for load levels in the armada-37xx cpufreq
     driver and drop the CPU PM clock .set_parent method for armada-37xx
     (Marek Behún).

   - Fix multiple issues in the armada-37xx cpufreq driver (Pali Rohár).

   - Fix handling of dev_pm_opp_of_cpumask_add_table() return values in
     cpufreq-dt to take the -EPROBE_DEFER one into acconut as
     appropriate (Quanyang Wang).

   - Fix format string in ia64-acpi-cpufreq (Sergei Trofimovich).

   - Drop the unused for_each_policy() macro from cpufreq (Shaokun
     Zhang).

   - Simplify computations in the schedutil cpufreq governor to avoid
     unnecessary overhead (Yue Hu).

   - Fix typos in the s5pv210 cpufreq driver (Bhaskar Chowdhury).

   - Fix cpufreq documentation links in Kconfig (Alexander Monakov).

   - Fix PCI device power state handling in pci_enable_device_flags() to
     avoid issuse in some cases when the device depends on an ACPI power
     resource (Rafael Wysocki).

   - Add missing documentation of pm_runtime_resume_and_get() (Alan
     Stern).

   - Add missing static inline stub for pm_runtime_has_no_callbacks() to
     pm_runtime.h and drop the unused try_to_freeze_nowarn() definition
     (YueHaibing).

   - Drop duplicate struct device declaration from pm.h and fix a
     structure type declaration in intel_rapl.h (Wan Jiabing).

   - Use dev_set_name() instead of an open-coded equivalent of it in the
     wakeup sources code and drop a redundant local variable
     initialization from it (Andy Shevchenko, Colin Ian King).

   - Use crc32 instead of md5 for e820 memory map integrity check during
     resume from hibernation on x86 (Chris von Recklinghausen).

   - Fix typos in comments in the system-wide and hibernation support
     code (Lu Jialin).

   - Modify the generic power domains (genpd) code to avoid resuming
     devices in the "prepare" phase of system-wide suspend and
     hibernation (Ulf Hansson).

   - Add Hygon Fam18h RAPL support to the intel_rapl power capping
     driver (Pu Wen).

   - Add MAINTAINERS entry for the dynamic thermal power management
     (DTPM) code (Daniel Lezcano).

   - Add devm variants of operating performance points (OPP) API
     functions and switch over some users of the OPP framework to the
     new resource-managed API (Yangtao Li and Dmitry Osipenko).

   - Update devfreq core:

      * Register devfreq devices as cooling devices on demand (Daniel
        Lezcano).

      * Add missing unlock opeation in devfreq_add_device() (Lukasz
        Luba).

      * Use the next frequency as resume_freq instead of the previous
        frequency when using the opp-suspend property (Dong Aisheng).

      * Check get_dev_status in devfreq_update_stats() (Dong Aisheng).

      * Fix set_freq path for the userspace governor in Kconfig (Dong
        Aisheng).

      * Remove invalid description of get_target_freq() (Dong Aisheng).

   - Update devfreq drivers:

      * imx8m-ddrc: Remove imx8m_ddrc_get_dev_status() and unneeded
        of_match_ptr() (Dong Aisheng, Fabio Estevam).

      * rk3399_dmc: dt-bindings: Add rockchip,pmu phandle and drop
        references to undefined symbols (Enric Balletbo i Serra, Gaël
        PORTAY).

      * rk3399_dmc: Use dev_err_probe() to simplify the code (Krzysztof
        Kozlowski).

      * imx-bus: Remove unneeded of_match_ptr() (Fabio Estevam).

   - Fix kernel-doc warnings in three places (Pierre-Louis Bossart).

   - Fix typo in the pm-graph utility code (Ricardo Ribalda)"

* tag 'pm-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (74 commits)
  PM: wakeup: remove redundant assignment to variable retval
  PM: hibernate: x86: Use crc32 instead of md5 for hibernation e820 integrity check
  cpufreq: Kconfig: fix documentation links
  PM: wakeup: use dev_set_name() directly
  PM: runtime: Add documentation for pm_runtime_resume_and_get()
  cpufreq: intel_pstate: Simplify intel_pstate_update_perf_limits()
  cpufreq: armada-37xx: Fix module unloading
  cpufreq: armada-37xx: Remove cur_frequency variable
  cpufreq: armada-37xx: Fix determining base CPU frequency
  cpufreq: armada-37xx: Fix driver cleanup when registration failed
  clk: mvebu: armada-37xx-periph: Fix workaround for switching from L1 to L0
  clk: mvebu: armada-37xx-periph: Fix switching CPU freq from 250 Mhz to 1 GHz
  cpufreq: armada-37xx: Fix the AVS value for load L1
  clk: mvebu: armada-37xx-periph: remove .set_parent method for CPU PM clock
  cpufreq: armada-37xx: Fix setting TBG parent for load levels
  cpuidle: Fix ARM_QCOM_SPM_CPUIDLE configuration
  cpuidle: tegra: Remove do_idle firmware call
  cpuidle: tegra: Fix C7 idling state on Tegra114
  PM: sleep: fix typos in comments
  cpufreq: Remove unused for_each_policy macro
  ...
2021-04-26 15:10:25 -07:00
Linus Torvalds
31a24ae89c arm64 updates for 5.13:
- MTE asynchronous support for KASan. Previously only synchronous
   (slower) mode was supported. Asynchronous is faster but does not allow
   precise identification of the illegal access.
 
 - Run kernel mode SIMD with softirqs disabled. This allows using NEON in
   softirq context for crypto performance improvements. The conditional
   yield support is modified to take softirqs into account and reduce the
   latency.
 
 - Preparatory patches for Apple M1: handle CPUs that only have the VHE
   mode available (host kernel running at EL2), add FIQ support.
 
 - arm64 perf updates: support for HiSilicon PA and SLLC PMU drivers, new
   functions for the HiSilicon HHA and L3C PMU, cleanups.
 
 - Re-introduce support for execute-only user permissions but only when
   the EPAN (Enhanced Privileged Access Never) architecture feature is
   available.
 
 - Disable fine-grained traps at boot and improve the documented boot
   requirements.
 
 - Support CONFIG_KASAN_VMALLOC on arm64 (only with KASAN_GENERIC).
 
 - Add hierarchical eXecute Never permissions for all page tables.
 
 - Add arm64 prctl(PR_PAC_{SET,GET}_ENABLED_KEYS) allowing user programs
   to control which PAC keys are enabled in a particular task.
 
 - arm64 kselftests for BTI and some improvements to the MTE tests.
 
 - Minor improvements to the compat vdso and sigpage.
 
 - Miscellaneous cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmB5xkkACgkQa9axLQDI
 XvEBgRAAsr6r8gsBQJP3FDHmbtbVf2ej5QJTCOAQAGHbTt0JH7Pk03pWSBr7h5nF
 vsddRDxxeDgB6xd7jWP7EvDaPxHeB0CdSj5gG8EP/ZdOm8sFAwB1ZIHWikgUgSwW
 nu6R28yXTMSj+EkyFtahMhTMJ1EMF4sCPuIgAo59ST5w/UMMqLCJByOu4ej6RPKZ
 aeSJJWaDLBmbgnTKWxRvCc/MgIx4J/LAHWGkdpGjuMK6SLp38Kdf86XcrklXtzwf
 K30ZYeoKq8zZ+nFOsK9gBVlOlocZcbS1jEbN842jD6imb6vKLQtBWrKk9A6o4v5E
 XulORWcSBhkZb3ItIU9+6SmelUExf0VeVlSp657QXYPgquoIIGvFl6rCwhrdGMGO
 bi6NZKCfJvcFZJoIN1oyhuHejgZSBnzGEcvhvzNdg7ItvOCed7q3uXcGHz/OI6tL
 2TZKddzHSEMVfTo0D+RUsYfasZHI1qAiQ0mWVC31c+YHuRuW/K/jlc3a5TXlSBUa
 Dwu0/zzMLiqx65ISx9i7XNMrngk55uzrS6MnwSByPoz4M4xsElZxt3cbUxQ8YAQz
 jhxTHs1Pwes8i7f4n61ay/nHCFbmVvN/LlsPRpZdwd8JumThLrDolF3tc6aaY0xO
 hOssKtnGY4Xvh/WitfJ5uvDb1vMObJKTXQEoZEJh4hlNQDxdeUE=
 =6NGI
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:

 - MTE asynchronous support for KASan. Previously only synchronous
   (slower) mode was supported. Asynchronous is faster but does not
   allow precise identification of the illegal access.

 - Run kernel mode SIMD with softirqs disabled. This allows using NEON
   in softirq context for crypto performance improvements. The
   conditional yield support is modified to take softirqs into account
   and reduce the latency.

 - Preparatory patches for Apple M1: handle CPUs that only have the VHE
   mode available (host kernel running at EL2), add FIQ support.

 - arm64 perf updates: support for HiSilicon PA and SLLC PMU drivers,
   new functions for the HiSilicon HHA and L3C PMU, cleanups.

 - Re-introduce support for execute-only user permissions but only when
   the EPAN (Enhanced Privileged Access Never) architecture feature is
   available.

 - Disable fine-grained traps at boot and improve the documented boot
   requirements.

 - Support CONFIG_KASAN_VMALLOC on arm64 (only with KASAN_GENERIC).

 - Add hierarchical eXecute Never permissions for all page tables.

 - Add arm64 prctl(PR_PAC_{SET,GET}_ENABLED_KEYS) allowing user programs
   to control which PAC keys are enabled in a particular task.

 - arm64 kselftests for BTI and some improvements to the MTE tests.

 - Minor improvements to the compat vdso and sigpage.

 - Miscellaneous cleanups.

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (86 commits)
  arm64/sve: Add compile time checks for SVE hooks in generic functions
  arm64/kernel/probes: Use BUG_ON instead of if condition followed by BUG.
  arm64: pac: Optimize kernel entry/exit key installation code paths
  arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)
  arm64: mte: make the per-task SCTLR_EL1 field usable elsewhere
  arm64/sve: Remove redundant system_supports_sve() tests
  arm64: fpsimd: run kernel mode NEON with softirqs disabled
  arm64: assembler: introduce wxN aliases for wN registers
  arm64: assembler: remove conditional NEON yield macros
  kasan, arm64: tests supports for HW_TAGS async mode
  arm64: mte: Report async tag faults before suspend
  arm64: mte: Enable async tag check fault
  arm64: mte: Conditionally compile mte_enable_kernel_*()
  arm64: mte: Enable TCO in functions that can read beyond buffer limits
  kasan: Add report for async mode
  arm64: mte: Drop arch_enable_tagging()
  kasan: Add KASAN mode kernel parameter
  arm64: mte: Add asynchronous mode support
  arm64: Get rid of CONFIG_ARM64_VHE
  arm64: Cope with CPUs stuck in VHE mode
  ...
2021-04-26 10:25:03 -07:00
Linus Torvalds
87dcebff92 The time and timers updates contain:
Core changes:
 
    - Allow runtime power management when the clocksource is changed.
 
    - A correctness fix for clock_adjtime32() so that the return value
      on success is not overwritten by the result of the copy to user.
 
    - Allow late installment of broadcast clockevent devices which was
      broken because nothing switched them over to oneshot mode. This went
      unnoticed so far because clockevent devices used to be built in, but
      now people started to make them modular.
 
    - Debugfs related simplifications
 
    - Small cleanups and improvements here and there
 
 Driver changes:
 
    - The usual set of device tree binding updates for a wide range
      of drivers/devices.
 
    - The usual updates and improvements for drivers all over the place but
      nothing outstanding.
 
    - No new clocksource/event drivers. They'll come back next time.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCGieYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYobRJEACNCtecUXdyt/u+ViDgHwG1XOHSZUkG
 zBO6E/uZ3G6ZUkr6FogAaY2eMMrSdSUyqbiNBSYBJki2ptMJWF5Li5VzqINmrBuD
 VyjK3FEDV0bXW9EJOm4d+95pMyFQ/pYv9VPcByj7VW21t+IDE/4pLeZ8M8shNDHa
 pmMnR/tgX4ZZtSrX2NqCUNoTrkycaz8d5NOuso5HjKvPkJ5BU2kSxULTGmvaeTil
 8d+70AetApDgzAWpCnJFPlLlOHIPyhnMxS5edvsMIbMIkRLsnI+b3LsPZe+CqVZ0
 zaP6KYvG+iqU8nKdz7OweV1fLgBD52GKgHlpTkhhYs3GW4XBEXDrsyoEyeIiZ22u
 YUkTzFvZ4JG/+80UUaKpLDIGYWUj1h+xe/EtWS0s8lj108RsNLghd/0YjFMikspT
 fYC2WpaXJDz3URbSV57OXGbwhg2zOYI5Supg6wNrmFfcld3k6CSitG4idDpIGjJE
 8WIcZmeZSelDufskiY8RmsiTumqNOf5P33F71r9JRI6QU9RsyYb3fJN71AFKnLq2
 31YEAShpzPYG5EGRinPymJRi3icdmcEQECz/pWUb6ua0s/HG1+HD9emLwHzvPdul
 hcWRq19GaK1YBzOfV60+8cdxW8ZEOROvRVdYJO8FoYcnueUJmOSM+boqSkRtDw3o
 RywO8BetxukPJg==
 =F6Du
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "The time and timers updates contain:

  Core changes:

   - Allow runtime power management when the clocksource is changed.

   - A correctness fix for clock_adjtime32() so that the return value on
     success is not overwritten by the result of the copy to user.

   - Allow late installment of broadcast clockevent devices which was
     broken because nothing switched them over to oneshot mode. This
     went unnoticed so far because clockevent devices used to be built
     in, but now people started to make them modular.

   - Debugfs related simplifications

   - Small cleanups and improvements here and there

  Driver changes:

   - The usual set of device tree binding updates for a wide range of
     drivers/devices.

   - The usual updates and improvements for drivers all over the place
     but nothing outstanding.

   - No new clocksource/event drivers. They'll come back next time"

* tag 'timers-core-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  posix-timers: Preserve return value in clock_adjtime32()
  tick/broadcast: Allow late registered device to enter oneshot mode
  tick: Use tick_check_replacement() instead of open coding it
  time/timecounter: Mark 1st argument of timecounter_cyc2time() as const
  dt-bindings: timer: nuvoton,npcm7xx: Add wpcm450-timer
  clocksource/drivers/arm_arch_timer: Add __ro_after_init and __init
  clocksource/drivers/timer-ti-dm: Handle dra7 timer wrap errata i940
  clocksource/drivers/timer-ti-dm: Prepare to handle dra7 timer wrap issue
  clocksource/drivers/dw_apb_timer_of: Add handling for potential memory leak
  clocksource/drivers/npcm: Add support for WPCM450
  clocksource/drivers/sh_cmt: Don't use CMTOUT_IE with R-Car Gen2/3
  clocksource/drivers/pistachio: Fix trivial typo
  clocksource/drivers/ingenic_ost: Fix return value check in ingenic_ost_probe()
  clocksource/drivers/timer-ti-dm: Add missing set_state_oneshot_stopped
  clocksource/drivers/timer-ti-dm: Fix posted mode status check order
  dt-bindings: timer: renesas,cmt: Document R8A77961
  dt-bindings: timer: renesas,cmt: Add r8a779a0 CMT support
  clocksource/drivers/ingenic-ost: Add support for the JZ4760B
  clocksource/drivers/ingenic: Add support for the JZ4760
  dt-bindings: timer: ingenic: Add compatible strings for JZ4760(B)
  ...
2021-04-26 09:54:03 -07:00
Linus Torvalds
91552ab8ff The usual updates from the irq departement:
Core changes:
 
  - Provide IRQF_NO_AUTOEN as a flag for request*_irq() so drivers can be
    cleaned up which either use a seperate mechanism to prevent auto-enable
    at request time or have a racy mechanism which disables the interrupt
    right after request.
 
  - Get rid of the last usage of irq_create_identity_mapping() and remove
    the interface.
 
  - An overhaul of tasklet_disable(). Most usage sites of tasklet_disable()
    are in task context and usually in cleanup, teardown code pathes.
    tasklet_disable() spinwaits for a tasklet which is currently executed.
    That's not only a problem for PREEMPT_RT where this can lead to a live
    lock when the disabling task preempts the softirq thread. It's also
    problematic in context of virtualization when the vCPU which runs the
    tasklet is scheduled out and the disabling code has to spin wait until
    it's scheduled back in. Though there are a few code pathes which invoke
    tasklet_disable() from non-sleepable context. For these a new disable
    variant which still spinwaits is provided which allows to switch
    tasklet_disable() to a sleep wait mechanism. For the atomic use cases
    this does not solve the live lock issue on PREEMPT_RT. That is mitigated
    by blocking on the RT specific softirq lock.
 
  - The PREEMPT_RT specific implementation of softirq processing and
    local_bh_disable/enable().
 
    On RT enabled kernels soft interrupt processing happens always in task
    context and all interrupt handlers, which are not explicitly marked to
    be invoked in hard interrupt context are forced into task context as
    well. This allows to protect against softirq processing with a per
    CPU lock, which in turn allows to make BH disabled regions preemptible.
 
    Most of the softirq handling code is still shared. The RT/non-RT
    specific differences are addressed with a set of inline functions which
    provide the context specific functionality. The local_bh_disable() /
    local_bh_enable() mechanism are obviously seperate.
 
  - The usual set of small improvements and cleanups
 
 Driver changes:
 
  - New drivers for Nuvoton WPCM450 and DT 79rc3243x interrupt controllers
 
  - Extended functionality for MStar, STM32 and SC7280 irq chips
 
  - Enhanced robustness for ARM GICv3/4.1 drivers
 
  - The usual set of cleanups and improvements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCGh5wTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZ+/EACWBpQ/2ZHizEw1bzjaDzJrR8U228xu
 wNi7nSP92Y07nJ3cCX7a6TJ53mqd0n3RT+DprlsOuqSN0D7Ktr/x44V/aZtm0d3N
 GkFOlpeGCRnHusLaUTwk7a8289LuoQ7OhSxIB409n1I4nLI96ZK41D1tYonMYl6E
 nxDiGADASfjaciBWbjwJO/mlwmiW/VRpSTxswx0wzakFfbIx9iKyKv1bCJQZ5JK+
 lHmf0jxpDIs1EVK/ElJ9Ky6TMBlEmZyiX7n6rujtwJ1W+Jc/uL/y8pLJvGwooVmI
 yHTYsLMqzviCbAMhJiB3h1qs3GbCGlM78prgJTnOd0+xEUOCcopCRQlsTXVBq8Nb
 OS+HNkYmYXRfiSH6lINJsIok8Xis28bAw/qWz2Ho+8wLq0TI8crK38roD1fPndee
 FNJRhsPPOBkscpIldJ0Cr0X5lclkJFiAhAxORPHoseKvQSm7gBMB7H99xeGRffTn
 yB3XqeTJMvPNmAHNN4Brv6ey3OjwnEWBgwcnIM2LtbIlRtlmxTYuR+82OPOgEvzk
 fSrjFFJqu0LEMLEOXS4pYN824PawjV//UAy4IaG8AodmUUCSGHgw1gTVa4sIf72t
 tXY54HqWfRWRpujhVRgsZETqBUtZkL6yvpoe8f6H7P91W5tAfv3oj4ch9RkhUo+Z
 b0/u9T0+Fpbg+w==
 =id4G
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "The usual updates from the irq departement:

  Core changes:

   - Provide IRQF_NO_AUTOEN as a flag for request*_irq() so drivers can
     be cleaned up which either use a seperate mechanism to prevent
     auto-enable at request time or have a racy mechanism which disables
     the interrupt right after request.

   - Get rid of the last usage of irq_create_identity_mapping() and
     remove the interface.

   - An overhaul of tasklet_disable().

     Most usage sites of tasklet_disable() are in task context and
     usually in cleanup, teardown code pathes. tasklet_disable()
     spinwaits for a tasklet which is currently executed. That's not
     only a problem for PREEMPT_RT where this can lead to a live lock
     when the disabling task preempts the softirq thread. It's also
     problematic in context of virtualization when the vCPU which runs
     the tasklet is scheduled out and the disabling code has to spin
     wait until it's scheduled back in.

     There are a few code pathes which invoke tasklet_disable() from
     non-sleepable context. For these a new disable variant which still
     spinwaits is provided which allows to switch tasklet_disable() to a
     sleep wait mechanism. For the atomic use cases this does not solve
     the live lock issue on PREEMPT_RT. That is mitigated by blocking on
     the RT specific softirq lock.

   - The PREEMPT_RT specific implementation of softirq processing and
     local_bh_disable/enable().

     On RT enabled kernels soft interrupt processing happens always in
     task context and all interrupt handlers, which are not explicitly
     marked to be invoked in hard interrupt context are forced into task
     context as well. This allows to protect against softirq processing
     with a per CPU lock, which in turn allows to make BH disabled
     regions preemptible.

     Most of the softirq handling code is still shared. The RT/non-RT
     specific differences are addressed with a set of inline functions
     which provide the context specific functionality. The
     local_bh_disable() / local_bh_enable() mechanism are obviously
     seperate.

   - The usual set of small improvements and cleanups

  Driver changes:

   - New drivers for Nuvoton WPCM450 and DT 79rc3243x interrupt
     controllers

   - Extended functionality for MStar, STM32 and SC7280 irq chips

   - Enhanced robustness for ARM GICv3/4.1 drivers

   - The usual set of cleanups and improvements all over the place"

* tag 'irq-core-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  irqchip/xilinx: Expose Kconfig option for Zynq/ZynqMP
  irqchip/gic-v3: Do not enable irqs when handling spurious interrups
  dt-bindings: interrupt-controller: Add IDT 79RC3243x Interrupt Controller
  irqchip: Add support for IDT 79rc3243x interrupt controller
  irqdomain: Drop references to recusive irqdomain setup
  irqdomain: Get rid of irq_create_strict_mappings()
  irqchip/jcore-aic: Kill use of irq_create_strict_mappings()
  ARM: PXA: Kill use of irq_create_strict_mappings()
  irqchip/gic-v4.1: Disable vSGI upon (GIC CPUIF < v4.1) detection
  irqchip/tb10x: Use 'fallthrough' to eliminate a warning
  genirq: Reduce irqdebug cacheline bouncing
  kernel: Initialize cpumask before parsing
  irqchip/wpcm450: Drop COMPILE_TEST
  irqchip/irq-mst: Support polarity configuration
  irqchip: Add driver for WPCM450 interrupt controller
  dt-bindings: interrupt-controller: Add nuvoton, wpcm450-aic
  dt-bindings: qcom,pdc: Add compatible for sc7280
  irqchip/stm32: Add usart instances exti direct event support
  irqchip/gic-v3: Fix OF_BAD_ADDR error handling
  irqchip/sifive-plic: Mark two global variables __ro_after_init
  ...
2021-04-26 09:43:16 -07:00
Linus Torvalds
3b671bf4a7 A trivial cleanup of typo fixes.
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCGgSATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoTdtD/wPWKg5olDLUr3S9Oh15dPj2zzrZHl8
 opJkN/0LpZvTYJmAtVrV1SCywbQYhpsLEB+khBZj0OY/A9gDRGM2uBxxL3oyvyOU
 hRjJkimJuNF3ErDAIFnW3rmgYPIgRnUAnS1hflUpeROHeL5CfR/nPqQgT6I79fmZ
 c23kbGOdz7lw5vUNPiSvpU52UT4HfTfDs5NXB3B4A7Lc2292o3xCw20/LZGsICRy
 6CybMM9Lp7yKdosPZxyjjS9Fu2U2/HptmQ0ueNC4/GYKxq+zM3JDNHECZ4IXVSlu
 +KPgqIfrhec0fwcyWXdEPwDLvgg4XPipxr5mLi9qa45MFxXhtzeKdSD2ktUFU5QH
 G10VWfslDt1VBavKR4u4lz/1L0iiRW3EyuMnbuZNCvbObfNY/jKs+3Kn4OcZIYfL
 DurPMvBZo9BiS3+rIROEETJvzvf84sstDU2c4dZzQMKxtSs1DvVDpyypSqzBkBcr
 n2nWRNsAVhSz0avJ3ZP8yphy/8TFWUY9H1sLQC68ih3frn5sgKN7Ng2cWGCVzL5P
 2geUmbEybdUO9x8mz6ui78kDgwrapyHZlXOQvbaSmlEDA00tEQM0XRFC+B29OwkX
 P37SQjlkvnH5hYxU2x+v4tCrZvrefYuv30E0RbdV5g0WPXAYUL7OH7hK+UzZkKGh
 iZEVlJ2vEUw/lQ==
 =m4pW
 -----END PGP SIGNATURE-----

Merge tag 'core-entry-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core entry updates from Thomas Gleixner:
 "A trivial cleanup of typo fixes"

* tag 'core-entry-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Fix typos in comments
2021-04-26 09:41:15 -07:00
Linus Torvalds
ffc766b31e This is an irregular pull request for sending a lockdep patch.
Peter Zijlstra asked us to find bad annotation that blows up the lockdep
 storage [1][2][3] but we could not find such annotation [4][5], and
 Peter cannot give us feedback any more [6]. Since we tested this patch
 on linux-next.git without problems, and keeping this problem unresolved
 discourages kernel testing which is more painful, I'm sending this patch
 without forever waiting for response from Peter.
 
 [1] https://lkml.kernel.org/r/20200916115057.GO2674@hirez.programming.kicks-ass.net
 [2] https://lkml.kernel.org/r/20201118142357.GW3121392@hirez.programming.kicks-ass.net
 [3] https://lkml.kernel.org/r/20201118151038.GX3121392@hirez.programming.kicks-ass.net
 [4] https://lkml.kernel.org/r/CACT4Y+asqRbjaN9ras=P5DcxKgzsnV0fvV0tYb2VkT+P00pFvQ@mail.gmail.com
 [5] https://lkml.kernel.org/r/4b89985e-99f9-18bc-0bf1-c883127dc70c@i-love.sakura.ne.jp
 [6] https://lkml.kernel.org/r/CACT4Y+YnHFV1p5mbhby2nyOaNTy8c_yoVk86z5avo14KWs0s1A@mail.gmail.com
 
  kernel/locking/lockdep.c           |    2 -
  kernel/locking/lockdep_internals.h |    8 +++----
  lib/Kconfig.debug                  |   40 +++++++++++++++++++++++++++++++++++++
  3 files changed, 45 insertions(+), 5 deletions(-)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJghetlAAoJEEJfEo0MZPUqMaEP/i0pkfOyKBdUe61Y9g0A2TmN
 h5I59KiSsgmx7dK90Q2GP1kUQE9ROCiqIz9qHzCzWfk9jljgFgRfECBKHqH+K7Tq
 AlQQkJmAiwpg+1scSkhoxBOrSGXHe2xB4qvazvw7tAAIDPjcV/pkFlNKaUtItzr2
 VPr4t6Eis/MZ7Pau2xLFLX2gRn5KvpsbcL+wydrDfqlXx3pNXlBvChBxixk90HS6
 0BC5pgb68pXm8Emzbp3+iloy0VuG/BHDA/vy02k5zUjMM7Zy+aGxR/cl2jvc+lWd
 wyRWhwbSjTUrYs3Olmjkybj15lsgl573oIptVhIIrXuvjpyY5v1IH1gkLoxJgr5d
 yaKSdYwyN/OPI3KireEfaSgc6IqrJ1K9gLh1Knqw4JeoJngEVEkmBwBg/izpiXoL
 WVlWZuLkYtOTWxpsTOiCtzv4KkFhFtE61IEAIEsvvj9oeLQJu7JUR8oW0ZQtdfXg
 Em0IbObS8VGW322MNmb1p9SsaYvOueWyKzImEVlCBAb2g6PUYuiAwiOw8/tvsDFr
 KPXCPpaqKCFtp+BG21fn6GpTqJ4GteWy6JK6C9i/xhIWmv+QRijNEmPlyYQ0YMkd
 a8z8rqRqexknlPCJy/9AZWfBo6kg5Dt3icrrNVKoXLVC/LNYaHQvIKsGzZaQ1Pyq
 W6rnMbLCRD199sqoEFrH
 =E7U/
 -----END PGP SIGNATURE-----

Merge tag 'tomoyo-pr-20210426' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1

Pull lockdep capacity limit updates from Tetsuo Handa:
 "syzbot is occasionally reporting that fuzz testing is terminated due
  to hitting upper limits lockdep can track.

  Analysis via /proc/lockdep* did not show any obvious culprits, allow
  tuning tracing capacity constants"

* tag 'tomoyo-pr-20210426' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1:
  lockdep: Allow tuning tracing capacity constants.
2021-04-26 08:44:23 -07:00
Rafael J. Wysocki
bf0cc8360e Merge branches 'pm-core', 'pm-pci', 'pm-sleep', 'pm-domains' and 'powercap'
* pm-core:
  PM: runtime: Add documentation for pm_runtime_resume_and_get()
  PM: runtime: Replace inline function pm_runtime_callbacks_present()
  PM: core: Remove duplicate declaration from header file

* pm-pci:
  PCI: PM: Do not read power state in pci_enable_device_flags()

* pm-sleep:
  PM: wakeup: remove redundant assignment to variable retval
  PM: hibernate: x86: Use crc32 instead of md5 for hibernation e820 integrity check
  PM: wakeup: use dev_set_name() directly
  PM: sleep: fix typos in comments
  freezer: Remove unused inline function try_to_freeze_nowarn()

* pm-domains:
  PM: domains: Don't runtime resume devices at genpd_prepare()

* powercap:
  powercap: RAPL: Fix struct declaration in header file
  MAINTAINERS: Add DTPM subsystem maintainer
  powercap: Add Hygon Fam18h RAPL support
2021-04-26 16:57:17 +02:00
Rafael J. Wysocki
dd9f2ae924 Merge branch 'pm-cpufreq'
* pm-cpufreq: (22 commits)
  cpufreq: Kconfig: fix documentation links
  cpufreq: intel_pstate: Simplify intel_pstate_update_perf_limits()
  cpufreq: armada-37xx: Fix module unloading
  cpufreq: armada-37xx: Remove cur_frequency variable
  cpufreq: armada-37xx: Fix determining base CPU frequency
  cpufreq: armada-37xx: Fix driver cleanup when registration failed
  clk: mvebu: armada-37xx-periph: Fix workaround for switching from L1 to L0
  clk: mvebu: armada-37xx-periph: Fix switching CPU freq from 250 Mhz to 1 GHz
  cpufreq: armada-37xx: Fix the AVS value for load L1
  clk: mvebu: armada-37xx-periph: remove .set_parent method for CPU PM clock
  cpufreq: armada-37xx: Fix setting TBG parent for load levels
  cpufreq: Remove unused for_each_policy macro
  cpufreq: dt: dev_pm_opp_of_cpumask_add_table() may return -EPROBE_DEFER
  cpufreq: intel_pstate: Clean up frequency computations
  cpufreq: cppc: simplify default delay_us setting
  cpufreq: Rudimentary typos fix in the file s5pv210-cpufreq.c
  cpufreq: CPPC: Add support for frequency invariance
  ia64: fix format string for ia64-acpi-cpu-freq
  cpufreq: schedutil: Call sugov_update_next_freq() before check to fast_switch_enabled
  arch_topology: Export arch_freq_scale and helpers
  ...
2021-04-26 16:56:50 +02:00
Jiri Olsa
f3a9507554 bpf: Allow trampoline re-attach for tracing and lsm programs
Currently we don't allow re-attaching of trampolines. Once
it's detached, it can't be re-attach even when the program
is still loaded.

Adding the possibility to re-attach the loaded tracing and
lsm programs.

Fixing missing unlock with proper cleanup goto jump reported
by Julia.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20210414195147.1624932-2-jolsa@kernel.org
2021-04-25 21:09:01 -07:00
David S. Miller
5f6c2f536d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-04-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).

The main changes are:

1) Add BPF static linker support for extern resolution of global, from Andrii.

2) Refine retval for bpf_get_task_stack helper, from Dave.

3) Add a bpf_snprintf helper, from Florent.

4) A bunch of miscellaneous improvements from many developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-25 18:02:32 -07:00
Stefan Metzmacher
ff24430330 kernel: always initialize task->pf_io_worker to NULL
Otherwise io_wq_worker_{running,sleeping}() may dereference an
invalid pointer (in future). Currently all users of create_io_thread()
are fine and get task->pf_io_worker = NULL implicitly from the
wq_manager, which got it either from the userspace thread
of the sq_thread, which explicitly reset it to NULL.

I think it's safer to always reset it in order to avoid future
problems.

Fixes: 3bfe610669 ("io-wq: fork worker threads from original task")
cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-25 10:29:03 -06:00
Linus Torvalds
0146da0d4c - Fix ordering in the queued writer lock's slowpath.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCFOD4ACgkQEsHwGGHe
 VUrwChAAr3r3Qr60WDUVTtbx0TS+d7Rk+aq63DhN6PvFADMlZe3PfjRFnw9bamIO
 hgf/65+BsYPTbO8w9CdSdgrv00oVw0v1pbYVdIAJDBzzwrG489gjJxozrIgQHU+X
 ixpxnG7R8e4dZrA4Q/ip2EZYRwGfWyO1H2Z1K6/W/sSk3HGi2EKa3+4BGBIhVtnI
 K81ckR59H+/+Fi8D3PZozK0HmC1XXKuoA5I0Yx0mgivy6i1XQ6UPohgWLXW76uae
 JtvbC2E9APhlYaRlIzdAAA7bx2REvyic6JjfTR515llIo3X2npVWlbusipjQdnOG
 enuFmHCCu49z9IAgGzI/ez+jEC41/SwJtma6lyDlcrW8j3Ovw5j6eMju14rrZD72
 8wqFXll7eAoFNbln0H9xUwvOVVWeIvMbF0+WluFWMsX5nDY4XRnEtYa24XPdeOFA
 aDhtuSmyQzcLi/mzgOU9q3dg/VKPb4i9XROwaXtJXXD8GVzbN2i5yOMleLmfHJ0K
 e1o/qc4VFItF+nCyp0v+yqEB8sQ23zRWiuw7zE1PvolBsULp+gO1lmJIJvSwaGaB
 nKJ04k7AvdkK82926cjfnhC6TvsOdHMWsdFBq+J/01bSnFGXwtzSJ6qoFTCts3TX
 zi7E5u+TEDAa4EZ4eO4Ds0yaKdJ0tBncoKUDLkTg7M98dLqv+dI=
 =GDTe
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:
 "Fix ordering in the queued writer lock's slowpath"

* tag 'locking_urgent_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/qrwlock: Fix ordering in queued_write_lock_slowpath()
2021-04-25 09:10:10 -07:00
Linus Torvalds
682b26bd80 - Fix a typo in a macro ifdeffery.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCFNfAACgkQEsHwGGHe
 VUoebg//cZf3az92fY7OTPOvEZrE1PgZhSYWDnjt6++1LSzOowh2lIXobPXfrFtF
 UAVojnqNZMEvV2CbUtpXMI9WGXyi6JdpeFI40wpfFvX5EaZkMlKtTG8q7jmGzus1
 akysbFmI3KwKT7ifu2Jmb7Olf4kaG+WfEP39E9s0TG20eCeduz4xSvfWo8OFesIY
 BZX1WC5xsdoCk2nlF2SQXqVr6tUxTrW5JumG2yS7cOzFEYTWODzza/BephmuMRgv
 /KeAEBPRzTY8mtdnERWaeH8cm2HvR6nlcblRYIL16B9StwFSPo1CMBV3FOd5gMhy
 l5JFR7j/AiL4LSaAOUbZffoIGl03nrTnVt+/GTGCbj4dScIVM5Cscx2exJx9oWuO
 vqF6LOq7yILK/OmN1/d+uiBSmBiQ1i9tuuKl1o8CWVRmh2WaDPh0VPnVAtrfqrKZ
 lTb6r8rFGywn7Zczdx7imjgxEYRJFqZXeR9Vr390Q7qc4uyNUrBOQWEWNeU+PbJc
 dJT0enVgUooNjjw/1piEyJ8e851JGOdYFmhBV92mdD7htCceJEBHwqI/hVY4OPzE
 BVxnojxrosEwBYqkA3ysdjzqWk7h1dhiRY4E9HD8Rc3hL2TqH1qlzfvjCBsZdjPS
 IqOlXUCsekBaGTNpAWa63slpfybwjt/tTW5m3NaEQ88Yvt9Ikj0=
 =D4bF
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:
 "Fix a typo in a macro ifdeffery"

* tag 'sched_urgent_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  preempt/dynamic: Fix typo in macro conditional statement
2021-04-25 09:08:19 -07:00
Alexey Dobriyan
0e0345b77a kbuild: redo fake deps at include/config/*.h
Make include/config/foo/bar.h fake deps files generation simpler.

* delete .h suffix
	those aren't header files, shorten filenames,

* delete tolower()
	Linux filesystems can deal with both upper and lowercase
	filenames very well,

* put everything in 1 directory
	Presumably 'mkdir -p' split is from dark times when filesystems
	handled huge directories badly, disks were round adding to
	seek times.

	x86_64 allmodconfig lists 12364 files in include/config.

	../obj/include/config/
	├── 104_QUAD_8
	├── 60XX_WDT
	├── 64BIT
		...
	├── ZSWAP_DEFAULT_ON
	├── ZSWAP_ZPOOL_DEFAULT
	└── ZSWAP_ZPOOL_DEFAULT_ZBUD

	0 directories, 12364 files

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-04-25 05:26:10 +09:00
Thomas Gleixner
765822e156 irqchip updates for Linux 5.13
New HW support:
 
 - New driver for the Nuvoton WPCM450 interrupt controller
 - New driver for the IDT 79rc3243x interrupt controller
 - Add support for interrupt trigger configuration to the MStar irqchip
 - Add more external interrupt support to the STM32 irqchip
 - Add new compatible strings for QCOM SC7280 to the qcom-pdc binding
 
 Fixes and cleanups:
 
 - Drop irq_create_strict_mappings() and irq_create_identity_mapping()
   from the irqdomain API, with cleanups in a couple of drivers
 - Fix nested NMI issue with spurious interrupts on GICv3
 - Don't allow GICv4.1 vSGIs when the CPU doesn't support them
 - Various cleanups and minor fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmCD5kwPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDCWsQAL5yHXtApf4l3F0W99SJIooumrQh3UR6nENG
 2WVR66g+MiuZ/JQcHAojdLQ6W6K9W8eTcY3hRNFCqlI1lrKffz6ovstuYg3Wphog
 JX1gQYcpqt67WYtb/TVw3JM5D3NLU4XKPKZPhRzSHv5G9utI2QeAv13EBcPoHxZd
 UBRAEdUrv90KIFDe2CxWo8B5ra07xfgOpDvlYYKlee+jQLtf6i4Kj7Tm0XoK3hoW
 w0Mo//5r2SggdXfFLW1sm0BGs0bpJMSNixKCZWRfXbnZLAYIaBynSoLT9XoYT/uC
 FDegtFZ9IG/5NXJ1d3Yl0RjsPp+iPUOOTq/5gAoXI0hRCLZ1f8G1IuDEoIf8ElOg
 kxA1JpYE1fewxNt7oh48BAs3Qa3fdjJ1+k6gFlau4ctJBjxTHMz7v7lr7PmjhPz7
 HgcmzFCu9Wb8pj1IDHMINkOMmAiQhgr3N0WK372wQyNE8Z8iB0ZeCYX9jAV5YTK6
 eQdsDgNW18rv1ks/f7vzJw4EHRUM2tzSYimgf3oW+EJq6xKacMHfDMp9ERtHcnfJ
 +4CCEEafrSOj/KsNpNnA7Bq3Qjh+RdRXDtCPsoGQ3LS1L5/JOaUoSmrCkWNNfXuZ
 kUKTrNzopmMPvvwx6Q1YUypMbKCloNvlO3IgKalKNVP5drWA184abOIU2MGp+yI1
 LAA8SFYU
 =RqVj
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip and irqdomain updates from Marc Zyngier:

 New HW support:

  - New driver for the Nuvoton WPCM450 interrupt controller
  - New driver for the IDT 79rc3243x interrupt controller
  - Add support for interrupt trigger configuration to the MStar irqchip
  - Add more external interrupt support to the STM32 irqchip
  - Add new compatible strings for QCOM SC7280 to the qcom-pdc binding

 Fixes and cleanups:

  - Drop irq_create_strict_mappings() and irq_create_identity_mapping()
    from the irqdomain API, with cleanups in a couple of drivers
  - Fix nested NMI issue with spurious interrupts on GICv3
  - Don't allow GICv4.1 vSGIs when the CPU doesn't support them
  - Various cleanups and minor fixes

Link: https://lore.kernel.org/r/20210424094640.1731920-1-maz@kernel.org
2021-04-24 21:18:44 +02:00
Florent Revest
a8fad73e33 bpf: Remove unnecessary map checks for ARG_PTR_TO_CONST_STR
reg->type is enforced by check_reg_type() and map should never be NULL
(it would already have been dereferenced anyway) so these checks are
unnecessary.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210422235543.4007694-3-revest@chromium.org
2021-04-23 09:58:21 -07:00
Florent Revest
8e8ee109b0 bpf: Notify user if we ever hit a bpf_snprintf verifier bug
In check_bpf_snprintf_call(), a map_direct_value_addr() of the fmt map
should never fail because it has already been checked by
ARG_PTR_TO_CONST_STR. But if it ever fails, it's better to error out
with an explicit debug message rather than silently fail.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210422235543.4007694-2-revest@chromium.org
2021-04-23 09:58:21 -07:00
Paolo Bonzini
c4f71901d5 KVM/arm64 updates for Linux 5.13
New features:
 
 - Stage-2 isolation for the host kernel when running in protected mode
 - Guest SVE support when running in nVHE mode
 - Force W^X hypervisor mappings in nVHE mode
 - ITS save/restore for guests using direct injection with GICv4.1
 - nVHE panics now produce readable backtraces
 - Guest support for PTP using the ptp_kvm driver
 - Performance improvements in the S2 fault handler
 - Alexandru is now a reviewer (not really a new feature...)
 
 Fixes:
 - Proper emulation of the GICR_TYPER register
 - Handle the complete set of relocation in the nVHE EL2 object
 - Get rid of the oprofile dependency in the PMU code (and of the
   oprofile body parts at the same time)
 - Debug and SPE fixes
 - Fix vcpu reset
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmCCpuAPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpD2G8QALWQYeBggKnNmAJfuihzZ2WariBmgcENs2R2
 qNZ/Py6dIF+b69P68nmgrEV1x2Kp35cPJbBwXnnrS4FCB5tk0b8YMaj00QbiRIYV
 UXbPxQTmYO1KbevpoEcw8NmR4bZJ/hRYPuzcQG7CCMKIZw0zj2cMcBofzQpTOAp/
 CgItdcv7at3iwamQatfU9vUmC0nDdnjdIwSxTAJOYMVV1ENwtnYSNgZVo4XLTg7n
 xR/5Qx27PKBJw7GyTRAIIxKAzNXG2tDL+GVIHe4AnRp3z3La8sr6PJf7nz9MCmco
 ISgeY7EGQINzmm4LahpnV+2xwwxOWo8QotxRFGNuRTOBazfARyAbp97yJ6eXJUpa
 j0qlg3xK9neyIIn9BQKkKx4sY9V45yqkuVDsK6odmqPq3EE01IMTRh1N/XQi+sTF
 iGrlM3ZW4AjlT5zgtT9US/FRXeDKoYuqVCObJeXZdm3sJSwEqTAs0JScnc0YTsh7
 m30CODnomfR2y5X6GoaubbQ0wcZ2I20K1qtIm+2F6yzD5P1/3Yi8HbXMxsSWyYWZ
 1ldoSa+ZUQlzV9Ot0S3iJ4PkphLKmmO96VlxE2+B5gQG50PZkLzsr8bVyYOuJC8p
 T83xT9xd07cy+FcGgF9veZL99Y6BLHMa6ZwFUolYNbzJxqrmqyR1aiJMEBIcX+aP
 ACeKW1w5
 =fpey
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for Linux 5.13

New features:

- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
- Alexandru is now a reviewer (not really a new feature...)

Fixes:
- Proper emulation of the GICR_TYPER register
- Handle the complete set of relocation in the nVHE EL2 object
- Get rid of the oprofile dependency in the PMU code (and of the
  oprofile body parts at the same time)
- Debug and SPE fixes
- Fix vcpu reset
2021-04-23 07:41:17 -04:00
Marco Elver
ed8e50800b signal, perf: Add missing TRAP_PERF case in siginfo_layout()
Add the missing TRAP_PERF case in siginfo_layout() for interpreting the
layout correctly as SIL_PERF_EVENT instead of just SIL_FAULT. This
ensures the si_perf field is copied and not just the si_addr field.

This was caught and tested by running the perf_events/sigtrap_threads
kselftest as a 32-bit binary with a 64-bit kernel.

Fixes: fb6cc127e0 ("signal: Introduce TRAP_PERF si_code and si_perf to siginfo")
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210422191823.79012-2-elver@google.com
2021-04-23 09:03:16 +02:00
Mickaël Salaün
265885daf3 landlock: Add syscall implementations
These 3 system calls are designed to be used by unprivileged processes
to sandbox themselves:
* landlock_create_ruleset(2): Creates a ruleset and returns its file
  descriptor.
* landlock_add_rule(2): Adds a rule (e.g. file hierarchy access) to a
  ruleset, identified by the dedicated file descriptor.
* landlock_restrict_self(2): Enforces a ruleset on the calling thread
  and its future children (similar to seccomp).  This syscall has the
  same usage restrictions as seccomp(2): the caller must have the
  no_new_privs attribute set or have CAP_SYS_ADMIN in the current user
  namespace.

All these syscalls have a "flags" argument (not currently used) to
enable extensibility.

Here are the motivations for these new syscalls:
* A sandboxed process may not have access to file systems, including
  /dev, /sys or /proc, but it should still be able to add more
  restrictions to itself.
* Neither prctl(2) nor seccomp(2) (which was used in a previous version)
  fit well with the current definition of a Landlock security policy.

All passed structs (attributes) are checked at build time to ensure that
they don't contain holes and that they are aligned the same way for each
architecture.

See the user and kernel documentation for more details (provided by a
following commit):
* Documentation/userspace-api/landlock.rst
* Documentation/security/landlock.rst

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: James Morris <jmorris@namei.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Link: https://lore.kernel.org/r/20210422154123.13086-9-mic@digikod.net
Signed-off-by: James Morris <jamorris@linux.microsoft.com>
2021-04-22 12:22:11 -07:00
Marc Zyngier
817aad5d08 irqdomain: Drop references to recusive irqdomain setup
It was never completely implemented, and was removed a long time
ago. Adjust the documentation to reflect this.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210406093557.1073423-8-maz@kernel.org
2021-04-22 15:55:22 +01:00
Marc Zyngier
1a0b05e435 irqdomain: Get rid of irq_create_strict_mappings()
No user of this helper is left, remove it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-04-22 15:55:22 +01:00
Marc Zyngier
9a8aae605b Merge branch 'kvm-arm64/kill_oprofile_dependency' into kvmarm-master/next
Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-04-22 13:41:49 +01:00
Arnd Bergmann
f4abe9967c kcsan: Fix printk format string
Printing a 'long' variable using the '%d' format string is wrong
and causes a warning from gcc:

kernel/kcsan/kcsan_test.c: In function 'nthreads_gen_params':
include/linux/kern_levels.h:5:25: error: format '%d' expects argument of type 'int', but argument 3 has type 'long int' [-Werror=format=]

Use the appropriate format modifier.

Fixes: f6a1491403 ("kcsan: Switch to KUNIT_CASE_PARAM for parameterized tests")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20210421135059.3371701-1-arnd@kernel.org
2021-04-22 14:36:03 +02:00
Marc Zyngier
7f318847a0 perf: Get rid of oprofile leftovers
perf_pmu_name() and perf_num_counters() are unused. Drop them.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210414134409.1266357-6-maz@kernel.org
2021-04-22 13:32:39 +01:00
Peter Zijlstra
2ea46c6fc9 cpumask/hotplug: Fix cpu_dying() state tracking
Vincent reported that for states with a NULL startup/teardown function
we do not call cpuhp_invoke_callback() (because there is none) and as
such we'll not update the cpu_dying() state.

The stale cpu_dying() can eventually lead to triggering BUG().

Rectify this by updating cpu_dying() in the exact same places the
hotplug machinery tracks its directional state, namely
cpuhp_set_state() and cpuhp_reset_state().

Reported-by: Vincent Donnefort <vincent.donnefort@arm.com>
Suggested-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/YH7r+AoQEReSvxBI@hirez.programming.kicks-ass.net
2021-04-21 13:55:43 +02:00
Peter Zijlstra
3a7956e25e kthread: Fix PF_KTHREAD vs to_kthread() race
The kthread_is_per_cpu() construct relies on only being called on
PF_KTHREAD tasks (per the WARN in to_kthread). This gives rise to the
following usage pattern:

	if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))

However, as reported by syzcaller, this is broken. The scenario is:

	CPU0				CPU1 (running p)

	(p->flags & PF_KTHREAD) // true

					begin_new_exec()
					  me->flags &= ~(PF_KTHREAD|...);
	kthread_is_per_cpu(p)
	  to_kthread(p)
	    WARN(!(p->flags & PF_KTHREAD) <-- *SPLAT*

Introduce __to_kthread() that omits the WARN and is sure to check both
values.

Use this to remove the problematic pattern for kthread_is_per_cpu()
and fix a number of other kthread_*() functions that have similar
issues but are currently not used in ways that would expose the
problem.

Notably kthread_func() is only ever called on 'current', while
kthread_probe_data() is only used for PF_WQ_WORKER, which implies the
task is from kthread_create*().

Fixes: ac687e6e8c ("kthread: Extract KTHREAD_IS_PER_CPU")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Link: https://lkml.kernel.org/r/YH6WJc825C4P0FCK@hirez.programming.kicks-ass.net
2021-04-21 13:55:42 +02:00
Waiman Long
ad789f84c9 sched/debug: Fix cgroup_path[] serialization
The handling of sysrq key can be activated by echoing the key to
/proc/sysrq-trigger or via the magic key sequence typed into a terminal
that is connected to the system in some way (serial, USB or other mean).
In the former case, the handling is done in a user context. In the
latter case, it is likely to be in an interrupt context.

Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is
taken with interrupt disabled for the whole duration of the calls to
print_*_stats() and print_rq() which could last for the quite some time
if the information dump happens on the serial console.

If the system has many cpus and the sched_debug_lock is somehow busy
(e.g. parallel sysrq-t), the system may hit a hard lockup panic
depending on the actually serial console implementation of the
system.

The purpose of sched_debug_lock is to serialize the use of the global
cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't
need serialization from sched_debug_lock.

Calling printk() with interrupt disabled can still be problematic if
multiple instances are running. Allocating a stack buffer of PATH_MAX
bytes is not feasible because of the limited size of the kernel stack.

The solution implemented in this patch is to allow only one caller at a
time to use the full size group_path[], while other simultaneous callers
will have to use shorter stack buffers with the possibility of path
name truncation. A "..." suffix will be printed if truncation may have
happened.  The cgroup path name is provided for informational purpose
only, so occasional path name truncation should not be a big problem.

Fixes: efe25c2c7b ("sched: Reinstate group names in /proc/sched_debug")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210415195426.6677-1-longman@redhat.com
2021-04-21 13:55:42 +02:00
Charan Teja Reddy
9d10a13d1e sched,psi: Handle potential task count underflow bugs more gracefully
psi_group_cpu->tasks, represented by the unsigned int, stores the
number of tasks that could be stalled on a psi resource(io/mem/cpu).
Decrementing these counters at zero leads to wrapping which further
leads to the psi_group_cpu->state_mask is being set with the
respective pressure state. This could result into the unnecessary time
sampling for the pressure state thus cause the spurious psi events.
This can further lead to wrong actions being taken at the user land
based on these psi events.

Though psi_bug is set under these conditions but that just for debug
purpose. Fix it by decrementing the ->tasks count only when it is
non-zero.

Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1618585336-37219-1-git-send-email-charante@codeaurora.org
2021-04-21 13:55:41 +02:00
Paul Turner
c006fac556 sched: Warn on long periods of pending need_resched
CPU scheduler marks need_resched flag to signal a schedule() on a
particular CPU. But, schedule() may not happen immediately in cases
where the current task is executing in the kernel mode (no
preemption state) for extended periods of time.

This patch adds a warn_on if need_resched is pending for more than the
time specified in sysctl resched_latency_warn_ms. If it goes off, it is
likely that there is a missing cond_resched() somewhere. Monitoring is
done via the tick and the accuracy is hence limited to jiffy scale. This
also means that we won't trigger the warning if the tick is disabled.

This feature (LATENCY_WARN) is default disabled.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210416212936.390566-1-joshdon@google.com
2021-04-21 13:55:41 +02:00
Linus Torvalds
1fe5501ba1 tracing: Fix tp_printk command line and trace events
Masami added a wrapper to be able to unhash trace event pointers
 as they are only read by root anyway, and they can also be extracted
 by the raw trace data buffers. But this wrapper utilized the iterator
 to have a temporary buffer to manipulate the text with.
 
 tp_printk is a kernel command line option that will send the trace
 output of a trace event to the console on boot up (useful when the
 system crashes before finishing the boot). But the code used the same
 wrapper that Masami added, and its iterator did not have a buffer,
 and this caused the system to crash.
 
 Have the wrapper just print the trace event normally if the iterator
 has no temporary buffer.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYH8exRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlvdAP436vts0GOM8UM0y9IAPNrSG5OSvjNe
 v5C5UOpnIxjlrgEArBtLLDaByLSBDQj+Vx0LmDW5uMOps49mo3A35TcIEAM=
 =GfL1
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Fix tp_printk command line and trace events

  Masami added a wrapper to be able to unhash trace event pointers as
  they are only read by root anyway, and they can also be extracted by
  the raw trace data buffers. But this wrapper utilized the iterator to
  have a temporary buffer to manipulate the text with.

  tp_printk is a kernel command line option that will send the trace
  output of a trace event to the console on boot up (useful when the
  system crashes before finishing the boot). But the code used the same
  wrapper that Masami added, and its iterator did not have a buffer, and
  this caused the system to crash.

  Have the wrapper just print the trace event normally if the iterator
  has no temporary buffer"

* tag 'trace-v5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix checking event hash pointer logic when tp_printk is enabled
2021-04-20 14:38:35 -07:00
Serge E. Hallyn
db2e718a47 capabilities: require CAP_SETFCAP to map uid 0
cap_setfcap is required to create file capabilities.

Since commit 8db6c34f1d ("Introduce v3 namespaced file capabilities"),
a process running as uid 0 but without cap_setfcap is able to work
around this as follows: unshare a new user namespace which maps parent
uid 0 into the child namespace.

While this task will not have new capabilities against the parent
namespace, there is a loophole due to the way namespaced file
capabilities are represented as xattrs.  File capabilities valid in
userns 1 are distinguished from file capabilities valid in userns 2 by
the kuid which underlies uid 0.  Therefore the restricted root process
can unshare a new self-mapping namespace, add a namespaced file
capability onto a file, then use that file capability in the parent
namespace.

To prevent that, do not allow mapping parent uid 0 if the process which
opened the uid_map file does not have CAP_SETFCAP, which is the
capability for setting file capabilities.

As a further wrinkle: a task can unshare its user namespace, then open
its uid_map file itself, and map (only) its own uid.  In this case we do
not have the credential from before unshare, which was potentially more
restricted.  So, when creating a user namespace, we record whether the
creator had CAP_SETFCAP.  Then we can use that during map_write().

With this patch:

1. Unprivileged user can still unshare -Ur

   ubuntu@caps:~$ unshare -Ur
   root@caps:~# logout

2. Root user can still unshare -Ur

   ubuntu@caps:~$ sudo bash
   root@caps:/home/ubuntu# unshare -Ur
   root@caps:/home/ubuntu# logout

3. Root user without CAP_SETFCAP cannot unshare -Ur:

   root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
   root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
   unable to set CAP_SETFCAP effective capability: Operation not permitted
   root@caps:/home/ubuntu# unshare -Ur
   unshare: write failed /proc/self/uid_map: Operation not permitted

Note: an alternative solution would be to allow uid 0 mappings by
processes without CAP_SETFCAP, but to prevent such a namespace from
writing any file capabilities.  This approach can be seen at [1].

Background history: commit 95ebabde38 ("capabilities: Don't allow
writing ambiguous v3 file capabilities") tried to fix the issue by
preventing v3 fscaps to be written to disk when the root uid would map
to the same uid in nested user namespaces.  This led to regressions for
various workloads.  For example, see [2].  Ultimately this is a valid
use-case we have to support meaning we had to revert this change in
3b0c2d3eaa ("Revert 95ebabde38 ("capabilities: Don't allow writing
ambiguous v3 file capabilities")").

Link: https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4 [1]
Link: https://github.com/containers/buildah/issues/3071 [2]
Signed-off-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Andrew G. Morgan <morgan@kernel.org>
Tested-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-20 14:28:33 -07:00
Steven Rostedt (VMware)
0e1e71d349 tracing: Fix checking event hash pointer logic when tp_printk is enabled
Pointers in events that are printed are unhashed if the flags allow it,
and the logic to do so is called before processing the event output from
the raw ring buffer. In most cases, this is done when a user reads one of
the trace files.

But if tp_printk is added on the kernel command line, this logic is done
for trace events when they are triggered, and their output goes out via
printk. The unhash logic (and even the validation of the output) did not
support the tp_printk output, and would crash.

Link: https://lore.kernel.org/linux-tegra/9835d9f1-8d3a-3440-c53f-516c2606ad07@nvidia.com/

Fixes: efbbdaa22b ("tracing: Show real address for trace event arguments")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-20 10:56:58 -04:00
YueHaibing
3f5ad91488 sched/fair: Move update_nohz_stats() to the CONFIG_NO_HZ_COMMON block to simplify the code & fix an unused function warning
When !CONFIG_NO_HZ_COMMON we get this new GCC warning:

   kernel/sched/fair.c:8398:13: warning: ‘update_nohz_stats’ defined but not used [-Wunused-function]

Move update_nohz_stats() to an already existing CONFIG_NO_HZ_COMMON #ifdef
block.

Beyond fixing the GCC warning, this also simplifies the update_nohz_stats() function.

[ mingo: Rewrote the changelog. ]

Fixes: 0826530de3 ("sched/fair: Remove update of blocked load from newidle_balance")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210329144029.29200-1-yuehaibing@huawei.com
2021-04-20 10:14:15 +02:00
Ingo Molnar
d0d252b8ca Linux 5.12-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmB8qHweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGEXIIAILUbsTJsNsvZIkZ
 uQ6SY6gnsPFkRiSRjY0YsZLUnqjTuiiHeTz4gzkonddwdnAp/9g6OIHIEBaeTqBh
 sTUMU/61Fgtrt/IvkA1yJ3rlawqgwdMe2VdimB+EFhufcSKq+5vpd3MVP4IuGx4E
 J3psoTU4gVltFs5t+1QjvI3XmByN0Qm8FMRXR7iL+zov1QTmGwR3G6Rn4AymG+QT
 pdruKboyZPfsrFGSVx7wd3HpFyQcrclEX9rKmBNZqets9d9JGWnqnEN4vQKmwO86
 4MV29ucdMXH0AMB3kzGdVp0Ji2Ykt5W0K+MUWbFLtcSxnpu1OyBKGsEAMlRbD7ik
 gm0bMSw=
 =qHI0
 -----END PGP SIGNATURE-----

Merge tag 'v5.12-rc8' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-20 10:13:58 +02:00
Dave Marchevsky
fd0b88f73f bpf: Refine retval for bpf_get_task_stack helper
Verifier can constrain the min/max bounds of bpf_get_task_stack's return
value more tightly than the default tnum_unknown. Like bpf_get_stack,
return value is num bytes written into a caller-supplied buf, or error,
so do_refine_retval_range will work.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210416204704.2816874-2-davemarchevsky@fb.com
2021-04-19 18:23:33 -07:00
Florent Revest
7b15523a98 bpf: Add a bpf_snprintf helper
The implementation takes inspiration from the existing bpf_trace_printk
helper but there are a few differences:

To allow for a large number of format-specifiers, parameters are
provided in an array, like in bpf_seq_printf.

Because the output string takes two arguments and the array of
parameters also takes two arguments, the format string needs to fit in
one argument. Thankfully, ARG_PTR_TO_CONST_STR is guaranteed to point to
a zero-terminated read-only map so we don't need a format string length
arg.

Because the format-string is known at verification time, we also do
a first pass of format string validation in the verifier logic. This
makes debugging easier.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-4-revest@chromium.org
2021-04-19 15:27:36 -07:00
Florent Revest
fff13c4bb6 bpf: Add a ARG_PTR_TO_CONST_STR argument type
This type provides the guarantee that an argument is going to be a const
pointer to somewhere in a read-only map value. It also checks that this
pointer is followed by a zero character before the end of the map value.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-3-revest@chromium.org
2021-04-19 15:27:36 -07:00
Florent Revest
d9c9e4db18 bpf: Factorize bpf_trace_printk and bpf_seq_printf
Two helpers (trace_printk and seq_printf) have very similar
implementations of format string parsing and a third one is coming
(snprintf). To avoid code duplication and make the code easier to
maintain, this moves the operations associated with format string
parsing (validation and argument sanitization) into one generic
function.

The implementation of the two existing helpers already drifted quite a
bit so unifying them entailed a lot of changes:

- bpf_trace_printk always expected fmt[fmt_size] to be the terminating
  NULL character, this is no longer true, the first 0 is terminating.
- bpf_trace_printk now supports %% (which produces the percentage char).
- bpf_trace_printk now skips width formating fields.
- bpf_trace_printk now supports the X modifier (capital hexadecimal).
- bpf_trace_printk now supports %pK, %px, %pB, %pi4, %pI4, %pi6 and %pI6
- argument casting on 32 bit has been simplified into one macro and
  using an enum instead of obscure int increments.

- bpf_seq_printf now uses bpf_trace_copy_string instead of
  strncpy_from_kernel_nofault and handles the %pks %pus specifiers.
- bpf_seq_printf now prints longs correctly on 32 bit architectures.

- both were changed to use a global per-cpu tmp buffer instead of one
  stack buffer for trace_printk and 6 small buffers for seq_printf.
- to avoid per-cpu buffer usage conflict, these helpers disable
  preemption while the per-cpu buffer is in use.
- both helpers now support the %ps and %pS specifiers to print symbols.

The implementation is also moved from bpf_trace.c to helpers.c because
the upcoming bpf_snprintf helper will be made available to all BPF
programs and will need it.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-2-revest@chromium.org
2021-04-19 15:27:36 -07:00
Linus Torvalds
7af0814097 Revert "gcov: clang: fix clang-11+ build"
This reverts commit 04c53de57c.

Nathan Chancellor points out that it should not have been merged into
mainline by itself. It was a fix for "gcov: use kvmalloc()", which is
still in -mm/-next. Merging it alone has broken the build.

Link: https://github.com/ClangBuiltLinux/continuous-integration2/runs/2384465683?check_suite_focus=true
Reported-by: Nathan Chancellor <nathan@kernel.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-19 15:08:49 -07:00
Kan Liang
55bcf6ef31 perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
Current Hardware events and Hardware cache events have special perf
types, PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE. The two types don't
pass the PMU type in the user interface. For a hybrid system, the perf
subsystem doesn't know which PMU the events belong to. The first capable
PMU will always be assigned to the events. The events never get a chance
to run on the other capable PMUs.

Extend the two types to become PMU aware types. The PMU type ID is
stored at attr.config[63:32].

Add a new PMU capability, PERF_PMU_CAP_EXTENDED_HW_TYPE, to indicate a
PMU which supports the extended PERF_TYPE_HARDWARE and
PERF_TYPE_HW_CACHE.

The PMU type is only required when searching a specific PMU. The PMU
specific codes will only be interested in the 'real' config value, which
is stored in the low 32 bit of the event->attr.config. Update the
event->attr.config in the generic code, so the PMU specific codes don't
need to calculate it separately.

If a user specifies a PMU type, but the PMU doesn't support the extended
type, error out.

If an event cannot be initialized in a PMU specified by a user, error
out immediately. Perf should not try to open it on other PMUs.

The new PMU capability is only set for the X86 hybrid PMUs for now.
Other architectures, e.g., ARM, may need it as well. The support on ARM
may be implemented later separately.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1618237865-33448-22-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:29 +02:00
Zhouyi Zhou
0c89d87d1d preempt/dynamic: Fix typo in macro conditional statement
Commit 40607ee97e ("preempt/dynamic: Provide irqentry_exit_cond_resched()
static call") tried to provide irqentry_exit_cond_resched() static call
in irqentry_exit, but has a typo in macro conditional statement.

Fixes: 40607ee97e ("preempt/dynamic: Provide irqentry_exit_cond_resched() static call")
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210410073523.5493-1-zhouzhouyi@gmail.com
2021-04-19 20:02:57 +02:00
Jakub Kicinski
8203c7ce4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
 - keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
 - fix build after move to net_generic

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-17 11:08:07 -07:00
Linus Torvalds
88a5af9439 Networking fixes for 5.12-rc8, including fixes from netfilter,
and bpf. BPF verifier changes stand out, otherwise things have
 slowed down.
 
 Current release - regressions:
 
  - gro: ensure frag0 meets IP header alignment
 
  - Revert "net: stmmac: re-init rx buffers when mac resume back"
 
  - ethernet: macb: fix the restore of cmp registers
 
 Previous releases - regressions:
 
  - ixgbe: Fix NULL pointer dereference in ethtool loopback test
 
  - ixgbe: fix unbalanced device enable/disable in suspend/resume
 
  - phy: marvell: fix detection of PHY on Topaz switches
 
  - make tcp_allowed_congestion_control readonly in non-init netns
 
  - xen-netback: Check for hotplug-status existence before watching
 
 Previous releases - always broken:
 
  - bpf: mitigate a speculative oob read of up to map value size by
         tightening the masking window
 
  - sctp: fix race condition in sctp_destroy_sock
 
  - sit, ip6_tunnel: Unregister catch-all devices
 
  - netfilter: nftables: clone set element expression template
 
  - netfilter: flowtable: fix NAT IPv6 offload mangling
 
  - net: geneve: check skb is large enough for IPv4/IPv6 header
 
  - netlink: don't call ->netlink_bind with table lock held
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmB6aBQACgkQMUZtbf5S
 Iruu2BAAqHKdB5Qd1iBGaA1md8f+elErsotzjONz+eh2yqDKRaOW84+Fo9TPKgu6
 se0WmAY1HMUO3TEVdFeBsrgrs+bTY1E1OdfoZ39PFNkMdKMM80Ks1rn94nrPOohy
 q1uoNxe9jjT3nRQBTKHWdB3ZC3Jetwf3LP7G2b8SoA+gNd9xl+b1H/drmv7WdE/n
 pY7/GND7wd4qqidLRDgAaavaiGIdqym8V0bZEpz7cZtjT/U6RhjkBLKSB8JFGUxP
 PQ1NFrYKmLDM1zYTSObLOrKUmEaWzPPSsXmWqGkCE4qjJ8euX0e+5EbxF98JHdYW
 O+HMtdgr4UJGWAoxyGaxk7h9w0ydVyC1+Xgi6jAFWdXP7wgvXXQrldLnO44pX/6I
 dYlIM+Br/5VmnKiS1i1gBUURREBRSEy7ZYxtREjGC7dFSUn9RPm+0s0x/DCRBS9/
 MtNo0lCiuWsyaZ2v57aEKLX4YvGpilzg4UU3/45RNW6OnFzQubvjMBJPfap6EUAC
 Ii8uUc/vX0Jq4nZVZzDZ7vlkRcJTQgUqKrzgamUuwJmyPqzefkDcbSZub3tM8G39
 eetiHS1nqe3QwuP+TYM3MaBjw0bdgNz9Wt3xmY3Ehnf3pujMR5fbAsCbcdowV5/+
 OI2ZcTUZculeAW2q9DgsOCtyS/1huwMHG0zO32TgadbFv45UCS0=
 =LN+J
 -----END PGP SIGNATURE-----

Merge tag 'net-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.12-rc8, including fixes from netfilter, and
  bpf. BPF verifier changes stand out, otherwise things have slowed
  down.

  Current release - regressions:

   - gro: ensure frag0 meets IP header alignment

   - Revert "net: stmmac: re-init rx buffers when mac resume back"

   - ethernet: macb: fix the restore of cmp registers

  Previous releases - regressions:

   - ixgbe: Fix NULL pointer dereference in ethtool loopback test

   - ixgbe: fix unbalanced device enable/disable in suspend/resume

   - phy: marvell: fix detection of PHY on Topaz switches

   - make tcp_allowed_congestion_control readonly in non-init netns

   - xen-netback: Check for hotplug-status existence before watching

  Previous releases - always broken:

   - bpf: mitigate a speculative oob read of up to map value size by
     tightening the masking window

   - sctp: fix race condition in sctp_destroy_sock

   - sit, ip6_tunnel: Unregister catch-all devices

   - netfilter: nftables: clone set element expression template

   - netfilter: flowtable: fix NAT IPv6 offload mangling

   - net: geneve: check skb is large enough for IPv4/IPv6 header

   - netlink: don't call ->netlink_bind with table lock held"

* tag 'net-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits)
  netlink: don't call ->netlink_bind with table lock held
  MAINTAINERS: update my email
  bpf: Update selftests to reflect new error states
  bpf: Tighten speculative pointer arithmetic mask
  bpf: Move sanitize_val_alu out of op switch
  bpf: Refactor and streamline bounds check into helper
  bpf: Improve verifier error messages for users
  bpf: Rework ptr_limit into alu_limit and add common error path
  bpf: Ensure off_reg has no mixed signed bounds for all types
  bpf: Move off_reg into sanitize_ptr_alu
  bpf: Use correct permission flag for mixed signed bounds arithmetic
  ch_ktls: do not send snd_una update to TCB in middle
  ch_ktls: tcb close causes tls connection failure
  ch_ktls: fix device connection close
  ch_ktls: Fix kernel panic
  i40e: fix the panic when running bpf in xdpdrv mode
  net/mlx5e: fix ingress_ifindex check in mlx5e_flower_parse_meta
  net/mlx5e: Fix setting of RS FEC mode
  net/mlx5: Fix setting of devlink traps in switchdev mode
  Revert "net: stmmac: re-init rx buffers when mac resume back"
  ...
2021-04-17 09:57:15 -07:00
Chen Jun
2d036dfa5f posix-timers: Preserve return value in clock_adjtime32()
The return value on success (>= 0) is overwritten by the return value of
put_old_timex32(). That works correct in the fault case, but is wrong for
the success case where put_old_timex32() returns 0.

Just check the return value of put_old_timex32() and return -EFAULT in case
it is not zero.

[ tglx: Massage changelog ]

Fixes: 3a4d44b616 ("ntp: Move adjtimex related compat syscalls to native counterparts")
Signed-off-by: Chen Jun <chenjun102@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Cochran <richardcochran@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210414030449.90692-1-chenjun102@huawei.com
2021-04-17 14:55:06 +02:00
Ali Saidi
84a24bf8c5 locking/qrwlock: Fix ordering in queued_write_lock_slowpath()
While this code is executed with the wait_lock held, a reader can
acquire the lock without holding wait_lock.  The writer side loops
checking the value with the atomic_cond_read_acquire(), but only truly
acquires the lock when the compare-and-exchange is completed
successfully which isn’t ordered. This exposes the window between the
acquire and the cmpxchg to an A-B-A problem which allows reads
following the lock acquisition to observe values speculatively before
the write lock is truly acquired.

We've seen a problem in epoll where the reader does a xchg while
holding the read lock, but the writer can see a value change out from
under it.

  Writer                                | Reader
  --------------------------------------------------------------------------------
  ep_scan_ready_list()                  |
  |- write_lock_irq()                   |
      |- queued_write_lock_slowpath()   |
	|- atomic_cond_read_acquire()   |
				        | read_lock_irqsave(&ep->lock, flags);
     --> (observes value before unlock) |  chain_epi_lockless()
     |                                  |    epi->next = xchg(&ep->ovflist, epi);
     |                                  | read_unlock_irqrestore(&ep->lock, flags);
     |                                  |
     |     atomic_cmpxchg_relaxed()     |
     |-- READ_ONCE(ep->ovflist);        |

A core can order the read of the ovflist ahead of the
atomic_cmpxchg_relaxed(). Switching the cmpxchg to use acquire
semantics addresses this issue at which point the atomic_cond_read can
be switched to use relaxed semantics.

Fixes: b519b56e37 ("locking/qrwlock: Use atomic_cond_read_acquire() when spinning in qrwlock")
Signed-off-by: Ali Saidi <alisaidi@amazon.com>
[peterz: use try_cmpxchg()]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Steve Capper <steve.capper@arm.com>
2021-04-17 13:40:50 +02:00
Peter Zijlstra
9406415f46 sched/debug: Rename the sched_debug parameter to sched_verbose
CONFIG_SCHED_DEBUG is the build-time Kconfig knob, the boot param
sched_debug and the /debug/sched/debug_enabled knobs control the
sched_debug_enabled variable, but what they really do is make
SCHED_DEBUG more verbose, so rename the lot.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-04-17 13:22:44 +02:00
Johannes Berg
04c53de57c gcov: clang: fix clang-11+ build
With clang-11+, the code is broken due to my kvmalloc() conversion
(which predated the clang-11 support code) leaving one vmalloc() in
place.  Fix that.

Link: https://lkml.kernel.org/r/20210412214210.6e1ecca9cdc5.I24459763acf0591d5e6b31c7e3a59890d802f79c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-16 16:10:37 -07:00
David S. Miller
b022654296 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-04-17

The following pull-request contains BPF updates for your *net* tree.

We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 8 files changed, 175 insertions(+), 111 deletions(-).

The main changes are:

1) Fix a potential NULL pointer dereference in libbpf's xsk
   umem handling, from Ciara Loftus.

2) Mitigate a speculative oob read of up to map value size by
   tightening the masking window, from Daniel Borkmann.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:48:08 -07:00
Daniel Borkmann
7fedb63a83 bpf: Tighten speculative pointer arithmetic mask
This work tightens the offset mask we use for unprivileged pointer arithmetic
in order to mitigate a corner case reported by Piotr and Benedict where in
the speculative domain it is possible to advance, for example, the map value
pointer by up to value_size-1 out-of-bounds in order to leak kernel memory
via side-channel to user space.

Before this change, the computed ptr_limit for retrieve_ptr_limit() helper
represents largest valid distance when moving pointer to the right or left
which is then fed as aux->alu_limit to generate masking instructions against
the offset register. After the change, the derived aux->alu_limit represents
the largest potential value of the offset register which we mask against which
is just a narrower subset of the former limit.

For minimal complexity, we call sanitize_ptr_alu() from 2 observation points
in adjust_ptr_min_max_vals(), that is, before and after the simulated alu
operation. In the first step, we retieve the alu_state and alu_limit before
the operation as well as we branch-off a verifier path and push it to the
verification stack as we did before which checks the dst_reg under truncation,
in other words, when the speculative domain would attempt to move the pointer
out-of-bounds.

In the second step, we retrieve the new alu_limit and calculate the absolute
distance between both. Moreover, we commit the alu_state and final alu_limit
via update_alu_sanitation_state() to the env's instruction aux data, and bail
out from there if there is a mismatch due to coming from different verification
paths with different states.

Reported-by: Piotr Krysiuk <piotras@gmail.com>
Reported-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Benedict Schlueter <benedict.schlueter@rub.de>
2021-04-16 23:52:01 +02:00
Daniel Borkmann
f528819334 bpf: Move sanitize_val_alu out of op switch
Add a small sanitize_needed() helper function and move sanitize_val_alu()
out of the main opcode switch. In upcoming work, we'll move sanitize_ptr_alu()
as well out of its opcode switch so this helps to streamline both.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16 23:51:57 +02:00
Daniel Borkmann
073815b756 bpf: Refactor and streamline bounds check into helper
Move the bounds check in adjust_ptr_min_max_vals() into a small helper named
sanitize_check_bounds() in order to simplify the former a bit.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16 23:51:54 +02:00
Daniel Borkmann
a6aaece00a bpf: Improve verifier error messages for users
Consolidate all error handling and provide more user-friendly error messages
from sanitize_ptr_alu() and sanitize_val_alu().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16 23:51:48 +02:00
Daniel Borkmann
b658bbb844 bpf: Rework ptr_limit into alu_limit and add common error path
Small refactor with no semantic changes in order to consolidate the max
ptr_limit boundary check.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16 23:51:43 +02:00
Daniel Borkmann
24c109bb15 bpf: Ensure off_reg has no mixed signed bounds for all types
The mixed signed bounds check really belongs into retrieve_ptr_limit()
instead of outside of it in adjust_ptr_min_max_vals(). The reason is
that this check is not tied to PTR_TO_MAP_VALUE only, but to all pointer
types that we handle in retrieve_ptr_limit() and given errors from the latter
propagate back to adjust_ptr_min_max_vals() and lead to rejection of the
program, it's a better place to reside to avoid anything slipping through
for future types. The reason why we must reject such off_reg is that we
otherwise would not be able to derive a mask, see details in 9d7eceede7
("bpf: restrict unknown scalars of mixed signed bounds for unprivileged").

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16 23:51:39 +02:00
Daniel Borkmann
6f55b2f2a1 bpf: Move off_reg into sanitize_ptr_alu
Small refactor to drag off_reg into sanitize_ptr_alu(), so we later on can
use off_reg for generalizing some of the checks for all pointer types.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16 23:51:36 +02:00
Daniel Borkmann
9601148392 bpf: Use correct permission flag for mixed signed bounds arithmetic
We forbid adding unknown scalars with mixed signed bounds due to the
spectre v1 masking mitigation. Hence this also needs bypass_spec_v1
flag instead of allow_ptr_leaks.

Fixes: 2c78ee898d ("bpf: Implement CAP_BPF")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16 23:51:21 +02:00
Chunguang Xu
ffeee417d9 cgroup: use tsk->in_iowait instead of delayacct_is_task_waiting_on_io()
If delayacct is disabled, then delayacct_is_task_waiting_on_io()
always returns false, which causes the statistical value to be
wrong. Perhaps tsk->in_iowait is better.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-04-16 16:49:37 -04:00
Jindong Yue
9c336c9935 tick/broadcast: Allow late registered device to enter oneshot mode
The broadcast device is switched to oneshot mode when the system switches
to oneshot mode. If a broadcast clock event device is registered after the
system switched to oneshot mode, it will stay in periodic mode forever.

Ensure that a late registered device which is selected as broadcast device
is initialized in oneshot mode when the system already uses oneshot mode.

[ tglx: Massage changelog ]

Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210331083318.21794-1-jindong.yue@nxp.com
2021-04-16 21:03:50 +02:00
Wang Wensheng
d7840aaadd tick: Use tick_check_replacement() instead of open coding it
The function tick_check_replacement() is the combination of
tick_check_percpu() and tick_check_preferred(), but tick_check_new_device()
has the same logic open coded.

Use the helper to simplify the code.

[ tglx: Massage changelog ]

Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210326022328.3266-1-wangwensheng4@huawei.com
2021-04-16 21:03:50 +02:00
Marc Kleine-Budde
07ff4aed01 time/timecounter: Mark 1st argument of timecounter_cyc2time() as const
The timecounter is not modified in this function. Mark it as const.

Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210303103544.994855-1-mkl@pengutronix.de
2021-04-16 21:03:50 +02:00
Peter Zijlstra
0c2de3f054 sched,fair: Alternative sched_slice()
The current sched_slice() seems to have issues; there's two possible
things that could be improved:

 - the 'nr_running' used for __sched_period() is daft when cgroups are
   considered. Using the RQ wide h_nr_running seems like a much more
   consistent number.

 - (esp) cgroups can slice it real fine, which makes for easy
   over-scheduling, ensure min_gran is what the name says.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.611897312@infradead.org
2021-04-16 17:06:35 +02:00
Peter Zijlstra
d27e9ae2f2 sched: Move /proc/sched_debug to debugfs
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.548833671@infradead.org
2021-04-16 17:06:35 +02:00
Peter Zijlstra
3b87f136f8 sched,debug: Convert sysctl sched_domains to debugfs
Stop polluting sysctl, move to debugfs for SCHED_DEBUG stuff.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/YHgB/s4KCBQ1ifdm@hirez.programming.kicks-ass.net
2021-04-16 17:06:35 +02:00
Peter Zijlstra
1011dcce99 sched,preempt: Move preempt_dynamic to debug.c
Move the #ifdef SCHED_DEBUG bits to kernel/sched/debug.c in order to
collect all the debugfs bits.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.353833279@infradead.org
2021-04-16 17:06:34 +02:00
Peter Zijlstra
8a99b6833c sched: Move SCHED_DEBUG sysctl to debugfs
Stop polluting sysctl with undocumented knobs that really are debug
only, move them all to /debug/sched/ along with the existing
/debug/sched_* files that already exist.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.287610138@infradead.org
2021-04-16 17:06:34 +02:00
Peter Zijlstra
1d1c2509de sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG
CONFIG_SCHEDSTATS does not depend on SCHED_DEBUG, it is inconsistent
to have the sysctl depend on it.

Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.161151631@infradead.org
2021-04-16 17:06:33 +02:00
Mel Gorman
b7cc6ec744 sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG
The ability to enable/disable NUMA balancing is not a debugging feature
and should not depend on CONFIG_SCHED_DEBUG.  For example, machines within
a HPC cluster may disable NUMA balancing temporarily for some jobs and
re-enable it for other jobs without needing to reboot.

This patch removes the dependency on CONFIG_SCHED_DEBUG for
kernel.numa_balancing sysctl. The other numa balancing related sysctls
are left as-is because if they need to be tuned then it is more likely
that NUMA balancing needs to be fixed instead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210324133916.GQ15768@suse.de
2021-04-16 17:06:33 +02:00
Peter Zijlstra
b5c4477366 sched: Use cpu_dying() to fix balance_push vs hotplug-rollback
Use the new cpu_dying() state to simplify and fix the balance_push()
vs CPU hotplug rollback state.

Specifically, we currently rely on notifiers sched_cpu_dying() /
sched_cpu_activate() to terminate balance_push, however if the
cpu_down() fails when we're past sched_cpu_deactivate(), it should
terminate balance_push at that point and not wait until we hit
sched_cpu_activate().

Similarly, when cpu_up() fails and we're going back down, balance_push
should be active, where it currently is not.

So instead, make sure balance_push is enabled below SCHED_AP_ACTIVE
(when !cpu_active()), and gate it's utility with cpu_dying().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/YHgAYef83VQhKdC2@hirez.programming.kicks-ass.net
2021-04-16 17:06:32 +02:00
Peter Zijlstra
e40f74c535 cpumask: Introduce DYING mask
Introduce a cpumask that indicates (for each CPU) what direction the
CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when
an up fails and we do a roll-back down, it will accurately reflect the
direction.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210310150109.151441252@infradead.org
2021-04-16 17:06:32 +02:00
Marco Elver
97ba62b278 perf: Add support for SIGTRAP on perf events
Adds bit perf_event_attr::sigtrap, which can be set to cause events to
send SIGTRAP (with si_code TRAP_PERF) to the task where the event
occurred. The primary motivation is to support synchronous signals on
perf events in the task where an event (such as breakpoints) triggered.

To distinguish perf events based on the event type, the type is set in
si_errno. For events that are associated with an address, si_addr is
copied from perf_sample_data.

The new field perf_event_attr::sig_data is copied to si_perf, which
allows user space to disambiguate which event (of the same type)
triggered the signal. For example, user space could encode the relevant
information it cares about in sig_data.

We note that the choice of an opaque u64 provides the simplest and most
flexible option. Alternatives where a reference to some user space data
is passed back suffer from the problem that modification of referenced
data (be it the event fd, or the perf_event_attr) can race with the
signal being delivered (of course, the same caveat applies if user space
decides to store a pointer in sig_data, but the ABI explicitly avoids
prescribing such a design).

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/lkml/YBv3rAT566k+6zjg@hirez.programming.kicks-ass.net/
2021-04-16 16:32:41 +02:00
Marco Elver
fb6cc127e0 signal: Introduce TRAP_PERF si_code and si_perf to siginfo
Introduces the TRAP_PERF si_code, and associated siginfo_t field
si_perf. These will be used by the perf event subsystem to send signals
(if requested) to the task where an event occurred.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic
Link: https://lkml.kernel.org/r/20210408103605.1676875-6-elver@google.com
2021-04-16 16:32:41 +02:00
Marco Elver
2e498d0a74 perf: Add support for event removal on exec
Adds bit perf_event_attr::remove_on_exec, to support removing an event
from a task on exec.

This option supports the case where an event is supposed to be
process-wide only, and should not propagate beyond exec, to limit
monitoring to the original process image only.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210408103605.1676875-5-elver@google.com
2021-04-16 16:32:41 +02:00
Marco Elver
2b26f0aa00 perf: Support only inheriting events if cloned with CLONE_THREAD
Adds bit perf_event_attr::inherit_thread, to restricting inheriting
events only if the child was cloned with CLONE_THREAD.

This option supports the case where an event is supposed to be
process-wide only (including subthreads), but should not propagate
beyond the current process's shared environment.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/YBvj6eJR%2FDY2TsEB@hirez.programming.kicks-ass.net/
2021-04-16 16:32:40 +02:00
Marco Elver
47f661eca0 perf: Apply PERF_EVENT_IOC_MODIFY_ATTRIBUTES to children
As with other ioctls (such as PERF_EVENT_IOC_{ENABLE,DISABLE}), fix up
handling of PERF_EVENT_IOC_MODIFY_ATTRIBUTES to also apply to children.

Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lkml.kernel.org/r/20210408103605.1676875-3-elver@google.com
2021-04-16 16:32:40 +02:00
Peter Zijlstra
ef54c1a476 perf: Rework perf_event_exit_event()
Make perf_event_exit_event() more robust, such that we can use it from
other contexts. Specifically the up and coming remove_on_exec.

For this to work we need to address a few issues. Remove_on_exec will
not destroy the entire context, so we cannot rely on TASK_TOMBSTONE to
disable event_function_call() and we thus have to use
perf_remove_from_context().

When using perf_remove_from_context(), there's two races to consider.
The first is against close(), where we can have concurrent tear-down
of the event. The second is against child_list iteration, which should
not find a half baked event.

To address this, teach perf_remove_from_context() to special case
!ctx->is_active and about DETACH_CHILD.

[ elver@google.com: fix racing parent/child exit in sync_child_event(). ]
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210408103605.1676875-2-elver@google.com
2021-04-16 16:32:40 +02:00
Alexander Shishkin
d68e6799a5 perf: Cap allocation order at aux_watermark
Currently, we start allocating AUX pages half the size of the total
requested AUX buffer size, ignoring the attr.aux_watermark setting. This,
in turn, makes intel_pt driver disregard the watermark also, as it uses
page order for its SG (ToPA) configuration.

Now, this can be fixed in the intel_pt PMU driver, but seeing as it's the
only one currently making use of high order allocations, there is no
reason not to fix the allocator instead. This way, any other driver
wishing to add this support would not have to worry about this.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210414154955.49603-2-alexander.shishkin@linux.intel.com
2021-04-16 16:32:39 +02:00
Christoph Hellwig
42eb0d54c0 fs: split receive_fd_replace from __receive_fd
receive_fd_replace shares almost no code with the general case, so split
it out.  Also remove the "Bump the sock usage counts" comment from
both copies, as that is now what __receive_sock actually does.

[AV: ... and make the only user of receive_fd_replace() choose between
it and receive_fd() according to what userland had passed to it in
flags]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-16 00:13:04 -04:00
Steven Rostedt (VMware)
e1db6338d6 ftrace: Reuse the output of the function tracer for func_repeats
The func_repeats event shows the output of the function tracer followed by
a count of the number of repeats the previous function had made, as well
as the timestamp of the last function that was repeated.

The printing of the function should be the same as is for the function it
is displaying. Reuse the code in trace_fn_trace() by making a helper
function print_fn_trace() and use it for trace_func_repeats_print().

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-15 16:34:26 -04:00
Yordan Karadzhov (VMware)
22db095d57 tracing: Add "func_no_repeats" option for function tracing
If the option is activated the function tracing record gets
consolidated in the cases when a single function is called number
of times consecutively. Instead of having an identical record for
each call of the function we will record only the first call
following by event showing the number of repeats.

Link: https://lkml.kernel.org/r/20210415181854.147448-7-y.karadz@gmail.com

Signed-off-by: Yordan Karadzhov (VMware) <y.karadz@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-15 14:50:02 -04:00
Yordan Karadzhov (VMware)
4994891ebb tracing: Unify the logic for function tracing options
Currently the logic for dealing with the options for function tracing
has two different implementations. One is used when we set the flags
(in "static int func_set_flag()") and another used when we initialize
the tracer (in "static int function_trace_init()"). Those two
implementations are meant to do essentially the same thing and they
are both not very convenient for adding new options. In this patch
we add a helper function that provides a single implementation of
the logic for dealing with the options and we make it such that new
options can be easily added.

Link: https://lkml.kernel.org/r/20210415181854.147448-6-y.karadz@gmail.com

Signed-off-by: Yordan Karadzhov (VMware) <y.karadz@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-15 14:50:02 -04:00
Yordan Karadzhov (VMware)
c658797f1a tracing: Add method for recording "func_repeats" events
This patch only provides the implementation of the method.
Later we will used it in a combination with a new option for
function tracing.

Link: https://lkml.kernel.org/r/20210415181854.147448-5-y.karadz@gmail.com

Signed-off-by: Yordan Karadzhov (VMware) <y.karadz@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-15 14:50:02 -04:00
Yordan Karadzhov (VMware)
20344c54d1 tracing: Add "last_func_repeats" to struct trace_array
The field is used to keep track of the consecutive (on the same CPU) calls
of a single function. This information is needed in order to consolidate
the function tracing record in the cases when a single function is called
number of times.

Link: https://lkml.kernel.org/r/20210415181854.147448-4-y.karadz@gmail.com

Signed-off-by: Yordan Karadzhov (VMware) <y.karadz@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-15 14:50:02 -04:00
Yordan Karadzhov (VMware)
f689e4f280 tracing: Define new ftrace event "func_repeats"
The event aims to consolidate the function tracing record in the cases
when a single function is called number of times consecutively.

	while (cond)
		do_func();

This may happen in various scenarios (busy waiting for example).
The new ftrace event can be used to show repeated function events with
a single event and save space on the ring buffer

Link: https://lkml.kernel.org/r/20210415181854.147448-3-y.karadz@gmail.com

Signed-off-by: Yordan Karadzhov (VMware) <y.karadz@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-15 14:50:01 -04:00
Yordan Karadzhov (VMware)
eaa7a89720 tracing: Define static void trace_print_time()
The part of the code that prints the time of the trace record in
"int trace_print_context()" gets extracted in a static function. This
is done as a preparation for a following patch, in which we will define
a new ftrace event called "func_repeats". The new static method,
defined here, will be used by this new event to print the time of the
last repeat of a function that is consecutively called number of times.

Link: https://lkml.kernel.org/r/20210415181854.147448-2-y.karadz@gmail.com

Signed-off-by: Yordan Karadzhov (VMware) <y.karadz@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-15 14:50:01 -04:00
Eric Dumazet
5e0ccd4a3b rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()
Commit ec9c82e03a ("rseq: uapi: Declare rseq_cs field as union,
update includes") added regressions for our servers.

Using copy_from_user() and clear_user() for 64bit values
is suboptimal.

We can use faster put_user() and get_user() on 64bit arches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/20210413203352.71350-4-eric.dumazet@gmail.com
2021-04-14 18:04:09 +02:00
Eric Dumazet
0ed9605153 rseq: Remove redundant access_ok()
After commit 8f28177014 ("rseq: Use get_user/put_user rather
than __get_user/__put_user") we no longer need
an access_ok() call from __rseq_handle_notify_resume()

Mathieu pointed out the same cleanup can be done
in rseq_syscall().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/20210413203352.71350-3-eric.dumazet@gmail.com
2021-04-14 18:04:09 +02:00
Eric Dumazet
60af388d23 rseq: Optimize rseq_update_cpu_id()
Two put_user() in rseq_update_cpu_id() are replaced
by a pair of unsafe_put_user() with appropriate surroundings.

This removes one stac/clac pair on x86 in fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/20210413203352.71350-2-eric.dumazet@gmail.com
2021-04-14 18:04:09 +02:00
Thomas Gleixner
4bad58ebc8 signal: Allow tasks to cache one sigqueue struct
The idea for this originates from the real time tree to make signal
delivery for realtime applications more efficient. In quite some of these
application scenarios a control tasks signals workers to start their
computations. There is usually only one signal per worker on flight.  This
works nicely as long as the kmem cache allocations do not hit the slow path
and cause latencies.

To cure this an optimistic caching was introduced (limited to RT tasks)
which allows a task to cache a single sigqueue in a pointer in task_struct
instead of handing it back to the kmem cache after consuming a signal. When
the next signal is sent to the task then the cached sigqueue is used
instead of allocating a new one. This solved the problem for this set of
application scenarios nicely.

The task cache is not preallocated so the first signal sent to a task goes
always to the cache allocator. The cached sigqueue stays around until the
task exits and is freed when task::sighand is dropped.

After posting this solution for mainline the discussion came up whether
this would be useful in general and should not be limited to realtime
tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org

One concern leading to the original limitation was to avoid a large amount
of pointlessly cached sigqueues in alive tasks. The other concern was
vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for.

The accounting problem is real, but on the other hand slightly academic.
After gathering some statistics it turned out that after boot of a regular
distro install there are less than 10 sigqueues cached in ~1500 tasks.

In case of a 'mass fork and fire signal to child' scenario the extra 80
bytes of memory per task are well in the noise of the overall memory
consumption of the fork bomb.

If this should be limited then this would need an extra counter in struct
user, more atomic instructions and a seperate rlimit. Yet another tunable
which is mostly unused.

The caching is actually used. After boot and a full kernel compile on a
64CPU machine with make -j128 the number of 'allocations' looks like this:

  From slab:	   23996
  From task cache: 52223

I.e. it reduces the number of slab cache operations by ~68%.

A typical pattern there is:

<...>-58490 __sigqueue_alloc:  for 58488 from slab ffff8881132df460
<...>-58488 __sigqueue_free:   cache ffff8881132df460
<...>-58488 __sigqueue_alloc:  for 1149 from cache ffff8881103dc550
  bash-1149 exit_task_sighand: free ffff8881132df460
  bash-1149 __sigqueue_free:   cache ffff8881103dc550

The interesting sequence is that the exiting task 58488 grabs the sigqueue
from bash's task cache to signal exit and bash sticks it back into it's own
cache. Lather, rinse and repeat.

The caching is probably not noticable for the general use case, but the
benefit for latency sensitive applications is clear. While kmem caches are
usually just serving from the fast path the slab merging (default) can
depending on the usage pattern of the merged slabs cause occasional slow
path allocations.

The time spared per cached entry is a few micro seconds per signal which is
not relevant for e.g. a kernel build, but for signal heavy workloads it's
measurable.

As there is no real downside of this caching mechanism making it
unconditionally available is preferred over more conditional code or new
magic tunables.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/87sg4lbmxo.fsf@nanos.tec.linutronix.de
2021-04-14 18:04:08 +02:00
Thomas Gleixner
69995ebbb9 signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()
There is no point in having the conditional at the callsite.

Just hand in the allocation mode flag to __sigqueue_alloc() and use it to
initialize sigqueue::flags.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210322092258.898677147@linutronix.de
2021-04-14 18:04:08 +02:00
Sumit Garg
83fa2d13d6 kdb: Refactor env variables get/set code
Add two new kdb environment access methods as kdb_setenv() and
kdb_printenv() in order to abstract out environment access code
from kdb command functions.

Also, replace (char *)0 with NULL as an initializer for environment
variables array.

Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/1612771342-16883-1-git-send-email-sumit.garg@linaro.org
[daniel.thompson@linaro.org: Replaced (char *)0/NULL initializers with
an array size]
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2021-04-14 13:47:09 +01:00
Masahiro Yamada
f8f0d06438 kconfig: do not use allnoconfig_y option
allnoconfig_y is an ugly hack that sets a symbol to 'y' by allnoconfig.

allnoconfig does not mean a minimal set of CONFIG options because a
bunch of prompts are hidden by 'if EMBEDDED' or 'if EXPERT', but I do
not like to hack Kconfig this way.

Use the pre-existing feature, KCONFIG_ALLCONFIG, to provide a one
liner config fragment. CONFIG_EMBEDDED=y is still forced when
allnoconfig is invoked as a part of tinyconfig.

No change in the .config file produced by 'make tinyconfig'.

The output of 'make allnoconfig' will be changed; we will get
CONFIG_EMBEDDED=n because allnoconfig literally sets all symbols to n.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-04-14 15:22:49 +09:00
Linus Torvalds
50987beca0 tracing/dynevent: Fix a memory link in dyn_event_release()
An error path exited the function before freeing the allocated
 "argv" variable.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYHY3LRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qigOAPwOvbUI9PQTW3hs16XHDGbgtzdzX6A7
 kF7GlId5tXbZDwD/bW2gilFCjULCEPDuqsDy5EXrbZ7V7kulOfIw2e8CAQM=
 =HwKu
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Fix a memory link in dyn_event_release().

  An error path exited the function before freeing the allocated 'argv'
  variable"

* tag 'trace-v5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/dynevent: Fix a memory leak in an error handling path
2021-04-13 18:40:00 -07:00
Toke Høiland-Jørgensen
441e8c66b2 bpf: Return target info when a tracing bpf_link is queried
There is currently no way to discover the target of a tracing program
attachment after the fact. Add this information to bpf_link_info and return
it when querying the bpf_link fd.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210413091607.58945-1-toke@redhat.com
2021-04-13 18:18:57 -07:00
Peter Collingbourne
201698626f arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)
This change introduces a prctl that allows the user program to control
which PAC keys are enabled in a particular task. The main reason
why this is useful is to enable a userspace ABI that uses PAC to
sign and authenticate function pointers and other pointers exposed
outside of the function, while still allowing binaries conforming
to the ABI to interoperate with legacy binaries that do not sign or
authenticate pointers.

The idea is that a dynamic loader or early startup code would issue
this prctl very early after establishing that a process may load legacy
binaries, but before executing any PAC instructions.

This change adds a small amount of overhead to kernel entry and exit
due to additional required instruction sequences.

On a DragonBoard 845c (Cortex-A75) with the powersave governor, the
overhead of similar instruction sequences was measured as 4.9ns when
simulating the common case where IA is left enabled, or 43.7ns when
simulating the uncommon case where IA is disabled. These numbers can
be seen as the worst case scenario, since in more realistic scenarios
a better performing governor would be used and a newer chip would be
used that would support PAC unlike Cortex-A75 and would be expected
to be faster than Cortex-A75.

On an Apple M1 under a hypervisor, the overhead of the entry/exit
instruction sequences introduced by this patch was measured as 0.3ns
in the case where IA is left enabled, and 33.0ns in the case where
IA is disabled.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://linux-review.googlesource.com/id/Ibc41a5e6a76b275efbaa126b31119dc197b927a5
Link: https://lore.kernel.org/r/d6609065f8f40397a4124654eb68c9f490b4d477.1616123271.git.pcc@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-04-13 17:31:44 +01:00
Christophe JAILLET
8db403b963 tracing/dynevent: Fix a memory leak in an error handling path
We must free 'argv' before returning, as already done in all the other
paths of this function.

Link: https://lkml.kernel.org/r/21e3594ccd7fc88c5c162c98450409190f304327.1618136448.git.christophe.jaillet@wanadoo.fr

Fixes: d262271d04 ("tracing/dynevent: Delegate parsing to create function")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-13 12:29:48 -04:00
Lu Jialin
d95af61df0 cgroup/cpuset: fix typos in comments
Change hierachy to hierarchy and unrechable to unreachable,
no functionality changed.

Signed-off-by: Lu Jialin <lujialin4@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-04-12 17:20:53 -04:00
Rafael J. Wysocki
0210b8eb72 Merge branch 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Pull ARM cpufreq updates for v5.13 from Viresh Kumar:

"- Fix typos in s5pv210 cpufreq driver (Bhaskar Chowdhury).

 - Armada 37xx: Fix cpufreq changing base CPU speed to 800 MHz from
   1000 MHz (Pali Rohár and Marek Behún).

 - cpufreq-dt: Return -EPROBE_DEFER on failure to add table (Quanyang
   Wang).

 - Minor cleanup in cppc driver (Tom Saeger).

 - Add frequency invariance support for CPPC driver and generalize
   freq invariance support arch-topology driver (Viresh Kumar)."

* 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
  cpufreq: armada-37xx: Fix module unloading
  cpufreq: armada-37xx: Remove cur_frequency variable
  cpufreq: armada-37xx: Fix determining base CPU frequency
  cpufreq: armada-37xx: Fix driver cleanup when registration failed
  clk: mvebu: armada-37xx-periph: Fix workaround for switching from L1 to L0
  clk: mvebu: armada-37xx-periph: Fix switching CPU freq from 250 Mhz to 1 GHz
  cpufreq: armada-37xx: Fix the AVS value for load L1
  clk: mvebu: armada-37xx-periph: remove .set_parent method for CPU PM clock
  cpufreq: armada-37xx: Fix setting TBG parent for load levels
  cpufreq: dt: dev_pm_opp_of_cpumask_add_table() may return -EPROBE_DEFER
  cpufreq: cppc: simplify default delay_us setting
  cpufreq: Rudimentary typos fix in the file s5pv210-cpufreq.c
  cpufreq: CPPC: Add support for frequency invariance
  arch_topology: Export arch_freq_scale and helpers
  arch_topology: Allow multiple entities to provide sched_freq_tick() callback
  arch_topology: Rename freq_scale as arch_freq_scale
2021-04-12 14:46:33 +02:00
Jens Axboe
c7aab1a7c5 task_work: add helper for more targeted task_work canceling
The only exported helper we have right now is task_work_cancel(), which
cancels any task_work from a given task where func matches the queued
work item. This is a bit too coarse for some use cases. Add a
task_work_cancel_match() that allows to more specifically target
individual work items outside of purely the callback function used.

task_work_cancel() can be trivially implemented on top of that, hence do
so.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11 19:30:25 -06:00
Jens Axboe
66ae0d1e2d kernel: allow fork with TIF_NOTIFY_SIGNAL pending
fork() fails if signal_pending() is true, but there are two conditions
that can lead to that:

1) An actual signal is pending. We want fork to fail for that one, like
   we always have.

2) TIF_NOTIFY_SIGNAL is pending, because the task has pending task_work.
   We don't need to make it fail for that case.

Allow fork() to proceed if just task_work is pending, by changing the
signal_pending() check to task_sigpending().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11 17:42:00 -06:00
Linus Torvalds
add6b92660 Two minor fixes: one for a Clang warning, the other improves an
ambiguous/confusing kernel log message.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmBy4N4RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hvVg/7BWV2CDlG2g19KASGgWQo53+JO40E5j1j
 dG6h5u709TAYKtr2eckP0fC2Eah6Y3CXbZ2P2JMAwrO4KUETAyzyjlupFDBeiqap
 SI6XX6alir6M0p6uy0UaEjzqbRJHSMJJKYPyA3YoewsQiKVCbYf6xYQJ0cA5VHaQ
 ZcAof/lvmH7uMG3rDWbHDE3G1A67fK3FUp8AxR2i+7IAcG25q0Ov2gYWNOTM3Eh2
 vSAuot/HPXIFjZDakfnx+iC2bIRZ3D6jgwlZNPIyzPFXB7A4+fyuFoJ8VbKyNRUa
 38f0gQVYad7AnhriDjeaAngcPeHaBEWyWGnQXX99mZy7jqa/HbFZUI6Btxb+Ertg
 rGZ7XyfOalZfaD6UBfc8Pr0fvn7Ci6chww5XO6F/A4P02lWwcQreNLMJQ8Jzujvf
 FBADbHT7QvTm48JHJqfID/jNAp9/u2jjHq3I3B8k0gHkGsnVwliyHinxoaE8XyzG
 vXDk/C0RvdywJBAz5H3VbR1Q6NeTCl/IIzGf4e7XjxH6Y3tyHeDJI3EdSojUrnGk
 V74CzRnqsnC+kfvU92Ms5+daeRMOyctZTpOeIGSjD3AYxo7D7FqFTw3M1L2vnqDn
 oe6qrUDLHNITPzouuDDGDS9c0aHtRydfPSEz3NH3WLgVyfJM6rgZ5BX9ItZKFgMR
 ZCeoY21JXeA=
 =y7/k
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2021-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixlets from Ingo Molnar:
 "Two minor fixes: one for a Clang warning, the other improves an
  ambiguous/confusing kernel log message"

* tag 'locking-urgent-2021-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Address clang -Wformat warning printing for %hd
  lockdep: Add a missing initialization hint to the "INFO: Trying to register non-static key" message
2021-04-11 11:47:03 -07:00
Ingo Molnar
eedd634134 Merge branch 'for-mingo-kcsan' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into locking/core
Pull KCSAN changes from Paul E. McKenney: misc updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-11 14:35:02 +02:00
Ingo Molnar
120b566d1d Merge branch 'for-mingo-rcu' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

 - Bitmap support for "N" as alias for last bit

 - kvfree_rcu updates

 - mm_dump_obj() updates.  (One of these is to mm, but was suggested by Andrew Morton.)

 - RCU callback offloading update

 - Polling RCU grace-period interfaces

 - Realtime-related RCU updates

 - Tasks-RCU updates

 - Torture-test updates

 - Torture-test scripting updates

 - Miscellaneous fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-11 14:31:43 +02:00
Nicholas Piggin
7c07012eb1 genirq: Reduce irqdebug cacheline bouncing
note_interrupt() increments desc->irq_count for each interrupt even for
percpu interrupt handlers, even when they are handled successfully. This
causes cacheline bouncing and limits scalability.

Instead of incrementing irq_count every time, only start incrementing it
after seeing an unhandled irq, which should avoid the cache line
bouncing in the common path.

This actually should give better consistency in handling misbehaving
irqs too, because instead of the first unhandled irq arriving at an
arbitrary point in the irq_count cycle, its arrival will begin the
irq_count cycle.

Cédric reports the result of his IPI throughput test:

               Millions of IPIs/s
 -----------   --------------------------------------
               upstream   upstream   patched
 chips  cpus   default    noirqdebug default (irqdebug)
 -----------   -----------------------------------------
 1      0-15     4.061      4.153      4.084
        0-31     7.937      8.186      8.158
        0-47    11.018     11.392     11.233
        0-63    11.460     13.907     14.022
 2      0-79     8.376     18.105     18.084
        0-95     7.338     22.101     22.266
        0-111    6.716     25.306     25.473
        0-127    6.223     27.814     28.029

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210402132037.574661-1-npiggin@gmail.com
2021-04-10 13:35:54 +02:00
Tetsuo Handa
c5e3a41187 kernel: Initialize cpumask before parsing
KMSAN complains that new_value at cpumask_parse_user() from
write_irq_affinity() from irq_affinity_proc_write() is uninitialized.

  [  148.133411][ T5509] =====================================================
  [  148.135383][ T5509] BUG: KMSAN: uninit-value in find_next_bit+0x325/0x340
  [  148.137819][ T5509]
  [  148.138448][ T5509] Local variable ----new_value.i@irq_affinity_proc_write created at:
  [  148.140768][ T5509]  irq_affinity_proc_write+0xc3/0x3d0
  [  148.142298][ T5509]  irq_affinity_proc_write+0xc3/0x3d0
  [  148.143823][ T5509] =====================================================

Since bitmap_parse() from cpumask_parse_user() calls find_next_bit(),
any alloc_cpumask_var() + cpumask_parse_user() sequence has possibility
that find_next_bit() accesses uninitialized cpu mask variable. Fix this
problem by replacing alloc_cpumask_var() with zalloc_cpumask_var().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20210401055823.3929-1-penguin-kernel@I-love.SAKURA.ne.jp
2021-04-10 13:35:54 +02:00
Jakub Kicinski
8859a44ea0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

MAINTAINERS
 - keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 - simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
 - trivial
include/linux/ethtool.h
 - trivial, fix kdoc while at it
include/linux/skmsg.h
 - move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
 - add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
 - trivial

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 20:48:35 -07:00
Linus Torvalds
adb2c4174f Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "14 patches.

  Subsystems affected by this patch series: mm (kasan, gup, pagecache,
  and kfence), MAINTAINERS, mailmap, nds32, gcov, ocfs2, ia64, and lib"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  lib: fix kconfig dependency on ARCH_WANT_FRAME_POINTERS
  kfence, x86: fix preemptible warning on KPTI-enabled systems
  lib/test_kasan_module.c: suppress unused var warning
  kasan: fix conflict with page poisoning
  fs: direct-io: fix missing sdio->boundary
  ia64: fix user_stack_pointer() for ptrace()
  ocfs2: fix deadlock between setattr and dio_end_io_write
  gcov: re-fix clang-11+ support
  nds32: flush_dcache_page: use page_mapping_file to avoid races with swapoff
  mm/gup: check page posion status for coredump.
  .mailmap: fix old email addresses
  mailmap: update email address for Jordan Crouse
  treewide: change my e-mail address, fix my name
  MAINTAINERS: update CZ.NIC's Turris information
2021-04-09 17:06:32 -07:00
Linus Torvalds
4e04e7513b Networking fixes for 5.12-rc7, including fixes from can, ipsec,
mac80211, wireless, and bpf trees. No scary regressions here
 or in the works, but small fixes for 5.12 changes keep coming.
 
 Current release - regressions:
 
  - virtio: do not pull payload in skb->head
 
  - virtio: ensure mac header is set in virtio_net_hdr_to_skb()
 
  - Revert "net: correct sk_acceptq_is_full()"
 
  - mptcp: revert "mptcp: provide subflow aware release function"
 
  - ethernet: lan743x: fix ethernet frame cutoff issue
 
  - dsa: fix type was not set for devlink port
 
  - ethtool: remove link_mode param and derive link params
             from driver
 
  - sched: htb: fix null pointer dereference on a null new_q
 
  - wireless: iwlwifi: Fix softirq/hardirq disabling in
                       iwl_pcie_enqueue_hcmd()
 
  - wireless: iwlwifi: fw: fix notification wait locking
 
  - wireless: brcmfmac: p2p: Fix deadlock introduced by avoiding
                             the rtnl dependency
 
 Current release - new code bugs:
 
  - napi: fix hangup on napi_disable for threaded napi
 
  - bpf: take module reference for trampoline in module
 
  - wireless: mt76: mt7921: fix airtime reporting and related
                            tx hangs
 
  - wireless: iwlwifi: mvm: rfi: don't lock mvm->mutex when sending
                                 config command
 
 Previous releases - regressions:
 
  - rfkill: revert back to old userspace API by default
 
  - nfc: fix infinite loop, refcount & memory leaks in LLCP sockets
 
  - let skb_orphan_partial wake-up waiters
 
  - xfrm/compat: Cleanup WARN()s that can be user-triggered
 
  - vxlan, geneve: do not modify the shared tunnel info when PMTU
                   triggers an ICMP reply
 
  - can: fix msg_namelen values depending on CAN_REQUIRED_SIZE
 
  - can: uapi: mark union inside struct can_frame packed
 
  - sched: cls: fix action overwrite reference counting
 
  - sched: cls: fix err handler in tcf_action_init()
 
  - ethernet: mlxsw: fix ECN marking in tunnel decapsulation
 
  - ethernet: nfp: Fix a use after free in nfp_bpf_ctrl_msg_rx
 
  - ethernet: i40e: fix receiving of single packets in xsk zero-copy
                    mode
 
  - ethernet: cxgb4: avoid collecting SGE_QBASE regs during traffic
 
 Previous releases - always broken:
 
  - bpf: Refuse non-O_RDWR flags in BPF_OBJ_GET
 
  - bpf: Refcount task stack in bpf_get_task_stack
 
  - bpf, x86: Validate computation of branch displacements
 
  - ieee802154: fix many similar syzbot-found bugs
     - fix NULL dereferences in netlink attribute handling
     - reject unsupported operations on monitor interfaces
     - fix error handling in llsec_key_alloc()
 
  - xfrm: make ipv4 pmtu check honor ip header df
 
  - xfrm: make hash generation lock per network namespace
 
  - xfrm: esp: delete NETIF_F_SCTP_CRC bit from features for esp
               offload
 
  - ethtool: fix incorrect datatype in set_eee ops
 
  - xdp: fix xdp_return_frame() kernel BUG throw for page_pool
         memory model
 
  - openvswitch: fix send of uninitialized stack memory in ct limit
                 reply
 
 Misc:
 
  - udp: add get handling for UDP_GRO sockopt
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmBwyfAACgkQMUZtbf5S
 IruJ/BAAnjghw2kWXRCKK3Tkm0pi0zjaKvTS30AcKCW2+GnqSxTdiWNv+mxqFgnm
 YdduPKiGwLoDkA2i2d4EF8/HK6m+Q6bHcUbZ2npEm1ElkKfxCYGmocor8n2kD+a9
 je94VGYV7zytnxXw85V6/jFLDqOXXwhBfHhlDMVBZP8OyzUfbDKGorWmyGuy9GJp
 81bvzqN2bHUGIM0cDr+ol3eYw2ituGWgiqNfnq7z+/NVcYmD0EPChDRbp0jtH1ng
 dcoONI6YlymDEDpu/9GmyKL1ken9lcWoVdvv/aDGtP62x6SYDt5HKe3wAtJ+Kjbq
 jIPADxPx5BymYIZRBtdNR0rP66LycA7hDtM/C/h1WoihDXwpGeNUU4g0aJ+hsP5Q
 ldwJI1DJo79VbwM2c3Kg73PaphLcPD4RdwF0/ovFsl0+bTDfj8i93ah4Wnzj0Qli
 EMiSDEDNb51e9nkW+xu+FjLWmxHJvLOL/+VgHV5bPJJBob2fqnjAMj2PkPEuEtXY
 TPWEh9y3zaEyp/9tNx0cstGOt6Gf5DQ5Nk6tX6hMpJT/BeL8mju1jm0yPLZhMJjF
 LlTrJgXftfP/cjltdSm4aVqSU5okjHNYDhmHlNgvzih5mt+NVslRJfzwq62Vudqy
 C0kpmVdQNFkOB0UcqQihevZg9mvem3m/dYl+v/MV7Uq6r4s4M2A=
 =SHL0
 -----END PGP SIGNATURE-----

Merge tag 'net-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.12-rc7, including fixes from can, ipsec,
  mac80211, wireless, and bpf trees.

  No scary regressions here or in the works, but small fixes for 5.12
  changes keep coming.

  Current release - regressions:

   - virtio: do not pull payload in skb->head

   - virtio: ensure mac header is set in virtio_net_hdr_to_skb()

   - Revert "net: correct sk_acceptq_is_full()"

   - mptcp: revert "mptcp: provide subflow aware release function"

   - ethernet: lan743x: fix ethernet frame cutoff issue

   - dsa: fix type was not set for devlink port

   - ethtool: remove link_mode param and derive link params from driver

   - sched: htb: fix null pointer dereference on a null new_q

   - wireless: iwlwifi: Fix softirq/hardirq disabling in
     iwl_pcie_enqueue_hcmd()

   - wireless: iwlwifi: fw: fix notification wait locking

   - wireless: brcmfmac: p2p: Fix deadlock introduced by avoiding the
     rtnl dependency

  Current release - new code bugs:

   - napi: fix hangup on napi_disable for threaded napi

   - bpf: take module reference for trampoline in module

   - wireless: mt76: mt7921: fix airtime reporting and related tx hangs

   - wireless: iwlwifi: mvm: rfi: don't lock mvm->mutex when sending
     config command

  Previous releases - regressions:

   - rfkill: revert back to old userspace API by default

   - nfc: fix infinite loop, refcount & memory leaks in LLCP sockets

   - let skb_orphan_partial wake-up waiters

   - xfrm/compat: Cleanup WARN()s that can be user-triggered

   - vxlan, geneve: do not modify the shared tunnel info when PMTU
     triggers an ICMP reply

   - can: fix msg_namelen values depending on CAN_REQUIRED_SIZE

   - can: uapi: mark union inside struct can_frame packed

   - sched: cls: fix action overwrite reference counting

   - sched: cls: fix err handler in tcf_action_init()

   - ethernet: mlxsw: fix ECN marking in tunnel decapsulation

   - ethernet: nfp: Fix a use after free in nfp_bpf_ctrl_msg_rx

   - ethernet: i40e: fix receiving of single packets in xsk zero-copy
     mode

   - ethernet: cxgb4: avoid collecting SGE_QBASE regs during traffic

  Previous releases - always broken:

   - bpf: Refuse non-O_RDWR flags in BPF_OBJ_GET

   - bpf: Refcount task stack in bpf_get_task_stack

   - bpf, x86: Validate computation of branch displacements

   - ieee802154: fix many similar syzbot-found bugs
       - fix NULL dereferences in netlink attribute handling
       - reject unsupported operations on monitor interfaces
       - fix error handling in llsec_key_alloc()

   - xfrm: make ipv4 pmtu check honor ip header df

   - xfrm: make hash generation lock per network namespace

   - xfrm: esp: delete NETIF_F_SCTP_CRC bit from features for esp
     offload

   - ethtool: fix incorrect datatype in set_eee ops

   - xdp: fix xdp_return_frame() kernel BUG throw for page_pool memory
     model

   - openvswitch: fix send of uninitialized stack memory in ct limit
     reply

  Misc:

   - udp: add get handling for UDP_GRO sockopt"

* tag 'net-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (182 commits)
  net: fix hangup on napi_disable for threaded napi
  net: hns3: Trivial spell fix in hns3 driver
  lan743x: fix ethernet frame cutoff issue
  net: ipv6: check for validity before dereferencing cfg->fc_nlinfo.nlh
  net: dsa: lantiq_gswip: Configure all remaining GSWIP_MII_CFG bits
  net: dsa: lantiq_gswip: Don't use PHY auto polling
  net: sched: sch_teql: fix null-pointer dereference
  ipv6: report errors for iftoken via netlink extack
  net: sched: fix err handler in tcf_action_init()
  net: sched: fix action overwrite reference counting
  Revert "net: sched: bump refcount for new action in ACT replace mode"
  ice: fix memory leak of aRFS after resuming from suspend
  i40e: Fix sparse warning: missing error code 'err'
  i40e: Fix sparse error: 'vsi->netdev' could be null
  i40e: Fix sparse error: uninitialized symbol 'ring'
  i40e: Fix sparse errors in i40e_txrx.c
  i40e: Fix parameters in aq_get_phy_register()
  nl80211: fix beacon head validation
  bpf, x86: Validate computation of branch displacements for x86-32
  bpf, x86: Validate computation of branch displacements for x86-64
  ...
2021-04-09 15:26:51 -07:00
Nick Desaulniers
9562fd1329 gcov: re-fix clang-11+ support
LLVM changed the expected function signature for llvm_gcda_emit_function()
in the clang-11 release.  Users of clang-11 or newer may have noticed
their kernels producing invalid coverage information:

  $ llvm-cov gcov -a -c -u -f -b <input>.gcda -- gcno=<input>.gcno
  1 <func>: checksum mismatch, \
    (<lineno chksum A>, <cfg chksum B>) != (<lineno chksum A>, <cfg chksum C>)
  2 Invalid .gcda File!
  ...

Fix up the function signatures so calling this function interprets its
parameters correctly and computes the correct cfg checksum.  In
particular, in clang-11, the additional checksum is no longer optional.

Link: https://reviews.llvm.org/rG25544ce2df0daa4304c07e64b9c8b0f7df60c11d
Link: https://lkml.kernel.org/r/20210408184631.1156669-1-ndesaulniers@google.com
Reported-by: Prasad Sodagudi <psodagud@quicinc.com>
Tested-by: Prasad Sodagudi <psodagud@quicinc.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Cc: <stable@vger.kernel.org>	[5.4+]

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-09 14:54:23 -07:00
Valentin Schneider
4aed8aa415 sched/fair: Introduce a CPU capacity comparison helper
During load-balance, groups classified as group_misfit_task are filtered
out if they do not pass

  group_smaller_max_cpu_capacity(<candidate group>, <local group>);

which itself employs fits_capacity() to compare the sgc->max_capacity of
both groups.

Due to the underlying margin, fits_capacity(X, 1024) will return false for
any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
{261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
CPUs, misfit migration will never intentionally upmigrate it to a CPU of
higher capacity due to the aforementioned margin.

One may argue the 20% margin of fits_capacity() is excessive in the advent
of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
is that fits_capacity() is meant to compare a utilization value to a
capacity value, whereas here it is being used to compare two capacity
values. As CPU capacity and task utilization have different dynamics, a
sensible approach here would be to add a new helper dedicated to comparing
CPU capacities.

Also note that comparing capacity extrema of local and source sched_group's
doesn't make much sense when at the day of the day the imbalance will be
pulled by a known env->dst_cpu, whose capacity can be anywhere within the
local group's capacity extrema.

While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Link: https://lkml.kernel.org/r/20210407220628.3798191-4-valentin.schneider@arm.com
2021-04-09 18:02:21 +02:00
Valentin Schneider
23fb06d960 sched/fair: Clean up active balance nr_balance_failed trickery
When triggering an active load balance, sd->nr_balance_failed is set to
such a value that any further can_migrate_task() using said sd will ignore
the output of task_hot().

This behaviour makes sense, as active load balance intentionally preempts a
rq's running task to migrate it right away, but this asynchronous write is
a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
before the sd->nr_balance_failed write either becomes visible to the
stopper's CPU or even happens on the CPU that appended the stopper work.

Add a struct lb_env flag to denote active balancing, and use it in
can_migrate_task(). Remove the sd->nr_balance_failed write that served the
same purpose. Cleanup the LBF_DST_PINNED active balance special case.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210407220628.3798191-3-valentin.schneider@arm.com
2021-04-09 18:02:20 +02:00
Lingutla Chandrasekhar
9bcb959d05 sched/fair: Ignore percpu threads for imbalance pulls
During load balance, LBF_SOME_PINNED will be set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.

In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).

It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.

Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:

  DIE [          ]
  MC  [    ][    ]
       0  1  2  3
       L  L  B  B

  arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)

Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.

Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schneider@arm.com
2021-04-09 18:02:20 +02:00
Rik van Riel
c722f35b51 sched/fair: Bring back select_idle_smt(), but differently
Mel Gorman did some nice work in 9fe1f127b9 ("sched/fair: Merge
select_idle_core/cpu()"), resulting in the kernel being more efficient
at finding an idle CPU, and in tasks spending less time waiting to be
run, both according to the schedstats run_delay numbers, and according
to measured application latencies. Yay.

The flip side of this is that we see more task migrations (about 30%
more), higher cache misses, higher memory bandwidth utilization, and
higher CPU use, for the same number of requests/second.

This is most pronounced on a memcache type workload, which saw a
consistent 1-3% increase in total CPU use on the system, due to those
increased task migrations leading to higher L2 cache miss numbers, and
higher memory utilization. The exclusive L3 cache on Skylake does us
no favors there.

On our web serving workload, that effect is usually negligible.

It appears that the increased number of CPU migrations is generally a
good thing, since it leads to lower cpu_delay numbers, reflecting the
fact that tasks get to run faster. However, the reduced locality and
the corresponding increase in L2 cache misses hurts a little.

The patch below appears to fix the regression, while keeping the
benefit of the lower cpu_delay numbers, by reintroducing
select_idle_smt with a twist: when a socket has no idle cores, check
to see if the sibling of "prev" is idle, before searching all the
other CPUs.

This fixes both the occasional 9% regression on the web serving
workload, and the continuous 2% CPU use regression on the memcache
type workload.

With Mel's patches and this patch together, task migrations are still
high, but L2 cache misses, memory bandwidth, and CPU time used are
back down to what they were before. The p95 and p99 response times for
the memcache type application improve by about 10% over what they were
before Mel's patches got merged.

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
2021-04-09 18:01:39 +02:00
Peter Zijlstra
9432bbd969 static_call: Relax static_call_update() function argument type
static_call_update() had stronger type requirements than regular C,
relax them to match. Instead of requiring the @func argument has the
exact matching type, allow any type which C is willing to promote to the
right (function) pointer type. Specifically this allows (void *)
arguments.

This cleans up a bunch of static_call_update() callers for
PREEMPT_DYNAMIC and should get around silly GCC11 warnings for free.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YFoN7nCl8OfGtpeh@hirez.programming.kicks-ass.net
2021-04-09 13:22:12 +02:00
Matthieu Baerts
7d95f22798 static_call: Fix unused variable warn w/o MODULE
Here is the warning converted as error and reported by GCC:

  kernel/static_call.c: In function ‘__static_call_update’:
  kernel/static_call.c:153:18: error: unused variable ‘mod’ [-Werror=unused-variable]
    153 |   struct module *mod = site_mod->mod;
        |                  ^~~
  cc1: all warnings being treated as errors
  make[1]: *** [scripts/Makefile.build:271: kernel/static_call.o] Error 1

This is simply because since recently, we no longer use 'mod' variable
elsewhere if MODULE is unset.

When using 'make tinyconfig' to generate the default kconfig, MODULE is
unset.

There are different ways to fix this warning. Here I tried to minimised
the number of modified lines and not add more #ifdef. We could also move
the declaration of the 'mod' variable inside the if-statement or
directly use site_mod->mod.

Fixes: 698bacefe9 ("static_call: Align static_call_is_init() patching condition")
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210326105023.2058860-1-matthieu.baerts@tessares.net
2021-04-09 13:22:12 +02:00
Sami Tolvanen
8b8e6b5d3b kallsyms: strip ThinLTO hashes from static functions
With CONFIG_CFI_CLANG and ThinLTO, Clang appends a hash to the names
of all static functions not marked __used. This can break userspace
tools that don't expect the function name to change, so strip out the
hash from the output.

Suggested-by: Jack Pham <jackp@codeaurora.org>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-8-samitolvanen@google.com
2021-04-08 16:04:21 -07:00
Sami Tolvanen
0a5b412891 kthread: use WARN_ON_FUNCTION_MISMATCH
With CONFIG_CFI_CLANG, a callback function passed to
__kthread_queue_delayed_work from a module points to a jump table
entry defined in the module instead of the one used in the core
kernel, which breaks function address equality in this check:

  WARN_ON_ONCE(timer->function != ktead_delayed_work_timer_fn);

Use WARN_ON_FUNCTION_MISMATCH() instead to disable the warning
when CFI and modules are both enabled.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-7-samitolvanen@google.com
2021-04-08 16:04:21 -07:00
Sami Tolvanen
981731129e workqueue: use WARN_ON_FUNCTION_MISMATCH
With CONFIG_CFI_CLANG, a callback function passed to
__queue_delayed_work from a module points to a jump table entry
defined in the module instead of the one used in the core kernel,
which breaks function address equality in this check:

  WARN_ON_ONCE(timer->function != delayed_work_timer_fn);

Use WARN_ON_FUNCTION_MISMATCH() instead to disable the warning
when CFI and modules are both enabled.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-6-samitolvanen@google.com
2021-04-08 16:04:21 -07:00
Sami Tolvanen
cf68fffb66 add support for Clang CFI
This change adds support for Clang’s forward-edge Control Flow
Integrity (CFI) checking. With CONFIG_CFI_CLANG, the compiler
injects a runtime check before each indirect function call to ensure
the target is a valid function with the correct static type. This
restricts possible call targets and makes it more difficult for
an attacker to exploit bugs that allow the modification of stored
function pointers. For more details, see:

  https://clang.llvm.org/docs/ControlFlowIntegrity.html

Clang requires CONFIG_LTO_CLANG to be enabled with CFI to gain
visibility to possible call targets. Kernel modules are supported
with Clang’s cross-DSO CFI mode, which allows checking between
independently compiled components.

With CFI enabled, the compiler injects a __cfi_check() function into
the kernel and each module for validating local call targets. For
cross-module calls that cannot be validated locally, the compiler
calls the global __cfi_slowpath_diag() function, which determines
the target module and calls the correct __cfi_check() function. This
patch includes a slowpath implementation that uses __module_address()
to resolve call targets, and with CONFIG_CFI_CLANG_SHADOW enabled, a
shadow map that speeds up module look-ups by ~3x.

Clang implements indirect call checking using jump tables and
offers two methods of generating them. With canonical jump tables,
the compiler renames each address-taken function to <function>.cfi
and points the original symbol to a jump table entry, which passes
__cfi_check() validation. This isn’t compatible with stand-alone
assembly code, which the compiler doesn’t instrument, and would
result in indirect calls to assembly code to fail. Therefore, we
default to using non-canonical jump tables instead, where the compiler
generates a local jump table entry <function>.cfi_jt for each
address-taken function, and replaces all references to the function
with the address of the jump table entry.

Note that because non-canonical jump table addresses are local
to each component, they break cross-module function address
equality. Specifically, the address of a global function will be
different in each module, as it's replaced with the address of a local
jump table entry. If this address is passed to a different module,
it won’t match the address of the same function taken there. This
may break code that relies on comparing addresses passed from other
components.

CFI checking can be disabled in a function with the __nocfi attribute.
Additionally, CFI can be disabled for an entire compilation unit by
filtering out CC_FLAGS_CFI.

By default, CFI failures result in a kernel panic to stop a potential
exploit. CONFIG_CFI_PERMISSIVE enables a permissive mode, where the
kernel prints out a rate-limited warning instead, and allows execution
to continue. This option is helpful for locating type mismatches, but
should only be enabled during development.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-2-samitolvanen@google.com
2021-04-08 16:04:20 -07:00
Josh Hunt
6db12ee045 psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files
Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210402025833.27599-1-johunt@akamai.com
2021-04-08 23:09:44 +02:00
Rafael J. Wysocki
71f4dd3441 Merge back earlier cpuidle updates for v5.13. 2021-04-08 20:05:49 +02:00
Lu Jialin
e4b2897ae1 PM: sleep: fix typos in comments
Change "occured" to "occurred" in kernel/power/autosleep.c.

Change "consiting" to "consisting" in kernel/power/snapshot.c.

Change "avaiable" to "available" in kernel/power/swap.c.

No functionality changed.

Signed-off-by: Lu Jialin <lujialin4@huawei.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-08 19:37:21 +02:00
Rafael J. Wysocki
4c81cb7e64 tick/nohz: Improve tick_nohz_get_next_hrtimer() kerneldoc
Make the tick_nohz_get_next_hrtimer() kerneldoc comment state clearly
that the function may return negative numbers.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-07 19:26:44 +02:00
Thomas Gleixner
b2c67cbe9f time: Add mechanism to recognize clocksource in time_get_snapshot
System time snapshots are not conveying information about the current
clocksource which was used, but callers like the PTP KVM guest
implementation have the requirement to evaluate the clocksource type to
select the appropriate mechanism.

Introduce a clocksource id field in struct clocksource which is by default
set to CSID_GENERIC (0). Clocksource implementations can set that field to
a value which allows to identify the clocksource.

Store the clocksource id of the current clocksource in the
system_time_snapshot so callers can evaluate which clocksource was used to
take the snapshot and act accordingly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20201209060932.212364-5-jianyong.wu@arm.com
2021-04-07 16:33:20 +01:00
Marc Zyngier
4a35d6a037 irqdomain: Get rid of irq_create_identity_mapping()
The sole user of irq_create_identity_mapping() having been converted,
get rid of the unused helper.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-04-07 13:25:52 +01:00
Muhammad Usama Anjum
957dca3df6 bpf, inode: Remove second initialization of the bpf_preload_lock
bpf_preload_lock is already defined with DEFINE_MUTEX(). There is no
need to initialize it again. Remove the extraneous initialization.

Signed-off-by: Muhammad Usama Anjum <musamaanjum@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210405194904.GA148013@LEGION
2021-04-06 23:39:13 +02:00
Tetsuo Handa
5dc33592e9 lockdep: Allow tuning tracing capacity constants.
Since syzkaller continues various test cases until the kernel crashes,
syzkaller tends to examine more locking dependencies than normal systems.
As a result, syzbot is reporting that the fuzz testing was terminated
due to hitting upper limits lockdep can track [1] [2] [3]. Since analysis
via /proc/lockdep* did not show any obvious culprit [4] [5], we have no
choice but allow tuning tracing capacity constants.

[1] https://syzkaller.appspot.com/bug?id=3d97ba93fb3566000c1c59691ea427370d33ea1b
[2] https://syzkaller.appspot.com/bug?id=381cb436fe60dc03d7fd2a092b46d7f09542a72a
[3] https://syzkaller.appspot.com/bug?id=a588183ac34c1437fc0785e8f220e88282e5a29f
[4] https://lkml.kernel.org/r/4b8f7a57-fa20-47bd-48a0-ae35d860f233@i-love.sakura.ne.jp
[5] https://lkml.kernel.org/r/1c351187-253b-2d49-acaf-4563c63ae7d2@i-love.sakura.ne.jp

References: https://lkml.kernel.org/r/1595640639-9310-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
2021-04-05 20:33:57 +09:00
Vipin Sharma
7aef27f0b2 svm/sev: Register SEV and SEV-ES ASIDs to the misc controller
Secure Encrypted Virtualization (SEV) and Secure Encrypted
Virtualization - Encrypted State (SEV-ES) ASIDs are used to encrypt KVMs
on AMD platform. These ASIDs are available in the limited quantities on
a host.

Register their capacity and usage to the misc controller for tracking
via cgroups.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-04-04 13:34:46 -04:00
Vipin Sharma
a72232eabd cgroup: Add misc cgroup controller
The Miscellaneous cgroup provides the resource limiting and tracking
mechanism for the scalar resources which cannot be abstracted like the
other cgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC
config option.

A resource can be added to the controller via enum misc_res_type{} in
the include/linux/misc_cgroup.h file and the corresponding name via
misc_res_name[] in the kernel/cgroup/misc.c file. Provider of the
resource must set its capacity prior to using the resource by calling
misc_cg_set_capacity().

Once a capacity is set then the resource usage can be updated using
charge and uncharge APIs. All of the APIs to interact with misc
controller are in include/linux/misc_cgroup.h.

Miscellaneous controller provides 3 interface files. If two misc
resources (res_a and res_b) are registered then:

misc.capacity
A read-only flat-keyed file shown only in the root cgroup.  It shows
miscellaneous scalar resources available on the platform along with
their quantities::

    $ cat misc.capacity
    res_a 50
    res_b 10

misc.current
A read-only flat-keyed file shown in the non-root cgroups.  It shows
the current usage of the resources in the cgroup and its children::

    $ cat misc.current
    res_a 3
    res_b 0

misc.max
A read-write flat-keyed file shown in the non root cgroups. Allowed
maximum usage of the resources in the cgroup and its children.::

    $ cat misc.max
    res_a max
    res_b 4

Limit can be set by::

    # echo res_a 1 > misc.max

Limit can be set to max by::

    # echo res_a max > misc.max

Limits can be set more than the capacity value in the misc.capacity
file.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-04-04 13:34:46 -04:00
Wang Qing
89e28ce60c workqueue/watchdog: Make unbound workqueues aware of touch_softlockup_watchdog()
84;0;0c84;0;0c
There are two workqueue-specific watchdog timestamps:

    + @wq_watchdog_touched_cpu (per-CPU) updated by
      touch_softlockup_watchdog()

    + @wq_watchdog_touched (global) updated by
      touch_all_softlockup_watchdogs()

watchdog_timer_fn() checks only the global @wq_watchdog_touched for
unbound workqueues. As a result, unbound workqueues are not aware
of touch_softlockup_watchdog(). The watchdog might report a stall
even when the unbound workqueues are blocked by a known slow code.

Solution:
touch_softlockup_watchdog() must touch also the global @wq_watchdog_touched
timestamp.

The global timestamp can no longer be used for bound workqueues because
it is now updated from all CPUs. Instead, bound workqueues have to check
only @wq_watchdog_touched_cpu and these timestamps have to be updated for
all CPUs in touch_all_softlockup_watchdogs().

Beware:
The change might cause the opposite problem. An unbound workqueue
might get blocked on CPU A because of a real softlockup. The workqueue
watchdog would miss it when the timestamp got touched on CPU B.

It is acceptable because softlockups are detected by softlockup
watchdog. The workqueue watchdog is there to detect stalls where
a work never finishes, for example, because of dependencies of works
queued into the same workqueue.

V3:
- Modify the commit message clearly according to Petr's suggestion.

Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-04-04 13:26:49 -04:00
Zqiang
0687c66b5f workqueue: Move the position of debug_work_activate() in __queue_work()
The debug_work_activate() is called on the premise that
the work can be inserted, because if wq be in WQ_DRAINING
status, insert work may be failed.

Fixes: e41e704bc4 ("workqueue: improve destroy_workqueue() debuggability")
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-04-04 13:26:46 -04:00
He Fengqing
2ec9898e9c bpf: Remove unused parameter from ___bpf_prog_run
'stack' parameter is not used in ___bpf_prog_run() after f696b8f471
("bpf: split bpf core interpreter"), the base address have been set to
FP reg. So consequently remove it.

Signed-off-by: He Fengqing <hefengqing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210331075135.3850782-1-hefengqing@huawei.com
2021-04-03 01:38:52 +02:00
David S. Miller
c2bcb4cf02 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-04-01

The following pull-request contains BPF updates for your *net-next* tree.

We've added 68 non-merge commits during the last 7 day(s) which contain
a total of 70 files changed, 2944 insertions(+), 1139 deletions(-).

The main changes are:

1) UDP support for sockmap, from Cong.

2) Verifier merge conflict resolution fix, from Daniel.

3) xsk selftests enhancements, from Maciej.

4) Unstable helpers aka kernel func calling, from Martin.

5) Batches ops for LPM map, from Pedro.

6) Fix race in bpf_get_local_storage, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-02 11:03:07 -07:00
Linus Torvalds
05de45383b Fix stack trace entry size to stop showing garbage
The macro that creates both the structure and the format displayed
 to user space for the stack trace event was changed a while ago
 to fix the parsing by user space tooling. But this change also modified
 the structure used to store the stack trace event. It changed the
 caller array field from [0] to [8]. Even though the size in the ring
 buffer is dynamic and can be something other than 8 (user space knows
 how to handle this), the 8 extra words was not accounted for when
 reserving the event on the ring buffer, and added 8 more entries, due
 to the calculation of "sizeof(*entry) + nr_entries * sizeof(long)",
 as the sizeof(*entry) now contains 8 entries. The size of the caller
 field needs to be subtracted from the size of the entry to create
 the correct allocation size.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYGccURQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qiboAPwNM1q8A7EFLDGfj+3tXksvp4H3hXd3
 ErMd2OMlsNQtRAD9GGmYyt2OtFdxZWzKOSEC07vdxq2TYTz50mqJM81YbgE=
 =7hwx
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.12-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Fix stack trace entry size to stop showing garbage

  The macro that creates both the structure and the format displayed to
  user space for the stack trace event was changed a while ago to fix
  the parsing by user space tooling. But this change also modified the
  structure used to store the stack trace event. It changed the caller
  array field from [0] to [8].

  Even though the size in the ring buffer is dynamic and can be
  something other than 8 (user space knows how to handle this), the 8
  extra words was not accounted for when reserving the event on the ring
  buffer, and added 8 more entries, due to the calculation of
  "sizeof(*entry) + nr_entries * sizeof(long)", as the sizeof(*entry)
  now contains 8 entries.

  The size of the caller field needs to be subtracted from the size of
  the entry to create the correct allocation size"

* tag 'trace-v5.12-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix stack trace event size
2021-04-02 08:39:00 -07:00
Xiang Chen
ca947482b0 dma-mapping: benchmark: Add support for multi-pages map/unmap
Currently it only support one page map/unmap once a time for dma-map
benchmark, but there are some other scenaries which need to support for
multi-page map/unmap: for those multi-pages interfaces such as
dma_alloc_coherent() and dma_map_sg(), the time spent on multi-pages
map/unmap is not the time of a single page * npages (not linear) as it
may use block description instead of page description when it is satified
with the size such as 2M/1G, and also it can send a single TLB invalidation
command to invalidate multi-pages instead of multi-times when RIL is
enabled (which will short the time of unmap). So it is necessary to add
support for multi-pages map/unmap.

Add a parameter "-g" to support multi-pages map/unmap.

Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Acked-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-02 16:41:08 +02:00
Hao Fang
42e4eefb08 dma-mapping: benchmark: use the correct HiSilicon copyright
s/Hisilicon/HiSilicon/g.
It should use capital S, according to
https://www.hisilicon.com/en/terms-of-use.

Signed-off-by: Hao Fang <fanghao11@huawei.com>
Acked-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-02 16:41:08 +02:00
Lorenz Bauer
d37300ed18 bpf: program: Refuse non-O_RDWR flags in BPF_OBJ_GET
As for bpf_link, refuse creating a non-O_RDWR fd. Since program fds
currently don't allow modifications this is a precaution, not a
straight up bug fix.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210326160501.46234-2-lmb@cloudflare.com
2021-04-01 14:33:48 -07:00
Lorenz Bauer
25fc94b2f0 bpf: link: Refuse non-O_RDWR flags in BPF_OBJ_GET
Invoking BPF_OBJ_GET on a pinned bpf_link checks the path access
permissions based on file_flags, but the returned fd ignores flags.
This means that any user can acquire a "read-write" fd for a pinned
link with mode 0664 by invoking BPF_OBJ_GET with BPF_F_RDONLY in
file_flags. The fd can be used to invoke BPF_LINK_DETACH, etc.

Fix this by refusing non-O_RDWR flags in BPF_OBJ_GET. This works
because OBJ_GET by default returns a read write mapping and libbpf
doesn't expose a way to override this behaviour for programs
and links.

Fixes: 70ed506c3b ("bpf: Introduce pinnable bpf_link abstraction")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210326160501.46234-1-lmb@cloudflare.com
2021-04-01 14:33:14 -07:00
Dave Marchevsky
06ab134ce8 bpf: Refcount task stack in bpf_get_task_stack
On x86 the struct pt_regs * grabbed by task_pt_regs() points to an
offset of task->stack. The pt_regs are later dereferenced in
__bpf_get_stack (e.g. by user_mode() check). This can cause a fault if
the task in question exits while bpf_get_task_stack is executing, as
warned by task_stack_page's comment:

* When accessing the stack of a non-current task that might exit, use
* try_get_task_stack() instead.  task_stack_page will return a pointer
* that could get freed out from under you.

Taking the comment's advice and using try_get_task_stack() and
put_task_stack() to hold task->stack refcount, or bail early if it's
already 0. Incrementing stack_refcount will ensure the task's stack
sticks around while we're using its data.

I noticed this bug while testing a bpf task iter similar to
bpf_iter_task_stack in selftests, except mine grabbed user stack, and
getting intermittent crashes, which resulted in dumps like:

  BUG: unable to handle page fault for address: 0000000000003fe0
  \#PF: supervisor read access in kernel mode
  \#PF: error_code(0x0000) - not-present page
  RIP: 0010:__bpf_get_stack+0xd0/0x230
  <snip...>
  Call Trace:
  bpf_prog_0a2be35c092cb190_get_task_stacks+0x5d/0x3ec
  bpf_iter_run_prog+0x24/0x81
  __task_seq_show+0x58/0x80
  bpf_seq_read+0xf7/0x3d0
  vfs_read+0x91/0x140
  ksys_read+0x59/0xd0
  do_syscall_64+0x48/0x120
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: fa28dcb82a ("bpf: Introduce helper bpf_get_task_stack()")
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210401000747.3648767-1-davemarchevsky@fb.com
2021-04-01 13:58:07 -07:00
Steven Rostedt (VMware)
ceaaa12904 ftrace: Simplify the calculation of page number for ftrace_page->records some more
Commit b40c6eabfc ("ftrace: Simplify the calculation of page number for
ftrace_page->records") simplified the calculation of the number of pages
needed for each page group without having any empty pages, but it can be
simplified even further.

Link: https://lore.kernel.org/lkml/CAHk-=wjt9b7kxQ2J=aDNKbR1QBMB3Hiqb_hYcZbKsxGRSEb+gQ@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-01 16:56:47 -04:00
Linus Torvalds
db42523b4f ftrace: Store the order of pages allocated in ftrace_page
Instead of saving the size of the records field of the ftrace_page, store
the order it uses to allocate the pages, as that is what is needed to know
in order to free the pages. This simplifies the code.

Link: https://lore.kernel.org/lkml/CAHk-=whyMxheOqXAORt9a7JK9gc9eHTgCJ55Pgs4p=X3RrQubQ@mail.gmail.com/

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ change log written by Steven Rostedt ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-01 16:55:45 -04:00
Florian Fainelli
2726bf3ff2 swiotlb: Make SWIOTLB_NO_FORCE perform no allocation
When SWIOTLB_NO_FORCE is used, there should really be no allocations of
default_nslabs to occur since we are not going to use those slabs. If a
platform was somehow setting swiotlb_no_force and a later call to
swiotlb_init() was to be made we would still be proceeding with
allocating the default SWIOTLB size (64MB), whereas if swiotlb=noforce
was set on the kernel command line we would have only allocated 2KB.

This would be inconsistent and the point of initializing default_nslabs
to 1, was intended to allocate the minimum amount of memory possible, so
simply remove that minimal allocation period.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2021-04-01 20:46:38 +00:00
Yordan Karadzhov (VMware)
f3ef7202ef tracing: Remove unused argument from "ring_buffer_time_stamp()
The "cpu" parameter is not being used by the function.

Link: https://lkml.kernel.org/r/20210329130331.199402-1-y.karadz@gmail.com

Signed-off-by: Yordan Karadzhov (VMware) <y.karadz@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-01 14:18:32 -04:00
Steven Rostedt (VMware)
22d5755a85 Merge branch 'trace/ftrace/urgent' into HEAD
Needed to merge trace/ftrace/urgent to get:

  Commit 59300b36f8 ("ftrace: Check if pages were allocated before calling free_pages()")

To clean up the code that is affected by it as well.
2021-04-01 14:16:37 -04:00
Steven Rostedt (VMware)
9deb193af6 tracing: Fix stack trace event size
Commit cbc3b92ce0 fixed an issue to modify the macros of the stack trace
event so that user space could parse it properly. Originally the stack
trace format to user space showed that the called stack was a dynamic
array. But it is not actually a dynamic array, in the way that other
dynamic event arrays worked, and this broke user space parsing for it. The
update was to make the array look to have 8 entries in it. Helper
functions were added to make it parse it correctly, as the stack was
dynamic, but was determined by the size of the event stored.

Although this fixed user space on how it read the event, it changed the
internal structure used for the stack trace event. It changed the array
size from [0] to [8] (added 8 entries). This increased the size of the
stack trace event by 8 words. The size reserved on the ring buffer was the
size of the stack trace event plus the number of stack entries found in
the stack trace. That commit caused the amount to be 8 more than what was
needed because it did not expect the caller field to have any size. This
produced 8 entries of garbage (and reading random data) from the stack
trace event:

          <idle>-0       [002] d... 1976396.837549: <stack trace>
 => trace_event_raw_event_sched_switch
 => __traceiter_sched_switch
 => __schedule
 => schedule_idle
 => do_idle
 => cpu_startup_entry
 => secondary_startup_64_no_verify
 => 0xc8c5e150ffff93de
 => 0xffff93de
 => 0
 => 0
 => 0xc8c5e17800000000
 => 0x1f30affff93de
 => 0x00000004
 => 0x200000000

Instead, subtract the size of the caller field from the size of the event
to make sure that only the amount needed to store the stack trace is
reserved.

Link: https://lore.kernel.org/lkml/your-ad-here.call-01617191565-ext-9692@work.hours/

Cc: stable@vger.kernel.org
Fixes: cbc3b92ce0 ("tracing: Set kernel_stack's caller size properly")
Reported-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-01 14:06:33 -04:00
Cong Wang
a7ba4558e6 sock_map: Introduce BPF_SK_SKB_VERDICT
Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is
confusing and more importantly we still want to distinguish them
from user-space. So we can just reuse the stream verdict code but
introduce a new type of eBPF program, skb_verdict. Users are not
allowed to attach stream_verdict and skb_verdict programs to the
same map.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-10-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Linus Torvalds
d19cc4bfbf Add check of order < 0 before calling free_pages()
The function addresses that are traced by ftrace are stored in pages,
 and the size is held in a variable. If there's some error in creating
 them, the allocate ones will be freed. In this case, it is possible that
 the order of pages to be freed may end up being negative due to a size of
 zero passed to get_count_order(), and then that negative number will cause
 free_pages() to free a very large section. Make sure that does not happen.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYGR30BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnbDAP9yEhTLcDRUi3VLWnEq19Dt4Lsg86Bf
 QRpbWG6Ze9EbZQEAgYAOe1fsNCNEIMXXh/4nlKVpKKH+vviS0ux9Z6uhpQQ=
 =Veyq
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull ftrace fix from Steven Rostedt:
 "Add check of order < 0 before calling free_pages()

  The function addresses that are traced by ftrace are stored in pages,
  and the size is held in a variable. If there's some error in creating
  them, the allocate ones will be freed. In this case, it is possible
  that the order of pages to be freed may end up being negative due to a
  size of zero passed to get_count_order(), and then that negative
  number will cause free_pages() to free a very large section.

  Make sure that does not happen"

* tag 'trace-v5.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Check if pages were allocated before calling free_pages()
2021-03-31 10:14:55 -07:00
Cui GaoSheng
a3fc712c5b seccomp: Fix "cacheable" typo in comments
Do a trivial typo fix: s/cachable/cacheable/

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Cui GaoSheng <cuigaosheng1@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210331030724.84419-1-cuigaosheng1@huawei.com
2021-03-30 22:34:30 -07:00
Colin Ian King
235fc0e36d bpf: Remove redundant assignment of variable id
The variable id is being assigned a value that is never read, the
assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210326194348.623782-1-colin.king@canonical.com
2021-03-30 22:58:53 +02:00
Steven Rostedt (VMware)
59300b36f8 ftrace: Check if pages were allocated before calling free_pages()
It is possible that on error pg->size can be zero when getting its order,
which would return a -1 value. It is dangerous to pass in an order of -1
to free_pages(). Check if order is greater than or equal to zero before
calling free_pages().

Link: https://lore.kernel.org/lkml/20210330093916.432697c7@gandalf.local.home/

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-30 09:58:38 -04:00
Bhaskar Chowdhury
acebb5597f kernel/printk.c: Fixed mundane typos
s/sempahore/semaphore/
s/exacly/exactly/
s/unregistred/unregistered/
s/interation/iteration/

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
[pmladek@suse.com: Removed 4th hunk. The string has already been removed in the meantime.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210328043932.8310-1-unixbhaskar@gmail.com
2021-03-30 15:34:17 +02:00
Rasmus Villemoes
28e1745b9f printk: rename vprintk_func to vprintk
The printk code is already hard enough to understand. Remove an
unnecessary indirection by renaming vprintk_func to vprintk (adding
the asmlinkage annotation), and removing the vprintk definition from
printk.c. That way, printk is implemented in terms of vprintk as one
would expect, and there's no "vprintk_func, what's that? Some function
pointer that gets set where?"

The declaration of vprintk in linux/printk.h already has the
__printf(1,0) attribute, there's no point repeating that with the
definition - it's for diagnostics in callers.

linux/printk.h already contains a static inline {return 0;} definition
of vprintk when !CONFIG_PRINTK.

Since the corresponding stub definition of vprintk_func was not marked
"static inline", any translation unit including internal.h would get a
definition of vprintk_func - it just so happens that for
!CONFIG_PRINTK, there is precisely one such TU, namely printk.c. Had
there been more, it would be a link error; now it's just a silly waste
of a few bytes of .text, which one must assume are rather precious to
anyone disabling PRINTK.

$ objdump -dr kernel/printk/printk.o
00000330 <vprintk_func>:
 330:   31 c0                   xor    %eax,%eax
 332:   c3                      ret
 333:   8d b4 26 00 00 00 00    lea    0x0(%esi,%eiz,1),%esi
 33a:   8d b6 00 00 00 00       lea    0x0(%esi),%esi

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210323144201.486050-1-linux@rasmusvillemoes.dk
2021-03-30 15:21:18 +02:00
Bartosz Golaszewski
883ccef355 genirq/irq_sim: Shrink devm_irq_domain_create_sim()
The custom devres structure manages only a single pointer which can
can be achieved by using devm_add_action_or_reset() as well which
makes the code simpler.

[ tglx: Fixed return value handling - found by smatch ]

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210301142659.8971-1-brgl@bgdev.pl
2021-03-30 13:21:27 +02:00
Miroslav Benes
8df1947c71 livepatch: Replace the fake signal sending with TIF_NOTIFY_SIGNAL infrastructure
Livepatch sends a fake signal to all remaining blocking tasks of a
running transition after a set period of time. It uses TIF_SIGPENDING
flag for the purpose. Commit 12db8b6900 ("entry: Add support for
TIF_NOTIFY_SIGNAL") added a generic infrastructure to achieve the same.
Replace our bespoke solution with the generic one.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2021-03-30 09:40:21 +02:00
Niklas Söderlund
d4c7c28806 timekeeping: Allow runtime PM from change_clocksource()
The struct clocksource callbacks enable() and disable() are described as a
way to allow clock sources to enter a power save mode. See commit
4614e6adaf ("clocksource: add enable() and disable() callbacks")

But using runtime PM from these callbacks triggers a cyclic lockdep warning when
switching clock source using change_clocksource().

  # echo e60f0000.timer > /sys/devices/system/clocksource/clocksource0/current_clocksource
   ======================================================
   WARNING: possible circular locking dependency detected
   ------------------------------------------------------
   migration/0/11 is trying to acquire lock:
   ffff0000403ed220 (&dev->power.lock){-...}-{2:2}, at: __pm_runtime_resume+0x40/0x74

   but task is already holding lock:
   ffff8000113c8f88 (tk_core.seq.seqcount){----}-{0:0}, at: multi_cpu_stop+0xa4/0x190

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #2 (tk_core.seq.seqcount){----}-{0:0}:
          ktime_get+0x28/0xa0
          hrtimer_start_range_ns+0x210/0x2dc
          generic_sched_clock_init+0x70/0x88
          sched_clock_init+0x40/0x64
          start_kernel+0x494/0x524

   -> #1 (hrtimer_bases.lock){-.-.}-{2:2}:
          hrtimer_start_range_ns+0x68/0x2dc
          rpm_suspend+0x308/0x5dc
          rpm_idle+0xc4/0x2a4
          pm_runtime_work+0x98/0xc0
          process_one_work+0x294/0x6f0
          worker_thread+0x70/0x45c
          kthread+0x154/0x160
          ret_from_fork+0x10/0x20

   -> #0 (&dev->power.lock){-...}-{2:2}:
          _raw_spin_lock_irqsave+0x7c/0xc4
          __pm_runtime_resume+0x40/0x74
          sh_cmt_start+0x1c4/0x260
          sh_cmt_clocksource_enable+0x28/0x50
          change_clocksource+0x9c/0x160
          multi_cpu_stop+0xa4/0x190
          cpu_stopper_thread+0x90/0x154
          smpboot_thread_fn+0x244/0x270
          kthread+0x154/0x160
          ret_from_fork+0x10/0x20

   other info that might help us debug this:

   Chain exists of:
     &dev->power.lock --> hrtimer_bases.lock --> tk_core.seq.seqcount

    Possible unsafe locking scenario:

          CPU0                    CPU1
          ----                    ----
     lock(tk_core.seq.seqcount);
                                  lock(hrtimer_bases.lock);
                                  lock(tk_core.seq.seqcount);
     lock(&dev->power.lock);

    *** DEADLOCK ***

   2 locks held by migration/0/11:
    #0: ffff8000113c9278 (timekeeper_lock){-.-.}-{2:2}, at: change_clocksource+0x2c/0x160
    #1: ffff8000113c8f88 (tk_core.seq.seqcount){----}-{0:0}, at: multi_cpu_stop+0xa4/0x190

Rework change_clocksource() so it enables the new clocksource and disables
the old clocksource outside of the timekeeper_lock and seqcount write held
region. There is no requirement that these callbacks are invoked from the
lock held region.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Link: https://lore.kernel.org/r/20210211134318.323910-1-niklas.soderlund+renesas@ragnatech.se
2021-03-29 16:41:59 +02:00
Thomas Gleixner
a51a327f3b locking/rtmutex: Clean up signal handling in __rt_mutex_slowlock()
The signal handling in __rt_mutex_slowlock() is open coded.

Use signal_pending_state() instead.

Aside of the cleanup this also prepares for the RT lock substituions which
require support for TASK_KILLABLE.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210326153944.533811987@linutronix.de
2021-03-29 15:57:05 +02:00
Thomas Gleixner
c2c360ed7f locking/rtmutex: Restrict the trylock WARN_ON() to debug
The warning as written is expensive and not really required for a
production kernel. Make it depend on rt mutex debugging and use !in_task()
for the condition which generates far better code and gives the same
answer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210326153944.436565064@linutronix.de
2021-03-29 15:57:04 +02:00
Thomas Gleixner
82cd5b1039 locking/rtmutex: Fix misleading comment in rt_mutex_postunlock()
Preemption is disabled in mark_wakeup_next_waiter(,) not in
rt_mutex_slowunlock().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210326153944.341734608@linutronix.de
2021-03-29 15:57:04 +02:00
Thomas Gleixner
70c80103aa locking/rtmutex: Consolidate the fast/slowpath invocation
The indirection via a function pointer (which is at least optimized into a
tail call by the compiler) is making the code hard to read.

Clean it up and move the futex related trylock functions down to the futex
section.

Move the wake_q wakeup into rt_mutex_slowunlock(). No point in handing it
to the caller. The futex code uses a different function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210326153944.247927548@linutronix.de
2021-03-29 15:57:04 +02:00