Commit Graph

31917 Commits

Author SHA1 Message Date
Yangtao Li
35f4cd96f5 stop_machine: Make stop_cpus() static
The function stop_cpus() is only used internally by the
stop_machine for stop multiple cpus.

Make it static.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191228161912.24082-1-tiny.windzz@gmail.com
2020-01-17 10:19:21 +01:00
Wei Li
02d4ac5885 sched/debug: Reset watchdog on all CPUs while processing sysrq-t
Lengthy output of sysrq-t may take a lot of time on slow serial console
with lots of processes and CPUs.

So we need to reset NMI-watchdog to avoid spurious lockup messages, and
we also reset softlockup watchdogs on all other CPUs since another CPU
might be blocked waiting for us to process an IPI or stop_machine.

Add to sysrq_sched_debug_show() as what we did in show_state_filter().

Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20191226085224.48942-1-liwei391@huawei.com
2020-01-17 10:19:20 +01:00
Li Guanglei
dcd6dffb0a sched/core: Fix size of rq::uclamp initialization
rq::uclamp is an array of struct uclamp_rq, make sure we clear the
whole thing.

Fixes: 69842cba9a ("sched/uclamp: Add CPU's clamp buckets refcountinga")
Signed-off-by: Li Guanglei <guanglei.li@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
2020-01-17 10:19:20 +01:00
Qais Yousef
7226017ad3 sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
When a new cgroup is created, the effective uclamp value wasn't updated
with a call to cpu_util_update_eff() that looks at the hierarchy and
update to the most restrictive values.

Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
becomes online.

Without this change, the newly created cgroup uses the default
root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
which will cause the rq to to be clamped to max, hence cause the
system to run at max frequency.

The problem was observed on Ubuntu server and was reproduced on Debian
and Buildroot rootfs.

By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
and add all tasks to it - which creates enough noise to keep the rq
uclamp value at max most of the time. Imitating this behavior makes the
problem visible in Buildroot too which otherwise looks fine since it's a
minimal userspace.

Fixes: 0b60ba2dd3 ("sched/uclamp: Propagate parent clamps")
Reported-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Doug Smythies <dsmythies@telus.net>
Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
2020-01-17 10:19:20 +01:00
Viresh Kumar
323af6deaf sched/fair: Load balance aggressively for SCHED_IDLE CPUs
The fair scheduler performs periodic load balance on every CPU to check
if it can pull some tasks from other busy CPUs. The duration of this
periodic load balance is set to sd->balance_interval for the idle CPUs
and is calculated by multiplying the sd->balance_interval with the
sd->busy_factor (set to 32 by default) for the busy CPUs. The
multiplication is done for busy CPUs to avoid doing load balance too
often and rather spend more time executing actual task. While that is
the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH
tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE
tasks.

With the recent enhancements in the fair scheduler around SCHED_IDLE
CPUs, we now prefer to enqueue a newly-woken task to a SCHED_IDLE
CPU instead of other busy or idle CPUs. The same reasoning should be
applied to the load balancer as well to make it migrate tasks more
aggressively to a SCHED_IDLE CPU, as that will reduce the scheduling
latency of the migrated (SCHED_OTHER) tasks.

This patch makes minimal changes to the fair scheduler to do the next
load balance soon after the last non SCHED_IDLE task is dequeued from a
runqueue, i.e. making the CPU SCHED_IDLE. Also the sd->busy_factor is
ignored while calculating the balance_interval for such CPUs. This is
done to avoid delaying the periodic load balance by few hundred
milliseconds for SCHED_IDLE CPUs.

This is tested on ARM64 Hikey620 platform (octa-core) with the help of
rt-app and it is verified, using kernel traces, that the newly
SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE
and pulls tasks from other busy CPUs.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/e485827eb8fe7db0943d6f3f6e0f5a4a70272781.1578471925.git.viresh.kumar@linaro.org
2020-01-17 10:19:20 +01:00
Vincent Guittot
5f68eb19b5 sched/fair : Improve update_sd_pick_busiest for spare capacity case
Similarly to calculate_imbalance() and find_busiest_group(), using the
number of idle CPUs when there is only 1 CPU in the group is not efficient
because we can't make a difference between a CPU running 1 task and a CPU
running dozens of small tasks competing for the same CPU but not enough
to overload it. More generally speaking, we should use the number of
running tasks when there is the same number of idle CPUs in a group instead
of blindly select the 1st one.

When the groups have spare capacity and the same number of idle CPUs, we
compare the number of running tasks to select the busiest group.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1576839893-26930-1-git-send-email-vincent.guittot@linaro.org
2020-01-17 10:19:19 +01:00
Jisheng Zhang
db5793c599 watchdog: Remove soft_lockup_hrtimer_cnt and related code
After commit 9cf57731b6 ("watchdog/softlockup: Replace "watchdog/%u"
threads with cpu_stop_work"), the percpu soft_lockup_hrtimer_cnt is
not used any more, so remove it and related code.

Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191218131720.4146aea2@xhacker.debian
2020-01-17 10:19:19 +01:00
Qais Yousef
804d402fb6 sched/rt: Make RT capacity-aware
Capacity Awareness refers to the fact that on heterogeneous systems
(like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
when placing tasks we need to be aware of this difference of CPU
capacities.

In such scenarios we want to ensure that the selected CPU has enough
capacity to meet the requirement of the running task. Enough capacity
means here that capacity_orig_of(cpu) >= task.requirement.

The definition of task.requirement is dependent on the scheduling class.

For CFS, utilization is used to select a CPU that has >= capacity value
than the cfs_task.util.

	capacity_orig_of(cpu) >= cfs_task.util

DL isn't capacity aware at the moment but can make use of the bandwidth
reservation to implement that in a similar manner CFS uses utilization.
The following patchset implements that:

https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/

	capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime

For RT we don't have a per task utilization signal and we lack any
information in general about what performance requirement the RT task
needs. But with the introduction of uclamp, RT tasks can now control
that by setting uclamp_min to guarantee a minimum performance point.

ATM the uclamp value are only used for frequency selection; but on
heterogeneous systems this is not enough and we need to ensure that the
capacity of the CPU is >= uclamp_min. Which is what implemented here.

	capacity_orig_of(cpu) >= rt_task.uclamp_min

Note that by default uclamp.min is 1024, which means that RT tasks will
always be biased towards the big CPUs, which make for a better more
predictable behavior for the default case.

Must stress that the bias acts as a hint rather than a definite
placement strategy. For example, if all big cores are busy executing
other RT tasks we can't guarantee that a new RT task will be placed
there.

On non-heterogeneous systems the original behavior of RT should be
retained. Similarly if uclamp is not selected in the config.

[ mingo: Minor edits to comments. ]

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191009104611.15363-1-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:10 +01:00
Valentin Schneider
1d42509e47 sched/fair: Make EAS wakeup placement consider uclamp restrictions
task_fits_capacity() has just been made uclamp-aware, and
find_energy_efficient_cpu() needs to go through the same treatment.

Things are somewhat different here however - using the task max clamp isn't
sufficient. Consider the following setup:

  The target runqueue, rq:
    rq.cpu_capacity_orig = 512
    rq.cfs.avg.util_avg = 200
    rq.uclamp.max = 768 // the max p.uclamp.max of all enqueued p's is 768

  The waking task, p (not yet enqueued on rq):
    p.util_est = 600
    p.uclamp.max = 100

Now, consider the following code which doesn't use the rq clamps:

  util = uclamp_task_util(p);
  // Does the task fit in the spare CPU capacity?
  cpu = cpu_of(rq);
  fits_capacity(util, cpu_capacity(cpu) - cpu_util(cpu))

This would lead to:

  util = 100;
  fits_capacity(100, 512 - 200)

fits_capacity() would return true. However, enqueuing p on that CPU *will*
cause it to become overutilized since rq clamp values are max-aggregated,
so we'd remain with

  rq.uclamp.max = 768

which comes from the other tasks already enqueued on rq. Thus, we could
select a high enough frequency to reach beyond 0.8 * 512 utilization
(== overutilized) after enqueuing p on rq. What find_energy_efficient_cpu()
needs here is uclamp_rq_util_with() which lets us peek at the future
utilization landscape, including rq-wide uclamp values.

Make find_energy_efficient_cpu() use uclamp_rq_util_with() for its
fits_capacity() check. This is in line with what compute_energy() ends up
using for estimating utilization.

Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-6-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:09 +01:00
Valentin Schneider
a7008c07a5 sched/fair: Make task_fits_capacity() consider uclamp restrictions
task_fits_capacity() drives CPU selection at wakeup time, and is also used
to detect misfit tasks. Right now it does so by comparing task_util_est()
with a CPU's capacity, but doesn't take into account uclamp restrictions.

There's a few interesting uses that can come out of doing this. For
instance, a low uclamp.max value could prevent certain tasks from being
flagged as misfit tasks, so they could merrily remain on low-capacity CPUs.
Similarly, a high uclamp.min value would steer tasks towards high capacity
CPUs at wakeup (and, should that fail, later steered via misfit balancing),
so such "boosted" tasks would favor CPUs of higher capacity.

Introduce uclamp_task_util() and make task_fits_capacity() use it.

Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-5-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:09 +01:00
Valentin Schneider
d2b58a286e sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
The current helper returns (CPU) rq utilization with uclamp restrictions
taken into account. A uclamp task utilization helper would be quite
helpful, but this requires some renaming.

Prepare the code for the introduction of a uclamp_task_util() by renaming
the existing uclamp_util_with() to uclamp_rq_util_with().

Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-4-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:08 +01:00
Valentin Schneider
686516b55e sched/uclamp: Make uclamp util helpers use and return UL values
Vincent pointed out recently that the canonical type for utilization
values is 'unsigned long'. Internally uclamp uses 'unsigned int' values for
cache optimization, but this doesn't have to be exported to its users.

Make the uclamp helpers that deal with utilization use and return unsigned
long values.

Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-3-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:08 +01:00
Valentin Schneider
59fe675248 sched/uclamp: Remove uclamp_util()
The sole user of uclamp_util(), schedutil_cpu_util(), was made to use
uclamp_util_with() instead in commit:

  af24bde8df ("sched/uclamp: Add uclamp support to energy_compute()")

From then on, uclamp_util() has remained unused. Being a simple wrapper
around uclamp_util_with(), we can get rid of it and win back a few lines.

Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:07 +01:00
Viresh Kumar
17346452b2 sched/fair: Make sched-idle CPU selection consistent throughout
There are instances where we keep searching for an idle CPU despite
already having a sched-idle CPU (in find_idlest_group_cpu(),
select_idle_smt() and select_idle_cpu() and then there are places where
we don't necessarily do that and return a sched-idle CPU as soon as we
find one (in select_idle_sibling()). This looks a bit inconsistent and
it may be worth having the same policy everywhere.

On the other hand, choosing a sched-idle CPU over a idle one shall be
beneficial from performance and power point of view as well, as we don't
need to get the CPU online from a deep idle state which wastes quite a
lot of time and energy and delays the scheduling of the newly woken up
task.

This patch tries to simplify code around sched-idle CPU selection and
make it consistent throughout.

Testing is done with the help of rt-app on hikey board (ARM64 octa-core,
2 clusters, 0-3 and 4-7). The cpufreq governor was set to performance to
avoid any side affects from CPU frequency. Following are the tests
performed:

Test 1: 1-cfs-task:

 A single SCHED_NORMAL task is pinned to CPU5 which runs for 2333 us
 out of 7777 us (so gives time for the cluster to go in deep idle
 state).

Test 2: 1-cfs-1-idle-task:

 A single SCHED_NORMAL task is pinned on CPU5 and single SCHED_IDLE
 task is pinned on CPU6 (to make sure cluster 1 doesn't go in deep idle
 state).

Test 3: 1-cfs-8-idle-task:

 A single SCHED_NORMAL task is pinned on CPU5 and eight SCHED_IDLE
 tasks are created which run forever (not pinned anywhere, so they run
 on all CPUs). Checked with kernelshark that as soon as NORMAL task
 sleeps, the SCHED_IDLE task starts running on CPU5.

And here are the results on mean latency (in us), using the "st" tool.

  $ st 1-cfs-task/rt-app-cfs_thread-0.log
  N       min     max     sum     mean    stddev
  642     90      592     197180  307.134 109.906

  $ st 1-cfs-1-idle-task/rt-app-cfs_thread-0.log
  N       min     max     sum     mean    stddev
  642     67      311     113850  177.336 41.4251

  $ st 1-cfs-8-idle-task/rt-app-cfs_thread-0.log
  N       min     max     sum     mean    stddev
  643     29      173     41364   64.3297 13.2344

The mean latency when we need to:

 - wakeup from deep idle state is 307 us.
 - wakeup from shallow idle state is 177 us.
 - preempt a SCHED_IDLE task is 64 us.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/b90cbcce608cef4e02a7bbfe178335f76d201bab.1573728344.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:07 +01:00
Qian Cai
53a23364b6 sched/core: Remove unused variable from set_user_nice()
This commit left behind an unused variable:

  5443a0be61 ("sched: Use fair:prio_changed() instead of ad-hoc implementation") left behind an unused variable.

  kernel/sched/core.c: In function 'set_user_nice':
  kernel/sched/core.c:4507:16: warning: variable 'delta' set but not used
    int old_prio, delta;
                ^~~~~

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 5443a0be61 ("sched: Use fair:prio_changed() instead of ad-hoc implementation")
Link: https://lkml.kernel.org/r/20191219140314.1252-1-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:06 +01:00
Ingo Molnar
1e5f8a3085 Linux 5.5-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4AEiYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGR3sH/ixrBBYUVyjRPOxS
 ce4iVoTqphGSoAzq/3FA1YZZOPQ/Ep0NXL4L2fTGxmoiqIiuy8JPp07/NKbHQjj1
 Rt6PGm6cw2pMJHaK9gRdlTH/6OyXkp06OkH1uHqKYrhPnpCWDnj+i2SHAX21Hr1y
 oBQh4/XKvoCMCV96J2zxRsLvw8OkQFE0ouWWfj6LbpXIsmWZ++s0OuaO1cVdP/oG
 j+j2Voi3B3vZNQtGgJa5W7YoZN5Qk4ZIj9bMPg7bmKRd3wNB228AiJH2w68JWD/I
 jCA+JcITilxC9ud96uJ6k7SMS2ufjQlnP0z6Lzd0El1yGtHYRcPOZBgfOoPU2Euf
 33WGSyI=
 =iEwx
 -----END PGP SIGNATURE-----

Merge tag 'v5.5-rc3' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:41:37 +01:00
Linus Torvalds
78bac77b52 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Several nf_flow_table_offload fixes from Pablo Neira Ayuso,
    including adding a missing ipv6 match description.

 2) Several heap overflow fixes in mwifiex from qize wang and Ganapathi
    Bhat.

 3) Fix uninit value in bond_neigh_init(), from Eric Dumazet.

 4) Fix non-ACPI probing of nxp-nci, from Stephan Gerhold.

 5) Fix use after free in tipc_disc_rcv(), from Tuong Lien.

 6) Enforce limit of 33 tail calls in mips and riscv JIT, from Paul
    Chaignon.

 7) Multicast MAC limit test is off by one in qede, from Manish Chopra.

 8) Fix established socket lookup race when socket goes from
    TCP_ESTABLISHED to TCP_LISTEN, because there lacks an intervening
    RCU grace period. From Eric Dumazet.

 9) Don't send empty SKBs from tcp_write_xmit(), also from Eric Dumazet.

10) Fix active backup transition after link failure in bonding, from
    Mahesh Bandewar.

11) Avoid zero sized hash table in gtp driver, from Taehee Yoo.

12) Fix wrong interface passed to ->mac_link_up(), from Russell King.

13) Fix DSA egress flooding settings in b53, from Florian Fainelli.

14) Memory leak in gmac_setup_txqs(), from Navid Emamdoost.

15) Fix double free in dpaa2-ptp code, from Ioana Ciornei.

16) Reject invalid MTU values in stmmac, from Jose Abreu.

17) Fix refcount leak in error path of u32 classifier, from Davide
    Caratti.

18) Fix regression causing iwlwifi firmware crashes on boot, from Anders
    Kaseorg.

19) Fix inverted return value logic in llc2 code, from Chan Shu Tak.

20) Disable hardware GRO when XDP is attached to qede, frm Manish
    Chopra.

21) Since we encode state in the low pointer bits, dst metrics must be
    at least 4 byte aligned, which is not necessarily true on m68k. Add
    annotations to fix this, from Geert Uytterhoeven.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (160 commits)
  sfc: Include XDP packet headroom in buffer step size.
  sfc: fix channel allocation with brute force
  net: dst: Force 4-byte alignment of dst_metrics
  selftests: pmtu: fix init mtu value in description
  hv_netvsc: Fix unwanted rx_table reset
  net: phy: ensure that phy IDs are correctly typed
  mod_devicetable: fix PHY module format
  qede: Disable hardware gro when xdp prog is installed
  net: ena: fix issues in setting interrupt moderation params in ethtool
  net: ena: fix default tx interrupt moderation interval
  net/smc: unregister ib devices in reboot_event
  net: stmmac: platform: Fix MDIO init for platforms without PHY
  llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)
  net: hisilicon: Fix a BUG trigered by wrong bytes_compl
  net: dsa: ksz: use common define for tag len
  s390/qeth: don't return -ENOTSUPP to userspace
  s390/qeth: fix promiscuous mode after reset
  s390/qeth: handle error due to unsupported transport mode
  cxgb4: fix refcount init for TC-MQPRIO offload
  tc-testing: initial tdc selftests for cls_u32
  ...
2019-12-22 09:54:33 -08:00
Linus Torvalds
b8e382a185 Various tracing fixes:
- Fix memory leak on error path of process_system_preds()
  - Lock inversion fix with updating tgid recording option
  - Fix histogram compare function on big endian machines
  - Fix histogram trigger function on big endian machines
  - Make trace_printk() irq sync on init for kprobe selftest correctness
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXf6MRxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlw6AQCny2YeASymmOjDqh9/G53UdhO539Y2
 oL/2nQ8B9T9KWgD6AmmohhbX+TS9l5Nwy2/bKmRgADZ7u+2XLM2f2mYR2Ag=
 =D7hI
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix memory leak on error path of process_system_preds()

 - Lock inversion fix with updating tgid recording option

 - Fix histogram compare function on big endian machines

 - Fix histogram trigger function on big endian machines

 - Make trace_printk() irq sync on init for kprobe selftest correctness

* tag 'trace-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix endianness bug in histogram trigger
  samples/trace_printk: Wait for IRQ work to finish
  tracing: Fix lock inversion in trace_event_enable_tgid_record()
  tracing: Have the histogram compare functions convert to u64 first
  tracing: Avoid memory leak in process_system_preds()
2019-12-21 15:16:56 -08:00
Sven Schnelle
fe6e096a5b tracing: Fix endianness bug in histogram trigger
At least on PA-RISC and s390 synthetic histogram triggers are failing
selftests because trace_event_raw_event_synth() always writes a 64 bit
values, but the reader expects a field->size sized value. On little endian
machines this doesn't hurt, but on big endian this makes the reader always
read zero values.

Link: http://lore.kernel.org/linux-trace-devel/20191218074427.96184-4-svens@linux.ibm.com

Cc: stable@vger.kernel.org
Fixes: 4b147936fa ("tracing: Add support for 'synthetic' events")
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-21 16:08:59 -05:00
Prateek Sood
3a53acf1d9 tracing: Fix lock inversion in trace_event_enable_tgid_record()
Task T2                             Task T3
trace_options_core_write()            subsystem_open()

 mutex_lock(trace_types_lock)           mutex_lock(event_mutex)

 set_tracer_flag()

   trace_event_enable_tgid_record()       mutex_lock(trace_types_lock)

    mutex_lock(event_mutex)

This gives a circular dependency deadlock between trace_types_lock and
event_mutex. To fix this invert the usage of trace_types_lock and
event_mutex in trace_options_core_write(). This keeps the sequence of
lock usage consistent.

Link: http://lkml.kernel.org/r/0101016eef175e38-8ca71caf-a4eb-480d-a1e6-6f0bbc015495-000000@us-west-2.amazonses.com

Cc: stable@vger.kernel.org
Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-21 16:05:13 -05:00
Linus Torvalds
fd7a6d2b8f Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixes: a (rare) PSI crash fix, a CPU affinity related balancing
  fix, and a toning down of active migration attempts"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cfs: fix spurious active migration
  sched/fair: Fix find_idlest_group() to handle CPU affinity
  psi: Fix a division error in psi poll()
  sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime
2019-12-21 10:52:10 -08:00
Linus Torvalds
c4ff10efe8 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes: a BTS fix, a PT NMI handling fix, a PMU sysfs fix and an
  SRCU annotation"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Add SRCU annotation for pmus list walk
  perf/x86/intel: Fix PT PMI handling
  perf/x86/intel/bts: Fix the use of page_private()
  perf/x86: Fix potential out-of-bounds access
2019-12-21 10:51:00 -08:00
Steven Rostedt (VMware)
106f41f5a3 tracing: Have the histogram compare functions convert to u64 first
The compare functions of the histogram code would be specific for the size
of the value being compared (byte, short, int, long long). It would
reference the value from the array via the type of the compare, but the
value was stored in a 64 bit number. This is fine for little endian
machines, but for big endian machines, it would end up comparing zeros or
all ones (depending on the sign) for anything but 64 bit numbers.

To fix this, first derference the value as a u64 then convert it to the type
being compared.

Link: http://lkml.kernel.org/r/20191211103557.7bed6928@gandalf.local.home

Cc: stable@vger.kernel.org
Fixes: 08d43a5fa0 ("tracing: Add lock-free tracing_map")
Acked-by: Tom Zanussi <zanussi@kernel.org>
Reported-by: Sven Schnelle <svens@stackframe.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-19 18:26:00 -05:00
Keita Suzuki
79e65c27f0 tracing: Avoid memory leak in process_system_preds()
When failing in the allocation of filter_item, process_system_preds()
goes to fail_mem, where the allocated filter is freed.

However, this leads to memory leak of filter->filter_string and
filter->prog, which is allocated before and in process_preds().
This bug has been detected by kmemleak as well.

Fix this by changing kfree to __free_fiter.

unreferenced object 0xffff8880658007c0 (size 32):
  comm "bash", pid 579, jiffies 4295096372 (age 17.752s)
  hex dump (first 32 bytes):
    63 6f 6d 6d 6f 6e 5f 70 69 64 20 20 3e 20 31 30  common_pid  > 10
    00 00 00 00 00 00 00 00 65 73 00 00 00 00 00 00  ........es......
  backtrace:
    [<0000000067441602>] kstrdup+0x2d/0x60
    [<00000000141cf7b7>] apply_subsystem_event_filter+0x378/0x932
    [<000000009ca32334>] subsystem_filter_write+0x5a/0x90
    [<0000000072da2bee>] vfs_write+0xe1/0x240
    [<000000004f14f473>] ksys_write+0xb4/0x150
    [<00000000a968b4a0>] do_syscall_64+0x6d/0x1e0
    [<000000001a189f40>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
unreferenced object 0xffff888060c22d00 (size 64):
  comm "bash", pid 579, jiffies 4295096372 (age 17.752s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 00 e8 d7 41 80 88 ff ff  ...........A....
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000b8c1b109>] process_preds+0x243/0x1820
    [<000000003972c7f0>] apply_subsystem_event_filter+0x3be/0x932
    [<000000009ca32334>] subsystem_filter_write+0x5a/0x90
    [<0000000072da2bee>] vfs_write+0xe1/0x240
    [<000000004f14f473>] ksys_write+0xb4/0x150
    [<00000000a968b4a0>] do_syscall_64+0x6d/0x1e0
    [<000000001a189f40>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
unreferenced object 0xffff888041d7e800 (size 512):
  comm "bash", pid 579, jiffies 4295096372 (age 17.752s)
  hex dump (first 32 bytes):
    70 bc 85 97 ff ff ff ff 0a 00 00 00 00 00 00 00  p...............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000001e04af34>] process_preds+0x71a/0x1820
    [<000000003972c7f0>] apply_subsystem_event_filter+0x3be/0x932
    [<000000009ca32334>] subsystem_filter_write+0x5a/0x90
    [<0000000072da2bee>] vfs_write+0xe1/0x240
    [<000000004f14f473>] ksys_write+0xb4/0x150
    [<00000000a968b4a0>] do_syscall_64+0x6d/0x1e0
    [<000000001a189f40>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Link: http://lkml.kernel.org/r/20191211091258.11310-1-keitasuzuki.park@sslab.ics.keio.ac.jp

Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 404a3add43 ("tracing: Only add filter list when needed")
Signed-off-by: Keita Suzuki <keitasuzuki.park@sslab.ics.keio.ac.jp>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-19 18:24:17 -05:00
Daniel Borkmann
cc52d9140a bpf: Fix record_func_key to perform backtracking on r3
While testing Cilium with /unreleased/ Linus' tree under BPF-based NodePort
implementation, I noticed a strange BPF SNAT engine behavior from time to
time. In some cases it would do the correct SNAT/DNAT service translation,
but at a random point in time it would just stop and perform an unexpected
translation after SYN, SYN/ACK and stack would send a RST back. While initially
assuming that there is some sort of a race condition in BPF code, adding
trace_printk()s for debugging purposes at some point seemed to have resolved
the issue auto-magically.

Digging deeper on this Heisenbug and reducing the trace_printk() calls to
an absolute minimum, it turns out that a single call would suffice to
trigger / not trigger the seen RST issue, even though the logic of the
program itself remains unchanged. Turns out the single call changed verifier
pruning behavior to get everything to work. Reconstructing a minimal test
case, the incorrect JIT dump looked as follows:

  # bpftool p d j i 11346
  0xffffffffc0cba96c:
  [...]
    21:   movzbq 0x30(%rdi),%rax
    26:   cmp    $0xd,%rax
    2a:   je     0x000000000000003a
    2c:   xor    %edx,%edx
    2e:   movabs $0xffff89cc74e85800,%rsi
    38:   jmp    0x0000000000000049
    3a:   mov    $0x2,%edx
    3f:   movabs $0xffff89cc74e85800,%rsi
    49:   mov    -0x224(%rbp),%eax
    4f:   cmp    $0x20,%eax
    52:   ja     0x0000000000000062
    54:   add    $0x1,%eax
    57:   mov    %eax,-0x224(%rbp)
    5d:   jmpq   0xffffffffffff6911
    62:   mov    $0x1,%eax
  [...]

Hence, unexpectedly, JIT emitted a direct jump even though retpoline based
one would have been needed since in line 2c and 3a we have different slot
keys in BPF reg r3. Verifier log of the test case reveals what happened:

  0: (b7) r0 = 14
  1: (73) *(u8 *)(r1 +48) = r0
  2: (71) r0 = *(u8 *)(r1 +48)
  3: (15) if r0 == 0xd goto pc+4
   R0_w=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=ctx(id=0,off=0,imm=0) R10=fp0
  4: (b7) r3 = 0
  5: (18) r2 = 0xffff89cc74d54a00
  7: (05) goto pc+3
  11: (85) call bpf_tail_call#12
  12: (b7) r0 = 1
  13: (95) exit
  from 3 to 8: R0_w=inv13 R1=ctx(id=0,off=0,imm=0) R10=fp0
  8: (b7) r3 = 2
  9: (18) r2 = 0xffff89cc74d54a00
  11: safe
  processed 13 insns (limit 1000000) [...]

Second branch is pruned by verifier since considered safe, but issue is that
record_func_key() couldn't have seen the index in line 3a and therefore
decided that emitting a direct jump at this location was okay.

Fix this by reusing our backtracking logic for precise scalar verification
in order to prevent pruning on the slot key. This means verifier will track
content of r3 all the way backwards and only prune if both scalars were
unknown in state equivalence check and therefore poisoned in the first place
in record_func_key(). The range is [x,x] in record_func_key() case since
the slot always would have to be constant immediate. Correct verification
after fix:

  0: (b7) r0 = 14
  1: (73) *(u8 *)(r1 +48) = r0
  2: (71) r0 = *(u8 *)(r1 +48)
  3: (15) if r0 == 0xd goto pc+4
   R0_w=invP(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=ctx(id=0,off=0,imm=0) R10=fp0
  4: (b7) r3 = 0
  5: (18) r2 = 0x0
  7: (05) goto pc+3
  11: (85) call bpf_tail_call#12
  12: (b7) r0 = 1
  13: (95) exit
  from 3 to 8: R0_w=invP13 R1=ctx(id=0,off=0,imm=0) R10=fp0
  8: (b7) r3 = 2
  9: (18) r2 = 0x0
  11: (85) call bpf_tail_call#12
  12: (b7) r0 = 1
  13: (95) exit
  processed 15 insns (limit 1000000) [...]

And correct corresponding JIT dump:

  # bpftool p d j i 11
  0xffffffffc0dc34c4:
  [...]
    21:	  movzbq 0x30(%rdi),%rax
    26:	  cmp    $0xd,%rax
    2a:	  je     0x000000000000003a
    2c:	  xor    %edx,%edx
    2e:	  movabs $0xffff9928b4c02200,%rsi
    38:	  jmp    0x0000000000000049
    3a:	  mov    $0x2,%edx
    3f:	  movabs $0xffff9928b4c02200,%rsi
    49:	  cmp    $0x4,%rdx
    4d:	  jae    0x0000000000000093
    4f:	  and    $0x3,%edx
    52:	  mov    %edx,%edx
    54:	  cmp    %edx,0x24(%rsi)
    57:	  jbe    0x0000000000000093
    59:	  mov    -0x224(%rbp),%eax
    5f:	  cmp    $0x20,%eax
    62:	  ja     0x0000000000000093
    64:	  add    $0x1,%eax
    67:	  mov    %eax,-0x224(%rbp)
    6d:	  mov    0x110(%rsi,%rdx,8),%rax
    75:	  test   %rax,%rax
    78:	  je     0x0000000000000093
    7a:	  mov    0x30(%rax),%rax
    7e:	  add    $0x19,%rax
    82:   callq  0x000000000000008e
    87:   pause
    89:   lfence
    8c:   jmp    0x0000000000000087
    8e:   mov    %rax,(%rsp)
    92:   retq
    93:   mov    $0x1,%eax
  [...]

Also explicitly adding explicit env->allow_ptr_leaks to fixup_bpf_calls() since
backtracking is enabled under former (direct jumps as well, but use different
test). In case of only tracking different map pointers as in c93552c443 ("bpf:
properly enforce index mask to prevent out-of-bounds speculation"), pruning
cannot make such short-cuts, neither if there are paths with scalar and non-scalar
types as r3. mark_chain_precision() is only needed after we know that
register_is_const(). If it was not the case, we already poison the key on first
path and non-const key in later paths are not matching the scalar range in regsafe()
either. Cilium NodePort testing passes fine as well now. Note, released kernels
not affected.

Fixes: d2e4c1e6c2 ("bpf: Constant map key tracking for prog array pokes")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/ac43ffdeb7386c5bd688761ed266f3722bb39823.1576789878.git.daniel@iogearbox.net
2019-12-19 13:39:22 -08:00
Linus Torvalds
5f096c0ecd Power management fix for 5.5-rc3
Fix a problem related to CPU offline/online and cpufreq governors
 that in some system configurations may lead to a system-wide
 deadlock during CPU online.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl37lO4SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxrUoP+wfiXQ8k3GncyD8NXY1/GhEmqB95v/f4
 clbn0xNu2WaQB3UdO/LkouL0+IaVw/i8PAt0cdeuEjKSgbPT8HHCkN28J0oia02H
 HD7JzdiUZh7ONG1eq9Z/7ckSXBflZaUIjzTi6C1axX8reEzGVVuy5LNhc+0iWjsh
 +mr9hRymgsRcGHPTN+CKi8Qhb29PPvVRt4YbghL0moQUDYewYENb/JBYJIjhgChG
 vWpHX6Kra99uveTMkAN5GVcgZP5b/RiM5E+cCpLEZDTSUnCIuTPM38ATGDTpadpW
 DSDuu+vEEmFu7RHO/lheN92n2fnTgjGpl5d6L5qwGCSzm0GeYZNo84RDEFCWwXZh
 5sY8oz+1wA2MIXV3f1bXYTDMWWQSitSVQ3A9OeKLlprGcZhG/66T2QB7aTut/D/R
 devyNt+xjMoqKcA7AaeVZ6XqUSHMTSCak88okXbKapJq6qkA6QkVsga+LArlRa0c
 xdA6lma2ICPG7Q2ta2G4nHekHd9mDSaR7aFkcKoApOkIDKUY9j47pI3KWSgVFCu3
 D6by7F7CCWHfp0Vw22eGuCQokBsLvhMsa7qwFlxKoxC6iJADANzBVkRzaH70wu2w
 QP2Xu9+WndyRJrrmIQS5iTrClUfgverOgXTJ5OH2jFm+Oi4r6quTKF83rturnDBr
 J8OK4odeh6E9
 =+MQE
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Fix a problem related to CPU offline/online and cpufreq governors that
  in some system configurations may lead to a system-wide deadlock
  during CPU online"

* tag 'pm-5.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: Avoid leaving stale IRQ work items during CPU offline
2019-12-19 08:09:43 -08:00
Linus Torvalds
9e8a0d5ff8 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "Tone down mutex debugging complaints, and annotate/fix spinlock
  debugging data accesses for KCSAN"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "locking/mutex: Complain upon mutex API misuse in IRQ contexts"
  locking/spinlock/debug: Fix various data races
2019-12-17 11:00:46 -08:00
Daniel Borkmann
e47304232b bpf: Fix cgroup local storage prog tracking
Recently noticed that we're tracking programs related to local storage maps
through their prog pointer. This is a wrong assumption since the prog pointer
can still change throughout the verification process, for example, whenever
bpf_patch_insn_single() is called.

Therefore, the prog pointer that was assigned via bpf_cgroup_storage_assign()
is not guaranteed to be the same as we pass in bpf_cgroup_storage_release()
and the map would therefore remain in busy state forever. Fix this by using
the prog's aux pointer which is stable throughout verification and beyond.

Fixes: de9cbbaadb ("bpf: introduce cgroup storage maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/1471c69eca3022218666f909bc927a92388fd09e.1576580332.git.daniel@iogearbox.net
2019-12-17 08:58:02 -08:00
Yangtao Li
a5e37de90e stop_machine: remove try_stop_cpus helper
try_stop_cpus is not used after this:

commit c190c3b16c ("rcu: Switch synchronize_sched_expedited() to
stop_one_cpu()")

So remove it.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191214195107.26480-1-tiny.windzz@gmail.com
2019-12-17 13:32:51 +01:00
Peng Wang
d040e0734f schied/fair: Skip calculating @contrib without load
Because of the:

	if (!load)
		runnable = running = 0;

clause in ___update_load_sum(), all the actual users of @contrib in
accumulate_sum():

	if (load)
		sa->load_sum += load * contrib;
	if (runnable)
		sa->runnable_load_sum += runnable * contrib;
	if (running)
		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;

don't happen, and therefore we don't care what @contrib actually is and
calculating it is pointless.

If we count the times when @load equals zero and not as below:

	if (load) {
		load_is_not_zero_count++;
		contrib = __accumulate_pelt_segments(periods,
				1024 - sa->period_contrib,delta);
	} else
		load_is_zero_count++;

As we can see, load_is_zero_count is much bigger than
load_is_zero_count, and the gap is gradually widening:

	load_is_zero_count:            6016044 times
	load_is_not_zero_count:         244316 times
	19:50:43 up 1 min,  1 user,  load average: 0.09, 0.06, 0.02

	load_is_zero_count:            7956168 times
	load_is_not_zero_count:         261472 times
	19:51:42 up 2 min,  1 user,  load average: 0.03, 0.05, 0.01

	load_is_zero_count:           10199896 times
	load_is_not_zero_count:         278364 times
	19:52:51 up 3 min,  1 user,  load average: 0.06, 0.05, 0.01

	load_is_zero_count:           14333700 times
	load_is_not_zero_count:         318424 times
	19:54:53 up 5 min,  1 user,  load average: 0.01, 0.03, 0.00

Perhaps we can gain some performance advantage by saving these
unnecessary calculation.

Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot < vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1576208740-35609-1-git-send-email-rocking@linux.alibaba.com
2019-12-17 13:32:51 +01:00
Cheng Jian
60588bfa22 sched/fair: Optimize select_idle_cpu
select_idle_cpu() will scan the LLC domain for idle CPUs,
it's always expensive. so the next commit :

	1ad3aaf3fc ("sched/core: Implement new approach to scale select_idle_cpu()")

introduces a way to limit how many CPUs we scan.

But it consume some CPUs out of 'nr' that are not allowed
for the task and thus waste our attempts. The function
always return nr_cpumask_bits, and we can't find a CPU
which our task is allowed to run.

Cpumask may be too big, similar to select_idle_core(), use
per_cpu_ptr 'select_idle_mask' to prevent stack overflow.

Fixes: 1ad3aaf3fc ("sched/core: Implement new approach to scale select_idle_cpu()")
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com
2019-12-17 13:32:51 +01:00
Peter Zijlstra
45178ac0ce cpu/hotplug, stop_machine: Fix stop_machine vs hotplug order
Paul reported a very sporadic, rcutorture induced, workqueue failure.
When the planets align, the workqueue rescuer's self-migrate fails and
then triggers a WARN for running a work on the wrong CPU.

Tejun then figured that set_cpus_allowed_ptr()'s stop_one_cpu() call
could be ignored! When stopper->enabled is false, stop_machine will
insta complete the work, without actually doing the work. Worse, it
will not WARN about this (we really should fix this).

It turns out there is a small window where a freshly online'ed CPU is
marked 'online' but doesn't yet have the stopper task running:

	BP				AP

	bringup_cpu()
	  __cpu_up(cpu, idle)	 -->	start_secondary()
					...
					cpu_startup_entry()
	  bringup_wait_for_ap()
	    wait_for_ap_thread() <--	  cpuhp_online_idle()
					  while (1)
					    do_idle()

					... available to run kthreads ...

	    stop_machine_unpark()
	      stopper->enable = true;

Close this by moving the stop_machine_unpark() into
cpuhp_online_idle(), such that the stopper thread is ready before we
start the idle loop and schedule.

Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Debugged-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
2019-12-17 13:32:50 +01:00
Oleg Nesterov
cde6519450 sched/wait: fix ___wait_var_event(exclusive)
init_wait_var_entry() forgets to initialize wq_entry->flags.

Currently not a problem, we don't have wait_var_event_exclusive().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20191210191902.GB14449@redhat.com
2019-12-17 13:32:50 +01:00
Frederic Weisbecker
5443a0be61 sched: Use fair:prio_changed() instead of ad-hoc implementation
set_user_nice() implements its own version of fair::prio_changed() and
therefore misses a specific optimization towards nohz_full CPUs that
avoid sending an resched IPI to a reniced task running alone. Use the
proper callback instead.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20191203160106.18806-3-frederic@kernel.org
2019-12-17 13:32:50 +01:00
Frederic Weisbecker
7c2e8bbd87 sched: Spare resched IPI when prio changes on a single fair task
The runqueue of a fair task being remotely reniced is going to get a
resched IPI in order to reassess which task should be the current
running on the CPU. However that evaluation is useless if the fair task
is running alone, in which case we can spare that IPI, preventing
nohz_full CPUs from being disturbed.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20191203160106.18806-2-frederic@kernel.org
2019-12-17 13:32:50 +01:00
Vincent Guittot
6cf82d559e sched/cfs: fix spurious active migration
The load balance can fail to find a suitable task during the periodic check
because  the imbalance is smaller than half of the load of the waiting
tasks. This results in the increase of the number of failed load balance,
which can end up to start an active migration. This active migration is
useless because the current running task is not a better choice than the
waiting ones. In fact, the current task was probably not running but
waiting for the CPU during one of the previous attempts and it had already
not been selected.

When load balance fails too many times to migrate a task, we should relax
the contraint on the maximum load of the tasks that can be migrated
similarly to what is done with cache hotness.

Before the rework, load balance used to set the imbalance to the average
load_per_task in order to mitigate such situation. This increased the
likelihood of migrating a task but also of selecting a larger task than
needed while more appropriate ones were in the list.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1575036287-6052-1-git-send-email-vincent.guittot@linaro.org
2019-12-17 13:32:48 +01:00
Vincent Guittot
7ed735c331 sched/fair: Fix find_idlest_group() to handle CPU affinity
Because of CPU affinity, the local group can be skipped which breaks the
assumption that statistics are always collected for local group. With
uninitialized local_sgs, the comparison is meaningless and the behavior
unpredictable. This can even end up to use local pointer which is to
NULL in this case.

If the local group has been skipped because of CPU affinity, we return
the idlest group.

Fixes: 57abff067a ("sched/fair: Rework find_idlest_group()")
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: John Stultz <john.stultz@linaro.org>
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: mingo@redhat.com
Cc: mgorman@suse.de
Cc: juri.lelli@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: bsegall@google.com
Cc: qais.yousef@arm.com
Link: https://lkml.kernel.org/r/1575483700-22153-1-git-send-email-vincent.guittot@linaro.org
2019-12-17 13:32:48 +01:00
Johannes Weiner
c3466952ca psi: Fix a division error in psi poll()
The psi window size is a u64 an can be up to 10 seconds right now,
which exceeds the lower 32 bits of the variable. We currently use
div_u64 for it, which is meant only for 32-bit divisors. The result is
garbage pressure sampling values and even potential div0 crashes.

Use div64_u64.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Jingfeng Xie <xiejingfeng@linux.alibaba.com>
Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org
2019-12-17 13:32:48 +01:00
Johannes Weiner
3dfbe25c27 sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime
Jingfeng reports rare div0 crashes in psi on systems with some uptime:

[58914.066423] divide error: 0000 [#1] SMP
[58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
[58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
[58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
[58914.102728] Workqueue: events psi_update_work
[58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
[58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
[58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
[58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
[58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
[58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
[58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
[58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
[58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
[58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
[58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[58914.200360] PKRU: 55555554
[58914.203221] Stack:
[58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
[58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
[58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[58914.228734] Call Trace:
[58914.231337] [] psi_update_work+0x22/0x60
[58914.237067] [] process_one_work+0x189/0x420
[58914.243063] [] worker_thread+0x4e/0x4b0
[58914.248701] [] ? process_one_work+0x420/0x420
[58914.254869] [] kthread+0xe6/0x100
[58914.259994] [] ? kthread_park+0x60/0x60
[58914.265640] [] ret_from_fork+0x39/0x50
[58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1
[58914.279691] RIP [] psi_update_stats+0x1c1/0x330

The crashing instruction is trying to divide the observed stall time
by the sampling period. The period, stored in R8, is not 0, but we are
dividing by the lower 32 bits only, which are all 0 in this instance.

We could switch to a 64-bit division, but the period shouldn't be that
big in the first place. It's the time between the last update and the
next scheduled one, and so should always be around 2s and comfortably
fit into 32 bits.

The bug is in the initialization of new cgroups: we schedule the first
sampling event in a cgroup as an offset of sched_clock(), but fail to
initialize the last_update timestamp, and it defaults to 0. That
results in a bogusly large sampling period the first time we run the
sampling code, and consequently we underreport pressure for the first
2s of a cgroup's life. But worse, if sched_clock() is sufficiently
advanced on the system, and the user gets unlucky, the period's lower
32 bits can all be 0 and the sampling division will crash.

Fix this by initializing the last update timestamp to the creation
time of the cgroup, thus correctly marking the start of the first
pressure sampling period in a new cgroup.

Reported-by: Jingfeng Xie <xiejingfeng@linux.alibaba.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org
2019-12-17 13:32:47 +01:00
Sebastian Andrzej Siewior
9f0bff1180 perf/core: Add SRCU annotation for pmus list walk
Since commit
   28875945ba ("rcu: Add support for consolidated-RCU reader checking")

there is an additional check to ensure that a RCU related lock is held
while the RCU list is iterated.
This section holds the SRCU reader lock instead.

Add annotation to list_for_each_entry_rcu() that pmus_srcu must be
acquired during the list traversal.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lkml.kernel.org/r/20191119121429.zhcubzdhm672zasg@linutronix.de
2019-12-17 13:32:46 +01:00
Daniel Borkmann
a2ea07465c bpf: Fix missing prog untrack in release_maps
Commit da765a2f59 ("bpf: Add poke dependency tracking for prog array
maps") wrongly assumed that in case of prog load errors, we're cleaning
up all program tracking via bpf_free_used_maps().

However, it can happen that we're still at the point where we didn't copy
map pointers into the prog's aux section such that env->prog->aux->used_maps
is still zero, running into a UAF. In such case, the verifier has similar
release_maps() helper that drops references to used maps from its env.

Consolidate the release code into __bpf_free_used_maps() and call it from
all sides to fix it.

Fixes: da765a2f59 ("bpf: Add poke dependency tracking for prog array maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/1c2909484ca524ae9f55109b06f22b6213e76376.1576514756.git.daniel@iogearbox.net
2019-12-16 10:59:29 -08:00
Linus Torvalds
22ff311af9 treewide conversion from FIELD_SIZEOF() to sizeof_field()
-----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl3umDgWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlvsD/49R12HK7UzTxNTrcpvbadJ4t7j
 j/qJvjMerW7iVNAPOoNAOePUa21+y3rI1AZPvoPyzIqp1Bf2eOICf5SdisG2cG+O
 X0A8EKWvS0SSQWSKaT6udUKJ3nBJItwvOvQ5B58KQzcOj3S4X7B9iVBWgieMHrzz
 urkZm7pqowrZB3wuF8keRtli5IZaoiCwzApy48Qrn70G3OeXymknFbpHTDwIAiGw
 RiE5Xh0R4EzQdsYyCgjR8U56gBchadAmj8BUJU0ppMnOFMyIAG670hNLrs0L3roP
 8TOIeyb993ZC5GZaMlnR8mz0jfibfkPa3Z85VAsVyQSPaOQldwc9j8TGBqD5Gfat
 1PjOU5RVwma0pH5xTPOeevWPQpIK9KovQpQYqMMN9GMxOEx96IOUjwTrnNK2xWoN
 UGyOVlESFGoniClhCiKYzPSrYOjlIBk5ovf15PdTe+bwyUDMfyfy5CZV88OS2DHz
 ZBZvpLrH/EMW9zJ+FqMTp0C4s4wa2Ioid3bSh6XuNUTtltKSjp71eUja8ZEz+2sd
 5AGstCC+hYqxaEk+6/851pfkQ9sbBjwuGtNrtX+pqreiLUvWLhQ0yUj6cLXlEQNH
 aucjCukCjI+4lMzofeaQ2LbNhtff4YsfO4b1Ye8maoDdHjzUVL57n3bTOxKhdzbt
 y6FM3lApOjk3OyaTJQ==
 =YU4A
 -----END PGP SIGNATURE-----

Merge tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull FIELD_SIZEOF conversion from Kees Cook:
 "A mostly mechanical treewide conversion from FIELD_SIZEOF() to
  sizeof_field(). This avoids the redundancy of having 2 macros
  (actually 3) doing the same thing, and consolidates on sizeof_field().
  While "field" is not an accurate name, it is the common name used in
  the kernel, and doesn't result in any unintended innuendo.

  As there are still users of FIELD_SIZEOF() in -next, I will clean up
  those during this coming development cycle and send the final old
  macro removal patch at that time"

* tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  treewide: Use sizeof_field() macro
  MIPS: OCTEON: Replace SIZEOF_FIELD() macro
2019-12-13 14:02:12 -08:00
Rafael J. Wysocki
85572c2c4a cpufreq: Avoid leaving stale IRQ work items during CPU offline
The scheduler code calling cpufreq_update_util() may run during CPU
offline on the target CPU after the IRQ work lists have been flushed
for it, so the target CPU should be prevented from running code that
may queue up an IRQ work item on it at that point.

Unfortunately, that may not be the case if dvfs_possible_from_any_cpu
is set for at least one cpufreq policy in the system, because that
allows the CPU going offline to run the utilization update callback
of the cpufreq governor on behalf of another (online) CPU in some
cases.

If that happens, the cpufreq governor callback may queue up an IRQ
work on the CPU running it, which is going offline, and the IRQ work
may not be flushed after that point.  Moreover, that IRQ work cannot
be flushed until the "offlining" CPU goes back online, so if any
other CPU calls irq_work_sync() to wait for the completion of that
IRQ work, it will have to wait until the "offlining" CPU is back
online and that may not happen forever.  In particular, a system-wide
deadlock may occur during CPU online as a result of that.

The failing scenario is as follows.  CPU0 is the boot CPU, so it
creates a cpufreq policy and becomes the "leader" of it
(policy->cpu).  It cannot go offline, because it is the boot CPU.
Next, other CPUs join the cpufreq policy as they go online and they
leave it when they go offline.  The last CPU to go offline, say CPU3,
may queue up an IRQ work while running the governor callback on
behalf of CPU0 after leaving the cpufreq policy because of the
dvfs_possible_from_any_cpu effect described above.  Then, CPU0 is
the only online CPU in the system and the stale IRQ work is still
queued on CPU3.  When, say, CPU1 goes back online, it will run
irq_work_sync() to wait for that IRQ work to complete and so it
will wait for CPU3 to go back online (which may never happen even
in principle), but (worse yet) CPU0 is waiting for CPU1 at that
point too and a system-wide deadlock occurs.

To address this problem notice that CPUs which cannot run cpufreq
utilization update code for themselves (for example, because they
have left the cpufreq policies that they belonged to), should also
be prevented from running that code on behalf of the other CPUs that
belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so
in that case the cpufreq_update_util_data pointer of the CPU running
the code must not be NULL as well as for the CPU which is the target
of the cpufreq utilization update in progress.

Accordingly, change cpufreq_this_cpu_can_update() into a regular
function in kernel/sched/cpufreq.c (instead of a static inline in a
header file) and make it check the cpufreq_update_util_data pointer
of the local CPU if dvfs_possible_from_any_cpu is set for the target
cpufreq policy.

Also update the schedutil governor to do the
cpufreq_this_cpu_can_update() check in the non-fast-switch
case too to avoid the stale IRQ work issues.

Fixes: 99d14d0e16 ("cpufreq: Process remote callbacks from any CPU if the platform permits")
Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/
Reported-by: Anson Huang <anson.huang@nxp.com>
Tested-by: Anson Huang <anson.huang@nxp.com>
Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Peng Fan <peng.fan@nxp.com> (i.MX8QXP-MEK)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-12-12 17:59:43 +01:00
Alexei Starovoitov
b91e014f07 bpf: Make BPF trampoline use register_ftrace_direct() API
Make BPF trampoline attach its generated assembly code to kernel functions via
register_ftrace_direct() API. It helps ftrace-based tracers co-exist with BPF
trampoline on the same kernel function. It also switches attaching logic from
arch specific text_poke to generic ftrace that is available on many
architectures. text_poke is still necessary for bpf-to-bpf attach and for
bpf_tail_call optimization.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191209000114.1876138-3-ast@kernel.org
2019-12-11 15:18:08 -08:00
Linus Torvalds
6674fdb25a This contains 3 changes:
- Removal of code I accidentally applied when doing a minor fix up
    to a patch, and then using "git commit -a --amend", which pulled
    in some other changes I was playing with.
 
  - Remove an used variable in trace_events_inject code
 
  - Fix to function graph tracer when it traces a ftrace direct function.
    It will now ignore tracing a function that has a ftrace direct
    tramploine attached. This is needed for eBPF to use the ftrace direct
    code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXfD/thQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qoo2AP4j7ONw7BTmMyo+GdYqPPntBeDnClHK
 vfMKrgK1j5BxYgEA7LgkwuUT9bcyLjfJVcyfeW67rB2PtmovKTWnKihFOwI=
 =DZ6N
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Remove code I accidentally applied when doing a minor fix up to a
   patch, and then using "git commit -a --amend", which pulled in some
   other changes I was playing with.

 - Remove an used variable in trace_events_inject code

 - Fix function graph tracer when it traces a ftrace direct function.
   It will now ignore tracing a function that has a ftrace direct
   tramploine attached. This is needed for eBPF to use the ftrace direct
   code.

* tag 'trace-v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix function_graph tracer interaction with BPF trampoline
  tracing: remove set but not used variable 'buffer'
  module: Remove accidental change of module_enable_x()
2019-12-11 12:22:38 -08:00
Arnd Bergmann
4c80c7bc58 bpf: Fix build in minimal configurations, again
Building with -Werror showed another failure:

kernel/bpf/btf.c: In function 'btf_get_prog_ctx_type.isra.31':
kernel/bpf/btf.c:3508:63: error: array subscript 0 is above array bounds of 'u8[0]' {aka 'unsigned char[0]'} [-Werror=array-bounds]
  ctx_type = btf_type_member(conv_struct) + bpf_ctx_convert_map[prog_type] * 2;

I don't actually understand why the array is empty, but a similar
fix has addressed a related problem, so I suppose we can do the
same thing here.

Fixes: ce27709b81 ("bpf: Fix build in minimal configurations")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191210203553.2941035-1-arnd@arndb.de
2019-12-11 13:57:26 +01:00
Davidlohr Bueso
c571b72e2b Revert "locking/mutex: Complain upon mutex API misuse in IRQ contexts"
This ended up causing some noise in places such as rxrpc running in softirq.

The warning is misleading in this case as the mutex trylock and unlock
operations are done within the same context; and therefore we need not
worry about the PI-boosting issues that comes along with no single-owner
lock guarantees.

While we don't want to support this in mutexes, there is no way out of
this yet; so lets get rid of the WARNs for now, as it is only fair to
code that has historically relied on non-preemptible softirq guarantees.
In addition, changing the lock type is also unviable: exclusive rwsems
have the same issue (just not the WARN_ON) and counting semaphores
would introduce a performance hit as mutexes are a lot more optimized.

This reverts:

    a0855d24fc: ("locking/mutex: Complain upon mutex API misuse in IRQ contexts")

Fixes: a0855d24fc: ("locking/mutex: Complain upon mutex API misuse in IRQ contexts")
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Tested-by: David Howells <dhowells@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-afs@lists.infradead.org
Cc: linux-fsdevel@vger.kernel.org
Cc: will@kernel.org
Link: https://lkml.kernel.org/r/20191210220523.28540-1-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-11 00:27:43 +01:00
Alexei Starovoitov
ff205766db ftrace: Fix function_graph tracer interaction with BPF trampoline
Depending on type of BPF programs served by BPF trampoline it can call original
function. In such case the trampoline will skip one stack frame while
returning. That will confuse function_graph tracer and will cause crashes with
bad RIP. Teach graph tracer to skip functions that have BPF trampoline attached.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-10 13:53:59 -05:00
YueHaibing
a61f810567 tracing: remove set but not used variable 'buffer'
kernel/trace/trace_events_inject.c: In function trace_inject_entry:
kernel/trace/trace_events_inject.c:20:22: warning: variable buffer set but not used [-Wunused-but-set-variable]

It is never used, so remove it.

Link: http://lkml.kernel.org/r/20191207034409.25668-1-yuehaibing@huawei.com

Reported-by: Hulk Robot <hulkci@huawei.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-10 13:53:51 -05:00
Steven Rostedt (VMware)
af74262337 module: Remove accidental change of module_enable_x()
When pulling in Divya Indi's patch, I made a minor fix to remove unneeded
braces. I commited my fix up via "git commit -a --amend". Unfortunately, I
didn't realize I had some changes I was testing in the module code, and
those changes were applied to Divya's patch as well.

This reverts the accidental updates to the module code.

Cc: Jessica Yu <jeyu@kernel.org>
Cc: Divya Indi <divya.indi@oracle.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Fixes: e585e6469d ("tracing: Verify if trace array exists before destroying it.")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-10 13:53:43 -05:00