Some console drivers code calls console_conditional_schedule()
that looks at @console_may_schedule. The value must be cleared
when the drivers are called from console_unlock() with
interrupts disabled. But rescheduling is fine when the same
code is called, for example, from tty operations where the
console semaphore is taken via console_lock().
This is why @console_may_schedule is cleared before calling console
drivers. The original value is stored to decide if we could sleep
between lines.
Now, @console_may_schedule is not cleared when we call
console_trylock() and jump back to the "again" goto label.
This has become a problem, since the commit 6b97a20d3a
("printk: set may_schedule for some of console_trylock() callers").
@console_may_schedule might get enabled now.
There is also the opposite problem. console_lock() can be called
only from preemptive context. It can always enable scheduling in
the console code. But console_trylock() is not able to detect it
when CONFIG_PREEMPT_COUNT is disabled. Therefore we should use the
original @console_may_schedule value after re-acquiring
the console semaphore in console_unlock().
This patch solves both problems by moving the "again" goto label.
Alternative solution was to clear and restore the value around
call_console_drivers(). Then console_conditional_schedule() could
be used also inside console_unlock(). But there was a potential race
with console_flush_on_panic() as reported by Sergey Senozhatsky.
That function should be called only where there is only one CPU
and with interrupts disabled. But better be on the safe side
because stopping CPUs might fail.
Fixes: 6b97a20d3a ("printk: set may_schedule for some of console_trylock() callers")
Link: http://lkml.kernel.org/r/1490372045-22288-1-git-send-email-pmladek@suse.com
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: linux-fbdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
The irq_create_affinity_masks routine is responsible for assigning a
number of interrupt vectors to CPUs. The optimal assignemnet will spread
requested vectors to all CPUs, with the fewest CPUs sharing a vector.
The algorithm may fail to assign some vectors to any CPUs if a node's
CPU count is lower than the average number of vectors per node. These
vectors are unusable and create an un-optimal spread.
Recalculate the number of vectors to assign at each node iteration by using
the remaining number of vectors and nodes to be assigned, not exceeding the
number of CPUs in that node. This will guarantee that every CPU is assigned
at least one vector.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/1491247553-7603-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There was a pure ->prio comparison left in try_to_wake_rt_mutex(),
convert it to use rt_mutex_waiter_less(), noting that greater-or-equal
is not-less (both in kernel priority view).
This necessitated the introduction of cmp_task() which creates a
pointer to an unnamed stack variable of struct rt_mutex_waiter type to
compare against tasks.
With this, we can now also create and employ rt_mutex_waiter_equal().
Reviewed-and-tested-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.455584638@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
during the PI chain walk, such that the PI chain order isn't messed up
by (asynchronous) task state updates.
Currently rt_mutex_waiter_less() uses task state for deadline tasks;
this is broken, since the task state can, as said above, change
asynchronously, causing the RB tree order to change without actual
tree update -> FAIL.
Fix this by also copying the deadline into the rt_mutex_waiter state
and updating it along with its prio field.
Ideally we would also force PI chain updates whenever DL tasks update
their deadline parameter, but for first approximation this is less
broken than it was.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.403992539@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With the introduction of SCHED_DEADLINE the whole notion that priority
is a single number is gone, therefore the @prio argument to
rt_mutex_setprio() doesn't make sense anymore.
So rework the code to pass a pi_task instead.
Note this also fixes a problem with pi_top_task caching; previously we
would not set the pointer (call rt_mutex_update_top_task) if the
priority didn't change, this could lead to a stale pointer.
As for the XXX, I think its fine to use pi_task->prio, because if it
differs from waiter->prio, a PI chain update is immenent.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.303827095@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently dl tasks will actually return at the very beginning
of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:
if (waiter->prio == task->prio) {
if (!detect_deadlock)
goto out_unlock_pi; // out here
else
requeue = false;
}
As the deadline value of blocked deadline tasks(waiters) without
changing their sched_class(thus prio doesn't change) never changes,
this seems reasonable, but it actually misses the chance of updating
rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
deadline parameters(dl_runtime, dl_period) or boosted waiter changes
to !deadline class.
Thus, force deadline task not out by adding the !dl_prio() condition.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.206577901@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
A crash happened while I was playing with deadline PI rtmutex.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
PGD 232a75067 PUD 230947067 PMD 0
Oops: 0000 [#1] SMP
CPU: 1 PID: 10994 Comm: a.out Not tainted
Call Trace:
[<ffffffff810b658c>] enqueue_task+0x2c/0x80
[<ffffffff810ba763>] activate_task+0x23/0x30
[<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
[<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
[<ffffffff8164e783>] __schedule+0xd3/0x900
[<ffffffff8164efd9>] schedule+0x29/0x70
[<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
[<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
[<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
[<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
[<ffffffff810ed8b0>] do_futex+0x190/0x5b0
[<ffffffff810edd50>] SyS_futex+0x80/0x180
This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.
In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.
Originally-From: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We should deboost before waking the high-priority task, such that we
don't run two tasks with the same "state" (priority, deadline,
sched_class, etc).
In order to make sure the boosting task doesn't start running between
unlock and deboost (due to 'spurious' wakeup), we move the deboost
under the wait_lock, that way its serialized against the wait loop in
__rt_mutex_slowlock().
Doing the deboost early can however lead to priority-inversion if
current would get preempted after the deboost but before waking our
high-prio task, hence we disable preemption before doing deboost, and
enabling it after the wake up is over.
This gets us the right semantic order, but most importantly however;
this change ensures pointer stability for the next patch, where we
have rt_mutex_setprio() cache a pointer to the top-most waiter task.
If we, as before this change, do the wakeup first and then deboost,
this pointer might point into thin air.
[peterz: Changelog + patch munging]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Relying on free_reserved_area() to call ftrace to free init memory proved to
not be sufficient. The issue is that on x86, when debug_pagealloc is
enabled, the init memory is not freed, but simply set as not present. Since
ftrace was uninformed of this, starting function tracing still tries to
update pages that are not present according to the page tables, causing
ftrace to bug, as well as killing the kernel itself.
Instead of relying on free_reserved_area(), have init/main.c call ftrace
directly just before it frees the init memory. Then it needs to use
__init_begin and __init_end to know where the init memory location is.
Looking at all archs (and testing what I can), it appears that this should
work for each of them.
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull scheduler fixes from Thomas Gleixner:
"This update provides:
- make the scheduler clock switch to unstable mode smooth so the
timestamps stay at microseconds granularity instead of switching to
tick granularity.
- unbreak perf test tsc by taking the new offset into account which
was added in order to proveide better sched clock continuity
- switching sched clock to unstable mode runs all clock related
computations which affect the sched clock output itself from a work
queue. In case of preemption sched clock uses half updated data and
provides wrong timestamps. Keep the math in the protected context
and delegate only the static key switch to workqueue context.
- remove a duplicate header include"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/headers: Remove duplicate #include <linux/sched/debug.h> line
sched/clock: Fix broken stable to unstable transfer
sched/clock, x86/perf: Fix "perf test tsc"
sched/clock: Fix clear_sched_clock_stable() preempt wobbly
development and testing of networking bpf programs is quite cumbersome.
Despite availability of user space bpf interpreters the kernel is
the ultimate authority and execution environment.
Current test frameworks for TC include creation of netns, veth,
qdiscs and use of various packet generators just to test functionality
of a bpf program. XDP testing is even more complicated, since
qemu needs to be started with gro/gso disabled and precise queue
configuration, transferring of xdp program from host into guest,
attaching to virtio/eth0 and generating traffic from the host
while capturing the results from the guest.
Moreover analyzing performance bottlenecks in XDP program is
impossible in virtio environment, since cost of running the program
is tiny comparing to the overhead of virtio packet processing,
so performance testing can only be done on physical nic
with another server generating traffic.
Furthermore ongoing changes to user space control plane of production
applications cannot be run on the test servers leaving bpf programs
stubbed out for testing.
Last but not least, the upstream llvm changes are validated by the bpf
backend testsuite which has no ability to test the code generated.
To improve this situation introduce BPF_PROG_TEST_RUN command
to test and performance benchmark bpf programs.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the verifier doesn't reject unaligned access for map_value_adj
register types. Commit 484611357c ("bpf: allow access into map value
arrays") added logic to check_ptr_alignment() extending it from PTR_TO_PACKET
to also PTR_TO_MAP_VALUE_ADJ, but for PTR_TO_MAP_VALUE_ADJ no enforcement
is in place, because reg->id for PTR_TO_MAP_VALUE_ADJ reg types is never
non-zero, meaning, we can cause BPF_H/_W/_DW-based unaligned access for
architectures not supporting efficient unaligned access, and thus worst
case could raise exceptions on some archs that are unable to correct the
unaligned access or perform a different memory access to the actual
requested one and such.
i) Unaligned load with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
on r0 (map_value_adj):
0: (bf) r2 = r10
1: (07) r2 += -8
2: (7a) *(u64 *)(r2 +0) = 0
3: (18) r1 = 0x42533a00
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+11
R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
7: (61) r1 = *(u32 *)(r0 +0)
8: (35) if r1 >= 0xb goto pc+9
R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=10 R10=fp
9: (07) r0 += 3
10: (79) r7 = *(u64 *)(r0 +0)
R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R10=fp
11: (79) r7 = *(u64 *)(r0 +2)
R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R7=inv R10=fp
[...]
ii) Unaligned store with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
on r0 (map_value_adj):
0: (bf) r2 = r10
1: (07) r2 += -8
2: (7a) *(u64 *)(r2 +0) = 0
3: (18) r1 = 0x4df16a00
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+19
R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
7: (07) r0 += 3
8: (7a) *(u64 *)(r0 +0) = 42
R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
9: (7a) *(u64 *)(r0 +2) = 43
R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
10: (7a) *(u64 *)(r0 -2) = 44
R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
[...]
For the PTR_TO_PACKET type, reg->id is initially zero when skb->data
was fetched, it later receives a reg->id from env->id_gen generator
once another register with UNKNOWN_VALUE type was added to it via
check_packet_ptr_add(). The purpose of this reg->id is twofold: i) it
is used in find_good_pkt_pointers() for setting the allowed access
range for regs with PTR_TO_PACKET of same id once verifier matched
on data/data_end tests, and ii) for check_ptr_alignment() to determine
that when not having efficient unaligned access and register with
UNKNOWN_VALUE was added to PTR_TO_PACKET, that we're only allowed
to access the content bytewise due to unknown unalignment. reg->id
was never intended for PTR_TO_MAP_VALUE{,_ADJ} types and thus is
always zero, the only marking is in PTR_TO_MAP_VALUE_OR_NULL that
was added after 484611357c via 57a09bf0a4 ("bpf: Detect identical
PTR_TO_MAP_VALUE_OR_NULL registers"). Above tests will fail for
non-root environment due to prohibited pointer arithmetic.
The fix splits register-type specific checks into their own helper
instead of keeping them combined, so we don't run into a similar
issue in future once we extend check_ptr_alignment() further and
forget to add reg->type checks for some of the checks.
Fixes: 484611357c ("bpf: allow access into map value arrays")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking into map_value_adj, I noticed that alu operations
directly on the map_value() resp. map_value_adj() register (any
alu operation on a map_value() register will turn it into a
map_value_adj() typed register) are not sufficiently protected
against some of the operations. Two non-exhaustive examples are
provided that the verifier needs to reject:
i) BPF_AND on r0 (map_value_adj):
0: (bf) r2 = r10
1: (07) r2 += -8
2: (7a) *(u64 *)(r2 +0) = 0
3: (18) r1 = 0xbf842a00
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+2
R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
7: (57) r0 &= 8
8: (7a) *(u64 *)(r0 +0) = 22
R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=8 R10=fp
9: (95) exit
from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
9: (95) exit
processed 10 insns
ii) BPF_ADD in 32 bit mode on r0 (map_value_adj):
0: (bf) r2 = r10
1: (07) r2 += -8
2: (7a) *(u64 *)(r2 +0) = 0
3: (18) r1 = 0xc24eee00
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+2
R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
7: (04) (u32) r0 += (u32) 0
8: (7a) *(u64 *)(r0 +0) = 22
R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
9: (95) exit
from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
9: (95) exit
processed 10 insns
Issue is, while min_value / max_value boundaries for the access
are adjusted appropriately, we change the pointer value in a way
that cannot be sufficiently tracked anymore from its origin.
Operations like BPF_{AND,OR,DIV,MUL,etc} on a destination register
that is PTR_TO_MAP_VALUE{,_ADJ} was probably unintended, in fact,
all the test cases coming with 484611357c ("bpf: allow access
into map value arrays") perform BPF_ADD only on the destination
register that is PTR_TO_MAP_VALUE_ADJ.
Only for UNKNOWN_VALUE register types such operations make sense,
f.e. with unknown memory content fetched initially from a constant
offset from the map value memory into a register. That register is
then later tested against lower / upper bounds, so that the verifier
can then do the tracking of min_value / max_value, and properly
check once that UNKNOWN_VALUE register is added to the destination
register with type PTR_TO_MAP_VALUE{,_ADJ}. This is also what the
original use-case is solving. Note, tracking on what is being
added is done through adjust_reg_min_max_vals() and later access
to the map value enforced with these boundaries and the given offset
from the insn through check_map_access_adj().
Tests will fail for non-root environment due to prohibited pointer
arithmetic, in particular in check_alu_op(), we bail out on the
is_pointer_value() check on the dst_reg (which is false in root
case as we allow for pointer arithmetic via env->allow_ptr_leaks).
Similarly to PTR_TO_PACKET, one way to fix it is to restrict the
allowed operations on PTR_TO_MAP_VALUE{,_ADJ} registers to 64 bit
mode BPF_ADD. The test_verifier suite runs fine after the patch
and it also rejects mentioned test cases.
Fixes: 484611357c ("bpf: allow access into map value arrays")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed that if I use dd to read the set_ftrace_filter file that the first
hash command is repeated.
# cd /sys/kernel/debug/tracing
# echo schedule > set_ftrace_filter
# echo do_IRQ >> set_ftrace_filter
# echo schedule:traceoff >> set_ftrace_filter
# echo do_IRQ:traceoff >> set_ftrace_filter
# cat set_ftrace_filter
schedule
do_IRQ
schedule:traceoff:unlimited
do_IRQ:traceoff:unlimited
# dd if=set_ftrace_filter bs=1
schedule
do_IRQ
schedule:traceoff:unlimited
schedule:traceoff:unlimited
do_IRQ:traceoff:unlimited
98+0 records in
98+0 records out
98 bytes copied, 0.00265011 s, 37.0 kB/s
This is due to the way t_start() calls t_next() as well as the seq_file
calls t_next() and the state is slightly different between the two. Namely,
t_start() will call t_next() with a local "pos" variable.
By separating out the function listing from t_next() into its own function,
we can have better control of outputting the functions and the hash of
triggers. This simplifies the code.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If all functions are enabled, there's a comment displayed in the file to
denote that:
# cd /sys/kernel/debug/tracing
# cat set_ftrace_filter
#### all functions enabled ####
If a function trigger is set, those are displayed as well:
# echo schedule:traceoff >> /debug/tracing/set_ftrace_filter
# cat set_ftrace_filter
#### all functions enabled ####
schedule:traceoff:unlimited
But if you read that file with dd, the output can change:
# dd if=/debug/tracing/set_ftrace_filter bs=1
#### all functions enabled ####
32+0 records in
32+0 records out
32 bytes copied, 7.0237e-05 s, 456 kB/s
This is because the "pos" variable is updated for the comment, but func_pos
is not. "func_pos" is used by the triggers (or hashes) to know how many
functions were printed and it bases its index from the pos - func_pos.
func_pos should be 1 to count for the comment printed. But since it is not,
t_hash_start() thinks that one trigger was already printed.
The cat gets to t_hash_start() via t_next() and not t_start() which updates
both pos and func_pos.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The loop in t_start() of calling t_next() will call t_hash_start() if the
pos is beyond the functions and enters the hash items. There's no reason to
check if p is NULL and call t_hash_start(), as that would be redundant.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of testing if the hash to use is the filter_hash or the notrace_hash
at each iteration, do the test at open, and set the iter->hash to point to
the corresponding filter or notrace hash. Then use that directly instead of
testing which hash needs to be used each iteration.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The return status check of __seq_open_private() is rather strange:
iter = __seq_open_private();
if (iter) {
/* do stuff */
}
return iter ? 0 : -ENOMEM;
It makes much more sense to do the return of failure right away:
iter = __seq_open_private();
if (!iter)
return -ENOMEM;
/* do stuff */
return 0;
This clean up will make updates to this code a bit nicer.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull crypto fixes from Herbert Xu:
"This fixes the following issues:
- memory corruption when kmalloc fails in xts/lrw
- mark some CCP DMA channels as private
- fix reordering race in padata
- regression in omap-rng DT description"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: xts,lrw - fix out-of-bounds write after kmalloc failure
crypto: ccp - Make some CCP DMA channels private
padata: avoid race in reordering
dt-bindings: rng: clocks property on omap_rng not always mandatory
interp_forward is type bool so assignment from a logical operation directly
is sufficient.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1490382215-30505-1-git-send-email-der.herr@hofr.at
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull timekeeping changes from John Stultz:
Main changes are the initial steps of Nicoli's work to make the clockevent
timers be corrected for NTP adjustments. Then a few smaller fixes that
I've queued, and adding Stephen Boyd to the maintainers list for
timekeeping.
It's reported that the time of insmoding a klp.ko for one of our
out-tree modules is too long.
~ time sudo insmod klp.ko
real 0m23.799s
user 0m0.036s
sys 0m21.256s
Then we found the reason: our out-tree module used a lot of static local
variables, so klp.ko has a lot of relocation records which reference the
module. Then for each such entry klp_find_object_symbol() is called to
resolve it, but this function uses the interface kallsyms_on_each_symbol()
even for finding module symbols, so will waste a lot of time on walking
through vmlinux kallsyms table many times.
This patch changes it to use module_kallsyms_on_each_symbol() for modules
symbols. After we apply this patch, the sys time reduced dramatically.
~ time sudo insmod klp.ko
real 0m1.007s
user 0m0.032s
sys 0m0.924s
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Use a timeout rather than a fixed number of loops to avoid running for
very long periods, such as under the kbuilder VMs.
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170310105733.6444-1-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The main PELT function ___update_load_avg(), which implements the
accumulation and progression of the geometric average series, is
implemented along the following lines for the scenario where the time
delta spans all 3 possible sections (see figure below):
1. add the remainder of the last incomplete period
2. decay old sum
3. accumulate new sum in full periods since last_update_time
4. accumulate the current incomplete period
5. update averages
Or:
d1 d2 d3
^ ^ ^
| | |
|<->|<----------------->|<--->|
... |---x---|------| ... |------|-----x (now)
load_sum' = (load_sum + weight * scale * d1) * y^(p+1) + (1,2)
p
weight * scale * 1024 * \Sum y^n + (3)
n=1
weight * scale * d3 * y^0 (4)
load_avg' = load_sum' / LOAD_AVG_MAX (5)
Where:
d1 - is the delta part completing the remainder of the last
incomplete period,
d2 - is the delta part spannind complete periods, and
d3 - is the delta part starting the current incomplete period.
We can simplify the code in two steps; the first step is to separate
the first term into new and old parts like:
(load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) +
weight * scale * d1 * y^(p+1)
Once we've done that, its easy to see that all new terms carry the
common factors:
weight * scale
If we factor those out, we arrive at the form:
load_sum' = load_sum * y^(p+1) +
weight * scale * (d1 * y^(p+1) +
p
1024 * \Sum y^n +
n=1
d3 * y^0)
Which results in a simpler, smaller and faster implementation.
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: matt@codeblueprint.co.uk
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The __update_load_avg() function is an __always_inline because its
used with constant propagation to generate different variants of the
code without having to duplicate it (which would be prone to bugs).
Explicitly instantiate the 3 variants.
Note that most of this is called from rather hot paths, so reducing
branches is good.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJY2C9qAAoJEHm+PkMAQRiGaBQIAIGzdlZ6ImiP6zoukrRv7qUr
44ITm0lsBiL85QGedhQQL+Y9UqwUmlqgFqnH0Gr8YHNbLJWXzdjGbl5aVo4KjASq
104NLUDXtPww/xZdH4wJMzhuwucYwZOUyDOjOr0ak3cGxOE2xjNjHMZXxWUf20GO
EpRr6WhV1DUAvAdjdNa9KlcOjMluNpMLLyL1CFLjrkkArrWAyqOURKHAb6ZLghfv
iZV1qJTVPyYGpnlI3kuEgu2GuDjxqpoNLSr3wHyEHm/pBPEl7MX6zPbzcegBV8TY
cRRlXo4notdsuknmSNcj0hHuTQvw1kl7BhieLKVsnCyCIM6jjX4TSQZFutmbzwM=
=5iRl
-----END PGP SIGNATURE-----
Backmerge tag 'v4.11-rc4' into drm-next
Linux 4.11-rc4
The i915 GVT team need the rc4 code to base some more code on.
We switched from "struct task_struct"->security to "struct cred"->security
in Linux 2.6.29. But not all LSM modules were happy with that change.
TOMOYO LSM module is an example which want to use per "struct task_struct"
security blob, for TOMOYO's security context is defined based on "struct
task_struct" rather than "struct cred". AppArmor LSM module is another
example which want to use it, for AppArmor is currently abusing the cred
a little bit to store the change_hat and setexeccon info. Although
security_task_free() hook was revived in Linux 3.4 because Yama LSM module
wanted to release per "struct task_struct" security blob,
security_task_alloc() hook and "struct task_struct"->security field were
not revived. Nowadays, we are getting proposals of lightweight LSM modules
which want to use per "struct task_struct" security blob.
We are already allowing multiple concurrent LSM modules (up to one fully
armored module which uses "struct cred"->security field or exclusive hooks
like security_xfrm_state_pol_flow_match(), plus unlimited number of
lightweight modules which do not use "struct cred"->security nor exclusive
hooks) as long as they are built into the kernel. But this patch does not
implement variable length "struct task_struct"->security field which will
become needed when multiple LSM modules want to use "struct task_struct"->
security field. Although it won't be difficult to implement variable length
"struct task_struct"->security field, let's think about it after we merged
this patch.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Tested-by: Djalal Harouni <tixxdz@gmail.com>
Acked-by: José Bollo <jobol@nonadev.net>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: James Morris <james.l.morris@oracle.com>
Cc: José Bollo <jobol@nonadev.net>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Commit 5b52330bbf ("audit: fix auditd/kernel connection state
tracking") made inlining audit_signal_info() a bit pointless as
it was always calling into auditd_test_task() so let's remove the
inline function in kernel/audit.h and convert __audit_signal_info()
in kernel/auditsc.c into audit_signal_info().
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Use BUG_ON() rather than an explicit if followed by BUG() for
improved readability and also consistency.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Tejun Heo <tj@kernel.org>
When it is determined that the clock is actually unstable, and
we switch from stable to unstable, the __clear_sched_clock_stable()
function is eventually called.
In this function we set gtod_offset so the following holds true:
sched_clock() + raw_offset == ktime_get_ns() + gtod_offset
But instead of getting the latest timestamps, we use the last values
from scd, so instead of sched_clock() we use scd->tick_raw, and
instead of ktime_get_ns() we use scd->tick_gtod.
However, later, when we use gtod_offset sched_clock_local() we do not
add it to scd->tick_gtod to calculate the correct clock value when we
determine the boundaries for min/max clocks.
This can result in tick granularity sched_clock() values, so fix it.
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Fixes: 5680d8094f ("sched/clock: Provide better clock continuity")
Link: http://lkml.kernel.org/r/1490214265-899964-2-git-send-email-pasha.tatashin@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the child domain prefers tasks to go siblings, the local group could
end up pulling tasks to itself even if the local group is almost equally
loaded as the source group.
Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload.
Everytime, local group has capacity and source group has atleast 2 threads,
local group tries to pull the task. This causes the threads to constantly
move between different cores. This is even more profound if the cores have
more threads, like in Power 8, smt 8 mode.
Fix this by only allowing local group to pull a task, if the source group
has more number of tasks than the local group.
Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine.
Without patch:
Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
1,440 context-switches # 0.001 K/sec ( +- 1.26% )
366 cpu-migrations # 0.000 K/sec ( +- 5.58% )
3,933 page-faults # 0.002 K/sec ( +- 11.08% )
Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
6,287 context-switches # 0.001 K/sec ( +- 3.65% )
3,776 cpu-migrations # 0.001 K/sec ( +- 4.84% )
5,702 page-faults # 0.001 K/sec ( +- 9.36% )
Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
8,776 context-switches # 0.001 K/sec ( +- 0.73% )
2,790 cpu-migrations # 0.000 K/sec ( +- 0.98% )
10,540 page-faults # 0.001 K/sec ( +- 3.12% )
With patch:
Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
1,133 context-switches # 0.001 K/sec ( +- 4.72% )
123 cpu-migrations # 0.000 K/sec ( +- 3.42% )
3,858 page-faults # 0.002 K/sec ( +- 8.52% )
Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
2,169 context-switches # 0.000 K/sec ( +- 6.19% )
189 cpu-migrations # 0.000 K/sec ( +- 12.75% )
5,917 page-faults # 0.001 K/sec ( +- 8.09% )
Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
5,333 context-switches # 0.001 K/sec ( +- 5.91% )
506 cpu-migrations # 0.000 K/sec ( +- 3.35% )
10,792 page-faults # 0.001 K/sec ( +- 7.75% )
Which show that in these workloads CPU migrations get reduced significantly.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since commit 383776fa75 ("locking/lockdep: Handle statically initialized
PER_CPU locks properly") we try to collapse per-cpu locks into a single
class by giving them all the same key. For this key we choose the canonical
address of the per-cpu object, which would be the offset into the per-cpu
area.
This has two problems:
- there is a case where we run !0 lock->key through static_obj() and
expect this to pass; it doesn't for canonical pointers.
- 0 is a valid canonical address.
Cure both issues by redefining the canonical address as the address of the
per-cpu variable on the boot CPU.
Since I didn't want to rely on CPU0 being the boot-cpu, or even existing at
all, track the boot CPU in a variable.
Fixes: 383776fa75 ("locking/lockdep: Handle statically initialized PER_CPU locks properly")
Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-mm@kvack.org
Cc: wfg@linux.intel.com
Cc: kernel test robot <fengguang.wu@intel.com>
Cc: LKP <lkp@01.org>
Link: http://lkml.kernel.org/r/20170320114108.kbvcsuepem45j5cr@hirez.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull audit fix from Paul Moore:
"We've got an audit fix, and unfortunately it is big.
While I'm not excited that we need to be sending you something this
large during the -rcX phase, it does fix some very real, and very
tangled, problems relating to locking, backlog queues, and the audit
daemon connection.
This code has passed our testsuite without problem and it has held up
to my ad-hoc stress tests (arguably better than the existing code),
please consider pulling this as fix for the next v4.11-rcX tag"
* 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit:
audit: fix auditd/kernel connection state tracking
llvm can optimize the 'if (ptr > data_end)' checks to be in the order
slightly different than the original C code which will confuse verifier.
Like:
if (ptr + 16 > data_end)
return TC_ACT_SHOT;
// may be followed by
if (ptr + 14 > data_end)
return TC_ACT_SHOT;
while llvm can see that 'ptr' is valid for all 16 bytes,
the verifier could not.
Fix verifier logic to account for such case and add a test.
Reported-by: Huapeng Zhou <hzhou@fb.com>
Fixes: 969bf05eb3 ("bpf: direct packet access")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently trace_handle_return() looks like this:
static inline enum print_line_t trace_handle_return(struct trace_seq *s)
{
return trace_seq_has_overflowed(s) ?
TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
}
Where trace_seq_overflowed(s) is:
static inline bool trace_seq_has_overflowed(struct trace_seq *s)
{
return s->full || seq_buf_has_overflowed(&s->seq);
}
And seq_buf_has_overflowed(&s->seq) is:
static inline bool
seq_buf_has_overflowed(struct seq_buf *s)
{
return s->len > s->size;
}
Making trace_handle_return() into:
return (s->full || (s->seq->len > s->seq->size)) ?
TRACE_TYPE_PARTIAL_LINE :
TRACE_TYPE_HANDLED;
One would think this is not an issue to keep as an inline. But because this
is used in the TRACE_EVENT() macro, it is extended for every tracepoint in
the system. Taking a look at a single tracepoint x86_irq_vector (was the
first one I randomly chosen). As trace_handle_return is used in the
TRACE_EVENT() macro of trace_raw_output_##call() we disassemble
trace_raw_output_x86_irq_vector and do a diff:
- is the original
+ is the out-of-line code
I removed identical lines that were different just due to different
addresses.
--- /tmp/irq-vec-orig 2017-03-16 09:12:48.569384851 -0400
+++ /tmp/irq-vec-ool 2017-03-16 09:13:39.378153385 -0400
@@ -6,27 +6,23 @@
53 push %rbx
48 89 fb mov %rdi,%rbx
4c 8b a7 c0 20 00 00 mov 0x20c0(%rdi),%r12
e8 f7 72 13 00 callq ffffffff81155c80 <trace_raw_output_prep>
83 f8 01 cmp $0x1,%eax
74 05 je ffffffff8101e993 <trace_raw_output_x86_irq_vector+0x23>
5b pop %rbx
41 5c pop %r12
5d pop %rbp
c3 retq
41 8b 54 24 08 mov 0x8(%r12),%edx
- 48 8d bb 98 10 00 00 lea 0x1098(%rbx),%rdi
+ 48 81 c3 98 10 00 00 add $0x1098,%rbx
- 48 c7 c6 7b 8a a0 81 mov $0xffffffff81a08a7b,%rsi
+ 48 c7 c6 ab 8a a0 81 mov $0xffffffff81a08aab,%rsi
- e8 c5 85 13 00 callq ffffffff81156f70 <trace_seq_printf>
=== here's the start of the main difference ===
+ 48 89 df mov %rbx,%rdi
+ e8 62 7e 13 00 callq ffffffff81156810 <trace_seq_printf>
- 8b 93 b8 20 00 00 mov 0x20b8(%rbx),%edx
- 31 c0 xor %eax,%eax
- 85 d2 test %edx,%edx
- 75 11 jne ffffffff8101e9c8 <trace_raw_output_x86_irq_vector+0x58>
- 48 8b 83 a8 20 00 00 mov 0x20a8(%rbx),%rax
- 48 39 83 a0 20 00 00 cmp %rax,0x20a0(%rbx)
- 0f 93 c0 setae %al
+ 48 89 df mov %rbx,%rdi
+ e8 4a c5 12 00 callq ffffffff8114af00 <trace_handle_return>
5b pop %rbx
- 0f b6 c0 movzbl %al,%eax
=== end ===
41 5c pop %r12
5d pop %rbp
c3 retq
If you notice, the original has 22 bytes of text more than the out of line
version. As this is for every TRACE_EVENT() defined in the system, this can
become quite large.
text data bss dec hex filename
8690305 5450490 1298432 15439227 eb957b vmlinux-orig
8681725 5450490 1298432 15430647 eb73f7 vmlinux-handle
This change has a total of 8580 bytes in savings.
$ objdump -dr /tmp/vmlinux-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
324
That's 324 tracepoints. But this does not include modules (which contain
many more tracepoints). For an allyesconfig build:
$ objdump -dr vmlinux-allyes-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
1401
That's 1401 tracepoints giving us:
text data bss dec hex filename
137920629 140221067 53264384 331406080 13c0db00 vmlinux-allyes-orig
137827709 140221067 53264384 331313160 13bf7008 vmlinux-allyes-handle
92920 bytes in savings!!!
Link: http://lkml.kernel.org/r/20170315021431.13107-2-andi@firstfloor.org
Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Adding a hook into free_reserve_area() that informs ftrace that boot up init
text is being free, lets ftrace safely remove those init functions from its
records, which keeps ftrace from trying to modify text that no longer
exists.
Note, this still does not allow for tracing .init text of modules, as
modules require different work for freeing its init code.
Link: http://lkml.kernel.org/r/1488502497.7212.24.camel@linux.intel.com
Cc: linux-mm@kvack.org
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Requested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Register the function tracer right after the tracing buffers are initialized
in early boot up. This will allow function tracing to begin early if it is
enabled via the kernel command line.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As tracing can now be enabled very early in boot up, even before some
critical system services (like scheduling), do not run the tracer selftests
until after early_initcall() is performed. If a tracer is registered before
such time, it is saved off in a list and the test is run when the system is
able to handle more diverse functions.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Create an early_trace_init() function that will initialize the buffers and
allow for ealier use of trace_printk(). This will also allow for future work
to have function tracing start earlier at boot up.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There is no need to always call blocking console_lock() in
console_cpu_notify(), it's quite possible that console_sem can
be locked by other CPU on the system, either already printing
or soon to begin printing the messages. console_lock() in this
case can simply block CPU hotplug for unknown period of time
(console_unlock() is time unbound). Not that hotplug is very
fast, but still, with other CPUs being online and doing
printk() console_cpu_notify() can stuck.
Use console_trylock() instead and opt-out if console_sem is
already acquired from another CPU, since that CPU will do
the printing for us.
Link: http://lkml.kernel.org/r/20170121104729.8585-1-sergey.senozhatsky@gmail.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
This patch fix spelling typos found in
Documentation/output/xml/driver-api/basics.xml.
It is because the xml file was generated from comments in source,
so I had to fix the comments.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Under extremely heavy uses of padata, crashes occur, and with list
debugging turned on, this happens instead:
[87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
__list_add+0xae/0x130
[87487.301868] list_add corruption. prev->next should be next
(ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00).
[87487.339011] [<ffffffff9a53d075>] dump_stack+0x68/0xa3
[87487.342198] [<ffffffff99e119a1>] ? console_unlock+0x281/0x6d0
[87487.345364] [<ffffffff99d6b91f>] __warn+0xff/0x140
[87487.348513] [<ffffffff99d6b9aa>] warn_slowpath_fmt+0x4a/0x50
[87487.351659] [<ffffffff9a58b5de>] __list_add+0xae/0x130
[87487.354772] [<ffffffff9add5094>] ? _raw_spin_lock+0x64/0x70
[87487.357915] [<ffffffff99eefd66>] padata_reorder+0x1e6/0x420
[87487.361084] [<ffffffff99ef0055>] padata_do_serial+0xa5/0x120
padata_reorder calls list_add_tail with the list to which its adding
locked, which seems correct:
spin_lock(&squeue->serial.lock);
list_add_tail(&padata->list, &squeue->serial.list);
spin_unlock(&squeue->serial.lock);
This therefore leaves only place where such inconsistency could occur:
if padata->list is added at the same time on two different threads.
This pdata pointer comes from the function call to
padata_get_next(pd), which has in it the following block:
next_queue = per_cpu_ptr(pd->pqueue, cpu);
padata = NULL;
reorder = &next_queue->reorder;
if (!list_empty(&reorder->list)) {
padata = list_entry(reorder->list.next,
struct padata_priv, list);
spin_lock(&reorder->lock);
list_del_init(&padata->list);
atomic_dec(&pd->reorder_objects);
spin_unlock(&reorder->lock);
pd->processed++;
goto out;
}
out:
return padata;
I strongly suspect that the problem here is that two threads can race
on reorder list. Even though the deletion is locked, call to
list_entry is not locked, which means it's feasible that two threads
pick up the same padata object and subsequently call list_add_tail on
them at the same time. The fix is thus be hoist that lock outside of
that block.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
- Make intel_pstate use one set of global P-state limits in the
active mode regardless of the scaling_governor settings for
individual CPUs instead of switching back and forth between two
of them in a way that is hard to control (Rafael Wysocki).
- Drop a useless function from intel_pstate to prevent it from
modifying the maximum supported frequency value unexpectedly
which may confuse the cpufreq core (Rafael Wysocki).
- Fix the cpufreq core to restore policy limits on CPU online so
that the limits are not reset over system suspend/resume, among
other things (Viresh Kumar).
- Fix the initialization of the schedutil cpufreq governor to
make the IO-wait boosting mechanism in it actually work on
systems with one CPU per cpufreq policy (Rafael Wysocki).
- Add a sanity check to the cpuidle core to prevent crashes from
happening if the architecture code initialization fails to set
up things as expected (Vaidyanathan Srinivasan).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJY1HLGAAoJEILEb/54YlRxjAsQAIFcYfJKosA8IlmKcricR/WH
CqBMCH9S6Y5YsggrFjcj3lVJbf5P43/h3U++O+/97lJsevfp4inCChpvVSQIWX4v
xzk2v+5Ms8ROqVSLy34yTaB0ysCC//J5FvdQLsj0Zw9W/8yvi0DosPfeiAD7sYeb
4qkJK3+yv5sLtZ41FmdVYabhC5KHQSAV6p6X+KOZnFV8cm+8TfOSERhStXASMTGc
tvDpjIjPA1GLpHYdOK4UQ+Er1Hgwk2fNX7eXrpHh7QCQx4eZEN+g7DAC95Ify9Am
gkTFc5eUfOFKU5KMshdQh6gnfoNaKi4d3E/ahmnU+KQuyKiy4KMyTNcuUBszSaDM
ZTm0GooseV0UajaLH08BNbfpqDsiKc2fm1qkCQkXxpGjs80/bYn5gK+fpvkziq9x
210Wc7XTWf7JfmPs0d3gZekaohHtqJVigCuA4dXH6kvbDvbDcfKOza6rqNmpSFrQ
ifWH6M12Ut/G5NfwihhTRhoKkeQjqHFgikNC8BjF2Myem20026Vr6MKZrubDAlkq
VWP7lT2zNSs1btsNqDrA9+wejwK8OwwrpfZOx3hbYL6Q+u/AuljIJ79aRz8ROZcE
jQZeOKprmlAaDIASdxIM4yjwzSQE0l/CaHteHfKaddmcWTrtKR+CEfU0ODzYWoK0
dcQwyMF3tOToj7BjWqKM
=jy4C
-----END PGP SIGNATURE-----
Merge tag 'pm-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"One of these is an intel_pstate regression fix and it is not a small
change, but it mostly removes code that shouldn't be there. That code
was acquired by mistake and has been a source of constant pain since
then, so the time has come to get rid of it finally. We have not seen
problems with this change in the lab, so fingers crossed.
The rest is more usual: one more intel_pstate commit removing useless
code, a cpufreq core fix to make it restore policy limits on CPU
online (which prevents the limits from being reset over system
suspend/resume), a schedutil cpufreq governor initialization fix to
make it actually work as advertised on all systems and an extra sanity
check in the cpuidle core to prevent crashes from happening if the
arch code messes things up.
Specifics:
- Make intel_pstate use one set of global P-state limits in the
active mode regardless of the scaling_governor settings for
individual CPUs instead of switching back and forth between two of
them in a way that is hard to control (Rafael Wysocki).
- Drop a useless function from intel_pstate to prevent it from
modifying the maximum supported frequency value unexpectedly which
may confuse the cpufreq core (Rafael Wysocki).
- Fix the cpufreq core to restore policy limits on CPU online so that
the limits are not reset over system suspend/resume, among other
things (Viresh Kumar).
- Fix the initialization of the schedutil cpufreq governor to make
the IO-wait boosting mechanism in it actually work on systems with
one CPU per cpufreq policy (Rafael Wysocki).
- Add a sanity check to the cpuidle core to prevent crashes from
happening if the architecture code initialization fails to set up
things as expected (Vaidyanathan Srinivasan)"
* tag 'pm-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Restore policy min/max limits on CPU online
cpuidle: Validate cpu_dev in cpuidle_add_sysfs()
cpufreq: intel_pstate: Fix policy data management in passive mode
cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
cpufreq: intel_pstate: One set of global limits in active mode
sugov_update_commit() calls trace_cpu_frequency() to record the
current CPU frequency if it has not changed in the fast switch case
to prevent utilities from getting confused (they may report that the
CPU is idle if the frequency has not been recorded for too long, for
example).
However, that may cause the tracepoint to be triggered quite often
for no real reason (if the frequency doesn't change, we will not
modify the last update time stamp and governor computations may
run again shortly when that happens), so don't do that (arguably, it
is done to work around a utilities bug anyway).
That allows code duplication in sugov_update_commit() to be reduced
somewhat too.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Conflicts:
drivers/net/ethernet/broadcom/genet/bcmmii.c
drivers/net/hyperv/netvsc.c
kernel/bpf/hashtab.c
Almost entirely overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
On systems with a large number of CPUs, running sysrq-<q> can cause
watchdog timeouts. There are two slow sections of code in the sysrq-<q>
path in timer_list.c.
1. print_active_timers() - This function is called by print_cpu() and
contains a slow goto loop. On a machine with hundreds of CPUs, this
loop took approximately 100ms for the first CPU in a NUMA node.
(Subsequent CPUs in the same node ran much quicker.) The total time
to print all of the CPUs is ultimately long enough to trigger the
soft lockup watchdog.
2. print_tickdevice() - This function outputs a large amount of textual
information. This function also took approximately 100ms per CPU.
Since sysrq-<q> is not a performance critical path, there should be no
harm in touching the nmi watchdog in both slow sections above. Touching
it in just one location was insufficient on systems with hundreds of
CPUs as occasional timeouts were still observed during testing.
This issue was observed on an Oracle T7 machine with 128 CPUs, but I
anticipate it may affect other systems with similarly large numbers of
CPUs.
Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The scheduler clock framework may not use the correct timeout for the clock
wrap. This happens when a new clock driver calls sched_clock_register()
after the kernel called sched_clock_postinit(). In this case the clock wrap
timeout is too long thus sched_clock_poll() is called too late and the clock
already wrapped.
On my ARM system the scheduler was no longer scheduling any other task than
the idle task because the sched_clock() wrapped.
Signed-off-by: David Engraf <david.engraf@sysgo.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
A clockevent device's rate should be configured before or at registration
and changed afterwards through clockevents_update_freq() only.
For the configuration at registration, we already have
clockevents_config_and_register().
Right now, there are no clockevents_config() users outside of the
clockevents core.
To mitigiate the risk of drivers errorneously reconfiguring their rates
through clockevents_config() *after* device registration, make
clockevents_config() static.
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Pull networking fixes from David Miller:
1) Several netfilter fixes from Pablo and the crew:
- Handle fragmented packets properly in netfilter conntrack, from
Florian Westphal.
- Fix SCTP ICMP packet handling, from Ying Xue.
- Fix big-endian bug in nftables, from Liping Zhang.
- Fix alignment of fake conntrack entry, from Steven Rostedt.
2) Fix feature flags setting in fjes driver, from Taku Izumi.
3) Openvswitch ipv6 tunnel source address not set properly, from Or
Gerlitz.
4) Fix jumbo MTU handling in amd-xgbe driver, from Thomas Lendacky.
5) sk->sk_frag.page not released properly in some cases, from Eric
Dumazet.
6) Fix RTNL deadlocks in nl80211, from Johannes Berg.
7) Fix erroneous RTNL lockdep splat in crypto, from Herbert Xu.
8) Cure improper inflight handling during AF_UNIX GC, from Andrey
Ulanov.
9) sch_dsmark doesn't write to packet headers properly, from Eric
Dumazet.
10) Fix SCM_TIMESTAMPING_OPT_STATS handling in TCP, from Soheil Hassas
Yeganeh.
11) Add some IDs for Motorola qmi_wwan chips, from Tony Lindgren.
12) Fix nametbl deadlock in tipc, from Ying Xue.
13) GRO and LRO packets not counted correctly in mlx5 driver, from Gal
Pressman.
14) Fix reset of internal PHYs in bcmgenet, from Doug Berger.
15) Fix hashmap allocation handling, from Alexei Starovoitov.
16) nl_fib_input() needs stronger netlink message length checking, from
Eric Dumazet.
17) Fix double-free of sk->sk_filter during sock clone, from Daniel
Borkmann.
18) Fix RX checksum offloading in aquantia driver, from Pavel Belous.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (85 commits)
net:ethernet:aquantia: Fix for RX checksum offload.
amd-xgbe: Fix the ECC-related bit position definitions
sfc: cleanup a condition in efx_udp_tunnel_del()
Bluetooth: btqcomsmd: fix compile-test dependency
inet: frag: release spinlock before calling icmp_send()
tcp: initialize icsk_ack.lrcvtime at session start time
genetlink: fix counting regression on ctrl_dumpfamily()
socket, bpf: fix sk_filter use after free in sk_clone_lock
ipv4: provide stronger user input validation in nl_fib_input()
bpf: fix hashmap extra_elems logic
enic: update enic maintainers
net: bcmgenet: remove bcmgenet_internal_phy_setup()
ipv6: make sure to initialize sockc.tsflags before first use
fjes: Do not load fjes driver if extended socket device is not power on.
fjes: Do not load fjes driver if system does not have extended socket device.
net/mlx5e: Count LRO packets correctly
net/mlx5e: Count GSO packets correctly
net/mlx5: Increase number of max QPs in default profile
net/mlx5e: Avoid supporting udp tunnel port ndo for VF reps
net/mlx5e: Use the proper UAPI values when offloading TC vlan actions
...
When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
chain code will (falsely) report a deadlock and BUG.
The problem is that it hold hb->lock (now an rt_mutex) while doing
task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
interleaved just right with futex_unlock_pi() leads it to believe to see an
AB-BA deadlock.
Task1 (holds rt_mutex, Task2 (does FUTEX_LOCK_PI)
does FUTEX_UNLOCK_PI)
lock hb->lock
lock rt_mutex (as per start_proxy)
lock hb->lock
Which is a trivial AB-BA.
It is not an actual deadlock, because it won't be holding hb->lock by the
time it actually blocks on the rt_mutex, but the chainwalk code doesn't
know that and it would be a nightmare to handle this gracefully.
To avoid this problem, do the same as in futex_unlock_pi() and drop
hb->lock after acquiring wait_lock. This still fully serializes against
futex_unlock_pi(), since adding to the wait_list does the very same lock
dance, and removing it holds both locks.
Aside of solving the RT problem this makes the lock and unlock mechanism
symetric and reduces the hb->lock held time.
Reported-and-tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.161341537@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The problem with returning -EAGAIN when the waiter state mismatches is that
it becomes very hard to proof a bounded execution time on the
operation. And seeing that this is a RT operation, this is somewhat
important.
While in practise; given the previous patch; it will be very unlikely to
ever really take more than one or two rounds, proving so becomes rather
hard.
However, now that modifying wait_list is done while holding both hb->lock
and wait_lock, the scenario can be avoided entirely by acquiring wait_lock
while still holding hb-lock. Doing a hand-over, without leaving a hole.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.112378812@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list
modifications are done under both hb->lock and wait_lock.
This closes the obvious interleave pattern between futex_lock_pi() and
futex_unlock_pi(), but not entirely so. See below:
Before:
futex_lock_pi() futex_unlock_pi()
unlock hb->lock
lock hb->lock
unlock hb->lock
lock rt_mutex->wait_lock
unlock rt_mutex_wait_lock
-EAGAIN
lock rt_mutex->wait_lock
list_add
unlock rt_mutex->wait_lock
schedule()
lock rt_mutex->wait_lock
list_del
unlock rt_mutex->wait_lock
<idem>
-EAGAIN
lock hb->lock
After:
futex_lock_pi() futex_unlock_pi()
lock hb->lock
lock rt_mutex->wait_lock
list_add
unlock rt_mutex->wait_lock
unlock hb->lock
schedule()
lock hb->lock
unlock hb->lock
lock hb->lock
lock rt_mutex->wait_lock
list_del
unlock rt_mutex->wait_lock
lock rt_mutex->wait_lock
unlock rt_mutex_wait_lock
-EAGAIN
unlock hb->lock
It does however solve the earlier starvation/live-lock scenario which got
introduced with the -EAGAIN since unlike the before scenario; where the
-EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the
after scenario it happens while futex_unlock_pi() actually holds a lock,
and then it is serialized on that lock.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.062785528@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters
consistent it's necessary to split 'rt_mutex_futex_lock()' into finer
parts, such that only the actual blocking can be done without hb->lock
held.
Split split_mutex_finish_proxy_lock() into two parts, one that does the
blocking and one that does remove_waiter() when the lock acquire failed.
When the rtmutex was acquired successfully the waiter can be removed in the
acquisiton path safely, since there is no concurrency on the lock owner.
This means that, except for futex_lock_pi(), all wait_list modifications
are done with both hb->lock and wait_lock held.
[bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.001659630@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There's a number of 'interesting' problems, all caused by holding
hb->lock while doing the rt_mutex_unlock() equivalient.
Notably:
- a PI inversion on hb->lock; and,
- a SCHED_DEADLINE crash because of pointer instability.
The previous changes:
- changed the locking rules to cover {uval,pi_state} with wait_lock.
- allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in
turn allows to rely on wait_lock atomicity completely.
- simplified the waiter conundrum.
It's now sufficient to hold rtmutex::wait_lock and a reference on the
pi_state to protect the state consistency, so hb->lock can be dropped
before calling rt_mutex_futex_unlock().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.900002056@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is a weird state in the futex_unlock_pi() path when it interleaves
with a concurrent futex_lock_pi() at the point where it drops hb->lock.
In this case, it can happen that the rt_mutex wait_list and the futex_q
disagree on pending waiters, in particular rt_mutex will find no pending
waiters where futex_q thinks there are. In this case the rt_mutex unlock
code cannot assign an owner.
The futex side fixup code has to cleanup the inconsistencies with quite a
bunch of interesting corner cases.
Simplify all this by changing wake_futex_pi() to return -EAGAIN when this
situation occurs. This then gives the futex_lock_pi() code the opportunity
to continue and the retried futex_unlock_pi() will now observe a coherent
state.
The only problem is that this breaks RT timeliness guarantees. That
is, consider the following scenario:
T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)
CPU0
T1
lock_pi()
queue_me() <- Waiter is visible
preemption
T2
unlock_pi()
loops with -EAGAIN forever
Which is undesirable for PI primitives. Future patches will rectify
this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.850383690@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently futex-pi relies on hb->lock to serialize everything. But hb->lock
creates another set of problems, especially priority inversions on RT where
hb->lock becomes a rt_mutex itself.
The rt_mutex::wait_lock is the most obvious protection for keeping the
futex user space value and the kernel internal pi_state in sync.
Rework and document the locking so rt_mutex::wait_lock is held accross all
operations which modify the user space value and the pi state.
This allows to invoke rt_mutex_unlock() (including deboost) without holding
hb->lock as a next step.
Nothing yet relies on the new locking rules.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Part of what makes futex_unlock_pi() intricate is that
rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
rt_mutex::wait_lock.
This means it cannot rely on the atomicy of wait_lock, which would be
preferred in order to not rely on hb->lock so much.
The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can
race with the rt_mutex fastpath, however futexes have their own fast path.
Since futexes already have a bunch of separate rt_mutex accessors, complete
that set and implement a rt_mutex variant without fastpath for them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
A regression of the FTQ noise has been reported by Ying Huang,
on the following hardware:
8 threads Intel(R) Core(TM)i7-4770 CPU @ 3.40GHz with 8G memory
... which was caused by this commit:
commit 4e5160766f ("sched/fair: Propagate asynchrous detach")
The only part of the patch that can increase the noise is the update
of blocked load of group entity in update_blocked_averages().
We can optimize this call and skip the update of group entity if its load
and utilization are already null and there is no pending propagation of load
in the task group.
This optimization partly restores the noise score. A more agressive
optimization has been tried but has shown worse score.
Reported-by: ying.huang@linux.intel.com
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: ying.huang@intel.com
Fixes: 4e5160766f ("sched/fair: Propagate asynchrous detach")
Link: http://lkml.kernel.org/r/1489758442-2877-1-git-send-email-vincent.guittot@linaro.org
[ Fixed typos, improved layout. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
People reported that commit:
5680d8094f ("sched/clock: Provide better clock continuity")
broke "perf test tsc".
That commit added another offset to the reported clock value; so
take that into account when computing the provided offset values.
Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 5680d8094f ("sched/clock: Provide better clock continuity")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Paul reported a problems with clear_sched_clock_stable(). Since we run
all of __clear_sched_clock_stable() from workqueue context, there's a
preempt problem.
Solve it by only running the static_key_disable() from workqueue.
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170313124621.GA3328@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJYzznuAAoJEHm+PkMAQRiGAzMIAJDBo5otTMMLhg8eKj8Cnab4
2NyaoWDN6mtU427rzEKEfZlTtp3gIBVdFex5x442weIdw6BgRQW0dvF/uwEn08yI
9Wx7VJmIUyH9M8VmhDtkUTFrhwUGr29qb3JhENMd7tv/CiJaehGRHCT3xqo5BDdu
xiyPcwSkwP/NH24TS91G87gV6r0I0oKLSAxu+KifEFESrb8gaZaduslzpEj3m/Ds
o9EPpfzaiGAdW5EdNfPtviYbBk7ZOXwtxdMV+zlvsLcaqtYnFEsJZd2WyZL0zGML
VXBVxaYtlyTeA7Mt8YYUL+rDHELSOtCeN5zLfxUvYt+Yc0Y6LFBLDOE5h8b3eCw=
=uKUo
-----END PGP SIGNATURE-----
BackMerge tag 'v4.11-rc3' into drm-next
Linux 4.11-rc3 as requested by Daniel
The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.
That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register. Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again. That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.
That has been attributed to CPU utilization metric updates on task
migration that cause the total utilization value for the CPU to be
reduced by the utilization of the migrated task. If that happens,
the schedutil governor may see a CPU utilization reduction and will
attempt to reduce the CPU frequency accordingly right away. That
may be premature, though, for example if the system is generally
busy and there are other runnable tasks waiting to be run on that
CPU already.
This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway. On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.
On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.
To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before its frequency is about to be reduced.
If the counter has not changed since the previous iteration of the
governor computations for that CPU, the CPU has been busy for all
that time and its frequency should not be decreased, so if the new
frequency would be lower than the one set previously, the governor
will skip the frequency update.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes <joelaf@google.com>
This patch adds hash of maps support (hashmap->bpf_map).
BPF_MAP_TYPE_HASH_OF_MAPS is added.
A map-in-map contains a pointer to another map and lets call
this pointer 'inner_map_ptr'.
Notes on deleting inner_map_ptr from a hash map:
1. For BPF_F_NO_PREALLOC map-in-map, when deleting
an inner_map_ptr, the htab_elem itself will go through
a rcu grace period and the inner_map_ptr resides
in the htab_elem.
2. For pre-allocated htab_elem (!BPF_F_NO_PREALLOC),
when deleting an inner_map_ptr, the htab_elem may
get reused immediately. This situation is similar
to the existing prealloc-ated use cases.
However, the bpf_map_fd_put_ptr() calls bpf_map_put() which calls
inner_map->ops->map_free(inner_map) which will go
through a rcu grace period (i.e. all bpf_map's map_free
currently goes through a rcu grace period). Hence,
the inner_map_ptr is still safe for the rcu reader side.
This patch also includes BPF_MAP_TYPE_HASH_OF_MAPS to the
check_map_prealloc() in the verifier. preallocation is a
must for BPF_PROG_TYPE_PERF_EVENT. Hence, even we don't expect
heavy updates to map-in-map, enforcing BPF_F_NO_PREALLOC for map-in-map
is impossible without disallowing BPF_PROG_TYPE_PERF_EVENT from using
map-in-map first.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a few helper funcs to enable map-in-map
support (i.e. outer_map->inner_map). The first outer_map type
BPF_MAP_TYPE_ARRAY_OF_MAPS is also added in this patch.
The next patch will introduce a hash of maps type.
Any bpf map type can be acted as an inner_map. The exception
is BPF_MAP_TYPE_PROG_ARRAY because the extra level of
indirection makes it harder to verify the owner_prog_type
and owner_jited.
Multi-level map-in-map is not supported (i.e. map->map is ok
but not map->map->map).
When adding an inner_map to an outer_map, it currently checks the
map_type, key_size, value_size, map_flags, max_entries and ops.
The verifier also uses those map's properties to do static analysis.
map_flags is needed because we need to ensure BPF_PROG_TYPE_PERF_EVENT
is using a preallocated hashtab for the inner_hash also. ops and
max_entries are needed to generate inlined map-lookup instructions.
For simplicity reason, a simple '==' test is used for both map_flags
and max_entries. The equality of ops is implied by the equality of
map_type.
During outer_map creation time, an inner_map_fd is needed to create an
outer_map. However, the inner_map_fd's life time does not depend on the
outer_map. The inner_map_fd is merely used to initialize
the inner_map_meta of the outer_map.
Also, for the outer_map:
* It allows element update and delete from syscall
* It allows element lookup from bpf_prog
The above is similar to the current fd_array pattern.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix in verifier:
For the same bpf_map_lookup_elem() instruction (i.e. "call 1"),
a broken case is "a different type of map could be used for the
same lookup instruction". For example, an array in one case and a
hashmap in another. We have to resort to the old dynamic call behavior
in this case. The fix is to check for collision on insn_aux->map_ptr.
If there is collision, don't inline the map lookup.
Please see the "do_reg_lookup()" in test_map_in_map_kern.c in the later
patch for how-to trigger the above case.
Simplifications on array_map_gen_lookup():
1. Calculate elem_size from map->value_size. It removes the
need for 'struct bpf_array' which makes the later map-in-map
implementation easier.
2. Remove the 'elem_size == 1' test
Fixes: 81ed18ab30 ("bpf: add helper inlining infra and optimize map_array lookup")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In both kmalloc and prealloc mode the bpf_map_update_elem() is using
per-cpu extra_elems to do atomic update when the map is full.
There are two issues with it. The logic can be misused, since it allows
max_entries+num_cpus elements to be present in the map. And alloc_extra_elems()
at map creation time can fail percpu alloc for large map values with a warn:
WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60
illegal size (32824) or align (8) for percpu allocation
The fixes for both of these issues are different for kmalloc and prealloc modes.
For prealloc mode allocate extra num_possible_cpus elements and store
their pointers into extra_elems array instead of actual elements.
Hence we can use these hidden(spare) elements not only when the map is full
but during bpf_map_update_elem() that replaces existing element too.
That also improves performance, since pcpu_freelist_pop/push is avoided.
Unfortunately this approach cannot be used for kmalloc mode which needs
to kfree elements after rcu grace period. Therefore switch it back to normal
kmalloc even when full and old element exists like it was prior to
commit 6c90598174 ("bpf: pre-allocate hash map elements").
Add tests to check for over max_entries and large map values.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
What started as a rather straightforward race condition reported by
Dmitry using the syzkaller fuzzer ended up revealing some major
problems with how the audit subsystem managed its netlink sockets and
its connection with the userspace audit daemon. Fixing this properly
had quite the cascading effect and what we are left with is this rather
large and complicated patch. My initial goal was to try and decompose
this patch into multiple smaller patches, but the way these changes
are intertwined makes it difficult to split these changes into
meaningful pieces that don't break or somehow make things worse for
the intermediate states.
The patch makes a number of changes, but the most significant are
highlighted below:
* The auditd tracking variables, e.g. audit_sock, are now gone and
replaced by a RCU/spin_lock protected variable auditd_conn which is
a structure containing all of the auditd tracking information.
* We no longer track the auditd sock directly, instead we track it
via the network namespace in which it resides and we use the audit
socket associated with that namespace. In spirit, this is what the
code was trying to do prior to this patch (at least I think that is
what the original authors intended), but it was done rather poorly
and added a layer of obfuscation that only masked the underlying
problems.
* Big backlog queue cleanup, again. In v4.10 we made some pretty big
changes to how the audit backlog queues work, here we haven't changed
the queue design so much as cleaned up the implementation. Brought
about by the locking changes, we've simplified kauditd_thread() quite
a bit by consolidating the queue handling into a new helper function,
kauditd_send_queue(), which allows us to eliminate a lot of very
similar code and makes the looping logic in kauditd_thread() clearer.
* All netlink messages sent to auditd are now sent via
auditd_send_unicast_skb(). Other than just making sense, this makes
the lock handling easier.
* Change the audit_log_start() sleep behavior so that we never sleep
on auditd events (unchanged) or if the caller is holding the
audit_cmd_mutex (changed). Previously we didn't sleep if the caller
was auditd or if the message type fell between a certain range; the
type check was a poor effort of doing what the cmd_mutex check now
does. Richard Guy Briggs originally proposed not sleeping the
cmd_mutex owner several years ago but his patch wasn't acceptable
at the time. At least the idea lives on here.
* A problem with the lost record counter has been resolved. Steve
Grubb and I both happened to notice this problem and according to
some quick testing by Steve, this problem goes back quite some time.
It's largely a harmless problem, although it may have left some
careful sysadmins quite puzzled.
Cc: <stable@vger.kernel.org> # 4.10.x-
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
sugov_start() only initializes struct sugov_cpu per-CPU structures
for shared policies, but it should do that for single-CPU policies too.
That in particular makes the IO-wait boost mechanism work in the
cases when cpufreq policies correspond to individual CPUs.
Fixes: 21ca6d2c52 (cpufreq: schedutil: Add iowait boosting)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
Pull CPU hotplug fix from Thomas Gleixner:
"A single fix preventing the concurrent execution of the CPU hotplug
callback install/invocation machinery. Long standing bug caused by a
massive brain slip of that Gleixner dude, which went unnoticed for
almost a year"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Serialize callback invocations proper
Pull perf fixes from Thomas Gleixner:
"A set of perf related fixes:
- fix a CR4.PCE propagation issue caused by usage of mm instead of
active_mm and therefore propagated the wrong value.
- perf core fixes, which plug a use-after-free issue and make the
event inheritance on fork more robust.
- a tooling fix for symbol handling"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf symbols: Fix symbols__fixup_end heuristic for corner cases
x86/perf: Clarify why x86_pmu_event_mapped() isn't racy
x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm
perf/core: Better explain the inherit magic
perf/core: Simplify perf_event_free_task()
perf/core: Fix event inheritance on fork()
perf/core: Fix use-after-free in perf_release()
Pull scheduler fixes from Thomas Gleixner:
"From the scheduler departement:
- a bunch of sched deadline related fixes which deal with various
buglets and corner cases.
- two fixes for the loadavg spikes which are caused by the delayed
NOHZ accounting"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Use deadline instead of period when calculating overflow
sched/deadline: Throttle a constrained deadline task activated after the deadline
sched/deadline: Make sure the replenishment timer fires in the next period
sched/loadavg: Use {READ,WRITE}_ONCE() for sample window
sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
sched/deadline: Add missing update_rq_clock() in dl_task_timer()
Pull locking fixes from Thomas Gleixner:
"Three fixes related to locking:
- fix a SIGKILL issue for RWSEM_GENERIC_SPINLOCK which has been fixed
for the XCHGADD variant already
- plug a potential use after free in the futex code
- prevent leaking a held spinlock in an futex error handling code
path"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y
futex: Add missing error handling to FUTEX_REQUEUE_PI
futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
This function was removed in commit c6eb3f70d4 (hrtimer: Get rid of
hrtimer softirq, 2015-04-14) but the prototype wasn't ever deleted.
Delete it now.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/20170317010814.2591-1-sboyd@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.
In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.
Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.
This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.
This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland. kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed. The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.
It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.
v2: Switch to a simpler implementation using a new task_struct bit
field suggested by Oleg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: Tejun Heo <tj@kernel.org>
Optimize:
bpf_call
bpf_map_lookup_elem
map->ops->map_lookup_elem
htab_map_lookup_elem
__htab_map_lookup_elem
into:
bpf_call
__htab_map_lookup_elem
to improve performance of JITed programs.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Optimize bpf_call -> bpf_map_lookup_elem() -> array_map_lookup_elem()
into a sequence of bpf instructions.
When JIT is on the sequence of bpf instructions is the sequence
of native cpu instructions with significantly faster performance
than indirect call and two function's prologue/epilogue.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
convert_ctx_accesses() replaces single bpf instruction with a set of
instructions. Adjust corresponding insn_aux_data while patching.
It's needed to make sure subsequent 'for(all insn)' loops
have matching insn and insn_aux_data.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
reduce indent and make it iterate over instructions similar to
convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
no functional change.
move fixup_bpf_calls() to verifier.c
it's being refactored in the next patch
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit bfc8c90139 ("mem-hotplug: implement get/put_online_mems")
introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
in order to allow similar semantics for memory hotplug like for cpu
hotplug.
The corresponding functions for cpu hotplug are get/put_online_cpus()
and cpu_hotplug_begin/done() for cpu hotplug.
The commit however missed to introduce functions that would serialize
memory hotplug operations like they are done for cpu hotplug with
cpu_maps_update_begin/done().
This basically leaves mem_hotplug.active_writer unprotected and allows
concurrent writers to modify it, which may lead to problems as outlined
by commit f931ab479d ("mm: fix devm_memremap_pages crash, use
mem_hotplug_{begin, done}").
That commit was extended again with commit b5d24fda9c ("mm,
devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
done}") which serializes memory hotplug operations for some call sites
by using the device_hotplug lock.
In addition with commit 3fc2192410 ("mm: validate device_hotplug is held
for memory hotplug") a sanity check was added to mem_hotplug_begin() to
verify that the device_hotplug lock is held.
This in turn triggers the following warning on s390:
WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
Call Trace:
assert_held_device_hotplug+0x40/0x58)
mem_hotplug_begin+0x34/0xc8
add_memory_resource+0x7e/0x1f8
add_memory+0xda/0x130
add_memory_merged+0x15c/0x178
sclp_detect_standby_memory+0x2ae/0x2f8
do_one_initcall+0xa2/0x150
kernel_init_freeable+0x228/0x2d8
kernel_init+0x2a/0x140
kernel_thread_starter+0x6/0xc
One possible fix would be to add more lock_device_hotplug() and
unlock_device_hotplug() calls around each call site of
mem_hotplug_begin/end(). But that would give the device_hotplug lock
additional semantics it better should not have (serialize memory hotplug
operations).
Instead add a new memory_add_remove_lock which has the similar semantics
like cpu_add_remove_lock for cpu hotplug.
To keep things hopefully a bit easier the lock will be locked and unlocked
within the mem_hotplug_begin/end() functions.
Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.com
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
New features:
- Add 'brstackinsn' field in 'perf script' to reuse the x86 instruction
decoder used in the Intel PT code to study hot paths to samples (Andi Kleen)
Kernel:
- Default UPROBES_EVENTS to Y (Alexei Starovoitov)
- Fix check for kretprobe offset within function entry (Naveen N. Rao)
Infrastructure:
- Introduce util func is_sdt_event() (Ravi Bangoria)
- Make perf_event__synthesize_mmap_events() scale on older kernels where
reading /proc/pid/maps is way slower than reading /proc/pid/task/pid/maps (Stephane Eranian)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYyrdSAAoJENZQFvNTUqpAe+4P/3c4ilBSOxLCCxGO7jDYo9oq
/KqlvsCIg7+vo5eqrOUJAb4qXFnvpYxwjMMkL5rx7gdsBCRfRXIINGWUMrq5mNyk
MgxuqYnp+yRuxLYml2wn+tdwLzcHWSN2EO9mqQ14N4I+HvgdLmVPQ44ACQXs6KfL
dk/Ix8YtnFWl2sDZjvyr7ZBqwCPzzklZgHM6erxNUr/WJspzUiixAWqUmewodOUl
P3PitlHXkITOK3AxSqOjJ4g1k933215nGih7hr0XdjEm4pIYaYksShQ6k9DASCrv
dn2o1pF1LTu7KCtAo70aaSB7GXydwoA//o2gRbDkSwJJ25DIImZxJXQz9PAYDOo1
vXSIhmlQ72c4/Yv/XzVOrIoMMMpmWKS3lGZxMVGR/Ie9Gw4kbotkaoEqEpNQsaDZ
iIaU5v/EcvvToT7T7VHrGg0+vmHgYxm5gSlyASi2IrO2/wJAs0v2pYfuL6gYhXGp
mhv/pHUv4l9OW+Ubm+zJEEcg337c2RQU5wT/bk4PihxY6nQyEH2Pn5VzdNbZLuMR
eWnqTH/md+8/bkhmuZJp71wm60oPHoPvbDjvtfVmXAa52AzO+NWSc9Veke3C/QRm
XgNkrXlzeKopEso3j4gw2iAolqw9t8FHFLGgbTkS+6UCKjAM7vNLiIV02LQqhM50
qCnKEusMDCRgzeOXxYt+
=Bg5M
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo-4.12-20170316' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
New features:
- Add 'brstackinsn' field in 'perf script' to reuse the x86 instruction
decoder used in the Intel PT code to study hot paths to samples (Andi Kleen)
Kernel changes:
- Default UPROBES_EVENTS to Y (Alexei Starovoitov)
- Fix check for kretprobe offset within function entry (Naveen N. Rao)
Infrastructure changes:
- Introduce util func is_sdt_event() (Ravi Bangoria)
- Make perf_event__synthesize_mmap_events() scale on older kernels where
reading /proc/pid/maps is way slower than reading /proc/pid/task/pid/maps (Stephane Eranian)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As it is already turned on by most distros, so just flip the default to
Y.
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Wang Nan <wangnan0@huawei.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: David Miller <davem@davemloft.net>
Cc: Hemant Kumar <hemant@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170316005817.GA6805@ast-mbp.thefacebook.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
While going through the event inheritance code Oleg got confused.
Add some comments to better explain the silent dissapearance of
orphaned events.
So what happens is that at perf_event_release_kernel() time; when an
event looses its connection to userspace (and ceases to exist from the
user's perspective) we can still have an arbitrary amount of inherited
copies of the event. We want to synchronously find and remove all
these child events.
Since that requires a bit of lock juggling, there is the possibility
that concurrent clone()s will create new child events. Therefore we
first mark the parent event as DEAD, which marks all the extant child
events as orphaned.
We then avoid copying orphaned events; in order to avoid getting more
of them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170316125823.289567442@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have ctx->event_list that contains all events; no need to
repeatedly iterate the group lists to find them all.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170316125823.239678244@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While hunting for clues to a use-after-free, Oleg spotted that
perf_event_init_context() can loose an error value with the result
that fork() can succeed even though we did not fully inherit the perf
event context.
Spotted-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: oleg@redhat.com
Cc: stable@vger.kernel.org
Fixes: 889ff01506 ("perf/core: Split context's event group list into pinned and non-pinned lists")
Link: http://lkml.kernel.org/r/20170316125823.190342547@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently each thread starts an acquire context only once, and
performs all its loop iterations under it.
This means that the Wound/Wait relations between threads are fixed.
To make things a little more realistic and cover more of the
functionality with the test, open a new acquire context for each loop.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If a PER_CPU struct which contains a spin_lock is statically initialized
via:
DEFINE_PER_CPU(struct foo, bla) = {
.lock = __SPIN_LOCK_UNLOCKED(bla.lock)
};
then lockdep assigns a seperate key to each lock because the logic for
assigning a key to statically initialized locks is to use the address as
the key. With per CPU locks the address is obvioulsy different on each CPU.
That's wrong, because all locks should have the same key.
To solve this the following modifications are required:
1) Extend the is_kernel/module_percpu_addr() functions to hand back the
canonical address of the per CPU address, i.e. the per CPU address
minus the per CPU offset.
2) Check the lock address with these functions and if the per CPU check
matches use the returned canonical address as the lock key, so all per
CPU locks have the same key.
3) Move the static_obj(key) check into look_up_lock_class() so this check
can be avoided for statically initialized per CPU locks. That's
required because the canonical address fails the static_obj(key) check
for obvious reasons.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Merged Dan's fixups for !MODULES and !SMP into this patch. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Murphy <dmurphy@ti.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170227143736.pectaimkjkan5kow@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
f8319483f5 ("locking/lockdep: Provide a type check for lock_is_held")
didn't fully cover rwsems as downgrade_write() was left out.
Introduce lock_downgrade() and use it to add new checks.
See-also: http://marc.info/?l=linux-kernel&m=148581164003149&w=2
Originally-written-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1486053497-9948-3-git-send-email-hooanon05g@gmail.com
[ Rewrote the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Behaviour should not change.
Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1486053497-9948-2-git-send-email-hooanon05g@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A simple consolidataion to factor out repeated patterns.
The behaviour should not change.
Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1486053497-9948-1-git-send-email-hooanon05g@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The Intel PT driver needs to be able to communicate partial AUX transactions,
that is, transactions with gaps in data for reasons other than no room
left in the buffer (i.e. truncated transactions). Therefore, this condition
does not imply a wakeup for the consumer.
To this end, add a new "partial" AUX flag.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170220133352.17995-4-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In preparation for adding more flags to perf AUX records, introduce a
separate API for setting the flags for a session, rather than appending
more bool arguments to perf_aux_output_end. This allows to set each
flag at the time a corresponding condition is detected, instead of
tracking it in each driver's private state.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170220133352.17995-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add DEQUEUE_NOCLOCK to all places where we just did an
update_rq_clock() already.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of relying on deactivate_task() to call update_rq_clock() and
handling the case where it didn't happen (task_on_rq_queued),
unconditionally do update_rq_clock() and skip any further updates.
This also avoids a double update on deactivate_task() + ttwu_local().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since all tasks on the wake_list are woken under a single rq->lock
avoid calling update_rq_clock() for each task.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In all cases, ENQUEUE_RESTORE should also have ENQUEUE_NOCLOCK because
DEQUEUE_SAVE will have done an update_rq_clock().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently {en,de}queue_task() do an unconditional update_rq_clock().
However since we want to avoid duplicate updates, so that each
rq->lock section appears atomic in time, we need to be able to skip
these clock updates.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The missing update_rq_clock() check can work with partial rq->lock
wrappery, since a missing wrapper can cause the warning to not be
emitted when it should have, but cannot cause the warning to trigger
when it should not have.
The duplicate update_rq_clock() check however can cause false warnings
to trigger. Therefore add more comprehensive rq->lock wrappery.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we have no missing calls, add a warning to find multiple
calls.
By having only a single update_rq_clock() call per rq-lock section,
the section appears 'atomic' wrt time.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While looking into optimizations for the RT scheduler IPI logic, I realized
that the comments are lacking to describe it efficiently. It deserves a
lengthy description describing its design.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170228155030.30c69068@gandalf.local.home
[ Small typographical edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I was testing Daniel's changes with his test case, and tweaked it a
little. Instead of having the runtime equal to the deadline, I
increased the deadline ten fold.
Daniel's test case had:
attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */
attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */
To make it more interesting, I changed it to:
attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
attr.sched_deadline = 20 * 1000 * 1000; /* 20 ms */
attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */
The results were rather surprising. The behavior that Daniel's patch
was fixing came back. The task started using much more than .1% of the
CPU. More like 20%.
Looking into this I found that it was due to the dl_entity_overflow()
constantly returning true. That's because it uses the relative period
against relative runtime vs the absolute deadline against absolute
runtime.
runtime / (deadline - t) > dl_runtime / dl_period
There's even a comment mentioning this, and saying that when relative
deadline equals relative period, that the equation is the same as using
deadline instead of period. That comment is backwards! What we really
want is:
runtime / (deadline - t) > dl_runtime / dl_deadline
We care about if the runtime can make its deadline, not its period. And
then we can say "when the deadline equals the period, the equation is
the same as using dl_period instead of dl_deadline".
After correcting this, now when the task gets enqueued, it can throttle
correctly, and Daniel's fix to the throttling of sleeping deadline
tasks works even when the runtime and deadline are not the same.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
During the activation, CBS checks if it can reuse the current task's
runtime and period. If the deadline of the task is in the past, CBS
cannot use the runtime, and so it replenishes the task. This rule
works fine for implicit deadline tasks (deadline == period), and the
CBS was designed for implicit deadline tasks. However, a task with
constrained deadline (deadine < period) might be awakened after the
deadline, but before the next period. In this case, replenishing the
task would allow it to run for runtime / deadline. As in this case
deadline < period, CBS enables a task to run for more than the
runtime / period. In a very loaded system, this can cause a domino
effect, making other tasks miss their deadlines.
To avoid this problem, in the activation of a constrained deadline
task after the deadline but before the next period, throttle the
task and set the replenishing timer to the begin of the next period,
unless it is boosted.
Reproducer:
--------------- %< ---------------
int main (int argc, char **argv)
{
int ret;
int flags = 0;
unsigned long l = 0;
struct timespec ts;
struct sched_attr attr;
memset(&attr, 0, sizeof(attr));
attr.size = sizeof(attr);
attr.sched_policy = SCHED_DEADLINE;
attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */
attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */
ts.tv_sec = 0;
ts.tv_nsec = 2000 * 1000; /* 2 ms */
ret = sched_setattr(0, &attr, flags);
if (ret < 0) {
perror("sched_setattr");
exit(-1);
}
for(;;) {
/* XXX: you may need to adjust the loop */
for (l = 0; l < 150000; l++);
/*
* The ideia is to go to sleep right before the deadline
* and then wake up before the next period to receive
* a new replenishment.
*/
nanosleep(&ts, NULL);
}
exit(0);
}
--------------- >% ---------------
On my box, this reproducer uses almost 50% of the CPU time, which is
obviously wrong for a task with 2/2000 reservation.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the replenishment timer is set to fire at the deadline
of a task. Although that works for implicit deadline tasks because the
deadline is equals to the begin of the next period, that is not correct
for constrained deadline tasks (deadline < period).
For instance:
f.c:
--------------- %< ---------------
int main (void)
{
for(;;);
}
--------------- >% ---------------
# gcc -o f f.c
# trace-cmd record -e sched:sched_switch \
-e syscalls:sys_exit_sched_setattr \
chrt -d --sched-runtime 490000000 \
--sched-deadline 500000000 \
--sched-period 1000000000 0 ./f
# trace-cmd report | grep "{pid of ./f}"
After setting parameters, the task is replenished and continue running
until being throttled:
f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0
The task is throttled after running 492318 ms, as expected:
f-11295 [003] 13322.606094: sched_switch: f:11295 [-1] R ==> watchdog/3:32 [0]
But then, the task is replenished 500719 ms after the first
replenishment:
<idle>-0 [003] 13322.614495: sched_switch: swapper/3:0 [120] R ==> f:11295 [-1]
Running for 490277 ms:
f-11295 [003] 13323.104772: sched_switch: f:11295 [-1] R ==> swapper/3:0 [120]
Hence, in the first period, the task runs 2 * runtime, and that is a bug.
During the first replenishment, the next deadline is set one period away.
So the runtime / period starts to be respected. However, as the second
replenishment took place in the wrong instant, the next replenishment
will also be held in a wrong instant of time. Rather than occurring in
the nth period away from the first activation, it is taking place
in the (nth period - relative deadline).
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We hang if SIGKILL has been sent, but the task is stuck in down_read()
(after do_exit()), even though no task is doing down_write() on the
rwsem in question:
INFO: task libupnp:21868 blocked for more than 120 seconds.
libupnp D 0 21868 1 0x08100008
...
Call Trace:
__schedule()
schedule()
__down_read()
do_exit()
do_group_exit()
__wake_up_parent()
This bug has already been fixed for CONFIG_RWSEM_XCHGADD_ALGORITHM=y in
the following commit:
04cafed7fc ("locking/rwsem: Fix down_write_killable()")
... however, this bug also exists for CONFIG_RWSEM_GENERIC_SPINLOCK=y.
Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Niklas Cassel <niklass@axis.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: d47996082f ("locking/rwsem: Introduce basis for down_write_killable()")
Link: http://lkml.kernel.org/r/1487981873-12649-1-git-send-email-niklass@axis.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'calc_load_update' is accessed without any kind of locking and there's
a clear assumption in the code that only a single value is read or
written.
Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
unintentionally seeing multiple values, or having the load/stores
split.
Technically the loads in calc_global_*() don't require this since
those are the only functions that update 'calc_load_update', but I've
added the READ_ONCE() for consistency.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.
This situation on exiting NO_HZ is described by:
this_rq->calc_load_update < jiffies < calc_load_update
In this scenario, what we should be doing is:
this_rq->calc_load_update = calc_load_update [ next window ]
But what we actually do is:
this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ]
This has the effect of delaying load average updates for potentially
up to ~9seconds.
This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.
It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().
This issue is easy to reproduce before,
commit 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following warning can be triggered by hot-unplugging the CPU
on which an active SCHED_DEADLINE task is running on:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40
rq->clock_update_flags < RQCF_ACT_SKIP
CPU: 7 PID: 0 Comm: swapper/7 Tainted: G B 4.11.0-rc1+ #24
Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
Call Trace:
<IRQ>
dump_stack+0x85/0xc4
__warn+0x172/0x1b0
warn_slowpath_fmt+0xb4/0xf0
? __warn+0x1b0/0x1b0
? debug_check_no_locks_freed+0x2c0/0x2c0
? cpudl_set+0x3d/0x2b0
replenish_dl_entity+0x71e/0xc40
enqueue_task_dl+0x2ea/0x12e0
? dl_task_timer+0x777/0x990
? __hrtimer_run_queues+0x270/0xa50
dl_task_timer+0x316/0x990
? enqueue_task_dl+0x12e0/0x12e0
? enqueue_task_dl+0x12e0/0x12e0
__hrtimer_run_queues+0x270/0xa50
? hrtimer_cancel+0x20/0x20
? hrtimer_interrupt+0x119/0x600
hrtimer_interrupt+0x19c/0x600
? trace_hardirqs_off+0xd/0x10
local_apic_timer_interrupt+0x74/0xe0
smp_apic_timer_interrupt+0x76/0xa0
apic_timer_interrupt+0x93/0xa0
The DL task will be migrated to a suitable later deadline rq once the DL
timer fires and currnet rq is offline. The rq clock of the new rq should
be updated. This patch fixes it by updating the rq clock after holding
the new rq's rq lock.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf specifies an offset from _text and since this offset is fed
directly into the arch-specific helper, kprobes tracer rejects
installation of kretprobes through perf. Fix this by looking up the
actual offset from a function for the specified sym+offset.
Refactor and reuse existing routines to limit code duplication -- we
repurpose kprobe_addr() for determining final kprobe address and we
split out the function entry offset determination into a separate
generic helper.
Before patch:
naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
probe-definition(0): do_open%return
symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
0 arguments
Looking at the vmlinux_path (8 entries long)
Using /boot/vmlinux for symbols
Open Debuginfo file: /boot/vmlinux
Try to find probe point from debuginfo.
Matched function: do_open [2d0c7ff]
Probe point found: do_open+0
Matched function: do_open [35d76dc]
found inline addr: 0xc0000000004ba9c4
Failed to find "do_open%return",
because do_open is an inlined function and has no return point.
An error occurred in debuginfo analysis (-22).
Trying to use symbols.
Opening /sys/kernel/debug/tracing//README write=0
Opening /sys/kernel/debug/tracing//kprobe_events write=1
Writing event: r:probe/do_open _text+4469776
Failed to write event: Invalid argument
Error: Failed to add events. Reason: Invalid argument (Code: -22)
naveen@ubuntu:~/linux/tools/perf$ dmesg | tail
<snip>
[ 33.568656] Given offset is not valid for return probe.
After patch:
naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
probe-definition(0): do_open%return
symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
0 arguments
Looking at the vmlinux_path (8 entries long)
Using /boot/vmlinux for symbols
Open Debuginfo file: /boot/vmlinux
Try to find probe point from debuginfo.
Matched function: do_open [2d0c7d6]
Probe point found: do_open+0
Matched function: do_open [35d76b3]
found inline addr: 0xc0000000004ba9e4
Failed to find "do_open%return",
because do_open is an inlined function and has no return point.
An error occurred in debuginfo analysis (-22).
Trying to use symbols.
Opening /sys/kernel/debug/tracing//README write=0
Opening /sys/kernel/debug/tracing//kprobe_events write=1
Writing event: r:probe/do_open _text+4469808
Writing event: r:probe/do_open_1 _text+4956344
Added new events:
probe:do_open (on do_open%return)
probe:do_open_1 (on do_open%return)
You can now use it in all perf tools, such as:
perf record -e probe:do_open_1 -aR sleep 1
naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list
c000000000041370 k kretprobe_trampoline+0x0 [OPTIMIZED]
c0000000004ba0b8 r do_open+0x8 [DISABLED]
c000000000443430 r do_open+0x0 [DISABLED]
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/d8cd1ef420ec22e3643ac332fdabcffc77319a42.1488961018.git.naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull networking fixes from David Miller:
1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
from Steffen Klassert.
2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
Alexei Starovoitov.
3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
Michal Schmidt.
4) We can get a divide by zero when TCP socket are morphed into
listening state, fix from Eric Dumazet.
5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
skb_complete_tx_timestamp(). From Eric Dumazet.
6) Use after free in dccp_feat_activate_values(), also from Eric
Dumazet.
7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
Jarod Wilson.
8) Fix use after free in vrf_xmit(), from David Ahern.
9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
Alexey Kodanev.
10) Properly check napi_complete_done() return value in order to decide
whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
Lendacky.
11) Fix double free of hwmon device in marvell phy driver, from Andrew
Lunn.
12) Don't crash on malformed netlink attributes in act_connmark, from
Etienne Noss.
13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
from Sabrina Dubroca.
14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
Florian Westphal.
15) Fix routing redirect races in dccp and tcp, basically the ICMP
handler can't modify the socket's cached route in it's locked by the
user at this moment. From Jon Maxwell.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
qed: Enable iSCSI Out-of-Order
qed: Correct out-of-bound access in OOO history
qed: Fix interrupt flags on Rx LL2
qed: Free previous connections when releasing iSCSI
qed: Fix mapping leak on LL2 rx flow
qed: Prevent creation of too-big u32-chains
qed: Align CIDs according to DORQ requirement
mlxsw: reg: Fix SPVMLR max record count
mlxsw: reg: Fix SPVM max record count
net: Resend IGMP memberships upon peer notification.
dccp: fix memory leak during tear-down of unsuccessful connection request
tun: fix premature POLLOUT notification on tun devices
dccp/tcp: fix routing redirect race
ucc/hdlc: fix two little issue
vxlan: fix ovs support
net: use net->count to check whether a netns is alive or not
bridge: drop netfilter fake rtable unconditionally
ipv6: avoid write to a possibly cloned skb
net: wimax/i2400m: fix NULL-deref at probe
isdn/gigaset: fix NULL-deref at probe
...
Pull workqueue fix from Tejun Heo:
"If a delayed work is queued with NULL @wq, workqueue code explodes
after the timer expires at which point it's difficult to tell who the
culprit was.
This actually happened and the offender was net/smc this time.
Add an explicit sanity check for it in the queueing path"
* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq
The setup/remove_state/instance() functions in the hotplug core code are
serialized against concurrent CPU hotplug, but unfortunately not serialized
against themself.
As a consequence a concurrent invocation of these function results in
corruption of the callback machinery because two instances try to invoke
callbacks on remote cpus at the same time. This results in missing callback
invocations and initiator threads waiting forever on the completion.
The obvious solution to replace get_cpu_online() with cpu_hotplug_begin()
is not possible because at least one callsite calls into these functions
from a get_online_cpu() locked region.
Extend the protection scope of the cpuhp_state_mutex from solely protecting
the state arrays to cover the callback invocation machinery as well.
Fixes: 5b7aa87e04 ("cpu/hotplug: Implement setup/removal interface")
Reported-and-tested-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: hpa@zytor.com
Cc: mingo@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20170314150645.g4tdyoszlcbajmna@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
commit fc62d0207a ("kprobes: Introduce weak variant of
kprobe_exceptions_notify()") used the __kprobes annotation to exclude
kprobe_exceptions_notify from being probed. Since NOKPROBE_SYMBOL() is a
better way to do this enabling the symbol to be discovered as being
blacklisted, change over to using NOKPROBE_SYMBOL().
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/3f25bf400da5c222cd9b10eec6ded2d6b58209f8.1488991670.git.naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The trouble we have is that we can't really test all the shrinker
recursion stuff exhaustively in BAT because any kind of thrashing
stress test just takes too long.
But that leaves a really big gap open, since shrinker recursions are
one of the most annoying bugs. Now lockdep already has support for
checking allocation deadlocks:
- Direct reclaim paths are marked up with
lockdep_set_current_reclaim_state() and
lockdep_clear_current_reclaim_state().
- Any allocation paths are marked with lockdep_trace_alloc().
If we simply mark up our debugfs with the reclaim annotations, any
code and locks taken in there will automatically complete the picture
with any allocation paths we already have, as long as we have a simple
testcase in BAT which throws out a few objects using this interface.
Not stress test or thrashing needed at all.
v2: Need to EXPORT_SYMBOL_GPL to make it compile as a module.
v3: Fixup rebase fail (spotted by Chris).
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://patchwork.freedesktop.org/patch/msgid/20170312205340.16202-1-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
When a process is sent a SIGKILL because it exceeded CPU or RT limits,
the cause may not be obvious in userspace -- daemonised processes just
get killed, and even foreground process just see a 'Killed' message. The
lack of any information on why this might be happening in logs can be
confusing to users who are not aware of this mechanism.
Add messages which dump the process name and tid in dmesg when a process
exceeds its CPU or RT limits (soft and hard) in order to make it clearer to
people debugging such issues.
Signed-off-by: Arun Raghavan <arun@arunraghavan.net>
Link: http://lkml.kernel.org/r/20170301145309.27214-1-arun@arunraghavan.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With the advert of container technologies like docker, that depend on
namespaces for isolation, there is a need for tracing support for
namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for
recording namespaces related info. By recording info for every
namespace, it is left to userspace to take a call on the definition of a
container and trace containers by updating perf tool accordingly.
Each namespace has a combination of device and inode numbers. Though
every namespace has the same device number currently, that may change in
future to avoid the need for a namespace of namespaces. Considering such
possibility, record both device and inode numbers separately for each
namespace.
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The loop in sugov_next_freq_shared() contains an if block to skip the
loop for the current CPU. This turns out to be an unnecessary
conditional in the scheduler's hot-path for every CPU in the policy.
It would be better to drop the conditional and make the loop treat all
the CPUs in the same way. That would eliminate the need of calling
sugov_iowait_boost() at the top of the routine.
To keep the code optimized to return early if the current CPU has RT/DL
flags set, move the flags check to sugov_update_shared() instead in
order to avoid the function call entirely.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The rate_limit_us tunable is intended to reduce the possible overhead
from running the schedutil governor. However, that overhead can be
divided into two separate parts: the governor computations and the
invocation of the scaling driver to set the CPU frequency. The latter
is where the real overhead comes from. The former is much less
expensive in terms of execution time and running it every time the
governor callback is invoked by the scheduler, after rate_limit_us
interval has passed since the last frequency update, would not be a
problem.
For this reason, redefine the rate_limit_us tunable so that it means the
minimum time that has to pass between two consecutive invocations of the
scaling driver by the schedutil governor (to set the CPU frequency).
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull x86 fixes from Thomas Gleixner:
- a fix for the kexec/purgatory regression which was introduced in the
merge window via an innocent sparse fix. We could have reverted that
commit, but on deeper inspection it turned out that the whole
machinery is neither documented nor robust. So a proper cleanup was
done instead
- the fix for the TLB flush issue which was discovered recently
- a simple typo fix for a reboot quirk
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tlb: Fix tlb flushing when lguest clears PGE
kexec, x86/purgatory: Unbreak it and clean it up
x86/reboot/quirks: Fix typo in ASUS EeeBook X205TA reboot quirk
On a specific audio system an interrupt input of an audio CODEC is used as a
shared interrupt. That interrupt input is handled by a CODEC specific irq
chip driver and triggers a CPU interrupt via the CODEC irq output line.
The CODEC interrupt handler demultiplexes the CODEC interrupt inputs and
the interrupt handlers for these demultiplexed inputs run nested in the
context of the CODEC interrupt handler.
The demultiplexed interrupts use handle_nested_irq() as their interrupt
handler, which unfortunately has no support for shared interrupts. So the
above hardware cannot be supported.
Add shared interrupt support to handle_nested_irq() by iterating over the
interrupt action chain.
[ tglx: Massaged changelog ]
Signed-off-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com>
Cc: patches@opensource.wolfsonmicro.com
Link: http://lkml.kernel.org/r/1488904098-5350-1-git-send-email-ckeepax@opensource.wolfsonmicro.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The purgatory code defines global variables which are referenced via a
symbol lookup in the kexec code (core and arch).
A recent commit addressing sparse warnings made these static and thereby
broke kexec_file.
Why did this happen? Simply because the whole machinery is undocumented and
lacks any form of forward declarations. The variable names are unspecific
and lack a prefix, so adding forward declarations creates shadow variables
in the core code. Aside of that the code relies on magic constants and
duplicate struct definitions with no way to ensure that these things stay
in sync. The section placement of the purgatory variables happened by
chance and not by design.
Unbreak kexec and cleanup the mess:
- Add proper forward declarations and document the usage
- Use common struct definition
- Use the proper common defines instead of magic constants
- Add a purgatory_ prefix to have a proper name space
- Use ARRAY_SIZE() instead of a homebrewn reimplementation
- Add proper sections to the purgatory variables [ From Mike ]
Fixes: 72042a8c7b ("x86/purgatory: Make functions and variables static")
Reported-by: Mike Galbraith <<efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Nicholas Mc Guire <der.herr@hofr.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Tobin C. Harding" <me@tobin.cc>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1703101315140.3681@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Merge fixes from Andrew Morton:
"26 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
userfaultfd: remove wrong comment from userfaultfd_ctx_get()
fat: fix using uninitialized fields of fat_inode/fsinfo_inode
sh: cayman: IDE support fix
kasan: fix races in quarantine_remove_cache()
kasan: resched in quarantine_remove_cache()
mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
thp: fix another corner case of munlock() vs. THPs
rmap: fix NULL-pointer dereference on THP munlocking
mm/memblock.c: fix memblock_next_valid_pfn()
userfaultfd: selftest: vm: allow to build in vm/ directory
userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
userfaultfd: non-cooperative: fix fork fctx->new memleak
mm/cgroup: avoid panic when init with low memory
drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h
mm/vmstats: add thp_split_pud event for clarity
include/linux/fs.h: fix unsigned enum warning with gcc-4.2
userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
userfaultfd: non-cooperative: robustness check
userfaultfd: non-cooperative: rollback userfaultfd_exit
x86, mm: unify exit paths in gup_pte_range()
...
Patch series "userfaultfd non-cooperative further update for 4.11 merge
window".
Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
more testing. I've been doing testing before and this was also tested
by kbuild bot and exercised by the selftest, but this bug never
reproduced before.
I dropped userfaultfd_exit as result. I dropped it because of
implementation difficulty in receiving signals in __mmput and because I
think -ENOSPC as result from the background UFFDIO_COPY should be enough
already.
Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
wasn't exercised by the selftest and when I tried to exercise it, after
moving it to a more correct place in __mmput where it would make more
sense and where the vma list is stable, it resulted in the
event_wait_completion in D state. So then I added the second patch to
be sure even if we call userfaultfd_event_wait_completion too late
during task exit(), we won't risk to generate tasks in D state. The
same check exists in handle_userfault() for the same reason, except it
makes a difference there, while here is just a robustness check and it's
run under WARN_ON_ONCE.
While looking at the userfaultfd_event_wait_completion() function I
looked back at its callers too while at it and I think it's not ok to
stop executing dup_fctx on the fcs list because we relay on
userfaultfd_event_wait_completion to execute
userfaultfd_ctx_put(fctx->orig) which is paired against
userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
list_add(fcs). This change only takes care of fctx->orig but this area
also needs further review looking for similar problems in fctx->new.
The only patch that is urgent is the first because it's an use after
free during a SMP race condition that affects all processes if
CONFIG_USERFAULTFD=y. Very hard to reproduce though and probably
impossible without SLUB poisoning enabled.
This patch (of 3):
I once reproduced this oops with the userfaultfd selftest, it's not
easily reproducible and it requires SLUB poisoning to reproduce.
general protection fault: 0000 [#1] SMP
Modules linked in:
CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G ------------ T 3.10.0+ #15
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
RIP: 0010:[<ffffffff81451299>] [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
RSP: 0018:ffff8801f833fe80 EFLAGS: 00010202
RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
do_exit+0x297/0xd10
SyS_exit+0x17/0x20
tracesys+0xdd/0xe2
Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
RIP [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
RSP <ffff8801f833fe80>
---[ end trace 9fecd6dcb442846a ]---
In the debugger I located the "mm" pointer in the stack and walking
mm->mmap->vm_next through the end shows the vma->vm_next list is fully
consistent and it is null terminated list as expected. So this has to
be an SMP race condition where userfaultfd_exit was running while the
vma list was being modified by another CPU.
When userfaultfd_exit() run one of the ->vm_next pointers pointed to
SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).
The reason is that it's not running in __mmput but while there are still
other threads running and it's not holding the mmap_sem (it can't as it
has to wait the even to be received by the manager). So this is an use
after free that was happening for all processes.
One more implementation problem aside from the race condition:
userfaultfd_exit has really to check a flag in mm->flags before walking
the vma or it's going to slowdown the exit() path for regular tasks.
One more implementation problem: at that point signals can't be
delivered so it would also create a task in D state if the manager
doesn't read the event.
The major design issue: it overall looks superfluous as the manager can
check for -ENOSPC in the background transfer:
if (mmget_not_zero(ctx->mm)) {
[..]
} else {
return -ENOSPC;
}
It's safer to roll it back and re-introduce it later if at all.
[rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix typos and add the following to the scripts/spelling.txt:
overide||override
While we are here, fix the doubled "address" in the touched line
Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt.
Also, fix the comment block style in the touched hunks in
drivers/media/dvb-frontends/drx39xyj/drx_driver.h.
Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix typos and add the following to the scripts/spelling.txt:
disble||disable
disbled||disabled
I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
untouched. The macro is not referenced at all, but this commit is
touching only comment blocks just in case.
Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Three fixes for intel_pstate problems related to the passive
mode (in which it acts as a regular cpufreq scaling driver), two
for the handling of global P-state limits and one for the handling
of the cpu_frequency tracepoint in that mode (Rafael Wysocki).
- Three fixes for the handling of P-state limits in intel_pstate in
the active mode (Rafael Wysocki).
- Introduction of a new cpufreq.off=1 kernel command line argument
that will disable cpufreq entirely if passed to the kernel and
is simply hooked up to the existing code used by Xen (Len Brown).
- Fix for the schedutil cpufreq governor to prevent it from using
stale raw frequency values in configurations with mutiple CPUs
sharing one policy object and a cleanup for it reducing its
overhead slightly (Viresh Kumar).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYwc9bAAoJEILEb/54YlRxU24P/2RPFpRYNtGNfbnvmyNjcA5H
MLQ/i0HEqhBVsfsj2Kmy6sRkY/X/kqnOiD2PzGcaOMTtMx2FcLNHzNqH0P0I7A4W
9wzUcq0SC5FzTJcLryN4gtJa3tfBDHjr6peAcZyji5t3DbGa+mygVwH8IGyr4feo
u7PE/6eXLfkRIySwG/kCI/YVE8tWuhjDHbjhR1pjgyrMjhDbPP480Mgsi4eDRRTO
BwGVyQJXoMo9e7/vZ8Y8Fkt7PMxYeyeE1yAGHkJzjkFfdrkZrn3IUPfgVgxy8IPV
N3w1BB+H84duEswPZH2patdIKQxXG3r0eaGHTXZIwy+5sHyc3mUMJ1FeHR67gasd
Z4p4hylTP+dGZGDo4G7PbVX4V6zEP+LSxgOQYjbo4n6k3nJJOrIF4wt+l5+tQz5m
Y+V6XVHgZm1LyFgjj06jMeVSmk6SS7X0qBHJ4WcfGzV/J+vStXmIvAmljs+cd/u2
R9z1W/rxelZFKT+4lRCuztpBYfJvlE5nXyM1XwC6Rz6QjFax0pJAHUPFznWyq24C
GlMAvQWPapDEoOvDrS/QKeqX1ROOYf2FwHs+uvPPCvOxMjL9NCQsQ34tqCM3MiL/
ebk3uZ0xC1e0LaOB87xr7tfsag12MyZzSCwX0D8nBLBLvntpa/xQwE5+4vu9ZguD
xlarnIsEh6bTwTVH6DJP
=Gwp7
-----END PGP SIGNATURE-----
Merge tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix several issues in the intel_pstate driver and one issue in
the schedutil cpufreq governor, clean up that governor a bit and hook
up existing code for disabling cpufreq to a new kernel command line
option.
Specifics:
- Three fixes for intel_pstate problems related to the passive mode
(in which it acts as a regular cpufreq scaling driver), two for the
handling of global P-state limits and one for the handling of the
cpu_frequency tracepoint in that mode (Rafael Wysocki).
- Three fixes for the handling of P-state limits in intel_pstate in
the active mode (Rafael Wysocki).
- Introduction of a new cpufreq.off=1 kernel command line argument
that will disable cpufreq entirely if passed to the kernel and is
simply hooked up to the existing code used by Xen (Len Brown).
- Fix for the schedutil cpufreq governor to prevent it from using
stale raw frequency values in configurations with mutiple CPUs
sharing one policy object and a cleanup for it reducing its
overhead slightly (Viresh Kumar)"
* tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy
cpufreq: intel_pstate: Fix intel_pstate_verify_policy()
cpufreq: intel_pstate: Fix global settings in active mode
cpufreq: Add the "cpufreq.off=1" cmdline option
cpufreq: schedutil: Pass sg_policy to get_next_freq()
cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily
cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()
cpufreq: intel_pstate: Do not use performance_limits in passive mode
when all map elements are pre-allocated one cpu can delete and reuse htab_elem
while another cpu is still walking the hlist. In such case the lookup may
miss the element. Convert hlist to hlist_nulls to avoid such scenario.
When bucket lock is taken there is no need to take such precautions,
so only convert map_lookup and map_get_next to nulls.
The race window is extremely small and only reproducible with explicit
udelay() inside lookup_nulls_elem_raw()
Similar to hlist add hlist_nulls_for_each_entry_safe() and
hlist_nulls_entry_safe() helpers.
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Reported-by: Jonathan Perry <jonperry@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
when htab_elem is removed from the bucket list the htab_elem.hash_node.next
field should not be overridden too early otherwise we have a tiny race window
between lookup and delete.
The bug was discovered by manual code analysis and reproducible
only with explicit udelay() in lookup_elem_raw().
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Reported-by: Jonathan Perry <jonperry@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The scheduler header file split and cleanups ended up exposing a few
nasty header file dependencies, and in particular it showed how we in
<linux/wait.h> ended up depending on "signal_pending()", which now comes
from <linux/sched/signal.h>.
That's a very subtle and annoying dependency, which already caused a
semantic merge conflict (see commit e58bc92783 "Pull overlayfs updates
from Miklos Szeredi", which added that fixup in the merge commit).
It turns out that we can avoid this dependency _and_ improve code
generation by moving the guts of the fairly nasty helper #define
__wait_event_interruptible_locked() to out-of-line code. The code that
includes the signal_pending() check is all in the slow-path where we
actually go to sleep waiting for the event anyway, so using a helper
function is the right thing to do.
Using a helper function is also what we already did for the non-locked
versions, see the "__wait_event*()" macros and the "prepare_to_wait*()"
set of helper functions.
We might want to try to unify all these macro games, we have a _lot_ of
subtly different wait-event loops. But this is the minimal patch to fix
the annoying header dependency.
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
klp_mutex is shared between core.c and transition.c, and as such would
rather be properly located in a header so that we don't have to play
'extern' games from .c sources.
This also silences sparse warning (wrongly) suggesting that klp_mutex
should be defined static.
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Currently we do not allow patch module to unload since there is no
method to determine if a task is still running in the patched code.
The consistency model gives us the way because when the unpatching
finishes we know that all tasks were marked as safe to call an original
function. Thus every new call to the function calls the original code
and at the same time no task can be somewhere in the patched code,
because it had to leave that code to be marked as safe.
We can safely let the patch module go after that.
Completion is used for synchronization between module removal and sysfs
infrastructure in a similar way to commit 942e443127 ("module: Fix
mod->mkobj.kobj potentially freed too early").
Note that we still do not allow the removal for immediate model, that is
no consistency model. The module refcount may increase in this case if
somebody disables and enables the patch several times. This should not
cause any harm.
With this change a call to try_module_get() is moved to
__klp_enable_patch from klp_register_patch to make module reference
counting symmetric (module_put() is in a patch disable path) and to
allow to take a new reference to a disabled module when being enabled.
Finally, we need to be very careful about possible races between
klp_unregister_patch(), kobject_put() functions and operations
on the related sysfs files.
kobject_put(&patch->kobj) must be called without klp_mutex. Otherwise,
it might be blocked by enabled_store() that needs the mutex as well.
In addition, enabled_store() must check if the patch was not
unregisted in the meantime.
There is no need to do the same for other kobject_put() callsites
at the moment. Their sysfs operations neither take the lock nor
they access any data that might be freed in the meantime.
There was an attempt to use kobjects the right way and prevent these
races by design. But it made the patch definition more complicated
and opened another can of worms. See
https://lkml.kernel.org/r/1464018848-4303-1-git-send-email-pmladek@suse.com
[Thanks to Petr Mladek for improving the commit message.]
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Change livepatch to use a basic per-task consistency model. This is the
foundation which will eventually enable us to patch those ~10% of
security patches which change function or data semantics. This is the
biggest remaining piece needed to make livepatch more generally useful.
This code stems from the design proposal made by Vojtech [1] in November
2014. It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
consistency and syscall barrier switching combined with kpatch's stack
trace switching. There are also a number of fallback options which make
it quite flexible.
Patches are applied on a per-task basis, when the task is deemed safe to
switch over. When a patch is enabled, livepatch enters into a
transition state where tasks are converging to the patched state.
Usually this transition state can complete in a few seconds. The same
sequence occurs when a patch is disabled, except the tasks converge from
the patched state to the unpatched state.
An interrupt handler inherits the patched state of the task it
interrupts. The same is true for forked tasks: the child inherits the
patched state of the parent.
Livepatch uses several complementary approaches to determine when it's
safe to patch tasks:
1. The first and most effective approach is stack checking of sleeping
tasks. If no affected functions are on the stack of a given task,
the task is patched. In most cases this will patch most or all of
the tasks on the first try. Otherwise it'll keep trying
periodically. This option is only available if the architecture has
reliable stacks (HAVE_RELIABLE_STACKTRACE).
2. The second approach, if needed, is kernel exit switching. A
task is switched when it returns to user space from a system call, a
user space IRQ, or a signal. It's useful in the following cases:
a) Patching I/O-bound user tasks which are sleeping on an affected
function. In this case you have to send SIGSTOP and SIGCONT to
force it to exit the kernel and be patched.
b) Patching CPU-bound user tasks. If the task is highly CPU-bound
then it will get patched the next time it gets interrupted by an
IRQ.
c) In the future it could be useful for applying patches for
architectures which don't yet have HAVE_RELIABLE_STACKTRACE. In
this case you would have to signal most of the tasks on the
system. However this isn't supported yet because there's
currently no way to patch kthreads without
HAVE_RELIABLE_STACKTRACE.
3. For idle "swapper" tasks, since they don't ever exit the kernel, they
instead have a klp_update_patch_state() call in the idle loop which
allows them to be patched before the CPU enters the idle state.
(Note there's not yet such an approach for kthreads.)
All the above approaches may be skipped by setting the 'immediate' flag
in the 'klp_patch' struct, which will disable per-task consistency and
patch all tasks immediately. This can be useful if the patch doesn't
change any function or data semantics. Note that, even with this flag
set, it's possible that some tasks may still be running with an old
version of the function, until that function returns.
There's also an 'immediate' flag in the 'klp_func' struct which allows
you to specify that certain functions in the patch can be applied
without per-task consistency. This might be useful if you want to patch
a common function like schedule(), and the function change doesn't need
consistency but the rest of the patch does.
For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user
must set patch->immediate which causes all tasks to be patched
immediately. This option should be used with care, only when the patch
doesn't change any function or data semantics.
In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE
may be allowed to use per-task consistency if we can come up with
another way to patch kthreads.
The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
is in transition. Only a single patch (the topmost patch on the stack)
can be in transition at a given time. A patch can remain in transition
indefinitely, if any of the tasks are stuck in the initial patch state.
A transition can be reversed and effectively canceled by writing the
opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
the transition is in progress. Then all the tasks will attempt to
converge back to the original patch state.
[1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Ingo Molnar <mingo@kernel.org> # for the scheduler changes
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
For the consistency model we'll need to know the sizes of the old and
new functions to determine if they're on the stacks of any tasks.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The sysfs enabled value is a boolean, so kstrtobool() is a better fit
for parsing the input string since it does the range checking for us.
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Move functions related to the actual patching of functions and objects
into a new patch.c file.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
klp_patch_object()'s callers already ensure that the object is loaded,
so its call to klp_is_object_loaded() is unnecessary.
This will also make it possible to move the patching code into a
separate file.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Once we have a consistency model, patches and their objects will be
enabled and disabled at different times. For example, when a patch is
disabled, its loaded objects' funcs can remain registered with ftrace
indefinitely until the unpatching operation is complete and they're no
longer in use.
It's less confusing if we give them different names: patches can be
enabled or disabled; objects (and their funcs) can be patched or
unpatched:
- Enabled means that a patch is logically enabled (but not necessarily
fully applied).
- Patched means that an object's funcs are registered with ftrace and
added to the klp_ops func stack.
Also, since these states are binary, represent them with booleans
instead of ints.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Create temporary stubs for klp_update_patch_state() so we can add
TIF_PATCH_PENDING to different architectures in separate patches without
breaking build bisectability.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
For live patching and possibly other use cases, a stack trace is only
useful if it can be assured that it's completely reliable. Add a new
save_stack_trace_tsk_reliable() function to achieve that.
Note that if the target task isn't the current task, and the target task
is allowed to run, then it could be writing the stack while the unwinder
is reading it, resulting in possible corruption. So the caller of
save_stack_trace_tsk_reliable() must ensure that the task is either
'current' or inactive.
save_stack_trace_tsk_reliable() relies on the x86 unwinder's detection
of pt_regs on the stack. If the pt_regs are not user-mode registers
from a syscall, then they indicate an in-kernel interrupt or exception
(e.g. preemption or a page fault), in which case the stack is considered
unreliable due to the nature of frame pointers.
It also relies on the x86 unwinder's detection of other issues, such as:
- corrupted stack data
- stack grows the wrong way
- stack walk doesn't reach the bottom
- user didn't provide a large enough entries array
Such issues are reported by checking unwind_error() and !unwind_done().
Also add CONFIG_HAVE_RELIABLE_STACKTRACE so arch-independent code can
determine at build time whether the function is implemented.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Ingo Molnar <mingo@kernel.org> # for the x86 changes
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull timer fixes from Ingo Molnar:
"This includes a fix for lockups caused by incorrect nsecs related
cleanup, and a capabilities check fix for timerfd"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
jiffies: Revert bogus conversion of NSEC_PER_SEC to TICK_NSEC
timerfd: Only check CAP_WAKE_ALARM when it is needed
Pull scheduler fixes from Ingo Molnar:
"A fix for KVM's scheduler clock which (erroneously) was always marked
unstable, a fix for RT/DL load balancing, plus latency fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface
sched/core: Fix pick_next_task() for RT,DL
sched/fair: Make select_idle_cpu() more aggressive
Pull perf fixes from Ingo Molnar:
"This includes a fix for a crash if certain special addresses are
kprobed, plus does a rename of two Kconfig variables that were a minor
misnomer"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Rename CONFIG_[UK]PROBE_EVENT to CONFIG_[UK]PROBE_EVENTS
kprobes/x86: Fix kernel panic when certain exception-handling addresses are probed
Pull locking fixes from Ingo Molnar:
- Change the new refcount_t warnings from WARN() to WARN_ONCE()
- two ww_mutex fixes
- plus a new lockdep self-consistency check for a bug that triggered in
practice
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/ww_mutex: Adjust the lock number for stress test
locking/lockdep: Add nest_lock integrity test
locking/ww_mutex: Replace cpu_relax() with cond_resched() for tests
locking/refcounts: Change WARN() to WARN_ONCE()
Pull namespace fix from Eric Biederman:
"This fixes a race between put_ucounts and get_ucounts that can cause a
use after free. The fix works by simplifying the code and so there is
not even a temptation to be clever and play spinlock vs atomic
reference games"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucount: Remove the atomicity from ucount->count
window. Namely powerpc broke as jump labels uses the two LSB bits as flags
in initialization. A check was added to make sure that all jump label
entries were 4 bytes aligned, but powerpc didn't work that way for modules.
Adding an alignment in the module linker script appeared to be the best
solution.
Jump labels also added an anonymous union to access those LSB bits as a
normal long. But because this structure had static initialization, it broke
older compilers that could not statically initialize anonymous unions
without brackets.
The command line parameter for setting function graph filter broke the
"EMPTY_HASH" descriptor by modifying it instead of creating a new hash to
hold the entries.
The command line parameter ftrace_graph_max_depth was added to allow its
setting at boot time. It uses existing code and only the command line hook
was added. This is not really a fix, but as it uses existing code without
affecting anything else, I added it to this release. It was ready before the
merge window closed, but I wanted to let it sit in linux-next for a couple
of days first.
-----BEGIN PGP SIGNATURE-----
iQExBAABCAAbBQJYvNrAFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
JGQIAMkayeZ0OCyYHRPR4EcCrdE3fATmt1huJWHrMPnT4/fLabL8XQqrOpnOBMq1
GFZb1SMkBmvGtAHF4GbvCxnIUfDQko6BTQAd8EMea1WM8+Kb66/BLgJawjWIU9I0
dNYre9ONgR2NOzkz6nfKRXnmy0lRcOweBb09YYGSzY11Md7d8T3T4TUrPNZdYrO9
8ZMbF4qRd9KLMRHcsWqvhWhBISxWnmtUSlthfweukKgDMy8OKpb7pR0ckjtYwsWX
RF41jqLqzSUqtd/nE2Sj/aT8XOP4pfrKEUuNM4SBj8q5jmNcZuqi8Q9wItu3LWR2
jqM/9UKTzaCr9cchwuvUC0i+jWc=
=kDql
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"There was some breakage with the changes for jump labels in the 4.11
merge window:
- powerpc broke as jump labels uses the two LSB bits as flags in
initialization.
A check was added to make sure that all jump label entries were 4
bytes aligned, but powerpc didn't work that way for modules. Adding
an alignment in the module linker script appeared to be the best
solution.
- Jump labels also added an anonymous union to access those LSB bits
as a normal long. But because this structure had static
initialization, it broke older compilers that could not statically
initialize anonymous unions without brackets.
- The command line parameter for setting function graph filter broke
the "EMPTY_HASH" descriptor by modifying it instead of creating a
new hash to hold the entries.
- The command line parameter ftrace_graph_max_depth was added to
allow its setting at boot time. It uses existing code and only the
command line hook was added.
This is not really a fix, but as it uses existing code without
affecting anything else, I added it to this release. It was ready
before the merge window closed, but I wanted to let it sit in
linux-next for a couple of days first"
* tag 'trace-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace/graph: Add ftrace_graph_max_depth kernel parameter
tracing: Add #undef to fix compile error
jump_label: Add comment about initialization order for anonymous unions
jump_label: Fix anonymous union initialization
module: set __jump_table alignment to 8
ftrace/graph: Do not modify the EMPTY_HASH for the function_graph filter
tracing: Fix code comment for ftrace_ops_get_func()
commit 93825f2ec7 converted NSEC_PER_SEC to TICK_NSEC because the author
confused NSEC_PER_JIFFY with NSEC_PER_SEC.
As a result, the calculation of refined jiffies got broken, triggering
lockups.
Fixes: 93825f2ec7 ("jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY")
Reported-and-tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1488880534-3777-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
New features:
- Allow sorting by symbol_size in 'perf report' and 'perf top' (Charles Baylis)
E.g.:
# perf report -s symbol_size,symbol
Samples: 9K of event 'cycles:k', Event count (approx.): 2870461623
Overhead Symbol size Symbol
14.55% 326 [k] flush_tlb_mm_range
7.20% 1045 [k] filemap_map_pages
5.82% 124 [k] vma_interval_tree_insert
5.18% 2430 [k] unmap_page_range
2.57% 571 [k] vma_interval_tree_remove
1.94% 494 [k] page_add_file_rmap
1.82% 740 [k] page_remove_rmap
1.66% 1017 [k] release_pages
1.57% 1636 [k] update_blocked_averages
1.57% 76 [k] unlock_page
- Add support for -p/--pid, -a/--all-cpus and -C/--cpu in 'perf ftrace' (Namhyung Kim)
Change in behaviour:
- Make system wide (-a) the default option if no target was specified and one
of following conditions is met:
- No workload specified (current behaviour)
- A workload is specified but all requested events are system wide ones,
like uncore ones. (Jiri Olsa)
Fixes:
- Add missing initialization to the instruction decoder used in the
intel PT/BTS code, which was causing lots of failures in 'perf test',
looking for a value when there was none (Adrian Hunter)
Infrastructure:
- Add arch code needed to adopt the kernel's refcount_t to aid in
catching bugs when using atomic_t as a reference counter, basically
cmpxchg related functions (Arnaldo Carvalho de Melo)
- Convert the code using atomic_t as reference counts to refcount_t
(Elena Rashetova)
- Add feature test for sched_getcpu() to more easily check for its
presence in the many libc implementations and accross different
versions of such C libraries (Arnaldo Carvalho de Melo)
- Issue a HW watchdog disable hint in 'perf stat' for when some of the
requested events can't get counted because a PMU counter is taken by that
watchdog (Borislav Petkov).
- Add mapping for Intel's KnightsMill PMU events (Karol Wachowski)
Documentation:
- Clarify the term 'convergence' in:
perf bench numa numa-mem -h --show_convergence (Jiri Olsa)
Kernel code:
- Ensure probe location is at function entry in kretprobes (Naveen N. Rao)
- Allow return probes with offsets and absolute addresses (Naveen N. Rao)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYvbmTAAoJENZQFvNTUqpAN80QAJ2ETcTosR9fo06VrT2HqRT4
+iGe55wSu261TOekIkXOEW+ww321eNPfy4rIZeLCEFcCd9p03n5JceVbFnOjBuAz
Lk6jrKpaH+Ajp56nCLyWH4r3LYLJXdoIydNay4PZ08rl0GgagGqvevD8ZZCEO0sx
vjD1TFH2uSOq3UTxKapO++FHwhy+XqZ5S5I+rMuLxg6Qi+rLubXDztzIlcCfQPGx
g+zFkaJ/ms9TAtWK25xoj34QXsaqpBsF8qkCE1P8Zdjtnkp6zM2Rx3HvvbRDmgVx
/h0b1iua5IVElgnai/84ttJG3Bi6ovRbf/PFy+IceM4Qfx0eQeWmA3CAtcGOh9Gv
GTDCcJ7xWZBpM0g1wCk3ks2oApFTA6GkcnIt5alhTse5U3gNmImv3uvuN8d265KL
2oGKps7MH1nWMgpL4G4BNuZg2oqmM/uX9ERiuNjtCqj6WoHy2QSDyEMJN5Od3lYj
ar2PPGofHmiacsW3NNMT+LwQ/wL/d2dVfZTopMafeaxRDGTxdqkhLkB/sZT/wexQ
ySVijQPO+x0eLSIK/BWdEmD8K6JiYGdpIRDWVW+D043I7iiXFvPgui1bFABJEqn5
mZFa1qT4EQMuuSaLkkxtoOrdoF6YzJJA57sIx2IrouGDapJ2BDegiUOfE0PSp5l0
oeRuFcJYfpITC/TzntE3
=h2yo
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo-4.11-20170306' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
New features:
- Allow sorting by symbol_size in 'perf report' and 'perf top' (Charles Baylis)
E.g.:
# perf report -s symbol_size,symbol
Samples: 9K of event 'cycles:k', Event count (approx.): 2870461623
Overhead Symbol size Symbol
14.55% 326 [k] flush_tlb_mm_range
7.20% 1045 [k] filemap_map_pages
5.82% 124 [k] vma_interval_tree_insert
5.18% 2430 [k] unmap_page_range
2.57% 571 [k] vma_interval_tree_remove
1.94% 494 [k] page_add_file_rmap
1.82% 740 [k] page_remove_rmap
1.66% 1017 [k] release_pages
1.57% 1636 [k] update_blocked_averages
1.57% 76 [k] unlock_page
- Add support for -p/--pid, -a/--all-cpus and -C/--cpu in 'perf ftrace' (Namhyung Kim)
Change in behaviour:
- Make system wide (-a) the default option if no target was specified and one
of following conditions is met:
- No workload specified (current behaviour)
- A workload is specified but all requested events are system wide ones,
like uncore ones. (Jiri Olsa)
Fixes:
- Add missing initialization to the instruction decoder used in the
intel PT/BTS code, which was causing lots of failures in 'perf test',
looking for a value when there was none (Adrian Hunter)
Infrastructure changes:
- Add arch code needed to adopt the kernel's refcount_t to aid in
catching bugs when using atomic_t as a reference counter, basically
cmpxchg related functions (Arnaldo Carvalho de Melo)
- Convert the code using atomic_t as reference counts to refcount_t
(Elena Rashetova)
- Add feature test for sched_getcpu() to more easily check for its
presence in the many libc implementations and accross different
versions of such C libraries (Arnaldo Carvalho de Melo)
- Issue a HW watchdog disable hint in 'perf stat' for when some of the
requested events can't get counted because a PMU counter is taken by that
watchdog (Borislav Petkov).
- Add mapping for Intel's KnightsMill PMU events (Karol Wachowski)
Documentation changes:
- Clarify the term 'convergence' in:
perf bench numa numa-mem -h --show_convergence (Jiri Olsa)
Kernel code changes:
- Ensure probe location is at function entry in kretprobes (Naveen N. Rao)
- Allow return probes with offsets and absolute addresses (Naveen N. Rao)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Always increment/decrement ucount->count under the ucounts_lock. The
increments are there already and moving the decrements there means the
locking logic of the code is simpler. This simplification in the
locking logic fixes a race between put_ucounts and get_ucounts that
could result in a use-after-free because the count could go zero then
be found by get_ucounts and then be freed by put_ucounts.
A bug presumably this one was found by a combination of syzkaller and
KASAN. JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
spotted the race in the code.
Cc: stable@vger.kernel.org
Fixes: f6b2db1a3e ("userns: Make the count of user namespaces per user")
Reported-by: JongHwan Kim <zzoru007@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Use setup_deferrable_timer() instead of init_timer_deferrable() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
If queue_delayed_work() gets called with NULL @wq, the kernel will
oops asynchronuosly on timer expiration which isn't too helpful in
tracking down the offender. This actually happened with smc.
__queue_delayed_work() already does several input sanity checks
synchronously. Add NULL @wq check.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Link: http://lkml.kernel.org/r/20170227171439.jshx3qplflyrgcv7@codemonkey.org.uk
Signed-off-by: Tejun Heo <tj@kernel.org>
As found in grsecurity, this avoids exposing a kernel pointer through
the cgroup debug entries.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
pids_can_fork() is special in that the css association is guaranteed
to be stable throughout the function and thus doesn't need RCU
protection around task_css access. When determining the css to charge
the pid, task_css_check() is used to override the RCU sanity check.
While adding a warning message on fork rejection from pids limit,
135b8b37bd ("cgroup: Add pids controller event when fork fails
because of pid limit") incorrectly added a task_css access which is
neither RCU protected or explicitly annotated. This triggers the
following suspicious RCU usage warning when RCU debugging is enabled.
cgroup: fork rejected by pids controller in
===============================
[ ERR: suspicious RCU usage. ]
4.10.0-work+ #1 Not tainted
-------------------------------
./include/linux/cgroup.h:435 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 0
1 lock held by bash/1748:
#0: (&cgroup_threadgroup_rwsem){+++++.}, at: [<ffffffff81052c96>] _do_fork+0xe6/0x6e0
stack backtrace:
CPU: 3 PID: 1748 Comm: bash Not tainted 4.10.0-work+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
Call Trace:
dump_stack+0x68/0x93
lockdep_rcu_suspicious+0xd7/0x110
pids_can_fork+0x1c7/0x1d0
cgroup_can_fork+0x67/0xc0
copy_process.part.58+0x1709/0x1e90
_do_fork+0xe6/0x6e0
SyS_clone+0x19/0x20
do_syscall_64+0x5c/0x140
entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f7853fab93a
RSP: 002b:00007ffc12d05c90 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7853fab93a
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
RBP: 00007ffc12d05cc0 R08: 0000000000000000 R09: 00007f78548db700
R10: 00007f78548db9d0 R11: 0000000000000246 R12: 00000000000006d4
R13: 0000000000000001 R14: 0000000000000000 R15: 000055e3ebe2c04d
/asdf
There's no reason to dereference task_css again here when the
associated css is already available. Fix it by replacing the
task_cgroup() call with css->cgroup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Galbraith <efault@gmx.de>
Fixes: 135b8b37bd ("cgroup: Add pids controller event when fork fails because of pid limit")
Cc: Kenny Yu <kennyyu@fb.com>
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Tejun Heo <tj@kernel.org>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When SELinux was first added to the kernel, a process could only get
and set its own resource limits via getrlimit(2) and setrlimit(2), so no
MAC checks were required for those operations, and thus no security hooks
were defined for them. Later, SELinux introduced a hook for setlimit(2)
with a check if the hard limit was being changed in order to be able to
rely on the hard limit value as a safe reset point upon context
transitions.
Later on, when prlimit(2) was added to the kernel with the ability to get
or set resource limits (hard or soft) of another process, LSM/SELinux was
not updated other than to pass the target process to the setrlimit hook.
This resulted in incomplete control over both getting and setting the
resource limits of another process.
Add a new security_task_prlimit() hook to the check_prlimit_permission()
function to provide complete mediation. The hook is only called when
acting on another task, and only if the existing DAC/capability checks
would allow access. Pass flags down to the hook to indicate whether the
prlimit(2) call will read, write, or both read and write the resource
limits of the target process.
The existing security_task_setrlimit() hook is left alone; it continues
to serve a purpose in supporting the ability to make decisions based on
the old and/or new resource limit values when setting limits. This
is consistent with the DAC/capability logic, where
check_prlimit_permission() performs generic DAC/capability checks for
acting on another task, while do_prlimit() performs a capability check
based on a comparison of the old and new resource limits. Fix the
inline documentation for the hook to match the code.
Implement the new hook for SELinux. For setting resource limits, we
reuse the existing setrlimit permission. Note that this does overload
the setrlimit permission to mean the ability to set the resource limit
(soft or hard) of another process or the ability to change one's own
hard limit. For getting resource limits, a new getrlimit permission
is defined. This was not originally defined since getrlimit(2) could
only be used to obtain a process' own limits.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <james.l.morris@oracle.com>
get_next_freq() uses sg_cpu only to get sg_policy, which the callers of
get_next_freq() already have. Pass sg_policy instead of sg_cpu to
get_next_freq(), to make it more efficient.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cached_raw_freq applies to the entire cpufreq policy and not individual
CPUs. Apart from wasting per-cpu memory, it is actually wrong to keep it
in struct sugov_cpu as we may end up comparing next_freq with a stale
cached_raw_freq of a random CPU.
Move cached_raw_freq to struct sugov_policy.
Fixes: 5cbea46984 (cpufreq: schedutil: map raw required frequency to driver frequency)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull networking fixes from David Miller:
1) Fix double-free in batman-adv, from Sven Eckelmann.
2) Fix packet stats for fast-RX path, from Joannes Berg.
3) Netfilter's ip_route_me_harder() doesn't handle request sockets
properly, fix from Florian Westphal.
4) Fix sendmsg deadlock in rxrpc, from David Howells.
5) Add missing RCU locking to transport hashtable scan, from Xin Long.
6) Fix potential packet loss in mlxsw driver, from Ido Schimmel.
7) Fix race in NAPI handling between poll handlers and busy polling,
from Eric Dumazet.
8) TX path in vxlan and geneve need proper RCU locking, from Jakub
Kicinski.
9) SYN processing in DCCP and TCP need to disable BH, from Eric
Dumazet.
10) Properly handle net_enable_timestamp() being invoked from IRQ
context, also from Eric Dumazet.
11) Fix crash on device-tree systems in xgene driver, from Alban Bedel.
12) Do not call sk_free() on a locked socket, from Arnaldo Carvalho de
Melo.
13) Fix use-after-free in netvsc driver, from Dexuan Cui.
14) Fix max MTU setting in bonding driver, from WANG Cong.
15) xen-netback hash table can be allocated from softirq context, so use
GFP_ATOMIC. From Anoob Soman.
16) Fix MAC address change bug in bgmac driver, from Hari Vyas.
17) strparser needs to destroy strp_wq on module exit, from WANG Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
strparser: destroy workqueue on module exit
sfc: fix IPID endianness in TSOv2
sfc: avoid max() in array size
rds: remove unnecessary returned value check
rxrpc: Fix potential NULL-pointer exception
nfp: correct DMA direction in XDP DMA sync
nfp: don't tell FW about the reserved buffer space
net: ethernet: bgmac: mac address change bug
net: ethernet: bgmac: init sequence bug
xen-netback: don't vfree() queues under spinlock
xen-netback: keep a local pointer for vif in backend_disconnect()
netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails
netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups
netfilter: nf_conntrack_sip: fix wrong memory initialisation
can: flexcan: fix typo in comment
can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer
can: gs_usb: fix coding style
can: gs_usb: Don't use stack memory for USB transfers
ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines
ixgbe: update the rss key on h/w, when ethtool ask for it
...
Let's not remove the warning about offsets and return probes when the
offset is invalid.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20170227115204.00f92846@gandalf.local.home
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Since the kernel includes many non-global functions with same names, we
will need to use offsets from other symbols (typically _text/_stext) or
absolute addresses to place return probes on specific functions. Also,
the core register_kretprobe() API never forbid use of offsets or
absolute addresses with kretprobes.
Allow its use with the trace infrastructure. To distinguish kernels that
support this, update ftrace README to explicitly call this out.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/183e7ce2921a08c9c755ee9a5da3134febc6695b.1487770934.git.naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
kretprobes can be registered by specifying an absolute address or by
specifying offset to a symbol. However, we need to ensure this falls at
function entry so as to be able to determine the return address.
Validate the same during kretprobe registration. By default, there
should not be any offset from a function entry, as determined through a
kallsyms_lookup(). Introduce arch_function_offset_within_entry() as a
way for architectures to override this.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/f1583bc4839a3862cfc2acefcc56f9c8837fa2ba.1487770934.git.naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull sched.h split-up from Ingo Molnar:
"The point of these changes is to significantly reduce the
<linux/sched.h> header footprint, to speed up the kernel build and to
have a cleaner header structure.
After these changes the new <linux/sched.h>'s typical preprocessed
size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
lines), which is around 40% faster to build on typical configs.
Not much changed from the last version (-v2) posted three weeks ago: I
eliminated quirks, backmerged fixes plus I rebased it to an upstream
SHA1 from yesterday that includes most changes queued up in -next plus
all sched.h changes that were pending from Andrew.
I've re-tested the series both on x86 and on cross-arch defconfigs,
and did a bisectability test at a number of random points.
I tried to test as many build configurations as possible, but some
build breakage is probably still left - but it should be mostly
limited to architectures that have no cross-compiler binaries
available on kernel.org, and non-default configurations"
* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
sched/headers: Clean up <linux/sched.h>
sched/headers: Remove #ifdefs from <linux/sched.h>
sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
sched/core: Remove unused prefetch_stack()
sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
sched/headers: Remove <linux/signal.h> from <linux/sched.h>
sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
sched/headers: Remove the runqueue_is_locked() prototype
sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
...
Early trace callgraphs can be extremely large on systems with
several seconds of boot time. The max_depth parameter limits how
deep the graph trace goes and reduces the output size. This
parameter is the same as the max_graph_depth file in tracefs.
Link: http://lkml.kernel.org/r/1488499935-23216-1-git-send-email-todd.e.brandt@linux.intel.com
Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com>
[ changed comments about debugfs to tracefs ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
On boot up, if the kernel command line sets a graph funtion with the kernel
command line options "ftrace_graph_filter" or "ftrace_graph_notrace" then it
updates the corresponding function graph hash, ftrace_graph_hash or
ftrace_graph_notrace_hash respectively. Unfortunately, at boot up, these
variables are pointers to the "EMPTY_HASH" which is a constant used as a
placeholder when a hash has no entities. The problem was that the comand
line version to set the hashes updated the actual EMPTY_HASH instead of
creating a new hash for the function graph. This broke the EMPTY_HASH
because not only did it modify a constant (not sure how that was allowed to
happen, except maybe because it was done at early boot, const variables were
still mutable), but it made the filters have functions listed in them when
they were actually empty.
The kernel command line function needs to allocate a new hash for the
function graph filters and assign the necessary variables to that new hash
instead.
Link: http://lkml.kernel.org/r/1488420091.7212.17.camel@linux.intel.com
Cc: Namhyung Kim <namhyung@kernel.org>
Fixes: b9b0c831be ("ftrace: Convert graph filter to use hash tables")
Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Fix for a cpuidle menu governor problem that started to take an
unnecessary spinlock after one of the recent updates and that
did not play well with the RT patch (Rafael Wysocki).
- Fix for the new intel_pstate operation mode switching feature
added recently that did not reinitialize P-state limits properly
when switching operation modes (Rafael Wysocki).
- Removal of unused global notifiers from the PM QoS framework
(Viresh Kumar).
- Generic power domains framework update to make it handle
asynchronous invocations of PM callbacks in the "noirq" phases
of system suspend/hibernation correctly (Ulf Hansson).
- Two hibernation core cleanups (Rafael Wysocki).
- intel_idle cleanup related to the sysfs interface (Len Brown).
- Off-by-one bug fix in the OPP (Operating Performance Points)
framework (Andrzej Hajda).
- OPP framework's documentation fix (Viresh Kumar).
- cpufreq qoriq driver cleanup (Tang Yuantian).
- Fixes for typos in comments in the device runtime PM framework
(Christophe Jaillet).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYuLeGAAoJEILEb/54YlRxJvcP/0BRmh8Hn4Itx/NIWNg71X6j
U+v8Pn8T3MP33gCcleLYlgre2JUIAUDmhdK99+UOx+/abhjMhQSaF3HhTOwYaPtQ
6njoHVS0NnfqUf+x5kp+EpRxBVNucYVbdRTVd1DsHIeLLz/96DFOzb/R7tko/pKx
pFMWvNdotHLLgXOG1UvdRimwTDlFMffxFzD8Se53LPjRXS0S73A5VWfqZOye44Re
j3W1AJ0Idgq5uduA6J8x1MWbaxDq1h+j6CSUm05yvqrINzxXwXt0Hv6stCQTo+Gb
YMdiBd8MujNyAgcchw3jiDQ8Vp+zmfLPcHrfPe//SSefj26eB8LyVNSYelvbUdOz
cNjvyErva37MmaegCL9QC7WbLM+A7VE6bU6YzDCi/rR8jYMJ51Fb9jGiYb/oimry
OLlblEekikUsskWv4hGV1JVt5VhmUMlagWtexxn+lMszATcZro0tfXu/vgQWksYs
noUnwuWJWxvj2aNMsvbzW3HLlTGSmYl2UxJ7IymQQaTDblwF9Kg61rm3+5coUctd
ifceynDVp9Gju25faYgZ+Dq9+o8ktlOGOHRRPdLIRNJ/T+4tUDnlGkdbPb+Tfn03
XUIzYCu74U8/oW8gOk6t0WpmWzvxEXNgdirdEIR6y3loYIC0Jr3v4gyD975Eug74
Hzfrdg7ignAmWV+nf6UY
=SeUF
-----END PGP SIGNATURE-----
Merge tag 'pm-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates deom Rafael Wysocki:
"These fix two bugs introduced by recent power management updates (in
the cpuidle menu governor and intel_pstate) and a few other issues,
clean up things and remove unused code.
Specifics:
- Fix for a cpuidle menu governor problem that started to take an
unnecessary spinlock after one of the recent updates and that did
not play well with the RT patch (Rafael Wysocki).
- Fix for the new intel_pstate operation mode switching feature added
recently that did not reinitialize P-state limits properly when
switching operation modes (Rafael Wysocki).
- Removal of unused global notifiers from the PM QoS framework
(Viresh Kumar).
- Generic power domains framework update to make it handle
asynchronous invocations of PM callbacks in the "noirq" phases of
system suspend/hibernation correctly (Ulf Hansson).
- Two hibernation core cleanups (Rafael Wysocki).
- intel_idle cleanup related to the sysfs interface (Len Brown).
- Off-by-one bug fix in the OPP (Operating Performance Points)
framework (Andrzej Hajda).
- OPP framework's documentation fix (Viresh Kumar).
- cpufreq qoriq driver cleanup (Tang Yuantian).
- Fixes for typos in comments in the device runtime PM framework
(Christophe Jaillet)"
* tag 'pm-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / OPP: Documentation: Fix opp-microvolt in examples
intel_idle: stop exposing platform acronyms in sysfs
cpufreq: intel_pstate: Fix limits issue with operation mode switching
PM / hibernate: Define pr_fmt() and use pr_*() instead of printk()
PM / hibernate: Untangle power_down()
cpuidle: menu: Avoid taking spinlock for accessing QoS values
PM / QoS: Remove global notifiers
PM / runtime: Fix some typos
cpufreq: qoriq: clean up unused code
PM / OPP: fix off-by-one bug in dev_pm_opp_get_max_volt_latency loop
PM / Domains: Power off masters immediately in the power off sequence
PM / Domains: Rename is_async to one_dev_on for genpd_power_off()
PM / Domains: Move genpd_power_off() above genpd_power_on()
It's not used by any of the scheduler methods, but <linux/sched/task_stack.h>
needs it to pick up STACK_END_MAGIC.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a stray header that is not needed by anything in sched.h,
so remove it.
Update files that relied on the stray inclusion.
This reduces the size of the header dependency graph.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move cputime related functionality out of <linux/sched.h>, as most code
that includes <linux/sched.h> does not use that functionality.
Move data types that are not included in task_struct directly to
the signal definitions, into <linux/sched/signal.h>.
Also merge the (small) existing <linux/cputime.h> header into <linux/sched/cputime.h>.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The task_lock()/task_unlock() APIs are not realated to core scheduling,
they are task lifetime APIs, i.e. they belong into <linux/sched/task.h>.
Move them.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move rcu_copy_process() into kernel/fork.c, which is the only
user of this inline function.
This simplifies <linux/sched/task.h> to the level that <linux/sched.h>
does not have to be included in it anymore - which change is done
in a subsequent patch.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
task_struct::signal and task_struct::sighand are pointers, which would normally make it
straightforward to not define those types in sched.h.
That is not so, because the types are accompanied by a myriad of APIs (macros and inline
functions) that dereference them.
Split the types and the APIs out of sched.h and move them into a new header, <linux/sched/signal.h>.
With this change sched.h does not know about 'struct signal' and 'struct sighand' anymore,
trying to put accessors into sched.h as a test fails the following way:
./include/linux/sched.h: In function ‘test_signal_types’:
./include/linux/sched.h:2461:18: error: dereferencing pointer to incomplete type ‘struct signal_struct’
^
This reduces the size and complexity of sched.h significantly.
Update all headers and .c code that relied on getting the signal handling
functionality from <linux/sched.h> to include <linux/sched/signal.h>.
The list of affected files in the preparatory patch was partly generated by
grepping for the APIs, and partly by doing coverage build testing, both
all[yes|mod|def|no]config builds on 64-bit and 32-bit x86, and an array of
cross-architecture builds.
Nevertheless some (trivial) build breakage is still expected related to rare
Kconfig combinations and in-flight patches to various kernel code, but most
of it should be handled by this patch.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because there are only 12 bits in held_lock::references, so we only
support 4095 nested lock held in the same time, adjust the lock number
for ww_mutex stress test to kill one lockdep splat:
[ ] [ BUG: bad unlock balance detected! ]
[ ] kworker/u2:0/5 is trying to release lock (ww_class_mutex) at:
[ ] ww_mutex_unlock()
[ ] but there are no more locks to release!
...
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170301150138.hdixnmafzfsox7nn@tardis.cn.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Boqun reported that hlock->references can overflow. Add a debug test
for that to generate a clear error when this happens.
Without this, lockdep is likely to report a mysterious failure on
unlock.
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When busy-spinning on a ww_mutex_trylock(), we depend upon the other
thread advancing and releasing the lock. This can not happen on a single
CPU unless we relinquish it:
[ ] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:18]
...
[ ] Call Trace:
[ ] mutex_trylock()
[ ] test_mutex_work+0x31/0x56
[ ] process_one_work+0x1b4/0x2f9
[ ] worker_thread+0x1b0/0x27c
[ ] kthread+0xd1/0xd3
[ ] ret_from_fork+0x19/0x30
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: f2a5fec173 ("locking/ww_mutex: Begin kselftests for ww_mutex")
Link: http://lkml.kernel.org/r/20170228094011.2595-1-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pavan noticed that the following commit:
49ee576809 ("sched/core: Optimize pick_next_task() for idle_sched_class")
... broke RT,DL balancing by robbing them of the opportinty to do new-'idle'
balancing when their last runnable task (on that runqueue) goes away.
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 49ee576809 ("sched/core: Optimize pick_next_task() for idle_sched_class")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kitsunyan reported desktop latency issues on his Celeron 887 because
of commit:
1b568f0aab ("sched/core: Optimize SCHED_SMT")
... even though his CPU doesn't do SMT.
The effect of running the SMT code on a !SMT part is basically a more
aggressive select_idle_cpu(). Removing the avg condition fixed things
for him.
I also know FB likes this test gone, even though other workloads like
having it.
For now, take it out by default, until we get a better idea.
Reported-by: kitsunyan <kitsunyan@inbox.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
But first update usage sites with the new header dependency.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
But first introduce a trivial header and update usage sites.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
But first update the usage site.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
But first update the usage sites with the new header dependency.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
But first update usage sites with the new header dependency.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce a trivial, mostly empty <linux/sched/cputime.h> header
to prepare for the moving of cputime functionality out of sched.h.
Update all code that relies on these facilities.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
But first update the code that uses these facilities with the
new header.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We don't actually need the full rculist.h header in sched.h anymore,
we will be able to include the smaller rcupdate.h header instead.
But first update code that relied on the implicit header inclusion.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update code that relied on sched.h including various MM types for them.
This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of including the full <linux/signal.h>, we are going to include the
types-only <linux/signal_types.h> header in <linux/sched.h>, to further
decouple the scheduler header from the signal headers.
This means that various files which relied on the full <linux/signal.h> need
to be updated to gain an explicit dependency on it.
Update the code that relies on sched.h's inclusion of the <linux/signal.h> header.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/task.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/task.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/hotplug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/hotplug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/nohz.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/nohz.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/stat.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix up affected files that include this signal functionality via sched.h.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix up missing #includes in other places that rely on sched.h doing that for them.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.
Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recent header reorganizations unearthed this hidden dependency:
kernel/sched/core.c:199:25: error: 'paravirt_steal_rq_enabled' undeclared (first use in this function)
kernel/sched/core.c:200:11: error: implicit declaration of function 'paravirt_steal_clock' [-Werror=implicit-function-declaration]
So move the asm/paravirt.h include from kernel/sched/cpuclock.c to kernel/sched/sched.h.
( NOTE: We do this change before doing the changes that introduce the build failure,
so the series remains fully bisectable. )
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/numa_balancing.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/cpufreq.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/cpufreq.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to move softlockup APIs out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
<linux/nmi.h> already includes <linux/sched.h>.
Include the <linux/nmi.h> header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/user.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/user.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/coredump.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/coredump.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/autogroup.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/autogroup.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/loadavg.h> out of <linux/sched.h>, which
will have to be picked up from a couple of .c files.
Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to move scheduler ABI details to <uapi/linux/sched/types.h>,
which will be used from a number of .c files.
Create empty placeholder header that maps to <linux/types.h>.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.
Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/wake_q.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/wake_q.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/idle.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/idle.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/topology.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The <linux/sched.h> header includes various vmacache related defines,
which are arguably misplaced.
Move them to mm_types.h and minimize the sched.h impact by putting
all task vmacache state into a new 'struct vmacache' structure.
No change in functionality.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
threadgroup_change_begin()/end() is a pointless wrapper around
cgroup_threadgroup_change_begin()/end(), minus a might_sleep()
in the !CONFIG_CGROUPS=y case.
Remove the wrappery, move the might_sleep() (the down_read()
already has a might_sleep() check).
This debloats <linux/sched.h> a bit and simplifies this API.
Update all call sites.
No change in functionality.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So rcupdate.h is a pretty complex header, in particular it includes
<linux/completion.h> which includes <linux/wait.h> - creating a
dependency that includes <linux/wait.h> in <linux/sched.h>,
which prevents the isolation of <linux/sched.h> from the derived
<linux/wait.h> header.
Solve part of the problem by decoupling rcupdate.h from completions:
this can be done by separating out the rcu_synchronize types and APIs,
and updating their usage sites.
Since this is a mostly RCU-internal types this will not just simplify
<linux/sched.h>'s dependencies, but will make all the hundreds of
.c files that include rcupdate.h but not completions or wait.h build
faster.
( For rcutiny this means that two dependent APIs have to be uninlined,
but that shouldn't be much of a problem as they are rare variants. )
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
tsk_nr_cpus_allowed() too is a pretty pointless wrapper that
is not used consistently and which makes the code both harder
to read and longer as well.
So remove it - this also shrinks <linux/sched.h> a bit.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the original intention of tsk_cpus_allowed() was to 'future-proof'
the field - but it's pretty ineffectual at that, because half of
the code uses ->cpus_allowed directly ...
Also, the wrapper makes the code longer than the original expression!
So just get rid of it. This also shrinks <linux/sched.h> a bit.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's defined in <linux/sched.h>, but nothing outside the scheduler
uses it - so move it to the sched/core.c usage site.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The length of TASK_STATE_TO_CHAR_STR was still checked using the old
link-time manual error method - convert it to BUILD_BUG_ON(). This
has a couple of advantages:
- it's more obvious what's going on
- it reduces the size and complexity of <linux/sched.h>
- BUILD_BUG_ON() will fail during compilation, with a clearer
error message than the link time assert.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 07016151a4 ("bpf, verifier: further improve search
pruning") increased the limit of processed instructions from
32k to 64k, but the comment still mentioned the 32k limit.
This commit updates the comment to reflect the change.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Gary Lin <glin@suse.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have uses of CONFIG_UPROBE_EVENT and CONFIG_KPROBE_EVENT as
well as CONFIG_UPROBE_EVENTS and CONFIG_KPROBE_EVENTS.
Consistently use the plurals.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: davem@davemloft.net
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/20170216060050.20866-1-anton@ozlabs.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fixes from Ingo Molnar:
"Two rq-clock warnings related fixes, plus a cgroups related crash fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cgroup: Move sched_online_group() back into css_online() to fix crash
sched/fair: Update rq clock before changing a task's CPU affinity
sched/core: Fix update_rq_clock() splat on hotplug (and suspend/resume)
Pull cgroup updates from Tejun Heo:
"Several noteworthy changes.
- Parav's rdma controller is finally merged. It is very straight
forward and can limit the abosolute numbers of common rdma
constructs used by different cgroups.
- kernel/cgroup.c got too chubby and disorganized. Created
kernel/cgroup/ subdirectory and moved all cgroup related files
under kernel/ there and reorganized the core code. This hurts for
backporting patches but was long overdue.
- cgroup v2 process listing reimplemented so that it no longer
depends on allocating a buffer large enough to cache the entire
result to sort and uniq the output. v2 has always mangled the sort
order to ensure that users don't depend on the sorted output, so
this shouldn't surprise anybody. This makes the pid listing
functions use the same iterators that are used internally, which
have to have the same iterating capabilities anyway.
- perf cgroup filtering now works automatically on cgroup v2. This
patch was posted a long time ago but somehow fell through the
cracks.
- misc fixes asnd documentation updates"
* 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (27 commits)
kernfs: fix locking around kernfs_ops->release() callback
cgroup: drop the matching uid requirement on migration for cgroup v2
cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy
cgroup: misc cleanups
cgroup: call subsys->*attach() only for subsystems which are actually affected by migration
cgroup: track migration context in cgroup_mgctx
cgroup: cosmetic update to cgroup_taskset_add()
rdmacg: Fixed uninitialized current resource usage
cgroup: Add missing cgroup-v2 PID controller documentation.
rdmacg: Added documentation for rdmacg
IB/core: added support to use rdma cgroup controller
rdmacg: Added rdma cgroup controller
cgroup: fix a comment typo
cgroup: fix RCU related sparse warnings
cgroup: move namespace code to kernel/cgroup/namespace.c
cgroup: rename functions for consistency
cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.c
cgroup: separate out cgroup1_kf_syscall_ops
cgroup: refactor mount path and clearly distinguish v1 and v2 paths
cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.c
...
We already have the helper, we can convert the rest of the kernel
mechanically using:
git grep -l 'atomic_inc_not_zero.*mm_users' | xargs sed -i 's/atomic_inc_not_zero(&\(.*\)->mm_users)/mmget_not_zero\(\1\)/'
This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.
Link: http://lkml.kernel.org/r/20161218123229.22952-3-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:
git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_users);/mmget\(\1\);/'
git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_users);/mmget\(\&\1\);/'
This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.
(Michal Hocko provided most of the kerneldoc comment.)
Link: http://lkml.kernel.org/r/20161218123229.22952-2-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>