This patch adds helper bpf_perf_prog_read_cvalue for perf event based bpf
programs, to read event counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.
The typical use case for perf event based bpf program is to attach itself
to a single event. In such cases, if it is desirable to get scaling factor
between two bpf invocations, users can can save the time values in a map,
and use the value from the map and the current value to calculate
the scaling factor.
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.
Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
normalized_num_samples = num_samples * time_enabled / time_running
normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.
This patch adds helper bpf_perf_event_read_value for kprobed based perf
event array map, to read perf counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.
To achieve scaling factor between two bpf invocations, users
can can use cpu_id as the key (which is typical for perf array usage model)
to remember the previous value and do the calculation inside the
bpf program.
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does not impact existing functionalities.
It contains the changes in perf event area needed for
subsequent bpf_perf_event_read_value and
bpf_perf_prog_read_value helpers.
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a code ordering issue in the main suspend-to-idle loop
that causes some "low power S0 idle" conditions to be incorrectly
reported as unmet with suspend/resume debug messages enabled.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZ1rGKAAoJEILEb/54YlRxONQP/0NzL79PrxFgtbsnhZZfDZ+U
os6pcNYl9J+n2YVq5NAP9nnqEhN3E2gctJwjYMRIuJC/g6uU8Z6Ym6h7D+QfZrv1
yzEsbQgLh8N9nR+lUGRi+meoF8BolOLeXnNgB18uP1ZShZLikvAELxMkmJUx+TjW
CVQaOkXe4I/Ey5O4Jjur+tgVn+ik+xl40akw7+wHAnY+I7KwzLffP7nwHBGavHsd
dCtbcRogWzpcihgpLpgJMaixjZXakJ2n/Zmg+IdpnYt9WRIMy2ztTu3+bPRPEVVJ
hcP9p93r1BZclkyyYNyM0QMv4Ac96xBOe8qigXP/9EdtWYHLeV+N+N+EFbeutHgn
LsiuO4h9FCrv4ltn7jm88sTyna0IpuKSva9pWk9nopI8e5r/Yplvi00ZJW1lz/B7
xrPWHdm6ozuK7UGQQoeeb5CVOuClTCBC3PFTw/XTD0emogjI+xJi802zIG00eRPm
VBAvMilLeS1ezegVWJte6wPptF3wUW2Ss6jYzh4zVgnJ5XKgfxbANhLeQizAhGFj
lxxW9/MS1APxTrV9VluY7zyqq/s9H1iWUMyf2H06zjetBfKPiqEn+oL4/k4rnYN6
SB4WrT8CwWxsDusMxN0aVjfftRwnGQ1+kPX3EP7mYQFP+zwiK44iilL7D37Rdyoe
Z8Hdmdf/sCORK3zXZWd8
=6lm4
-----END PGP SIGNATURE-----
Merge tag 'pm-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"This fixes a code ordering issue in the main suspend-to-idle loop that
causes some "low power S0 idle" conditions to be incorrectly reported
as unmet with suspend/resume debug messages enabled"
* tag 'pm-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / s2idle: Invoke the ->wake() platform callback earlier
Pull networking fixes from David Miller:
1) Check iwlwifi 9000 reorder buffer out-of-space condition properly,
from Sara Sharon.
2) Fix RCU splat in qualcomm rmnet driver, from Subash Abhinov
Kasiviswanathan.
3) Fix session and tunnel release races in l2tp, from Guillaume Nault
and Sabrina Dubroca.
4) Fix endian bug in sctp_diag_dump(), from Dan Carpenter.
5) Several mlx5 driver fixes from the Mellanox folks (max flow counters
cap check, invalid memory access in IPoIB support, etc.)
6) tun_get_user() should bail if skb->len is zero, from Alexander
Potapenko.
7) Fix RCU lookups in inetpeer, from Eric Dumazet.
8) Fix locking in packet_do_bund().
9) Handle cb->start() error properly in netlink dump code, from Jason
A. Donenfeld.
10) Handle multicast properly in UDP socket early demux code. From Paolo
Abeni.
11) Several erspan bug fixes in ip_gre, from Xin Long.
12) Fix use-after-free in socket filter code, in order to handle the
fact that listener lock is no longer taken during the three-way TCP
handshake. From Eric Dumazet.
13) Fix infoleak in RTM_GETSTATS, from Nikolay Aleksandrov.
14) Fix tail call generation in x86-64 BPF JIT, from Alexei Starovoitov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
net: 8021q: skip packets if the vlan is down
bpf: fix bpf_tail_call() x64 JIT
net: stmmac: dwmac-rk: Add RK3128 GMAC support
rndis_host: support Novatel Verizon USB730L
net: rtnetlink: fix info leak in RTM_GETSTATS call
socket, bpf: fix possible use after free
mlxsw: spectrum_router: Track RIF of IPIP next hops
mlxsw: spectrum_router: Move VRF refcounting
net: hns3: Fix an error handling path in 'hclge_rss_init_hw()'
net: mvpp2: Fix clock resource by adding an optional bus clock
r8152: add Linksys USB3GIGV1 id
l2tp: fix l2tp_eth module loading
ip_gre: erspan device should keep dst
ip_gre: set tunnel hlen properly in erspan_tunnel_init
ip_gre: check packet length and mtu correctly in erspan_xmit
ip_gre: get key from session_id correctly in erspan_rcv
tipc: use only positive error codes in messages
ppp: fix __percpu annotation
udp: perform source validation for mcast early demux
IPv4: early demux can return an error code
...
with addition of tnum logic the verifier got smart enough and
we can enforce return codes at program load time.
For now do so for cgroup-bpf program types.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
introduce BPF_PROG_QUERY command to retrieve a set of either
attached programs to given cgroup or a set of effective programs
that will execute for events within a cgroup
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
bpf programs to a cgroup.
The difference between three possible flags for BPF_PROG_ATTACH command:
- NONE(default): No further bpf programs allowed in the subtree.
- BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
the program in this cgroup yields to sub-cgroup program.
- BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
that cgroup program gets run in addition to the program in this cgroup.
NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
change their behavior. It only clarifies the semantics in relation
to new flag.
Only one program is allowed to be attached to a cgroup with
NONE or BPF_F_ALLOW_OVERRIDE flag.
Multiple programs are allowed to be attached to a cgroup with
BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
(those that were attached first, run first)
The programs of sub-cgroup are executed first, then programs of
this cgroup and then programs of parent cgroup.
All eligible programs are executed regardless of return code from
earlier programs.
To allow efficient execution of multiple programs attached to a cgroup
and to avoid penalizing cgroups without any programs attached
introduce 'struct bpf_prog_array' which is RCU protected array
of pointers to bpf programs.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge misc fixes from Andrew Morton:
"A lot of stuff, sorry about that. A week on a beach, then a bunch of
time catching up then more time letting it bake in -next. Shan't do
that again!"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (51 commits)
include/linux/fs.h: fix comment about struct address_space
checkpatch: fix ignoring cover-letter logic
m32r: fix build failure
lib/ratelimit.c: use deferred printk() version
kernel/params.c: improve STANDARD_PARAM_DEF readability
kernel/params.c: fix an overflow in param_attr_show
kernel/params.c: fix the maximum length in param_get_string
mm/memory_hotplug: define find_{smallest|biggest}_section_pfn as unsigned long
mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function
kernel/kcmp.c: drop branch leftover typo
memremap: add scheduling point to devm_memremap_pages
mm, page_alloc: add scheduling point to memmap_init_zone
mm, memory_hotplug: add scheduling point to __add_pages
lib/idr.c: fix comment for idr_replace()
mm: memcontrol: use vmalloc fallback for large kmem memcg arrays
kernel/sysctl.c: remove duplicate UINT_MAX check on do_proc_douintvec_conv()
include/linux/bitfield.h: remove 32bit from FIELD_GET comment block
lib/lz4: make arrays static const, reduces object code size
exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array
exec: binfmt_misc: fix race between load_misc_binary() and kill_node()
...
- A memory fix with left over code from spliting out ftrace_ops
and function graph tracer, where the function graph tracer could
reset the trampoline pointer, leaving the old trampoline not to
be freed (memory leak).
- The update to Paul's patch that added the unnecessary READ_ONCE().
This removes the unnecessary READ_ONCE() instead of having to rebase
the branch to update the patch that added it.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEEQEw9Eu0DdyUUkuUUybkF8mrZjcsFAlnU++sUHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQybkF8mrZjcujzgf/ebIzGKe5vQKNrL4ITAcIz0T7Hvzl
pWw4uJp8kqO9x9EHMnztAkltQigvjvgDKZozJpUGgtNsFLuvdgQSBMK24YV8vLHs
UmXEnQ2tSB/2Sg2ccEnpjVXaMzL9aqlbeTmACbdd9UgZnvPiUYPejq2jFfECFQjb
k/gZT911ukBtx4mXYKzGFbTEZHdc/YUs6Y/wzB1ox5BBIUh71ZDZXxQTUHfXHlwS
Cst69/9dKl4nBEGDGas6/95iR+ORVv85osI/pqPtjSj4EkRnWfVRotaH1kNuSQil
gDIHSoy35NfXJx77/5IFHfrjFBAkr0IYRNL/jZaWazwM7rdqfAN8TwMQuA==
=4CtF
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.14-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixlets from Steven Rostedt:
"Two updates:
- A memory fix with left over code from spliting out ftrace_ops and
function graph tracer, where the function graph tracer could reset
the trampoline pointer, leaving the old trampoline not to be freed
(memory leak).
- The update to Paul's patch that added the unnecessary READ_ONCE().
This removes the unnecessary READ_ONCE() instead of having to
rebase the branch to update the patch that added it"
* tag 'trace-v4.14-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
rcu: Remove extraneous READ_ONCE()s from rcu_irq_{enter,exit}()
ftrace: Fix kmemleak in unregister_ftrace_graph
Function param_attr_show could overflow the buffer it is operating on.
The buffer size is PAGE_SIZE, and the string returned by
attribute->param->ops->get is generated by scnprintf(buffer, PAGE_SIZE,
...) so it could be PAGE_SIZE - 1 long, with the terminating '\0' at the
very end of the buffer. Calling strcat(..., "\n") on this isn't safe, as
the '\0' will be replaced by '\n' (OK) and then another '\0' will be added
past the end of the buffer (not OK.)
Simply add the trailing '\n' when writing the attribute contents to the
buffer originally. This is safe, and also faster.
Credits to Teradata for discovering this issue.
Link: http://lkml.kernel.org/r/20170928162602.60c379c7@endymion
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The length parameter of strlcpy() is supposed to reflect the size of the
target buffer, not of the source string. Harmless in this case as the
buffer is PAGE_SIZE long and the source string is always much shorter than
this, but conceptually wrong, so let's fix it.
Link: http://lkml.kernel.org/r/20170928162515.24846b4f@endymion
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The else branch been left over and escaped the source code refresh. Not
a problem but better clean it up.
Fixes: 0791e3644e ("kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files")
Link: http://lkml.kernel.org/r/20170917165838.GA1887@uranus.lan
Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
devm_memremap_pages is initializing struct pages in for_each_device_pfn
and that can take quite some time. We have even seen a soft lockup
triggering on a non preemptive kernel
NMI watchdog: BUG: soft lockup - CPU#61 stuck for 22s! [kworker/u641:11:1808]
[...]
RIP: 0010:[<ffffffff8118b6b7>] [<ffffffff8118b6b7>] devm_memremap_pages+0x327/0x430
[...]
Call Trace:
pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
nvdimm_bus_probe+0x64/0x110 [libnvdimm]
driver_probe_device+0x1f7/0x420
bus_for_each_drv+0x52/0x80
__device_attach+0xb0/0x130
bus_probe_device+0x87/0xa0
device_add+0x3fc/0x5f0
nd_async_device_register+0xe/0x40 [libnvdimm]
async_run_entry_fn+0x43/0x150
process_one_work+0x14e/0x410
worker_thread+0x116/0x490
kthread+0xc7/0xe0
ret_from_fork+0x3f/0x70
fix this by adding cond_resched every 1024 pages.
Link: http://lkml.kernel.org/r/20170918121410.24466-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Johannes Thumshirn <jthumshirn@suse.de>
Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Dan Williams <dan.j.williams@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_proc_douintvec_conv() has two UINT_MAX checks, we can remove one.
This has no functional changes other than fixing a compiler warning:
kernel/sysctl.c:2190]: (warning) Identical condition '*lvalp>UINT_MAX', second condition is always false
Fixes: 4f2fec00af ("sysctl: simplify unsigned int support")
Link: http://lkml.kernel.org/r/20170919072918.12066-1-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reported-by: David Binderman <dcb314@hotmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Drop the global lru lock in isolate callback before calling
zap_page_range which calls cond_resched, and re-acquire the global lru
lock before returning. Also change return code to LRU_REMOVED_RETRY.
Use mmput_async when fail to acquire mmap sem in an atomic context.
Fix "BUG: sleeping function called from invalid context"
errors when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
Also restore mmput_async, which was initially introduced in commit
ec8d7c14ea ("mm, oom_reaper: do not mmput synchronously from the oom
reaper context"), and was removed in commit 2129258024 ("mm: oom: let
oom_reap_task and exit_mmap run concurrently").
Link: http://lkml.kernel.org/r/20170914182231.90908-1-sherryy@android.com
Fixes: f2517eb76f ("android: binder: Add global lru shrinker to binder")
Signed-off-by: Sherry Yang <sherryy@android.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reported-by: Kyle Yan <kyan@codeaurora.org>
Acked-by: Arve Hjønnevåg <arve@android.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Martijn Coenen <maco@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Riley Andrews <riandrews@android.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hoeun Ryu <hoeun.ryu@gmail.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This parameter is named kp, so the documentation should use that.
Fixes: 9b473de872 ("param: Fix duplicate module prefixes")
Link: http://lkml.kernel.org/r/20170919142656.64aea59e@endymion
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- bpf prog_array just like all other types of bpf array accepts 32-bit index.
Clarify that in the comment.
- fix x64 JIT of bpf_tail_call which was incorrectly loading 8 instead of 4 bytes
- tighten corresponding check in the interpreter to stay consistent
The JIT bug can be triggered after introduction of BPF_F_NUMA_NODE flag
in commit 96eabe7a40 in 4.14. Before that the map_flags would stay zero and
though JIT code is wrong it will check bounds correctly.
Hence two fixes tags. All other JITs don't have this problem.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 96eabe7a40 ("bpf: Allow selecting numa node during map creation")
Fixes: b52f00e6a7 ("x86: bpf_jit: implement bpf_tail_call() helper")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull cgroup fix from Tejun Heo:
"The recent migration code updates assumed that migrations always
execute from the top to the bottom once and didn't clean up internal
states after each migration round; however, cgroup_transfer_tasks()
repeats the inner steps multiple times and the garbage internal states
from the previous iteration led to OOPS.
Waiman fixed the bug by reinitializing the relevant states at the end
of each migration round"
* 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Reinit cgroup_taskset structure before cgroup_migrate_execute() returns
The read of ->dynticks_nmi_nesting in rcu_irq_enter() and rcu_irq_exit()
is currently protected with READ_ONCE(). However, this protection is
unnecessary because (1) ->dynticks_nmi_nesting is updated only by the
current CPU, (2) Although NMI handlers can update this field, they reset
it back to its old value before return, and (3) Interrupts are disabled,
so nothing else can modify it. The value of ->dynticks_nmi_nesting is
thus effectively constant, and so no protection is required.
This commit therefore removes the READ_ONCE() protection from these
two accesses.
Link: http://lkml.kernel.org/r/20170926031902.GA2074@linux.vnet.ibm.com
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The trampoline allocated by function tracer was overwriten by function_graph
tracer, and caused a memory leak. The save_global_trampoline should have
saved the previous trampoline in register_ftrace_graph() and restored it in
unregister_ftrace_graph(). But as it is implemented, save_global_trampoline was
only used in unregister_ftrace_graph as default value 0, and it overwrote the
previous trampoline's value. Causing the previous allocated trampoline to be
lost.
kmmeleak backtrace:
kmemleak_vmalloc+0x77/0xc0
__vmalloc_node_range+0x1b5/0x2c0
module_alloc+0x7c/0xd0
arch_ftrace_update_trampoline+0xb5/0x290
ftrace_startup+0x78/0x210
register_ftrace_function+0x8b/0xd0
function_trace_init+0x4f/0x80
tracing_set_tracer+0xe6/0x170
tracing_set_trace_write+0x90/0xd0
__vfs_write+0x37/0x170
vfs_write+0xb2/0x1b0
SyS_write+0x55/0xc0
do_syscall_64+0x67/0x180
return_from_SYSCALL_64+0x0/0x6a
[
Looking further into this, I found that this was left over from when the
function and function graph tracers shared the same ftrace_ops. But in
commit 5f151b2401 ("ftrace: Fix function_profiler and function tracer
together"), the two were separated, and the save_global_trampoline no
longer was necessary (and it may have been broken back then too).
-- Steven Rostedt
]
Link: http://lkml.kernel.org/r/20170912021454.5976-1-shuwang@redhat.com
Cc: stable@vger.kernel.org
Fixes: 5f151b2401 ("ftrace: Fix function_profiler and function tracer together")
Signed-off-by: Shu Wang <shuwang@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull smp/hotplug fixes from Thomas Gleixner:
"This addresses the fallout of the new lockdep mechanism which covers
completions in the CPU hotplug code.
The lockdep splats are false positives, but there is no way to
annotate that reliably. The solution is to split the completions for
CPU up and down, which requires some reshuffling of the failure
rollback handling as well"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp/hotplug: Hotplug state fail injection
smp/hotplug: Differentiate the AP completion between up and down
smp/hotplug: Differentiate the AP-work lockdep class between up and down
smp/hotplug: Callback vs state-machine consistency
smp/hotplug: Rewrite AP state machine core
smp/hotplug: Allow external multi-instance rollback
smp/hotplug: Add state diagram
Pull scheduler fixes from Thomas Gleixner:
"The scheduler pull request comes with the following updates:
- Prevent a divide by zero issue by validating the input value of
sysctl_sched_time_avg
- Make task state printing consistent all over the place and have
explicit state characters for IDLE and PARKED so they wont be
displayed as 'D' state which confuses tools"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/sysctl: Check user input value of sysctl_sched_time_avg
sched/debug: Add explicit TASK_PARKED printing
sched/debug: Ignore TASK_IDLE for SysRq-W
sched/debug: Add explicit TASK_IDLE printing
sched/tracing: Use common task-state helpers
sched/tracing: Fix trace_sched_switch task-state printing
sched/debug: Remove unused variable
sched/debug: Convert TASK_state to hex
sched/debug: Implement consistent task-state printing
Pull perf fixes from Thomas Gleixner:
- Prevent a division by zero in the perf aux buffer handling
- Sync kernel headers with perf tool headers
- Fix a build failure in the syscalltbl code
- Make the debug messages of perf report --call-graph work correctly
- Make sure that all required perf files are in the MANIFEST for
container builds
- Fix the atrr.exclude kernel handling so it respects the
perf_event_paranoid and the user permissions
- Make perf test on s390x work correctly
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/aux: Only update ->aux_wakeup in non-overwrite mode
perf test: Fix vmlinux failure on s390x part 2
perf test: Fix vmlinux failure on s390x
perf tools: Fix syscalltbl build failure
perf report: Fix debug messages with --call-graph option
perf evsel: Fix attr.exclude_kernel setting for default cycles:p
tools include: Sync kernel ABI headers with tooling headers
perf tools: Get all of tools/{arch,include}/ in the MANIFEST
Pull locking fixes from Thomas Gleixner:
"Two fixes for locking:
- Plug a hole the pi_stat->owner serialization which was changed
recently and failed to fixup two usage sites.
- Prevent reordering of the rwsem_has_spinner() check vs the
decrement of rwsem count in up_write() which causes a missed
wakeup"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem-xadd: Fix missed wakeup due to reordering of load
futex: Fix pi_state->owner serialization
Pull irq fixes from Thomas Gleixner:
- Add a missing NULL pointer check in free_irq()
- Fix a memory leak/memory corruption in the generic irq chip
- Add missing rcu annotations for radix tree access
- Use ffs instead of fls when extracting data from a chip register in
the MIPS GIC irq driver
- Fix the unmasking of IPI interrupts in the MIPS GIC driver so they
end up at the target CPU and not at CPU0
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irq/generic-chip: Don't replace domain's name
irqdomain: Add __rcu annotations to radix tree accessors
irqchip/mips-gic: Use effective affinity to unmask
irqchip/mips-gic: Fix shifts to extract register fields
genirq: Check __free_irq() return value for NULL
This patch uses u64_to_user_ptr() to cast info.map_ids to a userspace ptr.
It also tags the user_map_ids with '__user' for sparse check.
Fixes: cb4d2b3f03 ("bpf: Add name, load_time, uid and map_ids to bpf_prog_info")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull waitid fix from Al Viro:
"Fix infoleak in waitid()"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix infoleak in waitid(2)
kernel_waitid() can return a PID, an error or 0. rusage is filled in the first
case and waitid(2) rusage should've been copied out exactly in that case, *not*
whenever kernel_waitid() has not returned an error. Compat variant shares that
braino; none of kernel_wait4() callers do, so the below ought to fix it.
Reported-and-tested-by: Alexander Potapenko <glider@google.com>
Fixes: ce72a16fa7 ("wait4(2)/waitid(2): separate copying rusage to userland")
Cc: stable@vger.kernel.org # v4.13
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Markus reported that tasks in TASK_IDLE state are reported by SysRq-W,
which results in undesirable clutter.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If a spinner is present, there is a chance that the load of
rwsem_has_spinner() in rwsem_wake() can be reordered with
respect to decrement of rwsem count in __up_write() leading
to wakeup being missed:
spinning writer up_write caller
--------------- -----------------------
[S] osq_unlock() [L] osq
spin_lock(wait_lock)
sem->count=0xFFFFFFFF00000001
+0xFFFFFFFF00000000
count=sem->count
MB
sem->count=0xFFFFFFFE00000001
-0xFFFFFFFF00000001
spin_trylock(wait_lock)
return
rwsem_try_write_lock(count)
spin_unlock(wait_lock)
schedule()
Reordering of atomic_long_sub_return_release() in __up_write()
and rwsem_has_spinner() in rwsem_wake() can cause missing of
wakeup in up_write() context. In spinning writer, sem->count
and local variable count is 0XFFFFFFFE00000001. It would result
in rwsem_try_write_lock() failing to acquire rwsem and spinning
writer going to sleep in rwsem_down_write_failed().
The smp_rmb() will make sure that the spinner state is
consulted after sem->count is updated in up_write context.
Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: longman@redhat.com
Cc: parri.andrea@gmail.com
Cc: sramana@codeaurora.org
Link: http://lkml.kernel.org/r/1504794658-15397-1-git-send-email-prsood@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following commit:
d9a50b0256 ("perf/aux: Ensure aux_wakeup represents most recent wakeup index")
changed the AUX wakeup position calculation to rounddown(), which causes
a division-by-zero in AUX overwrite mode (aka "snapshot mode").
The zero denominator results from the fact that perf record doesn't set
aux_watermark to anything, in which case the kernel will set it to half
the AUX buffer size, but only for non-overwrite mode. In the overwrite
mode aux_watermark stays zero.
The good news is that, AUX overwrite mode, wakeups don't happen and
related bookkeeping is not relevant, so we can simply forego the whole
wakeup updates.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/20170906160811.16510-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch allows userspace to specify a name for a map
during BPF_MAP_CREATE.
The map's name can later be exported to user space
via BPF_OBJ_GET_INFO_BY_FD.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch adds name and load_time to struct bpf_prog_aux. They
are also exported to bpf_prog_info.
The bpf_prog's name is passed by userspace during BPF_PROG_LOAD.
The kernel only stores the first (BPF_PROG_NAME_LEN - 1) bytes
and the name stored in the kernel is always \0 terminated.
The kernel will reject name that contains characters other than
isalnum() and '_'. It will also reject name that is not null
terminated.
The existing 'user->uid' of the bpf_prog_aux is also exported to
the bpf_prog_info as created_by_uid.
The existing 'used_maps' of the bpf_prog_aux is exported to
the newly added members 'nr_map_ids' and 'map_ids' of
the bpf_prog_info. On the input, nr_map_ids tells how
big the userspace's map_ids buffer is. On the output,
nr_map_ids tells the exact user_map_cnt and it will only
copy up to the userspace's map_ids buffer is allowed.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The role of the ->wake() platform callback for suspend-to-idle is to
deal with possible spurious wakeups, among other things. The ACPI
implementation of it, acpi_s2idle_wake(), additionally checks the
conditions for entering the Low Power S0 Idle state by the platform
and reports the ones that have not been met.
However, the ->wake() platform callback is invoked after calling
dpm_noirq_resume_devices(), which means that the power states of some
devices may have changed since s2idle_enter() returned, so some unmet
Low Power S0 Idle conditions may be reported incorrectly as a result
of that.
To avoid these false positives, reorder the invocations of the
dpm_noirq_resume_devices() routine and the ->wake() platform callback
in s2idle_loop().
Fixes: 726fb6b4f2 (ACPI / PM: Check low power idle constraints for debug only)
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
BPF_NEG takes only one operand, unlike the bulk of BPF_ALU[64] which are
compound-assignments. So give it its own format in print_bpf_insn().
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
print_bpf_insn() was treating all BPF_ALU[64] the same, but BPF_END has a
different structure: it has a size in insn->imm (even if it's BPF_X) and
uses the BPF_SRC (X or K) to indicate which endianness to use. So it
needs different code to print it.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When generic irq chips are allocated for an irq domain the domain name is
set to the irq chip name. That was done to have named domains before the
recent changes which enforce domain naming were done.
Since then the overwrite causes a memory leak when the domain name is
dynamically allocated and even worse it would cause the domain free code to
free the wrong name pointer, which might point to a constant.
Remove the name assignment to prevent this.
Fixes: d59f6617ee ("genirq: Allow fwnode to carry name information only")
Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20170928043731.4764-1-jeffy.chen@rock-chips.com
As Chris explains, get_seccomp_filter() and put_seccomp_filter() can end
up using different filters. Once we drop ->siglock it is possible for
task->seccomp.filter to have been replaced by SECCOMP_FILTER_FLAG_TSYNC.
Fixes: f8e529ed94 ("seccomp, ptrace: add support for dumping seccomp filters")
Reported-by: Chris Salls <chrissalls5@gmail.com>
Cc: stable@vger.kernel.org # needs s/refcount_/atomic_/ for v4.12 and earlier
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[tycho: add __get_seccomp_filter vs. open coding refcount_inc()]
Signed-off-by: Tycho Andersen <tycho@docker.com>
[kees: tweak commit log]
Signed-off-by: Kees Cook <keescook@chromium.org>
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.
xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.
The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just do the rename into bpf_compute_data_pointers() as we'll add
one more pointer here to recompute.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull block fixes from Jens Axboe:
- Two sets of NVMe pull requests from Christoph:
- Fixes for the Fibre Channel host/target to fix spec compliance
- Allow a zero keep alive timeout
- Make the debug printk for broken SGLs work better
- Fix queue zeroing during initialization
- Set of RDMA and FC fixes
- Target div-by-zero fix
- bsg double-free fix.
- ndb unknown ioctl fix from Josef.
- Buffered vs O_DIRECT page cache inconsistency fix. Has been floating
around for a long time, well reviewed. From Lukas.
- brd overflow fix from Mikulas.
- Fix for a loop regression in this merge window, where using a union
for two members of the loop_cmd turned out to be a really bad idea.
From Omar.
- Fix for an iostat regression fix in this series, using the wrong API
to get at the block queue. From Shaohua.
- Fix for a potential blktrace delection deadlock. From Waiman.
* 'for-linus' of git://git.kernel.dk/linux-block: (30 commits)
nvme-fcloop: fix port deletes and callbacks
nvmet-fc: sync header templates with comments
nvmet-fc: ensure target queue id within range.
nvmet-fc: on port remove call put outside lock
nvme-rdma: don't fully stop the controller in error recovery
nvme-rdma: give up reconnect if state change fails
nvme-core: Use nvme_wq to queue async events and fw activation
nvme: fix sqhd reference when admin queue connect fails
block: fix a crash caused by wrong API
fs: Fix page cache inconsistency when mixing buffered and AIO DIO
nvmet: implement valid sqhd values in completions
nvme-fabrics: Allow 0 as KATO value
nvme: allow timed-out ios to retry
nvme: stop aer posting if controller state not live
nvme-pci: Print invalid SGL only once
nvme-pci: initialize queue memory before interrupts
nvmet-fc: fix failing max io queue connections
nvme-fc: use transport-specific sgl format
nvme: add transport SGL definitions
nvme.h: remove FC transport-specific error values
...
has been pointing out constant problems. The changes have been going into
the stack tracer, but it has been discovered that the problem isn't
with the stack tracer itself, but it is with calling save_stack_trace()
from within the internals of RCU. The stack tracer is the one that
can trigger the issue the easiest, but examining the problem further,
it could also happen from a WARN() in the wrong place, or even if
an NMI happened in this area and it did an rcu_read_lock().
The critical area is where RCU is not watching. Which can happen while
going to and from idle, or bringing up or taking down a CPU.
The final fix was to put the protection in kernel_text_address() as it
is the one that requires RCU to be watching while doing the stack trace.
To make this work properly, Paul had to allow rcu_irq_enter() happen after
rcu_nmi_enter(). This should have been done anyway, since an NMI can
page fault (reading vmalloc area), and a page fault triggers rcu_irq_enter().
One patch is just a consolidation of code so that the fix only needed
to be done in one location.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEEQEw9Eu0DdyUUkuUUybkF8mrZjcsFAlnGyXoUHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQybkF8mrZjctKtwf8CeKGqOdlqkZEafIpWaIASXmAVMO/
WE+hQK+rCydWFvzADgb/rOmsR0ou8WGEXcuUPxVxmvMyqhKhZ6AU1hE/7Y8P0pMq
F4bev+j3lAJC65ezFAh+ZQcIjaRIH4MFVPsUTaibSPSN7xziMNIpbf9VOVfpUm8A
jf9p6YAmyhFVi6DstCc29SWnywEVwC2ZWRVKRPXKry8/dPxjfVcLclGX680Eqi9I
EnYaOdC/mGbtvHPOUSs/P0cfxExHmyEErQHeOV8FPymj6KJ6+KoYIiELNlTHUBj/
eeKzrHc/b3j+lz0RPlA8WxYmpmEm4SE5cV3vRebdBNUBrABSN1RxeOozyQ==
=1KkS
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Stack tracing and RCU has been having issues with each other and
lockdep has been pointing out constant problems.
The changes have been going into the stack tracer, but it has been
discovered that the problem isn't with the stack tracer itself, but it
is with calling save_stack_trace() from within the internals of RCU.
The stack tracer is the one that can trigger the issue the easiest,
but examining the problem further, it could also happen from a WARN()
in the wrong place, or even if an NMI happened in this area and it did
an rcu_read_lock().
The critical area is where RCU is not watching. Which can happen while
going to and from idle, or bringing up or taking down a CPU.
The final fix was to put the protection in kernel_text_address() as it
is the one that requires RCU to be watching while doing the stack
trace.
To make this work properly, Paul had to allow rcu_irq_enter() happen
after rcu_nmi_enter(). This should have been done anyway, since an NMI
can page fault (reading vmalloc area), and a page fault triggers
rcu_irq_enter().
One patch is just a consolidation of code so that the fix only needed
to be done in one location"
* tag 'trace-v4.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Remove RCU work arounds from stack tracer
extable: Enable RCU if it is not watching in kernel_text_address()
extable: Consolidate *kernel_text_address() functions
rcu: Allow for page faults in NMI handlers