were getting corrupted. In the process I found three bugs. One was the
culprit, but the other two scared me. After deeper investigation, they
were not as major as I thought they were, due to a signed compared to
an unsigned that prevented a negative number from doing actual harm.
The two bigger bugs:
- Mask the ring buffer data page length. There are data flags at the
high bits of the length field. These were not cleared via the
length function, and the length could return a negative number.
(Although the number returned was unsigned, but was assigned to a
signed number) Luckily, this value was compared to PAGE_SIZE which is
unsigned and kept it from entering the path that could have caused damage.
- Check the page usage before reusing the ring buffer reader page.
TCP increments the page ref when passing the page off to the network.
The page is passed back to the ring buffer for use on free. But
the page could still be in use by the TCP stack.
Minor bugs:
- Related to the first bug. No need to clear out the unused ring buffer
data before sending to user space. It is now done by the ring buffer
code itself.
- Reset pointers after free on error path. There were some cases in
the error path that pointers were freed but not set to NULL, and could
have them freed again, having a pointer freed twice.
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlpD9O8UHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUnC0Av9EqzJjJXlZuleCSiuh1umx33esgZv
gOYTOXH9QLdKFHLpwVzeCsrhrLXNhbUfrGMQ0ERcpvVacHCKVwRyzx0nfI5W3rbt
9sCsNsVR2SCVpzSWOvP9iJM0J/myFdZtYmGLC2BBJerXZFwl9Ciw+1bF7MFprb4v
6r+49YrYMAR/H/obT3Aoh/XCOz0W0czk9ECGPhuwqAjWoNPwSgpbTdqpR92bJf85
hGYppIX9d+4Gv4pZ2lfXDKrgiAPvHpp5I/znLDY8cG7GhcBjyXaetBb+XlfHI6D4
jTS59f13CqcEhyFE5x2qwQBr9TTh043EKviixDud+nI1L7aNhDIBtb6tYrAmGWWh
Rj1268gFjspi3pYTjI8cHXXCJSdQiAqFesiFLviU1c17PgjbBAnmkcsFSgOPxHqc
j225jravcXtUqQq5J0qKR6Sn3LObfYJQk6tqpN6gWN76P75QgUms5W4+/NiEI0a3
0LVjapxHZkDEYNRGmI+d0CvIJ3BWyb781Siw
=xhPf
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"While doing tests on tracing over the network, I found that the
packets were getting corrupted.
In the process I found three bugs.
One was the culprit, but the other two scared me. After deeper
investigation, they were not as major as I thought they were, due to a
signed compared to an unsigned that prevented a negative number from
doing actual harm.
The two bigger bugs:
- Mask the ring buffer data page length. There are data flags at the
high bits of the length field. These were not cleared via the
length function, and the length could return a negative number.
(Although the number returned was unsigned, but was assigned to a
signed number) Luckily, this value was compared to PAGE_SIZE which
is unsigned and kept it from entering the path that could have
caused damage.
- Check the page usage before reusing the ring buffer reader page.
TCP increments the page ref when passing the page off to the
network. The page is passed back to the ring buffer for use on
free. But the page could still be in use by the TCP stack.
Minor bugs:
- Related to the first bug. No need to clear out the unused ring
buffer data before sending to user space. It is now done by the
ring buffer code itself.
- Reset pointers after free on error path. There were some cases in
the error path that pointers were freed but not set to NULL, and
could have them freed again, having a pointer freed twice"
* tag 'trace-v4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix possible double free on failure of allocating trace buffer
tracing: Fix crash when it fails to alloc ring buffer
ring-buffer: Do no reuse reader page if still in use
tracing: Remove extra zeroing out of the ring buffer page
ring-buffer: Mask out the info bits when returning buffer page length
Jing Xia and Chunyan Zhang reported that on failing to allocate part of the
tracing buffer, memory is freed, but the pointers that point to them are not
initialized back to NULL, and later paths may try to free the freed memory
again. Jing and Chunyan fixed one of the locations that does this, but
missed a spot.
Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com
Cc: stable@vger.kernel.org
Fixes: 737223fbca ("tracing: Consolidate buffer allocation code")
Reported-by: Jing Xia <jing.xia@spreadtrum.com>
Reported-by: Chunyan Zhang <chunyan.zhang@spreadtrum.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Double free of the ring buffer happens when it fails to alloc new
ring buffer instance for max_buffer if TRACER_MAX_TRACE is configured.
The root cause is that the pointer is not set to NULL after the buffer
is freed in allocate_trace_buffers(), and the freeing of the ring
buffer is invoked again later if the pointer is not equal to Null,
as:
instance_mkdir()
|-allocate_trace_buffers()
|-allocate_trace_buffer(tr, &tr->trace_buffer...)
|-allocate_trace_buffer(tr, &tr->max_buffer...)
// allocate fail(-ENOMEM),first free
// and the buffer pointer is not set to null
|-ring_buffer_free(tr->trace_buffer.buffer)
// out_free_tr
|-free_trace_buffers()
|-free_trace_buffer(&tr->trace_buffer);
//if trace_buffer is not null, free again
|-ring_buffer_free(buf->buffer)
|-rb_free_cpu_buffer(buffer->buffers[cpu])
// ring_buffer_per_cpu is null, and
// crash in ring_buffer_per_cpu->pages
Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com
Cc: stable@vger.kernel.org
Fixes: 737223fbca ("tracing: Consolidate buffer allocation code")
Signed-off-by: Jing Xia <jing.xia@spreadtrum.com>
Signed-off-by: Chunyan Zhang <chunyan.zhang@spreadtrum.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
To free the reader page that is allocated with ring_buffer_alloc_read_page(),
ring_buffer_free_read_page() must be called. For faster performance, this
page can be reused by the ring buffer to avoid having to free and allocate
new pages.
The issue arises when the page is used with a splice pipe into the
networking code. The networking code may up the page counter for the page,
and keep it active while sending it is queued to go to the network. The
incrementing of the page ref does not prevent it from being reused in the
ring buffer, and this can cause the page that is being sent out to the
network to be modified before it is sent by reading new data.
Add a check to the page ref counter, and only reuse the page if it is not
being used anywhere else.
Cc: stable@vger.kernel.org
Fixes: 73a757e631 ("ring-buffer: Return reader page back into existing ring buffer")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The ring_buffer_read_page() takes care of zeroing out any extra data in the
page that it returns. There's no need to zero it out again from the
consumer. It was removed from one consumer of this function, but
read_buffers_splice_read() did not remove it, and worse, it contained a
nasty bug because of it.
Cc: stable@vger.kernel.org
Fixes: 2711ca237a ("ring-buffer: Move zeroing out excess in page to ring buffer code")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Two info bits were added to the "commit" part of the ring buffer data page
when returned to be consumed. This was to inform the user space readers that
events have been missed, and that the count may be stored at the end of the
page.
What wasn't handled, was the splice code that actually called a function to
return the length of the data in order to zero out the rest of the page
before sending it up to user space. These data bits were returned with the
length making the value negative, and that negative value was not checked.
It was compared to PAGE_SIZE, and only used if the size was less than
PAGE_SIZE. Luckily PAGE_SIZE is unsigned long which made the compare an
unsigned compare, meaning the negative size value did not end up causing a
large portion of memory to be randomly zeroed out.
Cc: stable@vger.kernel.org
Fixes: 66a8cb95ed ("ring-buffer: Add place holder recording of dropped events")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
fix off by one error in max call depth check
and add a test
Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Instead of computing max stack depth for current call chain
during the main verifier pass track stack depth of each
function independently and after do_check() is done do
another pass over all instructions analyzing depth
of all possible call stacks.
Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pull x86 PTI preparatory patches from Thomas Gleixner:
"Todays Advent calendar window contains twentyfour easy to digest
patches. The original plan was to have twenty three matching the date,
but a late fixup made that moot.
- Move the cpu_entry_area mapping out of the fixmap into a separate
address space. That's necessary because the fixmap becomes too big
with NRCPUS=8192 and this caused already subtle and hard to
diagnose failures.
The top most patch is fresh from today and cures a brain slip of
that tall grumpy german greybeard, who ignored the intricacies of
32bit wraparounds.
- Limit the number of CPUs on 32bit to 64. That's insane big already,
but at least it's small enough to prevent address space issues with
the cpu_entry_area map, which have been observed and debugged with
the fixmap code
- A few TLB flush fixes in various places plus documentation which of
the TLB functions should be used for what.
- Rename the SYSENTER stack to CPU_ENTRY_AREA stack as it is used for
more than sysenter now and keeping the name makes backtraces
confusing.
- Prevent LDT inheritance on exec() by moving it to arch_dup_mmap(),
which is only invoked on fork().
- Make vysycall more robust.
- A few fixes and cleanups of the debug_pagetables code. Check
PAGE_PRESENT instead of checking the PTE for 0 and a cleanup of the
C89 initialization of the address hint array which already was out
of sync with the index enums.
- Move the ESPFIX init to a different place to prepare for PTI.
- Several code moves with no functional change to make PTI
integration simpler and header files less convoluted.
- Documentation fixes and clarifications"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/cpu_entry_area: Prevent wraparound in setup_cpu_entry_area_ptes() on 32bit
init: Invoke init_espfix_bsp() from mm_init()
x86/cpu_entry_area: Move it out of the fixmap
x86/cpu_entry_area: Move it to a separate unit
x86/mm: Create asm/invpcid.h
x86/mm: Put MMU to hardware ASID translation in one place
x86/mm: Remove hard-coded ASID limit checks
x86/mm: Move the CR3 construction functions to tlbflush.h
x86/mm: Add comments to clarify which TLB-flush functions are supposed to flush what
x86/mm: Remove superfluous barriers
x86/mm: Use __flush_tlb_one() for kernel memory
x86/microcode: Dont abuse the TLB-flush interface
x86/uv: Use the right TLB-flush API
x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stack
x86/doc: Remove obvious weirdnesses from the x86 MM layout documentation
x86/mm/64: Improve the memory map documentation
x86/ldt: Prevent LDT inheritance on exec
x86/ldt: Rework locking
arch, mm: Allow arch_dup_mmap() to fail
x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
...
Commit cc2b14d510 ("bpf: teach verifier to recognize zero initialized
stack") introduced a very relaxed check when comparing stacks of different
states, effectively returning a positive result in many cases where it
shouldn't.
This can create problems in cases such as this following C pseudocode:
long var;
long *x = bpf_map_lookup(...);
if (!x)
return;
if (*x != 0xbeef)
var = 0;
else
var = 1;
/* This is the key part, calling a helper causes an explored state
* to be saved with the information that "var" is on the stack as
* STACK_ZERO, since the helper is first met by the verifier after
* the "var = 0" assignment. This state will however be wrongly used
* also for the "var = 1" case, so the verifier assumes "var" is always
* 0 and will replace the NULL assignment with nops, because the
* search pruning prevents it from exploring the faulty branch.
*/
bpf_ktime_get_ns();
if (var)
*(long *)0 = 0xbeef;
Fix the issue by making sure that the stack is fully explored before
returning a positive comparison result.
Also attach a couple tests that highlight the bad behavior. In the first
test, without this fix instructions 16 and 17 are replaced with nops
instead of being rejected by the verifier.
The second test, instead, allows a program to make a potentially illegal
read from the stack.
Fixes: cc2b14d510 ("bpf: teach verifier to recognize zero initialized stack")
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: dan.j.williams@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lots of overlapping changes. Also on the net-next side
the XDP state management is handled more in the generic
layers so undo the 'net' nfp fix which isn't applicable
in net-next.
Include a necessary change by Jakub Kicinski, with log message:
====================
cls_bpf no longer takes care of offload tracking. Make sure
netdevsim performs necessary checks. This fixes a warning
caused by TC trying to remove a filter it has not added.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller"
"What's a holiday weekend without some networking bug fixes? [1]
1) Fix some eBPF JIT bugs wrt. SKB pointers across helper function
calls, from Daniel Borkmann.
2) Fix regression from errata limiting change to marvell PHY driver,
from Zhao Qiang.
3) Fix u16 overflow in SCTP, from Xin Long.
4) Fix potential memory leak during bridge newlink, from Nikolay
Aleksandrov.
5) Fix BPF selftest build on s390, from Hendrik Brueckner.
6) Don't append to cfg80211 automatically generated certs file,
always write new ones from scratch. From Thierry Reding.
7) Fix sleep in atomic in mac80211 hwsim, from Jia-Ju Bai.
8) Fix hang on tg3 MTU change with certain chips, from Brian King.
9) Add stall detection to arc emac driver and reset chip when this
happens, from Alexander Kochetkov.
10) Fix MTU limitng in GRE tunnel drivers, from Xin Long.
11) Fix stmmac timestamping bug due to mis-shifting of field. From
Fredrik Hallenberg.
12) Fix metrics match when deleting an ipv4 route. The kernel sets
some internal metrics bits which the user isn't going to set when
it makes the delete request. From Phil Sutter.
13) mvneta driver loop over RX queues limits on "txq_number" :-) Fix
from Yelena Krivosheev.
14) Fix double free and memory corruption in get_net_ns_by_id, from
Eric W. Biederman.
15) Flush ipv4 FIB tables in the reverse order. Some tables can share
their actual backing data, in particular this happens for the MAIN
and LOCAL tables. We have to kill the LOCAL table first, because
it uses MAIN's backing memory. Fix from Ido Schimmel.
16) Several eBPF verifier value tracking fixes, from Edward Cree, Jann
Horn, and Alexei Starovoitov.
17) Make changes to ipv6 autoflowlabel sysctl really propagate to
sockets, unless the socket has set the per-socket value
explicitly. From Shaohua Li.
18) Fix leaks and double callback invocations of zerocopy SKBs, from
Willem de Bruijn"
[1] Is this a trick question? "Relaxing"? "Quiet"? "Fine"? - Linus.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
skbuff: skb_copy_ubufs must release uarg even without user frags
skbuff: orphan frags before zerocopy clone
net: reevalulate autoflowlabel setting after sysctl setting
openvswitch: Fix pop_vlan action for double tagged frames
ipv6: Honor specified parameters in fibmatch lookup
bpf: do not allow root to mangle valid pointers
selftests/bpf: add tests for recent bugfixes
bpf: fix integer overflows
bpf: don't prune branches when a scalar is replaced with a pointer
bpf: force strict alignment checks for stack pointers
bpf: fix missing error return in check_stack_boundary()
bpf: fix 32-bit ALU op verification
bpf: fix incorrect tracking of register size truncation
bpf: fix incorrect sign extension in check_alu_op()
bpf/verifier: fix bounds calculation on BPF_RSH
ipv4: Fix use-after-free when flushing FIB tables
s390/qeth: fix error handling in checksum cmd callback
tipc: remove joining group member from congested list
selftests: net: Adding config fragment CONFIG_NUMA=y
nfp: bpf: keep track of the offloaded program
...
Right now kallsyms handling is not working with JITed subprogs.
The reason is that when in 1c2a088a66 ("bpf: x64: add JIT support
for multi-function programs") in jit_subprogs() they are passed
to bpf_prog_kallsyms_add(), then their prog type is 0, which BPF
core will think it's a cBPF program as only cBPF programs have a
0 type. Thus, they need to inherit the type from the main prog.
Once that is fixed, they are indeed added to the BPF kallsyms
infra, but their tag is 0. Therefore, since intention is to add
them as bpf_prog_F_<tag>, we need to pass them to bpf_prog_calc_tag()
first. And once this is resolved, there is a use-after-free on
prog cleanup: we remove the kallsyms entry from the main prog,
later walk all subprogs and call bpf_jit_free() on them. However,
the kallsyms linkage was never released on them. Thus, do that
for all subprogs right in __bpf_prog_put() when refcount hits 0.
Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Do not allow root to convert valid pointers into unknown scalars.
In particular disallow:
ptr &= reg
ptr <<= reg
ptr += ptr
and explicitly allow:
ptr -= ptr
since pkt_end - pkt == length
1.
This minimizes amount of address leaks root can do.
In the future may need to further tighten the leaks with kptr_restrict.
2.
If program has such pointer math it's likely a user mistake and
when verifier complains about it right away instead of many instructions
later on invalid memory access it's easier for users to fix their progs.
3.
when register holding a pointer cannot change to scalar it allows JITs to
optimize better. Like 32-bit archs could use single register for pointers
instead of a pair required to hold 64-bit scalars.
4.
reduces architecture dependent behavior. Since code:
r1 = r10;
r1 &= 0xff;
if (r1 ...)
will behave differently arm64 vs x64 and offloaded vs native.
A significant chunk of ptr mangling was allowed by
commit f1174f77b5 ("bpf/verifier: rework value tracking")
yet some of it was allowed even earlier.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There were various issues related to the limited size of integers used in
the verifier:
- `off + size` overflow in __check_map_access()
- `off + reg->off` overflow in check_mem_access()
- `off + reg->var_off.value` overflow or 32-bit truncation of
`reg->var_off.value` in check_mem_access()
- 32-bit truncation in check_stack_boundary()
Make sure that any integer math cannot overflow by not allowing
pointer math with large values.
Also reduce the scope of "scalar op scalar" tracking.
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This could be made safe by passing through a reference to env and checking
for env->allow_ptr_leaks, but it would only work one way and is probably
not worth the hassle - not doing it will not directly lead to program
rejection.
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Force strict alignment checks for stack pointers because the tracking of
stack spills relies on it; unaligned stack accesses can lead to corruption
of spilled registers, which is exploitable.
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
32-bit ALU ops operate on 32-bit values and have 32-bit outputs.
Adjust the verifier accordingly.
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Properly handle register truncation to a smaller size.
The old code first mirrors the clearing of the high 32 bits in the bitwise
tristate representation, which is correct. But then, it computes the new
arithmetic bounds as the intersection between the old arithmetic bounds and
the bounds resulting from the bitwise tristate representation. Therefore,
when coerce_reg_to_32() is called on a number with bounds
[0xffff'fff8, 0x1'0000'0007], the verifier computes
[0xffff'fff8, 0xffff'ffff] as bounds of the truncated number.
This is incorrect: The truncated number could also be in the range [0, 7],
and no meaningful arithmetic bounds can be computed in that case apart from
the obvious [0, 0xffff'ffff].
Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.
Debian assigned CVE-2017-16996 for this issue.
v2:
- flip the mask during arithmetic bounds calculation (Ben Hutchings)
v3:
- add CVE number (Ben Hutchings)
Fixes: b03c9f9fdc ("bpf/verifier: track signed and unsigned min/max values")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Distinguish between
BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
only perform sign extension in the first case.
Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.
Debian assigned CVE-2017-16995 for this issue.
v3:
- add CVE number (Ben Hutchings)
Fixes: 484611357c ("bpf: allow access into map value arrays")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Incorrect signed bounds were being computed.
If the old upper signed bound was positive and the old lower signed bound was
negative, this could cause the new upper signed bound to be too low,
leading to security issues.
Fixes: b03c9f9fdc ("bpf/verifier: track signed and unsigned min/max values")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
[jannh@google.com: changed description to reflect bug impact]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Deadlock during cgroup migration from cpu hotplug path when a task T is
being moved from source to destination cgroup.
kworker/0:0
cpuset_hotplug_workfn()
cpuset_hotplug_update_tasks()
hotplug_update_tasks_legacy()
remove_tasks_in_empty_cpuset()
cgroup_transfer_tasks() // stuck in iterator loop
cgroup_migrate()
cgroup_migrate_add_task()
In cgroup_migrate_add_task() it checks for PF_EXITING flag of task T.
Task T will not migrate to destination cgroup. css_task_iter_start()
will keep pointing to task T in loop waiting for task T cg_list node
to be removed.
Task T
do_exit()
exit_signals() // sets PF_EXITING
exit_task_namespaces()
switch_task_namespaces()
free_nsproxy()
put_mnt_ns()
drop_collected_mounts()
namespace_unlock()
synchronize_rcu()
_synchronize_rcu_expedited()
schedule_work() // on cpu0 low priority worker pool
wait_event() // waiting for work item to execute
Task T inserted a work item in the worklist of cpu0 low priority
worker pool. It is waiting for expedited grace period work item
to execute. This work item will only be executed once kworker/0:0
complete execution of cpuset_hotplug_workfn().
kworker/0:0 ==> Task T ==>kworker/0:0
In case of PF_EXITING task being migrated from source to destination
cgroup, migrate next available task in source cgroup.
Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The tools/testing/selftests/bpf test program
test_dev_cgroup fails with the following error
when compiled with llvm 6.0. (I did not try
with earlier versions.)
libbpf: load bpf program failed: Permission denied
libbpf: -- BEGIN DUMP LOG ---
libbpf:
0: (61) r2 = *(u32 *)(r1 +4)
1: (b7) r0 = 0
2: (55) if r2 != 0x1 goto pc+8
R0=inv0 R1=ctx(id=0,off=0,imm=0) R2=inv1 R10=fp0
3: (69) r2 = *(u16 *)(r1 +0)
invalid bpf_context access off=0 size=2
...
The culprit is the following statement in dev_cgroup.c:
short type = ctx->access_type & 0xFFFF;
This code is typical as the ctx->access_type is assigned
as below in kernel/bpf/cgroup.c:
struct bpf_cgroup_dev_ctx ctx = {
.access_type = (access << 16) | dev_type,
.major = major,
.minor = minor,
};
The compiler converts it to u16 access while
the verifier cgroup_dev_is_valid_access rejects
any non u32 access.
This patch permits the field access_type to be accessible
with type u16 and u8 as well.
Signed-off-by: Yonghong Song <yhs@fb.com>
Tested-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Function skip_callee is local to the source and does not need to
be in global scope, so make it static. Also return NULL rather than 0.
Cleans up two sparse warnings:
symbol 'skip_callee' was not declared. Should it be static?
Using plain integer as NULL pointer
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Trivial fix to spelling mistake in error message text.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2017-12-18
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Allow arbitrary function calls from one BPF function to another BPF function.
As of today when writing BPF programs, __always_inline had to be used in
the BPF C programs for all functions, unnecessarily causing LLVM to inflate
code size. Handle this more naturally with support for BPF to BPF calls
such that this __always_inline restriction can be overcome. As a result,
it allows for better optimized code and finally enables to introduce core
BPF libraries in the future that can be reused out of different projects.
x86 and arm64 JIT support was added as well, from Alexei.
2) Add infrastructure for tagging functions as error injectable and allow for
BPF to return arbitrary error values when BPF is attached via kprobes on
those. This way of injecting errors generically eases testing and debugging
without having to recompile or restart the kernel. Tags for opting-in for
this facility are added with BPF_ALLOW_ERROR_INJECTION(), from Josef.
3) For BPF offload via nfp JIT, add support for bpf_xdp_adjust_head() helper
call for XDP programs. First part of this work adds handling of BPF
capabilities included in the firmware, and the later patches add support
to the nfp verifier part and JIT as well as some small optimizations,
from Jakub.
4) The bpftool now also gets support for basic cgroup BPF operations such
as attaching, detaching and listing current BPF programs. As a requirement
for the attach part, bpftool can now also load object files through
'bpftool prog load'. This reuses libbpf which we have in the kernel tree
as well. bpftool-cgroup man page is added along with it, from Roman.
5) Back then commit e87c6bc385 ("bpf: permit multiple bpf attachments for
a single perf event") added support for attaching multiple BPF programs
to a single perf event. Given they are configured through perf's ioctl()
interface, the interface has been extended with a PERF_EVENT_IOC_QUERY_BPF
command in this work in order to return an array of one or multiple BPF
prog ids that are currently attached, from Yonghong.
6) Various minor fixes and cleanups to the bpftool's Makefile as well
as a new 'uninstall' and 'doc-uninstall' target for removing bpftool
itself or prior installed documentation related to it, from Quentin.
7) Add CONFIG_CGROUP_BPF=y to the BPF kernel selftest config file which is
required for the test_dev_cgroup test case to run, from Naresh.
8) Fix reporting of XDP prog_flags for nfp driver, from Jakub.
9) Fix libbpf's exit code from the Makefile when libelf was not found in
the system, also from Jakub.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2017-12-17
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a corner case in generic XDP where we have non-linear skbs
but enough tailroom in the skb to not miss to linearizing there,
from Song.
2) Fix BPF JIT bugs in s390x and ppc64 to not recache skb data when
BPF context is not skb, from Daniel.
3) Fix a BPF JIT bug in sparc64 where recaching skb data after helper
call would use the wrong register for the skb, from Daniel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
CONFIG_NO_HZ_FULL doesn't make sense without CONFIG_CPU_ISOLATION. In
fact enabling the first without the second is a regression as nohz_full=
boot parameter gets silently ignored.
Besides this unnatural combination hangs RCU gp kthread when running
rcutorture for reasons that are not yet fully understood:
rcu_preempt kthread starved for 9974 jiffies! g4294967208
+c4294967207 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
rcu_preempt I 7464 8 2 0x80000000
Call Trace:
__schedule+0x493/0x620
schedule+0x24/0x40
schedule_timeout+0x330/0x3b0
? preempt_count_sub+0xea/0x140
? collect_expired_timers+0xb0/0xb0
rcu_gp_kthread+0x6bf/0xef0
This commit therefore makes NO_HZ_FULL select CPU_ISOLATION, which
prevents all these bad behaviours.
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <kernellwp@gmail.com>
Fixes: 5c4991e24c ("sched/isolation: Split out new CONFIG_CPU_ISOLATION=y config from CONFIG_NO_HZ_FULL")
Link: http://lkml.kernel.org/r/1513275507-29200-2-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull timer fix from Thomas Gleixner:
"A single bugfix which prevents arbitrary sigev_notify values in
posix-timers"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-timer: Properly check sigevent->sigev_notify
Things got moved around between the original bpf_override_return patches
and the final version, and now the ftrace kprobe dispatcher assumes if
you modified the ip that you also enabled preemption. Make a comment of
this and enable preemption, this fixes the lockdep splat that happened
when using this feature.
Fixes: 9802d86585 ("bpf: add a bpf_override_function helper")
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Typical JIT does several passes over bpf instructions to
compute total size and relative offsets of jumps and calls.
With multitple bpf functions calling each other all relative calls
will have invalid offsets intially therefore we need to additional
last pass over the program to emit calls with correct offsets.
For example in case of three bpf functions:
main:
call foo
call bpf_map_lookup
exit
foo:
call bar
exit
bar:
exit
We will call bpf_int_jit_compile() indepedently for main(), foo() and bar()
x64 JIT typically does 4-5 passes to converge.
After these initial passes the image for these 3 functions
will be good except call targets, since start addresses of
foo() and bar() are unknown when we were JITing main()
(note that call bpf_map_lookup will be resolved properly
during initial passes).
Once start addresses of 3 functions are known we patch
call_insn->imm to point to right functions and call
bpf_int_jit_compile() again which needs only one pass.
Additional safety checks are done to make sure this
last pass doesn't produce image that is larger or smaller
than previous pass.
When constant blinding is on it's applied to all functions
at the first pass, since doing it once again at the last
pass can change size of the JITed code.
Tested on x64 and arm64 hw with JIT on/off, blinding on/off.
x64 jits bpf-to-bpf calls correctly while arm64 falls back to interpreter.
All other JITs that support normal BPF_CALL will behave the same way
since bpf-to-bpf call is equivalent to bpf-to-kernel call from
JITs point of view.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
global bpf_jit_enable variable is tested multiple times in JITs,
blinding and verifier core. The malicious root can try to toggle
it while loading the programs. This race condition was accounted
for and there should be no issues, but it's safer to avoid
this race condition.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
though bpf_call is still the same call instruction and
calling convention 'bpf to bpf' and 'bpf to helper' is the same
the interpreter has to oparate on 'struct bpf_insn *'.
To distinguish these two cases add a kernel internal opcode and
mark call insns with it.
This opcode is seen by interpreter only. JITs will never see it.
Also add tiny bit of debug code to aid interpreter debugging.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
programs with function calls are often passing various
pointers via stack. When all calls are inlined llvm
flattens stack accesses and optimizes away extra branches.
When functions are not inlined it becomes the job of
the verifier to recognize zero initialized stack to avoid
exploring paths that program will not take.
The following program would fail otherwise:
ptr = &buffer_on_stack;
*ptr = 0;
...
func_call(.., ptr, ...) {
if (..)
*ptr = bpf_map_lookup();
}
...
if (*ptr != 0) {
// Access (*ptr)->field is valid.
// Without stack_zero tracking such (*ptr)->field access
// will be rejected
}
since stack slots are no longer uniform invalid | spill | misc
add liveness marking to all slots, but do it in 8 byte chunks.
So if nothing was read or written in [fp-16, fp-9] range
it will be marked as LIVE_NONE.
If any byte in that range was read, it will be marked LIVE_READ
and stacksafe() check will perform byte-by-byte verification.
If all bytes in the range were written the slot will be
marked as LIVE_WRITTEN.
This significantly speeds up state equality comparison
and reduces total number of states processed.
before after
bpf_lb-DLB_L3.o 2051 2003
bpf_lb-DLB_L4.o 3287 3164
bpf_lb-DUNKNOWN.o 1080 1080
bpf_lxc-DDROP_ALL.o 24980 12361
bpf_lxc-DUNKNOWN.o 34308 16605
bpf_netdev.o 15404 10962
bpf_overlay.o 7191 6679
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow arbitrary function calls from bpf function to another bpf function.
To recognize such set of bpf functions the verifier does:
1. runs control flow analysis to detect function boundaries
2. proceeds with verification of all functions starting from main(root) function
It recognizes that the stack of the caller can be accessed by the callee
(if the caller passed a pointer to its stack to the callee) and the callee
can store map_value and other pointers into the stack of the caller.
3. keeps track of the stack_depth of each function to make sure that total
stack depth is still less than 512 bytes
4. disallows pointers to the callee stack to be stored into the caller stack,
since they will be invalid as soon as the callee returns
5. to reuse all of the existing state_pruning logic each function call
is considered to be independent call from the verifier point of view.
The verifier pretends to inline all function calls it sees are being called.
It stores the callsite instruction index as part of the state to make sure
that two calls to the same callee from two different places in the caller
will be different from state pruning point of view
6. more safety checks are added to liveness analysis
Implementation details:
. struct bpf_verifier_state is now consists of all stack frames that
led to this function
. struct bpf_func_state represent one stack frame. It consists of
registers in the given frame and its stack
. propagate_liveness() logic had a premature optimization where
mark_reg_read() and mark_stack_slot_read() were manually inlined
with loop iterating over parents for each register or stack slot.
Undo this optimization to reuse more complex mark_*_read() logic
. skip_callee() logic is not necessary from safety point of view,
but without it mark_*_read() markings become too conservative,
since after returning from the funciton call a read of r6-r9
will incorrectly propagate the read marks into callee causing
inefficient pruning later
. mark_*_read() logic is now aware of control flow which makes it
more complex. In the future the plan is to rewrite liveness
to be hierarchical. So that liveness can be done within
basic block only and control flow will be responsible for
propagation of liveness information along cfg and between calls.
. tail_calls and ld_abs insns are not allowed in the programs with
bpf-to-bpf calls
. returning stack pointers to the caller or storing them into stack
frame of the caller is not allowed
Testing:
. no difference in cilium processed_insn numbers
. large number of tests follows in next patches
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow arbitrary function calls from bpf function to another bpf function.
Since the beginning of bpf all bpf programs were represented as a single function
and program authors were forced to use always_inline for all functions
in their C code. That was causing llvm to unnecessary inflate the code size
and forcing developers to move code to header files with little code reuse.
With a bit of additional complexity teach verifier to recognize
arbitrary function calls from one bpf function to another as long as
all of functions are presented to the verifier as a single bpf program.
New program layout:
r6 = r1 // some code
..
r1 = .. // arg1
r2 = .. // arg2
call pc+1 // function call pc-relative
exit
.. = r1 // access arg1
.. = r2 // access arg2
..
call pc+20 // second level of function call
...
It allows for better optimized code and finally allows to introduce
the core bpf libraries that can be reused in different projects,
since programs are no longer limited by single elf file.
With function calls bpf can be compiled into multiple .o files.
This patch is the first step. It detects programs that contain
multiple functions and checks that calls between them are valid.
It splits the sequence of bpf instructions (one program) into a set
of bpf functions that call each other. Calls to only known
functions are allowed. In the future the verifier may allow
calls to unresolved functions and will do dynamic linking.
This logic supports statically linked bpf functions only.
Such function boundary detection could have been done as part of
control flow graph building in check_cfg(), but it's cleaner to
separate function boundary detection vs control flow checks within
a subprogram (function) into logically indepedent steps.
Follow up patches may split check_cfg() further, but not check_subprogs().
Only allow bpf-to-bpf calls for root only and for non-hw-offloaded programs.
These restrictions can be relaxed in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ Note, this is a Git cherry-pick of the following commit:
506458efaf ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")
... for easier x86 PTI code testing and back-porting. ]
READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Three sets of overlapping changes, two in the packet scheduler
and one in the meson-gxl PHY driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Clamp timeouts to INT_MAX in conntrack, from Jay Elliot.
2) Fix broken UAPI for BPF_PROG_TYPE_PERF_EVENT, from Hendrik
Brueckner.
3) Fix locking in ieee80211_sta_tear_down_BA_sessions, from Johannes
Berg.
4) Add missing barriers to ptr_ring, from Michael S. Tsirkin.
5) Don't advertise gigabit in sh_eth when not available, from Thomas
Petazzoni.
6) Check network namespace when delivering to netlink taps, from Kevin
Cernekee.
7) Kill a race in raw_sendmsg(), from Mohamed Ghannam.
8) Use correct address in TCP md5 lookups when replying to an incoming
segment, from Christoph Paasch.
9) Add schedule points to BPF map alloc/free, from Eric Dumazet.
10) Don't allow silly mtu values to be used in ipv4/ipv6 multicast, also
from Eric Dumazet.
11) Fix SKB leak in tipc, from Jon Maloy.
12) Disable MAC learning on OVS ports of mlxsw, from Yuval Mintz.
13) SKB leak fix in skB_complete_tx_timestamp(), from Willem de Bruijn.
14) Add some new qmi_wwan device IDs, from Daniele Palmas.
15) Fix static key imbalance in ingress qdisc, from Jiri Pirko.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
net: qcom/emac: Reduce timeout for mdio read/write
net: sched: fix static key imbalance in case of ingress/clsact_init error
net: sched: fix clsact init error path
ip_gre: fix wrong return value of erspan_rcv
net: usb: qmi_wwan: add Telit ME910 PID 0x1101 support
pkt_sched: Remove TC_RED_OFFLOADED from uapi
net: sched: Move to new offload indication in RED
net: sched: Add TCA_HW_OFFLOAD
net: aquantia: Increment driver version
net: aquantia: Fix typo in ethtool statistics names
net: aquantia: Update hw counters on hw init
net: aquantia: Improve link state and statistics check interval callback
net: aquantia: Fill in multicast counter in ndev stats from hardware
net: aquantia: Fill ndev stat couters from hardware
net: aquantia: Extend stat counters to 64bit values
net: aquantia: Fix hardware DMA stream overload on large MRRS
net: aquantia: Fix actual speed capabilities reporting
sock: free skb in skb_complete_tx_timestamp on error
s390/qeth: update takeover IPs after configuration change
s390/qeth: lock IP table while applying takeover changes
...
Pull locking fixes from Ingo Molnar:
"Misc fixes:
- Fix a S390 boot hang that was caused by the lock-break logic.
Remove lock-break to begin with, as review suggested it was
unreasonably fragile and our confidence in its continued good
health is lower than our confidence in its removal.
- Remove the lockdep cross-release checking code for now, because of
unresolved false positive warnings. This should make lockdep work
well everywhere again.
- Get rid of the final (and single) ACCESS_ONCE() straggler and
remove the API from v4.15.
- Fix a liblockdep build warning"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/lib/lockdep: Add missing declaration of 'pr_cont()'
checkpatch: Remove ACCESS_ONCE() warning
compiler.h: Remove ACCESS_ONCE()
tools/include: Remove ACCESS_ONCE()
tools/perf: Convert ACCESS_ONCE() to READ_ONCE()
locking/lockdep: Remove the cross-release locking checks
locking/core: Remove break_lock field when CONFIG_GENERIC_LOCKBREAK=y
locking/core: Fix deadlock during boot on systems with GENERIC_LOCKBREAK
Pull scheduler fixes from Ingo Molnar:
"Two fixes: a crash fix for an ARM SoC platform, and kernel-doc
warnings fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Do not pull from current CPU if only one CPU to pull
sched/core: Fix kernel-doc warnings after code movement
Some JITs don't cache skb context on stack in prologue, so when
LD_ABS/IND is used and helper calls yield bpf_helper_changes_pkt_data()
as true, then they temporarily save/restore skb pointer. However,
the assumption that skb always has to be in r1 is a bit of a
gamble. Right now it turned out to be true for all helpers listed
in bpf_helper_changes_pkt_data(), but lets enforce that from verifier
side, so that we make this a guarantee and bail out if the func
proto is misconfigured in future helpers.
In case of BPF helper calls from cBPF, bpf_helper_changes_pkt_data()
is completely unrelevant here (since cBPF is context read-only) and
therefore always false.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Wagner reported a crash on the BeagleBone Black SoC.
This is a single CPU architecture, and does not have a functional
arch_send_call_function_single_ipi() implementation which can crash
the kernel if that is called.
As it only has one CPU, it shouldn't be called, but if the kernel is
compiled for SMP, the push/pull RT scheduling logic now calls it for
irq_work if the one CPU is overloaded, it can use that function to call
itself and crash the kernel.
Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
only has a single CPU. But SCHED_FEAT is a constant if sched debugging
is turned off. Another fix can also be used, and this should also help
with normal SMP machines. That is, do not initiate the pull code if
there's only one RT overloaded CPU, and that CPU happens to be the
current CPU that is scheduling in a lower priority task.
Even on a system with many CPUs, if there's many RT tasks waiting to
run on a single CPU, and that CPU schedules in another RT task of lower
priority, it will initiate the PULL logic in case there's a higher
priority RT task on another CPU that is waiting to run. But if there is
no other CPU with waiting RT tasks, it will initiate the RT pull logic
on itself (as it still has RT tasks waiting to run). This is a wasted
effort.
Not only does this help with SMP code where the current CPU is the only
one with RT overloaded tasks, it should also solve the issue that
Daniel encountered, because it will prevent the PULL logic from
executing, as there's only one CPU on the system, and the check added
here will cause it to exit the RT pull code.
Reported-by: Daniel Wagner <wagi@monom.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As long as cft->name is guaranteed to be NUL-terminated, using strlcpy() would
work just as well and avoid that warning, so the change below could be folded
into that commit.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
timer_create() specifies via sigevent->sigev_notify the signal delivery for
the new timer. The valid modes are SIGEV_NONE, SIGEV_SIGNAL, SIGEV_THREAD
and (SIGEV_SIGNAL | SIGEV_THREAD_ID).
The sanity check in good_sigevent() is only checking the valid combination
for the SIGEV_THREAD_ID bit, i.e. SIGEV_SIGNAL, but if SIGEV_THREAD_ID is
not set it accepts any random value.
This has no real effects on the posix timer and signal delivery code, but
it affects show_timer() which handles the output of /proc/$PID/timers. That
function uses a string array to pretty print sigev_notify. The access to
that array has no bound checks, so random sigev_notify cause access beyond
the array bounds.
Add proper checks for the valid notify modes and remove the SIGEV_THREAD_ID
masking from various code pathes as SIGEV_NONE can never be set in
combination with SIGEV_THREAD_ID.
Reported-by: Eric Biggers <ebiggers3@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: stable@vger.kernel.org
- Comment fixes
- Build fix
- Better memory alloction (don't use NR_CPUS)
- Configuration fix
- Build warning fix
- Enhanced callback parameter (to simplify users of trace hooks)
- Give up on stack tracing when RCU isn't watching (it's a lost cause)
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlozKm0UHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUmhXwv7BEY923K3Nl3qC6LeYmNyrZ4g1PsD
nbZ+ZjU3KlMPugGbnJCJbfsS0utUp2Wd9gHT32O4BUf0/Pxjo3utXvkzRQJ3SwHT
X7QhXROkicAKRFrPxj0BaiLexC+yJR23wGp2YUVHLO4Aa/ptN8BJvH22+eDpsCLc
f1DWJdvdbyPBUeoHNKevjvccsUYMlnBfe1jhJ9nRWHnq1axGV3bllcd6v4T07LgK
LO28Krp4/V3tVN9Sq6jBoGTrULf5O1xtuBDVtANeXdha1oMUTYr4TUzTuOQxFgNu
sCIpUWTKu+PNGmU5bhlryb20C2+PveE4EMK0InVVlqrhYJ/XjXFyRaZYYkM2NAKq
XmeHVjMfc6Wmrd/nepuaTZvGGrK2kK/pCX/XuQthKKjAU6rO5X+FfGiSGodLPIYB
1m+QvfX8Re2IJkswm3lS68LKtG7SnjcSB9sY6PhVe6C6oP2ya2O1hthJ+TkrfNbc
lhE5HzfCJZ9ujSjcQoUGwjrhnIezQX4KnfJZ
=WdMQ
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fix-ups:
- comment fixes
- build fix
- better memory alloction (don't use NR_CPUS)
- configuration fix
- build warning fix
- enhanced callback parameter (to simplify users of trace hooks)
- give up on stack tracing when RCU isn't watching (it's a lost
cause)"
* tag 'trace-v4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have stack trace not record if RCU is not watching
tracing: Pass export pointer as argument to ->write()
ring-buffer: Remove unused function __rb_data_page_index()
tracing: make PREEMPTIRQ_EVENTS depend on TRACING
tracing: Allocate mask_str buffer dynamically
tracing: always define trace_{irq,preempt}_{enable_disable}
tracing: Fix code comments in trace.c
The stack tracer records a stack dump whenever it sees a stack usage that is
more than what it ever saw before. This can happen at any function that is
being traced. If it happens when the CPU is going idle (or other strange
locations), RCU may not be watching, and in this case, the recording of the
stack trace will trigger a warning. There's been lots of efforts to make
hacks to allow stack tracing to proceed even if RCU is not watching, but
this only causes more issues to appear. Simply do not trace a stack if RCU
is not watching. It probably isn't a bad stack anyway.
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
gcc toggle -fisolate-erroneous-paths-dereference (default at -O2
onwards) isolates faulty code paths such as null pointer access, divide
by zero etc. If gcc port doesnt implement __builtin_trap, an abort() is
generated which causes kernel link error.
In this case, gcc is generating abort due to 'divide by zero' in
lib/mpi/mpih-div.c.
Currently 'frv' and 'arc' are failing. Previously other arch was also
broken like m32r was fixed by commit d22e3d69ee ("m32r: fix build
failure").
Let's define this weak function which is common for all arch and fix the
problem permanently. We can even remove the arch specific 'abort' after
this is done.
Link: http://lkml.kernel.org/r/1513118956-8718-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Alexey Brodkin <Alexey.Brodkin@synopsys.com>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.
This patch:
- Make groups_sort globally visible.
- Move the call to groups_sort to the modifiers of group_info
- Remove the call to groups_sort from set_groups
Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com
Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit f371b304f1 ("bpf/tracing: allow user space to
query prog array on the same tp") introduced a perf
ioctl command to query prog array attached to the
same perf tracepoint. The commit introduced a
compilation error under certain config conditions, e.g.,
(1). CONFIG_BPF_SYSCALL is not defined, or
(2). CONFIG_TRACING is defined but neither CONFIG_UPROBE_EVENTS
nor CONFIG_KPROBE_EVENTS is defined.
Error message:
kernel/events/core.o: In function `perf_ioctl':
core.c:(.text+0x98c4): undefined reference to `bpf_event_query_prog_array'
This patch fixed this error by guarding the real definition under
CONFIG_BPF_EVENTS and provided static inline dummy function
if CONFIG_BPF_EVENTS was not defined.
It renamed the function from bpf_event_query_prog_array to
perf_event_query_prog_array and moved the definition from linux/bpf.h
to linux/trace_events.h so the definition is in proximity to
other prog_array related functions.
Fixes: f371b304f1 ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
While using large percpu maps, htab_map_alloc() can hold
cpu for hundreds of ms.
This patch adds cond_resched() calls to percpu alloc/free
call sites, all running in process context.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When tracing and networking programs are both attached in the
system and both use event-output helpers that eventually call
into perf_event_output(), then we could end up in a situation
where the tracing attached program runs in user context while
a cls_bpf program is triggered on that same CPU out of softirq
context.
Since both rely on the same per-cpu perf_sample_data, we could
potentially corrupt it. This can only ever happen in a combination
of the two types; all tracing programs use a bpf_prog_active
counter to bail out in case a program is already running on
that CPU out of a different context. XDP and cls_bpf programs
by themselves don't have this issue as they run in the same
context only. Therefore, split both perf_sample_data so they
cannot be accessed from each other.
Fixes: 20b9d7ac48 ("bpf: avoid excessive stack usage for perf_sample_data")
Reported-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Error injection is sloppy and very ad-hoc. BPF could fill this niche
perfectly with it's kprobe functionality. We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations. Accomplish this with the bpf_override_funciton
helper. This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function. This gives us a nice
clean way to implement systematic error injection for all of our code
paths.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Using BPF we can override kprob'ed functions and return arbitrary
values. Obviously this can be a bit unsafe, so make this feature opt-in
for functions. Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
order to give BPF access to that function for error injection purposes.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit e87c6bc385 ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event.
Although this provides flexibility, users may want to know
what other bpf programs attached to the same tp interface.
Besides getting visibility for the underlying bpf system,
such information may also help consolidate multiple bpf programs,
understand potential performance issues due to a large array,
and debug (e.g., one bpf program which overwrites return code
may impact subsequent program results).
Commit 2541517c32 ("tracing, perf: Implement BPF programs
attached to kprobes") utilized the existing perf ioctl
interface and added the command PERF_EVENT_IOC_SET_BPF
to attach a bpf program to a tracepoint. This patch adds a new
ioctl command, given a perf event fd, to query the bpf program
array attached to the same perf tracepoint event.
The new uapi ioctl command:
PERF_EVENT_IOC_QUERY_BPF
The new uapi/linux/perf_event.h structure:
struct perf_event_query_bpf {
__u32 ids_len;
__u32 prog_cnt;
__u32 ids[0];
};
User space provides buffer "ids" for kernel to copy to.
When returning from the kernel, the number of available
programs in the array is set in "prog_cnt".
The usage:
struct perf_event_query_bpf *query =
malloc(sizeof(*query) + sizeof(u32) * ids_len);
query.ids_len = ids_len;
err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
if (err == 0) {
/* query.prog_cnt is the number of available progs,
* number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
*/
} else if (errno == ENOSPC) {
/* query.ids_len number of progs copied,
* query.prog_cnt is the number of available progs
*/
} else {
/* other errors */
}
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
cgroup root name and file name have max length limit, we should
avoid copying longer name than that to the name.
tj: minor update to $SUBJ.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
while it found a number of old bugs initially, was also causing too many
false positives that caused people to disable lockdep - which is arguably
a worse overall outcome.
If we disable cross-release by default but keep the code upstream then
in practice the most likely outcome is that we'll allow the situation
to degrade gradually, by allowing entropy to introduce more and more
false positives, until it overwhelms maintenance capacity.
Another bad side effect was that people were trying to work around
the false positives by uglifying/complicating unrelated code. There's
a marked difference between annotating locking operations and
uglifying good code just due to bad lock debugging code ...
This gradual decrease in quality happened to a number of debugging
facilities in the kernel, and lockdep is pretty complex already,
so we cannot risk this outcome.
Either cross-release checking can be done right with no false positives,
or it should not be included in the upstream kernel.
( Note that it might make sense to maintain it out of tree and go through
the false positives every now and then and see whether new bugs were
introduced. )
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When CONFIG_GENERIC_LOCKBEAK=y, locking structures grow an extra int ->break_lock
field which is used to implement raw_spin_is_contended() by setting the field
to 1 when waiting on a lock and clearing it to zero when holding a lock.
However, there are a few problems with this approach:
- There is a write-write race between a CPU successfully taking the lock
(and subsequently writing break_lock = 0) and a waiter waiting on
the lock (and subsequently writing break_lock = 1). This could result
in a contended lock being reported as uncontended and vice-versa.
- On machines with store buffers, nothing guarantees that the writes
to break_lock are visible to other CPUs at any particular time.
- READ_ONCE/WRITE_ONCE are not used, so the field is potentially
susceptible to harmful compiler optimisations,
Consequently, the usefulness of this field is unclear and we'd be better off
removing it and allowing architectures to implement raw_spin_is_contended() by
providing a definition of arch_spin_is_contended(), as they can when
CONFIG_GENERIC_LOCKBREAK=n.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1511894539-7988-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
a8a217c221 ("locking/core: Remove {read,spin,write}_can_lock()")
removed the definition of raw_spin_can_lock(), causing the GENERIC_LOCKBREAK
spin_lock() routines to poll the ->break_lock field when waiting on a lock.
This has been reported to cause a deadlock during boot on s390, because
the ->break_lock field is also set by the waiters, and can potentially
remain set indefinitely if no other CPUs come in to take the lock after
it has been released.
This patch removes the explicit spinning on ->break_lock from the waiters,
instead relying on the outer trylock() operation to determine when the
lock is available.
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Tested-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: a8a217c221 ("locking/core: Remove {read,spin,write}_can_lock()")
Link: http://lkml.kernel.org/r/1511894539-7988-2-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull cgroup fixes from Tejun Heo:
- Prateek posted a couple patches to fix a deadlock involving cpuset
and workqueue. It unfortunately caused a different deadlock and the
recent workqueue hotplug simplification removed the original
deadlock, so Prateek's two patches are reverted for now.
- The new stat code was missing u64_stats initialization. Fixed.
- Doc and other misc changes
* 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: add warning about RT not being supported on cgroup2
Revert "cgroup/cpuset: remove circular dependency deadlock"
Revert "cpuset: Make cpuset hotplug synchronous"
cgroup: properly init u64_stats
debug cgroup: use task_css_set instead of rcu_dereference
cpuset: Make cpuset hotplug synchronous
cgroup/cpuset: remove circular dependency deadlock
Pull workqueue fixes from Tejun Heo:
- Lai's hotplug simplifications inadvertently fix a possible deadlock
involving cpuset and workqueue
- CPU isolation fix which was reverted due to the changes in the
housekeeping code resurrected
- A trivial unused include removal
* 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: remove unneeded kallsyms include
workqueue/hotplug: remove the workaround in rebind_workers()
workqueue/hotplug: simplify workqueue_offline_cpu()
workqueue: respect isolated cpus when queueing an unbound work
main: kernel_start: move housekeeping_init() before workqueue_init_early()
The filw was converted from print_symbol() to %pf some time
ago (044c782ce3 "workqueue: fix checkpatch issues").
kallsyms does not seem to be needed anymore.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix the following kernel-doc warnings after code restructuring:
../kernel/sched/core.c:5113: warning: No description found for parameter 't'
../kernel/sched/core.c:5113: warning: Excess function parameter 'interval' description in 'sched_rr_get_interval'
get rid of set_fs()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: abca5fc535 ("sched_rr_get_interval(): move compat to native,
Link: http://lkml.kernel.org/r/995c6ded-b32e-bbe4-d9f5-4d42d121aff1@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
sign_extend32 counts the sign bit parameter from 0, not from 1. So we
have to use "11" for 12th bit, not "12".
This mistake means we have not allowed negative op and cmp args since
commit 30d6e0a419 ("futex: Remove duplicated code and fix undefined
behaviour") till now.
Fixes: 30d6e0a419 ("futex: Remove duplicated code and fix undefined behaviour")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) CAN fixes from Martin Kelly (cancel URBs properly in all the CAN usb
drivers).
2) Revert returning -EEXIST from __dev_alloc_name() as this propagates
to userspace and broke some apps. From Johannes Berg.
3) Fix conn memory leaks and crashes in TIPC, from Jon Malloc and Cong
Wang.
4) Gianfar MAC can't do EEE so don't advertise it by default, from
Claudiu Manoil.
5) Relax strict netlink attribute validation, but emit a warning. From
David Ahern.
6) Fix regression in checksum offload of thunderx driver, from Florian
Westphal.
7) Fix UAPI bpf issues on s390, from Hendrik Brueckner.
8) New card support in iwlwifi, from Ihab Zhaika.
9) BBR congestion control bug fixes from Neal Cardwell.
10) Fix port stats in nfp driver, from Pieter Jansen van Vuuren.
11) Fix leaks in qualcomm rmnet, from Subash Abhinov Kasiviswanathan.
12) Fix DMA API handling in sh_eth driver, from Thomas Petazzoni.
13) Fix spurious netpoll warnings in bnxt_en, from Calvin Owens.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
net: mvpp2: fix the RSS table entry offset
tcp: evaluate packet losses upon RTT change
tcp: fix off-by-one bug in RACK
tcp: always evaluate losses in RACK upon undo
tcp: correctly test congestion state in RACK
bnxt_en: Fix sources of spurious netpoll warnings
tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
tcp_bbr: reset full pipe detection on loss recovery undo
tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
sfc: pass valid pointers from efx_enqueue_unwind
gianfar: Disable EEE autoneg by default
tcp: invalidate rate samples during SACK reneging
can: peak/pcie_fd: fix potential bug in restarting tx queue
can: usb_8dev: cancel urb on -EPIPE and -EPROTO
can: kvaser_usb: cancel urb on -EPIPE and -EPROTO
can: esd_usb2: cancel urb on -EPIPE and -EPROTO
can: ems_usb: cancel urb on -EPIPE and -EPROTO
can: mcba_usb: cancel urb on -EPROTO
usbnet: fix alignment for frames with no ethernet header
tcp: use current time in tcp_rcv_space_adjust()
...
* Fix long standing problem with kdb kallsyms_symbol_next() return value
* Add new co-maintainer Daniel Thompson
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJaKG1/AAoJEIciOldedpOjtigP/Rp1CK9EGtzZCW7a1wZp7d/n
OsAUKrj1AEvve+PpZXBw9X3XMYpXbGyzDfU59NLzMZmgZ6idIdwBwlO6AgWSRyXY
EjvYOvkIytk2ONFVQeDPfbAi5pPaSgTmt813pApka4mbUlWYgC4TioNove4Q08R+
CZrXiD2WeKlBxB2TgfYu+0Sy3TtYv7aaJhV45ziPjiwyFpcl9Ktp5zEVOcK6Xw10
8T5ZLA7k/MKah+CoqBdzRZ0IprBVAB4MMTBV3q7HXNXOMX6JdXgg1UihqKJcYF5u
6d+PS6A8sFGicUguPXMZh90vM2f0hg+ZagavskSLCp8usL91+4UmAxHaRdKUTqWC
k1jmxODsJPiBFi05mSem94WtegjBkrUMohGcKyfjEdixjZfa9G6Zhux/951HFkrl
UdUSlTnfO39Cj3V5AkSR95wBVuPgjMKSBVDxKuUrWib2p3Wx0P/eeDtcgy+yPHkX
Gxxa5nhmBUJJZCWl08kKQ1oacVX5/sg9ZcmpAXzcR3lLYL/GlgiB9tS8piRGLUN6
GVneU5uqVnj1Jm67J7SVZtPsvbhV2jEP9aeNx8MZzaTJ412XzYeMB9P88JFQpZRj
UXxhF/Q5HsjxKX0XKUD9O3W4KXD1YYWHng3ljarzR25Sd3zA7Y5HJ3NnCZRRa3Mm
hVgJQIk1xB40aNnRVXJT
=2jeq
-----END PGP SIGNATURE-----
Merge tag 'for_linus-4.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull kgdb fixes from Jason Wessel:
- Fix long standing problem with kdb kallsyms_symbol_next() return
value
- Add new co-maintainer Daniel Thompson
* tag 'for_linus-4.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
kgdb/kdb/debug_core: Add co-maintainer Daniel Thompson
kdb: Fix handling of kallsyms_symbol_next() return value
Pull CPU hotplug fix from Ingo Molnar:
"A single fix moving the smp-call queue flush step to the intended
point in the state machine"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp/hotplug: Move step CPUHP_AP_SMPCFD_DYING to the correct place
Pull scheduler fixes from Ingo Molnar:
"This includes a fix for the add_wait_queue() queue ordering brown
paperbag bug, plus PELT accounting fixes for cgroups scheduling
artifacts"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Update and fix the runnable propagation rule
sched/wait: Fix add_wait_queue() behavioral change
Pull perf fixes from Ingo Molnar:
"This includes perf namespace support kernel side fixes, plus an
accumulated set of perf tooling fixes - including UAPI header
synchronization that should make the perf build less noisy"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
tooling/headers: Synchronize updated s390 and x86 UAPI headers
tools headers: Syncronize mman.h ABI header
tools headers: Synchronize prctl.h ABI header
tools headers: Synchronize KVM arch ABI headers
tools headers: Synchronize drm/i915_drm.h
tools headers uapi: Synchronize drm/drm.h
tools headers: Synchronize perf_event.h header
tools headers: Synchronize kernel ABI headers wrt SPDX tags
tools/headers: Synchronize kernel x86 UAPI headers
perf intel-pt: Bring instruction decoder files into line with the kernel
perf test: Fix test 21 for s390x
perf bench numa: Fixup discontiguous/sparse numa nodes
perf top: Use signal interface for SIGWINCH handler
perf top: Fix window dimensions change handling
perf: Fix header.size for namespace events
perf top: Ignore kptr_restrict when not sampling the kernel
perf record: Ignore kptr_restrict when not sampling the kernel
perf report: Ignore kptr_restrict when not sampling the kernel
perf evlist: Add helper to check if attr.exclude_kernel is set in all evsels
perf test shell: Fix test case probe libc's inet_pton on s390x
...
Pull lockdep fix from Ingo Molnar:
"Fix a possible NULL dereference for the (rare) case when a task
doesn't have ->xhlocks space allocated due to kmalloc() OOM-ing"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Fix possible NULL deref
Pull irq fixes from Ingo Molnar:
"Two fixes: use bool type consistently, plus a irq_matrix_available()
bugfix"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqdesc: Use bool return type instead of int
genirq/matrix: Fix the precedence fix for real
kallsyms_symbol_next() returns a boolean (true on success). Currently
kdb_read() tests the return value with an inequality that
unconditionally evaluates to true.
This is fixed in the obvious way and, since the conditional branch is
supposed to be unreachable, we also add a WARN_ON().
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.
Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:
gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight
which assumes that tasks are equally runnable which is not true but
easy to compute.
Beside these estimates, we have several simple rules that help us to filter
out wrong ones:
- ge->avg.runnable_sum <= than LOAD_AVG_MAX
- ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
- ge->avg.runnable_sum can't increase when we detach a task
The effect of these fixes is better cgroups balancing.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following cleanup commit:
50816c4899 ("sched/wait: Standardize internal naming of wait-queue entries")
... unintentionally changed the behavior of add_wait_queue() from
inserting the wait entry at the head of the wait queue to the tail
of the wait queue.
Beyond a negative performance impact this change in behavior
theoretically also breaks wait queues which mix exclusive and
non-exclusive waiters, as non-exclusive waiters will not be
woken up if they are queued behind enough exclusive waiters.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Fixes: ("sched/wait: Standardize internal naming of wait-queue entries")
Link: http://lkml.kernel.org/r/a16c8ccffd39bd08fdaa45a5192294c784b803a7.1512544324.git.osandov@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CPUHP_AP_SCHED_MIGRATE_DYING doesn't exist, it looks like this was
supposed to refer to CPUHP_AP_SCHED_STARTING's teardown callback,
i.e. sched_cpu_dying().
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171206105911.28093-1-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Geert Uytterhoeven reported a NFS oops, and pointed out that some of the
numbers were hashed and useless.
We could just turn them from '%p' into '%px', but those numbers are
really just legacy, and useless even when not hashed.
So just remove them entirely.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Small overlapping change conflict ('net' changed a line,
'net-next' added a line right afterwards) in flexcan.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 0515e5999a ("bpf: introduce BPF_PROG_TYPE_PERF_EVENT
program type") introduced the bpf_perf_event_data structure which
exports the pt_regs structure. This is OK for multiple architectures
but fail for s390 and arm64 which do not export pt_regs. Programs
using them, for example, the bpf selftest fail to compile on these
architectures.
For s390, exporting the pt_regs is not an option because s390 wants
to allow changes to it. For arm64, there is a user_pt_regs structure
that covers parts of the pt_regs structure for use by user space.
To solve the broken uapi for s390 and arm64, introduce an abstract
type for pt_regs and add an asm/bpf_perf_event.h file that concretes
the type. An asm-generic header file covers the architectures that
export pt_regs today.
The arch-specific enablement for s390 and arm64 follows in separate
commits.
Reported-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Fixes: 0515e5999a ("bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type")
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Reviewed-and-tested-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This reverts commit aa24163b2e.
This and the following commit led to another circular locking scenario
and the scenario which is fixed by this commit no longer exists after
e8b3f8db7a ("workqueue/hotplug: simplify workqueue_offline_cpu()")
which removes work item flushing from hotplug path.
Revert it for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Since the cpu/hotplug refactoring, DOWN_FAILED is never called without
preceding DOWN_PREPARE making the workaround unnecessary. Remove it.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since the recent cpu/hotplug refactoring, workqueue_offline_cpu() is
guaranteed to run on the local cpu which is going offline.
This also fixes the following deadlock by removing work item
scheduling and flushing from CPU hotplug path.
http://lkml.kernel.org/r/1504764252-29091-1-git-send-email-prsood@codeaurora.org
tj: Description update.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This reverts commit 1599a185f0.
This and the previous commit led to another circular locking scenario
and the scenario which is fixed by this commit no longer exists after
e8b3f8db7a ("workqueue/hotplug: simplify workqueue_offline_cpu()")
which removes work item flushing from hotplug path.
Revert it for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
The previous commit which made the operator precedence in
irq_matrix_available() explicit made the implicit brokenness explicitely
wrong. It was wrong in the original commit already. The overworked
maintainer did not notice it either when merging the patch.
Replace the confusing '?' construct by a simple and obvious if ().
Fixes: 75f1133873 ("genirq/matrix: Make - vs ?: Precedence explicit")
Reported-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Pull networking fixes from David Miller:
1) Various TCP control block fixes, including one that crashes with
SELinux, from David Ahern and Eric Dumazet.
2) Fix ACK generation in rxrpc, from David Howells.
3) ipvlan doesn't set the mark properly in the ipv4 route lookup key,
from Gao Feng.
4) SIT configuration doesn't take on the frag_off ipv4 field
configuration properly, fix from Hangbin Liu.
5) TSO can fail after device down/up on stmmac, fix from Lars Persson.
6) Various bpftool fixes (mostly in JSON handling) from Quentin Monnet.
7) Various SKB leak fixes in vhost/tun/tap (mostly observed as
performance problems). From Wei Xu.
8) mvpps's TX descriptors were not zero initialized, from Yan Markman.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits)
tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match()
tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()
rxrpc: Fix the MAINTAINERS record
rxrpc: Use correct netns source in rxrpc_release_sock()
liquidio: fix incorrect indentation of assignment statement
stmmac: reset last TSO segment size after device open
ipvlan: Add the skb->mark as flow4's member to lookup route
s390/qeth: build max size GSO skbs on L2 devices
s390/qeth: fix GSO throughput regression
s390/qeth: fix thinko in IPv4 multicast address tracking
tap: free skb if flags error
tun: free skb in early errors
vhost: fix skb leak in handle_rx()
bnxt_en: Fix a variable scoping in bnxt_hwrm_do_send_msg()
bnxt_en: fix dst/src fid for vxlan encap/decap actions
bnxt_en: wildcard smac while creating tunnel decap filter
bnxt_en: Need to unconditionally shut down RoCE in bnxt_shutdown
phylink: ensure we take the link down when phylink_stop() is called
sfp: warn about modules requiring address change sequence
sfp: improve RX_LOS handling
...
By passing an export descriptor to the write function, users don't need to
keep a global static pointer and can rely on container_of() to fetch their
own structure.
Link: http://lkml.kernel.org/r/20170602102025.5140-1-felipe.balbi@linux.intel.com
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Chunyan Zhang <zhang.chunyan@linaro.org>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This fixes the following warning when building with clang:
kernel/trace/ring_buffer.c:1842:1: error: unused function
'__rb_data_page_index' [-Werror,-Wunused-function]
Link: http://lkml.kernel.org/r/20170518001415.5223-1-mka@chromium.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When CONFIG_TRACING is disabled, the new preemptirq events tracer
produces a build failure:
In file included from kernel/trace/trace_irqsoff.c:17:0:
kernel/trace/trace.h: In function 'trace_test_and_set_recursion':
kernel/trace/trace.h:542:28: error: 'struct task_struct' has no member named 'trace_recursion'
Adding an explicit dependency avoids the broken configuration.
Link: http://lkml.kernel.org/r/20171103104031.270375-1-arnd@arndb.de
Fixes: d59158162e ("tracing: Add support for preempt and irq enable/disable events")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The default NR_CPUS can be very large, but actual possible nr_cpu_ids
usually is very small. For my x86 distribution, the NR_CPUS is 8192 and
nr_cpu_ids is 4. About 2 pages are wasted.
Most machines don't have so many CPUs, so define a array with NR_CPUS
just wastes memory. So let's allocate the buffer dynamically when need.
With this change, the mutext tracing_cpumask_update_lock also can be
removed now, which was used to protect mask_str.
Link: http://lkml.kernel.org/r/1512013183-19107-1-git-send-email-changbin.du@intel.com
Fixes: 36dfe9252b ("ftrace: make use of tracing_cpumask")
Cc: stable@vger.kernel.org
Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Naming in code comments for tracing_snapshot, tracing_snapshot_alloc
and trace_pid_filter_add_remove_task don't match the real function
names. And latency_trace has been removed from tracing directory.
Fix them.
Link: http://lkml.kernel.org/r/1508394753-20887-1-git-send-email-chuhu@redhat.com
Fixes: cab5037 ("tracing/ftrace: Enable snapshot function trigger")
Fixes: 886b5b7 ("tracing: remove /debug/tracing/latency_trace")
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
[ Replaced /sys/kernel/debug/tracing with /sys/kerne/tracing ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Daniel Borkmann says:
====================
pull-request: bpf 2017-12-02
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a compilation warning in xdp redirect tracepoint due to
missing bpf.h include that pulls in struct bpf_map, from Xie.
2) Limit the maximum number of attachable BPF progs for a given
perf event as long as uabi is not frozen yet. The hard upper
limit is now 64 and therefore the same as with BPF multi-prog
for cgroups. Also add related error checking for the sample
BPF loader when enabling and attaching to the perf event, from
Yonghong.
3) Specifically set the RLIMIT_MEMLOCK for the test_verifier_log
case, so that the test case can always pass and not fail in
some environments due to too low default limit, also from
Yonghong.
4) Fix up a missing license header comment for kernel/bpf/offload.c,
from Jakub.
5) Several fixes for bpftool, among others a crash on incorrect
arguments when json output is used, error message handling
fixes on unknown options and proper destruction of json writer
for some exit cases, all from Quentin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>