Enabling BPF_TRAMP_F_IP_ARG for trampolines that actually need it.
The BPF_TRAMP_F_IP_ARG adds extra 3 instructions to trampoline code
and is used only by programs with bpf_get_func_ip helper, which is
added in following patch and sets call_get_func_ip bit.
This patch ensures that BPF_TRAMP_F_IP_ARG flag is used only for
trampolines that have programs with call_get_func_ip set.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-3-jolsa@kernel.org
Teach max stack depth checking algorithm about async callbacks
that don't increase bpf program stack size.
Also add sanity check that bpf_tail_call didn't sneak into async cb.
It's impossible, since PTR_TO_CTX is not available in async cb,
hence the program cannot contain bpf_tail_call(ctx,...);
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-10-alexei.starovoitov@gmail.com
bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on
PTR_TO_FUNC infra in the verifier to validate addresses to subprograms
and pass them into the helpers as function callbacks.
In case of bpf_for_each_map_elem() the callback is invoked synchronously
and the verifier treats it as a normal subprogram call by adding another
bpf_func_state and new frame in __check_func_call().
bpf_timer_set_callback() doesn't invoke the callback directly.
The subprogram will be called asynchronously from bpf_timer_cb().
Teach the verifier to validate such async callbacks as special kind
of jump by pushing verifier state into stack and let pop_stack() process it.
Special care needs to be taken during state pruning.
The call insn doing bpf_timer_set_callback has to be a prune_point.
Otherwise short timer callbacks might not have prune points in front of
bpf_timer_set_callback() which means is_state_visited() will be called
after this call insn is processed in __check_func_call(). Which means that
another async_cb state will be pushed to be walked later and the verifier
will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit.
Since push_async_cb() looks like another push_stack() branch the
infinite loop detection will trigger false positive. To recognize
this case mark such states as in_async_callback_fn.
To distinguish infinite loop in async callback vs the same callback called
with different arguments for different map and timer add async_entry_cnt
to bpf_func_state.
Enforce return zero from async callbacks.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-9-alexei.starovoitov@gmail.com
In the following bpf subprogram:
static int timer_cb(void *map, void *key, void *value)
{
bpf_timer_set_callback(.., timer_cb);
}
the 'timer_cb' is a pointer to a function.
ld_imm64 insn is used to carry this pointer.
bpf_pseudo_func() returns true for such ld_imm64 insn.
Unlike bpf_for_each_map_elem() the bpf_timer_set_callback() is asynchronous.
Relax control flow check to allow such "recursion" that is seen as an infinite
loop by check_cfg(). The distinction between bpf_for_each_map_elem() the
bpf_timer_set_callback() is done in the follow up patch.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-8-alexei.starovoitov@gmail.com
BTF is required for 'struct bpf_timer' to be recognized inside map value.
The bpf timers are supported inside inner maps.
Remember 'struct btf *' in inner_map_meta to make it available
to the verifier in the sequence:
struct bpf_map *inner_map = bpf_map_lookup_elem(&outer_map, ...);
if (inner_map)
timer = bpf_map_lookup_elem(&inner_map, ...);
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-7-alexei.starovoitov@gmail.com
bpf_timer_init() arguments are:
1. pointer to a timer (which is embedded in map element).
2. pointer to a map.
Make sure that pointer to a timer actually belongs to that map.
Use map_uid (which is unique id of inner map) to reject:
inner_map1 = bpf_map_lookup_elem(outer_map, key1)
inner_map2 = bpf_map_lookup_elem(outer_map, key2)
if (inner_map1 && inner_map2) {
timer = bpf_map_lookup_elem(inner_map1);
if (timer)
// mismatch would have been allowed
bpf_timer_init(timer, inner_map2);
}
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-6-alexei.starovoitov@gmail.com
Restrict bpf timers to array, hash (both preallocated and kmalloced), and
lru map types. The per-cpu maps with timers don't make sense, since 'struct
bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
the number of timers depends on number of possible cpus and timers would not be
accessible from all cpus. lpm map support can be added in the future.
The timers in inner maps are supported.
The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
bpf_timer in a given map element.
Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
that map element indeed contains 'struct bpf_timer'.
Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
map element data is reused in preallocated htab and lru maps.
Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
map element. There could be one of each, but not more than one. Due to 'one
bpf_timer in one element' restriction do not support timers in global data,
since global data is a map of single element, but from bpf program side it's
seen as many global variables and restriction of single global timer would be
odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
timers, since user space could have corrupted mmap element and crashed the
kernel. The maps with timers cannot be readonly. Due to these restrictions
search for bpf_timer in datasec BTF in case it was placed in the global data to
report clear error.
The previous patch allowed 'struct bpf_timer' as a first field in a map
element only. Relax this restriction.
Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
the timer when lru map deletes an element as a part of it eviction algorithm.
Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
The timer operation are done through helpers only.
This is similar to 'struct bpf_spin_lock'.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-5-alexei.starovoitov@gmail.com
Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
in hash/array/lru maps as a regular field and helpers to operate on it:
// Initialize the timer.
// First 4 bits of 'flags' specify clockid.
// Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, int flags);
// Configure the timer to call 'callback_fn' static function.
long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn);
// Arm the timer to expire 'nsec' nanoseconds from the current time.
long bpf_timer_start(struct bpf_timer *timer, u64 nsec, u64 flags);
// Cancel the timer and wait for callback_fn to finish if it was running.
long bpf_timer_cancel(struct bpf_timer *timer);
Here is how BPF program might look like:
struct map_elem {
int counter;
struct bpf_timer timer;
};
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 1000);
__type(key, int);
__type(value, struct map_elem);
} hmap SEC(".maps");
static int timer_cb(void *map, int *key, struct map_elem *val);
/* val points to particular map element that contains bpf_timer. */
SEC("fentry/bpf_fentry_test1")
int BPF_PROG(test1, int a)
{
struct map_elem *val;
int key = 0;
val = bpf_map_lookup_elem(&hmap, &key);
if (val) {
bpf_timer_init(&val->timer, &hmap, CLOCK_REALTIME);
bpf_timer_set_callback(&val->timer, timer_cb);
bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */, 0);
}
}
This patch adds helper implementations that rely on hrtimers
to call bpf functions as timers expire.
The following patches add necessary safety checks.
Only programs with CAP_BPF are allowed to use bpf_timer.
The amount of timers used by the program is constrained by
the memcg recorded at map creation time.
The bpf_timer_init() helper needs explicit 'map' argument because inner maps
are dynamic and not known at load time. While the bpf_timer_set_callback() is
receiving hidden 'aux->prog' argument supplied by the verifier.
The prog pointer is needed to do refcnting of bpf program to make sure that
program doesn't get freed while the timer is armed. This approach relies on
"user refcnt" scheme used in prog_array that stores bpf programs for
bpf_tail_call. The bpf_timer_set_callback() will increment the prog refcnt which is
paired with bpf_timer_cancel() that will drop the prog refcnt. The
ops->map_release_uref is responsible for cancelling the timers and dropping
prog refcnt when user space reference to a map reaches zero.
This uref approach is done to make sure that Ctrl-C of user space process will
not leave timers running forever unless the user space explicitly pinned a map
that contained timers in bpffs.
bpf_timer_init() and bpf_timer_set_callback() will return -EPERM if map doesn't
have user references (is not held by open file descriptor from user space and
not pinned in bpffs).
The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
and free the timer if given map element had it allocated.
"bpftool map update" command can be used to cancel timers.
The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
'__u64 :64' has 1 byte alignment of 8 byte padding.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-4-alexei.starovoitov@gmail.com
Move ____bpf_spin_lock/unlock into helpers to make it more clear
that quadruple underscore bpf_spin_lock/unlock are irqsave/restore variants.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-3-alexei.starovoitov@gmail.com
Currently bpf_prog_put() is called from the task context only.
With addition of bpf timers the timer related helpers will start calling
bpf_prog_put() from irq-saved region and in rare cases might drop
the refcnt to zero.
To address this case, first, convert bpf_prog_free_id() to be irq-save
(this is similar to bpf_map_free_id), and, second, defer non irq
appropriate calls into work queue.
For example:
bpf_audit_prog() is calling kmalloc and wake_up_interruptible,
bpf_prog_kallsyms_del_all()->bpf_ksym_del()->spin_unlock_bh().
They are not safe with irqs disabled.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-2-alexei.starovoitov@gmail.com
In bpf_patch_insn_data(), we first use the bpf_patch_insn_single() to
insert new instructions, then use adjust_insn_aux_data() to adjust
insn_aux_data. If the old env->prog have no enough room for new inserted
instructions, we use bpf_prog_realloc to construct new_prog and free the
old env->prog.
There have two errors here. First, if adjust_insn_aux_data() return
ENOMEM, we should free the new_prog. Second, if adjust_insn_aux_data()
return ENOMEM, bpf_patch_insn_data() will return NULL, and env->prog has
been freed in bpf_prog_realloc, but we will use it in bpf_check().
So in this patch, we make the adjust_insn_aux_data() never fails. In
bpf_patch_insn_data(), we first pre-malloc memory for the new
insn_aux_data, then call bpf_patch_insn_single() to insert new
instructions, at last call adjust_insn_aux_data() to adjust
insn_aux_data.
Fixes: 8041902dae ("bpf: adjust insn_aux_data when patching insns")
Signed-off-by: He Fengqing <hefengqing@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210714101815.164322-1-hefengqing@huawei.com
During testing of f263a81451 ("bpf: Track subprog poke descriptors correctly
and fix use-after-free") under various failure conditions, for example, when
jit_subprogs() fails and tries to clean up the program to be run under the
interpreter, we ran into the following freeze:
[...]
#127/8 tailcall_bpf2bpf_3:FAIL
[...]
[ 92.041251] BUG: KASAN: slab-out-of-bounds in ___bpf_prog_run+0x1b9d/0x2e20
[ 92.042408] Read of size 8 at addr ffff88800da67f68 by task test_progs/682
[ 92.043707]
[ 92.044030] CPU: 1 PID: 682 Comm: test_progs Tainted: G O 5.13.0-53301-ge6c08cb33a30-dirty #87
[ 92.045542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[ 92.046785] Call Trace:
[ 92.047171] ? __bpf_prog_run_args64+0xc0/0xc0
[ 92.047773] ? __bpf_prog_run_args32+0x8b/0xb0
[ 92.048389] ? __bpf_prog_run_args64+0xc0/0xc0
[ 92.049019] ? ktime_get+0x117/0x130
[...] // few hundred [similar] lines more
[ 92.659025] ? ktime_get+0x117/0x130
[ 92.659845] ? __bpf_prog_run_args64+0xc0/0xc0
[ 92.660738] ? __bpf_prog_run_args32+0x8b/0xb0
[ 92.661528] ? __bpf_prog_run_args64+0xc0/0xc0
[ 92.662378] ? print_usage_bug+0x50/0x50
[ 92.663221] ? print_usage_bug+0x50/0x50
[ 92.664077] ? bpf_ksym_find+0x9c/0xe0
[ 92.664887] ? ktime_get+0x117/0x130
[ 92.665624] ? kernel_text_address+0xf5/0x100
[ 92.666529] ? __kernel_text_address+0xe/0x30
[ 92.667725] ? unwind_get_return_address+0x2f/0x50
[ 92.668854] ? ___bpf_prog_run+0x15d4/0x2e20
[ 92.670185] ? ktime_get+0x117/0x130
[ 92.671130] ? __bpf_prog_run_args64+0xc0/0xc0
[ 92.672020] ? __bpf_prog_run_args32+0x8b/0xb0
[ 92.672860] ? __bpf_prog_run_args64+0xc0/0xc0
[ 92.675159] ? ktime_get+0x117/0x130
[ 92.677074] ? lock_is_held_type+0xd5/0x130
[ 92.678662] ? ___bpf_prog_run+0x15d4/0x2e20
[ 92.680046] ? ktime_get+0x117/0x130
[ 92.681285] ? __bpf_prog_run32+0x6b/0x90
[ 92.682601] ? __bpf_prog_run64+0x90/0x90
[ 92.683636] ? lock_downgrade+0x370/0x370
[ 92.684647] ? mark_held_locks+0x44/0x90
[ 92.685652] ? ktime_get+0x117/0x130
[ 92.686752] ? lockdep_hardirqs_on+0x79/0x100
[ 92.688004] ? ktime_get+0x117/0x130
[ 92.688573] ? __cant_migrate+0x2b/0x80
[ 92.689192] ? bpf_test_run+0x2f4/0x510
[ 92.689869] ? bpf_test_timer_continue+0x1c0/0x1c0
[ 92.690856] ? rcu_read_lock_bh_held+0x90/0x90
[ 92.691506] ? __kasan_slab_alloc+0x61/0x80
[ 92.692128] ? eth_type_trans+0x128/0x240
[ 92.692737] ? __build_skb+0x46/0x50
[ 92.693252] ? bpf_prog_test_run_skb+0x65e/0xc50
[ 92.693954] ? bpf_prog_test_run_raw_tp+0x2d0/0x2d0
[ 92.694639] ? __fget_light+0xa1/0x100
[ 92.695162] ? bpf_prog_inc+0x23/0x30
[ 92.695685] ? __sys_bpf+0xb40/0x2c80
[ 92.696324] ? bpf_link_get_from_fd+0x90/0x90
[ 92.697150] ? mark_held_locks+0x24/0x90
[ 92.698007] ? lockdep_hardirqs_on_prepare+0x124/0x220
[ 92.699045] ? finish_task_switch+0xe6/0x370
[ 92.700072] ? lockdep_hardirqs_on+0x79/0x100
[ 92.701233] ? finish_task_switch+0x11d/0x370
[ 92.702264] ? __switch_to+0x2c0/0x740
[ 92.703148] ? mark_held_locks+0x24/0x90
[ 92.704155] ? __x64_sys_bpf+0x45/0x50
[ 92.705146] ? do_syscall_64+0x35/0x80
[ 92.706953] ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[...]
Turns out that the program rejection from e411901c0b ("bpf: allow for tailcalls
in BPF subprograms for x64 JIT") is buggy since env->prog->aux->tail_call_reachable
is never true. Commit ebf7d1f508 ("bpf, x64: rework pro/epilogue and tailcall
handling in JIT") added a tracker into check_max_stack_depth() which propagates
the tail_call_reachable condition throughout the subprograms. This info is then
assigned to the subprogram's func[i]->aux->tail_call_reachable. However, in the
case of the rejection check upon JIT failure, env->prog->aux->tail_call_reachable
is used. func[0]->aux->tail_call_reachable which represents the main program's
information did not propagate this to the outer env->prog->aux, though. Add this
propagation into check_max_stack_depth() where it needs to belong so that the
check can be done reliably.
Fixes: ebf7d1f508 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Fixes: e411901c0b ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
Co-developed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/618c34e3163ad1a36b1e82377576a6081e182f25.1626123173.git.daniel@iogearbox.net
Subprograms are calling map_poke_track(), but on program release there is no
hook to call map_poke_untrack(). However, on program release, the aux memory
(and poke descriptor table) is freed even though we still have a reference to
it in the element list of the map aux data. When we run map_poke_run(), we then
end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():
[...]
[ 402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
[ 402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
[ 402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G I 5.12.0+ #399
[ 402.824715] Call Trace:
[ 402.824719] dump_stack+0x93/0xc2
[ 402.824727] print_address_description.constprop.0+0x1a/0x140
[ 402.824736] ? prog_array_map_poke_run+0xc2/0x34e
[ 402.824740] ? prog_array_map_poke_run+0xc2/0x34e
[ 402.824744] kasan_report.cold+0x7c/0xd8
[ 402.824752] ? prog_array_map_poke_run+0xc2/0x34e
[ 402.824757] prog_array_map_poke_run+0xc2/0x34e
[ 402.824765] bpf_fd_array_map_update_elem+0x124/0x1a0
[...]
The elements concerned are walked as follows:
for (i = 0; i < elem->aux->size_poke_tab; i++) {
poke = &elem->aux->poke_tab[i];
[...]
The access to size_poke_tab is a 4 byte read, verified by checking offsets
in the KASAN dump:
[ 402.825004] The buggy address belongs to the object at ffff8881905a7800
which belongs to the cache kmalloc-1k of size 1024
[ 402.825008] The buggy address is located 320 bytes inside of
1024-byte region [ffff8881905a7800, ffff8881905a7c00)
The pahole output of bpf_prog_aux:
struct bpf_prog_aux {
[...]
/* --- cacheline 5 boundary (320 bytes) --- */
u32 size_poke_tab; /* 320 4 */
[...]
In general, subprograms do not necessarily manage their own data structures.
For example, BTF func_info and linfo are just pointers to the main program
structure. This allows reference counting and cleanup to be done on the latter
which simplifies their management a bit. The aux->poke_tab struct, however,
did not follow this logic. The initial proposed fix for this use-after-free
bug further embedded poke data tracking into the subprogram with proper
reference counting. However, Daniel and Alexei questioned why we were treating
these objects special; I agree, its unnecessary. The fix here removes the per
subprogram poke table allocation and map tracking and instead simply points
the aux->poke_tab pointer at the main programs poke table. This way, map
tracking is simplified to the main program and we do not need to manage them
per subprogram.
This also means, bpf_prog_free_deferred(), which unwinds the program reference
counting and kfrees objects, needs to ensure that we don't try to double free
the poke_tab when free'ing the subprog structures. This is easily solved by
NULL'ing the poke_tab pointer. The second detail is to ensure that per
subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
this, we add a pointer in the poke structure to point at the subprogram value
so JITs can easily check while walking the poke_tab structure if the current
entry belongs to the current program. The aux pointer is stable and therefore
suitable for such comparison. On the jit_subprogs() error path, we omit
cleaning up the poke->aux field because these are only ever referenced from
the JIT side, but on error we will never make it to the JIT, so its fine to
leave them dangling. Removing these pointers would complicate the error path
for no reason. However, we do need to untrack all poke descriptors from the
main program as otherwise they could race with the freeing of JIT memory from
the subprograms. Lastly, a748c6975d ("bpf: propagate poke descriptors to
subprograms") had an off-by-one on the subprogram instruction index range
check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
However, subprog_end is the next subprogram's start instruction.
Fixes: a748c6975d ("bpf: propagate poke descriptors to subprograms")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.com
This lifts the restriction on running devmap BPF progs in generic
redirect mode. To match native XDP behavior, it is invoked right before
generic_xdp_tx is called, and only supports XDP_PASS/XDP_ABORTED/
XDP_DROP actions.
We also return 0 even if devmap program drops the packet, as
semantically redirect has already succeeded and the devmap prog is the
last point before TX of the packet to device where it can deliver a
verdict on the packet.
This also means it must take care of freeing the skb, as
xdp_do_generic_redirect callers only do that in case an error is
returned.
Since devmap entry prog is supported, remove the check in
generic_xdp_install entirely.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210702111825.491065-5-memxor@gmail.com
This change implements CPUMAP redirect support for generic XDP programs.
The idea is to reuse the cpu map entry's queue that is used to push
native xdp frames for redirecting skb to a different CPU. This will
match native XDP behavior (in that RPS is invoked again for packet
reinjected into networking stack).
To be able to determine whether the incoming skb is from the driver or
cpumap, we reuse skb->redirected bit that skips generic XDP processing
when it is set. To always make use of this, CONFIG_NET_REDIRECT guard on
it has been lifted and it is always available.
>From the redirect side, we add the skb to ptr_ring with its lowest bit
set to 1. This should be safe as skb is not 1-byte aligned. This allows
kthread to discern between xdp_frames and sk_buff. On consumption of the
ptr_ring item, the lowest bit is unset.
In the end, the skb is simply added to the list that kthread is anyway
going to maintain for xdp_frames converted to skb, and then received
again by using netif_receive_skb_list.
Bulking optimization for generic cpumap is left as an exercise for a
future patch for now.
Since cpumap entry progs are now supported, also remove check in
generic_xdp_install for the cpumap.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20210702111825.491065-4-memxor@gmail.com
There were a couple of READ_ONCE()-invocations left-over by the devmap
RCU conversion. Convert these to rcu_dereference_check() as well to avoid
complaints from sparse.
Fixes: 782347b6bc ("xdp: Add proper __rcu annotations to redirect map entries")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210629093907.573598-1-toke@redhat.com
Trivial conflict in net/netfilter/nf_tables_api.c.
Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.
skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-06-28
The following pull-request contains BPF updates for your *net-next* tree.
We've added 37 non-merge commits during the last 12 day(s) which contain
a total of 56 files changed, 394 insertions(+), 380 deletions(-).
The main changes are:
1) XDP driver RCU cleanups, from Toke Høiland-Jørgensen and Paul E. McKenney.
2) Fix bpf_skb_change_proto() IPv4/v6 GSO handling, from Maciej Żenczykowski.
3) Fix false positive kmemleak report for BPF ringbuf alloc, from Rustam Kovhaev.
4) Fix x86 JIT's extable offset calculation for PROBE_LDX NULL, from Ravi Bangoria.
5) Enable libbpf fallback probing with tracing under RHEL7, from Jonathan Edwards.
6) Clean up x86 JIT to remove unused cnt tracking from EMIT macro, from Jiri Olsa.
7) Netlink cleanups for libbpf to please Coverity, from Kumar Kartikeya Dwivedi.
8) Allow to retrieve ancestor cgroup id in tracing programs, from Namhyung Kim.
9) Fix lirc BPF program query to use user-provided prog_cnt, from Sean Young.
10) Add initial libbpf doc including generated kdoc for its API, from Grant Seltzer.
11) Make xdp_rxq_info_unreg_mem_model() more robust, from Jakub Kicinski.
12) Fix up bpfilter startup log-level to info level, from Gary Lin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
kmemleak scans struct page, but it does not scan the page content. If we
allocate some memory with kmalloc(), then allocate page with alloc_page(),
and if we put kmalloc pointer somewhere inside that page, kmemleak will
report kmalloc pointer as a false positive.
We can instruct kmemleak to scan the memory area by calling kmemleak_alloc()
and kmemleak_free(), but part of struct bpf_ringbuf is mmaped to user space,
and if struct bpf_ringbuf changes we would have to revisit and review size
argument in kmemleak_alloc(), because we do not want kmemleak to scan the
user space memory. Let's simplify things and use kmemleak_not_leak() here.
For posterity, also adding additional prior analysis from Andrii:
I think either kmemleak or syzbot are misreporting this. I've added a
bunch of printks around all allocations performed by BPF ringbuf. [...]
On repro side I get these two warnings:
[vmuser@archvm bpf]$ sudo ./repro
BUG: memory leak
unreferenced object 0xffff88810d538c00 (size 64):
comm "repro", pid 2140, jiffies 4294692933 (age 14.540s)
hex dump (first 32 bytes):
00 af 19 04 00 ea ff ff c0 ae 19 04 00 ea ff ff ................
80 ae 19 04 00 ea ff ff c0 29 2e 04 00 ea ff ff .........)......
backtrace:
[<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
[<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
[<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90
[<00000000f601d565>] do_syscall_64+0x2d/0x40
[<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae
BUG: memory leak
unreferenced object 0xffff88810d538c80 (size 64):
comm "repro", pid 2143, jiffies 4294699025 (age 8.448s)
hex dump (first 32 bytes):
80 aa 19 04 00 ea ff ff 00 ab 19 04 00 ea ff ff ................
c0 ab 19 04 00 ea ff ff 80 44 28 04 00 ea ff ff .........D(.....
backtrace:
[<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
[<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
[<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90
[<00000000f601d565>] do_syscall_64+0x2d/0x40
[<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae
Note that both reported leaks (ffff88810d538c80 and ffff88810d538c00)
correspond to pages array bpf_ringbuf is allocating and tracking properly
internally. Note also that syzbot repro doesn't close FD of created BPF
ringbufs, and even when ./repro itself exits with error, there are still
two forked processes hanging around in my system. So clearly ringbuf maps
are alive at that point. So reporting any memory leak looks weird at that
point, because that memory is being used by active referenced BPF ringbuf.
It's also a question why repro doesn't clean up its forks. But if I do a
`pkill repro`, I do see that all the allocated memory is /properly/ cleaned
up [and the] "leaks" are deallocated properly.
BTW, if I add close() right after bpf() syscall in syzbot repro, I see that
everything is immediately deallocated, like designed. And no memory leak
is reported. So I don't think the problem is anywhere in bpf_ringbuf code,
rather in the leak detection and/or repro itself.
Reported-by: syzbot+5d895828587f49e7fe9b@syzkaller.appspotmail.com
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
[ Daniel: also included analysis from Andrii to the commit log ]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+5d895828587f49e7fe9b@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4BzYk+dqs+jwu6VKXP-RttcTEGFe+ySTGWT9CRNkagDiJVA@mail.gmail.com
Link: https://lore.kernel.org/lkml/YNTAqiE7CWJhOK2M@nuc10
Link: https://lore.kernel.org/lkml/20210615101515.GC26027@arm.com
Link: https://syzkaller.appspot.com/bug?extid=5d895828587f49e7fe9b
Link: https://lore.kernel.org/bpf/20210626181156.1873604-1-rkovhaev@gmail.com
XDP_REDIRECT works by a three-step process: the bpf_redirect() and
bpf_redirect_map() helpers will lookup the target of the redirect and store
it (along with some other metadata) in a per-CPU struct bpf_redirect_info.
Next, when the program returns the XDP_REDIRECT return code, the driver
will call xdp_do_redirect() which will use the information thus stored to
actually enqueue the frame into a bulk queue structure (that differs
slightly by map type, but shares the same principle). Finally, before
exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will
flush all the different bulk queues, thus completing the redirect.
Pointers to the map entries will be kept around for this whole sequence of
steps, protected by RCU. However, there is no top-level rcu_read_lock() in
the core code; instead drivers add their own rcu_read_lock() around the XDP
portions of the code, but somewhat inconsistently as Martin discovered[0].
However, things still work because everything happens inside a single NAPI
poll sequence, which means it's between a pair of calls to
local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could
document this intention by using rcu_dereference_check() with
rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and
lockdep to verify that everything is done correctly.
This patch does just that: we add an __rcu annotation to the map entry
pointers and remove the various comments explaining the NAPI poll assurance
strewn through devmap.c in favour of a longer explanation in filter.c. The
goal is to have one coherent documentation of the entire flow, and rely on
the RCU annotations as a "standard" way of communicating the flow in the
map code (which can additionally be understood by sparse and lockdep).
The RCU annotation replacements result in a fairly straight-forward
replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE()
becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the
proper constructs to cast the pointer back and forth between __rcu and
__kernel address space (for the benefit of sparse). The one complication is
that xskmap has a few constructions where double-pointers are passed back
and forth; these simply all gain __rcu annotations, and only the final
reference/dereference to the inner-most pointer gets changed.
With this, everything can be run through sparse without eliciting
complaints, and lockdep can verify correctness even without the use of
rcu_read_lock() in the drivers. Subsequent patches will clean these up from
the drivers.
[0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/
[1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com
XDP programs are called from a NAPI poll context, which means the RCU
reference liveness is ensured by local_bh_disable(). Add
rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so
lockdep understands that the dereferences are safe from inside *either* an
rcu_read_lock() section *or* a local_bh_disable() section. While both
bh_disabled and rcu_read_lock() provide RCU protection, they are
semantically distinct, so we need both conditions to prevent lockdep
complaints.
This change is done in preparation for removing the redundant
rcu_read_lock()s from drivers.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210624160609.292325-5-toke@redhat.com
The sub-programs prog->aux->poke_tab[] is populated in jit_subprogs() and
then used when emitting 'BPF_JMP|BPF_TAIL_CALL' insn->code from the
individual JITs. The poke_tab[] to use is stored in the insn->imm by
the code adding it to that array slot. The JIT then uses imm to find the
right entry for an individual instruction. In the x86 bpf_jit_comp.c
this is done by calling emit_bpf_tail_call_direct with the poke_tab[]
of the imm value.
However, we observed the below null-ptr-deref when mixing tail call
programs with subprog programs. For this to happen we just need to
mix bpf-2-bpf calls and tailcalls with some extra calls or instructions
that would be patched later by one of the fixup routines. So whats
happening?
Before the fixup_call_args() -- where the jit op is done -- various
code patching is done by do_misc_fixups(). This may increase the
insn count, for example when we patch map_lookup_up using map_gen_lookup
hook. This does two things. First, it means the instruction index,
insn_idx field, of a tail call instruction will move by a 'delta'.
In verifier code,
struct bpf_jit_poke_descriptor desc = {
.reason = BPF_POKE_REASON_TAIL_CALL,
.tail_call.map = BPF_MAP_PTR(aux->map_ptr_state),
.tail_call.key = bpf_map_key_immediate(aux),
.insn_idx = i + delta,
};
Then subprog start values subprog_info[i].start will be updated
with the delta and any poke descriptor index will also be updated
with the delta in adjust_poke_desc(). If we look at the adjust
subprog starts though we see its only adjusted when the delta
occurs before the new instructions,
/* NOTE: fake 'exit' subprog should be updated as well. */
for (i = 0; i <= env->subprog_cnt; i++) {
if (env->subprog_info[i].start <= off)
continue;
Earlier subprograms are not changed because their start values
are not moved. But, adjust_poke_desc() does the offset + delta
indiscriminately. The result is poke descriptors are potentially
corrupted.
Then in jit_subprogs() we only populate the poke_tab[]
when the above insn_idx is less than the next subprogram start. From
above we corrupted our insn_idx so we might incorrectly assume a
poke descriptor is not used in a subprogram omitting it from the
subprogram. And finally when the jit runs it does the deref of poke_tab
when emitting the instruction and crashes with below. Because earlier
step omitted the poke descriptor.
The fix is straight forward with above context. Simply move same logic
from adjust_subprog_starts() into adjust_poke_descs() and only adjust
insn_idx when needed.
[ 82.396354] bpf_testmod: version magic '5.12.0-rc2alu+ SMP preempt mod_unload ' should be '5.12.0+ SMP preempt mod_unload '
[ 82.623001] loop10: detected capacity change from 0 to 8
[ 88.487424] ==================================================================
[ 88.487438] BUG: KASAN: null-ptr-deref in do_jit+0x184a/0x3290
[ 88.487455] Write of size 8 at addr 0000000000000008 by task test_progs/5295
[ 88.487471] CPU: 7 PID: 5295 Comm: test_progs Tainted: G I 5.12.0+ #386
[ 88.487483] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
[ 88.487490] Call Trace:
[ 88.487498] dump_stack+0x93/0xc2
[ 88.487515] kasan_report.cold+0x5f/0xd8
[ 88.487530] ? do_jit+0x184a/0x3290
[ 88.487542] do_jit+0x184a/0x3290
...
[ 88.487709] bpf_int_jit_compile+0x248/0x810
...
[ 88.487765] bpf_check+0x3718/0x5140
...
[ 88.487920] bpf_prog_load+0xa22/0xf10
Fixes: a748c6975d ("bpf: propagate poke descriptors to subprograms")
Reported-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
In 32-bit architecture, the result of sizeof() is a 32-bit integer so
the expression becomes the multiplication between 2 32-bit integer which
can potentially leads to integer overflow. As a result,
bpf_map_area_alloc() allocates less memory than needed.
Fix this by casting 1 operand to u64.
Fixes: 0d2c4f9640 ("bpf: Eliminate rlimit-based memory accounting for sockmap and sockhash maps")
Fixes: 99c51064fb ("devmap: Use bpf_map_area_alloc() for allocating hash buckets")
Fixes: 546ac1ffb7 ("bpf: add devmap, a map for storing net device references")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210613143440.71975-1-minhquangbui99@gmail.com
This reverts commit d37300ed18 ("bpf: program: Refuse non-O_RDWR flags
in BPF_OBJ_GET"). It breaks Android userspace which expects to be able to
fetch programs with just read permissions.
See: https://cs.android.com/android/platform/superproject/+/master:frameworks/libs/net/common/native/bpf_syscall_wrappers/include/BpfSyscallWrappers.h;drc=7005c764be23d31fa1d69e826b4a2f6689a8c81e;l=124
Side-note: another option to fix it would be to extend bpf_prog_new_fd()
and to pass in used file mode flags in the same way as we do for maps via
bpf_map_new_fd(). Meaning, they'd end up in anon_inode_getfd() and thus
would be retained for prog fd operations with bpf() syscall. Right now
these flags are not checked with progs since they are immutable for their
lifetime (as opposed to maps which can be updated from user space). In
future this could potentially change with new features, but at that point
it's still fine to do the bpf_prog_new_fd() extension when needed. For a
simple stable fix, a revert is less churn.
Fixes: d37300ed18 ("bpf: program: Refuse non-O_RDWR flags in BPF_OBJ_GET")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
[ Daniel: added side-note to commit message ]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: Greg Kroah-Hartman <gregkh@google.com>
Link: https://lore.kernel.org/bpf/20210618105526.265003-1-zenczykowski@gmail.com
Trivial conflicts in net/can/isotp.c and
tools/testing/selftests/net/mptcp/mptcp_connect.sh
scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c
to include/linux/ptp_clock_kernel.h in -next so re-apply
the fix there.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-06-17
The following pull-request contains BPF updates for your *net-next* tree.
We've added 50 non-merge commits during the last 25 day(s) which contain
a total of 148 files changed, 4779 insertions(+), 1248 deletions(-).
The main changes are:
1) BPF infrastructure to migrate TCP child sockets from a listener to another
in the same reuseport group/map, from Kuniyuki Iwashima.
2) Add a provably sound, faster and more precise algorithm for tnum_mul() as
noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan.
3) Streamline error reporting changes in libbpf as planned out in the
'libbpf: the road to v1.0' effort, from Andrii Nakryiko.
4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu.
5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map
types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek.
6) Support new LLVM relocations in libbpf to make them more linker friendly,
also add a doc to describe the BPF backend relocations, from Yonghong Song.
7) Silence long standing KUBSAN complaints on register-based shifts in
interpreter, from Daniel Borkmann and Eric Biggers.
8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when
target arch cannot be determined, from Lorenz Bauer.
9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson.
10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi.
11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF
programs to bpf_helpers.h header, from Florent Revest.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reported a shift-out-of-bounds that KUBSAN observed in the
interpreter:
[...]
UBSAN: shift-out-of-bounds in kernel/bpf/core.c:1420:2
shift exponent 255 is too large for 64-bit type 'long long unsigned int'
CPU: 1 PID: 11097 Comm: syz-executor.4 Not tainted 5.12.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x141/0x1d7 lib/dump_stack.c:120
ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327
___bpf_prog_run.cold+0x19/0x56c kernel/bpf/core.c:1420
__bpf_prog_run32+0x8f/0xd0 kernel/bpf/core.c:1735
bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline]
bpf_prog_run_pin_on_cpu include/linux/filter.h:624 [inline]
bpf_prog_run_clear_cb include/linux/filter.h:755 [inline]
run_filter+0x1a1/0x470 net/packet/af_packet.c:2031
packet_rcv+0x313/0x13e0 net/packet/af_packet.c:2104
dev_queue_xmit_nit+0x7c2/0xa90 net/core/dev.c:2387
xmit_one net/core/dev.c:3588 [inline]
dev_hard_start_xmit+0xad/0x920 net/core/dev.c:3609
__dev_queue_xmit+0x2121/0x2e00 net/core/dev.c:4182
__bpf_tx_skb net/core/filter.c:2116 [inline]
__bpf_redirect_no_mac net/core/filter.c:2141 [inline]
__bpf_redirect+0x548/0xc80 net/core/filter.c:2164
____bpf_clone_redirect net/core/filter.c:2448 [inline]
bpf_clone_redirect+0x2ae/0x420 net/core/filter.c:2420
___bpf_prog_run+0x34e1/0x77d0 kernel/bpf/core.c:1523
__bpf_prog_run512+0x99/0xe0 kernel/bpf/core.c:1737
bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline]
bpf_test_run+0x3ed/0xc50 net/bpf/test_run.c:50
bpf_prog_test_run_skb+0xabc/0x1c50 net/bpf/test_run.c:582
bpf_prog_test_run kernel/bpf/syscall.c:3127 [inline]
__do_sys_bpf+0x1ea9/0x4f00 kernel/bpf/syscall.c:4406
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
[...]
Generally speaking, KUBSAN reports from the kernel should be fixed.
However, in case of BPF, this particular report caused concerns since
the large shift is not wrong from BPF point of view, just undefined.
In the verifier, K-based shifts that are >= {64,32} (depending on the
bitwidth of the instruction) are already rejected. The register-based
cases were not given their content might not be known at verification
time. Ideas such as verifier instruction rewrite with an additional
AND instruction for the source register were brought up, but regularly
rejected due to the additional runtime overhead they incur.
As Edward Cree rightly put it:
Shifts by more than insn bitness are legal in the BPF ISA; they are
implementation-defined behaviour [of the underlying architecture],
rather than UB, and have been made legal for performance reasons.
Each of the JIT backends compiles the BPF shift operations to machine
instructions which produce implementation-defined results in such a
case; the resulting contents of the register may be arbitrary but
program behaviour as a whole remains defined.
Guard checks in the fast path (i.e. affecting JITted code) will thus
not be accepted.
The case of division by zero is not truly analogous here, as division
instructions on many of the JIT-targeted architectures will raise a
machine exception / fault on division by zero, whereas (to the best
of my knowledge) none will do so on an out-of-bounds shift.
Given the KUBSAN report only affects the BPF interpreter, but not JITs,
one solution is to add the ANDs with 63 or 31 into ___bpf_prog_run().
That would make the shifts defined, and thus shuts up KUBSAN, and the
compiler would optimize out the AND on any CPU that interprets the shift
amounts modulo the width anyway (e.g., confirmed from disassembly that
on x86-64 and arm64 the generated interpreter code is the same before
and after this fix).
The BPF interpreter is slow path, and most likely compiled out anyway
as distros select BPF_JIT_ALWAYS_ON to avoid speculative execution of
BPF instructions by the interpreter. Given the main argument was to
avoid sacrificing performance, the fact that the AND is optimized away
from compiler for mainstream archs helps as well as a solution moving
forward. Also add a comment on LSH/RSH/ARSH translation for JIT authors
to provide guidance when they see the ___bpf_prog_run() interpreter
code and use it as a model for a new JIT backend.
Reported-by: syzbot+bed360704c521841c85d@syzkaller.appspotmail.com
Reported-by: Kurt Manucredo <fuzzybritches0@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Co-developed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: syzbot+bed360704c521841c85d@syzkaller.appspotmail.com
Cc: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/bpf/0000000000008f912605bd30d5d7@google.com
Link: https://lore.kernel.org/bpf/bac16d8d-c174-bdc4-91bd-bfa62b410190@gmail.com
This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT
to check if the attached eBPF program is capable of migrating sockets. When
the eBPF program is attached, we run it for socket migration if the
expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or
net.ipv4.tcp_migrate_req is enabled.
Currently, the expected_attach_type is not enforced for the
BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the
earlier idea in the commit aac3fc320d ("bpf: Post-hooks for sys_bind") to
fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type().
Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to
select a new listener based on the child socket. migrating_sk varies
depending on if it is migrating a request in the accept queue or during
3WHS.
- accept_queue : sock (ESTABLISHED/SYN_RECV)
- 3WHS : request_sock (NEW_SYN_RECV)
In the eBPF program, we can select a new listener by
BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning
SK_DROP. This feature is useful when listeners have different settings at
the socket API level or when we want to free resources as soon as possible.
- SK_PASS with selected_sk, select it as a new listener
- SK_PASS with selected_sk NULL, fallbacks to the random selection
- SK_DROP, cancel the migration.
There is a noteworthy point. We select a listening socket in three places,
but we do not have struct skb at closing a listener or retransmitting a
SYN+ACK. On the other hand, some helper functions do not expect skb is NULL
(e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer()
in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb
temporarily before running the eBPF program.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/bpf/20210612123224.12525-10-kuniyu@amazon.co.jp
The verifier only enumerates valid control-flow paths and skips paths that
are unreachable in the non-speculative domain. And so it can miss issues
under speculative execution on mispredicted branches.
For example, a type confusion has been demonstrated with the following
crafted program:
// r0 = pointer to a map array entry
// r6 = pointer to readable stack slot
// r9 = scalar controlled by attacker
1: r0 = *(u64 *)(r0) // cache miss
2: if r0 != 0x0 goto line 4
3: r6 = r9
4: if r0 != 0x1 goto line 6
5: r9 = *(u8 *)(r6)
6: // leak r9
Since line 3 runs iff r0 == 0 and line 5 runs iff r0 == 1, the verifier
concludes that the pointer dereference on line 5 is safe. But: if the
attacker trains both the branches to fall-through, such that the following
is speculatively executed ...
r6 = r9
r9 = *(u8 *)(r6)
// leak r9
... then the program will dereference an attacker-controlled value and could
leak its content under speculative execution via side-channel. This requires
to mistrain the branch predictor, which can be rather tricky, because the
branches are mutually exclusive. However such training can be done at
congruent addresses in user space using different branches that are not
mutually exclusive. That is, by training branches in user space ...
A: if r0 != 0x0 goto line C
B: ...
C: if r0 != 0x0 goto line D
D: ...
... such that addresses A and C collide to the same CPU branch prediction
entries in the PHT (pattern history table) as those of the BPF program's
lines 2 and 4, respectively. A non-privileged attacker could simply brute
force such collisions in the PHT until observing the attack succeeding.
Alternative methods to mistrain the branch predictor are also possible that
avoid brute forcing the collisions in the PHT. A reliable attack has been
demonstrated, for example, using the following crafted program:
// r0 = pointer to a [control] map array entry
// r7 = *(u64 *)(r0 + 0), training/attack phase
// r8 = *(u64 *)(r0 + 8), oob address
// [...]
// r0 = pointer to a [data] map array entry
1: if r7 == 0x3 goto line 3
2: r8 = r0
// crafted sequence of conditional jumps to separate the conditional
// branch in line 193 from the current execution flow
3: if r0 != 0x0 goto line 5
4: if r0 == 0x0 goto exit
5: if r0 != 0x0 goto line 7
6: if r0 == 0x0 goto exit
[...]
187: if r0 != 0x0 goto line 189
188: if r0 == 0x0 goto exit
// load any slowly-loaded value (due to cache miss in phase 3) ...
189: r3 = *(u64 *)(r0 + 0x1200)
// ... and turn it into known zero for verifier, while preserving slowly-
// loaded dependency when executing:
190: r3 &= 1
191: r3 &= 2
// speculatively bypassed phase dependency
192: r7 += r3
193: if r7 == 0x3 goto exit
194: r4 = *(u8 *)(r8 + 0)
// leak r4
As can be seen, in training phase (phase != 0x3), the condition in line 1
turns into false and therefore r8 with the oob address is overridden with
the valid map value address, which in line 194 we can read out without
issues. However, in attack phase, line 2 is skipped, and due to the cache
miss in line 189 where the map value is (zeroed and later) added to the
phase register, the condition in line 193 takes the fall-through path due
to prior branch predictor training, where under speculation, it'll load the
byte at oob address r8 (unknown scalar type at that point) which could then
be leaked via side-channel.
One way to mitigate these is to 'branch off' an unreachable path, meaning,
the current verification path keeps following the is_branch_taken() path
and we push the other branch to the verification stack. Given this is
unreachable from the non-speculative domain, this branch's vstate is
explicitly marked as speculative. This is needed for two reasons: i) if
this path is solely seen from speculative execution, then we later on still
want the dead code elimination to kick in in order to sanitize these
instructions with jmp-1s, and ii) to ensure that paths walked in the
non-speculative domain are not pruned from earlier walks of paths walked in
the speculative domain. Additionally, for robustness, we mark the registers
which have been part of the conditional as unknown in the speculative path
given there should be no assumptions made on their content.
The fix in here mitigates type confusion attacks described earlier due to
i) all code paths in the BPF program being explored and ii) existing
verifier logic already ensuring that given memory access instruction
references one specific data structure.
An alternative to this fix that has also been looked at in this scope was to
mark aux->alu_state at the jump instruction with a BPF_JMP_TAKEN state as
well as direction encoding (always-goto, always-fallthrough, unknown), such
that mixing of different always-* directions themselves as well as mixing of
always-* with unknown directions would cause a program rejection by the
verifier, e.g. programs with constructs like 'if ([...]) { x = 0; } else
{ x = 1; }' with subsequent 'if (x == 1) { [...] }'. For unprivileged, this
would result in only single direction always-* taken paths, and unknown taken
paths being allowed, such that the former could be patched from a conditional
jump to an unconditional jump (ja). Compared to this approach here, it would
have two downsides: i) valid programs that otherwise are not performing any
pointer arithmetic, etc, would potentially be rejected/broken, and ii) we are
required to turn off path pruning for unprivileged, where both can be avoided
in this work through pushing the invalid branch to the verification stack.
The issue was originally discovered by Adam and Ofek, and later independently
discovered and reported as a result of Benedict and Piotr's research work.
Fixes: b2157399cc ("bpf: prevent out-of-bounds speculation")
Reported-by: Adam Morrison <mad@cs.tau.ac.il>
Reported-by: Ofek Kirzner <ofekkir@gmail.com>
Reported-by: Benedict Schlueter <benedict.schlueter@rub.de>
Reported-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
... in such circumstances, we do not want to mark the instruction as seen given
the goal is still to jmp-1 rewrite/sanitize dead code, if it is not reachable
from the non-speculative path verification. We do however want to verify it for
safety regardless.
With the patch as-is all the insns that have been marked as seen before the
patch will also be marked as seen after the patch (just with a potentially
different non-zero count). An upcoming patch will also verify paths that are
unreachable in the non-speculative domain, hence this extension is needed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Instead of relying on current env->pass_cnt, use the seen count from the
old aux data in adjust_insn_aux_data(), and expand it to the new range of
patched instructions. This change is valid given we always expand 1:n
with n>=1, so what applies to the old/original instruction needs to apply
for the replacement as well.
Not relying on env->pass_cnt is a prerequisite for a later change where we
want to avoid marking an instruction seen when verified under speculative
execution path.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Commit 59438b4647 ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:
1) The audit events that are triggered due to calls to security_locked_down()
can OOM kill a machine, see below details [0].
2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
when trying to wake up kauditd, for example, when using trace_sched_switch()
tracepoint, see details in [1]. Triggering this was not via some hypothetical
corner case, but with existing tools like runqlat & runqslower from bcc, for
example, which make use of this tracepoint. Rough call sequence goes like:
rq_lock(rq) -> -------------------------+
trace_sched_switch() -> |
bpf_prog_xyz() -> +-> deadlock
selinux_lockdown() -> |
audit_log_end() -> |
wake_up_interruptible() -> |
try_to_wake_up() -> |
rq_lock(rq) --------------+
What's worse is that the intention of 59438b4647 to further restrict lockdown
settings for specific applications in respect to the global lockdown policy is
completely broken for BPF. The SELinux policy rule for the current lockdown check
looks something like this:
allow <who> <who> : lockdown { <reason> };
However, this doesn't match with the 'current' task where the security_locked_down()
is executed, example: httpd does a syscall. There is a tracing program attached
to the syscall which triggers a BPF program to run, which ends up doing a
bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
the permission check against 'current', that is, httpd in this example. httpd
has literally zero relation to this tracing program, and it would be nonsensical
having to write an SELinux policy rule against httpd to let the tracing helper
pass. The policy in this case needs to be against the entity that is installing
the BPF program. For example, if bpftrace would generate a histogram of syscall
counts by user space application:
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
bpftrace would then go and generate a BPF program from this internally. One way
of doing it [for the sake of the example] could be to call bpf_get_current_task()
helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
helpers. So the program itself has nothing to do with httpd or any other random
app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
check. The allow/deny policy belongs in the context of bpftrace: meaning, you
want to grant bpftrace access to use these helpers, but other tracers on the
system like my_random_tracer _not_.
Therefore fix all three issues at the same time by taking a completely different
approach for the security_locked_down() hook, that is, move the check into the
program verification phase where we actually retrieve the BPF func proto. This
also reliably gets the task (current) that is trying to install the BPF tracing
program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
we're moving this out of the BPF helper's fast-path which can be called several
millions of times per second.
The check is then also in line with other security_locked_down() hooks in the
system where the enforcement is performed at open/load time, for example,
open_kcore() for /proc/kcore access or module_sig_check() for module signatures
just to pick few random ones. What's out of scope in the fix as well as in
other security_locked_down() hook locations /outside/ of BPF subsystem is that
if the lockdown policy changes on the fly there is no retrospective action.
This requires a different discussion, potentially complex infrastructure, and
it's also not clear whether this can be solved generically. Either way, it is
out of scope for a suitable stable fix which this one is targeting. Note that
the breakage is specifically on 59438b4647 where it started to rely on 'current'
as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf ("bpf:
Restrict bpf when kernel lockdown is in confidentiality mode").
[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:
I starting seeing this with F-34. When I run a container that is traced with
BPF to record the syscalls it is doing, auditd is flooded with messages like:
type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality }
for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
tclass=lockdown permissive=0
This seems to be leading to auditd running out of space in the backlog buffer
and eventually OOMs the machine.
[...]
auditd running at 99% CPU presumably processing all the messages, eventually I get:
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
[...]
[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
Serhei Makarov says:
Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
ppc64le. Example stack trace:
[...]
[ 730.868702] stack backtrace:
[ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
[ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[ 730.873278] Call Trace:
[ 730.873770] dump_stack+0x7f/0xa1
[ 730.874433] check_noncircular+0xdf/0x100
[ 730.875232] __lock_acquire+0x1202/0x1e10
[ 730.876031] ? __lock_acquire+0xfc0/0x1e10
[ 730.876844] lock_acquire+0xc2/0x3a0
[ 730.877551] ? __wake_up_common_lock+0x52/0x90
[ 730.878434] ? lock_acquire+0xc2/0x3a0
[ 730.879186] ? lock_is_held_type+0xa7/0x120
[ 730.880044] ? skb_queue_tail+0x1b/0x50
[ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90
[ 730.881656] ? __wake_up_common_lock+0x52/0x90
[ 730.882532] __wake_up_common_lock+0x52/0x90
[ 730.883375] audit_log_end+0x5b/0x100
[ 730.884104] slow_avc_audit+0x69/0x90
[ 730.884836] avc_has_perm+0x8b/0xb0
[ 730.885532] selinux_lockdown+0xa5/0xd0
[ 730.886297] security_locked_down+0x20/0x40
[ 730.887133] bpf_probe_read_compat+0x66/0xd0
[ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820
[ 730.888917] trace_call_bpf+0xe9/0x240
[ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0
[ 730.890579] perf_trace_sched_switch+0x142/0x180
[ 730.891485] ? __schedule+0x6d8/0xb20
[ 730.892209] __schedule+0x6d8/0xb20
[ 730.892899] schedule+0x5b/0xc0
[ 730.893522] exit_to_user_mode_prepare+0x11d/0x240
[ 730.894457] syscall_exit_to_user_mode+0x27/0x70
[ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae
[...]
Fixes: 59438b4647 ("security,lockdown,selinux: implement SELinux lockdown")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Reported-by: Jakub Hrozek <jhrozek@redhat.com>
Reported-by: Serhei Makarov <smakarov@redhat.com>
Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Frank Eigler <fche@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
This patch introduces a new algorithm for multiplication of tristate
numbers (tnums) that is provably sound. It is faster and more precise when
compared to the existing method.
Like the existing method, this new algorithm follows the long
multiplication algorithm. The idea is to generate partial products by
multiplying each bit in the multiplier (tnum a) with the multiplicand
(tnum b), and adding the partial products after appropriately bit-shifting
them. The new algorithm, however, uses just a single loop over the bits of
the multiplier (tnum a) and accumulates only the uncertain components of
the multiplicand (tnum b) into a mask-only tnum. The following paper
explains the algorithm in more detail: https://arxiv.org/abs/2105.05398.
A natural way to construct the tnum product is by performing a tnum
addition on all the partial products. This algorithm presents another
method of doing this: decompose each partial product into two tnums,
consisting of the values and the masks separately. The mask-sum is
accumulated within the loop in acc_m. The value-sum tnum is generated
using a.value * b.value. The tnum constructed by tnum addition of the
value-sum and the mask-sum contains all possible summations of concrete
values drawn from the partial product tnums pairwise. We prove this result
in the paper.
Our evaluations show that the new algorithm is overall more precise
(producing tnums with less uncertain components) than the existing method.
As an illustrative example, consider the input tnums A and B. The numbers
in the parenthesis correspond to (value;mask).
A = 000000x1 (1;2)
B = 0010011x (38;1)
A * B (existing) = xxxxxxxx (0;255)
A * B (new) = 0x1xxxxx (32;95)
Importantly, we present a proof of soundness of the new algorithm in the
aforementioned paper. Additionally, we show that this new algorithm is
empirically faster than the existing method.
Co-developed-by: Matan Shachnai <m.shachnai@rutgers.edu>
Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Matan Shachnai <m.shachnai@rutgers.edu>
Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@rutgers.edu>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://arxiv.org/abs/2105.05398
Link: https://lore.kernel.org/bpf/20210531020157.7386-1-harishankar.vishwanathan@rutgers.edu
As Colin pointed out, the first drops assignment after declaration will
be overwritten by the second drops assignment before using, which makes
it useless.
Since the drops variable will be used only once. Just remove it and
use "cnt - sent" in trace_xdp_devmap_xmit().
Fixes: cb261b594b ("bpf: Run devmap xdp_prog on flush instead of bulk enqueue")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210528024356.24333-1-liuhangbin@gmail.com
These macros are convenient wrappers around the bpf_seq_printf and
bpf_snprintf helpers. They are currently provided by bpf_tracing.h which
targets low level tracing primitives. bpf_helpers.h is a better fit.
The __bpf_narg and __bpf_apply are needed in both files and provided
twice. __bpf_empty isn't used anywhere and is removed from bpf_tracing.h
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210526164643.2881368-1-revest@chromium.org
This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
extend xdp_redirect_map for broadcast support.
With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
excluded when do broadcasting.
When getting the devices in dev hash map via dev_map_hash_get_next_key(),
there is a possibility that we fall back to the first key when a device
was removed. This will duplicate packets on some interfaces. So just walk
the whole buckets to avoid this issue. For dev array map, we also walk the
whole map to find valid interfaces.
Function bpf_clear_redirect_map() was removed in
commit ee75aef23a ("bpf, xdp: Restructure redirect actions").
Add it back as we need to use ri->map again.
With test topology:
+-------------------+ +-------------------+
| Host A (i40e 10G) | ---------- | eno1(i40e 10G) |
+-------------------+ | |
| Host B |
+-------------------+ | |
| Host C (i40e 10G) | ---------- | eno2(i40e 10G) |
+-------------------+ | |
| +------+ |
| veth0 -- | Peer | |
| veth1 -- | | |
| veth2 -- | NS | |
| +------+ |
+-------------------+
On Host A:
# pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64
On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
All the veth peers in the NS have a XDP_DROP program loaded. The
forward_map max_entries in xdp_redirect_map_multi is modify to 4.
Testing the performance impact on the regular xdp_redirect path with and
without patch (to check impact of additional check for broadcast mode):
5.12 rc4 | redirect_map i40e->i40e | 2.0M | 9.7M
5.12 rc4 | redirect_map i40e->veth | 1.7M | 11.8M
5.12 rc4 + patch | redirect_map i40e->i40e | 2.0M | 9.6M
5.12 rc4 + patch | redirect_map i40e->veth | 1.7M | 11.7M
Testing the performance when cloning packets with the redirect_map_multi
test, using a redirect map size of 4, filled with 1-3 devices:
5.12 rc4 + patch | redirect_map multi i40e->veth (x1) | 1.7M | 11.4M
5.12 rc4 + patch | redirect_map multi i40e->veth (x2) | 1.1M | 4.3M
5.12 rc4 + patch | redirect_map multi i40e->veth (x3) | 0.8M | 2.6M
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20210519090747.1655268-3-liuhangbin@gmail.com
This changes the devmap XDP program support to run the program when the
bulk queue is flushed instead of before the frame is enqueued. This has
a couple of benefits:
- It "sorts" the packets by destination devmap entry, and then runs the
same BPF program on all the packets in sequence. This ensures that we
keep the XDP program and destination device properties hot in I-cache.
- It makes the multicast implementation simpler because it can just
enqueue packets using bq_enqueue() without having to deal with the
devmap program at all.
The drawback is that if the devmap program drops the packet, the enqueue
step is redundant. However, arguably this is mostly visible in a
micro-benchmark, and with more mixed traffic the I-cache benefit should
win out. The performance impact of just this patch is as follows:
Using 2 10Gb i40e NIC, redirecting one to another, or into a veth interface,
which do XDP_DROP on veth peer. With xdp_redirect_map in sample/bpf, send
pkts via pktgen cmd:
./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64
There are about +/- 0.1M deviation for native testing, the performance
improved for the base-case, but some drop back with xdp devmap prog attached.
Version | Test | Generic | Native | Native + 2nd xdp_prog
5.12 rc4 | xdp_redirect_map i40e->i40e | 1.9M | 9.6M | 8.4M
5.12 rc4 | xdp_redirect_map i40e->veth | 1.7M | 11.7M | 9.8M
5.12 rc4 + patch | xdp_redirect_map i40e->i40e | 1.9M | 9.8M | 8.0M
5.12 rc4 + patch | xdp_redirect_map i40e->veth | 1.7M | 12.0M | 9.4M
When bq_xmit_all() is called from bq_enqueue(), another packet will
always be enqueued immediately after, so clearing dev_rx, xdp_prog and
flush_node in bq_xmit_all() is redundant. Move the clear to __dev_flush(),
and only check them once in bq_enqueue() since they are all modified
together.
This change also has the side effect of extending the lifetime of the
RCU-protected xdp_prog that lives inside the devmap entries: Instead of
just living for the duration of the XDP program invocation, the
reference now lives all the way until the bq is flushed. This is safe
because the bq flush happens at the end of the NAPI poll loop, so
everything happens between a local_bh_disable()/local_bh_enable() pair.
However, this is by no means obvious from looking at the call sites; in
particular, some drivers have an additional rcu_read_lock() around only
the XDP program invocation, which only confuses matters further.
Cleaning this up will be done in a separate patch series.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210519090747.1655268-2-liuhangbin@gmail.com
In 801c6058d1 ("bpf: Fix leakage of uninitialized bpf stack under
speculation") we replaced masking logic with direct loads of immediates
if the register is a known constant. Given in this case we do not apply
any masking, there is also no reason for the operation to be truncated
under the speculative domain.
Therefore, there is also zero reason for the verifier to branch-off and
simulate this case, it only needs to do it for unknown but bounded scalars.
As a side-effect, this also enables few test cases that were previously
rejected due to simulation under zero truncation.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Masking direction as indicated via mask_to_left is considered to be
calculated once and then used to derive pointer limits. Thus, this
needs to be placed into bpf_sanitize_info instead so we can pass it
to sanitize_ptr_alu() call after the pointer move. Piotr noticed a
corner case where the off reg causes masking direction change which
then results in an incorrect final aux->alu_limit.
Fixes: 7fedb63a83 ("bpf: Tighten speculative pointer arithmetic mask")
Reported-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Add a container structure struct bpf_sanitize_info which holds
the current aux info, and update call-sites to sanitize_ptr_alu()
to pass it in. This is needed for passing in additional state
later on.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Similarly as 6bdacdb48e ("bpf: Fix BPF_JIT kconfig symbol dependency") we
need to detangle the hard BPF_LSM dependency on NET. This was previously
implicit by its dependency on BPF_JIT which itself was dependent on NET (but
without any actual/real hard dependency code-wise). Given the latter was
lifted, so should be the former as BPF_LSMs could well exist on net-less
systems. This therefore also fixes a randconfig build error recently reported
by Randy:
ld: kernel/bpf/bpf_lsm.o: in function `bpf_lsm_func_proto':
bpf_lsm.c:(.text+0x1a0): undefined reference to `bpf_sk_storage_get_proto'
ld: bpf_lsm.c:(.text+0x1b8): undefined reference to `bpf_sk_storage_delete_proto'
[...]
Fixes: b24abcff91 ("bpf, kconfig: Add consolidated menu entry for bpf with core options")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Fix some spelling mistakes in comments:
aother ==> another
Netiher ==> Neither
desribe ==> describe
intializing ==> initializing
funciton ==> function
wont ==> won't and move the word 'the' at the end to the next line
accross ==> across
pathes ==> paths
triggerred ==> triggered
excute ==> execute
ether ==> either
conervative ==> conservative
convetion ==> convention
markes ==> marks
interpeter ==> interpreter
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210525025659.8898-2-thunder.leizhen@huawei.com
Extend the existing bpf_map_lookup_and_delete_elem() functionality to
hashtab map types, in addition to stacks and queues.
Create a new hashtab bpf_map_ops function that does lookup and deletion
of the element under the same bucket lock and add the created map_ops to
bpf.h.
Signed-off-by: Denis Salopek <denis.salopek@sartura.hr>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/4d18480a3e990ffbf14751ddef0325eed3be2966.1620763117.git.denis.salopek@sartura.hr
Commit 4976b718c3 ("bpf: Introduce pseudo_btf_id") switched the
order of resolve_pseudo_ldimm(), in which some pseudo instructions
are rewritten. Thus those rewritten instructions cannot be passed
to driver via 'prepare' offload callback.
Reorder the 'prepare' offload callback to fix it.
Fixes: 4976b718c3 ("bpf: Introduce pseudo_btf_id")
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210520085834.15023-1-simon.horman@netronome.com
The cppcheck static code analysis reported the following error:
if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(bufs->tmp_bufs))) {
^
ARRAY_SIZE is a macro that expands to sizeofs, so bufs is not actually
dereferenced at runtime, and the code is actually safe. But to keep
things tidy, this patch removes the need for a call to ARRAY_SIZE by
extracting the size of the array into a macro. Cppcheck should no longer
be confused and the code ends up being a bit cleaner.
Fixes: e2d5b2bb76 ("bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffers")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20210517092830.1026418-2-revest@chromium.org
The per-cpu buffers contain bprintf data rather than printf arguments.
The macro name and comment were a bit confusing, this rewords them in a
clearer way.
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20210517092830.1026418-1-revest@chromium.org
Randy reported a randconfig build error recently on i386:
ld: arch/x86/net/bpf_jit_comp32.o: in function `do_jit':
bpf_jit_comp32.c:(.text+0x28c9): undefined reference to `__bpf_call_base'
ld: arch/x86/net/bpf_jit_comp32.o: in function `bpf_int_jit_compile':
bpf_jit_comp32.c:(.text+0x3694): undefined reference to `bpf_jit_blind_constants'
ld: bpf_jit_comp32.c:(.text+0x3719): undefined reference to `bpf_jit_binary_free'
ld: bpf_jit_comp32.c:(.text+0x3745): undefined reference to `bpf_jit_binary_alloc'
ld: bpf_jit_comp32.c:(.text+0x37d3): undefined reference to `bpf_jit_prog_release_other'
[...]
The cause was that b24abcff91 ("bpf, kconfig: Add consolidated menu entry for
bpf with core options") moved BPF_JIT from net/Kconfig into kernel/bpf/Kconfig
and previously BPF_JIT was guarded by a 'if NET'. However, there is no actual
dependency on NET, it's just that menuconfig NET selects BPF. And the latter in
turn causes kernel/bpf/core.o to be built which contains above symbols. Randy's
randconfig didn't have NET set, and BPF wasn't either, but BPF_JIT otoh was.
Detangle this by making BPF_JIT depend on BPF instead. arm64 was the only arch
that pulled in its JIT in net/ via obj-$(CONFIG_NET), all others unconditionally
pull this dir in via obj-y. Do the same since CONFIG_NET guard there is really
useless as we compiled the JIT via obj-$(CONFIG_BPF_JIT) += bpf_jit_comp.o anyway.
Fixes: b24abcff91 ("bpf, kconfig: Add consolidated menu entry for bpf with core options")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
The sparse tool complains as follows:
kernel/bpf/syscall.c:4567:29: warning:
symbol 'bpf_sys_bpf_proto' was not declared. Should it be static?
kernel/bpf/syscall.c:4592:29: warning:
symbol 'bpf_sys_close_proto' was not declared. Should it be static?
This symbol is not used outside of syscall.c, so marks it static.
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210519064116.240536-1-pulehui@huawei.com
Add bpf_sys_close() helper to be used by the syscall/loader program to close
intermediate FDs and other cleanup.
Note this helper must never be allowed inside fdget/fdput bracketing.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-11-alexei.starovoitov@gmail.com
Add new helper:
long bpf_btf_find_by_name_kind(char *name, int name_sz, u32 kind, int flags)
Description
Find BTF type with given name and kind in vmlinux BTF or in module's BTFs.
Return
Returns btf_id and btf_obj_fd in lower and upper 32 bits.
It will be used by loader program to find btf_id to attach the program to
and to find btf_ids of ksyms.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-10-alexei.starovoitov@gmail.com
Typical program loading sequence involves creating bpf maps and applying
map FDs into bpf instructions in various places in the bpf program.
This job is done by libbpf that is using compiler generated ELF relocations
to patch certain instruction after maps are created and BTFs are loaded.
The goal of fd_idx is to allow bpf instructions to stay immutable
after compilation. At load time the libbpf would still create maps as usual,
but it wouldn't need to patch instructions. It would store map_fds into
__u32 fd_array[] and would pass that pointer to sys_bpf(BPF_PROG_LOAD).
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-9-alexei.starovoitov@gmail.com
Similar to prog_load make btf_load command to be availble to
bpf_prog_type_syscall program.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-7-alexei.starovoitov@gmail.com
With the help from bpfptr_t prepare relevant bpf syscall commands
to be used from kernel and user space.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-4-alexei.starovoitov@gmail.com
Add placeholders for bpf_sys_bpf() helper and new program type.
Make sure to check that expected_attach_type is zero for future extensibility.
Allow tracing helper functions to be used in this program type, since they will
only execute from user context via bpf_prog_test_run.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-2-alexei.starovoitov@gmail.com
The bpf_seq_printf, bpf_trace_printk and bpf_snprintf helpers share one
per-cpu buffer that they use to store temporary data (arguments to
bprintf). They "get" that buffer with try_get_fmt_tmp_buf and "put" it
by the end of their scope with bpf_bprintf_cleanup.
If one of these helpers gets called within the scope of one of these
helpers, for example: a first bpf program gets called, uses
bpf_trace_printk which calls raw_spin_lock_irqsave which is traced by
another bpf program that calls bpf_snprintf, then the second "get"
fails. Essentially, these helpers are not re-entrant. They would return
-EBUSY and print a warning message once.
This patch triples the number of bprintf buffers to allow three levels
of nesting. This is very similar to what was done for tracepoints in
"9594dc3c7e7 bpf: fix nested bpf tracepoints with per-cpu data"
Fixes: d9c9e4db18 ("bpf: Factorize bpf_trace_printk and bpf_seq_printf")
Reported-by: syzbot+63122d0bc347f18c1884@syzkaller.appspotmail.com
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210511081054.2125874-1-revest@chromium.org
The recursion check in __bpf_prog_enter and __bpf_prog_exit
leaves some (not inlined) functions unprotected:
In __bpf_prog_enter:
- migrate_disable is called before prog->active is checked
In __bpf_prog_exit:
- migrate_enable,rcu_read_unlock_strict are called after
prog->active is decreased
When attaching trampoline to them we get panic like:
traps: PANIC: double fault, error_code: 0x0
double fault: 0000 [#1] SMP PTI
RIP: 0010:__bpf_prog_enter+0x4/0x50
...
Call Trace:
<IRQ>
bpf_trampoline_6442466513_0+0x18/0x1000
migrate_disable+0x5/0x50
__bpf_prog_enter+0x9/0x50
bpf_trampoline_6442466513_0+0x18/0x1000
migrate_disable+0x5/0x50
__bpf_prog_enter+0x9/0x50
bpf_trampoline_6442466513_0+0x18/0x1000
migrate_disable+0x5/0x50
__bpf_prog_enter+0x9/0x50
bpf_trampoline_6442466513_0+0x18/0x1000
migrate_disable+0x5/0x50
...
Fixing this by adding deny list of btf ids for tracing
programs and checking btf id during program verification.
Adding above functions to this list.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210429114712.43783-1-jolsa@kernel.org
Add a kconfig knob which allows for unprivileged bpf to be disabled by default.
If set, the knob sets /proc/sys/kernel/unprivileged_bpf_disabled to value of 2.
This still allows a transition of 2 -> {0,1} through an admin. Similarly,
this also still keeps 1 -> {1} behavior intact, so that once set to permanently
disabled, it cannot be undone aside from a reboot.
We've also added extra2 with max of 2 for the procfs handler, so that an admin
still has a chance to toggle between 0 <-> 2.
Either way, as an additional alternative, applications can make use of CAP_BPF
that we added a while ago.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/74ec548079189e4e4dffaeb42b8987bb3c852eee.1620765074.git.daniel@iogearbox.net
Right now, all core BPF related options are scattered in different Kconfig
locations mainly due to historic reasons. Moving forward, lets add a proper
subsystem entry under ...
General setup --->
BPF subsystem --->
... in order to have all knobs in a single location and thus ease BPF related
configuration. Networking related bits such as sockmap are out of scope for
the general setup and therefore better suited to remain in net/Kconfig.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/f23f58765a4d59244ebd8037da7b6a6b2fb58446.1620765074.git.daniel@iogearbox.net
Only the very first page of BPF ringbuf that contains consumer position
counter is supposed to be mapped as writeable by user-space. Producer
position is read-only and can be modified only by the kernel code. BPF ringbuf
data pages are read-only as well and are not meant to be modified by
user-code to maintain integrity of per-record headers.
This patch allows to map only consumer position page as writeable and
everything else is restricted to be read-only. remap_vmalloc_range()
internally adds VM_DONTEXPAND, so all the established memory mappings can't be
extended, which prevents any future violations through mremap()'ing.
Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Reported-by: Ryota Shiga (Flatt Security)
Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
A BPF program might try to reserve a buffer larger than the ringbuf size.
If the consumer pointer is way ahead of the producer, that would be
successfully reserved, allowing the BPF program to read or write out of
the ringbuf allocated area.
Reported-by: Ryota Shiga (Flatt Security)
Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Fix a bug in the verifier's scalar32_min_max_*() functions which leads to
incorrect tracking of 32 bit bounds for the simulation of and/or/xor bitops.
When both the src & dst subreg is a known constant, then the assumption is
that scalar_min_max_*() will take care to update bounds correctly. However,
this is not the case, for example, consider a register R2 which has a tnum
of 0xffffffff00000000, meaning, lower 32 bits are known constant and in this
case of value 0x00000001. R2 is then and'ed with a register R3 which is a
64 bit known constant, here, 0x100000002.
What can be seen in line '10:' is that 32 bit bounds reach an invalid state
where {u,s}32_min_value > {u,s}32_max_value. The reason is scalar32_min_max_*()
delegates 32 bit bounds updates to scalar_min_max_*(), however, that really
only takes place when both the 64 bit src & dst register is a known constant.
Given scalar32_min_max_*() is intended to be designed as closely as possible
to scalar_min_max_*(), update the 32 bit bounds in this situation through
__mark_reg32_known() which will set all {u,s}32_{min,max}_value to the correct
constant, which is 0x00000000 after the fix (given 0x00000001 & 0x00000002 in
32 bit space). This is possible given var32_off already holds the final value
as dst_reg->var_off is updated before calling scalar32_min_max_*().
Before fix, invalid tracking of R2:
[...]
9: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,smin_value=-9223372036854775807 (0x8000000000000001),smax_value=9223372032559808513 (0x7fffffff00000001),umin_value=1,umax_value=0xffffffff00000001,var_off=(0x1; 0xffffffff00000000),s32_min_value=1,s32_max_value=1,u32_min_value=1,u32_max_value=1) R3_w=inv4294967298 R10=fp0
9: (5f) r2 &= r3
10: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,smin_value=0,smax_value=4294967296 (0x100000000),umin_value=0,umax_value=0x100000000,var_off=(0x0; 0x100000000),s32_min_value=1,s32_max_value=0,u32_min_value=1,u32_max_value=0) R3_w=inv4294967298 R10=fp0
[...]
After fix, correct tracking of R2:
[...]
9: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,smin_value=-9223372036854775807 (0x8000000000000001),smax_value=9223372032559808513 (0x7fffffff00000001),umin_value=1,umax_value=0xffffffff00000001,var_off=(0x1; 0xffffffff00000000),s32_min_value=1,s32_max_value=1,u32_min_value=1,u32_max_value=1) R3_w=inv4294967298 R10=fp0
9: (5f) r2 &= r3
10: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,smin_value=0,smax_value=4294967296 (0x100000000),umin_value=0,umax_value=0x100000000,var_off=(0x0; 0x100000000),s32_min_value=0,s32_max_value=0,u32_min_value=0,u32_max_value=0) R3_w=inv4294967298 R10=fp0
[...]
Fixes: 3f50f132d8 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Fixes: 2921c90d47 ("bpf: Fix a verifier failure with xor")
Reported-by: Manfred Paul (@_manfp)
Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
func_states_equal makes a very short lived allocation for idmap,
probably because it's too large to fit on the stack. However the
function is called quite often, leading to a lot of alloc / free
churn. Replace the temporary allocation with dedicated scratch
space in struct bpf_verifier_env.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/bpf/20210429134656.122225-4-lmb@cloudflare.com
Resizing and copying stack and reference tracking state currently
does a lot of kfree / kmalloc when the size of the tracked set changes.
The logic in copy_*_state and realloc_*_state is also hard to follow.
Refactor this into two core functions. copy_array copies from a source
into a destination. It avoids reallocation by taking the allocated
size of the destination into account via ksize(). The function is
essentially krealloc_array, with the difference that the contents of
dst are not preserved. realloc_array changes the size of an array and
zeroes newly allocated items. Contrary to krealloc both functions don't
free the destination if the size is zero. Instead we rely on free_func_state
to clean up.
realloc_stack_state is renamed to grow_stack_state to better convey
that it never shrinks the stack state.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210429134656.122225-2-lmb@cloudflare.com
We can't currently allow to attach functions with variable arguments.
The problem is that we should save all the registers for arguments,
which is probably doable, but if caller uses more than 6 arguments,
we need stack data, which will be wrong, because of the extra stack
frame we do in bpf trampoline, so we could crash.
Also currently there's malformed trampoline code generated for such
functions at the moment as described in:
https://lore.kernel.org/bpf/20210429212834.82621-1-jolsa@kernel.org/
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210505132529.401047-1-jolsa@kernel.org
The current implemented mechanisms to mitigate data disclosure under
speculation mainly address stack and map value oob access from the
speculative domain. However, Piotr discovered that uninitialized BPF
stack is not protected yet, and thus old data from the kernel stack,
potentially including addresses of kernel structures, could still be
extracted from that 512 bytes large window. The BPF stack is special
compared to map values since it's not zero initialized for every
program invocation, whereas map values /are/ zero initialized upon
their initial allocation and thus cannot leak any prior data in either
domain. In the non-speculative domain, the verifier ensures that every
stack slot read must have a prior stack slot write by the BPF program
to avoid such data leaking issue.
However, this is not enough: for example, when the pointer arithmetic
operation moves the stack pointer from the last valid stack offset to
the first valid offset, the sanitation logic allows for any intermediate
offsets during speculative execution, which could then be used to
extract any restricted stack content via side-channel.
Given for unprivileged stack pointer arithmetic the use of unknown
but bounded scalars is generally forbidden, we can simply turn the
register-based arithmetic operation into an immediate-based arithmetic
operation without the need for masking. This also gives the benefit
of reducing the needed instructions for the operation. Given after
the work in 7fedb63a83 ("bpf: Tighten speculative pointer arithmetic
mask"), the aux->alu_limit already holds the final immediate value for
the offset register with the known scalar. Thus, a simple mov of the
immediate to AX register with using AX as the source for the original
instruction is sufficient and possible now in this case.
Reported-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Piotr Krysiuk <piotras@gmail.com>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
The negation logic for the case where the off_reg is sitting in the
dst register is not correct given then we cannot just invert the add
to a sub or vice versa. As a fix, perform the final bitwise and-op
unconditionally into AX from the off_reg, then move the pointer from
the src to dst and finally use AX as the source for the original
pointer arithmetic operation such that the inversion yields a correct
result. The single non-AX mov in between is possible given constant
blinding is retaining it as it's not an immediate based operation.
Fixes: 979d63d50c ("bpf: prevent out of bounds speculation on pointer arithmetic")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Piotr Krysiuk <piotras@gmail.com>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Core:
- bpf:
- allow bpf programs calling kernel functions (initially to
reuse TCP congestion control implementations)
- enable task local storage for tracing programs - remove the
need to store per-task state in hash maps, and allow tracing
programs access to task local storage previously added for
BPF_LSM
- add bpf_for_each_map_elem() helper, allowing programs to
walk all map elements in a more robust and easier to verify
fashion
- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
redirection
- lpm: add support for batched ops in LPM trie
- add BTF_KIND_FLOAT support - mostly to allow use of BTF
on s390 which has floats in its headers files
- improve BPF syscall documentation and extend the use of kdoc
parsing scripts we already employ for bpf-helpers
- libbpf, bpftool: support static linking of BPF ELF files
- improve support for encapsulation of L2 packets
- xdp: restructure redirect actions to avoid a runtime lookup,
improving performance by 4-8% in microbenchmarks
- xsk: build skb by page (aka generic zerocopy xmit) - improve
performance of software AF_XDP path by 33% for devices
which don't need headers in the linear skb part (e.g. virtio)
- nexthop: resilient next-hop groups - improve path stability
on next-hops group changes (incl. offload for mlxsw)
- ipv6: segment routing: add support for IPv4 decapsulation
- icmp: add support for RFC 8335 extended PROBE messages
- inet: use bigger hash table for IP ID generation
- tcp: deal better with delayed TX completions - make sure we don't
give up on fast TCP retransmissions only because driver is
slow in reporting that it completed transmitting the original
- tcp: reorder tcp_congestion_ops for better cache locality
- mptcp:
- add sockopt support for common TCP options
- add support for common TCP msg flags
- include multiple address ids in RM_ADDR
- add reset option support for resetting one subflow
- udp: GRO L4 improvements - improve 'forward' / 'frag_list'
co-existence with UDP tunnel GRO, allowing the first to take
place correctly even for encapsulated UDP traffic
- micro-optimize dev_gro_receive() and flow dissection, avoid
retpoline overhead on VLAN and TEB GRO
- use less memory for sysctls, add a new sysctl type, to allow using
u8 instead of "int" and "long" and shrink networking sysctls
- veth: allow GRO without XDP - this allows aggregating UDP
packets before handing them off to routing, bridge, OvS, etc.
- allow specifing ifindex when device is moved to another namespace
- netfilter:
- nft_socket: add support for cgroupsv2
- nftables: add catch-all set element - special element used
to define a default action in case normal lookup missed
- use net_generic infra in many modules to avoid allocating
per-ns memory unnecessarily
- xps: improve the xps handling to avoid potential out-of-bound
accesses and use-after-free when XPS change race with other
re-configuration under traffic
- add a config knob to turn off per-cpu netdev refcnt to catch
underflows in testing
Device APIs:
- add WWAN subsystem to organize the WWAN interfaces better and
hopefully start driving towards more unified and vendor-
-independent APIs
- ethtool:
- add interface for reading IEEE MIB stats (incl. mlx5 and
bnxt support)
- allow network drivers to dump arbitrary SFP EEPROM data,
current offset+length API was a poor fit for modern SFP
which define EEPROM in terms of pages (incl. mlx5 support)
- act_police, flow_offload: add support for packet-per-second
policing (incl. offload for nfp)
- psample: add additional metadata attributes like transit delay
for packets sampled from switch HW (and corresponding egress
and policy-based sampling in the mlxsw driver)
- dsa: improve support for sandwiched LAGs with bridge and DSA
- netfilter:
- flowtable: use direct xmit in topologies with IP
forwarding, bridging, vlans etc.
- nftables: counter hardware offload support
- Bluetooth:
- improvements for firmware download w/ Intel devices
- add support for reading AOSP vendor capabilities
- add support for virtio transport driver
- mac80211:
- allow concurrent monitor iface and ethernet rx decap
- set priority and queue mapping for injected frames
- phy: add support for Clause-45 PHY Loopback
- pci/iov: add sysfs MSI-X vector assignment interface
to distribute MSI-X resources to VFs (incl. mlx5 support)
New hardware/drivers:
- dsa: mv88e6xxx: add support for Marvell mv88e6393x -
11-port Ethernet switch with 8x 1-Gigabit Ethernet
and 3x 10-Gigabit interfaces.
- dsa: support for legacy Broadcom tags used on BCM5325, BCM5365
and BCM63xx switches
- Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
- ath11k: support for QCN9074 a 802.11ax device
- Bluetooth: Broadcom BCM4330 and BMC4334
- phy: Marvell 88X2222 transceiver support
- mdio: add BCM6368 MDIO mux bus controller
- r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
- mana: driver for Microsoft Azure Network Adapter (MANA)
- Actions Semi Owl Ethernet MAC
- can: driver for ETAS ES58X CAN/USB interfaces
Pure driver changes:
- add XDP support to: enetc, igc, stmmac
- add AF_XDP support to: stmmac
- virtio:
- page_to_skb() use build_skb when there's sufficient tailroom
(21% improvement for 1000B UDP frames)
- support XDP even without dedicated Tx queues - share the Tx
queues with the stack when necessary
- mlx5:
- flow rules: add support for mirroring with conntrack,
matching on ICMP, GTP, flex filters and more
- support packet sampling with flow offloads
- persist uplink representor netdev across eswitch mode
changes
- allow coexistence of CQE compression and HW time-stamping
- add ethtool extended link error state reporting
- ice, iavf: support flow filters, UDP Segmentation Offload
- dpaa2-switch:
- move the driver out of staging
- add spanning tree (STP) support
- add rx copybreak support
- add tc flower hardware offload on ingress traffic
- ionic:
- implement Rx page reuse
- support HW PTP time-stamping
- octeon: support TC hardware offloads - flower matching on ingress
and egress ratelimitting.
- stmmac:
- add RX frame steering based on VLAN priority in tc flower
- support frame preemption (FPE)
- intel: add cross time-stamping freq difference adjustment
- ocelot:
- support forwarding of MRP frames in HW
- support multiple bridges
- support PTP Sync one-step timestamping
- dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
learning, flooding etc.
- ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
SC7280 SoCs)
- mt7601u: enable TDLS support
- mt76:
- add support for 802.3 rx frames (mt7915/mt7615)
- mt7915 flash pre-calibration support
- mt7921/mt7663 runtime power management fixes
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCKFPIACgkQMUZtbf5S
Irtw0g/+NA8bWdHNgG4H5rya0pv2z3IieLRmSdDfKRQQXcJpklawc5MKVVaTee/Q
5/QqgPdCsu1LAU6JXBKsKmyDDaMlQKdWuKbOqDSiAQKoMesZStTEHf9d851ZzgxA
Cdb6O7BD3lBl/IN+oxNG+KcmD1LKquTPKGySq2mQtEdLO12ekAsranzmj4voKffd
q9tBShpXQ7Dq77DLYfiQXVCvsizNcbbJFuxX0o9Lpb9+61ZyYAbogZSa9ypiZZwR
I/9azRBtJg7UV1aD/cLuAfy66Qh7t63+rCxVazs5Os8jVO26P/jQdisnnOe/x+p9
wYEmKm3GSu0V4SAPxkWW+ooKusflCeqDoMIuooKt6kbP6BRj540veGw3Ww/m5YFr
7pLQkTSP/tSjuGQIdBE1LOP5LBO8DZeC8Kiop9V0fzAW9hFSZbEq25WW0bPj8QQO
zA4Z7yWlslvxcfY2BdJX3wD8klaINkl/8fDWZFFsBdfFX2VeLtm7Xfduw34BJpvU
rYT3oWr6PhtkPAKR32SUcemSfeWgIVU41eSshzRz3kez1NngBUuLlSGGSEaKbes5
pZVt6pYFFVByyf6MTHFEoQvafZfEw04JILZpo4R5V8iTHzom0kD3Py064sBiXEw2
B6t+OW4qgcxGblpFkK2lD4kR2s1TPUs0ckVO6sAy1x8q60KKKjY=
=vcbA
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- bpf:
- allow bpf programs calling kernel functions (initially to
reuse TCP congestion control implementations)
- enable task local storage for tracing programs - remove the
need to store per-task state in hash maps, and allow tracing
programs access to task local storage previously added for
BPF_LSM
- add bpf_for_each_map_elem() helper, allowing programs to walk
all map elements in a more robust and easier to verify fashion
- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
redirection
- lpm: add support for batched ops in LPM trie
- add BTF_KIND_FLOAT support - mostly to allow use of BTF on
s390 which has floats in its headers files
- improve BPF syscall documentation and extend the use of kdoc
parsing scripts we already employ for bpf-helpers
- libbpf, bpftool: support static linking of BPF ELF files
- improve support for encapsulation of L2 packets
- xdp: restructure redirect actions to avoid a runtime lookup,
improving performance by 4-8% in microbenchmarks
- xsk: build skb by page (aka generic zerocopy xmit) - improve
performance of software AF_XDP path by 33% for devices which don't
need headers in the linear skb part (e.g. virtio)
- nexthop: resilient next-hop groups - improve path stability on
next-hops group changes (incl. offload for mlxsw)
- ipv6: segment routing: add support for IPv4 decapsulation
- icmp: add support for RFC 8335 extended PROBE messages
- inet: use bigger hash table for IP ID generation
- tcp: deal better with delayed TX completions - make sure we don't
give up on fast TCP retransmissions only because driver is slow in
reporting that it completed transmitting the original
- tcp: reorder tcp_congestion_ops for better cache locality
- mptcp:
- add sockopt support for common TCP options
- add support for common TCP msg flags
- include multiple address ids in RM_ADDR
- add reset option support for resetting one subflow
- udp: GRO L4 improvements - improve 'forward' / 'frag_list'
co-existence with UDP tunnel GRO, allowing the first to take place
correctly even for encapsulated UDP traffic
- micro-optimize dev_gro_receive() and flow dissection, avoid
retpoline overhead on VLAN and TEB GRO
- use less memory for sysctls, add a new sysctl type, to allow using
u8 instead of "int" and "long" and shrink networking sysctls
- veth: allow GRO without XDP - this allows aggregating UDP packets
before handing them off to routing, bridge, OvS, etc.
- allow specifing ifindex when device is moved to another namespace
- netfilter:
- nft_socket: add support for cgroupsv2
- nftables: add catch-all set element - special element used to
define a default action in case normal lookup missed
- use net_generic infra in many modules to avoid allocating
per-ns memory unnecessarily
- xps: improve the xps handling to avoid potential out-of-bound
accesses and use-after-free when XPS change race with other
re-configuration under traffic
- add a config knob to turn off per-cpu netdev refcnt to catch
underflows in testing
Device APIs:
- add WWAN subsystem to organize the WWAN interfaces better and
hopefully start driving towards more unified and vendor-
independent APIs
- ethtool:
- add interface for reading IEEE MIB stats (incl. mlx5 and bnxt
support)
- allow network drivers to dump arbitrary SFP EEPROM data,
current offset+length API was a poor fit for modern SFP which
define EEPROM in terms of pages (incl. mlx5 support)
- act_police, flow_offload: add support for packet-per-second
policing (incl. offload for nfp)
- psample: add additional metadata attributes like transit delay for
packets sampled from switch HW (and corresponding egress and
policy-based sampling in the mlxsw driver)
- dsa: improve support for sandwiched LAGs with bridge and DSA
- netfilter:
- flowtable: use direct xmit in topologies with IP forwarding,
bridging, vlans etc.
- nftables: counter hardware offload support
- Bluetooth:
- improvements for firmware download w/ Intel devices
- add support for reading AOSP vendor capabilities
- add support for virtio transport driver
- mac80211:
- allow concurrent monitor iface and ethernet rx decap
- set priority and queue mapping for injected frames
- phy: add support for Clause-45 PHY Loopback
- pci/iov: add sysfs MSI-X vector assignment interface to distribute
MSI-X resources to VFs (incl. mlx5 support)
New hardware/drivers:
- dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port
Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit
interfaces.
- dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and
BCM63xx switches
- Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
- ath11k: support for QCN9074 a 802.11ax device
- Bluetooth: Broadcom BCM4330 and BMC4334
- phy: Marvell 88X2222 transceiver support
- mdio: add BCM6368 MDIO mux bus controller
- r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
- mana: driver for Microsoft Azure Network Adapter (MANA)
- Actions Semi Owl Ethernet MAC
- can: driver for ETAS ES58X CAN/USB interfaces
Pure driver changes:
- add XDP support to: enetc, igc, stmmac
- add AF_XDP support to: stmmac
- virtio:
- page_to_skb() use build_skb when there's sufficient tailroom
(21% improvement for 1000B UDP frames)
- support XDP even without dedicated Tx queues - share the Tx
queues with the stack when necessary
- mlx5:
- flow rules: add support for mirroring with conntrack, matching
on ICMP, GTP, flex filters and more
- support packet sampling with flow offloads
- persist uplink representor netdev across eswitch mode changes
- allow coexistence of CQE compression and HW time-stamping
- add ethtool extended link error state reporting
- ice, iavf: support flow filters, UDP Segmentation Offload
- dpaa2-switch:
- move the driver out of staging
- add spanning tree (STP) support
- add rx copybreak support
- add tc flower hardware offload on ingress traffic
- ionic:
- implement Rx page reuse
- support HW PTP time-stamping
- octeon: support TC hardware offloads - flower matching on ingress
and egress ratelimitting.
- stmmac:
- add RX frame steering based on VLAN priority in tc flower
- support frame preemption (FPE)
- intel: add cross time-stamping freq difference adjustment
- ocelot:
- support forwarding of MRP frames in HW
- support multiple bridges
- support PTP Sync one-step timestamping
- dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
learning, flooding etc.
- ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
SC7280 SoCs)
- mt7601u: enable TDLS support
- mt76:
- add support for 802.3 rx frames (mt7915/mt7615)
- mt7915 flash pre-calibration support
- mt7921/mt7663 runtime power management fixes"
* tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits)
net: selftest: fix build issue if INET is disabled
net: netrom: nr_in: Remove redundant assignment to ns
net: tun: Remove redundant assignment to ret
net: phy: marvell: add downshift support for M88E1240
net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255
net/sched: act_ct: Remove redundant ct get and check
icmp: standardize naming of RFC 8335 PROBE constants
bpf, selftests: Update array map tests for per-cpu batched ops
bpf: Add batched ops support for percpu array
bpf: Implement formatted output helpers with bstr_printf
seq_file: Add a seq_bprintf function
sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues
net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
net: fix a concurrency bug in l2tp_tunnel_register()
net/smc: Remove redundant assignment to rc
mpls: Remove redundant assignment to err
llc2: Remove redundant assignment to rc
net/tls: Remove redundant initialization of record
rds: Remove redundant assignment to nr_sig
dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0
...
Uses the already in-place infrastructure provided by the
'generic_map_*_batch' functions.
No tweak was needed as it transparently handles the percpu variant.
As arrays don't have delete operations, let it return a error to
user space (default behaviour).
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210424214510.806627-2-pctammela@mojatatu.com
BPF has three formatted output helpers: bpf_trace_printk, bpf_seq_printf
and bpf_snprintf. Their signatures specify that all arguments are
provided from the BPF world as u64s (in an array or as registers). All
of these helpers are currently implemented by calling functions such as
snprintf() whose signatures take a variable number of arguments, then
placed in a va_list by the compiler to call vsnprintf().
"d9c9e4db bpf: Factorize bpf_trace_printk and bpf_seq_printf" introduced
a bpf_printf_prepare function that fills an array of u64 sanitized
arguments with an array of "modifiers" which indicate what the "real"
size of each argument should be (given by the format specifier). The
BPF_CAST_FMT_ARG macro consumes these arrays and casts each argument to
its real size. However, the C promotion rules implicitely cast them all
back to u64s. Therefore, the arguments given to snprintf are u64s and
the va_list constructed by the compiler will use 64 bits for each
argument. On 64 bit machines, this happens to work well because 32 bit
arguments in va_lists need to occupy 64 bits anyway, but on 32 bit
architectures this breaks the layout of the va_list expected by the
called function and mangles values.
In "88a5c690b6 bpf: fix bpf_trace_printk on 32 bit archs", this problem
had been solved for bpf_trace_printk only with a "horrid workaround"
that emitted multiple calls to trace_printk where each call had
different argument types and generated different va_list layouts. One of
the call would be dynamically chosen at runtime. This was ok with the 3
arguments that bpf_trace_printk takes but bpf_seq_printf and
bpf_snprintf accept up to 12 arguments. Because this approach scales
code exponentially, it is not a viable option anymore.
Because the promotion rules are part of the language and because the
construction of a va_list is an arch-specific ABI, it's best to just
avoid variadic arguments and va_lists altogether. Thankfully the
kernel's snprintf() has an alternative in the form of bstr_printf() that
accepts arguments in a "binary buffer representation". These binary
buffers are currently created by vbin_printf and used in the tracing
subsystem to split the cost of printing into two parts: a fast one that
only dereferences and remembers values, and a slower one, called later,
that does the pretty-printing.
This patch refactors bpf_printf_prepare to construct binary buffers of
arguments consumable by bstr_printf() instead of arrays of arguments and
modifiers. This gets rid of BPF_CAST_FMT_ARG and greatly simplifies the
bpf_printf_prepare usage but there are a few gotchas that change how
bpf_printf_prepare needs to do things.
Currently, bpf_printf_prepare uses a per cpu temporary buffer as a
generic storage for strings and IP addresses. With this refactoring, the
temporary buffers now holds all the arguments in a structured binary
format.
To comply with the format expected by bstr_printf, certain format
specifiers also need to be pre-formatted: %pB and %pi6/%pi4/%pI4/%pI6.
Because vsnprintf subroutines for these specifiers are hard to expose,
we pre-format these arguments with calls to snprintf().
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210427174313.860948-3-revest@chromium.org
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmCHM2sUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNfCg/9GmoCyCh+ZRj5RGQ6M+yJas1+yyJQ
uEfTNde54yfATUTaaWYnZG59yqzM3I2uaV11U7tqg8ajiFPxJKqbs5R9jl3lnSjH
0Dg22nXPSCOTKcU0x/DeLoKRr+M9jO1K/nQ8NEZvYX4nC/OgtCvJqb/oEQZIKAk5
2a7OEmNNQyFGd274p9dELaDHxN9UIaJ2PzQFXtq7ROHgBXQO4ONb2ajOf6mDSFQb
vP/CDHwaH+pcE28w44oRy0/YBkO1SrdqoFQchg5yFagM5tQRLGkXK4OFSs5KHi5Q
YMtmaOzMPIv1e5eaC1HuuMJYA4pPb30T9hFHP7tmBVZfmZaFaDeUs+BhMm98WTiS
o0iTP7tfs36/poOR1Q0/sB06uvF9hUAAX1ZuE95YySifbXU9hsUc9b0uQSwCdg9P
/J9rcdHLTpWqjw9n02mezWmAvo5U8ZvbDs+0xPIwI+3RTUP5t6mp+Hd5Tc7bPTq1
0rpWXx+FQoSytFap5qiUSiwBp+HF6HQnNIXB0Muf6wctChoTjvo7TwoxH//z4kEm
+SddhOCNkB7VC/X7hOxhl0F/rdHuXvb1AFIWjpTLJH2CR1PvMtF+sGey+uPT6hKZ
/gvhmQGjFdph99eGlfVbCNvx1pM61O25IscaYD1T2wGImw+z7dX4WkG3WoOdDSkR
bRjrBkcHh0gLhWk=
=HTEy
-----END PGP SIGNATURE-----
Merge tag 'selinux-pr-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
- Add support for measuring the SELinux state and policy capabilities
using IMA.
- A handful of SELinux/NFS patches to compare the SELinux state of one
mount with a set of mount options. Olga goes into more detail in the
patch descriptions, but this is important as it allows more
flexibility when using NFS and SELinux context mounts.
- Properly differentiate between the subjective and objective LSM
credentials; including support for the SELinux and Smack. My clumsy
attempt at a proper fix for AppArmor didn't quite pass muster so John
is working on a proper AppArmor patch, in the meantime this set of
patches shouldn't change the behavior of AppArmor in any way. This
change explains the bulk of the diffstat beyond security/.
- Fix a problem where we were not properly terminating the permission
list for two SELinux object classes.
* tag 'selinux-pr-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: add proper NULL termination to the secclass_map permissions
smack: differentiate between subjective and objective task credentials
selinux: clarify task subjective and objective credentials
lsm: separate security_task_getsecid() into subjective and objective variants
nfs: account for selinux security context when deciding to share superblock
nfs: remove unneeded null check in nfs_fill_super()
lsm,selinux: add new hook to compare new mount to an existing mount
selinux: fix misspellings using codespell tool
selinux: fix misspellings using codespell tool
selinux: measure state and policy capabilities
selinux: Allow context mounts for unpriviliged overlayfs
Rely on netif_receive_skb_list routine to send skbs converted from
xdp_frames in cpu_map_kthread_run in order to improve i-cache usage.
The proposed patch has been tested running xdp_redirect_cpu bpf sample
available in the kernel tree that is used to redirect UDP frames from
ixgbe driver to a cpumap entry and then to the networking stack. UDP
frames are generated using pktgen. Packets are discarded by the UDP
layer.
$ xdp_redirect_cpu --cpu <cpu> --progname xdp_cpu_map0 --dev <eth>
bpf-next: ~2.35Mpps
bpf-next + cpumap skb-list: ~2.72Mpps
Rename drops counter in kmem_alloc_drops since now it reports just
kmem_cache_alloc_bulk failures
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/c729f83e5d7482d9329e0f165bdbe5adcefd1510.1619169700.git.lorenzo@kernel.org
Similarly as b02709587e ("bpf: Fix propagation of 32-bit signed bounds
from 64-bit bounds."), we also need to fix the propagation of 32 bit
unsigned bounds from 64 bit counterparts. That is, really only set the
u32_{min,max}_value when /both/ {umin,umax}_value safely fit in 32 bit
space. For example, the register with a umin_value == 1 does /not/ imply
that u32_min_value is also equal to 1, since umax_value could be much
larger than 32 bit subregister can hold, and thus u32_min_value is in
the interval [0,1] instead.
Before fix, invalid tracking result of R2_w=inv1:
[...]
5: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0) R10=fp0
5: (35) if r2 >= 0x1 goto pc+1
[...] // goto path
7: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,umin_value=1) R10=fp0
7: (b6) if w2 <= 0x1 goto pc+1
[...] // goto path
9: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,smin_value=-9223372036854775807,smax_value=9223372032559808513,umin_value=1,umax_value=18446744069414584321,var_off=(0x1; 0xffffffff00000000),s32_min_value=1,s32_max_value=1,u32_max_value=1) R10=fp0
9: (bc) w2 = w2
10: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv1 R10=fp0
[...]
After fix, correct tracking result of R2_w=inv(id=0,umax_value=1,var_off=(0x0; 0x1)):
[...]
5: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0) R10=fp0
5: (35) if r2 >= 0x1 goto pc+1
[...] // goto path
7: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,umin_value=1) R10=fp0
7: (b6) if w2 <= 0x1 goto pc+1
[...] // goto path
9: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,smax_value=9223372032559808513,umax_value=18446744069414584321,var_off=(0x0; 0xffffffff00000001),s32_min_value=0,s32_max_value=1,u32_max_value=1) R10=fp0
9: (bc) w2 = w2
10: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,umax_value=1,var_off=(0x0; 0x1)) R10=fp0
[...]
Thus, same issue as in b02709587e holds for unsigned subregister tracking.
Also, align __reg64_bound_u32() similarly to __reg64_bound_s32() as done in
b02709587e to make them uniform again.
Fixes: 3f50f132d8 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Manfred Paul (@_manfp)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Currently we don't allow re-attaching of trampolines. Once
it's detached, it can't be re-attach even when the program
is still loaded.
Adding the possibility to re-attach the loaded tracing and
lsm programs.
Fixing missing unlock with proper cleanup goto jump reported
by Julia.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20210414195147.1624932-2-jolsa@kernel.org
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-04-23
The following pull-request contains BPF updates for your *net-next* tree.
We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).
The main changes are:
1) Add BPF static linker support for extern resolution of global, from Andrii.
2) Refine retval for bpf_get_task_stack helper, from Dave.
3) Add a bpf_snprintf helper, from Florent.
4) A bunch of miscellaneous improvements from many developers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
reg->type is enforced by check_reg_type() and map should never be NULL
(it would already have been dereferenced anyway) so these checks are
unnecessary.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210422235543.4007694-3-revest@chromium.org
In check_bpf_snprintf_call(), a map_direct_value_addr() of the fmt map
should never fail because it has already been checked by
ARG_PTR_TO_CONST_STR. But if it ever fails, it's better to error out
with an explicit debug message rather than silently fail.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210422235543.4007694-2-revest@chromium.org
Verifier can constrain the min/max bounds of bpf_get_task_stack's return
value more tightly than the default tnum_unknown. Like bpf_get_stack,
return value is num bytes written into a caller-supplied buf, or error,
so do_refine_retval_range will work.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210416204704.2816874-2-davemarchevsky@fb.com
The implementation takes inspiration from the existing bpf_trace_printk
helper but there are a few differences:
To allow for a large number of format-specifiers, parameters are
provided in an array, like in bpf_seq_printf.
Because the output string takes two arguments and the array of
parameters also takes two arguments, the format string needs to fit in
one argument. Thankfully, ARG_PTR_TO_CONST_STR is guaranteed to point to
a zero-terminated read-only map so we don't need a format string length
arg.
Because the format-string is known at verification time, we also do
a first pass of format string validation in the verifier logic. This
makes debugging easier.
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-4-revest@chromium.org
This type provides the guarantee that an argument is going to be a const
pointer to somewhere in a read-only map value. It also checks that this
pointer is followed by a zero character before the end of the map value.
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-3-revest@chromium.org
Two helpers (trace_printk and seq_printf) have very similar
implementations of format string parsing and a third one is coming
(snprintf). To avoid code duplication and make the code easier to
maintain, this moves the operations associated with format string
parsing (validation and argument sanitization) into one generic
function.
The implementation of the two existing helpers already drifted quite a
bit so unifying them entailed a lot of changes:
- bpf_trace_printk always expected fmt[fmt_size] to be the terminating
NULL character, this is no longer true, the first 0 is terminating.
- bpf_trace_printk now supports %% (which produces the percentage char).
- bpf_trace_printk now skips width formating fields.
- bpf_trace_printk now supports the X modifier (capital hexadecimal).
- bpf_trace_printk now supports %pK, %px, %pB, %pi4, %pI4, %pi6 and %pI6
- argument casting on 32 bit has been simplified into one macro and
using an enum instead of obscure int increments.
- bpf_seq_printf now uses bpf_trace_copy_string instead of
strncpy_from_kernel_nofault and handles the %pks %pus specifiers.
- bpf_seq_printf now prints longs correctly on 32 bit architectures.
- both were changed to use a global per-cpu tmp buffer instead of one
stack buffer for trace_printk and 6 small buffers for seq_printf.
- to avoid per-cpu buffer usage conflict, these helpers disable
preemption while the per-cpu buffer is in use.
- both helpers now support the %ps and %pS specifiers to print symbols.
The implementation is also moved from bpf_trace.c to helpers.c because
the upcoming bpf_snprintf helper will be made available to all BPF
programs and will need it.
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-2-revest@chromium.org
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
- keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
- fix build after move to net_generic
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This work tightens the offset mask we use for unprivileged pointer arithmetic
in order to mitigate a corner case reported by Piotr and Benedict where in
the speculative domain it is possible to advance, for example, the map value
pointer by up to value_size-1 out-of-bounds in order to leak kernel memory
via side-channel to user space.
Before this change, the computed ptr_limit for retrieve_ptr_limit() helper
represents largest valid distance when moving pointer to the right or left
which is then fed as aux->alu_limit to generate masking instructions against
the offset register. After the change, the derived aux->alu_limit represents
the largest potential value of the offset register which we mask against which
is just a narrower subset of the former limit.
For minimal complexity, we call sanitize_ptr_alu() from 2 observation points
in adjust_ptr_min_max_vals(), that is, before and after the simulated alu
operation. In the first step, we retieve the alu_state and alu_limit before
the operation as well as we branch-off a verifier path and push it to the
verification stack as we did before which checks the dst_reg under truncation,
in other words, when the speculative domain would attempt to move the pointer
out-of-bounds.
In the second step, we retrieve the new alu_limit and calculate the absolute
distance between both. Moreover, we commit the alu_state and final alu_limit
via update_alu_sanitation_state() to the env's instruction aux data, and bail
out from there if there is a mismatch due to coming from different verification
paths with different states.
Reported-by: Piotr Krysiuk <piotras@gmail.com>
Reported-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Benedict Schlueter <benedict.schlueter@rub.de>
Add a small sanitize_needed() helper function and move sanitize_val_alu()
out of the main opcode switch. In upcoming work, we'll move sanitize_ptr_alu()
as well out of its opcode switch so this helps to streamline both.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Move the bounds check in adjust_ptr_min_max_vals() into a small helper named
sanitize_check_bounds() in order to simplify the former a bit.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Consolidate all error handling and provide more user-friendly error messages
from sanitize_ptr_alu() and sanitize_val_alu().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Small refactor with no semantic changes in order to consolidate the max
ptr_limit boundary check.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
The mixed signed bounds check really belongs into retrieve_ptr_limit()
instead of outside of it in adjust_ptr_min_max_vals(). The reason is
that this check is not tied to PTR_TO_MAP_VALUE only, but to all pointer
types that we handle in retrieve_ptr_limit() and given errors from the latter
propagate back to adjust_ptr_min_max_vals() and lead to rejection of the
program, it's a better place to reside to avoid anything slipping through
for future types. The reason why we must reject such off_reg is that we
otherwise would not be able to derive a mask, see details in 9d7eceede7
("bpf: restrict unknown scalars of mixed signed bounds for unprivileged").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Small refactor to drag off_reg into sanitize_ptr_alu(), so we later on can
use off_reg for generalizing some of the checks for all pointer types.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
We forbid adding unknown scalars with mixed signed bounds due to the
spectre v1 masking mitigation. Hence this also needs bypass_spec_v1
flag instead of allow_ptr_leaks.
Fixes: 2c78ee898d ("bpf: Implement CAP_BPF")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
There is currently no way to discover the target of a tracing program
attachment after the fact. Add this information to bpf_link_info and return
it when querying the bpf_link fd.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210413091607.58945-1-toke@redhat.com
Conflicts:
MAINTAINERS
- keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
- simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
- trivial
include/linux/ethtool.h
- trivial, fix kdoc while at it
include/linux/skmsg.h
- move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
- add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
- trivial
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bpf_preload_lock is already defined with DEFINE_MUTEX(). There is no
need to initialize it again. Remove the extraneous initialization.
Signed-off-by: Muhammad Usama Anjum <musamaanjum@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210405194904.GA148013@LEGION
'stack' parameter is not used in ___bpf_prog_run() after f696b8f471
("bpf: split bpf core interpreter"), the base address have been set to
FP reg. So consequently remove it.
Signed-off-by: He Fengqing <hefengqing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210331075135.3850782-1-hefengqing@huawei.com
As for bpf_link, refuse creating a non-O_RDWR fd. Since program fds
currently don't allow modifications this is a precaution, not a
straight up bug fix.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210326160501.46234-2-lmb@cloudflare.com
Invoking BPF_OBJ_GET on a pinned bpf_link checks the path access
permissions based on file_flags, but the returned fd ignores flags.
This means that any user can acquire a "read-write" fd for a pinned
link with mode 0664 by invoking BPF_OBJ_GET with BPF_F_RDONLY in
file_flags. The fd can be used to invoke BPF_LINK_DETACH, etc.
Fix this by refusing non-O_RDWR flags in BPF_OBJ_GET. This works
because OBJ_GET by default returns a read write mapping and libbpf
doesn't expose a way to override this behaviour for programs
and links.
Fixes: 70ed506c3b ("bpf: Introduce pinnable bpf_link abstraction")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210326160501.46234-1-lmb@cloudflare.com
On x86 the struct pt_regs * grabbed by task_pt_regs() points to an
offset of task->stack. The pt_regs are later dereferenced in
__bpf_get_stack (e.g. by user_mode() check). This can cause a fault if
the task in question exits while bpf_get_task_stack is executing, as
warned by task_stack_page's comment:
* When accessing the stack of a non-current task that might exit, use
* try_get_task_stack() instead. task_stack_page will return a pointer
* that could get freed out from under you.
Taking the comment's advice and using try_get_task_stack() and
put_task_stack() to hold task->stack refcount, or bail early if it's
already 0. Incrementing stack_refcount will ensure the task's stack
sticks around while we're using its data.
I noticed this bug while testing a bpf task iter similar to
bpf_iter_task_stack in selftests, except mine grabbed user stack, and
getting intermittent crashes, which resulted in dumps like:
BUG: unable to handle page fault for address: 0000000000003fe0
\#PF: supervisor read access in kernel mode
\#PF: error_code(0x0000) - not-present page
RIP: 0010:__bpf_get_stack+0xd0/0x230
<snip...>
Call Trace:
bpf_prog_0a2be35c092cb190_get_task_stacks+0x5d/0x3ec
bpf_iter_run_prog+0x24/0x81
__task_seq_show+0x58/0x80
bpf_seq_read+0xf7/0x3d0
vfs_read+0x91/0x140
ksys_read+0x59/0xd0
do_syscall_64+0x48/0x120
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: fa28dcb82a ("bpf: Introduce helper bpf_get_task_stack()")
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210401000747.3648767-1-davemarchevsky@fb.com