2014-07-23 14:01:58 +08:00
|
|
|
/*
|
|
|
|
* Linux Socket Filter - Kernel level socket filtering
|
|
|
|
*
|
|
|
|
* Based on the design of the Berkeley Packet Filter. The new
|
|
|
|
* internal format has been designed by PLUMgrid:
|
|
|
|
*
|
|
|
|
* Copyright (c) 2011 - 2014 PLUMgrid, http://plumgrid.com
|
|
|
|
*
|
|
|
|
* Authors:
|
|
|
|
*
|
|
|
|
* Jay Schulist <jschlst@samba.org>
|
|
|
|
* Alexei Starovoitov <ast@plumgrid.com>
|
|
|
|
* Daniel Borkmann <dborkman@redhat.com>
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public License
|
|
|
|
* as published by the Free Software Foundation; either version
|
|
|
|
* 2 of the License, or (at your option) any later version.
|
|
|
|
*
|
|
|
|
* Andi Kleen - Fix a few bad bugs and races.
|
2014-07-31 11:34:14 +08:00
|
|
|
* Kris Katterjohn - Added many additional checks in bpf_check_classic()
|
2014-07-23 14:01:58 +08:00
|
|
|
*/
|
2014-09-08 14:04:47 +08:00
|
|
|
|
bpf: Introduce bpf_func_info
This patch added interface to load a program with the following
additional information:
. prog_btf_fd
. func_info, func_info_rec_size and func_info_cnt
where func_info will provide function range and type_id
corresponding to each function.
The func_info_rec_size is introduced in the UAPI to specify
struct bpf_func_info size passed from user space. This
intends to make bpf_func_info structure growable in the future.
If the kernel gets a different bpf_func_info size from userspace,
it will try to handle user request with part of bpf_func_info
it can understand. In this patch, kernel can understand
struct bpf_func_info {
__u32 insn_offset;
__u32 type_id;
};
If user passed a bpf func_info record size of 16 bytes, the
kernel can still handle part of records with the above definition.
If verifier agrees with function range provided by the user,
the bpf_prog ksym for each function will use the func name
provided in the type_id, which is supposed to provide better
encoding as it is not limited by 16 bytes program name
limitation and this is better for bpf program which contains
multiple subprograms.
The bpf_prog_info interface is also extended to
return btf_id, func_info, func_info_rec_size and func_info_cnt
to userspace, so userspace can print out the function prototype
for each xlated function. The insn_offset in the returned
func_info corresponds to the insn offset for xlated functions.
With other jit related fields in bpf_prog_info, userspace can also
print out function prototypes for each jited function.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-20 07:29:11 +08:00
|
|
|
#include <uapi/linux/btf.h>
|
2014-07-23 14:01:58 +08:00
|
|
|
#include <linux/filter.h>
|
|
|
|
#include <linux/skbuff.h>
|
2014-09-03 04:53:44 +08:00
|
|
|
#include <linux/vmalloc.h>
|
2014-09-08 14:04:47 +08:00
|
|
|
#include <linux/random.h>
|
|
|
|
#include <linux/moduleloader.h>
|
2014-09-26 15:17:00 +08:00
|
|
|
#include <linux/bpf.h>
|
bpf: Introduce bpf_func_info
This patch added interface to load a program with the following
additional information:
. prog_btf_fd
. func_info, func_info_rec_size and func_info_cnt
where func_info will provide function range and type_id
corresponding to each function.
The func_info_rec_size is introduced in the UAPI to specify
struct bpf_func_info size passed from user space. This
intends to make bpf_func_info structure growable in the future.
If the kernel gets a different bpf_func_info size from userspace,
it will try to handle user request with part of bpf_func_info
it can understand. In this patch, kernel can understand
struct bpf_func_info {
__u32 insn_offset;
__u32 type_id;
};
If user passed a bpf func_info record size of 16 bytes, the
kernel can still handle part of records with the above definition.
If verifier agrees with function range provided by the user,
the bpf_prog ksym for each function will use the func name
provided in the type_id, which is supposed to provide better
encoding as it is not limited by 16 bytes program name
limitation and this is better for bpf program which contains
multiple subprograms.
The bpf_prog_info interface is also extended to
return btf_id, func_info, func_info_rec_size and func_info_cnt
to userspace, so userspace can print out the function prototype
for each xlated function. The insn_offset in the returned
func_info corresponds to the insn offset for xlated functions.
With other jit related fields in bpf_prog_info, userspace can also
print out function prototypes for each jited function.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-20 07:29:11 +08:00
|
|
|
#include <linux/btf.h>
|
2016-02-29 12:22:37 +08:00
|
|
|
#include <linux/frame.h>
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
#include <linux/rbtree_latch.h>
|
|
|
|
#include <linux/kallsyms.h>
|
|
|
|
#include <linux/rcupdate.h>
|
2018-04-29 13:28:08 +08:00
|
|
|
#include <linux/perf_event.h>
|
2014-07-23 14:01:58 +08:00
|
|
|
|
2015-05-30 05:23:07 +08:00
|
|
|
#include <asm/unaligned.h>
|
|
|
|
|
2014-07-23 14:01:58 +08:00
|
|
|
/* Registers */
|
|
|
|
#define BPF_R0 regs[BPF_REG_0]
|
|
|
|
#define BPF_R1 regs[BPF_REG_1]
|
|
|
|
#define BPF_R2 regs[BPF_REG_2]
|
|
|
|
#define BPF_R3 regs[BPF_REG_3]
|
|
|
|
#define BPF_R4 regs[BPF_REG_4]
|
|
|
|
#define BPF_R5 regs[BPF_REG_5]
|
|
|
|
#define BPF_R6 regs[BPF_REG_6]
|
|
|
|
#define BPF_R7 regs[BPF_REG_7]
|
|
|
|
#define BPF_R8 regs[BPF_REG_8]
|
|
|
|
#define BPF_R9 regs[BPF_REG_9]
|
|
|
|
#define BPF_R10 regs[BPF_REG_10]
|
|
|
|
|
|
|
|
/* Named registers */
|
|
|
|
#define DST regs[insn->dst_reg]
|
|
|
|
#define SRC regs[insn->src_reg]
|
|
|
|
#define FP regs[BPF_REG_FP]
|
|
|
|
#define ARG1 regs[BPF_REG_ARG1]
|
|
|
|
#define CTX regs[BPF_REG_CTX]
|
|
|
|
#define IMM insn->imm
|
|
|
|
|
|
|
|
/* No hurry in this branch
|
|
|
|
*
|
|
|
|
* Exported for the bpf jit load helper.
|
|
|
|
*/
|
|
|
|
void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, unsigned int size)
|
|
|
|
{
|
|
|
|
u8 *ptr = NULL;
|
|
|
|
|
|
|
|
if (k >= SKF_NET_OFF)
|
|
|
|
ptr = skb_network_header(skb) + k - SKF_NET_OFF;
|
|
|
|
else if (k >= SKF_LL_OFF)
|
|
|
|
ptr = skb_mac_header(skb) + k - SKF_LL_OFF;
|
2015-05-30 05:23:07 +08:00
|
|
|
|
2014-07-23 14:01:58 +08:00
|
|
|
if (ptr >= skb->head && ptr + size <= skb_tail_pointer(skb))
|
|
|
|
return ptr;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags)
|
|
|
|
{
|
2017-05-09 06:57:44 +08:00
|
|
|
gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags;
|
2014-09-26 15:17:00 +08:00
|
|
|
struct bpf_prog_aux *aux;
|
2014-09-03 04:53:44 +08:00
|
|
|
struct bpf_prog *fp;
|
|
|
|
|
|
|
|
size = round_up(size, PAGE_SIZE);
|
|
|
|
fp = __vmalloc(size, gfp_flags, PAGE_KERNEL);
|
|
|
|
if (fp == NULL)
|
|
|
|
return NULL;
|
|
|
|
|
2014-09-26 15:17:00 +08:00
|
|
|
aux = kzalloc(sizeof(*aux), GFP_KERNEL | gfp_extra_flags);
|
|
|
|
if (aux == NULL) {
|
2014-09-03 04:53:44 +08:00
|
|
|
vfree(fp);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
fp->pages = size / PAGE_SIZE;
|
2014-09-26 15:17:00 +08:00
|
|
|
fp->aux = aux;
|
2015-10-29 21:58:08 +08:00
|
|
|
fp->aux->prog = fp;
|
2017-12-15 09:55:14 +08:00
|
|
|
fp->jit_requested = ebpf_jit_enabled();
|
2014-09-03 04:53:44 +08:00
|
|
|
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
INIT_LIST_HEAD_RCU(&fp->aux->ksym_lnode);
|
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
return fp;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(bpf_prog_alloc);
|
|
|
|
|
2018-12-08 08:42:25 +08:00
|
|
|
int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
if (!prog->aux->nr_linfo || !prog->jit_requested)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
prog->aux->jited_linfo = kcalloc(prog->aux->nr_linfo,
|
|
|
|
sizeof(*prog->aux->jited_linfo),
|
|
|
|
GFP_KERNEL | __GFP_NOWARN);
|
|
|
|
if (!prog->aux->jited_linfo)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_prog_free_jited_linfo(struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
kfree(prog->aux->jited_linfo);
|
|
|
|
prog->aux->jited_linfo = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_prog_free_unused_jited_linfo(struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
if (prog->aux->jited_linfo && !prog->aux->jited_linfo[0])
|
|
|
|
bpf_prog_free_jited_linfo(prog);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The jit engine is responsible to provide an array
|
|
|
|
* for insn_off to the jited_off mapping (insn_to_jit_off).
|
|
|
|
*
|
|
|
|
* The idx to this array is the insn_off. Hence, the insn_off
|
|
|
|
* here is relative to the prog itself instead of the main prog.
|
|
|
|
* This array has one entry for each xlated bpf insn.
|
|
|
|
*
|
|
|
|
* jited_off is the byte off to the last byte of the jited insn.
|
|
|
|
*
|
|
|
|
* Hence, with
|
|
|
|
* insn_start:
|
|
|
|
* The first bpf insn off of the prog. The insn off
|
|
|
|
* here is relative to the main prog.
|
|
|
|
* e.g. if prog is a subprog, insn_start > 0
|
|
|
|
* linfo_idx:
|
|
|
|
* The prog's idx to prog->aux->linfo and jited_linfo
|
|
|
|
*
|
|
|
|
* jited_linfo[linfo_idx] = prog->bpf_func
|
|
|
|
*
|
|
|
|
* For i > linfo_idx,
|
|
|
|
*
|
|
|
|
* jited_linfo[i] = prog->bpf_func +
|
|
|
|
* insn_to_jit_off[linfo[i].insn_off - insn_start - 1]
|
|
|
|
*/
|
|
|
|
void bpf_prog_fill_jited_linfo(struct bpf_prog *prog,
|
|
|
|
const u32 *insn_to_jit_off)
|
|
|
|
{
|
|
|
|
u32 linfo_idx, insn_start, insn_end, nr_linfo, i;
|
|
|
|
const struct bpf_line_info *linfo;
|
|
|
|
void **jited_linfo;
|
|
|
|
|
|
|
|
if (!prog->aux->jited_linfo)
|
|
|
|
/* Userspace did not provide linfo */
|
|
|
|
return;
|
|
|
|
|
|
|
|
linfo_idx = prog->aux->linfo_idx;
|
|
|
|
linfo = &prog->aux->linfo[linfo_idx];
|
|
|
|
insn_start = linfo[0].insn_off;
|
|
|
|
insn_end = insn_start + prog->len;
|
|
|
|
|
|
|
|
jited_linfo = &prog->aux->jited_linfo[linfo_idx];
|
|
|
|
jited_linfo[0] = prog->bpf_func;
|
|
|
|
|
|
|
|
nr_linfo = prog->aux->nr_linfo - linfo_idx;
|
|
|
|
|
|
|
|
for (i = 1; i < nr_linfo && linfo[i].insn_off < insn_end; i++)
|
|
|
|
/* The verifier ensures that linfo[i].insn_off is
|
|
|
|
* strictly increasing
|
|
|
|
*/
|
|
|
|
jited_linfo[i] = prog->bpf_func +
|
|
|
|
insn_to_jit_off[linfo[i].insn_off - insn_start - 1];
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_prog_free_linfo(struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
bpf_prog_free_jited_linfo(prog);
|
|
|
|
kvfree(prog->aux->linfo);
|
|
|
|
}
|
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size,
|
|
|
|
gfp_t gfp_extra_flags)
|
|
|
|
{
|
2017-05-09 06:57:44 +08:00
|
|
|
gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags;
|
2014-09-03 04:53:44 +08:00
|
|
|
struct bpf_prog *fp;
|
bpf: fix overflow in prog accounting
Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().
This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.
There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.
Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-18 08:52:58 +08:00
|
|
|
u32 pages, delta;
|
|
|
|
int ret;
|
2014-09-03 04:53:44 +08:00
|
|
|
|
|
|
|
BUG_ON(fp_old == NULL);
|
|
|
|
|
|
|
|
size = round_up(size, PAGE_SIZE);
|
bpf: fix overflow in prog accounting
Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().
This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.
There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.
Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-18 08:52:58 +08:00
|
|
|
pages = size / PAGE_SIZE;
|
|
|
|
if (pages <= fp_old->pages)
|
2014-09-03 04:53:44 +08:00
|
|
|
return fp_old;
|
|
|
|
|
bpf: fix overflow in prog accounting
Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().
This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.
There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.
Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-18 08:52:58 +08:00
|
|
|
delta = pages - fp_old->pages;
|
|
|
|
ret = __bpf_prog_charge(fp_old->aux->user, delta);
|
|
|
|
if (ret)
|
|
|
|
return NULL;
|
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
fp = __vmalloc(size, gfp_flags, PAGE_KERNEL);
|
bpf: fix overflow in prog accounting
Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().
This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.
There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.
Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-18 08:52:58 +08:00
|
|
|
if (fp == NULL) {
|
|
|
|
__bpf_prog_uncharge(fp_old->aux->user, delta);
|
|
|
|
} else {
|
2014-09-03 04:53:44 +08:00
|
|
|
memcpy(fp, fp_old, fp_old->pages * PAGE_SIZE);
|
bpf: fix overflow in prog accounting
Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().
This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.
There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.
Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-18 08:52:58 +08:00
|
|
|
fp->pages = pages;
|
2015-10-29 21:58:08 +08:00
|
|
|
fp->aux->prog = fp;
|
2014-09-03 04:53:44 +08:00
|
|
|
|
2014-09-26 15:17:00 +08:00
|
|
|
/* We keep fp->aux from fp_old around in the new
|
2014-09-03 04:53:44 +08:00
|
|
|
* reallocated structure.
|
|
|
|
*/
|
2014-09-26 15:17:00 +08:00
|
|
|
fp_old->aux = NULL;
|
2014-09-03 04:53:44 +08:00
|
|
|
__bpf_prog_free(fp_old);
|
|
|
|
}
|
|
|
|
|
|
|
|
return fp;
|
|
|
|
}
|
|
|
|
|
|
|
|
void __bpf_prog_free(struct bpf_prog *fp)
|
|
|
|
{
|
2014-09-26 15:17:00 +08:00
|
|
|
kfree(fp->aux);
|
2014-09-03 04:53:44 +08:00
|
|
|
vfree(fp);
|
|
|
|
}
|
|
|
|
|
2017-01-14 06:38:15 +08:00
|
|
|
int bpf_prog_calc_tag(struct bpf_prog *fp)
|
2016-12-05 06:19:41 +08:00
|
|
|
{
|
|
|
|
const u32 bits_offset = SHA_MESSAGE_BYTES - sizeof(__be64);
|
2017-01-14 06:38:15 +08:00
|
|
|
u32 raw_size = bpf_prog_tag_scratch_size(fp);
|
|
|
|
u32 digest[SHA_DIGEST_WORDS];
|
2016-12-18 08:52:57 +08:00
|
|
|
u32 ws[SHA_WORKSPACE_WORDS];
|
2016-12-05 06:19:41 +08:00
|
|
|
u32 i, bsize, psize, blocks;
|
2016-12-18 08:52:57 +08:00
|
|
|
struct bpf_insn *dst;
|
2016-12-05 06:19:41 +08:00
|
|
|
bool was_ld_map;
|
2016-12-18 08:52:57 +08:00
|
|
|
u8 *raw, *todo;
|
2016-12-05 06:19:41 +08:00
|
|
|
__be32 *result;
|
|
|
|
__be64 *bits;
|
|
|
|
|
2016-12-18 08:52:57 +08:00
|
|
|
raw = vmalloc(raw_size);
|
|
|
|
if (!raw)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2017-01-14 06:38:15 +08:00
|
|
|
sha_init(digest);
|
2016-12-05 06:19:41 +08:00
|
|
|
memset(ws, 0, sizeof(ws));
|
|
|
|
|
|
|
|
/* We need to take out the map fd for the digest calculation
|
|
|
|
* since they are unstable from user space side.
|
|
|
|
*/
|
2016-12-18 08:52:57 +08:00
|
|
|
dst = (void *)raw;
|
2016-12-05 06:19:41 +08:00
|
|
|
for (i = 0, was_ld_map = false; i < fp->len; i++) {
|
|
|
|
dst[i] = fp->insnsi[i];
|
|
|
|
if (!was_ld_map &&
|
|
|
|
dst[i].code == (BPF_LD | BPF_IMM | BPF_DW) &&
|
|
|
|
dst[i].src_reg == BPF_PSEUDO_MAP_FD) {
|
|
|
|
was_ld_map = true;
|
|
|
|
dst[i].imm = 0;
|
|
|
|
} else if (was_ld_map &&
|
|
|
|
dst[i].code == 0 &&
|
|
|
|
dst[i].dst_reg == 0 &&
|
|
|
|
dst[i].src_reg == 0 &&
|
|
|
|
dst[i].off == 0) {
|
|
|
|
was_ld_map = false;
|
|
|
|
dst[i].imm = 0;
|
|
|
|
} else {
|
|
|
|
was_ld_map = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-12-18 08:52:57 +08:00
|
|
|
psize = bpf_prog_insn_size(fp);
|
|
|
|
memset(&raw[psize], 0, raw_size - psize);
|
2016-12-05 06:19:41 +08:00
|
|
|
raw[psize++] = 0x80;
|
|
|
|
|
|
|
|
bsize = round_up(psize, SHA_MESSAGE_BYTES);
|
|
|
|
blocks = bsize / SHA_MESSAGE_BYTES;
|
2016-12-18 08:52:57 +08:00
|
|
|
todo = raw;
|
2016-12-05 06:19:41 +08:00
|
|
|
if (bsize - psize >= sizeof(__be64)) {
|
|
|
|
bits = (__be64 *)(todo + bsize - sizeof(__be64));
|
|
|
|
} else {
|
|
|
|
bits = (__be64 *)(todo + bsize + bits_offset);
|
|
|
|
blocks++;
|
|
|
|
}
|
|
|
|
*bits = cpu_to_be64((psize - 1) << 3);
|
|
|
|
|
|
|
|
while (blocks--) {
|
2017-01-14 06:38:15 +08:00
|
|
|
sha_transform(digest, todo, ws);
|
2016-12-05 06:19:41 +08:00
|
|
|
todo += SHA_MESSAGE_BYTES;
|
|
|
|
}
|
|
|
|
|
2017-01-14 06:38:15 +08:00
|
|
|
result = (__force __be32 *)digest;
|
2016-12-05 06:19:41 +08:00
|
|
|
for (i = 0; i < SHA_DIGEST_WORDS; i++)
|
2017-01-14 06:38:15 +08:00
|
|
|
result[i] = cpu_to_be32(digest[i]);
|
|
|
|
memcpy(fp->tag, result, sizeof(fp->tag));
|
2016-12-18 08:52:57 +08:00
|
|
|
|
|
|
|
vfree(raw);
|
|
|
|
return 0;
|
2016-12-05 06:19:41 +08:00
|
|
|
}
|
|
|
|
|
2018-05-17 07:44:11 +08:00
|
|
|
static int bpf_adj_delta_to_imm(struct bpf_insn *insn, u32 pos, u32 delta,
|
|
|
|
u32 curr, const bool probe_pass)
|
2016-05-14 01:08:30 +08:00
|
|
|
{
|
2018-05-17 07:44:11 +08:00
|
|
|
const s64 imm_min = S32_MIN, imm_max = S32_MAX;
|
|
|
|
s64 imm = insn->imm;
|
|
|
|
|
|
|
|
if (curr < pos && curr + imm + 1 > pos)
|
|
|
|
imm += delta;
|
|
|
|
else if (curr > pos + delta && curr + imm + 1 <= pos + delta)
|
|
|
|
imm -= delta;
|
|
|
|
if (imm < imm_min || imm > imm_max)
|
|
|
|
return -ERANGE;
|
|
|
|
if (!probe_pass)
|
|
|
|
insn->imm = imm;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int bpf_adj_delta_to_off(struct bpf_insn *insn, u32 pos, u32 delta,
|
|
|
|
u32 curr, const bool probe_pass)
|
|
|
|
{
|
|
|
|
const s32 off_min = S16_MIN, off_max = S16_MAX;
|
|
|
|
s32 off = insn->off;
|
|
|
|
|
|
|
|
if (curr < pos && curr + off + 1 > pos)
|
|
|
|
off += delta;
|
|
|
|
else if (curr > pos + delta && curr + off + 1 <= pos + delta)
|
|
|
|
off -= delta;
|
|
|
|
if (off < off_min || off > off_max)
|
|
|
|
return -ERANGE;
|
|
|
|
if (!probe_pass)
|
|
|
|
insn->off = off;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int bpf_adj_branches(struct bpf_prog *prog, u32 pos, u32 delta,
|
|
|
|
const bool probe_pass)
|
|
|
|
{
|
|
|
|
u32 i, insn_cnt = prog->len + (probe_pass ? delta : 0);
|
2016-05-14 01:08:30 +08:00
|
|
|
struct bpf_insn *insn = prog->insnsi;
|
2018-05-17 07:44:11 +08:00
|
|
|
int ret = 0;
|
2016-05-14 01:08:30 +08:00
|
|
|
|
|
|
|
for (i = 0; i < insn_cnt; i++, insn++) {
|
2018-05-17 07:44:11 +08:00
|
|
|
u8 code;
|
|
|
|
|
|
|
|
/* In the probing pass we still operate on the original,
|
|
|
|
* unpatched image in order to check overflows before we
|
|
|
|
* do any other adjustments. Therefore skip the patchlet.
|
|
|
|
*/
|
|
|
|
if (probe_pass && i == pos) {
|
|
|
|
i += delta + 1;
|
|
|
|
insn++;
|
|
|
|
}
|
2017-12-15 09:55:13 +08:00
|
|
|
code = insn->code;
|
2018-05-17 07:44:11 +08:00
|
|
|
if (BPF_CLASS(code) != BPF_JMP ||
|
|
|
|
BPF_OP(code) == BPF_EXIT)
|
2017-12-15 09:55:13 +08:00
|
|
|
continue;
|
2018-05-17 07:44:11 +08:00
|
|
|
/* Adjust offset of jmps if we cross patch boundaries. */
|
2017-12-15 09:55:13 +08:00
|
|
|
if (BPF_OP(code) == BPF_CALL) {
|
2018-05-17 07:44:11 +08:00
|
|
|
if (insn->src_reg != BPF_PSEUDO_CALL)
|
2017-12-15 09:55:13 +08:00
|
|
|
continue;
|
2018-05-17 07:44:11 +08:00
|
|
|
ret = bpf_adj_delta_to_imm(insn, pos, delta, i,
|
|
|
|
probe_pass);
|
2017-12-15 09:55:13 +08:00
|
|
|
} else {
|
2018-05-17 07:44:11 +08:00
|
|
|
ret = bpf_adj_delta_to_off(insn, pos, delta, i,
|
|
|
|
probe_pass);
|
2017-12-15 09:55:13 +08:00
|
|
|
}
|
2018-05-17 07:44:11 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
2016-05-14 01:08:30 +08:00
|
|
|
}
|
2018-05-17 07:44:11 +08:00
|
|
|
|
|
|
|
return ret;
|
2016-05-14 01:08:30 +08:00
|
|
|
}
|
|
|
|
|
2018-12-08 08:42:25 +08:00
|
|
|
static void bpf_adj_linfo(struct bpf_prog *prog, u32 off, u32 delta)
|
|
|
|
{
|
|
|
|
struct bpf_line_info *linfo;
|
|
|
|
u32 i, nr_linfo;
|
|
|
|
|
|
|
|
nr_linfo = prog->aux->nr_linfo;
|
|
|
|
if (!nr_linfo || !delta)
|
|
|
|
return;
|
|
|
|
|
|
|
|
linfo = prog->aux->linfo;
|
|
|
|
|
|
|
|
for (i = 0; i < nr_linfo; i++)
|
|
|
|
if (off < linfo[i].insn_off)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/* Push all off < linfo[i].insn_off by delta */
|
|
|
|
for (; i < nr_linfo; i++)
|
|
|
|
linfo[i].insn_off += delta;
|
|
|
|
}
|
|
|
|
|
2016-05-14 01:08:30 +08:00
|
|
|
struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
|
|
|
|
const struct bpf_insn *patch, u32 len)
|
|
|
|
{
|
|
|
|
u32 insn_adj_cnt, insn_rest, insn_delta = len - 1;
|
2018-05-17 07:44:11 +08:00
|
|
|
const u32 cnt_max = S16_MAX;
|
2016-05-14 01:08:30 +08:00
|
|
|
struct bpf_prog *prog_adj;
|
|
|
|
|
|
|
|
/* Since our patchlet doesn't expand the image, we're done. */
|
|
|
|
if (insn_delta == 0) {
|
|
|
|
memcpy(prog->insnsi + off, patch, sizeof(*patch));
|
|
|
|
return prog;
|
|
|
|
}
|
|
|
|
|
|
|
|
insn_adj_cnt = prog->len + insn_delta;
|
|
|
|
|
2018-05-17 07:44:11 +08:00
|
|
|
/* Reject anything that would potentially let the insn->off
|
|
|
|
* target overflow when we have excessive program expansions.
|
|
|
|
* We need to probe here before we do any reallocation where
|
|
|
|
* we afterwards may not fail anymore.
|
|
|
|
*/
|
|
|
|
if (insn_adj_cnt > cnt_max &&
|
|
|
|
bpf_adj_branches(prog, off, insn_delta, true))
|
|
|
|
return NULL;
|
|
|
|
|
2016-05-14 01:08:30 +08:00
|
|
|
/* Several new instructions need to be inserted. Make room
|
|
|
|
* for them. Likely, there's no need for a new allocation as
|
|
|
|
* last page could have large enough tailroom.
|
|
|
|
*/
|
|
|
|
prog_adj = bpf_prog_realloc(prog, bpf_prog_size(insn_adj_cnt),
|
|
|
|
GFP_USER);
|
|
|
|
if (!prog_adj)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
prog_adj->len = insn_adj_cnt;
|
|
|
|
|
|
|
|
/* Patching happens in 3 steps:
|
|
|
|
*
|
|
|
|
* 1) Move over tail of insnsi from next instruction onwards,
|
|
|
|
* so we can patch the single target insn with one or more
|
|
|
|
* new ones (patching is always from 1 to n insns, n > 0).
|
|
|
|
* 2) Inject new instructions at the target location.
|
|
|
|
* 3) Adjust branch offsets if necessary.
|
|
|
|
*/
|
|
|
|
insn_rest = insn_adj_cnt - off - len;
|
|
|
|
|
|
|
|
memmove(prog_adj->insnsi + off + len, prog_adj->insnsi + off + 1,
|
|
|
|
sizeof(*patch) * insn_rest);
|
|
|
|
memcpy(prog_adj->insnsi + off, patch, sizeof(*patch) * len);
|
|
|
|
|
2018-05-17 07:44:11 +08:00
|
|
|
/* We are guaranteed to not fail at this point, otherwise
|
|
|
|
* the ship has sailed to reverse to the original state. An
|
|
|
|
* overflow cannot happen at this point.
|
|
|
|
*/
|
|
|
|
BUG_ON(bpf_adj_branches(prog_adj, off, insn_delta, false));
|
2016-05-14 01:08:30 +08:00
|
|
|
|
2018-12-08 08:42:25 +08:00
|
|
|
bpf_adj_linfo(prog_adj, off, insn_delta);
|
|
|
|
|
2016-05-14 01:08:30 +08:00
|
|
|
return prog_adj;
|
|
|
|
}
|
|
|
|
|
2018-06-15 08:30:47 +08:00
|
|
|
void bpf_prog_kallsyms_del_subprogs(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < fp->aux->func_cnt; i++)
|
|
|
|
bpf_prog_kallsyms_del(fp->aux->func[i]);
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
bpf_prog_kallsyms_del_subprogs(fp);
|
|
|
|
bpf_prog_kallsyms_del(fp);
|
|
|
|
}
|
|
|
|
|
2014-09-10 21:01:02 +08:00
|
|
|
#ifdef CONFIG_BPF_JIT
|
2018-01-20 08:24:33 +08:00
|
|
|
/* All BPF JIT sysctl knobs here. */
|
|
|
|
int bpf_jit_enable __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
|
|
|
|
int bpf_jit_harden __read_mostly;
|
|
|
|
int bpf_jit_kallsyms __read_mostly;
|
bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K
Michael and Sandipan report:
Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
and is adjusted again at init time if MODULES_VADDR is defined.
For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
the compile-time default at boot-time, which is 0x9c400000 when
using 64K page size. This overflows the signed 32-bit bpf_jit_limit
value:
root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
-1673527296
and can cause various unexpected failures throughout the network
stack. In one case `strace dhclient eth0` reported:
setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
16) = -1 ENOTSUPP (Unknown error 524)
and similar failures can be seen with tools like tcpdump. This doesn't
always reproduce however, and I'm not sure why. The more consistent
failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
host would time out on systemd/netplan configuring a virtio-net NIC
with no noticeable errors in the logs.
Given this and also given that in near future some architectures like
arm64 will have a custom area for BPF JIT image allocations we should
get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
so therefore add another overridable bpf_jit_alloc_exec_limit() helper
function which returns the possible size of the memory area for deriving
the default heuristic in bpf_jit_charge_init().
Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
JIT memory provider, and therefore in case archs implement their custom
module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
Additionally, for archs supporting large page sizes, we should change
the sysctl to be handled as long to not run into sysctl restrictions
in future.
Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
Reported-by: Sandipan Das <sandipan@linux.ibm.com>
Reported-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-11 19:14:12 +08:00
|
|
|
long bpf_jit_limit __read_mostly;
|
2018-01-20 08:24:33 +08:00
|
|
|
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
static __always_inline void
|
|
|
|
bpf_get_prog_addr_region(const struct bpf_prog *prog,
|
|
|
|
unsigned long *symbol_start,
|
|
|
|
unsigned long *symbol_end)
|
|
|
|
{
|
|
|
|
const struct bpf_binary_header *hdr = bpf_jit_binary_hdr(prog);
|
|
|
|
unsigned long addr = (unsigned long)hdr;
|
|
|
|
|
|
|
|
WARN_ON_ONCE(!bpf_prog_ebpf_jited(prog));
|
|
|
|
|
|
|
|
*symbol_start = addr;
|
|
|
|
*symbol_end = addr + hdr->pages * PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
|
|
|
|
{
|
2017-10-06 12:52:13 +08:00
|
|
|
const char *end = sym + KSYM_NAME_LEN;
|
bpf: Introduce bpf_func_info
This patch added interface to load a program with the following
additional information:
. prog_btf_fd
. func_info, func_info_rec_size and func_info_cnt
where func_info will provide function range and type_id
corresponding to each function.
The func_info_rec_size is introduced in the UAPI to specify
struct bpf_func_info size passed from user space. This
intends to make bpf_func_info structure growable in the future.
If the kernel gets a different bpf_func_info size from userspace,
it will try to handle user request with part of bpf_func_info
it can understand. In this patch, kernel can understand
struct bpf_func_info {
__u32 insn_offset;
__u32 type_id;
};
If user passed a bpf func_info record size of 16 bytes, the
kernel can still handle part of records with the above definition.
If verifier agrees with function range provided by the user,
the bpf_prog ksym for each function will use the func name
provided in the type_id, which is supposed to provide better
encoding as it is not limited by 16 bytes program name
limitation and this is better for bpf program which contains
multiple subprograms.
The bpf_prog_info interface is also extended to
return btf_id, func_info, func_info_rec_size and func_info_cnt
to userspace, so userspace can print out the function prototype
for each xlated function. The insn_offset in the returned
func_info corresponds to the insn offset for xlated functions.
With other jit related fields in bpf_prog_info, userspace can also
print out function prototypes for each jited function.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-20 07:29:11 +08:00
|
|
|
const struct btf_type *type;
|
|
|
|
const char *func_name;
|
2017-10-06 12:52:13 +08:00
|
|
|
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
BUILD_BUG_ON(sizeof("bpf_prog_") +
|
2017-10-06 12:52:13 +08:00
|
|
|
sizeof(prog->tag) * 2 +
|
|
|
|
/* name has been null terminated.
|
|
|
|
* We should need +1 for the '_' preceding
|
|
|
|
* the name. However, the null character
|
|
|
|
* is double counted between the name and the
|
|
|
|
* sizeof("bpf_prog_") above, so we omit
|
|
|
|
* the +1 here.
|
|
|
|
*/
|
|
|
|
sizeof(prog->aux->name) > KSYM_NAME_LEN);
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
|
|
|
|
sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
|
|
|
|
sym = bin2hex(sym, prog->tag, sizeof(prog->tag));
|
bpf: Introduce bpf_func_info
This patch added interface to load a program with the following
additional information:
. prog_btf_fd
. func_info, func_info_rec_size and func_info_cnt
where func_info will provide function range and type_id
corresponding to each function.
The func_info_rec_size is introduced in the UAPI to specify
struct bpf_func_info size passed from user space. This
intends to make bpf_func_info structure growable in the future.
If the kernel gets a different bpf_func_info size from userspace,
it will try to handle user request with part of bpf_func_info
it can understand. In this patch, kernel can understand
struct bpf_func_info {
__u32 insn_offset;
__u32 type_id;
};
If user passed a bpf func_info record size of 16 bytes, the
kernel can still handle part of records with the above definition.
If verifier agrees with function range provided by the user,
the bpf_prog ksym for each function will use the func name
provided in the type_id, which is supposed to provide better
encoding as it is not limited by 16 bytes program name
limitation and this is better for bpf program which contains
multiple subprograms.
The bpf_prog_info interface is also extended to
return btf_id, func_info, func_info_rec_size and func_info_cnt
to userspace, so userspace can print out the function prototype
for each xlated function. The insn_offset in the returned
func_info corresponds to the insn offset for xlated functions.
With other jit related fields in bpf_prog_info, userspace can also
print out function prototypes for each jited function.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-20 07:29:11 +08:00
|
|
|
|
|
|
|
/* prog->aux->name will be ignored if full btf name is available */
|
2018-12-06 09:35:43 +08:00
|
|
|
if (prog->aux->func_info_cnt) {
|
2018-11-25 15:20:44 +08:00
|
|
|
type = btf_type_by_id(prog->aux->btf,
|
|
|
|
prog->aux->func_info[prog->aux->func_idx].type_id);
|
bpf: Introduce bpf_func_info
This patch added interface to load a program with the following
additional information:
. prog_btf_fd
. func_info, func_info_rec_size and func_info_cnt
where func_info will provide function range and type_id
corresponding to each function.
The func_info_rec_size is introduced in the UAPI to specify
struct bpf_func_info size passed from user space. This
intends to make bpf_func_info structure growable in the future.
If the kernel gets a different bpf_func_info size from userspace,
it will try to handle user request with part of bpf_func_info
it can understand. In this patch, kernel can understand
struct bpf_func_info {
__u32 insn_offset;
__u32 type_id;
};
If user passed a bpf func_info record size of 16 bytes, the
kernel can still handle part of records with the above definition.
If verifier agrees with function range provided by the user,
the bpf_prog ksym for each function will use the func name
provided in the type_id, which is supposed to provide better
encoding as it is not limited by 16 bytes program name
limitation and this is better for bpf program which contains
multiple subprograms.
The bpf_prog_info interface is also extended to
return btf_id, func_info, func_info_rec_size and func_info_cnt
to userspace, so userspace can print out the function prototype
for each xlated function. The insn_offset in the returned
func_info corresponds to the insn offset for xlated functions.
With other jit related fields in bpf_prog_info, userspace can also
print out function prototypes for each jited function.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-20 07:29:11 +08:00
|
|
|
func_name = btf_name_by_offset(prog->aux->btf, type->name_off);
|
|
|
|
snprintf(sym, (size_t)(end - sym), "_%s", func_name);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2017-10-06 12:52:13 +08:00
|
|
|
if (prog->aux->name[0])
|
|
|
|
snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
|
|
|
|
else
|
|
|
|
*sym = 0;
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline unsigned long
|
|
|
|
bpf_get_prog_addr_start(struct latch_tree_node *n)
|
|
|
|
{
|
|
|
|
unsigned long symbol_start, symbol_end;
|
|
|
|
const struct bpf_prog_aux *aux;
|
|
|
|
|
|
|
|
aux = container_of(n, struct bpf_prog_aux, ksym_tnode);
|
|
|
|
bpf_get_prog_addr_region(aux->prog, &symbol_start, &symbol_end);
|
|
|
|
|
|
|
|
return symbol_start;
|
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline bool bpf_tree_less(struct latch_tree_node *a,
|
|
|
|
struct latch_tree_node *b)
|
|
|
|
{
|
|
|
|
return bpf_get_prog_addr_start(a) < bpf_get_prog_addr_start(b);
|
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline int bpf_tree_comp(void *key, struct latch_tree_node *n)
|
|
|
|
{
|
|
|
|
unsigned long val = (unsigned long)key;
|
|
|
|
unsigned long symbol_start, symbol_end;
|
|
|
|
const struct bpf_prog_aux *aux;
|
|
|
|
|
|
|
|
aux = container_of(n, struct bpf_prog_aux, ksym_tnode);
|
|
|
|
bpf_get_prog_addr_region(aux->prog, &symbol_start, &symbol_end);
|
|
|
|
|
|
|
|
if (val < symbol_start)
|
|
|
|
return -1;
|
|
|
|
if (val >= symbol_end)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct latch_tree_ops bpf_tree_ops = {
|
|
|
|
.less = bpf_tree_less,
|
|
|
|
.comp = bpf_tree_comp,
|
|
|
|
};
|
|
|
|
|
|
|
|
static DEFINE_SPINLOCK(bpf_lock);
|
|
|
|
static LIST_HEAD(bpf_kallsyms);
|
|
|
|
static struct latch_tree_root bpf_tree __cacheline_aligned;
|
|
|
|
|
|
|
|
static void bpf_prog_ksym_node_add(struct bpf_prog_aux *aux)
|
|
|
|
{
|
|
|
|
WARN_ON_ONCE(!list_empty(&aux->ksym_lnode));
|
|
|
|
list_add_tail_rcu(&aux->ksym_lnode, &bpf_kallsyms);
|
|
|
|
latch_tree_insert(&aux->ksym_tnode, &bpf_tree, &bpf_tree_ops);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_prog_ksym_node_del(struct bpf_prog_aux *aux)
|
|
|
|
{
|
|
|
|
if (list_empty(&aux->ksym_lnode))
|
|
|
|
return;
|
|
|
|
|
|
|
|
latch_tree_erase(&aux->ksym_tnode, &bpf_tree, &bpf_tree_ops);
|
|
|
|
list_del_rcu(&aux->ksym_lnode);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool bpf_prog_kallsyms_candidate(const struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
return fp->jited && !bpf_prog_was_classic(fp);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
return list_empty(&fp->aux->ksym_lnode) ||
|
|
|
|
fp->aux->ksym_lnode.prev == LIST_POISON2;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_prog_kallsyms_add(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
if (!bpf_prog_kallsyms_candidate(fp) ||
|
|
|
|
!capable(CAP_SYS_ADMIN))
|
|
|
|
return;
|
|
|
|
|
2017-04-27 07:39:33 +08:00
|
|
|
spin_lock_bh(&bpf_lock);
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
bpf_prog_ksym_node_add(fp->aux);
|
2017-04-27 07:39:33 +08:00
|
|
|
spin_unlock_bh(&bpf_lock);
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_prog_kallsyms_del(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
if (!bpf_prog_kallsyms_candidate(fp))
|
|
|
|
return;
|
|
|
|
|
2017-04-27 07:39:33 +08:00
|
|
|
spin_lock_bh(&bpf_lock);
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
bpf_prog_ksym_node_del(fp->aux);
|
2017-04-27 07:39:33 +08:00
|
|
|
spin_unlock_bh(&bpf_lock);
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr)
|
|
|
|
{
|
|
|
|
struct latch_tree_node *n;
|
|
|
|
|
|
|
|
if (!bpf_jit_kallsyms_enabled())
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
n = latch_tree_find((void *)addr, &bpf_tree, &bpf_tree_ops);
|
|
|
|
return n ?
|
|
|
|
container_of(n, struct bpf_prog_aux, ksym_tnode)->prog :
|
|
|
|
NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
const char *__bpf_address_lookup(unsigned long addr, unsigned long *size,
|
|
|
|
unsigned long *off, char *sym)
|
|
|
|
{
|
|
|
|
unsigned long symbol_start, symbol_end;
|
|
|
|
struct bpf_prog *prog;
|
|
|
|
char *ret = NULL;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
prog = bpf_prog_kallsyms_find(addr);
|
|
|
|
if (prog) {
|
|
|
|
bpf_get_prog_addr_region(prog, &symbol_start, &symbol_end);
|
|
|
|
bpf_get_prog_name(prog, sym);
|
|
|
|
|
|
|
|
ret = sym;
|
|
|
|
if (size)
|
|
|
|
*size = symbol_end - symbol_start;
|
|
|
|
if (off)
|
|
|
|
*off = addr - symbol_start;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool is_bpf_text_address(unsigned long addr)
|
|
|
|
{
|
|
|
|
bool ret;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
ret = bpf_prog_kallsyms_find(addr) != NULL;
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
|
|
|
|
char *sym)
|
|
|
|
{
|
|
|
|
struct bpf_prog_aux *aux;
|
|
|
|
unsigned int it = 0;
|
|
|
|
int ret = -ERANGE;
|
|
|
|
|
|
|
|
if (!bpf_jit_kallsyms_enabled())
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(aux, &bpf_kallsyms, ksym_lnode) {
|
|
|
|
if (it++ != symnum)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
bpf_get_prog_name(aux->prog, sym);
|
|
|
|
|
2018-11-03 01:16:15 +08:00
|
|
|
*value = (unsigned long)aux->prog->bpf_func;
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
*type = BPF_SYM_ELF_TYPE;
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-10-23 07:11:04 +08:00
|
|
|
static atomic_long_t bpf_jit_current;
|
|
|
|
|
bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K
Michael and Sandipan report:
Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
and is adjusted again at init time if MODULES_VADDR is defined.
For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
the compile-time default at boot-time, which is 0x9c400000 when
using 64K page size. This overflows the signed 32-bit bpf_jit_limit
value:
root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
-1673527296
and can cause various unexpected failures throughout the network
stack. In one case `strace dhclient eth0` reported:
setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
16) = -1 ENOTSUPP (Unknown error 524)
and similar failures can be seen with tools like tcpdump. This doesn't
always reproduce however, and I'm not sure why. The more consistent
failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
host would time out on systemd/netplan configuring a virtio-net NIC
with no noticeable errors in the logs.
Given this and also given that in near future some architectures like
arm64 will have a custom area for BPF JIT image allocations we should
get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
so therefore add another overridable bpf_jit_alloc_exec_limit() helper
function which returns the possible size of the memory area for deriving
the default heuristic in bpf_jit_charge_init().
Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
JIT memory provider, and therefore in case archs implement their custom
module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
Additionally, for archs supporting large page sizes, we should change
the sysctl to be handled as long to not run into sysctl restrictions
in future.
Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
Reported-by: Sandipan Das <sandipan@linux.ibm.com>
Reported-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-11 19:14:12 +08:00
|
|
|
/* Can be overridden by an arch's JIT compiler if it has a custom,
|
|
|
|
* dedicated BPF backend memory area, or if neither of the two
|
|
|
|
* below apply.
|
|
|
|
*/
|
|
|
|
u64 __weak bpf_jit_alloc_exec_limit(void)
|
|
|
|
{
|
2018-10-23 07:11:04 +08:00
|
|
|
#if defined(MODULES_VADDR)
|
bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K
Michael and Sandipan report:
Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
and is adjusted again at init time if MODULES_VADDR is defined.
For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
the compile-time default at boot-time, which is 0x9c400000 when
using 64K page size. This overflows the signed 32-bit bpf_jit_limit
value:
root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
-1673527296
and can cause various unexpected failures throughout the network
stack. In one case `strace dhclient eth0` reported:
setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
16) = -1 ENOTSUPP (Unknown error 524)
and similar failures can be seen with tools like tcpdump. This doesn't
always reproduce however, and I'm not sure why. The more consistent
failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
host would time out on systemd/netplan configuring a virtio-net NIC
with no noticeable errors in the logs.
Given this and also given that in near future some architectures like
arm64 will have a custom area for BPF JIT image allocations we should
get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
so therefore add another overridable bpf_jit_alloc_exec_limit() helper
function which returns the possible size of the memory area for deriving
the default heuristic in bpf_jit_charge_init().
Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
JIT memory provider, and therefore in case archs implement their custom
module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
Additionally, for archs supporting large page sizes, we should change
the sysctl to be handled as long to not run into sysctl restrictions
in future.
Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
Reported-by: Sandipan Das <sandipan@linux.ibm.com>
Reported-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-11 19:14:12 +08:00
|
|
|
return MODULES_END - MODULES_VADDR;
|
|
|
|
#else
|
|
|
|
return VMALLOC_END - VMALLOC_START;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2018-10-23 07:11:04 +08:00
|
|
|
static int __init bpf_jit_charge_init(void)
|
|
|
|
{
|
|
|
|
/* Only used as heuristic here to derive limit. */
|
bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K
Michael and Sandipan report:
Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
and is adjusted again at init time if MODULES_VADDR is defined.
For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
the compile-time default at boot-time, which is 0x9c400000 when
using 64K page size. This overflows the signed 32-bit bpf_jit_limit
value:
root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
-1673527296
and can cause various unexpected failures throughout the network
stack. In one case `strace dhclient eth0` reported:
setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
16) = -1 ENOTSUPP (Unknown error 524)
and similar failures can be seen with tools like tcpdump. This doesn't
always reproduce however, and I'm not sure why. The more consistent
failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
host would time out on systemd/netplan configuring a virtio-net NIC
with no noticeable errors in the logs.
Given this and also given that in near future some architectures like
arm64 will have a custom area for BPF JIT image allocations we should
get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
so therefore add another overridable bpf_jit_alloc_exec_limit() helper
function which returns the possible size of the memory area for deriving
the default heuristic in bpf_jit_charge_init().
Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
JIT memory provider, and therefore in case archs implement their custom
module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
Additionally, for archs supporting large page sizes, we should change
the sysctl to be handled as long to not run into sysctl restrictions
in future.
Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
Reported-by: Sandipan Das <sandipan@linux.ibm.com>
Reported-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-11 19:14:12 +08:00
|
|
|
bpf_jit_limit = min_t(u64, round_up(bpf_jit_alloc_exec_limit() >> 2,
|
|
|
|
PAGE_SIZE), LONG_MAX);
|
2018-10-23 07:11:04 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
pure_initcall(bpf_jit_charge_init);
|
|
|
|
|
|
|
|
static int bpf_jit_charge_modmem(u32 pages)
|
|
|
|
{
|
|
|
|
if (atomic_long_add_return(pages, &bpf_jit_current) >
|
|
|
|
(bpf_jit_limit >> PAGE_SHIFT)) {
|
|
|
|
if (!capable(CAP_SYS_ADMIN)) {
|
|
|
|
atomic_long_sub(pages, &bpf_jit_current);
|
|
|
|
return -EPERM;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_jit_uncharge_modmem(u32 pages)
|
|
|
|
{
|
|
|
|
atomic_long_sub(pages, &bpf_jit_current);
|
|
|
|
}
|
|
|
|
|
2018-11-24 06:18:03 +08:00
|
|
|
void *__weak bpf_jit_alloc_exec(unsigned long size)
|
|
|
|
{
|
|
|
|
return module_alloc(size);
|
|
|
|
}
|
|
|
|
|
|
|
|
void __weak bpf_jit_free_exec(void *addr)
|
|
|
|
{
|
|
|
|
module_memfree(addr);
|
|
|
|
}
|
|
|
|
|
2014-09-08 14:04:47 +08:00
|
|
|
struct bpf_binary_header *
|
|
|
|
bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
|
|
|
|
unsigned int alignment,
|
|
|
|
bpf_jit_fill_hole_t bpf_fill_ill_insns)
|
|
|
|
{
|
|
|
|
struct bpf_binary_header *hdr;
|
2018-10-23 07:11:04 +08:00
|
|
|
u32 size, hole, start, pages;
|
2014-09-08 14:04:47 +08:00
|
|
|
|
|
|
|
/* Most of BPF filters are really small, but if some of them
|
|
|
|
* fill a page, allow at least 128 extra bytes to insert a
|
|
|
|
* random section of illegal instructions.
|
|
|
|
*/
|
|
|
|
size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE);
|
2018-10-23 07:11:04 +08:00
|
|
|
pages = size / PAGE_SIZE;
|
|
|
|
|
|
|
|
if (bpf_jit_charge_modmem(pages))
|
|
|
|
return NULL;
|
2018-11-24 06:18:03 +08:00
|
|
|
hdr = bpf_jit_alloc_exec(size);
|
2018-10-23 07:11:04 +08:00
|
|
|
if (!hdr) {
|
|
|
|
bpf_jit_uncharge_modmem(pages);
|
2014-09-08 14:04:47 +08:00
|
|
|
return NULL;
|
2018-10-23 07:11:04 +08:00
|
|
|
}
|
2014-09-08 14:04:47 +08:00
|
|
|
|
|
|
|
/* Fill space with illegal/arch-dep instructions. */
|
|
|
|
bpf_fill_ill_insns(hdr, size);
|
|
|
|
|
2018-10-23 07:11:04 +08:00
|
|
|
hdr->pages = pages;
|
2014-09-08 14:04:47 +08:00
|
|
|
hole = min_t(unsigned int, size - (proglen + sizeof(*hdr)),
|
|
|
|
PAGE_SIZE - sizeof(*hdr));
|
2016-05-18 20:14:28 +08:00
|
|
|
start = (get_random_int() % hole) & ~(alignment - 1);
|
2014-09-08 14:04:47 +08:00
|
|
|
|
|
|
|
/* Leave a random number of instructions before BPF code. */
|
|
|
|
*image_ptr = &hdr->image[start];
|
|
|
|
|
|
|
|
return hdr;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_jit_binary_free(struct bpf_binary_header *hdr)
|
|
|
|
{
|
2018-10-23 07:11:04 +08:00
|
|
|
u32 pages = hdr->pages;
|
|
|
|
|
2018-11-24 06:18:03 +08:00
|
|
|
bpf_jit_free_exec(hdr);
|
2018-10-23 07:11:04 +08:00
|
|
|
bpf_jit_uncharge_modmem(pages);
|
2014-09-08 14:04:47 +08:00
|
|
|
}
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
/* This symbol is only overridden by archs that have different
|
|
|
|
* requirements than the usual eBPF JITs, f.e. when they only
|
|
|
|
* implement cBPF JIT, do not set images read-only, etc.
|
|
|
|
*/
|
|
|
|
void __weak bpf_jit_free(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
if (fp->jited) {
|
|
|
|
struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
|
|
|
|
|
|
|
|
bpf_jit_binary_unlock_ro(hdr);
|
|
|
|
bpf_jit_binary_free(hdr);
|
|
|
|
|
|
|
|
WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
|
|
|
|
}
|
|
|
|
|
|
|
|
bpf_prog_unlock_free(fp);
|
|
|
|
}
|
|
|
|
|
2018-11-26 21:05:38 +08:00
|
|
|
int bpf_jit_get_func_addr(const struct bpf_prog *prog,
|
|
|
|
const struct bpf_insn *insn, bool extra_pass,
|
|
|
|
u64 *func_addr, bool *func_addr_fixed)
|
|
|
|
{
|
|
|
|
s16 off = insn->off;
|
|
|
|
s32 imm = insn->imm;
|
|
|
|
u8 *addr;
|
|
|
|
|
|
|
|
*func_addr_fixed = insn->src_reg != BPF_PSEUDO_CALL;
|
|
|
|
if (!*func_addr_fixed) {
|
|
|
|
/* Place-holder address till the last pass has collected
|
|
|
|
* all addresses for JITed subprograms in which case we
|
|
|
|
* can pick them up from prog->aux.
|
|
|
|
*/
|
|
|
|
if (!extra_pass)
|
|
|
|
addr = NULL;
|
|
|
|
else if (prog->aux->func &&
|
|
|
|
off >= 0 && off < prog->aux->func_cnt)
|
|
|
|
addr = (u8 *)prog->aux->func[off]->bpf_func;
|
|
|
|
else
|
|
|
|
return -EINVAL;
|
|
|
|
} else {
|
|
|
|
/* Address of a BPF helper call. Since part of the core
|
|
|
|
* kernel, it's always at a fixed location. __bpf_call_base
|
|
|
|
* and the helper with imm relative to it are both in core
|
|
|
|
* kernel.
|
|
|
|
*/
|
|
|
|
addr = (u8 *)__bpf_call_base + imm;
|
|
|
|
}
|
|
|
|
|
|
|
|
*func_addr = (unsigned long)addr;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
static int bpf_jit_blind_insn(const struct bpf_insn *from,
|
|
|
|
const struct bpf_insn *aux,
|
|
|
|
struct bpf_insn *to_buff)
|
|
|
|
{
|
|
|
|
struct bpf_insn *to = to_buff;
|
2016-05-18 20:14:28 +08:00
|
|
|
u32 imm_rnd = get_random_int();
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
s16 off;
|
|
|
|
|
|
|
|
BUILD_BUG_ON(BPF_REG_AX + 1 != MAX_BPF_JIT_REG);
|
|
|
|
BUILD_BUG_ON(MAX_BPF_REG + 1 != MAX_BPF_JIT_REG);
|
|
|
|
|
|
|
|
if (from->imm == 0 &&
|
|
|
|
(from->code == (BPF_ALU | BPF_MOV | BPF_K) ||
|
|
|
|
from->code == (BPF_ALU64 | BPF_MOV | BPF_K))) {
|
|
|
|
*to++ = BPF_ALU64_REG(BPF_XOR, from->dst_reg, from->dst_reg);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (from->code) {
|
|
|
|
case BPF_ALU | BPF_ADD | BPF_K:
|
|
|
|
case BPF_ALU | BPF_SUB | BPF_K:
|
|
|
|
case BPF_ALU | BPF_AND | BPF_K:
|
|
|
|
case BPF_ALU | BPF_OR | BPF_K:
|
|
|
|
case BPF_ALU | BPF_XOR | BPF_K:
|
|
|
|
case BPF_ALU | BPF_MUL | BPF_K:
|
|
|
|
case BPF_ALU | BPF_MOV | BPF_K:
|
|
|
|
case BPF_ALU | BPF_DIV | BPF_K:
|
|
|
|
case BPF_ALU | BPF_MOD | BPF_K:
|
|
|
|
*to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
|
|
|
|
*to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_ALU32_REG(from->code, from->dst_reg, BPF_REG_AX);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case BPF_ALU64 | BPF_ADD | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_SUB | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_AND | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_OR | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_XOR | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_MUL | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_MOV | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_DIV | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_MOD | BPF_K:
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_ALU64_REG(from->code, from->dst_reg, BPF_REG_AX);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case BPF_JMP | BPF_JEQ | BPF_K:
|
|
|
|
case BPF_JMP | BPF_JNE | BPF_K:
|
|
|
|
case BPF_JMP | BPF_JGT | BPF_K:
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
case BPF_JMP | BPF_JLT | BPF_K:
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
case BPF_JMP | BPF_JGE | BPF_K:
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
case BPF_JMP | BPF_JLE | BPF_K:
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
case BPF_JMP | BPF_JSGT | BPF_K:
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
case BPF_JMP | BPF_JSLT | BPF_K:
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
case BPF_JMP | BPF_JSGE | BPF_K:
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
case BPF_JMP | BPF_JSLE | BPF_K:
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
case BPF_JMP | BPF_JSET | BPF_K:
|
|
|
|
/* Accommodate for extra offset in case of a backjump. */
|
|
|
|
off = from->off;
|
|
|
|
if (off < 0)
|
|
|
|
off -= 2;
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_JMP_REG(from->code, from->dst_reg, BPF_REG_AX, off);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case BPF_LD | BPF_IMM | BPF_DW:
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ aux[1].imm);
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_LSH, BPF_REG_AX, 32);
|
|
|
|
*to++ = BPF_ALU64_REG(BPF_MOV, aux[0].dst_reg, BPF_REG_AX);
|
|
|
|
break;
|
|
|
|
case 0: /* Part 2 of BPF_LD | BPF_IMM | BPF_DW. */
|
|
|
|
*to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ aux[0].imm);
|
|
|
|
*to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_ALU64_REG(BPF_OR, aux[0].dst_reg, BPF_REG_AX);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case BPF_ST | BPF_MEM | BPF_DW:
|
|
|
|
case BPF_ST | BPF_MEM | BPF_W:
|
|
|
|
case BPF_ST | BPF_MEM | BPF_H:
|
|
|
|
case BPF_ST | BPF_MEM | BPF_B:
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_STX_MEM(from->code, from->dst_reg, BPF_REG_AX, from->off);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return to - to_buff;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct bpf_prog *bpf_prog_clone_create(struct bpf_prog *fp_other,
|
|
|
|
gfp_t gfp_extra_flags)
|
|
|
|
{
|
2017-05-09 06:57:44 +08:00
|
|
|
gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags;
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
struct bpf_prog *fp;
|
|
|
|
|
|
|
|
fp = __vmalloc(fp_other->pages * PAGE_SIZE, gfp_flags, PAGE_KERNEL);
|
|
|
|
if (fp != NULL) {
|
|
|
|
/* aux->prog still points to the fp_other one, so
|
|
|
|
* when promoting the clone to the real program,
|
|
|
|
* this still needs to be adapted.
|
|
|
|
*/
|
|
|
|
memcpy(fp, fp_other, fp_other->pages * PAGE_SIZE);
|
|
|
|
}
|
|
|
|
|
|
|
|
return fp;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_prog_clone_free(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
/* aux was stolen by the other clone, so we cannot free
|
|
|
|
* it from this path! It will be freed eventually by the
|
|
|
|
* other program on release.
|
|
|
|
*
|
|
|
|
* At this point, we don't need a deferred release since
|
|
|
|
* clone is guaranteed to not be locked.
|
|
|
|
*/
|
|
|
|
fp->aux = NULL;
|
|
|
|
__bpf_prog_free(fp);
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_jit_prog_release_other(struct bpf_prog *fp, struct bpf_prog *fp_other)
|
|
|
|
{
|
|
|
|
/* We have to repoint aux->prog to self, as we don't
|
|
|
|
* know whether fp here is the clone or the original.
|
|
|
|
*/
|
|
|
|
fp->aux->prog = fp;
|
|
|
|
bpf_prog_clone_free(fp_other);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct bpf_prog *bpf_jit_blind_constants(struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
struct bpf_insn insn_buff[16], aux[2];
|
|
|
|
struct bpf_prog *clone, *tmp;
|
|
|
|
int insn_delta, insn_cnt;
|
|
|
|
struct bpf_insn *insn;
|
|
|
|
int i, rewritten;
|
|
|
|
|
2017-12-15 09:55:15 +08:00
|
|
|
if (!bpf_jit_blinding_enabled(prog) || prog->blinded)
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
return prog;
|
|
|
|
|
|
|
|
clone = bpf_prog_clone_create(prog, GFP_USER);
|
|
|
|
if (!clone)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
|
|
|
insn_cnt = clone->len;
|
|
|
|
insn = clone->insnsi;
|
|
|
|
|
|
|
|
for (i = 0; i < insn_cnt; i++, insn++) {
|
|
|
|
/* We temporarily need to hold the original ld64 insn
|
|
|
|
* so that we can still access the first part in the
|
|
|
|
* second blinding run.
|
|
|
|
*/
|
|
|
|
if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW) &&
|
|
|
|
insn[1].code == 0)
|
|
|
|
memcpy(aux, insn, sizeof(aux));
|
|
|
|
|
|
|
|
rewritten = bpf_jit_blind_insn(insn, aux, insn_buff);
|
|
|
|
if (!rewritten)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
tmp = bpf_patch_insn_single(clone, i, insn_buff, rewritten);
|
|
|
|
if (!tmp) {
|
|
|
|
/* Patching may have repointed aux->prog during
|
|
|
|
* realloc from the original one, so we need to
|
|
|
|
* fix it up here on error.
|
|
|
|
*/
|
|
|
|
bpf_jit_prog_release_other(prog, clone);
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
|
|
|
|
clone = tmp;
|
|
|
|
insn_delta = rewritten - 1;
|
|
|
|
|
|
|
|
/* Walk new program and skip insns we just inserted. */
|
|
|
|
insn = clone->insnsi + i + insn_delta;
|
|
|
|
insn_cnt += insn_delta;
|
|
|
|
i += insn_delta;
|
|
|
|
}
|
|
|
|
|
2017-12-15 09:55:15 +08:00
|
|
|
clone->blinded = 1;
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
return clone;
|
|
|
|
}
|
2014-09-10 21:01:02 +08:00
|
|
|
#endif /* CONFIG_BPF_JIT */
|
2014-09-08 14:04:47 +08:00
|
|
|
|
2014-07-23 14:01:58 +08:00
|
|
|
/* Base function for offset calculation. Needs to go into .text section,
|
|
|
|
* therefore keeping it non-static as well; will also be used by JITs
|
bpf: allow for correlation of maps and helpers in dump
Currently a dump of an xlated prog (post verifier stage) doesn't
correlate used helpers as well as maps. The prog info lists
involved map ids, however there's no correlation of where in the
program they are used as of today. Likewise, bpftool does not
correlate helper calls with the target functions.
The latter can be done w/o any kernel changes through kallsyms,
and also has the advantage that this works with inlined helpers
and BPF calls.
Example, via interpreter:
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf chain 0
filter protocol all pref 49152 bpf chain 0 handle 0x1 foo.o:[ingress] \
direct-action not_in_hw id 1 tag c74773051b364165 <-- prog id:1
* Output before patch (calls/maps remain unclear):
# bpftool prog dump xlated id 1 <-- dump prog id:1
0: (b7) r1 = 2
1: (63) *(u32 *)(r10 -4) = r1
2: (bf) r2 = r10
3: (07) r2 += -4
4: (18) r1 = 0xffff95c47a8d4800
6: (85) call unknown#73040
7: (15) if r0 == 0x0 goto pc+18
8: (bf) r2 = r10
9: (07) r2 += -4
10: (bf) r1 = r0
11: (85) call unknown#73040
12: (15) if r0 == 0x0 goto pc+23
[...]
* Output after patch:
# bpftool prog dump xlated id 1
0: (b7) r1 = 2
1: (63) *(u32 *)(r10 -4) = r1
2: (bf) r2 = r10
3: (07) r2 += -4
4: (18) r1 = map[id:2] <-- map id:2
6: (85) call bpf_map_lookup_elem#73424 <-- helper call
7: (15) if r0 == 0x0 goto pc+18
8: (bf) r2 = r10
9: (07) r2 += -4
10: (bf) r1 = r0
11: (85) call bpf_map_lookup_elem#73424
12: (15) if r0 == 0x0 goto pc+23
[...]
# bpftool map show id 2 <-- show/dump/etc map id:2
2: hash_of_maps flags 0x0
key 4B value 4B max_entries 3 memlock 4096B
Example, JITed, same prog:
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf chain 0
filter protocol all pref 49152 bpf chain 0 handle 0x1 foo.o:[ingress] \
direct-action not_in_hw id 3 tag c74773051b364165 jited
# bpftool prog show id 3
3: sched_cls tag c74773051b364165
loaded_at Dec 19/13:48 uid 0
xlated 384B jited 257B memlock 4096B map_ids 2
# bpftool prog dump xlated id 3
0: (b7) r1 = 2
1: (63) *(u32 *)(r10 -4) = r1
2: (bf) r2 = r10
3: (07) r2 += -4
4: (18) r1 = map[id:2] <-- map id:2
6: (85) call __htab_map_lookup_elem#77408 <-+ inlined rewrite
7: (15) if r0 == 0x0 goto pc+2 |
8: (07) r0 += 56 |
9: (79) r0 = *(u64 *)(r0 +0) <-+
10: (15) if r0 == 0x0 goto pc+24
11: (bf) r2 = r10
12: (07) r2 += -4
[...]
Example, same prog, but kallsyms disabled (in that case we are
also not allowed to pass any relative offsets, etc, so prog
becomes pointer sanitized on dump):
# sysctl kernel.kptr_restrict=2
kernel.kptr_restrict = 2
# bpftool prog dump xlated id 3
0: (b7) r1 = 2
1: (63) *(u32 *)(r10 -4) = r1
2: (bf) r2 = r10
3: (07) r2 += -4
4: (18) r1 = map[id:2]
6: (85) call bpf_unspec#0
7: (15) if r0 == 0x0 goto pc+2
[...]
Example, BPF calls via interpreter:
# bpftool prog dump xlated id 1
0: (85) call pc+2#__bpf_prog_run_args32
1: (b7) r0 = 1
2: (95) exit
3: (b7) r0 = 2
4: (95) exit
Example, BPF calls via JIT:
# sysctl net.core.bpf_jit_enable=1
net.core.bpf_jit_enable = 1
# sysctl net.core.bpf_jit_kallsyms=1
net.core.bpf_jit_kallsyms = 1
# bpftool prog dump xlated id 1
0: (85) call pc+2#bpf_prog_3b185187f1855c4c_F
1: (b7) r0 = 1
2: (95) exit
3: (b7) r0 = 2
4: (95) exit
And finally, an example for tail calls that is now working
as well wrt correlation:
# bpftool prog dump xlated id 2
[...]
10: (b7) r2 = 8
11: (85) call bpf_trace_printk#-41312
12: (bf) r1 = r6
13: (18) r2 = map[id:1]
15: (b7) r3 = 0
16: (85) call bpf_tail_call#12
17: (b7) r1 = 42
18: (6b) *(u16 *)(r6 +46) = r1
19: (b7) r0 = 0
20: (95) exit
# bpftool map show id 1
1: prog_array flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-20 20:42:57 +08:00
|
|
|
* anyway later on, so do not let the compiler omit it. This also needs
|
|
|
|
* to go into kallsyms for correlation from e.g. bpftool, so naming
|
|
|
|
* must not change.
|
2014-07-23 14:01:58 +08:00
|
|
|
*/
|
|
|
|
noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2015-07-21 11:34:19 +08:00
|
|
|
EXPORT_SYMBOL_GPL(__bpf_call_base);
|
2014-07-23 14:01:58 +08:00
|
|
|
|
2018-01-27 06:33:38 +08:00
|
|
|
/* All UAPI available opcodes. */
|
|
|
|
#define BPF_INSN_MAP(INSN_2, INSN_3) \
|
|
|
|
/* 32 bit ALU operations. */ \
|
|
|
|
/* Register based. */ \
|
2018-12-06 02:52:34 +08:00
|
|
|
INSN_3(ALU, ADD, X), \
|
|
|
|
INSN_3(ALU, SUB, X), \
|
|
|
|
INSN_3(ALU, AND, X), \
|
|
|
|
INSN_3(ALU, OR, X), \
|
|
|
|
INSN_3(ALU, LSH, X), \
|
|
|
|
INSN_3(ALU, RSH, X), \
|
|
|
|
INSN_3(ALU, XOR, X), \
|
|
|
|
INSN_3(ALU, MUL, X), \
|
|
|
|
INSN_3(ALU, MOV, X), \
|
|
|
|
INSN_3(ALU, ARSH, X), \
|
|
|
|
INSN_3(ALU, DIV, X), \
|
|
|
|
INSN_3(ALU, MOD, X), \
|
2018-01-27 06:33:38 +08:00
|
|
|
INSN_2(ALU, NEG), \
|
|
|
|
INSN_3(ALU, END, TO_BE), \
|
|
|
|
INSN_3(ALU, END, TO_LE), \
|
|
|
|
/* Immediate based. */ \
|
2018-12-06 02:52:34 +08:00
|
|
|
INSN_3(ALU, ADD, K), \
|
|
|
|
INSN_3(ALU, SUB, K), \
|
|
|
|
INSN_3(ALU, AND, K), \
|
|
|
|
INSN_3(ALU, OR, K), \
|
|
|
|
INSN_3(ALU, LSH, K), \
|
|
|
|
INSN_3(ALU, RSH, K), \
|
|
|
|
INSN_3(ALU, XOR, K), \
|
|
|
|
INSN_3(ALU, MUL, K), \
|
|
|
|
INSN_3(ALU, MOV, K), \
|
|
|
|
INSN_3(ALU, ARSH, K), \
|
|
|
|
INSN_3(ALU, DIV, K), \
|
|
|
|
INSN_3(ALU, MOD, K), \
|
2018-01-27 06:33:38 +08:00
|
|
|
/* 64 bit ALU operations. */ \
|
|
|
|
/* Register based. */ \
|
|
|
|
INSN_3(ALU64, ADD, X), \
|
|
|
|
INSN_3(ALU64, SUB, X), \
|
|
|
|
INSN_3(ALU64, AND, X), \
|
|
|
|
INSN_3(ALU64, OR, X), \
|
|
|
|
INSN_3(ALU64, LSH, X), \
|
|
|
|
INSN_3(ALU64, RSH, X), \
|
|
|
|
INSN_3(ALU64, XOR, X), \
|
|
|
|
INSN_3(ALU64, MUL, X), \
|
|
|
|
INSN_3(ALU64, MOV, X), \
|
|
|
|
INSN_3(ALU64, ARSH, X), \
|
|
|
|
INSN_3(ALU64, DIV, X), \
|
|
|
|
INSN_3(ALU64, MOD, X), \
|
|
|
|
INSN_2(ALU64, NEG), \
|
|
|
|
/* Immediate based. */ \
|
|
|
|
INSN_3(ALU64, ADD, K), \
|
|
|
|
INSN_3(ALU64, SUB, K), \
|
|
|
|
INSN_3(ALU64, AND, K), \
|
|
|
|
INSN_3(ALU64, OR, K), \
|
|
|
|
INSN_3(ALU64, LSH, K), \
|
|
|
|
INSN_3(ALU64, RSH, K), \
|
|
|
|
INSN_3(ALU64, XOR, K), \
|
|
|
|
INSN_3(ALU64, MUL, K), \
|
|
|
|
INSN_3(ALU64, MOV, K), \
|
|
|
|
INSN_3(ALU64, ARSH, K), \
|
|
|
|
INSN_3(ALU64, DIV, K), \
|
|
|
|
INSN_3(ALU64, MOD, K), \
|
|
|
|
/* Call instruction. */ \
|
|
|
|
INSN_2(JMP, CALL), \
|
|
|
|
/* Exit instruction. */ \
|
|
|
|
INSN_2(JMP, EXIT), \
|
|
|
|
/* Jump instructions. */ \
|
|
|
|
/* Register based. */ \
|
|
|
|
INSN_3(JMP, JEQ, X), \
|
|
|
|
INSN_3(JMP, JNE, X), \
|
|
|
|
INSN_3(JMP, JGT, X), \
|
|
|
|
INSN_3(JMP, JLT, X), \
|
|
|
|
INSN_3(JMP, JGE, X), \
|
|
|
|
INSN_3(JMP, JLE, X), \
|
|
|
|
INSN_3(JMP, JSGT, X), \
|
|
|
|
INSN_3(JMP, JSLT, X), \
|
|
|
|
INSN_3(JMP, JSGE, X), \
|
|
|
|
INSN_3(JMP, JSLE, X), \
|
|
|
|
INSN_3(JMP, JSET, X), \
|
|
|
|
/* Immediate based. */ \
|
|
|
|
INSN_3(JMP, JEQ, K), \
|
|
|
|
INSN_3(JMP, JNE, K), \
|
|
|
|
INSN_3(JMP, JGT, K), \
|
|
|
|
INSN_3(JMP, JLT, K), \
|
|
|
|
INSN_3(JMP, JGE, K), \
|
|
|
|
INSN_3(JMP, JLE, K), \
|
|
|
|
INSN_3(JMP, JSGT, K), \
|
|
|
|
INSN_3(JMP, JSLT, K), \
|
|
|
|
INSN_3(JMP, JSGE, K), \
|
|
|
|
INSN_3(JMP, JSLE, K), \
|
|
|
|
INSN_3(JMP, JSET, K), \
|
|
|
|
INSN_2(JMP, JA), \
|
|
|
|
/* Store instructions. */ \
|
|
|
|
/* Register based. */ \
|
|
|
|
INSN_3(STX, MEM, B), \
|
|
|
|
INSN_3(STX, MEM, H), \
|
|
|
|
INSN_3(STX, MEM, W), \
|
|
|
|
INSN_3(STX, MEM, DW), \
|
|
|
|
INSN_3(STX, XADD, W), \
|
|
|
|
INSN_3(STX, XADD, DW), \
|
|
|
|
/* Immediate based. */ \
|
|
|
|
INSN_3(ST, MEM, B), \
|
|
|
|
INSN_3(ST, MEM, H), \
|
|
|
|
INSN_3(ST, MEM, W), \
|
|
|
|
INSN_3(ST, MEM, DW), \
|
|
|
|
/* Load instructions. */ \
|
|
|
|
/* Register based. */ \
|
|
|
|
INSN_3(LDX, MEM, B), \
|
|
|
|
INSN_3(LDX, MEM, H), \
|
|
|
|
INSN_3(LDX, MEM, W), \
|
|
|
|
INSN_3(LDX, MEM, DW), \
|
|
|
|
/* Immediate based. */ \
|
bpf: implement ld_abs/ld_ind in native bpf
The main part of this work is to finally allow removal of LD_ABS
and LD_IND from the BPF core by reimplementing them through native
eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
keeping them around in native eBPF caused way more trouble than
actually worth it. To just list some of the security issues in
the past:
* fdfaf64e7539 ("x86: bpf_jit: support negative offsets")
* 35607b02dbef ("sparc: bpf_jit: fix loads from negative offsets")
* e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
* 07aee9439454 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
* 6d59b7dbf72e ("bpf, s390x: do not reload skb pointers in non-skb context")
* 87338c8e2cbb ("bpf, ppc64: do not reload skb pointers in non-skb context")
For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
these days due to their limitations and more efficient/flexible
alternatives that have been developed over time such as direct
packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
register, the load happens in host endianness and its exception
handling can yield unexpected behavior. The latter is explained
in depth in f6b1b3bf0d5f ("bpf: fix subprog verifier bypass by
div/mod by 0 exception") with similar cases of exceptions we had.
In native eBPF more recent program types will disable LD_ABS/LD_IND
altogether through may_access_skb() in verifier, and given the
limitations in terms of exception handling, it's also disabled
in programs that use BPF to BPF calls.
In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
to access packet data. It is not used in seccomp-BPF but programs
that use it for socket filtering or reuseport for demuxing with
cBPF. This is mostly relevant for applications that have not yet
migrated to native eBPF.
The main complexity and source of bugs in LD_ABS/LD_IND is coming
from their implementation in the various JITs. Most of them keep
the model around from cBPF times by implementing a fastpath written
in asm. They use typically two from the BPF program hidden CPU
registers for caching the skb's headlen (skb->len - skb->data_len)
and skb->data. Throughout the JIT phase this requires to keep track
whether LD_ABS/LD_IND are used and if so, the two registers need
to be recached each time a BPF helper would change the underlying
packet data in native eBPF case. At least in eBPF case, available
CPU registers are rare and the additional exit path out of the
asm written JIT helper makes it also inflexible since not all
parts of the JITer are in control from plain C. A LD_ABS/LD_IND
implementation in eBPF therefore allows to significantly reduce
the complexity in JITs with comparable performance results for
them, e.g.:
test_bpf tcpdump port 22 tcpdump complex
x64 - before 15 21 10 14 19 18
- after 7 10 10 7 10 15
arm64 - before 40 91 92 40 91 151
- after 51 64 73 51 62 113
For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
and cache the skb's headlen and data in the cBPF prologue. The
BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
used as a local temporary variable. This allows to shrink the
image on x86_64 also for seccomp programs slightly since mapping
to %rsi is not an ereg. In callee-saved R8 and R9 we now track
skb data and headlen, respectively. For normal prologue emission
in the JITs this does not add any extra instructions since R8, R9
are pushed to stack in any case from eBPF side. cBPF uses the
convert_bpf_ld_abs() emitter which probes the fast path inline
already and falls back to bpf_skb_load_helper_{8,16,32}() helper
relying on the cached skb data and headlen as well. R8 and R9
never need to be reloaded due to bpf_helper_changes_pkt_data()
since all skb access in cBPF is read-only. Then, for the case
of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
does neither cache skb data and headlen nor has an inlined fast
path. The reason for the latter is that native eBPF does not have
any extra registers available anyway, but even if there were, it
avoids any reload of skb data and headlen in the first place.
Additionally, for the negative offsets, we provide an alternative
bpf_skb_load_bytes_relative() helper in eBPF which operates
similarly as bpf_skb_load_bytes() and allows for more flexibility.
Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-04 07:08:14 +08:00
|
|
|
INSN_3(LD, IMM, DW)
|
2018-01-27 06:33:38 +08:00
|
|
|
|
|
|
|
bool bpf_opcode_in_insntable(u8 code)
|
|
|
|
{
|
|
|
|
#define BPF_INSN_2_TBL(x, y) [BPF_##x | BPF_##y] = true
|
|
|
|
#define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
|
|
|
|
static const bool public_insntable[256] = {
|
|
|
|
[0 ... 255] = false,
|
|
|
|
/* Now overwrite non-defaults ... */
|
|
|
|
BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
|
bpf: implement ld_abs/ld_ind in native bpf
The main part of this work is to finally allow removal of LD_ABS
and LD_IND from the BPF core by reimplementing them through native
eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
keeping them around in native eBPF caused way more trouble than
actually worth it. To just list some of the security issues in
the past:
* fdfaf64e7539 ("x86: bpf_jit: support negative offsets")
* 35607b02dbef ("sparc: bpf_jit: fix loads from negative offsets")
* e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
* 07aee9439454 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
* 6d59b7dbf72e ("bpf, s390x: do not reload skb pointers in non-skb context")
* 87338c8e2cbb ("bpf, ppc64: do not reload skb pointers in non-skb context")
For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
these days due to their limitations and more efficient/flexible
alternatives that have been developed over time such as direct
packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
register, the load happens in host endianness and its exception
handling can yield unexpected behavior. The latter is explained
in depth in f6b1b3bf0d5f ("bpf: fix subprog verifier bypass by
div/mod by 0 exception") with similar cases of exceptions we had.
In native eBPF more recent program types will disable LD_ABS/LD_IND
altogether through may_access_skb() in verifier, and given the
limitations in terms of exception handling, it's also disabled
in programs that use BPF to BPF calls.
In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
to access packet data. It is not used in seccomp-BPF but programs
that use it for socket filtering or reuseport for demuxing with
cBPF. This is mostly relevant for applications that have not yet
migrated to native eBPF.
The main complexity and source of bugs in LD_ABS/LD_IND is coming
from their implementation in the various JITs. Most of them keep
the model around from cBPF times by implementing a fastpath written
in asm. They use typically two from the BPF program hidden CPU
registers for caching the skb's headlen (skb->len - skb->data_len)
and skb->data. Throughout the JIT phase this requires to keep track
whether LD_ABS/LD_IND are used and if so, the two registers need
to be recached each time a BPF helper would change the underlying
packet data in native eBPF case. At least in eBPF case, available
CPU registers are rare and the additional exit path out of the
asm written JIT helper makes it also inflexible since not all
parts of the JITer are in control from plain C. A LD_ABS/LD_IND
implementation in eBPF therefore allows to significantly reduce
the complexity in JITs with comparable performance results for
them, e.g.:
test_bpf tcpdump port 22 tcpdump complex
x64 - before 15 21 10 14 19 18
- after 7 10 10 7 10 15
arm64 - before 40 91 92 40 91 151
- after 51 64 73 51 62 113
For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
and cache the skb's headlen and data in the cBPF prologue. The
BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
used as a local temporary variable. This allows to shrink the
image on x86_64 also for seccomp programs slightly since mapping
to %rsi is not an ereg. In callee-saved R8 and R9 we now track
skb data and headlen, respectively. For normal prologue emission
in the JITs this does not add any extra instructions since R8, R9
are pushed to stack in any case from eBPF side. cBPF uses the
convert_bpf_ld_abs() emitter which probes the fast path inline
already and falls back to bpf_skb_load_helper_{8,16,32}() helper
relying on the cached skb data and headlen as well. R8 and R9
never need to be reloaded due to bpf_helper_changes_pkt_data()
since all skb access in cBPF is read-only. Then, for the case
of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
does neither cache skb data and headlen nor has an inlined fast
path. The reason for the latter is that native eBPF does not have
any extra registers available anyway, but even if there were, it
avoids any reload of skb data and headlen in the first place.
Additionally, for the negative offsets, we provide an alternative
bpf_skb_load_bytes_relative() helper in eBPF which operates
similarly as bpf_skb_load_bytes() and allows for more flexibility.
Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-04 07:08:14 +08:00
|
|
|
/* UAPI exposed, but rewritten opcodes. cBPF carry-over. */
|
|
|
|
[BPF_LD | BPF_ABS | BPF_B] = true,
|
|
|
|
[BPF_LD | BPF_ABS | BPF_H] = true,
|
|
|
|
[BPF_LD | BPF_ABS | BPF_W] = true,
|
|
|
|
[BPF_LD | BPF_IND | BPF_B] = true,
|
|
|
|
[BPF_LD | BPF_IND | BPF_H] = true,
|
|
|
|
[BPF_LD | BPF_IND | BPF_W] = true,
|
2018-01-27 06:33:38 +08:00
|
|
|
};
|
|
|
|
#undef BPF_INSN_3_TBL
|
|
|
|
#undef BPF_INSN_2_TBL
|
|
|
|
return public_insntable[code];
|
|
|
|
}
|
|
|
|
|
bpf: introduce BPF_JIT_ALWAYS_ON config
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."
To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64
The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)
v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
It will be sent when the trees are merged back to net-next
Considered doing:
int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10 02:04:29 +08:00
|
|
|
#ifndef CONFIG_BPF_JIT_ALWAYS_ON
|
2014-07-23 14:01:58 +08:00
|
|
|
/**
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
* __bpf_prog_run - run eBPF program on a given context
|
|
|
|
* @ctx: is the data we are operating on
|
|
|
|
* @insn: is the array of eBPF instructions
|
2014-07-23 14:01:58 +08:00
|
|
|
*
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
* Decode and execute eBPF instructions.
|
2014-07-23 14:01:58 +08:00
|
|
|
*/
|
2017-12-15 09:55:13 +08:00
|
|
|
static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
|
2014-07-23 14:01:58 +08:00
|
|
|
{
|
2017-05-31 04:31:28 +08:00
|
|
|
u64 tmp;
|
2018-01-27 06:33:38 +08:00
|
|
|
#define BPF_INSN_2_LBL(x, y) [BPF_##x | BPF_##y] = &&x##_##y
|
|
|
|
#define BPF_INSN_3_LBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = &&x##_##y##_##z
|
2014-07-23 14:01:58 +08:00
|
|
|
static const void *jumptable[256] = {
|
|
|
|
[0 ... 255] = &&default_label,
|
|
|
|
/* Now overwrite non-defaults ... */
|
2018-01-27 06:33:38 +08:00
|
|
|
BPF_INSN_MAP(BPF_INSN_2_LBL, BPF_INSN_3_LBL),
|
|
|
|
/* Non-UAPI available opcodes. */
|
2017-12-15 09:55:13 +08:00
|
|
|
[BPF_JMP | BPF_CALL_ARGS] = &&JMP_CALL_ARGS,
|
2017-05-31 04:31:27 +08:00
|
|
|
[BPF_JMP | BPF_TAIL_CALL] = &&JMP_TAIL_CALL,
|
2014-07-23 14:01:58 +08:00
|
|
|
};
|
2018-01-27 06:33:38 +08:00
|
|
|
#undef BPF_INSN_3_LBL
|
|
|
|
#undef BPF_INSN_2_LBL
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
u32 tail_call_cnt = 0;
|
2014-07-23 14:01:58 +08:00
|
|
|
|
|
|
|
#define CONT ({ insn++; goto select_insn; })
|
|
|
|
#define CONT_JMP ({ insn++; goto select_insn; })
|
|
|
|
|
|
|
|
select_insn:
|
|
|
|
goto *jumptable[insn->code];
|
|
|
|
|
|
|
|
/* ALU */
|
|
|
|
#define ALU(OPCODE, OP) \
|
|
|
|
ALU64_##OPCODE##_X: \
|
|
|
|
DST = DST OP SRC; \
|
|
|
|
CONT; \
|
|
|
|
ALU_##OPCODE##_X: \
|
|
|
|
DST = (u32) DST OP (u32) SRC; \
|
|
|
|
CONT; \
|
|
|
|
ALU64_##OPCODE##_K: \
|
|
|
|
DST = DST OP IMM; \
|
|
|
|
CONT; \
|
|
|
|
ALU_##OPCODE##_K: \
|
|
|
|
DST = (u32) DST OP (u32) IMM; \
|
|
|
|
CONT;
|
|
|
|
|
|
|
|
ALU(ADD, +)
|
|
|
|
ALU(SUB, -)
|
|
|
|
ALU(AND, &)
|
|
|
|
ALU(OR, |)
|
|
|
|
ALU(LSH, <<)
|
|
|
|
ALU(RSH, >>)
|
|
|
|
ALU(XOR, ^)
|
|
|
|
ALU(MUL, *)
|
|
|
|
#undef ALU
|
|
|
|
ALU_NEG:
|
|
|
|
DST = (u32) -DST;
|
|
|
|
CONT;
|
|
|
|
ALU64_NEG:
|
|
|
|
DST = -DST;
|
|
|
|
CONT;
|
|
|
|
ALU_MOV_X:
|
|
|
|
DST = (u32) SRC;
|
|
|
|
CONT;
|
|
|
|
ALU_MOV_K:
|
|
|
|
DST = (u32) IMM;
|
|
|
|
CONT;
|
|
|
|
ALU64_MOV_X:
|
|
|
|
DST = SRC;
|
|
|
|
CONT;
|
|
|
|
ALU64_MOV_K:
|
|
|
|
DST = IMM;
|
|
|
|
CONT;
|
net: filter: add "load 64-bit immediate" eBPF instruction
add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
All previous instructions were 8-byte. This is first 16-byte instruction.
Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
insn[0].code = BPF_LD | BPF_DW | BPF_IMM
insn[0].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].code = 0
insn[1].imm = upper 32-bit
All unused fields must be zero.
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.
x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn
Note that old eBPF programs are binary compatible with new interpreter.
It helps eBPF programs load 64-bit constant into a register with one
instruction instead of using two registers and 4 instructions:
BPF_MOV32_IMM(R1, imm32)
BPF_ALU64_IMM(BPF_LSH, R1, 32)
BPF_MOV32_IMM(R2, imm32)
BPF_ALU64_REG(BPF_OR, R1, R2)
User space generated programs will use this instruction to load constants only.
To tell kernel that user space needs a pointer the _pseudo_ variant of
this instruction may be added later, which will use extra bits of encoding
to indicate what type of pointer user space is asking kernel to provide.
For example 'off' or 'src_reg' fields can be used for such purpose.
src_reg = 1 could mean that user space is asking kernel to validate and
load in-kernel map pointer.
src_reg = 2 could mean that user space needs readonly data section pointer
src_reg = 3 could mean that user space needs a pointer to per-cpu local data
All such future pseudo instructions will not be carrying the actual pointer
as part of the instruction, but rather will be treated as a request to kernel
to provide one. The kernel will verify the request_for_a_pointer, then
will drop _pseudo_ marking and will store actual internal pointer inside
the instruction, so the end result is the interpreter and JITs never
see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that
loads 64-bit immediate into a register. User space never operates on direct
pointers and verifier can easily recognize request_for_pointer vs other
instructions.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 13:17:17 +08:00
|
|
|
LD_IMM_DW:
|
|
|
|
DST = (u64) (u32) insn[0].imm | ((u64) (u32) insn[1].imm) << 32;
|
|
|
|
insn++;
|
|
|
|
CONT;
|
2018-12-06 02:52:34 +08:00
|
|
|
ALU_ARSH_X:
|
|
|
|
DST = (u64) (u32) ((*(s32 *) &DST) >> SRC);
|
|
|
|
CONT;
|
|
|
|
ALU_ARSH_K:
|
|
|
|
DST = (u64) (u32) ((*(s32 *) &DST) >> IMM);
|
|
|
|
CONT;
|
2014-07-23 14:01:58 +08:00
|
|
|
ALU64_ARSH_X:
|
|
|
|
(*(s64 *) &DST) >>= SRC;
|
|
|
|
CONT;
|
|
|
|
ALU64_ARSH_K:
|
|
|
|
(*(s64 *) &DST) >>= IMM;
|
|
|
|
CONT;
|
|
|
|
ALU64_MOD_X:
|
2015-04-28 05:40:37 +08:00
|
|
|
div64_u64_rem(DST, SRC, &tmp);
|
|
|
|
DST = tmp;
|
2014-07-23 14:01:58 +08:00
|
|
|
CONT;
|
|
|
|
ALU_MOD_X:
|
|
|
|
tmp = (u32) DST;
|
|
|
|
DST = do_div(tmp, (u32) SRC);
|
|
|
|
CONT;
|
|
|
|
ALU64_MOD_K:
|
2015-04-28 05:40:37 +08:00
|
|
|
div64_u64_rem(DST, IMM, &tmp);
|
|
|
|
DST = tmp;
|
2014-07-23 14:01:58 +08:00
|
|
|
CONT;
|
|
|
|
ALU_MOD_K:
|
|
|
|
tmp = (u32) DST;
|
|
|
|
DST = do_div(tmp, (u32) IMM);
|
|
|
|
CONT;
|
|
|
|
ALU64_DIV_X:
|
2015-04-28 05:40:37 +08:00
|
|
|
DST = div64_u64(DST, SRC);
|
2014-07-23 14:01:58 +08:00
|
|
|
CONT;
|
|
|
|
ALU_DIV_X:
|
|
|
|
tmp = (u32) DST;
|
|
|
|
do_div(tmp, (u32) SRC);
|
|
|
|
DST = (u32) tmp;
|
|
|
|
CONT;
|
|
|
|
ALU64_DIV_K:
|
2015-04-28 05:40:37 +08:00
|
|
|
DST = div64_u64(DST, IMM);
|
2014-07-23 14:01:58 +08:00
|
|
|
CONT;
|
|
|
|
ALU_DIV_K:
|
|
|
|
tmp = (u32) DST;
|
|
|
|
do_div(tmp, (u32) IMM);
|
|
|
|
DST = (u32) tmp;
|
|
|
|
CONT;
|
|
|
|
ALU_END_TO_BE:
|
|
|
|
switch (IMM) {
|
|
|
|
case 16:
|
|
|
|
DST = (__force u16) cpu_to_be16(DST);
|
|
|
|
break;
|
|
|
|
case 32:
|
|
|
|
DST = (__force u32) cpu_to_be32(DST);
|
|
|
|
break;
|
|
|
|
case 64:
|
|
|
|
DST = (__force u64) cpu_to_be64(DST);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
ALU_END_TO_LE:
|
|
|
|
switch (IMM) {
|
|
|
|
case 16:
|
|
|
|
DST = (__force u16) cpu_to_le16(DST);
|
|
|
|
break;
|
|
|
|
case 32:
|
|
|
|
DST = (__force u32) cpu_to_le32(DST);
|
|
|
|
break;
|
|
|
|
case 64:
|
|
|
|
DST = (__force u64) cpu_to_le64(DST);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
|
|
|
|
/* CALL */
|
|
|
|
JMP_CALL:
|
|
|
|
/* Function call scratches BPF_R1-BPF_R5 registers,
|
|
|
|
* preserves BPF_R6-BPF_R9, and stores return value
|
|
|
|
* into BPF_R0.
|
|
|
|
*/
|
|
|
|
BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3,
|
|
|
|
BPF_R4, BPF_R5);
|
|
|
|
CONT;
|
|
|
|
|
2017-12-15 09:55:13 +08:00
|
|
|
JMP_CALL_ARGS:
|
|
|
|
BPF_R0 = (__bpf_call_base_args + insn->imm)(BPF_R1, BPF_R2,
|
|
|
|
BPF_R3, BPF_R4,
|
|
|
|
BPF_R5,
|
|
|
|
insn + insn->off + 1);
|
|
|
|
CONT;
|
|
|
|
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
JMP_TAIL_CALL: {
|
|
|
|
struct bpf_map *map = (struct bpf_map *) (unsigned long) BPF_R2;
|
|
|
|
struct bpf_array *array = container_of(map, struct bpf_array, map);
|
|
|
|
struct bpf_prog *prog;
|
2017-10-04 06:37:20 +08:00
|
|
|
u32 index = BPF_R3;
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
|
|
|
|
if (unlikely(index >= array->map.max_entries))
|
|
|
|
goto out;
|
|
|
|
if (unlikely(tail_call_cnt > MAX_TAIL_CALL_CNT))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
tail_call_cnt++;
|
|
|
|
|
2015-08-06 15:02:33 +08:00
|
|
|
prog = READ_ONCE(array->ptrs[index]);
|
2016-06-28 18:18:23 +08:00
|
|
|
if (!prog)
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
goto out;
|
|
|
|
|
2015-07-14 02:49:32 +08:00
|
|
|
/* ARG1 at this point is guaranteed to point to CTX from
|
|
|
|
* the verifier side due to the fact that the tail call is
|
|
|
|
* handeled like a helper, that is, bpf_tail_call_proto,
|
|
|
|
* where arg1_type is ARG_PTR_TO_CTX.
|
|
|
|
*/
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
insn = prog->insnsi;
|
|
|
|
goto select_insn;
|
|
|
|
out:
|
|
|
|
CONT;
|
|
|
|
}
|
2014-07-23 14:01:58 +08:00
|
|
|
/* JMP */
|
|
|
|
JMP_JA:
|
|
|
|
insn += insn->off;
|
|
|
|
CONT;
|
|
|
|
JMP_JEQ_X:
|
|
|
|
if (DST == SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JEQ_K:
|
|
|
|
if (DST == IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JNE_X:
|
|
|
|
if (DST != SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JNE_K:
|
|
|
|
if (DST != IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JGT_X:
|
|
|
|
if (DST > SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JGT_K:
|
|
|
|
if (DST > IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
JMP_JLT_X:
|
|
|
|
if (DST < SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JLT_K:
|
|
|
|
if (DST < IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
2014-07-23 14:01:58 +08:00
|
|
|
JMP_JGE_X:
|
|
|
|
if (DST >= SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JGE_K:
|
|
|
|
if (DST >= IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
JMP_JLE_X:
|
|
|
|
if (DST <= SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JLE_K:
|
|
|
|
if (DST <= IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
2014-07-23 14:01:58 +08:00
|
|
|
JMP_JSGT_X:
|
|
|
|
if (((s64) DST) > ((s64) SRC)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JSGT_K:
|
|
|
|
if (((s64) DST) > ((s64) IMM)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
JMP_JSLT_X:
|
|
|
|
if (((s64) DST) < ((s64) SRC)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JSLT_K:
|
|
|
|
if (((s64) DST) < ((s64) IMM)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
2014-07-23 14:01:58 +08:00
|
|
|
JMP_JSGE_X:
|
|
|
|
if (((s64) DST) >= ((s64) SRC)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JSGE_K:
|
|
|
|
if (((s64) DST) >= ((s64) IMM)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
JMP_JSLE_X:
|
|
|
|
if (((s64) DST) <= ((s64) SRC)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JSLE_K:
|
|
|
|
if (((s64) DST) <= ((s64) IMM)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
2014-07-23 14:01:58 +08:00
|
|
|
JMP_JSET_X:
|
|
|
|
if (DST & SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JSET_K:
|
|
|
|
if (DST & IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_EXIT:
|
|
|
|
return BPF_R0;
|
|
|
|
|
|
|
|
/* STX and ST and LDX*/
|
|
|
|
#define LDST(SIZEOP, SIZE) \
|
|
|
|
STX_MEM_##SIZEOP: \
|
|
|
|
*(SIZE *)(unsigned long) (DST + insn->off) = SRC; \
|
|
|
|
CONT; \
|
|
|
|
ST_MEM_##SIZEOP: \
|
|
|
|
*(SIZE *)(unsigned long) (DST + insn->off) = IMM; \
|
|
|
|
CONT; \
|
|
|
|
LDX_MEM_##SIZEOP: \
|
|
|
|
DST = *(SIZE *)(unsigned long) (SRC + insn->off); \
|
|
|
|
CONT;
|
|
|
|
|
|
|
|
LDST(B, u8)
|
|
|
|
LDST(H, u16)
|
|
|
|
LDST(W, u32)
|
|
|
|
LDST(DW, u64)
|
|
|
|
#undef LDST
|
|
|
|
STX_XADD_W: /* lock xadd *(u32 *)(dst_reg + off16) += src_reg */
|
|
|
|
atomic_add((u32) SRC, (atomic_t *)(unsigned long)
|
|
|
|
(DST + insn->off));
|
|
|
|
CONT;
|
|
|
|
STX_XADD_DW: /* lock xadd *(u64 *)(dst_reg + off16) += src_reg */
|
|
|
|
atomic64_add((u64) SRC, (atomic64_t *)(unsigned long)
|
|
|
|
(DST + insn->off));
|
|
|
|
CONT;
|
|
|
|
|
|
|
|
default_label:
|
2018-01-27 06:33:38 +08:00
|
|
|
/* If we ever reach this, we have a bug somewhere. Die hard here
|
|
|
|
* instead of just returning 0; we could be somewhere in a subprog,
|
|
|
|
* so execution could continue otherwise which we do /not/ want.
|
|
|
|
*
|
|
|
|
* Note, verifier whitelists all opcodes in bpf_opcode_in_insntable().
|
|
|
|
*/
|
|
|
|
pr_warn("BPF interpreter: unknown opcode %02x\n", insn->code);
|
|
|
|
BUG_ON(1);
|
2014-07-23 14:01:58 +08:00
|
|
|
return 0;
|
|
|
|
}
|
2017-05-31 04:31:28 +08:00
|
|
|
STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */
|
|
|
|
|
2017-05-31 04:31:33 +08:00
|
|
|
#define PROG_NAME(stack_size) __bpf_prog_run##stack_size
|
|
|
|
#define DEFINE_BPF_PROG_RUN(stack_size) \
|
|
|
|
static unsigned int PROG_NAME(stack_size)(const void *ctx, const struct bpf_insn *insn) \
|
|
|
|
{ \
|
|
|
|
u64 stack[stack_size / sizeof(u64)]; \
|
|
|
|
u64 regs[MAX_BPF_REG]; \
|
|
|
|
\
|
|
|
|
FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)]; \
|
|
|
|
ARG1 = (u64) (unsigned long) ctx; \
|
|
|
|
return ___bpf_prog_run(regs, insn, stack); \
|
2017-05-31 04:31:28 +08:00
|
|
|
}
|
2014-07-23 14:01:58 +08:00
|
|
|
|
2017-12-15 09:55:13 +08:00
|
|
|
#define PROG_NAME_ARGS(stack_size) __bpf_prog_run_args##stack_size
|
|
|
|
#define DEFINE_BPF_PROG_RUN_ARGS(stack_size) \
|
|
|
|
static u64 PROG_NAME_ARGS(stack_size)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5, \
|
|
|
|
const struct bpf_insn *insn) \
|
|
|
|
{ \
|
|
|
|
u64 stack[stack_size / sizeof(u64)]; \
|
|
|
|
u64 regs[MAX_BPF_REG]; \
|
|
|
|
\
|
|
|
|
FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)]; \
|
|
|
|
BPF_R1 = r1; \
|
|
|
|
BPF_R2 = r2; \
|
|
|
|
BPF_R3 = r3; \
|
|
|
|
BPF_R4 = r4; \
|
|
|
|
BPF_R5 = r5; \
|
|
|
|
return ___bpf_prog_run(regs, insn, stack); \
|
|
|
|
}
|
|
|
|
|
2017-05-31 04:31:33 +08:00
|
|
|
#define EVAL1(FN, X) FN(X)
|
|
|
|
#define EVAL2(FN, X, Y...) FN(X) EVAL1(FN, Y)
|
|
|
|
#define EVAL3(FN, X, Y...) FN(X) EVAL2(FN, Y)
|
|
|
|
#define EVAL4(FN, X, Y...) FN(X) EVAL3(FN, Y)
|
|
|
|
#define EVAL5(FN, X, Y...) FN(X) EVAL4(FN, Y)
|
|
|
|
#define EVAL6(FN, X, Y...) FN(X) EVAL5(FN, Y)
|
|
|
|
|
|
|
|
EVAL6(DEFINE_BPF_PROG_RUN, 32, 64, 96, 128, 160, 192);
|
|
|
|
EVAL6(DEFINE_BPF_PROG_RUN, 224, 256, 288, 320, 352, 384);
|
|
|
|
EVAL4(DEFINE_BPF_PROG_RUN, 416, 448, 480, 512);
|
|
|
|
|
2017-12-15 09:55:13 +08:00
|
|
|
EVAL6(DEFINE_BPF_PROG_RUN_ARGS, 32, 64, 96, 128, 160, 192);
|
|
|
|
EVAL6(DEFINE_BPF_PROG_RUN_ARGS, 224, 256, 288, 320, 352, 384);
|
|
|
|
EVAL4(DEFINE_BPF_PROG_RUN_ARGS, 416, 448, 480, 512);
|
|
|
|
|
2017-05-31 04:31:33 +08:00
|
|
|
#define PROG_NAME_LIST(stack_size) PROG_NAME(stack_size),
|
|
|
|
|
|
|
|
static unsigned int (*interpreters[])(const void *ctx,
|
|
|
|
const struct bpf_insn *insn) = {
|
|
|
|
EVAL6(PROG_NAME_LIST, 32, 64, 96, 128, 160, 192)
|
|
|
|
EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384)
|
|
|
|
EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
|
|
|
|
};
|
2017-12-15 09:55:13 +08:00
|
|
|
#undef PROG_NAME_LIST
|
|
|
|
#define PROG_NAME_LIST(stack_size) PROG_NAME_ARGS(stack_size),
|
|
|
|
static u64 (*interpreters_args[])(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5,
|
|
|
|
const struct bpf_insn *insn) = {
|
|
|
|
EVAL6(PROG_NAME_LIST, 32, 64, 96, 128, 160, 192)
|
|
|
|
EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384)
|
|
|
|
EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
|
|
|
|
};
|
|
|
|
#undef PROG_NAME_LIST
|
|
|
|
|
|
|
|
void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth)
|
|
|
|
{
|
|
|
|
stack_depth = max_t(u32, stack_depth, 1);
|
|
|
|
insn->off = (s16) insn->imm;
|
|
|
|
insn->imm = interpreters_args[(round_up(stack_depth, 32) / 32) - 1] -
|
|
|
|
__bpf_call_base_args;
|
|
|
|
insn->code = BPF_JMP | BPF_CALL_ARGS;
|
|
|
|
}
|
2017-05-31 04:31:33 +08:00
|
|
|
|
bpf: introduce BPF_JIT_ALWAYS_ON config
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."
To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64
The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)
v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
It will be sent when the trees are merged back to net-next
Considered doing:
int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10 02:04:29 +08:00
|
|
|
#else
|
2018-01-20 08:24:33 +08:00
|
|
|
static unsigned int __bpf_prog_ret0_warn(const void *ctx,
|
|
|
|
const struct bpf_insn *insn)
|
bpf: introduce BPF_JIT_ALWAYS_ON config
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."
To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64
The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)
v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
It will be sent when the trees are merged back to net-next
Considered doing:
int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10 02:04:29 +08:00
|
|
|
{
|
2018-01-20 08:24:33 +08:00
|
|
|
/* If this handler ever gets executed, then BPF_JIT_ALWAYS_ON
|
|
|
|
* is not working properly, so warn about it!
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(1);
|
bpf: introduce BPF_JIT_ALWAYS_ON config
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."
To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64
The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)
v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
It will be sent when the trees are merged back to net-next
Considered doing:
int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10 02:04:29 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-05-30 05:23:07 +08:00
|
|
|
bool bpf_prog_array_compatible(struct bpf_array *array,
|
|
|
|
const struct bpf_prog *fp)
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
{
|
2017-12-12 00:36:48 +08:00
|
|
|
if (fp->kprobe_override)
|
|
|
|
return false;
|
|
|
|
|
2015-05-30 05:23:07 +08:00
|
|
|
if (!array->owner_prog_type) {
|
|
|
|
/* There's no owner yet where we could check for
|
|
|
|
* compatibility.
|
|
|
|
*/
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
array->owner_prog_type = fp->type;
|
|
|
|
array->owner_jited = fp->jited;
|
2015-05-30 05:23:07 +08:00
|
|
|
|
|
|
|
return true;
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
}
|
2015-05-30 05:23:07 +08:00
|
|
|
|
|
|
|
return array->owner_prog_type == fp->type &&
|
|
|
|
array->owner_jited == fp->jited;
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
}
|
|
|
|
|
2015-05-30 05:23:07 +08:00
|
|
|
static int bpf_check_tail_call(const struct bpf_prog *fp)
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
{
|
|
|
|
struct bpf_prog_aux *aux = fp->aux;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < aux->used_map_cnt; i++) {
|
2015-05-30 05:23:07 +08:00
|
|
|
struct bpf_map *map = aux->used_maps[i];
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
struct bpf_array *array;
|
|
|
|
|
|
|
|
if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
|
|
|
|
continue;
|
2015-05-30 05:23:07 +08:00
|
|
|
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
array = container_of(map, struct bpf_array, map);
|
|
|
|
if (!bpf_prog_array_compatible(array, fp))
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
bpf: reject any prog that failed read-only lock
We currently lock any JITed image as read-only via bpf_jit_binary_lock_ro()
as well as the BPF image as read-only through bpf_prog_lock_ro(). In
the case any of these would fail we throw a WARN_ON_ONCE() in order to
yell loudly to the log. Perhaps, to some extend, this may be comparable
to an allocation where __GFP_NOWARN is explicitly not set.
Added via 65869a47f348 ("bpf: improve read-only handling"), this behavior
is slightly different compared to any of the other in-kernel set_memory_ro()
users who do not check the return code of set_memory_ro() and friends /at
all/ (e.g. in the case of module_enable_ro() / module_disable_ro()). Given
in BPF this is mandatory hardening step, we want to know whether there
are any issues that would leave both BPF data writable. So it happens
that syzkaller enabled fault injection and it triggered memory allocation
failure deep inside x86's change_page_attr_set_clr() which was triggered
from set_memory_ro().
Now, there are two options: i) leaving everything as is, and ii) reworking
the image locking code in order to have a final checkpoint out of the
central bpf_prog_select_runtime() which probes whether any of the calls
during prog setup weren't successful, and then bailing out with an error.
Option ii) is a better approach since this additional paranoia avoids
altogether leaving any potential W+X pages from BPF side in the system.
Therefore, lets be strict about it, and reject programs in such unlikely
occasion. While testing I noticed also that one bpf_prog_lock_ro()
call was missing on the outer dummy prog in case of calls, e.g. in the
destructor we call bpf_prog_free_deferred() on the main prog where we
try to bpf_prog_unlock_free() the program, and since we go via
bpf_prog_select_runtime() do that as well.
Reported-by: syzbot+3b889862e65a98317058@syzkaller.appspotmail.com
Reported-by: syzbot+9e762b52dd17e616a7a5@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-15 08:30:48 +08:00
|
|
|
static void bpf_prog_select_func(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
#ifndef CONFIG_BPF_JIT_ALWAYS_ON
|
|
|
|
u32 stack_depth = max_t(u32, fp->aux->stack_depth, 1);
|
|
|
|
|
|
|
|
fp->bpf_func = interpreters[(round_up(stack_depth, 32) / 32) - 1];
|
|
|
|
#else
|
|
|
|
fp->bpf_func = __bpf_prog_ret0_warn;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2014-07-23 14:01:58 +08:00
|
|
|
/**
|
2015-05-30 05:23:07 +08:00
|
|
|
* bpf_prog_select_runtime - select exec runtime for BPF program
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
* @fp: bpf_prog populated with internal BPF program
|
2016-05-14 01:08:31 +08:00
|
|
|
* @err: pointer to error variable
|
2014-07-23 14:01:58 +08:00
|
|
|
*
|
2015-05-30 05:23:07 +08:00
|
|
|
* Try to JIT eBPF program, if JIT is not available, use interpreter.
|
|
|
|
* The BPF program will be executed via BPF_PROG_RUN() macro.
|
2014-07-23 14:01:58 +08:00
|
|
|
*/
|
2016-05-14 01:08:31 +08:00
|
|
|
struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
|
2014-07-23 14:01:58 +08:00
|
|
|
{
|
bpf: reject any prog that failed read-only lock
We currently lock any JITed image as read-only via bpf_jit_binary_lock_ro()
as well as the BPF image as read-only through bpf_prog_lock_ro(). In
the case any of these would fail we throw a WARN_ON_ONCE() in order to
yell loudly to the log. Perhaps, to some extend, this may be comparable
to an allocation where __GFP_NOWARN is explicitly not set.
Added via 65869a47f348 ("bpf: improve read-only handling"), this behavior
is slightly different compared to any of the other in-kernel set_memory_ro()
users who do not check the return code of set_memory_ro() and friends /at
all/ (e.g. in the case of module_enable_ro() / module_disable_ro()). Given
in BPF this is mandatory hardening step, we want to know whether there
are any issues that would leave both BPF data writable. So it happens
that syzkaller enabled fault injection and it triggered memory allocation
failure deep inside x86's change_page_attr_set_clr() which was triggered
from set_memory_ro().
Now, there are two options: i) leaving everything as is, and ii) reworking
the image locking code in order to have a final checkpoint out of the
central bpf_prog_select_runtime() which probes whether any of the calls
during prog setup weren't successful, and then bailing out with an error.
Option ii) is a better approach since this additional paranoia avoids
altogether leaving any potential W+X pages from BPF side in the system.
Therefore, lets be strict about it, and reject programs in such unlikely
occasion. While testing I noticed also that one bpf_prog_lock_ro()
call was missing on the outer dummy prog in case of calls, e.g. in the
destructor we call bpf_prog_free_deferred() on the main prog where we
try to bpf_prog_unlock_free() the program, and since we go via
bpf_prog_select_runtime() do that as well.
Reported-by: syzbot+3b889862e65a98317058@syzkaller.appspotmail.com
Reported-by: syzbot+9e762b52dd17e616a7a5@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-15 08:30:48 +08:00
|
|
|
/* In case of BPF to BPF calls, verifier did all the prep
|
|
|
|
* work with regards to JITing, etc.
|
|
|
|
*/
|
|
|
|
if (fp->bpf_func)
|
|
|
|
goto finalize;
|
2017-06-29 01:41:24 +08:00
|
|
|
|
bpf: reject any prog that failed read-only lock
We currently lock any JITed image as read-only via bpf_jit_binary_lock_ro()
as well as the BPF image as read-only through bpf_prog_lock_ro(). In
the case any of these would fail we throw a WARN_ON_ONCE() in order to
yell loudly to the log. Perhaps, to some extend, this may be comparable
to an allocation where __GFP_NOWARN is explicitly not set.
Added via 65869a47f348 ("bpf: improve read-only handling"), this behavior
is slightly different compared to any of the other in-kernel set_memory_ro()
users who do not check the return code of set_memory_ro() and friends /at
all/ (e.g. in the case of module_enable_ro() / module_disable_ro()). Given
in BPF this is mandatory hardening step, we want to know whether there
are any issues that would leave both BPF data writable. So it happens
that syzkaller enabled fault injection and it triggered memory allocation
failure deep inside x86's change_page_attr_set_clr() which was triggered
from set_memory_ro().
Now, there are two options: i) leaving everything as is, and ii) reworking
the image locking code in order to have a final checkpoint out of the
central bpf_prog_select_runtime() which probes whether any of the calls
during prog setup weren't successful, and then bailing out with an error.
Option ii) is a better approach since this additional paranoia avoids
altogether leaving any potential W+X pages from BPF side in the system.
Therefore, lets be strict about it, and reject programs in such unlikely
occasion. While testing I noticed also that one bpf_prog_lock_ro()
call was missing on the outer dummy prog in case of calls, e.g. in the
destructor we call bpf_prog_free_deferred() on the main prog where we
try to bpf_prog_unlock_free() the program, and since we go via
bpf_prog_select_runtime() do that as well.
Reported-by: syzbot+3b889862e65a98317058@syzkaller.appspotmail.com
Reported-by: syzbot+9e762b52dd17e616a7a5@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-15 08:30:48 +08:00
|
|
|
bpf_prog_select_func(fp);
|
2014-07-23 14:01:58 +08:00
|
|
|
|
2016-05-14 01:08:31 +08:00
|
|
|
/* eBPF JITs can rewrite the program in case constant
|
|
|
|
* blinding is active. However, in case of error during
|
|
|
|
* blinding, bpf_int_jit_compile() must always return a
|
|
|
|
* valid program, which in this case would simply not
|
|
|
|
* be JITed, but falls back to the interpreter.
|
|
|
|
*/
|
2017-11-04 04:56:17 +08:00
|
|
|
if (!bpf_prog_is_dev_bound(fp->aux)) {
|
2018-12-08 08:42:25 +08:00
|
|
|
*err = bpf_prog_alloc_jited_linfo(fp);
|
|
|
|
if (*err)
|
|
|
|
return fp;
|
|
|
|
|
2017-11-04 04:56:17 +08:00
|
|
|
fp = bpf_int_jit_compile(fp);
|
bpf: introduce BPF_JIT_ALWAYS_ON config
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."
To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64
The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)
v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
It will be sent when the trees are merged back to net-next
Considered doing:
int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10 02:04:29 +08:00
|
|
|
if (!fp->jited) {
|
2018-12-08 08:42:25 +08:00
|
|
|
bpf_prog_free_jited_linfo(fp);
|
|
|
|
#ifdef CONFIG_BPF_JIT_ALWAYS_ON
|
bpf: introduce BPF_JIT_ALWAYS_ON config
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."
To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64
The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)
v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
It will be sent when the trees are merged back to net-next
Considered doing:
int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10 02:04:29 +08:00
|
|
|
*err = -ENOTSUPP;
|
|
|
|
return fp;
|
|
|
|
#endif
|
2018-12-08 08:42:25 +08:00
|
|
|
} else {
|
|
|
|
bpf_prog_free_unused_jited_linfo(fp);
|
|
|
|
}
|
2017-11-04 04:56:17 +08:00
|
|
|
} else {
|
|
|
|
*err = bpf_prog_offload_compile(fp);
|
|
|
|
if (*err)
|
|
|
|
return fp;
|
|
|
|
}
|
bpf: reject any prog that failed read-only lock
We currently lock any JITed image as read-only via bpf_jit_binary_lock_ro()
as well as the BPF image as read-only through bpf_prog_lock_ro(). In
the case any of these would fail we throw a WARN_ON_ONCE() in order to
yell loudly to the log. Perhaps, to some extend, this may be comparable
to an allocation where __GFP_NOWARN is explicitly not set.
Added via 65869a47f348 ("bpf: improve read-only handling"), this behavior
is slightly different compared to any of the other in-kernel set_memory_ro()
users who do not check the return code of set_memory_ro() and friends /at
all/ (e.g. in the case of module_enable_ro() / module_disable_ro()). Given
in BPF this is mandatory hardening step, we want to know whether there
are any issues that would leave both BPF data writable. So it happens
that syzkaller enabled fault injection and it triggered memory allocation
failure deep inside x86's change_page_attr_set_clr() which was triggered
from set_memory_ro().
Now, there are two options: i) leaving everything as is, and ii) reworking
the image locking code in order to have a final checkpoint out of the
central bpf_prog_select_runtime() which probes whether any of the calls
during prog setup weren't successful, and then bailing out with an error.
Option ii) is a better approach since this additional paranoia avoids
altogether leaving any potential W+X pages from BPF side in the system.
Therefore, lets be strict about it, and reject programs in such unlikely
occasion. While testing I noticed also that one bpf_prog_lock_ro()
call was missing on the outer dummy prog in case of calls, e.g. in the
destructor we call bpf_prog_free_deferred() on the main prog where we
try to bpf_prog_unlock_free() the program, and since we go via
bpf_prog_select_runtime() do that as well.
Reported-by: syzbot+3b889862e65a98317058@syzkaller.appspotmail.com
Reported-by: syzbot+9e762b52dd17e616a7a5@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-15 08:30:48 +08:00
|
|
|
|
|
|
|
finalize:
|
2014-09-03 04:53:44 +08:00
|
|
|
bpf_prog_lock_ro(fp);
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
|
2015-05-30 05:23:07 +08:00
|
|
|
/* The tail call compatibility check can only be done at
|
|
|
|
* this late stage as we need to determine, if we deal
|
|
|
|
* with JITed or non JITed program concatenations and not
|
|
|
|
* all eBPF JITs might immediately support all features.
|
|
|
|
*/
|
2016-05-14 01:08:31 +08:00
|
|
|
*err = bpf_check_tail_call(fp);
|
bpf: undo prog rejection on read-only lock failure
Partially undo commit 9facc336876f ("bpf: reject any prog that failed
read-only lock") since it caused a regression, that is, syzkaller was
able to manage to cause a panic via fault injection deep in set_memory_ro()
path by letting an allocation fail: In x86's __change_page_attr_set_clr()
it was able to change the attributes of the primary mapping but not in
the alias mapping via cpa_process_alias(), so the second, inner call
to the __change_page_attr() via __change_page_attr_set_clr() had to split
a larger page and failed in the alloc_pages() with the artifically triggered
allocation error which is then propagated down to the call site.
Thus, for set_memory_ro() this means that it returned with an error, but
from debugging a probe_kernel_write() revealed EFAULT on that memory since
the primary mapping succeeded to get changed. Therefore the subsequent
hdr->locked = 0 reset triggered the panic as it was performed on read-only
memory, so call-site assumptions were infact wrong to assume that it would
either succeed /or/ not succeed at all since there's no such rollback in
set_memory_*() calls from partial change of mappings, in other words, we're
left in a state that is "half done". A later undo via set_memory_rw() is
succeeding though due to matching permissions on that part (aka due to the
try_preserve_large_page() succeeding). While reproducing locally with
explicitly triggering this error, the initial splitting only happens on
rare occasions and in real world it would additionally need oom conditions,
but that said, it could partially fail. Therefore, it is definitely wrong
to bail out on set_memory_ro() error and reject the program with the
set_memory_*() semantics we have today. Shouldn't have gone the extra mile
since no other user in tree today infact checks for any set_memory_*()
errors, e.g. neither module_enable_ro() / module_disable_ro() for module
RO/NX handling which is mostly default these days nor kprobes core with
alloc_insn_page() / free_insn_page() as examples that could be invoked long
after bootup and original 314beb9bcabf ("x86: bpf_jit_comp: secure bpf jit
against spraying attacks") did neither when it got first introduced to BPF
so "improving" with bailing out was clearly not right when set_memory_*()
cannot handle it today.
Kees suggested that if set_memory_*() can fail, we should annotate it with
__must_check, and all callers need to deal with it gracefully given those
set_memory_*() markings aren't "advisory", but they're expected to actually
do what they say. This might be an option worth to move forward in future
but would at the same time require that set_memory_*() calls from supporting
archs are guaranteed to be "atomic" in that they provide rollback if part
of the range fails, once that happened, the transition from RW -> RO could
be made more robust that way, while subsequent RO -> RW transition /must/
continue guaranteeing to always succeed the undo part.
Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com
Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com
Fixes: 9facc336876f ("bpf: reject any prog that failed read-only lock")
Cc: Laura Abbott <labbott@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-29 05:34:59 +08:00
|
|
|
|
2016-05-14 01:08:31 +08:00
|
|
|
return fp;
|
2014-07-23 14:01:58 +08:00
|
|
|
}
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
EXPORT_SYMBOL_GPL(bpf_prog_select_runtime);
|
2014-07-23 14:01:58 +08:00
|
|
|
|
2017-10-24 14:53:08 +08:00
|
|
|
static unsigned int __bpf_prog_ret1(const void *ctx,
|
|
|
|
const struct bpf_insn *insn)
|
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct bpf_prog_dummy {
|
|
|
|
struct bpf_prog prog;
|
|
|
|
} dummy_bpf_prog = {
|
|
|
|
.prog = {
|
|
|
|
.bpf_func = __bpf_prog_ret1,
|
|
|
|
},
|
|
|
|
};
|
|
|
|
|
2017-10-03 13:50:21 +08:00
|
|
|
/* to avoid allocating empty bpf_prog_array for cgroups that
|
|
|
|
* don't have bpf program attached use one global 'empty_prog_array'
|
|
|
|
* It will not be modified the caller of bpf_prog_array_alloc()
|
|
|
|
* (since caller requested prog_cnt == 0)
|
|
|
|
* that pointer should be 'freed' by bpf_prog_array_free()
|
|
|
|
*/
|
|
|
|
static struct {
|
|
|
|
struct bpf_prog_array hdr;
|
|
|
|
struct bpf_prog *null_prog;
|
|
|
|
} empty_prog_array = {
|
|
|
|
.null_prog = NULL,
|
|
|
|
};
|
|
|
|
|
2018-07-14 03:41:10 +08:00
|
|
|
struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags)
|
2017-10-03 13:50:21 +08:00
|
|
|
{
|
|
|
|
if (prog_cnt)
|
|
|
|
return kzalloc(sizeof(struct bpf_prog_array) +
|
2018-08-03 05:27:21 +08:00
|
|
|
sizeof(struct bpf_prog_array_item) *
|
|
|
|
(prog_cnt + 1),
|
2017-10-03 13:50:21 +08:00
|
|
|
flags);
|
|
|
|
|
|
|
|
return &empty_prog_array.hdr;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_prog_array_free(struct bpf_prog_array __rcu *progs)
|
|
|
|
{
|
|
|
|
if (!progs ||
|
|
|
|
progs == (struct bpf_prog_array __rcu *)&empty_prog_array.hdr)
|
|
|
|
return;
|
|
|
|
kfree_rcu(progs, rcu);
|
|
|
|
}
|
|
|
|
|
2018-08-03 05:27:21 +08:00
|
|
|
int bpf_prog_array_length(struct bpf_prog_array __rcu *array)
|
2017-10-03 13:50:22 +08:00
|
|
|
{
|
2018-08-03 05:27:21 +08:00
|
|
|
struct bpf_prog_array_item *item;
|
2017-10-03 13:50:22 +08:00
|
|
|
u32 cnt = 0;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
2018-08-03 05:27:21 +08:00
|
|
|
item = rcu_dereference(array)->items;
|
|
|
|
for (; item->prog; item++)
|
|
|
|
if (item->prog != &dummy_bpf_prog.prog)
|
2017-12-01 05:47:54 +08:00
|
|
|
cnt++;
|
2017-10-03 13:50:22 +08:00
|
|
|
rcu_read_unlock();
|
|
|
|
return cnt;
|
|
|
|
}
|
|
|
|
|
2018-08-03 05:27:21 +08:00
|
|
|
|
|
|
|
static bool bpf_prog_array_copy_core(struct bpf_prog_array __rcu *array,
|
2018-04-11 00:37:32 +08:00
|
|
|
u32 *prog_ids,
|
|
|
|
u32 request_cnt)
|
|
|
|
{
|
2018-08-03 05:27:21 +08:00
|
|
|
struct bpf_prog_array_item *item;
|
2018-04-11 00:37:32 +08:00
|
|
|
int i = 0;
|
|
|
|
|
2018-08-15 02:01:12 +08:00
|
|
|
item = rcu_dereference_check(array, 1)->items;
|
2018-08-03 05:27:21 +08:00
|
|
|
for (; item->prog; item++) {
|
|
|
|
if (item->prog == &dummy_bpf_prog.prog)
|
2018-04-11 00:37:32 +08:00
|
|
|
continue;
|
2018-08-03 05:27:21 +08:00
|
|
|
prog_ids[i] = item->prog->aux->id;
|
2018-04-11 00:37:32 +08:00
|
|
|
if (++i == request_cnt) {
|
2018-08-03 05:27:21 +08:00
|
|
|
item++;
|
2018-04-11 00:37:32 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-08-03 05:27:21 +08:00
|
|
|
return !!(item->prog);
|
2018-04-11 00:37:32 +08:00
|
|
|
}
|
|
|
|
|
2018-08-03 05:27:21 +08:00
|
|
|
int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *array,
|
2017-10-03 13:50:22 +08:00
|
|
|
__u32 __user *prog_ids, u32 cnt)
|
|
|
|
{
|
2018-02-03 07:14:05 +08:00
|
|
|
unsigned long err = 0;
|
|
|
|
bool nospc;
|
2018-04-11 00:37:32 +08:00
|
|
|
u32 *ids;
|
2018-02-03 07:14:05 +08:00
|
|
|
|
|
|
|
/* users of this function are doing:
|
|
|
|
* cnt = bpf_prog_array_length();
|
|
|
|
* if (cnt > 0)
|
|
|
|
* bpf_prog_array_copy_to_user(..., cnt);
|
|
|
|
* so below kcalloc doesn't need extra cnt > 0 check, but
|
|
|
|
* bpf_prog_array_length() releases rcu lock and
|
|
|
|
* prog array could have been swapped with empty or larger array,
|
|
|
|
* so always copy 'cnt' prog_ids to the user.
|
|
|
|
* In a rare race the user will see zero prog_ids
|
|
|
|
*/
|
2018-02-14 22:31:00 +08:00
|
|
|
ids = kcalloc(cnt, sizeof(u32), GFP_USER | __GFP_NOWARN);
|
2018-02-03 07:14:05 +08:00
|
|
|
if (!ids)
|
|
|
|
return -ENOMEM;
|
2017-10-03 13:50:22 +08:00
|
|
|
rcu_read_lock();
|
2018-08-03 05:27:21 +08:00
|
|
|
nospc = bpf_prog_array_copy_core(array, ids, cnt);
|
2017-10-03 13:50:22 +08:00
|
|
|
rcu_read_unlock();
|
2018-02-03 07:14:05 +08:00
|
|
|
err = copy_to_user(prog_ids, ids, cnt * sizeof(u32));
|
|
|
|
kfree(ids);
|
|
|
|
if (err)
|
|
|
|
return -EFAULT;
|
|
|
|
if (nospc)
|
2017-10-03 13:50:22 +08:00
|
|
|
return -ENOSPC;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-08-03 05:27:21 +08:00
|
|
|
void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *array,
|
2017-10-24 14:53:08 +08:00
|
|
|
struct bpf_prog *old_prog)
|
|
|
|
{
|
2018-08-03 05:27:21 +08:00
|
|
|
struct bpf_prog_array_item *item = array->items;
|
2017-10-24 14:53:08 +08:00
|
|
|
|
2018-08-03 05:27:21 +08:00
|
|
|
for (; item->prog; item++)
|
|
|
|
if (item->prog == old_prog) {
|
|
|
|
WRITE_ONCE(item->prog, &dummy_bpf_prog.prog);
|
2017-10-24 14:53:08 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
|
|
|
|
struct bpf_prog *exclude_prog,
|
|
|
|
struct bpf_prog *include_prog,
|
|
|
|
struct bpf_prog_array **new_array)
|
|
|
|
{
|
|
|
|
int new_prog_cnt, carry_prog_cnt = 0;
|
2018-08-03 05:27:21 +08:00
|
|
|
struct bpf_prog_array_item *existing;
|
2017-10-24 14:53:08 +08:00
|
|
|
struct bpf_prog_array *array;
|
2018-05-27 19:24:08 +08:00
|
|
|
bool found_exclude = false;
|
2017-10-24 14:53:08 +08:00
|
|
|
int new_prog_idx = 0;
|
|
|
|
|
|
|
|
/* Figure out how many existing progs we need to carry over to
|
|
|
|
* the new array.
|
|
|
|
*/
|
|
|
|
if (old_array) {
|
2018-08-03 05:27:21 +08:00
|
|
|
existing = old_array->items;
|
|
|
|
for (; existing->prog; existing++) {
|
|
|
|
if (existing->prog == exclude_prog) {
|
2018-05-27 19:24:08 +08:00
|
|
|
found_exclude = true;
|
|
|
|
continue;
|
|
|
|
}
|
2018-08-03 05:27:21 +08:00
|
|
|
if (existing->prog != &dummy_bpf_prog.prog)
|
2017-10-24 14:53:08 +08:00
|
|
|
carry_prog_cnt++;
|
2018-08-03 05:27:21 +08:00
|
|
|
if (existing->prog == include_prog)
|
2017-10-24 14:53:08 +08:00
|
|
|
return -EEXIST;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-05-27 19:24:08 +08:00
|
|
|
if (exclude_prog && !found_exclude)
|
|
|
|
return -ENOENT;
|
|
|
|
|
2017-10-24 14:53:08 +08:00
|
|
|
/* How many progs (not NULL) will be in the new array? */
|
|
|
|
new_prog_cnt = carry_prog_cnt;
|
|
|
|
if (include_prog)
|
|
|
|
new_prog_cnt += 1;
|
|
|
|
|
|
|
|
/* Do we have any prog (not NULL) in the new array? */
|
|
|
|
if (!new_prog_cnt) {
|
|
|
|
*new_array = NULL;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* +1 as the end of prog_array is marked with NULL */
|
|
|
|
array = bpf_prog_array_alloc(new_prog_cnt + 1, GFP_KERNEL);
|
|
|
|
if (!array)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/* Fill in the new prog array */
|
|
|
|
if (carry_prog_cnt) {
|
2018-08-03 05:27:21 +08:00
|
|
|
existing = old_array->items;
|
|
|
|
for (; existing->prog; existing++)
|
|
|
|
if (existing->prog != exclude_prog &&
|
|
|
|
existing->prog != &dummy_bpf_prog.prog) {
|
|
|
|
array->items[new_prog_idx++].prog =
|
|
|
|
existing->prog;
|
|
|
|
}
|
2017-10-24 14:53:08 +08:00
|
|
|
}
|
|
|
|
if (include_prog)
|
2018-08-03 05:27:21 +08:00
|
|
|
array->items[new_prog_idx++].prog = include_prog;
|
|
|
|
array->items[new_prog_idx].prog = NULL;
|
2017-10-24 14:53:08 +08:00
|
|
|
*new_array = array;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
bpf/tracing: allow user space to query prog array on the same tp
Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event.
Although this provides flexibility, users may want to know
what other bpf programs attached to the same tp interface.
Besides getting visibility for the underlying bpf system,
such information may also help consolidate multiple bpf programs,
understand potential performance issues due to a large array,
and debug (e.g., one bpf program which overwrites return code
may impact subsequent program results).
Commit 2541517c32be ("tracing, perf: Implement BPF programs
attached to kprobes") utilized the existing perf ioctl
interface and added the command PERF_EVENT_IOC_SET_BPF
to attach a bpf program to a tracepoint. This patch adds a new
ioctl command, given a perf event fd, to query the bpf program
array attached to the same perf tracepoint event.
The new uapi ioctl command:
PERF_EVENT_IOC_QUERY_BPF
The new uapi/linux/perf_event.h structure:
struct perf_event_query_bpf {
__u32 ids_len;
__u32 prog_cnt;
__u32 ids[0];
};
User space provides buffer "ids" for kernel to copy to.
When returning from the kernel, the number of available
programs in the array is set in "prog_cnt".
The usage:
struct perf_event_query_bpf *query =
malloc(sizeof(*query) + sizeof(u32) * ids_len);
query.ids_len = ids_len;
err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
if (err == 0) {
/* query.prog_cnt is the number of available progs,
* number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
*/
} else if (errno == ENOSPC) {
/* query.ids_len number of progs copied,
* query.prog_cnt is the number of available progs
*/
} else {
/* other errors */
}
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-12 03:39:02 +08:00
|
|
|
int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
|
2018-04-11 00:37:32 +08:00
|
|
|
u32 *prog_ids, u32 request_cnt,
|
|
|
|
u32 *prog_cnt)
|
bpf/tracing: allow user space to query prog array on the same tp
Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event.
Although this provides flexibility, users may want to know
what other bpf programs attached to the same tp interface.
Besides getting visibility for the underlying bpf system,
such information may also help consolidate multiple bpf programs,
understand potential performance issues due to a large array,
and debug (e.g., one bpf program which overwrites return code
may impact subsequent program results).
Commit 2541517c32be ("tracing, perf: Implement BPF programs
attached to kprobes") utilized the existing perf ioctl
interface and added the command PERF_EVENT_IOC_SET_BPF
to attach a bpf program to a tracepoint. This patch adds a new
ioctl command, given a perf event fd, to query the bpf program
array attached to the same perf tracepoint event.
The new uapi ioctl command:
PERF_EVENT_IOC_QUERY_BPF
The new uapi/linux/perf_event.h structure:
struct perf_event_query_bpf {
__u32 ids_len;
__u32 prog_cnt;
__u32 ids[0];
};
User space provides buffer "ids" for kernel to copy to.
When returning from the kernel, the number of available
programs in the array is set in "prog_cnt".
The usage:
struct perf_event_query_bpf *query =
malloc(sizeof(*query) + sizeof(u32) * ids_len);
query.ids_len = ids_len;
err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
if (err == 0) {
/* query.prog_cnt is the number of available progs,
* number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
*/
} else if (errno == ENOSPC) {
/* query.ids_len number of progs copied,
* query.prog_cnt is the number of available progs
*/
} else {
/* other errors */
}
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-12 03:39:02 +08:00
|
|
|
{
|
|
|
|
u32 cnt = 0;
|
|
|
|
|
|
|
|
if (array)
|
|
|
|
cnt = bpf_prog_array_length(array);
|
|
|
|
|
2018-04-11 00:37:32 +08:00
|
|
|
*prog_cnt = cnt;
|
bpf/tracing: allow user space to query prog array on the same tp
Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event.
Although this provides flexibility, users may want to know
what other bpf programs attached to the same tp interface.
Besides getting visibility for the underlying bpf system,
such information may also help consolidate multiple bpf programs,
understand potential performance issues due to a large array,
and debug (e.g., one bpf program which overwrites return code
may impact subsequent program results).
Commit 2541517c32be ("tracing, perf: Implement BPF programs
attached to kprobes") utilized the existing perf ioctl
interface and added the command PERF_EVENT_IOC_SET_BPF
to attach a bpf program to a tracepoint. This patch adds a new
ioctl command, given a perf event fd, to query the bpf program
array attached to the same perf tracepoint event.
The new uapi ioctl command:
PERF_EVENT_IOC_QUERY_BPF
The new uapi/linux/perf_event.h structure:
struct perf_event_query_bpf {
__u32 ids_len;
__u32 prog_cnt;
__u32 ids[0];
};
User space provides buffer "ids" for kernel to copy to.
When returning from the kernel, the number of available
programs in the array is set in "prog_cnt".
The usage:
struct perf_event_query_bpf *query =
malloc(sizeof(*query) + sizeof(u32) * ids_len);
query.ids_len = ids_len;
err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
if (err == 0) {
/* query.prog_cnt is the number of available progs,
* number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
*/
} else if (errno == ENOSPC) {
/* query.ids_len number of progs copied,
* query.prog_cnt is the number of available progs
*/
} else {
/* other errors */
}
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-12 03:39:02 +08:00
|
|
|
|
|
|
|
/* return early if user requested only program count or nothing to copy */
|
|
|
|
if (!request_cnt || !cnt)
|
|
|
|
return 0;
|
|
|
|
|
2018-04-11 00:37:32 +08:00
|
|
|
/* this function is called under trace/bpf_trace.c: bpf_event_mutex */
|
2018-08-03 05:27:21 +08:00
|
|
|
return bpf_prog_array_copy_core(array, prog_ids, request_cnt) ? -ENOSPC
|
2018-04-11 00:37:32 +08:00
|
|
|
: 0;
|
bpf/tracing: allow user space to query prog array on the same tp
Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event.
Although this provides flexibility, users may want to know
what other bpf programs attached to the same tp interface.
Besides getting visibility for the underlying bpf system,
such information may also help consolidate multiple bpf programs,
understand potential performance issues due to a large array,
and debug (e.g., one bpf program which overwrites return code
may impact subsequent program results).
Commit 2541517c32be ("tracing, perf: Implement BPF programs
attached to kprobes") utilized the existing perf ioctl
interface and added the command PERF_EVENT_IOC_SET_BPF
to attach a bpf program to a tracepoint. This patch adds a new
ioctl command, given a perf event fd, to query the bpf program
array attached to the same perf tracepoint event.
The new uapi ioctl command:
PERF_EVENT_IOC_QUERY_BPF
The new uapi/linux/perf_event.h structure:
struct perf_event_query_bpf {
__u32 ids_len;
__u32 prog_cnt;
__u32 ids[0];
};
User space provides buffer "ids" for kernel to copy to.
When returning from the kernel, the number of available
programs in the array is set in "prog_cnt".
The usage:
struct perf_event_query_bpf *query =
malloc(sizeof(*query) + sizeof(u32) * ids_len);
query.ids_len = ids_len;
err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
if (err == 0) {
/* query.prog_cnt is the number of available progs,
* number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
*/
} else if (errno == ENOSPC) {
/* query.ids_len number of progs copied,
* query.prog_cnt is the number of available progs
*/
} else {
/* other errors */
}
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-12 03:39:02 +08:00
|
|
|
}
|
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
static void bpf_prog_free_deferred(struct work_struct *work)
|
|
|
|
{
|
2014-09-26 15:17:00 +08:00
|
|
|
struct bpf_prog_aux *aux;
|
2017-12-15 09:55:15 +08:00
|
|
|
int i;
|
2014-09-03 04:53:44 +08:00
|
|
|
|
2014-09-26 15:17:00 +08:00
|
|
|
aux = container_of(work, struct bpf_prog_aux, work);
|
2017-11-04 04:56:17 +08:00
|
|
|
if (bpf_prog_is_dev_bound(aux))
|
|
|
|
bpf_prog_offload_destroy(aux->prog);
|
2018-04-29 13:28:08 +08:00
|
|
|
#ifdef CONFIG_PERF_EVENTS
|
|
|
|
if (aux->prog->has_callchain_buf)
|
|
|
|
put_callchain_buffers();
|
|
|
|
#endif
|
2017-12-15 09:55:15 +08:00
|
|
|
for (i = 0; i < aux->func_cnt; i++)
|
|
|
|
bpf_jit_free(aux->func[i]);
|
|
|
|
if (aux->func_cnt) {
|
|
|
|
kfree(aux->func);
|
|
|
|
bpf_prog_unlock_free(aux->prog);
|
|
|
|
} else {
|
|
|
|
bpf_jit_free(aux->prog);
|
|
|
|
}
|
2014-09-03 04:53:44 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Free internal BPF program */
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
void bpf_prog_free(struct bpf_prog *fp)
|
2014-07-23 14:01:58 +08:00
|
|
|
{
|
2014-09-26 15:17:00 +08:00
|
|
|
struct bpf_prog_aux *aux = fp->aux;
|
2014-09-03 04:53:44 +08:00
|
|
|
|
2014-09-26 15:17:00 +08:00
|
|
|
INIT_WORK(&aux->work, bpf_prog_free_deferred);
|
|
|
|
schedule_work(&aux->work);
|
2014-07-23 14:01:58 +08:00
|
|
|
}
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
EXPORT_SYMBOL_GPL(bpf_prog_free);
|
2014-10-24 09:41:08 +08:00
|
|
|
|
bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs
While recently arguing on a seccomp discussion that raw prandom_u32()
access shouldn't be exposed to unpriviledged user space, I forgot the
fact that SKF_AD_RANDOM extension actually already does it for some time
in cBPF via commit 4cd3675ebf74 ("filter: added BPF random opcode").
Since prandom_u32() is being used in a lot of critical networking code,
lets be more conservative and split their states. Furthermore, consolidate
eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
bpf_get_prandom_u32() was only accessible for priviledged users, but
should that change one day, we also don't want to leak raw sequences
through things like eBPF maps.
One thought was also to have own per bpf_prog states, but due to ABI
reasons this is not easily possible, i.e. the program code currently
cannot access bpf_prog itself, and copying the rnd_state to/from the
stack scratch space whenever a program uses the prng seems not really
worth the trouble and seems too hacky. If needed, taus113 could in such
cases be implemented within eBPF using a map entry to keep the state
space, or get_random_bytes() could become a second helper in cases where
performance would not be critical.
Both sides can trigger a one-time late init via prandom_init_once() on
the shared state. Performance-wise, there should even be a tiny gain
as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
inside the BPF core since kernels could have a NET-less config as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 07:20:39 +08:00
|
|
|
/* RNG for unpriviledged user space with separated state from prandom_u32(). */
|
|
|
|
static DEFINE_PER_CPU(struct rnd_state, bpf_user_rnd_state);
|
|
|
|
|
|
|
|
void bpf_user_rnd_init_once(void)
|
|
|
|
{
|
|
|
|
prandom_init_once(&bpf_user_rnd_state);
|
|
|
|
}
|
|
|
|
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
BPF_CALL_0(bpf_user_rnd_u32)
|
bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs
While recently arguing on a seccomp discussion that raw prandom_u32()
access shouldn't be exposed to unpriviledged user space, I forgot the
fact that SKF_AD_RANDOM extension actually already does it for some time
in cBPF via commit 4cd3675ebf74 ("filter: added BPF random opcode").
Since prandom_u32() is being used in a lot of critical networking code,
lets be more conservative and split their states. Furthermore, consolidate
eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
bpf_get_prandom_u32() was only accessible for priviledged users, but
should that change one day, we also don't want to leak raw sequences
through things like eBPF maps.
One thought was also to have own per bpf_prog states, but due to ABI
reasons this is not easily possible, i.e. the program code currently
cannot access bpf_prog itself, and copying the rnd_state to/from the
stack scratch space whenever a program uses the prng seems not really
worth the trouble and seems too hacky. If needed, taus113 could in such
cases be implemented within eBPF using a map entry to keep the state
space, or get_random_bytes() could become a second helper in cases where
performance would not be critical.
Both sides can trigger a one-time late init via prandom_init_once() on
the shared state. Performance-wise, there should even be a tiny gain
as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
inside the BPF core since kernels could have a NET-less config as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 07:20:39 +08:00
|
|
|
{
|
|
|
|
/* Should someone ever have the rather unwise idea to use some
|
|
|
|
* of the registers passed into this function, then note that
|
|
|
|
* this function is called from native eBPF and classic-to-eBPF
|
|
|
|
* transformations. Register assignments from both sides are
|
|
|
|
* different, f.e. classic always sets fn(ctx, A, X) here.
|
|
|
|
*/
|
|
|
|
struct rnd_state *state;
|
|
|
|
u32 res;
|
|
|
|
|
|
|
|
state = &get_cpu_var(bpf_user_rnd_state);
|
|
|
|
res = prandom_u32_state(state);
|
2016-09-27 23:42:41 +08:00
|
|
|
put_cpu_var(bpf_user_rnd_state);
|
bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs
While recently arguing on a seccomp discussion that raw prandom_u32()
access shouldn't be exposed to unpriviledged user space, I forgot the
fact that SKF_AD_RANDOM extension actually already does it for some time
in cBPF via commit 4cd3675ebf74 ("filter: added BPF random opcode").
Since prandom_u32() is being used in a lot of critical networking code,
lets be more conservative and split their states. Furthermore, consolidate
eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
bpf_get_prandom_u32() was only accessible for priviledged users, but
should that change one day, we also don't want to leak raw sequences
through things like eBPF maps.
One thought was also to have own per bpf_prog states, but due to ABI
reasons this is not easily possible, i.e. the program code currently
cannot access bpf_prog itself, and copying the rnd_state to/from the
stack scratch space whenever a program uses the prng seems not really
worth the trouble and seems too hacky. If needed, taus113 could in such
cases be implemented within eBPF using a map entry to keep the state
space, or get_random_bytes() could become a second helper in cases where
performance would not be critical.
Both sides can trigger a one-time late init via prandom_init_once() on
the shared state. Performance-wise, there should even be a tiny gain
as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
inside the BPF core since kernels could have a NET-less config as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 07:20:39 +08:00
|
|
|
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
2015-03-06 06:27:51 +08:00
|
|
|
/* Weak definitions of helper functions in case we don't have bpf syscall. */
|
|
|
|
const struct bpf_func_proto bpf_map_lookup_elem_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_map_update_elem_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_map_delete_elem_proto __weak;
|
2018-10-18 21:16:25 +08:00
|
|
|
const struct bpf_func_proto bpf_map_push_elem_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_map_pop_elem_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_map_peek_elem_proto __weak;
|
2015-03-06 06:27:51 +08:00
|
|
|
|
2015-03-14 09:27:16 +08:00
|
|
|
const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
|
2015-03-14 09:27:17 +08:00
|
|
|
const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
|
2016-10-21 18:46:33 +08:00
|
|
|
const struct bpf_func_proto bpf_get_numa_node_id_proto __weak;
|
2015-05-30 05:23:06 +08:00
|
|
|
const struct bpf_func_proto bpf_ktime_get_ns_proto __weak;
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 03:01:24 +08:00
|
|
|
|
2015-06-13 10:39:12 +08:00
|
|
|
const struct bpf_func_proto bpf_get_current_pid_tgid_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_get_current_comm_proto __weak;
|
2018-06-04 06:59:41 +08:00
|
|
|
const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
|
2018-08-03 05:27:24 +08:00
|
|
|
const struct bpf_func_proto bpf_get_local_storage_proto __weak;
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 03:01:24 +08:00
|
|
|
|
2015-06-13 10:39:13 +08:00
|
|
|
const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
2015-03-14 09:27:16 +08:00
|
|
|
|
2016-07-15 00:08:05 +08:00
|
|
|
u64 __weak
|
|
|
|
bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
|
|
|
|
void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy)
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 03:01:24 +08:00
|
|
|
{
|
2016-07-15 00:08:05 +08:00
|
|
|
return -ENOTSUPP;
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 03:01:24 +08:00
|
|
|
}
|
2018-05-04 09:37:10 +08:00
|
|
|
EXPORT_SYMBOL_GPL(bpf_event_output);
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 03:01:24 +08:00
|
|
|
|
2015-05-30 05:23:07 +08:00
|
|
|
/* Always built-in helper functions. */
|
|
|
|
const struct bpf_func_proto bpf_tail_call_proto = {
|
|
|
|
.func = NULL,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_VOID,
|
|
|
|
.arg1_type = ARG_PTR_TO_CTX,
|
|
|
|
.arg2_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg3_type = ARG_ANYTHING,
|
|
|
|
};
|
|
|
|
|
2017-02-17 05:24:49 +08:00
|
|
|
/* Stub for JITs that only support cBPF. eBPF programs are interpreted.
|
|
|
|
* It is encouraged to implement bpf_int_jit_compile() instead, so that
|
|
|
|
* eBPF and implicitly also cBPF can get JITed!
|
|
|
|
*/
|
2016-05-14 01:08:31 +08:00
|
|
|
struct bpf_prog * __weak bpf_int_jit_compile(struct bpf_prog *prog)
|
2015-05-30 05:23:07 +08:00
|
|
|
{
|
2016-05-14 01:08:31 +08:00
|
|
|
return prog;
|
2015-05-30 05:23:07 +08:00
|
|
|
}
|
|
|
|
|
2017-02-17 05:24:49 +08:00
|
|
|
/* Stub for JITs that support eBPF. All cBPF code gets transformed into
|
|
|
|
* eBPF by the kernel and is later compiled by bpf_int_jit_compile().
|
|
|
|
*/
|
|
|
|
void __weak bpf_jit_compile(struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2016-12-08 07:53:11 +08:00
|
|
|
bool __weak bpf_helper_changes_pkt_data(void *func)
|
2016-05-06 10:49:10 +08:00
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2014-10-24 09:41:08 +08:00
|
|
|
/* To execute LD_ABS/LD_IND instructions __bpf_prog_run() may call
|
|
|
|
* skb_copy_bits(), so provide a weak definition of it for NET-less config.
|
|
|
|
*/
|
|
|
|
int __weak skb_copy_bits(const struct sk_buff *skb, int offset, void *to,
|
|
|
|
int len)
|
|
|
|
{
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
bpf: add initial bpf tracepoints
This work adds a number of tracepoints to paths that are either
considered slow-path or exception-like states, where monitoring or
inspecting them would be desirable.
For bpf(2) syscall, tracepoints have been placed for main commands
when they succeed. In XDP case, tracepoint is for exceptions, that
is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
return code, or when error occurs during XDP_TX action and the packet
could not be forwarded.
Both have been split into separate event headers, and can be further
extended. Worst case, if they unexpectedly should get into our way in
future, they can also removed [1]. Of course, these tracepoints (like
any other) can be analyzed by eBPF itself, etc. Example output:
# ./perf record -a -e bpf:* sleep 10
# ./perf script
sock_example 6197 [005] 283.980322: bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
sock_example 6197 [005] 283.980721: bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
sock_example 6197 [005] 283.988423: bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
sock_example 6197 [005] 283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
[...]
sock_example 6197 [005] 288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
swapper 0 [005] 289.338243: bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
[1] https://lwn.net/Articles/705270/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 09:28:18 +08:00
|
|
|
|
|
|
|
/* All definitions of tracepoints related to BPF. */
|
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <linux/bpf_trace.h>
|
|
|
|
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception);
|