mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-12-02 08:34:20 +08:00
03b77e17ae
__kptr meant to store PTR_UNTRUSTED kernel pointers inside bpf maps. The concept felt useful, but didn't get much traction, since bpf_rdonly_cast() was added soon after and bpf programs received a simpler way to access PTR_UNTRUSTED kernel pointers without going through restrictive __kptr usage. Rename __kptr_ref -> __kptr and __kptr -> __kptr_untrusted to indicate its intended usage. The main goal of __kptr_untrusted was to read/write such pointers directly while bpf_kptr_xchg was a mechanism to access refcnted kernel pointers. The next patch will allow RCU protected __kptr access with direct read. At that point __kptr_untrusted will be deprecated. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20230303041446.3630-2-alexei.starovoitov@gmail.com
357 lines
16 KiB
ReStructuredText
357 lines
16 KiB
ReStructuredText
==============
|
|
BPF Design Q&A
|
|
==============
|
|
|
|
BPF extensibility and applicability to networking, tracing, security
|
|
in the linux kernel and several user space implementations of BPF
|
|
virtual machine led to a number of misunderstanding on what BPF actually is.
|
|
This short QA is an attempt to address that and outline a direction
|
|
of where BPF is heading long term.
|
|
|
|
.. contents::
|
|
:local:
|
|
:depth: 3
|
|
|
|
Questions and Answers
|
|
=====================
|
|
|
|
Q: Is BPF a generic instruction set similar to x64 and arm64?
|
|
-------------------------------------------------------------
|
|
A: NO.
|
|
|
|
Q: Is BPF a generic virtual machine ?
|
|
-------------------------------------
|
|
A: NO.
|
|
|
|
BPF is generic instruction set *with* C calling convention.
|
|
-----------------------------------------------------------
|
|
|
|
Q: Why C calling convention was chosen?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
A: Because BPF programs are designed to run in the linux kernel
|
|
which is written in C, hence BPF defines instruction set compatible
|
|
with two most used architectures x64 and arm64 (and takes into
|
|
consideration important quirks of other architectures) and
|
|
defines calling convention that is compatible with C calling
|
|
convention of the linux kernel on those architectures.
|
|
|
|
Q: Can multiple return values be supported in the future?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
A: NO. BPF allows only register R0 to be used as return value.
|
|
|
|
Q: Can more than 5 function arguments be supported in the future?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
A: NO. BPF calling convention only allows registers R1-R5 to be used
|
|
as arguments. BPF is not a standalone instruction set.
|
|
(unlike x64 ISA that allows msft, cdecl and other conventions)
|
|
|
|
Q: Can BPF programs access instruction pointer or return address?
|
|
-----------------------------------------------------------------
|
|
A: NO.
|
|
|
|
Q: Can BPF programs access stack pointer ?
|
|
------------------------------------------
|
|
A: NO.
|
|
|
|
Only frame pointer (register R10) is accessible.
|
|
From compiler point of view it's necessary to have stack pointer.
|
|
For example, LLVM defines register R11 as stack pointer in its
|
|
BPF backend, but it makes sure that generated code never uses it.
|
|
|
|
Q: Does C-calling convention diminishes possible use cases?
|
|
-----------------------------------------------------------
|
|
A: YES.
|
|
|
|
BPF design forces addition of major functionality in the form
|
|
of kernel helper functions and kernel objects like BPF maps with
|
|
seamless interoperability between them. It lets kernel call into
|
|
BPF programs and programs call kernel helpers with zero overhead,
|
|
as all of them were native C code. That is particularly the case
|
|
for JITed BPF programs that are indistinguishable from
|
|
native kernel C code.
|
|
|
|
Q: Does it mean that 'innovative' extensions to BPF code are disallowed?
|
|
------------------------------------------------------------------------
|
|
A: Soft yes.
|
|
|
|
At least for now, until BPF core has support for
|
|
bpf-to-bpf calls, indirect calls, loops, global variables,
|
|
jump tables, read-only sections, and all other normal constructs
|
|
that C code can produce.
|
|
|
|
Q: Can loops be supported in a safe way?
|
|
----------------------------------------
|
|
A: It's not clear yet.
|
|
|
|
BPF developers are trying to find a way to
|
|
support bounded loops.
|
|
|
|
Q: What are the verifier limits?
|
|
--------------------------------
|
|
A: The only limit known to the user space is BPF_MAXINSNS (4096).
|
|
It's the maximum number of instructions that the unprivileged bpf
|
|
program can have. The verifier has various internal limits.
|
|
Like the maximum number of instructions that can be explored during
|
|
program analysis. Currently, that limit is set to 1 million.
|
|
Which essentially means that the largest program can consist
|
|
of 1 million NOP instructions. There is a limit to the maximum number
|
|
of subsequent branches, a limit to the number of nested bpf-to-bpf
|
|
calls, a limit to the number of the verifier states per instruction,
|
|
a limit to the number of maps used by the program.
|
|
All these limits can be hit with a sufficiently complex program.
|
|
There are also non-numerical limits that can cause the program
|
|
to be rejected. The verifier used to recognize only pointer + constant
|
|
expressions. Now it can recognize pointer + bounded_register.
|
|
bpf_lookup_map_elem(key) had a requirement that 'key' must be
|
|
a pointer to the stack. Now, 'key' can be a pointer to map value.
|
|
The verifier is steadily getting 'smarter'. The limits are
|
|
being removed. The only way to know that the program is going to
|
|
be accepted by the verifier is to try to load it.
|
|
The bpf development process guarantees that the future kernel
|
|
versions will accept all bpf programs that were accepted by
|
|
the earlier versions.
|
|
|
|
|
|
Instruction level questions
|
|
---------------------------
|
|
|
|
Q: LD_ABS and LD_IND instructions vs C code
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Q: How come LD_ABS and LD_IND instruction are present in BPF whereas
|
|
C code cannot express them and has to use builtin intrinsics?
|
|
|
|
A: This is artifact of compatibility with classic BPF. Modern
|
|
networking code in BPF performs better without them.
|
|
See 'direct packet access'.
|
|
|
|
Q: BPF instructions mapping not one-to-one to native CPU
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
Q: It seems not all BPF instructions are one-to-one to native CPU.
|
|
For example why BPF_JNE and other compare and jumps are not cpu-like?
|
|
|
|
A: This was necessary to avoid introducing flags into ISA which are
|
|
impossible to make generic and efficient across CPU architectures.
|
|
|
|
Q: Why BPF_DIV instruction doesn't map to x64 div?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
A: Because if we picked one-to-one relationship to x64 it would have made
|
|
it more complicated to support on arm64 and other archs. Also it
|
|
needs div-by-zero runtime check.
|
|
|
|
Q: Why there is no BPF_SDIV for signed divide operation?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
A: Because it would be rarely used. llvm errors in such case and
|
|
prints a suggestion to use unsigned divide instead.
|
|
|
|
Q: Why BPF has implicit prologue and epilogue?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
A: Because architectures like sparc have register windows and in general
|
|
there are enough subtle differences between architectures, so naive
|
|
store return address into stack won't work. Another reason is BPF has
|
|
to be safe from division by zero (and legacy exception path
|
|
of LD_ABS insn). Those instructions need to invoke epilogue and
|
|
return implicitly.
|
|
|
|
Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
A: Because classic BPF didn't have them and BPF authors felt that compiler
|
|
workaround would be acceptable. Turned out that programs lose performance
|
|
due to lack of these compare instructions and they were added.
|
|
These two instructions is a perfect example what kind of new BPF
|
|
instructions are acceptable and can be added in the future.
|
|
These two already had equivalent instructions in native CPUs.
|
|
New instructions that don't have one-to-one mapping to HW instructions
|
|
will not be accepted.
|
|
|
|
Q: BPF 32-bit subregister requirements
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF
|
|
registers which makes BPF inefficient virtual machine for 32-bit
|
|
CPU architectures and 32-bit HW accelerators. Can true 32-bit registers
|
|
be added to BPF in the future?
|
|
|
|
A: NO.
|
|
|
|
But some optimizations on zero-ing the upper 32 bits for BPF registers are
|
|
available, and can be leveraged to improve the performance of JITed BPF
|
|
programs for 32-bit architectures.
|
|
|
|
Starting with version 7, LLVM is able to generate instructions that operate
|
|
on 32-bit subregisters, provided the option -mattr=+alu32 is passed for
|
|
compiling a program. Furthermore, the verifier can now mark the
|
|
instructions for which zero-ing the upper bits of the destination register
|
|
is required, and insert an explicit zero-extension (zext) instruction
|
|
(a mov32 variant). This means that for architectures without zext hardware
|
|
support, the JIT back-ends do not need to clear the upper bits for
|
|
subregisters written by alu32 instructions or narrow loads. Instead, the
|
|
back-ends simply need to support code generation for that mov32 variant,
|
|
and to overwrite bpf_jit_needs_zext() to make it return "true" (in order to
|
|
enable zext insertion in the verifier).
|
|
|
|
Note that it is possible for a JIT back-end to have partial hardware
|
|
support for zext. In that case, if verifier zext insertion is enabled,
|
|
it could lead to the insertion of unnecessary zext instructions. Such
|
|
instructions could be removed by creating a simple peephole inside the JIT
|
|
back-end: if one instruction has hardware support for zext and if the next
|
|
instruction is an explicit zext, then the latter can be skipped when doing
|
|
the code generation.
|
|
|
|
Q: Does BPF have a stable ABI?
|
|
------------------------------
|
|
A: YES. BPF instructions, arguments to BPF programs, set of helper
|
|
functions and their arguments, recognized return codes are all part
|
|
of ABI. However there is one specific exception to tracing programs
|
|
which are using helpers like bpf_probe_read() to walk kernel internal
|
|
data structures and compile with kernel internal headers. Both of these
|
|
kernel internals are subject to change and can break with newer kernels
|
|
such that the program needs to be adapted accordingly.
|
|
|
|
New BPF functionality is generally added through the use of kfuncs instead of
|
|
new helpers. Kfuncs are not considered part of the stable API, and have their own
|
|
lifecycle expectations as described in :ref:`BPF_kfunc_lifecycle_expectations`.
|
|
|
|
Q: Are tracepoints part of the stable ABI?
|
|
------------------------------------------
|
|
A: NO. Tracepoints are tied to internal implementation details hence they are
|
|
subject to change and can break with newer kernels. BPF programs need to change
|
|
accordingly when this happens.
|
|
|
|
Q: Are places where kprobes can attach part of the stable ABI?
|
|
--------------------------------------------------------------
|
|
A: NO. The places to which kprobes can attach are internal implementation
|
|
details, which means that they are subject to change and can break with
|
|
newer kernels. BPF programs need to change accordingly when this happens.
|
|
|
|
Q: How much stack space a BPF program uses?
|
|
-------------------------------------------
|
|
A: Currently all program types are limited to 512 bytes of stack
|
|
space, but the verifier computes the actual amount of stack used
|
|
and both interpreter and most JITed code consume necessary amount.
|
|
|
|
Q: Can BPF be offloaded to HW?
|
|
------------------------------
|
|
A: YES. BPF HW offload is supported by NFP driver.
|
|
|
|
Q: Does classic BPF interpreter still exist?
|
|
--------------------------------------------
|
|
A: NO. Classic BPF programs are converted into extend BPF instructions.
|
|
|
|
Q: Can BPF call arbitrary kernel functions?
|
|
-------------------------------------------
|
|
A: NO. BPF programs can only call specific functions exposed as BPF helpers or
|
|
kfuncs. The set of available functions is defined for every program type.
|
|
|
|
Q: Can BPF overwrite arbitrary kernel memory?
|
|
---------------------------------------------
|
|
A: NO.
|
|
|
|
Tracing bpf programs can *read* arbitrary memory with bpf_probe_read()
|
|
and bpf_probe_read_str() helpers. Networking programs cannot read
|
|
arbitrary memory, since they don't have access to these helpers.
|
|
Programs can never read or write arbitrary memory directly.
|
|
|
|
Q: Can BPF overwrite arbitrary user memory?
|
|
-------------------------------------------
|
|
A: Sort-of.
|
|
|
|
Tracing BPF programs can overwrite the user memory
|
|
of the current task with bpf_probe_write_user(). Every time such
|
|
program is loaded the kernel will print warning message, so
|
|
this helper is only useful for experiments and prototypes.
|
|
Tracing BPF programs are root only.
|
|
|
|
Q: New functionality via kernel modules?
|
|
----------------------------------------
|
|
Q: Can BPF functionality such as new program or map types, new
|
|
helpers, etc be added out of kernel module code?
|
|
|
|
A: Yes, through kfuncs and kptrs
|
|
|
|
The core BPF functionality such as program types, maps and helpers cannot be
|
|
added to by modules. However, modules can expose functionality to BPF programs
|
|
by exporting kfuncs (which may return pointers to module-internal data
|
|
structures as kptrs).
|
|
|
|
Q: Directly calling kernel function is an ABI?
|
|
----------------------------------------------
|
|
Q: Some kernel functions (e.g. tcp_slow_start) can be called
|
|
by BPF programs. Do these kernel functions become an ABI?
|
|
|
|
A: NO.
|
|
|
|
The kernel function protos will change and the bpf programs will be
|
|
rejected by the verifier. Also, for example, some of the bpf-callable
|
|
kernel functions have already been used by other kernel tcp
|
|
cc (congestion-control) implementations. If any of these kernel
|
|
functions has changed, both the in-tree and out-of-tree kernel tcp cc
|
|
implementations have to be changed. The same goes for the bpf
|
|
programs and they have to be adjusted accordingly. See
|
|
:ref:`BPF_kfunc_lifecycle_expectations` for details.
|
|
|
|
Q: Attaching to arbitrary kernel functions is an ABI?
|
|
-----------------------------------------------------
|
|
Q: BPF programs can be attached to many kernel functions. Do these
|
|
kernel functions become part of the ABI?
|
|
|
|
A: NO.
|
|
|
|
The kernel function prototypes will change, and BPF programs attaching to
|
|
them will need to change. The BPF compile-once-run-everywhere (CO-RE)
|
|
should be used in order to make it easier to adapt your BPF programs to
|
|
different versions of the kernel.
|
|
|
|
Q: Marking a function with BTF_ID makes that function an ABI?
|
|
-------------------------------------------------------------
|
|
A: NO.
|
|
|
|
The BTF_ID macro does not cause a function to become part of the ABI
|
|
any more than does the EXPORT_SYMBOL_GPL macro.
|
|
|
|
Q: What is the compatibility story for special BPF types in map values?
|
|
-----------------------------------------------------------------------
|
|
Q: Users are allowed to embed bpf_spin_lock, bpf_timer fields in their BPF map
|
|
values (when using BTF support for BPF maps). This allows to use helpers for
|
|
such objects on these fields inside map values. Users are also allowed to embed
|
|
pointers to some kernel types (with __kptr_untrusted and __kptr BTF tags). Will the
|
|
kernel preserve backwards compatibility for these features?
|
|
|
|
A: It depends. For bpf_spin_lock, bpf_timer: YES, for kptr and everything else:
|
|
NO, but see below.
|
|
|
|
For struct types that have been added already, like bpf_spin_lock and bpf_timer,
|
|
the kernel will preserve backwards compatibility, as they are part of UAPI.
|
|
|
|
For kptrs, they are also part of UAPI, but only with respect to the kptr
|
|
mechanism. The types that you can use with a __kptr_untrusted and __kptr tagged
|
|
pointer in your struct are NOT part of the UAPI contract. The supported types can
|
|
and will change across kernel releases. However, operations like accessing kptr
|
|
fields and bpf_kptr_xchg() helper will continue to be supported across kernel
|
|
releases for the supported types.
|
|
|
|
For any other supported struct type, unless explicitly stated in this document
|
|
and added to bpf.h UAPI header, such types can and will arbitrarily change their
|
|
size, type, and alignment, or any other user visible API or ABI detail across
|
|
kernel releases. The users must adapt their BPF programs to the new changes and
|
|
update them to make sure their programs continue to work correctly.
|
|
|
|
NOTE: BPF subsystem specially reserves the 'bpf\_' prefix for type names, in
|
|
order to introduce more special fields in the future. Hence, user programs must
|
|
avoid defining types with 'bpf\_' prefix to not be broken in future releases.
|
|
In other words, no backwards compatibility is guaranteed if one using a type
|
|
in BTF with 'bpf\_' prefix.
|
|
|
|
Q: What is the compatibility story for special BPF types in allocated objects?
|
|
------------------------------------------------------------------------------
|
|
Q: Same as above, but for allocated objects (i.e. objects allocated using
|
|
bpf_obj_new for user defined types). Will the kernel preserve backwards
|
|
compatibility for these features?
|
|
|
|
A: NO.
|
|
|
|
Unlike map value types, the API to work with allocated objects and any support
|
|
for special fields inside them is exposed through kfuncs, and thus has the same
|
|
lifecycle expectations as the kfuncs themselves. See
|
|
:ref:`BPF_kfunc_lifecycle_expectations` for details.
|