mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-11-16 16:54:20 +08:00
2dbb9b9e6d
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN. BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern" to store the bpf context instead of using the skb->cb[48]. At the SO_REUSEPORT sk lookup time, it is in the middle of transiting from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this point, it is not always clear where the bpf context can be appended in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not clear if the lower layer is only ipv4 and ipv6 in the future and will it not touch the cb[] again before transiting to the upper layer. For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB instead of IP[6]CB and it may still modify the cb[] after calling the udp[46]_lib_lookup_skb(). Because of the above reason, if sk->cb is used for the bpf ctx, saving-and-restoring is needed and likely the whole 48 bytes cb[] has to be saved and restored. Instead of saving, setting and restoring the cb[], this patch opts to create a new "struct sk_reuseport_kern" and setting the needed values in there. The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)" will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol specific usage at this point and it is also inline with the current sock_reuseport.c implementation (i.e. no protocol specific requirement). In "struct sk_reuseport_md", this patch exposes data/data_end/len with semantic similar to other existing usages. Together with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()", the bpf prog can peek anywhere in the skb. The "bind_inany" tells the bpf prog that the reuseport group is bind-ed to a local INANY address which cannot be learned from skb. The new "bind_inany" is added to "struct sock_reuseport" which will be used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order to avoid repeating the "bind INANY" test on "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can only be properly initialized when a "sk->sk_reuseport" enabled sk is adding to a hashtable (i.e. during "reuseport_alloc()" and "reuseport_add_sock()"). The new "sk_select_reuseport()" is the main helper that the bpf prog will use to select a SO_REUSEPORT sk. It is the only function that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in the earlier patch, the validity of a selected sk is checked in run time in "sk_select_reuseport()". Doing the check in verification time is difficult and inflexible (consider the map-in-map use case). The runtime check is to compare the selected sk's reuseport_id with the reuseport_id that we want. This helper will return -EXXX if the selected sk cannot serve the incoming request (e.g. reuseport_id not match). The bpf prog can decide if it wants to do SK_DROP as its discretion. When the bpf prog returns SK_PASS, the kernel will check if a valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL"). If it does , it will use the selected sk. If not, the kernel will select one from "reuse->socks[]" (as before this patch). The SK_DROP and SK_PASS handling logic will be in the next patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> |
||
---|---|---|
.. | ||
bpfilter | ||
netfilter | ||
af_inet.c | ||
ah4.c | ||
arp.c | ||
cipso_ipv4.c | ||
datagram.c | ||
devinet.c | ||
esp4_offload.c | ||
esp4.c | ||
fib_frontend.c | ||
fib_lookup.h | ||
fib_notifier.c | ||
fib_rules.c | ||
fib_semantics.c | ||
fib_trie.c | ||
fou.c | ||
gre_demux.c | ||
gre_offload.c | ||
icmp.c | ||
igmp.c | ||
inet_connection_sock.c | ||
inet_diag.c | ||
inet_fragment.c | ||
inet_hashtables.c | ||
inet_timewait_sock.c | ||
inetpeer.c | ||
ip_forward.c | ||
ip_fragment.c | ||
ip_gre.c | ||
ip_input.c | ||
ip_options.c | ||
ip_output.c | ||
ip_sockglue.c | ||
ip_tunnel_core.c | ||
ip_tunnel.c | ||
ip_vti.c | ||
ipcomp.c | ||
ipconfig.c | ||
ipip.c | ||
ipmr_base.c | ||
ipmr.c | ||
Kconfig | ||
Makefile | ||
metrics.c | ||
netfilter.c | ||
netlink.c | ||
ping.c | ||
proc.c | ||
protocol.c | ||
raw_diag.c | ||
raw.c | ||
route.c | ||
syncookies.c | ||
sysctl_net_ipv4.c | ||
tcp_bbr.c | ||
tcp_bic.c | ||
tcp_cdg.c | ||
tcp_cong.c | ||
tcp_cubic.c | ||
tcp_dctcp.c | ||
tcp_diag.c | ||
tcp_fastopen.c | ||
tcp_highspeed.c | ||
tcp_htcp.c | ||
tcp_hybla.c | ||
tcp_illinois.c | ||
tcp_input.c | ||
tcp_ipv4.c | ||
tcp_lp.c | ||
tcp_metrics.c | ||
tcp_minisocks.c | ||
tcp_nv.c | ||
tcp_offload.c | ||
tcp_output.c | ||
tcp_rate.c | ||
tcp_recovery.c | ||
tcp_scalable.c | ||
tcp_timer.c | ||
tcp_ulp.c | ||
tcp_vegas.c | ||
tcp_vegas.h | ||
tcp_veno.c | ||
tcp_westwood.c | ||
tcp_yeah.c | ||
tcp.c | ||
tunnel4.c | ||
udp_diag.c | ||
udp_impl.h | ||
udp_offload.c | ||
udp_tunnel.c | ||
udp.c | ||
udplite.c | ||
xfrm4_input.c | ||
xfrm4_mode_beet.c | ||
xfrm4_mode_transport.c | ||
xfrm4_mode_tunnel.c | ||
xfrm4_output.c | ||
xfrm4_policy.c | ||
xfrm4_protocol.c | ||
xfrm4_state.c | ||
xfrm4_tunnel.c |