2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2024-12-25 05:34:00 +08:00
linux-next/net
Lawrence Brakmo 40304b2a15 bpf: BPF support for sock_ops
Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). It uses the
existing bpf cgroups infrastructure so the programs can be attached per
cgroup with full inheritance support. The program will be called at
appropriate times to set relevant connections parameters such as buffer
sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
as IP addresses, port numbers, etc.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
distinct advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.

Although the bpf cgroup framework already contains a sock related
program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
(BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
only once during the connections's lifetime. In contrast, the new
program type will be called multiple times from different places in the
network stack code.  For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.

The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.

This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.

This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.

Two new corresponding structs (one for the kernel one for the user/BPF
program):

/* kernel version */
struct bpf_sock_ops_kern {
        struct sock *sk;
        __u32  op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
};

/* user version
 * Some fields are in network byte order reflecting the sock struct
 * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
 * convert them to host byte order.
 */
struct bpf_sock_ops {
        __u32 op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
        __u32 family;
        __u32 remote_ip4;     /* In network byte order */
        __u32 local_ip4;      /* In network byte order */
        __u32 remote_ip6[4];  /* In network byte order */
        __u32 local_ip6[4];   /* In network byte order */
        __u32 remote_port;    /* In network byte order */
        __u32 local_port;     /* In host byte horder */
};

Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.

The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:13 -07:00
..
6lowpan 6lowpan: Don't set IFF_NO_QUEUE 2017-04-12 22:02:40 +02:00
9p xen: fixes for 4.12 rc2 2017-05-19 15:06:48 -07:00
802 net: introduce __skb_put_[zero, data, u8] 2017-06-20 13:30:14 -04:00
8021q net: add netlink_ext_ack argument to rtnl_link_ops.validate 2017-06-26 23:13:22 -04:00
appletalk networking: make skb_push & __skb_push return void pointers 2017-06-16 11:48:40 -04:00
atm net: convert sock.sk_refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
ax25 networking: make skb_push & __skb_push return void pointers 2017-06-16 11:48:40 -04:00
batman-adv networking: make skb_put & friends return void pointers 2017-06-16 11:48:39 -04:00
bluetooth Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next 2017-07-01 15:57:29 -07:00
bpf bpf: Align packet data properly in program testing framework. 2017-05-02 11:46:28 -04:00
bridge net: convert nf_bridge_info.use from atomic_t to refcount_t 2017-07-01 07:39:07 -07:00
caif net: convert sock.sk_wmem_alloc from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
can networking: introduce and use skb_put_data() 2017-06-16 11:48:37 -04:00
ceph libceph: cleanup old messages according to reconnect seq 2017-05-24 18:10:51 +02:00
core bpf: BPF support for sock_ops 2017-07-01 16:15:13 -07:00
dcb dcb: enforce minimum length on IEEE_APPS attribute 2017-05-21 13:42:33 -04:00
dccp net: convert sk_buff.users from atomic_t to refcount_t 2017-07-01 07:39:07 -07:00
decnet net: convert neighbour.refcnt from atomic_t to refcount_t 2017-07-01 07:39:07 -07:00
dns_resolver Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-03-03 10:16:38 -08:00
dsa net: manual clean code which call skb_put_[data:zero] 2017-06-20 13:30:15 -04:00
ethernet networking: make skb_push & __skb_push return void pointers 2017-06-16 11:48:40 -04:00
hsr net: add netlink_ext_ack argument to rtnl_link_ops.newlink 2017-06-26 23:13:21 -04:00
ieee802154 net: add netlink_ext_ack argument to rtnl_link_ops.validate 2017-06-26 23:13:22 -04:00
ife net: Introduce ife encapsulation module 2017-02-03 15:16:45 -05:00
ipv4 net: convert netlbl_lsm_cache.refcount from atomic_t to refcount_t 2017-07-01 07:39:09 -07:00
ipv6 net: convert netlbl_lsm_cache.refcount from atomic_t to refcount_t 2017-07-01 07:39:09 -07:00
ipx ipx: call ipxitf_put() in ioctl error path 2017-05-02 15:34:53 -04:00
irda net: manual clean code which call skb_put_[data:zero] 2017-06-20 13:30:15 -04:00
iucv af_iucv: Move sockaddr length checks to before accessing sa_family in bind and connect handlers 2017-06-25 11:42:58 -04:00
kcm net: convert sock.sk_wmem_alloc from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
key net: convert sock.sk_refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
l2tp net: convert sock.sk_refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
l3mdev
lapb
llc net: convert sock.sk_refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
mac80211 net: manual clean code which call skb_put_[data:zero] 2017-06-20 13:30:15 -04:00
mac802154 net: Fix inconsistent teardown and release of private netdev state. 2017-06-07 15:53:24 -04:00
mpls Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-06-06 22:20:08 -04:00
ncsi networking: make skb_push & __skb_push return void pointers 2017-06-16 11:48:40 -04:00
netfilter sctp: remove the typedef sctp_inithdr_t 2017-07-01 09:08:42 -07:00
netlabel netlink: pass extended ACK struct to parsing functions 2017-04-13 13:58:22 -04:00
netlink net: convert sock.sk_refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
netrom net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
nfc NFC: Add sockaddr length checks before accessing sa_family in bind handlers 2017-06-23 00:38:31 +02:00
openvswitch datapath: Avoid using stack larger than 1024. 2017-07-01 09:09:59 -07:00
packet net: convert packet_fanout.sk_ref from atomic_t to refcount_t 2017-07-01 07:39:09 -07:00
phonet net: convert sock.sk_refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
psample networking: make skb_put & friends return void pointers 2017-06-16 11:48:39 -04:00
qrtr networking: make skb_put & friends return void pointers 2017-06-16 11:48:39 -04:00
rds net: convert sock.sk_wmem_alloc from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
rfkill net: rfkill: gpio: Switch to devm_acpi_dev_add_driver_gpios() 2017-06-13 11:07:51 +02:00
rose net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
rxrpc net: convert sock.sk_refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
sched net: convert sock.sk_refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
sctp sctp: Add peeloff-flags socket option 2017-07-01 15:26:11 -07:00
smc net/smc: Add warning about remote memory exposure 2017-05-16 14:49:43 -04:00
strparser strparser: destroy workqueue on module exit 2017-03-03 20:43:26 -08:00
sunrpc SUNRPC: ensure correct error is reported by xs_tcp_setup_socket() 2017-05-31 12:26:44 -04:00
switchdev net: switchdev: Change notifier chain to be atomic 2017-06-08 14:16:24 -04:00
tipc net: convert sock.sk_refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
tls tls: return -EFAULT if copy_to_user() fails 2017-06-23 14:19:27 -04:00
unix net: convert unix_address.refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
vmw_vsock net: manual clean code which call skb_put_[data:zero] 2017-06-20 13:30:15 -04:00
wimax
wireless Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-06-21 17:35:22 -04:00
x25 net: manual clean code which call skb_put_[data:zero] 2017-06-20 13:30:15 -04:00
xfrm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-06-30 12:43:08 -04:00
compat.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-02-22 10:15:09 -08:00
Kconfig tls: kernel TLS support 2017-06-15 12:12:40 -04:00
Makefile tls: kernel TLS support 2017-06-15 12:12:40 -04:00
socket.c net: socket: fix a typo in sockfd_lookup(). 2017-05-22 12:14:04 -04:00
sysctl_net.c sysctl: Remove dead register_sysctl_root 2017-04-16 23:42:49 -05:00