linux/net
Andrey Ignatov 4fbac77d2d bpf: Hooks for sys_bind
== The problem ==

There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured.  Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.

Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).

Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
  with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.

== The solution ==

The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).

It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
(similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).

The new program type is intended to be used with sockets (`struct sock`)
in a cgroup and provided by user `struct sockaddr`. Pointers to both of
them are parts of the context passed to programs of newly added types.

The new attach types provides hooks in `bind(2)` system call for both
IPv4 and IPv6 so that one can write a program to override IP addresses
and ports user program tries to bind to and apply such a program for
whole cgroup.

== Implementation notes ==

[1]
Separate attach types for `AF_INET` and `AF_INET6` are added
intentionally to prevent reading/writing to offsets that don't make
sense for corresponding socket family. E.g. if user passes `sockaddr_in`
it doesn't make sense to read from / write to `user_ip6[]` context
fields.

[2]
The write access to `struct bpf_sock_addr_kern` is implemented using
special field as an additional "register".

There are just two registers in `sock_addr_convert_ctx_access`: `src`
with value to write and `dst` with pointer to context that can't be
changed not to break later instructions. But the fields, allowed to
write to, are not available directly and to access them address of
corresponding pointer has to be loaded first. To get additional register
the 1st not used by `src` and `dst` one is taken, its content is saved
to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
address of pointer field, and finally the register's content is restored
from the temporary field after writing `src` value.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31 02:15:18 +02:00
..
6lowpan
9p virtio: bugfixes 2018-02-15 14:29:27 -08:00
802 treewide: setup_timer() -> timer_setup() 2017-11-21 15:57:07 -08:00
8021q Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
appletalk net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
atm net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
ax25 net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
batman-adv Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
bpf bpf: fix null pointer deref in bpf_prog_test_run_xdp 2018-02-01 07:43:56 -08:00
bridge bridge: Allow max MTU when multiple VLANs present 2018-03-23 12:17:30 -04:00
caif net: Convert caif_net_ops 2018-03-05 10:48:27 -05:00
can net: Convert can_pernet_ops 2018-03-22 11:11:29 -04:00
ceph libceph, ceph: avoid memory leak when specifying same option several times 2018-02-26 16:19:30 +01:00
core bpf: Hooks for sys_bind 2018-03-31 02:15:18 +02:00
dcb
dccp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
decnet Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
dns_resolver afs: Support the AFS dynamic root 2018-02-06 14:43:37 +00:00
dsa Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
ethernet
hsr
ieee802154 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
ife
ipv4 bpf: Hooks for sys_bind 2018-03-31 02:15:18 +02:00
ipv6 bpf: Hooks for sys_bind 2018-03-31 02:15:18 +02:00
iucv Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
kcm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
key net: Convert /proc creating and destroying pernet_operations 2018-02-27 11:01:35 -05:00
l2tp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
l3mdev
lapb treewide: Remove TIMER_FUNC_TYPE and TIMER_DATA_TYPE casts 2017-11-21 16:35:54 -08:00
llc net: llc: drop VLA in llc_sap_mcast() 2018-03-12 11:14:06 -04:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
mac802154 net/mac802154: disambiguate mac80215 vs mac802154 trace events 2018-03-28 22:55:18 +02:00
mpls net: Convert mpls_net_ops 2018-03-17 17:07:39 -04:00
ncsi net/ncsi: unlock on error in ncsi_set_interface_nl() 2018-03-08 21:49:58 -05:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
netlabel net/netlabel: Add list_next_rcu() in rcu_dereference(). 2017-11-18 10:32:41 +09:00
netlink Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
netrom net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
nfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
nsh
openvswitch Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
packet net: Convert packet_net_ops 2018-02-13 10:36:08 -05:00
phonet net: Convert /proc creating and destroying pernet_operations 2018-02-27 11:01:35 -05:00
psample
qrtr Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-06 01:20:46 -05:00
rds rds: tcp: remove register_netdevice_notifier infrastructure. 2018-03-22 11:21:45 -04:00
rfkill vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
rose net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
rxrpc rxrpc: remove redundant initialization of variable 'len' 2018-03-16 09:48:39 -04:00
sched Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
sctp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
smc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
strparser strparser: Call sock_owned_by_user_nocheck 2017-12-28 14:28:22 -05:00
sunrpc net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
switchdev
tipc tipc: step sk->sk_drops when rcv buffer is full 2018-03-22 14:43:37 -04:00
tls net: generalize sk_alloc_sg to work with scatterlist rings 2018-03-19 21:14:38 +01:00
unix Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
vmw_vsock net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
wimax
wireless treewide: remove large struct-pass-by-value from tracepoint arguments 2018-03-28 22:55:18 +02:00
x25 x25: use %*ph to print small buffer 2018-02-20 13:51:47 -05:00
xfrm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
compat.c
Kconfig Staging/IIO patches for 4.16-rc1 2018-02-01 09:51:57 -08:00
Makefile ipx: move Novell IPX protocol support into staging 2017-11-28 13:55:00 +01:00
socket.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
sysctl_net.c net: Convert sysctl_pernet_ops 2018-02-13 10:36:05 -05:00