linux/net/core
Daniel Borkmann 983695fa67 bpf: fix unconnected udp hooks
Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
to applications as also stated in original motivation in 7828f20e37 ("Merge
branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
two hooks into Cilium to enable host based load-balancing with Kubernetes,
I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
typically sets up DNS as a service and is thus subject to load-balancing.

Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
is currently insufficient and thus not usable as-is for standard applications
shipped with most distros. To break down the issue we ran into with a simple
example:

  # cat /etc/resolv.conf
  nameserver 147.75.207.207
  nameserver 147.75.207.208

For the purpose of a simple test, we set up above IPs as service IPs and
transparently redirect traffic to a different DNS backend server for that
node:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

The attached BPF program is basically selecting one of the backends if the
service IP/port matches on the cgroup hook. DNS breaks here, because the
hooks are not transparent enough to applications which have built-in msg_name
address checks:

  # nslookup 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]
  ;; connection timed out; no servers could be reached

  # dig 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; connection timed out; no servers could be reached

For comparison, if none of the service IPs is used, and we tell nslookup
to use 8.8.8.8 directly it works just fine, of course:

  # nslookup 1.1.1.1 8.8.8.8
  1.1.1.1.in-addr.arpa	name = one.one.one.one.

In order to fix this and thus act more transparent to the application,
this needs reverse translation on recvmsg() side. A minimal fix for this
API is to add similar recvmsg() hooks behind the BPF cgroups static key
such that the program can track state and replace the current sockaddr_in{,6}
with the original service IP. From BPF side, this basically tracks the
service tuple plus socket cookie in an LRU map where the reverse NAT can
then be retrieved via map value as one example. Side-note: the BPF cgroups
static key should be converted to a per-hook static key in future.

Same example after this fix:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

Lookups work fine now:

  # nslookup 1.1.1.1
  1.1.1.1.in-addr.arpa    name = one.one.one.one.

  Authoritative answers can be found from:

  # dig 1.1.1.1

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
  ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

  ;; OPT PSEUDOSECTION:
  ; EDNS: version: 0, flags:; udp: 512
  ;; QUESTION SECTION:
  ;1.1.1.1.                       IN      A

  ;; AUTHORITY SECTION:
  .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400

  ;; Query time: 17 msec
  ;; SERVER: 147.75.207.207#53(147.75.207.207)
  ;; WHEN: Tue May 21 12:59:38 UTC 2019
  ;; MSG SIZE  rcvd: 111

And from an actual packet level it shows that we're using the back end
server when talking via 147.75.207.20{7,8} front end:

  # tcpdump -i any udp
  [...]
  12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  [...]

In order to be flexible and to have same semantics as in sendmsg BPF
programs, we only allow return codes in [1,1] range. In the sendmsg case
the program is called if msg->msg_name is present which can be the case
in both, connected and unconnected UDP.

The former only relies on the sockaddr_in{,6} passed via connect(2) if
passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
way to call into the BPF program whenever a non-NULL msg->msg_name was
passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg->msg_name is ignored in the regular recvmsg
path and therefore not relevant.

For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.

Fixes: 1cedee13d2 ("bpf: Hooks for sys_sendmsg")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06 16:53:12 -07:00
..
bpf_sk_storage.c bpf: Use PTR_ERR_OR_ZERO in bpf_fd_sk_storage_update_elem() 2019-05-04 23:20:58 -07:00
datagram.c datagram: remove rendundant 'peeked' argument 2019-04-08 09:51:54 -07:00
datagram.h net/core: Allow the compiler to verify declaration and definition consistency 2019-03-27 13:49:44 -07:00
dev_addr_lists.c net: dev: Issue NETDEV_PRE_CHANGEADDR 2018-12-13 18:41:38 -08:00
dev_ioctl.c net/core: Document all dev_ioctl() arguments 2019-03-27 13:49:43 -07:00
dev.c net: avoid weird emergency message 2019-05-16 14:25:58 -07:00
devlink.c devlink: Change devlink health locking mechanism 2019-05-01 11:07:03 -04:00
drop_monitor.c genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
dst_cache.c net: core: dst_cache_set_ip6: Rename 'addr' parameter to 'saddr' for consistency 2018-03-05 12:52:45 -05:00
dst.c net: dst: remove gc leftovers 2019-03-21 13:39:25 -07:00
ethtool.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-04-05 14:14:19 -07:00
failover.c failover: allow name change on IFF_UP slave interfaces 2019-04-10 22:12:26 -07:00
fib_notifier.c net: Fix fib notifer to return errno 2018-03-29 14:10:30 -04:00
fib_rules.c fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied 2019-05-08 09:32:10 -07:00
filter.c bpf: fix unconnected udp hooks 2019-06-06 16:53:12 -07:00
flow_dissector.c flow_dissector: disable preemption around BPF calls 2019-05-13 09:53:42 -07:00
flow_offload.c flow_offload: support CVLAN match 2019-05-16 12:02:42 -07:00
gen_estimator.c net: core: protect rate estimator statistics pointer with lock 2018-08-11 12:37:10 -07:00
gen_stats.c Revert: "net: sched: put back q.qlen into a single location" 2019-04-10 12:20:46 -07:00
gro_cells.c gro_cells: make sure device is up in gro_cells_receive() 2019-03-10 11:07:14 -07:00
hwbm.c
link_watch.c net: linkwatch: add check for netdevice being present to linkwatch_do_dev 2018-09-19 21:06:46 -07:00
lwt_bpf.c netlink: make validation more configurable for future strictness 2019-04-27 17:07:21 -04:00
lwtunnel.c netlink: make nla_nest_start() add NLA_F_NESTED flag 2019-04-27 17:03:44 -04:00
Makefile bpf: Introduce bpf sk local storage 2019-04-27 09:07:04 -07:00
neighbour.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-05-07 17:22:09 -07:00
net_namespace.c netlink: make validation more configurable for future strictness 2019-04-27 17:07:21 -04:00
net-procfs.c treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively 2019-04-09 14:19:06 +02:00
net-sysfs.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2019-05-07 22:03:58 -07:00
net-sysfs.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
net-traces.c trace: events: add a few neigh tracepoints 2019-02-17 10:33:39 -08:00
netclassid_cgroup.c cgroup, netclassid: add a preemption point to write_classid 2018-10-23 12:58:17 -07:00
netevent.c
netpoll.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2019-05-07 22:03:58 -07:00
netprio_cgroup.c cgroup: net: remove left over MODULE_LICENSE tag 2019-04-22 21:50:54 -07:00
page_pool.c page_pool: use DMA_ATTR_SKIP_CPU_SYNC for DMA mappings 2019-02-13 22:00:16 -08:00
pktgen.c xfrm: remove output indirection from xfrm_mode 2019-04-08 09:14:28 +02:00
ptp_classifier.c net/core: work around section mismatch warning for ptp_classifier 2019-04-16 20:46:17 -07:00
request_sock.c ipv4: Namespaceify tcp_max_syn_backlog knob 2016-12-29 11:38:31 -05:00
rtnetlink.c rtnetlink: always put IFLA_LINK for links with a link-netnsid 2019-05-14 15:40:01 -07:00
scm.c socket: Add SO_TIMESTAMPING_NEW 2019-02-03 11:17:31 -08:00
secure_seq.c infiniband: i40iw, nes: don't use wall time for TCP sequence numbers 2018-07-11 12:10:19 -06:00
skbuff.c bpf: sockmap, fix use after free from sleep in psock backlog workqueue 2019-05-24 23:18:42 +02:00
skmsg.c bpf: sockmap fix msg->sg.size account on ingress skb 2019-05-14 01:31:43 +02:00
sock_diag.c net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd() 2018-08-14 10:01:24 -07:00
sock_map.c bpf: skmsg, fix psock create on existing kcm/tls port 2018-10-20 00:40:45 +02:00
sock_reuseport.c net/core: Document reuseport_add_sock() bind_inany argument 2019-03-27 13:49:43 -07:00
sock.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2019-04-28 08:42:41 -04:00
stream.c tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT 2018-12-04 21:21:18 -08:00
sysctl_net_core.c net: convert rps_needed and rfs_needed to new static branch api 2019-03-23 21:57:38 -04:00
timestamping.c
tso.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
utils.c net: Remove some unneeded semicolon 2018-08-04 13:05:39 -07:00
xdp.c xdp: remove redundant variable 'headroom' 2018-09-01 01:35:53 +02:00