linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-12-02 16:44:10 +08:00

History

Daniel Borkmann 983695fa67 bpf: fix unconnected udp hooks Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently to applications as also stated in original motivation in `7828f20e37` ("Merge branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter two hooks into Cilium to enable host based load-balancing with Kubernetes, I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes typically sets up DNS as a service and is thus subject to load-balancing. Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API is currently insufficient and thus not usable as-is for standard applications shipped with most distros. To break down the issue we ran into with a simple example: # cat /etc/resolv.conf nameserver 147.75.207.207 nameserver 147.75.207.208 For the purpose of a simple test, we set up above IPs as service IPs and transparently redirect traffic to a different DNS backend server for that node: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 The attached BPF program is basically selecting one of the backends if the service IP/port matches on the cgroup hook. DNS breaks here, because the hooks are not transparent enough to applications which have built-in msg_name address checks: # nslookup 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ;; connection timed out; no servers could be reached # dig 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached For comparison, if none of the service IPs is used, and we tell nslookup to use 8.8.8.8 directly it works just fine, of course: # nslookup 1.1.1.1 8.8.8.8 1.1.1.1.in-addr.arpa name = one.one.one.one. In order to fix this and thus act more transparent to the application, this needs reverse translation on recvmsg() side. A minimal fix for this API is to add similar recvmsg() hooks behind the BPF cgroups static key such that the program can track state and replace the current sockaddr_in{,6} with the original service IP. From BPF side, this basically tracks the service tuple plus socket cookie in an LRU map where the reverse NAT can then be retrieved via map value as one example. Side-note: the BPF cgroups static key should be converted to a per-hook static key in future. Same example after this fix: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 Lookups work fine now: # nslookup 1.1.1.1 1.1.1.1.in-addr.arpa name = one.one.one.one. Authoritative answers can be found from: # dig 1.1.1.1 ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;1.1.1.1. IN A ;; AUTHORITY SECTION: . 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 147.75.207.207#53(147.75.207.207) ;; WHEN: Tue May 21 12:59:38 UTC 2019 ;; MSG SIZE rcvd: 111 And from an actual packet level it shows that we're using the back end server when talking via 147.75.207.20{7,8} front end: # tcpdump -i any udp [...] 12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) [...] In order to be flexible and to have same semantics as in sendmsg BPF programs, we only allow return codes in [1,1] range. In the sendmsg case the program is called if msg->msg_name is present which can be the case in both, connected and unconnected UDP. The former only relies on the sockaddr_in{,6} passed via connect(2) if passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar way to call into the BPF program whenever a non-NULL msg->msg_name was passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note that for TCP case, the msg->msg_name is ignored in the regular recvmsg path and therefore not relevant. For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE, the hook is not called. This is intentional as it aligns with the same semantics as in case of TCP cgroup BPF hooks right now. This might be better addressed in future through a different bpf_attach_type such that this case can be distinguished from the regular recvmsg paths, for example. Fixes: `1cedee13d2` ("bpf: Hooks for sys_sendmsg") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Alexei Starovoitov <ast@kernel.org>		2019-06-06 16:53:12 -07:00
..
bpf	bpf: fix unconnected udp hooks	2019-06-06 16:53:12 -07:00
cgroup	kernel/sched/psi.c: expose pressure metrics on root cgroup	2019-05-14 19:52:48 -07:00
configs
debug	kdb: Fix bound check compiler warning	2019-05-14 13:44:24 +01:00
dma	DMA mapping updates for 5.2	2019-05-09 08:40:55 -07:00
events	mm/mmu_notifier: use correct mmu_notifier events for each invalidation	2019-05-14 09:47:49 -07:00
gcov	gcov: clang support	2019-05-14 19:52:51 -07:00
irq	Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2019-05-19 10:58:45 -07:00
livepatch	The major changes in this tracing update includes:	2019-05-15 16:05:47 -07:00
locking	locking/rwsem: Prevent decrement of reader count before increment	2019-05-07 08:46:46 +02:00
power	Power management updates for 5.2-rc1	2019-05-06 19:40:31 -07:00
printk	panic: add an option to replay all the printk message in buffer	2019-05-18 15:52:26 -07:00
rcu	The major changes in this tracing update includes:	2019-05-15 16:05:47 -07:00
sched	kernel/sched/psi.c: expose pressure metrics on root cgroup	2019-05-14 19:52:48 -07:00
time	Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2019-05-16 11:00:20 -07:00
trace	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2019-05-20 08:21:07 -07:00
.gitignore	Provide in-kernel headers to make extending kernel easier	2019-04-29 16:48:03 +02:00
acct.c	acct_on(): don't mess with freeze protection	2019-04-04 21:04:13 -04:00
async.c	treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively	2019-04-09 14:19:06 +02:00
audit_fsnotify.c	audit_compare_dname_path(): switch to const struct qstr *	2019-04-28 20:33:43 -04:00
audit_tree.c	fsnotify: switch send_to_group() and ->handle_event to const struct qstr *	2019-04-26 13:51:03 -04:00
audit_watch.c	audit_compare_dname_path(): switch to const struct qstr *	2019-04-28 20:33:43 -04:00
audit.c	audit: connect LOGIN record to its syscall record	2019-03-20 20:57:48 -04:00
audit.h	audit_compare_dname_path(): switch to const struct qstr *	2019-04-28 20:33:43 -04:00
auditfilter.c	Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2019-05-07 20:03:32 -07:00
auditsc.c	Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2019-05-07 20:03:32 -07:00
backtracetest.c	backtrace-test: Simplify stack trace handling	2019-04-29 12:37:47 +02:00
bounds.c
capability.c	LSM: add SafeSetID module that gates setid calls	2019-01-25 11:22:43 -08:00
compat.c	kernel/compat.c: mark expected switch fall-throughs	2019-05-15 08:16:14 -07:00
configs.c	kernel/configs: use .incbin directive to embed config_data.gz	2019-03-07 18:32:02 -08:00
context_tracking.c
cpu_pm.c
cpu.c	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2019-05-06 14:50:46 -07:00
crash_core.c	kexec: export PG_offline to VMCOREINFO	2019-03-05 21:07:14 -08:00
crash_dump.c
cred.c	SELinux: Remove cred security blob poisoning	2019-01-08 13:18:44 -08:00
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c	mm: change mm_update_next_owner() to update mm->owner with WRITE_ONCE	2019-05-14 19:52:47 -07:00
extable.c
fail_function.c	treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively	2019-04-09 14:19:06 +02:00
fork.c	kernel/latencytop.c: rename clear_all_latency_tracing to clear_tsk_latency_tracing	2019-05-14 19:52:49 -07:00
freezer.c
futex.c	mm/gup: change GUP fast to use flags rather than a write 'bool'	2019-05-14 09:47:46 -07:00
gen_ikh_data.sh	Provide in-kernel headers to make extending kernel easier	2019-04-29 16:48:03 +02:00
groups.c
hung_task.c	kernel/hung_task.c: Use continuously blocked time when reporting.	2019-03-07 18:31:59 -08:00
iomem.c	mm/resource: Use resource_overlaps() to simplify region_intersects()	2019-04-19 12:59:36 +02:00
irq_work.c	irq_work: Do not raise an IPI when queueing work on the local CPU	2019-04-18 14:07:52 +02:00
jump_label.c	locking/static_key: Don't take sleeping locks in __static_key_slow_dec_deferred()	2019-04-29 08:29:21 +02:00
kallsyms.c	bpf: Add module name [bpf] to ksymbols for bpf programs	2019-01-21 17:38:56 -03:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks	Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb())	2019-05-06 16:57:52 -07:00
Kconfig.preempt	kconfig: warn no new line at end of file	2018-12-15 17:44:35 +09:00
kcov.c	kcov: convert kcov.refcount to refcount_t	2019-03-07 18:32:02 -08:00
kexec_core.c	power/suspend: Add function to disable secondaries for suspend	2019-05-03 19:42:41 +02:00
kexec_file.c	mm: memblock: make keeping memblock memory opt-in rather than opt-out	2019-05-14 09:47:50 -07:00
kexec_internal.h
kexec.c
kheaders.c	Provide in-kernel headers to make extending kernel easier	2019-04-29 16:48:03 +02:00
kmod.c
kprobes.c	kprobes: Fix error check when reusing optimized probes	2019-04-16 09:38:16 +02:00
ksysfs.c
kthread.c	include/: refactor headers to allow kthread.h inclusion in psi_types.h	2019-05-14 19:52:48 -07:00
latencytop.c	kernel/latencytop.c: rename clear_all_latency_tracing to clear_tsk_latency_tracing	2019-05-14 19:52:49 -07:00
Makefile	kernel/Makefile: don't assume that kernel/gen_ikh_data.sh is executable	2019-05-14 19:52:47 -07:00
memremap.c	kernel/memremap.c: remove the unused device_private_entry_fault() export	2019-05-14 09:47:51 -07:00
module_signing.c	modsign: use all trusted keys to verify module signature	2018-11-07 14:41:41 +01:00
module-internal.h	kallsyms: store type information in its own array	2019-03-28 15:00:37 +01:00
module.c	Modules updates for v5.2	2019-05-14 10:55:54 -07:00
notifier.c	kernel/notifier.c: double register detection	2019-05-14 19:52:49 -07:00
nsproxy.c
padata.c	padata: Replace padata_attr_type default_attrs field with groups	2019-04-25 22:06:11 +02:00
panic.c	panic: add an option to replay all the printk message in buffer	2019-05-18 15:52:26 -07:00
params.c
pid_namespace.c
pid.c	kernel/pid.c: remove unneeded hash header file	2019-05-14 19:52:51 -07:00
profile.c
ptrace.c	ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASK	2019-03-29 10:01:37 -07:00
range.c
reboot.c	panic/reboot: allow specifying reboot_mode for panic only	2019-05-14 19:52:51 -07:00
relay.c	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2019-03-12 13:27:20 -07:00
resource.c	mm/resource: Use resource_overlaps() to simplify region_intersects()	2019-04-19 12:59:36 +02:00
rseq.c	rseq: Remove superfluous rseq_len from task_struct	2019-04-19 12:39:32 +02:00
seccomp.c	audit/stable-5.2 PR 20190507	2019-05-07 19:06:04 -07:00
signal.c	signal: unconditionally leave the frozen state in ptrace_stop()	2019-05-16 10:43:58 -07:00
smp.c	cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM	2019-01-30 19:27:00 +01:00
smpboot.c
smpboot.h
softirq.c	softirq: Remove tasklet_hrtimer	2019-03-22 14:36:02 +01:00
stackleak.c	stackleak: Mark stackleak_track_stack() as notrace	2018-12-05 19:31:44 -08:00
stacktrace.c	stacktrace: Provide common infrastructure	2019-04-29 12:37:57 +02:00
stop_machine.c	treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively	2019-04-09 14:19:06 +02:00
sys_ni.c	signal: support CLONE_PIDFD with pidfd_send_signal	2019-05-07 14:31:03 +02:00
sys.c	kernel/sys.c: prctl: fix false positive in validate_prctl_map()	2019-05-14 09:47:44 -07:00
sysctl_binary.c	kernel/sysctl: add panic_print into sysctl	2019-01-04 13:13:47 -08:00
sysctl.c	kernel/sysctl.c: fix proc_do_large_bitmap for large input buffers	2019-05-14 19:52:51 -07:00
task_work.c
taskstats.c	genetlink: optionally validate strictly/dumps	2019-04-27 17:07:22 -04:00
test_kprobes.c
torture.c	torture: Don't try to offline the last CPU	2019-03-26 14:42:53 -07:00
tracepoint.c	tracing: Replace synchronize_sched() and call_rcu_sched()	2018-11-27 09:21:41 -08:00
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c	umh: add exit routine for UMH process	2019-01-11 18:05:40 -08:00
up.c
user_namespace.c	userns: also map extents in the reverse map to kernel IDs	2018-11-07 23:51:16 -06:00
user-return-notifier.c
user.c	kernel/user.c: clean up some leftover code	2019-05-14 19:52:49 -07:00
utsname_sysctl.c
utsname.c
watchdog_hld.c	kernel/watchdog_hld.c: hard lockup message should end with a newline	2019-04-19 09:46:05 -07:00
watchdog.c	watchdog: Fix typo in comment	2019-04-18 14:05:51 +02:00
workqueue_internal.h	sched/core, workqueues: Distangle worker accounting from rq lock	2019-04-16 16:55:15 +02:00
workqueue.c	Merge branch 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq	2019-05-09 13:48:52 -07:00