linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-26 13:44:15 +08:00

Author	SHA1	Message	Date
Alexei Starovoitov	02620d9e62	Merge branch 'bpf-dispatcher' Björn Töpel says: ==================== Overview ======== This is the 6th iteration of the series that introduces the BPF dispatcher, which is a mechanism to avoid indirect calls. The BPF dispatcher is a multi-way branch code generator, targeted for BPF programs. E.g. when an XDP program is executed via the bpf_prog_run_xdp(), it is invoked via an indirect call. With retpolines enabled, the indirect call has a substantial performance impact. The dispatcher is a mechanism that transform indirect calls to direct calls, and therefore avoids the retpoline. The dispatcher is generated using the BPF JIT, and relies on text poking provided by bpf_arch_text_poke(). The dispatcher hijacks a trampoline function it via the __fentry__ nop of the trampoline. One dispatcher instance currently supports up to 48 dispatch points. This can be extended in the future. In this series, only one dispatcher instance is supported, and the only user is XDP. The dispatcher is updated when an XDP program is attached/detached to/from a netdev. An alternative to this could have been to update the dispatcher at program load point, but as there are usually more XDP programs loaded than attached, so the latter was picked. The XDP dispatcher is always enabled, if available, because it helps even when retpolines are disabled. Please refer to the "Performance" section below. The first patch refactors the image allocation from the BPF trampoline code. Patch two introduces the dispatcher, and patch three adds a dispatcher for XDP, and wires up the XDP control-/ fast-path. Patch four adds the dispatcher to BPF_TEST_RUN. Patch five adds a simple selftest, and the last adds alignment to jump targets. I have rebased the series on commit `679152d3a3` ("libbpf: Fix printf compilation warnings on ppc64le arch"). Generated code, x86-64 ====================== The dispatcher currently has a maximum of 48 entries, where one entry is a unique BPF program. Multiple users of a dispatcher instance using the same BPF program will share that entry. The program/slot lookup is performed by a binary search, O(log n). Let's have a look at the generated code. The trampoline function has the following signature: unsigned int tramp(const void ctx, const struct bpf_insn insnsi, unsigned int (bpf_func)(const void , const struct bpf_insn )) On Intel x86-64 this means that rdx will contain the bpf_func. To, make it easier to read, I've let the BPF programs have the following range: 0xffffffffffffffff (-1) to 0xfffffffffffffff0 (-16). 0xffffffff81c00f10 is the retpoline thunk, in this case __x86_indirect_thunk_rdx. If retpolines are disabled the thunk will be a regular indirect call. The minimal dispatcher will then look like this: ffffffffc0002000: cmp rdx,0xffffffffffffffff ffffffffc0002007: je 0xffffffffffffffff ; -1 ffffffffc000200d: jmp 0xffffffff81c00f10 A 16 entry dispatcher looks like this: ffffffffc0020000: cmp rdx,0xfffffffffffffff7 ; -9 ffffffffc0020007: jg 0xffffffffc0020130 ffffffffc002000d: cmp rdx,0xfffffffffffffff3 ; -13 ffffffffc0020014: jg 0xffffffffc00200a0 ffffffffc002001a: cmp rdx,0xfffffffffffffff1 ; -15 ffffffffc0020021: jg 0xffffffffc0020060 ffffffffc0020023: cmp rdx,0xfffffffffffffff0 ; -16 ffffffffc002002a: jg 0xffffffffc0020040 ffffffffc002002c: cmp rdx,0xfffffffffffffff0 ; -16 ffffffffc0020033: je 0xfffffffffffffff0 ; -16 ffffffffc0020039: jmp 0xffffffff81c00f10 ffffffffc002003e: xchg ax,ax ffffffffc0020040: cmp rdx,0xfffffffffffffff1 ; -15 ffffffffc0020047: je 0xfffffffffffffff1 ; -15 ffffffffc002004d: jmp 0xffffffff81c00f10 ffffffffc0020052: nop DWORD PTR [rax+rax1+0x0] ffffffffc002005a: nop WORD PTR [rax+rax1+0x0] ffffffffc0020060: cmp rdx,0xfffffffffffffff2 ; -14 ffffffffc0020067: jg 0xffffffffc0020080 ffffffffc0020069: cmp rdx,0xfffffffffffffff2 ; -14 ffffffffc0020070: je 0xfffffffffffffff2 ; -14 ffffffffc0020076: jmp 0xffffffff81c00f10 ffffffffc002007b: nop DWORD PTR [rax+rax1+0x0] ffffffffc0020080: cmp rdx,0xfffffffffffffff3 ; -13 ffffffffc0020087: je 0xfffffffffffffff3 ; -13 ffffffffc002008d: jmp 0xffffffff81c00f10 ffffffffc0020092: nop DWORD PTR [rax+rax1+0x0] ffffffffc002009a: nop WORD PTR [rax+rax1+0x0] ffffffffc00200a0: cmp rdx,0xfffffffffffffff5 ; -11 ffffffffc00200a7: jg 0xffffffffc00200f0 ffffffffc00200a9: cmp rdx,0xfffffffffffffff4 ; -12 ffffffffc00200b0: jg 0xffffffffc00200d0 ffffffffc00200b2: cmp rdx,0xfffffffffffffff4 ; -12 ffffffffc00200b9: je 0xfffffffffffffff4 ; -12 ffffffffc00200bf: jmp 0xffffffff81c00f10 ffffffffc00200c4: nop DWORD PTR [rax+rax1+0x0] ffffffffc00200cc: nop DWORD PTR [rax+0x0] ffffffffc00200d0: cmp rdx,0xfffffffffffffff5 ; -11 ffffffffc00200d7: je 0xfffffffffffffff5 ; -11 ffffffffc00200dd: jmp 0xffffffff81c00f10 ffffffffc00200e2: nop DWORD PTR [rax+rax1+0x0] ffffffffc00200ea: nop WORD PTR [rax+rax1+0x0] ffffffffc00200f0: cmp rdx,0xfffffffffffffff6 ; -10 ffffffffc00200f7: jg 0xffffffffc0020110 ffffffffc00200f9: cmp rdx,0xfffffffffffffff6 ; -10 ffffffffc0020100: je 0xfffffffffffffff6 ; -10 ffffffffc0020106: jmp 0xffffffff81c00f10 ffffffffc002010b: nop DWORD PTR [rax+rax1+0x0] ffffffffc0020110: cmp rdx,0xfffffffffffffff7 ; -9 ffffffffc0020117: je 0xfffffffffffffff7 ; -9 ffffffffc002011d: jmp 0xffffffff81c00f10 ffffffffc0020122: nop DWORD PTR [rax+rax1+0x0] ffffffffc002012a: nop WORD PTR [rax+rax1+0x0] ffffffffc0020130: cmp rdx,0xfffffffffffffffb ; -5 ffffffffc0020137: jg 0xffffffffc00201d0 ffffffffc002013d: cmp rdx,0xfffffffffffffff9 ; -7 ffffffffc0020144: jg 0xffffffffc0020190 ffffffffc0020146: cmp rdx,0xfffffffffffffff8 ; -8 ffffffffc002014d: jg 0xffffffffc0020170 ffffffffc002014f: cmp rdx,0xfffffffffffffff8 ; -8 ffffffffc0020156: je 0xfffffffffffffff8 ; -8 ffffffffc002015c: jmp 0xffffffff81c00f10 ffffffffc0020161: nop DWORD PTR [rax+rax1+0x0] ffffffffc0020169: nop DWORD PTR [rax+0x0] ffffffffc0020170: cmp rdx,0xfffffffffffffff9 ; -7 ffffffffc0020177: je 0xfffffffffffffff9 ; -7 ffffffffc002017d: jmp 0xffffffff81c00f10 ffffffffc0020182: nop DWORD PTR [rax+rax1+0x0] ffffffffc002018a: nop WORD PTR [rax+rax1+0x0] ffffffffc0020190: cmp rdx,0xfffffffffffffffa ; -6 ffffffffc0020197: jg 0xffffffffc00201b0 ffffffffc0020199: cmp rdx,0xfffffffffffffffa ; -6 ffffffffc00201a0: je 0xfffffffffffffffa ; -6 ffffffffc00201a6: jmp 0xffffffff81c00f10 ffffffffc00201ab: nop DWORD PTR [rax+rax1+0x0] ffffffffc00201b0: cmp rdx,0xfffffffffffffffb ; -5 ffffffffc00201b7: je 0xfffffffffffffffb ; -5 ffffffffc00201bd: jmp 0xffffffff81c00f10 ffffffffc00201c2: nop DWORD PTR [rax+rax1+0x0] ffffffffc00201ca: nop WORD PTR [rax+rax1+0x0] ffffffffc00201d0: cmp rdx,0xfffffffffffffffd ; -3 ffffffffc00201d7: jg 0xffffffffc0020220 ffffffffc00201d9: cmp rdx,0xfffffffffffffffc ; -4 ffffffffc00201e0: jg 0xffffffffc0020200 ffffffffc00201e2: cmp rdx,0xfffffffffffffffc ; -4 ffffffffc00201e9: je 0xfffffffffffffffc ; -4 ffffffffc00201ef: jmp 0xffffffff81c00f10 ffffffffc00201f4: nop DWORD PTR [rax+rax1+0x0] ffffffffc00201fc: nop DWORD PTR [rax+0x0] ffffffffc0020200: cmp rdx,0xfffffffffffffffd ; -3 ffffffffc0020207: je 0xfffffffffffffffd ; -3 ffffffffc002020d: jmp 0xffffffff81c00f10 ffffffffc0020212: nop DWORD PTR [rax+rax1+0x0] ffffffffc002021a: nop WORD PTR [rax+rax1+0x0] ffffffffc0020220: cmp rdx,0xfffffffffffffffe ; -2 ffffffffc0020227: jg 0xffffffffc0020240 ffffffffc0020229: cmp rdx,0xfffffffffffffffe ; -2 ffffffffc0020230: je 0xfffffffffffffffe ; -2 ffffffffc0020236: jmp 0xffffffff81c00f10 ffffffffc002023b: nop DWORD PTR [rax+rax1+0x0] ffffffffc0020240: cmp rdx,0xffffffffffffffff ; -1 ffffffffc0020247: je 0xffffffffffffffff ; -1 ffffffffc002024d: jmp 0xffffffff81c00f10 The nops are there to align jump targets to 16 B. Performance =========== The tests were performed using the xdp_rxq_info sample program with the following command-line: 1. XDP_DRV: # xdp_rxq_info --dev eth0 --action XDP_DROP 2. XDP_SKB: # xdp_rxq_info --dev eth0 -S --action XDP_DROP 3. xdp-perf, from selftests/bpf: # test_progs -v -t xdp_perf Run with mitigations=auto ------------------------- Baseline: 1. 21.7 Mpps (21736190) 2. 3.8 Mpps (3837582) 3. 15 ns Dispatcher: 1. 30.2 Mpps (30176320) 2. 4.0 Mpps (4015579) 3. 5 ns Dispatcher (full; walk all entries, and fallback): 1. 22.0 Mpps (21986704) 2. 3.8 Mpps (3831298) 3. 17 ns Run with mitigations=off ------------------------ Baseline: 1. 29.9 Mpps (29875135) 2. 4.1 Mpps (4100179) 3. 4 ns Dispatcher: 1. 30.4 Mpps (30439241) 2. 4.1 Mpps (4109350) 1. 4 ns Dispatcher (full; walk all entries, and fallback): 1. 28.9 Mpps (28903269) 2. 4.1 Mpps (4080078) 3. 5 ns xdp-perf runs, aliged vs non-aligned jump targets ------------------------------------------------- In this test dispatchers of different sizes, with and without jump target alignment, were exercised. As outlined above the function lookup is performed via binary search. This means that depending on the pointer value of the function, it can reside in the upper or lower part of the search table. The performed tests were: 1. aligned, mititations=auto, function entry < other entries 2. aligned, mititations=auto, function entry > other entries 3. non-aligned, mititations=auto, function entry < other entries 4. non-aligned, mititations=auto, function entry > other entries 5. aligned, mititations=off, function entry < other entries 6. aligned, mititations=off, function entry > other entries 7. non-aligned, mititations=off, function entry < other entries 8. non-aligned, mititations=off, function entry > other entries The micro benchmarks showed that alignment of jump target has some positive impact. A reply to this cover letter will contain complete data for all runs. Multiple xdp-perf baseline with mitigations=auto ------------------------------------------------ Performance counter stats for './test_progs -v -t xdp_perf' (1024 runs): 16.69 msec task-clock # 0.984 CPUs utilized ( +- 0.08% ) 2 context-switches # 0.123 K/sec ( +- 1.11% ) 0 cpu-migrations # 0.000 K/sec ( +- 70.68% ) 97 page-faults # 0.006 M/sec ( +- 0.05% ) 49,254,635 cycles # 2.951 GHz ( +- 0.09% ) (12.28%) 42,138,558 instructions # 0.86 insn per cycle ( +- 0.02% ) (36.15%) 7,315,291 branches # 438.300 M/sec ( +- 0.01% ) (59.43%) 1,011,201 branch-misses # 13.82% of all branches ( +- 0.01% ) (83.31%) 15,440,788 L1-dcache-loads # 925.143 M/sec ( +- 0.00% ) (99.40%) 39,067 L1-dcache-load-misses # 0.25% of all L1-dcache hits ( +- 0.04% ) 6,531 LLC-loads # 0.391 M/sec ( +- 0.05% ) 442 LLC-load-misses # 6.76% of all LL-cache hits ( +- 0.77% ) <not supported> L1-icache-loads 57,964 L1-icache-load-misses ( +- 0.06% ) 15,442,496 dTLB-loads # 925.246 M/sec ( +- 0.00% ) 514 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 0.73% ) (40.57%) 130 iTLB-loads # 0.008 M/sec ( +- 2.75% ) (16.69%) <not counted> iTLB-load-misses ( +- 8.71% ) (0.60%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 0.0169558 +- 0.0000127 seconds time elapsed ( +- 0.07% ) Multiple xdp-perf dispatcher with mitigations=auto -------------------------------------------------- Note that this includes generating the dispatcher. Performance counter stats for './test_progs -v -t xdp_perf' (1024 runs): 4.80 msec task-clock # 0.953 CPUs utilized ( +- 0.06% ) 1 context-switches # 0.258 K/sec ( +- 1.57% ) 0 cpu-migrations # 0.000 K/sec 97 page-faults # 0.020 M/sec ( +- 0.05% ) 14,185,861 cycles # 2.955 GHz ( +- 0.17% ) (50.49%) 45,691,935 instructions # 3.22 insn per cycle ( +- 0.01% ) (99.19%) 8,346,008 branches # 1738.709 M/sec ( +- 0.00% ) 13,046 branch-misses # 0.16% of all branches ( +- 0.10% ) 15,443,735 L1-dcache-loads # 3217.365 M/sec ( +- 0.00% ) 39,585 L1-dcache-load-misses # 0.26% of all L1-dcache hits ( +- 0.05% ) 7,138 LLC-loads # 1.487 M/sec ( +- 0.06% ) 671 LLC-load-misses # 9.40% of all LL-cache hits ( +- 0.73% ) <not supported> L1-icache-loads 56,213 L1-icache-load-misses ( +- 0.08% ) 15,443,735 dTLB-loads # 3217.365 M/sec ( +- 0.00% ) <not counted> dTLB-load-misses (0.00%) <not counted> iTLB-loads (0.00%) <not counted> iTLB-load-misses (0.00%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 0.00503705 +- 0.00000546 seconds time elapsed ( +- 0.11% ) Revisions ========= v4->v5: [1] * Fixed s/xdp_ctx/ctx/ type-o (Toke) * Marked dispatcher trampoline with noinline attribute (Alexei) v3->v4: [2] * Moved away from doing dispatcher lookup based on the trampoline function, to a model where the dispatcher instance is explicitly passed to the bpf_dispatcher_change_prog() (Alexei) v2->v3: [3] * Removed xdp_call, and instead make the dispatcher available to all XDP users via bpf_prog_run_xdp() and dev_xdp_install(). (Toke) * Always enable the dispatcher, if available (Alexei) * Reuse BPF trampoline image allocator (Alexei) * Make sure the dispatcher is exercised in selftests (Alexei) * Only allow one dispatcher, and wire it to XDP v1->v2: [4] * Fixed i386 build warning (kbuild robot) * Made bpf_dispatcher_lookup() static (kbuild robot) * Make sure xdp_call.h is only enabled for builtins * Add xdp_call() to ixgbe, mlx4, and mlx5 RFC->v1: [5] * Improved error handling (Edward and Andrii) * Explicit cleanup (Andrii) * Use 32B with sext cmp (Alexei) * Align jump targets to 16B (Alexei) * 4 to 16 entries (Toke) * Added stats to xdp_call_run() [1] https://lore.kernel.org/bpf/20191211123017.13212-1-bjorn.topel@gmail.com/ [2] https://lore.kernel.org/bpf/20191209135522.16576-1-bjorn.topel@gmail.com/ [3] https://lore.kernel.org/bpf/20191123071226.6501-1-bjorn.topel@gmail.com/ [4] https://lore.kernel.org/bpf/20191119160757.27714-1-bjorn.topel@gmail.com/ [5] https://lore.kernel.org/bpf/20191113204737.31623-1-bjorn.topel@gmail.com/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-12-13 13:15:40 -08:00
Björn Töpel	116eb788f5	bpf, x86: Align dispatcher branch targets to 16B >From Intel 64 and IA-32 Architectures Optimization Reference Manual, 3.4.1.4 Code Alignment, Assembly/Compiler Coding Rule 11: All branch targets should be 16-byte aligned. This commits aligns branch targets according to the Intel manual. The nops used to align branch targets make the dispatcher larger, and therefore the number of supported dispatch points/programs are descreased from 64 to 48. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191213175112.30208-7-bjorn.topel@gmail.com	2019-12-13 13:09:32 -08:00
Björn Töpel	e754f5a6e3	selftests: bpf: Add xdp_perf test The xdp_perf is a dummy XDP test, only used to measure the the cost of jumping into a naive XDP program one million times. To build and run the program: $ cd tools/testing/selftests/bpf $ make $ ./test_progs -v -t xdp_perf Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191213175112.30208-6-bjorn.topel@gmail.com	2019-12-13 13:09:32 -08:00
Björn Töpel	f23c4b3924	bpf: Start using the BPF dispatcher in BPF_TEST_RUN In order to properly exercise the BPF dispatcher, this commit adds BPF dispatcher usage to BPF_TEST_RUN when executing XDP programs. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191213175112.30208-5-bjorn.topel@gmail.com	2019-12-13 13:09:32 -08:00
Björn Töpel	7e6897f959	bpf, xdp: Start using the BPF dispatcher for XDP This commit adds a BPF dispatcher for XDP. The dispatcher is updated from the XDP control-path, dev_xdp_install(), and used when an XDP program is run via bpf_prog_run_xdp(). Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191213175112.30208-4-bjorn.topel@gmail.com	2019-12-13 13:09:32 -08:00
Björn Töpel	75ccbef636	bpf: Introduce BPF dispatcher The BPF dispatcher is a multi-way branch code generator, mainly targeted for XDP programs. When an XDP program is executed via the bpf_prog_run_xdp(), it is invoked via an indirect call. The indirect call has a substantial performance impact, when retpolines are enabled. The dispatcher transform indirect calls to direct calls, and therefore avoids the retpoline. The dispatcher is generated using the BPF JIT, and relies on text poking provided by bpf_arch_text_poke(). The dispatcher hijacks a trampoline function it via the __fentry__ nop of the trampoline. One dispatcher instance currently supports up to 64 dispatch points. A user creates a dispatcher with its corresponding trampoline with the DEFINE_BPF_DISPATCHER macro. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191213175112.30208-3-bjorn.topel@gmail.com	2019-12-13 13:09:32 -08:00
Björn Töpel	98e8627efc	bpf: Move trampoline JIT image allocation to a function Refactor the image allocation in the BPF trampoline code into a separate function, so it can be shared with the BPF dispatcher in upcoming commits. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191213175112.30208-2-bjorn.topel@gmail.com	2019-12-13 13:09:32 -08:00
Andrii Nakryiko	91cbdf740a	selftests/bpf: Fix perf_buffer test on systems w/ offline CPUs Fix up perf_buffer.c selftest to take into account offline/missing CPUs. Fixes: `ee5cf82ce0` ("selftests/bpf: test perf buffer API") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212013621.1691858-1-andriin@fb.com	2019-12-13 13:00:25 -08:00
Andrii Nakryiko	783b8f01f5	libbpf: Don't attach perf_buffer to offline/missing CPUs It's quite common on some systems to have more CPUs enlisted as "possible", than there are (and could ever be) present/online CPUs. In such cases, perf_buffer creationg will fail due to inability to create perf event on missing CPU with error like this: libbpf: failed to open perf buffer event on cpu #16: No such device This patch fixes the logic of perf_buffer__new() to ignore CPUs that are missing or currently offline. In rare cases where user explicitly listed specific CPUs to connect to, behavior is unchanged: libbpf will try to open perf event buffer on specified CPU(s) anyways. Fixes: `fb84b82246` ("libbpf: add perf buffer API") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212013609.1691168-1-andriin@fb.com	2019-12-13 13:00:09 -08:00
Andrii Nakryiko	65bc4c4063	selftests/bpf: Add CPU mask parsing tests Add a bunch of test validating CPU mask parsing logic and error handling. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212013559.1690898-1-andriin@fb.com	2019-12-13 12:59:55 -08:00
Andrii Nakryiko	6803ee25f0	libbpf: Extract and generalize CPU mask parsing logic This logic is re-used for parsing a set of online CPUs. Having it as an isolated piece of code working with input string makes it conveninent to test this logic as well. While refactoring, also improve the robustness of original implementation. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212013548.1690564-1-andriin@fb.com	2019-12-13 12:58:51 -08:00
Alexei Starovoitov	7708bd430d	Merge branch 'reuseport_to_test_progs' Jakub Sitnicki says: ==================== This change has been suggested by Martin Lau [0] during a review of a related patch set that extends reuseport tests [1]. Patches 1 & 2 address a warning due to unrecognized section name from libbpf when running reuseport tests. We don't want to carry this warning into test_progs. Patches 3-8 massage the reuseport tests to ease the switch to test_progs framework. The intention here is to show the work. Happy to squash these, if needed. Patches 9-10 do the actual move and conversion to test_progs. Output from a test_progs run after changes pasted below. Thanks, Jakub [0] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/T/#m607d822caeb1eb5db101172821a78cc3896ff1c3 [1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/T/#m55881bae9fb6e34837d07a0c0a7ffbc138f8d06f ==================== Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-12-13 12:38:04 -08:00
Jakub Sitnicki	7ee0d4e97b	selftests/bpf: Switch reuseport tests for test_progs framework The tests were originally written in abort-on-error style. With the switch to test_progs we can no longer do that. So at the risk of not cleaning up some resource on failure, we now return to the caller on error. That said, failure inside one test should not affect others because we run setup/cleanup before/after every test. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212102259.418536-11-jakub@cloudflare.com	2019-12-13 12:38:00 -08:00
Jakub Sitnicki	415bb4e125	selftests/bpf: Move reuseport tests under prog_tests/ Do a pure move the show the actual work needed to adapt the tests in subsequent patch at the cost of breaking test_progs build for the moment. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212102259.418536-10-jakub@cloudflare.com	2019-12-13 12:38:00 -08:00
Jakub Sitnicki	250a91d48a	selftests/bpf: Pull up printing the test name into test runner Again, prepare for switching reuseport tests to test_progs framework. test_progs framework will print the subtest name for us if we set it. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212102259.418536-9-jakub@cloudflare.com	2019-12-13 12:38:00 -08:00
Jakub Sitnicki	9af6c84435	selftests/bpf: Propagate errors during setup for reuseport tests Prepare for switching reuseport tests to test_progs framework, where we don't have the luxury to terminate the process on failure. Modify setup helpers to signal failure via the return value with the help of a macro similar to the one currently in use by the tests. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212102259.418536-8-jakub@cloudflare.com	2019-12-13 12:38:00 -08:00
Jakub Sitnicki	ce7cb5f392	selftests/bpf: Run reuseport tests in a loop Prepare for switching reuseport tests to test_progs framework. Loop over the tests and perform setup/cleanup for each test separately, remembering that with test_progs we can select tests to run. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212102259.418536-7-jakub@cloudflare.com	2019-12-13 12:38:00 -08:00
Jakub Sitnicki	9936338258	selftests/bpf: Unroll the main loop in reuseport test Prepare for iterating over individual tests without introducing another nested loop in the main test function. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212102259.418536-6-jakub@cloudflare.com	2019-12-13 12:38:00 -08:00
Jakub Sitnicki	a9ce4cf4e4	selftests/bpf: Add helpers for getting socket family & type name Having string arrays to map socket family & type to a name prevents us from unrolling the test runner loop in the subsequent patch. Introduce helpers that do the same thing. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212102259.418536-5-jakub@cloudflare.com	2019-12-13 12:38:00 -08:00
Jakub Sitnicki	11f80355d4	selftests/bpf: Use sa_family_t everywhere in reuseport tests Update the only function that is not using sa_family_t in this source file. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212102259.418536-4-jakub@cloudflare.com	2019-12-13 12:38:00 -08:00
Jakub Sitnicki	1fbcef929d	selftests/bpf: Let libbpf determine program type from section name Now that libbpf can recognize SK_REUSEPORT programs, we no longer have to pass a prog_type hint before loading the object file. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212102259.418536-3-jakub@cloudflare.com	2019-12-13 12:38:00 -08:00
Jakub Sitnicki	67d69ccdf3	libbpf: Recognize SK_REUSEPORT programs from section name Allow loading BPF object files that contain SK_REUSEPORT programs without having to manually set the program type before load if the the section name is set to "sk_reuseport". Makes user-space code needed to load SK_REUSEPORT BPF program more concise. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191212102259.418536-2-jakub@cloudflare.com	2019-12-13 12:38:00 -08:00
Andrii Nakryiko	679152d3a3	libbpf: Fix printf compilation warnings on ppc64le arch On ppc64le __u64 and __s64 are defined as long int and unsigned long int, respectively. This causes compiler to emit warning when %lld/%llu are used to printf 64-bit numbers. Fix this by casting to size_t/ssize_t with %zu and %zd format specifiers, respectively. v1->v2: - use size_t/ssize_t instead of custom typedefs (Martin). Fixes: `1f8e2bcb2c` ("libbpf: Refactor relocation handling") Fixes: `abd29c9314` ("libbpf: allow specifying map definitions using BTF") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191212171918.638010-1-andriin@fb.com	2019-12-12 13:47:24 -08:00
Daniel Borkmann	81c22041d9	bpf, x86, arm64: Enable jit by default when not built as always-on After Spectre 2 fix via `290af86629` ("bpf: introduce BPF_JIT_ALWAYS_ON config") most major distros use BPF_JIT_ALWAYS_ON configuration these days which compiles out the BPF interpreter entirely and always enables the JIT. Also given recent fix in `e1608f3fa8` ("bpf: Avoid setting bpf insns pages read-only when prog is jited"), we additionally avoid fragmenting the direct map for the BPF insns pages sitting in the general data heap since they are not used during execution. Latter is only needed when run through the interpreter. Since both x86 and arm64 JITs have seen a lot of exposure over the years, are generally most up to date and maintained, there is more downside in !BPF_JIT_ALWAYS_ON configurations to have the interpreter enabled by default rather than the JIT. Add a ARCH_WANT_DEFAULT_BPF_JIT config which archs can use to set the bpf_jit_{enable,kallsyms} to 1. Back in the days the bpf_jit_kallsyms knob was set to 0 by default since major distros still had /proc/kallsyms addresses exposed to unprivileged user space which is not the case anymore. Hence both knobs are set via BPF_JIT_DEFAULT_ON which is set to 'y' in case of BPF_JIT_ALWAYS_ON or ARCH_WANT_DEFAULT_BPF_JIT. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Will Deacon <will@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/f78ad24795c2966efcc2ee19025fa3459f622185.1575903816.git.daniel@iogearbox.net	2019-12-11 16:16:01 -08:00
Daniel Borkmann	bae141f54b	bpf: Emit audit messages upon successful prog load and unload Allow for audit messages to be emitted upon BPF program load and unload for having a timeline of events. The load itself is in syscall context, so additional info about the process initiating the BPF prog creation can be logged and later directly correlated to the unload event. The only info really needed from BPF side is the globally unique prog ID where then audit user space tooling can query / dump all info needed about the specific BPF program right upon load event and enrich the record, thus these changes needed here can be kept small and non-intrusive to the core. Raw example output: # auditctl -D # auditctl -a always,exit -F arch=x86_64 -S bpf # ausearch --start recent -m 1334 ... ---- time->Wed Nov 27 16:04:13 2019 type=PROCTITLE msg=audit(1574867053.120:84664): proctitle="./bpf" type=SYSCALL msg=audit(1574867053.120:84664): arch=c000003e syscall=321 \ success=yes exit=3 a0=5 a1=7ffea484fbe0 a2=70 a3=0 items=0 ppid=7477 \ pid=12698 auid=1001 uid=1001 gid=1001 euid=1001 suid=1001 fsuid=1001 \ egid=1001 sgid=1001 fsgid=1001 tty=pts2 ses=4 comm="bpf" \ exe="/home/jolsa/auditd/audit-testsuite/tests/bpf/bpf" \ subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null) type=UNKNOWN[1334] msg=audit(1574867053.120:84664): prog-id=76 op=LOAD ---- time->Wed Nov 27 16:04:13 2019 type=UNKNOWN[1334] msg=audit(1574867053.120:84665): prog-id=76 op=UNLOAD ... Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/bpf/20191206214934.11319-1-jolsa@kernel.org	2019-12-11 17:41:09 +01:00
Stanislav Fomichev	b590cb5f80	bpf: Switch to offsetofend in BPF_PROG_TEST_RUN Switch existing pattern of "offsetof(..., member) + FIELD_SIZEOF(..., member)' to "offsetofend(..., member)" which does exactly what we need without all the copy-paste. Suggested-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191210191933.105321-1-sdf@google.com	2019-12-11 14:52:18 +01:00
Andrii Nakryiko	09c4708d3c	libbpf: Bump libpf current version to v0.0.7 New development cycles starts, bump to v0.0.7 proactively. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191209224022.3544519-1-andriin@fb.com	2019-12-11 14:50:37 +01:00
Russell King	c453312857	ARM: net: bpf: Improve prologue code sequence Improve the prologue code sequence to be able to take advantage of 64-bit stores, changing the code from: push {r4, r5, r6, r7, r8, r9, fp, lr} mov fp, sp sub ip, sp, #80 ; 0x50 sub sp, sp, #600 ; 0x258 str ip, [fp, #-100] ; 0xffffff9c mov r6, #0 str r6, [fp, #-96] ; 0xffffffa0 mov r4, #0 mov r3, r4 mov r2, r0 str r4, [fp, #-104] ; 0xffffff98 str r4, [fp, #-108] ; 0xffffff94 to the tighter: push {r4, r5, r6, r7, r8, r9, fp, lr} mov fp, sp mov r3, #0 sub r2, sp, #80 ; 0x50 sub sp, sp, #600 ; 0x258 strd r2, [fp, #-100] ; 0xffffff9c mov r2, #0 strd r2, [fp, #-108] ; 0xffffff94 mov r2, r0 resulting in a saving of three instructions. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/E1ieH2g-0004ih-Rb@rmk-PC.armlinux.org.uk	2019-12-11 14:34:26 +01:00
Shahjada Abul Husain	c219399988	cxgb4: add support for high priority filters T6 has a separate region known as high priority filter region that allows classifying packets going through ULD path. So, query firmware for HPFILTER resources and enable the high priority offload filter support when it is available. Signed-off-by: Shahjada Abul Husain <shahjada@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-10 17:52:41 -08:00
Chen Wandun	6525b5ef65	enetc: remove variable 'tc_max_sized_frame' set but not used Fixes gcc '-Wunused-but-set-variable' warning: drivers/net/ethernet/freescale/enetc/enetc_qos.c: In function enetc_setup_tc_cbs: drivers/net/ethernet/freescale/enetc/enetc_qos.c:195:6: warning: variable tc_max_sized_frame set but not used [-Wunused-but-set-variable] Fixes: `c431047c4e` ("enetc: add support Credit Based Shaper(CBS) for hardware offload") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-10 17:47:23 -08:00
Jakub Kicinski	ca866ee825	nfp: add support for TLV device stats Device stats are currently hard coded in the PCI BAR0 layout. Add a ability to read them from the TLV area instead. Names for the stats are maintained by the driver, and their meaning documented. This allows us to more easily add and remove device stats. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-10 17:34:43 -08:00
Kuniyuki Iwashima	5000b28b0b	tcp: Cleanup duplicate initialization of sk->sk_state. When a TCP socket is created, sk->sk_state is initialized twice as TCP_CLOSE in sock_init_data() and tcp_init_sock(). The tcp_init_sock() is always called after the sock_init_data(), so it is not necessary to update sk->sk_state in the tcp_init_sock(). Before v2.1.8, the code of the two functions was in the inet_create(). In the patch of v2.1.8, the tcp_v4/v6_init_sock() were added and the code of initialization of sk->state was duplicated. Signed-off-by: Kuniyuki Iwashima <kuni1840@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-10 17:33:29 -08:00
Michael Walle	4caefbce06	enetc: add software timestamping Provide a software TX timestamp and add it to the ethtool query interface. skb_tx_timestamp() is also needed if one would like to use PHY timestamping. Signed-off-by: Michael Walle <michael@walle.cc> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-10 17:32:06 -08:00
David S. Miller	bb9d8454bb	Merge branch 'tipc-introduce-variable-window-congestion-control' Jon Maloy says: ==================== tipc: introduce variable window congestion control We improve thoughput greatly by introducing a variety of the Reno congestion control algorithm at the link level. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-10 17:31:15 -08:00
Jon Maloy	16ad3f4022	tipc: introduce variable window congestion control We introduce a simple variable window congestion control for links. The algorithm is inspired by the Reno algorithm, covering both 'slow start', 'congestion avoidance', and 'fast recovery' modes. - We introduce hard lower and upper window limits per link, still different and configurable per bearer type. - We introduce a 'slow start theshold' variable, initially set to the maximum window size. - We let a link start at the minimum congestion window, i.e. in slow start mode, and then let is grow rapidly (+1 per rceived ACK) until it reaches the slow start threshold and enters congestion avoidance mode. - In congestion avoidance mode we increment the congestion window for each window-size number of acked packets, up to a possible maximum equal to the configured maximum window. - For each non-duplicate NACK received, we drop back to fast recovery mode, by setting the both the slow start threshold to and the congestion window to (current_congestion_window / 2). - If the timeout handler finds that the transmit queue has not moved since the previous timeout, it drops the link back to slow start and forces a probe containing the last sent sequence number to the sent to the peer, so that this can discover the stale situation. This change does in reality have effect only on unicast ethernet transport, as we have seen that there is no room whatsoever for increasing the window max size for the UDP bearer. For now, we also choose to keep the limits for the broadcast link unchanged and equal. This algorithm seems to give a 50-100% throughput improvement for messages larger than MTU. Suggested-by: Xin Long <lucien.xin@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-10 17:31:15 -08:00
Jon Maloy	d3b09995ab	tipc: eliminate more unnecessary nacks and retransmissions When we increase the link tranmsit window we often observe the following scenario: 1) A STATE message bypasses a sequence of traffic packets and arrives far ahead of those to the receiver. STATE messages contain a 'peers_nxt_snt' field to indicate which was the last packet sent from the peer. This mechanism is intended as a last resort for the receiver to detect missing packets, e.g., during very low traffic when there is no packet flow to help early loss detection. 3) The receiving link compares the 'peer_nxt_snt' field to its own 'rcv_nxt', finds that there is a gap, and immediately sends a NACK message back to the peer. 4) When this NACKs arrives at the sender, all the requested retransmissions are performed, since it is a first-time request. Just like in the scenario described in the previous commit this leads to many redundant retransmissions, with decreased throughput as a consequence. We fix this by adding two more conditions before we send a NACK in this sitution. First, the deferred queue must be empty, so we cannot assume that the potential packet loss has already been detected by other means. Second, we check the 'peers_snd_nxt' field only in probe/ probe_reply messages, thus turning this into a true mechanism of last resort as it was really meant to be. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-10 17:31:15 -08:00
Jon Maloy	02288248b0	tipc: eliminate gap indicator from ACK messages When we increase the link send window we sometimes observe the following scenario: 1) A packet #N arrives out of order far ahead of a sequence of older packets which are still under way. The packet is added to the deferred queue. 2) The missing packets arrive in sequence, and for each 16th of them an ACK is sent back to the receiver, as it should be. 3) When building those ACK messages, it is checked if there is a gap between the link's 'rcv_nxt' and the first packet in the deferred queue. This is always the case until packet number #N-1 arrives, and a 'gap' indicator is added, effectively turning them into NACK messages. 4) When those NACKs arrive at the sender, all the requested retransmissions are done, since it is a first-time request. This sometimes leads to a huge amount of redundant retransmissions, causing a drop in max throughput. This problem gets worse when we in a later commit introduce variable window congestion control, since it drops the link back to 'fast recovery' much more often than necessary. We now fix this by not sending any 'gap' indicator in regular ACK messages. We already have a mechanism for sending explicit NACKs in place, and this is sufficient to keep up the packet flow. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-10 17:31:15 -08:00
Nathan Chancellor	08cbc75f96	ppp: Adjust indentation into ppp_async_input Clang warns: ../drivers/net/ppp/ppp_async.c:877:6: warning: misleading indentation; statement is not part of the previous 'if' [-Wmisleading-indentation] ap->rpkt = skb; ^ ../drivers/net/ppp/ppp_async.c:875:5: note: previous statement is here if (!skb) ^ 1 warning generated. This warning occurs because there is a space before the tab on this line. Clean up this entire block's indentation so that it is consistent with the Linux kernel coding style and clang no longer warns. Fixes: `6722e78c90` ("[PPP]: handle misaligned accesses") Link: https://github.com/ClangBuiltLinux/linux/issues/800 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-09 20:32:40 -08:00
Nathan Chancellor	5c61e22300	net: smc911x: Adjust indentation in smc911x_phy_configure Clang warns: ../drivers/net/ethernet/smsc/smc911x.c:939:3: warning: misleading indentation; statement is not part of the previous 'if' [-Wmisleading-indentation] if (!lp->ctl_rfduplx) ^ ../drivers/net/ethernet/smsc/smc911x.c:936:2: note: previous statement is here if (lp->ctl_rspeed != 100) ^ 1 warning generated. This warning occurs because there is a space after the tab on this line. Remove it so that the indentation is consistent with the Linux kernel coding style and clang no longer warns. Fixes: `0a0c72c911` ("[PATCH] RE: [PATCH 1/1] net driver: Add support for SMSC LAN911x line of ethernet chips") Link: https://github.com/ClangBuiltLinux/linux/issues/796 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-09 20:31:46 -08:00
Nathan Chancellor	fe06bf3d83	net: tulip: Adjust indentation in {dmfe, uli526x}_init_module Clang warns: ../drivers/net/ethernet/dec/tulip/uli526x.c:1812:3: warning: misleading indentation; statement is not part of the previous 'if' [-Wmisleading-indentation] switch (mode) { ^ ../drivers/net/ethernet/dec/tulip/uli526x.c:1809:2: note: previous statement is here if (cr6set) ^ 1 warning generated. ../drivers/net/ethernet/dec/tulip/dmfe.c:2217:3: warning: misleading indentation; statement is not part of the previous 'if' [-Wmisleading-indentation] switch(mode) { ^ ../drivers/net/ethernet/dec/tulip/dmfe.c:2214:2: note: previous statement is here if (cr6set) ^ 1 warning generated. This warning occurs because there is a space before the tab on these lines. Remove them so that the indentation is consistent with the Linux kernel coding style and clang no longer warns. While we are here, adjust the default block in dmfe_init_module to have a proper break between the label and assignment and add a space between the switch and opening parentheses to avoid a checkpatch warning. Fixes: `e1c3e50140` ("[PATCH] initialisation cleanup for ULI526x-net-driver") Link: https://github.com/ClangBuiltLinux/linux/issues/795 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-09 20:28:22 -08:00
David S. Miller	80bfc3b40a	Merge branch 'dp83867-fix-fifo-depth' Dan Murphy says: ==================== Fix Tx/Rx FIFO depth for DP83867 The DP83867 supports both the RGMII and SGMII modes. The Tx and Rx FIFO depths are configurable in these modes but may not applicable for both modes. When the device is configured for RGMII mode the Tx FIFO depth is applicable and for SGMII mode both Tx and Rx FIFO depth settings are applicable. When the driver was originally written only the RGMII device was available and there were no standard fifo-depth DT properties. The patchset converts the special ti,fifo-depth property to the standard tx-fifo-depth property while still allowing the ti,fifo-depth property to be set as to maintain backward compatibility. In addition to this change the rx-fifo-depth property support was added and only written when the device is configured for SGMII mode. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-09 20:19:10 -08:00
Dan Murphy	e02d18161e	net: phy: dp83867: Add rx-fifo-depth and tx-fifo-depth This code changes the TI specific ti,fifo-depth to the common tx-fifo-depth property. The tx depth is applicable for both RGMII and SGMII modes of operation. rx-fifo-depth was added as well but this is only applicable for SGMII mode. So in summary if RGMII mode write tx fifo depth only if SGMII mode write both rx and tx fifo depths If the property is not populated in the device tree then set the value to the default values. Signed-off-by: Dan Murphy <dmurphy@ti.com> Reported-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-09 20:19:10 -08:00
Dan Murphy	96ae38af9d	dt-bindings: dp83867: Convert fifo-depth to common fifo-depth and make optional Convert the ti,fifo-depth from a TI specific property to the common tx-fifo-depth property. Also add support for the rx-fifo-depth. These are optional properties for this device and if these are not available then the fifo depths are set to device default values. Signed-off-by: Dan Murphy <dmurphy@ti.com> Reported-by: Adrian Bunk <bunk@kernel.org> CC: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-09 20:19:10 -08:00
Kevin(Yudong) Yang	65e6d90168	net-tcp: Disable TCP ssthresh metrics cache by default This patch introduces a sysctl knob "net.ipv4.tcp_no_ssthresh_metrics_save" that disables TCP ssthresh metrics cache by default. Other parts of TCP metrics cache, e.g. rtt, cwnd, remain unchanged. As modern networks becoming more and more dynamic, TCP metrics cache today often causes more harm than benefits. For example, the same IP address is often shared by different subscribers behind NAT in residential networks. Even if the IP address is not shared by different users, caching the slow-start threshold of a previous short flow using loss-based congestion control (e.g. cubic) often causes the future longer flows of the same network path to exit slow-start prematurely with abysmal throughput. Caching ssthresh is very risky and can lead to terrible performance. Therefore it makes sense to make disabling ssthresh caching by default and opt-in for specific networks by the administrators. This practice also has worked well for several years of deployment with CUBIC congestion control at Google. Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-09 20:17:48 -08:00
Xin Long	4e7696d90b	sctp: get netns from asoc and ep base Commit `312434617c` ("sctp: cache netns in sctp_ep_common") set netns in asoc and ep base since they're created, and it will never change. It's a better way to get netns from asoc and ep base, comparing to calling sock_net(). This patch is to replace them. v1->v2: - no change. Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-09 20:14:01 -08:00
Russell King	26c97a2d82	net: sfp: avoid tx-fault with Nokia GPON module The Nokia GPON module can hold tx-fault active while it is initialising which can take up to 60s. Avoid this causing the module to be declared faulty after the SFP MSA defined non-cooled module timeout. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-09 14:32:24 -08:00
Colin Ian King	e70ac62828	qed: remove redundant assignments to rc The variable rc is assigned with a value that is never read and it is re-assigned a new value later on. The assignment is redundant and can be removed. Clean up multiple occurrances of this pattern. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-09 14:28:19 -08:00
Mao Wenan	718eae277e	NFC: port100: Convert cpu_to_le16(le16_to_cpu(E1) + E2) to use le16_add_cpu(). Convert cpu_to_le16(le16_to_cpu(frame->datalen) + len) to use le16_add_cpu(), which is more concise and does the same thing. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-09 14:27:26 -08:00
David S. Miller	4a63ef710c	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2019-12-09 Here's the first bluetooth-next pull request for 5.6: - Devicetree bindings updates for Broadcom controllers - Add support for PCM configuration for Broadcom controllers - btusb: Fixes for Realtek devices - butsb: A few other smaller fixes (mem leak & non-atomic allocation issue) Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-09 10:49:25 -08:00
Jason A. Donenfeld	e7096c131e	net: WireGuard secure network tunnel WireGuard is a layer 3 secure networking tunnel made specifically for the kernel, that aims to be much simpler and easier to audit than IPsec. Extensive documentation and description of the protocol and considerations, along with formal proofs of the cryptography, are available at: * https://www.wireguard.com/ * https://www.wireguard.com/papers/wireguard.pdf This commit implements WireGuard as a simple network device driver, accessible in the usual RTNL way used by virtual network drivers. It makes use of the udp_tunnel APIs, GRO, GSO, NAPI, and the usual set of networking subsystem APIs. It has a somewhat novel multicore queueing system designed for maximum throughput and minimal latency of encryption operations, but it is implemented modestly using workqueues and NAPI. Configuration is done via generic Netlink, and following a review from the Netlink maintainer a year ago, several high profile userspace tools have already implemented the API. This commit also comes with several different tests, both in-kernel tests and out-of-kernel tests based on network namespaces, taking profit of the fact that sockets used by WireGuard intentionally stay in the namespace the WireGuard interface was originally created, exactly like the semantics of userspace tun devices. See wireguard.com/netns/ for pictures and examples. The source code is fairly short, but rather than combining everything into a single file, WireGuard is developed as cleanly separable files, making auditing and comprehension easier. Things are laid out as follows: * noise.[ch], cookie.[ch], messages.h: These implement the bulk of the cryptographic aspects of the protocol, and are mostly data-only in nature, taking in buffers of bytes and spitting out buffers of bytes. They also handle reference counting for their various shared pieces of data, like keys and key lists. * ratelimiter.[ch]: Used as an integral part of cookie.[ch] for ratelimiting certain types of cryptographic operations in accordance with particular WireGuard semantics. * allowedips.[ch], peerlookup.[ch]: The main lookup structures of WireGuard, the former being trie-like with particular semantics, an integral part of the design of the protocol, and the latter just being nice helper functions around the various hashtables we use. * device.[ch]: Implementation of functions for the netdevice and for rtnl, responsible for maintaining the life of a given interface and wiring it up to the rest of WireGuard. * peer.[ch]: Each interface has a list of peers, with helper functions available here for creation, destruction, and reference counting. * socket.[ch]: Implementation of functions related to udp_socket and the general set of kernel socket APIs, for sending and receiving ciphertext UDP packets, and taking care of WireGuard-specific sticky socket routing semantics for the automatic roaming. * netlink.[ch]: Userspace API entry point for configuring WireGuard peers and devices. The API has been implemented by several userspace tools and network management utility, and the WireGuard project distributes the basic wg(8) tool. * queueing.[ch]: Shared function on the rx and tx path for handling the various queues used in the multicore algorithms. * send.c: Handles encrypting outgoing packets in parallel on multiple cores, before sending them in order on a single core, via workqueues and ring buffers. Also handles sending handshake and cookie messages as part of the protocol, in parallel. * receive.c: Handles decrypting incoming packets in parallel on multiple cores, before passing them off in order to be ingested via the rest of the networking subsystem with GRO via the typical NAPI poll function. Also handles receiving handshake and cookie messages as part of the protocol, in parallel. * timers.[ch]: Uses the timer wheel to implement protocol particular event timeouts, and gives a set of very simple event-driven entry point functions for callers. * main.c, version.h: Initialization and deinitialization of the module. * selftest/.h: Runtime unit tests for some of the most security sensitive functions. tools/testing/selftests/wireguard/netns.sh: Aforementioned testing script using network namespaces. This commit aims to be as self-contained as possible, implementing WireGuard as a standalone module not needing much special handling or coordination from the network subsystem. I expect for future optimizations to the network stack to positively improve WireGuard, and vice-versa, but for the time being, this exists as intentionally standalone. We introduce a menu option for CONFIG_WIREGUARD, as well as providing a verbose debug log and self-tests via CONFIG_WIREGUARD_DEBUG. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: David Miller <davem@davemloft.net> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-08 17:48:42 -08:00

1 2 3 4 5 ...

886913 Commits