mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-11-23 20:24:12 +08:00
6e95ef0258
151 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Xu Kuohai
|
87cb58aebd |
bpf, arm64: Remove garbage frame for struct_ops trampoline
The callsite layout for arm64 fentry is:
mov x9, lr
nop
When a bpf prog is attached, the nop instruction is patched to a call
to bpf trampoline:
mov x9, lr
bl <bpf trampoline>
So two return addresses are passed to bpf trampoline: the return address
for the traced function/prog, stored in x9, and the return address for
the bpf trampoline itself, stored in lr. To obtain a full and accurate
call stack, the bpf trampoline constructs two fake function frames using
x9 and lr.
However, struct_ops progs are invoked directly as function callbacks,
meaning that x9 is not set as it is in the fentry callsite. In this case,
the frame constructed using x9 is garbage. The following stack trace for
struct_ops, captured by perf sampling, illustrates this issue, where
tcp_ack+0x404 is a garbage frame:
ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid)
ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline
ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline
ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame
ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms])
ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms])
ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms])
To fix it, construct only one frame using lr for struct_ops.
The above stack trace also indicates that there is no kernel symbol for
struct_ops bpf trampoline. This will be addressed in a follow-up patch.
Fixes:
|
||
Peter Collingbourne
|
a552e2ef5f |
bpf, arm64: Fix address emission with tag-based KASAN enabled
When BPF_TRAMP_F_CALL_ORIG is enabled, the address of a bpf_tramp_image
struct on the stack is passed during the size calculation pass and
an address on the heap is passed during code generation. This may
cause a heap buffer overflow if the heap address is tagged because
emit_a64_mov_i64() will emit longer code than it did during the size
calculation pass. The same problem could occur without tag-based
KASAN if one of the 16-bit words of the stack address happened to
be all-ones during the size calculation pass. Fix the problem by
assuming the worst case (4 instructions) when calculating the size
of the bpf_tramp_image address emission.
Fixes:
|
||
Xu Kuohai
|
ddbe9ec550 |
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is within the range of direct call, BPF_CALL can be jited to direct call. For example, the following BPF_CALL call __htab_map_lookup_elem is always jited to indirect call: mov x10, #0xffffffffffff18f4 movk x10, #0x821, lsl #16 movk x10, #0x8000, lsl #32 blr x10 When the address of target __htab_map_lookup_elem is within the range of direct call, the BPF_CALL can be jited to: bl 0xfffffffffd33bc98 This patch does such jit optimization by emitting arm64 direct calls for BPF_CALL when possible, indirect calls otherwise. Without this patch, the jit works as follows. 1. First pass A. Determine jited position and size for each bpf instruction. B. Computed the jited image size. 2. Allocate jited image with size computed in step 1. 3. Second pass A. Adjust jump offset for jump instructions B. Write the final image. This works because, for a given bpf prog, regardless of where the jited image is allocated, the jited result for each instruction is fixed. The second pass differs from the first only in adjusting the jump offsets, like changing "jmp imm1" to "jmp imm2", while the position and size of the "jmp" instruction remain unchanged. Now considering whether to jit BPF_CALL to arm64 direct or indirect call instruction. The choice depends solely on the jump offset: direct call if the jump offset is within 128MB, indirect call otherwise. For a given BPF_CALL, the target address is known, so the jump offset is decided by the jited address of the BPF_CALL instruction. In other words, for a given bpf prog, the jited result for each BPF_CALL is determined by its jited address. The jited address for a BPF_CALL is the jited image address plus the total jited size of all preceding instructions. For a given bpf prog, there are clearly no BPF_CALL instructions before the first BPF_CALL instruction. Since the jited result for all other instructions other than BPF_CALL are fixed, the total jited size preceding the first BPF_CALL is also fixed. Therefore, once the jited image is allocated, the jited address for the first BPF_CALL is fixed. Now that the jited result for the first BPF_CALL is fixed, the jited results for all instructions preceding the second BPF_CALL are fixed. So the jited address and result for the second BPF_CALL are also fixed. Similarly, we can conclude that the jited addresses and results for all subsequent BPF_CALL instructions are fixed. This means that, for a given bpf prog, once the jited image is allocated, the jited address and result for all instructions, including all BPF_CALL instructions, are fixed. Based on the observation, with this patch, the jit works as follows. 1. First pass Estimate the maximum jited image size. In this pass, all BPF_CALLs are jited to arm64 indirect calls since the jump offsets are unknown because the jited image is not allocated. 2. Allocate jited image with size estimated in step 1. 3. Second pass A. Determine the jited result for each BPF_CALL. B. Determine jited address and size for each bpf instruction. 4. Third pass A. Adjust jump offset for jump instructions. B. Write the final image. Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Xu Kuohai
|
5d4fa9ec56 |
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making the jited result looks a bit too compliated. For example, for an empty prog, the jited result is: 0: bti jc 4: mov x9, lr 8: nop c: paciasp 10: stp fp, lr, [sp, #-16]! 14: mov fp, sp 18: stp x19, x20, [sp, #-16]! 1c: stp x21, x22, [sp, #-16]! 20: stp x26, x25, [sp, #-16]! 24: mov x26, #0 28: stp x26, x25, [sp, #-16]! 2c: mov x26, sp 30: stp x27, x28, [sp, #-16]! 34: mov x25, sp 38: bti j // tailcall target 3c: sub sp, sp, #0 40: mov x7, #0 44: add sp, sp, #0 48: ldp x27, x28, [sp], #16 4c: ldp x26, x25, [sp], #16 50: ldp x26, x25, [sp], #16 54: ldp x21, x22, [sp], #16 58: ldp x19, x20, [sp], #16 5c: ldp fp, lr, [sp], #16 60: mov x0, x7 64: autiasp 68: ret Clearly, there is no need to save/restore unused callee-saved registers. This patch does this change, making the jited image to only save/restore the callee-saved registers it uses. Now the jited result of empty prog is: 0: bti jc 4: mov x9, lr 8: nop c: paciasp 10: stp fp, lr, [sp, #-16]! 14: mov fp, sp 18: stp xzr, x26, [sp, #-16]! 1c: mov x26, sp 20: bti j // tailcall target 24: mov x7, #0 28: ldp xzr, x26, [sp], #16 2c: ldp fp, lr, [sp], #16 30: mov x0, x7 34: autiasp 38: ret Since bpf prog saves/restores its own callee-saved registers as needed, to make tailcall work correctly, the caller needs to restore its saved registers before tailcall, and the callee needs to save its callee-saved registers after tailcall. This extra restoring/saving instructions increases preformance overhead. [1] provides 2 benchmarks for tailcall scenarios. Below is the perf number measured in an arm64 KVM guest. The result indicates that the performance difference before and after the patch in typical tailcall scenarios is negligible. - Before: Performance counter stats for './test_progs -t tailcalls' (5 runs): 4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% ) 574 context-switches # 133.073 /sec ( +- 1.14% ) 0 cpu-migrations # 0.000 /sec 538 page-faults # 124.727 /sec ( +- 0.57% ) 10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%) 25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%) 5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%) 2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%) TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%) # 0.21 frontend_bound ( +- 0.15% ) (61.31%) # 0.12 bad_speculation ( +- 0.08% ) (50.11%) # 0.07 backend_bound ( +- 0.16% ) (33.30%) 8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%) 468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%) 385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%) 38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%) 6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%) 1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%) 9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%) 416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%) 6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%) 66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% ) Performance counter stats for './test_progs -t flow_dissector' (5 runs): 10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% ) 603 context-switches # 55.197 /sec ( +- 1.13% ) 0 cpu-migrations # 0.000 /sec 566 page-faults # 51.810 /sec ( +- 0.42% ) 27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%) 56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%) 10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%) 3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%) TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%) # 0.27 frontend_bound ( +- 0.14% ) (61.27%) # 0.14 bad_speculation ( +- 0.19% ) (50.36%) # 0.07 backend_bound ( +- 0.42% ) (33.89%) 18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%) 13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%) 4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%) 267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%) 15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%) 2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%) 20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%) 732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%) 15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%) 152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% ) - After: Performance counter stats for './test_progs -t tailcalls' (5 runs): 4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% ) 569 context-switches # 132.982 /sec ( +- 0.58% ) 0 cpu-migrations # 0.000 /sec 539 page-faults # 125.970 /sec ( +- 0.43% ) 10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%) 25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%) 5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%) 2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%) TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%) # 0.22 frontend_bound ( +- 0.21% ) (60.83%) # 0.12 bad_speculation ( +- 0.26% ) (50.25%) # 0.06 backend_bound ( +- 0.17% ) (33.52%) 8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%) 694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%) 1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%) 96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%) 6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%) 1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%) 8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%) 438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%) 6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%) 67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% ) Performance counter stats for './test_progs -t flow_dissector' (5 runs): 10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% ) 615 context-switches # 56.173 /sec ( +- 1.65% ) 1 cpu-migrations # 0.091 /sec ( +- 31.62% ) 567 page-faults # 51.788 /sec ( +- 0.44% ) 27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%) 56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%) 10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%) 3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%) TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%) # 0.27 frontend_bound ( +- 0.09% ) (60.91%) # 0.14 bad_speculation ( +- 0.08% ) (49.85%) # 0.07 backend_bound ( +- 0.16% ) (33.33%) 18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%) 8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%) 2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%) 264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%) 15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%) 3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%) 20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%) 961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%) 15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%) 167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% ) [1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/ Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Xu Kuohai
|
bd737fcb64 |
bpf, arm64: Get rid of fpb
bpf prog accesses stack using BPF_FP as the base address and a negative
immediate number as offset. But arm64 ldr/str instructions only support
non-negative immediate number as offset. To simplify the jited result,
commit
|
||
Leon Hwang
|
66ff4d61dc |
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes:
|
||
Puranjay Mohan
|
19d3c179a3 |
bpf, arm64: Fix trampoline for BPF_TRAMP_F_CALL_ORIG
When BPF_TRAMP_F_CALL_ORIG is set, the trampoline calls
__bpf_tramp_enter() and __bpf_tramp_exit() functions, passing them
the struct bpf_tramp_image *im pointer as an argument in R0.
The trampoline generation code uses emit_addr_mov_i64() to emit
instructions for moving the bpf_tramp_image address into R0, but
emit_addr_mov_i64() assumes the address to be in the vmalloc() space
and uses only 48 bits. Because bpf_tramp_image is allocated using
kzalloc(), its address can use more than 48-bits, in this case the
trampoline will pass an invalid address to __bpf_tramp_enter/exit()
causing a kernel crash.
Fix this by using emit_a64_mov_i64() in place of emit_addr_mov_i64()
as it can work with addresses that are greater than 48-bits.
Fixes:
|
||
Puranjay Mohan
|
2bb138cb20 |
bpf, arm64: Inline bpf_get_current_task/_btf() helpers
On ARM64, the pointer to task_struct is always available in the sp_el0 register and therefore the calls to bpf_get_current_task() and bpf_get_current_task_btf() can be inlined into a single MRS instruction. Here is the difference before and after this change: Before: ; struct task_struct *task = bpf_get_current_task_btf(); 54: mov x10, #0xffffffffffff7978 // #-34440 58: movk x10, #0x802b, lsl #16 5c: movk x10, #0x8000, lsl #32 60: blr x10 --------------> 0xffff8000802b7978 <+0>: mrs x0, sp_el0 64: add x7, x0, #0x0 <-------------- 0xffff8000802b797c <+4>: ret After: ; struct task_struct *task = bpf_get_current_task_btf(); 54: mrs x7, sp_el0 This shows around 1% performance improvement in artificial microbenchmark. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Xu Kuohai <xukuohai@huawei.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240619131334.4297-1-puranjay@kernel.org |
||
Rafael Passos
|
9919c5c98c |
bpf: remove unused parameter in bpf_jit_binary_pack_finalize
Fixes a compiler warning. the bpf_jit_binary_pack_finalize function was taking an extra bpf_prog parameter that went unused. This removves it and updates the callers accordingly. Signed-off-by: Rafael Passos <rafael@rcpassos.me> Link: https://lore.kernel.org/r/20240615022641.210320-2-rafael@rcpassos.me Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Linus Torvalds
|
a49468240e |
Modules changes for v6.10-rc1
Finally something fun. Mike Rapoport does some cleanup to allow us to take out module_alloc() out of modules into a new paint shedded execmem_alloc() and execmem_free() so to make emphasis these helpers are actually used outside of modules. It starts with a no-functional changes API rename / placeholders to then allow architectures to define their requirements into a new shiny struct execmem_info with ranges, and requirements for those ranges. Archs now can intitialize this execmem_info as the last part of mm_core_init() if they have to diverge from the norm. Each range is a known type clearly articulated and spelled out in enum execmem_type. Although a lot of this is major cleanup and prep work for future enhancements an immediate clear gain is we get to enable KPROBES without MODULES now. That is ultimately what motiviated to pick this work up again, now with smaller goal as concrete stepping stone. This has been sitting on linux-next for a little less than a month, a few issues were found already and fixed, in particular an odd mips boot issue. Arch folks reviewed the code too. This is ready for wider exposure and testing. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmZDHfMSHG1jZ3JvZkBr ZXJuZWwub3JnAAoJEM4jHQowkoinfIwP/iFsr89v9BjWdRTqzufuHwjOxvFymWxU BbEpOppRny3CckDU9ag9hLIlUaSL1Bg56Zb+znzp5stKOoiQYMDBvjSYdfybPxW2 mRS6SClMF1ubWbzdysdp5Ld9u8T0MQPCLX+P2pKhZRGi0wjkBf5WEkTje+muJKI3 4vYkXS7bNhuTwRQ+EGfze4+AeleGdQJKDWFY00TW9mZTTBADjfHyYU5o0m9ijf5l 3V/weUznODvjVJStbIF7wEQ845Ae02LN1zXfsloIOuBMhcMju+x8IjPgPbD0KhX2 yA48q7mVWkirYp0L5GSQchtqV1GBiP0NK1xXWEpyx6EqQZ4RJCsQhlhjijoExYBR ylP4bqiGVuE3IN075X0OzGCnmOStuzwssfDmug0sMAZH/MvmOQ21WzZdet2nLMas wwJArHqZsBI9BnBlvH9ZM4Y9f1zC7iR1wULaNGwXLPx34X9PIch8Yk+RElP1kMFQ +YrjOuWPjl63pmSkrkk+Pe2eesMPcPB41M6Q2iCjDlp0iBp63LIx2XISUbTf0ljM EsI4ZQseYpx+BmC7AuQfmXvEOjuXII9z072/artVWcB2u/87ixIprnqZVhcs/spy 73DnXB4ufor2PCCC5Xrb/6kT6G+PzF3VwTbHQ1D+fYZ5n2qdyG+LKxgXbtxsRVTp oUg+Z/AJaCMt =Nsg4 -----END PGP SIGNATURE----- Merge tag 'modules-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull modules updates from Luis Chamberlain: "Finally something fun. Mike Rapoport does some cleanup to allow us to take out module_alloc() out of modules into a new paint shedded execmem_alloc() and execmem_free() so to make emphasis these helpers are actually used outside of modules. It starts with a non-functional changes API rename / placeholders to then allow architectures to define their requirements into a new shiny struct execmem_info with ranges, and requirements for those ranges. Archs now can intitialize this execmem_info as the last part of mm_core_init() if they have to diverge from the norm. Each range is a known type clearly articulated and spelled out in enum execmem_type. Although a lot of this is major cleanup and prep work for future enhancements an immediate clear gain is we get to enable KPROBES without MODULES now. That is ultimately what motiviated to pick this work up again, now with smaller goal as concrete stepping stone" * tag 'modules-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of kprobes: remove dependency on CONFIG_MODULES powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate x86/ftrace: enable dynamic ftrace without CONFIG_MODULES arch: make execmem setup available regardless of CONFIG_MODULES powerpc: extend execmem_params for kprobes allocations arm64: extend execmem_info for generated code allocations riscv: extend execmem_params for generated code allocations mm/execmem, arch: convert remaining overrides of module_alloc to execmem mm/execmem, arch: convert simple overrides of module_alloc to execmem mm: introduce execmem_alloc() and execmem_free() module: make module_memory_{alloc,free} more self-contained sparc: simplify module_alloc() nios2: define virtual address space for modules mips: module: rename MODULE_START to MODULES_VADDR arm64: module: remove unneeded call to kasan_alloc_module_shadow() kallsyms: replace deprecated strncpy with strscpy module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree. |
||
Mike Rapoport (IBM)
|
e2effa2235 |
arm64: extend execmem_info for generated code allocations
The memory allocations for kprobes and BPF on arm64 can be placed anywhere in vmalloc address space and currently this is implemented with overrides of alloc_insn_page() and bpf_jit_alloc_exec() in arm64. Define EXECMEM_KPROBES and EXECMEM_BPF ranges in arm64::execmem_info and drop overrides of alloc_insn_page() and bpf_jit_alloc_exec(). Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> |
||
Jakub Kicinski
|
6e62702feb |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZkGcZAAKCRDbK58LschI g6o6APwLsqhrM2w71VUN5ciCxu4H5VDtZp6wkdqtVbxxU4qNxQEApKgYgKt8ZLF3 Kily5c7m+S4ZXhMX21rb8JhSAz0dfQk= =5Dk7 -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-05-13 We've added 119 non-merge commits during the last 14 day(s) which contain a total of 134 files changed, 9462 insertions(+), 4742 deletions(-). The main changes are: 1) Add BPF JIT support for 32-bit ARCv2 processors, from Shahab Vahedi. 2) Add BPF range computation improvements to the verifier in particular around XOR and OR operators, refactoring of checks for range computation and relaxing MUL range computation so that src_reg can also be an unknown scalar, from Cupertino Miranda. 3) Add support to attach kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. Session mode is a common use-case for tetragon and bpftrace, from Jiri Olsa. 4) Fix a potential overflow in libbpf's ring__consume_n() and improve libbpf as well as BPF selftest's struct_ops handling, from Andrii Nakryiko. 5) Improvements to BPF selftests in context of BPF gcc backend, from Jose E. Marchesi & David Faust. 6) Migrate remaining BPF selftest tests from test_sock_addr.c to prog_test- -style in order to retire the old test, run it in BPF CI and additionally expand test coverage, from Jordan Rife. 7) Big batch for BPF selftest refactoring in order to remove duplicate code around common network helpers, from Geliang Tang. 8) Another batch of improvements to BPF selftests to retire obsolete bpf_tcp_helpers.h as everything is available vmlinux.h, from Martin KaFai Lau. 9) Fix BPF map tear-down to not walk the map twice on free when both timer and wq is used, from Benjamin Tissoires. 10) Fix BPF verifier assumptions about socket->sk that it can be non-NULL, from Alexei Starovoitov. 11) Change BTF build scripts to using --btf_features for pahole v1.26+, from Alan Maguire. 12) Small improvements to BPF reusing struct_size() and krealloc_array(), from Andy Shevchenko. 13) Fix s390 JIT to emit a barrier for BPF_FETCH instructions, from Ilya Leoshkevich. 14) Extend TCP ->cong_control() callback in order to feed in ack and flag parameters and allow write-access to tp->snd_cwnd_stamp from BPF program, from Miao Xu. 15) Add support for internal-only per-CPU instructions to inline bpf_get_smp_processor_id() helper call for arm64 and riscv64 BPF JITs, from Puranjay Mohan. 16) Follow-up to remove the redundant ethtool.h from tooling infrastructure, from Tushar Vyavahare. 17) Extend libbpf to support "module:<function>" syntax for tracing programs, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits) bpf: make list_for_each_entry portable bpf: ignore expected GCC warning in test_global_func10.c bpf: disable strict aliasing in test_global_func9.c selftests/bpf: Free strdup memory in xdp_hw_metadata selftests/bpf: Fix a few tests for GCC related warnings. bpf: avoid gcc overflow warning in test_xdp_vlan.c tools: remove redundant ethtool.h from tooling infra selftests/bpf: Expand ATTACH_REJECT tests selftests/bpf: Expand getsockname and getpeername tests sefltests/bpf: Expand sockaddr hook deny tests selftests/bpf: Expand sockaddr program return value tests selftests/bpf: Retire test_sock_addr.(c|sh) selftests/bpf: Remove redundant sendmsg test cases selftests/bpf: Migrate ATTACH_REJECT test cases selftests/bpf: Migrate expected_attach_type tests selftests/bpf: Migrate wildcard destination rewrite test selftests/bpf: Migrate sendmsg6 v4 mapped address tests selftests/bpf: Migrate sendmsg deny test cases selftests/bpf: Migrate WILDCARD_IP test selftests/bpf: Handle SYSCALL_EPERM and SYSCALL_ENOTSUPP test cases ... ==================== Link: https://lore.kernel.org/r/20240513134114.17575-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Puranjay Mohan
|
75fe4c0b3e |
bpf, arm64: inline bpf_get_smp_processor_id() helper
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] https://github.com/anakryiko/linux/commit/8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240502151854.9810-5-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Puranjay Mohan
|
7a4c32222b |
arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs
Support an instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF
JITs.
Since commit
|
||
Puranjay Mohan
|
e612b5c1d3 |
bpf, arm64: Add support for lse atomics in bpf_arena
When LSE atomics are available, BPF atomic instructions are implemented as single ARM64 atomic instructions, therefore it is easy to enable these in bpf_arena using the currently available exception handling setup. LL_SC atomics use loops and therefore would need more work to enable in bpf_arena. Enable LSE atomics based instructions in bpf_arena and use the bpf_jit_supports_insn() callback to reject atomics in bpf_arena if LSE atomics are not available. All atomics and arena_atomics selftests are passing: [root@ip-172-31-2-216 bpf]# ./test_progs -a atomics,arena_atomics #3/1 arena_atomics/add:OK #3/2 arena_atomics/sub:OK #3/3 arena_atomics/and:OK #3/4 arena_atomics/or:OK #3/5 arena_atomics/xor:OK #3/6 arena_atomics/cmpxchg:OK #3/7 arena_atomics/xchg:OK #3 arena_atomics:OK #10/1 atomics/add:OK #10/2 atomics/sub:OK #10/3 atomics/and:OK #10/4 atomics/or:OK #10/5 atomics/xor:OK #10/6 atomics/cmpxchg:OK #10/7 atomics/xchg:OK #10 atomics:OK Summary: 2/14 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20240426161116.441-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Jakub Kicinski
|
e958da0ddb |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: include/linux/filter.h kernel/bpf/core.c |
||
Xu Kuohai
|
dc7d7447b5 |
bpf, arm64: Fix incorrect runtime stats
When __bpf_prog_enter() returns zero, the arm64 register x20 that stores
prog start time is not assigned to zero, causing incorrect runtime stats.
To fix it, assign the return value of bpf_prog_enter() to x20 register
immediately upon its return.
Fixes:
|
||
Puranjay Mohan
|
4dd31243e3 |
bpf: Add arm64 JIT support for bpf_addr_space_cast instruction.
LLVM generates bpf_addr_space_cast instruction while translating pointers between native (zero) address space and __attribute__((address_space(N))). The addr_space=0 is reserved as bpf_arena address space. rY = addr_space_cast(rX, 0, 1) is processed by the verifier and converted to normal 32-bit move: wX = wY. rY = addr_space_cast(rX, 1, 0) : used to convert a bpf arena pointer to a pointer in the userspace vma. This has to be converted by the JIT. Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240325150716.4387-3-puranjay12@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Puranjay Mohan
|
339af577ec |
bpf: Add arm64 JIT support for PROBE_MEM32 pseudo instructions.
Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW] instructions. They are similar to PROBE_MEM instructions with the following differences: - PROBE_MEM32 supports store. - PROBE_MEM32 relies on the verifier to clear upper 32-bit of the src/dst register - PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in R28 in the prologue). Due to bpf_arena constructions such R28 + reg + off16 access is guaranteed to be within arena virtual range, so no address check at run-time. - PROBE_MEM32 allows STX and ST. If they fault the store is a nop. When LDX faults the destination register is zeroed. To support these on arm64, we do tmp2 = R28 + src/dst reg and then use tmp2 as the new src/dst register. This allows us to reuse most of the code for normal [LDX | STX | ST]. Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240325150716.4387-2-puranjay12@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Jakub Kicinski
|
5e47fbe5ce |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Artem Savkov
|
a51cd6bf8e |
arm64: bpf: fix 32bit unconditional bswap
In case when is64 == 1 in emit(A64_REV32(is64, dst, dst), ctx) the
generated insn reverses byte order for both high and low 32-bit words,
resuling in an incorrect swap as indicated by the jit test:
[ 9757.262607] test_bpf: #312 BSWAP 16: 0x0123456789abcdef -> 0xefcd jited:1 8 PASS
[ 9757.264435] test_bpf: #313 BSWAP 32: 0x0123456789abcdef -> 0xefcdab89 jited:1 ret 1460850314 != -271733879 (0x5712ce8a != 0xefcdab89)FAIL (1 times)
[ 9757.266260] test_bpf: #314 BSWAP 64: 0x0123456789abcdef -> 0x67452301 jited:1 8 PASS
[ 9757.268000] test_bpf: #315 BSWAP 64: 0x0123456789abcdef >> 32 -> 0xefcdab89 jited:1 8 PASS
[ 9757.269686] test_bpf: #316 BSWAP 16: 0xfedcba9876543210 -> 0x1032 jited:1 8 PASS
[ 9757.271380] test_bpf: #317 BSWAP 32: 0xfedcba9876543210 -> 0x10325476 jited:1 ret -1460850316 != 271733878 (0xa8ed3174 != 0x10325476)FAIL (1 times)
[ 9757.273022] test_bpf: #318 BSWAP 64: 0xfedcba9876543210 -> 0x98badcfe jited:1 7 PASS
[ 9757.274721] test_bpf: #319 BSWAP 64: 0xfedcba9876543210 >> 32 -> 0x10325476 jited:1 9 PASS
Fix this by forcing 32bit variant of rev32.
Fixes:
|
||
Puranjay Mohan
|
114b5b3b4b |
bpf, arm64: fix bug in BPF_LDX_MEMSX
A64_LDRSW() takes three registers: Xt, Xn, Xm as arguments and it loads
and sign extends the value at address Xn + Xm into register Xt.
Currently, the offset is being directly used in place of the tmp
register which has the offset already loaded by the last emitted
instruction.
This will cause JIT failures. The easiest way to reproduce this is to
test the following code through test_bpf module:
{
"BPF_LDX_MEMSX | BPF_W",
.u.insns_int = {
BPF_LD_IMM64(R1, 0x00000000deadbeefULL),
BPF_LD_IMM64(R2, 0xffffffffdeadbeefULL),
BPF_STX_MEM(BPF_DW, R10, R1, -7),
BPF_LDX_MEMSX(BPF_W, R0, R10, -7),
BPF_JMP_REG(BPF_JNE, R0, R2, 1),
BPF_ALU64_IMM(BPF_MOV, R0, 0),
BPF_EXIT_INSN(),
},
INTERNAL,
{ },
{ { 0, 0 } },
.stack_depth = 7,
},
We need to use the offset as -7 to trigger this code path, there could
be other valid ways to trigger this from proper BPF programs as well.
This code is rejected by the JIT because -7 is passed to A64_LDRSW() but
it expects a valid register (0 - 31).
roott@pjy:~# modprobe test_bpf test_name="BPF_LDX_MEMSX | BPF_W"
[11300.490371] test_bpf: test_bpf: set 'test_bpf' as the default test_suite.
[11300.491750] test_bpf: #345 BPF_LDX_MEMSX | BPF_W
[11300.493179] aarch64_insn_encode_register: unknown register encoding -7
[11300.494133] aarch64_insn_encode_register: unknown register encoding -7
[11300.495292] FAIL to select_runtime err=-524
[11300.496804] test_bpf: Summary: 0 PASSED, 1 FAILED, [0/0 JIT'ed]
modprobe: ERROR: could not insert 'test_bpf': Invalid argument
Applying this patch fixes the issue.
root@pjy:~# modprobe test_bpf test_name="BPF_LDX_MEMSX | BPF_W"
[ 292.837436] test_bpf: test_bpf: set 'test_bpf' as the default test_suite.
[ 292.839416] test_bpf: #345 BPF_LDX_MEMSX | BPF_W jited:1 156 PASS
[ 292.844794] test_bpf: Summary: 1 PASSED, 0 FAILED, [1/1 JIT'ed]
Fixes:
|
||
Christophe Leroy
|
c733239f8f |
bpf: Check return from set_memory_rox()
arch_protect_bpf_trampoline() and alloc_new_pack() call set_memory_rox() which can fail, leading to unprotected memory. Take into account return from set_memory_rox() function and add __must_check flag to arch_protect_bpf_trampoline(). Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/fe1c163c83767fde5cab31d209a4a6be3ddb3a73.1710574353.git.christophe.leroy@csgroup.eu Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> |
||
Christophe Leroy
|
e3362acd79 |
bpf: Remove arch_unprotect_bpf_trampoline()
Last user of arch_unprotect_bpf_trampoline() was removed by commit |
||
Puranjay Mohan
|
96b0f5addc |
arm64, bpf: Use bpf_prog_pack for arm64 bpf trampoline
We used bpf_prog_pack to aggregate bpf programs into huge page to relieve the iTLB pressure on the system. This was merged for ARM64[1] We can apply it to bpf trampoline as well. This would increase the preformance of fentry and struct_ops programs. [1] https://lore.kernel.org/bpf/20240228141824.119877-1-puranjay12@gmail.com/ Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Reviewed-by: Pu Lehui <pulehui@huawei.com> Message-ID: <20240304202803.31400-1-puranjay12@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Puranjay Mohan
|
1dad391dae |
bpf, arm64: use bpf_prog_pack for memory management
Use bpf_jit_binary_pack_alloc for memory management of JIT binaries in ARM64 BPF JIT. The bpf_jit_binary_pack_alloc creates a pair of RW and RX buffers. The JIT writes the program into the RW buffer. When the JIT is done, the program is copied to the final RX buffer with bpf_jit_binary_pack_finalize. Implement bpf_arch_text_copy() and bpf_arch_text_invalidate() for ARM64 JIT as these functions are required by bpf_jit_binary_pack allocator. Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20240228141824.119877-3-puranjay12@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Puranjay Mohan
|
22fc0e80ae |
bpf, arm64: support exceptions
The prologue generation code has been modified to make the callback program use the stack of the program marked as exception boundary where callee-saved registers are already pushed. As the bpf_throw function never returns, if it clobbers any callee-saved registers, they would remain clobbered. So, the prologue of the exception-boundary program is modified to push R23 and R24 as well, which the callback will then recover in its epilogue. The Procedure Call Standard for the Arm 64-bit Architecture[1] states that registers r19 to r28 should be saved by the callee. BPF programs on ARM64 already save all callee-saved registers except r23 and r24. This patch adds an instruction in prologue of the program to save these two registers and another instruction in the epilogue to recover them. These extra instructions are only added if bpf_throw() is used. Otherwise the emitted prologue/epilogue remains unchanged. [1] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240201125225.72796-3-puranjay12@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Hou Tao
|
18a45f12d7 |
bpf, arm64: Enable the inline of bpf_kptr_xchg()
ARM64 bpf jit satisfies the following two conditions: 1) support BPF_XCHG() on pointer-sized word. 2) the implementation of xchg is the same as atomic_xchg() on pointer-sized words. Both of these two functions use arch_xchg() to implement the exchange. So enable the inline of bpf_kptr_xchg() for arm64 bpf jit. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20240119102529.99581-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Song Liu
|
26ef208c20 |
bpf: Use arch_bpf_trampoline_size
Instead of blindly allocating PAGE_SIZE for each trampoline, check the size of the trampoline with arch_bpf_trampoline_size(). This size is saved in bpf_tramp_image->size, and used for modmem charge/uncharge. The fallback arch_alloc_bpf_trampoline() still allocates a whole page because we need to use set_memory_* to protect the memory. struct_ops trampoline still uses a whole page for multiple trampolines. With this size check at caller (regular trampoline and struct_ops trampoline), remove arch_bpf_trampoline_size() from arch_prepare_bpf_trampoline() in archs. Also, update bpf_image_ksym_add() to handle symbol of different sizes. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Björn Töpel <bjorn@rivosinc.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv Link: https://lore.kernel.org/r/20231206224054.492250-7-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Song Liu
|
96d1b7c081 |
bpf: Add arch_bpf_trampoline_size()
This helper will be used to calculate the size of the trampoline before allocating the memory. arch_prepare_bpf_trampoline() for arm64 and riscv64 can use arch_bpf_trampoline_size() to check the trampoline fits in the image. OTOH, arch_prepare_bpf_trampoline() for s390 has to call the JIT process twice, so it cannot use arch_bpf_trampoline_size(). Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Björn Töpel <bjorn@rivosinc.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv Link: https://lore.kernel.org/r/20231206224054.492250-6-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Song Liu
|
7a3d9a159b |
bpf: Adjust argument names of arch_prepare_bpf_trampoline()
We are using "im" for "struct bpf_tramp_image" and "tr" for "struct bpf_trampoline" in most of the code base. The only exception is the prototype and fallback version of arch_prepare_bpf_trampoline(). Update them to match the rest of the code base. We mix "orig_call" and "func_addr" for the argument in different versions of arch_prepare_bpf_trampoline(). s/orig_call/func_addr/g so they match. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231206224054.492250-3-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Kumar Kartikeya Dwivedi
|
9af27da631 |
bpf: Use bpf_is_subprog to check for subprogs
We would like to know whether a bpf_prog corresponds to the main prog or one of the subprogs. The current JIT implementations simply check this using the func_idx in bpf_prog->aux->func_idx. When the index is 0, it belongs to the main program, otherwise it corresponds to some subprogram. This will also be necessary to halt exception propagation while walking the stack when an exception is thrown, so we add a simple helper function to check this, named bpf_is_subprog, and convert existing JIT implementations to also make use of it. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Xu Kuohai
|
68b18191fe |
bpf, arm64: Support signed div/mod instructions
Add JIT for signed div/mod instructions. Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Florent Revest <revest@chromium.org> Acked-by: Florent Revest <revest@chromium.org> Link: https://lore.kernel.org/bpf/20230815154158.717901-7-xukuohai@huaweicloud.com |
||
Xu Kuohai
|
c32b6ee514 |
bpf, arm64: Support 32-bit offset jmp instruction
Add support for 32-bit offset jmp instructions. Given the arm64 direct jump range is +-128MB, which is large enough for BPF prog, jumps beyond this range are not supported. Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Florent Revest <revest@chromium.org> Acked-by: Florent Revest <revest@chromium.org> Link: https://lore.kernel.org/bpf/20230815154158.717901-6-xukuohai@huaweicloud.com |
||
Xu Kuohai
|
1104247f3f |
bpf, arm64: Support unconditional bswap
Add JIT support for unconditional bswap instructions. Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Florent Revest <revest@chromium.org> Acked-by: Florent Revest <revest@chromium.org> Link: https://lore.kernel.org/bpf/20230815154158.717901-5-xukuohai@huaweicloud.com |
||
Xu Kuohai
|
bb0a1d6b49 |
bpf, arm64: Support sign-extension mov instructions
Add JIT support for BPF sign-extension mov instructions with arm64 SXTB/SXTH/SXTW instructions. Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Florent Revest <revest@chromium.org> Acked-by: Florent Revest <revest@chromium.org> Link: https://lore.kernel.org/bpf/20230815154158.717901-4-xukuohai@huaweicloud.com |
||
Xu Kuohai
|
cc88f540da |
bpf, arm64: Support sign-extension load instructions
Add JIT support for sign-extension load instructions. Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Florent Revest <revest@chromium.org> Acked-by: Florent Revest <revest@chromium.org> Link: https://lore.kernel.org/bpf/20230815154158.717901-3-xukuohai@huaweicloud.com |
||
Alexander Duyck
|
a3f25d614b |
bpf, arm64: Fix BTI type used for freplace attached functions
When running an freplace attached bpf program on an arm64 system w were
seeing the following issue:
Unhandled 64-bit el1h sync exception on CPU47, ESR 0x0000000036000003 -- BTI
After a bit of work to track it down I determined that what appeared to be
happening is that the 'bti c' at the start of the program was somehow being
reached after a 'br' instruction. Further digging pointed me toward the
fact that the function was attached via freplace. This in turn led me to
build_plt which I believe is invoking the long jump which is triggering
this error.
To resolve it we can replace the 'bti c' with 'bti jc' and add a comment
explaining why this has to be modified as such.
Fixes:
|
||
Florent Revest
|
90564f1e3d |
bpf, arm64: Support struct arguments in the BPF trampoline
This extends the BPF trampoline JIT to support attachment to functions that take small structures (up to 128bit) as argument. This is trivially achieved by saving/restoring a number of "argument registers" rather than a number of arguments. The AAPCS64 section 6.8.2 describes the parameter passing ABI. "Composite types" (like C structs) below 16 bytes (as enforced by the BPF verifier) are provided as part of the 8 argument registers as explained in the section C.12. Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/bpf/20230511140507.514888-1-revest@chromium.org |
||
Xu Kuohai
|
738a96c4a8 |
bpf, arm64: Fixed a BTI error on returning to patched function
When BPF_TRAMP_F_CALL_ORIG is set, BPF trampoline uses BLR to jump
back to the instruction next to call site to call the patched function.
For BTI-enabled kernel, the instruction next to call site is usually
PACIASP, in this case, it's safe to jump back with BLR. But when
the call site is not followed by a PACIASP or bti, a BTI exception
is triggered.
Here is a fault log:
Unhandled 64-bit el1h sync exception on CPU0, ESR 0x0000000034000002 -- BTI
CPU: 0 PID: 263 Comm: test_progs Tainted: GF
Hardware name: linux,dummy-virt (DT)
pstate: 40400805 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=-c)
pc : bpf_fentry_test1+0xc/0x30
lr : bpf_trampoline_6442573892_0+0x48/0x1000
sp : ffff80000c0c3a50
x29: ffff80000c0c3a90 x28: ffff0000c2e6c080 x27: 0000000000000000
x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000050
x23: 0000000000000000 x22: 0000ffffcfd2a7f0 x21: 000000000000000a
x20: 0000ffffcfd2a7f0 x19: 0000000000000000 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffcfd2a7f0
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: ffff80000914f5e4 x9 : ffff8000082a1528
x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0101010101010101
x5 : 0000000000000000 x4 : 00000000fffffff2 x3 : 0000000000000001
x2 : ffff8001f4b82000 x1 : 0000000000000000 x0 : 0000000000000001
Kernel panic - not syncing: Unhandled exception
CPU: 0 PID: 263 Comm: test_progs Tainted: GF
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0xec/0x144
show_stack+0x24/0x7c
dump_stack_lvl+0x8c/0xb8
dump_stack+0x18/0x34
panic+0x1cc/0x3ec
__el0_error_handler_common+0x0/0x130
el1h_64_sync_handler+0x60/0xd0
el1h_64_sync+0x78/0x7c
bpf_fentry_test1+0xc/0x30
bpf_fentry_test1+0xc/0x30
bpf_prog_test_run_tracing+0xdc/0x2a0
__sys_bpf+0x438/0x22a0
__arm64_sys_bpf+0x30/0x54
invoke_syscall+0x78/0x110
el0_svc_common.constprop.0+0x6c/0x1d0
do_el0_svc+0x38/0xe0
el0_svc+0x30/0xd0
el0t_64_sync_handler+0x1ac/0x1b0
el0t_64_sync+0x1a0/0x1a4
Kernel Offset: disabled
CPU features: 0x0000,00034c24,f994fdab
Memory Limit: none
And the instruction next to call site of bpf_fentry_test1 is ADD,
not PACIASP:
<bpf_fentry_test1>:
bti c
nop
nop
add w0, w0, #0x1
paciasp
For BPF prog, JIT always puts a PACIASP after call site for BTI-enabled
kernel, so there is no problem. To fix it, replace BLR with RET to bypass
the branch target check.
Fixes:
|
||
Martin KaFai Lau
|
271de525e1 |
bpf: Remove prog->active check for bpf_lsm and bpf_iter
The commit |
||
Yonghong Song
|
eb707dde26 |
bpf: arm64: No support of struct argument in trampoline programs
ARM64 does not support struct argument for trampoline based bpf programs yet. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220831152702.2079066-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Xu Kuohai
|
aada476655 |
bpf, arm64: Fix bpf trampoline instruction endianness
The sparse tool complains as follows:
arch/arm64/net/bpf_jit_comp.c:1684:16:
warning: incorrect type in assignment (different base types)
arch/arm64/net/bpf_jit_comp.c:1684:16:
expected unsigned int [usertype] *branch
arch/arm64/net/bpf_jit_comp.c:1684:16:
got restricted __le32 [usertype] *
arch/arm64/net/bpf_jit_comp.c:1700:52:
error: subtraction of different types can't work (different base
types)
arch/arm64/net/bpf_jit_comp.c:1734:29:
warning: incorrect type in assignment (different base types)
arch/arm64/net/bpf_jit_comp.c:1734:29:
expected unsigned int [usertype] *
arch/arm64/net/bpf_jit_comp.c:1734:29:
got restricted __le32 [usertype] *
arch/arm64/net/bpf_jit_comp.c:1918:52:
error: subtraction of different types can't work (different base
types)
This is because the variable branch in function invoke_bpf_prog and the
variable branches in function prepare_trampoline are defined as type
u32 *, which conflicts with ctx->image's type __le32 *, so sparse complains
when assignment or arithmetic operation are performed on these two
variables and ctx->image.
Since arm64 instructions are always little-endian, change the type of
these two variables to __le32 * and call cpu_to_le32() to convert
instruction to little-endian before writing it to memory. This is also
in line with emit() which internally does cpu_to_le32(), too.
Fixes:
|
||
Aijun Sun
|
19f68ed6dc |
bpf, arm64: Allocate program buffer using kvcalloc instead of kcalloc
It is not necessary to allocate contiguous physical memory for BPF program buffer using kcalloc. When the BPF program is large more than memory page size, kcalloc allocates multiple memory pages from buddy system. If the device can not provide sufficient memory, for example in low-end android devices [0], memory allocation for BPF program is likely to fail. Test cases in lib/test_bpf.c all pass on ARM64 QEMU. [0] AndroidTestSuit: page allocation failure: order:4, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=foreground,mems_allowed=0 Call trace: dump_stack+0xa4/0x114 warn_alloc+0xf8/0x14c __alloc_pages_slowpath+0xac8/0xb14 __alloc_pages_nodemask+0x194/0x3d0 kmalloc_order_trace+0x44/0x1e8 __kmalloc+0x29c/0x66c bpf_int_jit_compile+0x17c/0x568 bpf_prog_select_runtime+0x4c/0x1b0 bpf_prepare_filter+0x5fc/0x6bc bpf_prog_create_from_user+0x118/0x1c0 seccomp_set_mode_filter+0x1c4/0x7cc __do_sys_prctl+0x380/0x1424 __arm64_sys_prctl+0x20/0x2c el0_svc_common+0xc8/0x22c el0_svc_handler+0x1c/0x28 el0_svc+0x8/0x100 Signed-off-by: Aijun Sun <aijun.sun@unisoc.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220804025442.22524-1-aijun.sun@unisoc.com |
||
Xu Kuohai
|
339ed900b3 |
bpf, arm64: Fix compile error in dummy_tramp()
dummy_tramp() uses "lr" to refer to the x30 register, but some assembler
does not recognize "lr" and reports a build failure:
/tmp/cc52xO0c.s: Assembler messages:
/tmp/cc52xO0c.s:8: Error: operand 1 should be an integer register -- `mov lr,x9'
/tmp/cc52xO0c.s:7: Error: undefined symbol lr used as an immediate value
make[2]: *** [scripts/Makefile.build:250: arch/arm64/net/bpf_jit_comp.o] Error 1
make[1]: *** [scripts/Makefile.build:525: arch/arm64/net] Error 2
So replace "lr" with "x30" to fix it.
Fixes:
|
||
Nathan Chancellor
|
33f32e5072 |
bpf, arm64: Mark dummy_tramp as global
When building with clang + CONFIG_CFI_CLANG=y, the following error
occurs at link time:
ld.lld: error: undefined symbol: dummy_tramp
dummy_tramp is declared globally in C but its definition in inline
assembly does not use .global, which prevents clang from properly
resolving the references to it when creating the CFI jump tables.
Mark dummy_tramp as global so that the reference can be properly
resolved.
Fixes:
|
||
Xu Kuohai
|
efc9909fdc |
bpf, arm64: Add bpf trampoline for arm64
This is arm64 version of commit
|
||
Xu Kuohai
|
b2ad54e153 |
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline can be patched with it. When the target address is NULL, the original instruction is patched to a NOP. When the target address and the source address are within the branch range, the original instruction is patched to a bl instruction to the target address directly. To support attaching bpf trampoline to both regular kernel function and bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two instructions are inserted at the beginning of bpf prog, the first one saves the return address to x9, and the second is a nop which will be patched to a bl instruction when a bpf trampoline is attached. However, when a bpf trampoline is attached to bpf prog, the distance between target address and source address may exceed 128MB, the maximum branch range, because bpf trampoline and bpf prog are allocated separately with vmalloc. So long jump should be handled. When a bpf prog is constructed, a plt pointing to empty trampoline dummy_tramp is placed at the end: bpf_prog: mov x9, lr nop // patchsite ... ret plt: ldr x10, target br x10 target: .quad dummy_tramp // plt target This is also the state when no trampoline is attached. When a short-jump bpf trampoline is attached, the patchsite is patched to a bl instruction to the trampoline directly: bpf_prog: mov x9, lr bl <short-jump bpf trampoline address> // patchsite ... ret plt: ldr x10, target br x10 target: .quad dummy_tramp // plt target When a long-jump bpf trampoline is attached, the plt target is filled with the trampoline address and the patchsite is patched to a bl instruction to the plt: bpf_prog: mov x9, lr bl plt // patchsite ... ret plt: ldr x10, target br x10 target: .quad <long-jump bpf trampoline address> dummy_tramp is used to prevent another CPU from jumping to an unknown location during the patching process, making the patching process easier. The patching process is as follows: 1. when neither the old address or the new address is a long jump, the patchsite is replaced with a bl to the new address, or nop if the new address is NULL; 2. when the old address is not long jump but the new one is, the branch target address is written to plt first, then the patchsite is replaced with a bl instruction to the plt; 3. when the old address is long jump but the new one is not, the address of dummy_tramp is written to plt first, then the patchsite is replaced with a bl to the new address, or a nop if the new address is NULL; 4. when both the old address and the new address are long jump, the new address is written to plt and the patchsite is not changed. Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com |
||
Jakub Sitnicki
|
d4609a5d8c |
bpf, arm64: Keep tail call count across bpf2bpf calls
Today doing a BPF tail call after a BPF to BPF call, that is from a subprogram, is allowed only by the x86-64 BPF JIT. Mixing these features requires support from JIT. Tail call count has to be tracked through BPF to BPF calls, as well as through BPF tail calls to prevent unbounded chains of tail calls. arm64 BPF JIT stores the tail call count (TCC) in a dedicated register (X26). This makes it easier to support bpf2bpf calls mixed with tail calls than on x86 platform. In order to keep the tail call count in tact throughout bpf2bpf calls, all we need to do is tweak the program prologue generator. When emitting prologue for a subprogram, we skip the block that initializes the tail call count and emits a jump pad for the tail call. With this change, a sample execution flow where a bpf2bpf call is followed by a tail call would look like so: int entry(struct __sk_buff *skb): 0xffffffc0090151d4: paciasp 0xffffffc0090151d8: stp x29, x30, [sp, #-16]! 0xffffffc0090151dc: mov x29, sp 0xffffffc0090151e0: stp x19, x20, [sp, #-16]! 0xffffffc0090151e4: stp x21, x22, [sp, #-16]! 0xffffffc0090151e8: stp x25, x26, [sp, #-16]! 0xffffffc0090151ec: stp x27, x28, [sp, #-16]! 0xffffffc0090151f0: mov x25, sp 0xffffffc0090151f4: mov x26, #0x0 // <- init TCC only 0xffffffc0090151f8: bti j // in main prog 0xffffffc0090151fc: sub x27, x25, #0x0 0xffffffc009015200: sub sp, sp, #0x10 0xffffffc009015204: mov w1, #0x0 0xffffffc009015208: mov x10, #0xffffffffffffffff 0xffffffc00901520c: strb w1, [x25, x10] 0xffffffc009015210: mov x10, #0xffffffffffffd25c 0xffffffc009015214: movk x10, #0x902, lsl #16 0xffffffc009015218: movk x10, #0xffc0, lsl #32 0xffffffc00901521c: blr x10 -------------------. // bpf2bpf call 0xffffffc009015220: add x7, x0, #0x0 <-------------. 0xffffffc009015224: add sp, sp, #0x10 | | 0xffffffc009015228: ldp x27, x28, [sp], #16 | | 0xffffffc00901522c: ldp x25, x26, [sp], #16 | | 0xffffffc009015230: ldp x21, x22, [sp], #16 | | 0xffffffc009015234: ldp x19, x20, [sp], #16 | | 0xffffffc009015238: ldp x29, x30, [sp], #16 | | 0xffffffc00901523c: add x0, x7, #0x0 | | 0xffffffc009015240: autiasp | | 0xffffffc009015244: ret | | | | int subprog_tail(struct __sk_buff *skb): | | 0xffffffc00902d25c: paciasp <----------------------' | 0xffffffc00902d260: stp x29, x30, [sp, #-16]! | 0xffffffc00902d264: mov x29, sp | 0xffffffc00902d268: stp x19, x20, [sp, #-16]! | 0xffffffc00902d26c: stp x21, x22, [sp, #-16]! | 0xffffffc00902d270: stp x25, x26, [sp, #-16]! | 0xffffffc00902d274: stp x27, x28, [sp, #-16]! | 0xffffffc00902d278: mov x25, sp | 0xffffffc00902d27c: sub x27, x25, #0x0 | 0xffffffc00902d280: sub sp, sp, #0x10 | // <- end of prologue, notice: 0xffffffc00902d284: add x19, x0, #0x0 | // 1) TCC not touched, and 0xffffffc00902d288: mov w0, #0x1 | // 2) no tail call jump pad 0xffffffc00902d28c: mov x10, #0xfffffffffffffffc | 0xffffffc00902d290: str w0, [x25, x10] | 0xffffffc00902d294: mov x20, #0xffffff80ffffffff | 0xffffffc00902d298: movk x20, #0xc033, lsl #16 | 0xffffffc00902d29c: movk x20, #0x4e00 | 0xffffffc00902d2a0: add x0, x19, #0x0 | 0xffffffc00902d2a4: add x1, x20, #0x0 | 0xffffffc00902d2a8: mov x2, #0x0 | 0xffffffc00902d2ac: mov w10, #0x24 | 0xffffffc00902d2b0: ldr w10, [x1, x10] | 0xffffffc00902d2b4: add w2, w2, #0x0 | 0xffffffc00902d2b8: cmp w2, w10 | 0xffffffc00902d2bc: b.cs 0xffffffc00902d2f8 | 0xffffffc00902d2c0: mov w10, #0x21 | 0xffffffc00902d2c4: cmp x26, x10 | // TCC >= MAX_TAIL_CALL_CNT? 0xffffffc00902d2c8: b.cs 0xffffffc00902d2f8 | 0xffffffc00902d2cc: add x26, x26, #0x1 | // TCC++ 0xffffffc00902d2d0: mov w10, #0x110 | 0xffffffc00902d2d4: add x10, x1, x10 | 0xffffffc00902d2d8: lsl x11, x2, #3 | 0xffffffc00902d2dc: ldr x11, [x10, x11] | 0xffffffc00902d2e0: cbz x11, 0xffffffc00902d2f8 | 0xffffffc00902d2e4: mov w10, #0x30 | 0xffffffc00902d2e8: ldr x10, [x11, x10] | 0xffffffc00902d2ec: add x10, x10, #0x24 | 0xffffffc00902d2f0: add sp, sp, #0x10 | // <- destroy just current 0xffffffc00902d2f4: br x10 ---------------------. | // BPF stack frame 0xffffffc00902d2f8: mov x10, #0xfffffffffffffffc | | // before the tail call 0xffffffc00902d2fc: ldr w7, [x25, x10] | | 0xffffffc00902d300: add sp, sp, #0x10 | | 0xffffffc00902d304: ldp x27, x28, [sp], #16 | | 0xffffffc00902d308: ldp x25, x26, [sp], #16 | | 0xffffffc00902d30c: ldp x21, x22, [sp], #16 | | 0xffffffc00902d310: ldp x19, x20, [sp], #16 | | 0xffffffc00902d314: ldp x29, x30, [sp], #16 | | 0xffffffc00902d318: add x0, x7, #0x0 | | 0xffffffc00902d31c: autiasp | | 0xffffffc00902d320: ret | | | | int classifier_0(struct __sk_buff *skb): | | 0xffffffc008ff5874: paciasp | | 0xffffffc008ff5878: stp x29, x30, [sp, #-16]! | | 0xffffffc008ff587c: mov x29, sp | | 0xffffffc008ff5880: stp x19, x20, [sp, #-16]! | | 0xffffffc008ff5884: stp x21, x22, [sp, #-16]! | | 0xffffffc008ff5888: stp x25, x26, [sp, #-16]! | | 0xffffffc008ff588c: stp x27, x28, [sp, #-16]! | | 0xffffffc008ff5890: mov x25, sp | | 0xffffffc008ff5894: mov x26, #0x0 | | 0xffffffc008ff5898: bti j <----------------------' | 0xffffffc008ff589c: sub x27, x25, #0x0 | 0xffffffc008ff58a0: sub sp, sp, #0x0 | 0xffffffc008ff58a4: mov x0, #0xffffffc0ffffffff | 0xffffffc008ff58a8: movk x0, #0x8fc, lsl #16 | 0xffffffc008ff58ac: movk x0, #0x6000 | 0xffffffc008ff58b0: mov w1, #0x1 | 0xffffffc008ff58b4: str w1, [x0] | 0xffffffc008ff58b8: mov w7, #0x0 | 0xffffffc008ff58bc: mov sp, sp | 0xffffffc008ff58c0: ldp x27, x28, [sp], #16 | 0xffffffc008ff58c4: ldp x25, x26, [sp], #16 | 0xffffffc008ff58c8: ldp x21, x22, [sp], #16 | 0xffffffc008ff58cc: ldp x19, x20, [sp], #16 | 0xffffffc008ff58d0: ldp x29, x30, [sp], #16 | 0xffffffc008ff58d4: add x0, x7, #0x0 | 0xffffffc008ff58d8: autiasp | 0xffffffc008ff58dc: ret -------------------------------' Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220617105735.733938-3-jakub@cloudflare.com |
||
Eric Dumazet
|
10f3b29c65 |
bpf, arm64: Clear prog->jited_len along prog->jited
syzbot reported an illegal copy_to_user() attempt from bpf_prog_get_info_by_fd() [1] There was no repro yet on this bug, but I think that commit |