From: Xu Kuohai <xukuohai@huawei.com> To: <bpf@vger.kernel.org>, <linux-arm-kernel@lists.infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Alexei Starovoitov <ast@kernel.org>, Zi Shen Lim <zlim.lnx@gmail.com>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>, John Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, Julien Thierry <jthierry@redhat.com>, Mark Rutland <mark.rutland@arm.com>, Hou Tao <houtao1@huawei.com>, Fuad Tabba <tabba@google.com>, James Morse <james.morse@arm.com> Subject: [PATCH bpf-next v3 0/4] bpf, arm64: Optimize BPF store/load using Date: Wed, 16 Mar 2022 12:26:17 -0400 [thread overview] Message-ID: <20220316162621.3842604-1-xukuohai@huawei.com> (raw) The current BPF store/load instruction is translated by the JIT into two instructions. The first instruction moves the immediate offset into a temporary register. The second instruction uses this temporary register to do the real store/load. In fact, arm64 supports addressing with immediate offsets. So This series introduces optimization that uses arm64 str/ldr instruction with immediate offset when the offset fits. Example of generated instuction for r2 = *(u64 *)(r1 + 0): Without optimization: mov x10, 0 ldr x1, [x0, x10] With optimization: ldr x1, [x0, 0] For the following bpftrace command: bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }' Without this series, jited code(fragment): 0: bti c 4: stp x29, x30, [sp, #-16]! 8: mov x29, sp c: stp x19, x20, [sp, #-16]! 10: stp x21, x22, [sp, #-16]! 14: stp x25, x26, [sp, #-16]! 18: mov x25, sp 1c: mov x26, #0x0 // #0 20: bti j 24: sub sp, sp, #0x90 28: add x19, x0, #0x0 2c: mov x0, #0x0 // #0 30: mov x10, #0xffffffffffffff78 // #-136 34: str x0, [x25, x10] 38: mov x10, #0xffffffffffffff80 // #-128 3c: str x0, [x25, x10] 40: mov x10, #0xffffffffffffff88 // #-120 44: str x0, [x25, x10] 48: mov x10, #0xffffffffffffff90 // #-112 4c: str x0, [x25, x10] 50: mov x10, #0xffffffffffffff98 // #-104 54: str x0, [x25, x10] 58: mov x10, #0xffffffffffffffa0 // #-96 5c: str x0, [x25, x10] 60: mov x10, #0xffffffffffffffa8 // #-88 64: str x0, [x25, x10] 68: mov x10, #0xffffffffffffffb0 // #-80 6c: str x0, [x25, x10] 70: mov x10, #0xffffffffffffffb8 // #-72 74: str x0, [x25, x10] 78: mov x10, #0xffffffffffffffc0 // #-64 7c: str x0, [x25, x10] 80: mov x10, #0xffffffffffffffc8 // #-56 84: str x0, [x25, x10] 88: mov x10, #0xffffffffffffffd0 // #-48 8c: str x0, [x25, x10] 90: mov x10, #0xffffffffffffffd8 // #-40 94: str x0, [x25, x10] 98: mov x10, #0xffffffffffffffe0 // #-32 9c: str x0, [x25, x10] a0: mov x10, #0xffffffffffffffe8 // #-24 a4: str x0, [x25, x10] a8: mov x10, #0xfffffffffffffff0 // #-16 ac: str x0, [x25, x10] b0: mov x10, #0xfffffffffffffff8 // #-8 b4: str x0, [x25, x10] b8: mov x10, #0x8 // #8 bc: ldr x2, [x19, x10] [...] With this series, jited code(fragment): 0: bti c 4: stp x29, x30, [sp, #-16]! 8: mov x29, sp c: stp x19, x20, [sp, #-16]! 10: stp x21, x22, [sp, #-16]! 14: stp x25, x26, [sp, #-16]! 18: stp x27, x28, [sp, #-16]! 1c: mov x25, sp 20: sub x27, x25, #0x88 24: mov x26, #0x0 // #0 28: bti j 2c: sub sp, sp, #0x90 30: add x19, x0, #0x0 34: mov x0, #0x0 // #0 38: str x0, [x27] 3c: str x0, [x27, #8] 40: str x0, [x27, #16] 44: str x0, [x27, #24] 48: str x0, [x27, #32] 4c: str x0, [x27, #40] 50: str x0, [x27, #48] 54: str x0, [x27, #56] 58: str x0, [x27, #64] 5c: str x0, [x27, #72] 60: str x0, [x27, #80] 64: str x0, [x27, #88] 68: str x0, [x27, #96] 6c: str x0, [x27, #104] 70: str x0, [x27, #112] 74: str x0, [x27, #120] 78: str x0, [x27, #128] 7c: ldr x2, [x19, #8] [...] Tested with test_bpf on both big-endian and little-endian arm64 qemu: test_bpf: Summary: 1026 PASSED, 0 FAILED, [1014/1014 JIT'ed] test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [8/8 JIT'ed] test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED v2 -> v3: 1. Split the v2 patch into 2 patches, one for arm64 instruction encoder, the other for BPF JIT 2. Add tests for BPF_LDX/BPF_STX with different offsets 3. Adjust the offset of str/ldr(immediate) to positive number v1 -> v2: 1. Remove macro definition that causes checkpatch to fail 2. Append result to commit message Xu Kuohai (4): arm64: insn: add ldr/str with immediate offset bpf, arm64: Optimize BPF store/load using str/ldr with immediate offset bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets bpf, arm64: adjust the offset of str/ldr(immediate) to positive number arch/arm64/include/asm/insn.h | 9 ++ arch/arm64/lib/insn.c | 67 ++++++-- arch/arm64/net/bpf_jit.h | 14 ++ arch/arm64/net/bpf_jit_comp.c | 212 ++++++++++++++++++++++--- lib/test_bpf.c | 285 +++++++++++++++++++++++++++++++++- 5 files changed, 549 insertions(+), 38 deletions(-) -- 2.30.2
WARNING: multiple messages have this Message-ID (diff)
From: Xu Kuohai <xukuohai@huawei.com> To: <bpf@vger.kernel.org>, <linux-arm-kernel@lists.infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Alexei Starovoitov <ast@kernel.org>, Zi Shen Lim <zlim.lnx@gmail.com>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>, John Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, Julien Thierry <jthierry@redhat.com>, Mark Rutland <mark.rutland@arm.com>, Hou Tao <houtao1@huawei.com>, Fuad Tabba <tabba@google.com>, James Morse <james.morse@arm.com> Subject: [PATCH bpf-next v3 0/4] bpf, arm64: Optimize BPF store/load using Date: Wed, 16 Mar 2022 12:26:17 -0400 [thread overview] Message-ID: <20220316162621.3842604-1-xukuohai@huawei.com> (raw) The current BPF store/load instruction is translated by the JIT into two instructions. The first instruction moves the immediate offset into a temporary register. The second instruction uses this temporary register to do the real store/load. In fact, arm64 supports addressing with immediate offsets. So This series introduces optimization that uses arm64 str/ldr instruction with immediate offset when the offset fits. Example of generated instuction for r2 = *(u64 *)(r1 + 0): Without optimization: mov x10, 0 ldr x1, [x0, x10] With optimization: ldr x1, [x0, 0] For the following bpftrace command: bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }' Without this series, jited code(fragment): 0: bti c 4: stp x29, x30, [sp, #-16]! 8: mov x29, sp c: stp x19, x20, [sp, #-16]! 10: stp x21, x22, [sp, #-16]! 14: stp x25, x26, [sp, #-16]! 18: mov x25, sp 1c: mov x26, #0x0 // #0 20: bti j 24: sub sp, sp, #0x90 28: add x19, x0, #0x0 2c: mov x0, #0x0 // #0 30: mov x10, #0xffffffffffffff78 // #-136 34: str x0, [x25, x10] 38: mov x10, #0xffffffffffffff80 // #-128 3c: str x0, [x25, x10] 40: mov x10, #0xffffffffffffff88 // #-120 44: str x0, [x25, x10] 48: mov x10, #0xffffffffffffff90 // #-112 4c: str x0, [x25, x10] 50: mov x10, #0xffffffffffffff98 // #-104 54: str x0, [x25, x10] 58: mov x10, #0xffffffffffffffa0 // #-96 5c: str x0, [x25, x10] 60: mov x10, #0xffffffffffffffa8 // #-88 64: str x0, [x25, x10] 68: mov x10, #0xffffffffffffffb0 // #-80 6c: str x0, [x25, x10] 70: mov x10, #0xffffffffffffffb8 // #-72 74: str x0, [x25, x10] 78: mov x10, #0xffffffffffffffc0 // #-64 7c: str x0, [x25, x10] 80: mov x10, #0xffffffffffffffc8 // #-56 84: str x0, [x25, x10] 88: mov x10, #0xffffffffffffffd0 // #-48 8c: str x0, [x25, x10] 90: mov x10, #0xffffffffffffffd8 // #-40 94: str x0, [x25, x10] 98: mov x10, #0xffffffffffffffe0 // #-32 9c: str x0, [x25, x10] a0: mov x10, #0xffffffffffffffe8 // #-24 a4: str x0, [x25, x10] a8: mov x10, #0xfffffffffffffff0 // #-16 ac: str x0, [x25, x10] b0: mov x10, #0xfffffffffffffff8 // #-8 b4: str x0, [x25, x10] b8: mov x10, #0x8 // #8 bc: ldr x2, [x19, x10] [...] With this series, jited code(fragment): 0: bti c 4: stp x29, x30, [sp, #-16]! 8: mov x29, sp c: stp x19, x20, [sp, #-16]! 10: stp x21, x22, [sp, #-16]! 14: stp x25, x26, [sp, #-16]! 18: stp x27, x28, [sp, #-16]! 1c: mov x25, sp 20: sub x27, x25, #0x88 24: mov x26, #0x0 // #0 28: bti j 2c: sub sp, sp, #0x90 30: add x19, x0, #0x0 34: mov x0, #0x0 // #0 38: str x0, [x27] 3c: str x0, [x27, #8] 40: str x0, [x27, #16] 44: str x0, [x27, #24] 48: str x0, [x27, #32] 4c: str x0, [x27, #40] 50: str x0, [x27, #48] 54: str x0, [x27, #56] 58: str x0, [x27, #64] 5c: str x0, [x27, #72] 60: str x0, [x27, #80] 64: str x0, [x27, #88] 68: str x0, [x27, #96] 6c: str x0, [x27, #104] 70: str x0, [x27, #112] 74: str x0, [x27, #120] 78: str x0, [x27, #128] 7c: ldr x2, [x19, #8] [...] Tested with test_bpf on both big-endian and little-endian arm64 qemu: test_bpf: Summary: 1026 PASSED, 0 FAILED, [1014/1014 JIT'ed] test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [8/8 JIT'ed] test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED v2 -> v3: 1. Split the v2 patch into 2 patches, one for arm64 instruction encoder, the other for BPF JIT 2. Add tests for BPF_LDX/BPF_STX with different offsets 3. Adjust the offset of str/ldr(immediate) to positive number v1 -> v2: 1. Remove macro definition that causes checkpatch to fail 2. Append result to commit message Xu Kuohai (4): arm64: insn: add ldr/str with immediate offset bpf, arm64: Optimize BPF store/load using str/ldr with immediate offset bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets bpf, arm64: adjust the offset of str/ldr(immediate) to positive number arch/arm64/include/asm/insn.h | 9 ++ arch/arm64/lib/insn.c | 67 ++++++-- arch/arm64/net/bpf_jit.h | 14 ++ arch/arm64/net/bpf_jit_comp.c | 212 ++++++++++++++++++++++--- lib/test_bpf.c | 285 +++++++++++++++++++++++++++++++++- 5 files changed, 549 insertions(+), 38 deletions(-) -- 2.30.2 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next reply other threads:[~2022-03-16 16:15 UTC|newest] Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-03-16 16:26 Xu Kuohai [this message] 2022-03-16 16:26 ` [PATCH bpf-next v3 0/4] bpf, arm64: Optimize BPF store/load using Xu Kuohai 2022-03-16 16:26 ` [PATCH -next v3 1/4] arm64: insn: add ldr/str with immediate offset Xu Kuohai 2022-03-16 16:26 ` Xu Kuohai 2022-03-16 16:26 ` [PATCH -next v3 2/4] bpf, arm64: Optimize BPF store/load using str/ldr " Xu Kuohai 2022-03-16 16:26 ` Xu Kuohai 2022-03-16 16:26 ` [PATCH -next v3 3/4] bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets Xu Kuohai 2022-03-16 16:26 ` Xu Kuohai 2022-03-17 0:47 ` kernel test robot 2022-03-16 16:26 ` [PATCH -next v3 4/4] bpf, arm64: adjust the offset of str/ldr(immediate) to positive number Xu Kuohai 2022-03-16 16:26 ` Xu Kuohai
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20220316162621.3842604-1-xukuohai@huawei.com \ --to=xukuohai@huawei.com \ --cc=andrii@kernel.org \ --cc=ast@kernel.org \ --cc=bpf@vger.kernel.org \ --cc=catalin.marinas@arm.com \ --cc=daniel@iogearbox.net \ --cc=houtao1@huawei.com \ --cc=james.morse@arm.com \ --cc=john.fastabend@gmail.com \ --cc=jthierry@redhat.com \ --cc=kafai@fb.com \ --cc=kpsingh@kernel.org \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=mark.rutland@arm.com \ --cc=songliubraving@fb.com \ --cc=tabba@google.com \ --cc=will@kernel.org \ --cc=yhs@fb.com \ --cc=zlim.lnx@gmail.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.