* [PATCH -next v5 0/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate)
@ 2022-03-21 15:28 ` Xu Kuohai
0 siblings, 0 replies; 8+ messages in thread
From: Xu Kuohai @ 2022-03-21 15:28 UTC (permalink / raw)
To: bpf, linux-arm-kernel
Cc: Catalin Marinas, Will Deacon, Daniel Borkmann,
Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Julien Thierry, Mark Rutland, Hou Tao, Fuad Tabba,
James Morse
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This series
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
Without optimization:
mov x10, 0
ldr x1, [x0, x10]
With optimization:
ldr x1, [x0, 0]
For the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this series, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this series, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Tested with test_bpf on both big-endian and little-endian arm64 qemu:
test_bpf: Summary: 1026 PASSED, 0 FAILED, [1014/1014 JIT'ed]
test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]
test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED
v4->v5:
1. Fix incorrect FP offset in tail call scenario pointed out by Daniel,
and add a tail call test case for this issue
2. Align down fpb_offset to 8 bytes to avoid unaligned offsets
3. Style and spelling fix
v3->v4:
1. Fix compile error reported by kernel test robot
2. Add one more test case for load/store in different offsets, and move
test case to last patch
3. Fix some obvious bugs
v2 -> v3:
1. Split the v2 patch into 2 patches, one for arm64 instruction encoder,
the other for BPF JIT
2. Add tests for BPF_LDX/BPF_STX with different offsets
3. Adjust the offset of str/ldr(immediate) to positive number
v1 -> v2:
1. Remove macro definition that causes checkpatch to fail
2. Append result to commit message
Xu Kuohai (5):
arm64: insn: add ldr/str with immediate offset
bpf, arm64: Optimize BPF store/load using str/ldr with immediate
offset
bpf, arm64: adjust the offset of str/ldr(immediate) to positive number
bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets
bpf, arm64: add load store test case for tail call
arch/arm64/include/asm/insn.h | 9 +
arch/arm64/lib/insn.c | 67 ++++++--
arch/arm64/net/bpf_jit.h | 14 ++
arch/arm64/net/bpf_jit_comp.c | 243 ++++++++++++++++++++++++--
lib/test_bpf.c | 315 +++++++++++++++++++++++++++++++++-
5 files changed, 613 insertions(+), 35 deletions(-)
--
2.30.2
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH -next v5 0/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate)
@ 2022-03-21 15:28 ` Xu Kuohai
0 siblings, 0 replies; 8+ messages in thread
From: Xu Kuohai @ 2022-03-21 15:28 UTC (permalink / raw)
To: bpf, linux-arm-kernel
Cc: Catalin Marinas, Will Deacon, Daniel Borkmann,
Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Julien Thierry, Mark Rutland, Hou Tao, Fuad Tabba,
James Morse
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This series
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
Without optimization:
mov x10, 0
ldr x1, [x0, x10]
With optimization:
ldr x1, [x0, 0]
For the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this series, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this series, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Tested with test_bpf on both big-endian and little-endian arm64 qemu:
test_bpf: Summary: 1026 PASSED, 0 FAILED, [1014/1014 JIT'ed]
test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]
test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED
v4->v5:
1. Fix incorrect FP offset in tail call scenario pointed out by Daniel,
and add a tail call test case for this issue
2. Align down fpb_offset to 8 bytes to avoid unaligned offsets
3. Style and spelling fix
v3->v4:
1. Fix compile error reported by kernel test robot
2. Add one more test case for load/store in different offsets, and move
test case to last patch
3. Fix some obvious bugs
v2 -> v3:
1. Split the v2 patch into 2 patches, one for arm64 instruction encoder,
the other for BPF JIT
2. Add tests for BPF_LDX/BPF_STX with different offsets
3. Adjust the offset of str/ldr(immediate) to positive number
v1 -> v2:
1. Remove macro definition that causes checkpatch to fail
2. Append result to commit message
Xu Kuohai (5):
arm64: insn: add ldr/str with immediate offset
bpf, arm64: Optimize BPF store/load using str/ldr with immediate
offset
bpf, arm64: adjust the offset of str/ldr(immediate) to positive number
bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets
bpf, arm64: add load store test case for tail call
arch/arm64/include/asm/insn.h | 9 +
arch/arm64/lib/insn.c | 67 ++++++--
arch/arm64/net/bpf_jit.h | 14 ++
arch/arm64/net/bpf_jit_comp.c | 243 ++++++++++++++++++++++++--
lib/test_bpf.c | 315 +++++++++++++++++++++++++++++++++-
5 files changed, 613 insertions(+), 35 deletions(-)
--
2.30.2
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH -next v5 0/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate)
@ 2022-03-21 15:28 ` Xu Kuohai
0 siblings, 0 replies; 8+ messages in thread
From: Xu Kuohai @ 2022-03-21 15:28 UTC (permalink / raw)
To: bpf, linux-arm-kernel
Cc: Catalin Marinas, Will Deacon, Daniel Borkmann,
Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Julien Thierry, Mark Rutland, Hou Tao, Fuad Tabba,
James Morse
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This series
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
Without optimization:
mov x10, 0
ldr x1, [x0, x10]
With optimization:
ldr x1, [x0, 0]
For the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this series, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this series, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Tested with test_bpf on both big-endian and little-endian arm64 qemu:
test_bpf: Summary: 1026 PASSED, 0 FAILED, [1014/1014 JIT'ed]
test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]
test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED
v4->v5:
1. Fix incorrect FP offset in tail call scenario pointed out by Daniel,
and add a tail call test case for this issue
2. Align down fpb_offset to 8 bytes to avoid unaligned offsets
3. Style and spelling fix
v3->v4:
1. Fix compile error reported by kernel test robot
2. Add one more test case for load/store in different offsets, and move
test case to last patch
3. Fix some obvious bugs
v2 -> v3:
1. Split the v2 patch into 2 patches, one for arm64 instruction encoder,
the other for BPF JIT
2. Add tests for BPF_LDX/BPF_STX with different offsets
3. Adjust the offset of str/ldr(immediate) to positive number
v1 -> v2:
1. Remove macro definition that causes checkpatch to fail
2. Append result to commit message
Xu Kuohai (5):
arm64: insn: add ldr/str with immediate offset
bpf, arm64: Optimize BPF store/load using str/ldr with immediate
offset
bpf, arm64: adjust the offset of str/ldr(immediate) to positive number
bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets
bpf, arm64: add load store test case for tail call
arch/arm64/include/asm/insn.h | 9 +
arch/arm64/lib/insn.c | 67 ++++++--
arch/arm64/net/bpf_jit.h | 14 ++
arch/arm64/net/bpf_jit_comp.c | 243 ++++++++++++++++++++++++--
lib/test_bpf.c | 315 +++++++++++++++++++++++++++++++++-
5 files changed, 613 insertions(+), 35 deletions(-)
--
2.30.2
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH -next v5 0/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate)
@ 2022-03-21 15:28 ` Xu Kuohai
0 siblings, 0 replies; 8+ messages in thread
From: Xu Kuohai @ 2022-03-21 15:28 UTC (permalink / raw)
To: bpf, linux-arm-kernel
Cc: Catalin Marinas, Will Deacon, Daniel Borkmann,
Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Julien Thierry, Mark Rutland, Hou Tao, Fuad Tabba,
James Morse
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This series
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
Without optimization:
mov x10, 0
ldr x1, [x0, x10]
With optimization:
ldr x1, [x0, 0]
For the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this series, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this series, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Tested with test_bpf on both big-endian and little-endian arm64 qemu:
test_bpf: Summary: 1026 PASSED, 0 FAILED, [1014/1014 JIT'ed]
test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]
test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED
v4->v5:
1. Fix incorrect FP offset in tail call scenario pointed out by Daniel,
and add a tail call test case for this issue
2. Align down fpb_offset to 8 bytes to avoid unaligned offsets
3. Style and spelling fix
v3->v4:
1. Fix compile error reported by kernel test robot
2. Add one more test case for load/store in different offsets, and move
test case to last patch
3. Fix some obvious bugs
v2 -> v3:
1. Split the v2 patch into 2 patches, one for arm64 instruction encoder,
the other for BPF JIT
2. Add tests for BPF_LDX/BPF_STX with different offsets
3. Adjust the offset of str/ldr(immediate) to positive number
v1 -> v2:
1. Remove macro definition that causes checkpatch to fail
2. Append result to commit message
Xu Kuohai (5):
arm64: insn: add ldr/str with immediate offset
bpf, arm64: Optimize BPF store/load using str/ldr with immediate
offset
bpf, arm64: adjust the offset of str/ldr(immediate) to positive number
bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets
bpf, arm64: add load store test case for tail call
arch/arm64/include/asm/insn.h | 9 +
arch/arm64/lib/insn.c | 67 ++++++--
arch/arm64/net/bpf_jit.h | 14 ++
arch/arm64/net/bpf_jit_comp.c | 243 ++++++++++++++++++++++++--
lib/test_bpf.c | 315 +++++++++++++++++++++++++++++++++-
5 files changed, 613 insertions(+), 35 deletions(-)
--
2.30.2
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH -next v5 0/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate)
2022-03-21 15:28 ` Xu Kuohai
@ 2022-03-30 3:43 ` Xu Kuohai
-1 siblings, 0 replies; 8+ messages in thread
From: Xu Kuohai @ 2022-03-30 3:43 UTC (permalink / raw)
To: bpf, linux-arm-kernel
Cc: Catalin Marinas, Will Deacon, Daniel Borkmann,
Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Julien Thierry, Mark Rutland, Hou Tao, Fuad Tabba,
James Morse
在 2022/3/21 23:28, Xu Kuohai 写道:
> The current BPF store/load instruction is translated by the JIT into two
> instructions. The first instruction moves the immediate offset into a
> temporary register. The second instruction uses this temporary register
> to do the real store/load.
>
> In fact, arm64 supports addressing with immediate offsets. So This series
> introduces optimization that uses arm64 str/ldr instruction with immediate
> offset when the offset fits.
>
> Example of generated instuction for r2 = *(u64 *)(r1 + 0):
>
> Without optimization:
> mov x10, 0
> ldr x1, [x0, x10]
>
> With optimization:
> ldr x1, [x0, 0]
>
> For the following bpftrace command:
>
> bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
>
> Without this series, jited code(fragment):
>
> 0: bti c
> 4: stp x29, x30, [sp, #-16]!
> 8: mov x29, sp
> c: stp x19, x20, [sp, #-16]!
> 10: stp x21, x22, [sp, #-16]!
> 14: stp x25, x26, [sp, #-16]!
> 18: mov x25, sp
> 1c: mov x26, #0x0 // #0
> 20: bti j
> 24: sub sp, sp, #0x90
> 28: add x19, x0, #0x0
> 2c: mov x0, #0x0 // #0
> 30: mov x10, #0xffffffffffffff78 // #-136
> 34: str x0, [x25, x10]
> 38: mov x10, #0xffffffffffffff80 // #-128
> 3c: str x0, [x25, x10]
> 40: mov x10, #0xffffffffffffff88 // #-120
> 44: str x0, [x25, x10]
> 48: mov x10, #0xffffffffffffff90 // #-112
> 4c: str x0, [x25, x10]
> 50: mov x10, #0xffffffffffffff98 // #-104
> 54: str x0, [x25, x10]
> 58: mov x10, #0xffffffffffffffa0 // #-96
> 5c: str x0, [x25, x10]
> 60: mov x10, #0xffffffffffffffa8 // #-88
> 64: str x0, [x25, x10]
> 68: mov x10, #0xffffffffffffffb0 // #-80
> 6c: str x0, [x25, x10]
> 70: mov x10, #0xffffffffffffffb8 // #-72
> 74: str x0, [x25, x10]
> 78: mov x10, #0xffffffffffffffc0 // #-64
> 7c: str x0, [x25, x10]
> 80: mov x10, #0xffffffffffffffc8 // #-56
> 84: str x0, [x25, x10]
> 88: mov x10, #0xffffffffffffffd0 // #-48
> 8c: str x0, [x25, x10]
> 90: mov x10, #0xffffffffffffffd8 // #-40
> 94: str x0, [x25, x10]
> 98: mov x10, #0xffffffffffffffe0 // #-32
> 9c: str x0, [x25, x10]
> a0: mov x10, #0xffffffffffffffe8 // #-24
> a4: str x0, [x25, x10]
> a8: mov x10, #0xfffffffffffffff0 // #-16
> ac: str x0, [x25, x10]
> b0: mov x10, #0xfffffffffffffff8 // #-8
> b4: str x0, [x25, x10]
> b8: mov x10, #0x8 // #8
> bc: ldr x2, [x19, x10]
> [...]
>
> With this series, jited code(fragment):
>
> 0: bti c
> 4: stp x29, x30, [sp, #-16]!
> 8: mov x29, sp
> c: stp x19, x20, [sp, #-16]!
> 10: stp x21, x22, [sp, #-16]!
> 14: stp x25, x26, [sp, #-16]!
> 18: stp x27, x28, [sp, #-16]!
> 1c: mov x25, sp
> 20: sub x27, x25, #0x88
> 24: mov x26, #0x0 // #0
> 28: bti j
> 2c: sub sp, sp, #0x90
> 30: add x19, x0, #0x0
> 34: mov x0, #0x0 // #0
> 38: str x0, [x27]
> 3c: str x0, [x27, #8]
> 40: str x0, [x27, #16]
> 44: str x0, [x27, #24]
> 48: str x0, [x27, #32]
> 4c: str x0, [x27, #40]
> 50: str x0, [x27, #48]
> 54: str x0, [x27, #56]
> 58: str x0, [x27, #64]
> 5c: str x0, [x27, #72]
> 60: str x0, [x27, #80]
> 64: str x0, [x27, #88]
> 68: str x0, [x27, #96]
> 6c: str x0, [x27, #104]
> 70: str x0, [x27, #112]
> 74: str x0, [x27, #120]
> 78: str x0, [x27, #128]
> 7c: ldr x2, [x19, #8]
> [...]
>
> Tested with test_bpf on both big-endian and little-endian arm64 qemu:
>
> test_bpf: Summary: 1026 PASSED, 0 FAILED, [1014/1014 JIT'ed]
> test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]
> test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED
>
> v4->v5:
> 1. Fix incorrect FP offset in tail call scenario pointed out by Daniel,
> and add a tail call test case for this issue
> 2. Align down fpb_offset to 8 bytes to avoid unaligned offsets
> 3. Style and spelling fix
>
> v3->v4:
> 1. Fix compile error reported by kernel test robot
> 2. Add one more test case for load/store in different offsets, and move
> test case to last patch
> 3. Fix some obvious bugs
>
> v2 -> v3:
> 1. Split the v2 patch into 2 patches, one for arm64 instruction encoder,
> the other for BPF JIT
> 2. Add tests for BPF_LDX/BPF_STX with different offsets
> 3. Adjust the offset of str/ldr(immediate) to positive number
>
> v1 -> v2:
> 1. Remove macro definition that causes checkpatch to fail
> 2. Append result to commit message
>
> Xu Kuohai (5):
> arm64: insn: add ldr/str with immediate offset
> bpf, arm64: Optimize BPF store/load using str/ldr with immediate
> offset
> bpf, arm64: adjust the offset of str/ldr(immediate) to positive number
> bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets
> bpf, arm64: add load store test case for tail call
>
> arch/arm64/include/asm/insn.h | 9 +
> arch/arm64/lib/insn.c | 67 ++++++--
> arch/arm64/net/bpf_jit.h | 14 ++
> arch/arm64/net/bpf_jit_comp.c | 243 ++++++++++++++++++++++++--
> lib/test_bpf.c | 315 +++++++++++++++++++++++++++++++++-
> 5 files changed, 613 insertions(+), 35 deletions(-)
>
ping ;)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH -next v5 0/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate)
@ 2022-03-30 3:43 ` Xu Kuohai
0 siblings, 0 replies; 8+ messages in thread
From: Xu Kuohai @ 2022-03-30 3:43 UTC (permalink / raw)
To: bpf, linux-arm-kernel
Cc: Catalin Marinas, Will Deacon, Daniel Borkmann,
Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Julien Thierry, Mark Rutland, Hou Tao, Fuad Tabba,
James Morse
在 2022/3/21 23:28, Xu Kuohai 写道:
> The current BPF store/load instruction is translated by the JIT into two
> instructions. The first instruction moves the immediate offset into a
> temporary register. The second instruction uses this temporary register
> to do the real store/load.
>
> In fact, arm64 supports addressing with immediate offsets. So This series
> introduces optimization that uses arm64 str/ldr instruction with immediate
> offset when the offset fits.
>
> Example of generated instuction for r2 = *(u64 *)(r1 + 0):
>
> Without optimization:
> mov x10, 0
> ldr x1, [x0, x10]
>
> With optimization:
> ldr x1, [x0, 0]
>
> For the following bpftrace command:
>
> bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
>
> Without this series, jited code(fragment):
>
> 0: bti c
> 4: stp x29, x30, [sp, #-16]!
> 8: mov x29, sp
> c: stp x19, x20, [sp, #-16]!
> 10: stp x21, x22, [sp, #-16]!
> 14: stp x25, x26, [sp, #-16]!
> 18: mov x25, sp
> 1c: mov x26, #0x0 // #0
> 20: bti j
> 24: sub sp, sp, #0x90
> 28: add x19, x0, #0x0
> 2c: mov x0, #0x0 // #0
> 30: mov x10, #0xffffffffffffff78 // #-136
> 34: str x0, [x25, x10]
> 38: mov x10, #0xffffffffffffff80 // #-128
> 3c: str x0, [x25, x10]
> 40: mov x10, #0xffffffffffffff88 // #-120
> 44: str x0, [x25, x10]
> 48: mov x10, #0xffffffffffffff90 // #-112
> 4c: str x0, [x25, x10]
> 50: mov x10, #0xffffffffffffff98 // #-104
> 54: str x0, [x25, x10]
> 58: mov x10, #0xffffffffffffffa0 // #-96
> 5c: str x0, [x25, x10]
> 60: mov x10, #0xffffffffffffffa8 // #-88
> 64: str x0, [x25, x10]
> 68: mov x10, #0xffffffffffffffb0 // #-80
> 6c: str x0, [x25, x10]
> 70: mov x10, #0xffffffffffffffb8 // #-72
> 74: str x0, [x25, x10]
> 78: mov x10, #0xffffffffffffffc0 // #-64
> 7c: str x0, [x25, x10]
> 80: mov x10, #0xffffffffffffffc8 // #-56
> 84: str x0, [x25, x10]
> 88: mov x10, #0xffffffffffffffd0 // #-48
> 8c: str x0, [x25, x10]
> 90: mov x10, #0xffffffffffffffd8 // #-40
> 94: str x0, [x25, x10]
> 98: mov x10, #0xffffffffffffffe0 // #-32
> 9c: str x0, [x25, x10]
> a0: mov x10, #0xffffffffffffffe8 // #-24
> a4: str x0, [x25, x10]
> a8: mov x10, #0xfffffffffffffff0 // #-16
> ac: str x0, [x25, x10]
> b0: mov x10, #0xfffffffffffffff8 // #-8
> b4: str x0, [x25, x10]
> b8: mov x10, #0x8 // #8
> bc: ldr x2, [x19, x10]
> [...]
>
> With this series, jited code(fragment):
>
> 0: bti c
> 4: stp x29, x30, [sp, #-16]!
> 8: mov x29, sp
> c: stp x19, x20, [sp, #-16]!
> 10: stp x21, x22, [sp, #-16]!
> 14: stp x25, x26, [sp, #-16]!
> 18: stp x27, x28, [sp, #-16]!
> 1c: mov x25, sp
> 20: sub x27, x25, #0x88
> 24: mov x26, #0x0 // #0
> 28: bti j
> 2c: sub sp, sp, #0x90
> 30: add x19, x0, #0x0
> 34: mov x0, #0x0 // #0
> 38: str x0, [x27]
> 3c: str x0, [x27, #8]
> 40: str x0, [x27, #16]
> 44: str x0, [x27, #24]
> 48: str x0, [x27, #32]
> 4c: str x0, [x27, #40]
> 50: str x0, [x27, #48]
> 54: str x0, [x27, #56]
> 58: str x0, [x27, #64]
> 5c: str x0, [x27, #72]
> 60: str x0, [x27, #80]
> 64: str x0, [x27, #88]
> 68: str x0, [x27, #96]
> 6c: str x0, [x27, #104]
> 70: str x0, [x27, #112]
> 74: str x0, [x27, #120]
> 78: str x0, [x27, #128]
> 7c: ldr x2, [x19, #8]
> [...]
>
> Tested with test_bpf on both big-endian and little-endian arm64 qemu:
>
> test_bpf: Summary: 1026 PASSED, 0 FAILED, [1014/1014 JIT'ed]
> test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]
> test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED
>
> v4->v5:
> 1. Fix incorrect FP offset in tail call scenario pointed out by Daniel,
> and add a tail call test case for this issue
> 2. Align down fpb_offset to 8 bytes to avoid unaligned offsets
> 3. Style and spelling fix
>
> v3->v4:
> 1. Fix compile error reported by kernel test robot
> 2. Add one more test case for load/store in different offsets, and move
> test case to last patch
> 3. Fix some obvious bugs
>
> v2 -> v3:
> 1. Split the v2 patch into 2 patches, one for arm64 instruction encoder,
> the other for BPF JIT
> 2. Add tests for BPF_LDX/BPF_STX with different offsets
> 3. Adjust the offset of str/ldr(immediate) to positive number
>
> v1 -> v2:
> 1. Remove macro definition that causes checkpatch to fail
> 2. Append result to commit message
>
> Xu Kuohai (5):
> arm64: insn: add ldr/str with immediate offset
> bpf, arm64: Optimize BPF store/load using str/ldr with immediate
> offset
> bpf, arm64: adjust the offset of str/ldr(immediate) to positive number
> bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets
> bpf, arm64: add load store test case for tail call
>
> arch/arm64/include/asm/insn.h | 9 +
> arch/arm64/lib/insn.c | 67 ++++++--
> arch/arm64/net/bpf_jit.h | 14 ++
> arch/arm64/net/bpf_jit_comp.c | 243 ++++++++++++++++++++++++--
> lib/test_bpf.c | 315 +++++++++++++++++++++++++++++++++-
> 5 files changed, 613 insertions(+), 35 deletions(-)
>
ping ;)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH -next v5 0/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate)
2022-03-21 15:28 ` Xu Kuohai
@ 2022-03-31 23:20 ` patchwork-bot+netdevbpf
-1 siblings, 0 replies; 8+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-03-31 23:20 UTC (permalink / raw)
To: Xu Kuohai
Cc: bpf, linux-arm-kernel, catalin.marinas, will, daniel, ast,
zlim.lnx, andrii, kafai, songliubraving, yhs, john.fastabend,
kpsingh, jthierry, mark.rutland, houtao1, tabba, james.morse
Hello:
This series was applied to bpf/bpf-next.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:
On Mon, 21 Mar 2022 11:28:47 -0400 you wrote:
> The current BPF store/load instruction is translated by the JIT into two
> instructions. The first instruction moves the immediate offset into a
> temporary register. The second instruction uses this temporary register
> to do the real store/load.
>
> In fact, arm64 supports addressing with immediate offsets. So This series
> introduces optimization that uses arm64 str/ldr instruction with immediate
> offset when the offset fits.
>
> [...]
Here is the summary with links:
- [bpf-next,v5,1/5] arm64: insn: add ldr/str with immediate offset
https://git.kernel.org/bpf/bpf-next/c/30c90f6757a7
- [bpf-next,v5,2/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
https://git.kernel.org/bpf/bpf-next/c/7db6c0f1d8ee
- [bpf-next,v5,3/5] bpf, arm64: adjust the offset of str/ldr(immediate) to positive number
https://git.kernel.org/bpf/bpf-next/c/5b3d19b9bd40
- [bpf-next,v5,4/5] bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets
https://git.kernel.org/bpf/bpf-next/c/f516420f683d
- [bpf-next,v5,5/5] bpf, arm64: add load store test case for tail call
https://git.kernel.org/bpf/bpf-next/c/38608ee7b690
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH -next v5 0/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate)
@ 2022-03-31 23:20 ` patchwork-bot+netdevbpf
0 siblings, 0 replies; 8+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-03-31 23:20 UTC (permalink / raw)
To: Xu Kuohai
Cc: bpf, linux-arm-kernel, catalin.marinas, will, daniel, ast,
zlim.lnx, andrii, kafai, songliubraving, yhs, john.fastabend,
kpsingh, jthierry, mark.rutland, houtao1, tabba, james.morse
Hello:
This series was applied to bpf/bpf-next.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:
On Mon, 21 Mar 2022 11:28:47 -0400 you wrote:
> The current BPF store/load instruction is translated by the JIT into two
> instructions. The first instruction moves the immediate offset into a
> temporary register. The second instruction uses this temporary register
> to do the real store/load.
>
> In fact, arm64 supports addressing with immediate offsets. So This series
> introduces optimization that uses arm64 str/ldr instruction with immediate
> offset when the offset fits.
>
> [...]
Here is the summary with links:
- [bpf-next,v5,1/5] arm64: insn: add ldr/str with immediate offset
https://git.kernel.org/bpf/bpf-next/c/30c90f6757a7
- [bpf-next,v5,2/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
https://git.kernel.org/bpf/bpf-next/c/7db6c0f1d8ee
- [bpf-next,v5,3/5] bpf, arm64: adjust the offset of str/ldr(immediate) to positive number
https://git.kernel.org/bpf/bpf-next/c/5b3d19b9bd40
- [bpf-next,v5,4/5] bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets
https://git.kernel.org/bpf/bpf-next/c/f516420f683d
- [bpf-next,v5,5/5] bpf, arm64: add load store test case for tail call
https://git.kernel.org/bpf/bpf-next/c/38608ee7b690
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-03-31 23:21 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-21 15:28 [PATCH -next v5 0/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate) Xu Kuohai
2022-03-21 15:28 ` Xu Kuohai
2022-03-21 15:28 Xu Kuohai
2022-03-21 15:28 ` Xu Kuohai
2022-03-30 3:43 ` Xu Kuohai
2022-03-30 3:43 ` Xu Kuohai
2022-03-31 23:20 ` patchwork-bot+netdevbpf
2022-03-31 23:20 ` patchwork-bot+netdevbpf
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.