bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* bpf_get_branch_snapshot on qemu-kvm
@ 2021-09-29  0:04 Song Liu
  2021-09-29  7:35 ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Song Liu @ 2021-09-29  0:04 UTC (permalink / raw)
  To: Peter Zijlstra, bpf
  Cc: Networking, Kernel Team, Andrii Nakryiko, Alexei Starovoitov

Hi Peter, 

We have see the warning below while testing the new bpf_get_branch_snapshot 
helper, on a QEMU vm (/usr/bin/qemu-system-x86_64 -enable-kvm -cpu host ...). 
This issue doesn't happen on bare metal systems (no QEMU). 

We didn't cover this case, as LBR didn't really work in QEMU. But it seems to
work after I upgrade the host kernel to 5.12. 

At the moment, we don't have much idea on how to debug and fix the issue. Could 
you please share your thoughts on this? 

Thanks in advance!

Song




============================== 8< ============================

[  139.494159] unchecked MSR access error: WRMSR to 0x3f1 (tried to write 0x0000000000000000) at rIP: 0xffffffff81011a8b (intel_pmu_snapshot_branch_stack+0x3b/0xd0)
[  139.495587] Call Trace:
[  139.495845]  bpf_get_branch_snapshot+0x17/0x40
[  139.496285]  bpf_prog_35810402cd1d294c_test1+0x33/0xe6c
[  139.496791]  bpf_trampoline_10737534536_0+0x4c/0x1000
[  139.497274]  bpf_testmod_loop_test+0x5/0x20 [bpf_testmod]
[  139.497799]  bpf_testmod_test_read+0x71/0x1f0 [bpf_testmod]
[  139.498332]  ? bpf_testmod_loop_test+0x20/0x20 [bpf_testmod]
[  139.498878]  ? sysfs_kf_bin_read+0xbe/0x110
[  139.499284]  ? bpf_testmod_loop_test+0x20/0x20 [bpf_testmod]
[  139.499829]  kernfs_fop_read_iter+0x1ac/0x2c0
[  139.500245]  ? kernfs_create_link+0x110/0x110
[  139.500667]  new_sync_read+0x24b/0x360
[  139.501037]  ? __x64_sys_llseek+0x1e0/0x1e0
[  139.501444]  ? rcu_read_lock_held_common+0x1a/0x50
[  139.501942]  ? rcu_read_lock_held_common+0x1a/0x50
[  139.502404]  ? rcu_read_lock_sched_held+0x5f/0xd0
[  139.502865]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  139.503294]  ? security_file_permission+0xe7/0x2c0
[  139.503758]  vfs_read+0x1a4/0x2a0
[  139.504091]  ksys_read+0xc0/0x160
[  139.504413]  ? vfs_write+0x510/0x510
[  139.504756]  ? ktime_get_coarse_real_ts64+0xe4/0xf0
[  139.505234]  do_syscall_64+0x3a/0x80
[  139.505581]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  139.506066] RIP: 0033:0x7fb8a05728b2
[  139.506413] Code: 97 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 96 db 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
[  139.508164] RSP: 002b:00007ffe66315a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  139.508870] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb8a05728b2
[  139.509545] RDX: 0000000000000064 RSI: 0000000000000000 RDI: 0000000000000010
[  139.510225] RBP: 00007ffe66315a60 R08: 0000000000000000 R09: 00007ffe66315907
[  139.510897] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000040c8b0
[  139.511570] R13: 00007ffe66315cc0 R14: 0000000000000000 R15: 0000000000000000 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-09-29  0:04 bpf_get_branch_snapshot on qemu-kvm Song Liu
@ 2021-09-29  7:35 ` Peter Zijlstra
  2021-09-29 12:05   ` Like Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2021-09-29  7:35 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu

On Wed, Sep 29, 2021 at 12:04:21AM +0000, Song Liu wrote:
> Hi Peter, 
> 
> We have see the warning below while testing the new bpf_get_branch_snapshot 
> helper, on a QEMU vm (/usr/bin/qemu-system-x86_64 -enable-kvm -cpu host ...). 
> This issue doesn't happen on bare metal systems (no QEMU). 
> 
> We didn't cover this case, as LBR didn't really work in QEMU. But it seems to
> work after I upgrade the host kernel to 5.12. 
> 
> At the moment, we don't have much idea on how to debug and fix the issue. Could 
> you please share your thoughts on this? 

Well, that's virt, afaik stuff not working is like a feature there or
something, who knows. I've Cc'ed Like Xu who might have clue since he
did the patches.

Virt just ain't worth the pain if you ask me.

> 
> Thanks in advance!
> 
> Song
> 
> 
> 
> 
> ============================== 8< ============================
> 
> [  139.494159] unchecked MSR access error: WRMSR to 0x3f1 (tried to write 0x0000000000000000) at rIP: 0xffffffff81011a8b (intel_pmu_snapshot_branch_stack+0x3b/0xd0)
> [  139.495587] Call Trace:
> [  139.495845]  bpf_get_branch_snapshot+0x17/0x40
> [  139.496285]  bpf_prog_35810402cd1d294c_test1+0x33/0xe6c
> [  139.496791]  bpf_trampoline_10737534536_0+0x4c/0x1000
> [  139.497274]  bpf_testmod_loop_test+0x5/0x20 [bpf_testmod]
> [  139.497799]  bpf_testmod_test_read+0x71/0x1f0 [bpf_testmod]
> [  139.498332]  ? bpf_testmod_loop_test+0x20/0x20 [bpf_testmod]
> [  139.498878]  ? sysfs_kf_bin_read+0xbe/0x110
> [  139.499284]  ? bpf_testmod_loop_test+0x20/0x20 [bpf_testmod]
> [  139.499829]  kernfs_fop_read_iter+0x1ac/0x2c0
> [  139.500245]  ? kernfs_create_link+0x110/0x110
> [  139.500667]  new_sync_read+0x24b/0x360
> [  139.501037]  ? __x64_sys_llseek+0x1e0/0x1e0
> [  139.501444]  ? rcu_read_lock_held_common+0x1a/0x50
> [  139.501942]  ? rcu_read_lock_held_common+0x1a/0x50
> [  139.502404]  ? rcu_read_lock_sched_held+0x5f/0xd0
> [  139.502865]  ? rcu_read_lock_bh_held+0xb0/0xb0
> [  139.503294]  ? security_file_permission+0xe7/0x2c0
> [  139.503758]  vfs_read+0x1a4/0x2a0
> [  139.504091]  ksys_read+0xc0/0x160
> [  139.504413]  ? vfs_write+0x510/0x510
> [  139.504756]  ? ktime_get_coarse_real_ts64+0xe4/0xf0
> [  139.505234]  do_syscall_64+0x3a/0x80
> [  139.505581]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  139.506066] RIP: 0033:0x7fb8a05728b2
> [  139.506413] Code: 97 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 96 db 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
> [  139.508164] RSP: 002b:00007ffe66315a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [  139.508870] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb8a05728b2
> [  139.509545] RDX: 0000000000000064 RSI: 0000000000000000 RDI: 0000000000000010
> [  139.510225] RBP: 00007ffe66315a60 R08: 0000000000000000 R09: 00007ffe66315907
> [  139.510897] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000040c8b0
> [  139.511570] R13: 00007ffe66315cc0 R14: 0000000000000000 R15: 0000000000000000 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-09-29  7:35 ` Peter Zijlstra
@ 2021-09-29 12:05   ` Like Xu
  2021-09-29 12:26     ` Peter Zijlstra
  2021-09-29 14:39     ` Song Liu
  0 siblings, 2 replies; 18+ messages in thread
From: Like Xu @ 2021-09-29 12:05 UTC (permalink / raw)
  To: Peter Zijlstra, Song Liu
  Cc: bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu

On 29/9/2021 3:35 pm, Peter Zijlstra wrote:
> On Wed, Sep 29, 2021 at 12:04:21AM +0000, Song Liu wrote:
>> Hi Peter,
>>
>> We have see the warning below while testing the new bpf_get_branch_snapshot
>> helper, on a QEMU vm (/usr/bin/qemu-system-x86_64 -enable-kvm -cpu host ...).
>> This issue doesn't happen on bare metal systems (no QEMU).
>>
>> We didn't cover this case, as LBR didn't really work in QEMU. But it seems to
>> work after I upgrade the host kernel to 5.12.

The guest LBR is enabled since the v5.12.

>>
>> At the moment, we don't have much idea on how to debug and fix the issue. Could
>> you please share your thoughts on this?
> 
> Well, that's virt, afaik stuff not working is like a feature there or
> something, who knows. I've Cc'ed Like Xu who might have clue since he
> did the patches.
> 
> Virt just ain't worth the pain if you ask me.

Just cc me for any vPMU/x86 stuff.

> 
>>
>> Thanks in advance!
>>
>> Song
>>
>>
>>
>>
>> ============================== 8< ============================
>>
>> [  139.494159] unchecked MSR access error: WRMSR to 0x3f1 (tried to write 0x0000000000000000) at rIP: 0xffffffff81011a8b (intel_pmu_snapshot_branch_stack+0x3b/0xd0)

Uh, it uses a PEBS counter to sample or count, which is not yet upstream but 
should be soon.

Song, can you try to fix bpf_get_branch_snapshot on a normal PMC counter,
or where is the src for bpf_get_branch_snapshot? I am more than happy to help.

>> [  139.495587] Call Trace:
>> [  139.495845]  bpf_get_branch_snapshot+0x17/0x40
>> [  139.496285]  bpf_prog_35810402cd1d294c_test1+0x33/0xe6c
>> [  139.496791]  bpf_trampoline_10737534536_0+0x4c/0x1000
>> [  139.497274]  bpf_testmod_loop_test+0x5/0x20 [bpf_testmod]
>> [  139.497799]  bpf_testmod_test_read+0x71/0x1f0 [bpf_testmod]
>> [  139.498332]  ? bpf_testmod_loop_test+0x20/0x20 [bpf_testmod]
>> [  139.498878]  ? sysfs_kf_bin_read+0xbe/0x110
>> [  139.499284]  ? bpf_testmod_loop_test+0x20/0x20 [bpf_testmod]
>> [  139.499829]  kernfs_fop_read_iter+0x1ac/0x2c0
>> [  139.500245]  ? kernfs_create_link+0x110/0x110
>> [  139.500667]  new_sync_read+0x24b/0x360
>> [  139.501037]  ? __x64_sys_llseek+0x1e0/0x1e0
>> [  139.501444]  ? rcu_read_lock_held_common+0x1a/0x50
>> [  139.501942]  ? rcu_read_lock_held_common+0x1a/0x50
>> [  139.502404]  ? rcu_read_lock_sched_held+0x5f/0xd0
>> [  139.502865]  ? rcu_read_lock_bh_held+0xb0/0xb0
>> [  139.503294]  ? security_file_permission+0xe7/0x2c0
>> [  139.503758]  vfs_read+0x1a4/0x2a0
>> [  139.504091]  ksys_read+0xc0/0x160
>> [  139.504413]  ? vfs_write+0x510/0x510
>> [  139.504756]  ? ktime_get_coarse_real_ts64+0xe4/0xf0
>> [  139.505234]  do_syscall_64+0x3a/0x80
>> [  139.505581]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [  139.506066] RIP: 0033:0x7fb8a05728b2
>> [  139.506413] Code: 97 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 96 db 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
>> [  139.508164] RSP: 002b:00007ffe66315a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>> [  139.508870] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb8a05728b2
>> [  139.509545] RDX: 0000000000000064 RSI: 0000000000000000 RDI: 0000000000000010
>> [  139.510225] RBP: 00007ffe66315a60 R08: 0000000000000000 R09: 00007ffe66315907
>> [  139.510897] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000040c8b0
>> [  139.511570] R13: 00007ffe66315cc0 R14: 0000000000000000 R15: 0000000000000000

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-09-29 12:05   ` Like Xu
@ 2021-09-29 12:26     ` Peter Zijlstra
  2021-09-29 14:42       ` Song Liu
  2021-09-29 14:39     ` Song Liu
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2021-09-29 12:26 UTC (permalink / raw)
  To: Like Xu
  Cc: Song Liu, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu

On Wed, Sep 29, 2021 at 08:05:24PM +0800, Like Xu wrote:
> On 29/9/2021 3:35 pm, Peter Zijlstra wrote:

> > > [  139.494159] unchecked MSR access error: WRMSR to 0x3f1 (tried to write 0x0000000000000000) at rIP: 0xffffffff81011a8b (intel_pmu_snapshot_branch_stack+0x3b/0xd0)
> 
> Uh, it uses a PEBS counter to sample or count, which is not yet upstream but
> should be soon.

Ooh that's PEBS_ENABLE

> Song, can you try to fix bpf_get_branch_snapshot on a normal PMC counter,
> or where is the src for bpf_get_branch_snapshot? I am more than happy to help.

Nah, all that code wants to do is disable PEBS... and virt being virt,
it's all sorts of weird with f/m/s :/

I so hate all that. So there's two solutions:

 - get confirmation that clearing GLOBAL_CTRL is suffient to supress
   PEBS, in which case we can simply remove the PEBS_ENABLE clear.

or

 - create another 2 intel_pmu_snapshot_{,arch_}branch_stack() copies
   that omit the PEBS thing.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-09-29 12:05   ` Like Xu
  2021-09-29 12:26     ` Peter Zijlstra
@ 2021-09-29 14:39     ` Song Liu
  1 sibling, 0 replies; 18+ messages in thread
From: Song Liu @ 2021-09-29 14:39 UTC (permalink / raw)
  To: Like Xu
  Cc: Peter Zijlstra, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu

Hi Like,

> On Sep 29, 2021, at 5:05 AM, Like Xu <like.xu.linux@gmail.com> wrote:
> 
> On 29/9/2021 3:35 pm, Peter Zijlstra wrote:
>> On Wed, Sep 29, 2021 at 12:04:21AM +0000, Song Liu wrote:
>>> Hi Peter,
>>> 
>>> We have see the warning below while testing the new bpf_get_branch_snapshot
>>> helper, on a QEMU vm (/usr/bin/qemu-system-x86_64 -enable-kvm -cpu host ...).
>>> This issue doesn't happen on bare metal systems (no QEMU).
>>> 
>>> We didn't cover this case, as LBR didn't really work in QEMU. But it seems to
>>> work after I upgrade the host kernel to 5.12.
> 
> The guest LBR is enabled since the v5.12.
> 
>>> 
>>> At the moment, we don't have much idea on how to debug and fix the issue. Could
>>> you please share your thoughts on this?
>> Well, that's virt, afaik stuff not working is like a feature there or
>> something, who knows. I've Cc'ed Like Xu who might have clue since he
>> did the patches.
>> Virt just ain't worth the pain if you ask me.
> 
> Just cc me for any vPMU/x86 stuff.
> 
>>> 
>>> Thanks in advance!
>>> 
>>> Song
>>> 
>>> 
>>> 
>>> 
>>> ============================== 8< ============================
>>> 
>>> [  139.494159] unchecked MSR access error: WRMSR to 0x3f1 (tried to write 0x0000000000000000) at rIP: 0xffffffff81011a8b (intel_pmu_snapshot_branch_stack+0x3b/0xd0)
> 
> Uh, it uses a PEBS counter to sample or count, which is not yet upstream but should be soon.
> 
> Song, can you try to fix bpf_get_branch_snapshot on a normal PMC counter,
> or where is the src for bpf_get_branch_snapshot? I am more than happy to help.

bpf_get_branch_snapshot is available in the bpf-next tree:

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/

To repro the issue:

1. build the kernel and boot qemu with it;
2. build tools/testing/selftests/bpf
     cd tools/testing/selftests/bpf ; make -j
3. copy test_progs and bpf_testmod.ko from tools/testing/selftests/bpf
   to the qemu vm;
4. run the test as 
     ./test_progs -t snapshot -v

This should trigger "unchecked MSR access error" in dmesg. Please 
let me know if this doesn't work. 

Thanks,
Song

> 
>>> [  139.495587] Call Trace:
>>> [  139.495845]  bpf_get_branch_snapshot+0x17/0x40
>>> [  139.496285]  bpf_prog_35810402cd1d294c_test1+0x33/0xe6c
>>> [  139.496791]  bpf_trampoline_10737534536_0+0x4c/0x1000
>>> [  139.497274]  bpf_testmod_loop_test+0x5/0x20 [bpf_testmod]
>>> [  139.497799]  bpf_testmod_test_read+0x71/0x1f0 [bpf_testmod]
>>> [  139.498332]  ? bpf_testmod_loop_test+0x20/0x20 [bpf_testmod]
>>> [  139.498878]  ? sysfs_kf_bin_read+0xbe/0x110
>>> [  139.499284]  ? bpf_testmod_loop_test+0x20/0x20 [bpf_testmod]
>>> [  139.499829]  kernfs_fop_read_iter+0x1ac/0x2c0
>>> [  139.500245]  ? kernfs_create_link+0x110/0x110
>>> [  139.500667]  new_sync_read+0x24b/0x360
>>> [  139.501037]  ? __x64_sys_llseek+0x1e0/0x1e0
>>> [  139.501444]  ? rcu_read_lock_held_common+0x1a/0x50
>>> [  139.501942]  ? rcu_read_lock_held_common+0x1a/0x50
>>> [  139.502404]  ? rcu_read_lock_sched_held+0x5f/0xd0
>>> [  139.502865]  ? rcu_read_lock_bh_held+0xb0/0xb0
>>> [  139.503294]  ? security_file_permission+0xe7/0x2c0
>>> [  139.503758]  vfs_read+0x1a4/0x2a0
>>> [  139.504091]  ksys_read+0xc0/0x160
>>> [  139.504413]  ? vfs_write+0x510/0x510
>>> [  139.504756]  ? ktime_get_coarse_real_ts64+0xe4/0xf0
>>> [  139.505234]  do_syscall_64+0x3a/0x80
>>> [  139.505581]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [  139.506066] RIP: 0033:0x7fb8a05728b2
>>> [  139.506413] Code: 97 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 96 db 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
>>> [  139.508164] RSP: 002b:00007ffe66315a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>>> [  139.508870] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb8a05728b2
>>> [  139.509545] RDX: 0000000000000064 RSI: 0000000000000000 RDI: 0000000000000010
>>> [  139.510225] RBP: 00007ffe66315a60 R08: 0000000000000000 R09: 00007ffe66315907
>>> [  139.510897] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000040c8b0
>>> [  139.511570] R13: 00007ffe66315cc0 R14: 0000000000000000 R15: 0000000000000000


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-09-29 12:26     ` Peter Zijlstra
@ 2021-09-29 14:42       ` Song Liu
  2021-09-29 15:59         ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Song Liu @ 2021-09-29 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Like Xu, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu

Hi Peter, 

> On Sep 29, 2021, at 5:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Sep 29, 2021 at 08:05:24PM +0800, Like Xu wrote:
>> On 29/9/2021 3:35 pm, Peter Zijlstra wrote:
> 
>>>> [  139.494159] unchecked MSR access error: WRMSR to 0x3f1 (tried to write 0x0000000000000000) at rIP: 0xffffffff81011a8b (intel_pmu_snapshot_branch_stack+0x3b/0xd0)
>> 
>> Uh, it uses a PEBS counter to sample or count, which is not yet upstream but
>> should be soon.
> 
> Ooh that's PEBS_ENABLE
> 
>> Song, can you try to fix bpf_get_branch_snapshot on a normal PMC counter,
>> or where is the src for bpf_get_branch_snapshot? I am more than happy to help.
> 
> Nah, all that code wants to do is disable PEBS... and virt being virt,
> it's all sorts of weird with f/m/s :/
> 
> I so hate all that. So there's two solutions:
> 
> - get confirmation that clearing GLOBAL_CTRL is suffient to supress
>   PEBS, in which case we can simply remove the PEBS_ENABLE clear.

How should we confirm this? Can we run some tests for this? Or do we
need hardware experts' input for this?

Thanks,
Song

> 
> or
> 
> - create another 2 intel_pmu_snapshot_{,arch_}branch_stack() copies
>   that omit the PEBS thing.
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-09-29 14:42       ` Song Liu
@ 2021-09-29 15:59         ` Peter Zijlstra
  2021-09-29 16:35           ` Liang, Kan
  2021-09-29 23:20           ` Andi Kleen
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:59 UTC (permalink / raw)
  To: Song Liu
  Cc: Like Xu, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu, Liang, Kan, Andi Kleen

On Wed, Sep 29, 2021 at 02:42:27PM +0000, Song Liu wrote:
> Hi Peter, 
> 
> > On Sep 29, 2021, at 5:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Wed, Sep 29, 2021 at 08:05:24PM +0800, Like Xu wrote:
> >> On 29/9/2021 3:35 pm, Peter Zijlstra wrote:
> > 
> >>>> [  139.494159] unchecked MSR access error: WRMSR to 0x3f1 (tried to write 0x0000000000000000) at rIP: 0xffffffff81011a8b (intel_pmu_snapshot_branch_stack+0x3b/0xd0)
> >> 
> >> Uh, it uses a PEBS counter to sample or count, which is not yet upstream but
> >> should be soon.
> > 
> > Ooh that's PEBS_ENABLE
> > 
> >> Song, can you try to fix bpf_get_branch_snapshot on a normal PMC counter,
> >> or where is the src for bpf_get_branch_snapshot? I am more than happy to help.
> > 
> > Nah, all that code wants to do is disable PEBS... and virt being virt,
> > it's all sorts of weird with f/m/s :/
> > 
> > I so hate all that. So there's two solutions:
> > 
> > - get confirmation that clearing GLOBAL_CTRL is suffient to supress
> >   PEBS, in which case we can simply remove the PEBS_ENABLE clear.
> 
> How should we confirm this? Can we run some tests for this? Or do we
> need hardware experts' input for this?

I'll put it on the list to ask the hardware people when I talk to them
next. But maybe Kan or Andi know without asking.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: bpf_get_branch_snapshot on qemu-kvm
  2021-09-29 15:59         ` Peter Zijlstra
@ 2021-09-29 16:35           ` Liang, Kan
  2021-09-29 20:05             ` Song Liu
  2021-09-29 23:20           ` Andi Kleen
  1 sibling, 1 reply; 18+ messages in thread
From: Liang, Kan @ 2021-09-29 16:35 UTC (permalink / raw)
  To: Peter Zijlstra, Song Liu
  Cc: Like Xu, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu, Andi Kleen

> > > - get confirmation that clearing GLOBAL_CTRL is suffient to supress
> > >   PEBS, in which case we can simply remove the PEBS_ENABLE clear.
> >
> > How should we confirm this? Can we run some tests for this? Or do we
> > need hardware experts' input for this?
> 
> I'll put it on the list to ask the hardware people when I talk to them next. But
> maybe Kan or Andi know without asking.

If the GLOBAL_CTRL is explicitly disabled, the counters do not count anymore.
It doesn't matter if PEBS is enabled or not. 

See 6c1c07b33eb0 ("perf/x86/intel: Avoid unnecessary PEBS_ENABLE MSR
access in PMI "). We optimized the PMU handler base on it.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-09-29 16:35           ` Liang, Kan
@ 2021-09-29 20:05             ` Song Liu
  2021-10-08  3:34               ` Like Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Song Liu @ 2021-09-29 20:05 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Like Xu, bpf, Networking, Kernel Team,
	Andrii Nakryiko, Alexei Starovoitov, like.xu, Andi Kleen

Hi Kan,

> On Sep 29, 2021, at 9:35 AM, Liang, Kan <kan.liang@intel.com> wrote:
> 
>>>> - get confirmation that clearing GLOBAL_CTRL is suffient to supress
>>>>  PEBS, in which case we can simply remove the PEBS_ENABLE clear.
>>> 
>>> How should we confirm this? Can we run some tests for this? Or do we
>>> need hardware experts' input for this?
>> 
>> I'll put it on the list to ask the hardware people when I talk to them next. But
>> maybe Kan or Andi know without asking.
> 
> If the GLOBAL_CTRL is explicitly disabled, the counters do not count anymore.
> It doesn't matter if PEBS is enabled or not. 
> 
> See 6c1c07b33eb0 ("perf/x86/intel: Avoid unnecessary PEBS_ENABLE MSR
> access in PMI "). We optimized the PMU handler base on it.

Thanks for these information!

IIUC, all we need is the following on top of bpf-next/master:

diff --git i/arch/x86/events/intel/core.c w/arch/x86/events/intel/core.c
index 1248fc1937f82..d0d357e7d6f21 100644
--- i/arch/x86/events/intel/core.c
+++ w/arch/x86/events/intel/core.c
@@ -2209,7 +2209,6 @@ intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int
        /* must not have branches... */
        local_irq_save(flags);
        __intel_pmu_disable_all(false); /* we don't care about BTS */
-       __intel_pmu_pebs_disable_all();
        __intel_pmu_lbr_disable();
        /*            ... until here */
        return __intel_pmu_snapshot_branch_stack(entries, cnt, flags);
@@ -2223,7 +2222,6 @@ intel_pmu_snapshot_arch_branch_stack(struct perf_branch_entry *entries, unsigned
        /* must not have branches... */
        local_irq_save(flags);
        __intel_pmu_disable_all(false); /* we don't care about BTS */
-       __intel_pmu_pebs_disable_all();
        __intel_pmu_arch_lbr_disable();
        /*            ... until here */
        return __intel_pmu_snapshot_branch_stack(entries, cnt, flags);


In the test, this does eliminate the warning. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-09-29 15:59         ` Peter Zijlstra
  2021-09-29 16:35           ` Liang, Kan
@ 2021-09-29 23:20           ` Andi Kleen
  1 sibling, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2021-09-29 23:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, Like Xu, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu, Liang, Kan, Andi Kleen

> > How should we confirm this? Can we run some tests for this? Or do we
> > need hardware experts' input for this?
> 
> I'll put it on the list to ask the hardware people when I talk to them
> next. But maybe Kan or Andi know without asking.

On modern CPUs it should be sufficient. In fact the perfmon v4 path that
that was recently deleted did it.

I'm not sure about all old CPUs. There was a corner case on older CPUs
(Haswell and Broadwell) with old Microcode where it was needed when entering virtualization
(but that should be fixed now)


-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-09-29 20:05             ` Song Liu
@ 2021-10-08  3:34               ` Like Xu
  2021-10-08  5:46                 ` Song Liu
  0 siblings, 1 reply; 18+ messages in thread
From: Like Xu @ 2021-10-08  3:34 UTC (permalink / raw)
  To: Song Liu
  Cc: Peter Zijlstra, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu, Andi Kleen, Liang, Kan

On 30/9/2021 4:05 am, Song Liu wrote:
> Hi Kan,
> 
>> On Sep 29, 2021, at 9:35 AM, Liang, Kan <kan.liang@intel.com> wrote:
>>
>>>>> - get confirmation that clearing GLOBAL_CTRL is suffient to supress
>>>>>   PEBS, in which case we can simply remove the PEBS_ENABLE clear.
>>>>
>>>> How should we confirm this? Can we run some tests for this? Or do we
>>>> need hardware experts' input for this?
>>>
>>> I'll put it on the list to ask the hardware people when I talk to them next. But
>>> maybe Kan or Andi know without asking.
>>
>> If the GLOBAL_CTRL is explicitly disabled, the counters do not count anymore.
>> It doesn't matter if PEBS is enabled or not.
>>
>> See 6c1c07b33eb0 ("perf/x86/intel: Avoid unnecessary PEBS_ENABLE MSR
>> access in PMI "). We optimized the PMU handler base on it.
> 
> Thanks for these information!
> 
> IIUC, all we need is the following on top of bpf-next/master:
> 
> diff --git i/arch/x86/events/intel/core.c w/arch/x86/events/intel/core.c
> index 1248fc1937f82..d0d357e7d6f21 100644
> --- i/arch/x86/events/intel/core.c
> +++ w/arch/x86/events/intel/core.c
> @@ -2209,7 +2209,6 @@ intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int
>          /* must not have branches... */
>          local_irq_save(flags);
>          __intel_pmu_disable_all(false); /* we don't care about BTS */

If the value passed in is true, does it affect your use case?

> -       __intel_pmu_pebs_disable_all();

In that case, we can reuse "static __always_inline void intel_pmu_disable_all(void)"
regardless of whether PEBS is supported or enabled inside the guest and the host ?

>          __intel_pmu_lbr_disable();

How about using intel_pmu_lbr_disable_all() to cover Arch LBR?

>          /*            ... until here */
>          return __intel_pmu_snapshot_branch_stack(entries, cnt, flags);
> @@ -2223,7 +2222,6 @@ intel_pmu_snapshot_arch_branch_stack(struct perf_branch_entry *entries, unsigned
>          /* must not have branches... */
>          local_irq_save(flags);
>          __intel_pmu_disable_all(false); /* we don't care about BTS */
> -       __intel_pmu_pebs_disable_all();
>          __intel_pmu_arch_lbr_disable();
>          /*            ... until here */
>          return __intel_pmu_snapshot_branch_stack(entries, cnt, flags);
> 
> 
> In the test, this does eliminate the warning.
> 
> Thanks,
> Song
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-10-08  3:34               ` Like Xu
@ 2021-10-08  5:46                 ` Song Liu
  2021-10-08  6:36                   ` Like Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Song Liu @ 2021-10-08  5:46 UTC (permalink / raw)
  To: Like Xu
  Cc: Peter Zijlstra, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu, Andi Kleen, Liang, Kan



> On Oct 7, 2021, at 8:34 PM, Like Xu <like.xu.linux@gmail.com> wrote:
> 
> On 30/9/2021 4:05 am, Song Liu wrote:
>> Hi Kan,
>>> On Sep 29, 2021, at 9:35 AM, Liang, Kan <kan.liang@intel.com> wrote:
>>> 
>>>>>> - get confirmation that clearing GLOBAL_CTRL is suffient to supress
>>>>>>  PEBS, in which case we can simply remove the PEBS_ENABLE clear.
>>>>> 
>>>>> How should we confirm this? Can we run some tests for this? Or do we
>>>>> need hardware experts' input for this?
>>>> 
>>>> I'll put it on the list to ask the hardware people when I talk to them next. But
>>>> maybe Kan or Andi know without asking.
>>> 
>>> If the GLOBAL_CTRL is explicitly disabled, the counters do not count anymore.
>>> It doesn't matter if PEBS is enabled or not.
>>> 
>>> See 6c1c07b33eb0 ("perf/x86/intel: Avoid unnecessary PEBS_ENABLE MSR
>>> access in PMI "). We optimized the PMU handler base on it.
>> Thanks for these information!
>> IIUC, all we need is the following on top of bpf-next/master:
>> diff --git i/arch/x86/events/intel/core.c w/arch/x86/events/intel/core.c
>> index 1248fc1937f82..d0d357e7d6f21 100644
>> --- i/arch/x86/events/intel/core.c
>> +++ w/arch/x86/events/intel/core.c
>> @@ -2209,7 +2209,6 @@ intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int
>>         /* must not have branches... */
>>         local_irq_save(flags);
>>         __intel_pmu_disable_all(false); /* we don't care about BTS */
> 
> If the value passed in is true, does it affect your use case?
> 
>> -       __intel_pmu_pebs_disable_all();
> 
> In that case, we can reuse "static __always_inline void intel_pmu_disable_all(void)"
> regardless of whether PEBS is supported or enabled inside the guest and the host ?
> 
>>         __intel_pmu_lbr_disable();
> 
> How about using intel_pmu_lbr_disable_all() to cover Arch LBR?

We are using LBR without PMI, so there isn't any hardware mechanism to 
stop the LBR, we have to stop it in software. There is always a delay
between the event triggers and the LBR is stopped. In this window, 
the LBR is still running and old entries are being replaced by new entries. 
We actually need the old entries before the triggering event, so the key
design goal here is to minimize the number of branch instructions between
the event triggers and the LBR is stopped. 

Here, both __intel_pmu_disable_all(false) and __intel_pmu_lbr_disable()
are used to optimize for this goal: the fewer branch instructions the 
better. 

After removing __intel_pmu_pebs_disable_all() from 
intel_pmu_snapshot_branch_stack(), we found quite a few LBR entries in 
extable related code. With these entries, snapshot branch stack is not 
really useful in the VM, because all the interesting entries are flushed
by these. I am not sure how to further optimize these. Do you have some
suggestions on this? 

Thanks,
Song

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-10-08  5:46                 ` Song Liu
@ 2021-10-08  6:36                   ` Like Xu
  2021-10-08 17:08                     ` Song Liu
  0 siblings, 1 reply; 18+ messages in thread
From: Like Xu @ 2021-10-08  6:36 UTC (permalink / raw)
  To: Song Liu
  Cc: Peter Zijlstra, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu, Andi Kleen, Liang, Kan

On 8/10/2021 1:46 pm, Song Liu wrote:
> 
> 
>> On Oct 7, 2021, at 8:34 PM, Like Xu <like.xu.linux@gmail.com> wrote:
>>
>> On 30/9/2021 4:05 am, Song Liu wrote:
>>> Hi Kan,
>>>> On Sep 29, 2021, at 9:35 AM, Liang, Kan <kan.liang@intel.com> wrote:
>>>>
>>>>>>> - get confirmation that clearing GLOBAL_CTRL is suffient to supress
>>>>>>>   PEBS, in which case we can simply remove the PEBS_ENABLE clear.
>>>>>>
>>>>>> How should we confirm this? Can we run some tests for this? Or do we
>>>>>> need hardware experts' input for this?
>>>>>
>>>>> I'll put it on the list to ask the hardware people when I talk to them next. But
>>>>> maybe Kan or Andi know without asking.
>>>>
>>>> If the GLOBAL_CTRL is explicitly disabled, the counters do not count anymore.
>>>> It doesn't matter if PEBS is enabled or not.
>>>>
>>>> See 6c1c07b33eb0 ("perf/x86/intel: Avoid unnecessary PEBS_ENABLE MSR
>>>> access in PMI "). We optimized the PMU handler base on it.
>>> Thanks for these information!
>>> IIUC, all we need is the following on top of bpf-next/master:
>>> diff --git i/arch/x86/events/intel/core.c w/arch/x86/events/intel/core.c
>>> index 1248fc1937f82..d0d357e7d6f21 100644
>>> --- i/arch/x86/events/intel/core.c
>>> +++ w/arch/x86/events/intel/core.c
>>> @@ -2209,7 +2209,6 @@ intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int
>>>          /* must not have branches... */
>>>          local_irq_save(flags);
>>>          __intel_pmu_disable_all(false); /* we don't care about BTS */
>>
>> If the value passed in is true, does it affect your use case?
>>
>>> -       __intel_pmu_pebs_disable_all();
>>
>> In that case, we can reuse "static __always_inline void intel_pmu_disable_all(void)"
>> regardless of whether PEBS is supported or enabled inside the guest and the host ?
>>
>>>          __intel_pmu_lbr_disable();
>>
>> How about using intel_pmu_lbr_disable_all() to cover Arch LBR?
> 
> We are using LBR without PMI, so there isn't any hardware mechanism to
> stop the LBR, we have to stop it in software. There is always a delay
> between the event triggers and the LBR is stopped. In this window,

Do you use counters for snapshot branch stack?

Can the assumption of "without PMI" be broken sine Intel does have
the hardware mechanism like "freeze LBR on counter overflow
(aka, DEBUGCTLMSR_FREEZE_LBRS_ON_PMI)" ?

> the LBR is still running and old entries are being replaced by new entries.
> We actually need the old entries before the triggering event, so the key
> design goal here is to minimize the number of branch instructions between
> the event triggers and the LBR is stopped.

Yes, it makes sense.

> 
> Here, both __intel_pmu_disable_all(false) and __intel_pmu_lbr_disable()
> are used to optimize for this goal: the fewer branch instructions the
> better.

Is it possible that we have another LBR in-kernel user in addition to the current
BPF-LBR snapshot user, such as another BPF-LBR snapshot user or a LBR perf user ?

In the intel_pmu_snapshot_[arch]_branch_stack(), what if there is a PMI or NMI 
handler
to be called before __intel_pmu_lbr_disable(), which means more branch instructions
(assuming we don't use the FREEZE_LBRS_ON_xxx capability)?

How about try to disable LBR at the earliest possible time, before 
__intel_pmu_disable_all(false) ?

> 
> After removing __intel_pmu_pebs_disable_all() from
> intel_pmu_snapshot_branch_stack(), we found quite a few LBR entries in
> extable related code. With these entries, snapshot branch stack is not

Are you saying that you still need to call
__intel_pmu_pebs_disable_all() to maintain precision ?

> really useful in the VM, because all the interesting entries are flushed
> by these. I am not sure how to further optimize these. Do you have some
> suggestions on this?


> 
> Thanks,
> Song
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-10-08  6:36                   ` Like Xu
@ 2021-10-08 17:08                     ` Song Liu
  2021-10-09  9:03                       ` Like Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Song Liu @ 2021-10-08 17:08 UTC (permalink / raw)
  To: Like Xu
  Cc: Peter Zijlstra, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu, Andi Kleen, Liang, Kan



> On Oct 7, 2021, at 11:36 PM, Like Xu <like.xu.linux@gmail.com> wrote:
> 
> On 8/10/2021 1:46 pm, Song Liu wrote:
>>> On Oct 7, 2021, at 8:34 PM, Like Xu <like.xu.linux@gmail.com> wrote:
>>> 
>>> On 30/9/2021 4:05 am, Song Liu wrote:
>>>> Hi Kan,
>>>>> On Sep 29, 2021, at 9:35 AM, Liang, Kan <kan.liang@intel.com> wrote:
>>>>> 
>>>>>>>> - get confirmation that clearing GLOBAL_CTRL is suffient to supress
>>>>>>>>  PEBS, in which case we can simply remove the PEBS_ENABLE clear.
>>>>>>> 
>>>>>>> How should we confirm this? Can we run some tests for this? Or do we
>>>>>>> need hardware experts' input for this?
>>>>>> 
>>>>>> I'll put it on the list to ask the hardware people when I talk to them next. But
>>>>>> maybe Kan or Andi know without asking.
>>>>> 
>>>>> If the GLOBAL_CTRL is explicitly disabled, the counters do not count anymore.
>>>>> It doesn't matter if PEBS is enabled or not.
>>>>> 
>>>>> See 6c1c07b33eb0 ("perf/x86/intel: Avoid unnecessary PEBS_ENABLE MSR
>>>>> access in PMI "). We optimized the PMU handler base on it.
>>>> Thanks for these information!
>>>> IIUC, all we need is the following on top of bpf-next/master:
>>>> diff --git i/arch/x86/events/intel/core.c w/arch/x86/events/intel/core.c
>>>> index 1248fc1937f82..d0d357e7d6f21 100644
>>>> --- i/arch/x86/events/intel/core.c
>>>> +++ w/arch/x86/events/intel/core.c
>>>> @@ -2209,7 +2209,6 @@ intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int
>>>>         /* must not have branches... */
>>>>         local_irq_save(flags);
>>>>         __intel_pmu_disable_all(false); /* we don't care about BTS */
>>> 
>>> If the value passed in is true, does it affect your use case?
>>> 
>>>> -       __intel_pmu_pebs_disable_all();
>>> 
>>> In that case, we can reuse "static __always_inline void intel_pmu_disable_all(void)"
>>> regardless of whether PEBS is supported or enabled inside the guest and the host ?
>>> 
>>>>         __intel_pmu_lbr_disable();
>>> 
>>> How about using intel_pmu_lbr_disable_all() to cover Arch LBR?
>> We are using LBR without PMI, so there isn't any hardware mechanism to
>> stop the LBR, we have to stop it in software. There is always a delay
>> between the event triggers and the LBR is stopped. In this window,
> 
> Do you use counters for snapshot branch stack?
> 
> Can the assumption of "without PMI" be broken sine Intel does have
> the hardware mechanism like "freeze LBR on counter overflow
> (aka, DEBUGCTLMSR_FREEZE_LBRS_ON_PMI)" ?

We are capturing LBR on software events. For example, when a complex syscall,
such as sys_bpf() and sys_perf_event_open(), returns -EINVAL, it is not obvious
what wen wrong. The branch stack at the return (on a kretprobe or fexit) could 
give us additional information. 

> 
>> the LBR is still running and old entries are being replaced by new entries.
>> We actually need the old entries before the triggering event, so the key
>> design goal here is to minimize the number of branch instructions between
>> the event triggers and the LBR is stopped.
> 
> Yes, it makes sense.
> 
>> Here, both __intel_pmu_disable_all(false) and __intel_pmu_lbr_disable()
>> are used to optimize for this goal: the fewer branch instructions the
>> better.
> 
> Is it possible that we have another LBR in-kernel user in addition to the current
> BPF-LBR snapshot user, such as another BPF-LBR snapshot user or a LBR perf user ?

I think it is OK to have another user. We just need to capture the LBR entries. 
In fact, we simply enable LBR by opening a perf_event on each CPU. So from the 
kernel's point of view, the LBR is owned used by "another user". 

> 
> In the intel_pmu_snapshot_[arch]_branch_stack(), what if there is a PMI or NMI handler
> to be called before __intel_pmu_lbr_disable(), which means more branch instructions
> (assuming we don't use the FREEZE_LBRS_ON_xxx capability)?

If we are unlucky and hit an NMI, we may get garbage data. The user will run the 
test again. 

> How about try to disable LBR at the earliest possible time, before __intel_pmu_disable_all(false) ?

I am not sure which solution is the best here. On bare metal, current version works
fine (available in bpf-next tree). 

> 
>> After removing __intel_pmu_pebs_disable_all() from
>> intel_pmu_snapshot_branch_stack(), we found quite a few LBR entries in
>> extable related code. With these entries, snapshot branch stack is not
> 
> Are you saying that you still need to call
> __intel_pmu_pebs_disable_all() to maintain precision ?

I think we don't need pebs_disable_all. In the VM, pebs_disable_all will trigger
"unchecked MSR access error" warning. After removing it, the warning message is 
gone. However, after we remove pebs_disable_all, we still see too many LBR entries
are flushed before LBR is stopped. Most of these new entries are in extable code. 
I guess this is because the VM access these MSR differently. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-10-08 17:08                     ` Song Liu
@ 2021-10-09  9:03                       ` Like Xu
  2021-10-26  7:09                         ` Song Liu
  0 siblings, 1 reply; 18+ messages in thread
From: Like Xu @ 2021-10-09  9:03 UTC (permalink / raw)
  To: Song Liu
  Cc: Peter Zijlstra, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu, Andi Kleen, Liang, Kan

On 9/10/2021 1:08 am, Song Liu wrote:
> 
> 
>> On Oct 7, 2021, at 11:36 PM, Like Xu <like.xu.linux@gmail.com> wrote:
>>
>> On 8/10/2021 1:46 pm, Song Liu wrote:
>>>> On Oct 7, 2021, at 8:34 PM, Like Xu <like.xu.linux@gmail.com> wrote:
>>>>
>>>> On 30/9/2021 4:05 am, Song Liu wrote:
>>>>> Hi Kan,
>>>>>> On Sep 29, 2021, at 9:35 AM, Liang, Kan <kan.liang@intel.com> wrote:
>>>>>>
>>>>>>>>> - get confirmation that clearing GLOBAL_CTRL is suffient to supress
>>>>>>>>>   PEBS, in which case we can simply remove the PEBS_ENABLE clear.
>>>>>>>>
>>>>>>>> How should we confirm this? Can we run some tests for this? Or do we
>>>>>>>> need hardware experts' input for this?
>>>>>>>
>>>>>>> I'll put it on the list to ask the hardware people when I talk to them next. But
>>>>>>> maybe Kan or Andi know without asking.
>>>>>>
>>>>>> If the GLOBAL_CTRL is explicitly disabled, the counters do not count anymore.
>>>>>> It doesn't matter if PEBS is enabled or not.
>>>>>>
>>>>>> See 6c1c07b33eb0 ("perf/x86/intel: Avoid unnecessary PEBS_ENABLE MSR
>>>>>> access in PMI "). We optimized the PMU handler base on it.
>>>>> Thanks for these information!
>>>>> IIUC, all we need is the following on top of bpf-next/master:
>>>>> diff --git i/arch/x86/events/intel/core.c w/arch/x86/events/intel/core.c
>>>>> index 1248fc1937f82..d0d357e7d6f21 100644
>>>>> --- i/arch/x86/events/intel/core.c
>>>>> +++ w/arch/x86/events/intel/core.c
>>>>> @@ -2209,7 +2209,6 @@ intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int
>>>>>          /* must not have branches... */
>>>>>          local_irq_save(flags);
>>>>>          __intel_pmu_disable_all(false); /* we don't care about BTS */
>>>>
>>>> If the value passed in is true, does it affect your use case?
>>>>
>>>>> -       __intel_pmu_pebs_disable_all();
>>>>
>>>> In that case, we can reuse "static __always_inline void intel_pmu_disable_all(void)"
>>>> regardless of whether PEBS is supported or enabled inside the guest and the host ?
>>>>
>>>>>          __intel_pmu_lbr_disable();
>>>>
>>>> How about using intel_pmu_lbr_disable_all() to cover Arch LBR?
>>> We are using LBR without PMI, so there isn't any hardware mechanism to
>>> stop the LBR, we have to stop it in software. There is always a delay
>>> between the event triggers and the LBR is stopped. In this window,
>>
>> Do you use counters for snapshot branch stack?
>>
>> Can the assumption of "without PMI" be broken sine Intel does have
>> the hardware mechanism like "freeze LBR on counter overflow
>> (aka, DEBUGCTLMSR_FREEZE_LBRS_ON_PMI)" ?
> 
> We are capturing LBR on software events. For example, when a complex syscall,
> such as sys_bpf() and sys_perf_event_open(), returns -EINVAL, it is not obvious
> what wen wrong. The branch stack at the return (on a kretprobe or fexit) could
> give us additional information.
> 
>>
>>> the LBR is still running and old entries are being replaced by new entries.
>>> We actually need the old entries before the triggering event, so the key
>>> design goal here is to minimize the number of branch instructions between
>>> the event triggers and the LBR is stopped.
>>
>> Yes, it makes sense.
>>
>>> Here, both __intel_pmu_disable_all(false) and __intel_pmu_lbr_disable()
>>> are used to optimize for this goal: the fewer branch instructions the
>>> better.
>>
>> Is it possible that we have another LBR in-kernel user in addition to the current
>> BPF-LBR snapshot user, such as another BPF-LBR snapshot user or a LBR perf user ?
> 
> I think it is OK to have another user. We just need to capture the LBR entries.
> In fact, we simply enable LBR by opening a perf_event on each CPU. So from the
> kernel's point of view, the LBR is owned used by "another user".
> 
>>
>> In the intel_pmu_snapshot_[arch]_branch_stack(), what if there is a PMI or NMI handler
>> to be called before __intel_pmu_lbr_disable(), which means more branch instructions
>> (assuming we don't use the FREEZE_LBRS_ON_xxx capability)?
> 
> If we are unlucky and hit an NMI, we may get garbage data. The user will run the
> test again.
> 
>> How about try to disable LBR at the earliest possible time, before __intel_pmu_disable_all(false) ?
> 
> I am not sure which solution is the best here. On bare metal, current version works
> fine (available in bpf-next tree).
> 
>>
>>> After removing __intel_pmu_pebs_disable_all() from
>>> intel_pmu_snapshot_branch_stack(), we found quite a few LBR entries in
>>> extable related code. With these entries, snapshot branch stack is not
>>
>> Are you saying that you still need to call
>> __intel_pmu_pebs_disable_all() to maintain precision ?
> 
> I think we don't need pebs_disable_all. In the VM, pebs_disable_all will trigger
> "unchecked MSR access error" warning. After removing it, the warning message is
> gone. However, after we remove pebs_disable_all, we still see too many LBR entries
> are flushed before LBR is stopped. Most of these new entries are in extable code.
> I guess this is because the VM access these MSR differently.

Hi Song,

Thanks for your detailed input. I saw your workaround "if (is_hypervisor())" on 
the tree.

Even when the guest supports PEBS, this use case fails and the root cause is still
playing hide-and-seek with me. Just check with you to see if you get similar results
when the guest LBR behavior makes the test case fail like this:

serial_test_get_branch_snapshot:FAIL:find_looptest_in_lbr unexpected 
find_looptest_in_lbr: actual 0 <= expected 6
serial_test_get_branch_snapshot:FAIL:check_wasted_entries unexpected 
check_wasted_entries: actual 32 >= expected 10
#52 get_branch_snapshot:FAIL

Also, do you know or rough guess about how extable code relates to the test case ?

> 
> Thanks,
> Song
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-10-09  9:03                       ` Like Xu
@ 2021-10-26  7:09                         ` Song Liu
  2021-10-28  3:09                           ` Like Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Song Liu @ 2021-10-26  7:09 UTC (permalink / raw)
  To: Like Xu
  Cc: Peter Zijlstra, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu, Andi Kleen, Liang, Kan



> On Oct 9, 2021, at 2:03 AM, Like Xu <like.xu.linux@gmail.com> wrote:
> 
> On 9/10/2021 1:08 am, Song Liu wrote:
>>> On Oct 7, 2021, at 11:36 PM, Like Xu <like.xu.linux@gmail.com> wrote:
>>> 
>>> On 8/10/2021 1:46 pm, Song Liu wrote:
>>>>> On Oct 7, 2021, at 8:34 PM, Like Xu <like.xu.linux@gmail.com> wrote:
>>>>> 
>>>>> On 30/9/2021 4:05 am, Song Liu wrote:
>>>>>> Hi Kan,
>>>>>>> On Sep 29, 2021, at 9:35 AM, Liang, Kan <kan.liang@intel.com> wrote:
>>>>>>> 
>>>>>>>>>> - get confirmation that clearing GLOBAL_CTRL is suffient to supress
>>>>>>>>>>  PEBS, in which case we can simply remove the PEBS_ENABLE clear.
>>>>>>>>> 
>>>>>>>>> How should we confirm this? Can we run some tests for this? Or do we
>>>>>>>>> need hardware experts' input for this?
>>>>>>>> 
>>>>>>>> I'll put it on the list to ask the hardware people when I talk to them next. But
>>>>>>>> maybe Kan or Andi know without asking.
>>>>>>> 
>>>>>>> If the GLOBAL_CTRL is explicitly disabled, the counters do not count anymore.
>>>>>>> It doesn't matter if PEBS is enabled or not.
>>>>>>> 
>>>>>>> See 6c1c07b33eb0 ("perf/x86/intel: Avoid unnecessary PEBS_ENABLE MSR
>>>>>>> access in PMI "). We optimized the PMU handler base on it.
>>>>>> Thanks for these information!
>>>>>> IIUC, all we need is the following on top of bpf-next/master:
>>>>>> diff --git i/arch/x86/events/intel/core.c w/arch/x86/events/intel/core.c
>>>>>> index 1248fc1937f82..d0d357e7d6f21 100644
>>>>>> --- i/arch/x86/events/intel/core.c
>>>>>> +++ w/arch/x86/events/intel/core.c
>>>>>> @@ -2209,7 +2209,6 @@ intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int
>>>>>>         /* must not have branches... */
>>>>>>         local_irq_save(flags);
>>>>>>         __intel_pmu_disable_all(false); /* we don't care about BTS */
>>>>> 
>>>>> If the value passed in is true, does it affect your use case?
>>>>> 
>>>>>> -       __intel_pmu_pebs_disable_all();
>>>>> 
>>>>> In that case, we can reuse "static __always_inline void intel_pmu_disable_all(void)"
>>>>> regardless of whether PEBS is supported or enabled inside the guest and the host ?
>>>>> 
>>>>>>         __intel_pmu_lbr_disable();
>>>>> 
>>>>> How about using intel_pmu_lbr_disable_all() to cover Arch LBR?
>>>> We are using LBR without PMI, so there isn't any hardware mechanism to
>>>> stop the LBR, we have to stop it in software. There is always a delay
>>>> between the event triggers and the LBR is stopped. In this window,
>>> 
>>> Do you use counters for snapshot branch stack?
>>> 
>>> Can the assumption of "without PMI" be broken sine Intel does have
>>> the hardware mechanism like "freeze LBR on counter overflow
>>> (aka, DEBUGCTLMSR_FREEZE_LBRS_ON_PMI)" ?
>> We are capturing LBR on software events. For example, when a complex syscall,
>> such as sys_bpf() and sys_perf_event_open(), returns -EINVAL, it is not obvious
>> what wen wrong. The branch stack at the return (on a kretprobe or fexit) could
>> give us additional information.
>>> 
>>>> the LBR is still running and old entries are being replaced by new entries.
>>>> We actually need the old entries before the triggering event, so the key
>>>> design goal here is to minimize the number of branch instructions between
>>>> the event triggers and the LBR is stopped.
>>> 
>>> Yes, it makes sense.
>>> 
>>>> Here, both __intel_pmu_disable_all(false) and __intel_pmu_lbr_disable()
>>>> are used to optimize for this goal: the fewer branch instructions the
>>>> better.
>>> 
>>> Is it possible that we have another LBR in-kernel user in addition to the current
>>> BPF-LBR snapshot user, such as another BPF-LBR snapshot user or a LBR perf user ?
>> I think it is OK to have another user. We just need to capture the LBR entries.
>> In fact, we simply enable LBR by opening a perf_event on each CPU. So from the
>> kernel's point of view, the LBR is owned used by "another user".
>>> 
>>> In the intel_pmu_snapshot_[arch]_branch_stack(), what if there is a PMI or NMI handler
>>> to be called before __intel_pmu_lbr_disable(), which means more branch instructions
>>> (assuming we don't use the FREEZE_LBRS_ON_xxx capability)?
>> If we are unlucky and hit an NMI, we may get garbage data. The user will run the
>> test again.
>>> How about try to disable LBR at the earliest possible time, before __intel_pmu_disable_all(false) ?
>> I am not sure which solution is the best here. On bare metal, current version works
>> fine (available in bpf-next tree).
>>> 
>>>> After removing __intel_pmu_pebs_disable_all() from
>>>> intel_pmu_snapshot_branch_stack(), we found quite a few LBR entries in
>>>> extable related code. With these entries, snapshot branch stack is not
>>> 
>>> Are you saying that you still need to call
>>> __intel_pmu_pebs_disable_all() to maintain precision ?
>> I think we don't need pebs_disable_all. In the VM, pebs_disable_all will trigger
>> "unchecked MSR access error" warning. After removing it, the warning message is
>> gone. However, after we remove pebs_disable_all, we still see too many LBR entries
>> are flushed before LBR is stopped. Most of these new entries are in extable code.
>> I guess this is because the VM access these MSR differently.
> 
> Hi Song,
> 
> Thanks for your detailed input. I saw your workaround "if (is_hypervisor())" on the tree.
> 
> Even when the guest supports PEBS, this use case fails and the root cause is still
> playing hide-and-seek with me. Just check with you to see if you get similar results
> when the guest LBR behavior makes the test case fail like this:
> 
> serial_test_get_branch_snapshot:FAIL:find_looptest_in_lbr unexpected find_looptest_in_lbr: actual 0 <= expected 6
> serial_test_get_branch_snapshot:FAIL:check_wasted_entries unexpected check_wasted_entries: actual 32 >= expected 10
> #52 get_branch_snapshot:FAIL
> 
> Also, do you know or rough guess about how extable code relates to the test case ?

Sorry for the delayed response. I finally got some time to look into 
this again. After disabling most debug configs, I managed to get it 
work in the VM with a simple change as 

diff --git i/arch/x86/events/intel/core.c w/arch/x86/events/intel/core.c
index 1248fc1937f82..3887b579297d7 100644
--- i/arch/x86/events/intel/core.c
+++ w/arch/x86/events/intel/core.c
@@ -2209,7 +2209,6 @@ intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int
        /* must not have branches... */
        local_irq_save(flags);
        __intel_pmu_disable_all(false); /* we don't care about BTS */
-       __intel_pmu_pebs_disable_all();
        __intel_pmu_lbr_disable();
        /*            ... until here */
        return __intel_pmu_snapshot_branch_stack(entries, cnt, flags);


(of course we also need to remove the is_hypervisor() check.). 

But I am not sure whether this is the best fix. 

I pushed all the change and debug code I used to 

https://git.kernel.org/pub/scm/linux/kernel/git/song/linux.git/log/?h=get_branch_snapshot_in_vm

Could you please take a look at it and share your feedback on this?
Specifically, can we fix intel_pmu_snapshot_branch_stack in vm with the
change above?

Thanks,
Song





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-10-26  7:09                         ` Song Liu
@ 2021-10-28  3:09                           ` Like Xu
  2021-10-29 18:09                             ` Song Liu
  0 siblings, 1 reply; 18+ messages in thread
From: Like Xu @ 2021-10-28  3:09 UTC (permalink / raw)
  To: Song Liu
  Cc: Peter Zijlstra, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu, Andi Kleen, Liang, Kan

On 26/10/2021 3:09 pm, Song Liu wrote:
> 
> 
>> On Oct 9, 2021, at 2:03 AM, Like Xu <like.xu.linux@gmail.com> wrote:
>>
>> On 9/10/2021 1:08 am, Song Liu wrote:
>>>> On Oct 7, 2021, at 11:36 PM, Like Xu <like.xu.linux@gmail.com> wrote:
>>>>
>>>> On 8/10/2021 1:46 pm, Song Liu wrote:
>>>>>> On Oct 7, 2021, at 8:34 PM, Like Xu <like.xu.linux@gmail.com> wrote:
>>>>>>
>>>>>> On 30/9/2021 4:05 am, Song Liu wrote:
>>>>>>> Hi Kan,
>>>>>>>> On Sep 29, 2021, at 9:35 AM, Liang, Kan <kan.liang@intel.com> wrote:
>>>>>>>>
>>>>>>>>>>> - get confirmation that clearing GLOBAL_CTRL is suffient to supress
>>>>>>>>>>>   PEBS, in which case we can simply remove the PEBS_ENABLE clear.
>>>>>>>>>>
>>>>>>>>>> How should we confirm this? Can we run some tests for this? Or do we
>>>>>>>>>> need hardware experts' input for this?
>>>>>>>>>
>>>>>>>>> I'll put it on the list to ask the hardware people when I talk to them next. But
>>>>>>>>> maybe Kan or Andi know without asking.
>>>>>>>>
>>>>>>>> If the GLOBAL_CTRL is explicitly disabled, the counters do not count anymore.
>>>>>>>> It doesn't matter if PEBS is enabled or not.
>>>>>>>>
>>>>>>>> See 6c1c07b33eb0 ("perf/x86/intel: Avoid unnecessary PEBS_ENABLE MSR
>>>>>>>> access in PMI "). We optimized the PMU handler base on it.
>>>>>>> Thanks for these information!
>>>>>>> IIUC, all we need is the following on top of bpf-next/master:
>>>>>>> diff --git i/arch/x86/events/intel/core.c w/arch/x86/events/intel/core.c
>>>>>>> index 1248fc1937f82..d0d357e7d6f21 100644
>>>>>>> --- i/arch/x86/events/intel/core.c
>>>>>>> +++ w/arch/x86/events/intel/core.c
>>>>>>> @@ -2209,7 +2209,6 @@ intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int
>>>>>>>          /* must not have branches... */
>>>>>>>          local_irq_save(flags);
>>>>>>>          __intel_pmu_disable_all(false); /* we don't care about BTS */
>>>>>>
>>>>>> If the value passed in is true, does it affect your use case?
>>>>>>
>>>>>>> -       __intel_pmu_pebs_disable_all();
>>>>>>
>>>>>> In that case, we can reuse "static __always_inline void intel_pmu_disable_all(void)"
>>>>>> regardless of whether PEBS is supported or enabled inside the guest and the host ?
>>>>>>
>>>>>>>          __intel_pmu_lbr_disable();
>>>>>>
>>>>>> How about using intel_pmu_lbr_disable_all() to cover Arch LBR?
>>>>> We are using LBR without PMI, so there isn't any hardware mechanism to
>>>>> stop the LBR, we have to stop it in software. There is always a delay
>>>>> between the event triggers and the LBR is stopped. In this window,
>>>>
>>>> Do you use counters for snapshot branch stack?
>>>>
>>>> Can the assumption of "without PMI" be broken sine Intel does have
>>>> the hardware mechanism like "freeze LBR on counter overflow
>>>> (aka, DEBUGCTLMSR_FREEZE_LBRS_ON_PMI)" ?
>>> We are capturing LBR on software events. For example, when a complex syscall,
>>> such as sys_bpf() and sys_perf_event_open(), returns -EINVAL, it is not obvious
>>> what wen wrong. The branch stack at the return (on a kretprobe or fexit) could
>>> give us additional information.
>>>>
>>>>> the LBR is still running and old entries are being replaced by new entries.
>>>>> We actually need the old entries before the triggering event, so the key
>>>>> design goal here is to minimize the number of branch instructions between
>>>>> the event triggers and the LBR is stopped.
>>>>
>>>> Yes, it makes sense.
>>>>
>>>>> Here, both __intel_pmu_disable_all(false) and __intel_pmu_lbr_disable()
>>>>> are used to optimize for this goal: the fewer branch instructions the
>>>>> better.
>>>>
>>>> Is it possible that we have another LBR in-kernel user in addition to the current
>>>> BPF-LBR snapshot user, such as another BPF-LBR snapshot user or a LBR perf user ?
>>> I think it is OK to have another user. We just need to capture the LBR entries.
>>> In fact, we simply enable LBR by opening a perf_event on each CPU. So from the
>>> kernel's point of view, the LBR is owned used by "another user".
>>>>
>>>> In the intel_pmu_snapshot_[arch]_branch_stack(), what if there is a PMI or NMI handler
>>>> to be called before __intel_pmu_lbr_disable(), which means more branch instructions
>>>> (assuming we don't use the FREEZE_LBRS_ON_xxx capability)?
>>> If we are unlucky and hit an NMI, we may get garbage data. The user will run the
>>> test again.
>>>> How about try to disable LBR at the earliest possible time, before __intel_pmu_disable_all(false) ?
>>> I am not sure which solution is the best here. On bare metal, current version works
>>> fine (available in bpf-next tree).
>>>>
>>>>> After removing __intel_pmu_pebs_disable_all() from
>>>>> intel_pmu_snapshot_branch_stack(), we found quite a few LBR entries in
>>>>> extable related code. With these entries, snapshot branch stack is not
>>>>
>>>> Are you saying that you still need to call
>>>> __intel_pmu_pebs_disable_all() to maintain precision ?
>>> I think we don't need pebs_disable_all. In the VM, pebs_disable_all will trigger
>>> "unchecked MSR access error" warning. After removing it, the warning message is
>>> gone. However, after we remove pebs_disable_all, we still see too many LBR entries
>>> are flushed before LBR is stopped. Most of these new entries are in extable code.
>>> I guess this is because the VM access these MSR differently.
>>
>> Hi Song,
>>
>> Thanks for your detailed input. I saw your workaround "if (is_hypervisor())" on the tree.
>>
>> Even when the guest supports PEBS, this use case fails and the root cause is still
>> playing hide-and-seek with me. Just check with you to see if you get similar results
>> when the guest LBR behavior makes the test case fail like this:
>>
>> serial_test_get_branch_snapshot:FAIL:find_looptest_in_lbr unexpected find_looptest_in_lbr: actual 0 <= expected 6
>> serial_test_get_branch_snapshot:FAIL:check_wasted_entries unexpected check_wasted_entries: actual 32 >= expected 10
>> #52 get_branch_snapshot:FAIL
>>
>> Also, do you know or rough guess about how extable code relates to the test case ?
> 
> Sorry for the delayed response. I finally got some time to look into
> this again. After disabling most debug configs, I managed to get it
> work in the VM with a simple change as

Yes, most of the contaminated lbr records come from these guest symbols:

intel_pmu_snapshot_branch_stack
native_write_msr
trace_hardirqs_off
lockdep_hardirqs_off
__lock_acquire
mark_lock
migrate_disable
rcu_is_watching
bpf_get_branch_snapshot
__bpf_prog_enter

I think we're fine with the current guest LBR emulation, right?

> 
> diff --git i/arch/x86/events/intel/core.c w/arch/x86/events/intel/core.c
> index 1248fc1937f82..3887b579297d7 100644
> --- i/arch/x86/events/intel/core.c
> +++ w/arch/x86/events/intel/core.c
> @@ -2209,7 +2209,6 @@ intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int
>          /* must not have branches... */
>          local_irq_save(flags);
>          __intel_pmu_disable_all(false); /* we don't care about BTS */
> -       __intel_pmu_pebs_disable_all();
>          __intel_pmu_lbr_disable();
>          /*            ... until here */
>          return __intel_pmu_snapshot_branch_stack(entries, cnt, flags);

LGTM.

> 
> 
> (of course we also need to remove the is_hypervisor() check.).
> 
> But I am not sure whether this is the best fix.
> 
> I pushed all the change and debug code I used to
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/song/linux.git/log/?h=get_branch_snapshot_in_vm
> 
> Could you please take a look at it and share your feedback on this?

How do we inform the user of bpf_get_branch_snapshot in a reasonable way
that the lbr data will be inaccurate when using the debug kernel?

Is it better to check for mutual exclusion in code or to use the user documentation
to specify this part of the restriction? It affects the user experience.

> Specifically, can we fix intel_pmu_snapshot_branch_stack in vm with the
> change above?

At least it's a valid fix and we can start from this change.

> 
> Thanks,
> Song
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bpf_get_branch_snapshot on qemu-kvm
  2021-10-28  3:09                           ` Like Xu
@ 2021-10-29 18:09                             ` Song Liu
  0 siblings, 0 replies; 18+ messages in thread
From: Song Liu @ 2021-10-29 18:09 UTC (permalink / raw)
  To: Like Xu
  Cc: Peter Zijlstra, bpf, Networking, Kernel Team, Andrii Nakryiko,
	Alexei Starovoitov, like.xu, Andi Kleen, Liang, Kan



> On Oct 27, 2021, at 8:09 PM, Like Xu <like.xu.linux@gmail.com> wrote:
> 
> On 26/10/2021 3:09 pm, Song Liu wrote:
>>> On Oct 9, 2021, at 2:03 AM, Like Xu <like.xu.linux@gmail.com> wrote:
>>> 
>>> On 9/10/2021 1:08 am, Song Liu wrote:
>>>>> On Oct 7, 2021, at 11:36 PM, Like Xu <like.xu.linux@gmail.com> wrote:
>>>>> 
>>>>> On 8/10/2021 1:46 pm, Song Liu wrote:
>>>>>>> On Oct 7, 2021, at 8:34 PM, Like Xu <like.xu.linux@gmail.com> wrote:
>>>>>>> 
>>>>>>> On 30/9/2021 4:05 am, Song Liu wrote:
>>>>>>>> Hi Kan,
>>>>>>>>> On Sep 29, 2021, at 9:35 AM, Liang, Kan <kan.liang@intel.com> wrote:
>>>>>>>>> 
>>>>>>>>>>>> - get confirmation that clearing GLOBAL_CTRL is suffient to supress
>>>>>>>>>>>>  PEBS, in which case we can simply remove the PEBS_ENABLE clear.
>>>>>>>>>>> 
>>>>>>>>>>> How should we confirm this? Can we run some tests for this? Or do we
>>>>>>>>>>> need hardware experts' input for this?
>>>>>>>>>> 
>>>>>>>>>> I'll put it on the list to ask the hardware people when I talk to them next. But
>>>>>>>>>> maybe Kan or Andi know without asking.
>>>>>>>>> 
>>>>>>>>> If the GLOBAL_CTRL is explicitly disabled, the counters do not count anymore.
>>>>>>>>> It doesn't matter if PEBS is enabled or not.
>>>>>>>>> 
>>>>>>>>> See 6c1c07b33eb0 ("perf/x86/intel: Avoid unnecessary PEBS_ENABLE MSR
>>>>>>>>> access in PMI "). We optimized the PMU handler base on it.
>>>>>>>> Thanks for these information!
>>>>>>>> IIUC, all we need is the following on top of bpf-next/master:
>>>>>>>> diff --git i/arch/x86/events/intel/core.c w/arch/x86/events/intel/core.c
>>>>>>>> index 1248fc1937f82..d0d357e7d6f21 100644
>>>>>>>> --- i/arch/x86/events/intel/core.c
>>>>>>>> +++ w/arch/x86/events/intel/core.c
>>>>>>>> @@ -2209,7 +2209,6 @@ intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int
>>>>>>>>         /* must not have branches... */
>>>>>>>>         local_irq_save(flags);
>>>>>>>>         __intel_pmu_disable_all(false); /* we don't care about BTS */
>>>>>>> 
>>>>>>> If the value passed in is true, does it affect your use case?
>>>>>>> 
>>>>>>>> -       __intel_pmu_pebs_disable_all();
>>>>>>> 
>>>>>>> In that case, we can reuse "static __always_inline void intel_pmu_disable_all(void)"
>>>>>>> regardless of whether PEBS is supported or enabled inside the guest and the host ?
>>>>>>> 
>>>>>>>>         __intel_pmu_lbr_disable();
>>>>>>> 
>>>>>>> How about using intel_pmu_lbr_disable_all() to cover Arch LBR?
>>>>>> We are using LBR without PMI, so there isn't any hardware mechanism to
>>>>>> stop the LBR, we have to stop it in software. There is always a delay
>>>>>> between the event triggers and the LBR is stopped. In this window,
>>>>> 
>>>>> Do you use counters for snapshot branch stack?
>>>>> 
>>>>> Can the assumption of "without PMI" be broken sine Intel does have
>>>>> the hardware mechanism like "freeze LBR on counter overflow
>>>>> (aka, DEBUGCTLMSR_FREEZE_LBRS_ON_PMI)" ?
>>>> We are capturing LBR on software events. For example, when a complex syscall,
>>>> such as sys_bpf() and sys_perf_event_open(), returns -EINVAL, it is not obvious
>>>> what wen wrong. The branch stack at the return (on a kretprobe or fexit) could
>>>> give us additional information.
>>>>> 
>>>>>> the LBR is still running and old entries are being replaced by new entries.
>>>>>> We actually need the old entries before the triggering event, so the key
>>>>>> design goal here is to minimize the number of branch instructions between
>>>>>> the event triggers and the LBR is stopped.
>>>>> 
>>>>> Yes, it makes sense.
>>>>> 
>>>>>> Here, both __intel_pmu_disable_all(false) and __intel_pmu_lbr_disable()
>>>>>> are used to optimize for this goal: the fewer branch instructions the
>>>>>> better.
>>>>> 
>>>>> Is it possible that we have another LBR in-kernel user in addition to the current
>>>>> BPF-LBR snapshot user, such as another BPF-LBR snapshot user or a LBR perf user ?
>>>> I think it is OK to have another user. We just need to capture the LBR entries.
>>>> In fact, we simply enable LBR by opening a perf_event on each CPU. So from the
>>>> kernel's point of view, the LBR is owned used by "another user".
>>>>> 
>>>>> In the intel_pmu_snapshot_[arch]_branch_stack(), what if there is a PMI or NMI handler
>>>>> to be called before __intel_pmu_lbr_disable(), which means more branch instructions
>>>>> (assuming we don't use the FREEZE_LBRS_ON_xxx capability)?
>>>> If we are unlucky and hit an NMI, we may get garbage data. The user will run the
>>>> test again.
>>>>> How about try to disable LBR at the earliest possible time, before __intel_pmu_disable_all(false) ?
>>>> I am not sure which solution is the best here. On bare metal, current version works
>>>> fine (available in bpf-next tree).
>>>>> 
>>>>>> After removing __intel_pmu_pebs_disable_all() from
>>>>>> intel_pmu_snapshot_branch_stack(), we found quite a few LBR entries in
>>>>>> extable related code. With these entries, snapshot branch stack is not
>>>>> 
>>>>> Are you saying that you still need to call
>>>>> __intel_pmu_pebs_disable_all() to maintain precision ?
>>>> I think we don't need pebs_disable_all. In the VM, pebs_disable_all will trigger
>>>> "unchecked MSR access error" warning. After removing it, the warning message is
>>>> gone. However, after we remove pebs_disable_all, we still see too many LBR entries
>>>> are flushed before LBR is stopped. Most of these new entries are in extable code.
>>>> I guess this is because the VM access these MSR differently.
>>> 
>>> Hi Song,
>>> 
>>> Thanks for your detailed input. I saw your workaround "if (is_hypervisor())" on the tree.
>>> 
>>> Even when the guest supports PEBS, this use case fails and the root cause is still
>>> playing hide-and-seek with me. Just check with you to see if you get similar results
>>> when the guest LBR behavior makes the test case fail like this:
>>> 
>>> serial_test_get_branch_snapshot:FAIL:find_looptest_in_lbr unexpected find_looptest_in_lbr: actual 0 <= expected 6
>>> serial_test_get_branch_snapshot:FAIL:check_wasted_entries unexpected check_wasted_entries: actual 32 >= expected 10
>>> #52 get_branch_snapshot:FAIL
>>> 
>>> Also, do you know or rough guess about how extable code relates to the test case ?
>> Sorry for the delayed response. I finally got some time to look into
>> this again. After disabling most debug configs, I managed to get it
>> work in the VM with a simple change as
> 
> Yes, most of the contaminated lbr records come from these guest symbols:
> 
> intel_pmu_snapshot_branch_stack
> native_write_msr
> trace_hardirqs_off
> lockdep_hardirqs_off
> __lock_acquire
> mark_lock
> migrate_disable
> rcu_is_watching
> bpf_get_branch_snapshot
> __bpf_prog_enter
> 
> I think we're fine with the current guest LBR emulation, right?

Yes, current result looks good to me. But it does rely on the .config.  

> 
>> diff --git i/arch/x86/events/intel/core.c w/arch/x86/events/intel/core.c
>> index 1248fc1937f82..3887b579297d7 100644
>> --- i/arch/x86/events/intel/core.c
>> +++ w/arch/x86/events/intel/core.c
>> @@ -2209,7 +2209,6 @@ intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int
>>         /* must not have branches... */
>>         local_irq_save(flags);
>>         __intel_pmu_disable_all(false); /* we don't care about BTS */
>> -       __intel_pmu_pebs_disable_all();
>>         __intel_pmu_lbr_disable();
>>         /*            ... until here */
>>         return __intel_pmu_snapshot_branch_stack(entries, cnt, flags);
> 
> LGTM.
> 
>> (of course we also need to remove the is_hypervisor() check.).
>> But I am not sure whether this is the best fix.
>> I pushed all the change and debug code I used to
>> https://git.kernel.org/pub/scm/linux/kernel/git/song/linux.git/log/?h=get_branch_snapshot_in_vm
>> Could you please take a look at it and share your feedback on this?
> 
> How do we inform the user of bpf_get_branch_snapshot in a reasonable way
> that the lbr data will be inaccurate when using the debug kernel?
> 
> Is it better to check for mutual exclusion in code or to use the user documentation
> to specify this part of the restriction? It affects the user experience.

In uapi/linux/bpf.h, we have 

 * long bpf_get_branch_snapshot(void *entries, u32 size, u64 flags)
 *      Description
 *              Get branch trace from hardware engines like Intel LBR. The
 *              hardware engine is stopped shortly after the helper is
 *              called. Therefore, the user need to filter branch entries
 *              based on the actual use case. To capture branch trace
 *              before the trigger point of the BPF program, the helper
 *              should be called at the beginning of the BPF program.

I guess this is enough. 

> 
>> Specifically, can we fix intel_pmu_snapshot_branch_stack in vm with the
>> change above?
> 
> At least it's a valid fix and we can start from this change.

Thanks! I will send the fix. 

Song


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-10-29 18:09 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-29  0:04 bpf_get_branch_snapshot on qemu-kvm Song Liu
2021-09-29  7:35 ` Peter Zijlstra
2021-09-29 12:05   ` Like Xu
2021-09-29 12:26     ` Peter Zijlstra
2021-09-29 14:42       ` Song Liu
2021-09-29 15:59         ` Peter Zijlstra
2021-09-29 16:35           ` Liang, Kan
2021-09-29 20:05             ` Song Liu
2021-10-08  3:34               ` Like Xu
2021-10-08  5:46                 ` Song Liu
2021-10-08  6:36                   ` Like Xu
2021-10-08 17:08                     ` Song Liu
2021-10-09  9:03                       ` Like Xu
2021-10-26  7:09                         ` Song Liu
2021-10-28  3:09                           ` Like Xu
2021-10-29 18:09                             ` Song Liu
2021-09-29 23:20           ` Andi Kleen
2021-09-29 14:39     ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).