All of lore.kernel.org
 help / color / mirror / Atom feed
* Deadlock due to EPT_VIOLATION
@ 2023-05-23 14:02 Brian Rak
  2023-05-23 16:22 ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Brian Rak @ 2023-05-23 14:02 UTC (permalink / raw)
  To: kvm

We've been hitting an issue lately where KVM guests (w/ qemu) have been 
getting stuck in a loop of EPT_VIOLATIONs, and end up requiring a guest 
reboot to fix.

On Intel machines the trace ends up looking like:

     CPU-2386625 [094] 6598425.465404: kvm_entry:            vcpu 1, rip 
0xffffffffc0771aa2
     CPU-2386625 [094] 6598425.465405: kvm_exit:             reason 
EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
     CPU-2386625 [094] 6598425.465405: kvm_page_fault:       vcpu 1 rip 
0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
     CPU-2386625 [094] 6598425.465406: kvm_inj_virq:         IRQ 0xec 
[reinjected]
     CPU-2386625 [094] 6598425.465406: kvm_entry:            vcpu 1, rip 
0xffffffffc0771aa2
     CPU-2386625 [094] 6598425.465407: kvm_exit:             reason 
EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
     CPU-2386625 [094] 6598425.465407: kvm_page_fault:       vcpu 1 rip 
0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
     CPU-2386625 [094] 6598425.465408: kvm_inj_virq:         IRQ 0xec 
[reinjected]
     CPU-2386625 [094] 6598425.465408: kvm_entry:            vcpu 1, rip 
0xffffffffc0771aa2
     CPU-2386625 [094] 6598425.465409: kvm_exit:             reason 
EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
     CPU-2386625 [094] 6598425.465410: kvm_page_fault:       vcpu 1 rip 
0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
     CPU-2386625 [094] 6598425.465410: kvm_inj_virq:         IRQ 0xec 
[reinjected]
     CPU-2386625 [094] 6598425.465410: kvm_entry:            vcpu 1, rip 
0xffffffffc0771aa2
     CPU-2386625 [094] 6598425.465411: kvm_exit:             reason 
EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
     CPU-2386625 [094] 6598425.465412: kvm_page_fault:       vcpu 1 rip 
0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
     CPU-2386625 [094] 6598425.465413: kvm_inj_virq:         IRQ 0xec 
[reinjected]
     CPU-2386625 [094] 6598425.465413: kvm_entry:            vcpu 1, rip 
0xffffffffc0771aa2
     CPU-2386625 [094] 6598425.465414: kvm_exit:             reason 
EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
     CPU-2386625 [094] 6598425.465414: kvm_page_fault:       vcpu 1 rip 
0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
     CPU-2386625 [094] 6598425.465415: kvm_inj_virq:         IRQ 0xec 
[reinjected]
     CPU-2386625 [094] 6598425.465415: kvm_entry:            vcpu 1, rip 
0xffffffffc0771aa2
     CPU-2386625 [094] 6598425.465417: kvm_exit:             reason 
EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
     CPU-2386625 [094] 6598425.465417: kvm_page_fault:       vcpu 1 rip 
0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683

on AMD machines, we end up with:

     CPU-14414 [063] 3039492.055571: kvm_page_fault:       vcpu 0 rip 
0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
     CPU-14414 [063] 3039492.055571: kvm_entry:            vcpu 0, rip 
0xffffffffb172ab2b
     CPU-14414 [063] 3039492.055572: kvm_exit:             reason 
EXIT_NPF rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
     CPU-14414 [063] 3039492.055572: kvm_page_fault:       vcpu 0 rip 
0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
     CPU-14414 [063] 3039492.055573: kvm_entry:            vcpu 0, rip 
0xffffffffb172ab2b
     CPU-14414 [063] 3039492.055574: kvm_exit:             reason 
EXIT_NPF rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
     CPU-14414 [063] 3039492.055574: kvm_page_fault:       vcpu 0 rip 
0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
     CPU-14414 [063] 3039492.055575: kvm_entry:            vcpu 0, rip 
0xffffffffb172ab2b
     CPU-14414 [063] 3039492.055575: kvm_exit:             reason 
EXIT_NPF rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
     CPU-14414 [063] 3039492.055576: kvm_page_fault:       vcpu 0 rip 
0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
     CPU-14414 [063] 3039492.055576: kvm_entry:            vcpu 0, rip 
0xffffffffb172ab2b
     CPU-14414 [063] 3039492.055577: kvm_exit:             reason 
EXIT_NPF rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
     CPU-14414 [063] 3039492.055577: kvm_page_fault:       vcpu 0 rip 
0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
     CPU-14414 [063] 3039492.055578: kvm_entry:            vcpu 0, rip 
0xffffffffb172ab2b
     CPU-14414 [063] 3039492.055579: kvm_exit:             reason 
EXIT_NPF rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
     CPU-14414 [063] 3039492.055579: kvm_page_fault:       vcpu 0 rip 
0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
     CPU-14414 [063] 3039492.055580: kvm_entry:            vcpu 0, rip 
0xffffffffb172ab2b


The qemu process ends up looking like this once it happens:

     0x00007fdc6a51be26 in internal_fallocate64 (fd=-514841856, 
offset=16, len=140729021657088) at ../sysdeps/posix/posix_fallocate64.c:36
     36     return EINVAL;
     (gdb) thread apply all bt

     Thread 6 (Thread 0x7fdbdefff700 (LWP 879746) "vnc_worker"):
     #0  futex_wait_cancelable (private=0, expected=0, 
futex_word=0x7fdc688f66cc) at ../sysdeps/nptl/futex-internal.h:186
     #1  __pthread_cond_wait_common (abstime=0x0, clockid=0, 
mutex=0x7fdc688f66d8, cond=0x7fdc688f66a0) at pthread_cond_wait.c:508
     #2  __pthread_cond_wait (cond=cond@entry=0x7fdc688f66a0, 
mutex=mutex@entry=0x7fdc688f66d8) at pthread_cond_wait.c:638
     #3  0x0000563424cbd32b in qemu_cond_wait_impl (cond=0x7fdc688f66a0, 
mutex=0x7fdc688f66d8, file=0x563424d302b4 "../../ui/vnc-jobs.c", 
line=248) at ../../util/qemu-thread-posix.c:220
     #4  0x00005634247dac33 in vnc_worker_thread_loop 
(queue=0x7fdc688f66a0) at ../../ui/vnc-jobs.c:248
     #5  0x00005634247db8f8 in vnc_worker_thread 
(arg=arg@entry=0x7fdc688f66a0) at ../../ui/vnc-jobs.c:361
     #6  0x0000563424cbc7e9 in qemu_thread_start (args=0x7fdbdeffcf30) 
at ../../util/qemu-thread-posix.c:505
     #7  0x00007fdc6a8e1ea7 in start_thread (arg=<optimized out>) at 
pthread_create.c:477
     #8  0x00007fdc6a527a2f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

     Thread 5 (Thread 0x7fdbe5dff700 (LWP 879738) "CPU 1/KVM"):
     #0  0x00007fdc6a51d5f7 in preadv64v2 (fd=1756258112, 
vector=0x563424b5f007 <kvm_vcpu_ioctl+103>, count=0, offset=0, 
flags=44672) at ../sysdeps/unix/sysv/linux/preadv64v2.c:31
     #1  0x0000000000000000 in ?? ()

     Thread 4 (Thread 0x7fdbe6fff700 (LWP 879737) "CPU 0/KVM"):
     #0  0x00007fdc6a51d5f7 in preadv64v2 (fd=1755834304, 
vector=0x563424b5f007 <kvm_vcpu_ioctl+103>, count=0, offset=0, 
flags=44672) at ../sysdeps/unix/sysv/linux/preadv64v2.c:31
     #1  0x0000000000000000 in ?? ()

     Thread 3 (Thread 0x7fdbe83ff700 (LWP 879735) "IO mon_iothread"):
     #0  0x00007fdc6a51bd2f in internal_fallocate64 (fd=-413102080, 
offset=3, len=4294967295) at ../sysdeps/posix/posix_fallocate64.c:32
     #1  0x000d5572b9bb0764 in ?? ()
     #2  0x000000016891db00 in ?? ()
     #3  0xffffffff7fffffff in ?? ()
     #4  0xf6b8254512850600 in ?? ()
     #5  0x0000000000000000 in ?? ()

     Thread 2 (Thread 0x7fdc693ff700 (LWP 879730) "qemu-kvm"):
     #0  0x00007fdc6a5212e9 in ?? () from 
target:/lib/x86_64-linux-gnu/libc.so.6
     #1  0x0000563424cbd9aa in qemu_futex_wait (val=<optimized out>, 
f=<optimized out>) at ./include/qemu/futex.h:29
     #2  qemu_event_wait (ev=ev@entry=0x5634254bd1a8 
<rcu_call_ready_event>) at ../../util/qemu-thread-posix.c:430
     #3  0x0000563424cc6d80 in call_rcu_thread (opaque=opaque@entry=0x0) 
at ../../util/rcu.c:261
     #4  0x0000563424cbc7e9 in qemu_thread_start (args=0x7fdc693fcf30) 
at ../../util/qemu-thread-posix.c:505
     #5  0x00007fdc6a8e1ea7 in start_thread (arg=<optimized out>) at 
pthread_create.c:477
     #6  0x00007fdc6a527a2f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

     Thread 1 (Thread 0x7fdc69c3a680 (LWP 879712) "qemu-kvm"):
     #0  0x00007fdc6a51be26 in internal_fallocate64 (fd=-514841856, 
offset=16, len=140729021657088) at ../sysdeps/posix/posix_fallocate64.c:36
     #1  0x0000000000000000 in ?? ()

We first started seeing this back in 5.19, and we're still seeing it as 
of 6.1.24 (likely later too, we don't have a ton of data for newer 
versions).  We haven't been able to link this to any specific hardware.  
It appears to happen more often on Intel, but our sample size is much 
larger there.  Guest operating system type/version doesn't appear to 
matter.  This usually happens to guests with a heavy network/disk 
workload, but it can happen to even idle guests. This has happened on 
qemu 7.0 and 7.2 (upgrading to 7.2.2 is on our list to do).

Where do we go from here?  We haven't really made a lot of progress in 
figuring out why this keeps happening, nor have we been able to come up 
with a reliable way to reproduce it.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-05-23 14:02 Deadlock due to EPT_VIOLATION Brian Rak
@ 2023-05-23 16:22 ` Sean Christopherson
  2023-05-24 13:39   ` Brian Rak
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-05-23 16:22 UTC (permalink / raw)
  To: brak; +Cc: kvm

Nit, this isn't deadlock.  It may or may not even be a livelock AFAICT.  The vCPUs
are simply stuck and not making forward progress, _why_ they aren't making forward
progress is unknown at this point (obviously :-) ).

On Tue, May 23, 2023, Brian Rak wrote:
> We've been hitting an issue lately where KVM guests (w/ qemu) have been
> getting stuck in a loop of EPT_VIOLATIONs, and end up requiring a guest
> reboot to fix.
> 
> On Intel machines the trace ends up looking like:
> 
> �� �CPU-2386625 [094] 6598425.465404: kvm_entry:����������� vcpu 1, rip
> 0xffffffffc0771aa2
> �� �CPU-2386625 [094] 6598425.465405: kvm_exit:������������ reason
> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
> �� �CPU-2386625 [094] 6598425.465405: kvm_page_fault:������ vcpu 1 rip
> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
> �� �CPU-2386625 [094] 6598425.465406: kvm_inj_virq:�������� IRQ 0xec
> [reinjected]
> �� �CPU-2386625 [094] 6598425.465406: kvm_entry:����������� vcpu 1, rip
> 0xffffffffc0771aa2
> �� �CPU-2386625 [094] 6598425.465407: kvm_exit:������������ reason
> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
> �� �CPU-2386625 [094] 6598425.465407: kvm_page_fault:������ vcpu 1 rip
> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
> �� �CPU-2386625 [094] 6598425.465408: kvm_inj_virq:�������� IRQ 0xec
> [reinjected]
> �� �CPU-2386625 [094] 6598425.465408: kvm_entry:����������� vcpu 1, rip
> 0xffffffffc0771aa2
> �� �CPU-2386625 [094] 6598425.465409: kvm_exit:������������ reason
> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
> �� �CPU-2386625 [094] 6598425.465410: kvm_page_fault:������ vcpu 1 rip
> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
> �� �CPU-2386625 [094] 6598425.465410: kvm_inj_virq:�������� IRQ 0xec
> [reinjected]
> �� �CPU-2386625 [094] 6598425.465410: kvm_entry:����������� vcpu 1, rip
> 0xffffffffc0771aa2
> �� �CPU-2386625 [094] 6598425.465411: kvm_exit:������������ reason
> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
> �� �CPU-2386625 [094] 6598425.465412: kvm_page_fault:������ vcpu 1 rip
> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
> �� �CPU-2386625 [094] 6598425.465413: kvm_inj_virq:�������� IRQ 0xec
> [reinjected]
> �� �CPU-2386625 [094] 6598425.465413: kvm_entry:����������� vcpu 1, rip
> 0xffffffffc0771aa2
> �� �CPU-2386625 [094] 6598425.465414: kvm_exit:������������ reason
> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
> �� �CPU-2386625 [094] 6598425.465414: kvm_page_fault:������ vcpu 1 rip
> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
> �� �CPU-2386625 [094] 6598425.465415: kvm_inj_virq:�������� IRQ 0xec
> [reinjected]
> �� �CPU-2386625 [094] 6598425.465415: kvm_entry:����������� vcpu 1, rip
> 0xffffffffc0771aa2
> �� �CPU-2386625 [094] 6598425.465417: kvm_exit:������������ reason
> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
> �� �CPU-2386625 [094] 6598425.465417: kvm_page_fault:������ vcpu 1 rip
> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
> 
> on AMD machines, we end up with:
> 
> �� �CPU-14414 [063] 3039492.055571: kvm_page_fault:������ vcpu 0 rip
> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
> �� �CPU-14414 [063] 3039492.055571: kvm_entry:����������� vcpu 0, rip
> 0xffffffffb172ab2b
> �� �CPU-14414 [063] 3039492.055572: kvm_exit:������������ reason EXIT_NPF
> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
> �� �CPU-14414 [063] 3039492.055572: kvm_page_fault:������ vcpu 0 rip
> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
> �� �CPU-14414 [063] 3039492.055573: kvm_entry:����������� vcpu 0, rip
> 0xffffffffb172ab2b
> �� �CPU-14414 [063] 3039492.055574: kvm_exit:������������ reason EXIT_NPF
> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
> �� �CPU-14414 [063] 3039492.055574: kvm_page_fault:������ vcpu 0 rip
> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
> �� �CPU-14414 [063] 3039492.055575: kvm_entry:����������� vcpu 0, rip
> 0xffffffffb172ab2b
> �� �CPU-14414 [063] 3039492.055575: kvm_exit:������������ reason EXIT_NPF
> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
> �� �CPU-14414 [063] 3039492.055576: kvm_page_fault:������ vcpu 0 rip
> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
> �� �CPU-14414 [063] 3039492.055576: kvm_entry:����������� vcpu 0, rip
> 0xffffffffb172ab2b
> �� �CPU-14414 [063] 3039492.055577: kvm_exit:������������ reason EXIT_NPF
> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
> �� �CPU-14414 [063] 3039492.055577: kvm_page_fault:������ vcpu 0 rip
> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
> �� �CPU-14414 [063] 3039492.055578: kvm_entry:����������� vcpu 0, rip
> 0xffffffffb172ab2b
> �� �CPU-14414 [063] 3039492.055579: kvm_exit:������������ reason EXIT_NPF
> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
> �� �CPU-14414 [063] 3039492.055579: kvm_page_fault:������ vcpu 0 rip
> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
> �� �CPU-14414 [063] 3039492.055580: kvm_entry:����������� vcpu 0, rip
> 0xffffffffb172ab2b

In both cases, the TDP fault (EPT violation on Intel, #NPF on AMD) is occurring
when translating a guest paging structure.  I can't glean much from the AMD case,
but in the Intel trace, the fault occurs during delivery of the timer interrupt
(vector 0xec).  That may or may not be relevant to what's going on.

It's definitely suspicious that both traces show that the guest is stuck faulting
on a guest paging structure.  Purely from a probability perspective, the odds of
that being a coincidence are low, though definitely not impossible.

> The qemu process ends up looking like this once it happens:
> 
> �� �0x00007fdc6a51be26 in internal_fallocate64 (fd=-514841856, offset=16,
> len=140729021657088) at ../sysdeps/posix/posix_fallocate64.c:36
> �� �36���� return EINVAL;
> �� �(gdb) thread apply all bt
> 
> �� �Thread 6 (Thread 0x7fdbdefff700 (LWP 879746) "vnc_worker"):
> �� �#0� futex_wait_cancelable (private=0, expected=0,
> futex_word=0x7fdc688f66cc) at ../sysdeps/nptl/futex-internal.h:186
> �� �#1� __pthread_cond_wait_common (abstime=0x0, clockid=0,
> mutex=0x7fdc688f66d8, cond=0x7fdc688f66a0) at pthread_cond_wait.c:508
> �� �#2� __pthread_cond_wait (cond=cond@entry=0x7fdc688f66a0,
> mutex=mutex@entry=0x7fdc688f66d8) at pthread_cond_wait.c:638
> �� �#3� 0x0000563424cbd32b in qemu_cond_wait_impl (cond=0x7fdc688f66a0,
> mutex=0x7fdc688f66d8, file=0x563424d302b4 "../../ui/vnc-jobs.c", line=248)
> at ../../util/qemu-thread-posix.c:220
> �� �#4� 0x00005634247dac33 in vnc_worker_thread_loop (queue=0x7fdc688f66a0)
> at ../../ui/vnc-jobs.c:248
> �� �#5� 0x00005634247db8f8 in vnc_worker_thread
> (arg=arg@entry=0x7fdc688f66a0) at ../../ui/vnc-jobs.c:361
> �� �#6� 0x0000563424cbc7e9 in qemu_thread_start (args=0x7fdbdeffcf30) at
> ../../util/qemu-thread-posix.c:505
> �� �#7� 0x00007fdc6a8e1ea7 in start_thread (arg=<optimized out>) at
> pthread_create.c:477
> �� �#8� 0x00007fdc6a527a2f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> 
> �� �Thread 5 (Thread 0x7fdbe5dff700 (LWP 879738) "CPU 1/KVM"):
> �� �#0� 0x00007fdc6a51d5f7 in preadv64v2 (fd=1756258112,
> vector=0x563424b5f007 <kvm_vcpu_ioctl+103>, count=0, offset=0, flags=44672)
> at ../sysdeps/unix/sysv/linux/preadv64v2.c:31
> �� �#1� 0x0000000000000000 in ?? ()
> 
> �� �Thread 4 (Thread 0x7fdbe6fff700 (LWP 879737) "CPU 0/KVM"):
> �� �#0� 0x00007fdc6a51d5f7 in preadv64v2 (fd=1755834304,
> vector=0x563424b5f007 <kvm_vcpu_ioctl+103>, count=0, offset=0, flags=44672)
> at ../sysdeps/unix/sysv/linux/preadv64v2.c:31
> �� �#1� 0x0000000000000000 in ?? ()
> 
> �� �Thread 3 (Thread 0x7fdbe83ff700 (LWP 879735) "IO mon_iothread"):
> �� �#0� 0x00007fdc6a51bd2f in internal_fallocate64 (fd=-413102080, offset=3,
> len=4294967295) at ../sysdeps/posix/posix_fallocate64.c:32
> �� �#1� 0x000d5572b9bb0764 in ?? ()
> �� �#2� 0x000000016891db00 in ?? ()
> �� �#3� 0xffffffff7fffffff in ?? ()
> �� �#4� 0xf6b8254512850600 in ?? ()
> �� �#5� 0x0000000000000000 in ?? ()
> 
> �� �Thread 2 (Thread 0x7fdc693ff700 (LWP 879730) "qemu-kvm"):
> �� �#0� 0x00007fdc6a5212e9 in ?? () from
> target:/lib/x86_64-linux-gnu/libc.so.6
> �� �#1� 0x0000563424cbd9aa in qemu_futex_wait (val=<optimized out>,
> f=<optimized out>) at ./include/qemu/futex.h:29
> �� �#2� qemu_event_wait (ev=ev@entry=0x5634254bd1a8 <rcu_call_ready_event>)
> at ../../util/qemu-thread-posix.c:430
> �� �#3� 0x0000563424cc6d80 in call_rcu_thread (opaque=opaque@entry=0x0) at
> ../../util/rcu.c:261
> �� �#4� 0x0000563424cbc7e9 in qemu_thread_start (args=0x7fdc693fcf30) at
> ../../util/qemu-thread-posix.c:505
> �� �#5� 0x00007fdc6a8e1ea7 in start_thread (arg=<optimized out>) at
> pthread_create.c:477
> �� �#6� 0x00007fdc6a527a2f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> 
> �� �Thread 1 (Thread 0x7fdc69c3a680 (LWP 879712) "qemu-kvm"):
> �� �#0� 0x00007fdc6a51be26 in internal_fallocate64 (fd=-514841856,
> offset=16, len=140729021657088) at ../sysdeps/posix/posix_fallocate64.c:36
> �� �#1� 0x0000000000000000 in ?? ()
> 
> We first started seeing this back in 5.19, and we're still seeing it as of
> 6.1.24 (likely later too, we don't have a ton of data for newer versions).�
> We haven't been able to link this to any specific hardware.

Just to double check, this is the host kernel version, correct?  When you upgraded
to kernel 5.19, did you change anything else in the stack?  E.g. did you upgrade
QEMU at the same time?  And what kernel were you upgrading from?

> It appears to happen more often on Intel, but our sample size is much larger
> there.� Guest operating system type/version doesn't appear to matter.� This
> usually happens to guests with a heavy network/disk workload, but it can
> happen to even idle guests. This has happened on qemu 7.0 and 7.2 (upgrading
> to 7.2.2 is on our list to do).
> 
> Where do we go from here?� We haven't really made a lot of progress in
> figuring out why this keeps happening, nor have we been able to come up with
> a reliable way to reproduce it.

Is it possible to capture a failure with the trace_kvm_unmap_hva_range,
kvm_mmu_spte_requested and kvm_mmu_set_spte tracepoints enabled?  That will hopefully
provide insight into why the vCPU keeps faulting, e.g. should show if KVM is
installing a "bad" SPTE, or if KVM is doing nothing and intentionally retrying
the fault because there are constant and/or unresolve mmu_notifier events.  My
guess is that it's the latter (KVM doing nothing) due to the fallocate() calls
in the stack, but that's really just a guess.

The other thing that would be helpful would be getting kernel stack traces of the
relevant tasks/threads.  The vCPU stack traces won't be interesting, but it'll
likely help to see what the fallocate() tasks are doing.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-05-23 16:22 ` Sean Christopherson
@ 2023-05-24 13:39   ` Brian Rak
  2023-05-26 16:59     ` Brian Rak
  0 siblings, 1 reply; 48+ messages in thread
From: Brian Rak @ 2023-05-24 13:39 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm


On 5/23/2023 12:22 PM, Sean Christopherson wrote:
> Nit, this isn't deadlock.  It may or may not even be a livelock AFAICT.  The vCPUs
> are simply stuck and not making forward progress, _why_ they aren't making forward
> progress is unknown at this point (obviously :-) ).
>
> On Tue, May 23, 2023, Brian Rak wrote:
>> We've been hitting an issue lately where KVM guests (w/ qemu) have been
>> getting stuck in a loop of EPT_VIOLATIONs, and end up requiring a guest
>> reboot to fix.
>>
>> On Intel machines the trace ends up looking like:
>>
>> �� �CPU-2386625 [094] 6598425.465404: kvm_entry:����������� vcpu 1, rip
>> 0xffffffffc0771aa2
>> �� �CPU-2386625 [094] 6598425.465405: kvm_exit:������������ reason
>> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
>> �� �CPU-2386625 [094] 6598425.465405: kvm_page_fault:������ vcpu 1 rip
>> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
>> �� �CPU-2386625 [094] 6598425.465406: kvm_inj_virq:�������� IRQ 0xec
>> [reinjected]
>> �� �CPU-2386625 [094] 6598425.465406: kvm_entry:����������� vcpu 1, rip
>> 0xffffffffc0771aa2
>> �� �CPU-2386625 [094] 6598425.465407: kvm_exit:������������ reason
>> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
>> �� �CPU-2386625 [094] 6598425.465407: kvm_page_fault:������ vcpu 1 rip
>> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
>> �� �CPU-2386625 [094] 6598425.465408: kvm_inj_virq:�������� IRQ 0xec
>> [reinjected]
>> �� �CPU-2386625 [094] 6598425.465408: kvm_entry:����������� vcpu 1, rip
>> 0xffffffffc0771aa2
>> �� �CPU-2386625 [094] 6598425.465409: kvm_exit:������������ reason
>> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
>> �� �CPU-2386625 [094] 6598425.465410: kvm_page_fault:������ vcpu 1 rip
>> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
>> �� �CPU-2386625 [094] 6598425.465410: kvm_inj_virq:�������� IRQ 0xec
>> [reinjected]
>> �� �CPU-2386625 [094] 6598425.465410: kvm_entry:����������� vcpu 1, rip
>> 0xffffffffc0771aa2
>> �� �CPU-2386625 [094] 6598425.465411: kvm_exit:������������ reason
>> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
>> �� �CPU-2386625 [094] 6598425.465412: kvm_page_fault:������ vcpu 1 rip
>> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
>> �� �CPU-2386625 [094] 6598425.465413: kvm_inj_virq:�������� IRQ 0xec
>> [reinjected]
>> �� �CPU-2386625 [094] 6598425.465413: kvm_entry:����������� vcpu 1, rip
>> 0xffffffffc0771aa2
>> �� �CPU-2386625 [094] 6598425.465414: kvm_exit:������������ reason
>> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
>> �� �CPU-2386625 [094] 6598425.465414: kvm_page_fault:������ vcpu 1 rip
>> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
>> �� �CPU-2386625 [094] 6598425.465415: kvm_inj_virq:�������� IRQ 0xec
>> [reinjected]
>> �� �CPU-2386625 [094] 6598425.465415: kvm_entry:����������� vcpu 1, rip
>> 0xffffffffc0771aa2
>> �� �CPU-2386625 [094] 6598425.465417: kvm_exit:������������ reason
>> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
>> �� �CPU-2386625 [094] 6598425.465417: kvm_page_fault:������ vcpu 1 rip
>> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
>>
>> on AMD machines, we end up with:
>>
>> �� �CPU-14414 [063] 3039492.055571: kvm_page_fault:������ vcpu 0 rip
>> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
>> �� �CPU-14414 [063] 3039492.055571: kvm_entry:����������� vcpu 0, rip
>> 0xffffffffb172ab2b
>> �� �CPU-14414 [063] 3039492.055572: kvm_exit:������������ reason EXIT_NPF
>> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
>> �� �CPU-14414 [063] 3039492.055572: kvm_page_fault:������ vcpu 0 rip
>> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
>> �� �CPU-14414 [063] 3039492.055573: kvm_entry:����������� vcpu 0, rip
>> 0xffffffffb172ab2b
>> �� �CPU-14414 [063] 3039492.055574: kvm_exit:������������ reason EXIT_NPF
>> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
>> �� �CPU-14414 [063] 3039492.055574: kvm_page_fault:������ vcpu 0 rip
>> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
>> �� �CPU-14414 [063] 3039492.055575: kvm_entry:����������� vcpu 0, rip
>> 0xffffffffb172ab2b
>> �� �CPU-14414 [063] 3039492.055575: kvm_exit:������������ reason EXIT_NPF
>> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
>> �� �CPU-14414 [063] 3039492.055576: kvm_page_fault:������ vcpu 0 rip
>> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
>> �� �CPU-14414 [063] 3039492.055576: kvm_entry:����������� vcpu 0, rip
>> 0xffffffffb172ab2b
>> �� �CPU-14414 [063] 3039492.055577: kvm_exit:������������ reason EXIT_NPF
>> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
>> �� �CPU-14414 [063] 3039492.055577: kvm_page_fault:������ vcpu 0 rip
>> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
>> �� �CPU-14414 [063] 3039492.055578: kvm_entry:����������� vcpu 0, rip
>> 0xffffffffb172ab2b
>> �� �CPU-14414 [063] 3039492.055579: kvm_exit:������������ reason EXIT_NPF
>> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
>> �� �CPU-14414 [063] 3039492.055579: kvm_page_fault:������ vcpu 0 rip
>> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
>> �� �CPU-14414 [063] 3039492.055580: kvm_entry:����������� vcpu 0, rip
>> 0xffffffffb172ab2b
> In both cases, the TDP fault (EPT violation on Intel, #NPF on AMD) is occurring
> when translating a guest paging structure.  I can't glean much from the AMD case,
> but in the Intel trace, the fault occurs during delivery of the timer interrupt
> (vector 0xec).  That may or may not be relevant to what's going on.
>
> It's definitely suspicious that both traces show that the guest is stuck faulting
> on a guest paging structure.  Purely from a probability perspective, the odds of
> that being a coincidence are low, though definitely not impossible.
>
>> The qemu process ends up looking like this once it happens:
>>
>> �� �0x00007fdc6a51be26 in internal_fallocate64 (fd=-514841856, offset=16,
>> len=140729021657088) at ../sysdeps/posix/posix_fallocate64.c:36
>> �� �36���� return EINVAL;
>> �� �(gdb) thread apply all bt
>>
>> �� �Thread 6 (Thread 0x7fdbdefff700 (LWP 879746) "vnc_worker"):
>> �� �#0� futex_wait_cancelable (private=0, expected=0,
>> futex_word=0x7fdc688f66cc) at ../sysdeps/nptl/futex-internal.h:186
>> �� �#1� __pthread_cond_wait_common (abstime=0x0, clockid=0,
>> mutex=0x7fdc688f66d8, cond=0x7fdc688f66a0) at pthread_cond_wait.c:508
>> �� �#2� __pthread_cond_wait (cond=cond@entry=0x7fdc688f66a0,
>> mutex=mutex@entry=0x7fdc688f66d8) at pthread_cond_wait.c:638
>> �� �#3� 0x0000563424cbd32b in qemu_cond_wait_impl (cond=0x7fdc688f66a0,
>> mutex=0x7fdc688f66d8, file=0x563424d302b4 "../../ui/vnc-jobs.c", line=248)
>> at ../../util/qemu-thread-posix.c:220
>> �� �#4� 0x00005634247dac33 in vnc_worker_thread_loop (queue=0x7fdc688f66a0)
>> at ../../ui/vnc-jobs.c:248
>> �� �#5� 0x00005634247db8f8 in vnc_worker_thread
>> (arg=arg@entry=0x7fdc688f66a0) at ../../ui/vnc-jobs.c:361
>> �� �#6� 0x0000563424cbc7e9 in qemu_thread_start (args=0x7fdbdeffcf30) at
>> ../../util/qemu-thread-posix.c:505
>> �� �#7� 0x00007fdc6a8e1ea7 in start_thread (arg=<optimized out>) at
>> pthread_create.c:477
>> �� �#8� 0x00007fdc6a527a2f in clone () at
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>
>> �� �Thread 5 (Thread 0x7fdbe5dff700 (LWP 879738) "CPU 1/KVM"):
>> �� �#0� 0x00007fdc6a51d5f7 in preadv64v2 (fd=1756258112,
>> vector=0x563424b5f007 <kvm_vcpu_ioctl+103>, count=0, offset=0, flags=44672)
>> at ../sysdeps/unix/sysv/linux/preadv64v2.c:31
>> �� �#1� 0x0000000000000000 in ?? ()
>>
>> �� �Thread 4 (Thread 0x7fdbe6fff700 (LWP 879737) "CPU 0/KVM"):
>> �� �#0� 0x00007fdc6a51d5f7 in preadv64v2 (fd=1755834304,
>> vector=0x563424b5f007 <kvm_vcpu_ioctl+103>, count=0, offset=0, flags=44672)
>> at ../sysdeps/unix/sysv/linux/preadv64v2.c:31
>> �� �#1� 0x0000000000000000 in ?? ()
>>
>> �� �Thread 3 (Thread 0x7fdbe83ff700 (LWP 879735) "IO mon_iothread"):
>> �� �#0� 0x00007fdc6a51bd2f in internal_fallocate64 (fd=-413102080, offset=3,
>> len=4294967295) at ../sysdeps/posix/posix_fallocate64.c:32
>> �� �#1� 0x000d5572b9bb0764 in ?? ()
>> �� �#2� 0x000000016891db00 in ?? ()
>> �� �#3� 0xffffffff7fffffff in ?? ()
>> �� �#4� 0xf6b8254512850600 in ?? ()
>> �� �#5� 0x0000000000000000 in ?? ()
>>
>> �� �Thread 2 (Thread 0x7fdc693ff700 (LWP 879730) "qemu-kvm"):
>> �� �#0� 0x00007fdc6a5212e9 in ?? () from
>> target:/lib/x86_64-linux-gnu/libc.so.6
>> �� �#1� 0x0000563424cbd9aa in qemu_futex_wait (val=<optimized out>,
>> f=<optimized out>) at ./include/qemu/futex.h:29
>> �� �#2� qemu_event_wait (ev=ev@entry=0x5634254bd1a8 <rcu_call_ready_event>)
>> at ../../util/qemu-thread-posix.c:430
>> �� �#3� 0x0000563424cc6d80 in call_rcu_thread (opaque=opaque@entry=0x0) at
>> ../../util/rcu.c:261
>> �� �#4� 0x0000563424cbc7e9 in qemu_thread_start (args=0x7fdc693fcf30) at
>> ../../util/qemu-thread-posix.c:505
>> �� �#5� 0x00007fdc6a8e1ea7 in start_thread (arg=<optimized out>) at
>> pthread_create.c:477
>> �� �#6� 0x00007fdc6a527a2f in clone () at
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>
>> �� �Thread 1 (Thread 0x7fdc69c3a680 (LWP 879712) "qemu-kvm"):
>> �� �#0� 0x00007fdc6a51be26 in internal_fallocate64 (fd=-514841856,
>> offset=16, len=140729021657088) at ../sysdeps/posix/posix_fallocate64.c:36
>> �� �#1� 0x0000000000000000 in ?? ()
>>
>> We first started seeing this back in 5.19, and we're still seeing it as of
>> 6.1.24 (likely later too, we don't have a ton of data for newer versions).�
>> We haven't been able to link this to any specific hardware.
> Just to double check, this is the host kernel version, correct?  When you upgraded
> to kernel 5.19, did you change anything else in the stack?  E.g. did you upgrade
> QEMU at the same time?  And what kernel were you upgrading from?
Those are host kernel versions, correct.  We went from 5.15 -> 5.19.  
That was the only change at the time.
>
>> It appears to happen more often on Intel, but our sample size is much larger
>> there.� Guest operating system type/version doesn't appear to matter.� This
>> usually happens to guests with a heavy network/disk workload, but it can
>> happen to even idle guests. This has happened on qemu 7.0 and 7.2 (upgrading
>> to 7.2.2 is on our list to do).
>>
>> Where do we go from here?� We haven't really made a lot of progress in
>> figuring out why this keeps happening, nor have we been able to come up with
>> a reliable way to reproduce it.
> Is it possible to capture a failure with the trace_kvm_unmap_hva_range,
> kvm_mmu_spte_requested and kvm_mmu_set_spte tracepoints enabled?  That will hopefully
> provide insight into why the vCPU keeps faulting, e.g. should show if KVM is
> installing a "bad" SPTE, or if KVM is doing nothing and intentionally retrying
> the fault because there are constant and/or unresolve mmu_notifier events.  My
> guess is that it's the latter (KVM doing nothing) due to the fallocate() calls
> in the stack, but that's really just a guess.

In that trace, I had all the kvm/kvmmmu events enabled (trace-cmd record 
-e kvm -e kvmmmu).  Just to be sure, I repeated this with `trace-cmd 
record -e all`:

              CPU-1365880 [038] 5559771.610941: rcu_utilization:      
Start context switch
              CPU-1365880 [038] 5559771.610941: rcu_utilization:      
End context switch
              CPU-1365880 [038] 5559771.610942: write_msr:            
1d9, value 4000
              CPU-1365880 [038] 5559771.610942: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffb0e9bed0 info 83 0
              CPU-1365880 [038] 5559771.610942: kvm_page_fault:       
vcpu 0 rip 0xffffffffb0e9bed0 address 0x000000016e5b2ff8 error_code 0x83
              CPU-1365880 [038] 5559771.610943: kvm_entry:            
vcpu 0, rip 0xffffffffb0e9bed0
              CPU-1365880 [038] 5559771.610943: rcu_utilization:      
Start context switch
              CPU-1365880 [038] 5559771.610943: rcu_utilization:      
End context switch
              CPU-1365880 [038] 5559771.610944: write_msr:            
1d9, value 4000
              CPU-1365880 [038] 5559771.610944: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffb0e9bed0 info 83 0
              CPU-1365880 [038] 5559771.610944: kvm_page_fault:       
vcpu 0 rip 0xffffffffb0e9bed0 address 0x000000016e5b2ff8 error_code 0x83
              CPU-1365880 [038] 5559771.610945: kvm_entry:            
vcpu 0, rip 0xffffffffb0e9bed0
              CPU-1365880 [038] 5559771.610945: rcu_utilization:      
Start context switch
              CPU-1365880 [038] 5559771.610945: rcu_utilization:      
End context switch
              CPU-1365880 [038] 5559771.610946: write_msr:            
1d9, value 4000
              CPU-1365880 [038] 5559771.610946: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffb0e9bed0 info 83 0
              CPU-1365880 [038] 5559771.610946: kvm_page_fault:       
vcpu 0 rip 0xffffffffb0e9bed0 address 0x000000016e5b2ff8 error_code 0x83
              CPU-1365880 [038] 5559771.610947: kvm_entry:            
vcpu 0, rip 0xffffffffb0e9bed0
              CPU-1365880 [038] 5559771.610947: rcu_utilization:      
Start context switch
              CPU-1365880 [038] 5559771.610947: rcu_utilization:      
End context switch
              CPU-1365880 [038] 5559771.610948: write_msr:            
1d9, value 4000
              CPU-1365880 [038] 5559771.610948: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffb0e9bed0 info 83 0
              CPU-1365880 [038] 5559771.610948: kvm_page_fault:       
vcpu 0 rip 0xffffffffb0e9bed0 address 0x000000016e5b2ff8 error_code 0x83
              CPU-1365880 [038] 5559771.610948: kvm_entry:            
vcpu 0, rip 0xffffffffb0e9bed0
              CPU-1365880 [038] 5559771.610949: rcu_utilization:      
Start context switch
              CPU-1365880 [038] 5559771.610949: rcu_utilization:      
End context switch
              CPU-1365880 [038] 5559771.610949: write_msr:            
1d9, value 4000
              CPU-1365880 [038] 5559771.610950: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffb0e9bed0 info 83 0
              CPU-1365880 [038] 5559771.610950: kvm_page_fault:       
vcpu 0 rip 0xffffffffb0e9bed0 address 0x000000016e5b2ff8 error_code 0x83
              CPU-1365880 [038] 5559771.610950: kvm_entry:            
vcpu 0, rip 0xffffffffb0e9bed0
              CPU-1365880 [038] 5559771.610950: rcu_utilization:      
Start context switch
              CPU-1365880 [038] 5559771.610951: rcu_utilization:      
End context switch
              CPU-1365880 [038] 5559771.610951: write_msr:            
1d9, value 4000
              CPU-1365880 [038] 5559771.610951: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffb0e9bed0 info 83 0
              CPU-1365880 [038] 5559771.610952: kvm_page_fault:       
vcpu 0 rip 0xffffffffb0e9bed0 address 0x000000016e5b2ff8 error_code 0x83
              CPU-1365880 [038] 5559771.610952: kvm_entry:            
vcpu 0, rip 0xffffffffb0e9bed0
              CPU-1365880 [038] 5559771.610952: rcu_utilization:      
Start context switch
              CPU-1365880 [038] 5559771.610952: rcu_utilization:      
End context switch
              CPU-1365880 [038] 5559771.610953: write_msr:            
1d9, value 4000

> The other thing that would be helpful would be getting kernel stack traces of the
> relevant tasks/threads.  The vCPU stack traces won't be interesting, but it'll
> likely help to see what the fallocate() tasks are doing.
I'll see what I can come up with here, I was running into some 
difficulty getting useful stack traces out of the VM

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-05-24 13:39   ` Brian Rak
@ 2023-05-26 16:59     ` Brian Rak
  2023-05-26 21:02       ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Brian Rak @ 2023-05-26 16:59 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm


On 5/24/2023 9:39 AM, Brian Rak wrote:
>
> On 5/23/2023 12:22 PM, Sean Christopherson wrote:
>> Nit, this isn't deadlock.  It may or may not even be a livelock 
>> AFAICT.  The vCPUs
>> are simply stuck and not making forward progress, _why_ they aren't 
>> making forward
>> progress is unknown at this point (obviously :-) ).
>>
>> On Tue, May 23, 2023, Brian Rak wrote:
>>> We've been hitting an issue lately where KVM guests (w/ qemu) have been
>>> getting stuck in a loop of EPT_VIOLATIONs, and end up requiring a guest
>>> reboot to fix.
>>>
>>> On Intel machines the trace ends up looking like:
>>>
>>> �� �CPU-2386625 [094] 6598425.465404: 
>>> kvm_entry:����������� vcpu 1, rip
>>> 0xffffffffc0771aa2
>>> �� �CPU-2386625 [094] 6598425.465405: 
>>> kvm_exit:������������ reason
>>> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
>>> �� �CPU-2386625 [094] 6598425.465405: 
>>> kvm_page_fault:������ vcpu 1 rip
>>> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
>>> �� �CPU-2386625 [094] 6598425.465406: 
>>> kvm_inj_virq:�������� IRQ 0xec
>>> [reinjected]
>>> �� �CPU-2386625 [094] 6598425.465406: 
>>> kvm_entry:����������� vcpu 1, rip
>>> 0xffffffffc0771aa2
>>> �� �CPU-2386625 [094] 6598425.465407: 
>>> kvm_exit:������������ reason
>>> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
>>> �� �CPU-2386625 [094] 6598425.465407: 
>>> kvm_page_fault:������ vcpu 1 rip
>>> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
>>> �� �CPU-2386625 [094] 6598425.465408: 
>>> kvm_inj_virq:�������� IRQ 0xec
>>> [reinjected]
>>> �� �CPU-2386625 [094] 6598425.465408: 
>>> kvm_entry:����������� vcpu 1, rip
>>> 0xffffffffc0771aa2
>>> �� �CPU-2386625 [094] 6598425.465409: 
>>> kvm_exit:������������ reason
>>> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
>>> �� �CPU-2386625 [094] 6598425.465410: 
>>> kvm_page_fault:������ vcpu 1 rip
>>> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
>>> �� �CPU-2386625 [094] 6598425.465410: 
>>> kvm_inj_virq:�������� IRQ 0xec
>>> [reinjected]
>>> �� �CPU-2386625 [094] 6598425.465410: 
>>> kvm_entry:����������� vcpu 1, rip
>>> 0xffffffffc0771aa2
>>> �� �CPU-2386625 [094] 6598425.465411: 
>>> kvm_exit:������������ reason
>>> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
>>> �� �CPU-2386625 [094] 6598425.465412: 
>>> kvm_page_fault:������ vcpu 1 rip
>>> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
>>> �� �CPU-2386625 [094] 6598425.465413: 
>>> kvm_inj_virq:�������� IRQ 0xec
>>> [reinjected]
>>> �� �CPU-2386625 [094] 6598425.465413: 
>>> kvm_entry:����������� vcpu 1, rip
>>> 0xffffffffc0771aa2
>>> �� �CPU-2386625 [094] 6598425.465414: 
>>> kvm_exit:������������ reason
>>> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
>>> �� �CPU-2386625 [094] 6598425.465414: 
>>> kvm_page_fault:������ vcpu 1 rip
>>> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
>>> �� �CPU-2386625 [094] 6598425.465415: 
>>> kvm_inj_virq:�������� IRQ 0xec
>>> [reinjected]
>>> �� �CPU-2386625 [094] 6598425.465415: 
>>> kvm_entry:����������� vcpu 1, rip
>>> 0xffffffffc0771aa2
>>> �� �CPU-2386625 [094] 6598425.465417: 
>>> kvm_exit:������������ reason
>>> EPT_VIOLATION rip 0xffffffffc0771aa2 info 683 800000ec
>>> �� �CPU-2386625 [094] 6598425.465417: 
>>> kvm_page_fault:������ vcpu 1 rip
>>> 0xffffffffc0771aa2 address 0x0000000002594fe0 error_code 0x683
>>>
>>> on AMD machines, we end up with:
>>>
>>> �� �CPU-14414 [063] 3039492.055571: 
>>> kvm_page_fault:������ vcpu 0 rip
>>> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
>>> �� �CPU-14414 [063] 3039492.055571: 
>>> kvm_entry:����������� vcpu 0, rip
>>> 0xffffffffb172ab2b
>>> �� �CPU-14414 [063] 3039492.055572: 
>>> kvm_exit:������������ reason EXIT_NPF
>>> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
>>> �� �CPU-14414 [063] 3039492.055572: 
>>> kvm_page_fault:������ vcpu 0 rip
>>> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
>>> �� �CPU-14414 [063] 3039492.055573: 
>>> kvm_entry:����������� vcpu 0, rip
>>> 0xffffffffb172ab2b
>>> �� �CPU-14414 [063] 3039492.055574: 
>>> kvm_exit:������������ reason EXIT_NPF
>>> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
>>> �� �CPU-14414 [063] 3039492.055574: 
>>> kvm_page_fault:������ vcpu 0 rip
>>> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
>>> �� �CPU-14414 [063] 3039492.055575: 
>>> kvm_entry:����������� vcpu 0, rip
>>> 0xffffffffb172ab2b
>>> �� �CPU-14414 [063] 3039492.055575: 
>>> kvm_exit:������������ reason EXIT_NPF
>>> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
>>> �� �CPU-14414 [063] 3039492.055576: 
>>> kvm_page_fault:������ vcpu 0 rip
>>> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
>>> �� �CPU-14414 [063] 3039492.055576: 
>>> kvm_entry:����������� vcpu 0, rip
>>> 0xffffffffb172ab2b
>>> �� �CPU-14414 [063] 3039492.055577: 
>>> kvm_exit:������������ reason EXIT_NPF
>>> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
>>> �� �CPU-14414 [063] 3039492.055577: 
>>> kvm_page_fault:������ vcpu 0 rip
>>> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
>>> �� �CPU-14414 [063] 3039492.055578: 
>>> kvm_entry:����������� vcpu 0, rip
>>> 0xffffffffb172ab2b
>>> �� �CPU-14414 [063] 3039492.055579: 
>>> kvm_exit:������������ reason EXIT_NPF
>>> rip 0xffffffffb172ab2b info 200000006 f88eb2ff8
>>> �� �CPU-14414 [063] 3039492.055579: 
>>> kvm_page_fault:������ vcpu 0 rip
>>> 0xffffffffb172ab2b address 0x0000000f88eb2ff8 error_code 0x200000006
>>> �� �CPU-14414 [063] 3039492.055580: 
>>> kvm_entry:����������� vcpu 0, rip
>>> 0xffffffffb172ab2b
>> In both cases, the TDP fault (EPT violation on Intel, #NPF on AMD) is 
>> occurring
>> when translating a guest paging structure.  I can't glean much from 
>> the AMD case,
>> but in the Intel trace, the fault occurs during delivery of the timer 
>> interrupt
>> (vector 0xec).  That may or may not be relevant to what's going on.
>>
>> It's definitely suspicious that both traces show that the guest is 
>> stuck faulting
>> on a guest paging structure.  Purely from a probability perspective, 
>> the odds of
>> that being a coincidence are low, though definitely not impossible.
>>
>>> The qemu process ends up looking like this once it happens:
>>>
>>> �� �0x00007fdc6a51be26 in internal_fallocate64 (fd=-514841856, 
>>> offset=16,
>>> len=140729021657088) at ../sysdeps/posix/posix_fallocate64.c:36
>>> �� �36���� return EINVAL;
>>> �� �(gdb) thread apply all bt
>>>
>>> �� �Thread 6 (Thread 0x7fdbdefff700 (LWP 879746) "vnc_worker"):
>>> �� �#0� futex_wait_cancelable (private=0, expected=0,
>>> futex_word=0x7fdc688f66cc) at ../sysdeps/nptl/futex-internal.h:186
>>> �� �#1� __pthread_cond_wait_common (abstime=0x0, clockid=0,
>>> mutex=0x7fdc688f66d8, cond=0x7fdc688f66a0) at pthread_cond_wait.c:508
>>> �� �#2� __pthread_cond_wait (cond=cond@entry=0x7fdc688f66a0,
>>> mutex=mutex@entry=0x7fdc688f66d8) at pthread_cond_wait.c:638
>>> �� �#3� 0x0000563424cbd32b in qemu_cond_wait_impl 
>>> (cond=0x7fdc688f66a0,
>>> mutex=0x7fdc688f66d8, file=0x563424d302b4 "../../ui/vnc-jobs.c", 
>>> line=248)
>>> at ../../util/qemu-thread-posix.c:220
>>> �� �#4� 0x00005634247dac33 in vnc_worker_thread_loop 
>>> (queue=0x7fdc688f66a0)
>>> at ../../ui/vnc-jobs.c:248
>>> �� �#5� 0x00005634247db8f8 in vnc_worker_thread
>>> (arg=arg@entry=0x7fdc688f66a0) at ../../ui/vnc-jobs.c:361
>>> �� �#6� 0x0000563424cbc7e9 in qemu_thread_start 
>>> (args=0x7fdbdeffcf30) at
>>> ../../util/qemu-thread-posix.c:505
>>> �� �#7� 0x00007fdc6a8e1ea7 in start_thread (arg=<optimized 
>>> out>) at
>>> pthread_create.c:477
>>> �� �#8� 0x00007fdc6a527a2f in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>>
>>> �� �Thread 5 (Thread 0x7fdbe5dff700 (LWP 879738) "CPU 1/KVM"):
>>> �� �#0� 0x00007fdc6a51d5f7 in preadv64v2 (fd=1756258112,
>>> vector=0x563424b5f007 <kvm_vcpu_ioctl+103>, count=0, offset=0, 
>>> flags=44672)
>>> at ../sysdeps/unix/sysv/linux/preadv64v2.c:31
>>> �� �#1� 0x0000000000000000 in ?? ()
>>>
>>> �� �Thread 4 (Thread 0x7fdbe6fff700 (LWP 879737) "CPU 0/KVM"):
>>> �� �#0� 0x00007fdc6a51d5f7 in preadv64v2 (fd=1755834304,
>>> vector=0x563424b5f007 <kvm_vcpu_ioctl+103>, count=0, offset=0, 
>>> flags=44672)
>>> at ../sysdeps/unix/sysv/linux/preadv64v2.c:31
>>> �� �#1� 0x0000000000000000 in ?? ()
>>>
>>> �� �Thread 3 (Thread 0x7fdbe83ff700 (LWP 879735) "IO 
>>> mon_iothread"):
>>> �� �#0� 0x00007fdc6a51bd2f in internal_fallocate64 
>>> (fd=-413102080, offset=3,
>>> len=4294967295) at ../sysdeps/posix/posix_fallocate64.c:32
>>> �� �#1� 0x000d5572b9bb0764 in ?? ()
>>> �� �#2� 0x000000016891db00 in ?? ()
>>> �� �#3� 0xffffffff7fffffff in ?? ()
>>> �� �#4� 0xf6b8254512850600 in ?? ()
>>> �� �#5� 0x0000000000000000 in ?? ()
>>>
>>> �� �Thread 2 (Thread 0x7fdc693ff700 (LWP 879730) "qemu-kvm"):
>>> �� �#0� 0x00007fdc6a5212e9 in ?? () from
>>> target:/lib/x86_64-linux-gnu/libc.so.6
>>> �� �#1� 0x0000563424cbd9aa in qemu_futex_wait 
>>> (val=<optimized out>,
>>> f=<optimized out>) at ./include/qemu/futex.h:29
>>> �� �#2� qemu_event_wait (ev=ev@entry=0x5634254bd1a8 
>>> <rcu_call_ready_event>)
>>> at ../../util/qemu-thread-posix.c:430
>>> �� �#3� 0x0000563424cc6d80 in call_rcu_thread 
>>> (opaque=opaque@entry=0x0) at
>>> ../../util/rcu.c:261
>>> �� �#4� 0x0000563424cbc7e9 in qemu_thread_start 
>>> (args=0x7fdc693fcf30) at
>>> ../../util/qemu-thread-posix.c:505
>>> �� �#5� 0x00007fdc6a8e1ea7 in start_thread (arg=<optimized 
>>> out>) at
>>> pthread_create.c:477
>>> �� �#6� 0x00007fdc6a527a2f in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>>
>>> �� �Thread 1 (Thread 0x7fdc69c3a680 (LWP 879712) "qemu-kvm"):
>>> �� �#0� 0x00007fdc6a51be26 in internal_fallocate64 
>>> (fd=-514841856,
>>> offset=16, len=140729021657088) at 
>>> ../sysdeps/posix/posix_fallocate64.c:36
>>> �� �#1� 0x0000000000000000 in ?? ()
>>>
>>> We first started seeing this back in 5.19, and we're still seeing it 
>>> as of
>>> 6.1.24 (likely later too, we don't have a ton of data for newer 
>>> versions).�
>>> We haven't been able to link this to any specific hardware.
>> Just to double check, this is the host kernel version, correct? When 
>> you upgraded
>> to kernel 5.19, did you change anything else in the stack?  E.g. did 
>> you upgrade
>> QEMU at the same time?  And what kernel were you upgrading from?
> Those are host kernel versions, correct.  We went from 5.15 -> 5.19.  
> That was the only change at the time.
>>
>>> It appears to happen more often on Intel, but our sample size is 
>>> much larger
>>> there.� Guest operating system type/version doesn't appear to 
>>> matter.� This
>>> usually happens to guests with a heavy network/disk workload, but it 
>>> can
>>> happen to even idle guests. This has happened on qemu 7.0 and 7.2 
>>> (upgrading
>>> to 7.2.2 is on our list to do).
>>>
>>> Where do we go from here?� We haven't really made a lot of 
>>> progress in
>>> figuring out why this keeps happening, nor have we been able to come 
>>> up with
>>> a reliable way to reproduce it.
>> Is it possible to capture a failure with the trace_kvm_unmap_hva_range,
>> kvm_mmu_spte_requested and kvm_mmu_set_spte tracepoints enabled?  
>> That will hopefully
>> provide insight into why the vCPU keeps faulting, e.g. should show if 
>> KVM is
>> installing a "bad" SPTE, or if KVM is doing nothing and intentionally 
>> retrying
>> the fault because there are constant and/or unresolve mmu_notifier 
>> events.  My
>> guess is that it's the latter (KVM doing nothing) due to the 
>> fallocate() calls
>> in the stack, but that's really just a guess.
>
> In that trace, I had all the kvm/kvmmmu events enabled (trace-cmd 
> record -e kvm -e kvmmmu).  Just to be sure, I repeated this with 
> `trace-cmd record -e all`:
>
>              CPU-1365880 [038] 5559771.610941: rcu_utilization:      
> Start context switch
>              CPU-1365880 [038] 5559771.610941: rcu_utilization:      
> End context switch
>              CPU-1365880 [038] 5559771.610942: write_msr:            
> 1d9, value 4000
>              CPU-1365880 [038] 5559771.610942: kvm_exit:             
> reason EPT_VIOLATION rip 0xffffffffb0e9bed0 info 83 0
>              CPU-1365880 [038] 5559771.610942: kvm_page_fault:       
> vcpu 0 rip 0xffffffffb0e9bed0 address 0x000000016e5b2ff8 error_code 0x83
>              CPU-1365880 [038] 5559771.610943: kvm_entry:            
> vcpu 0, rip 0xffffffffb0e9bed0
>              CPU-1365880 [038] 5559771.610943: rcu_utilization:      
> Start context switch
>              CPU-1365880 [038] 5559771.610943: rcu_utilization:      
> End context switch
>              CPU-1365880 [038] 5559771.610944: write_msr:            
> 1d9, value 4000
>              CPU-1365880 [038] 5559771.610944: kvm_exit:             
> reason EPT_VIOLATION rip 0xffffffffb0e9bed0 info 83 0
>              CPU-1365880 [038] 5559771.610944: kvm_page_fault:       
> vcpu 0 rip 0xffffffffb0e9bed0 address 0x000000016e5b2ff8 error_code 0x83
>              CPU-1365880 [038] 5559771.610945: kvm_entry:            
> vcpu 0, rip 0xffffffffb0e9bed0
>              CPU-1365880 [038] 5559771.610945: rcu_utilization:      
> Start context switch
>              CPU-1365880 [038] 5559771.610945: rcu_utilization:      
> End context switch
>              CPU-1365880 [038] 5559771.610946: write_msr:            
> 1d9, value 4000
>              CPU-1365880 [038] 5559771.610946: kvm_exit:             
> reason EPT_VIOLATION rip 0xffffffffb0e9bed0 info 83 0
>              CPU-1365880 [038] 5559771.610946: kvm_page_fault:       
> vcpu 0 rip 0xffffffffb0e9bed0 address 0x000000016e5b2ff8 error_code 0x83
>              CPU-1365880 [038] 5559771.610947: kvm_entry:            
> vcpu 0, rip 0xffffffffb0e9bed0
>              CPU-1365880 [038] 5559771.610947: rcu_utilization:      
> Start context switch
>              CPU-1365880 [038] 5559771.610947: rcu_utilization:      
> End context switch
>              CPU-1365880 [038] 5559771.610948: write_msr:            
> 1d9, value 4000
>              CPU-1365880 [038] 5559771.610948: kvm_exit:             
> reason EPT_VIOLATION rip 0xffffffffb0e9bed0 info 83 0
>              CPU-1365880 [038] 5559771.610948: kvm_page_fault:       
> vcpu 0 rip 0xffffffffb0e9bed0 address 0x000000016e5b2ff8 error_code 0x83
>              CPU-1365880 [038] 5559771.610948: kvm_entry:            
> vcpu 0, rip 0xffffffffb0e9bed0
>              CPU-1365880 [038] 5559771.610949: rcu_utilization:      
> Start context switch
>              CPU-1365880 [038] 5559771.610949: rcu_utilization:      
> End context switch
>              CPU-1365880 [038] 5559771.610949: write_msr:            
> 1d9, value 4000
>              CPU-1365880 [038] 5559771.610950: kvm_exit:             
> reason EPT_VIOLATION rip 0xffffffffb0e9bed0 info 83 0
>              CPU-1365880 [038] 5559771.610950: kvm_page_fault:       
> vcpu 0 rip 0xffffffffb0e9bed0 address 0x000000016e5b2ff8 error_code 0x83
>              CPU-1365880 [038] 5559771.610950: kvm_entry:            
> vcpu 0, rip 0xffffffffb0e9bed0
>              CPU-1365880 [038] 5559771.610950: rcu_utilization:      
> Start context switch
>              CPU-1365880 [038] 5559771.610951: rcu_utilization:      
> End context switch
>              CPU-1365880 [038] 5559771.610951: write_msr:            
> 1d9, value 4000
>              CPU-1365880 [038] 5559771.610951: kvm_exit:             
> reason EPT_VIOLATION rip 0xffffffffb0e9bed0 info 83 0
>              CPU-1365880 [038] 5559771.610952: kvm_page_fault:       
> vcpu 0 rip 0xffffffffb0e9bed0 address 0x000000016e5b2ff8 error_code 0x83
>              CPU-1365880 [038] 5559771.610952: kvm_entry:            
> vcpu 0, rip 0xffffffffb0e9bed0
>              CPU-1365880 [038] 5559771.610952: rcu_utilization:      
> Start context switch
>              CPU-1365880 [038] 5559771.610952: rcu_utilization:      
> End context switch
>              CPU-1365880 [038] 5559771.610953: write_msr:            
> 1d9, value 4000
>
>> The other thing that would be helpful would be getting kernel stack 
>> traces of the
>> relevant tasks/threads.  The vCPU stack traces won't be interesting, 
>> but it'll
>> likely help to see what the fallocate() tasks are doing.
> I'll see what I can come up with here, I was running into some 
> difficulty getting useful stack traces out of the VM

I didn't have any luck gathering guest-level stack traces - kaslr makes 
it pretty difficult even if I have the guest kernel symbols.

Interestingly we've been seeing this on Windows guests as well, which 
would rule out a bug within the guest kernel.  This was from windows:

              CPU-5638  [034] 5446408.762619: rcu_utilization:      End 
context switch
              CPU-5638  [034] 5446408.762620: write_msr: 1d9, value 4000
              CPU-5638  [034] 5446408.762620: kvm_exit: reason 
EPT_VIOLATION rip 0xfffff802057fee67 info 683 800000d1
              CPU-5638  [034] 5446408.762621: kvm_page_fault: vcpu 0 rip 
0xfffff802057fee67 address 0x0000000054200f80 error_code 0x683
              CPU-5638  [034] 5446408.762621: kvm_inj_virq: IRQ 0xd1 
[reinjected]
              CPU-5638  [034] 5446408.762622: kvm_entry: vcpu 0, rip 
0xfffff802057fee67
              CPU-5638  [034] 5446408.762622: rcu_utilization: Start 
context switch
              CPU-5638  [034] 5446408.762622: rcu_utilization: End 
context switch
              CPU-5638  [034] 5446408.762623: write_msr: 1d9, value 4000
              CPU-5638  [034] 5446408.762624: kvm_exit: reason 
EPT_VIOLATION rip 0xfffff802057fee67 info 683 800000d1
              CPU-5638  [034] 5446408.762624: kvm_page_fault: vcpu 0 rip 
0xfffff802057fee67 address 0x0000000054200f80 error_code 0x683
              CPU-5638  [034] 5446408.762625: kvm_inj_virq: IRQ 0xd1 
[reinjected]
              CPU-5638  [034] 5446408.762625: kvm_entry: vcpu 0, rip 
0xfffff802057fee67
              CPU-5638  [034] 5446408.762625: rcu_utilization: Start 
context switch
              CPU-5638  [034] 5446408.762625: rcu_utilization: End 
context switch
              CPU-5638  [034] 5446408.762626: write_msr: 1d9, value 4000
              CPU-5638  [034] 5446408.762627: kvm_exit: reason 
EPT_VIOLATION rip 0xfffff802057fee67 info 683 800000d1
              CPU-5638  [034] 5446408.762627: kvm_page_fault: vcpu 0 rip 
0xfffff802057fee67 address 0x0000000054200f80 error_code 0x683
              CPU-5638  [034] 5446408.762628: kvm_inj_virq: IRQ 0xd1 
[reinjected]
              CPU-5638  [034] 5446408.762628: kvm_entry: vcpu 0, rip 
0xfffff802057fee67
              CPU-5638  [034] 5446408.762628: rcu_utilization: Start 
context switch
              CPU-5638  [034] 5446408.762628: rcu_utilization: End 
context switch
              CPU-5638  [034] 5446408.762629: write_msr: 1d9, value 4000
              CPU-5638  [034] 5446408.762630: kvm_exit: reason 
EPT_VIOLATION rip 0xfffff802057fee67 info 683 800000d1
              CPU-5638  [034] 5446408.762630: kvm_page_fault: vcpu 0 rip 
0xfffff802057fee67 address 0x0000000054200f80 error_code 0x683
              CPU-5638  [034] 5446408.762631: kvm_inj_virq: IRQ 0xd1 
[reinjected]
              CPU-5638  [034] 5446408.762631: kvm_entry: vcpu 0, rip 
0xfffff802057fee67
              CPU-5638  [034] 5446408.762631: rcu_utilization: Start 
context switch
              CPU-5638  [034] 5446408.762632: rcu_utilization: End 
context switch
              CPU-5638  [034] 5446408.762633: write_msr: 1d9, value 4000
              CPU-5638  [034] 5446408.762633: kvm_exit: reason 
EPT_VIOLATION rip 0xfffff802057fee67 info 683 800000d1
              CPU-5638  [034] 5446408.762633: kvm_page_fault: vcpu 0 rip 
0xfffff802057fee67 address 0x0000000054200f80 error_code 0x683
              CPU-5638  [034] 5446408.762634: kvm_inj_virq: IRQ 0xd1 
[reinjected]
              CPU-5638  [034] 5446408.762634: kvm_entry: vcpu 0, rip 
0xfffff802057fee67
              CPU-5638  [034] 5446408.762635: rcu_utilization: Start 
context switch
              CPU-5638  [034] 5446408.762635: rcu_utilization: End 
context switch
              CPU-5638  [034] 5446408.762636: write_msr: 1d9, value 4000
              CPU-5638  [034] 5446408.762636: kvm_exit: reason 
EPT_VIOLATION rip 0xfffff802057fee67 info 683 800000d1
              CPU-5638  [034] 5446408.762636: kvm_page_fault: vcpu 0 rip 
0xfffff802057fee67 address 0x0000000054200f80 error_code 0x683
              CPU-5638  [034] 5446408.762637: kvm_inj_virq: IRQ 0xd1 
[reinjected]
              CPU-5638  [034] 5446408.762638: kvm_entry: vcpu 0, rip 
0xfffff802057fee67
              CPU-5638  [034] 5446408.762638: rcu_utilization: Start 
context switch
              CPU-5638  [034] 5446408.762638: rcu_utilization: End 
context switch
              CPU-5638  [034] 5446408.762639: write_msr: 1d9, value 4000
              CPU-5638  [034] 5446408.762639: kvm_exit: reason 
EPT_VIOLATION rip 0xfffff802057fee67 info 683 800000d1
              CPU-5638  [034] 5446408.762640: kvm_page_fault: vcpu 0 rip 
0xfffff802057fee67 address 0x0000000054200f80 error_code 0x683
              CPU-5638  [034] 5446408.762640: kvm_inj_virq: IRQ 0xd1 
[reinjected]


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-05-26 16:59     ` Brian Rak
@ 2023-05-26 21:02       ` Sean Christopherson
  2023-05-30 17:35         ` Brian Rak
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-05-26 21:02 UTC (permalink / raw)
  To: brak; +Cc: kvm

On Fri, May 26, 2023, Brian Rak wrote:
> 
> On 5/24/2023 9:39 AM, Brian Rak wrote:
> > 
> > On 5/23/2023 12:22 PM, Sean Christopherson wrote:
> > > The other thing that would be helpful would be getting kernel stack
> > > traces of the
> > > relevant tasks/threads.� The vCPU stack traces won't be interesting,
> > > but it'll
> > > likely help to see what the fallocate() tasks are doing.
> > I'll see what I can come up with here, I was running into some
> > difficulty getting useful stack traces out of the VM
> 
> I didn't have any luck gathering guest-level stack traces - kaslr makes it
> pretty difficult even if I have the guest kernel symbols.

Sorry, I was hoping to get host stack traces, not guest stack traces.  I am hoping
to see what the fallocate() in the *host* is doing.

Another datapoint that might provide insight would be seeing if/how KVM's page
faults stats change, e.g. look at /sys/kernel/debug/kvm/pf_* multiple times when
the guest is stuck.

Are you able to run modified host kernels?  If so, the easiest next step, assuming
stack traces don't provide a smoking gun, would be to add printks into the page
fault path to see why KVM is retrying instead of installing a SPTE.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-05-26 21:02       ` Sean Christopherson
@ 2023-05-30 17:35         ` Brian Rak
  2023-05-30 18:36           ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Brian Rak @ 2023-05-30 17:35 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm


On 5/26/2023 5:02 PM, Sean Christopherson wrote:
> On Fri, May 26, 2023, Brian Rak wrote:
>> On 5/24/2023 9:39 AM, Brian Rak wrote:
>>> On 5/23/2023 12:22 PM, Sean Christopherson wrote:
>>>> The other thing that would be helpful would be getting kernel stack
>>>> traces of the
>>>> relevant tasks/threads.� The vCPU stack traces won't be interesting,
>>>> but it'll
>>>> likely help to see what the fallocate() tasks are doing.
>>> I'll see what I can come up with here, I was running into some
>>> difficulty getting useful stack traces out of the VM
>> I didn't have any luck gathering guest-level stack traces - kaslr makes it
>> pretty difficult even if I have the guest kernel symbols.
> Sorry, I was hoping to get host stack traces, not guest stack traces.  I am hoping
> to see what the fallocate() in the *host* is doing.

Ah - here's a different instance of it with a full backtrace from the host:

(gdb) thread apply all bt

Thread 8 (Thread 0x7fbbd0bff700 (LWP 353251) "vnc_worker"):
#0  futex_wait_cancelable (private=0, expected=0, 
futex_word=0x7fbddd4f6b2c) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, 
mutex=0x7fbddd4f6b38, cond=0x7fbddd4f6b00) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7fbddd4f6b00, 
mutex=mutex@entry=0x7fbddd4f6b38) at pthread_cond_wait.c:638
#3  0x00005593ea9a232b in qemu_cond_wait_impl (cond=0x7fbddd4f6b00, 
mutex=0x7fbddd4f6b38, file=0x5593eaa152b4 "../../ui/vnc-jobs.c", 
line=248) at ../../util/qemu-thread-posix.c:220
#4  0x00005593ea4bfc33 in vnc_worker_thread_loop (queue=0x7fbddd4f6b00) 
at ../../ui/vnc-jobs.c:248
#5  0x00005593ea4c08f8 in vnc_worker_thread 
(arg=arg@entry=0x7fbddd4f6b00) at ../../ui/vnc-jobs.c:361
#6  0x00005593ea9a17e9 in qemu_thread_start (args=0x7fbbd0bfcf30) at 
../../util/qemu-thread-posix.c:505
#7  0x00007fbddf51bea7 in start_thread (arg=<optimized out>) at 
pthread_create.c:477
#8  0x00007fbddf127a2f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7fbbd85ff700 (LWP 353248) "CPU 3/KVM"):
#0  0x00007fbddf11d237 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005593ea844007 in kvm_vcpu_ioctl (cpu=cpu@entry=0x7fbddd42f7c0, 
type=type@entry=44672) at ../../accel/kvm/kvm-all.c:3035
#2  0x00005593ea844171 in kvm_cpu_exec (cpu=cpu@entry=0x7fbddd42f7c0) at 
../../accel/kvm/kvm-all.c:2850
#3  0x00005593ea8457ed in kvm_vcpu_thread_fn 
(arg=arg@entry=0x7fbddd42f7c0) at ../../accel/kvm/kvm-accel-ops.c:51
#4  0x00005593ea9a17e9 in qemu_thread_start (args=0x7fbbd85fcf30) at 
../../util/qemu-thread-posix.c:505
#5  0x00007fbddf51bea7 in start_thread (arg=<optimized out>) at 
pthread_create.c:477
#6  0x00007fbddf127a2f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7fbbd97ff700 (LWP 353247) "CPU 2/KVM"):
#0  0x00007fbddf11d237 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005593ea844007 in kvm_vcpu_ioctl (cpu=cpu@entry=0x7fbddd4213c0, 
type=type@entry=44672) at ../../accel/kvm/kvm-all.c:3035
#2  0x00005593ea844171 in kvm_cpu_exec (cpu=cpu@entry=0x7fbddd4213c0) at 
../../accel/kvm/kvm-all.c:2850
#3  0x00005593ea8457ed in kvm_vcpu_thread_fn 
(arg=arg@entry=0x7fbddd4213c0) at ../../accel/kvm/kvm-accel-ops.c:51
#4  0x00005593ea9a17e9 in qemu_thread_start (args=0x7fbbd97fcf30) at 
../../util/qemu-thread-posix.c:505
#5  0x00007fbddf51bea7 in start_thread (arg=<optimized out>) at 
pthread_create.c:477
#6  0x00007fbddf127a2f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7fbbda9ff700 (LWP 353246) "CPU 1/KVM"):
#0  0x00007fbddf11d237 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005593ea844007 in kvm_vcpu_ioctl (cpu=cpu@entry=0x7fbddd6f5e40, 
type=type@entry=44672) at ../../accel/kvm/kvm-all.c:3035
#2  0x00005593ea844171 in kvm_cpu_exec (cpu=cpu@entry=0x7fbddd6f5e40) at 
../../accel/kvm/kvm-all.c:2850
#3  0x00005593ea8457ed in kvm_vcpu_thread_fn 
(arg=arg@entry=0x7fbddd6f5e40) at ../../accel/kvm/kvm-accel-ops.c:51
#4  0x00005593ea9a17e9 in qemu_thread_start (args=0x7fbbda9fcf30) at 
../../util/qemu-thread-posix.c:505
#5  0x00007fbddf51bea7 in start_thread (arg=<optimized out>) at 
pthread_create.c:477
#6  0x00007fbddf127a2f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fbbdbbff700 (LWP 353245) "CPU 0/KVM"):
#0  0x00007fbddf11d237 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005593ea844007 in kvm_vcpu_ioctl (cpu=cpu@entry=0x7fbddd6a8c00, 
type=type@entry=44672) at ../../accel/kvm/kvm-all.c:3035
#2  0x00005593ea844171 in kvm_cpu_exec (cpu=cpu@entry=0x7fbddd6a8c00) at 
../../accel/kvm/kvm-all.c:2850
#3  0x00005593ea8457ed in kvm_vcpu_thread_fn 
(arg=arg@entry=0x7fbddd6a8c00) at ../../accel/kvm/kvm-accel-ops.c:51
#4  0x00005593ea9a17e9 in qemu_thread_start (args=0x7fbbdbbfcf30) at 
../../util/qemu-thread-posix.c:505
#5  0x00007fbddf51bea7 in start_thread (arg=<optimized out>) at 
pthread_create.c:477
#6  0x00007fbddf127a2f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fbbdcfff700 (LWP 353244) "IO mon_iothread"):
#0  0x00007fbddf11b96f in __GI___poll (fds=0x7fbbdc209000, nfds=3, 
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fbddf6eb0ae in ?? () from 
target:/lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007fbddf6eb40b in g_main_loop_run () from 
target:/lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00005593ea880509 in iothread_run 
(opaque=opaque@entry=0x7fbddd51db00) at ../../iothread.c:74
#4  0x00005593ea9a17e9 in qemu_thread_start (args=0x7fbbdcffcf30) at 
../../util/qemu-thread-posix.c:505
#5  0x00007fbddf51bea7 in start_thread (arg=<optimized out>) at 
pthread_create.c:477
#6  0x00007fbddf127a2f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fbdddfff700 (LWP 353235) "qemu-kvm"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00005593ea9a29aa in qemu_futex_wait (val=<optimized out>, 
f=<optimized out>) at ./include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x5593eb1a21a8 <rcu_call_ready_event>) 
at ../../util/qemu-thread-posix.c:430
#3  0x00005593ea9abd80 in call_rcu_thread (opaque=opaque@entry=0x0) at 
../../util/rcu.c:261
#4  0x00005593ea9a17e9 in qemu_thread_start (args=0x7fbdddffcf30) at 
../../util/qemu-thread-posix.c:505
#5  0x00007fbddf51bea7 in start_thread (arg=<optimized out>) at 
pthread_create.c:477
#6  0x00007fbddf127a2f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fbdde874680 (LWP 353228) "qemu-kvm"):
#0  0x00007fbddf11ba66 in __ppoll (fds=0x7fbbd39b9f00, nfds=21, 
timeout=<optimized out>, timeout@entry=0x7fff9ff32c90, 
sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x00005593ea9b70f1 in ppoll (__ss=0x0, __timeout=0x7fff9ff32c90, 
__nfds=<optimized out>, __fds=<optimized out>) at 
/usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
timeout=timeout@entry=2429393072) at ../../util/qemu-timer.c:351
#3  0x00005593ea9b4955 in os_host_main_loop_wait (timeout=2429393072) at 
../../util/main-loop.c:315
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at 
../../util/main-loop.c:606
#5  0x00005593ea64ca91 in qemu_main_loop () at ../../softmmu/runstate.c:739
#6  0x00005593ea84c8e7 in qemu_default_main () at ../../softmmu/main.c:37
#7  0x00007fbddf04fd0a in __libc_start_main (main=0x5593ea497620 <main>, 
argc=109, argv=0x7fff9ff32e58, init=<optimized out>, fini=<optimized 
out>, rtld_fini=<optimized out>, stack_end=0x7fff9ff32e48) at 
../csu/libc-start.c:308
#8  0x00005593ea498dda in _start ()

and trace with -e all from that same guest:

              CPU-353246 [041] 5473778.770320: write_msr:            
1d9, value 4000
              CPU-353248 [027] 5473778.770321: write_msr:            
1d9, value 4000
              CPU-353246 [041] 5473778.770321: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffbc726140 info 683 0
              CPU-353248 [027] 5473778.770321: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffbc7648b0 info 683 0
              CPU-353245 [039] 5473778.770321: kvm_entry:            
vcpu 0, rip 0xffffffffbc7648b0
              CPU-353246 [041] 5473778.770321: kvm_page_fault:       
vcpu 1 rip 0xffffffffbc726140 address 0x00000001170aaff8 error_code 0x683
              CPU-353247 [019] 5473778.770321: write_msr:            
1d9, value 4000
              CPU-353248 [027] 5473778.770321: kvm_page_fault:       
vcpu 3 rip 0xffffffffbc7648b0 address 0x0000000274300ff8 error_code 0x683
              CPU-353245 [039] 5473778.770321: rcu_utilization:      
Start context switch
              CPU-353247 [019] 5473778.770321: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffbc7648b0 info 683 0
              CPU-353245 [039] 5473778.770322: rcu_utilization:      End 
context switch
              CPU-353246 [041] 5473778.770322: kvm_entry:            
vcpu 1, rip 0xffffffffbc726140
              CPU-353247 [019] 5473778.770322: kvm_page_fault:       
vcpu 2 rip 0xffffffffbc7648b0 address 0x000000026e580ff8 error_code 0x683
              CPU-353248 [027] 5473778.770322: kvm_entry:            
vcpu 3, rip 0xffffffffbc7648b0
              CPU-353246 [041] 5473778.770322: rcu_utilization:      
Start context switch
              CPU-353246 [041] 5473778.770322: rcu_utilization:      End 
context switch
              CPU-353248 [027] 5473778.770322: rcu_utilization:      
Start context switch
              CPU-353248 [027] 5473778.770322: rcu_utilization:      End 
context switch
              CPU-353247 [019] 5473778.770323: kvm_entry:            
vcpu 2, rip 0xffffffffbc7648b0
              CPU-353245 [039] 5473778.770323: write_msr:            
1d9, value 4000
              CPU-353245 [039] 5473778.770323: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffbc7648b0 info 683 0
              CPU-353247 [019] 5473778.770323: rcu_utilization:      
Start context switch
              CPU-353246 [041] 5473778.770323: write_msr:            
1d9, value 4000
              CPU-353245 [039] 5473778.770323: kvm_page_fault:       
vcpu 0 rip 0xffffffffbc7648b0 address 0x0000000263670ff8 error_code 0x683
              CPU-353246 [041] 5473778.770323: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffbc726140 info 683 0
              CPU-353247 [019] 5473778.770323: rcu_utilization:      End 
context switch
              CPU-353248 [027] 5473778.770324: write_msr:            
1d9, value 4000
              CPU-353246 [041] 5473778.770324: kvm_page_fault:       
vcpu 1 rip 0xffffffffbc726140 address 0x00000001170aaff8 error_code 0x683
              CPU-353248 [027] 5473778.770324: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffbc7648b0 info 683 0
              CPU-353245 [039] 5473778.770324: kvm_entry:            
vcpu 0, rip 0xffffffffbc7648b0
              CPU-353248 [027] 5473778.770324: kvm_page_fault:       
vcpu 3 rip 0xffffffffbc7648b0 address 0x0000000274300ff8 error_code 0x683
              CPU-353246 [041] 5473778.770324: kvm_entry:            
vcpu 1, rip 0xffffffffbc726140
              CPU-353245 [039] 5473778.770324: rcu_utilization:      
Start context switch
              CPU-353246 [041] 5473778.770325: rcu_utilization:      
Start context switch
              CPU-353247 [019] 5473778.770325: write_msr:            
1d9, value 4000
              CPU-353245 [039] 5473778.770325: rcu_utilization:      End 
context switch
              CPU-353246 [041] 5473778.770325: rcu_utilization:      End 
context switch
              CPU-353247 [019] 5473778.770325: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffbc7648b0 info 683 0
              CPU-353248 [027] 5473778.770325: kvm_entry:            
vcpu 3, rip 0xffffffffbc7648b0
              CPU-353247 [019] 5473778.770325: kvm_page_fault:       
vcpu 2 rip 0xffffffffbc7648b0 address 0x000000026e580ff8 error_code 0x683
              CPU-353248 [027] 5473778.770325: rcu_utilization:      
Start context switch
              CPU-353248 [027] 5473778.770326: rcu_utilization:      End 
context switch
              CPU-353245 [039] 5473778.770326: write_msr:            
1d9, value 4000
              CPU-353246 [041] 5473778.770326: write_msr:            
1d9, value 4000
              CPU-353245 [039] 5473778.770326: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffbc7648b0 info 683 0
              CPU-353247 [019] 5473778.770326: kvm_entry:            
vcpu 2, rip 0xffffffffbc7648b0
              CPU-353246 [041] 5473778.770326: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffbc726140 info 683 0
              CPU-353245 [039] 5473778.770326: kvm_page_fault:       
vcpu 0 rip 0xffffffffbc7648b0 address 0x0000000263670ff8 error_code 0x683
              CPU-353246 [041] 5473778.770326: kvm_page_fault:       
vcpu 1 rip 0xffffffffbc726140 address 0x00000001170aaff8 error_code 0x683
              CPU-353247 [019] 5473778.770326: rcu_utilization:      
Start context switch
              CPU-353247 [019] 5473778.770327: rcu_utilization:      End 
context switch
              CPU-353248 [027] 5473778.770327: write_msr:            
1d9, value 4000
              CPU-353246 [041] 5473778.770327: kvm_entry:            
vcpu 1, rip 0xffffffffbc726140
              CPU-353248 [027] 5473778.770327: kvm_exit:             
reason EPT_VIOLATION rip 0xffffffffbc7648b0 info 683 0
              CPU-353245 [039] 5473778.770327: kvm_entry:            
vcpu 0, rip 0xffffffffbc7648b0
              CPU-353246 [041] 5473778.770327: rcu_utilization:      
Start context switch
              CPU-353246 [041] 5473778.770328: rcu_utilization:      End 
context switch

> Another datapoint that might provide insight would be seeing if/how KVM's page
> faults stats change, e.g. look at /sys/kernel/debug/kvm/pf_* multiple times when
> the guest is stuck.

It looks like pf_taken is the only real one incrementing:

# cd /sys/kernel/debug/kvm/353228-12

# tail -n1 pf*; sleep 5; ; echo "======"; tail -n1 pf*
==> pf_emulate <==
12353

==> pf_fast <==
56

==> pf_fixed <==
27039604264

==> pf_guest <==
0

==> pf_mmio_spte_created <==
12353

==> pf_spurious <==
2348694

==> pf_taken <==
74486522452
======
==> pf_emulate <==
12353

==> pf_fast <==
56

==> pf_fixed <==
27039604264

==> pf_guest <==
0

==> pf_mmio_spte_created <==
12353

==> pf_spurious <==
2348694

==> pf_taken <==
74499574212
# tail -n1 *tlb*; sleep 5; echo "======"; tail -n1 *tlb*
==> remote_tlb_flush <==
1731549762

==> remote_tlb_flush_requests <==
3092024754

==> tlb_flush <==
8203297646
======
==> remote_tlb_flush <==
1731549762

==> remote_tlb_flush_requests <==
3092024754

==> tlb_flush <==
8203297649


> Are you able to run modified host kernels?  If so, the easiest next step, assuming
> stack traces don't provide a smoking gun, would be to add printks into the page
> fault path to see why KVM is retrying instead of installing a SPTE.
We can, but it can take quite some time from when we do the update to 
actually seeing results.  This problem is inconsistent at best, and even 
though we're seeing it a ton of times a day, it's can show up anywhere.  
Even if we rolled it out today, we'd still be looking at weeks/months 
before we had any significant number of machines on it.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-05-30 17:35         ` Brian Rak
@ 2023-05-30 18:36           ` Sean Christopherson
  2023-05-31 17:40             ` Brian Rak
  2023-07-21 14:34             ` Amaan Cheval
  0 siblings, 2 replies; 48+ messages in thread
From: Sean Christopherson @ 2023-05-30 18:36 UTC (permalink / raw)
  To: brak; +Cc: kvm

On Tue, May 30, 2023, Brian Rak wrote:
> 
> On 5/26/2023 5:02 PM, Sean Christopherson wrote:
> > On Fri, May 26, 2023, Brian Rak wrote:
> > > On 5/24/2023 9:39 AM, Brian Rak wrote:
> > > > On 5/23/2023 12:22 PM, Sean Christopherson wrote:
> > > > > The other thing that would be helpful would be getting kernel stack
> > > > > traces of the
> > > > > relevant tasks/threads.� The vCPU stack traces won't be interesting,
> > > > > but it'll
> > > > > likely help to see what the fallocate() tasks are doing.
> > > > I'll see what I can come up with here, I was running into some
> > > > difficulty getting useful stack traces out of the VM
> > > I didn't have any luck gathering guest-level stack traces - kaslr makes it
> > > pretty difficult even if I have the guest kernel symbols.
> > Sorry, I was hoping to get host stack traces, not guest stack traces.  I am hoping
> > to see what the fallocate() in the *host* is doing.
> 
> Ah - here's a different instance of it with a full backtrace from the host:

Gah, I wasn't specific enough again.  Though there's no longer an fallocate() for
any of the threads', so that's probably a moot point.  What I wanted to see is what
exactly the host kernel was doing, e.g. if something in the host memory management
was indirectly preventing vCPUs from making forward progress.  But that doesn't
seem to be the case here, and I would expect other problems if fallocate() was
stuck.  So ignore that request for now.

> > Another datapoint that might provide insight would be seeing if/how KVM's page
> > faults stats change, e.g. look at /sys/kernel/debug/kvm/pf_* multiple times when
> > the guest is stuck.
> 
> It looks like pf_taken is the only real one incrementing:

Drat.  That's what I expected, but it doesn't narrow down the search much.

> > Are you able to run modified host kernels?  If so, the easiest next step, assuming
> > stack traces don't provide a smoking gun, would be to add printks into the page
> > fault path to see why KVM is retrying instead of installing a SPTE.
> We can, but it can take quite some time from when we do the update to
> actually seeing results.� This problem is inconsistent at best, and even
> though we're seeing it a ton of times a day, it's can show up anywhere.�
> Even if we rolled it out today, we'd still be looking at weeks/months before
> we had any significant number of machines on it.

Would you be able to run a bpftrace program on a host with a stuck guest?  If so,
I believe I could craft a program for the kvm_exit tracepoint that would rule out
or confirm two of the three likely culprits.

Can you also dump the kvm.ko module params?  E.g. `tail /sys/module/kvm/parameters/*`

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-05-30 18:36           ` Sean Christopherson
@ 2023-05-31 17:40             ` Brian Rak
  2023-07-21 14:34             ` Amaan Cheval
  1 sibling, 0 replies; 48+ messages in thread
From: Brian Rak @ 2023-05-31 17:40 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm


On 5/30/2023 2:36 PM, Sean Christopherson wrote:
> On Tue, May 30, 2023, Brian Rak wrote:
>> On 5/26/2023 5:02 PM, Sean Christopherson wrote:
>>> On Fri, May 26, 2023, Brian Rak wrote:
>>>> On 5/24/2023 9:39 AM, Brian Rak wrote:
>>>>> On 5/23/2023 12:22 PM, Sean Christopherson wrote:
>>>>>> The other thing that would be helpful would be getting kernel stack
>>>>>> traces of the
>>>>>> relevant tasks/threads.� The vCPU stack traces won't be interesting,
>>>>>> but it'll
>>>>>> likely help to see what the fallocate() tasks are doing.
>>>>> I'll see what I can come up with here, I was running into some
>>>>> difficulty getting useful stack traces out of the VM
>>>> I didn't have any luck gathering guest-level stack traces - kaslr makes it
>>>> pretty difficult even if I have the guest kernel symbols.
>>> Sorry, I was hoping to get host stack traces, not guest stack traces.  I am hoping
>>> to see what the fallocate() in the *host* is doing.
>> Ah - here's a different instance of it with a full backtrace from the host:
> Gah, I wasn't specific enough again.  Though there's no longer an fallocate() for
> any of the threads', so that's probably a moot point.  What I wanted to see is what
> exactly the host kernel was doing, e.g. if something in the host memory management
> was indirectly preventing vCPUs from making forward progress.  But that doesn't
> seem to be the case here, and I would expect other problems if fallocate() was
> stuck.  So ignore that request for now.
>
>>> Another datapoint that might provide insight would be seeing if/how KVM's page
>>> faults stats change, e.g. look at /sys/kernel/debug/kvm/pf_* multiple times when
>>> the guest is stuck.
>> It looks like pf_taken is the only real one incrementing:
> Drat.  That's what I expected, but it doesn't narrow down the search much.
>
>>> Are you able to run modified host kernels?  If so, the easiest next step, assuming
>>> stack traces don't provide a smoking gun, would be to add printks into the page
>>> fault path to see why KVM is retrying instead of installing a SPTE.
>> We can, but it can take quite some time from when we do the update to
>> actually seeing results.� This problem is inconsistent at best, and even
>> though we're seeing it a ton of times a day, it's can show up anywhere.�
>> Even if we rolled it out today, we'd still be looking at weeks/months before
>> we had any significant number of machines on it.
> Would you be able to run a bpftrace program on a host with a stuck guest?  If so,
> I believe I could craft a program for the kvm_exit tracepoint that would rule out
> or confirm two of the three likely culprits.
>
> Can you also dump the kvm.ko module params?  E.g. `tail /sys/module/kvm/parameters/*`

Yes, we can run bpftrace programs

# tail /sys/module/kvm/parameters/*
==> /sys/module/kvm/parameters/eager_page_split <==
Y

==> /sys/module/kvm/parameters/enable_pmu <==
Y

==> /sys/module/kvm/parameters/enable_vmware_backdoor <==
N

==> /sys/module/kvm/parameters/flush_on_reuse <==
N

==> /sys/module/kvm/parameters/force_emulation_prefix <==
0

==> /sys/module/kvm/parameters/halt_poll_ns <==
200000

==> /sys/module/kvm/parameters/halt_poll_ns_grow <==
2

==> /sys/module/kvm/parameters/halt_poll_ns_grow_start <==
10000

==> /sys/module/kvm/parameters/halt_poll_ns_shrink <==
0

==> /sys/module/kvm/parameters/ignore_msrs <==
N

==> /sys/module/kvm/parameters/kvmclock_periodic_sync <==
Y

==> /sys/module/kvm/parameters/lapic_timer_advance_ns <==
-1

==> /sys/module/kvm/parameters/min_timer_period_us <==
200

==> /sys/module/kvm/parameters/mitigate_smt_rsb <==
N

==> /sys/module/kvm/parameters/mmio_caching <==
Y

==> /sys/module/kvm/parameters/nx_huge_pages <==
Y

==> /sys/module/kvm/parameters/nx_huge_pages_recovery_period_ms <==
0

==> /sys/module/kvm/parameters/nx_huge_pages_recovery_ratio <==
60

==> /sys/module/kvm/parameters/pi_inject_timer <==
0

==> /sys/module/kvm/parameters/report_ignored_msrs <==
Y

==> /sys/module/kvm/parameters/tdp_mmu <==
Y

==> /sys/module/kvm/parameters/tsc_tolerance_ppm <==
250

==> /sys/module/kvm/parameters/vector_hashing <==
Y



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-05-30 18:36           ` Sean Christopherson
  2023-05-31 17:40             ` Brian Rak
@ 2023-07-21 14:34             ` Amaan Cheval
  2023-07-21 17:37               ` Sean Christopherson
  1 sibling, 1 reply; 48+ messages in thread
From: Amaan Cheval @ 2023-07-21 14:34 UTC (permalink / raw)
  To: seanjc; +Cc: brak, kvm


Hey Sean,

I'm helping Brian look into this issue.

> Would you be able to run a bpftrace program on a host with a stuck guest?  If so,
> I believe I could craft a program for the kvm_exit tracepoint that would rule out
> or confirm two of the three likely culprits.

Could you share your thoughts on what the 2-3 likely culprits might be, and the
bpftrace program if possible?

I ran one just to dump all args on the kvm_exit tracepoint on an affected host,
here's a snippet:


```
# bpftrace -e 'tracepoint:kvm:kvm_exit { printf("%s: rip=%lx reason=%u isa=%u info1=%lx info2=%lx intr=%u error=%u vcpu=%u \n", comm, args->guest_rip, args->exit_reason, args->isa, args->info1, args->info2, args->intr_info, args->error_code, args->vcpu_id); }'

CPU 3/KVM: rip=ffffffffa746d5f8 reason=32 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=3
CPU 3/KVM: rip=ffffffffa746d5fa reason=1 isa=1 info1=0 info2=0 intr=2147483894 error=0 vcpu=3
CPU 0/KVM: rip=ffffffffa746d5f8 reason=32 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5fa reason=1 isa=1 info1=0 info2=0 intr=2147483894 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5f8 reason=32 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5fa reason=1 isa=1 info1=0 info2=0 intr=2147483894 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5f8 reason=32 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5fa reason=1 isa=1 info1=0 info2=0 intr=2147483894 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5f8 reason=32 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5fa reason=1 isa=1 info1=0 info2=0 intr=2147483894 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa7b88eaa reason=12 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 3/KVM: rip=7ff4543b74e4 reason=1 isa=1 info1=0 info2=0 intr=2147483884 error=0 vcpu=3
CPU 0/KVM: rip=ffffffff94f3ff15 reason=1 isa=1 info1=0 info2=0 intr=2147483884 error=0 vcpu=0
CPU 0/KVM: rip=ffffffff94e683a8 reason=32 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 0/KVM: rip=ffffffff94e683aa reason=1 isa=1 info1=0 info2=0 intr=2147483894 error=0 vcpu=0
CPU 0/KVM: rip=ffffffff95516005 reason=12 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 3/KVM: rip=7ff45260dd24 reason=48 isa=1 info1=181 info2=0 intr=0 error=0 vcpu=3
CPU 3/KVM: rip=7ff45260dd24 reason=48 isa=1 info1=181 info2=0 intr=0 error=0 vcpu=3
CPU 3/KVM: rip=7ff45260df88 reason=48 isa=1 info1=181 info2=0 intr=0 error=0 vcpu=3
CPU 3/KVM: rip=7ff45260df88 reason=48 isa=1 info1=181 info2=0 intr=0 error=0 vcpu=3
```

I've also run a `function_graph` trace on some of the affected hosts, if you
think it might be helpful to have a look at that to see what the host kernel
might be doing while the guests are looping on EPT_VIOLATIONs. Nothing obvious
stands out to me right now.

We suspected KSM briefly, but ruled that out by turning KSM off and unmerging
KSM pages - after doing that, a guest VM still locked up / started looping
EPT_VIOLATIONS (like in Brian's original email), so it's unlikely this is KSM specific.

Another interesting observation we made was that when we migrate a guest to a
different host, the guest _stays_ locked up and throws EPT violations on the new
host as well - so it's unlikely the issue is in the guest kernel itself (since
we see it across guest operating systems), but perhaps the host kernel is
messing the state of the guest kernel up in a way that keeps it locked up after
migrating as well?

If you have any thoughts on anything else to try, let me know!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-07-21 14:34             ` Amaan Cheval
@ 2023-07-21 17:37               ` Sean Christopherson
  2023-07-24 12:08                 ` Amaan Cheval
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-07-21 17:37 UTC (permalink / raw)
  To: Amaan Cheval; +Cc: brak, kvm

On Fri, Jul 21, 2023, Amaan Cheval wrote:
> I've also run a `function_graph` trace on some of the affected hosts, if you
> think it might be helpful to have a look at that to see what the host kernel
> might be doing while the guests are looping on EPT_VIOLATIONs. Nothing obvious
> stands out to me right now.

It wouldn't hurt to see it.

> We suspected KSM briefly, but ruled that out by turning KSM off and unmerging
> KSM pages - after doing that, a guest VM still locked up / started looping
> EPT_VIOLATIONS (like in Brian's original email), so it's unlikely this is KSM specific.
> 
> Another interesting observation we made was that when we migrate a guest to a
> different host, the guest _stays_ locked up and throws EPT violations on the new
> host as well 

Ooh, that's *very* interesting.  That pretty much rules out memslot and mmu_notifier
issues.

>- so it's unlikely the issue is in the guest kernel itself (since
> we see it across guest operating systems), but perhaps the host kernel is
> messing the state of the guest kernel up in a way that keeps it locked up after
> migrating as well?
> 
> If you have any thoughts on anything else to try, let me know!

Good news and bad news.  Good news: I have a plausible theory as to what might be
going wrong.  Bad news: if my theory is correct, our princess is in another castle
(the bug isn't in KVM).

One of the scenario where KVM retries page faults is if KVM asynchronously faults-in
the host backing page.  If faulting in the page would require I/O, e.g. because
it's been swapped out, instead of synchronously doing the I/O on the vCPU task,
KVM uses a workqueue to fault in the page and immediately resumes the guest.

There are a variety of conditions that must be met to try an async page fault, but
assuming you aren't disable HLT VM-Exit, i.e. aren't letting the guest execute HLT,
it really just boils down to IRQs being enabled in the guest, which looking at the
traces is pretty much guaranteed to be true.

What's _supposed_ to happen is that async_pf_execute() successfully faults in the
page via get_user_pages_remote(), and then KVM installs a mapping for the guest
either in kvm_arch_async_page_ready() or by resuming the guest and cleanly handling
the retried guest page fault.

What I suspect is happening is that get_user_pages_remote() fails for some reason,
i.e. the workqueue doesn't fault in the page, and the vCPU gets stuck trying to
fault in a page that can't be faulted in for whatever reason.  AFAICT, nothing in
KVM will actually complain or even surface the problem in tracepoints (yeah, that's
not good).

Circling back to the bad news, if that's indeed what's happening, it likely means
there's a bug somewhere else in the stack.  E.g. it could be core mm/, might be
in the block layer, in swap, possibly in the exact filesystem you're using, etc.

Note, there's also a paravirt extension to async #PFs, where instead of putting
the vCPU into a synthetic halted state, KVM instead *may* inject a synthetic #PF
into the guest, e.g. so that the guest can go run a different task while the
faulting task is blocked.  But this really is just a note, guest enabling of PV
async #PF shouldn't actually matter, again assuming my theory is correct.

To mostly confirm this is likely what's happening, can you enable all of the async
#PF tracepoints in KVM?  The exact tracepoints might vary dependending on which kernel
version you're running, just enable everything with "async" in the name, e.g.

  # ls -1 /sys/kernel/debug/tracing/events/kvm | grep async
  kvm_async_pf_completed/
  kvm_async_pf_not_present/
  kvm_async_pf_ready/
  kvm_async_pf_repeated_fault/
  kvm_try_async_get_page/

If kvm_try_async_get_page() is more or less keeping pace with the "pf_taken" stat,
then this is likely what's happening.

And then to really confirm, this small bpf program will yell if get_user_pages_remote()
fails when attempting get a single page (which is always the case for KVM's async
#PF usage).

FWIW, get_user_pages_remote() isn't used all that much, e.g. when running a VM in
my, KVM is the only user.  So you can likely aggressively instrument
get_user_pages_remote() via bpf without major problems, or maybe even assume that
any call is from KVM.

$ tail gup_remote.bt 
kretfunc:get_user_pages_remote
{
        if ( args->nr_pages == 1 && retval != 1 ) {
                printf("Failed remote gup() on address %lx, ret = %d\n", args->start, retval);
        }
}


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-07-21 17:37               ` Sean Christopherson
@ 2023-07-24 12:08                 ` Amaan Cheval
  2023-07-25 17:30                   ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Amaan Cheval @ 2023-07-24 12:08 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: brak, kvm

> > I've also run a `function_graph` trace on some of the affected hosts, if you
> > think it might be helpful...
>
> It wouldn't hurt to see it.
>

Here you go:
https://transfer.sh/SfXSCHp5xI/ept-function-graph.log

> > Another interesting observation we made was that when we migrate a guest to a
> > different host, the guest _stays_ locked up and throws EPT violations on the new
> > host as well
>
> Ooh, that's *very* interesting.  That pretty much rules out memslot and mmu_notifier
> issues.

Good to know, thanks!

> What I suspect is happening is that get_user_pages_remote() fails for some reason,
> i.e. the workqueue doesn't fault in the page, and the vCPU gets stuck trying to
> fault in a page that can't be faulted in for whatever reason.  AFAICT, nothing in
> KVM will actually complain or even surface the problem in tracepoints (yeah, that's
> not good).

Thanks for the explanation, I did suspect something similar seeing how the page
faults / EPT_VIOLATIONs tend to loop on the same eip/rip/instruction and address
(not always, but quite often).

> To mostly confirm this is likely what's happening, can you enable all of the async
> #PF tracepoints in KVM?  The exact tracepoints might vary dependending on which kernel
> version you're running, just enable everything with "async" in the name, e.g.
>
>   # ls -1 /sys/kernel/debug/tracing/events/kvm | grep async
>   kvm_async_pf_completed/
>   kvm_async_pf_not_present/
>   kvm_async_pf_ready/
>   kvm_async_pf_repeated_fault/
>   kvm_try_async_get_page/
>
> If kvm_try_async_get_page() is more or less keeping pace with the "pf_taken" stat,
> then this is likely what's happening.

I did this and unfortunately, don't see any of these functions being
called at all despite
EPT_VIOLATIONs still being thrown and pf_taken still climbing. (Tried both with
`trace-cmd -e ...` and using `bpftrace` and none of those functions
are being called
during the deadlock/guest being stuck.)

> And then to really confirm, this small bpf program will yell if get_user_pages_remote()
> fails when attempting get a single page (which is always the case for KVM's async
> #PF usage).
>
> $ tail gup_remote.bt
> kretfunc:get_user_pages_remote
> {
>         if ( args->nr_pages == 1 && retval != 1 ) {
>                 printf("Failed remote gup() on address %lx, ret = %d\n", args->start, retval);
>         }
> }
>

Our hosts don't have kfunc/kretfunc support (`bpftrace --info` reports
`kret: no`),
but I tried just a kprobe to verify that get_user_pages_remote is
being called at all -
does not seem like it is, unfortunately:

```
# bpftrace -e 'kprobe:get_user_pages_remote { @[comm] = count(); }'
Attaching 1 probe...
^C
#
```

So I guess that disproves the async #PF theory? Any other potential causes you
can think of, or anything we can try on faulting hosts that might help
illuminate the
issue further?

Thanks for your time and help!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-07-24 12:08                 ` Amaan Cheval
@ 2023-07-25 17:30                   ` Sean Christopherson
  2023-08-02 14:21                     ` Amaan Cheval
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-07-25 17:30 UTC (permalink / raw)
  To: Amaan Cheval; +Cc: brak, kvm

On Mon, Jul 24, 2023, Amaan Cheval wrote:
> > > I've also run a `function_graph` trace on some of the affected hosts, if you
> > > think it might be helpful...
> >
> > It wouldn't hurt to see it.
> >
> 
> Here you go:
> https://transfer.sh/SfXSCHp5xI/ept-function-graph.log

Yeesh.  There is a ridiculous amount of potentially problematic activity.  KSM is
active in that trace, it looks like NUMA balancing might be in play, there might
be hugepage shattering, etc.

> > > Another interesting observation we made was that when we migrate a guest to a
> > > different host, the guest _stays_ locked up and throws EPT violations on the new
> > > host as well
> >
> > Ooh, that's *very* interesting.  That pretty much rules out memslot and mmu_notifier
> > issues.
> 
> Good to know, thanks!

Let me rephrase that statement: it rules out a certain class of memslot and
mmu_notifier bugs, namely bugs where KVM would incorrect leave an invalidation
refcount (for lack of a better term) elevated.  It doesn't mean memslot changes
and/or mmu_notifier events aren't at fault.

Can you migrate a hung guest to a host that is completely unloaded?  And ideally,
disable KSM and NUMA autobalancing on the target host.  And then get a
function_graph trace on that host, assuming the vCPU remains stuck.  There is *so*
much going on in the above graph that it's impossible to determine if there's a
kernel bug, e.g. it's possible the vCPU is stuck purely because it's being trashed
to the point where it can't make forward progress.

> > To mostly confirm this is likely what's happening, can you enable all of the async
> > #PF tracepoints in KVM?  The exact tracepoints might vary dependending on which kernel
> > version you're running, just enable everything with "async" in the name, e.g.
> >
> >   # ls -1 /sys/kernel/debug/tracing/events/kvm | grep async
> >   kvm_async_pf_completed/
> >   kvm_async_pf_not_present/
> >   kvm_async_pf_ready/
> >   kvm_async_pf_repeated_fault/
> >   kvm_try_async_get_page/
> >
> > If kvm_try_async_get_page() is more or less keeping pace with the "pf_taken" stat,
> > then this is likely what's happening.
> 
> I did this and unfortunately, don't see any of these functions being
> called at all despite
> EPT_VIOLATIONs still being thrown and pf_taken still climbing. (Tried both with
> `trace-cmd -e ...` and using `bpftrace` and none of those functions
> are being called
> during the deadlock/guest being stuck.)

Well fudge.

> > And then to really confirm, this small bpf program will yell if get_user_pages_remote()
> > fails when attempting get a single page (which is always the case for KVM's async
> > #PF usage).
> >
> > $ tail gup_remote.bt
> > kretfunc:get_user_pages_remote
> > {
> >         if ( args->nr_pages == 1 && retval != 1 ) {
> >                 printf("Failed remote gup() on address %lx, ret = %d\n", args->start, retval);
> >         }
> > }
> >
> 
> Our hosts don't have kfunc/kretfunc support (`bpftrace --info` reports
> `kret: no`),
> but I tried just a kprobe to verify that get_user_pages_remote is
> being called at all -
> does not seem like it is, unfortunately:
> 
> ```
> # bpftrace -e 'kprobe:get_user_pages_remote { @[comm] = count(); }'
> Attaching 1 probe...
> ^C
> #
> ```
> 
> So I guess that disproves the async #PF theory?

Yeah.  Definitely not related async page fault.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-07-25 17:30                   ` Sean Christopherson
@ 2023-08-02 14:21                     ` Amaan Cheval
  2023-08-02 15:34                       ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Amaan Cheval @ 2023-08-02 14:21 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: brak, kvm

> Yeesh.  There is a ridiculous amount of potentially problematic activity.  KSM is
> active in that trace, it looks like NUMA balancing might be in play,

Sorry about the delayed response - it seems like the majority of locked up guest
VMs stop throwing repeated EPT_VIOLATIONs as soon as we turn `numa_balancing`
off.
They still remain locked up, but that might be because the original cause of the
looping EPT_VIOLATIONs corrupted/crashed them in an unrecoverable way (are there
any ways you can think of that that might happen)?

----

We experimented with numa_balancing + transparent hugepage settings in certain
data centers (to determine if the settings make the lockups disappear) and the
incidence rate of locked up guests has lowered significantly for the
numa_balancing=0 and thp=1 case, but numa_balancing=0 and thp=0 are still
locking up / looping on EPT_VIOLATIONs at about the same rate (or slightly
lower than both numa_balancing=thp=1).

Here's a function_graph of a host which had numa_balancing=0, thp=1, ksm=2
(KSM unloaded and unmerged after it was initially on):

https://transfer.sh/M4WdfxaTJs/ept-fn-graph.log

```
# bpftrace -e 'kprobe:handle_ept_violation { @ept[comm] = count(); }
tracepoint:kvm:kvm_page_fault { @pf[comm] = count(); }'
Attaching 2 probes...
^C

@ept[CPU 0/KVM]: 52
@ept[CPU 3/KVM]: 61
@ept[CPU 2/KVM]: 112
@ept[CPU 1/KVM]: 257

@pf[CPU 0/KVM]: 52
@pf[CPU 3/KVM]: 61
@pf[CPU 2/KVM]: 111
@pf[CPU 1/KVM]: 262
```

> there might be hugepage shattering, etc.

Is there a BPF program / another way we can confirm this is the case? I think
the fact that guests lockup at about the same rate when thp=0,numa_balancing=0
as thp=1,numa_balancing=1 is interesting and relevant.

Only thp=1,numa_balancing=0 seems to have the least guests locking up.

> Let me rephrase that statement: it rules out a certain class of memslot and
> mmu_notifier bugs, namely bugs where KVM would incorrect leave an invalidation
> refcount (for lack of a better term) elevated.  It doesn't mean memslot changes
> and/or mmu_notifier events aren't at fault.

I see, thanks!

> kernel bug, e.g. it's possible the vCPU is stuck purely because it's being trashed
> to the point where it can't make forward progress.

Given that the guest stays locked-up post-migration on a completely unloaded
host, I think this is unlikely unless the thrashing also corrupts the guests'
state before the migration somehow?

> Yeah.  Definitely not related async page fault.

I guess the biggest lead currently is why `numa_balancing=1` increases the
odds of this issue occurring, and why is it specifically more likely with
transparent hugepages off (`thp=0`)?

To be clear, the lockups occur in all configurations we've tried so far, so none
of these are likely the direct cause, just relevant factors.

If there's any changes in the kernel that might help illuminate the issue
further, we can run a custom kernel and migrate a guest to the modified host -
let me know if there's anything that might help!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-02 14:21                     ` Amaan Cheval
@ 2023-08-02 15:34                       ` Sean Christopherson
  2023-08-02 16:45                         ` Amaan Cheval
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-08-02 15:34 UTC (permalink / raw)
  To: Amaan Cheval; +Cc: brak, kvm

On Wed, Aug 02, 2023, Amaan Cheval wrote:
> > Yeesh.  There is a ridiculous amount of potentially problematic activity.  KSM is
> > active in that trace, it looks like NUMA balancing might be in play,
> 
> Sorry about the delayed response - it seems like the majority of locked up guest
> VMs stop throwing repeated EPT_VIOLATIONs as soon as we turn `numa_balancing`
> off.

LOL, NUMA autobalancing.  I have a longstanding hatred of that feature.  I'm sure
there are setups where it adds value, but from my perspective it's nothing but
pain and misery.

> They still remain locked up, but that might be because the original cause of the
> looping EPT_VIOLATIONs corrupted/crashed them in an unrecoverable way (are there
> any ways you can think of that that might happen)?

Define "remain locked up".  If the vCPUs are actively running in the guest and
making forward progress, i.e. not looping on VM-Exits on a single RIP, then they
aren't stuck from KVM's perspective.

But that doesn't mean the guest didn't take punitive action when a vCPU was
effectively stalled indefinitely by KVM, e.g. from the guest's perspective the
stuck vCPU will likely manifest as a soft lockup, and that could lead to a panic()
if the guest is a Linux kernel running with softlockup_panic=1.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-02 15:34                       ` Sean Christopherson
@ 2023-08-02 16:45                         ` Amaan Cheval
  2023-08-02 17:52                           ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Amaan Cheval @ 2023-08-02 16:45 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: brak, kvm

> LOL, NUMA autobalancing.  I have a longstanding hatred of that feature.  I'm sure
> there are setups where it adds value, but from my perspective it's nothing but
> pain and misery.

Do you think autobalancing is increasing the odds of some edge-case race
condition, perhaps?
I find it really curious that numa_balancing definitely affects this issue, but
particularly when thp=0. Is it just too many EPT entries to install
when transparent hugepages is disabled, increasing the likelihood of
a race condition / lock contention of some sort?

> > They still remain locked up, but that might be because the original cause of the
> > looping EPT_VIOLATIONs corrupted/crashed them in an unrecoverable way (are there
> > any ways you can think of that that might happen)?
>
> Define "remain locked up".  If the vCPUs are actively running in the guest and
> making forward progress, i.e. not looping on VM-Exits on a single RIP, then they
> aren't stuck from KVM's perspective.

Right, the traces look like they're not stuck (i.e. no looping on the same
RIP). By "remain locked up" I mean that the VM is unresponsive on both the
console and services (such as ssh) used to connect to it.

> But that doesn't mean the guest didn't take punitive action when a vCPU was
> effectively stalled indefinitely by KVM, e.g. from the guest's perspective the
> stuck vCPU will likely manifest as a soft lockup, and that could lead to a panic()
> if the guest is a Linux kernel running with softlockup_panic=1.

So far we haven't had any guest kernels with softlockup_panic=1 have this issue,
so it's hard to confirm, but it makes sense that the guest took punitive action
in response to being stalled.

Any thoughts on how we might reproduce the issue or trace it down better?

Anything look suspect in the function_graph trace?
(Note that this was on a host that had numa_balancing=0,thp=1 from before
the guest booted, and it still ended up in the EPT_VIOLATION loop and
"locked up" (unresponsive on console).)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-02 16:45                         ` Amaan Cheval
@ 2023-08-02 17:52                           ` Sean Christopherson
  2023-08-08 15:34                             ` Amaan Cheval
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-08-02 17:52 UTC (permalink / raw)
  To: Amaan Cheval; +Cc: brak, kvm

On Wed, Aug 02, 2023, Amaan Cheval wrote:
> > LOL, NUMA autobalancing.  I have a longstanding hatred of that feature.  I'm sure
> > there are setups where it adds value, but from my perspective it's nothing but
> > pain and misery.
> 
> Do you think autobalancing is increasing the odds of some edge-case race
> condition, perhaps?
> I find it really curious that numa_balancing definitely affects this issue, but
> particularly when thp=0. Is it just too many EPT entries to install
> when transparent hugepages is disabled, increasing the likelihood of
> a race condition / lock contention of some sort?

NUMA balancing works by zapping PTEs[*] in userspace page tables for mappings to
remote memory, and then migrating the data to local memory on the resulting page
fault.  When that memory is being used to back a KVM guest, zapping the userspace
(primary) PTEs triggers an mmu_notifier event that in turn saps KVM's PTEs, a.k.a.
SPTEs (which used to mean Shadow PTEs, but we're retroactively redefining SPTE to
also mean Secondary PTEs so that it's correct when shadow paging isn't being used).

If NUMA balancing is going nuclear and constantly zapping PTEs, the resulting
mmu_notifier events could theoretically stall a vCPU indefinitely.  The reason I
dislike NUMA balancing is that it's all too easy to end up with subtle bugs
and/or misconfigured setups where the NUMA balancing logic zaps PTEs/SPTEs without
actuablly being able to move the page in the end, i.e. it's (IMO) too easy for
NUMA balancing to get false positives when determining whether or not to try and
migrate a page.

That said, it's definitely very unexpected that NUMA balancing would be zapping
SPTEs to the point where a vCPU can't make forward progress.   It's theoretically
possible that that's what's happening, but quite unlikely, especially since it
sounds like you're seeing issues even with NUMA balancing disabled.

More likely is that there is a bug somewhere that results in the mmu_notifier
event refcount staying incorrectly eleveated, but that type of bug shouldn't follow
the VM across a live migration...

[*] Not technically a full zap of the PTE, it's just marked PROT_NONE, i.e.
    !PRESET, but on the KVM side of things it does manifest as a full zap of the
    SPTE.

> > > They still remain locked up, but that might be because the original cause of the
> > > looping EPT_VIOLATIONs corrupted/crashed them in an unrecoverable way (are there
> > > any ways you can think of that that might happen)?
> >
> > Define "remain locked up".  If the vCPUs are actively running in the guest and
> > making forward progress, i.e. not looping on VM-Exits on a single RIP, then they
> > aren't stuck from KVM's perspective.
> 
> Right, the traces look like they're not stuck (i.e. no looping on the same
> RIP). By "remain locked up" I mean that the VM is unresponsive on both the
> console and services (such as ssh) used to connect to it.
> 
> > But that doesn't mean the guest didn't take punitive action when a vCPU was
> > effectively stalled indefinitely by KVM, e.g. from the guest's perspective the
> > stuck vCPU will likely manifest as a soft lockup, and that could lead to a panic()
> > if the guest is a Linux kernel running with softlockup_panic=1.
> 
> So far we haven't had any guest kernels with softlockup_panic=1 have this issue,
> so it's hard to confirm, but it makes sense that the guest took punitive action
> in response to being stalled.
> 
> Any thoughts on how we might reproduce the issue or trace it down better?

Before going further, can you confirm that this earlier statement is correct?

 : Another interesting observation we made was that when we migrate a guest to a
 : different host, the guest _stays_ locked up and throws EPT violations on the new
 : host as well

Specifically, after migration, is the vCPU still fully stuck on EPT violations,
i.e. not making forward progress from KVM's perspective?  Or is the guest "stuck"
after migration purely because the guest itself gave up?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-02 17:52                           ` Sean Christopherson
@ 2023-08-08 15:34                             ` Amaan Cheval
  2023-08-08 17:07                               ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Amaan Cheval @ 2023-08-08 15:34 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: brak, kvm

Hey Sean,

> If NUMA balancing is going nuclear and constantly zapping PTEs, the resulting
> mmu_notifier events could theoretically stall a vCPU indefinitely.  The reason I
> dislike NUMA balancing is that it's all too easy to end up with subtle bugs
> and/or misconfigured setups where the NUMA balancing logic zaps PTEs/SPTEs without
> actuablly being able to move the page in the end, i.e. it's (IMO) too easy for
> NUMA balancing to get false positives when determining whether or not to try and
> migrate a page.

What are some situations where it might not be able to move the page in the end?

> That said, it's definitely very unexpected that NUMA balancing would be zapping
> SPTEs to the point where a vCPU can't make forward progress.   It's theoretically
> possible that that's what's happening, but quite unlikely, especially since it
> sounds like you're seeing issues even with NUMA balancing disabled.

Yep, we're definitely seeing the issue occur even with numa_balancing enabled,
but the likelihood of it occurring has significantly dropped since
we've disabled
numa_balancing.

> More likely is that there is a bug somewhere that results in the mmu_notifier
> event refcount staying incorrectly eleveated, but that type of bug shouldn't follow
> the VM across a live migration...

Good news! We managed to live migrate a guest and that did "fix it".

The console was locked-up on the login screen before migration for about 6.5
hours, looping EPT_VIOLATIONs.
Post migration, we saw `rcu_shed detected stalls on CPUs/tasks` on the console,
and then the VM resumed normal operation. Here's a screenshot of the console (it
was "locked up"/frozen on the login screen until the migration):

https://i.imgur.com/n6CSsAv.png

> [*] Not technically a full zap of the PTE, it's just marked PROT_NONE, i.e.
>     !PRESET, but on the KVM side of things it does manifest as a full zap of the
>     SPTE.

Thank you so much for that detailed explanation!

A colleague also modified a host kernel with KFI (Kernel Function
Instrumentation) and wrote a kernel module to intercept the vmexit handler,
handle_ept_violation, and does an EPT walk for each pfn, compared against
/proc/iomem.

Assuming the EPT walking code is correct, we see this surprising result of a
PDPTE's pfn=0:

```
[15295.792019] kvm-kfi: enter: handle_ept_violation
[15295.792021] kvm-kfi: ept walk: eptp=0x103aaa05e gpa=0x792d4ff8
[15295.792023]   PML4E : [0x103aaa05e] pfn=0x103aaa         : is
within the range: 0x100000-0x3fffffff: System RAM
[15295.792026]   PDPTE : [0x0] pfn=0x0         : is within the range:
0x0-0xfff: Reserved
[15295.792029]   PDE   : [0xf000eef3f000e2c3] pfn=0xeef3f000e [large]
: is within the range: 0x100000000-0x1075ffffff: System RAM
```

For comparison, the same module's output on a host without any "locked up"
guests:

```
[13956.578732] kvm-kfi: ept walk: eptp=0x1061b505e gpa=0xfcf28
[13956.578733]   PML4E : [0x1061b505e] pfn=0x1061b5         : is
within the range: 0x100000-0x3fffffff: System RAM
[13956.578736]   PDPTE : [0x11f29a907] pfn=0x11f29a         : is
within the range: 0x100000-0x3fffffff: System RAM
[13956.578739]   PDE   : [0x11c205907] pfn=0x11c205         : is
within the range: 0x100000-0x3fffffff: System RAM
[13956.578741]   PTE   : [0x11c204907] pfn=0x11c204         : is
within the range: 0x100000-0x3fffffff: System RAM
```

Does this seem to indicate an mmu_notifier refcount issue to you, given that
migration did fix it? Any way to verify?

We haven't found any guests with `softlockup_panic=1` yet, and since we can't
reproduce the issue on command ourselves yet, we might have to wait a bit - but
I imagine that the fact that live migration "fixed" the locked up guest confirms
that the other guests that didn't get "fixed" were likely softlocked from the
CPU stalling?

If you have any suggestions on how modifying the host kernel (and then migrating
a locked up guest to it) or eBPF programs that might help illuminate the issue
further, let me know!

Thanks for all your help so far!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-08 15:34                             ` Amaan Cheval
@ 2023-08-08 17:07                               ` Sean Christopherson
  2023-08-10  0:48                                 ` Eric Wheeler
  2023-08-15  0:30                                 ` Eric Wheeler
  0 siblings, 2 replies; 48+ messages in thread
From: Sean Christopherson @ 2023-08-08 17:07 UTC (permalink / raw)
  To: Amaan Cheval; +Cc: brak, kvm

On Tue, Aug 08, 2023, Amaan Cheval wrote:
> Hey Sean,
> 
> > If NUMA balancing is going nuclear and constantly zapping PTEs, the resulting
> > mmu_notifier events could theoretically stall a vCPU indefinitely.  The reason I
> > dislike NUMA balancing is that it's all too easy to end up with subtle bugs
> > and/or misconfigured setups where the NUMA balancing logic zaps PTEs/SPTEs without
> > actuablly being able to move the page in the end, i.e. it's (IMO) too easy for
> > NUMA balancing to get false positives when determining whether or not to try and
> > migrate a page.
> 
> What are some situations where it might not be able to move the page in the end?

There's a pretty big list, see the "failure" paths of do_numa_page() and
migrate_misplaced_page().

> > That said, it's definitely very unexpected that NUMA balancing would be zapping
> > SPTEs to the point where a vCPU can't make forward progress.   It's theoretically
> > possible that that's what's happening, but quite unlikely, especially since it
> > sounds like you're seeing issues even with NUMA balancing disabled.
> 
> Yep, we're definitely seeing the issue occur even with numa_balancing
> enabled, but the likelihood of it occurring has significantly dropped since
> we've disabled numa_balancing.

So the fact that this occurs with NUMA balancing disabled means the problem likely
isn't with NUMA balancing itself.  NUMA balancing probably exposes some underlying
issue due to it generating a near-constant stream of mmu_notifier invalidation.

> > More likely is that there is a bug somewhere that results in the mmu_notifier
> > event refcount staying incorrectly eleveated, but that type of bug shouldn't follow
> > the VM across a live migration...
> 
> Good news! We managed to live migrate a guest and that did "fix it".

...

> A colleague also modified a host kernel with KFI (Kernel Function
> Instrumentation) and wrote a kernel module to intercept the vmexit handler,
> handle_ept_violation, and does an EPT walk for each pfn, compared against
> /proc/iomem.
> 
> Assuming the EPT walking code is correct, we see this surprising result of a
> PDPTE's pfn=0:

Not surprising.  The entire EPTE is zero, i.e. has been zapped by KVM.  This is
exactly what is expected.

> Does this seem to indicate an mmu_notifier refcount issue to you, given that
> migration did fix it? Any way to verify?

It doesn't move the needle either way, it just confirms what we already know: the
vCPU is repeatedly taking !PRESENT faults.  The unexpected part is that KVM never
"fixes" the fault and never outright fails.

> We haven't found any guests with `softlockup_panic=1` yet, and since we can't
> reproduce the issue on command ourselves yet, we might have to wait a bit - but
> I imagine that the fact that live migration "fixed" the locked up guest confirms
> that the other guests that didn't get "fixed" were likely softlocked from the
> CPU stalling?

Yes.

> If you have any suggestions on how modifying the host kernel (and then migrating
> a locked up guest to it) or eBPF programs that might help illuminate the issue
> further, let me know!
> 
> Thanks for all your help so far!

Since it sounds like you can test with a custom kernel, try running with this
patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
below expands said tracepoint to capture information about mmu_notifiers and
memslots generation.  With luck, it will reveal a smoking gun.

---
 arch/x86/kvm/mmu/mmu.c          | 10 ----------
 arch/x86/kvm/mmu/mmu_internal.h |  2 ++
 arch/x86/kvm/mmu/tdp_mmu.h      | 10 ++++++++++
 arch/x86/kvm/trace.h            | 28 ++++++++++++++++++++++++++--
 4 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9e4cd8b4a202..122bfc884293 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2006,16 +2006,6 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
 	return true;
 }
 
-static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
-	if (sp->role.invalid)
-		return true;
-
-	/* TDP MMU pages do not use the MMU generation. */
-	return !is_tdp_mmu_page(sp) &&
-	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
-}
-
 struct mmu_page_path {
 	struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
 	unsigned int idx[PT64_ROOT_MAX_LEVEL];
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index f1ef670058e5..cf7ba0abaa8f 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -6,6 +6,8 @@
 #include <linux/kvm_host.h>
 #include <asm/kvm_host.h>
 
+#include "mmu.h"
+
 #ifdef CONFIG_KVM_PROVE_MMU
 #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
 #else
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 0a63b1afabd3..a0d7c8acf78f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -76,4 +76,14 @@ static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu
 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
 #endif
 
+static inline bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	if (sp->role.invalid)
+		return true;
+
+	/* TDP MMU pages do not use the MMU generation. */
+	return !is_tdp_mmu_page(sp) &&
+	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
+}
+
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 83843379813e..ff4a384ab03a 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -8,6 +8,8 @@
 #include <asm/clocksource.h>
 #include <asm/pvclock-abi.h>
 
+#include "mmu/tdp_mmu.h"
+
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM kvm
 
@@ -405,6 +407,13 @@ TRACE_EVENT(kvm_page_fault,
 		__field(	unsigned long,	guest_rip	)
 		__field(	u64,		fault_address	)
 		__field(	u64,		error_code	)
+		__field(	unsigned long,  mmu_invalidate_seq)
+		__field(	long,  mmu_invalidate_in_progress)
+		__field(	unsigned long,  mmu_invalidate_range_start)
+		__field(	unsigned long,  mmu_invalidate_range_end)
+		__field(	bool,		root_is_valid)
+		__field(	bool,		root_has_sp)
+		__field(	bool,		root_is_obsolete)
 	),
 
 	TP_fast_assign(
@@ -412,11 +421,26 @@ TRACE_EVENT(kvm_page_fault,
 		__entry->guest_rip	= kvm_rip_read(vcpu);
 		__entry->fault_address	= fault_address;
 		__entry->error_code	= error_code;
+		__entry->mmu_invalidate_seq		= vcpu->kvm->mmu_invalidate_seq;
+		__entry->mmu_invalidate_in_progress	= vcpu->kvm->mmu_invalidate_in_progress;
+		__entry->mmu_invalidate_range_start	= vcpu->kvm->mmu_invalidate_range_start;
+		__entry->mmu_invalidate_range_end	= vcpu->kvm->mmu_invalidate_range_end;
+		__entry->root_is_valid			= VALID_PAGE(vcpu->arch.mmu->root.hpa);
+		__entry->root_has_sp			= VALID_PAGE(vcpu->arch.mmu->root.hpa) &&
+							  to_shadow_page(vcpu->arch.mmu->root.hpa);
+		__entry->root_is_obsolete		= VALID_PAGE(vcpu->arch.mmu->root.hpa) &&
+							  to_shadow_page(vcpu->arch.mmu->root.hpa) &&
+							  is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root.hpa));
 	),
 
-	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx",
+	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx, seq = 0x%lx, in_prog = 0x%lx, start = 0x%lx, end = 0x%lx, root = %s",
 		  __entry->vcpu_id, __entry->guest_rip,
-		  __entry->fault_address, __entry->error_code)
+		  __entry->fault_address, __entry->error_code,
+		  __entry->mmu_invalidate_seq, __entry->mmu_invalidate_in_progress,
+		  __entry->mmu_invalidate_range_start, __entry->mmu_invalidate_range_end,
+		  !__entry->root_is_valid ? "invalid" :
+		  !__entry->root_has_sp ? "no shadow page" :
+		  __entry->root_is_obsolete ? "obsolete" : "fresh")
 );
 
 /*

base-commit: 240f736891887939571854bd6d734b6c9291f22e
-- 


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-08 17:07                               ` Sean Christopherson
@ 2023-08-10  0:48                                 ` Eric Wheeler
  2023-08-10  1:27                                   ` Eric Wheeler
  2023-08-15  0:30                                 ` Eric Wheeler
  1 sibling, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-10  0:48 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Tue, 8 Aug 2023, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Amaan Cheval wrote:
> > Hey Sean,
> > 
> > > If NUMA balancing is going nuclear and constantly zapping PTEs, the resulting
> > > mmu_notifier events could theoretically stall a vCPU indefinitely.  The reason I
> > > dislike NUMA balancing is that it's all too easy to end up with subtle bugs
> > > and/or misconfigured setups where the NUMA balancing logic zaps PTEs/SPTEs without
> > > actuablly being able to move the page in the end, i.e. it's (IMO) too easy for
> > > NUMA balancing to get false positives when determining whether or not to try and
> > > migrate a page.
> > 
> > What are some situations where it might not be able to move the page in the end?
> 
> There's a pretty big list, see the "failure" paths of do_numa_page() and
> migrate_misplaced_page().
> 
> > > That said, it's definitely very unexpected that NUMA balancing would be zapping
> > > SPTEs to the point where a vCPU can't make forward progress.   It's theoretically
> > > possible that that's what's happening, but quite unlikely, especially since it
> > > sounds like you're seeing issues even with NUMA balancing disabled.
> > 
> > Yep, we're definitely seeing the issue occur even with numa_balancing
> > enabled, but the likelihood of it occurring has significantly dropped since
> > we've disabled numa_balancing.
> 
> So the fact that this occurs with NUMA balancing disabled means the problem likely
> isn't with NUMA balancing itself.  NUMA balancing probably exposes some underlying
> issue due to it generating a near-constant stream of mmu_notifier invalidation.
> 
> > > More likely is that there is a bug somewhere that results in the mmu_notifier
> > > event refcount staying incorrectly eleveated, but that type of bug shouldn't follow
> > > the VM across a live migration...
> > 
> > Good news! We managed to live migrate a guest and that did "fix it".

Does the VM make progress even if is migrated to a kernel that presents 
the bug?

What was kernel version being migrated from and to?

For example, was it from a >5.19 kernel to something earlier than 5.19?

For example, if the hung VM remains stuck after migrating to a >5.19 
kernel but _not_ to a <5.19 kernel, then maybe bisect is an option.


--
Eric Wheeler




> ...
> 
> > A colleague also modified a host kernel with KFI (Kernel Function
> > Instrumentation) and wrote a kernel module to intercept the vmexit handler,
> > handle_ept_violation, and does an EPT walk for each pfn, compared against
> > /proc/iomem.
> > 
> > Assuming the EPT walking code is correct, we see this surprising result of a
> > PDPTE's pfn=0:
> 
> Not surprising.  The entire EPTE is zero, i.e. has been zapped by KVM.  This is
> exactly what is expected.
> 
> > Does this seem to indicate an mmu_notifier refcount issue to you, given that
> > migration did fix it? Any way to verify?
> 
> It doesn't move the needle either way, it just confirms what we already know: the
> vCPU is repeatedly taking !PRESENT faults.  The unexpected part is that KVM never
> "fixes" the fault and never outright fails.
> 
> > We haven't found any guests with `softlockup_panic=1` yet, and since we can't
> > reproduce the issue on command ourselves yet, we might have to wait a bit - but
> > I imagine that the fact that live migration "fixed" the locked up guest confirms
> > that the other guests that didn't get "fixed" were likely softlocked from the
> > CPU stalling?
> 
> Yes.
> 
> > If you have any suggestions on how modifying the host kernel (and then migrating
> > a locked up guest to it) or eBPF programs that might help illuminate the issue
> > further, let me know!
> > 
> > Thanks for all your help so far!
> 
> Since it sounds like you can test with a custom kernel, try running with this
> patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
> below expands said tracepoint to capture information about mmu_notifiers and
> memslots generation.  With luck, it will reveal a smoking gun.
> 
> ---
>  arch/x86/kvm/mmu/mmu.c          | 10 ----------
>  arch/x86/kvm/mmu/mmu_internal.h |  2 ++
>  arch/x86/kvm/mmu/tdp_mmu.h      | 10 ++++++++++
>  arch/x86/kvm/trace.h            | 28 ++++++++++++++++++++++++++--
>  4 files changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9e4cd8b4a202..122bfc884293 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2006,16 +2006,6 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
>  	return true;
>  }
>  
> -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> -{
> -	if (sp->role.invalid)
> -		return true;
> -
> -	/* TDP MMU pages do not use the MMU generation. */
> -	return !is_tdp_mmu_page(sp) &&
> -	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> -}
> -
>  struct mmu_page_path {
>  	struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
>  	unsigned int idx[PT64_ROOT_MAX_LEVEL];
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index f1ef670058e5..cf7ba0abaa8f 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -6,6 +6,8 @@
>  #include <linux/kvm_host.h>
>  #include <asm/kvm_host.h>
>  
> +#include "mmu.h"
> +
>  #ifdef CONFIG_KVM_PROVE_MMU
>  #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
>  #else
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 0a63b1afabd3..a0d7c8acf78f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -76,4 +76,14 @@ static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu
>  static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
>  #endif
>  
> +static inline bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> +	if (sp->role.invalid)
> +		return true;
> +
> +	/* TDP MMU pages do not use the MMU generation. */
> +	return !is_tdp_mmu_page(sp) &&
> +	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> +}
> +
>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 83843379813e..ff4a384ab03a 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -8,6 +8,8 @@
>  #include <asm/clocksource.h>
>  #include <asm/pvclock-abi.h>
>  
> +#include "mmu/tdp_mmu.h"
> +
>  #undef TRACE_SYSTEM
>  #define TRACE_SYSTEM kvm
>  
> @@ -405,6 +407,13 @@ TRACE_EVENT(kvm_page_fault,
>  		__field(	unsigned long,	guest_rip	)
>  		__field(	u64,		fault_address	)
>  		__field(	u64,		error_code	)
> +		__field(	unsigned long,  mmu_invalidate_seq)
> +		__field(	long,  mmu_invalidate_in_progress)
> +		__field(	unsigned long,  mmu_invalidate_range_start)
> +		__field(	unsigned long,  mmu_invalidate_range_end)
> +		__field(	bool,		root_is_valid)
> +		__field(	bool,		root_has_sp)
> +		__field(	bool,		root_is_obsolete)
>  	),
>  
>  	TP_fast_assign(
> @@ -412,11 +421,26 @@ TRACE_EVENT(kvm_page_fault,
>  		__entry->guest_rip	= kvm_rip_read(vcpu);
>  		__entry->fault_address	= fault_address;
>  		__entry->error_code	= error_code;
> +		__entry->mmu_invalidate_seq		= vcpu->kvm->mmu_invalidate_seq;
> +		__entry->mmu_invalidate_in_progress	= vcpu->kvm->mmu_invalidate_in_progress;
> +		__entry->mmu_invalidate_range_start	= vcpu->kvm->mmu_invalidate_range_start;
> +		__entry->mmu_invalidate_range_end	= vcpu->kvm->mmu_invalidate_range_end;
> +		__entry->root_is_valid			= VALID_PAGE(vcpu->arch.mmu->root.hpa);
> +		__entry->root_has_sp			= VALID_PAGE(vcpu->arch.mmu->root.hpa) &&
> +							  to_shadow_page(vcpu->arch.mmu->root.hpa);
> +		__entry->root_is_obsolete		= VALID_PAGE(vcpu->arch.mmu->root.hpa) &&
> +							  to_shadow_page(vcpu->arch.mmu->root.hpa) &&
> +							  is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root.hpa));
>  	),
>  
> -	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx",
> +	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx, seq = 0x%lx, in_prog = 0x%lx, start = 0x%lx, end = 0x%lx, root = %s",
>  		  __entry->vcpu_id, __entry->guest_rip,
> -		  __entry->fault_address, __entry->error_code)
> +		  __entry->fault_address, __entry->error_code,
> +		  __entry->mmu_invalidate_seq, __entry->mmu_invalidate_in_progress,
> +		  __entry->mmu_invalidate_range_start, __entry->mmu_invalidate_range_end,
> +		  !__entry->root_is_valid ? "invalid" :
> +		  !__entry->root_has_sp ? "no shadow page" :
> +		  __entry->root_is_obsolete ? "obsolete" : "fresh")
>  );
>  
>  /*
> 
> base-commit: 240f736891887939571854bd6d734b6c9291f22e
> -- 
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-10  0:48                                 ` Eric Wheeler
@ 2023-08-10  1:27                                   ` Eric Wheeler
  2023-08-10 23:58                                     ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-10  1:27 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Wed, 9 Aug 2023, Eric Wheeler wrote:
> On Tue, 8 Aug 2023, Sean Christopherson wrote:
> > On Tue, Aug 08, 2023, Amaan Cheval wrote:
> > > Hey Sean,
> > > 
> > > > If NUMA balancing is going nuclear and constantly zapping PTEs, the resulting
> > > > mmu_notifier events could theoretically stall a vCPU indefinitely.  The reason I
> > > > dislike NUMA balancing is that it's all too easy to end up with subtle bugs
> > > > and/or misconfigured setups where the NUMA balancing logic zaps PTEs/SPTEs without
> > > > actuablly being able to move the page in the end, i.e. it's (IMO) too easy for
> > > > NUMA balancing to get false positives when determining whether or not to try and
> > > > migrate a page.
> > > 
> > > What are some situations where it might not be able to move the page in the end?
> > 
> > There's a pretty big list, see the "failure" paths of do_numa_page() and
> > migrate_misplaced_page().
> > 
> > > > That said, it's definitely very unexpected that NUMA balancing would be zapping
> > > > SPTEs to the point where a vCPU can't make forward progress.   It's theoretically
> > > > possible that that's what's happening, but quite unlikely, especially since it
> > > > sounds like you're seeing issues even with NUMA balancing disabled.

Brak indicated that they've seen this as early as v5.19.  IIRC, Hunter
said that v5.15 is working fine, so I went through the >v5.15 and <v5.19
commit logs for KVM that appear to be related to EPT. Of course if the
problem is outside of KVM, then this is moot, but maybe these are worth
a second look.

Sean, could any of these commits cause or hint at the problem?


  54275f74c KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
	- this mentions !PRESENT related to faulting out of mmu_lock.

  ec283cb1d KVM: x86/mmu: remove ept_ad field
	- looks like a simple patch, but could there be a reason that
	  this is somehow invalid in corner cases?  Here is the relevant 
	  diff snippet:

	+++ b/arch/x86/kvm/mmu/mmu.c
	@@ -5007,7 +5007,6 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
	 
			context->shadow_root_level = level;
	 
	-               context->ept_ad = accessed_dirty;

	+++ b/arch/x86/kvm/mmu/paging_tmpl.h
	-       #define PT_HAVE_ACCESSED_DIRTY(mmu) ((mmu)->ept_ad)
	+       #define PT_HAVE_ACCESSED_DIRTY(mmu) (!(mmu)->cpu_role.base.ad_disabled)

  ca2a7c22a KVM: x86/mmu: Derive EPT violation RWX bits from EPTE RWX bits
	- "No functional change intended" but it mentions EPT
	  violations.  Could something unintentional have happened here?

  4f4aa80e3 KVM: X86: Handle implicit supervisor access with SMAP
	- This is a small change, but maybe it would be worth a quick review
	
  5b22bbe71 KVM: X86: Change the type of access u32 to u64
	- This is just a datatype change in 5.17-rc3, probably not it.

-Eric

> > > 
> > > Yep, we're definitely seeing the issue occur even with numa_balancing
> > > enabled, but the likelihood of it occurring has significantly dropped since
> > > we've disabled numa_balancing.
> > 
> > So the fact that this occurs with NUMA balancing disabled means the problem likely
> > isn't with NUMA balancing itself.  NUMA balancing probably exposes some underlying
> > issue due to it generating a near-constant stream of mmu_notifier invalidation.
> > 
> > > > More likely is that there is a bug somewhere that results in the mmu_notifier
> > > > event refcount staying incorrectly eleveated, but that type of bug shouldn't follow
> > > > the VM across a live migration...
> > > 
> > > Good news! We managed to live migrate a guest and that did "fix it".
> 
> Does the VM make progress even if is migrated to a kernel that presents 
> the bug?
> 
> What was kernel version being migrated from and to?
> 
> For example, was it from a >5.19 kernel to something earlier than 5.19?
> 
> For example, if the hung VM remains stuck after migrating to a >5.19 
> kernel but _not_ to a <5.19 kernel, then maybe bisect is an option.
> 
> 
> --
> Eric Wheeler
> 
> 
> 
> 
> > ...
> > 
> > > A colleague also modified a host kernel with KFI (Kernel Function
> > > Instrumentation) and wrote a kernel module to intercept the vmexit handler,
> > > handle_ept_violation, and does an EPT walk for each pfn, compared against
> > > /proc/iomem.
> > > 
> > > Assuming the EPT walking code is correct, we see this surprising result of a
> > > PDPTE's pfn=0:
> > 
> > Not surprising.  The entire EPTE is zero, i.e. has been zapped by KVM.  This is
> > exactly what is expected.
> > 
> > > Does this seem to indicate an mmu_notifier refcount issue to you, given that
> > > migration did fix it? Any way to verify?
> > 
> > It doesn't move the needle either way, it just confirms what we already know: the
> > vCPU is repeatedly taking !PRESENT faults.  The unexpected part is that KVM never
> > "fixes" the fault and never outright fails.
> > 
> > > We haven't found any guests with `softlockup_panic=1` yet, and since we can't
> > > reproduce the issue on command ourselves yet, we might have to wait a bit - but
> > > I imagine that the fact that live migration "fixed" the locked up guest confirms
> > > that the other guests that didn't get "fixed" were likely softlocked from the
> > > CPU stalling?
> > 
> > Yes.
> > 
> > > If you have any suggestions on how modifying the host kernel (and then migrating
> > > a locked up guest to it) or eBPF programs that might help illuminate the issue
> > > further, let me know!
> > > 
> > > Thanks for all your help so far!
> > 
> > Since it sounds like you can test with a custom kernel, try running with this
> > patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
> > below expands said tracepoint to capture information about mmu_notifiers and
> > memslots generation.  With luck, it will reveal a smoking gun.
> > 
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 10 ----------
> >  arch/x86/kvm/mmu/mmu_internal.h |  2 ++
> >  arch/x86/kvm/mmu/tdp_mmu.h      | 10 ++++++++++
> >  arch/x86/kvm/trace.h            | 28 ++++++++++++++++++++++++++--
> >  4 files changed, 38 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 9e4cd8b4a202..122bfc884293 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2006,16 +2006,6 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
> >  	return true;
> >  }
> >  
> > -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > -{
> > -	if (sp->role.invalid)
> > -		return true;
> > -
> > -	/* TDP MMU pages do not use the MMU generation. */
> > -	return !is_tdp_mmu_page(sp) &&
> > -	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> > -}
> > -
> >  struct mmu_page_path {
> >  	struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
> >  	unsigned int idx[PT64_ROOT_MAX_LEVEL];
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index f1ef670058e5..cf7ba0abaa8f 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -6,6 +6,8 @@
> >  #include <linux/kvm_host.h>
> >  #include <asm/kvm_host.h>
> >  
> > +#include "mmu.h"
> > +
> >  #ifdef CONFIG_KVM_PROVE_MMU
> >  #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> >  #else
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index 0a63b1afabd3..a0d7c8acf78f 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -76,4 +76,14 @@ static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu
> >  static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
> >  #endif
> >  
> > +static inline bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > +{
> > +	if (sp->role.invalid)
> > +		return true;
> > +
> > +	/* TDP MMU pages do not use the MMU generation. */
> > +	return !is_tdp_mmu_page(sp) &&
> > +	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> > +}
> > +
> >  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index 83843379813e..ff4a384ab03a 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -8,6 +8,8 @@
> >  #include <asm/clocksource.h>
> >  #include <asm/pvclock-abi.h>
> >  
> > +#include "mmu/tdp_mmu.h"
> > +
> >  #undef TRACE_SYSTEM
> >  #define TRACE_SYSTEM kvm
> >  
> > @@ -405,6 +407,13 @@ TRACE_EVENT(kvm_page_fault,
> >  		__field(	unsigned long,	guest_rip	)
> >  		__field(	u64,		fault_address	)
> >  		__field(	u64,		error_code	)
> > +		__field(	unsigned long,  mmu_invalidate_seq)
> > +		__field(	long,  mmu_invalidate_in_progress)
> > +		__field(	unsigned long,  mmu_invalidate_range_start)
> > +		__field(	unsigned long,  mmu_invalidate_range_end)
> > +		__field(	bool,		root_is_valid)
> > +		__field(	bool,		root_has_sp)
> > +		__field(	bool,		root_is_obsolete)
> >  	),
> >  
> >  	TP_fast_assign(
> > @@ -412,11 +421,26 @@ TRACE_EVENT(kvm_page_fault,
> >  		__entry->guest_rip	= kvm_rip_read(vcpu);
> >  		__entry->fault_address	= fault_address;
> >  		__entry->error_code	= error_code;
> > +		__entry->mmu_invalidate_seq		= vcpu->kvm->mmu_invalidate_seq;
> > +		__entry->mmu_invalidate_in_progress	= vcpu->kvm->mmu_invalidate_in_progress;
> > +		__entry->mmu_invalidate_range_start	= vcpu->kvm->mmu_invalidate_range_start;
> > +		__entry->mmu_invalidate_range_end	= vcpu->kvm->mmu_invalidate_range_end;
> > +		__entry->root_is_valid			= VALID_PAGE(vcpu->arch.mmu->root.hpa);
> > +		__entry->root_has_sp			= VALID_PAGE(vcpu->arch.mmu->root.hpa) &&
> > +							  to_shadow_page(vcpu->arch.mmu->root.hpa);
> > +		__entry->root_is_obsolete		= VALID_PAGE(vcpu->arch.mmu->root.hpa) &&
> > +							  to_shadow_page(vcpu->arch.mmu->root.hpa) &&
> > +							  is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root.hpa));
> >  	),
> >  
> > -	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx",
> > +	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx, seq = 0x%lx, in_prog = 0x%lx, start = 0x%lx, end = 0x%lx, root = %s",
> >  		  __entry->vcpu_id, __entry->guest_rip,
> > -		  __entry->fault_address, __entry->error_code)
> > +		  __entry->fault_address, __entry->error_code,
> > +		  __entry->mmu_invalidate_seq, __entry->mmu_invalidate_in_progress,
> > +		  __entry->mmu_invalidate_range_start, __entry->mmu_invalidate_range_end,
> > +		  !__entry->root_is_valid ? "invalid" :
> > +		  !__entry->root_has_sp ? "no shadow page" :
> > +		  __entry->root_is_obsolete ? "obsolete" : "fresh")
> >  );
> >  
> >  /*
> > 
> > base-commit: 240f736891887939571854bd6d734b6c9291f22e
> > -- 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-10  1:27                                   ` Eric Wheeler
@ 2023-08-10 23:58                                     ` Sean Christopherson
  2023-08-11 12:37                                       ` Amaan Cheval
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-08-10 23:58 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Amaan Cheval, brak, kvm

On Wed, Aug 09, 2023, Eric Wheeler wrote:
> On Wed, 9 Aug 2023, Eric Wheeler wrote:
> > On Tue, 8 Aug 2023, Sean Christopherson wrote:
> > > On Tue, Aug 08, 2023, Amaan Cheval wrote:
> > > > Hey Sean,
> > > > 
> > > > > If NUMA balancing is going nuclear and constantly zapping PTEs, the resulting
> > > > > mmu_notifier events could theoretically stall a vCPU indefinitely.  The reason I
> > > > > dislike NUMA balancing is that it's all too easy to end up with subtle bugs
> > > > > and/or misconfigured setups where the NUMA balancing logic zaps PTEs/SPTEs without
> > > > > actuablly being able to move the page in the end, i.e. it's (IMO) too easy for
> > > > > NUMA balancing to get false positives when determining whether or not to try and
> > > > > migrate a page.
> > > > 
> > > > What are some situations where it might not be able to move the page in the end?
> > > 
> > > There's a pretty big list, see the "failure" paths of do_numa_page() and
> > > migrate_misplaced_page().
> > > 
> > > > > That said, it's definitely very unexpected that NUMA balancing would be zapping
> > > > > SPTEs to the point where a vCPU can't make forward progress.   It's theoretically
> > > > > possible that that's what's happening, but quite unlikely, especially since it
> > > > > sounds like you're seeing issues even with NUMA balancing disabled.
> 
> Brak indicated that they've seen this as early as v5.19.  IIRC, Hunter
> said that v5.15 is working fine, so I went through the >v5.15 and <v5.19
> commit logs for KVM that appear to be related to EPT. Of course if the
> problem is outside of KVM, then this is moot, but maybe these are worth
> a second look.
> 
> Sean, could any of these commits cause or hint at the problem?

No, it's extremely unlikely any of these are related.  FWIW, my money is on this
being a bug in generic KVM bug or even outside of KVM, not a bug in KVM x86's MMU.
But I'm not confident enough to bet real money ;-)

>   54275f74c KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
> 	- this mentions !PRESENT related to faulting out of mmu_lock.
> 
>   ec283cb1d KVM: x86/mmu: remove ept_ad field
> 	- looks like a simple patch, but could there be a reason that
> 	  this is somehow invalid in corner cases?  Here is the relevant 
> 	  diff snippet:
> 
> 	+++ b/arch/x86/kvm/mmu/mmu.c
> 	@@ -5007,7 +5007,6 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> 	 
> 			context->shadow_root_level = level;
> 	 
> 	-               context->ept_ad = accessed_dirty;
> 
> 	+++ b/arch/x86/kvm/mmu/paging_tmpl.h
> 	-       #define PT_HAVE_ACCESSED_DIRTY(mmu) ((mmu)->ept_ad)
> 	+       #define PT_HAVE_ACCESSED_DIRTY(mmu) (!(mmu)->cpu_role.base.ad_disabled)
> 
>   ca2a7c22a KVM: x86/mmu: Derive EPT violation RWX bits from EPTE RWX bits
> 	- "No functional change intended" but it mentions EPT
> 	  violations.  Could something unintentional have happened here?
> 
>   4f4aa80e3 KVM: X86: Handle implicit supervisor access with SMAP
> 	- This is a small change, but maybe it would be worth a quick review
> 	
>   5b22bbe71 KVM: X86: Change the type of access u32 to u64
> 	- This is just a datatype change in 5.17-rc3, probably not it.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-10 23:58                                     ` Sean Christopherson
@ 2023-08-11 12:37                                       ` Amaan Cheval
  2023-08-11 18:02                                         ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Amaan Cheval @ 2023-08-11 12:37 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Eric Wheeler, brak, kvm

> There's a pretty big list, see the "failure" paths of do_numa_page() and
> migrate_misplaced_page().

Gotcha, thank you!

...

> Since it sounds like you can test with a custom kernel, try running with this
> patch and then enable the kvm_page_fault tracepoint when a vCPU gets
> stuck.  The below expands said tracepoint to capture information about
> mmu_notifiers and memslots generation.  With luck, it will reveal a smoking
> gun.

Thanks for the patch there. We tried migrating a locked up guest to a host with
this modified kernel twice (logs below). The guest "fixed itself" post
migration, so the results may not have captured the "problematic" kind of
page-fault, but here they are.

Complete logs of kvm_page_fault tracepoint events, starting just before the
migration (with 0 guests before the migration, so the first logs should be of
the problematic guest) as it resolves the lockup:

1. https://transfer.sh/QjB3MjeBqh/trace-kvm-kpf2.log
2. https://transfer.sh/wEFQm4hLHs/trace-kvm-pf.log

Truncated logs of `trace-cmd record -e kvm -e kvmmmu` in case context helps:

1. https://transfer.sh/FoFsNoFQCP/trace-kvm2.log
2. https://transfer.sh/LBFJryOfu7/trace-kvm.log

Note that for migration #2 in both respectively above (trace-kvm-pf.log and
trace-kvm.log), we didn't confirm that the guest was locked up before migration
mistakenly. It most likely was but in case trace #2 doesn't present the same
symptoms, that's why.

Off an uneducated glance, it seems like `in_prog = 0x1` at least once for every
`seq` / kvm_page_fault that seems to be "looping" and staying unresolved -
indicating a lock contention, perhaps, in trying to invalidate/read/write the
same page range?

Any leads on where in the source code I could look to understand how that might
happen?

----

@Eric

> Does the VM make progress even if is migrated to a kernel that presents the
> bug?

We're unsure which kernel versions do present the bug, so it's hard to say.
We've definitely seen it occur on kernels 5.15.49 to 6.1.38, but beyond that, we
don't know for certain. (Potentially as early as 5.10.103, though!)

> What was kernel version being migrated from and to?

The live migration where the issue was resolved by migrating, was from 6.1.12 to
6.5.0-rc2.

The traces above are for this live migration (source 6.1.x to target host
6.5.0-rc2).

Another migration was from 6.1.x to 6.1.39 (not for these traces). All of these
times the guest resumed/made progress post-migration.

> For example, was it from a >5.19 kernel to something earlier than 5.19?

No, we haven't tried migrating to < 5.19 yet - we have very few hosts running
kernels that old.

> For example, if the hung VM remains stuck after migrating to a >5.19 kernel
> but _not_ to a <5.19 kernel, then maybe bisect is an option.

From what Sean and I discussed above, we suspect that the VM remaining stuck is
likely due to the kernel softlock'ing from stalling in the kernel due to the
original bug.

We do know this issue _occurs_ as late as 6.1.38 at least (i.e. hosts running
6.1.38 have had guests lockup - we don't have hosts on more recent kernels, so
this isn't proof that it's been fixed since then, nor is migration proof of
that, IMO).

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-11 12:37                                       ` Amaan Cheval
@ 2023-08-11 18:02                                         ` Sean Christopherson
  2023-08-12  0:50                                           ` Eric Wheeler
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-08-11 18:02 UTC (permalink / raw)
  To: Amaan Cheval; +Cc: Eric Wheeler, brak, kvm

On Fri, Aug 11, 2023, Amaan Cheval wrote:
> > Since it sounds like you can test with a custom kernel, try running with this
> > patch and then enable the kvm_page_fault tracepoint when a vCPU gets
> > stuck.  The below expands said tracepoint to capture information about
> > mmu_notifiers and memslots generation.  With luck, it will reveal a smoking
> > gun.
> 
> Thanks for the patch there. We tried migrating a locked up guest to a host with
> this modified kernel twice (logs below). The guest "fixed itself" post
> migration, so the results may not have captured the "problematic" kind of
> page-fault, but here they are.

The traces need to be captured from the host where a vCPU is stuck.

> Complete logs of kvm_page_fault tracepoint events, starting just before the
> migration (with 0 guests before the migration, so the first logs should be of
> the problematic guest) as it resolves the lockup:
> 
> 1. https://transfer.sh/QjB3MjeBqh/trace-kvm-kpf2.log
> 2. https://transfer.sh/wEFQm4hLHs/trace-kvm-pf.log
> 
> Truncated logs of `trace-cmd record -e kvm -e kvmmmu` in case context helps:
> 
> 1. https://transfer.sh/FoFsNoFQCP/trace-kvm2.log
> 2. https://transfer.sh/LBFJryOfu7/trace-kvm.log
> 
> Note that for migration #2 in both respectively above (trace-kvm-pf.log and
> trace-kvm.log), we didn't confirm that the guest was locked up before migration
> mistakenly. It most likely was but in case trace #2 doesn't present the same
> symptoms, that's why.
> 
> Off an uneducated glance, it seems like `in_prog = 0x1` at least once for every
> `seq` / kvm_page_fault that seems to be "looping" and staying unresolved -

This is completely expected.   The "in_prog" thing is just saying that a vCPU
took a fault while there was an mmu_notifier event in-progress.

> indicating a lock contention, perhaps, in trying to invalidate/read/write the
> same page range?

No, just a collision between the primary MMU invalidating something, e.g. to move
a page or do KSM stuff, and a vCPU accessing the page in question.

> We do know this issue _occurs_ as late as 6.1.38 at least (i.e. hosts running
> 6.1.38 have had guests lockup - we don't have hosts on more recent kernels, so
> this isn't proof that it's been fixed since then, nor is migration proof of
> that, IMO).

Note, if my hunch is correct, it's the act of migrating to a different *host* that
resolves the problem, not the fact that the migration is to a different kernel.
E.g. I would expect that migrating to the exact same kernel would still unstick
the vCPU.

What I suspect is happening is that the in-progress count gets left high, e.g.
because of a start() without a paired end(), and that causes KVM to refuse to
install mappings for the affected range of guest memory.  Or possibly that the
problematic host is generating an absolutely massive storm of invalidations and
unintentionally DoS's the guest.

Either way, migrating the VM to a new host and thus a new KVM instance essentially
resets all of that metadata and allows KVM to fault-in pages and establish mappings.

Actually, one thing you could try to unstick a VM would be to do an intra-host
migration, i.e. migrate it to a new KVM instance on the same host.  If that "fixes"
the guest, then the bug is likely an mmu_notifier counting bug and not an
invalidation storm.

But the easiest thing would be to catch a host in the act, i.e. capture traces
with my debug patch from a host with a stuck vCPU.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-11 18:02                                         ` Sean Christopherson
@ 2023-08-12  0:50                                           ` Eric Wheeler
  2023-08-14 17:29                                             ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-12  0:50 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Fri, 11 Aug 2023, Sean Christopherson wrote:
> On Fri, Aug 11, 2023, Amaan Cheval wrote:
> > > Since it sounds like you can test with a custom kernel, try running with this
> > > patch and then enable the kvm_page_fault tracepoint when a vCPU gets
> > > stuck.  The below expands said tracepoint to capture information about
> > > mmu_notifiers and memslots generation.  With luck, it will reveal a smoking
> > > gun.
> > 
> > Thanks for the patch there. We tried migrating a locked up guest to a host with
> > this modified kernel twice (logs below). The guest "fixed itself" post
> > migration, so the results may not have captured the "problematic" kind of
> > page-fault, but here they are.
> 
> The traces need to be captured from the host where a vCPU is stuck.
> 
> > Complete logs of kvm_page_fault tracepoint events, starting just before the
> > migration (with 0 guests before the migration, so the first logs should be of
> > the problematic guest) as it resolves the lockup:
> > 
> > 1. https://transfer.sh/QjB3MjeBqh/trace-kvm-kpf2.log
> > 2. https://transfer.sh/wEFQm4hLHs/trace-kvm-pf.log
> > 
> > Truncated logs of `trace-cmd record -e kvm -e kvmmmu` in case context helps:
> > 
> > 1. https://transfer.sh/FoFsNoFQCP/trace-kvm2.log
> > 2. https://transfer.sh/LBFJryOfu7/trace-kvm.log
> > 
> > Note that for migration #2 in both respectively above (trace-kvm-pf.log and
> > trace-kvm.log), we didn't confirm that the guest was locked up before migration
> > mistakenly. It most likely was but in case trace #2 doesn't present the same
> > symptoms, that's why.
> > 
> > Off an uneducated glance, it seems like `in_prog = 0x1` at least once for every
> > `seq` / kvm_page_fault that seems to be "looping" and staying unresolved -
> 
> This is completely expected.   The "in_prog" thing is just saying that a vCPU
> took a fault while there was an mmu_notifier event in-progress.
> 
> > indicating a lock contention, perhaps, in trying to invalidate/read/write the
> > same page range?
> 
> No, just a collision between the primary MMU invalidating something, e.g. to move
> a page or do KSM stuff, and a vCPU accessing the page in question.
> 
> > We do know this issue _occurs_ as late as 6.1.38 at least (i.e. hosts running
> > 6.1.38 have had guests lockup - we don't have hosts on more recent kernels, so
> > this isn't proof that it's been fixed since then, nor is migration proof of
> > that, IMO).
> 
> Note, if my hunch is correct, it's the act of migrating to a different *host* that
> resolves the problem, not the fact that the migration is to a different kernel.
> E.g. I would expect that migrating to the exact same kernel would still unstick
> the vCPU.
> 
> What I suspect is happening is that the in-progress count gets left high, e.g.
> because of a start() without a paired end(), and that causes KVM to refuse to
> install mappings for the affected range of guest memory.  Or possibly that the
> problematic host is generating an absolutely massive storm of invalidations and
> unintentionally DoS's the guest.


It would would be great to write a micro benchmark of sorts that generates 
EPT page invalidation pressure, and run it on a test system inside a 
virtual machine to see if we can get it to fault:

Can you suggest the type(s) of memory operations that could be written in 
user space (or kernel space as a module) to, find a test case that forces 
it to fail within a reasonable period of time?

We were thinking of memory mapping lots of page-sized mappings from 
/dev/zero and then randomly write and free them after there are tons of 
them allocated, and do this across multiple threads, while simultaneously 
using `taskset` (or `virsh vcpupin`) on the host to move the guest vCPUs 
across NUMA boundaries, and also with numabalance turned on.

I have also considered passing a device like null_blk.ko into the guest, 
and then doing memory mappings against it in the same way to put pressure 
or on the direct IO path from KVM into the guest user space. 

If you (or anyone else) have other suggestions then I would love to hear 
it. Maybe we can make a reproducer for this.


--
Eric Wheeler


> 
> Either way, migrating the VM to a new host and thus a new KVM instance essentially
> resets all of that metadata and allows KVM to fault-in pages and establish mappings.
> 
> Actually, one thing you could try to unstick a VM would be to do an intra-host
> migration, i.e. migrate it to a new KVM instance on the same host.  If that "fixes"
> the guest, then the bug is likely an mmu_notifier counting bug and not an
> invalidation storm.
> 
> But the easiest thing would be to catch a host in the act, i.e. capture traces
> with my debug patch from a host with a stuck vCPU.
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-12  0:50                                           ` Eric Wheeler
@ 2023-08-14 17:29                                             ` Sean Christopherson
  0 siblings, 0 replies; 48+ messages in thread
From: Sean Christopherson @ 2023-08-14 17:29 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Amaan Cheval, brak, kvm

On Fri, Aug 11, 2023, Eric Wheeler wrote:
> On Fri, 11 Aug 2023, Sean Christopherson wrote:
> > What I suspect is happening is that the in-progress count gets left high, e.g.
> > because of a start() without a paired end(), and that causes KVM to refuse to
> > install mappings for the affected range of guest memory.  Or possibly that the
> > problematic host is generating an absolutely massive storm of invalidations and
> > unintentionally DoS's the guest.
> 
> 
> It would would be great to write a micro benchmark of sorts that generates 
> EPT page invalidation pressure, and run it on a test system inside a 
> virtual machine to see if we can get it to fault:
> 
> Can you suggest the type(s) of memory operations that could be written in 
> user space (or kernel space as a module) to, find a test case that forces 
> it to fail within a reasonable period of time?

Easiest thing would be to toggle PROT_EXEC via mprotect() on guest memory.  KVM
ignores PROT_EXEC so that guest memory doesn't need to be mapped executable in
the VMM, i.e. toggling PROT_EXEC won't cause spurious failures but it will still
trigger mmu_notifier invalidations.

Side topic, can you provide your host Kconfig?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-08 17:07                               ` Sean Christopherson
  2023-08-10  0:48                                 ` Eric Wheeler
@ 2023-08-15  0:30                                 ` Eric Wheeler
  2023-08-15 16:10                                   ` Sean Christopherson
  1 sibling, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-15  0:30 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Tue, 8 Aug 2023, Sean Christopherson wrote:
> > If you have any suggestions on how modifying the host kernel (and then migrating
> > a locked up guest to it) or eBPF programs that might help illuminate the issue
> > further, let me know!
> > 
> > Thanks for all your help so far!
> 
> Since it sounds like you can test with a custom kernel, try running with this
> patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
> below expands said tracepoint to capture information about mmu_notifiers and
> memslots generation.  With luck, it will reveal a smoking gun.

Getting this patch into production systems is challenging, perhaps live
patching is an option:


Questions:

1. Do you know if this would be safe to insert as a live kernel patch?
For example, does adding to TRACE_EVENT modify a struct (which is not
live-patch-safe) or is it something that should plug in with simple
function redirection?
	

2. Before we try it, do you know off the top of your head if the patch
below relies on any code that Linux v6.1 would not have?


--
Eric Wheeler



> 
> ---
>  arch/x86/kvm/mmu/mmu.c          | 10 ----------
>  arch/x86/kvm/mmu/mmu_internal.h |  2 ++
>  arch/x86/kvm/mmu/tdp_mmu.h      | 10 ++++++++++
>  arch/x86/kvm/trace.h            | 28 ++++++++++++++++++++++++++--
>  4 files changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9e4cd8b4a202..122bfc884293 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2006,16 +2006,6 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
>  	return true;
>  }
>  
> -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> -{
> -	if (sp->role.invalid)
> -		return true;
> -
> -	/* TDP MMU pages do not use the MMU generation. */
> -	return !is_tdp_mmu_page(sp) &&
> -	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> -}
> -
>  struct mmu_page_path {
>  	struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
>  	unsigned int idx[PT64_ROOT_MAX_LEVEL];
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index f1ef670058e5..cf7ba0abaa8f 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -6,6 +6,8 @@
>  #include <linux/kvm_host.h>
>  #include <asm/kvm_host.h>
>  
> +#include "mmu.h"
> +
>  #ifdef CONFIG_KVM_PROVE_MMU
>  #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
>  #else
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 0a63b1afabd3..a0d7c8acf78f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -76,4 +76,14 @@ static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu
>  static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
>  #endif
>  
> +static inline bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> +	if (sp->role.invalid)
> +		return true;
> +
> +	/* TDP MMU pages do not use the MMU generation. */
> +	return !is_tdp_mmu_page(sp) &&
> +	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> +}
> +
>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 83843379813e..ff4a384ab03a 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -8,6 +8,8 @@
>  #include <asm/clocksource.h>
>  #include <asm/pvclock-abi.h>
>  
> +#include "mmu/tdp_mmu.h"
> +
>  #undef TRACE_SYSTEM
>  #define TRACE_SYSTEM kvm
>  
> @@ -405,6 +407,13 @@ TRACE_EVENT(kvm_page_fault,
>  		__field(	unsigned long,	guest_rip	)
>  		__field(	u64,		fault_address	)
>  		__field(	u64,		error_code	)
> +		__field(	unsigned long,  mmu_invalidate_seq)
> +		__field(	long,  mmu_invalidate_in_progress)
> +		__field(	unsigned long,  mmu_invalidate_range_start)
> +		__field(	unsigned long,  mmu_invalidate_range_end)
> +		__field(	bool,		root_is_valid)
> +		__field(	bool,		root_has_sp)
> +		__field(	bool,		root_is_obsolete)
>  	),
>  
>  	TP_fast_assign(
> @@ -412,11 +421,26 @@ TRACE_EVENT(kvm_page_fault,
>  		__entry->guest_rip	= kvm_rip_read(vcpu);
>  		__entry->fault_address	= fault_address;
>  		__entry->error_code	= error_code;
> +		__entry->mmu_invalidate_seq		= vcpu->kvm->mmu_invalidate_seq;
> +		__entry->mmu_invalidate_in_progress	= vcpu->kvm->mmu_invalidate_in_progress;
> +		__entry->mmu_invalidate_range_start	= vcpu->kvm->mmu_invalidate_range_start;
> +		__entry->mmu_invalidate_range_end	= vcpu->kvm->mmu_invalidate_range_end;
> +		__entry->root_is_valid			= VALID_PAGE(vcpu->arch.mmu->root.hpa);
> +		__entry->root_has_sp			= VALID_PAGE(vcpu->arch.mmu->root.hpa) &&
> +							  to_shadow_page(vcpu->arch.mmu->root.hpa);
> +		__entry->root_is_obsolete		= VALID_PAGE(vcpu->arch.mmu->root.hpa) &&
> +							  to_shadow_page(vcpu->arch.mmu->root.hpa) &&
> +							  is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root.hpa));
>  	),
>  
> -	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx",
> +	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx, seq = 0x%lx, in_prog = 0x%lx, start = 0x%lx, end = 0x%lx, root = %s",
>  		  __entry->vcpu_id, __entry->guest_rip,
> -		  __entry->fault_address, __entry->error_code)
> +		  __entry->fault_address, __entry->error_code,
> +		  __entry->mmu_invalidate_seq, __entry->mmu_invalidate_in_progress,
> +		  __entry->mmu_invalidate_range_start, __entry->mmu_invalidate_range_end,
> +		  !__entry->root_is_valid ? "invalid" :
> +		  !__entry->root_has_sp ? "no shadow page" :
> +		  __entry->root_is_obsolete ? "obsolete" : "fresh")
>  );
>  
>  /*
> 
> base-commit: 240f736891887939571854bd6d734b6c9291f22e
> -- 
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-15  0:30                                 ` Eric Wheeler
@ 2023-08-15 16:10                                   ` Sean Christopherson
  2023-08-16 23:54                                     ` Eric Wheeler
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-08-15 16:10 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Amaan Cheval, brak, kvm

On Mon, Aug 14, 2023, Eric Wheeler wrote:
> On Tue, 8 Aug 2023, Sean Christopherson wrote:
> > > If you have any suggestions on how modifying the host kernel (and then migrating
> > > a locked up guest to it) or eBPF programs that might help illuminate the issue
> > > further, let me know!
> > > 
> > > Thanks for all your help so far!
> > 
> > Since it sounds like you can test with a custom kernel, try running with this
> > patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
> > below expands said tracepoint to capture information about mmu_notifiers and
> > memslots generation.  With luck, it will reveal a smoking gun.
> 
> Getting this patch into production systems is challenging, perhaps live
> patching is an option:

Ah, I take when you gathered information after a live migration you were migrating
VMs into a sidecar environment.

> Questions:
> 
> 1. Do you know if this would be safe to insert as a live kernel patch?

Hmm, probably not safe.

> For example, does adding to TRACE_EVENT modify a struct (which is not
> live-patch-safe) or is it something that should plug in with simple
> function redirection?

Yes, the tracepoint defines a struct, e.g. in this case trace_event_raw_kvm_page_fault.

Looking back, I think I misinterpreted an earlier response regarding bpftrace and
unnecessarily abandoned that tactic. *sigh*

If your environment provides btf info, then this bpftrace program should provide
the mmu_notifier half of the tracepoint hack-a-patch.  If this yields nothing
interesting then we can try diving into whether or not the mmu_root is stale, but
let's cross that bridge when we have to.

I recommend loading this only when you have a stuck vCPU, it'll be quite noisy.

kprobe:handle_ept_violation
{
	printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
	       arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
}

If you don't have BTF info, we can still use a bpf program, but to get at the
fields of interested, I think we'd have to resort to pointer arithmetic with struct
offsets grab from your build.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-15 16:10                                   ` Sean Christopherson
@ 2023-08-16 23:54                                     ` Eric Wheeler
  2023-08-17 18:21                                       ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-16 23:54 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Tue, 15 Aug 2023, Sean Christopherson wrote:
> On Mon, Aug 14, 2023, Eric Wheeler wrote:
> > On Tue, 8 Aug 2023, Sean Christopherson wrote:
> > > > If you have any suggestions on how modifying the host kernel (and then migrating
> > > > a locked up guest to it) or eBPF programs that might help illuminate the issue
> > > > further, let me know!
> > > > 
> > > > Thanks for all your help so far!
> > > 
> > > Since it sounds like you can test with a custom kernel, try running with this
> > > patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
> > > below expands said tracepoint to capture information about mmu_notifiers and
> > > memslots generation.  With luck, it will reveal a smoking gun.
> > 
> > Getting this patch into production systems is challenging, perhaps live
> > patching is an option:
> 
> Ah, I take when you gathered information after a live migration you were migrating
> VMs into a sidecar environment.
> 
> > Questions:
> > 
> > 1. Do you know if this would be safe to insert as a live kernel patch?
> 
> Hmm, probably not safe.
> 
> > For example, does adding to TRACE_EVENT modify a struct (which is not
> > live-patch-safe) or is it something that should plug in with simple
> > function redirection?
> 
> Yes, the tracepoint defines a struct, e.g. in this case trace_event_raw_kvm_page_fault.
> 
> Looking back, I think I misinterpreted an earlier response regarding bpftrace and
> unnecessarily abandoned that tactic. *sigh*
> 
> If your environment provides btf info, then this bpftrace program should provide
> the mmu_notifier half of the tracepoint hack-a-patch.  If this yields nothing
> interesting then we can try diving into whether or not the mmu_root is stale, but
> let's cross that bridge when we have to.
> 
> I recommend loading this only when you have a stuck vCPU, it'll be quite noisy.
> 
> kprobe:handle_ept_violation
> {
> 	printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> 	       arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> }
> 
> If you don't have BTF info, we can still use a bpf program, but to get at the
> fields of interested, I think we'd have to resort to pointer arithmetic with struct
> offsets grab from your build.

We have BTF, so hurray for not needing struct offsets!

I am testing this on a host that is not (yet) known to be stuck. Please do 
a quick sanity check for me and make sure this looks like the kind of 
output that you want to see:

I had to shrink the printf line because it was longer than 64 bytes. I put 
the process ID as the first item and changed %lx to %08lx for visual 
alignment. Aside from that, it is the same as what you provided.

We're piping it through `uniq -c` to only see interesting changes (and 
show counts) because it is extremely noisy. If this looks good to you then 
please confirm and I will run it on a production system after a lock-up:

	kprobe:handle_ept_violation
	{
		printf("ept[%u] vcpu=%08lx seq=%08lx inprog=%lx start=%08lx end=%08lx\n",
		       ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
			arg0, 
		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
	}

Questions:
  - Should pid be zero?  (Note this is not yet running on a host with a 
    locked-up guest, in case that is the reason.)

  - Can you think of any reason that this would be unsafe? (Forgive my 
    paranoia, but of course this will be running on a production
    hypervisor.)

  - Can you think of any adjustments to the bpf script above before 
    running this for real?


Here is an example trace on a test host that isn't locked up:

 ~]# bpftrace handle_ept_violation.bt | grep ^ept --line-buffered | uniq -c
   1926 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
 215722 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
  66280 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
18609437 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
     30 ept[0] vcpu=ffff96955de90000 seq=001fa362 inprog=0 start=7fa25ef0f000 end=7fa25ef10000
      1 ept[0] vcpu=ffff96955de92340 seq=001fa44e inprog=0 start=7fa23f789000 end=7fa23f78a000
      2 ept[0] vcpu=ffff96955de92340 seq=001fa59f inprog=0 start=7fa23dfe8000 end=7fa23dfe9000
      2 ept[0] vcpu=ffff96955de92340 seq=001fa5a0 inprog=0 start=7fa23b723000 end=7fa23b724000
      1 ept[0] vcpu=ffff96955de92340 seq=001fa5a1 inprog=0 start=7fa238d50000 end=7fa238d51000
      1 ept[0] vcpu=ffff96955de92340 seq=001fa5a5 inprog=0 start=7fa24d920000 end=7fa24d921000
      1 ept[0] vcpu=ffff96955de92340 seq=001fa5a6 inprog=0 start=7fa238a73000 end=7fa238a74000
      1 ept[0] vcpu=ffff96955de92340 seq=001fa5ea inprog=0 start=7fa244791000 end=7fa244792000
      1 ept[0] vcpu=ffff96955de92340 seq=001fa5eb inprog=0 start=7fa24c988000 end=7fa24c989000
      3 ept[0] vcpu=ffff96955de92340 seq=001fa5ec inprog=0 start=7fa23f78b000 end=7fa23f78c000
      2 ept[0] vcpu=ffff96955de92340 seq=001fa5ed inprog=0 start=7fa24256a000 end=7fa24256b000
      2 ept[0] vcpu=ffff96955de92340 seq=001fa5ee inprog=0 start=7fa24ed2b000 end=7fa24ed2c000

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-16 23:54                                     ` Eric Wheeler
@ 2023-08-17 18:21                                       ` Sean Christopherson
  2023-08-18  0:55                                         ` Eric Wheeler
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-08-17 18:21 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Amaan Cheval, brak, kvm

On Wed, Aug 16, 2023, Eric Wheeler wrote:
> On Tue, 15 Aug 2023, Sean Christopherson wrote:
> > On Mon, Aug 14, 2023, Eric Wheeler wrote:
> > > On Tue, 8 Aug 2023, Sean Christopherson wrote:
> > > > > If you have any suggestions on how modifying the host kernel (and then migrating
> > > > > a locked up guest to it) or eBPF programs that might help illuminate the issue
> > > > > further, let me know!
> > > > > 
> > > > > Thanks for all your help so far!
> > > > 
> > > > Since it sounds like you can test with a custom kernel, try running with this
> > > > patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
> > > > below expands said tracepoint to capture information about mmu_notifiers and
> > > > memslots generation.  With luck, it will reveal a smoking gun.
> > > 
> > > Getting this patch into production systems is challenging, perhaps live
> > > patching is an option:
> > 
> > Ah, I take when you gathered information after a live migration you were migrating
> > VMs into a sidecar environment.
> > 
> > > Questions:
> > > 
> > > 1. Do you know if this would be safe to insert as a live kernel patch?
> > 
> > Hmm, probably not safe.
> > 
> > > For example, does adding to TRACE_EVENT modify a struct (which is not
> > > live-patch-safe) or is it something that should plug in with simple
> > > function redirection?
> > 
> > Yes, the tracepoint defines a struct, e.g. in this case trace_event_raw_kvm_page_fault.
> > 
> > Looking back, I think I misinterpreted an earlier response regarding bpftrace and
> > unnecessarily abandoned that tactic. *sigh*
> > 
> > If your environment provides btf info, then this bpftrace program should provide
> > the mmu_notifier half of the tracepoint hack-a-patch.  If this yields nothing
> > interesting then we can try diving into whether or not the mmu_root is stale, but
> > let's cross that bridge when we have to.
> > 
> > I recommend loading this only when you have a stuck vCPU, it'll be quite noisy.
> > 
> > kprobe:handle_ept_violation
> > {
> > 	printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> > 	       arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> > }
> > 
> > If you don't have BTF info, we can still use a bpf program, but to get at the
> > fields of interested, I think we'd have to resort to pointer arithmetic with struct
> > offsets grab from your build.
> 
> We have BTF, so hurray for not needing struct offsets!
> 
> I am testing this on a host that is not (yet) known to be stuck. Please do 
> a quick sanity check for me and make sure this looks like the kind of 
> output that you want to see:
> 
> I had to shrink the printf line because it was longer than 64 bytes. I put 
> the process ID as the first item and changed %lx to %08lx for visual 
> alignment. Aside from that, it is the same as what you provided.
> 
> We're piping it through `uniq -c` to only see interesting changes (and 
> show counts) because it is extremely noisy. If this looks good to you then 
> please confirm and I will run it on a production system after a lock-up:
> 
> 	kprobe:handle_ept_violation
> 	{
> 		printf("ept[%u] vcpu=%08lx seq=%08lx inprog=%lx start=%08lx end=%08lx\n",
> 		       ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> 			arg0, 
> 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> 	}
> 
> Questions:
>   - Should pid be zero?  (Note this is not yet running on a host with a 
>     locked-up guest, in case that is the reason.)

No.  I'm not at all familiar with PID management, I just copy+pasted from
pid_nr(), which is what KVM uses when displaying the pid in debugfs.  I printed
the PID purely to be able to unambiguously correlated prints to vCPUs without
needing to cross reference kernel addresses.  I.e. having the PID makes life
easier, but it shouldn't be strictly necessary.

>   - Can you think of any reason that this would be unsafe? (Forgive my 
>     paranoia, but of course this will be running on a production
>     hypervisor.)

Printing the raw address of the vCPU structure will effectively neuter KASLR, but
KASLR isn't all that much of a barrier, and whoever has permission to load a BPF
program on the system can do far, far more damage.

>   - Can you think of any adjustments to the bpf script above before 
>     running this for real?

You could try and make it less noisy or more precise, e.g. by tailoring it to
print only information on the vCPU that is stuck.  If the noise isn't a problem
though, I would keep it as-is, the more information the better.

> Here is an example trace on a test host that isn't locked up:
> 
>  ~]# bpftrace handle_ept_violation.bt | grep ^ept --line-buffered | uniq -c
>    1926 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
>  215722 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
>   66280 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
> 18609437 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000

Woah.  That's over 2 *billion* invalidations for a single VM.  Even if that's a
long-lived VM, that's still seems rather insane.  E.g. if the uptime of that VM
*on that host* is 6 months, my back of the napkin math says that that's nearly
100 invalidations every second for 6 months straight.

Bit 31 being set in relative isolation almost makes me wonder if mmu_invalidate_seq
got corrupted somehow.  Either that or you are thrashing that VM with a vengeance.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-17 18:21                                       ` Sean Christopherson
@ 2023-08-18  0:55                                         ` Eric Wheeler
  2023-08-18 14:33                                           ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-18  0:55 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Thu, 17 Aug 2023, Sean Christopherson wrote:
> On Wed, Aug 16, 2023, Eric Wheeler wrote:
> > On Tue, 15 Aug 2023, Sean Christopherson wrote:
> > > On Mon, Aug 14, 2023, Eric Wheeler wrote:
> > > > On Tue, 8 Aug 2023, Sean Christopherson wrote:
> > > > > > If you have any suggestions on how modifying the host kernel (and then migrating
> > > > > > a locked up guest to it) or eBPF programs that might help illuminate the issue
> > > > > > further, let me know!
> > > > > > 
> > > > > > Thanks for all your help so far!
> > > > > 
> > > > > Since it sounds like you can test with a custom kernel, try running with this
> > > > > patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
> > > > > below expands said tracepoint to capture information about mmu_notifiers and
> > > > > memslots generation.  With luck, it will reveal a smoking gun.
> > > > 
> > > > Getting this patch into production systems is challenging, perhaps live
> > > > patching is an option:
> > > 
> > > Ah, I take when you gathered information after a live migration you were migrating
> > > VMs into a sidecar environment.
> > > 
> > > > Questions:
> > > > 
> > > > 1. Do you know if this would be safe to insert as a live kernel patch?
> > > 
> > > Hmm, probably not safe.
> > > 
> > > > For example, does adding to TRACE_EVENT modify a struct (which is not
> > > > live-patch-safe) or is it something that should plug in with simple
> > > > function redirection?
> > > 
> > > Yes, the tracepoint defines a struct, e.g. in this case trace_event_raw_kvm_page_fault.
> > > 
> > > Looking back, I think I misinterpreted an earlier response regarding bpftrace and
> > > unnecessarily abandoned that tactic. *sigh*
> > > 
> > > If your environment provides btf info, then this bpftrace program should provide
> > > the mmu_notifier half of the tracepoint hack-a-patch.  If this yields nothing
> > > interesting then we can try diving into whether or not the mmu_root is stale, but
> > > let's cross that bridge when we have to.
> > > 
> > > I recommend loading this only when you have a stuck vCPU, it'll be quite noisy.
> > > 
> > > kprobe:handle_ept_violation
> > > {
> > > 	printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> > > 	       arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> > > }
> > > 
> > > If you don't have BTF info, we can still use a bpf program, but to get at the
> > > fields of interested, I think we'd have to resort to pointer arithmetic with struct
> > > offsets grab from your build.
> > 
> > We have BTF, so hurray for not needing struct offsets!

Well, I was part right: not all hosts have BTF.

What is involved in doing this with struct offsets for Linux v6.1.x?

> > I am testing this on a host that is not (yet) known to be stuck. Please do 
> > a quick sanity check for me and make sure this looks like the kind of 
> > output that you want to see:
> > 
> > I had to shrink the printf line because it was longer than 64 bytes. I put 
> > the process ID as the first item and changed %lx to %08lx for visual 
> > alignment. Aside from that, it is the same as what you provided.
> > 
> > We're piping it through `uniq -c` to only see interesting changes (and 
> > show counts) because it is extremely noisy. If this looks good to you then 
> > please confirm and I will run it on a production system after a lock-up:
> > 
> > 	kprobe:handle_ept_violation
> > 	{
> > 		printf("ept[%u] vcpu=%08lx seq=%08lx inprog=%lx start=%08lx end=%08lx\n",
> > 		       ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> > 			arg0, 
> > 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> > 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> > 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> > 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> > 	}
> > 
> > Questions:
> >   - Should pid be zero?  (Note this is not yet running on a host with a 
> >     locked-up guest, in case that is the reason.)
> 
> No.  I'm not at all familiar with PID management, I just copy+pasted from
> pid_nr(), which is what KVM uses when displaying the pid in debugfs.  I printed
> the PID purely to be able to unambiguously correlated prints to vCPUs without
> needing to cross reference kernel addresses.  I.e. having the PID makes life
> easier, but it shouldn't be strictly necessary.

ok

> >   - Can you think of any reason that this would be unsafe? (Forgive my 
> >     paranoia, but of course this will be running on a production
> >     hypervisor.)
> 
> Printing the raw address of the vCPU structure will effectively neuter KASLR, but
> KASLR isn't all that much of a barrier, and whoever has permission to load a BPF
> program on the system can do far, far more damage.

agreed
 
> >   - Can you think of any adjustments to the bpf script above before 
> >     running this for real?
> 
> You could try and make it less noisy or more precise, e.g. by tailoring it to
> print only information on the vCPU that is stuck.  If the noise isn't a problem
> though, I would keep it as-is, the more information the better.

Ok, will leave it as-is

> > Here is an example trace on a test host that isn't locked up:
> > 
> >  ~]# bpftrace handle_ept_violation.bt | grep ^ept --line-buffered | uniq -c
> >    1926 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
> >  215722 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
> >   66280 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
> > 18609437 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
> 
> Woah.  That's over 2 *billion* invalidations for a single VM.  Even if that's a
> long-lived VM, that's still seems rather insane.  E.g. if the uptime of that VM
> *on that host* is 6 months, my back of the napkin math says that that's nearly
> 100 invalidations every second for 6 months straight.
> 
> Bit 31 being set in relative isolation almost makes me wonder if mmu_invalidate_seq
> got corrupted somehow.  Either that or you are thrashing that VM with a vengeance.
 
Not sure what is happening on that host, but it could be being thrashed by 
another dev to try and reproduce the bug for bisect, but we don't have a 
reproducer yet...



--
Eric Wheeler



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-18  0:55                                         ` Eric Wheeler
@ 2023-08-18 14:33                                           ` Sean Christopherson
  2023-08-18 23:06                                             ` Eric Wheeler
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-08-18 14:33 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Amaan Cheval, brak, kvm

On Thu, Aug 17, 2023, Eric Wheeler wrote:
> On Thu, 17 Aug 2023, Sean Christopherson wrote:
> > > > kprobe:handle_ept_violation
> > > > {
> > > > 	printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> > > > 	       arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> > > > }
> > > > 
> > > > If you don't have BTF info, we can still use a bpf program, but to get at the
> > > > fields of interested, I think we'd have to resort to pointer arithmetic with struct
> > > > offsets grab from your build.
> > > 
> > > We have BTF, so hurray for not needing struct offsets!
> 
> Well, I was part right: not all hosts have BTF.
> 
> What is involved in doing this with struct offsets for Linux v6.1.x?

Unless you are up for a challenge, I'd drop the PID entirely, getting that will
be ugly.

For the KVM info, you need the offset of "kvm" within struct kvm_vcpu (more than
likely it's '0'), and then the offset of each of the mmu_invaliate_* fields within
struct kvm.  These need to come from the exact kernel you're running, though unless
a field is added/removed to/from struct kvm between kernel versions, the offsets
should be stable.

A cheesy/easy way to get the offsets is to feed offsetof() into __aligned and
then compile.  So long as the offset doesn't happen to be a power-of-2, the
compiler will yell.  E.g. with this

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 92c50dc159e8..04ec37f7374a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -543,7 +543,13 @@ struct kvm_hva_range {
  */
 static void kvm_null_fn(void)
 {
+       int v __aligned(offsetof(struct kvm_vcpu, kvm));
+       int w __aligned(offsetof(struct kvm, mmu_invalidate_seq));
+       int x __aligned(offsetof(struct kvm, mmu_invalidate_in_progress));
+       int y __aligned(offsetof(struct kvm, mmu_invalidate_range_start));
+       int z __aligned(offsetof(struct kvm, mmu_invalidate_range_end));
 
+       v = w = x = y = z = 0;
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)

I get yelled at with (trimmed):

arch/x86/kvm/../../../virt/kvm/kvm_main.c:546:34: error: requested alignment ‘0’ is not a positive power of 2 [-Werror=attributes]
arch/x86/kvm/../../../virt/kvm/kvm_main.c:547:20: error: requested alignment ‘36960’ is not a positive power of 2
arch/x86/kvm/../../../virt/kvm/kvm_main.c:549:20: error: requested alignment ‘36968’ is not a positive power of 2
arch/x86/kvm/../../../virt/kvm/kvm_main.c:551:20: error: requested alignment ‘36976’ is not a positive power of 2
arch/x86/kvm/../../../virt/kvm/kvm_main.c:553:20: error: requested alignment ‘36984’ is not a positive power of 2

Then take those offsets and do math.  For me, this provides the same output as
the above pretty version.  Just use common sense and verify you're getting sane
data.

kprobe:handle_ept_violation
{
	$kvm = *((uint64 *)((uint64)arg0 + 0));

	printf("vcpu = %lx MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
	       arg0,
               *((uint64 *)($kvm + 36960)),
               *((uint64 *)($kvm + 36968)),
               *((uint64 *)($kvm + 36976)),
               *((uint64 *)($kvm + 36984)));
}


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-18 14:33                                           ` Sean Christopherson
@ 2023-08-18 23:06                                             ` Eric Wheeler
  2023-08-21 20:27                                               ` Eric Wheeler
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-18 23:06 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

[-- Attachment #1: Type: text/plain, Size: 8385 bytes --]

On Fri, 18 Aug 2023, Sean Christopherson wrote:
> On Thu, Aug 17, 2023, Eric Wheeler wrote:
> > On Thu, 17 Aug 2023, Sean Christopherson wrote:
> > > > > kprobe:handle_ept_violation
> > > > > {
> > > > > 	printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> > > > > 	       arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> > > > > }
> > > > > 
> > > > > If you don't have BTF info, we can still use a bpf program, but to get at the
> > > > > fields of interested, I think we'd have to resort to pointer arithmetic with struct
> > > > > offsets grab from your build.
> > > > 
> > > > We have BTF, so hurray for not needing struct offsets!
> > 
> > Well, I was part right: not all hosts have BTF.

First things first, we got a trace from a machine _with_ BTF!

These the only items showing in-prog=1 (first column is count from uniq -c):

      1 ept[0] vcpu=ffff9c436d26c680 seq=32524620 inprog=1 start=7f32477d7000 end=7f32477d8000
      1 ept[0] vcpu=ffff9c436d26c680 seq=32524aee inprog=1 start=7f3252209000 end=7f325220a000
      1 ept[0] vcpu=ffff9c436d26c680 seq=32527895 inprog=1 start=7f329504d000 end=7f329504e000
      1 ept[0] vcpu=ffff9c436d26c680 seq=325279eb inprog=1 start=7f3296f00000 end=7f3296f01000
      1 ept[0] vcpu=ffff9c436d26c680 seq=325279f5 inprog=1 start=7f3296fae000 end=7f3296faf000
      1 ept[0] vcpu=ffff9c436d26c680 seq=32527b4d inprog=1 start=7f329937e000 end=7f329937f000
      1 ept[0] vcpu=ffff9c436d26c680 seq=32525ef6 inprog=1 start=7f3272503000 end=7f3272504000
      1 ept[0] vcpu=ffff9c436d26c680 seq=32526517 inprog=1 start=7f327a568000 end=7f327a569000
      1 ept[0] vcpu=ffff9c436d26c680 seq=325268e8 inprog=1 start=7f327e4a4000 end=7f327e4a5000
      1 ept[0] vcpu=ffff9c436d26c680 seq=32527543 inprog=1 start=7f328f8ca000 end=7f328f8cb000
      1 ept[0] vcpu=ffff9c43d5618000 seq=1c861ab6 inprog=1 start=7fb4c67de000 end=7fb4c67df000
      1 ept[0] vcpu=ffff9c43d5618000 seq=1c862600 inprog=1 start=7fb48c132000 end=7fb48c133000
      1 ept[0] vcpu=ffff9c43d5618000 seq=1c862a8b inprog=1 start=7fb4f06b8000 end=7fb4f06b9000
      1 ept[0] vcpu=ffff9c43d5618000 seq=1c862b9f inprog=1 start=7fb4f1861000 end=7fb4f1862000
      1 ept[0] vcpu=ffff9c43d5618000 seq=1c862d33 inprog=1 start=7fb4e72f5000 end=7fb4e72f6000
      1 ept[0] vcpu=ffff9c43d5618000 seq=1c86415c inprog=1 start=7fb49fb5a000 end=7fb49fb5b000
      1 ept[0] vcpu=ffff9c43d5618000 seq=1c864162 inprog=1 start=7fb49fb59000 end=7fb49fb5a000
      1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862ba1 inprog=1 start=7fb4f0e24000 end=7fb4f0e25000
      1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862bab inprog=1 start=7fb4f0e26000 end=7fb4f0e27000
      1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862bb1 inprog=1 start=7fb4f0e27000 end=7fb4f0e28000
      1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862cbd inprog=1 start=7fb4efffd000 end=7fb4efffe000
      1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862cc4 inprog=1 start=7fb4f0692000 end=7fb4f0693000
      1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862d32 inprog=1 start=7fb4dd282000 end=7fb4dd283000
      1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862d36 inprog=1 start=7fb4e8e97000 end=7fb4e8e98000
      1 ept[0] vcpu=ffff9c436d26c680 seq=3252adeb inprog=1 start=7f326209b000 end=7f326209c000

The entire dump is 22,687 lines if you want to see it, here (expires in 1 week):

	https://privatebin.net/?9a3bff6b6fd2566f#BHjrt4NGpoXL12NWiUDpThifi9E46LNXCy7eWzGXgqYx

> > 
> > What is involved in doing this with struct offsets for Linux v6.1.x?
> 
> Unless you are up for a challenge, I'd drop the PID entirely, getting that will
> be ugly.
> 
> For the KVM info, you need the offset of "kvm" within struct kvm_vcpu (more than
> likely it's '0'), and then the offset of each of the mmu_invaliate_* fields within
> struct kvm.  These need to come from the exact kernel you're running, though unless
> a field is added/removed to/from struct kvm between kernel versions, the offsets
> should be stable.
> 
> A cheesy/easy way to get the offsets is to feed offsetof() into __aligned and
> then compile.  So long as the offset doesn't happen to be a power-of-2, the
> compiler will yell.  E.g. with this
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 92c50dc159e8..04ec37f7374a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -543,7 +543,13 @@ struct kvm_hva_range {
>   */
>  static void kvm_null_fn(void)
>  {
> +       int v __aligned(offsetof(struct kvm_vcpu, kvm));
> +       int w __aligned(offsetof(struct kvm, mmu_invalidate_seq));
> +       int x __aligned(offsetof(struct kvm, mmu_invalidate_in_progress));
> +       int y __aligned(offsetof(struct kvm, mmu_invalidate_range_start));
> +       int z __aligned(offsetof(struct kvm, mmu_invalidate_range_end));
>  
> +       v = w = x = y = z = 0;
>  }
>  #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
> 
> I get yelled at with (trimmed):
> 
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:546:34: error: requested alignment ‘0’ is not a positive power of 2 [-Werror=attributes]
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:547:20: error: requested alignment ‘36960’ is not a positive power of 2
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:549:20: error: requested alignment ‘36968’ is not a positive power of 2
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:551:20: error: requested alignment ‘36976’ is not a positive power of 2
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:553:20: error: requested alignment ‘36984’ is not a positive power of 2

Neat trick.

So here are my numbers:

# make modules  KDIR=virt 2>&1 | grep -A1 alignment |grep -v ^-
arch/x86/kvm/../../../virt/kvm/kvm_main.c:568:40: error: requested alignment ‘0’ is not a positive power of 2 [-Werror=attributes]
  568 |        int v __aligned(offsetof(struct kvm_vcpu, kvm));
arch/x86/kvm/../../../virt/kvm/kvm_main.c:569:40: error: requested alignment ‘39552’ is not a positive power of 2
  569 |        int w __aligned(offsetof(struct kvm, mmu_invalidate_seq));
arch/x86/kvm/../../../virt/kvm/kvm_main.c:570:40: error: requested alignment ‘39560’ is not a positive power of 2
  570 |        int x __aligned(offsetof(struct kvm, mmu_invalidate_in_progress));
arch/x86/kvm/../../../virt/kvm/kvm_main.c:571:40: error: requested alignment ‘39568’ is not a positive power of 2
  571 |        int y __aligned(offsetof(struct kvm, mmu_invalidate_range_start));
arch/x86/kvm/../../../virt/kvm/kvm_main.c:572:40: error: requested alignment ‘39576’ is not a positive power of 2
  572 |        int z __aligned(offsetof(struct kvm, mmu_invalidate_range_end));

and the resulting script:
	kprobe:handle_ept_violation
	{
		$kvm = *((uint64 *)((uint64)arg0 + 0));

		printf("vcpu=%08lx seq=%08lx inprog=%lx start=%08lx end=%08lx\n",
			arg0, 
		       *((uint64 *)($kvm + 39552)),
		       *((uint64 *)($kvm + 39560)),
		       *((uint64 *)($kvm + 39568)),
		       *((uint64 *)($kvm + 39576))
		       );
	}

... but the output shows all 0's except vcpu:

	# bpftrace ./handle_ept_violation.bt |grep ^vcpu | uniq -c
	     11 vcpu=ffff9d518541c680 seq=00000000 inprog=0 start=00000000 end=00000000
	     29 vcpu=ffff9d80cc120000 seq=00000000 inprog=0 start=00000000 end=00000000
	    331 vcpu=ffff9d5f1d1a2340 seq=00000000 inprog=0 start=00000000 end=00000000
	    858 vcpu=ffff9d80c7b98000 seq=00000000 inprog=0 start=00000000 end=00000000
	   2183 vcpu=ffff9d6033fb2340 seq=00000000 inprog=0 start=00000000 end=00000000

Did I do something wrong here?

-Eric

> 
> Then take those offsets and do math.  For me, this provides the same output as
> the above pretty version.  Just use common sense and verify you're getting sane
> data.
> 
> kprobe:handle_ept_violation
> {
> 	$kvm = *((uint64 *)((uint64)arg0 + 0));
> 
> 	printf("vcpu = %lx MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> 	       arg0,
>                *((uint64 *)($kvm + 36960)),
>                *((uint64 *)($kvm + 36968)),
>                *((uint64 *)($kvm + 36976)),
>                *((uint64 *)($kvm + 36984)));
> }
> 
> 






--
Eric Wheeler


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-18 23:06                                             ` Eric Wheeler
@ 2023-08-21 20:27                                               ` Eric Wheeler
  2023-08-21 23:51                                                 ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-21 20:27 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

[-- Attachment #1: Type: text/plain, Size: 13628 bytes --]

On Fri, 18 Aug 2023, Eric Wheeler wrote:
> On Fri, 18 Aug 2023, Sean Christopherson wrote:
> > On Thu, Aug 17, 2023, Eric Wheeler wrote:
> > > On Thu, 17 Aug 2023, Sean Christopherson wrote:
> > > > > > kprobe:handle_ept_violation
> > > > > > {
> > > > > > 	printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> > > > > > 	       arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> > > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> > > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> > > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> > > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> > > > > > }
> > > > > > 
> > > > > > If you don't have BTF info, we can still use a bpf program, but to get at the
> > > > > > fields of interested, I think we'd have to resort to pointer arithmetic with struct
> > > > > > offsets grab from your build.
> > > > > 
> > > > > We have BTF, so hurray for not needing struct offsets!

We found a new sample in 6.1.38, right after a lockup, where _all_ log 
entries show inprog=1, in case that is interesting. Here is a sample, 
there are 500,000+ entries so let me know if you want the whole log.

To me, these are opaque numbers.  What do they represent?  What are you looking for in them?

      1 ept[0] vcpu=ffff9964cdc48000 seq=80854227 inprog=1 start=7fa3183a3000 end=7fa3183a4000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854228 inprog=1 start=7fa3183a3000 end=7fa3183a4000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854229 inprog=1 start=7fa3183a4000 end=7fa3183a5000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085422a inprog=1 start=7fa3183a4000 end=7fa3183a5000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085422b inprog=1 start=7fa3183a8000 end=7fa3183a9000
      2 ept[0] vcpu=ffff9964cdc48000 seq=8085422d inprog=1 start=7fa3183a9000 end=7fa3183aa000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085422e inprog=1 start=7fa3183a9000 end=7fa3183aa000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854232 inprog=1 start=7fa3183ac000 end=7fa3183ad000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854233 inprog=1 start=7fa3183ad000 end=7fa3183ae000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854235 inprog=1 start=7fa3183ae000 end=7fa3183af000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854236 inprog=1 start=7fa3183ae000 end=7fa3183af000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854237 inprog=1 start=7fa3183b1000 end=7fa3183b2000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854239 inprog=1 start=7fa3183b3000 end=7fa3183b4000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085423a inprog=1 start=7fa3183b3000 end=7fa3183b4000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085423d inprog=1 start=7fa3183b7000 end=7fa3183b8000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085423e inprog=1 start=7fa3183b7000 end=7fa3183b8000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085423f inprog=1 start=7fa3183b8000 end=7fa3183b9000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854240 inprog=1 start=7fa3183b8000 end=7fa3183b9000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854241 inprog=1 start=7fa3183b9000 end=7fa3183ba000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854242 inprog=1 start=7fa3183b9000 end=7fa3183ba000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854243 inprog=1 start=7fa3183ba000 end=7fa3183bb000
      2 ept[0] vcpu=ffff9964cdc48000 seq=80854244 inprog=1 start=7fa3183ba000 end=7fa3183bb000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854246 inprog=1 start=7fa3183bb000 end=7fa3183bc000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854247 inprog=1 start=7fa3183bc000 end=7fa3183bd000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854248 inprog=1 start=7fa3183bc000 end=7fa3183bd000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854249 inprog=1 start=7fa3183bd000 end=7fa3183be000
      2 ept[0] vcpu=ffff9964cdc48000 seq=8085424b inprog=1 start=7fa3183be000 end=7fa3183bf000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085424c inprog=1 start=7fa3183be000 end=7fa3183bf000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085424e inprog=1 start=7fa3183bf000 end=7fa3183c0000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854250 inprog=1 start=7fa3183c0000 end=7fa3183c1000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854251 inprog=1 start=7fa3183c1000 end=7fa3183c2000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854255 inprog=1 start=7fa3183c5000 end=7fa3183c6000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854256 inprog=1 start=7fa3183c5000 end=7fa3183c6000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854257 inprog=1 start=7fa3183c6000 end=7fa3183c7000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854259 inprog=1 start=7fa3183c7000 end=7fa3183c8000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085425a inprog=1 start=7fa3183c7000 end=7fa3183c8000
      2 ept[0] vcpu=ffff9964cdc48000 seq=8085425b inprog=1 start=7fa3183c8000 end=7fa3183c9000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085425c inprog=1 start=7fa3183c8000 end=7fa3183c9000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085425e inprog=1 start=7fa3183c9000 end=7fa3183ca000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854260 inprog=1 start=7fa3183ca000 end=7fa3183cb000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854261 inprog=1 start=7fa3183cb000 end=7fa3183cc000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854262 inprog=1 start=7fa3183cb000 end=7fa3183cc000
      2 ept[0] vcpu=ffff9964cdc48000 seq=80854263 inprog=1 start=7fa3183ce000 end=7fa3183cf000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854264 inprog=1 start=7fa3183ce000 end=7fa3183cf000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854265 inprog=1 start=7fa3183cf000 end=7fa3183d0000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854266 inprog=1 start=7fa3183cf000 end=7fa3183d0000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854268 inprog=1 start=7fa3183d4000 end=7fa3183d5000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085426b inprog=1 start=7fa3183d6000 end=7fa3183d7000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085426c inprog=1 start=7fa3183d6000 end=7fa3183d7000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085426d inprog=1 start=7fa3183d7000 end=7fa3183d8000

Thanks,

-Eric


> 
>       1 ept[0] vcpu=ffff9c436d26c680 seq=32524620 inprog=1 start=7f32477d7000 end=7f32477d8000
>       1 ept[0] vcpu=ffff9c436d26c680 seq=32524aee inprog=1 start=7f3252209000 end=7f325220a000
>       1 ept[0] vcpu=ffff9c436d26c680 seq=32527895 inprog=1 start=7f329504d000 end=7f329504e000
>       1 ept[0] vcpu=ffff9c436d26c680 seq=325279eb inprog=1 start=7f3296f00000 end=7f3296f01000
>       1 ept[0] vcpu=ffff9c436d26c680 seq=325279f5 inprog=1 start=7f3296fae000 end=7f3296faf000
>       1 ept[0] vcpu=ffff9c436d26c680 seq=32527b4d inprog=1 start=7f329937e000 end=7f329937f000
>       1 ept[0] vcpu=ffff9c436d26c680 seq=32525ef6 inprog=1 start=7f3272503000 end=7f3272504000
>       1 ept[0] vcpu=ffff9c436d26c680 seq=32526517 inprog=1 start=7f327a568000 end=7f327a569000
>       1 ept[0] vcpu=ffff9c436d26c680 seq=325268e8 inprog=1 start=7f327e4a4000 end=7f327e4a5000
>       1 ept[0] vcpu=ffff9c436d26c680 seq=32527543 inprog=1 start=7f328f8ca000 end=7f328f8cb000
>       1 ept[0] vcpu=ffff9c43d5618000 seq=1c861ab6 inprog=1 start=7fb4c67de000 end=7fb4c67df000
>       1 ept[0] vcpu=ffff9c43d5618000 seq=1c862600 inprog=1 start=7fb48c132000 end=7fb48c133000
>       1 ept[0] vcpu=ffff9c43d5618000 seq=1c862a8b inprog=1 start=7fb4f06b8000 end=7fb4f06b9000
>       1 ept[0] vcpu=ffff9c43d5618000 seq=1c862b9f inprog=1 start=7fb4f1861000 end=7fb4f1862000
>       1 ept[0] vcpu=ffff9c43d5618000 seq=1c862d33 inprog=1 start=7fb4e72f5000 end=7fb4e72f6000
>       1 ept[0] vcpu=ffff9c43d5618000 seq=1c86415c inprog=1 start=7fb49fb5a000 end=7fb49fb5b000
>       1 ept[0] vcpu=ffff9c43d5618000 seq=1c864162 inprog=1 start=7fb49fb59000 end=7fb49fb5a000
>       1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862ba1 inprog=1 start=7fb4f0e24000 end=7fb4f0e25000
>       1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862bab inprog=1 start=7fb4f0e26000 end=7fb4f0e27000
>       1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862bb1 inprog=1 start=7fb4f0e27000 end=7fb4f0e28000
>       1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862cbd inprog=1 start=7fb4efffd000 end=7fb4efffe000
>       1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862cc4 inprog=1 start=7fb4f0692000 end=7fb4f0693000
>       1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862d32 inprog=1 start=7fb4dd282000 end=7fb4dd283000
>       1 ept[0] vcpu=ffff9c533e1dc680 seq=1c862d36 inprog=1 start=7fb4e8e97000 end=7fb4e8e98000
>       1 ept[0] vcpu=ffff9c436d26c680 seq=3252adeb inprog=1 start=7f326209b000 end=7f326209c000
> 
> The entire dump is 22,687 lines if you want to see it, here (expires in 1 week):
> 
> 	https://privatebin.net/?9a3bff6b6fd2566f#BHjrt4NGpoXL12NWiUDpThifi9E46LNXCy7eWzGXgqYx
> 
> > > 
> > > What is involved in doing this with struct offsets for Linux v6.1.x?
> > 
> > Unless you are up for a challenge, I'd drop the PID entirely, getting that will
> > be ugly.
> > 
> > For the KVM info, you need the offset of "kvm" within struct kvm_vcpu (more than
> > likely it's '0'), and then the offset of each of the mmu_invaliate_* fields within
> > struct kvm.  These need to come from the exact kernel you're running, though unless
> > a field is added/removed to/from struct kvm between kernel versions, the offsets
> > should be stable.
> > 
> > A cheesy/easy way to get the offsets is to feed offsetof() into __aligned and
> > then compile.  So long as the offset doesn't happen to be a power-of-2, the
> > compiler will yell.  E.g. with this
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 92c50dc159e8..04ec37f7374a 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -543,7 +543,13 @@ struct kvm_hva_range {
> >   */
> >  static void kvm_null_fn(void)
> >  {
> > +       int v __aligned(offsetof(struct kvm_vcpu, kvm));
> > +       int w __aligned(offsetof(struct kvm, mmu_invalidate_seq));
> > +       int x __aligned(offsetof(struct kvm, mmu_invalidate_in_progress));
> > +       int y __aligned(offsetof(struct kvm, mmu_invalidate_range_start));
> > +       int z __aligned(offsetof(struct kvm, mmu_invalidate_range_end));
> >  
> > +       v = w = x = y = z = 0;
> >  }
> >  #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
> > 
> > I get yelled at with (trimmed):
> > 
> > arch/x86/kvm/../../../virt/kvm/kvm_main.c:546:34: error: requested alignment ‘0’ is not a positive power of 2 [-Werror=attributes]
> > arch/x86/kvm/../../../virt/kvm/kvm_main.c:547:20: error: requested alignment ‘36960’ is not a positive power of 2
> > arch/x86/kvm/../../../virt/kvm/kvm_main.c:549:20: error: requested alignment ‘36968’ is not a positive power of 2
> > arch/x86/kvm/../../../virt/kvm/kvm_main.c:551:20: error: requested alignment ‘36976’ is not a positive power of 2
> > arch/x86/kvm/../../../virt/kvm/kvm_main.c:553:20: error: requested alignment ‘36984’ is not a positive power of 2
> 
> Neat trick.
> 
> So here are my numbers:
> 
> # make modules  KDIR=virt 2>&1 | grep -A1 alignment |grep -v ^-
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:568:40: error: requested alignment ‘0’ is not a positive power of 2 [-Werror=attributes]
>   568 |        int v __aligned(offsetof(struct kvm_vcpu, kvm));
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:569:40: error: requested alignment ‘39552’ is not a positive power of 2
>   569 |        int w __aligned(offsetof(struct kvm, mmu_invalidate_seq));
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:570:40: error: requested alignment ‘39560’ is not a positive power of 2
>   570 |        int x __aligned(offsetof(struct kvm, mmu_invalidate_in_progress));
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:571:40: error: requested alignment ‘39568’ is not a positive power of 2
>   571 |        int y __aligned(offsetof(struct kvm, mmu_invalidate_range_start));
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:572:40: error: requested alignment ‘39576’ is not a positive power of 2
>   572 |        int z __aligned(offsetof(struct kvm, mmu_invalidate_range_end));
> 
> and the resulting script:
> 	kprobe:handle_ept_violation
> 	{
> 		$kvm = *((uint64 *)((uint64)arg0 + 0));
> 
> 		printf("vcpu=%08lx seq=%08lx inprog=%lx start=%08lx end=%08lx\n",
> 			arg0, 
> 		       *((uint64 *)($kvm + 39552)),
> 		       *((uint64 *)($kvm + 39560)),
> 		       *((uint64 *)($kvm + 39568)),
> 		       *((uint64 *)($kvm + 39576))
> 		       );
> 	}
> 
> ... but the output shows all 0's except vcpu:
> 
> 	# bpftrace ./handle_ept_violation.bt |grep ^vcpu | uniq -c
> 	     11 vcpu=ffff9d518541c680 seq=00000000 inprog=0 start=00000000 end=00000000
> 	     29 vcpu=ffff9d80cc120000 seq=00000000 inprog=0 start=00000000 end=00000000
> 	    331 vcpu=ffff9d5f1d1a2340 seq=00000000 inprog=0 start=00000000 end=00000000
> 	    858 vcpu=ffff9d80c7b98000 seq=00000000 inprog=0 start=00000000 end=00000000
> 	   2183 vcpu=ffff9d6033fb2340 seq=00000000 inprog=0 start=00000000 end=00000000
> 
> Did I do something wrong here?
> 
> -Eric
> 
> > 
> > Then take those offsets and do math.  For me, this provides the same output as
> > the above pretty version.  Just use common sense and verify you're getting sane
> > data.
> > 
> > kprobe:handle_ept_violation
> > {
> > 	$kvm = *((uint64 *)((uint64)arg0 + 0));
> > 
> > 	printf("vcpu = %lx MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> > 	       arg0,
> >                *((uint64 *)($kvm + 36960)),
> >                *((uint64 *)($kvm + 36968)),
> >                *((uint64 *)($kvm + 36976)),
> >                *((uint64 *)($kvm + 36984)));
> > }
> > 
> > 
> 
> 
> 
> 
> 
> 
> --
> Eric Wheeler
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-21 20:27                                               ` Eric Wheeler
@ 2023-08-21 23:51                                                 ` Sean Christopherson
  2023-08-22  0:11                                                   ` Sean Christopherson
  2023-08-22  1:10                                                   ` Eric Wheeler
  0 siblings, 2 replies; 48+ messages in thread
From: Sean Christopherson @ 2023-08-21 23:51 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Amaan Cheval, brak, kvm

On Mon, Aug 21, 2023, Eric Wheeler wrote:
> On Fri, 18 Aug 2023, Eric Wheeler wrote:
> > On Fri, 18 Aug 2023, Sean Christopherson wrote:
> > > On Thu, Aug 17, 2023, Eric Wheeler wrote:
> > > > On Thu, 17 Aug 2023, Sean Christopherson wrote:
> > > > > > > kprobe:handle_ept_violation
> > > > > > > {
> > > > > > > 	printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> > > > > > > 	       arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> > > > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> > > > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> > > > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> > > > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> > > > > > > }
> > > > > > > 
> > > > > > > If you don't have BTF info, we can still use a bpf program, but to get at the
> > > > > > > fields of interested, I think we'd have to resort to pointer arithmetic with struct
> > > > > > > offsets grab from your build.
> > > > > > 
> > > > > > We have BTF, so hurray for not needing struct offsets!
> 
> We found a new sample in 6.1.38, right after a lockup, where _all_ log 
> entries show inprog=1, in case that is interesting. Here is a sample, 
> there are 500,000+ entries so let me know if you want the whole log.
> 
> To me, these are opaque numbers.  What do they represent?  What are you looking for in them?

inprog is '1' if there is an in-progress mmu_notifier invalidation at the time
of the EPT violation.  start/end are the range that is being invalidated _if_
there is an in-progress invalidation.  If a vCPU were stuck with inprog=1, then
the most likely scenario is that there's an unpaired invalidation, i.e. something
started an invalidation but never completed it.

seq is a sequence count that is incremented when an invalidation completes, e.g.
if a vCPU was stuck and seq were constantly changing, then it would mean that
the primary MMU is invalidating the same range over and over so quickly that the
vCPU can't make forward progress.

>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854227 inprog=1 start=7fa3183a3000 end=7fa3183a4000

...

> > The entire dump is 22,687 lines if you want to see it, here (expires in 1 week):
> > 
> > 	https://privatebin.net/?9a3bff6b6fd2566f#BHjrt4NGpoXL12NWiUDpThifi9E46LNXCy7eWzGXgqYx

Hrm, so it doesn't look like there's an mmu_notifier bug.  None of the traces show
inprog being '1' for more than a single EPT violation, i.e. there's no incomplete
invalidation.  And the stuck vCPU sees a stable sequence count, so it doesn't appear
to be racing with an invalidation, e.g. in theory an invalidation could start and
finish between the initial VM-Exit and when KVM tries to "fix" the fault.

4983329 ept[0] vcpu=ffff9c533e1da340 seq=8002c058 inprog=0 start=7fb47c1f5000 end=7fb47c1f6000
7585048 ept[0] vcpu=ffff9c533e1da340 seq=8002c058 inprog=0 start=7fb47c1f5000 end=7fb47c1f6000
1234546 ept[0] vcpu=ffff9c533e1da340 seq=8002c058 inprog=0 start=7fb47c1f5000 end=7fb47c1f6000
 865885 ept[0] vcpu=ffff9c533e1da340 seq=8002c058 inprog=0 start=7fb47c1f5000 end=7fb47c1f6000
4548972 ept[0] vcpu=ffff9c533e1da340 seq=8002c058 inprog=0 start=7fb47c1f5000 end=7fb47c1f6000
1091282 ept[0] vcpu=ffff9c533e1da340 seq=8002c058 inprog=0 start=7fb47c1f5000 end=7fb47c1f6000

Below is another bpftrace program that will hopefully shrink the haystack to
the point where we can find something via code inspection.

Just a heads up, if we don't find a smoking gun, I likely won't be able to help
beyond high level guidance as we're approaching what I can do with bpftrace without
a significant time investment (which I can't make).

---
struct kvm_page_fault {
	const gpa_t addr;
	const u32 error_code;
	const bool prefetch;

	const bool exec;
	const bool write;
	const bool present;
	const bool rsvd;
	const bool user;

	const bool is_tdp;
	const bool nx_huge_page_workaround_enabled;

	bool huge_page_disallowed;
	u8 max_level;

	u8 req_level;

	u8 goal_level;

	gfn_t gfn;

	struct kvm_memory_slot *slot;

	kvm_pfn_t pfn;
	unsigned long hva;
	bool map_writable;
};

kprobe:handle_ept_violation
{
	$vcpu = (struct kvm_vcpu *)arg0;
	$nr_handled = $vcpu->stat.pf_emulate + $vcpu->stat.pf_fixed + $vcpu->stat.pf_spurious;

	if (($vcpu->stat.pf_taken / 5) > $nr_handled) {
		printf("PID = %u stuck, taken = %lu, emulated = %lu, fixed = %lu, spurious = %lu\n",
		       pid, $vcpu->stat.pf_taken, $vcpu->stat.pf_emulate, $vcpu->stat.pf_fixed, $vcpu->stat.pf_spurious);
	}
}

kprobe:kvm_faultin_pfn
{
	$vcpu = (struct kvm_vcpu *)arg0;
	$nr_handled = $vcpu->stat.pf_emulate + $vcpu->stat.pf_fixed + $vcpu->stat.pf_spurious;

	if (($vcpu->stat.pf_taken / 5) > $nr_handled) {
		$fault = (struct kvm_page_fault *)arg1;
		$flags = 0;

		if ($fault->slot != 0) {
			$flags = $fault->slot->flags;
		}

		printf("PID = %u stuck, reached kvm_faultin_pfn(), slot = %lx, flags = %lx\n",
		       pid, arg1, $flags);
	}
}

kprobe:make_mmio_spte
{
	$vcpu = (struct kvm_vcpu *)arg0;
	$nr_handled = $vcpu->stat.pf_emulate + $vcpu->stat.pf_fixed + $vcpu->stat.pf_spurious;

	if (($vcpu->stat.pf_taken / 5) > $nr_handled) {
		printf("PID %u stuck, reached make_mmio_spte()", pid);
	}
}

kprobe:make_spte
{
	$vcpu = (struct kvm_vcpu *)arg0;
	$nr_handled = $vcpu->stat.pf_emulate + $vcpu->stat.pf_fixed + $vcpu->stat.pf_spurious;

	if (($vcpu->stat.pf_taken / 5) > $nr_handled) {
		printf("PID %u stuck, reached make_spte()", pid);
	}
}


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-21 23:51                                                 ` Sean Christopherson
@ 2023-08-22  0:11                                                   ` Sean Christopherson
  2023-08-22  1:10                                                   ` Eric Wheeler
  1 sibling, 0 replies; 48+ messages in thread
From: Sean Christopherson @ 2023-08-22  0:11 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Amaan Cheval, brak, kvm

On Mon, Aug 21, 2023, Sean Christopherson wrote:
> Below is another bpftrace program that will hopefully shrink the haystack to
> the point where we can find something via code inspection.

Forgot to say what it actually does: it's essentially printf debugging to see how
far a stuck vCPU gets when trying to handle an EPT violation.  The program should
be silent until a vCPU gets stuck (though I would still wait until there's stuck
vCPU to load it).  When a vCPU's "faults taken":"faults handled" ratio gets over
5:1, i.e. the vCPU appears to be taking EPT violations without doing anything,
the program will start printing.  Unfortunately, what can be traced via kprobe
bit limited because much of the page fault handling path gets inlined.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-21 23:51                                                 ` Sean Christopherson
  2023-08-22  0:11                                                   ` Sean Christopherson
@ 2023-08-22  1:10                                                   ` Eric Wheeler
  2023-08-22 15:11                                                     ` Sean Christopherson
  1 sibling, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-22  1:10 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Mon, 21 Aug 2023, Sean Christopherson wrote:
> On Mon, Aug 21, 2023, Eric Wheeler wrote:
> > On Fri, 18 Aug 2023, Eric Wheeler wrote:
> > > On Fri, 18 Aug 2023, Sean Christopherson wrote:
> > > > On Thu, Aug 17, 2023, Eric Wheeler wrote:
> > > > > On Thu, 17 Aug 2023, Sean Christopherson wrote:
> > > > > > > > kprobe:handle_ept_violation
> > > > > > > > {
> > > > > > > > 	printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> > > > > > > > 	       arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> > > > > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> > > > > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> > > > > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> > > > > > > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> > > > > > > > }
> > > > > > > > 
> > > > > > > > If you don't have BTF info, we can still use a bpf program, but to get at the
> > > > > > > > fields of interested, I think we'd have to resort to pointer arithmetic with struct
> > > > > > > > offsets grab from your build.
> > > > > > > 
> > > > > > > We have BTF, so hurray for not needing struct offsets!
> > 
> > We found a new sample in 6.1.38, right after a lockup, where _all_ log 
> > entries show inprog=1, in case that is interesting. Here is a sample, 
> > there are 500,000+ entries so let me know if you want the whole log.
> > 
> > To me, these are opaque numbers.  What do they represent?  What are you looking for in them?
> 
> inprog is '1' if there is an in-progress mmu_notifier invalidation at the time
> of the EPT violation.  start/end are the range that is being invalidated _if_
> there is an in-progress invalidation.  If a vCPU were stuck with inprog=1, then
> the most likely scenario is that there's an unpaired invalidation, i.e. something
> started an invalidation but never completed it.
> 
> seq is a sequence count that is incremented when an invalidation completes, e.g.
> if a vCPU was stuck and seq were constantly changing, then it would mean that
> the primary MMU is invalidating the same range over and over so quickly that the
> vCPU can't make forward progress.

Here is another one, I think you described exactly this: the vcpu is 
always the same, and the sequence increments, forever:

      1 ept[0] vcpu=ffff9964cdc48000 seq=80854227 inprog=1 start=7fa3183a3000 end=7fa3183a4000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854228 inprog=1 start=7fa3183a3000 end=7fa3183a4000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854229 inprog=1 start=7fa3183a4000 end=7fa3183a5000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085422a inprog=1 start=7fa3183a4000 end=7fa3183a5000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085422b inprog=1 start=7fa3183a8000 end=7fa3183a9000
      2 ept[0] vcpu=ffff9964cdc48000 seq=8085422d inprog=1 start=7fa3183a9000 end=7fa3183aa000
      1 ept[0] vcpu=ffff9964cdc48000 seq=8085422e inprog=1 start=7fa3183a9000 end=7fa3183aa000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854232 inprog=1 start=7fa3183ac000 end=7fa3183ad000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854233 inprog=1 start=7fa3183ad000 end=7fa3183ae000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854235 inprog=1 start=7fa3183ae000 end=7fa3183af000
      1 ept[0] vcpu=ffff9964cdc48000 seq=80854236 inprog=1 start=7fa3183ae000 end=7fa3183af000

Here is the whole log with 500,000+ lines over 5 minutes of recording, it 
was first stuck on one vcpu for most of the time, and toward the end it 
was stuck on a different VCPU:

The file starts with 555,596 occurances of vcpu=ffff9964cdc48000 and is 
then followed by 31,784 occurances of vcpu=ffff9934ed50c680.  As you can 
see in the file, they are not interleaved:

	https://www.linuxglobal.com/out/handle_ept_violation.log2

  # awk '{print $3}' handle_ept_violation.log2 |uniq -c
   555596 vcpu=ffff9964cdc48000
    31784 vcpu=ffff9934ed50c680

> Below is another bpftrace program that will hopefully shrink the 
> haystack to the point where we can find something via code inspection.

Ok thanks, we'll give it a try.

> Just a heads up, if we don't find a smoking gun, I likely won't be able 
> to help beyond high level guidance as we're approaching what I can do 
> with bpftrace without a significant time investment (which I can't 
> make).

I understand, I'm grateful for your help, whatever you can offer.

-Eric

> 
> ---
> struct kvm_page_fault {
> 	const gpa_t addr;
> 	const u32 error_code;
> 	const bool prefetch;
> 
> 	const bool exec;
> 	const bool write;
> 	const bool present;
> 	const bool rsvd;
> 	const bool user;
> 
> 	const bool is_tdp;
> 	const bool nx_huge_page_workaround_enabled;
> 
> 	bool huge_page_disallowed;
> 	u8 max_level;
> 
> 	u8 req_level;
> 
> 	u8 goal_level;
> 
> 	gfn_t gfn;
> 
> 	struct kvm_memory_slot *slot;
> 
> 	kvm_pfn_t pfn;
> 	unsigned long hva;
> 	bool map_writable;
> };
> 
> kprobe:handle_ept_violation
> {
> 	$vcpu = (struct kvm_vcpu *)arg0;
> 	$nr_handled = $vcpu->stat.pf_emulate + $vcpu->stat.pf_fixed + $vcpu->stat.pf_spurious;
> 
> 	if (($vcpu->stat.pf_taken / 5) > $nr_handled) {
> 		printf("PID = %u stuck, taken = %lu, emulated = %lu, fixed = %lu, spurious = %lu\n",
> 		       pid, $vcpu->stat.pf_taken, $vcpu->stat.pf_emulate, $vcpu->stat.pf_fixed, $vcpu->stat.pf_spurious);
> 	}
> }
> 
> kprobe:kvm_faultin_pfn
> {
> 	$vcpu = (struct kvm_vcpu *)arg0;
> 	$nr_handled = $vcpu->stat.pf_emulate + $vcpu->stat.pf_fixed + $vcpu->stat.pf_spurious;
> 
> 	if (($vcpu->stat.pf_taken / 5) > $nr_handled) {
> 		$fault = (struct kvm_page_fault *)arg1;
> 		$flags = 0;
> 
> 		if ($fault->slot != 0) {
> 			$flags = $fault->slot->flags;
> 		}
> 
> 		printf("PID = %u stuck, reached kvm_faultin_pfn(), slot = %lx, flags = %lx\n",
> 		       pid, arg1, $flags);
> 	}
> }
> 
> kprobe:make_mmio_spte
> {
> 	$vcpu = (struct kvm_vcpu *)arg0;
> 	$nr_handled = $vcpu->stat.pf_emulate + $vcpu->stat.pf_fixed + $vcpu->stat.pf_spurious;
> 
> 	if (($vcpu->stat.pf_taken / 5) > $nr_handled) {
> 		printf("PID %u stuck, reached make_mmio_spte()", pid);
> 	}
> }
> 
> kprobe:make_spte
> {
> 	$vcpu = (struct kvm_vcpu *)arg0;
> 	$nr_handled = $vcpu->stat.pf_emulate + $vcpu->stat.pf_fixed + $vcpu->stat.pf_spurious;
> 
> 	if (($vcpu->stat.pf_taken / 5) > $nr_handled) {
> 		printf("PID %u stuck, reached make_spte()", pid);
> 	}
> }
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-22  1:10                                                   ` Eric Wheeler
@ 2023-08-22 15:11                                                     ` Sean Christopherson
  2023-08-22 21:23                                                       ` Eric Wheeler
  2023-08-23  0:39                                                       ` Eric Wheeler
  0 siblings, 2 replies; 48+ messages in thread
From: Sean Christopherson @ 2023-08-22 15:11 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Amaan Cheval, brak, kvm

On Mon, Aug 21, 2023, Eric Wheeler wrote:
> On Mon, 21 Aug 2023, Sean Christopherson wrote:
> > On Mon, Aug 21, 2023, Eric Wheeler wrote:
> > > On Fri, 18 Aug 2023, Eric Wheeler wrote:
> > > > On Fri, 18 Aug 2023, Sean Christopherson wrote:
> > > > > On Thu, Aug 17, 2023, Eric Wheeler wrote:
> > > > > > On Thu, 17 Aug 2023, Sean Christopherson wrote:
> > > To me, these are opaque numbers.  What do they represent?  What are you looking for in them?
> > 
> > inprog is '1' if there is an in-progress mmu_notifier invalidation at the time
> > of the EPT violation.  start/end are the range that is being invalidated _if_
> > there is an in-progress invalidation.  If a vCPU were stuck with inprog=1, then
> > the most likely scenario is that there's an unpaired invalidation, i.e. something
> > started an invalidation but never completed it.
> > 
> > seq is a sequence count that is incremented when an invalidation completes, e.g.
> > if a vCPU was stuck and seq were constantly changing, then it would mean that
> > the primary MMU is invalidating the same range over and over so quickly that the
> > vCPU can't make forward progress.
> 
> Here is another one, I think you described exactly this: the vcpu is 
> always the same, and the sequence increments, forever:
> 
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854227 inprog=1 start=7fa3183a3000 end=7fa3183a4000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854228 inprog=1 start=7fa3183a3000 end=7fa3183a4000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854229 inprog=1 start=7fa3183a4000 end=7fa3183a5000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=8085422a inprog=1 start=7fa3183a4000 end=7fa3183a5000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=8085422b inprog=1 start=7fa3183a8000 end=7fa3183a9000
>       2 ept[0] vcpu=ffff9964cdc48000 seq=8085422d inprog=1 start=7fa3183a9000 end=7fa3183aa000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=8085422e inprog=1 start=7fa3183a9000 end=7fa3183aa000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854232 inprog=1 start=7fa3183ac000 end=7fa3183ad000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854233 inprog=1 start=7fa3183ad000 end=7fa3183ae000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854235 inprog=1 start=7fa3183ae000 end=7fa3183af000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854236 inprog=1 start=7fa3183ae000 end=7fa3183af000
> 
> Here is the whole log with 500,000+ lines over 5 minutes of recording, it 
> was first stuck on one vcpu for most of the time, and toward the end it 
> was stuck on a different VCPU:
> 
> The file starts with 555,596 occurances of vcpu=ffff9964cdc48000 and is 
> then followed by 31,784 occurances of vcpu=ffff9934ed50c680.  As you can 
> see in the file, they are not interleaved:
> 
> 	https://www.linuxglobal.com/out/handle_ept_violation.log2
> 
>   # awk '{print $3}' handle_ept_violation.log2 |uniq -c
>    555596 vcpu=ffff9964cdc48000
>     31784 vcpu=ffff9934ed50c680

Hrm, but the address range being invalidated is changing.  Without seeing the
guest RIP, or even a timestamp, it's impossible to tell if the vCPU is well and
truly stuck or if it's just getting thrashed so hard by NUMA balancing or KSM
that it looks stuck.

Drat.

> > Below is another bpftrace program that will hopefully shrink the 
> > haystack to the point where we can find something via code inspection.
> 
> Ok thanks, we'll give it a try.

Try this version instead.  It's more comprehensive and more precise, e.g. should
only trigger on the guest being 100% stuck, and also fixes a PID vs. TID goof.

Note!  Enable trace_kvm_exit before/when running this to ensure KVM grabs the guest RIP
from the VMCS.  Without that enabled, RIP from vcpu->arch.regs[16] may be stale.

struct kvm_page_fault {
	const u64 addr;
	const u32 error_code;
	const bool prefetch;

	const bool exec;
	const bool write;
	const bool present;
	const bool rsvd;
	const bool user;

	const bool is_tdp;
	const bool nx_huge_page_workaround_enabled;

	bool huge_page_disallowed;
	u8 max_level;

	u8 req_level;

	u8 goal_level;

	u64 gfn;

	struct kvm_memory_slot *slot;

	u64 pfn;
	unsigned long hva;
	bool map_writable;
};

kprobe:kvm_faultin_pfn
{
	$vcpu = (struct kvm_vcpu *)arg0;
	$kvm = $vcpu->kvm;
	$rip = $vcpu->arch.regs[16];

	if (@last_rip[tid] == $rip) {
		@same[tid]++
	} else {
		@same[tid] = 0;
	}
	@last_rip[tid] = $rip;

	if (@same[tid] > 1000) {
		$fault = (struct kvm_page_fault *)arg1;
		$hva = -1;
		$flags = 0;

		if ($fault->slot != 0) {
			$hva = $fault->slot->userspace_addr +
			       (($fault->gfn - $fault->slot->base_gfn) << 12);
			$flags = $fault->slot->flags;
		}

		printf("%s tid[%u] pid[%u] stuck @ rip %lx (%lu hits), gpa = %lx, hva = %lx : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip, @same[tid], $fault->addr, $hva,
		       $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress,
		       $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end);
	}
}

kprobe:make_mmio_spte
{
        if (@same[tid] > 1000) {
	        $vcpu = (struct kvm_vcpu *)arg0;
		$rip = $vcpu->arch.regs[16];

		printf("%s tid[%u] pid[%u] stuck @ rip %lx made it to make_mmio_spte()\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip);
	}
}

kprobe:make_spte
{
        if (@same[tid] > 1000) {
	        $vcpu = (struct kvm_vcpu *)arg0;
		$rip = $vcpu->arch.regs[16];

		printf("%s tid[%u] pid[%u] stuck @ rip %lx made it to make_spte()\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip);
	}
}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-22 15:11                                                     ` Sean Christopherson
@ 2023-08-22 21:23                                                       ` Eric Wheeler
  2023-08-22 21:32                                                         ` Sean Christopherson
  2023-08-23  0:39                                                       ` Eric Wheeler
  1 sibling, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-22 21:23 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Tue, 22 Aug 2023, Sean Christopherson wrote:
> On Mon, Aug 21, 2023, Eric Wheeler wrote:
> > On Mon, 21 Aug 2023, Sean Christopherson wrote:
> > > On Mon, Aug 21, 2023, Eric Wheeler wrote:
> > > > On Fri, 18 Aug 2023, Eric Wheeler wrote:
> > > > > On Fri, 18 Aug 2023, Sean Christopherson wrote:
> > > > > > On Thu, Aug 17, 2023, Eric Wheeler wrote:
> > > > > > > On Thu, 17 Aug 2023, Sean Christopherson wrote:
> > > > To me, these are opaque numbers.  What do they represent?  What are you looking for in them?
> > > 
> > > inprog is '1' if there is an in-progress mmu_notifier invalidation at the time
> > > of the EPT violation.  start/end are the range that is being invalidated _if_
> > > there is an in-progress invalidation.  If a vCPU were stuck with inprog=1, then
> > > the most likely scenario is that there's an unpaired invalidation, i.e. something
> > > started an invalidation but never completed it.
> > > 
> > > seq is a sequence count that is incremented when an invalidation completes, e.g.
> > > if a vCPU was stuck and seq were constantly changing, then it would mean that
> > > the primary MMU is invalidating the same range over and over so quickly that the
> > > vCPU can't make forward progress.
> > 
> > Here is another one, I think you described exactly this: the vcpu is 
> > always the same, and the sequence increments, forever:
> > 
> >       1 ept[0] vcpu=ffff9964cdc48000 seq=80854227 inprog=1 start=7fa3183a3000 end=7fa3183a4000
> >       1 ept[0] vcpu=ffff9964cdc48000 seq=80854228 inprog=1 start=7fa3183a3000 end=7fa3183a4000
> >       1 ept[0] vcpu=ffff9964cdc48000 seq=80854229 inprog=1 start=7fa3183a4000 end=7fa3183a5000
> >       1 ept[0] vcpu=ffff9964cdc48000 seq=8085422a inprog=1 start=7fa3183a4000 end=7fa3183a5000
> >       1 ept[0] vcpu=ffff9964cdc48000 seq=8085422b inprog=1 start=7fa3183a8000 end=7fa3183a9000
> >       2 ept[0] vcpu=ffff9964cdc48000 seq=8085422d inprog=1 start=7fa3183a9000 end=7fa3183aa000
> >       1 ept[0] vcpu=ffff9964cdc48000 seq=8085422e inprog=1 start=7fa3183a9000 end=7fa3183aa000
> >       1 ept[0] vcpu=ffff9964cdc48000 seq=80854232 inprog=1 start=7fa3183ac000 end=7fa3183ad000
> >       1 ept[0] vcpu=ffff9964cdc48000 seq=80854233 inprog=1 start=7fa3183ad000 end=7fa3183ae000
> >       1 ept[0] vcpu=ffff9964cdc48000 seq=80854235 inprog=1 start=7fa3183ae000 end=7fa3183af000
> >       1 ept[0] vcpu=ffff9964cdc48000 seq=80854236 inprog=1 start=7fa3183ae000 end=7fa3183af000
> > 
> > Here is the whole log with 500,000+ lines over 5 minutes of recording, it 
> > was first stuck on one vcpu for most of the time, and toward the end it 
> > was stuck on a different VCPU:
> > 
> > The file starts with 555,596 occurances of vcpu=ffff9964cdc48000 and is 
> > then followed by 31,784 occurances of vcpu=ffff9934ed50c680.  As you can 
> > see in the file, they are not interleaved:
> > 
> > 	https://www.linuxglobal.com/out/handle_ept_violation.log2
> > 
> >   # awk '{print $3}' handle_ept_violation.log2 |uniq -c
> >    555596 vcpu=ffff9964cdc48000
> >     31784 vcpu=ffff9934ed50c680
> 
> Hrm, but the address range being invalidated is changing.  Without seeing the
> guest RIP, or even a timestamp, it's impossible to tell if the vCPU is well and
> truly stuck or if it's just getting thrashed so hard by NUMA balancing or KSM
> that it looks stuck.
> 
> Drat.
> 
> > > Below is another bpftrace program that will hopefully shrink the 
> > > haystack to the point where we can find something via code inspection.
> > 
> > Ok thanks, we'll give it a try.
> 
> Try this version instead.  It's more comprehensive and more precise, e.g. should
> only trigger on the guest being 100% stuck, and also fixes a PID vs. TID goof.
> 
> Note!  Enable trace_kvm_exit before/when running this to ensure KVM grabs the guest RIP
> from the VMCS.  Without that enabled, RIP from vcpu->arch.regs[16] may be stale.

Thanks, we'll try it out.

To confirm, when you say "Enable trace_kvm_exit", is it this:

	echo 1 > /sys/kernel/tracing/events/kvm/kvm_exit/enable

or this (which might be the same):

	echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_exit/enable

or something else?

--
Eric Wheeler



> 
> struct kvm_page_fault {
> 	const u64 addr;
> 	const u32 error_code;
> 	const bool prefetch;
> 
> 	const bool exec;
> 	const bool write;
> 	const bool present;
> 	const bool rsvd;
> 	const bool user;
> 
> 	const bool is_tdp;
> 	const bool nx_huge_page_workaround_enabled;
> 
> 	bool huge_page_disallowed;
> 	u8 max_level;
> 
> 	u8 req_level;
> 
> 	u8 goal_level;
> 
> 	u64 gfn;
> 
> 	struct kvm_memory_slot *slot;
> 
> 	u64 pfn;
> 	unsigned long hva;
> 	bool map_writable;
> };
> 
> kprobe:kvm_faultin_pfn
> {
> 	$vcpu = (struct kvm_vcpu *)arg0;
> 	$kvm = $vcpu->kvm;
> 	$rip = $vcpu->arch.regs[16];
> 
> 	if (@last_rip[tid] == $rip) {
> 		@same[tid]++
> 	} else {
> 		@same[tid] = 0;
> 	}
> 	@last_rip[tid] = $rip;
> 
> 	if (@same[tid] > 1000) {
> 		$fault = (struct kvm_page_fault *)arg1;
> 		$hva = -1;
> 		$flags = 0;
> 
> 		if ($fault->slot != 0) {
> 			$hva = $fault->slot->userspace_addr +
> 			       (($fault->gfn - $fault->slot->base_gfn) << 12);
> 			$flags = $fault->slot->flags;
> 		}
> 
> 		printf("%s tid[%u] pid[%u] stuck @ rip %lx (%lu hits), gpa = %lx, hva = %lx : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip, @same[tid], $fault->addr, $hva,
> 		       $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress,
> 		       $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end);
> 	}
> }
> 
> kprobe:make_mmio_spte
> {
>         if (@same[tid] > 1000) {
> 	        $vcpu = (struct kvm_vcpu *)arg0;
> 		$rip = $vcpu->arch.regs[16];
> 
> 		printf("%s tid[%u] pid[%u] stuck @ rip %lx made it to make_mmio_spte()\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip);
> 	}
> }
> 
> kprobe:make_spte
> {
>         if (@same[tid] > 1000) {
> 	        $vcpu = (struct kvm_vcpu *)arg0;
> 		$rip = $vcpu->arch.regs[16];
> 
> 		printf("%s tid[%u] pid[%u] stuck @ rip %lx made it to make_spte()\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip);
> 	}
> }
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-22 21:23                                                       ` Eric Wheeler
@ 2023-08-22 21:32                                                         ` Sean Christopherson
  0 siblings, 0 replies; 48+ messages in thread
From: Sean Christopherson @ 2023-08-22 21:32 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Amaan Cheval, brak, kvm

On Tue, Aug 22, 2023, Eric Wheeler wrote:
> On Tue, 22 Aug 2023, Sean Christopherson wrote:
> > On Mon, Aug 21, 2023, Eric Wheeler wrote:
> > Note!  Enable trace_kvm_exit before/when running this to ensure KVM grabs the guest RIP
> > from the VMCS.  Without that enabled, RIP from vcpu->arch.regs[16] may be stale.
> 
> Thanks, we'll try it out.
> 
> To confirm, when you say "Enable trace_kvm_exit", is it this:
> 
> 	echo 1 > /sys/kernel/tracing/events/kvm/kvm_exit/enable
> 
> or this (which might be the same):
> 
> 	echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_exit/enable
> 
> or something else?

Yep, you got it.  They're the same thing, the two paths are just different (common)
places to mount the kernel's debufs.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-22 15:11                                                     ` Sean Christopherson
  2023-08-22 21:23                                                       ` Eric Wheeler
@ 2023-08-23  0:39                                                       ` Eric Wheeler
  2023-08-23 17:54                                                         ` Sean Christopherson
  1 sibling, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-23  0:39 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Tue, 22 Aug 2023, Sean Christopherson wrote:
> > Here is the whole log with 500,000+ lines over 5 minutes of recording, it 
> > was first stuck on one vcpu for most of the time, and toward the end it 
> > was stuck on a different VCPU:
> > 
> > The file starts with 555,596 occurances of vcpu=ffff9964cdc48000 and is 
> > then followed by 31,784 occurances of vcpu=ffff9934ed50c680.  As you can 
> > see in the file, they are not interleaved:
> > 
> > 	https://www.linuxglobal.com/out/handle_ept_violation.log2
> > 
> >   # awk '{print $3}' handle_ept_violation.log2 |uniq -c
> >    555596 vcpu=ffff9964cdc48000
> >     31784 vcpu=ffff9934ed50c680
> 
> Hrm, but the address range being invalidated is changing.  Without seeing the
> guest RIP, or even a timestamp, it's impossible to tell if the vCPU is well and
> truly stuck or if it's just getting thrashed so hard by NUMA balancing or KSM
> that it looks stuck.
> 
> Drat.
> 
> > > Below is another bpftrace program that will hopefully shrink the 
> > > haystack to the point where we can find something via code inspection.
> > 
> > Ok thanks, we'll give it a try.
> 
> Try this version instead.  It's more comprehensive and more precise, e.g. should
> only trigger on the guest being 100% stuck, and also fixes a PID vs. TID goof.
> 
> Note!  Enable trace_kvm_exit before/when running this to ensure KVM grabs the guest RIP
> from the VMCS.  Without that enabled, RIP from vcpu->arch.regs[16] may be stale.

Ok, we got a 740MB log, zips down to 25MB if you would like to see the
whole thing, it is here:
	http://linuxglobal.com/out/handle_ept_violation-v2.log.gz

For brevity, here is a sample of each 100,000th line:

# zcat handle_ept_violation-v2.log.gz | perl -lne '!($n++%100000) && print'
Attaching 3 probes...
00:30:31:347560 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (375972 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:32:047769 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (706308 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:32:746729 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (1039825 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:33:447298 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (1375881 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:34:160967 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (1715243 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:34:882187 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (2060501 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:35:597351 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (2402485 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:36:323613 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (2749250 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:37:039834 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (3090704 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:37:775801 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (3444375 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:38:564075 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (3824996 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:39:320611 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (4186268 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:40:006544 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (4514744 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:40:708219 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (4850395 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:41:424570 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (5195103 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:42:147032 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (5543824 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:42:878845 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (5887371 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:43:590424 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (6234881 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:44:308041 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (6581426 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:45:039868 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (6925844 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:45:773678 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (7278195 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:46:507469 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (7634480 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:47:252621 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (7997426 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:47:935297 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (8323684 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:48:626000 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (8654836 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:49:344852 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (8991281 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:50:239769 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (9414664 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:50:940747 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (9755243 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:51:666992 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (10101607 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:52:383241 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (10433521 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:53:105012 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (10785355 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:53:867494 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (11149981 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:54:632045 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (11515349 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:55:403662 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (11882055 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:56:182540 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (12251134 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:56:915633 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (12596230 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:57:658496 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (12939661 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:58:395829 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (13290046 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
00:30:59:186697 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (13670748 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000

...

@same[553909]: 14142928


--
Eric Wheeler



> 
> struct kvm_page_fault {
> 	const u64 addr;
> 	const u32 error_code;
> 	const bool prefetch;
> 
> 	const bool exec;
> 	const bool write;
> 	const bool present;
> 	const bool rsvd;
> 	const bool user;
> 
> 	const bool is_tdp;
> 	const bool nx_huge_page_workaround_enabled;
> 
> 	bool huge_page_disallowed;
> 	u8 max_level;
> 
> 	u8 req_level;
> 
> 	u8 goal_level;
> 
> 	u64 gfn;
> 
> 	struct kvm_memory_slot *slot;
> 
> 	u64 pfn;
> 	unsigned long hva;
> 	bool map_writable;
> };
> 
> kprobe:kvm_faultin_pfn
> {
> 	$vcpu = (struct kvm_vcpu *)arg0;
> 	$kvm = $vcpu->kvm;
> 	$rip = $vcpu->arch.regs[16];
> 
> 	if (@last_rip[tid] == $rip) {
> 		@same[tid]++
> 	} else {
> 		@same[tid] = 0;
> 	}
> 	@last_rip[tid] = $rip;
> 
> 	if (@same[tid] > 1000) {
> 		$fault = (struct kvm_page_fault *)arg1;
> 		$hva = -1;
> 		$flags = 0;
> 
> 		if ($fault->slot != 0) {
> 			$hva = $fault->slot->userspace_addr +
> 			       (($fault->gfn - $fault->slot->base_gfn) << 12);
> 			$flags = $fault->slot->flags;
> 		}
> 
> 		printf("%s tid[%u] pid[%u] stuck @ rip %lx (%lu hits), gpa = %lx, hva = %lx : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip, @same[tid], $fault->addr, $hva,
> 		       $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress,
> 		       $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end);
> 	}
> }
> 
> kprobe:make_mmio_spte
> {
>         if (@same[tid] > 1000) {
> 	        $vcpu = (struct kvm_vcpu *)arg0;
> 		$rip = $vcpu->arch.regs[16];
> 
> 		printf("%s tid[%u] pid[%u] stuck @ rip %lx made it to make_mmio_spte()\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip);
> 	}
> }
> 
> kprobe:make_spte
> {
>         if (@same[tid] > 1000) {
> 	        $vcpu = (struct kvm_vcpu *)arg0;
> 		$rip = $vcpu->arch.regs[16];
> 
> 		printf("%s tid[%u] pid[%u] stuck @ rip %lx made it to make_spte()\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip);
> 	}
> }
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-23  0:39                                                       ` Eric Wheeler
@ 2023-08-23 17:54                                                         ` Sean Christopherson
  2023-08-23 19:44                                                           ` Eric Wheeler
  2023-08-23 22:12                                                           ` Eric Wheeler
  0 siblings, 2 replies; 48+ messages in thread
From: Sean Christopherson @ 2023-08-23 17:54 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Amaan Cheval, brak, kvm

On Tue, Aug 22, 2023, Eric Wheeler wrote:
> On Tue, 22 Aug 2023, Sean Christopherson wrote:
> > > Here is the whole log with 500,000+ lines over 5 minutes of recording, it 
> > > was first stuck on one vcpu for most of the time, and toward the end it 
> > > was stuck on a different VCPU:
> > > 
> > > The file starts with 555,596 occurances of vcpu=ffff9964cdc48000 and is 
> > > then followed by 31,784 occurances of vcpu=ffff9934ed50c680.  As you can 
> > > see in the file, they are not interleaved:
> > > 
> > > 	https://www.linuxglobal.com/out/handle_ept_violation.log2
> > > 
> > >   # awk '{print $3}' handle_ept_violation.log2 |uniq -c
> > >    555596 vcpu=ffff9964cdc48000
> > >     31784 vcpu=ffff9934ed50c680
> > 
> > Hrm, but the address range being invalidated is changing.  Without seeing the
> > guest RIP, or even a timestamp, it's impossible to tell if the vCPU is well and
> > truly stuck or if it's just getting thrashed so hard by NUMA balancing or KSM
> > that it looks stuck.
> > 
> > Drat.
> > 
> > > > Below is another bpftrace program that will hopefully shrink the 
> > > > haystack to the point where we can find something via code inspection.
> > > 
> > > Ok thanks, we'll give it a try.
> > 
> > Try this version instead.  It's more comprehensive and more precise, e.g. should
> > only trigger on the guest being 100% stuck, and also fixes a PID vs. TID goof.
> > 
> > Note!  Enable trace_kvm_exit before/when running this to ensure KVM grabs the guest RIP
> > from the VMCS.  Without that enabled, RIP from vcpu->arch.regs[16] may be stale.
> 
> Ok, we got a 740MB log, zips down to 25MB if you would like to see the
> whole thing, it is here:
> 	http://linuxglobal.com/out/handle_ept_violation-v2.log.gz
> 
> For brevity, here is a sample of each 100,000th line:
> 
> # zcat handle_ept_violation-v2.log.gz | perl -lne '!($n++%100000) && print'
> Attaching 3 probes...
> 00:30:31:347560 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (375972 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000

Argh.  I'm having a bit of a temper tantrum because I forgot to have the printf
spit out the memslot flags.  And I apparently gave you a version without a probe
on kvm_tdp_mmu_map().  Grr.

Can you capture one more trace?  Fingers crossed this is the last one.  Modified
program below.

On the plus side, I'm slowly learning how to effectively use bpf programs.
This version also prints return values and and other relevant side effects from
kvm_faultin_pfn(), and I figured out a way to get at the vCPU's root MMU page.

And it stops printing after a vCPU (task) has been stuck for 100k exits, i.e. it
should self-limit the spam.  Even if the vCPU managed to unstick itself after
that point, which is *extremely* unlikely, being stuck for 100k exits all but
guarantees there's a bug somewhere.

So this *should* give us a smoking gun.  Depending on what the gun points at, a
full root cause may still be a long ways off, but I'm pretty sure this mess will
tell us exactly why KVM is refusing to fix the fault.

--

struct kvm_page_fault {
	const u64 addr;
	const u32 error_code;

	const bool prefetch;
	const bool exec;
	const bool write;
	const bool present;

	const bool rsvd;
	const bool user;
	const bool is_tdp;
	const bool nx_huge_page_workaround_enabled;

	bool huge_page_disallowed;

	u8 max_level;
	u8 req_level;
	u8 goal_level;

	u64 gfn;

	struct kvm_memory_slot *slot;

	u64 pfn;
	unsigned long hva;
	bool map_writable;
};

struct kvm_mmu_page {
	struct list_head link;
	struct hlist_node hash_link;

	bool tdp_mmu_page;
	bool unsync;
	u8 mmu_valid_gen;
	bool lpage_disallowed;

	u32 role;
	u64 gfn;

	u64 *spt;

	u64 *shadowed_translation;

	int root_count;
}

kprobe:kvm_faultin_pfn
{
	$vcpu = (struct kvm_vcpu *)arg0;
	$kvm = $vcpu->kvm;
	$rip = $vcpu->arch.regs[16];

	if (@last_rip[tid] == $rip) {
		@same[tid]++
	} else {
		@same[tid] = 0;
	}
	@last_rip[tid] = $rip;

	if (@same[tid] > 1000 && @same[tid] < 100000) {
		$fault = (struct kvm_page_fault *)arg1;
		$hva = -1;
		$flags = 0;

		@__vcpu[tid] = arg0;
		@__fault[tid] = arg1;

		if ($fault->slot != 0) {
			$hva = $fault->slot->userspace_addr +
			       (($fault->gfn - $fault->slot->base_gfn) << 12);
			$flags = $fault->slot->flags;
		}

		printf("%s tid[%u] pid[%u] FAULTIN @ rip %lx (%lu hits), gpa = %lx, hva = %lx, flags = %lx : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip, @same[tid], $fault->addr, $hva, $flags,
		       $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress,
		       $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end);
	} else {
		@__vcpu[tid] = 0;
		@__fault[tid] = 0;
	}
}

kretprobe:kvm_faultin_pfn
{
	if (@__fault[tid] != 0) {
		$vcpu = (struct kvm_vcpu *)@__vcpu[tid];
		$kvm = $vcpu->kvm;
		$fault = (struct kvm_page_fault *)@__fault[tid];
		$hva = -1;
		$flags = 0;

		if ($fault->slot != 0) {
			$hva = $fault->slot->userspace_addr +
			       (($fault->gfn - $fault->slot->base_gfn) << 12);
			$flags = $fault->slot->flags;
		}

		printf("%s tid[%u] pid[%u] FAULTIN_RET @ rip %lx (%lu hits), gpa = %lx, hva = %lx (%lx), flags = %lx, pfn = %lx, ret = %lu : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
		       $fault->addr, $hva, $fault->hva, $flags, $fault->pfn, retval,
		       $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress,
		       $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end);
	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
		printf("%s tid[%u] pid[%u] FAULTIN_ERROR @ rip %lx (%lu hits), ret = %lu\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid], retval);
	}
}

kprobe:kvm_tdp_mmu_map
{
	$vcpu = (struct kvm_vcpu *)arg0;
	$rip = $vcpu->arch.regs[16];

	if (@last_rip[tid] == $rip) {
		@same[tid]++
	} else {
		@same[tid] = 0;
	}
	@last_rip[tid] = $rip;

	if (@__fault[tid] != 0) {
	        $vcpu = (struct kvm_vcpu *)arg0;
		$fault = (struct kvm_page_fault *)arg1;

                if (@__vcpu[tid] != arg0 || @__fault[tid] != arg1) {
                        printf("%s tid[%u] pid[%u] MAP_ERROR vcpu %lx vs. %lx, fault %lx vs. %lx\n",
                               strftime("%H:%M:%S:%f", nsecs), tid, pid, @__vcpu[tid], arg0, @__fault[tid], arg1);
                }

		printf("%s tid[%u] pid[%u] MAP @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
		       $fault->addr, $fault->hva, $fault->pfn);
	} else {
		@__vcpu[tid] = 0;
		@__fault[tid] = 0;
	}
}

kretprobe:kvm_tdp_mmu_map
{
	if (@__fault[tid] != 0) {
		$vcpu = (struct kvm_vcpu *)@__vcpu[tid];
		$fault = (struct kvm_page_fault *)@__fault[tid];
		$hva = -1;
		$flags = 0;

		if ($fault->slot != 0) {
			$hva = $fault->slot->userspace_addr +
			       (($fault->gfn - $fault->slot->base_gfn) << 12);
			$flags = $fault->slot->flags;
		}

		printf("%s tid[%u] pid[%u] MAP_RET @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx, ret = %lx\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
		       $fault->addr, $fault->hva, $fault->pfn, retval);
	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
		printf("%s tid[%u] pid[%u] MAP_RET_ERROR @ rip %lx (%lu hits), ret = %lu\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid], retval);
	}
}

kprobe:tdp_iter_start
{
	if (@__fault[tid] != 0) {
                $vcpu = (struct kvm_vcpu *)@__vcpu[tid];
		$fault = (struct kvm_page_fault *)@__fault[tid];
	        $root = (struct kvm_mmu_page *)arg1;

		printf("%s tid[%u] pid[%u] ITER @ rip %lx (%lu hits), gpa = %lx (%lx), hva = %lx, pfn = %lx, tdp_mmu = %u, role = %x, count = %d\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
		       $fault->addr, arg3 << 12, $fault->hva, $fault->pfn,
                       $root->tdp_mmu_page, $root->role, $root->root_count);
	} else {
		@__vcpu[tid] = 0;
		@__fault[tid] = 0;
	}
}

kprobe:make_mmio_spte
{
        if (@__fault[tid] != 0) {
		$fault = (struct kvm_page_fault *)@__fault[tid];

		printf("%s tid[%u] pid[%u] MMIO @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
		       $fault->addr, $fault->hva, $fault->pfn);
	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
		printf("%s tid[%u] pid[%u] MMIO_ERROR @ rip %lx (%lu hits)\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid]);
	}
}

kprobe:make_spte
{
        if (@__fault[tid] != 0) {
		$fault = (struct kvm_page_fault *)@__fault[tid];

		printf("%s tid[%u] pid[%u] SPTE @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
		       $fault->addr, $fault->hva, $fault->pfn);
	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
		printf("%s tid[%u] pid[%u] SPTE_ERROR @ rip %lx (%lu hits)\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid]);
	}
}


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-23 17:54                                                         ` Sean Christopherson
@ 2023-08-23 19:44                                                           ` Eric Wheeler
  2023-08-23 22:12                                                           ` Eric Wheeler
  1 sibling, 0 replies; 48+ messages in thread
From: Eric Wheeler @ 2023-08-23 19:44 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Wed, 23 Aug 2023, Sean Christopherson wrote:
> On Tue, Aug 22, 2023, Eric Wheeler wrote:
> > On Tue, 22 Aug 2023, Sean Christopherson wrote:
> > > > Here is the whole log with 500,000+ lines over 5 minutes of recording, it 
> > > > was first stuck on one vcpu for most of the time, and toward the end it 
> > > > was stuck on a different VCPU:
> > > > 
> > > > The file starts with 555,596 occurances of vcpu=ffff9964cdc48000 and is 
> > > > then followed by 31,784 occurances of vcpu=ffff9934ed50c680.  As you can 
> > > > see in the file, they are not interleaved:
> > > > 
> > > > 	https://www.linuxglobal.com/out/handle_ept_violation.log2
> > > > 
> > > >   # awk '{print $3}' handle_ept_violation.log2 |uniq -c
> > > >    555596 vcpu=ffff9964cdc48000
> > > >     31784 vcpu=ffff9934ed50c680
> > > 
> > > Hrm, but the address range being invalidated is changing.  Without seeing the
> > > guest RIP, or even a timestamp, it's impossible to tell if the vCPU is well and
> > > truly stuck or if it's just getting thrashed so hard by NUMA balancing or KSM
> > > that it looks stuck.
> > > 
> > > Drat.
> > > 
> > > > > Below is another bpftrace program that will hopefully shrink the 
> > > > > haystack to the point where we can find something via code inspection.
> > > > 
> > > > Ok thanks, we'll give it a try.
> > > 
> > > Try this version instead.  It's more comprehensive and more precise, e.g. should
> > > only trigger on the guest being 100% stuck, and also fixes a PID vs. TID goof.
> > > 
> > > Note!  Enable trace_kvm_exit before/when running this to ensure KVM grabs the guest RIP
> > > from the VMCS.  Without that enabled, RIP from vcpu->arch.regs[16] may be stale.
> > 
> > Ok, we got a 740MB log, zips down to 25MB if you would like to see the
> > whole thing, it is here:
> > 	http://linuxglobal.com/out/handle_ept_violation-v2.log.gz
> > 
> > For brevity, here is a sample of each 100,000th line:
> > 
> > # zcat handle_ept_violation-v2.log.gz | perl -lne '!($n++%100000) && print'
> > Attaching 3 probes...
> > 00:30:31:347560 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (375972 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
> 
> Argh.  I'm having a bit of a temper tantrum because I forgot to have the printf
> spit out the memslot flags.  And I apparently gave you a version without a probe
> on kvm_tdp_mmu_map().  Grr.
> 
> Can you capture one more trace?  Fingers crossed this is the last one.  Modified
> program below.

Sure!

> On the plus side, I'm slowly learning how to effectively use bpf programs.
> This version also prints return values and and other relevant side effects from
> kvm_faultin_pfn(), and I figured out a way to get at the vCPU's root MMU page.
> 
> And it stops printing after a vCPU (task) has been stuck for 100k exits, i.e. it
> should self-limit the spam.  Even if the vCPU managed to unstick itself after
> that point, which is *extremely* unlikely, being stuck for 100k exits all but
> guarantees there's a bug somewhere.
> 
> So this *should* give us a smoking gun.  Depending on what the gun points at, a
> full root cause may still be a long ways off, but I'm pretty sure this mess will
> tell us exactly why KVM is refusing to fix the fault.

Installed, now we wait for it to trigger again.

--
Eric Wheeler


> 
> --
> 
> struct kvm_page_fault {
> 	const u64 addr;
> 	const u32 error_code;
> 
> 	const bool prefetch;
> 	const bool exec;
> 	const bool write;
> 	const bool present;
> 
> 	const bool rsvd;
> 	const bool user;
> 	const bool is_tdp;
> 	const bool nx_huge_page_workaround_enabled;
> 
> 	bool huge_page_disallowed;
> 
> 	u8 max_level;
> 	u8 req_level;
> 	u8 goal_level;
> 
> 	u64 gfn;
> 
> 	struct kvm_memory_slot *slot;
> 
> 	u64 pfn;
> 	unsigned long hva;
> 	bool map_writable;
> };
> 
> struct kvm_mmu_page {
> 	struct list_head link;
> 	struct hlist_node hash_link;
> 
> 	bool tdp_mmu_page;
> 	bool unsync;
> 	u8 mmu_valid_gen;
> 	bool lpage_disallowed;
> 
> 	u32 role;
> 	u64 gfn;
> 
> 	u64 *spt;
> 
> 	u64 *shadowed_translation;
> 
> 	int root_count;
> }
> 
> kprobe:kvm_faultin_pfn
> {
> 	$vcpu = (struct kvm_vcpu *)arg0;
> 	$kvm = $vcpu->kvm;
> 	$rip = $vcpu->arch.regs[16];
> 
> 	if (@last_rip[tid] == $rip) {
> 		@same[tid]++
> 	} else {
> 		@same[tid] = 0;
> 	}
> 	@last_rip[tid] = $rip;
> 
> 	if (@same[tid] > 1000 && @same[tid] < 100000) {
> 		$fault = (struct kvm_page_fault *)arg1;
> 		$hva = -1;
> 		$flags = 0;
> 
> 		@__vcpu[tid] = arg0;
> 		@__fault[tid] = arg1;
> 
> 		if ($fault->slot != 0) {
> 			$hva = $fault->slot->userspace_addr +
> 			       (($fault->gfn - $fault->slot->base_gfn) << 12);
> 			$flags = $fault->slot->flags;
> 		}
> 
> 		printf("%s tid[%u] pid[%u] FAULTIN @ rip %lx (%lu hits), gpa = %lx, hva = %lx, flags = %lx : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip, @same[tid], $fault->addr, $hva, $flags,
> 		       $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress,
> 		       $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end);
> 	} else {
> 		@__vcpu[tid] = 0;
> 		@__fault[tid] = 0;
> 	}
> }
> 
> kretprobe:kvm_faultin_pfn
> {
> 	if (@__fault[tid] != 0) {
> 		$vcpu = (struct kvm_vcpu *)@__vcpu[tid];
> 		$kvm = $vcpu->kvm;
> 		$fault = (struct kvm_page_fault *)@__fault[tid];
> 		$hva = -1;
> 		$flags = 0;
> 
> 		if ($fault->slot != 0) {
> 			$hva = $fault->slot->userspace_addr +
> 			       (($fault->gfn - $fault->slot->base_gfn) << 12);
> 			$flags = $fault->slot->flags;
> 		}
> 
> 		printf("%s tid[%u] pid[%u] FAULTIN_RET @ rip %lx (%lu hits), gpa = %lx, hva = %lx (%lx), flags = %lx, pfn = %lx, ret = %lu : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> 		       $fault->addr, $hva, $fault->hva, $flags, $fault->pfn, retval,
> 		       $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress,
> 		       $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end);
> 	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
> 		printf("%s tid[%u] pid[%u] FAULTIN_ERROR @ rip %lx (%lu hits), ret = %lu\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid], retval);
> 	}
> }
> 
> kprobe:kvm_tdp_mmu_map
> {
> 	$vcpu = (struct kvm_vcpu *)arg0;
> 	$rip = $vcpu->arch.regs[16];
> 
> 	if (@last_rip[tid] == $rip) {
> 		@same[tid]++
> 	} else {
> 		@same[tid] = 0;
> 	}
> 	@last_rip[tid] = $rip;
> 
> 	if (@__fault[tid] != 0) {
> 	        $vcpu = (struct kvm_vcpu *)arg0;
> 		$fault = (struct kvm_page_fault *)arg1;
> 
>                 if (@__vcpu[tid] != arg0 || @__fault[tid] != arg1) {
>                         printf("%s tid[%u] pid[%u] MAP_ERROR vcpu %lx vs. %lx, fault %lx vs. %lx\n",
>                                strftime("%H:%M:%S:%f", nsecs), tid, pid, @__vcpu[tid], arg0, @__fault[tid], arg1);
>                 }
> 
> 		printf("%s tid[%u] pid[%u] MAP @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> 		       $fault->addr, $fault->hva, $fault->pfn);
> 	} else {
> 		@__vcpu[tid] = 0;
> 		@__fault[tid] = 0;
> 	}
> }
> 
> kretprobe:kvm_tdp_mmu_map
> {
> 	if (@__fault[tid] != 0) {
> 		$vcpu = (struct kvm_vcpu *)@__vcpu[tid];
> 		$fault = (struct kvm_page_fault *)@__fault[tid];
> 		$hva = -1;
> 		$flags = 0;
> 
> 		if ($fault->slot != 0) {
> 			$hva = $fault->slot->userspace_addr +
> 			       (($fault->gfn - $fault->slot->base_gfn) << 12);
> 			$flags = $fault->slot->flags;
> 		}
> 
> 		printf("%s tid[%u] pid[%u] MAP_RET @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx, ret = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> 		       $fault->addr, $fault->hva, $fault->pfn, retval);
> 	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
> 		printf("%s tid[%u] pid[%u] MAP_RET_ERROR @ rip %lx (%lu hits), ret = %lu\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid], retval);
> 	}
> }
> 
> kprobe:tdp_iter_start
> {
> 	if (@__fault[tid] != 0) {
>                 $vcpu = (struct kvm_vcpu *)@__vcpu[tid];
> 		$fault = (struct kvm_page_fault *)@__fault[tid];
> 	        $root = (struct kvm_mmu_page *)arg1;
> 
> 		printf("%s tid[%u] pid[%u] ITER @ rip %lx (%lu hits), gpa = %lx (%lx), hva = %lx, pfn = %lx, tdp_mmu = %u, role = %x, count = %d\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> 		       $fault->addr, arg3 << 12, $fault->hva, $fault->pfn,
>                        $root->tdp_mmu_page, $root->role, $root->root_count);
> 	} else {
> 		@__vcpu[tid] = 0;
> 		@__fault[tid] = 0;
> 	}
> }
> 
> kprobe:make_mmio_spte
> {
>         if (@__fault[tid] != 0) {
> 		$fault = (struct kvm_page_fault *)@__fault[tid];
> 
> 		printf("%s tid[%u] pid[%u] MMIO @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> 		       $fault->addr, $fault->hva, $fault->pfn);
> 	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
> 		printf("%s tid[%u] pid[%u] MMIO_ERROR @ rip %lx (%lu hits)\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid]);
> 	}
> }
> 
> kprobe:make_spte
> {
>         if (@__fault[tid] != 0) {
> 		$fault = (struct kvm_page_fault *)@__fault[tid];
> 
> 		printf("%s tid[%u] pid[%u] SPTE @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> 		       $fault->addr, $fault->hva, $fault->pfn);
> 	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
> 		printf("%s tid[%u] pid[%u] SPTE_ERROR @ rip %lx (%lu hits)\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid]);
> 	}
> }
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-23 17:54                                                         ` Sean Christopherson
  2023-08-23 19:44                                                           ` Eric Wheeler
@ 2023-08-23 22:12                                                           ` Eric Wheeler
  2023-08-23 22:32                                                             ` Eric Wheeler
  1 sibling, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-23 22:12 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Wed, 23 Aug 2023, Sean Christopherson wrote:
> On Tue, Aug 22, 2023, Eric Wheeler wrote:
> > On Tue, 22 Aug 2023, Sean Christopherson wrote:
> > > > Here is the whole log with 500,000+ lines over 5 minutes of recording, it 
> > > > was first stuck on one vcpu for most of the time, and toward the end it 
> > > > was stuck on a different VCPU:
> > > > 
> > > > The file starts with 555,596 occurances of vcpu=ffff9964cdc48000 and is 
> > > > then followed by 31,784 occurances of vcpu=ffff9934ed50c680.  As you can 
> > > > see in the file, they are not interleaved:
> > > > 
> > > > 	https://www.linuxglobal.com/out/handle_ept_violation.log2
> > > > 
> > > >   # awk '{print $3}' handle_ept_violation.log2 |uniq -c
> > > >    555596 vcpu=ffff9964cdc48000
> > > >     31784 vcpu=ffff9934ed50c680
> > > 
> > > Hrm, but the address range being invalidated is changing.  Without seeing the
> > > guest RIP, or even a timestamp, it's impossible to tell if the vCPU is well and
> > > truly stuck or if it's just getting thrashed so hard by NUMA balancing or KSM
> > > that it looks stuck.
> > > 
> > > Drat.
> > > 
> > > > > Below is another bpftrace program that will hopefully shrink the 
> > > > > haystack to the point where we can find something via code inspection.
> > > > 
> > > > Ok thanks, we'll give it a try.
> > > 
> > > Try this version instead.  It's more comprehensive and more precise, e.g. should
> > > only trigger on the guest being 100% stuck, and also fixes a PID vs. TID goof.
> > > 
> > > Note!  Enable trace_kvm_exit before/when running this to ensure KVM grabs the guest RIP
> > > from the VMCS.  Without that enabled, RIP from vcpu->arch.regs[16] may be stale.
> > 
> > Ok, we got a 740MB log, zips down to 25MB if you would like to see the
> > whole thing, it is here:
> > 	http://linuxglobal.com/out/handle_ept_violation-v2.log.gz
> > 
> > For brevity, here is a sample of each 100,000th line:
> > 
> > # zcat handle_ept_violation-v2.log.gz | perl -lne '!($n++%100000) && print'
> > Attaching 3 probes...
> > 00:30:31:347560 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (375972 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
> 
> Argh.  I'm having a bit of a temper tantrum because I forgot to have the printf
> spit out the memslot flags.  And I apparently gave you a version without a probe
> on kvm_tdp_mmu_map().  Grr.
> 
> Can you capture one more trace?  Fingers crossed this is the last one.  

Here it is: http://linuxglobal.com/out/handle_ept_violation-v3.log.gz

Here are the highlights:

# zcat handle_ept_violation-v3.log.gz | grep 3484173 | tail -n30
21:25:50:282711 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92234 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282714 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92235 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282720 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92237 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282723 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92238 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282726 tid[3484173] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (92239 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282743 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92245 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282749 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92247 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282752 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92248 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282755 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92249 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282769 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92254 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282775 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92256 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282778 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92257 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282784 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92259 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282804 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92266 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282813 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92269 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282816 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92270 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282822 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92272 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282825 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92273 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282831 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92275 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282837 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92277 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282845 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92280 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282854 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92283 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282857 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92284 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282863 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92286 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282866 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92287 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282869 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92288 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
@__fault[3484173]: 0
@__vcpu[3484173]: 0
@last_rip[3484173]: 18446744071583984805
@same[3484173]: 11615602


# zcat handle_ept_violation-v3.log.gz | grep 3484174 | tail -n30
21:25:50:282354 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90073 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282418 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90087 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282475 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90100 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282507 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90107 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282524 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90111 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282528 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90112 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282557 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90118 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282571 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90121 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282660 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90141 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282670 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90143 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282684 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90146 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282693 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90148 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282731 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90156 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282735 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90157 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282739 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90158 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282759 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90162 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282764 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90163 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282786 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90168 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282791 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90169 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282795 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90170 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282800 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90171 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282809 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90173 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282832 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90178 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282841 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90180 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282851 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90182 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
21:25:50:282874 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90187 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
@__fault[3484174]: 0
@__vcpu[3484174]: 0
@last_rip[3484174]: 18446744071583984805
@same[3484174]: 11598962


--
Eric Wheeler




> 
> On the plus side, I'm slowly learning how to effectively use bpf programs.
> This version also prints return values and and other relevant side effects from
> kvm_faultin_pfn(), and I figured out a way to get at the vCPU's root MMU page.
> 
> And it stops printing after a vCPU (task) has been stuck for 100k exits, i.e. it
> should self-limit the spam.  Even if the vCPU managed to unstick itself after
> that point, which is *extremely* unlikely, being stuck for 100k exits all but
> guarantees there's a bug somewhere.
> 
> So this *should* give us a smoking gun.  Depending on what the gun points at, a
> full root cause may still be a long ways off, but I'm pretty sure this mess will
> tell us exactly why KVM is refusing to fix the fault.
> 
> --
> 
> struct kvm_page_fault {
> 	const u64 addr;
> 	const u32 error_code;
> 
> 	const bool prefetch;
> 	const bool exec;
> 	const bool write;
> 	const bool present;
> 
> 	const bool rsvd;
> 	const bool user;
> 	const bool is_tdp;
> 	const bool nx_huge_page_workaround_enabled;
> 
> 	bool huge_page_disallowed;
> 
> 	u8 max_level;
> 	u8 req_level;
> 	u8 goal_level;
> 
> 	u64 gfn;
> 
> 	struct kvm_memory_slot *slot;
> 
> 	u64 pfn;
> 	unsigned long hva;
> 	bool map_writable;
> };
> 
> struct kvm_mmu_page {
> 	struct list_head link;
> 	struct hlist_node hash_link;
> 
> 	bool tdp_mmu_page;
> 	bool unsync;
> 	u8 mmu_valid_gen;
> 	bool lpage_disallowed;
> 
> 	u32 role;
> 	u64 gfn;
> 
> 	u64 *spt;
> 
> 	u64 *shadowed_translation;
> 
> 	int root_count;
> }
> 
> kprobe:kvm_faultin_pfn
> {
> 	$vcpu = (struct kvm_vcpu *)arg0;
> 	$kvm = $vcpu->kvm;
> 	$rip = $vcpu->arch.regs[16];
> 
> 	if (@last_rip[tid] == $rip) {
> 		@same[tid]++
> 	} else {
> 		@same[tid] = 0;
> 	}
> 	@last_rip[tid] = $rip;
> 
> 	if (@same[tid] > 1000 && @same[tid] < 100000) {
> 		$fault = (struct kvm_page_fault *)arg1;
> 		$hva = -1;
> 		$flags = 0;
> 
> 		@__vcpu[tid] = arg0;
> 		@__fault[tid] = arg1;
> 
> 		if ($fault->slot != 0) {
> 			$hva = $fault->slot->userspace_addr +
> 			       (($fault->gfn - $fault->slot->base_gfn) << 12);
> 			$flags = $fault->slot->flags;
> 		}
> 
> 		printf("%s tid[%u] pid[%u] FAULTIN @ rip %lx (%lu hits), gpa = %lx, hva = %lx, flags = %lx : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip, @same[tid], $fault->addr, $hva, $flags,
> 		       $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress,
> 		       $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end);
> 	} else {
> 		@__vcpu[tid] = 0;
> 		@__fault[tid] = 0;
> 	}
> }
> 
> kretprobe:kvm_faultin_pfn
> {
> 	if (@__fault[tid] != 0) {
> 		$vcpu = (struct kvm_vcpu *)@__vcpu[tid];
> 		$kvm = $vcpu->kvm;
> 		$fault = (struct kvm_page_fault *)@__fault[tid];
> 		$hva = -1;
> 		$flags = 0;
> 
> 		if ($fault->slot != 0) {
> 			$hva = $fault->slot->userspace_addr +
> 			       (($fault->gfn - $fault->slot->base_gfn) << 12);
> 			$flags = $fault->slot->flags;
> 		}
> 
> 		printf("%s tid[%u] pid[%u] FAULTIN_RET @ rip %lx (%lu hits), gpa = %lx, hva = %lx (%lx), flags = %lx, pfn = %lx, ret = %lu : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> 		       $fault->addr, $hva, $fault->hva, $flags, $fault->pfn, retval,
> 		       $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress,
> 		       $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end);
> 	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
> 		printf("%s tid[%u] pid[%u] FAULTIN_ERROR @ rip %lx (%lu hits), ret = %lu\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid], retval);
> 	}
> }
> 
> kprobe:kvm_tdp_mmu_map
> {
> 	$vcpu = (struct kvm_vcpu *)arg0;
> 	$rip = $vcpu->arch.regs[16];
> 
> 	if (@last_rip[tid] == $rip) {
> 		@same[tid]++
> 	} else {
> 		@same[tid] = 0;
> 	}
> 	@last_rip[tid] = $rip;
> 
> 	if (@__fault[tid] != 0) {
> 	        $vcpu = (struct kvm_vcpu *)arg0;
> 		$fault = (struct kvm_page_fault *)arg1;
> 
>                 if (@__vcpu[tid] != arg0 || @__fault[tid] != arg1) {
>                         printf("%s tid[%u] pid[%u] MAP_ERROR vcpu %lx vs. %lx, fault %lx vs. %lx\n",
>                                strftime("%H:%M:%S:%f", nsecs), tid, pid, @__vcpu[tid], arg0, @__fault[tid], arg1);
>                 }
> 
> 		printf("%s tid[%u] pid[%u] MAP @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> 		       $fault->addr, $fault->hva, $fault->pfn);
> 	} else {
> 		@__vcpu[tid] = 0;
> 		@__fault[tid] = 0;
> 	}
> }
> 
> kretprobe:kvm_tdp_mmu_map
> {
> 	if (@__fault[tid] != 0) {
> 		$vcpu = (struct kvm_vcpu *)@__vcpu[tid];
> 		$fault = (struct kvm_page_fault *)@__fault[tid];
> 		$hva = -1;
> 		$flags = 0;
> 
> 		if ($fault->slot != 0) {
> 			$hva = $fault->slot->userspace_addr +
> 			       (($fault->gfn - $fault->slot->base_gfn) << 12);
> 			$flags = $fault->slot->flags;
> 		}
> 
> 		printf("%s tid[%u] pid[%u] MAP_RET @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx, ret = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> 		       $fault->addr, $fault->hva, $fault->pfn, retval);
> 	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
> 		printf("%s tid[%u] pid[%u] MAP_RET_ERROR @ rip %lx (%lu hits), ret = %lu\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid], retval);
> 	}
> }
> 
> kprobe:tdp_iter_start
> {
> 	if (@__fault[tid] != 0) {
>                 $vcpu = (struct kvm_vcpu *)@__vcpu[tid];
> 		$fault = (struct kvm_page_fault *)@__fault[tid];
> 	        $root = (struct kvm_mmu_page *)arg1;
> 
> 		printf("%s tid[%u] pid[%u] ITER @ rip %lx (%lu hits), gpa = %lx (%lx), hva = %lx, pfn = %lx, tdp_mmu = %u, role = %x, count = %d\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> 		       $fault->addr, arg3 << 12, $fault->hva, $fault->pfn,
>                        $root->tdp_mmu_page, $root->role, $root->root_count);
> 	} else {
> 		@__vcpu[tid] = 0;
> 		@__fault[tid] = 0;
> 	}
> }
> 
> kprobe:make_mmio_spte
> {
>         if (@__fault[tid] != 0) {
> 		$fault = (struct kvm_page_fault *)@__fault[tid];
> 
> 		printf("%s tid[%u] pid[%u] MMIO @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> 		       $fault->addr, $fault->hva, $fault->pfn);
> 	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
> 		printf("%s tid[%u] pid[%u] MMIO_ERROR @ rip %lx (%lu hits)\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid]);
> 	}
> }
> 
> kprobe:make_spte
> {
>         if (@__fault[tid] != 0) {
> 		$fault = (struct kvm_page_fault *)@__fault[tid];
> 
> 		printf("%s tid[%u] pid[%u] SPTE @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> 		       $fault->addr, $fault->hva, $fault->pfn);
> 	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
> 		printf("%s tid[%u] pid[%u] SPTE_ERROR @ rip %lx (%lu hits)\n",
> 		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid]);
> 	}
> }
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-23 22:12                                                           ` Eric Wheeler
@ 2023-08-23 22:32                                                             ` Eric Wheeler
  2023-08-23 23:21                                                               ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-23 22:32 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Wed, 23 Aug 2023, Eric Wheeler wrote:
> On Wed, 23 Aug 2023, Sean Christopherson wrote:
> > On Tue, Aug 22, 2023, Eric Wheeler wrote:
> > > On Tue, 22 Aug 2023, Sean Christopherson wrote:
> > > > > Here is the whole log with 500,000+ lines over 5 minutes of recording, it 
> > > > > was first stuck on one vcpu for most of the time, and toward the end it 
> > > > > was stuck on a different VCPU:
> > > > > 
> > > > > The file starts with 555,596 occurances of vcpu=ffff9964cdc48000 and is 
> > > > > then followed by 31,784 occurances of vcpu=ffff9934ed50c680.  As you can 
> > > > > see in the file, they are not interleaved:
> > > > > 
> > > > > 	https://www.linuxglobal.com/out/handle_ept_violation.log2
> > > > > 
> > > > >   # awk '{print $3}' handle_ept_violation.log2 |uniq -c
> > > > >    555596 vcpu=ffff9964cdc48000
> > > > >     31784 vcpu=ffff9934ed50c680
> > > > 
> > > > Hrm, but the address range being invalidated is changing.  Without seeing the
> > > > guest RIP, or even a timestamp, it's impossible to tell if the vCPU is well and
> > > > truly stuck or if it's just getting thrashed so hard by NUMA balancing or KSM
> > > > that it looks stuck.
> > > > 
> > > > Drat.
> > > > 
> > > > > > Below is another bpftrace program that will hopefully shrink the 
> > > > > > haystack to the point where we can find something via code inspection.
> > > > > 
> > > > > Ok thanks, we'll give it a try.
> > > > 
> > > > Try this version instead.  It's more comprehensive and more precise, e.g. should
> > > > only trigger on the guest being 100% stuck, and also fixes a PID vs. TID goof.
> > > > 
> > > > Note!  Enable trace_kvm_exit before/when running this to ensure KVM grabs the guest RIP
> > > > from the VMCS.  Without that enabled, RIP from vcpu->arch.regs[16] may be stale.
> > > 
> > > Ok, we got a 740MB log, zips down to 25MB if you would like to see the
> > > whole thing, it is here:
> > > 	http://linuxglobal.com/out/handle_ept_violation-v2.log.gz
> > > 
> > > For brevity, here is a sample of each 100,000th line:
> > > 
> > > # zcat handle_ept_violation-v2.log.gz | perl -lne '!($n++%100000) && print'
> > > Attaching 3 probes...
> > > 00:30:31:347560 tid[553909] pid[553848] stuck @ rip ffffffff80094a41 (375972 hits), gpa = 294a41, hva = 7efc5b094000 : MMU seq = 8000b139, in-prog = 0, start = 7efc6e10f000, end = 7efc6e110000
> > 
> > Argh.  I'm having a bit of a temper tantrum because I forgot to have the printf
> > spit out the memslot flags.  And I apparently gave you a version without a probe
> > on kvm_tdp_mmu_map().  Grr.
> > 
> > Can you capture one more trace?  Fingers crossed this is the last one.  


See the previous email too, because it had >90k hits:
	* >90k hits: http://linuxglobal.com/out/handle_ept_violation-v3.log.gz

However, this one has different content (not just FAULTs), but it has only 1k hits:
	* >1k hits: http://linuxglobal.com/out/handle_ept_violation-v3-2.log.gz

22:23:31:481644 tid[142943] pid[142917] SPTE @ rip ffffffffa43ce877 (1196 hits), gpa = cf7b000, hva = 7f6d24f7b000, pfn = 1e35fc6
22:23:31:481645 tid[142943] pid[142917] MAP_RET @ rip ffffffffa43ce877 (1196 hits), gpa = cf7b000, hva = 7f6d24f7b000, pfn = 1e35fc6, ret = 4
22:23:31:481650 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1197 hits), gpa = cf7c000, hva = 7f6d24f7c000, flags = 0 : MMU seq = 4724e, in-prog = 0, start = 7f6d24f7b000, end = 7f6d24f7c000
22:23:31:481653 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1197 hits), gpa = cf7c000 (cf7c000), hva = 7f6d24f7c000, pfn = 0, tdp_mmu = 1, role = 3784, count = 2
22:23:31:481658 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1197 hits), gpa = cf7c000 (cf7c000), hva = 7f6d24f7c000, pfn = 0, tdp_mmu = 1, role = 3784, count = 1
22:23:31:481660 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1197 hits), gpa = cf7c000, hva = 7f6d24f7c000 (7f6d24f7c000), flags = 0, pfn = 1ee7439, ret = 0 : MMU seq = 4724f, in-prog = 0, start = 7f6d24f7c000, end = 7f6d24f7d000
22:23:31:481665 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1198 hits), gpa = cf7c000, hva = 7f6d24f7c000, flags = 0 : MMU seq = 4724f, in-prog = 0, start = 7f6d24f7c000, end = 7f6d24f7d000
22:23:31:481666 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1198 hits), gpa = cf7c000, hva = 7f6d24f7c000 (7f6d24f7c000), flags = 0, pfn = 1ee7439, ret = 0 : MMU seq = 4724f, in-prog = 0, start = 7f6d24f7c000, end = 7f6d24f7d000
22:23:31:481667 tid[142943] pid[142917] MAP @ rip ffffffffa43ce877 (1199 hits), gpa = cf7c000, hva = 7f6d24f7c000, pfn = 1ee7439
22:23:31:481668 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1199 hits), gpa = cf7c000 (cf7c000), hva = 7f6d24f7c000, pfn = 1ee7439, tdp_mmu = 1, role = 3784, count = 1
22:23:31:481669 tid[142943] pid[142917] SPTE @ rip ffffffffa43ce877 (1199 hits), gpa = cf7c000, hva = 7f6d24f7c000, pfn = 1ee7439
22:23:31:481670 tid[142943] pid[142917] MAP_RET @ rip ffffffffa43ce877 (1199 hits), gpa = cf7c000, hva = 7f6d24f7c000, pfn = 1ee7439, ret = 4
22:23:31:481673 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1200 hits), gpa = cf7d000, hva = 7f6d24f7d000, flags = 0 : MMU seq = 4724f, in-prog = 0, start = 7f6d24f7c000, end = 7f6d24f7d000
22:23:31:481676 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1200 hits), gpa = cf7d000 (cf7d000), hva = 7f6d24f7d000, pfn = 0, tdp_mmu = 1, role = 3784, count = 2
22:23:31:481677 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1200 hits), gpa = cf7d000 (cf7d000), hva = 7f6d24f7d000, pfn = 0, tdp_mmu = 1, role = 3784, count = 1
22:23:31:481678 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1200 hits), gpa = cf7d000, hva = 7f6d24f7d000 (7f6d24f7d000), flags = 0, pfn = 4087d5, ret = 0 : MMU seq = 47250, in-prog = 0, start = 7f6d24f7d000, end = 7f6d24f7e000
22:23:31:481682 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1201 hits), gpa = cf7d000, hva = 7f6d24f7d000, flags = 0 : MMU seq = 47250, in-prog = 0, start = 7f6d24f7d000, end = 7f6d24f7e000
22:23:31:481683 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1201 hits), gpa = cf7d000, hva = 7f6d24f7d000 (7f6d24f7d000), flags = 0, pfn = 4087d5, ret = 0 : MMU seq = 47250, in-prog = 0, start = 7f6d24f7d000, end = 7f6d24f7e000
22:23:31:481684 tid[142943] pid[142917] MAP @ rip ffffffffa43ce877 (1202 hits), gpa = cf7d000, hva = 7f6d24f7d000, pfn = 4087d5
22:23:31:481684 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1202 hits), gpa = cf7d000 (cf7d000), hva = 7f6d24f7d000, pfn = 4087d5, tdp_mmu = 1, role = 3784, count = 1
22:23:31:481685 tid[142943] pid[142917] SPTE @ rip ffffffffa43ce877 (1202 hits), gpa = cf7d000, hva = 7f6d24f7d000, pfn = 4087d5
22:23:31:481686 tid[142943] pid[142917] MAP_RET @ rip ffffffffa43ce877 (1202 hits), gpa = cf7d000, hva = 7f6d24f7d000, pfn = 4087d5, ret = 4
22:23:31:481689 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1203 hits), gpa = cf7e000, hva = 7f6d24f7e000, flags = 0 : MMU seq = 47250, in-prog = 0, start = 7f6d24f7d000, end = 7f6d24f7e000
22:23:31:481691 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1203 hits), gpa = cf7e000 (cf7e000), hva = 7f6d24f7e000, pfn = 0, tdp_mmu = 1, role = 3784, count = 2
22:23:31:481692 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1203 hits), gpa = cf7e000 (cf7e000), hva = 7f6d24f7e000, pfn = 0, tdp_mmu = 1, role = 3784, count = 1
22:23:31:481694 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1203 hits), gpa = cf7e000, hva = 7f6d24f7e000 (7f6d24f7e000), flags = 0, pfn = 51c5ee, ret = 0 : MMU seq = 47251, in-prog = 0, start = 7f6d24f7e000, end = 7f6d24f7f000
22:23:31:481697 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1204 hits), gpa = cf7e000, hva = 7f6d24f7e000, flags = 0 : MMU seq = 47251, in-prog = 0, start = 7f6d24f7e000, end = 7f6d24f7f000
22:23:31:481698 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1204 hits), gpa = cf7e000, hva = 7f6d24f7e000 (7f6d24f7e000), flags = 0, pfn = 51c5ee, ret = 0 : MMU seq = 47251, in-prog = 0, start = 7f6d24f7e000, end = 7f6d24f7f000
22:23:31:481699 tid[142943] pid[142917] MAP @ rip ffffffffa43ce877 (1205 hits), gpa = cf7e000, hva = 7f6d24f7e000, pfn = 51c5ee
22:23:31:481699 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1205 hits), gpa = cf7e000 (cf7e000), hva = 7f6d24f7e000, pfn = 51c5ee, tdp_mmu = 1, role = 3784, count = 1
22:23:31:481700 tid[142943] pid[142917] SPTE @ rip ffffffffa43ce877 (1205 hits), gpa = cf7e000, hva = 7f6d24f7e000, pfn = 51c5ee
22:23:31:481701 tid[142943] pid[142917] MAP_RET @ rip ffffffffa43ce877 (1205 hits), gpa = cf7e000, hva = 7f6d24f7e000, pfn = 51c5ee, ret = 4
22:23:31:481703 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1206 hits), gpa = cf7f000, hva = 7f6d24f7f000, flags = 0 : MMU seq = 47251, in-prog = 0, start = 7f6d24f7e000, end = 7f6d24f7f000
22:23:31:481706 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1206 hits), gpa = cf7f000 (cf7f000), hva = 7f6d24f7f000, pfn = 0, tdp_mmu = 1, role = 3784, count = 2
22:23:31:481707 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1206 hits), gpa = cf7f000 (cf7f000), hva = 7f6d24f7f000, pfn = 0, tdp_mmu = 1, role = 3784, count = 1
22:23:31:481708 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1206 hits), gpa = cf7f000, hva = 7f6d24f7f000 (7f6d24f7f000), flags = 0, pfn = 1ee50ee, ret = 0 : MMU seq = 47252, in-prog = 0, start = 7f6d24f7f000, end = 7f6d24f80000
22:23:31:481712 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1207 hits), gpa = cf7f000, hva = 7f6d24f7f000, flags = 0 : MMU seq = 47252, in-prog = 0, start = 7f6d24f7f000, end = 7f6d24f80000
22:23:31:481712 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1207 hits), gpa = cf7f000, hva = 7f6d24f7f000 (7f6d24f7f000), flags = 0, pfn = 1ee50ee, ret = 0 : MMU seq = 47252, in-prog = 0, start = 7f6d24f7f000, end = 7f6d24f80000
22:23:31:481714 tid[142943] pid[142917] MAP @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000, hva = 7f6d24f7f000, pfn = 1ee50ee
22:23:31:481714 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000 (cf7f000), hva = 7f6d24f7f000, pfn = 1ee50ee, tdp_mmu = 1, role = 3784, count = 1
22:23:31:481715 tid[142943] pid[142917] SPTE @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000, hva = 7f6d24f7f000, pfn = 1ee50ee
22:23:31:481716 tid[142943] pid[142917] MAP_RET @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000, hva = 7f6d24f7f000, pfn = 1ee50ee, ret = 4
@__fault[142943]: 0
@__fault[212784]: 0
@__vcpu[142943]: 0
@__vcpu[212784]: 0
@last_rip[142943]: 4634265
@last_rip[212784]: 18446735293648697744
@same[212784]: 1
@same[142943]: 20



--
Eric Wheeler


> 
> Here it is: http://linuxglobal.com/out/handle_ept_violation-v3.log.gz
> 
> Here are the highlights:
> 
> # zcat handle_ept_violation-v3.log.gz | grep 3484173 | tail -n30
> 21:25:50:282711 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92234 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282714 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92235 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282720 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92237 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282723 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92238 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282726 tid[3484173] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (92239 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282743 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92245 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282749 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92247 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282752 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92248 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282755 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92249 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282769 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92254 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282775 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92256 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282778 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92257 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282784 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92259 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282804 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92266 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282813 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92269 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282816 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92270 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282822 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92272 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282825 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92273 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282831 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92275 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282837 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92277 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282845 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92280 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282854 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92283 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282857 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92284 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282863 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92286 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282866 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92287 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282869 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92288 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> @__fault[3484173]: 0
> @__vcpu[3484173]: 0
> @last_rip[3484173]: 18446744071583984805
> @same[3484173]: 11615602
> 
> 
> # zcat handle_ept_violation-v3.log.gz | grep 3484174 | tail -n30
> 21:25:50:282354 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90073 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282418 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90087 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282475 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90100 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282507 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90107 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282524 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90111 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282528 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90112 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282557 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90118 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282571 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90121 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282660 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90141 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282670 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90143 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282684 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90146 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282693 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90148 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282731 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90156 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282735 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90157 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282739 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90158 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282759 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90162 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282764 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90163 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282786 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90168 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282791 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90169 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282795 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90170 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282800 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90171 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282809 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90173 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282832 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90178 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282841 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90180 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282851 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90182 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> 21:25:50:282874 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90187 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> @__fault[3484174]: 0
> @__vcpu[3484174]: 0
> @last_rip[3484174]: 18446744071583984805
> @same[3484174]: 11598962
> 
> 
> --
> Eric Wheeler
> 
> 
> 
> 
> > 
> > On the plus side, I'm slowly learning how to effectively use bpf programs.
> > This version also prints return values and and other relevant side effects from
> > kvm_faultin_pfn(), and I figured out a way to get at the vCPU's root MMU page.
> > 
> > And it stops printing after a vCPU (task) has been stuck for 100k exits, i.e. it
> > should self-limit the spam.  Even if the vCPU managed to unstick itself after
> > that point, which is *extremely* unlikely, being stuck for 100k exits all but
> > guarantees there's a bug somewhere.
> > 
> > So this *should* give us a smoking gun.  Depending on what the gun points at, a
> > full root cause may still be a long ways off, but I'm pretty sure this mess will
> > tell us exactly why KVM is refusing to fix the fault.
> > 
> > --
> > 
> > struct kvm_page_fault {
> > 	const u64 addr;
> > 	const u32 error_code;
> > 
> > 	const bool prefetch;
> > 	const bool exec;
> > 	const bool write;
> > 	const bool present;
> > 
> > 	const bool rsvd;
> > 	const bool user;
> > 	const bool is_tdp;
> > 	const bool nx_huge_page_workaround_enabled;
> > 
> > 	bool huge_page_disallowed;
> > 
> > 	u8 max_level;
> > 	u8 req_level;
> > 	u8 goal_level;
> > 
> > 	u64 gfn;
> > 
> > 	struct kvm_memory_slot *slot;
> > 
> > 	u64 pfn;
> > 	unsigned long hva;
> > 	bool map_writable;
> > };
> > 
> > struct kvm_mmu_page {
> > 	struct list_head link;
> > 	struct hlist_node hash_link;
> > 
> > 	bool tdp_mmu_page;
> > 	bool unsync;
> > 	u8 mmu_valid_gen;
> > 	bool lpage_disallowed;
> > 
> > 	u32 role;
> > 	u64 gfn;
> > 
> > 	u64 *spt;
> > 
> > 	u64 *shadowed_translation;
> > 
> > 	int root_count;
> > }
> > 
> > kprobe:kvm_faultin_pfn
> > {
> > 	$vcpu = (struct kvm_vcpu *)arg0;
> > 	$kvm = $vcpu->kvm;
> > 	$rip = $vcpu->arch.regs[16];
> > 
> > 	if (@last_rip[tid] == $rip) {
> > 		@same[tid]++
> > 	} else {
> > 		@same[tid] = 0;
> > 	}
> > 	@last_rip[tid] = $rip;
> > 
> > 	if (@same[tid] > 1000 && @same[tid] < 100000) {
> > 		$fault = (struct kvm_page_fault *)arg1;
> > 		$hva = -1;
> > 		$flags = 0;
> > 
> > 		@__vcpu[tid] = arg0;
> > 		@__fault[tid] = arg1;
> > 
> > 		if ($fault->slot != 0) {
> > 			$hva = $fault->slot->userspace_addr +
> > 			       (($fault->gfn - $fault->slot->base_gfn) << 12);
> > 			$flags = $fault->slot->flags;
> > 		}
> > 
> > 		printf("%s tid[%u] pid[%u] FAULTIN @ rip %lx (%lu hits), gpa = %lx, hva = %lx, flags = %lx : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> > 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip, @same[tid], $fault->addr, $hva, $flags,
> > 		       $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress,
> > 		       $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end);
> > 	} else {
> > 		@__vcpu[tid] = 0;
> > 		@__fault[tid] = 0;
> > 	}
> > }
> > 
> > kretprobe:kvm_faultin_pfn
> > {
> > 	if (@__fault[tid] != 0) {
> > 		$vcpu = (struct kvm_vcpu *)@__vcpu[tid];
> > 		$kvm = $vcpu->kvm;
> > 		$fault = (struct kvm_page_fault *)@__fault[tid];
> > 		$hva = -1;
> > 		$flags = 0;
> > 
> > 		if ($fault->slot != 0) {
> > 			$hva = $fault->slot->userspace_addr +
> > 			       (($fault->gfn - $fault->slot->base_gfn) << 12);
> > 			$flags = $fault->slot->flags;
> > 		}
> > 
> > 		printf("%s tid[%u] pid[%u] FAULTIN_RET @ rip %lx (%lu hits), gpa = %lx, hva = %lx (%lx), flags = %lx, pfn = %lx, ret = %lu : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> > 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> > 		       $fault->addr, $hva, $fault->hva, $flags, $fault->pfn, retval,
> > 		       $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress,
> > 		       $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end);
> > 	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
> > 		printf("%s tid[%u] pid[%u] FAULTIN_ERROR @ rip %lx (%lu hits), ret = %lu\n",
> > 		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid], retval);
> > 	}
> > }
> > 
> > kprobe:kvm_tdp_mmu_map
> > {
> > 	$vcpu = (struct kvm_vcpu *)arg0;
> > 	$rip = $vcpu->arch.regs[16];
> > 
> > 	if (@last_rip[tid] == $rip) {
> > 		@same[tid]++
> > 	} else {
> > 		@same[tid] = 0;
> > 	}
> > 	@last_rip[tid] = $rip;
> > 
> > 	if (@__fault[tid] != 0) {
> > 	        $vcpu = (struct kvm_vcpu *)arg0;
> > 		$fault = (struct kvm_page_fault *)arg1;
> > 
> >                 if (@__vcpu[tid] != arg0 || @__fault[tid] != arg1) {
> >                         printf("%s tid[%u] pid[%u] MAP_ERROR vcpu %lx vs. %lx, fault %lx vs. %lx\n",
> >                                strftime("%H:%M:%S:%f", nsecs), tid, pid, @__vcpu[tid], arg0, @__fault[tid], arg1);
> >                 }
> > 
> > 		printf("%s tid[%u] pid[%u] MAP @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx\n",
> > 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> > 		       $fault->addr, $fault->hva, $fault->pfn);
> > 	} else {
> > 		@__vcpu[tid] = 0;
> > 		@__fault[tid] = 0;
> > 	}
> > }
> > 
> > kretprobe:kvm_tdp_mmu_map
> > {
> > 	if (@__fault[tid] != 0) {
> > 		$vcpu = (struct kvm_vcpu *)@__vcpu[tid];
> > 		$fault = (struct kvm_page_fault *)@__fault[tid];
> > 		$hva = -1;
> > 		$flags = 0;
> > 
> > 		if ($fault->slot != 0) {
> > 			$hva = $fault->slot->userspace_addr +
> > 			       (($fault->gfn - $fault->slot->base_gfn) << 12);
> > 			$flags = $fault->slot->flags;
> > 		}
> > 
> > 		printf("%s tid[%u] pid[%u] MAP_RET @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx, ret = %lx\n",
> > 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> > 		       $fault->addr, $fault->hva, $fault->pfn, retval);
> > 	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
> > 		printf("%s tid[%u] pid[%u] MAP_RET_ERROR @ rip %lx (%lu hits), ret = %lu\n",
> > 		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid], retval);
> > 	}
> > }
> > 
> > kprobe:tdp_iter_start
> > {
> > 	if (@__fault[tid] != 0) {
> >                 $vcpu = (struct kvm_vcpu *)@__vcpu[tid];
> > 		$fault = (struct kvm_page_fault *)@__fault[tid];
> > 	        $root = (struct kvm_mmu_page *)arg1;
> > 
> > 		printf("%s tid[%u] pid[%u] ITER @ rip %lx (%lu hits), gpa = %lx (%lx), hva = %lx, pfn = %lx, tdp_mmu = %u, role = %x, count = %d\n",
> > 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> > 		       $fault->addr, arg3 << 12, $fault->hva, $fault->pfn,
> >                        $root->tdp_mmu_page, $root->role, $root->root_count);
> > 	} else {
> > 		@__vcpu[tid] = 0;
> > 		@__fault[tid] = 0;
> > 	}
> > }
> > 
> > kprobe:make_mmio_spte
> > {
> >         if (@__fault[tid] != 0) {
> > 		$fault = (struct kvm_page_fault *)@__fault[tid];
> > 
> > 		printf("%s tid[%u] pid[%u] MMIO @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx\n",
> > 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> > 		       $fault->addr, $fault->hva, $fault->pfn);
> > 	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
> > 		printf("%s tid[%u] pid[%u] MMIO_ERROR @ rip %lx (%lu hits)\n",
> > 		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid]);
> > 	}
> > }
> > 
> > kprobe:make_spte
> > {
> >         if (@__fault[tid] != 0) {
> > 		$fault = (struct kvm_page_fault *)@__fault[tid];
> > 
> > 		printf("%s tid[%u] pid[%u] SPTE @ rip %lx (%lu hits), gpa = %lx, hva = %lx, pfn = %lx\n",
> > 		       strftime("%H:%M:%S:%f", nsecs), tid, pid, @last_rip[tid], @same[tid],
> > 		       $fault->addr, $fault->hva, $fault->pfn);
> > 	} else if (@same[tid] > 1000 && @same[tid] < 100000) {
> > 		printf("%s tid[%u] pid[%u] SPTE_ERROR @ rip %lx (%lu hits)\n",
> > 		       strftime("%H:%M:%S:%f", nsecs), tid, pid,  @last_rip[tid], @same[tid]);
> > 	}
> > }
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-23 22:32                                                             ` Eric Wheeler
@ 2023-08-23 23:21                                                               ` Sean Christopherson
  2023-08-24  0:30                                                                 ` Eric Wheeler
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-08-23 23:21 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Amaan Cheval, brak, kvm

On Wed, Aug 23, 2023, Eric Wheeler wrote:
> 22:23:31:481644 tid[142943] pid[142917] SPTE @ rip ffffffffa43ce877 (1196 hits), gpa = cf7b000, hva = 7f6d24f7b000, pfn = 1e35fc6
> 22:23:31:481645 tid[142943] pid[142917] MAP_RET @ rip ffffffffa43ce877 (1196 hits), gpa = cf7b000, hva = 7f6d24f7b000, pfn = 1e35fc6, ret = 4
> 22:23:31:481650 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1197 hits), gpa = cf7c000, hva = 7f6d24f7c000, flags = 0 : MMU seq = 4724e, in-prog = 0, start = 7f6d24f7b000, end = 7f6d24f7c000
> 22:23:31:481653 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1197 hits), gpa = cf7c000 (cf7c000), hva = 7f6d24f7c000, pfn = 0, tdp_mmu = 1, role = 3784, count = 2
> 22:23:31:481658 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1197 hits), gpa = cf7c000 (cf7c000), hva = 7f6d24f7c000, pfn = 0, tdp_mmu = 1, role = 3784, count = 1
> 22:23:31:481660 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1197 hits), gpa = cf7c000, hva = 7f6d24f7c000 (7f6d24f7c000), flags = 0, pfn = 1ee7439, ret = 0 : MMU seq = 4724f, in-prog = 0, start = 7f6d24f7c000, end = 7f6d24f7d000
> 22:23:31:481665 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1198 hits), gpa = cf7c000, hva = 7f6d24f7c000, flags = 0 : MMU seq = 4724f, in-prog = 0, start = 7f6d24f7c000, end = 7f6d24f7d000
> 22:23:31:481666 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1198 hits), gpa = cf7c000, hva = 7f6d24f7c000 (7f6d24f7c000), flags = 0, pfn = 1ee7439, ret = 0 : MMU seq = 4724f, in-prog = 0, start = 7f6d24f7c000, end = 7f6d24f7d000
> 22:23:31:481667 tid[142943] pid[142917] MAP @ rip ffffffffa43ce877 (1199 hits), gpa = cf7c000, hva = 7f6d24f7c000, pfn = 1ee7439
> 22:23:31:481668 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1199 hits), gpa = cf7c000 (cf7c000), hva = 7f6d24f7c000, pfn = 1ee7439, tdp_mmu = 1, role = 3784, count = 1
> 22:23:31:481669 tid[142943] pid[142917] SPTE @ rip ffffffffa43ce877 (1199 hits), gpa = cf7c000, hva = 7f6d24f7c000, pfn = 1ee7439
> 22:23:31:481670 tid[142943] pid[142917] MAP_RET @ rip ffffffffa43ce877 (1199 hits), gpa = cf7c000, hva = 7f6d24f7c000, pfn = 1ee7439, ret = 4
> 22:23:31:481673 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1200 hits), gpa = cf7d000, hva = 7f6d24f7d000, flags = 0 : MMU seq = 4724f, in-prog = 0, start = 7f6d24f7c000, end = 7f6d24f7d000
> 22:23:31:481676 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1200 hits), gpa = cf7d000 (cf7d000), hva = 7f6d24f7d000, pfn = 0, tdp_mmu = 1, role = 3784, count = 2
> 22:23:31:481677 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1200 hits), gpa = cf7d000 (cf7d000), hva = 7f6d24f7d000, pfn = 0, tdp_mmu = 1, role = 3784, count = 1
> 22:23:31:481678 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1200 hits), gpa = cf7d000, hva = 7f6d24f7d000 (7f6d24f7d000), flags = 0, pfn = 4087d5, ret = 0 : MMU seq = 47250, in-prog = 0, start = 7f6d24f7d000, end = 7f6d24f7e000
> 22:23:31:481682 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1201 hits), gpa = cf7d000, hva = 7f6d24f7d000, flags = 0 : MMU seq = 47250, in-prog = 0, start = 7f6d24f7d000, end = 7f6d24f7e000
> 22:23:31:481683 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1201 hits), gpa = cf7d000, hva = 7f6d24f7d000 (7f6d24f7d000), flags = 0, pfn = 4087d5, ret = 0 : MMU seq = 47250, in-prog = 0, start = 7f6d24f7d000, end = 7f6d24f7e000
> 22:23:31:481684 tid[142943] pid[142917] MAP @ rip ffffffffa43ce877 (1202 hits), gpa = cf7d000, hva = 7f6d24f7d000, pfn = 4087d5
> 22:23:31:481684 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1202 hits), gpa = cf7d000 (cf7d000), hva = 7f6d24f7d000, pfn = 4087d5, tdp_mmu = 1, role = 3784, count = 1
> 22:23:31:481685 tid[142943] pid[142917] SPTE @ rip ffffffffa43ce877 (1202 hits), gpa = cf7d000, hva = 7f6d24f7d000, pfn = 4087d5
> 22:23:31:481686 tid[142943] pid[142917] MAP_RET @ rip ffffffffa43ce877 (1202 hits), gpa = cf7d000, hva = 7f6d24f7d000, pfn = 4087d5, ret = 4
> 22:23:31:481689 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1203 hits), gpa = cf7e000, hva = 7f6d24f7e000, flags = 0 : MMU seq = 47250, in-prog = 0, start = 7f6d24f7d000, end = 7f6d24f7e000
> 22:23:31:481691 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1203 hits), gpa = cf7e000 (cf7e000), hva = 7f6d24f7e000, pfn = 0, tdp_mmu = 1, role = 3784, count = 2
> 22:23:31:481692 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1203 hits), gpa = cf7e000 (cf7e000), hva = 7f6d24f7e000, pfn = 0, tdp_mmu = 1, role = 3784, count = 1
> 22:23:31:481694 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1203 hits), gpa = cf7e000, hva = 7f6d24f7e000 (7f6d24f7e000), flags = 0, pfn = 51c5ee, ret = 0 : MMU seq = 47251, in-prog = 0, start = 7f6d24f7e000, end = 7f6d24f7f000
> 22:23:31:481697 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1204 hits), gpa = cf7e000, hva = 7f6d24f7e000, flags = 0 : MMU seq = 47251, in-prog = 0, start = 7f6d24f7e000, end = 7f6d24f7f000
> 22:23:31:481698 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1204 hits), gpa = cf7e000, hva = 7f6d24f7e000 (7f6d24f7e000), flags = 0, pfn = 51c5ee, ret = 0 : MMU seq = 47251, in-prog = 0, start = 7f6d24f7e000, end = 7f6d24f7f000
> 22:23:31:481699 tid[142943] pid[142917] MAP @ rip ffffffffa43ce877 (1205 hits), gpa = cf7e000, hva = 7f6d24f7e000, pfn = 51c5ee
> 22:23:31:481699 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1205 hits), gpa = cf7e000 (cf7e000), hva = 7f6d24f7e000, pfn = 51c5ee, tdp_mmu = 1, role = 3784, count = 1
> 22:23:31:481700 tid[142943] pid[142917] SPTE @ rip ffffffffa43ce877 (1205 hits), gpa = cf7e000, hva = 7f6d24f7e000, pfn = 51c5ee
> 22:23:31:481701 tid[142943] pid[142917] MAP_RET @ rip ffffffffa43ce877 (1205 hits), gpa = cf7e000, hva = 7f6d24f7e000, pfn = 51c5ee, ret = 4
> 22:23:31:481703 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1206 hits), gpa = cf7f000, hva = 7f6d24f7f000, flags = 0 : MMU seq = 47251, in-prog = 0, start = 7f6d24f7e000, end = 7f6d24f7f000
> 22:23:31:481706 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1206 hits), gpa = cf7f000 (cf7f000), hva = 7f6d24f7f000, pfn = 0, tdp_mmu = 1, role = 3784, count = 2
> 22:23:31:481707 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1206 hits), gpa = cf7f000 (cf7f000), hva = 7f6d24f7f000, pfn = 0, tdp_mmu = 1, role = 3784, count = 1
> 22:23:31:481708 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1206 hits), gpa = cf7f000, hva = 7f6d24f7f000 (7f6d24f7f000), flags = 0, pfn = 1ee50ee, ret = 0 : MMU seq = 47252, in-prog = 0, start = 7f6d24f7f000, end = 7f6d24f80000
> 22:23:31:481712 tid[142943] pid[142917] FAULTIN @ rip ffffffffa43ce877 (1207 hits), gpa = cf7f000, hva = 7f6d24f7f000, flags = 0 : MMU seq = 47252, in-prog = 0, start = 7f6d24f7f000, end = 7f6d24f80000
> 22:23:31:481712 tid[142943] pid[142917] FAULTIN_RET @ rip ffffffffa43ce877 (1207 hits), gpa = cf7f000, hva = 7f6d24f7f000 (7f6d24f7f000), flags = 0, pfn = 1ee50ee, ret = 0 : MMU seq = 47252, in-prog = 0, start = 7f6d24f7f000, end = 7f6d24f80000
> 22:23:31:481714 tid[142943] pid[142917] MAP @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000, hva = 7f6d24f7f000, pfn = 1ee50ee
> 22:23:31:481714 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000 (cf7f000), hva = 7f6d24f7f000, pfn = 1ee50ee, tdp_mmu = 1, role = 3784, count = 1
> 22:23:31:481715 tid[142943] pid[142917] SPTE @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000, hva = 7f6d24f7f000, pfn = 1ee50ee
> 22:23:31:481716 tid[142943] pid[142917] MAP_RET @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000, hva = 7f6d24f7f000, pfn = 1ee50ee, ret = 4

This vCPU is making forward progress, it just happens to be taken lots of fualts
on a single RIP.  MAP_RET's "ret = 4" means KVM did inded "fix" the fault, and
the faulting GPA/HVA is changing on every fault.  Best guess is that the guest
is zeroing a hugepage, but the guest's 2MiB page is mapped with 4KiB EPT entries,
i.e. the vCPU is doing REP STOS on a 2MiB region.

> > 21:25:50:282711 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92234 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> > 21:25:50:282714 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92235 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> > 21:25:50:282720 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92237 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> > 21:25:50:282723 tid[3484173] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (92238 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> > 21:25:50:282726 tid[3484173] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (92239 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000

...

> > 21:25:50:282354 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90073 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> > 21:25:50:282418 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90087 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> > 21:25:50:282475 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90100 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> > 21:25:50:282507 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90107 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> > 21:25:50:282524 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90111 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> > 21:25:50:282528 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90112 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> > 21:25:50:282557 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90118 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> > 21:25:50:282571 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90121 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000

These vCPUs (which belong to the same VM) appear to be well and truly stuck.  The
fact that you got prints from MAP, MAP_RET, ITER, and SPTE for an unrelated (and
not stuck) vCPU is serendipitious, as it all but guarantees that the trace is
"good", i.e. that MAP prints aren't missing because the bpf program is bad.

Since the mmu_notifier info is stable (though the seq is still *insanely* high),
assuming there's no kernel memory corruption, that means that KVM is bailing
because is_page_fault_stale() returns true.  Based on v5.15 being the last known
good kernel for you, that places the blame squarerly on commit
a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update").
Note, that commit had an unrelated but fixed by 18c841e1f411 ("KVM: x86: Retry
page fault if MMU reload is pending and root has no sp"), but all flavors of v6.1
have said fix and the bug caused a crash, not stuck vCPUs.

The "MMU seq = 8002dc25" value still gives me pause.  It's one hell of a coincidence
that all stuck vCPUs have had a sequence counter of 0x800xxxxx.

<time goes by as I keep staring>

Fudge (that's not actually what I said).  *sigh*

Not a coincidence, at all.  The bug is that, in v6.1, is_page_fault_stale() takes
the local @mmu_seq snapshot as an int, whereas as the per-VM count is stored as an
unsigned long.  When the sequence sets bit 31, the local @mmu_seq value becomes
a signed *negative* value, and then when that gets passed to mmu_invalidate_retry_hva(),
which correctly takes an unsigned long, the negative value gets sign-extended and
so the comparison ends up being

	if (0x8002dc25 != 0xffffffff8002dc25)

and KVM thinks the sequence count is stale.  I missed it for so long because I
was stupidly looking mostly at upstream code (see below), and because of the subtle
sign-extension behavior (I was mostly on the lookout for a straight truncation
bug where bits[63:32] got dropped).

I suspect others haven't hit this issues because no one else is generating anywhere
near the same number of mmu_notifier invalidations, and/or live migrates VMs more
regularly (which effectively resets the sequence count).

The real kicker to all this is that the bug was accidentally fixed in v6.3 by
commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn()"),
as that refactoring correctly stored the "local" mmu_seq as an unsigned long.

I'll post the below as a proper patch for inclusion in stable kernels.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 230108a90cf3..beca03556379 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4212,7 +4212,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
  * root was invalidated by a memslot update or a relevant mmu_notifier fired.
  */
 static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
-                               struct kvm_page_fault *fault, int mmu_seq)
+                               struct kvm_page_fault *fault,
+                               unsigned long mmu_seq)
 {
        struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
 
P.S. FWIW, it's probably worth taking a peek at your NUMA setup and/or KSM settings.
2 billion invalidations is still quite insane, even for a long-lived VM.  E.g.
we (Google) disable NUMA balancing and instead rely on other parts of the stack
to hit our SLOs for NUMA locality.  That certainly has its own challenges, and
might not be viable for your environment, but NUMA balancing is far from a free
lunch for VMs; NUMA balancing is a _lot_ more costly when KVM is on the receiving
end due to the way mmu_notifier invalidations work.   And for KSM, I personally
think KSM is a terrible tradeoff and should never be enabled.  It saves memory at
the cost of CPU cycles, guest performance, and guest security.


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-23 23:21                                                               ` Sean Christopherson
@ 2023-08-24  0:30                                                                 ` Eric Wheeler
  2023-08-24  0:52                                                                   ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Wheeler @ 2023-08-24  0:30 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

On Wed, 23 Aug 2023, Sean Christopherson wrote:
> On Wed, Aug 23, 2023, Eric Wheeler wrote:

...

> > 22:23:31:481714 tid[142943] pid[142917] MAP @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000, hva = 7f6d24f7f000, pfn = 1ee50ee
> > 22:23:31:481714 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000 (cf7f000), hva = 7f6d24f7f000, pfn = 1ee50ee, tdp_mmu = 1, role = 3784, count = 1
> > 22:23:31:481715 tid[142943] pid[142917] SPTE @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000, hva = 7f6d24f7f000, pfn = 1ee50ee
> > 22:23:31:481716 tid[142943] pid[142917] MAP_RET @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000, hva = 7f6d24f7f000, pfn = 1ee50ee, ret = 4
> 
> This vCPU is making forward progress, it just happens to be taken lots of fualts
> on a single RIP.  MAP_RET's "ret = 4" means KVM did inded "fix" the fault, and
> the faulting GPA/HVA is changing on every fault.  Best guess is that the guest
> is zeroing a hugepage, but the guest's 2MiB page is mapped with 4KiB EPT entries,
> i.e. the vCPU is doing REP STOS on a 2MiB region.

...

> > > 21:25:50:282726 tid[3484173] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (92239 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000

...

> > > 21:25:50:282354 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90073 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000
> > > 21:25:50:282571 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90121 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000

...

> These vCPUs (which belong to the same VM) appear to be well and truly stuck.  The
> fact that you got prints from MAP, MAP_RET, ITER, and SPTE for an unrelated (and
> not stuck) vCPU is serendipitious, as it all but guarantees that the trace is
> "good", i.e. that MAP prints aren't missing because the bpf program is bad.
> 
> Since the mmu_notifier info is stable (though the seq is still *insanely* high),
> assuming there's no kernel memory corruption, that means that KVM is bailing
> because is_page_fault_stale() returns true.  Based on v5.15 being the last known
> good kernel for you,

I should note that v5.15 was speculative information that may not be
100% correct.  However, the occurance of failures <= 5.15 is miniscule
compared to >5.15, as almost all occurences are in 6.1.x (which could be
biased by the number of 6.1.x deployments).

It is entirely possible that the <= 5.15 issues are false-positives, but
we don't have TBF on anything <6.1.30, so no kprobe traces.

> that places the blame squarerly on commit
> a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update").
> Note, that commit had an unrelated but fixed by 18c841e1f411 ("KVM: x86: Retry
> page fault if MMU reload is pending and root has no sp"), but all flavors of v6.1
> have said fix and the bug caused a crash, not stuck vCPUs.
> 
> The "MMU seq = 8002dc25" value still gives me pause.  It's one hell of a coincidence
> that all stuck vCPUs have had a sequence counter of 0x800xxxxx.
> 
> <time goes by as I keep staring>
> 
> Fudge (that's not actually what I said).  *sigh*
> 
> Not a coincidence, at all.  The bug is that, in v6.1, is_page_fault_stale() takes
> the local @mmu_seq snapshot as an int, whereas as the per-VM count is stored as an
> unsigned long.

I'm surprised that there were no compiler warnings about signedness or
type precision.  What would have prevented such a compiler warning?

> When the sequence sets bit 31, the local @mmu_seq value becomes
> a signed *negative* value, and then when that gets passed to mmu_invalidate_retry_hva(),
> which correctly takes an unsigned long, the negative value gets sign-extended and
> so the comparison ends up being
> 
> 	if (0x8002dc25 != 0xffffffff8002dc25)
>
> and KVM thinks the sequence count is stale.  I missed it for so long because I
> was stupidly looking mostly at upstream code (see below), and because of the subtle
> sign-extension behavior (I was mostly on the lookout for a straight truncation
> bug where bits[63:32] got dropped).
> 
> I suspect others haven't hit this issues because no one else is generating anywhere
> near the same number of mmu_notifier invalidations, and/or live migrates VMs more
> regularly (which effectively resets the sequence count).
> 
> The real kicker to all this is that the bug was accidentally fixed in v6.3 by
> commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn()"),
> as that refactoring correctly stored the "local" mmu_seq as an unsigned long.
> 
> I'll post the below as a proper patch for inclusion in stable kernels.

Awesome, and well done.  Can you think of a "simple" patch for the
6.1-series that would be live-patch safe?

-Eric

> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 230108a90cf3..beca03556379 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4212,7 +4212,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   * root was invalidated by a memslot update or a relevant mmu_notifier fired.
>   */
>  static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> -                               struct kvm_page_fault *fault, int mmu_seq)
> +                               struct kvm_page_fault *fault,
> +                               unsigned long mmu_seq)
>  {
>         struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
>  
> P.S. FWIW, it's probably worth taking a peek at your NUMA setup and/or KSM settings.
> 2 billion invalidations is still quite insane, even for a long-lived VM.  E.g.
> we (Google) disable NUMA balancing and instead rely on other parts of the stack
> to hit our SLOs for NUMA locality.  That certainly has its own challenges, and
> might not be viable for your environment, but NUMA balancing is far from a free
> lunch for VMs; NUMA balancing is a _lot_ more costly when KVM is on the receiving
> end due to the way mmu_notifier invalidations work.   And for KSM, I personally
> think KSM is a terrible tradeoff and should never be enabled.  It saves memory at
> the cost of CPU cycles, guest performance, and guest security.



--
Eric Wheeler



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-24  0:30                                                                 ` Eric Wheeler
@ 2023-08-24  0:52                                                                   ` Sean Christopherson
  2023-08-24 23:51                                                                     ` Eric Wheeler
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Christopherson @ 2023-08-24  0:52 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Amaan Cheval, brak, kvm

On Wed, Aug 23, 2023, Eric Wheeler wrote:
> On Wed, 23 Aug 2023, Sean Christopherson wrote:
> > Not a coincidence, at all.  The bug is that, in v6.1, is_page_fault_stale() takes
> > the local @mmu_seq snapshot as an int, whereas as the per-VM count is stored as an
> > unsigned long.
> 
> I'm surprised that there were no compiler warnings about signedness or
> type precision.  What would have prevented such a compiler warning?

-Wconversion can detect this, but it detects freaking *everything*, i.e. its
signal to noise ratio is straight up awful.  It's so noisy in fact that it's not
even in the kernel's W=1 build, it's pushed down all the way to W=3.  W=1 is
basically "you'll get some noise, but it may find useful stuff.  W=3 is essentially
"don't bother wading through the warnings unless you're masochistic".

E.g. turning it on leads to:

linux/include/linux/kvm_host.h:891:60: error:
conversion to ‘long unsigned int’ from ‘int’ may change the sign of the result [-Werror=sign-conversion]
  891 |                           (atomic_read(&kvm->online_vcpus) - 1))
      |                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~

which is completely asinine (suppressing the warning would require declaring the
above literal as 1u).

FWIW, I would love to be able to prevent these types of bugs as this isn't the
first implicit conversion bug that has hit KVM x86[*], but the signal to noise
ratio is just so, so bad.

[*] commit d5aaad6f8342 ("KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds")

> > When the sequence sets bit 31, the local @mmu_seq value becomes
> > a signed *negative* value, and then when that gets passed to mmu_invalidate_retry_hva(),
> > which correctly takes an unsigned long, the negative value gets sign-extended and
> > so the comparison ends up being
> > 
> > 	if (0x8002dc25 != 0xffffffff8002dc25)
> >
> > and KVM thinks the sequence count is stale.  I missed it for so long because I
> > was stupidly looking mostly at upstream code (see below), and because of the subtle
> > sign-extension behavior (I was mostly on the lookout for a straight truncation
> > bug where bits[63:32] got dropped).
> > 
> > I suspect others haven't hit this issues because no one else is generating anywhere
> > near the same number of mmu_notifier invalidations, and/or live migrates VMs more
> > regularly (which effectively resets the sequence count).
> > 
> > The real kicker to all this is that the bug was accidentally fixed in v6.3 by
> > commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn()"),
> > as that refactoring correctly stored the "local" mmu_seq as an unsigned long.
> > 
> > I'll post the below as a proper patch for inclusion in stable kernels.
> 
> Awesome, and well done.  Can you think of a "simple" patch for the
> 6.1-series that would be live-patch safe?

This is what I'm going to post for 6.1, it's as safe and simple a patch as can
be.  The only potential hiccup for live-patching is that it's all but guaranteed
to be inlined, but the scope creep should be limited to one-level up, e.g. to
direct_page_fault().

Author: Sean Christopherson <seanjc@google.com>
Date:   Wed Aug 23 16:28:12 2023 -0700

    KVM: x86/mmu: Fix an sign-extension bug with mmu_seq that hangs vCPUs
    
    Take the vCPU's mmu_seq snapshot as an "unsigned long" instead of an "int"
    when checking to see if a page fault is stale, as the sequence count is
    stored as an "unsigned long" everywhere else in KVM.  This fixes a bug
    where KVM will effectively hang vCPUs due to always thinking page faults
    are stale, which result in KVM refusing to "fix" faults.
    
    mmu_invalidate_seq (née mmu_notifier_seq) is sequence counter used when
    KVM is handling page faults to detect if userspace mapping relevant to the
    guest was invalidated snapshotting the counter and acquiring mmu_lock, to
    ensure that the host pfn that KVM retrieved is still fresh.  If KVM sees
    that the counter has change, KVM resumes the guest without fixing the
    fault.
    
    What _should_ happen is that the source of the mmu_notifier invalidations
    eventually goes away, mmu_invalidate_seq will become stable, and KVM can
    once again fix guest page fault(s).
    
    But for a long-lived VM and/or a VM that the host just doesn't particularly
    like, it's possible for a VM to be on the receiving end of 2 billion (with
    a B) mmu_notifier invalidations.  When that happens, bit 31 will be set in
    mmu_invalidate_seq.  This causes the value to be turned into a 32-bit
    negative value when implicitly cast to an "int" by is_page_fault_stale(),
    and then sign-extended into a 64-bit unsigned when the signed "int" is
    implicitly cast back to an "unsigned long" on the call to
    mmu_invalidate_retry_hva().
    
    As a result of the casting and sign-extension, given a sequence counter of
    e.g. 0x8002dc25, mmu_invalidate_retry_hva() ends up doing
    
            if (0x8002dc25 != 0xffffffff8002dc25)
    
    and signals that the page fault is stale and needs to be retried even
    though the sequence counter is stable, and KVM effectively hangs any vCPU
    that takes a page fault (EPT violation or #NPF when TDP is enabled).
    
    Note, upstream commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq
    in kvm_faultin_pfn()") unknowingly fixed the bug in v6.3 when refactoring
    how KVM tracks the sequence counter snapshot.
    
    Reported-by: Brian Rak <brak@vultr.com>
    Reported-by: Amaan Cheval <amaan.cheval@gmail.com>
    Reported-by: Eric Wheeler <kvm@lists.ewheeler.net>
    Closes: https://lore.kernel.org/all/f023d927-52aa-7e08-2ee5-59a2fbc65953@gameservers.com
    Fixes: a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
    Signed-off-by: Sean Christopherson <seanjc@google.com>

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 230108a90cf3..beca03556379 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4212,7 +4212,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
  * root was invalidated by a memslot update or a relevant mmu_notifier fired.
  */
 static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
-                               struct kvm_page_fault *fault, int mmu_seq)
+                               struct kvm_page_fault *fault,
+                               unsigned long mmu_seq)
 {
        struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
 


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: Deadlock due to EPT_VIOLATION
  2023-08-24  0:52                                                                   ` Sean Christopherson
@ 2023-08-24 23:51                                                                     ` Eric Wheeler
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Wheeler @ 2023-08-24 23:51 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Amaan Cheval, brak, kvm

[-- Attachment #1: Type: text/plain, Size: 7020 bytes --]

On Wed, 23 Aug 2023, Sean Christopherson wrote:
> On Wed, Aug 23, 2023, Eric Wheeler wrote:
> > On Wed, 23 Aug 2023, Sean Christopherson wrote:
> > > Not a coincidence, at all.  The bug is that, in v6.1, is_page_fault_stale() takes
> > > the local @mmu_seq snapshot as an int, whereas as the per-VM count is stored as an
> > > unsigned long.
> > 
> > I'm surprised that there were no compiler warnings about signedness or
> > type precision.  What would have prevented such a compiler warning?
> 
> -Wconversion can detect this, but it detects freaking *everything*, i.e. its
> signal to noise ratio is straight up awful.  It's so noisy in fact that it's not
> even in the kernel's W=1 build, it's pushed down all the way to W=3.  W=1 is
> basically "you'll get some noise, but it may find useful stuff.  W=3 is essentially
> "don't bother wading through the warnings unless you're masochistic".
> 
> E.g. turning it on leads to:
> 
> linux/include/linux/kvm_host.h:891:60: error:
> conversion to ‘long unsigned int’ from ‘int’ may change the sign of the result [-Werror=sign-conversion]
>   891 |                           (atomic_read(&kvm->online_vcpus) - 1))
>       |                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
> 
> which is completely asinine (suppressing the warning would require declaring the
> above literal as 1u).

I can see that.  I suppose we'll never see the kernel compile with -Wall 
-Werror!

 
> FWIW, I would love to be able to prevent these types of bugs as this isn't the
> first implicit conversion bug that has hit KVM x86[*], but the signal to noise
> ratio is just so, so bad.
> 
> [*] commit d5aaad6f8342 ("KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds")
> 
> > > When the sequence sets bit 31, the local @mmu_seq value becomes
> > > a signed *negative* value, and then when that gets passed to mmu_invalidate_retry_hva(),
> > > which correctly takes an unsigned long, the negative value gets sign-extended and
> > > so the comparison ends up being
> > > 
> > > 	if (0x8002dc25 != 0xffffffff8002dc25)
> > >
> > > and KVM thinks the sequence count is stale.  I missed it for so long because I
> > > was stupidly looking mostly at upstream code (see below), and because of the subtle
> > > sign-extension behavior (I was mostly on the lookout for a straight truncation
> > > bug where bits[63:32] got dropped).
> > > 
> > > I suspect others haven't hit this issues because no one else is generating anywhere
> > > near the same number of mmu_notifier invalidations, and/or live migrates VMs more
> > > regularly (which effectively resets the sequence count).
> > > 
> > > The real kicker to all this is that the bug was accidentally fixed in v6.3 by
> > > commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn()"),
> > > as that refactoring correctly stored the "local" mmu_seq as an unsigned long.
> > > 
> > > I'll post the below as a proper patch for inclusion in stable kernels.
> > 
> > Awesome, and well done.  Can you think of a "simple" patch for the
> > 6.1-series that would be live-patch safe?
> 
> This is what I'm going to post for 6.1, it's as safe and simple a patch as can
> be.  The only potential hiccup for live-patching is that it's all but guaranteed
> to be inlined, but the scope creep should be limited to one-level up, e.g. to
> direct_page_fault().
> 
> Author: Sean Christopherson <seanjc@google.com>
> Date:   Wed Aug 23 16:28:12 2023 -0700
> 
>     KVM: x86/mmu: Fix an sign-extension bug with mmu_seq that hangs vCPUs
>     
>     Take the vCPU's mmu_seq snapshot as an "unsigned long" instead of an "int"
>     when checking to see if a page fault is stale, as the sequence count is
>     stored as an "unsigned long" everywhere else in KVM.  This fixes a bug
>     where KVM will effectively hang vCPUs due to always thinking page faults
>     are stale, which result in KVM refusing to "fix" faults.
>     
>     mmu_invalidate_seq (née mmu_notifier_seq) is sequence counter used when
>     KVM is handling page faults to detect if userspace mapping relevant to the
>     guest was invalidated snapshotting the counter and acquiring mmu_lock, to
>     ensure that the host pfn that KVM retrieved is still fresh.  If KVM sees
>     that the counter has change, KVM resumes the guest without fixing the
>     fault.
>     
>     What _should_ happen is that the source of the mmu_notifier invalidations
>     eventually goes away, mmu_invalidate_seq will become stable, and KVM can
>     once again fix guest page fault(s).
>     
>     But for a long-lived VM and/or a VM that the host just doesn't particularly
>     like, it's possible for a VM to be on the receiving end of 2 billion (with
>     a B) mmu_notifier invalidations.  When that happens, bit 31 will be set in
>     mmu_invalidate_seq.  This causes the value to be turned into a 32-bit
>     negative value when implicitly cast to an "int" by is_page_fault_stale(),
>     and then sign-extended into a 64-bit unsigned when the signed "int" is
>     implicitly cast back to an "unsigned long" on the call to
>     mmu_invalidate_retry_hva().
>     
>     As a result of the casting and sign-extension, given a sequence counter of
>     e.g. 0x8002dc25, mmu_invalidate_retry_hva() ends up doing
>     
>             if (0x8002dc25 != 0xffffffff8002dc25)
>     
>     and signals that the page fault is stale and needs to be retried even
>     though the sequence counter is stable, and KVM effectively hangs any vCPU
>     that takes a page fault (EPT violation or #NPF when TDP is enabled).
>     
>     Note, upstream commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq
>     in kvm_faultin_pfn()") unknowingly fixed the bug in v6.3 when refactoring
>     how KVM tracks the sequence counter snapshot.
>     
>     Reported-by: Brian Rak <brak@vultr.com>
>     Reported-by: Amaan Cheval <amaan.cheval@gmail.com>
>     Reported-by: Eric Wheeler <kvm@lists.ewheeler.net>
>     Closes: https://lore.kernel.org/all/f023d927-52aa-7e08-2ee5-59a2fbc65953@gameservers.com

Thanks again for all your help on this, I enjoyed working on it with you.

-Eric

>     Fixes: a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
>     Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 230108a90cf3..beca03556379 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4212,7 +4212,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   * root was invalidated by a memslot update or a relevant mmu_notifier fired.
>   */
>  static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> -                               struct kvm_page_fault *fault, int mmu_seq)
> +                               struct kvm_page_fault *fault,
> +                               unsigned long mmu_seq)
>  {
>         struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
>  
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2023-08-24 23:52 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-23 14:02 Deadlock due to EPT_VIOLATION Brian Rak
2023-05-23 16:22 ` Sean Christopherson
2023-05-24 13:39   ` Brian Rak
2023-05-26 16:59     ` Brian Rak
2023-05-26 21:02       ` Sean Christopherson
2023-05-30 17:35         ` Brian Rak
2023-05-30 18:36           ` Sean Christopherson
2023-05-31 17:40             ` Brian Rak
2023-07-21 14:34             ` Amaan Cheval
2023-07-21 17:37               ` Sean Christopherson
2023-07-24 12:08                 ` Amaan Cheval
2023-07-25 17:30                   ` Sean Christopherson
2023-08-02 14:21                     ` Amaan Cheval
2023-08-02 15:34                       ` Sean Christopherson
2023-08-02 16:45                         ` Amaan Cheval
2023-08-02 17:52                           ` Sean Christopherson
2023-08-08 15:34                             ` Amaan Cheval
2023-08-08 17:07                               ` Sean Christopherson
2023-08-10  0:48                                 ` Eric Wheeler
2023-08-10  1:27                                   ` Eric Wheeler
2023-08-10 23:58                                     ` Sean Christopherson
2023-08-11 12:37                                       ` Amaan Cheval
2023-08-11 18:02                                         ` Sean Christopherson
2023-08-12  0:50                                           ` Eric Wheeler
2023-08-14 17:29                                             ` Sean Christopherson
2023-08-15  0:30                                 ` Eric Wheeler
2023-08-15 16:10                                   ` Sean Christopherson
2023-08-16 23:54                                     ` Eric Wheeler
2023-08-17 18:21                                       ` Sean Christopherson
2023-08-18  0:55                                         ` Eric Wheeler
2023-08-18 14:33                                           ` Sean Christopherson
2023-08-18 23:06                                             ` Eric Wheeler
2023-08-21 20:27                                               ` Eric Wheeler
2023-08-21 23:51                                                 ` Sean Christopherson
2023-08-22  0:11                                                   ` Sean Christopherson
2023-08-22  1:10                                                   ` Eric Wheeler
2023-08-22 15:11                                                     ` Sean Christopherson
2023-08-22 21:23                                                       ` Eric Wheeler
2023-08-22 21:32                                                         ` Sean Christopherson
2023-08-23  0:39                                                       ` Eric Wheeler
2023-08-23 17:54                                                         ` Sean Christopherson
2023-08-23 19:44                                                           ` Eric Wheeler
2023-08-23 22:12                                                           ` Eric Wheeler
2023-08-23 22:32                                                             ` Eric Wheeler
2023-08-23 23:21                                                               ` Sean Christopherson
2023-08-24  0:30                                                                 ` Eric Wheeler
2023-08-24  0:52                                                                   ` Sean Christopherson
2023-08-24 23:51                                                                     ` Eric Wheeler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.