RFC: Split EPT huge pages in advance of dirty logging

* RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-18 13:13 ` Zhoujian (jay)
  0 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-02-18 13:13 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: pbonzini, peterx, dgilbert, quintela, Liujinsong (Paul),
	linfeng (M), wangxin (U), Huangweidong (C), Zhoujian (jay)

Hi all,

We found that the guest will be soft-lockup occasionally when live migrating a 60 vCPU,
512GiB huge page and memory sensitive VM. The reason is clear, almost all of the vCPUs
are waiting for the KVM MMU spin-lock to create 4K SPTEs when the huge pages are
write protected. This phenomenon is also described in this patch set:
https://patchwork.kernel.org/cover/11163459/
which aims to handle page faults in parallel more efficiently.

Our idea is to use the migration thread to touch all of the guest memory in the
granularity of 4K before enabling dirty logging. To be more specific, we split all the
PDPE_LEVEL SPTEs into DIRECTORY_LEVEL SPTEs as the first step, and then split all
the DIRECTORY_LEVEL SPTEs into PAGE_TABLE_LEVEL SPTEs as the following step.

However, there is a side effect. It takes more time to clear the D-bits of the last level
SPTEs when enabling dirty logging, which is held the QEMU BQL and KVM mmu-lock
simultaneously. To solve this issue, the idea of dirty logging gradually in small chunks
is proposed too, here is the link for v1:
https://patchwork.kernel.org/patch/11388227/

Under the Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz environment, some tests has
been done with a 60U256G VM which enables numa balancing using the demo we
written. We start a process which has 60 threads to randomly touch most of the
memory in VM, meanwhile count the function execution time in VM when live
migration. The change_prot_numa() is chosen since it will not release the CPU
unless its work has finished. Here is the number:

                    Original                 The demo we written
[1]                  > 9s (most of the time)     ~5ms
Hypervisor cost       > 90%                   ~3%

[1]: execution time of the change_prot_numa() function

If the time in [1] bigger than 20s, it will be result in soft-lockup.

I know it is a little hacking to do so, but my question is: is this worth trying to split
EPT huge pages in advance of dirty logging?

Any advice will be appreciated, thanks.

Regards,
Jay Zhou

^ permalink raw reply	[flat|nested] 28+ messages in thread