All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-18 13:13 ` Zhoujian (jay)
  0 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-02-18 13:13 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: pbonzini, peterx, dgilbert, quintela, Liujinsong (Paul),
	linfeng (M), wangxin (U), Huangweidong (C), Zhoujian (jay)

Hi all,

We found that the guest will be soft-lockup occasionally when live migrating a 60 vCPU,
512GiB huge page and memory sensitive VM. The reason is clear, almost all of the vCPUs
are waiting for the KVM MMU spin-lock to create 4K SPTEs when the huge pages are
write protected. This phenomenon is also described in this patch set:
https://patchwork.kernel.org/cover/11163459/
which aims to handle page faults in parallel more efficiently.

Our idea is to use the migration thread to touch all of the guest memory in the
granularity of 4K before enabling dirty logging. To be more specific, we split all the
PDPE_LEVEL SPTEs into DIRECTORY_LEVEL SPTEs as the first step, and then split all
the DIRECTORY_LEVEL SPTEs into PAGE_TABLE_LEVEL SPTEs as the following step.

However, there is a side effect. It takes more time to clear the D-bits of the last level
SPTEs when enabling dirty logging, which is held the QEMU BQL and KVM mmu-lock
simultaneously. To solve this issue, the idea of dirty logging gradually in small chunks
is proposed too, here is the link for v1:
https://patchwork.kernel.org/patch/11388227/

Under the Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz environment, some tests has
been done with a 60U256G VM which enables numa balancing using the demo we
written. We start a process which has 60 threads to randomly touch most of the
memory in VM, meanwhile count the function execution time in VM when live
migration. The change_prot_numa() is chosen since it will not release the CPU
unless its work has finished. Here is the number:

                    Original                 The demo we written
[1]                  > 9s (most of the time)     ~5ms
Hypervisor cost       > 90%                   ~3%

[1]: execution time of the change_prot_numa() function

If the time in [1] bigger than 20s, it will be result in soft-lockup.

I know it is a little hacking to do so, but my question is: is this worth trying to split
EPT huge pages in advance of dirty logging?

Any advice will be appreciated, thanks.

Regards,
Jay Zhou

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-18 13:13 ` Zhoujian (jay)
  0 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-02-18 13:13 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: Liujinsong (Paul), linfeng (M), quintela, wangxin (U),
	dgilbert, peterx, Zhoujian (jay), pbonzini, Huangweidong (C)

Hi all,

We found that the guest will be soft-lockup occasionally when live migrating a 60 vCPU,
512GiB huge page and memory sensitive VM. The reason is clear, almost all of the vCPUs
are waiting for the KVM MMU spin-lock to create 4K SPTEs when the huge pages are
write protected. This phenomenon is also described in this patch set:
https://patchwork.kernel.org/cover/11163459/
which aims to handle page faults in parallel more efficiently.

Our idea is to use the migration thread to touch all of the guest memory in the
granularity of 4K before enabling dirty logging. To be more specific, we split all the
PDPE_LEVEL SPTEs into DIRECTORY_LEVEL SPTEs as the first step, and then split all
the DIRECTORY_LEVEL SPTEs into PAGE_TABLE_LEVEL SPTEs as the following step.

However, there is a side effect. It takes more time to clear the D-bits of the last level
SPTEs when enabling dirty logging, which is held the QEMU BQL and KVM mmu-lock
simultaneously. To solve this issue, the idea of dirty logging gradually in small chunks
is proposed too, here is the link for v1:
https://patchwork.kernel.org/patch/11388227/

Under the Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz environment, some tests has
been done with a 60U256G VM which enables numa balancing using the demo we
written. We start a process which has 60 threads to randomly touch most of the
memory in VM, meanwhile count the function execution time in VM when live
migration. The change_prot_numa() is chosen since it will not release the CPU
unless its work has finished. Here is the number:

                    Original                 The demo we written
[1]                  > 9s (most of the time)     ~5ms
Hypervisor cost       > 90%                   ~3%

[1]: execution time of the change_prot_numa() function

If the time in [1] bigger than 20s, it will be result in soft-lockup.

I know it is a little hacking to do so, but my question is: is this worth trying to split
EPT huge pages in advance of dirty logging?

Any advice will be appreciated, thanks.

Regards,
Jay Zhou


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
  2020-02-18 13:13 ` Zhoujian (jay)
@ 2020-02-18 17:43   ` Peter Xu
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2020-02-18 17:43 UTC (permalink / raw)
  To: Zhoujian (jay)
  Cc: kvm, qemu-devel, pbonzini, dgilbert, quintela, Liujinsong (Paul),
	linfeng (M), wangxin (U), Huangweidong (C)

On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> Hi all,
> 
> We found that the guest will be soft-lockup occasionally when live migrating a 60 vCPU,
> 512GiB huge page and memory sensitive VM. The reason is clear, almost all of the vCPUs
> are waiting for the KVM MMU spin-lock to create 4K SPTEs when the huge pages are
> write protected. This phenomenon is also described in this patch set:
> https://patchwork.kernel.org/cover/11163459/
> which aims to handle page faults in parallel more efficiently.
> 
> Our idea is to use the migration thread to touch all of the guest memory in the
> granularity of 4K before enabling dirty logging. To be more specific, we split all the
> PDPE_LEVEL SPTEs into DIRECTORY_LEVEL SPTEs as the first step, and then split all
> the DIRECTORY_LEVEL SPTEs into PAGE_TABLE_LEVEL SPTEs as the following step.

IIUC, QEMU will prefer to use huge pages for all the anonymous
ramblocks (please refer to ram_block_add):

        qemu_madvise(new_block->host, new_block->max_length, QEMU_MADV_HUGEPAGE);

Another alternative I can think of is to add an extra parameter to
QEMU to explicitly disable huge pages (so that can even be
MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).  However that should also
drag down the performance for the whole lifecycle of the VM.  A 3rd
option is to make a QMP command to dynamically turn huge pages on/off
for ramblocks globally.  Haven't thought deep into any of them, but
seems doable.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-18 17:43   ` Peter Xu
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2020-02-18 17:43 UTC (permalink / raw)
  To: Zhoujian (jay)
  Cc: Liujinsong (Paul), linfeng (M), kvm, quintela, wangxin (U),
	qemu-devel, dgilbert, pbonzini, Huangweidong (C)

On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> Hi all,
> 
> We found that the guest will be soft-lockup occasionally when live migrating a 60 vCPU,
> 512GiB huge page and memory sensitive VM. The reason is clear, almost all of the vCPUs
> are waiting for the KVM MMU spin-lock to create 4K SPTEs when the huge pages are
> write protected. This phenomenon is also described in this patch set:
> https://patchwork.kernel.org/cover/11163459/
> which aims to handle page faults in parallel more efficiently.
> 
> Our idea is to use the migration thread to touch all of the guest memory in the
> granularity of 4K before enabling dirty logging. To be more specific, we split all the
> PDPE_LEVEL SPTEs into DIRECTORY_LEVEL SPTEs as the first step, and then split all
> the DIRECTORY_LEVEL SPTEs into PAGE_TABLE_LEVEL SPTEs as the following step.

IIUC, QEMU will prefer to use huge pages for all the anonymous
ramblocks (please refer to ram_block_add):

        qemu_madvise(new_block->host, new_block->max_length, QEMU_MADV_HUGEPAGE);

Another alternative I can think of is to add an extra parameter to
QEMU to explicitly disable huge pages (so that can even be
MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).  However that should also
drag down the performance for the whole lifecycle of the VM.  A 3rd
option is to make a QMP command to dynamically turn huge pages on/off
for ramblocks globally.  Haven't thought deep into any of them, but
seems doable.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC: Split EPT huge pages in advance of dirty logging
  2020-02-18 17:43   ` Peter Xu
@ 2020-02-19 13:19     ` Zhoujian (jay)
  -1 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-02-19 13:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, qemu-devel, pbonzini, dgilbert, quintela, Liujinsong (Paul),
	linfeng (M), wangxin (U), Huangweidong (C)

Hi Peter,

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, February 19, 2020 1:43 AM
> To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> <wangxinxin.wang@huawei.com>; Huangweidong (C)
> <weidong.huang@huawei.com>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> 
> On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > Hi all,
> >
> > We found that the guest will be soft-lockup occasionally when live
> > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > reason is clear, almost all of the vCPUs are waiting for the KVM MMU
> > spin-lock to create 4K SPTEs when the huge pages are write protected. This
> phenomenon is also described in this patch set:
> > https://patchwork.kernel.org/cover/11163459/
> > which aims to handle page faults in parallel more efficiently.
> >
> > Our idea is to use the migration thread to touch all of the guest
> > memory in the granularity of 4K before enabling dirty logging. To be
> > more specific, we split all the PDPE_LEVEL SPTEs into DIRECTORY_LEVEL
> > SPTEs as the first step, and then split all the DIRECTORY_LEVEL SPTEs into
> PAGE_TABLE_LEVEL SPTEs as the following step.
> 
> IIUC, QEMU will prefer to use huge pages for all the anonymous ramblocks
> (please refer to ram_block_add):
> 
>         qemu_madvise(new_block->host, new_block->max_length,
> QEMU_MADV_HUGEPAGE);

Yes, you're right

> 
> Another alternative I can think of is to add an extra parameter to QEMU to
> explicitly disable huge pages (so that can even be MADV_NOHUGEPAGE
> instead of MADV_HUGEPAGE).  However that should also drag down the
> performance for the whole lifecycle of the VM.  

From the performance point of view, it is better to keep the huge pages
when the VM is not in the live migration state.

> A 3rd option is to make a QMP
> command to dynamically turn huge pages on/off for ramblocks globally.

We're searching a dynamic method too.
We plan to add two new flags for each memory slot, say
KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set
through KVM_SET_USER_MEMORY_REGION ioctl.

The mapping_level which is called by tdp_page_fault in the kernel side
will return PT_DIRECTORY_LEVEL if the
KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag of the memory slot is
set, and return PT_PAGE_TABLE_LEVEL if the
KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
 
The key steps to split the huge pages in advance of enabling dirty log is
as follows:
1. The migration thread in user space uses
KVM_SET_USER_MEMORY_REGION ioctl to set the
KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each memory slot.
2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
ioctl (which is newly added) to do the splitting of large pages in the
kernel side.
3. A new vCPU is created temporally(do some initialization but will not
run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
4. Collect the GPA ranges of all the memory slots with the
KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
5. Split the 1G huge pages(collected in step 4) into 2M by calling
tdp_page_fault, since the mapping_level will return
PT_DIRECTORY_LEVEL. Here is the main difference from the usual
path which is caused by the Guest side(EPT violation/misconfig etc),
we call it directly in the hypervisor side.
6. Do some cleanups, i.e. free the vCPU related resources
7. The KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7,
in step 5 the 2M huge pages will be splitted into 4K pages.
9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
10. Then the migration thread calls the log_start ioctl to enable the dirty
logging, and the remaining thing is the same.

What's your take on this, thanks.

Regards,
Jay Zhou

> Haven't thought deep into any of them, but seems doable.
> 
> Thanks,
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-19 13:19     ` Zhoujian (jay)
  0 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-02-19 13:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liujinsong (Paul), linfeng (M), kvm, quintela, wangxin (U),
	qemu-devel, dgilbert, pbonzini, Huangweidong (C)

Hi Peter,

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, February 19, 2020 1:43 AM
> To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> <wangxinxin.wang@huawei.com>; Huangweidong (C)
> <weidong.huang@huawei.com>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> 
> On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > Hi all,
> >
> > We found that the guest will be soft-lockup occasionally when live
> > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > reason is clear, almost all of the vCPUs are waiting for the KVM MMU
> > spin-lock to create 4K SPTEs when the huge pages are write protected. This
> phenomenon is also described in this patch set:
> > https://patchwork.kernel.org/cover/11163459/
> > which aims to handle page faults in parallel more efficiently.
> >
> > Our idea is to use the migration thread to touch all of the guest
> > memory in the granularity of 4K before enabling dirty logging. To be
> > more specific, we split all the PDPE_LEVEL SPTEs into DIRECTORY_LEVEL
> > SPTEs as the first step, and then split all the DIRECTORY_LEVEL SPTEs into
> PAGE_TABLE_LEVEL SPTEs as the following step.
> 
> IIUC, QEMU will prefer to use huge pages for all the anonymous ramblocks
> (please refer to ram_block_add):
> 
>         qemu_madvise(new_block->host, new_block->max_length,
> QEMU_MADV_HUGEPAGE);

Yes, you're right

> 
> Another alternative I can think of is to add an extra parameter to QEMU to
> explicitly disable huge pages (so that can even be MADV_NOHUGEPAGE
> instead of MADV_HUGEPAGE).  However that should also drag down the
> performance for the whole lifecycle of the VM.  

From the performance point of view, it is better to keep the huge pages
when the VM is not in the live migration state.

> A 3rd option is to make a QMP
> command to dynamically turn huge pages on/off for ramblocks globally.

We're searching a dynamic method too.
We plan to add two new flags for each memory slot, say
KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set
through KVM_SET_USER_MEMORY_REGION ioctl.

The mapping_level which is called by tdp_page_fault in the kernel side
will return PT_DIRECTORY_LEVEL if the
KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag of the memory slot is
set, and return PT_PAGE_TABLE_LEVEL if the
KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
 
The key steps to split the huge pages in advance of enabling dirty log is
as follows:
1. The migration thread in user space uses
KVM_SET_USER_MEMORY_REGION ioctl to set the
KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each memory slot.
2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
ioctl (which is newly added) to do the splitting of large pages in the
kernel side.
3. A new vCPU is created temporally(do some initialization but will not
run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
4. Collect the GPA ranges of all the memory slots with the
KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
5. Split the 1G huge pages(collected in step 4) into 2M by calling
tdp_page_fault, since the mapping_level will return
PT_DIRECTORY_LEVEL. Here is the main difference from the usual
path which is caused by the Guest side(EPT violation/misconfig etc),
we call it directly in the hypervisor side.
6. Do some cleanups, i.e. free the vCPU related resources
7. The KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7,
in step 5 the 2M huge pages will be splitted into 4K pages.
9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
10. Then the migration thread calls the log_start ioctl to enable the dirty
logging, and the remaining thing is the same.

What's your take on this, thanks.

Regards,
Jay Zhou

> Haven't thought deep into any of them, but seems doable.
> 
> Thanks,
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
  2020-02-19 13:19     ` Zhoujian (jay)
@ 2020-02-19 17:19       ` Peter Xu
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2020-02-19 17:19 UTC (permalink / raw)
  To: Zhoujian (jay)
  Cc: kvm, qemu-devel, pbonzini, dgilbert, quintela, Liujinsong (Paul),
	linfeng (M), wangxin (U), Huangweidong (C)

On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> Hi Peter,
> 
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, February 19, 2020 1:43 AM
> > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> > <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > <weidong.huang@huawei.com>
> > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > 
> > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > Hi all,
> > >
> > > We found that the guest will be soft-lockup occasionally when live
> > > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > > reason is clear, almost all of the vCPUs are waiting for the KVM MMU
> > > spin-lock to create 4K SPTEs when the huge pages are write protected. This
> > phenomenon is also described in this patch set:
> > > https://patchwork.kernel.org/cover/11163459/
> > > which aims to handle page faults in parallel more efficiently.
> > >
> > > Our idea is to use the migration thread to touch all of the guest
> > > memory in the granularity of 4K before enabling dirty logging. To be
> > > more specific, we split all the PDPE_LEVEL SPTEs into DIRECTORY_LEVEL
> > > SPTEs as the first step, and then split all the DIRECTORY_LEVEL SPTEs into
> > PAGE_TABLE_LEVEL SPTEs as the following step.
> > 
> > IIUC, QEMU will prefer to use huge pages for all the anonymous ramblocks
> > (please refer to ram_block_add):
> > 
> >         qemu_madvise(new_block->host, new_block->max_length,
> > QEMU_MADV_HUGEPAGE);
> 
> Yes, you're right
> 
> > 
> > Another alternative I can think of is to add an extra parameter to QEMU to
> > explicitly disable huge pages (so that can even be MADV_NOHUGEPAGE
> > instead of MADV_HUGEPAGE).  However that should also drag down the
> > performance for the whole lifecycle of the VM.  
> 
> From the performance point of view, it is better to keep the huge pages
> when the VM is not in the live migration state.
> 
> > A 3rd option is to make a QMP
> > command to dynamically turn huge pages on/off for ramblocks globally.
> 
> We're searching a dynamic method too.
> We plan to add two new flags for each memory slot, say
> KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set
> through KVM_SET_USER_MEMORY_REGION ioctl.
> 
> The mapping_level which is called by tdp_page_fault in the kernel side
> will return PT_DIRECTORY_LEVEL if the
> KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag of the memory slot is
> set, and return PT_PAGE_TABLE_LEVEL if the
> KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
>  
> The key steps to split the huge pages in advance of enabling dirty log is
> as follows:
> 1. The migration thread in user space uses
> KVM_SET_USER_MEMORY_REGION ioctl to set the
> KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each memory slot.
> 2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
> ioctl (which is newly added) to do the splitting of large pages in the
> kernel side.
> 3. A new vCPU is created temporally(do some initialization but will not
> run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> 4. Collect the GPA ranges of all the memory slots with the
> KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> 5. Split the 1G huge pages(collected in step 4) into 2M by calling
> tdp_page_fault, since the mapping_level will return
> PT_DIRECTORY_LEVEL. Here is the main difference from the usual
> path which is caused by the Guest side(EPT violation/misconfig etc),
> we call it directly in the hypervisor side.
> 6. Do some cleanups, i.e. free the vCPU related resources
> 7. The KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7,
> in step 5 the 2M huge pages will be splitted into 4K pages.
> 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
> 10. Then the migration thread calls the log_start ioctl to enable the dirty
> logging, and the remaining thing is the same.

I'm not sure... I think it would be good if there is a way to have
finer granularity control on using huge pages for any process, then
KVM can directly leverage that because KVM page tables should always
respect the mm configurations on these (so e.g. when huge page split,
KVM gets notifications via mmu notifiers).  Have you thought of such a
more general way?

(And I just noticed that MADV_NOHUGEPAGE is only a hint to khugepaged
 and probably won't split any huge page at all after madvise() returns..)

To tell the truth I'm still confused on how split of huge pages helped
in your case...  If I read it right the test reduced some execution
time from 9s to a few ms after your splittion of huge pages.  The
thing is I don't see how split of huge pages could solve the mmu_lock
contention with the huge VM, because IMO even if we split the huge
pages into smaller ones, those pages should still be write-protected
and need merely the same number of page faults to resolve when
accessed/written?  And I thought that should only be fixed with
solutions like what Ben has proposed with the MMU rework. Could you
show me what I've missed?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-19 17:19       ` Peter Xu
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2020-02-19 17:19 UTC (permalink / raw)
  To: Zhoujian (jay)
  Cc: Liujinsong (Paul), linfeng (M), kvm, quintela, wangxin (U),
	qemu-devel, dgilbert, pbonzini, Huangweidong (C)

On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> Hi Peter,
> 
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, February 19, 2020 1:43 AM
> > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> > <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > <weidong.huang@huawei.com>
> > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > 
> > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > Hi all,
> > >
> > > We found that the guest will be soft-lockup occasionally when live
> > > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > > reason is clear, almost all of the vCPUs are waiting for the KVM MMU
> > > spin-lock to create 4K SPTEs when the huge pages are write protected. This
> > phenomenon is also described in this patch set:
> > > https://patchwork.kernel.org/cover/11163459/
> > > which aims to handle page faults in parallel more efficiently.
> > >
> > > Our idea is to use the migration thread to touch all of the guest
> > > memory in the granularity of 4K before enabling dirty logging. To be
> > > more specific, we split all the PDPE_LEVEL SPTEs into DIRECTORY_LEVEL
> > > SPTEs as the first step, and then split all the DIRECTORY_LEVEL SPTEs into
> > PAGE_TABLE_LEVEL SPTEs as the following step.
> > 
> > IIUC, QEMU will prefer to use huge pages for all the anonymous ramblocks
> > (please refer to ram_block_add):
> > 
> >         qemu_madvise(new_block->host, new_block->max_length,
> > QEMU_MADV_HUGEPAGE);
> 
> Yes, you're right
> 
> > 
> > Another alternative I can think of is to add an extra parameter to QEMU to
> > explicitly disable huge pages (so that can even be MADV_NOHUGEPAGE
> > instead of MADV_HUGEPAGE).  However that should also drag down the
> > performance for the whole lifecycle of the VM.  
> 
> From the performance point of view, it is better to keep the huge pages
> when the VM is not in the live migration state.
> 
> > A 3rd option is to make a QMP
> > command to dynamically turn huge pages on/off for ramblocks globally.
> 
> We're searching a dynamic method too.
> We plan to add two new flags for each memory slot, say
> KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set
> through KVM_SET_USER_MEMORY_REGION ioctl.
> 
> The mapping_level which is called by tdp_page_fault in the kernel side
> will return PT_DIRECTORY_LEVEL if the
> KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag of the memory slot is
> set, and return PT_PAGE_TABLE_LEVEL if the
> KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
>  
> The key steps to split the huge pages in advance of enabling dirty log is
> as follows:
> 1. The migration thread in user space uses
> KVM_SET_USER_MEMORY_REGION ioctl to set the
> KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each memory slot.
> 2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
> ioctl (which is newly added) to do the splitting of large pages in the
> kernel side.
> 3. A new vCPU is created temporally(do some initialization but will not
> run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> 4. Collect the GPA ranges of all the memory slots with the
> KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> 5. Split the 1G huge pages(collected in step 4) into 2M by calling
> tdp_page_fault, since the mapping_level will return
> PT_DIRECTORY_LEVEL. Here is the main difference from the usual
> path which is caused by the Guest side(EPT violation/misconfig etc),
> we call it directly in the hypervisor side.
> 6. Do some cleanups, i.e. free the vCPU related resources
> 7. The KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7,
> in step 5 the 2M huge pages will be splitted into 4K pages.
> 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
> 10. Then the migration thread calls the log_start ioctl to enable the dirty
> logging, and the remaining thing is the same.

I'm not sure... I think it would be good if there is a way to have
finer granularity control on using huge pages for any process, then
KVM can directly leverage that because KVM page tables should always
respect the mm configurations on these (so e.g. when huge page split,
KVM gets notifications via mmu notifiers).  Have you thought of such a
more general way?

(And I just noticed that MADV_NOHUGEPAGE is only a hint to khugepaged
 and probably won't split any huge page at all after madvise() returns..)

To tell the truth I'm still confused on how split of huge pages helped
in your case...  If I read it right the test reduced some execution
time from 9s to a few ms after your splittion of huge pages.  The
thing is I don't see how split of huge pages could solve the mmu_lock
contention with the huge VM, because IMO even if we split the huge
pages into smaller ones, those pages should still be write-protected
and need merely the same number of page faults to resolve when
accessed/written?  And I thought that should only be fixed with
solutions like what Ben has proposed with the MMU rework. Could you
show me what I've missed?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC: Split EPT huge pages in advance of dirty logging
  2020-02-19 17:19       ` Peter Xu
@ 2020-02-20 13:52         ` Zhoujian (jay)
  -1 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-02-20 13:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, qemu-devel, pbonzini, dgilbert, quintela, Liujinsong (Paul),
	linfeng (M), wangxin (U), Huangweidong (C),
	bgardon



> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Thursday, February 20, 2020 1:19 AM
> To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> <wangxinxin.wang@huawei.com>; Huangweidong (C)
> <weidong.huang@huawei.com>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> 
> On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> > Hi Peter,
> >
> > > -----Original Message-----
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Wednesday, February 19, 2020 1:43 AM
> > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org;
> pbonzini@redhat.com;
> > > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>;
> > > wangxin (U) <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > > <weidong.huang@huawei.com>
> > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > >
> > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > > Hi all,
> > > >
> > > > We found that the guest will be soft-lockup occasionally when live
> > > > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > > > reason is clear, almost all of the vCPUs are waiting for the KVM
> > > > MMU spin-lock to create 4K SPTEs when the huge pages are write
> > > > protected. This
> > > phenomenon is also described in this patch set:
> > > > https://patchwork.kernel.org/cover/11163459/
> > > > which aims to handle page faults in parallel more efficiently.
> > > >
> > > > Our idea is to use the migration thread to touch all of the guest
> > > > memory in the granularity of 4K before enabling dirty logging. To
> > > > be more specific, we split all the PDPE_LEVEL SPTEs into
> > > > DIRECTORY_LEVEL SPTEs as the first step, and then split all the
> > > > DIRECTORY_LEVEL SPTEs into
> > > PAGE_TABLE_LEVEL SPTEs as the following step.
> > >
> > > IIUC, QEMU will prefer to use huge pages for all the anonymous
> > > ramblocks (please refer to ram_block_add):
> > >
> > >         qemu_madvise(new_block->host, new_block->max_length,
> > > QEMU_MADV_HUGEPAGE);
> >
> > Yes, you're right
> >
> > >
> > > Another alternative I can think of is to add an extra parameter to
> > > QEMU to explicitly disable huge pages (so that can even be
> > > MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).  However that
> should also
> > > drag down the performance for the whole lifecycle of the VM.
> >
> > From the performance point of view, it is better to keep the huge
> > pages when the VM is not in the live migration state.
> >
> > > A 3rd option is to make a QMP
> > > command to dynamically turn huge pages on/off for ramblocks globally.
> >
> > We're searching a dynamic method too.
> > We plan to add two new flags for each memory slot, say
> > KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set through
> > KVM_SET_USER_MEMORY_REGION ioctl.
> >
> > The mapping_level which is called by tdp_page_fault in the kernel side
> > will return PT_DIRECTORY_LEVEL if the
> KVM_MEM_FORCE_PT_DIRECTORY_PAGES
> > flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL if the
> > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
> >
> > The key steps to split the huge pages in advance of enabling dirty log
> > is as follows:
> > 1. The migration thread in user space uses
> KVM_SET_USER_MEMORY_REGION
> > ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each
> memory
> > slot.
> > 2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
> > ioctl (which is newly added) to do the splitting of large pages in the
> > kernel side.
> > 3. A new vCPU is created temporally(do some initialization but will
> > not
> > run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> > 4. Collect the GPA ranges of all the memory slots with the
> > KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> > 5. Split the 1G huge pages(collected in step 4) into 2M by calling
> > tdp_page_fault, since the mapping_level will return
> > PT_DIRECTORY_LEVEL. Here is the main difference from the usual path
> > which is caused by the Guest side(EPT violation/misconfig etc), we
> > call it directly in the hypervisor side.
> > 6. Do some cleanups, i.e. free the vCPU related resources 7. The
> > KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> > 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> > KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in step
> 5
> > the 2M huge pages will be splitted into 4K pages.
> > 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
> > 10. Then the migration thread calls the log_start ioctl to enable the
> > dirty logging, and the remaining thing is the same.
> 
> I'm not sure... I think it would be good if there is a way to have finer granularity
> control on using huge pages for any process, then KVM can directly leverage
> that because KVM page tables should always respect the mm configurations on
> these (so e.g. when huge page split, KVM gets notifications via mmu notifiers).
> Have you thought of such a more general way?

I did have thought of this, if we split the huge pages into 4K of a process, I'm
afraid it will not be workable for the huge pages sharing scenario, e.g. DPDK,
SPDK etc. So, only split the EPT page table and keep the VM process page table
(e.g. qemu) untouched is the goal.

> 
> (And I just noticed that MADV_NOHUGEPAGE is only a hint to khugepaged
> and probably won't split any huge page at all after madvise() returns..)
> To tell the truth I'm still confused on how split of huge pages helped in your
> case...  

I'm sorry if the meaning is not expressed clearly, and thanks for your patience.

> If I read it right the test reduced some execution time from 9s to a
> few ms after your splittion of huge pages.  

Yes

> The thing is I don't see how split of
> huge pages could solve the mmu_lock contention with the huge VM, because
> IMO even if we split the huge pages into smaller ones, those pages should still
> be write-protected and need merely the same number of page faults to resolve
> when accessed/written? And I thought that should only be fixed with
> solutions like what Ben has proposed with the MMU rework. Could you show
> me what I've missed?

Let me try to describe the reason of mmu_lock contention more clearly and the
effort we tried to do...
The huge VM only has EPT >= level 2 sptes, and level 1 sptes don't
exist at the beginning. Write protect all the huge pages will trigger EPT
violation to create level 1 sptes for all the vCPUs which want to write the
content of the memory. Different vCPU write the different areas of
the memory, but they need the same kvm->mmu_lock to create the level 1
sptes, this situation will be worse if the number of vCPU and the memory of
VM is large(in our case 60U512G), meanwhile the VM has
memory-write-intensive work to do. In order to reduce the mmu_lock
contention, we try to: write protect VM memory gradually in small chunks,
such as 1G or 2M. Using a vCPU temporary creately by migration thread to
split 1G to 2M as the first step, and to split 2M to 4K as the second step
(this is a little hacking...and I do not know any side effect will be triggered
indeed).
Comparing to write protect all VM memory in one go, the write
protected range is limited in this way and only the vCPUs write this limited
range will be involved to take the mmu_lock. The contention will be reduced
since the memory range is small and the number of vCPU involved is small
too.

Of course, it will take some extra time to split all the huge pages into 4K
page before the real migration started, about 60s for 512G in my experiment.

During the memory iterative copy phase, PML will do the dirty logging work
(not write protected case for 4K), or IIRC using fast_page_fault to mark page
dirty if PML is not supported, which case the mmu_lock does not needed.

Regards,
Jay Zhou

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-20 13:52         ` Zhoujian (jay)
  0 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-02-20 13:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liujinsong (Paul), linfeng (M), kvm, quintela, wangxin (U),
	qemu-devel, dgilbert, bgardon, pbonzini, Huangweidong (C)



> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Thursday, February 20, 2020 1:19 AM
> To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> <wangxinxin.wang@huawei.com>; Huangweidong (C)
> <weidong.huang@huawei.com>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> 
> On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> > Hi Peter,
> >
> > > -----Original Message-----
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Wednesday, February 19, 2020 1:43 AM
> > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org;
> pbonzini@redhat.com;
> > > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>;
> > > wangxin (U) <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > > <weidong.huang@huawei.com>
> > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > >
> > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > > Hi all,
> > > >
> > > > We found that the guest will be soft-lockup occasionally when live
> > > > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > > > reason is clear, almost all of the vCPUs are waiting for the KVM
> > > > MMU spin-lock to create 4K SPTEs when the huge pages are write
> > > > protected. This
> > > phenomenon is also described in this patch set:
> > > > https://patchwork.kernel.org/cover/11163459/
> > > > which aims to handle page faults in parallel more efficiently.
> > > >
> > > > Our idea is to use the migration thread to touch all of the guest
> > > > memory in the granularity of 4K before enabling dirty logging. To
> > > > be more specific, we split all the PDPE_LEVEL SPTEs into
> > > > DIRECTORY_LEVEL SPTEs as the first step, and then split all the
> > > > DIRECTORY_LEVEL SPTEs into
> > > PAGE_TABLE_LEVEL SPTEs as the following step.
> > >
> > > IIUC, QEMU will prefer to use huge pages for all the anonymous
> > > ramblocks (please refer to ram_block_add):
> > >
> > >         qemu_madvise(new_block->host, new_block->max_length,
> > > QEMU_MADV_HUGEPAGE);
> >
> > Yes, you're right
> >
> > >
> > > Another alternative I can think of is to add an extra parameter to
> > > QEMU to explicitly disable huge pages (so that can even be
> > > MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).  However that
> should also
> > > drag down the performance for the whole lifecycle of the VM.
> >
> > From the performance point of view, it is better to keep the huge
> > pages when the VM is not in the live migration state.
> >
> > > A 3rd option is to make a QMP
> > > command to dynamically turn huge pages on/off for ramblocks globally.
> >
> > We're searching a dynamic method too.
> > We plan to add two new flags for each memory slot, say
> > KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set through
> > KVM_SET_USER_MEMORY_REGION ioctl.
> >
> > The mapping_level which is called by tdp_page_fault in the kernel side
> > will return PT_DIRECTORY_LEVEL if the
> KVM_MEM_FORCE_PT_DIRECTORY_PAGES
> > flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL if the
> > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
> >
> > The key steps to split the huge pages in advance of enabling dirty log
> > is as follows:
> > 1. The migration thread in user space uses
> KVM_SET_USER_MEMORY_REGION
> > ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each
> memory
> > slot.
> > 2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
> > ioctl (which is newly added) to do the splitting of large pages in the
> > kernel side.
> > 3. A new vCPU is created temporally(do some initialization but will
> > not
> > run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> > 4. Collect the GPA ranges of all the memory slots with the
> > KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> > 5. Split the 1G huge pages(collected in step 4) into 2M by calling
> > tdp_page_fault, since the mapping_level will return
> > PT_DIRECTORY_LEVEL. Here is the main difference from the usual path
> > which is caused by the Guest side(EPT violation/misconfig etc), we
> > call it directly in the hypervisor side.
> > 6. Do some cleanups, i.e. free the vCPU related resources 7. The
> > KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> > 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> > KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in step
> 5
> > the 2M huge pages will be splitted into 4K pages.
> > 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
> > 10. Then the migration thread calls the log_start ioctl to enable the
> > dirty logging, and the remaining thing is the same.
> 
> I'm not sure... I think it would be good if there is a way to have finer granularity
> control on using huge pages for any process, then KVM can directly leverage
> that because KVM page tables should always respect the mm configurations on
> these (so e.g. when huge page split, KVM gets notifications via mmu notifiers).
> Have you thought of such a more general way?

I did have thought of this, if we split the huge pages into 4K of a process, I'm
afraid it will not be workable for the huge pages sharing scenario, e.g. DPDK,
SPDK etc. So, only split the EPT page table and keep the VM process page table
(e.g. qemu) untouched is the goal.

> 
> (And I just noticed that MADV_NOHUGEPAGE is only a hint to khugepaged
> and probably won't split any huge page at all after madvise() returns..)
> To tell the truth I'm still confused on how split of huge pages helped in your
> case...  

I'm sorry if the meaning is not expressed clearly, and thanks for your patience.

> If I read it right the test reduced some execution time from 9s to a
> few ms after your splittion of huge pages.  

Yes

> The thing is I don't see how split of
> huge pages could solve the mmu_lock contention with the huge VM, because
> IMO even if we split the huge pages into smaller ones, those pages should still
> be write-protected and need merely the same number of page faults to resolve
> when accessed/written? And I thought that should only be fixed with
> solutions like what Ben has proposed with the MMU rework. Could you show
> me what I've missed?

Let me try to describe the reason of mmu_lock contention more clearly and the
effort we tried to do...
The huge VM only has EPT >= level 2 sptes, and level 1 sptes don't
exist at the beginning. Write protect all the huge pages will trigger EPT
violation to create level 1 sptes for all the vCPUs which want to write the
content of the memory. Different vCPU write the different areas of
the memory, but they need the same kvm->mmu_lock to create the level 1
sptes, this situation will be worse if the number of vCPU and the memory of
VM is large(in our case 60U512G), meanwhile the VM has
memory-write-intensive work to do. In order to reduce the mmu_lock
contention, we try to: write protect VM memory gradually in small chunks,
such as 1G or 2M. Using a vCPU temporary creately by migration thread to
split 1G to 2M as the first step, and to split 2M to 4K as the second step
(this is a little hacking...and I do not know any side effect will be triggered
indeed).
Comparing to write protect all VM memory in one go, the write
protected range is limited in this way and only the vCPUs write this limited
range will be involved to take the mmu_lock. The contention will be reduced
since the memory range is small and the number of vCPU involved is small
too.

Of course, it will take some extra time to split all the huge pages into 4K
page before the real migration started, about 60s for 512G in my experiment.

During the memory iterative copy phase, PML will do the dirty logging work
(not write protected case for 4K), or IIRC using fast_page_fault to mark page
dirty if PML is not supported, which case the mmu_lock does not needed.

Regards,
Jay Zhou

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
  2020-02-20 13:52         ` Zhoujian (jay)
@ 2020-02-20 17:32           ` Ben Gardon
  -1 siblings, 0 replies; 28+ messages in thread
From: Ben Gardon @ 2020-02-20 17:32 UTC (permalink / raw)
  To: Zhoujian (jay)
  Cc: Peter Xu, kvm, qemu-devel, pbonzini, dgilbert, quintela,
	Liujinsong (Paul), linfeng (M), wangxin (U), Huangweidong (C),
	Junaid Shahid

FWIW, we currently do this eager splitting at Google for live
migration. When the log-dirty-memory flag is set on a memslot we
eagerly split all pages in the slot down to 4k granularity.
As Jay said, this does not cause crippling lock contention because the
vCPU page faults generated by write protection / splitting can be
resolved in the fast page fault path without acquiring the MMU lock.
I believe +Junaid Shahid tried to upstream this approach at some point
in the past, but the patch set didn't make it in. (This was before my
time, so I'm hoping he has a link.)
I haven't done the analysis to know if eager splitting is more or less
efficient with parallel slow-path page faults, but it's definitely
faster under the MMU lock.

On Thu, Feb 20, 2020 at 5:53 AM Zhoujian (jay) <jianjay.zhou@huawei.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Thursday, February 20, 2020 1:19 AM
> > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> > <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > <weidong.huang@huawei.com>
> > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> >
> > On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> > > Hi Peter,
> > >
> > > > -----Original Message-----
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Wednesday, February 19, 2020 1:43 AM
> > > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org;
> > pbonzini@redhat.com;
> > > > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > > > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>;
> > > > wangxin (U) <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > > > <weidong.huang@huawei.com>
> > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > > >
> > > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > > > Hi all,
> > > > >
> > > > > We found that the guest will be soft-lockup occasionally when live
> > > > > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > > > > reason is clear, almost all of the vCPUs are waiting for the KVM
> > > > > MMU spin-lock to create 4K SPTEs when the huge pages are write
> > > > > protected. This
> > > > phenomenon is also described in this patch set:
> > > > > https://patchwork.kernel.org/cover/11163459/
> > > > > which aims to handle page faults in parallel more efficiently.
> > > > >
> > > > > Our idea is to use the migration thread to touch all of the guest
> > > > > memory in the granularity of 4K before enabling dirty logging. To
> > > > > be more specific, we split all the PDPE_LEVEL SPTEs into
> > > > > DIRECTORY_LEVEL SPTEs as the first step, and then split all the
> > > > > DIRECTORY_LEVEL SPTEs into
> > > > PAGE_TABLE_LEVEL SPTEs as the following step.
> > > >
> > > > IIUC, QEMU will prefer to use huge pages for all the anonymous
> > > > ramblocks (please refer to ram_block_add):
> > > >
> > > >         qemu_madvise(new_block->host, new_block->max_length,
> > > > QEMU_MADV_HUGEPAGE);
> > >
> > > Yes, you're right
> > >
> > > >
> > > > Another alternative I can think of is to add an extra parameter to
> > > > QEMU to explicitly disable huge pages (so that can even be
> > > > MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).  However that
> > should also
> > > > drag down the performance for the whole lifecycle of the VM.
> > >
> > > From the performance point of view, it is better to keep the huge
> > > pages when the VM is not in the live migration state.
> > >
> > > > A 3rd option is to make a QMP
> > > > command to dynamically turn huge pages on/off for ramblocks globally.
> > >
> > > We're searching a dynamic method too.
> > > We plan to add two new flags for each memory slot, say
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set through
> > > KVM_SET_USER_MEMORY_REGION ioctl.
> > >
> > > The mapping_level which is called by tdp_page_fault in the kernel side
> > > will return PT_DIRECTORY_LEVEL if the
> > KVM_MEM_FORCE_PT_DIRECTORY_PAGES
> > > flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL if the
> > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
> > >
> > > The key steps to split the huge pages in advance of enabling dirty log
> > > is as follows:
> > > 1. The migration thread in user space uses
> > KVM_SET_USER_MEMORY_REGION
> > > ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each
> > memory
> > > slot.
> > > 2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
> > > ioctl (which is newly added) to do the splitting of large pages in the
> > > kernel side.
> > > 3. A new vCPU is created temporally(do some initialization but will
> > > not
> > > run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> > > 4. Collect the GPA ranges of all the memory slots with the
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> > > 5. Split the 1G huge pages(collected in step 4) into 2M by calling
> > > tdp_page_fault, since the mapping_level will return
> > > PT_DIRECTORY_LEVEL. Here is the main difference from the usual path
> > > which is caused by the Guest side(EPT violation/misconfig etc), we
> > > call it directly in the hypervisor side.
> > > 6. Do some cleanups, i.e. free the vCPU related resources 7. The
> > > KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> > > 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in step
> > 5
> > > the 2M huge pages will be splitted into 4K pages.
> > > 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
> > > 10. Then the migration thread calls the log_start ioctl to enable the
> > > dirty logging, and the remaining thing is the same.
> >
> > I'm not sure... I think it would be good if there is a way to have finer granularity
> > control on using huge pages for any process, then KVM can directly leverage
> > that because KVM page tables should always respect the mm configurations on
> > these (so e.g. when huge page split, KVM gets notifications via mmu notifiers).
> > Have you thought of such a more general way?
>
> I did have thought of this, if we split the huge pages into 4K of a process, I'm
> afraid it will not be workable for the huge pages sharing scenario, e.g. DPDK,
> SPDK etc. So, only split the EPT page table and keep the VM process page table
> (e.g. qemu) untouched is the goal.
>
> >
> > (And I just noticed that MADV_NOHUGEPAGE is only a hint to khugepaged
> > and probably won't split any huge page at all after madvise() returns..)
> > To tell the truth I'm still confused on how split of huge pages helped in your
> > case...
>
> I'm sorry if the meaning is not expressed clearly, and thanks for your patience.
>
> > If I read it right the test reduced some execution time from 9s to a
> > few ms after your splittion of huge pages.
>
> Yes
>
> > The thing is I don't see how split of
> > huge pages could solve the mmu_lock contention with the huge VM, because
> > IMO even if we split the huge pages into smaller ones, those pages should still
> > be write-protected and need merely the same number of page faults to resolve
> > when accessed/written? And I thought that should only be fixed with
> > solutions like what Ben has proposed with the MMU rework. Could you show
> > me what I've missed?
>
> Let me try to describe the reason of mmu_lock contention more clearly and the
> effort we tried to do...
> The huge VM only has EPT >= level 2 sptes, and level 1 sptes don't
> exist at the beginning. Write protect all the huge pages will trigger EPT
> violation to create level 1 sptes for all the vCPUs which want to write the
> content of the memory. Different vCPU write the different areas of
> the memory, but they need the same kvm->mmu_lock to create the level 1
> sptes, this situation will be worse if the number of vCPU and the memory of
> VM is large(in our case 60U512G), meanwhile the VM has
> memory-write-intensive work to do. In order to reduce the mmu_lock
> contention, we try to: write protect VM memory gradually in small chunks,
> such as 1G or 2M. Using a vCPU temporary creately by migration thread to
> split 1G to 2M as the first step, and to split 2M to 4K as the second step
> (this is a little hacking...and I do not know any side effect will be triggered
> indeed).
> Comparing to write protect all VM memory in one go, the write
> protected range is limited in this way and only the vCPUs write this limited
> range will be involved to take the mmu_lock. The contention will be reduced
> since the memory range is small and the number of vCPU involved is small
> too.
>
> Of course, it will take some extra time to split all the huge pages into 4K
> page before the real migration started, about 60s for 512G in my experiment.
>
> During the memory iterative copy phase, PML will do the dirty logging work
> (not write protected case for 4K), or IIRC using fast_page_fault to mark page
> dirty if PML is not supported, which case the mmu_lock does not needed.
>
> Regards,
> Jay Zhou

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-20 17:32           ` Ben Gardon
  0 siblings, 0 replies; 28+ messages in thread
From: Ben Gardon @ 2020-02-20 17:32 UTC (permalink / raw)
  To: Zhoujian (jay)
  Cc: Peter Xu, kvm, qemu-devel, pbonzini, dgilbert, quintela,
	Liujinsong (Paul), linfeng (M), wangxin (U), Huangweidong (C),
	Junaid Shahid

FWIW, we currently do this eager splitting at Google for live
migration. When the log-dirty-memory flag is set on a memslot we
eagerly split all pages in the slot down to 4k granularity.
As Jay said, this does not cause crippling lock contention because the
vCPU page faults generated by write protection / splitting can be
resolved in the fast page fault path without acquiring the MMU lock.
I believe +Junaid Shahid tried to upstream this approach at some point
in the past, but the patch set didn't make it in. (This was before my
time, so I'm hoping he has a link.)
I haven't done the analysis to know if eager splitting is more or less
efficient with parallel slow-path page faults, but it's definitely
faster under the MMU lock.

On Thu, Feb 20, 2020 at 5:53 AM Zhoujian (jay) <jianjay.zhou@huawei.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Thursday, February 20, 2020 1:19 AM
> > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> > <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > <weidong.huang@huawei.com>
> > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> >
> > On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> > > Hi Peter,
> > >
> > > > -----Original Message-----
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Wednesday, February 19, 2020 1:43 AM
> > > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org;
> > pbonzini@redhat.com;
> > > > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > > > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>;
> > > > wangxin (U) <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > > > <weidong.huang@huawei.com>
> > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > > >
> > > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > > > Hi all,
> > > > >
> > > > > We found that the guest will be soft-lockup occasionally when live
> > > > > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > > > > reason is clear, almost all of the vCPUs are waiting for the KVM
> > > > > MMU spin-lock to create 4K SPTEs when the huge pages are write
> > > > > protected. This
> > > > phenomenon is also described in this patch set:
> > > > > https://patchwork.kernel.org/cover/11163459/
> > > > > which aims to handle page faults in parallel more efficiently.
> > > > >
> > > > > Our idea is to use the migration thread to touch all of the guest
> > > > > memory in the granularity of 4K before enabling dirty logging. To
> > > > > be more specific, we split all the PDPE_LEVEL SPTEs into
> > > > > DIRECTORY_LEVEL SPTEs as the first step, and then split all the
> > > > > DIRECTORY_LEVEL SPTEs into
> > > > PAGE_TABLE_LEVEL SPTEs as the following step.
> > > >
> > > > IIUC, QEMU will prefer to use huge pages for all the anonymous
> > > > ramblocks (please refer to ram_block_add):
> > > >
> > > >         qemu_madvise(new_block->host, new_block->max_length,
> > > > QEMU_MADV_HUGEPAGE);
> > >
> > > Yes, you're right
> > >
> > > >
> > > > Another alternative I can think of is to add an extra parameter to
> > > > QEMU to explicitly disable huge pages (so that can even be
> > > > MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).  However that
> > should also
> > > > drag down the performance for the whole lifecycle of the VM.
> > >
> > > From the performance point of view, it is better to keep the huge
> > > pages when the VM is not in the live migration state.
> > >
> > > > A 3rd option is to make a QMP
> > > > command to dynamically turn huge pages on/off for ramblocks globally.
> > >
> > > We're searching a dynamic method too.
> > > We plan to add two new flags for each memory slot, say
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set through
> > > KVM_SET_USER_MEMORY_REGION ioctl.
> > >
> > > The mapping_level which is called by tdp_page_fault in the kernel side
> > > will return PT_DIRECTORY_LEVEL if the
> > KVM_MEM_FORCE_PT_DIRECTORY_PAGES
> > > flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL if the
> > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
> > >
> > > The key steps to split the huge pages in advance of enabling dirty log
> > > is as follows:
> > > 1. The migration thread in user space uses
> > KVM_SET_USER_MEMORY_REGION
> > > ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each
> > memory
> > > slot.
> > > 2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
> > > ioctl (which is newly added) to do the splitting of large pages in the
> > > kernel side.
> > > 3. A new vCPU is created temporally(do some initialization but will
> > > not
> > > run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> > > 4. Collect the GPA ranges of all the memory slots with the
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> > > 5. Split the 1G huge pages(collected in step 4) into 2M by calling
> > > tdp_page_fault, since the mapping_level will return
> > > PT_DIRECTORY_LEVEL. Here is the main difference from the usual path
> > > which is caused by the Guest side(EPT violation/misconfig etc), we
> > > call it directly in the hypervisor side.
> > > 6. Do some cleanups, i.e. free the vCPU related resources 7. The
> > > KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> > > 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in step
> > 5
> > > the 2M huge pages will be splitted into 4K pages.
> > > 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
> > > 10. Then the migration thread calls the log_start ioctl to enable the
> > > dirty logging, and the remaining thing is the same.
> >
> > I'm not sure... I think it would be good if there is a way to have finer granularity
> > control on using huge pages for any process, then KVM can directly leverage
> > that because KVM page tables should always respect the mm configurations on
> > these (so e.g. when huge page split, KVM gets notifications via mmu notifiers).
> > Have you thought of such a more general way?
>
> I did have thought of this, if we split the huge pages into 4K of a process, I'm
> afraid it will not be workable for the huge pages sharing scenario, e.g. DPDK,
> SPDK etc. So, only split the EPT page table and keep the VM process page table
> (e.g. qemu) untouched is the goal.
>
> >
> > (And I just noticed that MADV_NOHUGEPAGE is only a hint to khugepaged
> > and probably won't split any huge page at all after madvise() returns..)
> > To tell the truth I'm still confused on how split of huge pages helped in your
> > case...
>
> I'm sorry if the meaning is not expressed clearly, and thanks for your patience.
>
> > If I read it right the test reduced some execution time from 9s to a
> > few ms after your splittion of huge pages.
>
> Yes
>
> > The thing is I don't see how split of
> > huge pages could solve the mmu_lock contention with the huge VM, because
> > IMO even if we split the huge pages into smaller ones, those pages should still
> > be write-protected and need merely the same number of page faults to resolve
> > when accessed/written? And I thought that should only be fixed with
> > solutions like what Ben has proposed with the MMU rework. Could you show
> > me what I've missed?
>
> Let me try to describe the reason of mmu_lock contention more clearly and the
> effort we tried to do...
> The huge VM only has EPT >= level 2 sptes, and level 1 sptes don't
> exist at the beginning. Write protect all the huge pages will trigger EPT
> violation to create level 1 sptes for all the vCPUs which want to write the
> content of the memory. Different vCPU write the different areas of
> the memory, but they need the same kvm->mmu_lock to create the level 1
> sptes, this situation will be worse if the number of vCPU and the memory of
> VM is large(in our case 60U512G), meanwhile the VM has
> memory-write-intensive work to do. In order to reduce the mmu_lock
> contention, we try to: write protect VM memory gradually in small chunks,
> such as 1G or 2M. Using a vCPU temporary creately by migration thread to
> split 1G to 2M as the first step, and to split 2M to 4K as the second step
> (this is a little hacking...and I do not know any side effect will be triggered
> indeed).
> Comparing to write protect all VM memory in one go, the write
> protected range is limited in this way and only the vCPUs write this limited
> range will be involved to take the mmu_lock. The contention will be reduced
> since the memory range is small and the number of vCPU involved is small
> too.
>
> Of course, it will take some extra time to split all the huge pages into 4K
> page before the real migration started, about 60s for 512G in my experiment.
>
> During the memory iterative copy phase, PML will do the dirty logging work
> (not write protected case for 4K), or IIRC using fast_page_fault to mark page
> dirty if PML is not supported, which case the mmu_lock does not needed.
>
> Regards,
> Jay Zhou


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
  2020-02-20 13:52         ` Zhoujian (jay)
@ 2020-02-20 17:34           ` Ben Gardon
  -1 siblings, 0 replies; 28+ messages in thread
From: Ben Gardon @ 2020-02-20 17:34 UTC (permalink / raw)
  To: Zhoujian (jay), Junaid Shahid
  Cc: Peter Xu, kvm, qemu-devel, pbonzini, dgilbert, quintela,
	Liujinsong (Paul), linfeng (M), wangxin (U), Huangweidong (C)

On Thu, Feb 20, 2020 at 5:53 AM Zhoujian (jay) <jianjay.zhou@huawei.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Thursday, February 20, 2020 1:19 AM
> > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> > <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > <weidong.huang@huawei.com>
> > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> >
> > On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> > > Hi Peter,
> > >
> > > > -----Original Message-----
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Wednesday, February 19, 2020 1:43 AM
> > > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org;
> > pbonzini@redhat.com;
> > > > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > > > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>;
> > > > wangxin (U) <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > > > <weidong.huang@huawei.com>
> > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > > >
> > > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > > > Hi all,
> > > > >
> > > > > We found that the guest will be soft-lockup occasionally when live
> > > > > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > > > > reason is clear, almost all of the vCPUs are waiting for the KVM
> > > > > MMU spin-lock to create 4K SPTEs when the huge pages are write
> > > > > protected. This
> > > > phenomenon is also described in this patch set:
> > > > > https://patchwork.kernel.org/cover/11163459/
> > > > > which aims to handle page faults in parallel more efficiently.
> > > > >
> > > > > Our idea is to use the migration thread to touch all of the guest
> > > > > memory in the granularity of 4K before enabling dirty logging. To
> > > > > be more specific, we split all the PDPE_LEVEL SPTEs into
> > > > > DIRECTORY_LEVEL SPTEs as the first step, and then split all the
> > > > > DIRECTORY_LEVEL SPTEs into
> > > > PAGE_TABLE_LEVEL SPTEs as the following step.
> > > >
> > > > IIUC, QEMU will prefer to use huge pages for all the anonymous
> > > > ramblocks (please refer to ram_block_add):
> > > >
> > > >         qemu_madvise(new_block->host, new_block->max_length,
> > > > QEMU_MADV_HUGEPAGE);
> > >
> > > Yes, you're right
> > >
> > > >
> > > > Another alternative I can think of is to add an extra parameter to
> > > > QEMU to explicitly disable huge pages (so that can even be
> > > > MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).  However that
> > should also
> > > > drag down the performance for the whole lifecycle of the VM.
> > >
> > > From the performance point of view, it is better to keep the huge
> > > pages when the VM is not in the live migration state.
> > >
> > > > A 3rd option is to make a QMP
> > > > command to dynamically turn huge pages on/off for ramblocks globally.
> > >
> > > We're searching a dynamic method too.
> > > We plan to add two new flags for each memory slot, say
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set through
> > > KVM_SET_USER_MEMORY_REGION ioctl.
> > >
> > > The mapping_level which is called by tdp_page_fault in the kernel side
> > > will return PT_DIRECTORY_LEVEL if the
> > KVM_MEM_FORCE_PT_DIRECTORY_PAGES
> > > flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL if the
> > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
> > >
> > > The key steps to split the huge pages in advance of enabling dirty log
> > > is as follows:
> > > 1. The migration thread in user space uses
> > KVM_SET_USER_MEMORY_REGION
> > > ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each
> > memory
> > > slot.
> > > 2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
> > > ioctl (which is newly added) to do the splitting of large pages in the
> > > kernel side.
> > > 3. A new vCPU is created temporally(do some initialization but will
> > > not
> > > run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> > > 4. Collect the GPA ranges of all the memory slots with the
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> > > 5. Split the 1G huge pages(collected in step 4) into 2M by calling
> > > tdp_page_fault, since the mapping_level will return
> > > PT_DIRECTORY_LEVEL. Here is the main difference from the usual path
> > > which is caused by the Guest side(EPT violation/misconfig etc), we
> > > call it directly in the hypervisor side.
> > > 6. Do some cleanups, i.e. free the vCPU related resources 7. The
> > > KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> > > 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in step
> > 5
> > > the 2M huge pages will be splitted into 4K pages.
> > > 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
> > > 10. Then the migration thread calls the log_start ioctl to enable the
> > > dirty logging, and the remaining thing is the same.
> >
> > I'm not sure... I think it would be good if there is a way to have finer granularity
> > control on using huge pages for any process, then KVM can directly leverage
> > that because KVM page tables should always respect the mm configurations on
> > these (so e.g. when huge page split, KVM gets notifications via mmu notifiers).
> > Have you thought of such a more general way?
>
> I did have thought of this, if we split the huge pages into 4K of a process, I'm
> afraid it will not be workable for the huge pages sharing scenario, e.g. DPDK,
> SPDK etc. So, only split the EPT page table and keep the VM process page table
> (e.g. qemu) untouched is the goal.
>
> >
> > (And I just noticed that MADV_NOHUGEPAGE is only a hint to khugepaged
> > and probably won't split any huge page at all after madvise() returns..)
> > To tell the truth I'm still confused on how split of huge pages helped in your
> > case...
>
> I'm sorry if the meaning is not expressed clearly, and thanks for your patience.
>
> > If I read it right the test reduced some execution time from 9s to a
> > few ms after your splittion of huge pages.
>
> Yes
>
> > The thing is I don't see how split of
> > huge pages could solve the mmu_lock contention with the huge VM, because
> > IMO even if we split the huge pages into smaller ones, those pages should still
> > be write-protected and need merely the same number of page faults to resolve
> > when accessed/written? And I thought that should only be fixed with
> > solutions like what Ben has proposed with the MMU rework. Could you show
> > me what I've missed?
>
> Let me try to describe the reason of mmu_lock contention more clearly and the
> effort we tried to do...
> The huge VM only has EPT >= level 2 sptes, and level 1 sptes don't
> exist at the beginning. Write protect all the huge pages will trigger EPT
> violation to create level 1 sptes for all the vCPUs which want to write the
> content of the memory. Different vCPU write the different areas of
> the memory, but they need the same kvm->mmu_lock to create the level 1
> sptes, this situation will be worse if the number of vCPU and the memory of
> VM is large(in our case 60U512G), meanwhile the VM has
> memory-write-intensive work to do. In order to reduce the mmu_lock
> contention, we try to: write protect VM memory gradually in small chunks,
> such as 1G or 2M. Using a vCPU temporary creately by migration thread to
> split 1G to 2M as the first step, and to split 2M to 4K as the second step
> (this is a little hacking...and I do not know any side effect will be triggered
> indeed).
> Comparing to write protect all VM memory in one go, the write
> protected range is limited in this way and only the vCPUs write this limited
> range will be involved to take the mmu_lock. The contention will be reduced
> since the memory range is small and the number of vCPU involved is small
> too.
>
> Of course, it will take some extra time to split all the huge pages into 4K
> page before the real migration started, about 60s for 512G in my experiment.
>
> During the memory iterative copy phase, PML will do the dirty logging work
> (not write protected case for 4K), or IIRC using fast_page_fault to mark page
> dirty if PML is not supported, which case the mmu_lock does not needed.
>
> Regards,
> Jay Zhou

(Ah I top-posted I'm sorry. Re-sending at the bottom.)

FWIW, we currently do this eager splitting at Google for live
migration. When the log-dirty-memory flag is set on a memslot we
eagerly split all pages in the slot down to 4k granularity.
As Jay said, this does not cause crippling lock contention because the
vCPU page faults generated by write protection / splitting can be
resolved in the fast page fault path without acquiring the MMU lock.
I believe +Junaid Shahid tried to upstream this approach at some point
in the past, but the patch set didn't make it in. (This was before my
time, so I'm hoping he has a link.)
I haven't done the analysis to know if eager splitting is more or less
efficient with parallel slow-path page faults, but it's definitely
faster under the MMU lock.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-20 17:34           ` Ben Gardon
  0 siblings, 0 replies; 28+ messages in thread
From: Ben Gardon @ 2020-02-20 17:34 UTC (permalink / raw)
  To: Zhoujian (jay), Junaid Shahid
  Cc: Peter Xu, kvm, qemu-devel, pbonzini, dgilbert, quintela,
	Liujinsong (Paul), linfeng (M), wangxin (U), Huangweidong (C)

On Thu, Feb 20, 2020 at 5:53 AM Zhoujian (jay) <jianjay.zhou@huawei.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Thursday, February 20, 2020 1:19 AM
> > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> > <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > <weidong.huang@huawei.com>
> > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> >
> > On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> > > Hi Peter,
> > >
> > > > -----Original Message-----
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Wednesday, February 19, 2020 1:43 AM
> > > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org;
> > pbonzini@redhat.com;
> > > > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > > > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>;
> > > > wangxin (U) <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > > > <weidong.huang@huawei.com>
> > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > > >
> > > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > > > Hi all,
> > > > >
> > > > > We found that the guest will be soft-lockup occasionally when live
> > > > > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > > > > reason is clear, almost all of the vCPUs are waiting for the KVM
> > > > > MMU spin-lock to create 4K SPTEs when the huge pages are write
> > > > > protected. This
> > > > phenomenon is also described in this patch set:
> > > > > https://patchwork.kernel.org/cover/11163459/
> > > > > which aims to handle page faults in parallel more efficiently.
> > > > >
> > > > > Our idea is to use the migration thread to touch all of the guest
> > > > > memory in the granularity of 4K before enabling dirty logging. To
> > > > > be more specific, we split all the PDPE_LEVEL SPTEs into
> > > > > DIRECTORY_LEVEL SPTEs as the first step, and then split all the
> > > > > DIRECTORY_LEVEL SPTEs into
> > > > PAGE_TABLE_LEVEL SPTEs as the following step.
> > > >
> > > > IIUC, QEMU will prefer to use huge pages for all the anonymous
> > > > ramblocks (please refer to ram_block_add):
> > > >
> > > >         qemu_madvise(new_block->host, new_block->max_length,
> > > > QEMU_MADV_HUGEPAGE);
> > >
> > > Yes, you're right
> > >
> > > >
> > > > Another alternative I can think of is to add an extra parameter to
> > > > QEMU to explicitly disable huge pages (so that can even be
> > > > MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).  However that
> > should also
> > > > drag down the performance for the whole lifecycle of the VM.
> > >
> > > From the performance point of view, it is better to keep the huge
> > > pages when the VM is not in the live migration state.
> > >
> > > > A 3rd option is to make a QMP
> > > > command to dynamically turn huge pages on/off for ramblocks globally.
> > >
> > > We're searching a dynamic method too.
> > > We plan to add two new flags for each memory slot, say
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set through
> > > KVM_SET_USER_MEMORY_REGION ioctl.
> > >
> > > The mapping_level which is called by tdp_page_fault in the kernel side
> > > will return PT_DIRECTORY_LEVEL if the
> > KVM_MEM_FORCE_PT_DIRECTORY_PAGES
> > > flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL if the
> > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
> > >
> > > The key steps to split the huge pages in advance of enabling dirty log
> > > is as follows:
> > > 1. The migration thread in user space uses
> > KVM_SET_USER_MEMORY_REGION
> > > ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each
> > memory
> > > slot.
> > > 2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
> > > ioctl (which is newly added) to do the splitting of large pages in the
> > > kernel side.
> > > 3. A new vCPU is created temporally(do some initialization but will
> > > not
> > > run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> > > 4. Collect the GPA ranges of all the memory slots with the
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> > > 5. Split the 1G huge pages(collected in step 4) into 2M by calling
> > > tdp_page_fault, since the mapping_level will return
> > > PT_DIRECTORY_LEVEL. Here is the main difference from the usual path
> > > which is caused by the Guest side(EPT violation/misconfig etc), we
> > > call it directly in the hypervisor side.
> > > 6. Do some cleanups, i.e. free the vCPU related resources 7. The
> > > KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> > > 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in step
> > 5
> > > the 2M huge pages will be splitted into 4K pages.
> > > 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
> > > 10. Then the migration thread calls the log_start ioctl to enable the
> > > dirty logging, and the remaining thing is the same.
> >
> > I'm not sure... I think it would be good if there is a way to have finer granularity
> > control on using huge pages for any process, then KVM can directly leverage
> > that because KVM page tables should always respect the mm configurations on
> > these (so e.g. when huge page split, KVM gets notifications via mmu notifiers).
> > Have you thought of such a more general way?
>
> I did have thought of this, if we split the huge pages into 4K of a process, I'm
> afraid it will not be workable for the huge pages sharing scenario, e.g. DPDK,
> SPDK etc. So, only split the EPT page table and keep the VM process page table
> (e.g. qemu) untouched is the goal.
>
> >
> > (And I just noticed that MADV_NOHUGEPAGE is only a hint to khugepaged
> > and probably won't split any huge page at all after madvise() returns..)
> > To tell the truth I'm still confused on how split of huge pages helped in your
> > case...
>
> I'm sorry if the meaning is not expressed clearly, and thanks for your patience.
>
> > If I read it right the test reduced some execution time from 9s to a
> > few ms after your splittion of huge pages.
>
> Yes
>
> > The thing is I don't see how split of
> > huge pages could solve the mmu_lock contention with the huge VM, because
> > IMO even if we split the huge pages into smaller ones, those pages should still
> > be write-protected and need merely the same number of page faults to resolve
> > when accessed/written? And I thought that should only be fixed with
> > solutions like what Ben has proposed with the MMU rework. Could you show
> > me what I've missed?
>
> Let me try to describe the reason of mmu_lock contention more clearly and the
> effort we tried to do...
> The huge VM only has EPT >= level 2 sptes, and level 1 sptes don't
> exist at the beginning. Write protect all the huge pages will trigger EPT
> violation to create level 1 sptes for all the vCPUs which want to write the
> content of the memory. Different vCPU write the different areas of
> the memory, but they need the same kvm->mmu_lock to create the level 1
> sptes, this situation will be worse if the number of vCPU and the memory of
> VM is large(in our case 60U512G), meanwhile the VM has
> memory-write-intensive work to do. In order to reduce the mmu_lock
> contention, we try to: write protect VM memory gradually in small chunks,
> such as 1G or 2M. Using a vCPU temporary creately by migration thread to
> split 1G to 2M as the first step, and to split 2M to 4K as the second step
> (this is a little hacking...and I do not know any side effect will be triggered
> indeed).
> Comparing to write protect all VM memory in one go, the write
> protected range is limited in this way and only the vCPUs write this limited
> range will be involved to take the mmu_lock. The contention will be reduced
> since the memory range is small and the number of vCPU involved is small
> too.
>
> Of course, it will take some extra time to split all the huge pages into 4K
> page before the real migration started, about 60s for 512G in my experiment.
>
> During the memory iterative copy phase, PML will do the dirty logging work
> (not write protected case for 4K), or IIRC using fast_page_fault to mark page
> dirty if PML is not supported, which case the mmu_lock does not needed.
>
> Regards,
> Jay Zhou

(Ah I top-posted I'm sorry. Re-sending at the bottom.)

FWIW, we currently do this eager splitting at Google for live
migration. When the log-dirty-memory flag is set on a memslot we
eagerly split all pages in the slot down to 4k granularity.
As Jay said, this does not cause crippling lock contention because the
vCPU page faults generated by write protection / splitting can be
resolved in the fast page fault path without acquiring the MMU lock.
I believe +Junaid Shahid tried to upstream this approach at some point
in the past, but the patch set didn't make it in. (This was before my
time, so I'm hoping he has a link.)
I haven't done the analysis to know if eager splitting is more or less
efficient with parallel slow-path page faults, but it's definitely
faster under the MMU lock.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
  2020-02-20 17:34           ` Ben Gardon
@ 2020-02-20 18:17             ` Peter Xu
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2020-02-20 18:17 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Zhoujian (jay),
	Junaid Shahid, kvm, qemu-devel, pbonzini, dgilbert, quintela,
	Liujinsong (Paul), linfeng (M), wangxin (U), Huangweidong (C)

On Thu, Feb 20, 2020 at 09:34:52AM -0800, Ben Gardon wrote:
> On Thu, Feb 20, 2020 at 5:53 AM Zhoujian (jay) <jianjay.zhou@huawei.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Thursday, February 20, 2020 1:19 AM
> > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> > > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> > > <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > > <weidong.huang@huawei.com>
> > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > >
> > > On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> > > > Hi Peter,
> > > >
> > > > > -----Original Message-----
> > > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > > Sent: Wednesday, February 19, 2020 1:43 AM
> > > > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org;
> > > pbonzini@redhat.com;
> > > > > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > > > > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>;
> > > > > wangxin (U) <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > > > > <weidong.huang@huawei.com>
> > > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > > > >
> > > > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > We found that the guest will be soft-lockup occasionally when live
> > > > > > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > > > > > reason is clear, almost all of the vCPUs are waiting for the KVM
> > > > > > MMU spin-lock to create 4K SPTEs when the huge pages are write
> > > > > > protected. This
> > > > > phenomenon is also described in this patch set:
> > > > > > https://patchwork.kernel.org/cover/11163459/
> > > > > > which aims to handle page faults in parallel more efficiently.
> > > > > >
> > > > > > Our idea is to use the migration thread to touch all of the guest
> > > > > > memory in the granularity of 4K before enabling dirty logging. To
> > > > > > be more specific, we split all the PDPE_LEVEL SPTEs into
> > > > > > DIRECTORY_LEVEL SPTEs as the first step, and then split all the
> > > > > > DIRECTORY_LEVEL SPTEs into
> > > > > PAGE_TABLE_LEVEL SPTEs as the following step.
> > > > >
> > > > > IIUC, QEMU will prefer to use huge pages for all the anonymous
> > > > > ramblocks (please refer to ram_block_add):
> > > > >
> > > > >         qemu_madvise(new_block->host, new_block->max_length,
> > > > > QEMU_MADV_HUGEPAGE);
> > > >
> > > > Yes, you're right
> > > >
> > > > >
> > > > > Another alternative I can think of is to add an extra parameter to
> > > > > QEMU to explicitly disable huge pages (so that can even be
> > > > > MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).  However that
> > > should also
> > > > > drag down the performance for the whole lifecycle of the VM.
> > > >
> > > > From the performance point of view, it is better to keep the huge
> > > > pages when the VM is not in the live migration state.
> > > >
> > > > > A 3rd option is to make a QMP
> > > > > command to dynamically turn huge pages on/off for ramblocks globally.
> > > >
> > > > We're searching a dynamic method too.
> > > > We plan to add two new flags for each memory slot, say
> > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set through
> > > > KVM_SET_USER_MEMORY_REGION ioctl.

[1]

> > > >
> > > > The mapping_level which is called by tdp_page_fault in the kernel side
> > > > will return PT_DIRECTORY_LEVEL if the
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES
> > > > flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL if the
> > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
> > > >
> > > > The key steps to split the huge pages in advance of enabling dirty log
> > > > is as follows:
> > > > 1. The migration thread in user space uses
> > > KVM_SET_USER_MEMORY_REGION
> > > > ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each
> > > memory
> > > > slot.
> > > > 2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
> > > > ioctl (which is newly added) to do the splitting of large pages in the
> > > > kernel side.
> > > > 3. A new vCPU is created temporally(do some initialization but will
> > > > not
> > > > run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> > > > 4. Collect the GPA ranges of all the memory slots with the
> > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> > > > 5. Split the 1G huge pages(collected in step 4) into 2M by calling
> > > > tdp_page_fault, since the mapping_level will return
> > > > PT_DIRECTORY_LEVEL. Here is the main difference from the usual path
> > > > which is caused by the Guest side(EPT violation/misconfig etc), we
> > > > call it directly in the hypervisor side.
> > > > 6. Do some cleanups, i.e. free the vCPU related resources 7. The
> > > > KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> > > > 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in step
> > > 5
> > > > the 2M huge pages will be splitted into 4K pages.
> > > > 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
> > > > 10. Then the migration thread calls the log_start ioctl to enable the
> > > > dirty logging, and the remaining thing is the same.
> > >
> > > I'm not sure... I think it would be good if there is a way to have finer granularity
> > > control on using huge pages for any process, then KVM can directly leverage
> > > that because KVM page tables should always respect the mm configurations on
> > > these (so e.g. when huge page split, KVM gets notifications via mmu notifiers).
> > > Have you thought of such a more general way?
> >
> > I did have thought of this, if we split the huge pages into 4K of a process, I'm
> > afraid it will not be workable for the huge pages sharing scenario, e.g. DPDK,
> > SPDK etc. So, only split the EPT page table and keep the VM process page table
> > (e.g. qemu) untouched is the goal.

Ah I see your point now.

> >
> > >
> > > (And I just noticed that MADV_NOHUGEPAGE is only a hint to khugepaged
> > > and probably won't split any huge page at all after madvise() returns..)
> > > To tell the truth I'm still confused on how split of huge pages helped in your
> > > case...
> >
> > I'm sorry if the meaning is not expressed clearly, and thanks for your patience.
> >
> > > If I read it right the test reduced some execution time from 9s to a
> > > few ms after your splittion of huge pages.
> >
> > Yes
> >
> > > The thing is I don't see how split of
> > > huge pages could solve the mmu_lock contention with the huge VM, because
> > > IMO even if we split the huge pages into smaller ones, those pages should still
> > > be write-protected and need merely the same number of page faults to resolve
> > > when accessed/written? And I thought that should only be fixed with
> > > solutions like what Ben has proposed with the MMU rework. Could you show
> > > me what I've missed?
> >
> > Let me try to describe the reason of mmu_lock contention more clearly and the
> > effort we tried to do...
> > The huge VM only has EPT >= level 2 sptes, and level 1 sptes don't
> > exist at the beginning. Write protect all the huge pages will trigger EPT
> > violation to create level 1 sptes for all the vCPUs which want to write the
> > content of the memory. Different vCPU write the different areas of
> > the memory, but they need the same kvm->mmu_lock to create the level 1
> > sptes, this situation will be worse if the number of vCPU and the memory of
> > VM is large(in our case 60U512G), meanwhile the VM has
> > memory-write-intensive work to do. In order to reduce the mmu_lock
> > contention, we try to: write protect VM memory gradually in small chunks,
> > such as 1G or 2M. Using a vCPU temporary creately by migration thread to
> > split 1G to 2M as the first step, and to split 2M to 4K as the second step
> > (this is a little hacking...and I do not know any side effect will be triggered
> > indeed).
> > Comparing to write protect all VM memory in one go, the write
> > protected range is limited in this way and only the vCPUs write this limited
> > range will be involved to take the mmu_lock. The contention will be reduced
> > since the memory range is small and the number of vCPU involved is small
> > too.
> >
> > Of course, it will take some extra time to split all the huge pages into 4K
> > page before the real migration started, about 60s for 512G in my experiment.
> >
> > During the memory iterative copy phase, PML will do the dirty logging work
> > (not write protected case for 4K), or IIRC using fast_page_fault to mark page
> > dirty if PML is not supported, which case the mmu_lock does not needed.

Yes I missed both of these.  Thanks for explaining!

Then it makes sense at least to me with your idea. Though instead of
the KVM_MEM_FORCE_PT_* naming [1], we can also embed allowed page
sizes for the memslot into the flags using a few bits, with another
new kvm cap.

> >
> > Regards,
> > Jay Zhou
> 
> (Ah I top-posted I'm sorry. Re-sending at the bottom.)
> 
> FWIW, we currently do this eager splitting at Google for live
> migration. When the log-dirty-memory flag is set on a memslot we
> eagerly split all pages in the slot down to 4k granularity.
> As Jay said, this does not cause crippling lock contention because the
> vCPU page faults generated by write protection / splitting can be
> resolved in the fast page fault path without acquiring the MMU lock.
> I believe +Junaid Shahid tried to upstream this approach at some point
> in the past, but the patch set didn't make it in. (This was before my
> time, so I'm hoping he has a link.)
> I haven't done the analysis to know if eager splitting is more or less
> efficient with parallel slow-path page faults, but it's definitely
> faster under the MMU lock.

Yes, totally agreed.  Though comparing to eager splitting (which might
still need a new capabilility for the changed behavior after all, not
sure...), the per-memslot hint solution looks slightly nicer to me,
imho, because it can offer more mechanism than policy.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-20 18:17             ` Peter Xu
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2020-02-20 18:17 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Junaid Shahid, Liujinsong (Paul), linfeng (M),
	kvm, quintela, wangxin (U), dgilbert, qemu-devel, Zhoujian (jay),
	pbonzini, Huangweidong (C)

On Thu, Feb 20, 2020 at 09:34:52AM -0800, Ben Gardon wrote:
> On Thu, Feb 20, 2020 at 5:53 AM Zhoujian (jay) <jianjay.zhou@huawei.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Thursday, February 20, 2020 1:19 AM
> > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> > > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> > > <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > > <weidong.huang@huawei.com>
> > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > >
> > > On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> > > > Hi Peter,
> > > >
> > > > > -----Original Message-----
> > > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > > Sent: Wednesday, February 19, 2020 1:43 AM
> > > > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org;
> > > pbonzini@redhat.com;
> > > > > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > > > > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>;
> > > > > wangxin (U) <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > > > > <weidong.huang@huawei.com>
> > > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > > > >
> > > > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > We found that the guest will be soft-lockup occasionally when live
> > > > > > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > > > > > reason is clear, almost all of the vCPUs are waiting for the KVM
> > > > > > MMU spin-lock to create 4K SPTEs when the huge pages are write
> > > > > > protected. This
> > > > > phenomenon is also described in this patch set:
> > > > > > https://patchwork.kernel.org/cover/11163459/
> > > > > > which aims to handle page faults in parallel more efficiently.
> > > > > >
> > > > > > Our idea is to use the migration thread to touch all of the guest
> > > > > > memory in the granularity of 4K before enabling dirty logging. To
> > > > > > be more specific, we split all the PDPE_LEVEL SPTEs into
> > > > > > DIRECTORY_LEVEL SPTEs as the first step, and then split all the
> > > > > > DIRECTORY_LEVEL SPTEs into
> > > > > PAGE_TABLE_LEVEL SPTEs as the following step.
> > > > >
> > > > > IIUC, QEMU will prefer to use huge pages for all the anonymous
> > > > > ramblocks (please refer to ram_block_add):
> > > > >
> > > > >         qemu_madvise(new_block->host, new_block->max_length,
> > > > > QEMU_MADV_HUGEPAGE);
> > > >
> > > > Yes, you're right
> > > >
> > > > >
> > > > > Another alternative I can think of is to add an extra parameter to
> > > > > QEMU to explicitly disable huge pages (so that can even be
> > > > > MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).  However that
> > > should also
> > > > > drag down the performance for the whole lifecycle of the VM.
> > > >
> > > > From the performance point of view, it is better to keep the huge
> > > > pages when the VM is not in the live migration state.
> > > >
> > > > > A 3rd option is to make a QMP
> > > > > command to dynamically turn huge pages on/off for ramblocks globally.
> > > >
> > > > We're searching a dynamic method too.
> > > > We plan to add two new flags for each memory slot, say
> > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set through
> > > > KVM_SET_USER_MEMORY_REGION ioctl.

[1]

> > > >
> > > > The mapping_level which is called by tdp_page_fault in the kernel side
> > > > will return PT_DIRECTORY_LEVEL if the
> > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES
> > > > flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL if the
> > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
> > > >
> > > > The key steps to split the huge pages in advance of enabling dirty log
> > > > is as follows:
> > > > 1. The migration thread in user space uses
> > > KVM_SET_USER_MEMORY_REGION
> > > > ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each
> > > memory
> > > > slot.
> > > > 2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
> > > > ioctl (which is newly added) to do the splitting of large pages in the
> > > > kernel side.
> > > > 3. A new vCPU is created temporally(do some initialization but will
> > > > not
> > > > run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> > > > 4. Collect the GPA ranges of all the memory slots with the
> > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> > > > 5. Split the 1G huge pages(collected in step 4) into 2M by calling
> > > > tdp_page_fault, since the mapping_level will return
> > > > PT_DIRECTORY_LEVEL. Here is the main difference from the usual path
> > > > which is caused by the Guest side(EPT violation/misconfig etc), we
> > > > call it directly in the hypervisor side.
> > > > 6. Do some cleanups, i.e. free the vCPU related resources 7. The
> > > > KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> > > > 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in step
> > > 5
> > > > the 2M huge pages will be splitted into 4K pages.
> > > > 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
> > > > 10. Then the migration thread calls the log_start ioctl to enable the
> > > > dirty logging, and the remaining thing is the same.
> > >
> > > I'm not sure... I think it would be good if there is a way to have finer granularity
> > > control on using huge pages for any process, then KVM can directly leverage
> > > that because KVM page tables should always respect the mm configurations on
> > > these (so e.g. when huge page split, KVM gets notifications via mmu notifiers).
> > > Have you thought of such a more general way?
> >
> > I did have thought of this, if we split the huge pages into 4K of a process, I'm
> > afraid it will not be workable for the huge pages sharing scenario, e.g. DPDK,
> > SPDK etc. So, only split the EPT page table and keep the VM process page table
> > (e.g. qemu) untouched is the goal.

Ah I see your point now.

> >
> > >
> > > (And I just noticed that MADV_NOHUGEPAGE is only a hint to khugepaged
> > > and probably won't split any huge page at all after madvise() returns..)
> > > To tell the truth I'm still confused on how split of huge pages helped in your
> > > case...
> >
> > I'm sorry if the meaning is not expressed clearly, and thanks for your patience.
> >
> > > If I read it right the test reduced some execution time from 9s to a
> > > few ms after your splittion of huge pages.
> >
> > Yes
> >
> > > The thing is I don't see how split of
> > > huge pages could solve the mmu_lock contention with the huge VM, because
> > > IMO even if we split the huge pages into smaller ones, those pages should still
> > > be write-protected and need merely the same number of page faults to resolve
> > > when accessed/written? And I thought that should only be fixed with
> > > solutions like what Ben has proposed with the MMU rework. Could you show
> > > me what I've missed?
> >
> > Let me try to describe the reason of mmu_lock contention more clearly and the
> > effort we tried to do...
> > The huge VM only has EPT >= level 2 sptes, and level 1 sptes don't
> > exist at the beginning. Write protect all the huge pages will trigger EPT
> > violation to create level 1 sptes for all the vCPUs which want to write the
> > content of the memory. Different vCPU write the different areas of
> > the memory, but they need the same kvm->mmu_lock to create the level 1
> > sptes, this situation will be worse if the number of vCPU and the memory of
> > VM is large(in our case 60U512G), meanwhile the VM has
> > memory-write-intensive work to do. In order to reduce the mmu_lock
> > contention, we try to: write protect VM memory gradually in small chunks,
> > such as 1G or 2M. Using a vCPU temporary creately by migration thread to
> > split 1G to 2M as the first step, and to split 2M to 4K as the second step
> > (this is a little hacking...and I do not know any side effect will be triggered
> > indeed).
> > Comparing to write protect all VM memory in one go, the write
> > protected range is limited in this way and only the vCPUs write this limited
> > range will be involved to take the mmu_lock. The contention will be reduced
> > since the memory range is small and the number of vCPU involved is small
> > too.
> >
> > Of course, it will take some extra time to split all the huge pages into 4K
> > page before the real migration started, about 60s for 512G in my experiment.
> >
> > During the memory iterative copy phase, PML will do the dirty logging work
> > (not write protected case for 4K), or IIRC using fast_page_fault to mark page
> > dirty if PML is not supported, which case the mmu_lock does not needed.

Yes I missed both of these.  Thanks for explaining!

Then it makes sense at least to me with your idea. Though instead of
the KVM_MEM_FORCE_PT_* naming [1], we can also embed allowed page
sizes for the memslot into the flags using a few bits, with another
new kvm cap.

> >
> > Regards,
> > Jay Zhou
> 
> (Ah I top-posted I'm sorry. Re-sending at the bottom.)
> 
> FWIW, we currently do this eager splitting at Google for live
> migration. When the log-dirty-memory flag is set on a memslot we
> eagerly split all pages in the slot down to 4k granularity.
> As Jay said, this does not cause crippling lock contention because the
> vCPU page faults generated by write protection / splitting can be
> resolved in the fast page fault path without acquiring the MMU lock.
> I believe +Junaid Shahid tried to upstream this approach at some point
> in the past, but the patch set didn't make it in. (This was before my
> time, so I'm hoping he has a link.)
> I haven't done the analysis to know if eager splitting is more or less
> efficient with parallel slow-path page faults, but it's definitely
> faster under the MMU lock.

Yes, totally agreed.  Though comparing to eager splitting (which might
still need a new capabilility for the changed behavior after all, not
sure...), the per-memslot hint solution looks slightly nicer to me,
imho, because it can offer more mechanism than policy.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC: Split EPT huge pages in advance of dirty logging
  2020-02-20 18:17             ` Peter Xu
@ 2020-02-21  6:51               ` Zhoujian (jay)
  -1 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-02-21  6:51 UTC (permalink / raw)
  To: Peter Xu, Ben Gardon
  Cc: Junaid Shahid, kvm, qemu-devel, pbonzini, dgilbert, quintela,
	Liujinsong (Paul), linfeng (M), wangxin (U), Huangweidong (C)



> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, February 21, 2020 2:17 AM
> To: Ben Gardon <bgardon@google.com>
> Cc: Zhoujian (jay) <jianjay.zhou@huawei.com>; Junaid Shahid
> <junaids@google.com>; kvm@vger.kernel.org; qemu-devel@nongnu.org;
> pbonzini@redhat.com; dgilbert@redhat.com; quintela@redhat.com;
> Liujinsong (Paul) <liu.jinsong@huawei.com>; linfeng (M)
> <linfeng23@huawei.com>; wangxin (U) <wangxinxin.wang@huawei.com>;
> Huangweidong (C) <weidong.huang@huawei.com>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> 
> On Thu, Feb 20, 2020 at 09:34:52AM -0800, Ben Gardon wrote:
> > On Thu, Feb 20, 2020 at 5:53 AM Zhoujian (jay) <jianjay.zhou@huawei.com>
> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Thursday, February 20, 2020 1:19 AM
> > > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org;
> > > > pbonzini@redhat.com; dgilbert@redhat.com; quintela@redhat.com;
> > > > Liujinsong (Paul) <liu.jinsong@huawei.com>; linfeng (M)
> > > > <linfeng23@huawei.com>; wangxin (U)
> <wangxinxin.wang@huawei.com>;
> > > > Huangweidong (C) <weidong.huang@huawei.com>
> > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > > >
> > > > On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> > > > > Hi Peter,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > > > Sent: Wednesday, February 19, 2020 1:43 AM
> > > > > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > > > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org;
> > > > pbonzini@redhat.com;
> > > > > > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > > > > > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>;
> > > > > > wangxin (U) <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > > > > > <weidong.huang@huawei.com>
> > > > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty
> > > > > > logging
> > > > > >
> > > > > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We found that the guest will be soft-lockup occasionally
> > > > > > > when live migrating a 60 vCPU, 512GiB huge page and memory
> > > > > > > sensitive VM. The reason is clear, almost all of the vCPUs
> > > > > > > are waiting for the KVM MMU spin-lock to create 4K SPTEs
> > > > > > > when the huge pages are write protected. This
> > > > > > phenomenon is also described in this patch set:
> > > > > > > https://patchwork.kernel.org/cover/11163459/
> > > > > > > which aims to handle page faults in parallel more efficiently.
> > > > > > >
> > > > > > > Our idea is to use the migration thread to touch all of the
> > > > > > > guest memory in the granularity of 4K before enabling dirty
> > > > > > > logging. To be more specific, we split all the PDPE_LEVEL
> > > > > > > SPTEs into DIRECTORY_LEVEL SPTEs as the first step, and then
> > > > > > > split all the DIRECTORY_LEVEL SPTEs into
> > > > > > PAGE_TABLE_LEVEL SPTEs as the following step.
> > > > > >
> > > > > > IIUC, QEMU will prefer to use huge pages for all the anonymous
> > > > > > ramblocks (please refer to ram_block_add):
> > > > > >
> > > > > >         qemu_madvise(new_block->host, new_block->max_length,
> > > > > > QEMU_MADV_HUGEPAGE);
> > > > >
> > > > > Yes, you're right
> > > > >
> > > > > >
> > > > > > Another alternative I can think of is to add an extra
> > > > > > parameter to QEMU to explicitly disable huge pages (so that
> > > > > > can even be MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).
> > > > > > However that
> > > > should also
> > > > > > drag down the performance for the whole lifecycle of the VM.
> > > > >
> > > > > From the performance point of view, it is better to keep the
> > > > > huge pages when the VM is not in the live migration state.
> > > > >
> > > > > > A 3rd option is to make a QMP
> > > > > > command to dynamically turn huge pages on/off for ramblocks
> globally.
> > > > >
> > > > > We're searching a dynamic method too.
> > > > > We plan to add two new flags for each memory slot, say
> > > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set
> > > > > through KVM_SET_USER_MEMORY_REGION ioctl.
> 
> [1]
> 
> > > > >
> > > > > The mapping_level which is called by tdp_page_fault in the
> > > > > kernel side will return PT_DIRECTORY_LEVEL if the
> > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES
> > > > > flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL
> > > > > if the KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
> > > > >
> > > > > The key steps to split the huge pages in advance of enabling
> > > > > dirty log is as follows:
> > > > > 1. The migration thread in user space uses
> > > > KVM_SET_USER_MEMORY_REGION
> > > > > ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each
> > > > memory
> > > > > slot.
> > > > > 2. The migration thread continues to use the
> > > > > KVM_SPLIT_HUGE_PAGES ioctl (which is newly added) to do the
> > > > > splitting of large pages in the kernel side.
> > > > > 3. A new vCPU is created temporally(do some initialization but
> > > > > will not
> > > > > run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> > > > > 4. Collect the GPA ranges of all the memory slots with the
> > > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> > > > > 5. Split the 1G huge pages(collected in step 4) into 2M by
> > > > > calling tdp_page_fault, since the mapping_level will return
> > > > > PT_DIRECTORY_LEVEL. Here is the main difference from the usual
> > > > > path which is caused by the Guest side(EPT violation/misconfig
> > > > > etc), we call it directly in the hypervisor side.
> > > > > 6. Do some cleanups, i.e. free the vCPU related resources 7. The
> > > > > KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> > > > > 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> > > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in
> > > > > step
> > > > 5
> > > > > the 2M huge pages will be splitted into 4K pages.
> > > > > 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory
> slot.
> > > > > 10. Then the migration thread calls the log_start ioctl to
> > > > > enable the dirty logging, and the remaining thing is the same.
> > > >
> > > > I'm not sure... I think it would be good if there is a way to have
> > > > finer granularity control on using huge pages for any process,
> > > > then KVM can directly leverage that because KVM page tables should
> > > > always respect the mm configurations on these (so e.g. when huge page
> split, KVM gets notifications via mmu notifiers).
> > > > Have you thought of such a more general way?
> > >
> > > I did have thought of this, if we split the huge pages into 4K of a
> > > process, I'm afraid it will not be workable for the huge pages
> > > sharing scenario, e.g. DPDK, SPDK etc. So, only split the EPT page
> > > table and keep the VM process page table (e.g. qemu) untouched is the
> goal.
> 
> Ah I see your point now.
> 
> > >
> > > >
> > > > (And I just noticed that MADV_NOHUGEPAGE is only a hint to
> > > > khugepaged and probably won't split any huge page at all after
> > > > madvise() returns..) To tell the truth I'm still confused on how
> > > > split of huge pages helped in your case...
> > >
> > > I'm sorry if the meaning is not expressed clearly, and thanks for your
> patience.
> > >
> > > > If I read it right the test reduced some execution time from 9s to
> > > > a few ms after your splittion of huge pages.
> > >
> > > Yes
> > >
> > > > The thing is I don't see how split of huge pages could solve the
> > > > mmu_lock contention with the huge VM, because IMO even if we split
> > > > the huge pages into smaller ones, those pages should still be
> > > > write-protected and need merely the same number of page faults to
> > > > resolve when accessed/written? And I thought that should only be
> > > > fixed with solutions like what Ben has proposed with the MMU
> > > > rework. Could you show me what I've missed?
> > >
> > > Let me try to describe the reason of mmu_lock contention more
> > > clearly and the effort we tried to do...
> > > The huge VM only has EPT >= level 2 sptes, and level 1 sptes don't
> > > exist at the beginning. Write protect all the huge pages will
> > > trigger EPT violation to create level 1 sptes for all the vCPUs
> > > which want to write the content of the memory. Different vCPU write
> > > the different areas of the memory, but they need the same
> > > kvm->mmu_lock to create the level 1 sptes, this situation will be
> > > worse if the number of vCPU and the memory of VM is large(in our
> > > case 60U512G), meanwhile the VM has memory-write-intensive work to
> > > do. In order to reduce the mmu_lock contention, we try to: write
> > > protect VM memory gradually in small chunks, such as 1G or 2M. Using
> > > a vCPU temporary creately by migration thread to split 1G to 2M as
> > > the first step, and to split 2M to 4K as the second step (this is a
> > > little hacking...and I do not know any side effect will be triggered indeed).
> > > Comparing to write protect all VM memory in one go, the write
> > > protected range is limited in this way and only the vCPUs write this
> > > limited range will be involved to take the mmu_lock. The contention
> > > will be reduced since the memory range is small and the number of
> > > vCPU involved is small too.
> > >
> > > Of course, it will take some extra time to split all the huge pages
> > > into 4K page before the real migration started, about 60s for 512G in my
> experiment.
> > >
> > > During the memory iterative copy phase, PML will do the dirty
> > > logging work (not write protected case for 4K), or IIRC using
> > > fast_page_fault to mark page dirty if PML is not supported, which case the
> mmu_lock does not needed.
> 
> Yes I missed both of these.  Thanks for explaining!
> 
> Then it makes sense at least to me with your idea. Though instead of the
> KVM_MEM_FORCE_PT_* naming [1], we can also embed allowed page sizes
> for the memslot into the flags using a few bits, with another new kvm cap.

Thanks for this suggestion.

> > >
> > > Regards,
> > > Jay Zhou
> >
> > (Ah I top-posted I'm sorry. Re-sending at the bottom.)
> >
> > FWIW, we currently do this eager splitting at Google for live
> > migration. When the log-dirty-memory flag is set on a memslot we
> > eagerly split all pages in the slot down to 4k granularity.
> > As Jay said, this does not cause crippling lock contention because the
> > vCPU page faults generated by write protection / splitting can be
> > resolved in the fast page fault path without acquiring the MMU lock.
> > I believe +Junaid Shahid tried to upstream this approach at some point
> > in the past, but the patch set didn't make it in. (This was before my
> > time, so I'm hoping he has a link.) 

I tried to google the "eagerly split " approach using the keywords like
" eager splitting ", " eager splitting live migration " or add the name of
the author, but unfortunately no relevant links were found...
But still thanks for this information.

Regards,
Jay Zhou

>> I haven't done the analysis to
> > know if eager splitting is more or less efficient with parallel
> > slow-path page faults, but it's definitely faster under the MMU lock.
> 
> Yes, totally agreed.  Though comparing to eager splitting (which might still
> need a new capabilility for the changed behavior after all, not sure...), the
> per-memslot hint solution looks slightly nicer to me, imho, because it can offer
> more mechanism than policy.
> 
> Thanks,
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-21  6:51               ` Zhoujian (jay)
  0 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-02-21  6:51 UTC (permalink / raw)
  To: Peter Xu, Ben Gardon
  Cc: Junaid Shahid, Liujinsong (Paul), linfeng (M),
	kvm, quintela, wangxin (U),
	qemu-devel, dgilbert, pbonzini, Huangweidong (C)



> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, February 21, 2020 2:17 AM
> To: Ben Gardon <bgardon@google.com>
> Cc: Zhoujian (jay) <jianjay.zhou@huawei.com>; Junaid Shahid
> <junaids@google.com>; kvm@vger.kernel.org; qemu-devel@nongnu.org;
> pbonzini@redhat.com; dgilbert@redhat.com; quintela@redhat.com;
> Liujinsong (Paul) <liu.jinsong@huawei.com>; linfeng (M)
> <linfeng23@huawei.com>; wangxin (U) <wangxinxin.wang@huawei.com>;
> Huangweidong (C) <weidong.huang@huawei.com>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> 
> On Thu, Feb 20, 2020 at 09:34:52AM -0800, Ben Gardon wrote:
> > On Thu, Feb 20, 2020 at 5:53 AM Zhoujian (jay) <jianjay.zhou@huawei.com>
> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Thursday, February 20, 2020 1:19 AM
> > > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org;
> > > > pbonzini@redhat.com; dgilbert@redhat.com; quintela@redhat.com;
> > > > Liujinsong (Paul) <liu.jinsong@huawei.com>; linfeng (M)
> > > > <linfeng23@huawei.com>; wangxin (U)
> <wangxinxin.wang@huawei.com>;
> > > > Huangweidong (C) <weidong.huang@huawei.com>
> > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > > >
> > > > On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> > > > > Hi Peter,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > > > Sent: Wednesday, February 19, 2020 1:43 AM
> > > > > > To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> > > > > > Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org;
> > > > pbonzini@redhat.com;
> > > > > > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > > > > > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>;
> > > > > > wangxin (U) <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > > > > > <weidong.huang@huawei.com>
> > > > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty
> > > > > > logging
> > > > > >
> > > > > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We found that the guest will be soft-lockup occasionally
> > > > > > > when live migrating a 60 vCPU, 512GiB huge page and memory
> > > > > > > sensitive VM. The reason is clear, almost all of the vCPUs
> > > > > > > are waiting for the KVM MMU spin-lock to create 4K SPTEs
> > > > > > > when the huge pages are write protected. This
> > > > > > phenomenon is also described in this patch set:
> > > > > > > https://patchwork.kernel.org/cover/11163459/
> > > > > > > which aims to handle page faults in parallel more efficiently.
> > > > > > >
> > > > > > > Our idea is to use the migration thread to touch all of the
> > > > > > > guest memory in the granularity of 4K before enabling dirty
> > > > > > > logging. To be more specific, we split all the PDPE_LEVEL
> > > > > > > SPTEs into DIRECTORY_LEVEL SPTEs as the first step, and then
> > > > > > > split all the DIRECTORY_LEVEL SPTEs into
> > > > > > PAGE_TABLE_LEVEL SPTEs as the following step.
> > > > > >
> > > > > > IIUC, QEMU will prefer to use huge pages for all the anonymous
> > > > > > ramblocks (please refer to ram_block_add):
> > > > > >
> > > > > >         qemu_madvise(new_block->host, new_block->max_length,
> > > > > > QEMU_MADV_HUGEPAGE);
> > > > >
> > > > > Yes, you're right
> > > > >
> > > > > >
> > > > > > Another alternative I can think of is to add an extra
> > > > > > parameter to QEMU to explicitly disable huge pages (so that
> > > > > > can even be MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).
> > > > > > However that
> > > > should also
> > > > > > drag down the performance for the whole lifecycle of the VM.
> > > > >
> > > > > From the performance point of view, it is better to keep the
> > > > > huge pages when the VM is not in the live migration state.
> > > > >
> > > > > > A 3rd option is to make a QMP
> > > > > > command to dynamically turn huge pages on/off for ramblocks
> globally.
> > > > >
> > > > > We're searching a dynamic method too.
> > > > > We plan to add two new flags for each memory slot, say
> > > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set
> > > > > through KVM_SET_USER_MEMORY_REGION ioctl.
> 
> [1]
> 
> > > > >
> > > > > The mapping_level which is called by tdp_page_fault in the
> > > > > kernel side will return PT_DIRECTORY_LEVEL if the
> > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES
> > > > > flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL
> > > > > if the KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
> > > > >
> > > > > The key steps to split the huge pages in advance of enabling
> > > > > dirty log is as follows:
> > > > > 1. The migration thread in user space uses
> > > > KVM_SET_USER_MEMORY_REGION
> > > > > ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each
> > > > memory
> > > > > slot.
> > > > > 2. The migration thread continues to use the
> > > > > KVM_SPLIT_HUGE_PAGES ioctl (which is newly added) to do the
> > > > > splitting of large pages in the kernel side.
> > > > > 3. A new vCPU is created temporally(do some initialization but
> > > > > will not
> > > > > run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> > > > > 4. Collect the GPA ranges of all the memory slots with the
> > > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> > > > > 5. Split the 1G huge pages(collected in step 4) into 2M by
> > > > > calling tdp_page_fault, since the mapping_level will return
> > > > > PT_DIRECTORY_LEVEL. Here is the main difference from the usual
> > > > > path which is caused by the Guest side(EPT violation/misconfig
> > > > > etc), we call it directly in the hypervisor side.
> > > > > 6. Do some cleanups, i.e. free the vCPU related resources 7. The
> > > > > KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> > > > > 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> > > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in
> > > > > step
> > > > 5
> > > > > the 2M huge pages will be splitted into 4K pages.
> > > > > 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory
> slot.
> > > > > 10. Then the migration thread calls the log_start ioctl to
> > > > > enable the dirty logging, and the remaining thing is the same.
> > > >
> > > > I'm not sure... I think it would be good if there is a way to have
> > > > finer granularity control on using huge pages for any process,
> > > > then KVM can directly leverage that because KVM page tables should
> > > > always respect the mm configurations on these (so e.g. when huge page
> split, KVM gets notifications via mmu notifiers).
> > > > Have you thought of such a more general way?
> > >
> > > I did have thought of this, if we split the huge pages into 4K of a
> > > process, I'm afraid it will not be workable for the huge pages
> > > sharing scenario, e.g. DPDK, SPDK etc. So, only split the EPT page
> > > table and keep the VM process page table (e.g. qemu) untouched is the
> goal.
> 
> Ah I see your point now.
> 
> > >
> > > >
> > > > (And I just noticed that MADV_NOHUGEPAGE is only a hint to
> > > > khugepaged and probably won't split any huge page at all after
> > > > madvise() returns..) To tell the truth I'm still confused on how
> > > > split of huge pages helped in your case...
> > >
> > > I'm sorry if the meaning is not expressed clearly, and thanks for your
> patience.
> > >
> > > > If I read it right the test reduced some execution time from 9s to
> > > > a few ms after your splittion of huge pages.
> > >
> > > Yes
> > >
> > > > The thing is I don't see how split of huge pages could solve the
> > > > mmu_lock contention with the huge VM, because IMO even if we split
> > > > the huge pages into smaller ones, those pages should still be
> > > > write-protected and need merely the same number of page faults to
> > > > resolve when accessed/written? And I thought that should only be
> > > > fixed with solutions like what Ben has proposed with the MMU
> > > > rework. Could you show me what I've missed?
> > >
> > > Let me try to describe the reason of mmu_lock contention more
> > > clearly and the effort we tried to do...
> > > The huge VM only has EPT >= level 2 sptes, and level 1 sptes don't
> > > exist at the beginning. Write protect all the huge pages will
> > > trigger EPT violation to create level 1 sptes for all the vCPUs
> > > which want to write the content of the memory. Different vCPU write
> > > the different areas of the memory, but they need the same
> > > kvm->mmu_lock to create the level 1 sptes, this situation will be
> > > worse if the number of vCPU and the memory of VM is large(in our
> > > case 60U512G), meanwhile the VM has memory-write-intensive work to
> > > do. In order to reduce the mmu_lock contention, we try to: write
> > > protect VM memory gradually in small chunks, such as 1G or 2M. Using
> > > a vCPU temporary creately by migration thread to split 1G to 2M as
> > > the first step, and to split 2M to 4K as the second step (this is a
> > > little hacking...and I do not know any side effect will be triggered indeed).
> > > Comparing to write protect all VM memory in one go, the write
> > > protected range is limited in this way and only the vCPUs write this
> > > limited range will be involved to take the mmu_lock. The contention
> > > will be reduced since the memory range is small and the number of
> > > vCPU involved is small too.
> > >
> > > Of course, it will take some extra time to split all the huge pages
> > > into 4K page before the real migration started, about 60s for 512G in my
> experiment.
> > >
> > > During the memory iterative copy phase, PML will do the dirty
> > > logging work (not write protected case for 4K), or IIRC using
> > > fast_page_fault to mark page dirty if PML is not supported, which case the
> mmu_lock does not needed.
> 
> Yes I missed both of these.  Thanks for explaining!
> 
> Then it makes sense at least to me with your idea. Though instead of the
> KVM_MEM_FORCE_PT_* naming [1], we can also embed allowed page sizes
> for the memslot into the flags using a few bits, with another new kvm cap.

Thanks for this suggestion.

> > >
> > > Regards,
> > > Jay Zhou
> >
> > (Ah I top-posted I'm sorry. Re-sending at the bottom.)
> >
> > FWIW, we currently do this eager splitting at Google for live
> > migration. When the log-dirty-memory flag is set on a memslot we
> > eagerly split all pages in the slot down to 4k granularity.
> > As Jay said, this does not cause crippling lock contention because the
> > vCPU page faults generated by write protection / splitting can be
> > resolved in the fast page fault path without acquiring the MMU lock.
> > I believe +Junaid Shahid tried to upstream this approach at some point
> > in the past, but the patch set didn't make it in. (This was before my
> > time, so I'm hoping he has a link.) 

I tried to google the "eagerly split " approach using the keywords like
" eager splitting ", " eager splitting live migration " or add the name of
the author, but unfortunately no relevant links were found...
But still thanks for this information.

Regards,
Jay Zhou

>> I haven't done the analysis to
> > know if eager splitting is more or less efficient with parallel
> > slow-path page faults, but it's definitely faster under the MMU lock.
> 
> Yes, totally agreed.  Though comparing to eager splitting (which might still
> need a new capabilility for the changed behavior after all, not sure...), the
> per-memslot hint solution looks slightly nicer to me, imho, because it can offer
> more mechanism than policy.
> 
> Thanks,
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
  2020-02-20 17:34           ` Ben Gardon
  (?)
  (?)
@ 2020-02-21 22:08           ` Junaid Shahid
  2020-02-22  0:19               ` Peter Feiner
  -1 siblings, 1 reply; 28+ messages in thread
From: Junaid Shahid @ 2020-02-21 22:08 UTC (permalink / raw)
  To: Ben Gardon, Zhoujian (jay)
  Cc: Peter Xu, kvm, qemu-devel, pbonzini, dgilbert, quintela,
	Liujinsong (Paul), linfeng (M), wangxin (U), Huangweidong (C),
	pfeiner

On 2/20/20 9:34 AM, Ben Gardon wrote:
> 
> FWIW, we currently do this eager splitting at Google for live
> migration. When the log-dirty-memory flag is set on a memslot we
> eagerly split all pages in the slot down to 4k granularity.
> As Jay said, this does not cause crippling lock contention because the
> vCPU page faults generated by write protection / splitting can be
> resolved in the fast page fault path without acquiring the MMU lock.
> I believe +Junaid Shahid tried to upstream this approach at some point
> in the past, but the patch set didn't make it in. (This was before my
> time, so I'm hoping he has a link.)
> I haven't done the analysis to know if eager splitting is more or less
> efficient with parallel slow-path page faults, but it's definitely
> faster under the MMU lock.
> 

I am not sure if we ever posted those patches upstream. Peter Feiner would know for sure. One notable difference in what we do compared to the approach outlined by Jay is that we don't rely on tdp_page_fault() to do the splitting. So we don't have to create a dummy VCPU and the specialized split function is also much faster.

Thanks,
Junaid

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
  2020-02-21 22:08           ` Junaid Shahid
@ 2020-02-22  0:19               ` Peter Feiner
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Feiner @ 2020-02-22  0:19 UTC (permalink / raw)
  To: Junaid Shahid
  Cc: Ben Gardon, Zhoujian (jay),
	Peter Xu, kvm, qemu-devel, pbonzini, dgilbert, quintela,
	Liujinsong (Paul), linfeng (M), wangxin (U), Huangweidong (C)

On Fri, Feb 21, 2020 at 2:08 PM Junaid Shahid <junaids@google.com> wrote:
>
> On 2/20/20 9:34 AM, Ben Gardon wrote:
> >
> > FWIW, we currently do this eager splitting at Google for live
> > migration. When the log-dirty-memory flag is set on a memslot we
> > eagerly split all pages in the slot down to 4k granularity.
> > As Jay said, this does not cause crippling lock contention because the
> > vCPU page faults generated by write protection / splitting can be
> > resolved in the fast page fault path without acquiring the MMU lock.
> > I believe +Junaid Shahid tried to upstream this approach at some point
> > in the past, but the patch set didn't make it in. (This was before my
> > time, so I'm hoping he has a link.)
> > I haven't done the analysis to know if eager splitting is more or less
> > efficient with parallel slow-path page faults, but it's definitely
> > faster under the MMU lock.
> >
>
> I am not sure if we ever posted those patches upstream. Peter Feiner would know for sure. One notable difference in what we do compared to the approach outlined by Jay is that we don't rely on tdp_page_fault() to do the splitting. So we don't have to create a dummy VCPU and the specialized split function is also much faster.

We've been carrying these patches since 2015. I've never posted them.
Getting them in shape for upstream consumption will take some work. I
can look into this next week.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-22  0:19               ` Peter Feiner
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Feiner @ 2020-02-22  0:19 UTC (permalink / raw)
  To: Junaid Shahid
  Cc: Ben Gardon, Zhoujian (jay),
	Peter Xu, kvm, qemu-devel, pbonzini, dgilbert, quintela,
	Liujinsong (Paul), linfeng (M), wangxin (U), Huangweidong (C)

On Fri, Feb 21, 2020 at 2:08 PM Junaid Shahid <junaids@google.com> wrote:
>
> On 2/20/20 9:34 AM, Ben Gardon wrote:
> >
> > FWIW, we currently do this eager splitting at Google for live
> > migration. When the log-dirty-memory flag is set on a memslot we
> > eagerly split all pages in the slot down to 4k granularity.
> > As Jay said, this does not cause crippling lock contention because the
> > vCPU page faults generated by write protection / splitting can be
> > resolved in the fast page fault path without acquiring the MMU lock.
> > I believe +Junaid Shahid tried to upstream this approach at some point
> > in the past, but the patch set didn't make it in. (This was before my
> > time, so I'm hoping he has a link.)
> > I haven't done the analysis to know if eager splitting is more or less
> > efficient with parallel slow-path page faults, but it's definitely
> > faster under the MMU lock.
> >
>
> I am not sure if we ever posted those patches upstream. Peter Feiner would know for sure. One notable difference in what we do compared to the approach outlined by Jay is that we don't rely on tdp_page_fault() to do the splitting. So we don't have to create a dummy VCPU and the specialized split function is also much faster.

We've been carrying these patches since 2015. I've never posted them.
Getting them in shape for upstream consumption will take some work. I
can look into this next week.

Peter


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC: Split EPT huge pages in advance of dirty logging
  2020-02-22  0:19               ` Peter Feiner
@ 2020-02-24  1:07                 ` Zhoujian (jay)
  -1 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-02-24  1:07 UTC (permalink / raw)
  To: Peter Feiner, Junaid Shahid
  Cc: Ben Gardon, Peter Xu, kvm, qemu-devel, pbonzini, dgilbert,
	quintela, Liujinsong (Paul), linfeng (M), wangxin (U),
	Huangweidong (C)



> -----Original Message-----
> From: Peter Feiner [mailto:pfeiner@google.com]
> Sent: Saturday, February 22, 2020 8:19 AM
> To: Junaid Shahid <junaids@google.com>
> Cc: Ben Gardon <bgardon@google.com>; Zhoujian (jay)
> <jianjay.zhou@huawei.com>; Peter Xu <peterx@redhat.com>;
> kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> <wangxinxin.wang@huawei.com>; Huangweidong (C)
> <weidong.huang@huawei.com>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> 
> On Fri, Feb 21, 2020 at 2:08 PM Junaid Shahid <junaids@google.com> wrote:
> >
> > On 2/20/20 9:34 AM, Ben Gardon wrote:
> > >
> > > FWIW, we currently do this eager splitting at Google for live
> > > migration. When the log-dirty-memory flag is set on a memslot we
> > > eagerly split all pages in the slot down to 4k granularity.
> > > As Jay said, this does not cause crippling lock contention because
> > > the vCPU page faults generated by write protection / splitting can
> > > be resolved in the fast page fault path without acquiring the MMU lock.
> > > I believe +Junaid Shahid tried to upstream this approach at some
> > > point in the past, but the patch set didn't make it in. (This was
> > > before my time, so I'm hoping he has a link.) I haven't done the
> > > analysis to know if eager splitting is more or less efficient with
> > > parallel slow-path page faults, but it's definitely faster under the
> > > MMU lock.
> > >
> >
> > I am not sure if we ever posted those patches upstream. Peter Feiner would
> know for sure. One notable difference in what we do compared to the approach
> outlined by Jay is that we don't rely on tdp_page_fault() to do the splitting. So
> we don't have to create a dummy VCPU and the specialized split function is also
> much faster.

I'm curious and interested in the way you implemented, especially you mentioned
that the performance is much faster without a dummy VCPU.

> We've been carrying these patches since 2015. I've never posted them.
> Getting them in shape for upstream consumption will take some work. I can
> look into this next week.

It will be nice if you're going to post it to the upstream.

Regards,
Jay Zhou

> 
> Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC: Split EPT huge pages in advance of dirty logging
@ 2020-02-24  1:07                 ` Zhoujian (jay)
  0 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-02-24  1:07 UTC (permalink / raw)
  To: Peter Feiner, Junaid Shahid
  Cc: Liujinsong (Paul), linfeng (M), kvm, quintela, wangxin (U),
	qemu-devel, Peter Xu, dgilbert, Ben Gardon, pbonzini,
	Huangweidong (C)



> -----Original Message-----
> From: Peter Feiner [mailto:pfeiner@google.com]
> Sent: Saturday, February 22, 2020 8:19 AM
> To: Junaid Shahid <junaids@google.com>
> Cc: Ben Gardon <bgardon@google.com>; Zhoujian (jay)
> <jianjay.zhou@huawei.com>; Peter Xu <peterx@redhat.com>;
> kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> <wangxinxin.wang@huawei.com>; Huangweidong (C)
> <weidong.huang@huawei.com>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> 
> On Fri, Feb 21, 2020 at 2:08 PM Junaid Shahid <junaids@google.com> wrote:
> >
> > On 2/20/20 9:34 AM, Ben Gardon wrote:
> > >
> > > FWIW, we currently do this eager splitting at Google for live
> > > migration. When the log-dirty-memory flag is set on a memslot we
> > > eagerly split all pages in the slot down to 4k granularity.
> > > As Jay said, this does not cause crippling lock contention because
> > > the vCPU page faults generated by write protection / splitting can
> > > be resolved in the fast page fault path without acquiring the MMU lock.
> > > I believe +Junaid Shahid tried to upstream this approach at some
> > > point in the past, but the patch set didn't make it in. (This was
> > > before my time, so I'm hoping he has a link.) I haven't done the
> > > analysis to know if eager splitting is more or less efficient with
> > > parallel slow-path page faults, but it's definitely faster under the
> > > MMU lock.
> > >
> >
> > I am not sure if we ever posted those patches upstream. Peter Feiner would
> know for sure. One notable difference in what we do compared to the approach
> outlined by Jay is that we don't rely on tdp_page_fault() to do the splitting. So
> we don't have to create a dummy VCPU and the specialized split function is also
> much faster.

I'm curious and interested in the way you implemented, especially you mentioned
that the performance is much faster without a dummy VCPU.

> We've been carrying these patches since 2015. I've never posted them.
> Getting them in shape for upstream consumption will take some work. I can
> look into this next week.

It will be nice if you're going to post it to the upstream.

Regards,
Jay Zhou

> 
> Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC: Split EPT huge pages in advance of dirty logging
  2020-02-22  0:19               ` Peter Feiner
@ 2020-03-02 13:38                 ` Zhoujian (jay)
  -1 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-03-02 13:38 UTC (permalink / raw)
  To: Peter Feiner
  Cc: Ben Gardon, Peter Xu, kvm, qemu-devel, pbonzini, dgilbert,
	quintela, Liujinsong (Paul), linfeng (M), wangxin (U),
	Huangweidong (C),
	Junaid Shahid



> -----Original Message-----
> From: Peter Feiner [mailto:pfeiner@google.com]
> Sent: Saturday, February 22, 2020 8:19 AM
> To: Junaid Shahid <junaids@google.com>
> Cc: Ben Gardon <bgardon@google.com>; Zhoujian (jay)
> <jianjay.zhou@huawei.com>; Peter Xu <peterx@redhat.com>;
> kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> <wangxinxin.wang@huawei.com>; Huangweidong (C)
> <weidong.huang@huawei.com>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> 
> On Fri, Feb 21, 2020 at 2:08 PM Junaid Shahid <junaids@google.com> wrote:
> >
> > On 2/20/20 9:34 AM, Ben Gardon wrote:
> > >
> > > FWIW, we currently do this eager splitting at Google for live
> > > migration. When the log-dirty-memory flag is set on a memslot we
> > > eagerly split all pages in the slot down to 4k granularity.
> > > As Jay said, this does not cause crippling lock contention because
> > > the vCPU page faults generated by write protection / splitting can
> > > be resolved in the fast page fault path without acquiring the MMU lock.
> > > I believe +Junaid Shahid tried to upstream this approach at some
> > > point in the past, but the patch set didn't make it in. (This was
> > > before my time, so I'm hoping he has a link.) I haven't done the
> > > analysis to know if eager splitting is more or less efficient with
> > > parallel slow-path page faults, but it's definitely faster under the
> > > MMU lock.
> > >
> >
> > I am not sure if we ever posted those patches upstream. Peter Feiner would
> know for sure. One notable difference in what we do compared to the approach
> outlined by Jay is that we don't rely on tdp_page_fault() to do the splitting. So we
> don't have to create a dummy VCPU and the specialized split function is also
> much faster.
> 
> We've been carrying these patches since 2015. I've never posted them.
> Getting them in shape for upstream consumption will take some work. I can look
> into this next week.

Hi Peter Feiner,

May I ask any new updates about your plan? Sorry to disturb.

Regards,
Jay Zhou

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC: Split EPT huge pages in advance of dirty logging
@ 2020-03-02 13:38                 ` Zhoujian (jay)
  0 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-03-02 13:38 UTC (permalink / raw)
  To: Peter Feiner
  Cc: Liujinsong (Paul), Junaid Shahid, linfeng (M),
	kvm, quintela, wangxin (U),
	qemu-devel, Peter Xu, dgilbert, Ben Gardon, pbonzini,
	Huangweidong (C)



> -----Original Message-----
> From: Peter Feiner [mailto:pfeiner@google.com]
> Sent: Saturday, February 22, 2020 8:19 AM
> To: Junaid Shahid <junaids@google.com>
> Cc: Ben Gardon <bgardon@google.com>; Zhoujian (jay)
> <jianjay.zhou@huawei.com>; Peter Xu <peterx@redhat.com>;
> kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U)
> <wangxinxin.wang@huawei.com>; Huangweidong (C)
> <weidong.huang@huawei.com>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> 
> On Fri, Feb 21, 2020 at 2:08 PM Junaid Shahid <junaids@google.com> wrote:
> >
> > On 2/20/20 9:34 AM, Ben Gardon wrote:
> > >
> > > FWIW, we currently do this eager splitting at Google for live
> > > migration. When the log-dirty-memory flag is set on a memslot we
> > > eagerly split all pages in the slot down to 4k granularity.
> > > As Jay said, this does not cause crippling lock contention because
> > > the vCPU page faults generated by write protection / splitting can
> > > be resolved in the fast page fault path without acquiring the MMU lock.
> > > I believe +Junaid Shahid tried to upstream this approach at some
> > > point in the past, but the patch set didn't make it in. (This was
> > > before my time, so I'm hoping he has a link.) I haven't done the
> > > analysis to know if eager splitting is more or less efficient with
> > > parallel slow-path page faults, but it's definitely faster under the
> > > MMU lock.
> > >
> >
> > I am not sure if we ever posted those patches upstream. Peter Feiner would
> know for sure. One notable difference in what we do compared to the approach
> outlined by Jay is that we don't rely on tdp_page_fault() to do the splitting. So we
> don't have to create a dummy VCPU and the specialized split function is also
> much faster.
> 
> We've been carrying these patches since 2015. I've never posted them.
> Getting them in shape for upstream consumption will take some work. I can look
> into this next week.

Hi Peter Feiner,

May I ask any new updates about your plan? Sorry to disturb.

Regards,
Jay Zhou

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC: Split EPT huge pages in advance of dirty logging
  2020-03-02 13:38                 ` Zhoujian (jay)
  (?)
@ 2020-03-02 16:28                 ` Peter Feiner
  2020-03-03  4:29                     ` Zhoujian (jay)
  -1 siblings, 1 reply; 28+ messages in thread
From: Peter Feiner @ 2020-03-02 16:28 UTC (permalink / raw)
  To: Zhoujian (jay)
  Cc: Ben Gardon, Peter Xu, kvm, qemu-devel, pbonzini, dgilbert,
	quintela, Liujinsong (Paul), linfeng (M), wangxin (U),
	Huangweidong (C),
	Junaid Shahid

[-- Attachment #1: Type: text/plain, Size: 2526 bytes --]

On Mon, Mar 2, 2020, 5:38 AM Zhoujian (jay) <jianjay.zhou@huawei.com> wrote:

>
>
> > -----Original Message-----
> > From: Peter Feiner [mailto:pfeiner@google.com]
> > Sent: Saturday, February 22, 2020 8:19 AM
> > To: Junaid Shahid <junaids@google.com>
> > Cc: Ben Gardon <bgardon@google.com>; Zhoujian (jay)
> > <jianjay.zhou@huawei.com>; Peter Xu <peterx@redhat.com>;
> > kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com;
> > dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul)
> > <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin
> (U)
> > <wangxinxin.wang@huawei.com>; Huangweidong (C)
> > <weidong.huang@huawei.com>
> > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> >
> > On Fri, Feb 21, 2020 at 2:08 PM Junaid Shahid <junaids@google.com>
> wrote:
> > >
> > > On 2/20/20 9:34 AM, Ben Gardon wrote:
> > > >
> > > > FWIW, we currently do this eager splitting at Google for live
> > > > migration. When the log-dirty-memory flag is set on a memslot we
> > > > eagerly split all pages in the slot down to 4k granularity.
> > > > As Jay said, this does not cause crippling lock contention because
> > > > the vCPU page faults generated by write protection / splitting can
> > > > be resolved in the fast page fault path without acquiring the MMU
> lock.
> > > > I believe +Junaid Shahid tried to upstream this approach at some
> > > > point in the past, but the patch set didn't make it in. (This was
> > > > before my time, so I'm hoping he has a link.) I haven't done the
> > > > analysis to know if eager splitting is more or less efficient with
> > > > parallel slow-path page faults, but it's definitely faster under the
> > > > MMU lock.
> > > >
> > >
> > > I am not sure if we ever posted those patches upstream. Peter Feiner
> would
> > know for sure. One notable difference in what we do compared to the
> approach
> > outlined by Jay is that we don't rely on tdp_page_fault() to do the
> splitting. So we
> > don't have to create a dummy VCPU and the specialized split function is
> also
> > much faster.
> >
> > We've been carrying these patches since 2015. I've never posted them.
> > Getting them in shape for upstream consumption will take some work. I
> can look
> > into this next week.
>
> Hi Peter Feiner,
>
> May I ask any new updates about your plan? Sorry to disturb.
>


Hi Jay,

I've been sick since I sent my last email, so I haven't gotten to this
patch set yet. I'll send it in the next week or two.

Peter


> Regards,
> Jay Zhou
>

[-- Attachment #2: Type: text/html, Size: 4741 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC: Split EPT huge pages in advance of dirty logging
  2020-03-02 16:28                 ` Peter Feiner
@ 2020-03-03  4:29                     ` Zhoujian (jay)
  0 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-03-03  4:29 UTC (permalink / raw)
  To: Peter Feiner
  Cc: Ben Gardon, Peter Xu, kvm, qemu-devel, pbonzini, dgilbert,
	quintela, Liujinsong (Paul), linfeng (M), wangxin (U),
	Huangweidong (C),
	Junaid Shahid



From: Peter Feiner [mailto:pfeiner@google.com] 
Sent: Tuesday, March 3, 2020 12:29 AM
To: Zhoujian (jay) <jianjay.zhou@huawei.com>
Cc: Ben Gardon <bgardon@google.com>; Peter Xu <peterx@redhat.com>; kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com; dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul) <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U) <wangxinxin.wang@huawei.com>; Huangweidong (C) <weidong.huang@huawei.com>; Junaid Shahid <junaids@google.com>
Subject: Re: RFC: Split EPT huge pages in advance of dirty logging

> -----Original Message-----
> From: Peter Feiner [mailto:pfeiner@google.com]

[...]

>Hi Jay,
>I've been sick since I sent my last email, so I haven't gotten to this patch set yet. I'll send it in the next week or two. 

OK, please take care of yourself.


Regards,
Jay Zhou

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC: Split EPT huge pages in advance of dirty logging
@ 2020-03-03  4:29                     ` Zhoujian (jay)
  0 siblings, 0 replies; 28+ messages in thread
From: Zhoujian (jay) @ 2020-03-03  4:29 UTC (permalink / raw)
  To: Peter Feiner
  Cc: Liujinsong (Paul), Junaid Shahid, linfeng (M),
	kvm, quintela, wangxin (U),
	qemu-devel, Peter Xu, dgilbert, Ben Gardon, pbonzini,
	Huangweidong (C)



From: Peter Feiner [mailto:pfeiner@google.com] 
Sent: Tuesday, March 3, 2020 12:29 AM
To: Zhoujian (jay) <jianjay.zhou@huawei.com>
Cc: Ben Gardon <bgardon@google.com>; Peter Xu <peterx@redhat.com>; kvm@vger.kernel.org; qemu-devel@nongnu.org; pbonzini@redhat.com; dgilbert@redhat.com; quintela@redhat.com; Liujinsong (Paul) <liu.jinsong@huawei.com>; linfeng (M) <linfeng23@huawei.com>; wangxin (U) <wangxinxin.wang@huawei.com>; Huangweidong (C) <weidong.huang@huawei.com>; Junaid Shahid <junaids@google.com>
Subject: Re: RFC: Split EPT huge pages in advance of dirty logging

> -----Original Message-----
> From: Peter Feiner [mailto:pfeiner@google.com]

[...]

>Hi Jay,
>I've been sick since I sent my last email, so I haven't gotten to this patch set yet. I'll send it in the next week or two. 

OK, please take care of yourself.


Regards,
Jay Zhou

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-03-03  4:31 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-18 13:13 RFC: Split EPT huge pages in advance of dirty logging Zhoujian (jay)
2020-02-18 13:13 ` Zhoujian (jay)
2020-02-18 17:43 ` Peter Xu
2020-02-18 17:43   ` Peter Xu
2020-02-19 13:19   ` Zhoujian (jay)
2020-02-19 13:19     ` Zhoujian (jay)
2020-02-19 17:19     ` Peter Xu
2020-02-19 17:19       ` Peter Xu
2020-02-20 13:52       ` Zhoujian (jay)
2020-02-20 13:52         ` Zhoujian (jay)
2020-02-20 17:32         ` Ben Gardon
2020-02-20 17:32           ` Ben Gardon
2020-02-20 17:34         ` Ben Gardon
2020-02-20 17:34           ` Ben Gardon
2020-02-20 18:17           ` Peter Xu
2020-02-20 18:17             ` Peter Xu
2020-02-21  6:51             ` Zhoujian (jay)
2020-02-21  6:51               ` Zhoujian (jay)
2020-02-21 22:08           ` Junaid Shahid
2020-02-22  0:19             ` Peter Feiner
2020-02-22  0:19               ` Peter Feiner
2020-02-24  1:07               ` Zhoujian (jay)
2020-02-24  1:07                 ` Zhoujian (jay)
2020-03-02 13:38               ` Zhoujian (jay)
2020-03-02 13:38                 ` Zhoujian (jay)
2020-03-02 16:28                 ` Peter Feiner
2020-03-03  4:29                   ` Zhoujian (jay)
2020-03-03  4:29                     ` Zhoujian (jay)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.