kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RFC: KVM: x86/mmu: Eager Page Splitting
@ 2021-11-04 22:45 David Matlack
  2021-11-05  8:44 ` Paolo Bonzini
  2021-11-05 17:17 ` Janis Schoetterl-Glausch
  0 siblings, 2 replies; 8+ messages in thread
From: David Matlack @ 2021-11-04 22:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm list, Ben Gardon, Junaid Shahid, Sean Christopherson,
	Oliver Upton, Harish Barathvajasankar, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Peter Xu, Peter Shier

The goal of this RFC is to get feedback on "Eager Page Splitting", an
optimization that has been in use in Google Cloud since 2016 to reduce
the performance impact of live migration on customer workloads. We
wanted to get feedback on the feature before delving too far into
porting it to the latest upstream kernel for submission. If there is
interest in adding this feature to KVM we plan to follow up in the
coming months with patches.

Background
==========
When KVM is tracking writes for dirty logging it write-protects any
2MiB or 1GiB pages that are mapped into the guest. When a vCPU writes
to such a page a write-protection fault will occur which KVM handles
by allocating lower level page tables, mapping in the faulting address
with a PT-level (4KiB) SPTE, and recording the 4KiB page is dirty.
Handling these faults is done under the MMU lock. In the TDP MMU,
where the MMU lock is a rwlock rather than a spin lock, splitting is
done while holding the lock in read (shared) mode and atomic
compare-exchanges are used when modifying SPTEs to detect races and
retry.

Motivation
==========
The write-protection faults to break down 2MiB and 1GiB mappings into
4KiB mappings are taken on the critical path of guest execution, which
negatively impacts guest performance. The negative impact scales with
the number of vCPUs, since each vCPU contends for the MMU lock.

This overhead can be seen by running dirty_log_perf_test with 1GiB per
vCPU and comparing the time it takes the test to dirty memory after
enabling dirty logging when the backing source is `anonymous` (4KiB
pages) and `anonymous_hugetlb_1gb` (1GiB pages).

        |        First Pass Dirty Memory Time           |
        |       tdp_mmu=N       |     tdp_mmu=Y         |
vCPUs   | 4KiB      | 1GiB      | 4KiB      | 1GiB      |
------- | --------- | --------- | --------- | --------- |
1       | 0.063s    | 0.241s    | 0.061s    | 0.238s    |
2       | 0.066s    | 0.280s    | 0.066s    | 0.278s    |
4       | 0.069s    | 0.359s    | 0.070s    | 0.298s    |
8       | 0.112s    | 0.982s    | 0.109s    | 0.345s    |
16      | 0.197s    | 3.153s    | 0.183s    | 1.020s    |
32      | 0.368s    | 9.293s    | 0.425s    | 2.610s    |
64      | 0.456s    | 23.291s   | 0.354s    | 4.212s    |
128     | 0.334s    | 55.030s   | 0.419s    | 7.169s    |
256     | 0.576s    | 141.332s  | 0.492s    | 13.874s   |
416     | 0.881s    | 338.185s  | 0.785s    | 14.582s   |

The performance overhead is egregious with the legacy MMU, as
expected, since every fault requires contending for exclusive access
to MMU lock. However, even with the TDP MMU, where the MMU lock is
held in read-mode, perf recording confirms there is still contention
due to hammering of atomic operations that scales with the number of
vCPUs:

+   28.16%  [k] _raw_read_lock
+   28.10%  [k] direct_page_fault
+   21.52%  [k] tdp_mmu_set_spte_atomic
+    6.90%  [k] __handle_changed_spte
+    3.93%  [k] __get_current_cr3_fast
+    3.47%  [k] lockless_pages_from_mm

_raw_read_lock, direct_page_fault, tdp_mmu_set_spte_atomic, and
__handle_changed_spte each spend 99+% of their time on atomic
operations. Note the _raw_read_lock path specifically is pure
reader/reader contention, it is not spinning waiting for writers.

As of Nov 2021, Google Cloud supports live migrating VMs with up to
416 vCPUs and 28 GiB per vCPU. The customers using large instances
tend to run applications that are sensitive to abrupt performance
degradations. For these VMs we use Eager Page Splitting in conjunction
with the Direct MMU (our internal predecessor to the TDP MMU).

Design
======
Eager Page Splitting occurs when dirty logging is enabled on a region
of memory. Before KVM write-protects all large page mapping, we
attempt the following two steps (*):

1. Iterate through all 1GiB SPTEs and for each:
    a. Allocate a new shadow page table.
    b. Populate it with 2MiB SPTEs mapping the 1GiB page.
    c. Replace the 1GiB SPTE with a link to the shadow page table.
2. Iterate through all 2MiB SPTEs and for each:
    a. Allocate a new shadow page table.
    b. Populate it with 4KiB SPTEs mapping the 2MiB page.
    c. Replace the 2MiB SPTE with a link to the shadow page table.

(*) We could split 1G pages directly down to 4K, and then split 2MiB
down to 4K. But the two step approach is a bit simpler to understand
and implement.

The implementation of splitting a 1GiB SPTE into 512 2MiB SPTEs and a
2MiB SPTE into a 512 4KiB SPTEs is generalized into a new function
that splits an SPTE into one level smaller.

The lower level SPTEs receive the same set of permissions as the upper
level SPTEs. The exception is execution permissions, which is granted
for 4K SPTEs if the large SPTE was forced NX due to HugePage NX.

Eager Page Splitting can gracefully fall back to write-protection, if
we decide the complexity is not worth splitting in certain scenarios,
since the existing write-protection logic runs after Eager Page
Splitting. This makes it possible to build Eager Page Splitting
iteratively (e.g. just supporting direct-map-only GFNs, then adding
support for rmap, etc.). This also means that it is trivial to make
Eager Page Splitting an optional capability if desired.

Allocations
-----------
In order to avoid allocating while holding the MMU lock, vCPUs
preallocate everything they need to handle the fault and store it in
kvm_mmu_memory_cache structs. Eager Page Splitting does the same thing
but since it runs outside of a vCPU thread it needs its own copies of
kvm_mmu_memory_cache structs. This requires refactoring the way
kvm_mmu_memory_cache structs are passed around in the MMU code and
adding kvm_mmu_memory_cache structs to kvm_arch.

Before splitting a large page, Eager Page Splitting checks if it has
enough memory in its caches to fully split the page. If it doesn't, it
flushes TLBs, drops the MMU lock, cond_rescheds, topups the caches,
reacquires the MMU lock, and continue where it left off.

Pros/Cons
---------
The benefits of Eager Page Splitting are:

  * Eliminates the overhead of enabling dirty logging when using large
pages. vCPUs will only take write-protection faults on 4KiB pages,
which can be handled without acquiring the MMU lock (fast_page_fault),
regardless of which page sizes were used before the migration started.
  * Eager Page Splitting is more efficient overall. We do not have to
pay the VM-exit and fault handling costs for every 4KiB of guest
memory, the splitter benefits from cache locality, and TLB flushes can
be batched. When using the tdp_mmu, the new child page table can be
populated without atomics.

The downsides of Eager Page Splitting are:

  * Introduces a new and complex path for modifying KVM's page tables
outside of the vCPU fault path.
  * Increases the duration of the VM ioctls that enable dirty logging.
This does not affect customer performance but may have unintended
consequences depending on how userspace invokes the ioctl. For
example, eagerly splitting a 1.5TB memslot takes 30 seconds.
  * May increase memory usage since it allocates the worst case number
of page tables in order to map all memory at 4KiB.

Alternatives
============
"RFC: Split EPT huge pages in advance of dirty logging" [1] was a
previous proposal to proactively split large pages off of the vCPU
threads. However it required faulting in every page in the migration
thread, a vCPU-like thread in QEMU, which requires extra userspace
support and also is less efficient since it requires faulting.

Another alternative is to modify the vCPU fault handler to map in the
entire large page when handling write-protection faults on large
pages, rather than just mapping in the faulting 4KiB region. This is a
middle ground approach that would be more efficient than the current
solution but has yet to be prototyped or proven in a production
environment.

The last alternative is to perform dirty tracking at a 2M granularity.
This would reduce the amount of splitting work required by 512x,
making the current approach of splitting on fault less impactful to
customer performance. We are in the early stages of investigating 2M
dirty tracking internally but it will be a while before it is proven
and ready for production. Furthermore there may be scenarios where
dirty tracking at 4K would be preferable to reduce the amount of
memory that needs to be demand-faulted during precopy.

Credit
======
Eager Page Splitting was originally designed and implemented by Peter
Feiner <pfeiner@google.com> for Google's internal kernel in 2016.

Appendix
========
In order to collect the performance results I ran the following
commands on an Intel Cascade Lake host that is used to run Google
Cloud's m2-ultramem-416 VMs:

  echo Y > /sys/module/kvm/parameters/tdp_mmu
  ./dirty_log_perf_test -v${vcpus} -s anonymous
  ./dirty_log_perf_test -v${vcpus} -s anonymous_hugetlb_1gb
  echo N > /sys/module/kvm/parameters/tdp_mmu
  ./dirty_log_perf_test -v${vcpus} -s anonymous
  ./dirty_log_perf_test -v${vcpus} -s anonymous_hugetlb_1gb

[1] https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg04774.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: KVM: x86/mmu: Eager Page Splitting
  2021-11-04 22:45 RFC: KVM: x86/mmu: Eager Page Splitting David Matlack
@ 2021-11-05  8:44 ` Paolo Bonzini
  2021-11-08 19:57   ` David Matlack
  2021-11-05 17:17 ` Janis Schoetterl-Glausch
  1 sibling, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2021-11-05  8:44 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm list, Ben Gardon, Junaid Shahid, Sean Christopherson,
	Oliver Upton, Harish Barathvajasankar, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Peter Xu, Peter Shier

On 11/4/21 23:45, David Matlack wrote:
> The goal of this RFC is to get feedback on "Eager Page Splitting",
> an optimization that has been in use in Google Cloud since 2016 to 
> reduce the performance impact of live migration on customer 
> workloads. We wanted to get feedback on the feature before delving 
> too far into porting it to the latest upstream kernel for submission.
> If there is interest in adding this feature to KVM we plan to follow
> up in the coming months with patches.

Hi David!

I'm definitely interested in eager page splitting upstream, but with a
twist: in order to limit the proliferation of knobs, I would rather
enable it only when KVM_DIRTY_LOG_INITIALLY_SET is set, and do the split
on the first KVM_CLEAR_DIRTY_LOG ioctl.

Initially-all-set does not require write protection when dirty logging
is enabled; instead, it delays write protection to the first
KVM_CLEAR_DIRTY_LOG.  In fact, I believe that eager page splitting can
be enabled unconditionally for initial-all-set.  You would still have
the benefit of moving the page splitting out of the vCPU run
path; and because you can smear the cost of splitting over multiple
calls, most of the disadvantages go away.

Initially-all-set is already the best-performing method for bitmap-based
dirty page tracking, so it makes sense to focus on it.  Even if Google
might not be using initial-all-set internally, adding eager page
splitting to the upstream code would remove most of the delta related to
it.  The rest of the delta can be tackled later; I'm not super
interested in adding eager page splitting for the older methods (clear
on KVM_GET_DIRTY_LOG, and manual-clear without initially-all-set), but
it should be useful for the ring buffer method and that *should* share
most of the code with the older methods.

> In order to avoid allocating while holding the MMU lock, vCPUs 
> preallocate everything they need to handle the fault and store it in 
> kvm_mmu_memory_cache structs. Eager Page Splitting does the same 
> thing but since it runs outside of a vCPU thread it needs its own 
> copies of kvm_mmu_memory_cache structs. This requires refactoring the
> way kvm_mmu_memory_cache structs are passed around in the MMU code
> and adding kvm_mmu_memory_cache structs to kvm_arch.

That's okay, we can move more arguments to structs if needed in the same
was as struct kvm_page_fault; or we can use kvm_get_running_vcpu() if
it's easier or more appropriate.

> * Increases the duration of the VM ioctls that enable dirty logging. 
> This does not affect customer performance but may have unintended 
> consequences depending on how userspace invokes the ioctl. For 
> example, eagerly splitting a 1.5TB memslot takes 30 seconds.

This issue goes away (or becomes easier to manage) if it's done in
KVM_CLEAR_DIRTY_LOG.

> "RFC: Split EPT huge pages in advance of dirty logging" [1] was a 
> previous proposal to proactively split large pages off of the vCPU 
> threads. However it required faulting in every page in the migration 
> thread, a vCPU-like thread in QEMU, which requires extra userspace 
> support and also is less efficient since it requires faulting.

Yeah, this is best done on the kernel side.

> The last alternative is to perform dirty tracking at a 2M 
> granularity. This would reduce the amount of splitting work required
>  by 512x, making the current approach of splitting on fault less 
> impactful to customer performance. We are in the early stages of 
> investigating 2M dirty tracking internally but it will be a while 
> before it is proven and ready for production. Furthermore there may 
> be scenarios where dirty tracking at 4K would be preferable to reduce
> the amount of memory that needs to be demand-faulted during precopy.

Granularity of dirty tracking is somewhat orthogonal to this anyway,
since you'd have to split 1G pages down to 2M.  So please let me know if
you're okay with the above twist, and let's go ahead with the plan!

Paolo


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: KVM: x86/mmu: Eager Page Splitting
  2021-11-04 22:45 RFC: KVM: x86/mmu: Eager Page Splitting David Matlack
  2021-11-05  8:44 ` Paolo Bonzini
@ 2021-11-05 17:17 ` Janis Schoetterl-Glausch
  2021-11-08 21:07   ` David Matlack
  1 sibling, 1 reply; 8+ messages in thread
From: Janis Schoetterl-Glausch @ 2021-11-05 17:17 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: kvm list, Ben Gardon, Junaid Shahid, Sean Christopherson,
	Oliver Upton, Harish Barathvajasankar, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Peter Xu, Peter Shier

On 11/4/21 23:45, David Matlack wrote:

[...]
> 
> The last alternative is to perform dirty tracking at a 2M granularity.
> This would reduce the amount of splitting work required by 512x,
> making the current approach of splitting on fault less impactful to
> customer performance. We are in the early stages of investigating 2M
> dirty tracking internally but it will be a while before it is proven
> and ready for production. Furthermore there may be scenarios where
> dirty tracking at 4K would be preferable to reduce the amount of
> memory that needs to be demand-faulted during precopy.

I'm curious how you're going about evaluating this, as I've experimented with
2M dirty tracking in the past, in a continuous checkpointing context however.
I suspect it's very sensitive to the workload. If the coarser granularity
leads to more memory being considered dirty, the length of pre-copy rounds
increases, giving the workload more time to dirty even more memory.
Ideally large pages would be used only for regions that won't be dirty or
regions that would also be pretty much completely dirty when tracking at 4K.
But deciding the granularity adaptively is hard, doing 2M tracking instead
of 4K robs you of the very information you'd need to judge that. 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: KVM: x86/mmu: Eager Page Splitting
  2021-11-05  8:44 ` Paolo Bonzini
@ 2021-11-08 19:57   ` David Matlack
  2021-11-08 21:37     ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: David Matlack @ 2021-11-08 19:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm list, Ben Gardon, Junaid Shahid, Sean Christopherson,
	Oliver Upton, Harish Barathvajasankar, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Peter Xu, Peter Shier

On Fri, Nov 05, 2021 at 09:44:14AM +0100, Paolo Bonzini wrote:
> On 11/4/21 23:45, David Matlack wrote:
> > The goal of this RFC is to get feedback on "Eager Page Splitting",
> > an optimization that has been in use in Google Cloud since 2016 to
> > reduce the performance impact of live migration on customer workloads.
> > We wanted to get feedback on the feature before delving too far into
> > porting it to the latest upstream kernel for submission.
> > If there is interest in adding this feature to KVM we plan to follow
> > up in the coming months with patches.
> 
> Hi David!
> 
> I'm definitely interested in eager page splitting upstream, but with a
> twist: in order to limit the proliferation of knobs, I would rather
> enable it only when KVM_DIRTY_LOG_INITIALLY_SET is set, and do the split
> on the first KVM_CLEAR_DIRTY_LOG ioctl.
> 
> Initially-all-set does not require write protection when dirty logging
> is enabled; instead, it delays write protection to the first
> KVM_CLEAR_DIRTY_LOG.  In fact, I believe that eager page splitting can
> be enabled unconditionally for initial-all-set.  You would still have
> the benefit of moving the page splitting out of the vCPU run
> path; and because you can smear the cost of splitting over multiple
> calls, most of the disadvantages go away.

Splitting on the first call to KVM_CLEAR_DIRTY_LOG when
initially-all-set is enabled sounds fine to me. But it does require
extra complexity versus unconditionally eager splitting the entire
memslot when dirty logging is enabled, which (I now realize) is needed
to support the ring buffer method. More below...

> 
> Initially-all-set is already the best-performing method for bitmap-based
> dirty page tracking, so it makes sense to focus on it.  Even if Google
> might not be using initial-all-set internally, adding eager page
> splitting to the upstream code would remove most of the delta related to
> it.  The rest of the delta can be tackled later;

Yeah we are still using the legacy clear-on-get-dirty interface.
Upstreaming eager page splitting for initially-all-set would address
most of the delta and give us extra motivation to switch off of
clear-on-get-dirty :).

> I'm not super
> interested in adding eager page splitting for the older methods (clear
> on KVM_GET_DIRTY_LOG, and manual-clear without initially-all-set), but
> it should be useful for the ring buffer method and that *should* share
> most of the code with the older methods.

Using Eager Page Splitting with the ring buffer method would require
splitting the entire memslot when dirty logging is enabled for that
memslot right? Are you saying we should do that?

i.e. in kvm_mmu_slot_apply_flags we'd have something like:

        if (kvm->dirty_ring_size)
                kvm_slot_split_large_pages(kvm, slot);

If so, maybe we should just unconditionally do eager page splitting for
the entire memslot, which would save us from having to add egaer page
splitting in two places.

> 
> > In order to avoid allocating while holding the MMU lock, vCPUs
> > preallocate everything they need to handle the fault and store it in
> > kvm_mmu_memory_cache structs. Eager Page Splitting does the same thing
> > but since it runs outside of a vCPU thread it needs its own copies of
> > kvm_mmu_memory_cache structs. This requires refactoring the
> > way kvm_mmu_memory_cache structs are passed around in the MMU code
> > and adding kvm_mmu_memory_cache structs to kvm_arch.
> 
> That's okay, we can move more arguments to structs if needed in the same
> was as struct kvm_page_fault; or we can use kvm_get_running_vcpu() if
> it's easier or more appropriate.
> 
> > * Increases the duration of the VM ioctls that enable dirty logging.
> > This does not affect customer performance but may have unintended
> > consequences depending on how userspace invokes the ioctl. For example,
> > eagerly splitting a 1.5TB memslot takes 30 seconds.
> 
> This issue goes away (or becomes easier to manage) if it's done in
> KVM_CLEAR_DIRTY_LOG.
> 
> > "RFC: Split EPT huge pages in advance of dirty logging" [1] was a
> > previous proposal to proactively split large pages off of the vCPU
> > threads. However it required faulting in every page in the migration
> > thread, a vCPU-like thread in QEMU, which requires extra userspace
> > support and also is less efficient since it requires faulting.
> 
> Yeah, this is best done on the kernel side.
> 
> > The last alternative is to perform dirty tracking at a 2M granularity.
> > This would reduce the amount of splitting work required
> >  by 512x, making the current approach of splitting on fault less
> > impactful to customer performance. We are in the early stages of
> > investigating 2M dirty tracking internally but it will be a while before
> > it is proven and ready for production. Furthermore there may be
> > scenarios where dirty tracking at 4K would be preferable to reduce
> > the amount of memory that needs to be demand-faulted during precopy.
> 
> Granularity of dirty tracking is somewhat orthogonal to this anyway,
> since you'd have to split 1G pages down to 2M.  So please let me know if
> you're okay with the above twist, and let's go ahead with the plan!
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: KVM: x86/mmu: Eager Page Splitting
  2021-11-05 17:17 ` Janis Schoetterl-Glausch
@ 2021-11-08 21:07   ` David Matlack
  2021-11-23 12:15     ` Peter Xu
  0 siblings, 1 reply; 8+ messages in thread
From: David Matlack @ 2021-11-08 21:07 UTC (permalink / raw)
  To: Janis Schoetterl-Glausch
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Junaid Shahid,
	Sean Christopherson, Oliver Upton, Harish Barathvajasankar,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Peter Xu, Peter Shier

On Fri, Nov 05, 2021 at 06:17:11PM +0100, Janis Schoetterl-Glausch wrote:
> On 11/4/21 23:45, David Matlack wrote:
> 
> [...]
> > 
> > The last alternative is to perform dirty tracking at a 2M granularity.
> > This would reduce the amount of splitting work required by 512x,
> > making the current approach of splitting on fault less impactful to
> > customer performance. We are in the early stages of investigating 2M
> > dirty tracking internally but it will be a while before it is proven
> > and ready for production. Furthermore there may be scenarios where
> > dirty tracking at 4K would be preferable to reduce the amount of
> > memory that needs to be demand-faulted during precopy.

Oops I meant to say "demand-faulted during post-copy" here.

> I'm curious how you're going about evaluating this, as I've experimented with
> 2M dirty tracking in the past, in a continuous checkpointing context however.
> I suspect it's very sensitive to the workload. If the coarser granularity
> leads to more memory being considered dirty, the length of pre-copy rounds
> increases, giving the workload more time to dirty even more memory.
> Ideally large pages would be used only for regions that won't be dirty or
> regions that would also be pretty much completely dirty when tracking at 4K.
> But deciding the granularity adaptively is hard, doing 2M tracking instead
> of 4K robs you of the very information you'd need to judge that.

We're planning to look at how 2M tracking affects the amount of memory
that needs to be demand-faulted during the post-copy phase for different
workloads.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: KVM: x86/mmu: Eager Page Splitting
  2021-11-08 19:57   ` David Matlack
@ 2021-11-08 21:37     ` Paolo Bonzini
  2021-11-08 21:39       ` David Matlack
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2021-11-08 21:37 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm list, Ben Gardon, Junaid Shahid, Sean Christopherson,
	Oliver Upton, Harish Barathvajasankar, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Peter Xu, Peter Shier

On 11/8/21 20:57, David Matlack wrote:
>> I'm not super
>> interested in adding eager page splitting for the older methods (clear
>> on KVM_GET_DIRTY_LOG, and manual-clear without initially-all-set), but
>> it should be useful for the ring buffer method and that *should* share
>> most of the code with the older methods.
> 
> Using Eager Page Splitting with the ring buffer method would require
> splitting the entire memslot when dirty logging is enabled for that
> memslot right? Are you saying we should do that?

Yeah, that's why I said it should share code with clear-on-get-dirty.

For initially-all-set, where it's possible to do it and even easy-ish, I 
would like to avoid paying the cost of splitting entirely upfront, when 
enabling dirty page tracking.  But you can already post an RFC that just 
splits always when dirty page tracking is enabled, so that I have a bit 
more of an idea of the new code, and of what it would entail to smear 
the cost over the calls to KVM_CLEAR_DIRTY_LOG.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: KVM: x86/mmu: Eager Page Splitting
  2021-11-08 21:37     ` Paolo Bonzini
@ 2021-11-08 21:39       ` David Matlack
  0 siblings, 0 replies; 8+ messages in thread
From: David Matlack @ 2021-11-08 21:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm list, Ben Gardon, Junaid Shahid, Sean Christopherson,
	Oliver Upton, Harish Barathvajasankar, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Peter Xu, Peter Shier

On Mon, Nov 8, 2021 at 1:37 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 11/8/21 20:57, David Matlack wrote:
> >> I'm not super
> >> interested in adding eager page splitting for the older methods (clear
> >> on KVM_GET_DIRTY_LOG, and manual-clear without initially-all-set), but
> >> it should be useful for the ring buffer method and that *should* share
> >> most of the code with the older methods.
> >
> > Using Eager Page Splitting with the ring buffer method would require
> > splitting the entire memslot when dirty logging is enabled for that
> > memslot right? Are you saying we should do that?
>
> Yeah, that's why I said it should share code with clear-on-get-dirty.
>
> For initially-all-set, where it's possible to do it and even easy-ish, I
> would like to avoid paying the cost of splitting entirely upfront, when
> enabling dirty page tracking.  But you can already post an RFC that just
> splits always when dirty page tracking is enabled, so that I have a bit
> more of an idea of the new code, and of what it would entail to smear
> the cost over the calls to KVM_CLEAR_DIRTY_LOG.

Ok makes sense. Thanks for the feedback!

>
> Thanks,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: KVM: x86/mmu: Eager Page Splitting
  2021-11-08 21:07   ` David Matlack
@ 2021-11-23 12:15     ` Peter Xu
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Xu @ 2021-11-23 12:15 UTC (permalink / raw)
  To: David Matlack
  Cc: Janis Schoetterl-Glausch, Paolo Bonzini, kvm list, Ben Gardon,
	Junaid Shahid, Sean Christopherson, Oliver Upton,
	Harish Barathvajasankar, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Peter Shier

On Mon, Nov 08, 2021 at 09:07:51PM +0000, David Matlack wrote:
> On Fri, Nov 05, 2021 at 06:17:11PM +0100, Janis Schoetterl-Glausch wrote:
> > On 11/4/21 23:45, David Matlack wrote:
> > 
> > [...]
> > > 
> > > The last alternative is to perform dirty tracking at a 2M granularity.
> > > This would reduce the amount of splitting work required by 512x,
> > > making the current approach of splitting on fault less impactful to
> > > customer performance. We are in the early stages of investigating 2M
> > > dirty tracking internally but it will be a while before it is proven
> > > and ready for production. Furthermore there may be scenarios where
> > > dirty tracking at 4K would be preferable to reduce the amount of
> > > memory that needs to be demand-faulted during precopy.
> 
> Oops I meant to say "demand-faulted during post-copy" here.

Sorry to join late, but this does sound like an interesting topic, too.

Hopefully assuming postcopy will be enabled in just a few iterations of
precopy, write amplification could hopefully be a much smaller problem, so the
mostly-static pages can still be successfully migrated during precopy.

Please share more information when there is, and I'll be very interested to
learn.

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-11-23 12:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-04 22:45 RFC: KVM: x86/mmu: Eager Page Splitting David Matlack
2021-11-05  8:44 ` Paolo Bonzini
2021-11-08 19:57   ` David Matlack
2021-11-08 21:37     ` Paolo Bonzini
2021-11-08 21:39       ` David Matlack
2021-11-05 17:17 ` Janis Schoetterl-Glausch
2021-11-08 21:07   ` David Matlack
2021-11-23 12:15     ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).