RE: [PATCH v2] KVM: x86: enable dirty log gradually in small chunks

From: "Zhoujian (jay)" <jianjay.zhou@huawei.com>
To: Peter Xu <peterx@redhat.com>
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"wangxin (U)" <wangxinxin.wang@huawei.com>,
	"Huangweidong (C)" <weidong.huang@huawei.com>,
	"sean.j.christopherson@intel.com"
	<sean.j.christopherson@intel.com>,
	"Liujinsong (Paul)" <liu.jinsong@huawei.com>
Subject: RE: [PATCH v2] KVM: x86: enable dirty log gradually in small chunks
Date: Fri, 21 Feb 2020 09:31:52 +0000	[thread overview]
Message-ID: <B2D15215269B544CADD246097EACE7474BB064B7@DGGEMM528-MBX.china.huawei.com> (raw)
In-Reply-To: <20200220191706.GF2905@xz-x1>

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, February 21, 2020 3:17 AM
> To: Zhoujian (jay) <jianjay.zhou@huawei.com>
> Cc: kvm@vger.kernel.org; pbonzini@redhat.com; wangxin (U)
> <wangxinxin.wang@huawei.com>; Huangweidong (C)
> <weidong.huang@huawei.com>; sean.j.christopherson@intel.com; Liujinsong
> (Paul) <liu.jinsong@huawei.com>
> Subject: Re: [PATCH v2] KVM: x86: enable dirty log gradually in small chunks
> 
> On Thu, Feb 20, 2020 at 12:28:28PM +0800, Jay Zhou wrote:
> > It could take kvm->mmu_lock for an extended period of time when
> > enabling dirty log for the first time. The main cost is to clear all
> > the D-bits of last level SPTEs. This situation can benefit from manual
> > dirty log protect as well, which can reduce the mmu_lock time taken.
> > The sequence is like this:
> >
> > 1. Initialize all the bits of the dirty bitmap to 1 when enabling
> >    dirty log for the first time
> > 2. Only write protect the huge pages
> > 3. KVM_GET_DIRTY_LOG returns the dirty bitmap info 4.
> > KVM_CLEAR_DIRTY_LOG will clear D-bit for each of the leaf level
> >    SPTEs gradually in small chunks
> >
> > Under the Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz environment, I did
> > some tests with a 128G windows VM and counted the time taken of
> > memory_global_dirty_log_start, here is the numbers:
> >
> > VM Size        Before    After optimization
> > 128G           460ms     10ms
> >
> > Signed-off-by: Jay Zhou <jianjay.zhou@huawei.com>
> > ---
> > v2:
> >   * add new bit to KVM_ENABLE_CAP for
> KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 [Paolo]
> >   * support non-PML path [Peter]
> >   * delete the unnecessary ifdef and make the initialization of bitmap
> >     more clear [Sean]
> >   * document the new bits and tweak the testcase
> >
> >  Documentation/virt/kvm/api.rst               | 23
> +++++++++++++++++------
> >  arch/x86/kvm/mmu/mmu.c                       |  8 ++++++--
> >  arch/x86/kvm/vmx/vmx.c                       |  3 ++-
> >  include/linux/kvm_host.h                     |  7 ++++++-
> >  tools/testing/selftests/kvm/dirty_log_test.c |  3 ++-
> >  virt/kvm/kvm_main.c                          | 25
> ++++++++++++++++++-------
> >  6 files changed, 51 insertions(+), 18 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst
> > b/Documentation/virt/kvm/api.rst index 97a72a5..1afd310 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1267,9 +1267,11 @@ pages in the host.
> >  The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > KVM_MEM_READONLY.  The former can be set to instruct KVM to keep
> track
> > of  writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to
> > know how to -use it.  The latter can be set, if KVM_CAP_READONLY_MEM
> > capability allows it, -to make a new slot read-only.  In this case,
> > writes to this memory will be -posted to userspace as KVM_EXIT_MMIO
> exits.
> > +use it.  It will be different if the KVM_DIRTY_LOG_INITIALLY_SET flag
> > +of
> > +KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is set.  For more information,
> see
> > +the description of the capability.  The latter can be set, if
> > +KVM_CAP_READONLY_MEM capability allows it, to make a new slot
> > +read-only.  In this case, writes to this memory will be posted to userspace
> as KVM_EXIT_MMIO exits.
> 
> Not sure about others, but my own preference is that we keep this part
> untouched...  The changed document could be even more confusing to me...

Just want to mention something different happened for the other code logic
if KVM_DIRTY_LOG_INITIALLY_SET flag is set, but I have to admit that the
"difference" is not described clearly here. It is described at [1] below, 
so keep this part untouched maybe is a choice.

> >
> >  When the KVM_CAP_SYNC_MMU capability is available, changes in the
> > backing of  the memory region are automatically reflected into the
> > guest.  For example, an @@ -5704,10 +5706,19 @@ and injected
> exceptions.
> >  :Architectures: x86, arm, arm64, mips
> >  :Parameters: args[0] whether feature should be enabled or not
> >
> > -With this capability enabled, KVM_GET_DIRTY_LOG will not
> > automatically
> > +Valid flags are::
> > +
> > +  #define KVM_DIRTY_LOG_MANUAL_PROTECT (1 << 0)  #define
> > + KVM_DIRTY_LOG_INITIALLY_SET (1 << 1)
> > +
> > +With KVM_DIRTY_LOG_MANUAL_PROTECT set, KVM_GET_DIRTY_LOG will
> not
> > +automatically
> >  clear and write-protect all pages that are returned as dirty.
> >  Rather, userspace will have to do this operation separately using
> > -KVM_CLEAR_DIRTY_LOG.
> > +KVM_CLEAR_DIRTY_LOG.  With KVM_DIRTY_LOG_INITIALLY_SET set, all
> the
> > +bits of the dirty bitmap will be initialized to 1 when created, dirty
> > +logging will be enabled gradually in small chunks using
> > +KVM_CLEAR_DIRTY_LOG ioctl.  However, the
> KVM_DIRTY_LOG_INITIALLY_SET
> > +depends on KVM_DIRTY_LOG_MANUAL_PROTECT, it can not be set
> individually and supports x86 only for now.

[1] 

> 
> Need
> s/KVM_DIRTY_LOG_MANUAL_PROTECT/KVM_DIRTY_LOG_MANUAL_PROTEC
> T2/?

Since the ioctl name is KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2, this naming
replacement seems reasonable. If no one is against it, I'll take it in the next
version.

> 
> >
> >  At the cost of a slightly more complicated operation, this provides
> > better  scalability and responsiveness for two reasons.  First, @@
> > -5716,7 +5727,7 @@ than requiring to sync a full memslot; this ensures
> > that KVM does not  take spinlocks for an extended period of time.
> > Second, in some cases a  large amount of time can pass between a call
> > to KVM_GET_DIRTY_LOG and  userspace actually using the data in the
> > page.  Pages can be modified -during this time, which is inefficint for both
> the guest and userspace:
> > +during this time, which is inefficient for both the guest and userspace:
> >  the guest will incur a higher penalty due to write protection faults,
> > while userspace can see false reports of dirty pages.  Manual
> > reprotection  helps reducing this time, improving guest performance
> > and reducing the diff --git a/arch/x86/kvm/mmu/mmu.c
> > b/arch/x86/kvm/mmu/mmu.c index 87e9ba2..f9c120e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5865,8 +5865,12 @@ void
> kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >  	bool flush;
> >
> >  	spin_lock(&kvm->mmu_lock);
> > -	flush = slot_handle_all_level(kvm, memslot, slot_rmap_write_protect,
> > -				      false);
> > +	if (kvm->manual_dirty_log_protect & KVM_DIRTY_LOG_INITIALLY_SET)
> > +		flush = slot_handle_large_level(kvm, memslot,
> > +						slot_rmap_write_protect, false);
> > +	else
> > +		flush = slot_handle_all_level(kvm, memslot,
> > +						slot_rmap_write_protect, false);
> >  	spin_unlock(&kvm->mmu_lock);
> >
> >  	/*
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
> > 3be25ec..fcc585a 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7201,7 +7201,8 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu,
> > int cpu)  static void vmx_slot_enable_log_dirty(struct kvm *kvm,
> >  				     struct kvm_memory_slot *slot)  {
> > -	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> > +	if (!(kvm->manual_dirty_log_protect & KVM_DIRTY_LOG_INITIALLY_SET))
> > +		kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> >  	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);  }
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index
> > e89eb67..a555b52 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -39,6 +39,11 @@
> >  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS  #endif
> >
> > +#define KVM_DIRTY_LOG_MANUAL_PROTECT (1 << 0) #define
> > +KVM_DIRTY_LOG_INITIALLY_SET (1 << 1) #define
> > +KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT |
> \
> > +				KVM_DIRTY_LOG_INITIALLY_SET)
> > +
> >  /*
> >   * The bit 16 ~ bit 31 of kvm_memory_region::flags are internally used
> >   * in kvm, other bits are visible for userspace which are defined in
> > @@ -493,7 +498,7 @@ struct kvm {  #endif
> >  	long tlbs_dirty;
> >  	struct list_head devices;
> > -	bool manual_dirty_log_protect;
> > +	u64 manual_dirty_log_protect;
> >  	struct dentry *debugfs_dentry;
> >  	struct kvm_stat_data **debugfs_stat_data;
> >  	struct srcu_struct srcu;
> > diff --git a/tools/testing/selftests/kvm/dirty_log_test.c
> > b/tools/testing/selftests/kvm/dirty_log_test.c
> > index 5614222..2a493c1 100644
> > --- a/tools/testing/selftests/kvm/dirty_log_test.c
> > +++ b/tools/testing/selftests/kvm/dirty_log_test.c
> > @@ -317,10 +317,11 @@ static void run_test(enum vm_guest_mode mode,
> unsigned long iterations,
> >  	host_bmap_track = bitmap_alloc(host_num_pages);
> >
> >  #ifdef USE_CLEAR_DIRTY_LOG
> > +	int ret = kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
> >  	struct kvm_enable_cap cap = {};
> >
> >  	cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2;
> > -	cap.args[0] = 1;
> > +	cap.args[0] = ret;
> 
> You enabled the initial-all-set but didn't really check it, so it didn't help much
> from the testcase pov...  

vm_enable_cap is called afterwards, the return value is checked inside it,
may I ask this check is enough, or it is needed to get the value through
something like vm_get_cap ?

> I'd suggest you drop this change, and you can work
> on top after this patch can be accepted.

OK, some input parameters for cap.args[0] should be tested I think: 0, 1, 3
should be accepted, the other numbers will not.

> 
> (Not to mention the original test actually verified that we don't  break, which
> seems good..)
> 
> >  	vm_enable_cap(vm, &cap);
> >  #endif
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index
> > 70f03ce..f2631d0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -862,7 +862,7 @@ static int kvm_vm_release(struct inode *inode,
> struct file *filp)
> >   * Allocation size is twice as large as the actual dirty bitmap size.
> >   * See x86's kvm_vm_ioctl_get_dirty_log() why this is needed.
> >   */
> > -static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
> > +static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot)
> 
> This change seems irrelevant..
> 
> >  {
> >  	unsigned long dirty_bytes = 2 * kvm_dirty_bitmap_bytes(memslot);
> >
> > @@ -1094,8 +1094,11 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >
> >  	/* Allocate page dirty bitmap if needed */
> >  	if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
> > -		if (kvm_create_dirty_bitmap(&new) < 0)
> > +		if (kvm_alloc_dirty_bitmap(&new))
> 
> Same here.

s/kvm_create_dirty_bitmap/kvm_alloc_dirty_bitmap is Sean's suggestion to make
it clear that the helper is only responsible for allocation, then set all the bitmap bits
to 1 using bitmap_set afterwards, which seems reasonable. Do you still think it's
better to keep this name untouched?

> 
> >  			goto out_free;
> > +
> > +		if (kvm->manual_dirty_log_protect &
> KVM_DIRTY_LOG_INITIALLY_SET)
> 
> (Maybe time to introduce a helper to shorten this check. :)

Yeah, but could this be done on top of this patch?

> 
> > +			bitmap_set(new.dirty_bitmap, 0, new.npages);
> >  	}
> >
> >  	slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT);
> > @@ -1255,7 +1258,7 @@ int kvm_get_dirty_log_protect(struct kvm *kvm,
> >
> >  	n = kvm_dirty_bitmap_bytes(memslot);
> >  	*flush = false;
> > -	if (kvm->manual_dirty_log_protect) {
> > +	if (kvm->manual_dirty_log_protect &
> KVM_DIRTY_LOG_MANUAL_PROTECT) {
> 
> Can also introduce a helper for this too.
> 
> Side note: logically this line doesn't need a change, because bit 1 depends on bit
> 0 so if (kvm->manual_dirty_log_protect) it means bit 0 must be set.

Yes, I agree.

> 
> >  		/*
> >  		 * Unlike kvm_get_dirty_log, we always return false in *flush,
> >  		 * because no flush is needed until KVM_CLEAR_DIRTY_LOG.
> There @@
> > -3310,9 +3313,6 @@ static long
> kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >  	case KVM_CAP_IOEVENTFD_ANY_LENGTH:
> >  	case KVM_CAP_CHECK_EXTENSION_VM:
> >  	case KVM_CAP_ENABLE_CAP_VM:
> > -#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
> > -	case KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2:
> > -#endif
> >  		return 1;
> >  #ifdef CONFIG_KVM_MMIO
> >  	case KVM_CAP_COALESCED_MMIO:
> > @@ -3320,6 +3320,10 @@ static long
> kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >  	case KVM_CAP_COALESCED_PIO:
> >  		return 1;
> >  #endif
> > +#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
> > +	case KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2:
> > +		return KVM_DIRTY_LOG_MANUAL_CAPS;
> > +#endif
> >  #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
> >  	case KVM_CAP_IRQ_ROUTING:
> >  		return KVM_MAX_IRQ_ROUTES;
> > @@ -3348,7 +3352,14 @@ static int
> kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
> >  	switch (cap->cap) {
> >  #ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
> >  	case KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2:
> > -		if (cap->flags || (cap->args[0] & ~1))
> > +		if (cap->flags ||
> > +		    (cap->args[0] & ~KVM_DIRTY_LOG_MANUAL_CAPS) ||
> > +		    /* The capability of KVM_DIRTY_LOG_INITIALLY_SET depends
> > +		     * on KVM_DIRTY_LOG_MANUAL_PROTECT, it should not be
> > +		     * set individually
> > +		     */
> > +		    ((cap->args[0] & KVM_DIRTY_LOG_MANUAL_CAPS) ==
> > +			KVM_DIRTY_LOG_INITIALLY_SET))
> 
> How about something easier to read? :)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index
> 70f03ce0e5c1..9dfbab2a9929 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3348,7 +3348,10 @@ static int
> kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>         switch (cap->cap) {
>  #ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
>         case KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2:
> -               if (cap->flags || (cap->args[0] & ~1))
> +               if (cap->flags || (cap->args[0] & ~3))
> +                       return -EINVAL;
> +               /* Allow 00, 01, and 11. */
> +               if (cap->args[0] == KVM_DIRTY_LOG_INITIALLY_SET)
>                         return -EINVAL;
>                 kvm->manual_dirty_log_protect = cap->args[0];
>                 return 0;

Looks simpler than mine.

> Otherwise it looks good to me!

Thanks!

Regards,
Jay Zhou

> 
> Thanks,
> 
> --
> Peter Xu