From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 68AF1C433EF
	for <kvm@archiver.kernel.org>; Tue, 24 May 2022 07:12:14 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S234491AbiEXHMM (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Tue, 24 May 2022 03:12:12 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41024 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234721AbiEXHMF (ORCPT <rfc822;kvm@vger.kernel.org>);
        Tue, 24 May 2022 03:12:05 -0400
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0E32A54F8F
        for <kvm@vger.kernel.org>; Tue, 24 May 2022 00:12:02 -0700 (PDT)
Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140])
        (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
        (No client certificate requested)
        by dfw.source.kernel.org (Postfix) with ESMTPS id 42DBD61538
        for <kvm@vger.kernel.org>; Tue, 24 May 2022 07:12:02 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 806A9C385AA;
        Tue, 24 May 2022 07:12:01 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
        s=k20201202; t=1653376321;
        bh=8B7bycy9bP10BkPe9UarbbG5/goCVnT+gGnduTdEGP8=;
        h=Date:From:To:Cc:Subject:In-Reply-To:References:From;
        b=TIOhOaonsjYyCsipH3xhOHw1EAbmlTbAojh0Wp9qwnF+yJ5ePm5dSBsmhO0f65bFZ
         9sVwuvzKif72aQHYDDzQtEJgRaU9H4H2FyA+hI4Bkx000mgLiiu3sgf7chQR7VPx07
         keqaNEmPchT6yFxB3a12y34EnfH0EAZXseLwE0YK46JmKeLUDw0UYf6eD+lNr1HAAp
         H36NshNjcIrMUbhc2rWC7rZzyel1x8dmdgH6+lpUWDVijaLP4QHZ9fWUkx3u7z4OkS
         dum8dzB4/6nFyp4ErepQiXWIdPmKEvIys0gpCObXKarxqcPKnUztwEm4t5PFBVpN2E
         6qL78dGU7wT5w==
Received: from athedsl-362076.home.otenet.gr ([87.202.142.23] helo=wait-a-minute.misterjones.org)
        by disco-boy.misterjones.org with esmtpsa  (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
        (Exim 4.94.2)
        (envelope-from <maz@kernel.org>)
        id 1ntOhy-00DLRS-Mt; Tue, 24 May 2022 08:11:58 +0100
Date:   Tue, 24 May 2022 08:11:56 +0100
Message-ID: <87h75fmmkj.wl-maz@kernel.org>
From:   Marc Zyngier <maz@kernel.org>
To:     Shivam Kumar <shivam.kumar1@nutanix.com>
Cc:     pbonzini@redhat.com, seanjc@google.com, james.morse@arm.com,
        borntraeger@linux.ibm.com, david@redhat.com, kvm@vger.kernel.org,
        Shaju Abraham <shaju.abraham@nutanix.com>,
        Manish Mishra <manish.mishra@nutanix.com>,
        Anurag Madnawat <anurag.madnawat@nutanix.com>
Subject: Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
In-Reply-To: <20220521202937.184189-2-shivam.kumar1@nutanix.com>
References: <20220521202937.184189-1-shivam.kumar1@nutanix.com>
        <20220521202937.184189-2-shivam.kumar1@nutanix.com>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue)
 FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/27.1
 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue")
Content-Type: text/plain; charset=US-ASCII
X-SA-Exim-Connect-IP: 87.202.142.23
X-SA-Exim-Rcpt-To: shivam.kumar1@nutanix.com, pbonzini@redhat.com, seanjc@google.com, james.morse@arm.com, borntraeger@linux.ibm.com, david@redhat.com, kvm@vger.kernel.org, shaju.abraham@nutanix.com, manish.mishra@nutanix.com, anurag.madnawat@nutanix.com
X-SA-Exim-Mail-From: maz@kernel.org
X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

On Sat, 21 May 2022 21:29:36 +0100,
Shivam Kumar <shivam.kumar1@nutanix.com> wrote:
> 
> Define variables to track and throttle memory dirtying for every vcpu.
> 
> dirty_count:    Number of pages the vcpu has dirtied since its creation,
>                 while dirty logging is enabled.
> dirty_quota:    Number of pages the vcpu is allowed to dirty. To dirty
>                 more, it needs to request more quota by exiting to
>                 userspace.
> 
> Implement the flow for throttling based on dirty quota.
> 
> i) Increment dirty_count for the vcpu whenever it dirties a page.
> ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty
> count equals/exceeds dirty quota) to request more dirty quota.
> 
> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> ---
>  Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu/spte.c        |  4 ++--
>  arch/x86/kvm/vmx/vmx.c         |  3 +++
>  arch/x86/kvm/x86.c             |  4 ++++

Please split the x86 support in its own patch.

>  include/linux/kvm_host.h       | 15 +++++++++++++++
>  include/linux/kvm_types.h      |  1 +
>  include/uapi/linux/kvm.h       | 12 ++++++++++++
>  virt/kvm/kvm_main.c            |  7 ++++++-
>  8 files changed, 75 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9f3172376ec3..a9317ed31d06 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6125,6 +6125,24 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
> +		struct {
> +			__u64 count;
> +			__u64 quota;
> +		} dirty_quota_exit;
> +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has
> +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure
> +makes the following information available to the userspace:
> +	'count' field: the current count of pages dirtied by the VCPU, can be
> +        skewed based on the size of the pages accessed by each vCPU.
> +	'quota' field: the observed dirty quota just before the exit to userspace.
> +The userspace can design a strategy to allocate the overall scope of dirtying
> +for the VM among the vcpus. Based on the strategy and the current state of dirty
> +quota throttling, the userspace can make a decision to either update (increase)
> +the quota or to put the VCPU to sleep for some time.
> +
>  ::
>  
>  		/* Fix the size of the union. */
> @@ -6159,6 +6177,20 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
>  
>  ::
>  
> +	/*
> +	 * Number of pages the vCPU is allowed to have dirtied over its entire
> +	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota
> +	 * is reached/exceeded.
> +	 */
> +	__u64 dirty_quota;
> +Please note that enforcing the quota is best effort, as the guest may dirty
> +multiple pages before KVM can recheck the quota.  However, unless KVM is using
> +a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
> +KVM will detect quota exhaustion within a handful of dirtied page.  If a
> +hardware ring buffer is used, the overrun is bounded by the size of the buffer
> +(512 entries for PML).
> +
> +::
>    };
>  
>  
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 73cfe62fdad1..01f0d2a04796 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -182,9 +182,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  		  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
>  		  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
>  
> -	if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
> +	if (spte & PT_WRITABLE_MASK) {
>  		/* Enforced by kvm_mmu_hugepage_adjust. */
> -		WARN_ON(level > PG_LEVEL_4K);
> +		WARN_ON(level > PG_LEVEL_4K && kvm_slot_dirty_track_enabled(slot));
>  		mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
>  	}
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b730d799c26e..5cbe4992692a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5507,6 +5507,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
>  		 */
>  		if (__xfer_to_guest_mode_work_pending())
>  			return 1;
> +
> +		if (!kvm_vcpu_check_dirty_quota(vcpu))
> +			return 0;
>  	}
>  
>  	return 1;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eb4029660bd9..0b35b8cc0274 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10257,6 +10257,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>  	vcpu->arch.l1tf_flush_l1d = true;
>  
>  	for (;;) {
> +		r = kvm_vcpu_check_dirty_quota(vcpu);

I really wonder why this has to be checked on each and every run. Why
isn't this a request set by the core code instead?

> +		if (!r)
> +			break;
> +
>  		if (kvm_vcpu_running(vcpu)) {
>  			r = vcpu_enter_guest(vcpu);
>  		} else {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f11039944c08..ca1ac970a6cf 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -530,6 +530,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>  	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
>  }
>  
> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)

Why is this inline instead of a normal call?

> +{
> +	struct kvm_run *run = vcpu->run;
> +	u64 dirty_quota = READ_ONCE(run->dirty_quota);
> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> +
> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> +		return 1;

What happens when page_dirtied becomes large and dirty_quota has to
wrap to allow further progress?

> +
> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
> +	run->dirty_quota_exit.count = pages_dirtied;
> +	run->dirty_quota_exit.quota = dirty_quota;
> +	return 0;
> +}
> +
>  /*
>   * Some of the bitops functions do not support too long bitmaps.
>   * This number must be determined not to exceed such limits.
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index dceac12c1ce5..7f42486b0405 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -106,6 +106,7 @@ struct kvm_vcpu_stat_generic {
>  	u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
>  	u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
>  	u64 blocking;
> +	u64 pages_dirtied;
>  };
>  
>  #define KVM_STATS_NAME_SIZE	48
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 507ee1f2aa96..1d9531efe1fb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -270,6 +270,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_X86_BUS_LOCK     33
>  #define KVM_EXIT_XEN              34
>  #define KVM_EXIT_RISCV_SBI        35
> +#define KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 36
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -487,6 +488,11 @@ struct kvm_run {
>  			unsigned long args[6];
>  			unsigned long ret[2];
>  		} riscv_sbi;
> +		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
> +		struct {
> +			__u64 count;
> +			__u64 quota;
> +		} dirty_quota_exit;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
> @@ -508,6 +514,12 @@ struct kvm_run {
>  		struct kvm_sync_regs regs;
>  		char padding[SYNC_REGS_SIZE_BYTES];
>  	} s;
> +	/*
> +	 * Number of pages the vCPU is allowed to have dirtied over its entire
> +	 * liftime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the

lifetime

> +	 * quota is reached/exceeded.
> +	 */
> +	__u64 dirty_quota;

How is the feature detected from userspace? I don't see how it can
make use of it without knowing about it the first place.

>  };
>  
>  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0afc016cc54d..041ab464405d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3163,7 +3163,12 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>  		return;
>  #endif
>  
> -	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
> +	if (!memslot)
> +		return;
> +
> +	vcpu->stat.generic.pages_dirtied++;
> +
> +	if (kvm_slot_dirty_track_enabled(memslot)) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
>  		u32 slot = (memslot->as_id << 16) | memslot->id;
>  

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.