From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91134C433EF for ; Thu, 26 May 2022 13:30:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347540AbiEZNaL (ORCPT ); Thu, 26 May 2022 09:30:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33616 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347570AbiEZNaJ (ORCPT ); Thu, 26 May 2022 09:30:09 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 09788D80B5 for ; Thu, 26 May 2022 06:30:07 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 9EEC261ACC for ; Thu, 26 May 2022 13:30:06 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D2DA8C385B8; Thu, 26 May 2022 13:30:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1653571806; bh=laTEKCElwLiuRVLHjHLcomrSvcC/2/bqRlZtmfbHCwU=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=XbL1ilCufDtuaJ8Rwy4maEI9oVrCXNft9k+0K7GKddSl/onk1jTw8JqzJ56ItVzTm 0fB7fcwbjvW2Lt8GzdZrhd9MqzmgvRlGlfDiRKtRHuQqUhHOdwSUORpriwW/OKCl+F wCY0eWRjSCtM3z6kXUCsM27mOkPQioyeTCggcn5CsGiVTPhUHSAzTJQ/XModwC558n M8qPOoTF3dictBlwnFCxET0fy1y6gqh/8mXm3WgZnL9XqqeItwAMyIZlFi2f4PmxM3 BRloWb6xDV6C83j9oNHBWYcu3qJlvF3Vxs/nxJYS1x9EUdq1YEPEkUsa4FQPoEMdaq I/fx+hi/vBF3g== Received: from athedsl-4557779.home.otenet.gr ([94.70.87.219] helo=wait-a-minute.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1nuDYw-00DrFD-Uu; Thu, 26 May 2022 14:30:03 +0100 Date: Thu, 26 May 2022 14:30:00 +0100 Message-ID: <878rqomnfr.wl-maz@kernel.org> From: Marc Zyngier To: Shivam Kumar Cc: pbonzini@redhat.com, seanjc@google.com, james.morse@arm.com, borntraeger@linux.ibm.com, david@redhat.com, kvm@vger.kernel.org, Shaju Abraham , Manish Mishra , Anurag Madnawat Subject: Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus In-Reply-To: References: <20220521202937.184189-1-shivam.kumar1@nutanix.com> <20220521202937.184189-2-shivam.kumar1@nutanix.com> <87h75fmmkj.wl-maz@kernel.org> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/27.1 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-SA-Exim-Connect-IP: 94.70.87.219 X-SA-Exim-Rcpt-To: shivam.kumar1@nutanix.com, pbonzini@redhat.com, seanjc@google.com, james.morse@arm.com, borntraeger@linux.ibm.com, david@redhat.com, kvm@vger.kernel.org, shaju.abraham@nutanix.com, manish.mishra@nutanix.com, anurag.madnawat@nutanix.com X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Thu, 26 May 2022 10:33:04 +0100, Shivam Kumar wrote: > > > On 24/05/22 12:41 pm, Marc Zyngier wrote: > > On Sat, 21 May 2022 21:29:36 +0100, > > Shivam Kumar wrote: > >> Define variables to track and throttle memory dirtying for every vcpu. > >> > >> dirty_count: Number of pages the vcpu has dirtied since its creation, > >> while dirty logging is enabled. > >> dirty_quota: Number of pages the vcpu is allowed to dirty. To dirty > >> more, it needs to request more quota by exiting to > >> userspace. > >> > >> Implement the flow for throttling based on dirty quota. > >> > >> i) Increment dirty_count for the vcpu whenever it dirties a page. > >> ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty > >> count equals/exceeds dirty quota) to request more dirty quota. > >> > >> Suggested-by: Shaju Abraham > >> Suggested-by: Manish Mishra > >> Co-developed-by: Anurag Madnawat > >> Signed-off-by: Anurag Madnawat > >> Signed-off-by: Shivam Kumar > >> --- > >> Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++ > >> arch/x86/kvm/mmu/spte.c | 4 ++-- > >> arch/x86/kvm/vmx/vmx.c | 3 +++ > >> arch/x86/kvm/x86.c | 4 ++++ > > Please split the x86 support in its own patch. > I'd merged the changes into one commit based on the feedback received > on previous patchsets (as the change is very simple). I hadn't > received any review from arm and s390 maintainers, so I separated > those changes in different commits. Thanks. Well, you don't get much reviewing if you don't Cc people. I still think having the generic code one its own is the right thing to do, each architecture being added separately. That's irrespective of the simplicity of the patch (I have seen patches with only two lines that contained 3 bugs). > > > >> include/linux/kvm_host.h | 15 +++++++++++++++ > >> include/linux/kvm_types.h | 1 + > >> include/uapi/linux/kvm.h | 12 ++++++++++++ > >> virt/kvm/kvm_main.c | 7 ++++++- > >> 8 files changed, 75 insertions(+), 3 deletions(-) > >> > >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > >> index 9f3172376ec3..a9317ed31d06 100644 > >> --- a/Documentation/virt/kvm/api.rst > >> +++ b/Documentation/virt/kvm/api.rst > >> @@ -6125,6 +6125,24 @@ array field represents return values. The userspace should update the return > >> values of SBI call before resuming the VCPU. For more details on RISC-V SBI > >> spec refer, https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_riscv_riscv-2Dsbi-2Ddoc&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=gKJrc6bbnHvyaBs-J8ysjBmGRkXneAgEgVCMrWtZO4xdbqfmX-zF2y68oOWbuSuA&s=YkC-KO05Du0_A2-m3nWatY5ZPzMYoDbcWjI3k95obPQ&e= . > >> +:: > >> + > >> + /* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */ > >> + struct { > >> + __u64 count; > >> + __u64 quota; > >> + } dirty_quota_exit; > >> +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has > >> +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure > >> +makes the following information available to the userspace: > >> + 'count' field: the current count of pages dirtied by the VCPU, can be > >> + skewed based on the size of the pages accessed by each vCPU. > >> + 'quota' field: the observed dirty quota just before the exit to userspace. > >> +The userspace can design a strategy to allocate the overall scope of dirtying > >> +for the VM among the vcpus. Based on the strategy and the current state of dirty > >> +quota throttling, the userspace can make a decision to either update (increase) > >> +the quota or to put the VCPU to sleep for some time. > >> + > >> :: > >> /* Fix the size of the union. */ > >> @@ -6159,6 +6177,20 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set. > >> :: > >> + /* > >> + * Number of pages the vCPU is allowed to have dirtied over its entire > >> + * lifetime. KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota > >> + * is reached/exceeded. > >> + */ > >> + __u64 dirty_quota; > >> +Please note that enforcing the quota is best effort, as the guest may dirty > >> +multiple pages before KVM can recheck the quota. However, unless KVM is using > >> +a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging, > >> +KVM will detect quota exhaustion within a handful of dirtied page. If a > >> +hardware ring buffer is used, the overrun is bounded by the size of the buffer > >> +(512 entries for PML). > >> + > >> +:: > >> }; > >> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c > >> index 73cfe62fdad1..01f0d2a04796 100644 > >> --- a/arch/x86/kvm/mmu/spte.c > >> +++ b/arch/x86/kvm/mmu/spte.c > >> @@ -182,9 +182,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, > >> "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level, > >> get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level)); > >> - if ((spte & PT_WRITABLE_MASK) && > >> kvm_slot_dirty_track_enabled(slot)) { > >> + if (spte & PT_WRITABLE_MASK) { > >> /* Enforced by kvm_mmu_hugepage_adjust. */ > >> - WARN_ON(level > PG_LEVEL_4K); > >> + WARN_ON(level > PG_LEVEL_4K && kvm_slot_dirty_track_enabled(slot)); > >> mark_page_dirty_in_slot(vcpu->kvm, slot, gfn); > >> } > >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > >> index b730d799c26e..5cbe4992692a 100644 > >> --- a/arch/x86/kvm/vmx/vmx.c > >> +++ b/arch/x86/kvm/vmx/vmx.c > >> @@ -5507,6 +5507,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) > >> */ > >> if (__xfer_to_guest_mode_work_pending()) > >> return 1; > >> + > >> + if (!kvm_vcpu_check_dirty_quota(vcpu)) > >> + return 0; > >> } > >> return 1; > >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > >> index eb4029660bd9..0b35b8cc0274 100644 > >> --- a/arch/x86/kvm/x86.c > >> +++ b/arch/x86/kvm/x86.c > >> @@ -10257,6 +10257,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu) > >> vcpu->arch.l1tf_flush_l1d = true; > >> for (;;) { > >> + r = kvm_vcpu_check_dirty_quota(vcpu); > > I really wonder why this has to be checked on each and every run. Why > > isn't this a request set by the core code instead? > Ack. Thanks. > > > >> + if (!r) > >> + break; > >> + > >> if (kvm_vcpu_running(vcpu)) { > >> r = vcpu_enter_guest(vcpu); > >> } else { > >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > >> index f11039944c08..ca1ac970a6cf 100644 > >> --- a/include/linux/kvm_host.h > >> +++ b/include/linux/kvm_host.h > >> @@ -530,6 +530,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu) > >> return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE); > >> } > >> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu > >> *vcpu) > > Why is this inline instead of a normal call? > It's a lightweight function and I found similar functions marked > inline in the code. Thanks. That's not a justification. If you mark something inline, this is either because it is performance critical (that's not the case here), or that it must be inlined by construction (not the case either). > >> +{ > >> + struct kvm_run *run = vcpu->run; > >> + u64 dirty_quota = READ_ONCE(run->dirty_quota); > >> + u64 pages_dirtied = vcpu->stat.generic.pages_dirtied; > >> + > >> + if (!dirty_quota || (pages_dirtied < dirty_quota)) > >> + return 1; > > What happens when page_dirtied becomes large and dirty_quota has to > > wrap to allow further progress? > Every time the quota is exhausted, userspace is expected to set it to > pages_dirtied + new quota. So, pages_dirtied will always follow dirty > quota. I'll be sending the qemu patches soon. Thanks. Right, so let's assume that page_dirtied=0xffffffffffffffff (yes, I have dirtied that many pages). I add two page of quota, so dirty_quota=1. pages_dirtied > dirty_quota, and we're back to userspace forever. Epic fail. > > > >> + > >> + run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED; > >> + run->dirty_quota_exit.count = pages_dirtied; > >> + run->dirty_quota_exit.quota = dirty_quota; > >> + return 0; > >> +} > >> + > >> /* > >> * Some of the bitops functions do not support too long bitmaps. > >> * This number must be determined not to exceed such limits. > >> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h > >> index dceac12c1ce5..7f42486b0405 100644 > >> --- a/include/linux/kvm_types.h > >> +++ b/include/linux/kvm_types.h > >> @@ -106,6 +106,7 @@ struct kvm_vcpu_stat_generic { > >> u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT]; > >> u64 halt_wait_hist[HALT_POLL_HIST_COUNT]; > >> u64 blocking; > >> + u64 pages_dirtied; > >> }; > >> #define KVM_STATS_NAME_SIZE 48 > >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > >> index 507ee1f2aa96..1d9531efe1fb 100644 > >> --- a/include/uapi/linux/kvm.h > >> +++ b/include/uapi/linux/kvm.h > >> @@ -270,6 +270,7 @@ struct kvm_xen_exit { > >> #define KVM_EXIT_X86_BUS_LOCK 33 > >> #define KVM_EXIT_XEN 34 > >> #define KVM_EXIT_RISCV_SBI 35 > >> +#define KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 36 > >> /* For KVM_EXIT_INTERNAL_ERROR */ > >> /* Emulate instruction failed. */ > >> @@ -487,6 +488,11 @@ struct kvm_run { > >> unsigned long args[6]; > >> unsigned long ret[2]; > >> } riscv_sbi; > >> + /* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */ > >> + struct { > >> + __u64 count; > >> + __u64 quota; > >> + } dirty_quota_exit; > >> /* Fix the size of the union. */ > >> char padding[256]; > >> }; > >> @@ -508,6 +514,12 @@ struct kvm_run { > >> struct kvm_sync_regs regs; > >> char padding[SYNC_REGS_SIZE_BYTES]; > >> } s; > >> + /* > >> + * Number of pages the vCPU is allowed to have dirtied over its entire > >> + * liftime. KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the > > lifetime > Ack. Thanks. > >> + * quota is reached/exceeded. > >> + */ > >> + __u64 dirty_quota; > > How is the feature detected from userspace? I don't see how it can > > make use of it without knowing about it the first place. > I hope changing the kernel headers would suffice. We had started the > patch series with a KVM CAP for dirty quota but dropped it as per the > feedback recieved. Thanks. I don't see how anyone can make use of this feature without a capability. How do you know whether you're ever going to see the exit if you can't know if the kernel supports the feature? M. -- Without deviation from the norm, progress is not possible.