From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1820CC636CC for ; Wed, 15 Feb 2023 08:41:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229820AbjBOIlX (ORCPT ); Wed, 15 Feb 2023 03:41:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41056 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229605AbjBOIlW (ORCPT ); Wed, 15 Feb 2023 03:41:22 -0500 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F24E81721 for ; Wed, 15 Feb 2023 00:41:20 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 7EAB861A2D for ; Wed, 15 Feb 2023 08:41:20 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id BA61DC433D2; Wed, 15 Feb 2023 08:41:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1676450479; bh=EvfQc8jIINLXMq6IZVpXfIVBft3+KoVyE4+QKsWsEhE=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=HvELpAVG0BsdvtHBRQC9soKnBhRdMLdNu4Tp3m8XOYfwEwd3Vi5KBwXrmUHNqFMqX xvF8ffI+qLNCcvex6bXuizyV3zvA+WI6I7XHyrF1cfKmkBpUDP1csJPOlDlAio8K48 FbOw4pTnK85kdmc2WGv/19TRm2ir9wP7pBVCer5r4fW77wsHPU/0q6GL1jcDVtaTRZ gJAahtdnlwBzvFtyjCX4dBHCwuZc12u4VqQFoBp/grqNu+hFKLjM29bKehU82BWCRd //qAwj6l64lnq+DTNSHEv2/RChU/p/b3Tl0sveAu4cdibiyuDhqSfkIZSVfAjhQELU QiEDEImsfomvw== Received: from ip-185-104-136-29.ptr.icomera.net ([185.104.136.29] helo=wait-a-minute.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.95) (envelope-from ) id 1pSDLn-00AXU9-Uj; Wed, 15 Feb 2023 08:41:16 +0000 Date: Wed, 15 Feb 2023 08:41:13 +0000 Message-ID: <87mt5fz5g6.wl-maz@kernel.org> From: Marc Zyngier To: Anish Moorthy Cc: Paolo Bonzini , Oliver Upton , Sean Christopherson , James Houghton , Ben Gardon , David Matlack , Ricardo Koller , Chao Peng , Axel Rasmussen , kvm@vger.kernel.org, kvmarm@lists.linux.dev Subject: Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits In-Reply-To: <20230215011614.725983-6-amoorthy@google.com> References: <20230215011614.725983-1-amoorthy@google.com> <20230215011614.725983-6-amoorthy@google.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/27.1 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-SA-Exim-Connect-IP: 185.104.136.29 X-SA-Exim-Rcpt-To: amoorthy@google.com, pbonzini@redhat.com, oliver.upton@linux.dev, seanjc@google.com, jthoughton@google.com, bgardon@google.com, dmatlack@google.com, ricarkol@google.com, chao.p.peng@linux.intel.com, axelrasmussen@google.com, kvm@vger.kernel.org, kvmarm@lists.linux.dev X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Wed, 15 Feb 2023 01:16:11 +0000, Anish Moorthy wrote: > > This new KVM exit allows userspace to handle missing memory. It > indicates that the pages in the range [gpa, gpa + size) must be mapped. > > The "flags" field actually goes unused in this series: it's included for > forward compatibility with [1], should this series happen to go in > first. > > [1] https://lore.kernel.org/all/CA+EHjTyzZ2n8kQxH_Qx72aRq1k+dETJXTsoOM3tggPZAZkYbCA@mail.gmail.com/ > > Signed-off-by: Anish Moorthy > Acked-by: James Houghton > --- > Documentation/virt/kvm/api.rst | 42 ++++++++++++++++++++++++++++++++++ > include/linux/kvm_host.h | 13 +++++++++++ > include/uapi/linux/kvm.h | 13 ++++++++++- > tools/include/uapi/linux/kvm.h | 7 ++++++ > virt/kvm/kvm_main.c | 26 +++++++++++++++++++++ > 5 files changed, 100 insertions(+), 1 deletion(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index 9807b05a1b571..4b06e60668686 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -5937,6 +5937,18 @@ delivery must be provided via the "reg_aen" struct. > The "pad" and "reserved" fields may be used for future extensions and should be > set to 0s by userspace. > > +4.137 KVM_SET_MEM_FAULT_NOWAIT > +------------------------------ > + > +:Capability: KVM_CAP_MEM_FAULT_NOWAIT > +:Architectures: x86, arm64 > +:Type: vm ioctl > +:Parameters: bool state (in) Huh. See towards the end of this patch why I think this is a pretty bad idea. > +:Returns: 0 on success, or -1 if KVM_CAP_MEM_FAULT_NOWAIT is not present. Please pick a meaningful error code instead of -1. And if you intended this as -EPERM, please explain the rationale (-EINVAL would seem more appropriate). > + > +Enables (state=true) or disables (state=false) waitless memory faults. For more > +information, see the documentation of KVM_CAP_MEM_FAULT_NOWAIT. > + > 5. The kvm_run structure > ======================== > > @@ -6544,6 +6556,21 @@ array field represents return values. The userspace should update the return > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 gpa; > + __u64 size; > + } memory_fault; > + > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has > +encountered a memory error which is not handled by KVM kernel module and > +which userspace may choose to handle. No, the vcpu hasn't "encountered a memory error". The hypervisor has taken a page fault, which is very different. And it isn't that KVM couldn't handle it (or we wouldn't have a hypervisor at all). From what I understand of this series (possibly very little), userspace has to *ask* for these, and they are delivered in specific circumstances. Which are? > + > +'gpa' and 'size' indicate the memory range the error occurs at. Userspace > +may handle the error and return to KVM to retry the previous memory access. What are these *exactly*? In what unit? What guarantees that the process eventually converges? How is userspace supposed to handle the fault (which is *not* an error)? Don't you need to communicate other information, such as the type of fault (read, write, permission or translation fault...)? > + > :: > > /* KVM_EXIT_NOTIFY */ > @@ -7577,6 +7604,21 @@ This capability is aimed to mitigate the threat that malicious VMs can > cause CPU stuck (due to event windows don't open up) and make the CPU > unavailable to host or other VMs. > > +7.34 KVM_CAP_MEM_FAULT_NOWAIT > +----------------------------- > + > +:Architectures: x86, arm64 > +:Target: VM > +:Parameters: None > +:Returns: 0 on success, or -EINVAL if capability is not supported. > + > +The presence of this capability indicates that userspace can enable/disable > +waitless memory faults through the KVM_SET_MEM_FAULT_NOWAIT ioctl. > + > +When waitless memory faults are enabled, fast get_user_pages failures when > +handling EPT/Shadow Page Table violations will cause a vCPU exit > +(KVM_EXIT_MEMORY_FAULT) instead of a fallback to slow get_user_pages. Do you really expect a random VMM hacker to understand what GUP is? Also, the x86 verbiage makes zero sense to me. Please write this in a way that is useful for userspace people, because they are the ones consuming this documentation. I really can't imagine anyone being able to write something useful based on this. > + > 8. Other capabilities. > ====================== > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 109b18e2789c4..9352e7f8480fb 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -801,6 +801,9 @@ struct kvm { > bool vm_bugged; > bool vm_dead; > > + rwlock_t mem_fault_nowait_lock; > + bool mem_fault_nowait; A full-fat rwlock to protect a single bool? What benefits do you expect from a rwlock? Why is it preferable to an atomic access, or a simple bitop? > + > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER > struct notifier_block pm_notifier; > #endif > @@ -2278,4 +2281,14 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > /* Max number of entries allowed for each kvm dirty ring */ > #define KVM_DIRTY_RING_MAX_ENTRIES 65536 > > +static inline bool memory_faults_enabled(struct kvm *kvm) > +{ > + bool ret; > + > + read_lock(&kvm->mem_fault_nowait_lock); > + ret = kvm->mem_fault_nowait; > + read_unlock(&kvm->mem_fault_nowait_lock); > + return ret; > +} > + > #endif > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index 55155e262646e..064fbfed97f01 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -264,6 +264,7 @@ struct kvm_xen_exit { > #define KVM_EXIT_RISCV_SBI 35 > #define KVM_EXIT_RISCV_CSR 36 > #define KVM_EXIT_NOTIFY 37 > +#define KVM_EXIT_MEMORY_FAULT 38 > > /* For KVM_EXIT_INTERNAL_ERROR */ > /* Emulate instruction failed. */ > @@ -505,6 +506,12 @@ struct kvm_run { > #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) > __u32 flags; > } notify; > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory_fault; Sigh... This doesn't even match the documentation. > /* Fix the size of the union. */ > char padding[256]; > }; > @@ -1175,6 +1182,7 @@ struct kvm_ppc_resize_hpt { > #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223 > #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224 > #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225 > +#define KVM_CAP_MEM_FAULT_NOWAIT 226 > > #ifdef KVM_CAP_IRQ_ROUTING > > @@ -1658,7 +1666,7 @@ struct kvm_enc_region { > /* Available with KVM_CAP_ARM_SVE */ > #define KVM_ARM_VCPU_FINALIZE _IOW(KVMIO, 0xc2, int) > > -/* Available with KVM_CAP_S390_VCPU_RESETS */ > +/* Available with KVM_CAP_S390_VCPU_RESETS */ Unrelated change? > #define KVM_S390_NORMAL_RESET _IO(KVMIO, 0xc3) > #define KVM_S390_CLEAR_RESET _IO(KVMIO, 0xc4) > > @@ -2228,4 +2236,7 @@ struct kvm_s390_zpci_op { > /* flags for kvm_s390_zpci_op->u.reg_aen.flags */ > #define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0) > > +/* Available with KVM_CAP_MEM_FAULT_NOWAIT */ > +#define KVM_SET_MEM_FAULT_NOWAIT _IOWR(KVMIO, 0xd2, bool) > + > #endif /* __LINUX_KVM_H */ > diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h > index 20522d4ba1e0d..5d9e3f48a9634 100644 > --- a/tools/include/uapi/linux/kvm.h > +++ b/tools/include/uapi/linux/kvm.h > @@ -264,6 +264,7 @@ struct kvm_xen_exit { > #define KVM_EXIT_RISCV_SBI 35 > #define KVM_EXIT_RISCV_CSR 36 > #define KVM_EXIT_NOTIFY 37 > +#define KVM_EXIT_MEMORY_FAULT 38 > > /* For KVM_EXIT_INTERNAL_ERROR */ > /* Emulate instruction failed. */ > @@ -505,6 +506,12 @@ struct kvm_run { > #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) > __u32 flags; > } notify; > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory_fault; > /* Fix the size of the union. */ > char padding[256]; > }; > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index dae5f48151032..8e5bfc00d1181 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1149,6 +1149,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) > INIT_LIST_HEAD(&kvm->devices); > kvm->max_vcpus = KVM_MAX_VCPUS; > > + rwlock_init(&kvm->mem_fault_nowait_lock); > + kvm->mem_fault_nowait = false; > + > BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX); > > /* > @@ -2313,6 +2316,16 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm, > } > #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */ > > +static int kvm_vm_ioctl_set_mem_fault_nowait(struct kvm *kvm, bool state) > +{ > + if (!kvm_vm_ioctl_check_extension(kvm, KVM_CAP_MEM_FAULT_NOWAIT)) > + return -1; > + write_lock(&kvm->mem_fault_nowait_lock); > + kvm->mem_fault_nowait = state; > + write_unlock(&kvm->mem_fault_nowait_lock); > + return 0; > +} > + > struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn) > { > return __gfn_to_memslot(kvm_memslots(kvm), gfn); > @@ -4675,6 +4688,10 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, > > return r; > } > + case KVM_CAP_MEM_FAULT_NOWAIT: > + if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap)) > + return -EINVAL; > + return 0; > default: > return kvm_vm_ioctl_enable_cap(kvm, cap); > } > @@ -4892,6 +4909,15 @@ static long kvm_vm_ioctl(struct file *filp, > r = 0; > break; > } > + case KVM_SET_MEM_FAULT_NOWAIT: { > + bool state; > + > + r = -EFAULT; > + if (copy_from_user(&state, argp, sizeof(state))) A copy_from_user for... a bool? Why don't you directly treat argp as the boolean itself? Of even better, make it a more interesting quantity so that the API can evolve in the future. As it stands, it is not extensible, which means it is already dead, if Linux APIs are anything to go by. On a related subject: is there a set of patches for any of the main VMMs making any use of this? M. -- Without deviation from the norm, progress is not possible.