From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7E75C433EF for ; Wed, 3 Nov 2021 11:59:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C227D610FC for ; Wed, 3 Nov 2021 11:59:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231955AbhKCMCX (ORCPT ); Wed, 3 Nov 2021 08:02:23 -0400 Received: from vps-vb.mhejs.net ([37.28.154.113]:49726 "EHLO vps-vb.mhejs.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231476AbhKCMCU (ORCPT ); Wed, 3 Nov 2021 08:02:20 -0400 Received: from MUA by vps-vb.mhejs.net with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1miEvW-00070a-Rg; Wed, 03 Nov 2021 12:59:34 +0100 To: Sean Christopherson Cc: Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Igor Mammedov , Marc Zyngier , James Morse , Julien Thierry , Suzuki K Poulose , Huacai Chen , Aleksandar Markovic , Paul Mackerras , Christian Borntraeger , Janosch Frank , David Hildenbrand , Cornelia Huck , Claudio Imbrenda , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org References: From: "Maciej S. Szmigiero" Subject: Re: [PATCH v5 01/13] KVM: x86: Cache total page count to avoid traversing the memslot array Message-ID: <8017cf9d-2b03-0c27-b78a-41b3d03c308b@maciej.szmigiero.name> Date: Wed, 3 Nov 2021 12:59:27 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01.11.2021 23:29, Sean Christopherson wrote: > On Wed, Oct 20, 2021, Sean Christopherson wrote: >> On Wed, Oct 20, 2021, Maciej S. Szmigiero wrote: >>> On 20.10.2021 00:24, Sean Christopherson wrote: >>>> E.g. the whole thing can be >>>> >>>> if (!kvm->arch.n_requested_mmu_pages && >>>> (change == KVM_MR_CREATE || change == KVM_MR_DELETE)) { >>>> unsigned long nr_mmu_pages; >>>> >>>> if (change == KVM_MR_CREATE) { >>>> kvm->arch.n_memslots_pages += new->npages; >>>> } else { >>>> WARN_ON(kvm->arch.n_memslots_pages < old->npages); >>>> kvm->arch.n_memslots_pages -= old->npages; >>>> } >>>> >>>> nr_mmu_pages = (unsigned long)kvm->arch.n_memslots_pages; >>>> nr_mmu_pages *= (KVM_PERMILLE_MMU_PAGES / 1000); >>> >>> The above line will set nr_mmu_pages to zero since KVM_PERMILLE_MMU_PAGES >>> is 20, so when integer-divided by 1000 will result in a multiplication >>> coefficient of zero. >> >> Ugh, math. And thus do_div() to avoid the whole 64-bit divide issue on 32-bit KVM. >> Bummer. > > I was revisiting this today because (a) simply making n_memslots_pages a u64 doesn't > cleanly handle the case where the resulting nr_mmu_pages would wrap, Handling this case without capping total n_memslots_pages would require either capping memslots_pages on 32-bit KVM to make it fit in 32-bits or changing kvm_mmu_change_mmu_pages() and all the logic further down to accept u64's. > (b) any fix > in that are should really go in a separate patch to fix > kvm_mmu_calculate_default_mmu_pages() and then carry that behavior forward > > But as I dove deeper (and deeper), I came to the conclusion that supporting a > total number of memslot pages that doesn't fit in an unsigned long is a waste of > our time. With a 32-bit kernel, userspace can at most address 3gb of virtual > memory, whereas wrapping the total number of pages would require 4tb+ of guest > physical memory. Even with x86's second address space for SMM, that means userspace > would need to alias all of guest memory more than one _thousand_ times. And on > older hardware with MAXPHYADDR < 43, the guest couldn't actually access any of those > aliases even if userspace lied about guest.MAXPHYADDR. > > So unless I'm missing something, or PPC or MIPS has some crazy way for a 32-bit > host to support 4TB of guest memory, my vote would be to explicitly disallow > creating more memslot pages than can fit in an unsigned long. Then x86 KVM could > reuse the cache nr_memslot_pages and x86's MMU wouldn't have to update a big pile > of code to support a scenario that practically speaking is useless. > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 72b329e82089..acabdbdef5cf 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -552,6 +552,7 @@ struct kvm { > */ > struct mutex slots_arch_lock; > struct mm_struct *mm; /* userspace tied to this vm */ > + unsigned long nr_memslot_pages; > struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM]; > struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 8bf4b89cfa03..c63fc5c05322 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1567,6 +1567,15 @@ static void kvm_commit_memory_region(struct kvm *kvm, > const struct kvm_memory_slot *new, > enum kvm_mr_change change) > { > + /* > + * Update the total number of memslot pages before calling the arch > + * hook so that architectures can consume the result directly. > + */ > + if (change == KVM_MR_DELETE) > + kvm->nr_memslot_pages -= old->npages; > + else if (change == KVM_MR_CREATE) > + kvm->nr_memslot_pages += new->npages; > + > kvm_arch_commit_memory_region(kvm, old, new, change); > > /* > @@ -1738,6 +1747,9 @@ int __kvm_set_memory_region(struct kvm *kvm, > if (!old || !old->npages) > return -EINVAL; > > + if (WARN_ON_ONCE(kvm->nr_memslot_pages < old->npages)) > + return -EIO; > + > memset(&new, 0, sizeof(new)); > new.id = id; > new.as_id = as_id; > @@ -1756,6 +1768,13 @@ int __kvm_set_memory_region(struct kvm *kvm, > > if (!old || !old->npages) { > change = KVM_MR_CREATE; > + > + /* > + * To simplify KVM internals, the total number of pages across > + * all memslots must fit in an unsigned long. > + */ > + if ((kvm->nr_memslot_pages + new.npages) < kvm->nr_memslot_pages) > + return -EINVAL; > } else { /* Modify an existing slot. */ > if ((new.userspace_addr != old->userspace_addr) || > (new.npages != old->npages) || > Capping total n_memslots_pages makes sense to me to avoid the (existing) nr_mmu_pages wraparound issue, will update the next patchset version accordingly. Thanks, Maciej