Re: [PATCH v3 1/8] KVM: x86: Cache total page count to avoid traversing the memslot array

From: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>,
	Igor Mammedov <imammedo@redhat.com>,
	Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>,
	Julien Thierry <julien.thierry.kdev@gmail.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Huacai Chen <chenhuacai@kernel.org>,
	Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>,
	Paul Mackerras <paulus@ozlabs.org>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Janosch Frank <frankja@linux.ibm.com>,
	David Hildenbrand <david@redhat.com>,
	Cornelia Huck <cohuck@redhat.com>,
	Claudio Imbrenda <imbrenda@linux.ibm.com>,
	Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 1/8] KVM: x86: Cache total page count to avoid traversing the memslot array
Date: Fri, 21 May 2021 09:03:23 +0200	[thread overview]
Message-ID: <e3769513-d2d8-e3fb-7887-3c8872b0f00c@maciej.szmigiero.name> (raw)
In-Reply-To: <YKV8hHDS489g9JBS@google.com>

On 19.05.2021 23:00, Sean Christopherson wrote:
> On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> There is no point in recalculating from scratch the total number of pages
>> in all memslots each time a memslot is created or deleted.
>>
>> Just cache the value and update it accordingly on each such operation so
>> the code doesn't need to traverse the whole memslot array each time.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 5bd550eaf683..8c7738b75393 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -11112,9 +11112,21 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>>   				const struct kvm_memory_slot *new,
>>   				enum kvm_mr_change change)
>>   {
>> -	if (!kvm->arch.n_requested_mmu_pages)
>> -		kvm_mmu_change_mmu_pages(kvm,
>> -				kvm_mmu_calculate_default_mmu_pages(kvm));
>> +	if (change == KVM_MR_CREATE)
>> +		kvm->arch.n_memslots_pages += new->npages;
>> +	else if (change == KVM_MR_DELETE) {
>> +		WARN_ON(kvm->arch.n_memslots_pages < old->npages);
> 
> Heh, so I think this WARN can be triggered at will by userspace on 32-bit KVM by
> causing the running count to wrap.  KVM artificially caps the size of a single
> memslot at ((1UL << 31) - 1), but userspace could create multiple gigantic slots
> to overflow arch.n_memslots_pages.
> 
> I _think_ changing it to a u64 would fix the problem since KVM forbids overlapping
> memslots in the GPA space.

You are right, n_memslots_pages needs to be u64 so it does not overflow
on 32-bit KVM.

The memslot count is limited to 32k in each of 2 address spaces, so in
the worst case the variable should hold 15-bits + 1 bit + 31-bits = 47 bit number.

> Also, what about moving the check-and-WARN to prepare_memory_region() so that
> KVM can error out if the check fails?  Doesn't really matter, but an explicit
> error for userspace is preferable to underflowing the number of pages and getting
> weird MMU errors/behavior down the line.

In principle this seems like a possibility, however, it is a more
regression-risky option, in case something has (perhaps unintentionally)
relied on the fact that kvm_mmu_zap_oldest_mmu_pages() call from
kvm_mmu_change_mmu_pages() was being done only in the memslot commit
function.

>> +		kvm->arch.n_memslots_pages -= old->npages;
>> +	}
>> +
>> +	if (!kvm->arch.n_requested_mmu_pages) {
> 
> If we're going to bother caching the number of pages then we should also skip
> the update when the number pages isn't changing, e.g.
> 
> 	if (change == KVM_MR_CREATE || change == KVM_MR_DELETE) {
> 		if (change == KVM_MR_CREATE)
> 			kvm->arch.n_memslots_pages += new->npages;
> 		else
> 			kvm->arch.n_memslots_pages -= old->npages;
> 
> 		if (!kvm->arch.n_requested_mmu_pages) {
> 			unsigned long nr_mmu_pages;
> 
> 			nr_mmu_pages = kvm->arch.n_memslots_pages *
> 				       KVM_PERMILLE_MMU_PAGES / 1000;
> 			nr_mmu_pages = max(nr_mmu_pages, KVM_MIN_ALLOC_MMU_PAGES);
> 			kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
> 		}
> 	}

The old code did it that way (unconditionally) and, as in the case above,
I didn't want to risk an regression.
If we are going to change this fact then I think it should happen in a
separate patch.

Thanks,
Maciej