Re: [PATCH RFC v1 5/9] KVM: SVM: Implement demand page pinning

From: "Nikunj A. Dadhania" <nikunj@amd.com>
To: Mingwei Zhang <mizhang@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <seanjc@google.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	Brijesh Singh <brijesh.singh@amd.com>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	Peter Gonda <pgonda@google.com>, Bharata B Rao <bharata@amd.com>,
	"Maciej S . Szmigiero" <mail@maciej.szmigiero.name>,
	David Hildenbrand <david@redhat.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC v1 5/9] KVM: SVM: Implement demand page pinning
Date: Mon, 21 Mar 2022 14:49:27 +0530	[thread overview]
Message-ID: <22268ddb-5643-f35e-6c34-eb5c2b0ad4cb@amd.com> (raw)
In-Reply-To: <YjgXIyrcDA5+u8d+@google.com>

On 3/21/2022 11:41 AM, Mingwei Zhang wrote:
> On Wed, Mar 09, 2022, Nikunj A. Dadhania wrote:
>> On 3/9/2022 3:23 AM, Mingwei Zhang wrote:
>>> On Tue, Mar 08, 2022, Nikunj A Dadhania wrote:
>>>> Use the memslot metadata to store the pinned data along with the pfns.
>>>> This improves the SEV guest startup time from O(n) to a constant by
>>>> deferring guest page pinning until the pages are used to satisfy
>>>> nested page faults. The page reference will be dropped in the memslot
>>>> free path or deallocation path.
>>>>
>>>> Reuse enc_region structure definition as pinned_region to maintain
>>>> pages that are pinned outside of MMU demand pinning. Remove rest of
>>>> the code which did upfront pinning, as they are no longer needed in
>>>> view of the demand pinning support.
>>>
>>> I don't quite understand why we still need the enc_region. I have
>>> several concerns. Details below.
>>
>> With patch 9 the enc_region is used only for memory that was pinned before 
>> the vcpu is online (i.e. mmu is not yet usable)
>>
>>>>
>>>> Retain svm_register_enc_region() and svm_unregister_enc_region() with
>>>> required checks for resource limit.
>>>>
>>>> Guest boot time comparison
>>>>   +---------------+----------------+-------------------+
>>>>   | Guest Memory  |   baseline     |  Demand Pinning   |
>>>>   | Size (GB)     |    (secs)      |     (secs)        |
>>>>   +---------------+----------------+-------------------+
>>>>   |      4        |     6.16       |      5.71         |
>>>>   +---------------+----------------+-------------------+
>>>>   |     16        |     7.38       |      5.91         |
>>>>   +---------------+----------------+-------------------+
>>>>   |     64        |    12.17       |      6.16         |
>>>>   +---------------+----------------+-------------------+
>>>>   |    128        |    18.20       |      6.50         |
>>>>   +---------------+----------------+-------------------+
>>>>   |    192        |    24.56       |      6.80         |
>>>>   +---------------+----------------+-------------------+
>>>>
>>>> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
>>>> ---
>>>>  arch/x86/kvm/svm/sev.c | 304 ++++++++++++++++++++++++++---------------
>>>>  arch/x86/kvm/svm/svm.c |   1 +
>>>>  arch/x86/kvm/svm/svm.h |   6 +-
>>>>  3 files changed, 200 insertions(+), 111 deletions(-)
>>>>

<SNIP>

>>>>  static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>>>>  				    unsigned long ulen, unsigned long *n,
>>>>  				    int write)
>>>>  {
>>>>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>> +	struct pinned_region *region;
>>>>  	unsigned long npages, size;
>>>>  	int npinned;
>>>> -	unsigned long locked, lock_limit;
>>>>  	struct page **pages;
>>>> -	unsigned long first, last;
>>>>  	int ret;
>>>>  
>>>>  	lockdep_assert_held(&kvm->lock);
>>>> @@ -395,15 +413,12 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>>>>  	if (ulen == 0 || uaddr + ulen < uaddr)
>>>>  		return ERR_PTR(-EINVAL);
>>>>  
>>>> -	/* Calculate number of pages. */
>>>> -	first = (uaddr & PAGE_MASK) >> PAGE_SHIFT;
>>>> -	last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT;
>>>> -	npages = (last - first + 1);
>>>> +	npages = get_npages(uaddr, ulen);
>>>>  
>>>> -	locked = sev->pages_locked + npages;
>>>> -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>>>> -	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
>>>> -		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit);
>>>> +	if (rlimit_memlock_exceeds(sev->pages_to_lock, npages)) {
>>>> +		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n",
>>>> +			sev->pages_to_lock + npages,
>>>> +			(rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT));
>>>>  		return ERR_PTR(-ENOMEM);
>>>>  	}
>>>>  
>>>> @@ -429,7 +444,19 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>>>>  	}
>>>>  
>>>>  	*n = npages;
>>>> -	sev->pages_locked = locked;
>>>> +	sev->pages_to_lock += npages;
>>>> +
>>>> +	/* Maintain region list that is pinned to be unpinned in vm destroy path */
>>>> +	region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
>>>> +	if (!region) {
>>>> +		ret = -ENOMEM;
>>>> +		goto err;
>>>> +	}
>>>> +	region->uaddr = uaddr;
>>>> +	region->size = ulen;
>>>> +	region->pages = pages;
>>>> +	region->npages = npages;
>>>> +	list_add_tail(&region->list, &sev->pinned_regions_list);
>>>
>>> Hmm. I see a duplication of the metadata. We already store the pfns in
>>> memslot. But now we also do it in regions. Is this one used for
>>> migration purpose?
>>
>> We are not duplicating, the enc_region holds regions that are pinned other 
>> than svm_register_enc_region(). Later patches add infrastructure to directly 
>> fault-in those pages which will use memslot->pfns. 
>>
>>>
>>> I might miss some of the context here. 
>>
>> More context here:
>> https://lore.kernel.org/kvm/CAMkAt6p1-82LTRNB3pkPRwYh=wGpreUN=jcUeBj_dZt8ss9w0Q@mail.gmail.com/
> 
> hmm. I think I might got the point. However, logically, I still think we
> might not need double data structures for pinning. When vcpu is not
> online, we could use the the array in memslot to contain the pinned
> pages, right?

Yes.

> Since user-level code is not allowed to pin arbitrary regions of HVA, we
> could check that and bail out early if the region goes out of a memslot.
> 
> From that point, the only requirement is that we need a valid memslot
> before doing memory encryption and pinning. So enc_region is still not
> needed from this point.
> 
> This should save some time to avoid double pinning and make the pinning
> information clear.

Agreed, I think that should be possible:

* Check for addr/end being part of a memslot
* Error out in case it is not part of any memslot
* Add __sev_pin_pfn() which is not dependent on vcpu arg.
* Iterate over the pages and use __sev_pin_pfn() routine to pin.
	slots = kvm_memslots(kvm);
	kvm_for_each_memslot_in_hva_range(node, slots, addr, end) {
		slot = container_of(node, struct kvm_memory_slot,
			    hva_node[slots->node_idx]);
		slot_start = slot->userspace_addr;
		slot_end = slot_start + (slot->npages << PAGE_SHIFT);
		hva_start = max(addr, slot_start);
		hva_end = min(end, slot_end)
		for (uaddr = hva_start; uaddr < hva_end; uaddr += PAGE_SIZE) {
			__sev_pin_pfn(slot, uaddr, PG_LEVEL_4K)
		}
	}

This will make sure memslot based data structure is used and enc_region can be removed.

Regards
Nikunj