Re: [RFC PATCH] kvm: nv: Optimize the unmapping of shadow S2-MMU tables.

From: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
To: Marc Zyngier <maz@kernel.org>
Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, oliver.upton@linux.dev,
	darren@os.amperecomputing.com,
	d.scott.phillips@amperecomputing.com
Subject: Re: [RFC PATCH] kvm: nv: Optimize the unmapping of shadow S2-MMU tables.
Date: Tue, 5 Mar 2024 18:59:08 +0530	[thread overview]
Message-ID: <6685c3a6-2017-4bc2-ad26-d11949097050@os.amperecomputing.com> (raw)
In-Reply-To: <86sf150w4t.wl-maz@kernel.org>

On 05-03-2024 04:43 pm, Marc Zyngier wrote:
> [re-sending with kvmarm@ fixed]
> 
> On Tue, 05 Mar 2024 05:46:06 +0000,
> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
>>
>> As per 'commit 178a6915434c ("KVM: arm64: nv: Unmap/flush shadow stage 2
> 
> $ git describe --contains 178a6915434c --match=v\*
> fatal: cannot describe '178a6915434c141edefd116b8da3d55555ea3e63'
> 

My bad(I would have been more verbose), I missed to mention that this 
patch is on top of NV-V11 patch series.

> This commit simply doesn't exist upstream. It only lives in a
> now deprecated branch that will never be merged.
> 
>> page tables")', when ever there is unmap of pages that
>> are mapped to L1, they are invalidated from both L1 S2-MMU and from
>> all the active shadow/L2 S2-MMU tables. Since there is no mapping
>> to invalidate the IPAs of Shadow S2 to a page, there is a complete
>> S2-MMU page table walk and invalidation is done covering complete
>> address space allocated to a L2. This has performance impacts and
>> even soft lockup for NV(L1 and L2) boots with higher number of
>> CPUs and large Memory.
>>
>> Adding a lookup table of mapping of Shadow IPA to Canonical IPA
>> whenever a page is mapped to any of the L2. While any page is
>> unmaped, this lookup is helpful to unmap only if it is mapped in
>> any of the shadow S2-MMU tables. Hence avoids unnecessary long
>> iterations of S2-MMU table walk-through and invalidation for the
>> complete address space.
> 
> All of this falls in the "premature optimisation" bucket. Why should
> we bother with any of this when not even 'AT S1' works correctly,

Hmm, I am not aware of this, is this something new issue of V11?

> making it trivial to prevent a guest from making forward progress? You
> also show no numbers that would hint at a measurable improvement under
> any particular workload.

This patch is avoiding long iterations of unmap which was resulting in 
soft-lockup, when tried L1 and L2 with 192 cores.
Fixing soft lockup isn't a required fix for feature enablement?

> 
> I am genuinely puzzled that you are wasting valuable engineering time
> on *this*.
> 
>>
>> Signed-off-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
>> ---
>>   arch/arm64/include/asm/kvm_emulate.h |   5 ++
>>   arch/arm64/include/asm/kvm_host.h    |  14 ++++
>>   arch/arm64/include/asm/kvm_nested.h  |   4 +
>>   arch/arm64/kvm/mmu.c                 |  19 ++++-
>>   arch/arm64/kvm/nested.c              | 113 +++++++++++++++++++++++++++
>>   5 files changed, 152 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index 5173f8cf2904..f503b2eaedc4 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -656,4 +656,9 @@ static inline bool kvm_is_shadow_s2_fault(struct kvm_vcpu *vcpu)
>>   		vcpu->arch.hw_mmu->nested_stage2_enabled);
>>   }
>>   
>> +static inline bool kvm_is_l1_using_shadow_s2(struct kvm_vcpu *vcpu)
>> +{
>> +	return (vcpu->arch.hw_mmu != &vcpu->kvm->arch.mmu);
>> +}
> 
> Isn't that the very definition of "!in_hyp_ctxt()"? You are abusing

"!in_hyp_ctxt()" isn't true for non-NV case also?
This function added to know that L1 is NV enabled and using shadow S2.

> the hw_mmu pointer to derive something, but the source of truth is the
> translation regime, as defined by HCR_EL2.{E2H,TGE} and PSTATE.M.
> 

OK, I can try HCR_EL2.{E2H,TGE} and PSTATE.M instead of hw_mmu in next 
version.

>> +
>>   #endif /* __ARM64_KVM_EMULATE_H__ */
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 8da3c9a81ae3..f61c674c300a 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -144,6 +144,13 @@ struct kvm_vmid {
>>   	atomic64_t id;
>>   };
>>   
>> +struct mapipa_node {
>> +	struct rb_node node;
>> +	phys_addr_t ipa;
>> +	phys_addr_t shadow_ipa;
>> +	long size;
>> +};
>> +
>>   struct kvm_s2_mmu {
>>   	struct kvm_vmid vmid;
>>   
>> @@ -216,6 +223,13 @@ struct kvm_s2_mmu {
>>   	 * >0: Somebody is actively using this.
>>   	 */
>>   	atomic_t refcnt;
>> +
>> +	/*
>> +	 * For a Canonical IPA to Shadow IPA mapping.
>> +	 */
>> +	struct rb_root nested_mapipa_root;
> 
> Why isn't this a maple tree? If there is no overlap between mappings
> (and it really shouldn't be any), why should we use a bare-bone rb-tree?
> 
>> +	rwlock_t mmu_lock;
> 
> Hell no. We have plenty of locking already, and there is no reason why
> this should gain its own locking. I can't see a case where you would
> take this lock outside of holding the *real* mmu_lock -- extra bonus
> point for the ill-chosen name.

OK, this should be avoided with maple tree.
> 
>> +
>>   };
>>   
>>   static inline bool kvm_s2_mmu_valid(struct kvm_s2_mmu *mmu)
>> diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
>> index da7ebd2f6e24..c31a59a1fdc6 100644
>> --- a/arch/arm64/include/asm/kvm_nested.h
>> +++ b/arch/arm64/include/asm/kvm_nested.h
>> @@ -65,6 +65,9 @@ extern void kvm_init_nested(struct kvm *kvm);
>>   extern int kvm_vcpu_init_nested(struct kvm_vcpu *vcpu);
>>   extern void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu);
>>   extern struct kvm_s2_mmu *lookup_s2_mmu(struct kvm_vcpu *vcpu);
>> +extern void add_shadow_ipa_map_node(
>> +		struct kvm_s2_mmu *mmu,
>> +		phys_addr_t ipa, phys_addr_t shadow_ipa, long size);
>>   
>>   union tlbi_info;
>>   
>> @@ -123,6 +126,7 @@ extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu,
>>   extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2);
>>   extern void kvm_nested_s2_wp(struct kvm *kvm);
>>   extern void kvm_nested_s2_unmap(struct kvm *kvm);
>> +extern void kvm_nested_s2_unmap_range(struct kvm *kvm, struct kvm_gfn_range *range);
>>   extern void kvm_nested_s2_flush(struct kvm *kvm);
>>   int handle_wfx_nested(struct kvm_vcpu *vcpu, bool is_wfe);
>>   
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 61bdd8798f83..3948681426a0 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1695,6 +1695,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   					     memcache,
>>   					     KVM_PGTABLE_WALK_HANDLE_FAULT |
>>   					     KVM_PGTABLE_WALK_SHARED);
>> +		if ((nested || kvm_is_l1_using_shadow_s2(vcpu)) && !ret) {
> 
> I don't understand this condition. If nested is non-NULL, it's because
> we're using a shadow S2. So why the additional condition?

No, nested is set only for L2, for L1 it is not.
To handle L1 shadow S2 case, I have added this condition.

> 
>> +			struct kvm_s2_mmu *shadow_s2_mmu;
>> +
>> +			ipa &= ~(vma_pagesize - 1);
>> +			shadow_s2_mmu = lookup_s2_mmu(vcpu);
>> +			add_shadow_ipa_map_node(shadow_s2_mmu, ipa, fault_ipa, vma_pagesize);
>> +		}
>>   	}
>>   
>>   	/* Mark the page dirty only if the fault is handled successfully */
>> @@ -1918,7 +1925,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>>   			     (range->end - range->start) << PAGE_SHIFT,
>>   			     range->may_block);
>>   
>> -	kvm_nested_s2_unmap(kvm);
>> +	kvm_nested_s2_unmap_range(kvm, range);
>>   	return false;
>>   }
>>   
>> @@ -1953,7 +1960,7 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>>   			       PAGE_SIZE, __pfn_to_phys(pfn),
>>   			       KVM_PGTABLE_PROT_R, NULL, 0);
>>   
>> -	kvm_nested_s2_unmap(kvm);
>> +	kvm_nested_s2_unmap_range(kvm, range);
>>   	return false;
>>   }
>>   
>> @@ -2223,12 +2230,18 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
>>   void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>>   				   struct kvm_memory_slot *slot)
>>   {
>> +	struct kvm_gfn_range range;
>> +
>>   	gpa_t gpa = slot->base_gfn << PAGE_SHIFT;
>>   	phys_addr_t size = slot->npages << PAGE_SHIFT;
>>   
>> +	range.start = gpa;
>> +	range.end = gpa + size;
>> +	range.may_block = true;
>> +
>>   	write_lock(&kvm->mmu_lock);
>>   	kvm_unmap_stage2_range(&kvm->arch.mmu, gpa, size);
>> -	kvm_nested_s2_unmap(kvm);
>> +	kvm_nested_s2_unmap_range(kvm, &range);
>>   	write_unlock(&kvm->mmu_lock);
>>   }
>>   
>> diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
>> index f88d9213c6b3..888ec9fba4a0 100644
>> --- a/arch/arm64/kvm/nested.c
>> +++ b/arch/arm64/kvm/nested.c
>> @@ -565,6 +565,88 @@ void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
>>   	write_unlock(&kvm->mmu_lock);
>>   }
>>   
>> +/*
>> + * Create a node and add to lookup table, when a page is mapped to
>> + * Canonical IPA and also mapped to Shadow IPA.
>> + */
>> +void add_shadow_ipa_map_node(struct kvm_s2_mmu *mmu,
>> +			phys_addr_t ipa,
>> +			phys_addr_t shadow_ipa, long size)
>> +{
>> +	struct rb_root *ipa_root = &(mmu->nested_mapipa_root);
>> +	struct rb_node **node = &(ipa_root->rb_node), *parent = NULL;
>> +	struct mapipa_node *new;
>> +
>> +	new = kzalloc(sizeof(struct mapipa_node), GFP_KERNEL);
>> +	if (!new)
>> +		return;
>> +
>> +	new->shadow_ipa = shadow_ipa;
>> +	new->ipa = ipa;
>> +	new->size = size;
>> +
>> +	write_lock(&mmu->mmu_lock);
>> +
>> +	while (*node) {
>> +		struct mapipa_node *tmp;
>> +
>> +		tmp = container_of(*node, struct mapipa_node, node);
>> +		parent = *node;
>> +		if (new->ipa < tmp->ipa) {
>> +			node = &(*node)->rb_left;
>> +		} else if (new->ipa > tmp->ipa) {
>> +			node = &(*node)->rb_right;
>> +		} else {
>> +			write_unlock(&mmu->mmu_lock);
>> +			kfree(new);
>> +			return;
>> +		}
>> +	}
>> +
>> +	rb_link_node(&new->node, parent, node);
>> +	rb_insert_color(&new->node, ipa_root);
>> +	write_unlock(&mmu->mmu_lock);
> 
> All this should be removed in favour of simply using a maple tree.
> 

Thanks for the suggestion to use maple tree. I will use it in next 
version, which help to avoid the locks.

>> +}
>> +
>> +/*
>> + * Iterate over the lookup table of Canonical IPA to Shadow IPA.
>> + * Return Shadow IPA, if the page mapped to Canonical IPA is
>> + * also mapped to a Shadow IPA.
>> + *
>> + */
>> +bool get_shadow_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa, phys_addr_t *shadow_ipa, long *size)
> 
> static?

It should be, thanks.
> 
>> +{
>> +	struct rb_node *node;
>> +	struct mapipa_node *tmp = NULL;
>> +
>> +	read_lock(&mmu->mmu_lock);
>> +	node = mmu->nested_mapipa_root.rb_node;
>> +
>> +	while (node) {
>> +		tmp = container_of(node, struct mapipa_node, node);
>> +
>> +		if (tmp->ipa == ipa)
> 
> What guarantees that the mapping you have for L1 has the same starting
> address as the one you have for L2? L1 could have a 2MB mapping and L2
> only 4kB *in the middle*.

IIUC, when a page is mapped to 2MB in L1, it won't be
mapped to L2 and we iterate with the step of PAGE_SIZE and we should be 
hitting the L2's IPA in lookup table, provided the L2 page falls in 
unmap range.

> 
>> +			break;
>> +		else if (ipa > tmp->ipa)
>> +			node = node->rb_right;
>> +		else
>> +			node = node->rb_left;
>> +	}
>> +
>> +	read_unlock(&mmu->mmu_lock);
> 
> Why would you drop the lock here....
> 
>> +
>> +	if (tmp && tmp->ipa == ipa) {
>> +		*shadow_ipa = tmp->shadow_ipa;
>> +		*size = tmp->size;
>> +		write_lock(&mmu->mmu_lock);
> 
> ... if taking it again here? What could have changed in between?
> 
>> +		rb_erase(&tmp->node, &mmu->nested_mapipa_root);
>> +		write_unlock(&mmu->mmu_lock);
>> +		kfree(tmp);
>> +		return true;
>> +	}
>> +	return false;
>> +}
> 
> So simply hitting in the reverse mapping structure *frees* it? Meaning
> that you cannot use it as a way to update a mapping?

Freeing it since this page already unmapped/migrated on host and will be 
done on shadow S2 after this lookup. I should have considered other 
cases as well, as Oliver mentioned.

> 
>> +
>>   /* Must be called with kvm->mmu_lock held */
>>   struct kvm_s2_mmu *lookup_s2_mmu(struct kvm_vcpu *vcpu)
>>   {
>> @@ -674,6 +756,7 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
>>   	mmu->tlb_vttbr = 1;
>>   	mmu->nested_stage2_enabled = false;
>>   	atomic_set(&mmu->refcnt, 0);
>> +	mmu->nested_mapipa_root = RB_ROOT;
>>   }
>>   
>>   void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
>> @@ -760,6 +843,36 @@ void kvm_nested_s2_unmap(struct kvm *kvm)
>>   	}
>>   }
>>   
>> +void kvm_nested_s2_unmap_range(struct kvm *kvm, struct kvm_gfn_range *range)
>> +{
>> +	int i;
>> +	long size;
>> +	bool ret;
>> +
>> +	for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
>> +		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
>> +
>> +		if (kvm_s2_mmu_valid(mmu)) {
>> +			phys_addr_t shadow_ipa, start, end;
>> +
>> +			start = range->start << PAGE_SHIFT;
>> +			end = range->end << PAGE_SHIFT;
>> +
>> +			while (start < end) {
>> +				size = PAGE_SIZE;
>> +				/*
>> +				 * get the Shadow IPA if the page is mapped
>> +				 * to L1 and also mapped to any of active L2.
>> +				 */
> 
> Why is L1 relevant here?

We do map while L1 boots(early stage) in shadow S2, at that moment
if the L1 mapped page is unmapped/migrated we do need to unmap from L1's 
S2 table also.

> 
>> +				ret = get_shadow_ipa(mmu, start, &shadow_ipa, &size);
>> +				if (ret)
>> +					kvm_unmap_stage2_range(mmu, shadow_ipa, size);
>> +				start += size;
>> +			}
>> +		}
>> +	}
>> +}
>> +
>>   /* expects kvm->mmu_lock to be held */
>>   void kvm_nested_s2_flush(struct kvm *kvm)
>>   {
> 
> There are a bunch of worrying issues with this patch. But more
> importantly, this looks like a waste of effort until the core issues
> that NV still has are solved, and I will not consider anything of the
> sort until then.

OK thanks for letting us know, I will pause the work on V2 of this patch 
until then.

> 
> I get the ugly feeling that you are trying to make it look as if it
> was "production ready", which it won't be for another few years,
> specially if the few interested people (such as you) are ignoring the
> core issues in favour of marketing driven features ("make it fast").
> 

What are the core issues (please forgive me if you mentioned already)? 
certainly we will prioritise them than this.

> Thanks,
> 
> 	M.
> 

Thanks,
Ganapat