On 3/2/22 03:13, Sean Christopherson wrote:
> After typing up all of the below, I actually tried the novel idea of compiling
> the code... and we can't do xchg() on role.invalid because it occupies a single
> bit, it's not a standalone boolean.

Yeah I thought the same right after sending the email, but I'll just say it
was pseudocode. :)  We can do

static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
{
	union kvm_mmu_page_role role = page->role;
	role.invalid = true;

	/* No need to use cmpxchg, only the invalid bit can change.  */
	role.word = xchg(&page->role.word, role.word);
	return role.invalid;
}

Either way, barriers or xchg, it needs to be a separate function.

> by using refcount_dec_not_one() above, there's no guarantee that this
> task is the last one to see it as kvm_tdp_mmu_get_root() can succeed
> and bump the refcount between refcount_dec_not_one() and here.
Yep, I agree refcount_dec_and_test is needed.

>> Based on my own version, I guess you mean (without comments due to family
>> NMI):
>>
>>          if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
>>                  return;
>>
>> 	if (!xchg(&root->role.invalid, true) {
>> 		refcount_set(&root->tdp_mmu_root_count, 1);
>> 	 	tdp_mmu_zap_root(kvm, root, shared);
>> 	        if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
>>          	        return;
>> 	}
>>
>>          spin_lock(&kvm->arch.tdp_mmu_pages_lock);
>>          list_del_rcu(&root->link);
>>          spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>>          call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> 
> That would work, though I'd prefer to recurse on kvm_tdp_mmu_put_root() instead
> of open coding refcount_dec_and_test() so that we get coverage of the xchg()
> doing the right thing.
> 
> I still slightly prefer having the "free" path be inside the xchg().  To me, even
> though the "free" path is the only one that's guaranteed to be reached for every root,
> the fall-through to resetting the refcount and zapping the root is the "normal" path,
> and the "free" path is the exception.

Hmm I can see how that makes especially sense once you add in the worker logic.
But it seems to me that the "basic" logic should be "do both the xchg and the
free", and coding the function with tail recursion obfuscates this.  Even with
the worker, you grow an

+	if (kvm_get_kvm_safe(kvm)) {
+		... let the worker do it ...
+		return;
+	}
+
	tdp_mmu_zap_root(kvm, root, shared);

but you still have a downwards flow that matches what happens even if multiple
threads pick up different parts of the job.

So, I tried a bunch of alternatives including with gotos and with if/else, but
really the above one remains my favorite.

My plan would be:

1) splice the mini series I'm attaching before this patch, and
remove patch 1 of this series.  next_invalidated_root() is a
bit yucky, but notably it doesn't need to add/remove a reference
in kvm_tdp_mmu_zap_invalidated_roots().

Together, these two steps ensure that readers never acquire a
reference to either refcount=0/valid or invalid pages".  In other
words, the three states of that kvm_tdp_mmu_put_root moves the root
through (refcount=0/valid -> refcount=0/invalid -> refcount=1/invalid)
are exactly the same to readers, and there are essentially no races
to worry about.

In other other words, it's trading slightly uglier code for simpler
invariants.

2) here, replace the change to kvm_tdp_mmu_put_root with the following:

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b6ffa91fb9d7..aa0669f54d96 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -81,6 +81,16 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
  static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
  			     bool shared);
  
+static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
+{
+	union kvm_mmu_page_role role = page->role;
+	role.invalid = true;
+
+	/* No need to use cmpxchg, only the invalid bit can change.  */
+	role.word = xchg(&page->role.word, role.word);
+	return role.invalid;
+}
+
  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
  			  bool shared)
  {
@@ -91,20 +101,44 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
  
  	WARN_ON(!root->tdp_mmu_page);
  
+	/*
+	 * The root now has refcount=0 and is valid.  Readers cannot acquire
+	 * a reference to it (they all visit valid roots only, except for
+	 * kvm_tdp_mmu_zap_invalidated_roots() which however does not acquire
+	 * any reference itself.
+	 *
+	 * Even though there are flows that need to visit all roots for
+	 * correctness, they all take mmu_lock for write, so they cannot yet
+	 * run concurrently. The same is true after kvm_tdp_root_mark_invalid,
+	 * since the root still has refcount=0.
+	 *
+	 * However, tdp_mmu_zap_root can yield, and writers do not expect to
+	 * see refcount=0 (see for example kvm_tdp_mmu_invalidate_all_roots()).
+	 * So the root temporarily gets an extra reference, going to refcount=1
+	 * while staying invalid.  Readers still cannot acquire any reference;
+	 * but writers are now allowed to run if tdp_mmu_zap_root yields and
+	 * they might take an extra reference is they themselves yield.  Therefore,
+	 * when the reference is given back after tdp_mmu_zap_root terminates,
+	 * there is no guarantee that the refcount is still 1.  If not, whoever
+	 * puts the last reference will free the page, but they will not have to
+	 * zap the root because a root cannot go from invalid to valid.
+	 */
+	if (!kvm_tdp_root_mark_invalid(root)) {
+		refcount_set(&root->tdp_mmu_root_count, 1);
+		tdp_mmu_zap_root(kvm, root, shared);
+
+		/*
+		 * Give back the reference that was added back above.  We now
+		 * know that the root is invalid, so go ahead and free it if
+		 * no one has taken a reference in the meanwhile.
+		 */
+		if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
+			return;
+	}
+
  	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
  	list_del_rcu(&root->link);
  	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
-
-	/*
-	 * A TLB flush is not necessary as KVM performs a local TLB flush when
-	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
-	 * to a different pCPU.  Note, the local TLB flush on reuse also
-	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
-	 * intermediate paging structures, that may be zapped, as such entries
-	 * are associated with the ASID on both VMX and SVM.
-	 */
-	tdp_mmu_zap_root(kvm, root, shared);
-
  	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
  }
  

3) for the worker patch, the idea would be

+static void tdp_mmu_zap_root_work(struct work_struct *work)
+{
+	...
+}
+
+
+static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+       root->tdp_mmu_async_data = kvm;
+       INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
+       schedule_work(&root->tdp_mmu_async_work);
+}
+
  static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
  {
         union kvm_mmu_page_role role = page->role;
@@ -125,13 +165,24 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
          */
         if (!kvm_tdp_root_mark_invalid(root)) {
                 refcount_set(&root->tdp_mmu_root_count, 1);
-               tdp_mmu_zap_root(kvm, root, shared);
  
                 /*
-                * Give back the reference that was added back above.  We now
+                * If the struct kvm is alive, we might as well zap the root
+                * in a worker.  The worker takes ownership of the reference we
+                * have just added to root as well as the new reference to kvm.
+                */
+               if (kvm_get_kvm_safe(kvm)) {
+                       tdp_mmu_schedule_zap_root(kvm, root);
+                       return;
+               }
+
+               /*
+                * The struct kvm is being destroyed, zap synchronously and give
+                * back immediately the reference that was added above.  We now
                  * know that the root is invalid, so go ahead and free it if
                  * no one has taken a reference in the meanwhile.
                  */
+               tdp_mmu_zap_root(kvm, root, shared);
                 if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
                         return;
         }


Again, I appreciate the idea behind the recursive call, but I think
overall it's clearer to have a clear flow from the beginning to the
end of the function, with the exceptions and optimizations noted as
early returns.

Let me know what you think.  Tomorrow I have a day off, but later
today I will have my changes tested and pushed to kvm/queue for you
to look at.

Thanks,

Paolo