Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.

From: Paolo Bonzini <pbonzini@redhat.com>
To: Junaid Shahid <junaids@google.com>
Cc: kvm@vger.kernel.org, andreslc@google.com, pfeiner@google.com,
	guangrong.xiao@linux.intel.com
Subject: Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
Date: Fri, 25 Nov 2016 10:45:28 +0100	[thread overview]
Message-ID: <f393f6b8-c61a-a408-55ce-58a3cf7087a5@redhat.com> (raw)
In-Reply-To: <41116680.VFaKihePAT@js-desktop.mtv.corp.google.com>

On 24/11/2016 04:50, Junaid Shahid wrote:
> On Monday, November 21, 2016 03:42:23 PM Paolo Bonzini wrote:
>> Please introduce a new function spte_is_access_tracking_enabled(u64 
>> old_spte, u64 new_spte) and use it here:
>>
>> 	if (shadow_accessed_mask ?
>> 	    spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask) :
>> 	    spte_is_access_tracking_enabled(old_spte, new_spte)) {
>> 		flush |= !shadow_accessed_mask;
>> 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
>> 	}
>>
> 
> I think we can just set flush = true in the then block instead of 
> flush |= !shadow_accessed_mask.
> 
> And while we are at it, is there any reason to flush the TLB when setting the
> A or D bit in the PTE? If not, we can remove this earlier block since the
> clearing case is now handled in the separate if blocks for accessed and dirty:

Hmm, flushing the TLB is expensive.  I would have thought that we want to avoid
a shootdown (kvm_flush_remote_tlbs) for the A and D bits.  But it's probably rare
enough that it doesn't matter, and the existing code has

        /*
         * Flush TLB when accessed/dirty bits are changed in the page tables,
         * to guarantee consistency between TLB and page tables.
         */
        if (spte_is_bit_changed(old_spte, new_spte,
                                shadow_accessed_mask | shadow_dirty_mask))
                ret = true;

so yeah.

> 	if (spte_is_bit_changed(old_spte, new_spte,
>                            shadow_accessed_mask | shadow_dirty_mask))
> 		flush = true;
> 
> Also, instead of spte_is_access_tracking_enabled(), I’ve added is_accessed_spte
> as you suggested later and used that here as well.

Yes, that makes more sense.  Thanks!

>>> -	if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
>>> +
>>> +	if (shadow_dirty_mask ?
>>> +	    spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask) :
>>> +	    writable_cleared)
>>
>> writable_cleared can be inline here and written as
>>
>> 	spte_is_bit_cleared(old_spte, new_spte, PT_WRITABLE_MASK)
>>
>> so
>>
>> 	if (shadow_dirty_mask ?
>> 	    spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask) :
>> 	    spte_is_bit_cleared(old_spte, new_spte, PT_WRITABLE_MASK)) {
>> 		flush |= !shadow_dirty_mask;
>> 		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
>> 	}
>>>  		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
>>>  
>>> -	return ret;
>>> +	return flush;
>>>  }
>>
>> But please anticipate the changes to this function, except for 
>> introducing is_access_track_spte of course, to a separate function.
> 
> Sorry, I didn’t exactly understand what you meant by the last line. But I have
> made it like this:
> 
> 	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> 		flush = true;
> 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> 	}
> 	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
> 		flush = true;
> 		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> 	}

The idea is to split this patch in two.  You can first refactor the function to
have the above code and introduce is_accessed_spte/is_dirty_spte.  Then, you
add the lockless access tracking on top.  But it's okay if we leave it for a
subsequent review.  There are going to be many changes already between v2 and v3!

> I’ve replaced the checks with if(is_accessed_spte...) / if (is_dirty_spte...)

Good!

>>> +	/*
>>> +	 * Any PTE marked for access tracking should also be marked for dirty
>>> +	 * tracking (by being non-writable)
>>> +	 */
>>> +	spte &= ~PT_WRITABLE_MASK;
>>
>> This should be implicit in the definition of 
>> shadow_acc_track_mask/value, so it's not necessary (or it can be a 
>> WARN_ONCE).
> 
> It can't be handled by shadow_acc_track_mask/value, but I have changed
> shadow_acc_track_saved_bits_mask to save only the R/X bits, which achieves the
> same result.
> 
>>
>>> +	spte &= ~(shadow_acc_track_saved_bits_mask <<
>>> +		  shadow_acc_track_saved_bits_shift);
>>
>> Should there be a WARN_ON if these bits are not zero?
> 
> No, these bits can be non-zero from a previous instance of 
> mark_spte_for_access_track. They are not cleared when the PTE is restored to
> normal state. (Though we could do that and then have a WARN_ONCE here.)

Ok, if it's not too hard, do add it.  I think it's worth having
more self-checks.

>>> @@ -2919,10 +3063,17 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>>>  	 *
>>>  	 * Compare with set_spte where instead shadow_dirty_mask is set.
>>>  	 */
>>> -	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
>>> +	if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
>>>  		return false;
>>>  
>>> -	kvm_vcpu_mark_page_dirty(vcpu, gfn);
>>> +	if (remove_write_prot) {
>>
>> Stupid question ahead, why not call kvm_vcpu_mark_page_accessed in this 
>> function?
> 
> We could, but I kept that call in the same paths as before for consistency with
> the other cases. Plus, even though it is a cheap call, why add it to the fast 
> PF path if it is not necessary.

Makes sense, I said it was a stupid question. :)

>>> @@ -2953,36 +3104,43 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>>>  
>>>  		/*
>>> -		 * Check if it is a spurious fault caused by TLB lazily flushed.
>>> +		 * Check whether the memory access that caused the fault would
>>> +		 * still cause it if it were to be performed right now. If not,
>>> +		 * then this is a spurious fault caused by TLB lazily flushed,
>>> +		 * or some other CPU has already fixed the PTE after the
>>> +		 * current CPU took the fault.
>>> +		 *
>>> +		 * If Write-Only mappings ever become supported, then the
>>> +		 * condition below would need to be changed appropriately.
>>>  		 *
>>>  		 * Need not check the access of upper level table entries since
>>>  		 * they are always ACC_ALL.
>>>  		 */
>>> -		if (is_writable_pte(spte)) {
>>> -			ret = true;
>>> +		if (((spte & PT_PRESENT_MASK) && !remove_write_prot) ||
>>> +		    valid_exec_access) {
>>> +			fault_handled = true;
>>>  			break;
>>>  		}
>>
>> Let's separate the three conditions (R/W/X):
>>
>> 		if ((error_code & PFERR_FETCH_MASK) {
>> 			if ((spte & (shadow_x_mask|shadow_nx_mask))
>> 			    == shadow_x_mask) {
>> 				fault_handled = true;
>> 				break;
>> 			}
>> 		}
>> 		if (error_code & PFERR_WRITE_MASK) {
>> 			if (is_writable_pte(spte)) {
>> 				fault_handled = true;
>> 				break;
>> 			}
>> 			remove_write_prot =
>> 				spte_can_locklessly_be_made_writable(spte);
>> 		}
>> 		if (!(error_code & PFERR_PRESENT_MASK)) {
>> 			if (!is_access_track_spte(spte)) {
>> 				fault_handled = true;
>> 				break;
>> 			}
>> 			remove_acc_track = true;
>> 		}
> 
> I think the third block is incorrect e.g. it will set fault_handled = true even
> for a completely zero PTE.

A completely zero PTE would have been filtered before by the
is_shadow_present_pte check, wouldn't it?

>>>  
>>> +/* This is only supposed to be used for non-EPT mappings */
>>
>> It's only used for non-EPT mappings, why is it only *supposed* to be 
>> used for non-EPT mappings?  It seems to me that it would work.
>>
>>>  static bool need_remote_flush(u64 old, u64 new)
>>>  {
> 
> It would work but it will return true in at least one (probably only) case 
> where it doesn’t need to e.g. going from an acc-track PTE to a zeroed PTE.
> Though maybe that is not a big deal. Of course, we could also just update it 
> to handle acc-track. 

Yeah, I think it's not a big deal.  But actually is it really used only
for non-EPT?  Nested virtualization uses kvm_mmu_pte_write for EPT as well.

Paolo

> 
> 
> Thanks,
> Junaid
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>