Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.

From: Junaid Shahid <junaids@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org, andreslc@google.com, pfeiner@google.com,
	guangrong.xiao@linux.intel.com
Subject: Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
Date: Mon, 28 Nov 2016 18:43:07 -0800	[thread overview]
Message-ID: <1857180.zXFPSRIdpL@js-desktop.mtv.corp.google.com> (raw)
In-Reply-To: <f393f6b8-c61a-a408-55ce-58a3cf7087a5@redhat.com>

On Friday, November 25, 2016 10:45:28 AM Paolo Bonzini wrote:
> 
> On 24/11/2016 04:50, Junaid Shahid wrote:
> > On Monday, November 21, 2016 03:42:23 PM Paolo Bonzini wrote:
> >> Please introduce a new function spte_is_access_tracking_enabled(u64 
> >> old_spte, u64 new_spte) and use it here:
> >>
> >> 	if (shadow_accessed_mask ?
> >> 	    spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask) :
> >> 	    spte_is_access_tracking_enabled(old_spte, new_spte)) {
> >> 		flush |= !shadow_accessed_mask;
> >> 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> >> 	}
> >>
> > 
> > I think we can just set flush = true in the then block instead of 
> > flush |= !shadow_accessed_mask.
> > 
> > And while we are at it, is there any reason to flush the TLB when setting the
> > A or D bit in the PTE? If not, we can remove this earlier block since the
> > clearing case is now handled in the separate if blocks for accessed and dirty:
> 
> Hmm, flushing the TLB is expensive.  I would have thought that we want to avoid
> a shootdown (kvm_flush_remote_tlbs) for the A and D bits.  But it's probably rare
> enough that it doesn't matter, and the existing code has
> 
>         /*
>          * Flush TLB when accessed/dirty bits are changed in the page tables,
>          * to guarantee consistency between TLB and page tables.
>          */
>         if (spte_is_bit_changed(old_spte, new_spte,
>                                 shadow_accessed_mask | shadow_dirty_mask))
>                 ret = true;
> 
> so yeah.

Ok. So I’ll remove the existing spte_is_bit_changed block and set flush = true
inside the separate blocks that check accessed and dirty masks.

> >> But please anticipate the changes to this function, except for 
> >> introducing is_access_track_spte of course, to a separate function.
> > 
> > Sorry, I didn’t exactly understand what you meant by the last line. But I have
> > made it like this:
> > 
> > 	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> > 		flush = true;
> > 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > 	}
> > 	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
> > 		flush = true;
> > 		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> > 	}
> 
> The idea is to split this patch in two.  You can first refactor the function to
> have the above code and introduce is_accessed_spte/is_dirty_spte.  Then, you
> add the lockless access tracking on top.  But it's okay if we leave it for a
> subsequent review.  There are going to be many changes already between v2 and v3!

Thanks for the clarification. I guess I can just add the other patch now to
separate out the refactoring from the rest of the access tracking changes.

> >>> +	spte &= ~(shadow_acc_track_saved_bits_mask <<
> >>> +		  shadow_acc_track_saved_bits_shift);
> >>
> >> Should there be a WARN_ON if these bits are not zero?
> > 
> > No, these bits can be non-zero from a previous instance of 
> > mark_spte_for_access_track. They are not cleared when the PTE is restored to
> > normal state. (Though we could do that and then have a WARN_ONCE here.)
> 
> Ok, if it's not too hard, do add it.  I think it's worth having
> more self-checks.

Sure. Will do.

> 
> >>> @@ -2953,36 +3104,43 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> >>>  
> >>>  		/*
> >>> -		 * Check if it is a spurious fault caused by TLB lazily flushed.
> >>> +		 * Check whether the memory access that caused the fault would
> >>> +		 * still cause it if it were to be performed right now. If not,
> >>> +		 * then this is a spurious fault caused by TLB lazily flushed,
> >>> +		 * or some other CPU has already fixed the PTE after the
> >>> +		 * current CPU took the fault.
> >>> +		 *
> >>> +		 * If Write-Only mappings ever become supported, then the
> >>> +		 * condition below would need to be changed appropriately.
> >>>  		 *
> >>>  		 * Need not check the access of upper level table entries since
> >>>  		 * they are always ACC_ALL.
> >>>  		 */
> >>> -		if (is_writable_pte(spte)) {
> >>> -			ret = true;
> >>> +		if (((spte & PT_PRESENT_MASK) && !remove_write_prot) ||
> >>> +		    valid_exec_access) {
> >>> +			fault_handled = true;
> >>>  			break;
> >>>  		}
> >>
> >> Let's separate the three conditions (R/W/X):
> >>
> >> 		if ((error_code & PFERR_FETCH_MASK) {
> >> 			if ((spte & (shadow_x_mask|shadow_nx_mask))
> >> 			    == shadow_x_mask) {
> >> 				fault_handled = true;
> >> 				break;
> >> 			}
> >> 		}
> >> 		if (error_code & PFERR_WRITE_MASK) {
> >> 			if (is_writable_pte(spte)) {
> >> 				fault_handled = true;
> >> 				break;
> >> 			}
> >> 			remove_write_prot =
> >> 				spte_can_locklessly_be_made_writable(spte);
> >> 		}
> >> 		if (!(error_code & PFERR_PRESENT_MASK)) {
> >> 			if (!is_access_track_spte(spte)) {
> >> 				fault_handled = true;
> >> 				break;
> >> 			}
> >> 			remove_acc_track = true;
> >> 		}
> > 
> > I think the third block is incorrect e.g. it will set fault_handled = true even
> > for a completely zero PTE.
> 
> A completely zero PTE would have been filtered before by the
> is_shadow_present_pte check, wouldn't it?

Oh, the is_shadow_present_pte check was actually removed in the patch. We could
add it back, minus the ret = true statement, and then it would filter the zero 
PTE case. But I still think that the other form:

                if ((error_code & PFERR_USER_MASK) &&
                    (spte & PT_PRESENT_MASK)) {
                        fault_handled = true;
                        break;
                }

is simpler as it is directly analogous to the cases for fetch and write.
Please let me know if you think otherwise.

> >>> +/* This is only supposed to be used for non-EPT mappings */
> >>
> >> It's only used for non-EPT mappings, why is it only *supposed* to be 
> >> used for non-EPT mappings?  It seems to me that it would work.
> >>
> >>>  static bool need_remote_flush(u64 old, u64 new)
> >>>  {
> > 
> > It would work but it will return true in at least one (probably only) case 
> > where it doesn’t need to e.g. going from an acc-track PTE to a zeroed PTE.
> > Though maybe that is not a big deal. Of course, we could also just update it 
> > to handle acc-track. 
> 
> Yeah, I think it's not a big deal.  But actually is it really used only
> for non-EPT?  Nested virtualization uses kvm_mmu_pte_write for EPT as well.

Ok. I guess it might be more accurate to say indirect mappings instead of
non-EPT mappings. In any case, I’ll just remove the comment.

Thanks,
Junaid