Re: [PATCH v2] x86/mm: Suppresses vm_events caused by page-walks

From: "Jan Beulich" <JBeulich@suse.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Tamas K Lengyel <tamas@tklengyel.com>,
	Wei Liu <wei.liu2@citrix.com>,
	Razvan Cojocaru <rcojocaru@bitdefender.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Tim Deegan <tim@xen.org>,
	xen-devel@lists.xen.org, aisaila@bitdefender.com
Subject: Re: [PATCH v2] x86/mm: Suppresses vm_events caused by page-walks
Date: Wed, 19 Sep 2018 08:31:15 -0600	[thread overview]
Message-ID: <5BA25DB302000078001E9F54@prv1-mh.provo.novell.com> (raw)
In-Reply-To: <3eb2d0c9-09e3-fad4-e70d-dc3fa12492d7@citrix.com>

>>> On 19.09.18 at 15:41, <andrew.cooper3@citrix.com> wrote:
> On 19/09/18 09:53, Jan Beulich wrote:
>>>>> On 18.09.18 at 20:20, <andrew.cooper3@citrix.com> wrote:
>>> On 18/09/18 11:17, Jan Beulich wrote:
>>>>>>> On 18.09.18 at 11:47, <aisaila@bitdefender.com> wrote:
>>>>> On Thu, 2018-09-13 at 08:17 -0600, Jan Beulich wrote:
>>>>>>>>> On 12.09.18 at 11:47, <aisaila@bitdefender.com> wrote:
>>>>>>> The original version of the patch emulated the current instruction
>>>>>>> (which, as a side-effect, emulated the page-walk as well), however
>>>>>>> we
>>>>>>> need finer-grained control. We want to emulate the page-walk, but
>>>>>>> still
>>>>>>> get an EPT violation event if the current instruction would trigger
>>>>>>> one.
>>>>>>> This patch performs just the page-walk emulation.
>>>>>> Rather than making this basically a revision log, could you please
>>>>>> focus
>>>>>> on what you actually want to achieve?
>>>>>>
>>>>>> As to the title: "Suppress ..." please.
>>>>>>
>>>>>>> @@ -149,6 +151,10 @@ guest_walk_tables(struct vcpu *v, struct
>>>>>>> p2m_domain *p2m,
>>>>>>>      ar_and &= gflags;
>>>>>>>      ar_or  |= gflags;
>>>>>>>  
>>>>>>> +    if ( set_ad && set_ad_bits(&l4p[guest_l4_table_offset(va)].l4,
>>>>>>> +                               &gw->l4e.l4, false) )
>>>>>>> +        accessed = true;
>>>>>> It is in particular this seemingly odd (and redundant with what's
>>>>>> done
>>>>>> later in the function) which needs thorough explanation.
>>>>> On this patch I've followed Andrew Cooper's suggestion on how to set
>>>>> A/D Bits:
>>>>>
>>>>> "While walking down the levels, set any missing A bits and remember if we
>>>>> set any.  If we set A bits, consider ourselves complete and exit back to
>>>>> the guest.  If no A bits were set, and the access was a write (which we
>>>>> know from the EPT violation information), then set the leaf D bit."
>>>>>
>>>>> If I misunderstood the comment please clarify.
>>>> It doesn't look to me as if you misunderstood anything, but only Andrew
>>>> can say for sure. However, none of this was in the description of your
>>>> patch (neither as part of the description, nor as code comment), and I
>>>> think you'd even have to greatly extend on this in order to explain to
>>>> everyone why the resulting behavior is still architecturally correct. In no
>>>> case should you assume anyone reading your patch (now or in the
>>>> future) has participated in the earlier discussion.
>>> The problem we have is that, while we know the EPT Violation was for a
>>> write of an A or D bit to a write-protected guest pagetable, we don't
>>> know if it was the A or the D bit which was attempting to be set.
>>>
>>> Furthermore (without emulating the instruction, which is what we are
>>> trying to avoid), we can't reconstruct the access.
>>>
>>> Access bits are only written if they were missing before, but may be set
>>> speculatively.  Dirty bits are only set when a write is retired.  From a
>>> practical point of view, the pipeline sets A and D bits as separate actions.
>>>
>>> Following this logic (and assuming for now a single vcpu), if we get a
>>> GPT EPT Violation, and there are missing access bits on the walk, then
>>> the fault is definitely from setting an access bit.
>> Definitely?
> 
> Yes
> 
>>  Is there anything guaranteeing architecturally that an access
>> bit related EPT violation would be delivered earlier than any other one
>> on that same or a lower page table level?
> 
> No, but why does that matter?
> 
> Architecturally defined or not, we know that the action the processor
> was trying to perform was to set an A/D bit, because we got a vmexit
> telling us so.

Well, as per what I had written further down in my earlier reply (starting
with "Or wait ..."), nothing in what you said about the EPT violation made
explicit that this was a violation from a write attempt. Without that, it
very much matters (afaict), as there are other reasons for getting EPT
violations from page table accesses.

>>  It doesn't matter much for
>> the implementation (because of it being permissible to set the A bits
>> speculatively, as you also say further down, and any other violation
>> then re-occurring after exiting back to the guest once the A bits are
>> all set), but since we're discussing here what exactly the patch
>> description should contain, I think I'd prefer this to be fully correct there.
>>
>> Or wait - I think I can agree with "definitely", provided you further
>> restrict the context: "..., if we get a GPT EPT Write Violation ...". But
>> from what I can tell the patch'es change to p2m_mem_access_check()
>> doesn't apply (or pass on) any of these qualifications at all.
> 
> I've not looked at the patch in detail yet.  I'm tempted to suggest
> rearranging guest_walk_tables() to just set the access bits on the
> decent, rather than at the end.  This matches how some hardware behaves
> when pulling entries into the paging structure cache.

I'm not opposed to that; I believe this property has changed over the
years, as I certainly do recall that early (386ish, 486ish) descriptions
of page walks suggested that A bits indeed got set after the full walk
only, and I further believe the original implementation simply took this
behavior as reference.

What I am not very happy about is the addition of a (poorly described)
new parameter to the function, making it sort of bail in the middle.

>>>  Set all access bits
>>> and call it done.  If we get a GPT EPT Violation and all access bits
>>> were set, then it was definitely from setting the Dirty bit.
>>>
>>> For multi-vcpu scenarios, things get racy.  Setting all the Access bits
>>> is safe because its a speculative action, but a speculatively load on
>>> one vcpu can race with a write (to a read-only mapping) on the other
>>> vcpu, and would trick this algorithm into setting the dirty bit when the
>>> write would have faulted (and not set the dirty bit).
>>>
>>> Do we have numbers on how many the GPT EPT Violations are for (only)
>>> access sets, and how many are for dirty tsets?  Would the first half of
>>> the algorithm (which is definitely not racy) still be a net perf win?
>> Does Windows make use of A bits at all? I'd expect most OSes to
>> simply set them right away, and actively use of the D bits.
> 
> What gives you the expectation that OSes wouldn't use A bits?
> 
> For paging out, the best options are non-accessed non-dirty page because
> their contents can be discarded immediately and reread from disk at a
> later point.

Discard-immediately is solely bound to the D bit. The A bit only tells an
OS whether a page was accessed after the bit was cleared the last time.
I will surely agree this is a possible strategy, and I was merely wondering
whether Windows (other than iirc Linux) actually makes use of this. The
reason I wouldn't really expect their use is because repeatedly (and
perhaps frequently) clearing A bits from all PTEs is sort of work intensive
(as is locating a PTE with its A bit clear). Yet if you don't clear them
frequently, the information a set A bit carries does not look to be overly
useful.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel