linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* The root cause of failure of access_tracking_perf_test in a nested guest
@ 2022-09-23 10:16 Maxim Levitsky
  2022-09-23 11:57 ` Emanuele Giuseppe Esposito
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Maxim Levitsky @ 2022-09-23 10:16 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Vladimir Davydov, linux-mm, Sean Christopherson,
	Emanuele Giuseppe Esposito

Hi!

Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test
fails when run in a nested guest on Intel, and I finally was able to find the root casue.

So the access_tracking_perf_test tests the following:

- It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable
file which allows a process to set/clear the accessed bit in its page tables.
the interface of this file is inverted, it is a bitmap of 'idle' bits
Idle bit set === dirty bit is clear.

- It then runs a KVM guest, and checks that when the guest accesses its memory
(through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap.

In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap,
and then runs a guest which reads/writes all its memory, and then
it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap.



Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular
- kvm_mmu_notifier_clear_flush_young
- kvm_mmu_notifier_clear_young
- kvm_mmu_notifier_test_young

First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value.

The difference between the first two notifiers is that the first one flushes EPT/NPT,
and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one.

This means that on the bare metal, the tlb might still have the accessed bit set, and thus
it might not set it again in the PTE when a memory access is done through it.

There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be
done on purpose.

I would like to hear your opinion on why it was done this way, and if the original reasons for
not doing the tlb flush are still valid.

Now why the access_tracking_perf_test fails in a nested guest?
It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which
is not bounded by size, because it is stored in the unsync sptes in memory.

Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
the memory is not intercepted and because of this doesn't turn back
the accessed bit in the guest EPT tables.

(If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't
keep sptes for gptes with no accessed bit)


Any comments are welcome!

If you think that the lack of the EPT flush is still the right thing to do,
I vote again to have at least some form of a blacklist of selftests which
are expected to fail, when run under KVM (fix_hypercall_test is the other test
I already know that fails in a KVM guest, also without a practical way to fix it).


Best regards,
	Maxim Levitsky


PS: the test doesn't fail on AMD because we sync the nested NPT on each nested VM entry, which
means that L0 syncs all the page tables.

Also the test sometimes passes on Intel when an unrelated TLB flush syncs the nested EPT.

Not using the new tdp_mmu also 'helps' by letting the test pass much more often but it still
fails once in a while, likely because of timing and/or different implementation.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: The root cause of failure of access_tracking_perf_test in a nested guest
  2022-09-23 10:16 The root cause of failure of access_tracking_perf_test in a nested guest Maxim Levitsky
@ 2022-09-23 11:57 ` Emanuele Giuseppe Esposito
  2022-09-23 17:30 ` David Matlack
  2022-09-23 19:25 ` Jim Mattson
  2 siblings, 0 replies; 7+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-23 11:57 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: Paolo Bonzini, Vladimir Davydov, linux-mm, Sean Christopherson



Am 23/09/2022 um 12:16 schrieb Maxim Levitsky:
> Hi!
> 
> Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test
> fails when run in a nested guest on Intel, and I finally was able to find the root casue.
> 
> So the access_tracking_perf_test tests the following:
> 
> - It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable
> file which allows a process to set/clear the accessed bit in its page tables.
> the interface of this file is inverted, it is a bitmap of 'idle' bits
> Idle bit set === dirty bit is clear.
> 
> - It then runs a KVM guest, and checks that when the guest accesses its memory
> (through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap.
> 
> In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap,
> and then runs a guest which reads/writes all its memory, and then
> it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap.
> 
> 
> 
> Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular
> - kvm_mmu_notifier_clear_flush_young
> - kvm_mmu_notifier_clear_young
> - kvm_mmu_notifier_test_young
> 
> First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value.
> 
> The difference between the first two notifiers is that the first one flushes EPT/NPT,
> and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one.
> 
> This means that on the bare metal, the tlb might still have the accessed bit set, and thus
> it might not set it again in the PTE when a memory access is done through it.
> 
> There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be
> done on purpose.
> 
> I would like to hear your opinion on why it was done this way, and if the original reasons for
> not doing the tlb flush are still valid.
> 
> Now why the access_tracking_perf_test fails in a nested guest?
> It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which
> is not bounded by size, because it is stored in the unsync sptes in memory.
> 
> Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
> notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
> the memory is not intercepted and because of this doesn't turn back
> the accessed bit in the guest EPT tables.
> 
> (If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't
> keep sptes for gptes with no accessed bit)

As suggested by Paolo, I also tried changing page_idle.c implementation so that it would call kvm_mmu_notifier_clear_flush_young instead of its non-flush counterpart: 

diff --git a/mm/page_idle.c b/mm/page_idle.c
index edead6a8a5f9..ffc1b0182534 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -62,10 +62,10 @@ static bool page_idle_clear_pte_refs_one(struct page *page,
                         * For PTE-mapped THP, one sub page is referenced,
                         * the whole THP is referenced.
                         */
-                       if (ptep_clear_young_notify(vma, addr, pvmw.pte))
+                       if (ptep_clear_flush_young_notify(vma, addr, pvmw.pte))
                                referenced = true;
                } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-                       if (pmdp_clear_young_notify(vma, addr, pvmw.pmd))
+                       if (pmdp_clear_flush_young_notify(vma, addr, pvmw.pmd))
                                referenced = true;
                } else {
                        /* unexpected pmd-mapped page? */

As expected, with the above patch the test does not fail anymore, proving Maxim's point.
As I understand an alternative was to get rid of the test? Or at least move it outside from kvm?

Thank you,
Emanuele

> 
> 
> Any comments are welcome!
> 
> If you think that the lack of the EPT flush is still the right thing to do,
> I vote again to have at least some form of a blacklist of selftests which
> are expected to fail, when run under KVM (fix_hypercall_test is the other test
> I already know that fails in a KVM guest, also without a practical way to fix it).
> 
> 
> Best regards,
> 	Maxim Levitsky
> 
> 
> PS: the test doesn't fail on AMD because we sync the nested NPT on each nested VM entry, which
> means that L0 syncs all the page tables.
> 
> Also the test sometimes passes on Intel when an unrelated TLB flush syncs the nested EPT.
> 
> Not using the new tdp_mmu also 'helps' by letting the test pass much more often but it still
> fails once in a while, likely because of timing and/or different implementation.
> 
> 
> 



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: The root cause of failure of access_tracking_perf_test in a nested guest
  2022-09-23 10:16 The root cause of failure of access_tracking_perf_test in a nested guest Maxim Levitsky
  2022-09-23 11:57 ` Emanuele Giuseppe Esposito
@ 2022-09-23 17:30 ` David Matlack
  2022-09-23 19:25 ` Jim Mattson
  2 siblings, 0 replies; 7+ messages in thread
From: David Matlack @ 2022-09-23 17:30 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Paolo Bonzini, Vladimir Davydov, linux-mm,
	Sean Christopherson, Emanuele Giuseppe Esposito

On Fri, Sep 23, 2022 at 01:16:04PM +0300, Maxim Levitsky wrote:
> Hi!
> 
> Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test
> fails when run in a nested guest on Intel, and I finally was able to find the root casue.
> 
> So the access_tracking_perf_test tests the following:
> 
> - It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable
> file which allows a process to set/clear the accessed bit in its page tables.
> the interface of this file is inverted, it is a bitmap of 'idle' bits
> Idle bit set === dirty bit is clear.
> 
> - It then runs a KVM guest, and checks that when the guest accesses its memory
> (through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap.
> 
> In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap,
> and then runs a guest which reads/writes all its memory, and then
> it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap.
> 
> 
> 
> Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular
> - kvm_mmu_notifier_clear_flush_young
> - kvm_mmu_notifier_clear_young
> - kvm_mmu_notifier_test_young
> 
> First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value.
> 
> The difference between the first two notifiers is that the first one flushes EPT/NPT,
> and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one.
> 
> This means that on the bare metal, the tlb might still have the accessed bit set, and thus
> it might not set it again in the PTE when a memory access is done through it.
> 
> There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be
> done on purpose.
> 
> I would like to hear your opinion on why it was done this way, and if the original reasons for
> not doing the tlb flush are still valid.
> 
> Now why the access_tracking_perf_test fails in a nested guest?
> It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which
> is not bounded by size, because it is stored in the unsync sptes in memory.
> 
> Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
> notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
> the memory is not intercepted and because of this doesn't turn back
> the accessed bit in the guest EPT tables.
> 
> (If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't
> keep sptes for gptes with no accessed bit)
> 
> 
> Any comments are welcome!
> 
> If you think that the lack of the EPT flush is still the right thing to do,
> I vote again to have at least some form of a blacklist of selftests which
> are expected to fail, when run under KVM (fix_hypercall_test is the other test
> I already know that fails in a KVM guest, also without a practical way to fix it).

Nice find. I don't recommend changing page_idle just for this test.

I added this test to evaluate the performance of KVM's access tracking
faulting handling, e.g. for when eptad=N. page_idle just happens to be
the only userspace mechanism available today to exercise access
tracking. But it has serious downsides as you discovered and are
documented at the top of the test:

/*
 ...
 * Note that a deterministic correctness test of access tracking is not possible
 * by using page_idle as it exists today. This is for a few reasons:
 *
 * 1. page_idle only issues clear_young notifiers, which lack a TLB flush. This
 *    means subsequent guest accesses are not guaranteed to see page table
 *    updates made by KVM until some time in the future.
 *
 * 2. page_idle only operates on LRU pages. Newly allocated pages are not
 *    immediately allocated to LRU lists. Instead they are held in a "pagevec",
 *    which is drained to LRU lists some time in the future. There is no
 *    userspace API to force this drain to occur.
 *
 * These limitations are worked around in this test by using a large enough
 * region of memory for each vCPU such that the number of translations cached in
 * the TLB and the number of pages held in pagevecs are a small fraction of the
 * overall workload. And if either of those conditions are not true this test
 * will fail rather than silently passing.
 ...
 */

When I wrote the test, I did not realize that nested effectively has an
unlimited TLB since shadow pages can just be left unsync. So the comment
above does not hold for nested.

My recommendation to move forward would be to get rid of this
TEST_ASSERT():

	TEST_ASSERT(still_idle < pages / 10,
		    "vCPU%d: Too many pages still idle (%"PRIu64 " out of %"
		    PRIu64 ").\n",
		    vcpu_idx, still_idle, pages);

And instead just print a warning message to tell the user that memory is
not being marked idle, and that will affect the performance results (not
as many access will actually go through access tracking). This will stop
the test from failing.

Long term, it would be great to switch to a more deterministic userspace
mechanism to trigger access tracking. My understanding is the new
multi-gen LRU that is slated for 6.1 or 6.2 might provide a better
option.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: The root cause of failure of access_tracking_perf_test in a nested guest
  2022-09-23 10:16 The root cause of failure of access_tracking_perf_test in a nested guest Maxim Levitsky
  2022-09-23 11:57 ` Emanuele Giuseppe Esposito
  2022-09-23 17:30 ` David Matlack
@ 2022-09-23 19:25 ` Jim Mattson
  2022-09-23 20:28   ` David Matlack
  2 siblings, 1 reply; 7+ messages in thread
From: Jim Mattson @ 2022-09-23 19:25 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Paolo Bonzini, Vladimir Davydov, linux-mm,
	Sean Christopherson, Emanuele Giuseppe Esposito

On Fri, Sep 23, 2022 at 3:16 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> Hi!
>
> Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test
> fails when run in a nested guest on Intel, and I finally was able to find the root casue.
>
> So the access_tracking_perf_test tests the following:
>
> - It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable
> file which allows a process to set/clear the accessed bit in its page tables.
> the interface of this file is inverted, it is a bitmap of 'idle' bits
> Idle bit set === dirty bit is clear.
>
> - It then runs a KVM guest, and checks that when the guest accesses its memory
> (through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap.
>
> In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap,
> and then runs a guest which reads/writes all its memory, and then
> it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap.
>
>
>
> Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular
> - kvm_mmu_notifier_clear_flush_young
> - kvm_mmu_notifier_clear_young
> - kvm_mmu_notifier_test_young
>
> First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value.
>
> The difference between the first two notifiers is that the first one flushes EPT/NPT,
> and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one.
>
> This means that on the bare metal, the tlb might still have the accessed bit set, and thus
> it might not set it again in the PTE when a memory access is done through it.
>
> There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be
> done on purpose.
>
> I would like to hear your opinion on why it was done this way, and if the original reasons for
> not doing the tlb flush are still valid.
>
> Now why the access_tracking_perf_test fails in a nested guest?
> It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which
> is not bounded by size, because it is stored in the unsync sptes in memory.
>
> Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
> notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
> the memory is not intercepted and because of this doesn't turn back
> the accessed bit in the guest EPT tables.

Does the guest execute an INVEPT after clearing the accessed bit?

From volume 3 of the SDM, section 28.3.5 Accessed and Dirty Flags for EPT:

> A processor may cache information from the EPT paging-structure entries in TLBs and paging-structure caches (see Section 28.4). This fact implies that, if software changes an accessed flag or a dirty flag from 1 to 0, the processor might not set the corresponding bit in memory on a subsequent access using an affected guest-physical address.

> (If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't
> keep sptes for gptes with no accessed bit)
>
>
> Any comments are welcome!
>
> If you think that the lack of the EPT flush is still the right thing to do,
> I vote again to have at least some form of a blacklist of selftests which
> are expected to fail, when run under KVM (fix_hypercall_test is the other test
> I already know that fails in a KVM guest, also without a practical way to fix it).
>
>
> Best regards,
>         Maxim Levitsky
>
>
> PS: the test doesn't fail on AMD because we sync the nested NPT on each nested VM entry, which
> means that L0 syncs all the page tables.
>
> Also the test sometimes passes on Intel when an unrelated TLB flush syncs the nested EPT.
>
> Not using the new tdp_mmu also 'helps' by letting the test pass much more often but it still
> fails once in a while, likely because of timing and/or different implementation.
>
>
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: The root cause of failure of access_tracking_perf_test in a nested guest
  2022-09-23 19:25 ` Jim Mattson
@ 2022-09-23 20:28   ` David Matlack
  2022-09-26  8:50     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 7+ messages in thread
From: David Matlack @ 2022-09-23 20:28 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Maxim Levitsky, kvm, Paolo Bonzini, Vladimir Davydov, linux-mm,
	Sean Christopherson, Emanuele Giuseppe Esposito

On Fri, Sep 23, 2022 at 12:25:00PM -0700, Jim Mattson wrote:
> On Fri, Sep 23, 2022 at 3:16 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> >
> > Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
> > notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
> > the memory is not intercepted and because of this doesn't turn back
> > the accessed bit in the guest EPT tables.
> 
> Does the guest execute an INVEPT after clearing the accessed bit?

No, that's the problem. In L1, access_tracking_perf_test is using
page_idle to mark guest memory as idle, which results in clear_young()
notifiers being sent to KVM clear access bits. clear_young() is
explicitly allowed to omit flushes, so KVM happily obliges.

	/*
	 * clear_young is a lightweight version of clear_flush_young. Like the
	 * latter, it is supposed to test-and-clear the young/accessed bitflag
	 * in the secondary pte, but it may omit flushing the secondary tlb.
	 */
	int (*clear_young)(struct mmu_notifier *subscription,
			   struct mm_struct *mm,
			   unsigned long start,
			   unsigned long end);

We could modify page_idle so that KVM performs TLB flushes. For example,
add a mechanism for userspace to trigger a TLB flush. Or change
page_idle to use clear_flush_young() (although that would be incredibly
expensive since page_idle only allows clearing one pfn at a time). But
I'm not sure creating a new userspace API just for this test is really
worth it, especially with multigen LRU coming soon.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: The root cause of failure of access_tracking_perf_test in a nested guest
  2022-09-23 20:28   ` David Matlack
@ 2022-09-26  8:50     ` Emanuele Giuseppe Esposito
  2022-10-04 18:52       ` Mingwei Zhang
  0 siblings, 1 reply; 7+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-26  8:50 UTC (permalink / raw)
  To: David Matlack, Jim Mattson
  Cc: Maxim Levitsky, kvm, Paolo Bonzini, Vladimir Davydov, linux-mm,
	Sean Christopherson



Am 23/09/2022 um 22:28 schrieb David Matlack:
> On Fri, Sep 23, 2022 at 12:25:00PM -0700, Jim Mattson wrote:
>> On Fri, Sep 23, 2022 at 3:16 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>>>
>>> Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
>>> notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
>>> the memory is not intercepted and because of this doesn't turn back
>>> the accessed bit in the guest EPT tables.
>>
>> Does the guest execute an INVEPT after clearing the accessed bit?
> 
> No, that's the problem. In L1, access_tracking_perf_test is using
> page_idle to mark guest memory as idle, which results in clear_young()
> notifiers being sent to KVM clear access bits. clear_young() is
> explicitly allowed to omit flushes, so KVM happily obliges.
> 
> 	/*
> 	 * clear_young is a lightweight version of clear_flush_young. Like the
> 	 * latter, it is supposed to test-and-clear the young/accessed bitflag
> 	 * in the secondary pte, but it may omit flushing the secondary tlb.
> 	 */
> 	int (*clear_young)(struct mmu_notifier *subscription,
> 			   struct mm_struct *mm,
> 			   unsigned long start,
> 			   unsigned long end);
> 
> We could modify page_idle so that KVM performs TLB flushes. For example,
> add a mechanism for userspace to trigger a TLB flush. Or change
> page_idle to use clear_flush_young() (although that would be incredibly
> expensive since page_idle only allows clearing one pfn at a time). But
> I'm not sure creating a new userspace API just for this test is really
> worth it, especially with multigen LRU coming soon.
> 

Thank you David and Jim for the feedback.
I sent a patch converting the assertion in warning here:
https://lkml.org/lkml/2022/9/26/238

Thank you,
Emanuele



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: The root cause of failure of access_tracking_perf_test in a nested guest
  2022-09-26  8:50     ` Emanuele Giuseppe Esposito
@ 2022-10-04 18:52       ` Mingwei Zhang
  0 siblings, 0 replies; 7+ messages in thread
From: Mingwei Zhang @ 2022-10-04 18:52 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: David Matlack, Jim Mattson, Maxim Levitsky, kvm, Paolo Bonzini,
	Vladimir Davydov, linux-mm, Sean Christopherson

On Mon, Sep 26, 2022 at 1:50 AM Emanuele Giuseppe Esposito
<eesposit@redhat.com> wrote:
>
>
>
> Am 23/09/2022 um 22:28 schrieb David Matlack:
> > On Fri, Sep 23, 2022 at 12:25:00PM -0700, Jim Mattson wrote:
> >> On Fri, Sep 23, 2022 at 3:16 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> >>>
> >>> Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
> >>> notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
> >>> the memory is not intercepted and because of this doesn't turn back
> >>> the accessed bit in the guest EPT tables.
> >>
> >> Does the guest execute an INVEPT after clearing the accessed bit?
> >
> > No, that's the problem. In L1, access_tracking_perf_test is using
> > page_idle to mark guest memory as idle, which results in clear_young()
> > notifiers being sent to KVM clear access bits. clear_young() is
> > explicitly allowed to omit flushes, so KVM happily obliges.
> >
> >       /*
> >        * clear_young is a lightweight version of clear_flush_young. Like the
> >        * latter, it is supposed to test-and-clear the young/accessed bitflag
> >        * in the secondary pte, but it may omit flushing the secondary tlb.
> >        */
> >       int (*clear_young)(struct mmu_notifier *subscription,
> >                          struct mm_struct *mm,
> >                          unsigned long start,
> >                          unsigned long end);
> >
> > We could modify page_idle so that KVM performs TLB flushes. For example,
> > add a mechanism for userspace to trigger a TLB flush. Or change
> > page_idle to use clear_flush_young() (although that would be incredibly
> > expensive since page_idle only allows clearing one pfn at a time). But
> > I'm not sure creating a new userspace API just for this test is really
> > worth it, especially with multigen LRU coming soon.

Can we add an operation that causes KVM to flush guest TLB explicitly?
For instance, we can use any operation that causes a change in
EPT/NPT, which would invoke an explicit TLB flush.  E.g., enabling
dirty logging will do the job. Alternatively, adding a memslot for the
guest, letting the guest touch it and then removing it at host level
will also flush the TLB. I believe the both should be architecturally
neutral and the latter seems more stable.

In any case, would an explicit TLB suffice in this case? I think this
will cause the zapping of PTEs in L0 EPT/NPT.

Thanks.
-Mingwei


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-10-04 18:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-23 10:16 The root cause of failure of access_tracking_perf_test in a nested guest Maxim Levitsky
2022-09-23 11:57 ` Emanuele Giuseppe Esposito
2022-09-23 17:30 ` David Matlack
2022-09-23 19:25 ` Jim Mattson
2022-09-23 20:28   ` David Matlack
2022-09-26  8:50     ` Emanuele Giuseppe Esposito
2022-10-04 18:52       ` Mingwei Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).