The root cause of failure of access_tracking_perf_test in a nested guest

* The root cause of failure of access_tracking_perf_test in a nested guest
@ 2022-09-23 10:16 Maxim Levitsky
  2022-09-23 11:57 ` Emanuele Giuseppe Esposito
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Maxim Levitsky @ 2022-09-23 10:16 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Vladimir Davydov, linux-mm, Sean Christopherson,
	Emanuele Giuseppe Esposito

Hi!

Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test
fails when run in a nested guest on Intel, and I finally was able to find the root casue.

So the access_tracking_perf_test tests the following:

- It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable
file which allows a process to set/clear the accessed bit in its page tables.
the interface of this file is inverted, it is a bitmap of 'idle' bits
Idle bit set === dirty bit is clear.

- It then runs a KVM guest, and checks that when the guest accesses its memory
(through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap.

In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap,
and then runs a guest which reads/writes all its memory, and then
it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap.

Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular
- kvm_mmu_notifier_clear_flush_young
- kvm_mmu_notifier_clear_young
- kvm_mmu_notifier_test_young

First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value.

The difference between the first two notifiers is that the first one flushes EPT/NPT,
and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one.

This means that on the bare metal, the tlb might still have the accessed bit set, and thus
it might not set it again in the PTE when a memory access is done through it.

There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be
done on purpose.

I would like to hear your opinion on why it was done this way, and if the original reasons for
not doing the tlb flush are still valid.

Now why the access_tracking_perf_test fails in a nested guest?
It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which
is not bounded by size, because it is stored in the unsync sptes in memory.

Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
the memory is not intercepted and because of this doesn't turn back
the accessed bit in the guest EPT tables.

(If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't
keep sptes for gptes with no accessed bit)

Any comments are welcome!

If you think that the lack of the EPT flush is still the right thing to do,
I vote again to have at least some form of a blacklist of selftests which
are expected to fail, when run under KVM (fix_hypercall_test is the other test
I already know that fails in a KVM guest, also without a practical way to fix it).

Best regards,
	Maxim Levitsky

PS: the test doesn't fail on AMD because we sync the nested NPT on each nested VM entry, which
means that L0 syncs all the page tables.

Also the test sometimes passes on Intel when an unrelated TLB flush syncs the nested EPT.

Not using the new tdp_mmu also 'helps' by letting the test pass much more often but it still
fails once in a while, likely because of timing and/or different implementation.

^ permalink raw reply	[flat|nested] 7+ messages in thread