kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Maxim Levitsky <mlevitsk@redhat.com>
To: Jim Mattson <jmattson@google.com>
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Sean Christopherson <seanjc@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	David Gilbert <dgilbert@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Peter Xu <peterx@redhat.com>
Subject: Re: Why do we need KVM_REQ_GET_NESTED_STATE_PAGES after all
Date: Sun, 30 Jan 2022 16:29:02 +0200	[thread overview]
Message-ID: <11324ba7075ce90dee9d424c78db0bf97b1c4444.camel@redhat.com> (raw)
In-Reply-To: <CALMp9eT2cP7kdptoP3=acJX+5_Wg6MXNwoDh42pfb21-wdXvJg@mail.gmail.com>

On Thu, 2022-01-27 at 11:39 -0800, Jim Mattson wrote:
> On Thu, Jan 27, 2022 at 8:04 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > I would like to raise a question about this elephant in the room which I wanted to understand for
> > quite a long time.
> > 
> > For my nested AVIC work I once again need to change the KVM_REQ_GET_NESTED_STATE_PAGES code and once
> > again I am asking myself, maybe we can get rid of this code, after all?
> 
> We (GCE) use it so that, during post-copy, a vCPU thread can exit to

Thank you very much for a very detailed response!

> userspace and demand these pages from the source itself, rather than
> funneling all demands through a single "demand paging listener"
That is something I didn't think of!

The question is however, can that happen between setting the nested state
and running a vCPU first time. 

I guess it is possible in therory that you set the nested state, then run 1 vCPU, 
which faults and exits to userspace,
then userspace populates the faulted memory which triggers memslots update,
and only then you run another vCPU for first time.
needs will be need when this request is processed, and it will fail if they aren't.



> thread, which I believe is the equivalent of qemu's userfaultfd "fault
> handler" thread. Our (internal) post-copy mechanism scales quite well,
> because most demand paging requests are triggered by an EPT violation,
> which happens to be a convenient place to exit to userspace. Very few
> pages are typically demanded as a result of
> kvm_vcpu_{read,write}_guest, where the vCPU thread is so deep in the
> kernel call stack that it has to request the page via the demand
> paging listener thread. With nested virtualization, the various vmcs12
> pages consulted directly by kvm (bypassing the EPT tables) were a
> scalability issue.
I assume that you patched all these calls to exit to userspace for the
demand paging scheme you are using.

> 
> (Note that, unlike upstream, we don't call nested_get_vmcs12_pages
> directly from VMLAUNCH/VMRESUME emulation; we always call it as a
> result of this request that you don't like.)
Also I guess something specific to your downstream patches.


> 
> As we work on converting from our (hacky) demand paging scheme to
> userfaultfd, we will have to solve the scalability issue anyway
> (unless someone else beats us to it). Eventually, I expect that our
> need for this request will go away.

Great!

The question is, if we remove it now, will that affect you?

What if we depricate it (add option to keep the current behavier,
but keep an module param to revert back to old behavier, with
the eventual goal of removing it.


> 
> Honestly, without the exits to userspace, I don't really see how this
> request buys you anything upstream. When I originally submitted it, I
> was prepared for rejection, but Paolo said that qemu had a similar
> need for it, and I happily never questioned that assertion.

Exactly! I didn't questioned it as well, because I didn't knew MMU at all,
and it is one of the harderst KVM parts - who knows what it caches,
and what magic it needs to be up to date.

But now, I don't think there are still large areas of MMU that I don't understand,
thus I started asking myself why it is need.

That request is a ripe source of bugs. Just off my hand, Vitaly spent at least a week
understanding why after vmcb01/02 split, eVMCS stopped working, only to figure out that
KVM_REQ_GET_NESTED_STATE_PAGES might not be called after nested entry since there
could be nested VM exit before we even enter the guest, and since then one more
hack has to be added to work that around (nothing against the hack, its not the
root cause of the problem).

I also indirectly caused and fixed a CVE like issue, which started 
 with the patch that added KVM_REQ_GET_NESTED_STATE_PAGES to SVM - 
it made KVM switch to nested msr bitmap
and back then there was no vmcb01/02 split. Problem is that back then
we didn't cancel that request if we have VM exit right after VM entry,
so it would be still pending on VM entry, and it will switch to nested MSR bitmap
even if we are no longer nested - then I added patch to free the nested state
on demand, and boom - we have L1 using freed (and usualy zeroed) MSR bitmap - free
access to all host msrs from L1...

There is another hidden issue in this request, that it makes it impossible to
handle failure gracefully.

If for example, loading nested state pages needs to allocate memory and that
fails, we could just fail the nested VM entry if that request wasn't there.
While this is not ideal for the nested guest, it likely to survive, and
might even retry entering the nested guest.

On the other hand during the request if something fails, nested state is
already loaded, and all we can do is to kill the L1.


Thanks again!
Best regards,
	Maxim Levitsky

> 





  reply	other threads:[~2022-01-30 14:29 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-27 16:03 Why do we need KVM_REQ_GET_NESTED_STATE_PAGES after all Maxim Levitsky
2022-01-27 19:39 ` Jim Mattson
2022-01-30 14:29   ` Maxim Levitsky [this message]
2022-01-30 23:46     ` Jim Mattson
2022-01-31 10:31       ` Maxim Levitsky
2022-01-31 17:37         ` Jim Mattson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=11324ba7075ce90dee9d424c78db0bf97b1c4444.camel@redhat.com \
    --to=mlevitsk@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=jmattson@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=seanjc@google.com \
    --cc=vkuznets@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).