I would like to raise a question about this elephant in the room which I wanted to understand for 
quite a long time.
 
For my nested AVIC work I once again need to change the KVM_REQ_GET_NESTED_STATE_PAGES code and once
again I am asking myself, maybe we can get rid of this code, after all?
  
And of course if we don't need it, getting rid of it would be a very happy day in my life 
(and likely in the life of all other KVM developers as well).
 
Needless to say that it already caused few CVE worthy issues which thankfully we patched before
it got to production, and on top of that, having different code flow for a very rare code path 
because nested migration only happens if you are lucky enough to have nested guest 
actually running during migration, which happens only once in a while, even when L1 is fully loaded. 
 
In my testing I always disable HLT exits to actually ensure that nested guest runs all the time,
sans vmexits.
 
====================================================================================================
 
So first of all what KVM_REQ_GET_NESTED_STATE_PAGES is:
 
KVM_REQ_GET_NESTED_STATE_PAGES exists to delay reading/writing/pinning of the guest memory, which is 
pointed by various fields of current vmcs/vmcb to next KVM_RUN, under an assumption that it is not 
safe to do in case when we do a non natural VM entry (which is either due to resume of nested
guest that was interrupted by SMM or when we set the nested state after migration.
 
The alleged reason for that is that either kvm's mmu is not in 'right state', or that userspace
VMM will do some modifications to the VM after the entry and before we actually enter the nested guest.
 
====================================================================================================
 
There are two cases where the KVM_REQ_GET_NESTED_STATE_PAGES is used:
 
1. Exit from SMM, back to an interrupted nested guest:
 
Note that on SVM we actually read the VMCB from the guest memory and HSAVE area,
already violating the rule of not reading guest memory.
On VMX at least we use the cached VMCS.
 
 
2. Loading the nested state.
 
In this case indeed both vmcb and vmcs come from the migration stream, thus we don't touch the 
guest memory when setting the nested state.
 
====================================================================================================
 
Now let see how the guest memory would be accessed on nested VM entry if we didn't use
KVM_REQ_GET_NESTED_STATE_PAGES:
 
First of all, AFAIK all SVM/VMX memory structures including the VMCB/VMCS itself are accessed 
by their physical address, thus we don't need to worry about guest's paging, and 
neither about nested NPT/EPT paging.
 
Shadow pages that MMU created are not relevant also because they are not used by KVM itself to
read guest's memory.
 
To access guest physical memory from KVM the following is done:
 
A. a set of memslots is chosen (calling __kvm_memslots())
 
   KVM keeps two sets of memory slots, one for 'Normal' address space and one for SMM address space.
 
   These memslots are setup by VMM, which usually updates it when the guest modifies
   virtual chipset's SMM related settings.
 
   Note that on standard Q35/i440fx chipsets which qemu emulates, those settings themselves are
   accessible through pci config space regardless of SMM, but have a lock bit which prevents non 
   SMM code to change them after VM's BIOS setup and locked them..
 
   Lots of places in the KVM hardcoded to use the 'Normal' memslots 
   (everyone who uses kvm_memslots for example).
   Others use the kvm_vcpu_memslots, which chooses memslots based on 'arch.hflags & HF_SMM_MASK'
 
   Thankfully we don't allow VMX/SVM in SMM (and I really hope that we never will), thus all the guest 
   memory structures including VMCB/VMCS would reside in the main guest memory and thus can be
   reached through 'Normal' memslots.
 
   From this we can deduce that loading of the nested state will never happen when the 
   VCPU is in SMM mode.
   KVM verifies this and denies setting the nested state if that is attempted.
 
   On returning from SMM to a nested guest, kvm_smm_changed is the first thing to be called,
   prior to calling vendor code which resumes the nested guest.
   kvm_smm_changed clears the HF_SMM_MASK flag, which selects Which memslots to take when doing 
   guest physical to host virtual translation.
 
   Based on this reasoning, we can say that nested guest entries we will never access the 
   SMM memslots, even if we attempt to access the guest memory during the setting of the nested state.
 
 
B. From the 'Normal' memslots, the host virtual address of the page containing the guest physical 
   memory is obtained.
   Page itself might be swapped out or not even present at all if post-copy migration is used.
 
   AFAIK, I don't see a reason why QEMU or any other KVM user would not upload the correct memory slot
   set, after a nested migration.
 
   In fact qemu rarely modifies the memory slots. It does have some memslots for devices which
   Have ram like memory but if the guest attempts to place there VM related structure, it is welcome
   to keep both pieces.
 
   Actual RAM memslots should only be modified during runtime when RAM hotplug happens or so,
   which even if happen after nested migration, would not be an issue as the fact that they 
   Happen after a migration means that guest VM structures couldn’t be in these memslots.
 
   Qemu resets and loads the state for all the devices, and they match the original devices on the 
   migration source, and seems to upload the nested state last.
   It might not migrate all ram, but rather register it with userfaultd, but that isn't an
   issue (see below)
 
   For returns from the SMM, the chances that this memslot map is not up to date are even lower.
   In fact, returning from smm usually won't even cause a userspace vmexit to let qemu mess with
   this map.
 
 
3. Once the host’s virtual address of the guest's page obtained, stock kernel paging 
   facilities are used to obtain the page (swap-in, userfaultd, etc) and
    then obtain its physical address.
 
   Once again on return from SMM, there should be no issues.
 
   After nested migration page might not be present but that will lead to either swapping it in,
   or asking qemu via userfaultd to bring it via userfaultd interface while doing post-copy migration.
   
   In regard to post-copy migration, other VMMs ought to work the same (use userfaultd).
   In theory a VMM could instead handle SIGSEGV, as an indication of a page fault but that would
   not really work due to many reasons.
   
   
So, having said all that, the only justification for KVM_REQ_GET_NESTED_STATE_PAGES
would be an VMM which first sets the CPU nested state and then setups the VMAs for guest memory
and uploads them to KVM via memslots.
 
Since nested migration is something relatively new, there is good chance that nobody does this,
and then requiring that guest memory is present before setting a nested state (or at least memory
which is pointed by the VMX/SVM structs) seems like a reasonable requirement.
 
On top of that, that simplifies the API usage by not delaying the error to the next VM run,
if truly there is an issue with this guest memory pointed by the nested state.
 
 
PS: Few technical notes:
 
====================================================================================================
A. Note on eVMCS usage of KVM_REQ_GET_NESTED_STATE_PAGES:
====================================================================================================
 
When eVMCS which is a guest memory page, representing a VM, similar to VMCB is used 
(this is HV specific PV feature), VMCS fields are loaded from it.
 
Thankfully after nested migration or return from SMM, we already have up to date vmcs12 either read
from the migration stream or saved in kernel memory.
 
We only need the guest physical address of this memory page, to map and pin it, so that later on,
after the nested entry/exit we could read/write it.
 
It address is (indirectly) obtained from msr HV_X64_MSR_VP_ASSIST_PAGE, which qemu restores after
it sets the nested state.
 
Currently to support this, setting nested state with eVMCS is allowed without this msr set,
and later KVM_REQ_GET_NESTED_STATE_PAGES actually maps the eVMCS page.
 
There is also a workaround that was relatively recently added to map eVMCS also when
KVM_REQ_GET_NESTED_STATE_PAGES haven’t got called after nested migration which can happen if we
have nested VM exit prior to entering the guest even once after setting the nested state.
Workaround was to also map eVMCS on nested VM exit.
 
IMHO the right way to fix this, assuming that we drop KVM_REQ_GET_NESTED_STATE_PAGES is to just
map the eVMCS when the HV_X64_MSR_VP_ASSIST_PAGE is set by qemu (host write) after nested state with
active eVMCS was set (that is we are nested but with vmptr == -1 )
 

====================================================================================================
B: Some digital archaeology for reference:
====================================================================================================

First of all we have those two commits:
 
commit a7d5c7ce41ac1e2537d78ddb57ef0ac4f737aa19
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Tue Sep 22 07:43:14 2020 -0400
 
	KVM: nSVM: delay MSR permission processing to first nested VM run
    
	Allow userspace to set up the memory map after KVM_SET_NESTED_STATE;
	to do so, move the call to nested_svm_vmrun_msrpm inside the
	KVM_REQ_GET_NESTED_STATE_PAGES handler (which is currently
	not used by nSVM).  This is similar to what VMX does already.
    
	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 
commit 729c15c20f1a7c9ad1d09a603ad1aa7fb60b1f88
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Tue Sep 22 06:53:57 2020 -0400
 
	KVM: x86: rename KVM_REQ_GET_VMCS12_PAGES
    
	We are going to use it for SVM too, so use a more generic name.
	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
	

These two commits were added around the same time I  started fixing the SVM's nested migration
which was just implemented and didn't yet receive any real-world testing.

SVM's code used to not even read the nested msr bitmap which I fixed, and there is a good change
that the above commit was added just in case to make SVM's code do the same thing as VMX does.
	

If we go deeper, we get this commit, from which it all started:
 
 
commit 7f7f1ba33cf2c21d001821313088c231db42ff40
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Wed Jul 18 18:49:01 2018 +0200
 
	KVM: x86: do not load vmcs12 pages while still in SMM
    
	If the vCPU enters system management mode while running a nested guest,
	RSM starts processing the vmentry while still in SMM.  In that case,
	however, the pages pointed to by the vmcs12 might be incorrectly
	loaded from SMRAM.  To avoid this, delay the handling of the pages
	until just before the next vmentry.  This is done with a new request
	and a new entry in kvm_x86_ops, which we will be able to reuse for
	nested VMX state migration.
    
	Extracted from a patch by Jim Mattson and KarimAllah Ahmed.
    
	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 

That commit was part of initial patch series which implemented the nested migration,
and it could be very well that the above commit was just a precation.
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1740085.html


Looking back at the old code before this commit, I wonder if that was needed even back then:
 
In the original code just prior to the this commit we had this weird code:
 
                vcpu->arch.hflags &= ~HF_SMM_MASK;
                ret = enter_vmx_non_root_mode(vcpu, NULL);
                vcpu->arch.hflags |= HF_SMM_MASK; <- why do we set it back 
 
 
Yet, it did clear the HF_SMM_MASK just prior to nested entry,
thus correct memslots should be accessed even back then, and I don't think that MSR bitmap would
be loaded from SMM memory.


====================================================================================================
C: POC
====================================================================================================

I attached the patch which removes the KVM_REQ_GET_NESTED_STATE_PAGES which I lightly tested.
I didn't test the eVMCS side of things. Works for me on SVM and VMX.

I also added debug prints to qemu and used virtio-mem which was suspected to set memslots
after setting nested state. It seems that it really doesn't.


Best regards,
   Maxim Levitsky