kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RFC: Proposal to create a new version of the SVM nested state migration blob
@ 2021-08-16 12:54 Maxim Levitsky
  2021-08-16 16:50 ` Paolo Bonzini
  0 siblings, 1 reply; 5+ messages in thread
From: Maxim Levitsky @ 2021-08-16 12:54 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Sean Christopherson, Jim Mattson

Hi!

Today I was studying again the differences between the SVM and VMX handling of the nested migration,
and I reached the conclusion that it might be worth it to create a new format of the SVM nested
state.
 
Especially since nested SVM mostly isn't yet in production, it might be a good time now
to fix this and reduce the overall complexity and difference in implementation vs VMX.
 
These are the issues with the current SVM nested state format:


1.


On VMX the nested state is very simple and straightforward. It is the VMCS12.
 
This works because while nested, the VMCS01 doesn't need to have a valid guest state at all,
since on nested VM exit, L1 guest state will be overwritten from VMCB12 constant 'host state'.
 
Thus on VMX, while nested, the VMCS01 is almost fully (sans things like TSC multiplier),
created by KVM, and can be recreated fully on the target machine, without a need to have (parts of it)
in the migration stream.
 
When loading the nested state on VMX, we work in a very similar manner as for the regular nested entry.
Even the same function that is used for regular entry, is used for loading of the nested state.
(nested_vmx_enter_non_root_mode).
 
We load both the control and save state from VMCS12 thus overwriting whatever L2 state exists currently
in the CPU registers.
 
 
On SVM on the other hand, the VMCB01 contains the L1 ‘host’ state
(after we ‘adjusted’ it, as I explain later), and it has to be migrated as well.
 
The straightforward nested state for the SVM, thus would be this ‘host’ state
*and* the VMCB12.
 
However due to whatever reasons, we didn’t implement the nested state in this way.
 
Instead, we don’t store the save area of the vmcb12 in the migration stream,
because in theory the register state of the L2 is restored prior/after setting of
the nested state using standard KVM_SET_{S|}REGS, and such.

We could have done the same for VMX as well.
 
And in this save area we store the *L1* save area (aka L1 ‘host state’) 
there, which we have to preserve unlike on VMX.
 
Thus in the end the nested state of SVM is also a VMCB but it is ‘stitched’
as explained above.
 
This forces us to have quite different code for loading the nested state and
for regular nested entry and can be prone to errors, and various issues.
It is also quite confusing.
 
One issue that is related to this, had to be worked around with another slight hack,
was that Vitaly recently found is that it is possible to enter L1 via a 'SMM detour':
If SMI happens while L2 is running, and L1 is not intercepting the SMI, 
we need to enter L1, but we (more similar to VMX) have to load fixed SMM host state, 
and don't touch the L1 'host' state stored in VMCB01.
 
Another issue which I am trying to fix related to the fact that on VMX it uses
the KVM_REQ_GET_NESTED_STATE_PAGES for VM entries that happen on exit from SMM
while SVM doesn’t, which is also a result of differences between the implementations.



2.

 
On SVM, the host state is transparently saved 'somewhere'.
'somewhere' is defined as either per cpu HSAVE area, or the cpu internal cache.
 
For nesting we opt for ‘cpu internal cache’ and keep the L1 'host' state in VMCB01 since it is already naturally saved
there, however we slightly modify this state which is also somewhat a hack, which also increases complexity and in this
case even should decrease the performance a little bit: 
 
We replace CPU visible values of CR0/CR3/CR4/EFER, 
with guest visible values (vcpu->arch.cr*/efer) to avoid losing them.
 
And then on nested VM exit we load the guest visible values again from VMCB01, and pass them to kvm_set_cr* 
functions, which add/remove bits as needed to get back the CPU visible values, and 
store them back to VMCB01, and store back the guest visible values into (vcpu->arch.cr*/efer)
 
That in theory should restore both the guest visible and cpu visible values of these registers,
but can be prone to errors.
 
It would be much cleaner if we kept vmcb01 as is, and stored the guest visible cr* values
somewhere else, which would allow us to avoid updating the vmcb01 twice.
 
Then on nested VM exit, 
we would only restore the kvm's guest visible values (vcpu->arch.cr*/efer) 
from that 'somewhere else', and could do this without any checks/etc, since these already passed all checks.
 
This needs to save these values in the migration stream as well of course.
 
Note that on VMX, this issue doesn’t exist since host state isn’t preserved, but rather
force loaded from the VMCS on each VM exit.
 
 
Finally I propose that SVM nested state would be:
 
* L1 save area.
* L1 guest visible CR*/EFER/etc values (vcpu->arch.cr* values)
* Full VMCB12 (save and control area)
 
 
What do you think?
 
 
For backward compatibility we can have a function that will 'convert' 
the old state to new state by guessing the L2 save area from current CPU state and treating the L1 save area
in the old state as area that was 'adjusted' with guest visible CR* values, thus keeping this complexity
In this conversion function which can even go away in the future.
 
 
I did most of this research today trying to fix a few bugs I introduced when I added my SREG2 ioctl,
in regard to exit back from SMM mode, which is also done differently on SVM than on VMX.
 
 
Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: Proposal to create a new version of the SVM nested state migration blob
  2021-08-16 12:54 RFC: Proposal to create a new version of the SVM nested state migration blob Maxim Levitsky
@ 2021-08-16 16:50 ` Paolo Bonzini
  2021-08-17 16:40   ` Maxim Levitsky
  0 siblings, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2021-08-16 16:50 UTC (permalink / raw)
  To: Maxim Levitsky, kvm; +Cc: Sean Christopherson, Jim Mattson

On 16/08/21 14:54, Maxim Levitsky wrote:
> Then on nested VM exit,
> we would only restore the kvm's guest visible values (vcpu->arch.cr*/efer)
> from that 'somewhere else', and could do this without any checks/etc, since these already passed all checks.
>   
> This needs to save these values in the migration stream as well of course.

Note that there could be differences between the guest-visible values 
and the processor values of CRx.  In particular, say you have a 
hypervisor that uses PSE for its page tables.  The CR4 would have 
CR4.PSE=1 on a machine that uses NPT and CR4.PAE=1 on a machine that 
doesn't.

> Finally I propose that SVM nested state would be:
>  
> * L1 save area.
> * L1 guest visible CR*/EFER/etc values (vcpu->arch.cr* values)
> * Full VMCB12 (save and control area)

So your proposal would basically be to:

* do the equivalent of sync_vmcs02_to_vmcs12+sync_vmcs02_to_vmcs12_rare 
on KVM_GET_NESTED_STATE

* discard the current state on KVM_SET_NESTED_STATE.

That does make sense.  It wasn't done this way just because the "else" 
branch of

         if (is_guest_mode(vcpu)) {
                 sync_vmcs02_to_vmcs12(vcpu, vmcs12);
                 sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
         } else  {
                 copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu));
                 if (!vmx->nested.need_vmcs12_to_shadow_sync) {
                         if (vmx->nested.hv_evmcs)
                                 copy_enlightened_to_vmcs12(vmx);
                         else if (enable_shadow_vmcs)
                                 copy_shadow_to_vmcs12(vmx);
                 }
         }

isn't needed on SVM and thus it seemed "obvious" to remove the "then" 
branch as well.  I just focused on enter_svm_guest_mode when refactoring 
the common bits of vmentry and KVM_SET_NESTED_STATE, so there is not 
even a function like sync_vmcb02_to_vmcb12 (instead it's just done in 
nested_svm_vmexit).

It does have some extra complications, for example with SREGS2 and the 
always more complicated ordering of KVM_SET_* ioctls.

On the other hand, issues like not migrating PDPTRs were not specific to 
nested SVM, and merely exposed by it.  The ordering issues with 
KVM_SET_VCPU_EVENTS and KVM_SET_MP_STATE's handling of INIT and SIPI are 
also not specific to nested SVM.  I am not sure that including the full 
VMCB12 in the SVM nested state would solve any of these.  Anything that 
would be solved, probably would be just because migrating a large blob 
is easier than juggling a dozen ioctls.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: Proposal to create a new version of the SVM nested state migration blob
  2021-08-16 16:50 ` Paolo Bonzini
@ 2021-08-17 16:40   ` Maxim Levitsky
  2021-08-17 16:50     ` Paolo Bonzini
  0 siblings, 1 reply; 5+ messages in thread
From: Maxim Levitsky @ 2021-08-17 16:40 UTC (permalink / raw)
  To: Paolo Bonzini, kvm; +Cc: Sean Christopherson, Jim Mattson

On Mon, 2021-08-16 at 18:50 +0200, Paolo Bonzini wrote:
> On 16/08/21 14:54, Maxim Levitsky wrote:
> > Then on nested VM exit,
> > we would only restore the kvm's guest visible values (vcpu->arch.cr*/efer)
> > from that 'somewhere else', and could do this without any checks/etc, since these already passed all checks.
> >   
> > This needs to save these values in the migration stream as well of course.
> 
> Note that there could be differences between the guest-visible values 
> and the processor values of CRx.  In particular, say you have a 
> hypervisor that uses PSE for its page tables.  The CR4 would have 
> CR4.PSE=1 on a machine that uses NPT and CR4.PAE=1 on a machine that 
> doesn't.

Exactly what I said in the mail, about a more minor problem
which kind of irritates me, but not something urgent:

I proposed that on nested entry we leave the processor values in vmcb01,
as is, and backup the guest visible values in say 'svm->nested.hsave.cr*'
or something like that.
Later on nested VM exit we restore vcpu.arch.cr* values from 'svm->nested.hsave.cr*'
and leave the vmcb01 values alone.
 
That isn't strictly related to nested migration state but it seemed
to me that it would be also nice to have both guest visible
and cpu visible values of L1 save state in migration state 
as well while we are at redefining it.

Maybe this is an overkill.



> 
> > Finally I propose that SVM nested state would be:
> >  
> > * L1 save area.
> > * L1 guest visible CR*/EFER/etc values (vcpu->arch.cr* values)
> > * Full VMCB12 (save and control area)
> 
> So your proposal would basically be to:
> 
> * do the equivalent of sync_vmcs02_to_vmcs12+sync_vmcs02_to_vmcs12_rare 
> on KVM_GET_NESTED_STATE



> 
> * discard the current state on KVM_SET_NESTED_STATE.
> 
> That does make sense.  It wasn't done this way just because the "else" 
> branch of
> 
>          if (is_guest_mode(vcpu)) {
>                  sync_vmcs02_to_vmcs12(vcpu, vmcs12);
>                  sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
>          } else  {
>                  copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu));
>                  if (!vmx->nested.need_vmcs12_to_shadow_sync) {
>                          if (vmx->nested.hv_evmcs)
>                                  copy_enlightened_to_vmcs12(vmx);
>                          else if (enable_shadow_vmcs)
>                                  copy_shadow_to_vmcs12(vmx);
>                  }
>          }
> 
> isn't needed on SVM and thus it seemed "obvious" to remove the "then" 
> branch as well.  I just focused on enter_svm_guest_mode when refactoring 
> the common bits of vmentry and KVM_SET_NESTED_STATE, so there is not 
> even a function like sync_vmcb02_to_vmcb12 (instead it's just done in 
> nested_svm_vmexit).

Yes. The else branch is due to the fact that even while the L1 is running,
the vmcs12 is usually not up to date, which is hidden by vmread/vmwrite
intercepts. 

This isn't possible on SVM due to memory mapped nature of VMCB
which forces us to sync vmcb12 after each nested vm exit.

 
I did indeed overlook the fact that vmcb12 save area is not up to date,
in fact I probably won't even want to read it from the guest memory
at the KVM_GET_NESTED_STATE time.
 
But it can be constructed from the KVM's guest visible CR* values,
and values in the VMCB02, roughly like how sync_vmcs02_to_vmcs12,
or how nested_svm_vmexit does it.


> 
> It does have some extra complications, for example with SREGS2 and the 
> always more complicated ordering of KVM_SET_* ioctls.
> 
> On the other hand, issues like not migrating PDPTRs were not specific to 
> nested SVM, and merely exposed by it.  The ordering issues with 
> KVM_SET_VCPU_EVENTS and KVM_SET_MP_STATE's handling of INIT and SIPI are 
> also not specific to nested SVM.  I am not sure that including the full 
> VMCB12 in the SVM nested state would solve any of these.  Anything that 
> would be solved, probably would be just because migrating a large blob 
> is easier than juggling a dozen ioctls.


The core of my proposal is that while it indeed makes the retrieval of the
nested state a bit more complicated, but it makes restore of the nested state
much simpler, since it can be treated as if we are just doing a nested entry,
eliminating the various special cases we have to have in nested state load on SVM.
 
Security wise, a bug during retrieval, isn't as bad as a bug during loading of the
state, so it makes sense to make the load of the state share as much code
with normal nested entry.
 
That means that the whole VMCB12 image can be checked as a whole and loaded
replacing most of the existing cpu state, in the same manner to regular
nested entry.
 
This also makes nested state load less dependant on its ordering vs setting of
the other cpu state.
 
So what do you think? Is it worth it for me to write a RFC patch series for this?

Best regards,
	Maxim Levitsky



> 
> Thanks,
> 
> Paolo
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: Proposal to create a new version of the SVM nested state migration blob
  2021-08-17 16:40   ` Maxim Levitsky
@ 2021-08-17 16:50     ` Paolo Bonzini
  2021-08-17 16:56       ` Maxim Levitsky
  0 siblings, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2021-08-17 16:50 UTC (permalink / raw)
  To: Maxim Levitsky, kvm; +Cc: Sean Christopherson, Jim Mattson

On 17/08/21 18:40, Maxim Levitsky wrote:
> I proposed that on nested entry we leave the processor values in vmcb01,
> as is, and backup the guest visible values in say 'svm->nested.hsave.cr*'
> or something like that.
> Later on nested VM exit we restore vcpu.arch.cr* values from 'svm->nested.hsave.cr*'
> and leave the vmcb01 values alone.
>   
> That isn't strictly related to nested migration state but it seemed
> to me that it would be also nice to have both guest visible
> and cpu visible values of L1 save state in migration state
> as well while we are at redefining it.

But the CPU visible values for L1 are useless, aren't they?  They are 
computed on another system.  So you have to compute them again on the 
destination.

So this idea, which is strictly speaking an optimization of vmexit, is 
unrelated from migration and I would leave it aside.

>> So your proposal would basically be to:
>>
>> * do the equivalent of sync_vmcs02_to_vmcs12+sync_vmcs02_to_vmcs12_rare
>> on KVM_GET_NESTED_STATE
>>
>> * discard the current state on KVM_SET_NESTED_STATE.
>
> I did indeed overlook the fact that vmcb12 save area is not up to date,
> in fact I probably won't even want to read it from the guest memory
> at the KVM_GET_NESTED_STATE time. But it can be constructed from the
> KVM's guest visible CR* values, and values in the VMCB02, roughly like
> how sync_vmcs02_to_vmcs12, or how nested_svm_vmexit does it.

Right.

> The core of my proposal is that while it indeed makes the retrieval of the
> nested state a bit more complicated, but it makes restore of the nested state
> much simpler, since it can be treated as if we are just doing a nested entry,
> eliminating the various special cases we have to have in nested state load on SVM.

This is true.

> Security wise, a bug during retrieval, isn't as bad as a bug during loading of the
> state, so it makes sense to make the load of the state share as much code
> with normal nested entry.
> 
> That means that the whole VMCB12 image can be checked as a whole and loaded
> replacing most of the existing cpu state, in the same manner to regular
> nested entry.
>   
> This also makes nested state load less dependant on its ordering vs setting of
> the other cpu state.
>
> So what do you think? Is it worth it for me to write a RFC patch series for this?

I have a slightly different idea.  Do you think it would make sense to 
use the current processor state to build the VMCB12?  Such a prototype 
would show what KVM_SET_NESTED_STATE would look like with your new blob 
format.  We could use that to see if it worth proceeding further, with 
three possible answer:

1) the new KVM_SET_NESTED_STATE remains complex, so we scrap the idea

2) the new KVM_SET_NESTED_STATE is nice, and we decide it's already good 
enough

3) the new KVM_SET_NESTED_STATE is nice, but there is enough ugliness 
left that the new format seems worthwhile

Paolo


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: Proposal to create a new version of the SVM nested state migration blob
  2021-08-17 16:50     ` Paolo Bonzini
@ 2021-08-17 16:56       ` Maxim Levitsky
  0 siblings, 0 replies; 5+ messages in thread
From: Maxim Levitsky @ 2021-08-17 16:56 UTC (permalink / raw)
  To: Paolo Bonzini, kvm; +Cc: Sean Christopherson, Jim Mattson

On Tue, 2021-08-17 at 18:50 +0200, Paolo Bonzini wrote:
> On 17/08/21 18:40, Maxim Levitsky wrote:
> > I proposed that on nested entry we leave the processor values in vmcb01,
> > as is, and backup the guest visible values in say 'svm->nested.hsave.cr*'
> > or something like that.
> > Later on nested VM exit we restore vcpu.arch.cr* values from 'svm->nested.hsave.cr*'
> > and leave the vmcb01 values alone.
> >   
> > That isn't strictly related to nested migration state but it seemed
> > to me that it would be also nice to have both guest visible
> > and cpu visible values of L1 save state in migration state
> > as well while we are at redefining it.
> 
> But the CPU visible values for L1 are useless, aren't they?  They are 
> computed on another system.  So you have to compute them again on the 
> destination.

I understand what you mean, and I agree with you.

> 
> So this idea, which is strictly speaking an optimization of vmexit, is 
> unrelated from migration and I would leave it aside.
> 
> > > So your proposal would basically be to:
> > > 
> > > * do the equivalent of sync_vmcs02_to_vmcs12+sync_vmcs02_to_vmcs12_rare
> > > on KVM_GET_NESTED_STATE
> > > 
> > > * discard the current state on KVM_SET_NESTED_STATE.
> > 
> > I did indeed overlook the fact that vmcb12 save area is not up to date,
> > in fact I probably won't even want to read it from the guest memory
> > at the KVM_GET_NESTED_STATE time. But it can be constructed from the
> > KVM's guest visible CR* values, and values in the VMCB02, roughly like
> > how sync_vmcs02_to_vmcs12, or how nested_svm_vmexit does it.
> 
> Right.
> 
> > The core of my proposal is that while it indeed makes the retrieval of the
> > nested state a bit more complicated, but it makes restore of the nested state
> > much simpler, since it can be treated as if we are just doing a nested entry,
> > eliminating the various special cases we have to have in nested state load on SVM.
> 
> This is true.
> 
> > Security wise, a bug during retrieval, isn't as bad as a bug during loading of the
> > state, so it makes sense to make the load of the state share as much code
> > with normal nested entry.
> > 
> > That means that the whole VMCB12 image can be checked as a whole and loaded
> > replacing most of the existing cpu state, in the same manner to regular
> > nested entry.
> >   
> > This also makes nested state load less dependant on its ordering vs setting of
> > the other cpu state.
> > 
> > So what do you think? Is it worth it for me to write a RFC patch series for this?
> 
> I have a slightly different idea.  Do you think it would make sense to 
> use the current processor state to build the VMCB12?  Such a prototype 
> would show what KVM_SET_NESTED_STATE would look like with your new blob 
> format.  We could use that to see if it worth proceeding further, with 
> three possible answer:
> 
> 1) the new KVM_SET_NESTED_STATE remains complex, so we scrap the idea
> 
> 2) the new KVM_SET_NESTED_STATE is nice, and we decide it's already good 
> enough
> 
> 3) the new KVM_SET_NESTED_STATE is nice, but there is enough ugliness 
> left that the new format seems worthwhile

I like this idea! I will prepare a prototype for this soon.

Thanks a lot for the help!

Best regards,
	Maxim Levitsky

> 
> Paolo
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-08-17 16:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-16 12:54 RFC: Proposal to create a new version of the SVM nested state migration blob Maxim Levitsky
2021-08-16 16:50 ` Paolo Bonzini
2021-08-17 16:40   ` Maxim Levitsky
2021-08-17 16:50     ` Paolo Bonzini
2021-08-17 16:56       ` Maxim Levitsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).