Re: [PATCH 6/8] KVM: VMX: Make vmx_write_l1_tsc_offset() work with nested TSC scaling

From: Maxim Levitsky <mlevitsk@redhat.com>
To: "Stamatis, Ilias" <ilstam@amazon.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"ilstam@mailbox.org" <ilstam@mailbox.org>
Cc: "jmattson@google.com" <jmattson@google.com>,
	"Woodhouse, David" <dwmw@amazon.co.uk>,
	"vkuznets@redhat.com" <vkuznets@redhat.com>,
	"joro@8bytes.org" <joro@8bytes.org>,
	"mtosatti@redhat.com" <mtosatti@redhat.com>,
	"zamsden@gmail.com" <zamsden@gmail.com>,
	"seanjc@google.com" <seanjc@google.com>,
	"wanpengli@tencent.com" <wanpengli@tencent.com>
Subject: Re: [PATCH 6/8] KVM: VMX: Make vmx_write_l1_tsc_offset() work with nested TSC scaling
Date: Tue, 11 May 2021 15:44:48 +0300	[thread overview]
Message-ID: <875fa1ff9a85ff601a05030eaa24a1db45a71f36.camel@redhat.com> (raw)
In-Reply-To: <9e0973a1adef19158e0c09a642b8c733556e272c.camel@amazon.com>

On Mon, 2021-05-10 at 16:08 +0000, Stamatis, Ilias wrote:
> On Mon, 2021-05-10 at 16:54 +0300, Maxim Levitsky wrote:
> > On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> > > From: Ilias Stamatis <ilstam@amazon.com>
> > > 
> > > Calculating the current TSC offset is done differently when nested TSC
> > > scaling is used.
> > > 
> > > Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> > > ---
> > >  arch/x86/kvm/vmx/vmx.c | 26 ++++++++++++++++++++------
> > >  1 file changed, 20 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 49241423b854..df7dc0e4c903 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -1797,10 +1797,16 @@ static void setup_msrs(struct vcpu_vmx *vmx)
> > >               vmx_update_msr_bitmap(&vmx->vcpu);
> > >  }
> > > 
> > > -static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > > +/*
> > > + * This function receives the requested offset for L1 as an argument but it
> > > + * actually writes the "current" tsc offset to the VMCS and returns it. The
> > > + * current offset might be different in case an L2 guest is currently running
> > > + * and its VMCS02 is loaded.
> > > + */
> > 
> > (Not related to this patch) It might be a good idea to rename this callback
> > instead of this comment, but I am not sure about it.
> > 
> 
> Yes! I was planning to do this on v2 anyway as the name of that function
> is completely misleading/wrong IMHO.
> 
> I think that even the comment inside it that explains that when L1
> doesn't trap WRMSR then L2 TSC writes overwrite L1's TSC is misplaced.
> It should go one or more levels above I believe.
> 
> This function simply
> updates the TSC offset in the current VMCS depending on whether L1 or L2
> is executing. 
> 
> > > +static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 l1_offset)
> > >  {
> > >       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > > -     u64 g_tsc_offset = 0;
> > > +     u64 cur_offset = l1_offset;
> > > 
> > >       /*
> > >        * We're here if L1 chose not to trap WRMSR to TSC. According
> > > @@ -1809,11 +1815,19 @@ static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > >        * to the newly set TSC to get L2's TSC.
> > >        */
> > >       if (is_guest_mode(vcpu) &&
> > > -         (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING))
> > > -             g_tsc_offset = vmcs12->tsc_offset;
> > > +         (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)) {
> > > +             if (vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_TSC_SCALING) {
> > > +                     cur_offset = kvm_compute_02_tsc_offset(
> > > +                                     l1_offset,
> > > +                                     vmcs12->tsc_multiplier,
> > > +                                     vmcs12->tsc_offset);
> > > +             } else {
> > > +                     cur_offset = l1_offset + vmcs12->tsc_offset;
> > > +             }
> > > +     }
> > > 
> > > -     vmcs_write64(TSC_OFFSET, offset + g_tsc_offset);
> > > -     return offset + g_tsc_offset;
> > > +     vmcs_write64(TSC_OFFSET, cur_offset);
> > > +     return cur_offset;
> > >  }
> > > 
> > >  /*
> > 
> > This code would be ideal to move to common code as SVM will do basically
> > the same thing.
> > Doesn't have to be done now though.
> > 
> 
> Hmm, how can we do the feature availability checking in common code?

We can add a vendor callback for this.

Just a few thoughts about how I think we can implement
the nested TSC scaling in (mostly) common code:

Assuming that the common code knows that:
1. Nested guest is running (already the case)

2. The format of the scaling multiplier is known 
(thankfully both SVM and VMX use fixed point binary number.

SVM is using 8.32 format and VMX using 16.48 format.

The common code already knows this via
kvm_max_tsc_scaling_ratio/kvm_tsc_scaling_ratio_frac_bits.

3. the value of nested TSC scaling multiplier 
is known to the common code.

(a callback to VMX/SVM code to query the TSC scaling value, 
and have it return 1 when TSC scaling is disabled should work)

Then the common code can do the whole thing, and only
let the SVM/VMX code write the actual multiplier.

As far as I know on the SVM, the TSC scaling works like that:

1. SVM has a CPUID bit to indicate that tsc scaling is supported.
(X86_FEATURE_TSCRATEMSR)

When this bit is set, TSC scale ratio is unconditionally enabled (but
can be just 1), and it is set via a special MSR (MSR_AMD64_TSC_RATIO)
rather than a field in VMCB (someone at AMD did cut corners...).

However since the TSC scaling is only effective when a guest is running,
that MSR can be treated almost as if it was just another VMCB field.

The TSC scale value is 32 bit fraction and another 8 bits the integer value
(as opposed to 48 bit fraction on VMX and 16 bits integer value).

I don't think that there are any other differences.

I should also note that I can do all of the above myself if 
I end up implementing the nested TSC scaling on AMD 
so I don't object much to the way that this patch series is done.

Best regards,
	Maxim Levitsky

> 
> > Best regards,
> >         Maxim Levitsky
> >