kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Intercepting RDTSC instruction by causing a VMEXIT
@ 2021-04-17  4:54 Arnabjyoti Kalita
  2021-04-19 16:46 ` Sean Christopherson
  0 siblings, 1 reply; 13+ messages in thread
From: Arnabjyoti Kalita @ 2021-04-17  4:54 UTC (permalink / raw)
  To: kvm

Hello all,

I'm having a requirement to record values obtained by reading tsc clock.

The command line I use to start QEMU in KVM mode is as below -

sudo ./qemu-system-x86_64 -m 1024 --machine pc-i440fx-2.5 -cpu
qemu64,-vme,-x2apic,-kvmclock,+lahf_lm,+3dnowprefetch,+vmx -enable-kvm
-netdev tap,id=tap1,ifname=tap0,script=no,downscript=no -device
virtio-net-pci,netdev=tap1,mac=00:00:00:00:00:00 -drive
file=~/os_images_for_qemu/ubuntu-16.04.server.qcow2,format=qcow2,if=none,id=img-direct
-device virtio-blk-pci,drive=img-direct

I am using QEMU version 2.11.92 and the guest kernel is a
4.4.0-116-generic. I use the CPU model "qemu64" because I have a
requirement to create a snapshot of this guest and load the snapshot
in TCG mode. The generic CPU model helps, in this regard.

Now when the guest is running, I want to intercept all rdtsc
instructions and record the tsc clock values. I know that for this to
happen, the CPU_BASED_RDTSC_EXITING flag needs to exist for the
particular CPU model.

How do I start adding support for causing VMEXIT upon rdtsc execution?

I see that a fairly recent commit in QEMU helps adding nested VMX
controls to named CPU models, but not "qemu64". Can I extend this
commit to add these controls to "qemu64" as well? Will making this
change immediately add support for intercepting VMEXITS for "qemu64"
CPU?

Thank you very much.

Best Regards,
Arnabjyoti Kalita

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Intercepting RDTSC instruction by causing a VMEXIT
  2021-04-17  4:54 Intercepting RDTSC instruction by causing a VMEXIT Arnabjyoti Kalita
@ 2021-04-19 16:46 ` Sean Christopherson
  2021-04-20  2:33   ` Arnabjyoti Kalita
  0 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2021-04-19 16:46 UTC (permalink / raw)
  To: Arnabjyoti Kalita; +Cc: kvm

On Sat, Apr 17, 2021, Arnabjyoti Kalita wrote:
> Hello all,
> 
> I'm having a requirement to record values obtained by reading tsc clock.
> 
> The command line I use to start QEMU in KVM mode is as below -
> 
> sudo ./qemu-system-x86_64 -m 1024 --machine pc-i440fx-2.5 -cpu
> qemu64,-vme,-x2apic,-kvmclock,+lahf_lm,+3dnowprefetch,+vmx -enable-kvm
> -netdev tap,id=tap1,ifname=tap0,script=no,downscript=no -device
> virtio-net-pci,netdev=tap1,mac=00:00:00:00:00:00 -drive
> file=~/os_images_for_qemu/ubuntu-16.04.server.qcow2,format=qcow2,if=none,id=img-direct
> -device virtio-blk-pci,drive=img-direct
> 
> I am using QEMU version 2.11.92 and the guest kernel is a
> 4.4.0-116-generic. I use the CPU model "qemu64" because I have a
> requirement to create a snapshot of this guest and load the snapshot
> in TCG mode. The generic CPU model helps, in this regard.
> 
> Now when the guest is running, I want to intercept all rdtsc
> instructions and record the tsc clock values. I know that for this to
> happen, the CPU_BASED_RDTSC_EXITING flag needs to exist for the
> particular CPU model.
> 
> How do I start adding support for causing VMEXIT upon rdtsc execution?

This requires a KVM change.  The below should do the trick.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c05e6e2854b5..f000728e4319 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2453,7 +2453,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
              CPU_BASED_MWAIT_EXITING |
              CPU_BASED_MONITOR_EXITING |
              CPU_BASED_INVLPG_EXITING |
-             CPU_BASED_RDPMC_EXITING;
+             CPU_BASED_RDPMC_EXITING |
+             CPU_BASED_RDTSC_EXITING;

        opt = CPU_BASED_TPR_SHADOW |
              CPU_BASED_USE_MSR_BITMAPS |
@@ -5194,6 +5195,15 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
        return kvm_skip_emulated_instruction(vcpu);
 }

+static int handle_rdtsc(struct kvm_vcpu *vcpu)
+{
+       u64 tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+
+       kvm_rax_write(vcpu, tsc & -1u);
+       kvm_rdx_write(vcpu, (tsc >> 32) & -1u);
+       return kvm_skip_emulated_instruction(vcpu);
+}
+
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
        if (likely(fasteoi)) {
@@ -5605,6 +5615,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
        [EXIT_REASON_INVD]                    = kvm_emulate_invd,
        [EXIT_REASON_INVLPG]                  = handle_invlpg,
        [EXIT_REASON_RDPMC]                   = kvm_emulate_rdpmc,
+       [EXIT_REASON_RDTSC]                   = handle_rdtsc,
        [EXIT_REASON_VMCALL]                  = kvm_emulate_hypercall,
        [EXIT_REASON_VMCLEAR]                 = handle_vmx_instruction,
        [EXIT_REASON_VMLAUNCH]                = handle_vmx_instruction,
 
> I see that a fairly recent commit in QEMU helps adding nested VMX
> controls to named CPU models, but not "qemu64". Can I extend this
> commit to add these controls to "qemu64" as well? Will making this
> change immediately add support for intercepting VMEXITS for "qemu64"
> CPU?

Are you actually running a nested guest?

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Intercepting RDTSC instruction by causing a VMEXIT
  2021-04-19 16:46 ` Sean Christopherson
@ 2021-04-20  2:33   ` Arnabjyoti Kalita
       [not found]     ` <CAJGDS+GT1mKHz6K=qHQf54S_97ym=nRP12MfO6OSEOpLYGht=A@mail.gmail.com>
  0 siblings, 1 reply; 13+ messages in thread
From: Arnabjyoti Kalita @ 2021-04-20  2:33 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm

Hello Sean,

Thank you very much for your answer. I'm hoping the inlined changes
should be enough to see RDTSC interception.

No, I'm actually not running a nested guest, even though vmx is enabled.

Best Regards,
Arnabjyoti Kalita

On Mon, Apr 19, 2021 at 10:16 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Sat, Apr 17, 2021, Arnabjyoti Kalita wrote:
> > Hello all,
> >
> > I'm having a requirement to record values obtained by reading tsc clock.
> >
> > The command line I use to start QEMU in KVM mode is as below -
> >
> > sudo ./qemu-system-x86_64 -m 1024 --machine pc-i440fx-2.5 -cpu
> > qemu64,-vme,-x2apic,-kvmclock,+lahf_lm,+3dnowprefetch,+vmx -enable-kvm
> > -netdev tap,id=tap1,ifname=tap0,script=no,downscript=no -device
> > virtio-net-pci,netdev=tap1,mac=00:00:00:00:00:00 -drive
> > file=~/os_images_for_qemu/ubuntu-16.04.server.qcow2,format=qcow2,if=none,id=img-direct
> > -device virtio-blk-pci,drive=img-direct
> >
> > I am using QEMU version 2.11.92 and the guest kernel is a
> > 4.4.0-116-generic. I use the CPU model "qemu64" because I have a
> > requirement to create a snapshot of this guest and load the snapshot
> > in TCG mode. The generic CPU model helps, in this regard.
> >
> > Now when the guest is running, I want to intercept all rdtsc
> > instructions and record the tsc clock values. I know that for this to
> > happen, the CPU_BASED_RDTSC_EXITING flag needs to exist for the
> > particular CPU model.
> >
> > How do I start adding support for causing VMEXIT upon rdtsc execution?
>
> This requires a KVM change.  The below should do the trick.
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c05e6e2854b5..f000728e4319 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2453,7 +2453,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>               CPU_BASED_MWAIT_EXITING |
>               CPU_BASED_MONITOR_EXITING |
>               CPU_BASED_INVLPG_EXITING |
> -             CPU_BASED_RDPMC_EXITING;
> +             CPU_BASED_RDPMC_EXITING |
> +             CPU_BASED_RDTSC_EXITING;
>
>         opt = CPU_BASED_TPR_SHADOW |
>               CPU_BASED_USE_MSR_BITMAPS |
> @@ -5194,6 +5195,15 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
>         return kvm_skip_emulated_instruction(vcpu);
>  }
>
> +static int handle_rdtsc(struct kvm_vcpu *vcpu)
> +{
> +       u64 tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +
> +       kvm_rax_write(vcpu, tsc & -1u);
> +       kvm_rdx_write(vcpu, (tsc >> 32) & -1u);
> +       return kvm_skip_emulated_instruction(vcpu);
> +}
> +
>  static int handle_apic_access(struct kvm_vcpu *vcpu)
>  {
>         if (likely(fasteoi)) {
> @@ -5605,6 +5615,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>         [EXIT_REASON_INVD]                    = kvm_emulate_invd,
>         [EXIT_REASON_INVLPG]                  = handle_invlpg,
>         [EXIT_REASON_RDPMC]                   = kvm_emulate_rdpmc,
> +       [EXIT_REASON_RDTSC]                   = handle_rdtsc,
>         [EXIT_REASON_VMCALL]                  = kvm_emulate_hypercall,
>         [EXIT_REASON_VMCLEAR]                 = handle_vmx_instruction,
>         [EXIT_REASON_VMLAUNCH]                = handle_vmx_instruction,
>
> > I see that a fairly recent commit in QEMU helps adding nested VMX
> > controls to named CPU models, but not "qemu64". Can I extend this
> > commit to add these controls to "qemu64" as well? Will making this
> > change immediately add support for intercepting VMEXITS for "qemu64"
> > CPU?
>
> Are you actually running a nested guest?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Intercepting RDTSC instruction by causing a VMEXIT
       [not found]     ` <CAJGDS+GT1mKHz6K=qHQf54S_97ym=nRP12MfO6OSEOpLYGht=A@mail.gmail.com>
@ 2021-04-26  1:51       ` Arnabjyoti Kalita
  2021-04-26 16:05       ` Sean Christopherson
  1 sibling, 0 replies; 13+ messages in thread
From: Arnabjyoti Kalita @ 2021-04-26  1:51 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm

Hello Sean,

Thank you very much for your answer again. I could actually see that
the RDTSC handler gets called successfully. I added a "printk"
statement to the handler and I can see the messages being printed out
to the kernel syslog.

However, I am not sure if a VMEXIT is happening in userspace. I use
QEMU and I do not see any notification from QEMU that tells me that a
VMEXIT happened due to RDTSC being executed. Is there a mechanism to
confirm this?

My actual requirement to record tsc values read out as a result of
RDTSC execution still stands.

Best Regards,
Arnabjyoti Kalita


On Mon, Apr 26, 2021 at 7:19 AM Arnabjyoti Kalita
<akalita@cs.stonybrook.edu> wrote:
>
> Hello Sean,
>
> Thank you very much for your answer again. I could actually see that the RDTSC handler gets called successfully. I added a "printk" statement to the handler and I can see the messages being printed out to the kernel syslog.
>
> However, I am not sure if a VMEXIT is happening in userspace. I use QEMU and I do not see any notification from QEMU that tells me that a VMEXIT happened due to RDTSC being executed. Is there a mechanism to confirm this?
>
> My actual requirement to record tsc values read out as a result of RDTSC execution still stands.
>
> Best Regards,
> Arnabjyoti Kalita
>
>
>
>
> On Tue, 20 Apr 2021, 08:03 Arnabjyoti Kalita, <akalita@cs.stonybrook.edu> wrote:
>>
>> Hello Sean,
>>
>> Thank you very much for your answer. I'm hoping the inlined changes
>> should be enough to see RDTSC interception.
>>
>> No, I'm actually not running a nested guest, even though vmx is enabled.
>>
>> Best Regards,
>> Arnabjyoti Kalita
>>
>> On Mon, Apr 19, 2021 at 10:16 PM Sean Christopherson <seanjc@google.com> wrote:
>> >
>> > On Sat, Apr 17, 2021, Arnabjyoti Kalita wrote:
>> > > Hello all,
>> > >
>> > > I'm having a requirement to record values obtained by reading tsc clock.
>> > >
>> > > The command line I use to start QEMU in KVM mode is as below -
>> > >
>> > > sudo ./qemu-system-x86_64 -m 1024 --machine pc-i440fx-2.5 -cpu
>> > > qemu64,-vme,-x2apic,-kvmclock,+lahf_lm,+3dnowprefetch,+vmx -enable-kvm
>> > > -netdev tap,id=tap1,ifname=tap0,script=no,downscript=no -device
>> > > virtio-net-pci,netdev=tap1,mac=00:00:00:00:00:00 -drive
>> > > file=~/os_images_for_qemu/ubuntu-16.04.server.qcow2,format=qcow2,if=none,id=img-direct
>> > > -device virtio-blk-pci,drive=img-direct
>> > >
>> > > I am using QEMU version 2.11.92 and the guest kernel is a
>> > > 4.4.0-116-generic. I use the CPU model "qemu64" because I have a
>> > > requirement to create a snapshot of this guest and load the snapshot
>> > > in TCG mode. The generic CPU model helps, in this regard.
>> > >
>> > > Now when the guest is running, I want to intercept all rdtsc
>> > > instructions and record the tsc clock values. I know that for this to
>> > > happen, the CPU_BASED_RDTSC_EXITING flag needs to exist for the
>> > > particular CPU model.
>> > >
>> > > How do I start adding support for causing VMEXIT upon rdtsc execution?
>> >
>> > This requires a KVM change.  The below should do the trick.
>> >
>> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> > index c05e6e2854b5..f000728e4319 100644
>> > --- a/arch/x86/kvm/vmx/vmx.c
>> > +++ b/arch/x86/kvm/vmx/vmx.c
>> > @@ -2453,7 +2453,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>> >               CPU_BASED_MWAIT_EXITING |
>> >               CPU_BASED_MONITOR_EXITING |
>> >               CPU_BASED_INVLPG_EXITING |
>> > -             CPU_BASED_RDPMC_EXITING;
>> > +             CPU_BASED_RDPMC_EXITING |
>> > +             CPU_BASED_RDTSC_EXITING;
>> >
>> >         opt = CPU_BASED_TPR_SHADOW |
>> >               CPU_BASED_USE_MSR_BITMAPS |
>> > @@ -5194,6 +5195,15 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
>> >         return kvm_skip_emulated_instruction(vcpu);
>> >  }
>> >
>> > +static int handle_rdtsc(struct kvm_vcpu *vcpu)
>> > +{
>> > +       u64 tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>> > +
>> > +       kvm_rax_write(vcpu, tsc & -1u);
>> > +       kvm_rdx_write(vcpu, (tsc >> 32) & -1u);
>> > +       return kvm_skip_emulated_instruction(vcpu);
>> > +}
>> > +
>> >  static int handle_apic_access(struct kvm_vcpu *vcpu)
>> >  {
>> >         if (likely(fasteoi)) {
>> > @@ -5605,6 +5615,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>> >         [EXIT_REASON_INVD]                    = kvm_emulate_invd,
>> >         [EXIT_REASON_INVLPG]                  = handle_invlpg,
>> >         [EXIT_REASON_RDPMC]                   = kvm_emulate_rdpmc,
>> > +       [EXIT_REASON_RDTSC]                   = handle_rdtsc,
>> >         [EXIT_REASON_VMCALL]                  = kvm_emulate_hypercall,
>> >         [EXIT_REASON_VMCLEAR]                 = handle_vmx_instruction,
>> >         [EXIT_REASON_VMLAUNCH]                = handle_vmx_instruction,
>> >
>> > > I see that a fairly recent commit in QEMU helps adding nested VMX
>> > > controls to named CPU models, but not "qemu64". Can I extend this
>> > > commit to add these controls to "qemu64" as well? Will making this
>> > > change immediately add support for intercepting VMEXITS for "qemu64"
>> > > CPU?
>> >
>> > Are you actually running a nested guest?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Intercepting RDTSC instruction by causing a VMEXIT
       [not found]     ` <CAJGDS+GT1mKHz6K=qHQf54S_97ym=nRP12MfO6OSEOpLYGht=A@mail.gmail.com>
  2021-04-26  1:51       ` Arnabjyoti Kalita
@ 2021-04-26 16:05       ` Sean Christopherson
  2021-04-27  2:19         ` Arnabjyoti Kalita
  1 sibling, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2021-04-26 16:05 UTC (permalink / raw)
  To: Arnabjyoti Kalita; +Cc: kvm

On Mon, Apr 26, 2021, Arnabjyoti Kalita wrote:
> Hello Sean,
> 
> Thank you very much for your answer again. I could actually see that the
> RDTSC handler gets called successfully. I added a "printk" statement to the
> handler and I can see the messages being printed out to the kernel syslog.
> 
> However, I am not sure if a VMEXIT is happening in userspace. I use QEMU
> and I do not see any notification from QEMU that tells me that a VMEXIT
> happened due to RDTSC being executed. Is there a mechanism to confirm this?

Without further modification, KVM will _not_ exit to userspace in this case.

> My actual requirement to record tsc values read out as a result of RDTSC
> execution still stands.

Your requirement didn't clarify that userspace needed to record the values ;-)

Forcing an exit to userspace is very doable, but will tank guest performance and
possibly interfere with whatever you're trying to do.  I would try adding a
tracepoint and using that to capture the TSC values.  Adding guest RIP, etc...
as needed is trivial.

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index a61c015870e3..e962e813ba04 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -821,6 +821,23 @@ TRACE_EVENT(
                  __entry->gpa_match ? "GPA" : "GVA")
 );

+TRACE_EVENT(kvm_emulate_rdtsc,
+       TP_PROTO(unsigned int vcpu_id, __u64 tsc),
+       TP_ARGS(vcpu_id, tsc),
+
+       TP_STRUCT__entry(
+               __field( unsigned int,  vcpu_id         )
+               __field(        __u64,  tsc             )
+       ),
+
+       TP_fast_assign(
+               __entry->vcpu_id                = vcpu_id;
+               __entry->tsc                    = tsc;
+       ),
+
+       TP_printk("vcpu=%u tsc=%llu", __entry->vcpu_id, __entry->tsc)
+);
+
 TRACE_EVENT(kvm_write_tsc_offset,
        TP_PROTO(unsigned int vcpu_id, __u64 previous_tsc_offset,
                 __u64 next_tsc_offset),
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 61bf0b86770b..1fbeef520349 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5254,6 +5254,8 @@ static int handle_rdtsc(struct kvm_vcpu *vcpu)
 {
        u64 tsc = kvm_read_l1_tsc(vcpu, rdtsc());

+       trace_kvm_emulate_rdtsc(vcpu->vcpu_id, tsc);
+
        kvm_rax_write(vcpu, tsc & -1u);
        kvm_rdx_write(vcpu, (tsc >> 32) & -1u);
        return kvm_skip_emulated_instruction(vcpu);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Intercepting RDTSC instruction by causing a VMEXIT
  2021-04-26 16:05       ` Sean Christopherson
@ 2021-04-27  2:19         ` Arnabjyoti Kalita
  2021-04-30 17:16           ` Arnabjyoti Kalita
  0 siblings, 1 reply; 13+ messages in thread
From: Arnabjyoti Kalita @ 2021-04-27  2:19 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm

Hello Sean,

Thank you very much again. I didn't know about the tracepointing
methodology. This is very helpful, indeed.

In addition to my requirement of recording tsc values, I am trying to
track when the VMEXITs happen in userspace(QEMU). This allows me to
correlate recorded TSC values with the VMEXIT sequence. So, causing a
VMEXIT in userspace is absolutely essential to me and I would love to
have some pointers on how to do it.

I run a server within the guest and I'm okay with performance dropping
down to within a factor of 2 compared to running the server on native
hardware. I run this experiment for about 20-30 seconds.

Best Regards,
Arnabjyoti Kalita


On Mon, Apr 26, 2021 at 9:35 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 26, 2021, Arnabjyoti Kalita wrote:
> > Hello Sean,
> >
> > Thank you very much for your answer again. I could actually see that the
> > RDTSC handler gets called successfully. I added a "printk" statement to the
> > handler and I can see the messages being printed out to the kernel syslog.
> >
> > However, I am not sure if a VMEXIT is happening in userspace. I use QEMU
> > and I do not see any notification from QEMU that tells me that a VMEXIT
> > happened due to RDTSC being executed. Is there a mechanism to confirm this?
>
> Without further modification, KVM will _not_ exit to userspace in this case.
>
> > My actual requirement to record tsc values read out as a result of RDTSC
> > execution still stands.
>
> Your requirement didn't clarify that userspace needed to record the values ;-)
>
> Forcing an exit to userspace is very doable, but will tank guest performance and
> possibly interfere with whatever you're trying to do.  I would try adding a
> tracepoint and using that to capture the TSC values.  Adding guest RIP, etc...
> as needed is trivial.
>
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index a61c015870e3..e962e813ba04 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -821,6 +821,23 @@ TRACE_EVENT(
>                   __entry->gpa_match ? "GPA" : "GVA")
>  );
>
> +TRACE_EVENT(kvm_emulate_rdtsc,
> +       TP_PROTO(unsigned int vcpu_id, __u64 tsc),
> +       TP_ARGS(vcpu_id, tsc),
> +
> +       TP_STRUCT__entry(
> +               __field( unsigned int,  vcpu_id         )
> +               __field(        __u64,  tsc             )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->vcpu_id                = vcpu_id;
> +               __entry->tsc                    = tsc;
> +       ),
> +
> +       TP_printk("vcpu=%u tsc=%llu", __entry->vcpu_id, __entry->tsc)
> +);
> +
>  TRACE_EVENT(kvm_write_tsc_offset,
>         TP_PROTO(unsigned int vcpu_id, __u64 previous_tsc_offset,
>                  __u64 next_tsc_offset),
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 61bf0b86770b..1fbeef520349 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5254,6 +5254,8 @@ static int handle_rdtsc(struct kvm_vcpu *vcpu)
>  {
>         u64 tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>
> +       trace_kvm_emulate_rdtsc(vcpu->vcpu_id, tsc);
> +
>         kvm_rax_write(vcpu, tsc & -1u);
>         kvm_rdx_write(vcpu, (tsc >> 32) & -1u);
>         return kvm_skip_emulated_instruction(vcpu);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Intercepting RDTSC instruction by causing a VMEXIT
  2021-04-27  2:19         ` Arnabjyoti Kalita
@ 2021-04-30 17:16           ` Arnabjyoti Kalita
  2024-01-01 22:06             ` obtain the timestamp counter of physical/host machine inside the VMs Tao Lyu
  0 siblings, 1 reply; 13+ messages in thread
From: Arnabjyoti Kalita @ 2021-04-30 17:16 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm

Hello Sean,

Thank you for all your help. I've now managed to cause a VMEXIT in the
guest. I've also identified which guest registers to access to read
and record the tsc counter value when the vmexit handler is called in
userspace.

Best Regards,
Arnabjyoti Kalita

On Tue, Apr 27, 2021 at 7:49 AM Arnabjyoti Kalita
<akalita@cs.stonybrook.edu> wrote:
>
> Hello Sean,
>
> Thank you very much again. I didn't know about the tracepointing
> methodology. This is very helpful, indeed.
>
> In addition to my requirement of recording tsc values, I am trying to
> track when the VMEXITs happen in userspace(QEMU). This allows me to
> correlate recorded TSC values with the VMEXIT sequence. So, causing a
> VMEXIT in userspace is absolutely essential to me and I would love to
> have some pointers on how to do it.
>
> I run a server within the guest and I'm okay with performance dropping
> down to within a factor of 2 compared to running the server on native
> hardware. I run this experiment for about 20-30 seconds.
>
> Best Regards,
> Arnabjyoti Kalita
>
>
> On Mon, Apr 26, 2021 at 9:35 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Apr 26, 2021, Arnabjyoti Kalita wrote:
> > > Hello Sean,
> > >
> > > Thank you very much for your answer again. I could actually see that the
> > > RDTSC handler gets called successfully. I added a "printk" statement to the
> > > handler and I can see the messages being printed out to the kernel syslog.
> > >
> > > However, I am not sure if a VMEXIT is happening in userspace. I use QEMU
> > > and I do not see any notification from QEMU that tells me that a VMEXIT
> > > happened due to RDTSC being executed. Is there a mechanism to confirm this?
> >
> > Without further modification, KVM will _not_ exit to userspace in this case.
> >
> > > My actual requirement to record tsc values read out as a result of RDTSC
> > > execution still stands.
> >
> > Your requirement didn't clarify that userspace needed to record the values ;-)
> >
> > Forcing an exit to userspace is very doable, but will tank guest performance and
> > possibly interfere with whatever you're trying to do.  I would try adding a
> > tracepoint and using that to capture the TSC values.  Adding guest RIP, etc...
> > as needed is trivial.
> >
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index a61c015870e3..e962e813ba04 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -821,6 +821,23 @@ TRACE_EVENT(
> >                   __entry->gpa_match ? "GPA" : "GVA")
> >  );
> >
> > +TRACE_EVENT(kvm_emulate_rdtsc,
> > +       TP_PROTO(unsigned int vcpu_id, __u64 tsc),
> > +       TP_ARGS(vcpu_id, tsc),
> > +
> > +       TP_STRUCT__entry(
> > +               __field( unsigned int,  vcpu_id         )
> > +               __field(        __u64,  tsc             )
> > +       ),
> > +
> > +       TP_fast_assign(
> > +               __entry->vcpu_id                = vcpu_id;
> > +               __entry->tsc                    = tsc;
> > +       ),
> > +
> > +       TP_printk("vcpu=%u tsc=%llu", __entry->vcpu_id, __entry->tsc)
> > +);
> > +
> >  TRACE_EVENT(kvm_write_tsc_offset,
> >         TP_PROTO(unsigned int vcpu_id, __u64 previous_tsc_offset,
> >                  __u64 next_tsc_offset),
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 61bf0b86770b..1fbeef520349 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -5254,6 +5254,8 @@ static int handle_rdtsc(struct kvm_vcpu *vcpu)
> >  {
> >         u64 tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> >
> > +       trace_kvm_emulate_rdtsc(vcpu->vcpu_id, tsc);
> > +
> >         kvm_rax_write(vcpu, tsc & -1u);
> >         kvm_rdx_write(vcpu, (tsc >> 32) & -1u);
> >         return kvm_skip_emulated_instruction(vcpu);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* obtain the timestamp counter of physical/host machine inside the VMs.
  2021-04-30 17:16           ` Arnabjyoti Kalita
@ 2024-01-01 22:06             ` Tao Lyu
  2024-01-02  5:11               ` Dongli Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Tao Lyu @ 2024-01-01 22:06 UTC (permalink / raw)
  To: akalita; +Cc: kvm, seanjc

Hello Arnabjyoti, Sean, and everyone,

I'm having a similiar but slightly differnt issue about the rdtsc in KVM.

I want to obtain the timestamp counter of physical/host machine inside the VMs.

Acccording to the previous threads, I know I need to disable the offsetting, VM exit, and scaling.
I specify the correspoding parameters in the qemu arguments.
The booting command is listed below:

qemu-system-x86_64 -m 10240 -smp 4 -chardev socket,id=SOCKSYZ,server=on,nowait,host=localhost,port=3258 -mon chardev=SOCKSYZ,mode=control -display none -serial stdio -device virtio-rng-pci -enable-kvm -cpu host,migratable=off,tsc=on,rdtscp=on,vmx-tsc-offset=off,vmx-rdtsc-exit=off,tsc-scale=off,tsc-adjust=off,vmx-rdtscp-exit=off -netdev bridge,id=hn40 -device virtio-net,netdev=hn40,mac=e6:c8:ff:09:76:38 -hda XXX -kernel XXX -append "root=/dev/sda console=ttyS0"


But the rdtsc still returns the adjusted tsc.
The vmxcap script shows the TSC settings as below:
  
  Use TSC offsetting                       no
  RDTSC exiting                            no
  Enable RDTSCP                            no
  TSC scaling                              yes


I would really appreciate it if anyone can tell me whether and how I can get the tsc of physical machine insdie the VM.

Thanks a lot.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: obtain the timestamp counter of physical/host machine inside the VMs.
  2024-01-01 22:06             ` obtain the timestamp counter of physical/host machine inside the VMs Tao Lyu
@ 2024-01-02  5:11               ` Dongli Zhang
  2024-01-02 10:10                 ` Tao Lyu
  0 siblings, 1 reply; 13+ messages in thread
From: Dongli Zhang @ 2024-01-02  5:11 UTC (permalink / raw)
  To: Tao Lyu; +Cc: kvm, seanjc, akalita

Hi Tao,

On 1/1/24 14:06, Tao Lyu wrote:
> Hello Arnabjyoti, Sean, and everyone,
> 
> I'm having a similiar but slightly differnt issue about the rdtsc in KVM.
> 
> I want to obtain the timestamp counter of physical/host machine inside the VMs.
> 
> Acccording to the previous threads, I know I need to disable the offsetting, VM exit, and scaling.
> I specify the correspoding parameters in the qemu arguments.
> The booting command is listed below:
> 
> qemu-system-x86_64 -m 10240 -smp 4 -chardev socket,id=SOCKSYZ,server=on,nowait,host=localhost,port=3258 -mon chardev=SOCKSYZ,mode=control -display none -serial stdio -device virtio-rng-pci -enable-kvm -cpu host,migratable=off,tsc=on,rdtscp=on,vmx-tsc-offset=off,vmx-rdtsc-exit=off,tsc-scale=off,tsc-adjust=off,vmx-rdtscp-exit=off -netdev bridge,id=hn40 -device virtio-net,netdev=hn40,mac=e6:c8:ff:09:76:38 -hda XXX -kernel XXX -append "root=/dev/sda console=ttyS0"
> 
> 
> But the rdtsc still returns the adjusted tsc.
> The vmxcap script shows the TSC settings as below:
>   
>   Use TSC offsetting                       no
>   RDTSC exiting                            no
>   Enable RDTSCP                            no
>   TSC scaling                              yes
> 
> 
> I would really appreciate it if anyone can tell me whether and how I can get the tsc of physical machine insdie the VM.

If the objective is to obtain the same tsc at both VM and host side (that is, to
avoid any offset or scaling), I can obtain quite close tsc at both VM and host
side with the below linux-6.6 change.

My env does not use tsc scaling.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 41cce50..b102dcd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2723,7 +2723,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64
data)
        bool synchronizing = false;

        raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-       offset = kvm_compute_l1_tsc_offset(vcpu, data);
+       offset = 0;
        ns = get_kvmclock_base_ns();
        elapsed = ns - kvm->arch.last_tsc_nsec;

Dongli Zhang

> 
> Thanks a lot.
> 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: obtain the timestamp counter of physical/host machine inside the VMs.
  2024-01-02  5:11               ` Dongli Zhang
@ 2024-01-02 10:10                 ` Tao Lyu
  2024-01-02 17:53                   ` Sean Christopherson
  0 siblings, 1 reply; 13+ messages in thread
From: Tao Lyu @ 2024-01-02 10:10 UTC (permalink / raw)
  To: Dongli Zhang; +Cc: kvm, seanjc, akalita


Hi Dongli,

> On 1/1/24 14:06, Tao Lyu wrote:
>> Hello Arnabjyoti, Sean, and everyone,
>> 
>> I'm having a similiar but slightly differnt issue about the rdtsc in KVM.
>> 
>> I want to obtain the timestamp counter of physical/host machine inside the VMs.
>> 
>> Acccording to the previous threads, I know I need to disable the offsetting, VM exit, and scaling.
>> I specify the correspoding parameters in the qemu arguments.
>> The booting command is listed below:
>> 
>> qemu-system-x86_64 -m 10240 -smp 4 -chardev socket,id=SOCKSYZ,server=on,nowait,host=localhost,port=3258 -mon chardev=SOCKSYZ,mode=control -display none -serial stdio -device virtio-rng-pci -enable-kvm -cpu host,migratable=off,tsc=on,rdtscp=on,vmx-tsc-offset=off,vmx-rdtsc-exit=off,tsc-scale=off,tsc-adjust=off,vmx-rdtscp-exit=off  -netdev bridge,id=hn40 -device virtio-net,netdev=hn40,mac=e6:c8:ff:09:76:38 -hda XXX -kernel XXX -append "root=/dev/sda console=ttyS0"
>> 
>> 
>> But the rdtsc still returns the adjusted tsc.
>> The vmxcap script shows the TSC settings as below:
>>   
>>   Use TSC offsetting                       no
>>   RDTSC exiting                            no
>>   Enable RDTSCP                            no
>>   TSC scaling                              yes
>> 
>> 
>> I would really appreciate it if anyone can tell me whether and how I can get the tsc of physical machine insdie the VM.

> If the objective is to obtain the same tsc at both VM and host side (that is, to
> avoid any offset or scaling), I can obtain quite close tsc at both VM and host
> side with the below linux-6.6 change.

> My env does not use tsc scaling.

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 41cce50..b102dcd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
>@@ -2723,7 +2723,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64
data)
>        bool synchronizing = false;
>
>       raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
> -       offset = kvm_compute_l1_tsc_offset(vcpu, data);
> +       offset = 0;
>        ns = get_kvmclock_base_ns();
>        elapsed = ns - kvm->arch.last_tsc_nsec;
>
> Dongli Zhang


Hi Dongli,

Thank you so much for the explanation and for providing a patch.
It works for me now.

Wish you a happy new year.

Best,
Tao

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: obtain the timestamp counter of physical/host machine inside the VMs.
  2024-01-02 10:10                 ` Tao Lyu
@ 2024-01-02 17:53                   ` Sean Christopherson
  2024-01-02 18:20                     ` Tao Lyu
  0 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2024-01-02 17:53 UTC (permalink / raw)
  To: Tao Lyu; +Cc: Dongli Zhang, kvm, akalita

On Tue, Jan 02, 2024, Tao Lyu wrote:
> 
> Hi Dongli,
> 
> > On 1/1/24 14:06, Tao Lyu wrote:
> >> Hello Arnabjyoti, Sean, and everyone,
> >> 
> >> I'm having a similiar but slightly differnt issue about the rdtsc in KVM.
> >> 
> >> I want to obtain the timestamp counter of physical/host machine inside the VMs.
> >> 
> >> Acccording to the previous threads, I know I need to disable the offsetting, VM exit, and scaling.
> >> I specify the correspoding parameters in the qemu arguments.
> >> The booting command is listed below:
> >> 
> >> qemu-system-x86_64 -m 10240 -smp 4 -chardev socket,id=SOCKSYZ,server=on,nowait,host=localhost,port=3258 -mon chardev=SOCKSYZ,mode=control -display none -serial stdio -device virtio-rng-pci -enable-kvm -cpu host,migratable=off,tsc=on,rdtscp=on,vmx-tsc-offset=off,vmx-rdtsc-exit=off,tsc-scale=off,tsc-adjust=off,vmx-rdtscp-exit=off  -netdev bridge,id=hn40 -device virtio-net,netdev=hn40,mac=e6:c8:ff:09:76:38 -hda XXX -kernel XXX -append "root=/dev/sda console=ttyS0"
> >> 
> >> 
> >> But the rdtsc still returns the adjusted tsc.
> >> The vmxcap script shows the TSC settings as below:
> >>   
> >>   Use TSC offsetting                       no
> >>   RDTSC exiting                            no
> >>   Enable RDTSCP                            no
> >>   TSC scaling                              yes
> >> 
> >> 
> >> I would really appreciate it if anyone can tell me whether and how I can get the tsc of physical machine insdie the VM.
> 
> > If the objective is to obtain the same tsc at both VM and host side (that is, to
> > avoid any offset or scaling), I can obtain quite close tsc at both VM and host
> > side with the below linux-6.6 change.
> 
> > My env does not use tsc scaling.
> 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 41cce50..b102dcd 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> >@@ -2723,7 +2723,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64
> data)
> >        bool synchronizing = false;
> >
> >       raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
> > -       offset = kvm_compute_l1_tsc_offset(vcpu, data);
> > +       offset = 0;
> >        ns = get_kvmclock_base_ns();
> >        elapsed = ns - kvm->arch.last_tsc_nsec;
> >
> > Dongli Zhang
> 
> 
> Hi Dongli,
> 
> Thank you so much for the explanation and for providing a patch.
> It works for me now.

Yeah, during vCPU creation KVM sets a target guest TSC of '0', i.e. sets the TSC
offset to "0 - HOST_TSC".  As of commit 828ca89628bf ("KVM: x86: Expose TSC offset
controls to userspace"), userspace can explicitly set an offset of '0' via
KVM_VCPU_TSC_CTRL+KVM_VCPU_TSC_OFFSET, but AFAIK QEMU doesn't support that API.

All the other methods for setting the TSC offset are indirect, i.e. userspace
provides the target TSC and KVM computes the offset.  So even if QEMU provides a
way to specify an explicit TSC (or offset), there will be a healthy amount of slop.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: obtain the timestamp counter of physical/host machine inside the VMs.
  2024-01-02 17:53                   ` Sean Christopherson
@ 2024-01-02 18:20                     ` Tao Lyu
  2024-01-02 19:36                       ` Dongli Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Tao Lyu @ 2024-01-02 18:20 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Dongli Zhang, kvm

>> 
>> Hi Dongli,
>> 
>> > On 1/1/24 14:06, Tao Lyu wrote:
>> >> Hello Arnabjyoti, Sean, and everyone,
>> >> 
>> >> I'm having a similiar but slightly differnt issue about the rdtsc in KVM.
>> >> 
>> >> I want to obtain the timestamp counter of physical/host machine inside the VMs.
>> >> 
>> >> Acccording to the previous threads, I know I need to disable the offsetting, VM exit, and scaling.
>> >> I specify the correspoding parameters in the qemu arguments.
>> >> The booting command is listed below:
>> >> 
>> >> qemu-system-x86_64 -m 10240 -smp 4 -chardev socket,id=SOCKSYZ,server=on,nowait,host=localhost,port=3258 -mon chardev=SOCKSYZ,mode=control -display none -serial stdio -device virtio-rng-pci -enable-kvm -cpu host,migratable=off,tsc=on,rdtscp=on,vmx-tsc-offset=off,vmx-rdtsc-exit=off,tsc-scale=off,tsc-adjust=off,vmx-rdtscp-exit=off   -netdev bridge,id=hn40 -device virtio-net,netdev=hn40,mac=e6:c8:ff:09:76:38 -hda XXX -kernel XXX -append "root=/dev/sda console=ttyS0"
>> >> 
>> >> 
>> >> But the rdtsc still returns the adjusted tsc.
>> >> The vmxcap script shows the TSC settings as below:
>> >>   
>> >>   Use TSC offsetting                       no
>> >>   RDTSC exiting                            no
>> >>   Enable RDTSCP                            no
>> >>   TSC scaling                              yes
>> >> 
>> >> 
>> >> I would really appreciate it if anyone can tell me whether and how I can get the tsc of physical machine insdie the VM.
>> 
>> > If the objective is to obtain the same tsc at both VM and host side (that is, to
>> > avoid any offset or scaling), I can obtain quite close tsc at both VM and host
>> > side with the below linux-6.6 change.
>> 
>> > My env does not use tsc scaling.
>> 
>> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> > index 41cce50..b102dcd 100644
>> > --- a/arch/x86/kvm/x86.c
>> > +++ b/arch/x86/kvm/x86.c
>> >@@ -2723,7 +2723,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64
>> data)
>> >        bool synchronizing = false;
>> >
>> >       raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>> > -       offset = kvm_compute_l1_tsc_offset(vcpu, data);
>> > +       offset = 0;
>> >        ns = get_kvmclock_base_ns();
>> >        elapsed = ns - kvm->arch.last_tsc_nsec;
>> >
>> > Dongli Zhang
>> 
>> 
>> Hi Dongli,
>> 
>> Thank you so much for the explanation and for providing a patch.
>> It works for me now.
>
>Yeah, during vCPU creation KVM sets a target guest TSC of '0', i.e. sets the TSC
>offset to "0 - HOST_TSC".  As of commit 828ca89628bf ("KVM: x86: Expose TSC offset
>controls to userspace"), userspace can explicitly set an offset of '0' via
>KVM_VCPU_TSC_CTRL+KVM_VCPU_TSC_OFFSET, but AFAIK QEMU doesn't support that API.
>
>All the other methods for setting the TSC offset are indirect, i.e. userspace
>provides the target TSC and KVM computes the offset.  So even if QEMU provides a
way to specify an explicit TSC (or offset), there will be a healthy amount of slop.


Hi Sean and Dongli,

Thank you so much for the replies.

Unfortunately, after I adding the following patch to reset the TSC OFFSET forcefully,
I can get the host TSC value from guest.

However, when booting the host kernel, it has the following WARNINGS:


[  113.033750] ------------[ cut here ]------------
[  113.033768] NETDEV WATCHDOG: enxb03af61ad78a (rndis_host): transmit queue 0 timed out
[  113.033802] WARNING: CPU: 42 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x264/0x270
[  113.033829] Modules linked in: nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio socwatch2_15(OE) vtsspp(OE) vhost_net vhost vhost_iotlb tap sep5(OE) socperf3(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter bridge stp llc overlay cuse pax(OE) ipmi_ssif zram intel_rapl_msr intel_rapl_common i10nm_edac x86_pkg_temp_thermal intel_powerclamp coretemp joydev input_leds nls_iso8859_1 kvm_intel ast hid_generic drm_vram_helper drm_ttm_helper kvm rndis_host ttm usbhid cdc_ether usbnet hid drm_kms_helper mii crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd cec i2c_algo_bit rapl fb_sys_fops syscopyarea intel_cstate sysfillrect i40e sysimgblt isst_if_mbox_pci mei_me ioatdma ahci
[  113.034307]  isst_if_mmio i2c_i801 mei intel_pch_thermal isst_if_common acpi_ipmi libahci dca i2c_smbus wmi ipmi_si ipmi_devintf ipmi_msghandler nfit acpi_pad acpi_power_meter mac_hid sch_fq_codel binfmt_misc ramoops drm reed_solomon efi_pstore sunrpc ip_tables x_tables autofs4
[  113.034473] CPU: 42 PID: 0 Comm: swapper/42 Kdump: loaded Tainted: G           OE     5.15.0+ #4
[  113.034486] Hardware name: Intel Corporation M50CYP2SB1U/M50CYP2SB1U, BIOS SE5C620.86B.01.01.0004.2110190142 10/19/2021
[  113.034495] RIP: 0010:dev_watchdog+0x264/0x270
[  113.034511] Code: eb a6 48 8b 5d d0 c6 05 e6 47 0a 01 01 48 89 df e8 91 c4 f9 ff 44 89 e1 48 89 de 48 c7 c7 68 c2 a8 a9 48 89 c2 e8 90 3e 16 00 <0f> 0b eb 83 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
[  113.034522] RSP: 0018:ffffa6e88d79ce78 EFLAGS: 00010286
[  113.034541] RAX: 0000000000000000 RBX: ffff959763ddb000 RCX: 000000000000083f
[  113.034551] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
[  113.034559] RBP: ffffa6e88d79ceb0 R08: 0000000000000000 R09: ffffa6e88d79cc60
[  113.034565] R10: ffffa6e88d79cc58 R11: ffff96163ff26c28 R12: 0000000000000000
[  113.034572] R13: ffff95976488ac80 R14: 0000000000000001 R15: ffff959763ddb4c0
[  113.034579] FS:  0000000000000000(0000) GS:ffff961541980000(0000) knlGS:0000000000000000
[  113.034588] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  113.034595] CR2: 000055c5819e40f8 CR3: 0000004740c0a006 CR4: 0000000000772ee0
[  113.034604] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  113.034614] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  113.034620] PKRU: 55555554
[  113.034629] Call Trace:
[  113.034636]  <IRQ>
[  113.034662]  call_timer_fn+0x29/0x100
[  113.034680]  __run_timers.part.0+0x1cf/0x240
[  113.034757]  run_timer_softirq+0x2a/0x50
[  113.034768]  __do_softirq+0xcb/0x274
[  113.034790]  irq_exit_rcu+0x8c/0xb0
[  113.034807]  sysvec_apic_timer_interrupt+0x7c/0x90
[  113.034823]  </IRQ>
[  113.034828]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  113.034841] RIP: 0010:cpuidle_enter_state+0xcc/0x360
[  113.034861] Code: 3d c1 f7 26 57 e8 c4 09 74 ff 49 89 c6 0f 1f 44 00 00 31 ff e8 35 15 74 ff 80 7d d7 00 0f 85 01 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 ff 0f 88 0d 01 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89
[  113.034870] RSP: 0018:ffffa6e8809b7e68 EFLAGS: 00000246
[  113.034884] RAX: ffff9615419a8dc0 RBX: 0000000000000002 RCX: 000000000000001f
[  113.034892] RDX: 0000000000000000 RSI: 000000003158cc4a RDI: 0000000000000000
[  113.034900] RBP: ffffa6e8809b7ea0 R08: 0000001a5155ea28 R09: 0000000000000018
[  113.034909] R10: 000000000006e15c R11: ffffffffa9e4b960 R12: ffffc6e87a991800
[  113.034917] R13: ffffffffa9e4b960 R14: 0000001a5155ea28 R15: 0000000000000002
[  113.034942]  cpuidle_enter+0x2e/0x40
[  113.034953]  do_idle+0x1ff/0x2a0
[  113.034966]  cpu_startup_entry+0x20/0x30
[  113.034979]  start_secondary+0x11a/0x150
[  113.034991]  secondary_startup_64_no_verify+0xb0/0xbb
[  113.035008] ---[ end trace f39ffcbabd5dfe2e ]---

[  533.511262] clocksource: timekeeping watchdog on CPU53: hpet read-back delay of 89916ns, attempt 4, marking unstable
[  533.511295] tsc: Marking TSC unstable due to clocksource watchdog
[  533.511336] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[  533.511339] sched_clock: Marking unstable (533409196195, 102131418)<-(533549406780, -38078705)
[  533.512368] clocksource: Checking clocksource tsc synchronization from CPU 35 to CPUs 0,3,21-22,36,50,54.
[  533.513146] clocksource: Switched to clocksource hpet


And after a while, the guest kernel will have the following  error, and then the network doesn't work anymore.
If I reboot the guest VM, then it will stuck and cannot be rebooted successfully.

rcu: INFO: rcu_sched self-detected stall on CPU
[  336.374152] rcu: 	3-...!: (1 GPs behind) idle=bb3/0/0x1 softirq=3087/3087 fqs=0 
[  336.379018] rcu: rcu_sched kthread timer wakeup didn't happen for 39086 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  336.386045] rcu: 	Possible timer handling issue on cpu=1 timer-softirq=871
[  336.390353] rcu: rcu_sched kthread starved for 39089 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[  336.395881] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  336.400375] rcu: RCU grace-period kthread stack dump:
[  336.404091] rcu: Stack dump where RCU GP kthread last ran:
[  566.795685] rcu: INFO: rcu_sched self-detected stall on CPU
[  566.799315] rcu: 	3-...!: (1 ticks this GP) idle=c65/0/0x1 softirq=3088/3088 fqs=1 
[  566.804170] rcu: rcu_sched kthread timer wakeup didn't happen for 229687 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  566.811259] rcu: 	Possible timer handling issue on cpu=1 timer-softirq=872
[  566.815579] rcu: rcu_sched kthread starved for 229690 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[  566.821190] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  566.825813] rcu: RCU grace-period kthread stack dump:
[  566.829513] rcu: Stack dump where RCU GP kthread last ran:


Looks like it leads to kernel misbehavior if we don't adjust the guest TSC value.

Our goal is to get the almost synchronized TSC value among KVM VMs one the same host.

Now I fix the host CPU frequency. Then the TSC OFFFSET, which can be read under "/sys/kernel/debug/kvm/qemu-PID/vcpu0/tsc-offset", will always keep constant.

Every time when I execute the rdtsc inside the guest, I will subtract the offset to it get the TSC value, which can be close to the host TSC value.

Do you think this makes sense?

Thank you in advance

Best,
Tao

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: obtain the timestamp counter of physical/host machine inside the VMs.
  2024-01-02 18:20                     ` Tao Lyu
@ 2024-01-02 19:36                       ` Dongli Zhang
  0 siblings, 0 replies; 13+ messages in thread
From: Dongli Zhang @ 2024-01-02 19:36 UTC (permalink / raw)
  To: Tao Lyu; +Cc: kvm, Sean Christopherson

Hi Tao,

On 1/2/24 10:20, Tao Lyu wrote:
>>>
>>> Hi Dongli,
>>>
>>>> On 1/1/24 14:06, Tao Lyu wrote:
>>>>> Hello Arnabjyoti, Sean, and everyone,
>>>>>
>>>>> I'm having a similiar but slightly differnt issue about the rdtsc in KVM.
>>>>>
>>>>> I want to obtain the timestamp counter of physical/host machine inside the VMs.
>>>>>
>>>>> Acccording to the previous threads, I know I need to disable the offsetting, VM exit, and scaling.
>>>>> I specify the correspoding parameters in the qemu arguments.
>>>>> The booting command is listed below:
>>>>>
>>>>> qemu-system-x86_64 -m 10240 -smp 4 -chardev socket,id=SOCKSYZ,server=on,nowait,host=localhost,port=3258 -mon chardev=SOCKSYZ,mode=control -display none -serial stdio -device virtio-rng-pci -enable-kvm -cpu host,migratable=off,tsc=on,rdtscp=on,vmx-tsc-offset=off,vmx-rdtsc-exit=off,tsc-scale=off,tsc-adjust=off,vmx-rdtscp-exit=off   -netdev bridge,id=hn40 -device virtio-net,netdev=hn40,mac=e6:c8:ff:09:76:38 -hda XXX -kernel XXX -append "root=/dev/sda console=ttyS0"
>>>>>
>>>>>
>>>>> But the rdtsc still returns the adjusted tsc.
>>>>> The vmxcap script shows the TSC settings as below:
>>>>>    
>>>>>    Use TSC offsetting                       no
>>>>>    RDTSC exiting                            no
>>>>>    Enable RDTSCP                            no
>>>>>    TSC scaling                              yes
>>>>>
>>>>>
>>>>> I would really appreciate it if anyone can tell me whether and how I can get the tsc of physical machine insdie the VM.
>>>
>>>> If the objective is to obtain the same tsc at both VM and host side (that is, to
>>>> avoid any offset or scaling), I can obtain quite close tsc at both VM and host
>>>> side with the below linux-6.6 change.
>>>
>>>> My env does not use tsc scaling.
>>>
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index 41cce50..b102dcd 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -2723,7 +2723,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64
>>> data)
>>>>         bool synchronizing = false;
>>>>
>>>>        raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>>>> -       offset = kvm_compute_l1_tsc_offset(vcpu, data);
>>>> +       offset = 0;
>>>>         ns = get_kvmclock_base_ns();
>>>>         elapsed = ns - kvm->arch.last_tsc_nsec;
>>>>
>>>> Dongli Zhang
>>>
>>>
>>> Hi Dongli,
>>>
>>> Thank you so much for the explanation and for providing a patch.
>>> It works for me now.
>>
>> Yeah, during vCPU creation KVM sets a target guest TSC of '0', i.e. sets the TSC
>> offset to "0 - HOST_TSC".  As of commit 828ca89628bf ("KVM: x86: Expose TSC offset
>> controls to userspace"), userspace can explicitly set an offset of '0' via
>> KVM_VCPU_TSC_CTRL+KVM_VCPU_TSC_OFFSET, but AFAIK QEMU doesn't support that API.
>>
>> All the other methods for setting the TSC offset are indirect, i.e. userspace
>> provides the target TSC and KVM computes the offset.  So even if QEMU provides a
> way to specify an explicit TSC (or offset), there will be a healthy amount of slop.
> 
> 
> Hi Sean and Dongli,
> 
> Thank you so much for the replies.
> 
> Unfortunately, after I adding the following patch to reset the TSC OFFSET forcefully,
> I can get the host TSC value from guest.
> 
> However, when booting the host kernel, it has the following WARNINGS:

My test patch will not impact the host time, when booting the host kernel. It
will not take effect until the VM is created.

Therefore, I guess the below is due to other reasons in your host kernel.

> 
> 
> [  113.033750] ------------[ cut here ]------------
> [  113.033768] NETDEV WATCHDOG: enxb03af61ad78a (rndis_host): transmit queue 0 timed out
> [  113.033802] WARNING: CPU: 42 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x264/0x270
> [  113.033829] Modules linked in: nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio socwatch2_15(OE) vtsspp(OE) vhost_net vhost vhost_iotlb tap sep5(OE) socperf3(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter bridge stp llc overlay cuse pax(OE) ipmi_ssif zram intel_rapl_msr intel_rapl_common i10nm_edac x86_pkg_temp_thermal intel_powerclamp coretemp joydev input_leds nls_iso8859_1 kvm_intel ast hid_generic drm_vram_helper drm_ttm_helper kvm rndis_host ttm usbhid cdc_ether usbnet hid drm_kms_helper mii crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd cec i2c_algo_bit rapl fb_sys_fops syscopyarea intel_cstate sysfillrect i40e sysimgblt isst_if_mbox_pci mei_me ioatdma ahci
> [  113.034307]  isst_if_mmio i2c_i801 mei intel_pch_thermal isst_if_common acpi_ipmi libahci dca i2c_smbus wmi ipmi_si ipmi_devintf ipmi_msghandler nfit acpi_pad acpi_power_meter mac_hid sch_fq_codel binfmt_misc ramoops drm reed_solomon efi_pstore sunrpc ip_tables x_tables autofs4
> [  113.034473] CPU: 42 PID: 0 Comm: swapper/42 Kdump: loaded Tainted: G           OE     5.15.0+ #4
> [  113.034486] Hardware name: Intel Corporation M50CYP2SB1U/M50CYP2SB1U, BIOS SE5C620.86B.01.01.0004.2110190142 10/19/2021
> [  113.034495] RIP: 0010:dev_watchdog+0x264/0x270
> [  113.034511] Code: eb a6 48 8b 5d d0 c6 05 e6 47 0a 01 01 48 89 df e8 91 c4 f9 ff 44 89 e1 48 89 de 48 c7 c7 68 c2 a8 a9 48 89 c2 e8 90 3e 16 00 <0f> 0b eb 83 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
> [  113.034522] RSP: 0018:ffffa6e88d79ce78 EFLAGS: 00010286
> [  113.034541] RAX: 0000000000000000 RBX: ffff959763ddb000 RCX: 000000000000083f
> [  113.034551] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
> [  113.034559] RBP: ffffa6e88d79ceb0 R08: 0000000000000000 R09: ffffa6e88d79cc60
> [  113.034565] R10: ffffa6e88d79cc58 R11: ffff96163ff26c28 R12: 0000000000000000
> [  113.034572] R13: ffff95976488ac80 R14: 0000000000000001 R15: ffff959763ddb4c0
> [  113.034579] FS:  0000000000000000(0000) GS:ffff961541980000(0000) knlGS:0000000000000000
> [  113.034588] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  113.034595] CR2: 000055c5819e40f8 CR3: 0000004740c0a006 CR4: 0000000000772ee0
> [  113.034604] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  113.034614] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  113.034620] PKRU: 55555554
> [  113.034629] Call Trace:
> [  113.034636]  <IRQ>
> [  113.034662]  call_timer_fn+0x29/0x100
> [  113.034680]  __run_timers.part.0+0x1cf/0x240
> [  113.034757]  run_timer_softirq+0x2a/0x50
> [  113.034768]  __do_softirq+0xcb/0x274
> [  113.034790]  irq_exit_rcu+0x8c/0xb0
> [  113.034807]  sysvec_apic_timer_interrupt+0x7c/0x90
> [  113.034823]  </IRQ>
> [  113.034828]  asm_sysvec_apic_timer_interrupt+0x12/0x20
> [  113.034841] RIP: 0010:cpuidle_enter_state+0xcc/0x360
> [  113.034861] Code: 3d c1 f7 26 57 e8 c4 09 74 ff 49 89 c6 0f 1f 44 00 00 31 ff e8 35 15 74 ff 80 7d d7 00 0f 85 01 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 ff 0f 88 0d 01 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89
> [  113.034870] RSP: 0018:ffffa6e8809b7e68 EFLAGS: 00000246
> [  113.034884] RAX: ffff9615419a8dc0 RBX: 0000000000000002 RCX: 000000000000001f
> [  113.034892] RDX: 0000000000000000 RSI: 000000003158cc4a RDI: 0000000000000000
> [  113.034900] RBP: ffffa6e8809b7ea0 R08: 0000001a5155ea28 R09: 0000000000000018
> [  113.034909] R10: 000000000006e15c R11: ffffffffa9e4b960 R12: ffffc6e87a991800
> [  113.034917] R13: ffffffffa9e4b960 R14: 0000001a5155ea28 R15: 0000000000000002
> [  113.034942]  cpuidle_enter+0x2e/0x40
> [  113.034953]  do_idle+0x1ff/0x2a0
> [  113.034966]  cpu_startup_entry+0x20/0x30
> [  113.034979]  start_secondary+0x11a/0x150
> [  113.034991]  secondary_startup_64_no_verify+0xb0/0xbb
> [  113.035008] ---[ end trace f39ffcbabd5dfe2e ]---
> 
> [  533.511262] clocksource: timekeeping watchdog on CPU53: hpet read-back delay of 89916ns, attempt 4, marking unstable
> [  533.511295] tsc: Marking TSC unstable due to clocksource watchdog
> [  533.511336] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> [  533.511339] sched_clock: Marking unstable (533409196195, 102131418)<-(533549406780, -38078705)
> [  533.512368] clocksource: Checking clocksource tsc synchronization from CPU 35 to CPUs 0,3,21-22,36,50,54.
> [  533.513146] clocksource: Switched to clocksource hpet
> 
> 
> And after a while, the guest kernel will have the following  error, and then the network doesn't work anymore.
> If I reboot the guest VM, then it will stuck and cannot be rebooted successfully.
> 
> rcu: INFO: rcu_sched self-detected stall on CPU
> [  336.374152] rcu: 	3-...!: (1 GPs behind) idle=bb3/0/0x1 softirq=3087/3087 fqs=0 
> [  336.379018] rcu: rcu_sched kthread timer wakeup didn't happen for 39086 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> [  336.386045] rcu: 	Possible timer handling issue on cpu=1 timer-softirq=871
> [  336.390353] rcu: rcu_sched kthread starved for 39089 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
> [  336.395881] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
> [  336.400375] rcu: RCU grace-period kthread stack dump:
> [  336.404091] rcu: Stack dump where RCU GP kthread last ran:
> [  566.795685] rcu: INFO: rcu_sched self-detected stall on CPU
> [  566.799315] rcu: 	3-...!: (1 ticks this GP) idle=c65/0/0x1 softirq=3088/3088 fqs=1 
> [  566.804170] rcu: rcu_sched kthread timer wakeup didn't happen for 229687 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> [  566.811259] rcu: 	Possible timer handling issue on cpu=1 timer-softirq=872
> [  566.815579] rcu: rcu_sched kthread starved for 229690 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
> [  566.821190] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
> [  566.825813] rcu: RCU grace-period kthread stack dump:
> [  566.829513] rcu: Stack dump where RCU GP kthread last ran:
> 
> 
> Looks like it leads to kernel misbehavior if we don't adjust the guest TSC value.
> 
> Our goal is to get the almost synchronized TSC value among KVM VMs one the same host.
> 
> Now I fix the host CPU frequency. Then the TSC OFFFSET, which can be read under "/sys/kernel/debug/kvm/qemu-PID/vcpu0/tsc-offset", will always keep constant.
> 
> Every time when I execute the rdtsc inside the guest, I will subtract the offset to it get the TSC value, which can be close to the host TSC value.
> 
> Do you think this makes sense?

I assume you do not use TSC scaling or any nested virtualization.

Therefore, the value in the debugfs should be the same as the one computed by
"kvm_compute_l1_tsc_offset(vcpu, data)". You may use printk to double confirm.

I think it makes sense.

However, I do not think the patch may cause the issue. It works in my environment.

Thank you very much!

Dongli Zhang

> 
> Thank you in advance
> 
> Best,
> Tao

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-01-02 19:36 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-17  4:54 Intercepting RDTSC instruction by causing a VMEXIT Arnabjyoti Kalita
2021-04-19 16:46 ` Sean Christopherson
2021-04-20  2:33   ` Arnabjyoti Kalita
     [not found]     ` <CAJGDS+GT1mKHz6K=qHQf54S_97ym=nRP12MfO6OSEOpLYGht=A@mail.gmail.com>
2021-04-26  1:51       ` Arnabjyoti Kalita
2021-04-26 16:05       ` Sean Christopherson
2021-04-27  2:19         ` Arnabjyoti Kalita
2021-04-30 17:16           ` Arnabjyoti Kalita
2024-01-01 22:06             ` obtain the timestamp counter of physical/host machine inside the VMs Tao Lyu
2024-01-02  5:11               ` Dongli Zhang
2024-01-02 10:10                 ` Tao Lyu
2024-01-02 17:53                   ` Sean Christopherson
2024-01-02 18:20                     ` Tao Lyu
2024-01-02 19:36                       ` Dongli Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).