Re: [PATCH] Move VMEnter and VMExit tracepoints closer to the actual event

From: Sean Christopherson <seanjc@google.com>
To: Stefano De Venuto <stefano.devenuto99@gmail.com>
Cc: linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	vkuznets@redhat.com, wanpengli@tencent.com, jmattson@google.com,
	x86@kernel.org, hpa@zytor.com, kvm@vger.kernel.org,
	rostedt@goodmis.org, y.karadz@gmail.com,
	Dario Faggioli <dfaggioli@suse.com>
Subject: Re: [PATCH] Move VMEnter and VMExit tracepoints closer to the actual event
Date: Thu, 20 May 2021 15:32:34 +0000	[thread overview]
Message-ID: <YKaBEn6oUXaVAb0K@google.com> (raw)
In-Reply-To: <20210519182303.2790-1-stefano.devenuto99@gmail.com>

On Wed, May 19, 2021, Stefano De Venuto wrote:
> The kvm_entry and kvm_exit tracepoints are still quite far from the
> actual VMEnters/VMExits. This means that in a trace we can find host
> events after a kvm_entry event and before a kvm_exit one, as in this
> example:
> 
>            trace.dat:  CPU 0/KVM-4594  [001]  2.167191: kvm_entry:
>            trace.dat:  CPU 0/KVM-4594  [001]  2.167192: write_msr: 48, value 0
>            trace.dat:  CPU 0/KVM-4594  [001]  2.167192: rcu_utilization: Start context switch
>            trace.dat:  CPU 0/KVM-4594  [001]  2.167192: rcu_utilization: End context switch
> trace-tumbleweed.dat:     <idle>-0     [000]  2.167196: hrtimer_cancel:
> trace-tumbleweed.dat:     <idle>-0     [000]  2.167197: hrtimer_expire_entry:
> trace-tumbleweed.dat:     <idle>-0     [000]  2.167201: hrtimer_expire_exit:
> trace-tumbleweed.dat:     <idle>-0     [000]  2.167201: hrtimer_start:
>            trace.dat:  CPU 0/KVM-4594  [001]  2.167203: read_msr: 48, value 0
>            trace.dat:  CPU 0/KVM-4594  [001]  2.167203: write_msr: 48, value 4
>            trace.dat:  CPU 0/KVM-4594  [001]  2.167204: kvm_exit: 
> 
> This patch moves the tracepoints closer to the events, for both Intel
> and AMD, so that a combined host-guest trace will offer a more
> realistic representation of what is really happening, as shown here:
> 
>            trace.dat:  CPU 0/KVM-2553  [000]  2.190290: write_msr: 48, value 0

I'm not sure this is a good thing, as it's not clear to me that invoking tracing
with the guest's SPEC_CTRL loaded is desirable.  Maybe it's a non-issue, but it
should be explicitly called out and discussed.

And to some extent, the current behavior is _more_ accurate because it shows that
KVM started its VM-Enter sequence and then the WRMSR occured as part of that
sequence.  It is writing the guest's value after all.  Ditto for XCR0, XSS, PKRU,
Intel PT, etc...

A more concrete example would be perf; on VMX, if a perf NMI happens after KVM
invokes atomic_switch_perf_msrs() then I absolutely want to see that reflected
in the trace, e.g. to help debug the PEBS mess[*].  If the VM-Enter tracepoint
is moved closer to VM-Enter, that may or may not hold true depending on where the
NMI lands.

On VMX, I think the tracepoint can be moved below the VMWRITEs without much
contention (though doing so is likely a nop), but moving it below
kvm_load_guest_xsave_state() requires a bit more discussion.

I 100% agree that the current behavior can be a bit confusing, but I wonder if
we'd be better off "solving" that problem through documentation.

[*] https://lkml.kernel.org/r/20210209225653.1393771-1-jmattson@google.com

>            trace.dat:  CPU 0/KVM-2553  [000]  2.190290: rcu_utilization: Start context switch
>            trace.dat:  CPU 0/KVM-2553  [000]  2.190290: rcu_utilization: End context switch
>            trace.dat:  CPU 0/KVM-2553  [000]  2.190290: kvm_entry:
> trace-tumbleweed.dat:     <idle>-0     [000]  2.190290: write_msr:
> trace-tumbleweed.dat:     <idle>-0     [000]  2.190290: cpu_idle:
>            trace.dat:  CPU 0/KVM-2553  [000]  2.190291: kvm_exit:
>            trace.dat:  CPU 0/KVM-2553  [000]  2.190291: read_msr: 48, value 0
>            trace.dat:  CPU 0/KVM-2553  [000]  2.190291: write_msr: 48, value 4 
> 
> Signed-off-by: Stefano De Venuto <stefano.devenuto99@gmail.com>
> Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
> ---

...

> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 4bceb5ca3a89..33c732101b83 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6661,6 +6661,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>  {
>  	kvm_guest_enter_irqoff();
>  
> +	trace_kvm_entry(vcpu);
> +
>  	/* L1D Flush includes CPU buffer clear to mitigate MDS */
>  	if (static_branch_unlikely(&vmx_l1d_should_flush))
>  		vmx_l1d_flush(vcpu);
> @@ -6675,6 +6677,9 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>  
>  	vcpu->arch.cr2 = native_read_cr2();
>  
> +	vmx->exit_reason.full = vmcs_read32(VM_EXIT_REASON);
> +	trace_kvm_exit(vmx->exit_reason.full, vcpu, KVM_ISA_VMX);

This is wrong in the 'vmx->fail == true' case.

> +
>  	kvm_guest_exit_irqoff();
>  }
>  
> @@ -6693,8 +6698,6 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	if (vmx->emulation_required)
>  		return EXIT_FASTPATH_NONE;
>  
> -	trace_kvm_entry(vcpu);
> -
>  	if (vmx->ple_window_dirty) {
>  		vmx->ple_window_dirty = false;
>  		vmcs_write32(PLE_WINDOW, vmx->ple_window);
> @@ -6814,15 +6817,12 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  		return EXIT_FASTPATH_NONE;
>  	}
>  
> -	vmx->exit_reason.full = vmcs_read32(VM_EXIT_REASON);
>  	if (unlikely((u16)vmx->exit_reason.basic == EXIT_REASON_MCE_DURING_VMENTRY))
>  		kvm_machine_check();
>  
>  	if (likely(!vmx->exit_reason.failed_vmentry))
>  		vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
>  
> -	trace_kvm_exit(vmx->exit_reason.full, vcpu, KVM_ISA_VMX);
> -
>  	if (unlikely(vmx->exit_reason.failed_vmentry))
>  		return EXIT_FASTPATH_NONE;
>  
> -- 
> 2.31.1
>