All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
@ 2023-05-04 12:00 Roman Kagan
  2023-05-23 12:40 ` Like Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Roman Kagan @ 2023-05-04 12:00 UTC (permalink / raw)
  To: kvm
  Cc: Dave Hansen, Jim Mattson, Like Xu, Paolo Bonzini, x86,
	Thomas Gleixner, Eric Hankland, linux-kernel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Sean Christopherson

Performance counters are defined to have width less than 64 bits.  The
vPMU code maintains the counters in u64 variables but assumes the value
to fit within the defined width.  However, for Intel non-full-width
counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
truncated to 32 bits and then sign-extended to full 64 bits.  If a
negative value is set, it's sign-extended to 64 bits, but then in
kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
previous value for overflow detection.
That previous value is not truncated, so it always evaluates bigger than
the truncated new one, and a PMI is injected.  If the PMI handler writes
a negative counter value itself, the vCPU never quits the PMI loop.

Turns out that Linux PMI handler actually does write the counter with
the value just read with RDPMC, so when no full-width support is exposed
via MSR_IA32_PERF_CAPABILITIES, and the guest initializes the counter to
a negative value, it locks up.

We observed this in the field, for example, when the guest configures
atop to use perfevents and runs two instances of it simultaneously.

To address the problem, maintain the invariant that the counter value
always fits in the defined bit width, by truncating the received value
in the respective set_msr methods.  For better readability, factor this
out into a helper function, pmc_set_counter, shared by vmx and svm
parts.

Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Roman Kagan <rkagan@amazon.de>
---
 arch/x86/kvm/pmu.h           | 6 ++++++
 arch/x86/kvm/svm/pmu.c       | 2 +-
 arch/x86/kvm/vmx/pmu_intel.c | 4 ++--
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 5c7bbf03b599..6a91e1afef5a 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
 	return counter & pmc_bitmask(pmc);
 }
 
+static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val)
+{
+	pmc->counter += val - pmc_read_counter(pmc);
+	pmc->counter &= pmc_bitmask(pmc);
+}
+
 static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
 {
 	if (pmc->perf_event) {
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 5fa939e411d8..f93543d84cfe 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -151,7 +151,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	/* MSR_PERFCTRn */
 	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
 	if (pmc) {
-		pmc->counter += data - pmc_read_counter(pmc);
+		pmc_set_counter(pmc, data);
 		pmc_update_sample_period(pmc);
 		return 0;
 	}
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 741efe2c497b..51354e3935d4 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -467,11 +467,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			if (!msr_info->host_initiated &&
 			    !(msr & MSR_PMC_FULL_WIDTH_BIT))
 				data = (s64)(s32)data;
-			pmc->counter += data - pmc_read_counter(pmc);
+			pmc_set_counter(pmc, data);
 			pmc_update_sample_period(pmc);
 			break;
 		} else if ((pmc = get_fixed_pmc(pmu, msr))) {
-			pmc->counter += data - pmc_read_counter(pmc);
+			pmc_set_counter(pmc, data);
 			pmc_update_sample_period(pmc);
 			break;
 		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
-- 
2.34.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-05-04 12:00 [PATCH] KVM: x86: vPMU: truncate counter value to allowed width Roman Kagan
@ 2023-05-23 12:40 ` Like Xu
  2023-05-23 16:42   ` Roman Kagan
  2023-06-06  0:51 ` Sean Christopherson
  2023-09-28 16:41 ` Sean Christopherson
  2 siblings, 1 reply; 25+ messages in thread
From: Like Xu @ 2023-05-23 12:40 UTC (permalink / raw)
  To: Roman Kagan
  Cc: Jim Mattson, Paolo Bonzini, x86, Eric Hankland, linux-kernel,
	Sean Christopherson, kvm list

On 4/5/2023 8:00 pm, Roman Kagan wrote:
> Performance counters are defined to have width less than 64 bits.  The
> vPMU code maintains the counters in u64 variables but assumes the value
> to fit within the defined width.  However, for Intel non-full-width
> counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
> truncated to 32 bits and then sign-extended to full 64 bits.  If a
> negative value is set, it's sign-extended to 64 bits, but then in
> kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
> previous value for overflow detection.

Thanks for reporting this issue. An easier-to-understand fix could be:

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e17be25de6ca..51e75f121234 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -718,7 +718,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)

  static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
  {
-	pmc->prev_counter = pmc->counter;
+	pmc->prev_counter = pmc->counter & pmc_bitmask(pmc);
  	pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
  	kvm_pmu_request_counter_reprogram(pmc);
  }

Considering that the pmu code uses pmc_bitmask(pmc) everywhere to wrap
around, I would prefer to use this fix above first and then do a more thorough
cleanup based on your below diff. What do you think ?

> That previous value is not truncated, so it always evaluates bigger than
> the truncated new one, and a PMI is injected.  If the PMI handler writes
> a negative counter value itself, the vCPU never quits the PMI loop.
> 
> Turns out that Linux PMI handler actually does write the counter with
> the value just read with RDPMC, so when no full-width support is exposed
> via MSR_IA32_PERF_CAPABILITIES, and the guest initializes the counter to
> a negative value, it locks up.

Not really sure what the behavioral difference is between "it locks up" and
"the vCPU never quits the PMI loop".

> 
> We observed this in the field, for example, when the guest configures
> atop to use perfevents and runs two instances of it simultaneously.

A more essential case I found is this:

  kvm_msr: msr_write CTR1 = 0xffffffffffffffea
  rdpmc on host: CTR1, value 0xffffffffffe3
  kvm_exit: vcpu 0 reason EXCEPTION_NMI
  kvm_msr: msr_read CTR1 = 0x83 // nmi_handler

There are two typical issues here:
- the emulated counter value changed from 0xffffffffffffffffea to 0xffffffffffffe3,
  triggering __kvm_perf_overflow(pmc, false);
- PMI-handler should not reset the counter to a value that is easily overflowed,
  in order to avoid overflow here before iret;

Please confirm whether your usage scenarios consist of these two types, or more.

> 
> To address the problem, maintain the invariant that the counter value
> always fits in the defined bit width, by truncating the received value
> in the respective set_msr methods.  For better readability, factor this
> out into a helper function, pmc_set_counter, shared by vmx and svm
> parts.
> 
> Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> Signed-off-by: Roman Kagan <rkagan@amazon.de>

Tested-by: Like Xu <likexu@tencent.com>
I prefer to use pmc_bitmask(pmc) to wrap around pmc->prev_counter as the first step.

> ---
>   arch/x86/kvm/pmu.h           | 6 ++++++
>   arch/x86/kvm/svm/pmu.c       | 2 +-
>   arch/x86/kvm/vmx/pmu_intel.c | 4 ++--
>   3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 5c7bbf03b599..6a91e1afef5a 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>   	return counter & pmc_bitmask(pmc);
>   }
>   
> +static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val)
> +{
> +	pmc->counter += val - pmc_read_counter(pmc);
> +	pmc->counter &= pmc_bitmask(pmc);
> +}
> +
>   static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
>   {
>   	if (pmc->perf_event) {
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index 5fa939e411d8..f93543d84cfe 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -151,7 +151,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   	/* MSR_PERFCTRn */
>   	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
>   	if (pmc) {
> -		pmc->counter += data - pmc_read_counter(pmc);
> +		pmc_set_counter(pmc, data);
>   		pmc_update_sample_period(pmc);
>   		return 0;
>   	}
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 741efe2c497b..51354e3935d4 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -467,11 +467,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   			if (!msr_info->host_initiated &&
>   			    !(msr & MSR_PMC_FULL_WIDTH_BIT))
>   				data = (s64)(s32)data;
> -			pmc->counter += data - pmc_read_counter(pmc);
> +			pmc_set_counter(pmc, data);
>   			pmc_update_sample_period(pmc);
>   			break;
>   		} else if ((pmc = get_fixed_pmc(pmu, msr))) {
> -			pmc->counter += data - pmc_read_counter(pmc);
> +			pmc_set_counter(pmc, data);
>   			pmc_update_sample_period(pmc);
>   			break;
>   		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-05-23 12:40 ` Like Xu
@ 2023-05-23 16:42   ` Roman Kagan
  2023-06-06  0:26     ` Sean Christopherson
  0 siblings, 1 reply; 25+ messages in thread
From: Roman Kagan @ 2023-05-23 16:42 UTC (permalink / raw)
  To: Like Xu
  Cc: Jim Mattson, Paolo Bonzini, x86, Eric Hankland, linux-kernel,
	Sean Christopherson, kvm list

On Tue, May 23, 2023 at 08:40:53PM +0800, Like Xu wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On 4/5/2023 8:00 pm, Roman Kagan wrote:
> > Performance counters are defined to have width less than 64 bits.  The
> > vPMU code maintains the counters in u64 variables but assumes the value
> > to fit within the defined width.  However, for Intel non-full-width
> > counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
> > truncated to 32 bits and then sign-extended to full 64 bits.  If a
> > negative value is set, it's sign-extended to 64 bits, but then in
> > kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
> > previous value for overflow detection.
> 
> Thanks for reporting this issue. An easier-to-understand fix could be:
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index e17be25de6ca..51e75f121234 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -718,7 +718,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
> 
>  static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
>  {
> -       pmc->prev_counter = pmc->counter;
> +       pmc->prev_counter = pmc->counter & pmc_bitmask(pmc);
>        pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
>        kvm_pmu_request_counter_reprogram(pmc);
>  }
> 
> Considering that the pmu code uses pmc_bitmask(pmc) everywhere to wrap
> around, I would prefer to use this fix above first and then do a more thorough
> cleanup based on your below diff. What do you think ?

I did exactly this at first.  However, it felt more natural and easier
to reason about and less error-prone going forward, to maintain the
invariant that pmc->counter always fits in the assumed width.

> > That previous value is not truncated, so it always evaluates bigger than
> > the truncated new one, and a PMI is injected.  If the PMI handler writes
> > a negative counter value itself, the vCPU never quits the PMI loop.
> > 
> > Turns out that Linux PMI handler actually does write the counter with
> > the value just read with RDPMC, so when no full-width support is exposed
> > via MSR_IA32_PERF_CAPABILITIES, and the guest initializes the counter to
> > a negative value, it locks up.
> 
> Not really sure what the behavioral difference is between "it locks up" and
> "the vCPU never quits the PMI loop".

No difference, the second paragraph just illustrates the specific case
with Linux PMI handler and the system not exposing full-width counter
support so the problematic code path is triggered.

> > We observed this in the field, for example, when the guest configures
> > atop to use perfevents and runs two instances of it simultaneously.
> 
> A more essential case I found is this:
> 
>  kvm_msr: msr_write CTR1 = 0xffffffffffffffea
>  rdpmc on host: CTR1, value 0xffffffffffe3
>  kvm_exit: vcpu 0 reason EXCEPTION_NMI
>  kvm_msr: msr_read CTR1 = 0x83 // nmi_handler
> 
> There are two typical issues here:
> - the emulated counter value changed from 0xffffffffffffffffea to 0xffffffffffffe3,

Strange, why would the counter go backwards (disregarding the extra high
bits)?

>  triggering __kvm_perf_overflow(pmc, false);
> - PMI-handler should not reset the counter to a value that is easily overflowed,
>  in order to avoid overflow here before iret;

This is a legitimate guest behavior, isn't it?  The problem I'm trying
to fix is when the value written is sufficiently far from overflowing,
but it still ends up looking as an overflow and triggers a PMI.

> Please confirm whether your usage scenarios consist of these two types, or more.

The *usage* scenario is the one I wrote above.  I've identified an
easier repro since then:

  perf stat -e instructions -C 0 &
  sleep 1 && perf stat -e instructions,cycles -C 0 sleep 0.1 && kill -INT %

Please also see the kvm-unit-test I posted at
https://lore.kernel.org/kvm/20230504134908.830041-1-rkagan@amazon.de

> > To address the problem, maintain the invariant that the counter value
> > always fits in the defined bit width, by truncating the received value
> > in the respective set_msr methods.  For better readability, factor this
> > out into a helper function, pmc_set_counter, shared by vmx and svm
> > parts.
> > 
> > Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> > Signed-off-by: Roman Kagan <rkagan@amazon.de>
> 
> Tested-by: Like Xu <likexu@tencent.com>
> I prefer to use pmc_bitmask(pmc) to wrap around pmc->prev_counter as the first step.

My preference is to keep the counter within the limits all the time, so
I'd leave it up to the maintainers to decide.

Thanks,
Roman.



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-05-23 16:42   ` Roman Kagan
@ 2023-06-06  0:26     ` Sean Christopherson
  0 siblings, 0 replies; 25+ messages in thread
From: Sean Christopherson @ 2023-06-06  0:26 UTC (permalink / raw)
  To: Roman Kagan, Like Xu, Jim Mattson, Paolo Bonzini, x86,
	Eric Hankland, linux-kernel, kvm list

On Tue, May 23, 2023, Roman Kagan wrote:
> On Tue, May 23, 2023 at 08:40:53PM +0800, Like Xu wrote:
> > On 4/5/2023 8:00 pm, Roman Kagan wrote:
> > > Performance counters are defined to have width less than 64 bits.  The
> > > vPMU code maintains the counters in u64 variables but assumes the value
> > > to fit within the defined width.  However, for Intel non-full-width
> > > counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
> > > truncated to 32 bits and then sign-extended to full 64 bits.  If a
> > > negative value is set, it's sign-extended to 64 bits, but then in
> > > kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
> > > previous value for overflow detection.
> > 
> > Thanks for reporting this issue. An easier-to-understand fix could be:
> > 
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index e17be25de6ca..51e75f121234 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -718,7 +718,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
> > 
> >  static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
> >  {
> > -       pmc->prev_counter = pmc->counter;
> > +       pmc->prev_counter = pmc->counter & pmc_bitmask(pmc);
> >        pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
> >        kvm_pmu_request_counter_reprogram(pmc);
> >  }
> > 
> > Considering that the pmu code uses pmc_bitmask(pmc) everywhere to wrap
> > around, I would prefer to use this fix above first and then do a more thorough
> > cleanup based on your below diff. What do you think ?
> 
> I did exactly this at first.  However, it felt more natural and easier
> to reason about and less error-prone going forward, to maintain the
> invariant that pmc->counter always fits in the assumed width.

Agreed, KVM shouldn't store information that's not supposed to exist.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-05-04 12:00 [PATCH] KVM: x86: vPMU: truncate counter value to allowed width Roman Kagan
  2023-05-23 12:40 ` Like Xu
@ 2023-06-06  0:51 ` Sean Christopherson
  2023-06-29 21:16   ` Jim Mattson
  2023-09-28 16:41 ` Sean Christopherson
  2 siblings, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2023-06-06  0:51 UTC (permalink / raw)
  To: Roman Kagan
  Cc: kvm, Dave Hansen, Jim Mattson, Like Xu, Paolo Bonzini, x86,
	Thomas Gleixner, Eric Hankland, linux-kernel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar

On Thu, May 04, 2023, Roman Kagan wrote:
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 5c7bbf03b599..6a91e1afef5a 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>  	return counter & pmc_bitmask(pmc);
>  }
>  
> +static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val)
> +{
> +	pmc->counter += val - pmc_read_counter(pmc);

Ugh, not your code, but I don't see how the current code can possibly be correct.

The above unpacks to

	counter = pmc->counter;
	if (pmc->perf_event && !pmc->is_paused)
		counter += perf_event_read_value(pmc->perf_event,
						 &enabled, &running);
	pmc->counter += val - (counter & pmc_bitmask(pmc));

which distills down to

	counter = 0;
	if (pmc->perf_event && !pmc->is_paused)
		counter += perf_event_read_value(pmc->perf_event,
						 &enabled, &running);
	pmc->counter = val - (counter & pmc_bitmask(pmc));

or more succinctly

	if (pmc->perf_event && !pmc->is_paused)
		val -= perf_event_read_value(pmc->perf_event, &enabled, &running);

	pmc->counter = val;

which is obviously wrong.  E.g. if the guest writes '0' to an active counter, the
adjustment will cause pmc->counter to be loaded with a large (in unsigned terms)
value, and thus quickly overflow after a write of '0'.

I assume the intent it to purge any accumulated counts that occurred since the
last read, but *before* the write, but then shouldn't this just be:

	/* Purge any events that were acculumated before the write. */
	if (pmc->perf_event && !pmc->is_paused)
		(void)perf_event_read_value(pmc->perf_event, &enabled, &running);

	pmc->counter = val & pmc_bitmask(pmc);

Am I missing something?

I'd like to get this sorted out before applying this patch, because I really want
to document what on earth this new helper is doing.  Seeing a bizarre partial
RMW operation in a helper with "set" as the verb is super weird.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-06  0:51 ` Sean Christopherson
@ 2023-06-29 21:16   ` Jim Mattson
  2023-06-30  0:11     ` Sean Christopherson
  0 siblings, 1 reply; 25+ messages in thread
From: Jim Mattson @ 2023-06-29 21:16 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Eric Hankland
  Cc: Roman Kagan, kvm, Dave Hansen, Like Xu, x86, Thomas Gleixner,
	linux-kernel, H. Peter Anvin, Borislav Petkov, Ingo Molnar

On Mon, Jun 5, 2023 at 5:51 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, May 04, 2023, Roman Kagan wrote:
> > diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> > index 5c7bbf03b599..6a91e1afef5a 100644
> > --- a/arch/x86/kvm/pmu.h
> > +++ b/arch/x86/kvm/pmu.h
> > @@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> >       return counter & pmc_bitmask(pmc);
> >  }
> >
> > +static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val)
> > +{
> > +     pmc->counter += val - pmc_read_counter(pmc);
>
> Ugh, not your code, but I don't see how the current code can possibly be correct.
>
> The above unpacks to
>
>         counter = pmc->counter;
>         if (pmc->perf_event && !pmc->is_paused)
>                 counter += perf_event_read_value(pmc->perf_event,
>                                                  &enabled, &running);
>         pmc->counter += val - (counter & pmc_bitmask(pmc));
>
> which distills down to
>
>         counter = 0;
>         if (pmc->perf_event && !pmc->is_paused)
>                 counter += perf_event_read_value(pmc->perf_event,
>                                                  &enabled, &running);
>         pmc->counter = val - (counter & pmc_bitmask(pmc));
>
> or more succinctly
>
>         if (pmc->perf_event && !pmc->is_paused)
>                 val -= perf_event_read_value(pmc->perf_event, &enabled, &running);
>
>         pmc->counter = val;
>
> which is obviously wrong.  E.g. if the guest writes '0' to an active counter, the
> adjustment will cause pmc->counter to be loaded with a large (in unsigned terms)
> value, and thus quickly overflow after a write of '0'.

This weird construct goes all the way back to commit f5132b01386b
("KVM: Expose a version 2 architectural PMU to a guests"). Paolo
killed it in commit 2924b52117b2 ("KVM: x86/pmu: do not mask the value
that is written to fixed PMUs"), perhaps by accident. Eric then
resurrected it in commit 4400cf546b4b ("KVM: x86: Fix perfctr WRMSR
for running counters").

It makes no sense to me. WRMSR should just set the new value of the
counter, regardless of the old value or whether or not it is running.

> I assume the intent it to purge any accumulated counts that occurred since the
> last read, but *before* the write, but then shouldn't this just be:
>
>         /* Purge any events that were acculumated before the write. */
>         if (pmc->perf_event && !pmc->is_paused)
>                 (void)perf_event_read_value(pmc->perf_event, &enabled, &running);
>
>         pmc->counter = val & pmc_bitmask(pmc);
>
> Am I missing something?
>
> I'd like to get this sorted out before applying this patch, because I really want
> to document what on earth this new helper is doing.  Seeing a bizarre partial
> RMW operation in a helper with "set" as the verb is super weird.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-29 21:16   ` Jim Mattson
@ 2023-06-30  0:11     ` Sean Christopherson
  2023-06-30  0:31       ` Jim Mattson
  2023-06-30 11:14       ` Roman Kagan
  0 siblings, 2 replies; 25+ messages in thread
From: Sean Christopherson @ 2023-06-30  0:11 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Eric Hankland, Roman Kagan, kvm, Dave Hansen,
	Like Xu, x86, Thomas Gleixner, linux-kernel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Mingwei Zhang

+Mingwei

On Thu, Jun 29, 2023, Jim Mattson wrote:
> On Mon, Jun 5, 2023 at 5:51 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, May 04, 2023, Roman Kagan wrote:
> > > diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> > > index 5c7bbf03b599..6a91e1afef5a 100644
> > > --- a/arch/x86/kvm/pmu.h
> > > +++ b/arch/x86/kvm/pmu.h
> > > @@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > >       return counter & pmc_bitmask(pmc);
> > >  }
> > >
> > > +static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val)
> > > +{
> > > +     pmc->counter += val - pmc_read_counter(pmc);
> >
> > Ugh, not your code, but I don't see how the current code can possibly be correct.
> >
> > The above unpacks to
> >
> >         counter = pmc->counter;
> >         if (pmc->perf_event && !pmc->is_paused)
> >                 counter += perf_event_read_value(pmc->perf_event,
> >                                                  &enabled, &running);
> >         pmc->counter += val - (counter & pmc_bitmask(pmc));
> >
> > which distills down to
> >
> >         counter = 0;
> >         if (pmc->perf_event && !pmc->is_paused)
> >                 counter += perf_event_read_value(pmc->perf_event,
> >                                                  &enabled, &running);
> >         pmc->counter = val - (counter & pmc_bitmask(pmc));
> >
> > or more succinctly
> >
> >         if (pmc->perf_event && !pmc->is_paused)
> >                 val -= perf_event_read_value(pmc->perf_event, &enabled, &running);
> >
> >         pmc->counter = val;
> >
> > which is obviously wrong.  E.g. if the guest writes '0' to an active counter, the
> > adjustment will cause pmc->counter to be loaded with a large (in unsigned terms)
> > value, and thus quickly overflow after a write of '0'.
> 
> This weird construct goes all the way back to commit f5132b01386b
> ("KVM: Expose a version 2 architectural PMU to a guests"). Paolo
> killed it in commit 2924b52117b2 ("KVM: x86/pmu: do not mask the value
> that is written to fixed PMUs"), perhaps by accident. Eric then
> resurrected it in commit 4400cf546b4b ("KVM: x86: Fix perfctr WRMSR
> for running counters").
> 
> It makes no sense to me. WRMSR should just set the new value of the
> counter, regardless of the old value or whether or not it is running.

Heh, didn't stop you from giving Eric's patch a thumbs-up[*] :-)

Thanks to Eric's testcase [Wow, tests do help!  We should try writing more of them!],
I finally figured out what's going on.  I wrongly assumed perf_event_read_value()
is destructive, but it's not, it just reads the current value.  So on a WRMSR,
KVM offsets the value with the current perf event, and then *mostly* adjusts for
it when reading the counter.

But that is obviously super fragile because it means pmc->counter must never be
read directly unless the perf event is paused and the accumulated counter has been
propagated to pmc->counter.  Blech.

I fiddled with a variety of things, but AFAICT the easiest solution is also the
most obviously correct: set perf's count to the guest's count.  Lightly tested
patch below.

On a related topic, Mingwei also appears to have found another bug: prev_counter
needs to be set when the counter is written, i.e. my proposed pmc_write_counter()
also needs to update prev_counter.

Though that also raises the question of whether or not zeroing prev_counter in
reprogram_counter() is correct.  Unless I'm missing something, reprogram_counter()
should also set pmc->prev_counter to pmc->counter when the counter is successfully
(re)enabled.

And Jim also pointed out that prev_counter needs to be set even when KVM fails
to enable a perf event (software counting should still work).

[*] https://lore.kernel.org/all/CALMp9eRfeFFb6n22Uf4R2Pf8WW7BVLX_Vuf04WFwiMtrk14Y-Q@mail.gmail.com

---
 arch/x86/kvm/pmu.h           |  8 ++++++++
 arch/x86/kvm/svm/pmu.c       |  2 +-
 arch/x86/kvm/vmx/pmu_intel.c |  4 ++--
 include/linux/perf_event.h   |  2 ++
 kernel/events/core.c         | 11 +++++++++++
 5 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7d9ba301c090..ba91a78e4dc1 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
 	return counter & pmc_bitmask(pmc);
 }
 
+static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
+{
+	if (pmc->perf_event && !pmc->is_paused)
+		perf_event_set_count(pmc->perf_event, val);
+
+	pmc->counter = val;
+}
+
 static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
 {
 	if (pmc->perf_event) {
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index cef5a3d0abd0..373ff6a6687b 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -160,7 +160,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	/* MSR_PERFCTRn */
 	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
 	if (pmc) {
-		pmc->counter += data - pmc_read_counter(pmc);
+		pmc_write_counter(pmc, data);
 		pmc_update_sample_period(pmc);
 		return 0;
 	}
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 80c769c58a87..18a658aa2a8d 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -406,11 +406,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			if (!msr_info->host_initiated &&
 			    !(msr & MSR_PMC_FULL_WIDTH_BIT))
 				data = (s64)(s32)data;
-			pmc->counter += data - pmc_read_counter(pmc);
+			pmc_write_counter(pmc, data);
 			pmc_update_sample_period(pmc);
 			break;
 		} else if ((pmc = get_fixed_pmc(pmu, msr))) {
-			pmc->counter += data - pmc_read_counter(pmc);
+			pmc_write_counter(pmc, data);
 			pmc_update_sample_period(pmc);
 			break;
 		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d5628a7b5eaa..8fcd52a87ba2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1677,6 +1677,7 @@ extern void perf_event_disable_inatomic(struct perf_event *event);
 extern void perf_event_task_tick(void);
 extern int perf_event_account_interrupt(struct perf_event *event);
 extern int perf_event_period(struct perf_event *event, u64 value);
+extern void perf_event_set_count(struct perf_event *event, u64 count);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
@@ -1760,6 +1761,7 @@ static inline int perf_event_period(struct perf_event *event, u64 value)
 {
 	return -EINVAL;
 }
+static inline perf_event_set_count(struct perf_event *event, u64 count) { }
 static inline u64 perf_event_pause(struct perf_event *event, bool reset)
 {
 	return 0;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index db016e418931..d368c283eba5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5646,6 +5646,17 @@ static void _perf_event_reset(struct perf_event *event)
 	perf_event_update_userpage(event);
 }
 
+void perf_event_set_count(struct perf_event *event, u64 count)
+{
+	struct perf_event_context *ctx;
+
+	ctx = perf_event_ctx_lock(event);
+	(void)perf_event_read(event, false);
+	local64_set(&event->count, count);
+	perf_event_ctx_unlock(event, ctx);
+}
+EXPORT_SYMBOL_GPL(perf_event_set_count);
+
 /* Assume it's not an event with inherit set. */
 u64 perf_event_pause(struct perf_event *event, bool reset)
 {

base-commit: 5ae85a1bd17b959796f6cc4c1153ceada2cf8f24
-- 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30  0:11     ` Sean Christopherson
@ 2023-06-30  0:31       ` Jim Mattson
  2023-06-30 11:14       ` Roman Kagan
  1 sibling, 0 replies; 25+ messages in thread
From: Jim Mattson @ 2023-06-30  0:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Eric Hankland, Roman Kagan, kvm, Dave Hansen,
	Like Xu, x86, Thomas Gleixner, linux-kernel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Mingwei Zhang

On Thu, Jun 29, 2023 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote:
>
> +Mingwei
>
> On Thu, Jun 29, 2023, Jim Mattson wrote:
> > On Mon, Jun 5, 2023 at 5:51 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, May 04, 2023, Roman Kagan wrote:
> > > > diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> > > > index 5c7bbf03b599..6a91e1afef5a 100644
> > > > --- a/arch/x86/kvm/pmu.h
> > > > +++ b/arch/x86/kvm/pmu.h
> > > > @@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > >       return counter & pmc_bitmask(pmc);
> > > >  }
> > > >
> > > > +static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val)
> > > > +{
> > > > +     pmc->counter += val - pmc_read_counter(pmc);
> > >
> > > Ugh, not your code, but I don't see how the current code can possibly be correct.
> > >
> > > The above unpacks to
> > >
> > >         counter = pmc->counter;
> > >         if (pmc->perf_event && !pmc->is_paused)
> > >                 counter += perf_event_read_value(pmc->perf_event,
> > >                                                  &enabled, &running);
> > >         pmc->counter += val - (counter & pmc_bitmask(pmc));
> > >
> > > which distills down to
> > >
> > >         counter = 0;
> > >         if (pmc->perf_event && !pmc->is_paused)
> > >                 counter += perf_event_read_value(pmc->perf_event,
> > >                                                  &enabled, &running);
> > >         pmc->counter = val - (counter & pmc_bitmask(pmc));
> > >
> > > or more succinctly
> > >
> > >         if (pmc->perf_event && !pmc->is_paused)
> > >                 val -= perf_event_read_value(pmc->perf_event, &enabled, &running);
> > >
> > >         pmc->counter = val;
> > >
> > > which is obviously wrong.  E.g. if the guest writes '0' to an active counter, the
> > > adjustment will cause pmc->counter to be loaded with a large (in unsigned terms)
> > > value, and thus quickly overflow after a write of '0'.
> >
> > This weird construct goes all the way back to commit f5132b01386b
> > ("KVM: Expose a version 2 architectural PMU to a guests"). Paolo
> > killed it in commit 2924b52117b2 ("KVM: x86/pmu: do not mask the value
> > that is written to fixed PMUs"), perhaps by accident. Eric then
> > resurrected it in commit 4400cf546b4b ("KVM: x86: Fix perfctr WRMSR
> > for running counters").
> >
> > It makes no sense to me. WRMSR should just set the new value of the
> > counter, regardless of the old value or whether or not it is running.
>
> Heh, didn't stop you from giving Eric's patch a thumbs-up[*] :-)

I was smarter then, and probably understood what was going on.

> Thanks to Eric's testcase [Wow, tests do help!  We should try writing more of them!],
> I finally figured out what's going on.  I wrongly assumed perf_event_read_value()
> is destructive, but it's not, it just reads the current value.  So on a WRMSR,
> KVM offsets the value with the current perf event, and then *mostly* adjusts for
> it when reading the counter.
>
> But that is obviously super fragile because it means pmc->counter must never be
> read directly unless the perf event is paused and the accumulated counter has been
> propagated to pmc->counter.  Blech.
>
> I fiddled with a variety of things, but AFAICT the easiest solution is also the
> most obviously correct: set perf's count to the guest's count.  Lightly tested
> patch below.

That seems correct to me. :)

> On a related topic, Mingwei also appears to have found another bug: prev_counter
> needs to be set when the counter is written, i.e. my proposed pmc_write_counter()
> also needs to update prev_counter.
>
> Though that also raises the question of whether or not zeroing prev_counter in
> reprogram_counter() is correct.  Unless I'm missing something, reprogram_counter()
> should also set pmc->prev_counter to pmc->counter when the counter is successfully
> (re)enabled.
>
> And Jim also pointed out that prev_counter needs to be set even when KVM fails
> to enable a perf event (software counting should still work).

This whole prev_counter business is a mess. The software counting
increments by at most one per VM-exit. The only time we need to
concern ourselves about overflow is when there has been a software
increment and the paused counter value is -1. That is the only
situation in which software counting can trigger a PMI. Even if the
VM-exit was for a WRMSR to the PMC, assuming that the value just
written by the guest is the value we see when we pause the counter. Or
am I missing something?

> [*] https://lore.kernel.org/all/CALMp9eRfeFFb6n22Uf4R2Pf8WW7BVLX_Vuf04WFwiMtrk14Y-Q@mail.gmail.com
>
> ---
>  arch/x86/kvm/pmu.h           |  8 ++++++++
>  arch/x86/kvm/svm/pmu.c       |  2 +-
>  arch/x86/kvm/vmx/pmu_intel.c |  4 ++--
>  include/linux/perf_event.h   |  2 ++
>  kernel/events/core.c         | 11 +++++++++++
>  5 files changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 7d9ba301c090..ba91a78e4dc1 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>         return counter & pmc_bitmask(pmc);
>  }
>
> +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> +{
> +       if (pmc->perf_event && !pmc->is_paused)
> +               perf_event_set_count(pmc->perf_event, val);
> +
> +       pmc->counter = val;
> +}
> +
>  static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
>  {
>         if (pmc->perf_event) {
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index cef5a3d0abd0..373ff6a6687b 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -160,7 +160,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>         /* MSR_PERFCTRn */
>         pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
>         if (pmc) {
> -               pmc->counter += data - pmc_read_counter(pmc);
> +               pmc_write_counter(pmc, data);
>                 pmc_update_sample_period(pmc);
>                 return 0;
>         }
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 80c769c58a87..18a658aa2a8d 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -406,11 +406,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>                         if (!msr_info->host_initiated &&
>                             !(msr & MSR_PMC_FULL_WIDTH_BIT))
>                                 data = (s64)(s32)data;
> -                       pmc->counter += data - pmc_read_counter(pmc);
> +                       pmc_write_counter(pmc, data);
>                         pmc_update_sample_period(pmc);
>                         break;
>                 } else if ((pmc = get_fixed_pmc(pmu, msr))) {
> -                       pmc->counter += data - pmc_read_counter(pmc);
> +                       pmc_write_counter(pmc, data);
>                         pmc_update_sample_period(pmc);
>                         break;
>                 } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d5628a7b5eaa..8fcd52a87ba2 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1677,6 +1677,7 @@ extern void perf_event_disable_inatomic(struct perf_event *event);
>  extern void perf_event_task_tick(void);
>  extern int perf_event_account_interrupt(struct perf_event *event);
>  extern int perf_event_period(struct perf_event *event, u64 value);
> +extern void perf_event_set_count(struct perf_event *event, u64 count);
>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>  #else /* !CONFIG_PERF_EVENTS: */
>  static inline void *
> @@ -1760,6 +1761,7 @@ static inline int perf_event_period(struct perf_event *event, u64 value)
>  {
>         return -EINVAL;
>  }
> +static inline perf_event_set_count(struct perf_event *event, u64 count) { }
>  static inline u64 perf_event_pause(struct perf_event *event, bool reset)
>  {
>         return 0;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index db016e418931..d368c283eba5 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5646,6 +5646,17 @@ static void _perf_event_reset(struct perf_event *event)
>         perf_event_update_userpage(event);
>  }
>
> +void perf_event_set_count(struct perf_event *event, u64 count)
> +{
> +       struct perf_event_context *ctx;
> +
> +       ctx = perf_event_ctx_lock(event);
> +       (void)perf_event_read(event, false);
> +       local64_set(&event->count, count);
> +       perf_event_ctx_unlock(event, ctx);
> +}
> +EXPORT_SYMBOL_GPL(perf_event_set_count);
> +
>  /* Assume it's not an event with inherit set. */
>  u64 perf_event_pause(struct perf_event *event, bool reset)
>  {
>
> base-commit: 5ae85a1bd17b959796f6cc4c1153ceada2cf8f24
> --

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30  0:11     ` Sean Christopherson
  2023-06-30  0:31       ` Jim Mattson
@ 2023-06-30 11:14       ` Roman Kagan
  2023-06-30 14:28         ` Sean Christopherson
  1 sibling, 1 reply; 25+ messages in thread
From: Roman Kagan @ 2023-06-30 11:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jim Mattson, Paolo Bonzini, Eric Hankland, kvm, Dave Hansen,
	Like Xu, x86, Thomas Gleixner, linux-kernel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Mingwei Zhang

On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> +Mingwei
> 
> On Thu, Jun 29, 2023, Jim Mattson wrote:
> > On Mon, Jun 5, 2023 at 5:51 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, May 04, 2023, Roman Kagan wrote:
> > > > diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> > > > index 5c7bbf03b599..6a91e1afef5a 100644
> > > > --- a/arch/x86/kvm/pmu.h
> > > > +++ b/arch/x86/kvm/pmu.h
> > > > @@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > >       return counter & pmc_bitmask(pmc);
> > > >  }
> > > >
> > > > +static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val)
> > > > +{
> > > > +     pmc->counter += val - pmc_read_counter(pmc);
> > >
> > > Ugh, not your code, but I don't see how the current code can possibly be correct.
> > >
> > > The above unpacks to
> > >
> > >         counter = pmc->counter;
> > >         if (pmc->perf_event && !pmc->is_paused)
> > >                 counter += perf_event_read_value(pmc->perf_event,
> > >                                                  &enabled, &running);
> > >         pmc->counter += val - (counter & pmc_bitmask(pmc));
> > >
> > > which distills down to
> > >
> > >         counter = 0;
> > >         if (pmc->perf_event && !pmc->is_paused)
> > >                 counter += perf_event_read_value(pmc->perf_event,
> > >                                                  &enabled, &running);
> > >         pmc->counter = val - (counter & pmc_bitmask(pmc));
> > >
> > > or more succinctly
> > >
> > >         if (pmc->perf_event && !pmc->is_paused)
> > >                 val -= perf_event_read_value(pmc->perf_event, &enabled, &running);
> > >
> > >         pmc->counter = val;
> > >
> > > which is obviously wrong.  E.g. if the guest writes '0' to an active counter, the
> > > adjustment will cause pmc->counter to be loaded with a large (in unsigned terms)
> > > value, and thus quickly overflow after a write of '0'.
> >
> > This weird construct goes all the way back to commit f5132b01386b
> > ("KVM: Expose a version 2 architectural PMU to a guests"). Paolo
> > killed it in commit 2924b52117b2 ("KVM: x86/pmu: do not mask the value
> > that is written to fixed PMUs"), perhaps by accident. Eric then
> > resurrected it in commit 4400cf546b4b ("KVM: x86: Fix perfctr WRMSR
> > for running counters").
> >
> > It makes no sense to me. WRMSR should just set the new value of the
> > counter, regardless of the old value or whether or not it is running.
> 
> Heh, didn't stop you from giving Eric's patch a thumbs-up[*] :-)
> 
> Thanks to Eric's testcase [Wow, tests do help!  We should try writing more of them!],
> I finally figured out what's going on.  I wrongly assumed perf_event_read_value()
> is destructive, but it's not, it just reads the current value.  So on a WRMSR,
> KVM offsets the value with the current perf event, and then *mostly* adjusts for
> it when reading the counter.
> 
> But that is obviously super fragile because it means pmc->counter must never be
> read directly unless the perf event is paused and the accumulated counter has been
> propagated to pmc->counter.  Blech.
> 
> I fiddled with a variety of things, but AFAICT the easiest solution is also the
> most obviously correct: set perf's count to the guest's count.  Lightly tested
> patch below.
> 
> On a related topic, Mingwei also appears to have found another bug: prev_counter
> needs to be set when the counter is written, i.e. my proposed pmc_write_counter()
> also needs to update prev_counter.
> 
> Though that also raises the question of whether or not zeroing prev_counter in
> reprogram_counter() is correct.  Unless I'm missing something, reprogram_counter()
> should also set pmc->prev_counter to pmc->counter when the counter is successfully
> (re)enabled.
> 
> And Jim also pointed out that prev_counter needs to be set even when KVM fails
> to enable a perf event (software counting should still work).
> 
> [*] https://lore.kernel.org/all/CALMp9eRfeFFb6n22Uf4R2Pf8WW7BVLX_Vuf04WFwiMtrk14Y-Q@mail.gmail.com
> 
> ---
>  arch/x86/kvm/pmu.h           |  8 ++++++++
>  arch/x86/kvm/svm/pmu.c       |  2 +-
>  arch/x86/kvm/vmx/pmu_intel.c |  4 ++--
>  include/linux/perf_event.h   |  2 ++
>  kernel/events/core.c         | 11 +++++++++++
>  5 files changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 7d9ba301c090..ba91a78e4dc1 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>         return counter & pmc_bitmask(pmc);
>  }
> 
> +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> +{
> +       if (pmc->perf_event && !pmc->is_paused)
> +               perf_event_set_count(pmc->perf_event, val);
> +
> +       pmc->counter = val;

Doesn't this still have the original problem of storing wider value than
allowed?

Thanks,
Roman.

> +}
> +
>  static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
>  {
>         if (pmc->perf_event) {
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index cef5a3d0abd0..373ff6a6687b 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -160,7 +160,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>         /* MSR_PERFCTRn */
>         pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
>         if (pmc) {
> -               pmc->counter += data - pmc_read_counter(pmc);
> +               pmc_write_counter(pmc, data);
>                 pmc_update_sample_period(pmc);
>                 return 0;
>         }
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 80c769c58a87..18a658aa2a8d 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -406,11 +406,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>                         if (!msr_info->host_initiated &&
>                             !(msr & MSR_PMC_FULL_WIDTH_BIT))
>                                 data = (s64)(s32)data;
> -                       pmc->counter += data - pmc_read_counter(pmc);
> +                       pmc_write_counter(pmc, data);
>                         pmc_update_sample_period(pmc);
>                         break;
>                 } else if ((pmc = get_fixed_pmc(pmu, msr))) {
> -                       pmc->counter += data - pmc_read_counter(pmc);
> +                       pmc_write_counter(pmc, data);
>                         pmc_update_sample_period(pmc);
>                         break;
>                 } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d5628a7b5eaa..8fcd52a87ba2 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1677,6 +1677,7 @@ extern void perf_event_disable_inatomic(struct perf_event *event);
>  extern void perf_event_task_tick(void);
>  extern int perf_event_account_interrupt(struct perf_event *event);
>  extern int perf_event_period(struct perf_event *event, u64 value);
> +extern void perf_event_set_count(struct perf_event *event, u64 count);
>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>  #else /* !CONFIG_PERF_EVENTS: */
>  static inline void *
> @@ -1760,6 +1761,7 @@ static inline int perf_event_period(struct perf_event *event, u64 value)
>  {
>         return -EINVAL;
>  }
> +static inline perf_event_set_count(struct perf_event *event, u64 count) { }
>  static inline u64 perf_event_pause(struct perf_event *event, bool reset)
>  {
>         return 0;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index db016e418931..d368c283eba5 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5646,6 +5646,17 @@ static void _perf_event_reset(struct perf_event *event)
>         perf_event_update_userpage(event);
>  }
> 
> +void perf_event_set_count(struct perf_event *event, u64 count)
> +{
> +       struct perf_event_context *ctx;
> +
> +       ctx = perf_event_ctx_lock(event);
> +       (void)perf_event_read(event, false);
> +       local64_set(&event->count, count);
> +       perf_event_ctx_unlock(event, ctx);
> +}
> +EXPORT_SYMBOL_GPL(perf_event_set_count);
> +
>  /* Assume it's not an event with inherit set. */
>  u64 perf_event_pause(struct perf_event *event, bool reset)
>  {
> 
> base-commit: 5ae85a1bd17b959796f6cc4c1153ceada2cf8f24
> --



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 11:14       ` Roman Kagan
@ 2023-06-30 14:28         ` Sean Christopherson
  2023-06-30 15:21           ` Roman Kagan
  0 siblings, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2023-06-30 14:28 UTC (permalink / raw)
  To: Roman Kagan, Jim Mattson, Paolo Bonzini, Eric Hankland, kvm,
	Dave Hansen, Like Xu, x86, Thomas Gleixner, linux-kernel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Mingwei Zhang

On Fri, Jun 30, 2023, Roman Kagan wrote:
> On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> >         return counter & pmc_bitmask(pmc);
> >  }
> > 
> > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > +{
> > +       if (pmc->perf_event && !pmc->is_paused)
> > +               perf_event_set_count(pmc->perf_event, val);
> > +
> > +       pmc->counter = val;
> 
> Doesn't this still have the original problem of storing wider value than
> allowed?

Yes, this was just to fix the counter offset weirdness.  My plan is to apply your
patch on top.  Sorry for not making that clear.  

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 14:28         ` Sean Christopherson
@ 2023-06-30 15:21           ` Roman Kagan
  2023-06-30 15:45             ` Sean Christopherson
  2023-06-30 16:40             ` Jim Mattson
  0 siblings, 2 replies; 25+ messages in thread
From: Roman Kagan @ 2023-06-30 15:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jim Mattson, Paolo Bonzini, Eric Hankland, kvm, Dave Hansen,
	Like Xu, x86, Thomas Gleixner, linux-kernel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Mingwei Zhang

On Fri, Jun 30, 2023 at 07:28:29AM -0700, Sean Christopherson wrote:
> On Fri, Jun 30, 2023, Roman Kagan wrote:
> > On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > >         return counter & pmc_bitmask(pmc);
> > >  }
> > >
> > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > > +{
> > > +       if (pmc->perf_event && !pmc->is_paused)
> > > +               perf_event_set_count(pmc->perf_event, val);
> > > +
> > > +       pmc->counter = val;
> >
> > Doesn't this still have the original problem of storing wider value than
> > allowed?
> 
> Yes, this was just to fix the counter offset weirdness.  My plan is to apply your
> patch on top.  Sorry for not making that clear.

Ah, got it, thanks!

Also I'm now chasing a problem that we occasionally see

[3939579.462832] Uhhuh. NMI received for unknown reason 30 on CPU 43.
[3939579.462836] Do you have a strange power saving mode enabled?
[3939579.462836] Dazed and confused, but trying to continue

in the guests when perf is used.  These messages disappear when
9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") is
reverted.  I haven't yet figured out where exactly the culprit is.

Thanks,
Roman.



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 15:21           ` Roman Kagan
@ 2023-06-30 15:45             ` Sean Christopherson
  2023-06-30 17:07               ` Mingwei Zhang
                                 ` (2 more replies)
  2023-06-30 16:40             ` Jim Mattson
  1 sibling, 3 replies; 25+ messages in thread
From: Sean Christopherson @ 2023-06-30 15:45 UTC (permalink / raw)
  To: Roman Kagan, Jim Mattson, Paolo Bonzini, Eric Hankland, kvm,
	Dave Hansen, Like Xu, x86, Thomas Gleixner, linux-kernel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Mingwei Zhang

On Fri, Jun 30, 2023, Roman Kagan wrote:
> On Fri, Jun 30, 2023 at 07:28:29AM -0700, Sean Christopherson wrote:
> > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > > > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > >         return counter & pmc_bitmask(pmc);
> > > >  }
> > > >
> > > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > > > +{
> > > > +       if (pmc->perf_event && !pmc->is_paused)
> > > > +               perf_event_set_count(pmc->perf_event, val);
> > > > +
> > > > +       pmc->counter = val;
> > >
> > > Doesn't this still have the original problem of storing wider value than
> > > allowed?
> > 
> > Yes, this was just to fix the counter offset weirdness.  My plan is to apply your
> > patch on top.  Sorry for not making that clear.
> 
> Ah, got it, thanks!
> 
> Also I'm now chasing a problem that we occasionally see
> 
> [3939579.462832] Uhhuh. NMI received for unknown reason 30 on CPU 43.
> [3939579.462836] Do you have a strange power saving mode enabled?
> [3939579.462836] Dazed and confused, but trying to continue
> 
> in the guests when perf is used.  These messages disappear when
> 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") is
> reverted.  I haven't yet figured out where exactly the culprit is.

Can you reverting de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow
via pmc->prev_counter")?  I suspect the problem is the prev_counter mess.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 15:21           ` Roman Kagan
  2023-06-30 15:45             ` Sean Christopherson
@ 2023-06-30 16:40             ` Jim Mattson
  2023-06-30 23:25               ` Jim Mattson
  1 sibling, 1 reply; 25+ messages in thread
From: Jim Mattson @ 2023-06-30 16:40 UTC (permalink / raw)
  To: Roman Kagan, Sean Christopherson, Jim Mattson, Paolo Bonzini,
	Eric Hankland, kvm, Dave Hansen, Like Xu, x86, Thomas Gleixner,
	linux-kernel, H. Peter Anvin, Borislav Petkov, Ingo Molnar,
	Mingwei Zhang

On Fri, Jun 30, 2023 at 8:21 AM Roman Kagan <rkagan@amazon.de> wrote:
>
> On Fri, Jun 30, 2023 at 07:28:29AM -0700, Sean Christopherson wrote:
> > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > > > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > >         return counter & pmc_bitmask(pmc);
> > > >  }
> > > >
> > > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > > > +{
> > > > +       if (pmc->perf_event && !pmc->is_paused)
> > > > +               perf_event_set_count(pmc->perf_event, val);
> > > > +
> > > > +       pmc->counter = val;
> > >
> > > Doesn't this still have the original problem of storing wider value than
> > > allowed?
> >
> > Yes, this was just to fix the counter offset weirdness.  My plan is to apply your
> > patch on top.  Sorry for not making that clear.
>
> Ah, got it, thanks!
>
> Also I'm now chasing a problem that we occasionally see
>
> [3939579.462832] Uhhuh. NMI received for unknown reason 30 on CPU 43.
> [3939579.462836] Do you have a strange power saving mode enabled?
> [3939579.462836] Dazed and confused, but trying to continue
>
> in the guests when perf is used.  These messages disappear when
> 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") is
> reverted.  I haven't yet figured out where exactly the culprit is.

Maybe this is because KVM doesn't virtualize
IA32_DEBUGCTL.Freeze_PerfMon_On_PMI?

Consider:

1. PMC0 overflows, GLOBAL_STATUS[0] is set, and an NMI is delivered.
2. Before the guest's PMI handler clears GLOBAL_CTRL, PMC1 overflows,
GLOBAL_STATUS[1] is set, and an NMI is queued for delivery after the
next IRET.
3. The guest's PMI handler clears GLOBAL_CTRL, reads 3 from
GLOBAL_STATUS, writes 3 to GLOBAL_OVF_CTRL, re-enables GLOBAL_CTRL,
and IRETs.
4. The queued NMI is delivered, but GLOBAL_STATUS is now 0. No one
claims the NMI, so we get the spurious NMI message.

I don't know why this would require counting the retirement of
emulated instructions. It seems that hardware PMC overflow in the
early part of the guest's PMI handler would also be a problem.

> Thanks,
> Roman.
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 15:45             ` Sean Christopherson
@ 2023-06-30 17:07               ` Mingwei Zhang
  2023-06-30 17:16                 ` Jim Mattson
  2023-06-30 21:26               ` Sean Christopherson
  2023-07-03 13:33               ` Roman Kagan
  2 siblings, 1 reply; 25+ messages in thread
From: Mingwei Zhang @ 2023-06-30 17:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Roman Kagan, Jim Mattson, Paolo Bonzini, Eric Hankland, kvm,
	Dave Hansen, Like Xu, x86, Thomas Gleixner, linux-kernel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar

On Fri, Jun 30, 2023 at 8:45 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jun 30, 2023, Roman Kagan wrote:
> > On Fri, Jun 30, 2023 at 07:28:29AM -0700, Sean Christopherson wrote:
> > > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > > On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > > > > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > > >         return counter & pmc_bitmask(pmc);
> > > > >  }
> > > > >
> > > > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > > > > +{
> > > > > +       if (pmc->perf_event && !pmc->is_paused)
> > > > > +               perf_event_set_count(pmc->perf_event, val);
> > > > > +
> > > > > +       pmc->counter = val;
> > > >
> > > > Doesn't this still have the original problem of storing wider value than
> > > > allowed?
> > >
> > > Yes, this was just to fix the counter offset weirdness.  My plan is to apply your
> > > patch on top.  Sorry for not making that clear.
> >
> > Ah, got it, thanks!
> >
> > Also I'm now chasing a problem that we occasionally see
> >
> > [3939579.462832] Uhhuh. NMI received for unknown reason 30 on CPU 43.
> > [3939579.462836] Do you have a strange power saving mode enabled?
> > [3939579.462836] Dazed and confused, but trying to continue
> >
> > in the guests when perf is used.  These messages disappear when
> > 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") is
> > reverted.  I haven't yet figured out where exactly the culprit is.
>
> Can you reverting de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow
> via pmc->prev_counter")?  I suspect the problem is the prev_counter mess.

For sure it is prev_counter issue, I have done some instrumentation as follows:

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 48a0528080ab..946663a42326 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -322,8 +322,11 @@ static void reprogram_counter(struct kvm_pmc *pmc)
        if (!pmc_event_is_allowed(pmc))
                goto reprogram_complete;

-       if (pmc->counter < pmc->prev_counter)
+       if (pmc->counter < pmc->prev_counter) {
+               pr_info("pmc->counter: %llx\tpmc->prev_counter: %llx\n",
+                       pmc->counter, pmc->prev_counter);
                __kvm_perf_overflow(pmc, false);
+       }

        if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
                printk_once("kvm pmu: pin control bit is ignored\n");

I find some interesting changes on prev_counter:

[  +7.295348] pmc->counter: 12 pmc->prev_counter: fffffffffb3d
[  +0.622991] pmc->counter: 3 pmc->prev_counter: fffffffffb1a
[  +6.943282] pmc->counter: 1 pmc->prev_counter: fffffffff746
[  +4.483523] pmc->counter: 0 pmc->prev_counter: ffffffffffff
[ +12.817772] pmc->counter: 0 pmc->prev_counter: ffffffffffff
[ +21.721233] pmc->counter: 0 pmc->prev_counter: ffffffffffff

The first 3 logs will generate this:

[ +11.811925] Uhhuh. NMI received for unknown reason 20 on CPU 2.
[  +0.000003] Dazed and confused, but trying to continue

While the last 3 logs won't. This is quite reasonable as looking into
de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow via
pmc->prev_counter"), counter and prev_counter should only have 1 diff
in value.

So, the reasonable suspect should be the stale prev_counter. There
might be several potential reasons behind this. Jim's theory is the
highly reasonable one as I did another experiment and found that KVM
may leave pmu->global_status as '0' when injecting an NMI.

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 17:07               ` Mingwei Zhang
@ 2023-06-30 17:16                 ` Jim Mattson
  2023-06-30 17:32                   ` Mingwei Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Jim Mattson @ 2023-06-30 17:16 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Roman Kagan, Paolo Bonzini, Eric Hankland,
	kvm, Dave Hansen, Like Xu, x86, Thomas Gleixner, linux-kernel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar

On Fri, Jun 30, 2023 at 10:08 AM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Fri, Jun 30, 2023 at 8:45 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > On Fri, Jun 30, 2023 at 07:28:29AM -0700, Sean Christopherson wrote:
> > > > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > > > On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > > > > > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > > > >         return counter & pmc_bitmask(pmc);
> > > > > >  }
> > > > > >
> > > > > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > > > > > +{
> > > > > > +       if (pmc->perf_event && !pmc->is_paused)
> > > > > > +               perf_event_set_count(pmc->perf_event, val);
> > > > > > +
> > > > > > +       pmc->counter = val;
> > > > >
> > > > > Doesn't this still have the original problem of storing wider value than
> > > > > allowed?
> > > >
> > > > Yes, this was just to fix the counter offset weirdness.  My plan is to apply your
> > > > patch on top.  Sorry for not making that clear.
> > >
> > > Ah, got it, thanks!
> > >
> > > Also I'm now chasing a problem that we occasionally see
> > >
> > > [3939579.462832] Uhhuh. NMI received for unknown reason 30 on CPU 43.
> > > [3939579.462836] Do you have a strange power saving mode enabled?
> > > [3939579.462836] Dazed and confused, but trying to continue
> > >
> > > in the guests when perf is used.  These messages disappear when
> > > 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") is
> > > reverted.  I haven't yet figured out where exactly the culprit is.
> >
> > Can you reverting de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow
> > via pmc->prev_counter")?  I suspect the problem is the prev_counter mess.
>
> For sure it is prev_counter issue, I have done some instrumentation as follows:
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 48a0528080ab..946663a42326 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -322,8 +322,11 @@ static void reprogram_counter(struct kvm_pmc *pmc)
>         if (!pmc_event_is_allowed(pmc))
>                 goto reprogram_complete;
>
> -       if (pmc->counter < pmc->prev_counter)
> +       if (pmc->counter < pmc->prev_counter) {
> +               pr_info("pmc->counter: %llx\tpmc->prev_counter: %llx\n",
> +                       pmc->counter, pmc->prev_counter);
>                 __kvm_perf_overflow(pmc, false);
> +       }
>
>         if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
>                 printk_once("kvm pmu: pin control bit is ignored\n");
>
> I find some interesting changes on prev_counter:
>
> [  +7.295348] pmc->counter: 12 pmc->prev_counter: fffffffffb3d
> [  +0.622991] pmc->counter: 3 pmc->prev_counter: fffffffffb1a
> [  +6.943282] pmc->counter: 1 pmc->prev_counter: fffffffff746
> [  +4.483523] pmc->counter: 0 pmc->prev_counter: ffffffffffff
> [ +12.817772] pmc->counter: 0 pmc->prev_counter: ffffffffffff
> [ +21.721233] pmc->counter: 0 pmc->prev_counter: ffffffffffff
>
> The first 3 logs will generate this:
>
> [ +11.811925] Uhhuh. NMI received for unknown reason 20 on CPU 2.
> [  +0.000003] Dazed and confused, but trying to continue
>
> While the last 3 logs won't. This is quite reasonable as looking into
> de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow via
> pmc->prev_counter"), counter and prev_counter should only have 1 diff
> in value.

prev_counter isn't actually sync'ed at this point, is it? This comes
back to that "setting a running counter" nonsense. We want to add 1 to
the current counter, but we don't actually know what the current
counter is.

My interpretation of the above is that, in the first three cases, PMU
hardware has already detected an overflow. In the last three cases,
software counting has detected an overflow.

If the last three occur while executing the guest's PMI handler (i.e.
NMIs are blocked), then this could corroborate my conjecture about
IA32_DEBUGCTL.Freeze_PerfMon_On_PMI.

> So, the reasonable suspect should be the stale prev_counter. There
> might be several potential reasons behind this. Jim's theory is the
> highly reasonable one as I did another experiment and found that KVM
> may leave pmu->global_status as '0' when injecting an NMI.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 17:16                 ` Jim Mattson
@ 2023-06-30 17:32                   ` Mingwei Zhang
  2023-06-30 18:03                     ` Mingwei Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Mingwei Zhang @ 2023-06-30 17:32 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Sean Christopherson, Roman Kagan, Paolo Bonzini, Eric Hankland,
	kvm, Dave Hansen, Like Xu, x86, Thomas Gleixner, linux-kernel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar

On Fri, Jun 30, 2023 at 10:16 AM Jim Mattson <jmattson@google.com> wrote:
>
> On Fri, Jun 30, 2023 at 10:08 AM Mingwei Zhang <mizhang@google.com> wrote:
> >
> > On Fri, Jun 30, 2023 at 8:45 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > > On Fri, Jun 30, 2023 at 07:28:29AM -0700, Sean Christopherson wrote:
> > > > > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > > > > On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > > > > > > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > > > > >         return counter & pmc_bitmask(pmc);
> > > > > > >  }
> > > > > > >
> > > > > > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > > > > > > +{
> > > > > > > +       if (pmc->perf_event && !pmc->is_paused)
> > > > > > > +               perf_event_set_count(pmc->perf_event, val);
> > > > > > > +
> > > > > > > +       pmc->counter = val;
> > > > > >
> > > > > > Doesn't this still have the original problem of storing wider value than
> > > > > > allowed?
> > > > >
> > > > > Yes, this was just to fix the counter offset weirdness.  My plan is to apply your
> > > > > patch on top.  Sorry for not making that clear.
> > > >
> > > > Ah, got it, thanks!
> > > >
> > > > Also I'm now chasing a problem that we occasionally see
> > > >
> > > > [3939579.462832] Uhhuh. NMI received for unknown reason 30 on CPU 43.
> > > > [3939579.462836] Do you have a strange power saving mode enabled?
> > > > [3939579.462836] Dazed and confused, but trying to continue
> > > >
> > > > in the guests when perf is used.  These messages disappear when
> > > > 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") is
> > > > reverted.  I haven't yet figured out where exactly the culprit is.
> > >
> > > Can you reverting de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow
> > > via pmc->prev_counter")?  I suspect the problem is the prev_counter mess.
> >
> > For sure it is prev_counter issue, I have done some instrumentation as follows:
> >
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 48a0528080ab..946663a42326 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -322,8 +322,11 @@ static void reprogram_counter(struct kvm_pmc *pmc)
> >         if (!pmc_event_is_allowed(pmc))
> >                 goto reprogram_complete;
> >
> > -       if (pmc->counter < pmc->prev_counter)
> > +       if (pmc->counter < pmc->prev_counter) {
> > +               pr_info("pmc->counter: %llx\tpmc->prev_counter: %llx\n",
> > +                       pmc->counter, pmc->prev_counter);
> >                 __kvm_perf_overflow(pmc, false);
> > +       }
> >
> >         if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
> >                 printk_once("kvm pmu: pin control bit is ignored\n");
> >
> > I find some interesting changes on prev_counter:
> >
> > [  +7.295348] pmc->counter: 12 pmc->prev_counter: fffffffffb3d
> > [  +0.622991] pmc->counter: 3 pmc->prev_counter: fffffffffb1a
> > [  +6.943282] pmc->counter: 1 pmc->prev_counter: fffffffff746
> > [  +4.483523] pmc->counter: 0 pmc->prev_counter: ffffffffffff
> > [ +12.817772] pmc->counter: 0 pmc->prev_counter: ffffffffffff
> > [ +21.721233] pmc->counter: 0 pmc->prev_counter: ffffffffffff
> >
> > The first 3 logs will generate this:
> >
> > [ +11.811925] Uhhuh. NMI received for unknown reason 20 on CPU 2.
> > [  +0.000003] Dazed and confused, but trying to continue
> >
> > While the last 3 logs won't. This is quite reasonable as looking into
> > de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow via
> > pmc->prev_counter"), counter and prev_counter should only have 1 diff
> > in value.
>
> prev_counter isn't actually sync'ed at this point, is it? This comes
> back to that "setting a running counter" nonsense. We want to add 1 to
> the current counter, but we don't actually know what the current
> counter is.
>
> My interpretation of the above is that, in the first three cases, PMU
> hardware has already detected an overflow. In the last three cases,
> software counting has detected an overflow.
>
> If the last three occur while executing the guest's PMI handler (i.e.
> NMIs are blocked), then this could corroborate my conjecture about
> IA32_DEBUGCTL.Freeze_PerfMon_On_PMI.
>

I see. I wonder if we can just do this:

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 48a0528080ab..8d28158e58f2 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -322,7 +322,7 @@ static void reprogram_counter(struct kvm_pmc *pmc)
        if (!pmc_event_is_allowed(pmc))
                goto reprogram_complete;

-       if (pmc->counter < pmc->prev_counter)
+       if (pmc->counter == 0)
                __kvm_perf_overflow(pmc, false);

        if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)

Since this is software emulation, we (KVM) should only handle overflow
by plusing one?

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 17:32                   ` Mingwei Zhang
@ 2023-06-30 18:03                     ` Mingwei Zhang
  0 siblings, 0 replies; 25+ messages in thread
From: Mingwei Zhang @ 2023-06-30 18:03 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Sean Christopherson, Roman Kagan, Paolo Bonzini, Eric Hankland,
	kvm, Dave Hansen, Like Xu, x86, Thomas Gleixner, linux-kernel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar

> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 48a0528080ab..8d28158e58f2 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -322,7 +322,7 @@ static void reprogram_counter(struct kvm_pmc *pmc)
>         if (!pmc_event_is_allowed(pmc))
>                 goto reprogram_complete;
>
> -       if (pmc->counter < pmc->prev_counter)
> +       if (pmc->counter == 0)
>                 __kvm_perf_overflow(pmc, false);
>
>         if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
>
> Since this is software emulation, we (KVM) should only handle overflow
> by plusing one?

Sign. Please ignore this as it is not only hacky but also not working.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 15:45             ` Sean Christopherson
  2023-06-30 17:07               ` Mingwei Zhang
@ 2023-06-30 21:26               ` Sean Christopherson
  2023-07-01 19:51                 ` Mingwei Zhang
                                   ` (2 more replies)
  2023-07-03 13:33               ` Roman Kagan
  2 siblings, 3 replies; 25+ messages in thread
From: Sean Christopherson @ 2023-06-30 21:26 UTC (permalink / raw)
  To: Roman Kagan, Jim Mattson, Paolo Bonzini, Eric Hankland, kvm,
	Dave Hansen, Like Xu, x86, Thomas Gleixner, linux-kernel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Mingwei Zhang

On Fri, Jun 30, 2023, Sean Christopherson wrote:
> On Fri, Jun 30, 2023, Roman Kagan wrote:
> > On Fri, Jun 30, 2023 at 07:28:29AM -0700, Sean Christopherson wrote:
> > > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > > On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > > > > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > > >         return counter & pmc_bitmask(pmc);
> > > > >  }
> > > > >
> > > > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > > > > +{
> > > > > +       if (pmc->perf_event && !pmc->is_paused)
> > > > > +               perf_event_set_count(pmc->perf_event, val);
> > > > > +
> > > > > +       pmc->counter = val;
> > > >
> > > > Doesn't this still have the original problem of storing wider value than
> > > > allowed?
> > > 
> > > Yes, this was just to fix the counter offset weirdness.  My plan is to apply your
> > > patch on top.  Sorry for not making that clear.
> > 
> > Ah, got it, thanks!
> > 
> > Also I'm now chasing a problem that we occasionally see
> > 
> > [3939579.462832] Uhhuh. NMI received for unknown reason 30 on CPU 43.
> > [3939579.462836] Do you have a strange power saving mode enabled?
> > [3939579.462836] Dazed and confused, but trying to continue
> > 
> > in the guests when perf is used.  These messages disappear when
> > 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") is
> > reverted.  I haven't yet figured out where exactly the culprit is.
> 
> Can you reverting de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow
> via pmc->prev_counter")?  I suspect the problem is the prev_counter mess.

Ugh, yeah, de0f619564f4 created a bit of a mess.  The underlying issue that it
was solving is that perf_event_read_value() and friends might sleep (yay mutex),
and so can't be called from KVM's fastpath (IRQs disabled).

However, detecting overflow requires reading perf_event_read_value() to gather
the accumulated count from the hardware event in order to add it to the emulated
count from software.  E.g. if pmc->counter is X and the perf event counter is Y,
KVM needs to factor in Y because X+Y+1 might overflow even if X+1 does not.

Trying to snapshot the previous counter value is a bit of a mess.  It could probably
made to work, but it's hard to reason about what the snapshot actually contains
and when it should be cleared, especially when factoring in the wrapping logic.

Rather than snapshot the previous counter, I think it makes sense to:

  1) Track the number of emulated counter events
  2) Accumulate and reset the counts from perf_event and emulated_counter into
     pmc->counter when pausing the PMC
  3) Pause and reprogram the PMC on writes (instead of the current approach of
     blindly updating the sample period)
  4) Pause the counter when stopping the perf_event to ensure pmc->counter is
     fresh (instead of manually updating pmc->counter)

IMO, that yields more intuitive logic, and makes it easier to reason about
correctness since the behavior is easily define: pmc->counter holds the counts
that have been gathered and processed, perf_event and emulated_counter hold
outstanding counts on top.  E.g. on a WRMSR to the counter, both the emulated
counter and the hardware counter are reset, because whatever counts existed
previously are irrelevant.

Pausing the counter _might_ make WRMSR slower, but we need to get this all
functionally correct before worrying too much about performance.

Diff below for what I'm thinking (needs to be split into multiple patches).  It's
*very* lightly tested.

I'm about to disappear for a week, I'll pick this back up when I get return.  In
the meantime, any testing and/or input would be much appreciated!

---
 arch/x86/include/asm/kvm-x86-pmu-ops.h |  2 +-
 arch/x86/include/asm/kvm_host.h        | 11 ++-
 arch/x86/kvm/pmu.c                     | 94 ++++++++++++++++++++++----
 arch/x86/kvm/pmu.h                     | 53 +++------------
 arch/x86/kvm/svm/pmu.c                 | 19 +-----
 arch/x86/kvm/vmx/pmu_intel.c           | 26 +------
 6 files changed, 103 insertions(+), 102 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index 6c98f4bb4228..058bc636356a 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -22,7 +22,7 @@ KVM_X86_PMU_OP(get_msr)
 KVM_X86_PMU_OP(set_msr)
 KVM_X86_PMU_OP(refresh)
 KVM_X86_PMU_OP(init)
-KVM_X86_PMU_OP(reset)
+KVM_X86_PMU_OP_OPTIONAL(reset)
 KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
 KVM_X86_PMU_OP_OPTIONAL(cleanup)
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 28bd38303d70..337f5e1da57c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -492,8 +492,17 @@ struct kvm_pmc {
 	u8 idx;
 	bool is_paused;
 	bool intr;
+	/*
+	 * Value of the PMC counter that has been gathered from the associated
+	 * perf_event and from emulated_counter.  This is *not* the current
+	 * value as seen by the guest or userspace.
+	 */
 	u64 counter;
-	u64 prev_counter;
+	/*
+	 * PMC events triggered by KVM emulation that haven't been fully
+	 * procssed, e.g. haven't undergone overflow detection.
+	 */
+	u64 emulated_counter;
 	u64 eventsel;
 	struct perf_event *perf_event;
 	struct kvm_vcpu *vcpu;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index bf653df86112..472e45f5993f 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -148,9 +148,9 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
 	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
 
 	/*
-	 * Ignore overflow events for counters that are scheduled to be
-	 * reprogrammed, e.g. if a PMI for the previous event races with KVM's
-	 * handling of a related guest WRMSR.
+	 * Ignore asynchronous overflow events for counters that are scheduled
+	 * to be reprogrammed, e.g. if a PMI for the previous event races with
+	 * KVM's handling of a related guest WRMSR.
 	 */
 	if (test_and_set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi))
 		return;
@@ -182,6 +182,21 @@ static u64 pmc_get_pebs_precise_level(struct kvm_pmc *pmc)
 	return 1;
 }
 
+static u64 pmc_get_sample_period(struct kvm_pmc *pmc)
+{
+	u64 sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+
+	/*
+	 * Verify pmc->counter is fresh, i.e. that the perf event is paused and
+	 * emulated events have been gathered.
+	 */
+	WARN_ON_ONCE(pmc->emulated_counter || (pmc->perf_event && !pmc->is_paused));
+
+	if (!sample_period)
+		sample_period = pmc_bitmask(pmc) + 1;
+	return sample_period;
+}
+
 static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
 				 bool exclude_user, bool exclude_kernel,
 				 bool intr)
@@ -200,7 +215,7 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
 	};
 	bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
 
-	attr.sample_period = get_sample_period(pmc, pmc->counter);
+	attr.sample_period = pmc_get_sample_period(pmc);
 
 	if ((attr.config & HSW_IN_TX_CHECKPOINTED) &&
 	    guest_cpuid_is_intel(pmc->vcpu)) {
@@ -238,13 +253,19 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
 
 static void pmc_pause_counter(struct kvm_pmc *pmc)
 {
-	u64 counter = pmc->counter;
+	/*
+	 * Accumulate emulated events, even if the PMC was already paused, e.g.
+	 * if KVM emulated an event after a WRMSR, but before reprogramming, or
+	 * if KVM couldn't create a perf event.
+	 */
+	u64 counter = pmc->counter + pmc->emulated_counter;
 
-	if (!pmc->perf_event || pmc->is_paused)
-		return;
+	pmc->emulated_counter = 0;
 
 	/* update counter, reset event value to avoid redundant accumulation */
-	counter += perf_event_pause(pmc->perf_event, true);
+	if (pmc->perf_event && !pmc->is_paused)
+		counter += perf_event_pause(pmc->perf_event, true);
+
 	pmc->counter = counter & pmc_bitmask(pmc);
 	pmc->is_paused = true;
 }
@@ -256,8 +277,7 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
 
 	/* recalibrate sample period and check if it's accepted by perf core */
 	if (is_sampling_event(pmc->perf_event) &&
-	    perf_event_period(pmc->perf_event,
-			      get_sample_period(pmc, pmc->counter)))
+	    perf_event_period(pmc->perf_event, pmc_get_sample_period(pmc)))
 		return false;
 
 	if (test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) !=
@@ -395,6 +415,32 @@ static bool check_pmu_event_filter(struct kvm_pmc *pmc)
 	return is_fixed_event_allowed(filter, pmc->idx);
 }
 
+void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
+{
+	pmc_pause_counter(pmc);
+	pmc->counter = val & pmc_bitmask(pmc);
+	kvm_pmu_request_counter_reprogram(pmc);
+}
+EXPORT_SYMBOL_GPL(pmc_write_counter);
+
+static void pmc_release_perf_event(struct kvm_pmc *pmc)
+{
+	if (pmc->perf_event) {
+		perf_event_release_kernel(pmc->perf_event);
+		pmc->perf_event = NULL;
+		pmc->current_config = 0;
+		pmc_to_pmu(pmc)->event_count--;
+	}
+}
+
+static void pmc_stop_counter(struct kvm_pmc *pmc)
+{
+	if (pmc->perf_event) {
+		pmc_pause_counter(pmc);
+		pmc_release_perf_event(pmc);
+	}
+}
+
 static bool pmc_event_is_allowed(struct kvm_pmc *pmc)
 {
 	return pmc_is_globally_enabled(pmc) && pmc_speculative_in_use(pmc) &&
@@ -404,6 +450,7 @@ static bool pmc_event_is_allowed(struct kvm_pmc *pmc)
 static void reprogram_counter(struct kvm_pmc *pmc)
 {
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+	u64 prev_counter = pmc->counter;
 	u64 eventsel = pmc->eventsel;
 	u64 new_config = eventsel;
 	u8 fixed_ctr_ctrl;
@@ -413,7 +460,7 @@ static void reprogram_counter(struct kvm_pmc *pmc)
 	if (!pmc_event_is_allowed(pmc))
 		goto reprogram_complete;
 
-	if (pmc->counter < pmc->prev_counter)
+	if (pmc->counter < prev_counter)
 		__kvm_perf_overflow(pmc, false);
 
 	if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
@@ -453,7 +500,6 @@ static void reprogram_counter(struct kvm_pmc *pmc)
 
 reprogram_complete:
 	clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
-	pmc->prev_counter = 0;
 }
 
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
@@ -678,9 +724,28 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
 void kvm_pmu_reset(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
+	int i;
 
 	irq_work_sync(&pmu->irq_work);
-	static_call(kvm_x86_pmu_reset)(vcpu);
+
+	bitmap_zero(pmu->reprogram_pmi, X86_PMC_IDX_MAX);
+
+	for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) {
+		pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, i);
+		if (!pmc)
+			continue;
+
+		pmc_stop_counter(pmc);
+		pmc->counter = 0;
+
+		if (pmc_is_gp(pmc))
+			pmc->eventsel = 0;
+	};
+
+	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
+
+	static_call_cond(kvm_x86_pmu_reset)(vcpu);
 }
 
 void kvm_pmu_init(struct kvm_vcpu *vcpu)
@@ -727,8 +792,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
 
 static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
 {
-	pmc->prev_counter = pmc->counter;
-	pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
+	pmc->emulated_counter++;
 	kvm_pmu_request_counter_reprogram(pmc);
 }
 
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7d9ba301c090..0ac60ffae944 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -55,6 +55,12 @@ static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu)
 	return pmu->version > 1;
 }
 
+static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
+{
+	set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
+	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
+}
+
 static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
 {
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
@@ -66,31 +72,17 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
 {
 	u64 counter, enabled, running;
 
-	counter = pmc->counter;
+	counter = pmc->counter + pmc->emulated_counter;
+
 	if (pmc->perf_event && !pmc->is_paused)
 		counter += perf_event_read_value(pmc->perf_event,
 						 &enabled, &running);
+
 	/* FIXME: Scaling needed? */
 	return counter & pmc_bitmask(pmc);
 }
 
-static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
-{
-	if (pmc->perf_event) {
-		perf_event_release_kernel(pmc->perf_event);
-		pmc->perf_event = NULL;
-		pmc->current_config = 0;
-		pmc_to_pmu(pmc)->event_count--;
-	}
-}
-
-static inline void pmc_stop_counter(struct kvm_pmc *pmc)
-{
-	if (pmc->perf_event) {
-		pmc->counter = pmc_read_counter(pmc);
-		pmc_release_perf_event(pmc);
-	}
-}
+void pmc_write_counter(struct kvm_pmc *pmc, u64 val);
 
 static inline bool pmc_is_gp(struct kvm_pmc *pmc)
 {
@@ -140,25 +132,6 @@ static inline struct kvm_pmc *get_fixed_pmc(struct kvm_pmu *pmu, u32 msr)
 	return NULL;
 }
 
-static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
-{
-	u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
-
-	if (!sample_period)
-		sample_period = pmc_bitmask(pmc) + 1;
-	return sample_period;
-}
-
-static inline void pmc_update_sample_period(struct kvm_pmc *pmc)
-{
-	if (!pmc->perf_event || pmc->is_paused ||
-	    !is_sampling_event(pmc->perf_event))
-		return;
-
-	perf_event_period(pmc->perf_event,
-			  get_sample_period(pmc, pmc->counter));
-}
-
 static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
 {
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
@@ -214,12 +187,6 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
 					     KVM_PMC_MAX_FIXED);
 }
 
-static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
-{
-	set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
-	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
-}
-
 static inline void reprogram_counters(struct kvm_pmu *pmu, u64 diff)
 {
 	int bit;
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index cef5a3d0abd0..b6a7ad4d6914 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -160,8 +160,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	/* MSR_PERFCTRn */
 	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
 	if (pmc) {
-		pmc->counter += data - pmc_read_counter(pmc);
-		pmc_update_sample_period(pmc);
+		pmc_write_counter(pmc, data);
 		return 0;
 	}
 	/* MSR_EVNTSELn */
@@ -233,21 +232,6 @@ static void amd_pmu_init(struct kvm_vcpu *vcpu)
 	}
 }
 
-static void amd_pmu_reset(struct kvm_vcpu *vcpu)
-{
-	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
-	int i;
-
-	for (i = 0; i < KVM_AMD_PMC_MAX_GENERIC; i++) {
-		struct kvm_pmc *pmc = &pmu->gp_counters[i];
-
-		pmc_stop_counter(pmc);
-		pmc->counter = pmc->prev_counter = pmc->eventsel = 0;
-	}
-
-	pmu->global_ctrl = pmu->global_status = 0;
-}
-
 struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.hw_event_available = amd_hw_event_available,
 	.pmc_idx_to_pmc = amd_pmc_idx_to_pmc,
@@ -259,7 +243,6 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.set_msr = amd_pmu_set_msr,
 	.refresh = amd_pmu_refresh,
 	.init = amd_pmu_init,
-	.reset = amd_pmu_reset,
 	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 80c769c58a87..ce49d060bc96 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -406,12 +406,10 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			if (!msr_info->host_initiated &&
 			    !(msr & MSR_PMC_FULL_WIDTH_BIT))
 				data = (s64)(s32)data;
-			pmc->counter += data - pmc_read_counter(pmc);
-			pmc_update_sample_period(pmc);
+			pmc_write_counter(pmc, data);
 			break;
 		} else if ((pmc = get_fixed_pmc(pmu, msr))) {
-			pmc->counter += data - pmc_read_counter(pmc);
-			pmc_update_sample_period(pmc);
+			pmc_write_counter(pmc, data);
 			break;
 		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
 			reserved_bits = pmu->reserved_bits;
@@ -603,26 +601,6 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
 
 static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 {
-	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
-	struct kvm_pmc *pmc = NULL;
-	int i;
-
-	for (i = 0; i < KVM_INTEL_PMC_MAX_GENERIC; i++) {
-		pmc = &pmu->gp_counters[i];
-
-		pmc_stop_counter(pmc);
-		pmc->counter = pmc->prev_counter = pmc->eventsel = 0;
-	}
-
-	for (i = 0; i < KVM_PMC_MAX_FIXED; i++) {
-		pmc = &pmu->fixed_counters[i];
-
-		pmc_stop_counter(pmc);
-		pmc->counter = pmc->prev_counter = 0;
-	}
-
-	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
-
 	intel_pmu_release_guest_lbr_event(vcpu);
 }
 

base-commit: 88bb466c9dec4f70d682cf38c685324e7b1b3d60
-- 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 16:40             ` Jim Mattson
@ 2023-06-30 23:25               ` Jim Mattson
  0 siblings, 0 replies; 25+ messages in thread
From: Jim Mattson @ 2023-06-30 23:25 UTC (permalink / raw)
  To: Roman Kagan, Sean Christopherson, Jim Mattson, Paolo Bonzini,
	Eric Hankland, kvm, Dave Hansen, Like Xu, x86, Thomas Gleixner,
	linux-kernel, H. Peter Anvin, Borislav Petkov, Ingo Molnar,
	Mingwei Zhang

On Fri, Jun 30, 2023 at 9:40 AM Jim Mattson <jmattson@google.com> wrote:
>
> On Fri, Jun 30, 2023 at 8:21 AM Roman Kagan <rkagan@amazon.de> wrote:
> >
> > On Fri, Jun 30, 2023 at 07:28:29AM -0700, Sean Christopherson wrote:
> > > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > > On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > > > > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > > >         return counter & pmc_bitmask(pmc);
> > > > >  }
> > > > >
> > > > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > > > > +{
> > > > > +       if (pmc->perf_event && !pmc->is_paused)
> > > > > +               perf_event_set_count(pmc->perf_event, val);
> > > > > +
> > > > > +       pmc->counter = val;
> > > >
> > > > Doesn't this still have the original problem of storing wider value than
> > > > allowed?
> > >
> > > Yes, this was just to fix the counter offset weirdness.  My plan is to apply your
> > > patch on top.  Sorry for not making that clear.
> >
> > Ah, got it, thanks!
> >
> > Also I'm now chasing a problem that we occasionally see
> >
> > [3939579.462832] Uhhuh. NMI received for unknown reason 30 on CPU 43.
> > [3939579.462836] Do you have a strange power saving mode enabled?
> > [3939579.462836] Dazed and confused, but trying to continue
> >
> > in the guests when perf is used.  These messages disappear when
> > 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") is
> > reverted.  I haven't yet figured out where exactly the culprit is.
>
> Maybe this is because KVM doesn't virtualize
> IA32_DEBUGCTL.Freeze_PerfMon_On_PMI?

Never mind. Linux doesn't set IA32_DEBUGCTL.Freeze_PerfMon_On_PMI.

> Consider:
>
> 1. PMC0 overflows, GLOBAL_STATUS[0] is set, and an NMI is delivered.
> 2. Before the guest's PMI handler clears GLOBAL_CTRL, PMC1 overflows,
> GLOBAL_STATUS[1] is set, and an NMI is queued for delivery after the
> next IRET.
> 3. The guest's PMI handler clears GLOBAL_CTRL, reads 3 from
> GLOBAL_STATUS, writes 3 to GLOBAL_OVF_CTRL, re-enables GLOBAL_CTRL,
> and IRETs.
> 4. The queued NMI is delivered, but GLOBAL_STATUS is now 0. No one
> claims the NMI, so we get the spurious NMI message.
>
> I don't know why this would require counting the retirement of
> emulated instructions. It seems that hardware PMC overflow in the
> early part of the guest's PMI handler would also be a problem.
>
> > Thanks,
> > Roman.
> >
> >
> >
> > Amazon Development Center Germany GmbH
> > Krausenstr. 38
> > 10117 Berlin
> > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> > Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> > Sitz: Berlin
> > Ust-ID: DE 289 237 879
> >
> >
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 21:26               ` Sean Christopherson
@ 2023-07-01 19:51                 ` Mingwei Zhang
  2023-08-11  8:30                 ` Dapeng Mi
  2023-08-22  9:29                 ` Like Xu
  2 siblings, 0 replies; 25+ messages in thread
From: Mingwei Zhang @ 2023-07-01 19:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Roman Kagan, Jim Mattson, Paolo Bonzini, Eric Hankland, kvm,
	Dave Hansen, Like Xu, x86, Thomas Gleixner, linux-kernel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar

On Fri, Jun 30, 2023, Sean Christopherson wrote:
> On Fri, Jun 30, 2023, Sean Christopherson wrote:
> > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > On Fri, Jun 30, 2023 at 07:28:29AM -0700, Sean Christopherson wrote:
> > > > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > > > On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > > > > > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > > > >         return counter & pmc_bitmask(pmc);
> > > > > >  }
> > > > > >
> > > > > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > > > > > +{
> > > > > > +       if (pmc->perf_event && !pmc->is_paused)
> > > > > > +               perf_event_set_count(pmc->perf_event, val);
> > > > > > +
> > > > > > +       pmc->counter = val;
> > > > >
> > > > > Doesn't this still have the original problem of storing wider value than
> > > > > allowed?
> > > > 
> > > > Yes, this was just to fix the counter offset weirdness.  My plan is to apply your
> > > > patch on top.  Sorry for not making that clear.
> > > 
> > > Ah, got it, thanks!
> > > 
> > > Also I'm now chasing a problem that we occasionally see
> > > 
> > > [3939579.462832] Uhhuh. NMI received for unknown reason 30 on CPU 43.
> > > [3939579.462836] Do you have a strange power saving mode enabled?
> > > [3939579.462836] Dazed and confused, but trying to continue
> > > 
> > > in the guests when perf is used.  These messages disappear when
> > > 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") is
> > > reverted.  I haven't yet figured out where exactly the culprit is.
> > 
> > Can you reverting de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow
> > via pmc->prev_counter")?  I suspect the problem is the prev_counter mess.
> 
> Ugh, yeah, de0f619564f4 created a bit of a mess.  The underlying issue that it
> was solving is that perf_event_read_value() and friends might sleep (yay mutex),
> and so can't be called from KVM's fastpath (IRQs disabled).
> 
> However, detecting overflow requires reading perf_event_read_value() to gather
> the accumulated count from the hardware event in order to add it to the emulated
> count from software.  E.g. if pmc->counter is X and the perf event counter is Y,
> KVM needs to factor in Y because X+Y+1 might overflow even if X+1 does not.
> 
> Trying to snapshot the previous counter value is a bit of a mess.  It could probably
> made to work, but it's hard to reason about what the snapshot actually contains
> and when it should be cleared, especially when factoring in the wrapping logic.
> 
> Rather than snapshot the previous counter, I think it makes sense to:
> 
>   1) Track the number of emulated counter events
>   2) Accumulate and reset the counts from perf_event and emulated_counter into
>      pmc->counter when pausing the PMC
>   3) Pause and reprogram the PMC on writes (instead of the current approach of
>      blindly updating the sample period)
>   4) Pause the counter when stopping the perf_event to ensure pmc->counter is
>      fresh (instead of manually updating pmc->counter)
> 
> IMO, that yields more intuitive logic, and makes it easier to reason about
> correctness since the behavior is easily define: pmc->counter holds the counts
> that have been gathered and processed, perf_event and emulated_counter hold
> outstanding counts on top.  E.g. on a WRMSR to the counter, both the emulated
> counter and the hardware counter are reset, because whatever counts existed
> previously are irrelevant.
> 
> Pausing the counter _might_ make WRMSR slower, but we need to get this all
> functionally correct before worrying too much about performance.
> 
> Diff below for what I'm thinking (needs to be split into multiple patches).  It's
> *very* lightly tested.
> 
> I'm about to disappear for a week, I'll pick this back up when I get return.  In
> the meantime, any testing and/or input would be much appreciated!
> 

Sean, I have tested this change with QEMU on the tip of kvmx86/next.
hmm, my observation is that this one still causes spurious NMIs. The
volume of spurious NMIs are even more...

Command I am using inside VM:
/usr/bin/perf record -N -B -T --sample-cpu  --compression-level=1 --threads --clockid=CLOCK_MONOTONIC_RAW -e $events

Events:
cpu/period=0xcdfe60,event=0x3c,name='CPU_CLK_UNHALTED.THREAD'/Duk
cpu/period=0xcdfe60,umask=0x3,name='CPU_CLK_UNHALTED.REF_TSC'/Duk
cpu/period=0xcdfe60,event=0xc0,name='INST_RETIRED.ANY'/Duk
cpu/period=0xcdfe60,event=0xc0,name='INST_RETIRED.ANY'/Duk
cpu/period=0xcdfe60,event=0xc0,name='INST_RETIRED.ANY'/Duk
cpu/period=0xcdfe60,event=0xc0,name='INST_RETIRED.ANY'/Duk
cpu/period=0xcdfe60,event=0xc0,name='INST_RETIRED.ANY'/Duk
cpu/period=0xcdfe60,event=0xc0,name='INST_RETIRED.ANY'/Duk
cpu/period=0xcdfe60,event=0xc0,name='INST_RETIRED.ANY'/Duk
cpu/period=0xcdfe60,event=0xec,umask=0x2,name='CPU_CLK_UNHALTED.DISTRIBUTED'/uk
cpu/period=0x98968f,event=0x3c,name='CPU_CLK_UNHALTED.THREAD_P'/uk
cpu/period=0x98968f,event=0xa6,umask=0x21,cmask=0x5,name='EXE_ACTIVITY.BOUND_ON_LOADS'/uk
cpu/period=0x4c4b4f,event=0x9c,umask=0x1,name='IDQ_UOPS_NOT_DELIVERED.CORE'/uk
cpu/period=0x4c4b4f,event=0xad,umask=0x10,name='INT_MISC.UOP_DROPPING'/uk
cpu/period=0x4c4b4f,event=0x47,umask=0x5,cmask=0x5,name='MEMORY_ACTIVITY.STALLS_L2_MISS'/uk
cpu/period=0x7a143,event=0xd1,umask=0x40,name='MEM_LOAD_RETIRED.FB_HIT'/uk
cpu/period=0xf424f,event=0xd1,umask=0x8,name='MEM_LOAD_RETIRED.L1_MISS'/uk
cpu/period=0x4c4b4f,event=0x20,umask=0x8,cmask=0x4,name='OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD_cmask_4'/uk
cpu/period=0x2faf090,event=0xa4,umask=0x2,name='TOPDOWN.BACKEND_BOUND_SLOTS'/uk
cpu/period=0x2faf090,event=0xa4,umask=0x10,name='TOPDOWN.MEMORY_BOUND_SLOTS'/uk
cpu/period=0x98968f,event=0xc2,umask=0x2,name='UOPS_RETIRED.SLOTS'/uk
cpu/period=0x7a12f,event=0xd0,umask=0x82,name='MEM_INST_RETIRED.ALL_STORES'/uk
cpu/period=0x7a12f,event=0xd0,umask=0x81,name='MEM_INST_RETIRED.ALL_LOADS'/uk

In addition, I noticed a periodic performance overhead at the host level
(invisible to guest). It seems all vCPU threads are spending 60% of the
time on the mutex inside perf_event_contex. So a significant amount of
time was spend on kvm_pmu_handle_event() and in particular
perf_event_ctx_lock{,_nested}() and perf_event_ctx_unlock() for a couple
of seconds periodically.

Looking deeper, the perf_event_context might be
the perf_cpu_context->task_ctx. But still I cannot convince myself that
all vCPU threads may be competing against this one, since it is still
should be a per task_struct mutext...

Anyway, the performance issue may be off the current topic but it
persist across many kernel versions when using vPMU with the above event
set in sampling mode.

> ---
>  arch/x86/include/asm/kvm-x86-pmu-ops.h |  2 +-
>  arch/x86/include/asm/kvm_host.h        | 11 ++-
>  arch/x86/kvm/pmu.c                     | 94 ++++++++++++++++++++++----
>  arch/x86/kvm/pmu.h                     | 53 +++------------
>  arch/x86/kvm/svm/pmu.c                 | 19 +-----
>  arch/x86/kvm/vmx/pmu_intel.c           | 26 +------
>  6 files changed, 103 insertions(+), 102 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> index 6c98f4bb4228..058bc636356a 100644
> --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> @@ -22,7 +22,7 @@ KVM_X86_PMU_OP(get_msr)
>  KVM_X86_PMU_OP(set_msr)
>  KVM_X86_PMU_OP(refresh)
>  KVM_X86_PMU_OP(init)
> -KVM_X86_PMU_OP(reset)
> +KVM_X86_PMU_OP_OPTIONAL(reset)
>  KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
>  KVM_X86_PMU_OP_OPTIONAL(cleanup)
>  
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 28bd38303d70..337f5e1da57c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -492,8 +492,17 @@ struct kvm_pmc {
>  	u8 idx;
>  	bool is_paused;
>  	bool intr;
> +	/*
> +	 * Value of the PMC counter that has been gathered from the associated
> +	 * perf_event and from emulated_counter.  This is *not* the current
> +	 * value as seen by the guest or userspace.
> +	 */
>  	u64 counter;
> -	u64 prev_counter;
> +	/*
> +	 * PMC events triggered by KVM emulation that haven't been fully
> +	 * procssed, e.g. haven't undergone overflow detection.
> +	 */
> +	u64 emulated_counter;
>  	u64 eventsel;
>  	struct perf_event *perf_event;
>  	struct kvm_vcpu *vcpu;
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index bf653df86112..472e45f5993f 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -148,9 +148,9 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
>  	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
>  
>  	/*
> -	 * Ignore overflow events for counters that are scheduled to be
> -	 * reprogrammed, e.g. if a PMI for the previous event races with KVM's
> -	 * handling of a related guest WRMSR.
> +	 * Ignore asynchronous overflow events for counters that are scheduled
> +	 * to be reprogrammed, e.g. if a PMI for the previous event races with
> +	 * KVM's handling of a related guest WRMSR.
>  	 */
>  	if (test_and_set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi))
>  		return;
> @@ -182,6 +182,21 @@ static u64 pmc_get_pebs_precise_level(struct kvm_pmc *pmc)
>  	return 1;
>  }
>  
> +static u64 pmc_get_sample_period(struct kvm_pmc *pmc)
> +{
> +	u64 sample_period = (-pmc->counter) & pmc_bitmask(pmc);
> +
> +	/*
> +	 * Verify pmc->counter is fresh, i.e. that the perf event is paused and
> +	 * emulated events have been gathered.
> +	 */
> +	WARN_ON_ONCE(pmc->emulated_counter || (pmc->perf_event && !pmc->is_paused));
> +
> +	if (!sample_period)
> +		sample_period = pmc_bitmask(pmc) + 1;
> +	return sample_period;
> +}
> +
>  static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
>  				 bool exclude_user, bool exclude_kernel,
>  				 bool intr)
> @@ -200,7 +215,7 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
>  	};
>  	bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
>  
> -	attr.sample_period = get_sample_period(pmc, pmc->counter);
> +	attr.sample_period = pmc_get_sample_period(pmc);
>  
>  	if ((attr.config & HSW_IN_TX_CHECKPOINTED) &&
>  	    guest_cpuid_is_intel(pmc->vcpu)) {
> @@ -238,13 +253,19 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
>  
>  static void pmc_pause_counter(struct kvm_pmc *pmc)
>  {
> -	u64 counter = pmc->counter;
> +	/*
> +	 * Accumulate emulated events, even if the PMC was already paused, e.g.
> +	 * if KVM emulated an event after a WRMSR, but before reprogramming, or
> +	 * if KVM couldn't create a perf event.
> +	 */
> +	u64 counter = pmc->counter + pmc->emulated_counter;
>  
> -	if (!pmc->perf_event || pmc->is_paused)
> -		return;
> +	pmc->emulated_counter = 0;
>  
>  	/* update counter, reset event value to avoid redundant accumulation */
> -	counter += perf_event_pause(pmc->perf_event, true);
> +	if (pmc->perf_event && !pmc->is_paused)
> +		counter += perf_event_pause(pmc->perf_event, true);
> +
>  	pmc->counter = counter & pmc_bitmask(pmc);
>  	pmc->is_paused = true;
>  }
> @@ -256,8 +277,7 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
>  
>  	/* recalibrate sample period and check if it's accepted by perf core */
>  	if (is_sampling_event(pmc->perf_event) &&
> -	    perf_event_period(pmc->perf_event,
> -			      get_sample_period(pmc, pmc->counter)))
> +	    perf_event_period(pmc->perf_event, pmc_get_sample_period(pmc)))
>  		return false;
>  
>  	if (test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) !=
> @@ -395,6 +415,32 @@ static bool check_pmu_event_filter(struct kvm_pmc *pmc)
>  	return is_fixed_event_allowed(filter, pmc->idx);
>  }
>  
> +void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> +{
> +	pmc_pause_counter(pmc);
> +	pmc->counter = val & pmc_bitmask(pmc);
> +	kvm_pmu_request_counter_reprogram(pmc);
> +}
> +EXPORT_SYMBOL_GPL(pmc_write_counter);
> +
> +static void pmc_release_perf_event(struct kvm_pmc *pmc)
> +{
> +	if (pmc->perf_event) {
> +		perf_event_release_kernel(pmc->perf_event);
> +		pmc->perf_event = NULL;
> +		pmc->current_config = 0;
> +		pmc_to_pmu(pmc)->event_count--;
> +	}
> +}
> +
> +static void pmc_stop_counter(struct kvm_pmc *pmc)
> +{
> +	if (pmc->perf_event) {
> +		pmc_pause_counter(pmc);
> +		pmc_release_perf_event(pmc);
> +	}
> +}
> +
>  static bool pmc_event_is_allowed(struct kvm_pmc *pmc)
>  {
>  	return pmc_is_globally_enabled(pmc) && pmc_speculative_in_use(pmc) &&
> @@ -404,6 +450,7 @@ static bool pmc_event_is_allowed(struct kvm_pmc *pmc)
>  static void reprogram_counter(struct kvm_pmc *pmc)
>  {
>  	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> +	u64 prev_counter = pmc->counter;
>  	u64 eventsel = pmc->eventsel;
>  	u64 new_config = eventsel;
>  	u8 fixed_ctr_ctrl;
> @@ -413,7 +460,7 @@ static void reprogram_counter(struct kvm_pmc *pmc)
>  	if (!pmc_event_is_allowed(pmc))
>  		goto reprogram_complete;
>  
> -	if (pmc->counter < pmc->prev_counter)
> +	if (pmc->counter < prev_counter)
>  		__kvm_perf_overflow(pmc, false);
>  
>  	if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
> @@ -453,7 +500,6 @@ static void reprogram_counter(struct kvm_pmc *pmc)
>  
>  reprogram_complete:
>  	clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
> -	pmc->prev_counter = 0;
>  }
>  
>  void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
> @@ -678,9 +724,28 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>  void kvm_pmu_reset(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct kvm_pmc *pmc;
> +	int i;
>  
>  	irq_work_sync(&pmu->irq_work);
> -	static_call(kvm_x86_pmu_reset)(vcpu);
> +
> +	bitmap_zero(pmu->reprogram_pmi, X86_PMC_IDX_MAX);
> +
> +	for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) {
> +		pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, i);
> +		if (!pmc)
> +			continue;
> +
> +		pmc_stop_counter(pmc);
> +		pmc->counter = 0;
> +
> +		if (pmc_is_gp(pmc))
> +			pmc->eventsel = 0;
> +	};
> +
> +	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
> +
> +	static_call_cond(kvm_x86_pmu_reset)(vcpu);
>  }
>  
>  void kvm_pmu_init(struct kvm_vcpu *vcpu)
> @@ -727,8 +792,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>  
>  static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
>  {
> -	pmc->prev_counter = pmc->counter;
> -	pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
> +	pmc->emulated_counter++;
>  	kvm_pmu_request_counter_reprogram(pmc);
>  }
>  
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 7d9ba301c090..0ac60ffae944 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -55,6 +55,12 @@ static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu)
>  	return pmu->version > 1;
>  }
>  
> +static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
> +{
> +	set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
> +	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> +}
> +
>  static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
>  {
>  	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> @@ -66,31 +72,17 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>  {
>  	u64 counter, enabled, running;
>  
> -	counter = pmc->counter;
> +	counter = pmc->counter + pmc->emulated_counter;
> +
>  	if (pmc->perf_event && !pmc->is_paused)
>  		counter += perf_event_read_value(pmc->perf_event,
>  						 &enabled, &running);
> +
>  	/* FIXME: Scaling needed? */
>  	return counter & pmc_bitmask(pmc);
>  }
>  
> -static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
> -{
> -	if (pmc->perf_event) {
> -		perf_event_release_kernel(pmc->perf_event);
> -		pmc->perf_event = NULL;
> -		pmc->current_config = 0;
> -		pmc_to_pmu(pmc)->event_count--;
> -	}
> -}
> -
> -static inline void pmc_stop_counter(struct kvm_pmc *pmc)
> -{
> -	if (pmc->perf_event) {
> -		pmc->counter = pmc_read_counter(pmc);
> -		pmc_release_perf_event(pmc);
> -	}
> -}
> +void pmc_write_counter(struct kvm_pmc *pmc, u64 val);
>  
>  static inline bool pmc_is_gp(struct kvm_pmc *pmc)
>  {
> @@ -140,25 +132,6 @@ static inline struct kvm_pmc *get_fixed_pmc(struct kvm_pmu *pmu, u32 msr)
>  	return NULL;
>  }
>  
> -static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
> -{
> -	u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
> -
> -	if (!sample_period)
> -		sample_period = pmc_bitmask(pmc) + 1;
> -	return sample_period;
> -}
> -
> -static inline void pmc_update_sample_period(struct kvm_pmc *pmc)
> -{
> -	if (!pmc->perf_event || pmc->is_paused ||
> -	    !is_sampling_event(pmc->perf_event))
> -		return;
> -
> -	perf_event_period(pmc->perf_event,
> -			  get_sample_period(pmc, pmc->counter));
> -}
> -
>  static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
>  {
>  	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> @@ -214,12 +187,6 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
>  					     KVM_PMC_MAX_FIXED);
>  }
>  
> -static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
> -{
> -	set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
> -	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> -}
> -
>  static inline void reprogram_counters(struct kvm_pmu *pmu, u64 diff)
>  {
>  	int bit;
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index cef5a3d0abd0..b6a7ad4d6914 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -160,8 +160,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	/* MSR_PERFCTRn */
>  	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
>  	if (pmc) {
> -		pmc->counter += data - pmc_read_counter(pmc);
> -		pmc_update_sample_period(pmc);
> +		pmc_write_counter(pmc, data);
>  		return 0;
>  	}
>  	/* MSR_EVNTSELn */
> @@ -233,21 +232,6 @@ static void amd_pmu_init(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> -static void amd_pmu_reset(struct kvm_vcpu *vcpu)
> -{
> -	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> -	int i;
> -
> -	for (i = 0; i < KVM_AMD_PMC_MAX_GENERIC; i++) {
> -		struct kvm_pmc *pmc = &pmu->gp_counters[i];
> -
> -		pmc_stop_counter(pmc);
> -		pmc->counter = pmc->prev_counter = pmc->eventsel = 0;
> -	}
> -
> -	pmu->global_ctrl = pmu->global_status = 0;
> -}
> -
>  struct kvm_pmu_ops amd_pmu_ops __initdata = {
>  	.hw_event_available = amd_hw_event_available,
>  	.pmc_idx_to_pmc = amd_pmc_idx_to_pmc,
> @@ -259,7 +243,6 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
>  	.set_msr = amd_pmu_set_msr,
>  	.refresh = amd_pmu_refresh,
>  	.init = amd_pmu_init,
> -	.reset = amd_pmu_reset,
>  	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
>  	.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
>  	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 80c769c58a87..ce49d060bc96 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -406,12 +406,10 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  			if (!msr_info->host_initiated &&
>  			    !(msr & MSR_PMC_FULL_WIDTH_BIT))
>  				data = (s64)(s32)data;
> -			pmc->counter += data - pmc_read_counter(pmc);
> -			pmc_update_sample_period(pmc);
> +			pmc_write_counter(pmc, data);
>  			break;
>  		} else if ((pmc = get_fixed_pmc(pmu, msr))) {
> -			pmc->counter += data - pmc_read_counter(pmc);
> -			pmc_update_sample_period(pmc);
> +			pmc_write_counter(pmc, data);
>  			break;
>  		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
>  			reserved_bits = pmu->reserved_bits;
> @@ -603,26 +601,6 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
>  
>  static void intel_pmu_reset(struct kvm_vcpu *vcpu)
>  {
> -	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> -	struct kvm_pmc *pmc = NULL;
> -	int i;
> -
> -	for (i = 0; i < KVM_INTEL_PMC_MAX_GENERIC; i++) {
> -		pmc = &pmu->gp_counters[i];
> -
> -		pmc_stop_counter(pmc);
> -		pmc->counter = pmc->prev_counter = pmc->eventsel = 0;
> -	}
> -
> -	for (i = 0; i < KVM_PMC_MAX_FIXED; i++) {
> -		pmc = &pmu->fixed_counters[i];
> -
> -		pmc_stop_counter(pmc);
> -		pmc->counter = pmc->prev_counter = 0;
> -	}
> -
> -	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
> -
>  	intel_pmu_release_guest_lbr_event(vcpu);
>  }
>  
> 
> base-commit: 88bb466c9dec4f70d682cf38c685324e7b1b3d60
> -- 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 15:45             ` Sean Christopherson
  2023-06-30 17:07               ` Mingwei Zhang
  2023-06-30 21:26               ` Sean Christopherson
@ 2023-07-03 13:33               ` Roman Kagan
  2 siblings, 0 replies; 25+ messages in thread
From: Roman Kagan @ 2023-07-03 13:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jim Mattson, Paolo Bonzini, Eric Hankland, kvm, Dave Hansen,
	Like Xu, x86, Thomas Gleixner, linux-kernel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Mingwei Zhang

On Fri, Jun 30, 2023 at 08:45:04AM -0700, Sean Christopherson wrote:
> On Fri, Jun 30, 2023, Roman Kagan wrote:
> > On Fri, Jun 30, 2023 at 07:28:29AM -0700, Sean Christopherson wrote:
> > > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > > On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > > > > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > > >         return counter & pmc_bitmask(pmc);
> > > > >  }
> > > > >
> > > > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > > > > +{
> > > > > +       if (pmc->perf_event && !pmc->is_paused)
> > > > > +               perf_event_set_count(pmc->perf_event, val);
> > > > > +
> > > > > +       pmc->counter = val;
> > > >
> > > > Doesn't this still have the original problem of storing wider value than
> > > > allowed?
> > >
> > > Yes, this was just to fix the counter offset weirdness.  My plan is to apply your
> > > patch on top.  Sorry for not making that clear.
> >
> > Ah, got it, thanks!
> >
> > Also I'm now chasing a problem that we occasionally see
> >
> > [3939579.462832] Uhhuh. NMI received for unknown reason 30 on CPU 43.
> > [3939579.462836] Do you have a strange power saving mode enabled?
> > [3939579.462836] Dazed and confused, but trying to continue
> >
> > in the guests when perf is used.  These messages disappear when
> > 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") is
> > reverted.  I haven't yet figured out where exactly the culprit is.
> 
> Can you reverting de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow
> via pmc->prev_counter")?  I suspect the problem is the prev_counter mess.

We observe the problem on a branch that predates this commit

Thanks,
Roman.



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 21:26               ` Sean Christopherson
  2023-07-01 19:51                 ` Mingwei Zhang
@ 2023-08-11  8:30                 ` Dapeng Mi
  2023-08-22  9:29                 ` Like Xu
  2 siblings, 0 replies; 25+ messages in thread
From: Dapeng Mi @ 2023-08-11  8:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Roman Kagan, Jim Mattson, Paolo Bonzini, Eric Hankland, kvm,
	Dave Hansen, Like Xu, x86, Thomas Gleixner, linux-kernel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Mingwei Zhang

On Fri, Jun 30, 2023 at 02:26:02PM -0700, Sean Christopherson wrote:
> Date: Fri, 30 Jun 2023 14:26:02 -0700
> From: Sean Christopherson <seanjc@google.com>
> Subject: Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
> 
> On Fri, Jun 30, 2023, Sean Christopherson wrote:
> > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > On Fri, Jun 30, 2023 at 07:28:29AM -0700, Sean Christopherson wrote:
> > > > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > > > On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > > > > > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > > > >         return counter & pmc_bitmask(pmc);
> > > > > >  }
> > > > > >
> > > > > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > > > > > +{
> > > > > > +       if (pmc->perf_event && !pmc->is_paused)
> > > > > > +               perf_event_set_count(pmc->perf_event, val);
> > > > > > +
> > > > > > +       pmc->counter = val;
> > > > >
> > > > > Doesn't this still have the original problem of storing wider value than
> > > > > allowed?
> > > > 
> > > > Yes, this was just to fix the counter offset weirdness.  My plan is to apply your
> > > > patch on top.  Sorry for not making that clear.
> > > 
> > > Ah, got it, thanks!
> > > 
> > > Also I'm now chasing a problem that we occasionally see
> > > 
> > > [3939579.462832] Uhhuh. NMI received for unknown reason 30 on CPU 43.
> > > [3939579.462836] Do you have a strange power saving mode enabled?
> > > [3939579.462836] Dazed and confused, but trying to continue
> > > 
> > > in the guests when perf is used.  These messages disappear when
> > > 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") is
> > > reverted.  I haven't yet figured out where exactly the culprit is.
> > 
> > Can you reverting de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow
> > via pmc->prev_counter")?  I suspect the problem is the prev_counter mess.
> 
> Ugh, yeah, de0f619564f4 created a bit of a mess.  The underlying issue that it
> was solving is that perf_event_read_value() and friends might sleep (yay mutex),
> and so can't be called from KVM's fastpath (IRQs disabled).
> 
> However, detecting overflow requires reading perf_event_read_value() to gather
> the accumulated count from the hardware event in order to add it to the emulated
> count from software.  E.g. if pmc->counter is X and the perf event counter is Y,
> KVM needs to factor in Y because X+Y+1 might overflow even if X+1 does not.
> 
> Trying to snapshot the previous counter value is a bit of a mess.  It could probably
> made to work, but it's hard to reason about what the snapshot actually contains
> and when it should be cleared, especially when factoring in the wrapping logic.
> 
> Rather than snapshot the previous counter, I think it makes sense to:
> 
>   1) Track the number of emulated counter events
>   2) Accumulate and reset the counts from perf_event and emulated_counter into
>      pmc->counter when pausing the PMC
>   3) Pause and reprogram the PMC on writes (instead of the current approach of
>      blindly updating the sample period)
>   4) Pause the counter when stopping the perf_event to ensure pmc->counter is
>      fresh (instead of manually updating pmc->counter)
> 
> IMO, that yields more intuitive logic, and makes it easier to reason about
> correctness since the behavior is easily define: pmc->counter holds the counts
> that have been gathered and processed, perf_event and emulated_counter hold
> outstanding counts on top.  E.g. on a WRMSR to the counter, both the emulated
> counter and the hardware counter are reset, because whatever counts existed
> previously are irrelevant.
> 
> Pausing the counter _might_ make WRMSR slower, but we need to get this all
> functionally correct before worrying too much about performance.
> 
> Diff below for what I'm thinking (needs to be split into multiple patches).  It's
> *very* lightly tested.
> 
> I'm about to disappear for a week, I'll pick this back up when I get return.  In
> the meantime, any testing and/or input would be much appreciated!

We also observed this PMI injection deadloop issue, especially when the
counters are in non-full-width mode and the MSB of counter value is 1,
the deadloop issue is quite easily triggered. The PMI injection deadloop
would cause PMI storm and exhaust all CPU resource eventually.

We verified either Roman or Sean's patch can fix this PMI deadloop issue
on Intel Sapphire Rapids platform.

The Perf command in validation comes from Mingwei and it is

sudo ./perf record -N -B -T --sample-cpu                                                                            \
--clockid=CLOCK_MONOTONIC_RAW                                                                                            \
-e cpu/period=0xcdfe60,event=0x3c,name='CPU_CLK_UNHALTED.THREAD'/Duk                                                     \
-e cpu/period=0xcdfe60,umask=0x3,name='CPU_CLK_UNHALTED.REF_TSC'/Duk                                                     \
-e cpu/period=0xcdfe60,event=0xc0,name='INST_RETIRED.ANY'/Duk                                                            \
-e cpu/period=0xcdfe60,event=0xec,umask=0x2,name='CPU_CLK_UNHALTED.DISTRIBUTED'/uk                                           \
-e cpu/period=0x98968f,event=0x3c,name='CPU_CLK_UNHALTED.THREAD_P'/uk                                                    \
-e cpu/period=0x98968f,event=0xa6,umask=0x21,cmask=0x5,name='EXE_ACTIVITY.BOUND_ON_LOADS'/uk                             \
-e cpu/period=0x4c4b4f,event=0x9c,umask=0x1,name='IDQ_UOPS_NOT_DELIVERED.CORE'/uk                                        \
-e cpu/period=0x4c4b4f,event=0xad,umask=0x10,name='INT_MISC.UOP_DROPPING'/uk                                             \
-e cpu/period=0x4c4b4f,event=0x47,umask=0x3,cmask=0x3,name='MEMORY_ACTIVITY.STALLS_L1D_MISS'/uk                          \
-e cpu/period=0x4c4b4f,event=0x47,umask=0x5,cmask=0x5,name='MEMORY_ACTIVITY.STALLS_L2_MISS'/uk                           \
-e cpu/period=0x4c4b4f,event=0x47,umask=0x9,cmask=0x9,name='MEMORY_ACTIVITY.STALLS_L3_MISS'/uk                           \
-e cpu/period=0x7a143,event=0xd3,umask=0x1,name='MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM'/uk                                 \
-e cpu/period=0x4c4b4f,event=0xd3,umask=0x2,name='MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM'/uk                               \
-e cpu/period=0x7a143,event=0xd3,umask=0x8,name='MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD'/uk                                 \
-e cpu/period=0x4c4b4f,event=0xd3,umask=0x4,name='MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM'/uk                               \
-e cpu/period=0x7a143,event=0xd3,umask=0x10,name='MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM'/uk                                \
-e cpu/period=0x7a143,event=0xd1,umask=0x40,name='MEM_LOAD_RETIRED.FB_HIT'/uk                                            \
-e cpu/period=0x4c4b4f,event=0xd1,umask=0x1,name='MEM_LOAD_RETIRED.L1_HIT'/uk                                            \
-e cpu/period=0xf424f,event=0xd1,umask=0x8,name='MEM_LOAD_RETIRED.L1_MISS'/uk                                            \
-e cpu/period=0xf424f,event=0xd1,umask=0x2,name='MEM_LOAD_RETIRED.L2_HIT'/uk                                             \
-e cpu/period=0x7a189,event=0xd1,umask=0x4,name='MEM_LOAD_RETIRED.L3_HIT'/uk                                             \
-e cpu/period=0x3d0f9,event=0xd1,umask=0x20,name='MEM_LOAD_RETIRED.L3_MISS'/uk                                           \
-e cpu/period=0x4c4b4f,event=0xd1,umask=0x80,name='MEM_LOAD_RETIRED.LOCAL_PMM'/uk                                        \
-e cpu/period=0x4c4b4f,event=0x20,umask=0x8,cmask=0x4,name='OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD_cmask_4'/uk         \
-e cpu/period=0x4c4b4f,event=0x20,umask=0x8,cmask=0x1,name='OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD'/uk         \
-e cpu/period=0x2faf090,event=0xa4,umask=0x2,name='TOPDOWN.BACKEND_BOUND_SLOTS'/uk                                       \
-e cpu/period=0x2faf090,event=0xa4,umask=0x10,name='TOPDOWN.MEMORY_BOUND_SLOTS'/uk                                       \
-e cpu/period=0x98968f,event=0xc2,umask=0x2,name='UOPS_RETIRED.SLOTS'/uk                                                 \
-e cpu/period=0x7a12f,event=0xd0,umask=0x82,name='MEM_INST_RETIRED.ALL_STORES'/uk                                        \
-e cpu/period=0x7a12f,event=0xd0,umask=0x81,name='MEM_INST_RETIRED.ALL_LOADS'/uk                                         \
-e cpu/period=0x98968f,event=0xc7,umask=0x2,name='FP_ARITH_INST_RETIRED.SCALAR_SINGLE'/uk                                \
-e cpu/period=0x98968f,event=0xc7,umask=0x8,name='FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE'/uk                           \
-e cpu/period=0x98968f,event=0xc7,umask=0x20,name='FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE'/uk                          \
-e cpu/period=0x98968f,event=0xc7,umask=0x1,name='FP_ARITH_INST_RETIRED.SCALAR_DOUBLE'/uk                                \
-e cpu/period=0x98968f,event=0xc7,umask=0x4,name='FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE'/uk                           \
-e cpu/period=0x98968f,event=0xc7,umask=0x10,name='FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE'/uk                          \
-e cpu/period=0x98968f,event=0xb1,umask=0x10,name='UOPS_EXECUTED.X87'/uk                                                 \
-e cpu/period=0x98968f,event=0xb1,umask=0x1,name='UOPS_EXECUTED.THREAD'/uk                                               \
-e cpu/period=0x98968f,event=0xc7,umask=0x40,name='FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE'/uk                          \
-e cpu/period=0x98968f,event=0xc7,umask=0x80,name='FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE'/uk                          \
-e cpu/period=0x98968f,event=0xce,umask=0x1,name='AMX_OPS_RETIRED.INT8'/uk                                               \
-e cpu/period=0x98968f,event=0xce,umask=0x2,name='AMX_OPS_RETIRED.BF16'/uk                                               \
-e cpu/period=0x98968f,event=0xcf,umask=0x1,name='FP_ARITH_INST_RETIRED2.SCALAR_HALF'/uk                                 \
-e cpu/period=0x98968f,event=0xcf,umask=0x4,name='FP_ARITH_INST_RETIRED2.128B_PACKED_HALF'/uk                            \
-e cpu/period=0x98968f,event=0xcf,umask=0x8,name='FP_ARITH_INST_RETIRED2.256B_PACKED_HALF'/uk                            \
-e cpu/period=0x98968f,event=0xcf,umask=0x10,name='FP_ARITH_INST_RETIRED2.512B_PACKED_HALF'/uk                           \
-e cpu/period=0xcdfe60,event=0xc0,name='INST_RETIRED.ANY_P'/uk                                                           \
-e cpu/period=0x2faf090,event=0xa4,umask=0x1,name='TOPDOWN.SLOTS_P'/uk


> 
> ---
>  arch/x86/include/asm/kvm-x86-pmu-ops.h |  2 +-
>  arch/x86/include/asm/kvm_host.h        | 11 ++-
>  arch/x86/kvm/pmu.c                     | 94 ++++++++++++++++++++++----
>  arch/x86/kvm/pmu.h                     | 53 +++------------
>  arch/x86/kvm/svm/pmu.c                 | 19 +-----
>  arch/x86/kvm/vmx/pmu_intel.c           | 26 +------
>  6 files changed, 103 insertions(+), 102 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> index 6c98f4bb4228..058bc636356a 100644
> --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> @@ -22,7 +22,7 @@ KVM_X86_PMU_OP(get_msr)
>  KVM_X86_PMU_OP(set_msr)
>  KVM_X86_PMU_OP(refresh)
>  KVM_X86_PMU_OP(init)
> -KVM_X86_PMU_OP(reset)
> +KVM_X86_PMU_OP_OPTIONAL(reset)
>  KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
>  KVM_X86_PMU_OP_OPTIONAL(cleanup)
>  
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 28bd38303d70..337f5e1da57c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -492,8 +492,17 @@ struct kvm_pmc {
>  	u8 idx;
>  	bool is_paused;
>  	bool intr;
> +	/*
> +	 * Value of the PMC counter that has been gathered from the associated
> +	 * perf_event and from emulated_counter.  This is *not* the current
> +	 * value as seen by the guest or userspace.
> +	 */
>  	u64 counter;
> -	u64 prev_counter;
> +	/*
> +	 * PMC events triggered by KVM emulation that haven't been fully
> +	 * procssed, e.g. haven't undergone overflow detection.
> +	 */
> +	u64 emulated_counter;
>  	u64 eventsel;
>  	struct perf_event *perf_event;
>  	struct kvm_vcpu *vcpu;
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index bf653df86112..472e45f5993f 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -148,9 +148,9 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
>  	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
>  
>  	/*
> -	 * Ignore overflow events for counters that are scheduled to be
> -	 * reprogrammed, e.g. if a PMI for the previous event races with KVM's
> -	 * handling of a related guest WRMSR.
> +	 * Ignore asynchronous overflow events for counters that are scheduled
> +	 * to be reprogrammed, e.g. if a PMI for the previous event races with
> +	 * KVM's handling of a related guest WRMSR.
>  	 */
>  	if (test_and_set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi))
>  		return;
> @@ -182,6 +182,21 @@ static u64 pmc_get_pebs_precise_level(struct kvm_pmc *pmc)
>  	return 1;
>  }
>  
> +static u64 pmc_get_sample_period(struct kvm_pmc *pmc)
> +{
> +	u64 sample_period = (-pmc->counter) & pmc_bitmask(pmc);
> +
> +	/*
> +	 * Verify pmc->counter is fresh, i.e. that the perf event is paused and
> +	 * emulated events have been gathered.
> +	 */
> +	WARN_ON_ONCE(pmc->emulated_counter || (pmc->perf_event && !pmc->is_paused));
> +
> +	if (!sample_period)
> +		sample_period = pmc_bitmask(pmc) + 1;
> +	return sample_period;
> +}
> +
>  static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
>  				 bool exclude_user, bool exclude_kernel,
>  				 bool intr)
> @@ -200,7 +215,7 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
>  	};
>  	bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
>  
> -	attr.sample_period = get_sample_period(pmc, pmc->counter);
> +	attr.sample_period = pmc_get_sample_period(pmc);
>  
>  	if ((attr.config & HSW_IN_TX_CHECKPOINTED) &&
>  	    guest_cpuid_is_intel(pmc->vcpu)) {
> @@ -238,13 +253,19 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
>  
>  static void pmc_pause_counter(struct kvm_pmc *pmc)
>  {
> -	u64 counter = pmc->counter;
> +	/*
> +	 * Accumulate emulated events, even if the PMC was already paused, e.g.
> +	 * if KVM emulated an event after a WRMSR, but before reprogramming, or
> +	 * if KVM couldn't create a perf event.
> +	 */
> +	u64 counter = pmc->counter + pmc->emulated_counter;
>  
> -	if (!pmc->perf_event || pmc->is_paused)
> -		return;
> +	pmc->emulated_counter = 0;
>  
>  	/* update counter, reset event value to avoid redundant accumulation */
> -	counter += perf_event_pause(pmc->perf_event, true);
> +	if (pmc->perf_event && !pmc->is_paused)
> +		counter += perf_event_pause(pmc->perf_event, true);
> +
>  	pmc->counter = counter & pmc_bitmask(pmc);
>  	pmc->is_paused = true;
>  }
> @@ -256,8 +277,7 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
>  
>  	/* recalibrate sample period and check if it's accepted by perf core */
>  	if (is_sampling_event(pmc->perf_event) &&
> -	    perf_event_period(pmc->perf_event,
> -			      get_sample_period(pmc, pmc->counter)))
> +	    perf_event_period(pmc->perf_event, pmc_get_sample_period(pmc)))
>  		return false;
>  
>  	if (test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) !=
> @@ -395,6 +415,32 @@ static bool check_pmu_event_filter(struct kvm_pmc *pmc)
>  	return is_fixed_event_allowed(filter, pmc->idx);
>  }
>  
> +void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> +{
> +	pmc_pause_counter(pmc);
> +	pmc->counter = val & pmc_bitmask(pmc);
> +	kvm_pmu_request_counter_reprogram(pmc);
> +}
> +EXPORT_SYMBOL_GPL(pmc_write_counter);
> +
> +static void pmc_release_perf_event(struct kvm_pmc *pmc)
> +{
> +	if (pmc->perf_event) {
> +		perf_event_release_kernel(pmc->perf_event);
> +		pmc->perf_event = NULL;
> +		pmc->current_config = 0;
> +		pmc_to_pmu(pmc)->event_count--;
> +	}
> +}
> +
> +static void pmc_stop_counter(struct kvm_pmc *pmc)
> +{
> +	if (pmc->perf_event) {
> +		pmc_pause_counter(pmc);
> +		pmc_release_perf_event(pmc);
> +	}
> +}
> +
>  static bool pmc_event_is_allowed(struct kvm_pmc *pmc)
>  {
>  	return pmc_is_globally_enabled(pmc) && pmc_speculative_in_use(pmc) &&
> @@ -404,6 +450,7 @@ static bool pmc_event_is_allowed(struct kvm_pmc *pmc)
>  static void reprogram_counter(struct kvm_pmc *pmc)
>  {
>  	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> +	u64 prev_counter = pmc->counter;
>  	u64 eventsel = pmc->eventsel;
>  	u64 new_config = eventsel;
>  	u8 fixed_ctr_ctrl;
> @@ -413,7 +460,7 @@ static void reprogram_counter(struct kvm_pmc *pmc)
>  	if (!pmc_event_is_allowed(pmc))
>  		goto reprogram_complete;
>  
> -	if (pmc->counter < pmc->prev_counter)
> +	if (pmc->counter < prev_counter)
>  		__kvm_perf_overflow(pmc, false);
>  
>  	if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
> @@ -453,7 +500,6 @@ static void reprogram_counter(struct kvm_pmc *pmc)
>  
>  reprogram_complete:
>  	clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
> -	pmc->prev_counter = 0;
>  }
>  
>  void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
> @@ -678,9 +724,28 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>  void kvm_pmu_reset(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct kvm_pmc *pmc;
> +	int i;
>  
>  	irq_work_sync(&pmu->irq_work);
> -	static_call(kvm_x86_pmu_reset)(vcpu);
> +
> +	bitmap_zero(pmu->reprogram_pmi, X86_PMC_IDX_MAX);
> +
> +	for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) {
> +		pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, i);
> +		if (!pmc)
> +			continue;
> +
> +		pmc_stop_counter(pmc);
> +		pmc->counter = 0;
> +
> +		if (pmc_is_gp(pmc))
> +			pmc->eventsel = 0;
> +	};
> +
> +	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
> +
> +	static_call_cond(kvm_x86_pmu_reset)(vcpu);
>  }
>  
>  void kvm_pmu_init(struct kvm_vcpu *vcpu)
> @@ -727,8 +792,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>  
>  static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
>  {
> -	pmc->prev_counter = pmc->counter;
> -	pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
> +	pmc->emulated_counter++;
>  	kvm_pmu_request_counter_reprogram(pmc);
>  }
>  
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 7d9ba301c090..0ac60ffae944 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -55,6 +55,12 @@ static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu)
>  	return pmu->version > 1;
>  }
>  
> +static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
> +{
> +	set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
> +	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> +}
> +
>  static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
>  {
>  	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> @@ -66,31 +72,17 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>  {
>  	u64 counter, enabled, running;
>  
> -	counter = pmc->counter;
> +	counter = pmc->counter + pmc->emulated_counter;
> +
>  	if (pmc->perf_event && !pmc->is_paused)
>  		counter += perf_event_read_value(pmc->perf_event,
>  						 &enabled, &running);
> +
>  	/* FIXME: Scaling needed? */
>  	return counter & pmc_bitmask(pmc);
>  }
>  
> -static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
> -{
> -	if (pmc->perf_event) {
> -		perf_event_release_kernel(pmc->perf_event);
> -		pmc->perf_event = NULL;
> -		pmc->current_config = 0;
> -		pmc_to_pmu(pmc)->event_count--;
> -	}
> -}
> -
> -static inline void pmc_stop_counter(struct kvm_pmc *pmc)
> -{
> -	if (pmc->perf_event) {
> -		pmc->counter = pmc_read_counter(pmc);
> -		pmc_release_perf_event(pmc);
> -	}
> -}
> +void pmc_write_counter(struct kvm_pmc *pmc, u64 val);
>  
>  static inline bool pmc_is_gp(struct kvm_pmc *pmc)
>  {
> @@ -140,25 +132,6 @@ static inline struct kvm_pmc *get_fixed_pmc(struct kvm_pmu *pmu, u32 msr)
>  	return NULL;
>  }
>  
> -static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
> -{
> -	u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
> -
> -	if (!sample_period)
> -		sample_period = pmc_bitmask(pmc) + 1;
> -	return sample_period;
> -}
> -
> -static inline void pmc_update_sample_period(struct kvm_pmc *pmc)
> -{
> -	if (!pmc->perf_event || pmc->is_paused ||
> -	    !is_sampling_event(pmc->perf_event))
> -		return;
> -
> -	perf_event_period(pmc->perf_event,
> -			  get_sample_period(pmc, pmc->counter));
> -}
> -
>  static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
>  {
>  	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> @@ -214,12 +187,6 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
>  					     KVM_PMC_MAX_FIXED);
>  }
>  
> -static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
> -{
> -	set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
> -	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> -}
> -
>  static inline void reprogram_counters(struct kvm_pmu *pmu, u64 diff)
>  {
>  	int bit;
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index cef5a3d0abd0..b6a7ad4d6914 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -160,8 +160,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	/* MSR_PERFCTRn */
>  	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
>  	if (pmc) {
> -		pmc->counter += data - pmc_read_counter(pmc);
> -		pmc_update_sample_period(pmc);
> +		pmc_write_counter(pmc, data);
>  		return 0;
>  	}
>  	/* MSR_EVNTSELn */
> @@ -233,21 +232,6 @@ static void amd_pmu_init(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> -static void amd_pmu_reset(struct kvm_vcpu *vcpu)
> -{
> -	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> -	int i;
> -
> -	for (i = 0; i < KVM_AMD_PMC_MAX_GENERIC; i++) {
> -		struct kvm_pmc *pmc = &pmu->gp_counters[i];
> -
> -		pmc_stop_counter(pmc);
> -		pmc->counter = pmc->prev_counter = pmc->eventsel = 0;
> -	}
> -
> -	pmu->global_ctrl = pmu->global_status = 0;
> -}
> -
>  struct kvm_pmu_ops amd_pmu_ops __initdata = {
>  	.hw_event_available = amd_hw_event_available,
>  	.pmc_idx_to_pmc = amd_pmc_idx_to_pmc,
> @@ -259,7 +243,6 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
>  	.set_msr = amd_pmu_set_msr,
>  	.refresh = amd_pmu_refresh,
>  	.init = amd_pmu_init,
> -	.reset = amd_pmu_reset,
>  	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
>  	.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
>  	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 80c769c58a87..ce49d060bc96 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -406,12 +406,10 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  			if (!msr_info->host_initiated &&
>  			    !(msr & MSR_PMC_FULL_WIDTH_BIT))
>  				data = (s64)(s32)data;
> -			pmc->counter += data - pmc_read_counter(pmc);
> -			pmc_update_sample_period(pmc);
> +			pmc_write_counter(pmc, data);
>  			break;
>  		} else if ((pmc = get_fixed_pmc(pmu, msr))) {
> -			pmc->counter += data - pmc_read_counter(pmc);
> -			pmc_update_sample_period(pmc);
> +			pmc_write_counter(pmc, data);
>  			break;
>  		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
>  			reserved_bits = pmu->reserved_bits;
> @@ -603,26 +601,6 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
>  
>  static void intel_pmu_reset(struct kvm_vcpu *vcpu)
>  {
> -	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> -	struct kvm_pmc *pmc = NULL;
> -	int i;
> -
> -	for (i = 0; i < KVM_INTEL_PMC_MAX_GENERIC; i++) {
> -		pmc = &pmu->gp_counters[i];
> -
> -		pmc_stop_counter(pmc);
> -		pmc->counter = pmc->prev_counter = pmc->eventsel = 0;
> -	}
> -
> -	for (i = 0; i < KVM_PMC_MAX_FIXED; i++) {
> -		pmc = &pmu->fixed_counters[i];
> -
> -		pmc_stop_counter(pmc);
> -		pmc->counter = pmc->prev_counter = 0;
> -	}
> -
> -	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
> -
>  	intel_pmu_release_guest_lbr_event(vcpu);
>  }
>  
> 
> base-commit: 88bb466c9dec4f70d682cf38c685324e7b1b3d60
> -- 
> 

-- 
Thanks,
Dapeng Mi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-06-30 21:26               ` Sean Christopherson
  2023-07-01 19:51                 ` Mingwei Zhang
  2023-08-11  8:30                 ` Dapeng Mi
@ 2023-08-22  9:29                 ` Like Xu
  2023-08-23 18:28                   ` Mingwei Zhang
  2 siblings, 1 reply; 25+ messages in thread
From: Like Xu @ 2023-08-22  9:29 UTC (permalink / raw)
  To: Sean Christopherson, Roman Kagan
  Cc: Jim Mattson, Paolo Bonzini, Eric Hankland, kvm, linux-kernel,
	Mingwei Zhang

On 1/7/2023 5:26 am, Sean Christopherson wrote:
> Ugh, yeah, de0f619564f4 created a bit of a mess.  The underlying issue that it
> was solving is that perf_event_read_value() and friends might sleep (yay mutex),
> and so can't be called from KVM's fastpath (IRQs disabled).
Updating pmu counters for emulated instructions cause troubles.

> 
> However, detecting overflow requires reading perf_event_read_value() to gather
> the accumulated count from the hardware event in order to add it to the emulated
> count from software.  E.g. if pmc->counter is X and the perf event counter is Y,
> KVM needs to factor in Y because X+Y+1 might overflow even if X+1 does not.
> 
> Trying to snapshot the previous counter value is a bit of a mess.  It could probably
> made to work, but it's hard to reason about what the snapshot actually contains
> and when it should be cleared, especially when factoring in the wrapping logic.
> 
> Rather than snapshot the previous counter, I think it makes sense to:
> 
>    1) Track the number of emulated counter events

If events are counted separately, the challenge here is to correctly time
the emulation of counter overflows, which can occur on both sides of the
counter values out of sync.

>    2) Accumulate and reset the counts from perf_event and emulated_counter into
>       pmc->counter when pausing the PMC
>    3) Pause and reprogram the PMC on writes (instead of the current approach of
>       blindly updating the sample period)

Updating the sample period is the only interface for KVM to configure hw
behaviour on hw-ctr. I note that perf_event_set_count() will be proposed,
and I'm pessimistic about this change.

>    4) Pause the counter when stopping the perf_event to ensure pmc->counter is
>       fresh (instead of manually updating pmc->counter)
> 
> IMO, that yields more intuitive logic, and makes it easier to reason about
> correctness since the behavior is easily define: pmc->counter holds the counts
> that have been gathered and processed, perf_event and emulated_counter hold
> outstanding counts on top.  E.g. on a WRMSR to the counter, both the emulated
> counter and the hardware counter are reset, because whatever counts existed
> previously are irrelevant.

If we take the hardware view, a counter, emulated or not, just increments
and overflows at the threshold. The missing logic here is when the counter
is truncated when writing high bit-width values, and how to deal with the
value of pmc->prev_counter was before pmc->counter was truncated.

> 
> Pausing the counter_might_  make WRMSR slower, but we need to get this all
> functionally correct before worrying too much about performance.

Performance, security and correctness should all be considered at the beginning.

> 
> Diff below for what I'm thinking (needs to be split into multiple patches).  It's
> *very*  lightly tested.

It saddens me that no one has come up with an actual low-level counter-test for 
this issue.

> 
> I'm about to disappear for a week, I'll pick this back up when I get return.  In
> the meantime, any testing and/or input would be much appreciated!

How about accepting Roman's original fix and then exercising the rewriting genius ?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-08-22  9:29                 ` Like Xu
@ 2023-08-23 18:28                   ` Mingwei Zhang
  0 siblings, 0 replies; 25+ messages in thread
From: Mingwei Zhang @ 2023-08-23 18:28 UTC (permalink / raw)
  To: Like Xu
  Cc: Sean Christopherson, Roman Kagan, Jim Mattson, Paolo Bonzini,
	Eric Hankland, kvm, linux-kernel

On Tue, Aug 22, 2023 at 2:30 AM Like Xu <like.xu.linux@gmail.com> wrote:
>
> On 1/7/2023 5:26 am, Sean Christopherson wrote:
> > Ugh, yeah, de0f619564f4 created a bit of a mess.  The underlying issue that it
> > was solving is that perf_event_read_value() and friends might sleep (yay mutex),
> > and so can't be called from KVM's fastpath (IRQs disabled).
> Updating pmu counters for emulated instructions cause troubles.
>
> >
> > However, detecting overflow requires reading perf_event_read_value() to gather
> > the accumulated count from the hardware event in order to add it to the emulated
> > count from software.  E.g. if pmc->counter is X and the perf event counter is Y,
> > KVM needs to factor in Y because X+Y+1 might overflow even if X+1 does not.
> >
> > Trying to snapshot the previous counter value is a bit of a mess.  It could probably
> > made to work, but it's hard to reason about what the snapshot actually contains
> > and when it should be cleared, especially when factoring in the wrapping logic.
> >
> > Rather than snapshot the previous counter, I think it makes sense to:
> >
> >    1) Track the number of emulated counter events
>
> If events are counted separately, the challenge here is to correctly time
> the emulation of counter overflows, which can occur on both sides of the
> counter values out of sync.
>
> >    2) Accumulate and reset the counts from perf_event and emulated_counter into
> >       pmc->counter when pausing the PMC
> >    3) Pause and reprogram the PMC on writes (instead of the current approach of
> >       blindly updating the sample period)
>
> Updating the sample period is the only interface for KVM to configure hw
> behaviour on hw-ctr. I note that perf_event_set_count() will be proposed,
> and I'm pessimistic about this change.
>
> >    4) Pause the counter when stopping the perf_event to ensure pmc->counter is
> >       fresh (instead of manually updating pmc->counter)
> >
> > IMO, that yields more intuitive logic, and makes it easier to reason about
> > correctness since the behavior is easily define: pmc->counter holds the counts
> > that have been gathered and processed, perf_event and emulated_counter hold
> > outstanding counts on top.  E.g. on a WRMSR to the counter, both the emulated
> > counter and the hardware counter are reset, because whatever counts existed
> > previously are irrelevant.
>
> If we take the hardware view, a counter, emulated or not, just increments
> and overflows at the threshold. The missing logic here is when the counter
> is truncated when writing high bit-width values, and how to deal with the
> value of pmc->prev_counter was before pmc->counter was truncated.
>
> >
> > Pausing the counter_might_  make WRMSR slower, but we need to get this all
> > functionally correct before worrying too much about performance.
>
> Performance, security and correctness should all be considered at the beginning.
>

+1 on the performance part.

I did several rounds of performance testing, pausing the counter is
fast, but restarting the counter is *super* slow. The extra overhead
might just make vPMU useless especially when the guest workload takes
full CPU/memory resources in a VM (like SPEC2017 does).

> >
> > Diff below for what I'm thinking (needs to be split into multiple patches).  It's
> > *very*  lightly tested.
>
> It saddens me that no one has come up with an actual low-level counter-test for
> this issue.
>
> >
> > I'm about to disappear for a week, I'll pick this back up when I get return.  In
> > the meantime, any testing and/or input would be much appreciated!
>
> How about accepting Roman's original fix and then exercising the rewriting genius ?

+1

I think the best option would be to just apply the fix in a short term
and put the refactoring of the emulated counter in the next series.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
  2023-05-04 12:00 [PATCH] KVM: x86: vPMU: truncate counter value to allowed width Roman Kagan
  2023-05-23 12:40 ` Like Xu
  2023-06-06  0:51 ` Sean Christopherson
@ 2023-09-28 16:41 ` Sean Christopherson
  2 siblings, 0 replies; 25+ messages in thread
From: Sean Christopherson @ 2023-09-28 16:41 UTC (permalink / raw)
  To: Sean Christopherson, kvm, Roman Kagan
  Cc: Dave Hansen, Jim Mattson, Like Xu, Paolo Bonzini, x86,
	Thomas Gleixner, Eric Hankland, linux-kernel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar

On Thu, 04 May 2023 14:00:42 +0200, Roman Kagan wrote:
> Performance counters are defined to have width less than 64 bits.  The
> vPMU code maintains the counters in u64 variables but assumes the value
> to fit within the defined width.  However, for Intel non-full-width
> counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
> truncated to 32 bits and then sign-extended to full 64 bits.  If a
> negative value is set, it's sign-extended to 64 bits, but then in
> kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
> previous value for overflow detection.
> That previous value is not truncated, so it always evaluates bigger than
> the truncated new one, and a PMI is injected.  If the PMI handler writes
> a negative counter value itself, the vCPU never quits the PMI loop.
> 
> [...]

Applied to kvm-x86 pmu, with a slightly massaged changelog.  Thanks!  And sorry
for the horrendous delay...

[1/1] KVM: x86/pmu: Truncate counter value to allowed width on write
      https://github.com/kvm-x86/linux/commit/b29a2acd36dd

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-09-28 16:42 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-04 12:00 [PATCH] KVM: x86: vPMU: truncate counter value to allowed width Roman Kagan
2023-05-23 12:40 ` Like Xu
2023-05-23 16:42   ` Roman Kagan
2023-06-06  0:26     ` Sean Christopherson
2023-06-06  0:51 ` Sean Christopherson
2023-06-29 21:16   ` Jim Mattson
2023-06-30  0:11     ` Sean Christopherson
2023-06-30  0:31       ` Jim Mattson
2023-06-30 11:14       ` Roman Kagan
2023-06-30 14:28         ` Sean Christopherson
2023-06-30 15:21           ` Roman Kagan
2023-06-30 15:45             ` Sean Christopherson
2023-06-30 17:07               ` Mingwei Zhang
2023-06-30 17:16                 ` Jim Mattson
2023-06-30 17:32                   ` Mingwei Zhang
2023-06-30 18:03                     ` Mingwei Zhang
2023-06-30 21:26               ` Sean Christopherson
2023-07-01 19:51                 ` Mingwei Zhang
2023-08-11  8:30                 ` Dapeng Mi
2023-08-22  9:29                 ` Like Xu
2023-08-23 18:28                   ` Mingwei Zhang
2023-07-03 13:33               ` Roman Kagan
2023-06-30 16:40             ` Jim Mattson
2023-06-30 23:25               ` Jim Mattson
2023-09-28 16:41 ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.