linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] KVM: arm64: Assorted PMU emulation fixes
@ 2019-10-06 10:46 maz
  2019-10-06 10:46 ` [PATCH 1/3] KVM: arm64: pmu: Fix cycle counter truncation maz
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: maz @ 2019-10-06 10:46 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, kvm
  Cc: Mark Rutland, Suzuki K Poulose, Marc Zyngier, James Morse,
	Andrew Murray, Julien Thierry

From: Marc Zyngier <maz@kernel.org>

I recently came across a number of PMU emulation bugs, all which can
result in unexpected behaviours in an unsuspecting guest. The first
two patches already have been discussed on the list, but I'm including
them here as part of a slightly longer series. The last patch fixes an
issue that has been here from day one, where we confuse architectural
overflow of a counter and perf sampling period.

If nobody disagrees, I'll send them upstream shortly.

Marc Zyngier (3):
  KVM: arm64: pmu: Fix cycle counter truncation
  arm64: KVM: Handle PMCR_EL0.LC as RES1 on pure AArch64 systems
  KVM: arm64: pmu: Reset sample period on overflow handling

 arch/arm64/kvm/sys_regs.c |  4 ++++
 virt/kvm/arm/pmu.c        | 34 ++++++++++++++++++++++++----------
 2 files changed, 28 insertions(+), 10 deletions(-)

-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] KVM: arm64: pmu: Fix cycle counter truncation
  2019-10-06 10:46 [PATCH 0/3] KVM: arm64: Assorted PMU emulation fixes maz
@ 2019-10-06 10:46 ` maz
  2019-10-07  8:48   ` Andrew Murray
  2019-10-06 10:46 ` [PATCH 2/3] arm64: KVM: Handle PMCR_EL0.LC as RES1 on pure AArch64 systems maz
  2019-10-06 10:46 ` [PATCH 3/3] KVM: arm64: pmu: Reset sample period on overflow handling maz
  2 siblings, 1 reply; 9+ messages in thread
From: maz @ 2019-10-06 10:46 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, kvm
  Cc: Mark Rutland, Suzuki K Poulose, Marc Zyngier, James Morse,
	Andrew Murray, Julien Thierry

From: Marc Zyngier <maz@kernel.org>

When a counter is disabled, its value is sampled before the event
is being disabled, and the value written back in the shadow register.

In that process, the value gets truncated to 32bit, which is adequate
for any counter but the cycle counter (defined as a 64bit counter).

This obviously results in a corrupted counter, and things like
"perf record -e cycles" not working at all when run in a guest...
A similar, but less critical bug exists in kvm_pmu_get_counter_value.

Make the truncation conditional on the counter not being the cycle
counter, which results in a minor code reorganisation.

Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
Cc: Andrew Murray <andrew.murray@arm.com>
Reported-by: Julien Thierry <julien.thierry.kdev@gmail.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 virt/kvm/arm/pmu.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index 362a01886bab..c30c3a74fc7f 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -146,8 +146,7 @@ u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx)
 	if (kvm_pmu_pmc_is_chained(pmc) &&
 	    kvm_pmu_idx_is_high_counter(select_idx))
 		counter = upper_32_bits(counter);
-
-	else if (!kvm_pmu_idx_is_64bit(vcpu, select_idx))
+	else if (select_idx != ARMV8_PMU_CYCLE_IDX)
 		counter = lower_32_bits(counter);
 
 	return counter;
@@ -193,7 +192,7 @@ static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc)
  */
 static void kvm_pmu_stop_counter(struct kvm_vcpu *vcpu, struct kvm_pmc *pmc)
 {
-	u64 counter, reg;
+	u64 counter, reg, val;
 
 	pmc = kvm_pmu_get_canonical_pmc(pmc);
 	if (!pmc->perf_event)
@@ -201,16 +200,19 @@ static void kvm_pmu_stop_counter(struct kvm_vcpu *vcpu, struct kvm_pmc *pmc)
 
 	counter = kvm_pmu_get_pair_counter_value(vcpu, pmc);
 
-	if (kvm_pmu_pmc_is_chained(pmc)) {
-		reg = PMEVCNTR0_EL0 + pmc->idx;
-		__vcpu_sys_reg(vcpu, reg) = lower_32_bits(counter);
-		__vcpu_sys_reg(vcpu, reg + 1) = upper_32_bits(counter);
+	if (pmc->idx == ARMV8_PMU_CYCLE_IDX) {
+		reg = PMCCNTR_EL0;
+		val = counter;
 	} else {
-		reg = (pmc->idx == ARMV8_PMU_CYCLE_IDX)
-		       ? PMCCNTR_EL0 : PMEVCNTR0_EL0 + pmc->idx;
-		__vcpu_sys_reg(vcpu, reg) = lower_32_bits(counter);
+		reg = PMEVCNTR0_EL0 + pmc->idx;
+		val = lower_32_bits(counter);
 	}
 
+	__vcpu_sys_reg(vcpu, reg) = val;
+
+	if (kvm_pmu_pmc_is_chained(pmc))
+		__vcpu_sys_reg(vcpu, reg + 1) = upper_32_bits(counter);
+
 	kvm_pmu_release_perf_event(pmc);
 }
 
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] arm64: KVM: Handle PMCR_EL0.LC as RES1 on pure AArch64 systems
  2019-10-06 10:46 [PATCH 0/3] KVM: arm64: Assorted PMU emulation fixes maz
  2019-10-06 10:46 ` [PATCH 1/3] KVM: arm64: pmu: Fix cycle counter truncation maz
@ 2019-10-06 10:46 ` maz
  2019-10-06 10:46 ` [PATCH 3/3] KVM: arm64: pmu: Reset sample period on overflow handling maz
  2 siblings, 0 replies; 9+ messages in thread
From: maz @ 2019-10-06 10:46 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, kvm
  Cc: Mark Rutland, Suzuki K Poulose, Marc Zyngier, James Morse,
	Andrew Murray, Julien Thierry

From: Marc Zyngier <maz@kernel.org>

Of PMCR_EL0.LC, the ARMv8 ARM says:

	"In an AArch64 only implementation, this field is RES 1."

So be it.

Fixes: ab9468340d2bc ("arm64: KVM: Add access handler for PMCR register")
Reviewed-by: Andrew Murray <andrew.murray@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/sys_regs.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 2071260a275b..46822afc57e0 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -632,6 +632,8 @@ static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
 	 */
 	val = ((pmcr & ~ARMV8_PMU_PMCR_MASK)
 	       | (ARMV8_PMU_PMCR_MASK & 0xdecafbad)) & (~ARMV8_PMU_PMCR_E);
+	if (!system_supports_32bit_el0())
+		val |= ARMV8_PMU_PMCR_LC;
 	__vcpu_sys_reg(vcpu, r->reg) = val;
 }
 
@@ -682,6 +684,8 @@ static bool access_pmcr(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
 		val = __vcpu_sys_reg(vcpu, PMCR_EL0);
 		val &= ~ARMV8_PMU_PMCR_MASK;
 		val |= p->regval & ARMV8_PMU_PMCR_MASK;
+		if (!system_supports_32bit_el0())
+			val |= ARMV8_PMU_PMCR_LC;
 		__vcpu_sys_reg(vcpu, PMCR_EL0) = val;
 		kvm_pmu_handle_pmcr(vcpu, val);
 		kvm_vcpu_pmu_restore_guest(vcpu);
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] KVM: arm64: pmu: Reset sample period on overflow handling
  2019-10-06 10:46 [PATCH 0/3] KVM: arm64: Assorted PMU emulation fixes maz
  2019-10-06 10:46 ` [PATCH 1/3] KVM: arm64: pmu: Fix cycle counter truncation maz
  2019-10-06 10:46 ` [PATCH 2/3] arm64: KVM: Handle PMCR_EL0.LC as RES1 on pure AArch64 systems maz
@ 2019-10-06 10:46 ` maz
  2019-10-07  9:43   ` Andrew Murray
  2 siblings, 1 reply; 9+ messages in thread
From: maz @ 2019-10-06 10:46 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, kvm
  Cc: Mark Rutland, Suzuki K Poulose, Marc Zyngier, James Morse,
	Andrew Murray, Julien Thierry

From: Marc Zyngier <maz@kernel.org>

The PMU emulation code uses the perf event sample period to trigger
the overflow detection. This works fine  for the *first* overflow
handling, but results in a huge number of interrupts on the host,
unrelated to the number of interrupts handled in the guest (a x20
factor is pretty common for the cycle counter). On a slow system
(such as a SW model), this can result in the guest only making
forward progress at a glacial pace.

It turns out that the clue is in the name. The sample period is
exactly that: a period. And once the an overflow has occured,
the following period should be the full width of the associated
counter, instead of whatever the guest had initially programed.

Reset the sample period to the architected value in the overflow
handler, which now results in a number of host interrupts that is
much closer to the number of interrupts in the guest.

Fixes: b02386eb7dac ("arm64: KVM: Add PMU overflow interrupt routing")
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 virt/kvm/arm/pmu.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index c30c3a74fc7f..3ca4761fc0f5 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -444,6 +444,18 @@ static void kvm_pmu_perf_overflow(struct perf_event *perf_event,
 	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
 	struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
 	int idx = pmc->idx;
+	u64 val, period;
+
+	/* Start by resetting the sample period to the architectural limit */
+	val = kvm_pmu_get_pair_counter_value(vcpu, pmc);
+
+	if (kvm_pmu_idx_is_64bit(vcpu, pmc->idx))
+		period = (-val) & GENMASK(63, 0);
+	else
+		period = (-val) & GENMASK(31, 0);
+
+	pmc->perf_event->attr.sample_period = period;
+	pmc->perf_event->hw.sample_period = period;
 
 	__vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= BIT(idx);
 
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] KVM: arm64: pmu: Fix cycle counter truncation
  2019-10-06 10:46 ` [PATCH 1/3] KVM: arm64: pmu: Fix cycle counter truncation maz
@ 2019-10-07  8:48   ` Andrew Murray
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Murray @ 2019-10-07  8:48 UTC (permalink / raw)
  To: maz
  Cc: Mark Rutland, kvm, Suzuki K Poulose, James Morse,
	linux-arm-kernel, kvmarm, Julien Thierry

On Sun, Oct 06, 2019 at 11:46:34AM +0100, maz@kernel.org wrote:
> From: Marc Zyngier <maz@kernel.org>
> 
> When a counter is disabled, its value is sampled before the event
> is being disabled, and the value written back in the shadow register.
> 
> In that process, the value gets truncated to 32bit, which is adequate
> for any counter but the cycle counter (defined as a 64bit counter).
> 
> This obviously results in a corrupted counter, and things like
> "perf record -e cycles" not working at all when run in a guest...
> A similar, but less critical bug exists in kvm_pmu_get_counter_value.
> 
> Make the truncation conditional on the counter not being the cycle
> counter, which results in a minor code reorganisation.
> 
> Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
> Cc: Andrew Murray <andrew.murray@arm.com>
> Reported-by: Julien Thierry <julien.thierry.kdev@gmail.com>
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---

Reviewed-by: Andrew Murray <andrew.murray@arm.com>

>  virt/kvm/arm/pmu.c | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
> index 362a01886bab..c30c3a74fc7f 100644
> --- a/virt/kvm/arm/pmu.c
> +++ b/virt/kvm/arm/pmu.c
> @@ -146,8 +146,7 @@ u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx)
>  	if (kvm_pmu_pmc_is_chained(pmc) &&
>  	    kvm_pmu_idx_is_high_counter(select_idx))
>  		counter = upper_32_bits(counter);
> -
> -	else if (!kvm_pmu_idx_is_64bit(vcpu, select_idx))
> +	else if (select_idx != ARMV8_PMU_CYCLE_IDX)
>  		counter = lower_32_bits(counter);
>  
>  	return counter;
> @@ -193,7 +192,7 @@ static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc)
>   */
>  static void kvm_pmu_stop_counter(struct kvm_vcpu *vcpu, struct kvm_pmc *pmc)
>  {
> -	u64 counter, reg;
> +	u64 counter, reg, val;
>  
>  	pmc = kvm_pmu_get_canonical_pmc(pmc);
>  	if (!pmc->perf_event)
> @@ -201,16 +200,19 @@ static void kvm_pmu_stop_counter(struct kvm_vcpu *vcpu, struct kvm_pmc *pmc)
>  
>  	counter = kvm_pmu_get_pair_counter_value(vcpu, pmc);
>  
> -	if (kvm_pmu_pmc_is_chained(pmc)) {
> -		reg = PMEVCNTR0_EL0 + pmc->idx;
> -		__vcpu_sys_reg(vcpu, reg) = lower_32_bits(counter);
> -		__vcpu_sys_reg(vcpu, reg + 1) = upper_32_bits(counter);
> +	if (pmc->idx == ARMV8_PMU_CYCLE_IDX) {
> +		reg = PMCCNTR_EL0;
> +		val = counter;
>  	} else {
> -		reg = (pmc->idx == ARMV8_PMU_CYCLE_IDX)
> -		       ? PMCCNTR_EL0 : PMEVCNTR0_EL0 + pmc->idx;
> -		__vcpu_sys_reg(vcpu, reg) = lower_32_bits(counter);
> +		reg = PMEVCNTR0_EL0 + pmc->idx;
> +		val = lower_32_bits(counter);
>  	}
>  
> +	__vcpu_sys_reg(vcpu, reg) = val;
> +
> +	if (kvm_pmu_pmc_is_chained(pmc))
> +		__vcpu_sys_reg(vcpu, reg + 1) = upper_32_bits(counter);
> +
>  	kvm_pmu_release_perf_event(pmc);
>  }
>  
> -- 
> 2.20.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: pmu: Reset sample period on overflow handling
  2019-10-06 10:46 ` [PATCH 3/3] KVM: arm64: pmu: Reset sample period on overflow handling maz
@ 2019-10-07  9:43   ` Andrew Murray
  2019-10-07 10:48     ` Marc Zyngier
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Murray @ 2019-10-07  9:43 UTC (permalink / raw)
  To: maz
  Cc: Mark Rutland, kvm, Suzuki K Poulose, James Morse,
	linux-arm-kernel, kvmarm, Julien Thierry

On Sun, Oct 06, 2019 at 11:46:36AM +0100, maz@kernel.org wrote:
> From: Marc Zyngier <maz@kernel.org>
> 
> The PMU emulation code uses the perf event sample period to trigger
> the overflow detection. This works fine  for the *first* overflow
> handling

Although, even though the first overflow is timed correctly, the value
the guest reads may be wrong...

Assuming a Linux guest with the arm_pmu.c driver, if I recall correctly
this writes the -remainingperiod to the counter upon stopping/starting.
In the case of a perf_event that is pinned to a task, this will happen
upon every context switch of that task. If the counter was getting close
to overflow before the context switch, then the value written to the
guest counter will be very high and thus the sample_period written in KVM
will be very low...

The best scenario is when the host handles the overflow, the guest
handles its overflow and rewrites the guest counter (resulting in a new
host perf_event) - all before the first host perf_event fires again. This
is clearly the assumption the code makes.

Or - the host handles its overflow and kicks the guest, but the guest
doesn't respond in time, so we end up endlessly and pointlessly kicking it
for each host overflow - thus resulting in the large difference between number
of interrupts between host and guest. This isn't ideal, because when the
guest does read its counter, the value isn't correct (because it overflowed
a zillion times at a value less than the arrchitected max).

Worse still is when the sample_period is so small, the host doesn't
even keep up.

> , but results in a huge number of interrupts on the host,
> unrelated to the number of interrupts handled in the guest (a x20
> factor is pretty common for the cycle counter). On a slow system
> (such as a SW model), this can result in the guest only making
> forward progress at a glacial pace.
> 
> It turns out that the clue is in the name. The sample period is
> exactly that: a period. And once the an overflow has occured,
> the following period should be the full width of the associated
> counter, instead of whatever the guest had initially programed.
> 
> Reset the sample period to the architected value in the overflow
> handler, which now results in a number of host interrupts that is
> much closer to the number of interrupts in the guest.

This seems a reasonable pragmatic approach - though of course you will end
up counting slightly slower due to the host interrupt latency. But that's
better than the status quo.

It may be possible with perf to have a single-fire counter (this mitigates
against my third scenario but you still end up with a loss of precision) -
See PERF_EVENT_IOC_REFRESH.

Ideally the PERF_EVENT_IOC_REFRESH type of functionality could be updated
to reload to a different value after the first hit.

This problem also exists on arch/x86/kvm/pmu.c (though I'm not sure what
their PMU drivers do with respect to the value they write).

> 
> Fixes: b02386eb7dac ("arm64: KVM: Add PMU overflow interrupt routing")
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  virt/kvm/arm/pmu.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
> index c30c3a74fc7f..3ca4761fc0f5 100644
> --- a/virt/kvm/arm/pmu.c
> +++ b/virt/kvm/arm/pmu.c
> @@ -444,6 +444,18 @@ static void kvm_pmu_perf_overflow(struct perf_event *perf_event,
>  	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
>  	struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
>  	int idx = pmc->idx;
> +	u64 val, period;
> +
> +	/* Start by resetting the sample period to the architectural limit */
> +	val = kvm_pmu_get_pair_counter_value(vcpu, pmc);
> +
> +	if (kvm_pmu_idx_is_64bit(vcpu, pmc->idx))

This is correct, because in this case we *do* care about _PMCR_LC.

> +		period = (-val) & GENMASK(63, 0);
> +	else
> +		period = (-val) & GENMASK(31, 0);
> +
> +	pmc->perf_event->attr.sample_period = period;
> +	pmc->perf_event->hw.sample_period = period;

I'm not sure about the above line - does direct manipulation of sample_period
work on a running perf event? As far as I can tell this is already done in the
kernel with __perf_event_period - however this also does other stuff (such as
disable and re-enable the event).

>  

Thanks,

Andrew Murray

>  	__vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= BIT(idx);
>  
> -- 
> 2.20.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: pmu: Reset sample period on overflow handling
  2019-10-07  9:43   ` Andrew Murray
@ 2019-10-07 10:48     ` Marc Zyngier
  2019-10-07 13:04       ` Andrew Murray
  0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2019-10-07 10:48 UTC (permalink / raw)
  To: Andrew Murray, Mark Rutland
  Cc: kvm, Suzuki K Poulose, James Morse, linux-arm-kernel, kvmarm,
	Julien Thierry

On Mon, 07 Oct 2019 10:43:27 +0100,
Andrew Murray <andrew.murray@arm.com> wrote:
> 
> On Sun, Oct 06, 2019 at 11:46:36AM +0100, maz@kernel.org wrote:
> > From: Marc Zyngier <maz@kernel.org>
> > 
> > The PMU emulation code uses the perf event sample period to trigger
> > the overflow detection. This works fine  for the *first* overflow
> > handling
> 
> Although, even though the first overflow is timed correctly, the value
> the guest reads may be wrong...
> 
> Assuming a Linux guest with the arm_pmu.c driver, if I recall correctly
> this writes the -remainingperiod to the counter upon stopping/starting.
> In the case of a perf_event that is pinned to a task, this will happen
> upon every context switch of that task. If the counter was getting close
> to overflow before the context switch, then the value written to the
> guest counter will be very high and thus the sample_period written in KVM
> will be very low...
> 
> The best scenario is when the host handles the overflow, the guest
> handles its overflow and rewrites the guest counter (resulting in a new
> host perf_event) - all before the first host perf_event fires again. This
> is clearly the assumption the code makes.
> 
> Or - the host handles its overflow and kicks the guest, but the guest
> doesn't respond in time, so we end up endlessly and pointlessly kicking it
> for each host overflow - thus resulting in the large difference between number
> of interrupts between host and guest. This isn't ideal, because when the
> guest does read its counter, the value isn't correct (because it overflowed
> a zillion times at a value less than the arrchitected max).
> 
> Worse still is when the sample_period is so small, the host doesn't
> even keep up.

Well, there are plenty of ways to make this code go mad. The
overarching reason is that we abuse the notion of sampling period to
generate interrupts, while what we'd really like is something that
says "call be back in that many events", rather than the sampling
period which doesn't match the architecture.

Yes, small values will results in large drifts. Nothing we can do
about it.

> 
> > , but results in a huge number of interrupts on the host,
> > unrelated to the number of interrupts handled in the guest (a x20
> > factor is pretty common for the cycle counter). On a slow system
> > (such as a SW model), this can result in the guest only making
> > forward progress at a glacial pace.
> > 
> > It turns out that the clue is in the name. The sample period is
> > exactly that: a period. And once the an overflow has occured,
> > the following period should be the full width of the associated
> > counter, instead of whatever the guest had initially programed.
> > 
> > Reset the sample period to the architected value in the overflow
> > handler, which now results in a number of host interrupts that is
> > much closer to the number of interrupts in the guest.
> 
> This seems a reasonable pragmatic approach - though of course you will end
> up counting slightly slower due to the host interrupt latency. But that's
> better than the status quo.

Slower than what?

> 
> It may be possible with perf to have a single-fire counter (this mitigates
> against my third scenario but you still end up with a loss of precision) -
> See PERF_EVENT_IOC_REFRESH.

Unfortunately, that's a userspace interface, not something that's
available to the kernel at large...

> Ideally the PERF_EVENT_IOC_REFRESH type of functionality could be updated
> to reload to a different value after the first hit.

Which is what I was hinting at above. I'd like a way to reload the
next period on each expiration, much like a timer.

> 
> This problem also exists on arch/x86/kvm/pmu.c (though I'm not sure what
> their PMU drivers do with respect to the value they write).
> 
> > 
> > Fixes: b02386eb7dac ("arm64: KVM: Add PMU overflow interrupt routing")
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > ---
> >  virt/kvm/arm/pmu.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
> > index c30c3a74fc7f..3ca4761fc0f5 100644
> > --- a/virt/kvm/arm/pmu.c
> > +++ b/virt/kvm/arm/pmu.c
> > @@ -444,6 +444,18 @@ static void kvm_pmu_perf_overflow(struct perf_event *perf_event,
> >  	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> >  	struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
> >  	int idx = pmc->idx;
> > +	u64 val, period;
> > +
> > +	/* Start by resetting the sample period to the architectural limit */
> > +	val = kvm_pmu_get_pair_counter_value(vcpu, pmc);
> > +
> > +	if (kvm_pmu_idx_is_64bit(vcpu, pmc->idx))
> 
> This is correct, because in this case we *do* care about _PMCR_LC.
> 
> > +		period = (-val) & GENMASK(63, 0);
> > +	else
> > +		period = (-val) & GENMASK(31, 0);
> > +
> > +	pmc->perf_event->attr.sample_period = period;
> > +	pmc->perf_event->hw.sample_period = period;
> 
> I'm not sure about the above line - does direct manipulation of sample_period
> work on a running perf event? As far as I can tell this is already done in the
> kernel with __perf_event_period - however this also does other stuff (such as
> disable and re-enable the event).

I'm not sure you could do that in the handler, which is run in atomic
context. It doesn't look like anything bad happens when updating the
sample period directly (the whole thing has stopped getting crazy),
but I'd really like someone who understands the perf internals to help
here (hence Mark being on cc).

Thanks,

	M.

-- 
Jazz is not dead, it just smells funny.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: pmu: Reset sample period on overflow handling
  2019-10-07 10:48     ` Marc Zyngier
@ 2019-10-07 13:04       ` Andrew Murray
  2019-10-07 17:17         ` Marc Zyngier
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Murray @ 2019-10-07 13:04 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Mark Rutland, kvm, Suzuki K Poulose, James Morse,
	linux-arm-kernel, kvmarm, Julien Thierry

On Mon, Oct 07, 2019 at 11:48:33AM +0100, Marc Zyngier wrote:
> On Mon, 07 Oct 2019 10:43:27 +0100,
> Andrew Murray <andrew.murray@arm.com> wrote:
> > 
> > On Sun, Oct 06, 2019 at 11:46:36AM +0100, maz@kernel.org wrote:
> > > From: Marc Zyngier <maz@kernel.org>
> > > 
> > > The PMU emulation code uses the perf event sample period to trigger
> > > the overflow detection. This works fine  for the *first* overflow
> > > handling
> > 
> > Although, even though the first overflow is timed correctly, the value
> > the guest reads may be wrong...
> > 
> > Assuming a Linux guest with the arm_pmu.c driver, if I recall correctly
> > this writes the -remainingperiod to the counter upon stopping/starting.
> > In the case of a perf_event that is pinned to a task, this will happen
> > upon every context switch of that task. If the counter was getting close
> > to overflow before the context switch, then the value written to the
> > guest counter will be very high and thus the sample_period written in KVM
> > will be very low...
> > 
> > The best scenario is when the host handles the overflow, the guest
> > handles its overflow and rewrites the guest counter (resulting in a new
> > host perf_event) - all before the first host perf_event fires again. This
> > is clearly the assumption the code makes.
> > 
> > Or - the host handles its overflow and kicks the guest, but the guest
> > doesn't respond in time, so we end up endlessly and pointlessly kicking it
> > for each host overflow - thus resulting in the large difference between number
> > of interrupts between host and guest. This isn't ideal, because when the
> > guest does read its counter, the value isn't correct (because it overflowed
> > a zillion times at a value less than the arrchitected max).
> > 
> > Worse still is when the sample_period is so small, the host doesn't
> > even keep up.
> 
> Well, there are plenty of ways to make this code go mad. The
> overarching reason is that we abuse the notion of sampling period to
> generate interrupts, while what we'd really like is something that
> says "call be back in that many events", rather than the sampling
> period which doesn't match the architecture.
> 
> Yes, small values will results in large drifts. Nothing we can do
> about it.
> 
> > 
> > > , but results in a huge number of interrupts on the host,
> > > unrelated to the number of interrupts handled in the guest (a x20
> > > factor is pretty common for the cycle counter). On a slow system
> > > (such as a SW model), this can result in the guest only making
> > > forward progress at a glacial pace.
> > > 
> > > It turns out that the clue is in the name. The sample period is
> > > exactly that: a period. And once the an overflow has occured,
> > > the following period should be the full width of the associated
> > > counter, instead of whatever the guest had initially programed.
> > > 
> > > Reset the sample period to the architected value in the overflow
> > > handler, which now results in a number of host interrupts that is
> > > much closer to the number of interrupts in the guest.
> > 
> > This seems a reasonable pragmatic approach - though of course you will end
> > up counting slightly slower due to the host interrupt latency. But that's
> > better than the status quo.
> 
> Slower than what?
> 

Slower than the guest should expect. Assuming a cycle counter (with LC) is
initially programmed to 0, you'd target a guest interrupt period of 2^64 x cycle
period...

But I'm wrong in saying that you end up counting slightly slower - as you're
not restarting the perf counter or changing the value so there should be no change
in the interrupt period to the guest.

I was considering the case where the kernel perf event is recreated in the
overflow handler, in which case unless you consider the time elapsed between the
event firing and changing the sample_period then you end up with a larger period.

> > 
> > It may be possible with perf to have a single-fire counter (this mitigates
> > against my third scenario but you still end up with a loss of precision) -
> > See PERF_EVENT_IOC_REFRESH.
> 
> Unfortunately, that's a userspace interface, not something that's
> available to the kernel at large...

The mechanism to change the value of event->event_limit is only available via
ioctl, though I was implying that an in-kernel mechansim could be provided.
This would be trivial. (But it doesn't help, as I don't think you could create
another perf kernel event in that context).
 
> 
> > Ideally the PERF_EVENT_IOC_REFRESH type of functionality could be updated
> > to reload to a different value after the first hit.
> 
> Which is what I was hinting at above. I'd like a way to reload the
> next period on each expiration, much like a timer.
> 
> > 
> > This problem also exists on arch/x86/kvm/pmu.c (though I'm not sure what
> > their PMU drivers do with respect to the value they write).
> > 
> > > 
> > > Fixes: b02386eb7dac ("arm64: KVM: Add PMU overflow interrupt routing")
> > > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > > ---
> > >  virt/kvm/arm/pmu.c | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > > 
> > > diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
> > > index c30c3a74fc7f..3ca4761fc0f5 100644
> > > --- a/virt/kvm/arm/pmu.c
> > > +++ b/virt/kvm/arm/pmu.c
> > > @@ -444,6 +444,18 @@ static void kvm_pmu_perf_overflow(struct perf_event *perf_event,
> > >  	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> > >  	struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
> > >  	int idx = pmc->idx;
> > > +	u64 val, period;
> > > +
> > > +	/* Start by resetting the sample period to the architectural limit */
> > > +	val = kvm_pmu_get_pair_counter_value(vcpu, pmc);
> > > +
> > > +	if (kvm_pmu_idx_is_64bit(vcpu, pmc->idx))
> > 
> > This is correct, because in this case we *do* care about _PMCR_LC.
> > 
> > > +		period = (-val) & GENMASK(63, 0);
> > > +	else
> > > +		period = (-val) & GENMASK(31, 0);
> > > +
> > > +	pmc->perf_event->attr.sample_period = period;
> > > +	pmc->perf_event->hw.sample_period = period;
> > 
> > I'm not sure about the above line - does direct manipulation of sample_period
> > work on a running perf event? As far as I can tell this is already done in the
> > kernel with __perf_event_period - however this also does other stuff (such as
> > disable and re-enable the event).
> 
> I'm not sure you could do that in the handler, which is run in atomic
> context. It doesn't look like anything bad happens when updating the
> sample period directly (the whole thing has stopped getting crazy),
> but I'd really like someone who understands the perf internals to help
> here (hence Mark being on cc).

I suspect this is working lazily - when you want to change the underlying pmu
period, you need to write the new period to the host PMU counters. This is done
in armpmu_start. __perf_event_period would normally stop and then start the
PMU to achieve this (hence the PERF_EF_RELOAD flag). Your code doesn't do this.

However, the perf counter set up in KVM is always pinned to the guest process
and thus when switching to/from this task the counter are stopped and started.
Therefore I suspect the sample_period you change goes into effect at this point
in time. So it probably stops going crazy - but not immediately.

I think the underlying counter also gets reset to the new period just before it
calls perf_event_overflow (see armv8pmu_handle_irq) - so worse case you'll wait
until it overflows for the second time.

In any case this is still better than the status quo.

Thanks,

Andrew Murray

> 
> Thanks,
> 
> 	M.
> 
> -- 
> Jazz is not dead, it just smells funny.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: pmu: Reset sample period on overflow handling
  2019-10-07 13:04       ` Andrew Murray
@ 2019-10-07 17:17         ` Marc Zyngier
  0 siblings, 0 replies; 9+ messages in thread
From: Marc Zyngier @ 2019-10-07 17:17 UTC (permalink / raw)
  To: Andrew Murray
  Cc: Mark Rutland, kvm, Suzuki K Poulose, James Morse,
	linux-arm-kernel, kvmarm, Julien Thierry

On Mon, 7 Oct 2019 14:04:58 +0100
Andrew Murray <andrew.murray@arm.com> wrote:

> On Mon, Oct 07, 2019 at 11:48:33AM +0100, Marc Zyngier wrote:
> > On Mon, 07 Oct 2019 10:43:27 +0100,
> > Andrew Murray <andrew.murray@arm.com> wrote:  
> > > 
> > > On Sun, Oct 06, 2019 at 11:46:36AM +0100, maz@kernel.org wrote:  
> > > > From: Marc Zyngier <maz@kernel.org>
> > > > 
> > > > The PMU emulation code uses the perf event sample period to trigger
> > > > the overflow detection. This works fine  for the *first* overflow
> > > > handling  
> > > 
> > > Although, even though the first overflow is timed correctly, the value
> > > the guest reads may be wrong...
> > > 
> > > Assuming a Linux guest with the arm_pmu.c driver, if I recall correctly
> > > this writes the -remainingperiod to the counter upon stopping/starting.
> > > In the case of a perf_event that is pinned to a task, this will happen
> > > upon every context switch of that task. If the counter was getting close
> > > to overflow before the context switch, then the value written to the
> > > guest counter will be very high and thus the sample_period written in KVM
> > > will be very low...
> > > 
> > > The best scenario is when the host handles the overflow, the guest
> > > handles its overflow and rewrites the guest counter (resulting in a new
> > > host perf_event) - all before the first host perf_event fires again. This
> > > is clearly the assumption the code makes.
> > > 
> > > Or - the host handles its overflow and kicks the guest, but the guest
> > > doesn't respond in time, so we end up endlessly and pointlessly kicking it
> > > for each host overflow - thus resulting in the large difference between number
> > > of interrupts between host and guest. This isn't ideal, because when the
> > > guest does read its counter, the value isn't correct (because it overflowed
> > > a zillion times at a value less than the arrchitected max).
> > > 
> > > Worse still is when the sample_period is so small, the host doesn't
> > > even keep up.  
> > 
> > Well, there are plenty of ways to make this code go mad. The
> > overarching reason is that we abuse the notion of sampling period to
> > generate interrupts, while what we'd really like is something that
> > says "call be back in that many events", rather than the sampling
> > period which doesn't match the architecture.
> > 
> > Yes, small values will results in large drifts. Nothing we can do
> > about it.
> >   
> > >   
> > > > , but results in a huge number of interrupts on the host,
> > > > unrelated to the number of interrupts handled in the guest (a x20
> > > > factor is pretty common for the cycle counter). On a slow system
> > > > (such as a SW model), this can result in the guest only making
> > > > forward progress at a glacial pace.
> > > > 
> > > > It turns out that the clue is in the name. The sample period is
> > > > exactly that: a period. And once the an overflow has occured,
> > > > the following period should be the full width of the associated
> > > > counter, instead of whatever the guest had initially programed.
> > > > 
> > > > Reset the sample period to the architected value in the overflow
> > > > handler, which now results in a number of host interrupts that is
> > > > much closer to the number of interrupts in the guest.  
> > > 
> > > This seems a reasonable pragmatic approach - though of course you will end
> > > up counting slightly slower due to the host interrupt latency. But that's
> > > better than the status quo.  
> > 
> > Slower than what?
> >   
> 
> Slower than the guest should expect. Assuming a cycle counter (with LC) is
> initially programmed to 0, you'd target a guest interrupt period of 2^64 x cycle
> period...

What is exactly what is expected, isn't it?

> But I'm wrong in saying that you end up counting slightly slower - as you're
> not restarting the perf counter or changing the value so there should be no change
> in the interrupt period to the guest.
> 
> I was considering the case where the kernel perf event is recreated in the
> overflow handler, in which case unless you consider the time elapsed between the
> event firing and changing the sample_period then you end up with a larger period.

The only thing that changes is the point at which the next period will
end, matching the expected overflow.

> > > 
> > > It may be possible with perf to have a single-fire counter (this mitigates
> > > against my third scenario but you still end up with a loss of precision) -
> > > See PERF_EVENT_IOC_REFRESH.  
> > 
> > Unfortunately, that's a userspace interface, not something that's
> > available to the kernel at large...  
> 
> The mechanism to change the value of event->event_limit is only available via
> ioctl, though I was implying that an in-kernel mechansim could be provided.
> This would be trivial. (But it doesn't help, as I don't think you could create
> another perf kernel event in that context).
>  
> >   
> > > Ideally the PERF_EVENT_IOC_REFRESH type of functionality could be updated
> > > to reload to a different value after the first hit.  
> > 
> > Which is what I was hinting at above. I'd like a way to reload the
> > next period on each expiration, much like a timer.
> >   
> > > 
> > > This problem also exists on arch/x86/kvm/pmu.c (though I'm not sure what
> > > their PMU drivers do with respect to the value they write).
> > >   
> > > > 
> > > > Fixes: b02386eb7dac ("arm64: KVM: Add PMU overflow interrupt routing")
> > > > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > > > ---
> > > >  virt/kvm/arm/pmu.c | 12 ++++++++++++
> > > >  1 file changed, 12 insertions(+)
> > > > 
> > > > diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
> > > > index c30c3a74fc7f..3ca4761fc0f5 100644
> > > > --- a/virt/kvm/arm/pmu.c
> > > > +++ b/virt/kvm/arm/pmu.c
> > > > @@ -444,6 +444,18 @@ static void kvm_pmu_perf_overflow(struct perf_event *perf_event,
> > > >  	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> > > >  	struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
> > > >  	int idx = pmc->idx;
> > > > +	u64 val, period;
> > > > +
> > > > +	/* Start by resetting the sample period to the architectural limit */
> > > > +	val = kvm_pmu_get_pair_counter_value(vcpu, pmc);
> > > > +
> > > > +	if (kvm_pmu_idx_is_64bit(vcpu, pmc->idx))  
> > > 
> > > This is correct, because in this case we *do* care about _PMCR_LC.
> > >   
> > > > +		period = (-val) & GENMASK(63, 0);
> > > > +	else
> > > > +		period = (-val) & GENMASK(31, 0);
> > > > +
> > > > +	pmc->perf_event->attr.sample_period = period;
> > > > +	pmc->perf_event->hw.sample_period = period;  
> > > 
> > > I'm not sure about the above line - does direct manipulation of sample_period
> > > work on a running perf event? As far as I can tell this is already done in the
> > > kernel with __perf_event_period - however this also does other stuff (such as
> > > disable and re-enable the event).  
> > 
> > I'm not sure you could do that in the handler, which is run in atomic
> > context. It doesn't look like anything bad happens when updating the
> > sample period directly (the whole thing has stopped getting crazy),
> > but I'd really like someone who understands the perf internals to help
> > here (hence Mark being on cc).  
> 
> I suspect this is working lazily - when you want to change the underlying pmu
> period, you need to write the new period to the host PMU counters. This is done
> in armpmu_start. __perf_event_period would normally stop and then start the
> PMU to achieve this (hence the PERF_EF_RELOAD flag). Your code doesn't do this.

And yet I don't get these extra interrupts, so something must be
happening.

> However, the perf counter set up in KVM is always pinned to the guest process
> and thus when switching to/from this task the counter are stopped and started.
> Therefore I suspect the sample_period you change goes into effect at this point
> in time. So it probably stops going crazy - but not immediately.

Fair enough. I wonder if we can tell perf to always stop the event
before calling the handler, and resume it on return from the handler.

> I think the underlying counter also gets reset to the new period just before it
> calls perf_event_overflow (see armv8pmu_handle_irq) - so worse case you'll wait
> until it overflows for the second time.
> 
> In any case this is still better than the status quo.

Well, I'd still like to have something that is in line with the perf
usage model... It's been broken forever, so I guess it can wait another
few weeks to be correctly solved.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-10-07 17:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-06 10:46 [PATCH 0/3] KVM: arm64: Assorted PMU emulation fixes maz
2019-10-06 10:46 ` [PATCH 1/3] KVM: arm64: pmu: Fix cycle counter truncation maz
2019-10-07  8:48   ` Andrew Murray
2019-10-06 10:46 ` [PATCH 2/3] arm64: KVM: Handle PMCR_EL0.LC as RES1 on pure AArch64 systems maz
2019-10-06 10:46 ` [PATCH 3/3] KVM: arm64: pmu: Reset sample period on overflow handling maz
2019-10-07  9:43   ` Andrew Murray
2019-10-07 10:48     ` Marc Zyngier
2019-10-07 13:04       ` Andrew Murray
2019-10-07 17:17         ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).