All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] ARM/arm64: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-07 15:40 ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-07 15:40 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, kvm; +Cc: Christoffer Dall

This is a respin of a patch I posted a long while ago, this time with
numbers that I hope to be convincing enough.

The basic idea is that spinning on WFE in a guest is a waste of
resource, and that we're better of running another vcpu instead. This
specially shows when the system is oversubscribed. The guest vcpus can
be seen spinning, waiting for a lock to be released while the lock
holder is nowhere near a physical CPU.

This patch series just enables WFE trapping on both ARM and arm64, and
calls kvm_vcpu_on_spin(). This is enough to boost other vcpus, and
dramatically reduce the overhead.

Branch available at:
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git kvm-arm64/wfe-trap

Marc Zyngier (2):
  ARM: KVM: Yield CPU when vcpu executes a WFE
  arm64: KVM: Yield CPU when vcpu executes a WFE

 arch/arm/include/asm/kvm_arm.h   |  4 +++-
 arch/arm/kvm/handle_exit.c       |  6 +++++-
 arch/arm64/include/asm/kvm_arm.h |  8 ++++++--
 arch/arm64/kvm/handle_exit.c     | 18 +++++++++++++-----
 4 files changed, 27 insertions(+), 9 deletions(-)

-- 
1.8.2.3



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/2] ARM/arm64: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-07 15:40 ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-07 15:40 UTC (permalink / raw)
  To: linux-arm-kernel

This is a respin of a patch I posted a long while ago, this time with
numbers that I hope to be convincing enough.

The basic idea is that spinning on WFE in a guest is a waste of
resource, and that we're better of running another vcpu instead. This
specially shows when the system is oversubscribed. The guest vcpus can
be seen spinning, waiting for a lock to be released while the lock
holder is nowhere near a physical CPU.

This patch series just enables WFE trapping on both ARM and arm64, and
calls kvm_vcpu_on_spin(). This is enough to boost other vcpus, and
dramatically reduce the overhead.

Branch available at:
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git kvm-arm64/wfe-trap

Marc Zyngier (2):
  ARM: KVM: Yield CPU when vcpu executes a WFE
  arm64: KVM: Yield CPU when vcpu executes a WFE

 arch/arm/include/asm/kvm_arm.h   |  4 +++-
 arch/arm/kvm/handle_exit.c       |  6 +++++-
 arch/arm64/include/asm/kvm_arm.h |  8 ++++++--
 arch/arm64/kvm/handle_exit.c     | 18 +++++++++++++-----
 4 files changed, 27 insertions(+), 9 deletions(-)

-- 
1.8.2.3

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-07 15:40 ` Marc Zyngier
@ 2013-10-07 15:40   ` Marc Zyngier
  -1 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-07 15:40 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, kvm; +Cc: Christoffer Dall

On an (even slightly) oversubscribed system, spinlocks are quickly
becoming a bottleneck, as some vcpus are spinning, waiting for a
lock to be released, while the vcpu holding the lock may not be
running at all.

This creates contention, and the observed slowdown is 40x for
hackbench. No, this isn't a typo.

The solution is to trap blocking WFEs and tell KVM that we're
now spinning. This ensures that other vpus will get a scheduling
boost, allowing the lock to be released more quickly.

>From a performance point of view: hackbench 1 process 1000

2xA15 host (baseline):	1.843s

2xA15 guest w/o patch:	2.083s
4xA15 guest w/o patch:	80.212s

2xA15 guest w/ patch:	2.072s
4xA15 guest w/ patch:	3.202s

So we go from a 40x degradation to 1.5x, which is vaguely more
acceptable.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 arch/arm/include/asm/kvm_arm.h | 4 +++-
 arch/arm/kvm/handle_exit.c     | 6 +++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index 64e9696..693d5b2 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -67,7 +67,7 @@
  */
 #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
 			HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
-			HCR_SWIO | HCR_TIDCP)
+			HCR_TWE | HCR_SWIO | HCR_TIDCP)
 #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
 
 /* System Control Register (SCTLR) bits */
@@ -208,6 +208,8 @@
 #define HSR_EC_DABT	(0x24)
 #define HSR_EC_DABT_HYP	(0x25)
 
+#define HSR_WFI_IS_WFE		(1U << 0)
+
 #define HSR_HVC_IMM_MASK	((1UL << 16) - 1)
 
 #define HSR_DABT_S1PTW		(1U << 7)
diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
index df4c82d..c4c496f 100644
--- a/arch/arm/kvm/handle_exit.c
+++ b/arch/arm/kvm/handle_exit.c
@@ -84,7 +84,11 @@ static int handle_dabt_hyp(struct kvm_vcpu *vcpu, struct kvm_run *run)
 static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
 	trace_kvm_wfi(*vcpu_pc(vcpu));
-	kvm_vcpu_block(vcpu);
+	if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE)
+		kvm_vcpu_on_spin(vcpu);
+	else
+		kvm_vcpu_block(vcpu);
+
 	return 1;
 }
 
-- 
1.8.2.3



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-07 15:40   ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-07 15:40 UTC (permalink / raw)
  To: linux-arm-kernel

On an (even slightly) oversubscribed system, spinlocks are quickly
becoming a bottleneck, as some vcpus are spinning, waiting for a
lock to be released, while the vcpu holding the lock may not be
running at all.

This creates contention, and the observed slowdown is 40x for
hackbench. No, this isn't a typo.

The solution is to trap blocking WFEs and tell KVM that we're
now spinning. This ensures that other vpus will get a scheduling
boost, allowing the lock to be released more quickly.

>From a performance point of view: hackbench 1 process 1000

2xA15 host (baseline):	1.843s

2xA15 guest w/o patch:	2.083s
4xA15 guest w/o patch:	80.212s

2xA15 guest w/ patch:	2.072s
4xA15 guest w/ patch:	3.202s

So we go from a 40x degradation to 1.5x, which is vaguely more
acceptable.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 arch/arm/include/asm/kvm_arm.h | 4 +++-
 arch/arm/kvm/handle_exit.c     | 6 +++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index 64e9696..693d5b2 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -67,7 +67,7 @@
  */
 #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
 			HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
-			HCR_SWIO | HCR_TIDCP)
+			HCR_TWE | HCR_SWIO | HCR_TIDCP)
 #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
 
 /* System Control Register (SCTLR) bits */
@@ -208,6 +208,8 @@
 #define HSR_EC_DABT	(0x24)
 #define HSR_EC_DABT_HYP	(0x25)
 
+#define HSR_WFI_IS_WFE		(1U << 0)
+
 #define HSR_HVC_IMM_MASK	((1UL << 16) - 1)
 
 #define HSR_DABT_S1PTW		(1U << 7)
diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
index df4c82d..c4c496f 100644
--- a/arch/arm/kvm/handle_exit.c
+++ b/arch/arm/kvm/handle_exit.c
@@ -84,7 +84,11 @@ static int handle_dabt_hyp(struct kvm_vcpu *vcpu, struct kvm_run *run)
 static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
 	trace_kvm_wfi(*vcpu_pc(vcpu));
-	kvm_vcpu_block(vcpu);
+	if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE)
+		kvm_vcpu_on_spin(vcpu);
+	else
+		kvm_vcpu_block(vcpu);
+
 	return 1;
 }
 
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE
  2013-10-07 15:40 ` Marc Zyngier
@ 2013-10-07 15:40   ` Marc Zyngier
  -1 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-07 15:40 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, kvm; +Cc: Christoffer Dall

On an (even slightly) oversubscribed system, spinlocks are quickly
becoming a bottleneck, as some vcpus are spinning, waiting for a
lock to be released, while the vcpu holding the lock may not be
running at all.

The solution is to trap blocking WFEs and tell KVM that we're
now spinning. This ensures that other vpus will get a scheduling
boost, allowing the lock to be released more quickly.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 arch/arm64/include/asm/kvm_arm.h |  8 ++++++--
 arch/arm64/kvm/handle_exit.c     | 18 +++++++++++++-----
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index a5f28e2..c98ef47 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -63,6 +63,7 @@
  * TAC:		Trap ACTLR
  * TSC:		Trap SMC
  * TSW:		Trap cache operations by set/way
+ * TWE:		Trap WFE
  * TWI:		Trap WFI
  * TIDCP:	Trap L2CTLR/L2ECTLR
  * BSU_IS:	Upgrade barriers to the inner shareable domain
@@ -72,8 +73,9 @@
  * FMO:		Override CPSR.F and enable signaling with VF
  * SWIO:	Turn set/way invalidates into set/way clean+invalidate
  */
-#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
-			 HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
+#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
+			 HCR_BSU_IS | HCR_FB | HCR_TAC | \
+			 HCR_AMO | HCR_IMO | HCR_FMO | \
 			 HCR_SWIO | HCR_TIDCP | HCR_RW)
 #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
 
@@ -242,4 +244,6 @@
 
 #define ESR_EL2_EC_xABT_xFSR_EXTABT	0x10
 
+#define ESR_EL2_EC_WFI_ISS_WFE	(1 << 0)
+
 #endif /* __ARM64_KVM_ARM_H__ */
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 9beaca03..8da5606 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -47,21 +47,29 @@ static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run *run)
 }
 
 /**
- * kvm_handle_wfi - handle a wait-for-interrupts instruction executed by a guest
+ * kvm_handle_wfx - handle a wait-for-interrupts or wait-for-event
+ *		    instruction executed by a guest
+ *
  * @vcpu:	the vcpu pointer
  *
- * Simply call kvm_vcpu_block(), which will halt execution of
+ * WFE: Yield the CPU and come back to this vcpu when the scheduler
+ * decides to.
+ * WFI: Simply call kvm_vcpu_block(), which will halt execution of
  * world-switches and schedule other host processes until there is an
  * incoming IRQ or FIQ to the VM.
  */
-static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
+static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
-	kvm_vcpu_block(vcpu);
+	if (kvm_vcpu_get_hsr(vcpu) & ESR_EL2_EC_WFI_ISS_WFE)
+		kvm_vcpu_on_spin(vcpu);
+	else
+		kvm_vcpu_block(vcpu);
+
 	return 1;
 }
 
 static exit_handle_fn arm_exit_handlers[] = {
-	[ESR_EL2_EC_WFI]	= kvm_handle_wfi,
+	[ESR_EL2_EC_WFI]	= kvm_handle_wfx,
 	[ESR_EL2_EC_CP15_32]	= kvm_handle_cp15_32,
 	[ESR_EL2_EC_CP15_64]	= kvm_handle_cp15_64,
 	[ESR_EL2_EC_CP14_MR]	= kvm_handle_cp14_access,
-- 
1.8.2.3



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-07 15:40   ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-07 15:40 UTC (permalink / raw)
  To: linux-arm-kernel

On an (even slightly) oversubscribed system, spinlocks are quickly
becoming a bottleneck, as some vcpus are spinning, waiting for a
lock to be released, while the vcpu holding the lock may not be
running at all.

The solution is to trap blocking WFEs and tell KVM that we're
now spinning. This ensures that other vpus will get a scheduling
boost, allowing the lock to be released more quickly.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 arch/arm64/include/asm/kvm_arm.h |  8 ++++++--
 arch/arm64/kvm/handle_exit.c     | 18 +++++++++++++-----
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index a5f28e2..c98ef47 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -63,6 +63,7 @@
  * TAC:		Trap ACTLR
  * TSC:		Trap SMC
  * TSW:		Trap cache operations by set/way
+ * TWE:		Trap WFE
  * TWI:		Trap WFI
  * TIDCP:	Trap L2CTLR/L2ECTLR
  * BSU_IS:	Upgrade barriers to the inner shareable domain
@@ -72,8 +73,9 @@
  * FMO:		Override CPSR.F and enable signaling with VF
  * SWIO:	Turn set/way invalidates into set/way clean+invalidate
  */
-#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
-			 HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
+#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
+			 HCR_BSU_IS | HCR_FB | HCR_TAC | \
+			 HCR_AMO | HCR_IMO | HCR_FMO | \
 			 HCR_SWIO | HCR_TIDCP | HCR_RW)
 #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
 
@@ -242,4 +244,6 @@
 
 #define ESR_EL2_EC_xABT_xFSR_EXTABT	0x10
 
+#define ESR_EL2_EC_WFI_ISS_WFE	(1 << 0)
+
 #endif /* __ARM64_KVM_ARM_H__ */
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 9beaca03..8da5606 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -47,21 +47,29 @@ static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run *run)
 }
 
 /**
- * kvm_handle_wfi - handle a wait-for-interrupts instruction executed by a guest
+ * kvm_handle_wfx - handle a wait-for-interrupts or wait-for-event
+ *		    instruction executed by a guest
+ *
  * @vcpu:	the vcpu pointer
  *
- * Simply call kvm_vcpu_block(), which will halt execution of
+ * WFE: Yield the CPU and come back to this vcpu when the scheduler
+ * decides to.
+ * WFI: Simply call kvm_vcpu_block(), which will halt execution of
  * world-switches and schedule other host processes until there is an
  * incoming IRQ or FIQ to the VM.
  */
-static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
+static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
-	kvm_vcpu_block(vcpu);
+	if (kvm_vcpu_get_hsr(vcpu) & ESR_EL2_EC_WFI_ISS_WFE)
+		kvm_vcpu_on_spin(vcpu);
+	else
+		kvm_vcpu_block(vcpu);
+
 	return 1;
 }
 
 static exit_handle_fn arm_exit_handlers[] = {
-	[ESR_EL2_EC_WFI]	= kvm_handle_wfi,
+	[ESR_EL2_EC_WFI]	= kvm_handle_wfx,
 	[ESR_EL2_EC_CP15_32]	= kvm_handle_cp15_32,
 	[ESR_EL2_EC_CP15_64]	= kvm_handle_cp15_64,
 	[ESR_EL2_EC_CP14_MR]	= kvm_handle_cp14_access,
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* RE: [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE
  2013-10-07 15:40   ` Marc Zyngier
@ 2013-10-07 15:52     ` Bhushan Bharat-R65777
  -1 siblings, 0 replies; 50+ messages in thread
From: Bhushan Bharat-R65777 @ 2013-10-07 15:52 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvmarm, kvm



> -----Original Message-----
> From: Marc Zyngier [mailto:marc.zyngier@arm.com]
> Sent: Monday, October 07, 2013 9:11 PM
> To: linux-arm-kernel@lists.infradead.org; kvmarm@lists.cs.columbia.edu;
> kvm@vger.kernel.org
> Subject: [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE
> 
> On an (even slightly) oversubscribed system, spinlocks are quickly becoming a
> bottleneck, as some vcpus are spinning, waiting for a lock to be released, while
> the vcpu holding the lock may not be running at all.
> 
> The solution is to trap blocking WFEs and tell KVM that we're now spinning. This
> ensures that other vpus will get a scheduling boost, allowing the lock to be
> released more quickly.
> 
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> ---
>  arch/arm64/include/asm/kvm_arm.h |  8 ++++++--
>  arch/arm64/kvm/handle_exit.c     | 18 +++++++++++++-----
>  2 files changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index a5f28e2..c98ef47 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -63,6 +63,7 @@
>   * TAC:		Trap ACTLR
>   * TSC:		Trap SMC
>   * TSW:		Trap cache operations by set/way
> + * TWE:		Trap WFE
>   * TWI:		Trap WFI
>   * TIDCP:	Trap L2CTLR/L2ECTLR
>   * BSU_IS:	Upgrade barriers to the inner shareable domain
> @@ -72,8 +73,9 @@
>   * FMO:		Override CPSR.F and enable signaling with VF
>   * SWIO:	Turn set/way invalidates into set/way clean+invalidate
>   */
> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
> -			 HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
> +#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
> +			 HCR_BSU_IS | HCR_FB | HCR_TAC | \
> +			 HCR_AMO | HCR_IMO | HCR_FMO | \
>  			 HCR_SWIO | HCR_TIDCP | HCR_RW)
>  #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
> 
> @@ -242,4 +244,6 @@
> 
>  #define ESR_EL2_EC_xABT_xFSR_EXTABT	0x10
> 
> +#define ESR_EL2_EC_WFI_ISS_WFE	(1 << 0)

In another patch this is named as WHI_IS_WFE whereas here it is WFI_ISS_WFE, looks like typo. Anyways, what I am interested to understand is what does this macro means?

Thanks
-Bharat

> +
>  #endif /* __ARM64_KVM_ARM_H__ */
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index
> 9beaca03..8da5606 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -47,21 +47,29 @@ static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run
> *run)  }
> 
>  /**
> - * kvm_handle_wfi - handle a wait-for-interrupts instruction executed by a
> guest
> + * kvm_handle_wfx - handle a wait-for-interrupts or wait-for-event
> + *		    instruction executed by a guest
> + *
>   * @vcpu:	the vcpu pointer
>   *
> - * Simply call kvm_vcpu_block(), which will halt execution of
> + * WFE: Yield the CPU and come back to this vcpu when the scheduler
> + * decides to.
> + * WFI: Simply call kvm_vcpu_block(), which will halt execution of
>   * world-switches and schedule other host processes until there is an
>   * incoming IRQ or FIQ to the VM.
>   */
> -static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
> +static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
> -	kvm_vcpu_block(vcpu);
> +	if (kvm_vcpu_get_hsr(vcpu) & ESR_EL2_EC_WFI_ISS_WFE)
> +		kvm_vcpu_on_spin(vcpu);
> +	else
> +		kvm_vcpu_block(vcpu);
> +
>  	return 1;
>  }
> 
>  static exit_handle_fn arm_exit_handlers[] = {
> -	[ESR_EL2_EC_WFI]	= kvm_handle_wfi,
> +	[ESR_EL2_EC_WFI]	= kvm_handle_wfx,
>  	[ESR_EL2_EC_CP15_32]	= kvm_handle_cp15_32,
>  	[ESR_EL2_EC_CP15_64]	= kvm_handle_cp15_64,
>  	[ESR_EL2_EC_CP14_MR]	= kvm_handle_cp14_access,
> --
> 1.8.2.3
> 
> 
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-07 15:52     ` Bhushan Bharat-R65777
  0 siblings, 0 replies; 50+ messages in thread
From: Bhushan Bharat-R65777 @ 2013-10-07 15:52 UTC (permalink / raw)
  To: linux-arm-kernel



> -----Original Message-----
> From: Marc Zyngier [mailto:marc.zyngier at arm.com]
> Sent: Monday, October 07, 2013 9:11 PM
> To: linux-arm-kernel at lists.infradead.org; kvmarm at lists.cs.columbia.edu;
> kvm at vger.kernel.org
> Subject: [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE
> 
> On an (even slightly) oversubscribed system, spinlocks are quickly becoming a
> bottleneck, as some vcpus are spinning, waiting for a lock to be released, while
> the vcpu holding the lock may not be running at all.
> 
> The solution is to trap blocking WFEs and tell KVM that we're now spinning. This
> ensures that other vpus will get a scheduling boost, allowing the lock to be
> released more quickly.
> 
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> ---
>  arch/arm64/include/asm/kvm_arm.h |  8 ++++++--
>  arch/arm64/kvm/handle_exit.c     | 18 +++++++++++++-----
>  2 files changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index a5f28e2..c98ef47 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -63,6 +63,7 @@
>   * TAC:		Trap ACTLR
>   * TSC:		Trap SMC
>   * TSW:		Trap cache operations by set/way
> + * TWE:		Trap WFE
>   * TWI:		Trap WFI
>   * TIDCP:	Trap L2CTLR/L2ECTLR
>   * BSU_IS:	Upgrade barriers to the inner shareable domain
> @@ -72,8 +73,9 @@
>   * FMO:		Override CPSR.F and enable signaling with VF
>   * SWIO:	Turn set/way invalidates into set/way clean+invalidate
>   */
> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
> -			 HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
> +#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
> +			 HCR_BSU_IS | HCR_FB | HCR_TAC | \
> +			 HCR_AMO | HCR_IMO | HCR_FMO | \
>  			 HCR_SWIO | HCR_TIDCP | HCR_RW)
>  #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
> 
> @@ -242,4 +244,6 @@
> 
>  #define ESR_EL2_EC_xABT_xFSR_EXTABT	0x10
> 
> +#define ESR_EL2_EC_WFI_ISS_WFE	(1 << 0)

In another patch this is named as WHI_IS_WFE whereas here it is WFI_ISS_WFE, looks like typo. Anyways, what I am interested to understand is what does this macro means?

Thanks
-Bharat

> +
>  #endif /* __ARM64_KVM_ARM_H__ */
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index
> 9beaca03..8da5606 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -47,21 +47,29 @@ static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run
> *run)  }
> 
>  /**
> - * kvm_handle_wfi - handle a wait-for-interrupts instruction executed by a
> guest
> + * kvm_handle_wfx - handle a wait-for-interrupts or wait-for-event
> + *		    instruction executed by a guest
> + *
>   * @vcpu:	the vcpu pointer
>   *
> - * Simply call kvm_vcpu_block(), which will halt execution of
> + * WFE: Yield the CPU and come back to this vcpu when the scheduler
> + * decides to.
> + * WFI: Simply call kvm_vcpu_block(), which will halt execution of
>   * world-switches and schedule other host processes until there is an
>   * incoming IRQ or FIQ to the VM.
>   */
> -static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
> +static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
> -	kvm_vcpu_block(vcpu);
> +	if (kvm_vcpu_get_hsr(vcpu) & ESR_EL2_EC_WFI_ISS_WFE)
> +		kvm_vcpu_on_spin(vcpu);
> +	else
> +		kvm_vcpu_block(vcpu);
> +
>  	return 1;
>  }
> 
>  static exit_handle_fn arm_exit_handlers[] = {
> -	[ESR_EL2_EC_WFI]	= kvm_handle_wfi,
> +	[ESR_EL2_EC_WFI]	= kvm_handle_wfx,
>  	[ESR_EL2_EC_CP15_32]	= kvm_handle_cp15_32,
>  	[ESR_EL2_EC_CP15_64]	= kvm_handle_cp15_64,
>  	[ESR_EL2_EC_CP14_MR]	= kvm_handle_cp14_access,
> --
> 1.8.2.3
> 
> 
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE
  2013-10-07 15:52     ` Bhushan Bharat-R65777
@ 2013-10-07 16:00       ` Marc Zyngier
  -1 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-07 16:00 UTC (permalink / raw)
  To: Bhushan Bharat-R65777; +Cc: linux-arm-kernel, kvmarm, kvm

On 07/10/13 16:52, Bhushan Bharat-R65777 wrote:
> 
> 
>> -----Original Message-----
>> From: Marc Zyngier [mailto:marc.zyngier@arm.com]
>> Sent: Monday, October 07, 2013 9:11 PM
>> To: linux-arm-kernel@lists.infradead.org; kvmarm@lists.cs.columbia.edu;
>> kvm@vger.kernel.org
>> Subject: [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE
>>
>> On an (even slightly) oversubscribed system, spinlocks are quickly becoming a
>> bottleneck, as some vcpus are spinning, waiting for a lock to be released, while
>> the vcpu holding the lock may not be running at all.
>>
>> The solution is to trap blocking WFEs and tell KVM that we're now spinning. This
>> ensures that other vpus will get a scheduling boost, allowing the lock to be
>> released more quickly.
>>
>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>> ---
>>  arch/arm64/include/asm/kvm_arm.h |  8 ++++++--
>>  arch/arm64/kvm/handle_exit.c     | 18 +++++++++++++-----
>>  2 files changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>> index a5f28e2..c98ef47 100644
>> --- a/arch/arm64/include/asm/kvm_arm.h
>> +++ b/arch/arm64/include/asm/kvm_arm.h
>> @@ -63,6 +63,7 @@
>>   * TAC:		Trap ACTLR
>>   * TSC:		Trap SMC
>>   * TSW:		Trap cache operations by set/way
>> + * TWE:		Trap WFE
>>   * TWI:		Trap WFI
>>   * TIDCP:	Trap L2CTLR/L2ECTLR
>>   * BSU_IS:	Upgrade barriers to the inner shareable domain
>> @@ -72,8 +73,9 @@
>>   * FMO:		Override CPSR.F and enable signaling with VF
>>   * SWIO:	Turn set/way invalidates into set/way clean+invalidate
>>   */
>> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
>> -			 HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
>> +#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
>> +			 HCR_BSU_IS | HCR_FB | HCR_TAC | \
>> +			 HCR_AMO | HCR_IMO | HCR_FMO | \
>>  			 HCR_SWIO | HCR_TIDCP | HCR_RW)
>>  #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
>>
>> @@ -242,4 +244,6 @@
>>
>>  #define ESR_EL2_EC_xABT_xFSR_EXTABT	0x10
>>
>> +#define ESR_EL2_EC_WFI_ISS_WFE	(1 << 0)
> 
> In another patch this is named as WHI_IS_WFE whereas here it is WFI_ISS_WFE, looks like typo. Anyways, what I am interested to understand is what does this macro means?

Not a typo. It decodes as:
Exception Status Register, Exception Level 2, Exception Class Wait For
Interrupt, Instruction Specific Syndrome Wait For Event.

The ARM code doesn't have such a convention, so I didn't bother. It just
reads "Hyp Status Register, Wait For Interrupt Is Wait For Event".

	M.

> Thanks
> -Bharat
> 
>> +
>>  #endif /* __ARM64_KVM_ARM_H__ */
>> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index
>> 9beaca03..8da5606 100644
>> --- a/arch/arm64/kvm/handle_exit.c
>> +++ b/arch/arm64/kvm/handle_exit.c
>> @@ -47,21 +47,29 @@ static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run
>> *run)  }
>>
>>  /**
>> - * kvm_handle_wfi - handle a wait-for-interrupts instruction executed by a
>> guest
>> + * kvm_handle_wfx - handle a wait-for-interrupts or wait-for-event
>> + *		    instruction executed by a guest
>> + *
>>   * @vcpu:	the vcpu pointer
>>   *
>> - * Simply call kvm_vcpu_block(), which will halt execution of
>> + * WFE: Yield the CPU and come back to this vcpu when the scheduler
>> + * decides to.
>> + * WFI: Simply call kvm_vcpu_block(), which will halt execution of
>>   * world-switches and schedule other host processes until there is an
>>   * incoming IRQ or FIQ to the VM.
>>   */
>> -static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
>> +static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>  {
>> -	kvm_vcpu_block(vcpu);
>> +	if (kvm_vcpu_get_hsr(vcpu) & ESR_EL2_EC_WFI_ISS_WFE)
>> +		kvm_vcpu_on_spin(vcpu);
>> +	else
>> +		kvm_vcpu_block(vcpu);
>> +
>>  	return 1;
>>  }
>>
>>  static exit_handle_fn arm_exit_handlers[] = {
>> -	[ESR_EL2_EC_WFI]	= kvm_handle_wfi,
>> +	[ESR_EL2_EC_WFI]	= kvm_handle_wfx,
>>  	[ESR_EL2_EC_CP15_32]	= kvm_handle_cp15_32,
>>  	[ESR_EL2_EC_CP15_64]	= kvm_handle_cp15_64,
>>  	[ESR_EL2_EC_CP14_MR]	= kvm_handle_cp14_access,
>> --
>> 1.8.2.3
>>
>>
>>
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
> 
> 
> 


-- 
Jazz is not dead. It just smells funny...


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-07 16:00       ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-07 16:00 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/10/13 16:52, Bhushan Bharat-R65777 wrote:
> 
> 
>> -----Original Message-----
>> From: Marc Zyngier [mailto:marc.zyngier at arm.com]
>> Sent: Monday, October 07, 2013 9:11 PM
>> To: linux-arm-kernel at lists.infradead.org; kvmarm at lists.cs.columbia.edu;
>> kvm at vger.kernel.org
>> Subject: [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE
>>
>> On an (even slightly) oversubscribed system, spinlocks are quickly becoming a
>> bottleneck, as some vcpus are spinning, waiting for a lock to be released, while
>> the vcpu holding the lock may not be running at all.
>>
>> The solution is to trap blocking WFEs and tell KVM that we're now spinning. This
>> ensures that other vpus will get a scheduling boost, allowing the lock to be
>> released more quickly.
>>
>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>> ---
>>  arch/arm64/include/asm/kvm_arm.h |  8 ++++++--
>>  arch/arm64/kvm/handle_exit.c     | 18 +++++++++++++-----
>>  2 files changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>> index a5f28e2..c98ef47 100644
>> --- a/arch/arm64/include/asm/kvm_arm.h
>> +++ b/arch/arm64/include/asm/kvm_arm.h
>> @@ -63,6 +63,7 @@
>>   * TAC:		Trap ACTLR
>>   * TSC:		Trap SMC
>>   * TSW:		Trap cache operations by set/way
>> + * TWE:		Trap WFE
>>   * TWI:		Trap WFI
>>   * TIDCP:	Trap L2CTLR/L2ECTLR
>>   * BSU_IS:	Upgrade barriers to the inner shareable domain
>> @@ -72,8 +73,9 @@
>>   * FMO:		Override CPSR.F and enable signaling with VF
>>   * SWIO:	Turn set/way invalidates into set/way clean+invalidate
>>   */
>> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
>> -			 HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
>> +#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
>> +			 HCR_BSU_IS | HCR_FB | HCR_TAC | \
>> +			 HCR_AMO | HCR_IMO | HCR_FMO | \
>>  			 HCR_SWIO | HCR_TIDCP | HCR_RW)
>>  #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
>>
>> @@ -242,4 +244,6 @@
>>
>>  #define ESR_EL2_EC_xABT_xFSR_EXTABT	0x10
>>
>> +#define ESR_EL2_EC_WFI_ISS_WFE	(1 << 0)
> 
> In another patch this is named as WHI_IS_WFE whereas here it is WFI_ISS_WFE, looks like typo. Anyways, what I am interested to understand is what does this macro means?

Not a typo. It decodes as:
Exception Status Register, Exception Level 2, Exception Class Wait For
Interrupt, Instruction Specific Syndrome Wait For Event.

The ARM code doesn't have such a convention, so I didn't bother. It just
reads "Hyp Status Register, Wait For Interrupt Is Wait For Event".

	M.

> Thanks
> -Bharat
> 
>> +
>>  #endif /* __ARM64_KVM_ARM_H__ */
>> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index
>> 9beaca03..8da5606 100644
>> --- a/arch/arm64/kvm/handle_exit.c
>> +++ b/arch/arm64/kvm/handle_exit.c
>> @@ -47,21 +47,29 @@ static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run
>> *run)  }
>>
>>  /**
>> - * kvm_handle_wfi - handle a wait-for-interrupts instruction executed by a
>> guest
>> + * kvm_handle_wfx - handle a wait-for-interrupts or wait-for-event
>> + *		    instruction executed by a guest
>> + *
>>   * @vcpu:	the vcpu pointer
>>   *
>> - * Simply call kvm_vcpu_block(), which will halt execution of
>> + * WFE: Yield the CPU and come back to this vcpu when the scheduler
>> + * decides to.
>> + * WFI: Simply call kvm_vcpu_block(), which will halt execution of
>>   * world-switches and schedule other host processes until there is an
>>   * incoming IRQ or FIQ to the VM.
>>   */
>> -static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
>> +static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>  {
>> -	kvm_vcpu_block(vcpu);
>> +	if (kvm_vcpu_get_hsr(vcpu) & ESR_EL2_EC_WFI_ISS_WFE)
>> +		kvm_vcpu_on_spin(vcpu);
>> +	else
>> +		kvm_vcpu_block(vcpu);
>> +
>>  	return 1;
>>  }
>>
>>  static exit_handle_fn arm_exit_handlers[] = {
>> -	[ESR_EL2_EC_WFI]	= kvm_handle_wfi,
>> +	[ESR_EL2_EC_WFI]	= kvm_handle_wfx,
>>  	[ESR_EL2_EC_CP15_32]	= kvm_handle_cp15_32,
>>  	[ESR_EL2_EC_CP15_64]	= kvm_handle_cp15_64,
>>  	[ESR_EL2_EC_CP14_MR]	= kvm_handle_cp14_access,
>> --
>> 1.8.2.3
>>
>>
>>
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm at lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
> 
> 
> 


-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-07 15:40   ` Marc Zyngier
@ 2013-10-07 16:04     ` Alexander Graf
  -1 siblings, 0 replies; 50+ messages in thread
From: Alexander Graf @ 2013-10-07 16:04 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: linux-arm-kernel, kvmarm, kvm@vger.kernel.org mailing list


On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:

> On an (even slightly) oversubscribed system, spinlocks are quickly
> becoming a bottleneck, as some vcpus are spinning, waiting for a
> lock to be released, while the vcpu holding the lock may not be
> running at all.
> 
> This creates contention, and the observed slowdown is 40x for
> hackbench. No, this isn't a typo.
> 
> The solution is to trap blocking WFEs and tell KVM that we're
> now spinning. This ensures that other vpus will get a scheduling
> boost, allowing the lock to be released more quickly.
> 
>> From a performance point of view: hackbench 1 process 1000
> 
> 2xA15 host (baseline):	1.843s
> 
> 2xA15 guest w/o patch:	2.083s
> 4xA15 guest w/o patch:	80.212s
> 
> 2xA15 guest w/ patch:	2.072s
> 4xA15 guest w/ patch:	3.202s

I'm confused. You got from 2.083s when not exiting on spin locks to 2.072 when exiting on _every_ spin lock that didn't immediately succeed. I would've expected to second number to be worse rather than better. I assume it's within jitter, I'm still puzzled why you don't see any significant drop in performance.


Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-07 16:04     ` Alexander Graf
  0 siblings, 0 replies; 50+ messages in thread
From: Alexander Graf @ 2013-10-07 16:04 UTC (permalink / raw)
  To: linux-arm-kernel


On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:

> On an (even slightly) oversubscribed system, spinlocks are quickly
> becoming a bottleneck, as some vcpus are spinning, waiting for a
> lock to be released, while the vcpu holding the lock may not be
> running at all.
> 
> This creates contention, and the observed slowdown is 40x for
> hackbench. No, this isn't a typo.
> 
> The solution is to trap blocking WFEs and tell KVM that we're
> now spinning. This ensures that other vpus will get a scheduling
> boost, allowing the lock to be released more quickly.
> 
>> From a performance point of view: hackbench 1 process 1000
> 
> 2xA15 host (baseline):	1.843s
> 
> 2xA15 guest w/o patch:	2.083s
> 4xA15 guest w/o patch:	80.212s
> 
> 2xA15 guest w/ patch:	2.072s
> 4xA15 guest w/ patch:	3.202s

I'm confused. You got from 2.083s when not exiting on spin locks to 2.072 when exiting on _every_ spin lock that didn't immediately succeed. I would've expected to second number to be worse rather than better. I assume it's within jitter, I'm still puzzled why you don't see any significant drop in performance.


Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-07 16:04     ` Alexander Graf
@ 2013-10-07 16:16       ` Marc Zyngier
  -1 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-07 16:16 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linux-arm-kernel, kvmarm, kvm@vger.kernel.org mailing list

On 07/10/13 17:04, Alexander Graf wrote:
> 
> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
> 
>> On an (even slightly) oversubscribed system, spinlocks are quickly 
>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
>> lock to be released, while the vcpu holding the lock may not be 
>> running at all.
>> 
>> This creates contention, and the observed slowdown is 40x for 
>> hackbench. No, this isn't a typo.
>> 
>> The solution is to trap blocking WFEs and tell KVM that we're now
>> spinning. This ensures that other vpus will get a scheduling boost,
>> allowing the lock to be released more quickly.
>> 
>>> From a performance point of view: hackbench 1 process 1000
>> 
>> 2xA15 host (baseline):	1.843s
>> 
>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
>> 
>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
> 
> I'm confused. You got from 2.083s when not exiting on spin locks to
> 2.072 when exiting on _every_ spin lock that didn't immediately
> succeed. I would've expected to second number to be worse rather than
> better. I assume it's within jitter, I'm still puzzled why you don't
> see any significant drop in performance.

The key is in the ARM ARM:

B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
mode other than Hyp mode, execution of a WFE instruction generates a Hyp
Trap exception if, ignoring the value of the HCR.TWE bit, conditions
permit the processor to suspend execution."

So, on a non-overcommitted system, you rarely hit a blocking spinlock,
hence not trapping. Otherwise, performance would go down the drain very
quickly.

And yes, the difference is pretty much noise.

	M.
-- 
Jazz is not dead. It just smells funny...


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-07 16:16       ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-07 16:16 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/10/13 17:04, Alexander Graf wrote:
> 
> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
> 
>> On an (even slightly) oversubscribed system, spinlocks are quickly 
>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
>> lock to be released, while the vcpu holding the lock may not be 
>> running at all.
>> 
>> This creates contention, and the observed slowdown is 40x for 
>> hackbench. No, this isn't a typo.
>> 
>> The solution is to trap blocking WFEs and tell KVM that we're now
>> spinning. This ensures that other vpus will get a scheduling boost,
>> allowing the lock to be released more quickly.
>> 
>>> From a performance point of view: hackbench 1 process 1000
>> 
>> 2xA15 host (baseline):	1.843s
>> 
>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
>> 
>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
> 
> I'm confused. You got from 2.083s when not exiting on spin locks to
> 2.072 when exiting on _every_ spin lock that didn't immediately
> succeed. I would've expected to second number to be worse rather than
> better. I assume it's within jitter, I'm still puzzled why you don't
> see any significant drop in performance.

The key is in the ARM ARM:

B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
mode other than Hyp mode, execution of a WFE instruction generates a Hyp
Trap exception if, ignoring the value of the HCR.TWE bit, conditions
permit the processor to suspend execution."

So, on a non-overcommitted system, you rarely hit a blocking spinlock,
hence not trapping. Otherwise, performance would go down the drain very
quickly.

And yes, the difference is pretty much noise.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-07 16:16       ` Marc Zyngier
@ 2013-10-07 16:30         ` Alexander Graf
  -1 siblings, 0 replies; 50+ messages in thread
From: Alexander Graf @ 2013-10-07 16:30 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: kvmarm, linux-arm-kernel, kvm@vger.kernel.org mailing list


On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:

> On 07/10/13 17:04, Alexander Graf wrote:
>> 
>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> 
>>> On an (even slightly) oversubscribed system, spinlocks are quickly 
>>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
>>> lock to be released, while the vcpu holding the lock may not be 
>>> running at all.
>>> 
>>> This creates contention, and the observed slowdown is 40x for 
>>> hackbench. No, this isn't a typo.
>>> 
>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>> spinning. This ensures that other vpus will get a scheduling boost,
>>> allowing the lock to be released more quickly.
>>> 
>>>> From a performance point of view: hackbench 1 process 1000
>>> 
>>> 2xA15 host (baseline):	1.843s
>>> 
>>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
>>> 
>>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
>> 
>> I'm confused. You got from 2.083s when not exiting on spin locks to
>> 2.072 when exiting on _every_ spin lock that didn't immediately
>> succeed. I would've expected to second number to be worse rather than
>> better. I assume it's within jitter, I'm still puzzled why you don't
>> see any significant drop in performance.
> 
> The key is in the ARM ARM:
> 
> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
> permit the processor to suspend execution."
> 
> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
> hence not trapping. Otherwise, performance would go down the drain very
> quickly.

Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.

I assume you simply don't contend and spin locks yet. Once you have more guest cores things would look differently. So once you have a system with more cores available, it might make sense to measure it again.

Until then, the numbers are impressive.


Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-07 16:30         ` Alexander Graf
  0 siblings, 0 replies; 50+ messages in thread
From: Alexander Graf @ 2013-10-07 16:30 UTC (permalink / raw)
  To: linux-arm-kernel


On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:

> On 07/10/13 17:04, Alexander Graf wrote:
>> 
>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> 
>>> On an (even slightly) oversubscribed system, spinlocks are quickly 
>>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
>>> lock to be released, while the vcpu holding the lock may not be 
>>> running at all.
>>> 
>>> This creates contention, and the observed slowdown is 40x for 
>>> hackbench. No, this isn't a typo.
>>> 
>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>> spinning. This ensures that other vpus will get a scheduling boost,
>>> allowing the lock to be released more quickly.
>>> 
>>>> From a performance point of view: hackbench 1 process 1000
>>> 
>>> 2xA15 host (baseline):	1.843s
>>> 
>>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
>>> 
>>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
>> 
>> I'm confused. You got from 2.083s when not exiting on spin locks to
>> 2.072 when exiting on _every_ spin lock that didn't immediately
>> succeed. I would've expected to second number to be worse rather than
>> better. I assume it's within jitter, I'm still puzzled why you don't
>> see any significant drop in performance.
> 
> The key is in the ARM ARM:
> 
> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
> permit the processor to suspend execution."
> 
> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
> hence not trapping. Otherwise, performance would go down the drain very
> quickly.

Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.

I assume you simply don't contend and spin locks yet. Once you have more guest cores things would look differently. So once you have a system with more cores available, it might make sense to measure it again.

Until then, the numbers are impressive.


Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-07 16:30         ` Alexander Graf
@ 2013-10-07 16:53           ` Gleb Natapov
  -1 siblings, 0 replies; 50+ messages in thread
From: Gleb Natapov @ 2013-10-07 16:53 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Marc Zyngier, kvmarm, linux-arm-kernel, kvm@vger.kernel.org mailing list

On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
> 
> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
> 
> > On 07/10/13 17:04, Alexander Graf wrote:
> >> 
> >> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
> >> 
> >>> On an (even slightly) oversubscribed system, spinlocks are quickly 
> >>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
> >>> lock to be released, while the vcpu holding the lock may not be 
> >>> running at all.
> >>> 
> >>> This creates contention, and the observed slowdown is 40x for 
> >>> hackbench. No, this isn't a typo.
> >>> 
> >>> The solution is to trap blocking WFEs and tell KVM that we're now
> >>> spinning. This ensures that other vpus will get a scheduling boost,
> >>> allowing the lock to be released more quickly.
> >>> 
> >>>> From a performance point of view: hackbench 1 process 1000
> >>> 
> >>> 2xA15 host (baseline):	1.843s
> >>> 
> >>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
> >>> 
> >>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
> >> 
> >> I'm confused. You got from 2.083s when not exiting on spin locks to
> >> 2.072 when exiting on _every_ spin lock that didn't immediately
> >> succeed. I would've expected to second number to be worse rather than
> >> better. I assume it's within jitter, I'm still puzzled why you don't
> >> see any significant drop in performance.
> > 
> > The key is in the ARM ARM:
> > 
> > B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
> > mode other than Hyp mode, execution of a WFE instruction generates a Hyp
> > Trap exception if, ignoring the value of the HCR.TWE bit, conditions
> > permit the processor to suspend execution."
> > 
> > So, on a non-overcommitted system, you rarely hit a blocking spinlock,
> > hence not trapping. Otherwise, performance would go down the drain very
> > quickly.
> 
> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
> 
It will hurt performance if vcpu that holds the lock is running.
Ideally you want to exit to hypervisor only if lock holder is preempted,
but there is no way to know it, so you spin for a short time and if lock
is not released it means that lock holder is preempted (spinlock should
not be held for a long time after all), so you exit.

--
			Gleb.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-07 16:53           ` Gleb Natapov
  0 siblings, 0 replies; 50+ messages in thread
From: Gleb Natapov @ 2013-10-07 16:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
> 
> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
> 
> > On 07/10/13 17:04, Alexander Graf wrote:
> >> 
> >> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
> >> 
> >>> On an (even slightly) oversubscribed system, spinlocks are quickly 
> >>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
> >>> lock to be released, while the vcpu holding the lock may not be 
> >>> running at all.
> >>> 
> >>> This creates contention, and the observed slowdown is 40x for 
> >>> hackbench. No, this isn't a typo.
> >>> 
> >>> The solution is to trap blocking WFEs and tell KVM that we're now
> >>> spinning. This ensures that other vpus will get a scheduling boost,
> >>> allowing the lock to be released more quickly.
> >>> 
> >>>> From a performance point of view: hackbench 1 process 1000
> >>> 
> >>> 2xA15 host (baseline):	1.843s
> >>> 
> >>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
> >>> 
> >>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
> >> 
> >> I'm confused. You got from 2.083s when not exiting on spin locks to
> >> 2.072 when exiting on _every_ spin lock that didn't immediately
> >> succeed. I would've expected to second number to be worse rather than
> >> better. I assume it's within jitter, I'm still puzzled why you don't
> >> see any significant drop in performance.
> > 
> > The key is in the ARM ARM:
> > 
> > B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
> > mode other than Hyp mode, execution of a WFE instruction generates a Hyp
> > Trap exception if, ignoring the value of the HCR.TWE bit, conditions
> > permit the processor to suspend execution."
> > 
> > So, on a non-overcommitted system, you rarely hit a blocking spinlock,
> > hence not trapping. Otherwise, performance would go down the drain very
> > quickly.
> 
> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
> 
It will hurt performance if vcpu that holds the lock is running.
Ideally you want to exit to hypervisor only if lock holder is preempted,
but there is no way to know it, so you spin for a short time and if lock
is not released it means that lock holder is preempted (spinlock should
not be held for a long time after all), so you exit.

--
			Gleb.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-07 16:30         ` Alexander Graf
@ 2013-10-07 16:55           ` Marc Zyngier
  -1 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-07 16:55 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linux-arm-kernel, kvmarm, kvm@vger.kernel.org mailing list

On 07/10/13 17:30, Alexander Graf wrote:
> 
> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
> 
>> On 07/10/13 17:04, Alexander Graf wrote:
>>> 
>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com>
>>> wrote:
>>> 
>>>> On an (even slightly) oversubscribed system, spinlocks are
>>>> quickly becoming a bottleneck, as some vcpus are spinning,
>>>> waiting for a lock to be released, while the vcpu holding the
>>>> lock may not be running at all.
>>>> 
>>>> This creates contention, and the observed slowdown is 40x for 
>>>> hackbench. No, this isn't a typo.
>>>> 
>>>> The solution is to trap blocking WFEs and tell KVM that we're
>>>> now spinning. This ensures that other vpus will get a
>>>> scheduling boost, allowing the lock to be released more
>>>> quickly.
>>>> 
>>>>> From a performance point of view: hackbench 1 process 1000
>>>> 
>>>> 2xA15 host (baseline):	1.843s
>>>> 
>>>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
>>>> 
>>>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
>>> 
>>> I'm confused. You got from 2.083s when not exiting on spin locks
>>> to 2.072 when exiting on _every_ spin lock that didn't
>>> immediately succeed. I would've expected to second number to be
>>> worse rather than better. I assume it's within jitter, I'm still
>>> puzzled why you don't see any significant drop in performance.
>> 
>> The key is in the ARM ARM:
>> 
>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a
>> Non-secure mode other than Hyp mode, execution of a WFE instruction
>> generates a Hyp Trap exception if, ignoring the value of the
>> HCR.TWE bit, conditions permit the processor to suspend
>> execution."
>> 
>> So, on a non-overcommitted system, you rarely hit a blocking
>> spinlock, hence not trapping. Otherwise, performance would go down
>> the drain very quickly.
> 
> Well, it's the same as pause/loop exiting on x86, but there we have
> special hardware features to only ever exit after n number of
> turnarounds. I wonder why we have those when we could just as easily
> exit on every blocking path.

My understanding of x86 is extremely patchy (and of the non-existent
flavour), so I can't really comment on that.

On ARM, WFE normally blocks if no event is pending for this CPU. We use
it on the spinlock slow path, and have a SEV (Send EVent) on release.

Even in the case of a race between entering the slow path and releasing
the spinlock, you may end-up executing a non-blocking WFE. In this case,
no trap will occur.

> I assume you simply don't contend and spin locks yet. Once you have
> more guest cores things would look differently. So once you have a
> system with more cores available, it might make sense to measure it
> again.

Indeed. Though the above should probably stay valid even if we have a
different locking strategy. Entering a blocking WFE always means you're
going to block for some time (and no, you don't know how long).

> Until then, the numbers are impressive.

I thought as much...

	M.
-- 
Jazz is not dead. It just smells funny...


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-07 16:55           ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-07 16:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/10/13 17:30, Alexander Graf wrote:
> 
> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
> 
>> On 07/10/13 17:04, Alexander Graf wrote:
>>> 
>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com>
>>> wrote:
>>> 
>>>> On an (even slightly) oversubscribed system, spinlocks are
>>>> quickly becoming a bottleneck, as some vcpus are spinning,
>>>> waiting for a lock to be released, while the vcpu holding the
>>>> lock may not be running at all.
>>>> 
>>>> This creates contention, and the observed slowdown is 40x for 
>>>> hackbench. No, this isn't a typo.
>>>> 
>>>> The solution is to trap blocking WFEs and tell KVM that we're
>>>> now spinning. This ensures that other vpus will get a
>>>> scheduling boost, allowing the lock to be released more
>>>> quickly.
>>>> 
>>>>> From a performance point of view: hackbench 1 process 1000
>>>> 
>>>> 2xA15 host (baseline):	1.843s
>>>> 
>>>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
>>>> 
>>>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
>>> 
>>> I'm confused. You got from 2.083s when not exiting on spin locks
>>> to 2.072 when exiting on _every_ spin lock that didn't
>>> immediately succeed. I would've expected to second number to be
>>> worse rather than better. I assume it's within jitter, I'm still
>>> puzzled why you don't see any significant drop in performance.
>> 
>> The key is in the ARM ARM:
>> 
>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a
>> Non-secure mode other than Hyp mode, execution of a WFE instruction
>> generates a Hyp Trap exception if, ignoring the value of the
>> HCR.TWE bit, conditions permit the processor to suspend
>> execution."
>> 
>> So, on a non-overcommitted system, you rarely hit a blocking
>> spinlock, hence not trapping. Otherwise, performance would go down
>> the drain very quickly.
> 
> Well, it's the same as pause/loop exiting on x86, but there we have
> special hardware features to only ever exit after n number of
> turnarounds. I wonder why we have those when we could just as easily
> exit on every blocking path.

My understanding of x86 is extremely patchy (and of the non-existent
flavour), so I can't really comment on that.

On ARM, WFE normally blocks if no event is pending for this CPU. We use
it on the spinlock slow path, and have a SEV (Send EVent) on release.

Even in the case of a race between entering the slow path and releasing
the spinlock, you may end-up executing a non-blocking WFE. In this case,
no trap will occur.

> I assume you simply don't contend and spin locks yet. Once you have
> more guest cores things would look differently. So once you have a
> system with more cores available, it might make sense to measure it
> again.

Indeed. Though the above should probably stay valid even if we have a
different locking strategy. Entering a blocking WFE always means you're
going to block for some time (and no, you don't know how long).

> Until then, the numbers are impressive.

I thought as much...

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-07 15:40   ` Marc Zyngier
@ 2013-10-08 11:26     ` Raghavendra KT
  -1 siblings, 0 replies; 50+ messages in thread
From: Raghavendra KT @ 2013-10-08 11:26 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-arm-kernel, kvmarm, kvm, Christoffer Dall, Raghavendra KT

On Mon, Oct 7, 2013 at 9:10 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On an (even slightly) oversubscribed system, spinlocks are quickly
> becoming a bottleneck, as some vcpus are spinning, waiting for a
> lock to be released, while the vcpu holding the lock may not be
> running at all.
>
> This creates contention, and the observed slowdown is 40x for
> hackbench. No, this isn't a typo.
>
> The solution is to trap blocking WFEs and tell KVM that we're
> now spinning. This ensures that other vpus will get a scheduling
> boost, allowing the lock to be released more quickly.
>
> From a performance point of view: hackbench 1 process 1000
>
> 2xA15 host (baseline):  1.843s
>
> 2xA15 guest w/o patch:  2.083s
> 4xA15 guest w/o patch:  80.212s
>
> 2xA15 guest w/ patch:   2.072s
> 4xA15 guest w/ patch:   3.202s
>
> So we go from a 40x degradation to 1.5x, which is vaguely more
> acceptable.
>
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> ---
>  arch/arm/include/asm/kvm_arm.h | 4 +++-
>  arch/arm/kvm/handle_exit.c     | 6 +++++-
>  2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
> index 64e9696..693d5b2 100644
> --- a/arch/arm/include/asm/kvm_arm.h
> +++ b/arch/arm/include/asm/kvm_arm.h
> @@ -67,7 +67,7 @@
>   */
>  #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
>                         HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
> -                       HCR_SWIO | HCR_TIDCP)
> +                       HCR_TWE | HCR_SWIO | HCR_TIDCP)
>  #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
>
>  /* System Control Register (SCTLR) bits */
> @@ -208,6 +208,8 @@
>  #define HSR_EC_DABT    (0x24)
>  #define HSR_EC_DABT_HYP        (0x25)
>
> +#define HSR_WFI_IS_WFE         (1U << 0)
> +
>  #define HSR_HVC_IMM_MASK       ((1UL << 16) - 1)
>
>  #define HSR_DABT_S1PTW         (1U << 7)
> diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
> index df4c82d..c4c496f 100644
> --- a/arch/arm/kvm/handle_exit.c
> +++ b/arch/arm/kvm/handle_exit.c
> @@ -84,7 +84,11 @@ static int handle_dabt_hyp(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
>         trace_kvm_wfi(*vcpu_pc(vcpu));
> -       kvm_vcpu_block(vcpu);
> +       if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE)
> +               kvm_vcpu_on_spin(vcpu);

Could you also enable CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT for arm and
check if ple handler logic helps further?
we would ideally get one more optimization folded into ple handler if
you enable that.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-08 11:26     ` Raghavendra KT
  0 siblings, 0 replies; 50+ messages in thread
From: Raghavendra KT @ 2013-10-08 11:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 7, 2013 at 9:10 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On an (even slightly) oversubscribed system, spinlocks are quickly
> becoming a bottleneck, as some vcpus are spinning, waiting for a
> lock to be released, while the vcpu holding the lock may not be
> running at all.
>
> This creates contention, and the observed slowdown is 40x for
> hackbench. No, this isn't a typo.
>
> The solution is to trap blocking WFEs and tell KVM that we're
> now spinning. This ensures that other vpus will get a scheduling
> boost, allowing the lock to be released more quickly.
>
> From a performance point of view: hackbench 1 process 1000
>
> 2xA15 host (baseline):  1.843s
>
> 2xA15 guest w/o patch:  2.083s
> 4xA15 guest w/o patch:  80.212s
>
> 2xA15 guest w/ patch:   2.072s
> 4xA15 guest w/ patch:   3.202s
>
> So we go from a 40x degradation to 1.5x, which is vaguely more
> acceptable.
>
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> ---
>  arch/arm/include/asm/kvm_arm.h | 4 +++-
>  arch/arm/kvm/handle_exit.c     | 6 +++++-
>  2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
> index 64e9696..693d5b2 100644
> --- a/arch/arm/include/asm/kvm_arm.h
> +++ b/arch/arm/include/asm/kvm_arm.h
> @@ -67,7 +67,7 @@
>   */
>  #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
>                         HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
> -                       HCR_SWIO | HCR_TIDCP)
> +                       HCR_TWE | HCR_SWIO | HCR_TIDCP)
>  #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
>
>  /* System Control Register (SCTLR) bits */
> @@ -208,6 +208,8 @@
>  #define HSR_EC_DABT    (0x24)
>  #define HSR_EC_DABT_HYP        (0x25)
>
> +#define HSR_WFI_IS_WFE         (1U << 0)
> +
>  #define HSR_HVC_IMM_MASK       ((1UL << 16) - 1)
>
>  #define HSR_DABT_S1PTW         (1U << 7)
> diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
> index df4c82d..c4c496f 100644
> --- a/arch/arm/kvm/handle_exit.c
> +++ b/arch/arm/kvm/handle_exit.c
> @@ -84,7 +84,11 @@ static int handle_dabt_hyp(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
>         trace_kvm_wfi(*vcpu_pc(vcpu));
> -       kvm_vcpu_block(vcpu);
> +       if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE)
> +               kvm_vcpu_on_spin(vcpu);

Could you also enable CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT for arm and
check if ple handler logic helps further?
we would ideally get one more optimization folded into ple handler if
you enable that.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-08 11:26     ` Raghavendra KT
@ 2013-10-08 12:43       ` Marc Zyngier
  -1 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-08 12:43 UTC (permalink / raw)
  To: Raghavendra KT
  Cc: linux-arm-kernel, kvmarm, kvm, Christoffer Dall, Raghavendra KT

On 08/10/13 12:26, Raghavendra KT wrote:
> On Mon, Oct 7, 2013 at 9:10 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On an (even slightly) oversubscribed system, spinlocks are quickly
>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>> lock to be released, while the vcpu holding the lock may not be
>> running at all.
>>
>> This creates contention, and the observed slowdown is 40x for
>> hackbench. No, this isn't a typo.
>>
>> The solution is to trap blocking WFEs and tell KVM that we're
>> now spinning. This ensures that other vpus will get a scheduling
>> boost, allowing the lock to be released more quickly.
>>
>> From a performance point of view: hackbench 1 process 1000
>>
>> 2xA15 host (baseline):  1.843s
>>
>> 2xA15 guest w/o patch:  2.083s
>> 4xA15 guest w/o patch:  80.212s
>>
>> 2xA15 guest w/ patch:   2.072s
>> 4xA15 guest w/ patch:   3.202s
>>
>> So we go from a 40x degradation to 1.5x, which is vaguely more
>> acceptable.
>>
>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>> ---
>>  arch/arm/include/asm/kvm_arm.h | 4 +++-
>>  arch/arm/kvm/handle_exit.c     | 6 +++++-
>>  2 files changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
>> index 64e9696..693d5b2 100644
>> --- a/arch/arm/include/asm/kvm_arm.h
>> +++ b/arch/arm/include/asm/kvm_arm.h
>> @@ -67,7 +67,7 @@
>>   */
>>  #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
>>                         HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
>> -                       HCR_SWIO | HCR_TIDCP)
>> +                       HCR_TWE | HCR_SWIO | HCR_TIDCP)
>>  #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
>>
>>  /* System Control Register (SCTLR) bits */
>> @@ -208,6 +208,8 @@
>>  #define HSR_EC_DABT    (0x24)
>>  #define HSR_EC_DABT_HYP        (0x25)
>>
>> +#define HSR_WFI_IS_WFE         (1U << 0)
>> +
>>  #define HSR_HVC_IMM_MASK       ((1UL << 16) - 1)
>>
>>  #define HSR_DABT_S1PTW         (1U << 7)
>> diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
>> index df4c82d..c4c496f 100644
>> --- a/arch/arm/kvm/handle_exit.c
>> +++ b/arch/arm/kvm/handle_exit.c
>> @@ -84,7 +84,11 @@ static int handle_dabt_hyp(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>  static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>  {
>>         trace_kvm_wfi(*vcpu_pc(vcpu));
>> -       kvm_vcpu_block(vcpu);
>> +       if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE)
>> +               kvm_vcpu_on_spin(vcpu);
> 
> Could you also enable CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT for arm and
> check if ple handler logic helps further?
> we would ideally get one more optimization folded into ple handler if
> you enable that.

Just gave it a go, and the results are slightly (but consistently)
worse. Over 10 runs:

Without RELAX_INTERCEPT: Average run 3.3623s
With RELAX_INTERCEPT: Average run 3.4226s

Not massive, but still noticeable. Any clue?

	M.
-- 
Jazz is not dead. It just smells funny...


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-08 12:43       ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-08 12:43 UTC (permalink / raw)
  To: linux-arm-kernel

On 08/10/13 12:26, Raghavendra KT wrote:
> On Mon, Oct 7, 2013 at 9:10 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On an (even slightly) oversubscribed system, spinlocks are quickly
>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>> lock to be released, while the vcpu holding the lock may not be
>> running at all.
>>
>> This creates contention, and the observed slowdown is 40x for
>> hackbench. No, this isn't a typo.
>>
>> The solution is to trap blocking WFEs and tell KVM that we're
>> now spinning. This ensures that other vpus will get a scheduling
>> boost, allowing the lock to be released more quickly.
>>
>> From a performance point of view: hackbench 1 process 1000
>>
>> 2xA15 host (baseline):  1.843s
>>
>> 2xA15 guest w/o patch:  2.083s
>> 4xA15 guest w/o patch:  80.212s
>>
>> 2xA15 guest w/ patch:   2.072s
>> 4xA15 guest w/ patch:   3.202s
>>
>> So we go from a 40x degradation to 1.5x, which is vaguely more
>> acceptable.
>>
>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>> ---
>>  arch/arm/include/asm/kvm_arm.h | 4 +++-
>>  arch/arm/kvm/handle_exit.c     | 6 +++++-
>>  2 files changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
>> index 64e9696..693d5b2 100644
>> --- a/arch/arm/include/asm/kvm_arm.h
>> +++ b/arch/arm/include/asm/kvm_arm.h
>> @@ -67,7 +67,7 @@
>>   */
>>  #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
>>                         HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
>> -                       HCR_SWIO | HCR_TIDCP)
>> +                       HCR_TWE | HCR_SWIO | HCR_TIDCP)
>>  #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
>>
>>  /* System Control Register (SCTLR) bits */
>> @@ -208,6 +208,8 @@
>>  #define HSR_EC_DABT    (0x24)
>>  #define HSR_EC_DABT_HYP        (0x25)
>>
>> +#define HSR_WFI_IS_WFE         (1U << 0)
>> +
>>  #define HSR_HVC_IMM_MASK       ((1UL << 16) - 1)
>>
>>  #define HSR_DABT_S1PTW         (1U << 7)
>> diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
>> index df4c82d..c4c496f 100644
>> --- a/arch/arm/kvm/handle_exit.c
>> +++ b/arch/arm/kvm/handle_exit.c
>> @@ -84,7 +84,11 @@ static int handle_dabt_hyp(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>  static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>  {
>>         trace_kvm_wfi(*vcpu_pc(vcpu));
>> -       kvm_vcpu_block(vcpu);
>> +       if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE)
>> +               kvm_vcpu_on_spin(vcpu);
> 
> Could you also enable CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT for arm and
> check if ple handler logic helps further?
> we would ideally get one more optimization folded into ple handler if
> you enable that.

Just gave it a go, and the results are slightly (but consistently)
worse. Over 10 runs:

Without RELAX_INTERCEPT: Average run 3.3623s
With RELAX_INTERCEPT: Average run 3.4226s

Not massive, but still noticeable. Any clue?

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-08 12:43       ` Marc Zyngier
@ 2013-10-08 15:02         ` Raghavendra K T
  -1 siblings, 0 replies; 50+ messages in thread
From: Raghavendra K T @ 2013-10-08 15:02 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Raghavendra KT, linux-arm-kernel, kvmarm, kvm, Christoffer Dall

[...]
>>> +               kvm_vcpu_on_spin(vcpu);
>>
>> Could you also enable CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT for arm and
>> check if ple handler logic helps further?
>> we would ideally get one more optimization folded into ple handler if
>> you enable that.
>
> Just gave it a go, and the results are slightly (but consistently)
> worse. Over 10 runs:
>
> Without RELAX_INTERCEPT: Average run 3.3623s
> With RELAX_INTERCEPT: Average run 3.4226s
>
> Not massive, but still noticeable. Any clue?

Is it  a 4x overcommit? Probably we would have hit the code
overhead if it were small guests.

RELAX_INTERCEPT is worth enabling for large guests with
overcommits.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-08 15:02         ` Raghavendra K T
  0 siblings, 0 replies; 50+ messages in thread
From: Raghavendra K T @ 2013-10-08 15:02 UTC (permalink / raw)
  To: linux-arm-kernel

[...]
>>> +               kvm_vcpu_on_spin(vcpu);
>>
>> Could you also enable CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT for arm and
>> check if ple handler logic helps further?
>> we would ideally get one more optimization folded into ple handler if
>> you enable that.
>
> Just gave it a go, and the results are slightly (but consistently)
> worse. Over 10 runs:
>
> Without RELAX_INTERCEPT: Average run 3.3623s
> With RELAX_INTERCEPT: Average run 3.4226s
>
> Not massive, but still noticeable. Any clue?

Is it  a 4x overcommit? Probably we would have hit the code
overhead if it were small guests.

RELAX_INTERCEPT is worth enabling for large guests with
overcommits.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-08 15:02         ` Raghavendra K T
@ 2013-10-08 15:06           ` Marc Zyngier
  -1 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-08 15:06 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Raghavendra KT, linux-arm-kernel, kvmarm, kvm, Christoffer Dall

On 08/10/13 16:02, Raghavendra K T wrote:
> [...]
>>>> +               kvm_vcpu_on_spin(vcpu);
>>>
>>> Could you also enable CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT for arm and
>>> check if ple handler logic helps further?
>>> we would ideally get one more optimization folded into ple handler if
>>> you enable that.
>>
>> Just gave it a go, and the results are slightly (but consistently)
>> worse. Over 10 runs:
>>
>> Without RELAX_INTERCEPT: Average run 3.3623s
>> With RELAX_INTERCEPT: Average run 3.4226s
>>
>> Not massive, but still noticeable. Any clue?
> 
> Is it  a 4x overcommit? Probably we would have hit the code
> overhead if it were small guests.

Only 2x overcommit (dual core host, quad vcpu guests).

> RELAX_INTERCEPT is worth enabling for large guests with
> overcommits.

I'll try something more aggressive as soon as I get the time. What do
you call a large guest? So far, the hard limit on ARM is 8 vcpus.

	M.
-- 
Jazz is not dead. It just smells funny...


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-08 15:06           ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-08 15:06 UTC (permalink / raw)
  To: linux-arm-kernel

On 08/10/13 16:02, Raghavendra K T wrote:
> [...]
>>>> +               kvm_vcpu_on_spin(vcpu);
>>>
>>> Could you also enable CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT for arm and
>>> check if ple handler logic helps further?
>>> we would ideally get one more optimization folded into ple handler if
>>> you enable that.
>>
>> Just gave it a go, and the results are slightly (but consistently)
>> worse. Over 10 runs:
>>
>> Without RELAX_INTERCEPT: Average run 3.3623s
>> With RELAX_INTERCEPT: Average run 3.4226s
>>
>> Not massive, but still noticeable. Any clue?
> 
> Is it  a 4x overcommit? Probably we would have hit the code
> overhead if it were small guests.

Only 2x overcommit (dual core host, quad vcpu guests).

> RELAX_INTERCEPT is worth enabling for large guests with
> overcommits.

I'll try something more aggressive as soon as I get the time. What do
you call a large guest? So far, the hard limit on ARM is 8 vcpus.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-08 15:06           ` Marc Zyngier
@ 2013-10-08 15:13             ` Raghavendra K T
  -1 siblings, 0 replies; 50+ messages in thread
From: Raghavendra K T @ 2013-10-08 15:13 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Raghavendra KT, linux-arm-kernel, kvmarm, kvm, Christoffer Dall

On 10/08/2013 08:36 PM, Marc Zyngier wrote:
>>> Just gave it a go, and the results are slightly (but consistently)
>>> worse. Over 10 runs:
>>>
>>> Without RELAX_INTERCEPT: Average run 3.3623s
>>> With RELAX_INTERCEPT: Average run 3.4226s
>>>
>>> Not massive, but still noticeable. Any clue?
>>
>> Is it  a 4x overcommit? Probably we would have hit the code
>> overhead if it were small guests.
>
> Only 2x overcommit (dual core host, quad vcpu guests).

Okay. quad vcpu seem to explain.

>
>> RELAX_INTERCEPT is worth enabling for large guests with
>> overcommits.
>
> I'll try something more aggressive as soon as I get the time. What do
> you call a large guest? So far, the hard limit on ARM is 8 vcpus.
>

Okay. I was referring to guests >= 32 vcpus.
May be 8vcpu guests with 2x/4x is worth trying. If we still do not
see benefit, then it is not worth enabling.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-08 15:13             ` Raghavendra K T
  0 siblings, 0 replies; 50+ messages in thread
From: Raghavendra K T @ 2013-10-08 15:13 UTC (permalink / raw)
  To: linux-arm-kernel

On 10/08/2013 08:36 PM, Marc Zyngier wrote:
>>> Just gave it a go, and the results are slightly (but consistently)
>>> worse. Over 10 runs:
>>>
>>> Without RELAX_INTERCEPT: Average run 3.3623s
>>> With RELAX_INTERCEPT: Average run 3.4226s
>>>
>>> Not massive, but still noticeable. Any clue?
>>
>> Is it  a 4x overcommit? Probably we would have hit the code
>> overhead if it were small guests.
>
> Only 2x overcommit (dual core host, quad vcpu guests).

Okay. quad vcpu seem to explain.

>
>> RELAX_INTERCEPT is worth enabling for large guests with
>> overcommits.
>
> I'll try something more aggressive as soon as I get the time. What do
> you call a large guest? So far, the hard limit on ARM is 8 vcpus.
>

Okay. I was referring to guests >= 32 vcpus.
May be 8vcpu guests with 2x/4x is worth trying. If we still do not
see benefit, then it is not worth enabling.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-08 15:13             ` Raghavendra K T
@ 2013-10-08 16:09               ` Marc Zyngier
  -1 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-08 16:09 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Raghavendra KT, linux-arm-kernel, kvmarm, kvm, Christoffer Dall

On 08/10/13 16:13, Raghavendra K T wrote:
> On 10/08/2013 08:36 PM, Marc Zyngier wrote:
>>>> Just gave it a go, and the results are slightly (but consistently)
>>>> worse. Over 10 runs:
>>>>
>>>> Without RELAX_INTERCEPT: Average run 3.3623s
>>>> With RELAX_INTERCEPT: Average run 3.4226s
>>>>
>>>> Not massive, but still noticeable. Any clue?
>>>
>>> Is it  a 4x overcommit? Probably we would have hit the code
>>> overhead if it were small guests.
>>
>> Only 2x overcommit (dual core host, quad vcpu guests).
> 
> Okay. quad vcpu seem to explain.
> 
>>
>>> RELAX_INTERCEPT is worth enabling for large guests with
>>> overcommits.
>>
>> I'll try something more aggressive as soon as I get the time. What do
>> you call a large guest? So far, the hard limit on ARM is 8 vcpus.
>>
> 
> Okay. I was referring to guests >= 32 vcpus.
> May be 8vcpu guests with 2x/4x is worth trying. If we still do not
> see benefit, then it is not worth enabling.

I've just tried with the worse case I can construct, which is a 8 vcpu
guest limited to one physical CPU:

Over 10 runs:

Without RELAX_INTERCEPT:
Time: 6.793
Time: 7.619
Time: 6.690
Time: 7.198
Time: 7.659
Time: 7.054
Time: 7.728
Time: 8.546
Time: 7.306
Time: 7.219

Average: 7.381

With RELAX_INTERCEPT:
Time: 6.850
Time: 6.889
Time: 7.170
Time: 6.938
Time: 6.756
Time: 7.341
Time: 6.707
Time: 7.452
Time: 6.617
Time: 8.095

Average: 7.082

We're now starting to see some (small) benefits: slightly faster with
RELAX_INTERCEPT, and less jitter (the heuristic is better at picking the
target vcpu than the default behaviour).

I'll enable it in the next version of the series.

Thanks!

	M.
-- 
Jazz is not dead. It just smells funny...


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-08 16:09               ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-08 16:09 UTC (permalink / raw)
  To: linux-arm-kernel

On 08/10/13 16:13, Raghavendra K T wrote:
> On 10/08/2013 08:36 PM, Marc Zyngier wrote:
>>>> Just gave it a go, and the results are slightly (but consistently)
>>>> worse. Over 10 runs:
>>>>
>>>> Without RELAX_INTERCEPT: Average run 3.3623s
>>>> With RELAX_INTERCEPT: Average run 3.4226s
>>>>
>>>> Not massive, but still noticeable. Any clue?
>>>
>>> Is it  a 4x overcommit? Probably we would have hit the code
>>> overhead if it were small guests.
>>
>> Only 2x overcommit (dual core host, quad vcpu guests).
> 
> Okay. quad vcpu seem to explain.
> 
>>
>>> RELAX_INTERCEPT is worth enabling for large guests with
>>> overcommits.
>>
>> I'll try something more aggressive as soon as I get the time. What do
>> you call a large guest? So far, the hard limit on ARM is 8 vcpus.
>>
> 
> Okay. I was referring to guests >= 32 vcpus.
> May be 8vcpu guests with 2x/4x is worth trying. If we still do not
> see benefit, then it is not worth enabling.

I've just tried with the worse case I can construct, which is a 8 vcpu
guest limited to one physical CPU:

Over 10 runs:

Without RELAX_INTERCEPT:
Time: 6.793
Time: 7.619
Time: 6.690
Time: 7.198
Time: 7.659
Time: 7.054
Time: 7.728
Time: 8.546
Time: 7.306
Time: 7.219

Average: 7.381

With RELAX_INTERCEPT:
Time: 6.850
Time: 6.889
Time: 7.170
Time: 6.938
Time: 6.756
Time: 7.341
Time: 6.707
Time: 7.452
Time: 6.617
Time: 8.095

Average: 7.082

We're now starting to see some (small) benefits: slightly faster with
RELAX_INTERCEPT, and less jitter (the heuristic is better at picking the
target vcpu than the default behaviour).

I'll enable it in the next version of the series.

Thanks!

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-07 16:53           ` Gleb Natapov
@ 2013-10-09 13:09             ` Alexander Graf
  -1 siblings, 0 replies; 50+ messages in thread
From: Alexander Graf @ 2013-10-09 13:09 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Marc Zyngier, kvmarm, linux-arm-kernel, kvm@vger.kernel.org mailing list


On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:

> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>> 
>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> 
>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>> 
>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>> 
>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly 
>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
>>>>> lock to be released, while the vcpu holding the lock may not be 
>>>>> running at all.
>>>>> 
>>>>> This creates contention, and the observed slowdown is 40x for 
>>>>> hackbench. No, this isn't a typo.
>>>>> 
>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>> allowing the lock to be released more quickly.
>>>>> 
>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>> 
>>>>> 2xA15 host (baseline):	1.843s
>>>>> 
>>>>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
>>>>> 
>>>>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
>>>> 
>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>> succeed. I would've expected to second number to be worse rather than
>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>> see any significant drop in performance.
>>> 
>>> The key is in the ARM ARM:
>>> 
>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>> permit the processor to suspend execution."
>>> 
>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>> hence not trapping. Otherwise, performance would go down the drain very
>>> quickly.
>> 
>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>> 
> It will hurt performance if vcpu that holds the lock is running.

Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.


Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-09 13:09             ` Alexander Graf
  0 siblings, 0 replies; 50+ messages in thread
From: Alexander Graf @ 2013-10-09 13:09 UTC (permalink / raw)
  To: linux-arm-kernel


On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:

> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>> 
>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> 
>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>> 
>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>> 
>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly 
>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
>>>>> lock to be released, while the vcpu holding the lock may not be 
>>>>> running at all.
>>>>> 
>>>>> This creates contention, and the observed slowdown is 40x for 
>>>>> hackbench. No, this isn't a typo.
>>>>> 
>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>> allowing the lock to be released more quickly.
>>>>> 
>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>> 
>>>>> 2xA15 host (baseline):	1.843s
>>>>> 
>>>>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
>>>>> 
>>>>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
>>>> 
>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>> succeed. I would've expected to second number to be worse rather than
>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>> see any significant drop in performance.
>>> 
>>> The key is in the ARM ARM:
>>> 
>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>> permit the processor to suspend execution."
>>> 
>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>> hence not trapping. Otherwise, performance would go down the drain very
>>> quickly.
>> 
>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>> 
> It will hurt performance if vcpu that holds the lock is running.

Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.


Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-09 13:09             ` Alexander Graf
@ 2013-10-09 13:26               ` Gleb Natapov
  -1 siblings, 0 replies; 50+ messages in thread
From: Gleb Natapov @ 2013-10-09 13:26 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Marc Zyngier, kvmarm, linux-arm-kernel, kvm@vger.kernel.org mailing list

On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
> 
> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
> 
> > On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
> >> 
> >> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
> >> 
> >>> On 07/10/13 17:04, Alexander Graf wrote:
> >>>> 
> >>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
> >>>> 
> >>>>> On an (even slightly) oversubscribed system, spinlocks are quickly 
> >>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
> >>>>> lock to be released, while the vcpu holding the lock may not be 
> >>>>> running at all.
> >>>>> 
> >>>>> This creates contention, and the observed slowdown is 40x for 
> >>>>> hackbench. No, this isn't a typo.
> >>>>> 
> >>>>> The solution is to trap blocking WFEs and tell KVM that we're now
> >>>>> spinning. This ensures that other vpus will get a scheduling boost,
> >>>>> allowing the lock to be released more quickly.
> >>>>> 
> >>>>>> From a performance point of view: hackbench 1 process 1000
> >>>>> 
> >>>>> 2xA15 host (baseline):	1.843s
> >>>>> 
> >>>>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
> >>>>> 
> >>>>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
> >>>> 
> >>>> I'm confused. You got from 2.083s when not exiting on spin locks to
> >>>> 2.072 when exiting on _every_ spin lock that didn't immediately
> >>>> succeed. I would've expected to second number to be worse rather than
> >>>> better. I assume it's within jitter, I'm still puzzled why you don't
> >>>> see any significant drop in performance.
> >>> 
> >>> The key is in the ARM ARM:
> >>> 
> >>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
> >>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
> >>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
> >>> permit the processor to suspend execution."
> >>> 
> >>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
> >>> hence not trapping. Otherwise, performance would go down the drain very
> >>> quickly.
> >> 
> >> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
> >> 
> > It will hurt performance if vcpu that holds the lock is running.
> 
> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
> 
> 
For not contended locks it make sense. We need to recheck if x86
assumption is still true there, but x86 lock is ticketing which
has not only lock holder preemption, but also lock waiter
preemption problem which make overcommit problem even worse.

--
			Gleb.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-09 13:26               ` Gleb Natapov
  0 siblings, 0 replies; 50+ messages in thread
From: Gleb Natapov @ 2013-10-09 13:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
> 
> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
> 
> > On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
> >> 
> >> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
> >> 
> >>> On 07/10/13 17:04, Alexander Graf wrote:
> >>>> 
> >>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
> >>>> 
> >>>>> On an (even slightly) oversubscribed system, spinlocks are quickly 
> >>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
> >>>>> lock to be released, while the vcpu holding the lock may not be 
> >>>>> running at all.
> >>>>> 
> >>>>> This creates contention, and the observed slowdown is 40x for 
> >>>>> hackbench. No, this isn't a typo.
> >>>>> 
> >>>>> The solution is to trap blocking WFEs and tell KVM that we're now
> >>>>> spinning. This ensures that other vpus will get a scheduling boost,
> >>>>> allowing the lock to be released more quickly.
> >>>>> 
> >>>>>> From a performance point of view: hackbench 1 process 1000
> >>>>> 
> >>>>> 2xA15 host (baseline):	1.843s
> >>>>> 
> >>>>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
> >>>>> 
> >>>>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
> >>>> 
> >>>> I'm confused. You got from 2.083s when not exiting on spin locks to
> >>>> 2.072 when exiting on _every_ spin lock that didn't immediately
> >>>> succeed. I would've expected to second number to be worse rather than
> >>>> better. I assume it's within jitter, I'm still puzzled why you don't
> >>>> see any significant drop in performance.
> >>> 
> >>> The key is in the ARM ARM:
> >>> 
> >>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
> >>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
> >>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
> >>> permit the processor to suspend execution."
> >>> 
> >>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
> >>> hence not trapping. Otherwise, performance would go down the drain very
> >>> quickly.
> >> 
> >> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
> >> 
> > It will hurt performance if vcpu that holds the lock is running.
> 
> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
> 
> 
For not contended locks it make sense. We need to recheck if x86
assumption is still true there, but x86 lock is ticketing which
has not only lock holder preemption, but also lock waiter
preemption problem which make overcommit problem even worse.

--
			Gleb.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-09 13:26               ` Gleb Natapov
@ 2013-10-09 14:18                 ` Marc Zyngier
  -1 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-09 14:18 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Alexander Graf, kvmarm, linux-arm-kernel,
	kvm@vger.kernel.org mailing list

On 09/10/13 14:26, Gleb Natapov wrote:
> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>
>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>
>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>
>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>
>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>
>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly 
>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
>>>>>>> lock to be released, while the vcpu holding the lock may not be 
>>>>>>> running at all.
>>>>>>>
>>>>>>> This creates contention, and the observed slowdown is 40x for 
>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>
>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>> allowing the lock to be released more quickly.
>>>>>>>
>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>
>>>>>>> 2xA15 host (baseline):	1.843s
>>>>>>>
>>>>>>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
>>>>>>>
>>>>>>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
>>>>>>
>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>> see any significant drop in performance.
>>>>>
>>>>> The key is in the ARM ARM:
>>>>>
>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>> permit the processor to suspend execution."
>>>>>
>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>> quickly.
>>>>
>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>
>>> It will hurt performance if vcpu that holds the lock is running.
>>
>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.

Yes. I basically assume that contention should be rare, and that ending
up in a *blocking* WFE is a sign that we're in thrashing mode already
(no event is pending).

>>
> For not contended locks it make sense. We need to recheck if x86
> assumption is still true there, but x86 lock is ticketing which
> has not only lock holder preemption, but also lock waiter
> preemption problem which make overcommit problem even worse.

Locks are ticketing on ARM as well. But there is one key difference here
with x86 (or at least what I understand of it, which is very close to
none): We only trap if we would have blocked anyway. In our case, it is
almost always better to give up the CPU to someone else rather than
waiting for some event to take the CPU out of sleep.

	M.
-- 
Jazz is not dead. It just smells funny...


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-09 14:18                 ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-09 14:18 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/10/13 14:26, Gleb Natapov wrote:
> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>
>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>
>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>
>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>
>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>
>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly 
>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
>>>>>>> lock to be released, while the vcpu holding the lock may not be 
>>>>>>> running at all.
>>>>>>>
>>>>>>> This creates contention, and the observed slowdown is 40x for 
>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>
>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>> allowing the lock to be released more quickly.
>>>>>>>
>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>
>>>>>>> 2xA15 host (baseline):	1.843s
>>>>>>>
>>>>>>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
>>>>>>>
>>>>>>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
>>>>>>
>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>> see any significant drop in performance.
>>>>>
>>>>> The key is in the ARM ARM:
>>>>>
>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>> permit the processor to suspend execution."
>>>>>
>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>> quickly.
>>>>
>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>
>>> It will hurt performance if vcpu that holds the lock is running.
>>
>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.

Yes. I basically assume that contention should be rare, and that ending
up in a *blocking* WFE is a sign that we're in thrashing mode already
(no event is pending).

>>
> For not contended locks it make sense. We need to recheck if x86
> assumption is still true there, but x86 lock is ticketing which
> has not only lock holder preemption, but also lock waiter
> preemption problem which make overcommit problem even worse.

Locks are ticketing on ARM as well. But there is one key difference here
with x86 (or at least what I understand of it, which is very close to
none): We only trap if we would have blocked anyway. In our case, it is
almost always better to give up the CPU to someone else rather than
waiting for some event to take the CPU out of sleep.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-09 14:18                 ` Marc Zyngier
@ 2013-10-09 14:50                   ` Anup Patel
  -1 siblings, 0 replies; 50+ messages in thread
From: Anup Patel @ 2013-10-09 14:50 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Gleb Natapov, kvm@vger.kernel.org mailing list, linux-arm-kernel, kvmarm

On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 09/10/13 14:26, Gleb Natapov wrote:
>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>
>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>>
>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>
>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>
>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>
>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>
>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>> running at all.
>>>>>>>>
>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>
>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>
>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>
>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>
>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>
>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>
>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>> see any significant drop in performance.
>>>>>>
>>>>>> The key is in the ARM ARM:
>>>>>>
>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>> permit the processor to suspend execution."
>>>>>>
>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>> quickly.
>>>>>
>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>
>>>> It will hurt performance if vcpu that holds the lock is running.
>>>
>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>
> Yes. I basically assume that contention should be rare, and that ending
> up in a *blocking* WFE is a sign that we're in thrashing mode already
> (no event is pending).
>
>>>
>> For not contended locks it make sense. We need to recheck if x86
>> assumption is still true there, but x86 lock is ticketing which
>> has not only lock holder preemption, but also lock waiter
>> preemption problem which make overcommit problem even worse.
>
> Locks are ticketing on ARM as well. But there is one key difference here
> with x86 (or at least what I understand of it, which is very close to
> none): We only trap if we would have blocked anyway. In our case, it is
> almost always better to give up the CPU to someone else rather than
> waiting for some event to take the CPU out of sleep.

Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
1. How spin lock is implemented in Guest OS?
we cannot assume
    that underlying Guest OS is always Linux.
2. How bad/good is spin

It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE


>
>         M.
> --
> Jazz is not dead. It just smells funny...
>
>
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-09 14:50                   ` Anup Patel
  0 siblings, 0 replies; 50+ messages in thread
From: Anup Patel @ 2013-10-09 14:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 09/10/13 14:26, Gleb Natapov wrote:
>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>
>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>>
>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>
>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>
>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>
>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>
>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>> running at all.
>>>>>>>>
>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>
>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>
>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>
>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>
>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>
>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>
>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>> see any significant drop in performance.
>>>>>>
>>>>>> The key is in the ARM ARM:
>>>>>>
>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>> permit the processor to suspend execution."
>>>>>>
>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>> quickly.
>>>>>
>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>
>>>> It will hurt performance if vcpu that holds the lock is running.
>>>
>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>
> Yes. I basically assume that contention should be rare, and that ending
> up in a *blocking* WFE is a sign that we're in thrashing mode already
> (no event is pending).
>
>>>
>> For not contended locks it make sense. We need to recheck if x86
>> assumption is still true there, but x86 lock is ticketing which
>> has not only lock holder preemption, but also lock waiter
>> preemption problem which make overcommit problem even worse.
>
> Locks are ticketing on ARM as well. But there is one key difference here
> with x86 (or at least what I understand of it, which is very close to
> none): We only trap if we would have blocked anyway. In our case, it is
> almost always better to give up the CPU to someone else rather than
> waiting for some event to take the CPU out of sleep.

Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
1. How spin lock is implemented in Guest OS?
we cannot assume
    that underlying Guest OS is always Linux.
2. How bad/good is spin

It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE


>
>         M.
> --
> Jazz is not dead. It just smells funny...
>
>
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-09 14:50                   ` Anup Patel
@ 2013-10-09 14:52                     ` Anup Patel
  -1 siblings, 0 replies; 50+ messages in thread
From: Anup Patel @ 2013-10-09 14:52 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Gleb Natapov, kvm@vger.kernel.org mailing list, linux-arm-kernel, kvmarm

On Wed, Oct 9, 2013 at 8:20 PM, Anup Patel <anup@brainfault.org> wrote:
> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On 09/10/13 14:26, Gleb Natapov wrote:
>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>
>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>>>
>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>
>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>
>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>> running at all.
>>>>>>>>>
>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>
>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>
>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>
>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>
>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>
>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>
>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>> see any significant drop in performance.
>>>>>>>
>>>>>>> The key is in the ARM ARM:
>>>>>>>
>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>> permit the processor to suspend execution."
>>>>>>>
>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>> quickly.
>>>>>>
>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>>
>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>
>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>>
>> Yes. I basically assume that contention should be rare, and that ending
>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>> (no event is pending).
>>
>>>>
>>> For not contended locks it make sense. We need to recheck if x86
>>> assumption is still true there, but x86 lock is ticketing which
>>> has not only lock holder preemption, but also lock waiter
>>> preemption problem which make overcommit problem even worse.
>>
>> Locks are ticketing on ARM as well. But there is one key difference here
>> with x86 (or at least what I understand of it, which is very close to
>> none): We only trap if we would have blocked anyway. In our case, it is
>> almost always better to give up the CPU to someone else rather than
>> waiting for some event to take the CPU out of sleep.
>
> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
> 1. How spin lock is implemented in Guest OS?
> we cannot assume
>     that underlying Guest OS is always Linux.
> 2. How bad/good is spin
>
> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE

(Please ignore previous incomplete reply ....)

Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
1. How spin lock is implemented in Guest OS?
(Note: we cannot assume that underlying Guest OS is always Linux)
2. How bad/good is spin lock contention in Guest ?
(Note: here too we cannot assume the loads running on Guest)

It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE"
via Kconfig.

--Anup

>
>
>>
>>         M.
>> --
>> Jazz is not dead. It just smells funny...
>>
>>
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-09 14:52                     ` Anup Patel
  0 siblings, 0 replies; 50+ messages in thread
From: Anup Patel @ 2013-10-09 14:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Oct 9, 2013 at 8:20 PM, Anup Patel <anup@brainfault.org> wrote:
> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On 09/10/13 14:26, Gleb Natapov wrote:
>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>
>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>>>
>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>
>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>
>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>> running at all.
>>>>>>>>>
>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>
>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>
>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>
>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>
>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>
>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>
>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>> see any significant drop in performance.
>>>>>>>
>>>>>>> The key is in the ARM ARM:
>>>>>>>
>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>> permit the processor to suspend execution."
>>>>>>>
>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>> quickly.
>>>>>>
>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>>
>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>
>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>>
>> Yes. I basically assume that contention should be rare, and that ending
>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>> (no event is pending).
>>
>>>>
>>> For not contended locks it make sense. We need to recheck if x86
>>> assumption is still true there, but x86 lock is ticketing which
>>> has not only lock holder preemption, but also lock waiter
>>> preemption problem which make overcommit problem even worse.
>>
>> Locks are ticketing on ARM as well. But there is one key difference here
>> with x86 (or at least what I understand of it, which is very close to
>> none): We only trap if we would have blocked anyway. In our case, it is
>> almost always better to give up the CPU to someone else rather than
>> waiting for some event to take the CPU out of sleep.
>
> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
> 1. How spin lock is implemented in Guest OS?
> we cannot assume
>     that underlying Guest OS is always Linux.
> 2. How bad/good is spin
>
> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE

(Please ignore previous incomplete reply ....)

Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
1. How spin lock is implemented in Guest OS?
(Note: we cannot assume that underlying Guest OS is always Linux)
2. How bad/good is spin lock contention in Guest ?
(Note: here too we cannot assume the loads running on Guest)

It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE"
via Kconfig.

--Anup

>
>
>>
>>         M.
>> --
>> Jazz is not dead. It just smells funny...
>>
>>
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm at lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-09 14:50                   ` Anup Patel
@ 2013-10-09 14:59                     ` Marc Zyngier
  -1 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-09 14:59 UTC (permalink / raw)
  To: Anup Patel
  Cc: Gleb Natapov, kvm@vger.kernel.org mailing list, linux-arm-kernel, kvmarm

On 09/10/13 15:50, Anup Patel wrote:
> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On 09/10/13 14:26, Gleb Natapov wrote:
>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>
>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>>>
>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>
>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>
>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>> running at all.
>>>>>>>>>
>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>
>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>
>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>
>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>
>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>
>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>
>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>> see any significant drop in performance.
>>>>>>>
>>>>>>> The key is in the ARM ARM:
>>>>>>>
>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>> permit the processor to suspend execution."
>>>>>>>
>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>> quickly.
>>>>>>
>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>>
>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>
>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>>
>> Yes. I basically assume that contention should be rare, and that ending
>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>> (no event is pending).
>>
>>>>
>>> For not contended locks it make sense. We need to recheck if x86
>>> assumption is still true there, but x86 lock is ticketing which
>>> has not only lock holder preemption, but also lock waiter
>>> preemption problem which make overcommit problem even worse.
>>
>> Locks are ticketing on ARM as well. But there is one key difference here
>> with x86 (or at least what I understand of it, which is very close to
>> none): We only trap if we would have blocked anyway. In our case, it is
>> almost always better to give up the CPU to someone else rather than
>> waiting for some event to take the CPU out of sleep.
> 
> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
> 1. How spin lock is implemented in Guest OS?
> we cannot assume
>     that underlying Guest OS is always Linux.
> 2. How bad/good is spin

We do *not* spin. We *sleep*. So instead of taking a nap on a physical
CPU (which is slightly less than useful), we go and run some real
workload. If your guest OS is executing WFE (I'm not implying a lock
here), *and* that WFE is blocking, then I maintain it will be a gain in
the vast majority of the cases.

> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE

Not until someone has shown me a (real) workload when this is actually
detrimental.

	M.
-- 
Jazz is not dead. It just smells funny...


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-09 14:59                     ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-09 14:59 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/10/13 15:50, Anup Patel wrote:
> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On 09/10/13 14:26, Gleb Natapov wrote:
>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>
>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>>>
>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>
>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>
>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>> running at all.
>>>>>>>>>
>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>
>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>
>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>
>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>
>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>
>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>
>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>> see any significant drop in performance.
>>>>>>>
>>>>>>> The key is in the ARM ARM:
>>>>>>>
>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>> permit the processor to suspend execution."
>>>>>>>
>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>> quickly.
>>>>>>
>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>>
>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>
>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>>
>> Yes. I basically assume that contention should be rare, and that ending
>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>> (no event is pending).
>>
>>>>
>>> For not contended locks it make sense. We need to recheck if x86
>>> assumption is still true there, but x86 lock is ticketing which
>>> has not only lock holder preemption, but also lock waiter
>>> preemption problem which make overcommit problem even worse.
>>
>> Locks are ticketing on ARM as well. But there is one key difference here
>> with x86 (or at least what I understand of it, which is very close to
>> none): We only trap if we would have blocked anyway. In our case, it is
>> almost always better to give up the CPU to someone else rather than
>> waiting for some event to take the CPU out of sleep.
> 
> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
> 1. How spin lock is implemented in Guest OS?
> we cannot assume
>     that underlying Guest OS is always Linux.
> 2. How bad/good is spin

We do *not* spin. We *sleep*. So instead of taking a nap on a physical
CPU (which is slightly less than useful), we go and run some real
workload. If your guest OS is executing WFE (I'm not implying a lock
here), *and* that WFE is blocking, then I maintain it will be a gain in
the vast majority of the cases.

> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE

Not until someone has shown me a (real) workload when this is actually
detrimental.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-09 14:59                     ` Marc Zyngier
@ 2013-10-09 15:10                       ` Anup Patel
  -1 siblings, 0 replies; 50+ messages in thread
From: Anup Patel @ 2013-10-09 15:10 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Gleb Natapov, kvm@vger.kernel.org mailing list, linux-arm-kernel, kvmarm

On Wed, Oct 9, 2013 at 8:29 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 09/10/13 15:50, Anup Patel wrote:
>> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>> On 09/10/13 14:26, Gleb Natapov wrote:
>>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>>
>>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>>>>
>>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>>
>>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>
>>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>>
>>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>>
>>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>>> running at all.
>>>>>>>>>>
>>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>>
>>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>>
>>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>>
>>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>>
>>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>>
>>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>>
>>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>>> see any significant drop in performance.
>>>>>>>>
>>>>>>>> The key is in the ARM ARM:
>>>>>>>>
>>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>>> permit the processor to suspend execution."
>>>>>>>>
>>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>>> quickly.
>>>>>>>
>>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>>>
>>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>>
>>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>>>
>>> Yes. I basically assume that contention should be rare, and that ending
>>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>>> (no event is pending).
>>>
>>>>>
>>>> For not contended locks it make sense. We need to recheck if x86
>>>> assumption is still true there, but x86 lock is ticketing which
>>>> has not only lock holder preemption, but also lock waiter
>>>> preemption problem which make overcommit problem even worse.
>>>
>>> Locks are ticketing on ARM as well. But there is one key difference here
>>> with x86 (or at least what I understand of it, which is very close to
>>> none): We only trap if we would have blocked anyway. In our case, it is
>>> almost always better to give up the CPU to someone else rather than
>>> waiting for some event to take the CPU out of sleep.
>>
>> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
>> 1. How spin lock is implemented in Guest OS?
>> we cannot assume
>>     that underlying Guest OS is always Linux.
>> 2. How bad/good is spin
>
> We do *not* spin. We *sleep*. So instead of taking a nap on a physical
> CPU (which is slightly less than useful), we go and run some real
> workload. If your guest OS is executing WFE (I'm not implying a lock
> here), *and* that WFE is blocking, then I maintain it will be a gain in
> the vast majority of the cases.

What if VCPU A was about to release lock and VCPU B tries to grab
same lock. In this case VCPU B gets Yielded due to WFE causing
unnecessary delay for VCPU B in acquiring lock. This situation can
happen quite often because spin locks are generally used for protecting
very small portion of code.

>
>> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE
>
> Not until someone has shown me a (real) workload when this is actually
> detrimental.

The gains by "Yield CPU when vcpu executes a WFE" are not-significant
and we dont have consistent improvement when tried multiple times. Please
look at number you reported for multiple runs. Due to this fact it makes
more sense to have Kconfig option for this.

--Anup

>
>         M.
> --
> Jazz is not dead. It just smells funny...
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-09 15:10                       ` Anup Patel
  0 siblings, 0 replies; 50+ messages in thread
From: Anup Patel @ 2013-10-09 15:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Oct 9, 2013 at 8:29 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 09/10/13 15:50, Anup Patel wrote:
>> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>> On 09/10/13 14:26, Gleb Natapov wrote:
>>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>>
>>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>>>>
>>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>>
>>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>
>>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>>
>>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>>
>>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>>> running at all.
>>>>>>>>>>
>>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>>
>>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>>
>>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>>
>>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>>
>>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>>
>>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>>
>>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>>> see any significant drop in performance.
>>>>>>>>
>>>>>>>> The key is in the ARM ARM:
>>>>>>>>
>>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>>> permit the processor to suspend execution."
>>>>>>>>
>>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>>> quickly.
>>>>>>>
>>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>>>
>>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>>
>>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>>>
>>> Yes. I basically assume that contention should be rare, and that ending
>>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>>> (no event is pending).
>>>
>>>>>
>>>> For not contended locks it make sense. We need to recheck if x86
>>>> assumption is still true there, but x86 lock is ticketing which
>>>> has not only lock holder preemption, but also lock waiter
>>>> preemption problem which make overcommit problem even worse.
>>>
>>> Locks are ticketing on ARM as well. But there is one key difference here
>>> with x86 (or at least what I understand of it, which is very close to
>>> none): We only trap if we would have blocked anyway. In our case, it is
>>> almost always better to give up the CPU to someone else rather than
>>> waiting for some event to take the CPU out of sleep.
>>
>> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
>> 1. How spin lock is implemented in Guest OS?
>> we cannot assume
>>     that underlying Guest OS is always Linux.
>> 2. How bad/good is spin
>
> We do *not* spin. We *sleep*. So instead of taking a nap on a physical
> CPU (which is slightly less than useful), we go and run some real
> workload. If your guest OS is executing WFE (I'm not implying a lock
> here), *and* that WFE is blocking, then I maintain it will be a gain in
> the vast majority of the cases.

What if VCPU A was about to release lock and VCPU B tries to grab
same lock. In this case VCPU B gets Yielded due to WFE causing
unnecessary delay for VCPU B in acquiring lock. This situation can
happen quite often because spin locks are generally used for protecting
very small portion of code.

>
>> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE
>
> Not until someone has shown me a (real) workload when this is actually
> detrimental.

The gains by "Yield CPU when vcpu executes a WFE" are not-significant
and we dont have consistent improvement when tried multiple times. Please
look at number you reported for multiple runs. Due to this fact it makes
more sense to have Kconfig option for this.

--Anup

>
>         M.
> --
> Jazz is not dead. It just smells funny...
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-09 15:10                       ` Anup Patel
@ 2013-10-09 15:17                         ` Marc Zyngier
  -1 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-09 15:17 UTC (permalink / raw)
  To: Anup Patel
  Cc: Gleb Natapov, kvm@vger.kernel.org mailing list, linux-arm-kernel, kvmarm

On 09/10/13 16:10, Anup Patel wrote:
> On Wed, Oct 9, 2013 at 8:29 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On 09/10/13 15:50, Anup Patel wrote:
>>> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>> On 09/10/13 14:26, Gleb Natapov wrote:
>>>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>>>>>
>>>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>
>>>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>>>
>>>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>>>> running at all.
>>>>>>>>>>>
>>>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>>>
>>>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>>>
>>>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>>>
>>>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>>>> see any significant drop in performance.
>>>>>>>>>
>>>>>>>>> The key is in the ARM ARM:
>>>>>>>>>
>>>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>>>> permit the processor to suspend execution."
>>>>>>>>>
>>>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>>>> quickly.
>>>>>>>>
>>>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>>>>
>>>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>>>
>>>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>>>>
>>>> Yes. I basically assume that contention should be rare, and that ending
>>>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>>>> (no event is pending).
>>>>
>>>>>>
>>>>> For not contended locks it make sense. We need to recheck if x86
>>>>> assumption is still true there, but x86 lock is ticketing which
>>>>> has not only lock holder preemption, but also lock waiter
>>>>> preemption problem which make overcommit problem even worse.
>>>>
>>>> Locks are ticketing on ARM as well. But there is one key difference here
>>>> with x86 (or at least what I understand of it, which is very close to
>>>> none): We only trap if we would have blocked anyway. In our case, it is
>>>> almost always better to give up the CPU to someone else rather than
>>>> waiting for some event to take the CPU out of sleep.
>>>
>>> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
>>> 1. How spin lock is implemented in Guest OS?
>>> we cannot assume
>>>     that underlying Guest OS is always Linux.
>>> 2. How bad/good is spin
>>
>> We do *not* spin. We *sleep*. So instead of taking a nap on a physical
>> CPU (which is slightly less than useful), we go and run some real
>> workload. If your guest OS is executing WFE (I'm not implying a lock
>> here), *and* that WFE is blocking, then I maintain it will be a gain in
>> the vast majority of the cases.
> 
> What if VCPU A was about to release lock and VCPU B tries to grab
> same lock. In this case VCPU B gets Yielded due to WFE causing
> unnecessary delay for VCPU B in acquiring lock. This situation can
> happen quite often because spin locks are generally used for protecting
> very small portion of code.
> 
>>
>>> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE
>>
>> Not until someone has shown me a (real) workload when this is actually
>> detrimental.
> 
> The gains by "Yield CPU when vcpu executes a WFE" are not-significant
> and we dont have consistent improvement when tried multiple times. Please
> look at number you reported for multiple runs. Due to this fact it makes
> more sense to have Kconfig option for this.

Not significant? I don't know if I should cry or laugh here...

	M.
-- 
Jazz is not dead. It just smells funny...


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-09 15:17                         ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2013-10-09 15:17 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/10/13 16:10, Anup Patel wrote:
> On Wed, Oct 9, 2013 at 8:29 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On 09/10/13 15:50, Anup Patel wrote:
>>> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>> On 09/10/13 14:26, Gleb Natapov wrote:
>>>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>>>>>
>>>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>
>>>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>>>
>>>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>>>> running at all.
>>>>>>>>>>>
>>>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>>>
>>>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>>>
>>>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>>>
>>>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>>>> see any significant drop in performance.
>>>>>>>>>
>>>>>>>>> The key is in the ARM ARM:
>>>>>>>>>
>>>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>>>> permit the processor to suspend execution."
>>>>>>>>>
>>>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>>>> quickly.
>>>>>>>>
>>>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>>>>
>>>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>>>
>>>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>>>>
>>>> Yes. I basically assume that contention should be rare, and that ending
>>>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>>>> (no event is pending).
>>>>
>>>>>>
>>>>> For not contended locks it make sense. We need to recheck if x86
>>>>> assumption is still true there, but x86 lock is ticketing which
>>>>> has not only lock holder preemption, but also lock waiter
>>>>> preemption problem which make overcommit problem even worse.
>>>>
>>>> Locks are ticketing on ARM as well. But there is one key difference here
>>>> with x86 (or at least what I understand of it, which is very close to
>>>> none): We only trap if we would have blocked anyway. In our case, it is
>>>> almost always better to give up the CPU to someone else rather than
>>>> waiting for some event to take the CPU out of sleep.
>>>
>>> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
>>> 1. How spin lock is implemented in Guest OS?
>>> we cannot assume
>>>     that underlying Guest OS is always Linux.
>>> 2. How bad/good is spin
>>
>> We do *not* spin. We *sleep*. So instead of taking a nap on a physical
>> CPU (which is slightly less than useful), we go and run some real
>> workload. If your guest OS is executing WFE (I'm not implying a lock
>> here), *and* that WFE is blocking, then I maintain it will be a gain in
>> the vast majority of the cases.
> 
> What if VCPU A was about to release lock and VCPU B tries to grab
> same lock. In this case VCPU B gets Yielded due to WFE causing
> unnecessary delay for VCPU B in acquiring lock. This situation can
> happen quite often because spin locks are generally used for protecting
> very small portion of code.
> 
>>
>>> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE
>>
>> Not until someone has shown me a (real) workload when this is actually
>> detrimental.
> 
> The gains by "Yield CPU when vcpu executes a WFE" are not-significant
> and we dont have consistent improvement when tried multiple times. Please
> look at number you reported for multiple runs. Due to this fact it makes
> more sense to have Kconfig option for this.

Not significant? I don't know if I should cry or laugh here...

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
  2013-10-09 15:10                       ` Anup Patel
@ 2013-10-09 15:17                         ` Anup Patel
  -1 siblings, 0 replies; 50+ messages in thread
From: Anup Patel @ 2013-10-09 15:17 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Gleb Natapov, kvm@vger.kernel.org mailing list, linux-arm-kernel, kvmarm

On Wed, Oct 9, 2013 at 8:40 PM, Anup Patel <anup@brainfault.org> wrote:
> On Wed, Oct 9, 2013 at 8:29 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On 09/10/13 15:50, Anup Patel wrote:
>>> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>> On 09/10/13 14:26, Gleb Natapov wrote:
>>>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>>>>>
>>>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>
>>>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>>>
>>>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>>>> running at all.
>>>>>>>>>>>
>>>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>>>
>>>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>>>
>>>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>>>
>>>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>>>> see any significant drop in performance.
>>>>>>>>>
>>>>>>>>> The key is in the ARM ARM:
>>>>>>>>>
>>>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>>>> permit the processor to suspend execution."
>>>>>>>>>
>>>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>>>> quickly.
>>>>>>>>
>>>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>>>>
>>>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>>>
>>>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>>>>
>>>> Yes. I basically assume that contention should be rare, and that ending
>>>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>>>> (no event is pending).
>>>>
>>>>>>
>>>>> For not contended locks it make sense. We need to recheck if x86
>>>>> assumption is still true there, but x86 lock is ticketing which
>>>>> has not only lock holder preemption, but also lock waiter
>>>>> preemption problem which make overcommit problem even worse.
>>>>
>>>> Locks are ticketing on ARM as well. But there is one key difference here
>>>> with x86 (or at least what I understand of it, which is very close to
>>>> none): We only trap if we would have blocked anyway. In our case, it is
>>>> almost always better to give up the CPU to someone else rather than
>>>> waiting for some event to take the CPU out of sleep.
>>>
>>> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
>>> 1. How spin lock is implemented in Guest OS?
>>> we cannot assume
>>>     that underlying Guest OS is always Linux.
>>> 2. How bad/good is spin
>>
>> We do *not* spin. We *sleep*. So instead of taking a nap on a physical
>> CPU (which is slightly less than useful), we go and run some real
>> workload. If your guest OS is executing WFE (I'm not implying a lock
>> here), *and* that WFE is blocking, then I maintain it will be a gain in
>> the vast majority of the cases.
>
> What if VCPU A was about to release lock and VCPU B tries to grab
> same lock. In this case VCPU B gets Yielded due to WFE causing
> unnecessary delay for VCPU B in acquiring lock. This situation can
> happen quite often because spin locks are generally used for protecting
> very small portion of code.

It will be interesting to see what hackbench number you get if you
don't restrict all Guest VCPUs to same Host CPU? Lets say a Guest
with 8 VCPUs running on Host (with > 2 CPUs).

>
>>
>>> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE
>>
>> Not until someone has shown me a (real) workload when this is actually
>> detrimental.
>
> The gains by "Yield CPU when vcpu executes a WFE" are not-significant
> and we dont have consistent improvement when tried multiple times. Please
> look at number you reported for multiple runs. Due to this fact it makes
> more sense to have Kconfig option for this.
>
> --Anup
>
>>
>>         M.
>> --
>> Jazz is not dead. It just smells funny...
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE
@ 2013-10-09 15:17                         ` Anup Patel
  0 siblings, 0 replies; 50+ messages in thread
From: Anup Patel @ 2013-10-09 15:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Oct 9, 2013 at 8:40 PM, Anup Patel <anup@brainfault.org> wrote:
> On Wed, Oct 9, 2013 at 8:29 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On 09/10/13 15:50, Anup Patel wrote:
>>> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>> On 09/10/13 14:26, Gleb Natapov wrote:
>>>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote:
>>>>>>
>>>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>
>>>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>>>
>>>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>>>> running at all.
>>>>>>>>>>>
>>>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>>>
>>>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>>>
>>>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>>>
>>>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>>>> see any significant drop in performance.
>>>>>>>>>
>>>>>>>>> The key is in the ARM ARM:
>>>>>>>>>
>>>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>>>> permit the processor to suspend execution."
>>>>>>>>>
>>>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>>>> quickly.
>>>>>>>>
>>>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>>>>
>>>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>>>
>>>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>>>>
>>>> Yes. I basically assume that contention should be rare, and that ending
>>>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>>>> (no event is pending).
>>>>
>>>>>>
>>>>> For not contended locks it make sense. We need to recheck if x86
>>>>> assumption is still true there, but x86 lock is ticketing which
>>>>> has not only lock holder preemption, but also lock waiter
>>>>> preemption problem which make overcommit problem even worse.
>>>>
>>>> Locks are ticketing on ARM as well. But there is one key difference here
>>>> with x86 (or at least what I understand of it, which is very close to
>>>> none): We only trap if we would have blocked anyway. In our case, it is
>>>> almost always better to give up the CPU to someone else rather than
>>>> waiting for some event to take the CPU out of sleep.
>>>
>>> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
>>> 1. How spin lock is implemented in Guest OS?
>>> we cannot assume
>>>     that underlying Guest OS is always Linux.
>>> 2. How bad/good is spin
>>
>> We do *not* spin. We *sleep*. So instead of taking a nap on a physical
>> CPU (which is slightly less than useful), we go and run some real
>> workload. If your guest OS is executing WFE (I'm not implying a lock
>> here), *and* that WFE is blocking, then I maintain it will be a gain in
>> the vast majority of the cases.
>
> What if VCPU A was about to release lock and VCPU B tries to grab
> same lock. In this case VCPU B gets Yielded due to WFE causing
> unnecessary delay for VCPU B in acquiring lock. This situation can
> happen quite often because spin locks are generally used for protecting
> very small portion of code.

It will be interesting to see what hackbench number you get if you
don't restrict all Guest VCPUs to same Host CPU? Lets say a Guest
with 8 VCPUs running on Host (with > 2 CPUs).

>
>>
>>> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE
>>
>> Not until someone has shown me a (real) workload when this is actually
>> detrimental.
>
> The gains by "Yield CPU when vcpu executes a WFE" are not-significant
> and we dont have consistent improvement when tried multiple times. Please
> look at number you reported for multiple runs. Due to this fact it makes
> more sense to have Kconfig option for this.
>
> --Anup
>
>>
>>         M.
>> --
>> Jazz is not dead. It just smells funny...
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2013-10-09 15:17 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-07 15:40 [PATCH 0/2] ARM/arm64: KVM: Yield CPU when vcpu executes a WFE Marc Zyngier
2013-10-07 15:40 ` Marc Zyngier
2013-10-07 15:40 ` [PATCH 1/2] ARM: " Marc Zyngier
2013-10-07 15:40   ` Marc Zyngier
2013-10-07 16:04   ` Alexander Graf
2013-10-07 16:04     ` Alexander Graf
2013-10-07 16:16     ` Marc Zyngier
2013-10-07 16:16       ` Marc Zyngier
2013-10-07 16:30       ` Alexander Graf
2013-10-07 16:30         ` Alexander Graf
2013-10-07 16:53         ` Gleb Natapov
2013-10-07 16:53           ` Gleb Natapov
2013-10-09 13:09           ` Alexander Graf
2013-10-09 13:09             ` Alexander Graf
2013-10-09 13:26             ` Gleb Natapov
2013-10-09 13:26               ` Gleb Natapov
2013-10-09 14:18               ` Marc Zyngier
2013-10-09 14:18                 ` Marc Zyngier
2013-10-09 14:50                 ` Anup Patel
2013-10-09 14:50                   ` Anup Patel
2013-10-09 14:52                   ` Anup Patel
2013-10-09 14:52                     ` Anup Patel
2013-10-09 14:59                   ` Marc Zyngier
2013-10-09 14:59                     ` Marc Zyngier
2013-10-09 15:10                     ` Anup Patel
2013-10-09 15:10                       ` Anup Patel
2013-10-09 15:17                       ` Marc Zyngier
2013-10-09 15:17                         ` Marc Zyngier
2013-10-09 15:17                       ` Anup Patel
2013-10-09 15:17                         ` Anup Patel
2013-10-07 16:55         ` Marc Zyngier
2013-10-07 16:55           ` Marc Zyngier
2013-10-08 11:26   ` Raghavendra KT
2013-10-08 11:26     ` Raghavendra KT
2013-10-08 12:43     ` Marc Zyngier
2013-10-08 12:43       ` Marc Zyngier
2013-10-08 15:02       ` Raghavendra K T
2013-10-08 15:02         ` Raghavendra K T
2013-10-08 15:06         ` Marc Zyngier
2013-10-08 15:06           ` Marc Zyngier
2013-10-08 15:13           ` Raghavendra K T
2013-10-08 15:13             ` Raghavendra K T
2013-10-08 16:09             ` Marc Zyngier
2013-10-08 16:09               ` Marc Zyngier
2013-10-07 15:40 ` [PATCH 2/2] arm64: " Marc Zyngier
2013-10-07 15:40   ` Marc Zyngier
2013-10-07 15:52   ` Bhushan Bharat-R65777
2013-10-07 15:52     ` Bhushan Bharat-R65777
2013-10-07 16:00     ` Marc Zyngier
2013-10-07 16:00       ` Marc Zyngier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.