All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC/PATCH  0/3] Add support for stop instruction inside KVM guest
@ 2020-03-31 12:22 ` Gautham R. Shenoy
  0 siblings, 0 replies; 28+ messages in thread
From: Gautham R. Shenoy @ 2020-03-31 12:10 UTC (permalink / raw)
  To: Paul Mackerras, Michael Neuling, Nicholas Piggin,
	Michael Ellerman, David Gibson, Vaidyanathan Srinivasan,
	Bharata B Rao
  Cc: linuxppc-dev, linuxppc-dev, kvm-ppc, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>


 *** RFC Only. Not intended for inclusion ************
 
Motivation
~~~~~~~~~~~~~~~

The POWER ISA v3.0 allows stop instruction to be executed from a Guest
Kernel (HV=0,PR=0) context. If the hypervisor has cleared
PSSCR[ESL|EC] bits, then the stop instruction thus executed will cause
the vCPU thread to "pause", thereby donating its cycles to the other
threads in the core until the paused thread is woken up by an
interrupt. If the hypervisor has set the PSSCR[ESL|EC] bits, then
execution of the "stop" instruction will raise a Hypervisor Facility
Unavailable exception.

The stop idle state in the guest (henceforth referred to as stop0lite)
when enabled

* has a very small wakeup latency (1-3us) comparable to that of
  snooze and considerably better compared the Shared CEDE state
  (25-30us).  Results are provided below for wakeup latency measured
  by waking up an idle CPU in a given state using a timer as well as
  using an IPI.

  ======================================================================
  Wakeup Latency measured using a timer (in ns) [Lower is better]
  ======================================================================
  Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
  ======================================================================
  snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
  ======================================================================
  stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
  ======================================================================
  Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
  ======================================================================

  ======================================================================
  Wakeup Latency measured using a timer (in ns) [Lower is better]
  ======================================================================
  Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
  ======================================================================
  snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
  ======================================================================
  stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
  ======================================================================
  Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
  ======================================================================

* provides an improved single threaded performance compared to snooze
  since the idle state completely relinquishes the core cycles. The
  single threaded performance is observed to be better even when
  compared to "Shared CEDE", since in the latter case something else
  can scheduled on the ceded CPU, while "stop0lite" doesn't give up
  the CPU.

  On a KVM guest with smp 8,sockets=1,cores=2,threads=4 with vCPUs of
  a vCore bound to a physical core, we run a single-threaded ebizzy
  pinned to one of the guest vCPUs while the sibling vCPUs in the core
  are idling. We enable only one guest idle state at a time to measure
  the single-threaded performance benefit that the idle state provides
  by giving up the core resources to the non-idle thread. we obtain
  ~13% improvement in the throughput compared to that with "snooze"
  and ~8% improvement in the throughput compared to "Shared CEDE".
   
   =======================================================================
   | ebizzy records/s : [Higher the better]                              |
   =======================================================================
   |Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
   |           |samples |         |        |        |         |          |
   =======================================================================	   
   |snooze     |   10   |  1378988| 1379358| 1379032|1379067.3|    113.47|
   =======================================================================
   |stop0lite  |   10   |  1561836| 1562058| 1561906|1561927.5|     81.87|
   =======================================================================
   |Shared CEDE|   10   |  1446584| 1447383| 1447037|1447009.0|    244.16|
   =======================================================================

Is stop0lite a replacement for snooze ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Not yet. snooze is a polling state, and can respond much faster to a
need_resched() compared to stop0lite which needs an IPI to wakeup from
idle state. This can be seen in the results below:

With the context_switch2 pipe test, we can see that with stop0lite,
the number of context switches are 32.47% lesser than with
snooze. This is due to the fact that snooze is a polling state which
polls for TIF_NEED_RESCHED. Thus it does not require an interrupt to
exit the state and start executing the scheduler code. However,
stop0lite needs an IPI.

Compared to the "Shared CEDE" state, we see that with stop0lite, we
have 82.7% improvement in the number of context switches. This is due
to the low wakeup latency compared to Shared CEDE.

======================================================================
context switch2 : Number of context switches/s [Higher the better]
======================================================================
Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
           |samples |         |        |        |         |          |
======================================================================	   
snooze     |  100   |   210480|  221578|  219860|219684.88|   1344.97|
======================================================================
stop0lite  |  100   |   146730|  150266|  148258|148331.70|    871.50|
======================================================================
Shared CEDE|  100   |    75812|   82792|   81232| 81187.16|    832.99|
======================================================================


Is stop0lite a replacement for Shared CEDE ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
No. For longer idle durations, Shared CEDE is a better option compared
to "stop0lite", both from a performance (CEDEd CPUs can be put into
deeper idle states such as stop2, which can provide SMT folding
benefits) and utilization (Hypervisor can utilize the idle CPUs for
running something useful).


What this patch-set does:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The patchset has 3 patches

Patch 1: Allows the guest to run "stop" instruction without crashing
         even if the hypervisor has set the PSSCR[ESL|EC] bits. This
         is done by handling the Hypervisor Facility Unavailable
         exception and incrementing the program counter by 4 bytes,
         thus emulating the wakeup from a PSSCR[ESL = EC = 0] stop.

Patch 2: Clears the PSSCR[ESL|EC] bits unconditionally before
         dispatching a vCPU, thereby allowing the vCPU to execute a
         "stop" instruction.

Patch 3: Defines a cpuidle state for pseries guest named "stop0lite"
         to be invoked by the cpuidle driver.


What this patch-set does not do:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* It does not define an interface by which the guest discovers the
  stop-capability. Should this be defined via device-tree?

* It does address the problem of guest migration. i.e, a guest started
  on a hypervisor which supports guest stop state, if migrated to a
  hypervisor which does not support guest stop state will crash,
  unless it has Patch 1 above.
  

I would like to seek feedback and comments with respect to how to go
about implementing the issues that have not been addressed in this
patchset.

Gautham R. Shenoy (3):
  powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop.
  pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
  cpuidle/pseries: Add stop0lite state

 arch/powerpc/include/asm/reg.h          |  1 +
 arch/powerpc/kvm/book3s_hv.c            |  8 ++++++--
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
 drivers/cpuidle/cpuidle-pseries.c       | 27 +++++++++++++++++++++++++++
 4 files changed, 47 insertions(+), 14 deletions(-)

-- 
1.9.4


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC/PATCH 1/3] powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop.
  2020-03-31 12:22 ` Gautham R. Shenoy
@ 2020-03-31 12:22   ` Gautham R. Shenoy
  -1 siblings, 0 replies; 28+ messages in thread
From: Gautham R. Shenoy @ 2020-03-31 12:10 UTC (permalink / raw)
  To: Paul Mackerras, Michael Neuling, Nicholas Piggin,
	Michael Ellerman, David Gibson, Vaidyanathan Srinivasan,
	Bharata B Rao
  Cc: linuxppc-dev, linuxppc-dev, kvm-ppc, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

If a guest executes a stop instruction when the hypervisor has set the
PSSCR[ESL|EC] bits, the processor will throw an Hypervisor Facility
Unavailable exception. Currently when we receive this exception, we
only check if the exeception is generated due to a doorbell
instruction, in which case we emulate it. For all other cases,
including the case when the guest executes a stop-instruction, the
hypervisor sends a PROGILL to the guest program, which results in a
guest crash.

This patch adds code to handle the case when the hypervisor receives a
H_FAC_UNAVAIL exception due to guest executing the stop
instruction. The hypervisor increments the pc to the next instruction
and resumes the guest as expected by the semantics of the
PSSCR[ESL|EC] = 0 stop instruction.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/reg.h | 1 +
 arch/powerpc/kvm/book3s_hv.c   | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index da5cab0..2568c18 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -399,6 +399,7 @@
 /* HFSCR and FSCR bit numbers are the same */
 #define FSCR_SCV_LG	12	/* Enable System Call Vectored */
 #define FSCR_MSGP_LG	10	/* Enable MSGP */
+#define FSCR_STOP_LG    9       /* Enable stop states */
 #define FSCR_TAR_LG	8	/* Enable Target Address Register */
 #define FSCR_EBB_LG	7	/* Enable Event Based Branching */
 #define FSCR_TM_LG	5	/* Enable Transactional Memory */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 33be4d9..cdb7224 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1419,7 +1419,11 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu,
 		if (((vcpu->arch.hfscr >> 56) == FSCR_MSGP_LG) &&
 		    cpu_has_feature(CPU_FTR_ARCH_300))
 			r = kvmppc_emulate_doorbell_instr(vcpu);
-		if (r == EMULATE_FAIL) {
+		else if (((vcpu->arch.hfscr >> 56) == FSCR_STOP_LG) &&
+			cpu_has_feature(CPU_FTR_ARCH_300)) {
+			kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
+			r = RESUME_GUEST;
+		} else if (r == EMULATE_FAIL) {
 			kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
 			r = RESUME_GUEST;
 		}
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC/PATCH 2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
  2020-03-31 12:22 ` Gautham R. Shenoy
@ 2020-03-31 12:22   ` Gautham R. Shenoy
  -1 siblings, 0 replies; 28+ messages in thread
From: Gautham R. Shenoy @ 2020-03-31 12:10 UTC (permalink / raw)
  To: Paul Mackerras, Michael Neuling, Nicholas Piggin,
	Michael Ellerman, David Gibson, Vaidyanathan Srinivasan,
	Bharata B Rao
  Cc: linuxppc-dev, linuxppc-dev, kvm-ppc, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

ISA v3.0 allows the guest to execute a stop instruction. For this, the
PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
scheduling in the guest vCPU.

Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
set. This patch changes the behaviour to enter the guest with
PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
unconditionally clear these bits. Ideally this should be done
conditionally on platforms where the guest stop instruction has no
Bugs (starting POWER9 DD2.3).

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/kvm/book3s_hv.c            |  2 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index cdb7224..36d059a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
 	mtspr(SPRN_IC, vcpu->arch.ic);
 	mtspr(SPRN_PID, vcpu->arch.pid);
 
-	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
+	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
 	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
 
 	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index dbc2fec..c2daec3 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
 	mtspr	SPRN_PID, r7
 	mtspr	SPRN_WORT, r8
 BEGIN_FTR_SECTION
+	/* POWER9-only registers */
+	ld	r5, VCPU_TID(r4)
+	ld	r6, VCPU_PSSCR(r4)
+	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
+	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
+	andc	r6, r6, r7
+	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
+	ld	r7, VCPU_HFSCR(r4)
+	mtspr	SPRN_TIDR, r5
+	mtspr	SPRN_PSSCR, r6
+	mtspr	SPRN_HFSCR, r7
+FTR_SECTION_ELSE
 	/* POWER8-only registers */
 	ld	r5, VCPU_TCSCR(r4)
 	ld	r6, VCPU_ACOP(r4)
@@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
 	mtspr	SPRN_CSIGR, r7
 	mtspr	SPRN_TACR, r8
 	nop
-FTR_SECTION_ELSE
-	/* POWER9-only registers */
-	ld	r5, VCPU_TID(r4)
-	ld	r6, VCPU_PSSCR(r4)
-	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
-	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
-	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
-	ld	r7, VCPU_HFSCR(r4)
-	mtspr	SPRN_TIDR, r5
-	mtspr	SPRN_PSSCR, r6
-	mtspr	SPRN_HFSCR, r7
-ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
 8:
 
 	ld	r5, VCPU_SPRG0(r4)
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC/PATCH  3/3] cpuidle/pseries: Add stop0lite state
  2020-03-31 12:22 ` Gautham R. Shenoy
@ 2020-03-31 12:22   ` Gautham R. Shenoy
  -1 siblings, 0 replies; 28+ messages in thread
From: Gautham R. Shenoy @ 2020-03-31 12:10 UTC (permalink / raw)
  To: Paul Mackerras, Michael Neuling, Nicholas Piggin,
	Michael Ellerman, David Gibson, Vaidyanathan Srinivasan,
	Bharata B Rao
  Cc: linuxppc-dev, linuxppc-dev, kvm-ppc, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

The POWER ISA v3.0 allows stop instruction to be executed from a
HV=0,PR=0 context. If the PSSCR[ESL|EC] bits are cleared, then the
stop instruction thus executed will cause the thread to pause, thereby
donating its cycles to the other threads in the core until the paused
thread is woken up by an interrupt.

In this patch we define a cpuidle state for pseries guests named
stop0lite. This has a latency and residency intermediate to that of
snooze and CEDE. While snooze has non-existent latency, it consumes
the CPU cycles without contributing to anything useful. CEDE on the
other hand requires a full VM exit, which can result in some other
vCPU being scheduled on this physical CPU thereby delaying the
scheduling of the CEDEd vCPU back. In such cases, when the expected
idle duration is small (1-20us), the vCPU can go to this stop0lite
state which provides a nice intermediate state between snooze and
CEDE.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-pseries.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
index 74c2479..9c8c18d 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -20,6 +20,7 @@
 #include <asm/firmware.h>
 #include <asm/runlatch.h>
 #include <asm/plpar_wrappers.h>
+#include <asm/processor.h>
 
 struct cpuidle_driver pseries_idle_driver = {
 	.name             = "pseries_idle",
@@ -170,6 +171,26 @@ static int shared_cede_loop(struct cpuidle_device *dev,
 		.enter = &dedicated_cede_loop },
 };
 
+
+
+static int stop_loop(struct cpuidle_device *dev,
+		     struct cpuidle_driver *drv,
+		     int index)
+{
+	unsigned long srr1 = 0;
+
+	if (!prep_irq_for_idle_irqsoff())
+		return index;
+
+	__ppc64_runlatch_off();
+	asm volatile("stop");
+	__ppc64_runlatch_on();
+	fini_irq_for_idle_irqsoff();
+	irq_set_pending_from_srr1(srr1);
+
+	return index;
+}
+
 /*
  * States for shared partition case.
  */
@@ -180,6 +201,12 @@ static int shared_cede_loop(struct cpuidle_device *dev,
 		.exit_latency = 0,
 		.target_residency = 0,
 		.enter = &snooze_loop },
+	{ /* stop0_lite */
+		.name = "stop0lite",
+		.desc = "Pauses the CPU",
+		.exit_latency = 2,
+		.target_residency=20,
+		.enter = &stop_loop },
 	{ /* Shared Cede */
 		.name = "Shared Cede",
 		.desc = "Shared Cede",
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  0/3] Add support for stop instruction inside KVM guest
  2020-03-31 12:22 ` Gautham R. Shenoy
@ 2020-03-31 12:26   ` Gautham R Shenoy
  -1 siblings, 0 replies; 28+ messages in thread
From: Gautham R Shenoy @ 2020-03-31 12:14 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Michael Neuling, Nicholas Piggin, Bharata B Rao, linuxppc-dev,
	kvm-ppc, Vaidyanathan Srinivasan, linuxppc-dev, David Gibson

On Tue, Mar 31, 2020 at 05:40:55PM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> 
>  *** RFC Only. Not intended for inclusion ************
> 
> Motivation
> ~~~~~~~~~~~~~~~
> 
> The POWER ISA v3.0 allows stop instruction to be executed from a Guest
> Kernel (HV=0,PR=0) context. If the hypervisor has cleared
> PSSCR[ESL|EC] bits, then the stop instruction thus executed will cause
> the vCPU thread to "pause", thereby donating its cycles to the other
> threads in the core until the paused thread is woken up by an
> interrupt. If the hypervisor has set the PSSCR[ESL|EC] bits, then
> execution of the "stop" instruction will raise a Hypervisor Facility
> Unavailable exception.
> 
> The stop idle state in the guest (henceforth referred to as stop0lite)
> when enabled
> 
> * has a very small wakeup latency (1-3us) comparable to that of
>   snooze and considerably better compared the Shared CEDE state
>   (25-30us).  Results are provided below for wakeup latency measured
>   by waking up an idle CPU in a given state using a timer as well as
>   using an IPI.
> 
>   ======================================================================
>   Wakeup Latency measured using a timer (in ns) [Lower is better]
>   ======================================================================
>   Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
>   ======================================================================
>   snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
>   ======================================================================
>   stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
>   ======================================================================
>   Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
>   ======================================================================
>

Posted two copies of Wakeup latency measured by timer. Here is the
wakeup latency measured using an IPI.


======================================================================
Wakeup latency measured using an IPI (in ns) [Lower is better]
======================================================================
Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
           |samples |         |        |        |         |          |
----------------------------------------------------------------------
snooze     |   60   |     2089|    4228|    2259|  2342.31|    316.56|
----------------------------------------------------------------------
stop0lite  |   60   |     1947|    3674|    2653|  2610.57|    266.73|
----------------------------------------------------------------------
Shared CEDE|   60   |    20147|   36305|   21827| 26762.65|   6875.01|
======================================================================

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC/PATCH  0/3] Add support for stop instruction inside KVM guest
@ 2020-03-31 12:22 ` Gautham R. Shenoy
  0 siblings, 0 replies; 28+ messages in thread
From: Gautham R. Shenoy @ 2020-03-31 12:22 UTC (permalink / raw)
  To: Paul Mackerras, Michael Neuling, Nicholas Piggin,
	Michael Ellerman, David Gibson, Vaidyanathan Srinivasan,
	Bharata B Rao
  Cc: linuxppc-dev, linuxppc-dev, kvm-ppc, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>


 *** RFC Only. Not intended for inclusion ************
 
Motivation
~~~~~~~~~~~~~~~

The POWER ISA v3.0 allows stop instruction to be executed from a Guest
Kernel (HV=0,PR=0) context. If the hypervisor has cleared
PSSCR[ESL|EC] bits, then the stop instruction thus executed will cause
the vCPU thread to "pause", thereby donating its cycles to the other
threads in the core until the paused thread is woken up by an
interrupt. If the hypervisor has set the PSSCR[ESL|EC] bits, then
execution of the "stop" instruction will raise a Hypervisor Facility
Unavailable exception.

The stop idle state in the guest (henceforth referred to as stop0lite)
when enabled

* has a very small wakeup latency (1-3us) comparable to that of
  snooze and considerably better compared the Shared CEDE state
  (25-30us).  Results are provided below for wakeup latency measured
  by waking up an idle CPU in a given state using a timer as well as
  using an IPI.

  ===================================
  Wakeup Latency measured using a timer (in ns) [Lower is better]
  ===================================
  Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
  ===================================
  snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
  ===================================
  stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
  ===================================
  Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
  ===================================

  ===================================
  Wakeup Latency measured using a timer (in ns) [Lower is better]
  ===================================
  Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
  ===================================
  snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
  ===================================
  stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
  ===================================
  Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
  ===================================

* provides an improved single threaded performance compared to snooze
  since the idle state completely relinquishes the core cycles. The
  single threaded performance is observed to be better even when
  compared to "Shared CEDE", since in the latter case something else
  can scheduled on the ceded CPU, while "stop0lite" doesn't give up
  the CPU.

  On a KVM guest with smp 8,sockets=1,cores=2,threads=4 with vCPUs of
  a vCore bound to a physical core, we run a single-threaded ebizzy
  pinned to one of the guest vCPUs while the sibling vCPUs in the core
  are idling. We enable only one guest idle state at a time to measure
  the single-threaded performance benefit that the idle state provides
  by giving up the core resources to the non-idle thread. we obtain
  ~13% improvement in the throughput compared to that with "snooze"
  and ~8% improvement in the throughput compared to "Shared CEDE".
   
   ===================================   | ebizzy records/s : [Higher the better]                              |
   ===================================   |Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
   |           |samples |         |        |        |         |          |
   ====================================	   
   |snooze     |   10   |  1378988| 1379358| 1379032|1379067.3|    113.47|
   ===================================   |stop0lite  |   10   |  1561836| 1562058| 1561906|1561927.5|     81.87|
   ===================================   |Shared CEDE|   10   |  1446584| 1447383| 1447037|1447009.0|    244.16|
   ===================================
Is stop0lite a replacement for snooze ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Not yet. snooze is a polling state, and can respond much faster to a
need_resched() compared to stop0lite which needs an IPI to wakeup from
idle state. This can be seen in the results below:

With the context_switch2 pipe test, we can see that with stop0lite,
the number of context switches are 32.47% lesser than with
snooze. This is due to the fact that snooze is a polling state which
polls for TIF_NEED_RESCHED. Thus it does not require an interrupt to
exit the state and start executing the scheduler code. However,
stop0lite needs an IPI.

Compared to the "Shared CEDE" state, we see that with stop0lite, we
have 82.7% improvement in the number of context switches. This is due
to the low wakeup latency compared to Shared CEDE.

===================================
context switch2 : Number of context switches/s [Higher the better]
===================================
Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
           |samples |         |        |        |         |          |
===================================	   
snooze     |  100   |   210480|  221578|  219860|219684.88|   1344.97|
===================================
stop0lite  |  100   |   146730|  150266|  148258|148331.70|    871.50|
===================================
Shared CEDE|  100   |    75812|   82792|   81232| 81187.16|    832.99|
===================================


Is stop0lite a replacement for Shared CEDE ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
No. For longer idle durations, Shared CEDE is a better option compared
to "stop0lite", both from a performance (CEDEd CPUs can be put into
deeper idle states such as stop2, which can provide SMT folding
benefits) and utilization (Hypervisor can utilize the idle CPUs for
running something useful).


What this patch-set does:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The patchset has 3 patches

Patch 1: Allows the guest to run "stop" instruction without crashing
         even if the hypervisor has set the PSSCR[ESL|EC] bits. This
         is done by handling the Hypervisor Facility Unavailable
         exception and incrementing the program counter by 4 bytes,
         thus emulating the wakeup from a PSSCR[ESL = EC = 0] stop.

Patch 2: Clears the PSSCR[ESL|EC] bits unconditionally before
         dispatching a vCPU, thereby allowing the vCPU to execute a
         "stop" instruction.

Patch 3: Defines a cpuidle state for pseries guest named "stop0lite"
         to be invoked by the cpuidle driver.


What this patch-set does not do:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* It does not define an interface by which the guest discovers the
  stop-capability. Should this be defined via device-tree?

* It does address the problem of guest migration. i.e, a guest started
  on a hypervisor which supports guest stop state, if migrated to a
  hypervisor which does not support guest stop state will crash,
  unless it has Patch 1 above.
  

I would like to seek feedback and comments with respect to how to go
about implementing the issues that have not been addressed in this
patchset.

Gautham R. Shenoy (3):
  powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop.
  pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
  cpuidle/pseries: Add stop0lite state

 arch/powerpc/include/asm/reg.h          |  1 +
 arch/powerpc/kvm/book3s_hv.c            |  8 ++++++--
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
 drivers/cpuidle/cpuidle-pseries.c       | 27 +++++++++++++++++++++++++++
 4 files changed, 47 insertions(+), 14 deletions(-)

-- 
1.9.4

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC/PATCH  1/3] powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop.
@ 2020-03-31 12:22   ` Gautham R. Shenoy
  0 siblings, 0 replies; 28+ messages in thread
From: Gautham R. Shenoy @ 2020-03-31 12:22 UTC (permalink / raw)
  To: Paul Mackerras, Michael Neuling, Nicholas Piggin,
	Michael Ellerman, David Gibson, Vaidyanathan Srinivasan,
	Bharata B Rao
  Cc: linuxppc-dev, linuxppc-dev, kvm-ppc, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

If a guest executes a stop instruction when the hypervisor has set the
PSSCR[ESL|EC] bits, the processor will throw an Hypervisor Facility
Unavailable exception. Currently when we receive this exception, we
only check if the exeception is generated due to a doorbell
instruction, in which case we emulate it. For all other cases,
including the case when the guest executes a stop-instruction, the
hypervisor sends a PROGILL to the guest program, which results in a
guest crash.

This patch adds code to handle the case when the hypervisor receives a
H_FAC_UNAVAIL exception due to guest executing the stop
instruction. The hypervisor increments the pc to the next instruction
and resumes the guest as expected by the semantics of the
PSSCR[ESL|EC] = 0 stop instruction.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/reg.h | 1 +
 arch/powerpc/kvm/book3s_hv.c   | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index da5cab0..2568c18 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -399,6 +399,7 @@
 /* HFSCR and FSCR bit numbers are the same */
 #define FSCR_SCV_LG	12	/* Enable System Call Vectored */
 #define FSCR_MSGP_LG	10	/* Enable MSGP */
+#define FSCR_STOP_LG    9       /* Enable stop states */
 #define FSCR_TAR_LG	8	/* Enable Target Address Register */
 #define FSCR_EBB_LG	7	/* Enable Event Based Branching */
 #define FSCR_TM_LG	5	/* Enable Transactional Memory */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 33be4d9..cdb7224 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1419,7 +1419,11 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu,
 		if (((vcpu->arch.hfscr >> 56) = FSCR_MSGP_LG) &&
 		    cpu_has_feature(CPU_FTR_ARCH_300))
 			r = kvmppc_emulate_doorbell_instr(vcpu);
-		if (r = EMULATE_FAIL) {
+		else if (((vcpu->arch.hfscr >> 56) = FSCR_STOP_LG) &&
+			cpu_has_feature(CPU_FTR_ARCH_300)) {
+			kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
+			r = RESUME_GUEST;
+		} else if (r = EMULATE_FAIL) {
 			kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
 			r = RESUME_GUEST;
 		}
-- 
1.9.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
@ 2020-03-31 12:22   ` Gautham R. Shenoy
  0 siblings, 0 replies; 28+ messages in thread
From: Gautham R. Shenoy @ 2020-03-31 12:22 UTC (permalink / raw)
  To: Paul Mackerras, Michael Neuling, Nicholas Piggin,
	Michael Ellerman, David Gibson, Vaidyanathan Srinivasan,
	Bharata B Rao
  Cc: linuxppc-dev, linuxppc-dev, kvm-ppc, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

ISA v3.0 allows the guest to execute a stop instruction. For this, the
PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
scheduling in the guest vCPU.

Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
set. This patch changes the behaviour to enter the guest with
PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
unconditionally clear these bits. Ideally this should be done
conditionally on platforms where the guest stop instruction has no
Bugs (starting POWER9 DD2.3).

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/kvm/book3s_hv.c            |  2 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index cdb7224..36d059a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
 	mtspr(SPRN_IC, vcpu->arch.ic);
 	mtspr(SPRN_PID, vcpu->arch.pid);
 
-	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
+	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
 	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
 
 	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index dbc2fec..c2daec3 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
 	mtspr	SPRN_PID, r7
 	mtspr	SPRN_WORT, r8
 BEGIN_FTR_SECTION
+	/* POWER9-only registers */
+	ld	r5, VCPU_TID(r4)
+	ld	r6, VCPU_PSSCR(r4)
+	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
+	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
+	andc	r6, r6, r7
+	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
+	ld	r7, VCPU_HFSCR(r4)
+	mtspr	SPRN_TIDR, r5
+	mtspr	SPRN_PSSCR, r6
+	mtspr	SPRN_HFSCR, r7
+FTR_SECTION_ELSE
 	/* POWER8-only registers */
 	ld	r5, VCPU_TCSCR(r4)
 	ld	r6, VCPU_ACOP(r4)
@@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
 	mtspr	SPRN_CSIGR, r7
 	mtspr	SPRN_TACR, r8
 	nop
-FTR_SECTION_ELSE
-	/* POWER9-only registers */
-	ld	r5, VCPU_TID(r4)
-	ld	r6, VCPU_PSSCR(r4)
-	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
-	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
-	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
-	ld	r7, VCPU_HFSCR(r4)
-	mtspr	SPRN_TIDR, r5
-	mtspr	SPRN_PSSCR, r6
-	mtspr	SPRN_HFSCR, r7
-ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
 8:
 
 	ld	r5, VCPU_SPRG0(r4)
-- 
1.9.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC/PATCH  3/3] cpuidle/pseries: Add stop0lite state
@ 2020-03-31 12:22   ` Gautham R. Shenoy
  0 siblings, 0 replies; 28+ messages in thread
From: Gautham R. Shenoy @ 2020-03-31 12:22 UTC (permalink / raw)
  To: Paul Mackerras, Michael Neuling, Nicholas Piggin,
	Michael Ellerman, David Gibson, Vaidyanathan Srinivasan,
	Bharata B Rao
  Cc: linuxppc-dev, linuxppc-dev, kvm-ppc, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

The POWER ISA v3.0 allows stop instruction to be executed from a
HV=0,PR=0 context. If the PSSCR[ESL|EC] bits are cleared, then the
stop instruction thus executed will cause the thread to pause, thereby
donating its cycles to the other threads in the core until the paused
thread is woken up by an interrupt.

In this patch we define a cpuidle state for pseries guests named
stop0lite. This has a latency and residency intermediate to that of
snooze and CEDE. While snooze has non-existent latency, it consumes
the CPU cycles without contributing to anything useful. CEDE on the
other hand requires a full VM exit, which can result in some other
vCPU being scheduled on this physical CPU thereby delaying the
scheduling of the CEDEd vCPU back. In such cases, when the expected
idle duration is small (1-20us), the vCPU can go to this stop0lite
state which provides a nice intermediate state between snooze and
CEDE.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-pseries.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
index 74c2479..9c8c18d 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -20,6 +20,7 @@
 #include <asm/firmware.h>
 #include <asm/runlatch.h>
 #include <asm/plpar_wrappers.h>
+#include <asm/processor.h>
 
 struct cpuidle_driver pseries_idle_driver = {
 	.name             = "pseries_idle",
@@ -170,6 +171,26 @@ static int shared_cede_loop(struct cpuidle_device *dev,
 		.enter = &dedicated_cede_loop },
 };
 
+
+
+static int stop_loop(struct cpuidle_device *dev,
+		     struct cpuidle_driver *drv,
+		     int index)
+{
+	unsigned long srr1 = 0;
+
+	if (!prep_irq_for_idle_irqsoff())
+		return index;
+
+	__ppc64_runlatch_off();
+	asm volatile("stop");
+	__ppc64_runlatch_on();
+	fini_irq_for_idle_irqsoff();
+	irq_set_pending_from_srr1(srr1);
+
+	return index;
+}
+
 /*
  * States for shared partition case.
  */
@@ -180,6 +201,12 @@ static int shared_cede_loop(struct cpuidle_device *dev,
 		.exit_latency = 0,
 		.target_residency = 0,
 		.enter = &snooze_loop },
+	{ /* stop0_lite */
+		.name = "stop0lite",
+		.desc = "Pauses the CPU",
+		.exit_latency = 2,
+		.target_residency ,
+		.enter = &stop_loop },
 	{ /* Shared Cede */
 		.name = "Shared Cede",
 		.desc = "Shared Cede",
-- 
1.9.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  0/3] Add support for stop instruction inside KVM guest
@ 2020-03-31 12:26   ` Gautham R Shenoy
  0 siblings, 0 replies; 28+ messages in thread
From: Gautham R Shenoy @ 2020-03-31 12:26 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Michael Neuling, Nicholas Piggin, Bharata B Rao, linuxppc-dev,
	kvm-ppc, Vaidyanathan Srinivasan, linuxppc-dev, David Gibson

On Tue, Mar 31, 2020 at 05:40:55PM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> 
>  *** RFC Only. Not intended for inclusion ************
> 
> Motivation
> ~~~~~~~~~~~~~~~
> 
> The POWER ISA v3.0 allows stop instruction to be executed from a Guest
> Kernel (HV=0,PR=0) context. If the hypervisor has cleared
> PSSCR[ESL|EC] bits, then the stop instruction thus executed will cause
> the vCPU thread to "pause", thereby donating its cycles to the other
> threads in the core until the paused thread is woken up by an
> interrupt. If the hypervisor has set the PSSCR[ESL|EC] bits, then
> execution of the "stop" instruction will raise a Hypervisor Facility
> Unavailable exception.
> 
> The stop idle state in the guest (henceforth referred to as stop0lite)
> when enabled
> 
> * has a very small wakeup latency (1-3us) comparable to that of
>   snooze and considerably better compared the Shared CEDE state
>   (25-30us).  Results are provided below for wakeup latency measured
>   by waking up an idle CPU in a given state using a timer as well as
>   using an IPI.
> 
>   ===================================
>   Wakeup Latency measured using a timer (in ns) [Lower is better]
>   ===================================
>   Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
>   ===================================
>   snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
>   ===================================
>   stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
>   ===================================
>   Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
>   ===================================
>

Posted two copies of Wakeup latency measured by timer. Here is the
wakeup latency measured using an IPI.


===================================
Wakeup latency measured using an IPI (in ns) [Lower is better]
===================================
Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
           |samples |         |        |        |         |          |
----------------------------------------------------------------------
snooze     |   60   |     2089|    4228|    2259|  2342.31|    316.56|
----------------------------------------------------------------------
stop0lite  |   60   |     1947|    3674|    2653|  2610.57|    266.73|
----------------------------------------------------------------------
Shared CEDE|   60   |    20147|   36305|   21827| 26762.65|   6875.01|
===================================

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  1/3] powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop.
  2020-03-31 12:22   ` Gautham R. Shenoy
@ 2020-04-03  2:15     ` Nicholas Piggin
  -1 siblings, 0 replies; 28+ messages in thread
From: Nicholas Piggin @ 2020-04-03  2:15 UTC (permalink / raw)
  To: Bharata B Rao, David Gibson, Gautham R. Shenoy, Michael Neuling,
	Michael Ellerman, Paul Mackerras, Vaidyanathan Srinivasan
  Cc: linuxppc-dev, linuxppc-dev, kvm-ppc

Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> If a guest executes a stop instruction when the hypervisor has set the
> PSSCR[ESL|EC] bits, the processor will throw an Hypervisor Facility
> Unavailable exception. Currently when we receive this exception, we
> only check if the exeception is generated due to a doorbell
> instruction, in which case we emulate it. For all other cases,
> including the case when the guest executes a stop-instruction, the
> hypervisor sends a PROGILL to the guest program, which results in a
> guest crash.
> 
> This patch adds code to handle the case when the hypervisor receives a
> H_FAC_UNAVAIL exception due to guest executing the stop
> instruction. The hypervisor increments the pc to the next instruction
> and resumes the guest as expected by the semantics of the
> PSSCR[ESL|EC] = 0 stop instruction.

This seems reasonable, I don't think we need to crash the guest here.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  1/3] powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop.
@ 2020-04-03  2:15     ` Nicholas Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nicholas Piggin @ 2020-04-03  2:15 UTC (permalink / raw)
  To: Bharata B Rao, David Gibson, Gautham R. Shenoy, Michael Neuling,
	Michael Ellerman, Paul Mackerras, Vaidyanathan Srinivasan
  Cc: linuxppc-dev, linuxppc-dev, kvm-ppc

Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> If a guest executes a stop instruction when the hypervisor has set the
> PSSCR[ESL|EC] bits, the processor will throw an Hypervisor Facility
> Unavailable exception. Currently when we receive this exception, we
> only check if the exeception is generated due to a doorbell
> instruction, in which case we emulate it. For all other cases,
> including the case when the guest executes a stop-instruction, the
> hypervisor sends a PROGILL to the guest program, which results in a
> guest crash.
> 
> This patch adds code to handle the case when the hypervisor receives a
> H_FAC_UNAVAIL exception due to guest executing the stop
> instruction. The hypervisor increments the pc to the next instruction
> and resumes the guest as expected by the semantics of the
> PSSCR[ESL|EC] = 0 stop instruction.

This seems reasonable, I don't think we need to crash the guest here.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
  2020-03-31 12:22   ` Gautham R. Shenoy
@ 2020-04-03  2:20     ` Nicholas Piggin
  -1 siblings, 0 replies; 28+ messages in thread
From: Nicholas Piggin @ 2020-04-03  2:20 UTC (permalink / raw)
  To: Bharata B Rao, David Gibson, Gautham R. Shenoy, Michael Neuling,
	Michael Ellerman, Paul Mackerras, Vaidyanathan Srinivasan
  Cc: linuxppc-dev, linuxppc-dev, kvm-ppc

Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> ISA v3.0 allows the guest to execute a stop instruction. For this, the
> PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> scheduling in the guest vCPU.
> 
> Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> set. This patch changes the behaviour to enter the guest with
> PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> unconditionally clear these bits. Ideally this should be done
> conditionally on platforms where the guest stop instruction has no
> Bugs (starting POWER9 DD2.3).

How will guests know that they can use this facility safely after your
series? You need both DD2.3 and a patched KVM.

> 
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv.c            |  2 +-
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
>  2 files changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index cdb7224..36d059a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
>  	mtspr(SPRN_IC, vcpu->arch.ic);
>  	mtspr(SPRN_PID, vcpu->arch.pid);
>  
> -	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> +	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
>  	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
>  
>  	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index dbc2fec..c2daec3 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
>  	mtspr	SPRN_PID, r7
>  	mtspr	SPRN_WORT, r8
>  BEGIN_FTR_SECTION
> +	/* POWER9-only registers */
> +	ld	r5, VCPU_TID(r4)
> +	ld	r6, VCPU_PSSCR(r4)
> +	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> +	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> +	andc	r6, r6, r7
> +	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> +	ld	r7, VCPU_HFSCR(r4)
> +	mtspr	SPRN_TIDR, r5
> +	mtspr	SPRN_PSSCR, r6
> +	mtspr	SPRN_HFSCR, r7
> +FTR_SECTION_ELSE

Why did you move these around? Just because the POWER9 section became
larger than the other?

That's a real wart in the instruction patching implementation, I think
we can fix it by padding with nops in the macros.

Can you just add the additional required nops to the top branch without
changing them around for this patch, so it's easier to see what's going
on? The end result will be the same after patching. Actually changing
these around can have a slight unintended consequence in that code that
runs before features were patched will execute the IF code. Not a
problem here, but another reason why the instruction patching 
restriction is annoying.

Thanks,
Nick

>  	/* POWER8-only registers */
>  	ld	r5, VCPU_TCSCR(r4)
>  	ld	r6, VCPU_ACOP(r4)
> @@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
>  	mtspr	SPRN_CSIGR, r7
>  	mtspr	SPRN_TACR, r8
>  	nop
> -FTR_SECTION_ELSE
> -	/* POWER9-only registers */
> -	ld	r5, VCPU_TID(r4)
> -	ld	r6, VCPU_PSSCR(r4)
> -	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> -	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
> -	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> -	ld	r7, VCPU_HFSCR(r4)
> -	mtspr	SPRN_TIDR, r5
> -	mtspr	SPRN_PSSCR, r6
> -	mtspr	SPRN_HFSCR, r7
> -ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
> +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
>  8:
>  
>  	ld	r5, VCPU_SPRG0(r4)
> -- 
> 1.9.4
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
@ 2020-04-03  2:20     ` Nicholas Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nicholas Piggin @ 2020-04-03  2:20 UTC (permalink / raw)
  To: Bharata B Rao, David Gibson, Gautham R. Shenoy, Michael Neuling,
	Michael Ellerman, Paul Mackerras, Vaidyanathan Srinivasan
  Cc: linuxppc-dev, linuxppc-dev, kvm-ppc

Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> ISA v3.0 allows the guest to execute a stop instruction. For this, the
> PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> scheduling in the guest vCPU.
> 
> Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> set. This patch changes the behaviour to enter the guest with
> PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> unconditionally clear these bits. Ideally this should be done
> conditionally on platforms where the guest stop instruction has no
> Bugs (starting POWER9 DD2.3).

How will guests know that they can use this facility safely after your
series? You need both DD2.3 and a patched KVM.

> 
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv.c            |  2 +-
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
>  2 files changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index cdb7224..36d059a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
>  	mtspr(SPRN_IC, vcpu->arch.ic);
>  	mtspr(SPRN_PID, vcpu->arch.pid);
>  
> -	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> +	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
>  	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
>  
>  	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index dbc2fec..c2daec3 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
>  	mtspr	SPRN_PID, r7
>  	mtspr	SPRN_WORT, r8
>  BEGIN_FTR_SECTION
> +	/* POWER9-only registers */
> +	ld	r5, VCPU_TID(r4)
> +	ld	r6, VCPU_PSSCR(r4)
> +	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> +	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> +	andc	r6, r6, r7
> +	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> +	ld	r7, VCPU_HFSCR(r4)
> +	mtspr	SPRN_TIDR, r5
> +	mtspr	SPRN_PSSCR, r6
> +	mtspr	SPRN_HFSCR, r7
> +FTR_SECTION_ELSE

Why did you move these around? Just because the POWER9 section became
larger than the other?

That's a real wart in the instruction patching implementation, I think
we can fix it by padding with nops in the macros.

Can you just add the additional required nops to the top branch without
changing them around for this patch, so it's easier to see what's going
on? The end result will be the same after patching. Actually changing
these around can have a slight unintended consequence in that code that
runs before features were patched will execute the IF code. Not a
problem here, but another reason why the instruction patching 
restriction is annoying.

Thanks,
Nick

>  	/* POWER8-only registers */
>  	ld	r5, VCPU_TCSCR(r4)
>  	ld	r6, VCPU_ACOP(r4)
> @@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
>  	mtspr	SPRN_CSIGR, r7
>  	mtspr	SPRN_TACR, r8
>  	nop
> -FTR_SECTION_ELSE
> -	/* POWER9-only registers */
> -	ld	r5, VCPU_TID(r4)
> -	ld	r6, VCPU_PSSCR(r4)
> -	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> -	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
> -	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> -	ld	r7, VCPU_HFSCR(r4)
> -	mtspr	SPRN_TIDR, r5
> -	mtspr	SPRN_PSSCR, r6
> -	mtspr	SPRN_HFSCR, r7
> -ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
> +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
>  8:
>  
>  	ld	r5, VCPU_SPRG0(r4)
> -- 
> 1.9.4
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
  2020-04-03  2:20     ` Nicholas Piggin
@ 2020-04-03  9:43       ` Gautham R Shenoy
  -1 siblings, 0 replies; 28+ messages in thread
From: Gautham R Shenoy @ 2020-04-03  9:31 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Gautham R. Shenoy, Michael Neuling, kvm-ppc, Bharata B Rao,
	linuxppc-dev, Vaidyanathan Srinivasan, linuxppc-dev,
	David Gibson

On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > 
> > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > scheduling in the guest vCPU.
> > 
> > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > set. This patch changes the behaviour to enter the guest with
> > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > unconditionally clear these bits. Ideally this should be done
> > conditionally on platforms where the guest stop instruction has no
> > Bugs (starting POWER9 DD2.3).
> 
> How will guests know that they can use this facility safely after your
> series? You need both DD2.3 and a patched KVM.


Yes, this is something that isn't addressed in this series (mentioned
in the cover letter), which is a POC demonstrating that the stop0lite
state in guest works.

However, to answer your question, this is the scheme that I had in
mind :

OPAL:
   On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"

Hypervisor Kernel:
    1. If "idle-stop-guest" dt-cpu-feature is discovered, then
       we set bool enable_guest_stop = true;

    2. During KVM guest entry, clear PSSCR[ESL|EC] iff
       enable_guest_stop == true.

    3. In kvm_vm_ioctl_check_extension(), for a new capability
       KVM_CAP_STOP, return true iff enable_guest_top == true.

QEMU:
   Check with the hypervisor if KVM_CAP_STOP is present. If so,
   indicate the presence to the guest via device tree.

Guest Kernel:
   Check for the presence of guest stop state support in
   device-tree. If available, enable the stop0lite in the cpuidle
   driver. 
   

We still have a challenge of migrating a guest which started on a
hypervisor supporting guest stop state to a hypervisor without it.
The target hypervisor should atleast have Patch 1 of this series, so
that we don't crash the guest.

> 
> > 
> > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kvm/book3s_hv.c            |  2 +-
> >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
> >  2 files changed, 14 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index cdb7224..36d059a 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
> >  	mtspr(SPRN_IC, vcpu->arch.ic);
> >  	mtspr(SPRN_PID, vcpu->arch.pid);
> >  
> > -	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> > +	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
> >  	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
> >  
> >  	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > index dbc2fec..c2daec3 100644
> > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
> >  	mtspr	SPRN_PID, r7
> >  	mtspr	SPRN_WORT, r8
> >  BEGIN_FTR_SECTION
> > +	/* POWER9-only registers */
> > +	ld	r5, VCPU_TID(r4)
> > +	ld	r6, VCPU_PSSCR(r4)
> > +	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > +	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> > +	andc	r6, r6, r7
> > +	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > +	ld	r7, VCPU_HFSCR(r4)
> > +	mtspr	SPRN_TIDR, r5
> > +	mtspr	SPRN_PSSCR, r6
> > +	mtspr	SPRN_HFSCR, r7
> > +FTR_SECTION_ELSE
> 
> Why did you move these around? Just because the POWER9 section became
> larger than the other?

Yes.

> 
> That's a real wart in the instruction patching implementation, I think
> we can fix it by padding with nops in the macros.
> 
> Can you just add the additional required nops to the top branch without
> changing them around for this patch, so it's easier to see what's going
> on? The end result will be the same after patching. Actually changing
> these around can have a slight unintended consequence in that code that
> runs before features were patched will execute the IF code. Not a
> problem here, but another reason why the instruction patching 
> restriction is annoying.

Sure, I will repost this patch with additional nops instead of
moving them around.

> 
> Thanks,
> Nick
> 
> >  	/* POWER8-only registers */
> >  	ld	r5, VCPU_TCSCR(r4)
> >  	ld	r6, VCPU_ACOP(r4)
> > @@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
> >  	mtspr	SPRN_CSIGR, r7
> >  	mtspr	SPRN_TACR, r8
> >  	nop
> > -FTR_SECTION_ELSE
> > -	/* POWER9-only registers */
> > -	ld	r5, VCPU_TID(r4)
> > -	ld	r6, VCPU_PSSCR(r4)
> > -	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > -	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
> > -	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > -	ld	r7, VCPU_HFSCR(r4)
> > -	mtspr	SPRN_TIDR, r5
> > -	mtspr	SPRN_PSSCR, r6
> > -	mtspr	SPRN_HFSCR, r7
> > -ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
> > +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
> >  8:
> >  
> >  	ld	r5, VCPU_SPRG0(r4)
> > -- 
> > 1.9.4
> > 
> > 

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
@ 2020-04-03  9:43       ` Gautham R Shenoy
  0 siblings, 0 replies; 28+ messages in thread
From: Gautham R Shenoy @ 2020-04-03  9:43 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Gautham R. Shenoy, Michael Neuling, kvm-ppc, Bharata B Rao,
	linuxppc-dev, Vaidyanathan Srinivasan, linuxppc-dev,
	David Gibson

On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > 
> > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > scheduling in the guest vCPU.
> > 
> > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > set. This patch changes the behaviour to enter the guest with
> > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > unconditionally clear these bits. Ideally this should be done
> > conditionally on platforms where the guest stop instruction has no
> > Bugs (starting POWER9 DD2.3).
> 
> How will guests know that they can use this facility safely after your
> series? You need both DD2.3 and a patched KVM.


Yes, this is something that isn't addressed in this series (mentioned
in the cover letter), which is a POC demonstrating that the stop0lite
state in guest works.

However, to answer your question, this is the scheme that I had in
mind :

OPAL:
   On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"

Hypervisor Kernel:
    1. If "idle-stop-guest" dt-cpu-feature is discovered, then
       we set bool enable_guest_stop = true;

    2. During KVM guest entry, clear PSSCR[ESL|EC] iff
       enable_guest_stop = true.

    3. In kvm_vm_ioctl_check_extension(), for a new capability
       KVM_CAP_STOP, return true iff enable_guest_top = true.

QEMU:
   Check with the hypervisor if KVM_CAP_STOP is present. If so,
   indicate the presence to the guest via device tree.

Guest Kernel:
   Check for the presence of guest stop state support in
   device-tree. If available, enable the stop0lite in the cpuidle
   driver. 
   

We still have a challenge of migrating a guest which started on a
hypervisor supporting guest stop state to a hypervisor without it.
The target hypervisor should atleast have Patch 1 of this series, so
that we don't crash the guest.

> 
> > 
> > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kvm/book3s_hv.c            |  2 +-
> >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
> >  2 files changed, 14 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index cdb7224..36d059a 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
> >  	mtspr(SPRN_IC, vcpu->arch.ic);
> >  	mtspr(SPRN_PID, vcpu->arch.pid);
> >  
> > -	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> > +	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
> >  	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
> >  
> >  	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > index dbc2fec..c2daec3 100644
> > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
> >  	mtspr	SPRN_PID, r7
> >  	mtspr	SPRN_WORT, r8
> >  BEGIN_FTR_SECTION
> > +	/* POWER9-only registers */
> > +	ld	r5, VCPU_TID(r4)
> > +	ld	r6, VCPU_PSSCR(r4)
> > +	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > +	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> > +	andc	r6, r6, r7
> > +	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > +	ld	r7, VCPU_HFSCR(r4)
> > +	mtspr	SPRN_TIDR, r5
> > +	mtspr	SPRN_PSSCR, r6
> > +	mtspr	SPRN_HFSCR, r7
> > +FTR_SECTION_ELSE
> 
> Why did you move these around? Just because the POWER9 section became
> larger than the other?

Yes.

> 
> That's a real wart in the instruction patching implementation, I think
> we can fix it by padding with nops in the macros.
> 
> Can you just add the additional required nops to the top branch without
> changing them around for this patch, so it's easier to see what's going
> on? The end result will be the same after patching. Actually changing
> these around can have a slight unintended consequence in that code that
> runs before features were patched will execute the IF code. Not a
> problem here, but another reason why the instruction patching 
> restriction is annoying.

Sure, I will repost this patch with additional nops instead of
moving them around.

> 
> Thanks,
> Nick
> 
> >  	/* POWER8-only registers */
> >  	ld	r5, VCPU_TCSCR(r4)
> >  	ld	r6, VCPU_ACOP(r4)
> > @@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
> >  	mtspr	SPRN_CSIGR, r7
> >  	mtspr	SPRN_TACR, r8
> >  	nop
> > -FTR_SECTION_ELSE
> > -	/* POWER9-only registers */
> > -	ld	r5, VCPU_TID(r4)
> > -	ld	r6, VCPU_PSSCR(r4)
> > -	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > -	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
> > -	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > -	ld	r7, VCPU_HFSCR(r4)
> > -	mtspr	SPRN_TIDR, r5
> > -	mtspr	SPRN_PSSCR, r6
> > -	mtspr	SPRN_HFSCR, r7
> > -ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
> > +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
> >  8:
> >  
> >  	ld	r5, VCPU_SPRG0(r4)
> > -- 
> > 1.9.4
> > 
> > 

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
  2020-04-03  9:43       ` Gautham R Shenoy
@ 2020-04-06  9:58         ` David Gibson
  -1 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2020-04-06  9:58 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Michael Neuling, Nicholas Piggin, Bharata B Rao, linuxppc-dev,
	kvm-ppc, Vaidyanathan Srinivasan, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 6206 bytes --]

On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > 
> > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > scheduling in the guest vCPU.
> > > 
> > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > set. This patch changes the behaviour to enter the guest with
> > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > unconditionally clear these bits. Ideally this should be done
> > > conditionally on platforms where the guest stop instruction has no
> > > Bugs (starting POWER9 DD2.3).
> > 
> > How will guests know that they can use this facility safely after your
> > series? You need both DD2.3 and a patched KVM.
> 
> 
> Yes, this is something that isn't addressed in this series (mentioned
> in the cover letter), which is a POC demonstrating that the stop0lite
> state in guest works.
> 
> However, to answer your question, this is the scheme that I had in
> mind :
> 
> OPAL:
>    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> 
> Hypervisor Kernel:
>     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
>        we set bool enable_guest_stop = true;
> 
>     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
>        enable_guest_stop == true.
> 
>     3. In kvm_vm_ioctl_check_extension(), for a new capability
>        KVM_CAP_STOP, return true iff enable_guest_top == true.
> 
> QEMU:
>    Check with the hypervisor if KVM_CAP_STOP is present. If so,
>    indicate the presence to the guest via device tree.

Nack.  Presenting different capabilities to the guest depending on
host capabilities (rather than explicit options) is never ok.  It
means that depending on the system you start on you may or may not be
able to migrate to other systems that you're supposed to be able to,

> Guest Kernel:
>    Check for the presence of guest stop state support in
>    device-tree. If available, enable the stop0lite in the cpuidle
>    driver. 
>    
> 
> We still have a challenge of migrating a guest which started on a
> hypervisor supporting guest stop state to a hypervisor without it.
> The target hypervisor should atleast have Patch 1 of this series, so
> that we don't crash the guest.
> 
> > 
> > > 
> > > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > > ---
> > >  arch/powerpc/kvm/book3s_hv.c            |  2 +-
> > >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
> > >  2 files changed, 14 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > index cdb7224..36d059a 100644
> > > --- a/arch/powerpc/kvm/book3s_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_hv.c
> > > @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
> > >  	mtspr(SPRN_IC, vcpu->arch.ic);
> > >  	mtspr(SPRN_PID, vcpu->arch.pid);
> > >  
> > > -	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> > > +	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
> > >  	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
> > >  
> > >  	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> > > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > index dbc2fec..c2daec3 100644
> > > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
> > >  	mtspr	SPRN_PID, r7
> > >  	mtspr	SPRN_WORT, r8
> > >  BEGIN_FTR_SECTION
> > > +	/* POWER9-only registers */
> > > +	ld	r5, VCPU_TID(r4)
> > > +	ld	r6, VCPU_PSSCR(r4)
> > > +	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > > +	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> > > +	andc	r6, r6, r7
> > > +	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > > +	ld	r7, VCPU_HFSCR(r4)
> > > +	mtspr	SPRN_TIDR, r5
> > > +	mtspr	SPRN_PSSCR, r6
> > > +	mtspr	SPRN_HFSCR, r7
> > > +FTR_SECTION_ELSE
> > 
> > Why did you move these around? Just because the POWER9 section became
> > larger than the other?
> 
> Yes.
> 
> > 
> > That's a real wart in the instruction patching implementation, I think
> > we can fix it by padding with nops in the macros.
> > 
> > Can you just add the additional required nops to the top branch without
> > changing them around for this patch, so it's easier to see what's going
> > on? The end result will be the same after patching. Actually changing
> > these around can have a slight unintended consequence in that code that
> > runs before features were patched will execute the IF code. Not a
> > problem here, but another reason why the instruction patching 
> > restriction is annoying.
> 
> Sure, I will repost this patch with additional nops instead of
> moving them around.
> 
> > 
> > Thanks,
> > Nick
> > 
> > >  	/* POWER8-only registers */
> > >  	ld	r5, VCPU_TCSCR(r4)
> > >  	ld	r6, VCPU_ACOP(r4)
> > > @@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
> > >  	mtspr	SPRN_CSIGR, r7
> > >  	mtspr	SPRN_TACR, r8
> > >  	nop
> > > -FTR_SECTION_ELSE
> > > -	/* POWER9-only registers */
> > > -	ld	r5, VCPU_TID(r4)
> > > -	ld	r6, VCPU_PSSCR(r4)
> > > -	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > > -	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
> > > -	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > > -	ld	r7, VCPU_HFSCR(r4)
> > > -	mtspr	SPRN_TIDR, r5
> > > -	mtspr	SPRN_PSSCR, r6
> > > -	mtspr	SPRN_HFSCR, r7
> > > -ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
> > > +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
> > >  8:
> > >  
> > >  	ld	r5, VCPU_SPRG0(r4)
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
@ 2020-04-06  9:58         ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2020-04-06  9:58 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Michael Neuling, Nicholas Piggin, Bharata B Rao, linuxppc-dev,
	kvm-ppc, Vaidyanathan Srinivasan, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 6206 bytes --]

On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > 
> > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > scheduling in the guest vCPU.
> > > 
> > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > set. This patch changes the behaviour to enter the guest with
> > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > unconditionally clear these bits. Ideally this should be done
> > > conditionally on platforms where the guest stop instruction has no
> > > Bugs (starting POWER9 DD2.3).
> > 
> > How will guests know that they can use this facility safely after your
> > series? You need both DD2.3 and a patched KVM.
> 
> 
> Yes, this is something that isn't addressed in this series (mentioned
> in the cover letter), which is a POC demonstrating that the stop0lite
> state in guest works.
> 
> However, to answer your question, this is the scheme that I had in
> mind :
> 
> OPAL:
>    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> 
> Hypervisor Kernel:
>     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
>        we set bool enable_guest_stop = true;
> 
>     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
>        enable_guest_stop == true.
> 
>     3. In kvm_vm_ioctl_check_extension(), for a new capability
>        KVM_CAP_STOP, return true iff enable_guest_top == true.
> 
> QEMU:
>    Check with the hypervisor if KVM_CAP_STOP is present. If so,
>    indicate the presence to the guest via device tree.

Nack.  Presenting different capabilities to the guest depending on
host capabilities (rather than explicit options) is never ok.  It
means that depending on the system you start on you may or may not be
able to migrate to other systems that you're supposed to be able to,

> Guest Kernel:
>    Check for the presence of guest stop state support in
>    device-tree. If available, enable the stop0lite in the cpuidle
>    driver. 
>    
> 
> We still have a challenge of migrating a guest which started on a
> hypervisor supporting guest stop state to a hypervisor without it.
> The target hypervisor should atleast have Patch 1 of this series, so
> that we don't crash the guest.
> 
> > 
> > > 
> > > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > > ---
> > >  arch/powerpc/kvm/book3s_hv.c            |  2 +-
> > >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
> > >  2 files changed, 14 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > index cdb7224..36d059a 100644
> > > --- a/arch/powerpc/kvm/book3s_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_hv.c
> > > @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
> > >  	mtspr(SPRN_IC, vcpu->arch.ic);
> > >  	mtspr(SPRN_PID, vcpu->arch.pid);
> > >  
> > > -	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> > > +	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
> > >  	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
> > >  
> > >  	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> > > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > index dbc2fec..c2daec3 100644
> > > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
> > >  	mtspr	SPRN_PID, r7
> > >  	mtspr	SPRN_WORT, r8
> > >  BEGIN_FTR_SECTION
> > > +	/* POWER9-only registers */
> > > +	ld	r5, VCPU_TID(r4)
> > > +	ld	r6, VCPU_PSSCR(r4)
> > > +	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > > +	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> > > +	andc	r6, r6, r7
> > > +	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > > +	ld	r7, VCPU_HFSCR(r4)
> > > +	mtspr	SPRN_TIDR, r5
> > > +	mtspr	SPRN_PSSCR, r6
> > > +	mtspr	SPRN_HFSCR, r7
> > > +FTR_SECTION_ELSE
> > 
> > Why did you move these around? Just because the POWER9 section became
> > larger than the other?
> 
> Yes.
> 
> > 
> > That's a real wart in the instruction patching implementation, I think
> > we can fix it by padding with nops in the macros.
> > 
> > Can you just add the additional required nops to the top branch without
> > changing them around for this patch, so it's easier to see what's going
> > on? The end result will be the same after patching. Actually changing
> > these around can have a slight unintended consequence in that code that
> > runs before features were patched will execute the IF code. Not a
> > problem here, but another reason why the instruction patching 
> > restriction is annoying.
> 
> Sure, I will repost this patch with additional nops instead of
> moving them around.
> 
> > 
> > Thanks,
> > Nick
> > 
> > >  	/* POWER8-only registers */
> > >  	ld	r5, VCPU_TCSCR(r4)
> > >  	ld	r6, VCPU_ACOP(r4)
> > > @@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
> > >  	mtspr	SPRN_CSIGR, r7
> > >  	mtspr	SPRN_TACR, r8
> > >  	nop
> > > -FTR_SECTION_ELSE
> > > -	/* POWER9-only registers */
> > > -	ld	r5, VCPU_TID(r4)
> > > -	ld	r6, VCPU_PSSCR(r4)
> > > -	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > > -	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
> > > -	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > > -	ld	r7, VCPU_HFSCR(r4)
> > > -	mtspr	SPRN_TIDR, r5
> > > -	mtspr	SPRN_PSSCR, r6
> > > -	mtspr	SPRN_HFSCR, r7
> > > -ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
> > > +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
> > >  8:
> > >  
> > >  	ld	r5, VCPU_SPRG0(r4)
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
  2020-04-03  9:43       ` Gautham R Shenoy
@ 2020-04-07 12:45         ` Gautham R Shenoy
  -1 siblings, 0 replies; 28+ messages in thread
From: Gautham R Shenoy @ 2020-04-07 12:33 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Michael Neuling, kvm-ppc, Bharata B Rao, linuxppc-dev,
	Nicholas Piggin, Vaidyanathan Srinivasan, linuxppc-dev,
	David Gibson

Hello Nicholas,

On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:


[..snip..]
> > > 
> > > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > > ---
> > >  arch/powerpc/kvm/book3s_hv.c            |  2 +-
> > >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
> > >  2 files changed, 14 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > index cdb7224..36d059a 100644
> > > --- a/arch/powerpc/kvm/book3s_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_hv.c
> > > @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
> > >  	mtspr(SPRN_IC, vcpu->arch.ic);
> > >  	mtspr(SPRN_PID, vcpu->arch.pid);
> > >  
> > > -	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> > > +	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
> > >  	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
> > >  
> > >  	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> > > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > index dbc2fec..c2daec3 100644
> > > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
> > >  	mtspr	SPRN_PID, r7
> > >  	mtspr	SPRN_WORT, r8
> > >  BEGIN_FTR_SECTION
> > > +	/* POWER9-only registers */
> > > +	ld	r5, VCPU_TID(r4)
> > > +	ld	r6, VCPU_PSSCR(r4)
> > > +	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > > +	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> > > +	andc	r6, r6, r7
> > > +	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > > +	ld	r7, VCPU_HFSCR(r4)
> > > +	mtspr	SPRN_TIDR, r5
> > > +	mtspr	SPRN_PSSCR, r6
> > > +	mtspr	SPRN_HFSCR, r7
> > > +FTR_SECTION_ELSE
> > 
> > Why did you move these around? Just because the POWER9 section became
> > larger than the other?
> 
> Yes.
> 
> > 
> > That's a real wart in the instruction patching implementation, I think
> > we can fix it by padding with nops in the macros.
> > 
> > Can you just add the additional required nops to the top branch without
> > changing them around for this patch, so it's easier to see what's going
> > on? The end result will be the same after patching. Actually changing
> > these around can have a slight unintended consequence in that code that
> > runs before features were patched will execute the IF code. Not a
> > problem here, but another reason why the instruction patching 
> > restriction is annoying.
> 
> Sure, I will repost this patch with additional nops instead of
> moving them around.
> 

Below is the same patch without rearranging the FTR_SECTION blocks,
but with an extra nop. 

---
 arch/powerpc/kvm/book3s_hv.c            | 2 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index c52871c..efa7d3e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3433,7 +3433,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
 	mtspr(SPRN_IC, vcpu->arch.ic);
 	mtspr(SPRN_PID, vcpu->arch.pid);
 
-	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
+	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
 	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
 
 	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 780a499..83a69dc 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -833,12 +833,14 @@ BEGIN_FTR_SECTION
 	mtspr	SPRN_CSIGR, r7
 	mtspr	SPRN_TACR, r8
 	nop
+	nop
 FTR_SECTION_ELSE
 	/* POWER9-only registers */
 	ld	r5, VCPU_TID(r4)
 	ld	r6, VCPU_PSSCR(r4)
 	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
-	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
+	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
+	andc	r6, r6, r7
 	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
 	ld	r7, VCPU_HFSCR(r4)
 	mtspr	SPRN_TIDR, r5
-- 
Thanks and Regards
gautham.

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
@ 2020-04-07 12:45         ` Gautham R Shenoy
  0 siblings, 0 replies; 28+ messages in thread
From: Gautham R Shenoy @ 2020-04-07 12:45 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Michael Neuling, kvm-ppc, Bharata B Rao, linuxppc-dev,
	Nicholas Piggin, Vaidyanathan Srinivasan, linuxppc-dev,
	David Gibson

Hello Nicholas,

On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:


[..snip..]
> > > 
> > > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > > ---
> > >  arch/powerpc/kvm/book3s_hv.c            |  2 +-
> > >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
> > >  2 files changed, 14 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > index cdb7224..36d059a 100644
> > > --- a/arch/powerpc/kvm/book3s_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_hv.c
> > > @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
> > >  	mtspr(SPRN_IC, vcpu->arch.ic);
> > >  	mtspr(SPRN_PID, vcpu->arch.pid);
> > >  
> > > -	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> > > +	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
> > >  	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
> > >  
> > >  	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> > > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > index dbc2fec..c2daec3 100644
> > > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
> > >  	mtspr	SPRN_PID, r7
> > >  	mtspr	SPRN_WORT, r8
> > >  BEGIN_FTR_SECTION
> > > +	/* POWER9-only registers */
> > > +	ld	r5, VCPU_TID(r4)
> > > +	ld	r6, VCPU_PSSCR(r4)
> > > +	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > > +	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> > > +	andc	r6, r6, r7
> > > +	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > > +	ld	r7, VCPU_HFSCR(r4)
> > > +	mtspr	SPRN_TIDR, r5
> > > +	mtspr	SPRN_PSSCR, r6
> > > +	mtspr	SPRN_HFSCR, r7
> > > +FTR_SECTION_ELSE
> > 
> > Why did you move these around? Just because the POWER9 section became
> > larger than the other?
> 
> Yes.
> 
> > 
> > That's a real wart in the instruction patching implementation, I think
> > we can fix it by padding with nops in the macros.
> > 
> > Can you just add the additional required nops to the top branch without
> > changing them around for this patch, so it's easier to see what's going
> > on? The end result will be the same after patching. Actually changing
> > these around can have a slight unintended consequence in that code that
> > runs before features were patched will execute the IF code. Not a
> > problem here, but another reason why the instruction patching 
> > restriction is annoying.
> 
> Sure, I will repost this patch with additional nops instead of
> moving them around.
> 

Below is the same patch without rearranging the FTR_SECTION blocks,
but with an extra nop. 

---
 arch/powerpc/kvm/book3s_hv.c            | 2 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index c52871c..efa7d3e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3433,7 +3433,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
 	mtspr(SPRN_IC, vcpu->arch.ic);
 	mtspr(SPRN_PID, vcpu->arch.pid);
 
-	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
+	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
 	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
 
 	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 780a499..83a69dc 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -833,12 +833,14 @@ BEGIN_FTR_SECTION
 	mtspr	SPRN_CSIGR, r7
 	mtspr	SPRN_TACR, r8
 	nop
+	nop
 FTR_SECTION_ELSE
 	/* POWER9-only registers */
 	ld	r5, VCPU_TID(r4)
 	ld	r6, VCPU_PSSCR(r4)
 	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
-	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
+	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
+	andc	r6, r6, r7
 	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
 	ld	r7, VCPU_HFSCR(r4)
 	mtspr	SPRN_TIDR, r5
-- 
Thanks and Regards
gautham.

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
  2020-04-06  9:58         ` David Gibson
@ 2020-04-07 13:37           ` Gautham R Shenoy
  -1 siblings, 0 replies; 28+ messages in thread
From: Gautham R Shenoy @ 2020-04-07 13:25 UTC (permalink / raw)
  To: David Gibson
  Cc: Gautham R Shenoy, Michael Neuling, Nicholas Piggin,
	Bharata B Rao, linuxppc-dev, kvm-ppc, Vaidyanathan Srinivasan,
	linuxppc-dev

Hello David,

On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > 
> > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > scheduling in the guest vCPU.
> > > > 
> > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > set. This patch changes the behaviour to enter the guest with
> > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > unconditionally clear these bits. Ideally this should be done
> > > > conditionally on platforms where the guest stop instruction has no
> > > > Bugs (starting POWER9 DD2.3).
> > > 
> > > How will guests know that they can use this facility safely after your
> > > series? You need both DD2.3 and a patched KVM.
> > 
> > 
> > Yes, this is something that isn't addressed in this series (mentioned
> > in the cover letter), which is a POC demonstrating that the stop0lite
> > state in guest works.
> > 
> > However, to answer your question, this is the scheme that I had in
> > mind :
> > 
> > OPAL:
> >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > 
> > Hypervisor Kernel:
> >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> >        we set bool enable_guest_stop = true;
> > 
> >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> >        enable_guest_stop == true.
> > 
> >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> >        KVM_CAP_STOP, return true iff enable_guest_top == true.
> > 
> > QEMU:
> >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> >    indicate the presence to the guest via device tree.
> 
> Nack.  Presenting different capabilities to the guest depending on
> host capabilities (rather than explicit options) is never ok.  It
> means that depending on the system you start on you may or may not be
> able to migrate to other systems that you're supposed to be able to,

I agree that blocking migration for the unavailability of this feature
is not desirable. Could you point me to some other capabilities in KVM
which have been implemented via explicit options?

The ISA 3.0 allows the guest to execute the "stop" instruction. If the
Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
"stop" instruction in the causes a Hypervisor Facility Unavailable
Exception, thus giving the hypervisor a chance to emulate the
instruction. However, in the current code, when the hypervisor
receives this exception, it sends a PROGKILL to the guest which
results in crashing the guest.

Patch 1 of this series emulates wakeup from the "stop"
instruction. Would the following scheme be ok?

OPAL:
	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"

Hypervisor Kernel:

	   If "idle-stop-guest" dt feature is available, then, before
	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
	   bits allowing the guest to safely execute stop instruction.

	   If "idle-stop-guest" dt feature is not available, then, the
	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
	   guest "stop" instruction execution to trap back into the
	   hypervisor. We then emulate a wakeup from the stop
	   instruction (Patch 1 of this series).

Guest Kernel:
      If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
      stop0lite cpuidle state.

This allows us to migrate the KVM guest across any POWER9
Hypervisor. The minimal addition that the Hypervisor would need is
Patch 1 of this series.
	   



--
Thanks and Regards
gautham.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
@ 2020-04-07 13:37           ` Gautham R Shenoy
  0 siblings, 0 replies; 28+ messages in thread
From: Gautham R Shenoy @ 2020-04-07 13:37 UTC (permalink / raw)
  To: David Gibson
  Cc: Gautham R Shenoy, Michael Neuling, Nicholas Piggin,
	Bharata B Rao, linuxppc-dev, kvm-ppc, Vaidyanathan Srinivasan,
	linuxppc-dev

Hello David,

On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > 
> > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > scheduling in the guest vCPU.
> > > > 
> > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > set. This patch changes the behaviour to enter the guest with
> > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > unconditionally clear these bits. Ideally this should be done
> > > > conditionally on platforms where the guest stop instruction has no
> > > > Bugs (starting POWER9 DD2.3).
> > > 
> > > How will guests know that they can use this facility safely after your
> > > series? You need both DD2.3 and a patched KVM.
> > 
> > 
> > Yes, this is something that isn't addressed in this series (mentioned
> > in the cover letter), which is a POC demonstrating that the stop0lite
> > state in guest works.
> > 
> > However, to answer your question, this is the scheme that I had in
> > mind :
> > 
> > OPAL:
> >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > 
> > Hypervisor Kernel:
> >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> >        we set bool enable_guest_stop = true;
> > 
> >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> >        enable_guest_stop = true.
> > 
> >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> >        KVM_CAP_STOP, return true iff enable_guest_top = true.
> > 
> > QEMU:
> >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> >    indicate the presence to the guest via device tree.
> 
> Nack.  Presenting different capabilities to the guest depending on
> host capabilities (rather than explicit options) is never ok.  It
> means that depending on the system you start on you may or may not be
> able to migrate to other systems that you're supposed to be able to,

I agree that blocking migration for the unavailability of this feature
is not desirable. Could you point me to some other capabilities in KVM
which have been implemented via explicit options?

The ISA 3.0 allows the guest to execute the "stop" instruction. If the
Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
"stop" instruction in the causes a Hypervisor Facility Unavailable
Exception, thus giving the hypervisor a chance to emulate the
instruction. However, in the current code, when the hypervisor
receives this exception, it sends a PROGKILL to the guest which
results in crashing the guest.

Patch 1 of this series emulates wakeup from the "stop"
instruction. Would the following scheme be ok?

OPAL:
	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"

Hypervisor Kernel:

	   If "idle-stop-guest" dt feature is available, then, before
	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
	   bits allowing the guest to safely execute stop instruction.

	   If "idle-stop-guest" dt feature is not available, then, the
	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
	   guest "stop" instruction execution to trap back into the
	   hypervisor. We then emulate a wakeup from the stop
	   instruction (Patch 1 of this series).

Guest Kernel:
      If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
      stop0lite cpuidle state.

This allows us to migrate the KVM guest across any POWER9
Hypervisor. The minimal addition that the Hypervisor would need is
Patch 1 of this series.
	   



--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
  2020-04-07 13:37           ` Gautham R Shenoy
@ 2020-04-08  2:29             ` David Gibson
  -1 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2020-04-08  2:29 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Michael Neuling, Nicholas Piggin, Bharata B Rao, linuxppc-dev,
	kvm-ppc, Vaidyanathan Srinivasan, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 5241 bytes --]

On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote:
> Hello David,
> 
> On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > 
> > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > > scheduling in the guest vCPU.
> > > > > 
> > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > > set. This patch changes the behaviour to enter the guest with
> > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > > unconditionally clear these bits. Ideally this should be done
> > > > > conditionally on platforms where the guest stop instruction has no
> > > > > Bugs (starting POWER9 DD2.3).
> > > > 
> > > > How will guests know that they can use this facility safely after your
> > > > series? You need both DD2.3 and a patched KVM.
> > > 
> > > 
> > > Yes, this is something that isn't addressed in this series (mentioned
> > > in the cover letter), which is a POC demonstrating that the stop0lite
> > > state in guest works.
> > > 
> > > However, to answer your question, this is the scheme that I had in
> > > mind :
> > > 
> > > OPAL:
> > >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > 
> > > Hypervisor Kernel:
> > >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> > >        we set bool enable_guest_stop = true;
> > > 
> > >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> > >        enable_guest_stop == true.
> > > 
> > >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> > >        KVM_CAP_STOP, return true iff enable_guest_top == true.
> > > 
> > > QEMU:
> > >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> > >    indicate the presence to the guest via device tree.
> > 
> > Nack.  Presenting different capabilities to the guest depending on
> > host capabilities (rather than explicit options) is never ok.  It
> > means that depending on the system you start on you may or may not be
> > able to migrate to other systems that you're supposed to be able to,
> 
> I agree that blocking migration for the unavailability of this feature
> is not desirable. Could you point me to some other capabilities in KVM
> which have been implemented via explicit options?

TBH, most of the options for the 'pseries' machine type are in this
category: cap-vsx, cap-dfp, cap-htm, a bunch related to various
Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash
guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT
resizing extension), ic-mode (which irq controllers are available to
the guest).

> The ISA 3.0 allows the guest to execute the "stop" instruction.

So, this was a bug in DD2.2's implementation of the architecture?

> If the
> Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
> "stop" instruction in the causes a Hypervisor Facility Unavailable
> Exception, thus giving the hypervisor a chance to emulate the
> instruction. However, in the current code, when the hypervisor
> receives this exception, it sends a PROGKILL to the guest which
> results in crashing the guest.
> 
> Patch 1 of this series emulates wakeup from the "stop"
> instruction. Would the following scheme be ok?
> 
> OPAL:
> 	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> 
> Hypervisor Kernel:
> 
> 	   If "idle-stop-guest" dt feature is available, then, before
> 	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
> 	   bits allowing the guest to safely execute stop instruction.
> 
> 	   If "idle-stop-guest" dt feature is not available, then, the
> 	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
> 	   guest "stop" instruction execution to trap back into the
> 	   hypervisor. We then emulate a wakeup from the stop
> 	   instruction (Patch 1 of this series).
> 
> Guest Kernel:
>       If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
>       stop0lite cpuidle state.
> 
> This allows us to migrate the KVM guest across any POWER9
> Hypervisor. The minimal addition that the Hypervisor would need is
> Patch 1 of this series.

That could be workable.  Some caveats, though:

 * How does the latency of the trap-and-emulate compare to the guest
   using H_CEDE in the first place?  i.e. how big a negative impact
   will this have for guests running on DD2.2 hosts?

 * We'll only be able to enable this in a new qemu machine type
   version (say, pseries-5.1.0).  Otherwise a guest could start
   thinking it can use stop states, then be migrated to an older qemu
   or host kernel without the support and crash.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
@ 2020-04-08  2:29             ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2020-04-08  2:29 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Michael Neuling, Nicholas Piggin, Bharata B Rao, linuxppc-dev,
	kvm-ppc, Vaidyanathan Srinivasan, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 5241 bytes --]

On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote:
> Hello David,
> 
> On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > 
> > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > > scheduling in the guest vCPU.
> > > > > 
> > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > > set. This patch changes the behaviour to enter the guest with
> > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > > unconditionally clear these bits. Ideally this should be done
> > > > > conditionally on platforms where the guest stop instruction has no
> > > > > Bugs (starting POWER9 DD2.3).
> > > > 
> > > > How will guests know that they can use this facility safely after your
> > > > series? You need both DD2.3 and a patched KVM.
> > > 
> > > 
> > > Yes, this is something that isn't addressed in this series (mentioned
> > > in the cover letter), which is a POC demonstrating that the stop0lite
> > > state in guest works.
> > > 
> > > However, to answer your question, this is the scheme that I had in
> > > mind :
> > > 
> > > OPAL:
> > >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > 
> > > Hypervisor Kernel:
> > >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> > >        we set bool enable_guest_stop = true;
> > > 
> > >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> > >        enable_guest_stop == true.
> > > 
> > >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> > >        KVM_CAP_STOP, return true iff enable_guest_top == true.
> > > 
> > > QEMU:
> > >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> > >    indicate the presence to the guest via device tree.
> > 
> > Nack.  Presenting different capabilities to the guest depending on
> > host capabilities (rather than explicit options) is never ok.  It
> > means that depending on the system you start on you may or may not be
> > able to migrate to other systems that you're supposed to be able to,
> 
> I agree that blocking migration for the unavailability of this feature
> is not desirable. Could you point me to some other capabilities in KVM
> which have been implemented via explicit options?

TBH, most of the options for the 'pseries' machine type are in this
category: cap-vsx, cap-dfp, cap-htm, a bunch related to various
Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash
guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT
resizing extension), ic-mode (which irq controllers are available to
the guest).

> The ISA 3.0 allows the guest to execute the "stop" instruction.

So, this was a bug in DD2.2's implementation of the architecture?

> If the
> Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
> "stop" instruction in the causes a Hypervisor Facility Unavailable
> Exception, thus giving the hypervisor a chance to emulate the
> instruction. However, in the current code, when the hypervisor
> receives this exception, it sends a PROGKILL to the guest which
> results in crashing the guest.
> 
> Patch 1 of this series emulates wakeup from the "stop"
> instruction. Would the following scheme be ok?
> 
> OPAL:
> 	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> 
> Hypervisor Kernel:
> 
> 	   If "idle-stop-guest" dt feature is available, then, before
> 	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
> 	   bits allowing the guest to safely execute stop instruction.
> 
> 	   If "idle-stop-guest" dt feature is not available, then, the
> 	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
> 	   guest "stop" instruction execution to trap back into the
> 	   hypervisor. We then emulate a wakeup from the stop
> 	   instruction (Patch 1 of this series).
> 
> Guest Kernel:
>       If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
>       stop0lite cpuidle state.
> 
> This allows us to migrate the KVM guest across any POWER9
> Hypervisor. The minimal addition that the Hypervisor would need is
> Patch 1 of this series.

That could be workable.  Some caveats, though:

 * How does the latency of the trap-and-emulate compare to the guest
   using H_CEDE in the first place?  i.e. how big a negative impact
   will this have for guests running on DD2.2 hosts?

 * We'll only be able to enable this in a new qemu machine type
   version (say, pseries-5.1.0).  Otherwise a guest could start
   thinking it can use stop states, then be migrated to an older qemu
   or host kernel without the support and crash.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
  2020-04-08  2:29             ` David Gibson
@ 2020-04-13 10:37               ` Gautham R Shenoy
  -1 siblings, 0 replies; 28+ messages in thread
From: Gautham R Shenoy @ 2020-04-13 10:25 UTC (permalink / raw)
  To: David Gibson
  Cc: Gautham R Shenoy, Michael Neuling, Nicholas Piggin,
	Bharata B Rao, linuxppc-dev, kvm-ppc, Vaidyanathan Srinivasan,
	linuxppc-dev

Hello David,

On Wed, Apr 08, 2020 at 12:29:57PM +1000, David Gibson wrote:
> On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote:
> > Hello David,
> > 
> > On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> > > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > > 
> > > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > > > scheduling in the guest vCPU.
> > > > > > 
> > > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > > > set. This patch changes the behaviour to enter the guest with
> > > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > > > unconditionally clear these bits. Ideally this should be done
> > > > > > conditionally on platforms where the guest stop instruction has no
> > > > > > Bugs (starting POWER9 DD2.3).
> > > > > 
> > > > > How will guests know that they can use this facility safely after your
> > > > > series? You need both DD2.3 and a patched KVM.
> > > > 
> > > > 
> > > > Yes, this is something that isn't addressed in this series (mentioned
> > > > in the cover letter), which is a POC demonstrating that the stop0lite
> > > > state in guest works.
> > > > 
> > > > However, to answer your question, this is the scheme that I had in
> > > > mind :
> > > > 
> > > > OPAL:
> > > >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > > 
> > > > Hypervisor Kernel:
> > > >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> > > >        we set bool enable_guest_stop = true;
> > > > 
> > > >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> > > >        enable_guest_stop == true.
> > > > 
> > > >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> > > >        KVM_CAP_STOP, return true iff enable_guest_top == true.
> > > > 
> > > > QEMU:
> > > >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> > > >    indicate the presence to the guest via device tree.
> > > 
> > > Nack.  Presenting different capabilities to the guest depending on
> > > host capabilities (rather than explicit options) is never ok.  It
> > > means that depending on the system you start on you may or may not be
> > > able to migrate to other systems that you're supposed to be able to,
> > 
> > I agree that blocking migration for the unavailability of this feature
> > is not desirable. Could you point me to some other capabilities in KVM
> > which have been implemented via explicit options?
> 
> TBH, most of the options for the 'pseries' machine type are in this
> category: cap-vsx, cap-dfp, cap-htm, a bunch related to various
> Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash
> guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT
> resizing extension), ic-mode (which irq controllers are available to
> the guest).


Thanks. I will follow this suit.

> 
> > The ISA 3.0 allows the guest to execute the "stop" instruction.
> 
> So, this was a bug in DD2.2's implementation of the architecture?

Yes, the previous versions could miss wakeup events when stop was
executed in HV=0,PR=0 mode. So, the hypervisor had to block that.


> 
> > If the
> > Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
> > "stop" instruction in the causes a Hypervisor Facility Unavailable
> > Exception, thus giving the hypervisor a chance to emulate the
> > instruction. However, in the current code, when the hypervisor
> > receives this exception, it sends a PROGKILL to the guest which
> > results in crashing the guest.
> > 
> > Patch 1 of this series emulates wakeup from the "stop"
> > instruction. Would the following scheme be ok?
> > 
> > OPAL:
> > 	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > 
> > Hypervisor Kernel:
> > 
> > 	   If "idle-stop-guest" dt feature is available, then, before
> > 	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
> > 	   bits allowing the guest to safely execute stop instruction.
> > 
> > 	   If "idle-stop-guest" dt feature is not available, then, the
> > 	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
> > 	   guest "stop" instruction execution to trap back into the
> > 	   hypervisor. We then emulate a wakeup from the stop
> > 	   instruction (Patch 1 of this series).
> > 
> > Guest Kernel:
> >       If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
> >       stop0lite cpuidle state.
> > 
> > This allows us to migrate the KVM guest across any POWER9
> > Hypervisor. The minimal addition that the Hypervisor would need is
> > Patch 1 of this series.
> 
> That could be workable.  Some caveats, though:
> 
>  * How does the latency of the trap-and-emulate compare to the guest
>    using H_CEDE in the first place?  i.e. how big a negative impact
>    will this have for guests running on DD2.2 hosts?


The wakeup latency of trap-and-emulated stop0lite (referred to as
"stop0lite Emulated" in the tables below) the compares favorably
compared to H_CEDE. It is in the order of 5-6us while the wakeup
latency of H_CEDE is ~25-30us.

======================================================================
Wakeup Latency measured using a timer (in ns) [Lower is better]
======================================================================
Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
----------------------------------------------------------------------
snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
----------------------------------------------------------------------
stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
----------------------------------------------------------------------
stop0lite  |   60        | 2378    | 7659   | 5006   |5093.6 |1578.7 |  
Emulated   |             |         |        |        |       |       |
----------------------------------------------------------------------
Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
======================================================================


======================================================================
Wakeup latency measured using an IPI (in ns) [Lower is better]
======================================================================
Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
           |samples |         |        |        |         |          |
----------------------------------------------------------------------
snooze     |   60   |     2089|    4228|    2259|  2342.31|    316.56|
----------------------------------------------------------------------
stop0lite  |   60   |     1947|    3674|    2653|  2610.57|    266.73|
----------------------------------------------------------------------
stop0lite  |   60   |     3580|    8154|    5596|  5644.95|   1368.44|
Emulated   |        |         |        |        |         |          |
----------------------------------------------------------------------
Shared CEDE|   60   |    20147|   36305|   21827| 26762.65|   6875.01|
======================================================================

> 
>  * We'll only be able to enable this in a new qemu machine type
>    version (say, pseries-5.1.0).  Otherwise a guest could start
>    thinking it can use stop states, then be migrated to an older qemu
>    or host kernel without the support and crash.

That makese sense. In fact in the case of not being able to backport
Patch 1 to all the older hypervisor kernels, we will need a way of
gating the guest from using stop-states and then migrating onto an
older hypervisor kernel. Associating this with a new qemu machine type
version should solve this problem, assuming that all the newer qemus
will also be running on newer hypervisor kernels.



> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
@ 2020-04-13 10:37               ` Gautham R Shenoy
  0 siblings, 0 replies; 28+ messages in thread
From: Gautham R Shenoy @ 2020-04-13 10:37 UTC (permalink / raw)
  To: David Gibson
  Cc: Gautham R Shenoy, Michael Neuling, Nicholas Piggin,
	Bharata B Rao, linuxppc-dev, kvm-ppc, Vaidyanathan Srinivasan,
	linuxppc-dev

Hello David,

On Wed, Apr 08, 2020 at 12:29:57PM +1000, David Gibson wrote:
> On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote:
> > Hello David,
> > 
> > On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> > > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > > 
> > > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > > > scheduling in the guest vCPU.
> > > > > > 
> > > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > > > set. This patch changes the behaviour to enter the guest with
> > > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > > > unconditionally clear these bits. Ideally this should be done
> > > > > > conditionally on platforms where the guest stop instruction has no
> > > > > > Bugs (starting POWER9 DD2.3).
> > > > > 
> > > > > How will guests know that they can use this facility safely after your
> > > > > series? You need both DD2.3 and a patched KVM.
> > > > 
> > > > 
> > > > Yes, this is something that isn't addressed in this series (mentioned
> > > > in the cover letter), which is a POC demonstrating that the stop0lite
> > > > state in guest works.
> > > > 
> > > > However, to answer your question, this is the scheme that I had in
> > > > mind :
> > > > 
> > > > OPAL:
> > > >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > > 
> > > > Hypervisor Kernel:
> > > >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> > > >        we set bool enable_guest_stop = true;
> > > > 
> > > >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> > > >        enable_guest_stop = true.
> > > > 
> > > >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> > > >        KVM_CAP_STOP, return true iff enable_guest_top = true.
> > > > 
> > > > QEMU:
> > > >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> > > >    indicate the presence to the guest via device tree.
> > > 
> > > Nack.  Presenting different capabilities to the guest depending on
> > > host capabilities (rather than explicit options) is never ok.  It
> > > means that depending on the system you start on you may or may not be
> > > able to migrate to other systems that you're supposed to be able to,
> > 
> > I agree that blocking migration for the unavailability of this feature
> > is not desirable. Could you point me to some other capabilities in KVM
> > which have been implemented via explicit options?
> 
> TBH, most of the options for the 'pseries' machine type are in this
> category: cap-vsx, cap-dfp, cap-htm, a bunch related to various
> Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash
> guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT
> resizing extension), ic-mode (which irq controllers are available to
> the guest).


Thanks. I will follow this suit.

> 
> > The ISA 3.0 allows the guest to execute the "stop" instruction.
> 
> So, this was a bug in DD2.2's implementation of the architecture?

Yes, the previous versions could miss wakeup events when stop was
executed in HV=0,PR=0 mode. So, the hypervisor had to block that.


> 
> > If the
> > Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
> > "stop" instruction in the causes a Hypervisor Facility Unavailable
> > Exception, thus giving the hypervisor a chance to emulate the
> > instruction. However, in the current code, when the hypervisor
> > receives this exception, it sends a PROGKILL to the guest which
> > results in crashing the guest.
> > 
> > Patch 1 of this series emulates wakeup from the "stop"
> > instruction. Would the following scheme be ok?
> > 
> > OPAL:
> > 	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > 
> > Hypervisor Kernel:
> > 
> > 	   If "idle-stop-guest" dt feature is available, then, before
> > 	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
> > 	   bits allowing the guest to safely execute stop instruction.
> > 
> > 	   If "idle-stop-guest" dt feature is not available, then, the
> > 	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
> > 	   guest "stop" instruction execution to trap back into the
> > 	   hypervisor. We then emulate a wakeup from the stop
> > 	   instruction (Patch 1 of this series).
> > 
> > Guest Kernel:
> >       If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
> >       stop0lite cpuidle state.
> > 
> > This allows us to migrate the KVM guest across any POWER9
> > Hypervisor. The minimal addition that the Hypervisor would need is
> > Patch 1 of this series.
> 
> That could be workable.  Some caveats, though:
> 
>  * How does the latency of the trap-and-emulate compare to the guest
>    using H_CEDE in the first place?  i.e. how big a negative impact
>    will this have for guests running on DD2.2 hosts?


The wakeup latency of trap-and-emulated stop0lite (referred to as
"stop0lite Emulated" in the tables below) the compares favorably
compared to H_CEDE. It is in the order of 5-6us while the wakeup
latency of H_CEDE is ~25-30us.

===================================
Wakeup Latency measured using a timer (in ns) [Lower is better]
===================================
Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
----------------------------------------------------------------------
snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
----------------------------------------------------------------------
stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
----------------------------------------------------------------------
stop0lite  |   60        | 2378    | 7659   | 5006   |5093.6 |1578.7 |  
Emulated   |             |         |        |        |       |       |
----------------------------------------------------------------------
Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
===================================


===================================
Wakeup latency measured using an IPI (in ns) [Lower is better]
===================================
Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
           |samples |         |        |        |         |          |
----------------------------------------------------------------------
snooze     |   60   |     2089|    4228|    2259|  2342.31|    316.56|
----------------------------------------------------------------------
stop0lite  |   60   |     1947|    3674|    2653|  2610.57|    266.73|
----------------------------------------------------------------------
stop0lite  |   60   |     3580|    8154|    5596|  5644.95|   1368.44|
Emulated   |        |         |        |        |         |          |
----------------------------------------------------------------------
Shared CEDE|   60   |    20147|   36305|   21827| 26762.65|   6875.01|
===================================

> 
>  * We'll only be able to enable this in a new qemu machine type
>    version (say, pseries-5.1.0).  Otherwise a guest could start
>    thinking it can use stop states, then be migrated to an older qemu
>    or host kernel without the support and crash.

That makese sense. In fact in the case of not being able to backport
Patch 1 to all the older hypervisor kernels, we will need a way of
gating the guest from using stop-states and then migrating onto an
older hypervisor kernel. Associating this with a new qemu machine type
version should solve this problem, assuming that all the newer qemus
will also be running on newer hypervisor kernels.



> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
  2020-04-13 10:37               ` Gautham R Shenoy
@ 2020-04-14  2:17                 ` David Gibson
  -1 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2020-04-14  2:17 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Michael Neuling, Nicholas Piggin, Bharata B Rao, linuxppc-dev,
	kvm-ppc, Vaidyanathan Srinivasan, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 9541 bytes --]

On Mon, Apr 13, 2020 at 03:55:49PM +0530, Gautham R Shenoy wrote:
> Hello David,
> 
> On Wed, Apr 08, 2020 at 12:29:57PM +1000, David Gibson wrote:
> > On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote:
> > > Hello David,
> > > 
> > > On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> > > > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > > > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > > > 
> > > > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > > > > scheduling in the guest vCPU.
> > > > > > > 
> > > > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > > > > set. This patch changes the behaviour to enter the guest with
> > > > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > > > > unconditionally clear these bits. Ideally this should be done
> > > > > > > conditionally on platforms where the guest stop instruction has no
> > > > > > > Bugs (starting POWER9 DD2.3).
> > > > > > 
> > > > > > How will guests know that they can use this facility safely after your
> > > > > > series? You need both DD2.3 and a patched KVM.
> > > > > 
> > > > > 
> > > > > Yes, this is something that isn't addressed in this series (mentioned
> > > > > in the cover letter), which is a POC demonstrating that the stop0lite
> > > > > state in guest works.
> > > > > 
> > > > > However, to answer your question, this is the scheme that I had in
> > > > > mind :
> > > > > 
> > > > > OPAL:
> > > > >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > > > 
> > > > > Hypervisor Kernel:
> > > > >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> > > > >        we set bool enable_guest_stop = true;
> > > > > 
> > > > >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> > > > >        enable_guest_stop == true.
> > > > > 
> > > > >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> > > > >        KVM_CAP_STOP, return true iff enable_guest_top == true.
> > > > > 
> > > > > QEMU:
> > > > >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> > > > >    indicate the presence to the guest via device tree.
> > > > 
> > > > Nack.  Presenting different capabilities to the guest depending on
> > > > host capabilities (rather than explicit options) is never ok.  It
> > > > means that depending on the system you start on you may or may not be
> > > > able to migrate to other systems that you're supposed to be able to,
> > > 
> > > I agree that blocking migration for the unavailability of this feature
> > > is not desirable. Could you point me to some other capabilities in KVM
> > > which have been implemented via explicit options?
> > 
> > TBH, most of the options for the 'pseries' machine type are in this
> > category: cap-vsx, cap-dfp, cap-htm, a bunch related to various
> > Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash
> > guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT
> > resizing extension), ic-mode (which irq controllers are available to
> > the guest).
> 
> 
> Thanks. I will follow this suit.
> 
> > 
> > > The ISA 3.0 allows the guest to execute the "stop" instruction.
> > 
> > So, this was a bug in DD2.2's implementation of the architecture?
> 
> Yes, the previous versions could miss wakeup events when stop was
> executed in HV=0,PR=0 mode. So, the hypervisor had to block that.
> 
> 
> > 
> > > If the
> > > Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
> > > "stop" instruction in the causes a Hypervisor Facility Unavailable
> > > Exception, thus giving the hypervisor a chance to emulate the
> > > instruction. However, in the current code, when the hypervisor
> > > receives this exception, it sends a PROGKILL to the guest which
> > > results in crashing the guest.
> > > 
> > > Patch 1 of this series emulates wakeup from the "stop"
> > > instruction. Would the following scheme be ok?
> > > 
> > > OPAL:
> > > 	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > 
> > > Hypervisor Kernel:
> > > 
> > > 	   If "idle-stop-guest" dt feature is available, then, before
> > > 	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
> > > 	   bits allowing the guest to safely execute stop instruction.
> > > 
> > > 	   If "idle-stop-guest" dt feature is not available, then, the
> > > 	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
> > > 	   guest "stop" instruction execution to trap back into the
> > > 	   hypervisor. We then emulate a wakeup from the stop
> > > 	   instruction (Patch 1 of this series).
> > > 
> > > Guest Kernel:
> > >       If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
> > >       stop0lite cpuidle state.
> > > 
> > > This allows us to migrate the KVM guest across any POWER9
> > > Hypervisor. The minimal addition that the Hypervisor would need is
> > > Patch 1 of this series.
> > 
> > That could be workable.  Some caveats, though:
> > 
> >  * How does the latency of the trap-and-emulate compare to the guest
> >    using H_CEDE in the first place?  i.e. how big a negative impact
> >    will this have for guests running on DD2.2 hosts?
> 
> 
> The wakeup latency of trap-and-emulated stop0lite (referred to as
> "stop0lite Emulated" in the tables below) the compares favorably
> compared to H_CEDE. It is in the order of 5-6us while the wakeup
> latency of H_CEDE is ~25-30us.

Ok.  So allowing the guest to use stop0lite everywhere, but having it
emulated on the older CPUs should work reasonably well.

> 
> ======================================================================
> Wakeup Latency measured using a timer (in ns) [Lower is better]
> ======================================================================
> Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
> ----------------------------------------------------------------------
> snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
> ----------------------------------------------------------------------
> stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
> ----------------------------------------------------------------------
> stop0lite  |   60        | 2378    | 7659   | 5006   |5093.6 |1578.7 |  
> Emulated   |             |         |        |        |       |       |
> ----------------------------------------------------------------------
> Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
> ======================================================================
> 
> 
> ======================================================================
> Wakeup latency measured using an IPI (in ns) [Lower is better]
> ======================================================================
> Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
>            |samples |         |        |        |         |          |
> ----------------------------------------------------------------------
> snooze     |   60   |     2089|    4228|    2259|  2342.31|    316.56|
> ----------------------------------------------------------------------
> stop0lite  |   60   |     1947|    3674|    2653|  2610.57|    266.73|
> ----------------------------------------------------------------------
> stop0lite  |   60   |     3580|    8154|    5596|  5644.95|   1368.44|
> Emulated   |        |         |        |        |         |          |
> ----------------------------------------------------------------------
> Shared CEDE|   60   |    20147|   36305|   21827| 26762.65|   6875.01|
> ======================================================================
> 
> > 
> >  * We'll only be able to enable this in a new qemu machine type
> >    version (say, pseries-5.1.0).  Otherwise a guest could start
> >    thinking it can use stop states, then be migrated to an older qemu
> >    or host kernel without the support and crash.
> 
> That makese sense. In fact in the case of not being able to backport
> Patch 1 to all the older hypervisor kernels, we will need a way of
> gating the guest from using stop-states and then migrating onto an
> older hypervisor kernel. Associating this with a new qemu machine type
> version should solve this problem, assuming that all the newer qemus
> will also be running on newer hypervisor kernels.

We can't assume that automatically, but we can enforce it with a
pretty standard mechanism.  The way to do it is this:

 * Make a new spapr capability flag that enables guest use of the lite
   stop instructions
 * In the capability's '.apply' hook, verify that the host kernel can
   support this, and if not fail
 * Enable the new capability by default in the new machine type

So, running the new machine type with default options on an old kernel
will fail with a meaningful error, but existing setups with an old
qemu, or old machine type.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
@ 2020-04-14  2:17                 ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2020-04-14  2:17 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Michael Neuling, Nicholas Piggin, Bharata B Rao, linuxppc-dev,
	kvm-ppc, Vaidyanathan Srinivasan, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 9541 bytes --]

On Mon, Apr 13, 2020 at 03:55:49PM +0530, Gautham R Shenoy wrote:
> Hello David,
> 
> On Wed, Apr 08, 2020 at 12:29:57PM +1000, David Gibson wrote:
> > On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote:
> > > Hello David,
> > > 
> > > On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> > > > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > > > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > > > 
> > > > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > > > > scheduling in the guest vCPU.
> > > > > > > 
> > > > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > > > > set. This patch changes the behaviour to enter the guest with
> > > > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > > > > unconditionally clear these bits. Ideally this should be done
> > > > > > > conditionally on platforms where the guest stop instruction has no
> > > > > > > Bugs (starting POWER9 DD2.3).
> > > > > > 
> > > > > > How will guests know that they can use this facility safely after your
> > > > > > series? You need both DD2.3 and a patched KVM.
> > > > > 
> > > > > 
> > > > > Yes, this is something that isn't addressed in this series (mentioned
> > > > > in the cover letter), which is a POC demonstrating that the stop0lite
> > > > > state in guest works.
> > > > > 
> > > > > However, to answer your question, this is the scheme that I had in
> > > > > mind :
> > > > > 
> > > > > OPAL:
> > > > >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > > > 
> > > > > Hypervisor Kernel:
> > > > >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> > > > >        we set bool enable_guest_stop = true;
> > > > > 
> > > > >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> > > > >        enable_guest_stop == true.
> > > > > 
> > > > >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> > > > >        KVM_CAP_STOP, return true iff enable_guest_top == true.
> > > > > 
> > > > > QEMU:
> > > > >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> > > > >    indicate the presence to the guest via device tree.
> > > > 
> > > > Nack.  Presenting different capabilities to the guest depending on
> > > > host capabilities (rather than explicit options) is never ok.  It
> > > > means that depending on the system you start on you may or may not be
> > > > able to migrate to other systems that you're supposed to be able to,
> > > 
> > > I agree that blocking migration for the unavailability of this feature
> > > is not desirable. Could you point me to some other capabilities in KVM
> > > which have been implemented via explicit options?
> > 
> > TBH, most of the options for the 'pseries' machine type are in this
> > category: cap-vsx, cap-dfp, cap-htm, a bunch related to various
> > Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash
> > guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT
> > resizing extension), ic-mode (which irq controllers are available to
> > the guest).
> 
> 
> Thanks. I will follow this suit.
> 
> > 
> > > The ISA 3.0 allows the guest to execute the "stop" instruction.
> > 
> > So, this was a bug in DD2.2's implementation of the architecture?
> 
> Yes, the previous versions could miss wakeup events when stop was
> executed in HV=0,PR=0 mode. So, the hypervisor had to block that.
> 
> 
> > 
> > > If the
> > > Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
> > > "stop" instruction in the causes a Hypervisor Facility Unavailable
> > > Exception, thus giving the hypervisor a chance to emulate the
> > > instruction. However, in the current code, when the hypervisor
> > > receives this exception, it sends a PROGKILL to the guest which
> > > results in crashing the guest.
> > > 
> > > Patch 1 of this series emulates wakeup from the "stop"
> > > instruction. Would the following scheme be ok?
> > > 
> > > OPAL:
> > > 	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > 
> > > Hypervisor Kernel:
> > > 
> > > 	   If "idle-stop-guest" dt feature is available, then, before
> > > 	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
> > > 	   bits allowing the guest to safely execute stop instruction.
> > > 
> > > 	   If "idle-stop-guest" dt feature is not available, then, the
> > > 	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
> > > 	   guest "stop" instruction execution to trap back into the
> > > 	   hypervisor. We then emulate a wakeup from the stop
> > > 	   instruction (Patch 1 of this series).
> > > 
> > > Guest Kernel:
> > >       If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
> > >       stop0lite cpuidle state.
> > > 
> > > This allows us to migrate the KVM guest across any POWER9
> > > Hypervisor. The minimal addition that the Hypervisor would need is
> > > Patch 1 of this series.
> > 
> > That could be workable.  Some caveats, though:
> > 
> >  * How does the latency of the trap-and-emulate compare to the guest
> >    using H_CEDE in the first place?  i.e. how big a negative impact
> >    will this have for guests running on DD2.2 hosts?
> 
> 
> The wakeup latency of trap-and-emulated stop0lite (referred to as
> "stop0lite Emulated" in the tables below) the compares favorably
> compared to H_CEDE. It is in the order of 5-6us while the wakeup
> latency of H_CEDE is ~25-30us.

Ok.  So allowing the guest to use stop0lite everywhere, but having it
emulated on the older CPUs should work reasonably well.

> 
> ======================================================================
> Wakeup Latency measured using a timer (in ns) [Lower is better]
> ======================================================================
> Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
> ----------------------------------------------------------------------
> snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
> ----------------------------------------------------------------------
> stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
> ----------------------------------------------------------------------
> stop0lite  |   60        | 2378    | 7659   | 5006   |5093.6 |1578.7 |  
> Emulated   |             |         |        |        |       |       |
> ----------------------------------------------------------------------
> Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
> ======================================================================
> 
> 
> ======================================================================
> Wakeup latency measured using an IPI (in ns) [Lower is better]
> ======================================================================
> Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
>            |samples |         |        |        |         |          |
> ----------------------------------------------------------------------
> snooze     |   60   |     2089|    4228|    2259|  2342.31|    316.56|
> ----------------------------------------------------------------------
> stop0lite  |   60   |     1947|    3674|    2653|  2610.57|    266.73|
> ----------------------------------------------------------------------
> stop0lite  |   60   |     3580|    8154|    5596|  5644.95|   1368.44|
> Emulated   |        |         |        |        |         |          |
> ----------------------------------------------------------------------
> Shared CEDE|   60   |    20147|   36305|   21827| 26762.65|   6875.01|
> ======================================================================
> 
> > 
> >  * We'll only be able to enable this in a new qemu machine type
> >    version (say, pseries-5.1.0).  Otherwise a guest could start
> >    thinking it can use stop states, then be migrated to an older qemu
> >    or host kernel without the support and crash.
> 
> That makese sense. In fact in the case of not being able to backport
> Patch 1 to all the older hypervisor kernels, we will need a way of
> gating the guest from using stop-states and then migrating onto an
> older hypervisor kernel. Associating this with a new qemu machine type
> version should solve this problem, assuming that all the newer qemus
> will also be running on newer hypervisor kernels.

We can't assume that automatically, but we can enforce it with a
pretty standard mechanism.  The way to do it is this:

 * Make a new spapr capability flag that enables guest use of the lite
   stop instructions
 * In the capability's '.apply' hook, verify that the host kernel can
   support this, and if not fail
 * Enable the new capability by default in the new machine type

So, running the new machine type with default options on an old kernel
will fail with a meaningful error, but existing setups with an old
qemu, or old machine type.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-04-14  2:51 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-31 12:10 [RFC/PATCH 0/3] Add support for stop instruction inside KVM guest Gautham R. Shenoy
2020-03-31 12:22 ` Gautham R. Shenoy
2020-03-31 12:10 ` [RFC/PATCH 1/3] powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop Gautham R. Shenoy
2020-03-31 12:22   ` Gautham R. Shenoy
2020-04-03  2:15   ` Nicholas Piggin
2020-04-03  2:15     ` Nicholas Piggin
2020-03-31 12:10 ` [RFC/PATCH 2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry Gautham R. Shenoy
2020-03-31 12:22   ` Gautham R. Shenoy
2020-04-03  2:20   ` Nicholas Piggin
2020-04-03  2:20     ` Nicholas Piggin
2020-04-03  9:31     ` Gautham R Shenoy
2020-04-03  9:43       ` Gautham R Shenoy
2020-04-06  9:58       ` David Gibson
2020-04-06  9:58         ` David Gibson
2020-04-07 13:25         ` Gautham R Shenoy
2020-04-07 13:37           ` Gautham R Shenoy
2020-04-08  2:29           ` David Gibson
2020-04-08  2:29             ` David Gibson
2020-04-13 10:25             ` Gautham R Shenoy
2020-04-13 10:37               ` Gautham R Shenoy
2020-04-14  2:17               ` David Gibson
2020-04-14  2:17                 ` David Gibson
2020-04-07 12:33       ` Gautham R Shenoy
2020-04-07 12:45         ` Gautham R Shenoy
2020-03-31 12:10 ` [RFC/PATCH 3/3] cpuidle/pseries: Add stop0lite state Gautham R. Shenoy
2020-03-31 12:22   ` Gautham R. Shenoy
2020-03-31 12:14 ` [RFC/PATCH 0/3] Add support for stop instruction inside KVM guest Gautham R Shenoy
2020-03-31 12:26   ` Gautham R Shenoy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.