linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management
@ 2014-11-25 11:17 Shreyas B. Prabhu
  2014-11-25 11:17 ` [PATCH v2 1/4] powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle mode Shreyas B. Prabhu
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Shreyas B. Prabhu @ 2014-11-25 11:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Shreyas B. Prabhu, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Rafael J. Wysocki, linux-pm, linuxppc-dev,
	Vaidyanathan Srinivasan, Preeti U Murthy

Deep idle states like sleep and winkle are per core idle states. A core
enters these states only when all the threads enter either the particular
idle state or a deeper one. There are tasks like fastsleep hardware bug
workaround and hypervisor core state save which have to be done only by
the last thread of the core entering deep idle state and similarly tasks
like timebase resync, hypervisor core register restore that have to be
done only by the first thread waking up from these states. 

The current idle state management does not have a way to distinguish the
first/last thread of the core waking/entering idle states. Tasks like
timebase resync are done for all the threads. This is not only is suboptimal,
but can cause functionality issues when subcores are involved.

Winkle is deeper idle state compared to fastsleep. In this state the power
supply to the chiplet, i.e core, private L2 and private L3 is turned off.
This results in a total hypervisor state loss. This patch set adds support
for winkle and provides a way to track the idle states of the threads of the
core and use it for idle state management of idle states sleep and winkle.


Changes in v2:
--------------
-Using PNV_THREAD_NAP/SLEEP defines while calling power7_powersave_common
-Comment changes based on review
-Rebased on top of 3.18-rc6


Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: linux-pm@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>

Paul Mackerras (1):
  powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle
    mode

Preeti U. Murthy (1):
  powerpc/powernv: Enable Offline CPUs to enter deep idle states

Shreyas B. Prabhu (2):
  powernv: cpuidle: Redesign idle states management
  powernv: powerpc: Add winkle support for offline cpus

 arch/powerpc/include/asm/cpuidle.h             |  14 ++
 arch/powerpc/include/asm/opal.h                |  13 +
 arch/powerpc/include/asm/paca.h                |   6 +
 arch/powerpc/include/asm/ppc-opcode.h          |   2 +
 arch/powerpc/include/asm/processor.h           |   1 +
 arch/powerpc/include/asm/reg.h                 |   4 +
 arch/powerpc/kernel/asm-offsets.c              |   6 +
 arch/powerpc/kernel/cpu_setup_power.S          |   4 +
 arch/powerpc/kernel/exceptions-64s.S           |  30 ++-
 arch/powerpc/kernel/idle_power7.S              | 332 +++++++++++++++++++++----
 arch/powerpc/platforms/powernv/opal-wrappers.S |  39 +++
 arch/powerpc/platforms/powernv/powernv.h       |   2 +
 arch/powerpc/platforms/powernv/setup.c         | 160 ++++++++++++
 arch/powerpc/platforms/powernv/smp.c           |  10 +-
 arch/powerpc/platforms/powernv/subcore.c       |  34 +++
 arch/powerpc/platforms/powernv/subcore.h       |   1 +
 drivers/cpuidle/cpuidle-powernv.c              |  10 +-
 17 files changed, 608 insertions(+), 60 deletions(-)
 create mode 100644 arch/powerpc/include/asm/cpuidle.h

-- 
1.9.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/4] powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle mode
  2014-11-25 11:17 [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management Shreyas B. Prabhu
@ 2014-11-25 11:17 ` Shreyas B. Prabhu
  2014-11-25 11:17 ` [PATCH v2 2/4] powerpc/powernv: Enable Offline CPUs to enter deep idle states Shreyas B. Prabhu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Shreyas B. Prabhu @ 2014-11-25 11:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul Mackerras, Shreyas B. Prabhu, Benjamin Herrenschmidt,
	Michael Ellerman, linuxppc-dev

From: Paul Mackerras <paulus@samba.org>

Currently, when going idle, we set the flag indicating that we are in
nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
(or sleep or rvwinkle) instruction, all with the MMU on.  This is bad
for two reasons: (a) the architecture specifies that those instructions
must be executed with the MMU off, and in fact with only the SF, HV, ME
and possibly RI bits set, and (b) this introduces a race, because as
soon as we set the flag, another thread can switch the MMU to a guest
context.  If the race is lost, this thread will typically start looping
on relocation-on ISIs at 0xc...4400.

This fixes it by setting the MSR as required by the architecture before
setting the flag or executing the nap/sleep/rvwinkle instruction.

[ shreyas@linux.vnet.ibm.com: Edited to handle LE ]
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/reg.h    |  2 ++
 arch/powerpc/kernel/idle_power7.S | 18 +++++++++++++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index c998279..a68ee15 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -118,8 +118,10 @@
 #define __MSR		(MSR_ME | MSR_RI | MSR_IR | MSR_DR | MSR_ISF |MSR_HV)
 #ifdef __BIG_ENDIAN__
 #define MSR_		__MSR
+#define MSR_IDLE	(MSR_ME | MSR_SF | MSR_HV)
 #else
 #define MSR_		(__MSR | MSR_LE)
+#define MSR_IDLE	(MSR_ME | MSR_SF | MSR_HV | MSR_LE)
 #endif
 #define MSR_KERNEL	(MSR_ | MSR_64BIT)
 #define MSR_USER32	(MSR_ | MSR_PR | MSR_EE)
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index c0754bb..283c603 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -101,7 +101,23 @@ _GLOBAL(power7_powersave_common)
 	std	r9,_MSR(r1)
 	std	r1,PACAR1(r13)
 
-_GLOBAL(power7_enter_nap_mode)
+	/*
+	 * Go to real mode to do the nap, as required by the architecture.
+	 * Also, we need to be in real mode before setting hwthread_state,
+	 * because as soon as we do that, another thread can switch
+	 * the MMU context to the guest.
+	 */
+	LOAD_REG_IMMEDIATE(r5, MSR_IDLE)
+	li	r6, MSR_RI
+	andc	r6, r9, r6
+	LOAD_REG_ADDR(r7, power7_enter_nap_mode)
+	mtmsrd	r6, 1		/* clear RI before setting SRR0/1 */
+	mtspr	SPRN_SRR0, r7
+	mtspr	SPRN_SRR1, r5
+	rfid
+
+	.globl	power7_enter_nap_mode
+power7_enter_nap_mode:
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 	/* Tell KVM we're napping */
 	li	r4,KVM_HWTHREAD_IN_NAP
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/4] powerpc/powernv: Enable Offline CPUs to enter deep idle states
  2014-11-25 11:17 [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management Shreyas B. Prabhu
  2014-11-25 11:17 ` [PATCH v2 1/4] powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle mode Shreyas B. Prabhu
@ 2014-11-25 11:17 ` Shreyas B. Prabhu
  2014-11-25 11:17 ` [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management Shreyas B. Prabhu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Shreyas B. Prabhu @ 2014-11-25 11:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Preeti U. Murthy, Shreyas B. Prabhu, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Rafael J. Wysocki, linux-pm,
	linuxppc-dev

From: "Preeti U. Murthy" <preeti@linux.vnet.ibm.com>

The secondary threads should enter deep idle states so as to gain maximum
powersavings when the entire core is offline. To do so the offline path
must be made aware of the available deepest idle state. Hence probe the
device tree for the possible idle states in powernv core code and
expose the deepest idle state through flags.

Since the  device tree is probed by the cpuidle driver as well, move
the parameters required to discover the idle states into an appropriate
common place to both the driver and the powernv core code.

Another point is that fastsleep idle state may require workarounds in
the kernel to function properly. This workaround is introduced in the
subsequent patches. However neither the cpuidle driver or the hotplug
path need be bothered about this workaround.

They will be taken care of by the core powernv code.

Originally-by: Srivatsa S. Bhat <srivatsa@mit.edu>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: linux-pm@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/opal.h          |  8 ++++++
 arch/powerpc/platforms/powernv/powernv.h |  2 ++
 arch/powerpc/platforms/powernv/setup.c   | 49 ++++++++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/smp.c     |  7 ++++-
 drivers/cpuidle/cpuidle-powernv.c        |  9 ++----
 5 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 9124b0e..f8b95c0 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -155,6 +155,14 @@ struct opal_sg_list {
 #define OPAL_REGISTER_DUMP_REGION		101
 #define OPAL_UNREGISTER_DUMP_REGION		102
 
+/* Device tree flags */
+
+/* Flags set in power-mgmt nodes in device tree if
+ * respective idle states are supported in the platform.
+ */
+#define OPAL_PM_NAP_ENABLED	0x00010000
+#define OPAL_PM_SLEEP_ENABLED	0x00020000
+
 #ifndef __ASSEMBLY__
 
 #include <linux/notifier.h>
diff --git a/arch/powerpc/platforms/powernv/powernv.h b/arch/powerpc/platforms/powernv/powernv.h
index 6c8e2d1..604c48e 100644
--- a/arch/powerpc/platforms/powernv/powernv.h
+++ b/arch/powerpc/platforms/powernv/powernv.h
@@ -29,6 +29,8 @@ static inline u64 pnv_pci_dma_get_required_mask(struct pci_dev *pdev)
 }
 #endif
 
+extern u32 pnv_get_supported_cpuidle_states(void);
+
 extern void pnv_lpc_init(void);
 
 bool cpu_core_split_required(void);
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 3f9546d..34c6665 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -290,6 +290,55 @@ static void __init pnv_setup_machdep_rtas(void)
 }
 #endif /* CONFIG_PPC_POWERNV_RTAS */
 
+static u32 supported_cpuidle_states;
+
+u32 pnv_get_supported_cpuidle_states(void)
+{
+	return supported_cpuidle_states;
+}
+
+static int __init pnv_init_idle_states(void)
+{
+	struct device_node *power_mgt;
+	int dt_idle_states;
+	const __be32 *idle_state_flags;
+	u32 len_flags, flags;
+	int i;
+
+	supported_cpuidle_states = 0;
+
+	if (cpuidle_disable != IDLE_NO_OVERRIDE)
+		return 0;
+
+	if (!firmware_has_feature(FW_FEATURE_OPALv3))
+		return 0;
+
+	power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
+	if (!power_mgt) {
+		pr_warn("opal: PowerMgmt Node not found\n");
+		return 0;
+	}
+
+	idle_state_flags = of_get_property(power_mgt,
+			"ibm,cpu-idle-state-flags", &len_flags);
+	if (!idle_state_flags) {
+		pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n");
+		return 0;
+	}
+
+	dt_idle_states = len_flags / sizeof(u32);
+
+	for (i = 0; i < dt_idle_states; i++) {
+		flags = be32_to_cpu(idle_state_flags[i]);
+		supported_cpuidle_states |= flags;
+	}
+
+	return 0;
+}
+
+subsys_initcall(pnv_init_idle_states);
+
+
 static int __init pnv_probe(void)
 {
 	unsigned long root = of_get_flat_dt_root();
diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
index 4753958..3dc4cec 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -149,6 +149,7 @@ static int pnv_smp_cpu_disable(void)
 static void pnv_smp_cpu_kill_self(void)
 {
 	unsigned int cpu;
+	u32 idle_states;
 
 	/* Standard hot unplug procedure */
 	local_irq_disable();
@@ -159,13 +160,17 @@ static void pnv_smp_cpu_kill_self(void)
 	generic_set_cpu_dead(cpu);
 	smp_wmb();
 
+	idle_states = pnv_get_supported_cpuidle_states();
 	/* We don't want to take decrementer interrupts while we are offline,
 	 * so clear LPCR:PECE1. We keep PECE2 enabled.
 	 */
 	mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
 	while (!generic_check_cpu_restart(cpu)) {
 		ppc64_runlatch_off();
-		power7_nap(1);
+		if (idle_states & OPAL_PM_SLEEP_ENABLED)
+			power7_sleep();
+		else
+			power7_nap(1);
 		ppc64_runlatch_on();
 
 		/* Clear the IPI that woke us up */
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 7d3a349..0a7d827 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -16,13 +16,10 @@
 
 #include <asm/machdep.h>
 #include <asm/firmware.h>
+#include <asm/opal.h>
 #include <asm/runlatch.h>
 
-/* Flags and constants used in PowerNV platform */
-
 #define MAX_POWERNV_IDLE_STATES	8
-#define IDLE_USE_INST_NAP	0x00010000 /* Use nap instruction */
-#define IDLE_USE_INST_SLEEP	0x00020000 /* Use sleep instruction */
 
 struct cpuidle_driver powernv_idle_driver = {
 	.name             = "powernv_idle",
@@ -198,7 +195,7 @@ static int powernv_add_idle_states(void)
 		 * target residency to be 10x exit_latency
 		 */
 		latency_ns = be32_to_cpu(idle_state_latency[i]);
-		if (flags & IDLE_USE_INST_NAP) {
+		if (flags & OPAL_PM_NAP_ENABLED) {
 			/* Add NAP state */
 			strcpy(powernv_states[nr_idle_states].name, "Nap");
 			strcpy(powernv_states[nr_idle_states].desc, "Nap");
@@ -211,7 +208,7 @@ static int powernv_add_idle_states(void)
 			nr_idle_states++;
 		}
 
-		if (flags & IDLE_USE_INST_SLEEP) {
+		if (flags & OPAL_PM_SLEEP_ENABLED) {
 			/* Add FASTSLEEP state */
 			strcpy(powernv_states[nr_idle_states].name, "FastSleep");
 			strcpy(powernv_states[nr_idle_states].desc, "FastSleep");
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management
  2014-11-25 11:17 [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management Shreyas B. Prabhu
  2014-11-25 11:17 ` [PATCH v2 1/4] powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle mode Shreyas B. Prabhu
  2014-11-25 11:17 ` [PATCH v2 2/4] powerpc/powernv: Enable Offline CPUs to enter deep idle states Shreyas B. Prabhu
@ 2014-11-25 11:17 ` Shreyas B. Prabhu
  2014-11-27  0:37   ` Benjamin Herrenschmidt
  2014-11-27 23:50   ` Paul Mackerras
  2014-11-25 11:17 ` [PATCH v2 4/4] powernv: powerpc: Add winkle support for offline cpus Shreyas B. Prabhu
  2014-11-26  5:15 ` [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management Preeti U Murthy
  4 siblings, 2 replies; 12+ messages in thread
From: Shreyas B. Prabhu @ 2014-11-25 11:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Shreyas B. Prabhu, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Rafael J. Wysocki, linux-pm, linuxppc-dev

Deep idle states like sleep and winkle are per core idle states. A core
enters these states only when all the threads enter either the
particular idle state or a deeper one. There are tasks like fastsleep
hardware bug workaround and hypervisor core state save which have to be
done only by the last thread of the core entering deep idle state and
similarly tasks like timebase resync, hypervisor core register restore
that have to be done only by the first thread waking up from these
state.

The current idle state management does not have a way to distinguish the
first/last thread of the core waking/entering idle states. Tasks like
timebase resync are done for all the threads. This is not only is
suboptimal, but can cause functionality issues when subcores and kvm is
involved.

This patch adds the necessary infrastructure to track idle states of
threads in a per-core structure. It uses this info to perform tasks like
fastsleep workaround and timebase resync only once per core.

Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Originally-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: linux-pm@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/cpuidle.h             |  14 ++
 arch/powerpc/include/asm/opal.h                |   2 +
 arch/powerpc/include/asm/paca.h                |   4 +
 arch/powerpc/kernel/asm-offsets.c              |   4 +
 arch/powerpc/kernel/exceptions-64s.S           |  20 ++-
 arch/powerpc/kernel/idle_power7.S              | 186 +++++++++++++++++++------
 arch/powerpc/platforms/powernv/opal-wrappers.S |  37 +++++
 arch/powerpc/platforms/powernv/setup.c         |  42 +++++-
 arch/powerpc/platforms/powernv/smp.c           |   3 +-
 drivers/cpuidle/cpuidle-powernv.c              |   3 +-
 10 files changed, 258 insertions(+), 57 deletions(-)
 create mode 100644 arch/powerpc/include/asm/cpuidle.h

diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
new file mode 100644
index 0000000..8c82850
--- /dev/null
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -0,0 +1,14 @@
+#ifndef _ASM_POWERPC_CPUIDLE_H
+#define _ASM_POWERPC_CPUIDLE_H
+
+#ifdef CONFIG_PPC_POWERNV
+/* Used in powernv idle state management */
+#define PNV_THREAD_RUNNING              0
+#define PNV_THREAD_NAP                  1
+#define PNV_THREAD_SLEEP                2
+#define PNV_THREAD_WINKLE               3
+#define PNV_CORE_IDLE_LOCK_BIT          0x100
+#define PNV_CORE_IDLE_THREAD_BITS       0x0FF
+#endif
+
+#endif
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index f8b95c0..bef7fbc 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -152,6 +152,7 @@ struct opal_sg_list {
 #define OPAL_PCI_ERR_INJECT			96
 #define OPAL_PCI_EEH_FREEZE_SET			97
 #define OPAL_HANDLE_HMI				98
+#define OPAL_CONFIG_CPU_IDLE_STATE		99
 #define OPAL_REGISTER_DUMP_REGION		101
 #define OPAL_UNREGISTER_DUMP_REGION		102
 
@@ -162,6 +163,7 @@ struct opal_sg_list {
  */
 #define OPAL_PM_NAP_ENABLED	0x00010000
 #define OPAL_PM_SLEEP_ENABLED	0x00020000
+#define OPAL_PM_SLEEP_ENABLED_ER1	0x00080000
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index a5139ea..85aeedb 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -158,6 +158,10 @@ struct paca_struct {
 	 * early exception handler for use by high level C handler
 	 */
 	struct opal_machine_check_event *opal_mc_evt;
+
+	/* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
+	u32 *core_idle_state_ptr;
+	u8 thread_idle_state;		/* ~Idle[0]/Nap[1]/Sleep[2]/Winkle[3] */
 #endif
 #ifdef CONFIG_PPC_BOOK3S_64
 	/* Exclusive emergency stack pointer for machine check exception. */
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 9d7dede..50f299e 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -731,6 +731,10 @@ int main(void)
 	DEFINE(OPAL_MC_SRR0, offsetof(struct opal_machine_check_event, srr0));
 	DEFINE(OPAL_MC_SRR1, offsetof(struct opal_machine_check_event, srr1));
 	DEFINE(PACA_OPAL_MC_EVT, offsetof(struct paca_struct, opal_mc_evt));
+	DEFINE(PACA_CORE_IDLE_STATE_PTR,
+			offsetof(struct paca_struct, core_idle_state_ptr));
+	DEFINE(PACA_THREAD_IDLE_STATE,
+			offsetof(struct paca_struct, thread_idle_state));
 #endif
 
 	return 0;
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 72e783e..3311c8d 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -15,6 +15,7 @@
 #include <asm/hw_irq.h>
 #include <asm/exception-64s.h>
 #include <asm/ptrace.h>
+#include <asm/cpuidle.h>
 
 /*
  * We layout physical memory as follows:
@@ -109,15 +110,19 @@ BEGIN_FTR_SECTION
 	rlwinm.	r13,r13,47-31,30,31
 	beq	9f
 
-	/* waking up from powersave (nap) state */
 	cmpwi	cr1,r13,2
-	/* Total loss of HV state is fatal, we could try to use the
-	 * PIR to locate a PACA, then use an emergency stack etc...
-	 * OPAL v3 based powernv platforms have new idle states
-	 * which fall in this catagory.
-	 */
-	bgt	cr1,8f
+
 	GET_PACA(r13)
+	lbz	r0,PACA_THREAD_IDLE_STATE(r13)
+	cmpwi   cr2,r0,PNV_THREAD_NAP
+	bgt     cr2,8f				/* Either sleep or Winkle */
+
+	/* Waking up from nap should not cause hypervisor state loss */
+	bgt	cr1,.
+
+	/* Waking up from nap */
+	li	r0,PNV_THREAD_RUNNING
+	stb	r0,PACA_THREAD_IDLE_STATE(r13)	/* Clear thread state */
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 	li	r0,KVM_HWTHREAD_IN_KERNEL
@@ -1386,6 +1391,7 @@ machine_check_handle_early:
 	MACHINE_CHECK_HANDLER_WINDUP
 	GET_PACA(r13)
 	ld	r1,PACAR1(r13)
+	li	r3,PNV_THREAD_NAP
 	b	power7_enter_nap_mode
 4:
 #endif
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index 283c603..c1d590f 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -18,6 +18,7 @@
 #include <asm/hw_irq.h>
 #include <asm/kvm_book3s_asm.h>
 #include <asm/opal.h>
+#include <asm/cpuidle.h>
 
 #undef DEBUG
 
@@ -37,8 +38,7 @@
 
 /*
  * Pass requested state in r3:
- * 	0 - nap
- * 	1 - sleep
+ *	r3 - PNV_THREAD_NAP/SLEEP/WINKLE
  *
  * To check IRQ_HAPPENED in r4
  * 	0 - don't check
@@ -123,12 +123,62 @@ power7_enter_nap_mode:
 	li	r4,KVM_HWTHREAD_IN_NAP
 	stb	r4,HSTATE_HWTHREAD_STATE(r13)
 #endif
-	cmpwi	cr0,r3,1
-	beq	2f
+	stb	r3,PACA_THREAD_IDLE_STATE(r13)
+	cmpwi	cr1,r3,PNV_THREAD_SLEEP
+	bge	cr1,2f
 	IDLE_STATE_ENTER_SEQ(PPC_NAP)
 	/* No return */
-2:	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
-	/* No return */
+2:
+	/* Sleep or winkle */
+	li	r7,1
+	mfspr	r8,SPRN_PIR
+	/*
+	 * The last 3 bits of PIR represents the thread id of a cpu
+	 * in power8. This will need adjusting for power7.
+	 */
+	andi.	r8,r8,0x07			/* Get thread id into r8 */
+	rotld	r7,r7,r8
+
+	ld	r14,PACA_CORE_IDLE_STATE_PTR(r13)
+lwarx_loop1:
+	lwarx	r15,0,r14
+	andc	r15,r15,r7			/* Clear thread bit */
+
+	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
+	beq	last_thread
+
+	/* Not the last thread to goto sleep */
+	stwcx.	r15,0,r14
+	bne-	lwarx_loop1
+	b	common_enter
+
+last_thread:
+	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
+	lbz	r3,0(r3)
+	cmpwi	r3,1
+	bne	common_enter
+	/*
+	 * Last thread of the core entering sleep. Last thread needs to execute
+	 * the hardware bug workaround code. Before that, set the lock bit to
+	 * avoid the race of other threads waking up and undoing workaround
+	 * before workaround is applied.
+	 */
+	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
+	stwcx.	r15,0,r14
+	bne-	lwarx_loop1
+
+	/* Fast sleep workaround */
+	li	r3,1
+	li	r4,1
+	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
+	bl	opal_call_realmode
+
+	/* Clear Lock bit */
+	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
+	stw	r15,0(r14)
+
+common_enter: /* common code for all the threads entering sleep */
+	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
 
 _GLOBAL(power7_idle)
 	/* Now check if user or arch enabled NAP mode */
@@ -141,49 +191,16 @@ _GLOBAL(power7_idle)
 
 _GLOBAL(power7_nap)
 	mr	r4,r3
-	li	r3,0
+	li	r3,PNV_THREAD_NAP
 	b	power7_powersave_common
 	/* No return */
 
 _GLOBAL(power7_sleep)
-	li	r3,1
+	li	r3,PNV_THREAD_SLEEP
 	li	r4,1
 	b	power7_powersave_common
 	/* No return */
 
-/*
- * Make opal call in realmode. This is a generic function to be called
- * from realmode from reset vector. It handles endianess.
- *
- * r13 - paca pointer
- * r1  - stack pointer
- * r3  - opal token
- */
-opal_call_realmode:
-	mflr	r12
-	std	r12,_LINK(r1)
-	ld	r2,PACATOC(r13)
-	/* Set opal return address */
-	LOAD_REG_ADDR(r0,return_from_opal_call)
-	mtlr	r0
-	/* Handle endian-ness */
-	li	r0,MSR_LE
-	mfmsr	r12
-	andc	r12,r12,r0
-	mtspr	SPRN_HSRR1,r12
-	mr	r0,r3			/* Move opal token to r0 */
-	LOAD_REG_ADDR(r11,opal)
-	ld	r12,8(r11)
-	ld	r2,0(r11)
-	mtspr	SPRN_HSRR0,r12
-	hrfid
-
-return_from_opal_call:
-	FIXUP_ENDIAN
-	ld	r0,_LINK(r1)
-	mtlr	r0
-	blr
-
 #define CHECK_HMI_INTERRUPT						\
 	mfspr	r0,SPRN_SRR1;						\
 BEGIN_FTR_SECTION_NESTED(66);						\
@@ -196,10 +213,8 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66);		\
 	/* Invoke opal call to handle hmi */				\
 	ld	r2,PACATOC(r13);					\
 	ld	r1,PACAR1(r13);						\
-	std	r3,ORIG_GPR3(r1);	/* Save original r3 */		\
-	li	r3,OPAL_HANDLE_HMI;	/* Pass opal token argument*/	\
+	li	r0,OPAL_HANDLE_HMI;	/* Pass opal token argument*/	\
 	bl	opal_call_realmode;					\
-	ld	r3,ORIG_GPR3(r1);	/* Restore original r3 */	\
 20:	nop;
 
 
@@ -210,12 +225,91 @@ _GLOBAL(power7_wakeup_tb_loss)
 BEGIN_FTR_SECTION
 	CHECK_HMI_INTERRUPT
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
+
+	li	r7,1
+	mfspr	r8,SPRN_PIR
+	/*
+	 * The last 3 bits of PIR represents the thread id of a cpu
+	 * in power8. This will need adjusting for power7.
+	 */
+	andi.	r8,r8,0x07		/* Get thread id into r8 */
+	rotld	r7,r7,r8
+	/* r7 now has 'thread_id'th bit set */
+
+	ld	r14,PACA_CORE_IDLE_STATE_PTR(r13)
+lwarx_loop2:
+	lwarx	r15,0,r14
+	andi.	r9,r15,PNV_CORE_IDLE_LOCK_BIT
+	/*
+	 * Lock bit is set in one of the 2 cases-
+	 * a. In the sleep/winkle enter path, the last thread is executing
+	 * fastsleep workaround code.
+	 * b. In the wake up path, another thread is executing fastsleep
+	 * workaround undo code or resyncing timebase or restoring context
+	 * In either case loop until the lock bit is cleared.
+	 */
+	bne	lwarx_loop2
+
+	cmpwi	cr2,r15,0
+	or	r15,r15,r7		/* Set thread bit */
+
+	beq	cr2,first_thread
+
+	/* Not first thread in core to wake up */
+	stwcx.	r15,0,r14
+	bne-	lwarx_loop2
+	b	common_exit
+
+first_thread:
+	/* First thread in core to wakeup */
+	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
+	stwcx.	r15,0,r14
+	bne-	lwarx_loop2
+
+	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
+	lbz	r3,0(r3)
+	cmpwi	r3,1
+	/*  skip fastsleep workaround if its not needed */
+	bne	timebase_resync
+
+	/* Undo fast sleep workaround */
+	mfcr	r16	/* Backup CR into a non-volatile register */
+	li	r3,1
+	li	r4,0
+	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
+	bl	opal_call_realmode
+	mtcr	r16	/* Restore CR */
+
+	/* Do timebase resync if we are waking up from sleep. Use cr1 value
+	 * set in exceptions-64s.S */
+	ble	cr1,clear_lock
+
+timebase_resync:
 	/* Time base re-sync */
-	li	r3,OPAL_RESYNC_TIMEBASE
+	li	r0,OPAL_RESYNC_TIMEBASE
 	bl	opal_call_realmode;
-
 	/* TODO: Check r3 for failure */
 
+clear_lock:
+	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
+	stw	r15,0(r14)
+
+common_exit:
+	li	r5,PNV_THREAD_RUNNING
+	stb     r5,PACA_THREAD_IDLE_STATE(r13)
+
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+	li      r0,KVM_HWTHREAD_IN_KERNEL
+	stb     r0,HSTATE_HWTHREAD_STATE(r13)
+	/* Order setting hwthread_state vs. testing hwthread_req */
+	sync
+	lbz     r0,HSTATE_HWTHREAD_REQ(r13)
+	cmpwi   r0,0
+	beq     6f
+	b       kvm_start_guest
+6:
+#endif
+
 	REST_NVGPRS(r1)
 	REST_GPR(2, r1)
 	ld	r3,_CCR(r1)
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index feb549a..b2aa93b 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -158,6 +158,43 @@ opal_tracepoint_return:
 	blr
 #endif
 
+/*
+ * Make opal call in realmode. This is a generic function to be called
+ * from realmode. It handles endianness.
+ *
+ * r13 - paca pointer
+ * r1  - stack pointer
+ * r0  - opal token
+ */
+_GLOBAL(opal_call_realmode)
+	mflr	r12
+	std	r12,_LINK(r1)
+	ld	r2,PACATOC(r13)
+	/* Set opal return address */
+	LOAD_REG_ADDR(r12,return_from_opal_call)
+	mtlr	r12
+
+	mfmsr	r12
+#ifdef __LITTLE_ENDIAN__
+	/* Handle endian-ness */
+	li	r11,MSR_LE
+	andc	r12,r12,r11
+#endif
+	mtspr	SPRN_HSRR1,r12
+	LOAD_REG_ADDR(r11,opal)
+	ld	r12,8(r11)
+	ld	r2,0(r11)
+	mtspr	SPRN_HSRR0,r12
+	hrfid
+
+return_from_opal_call:
+#ifdef __LITTLE_ENDIAN__
+	FIXUP_ENDIAN
+#endif
+	ld	r12,_LINK(r1)
+	mtlr	r12
+	blr
+
 OPAL_CALL(opal_invalid_call,			OPAL_INVALID_CALL);
 OPAL_CALL(opal_console_write,			OPAL_CONSOLE_WRITE);
 OPAL_CALL(opal_console_read,			OPAL_CONSOLE_READ);
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 34c6665..17fb98c 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -36,6 +36,8 @@
 #include <asm/opal.h>
 #include <asm/kexec.h>
 #include <asm/smp.h>
+#include <asm/cputhreads.h>
+#include <asm/cpuidle.h>
 
 #include "powernv.h"
 
@@ -292,11 +294,45 @@ static void __init pnv_setup_machdep_rtas(void)
 
 static u32 supported_cpuidle_states;
 
+static void pnv_alloc_idle_core_states(void)
+{
+	int i, j;
+	int nr_cores = cpu_nr_cores();
+	u32 *core_idle_state;
+
+	/*
+	 * core_idle_state - First 8 bits track the idle state of each thread
+	 * of the core. The 8th bit is the lock bit. Initially all thread bits
+	 * are set. They are cleared when the thread enters deep idle state
+	 * like sleep and winkle. Initially the lock bit is cleared.
+	 * The lock bit has 2 purposes
+	 * a. While the first thread is restoring core state, it prevents
+	 * from other threads in the core from switching to prcoess context.
+	 * b. While the last thread in the core is saving the core state, it
+	 * prevent a different thread from waking up.
+	 */
+	for (i = 0; i < nr_cores; i++) {
+		int first_cpu = i * threads_per_core;
+		int node = cpu_to_node(first_cpu);
+
+		core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
+		for (j = 0; j < threads_per_core; j++) {
+			int cpu = first_cpu + j;
+
+			paca[cpu].core_idle_state_ptr = core_idle_state;
+			paca[cpu].thread_idle_state = PNV_THREAD_RUNNING;
+
+		}
+	}
+}
+
 u32 pnv_get_supported_cpuidle_states(void)
 {
 	return supported_cpuidle_states;
 }
+EXPORT_SYMBOL_GPL(pnv_get_supported_cpuidle_states);
 
+u8 pnv_need_fastsleep_workaround;
 static int __init pnv_init_idle_states(void)
 {
 	struct device_node *power_mgt;
@@ -306,6 +342,7 @@ static int __init pnv_init_idle_states(void)
 	int i;
 
 	supported_cpuidle_states = 0;
+	pnv_need_fastsleep_workaround = 0;
 
 	if (cpuidle_disable != IDLE_NO_OVERRIDE)
 		return 0;
@@ -332,13 +369,14 @@ static int __init pnv_init_idle_states(void)
 		flags = be32_to_cpu(idle_state_flags[i]);
 		supported_cpuidle_states |= flags;
 	}
-
+	if (supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)
+		pnv_need_fastsleep_workaround = 1;
+	pnv_alloc_idle_core_states();
 	return 0;
 }
 
 subsys_initcall(pnv_init_idle_states);
 
-
 static int __init pnv_probe(void)
 {
 	unsigned long root = of_get_flat_dt_root();
diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
index 3dc4cec..12b761a 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -167,7 +167,8 @@ static void pnv_smp_cpu_kill_self(void)
 	mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
 	while (!generic_check_cpu_restart(cpu)) {
 		ppc64_runlatch_off();
-		if (idle_states & OPAL_PM_SLEEP_ENABLED)
+		if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
+				(idle_states & OPAL_PM_SLEEP_ENABLED_ER1))
 			power7_sleep();
 		else
 			power7_nap(1);
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 0a7d827..a489b56 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -208,7 +208,8 @@ static int powernv_add_idle_states(void)
 			nr_idle_states++;
 		}
 
-		if (flags & OPAL_PM_SLEEP_ENABLED) {
+		if (flags & OPAL_PM_SLEEP_ENABLED ||
+			flags & OPAL_PM_SLEEP_ENABLED_ER1) {
 			/* Add FASTSLEEP state */
 			strcpy(powernv_states[nr_idle_states].name, "FastSleep");
 			strcpy(powernv_states[nr_idle_states].desc, "FastSleep");
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/4] powernv: powerpc: Add winkle support for offline cpus
  2014-11-25 11:17 [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management Shreyas B. Prabhu
                   ` (2 preceding siblings ...)
  2014-11-25 11:17 ` [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management Shreyas B. Prabhu
@ 2014-11-25 11:17 ` Shreyas B. Prabhu
  2014-11-27  1:55   ` Benjamin Herrenschmidt
  2014-11-26  5:15 ` [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management Preeti U Murthy
  4 siblings, 1 reply; 12+ messages in thread
From: Shreyas B. Prabhu @ 2014-11-25 11:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Shreyas B. Prabhu, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, linuxppc-dev

Winkle is a deep idle state supported in power8 chips. A core enters
winkle when all the threads of the core enter winkle. In this state
power supply to the entire chiplet i.e core, private L2 and private L3
is turned off. As a result it gives higher powersavings compared to
sleep.

But entering winkle results in a total hypervisor state loss. Hence the
hypervisor context has to be preserved before entering winkle and
restored upon wake up.

Power-on Reset Engine (PORE) is a dedicated engine which is responsible
for powering on the chiplet during wake up. It can be programmed to
restore the register contests of a few specific registers. This patch
uses PORE to restore register state wherever possible and uses stack to
save and restore rest of the necessary registers.

With hypervisor state restore things fall under three categories-
per-core state, per-subcore state and per-thread state. To manage this,
extend the infrastructure introduced for sleep. Mainly we add a paca
variable subcore_sibling_mask. Using this and the core_idle_state we can
distingush first thread in core and subcore.

Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/opal.h                |   3 +
 arch/powerpc/include/asm/paca.h                |   2 +
 arch/powerpc/include/asm/ppc-opcode.h          |   2 +
 arch/powerpc/include/asm/processor.h           |   1 +
 arch/powerpc/include/asm/reg.h                 |   2 +
 arch/powerpc/kernel/asm-offsets.c              |   2 +
 arch/powerpc/kernel/cpu_setup_power.S          |   4 +
 arch/powerpc/kernel/exceptions-64s.S           |  10 ++
 arch/powerpc/kernel/idle_power7.S              | 156 ++++++++++++++++++++++---
 arch/powerpc/platforms/powernv/opal-wrappers.S |   2 +
 arch/powerpc/platforms/powernv/setup.c         |  73 ++++++++++++
 arch/powerpc/platforms/powernv/smp.c           |   4 +-
 arch/powerpc/platforms/powernv/subcore.c       |  34 ++++++
 arch/powerpc/platforms/powernv/subcore.h       |   1 +
 14 files changed, 282 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index bef7fbc..f0ca2d9 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -153,6 +153,7 @@ struct opal_sg_list {
 #define OPAL_PCI_EEH_FREEZE_SET			97
 #define OPAL_HANDLE_HMI				98
 #define OPAL_CONFIG_CPU_IDLE_STATE		99
+#define OPAL_SLW_SET_REG			100
 #define OPAL_REGISTER_DUMP_REGION		101
 #define OPAL_UNREGISTER_DUMP_REGION		102
 
@@ -163,6 +164,7 @@ struct opal_sg_list {
  */
 #define OPAL_PM_NAP_ENABLED	0x00010000
 #define OPAL_PM_SLEEP_ENABLED	0x00020000
+#define OPAL_PM_WINKLE_ENABLED	0x00040000
 #define OPAL_PM_SLEEP_ENABLED_ER1	0x00080000
 
 #ifndef __ASSEMBLY__
@@ -972,6 +974,7 @@ int64_t opal_sensor_read(uint32_t sensor_hndl, int token, __be32 *sensor_data);
 int64_t opal_handle_hmi(void);
 int64_t opal_register_dump_region(uint32_t id, uint64_t start, uint64_t end);
 int64_t opal_unregister_dump_region(uint32_t id);
+int64_t opal_slw_set_reg(uint64_t cpu_pir, uint64_t sprn, uint64_t val);
 int64_t opal_pci_set_phb_cxl_mode(uint64_t phb_id, uint64_t mode, uint64_t pe_number);
 
 /* Internal functions */
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 85aeedb..c2e51b7 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -162,6 +162,8 @@ struct paca_struct {
 	/* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
 	u32 *core_idle_state_ptr;
 	u8 thread_idle_state;		/* ~Idle[0]/Nap[1]/Sleep[2]/Winkle[3] */
+	/* Mask to denote subcore sibling threads */
+	u8 subcore_sibling_mask;
 #endif
 #ifdef CONFIG_PPC_BOOK3S_64
 	/* Exclusive emergency stack pointer for machine check exception. */
diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index 6f85362..5155be7 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -194,6 +194,7 @@
 
 #define PPC_INST_NAP			0x4c000364
 #define PPC_INST_SLEEP			0x4c0003a4
+#define PPC_INST_WINKLE			0x4c0003e4
 
 /* A2 specific instructions */
 #define PPC_INST_ERATWE			0x7c0001a6
@@ -374,6 +375,7 @@
 
 #define PPC_NAP			stringify_in_c(.long PPC_INST_NAP)
 #define PPC_SLEEP		stringify_in_c(.long PPC_INST_SLEEP)
+#define PPC_WINKLE		stringify_in_c(.long PPC_INST_WINKLE)
 
 /* BHRB instructions */
 #define PPC_CLRBHRB		stringify_in_c(.long PPC_INST_CLRBHRB)
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index dda7ac4..c076842 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -453,6 +453,7 @@ enum idle_boot_override {IDLE_NO_OVERRIDE = 0, IDLE_POWERSAVE_OFF};
 extern int powersave_nap;	/* set if nap mode can be used in idle loop */
 extern void power7_nap(int check_irq);
 extern void power7_sleep(void);
+extern void power7_winkle(void);
 extern void flush_instruction_cache(void);
 extern void hard_reset_now(void);
 extern void poweroff_now(void);
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index a68ee15..1c874fb 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -373,6 +373,7 @@
 #define SPRN_DBAT7L	0x23F	/* Data BAT 7 Lower Register */
 #define SPRN_DBAT7U	0x23E	/* Data BAT 7 Upper Register */
 #define SPRN_PPR	0x380	/* SMT Thread status Register */
+#define SPRN_TSCR	0x399	/* Thread Switch Control Register */
 
 #define SPRN_DEC	0x016		/* Decrement Register */
 #define SPRN_DER	0x095		/* Debug Enable Regsiter */
@@ -730,6 +731,7 @@
 #define SPRN_BESCR	806	/* Branch event status and control register */
 #define   BESCR_GE	0x8000000000000000ULL /* Global Enable */
 #define SPRN_WORT	895	/* Workload optimization register - thread */
+#define SPRN_WORC	863	/* Workload optimization register - core */
 
 #define SPRN_PMC1	787
 #define SPRN_PMC2	788
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 50f299e..b7fb63a 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -735,6 +735,8 @@ int main(void)
 			offsetof(struct paca_struct, core_idle_state_ptr));
 	DEFINE(PACA_THREAD_IDLE_STATE,
 			offsetof(struct paca_struct, thread_idle_state));
+	DEFINE(PACA_SUBCORE_SIBLING_MASK,
+			offsetof(struct paca_struct, subcore_sibling_mask));
 #endif
 
 	return 0;
diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
index 4673353..66874aa 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -55,6 +55,8 @@ _GLOBAL(__setup_cpu_power8)
 	beqlr
 	li	r0,0
 	mtspr	SPRN_LPID,r0
+	mtspr	SPRN_WORT,r0
+	mtspr	SPRN_WORC,r0
 	mfspr	r3,SPRN_LPCR
 	ori	r3, r3, LPCR_PECEDH
 	bl	__init_LPCR
@@ -75,6 +77,8 @@ _GLOBAL(__restore_cpu_power8)
 	li	r0,0
 	mtspr	SPRN_LPID,r0
 	mfspr   r3,SPRN_LPCR
+	mtspr	SPRN_WORT,r0
+	mtspr	SPRN_WORC,r0
 	ori	r3, r3, LPCR_PECEDH
 	bl	__init_LPCR
 	bl	__init_HFSCR
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 3311c8d..c9897cb 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -112,6 +112,16 @@ BEGIN_FTR_SECTION
 
 	cmpwi	cr1,r13,2
 
+	/* Check if last bit of HSPGR0 is set. This indicates whether we are
+	 * waking up from winkle */
+	li	r3,1
+	mfspr	r4,SPRN_HSPRG0
+	and	r5,r4,r3
+	cmpwi	cr4,r5,1	/* Store result in cr4 for later use */
+
+	andc	r4,r4,r3
+	mtspr	SPRN_HSPRG0,r4
+
 	GET_PACA(r13)
 	lbz	r0,PACA_THREAD_IDLE_STATE(r13)
 	cmpwi   cr2,r0,PNV_THREAD_NAP
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index c1d590f..78c30b0 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -19,8 +19,22 @@
 #include <asm/kvm_book3s_asm.h>
 #include <asm/opal.h>
 #include <asm/cpuidle.h>
+#include <asm/mmu-hash64.h>
 
 #undef DEBUG
+/*
+ * Use unused space in the interrupt stack to save and restore
+ * registers for winkle support.
+ */
+#define _SDR1	GPR3
+#define _RPR	GPR4
+#define _SPURR	GPR5
+#define _PURR	GPR6
+#define _TSCR	GPR7
+#define _DSCR	GPR8
+#define _AMOR	GPR9
+#define _PMC5	GPR10
+#define _PMC6	GPR11
 
 /* Idle state entry routines */
 
@@ -153,32 +167,60 @@ lwarx_loop1:
 	b	common_enter
 
 last_thread:
-	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
-	lbz	r3,0(r3)
-	cmpwi	r3,1
-	bne	common_enter
 	/*
 	 * Last thread of the core entering sleep. Last thread needs to execute
 	 * the hardware bug workaround code. Before that, set the lock bit to
 	 * avoid the race of other threads waking up and undoing workaround
 	 * before workaround is applied.
 	 */
+	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
+	lbz	r3,0(r3)
+	cmpwi	r3,1
+	bne	common_enter
+
 	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
 	stwcx.	r15,0,r14
 	bne-	lwarx_loop1
 
 	/* Fast sleep workaround */
+	mfcr	r16	/* Backup CR to a non-volatile register */
 	li	r3,1
 	li	r4,1
 	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
 	bl	opal_call_realmode
+	mtcr	r16	/* Restore CR */
 
 	/* Clear Lock bit */
 	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
 	stw	r15,0(r14)
 
-common_enter: /* common code for all the threads entering sleep */
+common_enter: /* common code for all the threads entering sleep or winkle*/
+	bgt	cr1,enter_winkle
 	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
+enter_winkle:
+	/*
+	 * Note all register i.e per-core, per-subcore or per-thread is saved
+	 * here since any thread in the core might wake up first
+	 */
+	mfspr	r3,SPRN_SDR1
+	std	r3,_SDR1(r1)
+	mfspr	r3,SPRN_RPR
+	std	r3,_RPR(r1)
+	mfspr	r3,SPRN_SPURR
+	std	r3,_SPURR(r1)
+	mfspr	r3,SPRN_PURR
+	std	r3,_PURR(r1)
+	mfspr	r3,SPRN_TSCR
+	std	r3,_TSCR(r1)
+	mfspr	r3,SPRN_DSCR
+	std	r3,_DSCR(r1)
+	mfspr	r3,SPRN_AMOR
+	std	r3,_AMOR(r1)
+	mfspr	r3,SPRN_PMC5
+	std	r3,_PMC5(r1)
+	mfspr	r3,SPRN_PMC6
+	std	r3,_PMC6(r1)
+	IDLE_STATE_ENTER_SEQ(PPC_WINKLE)
 
 _GLOBAL(power7_idle)
 	/* Now check if user or arch enabled NAP mode */
@@ -201,6 +243,12 @@ _GLOBAL(power7_sleep)
 	b	power7_powersave_common
 	/* No return */
 
+_GLOBAL(power7_winkle)
+	li	r3,PNV_THREAD_WINKLE
+	li	r4,1
+	b	power7_powersave_common
+	/* No return */
+
 #define CHECK_HMI_INTERRUPT						\
 	mfspr	r0,SPRN_SRR1;						\
 BEGIN_FTR_SECTION_NESTED(66);						\
@@ -250,22 +298,54 @@ lwarx_loop2:
 	 */
 	bne	lwarx_loop2
 
-	cmpwi	cr2,r15,0
+	cmpwi	cr2,r15,0	/* Check if first in core */
+	lbz	r4,PACA_SUBCORE_SIBLING_MASK(r13)
+	and	r4,r4,r15
+	cmpwi	cr3,r4,0	/* Check if first in subcore */
+
+	/*
+	 * At this stage
+	 * cr1 - 01 if waking up from sleep or winkle
+	 * cr2 - 10 if first thread to wakeup in core
+	 * cr3 - 10 if first thread to wakeup in subcore
+	 * cr4 - 10 if waking up from winkle
+	 */
+
 	or	r15,r15,r7		/* Set thread bit */
 
-	beq	cr2,first_thread
+	beq	cr3,first_thread_in_subcore
 
-	/* Not first thread in core to wake up */
+	/* Not first thread in subcore to wake up */
 	stwcx.	r15,0,r14
 	bne-	lwarx_loop2
 	b	common_exit
 
-first_thread:
-	/* First thread in core to wakeup */
+first_thread_in_subcore:
+	/* First thread in subcore to wakeup set the lock bit */
 	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
 	stwcx.	r15,0,r14
 	bne-	lwarx_loop2
 
+	/*
+	 * If waking up from sleep, subcore state is not lost. Hence
+	 * skip subcore state restore
+	 */
+	bne	cr4,subcore_state_restored
+
+	/* Restore per-subcore state */
+	ld      r4,_SDR1(r1)
+	mtspr   SPRN_SDR1,r4
+	ld      r4,_RPR(r1)
+	mtspr   SPRN_RPR,r4
+	ld	r4,_AMOR(r1)
+	mtspr	SPRN_AMOR,r4
+
+subcore_state_restored:
+	/* Check if the thread is also the first thread in the core. If not,
+	 * skip to clear_lock */
+	bne	cr2,clear_lock
+
+first_thread_in_core:
 	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
 	lbz	r3,0(r3)
 	cmpwi	r3,1
@@ -280,21 +360,71 @@ first_thread:
 	bl	opal_call_realmode
 	mtcr	r16	/* Restore CR */
 
-	/* Do timebase resync if we are waking up from sleep. Use cr1 value
-	 * set in exceptions-64s.S */
+timebase_resync:
+	/* Do timebase resync only if the core truly woke up from
+	 * sleep/winkle */
 	ble	cr1,clear_lock
 
-timebase_resync:
 	/* Time base re-sync */
+	mfcr	r16	/* Backup CR into a non-volatile register */
 	li	r0,OPAL_RESYNC_TIMEBASE
 	bl	opal_call_realmode;
 	/* TODO: Check r3 for failure */
+	mtcr	r16	/* Restore CR */
+
+	/*
+	 * If waking up from sleep, per core state is not lost, skip to
+	 * clear_lock.
+	 */
+	bne	cr4,clear_lock
+
+	/* Restore per core state */
+	ld	r4,_TSCR(r1)
+	mtspr	SPRN_TSCR,r4
 
 clear_lock:
 	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
 	stw	r15,0(r14)
 
 common_exit:
+	/* Common to all threads
+	 *
+	 * If waking up from sleep, hypervisor state is not lost. Hence
+	 * skip hypervisor state restore.
+	 */
+	bne	cr4,hypervisor_state_restored
+
+	/* Waking up from winkle */
+
+	/* Restore per thread state */
+	bl	__restore_cpu_power8
+
+	/* Restore SLB  from PACA */
+	ld	r8,PACA_SLBSHADOWPTR(r13)
+
+	.rept	SLB_NUM_BOLTED
+	li	r3, SLBSHADOW_SAVEAREA
+	LDX_BE	r5, r8, r3
+	addi	r3, r3, 8
+	LDX_BE	r6, r8, r3
+	andis.	r7,r5,SLB_ESID_V@h
+	beq	1f
+	slbmte	r6,r5
+1:	addi	r8,r8,16
+	.endr
+
+	ld	r4,_SPURR(r1)
+	mtspr	SPRN_SPURR,r4
+	ld	r4,_PURR(r1)
+	mtspr	SPRN_PURR,r4
+	ld	r4,_DSCR(r1)
+	mtspr	SPRN_DSCR,r4
+	ld	r4,_PMC5(r1)
+	mtspr	SPRN_PMC5,r4
+	ld	r4,_PMC6(r1)
+	mtspr	SPRN_PMC6,r4
+
+hypervisor_state_restored:
 	li	r5,PNV_THREAD_RUNNING
 	stb     r5,PACA_THREAD_IDLE_STATE(r13)
 
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index b2aa93b..e1e91e0 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -191,6 +191,7 @@ return_from_opal_call:
 #ifdef __LITTLE_ENDIAN__
 	FIXUP_ENDIAN
 #endif
+	ld	r2,PACATOC(r13)
 	ld	r12,_LINK(r1)
 	mtlr	r12
 	blr
@@ -284,6 +285,7 @@ OPAL_CALL(opal_sensor_read,			OPAL_SENSOR_READ);
 OPAL_CALL(opal_get_param,			OPAL_GET_PARAM);
 OPAL_CALL(opal_set_param,			OPAL_SET_PARAM);
 OPAL_CALL(opal_handle_hmi,			OPAL_HANDLE_HMI);
+OPAL_CALL(opal_slw_set_reg,			OPAL_SLW_SET_REG);
 OPAL_CALL(opal_register_dump_region,		OPAL_REGISTER_DUMP_REGION);
 OPAL_CALL(opal_unregister_dump_region,		OPAL_UNREGISTER_DUMP_REGION);
 OPAL_CALL(opal_pci_set_phb_cxl_mode,		OPAL_PCI_SET_PHB_CXL_MODE);
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 17fb98c..4a886a1 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -40,6 +40,7 @@
 #include <asm/cpuidle.h>
 
 #include "powernv.h"
+#include "subcore.h"
 
 static void __init pnv_setup_arch(void)
 {
@@ -293,6 +294,74 @@ static void __init pnv_setup_machdep_rtas(void)
 #endif /* CONFIG_PPC_POWERNV_RTAS */
 
 static u32 supported_cpuidle_states;
+int pnv_save_sprs_for_winkle(void)
+{
+	int cpu;
+	int rc;
+
+	/*
+	 * hid0, hid1, hid4, hid5, hmeer and lpcr values are symmetric accross
+	 * all cpus at boot. Get these reg values of current cpu and use the
+	 * same accross all cpus.
+	 */
+	uint64_t lpcr_val = mfspr(SPRN_LPCR);
+	uint64_t hid0_val = mfspr(SPRN_HID0);
+	uint64_t hid1_val = mfspr(SPRN_HID1);
+	uint64_t hid4_val = mfspr(SPRN_HID4);
+	uint64_t hid5_val = mfspr(SPRN_HID5);
+	uint64_t hmeer_val = mfspr(SPRN_HMEER);
+
+	for_each_possible_cpu(cpu) {
+		uint64_t pir = get_hard_smp_processor_id(cpu);
+		uint64_t hsprg0_val = (uint64_t)&paca[cpu];
+
+		/*
+		 * HSPRG0 is used to store the cpu's pointer to paca. Hence last
+		 * 3 bits are guaranteed to be 0. Program slw to restore HSPRG0
+		 * with 63rd bit set, so that when a thread wakes up at 0x100 we
+		 * can use this bit to distinguish between fastsleep and
+		 * deep winkle.
+		 */
+		hsprg0_val |= 1;
+
+		rc = opal_slw_set_reg(pir, SPRN_HSPRG0, hsprg0_val);
+		if (rc != 0)
+			return rc;
+
+		rc = opal_slw_set_reg(pir, SPRN_LPCR, lpcr_val);
+		if (rc != 0)
+			return rc;
+
+		/* HIDs are per core registers */
+		if (cpu_thread_in_core(cpu) == 0) {
+
+			rc = opal_slw_set_reg(pir, SPRN_HMEER, hmeer_val);
+			if (rc != 0)
+				return rc;
+
+			rc = opal_slw_set_reg(pir, SPRN_HID0, hid0_val);
+			if (rc != 0)
+				return rc;
+
+			rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
+			if (rc != 0)
+				return rc;
+
+			rc = opal_slw_set_reg(pir, SPRN_HID4, hid4_val);
+			if (rc != 0)
+				return rc;
+
+			rc = opal_slw_set_reg(pir, SPRN_HID5, hid5_val);
+			if (rc != 0)
+				return rc;
+
+		}
+
+	}
+
+	return 0;
+
+}
 
 static void pnv_alloc_idle_core_states(void)
 {
@@ -324,6 +393,10 @@ static void pnv_alloc_idle_core_states(void)
 
 		}
 	}
+	update_subcore_sibling_mask();
+	if (supported_cpuidle_states & OPAL_PM_WINKLE_ENABLED)
+		pnv_save_sprs_for_winkle();
+
 }
 
 u32 pnv_get_supported_cpuidle_states(void)
diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
index 12b761a..5e35857 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -167,7 +167,9 @@ static void pnv_smp_cpu_kill_self(void)
 	mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
 	while (!generic_check_cpu_restart(cpu)) {
 		ppc64_runlatch_off();
-		if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
+		if (idle_states & OPAL_PM_WINKLE_ENABLED)
+			power7_winkle();
+		else if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
 				(idle_states & OPAL_PM_SLEEP_ENABLED_ER1))
 			power7_sleep();
 		else
diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c
index c87f96b..f60f80a 100644
--- a/arch/powerpc/platforms/powernv/subcore.c
+++ b/arch/powerpc/platforms/powernv/subcore.c
@@ -160,6 +160,18 @@ static void wait_for_sync_step(int step)
 	mb();
 }
 
+static void update_hid_in_slw(u64 hid0)
+{
+	u64 idle_states = pnv_get_supported_cpuidle_states();
+
+	if (idle_states & OPAL_PM_WINKLE_ENABLED) {
+		/* OPAL call to patch slw with the new HID0 value */
+		u64 cpu_pir = hard_smp_processor_id();
+
+		opal_slw_set_reg(cpu_pir, SPRN_HID0, hid0);
+	}
+}
+
 static void unsplit_core(void)
 {
 	u64 hid0, mask;
@@ -179,6 +191,7 @@ static void unsplit_core(void)
 	hid0 = mfspr(SPRN_HID0);
 	hid0 &= ~HID0_POWER8_DYNLPARDIS;
 	mtspr(SPRN_HID0, hid0);
+	update_hid_in_slw(hid0);
 
 	while (mfspr(SPRN_HID0) & mask)
 		cpu_relax();
@@ -215,6 +228,7 @@ static void split_core(int new_mode)
 	hid0  = mfspr(SPRN_HID0);
 	hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value;
 	mtspr(SPRN_HID0, hid0);
+	update_hid_in_slw(hid0);
 
 	/* Wait for it to happen */
 	while (!(mfspr(SPRN_HID0) & split_parms[i].mask))
@@ -251,6 +265,25 @@ bool cpu_core_split_required(void)
 	return true;
 }
 
+void update_subcore_sibling_mask(void)
+{
+	int cpu;
+	/*
+	 * sibling mask for the first cpu. Left shift this by required bits
+	 * to get sibling mask for the rest of the cpus.
+	 */
+	int sibling_mask_first_cpu =  (1 << threads_per_subcore) - 1;
+
+	for_each_possible_cpu(cpu) {
+		int tid = cpu_thread_in_core(cpu);
+		int offset = (tid / threads_per_subcore) * threads_per_subcore;
+		int mask = sibling_mask_first_cpu << offset;
+
+		paca[cpu].subcore_sibling_mask = mask;
+
+	}
+}
+
 static int cpu_update_split_mode(void *data)
 {
 	int cpu, new_mode = *(int *)data;
@@ -284,6 +317,7 @@ static int cpu_update_split_mode(void *data)
 		/* Make the new mode public */
 		subcores_per_core = new_mode;
 		threads_per_subcore = threads_per_core / subcores_per_core;
+		update_subcore_sibling_mask();
 
 		/* Make sure the new mode is written before we exit */
 		mb();
diff --git a/arch/powerpc/platforms/powernv/subcore.h b/arch/powerpc/platforms/powernv/subcore.h
index 148abc9..604eb40 100644
--- a/arch/powerpc/platforms/powernv/subcore.h
+++ b/arch/powerpc/platforms/powernv/subcore.h
@@ -15,4 +15,5 @@
 
 #ifndef __ASSEMBLY__
 void split_core_secondary_loop(u8 *state);
+extern void update_subcore_sibling_mask(void);
 #endif
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management
  2014-11-25 11:17 [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management Shreyas B. Prabhu
                   ` (3 preceding siblings ...)
  2014-11-25 11:17 ` [PATCH v2 4/4] powernv: powerpc: Add winkle support for offline cpus Shreyas B. Prabhu
@ 2014-11-26  5:15 ` Preeti U Murthy
  4 siblings, 0 replies; 12+ messages in thread
From: Preeti U Murthy @ 2014-11-26  5:15 UTC (permalink / raw)
  To: Shreyas B. Prabhu, linux-kernel, Benjamin Herrenschmidt
  Cc: Paul Mackerras, Michael Ellerman, Rafael J. Wysocki, linux-pm,
	linuxppc-dev, Vaidyanathan Srinivasan

Hi,

I ran hackbench to evaluate this patchset and found good improvements in
the results.

I modified hackbench to take in a 'loops' parameter along with
num_groups which ensures that the test runs long enough to observe and
debug issues. The idea was to find out how latency sensitive workloads
can get affected by modification in cpuidle heuristics since it is easy
to measure the impact on these workloads.

The experiment was conducted on a Power8 system with 1 socket and 6
cores on it.

The first experiment was carried out by pinning hackbench to the first
thread in each core while the rest of the smt threads were idle and
below are the results. This would ensure the core entered deep idle
states more often.

num_grps     %improvement with patchset
3                    3.6
6                   10.6
12                   5.0
24                   5.0

The second experiment was carried out by allowing hackbench to run on
the smt threads of two cores and % improvement with the patchset was in
range of 4-7%.

I ran the experiments on the vanilla kernel. This means the performance
improvements is primarily due to avoiding having to do a timebase sync
by every thread in the core. The power numbers have very little
variation between the runs with and without the patchset.

Thanks

Regards
Preeti U Murthy

On 11/25/2014 04:47 PM, Shreyas B. Prabhu wrote:
> Deep idle states like sleep and winkle are per core idle states. A core
> enters these states only when all the threads enter either the particular
> idle state or a deeper one. There are tasks like fastsleep hardware bug
> workaround and hypervisor core state save which have to be done only by
> the last thread of the core entering deep idle state and similarly tasks
> like timebase resync, hypervisor core register restore that have to be
> done only by the first thread waking up from these states. 
> 
> The current idle state management does not have a way to distinguish the
> first/last thread of the core waking/entering idle states. Tasks like
> timebase resync are done for all the threads. This is not only is suboptimal,
> but can cause functionality issues when subcores are involved.
> 
> Winkle is deeper idle state compared to fastsleep. In this state the power
> supply to the chiplet, i.e core, private L2 and private L3 is turned off.
> This results in a total hypervisor state loss. This patch set adds support
> for winkle and provides a way to track the idle states of the threads of the
> core and use it for idle state management of idle states sleep and winkle.
> 
> 
> Changes in v2:
> --------------
> -Using PNV_THREAD_NAP/SLEEP defines while calling power7_powersave_common
> -Comment changes based on review
> -Rebased on top of 3.18-rc6
> 
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> Cc: linux-pm@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> 
> Paul Mackerras (1):
>   powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle
>     mode
> 
> Preeti U. Murthy (1):
>   powerpc/powernv: Enable Offline CPUs to enter deep idle states
> 
> Shreyas B. Prabhu (2):
>   powernv: cpuidle: Redesign idle states management
>   powernv: powerpc: Add winkle support for offline cpus
> 
>  arch/powerpc/include/asm/cpuidle.h             |  14 ++
>  arch/powerpc/include/asm/opal.h                |  13 +
>  arch/powerpc/include/asm/paca.h                |   6 +
>  arch/powerpc/include/asm/ppc-opcode.h          |   2 +
>  arch/powerpc/include/asm/processor.h           |   1 +
>  arch/powerpc/include/asm/reg.h                 |   4 +
>  arch/powerpc/kernel/asm-offsets.c              |   6 +
>  arch/powerpc/kernel/cpu_setup_power.S          |   4 +
>  arch/powerpc/kernel/exceptions-64s.S           |  30 ++-
>  arch/powerpc/kernel/idle_power7.S              | 332 +++++++++++++++++++++----
>  arch/powerpc/platforms/powernv/opal-wrappers.S |  39 +++
>  arch/powerpc/platforms/powernv/powernv.h       |   2 +
>  arch/powerpc/platforms/powernv/setup.c         | 160 ++++++++++++
>  arch/powerpc/platforms/powernv/smp.c           |  10 +-
>  arch/powerpc/platforms/powernv/subcore.c       |  34 +++
>  arch/powerpc/platforms/powernv/subcore.h       |   1 +
>  drivers/cpuidle/cpuidle-powernv.c              |  10 +-
>  17 files changed, 608 insertions(+), 60 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/cpuidle.h
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management
  2014-11-25 11:17 ` [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management Shreyas B. Prabhu
@ 2014-11-27  0:37   ` Benjamin Herrenschmidt
  2014-11-27  5:55     ` Shreyas B Prabhu
  2014-11-27 23:50   ` Paul Mackerras
  1 sibling, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2014-11-27  0:37 UTC (permalink / raw)
  To: Shreyas B. Prabhu
  Cc: linux-kernel, Paul Mackerras, Michael Ellerman,
	Rafael J. Wysocki, linux-pm, linuxppc-dev


>  
> @@ -37,8 +38,7 @@
>  
>  /*
>   * Pass requested state in r3:
> - * 	0 - nap
> - * 	1 - sleep
> + *	r3 - PNV_THREAD_NAP/SLEEP/WINKLE
>   *
>   * To check IRQ_HAPPENED in r4
>   * 	0 - don't check
> @@ -123,12 +123,62 @@ power7_enter_nap_mode:
>  	li	r4,KVM_HWTHREAD_IN_NAP
>  	stb	r4,HSTATE_HWTHREAD_STATE(r13)
>  #endif
> -	cmpwi	cr0,r3,1
> -	beq	2f
> +	stb	r3,PACA_THREAD_IDLE_STATE(r13)
> +	cmpwi	cr1,r3,PNV_THREAD_SLEEP
> +	bge	cr1,2f
>  	IDLE_STATE_ENTER_SEQ(PPC_NAP)
>  	/* No return */
> -2:	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
> -	/* No return */
> +2:
> +	/* Sleep or winkle */
> +	li	r7,1
> +	mfspr	r8,SPRN_PIR
> +	/*
> +	 * The last 3 bits of PIR represents the thread id of a cpu
> +	 * in power8. This will need adjusting for power7.
> +	 */
> +	andi.	r8,r8,0x07			/* Get thread id into r8 */
> +	rotld	r7,r7,r8
> +
> +	ld	r14,PACA_CORE_IDLE_STATE_PTR(r13)

I assume we have already saved all non-volatile registers ? Because you
are clobbering one here and more below.

> +lwarx_loop1:
> +	lwarx	r15,0,r14
> +	andc	r15,r15,r7			/* Clear thread bit */
> +
> +	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
> +	beq	last_thread
> +
> +	/* Not the last thread to goto sleep */
> +	stwcx.	r15,0,r14
> +	bne-	lwarx_loop1
> +	b	common_enter
> +
> +last_thread:
> +	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
> +	lbz	r3,0(r3)
> +	cmpwi	r3,1
> +	bne	common_enter

This looks wrong. If the workaround is 0, we don't do the stwcx. at
all... Did you try with pnv_need_fastsleep_workaround set to 0 ? It
should work most of the time as long as you don't hit the fairly
rare race window :)

Also it would be nice to make the above a dynamically patches feature
section, though that means pnv_need_fastsleep_workaround needs to turn
into a CPU feature bit and that needs to be done *very* early on.

Another option is to patch out manually from the pnv code the pair:

	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
 	beq	last_thread

To turn them into nops by hand rather than using the feature system.

> +	/*
> +	 * Last thread of the core entering sleep. Last thread needs to execute
> +	 * the hardware bug workaround code. Before that, set the lock bit to
> +	 * avoid the race of other threads waking up and undoing workaround
> +	 * before workaround is applied.
> +	 */
> +	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
> +	stwcx.	r15,0,r14
> +	bne-	lwarx_loop1
> +
> +	/* Fast sleep workaround */
> +	li	r3,1
> +	li	r4,1
> +	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
> +	bl	opal_call_realmode
> +
> +	/* Clear Lock bit */

It's a lock, I would add a lwsync here to be safe, and I would add an
isync before the bne- above. Just to ensure that whatever is done
inside that locked section remains in there.

> +	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
> +	stw	r15,0(r14)
> +
> +common_enter: /* common code for all the threads entering sleep */
> +	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
>  
>  _GLOBAL(power7_idle)
>  	/* Now check if user or arch enabled NAP mode */
> @@ -141,49 +191,16 @@ _GLOBAL(power7_idle)
>  
>  _GLOBAL(power7_nap)
>  	mr	r4,r3
> -	li	r3,0
> +	li	r3,PNV_THREAD_NAP
>  	b	power7_powersave_common
>  	/* No return */
>  
>  _GLOBAL(power7_sleep)
> -	li	r3,1
> +	li	r3,PNV_THREAD_SLEEP
>  	li	r4,1
>  	b	power7_powersave_common
>  	/* No return */
>  
> -/*
> - * Make opal call in realmode. This is a generic function to be called
> - * from realmode from reset vector. It handles endianess.
> - *
> - * r13 - paca pointer
> - * r1  - stack pointer
> - * r3  - opal token
> - */
> -opal_call_realmode:
> -	mflr	r12
> -	std	r12,_LINK(r1)
> -	ld	r2,PACATOC(r13)
> -	/* Set opal return address */
> -	LOAD_REG_ADDR(r0,return_from_opal_call)
> -	mtlr	r0
> -	/* Handle endian-ness */
> -	li	r0,MSR_LE
> -	mfmsr	r12
> -	andc	r12,r12,r0
> -	mtspr	SPRN_HSRR1,r12
> -	mr	r0,r3			/* Move opal token to r0 */
> -	LOAD_REG_ADDR(r11,opal)
> -	ld	r12,8(r11)
> -	ld	r2,0(r11)
> -	mtspr	SPRN_HSRR0,r12
> -	hrfid
> -
> -return_from_opal_call:
> -	FIXUP_ENDIAN
> -	ld	r0,_LINK(r1)
> -	mtlr	r0
> -	blr
> -
>  #define CHECK_HMI_INTERRUPT						\
>  	mfspr	r0,SPRN_SRR1;						\
>  BEGIN_FTR_SECTION_NESTED(66);						\
> @@ -196,10 +213,8 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66);		\
>  	/* Invoke opal call to handle hmi */				\
>  	ld	r2,PACATOC(r13);					\
>  	ld	r1,PACAR1(r13);						\
> -	std	r3,ORIG_GPR3(r1);	/* Save original r3 */		\
> -	li	r3,OPAL_HANDLE_HMI;	/* Pass opal token argument*/	\
> +	li	r0,OPAL_HANDLE_HMI;	/* Pass opal token argument*/	\
>  	bl	opal_call_realmode;					\
> -	ld	r3,ORIG_GPR3(r1);	/* Restore original r3 */	\
>  20:	nop;
>  
> 
> @@ -210,12 +225,91 @@ _GLOBAL(power7_wakeup_tb_loss)
>  BEGIN_FTR_SECTION
>  	CHECK_HMI_INTERRUPT
>  END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
> +
> +	li	r7,1
> +	mfspr	r8,SPRN_PIR
> +	/*
> +	 * The last 3 bits of PIR represents the thread id of a cpu
> +	 * in power8. This will need adjusting for power7.
> +	 */
> +	andi.	r8,r8,0x07		/* Get thread id into r8 */

I'd be more comfortable if we patched that instruction at boot with the
right mask.

> +	rotld	r7,r7,r8
> +	/* r7 now has 'thread_id'th bit set */
> +
> +	ld	r14,PACA_CORE_IDLE_STATE_PTR(r13)
> +lwarx_loop2:
> +	lwarx	r15,0,r14
> +	andi.	r9,r15,PNV_CORE_IDLE_LOCK_BIT
> +	/*
> +	 * Lock bit is set in one of the 2 cases-
> +	 * a. In the sleep/winkle enter path, the last thread is executing
> +	 * fastsleep workaround code.
> +	 * b. In the wake up path, another thread is executing fastsleep
> +	 * workaround undo code or resyncing timebase or restoring context
> +	 * In either case loop until the lock bit is cleared.
> +	 */
> +	bne	lwarx_loop2

We should do some smt priority games here otherwise the spinning threads
are going to slow down the one with the lock. Basically, if we see the
lock held, go out of line, smt_low, spin on a normal load, and when
smt_medium and go back to lwarx

> +	cmpwi	cr2,r15,0
> +	or	r15,r15,r7		/* Set thread bit */
> +
> +	beq	cr2,first_thread
> +
> +	/* Not first thread in core to wake up */
> +	stwcx.	r15,0,r14
> +	bne-	lwarx_loop2
> +	b	common_exit
> +
> +first_thread:
> +	/* First thread in core to wakeup */
> +	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
> +	stwcx.	r15,0,r14
> +	bne-	lwarx_loop2
> +
> +	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
> +	lbz	r3,0(r3)
> +	cmpwi	r3,1

Same comment about dynamic patching.

> +	/*  skip fastsleep workaround if its not needed */
> +	bne	timebase_resync
> +
> +	/* Undo fast sleep workaround */
> +	mfcr	r16	/* Backup CR into a non-volatile register */

Why ? If you have your non-volatiles saved you can use an NV CR like CR2
no ? You need to restore those anyway...

Also same comments as on the way down vs barriers when doing
lock/unlock.

> +	li	r3,1
> +	li	r4,0
> +	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
> +	bl	opal_call_realmode
> +	mtcr	r16	/* Restore CR */
> +
> +	/* Do timebase resync if we are waking up from sleep. Use cr1 value
> +	 * set in exceptions-64s.S */
> +	ble	cr1,clear_lock
> +
> +timebase_resync:
>  	/* Time base re-sync */
> -	li	r3,OPAL_RESYNC_TIMEBASE
> +	li	r0,OPAL_RESYNC_TIMEBASE
>  	bl	opal_call_realmode;
> -
>  	/* TODO: Check r3 for failure */
>  
> +clear_lock:
> +	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
> +	stw	r15,0(r14)
> +
> +common_exit:
> +	li	r5,PNV_THREAD_RUNNING
> +	stb     r5,PACA_THREAD_IDLE_STATE(r13)
> +
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +	li      r0,KVM_HWTHREAD_IN_KERNEL
> +	stb     r0,HSTATE_HWTHREAD_STATE(r13)
> +	/* Order setting hwthread_state vs. testing hwthread_req */
> +	sync
> +	lbz     r0,HSTATE_HWTHREAD_REQ(r13)
> +	cmpwi   r0,0
> +	beq     6f
> +	b       kvm_start_guest
> +6:
> +#endif
> +
>  	REST_NVGPRS(r1)
>  	REST_GPR(2, r1)
>  	ld	r3,_CCR(r1)
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index feb549a..b2aa93b 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -158,6 +158,43 @@ opal_tracepoint_return:
>  	blr
>  #endif
>  
> +/*
> + * Make opal call in realmode. This is a generic function to be called
> + * from realmode. It handles endianness.
> + *
> + * r13 - paca pointer
> + * r1  - stack pointer
> + * r0  - opal token
> + */
> +_GLOBAL(opal_call_realmode)
> +	mflr	r12
> +	std	r12,_LINK(r1)
> +	ld	r2,PACATOC(r13)
> +	/* Set opal return address */
> +	LOAD_REG_ADDR(r12,return_from_opal_call)
> +	mtlr	r12
> +
> +	mfmsr	r12
> +#ifdef __LITTLE_ENDIAN__
> +	/* Handle endian-ness */
> +	li	r11,MSR_LE
> +	andc	r12,r12,r11
> +#endif
> +	mtspr	SPRN_HSRR1,r12
> +	LOAD_REG_ADDR(r11,opal)
> +	ld	r12,8(r11)
> +	ld	r2,0(r11)
> +	mtspr	SPRN_HSRR0,r12
> +	hrfid
> +
> +return_from_opal_call:
> +#ifdef __LITTLE_ENDIAN__
> +	FIXUP_ENDIAN
> +#endif
> +	ld	r12,_LINK(r1)
> +	mtlr	r12
> +	blr
> +
>  OPAL_CALL(opal_invalid_call,			OPAL_INVALID_CALL);
>  OPAL_CALL(opal_console_write,			OPAL_CONSOLE_WRITE);
>  OPAL_CALL(opal_console_read,			OPAL_CONSOLE_READ);
> diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
> index 34c6665..17fb98c 100644
> --- a/arch/powerpc/platforms/powernv/setup.c
> +++ b/arch/powerpc/platforms/powernv/setup.c
> @@ -36,6 +36,8 @@
>  #include <asm/opal.h>
>  #include <asm/kexec.h>
>  #include <asm/smp.h>
> +#include <asm/cputhreads.h>
> +#include <asm/cpuidle.h>
>  
>  #include "powernv.h"
>  
> @@ -292,11 +294,45 @@ static void __init pnv_setup_machdep_rtas(void)
>  
>  static u32 supported_cpuidle_states;
>  
> +static void pnv_alloc_idle_core_states(void)
> +{
> +	int i, j;
> +	int nr_cores = cpu_nr_cores();
> +	u32 *core_idle_state;
> +
> +	/*
> +	 * core_idle_state - First 8 bits track the idle state of each thread
> +	 * of the core. The 8th bit is the lock bit. Initially all thread bits
> +	 * are set. They are cleared when the thread enters deep idle state
> +	 * like sleep and winkle. Initially the lock bit is cleared.
> +	 * The lock bit has 2 purposes
> +	 * a. While the first thread is restoring core state, it prevents
> +	 * from other threads in the core from switching to prcoess context.
> +	 * b. While the last thread in the core is saving the core state, it
> +	 * prevent a different thread from waking up.
> +	 */
> +	for (i = 0; i < nr_cores; i++) {
> +		int first_cpu = i * threads_per_core;
> +		int node = cpu_to_node(first_cpu);
> +
> +		core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
> +		for (j = 0; j < threads_per_core; j++) {
> +			int cpu = first_cpu + j;
> +
> +			paca[cpu].core_idle_state_ptr = core_idle_state;
> +			paca[cpu].thread_idle_state = PNV_THREAD_RUNNING;
> +
> +		}
> +	}
> +}
> +
>  u32 pnv_get_supported_cpuidle_states(void)
>  {
>  	return supported_cpuidle_states;
>  }
> +EXPORT_SYMBOL_GPL(pnv_get_supported_cpuidle_states);
>  
> +u8 pnv_need_fastsleep_workaround;
>  static int __init pnv_init_idle_states(void)
>  {
>  	struct device_node *power_mgt;
> @@ -306,6 +342,7 @@ static int __init pnv_init_idle_states(void)
>  	int i;
>  
>  	supported_cpuidle_states = 0;
> +	pnv_need_fastsleep_workaround = 0;
>  
>  	if (cpuidle_disable != IDLE_NO_OVERRIDE)
>  		return 0;
> @@ -332,13 +369,14 @@ static int __init pnv_init_idle_states(void)
>  		flags = be32_to_cpu(idle_state_flags[i]);
>  		supported_cpuidle_states |= flags;
>  	}
> -
> +	if (supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)
> +		pnv_need_fastsleep_workaround = 1;
> +	pnv_alloc_idle_core_states();
>  	return 0;
>  }
>  
>  subsys_initcall(pnv_init_idle_states);
>  
> -
>  static int __init pnv_probe(void)
>  {
>  	unsigned long root = of_get_flat_dt_root();
> diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
> index 3dc4cec..12b761a 100644
> --- a/arch/powerpc/platforms/powernv/smp.c
> +++ b/arch/powerpc/platforms/powernv/smp.c
> @@ -167,7 +167,8 @@ static void pnv_smp_cpu_kill_self(void)
>  	mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
>  	while (!generic_check_cpu_restart(cpu)) {
>  		ppc64_runlatch_off();
> -		if (idle_states & OPAL_PM_SLEEP_ENABLED)
> +		if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
> +				(idle_states & OPAL_PM_SLEEP_ENABLED_ER1))
>  			power7_sleep();
>  		else
>  			power7_nap(1);
> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
> index 0a7d827..a489b56 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -208,7 +208,8 @@ static int powernv_add_idle_states(void)
>  			nr_idle_states++;
>  		}
>  
> -		if (flags & OPAL_PM_SLEEP_ENABLED) {
> +		if (flags & OPAL_PM_SLEEP_ENABLED ||
> +			flags & OPAL_PM_SLEEP_ENABLED_ER1) {
>  			/* Add FASTSLEEP state */
>  			strcpy(powernv_states[nr_idle_states].name, "FastSleep");
>  			strcpy(powernv_states[nr_idle_states].desc, "FastSleep");



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 4/4] powernv: powerpc: Add winkle support for offline cpus
  2014-11-25 11:17 ` [PATCH v2 4/4] powernv: powerpc: Add winkle support for offline cpus Shreyas B. Prabhu
@ 2014-11-27  1:55   ` Benjamin Herrenschmidt
  2014-11-27  6:24     ` Shreyas B Prabhu
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2014-11-27  1:55 UTC (permalink / raw)
  To: Shreyas B. Prabhu
  Cc: linux-kernel, Paul Mackerras, Michael Ellerman, linuxppc-dev

On Tue, 2014-11-25 at 16:47 +0530, Shreyas B. Prabhu wrote:

> diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
> index 4673353..66874aa 100644
> --- a/arch/powerpc/kernel/cpu_setup_power.S
> +++ b/arch/powerpc/kernel/cpu_setup_power.S
> @@ -55,6 +55,8 @@ _GLOBAL(__setup_cpu_power8)
>  	beqlr
>  	li	r0,0
>  	mtspr	SPRN_LPID,r0
> +	mtspr	SPRN_WORT,r0
> +	mtspr	SPRN_WORC,r0
>  	mfspr	r3,SPRN_LPCR
>  	ori	r3, r3, LPCR_PECEDH
>  	bl	__init_LPCR
> @@ -75,6 +77,8 @@ _GLOBAL(__restore_cpu_power8)
>  	li	r0,0
>  	mtspr	SPRN_LPID,r0
>  	mfspr   r3,SPRN_LPCR
> +	mtspr	SPRN_WORT,r0
> +	mtspr	SPRN_WORC,r0
>  	ori	r3, r3, LPCR_PECEDH
>  	bl	__init_LPCR
>  	bl	__init_HFSCR

Clearing WORT and WORC might not be the best thing. We know the HW folks
have been trying to tune those values and we might need to preserve what
the boot FW has set.

Can you get in touch with them and double check what we should do here ?

> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> index 3311c8d..c9897cb 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -112,6 +112,16 @@ BEGIN_FTR_SECTION
>  
>  	cmpwi	cr1,r13,2
>  
> +	/* Check if last bit of HSPGR0 is set. This indicates whether we are
> +	 * waking up from winkle */
> +	li	r3,1
> +	mfspr	r4,SPRN_HSPRG0
> +	and	r5,r4,r3
> +	cmpwi	cr4,r5,1	/* Store result in cr4 for later use */
> +
> +	andc	r4,r4,r3
> +	mtspr	SPRN_HSPRG0,r4
> +

There is an open question here whether adding a beq cr4,+8 after the
cmpwi (or a +4 after the andc) is worthwhile. Can you check ? (either
measure or talk to HW folks). Also we could write directly to r13...

>  	GET_PACA(r13)
>  	lbz	r0,PACA_THREAD_IDLE_STATE(r13)
>  	cmpwi   cr2,r0,PNV_THREAD_NAP
> diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
> index c1d590f..78c30b0 100644
> --- a/arch/powerpc/kernel/idle_power7.S
> +++ b/arch/powerpc/kernel/idle_power7.S
> @@ -19,8 +19,22 @@
>  #include <asm/kvm_book3s_asm.h>
>  #include <asm/opal.h>
>  #include <asm/cpuidle.h>
> +#include <asm/mmu-hash64.h>
>  
>  #undef DEBUG
> +/*
> + * Use unused space in the interrupt stack to save and restore
> + * registers for winkle support.
> + */
> +#define _SDR1	GPR3
> +#define _RPR	GPR4
> +#define _SPURR	GPR5
> +#define _PURR	GPR6
> +#define _TSCR	GPR7
> +#define _DSCR	GPR8
> +#define _AMOR	GPR9
> +#define _PMC5	GPR10
> +#define _PMC6	GPR11

WORT/WORTC need saving restoring

>  /* Idle state entry routines */
>  
> @@ -153,32 +167,60 @@ lwarx_loop1:
>  	b	common_enter
>  
>  last_thread:
> -	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
> -	lbz	r3,0(r3)
> -	cmpwi	r3,1
> -	bne	common_enter
>  	/*
>  	 * Last thread of the core entering sleep. Last thread needs to execute
>  	 * the hardware bug workaround code. Before that, set the lock bit to
>  	 * avoid the race of other threads waking up and undoing workaround
>  	 * before workaround is applied.
>  	 */
> +	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
> +	lbz	r3,0(r3)
> +	cmpwi	r3,1
> +	bne	common_enter
> +
>  	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
>  	stwcx.	r15,0,r14
>  	bne-	lwarx_loop1
>  
>  	/* Fast sleep workaround */
> +	mfcr	r16	/* Backup CR to a non-volatile register */
>  	li	r3,1
>  	li	r4,1
>  	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
>  	bl	opal_call_realmode
> +	mtcr	r16	/* Restore CR */

Why isn't the above already in the previous patch ? Also see my comment
about using a non-volatile CR instead.

>  	/* Clear Lock bit */
>  	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
>  	stw	r15,0(r14)
>  
> -common_enter: /* common code for all the threads entering sleep */
> +common_enter: /* common code for all the threads entering sleep or winkle*/
> +	bgt	cr1,enter_winkle
>  	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
> +enter_winkle:
> +	/*
> +	 * Note all register i.e per-core, per-subcore or per-thread is saved
> +	 * here since any thread in the core might wake up first
> +	 */
> +	mfspr	r3,SPRN_SDR1
> +	std	r3,_SDR1(r1)
> +	mfspr	r3,SPRN_RPR
> +	std	r3,_RPR(r1)
> +	mfspr	r3,SPRN_SPURR
> +	std	r3,_SPURR(r1)
> +	mfspr	r3,SPRN_PURR
> +	std	r3,_PURR(r1)
> +	mfspr	r3,SPRN_TSCR
> +	std	r3,_TSCR(r1)
> +	mfspr	r3,SPRN_DSCR
> +	std	r3,_DSCR(r1)
> +	mfspr	r3,SPRN_AMOR
> +	std	r3,_AMOR(r1)
> +	mfspr	r3,SPRN_PMC5
> +	std	r3,_PMC5(r1)
> +	mfspr	r3,SPRN_PMC6
> +	std	r3,_PMC6(r1)
> +	IDLE_STATE_ENTER_SEQ(PPC_WINKLE)
>  
>  _GLOBAL(power7_idle)
>  	/* Now check if user or arch enabled NAP mode */
> @@ -201,6 +243,12 @@ _GLOBAL(power7_sleep)
>  	b	power7_powersave_common
>  	/* No return */
>  
> +_GLOBAL(power7_winkle)
> +	li	r3,PNV_THREAD_WINKLE
> +	li	r4,1
> +	b	power7_powersave_common
> +	/* No return */
> +
>  #define CHECK_HMI_INTERRUPT						\
>  	mfspr	r0,SPRN_SRR1;						\
>  BEGIN_FTR_SECTION_NESTED(66);						\
> @@ -250,22 +298,54 @@ lwarx_loop2:
>  	 */
>  	bne	lwarx_loop2
>  
> -	cmpwi	cr2,r15,0
> +	cmpwi	cr2,r15,0	/* Check if first in core */
> +	lbz	r4,PACA_SUBCORE_SIBLING_MASK(r13)
> +	and	r4,r4,r15
> +	cmpwi	cr3,r4,0	/* Check if first in subcore */
> +
> +	/*
> +	 * At this stage
> +	 * cr1 - 01 if waking up from sleep or winkle
> +	 * cr2 - 10 if first thread to wakeup in core
> +	 * cr3 - 10 if first thread to wakeup in subcore
> +	 * cr4 - 10 if waking up from winkle
> +	 */
> +
>  	or	r15,r15,r7		/* Set thread bit */
>  
> -	beq	cr2,first_thread
> +	beq	cr3,first_thread_in_subcore
>  
> -	/* Not first thread in core to wake up */
> +	/* Not first thread in subcore to wake up */
>  	stwcx.	r15,0,r14
>  	bne-	lwarx_loop2
>  	b	common_exit
>  
> -first_thread:
> -	/* First thread in core to wakeup */
> +first_thread_in_subcore:
> +	/* First thread in subcore to wakeup set the lock bit */
>  	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
>  	stwcx.	r15,0,r14
>  	bne-	lwarx_loop2
>  
> +	/*
> +	 * If waking up from sleep, subcore state is not lost. Hence
> +	 * skip subcore state restore
> +	 */
> +	bne	cr4,subcore_state_restored
> +
> +	/* Restore per-subcore state */
> +	ld      r4,_SDR1(r1)
> +	mtspr   SPRN_SDR1,r4
> +	ld      r4,_RPR(r1)
> +	mtspr   SPRN_RPR,r4
> +	ld	r4,_AMOR(r1)
> +	mtspr	SPRN_AMOR,r4
> +
> +subcore_state_restored:
> +	/* Check if the thread is also the first thread in the core. If not,
> +	 * skip to clear_lock */
> +	bne	cr2,clear_lock
> +
> +first_thread_in_core:
>  	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
>  	lbz	r3,0(r3)
>  	cmpwi	r3,1
> @@ -280,21 +360,71 @@ first_thread:
>  	bl	opal_call_realmode
>  	mtcr	r16	/* Restore CR */
>  
> -	/* Do timebase resync if we are waking up from sleep. Use cr1 value
> -	 * set in exceptions-64s.S */
> +timebase_resync:
> +	/* Do timebase resync only if the core truly woke up from
> +	 * sleep/winkle */
>  	ble	cr1,clear_lock
>  
> -timebase_resync:
>  	/* Time base re-sync */
> +	mfcr	r16	/* Backup CR into a non-volatile register */
>  	li	r0,OPAL_RESYNC_TIMEBASE
>  	bl	opal_call_realmode;
>  	/* TODO: Check r3 for failure */
> +	mtcr	r16	/* Restore CR */
> +
> +	/*
> +	 * If waking up from sleep, per core state is not lost, skip to
> +	 * clear_lock.
> +	 */
> +	bne	cr4,clear_lock
> +
> +	/* Restore per core state */
> +	ld	r4,_TSCR(r1)
> +	mtspr	SPRN_TSCR,r4
>  
>  clear_lock:
>  	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
>  	stw	r15,0(r14)
>  
>  common_exit:
> +	/* Common to all threads
> +	 *
> +	 * If waking up from sleep, hypervisor state is not lost. Hence
> +	 * skip hypervisor state restore.
> +	 */
> +	bne	cr4,hypervisor_state_restored
> +
> +	/* Waking up from winkle */
> +
> +	/* Restore per thread state */
> +	bl	__restore_cpu_power8
> +
> +	/* Restore SLB  from PACA */
> +	ld	r8,PACA_SLBSHADOWPTR(r13)
> +
> +	.rept	SLB_NUM_BOLTED
> +	li	r3, SLBSHADOW_SAVEAREA
> +	LDX_BE	r5, r8, r3
> +	addi	r3, r3, 8
> +	LDX_BE	r6, r8, r3
> +	andis.	r7,r5,SLB_ESID_V@h
> +	beq	1f
> +	slbmte	r6,r5
> +1:	addi	r8,r8,16
> +	.endr
> +
> +	ld	r4,_SPURR(r1)
> +	mtspr	SPRN_SPURR,r4
> +	ld	r4,_PURR(r1)
> +	mtspr	SPRN_PURR,r4
> +	ld	r4,_DSCR(r1)
> +	mtspr	SPRN_DSCR,r4
> +	ld	r4,_PMC5(r1)
> +	mtspr	SPRN_PMC5,r4
> +	ld	r4,_PMC6(r1)
> +	mtspr	SPRN_PMC6,r4
> +
> +hypervisor_state_restored:
>  	li	r5,PNV_THREAD_RUNNING
>  	stb     r5,PACA_THREAD_IDLE_STATE(r13)
>  
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index b2aa93b..e1e91e0 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -191,6 +191,7 @@ return_from_opal_call:
>  #ifdef __LITTLE_ENDIAN__
>  	FIXUP_ENDIAN
>  #endif
> +	ld	r2,PACATOC(r13)
>  	ld	r12,_LINK(r1)
>  	mtlr	r12
>  	blr
> @@ -284,6 +285,7 @@ OPAL_CALL(opal_sensor_read,			OPAL_SENSOR_READ);
>  OPAL_CALL(opal_get_param,			OPAL_GET_PARAM);
>  OPAL_CALL(opal_set_param,			OPAL_SET_PARAM);
>  OPAL_CALL(opal_handle_hmi,			OPAL_HANDLE_HMI);
> +OPAL_CALL(opal_slw_set_reg,			OPAL_SLW_SET_REG);
>  OPAL_CALL(opal_register_dump_region,		OPAL_REGISTER_DUMP_REGION);
>  OPAL_CALL(opal_unregister_dump_region,		OPAL_UNREGISTER_DUMP_REGION);
>  OPAL_CALL(opal_pci_set_phb_cxl_mode,		OPAL_PCI_SET_PHB_CXL_MODE);
> diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
> index 17fb98c..4a886a1 100644
> --- a/arch/powerpc/platforms/powernv/setup.c
> +++ b/arch/powerpc/platforms/powernv/setup.c
> @@ -40,6 +40,7 @@
>  #include <asm/cpuidle.h>
>  
>  #include "powernv.h"
> +#include "subcore.h"
>  
>  static void __init pnv_setup_arch(void)
>  {
> @@ -293,6 +294,74 @@ static void __init pnv_setup_machdep_rtas(void)
>  #endif /* CONFIG_PPC_POWERNV_RTAS */
>  
>  static u32 supported_cpuidle_states;
> +int pnv_save_sprs_for_winkle(void)
> +{
> +	int cpu;
> +	int rc;
> +
> +	/*
> +	 * hid0, hid1, hid4, hid5, hmeer and lpcr values are symmetric accross
> +	 * all cpus at boot. Get these reg values of current cpu and use the
> +	 * same accross all cpus.
> +	 */
> +	uint64_t lpcr_val = mfspr(SPRN_LPCR);
> +	uint64_t hid0_val = mfspr(SPRN_HID0);
> +	uint64_t hid1_val = mfspr(SPRN_HID1);
> +	uint64_t hid4_val = mfspr(SPRN_HID4);
> +	uint64_t hid5_val = mfspr(SPRN_HID5);
> +	uint64_t hmeer_val = mfspr(SPRN_HMEER);
> +
> +	for_each_possible_cpu(cpu) {
> +		uint64_t pir = get_hard_smp_processor_id(cpu);
> +		uint64_t hsprg0_val = (uint64_t)&paca[cpu];
> +
> +		/*
> +		 * HSPRG0 is used to store the cpu's pointer to paca. Hence last
> +		 * 3 bits are guaranteed to be 0. Program slw to restore HSPRG0
> +		 * with 63rd bit set, so that when a thread wakes up at 0x100 we
> +		 * can use this bit to distinguish between fastsleep and
> +		 * deep winkle.
> +		 */
> +		hsprg0_val |= 1;
> +
> +		rc = opal_slw_set_reg(pir, SPRN_HSPRG0, hsprg0_val);
> +		if (rc != 0)
> +			return rc;
> +
> +		rc = opal_slw_set_reg(pir, SPRN_LPCR, lpcr_val);
> +		if (rc != 0)
> +			return rc;
> +
> +		/* HIDs are per core registers */
> +		if (cpu_thread_in_core(cpu) == 0) {
> +
> +			rc = opal_slw_set_reg(pir, SPRN_HMEER, hmeer_val);
> +			if (rc != 0)
> +				return rc;
> +
> +			rc = opal_slw_set_reg(pir, SPRN_HID0, hid0_val);
> +			if (rc != 0)
> +				return rc;
> +
> +			rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
> +			if (rc != 0)
> +				return rc;
> +
> +			rc = opal_slw_set_reg(pir, SPRN_HID4, hid4_val);
> +			if (rc != 0)
> +				return rc;
> +
> +			rc = opal_slw_set_reg(pir, SPRN_HID5, hid5_val);
> +			if (rc != 0)
> +				return rc;
> +
> +		}
> +
> +	}
> +
> +	return 0;
> +
> +}
>  
>  static void pnv_alloc_idle_core_states(void)
>  {
> @@ -324,6 +393,10 @@ static void pnv_alloc_idle_core_states(void)
>  
>  		}
>  	}
> +	update_subcore_sibling_mask();
> +	if (supported_cpuidle_states & OPAL_PM_WINKLE_ENABLED)
> +		pnv_save_sprs_for_winkle();
> +
>  }
>  
>  u32 pnv_get_supported_cpuidle_states(void)
> diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
> index 12b761a..5e35857 100644
> --- a/arch/powerpc/platforms/powernv/smp.c
> +++ b/arch/powerpc/platforms/powernv/smp.c
> @@ -167,7 +167,9 @@ static void pnv_smp_cpu_kill_self(void)
>  	mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
>  	while (!generic_check_cpu_restart(cpu)) {
>  		ppc64_runlatch_off();
> -		if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
> +		if (idle_states & OPAL_PM_WINKLE_ENABLED)
> +			power7_winkle();
> +		else if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
>  				(idle_states & OPAL_PM_SLEEP_ENABLED_ER1))
>  			power7_sleep();
>  		else
> diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c
> index c87f96b..f60f80a 100644
> --- a/arch/powerpc/platforms/powernv/subcore.c
> +++ b/arch/powerpc/platforms/powernv/subcore.c
> @@ -160,6 +160,18 @@ static void wait_for_sync_step(int step)
>  	mb();
>  }
>  
> +static void update_hid_in_slw(u64 hid0)
> +{
> +	u64 idle_states = pnv_get_supported_cpuidle_states();
> +
> +	if (idle_states & OPAL_PM_WINKLE_ENABLED) {
> +		/* OPAL call to patch slw with the new HID0 value */
> +		u64 cpu_pir = hard_smp_processor_id();
> +
> +		opal_slw_set_reg(cpu_pir, SPRN_HID0, hid0);
> +	}
> +}
> +
>  static void unsplit_core(void)
>  {
>  	u64 hid0, mask;
> @@ -179,6 +191,7 @@ static void unsplit_core(void)
>  	hid0 = mfspr(SPRN_HID0);
>  	hid0 &= ~HID0_POWER8_DYNLPARDIS;
>  	mtspr(SPRN_HID0, hid0);
> +	update_hid_in_slw(hid0);
>  
>  	while (mfspr(SPRN_HID0) & mask)
>  		cpu_relax();
> @@ -215,6 +228,7 @@ static void split_core(int new_mode)
>  	hid0  = mfspr(SPRN_HID0);
>  	hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value;
>  	mtspr(SPRN_HID0, hid0);
> +	update_hid_in_slw(hid0);
>  
>  	/* Wait for it to happen */
>  	while (!(mfspr(SPRN_HID0) & split_parms[i].mask))
> @@ -251,6 +265,25 @@ bool cpu_core_split_required(void)
>  	return true;
>  }
>  
> +void update_subcore_sibling_mask(void)
> +{
> +	int cpu;
> +	/*
> +	 * sibling mask for the first cpu. Left shift this by required bits
> +	 * to get sibling mask for the rest of the cpus.
> +	 */
> +	int sibling_mask_first_cpu =  (1 << threads_per_subcore) - 1;
> +
> +	for_each_possible_cpu(cpu) {
> +		int tid = cpu_thread_in_core(cpu);
> +		int offset = (tid / threads_per_subcore) * threads_per_subcore;
> +		int mask = sibling_mask_first_cpu << offset;
> +
> +		paca[cpu].subcore_sibling_mask = mask;
> +
> +	}
> +}
> +
>  static int cpu_update_split_mode(void *data)
>  {
>  	int cpu, new_mode = *(int *)data;
> @@ -284,6 +317,7 @@ static int cpu_update_split_mode(void *data)
>  		/* Make the new mode public */
>  		subcores_per_core = new_mode;
>  		threads_per_subcore = threads_per_core / subcores_per_core;
> +		update_subcore_sibling_mask();
>  
>  		/* Make sure the new mode is written before we exit */
>  		mb();
> diff --git a/arch/powerpc/platforms/powernv/subcore.h b/arch/powerpc/platforms/powernv/subcore.h
> index 148abc9..604eb40 100644
> --- a/arch/powerpc/platforms/powernv/subcore.h
> +++ b/arch/powerpc/platforms/powernv/subcore.h
> @@ -15,4 +15,5 @@
>  
>  #ifndef __ASSEMBLY__
>  void split_core_secondary_loop(u8 *state);
> +extern void update_subcore_sibling_mask(void);
>  #endif



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management
  2014-11-27  0:37   ` Benjamin Herrenschmidt
@ 2014-11-27  5:55     ` Shreyas B Prabhu
  0 siblings, 0 replies; 12+ messages in thread
From: Shreyas B Prabhu @ 2014-11-27  5:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-kernel, Paul Mackerras, Michael Ellerman,
	Rafael J. Wysocki, linux-pm, linuxppc-dev

Hi Ben,

On Thursday 27 November 2014 06:07 AM, Benjamin Herrenschmidt wrote:
> 
>>  
>> @@ -37,8 +38,7 @@
>>  
>>  /*
>>   * Pass requested state in r3:
>> - * 	0 - nap
>> - * 	1 - sleep
>> + *	r3 - PNV_THREAD_NAP/SLEEP/WINKLE
>>   *
>>   * To check IRQ_HAPPENED in r4
>>   * 	0 - don't check
>> @@ -123,12 +123,62 @@ power7_enter_nap_mode:
>>  	li	r4,KVM_HWTHREAD_IN_NAP
>>  	stb	r4,HSTATE_HWTHREAD_STATE(r13)
>>  #endif
>> -	cmpwi	cr0,r3,1
>> -	beq	2f
>> +	stb	r3,PACA_THREAD_IDLE_STATE(r13)
>> +	cmpwi	cr1,r3,PNV_THREAD_SLEEP
>> +	bge	cr1,2f
>>  	IDLE_STATE_ENTER_SEQ(PPC_NAP)
>>  	/* No return */
>> -2:	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
>> -	/* No return */
>> +2:
>> +	/* Sleep or winkle */
>> +	li	r7,1
>> +	mfspr	r8,SPRN_PIR
>> +	/*
>> +	 * The last 3 bits of PIR represents the thread id of a cpu
>> +	 * in power8. This will need adjusting for power7.
>> +	 */
>> +	andi.	r8,r8,0x07			/* Get thread id into r8 */
>> +	rotld	r7,r7,r8
>> +
>> +	ld	r14,PACA_CORE_IDLE_STATE_PTR(r13)
> 
> I assume we have already saved all non-volatile registers ? Because you
> are clobbering one here and more below.

Yes. At this stage the all non-volatile registers are already saved in
stack.
> 
>> +lwarx_loop1:
>> +	lwarx	r15,0,r14
>> +	andc	r15,r15,r7			/* Clear thread bit */
>> +
>> +	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
>> +	beq	last_thread
>> +
>> +	/* Not the last thread to goto sleep */
>> +	stwcx.	r15,0,r14
>> +	bne-	lwarx_loop1
>> +	b	common_enter
>> +
>> +last_thread:
>> +	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
>> +	lbz	r3,0(r3)
>> +	cmpwi	r3,1
>> +	bne	common_enter
> 
> This looks wrong. If the workaround is 0, we don't do the stwcx. at
> all... Did you try with pnv_need_fastsleep_workaround set to 0 ? It
> should work most of the time as long as you don't hit the fairly
> rare race window :)
> 

My bad. I missed the stwcx. in the pnv_need_fastsleep_workaround = 0 path.
> Also it would be nice to make the above a dynamically patches feature
> section, though that means pnv_need_fastsleep_workaround needs to turn
> into a CPU feature bit and that needs to be done *very* early on.
> 
> Another option is to patch out manually from the pnv code the pair:
> 
> 	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
>  	beq	last_thread
> 
> To turn them into nops by hand rather than using the feature system.
> 

Okay. I'll see which works out best here.
>> +	/*
>> +	 * Last thread of the core entering sleep. Last thread needs to execute
>> +	 * the hardware bug workaround code. Before that, set the lock bit to
>> +	 * avoid the race of other threads waking up and undoing workaround
>> +	 * before workaround is applied.
>> +	 */
>> +	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
>> +	stwcx.	r15,0,r14
>> +	bne-	lwarx_loop1
>> +
>> +	/* Fast sleep workaround */
>> +	li	r3,1
>> +	li	r4,1
>> +	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
>> +	bl	opal_call_realmode
>> +
>> +	/* Clear Lock bit */
> 
> It's a lock, I would add a lwsync here to be safe, and I would add an
> isync before the bne- above. Just to ensure that whatever is done
> inside that locked section remains in there.
>

Okay. Will add it.
>> +	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
>> +	stw	r15,0(r14)
>> +
>> +common_enter: /* common code for all the threads entering sleep */
>> +	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
>>  
>>  _GLOBAL(power7_idle)
>>  	/* Now check if user or arch enabled NAP mode */
>> @@ -141,49 +191,16 @@ _GLOBAL(power7_idle)
>>  
>>  _GLOBAL(power7_nap)
>>  	mr	r4,r3
>> -	li	r3,0
>> +	li	r3,PNV_THREAD_NAP
>>  	b	power7_powersave_common
>>  	/* No return */
>>  
>>  _GLOBAL(power7_sleep)
>> -	li	r3,1
>> +	li	r3,PNV_THREAD_SLEEP
>>  	li	r4,1
>>  	b	power7_powersave_common
>>  	/* No return */
>>  
>> -/*
>> - * Make opal call in realmode. This is a generic function to be called
>> - * from realmode from reset vector. It handles endianess.
>> - *
>> - * r13 - paca pointer
>> - * r1  - stack pointer
>> - * r3  - opal token
>> - */
>> -opal_call_realmode:
>> -	mflr	r12
>> -	std	r12,_LINK(r1)
>> -	ld	r2,PACATOC(r13)
>> -	/* Set opal return address */
>> -	LOAD_REG_ADDR(r0,return_from_opal_call)
>> -	mtlr	r0
>> -	/* Handle endian-ness */
>> -	li	r0,MSR_LE
>> -	mfmsr	r12
>> -	andc	r12,r12,r0
>> -	mtspr	SPRN_HSRR1,r12
>> -	mr	r0,r3			/* Move opal token to r0 */
>> -	LOAD_REG_ADDR(r11,opal)
>> -	ld	r12,8(r11)
>> -	ld	r2,0(r11)
>> -	mtspr	SPRN_HSRR0,r12
>> -	hrfid
>> -
>> -return_from_opal_call:
>> -	FIXUP_ENDIAN
>> -	ld	r0,_LINK(r1)
>> -	mtlr	r0
>> -	blr
>> -
>>  #define CHECK_HMI_INTERRUPT						\
>>  	mfspr	r0,SPRN_SRR1;						\
>>  BEGIN_FTR_SECTION_NESTED(66);						\
>> @@ -196,10 +213,8 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66);		\
>>  	/* Invoke opal call to handle hmi */				\
>>  	ld	r2,PACATOC(r13);					\
>>  	ld	r1,PACAR1(r13);						\
>> -	std	r3,ORIG_GPR3(r1);	/* Save original r3 */		\
>> -	li	r3,OPAL_HANDLE_HMI;	/* Pass opal token argument*/	\
>> +	li	r0,OPAL_HANDLE_HMI;	/* Pass opal token argument*/	\
>>  	bl	opal_call_realmode;					\
>> -	ld	r3,ORIG_GPR3(r1);	/* Restore original r3 */	\
>>  20:	nop;
>>  
>>
>> @@ -210,12 +225,91 @@ _GLOBAL(power7_wakeup_tb_loss)
>>  BEGIN_FTR_SECTION
>>  	CHECK_HMI_INTERRUPT
>>  END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
>> +
>> +	li	r7,1
>> +	mfspr	r8,SPRN_PIR
>> +	/*
>> +	 * The last 3 bits of PIR represents the thread id of a cpu
>> +	 * in power8. This will need adjusting for power7.
>> +	 */
>> +	andi.	r8,r8,0x07		/* Get thread id into r8 */
> 
> I'd be more comfortable if we patched that instruction at boot with the
> right mask.
> 

Okay. I'll make the change.
>> +	rotld	r7,r7,r8
>> +	/* r7 now has 'thread_id'th bit set */
>> +
>> +	ld	r14,PACA_CORE_IDLE_STATE_PTR(r13)
>> +lwarx_loop2:
>> +	lwarx	r15,0,r14
>> +	andi.	r9,r15,PNV_CORE_IDLE_LOCK_BIT
>> +	/*
>> +	 * Lock bit is set in one of the 2 cases-
>> +	 * a. In the sleep/winkle enter path, the last thread is executing
>> +	 * fastsleep workaround code.
>> +	 * b. In the wake up path, another thread is executing fastsleep
>> +	 * workaround undo code or resyncing timebase or restoring context
>> +	 * In either case loop until the lock bit is cleared.
>> +	 */
>> +	bne	lwarx_loop2
> 
> We should do some smt priority games here otherwise the spinning threads
> are going to slow down the one with the lock. Basically, if we see the
> lock held, go out of line, smt_low, spin on a normal load, and when
> smt_medium and go back to lwarx
> 

Okay.
>> +	cmpwi	cr2,r15,0
>> +	or	r15,r15,r7		/* Set thread bit */
>> +
>> +	beq	cr2,first_thread
>> +
>> +	/* Not first thread in core to wake up */
>> +	stwcx.	r15,0,r14
>> +	bne-	lwarx_loop2
>> +	b	common_exit
>> +
>> +first_thread:
>> +	/* First thread in core to wakeup */
>> +	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
>> +	stwcx.	r15,0,r14
>> +	bne-	lwarx_loop2
>> +
>> +	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
>> +	lbz	r3,0(r3)
>> +	cmpwi	r3,1
> 
> Same comment about dynamic patching.

Okay.
> 
>> +	/*  skip fastsleep workaround if its not needed */
>> +	bne	timebase_resync
>> +
>> +	/* Undo fast sleep workaround */
>> +	mfcr	r16	/* Backup CR into a non-volatile register */
> 
> Why ? If you have your non-volatiles saved you can use an NV CR like CR2
> no ? You need to restore those anyway...
> 

I wasn't sure CRs were preserved during OPAL call. I'll make the change
to use CR[234].
> Also same comments as on the way down vs barriers when doing
> lock/unlock.
> 

Okay.
>> +	li	r3,1
>> +	li	r4,0
>> +	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
>> +	bl	opal_call_realmode
>> +	mtcr	r16	/* Restore CR */
>> +
>> +	/* Do timebase resync if we are waking up from sleep. Use cr1 value
>> +	 * set in exceptions-64s.S */
>> +	ble	cr1,clear_lock
>> +
>> +timebase_resync:
>>  	/* Time base re-sync */
>> -	li	r3,OPAL_RESYNC_TIMEBASE
>> +	li	r0,OPAL_RESYNC_TIMEBASE
>>  	bl	opal_call_realmode;
>> -
>>  	/* TODO: Check r3 for failure */
>>  
>> +clear_lock:
>> +	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
>> +	stw	r15,0(r14)
>> +
>> +common_exit:
>> +	li	r5,PNV_THREAD_RUNNING
>> +	stb     r5,PACA_THREAD_IDLE_STATE(r13)
>> +
>> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +	li      r0,KVM_HWTHREAD_IN_KERNEL
>> +	stb     r0,HSTATE_HWTHREAD_STATE(r13)
>> +	/* Order setting hwthread_state vs. testing hwthread_req */
>> +	sync
>> +	lbz     r0,HSTATE_HWTHREAD_REQ(r13)
>> +	cmpwi   r0,0
>> +	beq     6f
>> +	b       kvm_start_guest
>> +6:
>> +#endif
>> +
>>  	REST_NVGPRS(r1)
>>  	REST_GPR(2, r1)
>>  	ld	r3,_CCR(r1)
>> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
>> index feb549a..b2aa93b 100644
>> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
>> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
>> @@ -158,6 +158,43 @@ opal_tracepoint_return:
>>  	blr
>>  #endif
>>  
>> +/*
>> + * Make opal call in realmode. This is a generic function to be called
>> + * from realmode. It handles endianness.
>> + *
>> + * r13 - paca pointer
>> + * r1  - stack pointer
>> + * r0  - opal token
>> + */
>> +_GLOBAL(opal_call_realmode)
>> +	mflr	r12
>> +	std	r12,_LINK(r1)
>> +	ld	r2,PACATOC(r13)
>> +	/* Set opal return address */
>> +	LOAD_REG_ADDR(r12,return_from_opal_call)
>> +	mtlr	r12
>> +
>> +	mfmsr	r12
>> +#ifdef __LITTLE_ENDIAN__
>> +	/* Handle endian-ness */
>> +	li	r11,MSR_LE
>> +	andc	r12,r12,r11
>> +#endif
>> +	mtspr	SPRN_HSRR1,r12
>> +	LOAD_REG_ADDR(r11,opal)
>> +	ld	r12,8(r11)
>> +	ld	r2,0(r11)
>> +	mtspr	SPRN_HSRR0,r12
>> +	hrfid
>> +
>> +return_from_opal_call:
>> +#ifdef __LITTLE_ENDIAN__
>> +	FIXUP_ENDIAN
>> +#endif
>> +	ld	r12,_LINK(r1)
>> +	mtlr	r12
>> +	blr
>> +
>>  OPAL_CALL(opal_invalid_call,			OPAL_INVALID_CALL);
>>  OPAL_CALL(opal_console_write,			OPAL_CONSOLE_WRITE);
>>  OPAL_CALL(opal_console_read,			OPAL_CONSOLE_READ);
>> diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
>> index 34c6665..17fb98c 100644
>> --- a/arch/powerpc/platforms/powernv/setup.c
>> +++ b/arch/powerpc/platforms/powernv/setup.c
>> @@ -36,6 +36,8 @@
>>  #include <asm/opal.h>
>>  #include <asm/kexec.h>
>>  #include <asm/smp.h>
>> +#include <asm/cputhreads.h>
>> +#include <asm/cpuidle.h>
>>  
>>  #include "powernv.h"
>>  
>> @@ -292,11 +294,45 @@ static void __init pnv_setup_machdep_rtas(void)
>>  
>>  static u32 supported_cpuidle_states;
>>  
>> +static void pnv_alloc_idle_core_states(void)
>> +{
>> +	int i, j;
>> +	int nr_cores = cpu_nr_cores();
>> +	u32 *core_idle_state;
>> +
>> +	/*
>> +	 * core_idle_state - First 8 bits track the idle state of each thread
>> +	 * of the core. The 8th bit is the lock bit. Initially all thread bits
>> +	 * are set. They are cleared when the thread enters deep idle state
>> +	 * like sleep and winkle. Initially the lock bit is cleared.
>> +	 * The lock bit has 2 purposes
>> +	 * a. While the first thread is restoring core state, it prevents
>> +	 * from other threads in the core from switching to prcoess context.
>> +	 * b. While the last thread in the core is saving the core state, it
>> +	 * prevent a different thread from waking up.
>> +	 */
>> +	for (i = 0; i < nr_cores; i++) {
>> +		int first_cpu = i * threads_per_core;
>> +		int node = cpu_to_node(first_cpu);
>> +
>> +		core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
>> +		for (j = 0; j < threads_per_core; j++) {
>> +			int cpu = first_cpu + j;
>> +
>> +			paca[cpu].core_idle_state_ptr = core_idle_state;
>> +			paca[cpu].thread_idle_state = PNV_THREAD_RUNNING;
>> +
>> +		}
>> +	}
>> +}
>> +
>>  u32 pnv_get_supported_cpuidle_states(void)
>>  {
>>  	return supported_cpuidle_states;
>>  }
>> +EXPORT_SYMBOL_GPL(pnv_get_supported_cpuidle_states);
>>  
>> +u8 pnv_need_fastsleep_workaround;
>>  static int __init pnv_init_idle_states(void)
>>  {
>>  	struct device_node *power_mgt;
>> @@ -306,6 +342,7 @@ static int __init pnv_init_idle_states(void)
>>  	int i;
>>  
>>  	supported_cpuidle_states = 0;
>> +	pnv_need_fastsleep_workaround = 0;
>>  
>>  	if (cpuidle_disable != IDLE_NO_OVERRIDE)
>>  		return 0;
>> @@ -332,13 +369,14 @@ static int __init pnv_init_idle_states(void)
>>  		flags = be32_to_cpu(idle_state_flags[i]);
>>  		supported_cpuidle_states |= flags;
>>  	}
>> -
>> +	if (supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)
>> +		pnv_need_fastsleep_workaround = 1;
>> +	pnv_alloc_idle_core_states();
>>  	return 0;
>>  }
>>  
>>  subsys_initcall(pnv_init_idle_states);
>>  
>> -
>>  static int __init pnv_probe(void)
>>  {
>>  	unsigned long root = of_get_flat_dt_root();
>> diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
>> index 3dc4cec..12b761a 100644
>> --- a/arch/powerpc/platforms/powernv/smp.c
>> +++ b/arch/powerpc/platforms/powernv/smp.c
>> @@ -167,7 +167,8 @@ static void pnv_smp_cpu_kill_self(void)
>>  	mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
>>  	while (!generic_check_cpu_restart(cpu)) {
>>  		ppc64_runlatch_off();
>> -		if (idle_states & OPAL_PM_SLEEP_ENABLED)
>> +		if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
>> +				(idle_states & OPAL_PM_SLEEP_ENABLED_ER1))
>>  			power7_sleep();
>>  		else
>>  			power7_nap(1);
>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>> index 0a7d827..a489b56 100644
>> --- a/drivers/cpuidle/cpuidle-powernv.c
>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>> @@ -208,7 +208,8 @@ static int powernv_add_idle_states(void)
>>  			nr_idle_states++;
>>  		}
>>  
>> -		if (flags & OPAL_PM_SLEEP_ENABLED) {
>> +		if (flags & OPAL_PM_SLEEP_ENABLED ||
>> +			flags & OPAL_PM_SLEEP_ENABLED_ER1) {
>>  			/* Add FASTSLEEP state */
>>  			strcpy(powernv_states[nr_idle_states].name, "FastSleep");
>>  			strcpy(powernv_states[nr_idle_states].desc, "FastSleep");
> 
> 
Thanks,
Shreyas


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 4/4] powernv: powerpc: Add winkle support for offline cpus
  2014-11-27  1:55   ` Benjamin Herrenschmidt
@ 2014-11-27  6:24     ` Shreyas B Prabhu
  0 siblings, 0 replies; 12+ messages in thread
From: Shreyas B Prabhu @ 2014-11-27  6:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-kernel, Paul Mackerras, Michael Ellerman, linuxppc-dev

Hi Ben,

On Thursday 27 November 2014 07:25 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2014-11-25 at 16:47 +0530, Shreyas B. Prabhu wrote:
> 
>> diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
>> index 4673353..66874aa 100644
>> --- a/arch/powerpc/kernel/cpu_setup_power.S
>> +++ b/arch/powerpc/kernel/cpu_setup_power.S
>> @@ -55,6 +55,8 @@ _GLOBAL(__setup_cpu_power8)
>>  	beqlr
>>  	li	r0,0
>>  	mtspr	SPRN_LPID,r0
>> +	mtspr	SPRN_WORT,r0
>> +	mtspr	SPRN_WORC,r0
>>  	mfspr	r3,SPRN_LPCR
>>  	ori	r3, r3, LPCR_PECEDH
>>  	bl	__init_LPCR
>> @@ -75,6 +77,8 @@ _GLOBAL(__restore_cpu_power8)
>>  	li	r0,0
>>  	mtspr	SPRN_LPID,r0
>>  	mfspr   r3,SPRN_LPCR
>> +	mtspr	SPRN_WORT,r0
>> +	mtspr	SPRN_WORC,r0
>>  	ori	r3, r3, LPCR_PECEDH
>>  	bl	__init_LPCR
>>  	bl	__init_HFSCR
> 
> Clearing WORT and WORC might not be the best thing. We know the HW folks
> have been trying to tune those values and we might need to preserve what
> the boot FW has set.
> 
> Can you get in touch with them and double check what we should do here ?
> 

I observed these were always 0. I'll speak to HW folks as you suggested.

>> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
>> index 3311c8d..c9897cb 100644
>> --- a/arch/powerpc/kernel/exceptions-64s.S
>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>> @@ -112,6 +112,16 @@ BEGIN_FTR_SECTION
>>  
>>  	cmpwi	cr1,r13,2
>>  
>> +	/* Check if last bit of HSPGR0 is set. This indicates whether we are
>> +	 * waking up from winkle */
>> +	li	r3,1
>> +	mfspr	r4,SPRN_HSPRG0
>> +	and	r5,r4,r3
>> +	cmpwi	cr4,r5,1	/* Store result in cr4 for later use */
>> +
>> +	andc	r4,r4,r3
>> +	mtspr	SPRN_HSPRG0,r4
>> +
> 
> There is an open question here whether adding a beq cr4,+8 after the
> cmpwi (or a +4 after the andc) is worthwhile. Can you check ? (either
> measure or talk to HW folks). 

Okay. This because mtspr is heavier op than beq?
> Also we could write directly to r13...

You mean use mr r13,r4 instead or GET_PACA?

>>  	GET_PACA(r13)
>>  	lbz	r0,PACA_THREAD_IDLE_STATE(r13)
>>  	cmpwi   cr2,r0,PNV_THREAD_NAP
>> diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
>> index c1d590f..78c30b0 100644
>> --- a/arch/powerpc/kernel/idle_power7.S
>> +++ b/arch/powerpc/kernel/idle_power7.S
>> @@ -19,8 +19,22 @@
>>  #include <asm/kvm_book3s_asm.h>
>>  #include <asm/opal.h>
>>  #include <asm/cpuidle.h>
>> +#include <asm/mmu-hash64.h>
>>  
>>  #undef DEBUG
>> +/*
>> + * Use unused space in the interrupt stack to save and restore
>> + * registers for winkle support.
>> + */
>> +#define _SDR1	GPR3
>> +#define _RPR	GPR4
>> +#define _SPURR	GPR5
>> +#define _PURR	GPR6
>> +#define _TSCR	GPR7
>> +#define _DSCR	GPR8
>> +#define _AMOR	GPR9
>> +#define _PMC5	GPR10
>> +#define _PMC6	GPR11
> 
> WORT/WORTC need saving restoring

The reason I skipped this was because these were always 0. But since its
set by FW, I'll save and restore them.

> 
>>  /* Idle state entry routines */
>>  
>> @@ -153,32 +167,60 @@ lwarx_loop1:
>>  	b	common_enter
>>  
>>  last_thread:
>> -	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
>> -	lbz	r3,0(r3)
>> -	cmpwi	r3,1
>> -	bne	common_enter
>>  	/*
>>  	 * Last thread of the core entering sleep. Last thread needs to execute
>>  	 * the hardware bug workaround code. Before that, set the lock bit to
>>  	 * avoid the race of other threads waking up and undoing workaround
>>  	 * before workaround is applied.
>>  	 */
>> +	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
>> +	lbz	r3,0(r3)
>> +	cmpwi	r3,1
>> +	bne	common_enter
>> +
>>  	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
>>  	stwcx.	r15,0,r14
>>  	bne-	lwarx_loop1
>>  
>>  	/* Fast sleep workaround */
>> +	mfcr	r16	/* Backup CR to a non-volatile register */
>>  	li	r3,1
>>  	li	r4,1
>>  	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
>>  	bl	opal_call_realmode
>> +	mtcr	r16	/* Restore CR */
> 
> Why isn't the above already in the previous patch ? Also see my comment
> about using a non-volatile CR instead.

In the previous patch I wasn't using any CR after this OPAL call. Hence
I had skipped it. As you suggested I'll avoid this by using CR[234].
> 
>>  	/* Clear Lock bit */
>>  	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
>>  	stw	r15,0(r14)
>>  
>> -common_enter: /* common code for all the threads entering sleep */
>> +common_enter: /* common code for all the threads entering sleep or winkle*/
>> +	bgt	cr1,enter_winkle
>>  	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
>> +enter_winkle:
>> +	/*
>> +	 * Note all register i.e per-core, per-subcore or per-thread is saved
>> +	 * here since any thread in the core might wake up first
>> +	 */
>> +	mfspr	r3,SPRN_SDR1
>> +	std	r3,_SDR1(r1)
>> +	mfspr	r3,SPRN_RPR
>> +	std	r3,_RPR(r1)
>> +	mfspr	r3,SPRN_SPURR
>> +	std	r3,_SPURR(r1)
>> +	mfspr	r3,SPRN_PURR
>> +	std	r3,_PURR(r1)
>> +	mfspr	r3,SPRN_TSCR
>> +	std	r3,_TSCR(r1)
>> +	mfspr	r3,SPRN_DSCR
>> +	std	r3,_DSCR(r1)
>> +	mfspr	r3,SPRN_AMOR
>> +	std	r3,_AMOR(r1)
>> +	mfspr	r3,SPRN_PMC5
>> +	std	r3,_PMC5(r1)
>> +	mfspr	r3,SPRN_PMC6
>> +	std	r3,_PMC6(r1)
>> +	IDLE_STATE_ENTER_SEQ(PPC_WINKLE)
>>  
>>  _GLOBAL(power7_idle)
>>  	/* Now check if user or arch enabled NAP mode */
>> @@ -201,6 +243,12 @@ _GLOBAL(power7_sleep)
>>  	b	power7_powersave_common
>>  	/* No return */
>>  
>> +_GLOBAL(power7_winkle)
>> +	li	r3,PNV_THREAD_WINKLE
>> +	li	r4,1
>> +	b	power7_powersave_common
>> +	/* No return */
>> +
>>  #define CHECK_HMI_INTERRUPT						\
>>  	mfspr	r0,SPRN_SRR1;						\
>>  BEGIN_FTR_SECTION_NESTED(66);						\
>> @@ -250,22 +298,54 @@ lwarx_loop2:
>>  	 */
>>  	bne	lwarx_loop2
>>  
>> -	cmpwi	cr2,r15,0
>> +	cmpwi	cr2,r15,0	/* Check if first in core */
>> +	lbz	r4,PACA_SUBCORE_SIBLING_MASK(r13)
>> +	and	r4,r4,r15
>> +	cmpwi	cr3,r4,0	/* Check if first in subcore */
>> +
>> +	/*
>> +	 * At this stage
>> +	 * cr1 - 01 if waking up from sleep or winkle
>> +	 * cr2 - 10 if first thread to wakeup in core
>> +	 * cr3 - 10 if first thread to wakeup in subcore
>> +	 * cr4 - 10 if waking up from winkle
>> +	 */
>> +
>>  	or	r15,r15,r7		/* Set thread bit */
>>  
>> -	beq	cr2,first_thread
>> +	beq	cr3,first_thread_in_subcore
>>  
>> -	/* Not first thread in core to wake up */
>> +	/* Not first thread in subcore to wake up */
>>  	stwcx.	r15,0,r14
>>  	bne-	lwarx_loop2
>>  	b	common_exit
>>  
>> -first_thread:
>> -	/* First thread in core to wakeup */
>> +first_thread_in_subcore:
>> +	/* First thread in subcore to wakeup set the lock bit */
>>  	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
>>  	stwcx.	r15,0,r14
>>  	bne-	lwarx_loop2
>>  
>> +	/*
>> +	 * If waking up from sleep, subcore state is not lost. Hence
>> +	 * skip subcore state restore
>> +	 */
>> +	bne	cr4,subcore_state_restored
>> +
>> +	/* Restore per-subcore state */
>> +	ld      r4,_SDR1(r1)
>> +	mtspr   SPRN_SDR1,r4
>> +	ld      r4,_RPR(r1)
>> +	mtspr   SPRN_RPR,r4
>> +	ld	r4,_AMOR(r1)
>> +	mtspr	SPRN_AMOR,r4
>> +
>> +subcore_state_restored:
>> +	/* Check if the thread is also the first thread in the core. If not,
>> +	 * skip to clear_lock */
>> +	bne	cr2,clear_lock
>> +
>> +first_thread_in_core:
>>  	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
>>  	lbz	r3,0(r3)
>>  	cmpwi	r3,1
>> @@ -280,21 +360,71 @@ first_thread:
>>  	bl	opal_call_realmode
>>  	mtcr	r16	/* Restore CR */
>>  
>> -	/* Do timebase resync if we are waking up from sleep. Use cr1 value
>> -	 * set in exceptions-64s.S */
>> +timebase_resync:
>> +	/* Do timebase resync only if the core truly woke up from
>> +	 * sleep/winkle */
>>  	ble	cr1,clear_lock
>>  
>> -timebase_resync:
>>  	/* Time base re-sync */
>> +	mfcr	r16	/* Backup CR into a non-volatile register */
>>  	li	r0,OPAL_RESYNC_TIMEBASE
>>  	bl	opal_call_realmode;
>>  	/* TODO: Check r3 for failure */
>> +	mtcr	r16	/* Restore CR */
>> +
>> +	/*
>> +	 * If waking up from sleep, per core state is not lost, skip to
>> +	 * clear_lock.
>> +	 */
>> +	bne	cr4,clear_lock
>> +
>> +	/* Restore per core state */
>> +	ld	r4,_TSCR(r1)
>> +	mtspr	SPRN_TSCR,r4
>>  
>>  clear_lock:
>>  	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
>>  	stw	r15,0(r14)
>>  
>>  common_exit:
>> +	/* Common to all threads
>> +	 *
>> +	 * If waking up from sleep, hypervisor state is not lost. Hence
>> +	 * skip hypervisor state restore.
>> +	 */
>> +	bne	cr4,hypervisor_state_restored
>> +
>> +	/* Waking up from winkle */
>> +
>> +	/* Restore per thread state */
>> +	bl	__restore_cpu_power8
>> +
>> +	/* Restore SLB  from PACA */
>> +	ld	r8,PACA_SLBSHADOWPTR(r13)
>> +
>> +	.rept	SLB_NUM_BOLTED
>> +	li	r3, SLBSHADOW_SAVEAREA
>> +	LDX_BE	r5, r8, r3
>> +	addi	r3, r3, 8
>> +	LDX_BE	r6, r8, r3
>> +	andis.	r7,r5,SLB_ESID_V@h
>> +	beq	1f
>> +	slbmte	r6,r5
>> +1:	addi	r8,r8,16
>> +	.endr
>> +
>> +	ld	r4,_SPURR(r1)
>> +	mtspr	SPRN_SPURR,r4
>> +	ld	r4,_PURR(r1)
>> +	mtspr	SPRN_PURR,r4
>> +	ld	r4,_DSCR(r1)
>> +	mtspr	SPRN_DSCR,r4
>> +	ld	r4,_PMC5(r1)
>> +	mtspr	SPRN_PMC5,r4
>> +	ld	r4,_PMC6(r1)
>> +	mtspr	SPRN_PMC6,r4
>> +
>> +hypervisor_state_restored:
>>  	li	r5,PNV_THREAD_RUNNING
>>  	stb     r5,PACA_THREAD_IDLE_STATE(r13)
>>  
>> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
>> index b2aa93b..e1e91e0 100644
>> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
>> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
>> @@ -191,6 +191,7 @@ return_from_opal_call:
>>  #ifdef __LITTLE_ENDIAN__
>>  	FIXUP_ENDIAN
>>  #endif
>> +	ld	r2,PACATOC(r13)
>>  	ld	r12,_LINK(r1)
>>  	mtlr	r12
>>  	blr
>> @@ -284,6 +285,7 @@ OPAL_CALL(opal_sensor_read,			OPAL_SENSOR_READ);
>>  OPAL_CALL(opal_get_param,			OPAL_GET_PARAM);
>>  OPAL_CALL(opal_set_param,			OPAL_SET_PARAM);
>>  OPAL_CALL(opal_handle_hmi,			OPAL_HANDLE_HMI);
>> +OPAL_CALL(opal_slw_set_reg,			OPAL_SLW_SET_REG);
>>  OPAL_CALL(opal_register_dump_region,		OPAL_REGISTER_DUMP_REGION);
>>  OPAL_CALL(opal_unregister_dump_region,		OPAL_UNREGISTER_DUMP_REGION);
>>  OPAL_CALL(opal_pci_set_phb_cxl_mode,		OPAL_PCI_SET_PHB_CXL_MODE);
>> diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
>> index 17fb98c..4a886a1 100644
>> --- a/arch/powerpc/platforms/powernv/setup.c
>> +++ b/arch/powerpc/platforms/powernv/setup.c
>> @@ -40,6 +40,7 @@
>>  #include <asm/cpuidle.h>
>>  
>>  #include "powernv.h"
>> +#include "subcore.h"
>>  
>>  static void __init pnv_setup_arch(void)
>>  {
>> @@ -293,6 +294,74 @@ static void __init pnv_setup_machdep_rtas(void)
>>  #endif /* CONFIG_PPC_POWERNV_RTAS */
>>  
>>  static u32 supported_cpuidle_states;
>> +int pnv_save_sprs_for_winkle(void)
>> +{
>> +	int cpu;
>> +	int rc;
>> +
>> +	/*
>> +	 * hid0, hid1, hid4, hid5, hmeer and lpcr values are symmetric accross
>> +	 * all cpus at boot. Get these reg values of current cpu and use the
>> +	 * same accross all cpus.
>> +	 */
>> +	uint64_t lpcr_val = mfspr(SPRN_LPCR);
>> +	uint64_t hid0_val = mfspr(SPRN_HID0);
>> +	uint64_t hid1_val = mfspr(SPRN_HID1);
>> +	uint64_t hid4_val = mfspr(SPRN_HID4);
>> +	uint64_t hid5_val = mfspr(SPRN_HID5);
>> +	uint64_t hmeer_val = mfspr(SPRN_HMEER);
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		uint64_t pir = get_hard_smp_processor_id(cpu);
>> +		uint64_t hsprg0_val = (uint64_t)&paca[cpu];
>> +
>> +		/*
>> +		 * HSPRG0 is used to store the cpu's pointer to paca. Hence last
>> +		 * 3 bits are guaranteed to be 0. Program slw to restore HSPRG0
>> +		 * with 63rd bit set, so that when a thread wakes up at 0x100 we
>> +		 * can use this bit to distinguish between fastsleep and
>> +		 * deep winkle.
>> +		 */
>> +		hsprg0_val |= 1;
>> +
>> +		rc = opal_slw_set_reg(pir, SPRN_HSPRG0, hsprg0_val);
>> +		if (rc != 0)
>> +			return rc;
>> +
>> +		rc = opal_slw_set_reg(pir, SPRN_LPCR, lpcr_val);
>> +		if (rc != 0)
>> +			return rc;
>> +
>> +		/* HIDs are per core registers */
>> +		if (cpu_thread_in_core(cpu) == 0) {
>> +
>> +			rc = opal_slw_set_reg(pir, SPRN_HMEER, hmeer_val);
>> +			if (rc != 0)
>> +				return rc;
>> +
>> +			rc = opal_slw_set_reg(pir, SPRN_HID0, hid0_val);
>> +			if (rc != 0)
>> +				return rc;
>> +
>> +			rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
>> +			if (rc != 0)
>> +				return rc;
>> +
>> +			rc = opal_slw_set_reg(pir, SPRN_HID4, hid4_val);
>> +			if (rc != 0)
>> +				return rc;
>> +
>> +			rc = opal_slw_set_reg(pir, SPRN_HID5, hid5_val);
>> +			if (rc != 0)
>> +				return rc;
>> +
>> +		}
>> +
>> +	}
>> +
>> +	return 0;
>> +
>> +}
>>  
>>  static void pnv_alloc_idle_core_states(void)
>>  {
>> @@ -324,6 +393,10 @@ static void pnv_alloc_idle_core_states(void)
>>  
>>  		}
>>  	}
>> +	update_subcore_sibling_mask();
>> +	if (supported_cpuidle_states & OPAL_PM_WINKLE_ENABLED)
>> +		pnv_save_sprs_for_winkle();
>> +
>>  }
>>  
>>  u32 pnv_get_supported_cpuidle_states(void)
>> diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
>> index 12b761a..5e35857 100644
>> --- a/arch/powerpc/platforms/powernv/smp.c
>> +++ b/arch/powerpc/platforms/powernv/smp.c
>> @@ -167,7 +167,9 @@ static void pnv_smp_cpu_kill_self(void)
>>  	mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
>>  	while (!generic_check_cpu_restart(cpu)) {
>>  		ppc64_runlatch_off();
>> -		if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
>> +		if (idle_states & OPAL_PM_WINKLE_ENABLED)
>> +			power7_winkle();
>> +		else if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
>>  				(idle_states & OPAL_PM_SLEEP_ENABLED_ER1))
>>  			power7_sleep();
>>  		else
>> diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c
>> index c87f96b..f60f80a 100644
>> --- a/arch/powerpc/platforms/powernv/subcore.c
>> +++ b/arch/powerpc/platforms/powernv/subcore.c
>> @@ -160,6 +160,18 @@ static void wait_for_sync_step(int step)
>>  	mb();
>>  }
>>  
>> +static void update_hid_in_slw(u64 hid0)
>> +{
>> +	u64 idle_states = pnv_get_supported_cpuidle_states();
>> +
>> +	if (idle_states & OPAL_PM_WINKLE_ENABLED) {
>> +		/* OPAL call to patch slw with the new HID0 value */
>> +		u64 cpu_pir = hard_smp_processor_id();
>> +
>> +		opal_slw_set_reg(cpu_pir, SPRN_HID0, hid0);
>> +	}
>> +}
>> +
>>  static void unsplit_core(void)
>>  {
>>  	u64 hid0, mask;
>> @@ -179,6 +191,7 @@ static void unsplit_core(void)
>>  	hid0 = mfspr(SPRN_HID0);
>>  	hid0 &= ~HID0_POWER8_DYNLPARDIS;
>>  	mtspr(SPRN_HID0, hid0);
>> +	update_hid_in_slw(hid0);
>>  
>>  	while (mfspr(SPRN_HID0) & mask)
>>  		cpu_relax();
>> @@ -215,6 +228,7 @@ static void split_core(int new_mode)
>>  	hid0  = mfspr(SPRN_HID0);
>>  	hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value;
>>  	mtspr(SPRN_HID0, hid0);
>> +	update_hid_in_slw(hid0);
>>  
>>  	/* Wait for it to happen */
>>  	while (!(mfspr(SPRN_HID0) & split_parms[i].mask))
>> @@ -251,6 +265,25 @@ bool cpu_core_split_required(void)
>>  	return true;
>>  }
>>  
>> +void update_subcore_sibling_mask(void)
>> +{
>> +	int cpu;
>> +	/*
>> +	 * sibling mask for the first cpu. Left shift this by required bits
>> +	 * to get sibling mask for the rest of the cpus.
>> +	 */
>> +	int sibling_mask_first_cpu =  (1 << threads_per_subcore) - 1;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		int tid = cpu_thread_in_core(cpu);
>> +		int offset = (tid / threads_per_subcore) * threads_per_subcore;
>> +		int mask = sibling_mask_first_cpu << offset;
>> +
>> +		paca[cpu].subcore_sibling_mask = mask;
>> +
>> +	}
>> +}
>> +
>>  static int cpu_update_split_mode(void *data)
>>  {
>>  	int cpu, new_mode = *(int *)data;
>> @@ -284,6 +317,7 @@ static int cpu_update_split_mode(void *data)
>>  		/* Make the new mode public */
>>  		subcores_per_core = new_mode;
>>  		threads_per_subcore = threads_per_core / subcores_per_core;
>> +		update_subcore_sibling_mask();
>>  
>>  		/* Make sure the new mode is written before we exit */
>>  		mb();
>> diff --git a/arch/powerpc/platforms/powernv/subcore.h b/arch/powerpc/platforms/powernv/subcore.h
>> index 148abc9..604eb40 100644
>> --- a/arch/powerpc/platforms/powernv/subcore.h
>> +++ b/arch/powerpc/platforms/powernv/subcore.h
>> @@ -15,4 +15,5 @@
>>  
>>  #ifndef __ASSEMBLY__
>>  void split_core_secondary_loop(u8 *state);
>> +extern void update_subcore_sibling_mask(void);
>>  #endif
> 
> 
Thanks,
Shreyas


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management
  2014-11-25 11:17 ` [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management Shreyas B. Prabhu
  2014-11-27  0:37   ` Benjamin Herrenschmidt
@ 2014-11-27 23:50   ` Paul Mackerras
  2014-12-01  6:27     ` Shreyas B Prabhu
  1 sibling, 1 reply; 12+ messages in thread
From: Paul Mackerras @ 2014-11-27 23:50 UTC (permalink / raw)
  To: Shreyas B. Prabhu
  Cc: linux-kernel, Benjamin Herrenschmidt, Michael Ellerman,
	Rafael J. Wysocki, linux-pm, linuxppc-dev

On Tue, Nov 25, 2014 at 04:47:58PM +0530, Shreyas B. Prabhu wrote:
[snip]
> +2:
> +	/* Sleep or winkle */
> +	li	r7,1
> +	mfspr	r8,SPRN_PIR
> +	/*
> +	 * The last 3 bits of PIR represents the thread id of a cpu
> +	 * in power8. This will need adjusting for power7.
> +	 */
> +	andi.	r8,r8,0x07			/* Get thread id into r8 */
> +	rotld	r7,r7,r8

I would suggest adding another u8 field to the paca to store our
thread bit, and initialize it to 1 << (cpu_id % threads_per_core)
early on.  That will handle the POWER7 case correctly and reduce these
four instructions to one.

> +
> +	ld	r14,PACA_CORE_IDLE_STATE_PTR(r13)
> +lwarx_loop1:
> +	lwarx	r15,0,r14
> +	andc	r15,r15,r7			/* Clear thread bit */
> +
> +	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
> +	beq	last_thread
> +
> +	/* Not the last thread to goto sleep */
> +	stwcx.	r15,0,r14
> +	bne-	lwarx_loop1
> +	b	common_enter
> +
> +last_thread:
> +	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
> +	lbz	r3,0(r3)
> +	cmpwi	r3,1
> +	bne	common_enter
> +	/*
> +	 * Last thread of the core entering sleep. Last thread needs to execute
> +	 * the hardware bug workaround code. Before that, set the lock bit to
> +	 * avoid the race of other threads waking up and undoing workaround
> +	 * before workaround is applied.
> +	 */
> +	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
> +	stwcx.	r15,0,r14
> +	bne-	lwarx_loop1
> +
> +	/* Fast sleep workaround */
> +	li	r3,1
> +	li	r4,1
> +	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
> +	bl	opal_call_realmode
> +
> +	/* Clear Lock bit */
> +	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
> +	stw	r15,0(r14)

In this case we know the result of the andi. will be 0, so this could
be just li r0,0; stw r0,0(r14).

> +
> +common_enter: /* common code for all the threads entering sleep */
> +	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
>  
>  _GLOBAL(power7_idle)
>  	/* Now check if user or arch enabled NAP mode */
> @@ -141,49 +191,16 @@ _GLOBAL(power7_idle)
>  
>  _GLOBAL(power7_nap)
>  	mr	r4,r3
> -	li	r3,0
> +	li	r3,PNV_THREAD_NAP
>  	b	power7_powersave_common
>  	/* No return */
>  
>  _GLOBAL(power7_sleep)
> -	li	r3,1
> +	li	r3,PNV_THREAD_SLEEP
>  	li	r4,1
>  	b	power7_powersave_common
>  	/* No return */
>  
> -/*
> - * Make opal call in realmode. This is a generic function to be called
> - * from realmode from reset vector. It handles endianess.
> - *
> - * r13 - paca pointer
> - * r1  - stack pointer
> - * r3  - opal token
> - */
> -opal_call_realmode:
> -	mflr	r12
> -	std	r12,_LINK(r1)
> -	ld	r2,PACATOC(r13)
> -	/* Set opal return address */
> -	LOAD_REG_ADDR(r0,return_from_opal_call)
> -	mtlr	r0
> -	/* Handle endian-ness */
> -	li	r0,MSR_LE
> -	mfmsr	r12
> -	andc	r12,r12,r0
> -	mtspr	SPRN_HSRR1,r12
> -	mr	r0,r3			/* Move opal token to r0 */
> -	LOAD_REG_ADDR(r11,opal)
> -	ld	r12,8(r11)
> -	ld	r2,0(r11)
> -	mtspr	SPRN_HSRR0,r12
> -	hrfid
> -
> -return_from_opal_call:
> -	FIXUP_ENDIAN
> -	ld	r0,_LINK(r1)
> -	mtlr	r0
> -	blr
> -
>  #define CHECK_HMI_INTERRUPT						\
>  	mfspr	r0,SPRN_SRR1;						\
>  BEGIN_FTR_SECTION_NESTED(66);						\
> @@ -196,10 +213,8 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66);		\
>  	/* Invoke opal call to handle hmi */				\
>  	ld	r2,PACATOC(r13);					\
>  	ld	r1,PACAR1(r13);						\
> -	std	r3,ORIG_GPR3(r1);	/* Save original r3 */		\
> -	li	r3,OPAL_HANDLE_HMI;	/* Pass opal token argument*/	\
> +	li	r0,OPAL_HANDLE_HMI;	/* Pass opal token argument*/	\
>  	bl	opal_call_realmode;					\
> -	ld	r3,ORIG_GPR3(r1);	/* Restore original r3 */	\
>  20:	nop;
>  
>  
> @@ -210,12 +225,91 @@ _GLOBAL(power7_wakeup_tb_loss)
>  BEGIN_FTR_SECTION
>  	CHECK_HMI_INTERRUPT
>  END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
> +
> +	li	r7,1
> +	mfspr	r8,SPRN_PIR
> +	/*
> +	 * The last 3 bits of PIR represents the thread id of a cpu
> +	 * in power8. This will need adjusting for power7.
> +	 */
> +	andi.	r8,r8,0x07		/* Get thread id into r8 */
> +	rotld	r7,r7,r8
> +	/* r7 now has 'thread_id'th bit set */
> +
> +	ld	r14,PACA_CORE_IDLE_STATE_PTR(r13)
> +lwarx_loop2:
> +	lwarx	r15,0,r14
> +	andi.	r9,r15,PNV_CORE_IDLE_LOCK_BIT
> +	/*
> +	 * Lock bit is set in one of the 2 cases-
> +	 * a. In the sleep/winkle enter path, the last thread is executing
> +	 * fastsleep workaround code.
> +	 * b. In the wake up path, another thread is executing fastsleep
> +	 * workaround undo code or resyncing timebase or restoring context
> +	 * In either case loop until the lock bit is cleared.
> +	 */
> +	bne	lwarx_loop2
> +
> +	cmpwi	cr2,r15,0
> +	or	r15,r15,r7		/* Set thread bit */
> +
> +	beq	cr2,first_thread
> +
> +	/* Not first thread in core to wake up */
> +	stwcx.	r15,0,r14
> +	bne-	lwarx_loop2
> +	b	common_exit
> +
> +first_thread:
> +	/* First thread in core to wakeup */
> +	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
> +	stwcx.	r15,0,r14
> +	bne-	lwarx_loop2
> +
> +	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
> +	lbz	r3,0(r3)
> +	cmpwi	r3,1
> +	/*  skip fastsleep workaround if its not needed */
> +	bne	timebase_resync
> +
> +	/* Undo fast sleep workaround */
> +	mfcr	r16	/* Backup CR into a non-volatile register */
> +	li	r3,1
> +	li	r4,0
> +	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
> +	bl	opal_call_realmode
> +	mtcr	r16	/* Restore CR */
> +
> +	/* Do timebase resync if we are waking up from sleep. Use cr1 value
> +	 * set in exceptions-64s.S */
> +	ble	cr1,clear_lock
> +
> +timebase_resync:
>  	/* Time base re-sync */
> -	li	r3,OPAL_RESYNC_TIMEBASE
> +	li	r0,OPAL_RESYNC_TIMEBASE
>  	bl	opal_call_realmode;

So if pnv_need_fastsleep_workaround is zero, we always do the timebase
resync, but if pnv_need_fastsleep_workaround is one, we only do the
timebase resync if we had a loss of state.  Is that really what you
meant?

> -
>  	/* TODO: Check r3 for failure */
>  
> +clear_lock:
> +	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
> +	stw	r15,0(r14)
> +
> +common_exit:
> +	li	r5,PNV_THREAD_RUNNING
> +	stb     r5,PACA_THREAD_IDLE_STATE(r13)
> +
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +	li      r0,KVM_HWTHREAD_IN_KERNEL
> +	stb     r0,HSTATE_HWTHREAD_STATE(r13)
> +	/* Order setting hwthread_state vs. testing hwthread_req */
> +	sync
> +	lbz     r0,HSTATE_HWTHREAD_REQ(r13)
> +	cmpwi   r0,0
> +	beq     6f
> +	b       kvm_start_guest
> +6:
> +#endif

I'd prefer not to duplicate this code.  Could you instead branch back
to the code in exceptions-64s.S?  Or call this code via a bl and get
back to exceptions-64s.S via a blr.

> +
>  	REST_NVGPRS(r1)
>  	REST_GPR(2, r1)
>  	ld	r3,_CCR(r1)
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index feb549a..b2aa93b 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -158,6 +158,43 @@ opal_tracepoint_return:
>  	blr
>  #endif
>  
> +/*
> + * Make opal call in realmode. This is a generic function to be called
> + * from realmode. It handles endianness.
> + *
> + * r13 - paca pointer
> + * r1  - stack pointer
> + * r0  - opal token
> + */
> +_GLOBAL(opal_call_realmode)
> +	mflr	r12
> +	std	r12,_LINK(r1)

This is a bug waiting to happen.  Using _LINK(r1) was OK in this
code's previous location, since there we know there is a
INT_FRAME_SIZE-sized stack frame and the _LINK field is basically
unused.  Now that you're making this available to call from anywhere,
you can't trash the caller's stack frame like this.  You need to use
PPC_LR_STKOFF(r1) instead.

> +	ld	r2,PACATOC(r13)
> +	/* Set opal return address */
> +	LOAD_REG_ADDR(r12,return_from_opal_call)
> +	mtlr	r12
> +
> +	mfmsr	r12
> +#ifdef __LITTLE_ENDIAN__
> +	/* Handle endian-ness */
> +	li	r11,MSR_LE
> +	andc	r12,r12,r11
> +#endif
> +	mtspr	SPRN_HSRR1,r12
> +	LOAD_REG_ADDR(r11,opal)
> +	ld	r12,8(r11)
> +	ld	r2,0(r11)
> +	mtspr	SPRN_HSRR0,r12
> +	hrfid
> +
> +return_from_opal_call:
> +#ifdef __LITTLE_ENDIAN__
> +	FIXUP_ENDIAN
> +#endif
> +	ld	r12,_LINK(r1)
> +	mtlr	r12
> +	blr

Paul.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management
  2014-11-27 23:50   ` Paul Mackerras
@ 2014-12-01  6:27     ` Shreyas B Prabhu
  0 siblings, 0 replies; 12+ messages in thread
From: Shreyas B Prabhu @ 2014-12-01  6:27 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linux-kernel, Benjamin Herrenschmidt, Michael Ellerman,
	Rafael J. Wysocki, linux-pm, linuxppc-dev



On Friday 28 November 2014 05:20 AM, Paul Mackerras wrote:
> On Tue, Nov 25, 2014 at 04:47:58PM +0530, Shreyas B. Prabhu wrote:
> [snip]
>> +2:
>> +	/* Sleep or winkle */
>> +	li	r7,1
>> +	mfspr	r8,SPRN_PIR
>> +	/*
>> +	 * The last 3 bits of PIR represents the thread id of a cpu
>> +	 * in power8. This will need adjusting for power7.
>> +	 */
>> +	andi.	r8,r8,0x07			/* Get thread id into r8 */
>> +	rotld	r7,r7,r8
> 
> I would suggest adding another u8 field to the paca to store our
> thread bit, and initialize it to 1 << (cpu_id % threads_per_core)
> early on.  That will handle the POWER7 case correctly and reduce these
> four instructions to one.
> 
Okay. I'll make the change. 
>> +
>> +	ld	r14,PACA_CORE_IDLE_STATE_PTR(r13)
>> +lwarx_loop1:
>> +	lwarx	r15,0,r14
>> +	andc	r15,r15,r7			/* Clear thread bit */
>> +
>> +	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
>> +	beq	last_thread
>> +
>> +	/* Not the last thread to goto sleep */
>> +	stwcx.	r15,0,r14
>> +	bne-	lwarx_loop1
>> +	b	common_enter
>> +
>> +last_thread:
>> +	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
>> +	lbz	r3,0(r3)
>> +	cmpwi	r3,1
>> +	bne	common_enter
>> +	/*
>> +	 * Last thread of the core entering sleep. Last thread needs to execute
>> +	 * the hardware bug workaround code. Before that, set the lock bit to
>> +	 * avoid the race of other threads waking up and undoing workaround
>> +	 * before workaround is applied.
>> +	 */
>> +	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
>> +	stwcx.	r15,0,r14
>> +	bne-	lwarx_loop1
>> +
>> +	/* Fast sleep workaround */
>> +	li	r3,1
>> +	li	r4,1
>> +	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
>> +	bl	opal_call_realmode
>> +
>> +	/* Clear Lock bit */
>> +	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
>> +	stw	r15,0(r14)
> 
> In this case we know the result of the andi. will be 0, so this could
> be just li r0,0; stw r0,0(r14).
> 

Yes. I'll make the change.
>> +
>> +common_enter: /* common code for all the threads entering sleep */
>> +	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
>>  
>>  _GLOBAL(power7_idle)
>>  	/* Now check if user or arch enabled NAP mode */
>> @@ -141,49 +191,16 @@ _GLOBAL(power7_idle)
>>  
>>  _GLOBAL(power7_nap)
>>  	mr	r4,r3
>> -	li	r3,0
>> +	li	r3,PNV_THREAD_NAP
>>  	b	power7_powersave_common
>>  	/* No return */
>>  
>>  _GLOBAL(power7_sleep)
>> -	li	r3,1
>> +	li	r3,PNV_THREAD_SLEEP
>>  	li	r4,1
>>  	b	power7_powersave_common
>>  	/* No return */
>>  
>> -/*
>> - * Make opal call in realmode. This is a generic function to be called
>> - * from realmode from reset vector. It handles endianess.
>> - *
>> - * r13 - paca pointer
>> - * r1  - stack pointer
>> - * r3  - opal token
>> - */
>> -opal_call_realmode:
>> -	mflr	r12
>> -	std	r12,_LINK(r1)
>> -	ld	r2,PACATOC(r13)
>> -	/* Set opal return address */
>> -	LOAD_REG_ADDR(r0,return_from_opal_call)
>> -	mtlr	r0
>> -	/* Handle endian-ness */
>> -	li	r0,MSR_LE
>> -	mfmsr	r12
>> -	andc	r12,r12,r0
>> -	mtspr	SPRN_HSRR1,r12
>> -	mr	r0,r3			/* Move opal token to r0 */
>> -	LOAD_REG_ADDR(r11,opal)
>> -	ld	r12,8(r11)
>> -	ld	r2,0(r11)
>> -	mtspr	SPRN_HSRR0,r12
>> -	hrfid
>> -
>> -return_from_opal_call:
>> -	FIXUP_ENDIAN
>> -	ld	r0,_LINK(r1)
>> -	mtlr	r0
>> -	blr
>> -
>>  #define CHECK_HMI_INTERRUPT						\
>>  	mfspr	r0,SPRN_SRR1;						\
>>  BEGIN_FTR_SECTION_NESTED(66);						\
>> @@ -196,10 +213,8 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66);		\
>>  	/* Invoke opal call to handle hmi */				\
>>  	ld	r2,PACATOC(r13);					\
>>  	ld	r1,PACAR1(r13);						\
>> -	std	r3,ORIG_GPR3(r1);	/* Save original r3 */		\
>> -	li	r3,OPAL_HANDLE_HMI;	/* Pass opal token argument*/	\
>> +	li	r0,OPAL_HANDLE_HMI;	/* Pass opal token argument*/	\
>>  	bl	opal_call_realmode;					\
>> -	ld	r3,ORIG_GPR3(r1);	/* Restore original r3 */	\
>>  20:	nop;
>>  
>>  
>> @@ -210,12 +225,91 @@ _GLOBAL(power7_wakeup_tb_loss)
>>  BEGIN_FTR_SECTION
>>  	CHECK_HMI_INTERRUPT
>>  END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
>> +
>> +	li	r7,1
>> +	mfspr	r8,SPRN_PIR
>> +	/*
>> +	 * The last 3 bits of PIR represents the thread id of a cpu
>> +	 * in power8. This will need adjusting for power7.
>> +	 */
>> +	andi.	r8,r8,0x07		/* Get thread id into r8 */
>> +	rotld	r7,r7,r8
>> +	/* r7 now has 'thread_id'th bit set */
>> +
>> +	ld	r14,PACA_CORE_IDLE_STATE_PTR(r13)
>> +lwarx_loop2:
>> +	lwarx	r15,0,r14
>> +	andi.	r9,r15,PNV_CORE_IDLE_LOCK_BIT
>> +	/*
>> +	 * Lock bit is set in one of the 2 cases-
>> +	 * a. In the sleep/winkle enter path, the last thread is executing
>> +	 * fastsleep workaround code.
>> +	 * b. In the wake up path, another thread is executing fastsleep
>> +	 * workaround undo code or resyncing timebase or restoring context
>> +	 * In either case loop until the lock bit is cleared.
>> +	 */
>> +	bne	lwarx_loop2
>> +
>> +	cmpwi	cr2,r15,0
>> +	or	r15,r15,r7		/* Set thread bit */
>> +
>> +	beq	cr2,first_thread
>> +
>> +	/* Not first thread in core to wake up */
>> +	stwcx.	r15,0,r14
>> +	bne-	lwarx_loop2
>> +	b	common_exit
>> +
>> +first_thread:
>> +	/* First thread in core to wakeup */
>> +	ori	r15,r15,PNV_CORE_IDLE_LOCK_BIT
>> +	stwcx.	r15,0,r14
>> +	bne-	lwarx_loop2
>> +
>> +	LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
>> +	lbz	r3,0(r3)
>> +	cmpwi	r3,1
>> +	/*  skip fastsleep workaround if its not needed */
>> +	bne	timebase_resync
>> +
>> +	/* Undo fast sleep workaround */
>> +	mfcr	r16	/* Backup CR into a non-volatile register */
>> +	li	r3,1
>> +	li	r4,0
>> +	li	r0,OPAL_CONFIG_CPU_IDLE_STATE
>> +	bl	opal_call_realmode
>> +	mtcr	r16	/* Restore CR */
>> +
>> +	/* Do timebase resync if we are waking up from sleep. Use cr1 value
>> +	 * set in exceptions-64s.S */
>> +	ble	cr1,clear_lock
>> +
>> +timebase_resync:
>>  	/* Time base re-sync */
>> -	li	r3,OPAL_RESYNC_TIMEBASE
>> +	li	r0,OPAL_RESYNC_TIMEBASE
>>  	bl	opal_call_realmode;
> 
> So if pnv_need_fastsleep_workaround is zero, we always do the timebase
> resync, but if pnv_need_fastsleep_workaround is one, we only do the
> timebase resync if we had a loss of state.  Is that really what you
> meant?
> 
Timebase resync is needed only if we have a loss of state, irrespective of
pnv_need_fastsleep_workaround. 
But I see here in pnv_need_fastsleep_workaround = 0 path, I am doing timebase
resync unconditionally. I'll fix that.

The correct flow is, if its the first thread in the core waking up from sleep
- call OPAL_CONFIG_CPU_IDLE_STATE if pnv_need_fastsleep_workaround = 1
- If we had a state loss, call OPAL_RESYNC_TIMEBASE to resync timebase.
>> -
>>  	/* TODO: Check r3 for failure */
>>  
>> +clear_lock:
>> +	andi.	r15,r15,PNV_CORE_IDLE_THREAD_BITS
>> +	stw	r15,0(r14)
>> +
>> +common_exit:
>> +	li	r5,PNV_THREAD_RUNNING
>> +	stb     r5,PACA_THREAD_IDLE_STATE(r13)
>> +
>> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +	li      r0,KVM_HWTHREAD_IN_KERNEL
>> +	stb     r0,HSTATE_HWTHREAD_STATE(r13)
>> +	/* Order setting hwthread_state vs. testing hwthread_req */
>> +	sync
>> +	lbz     r0,HSTATE_HWTHREAD_REQ(r13)
>> +	cmpwi   r0,0
>> +	beq     6f
>> +	b       kvm_start_guest
>> +6:
>> +#endif
> 
> I'd prefer not to duplicate this code.  Could you instead branch back
> to the code in exceptions-64s.S?  Or call this code via a bl and get
> back to exceptions-64s.S via a blr.
> 

Okay.
>> +
>>  	REST_NVGPRS(r1)
>>  	REST_GPR(2, r1)
>>  	ld	r3,_CCR(r1)
>> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
>> index feb549a..b2aa93b 100644
>> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
>> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
>> @@ -158,6 +158,43 @@ opal_tracepoint_return:
>>  	blr
>>  #endif
>>  
>> +/*
>> + * Make opal call in realmode. This is a generic function to be called
>> + * from realmode. It handles endianness.
>> + *
>> + * r13 - paca pointer
>> + * r1  - stack pointer
>> + * r0  - opal token
>> + */
>> +_GLOBAL(opal_call_realmode)
>> +	mflr	r12
>> +	std	r12,_LINK(r1)
> 
> This is a bug waiting to happen.  Using _LINK(r1) was OK in this
> code's previous location, since there we know there is a
> INT_FRAME_SIZE-sized stack frame and the _LINK field is basically
> unused.  Now that you're making this available to call from anywhere,
> you can't trash the caller's stack frame like this.  You need to use
> PPC_LR_STKOFF(r1) instead.
>
 
Okay.
>> +	ld	r2,PACATOC(r13)
>> +	/* Set opal return address */
>> +	LOAD_REG_ADDR(r12,return_from_opal_call)
>> +	mtlr	r12
>> +
>> +	mfmsr	r12
>> +#ifdef __LITTLE_ENDIAN__
>> +	/* Handle endian-ness */
>> +	li	r11,MSR_LE
>> +	andc	r12,r12,r11
>> +#endif
>> +	mtspr	SPRN_HSRR1,r12
>> +	LOAD_REG_ADDR(r11,opal)
>> +	ld	r12,8(r11)
>> +	ld	r2,0(r11)
>> +	mtspr	SPRN_HSRR0,r12
>> +	hrfid
>> +
>> +return_from_opal_call:
>> +#ifdef __LITTLE_ENDIAN__
>> +	FIXUP_ENDIAN
>> +#endif
>> +	ld	r12,_LINK(r1)
>> +	mtlr	r12
>> +	blr
> 
> Paul.
> 

Thanks,
Shreyas


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-12-01  6:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-25 11:17 [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management Shreyas B. Prabhu
2014-11-25 11:17 ` [PATCH v2 1/4] powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle mode Shreyas B. Prabhu
2014-11-25 11:17 ` [PATCH v2 2/4] powerpc/powernv: Enable Offline CPUs to enter deep idle states Shreyas B. Prabhu
2014-11-25 11:17 ` [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management Shreyas B. Prabhu
2014-11-27  0:37   ` Benjamin Herrenschmidt
2014-11-27  5:55     ` Shreyas B Prabhu
2014-11-27 23:50   ` Paul Mackerras
2014-12-01  6:27     ` Shreyas B Prabhu
2014-11-25 11:17 ` [PATCH v2 4/4] powernv: powerpc: Add winkle support for offline cpus Shreyas B. Prabhu
2014-11-27  1:55   ` Benjamin Herrenschmidt
2014-11-27  6:24     ` Shreyas B Prabhu
2014-11-26  5:15 ` [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management Preeti U Murthy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).