linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
@ 2019-09-12 10:35 Gautham R. Shenoy
  2019-09-12 10:35 ` [PATCH 1/2] pseries/hotplug-cpu: Change default behaviour of cede_offline to "off" Gautham R. Shenoy
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Gautham R. Shenoy @ 2019-09-12 10:35 UTC (permalink / raw)
  To: Nathan Lynch, Michael Ellerman, Nicholas Piggin, Tyrel Datwyler
  Cc: linux-kernel, linuxppc-dev, Vaidyanathan Srinivasan,
	Kamalesh Babulal, Naveen N . Rao, Aneesh Kumar K.V,
	Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Currently on Pseries Linux Guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the Hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede. The PHYP hypervisor
considers CPUs in extended cede to be "active" since the CPUs are
still under the control fo the Linux Guests. Hence, when we change the
SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
will continue to count the values for offlined CPUs in extended cede
as if they are online.

One of the expectations with PURR is that the for an interval of time,
the sum of the PURR increments across the online CPUs of a core should
equal the number of timebase ticks for that interval.

This is currently not the case.

In the following data (Generated using
https://github.com/gautshen/misc/blob/master/purr_tb.py):


delta tb = tb ticks elapsed in 1 second.
delta purr = sum of PURR increments on online CPUs of that core in 1
       	     second
      
SMT=off
===========================================
Core        	delta tb(apprx)  delta purr	
===========================================
core00 [  0]	512000000	69883784	
core01 [  8]	512000000	88782536	
core02 [ 16]	512000000	94296824	
core03 [ 24]	512000000	80951968	

SMT=2
===========================================
Core            delta tb(apprx)  delta purr	
===========================================
core00 [  0,1]	512000000	136147792	
core01 [  8,9]	512000000	128636784	
core02 [ 16,17]	512000000	135426488	
core03 [ 24,25]	512000000	153027520	

SMT=4
===================================================
Core                   	delta tb(apprx)  delta purr	
===================================================
core00 [  0,1,2,3]	512000000	258331616	
core01 [  8,9,10,11]	512000000	274220072	
core02 [ 16,17,18,19]	512000000	260013736	
core03 [ 24,25,26,27]	512000000	260079672	

SMT=on
===================================================================
Core                                   	delta tb(apprx)  delta purr	
===================================================================
core00 [  0,1,2,3,4,5,6,7]		512000000	512941248	
core01 [  8,9,10,11,12,13,14,15]	512000000	512936544	
core02 [ 16,17,18,19,20,21,22,23]	512000000	512931544	
core03 [ 24,25,26,27,28,29,30,31]	512000000	512923800

This patchset addresses this issue by ensuring that by default, the
offlined CPUs are returned to the Hypervisor via RTAS "stop-self" call
by changing the default value of "cede_offline_enabled" to false.

The patchset also defines a new sysfs attribute
"/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
to allow userspace programs to change the state into which the
offlined CPU need to be put to at runtime. This is intended for
userspace programs that fold CPUs for the purpose of saving energy
when the utilization is low. Setting the value of this attribute
ensures that subsequent CPU offline operations will put the offlined
CPUs to extended cede. However, it will cause inconsistencies in the
PURR accounting. Clearing the attribute will make the offlined CPUs
call the RTAS "stop-self" call thereby returning the CPU to the
hypervisor.

With the patches,

SMT=off
===========================================
Core        	delta tb(apprx)	 delta purr	
===========================================
core00 [  0]	512000000	 512527568	
core01 [  8]	512000000	 512556128	
core02 [ 16]	512000000	 512590016	
core03 [ 24]	512000000	 512589440	

SMT=2
===========================================
Core            delta tb(apprx)	 delta purr	
===========================================
core00 [  0,1]	512000000	512635328
core01 [  8,9]	512000000	512610416	
core02 [ 16,17]	512000000	512639360	
core03 [ 24,25]	512000000	512638720	

SMT=4
===================================================
Core                    delta tb(apprx)  delta purr	
===================================================
core00 [  0,1,2,3]	512000000	512757328	
core01 [  8,9,10,11]	512000000	512727920	
core02 [ 16,17,18,19]	512000000	512754712	
core03 [ 24,25,26,27]	512000000	512739040	

SMT=on
==============================================================
Core                               delta tb(apprx)  delta purr	
==============================================================
core00 [  0,1,2,3,4,5,6,7]	   512000000	   512920936	
core01 [  8,9,10,11,12,13,14,15]   512000000	   512878728	
core02 [ 16,17,18,19,20,21,22,23]  512000000	   512921192	
core03 [ 24,25,26,27,28,29,30,31]  512000000	   512924816	

Gautham R. Shenoy (2):
  pseries/hotplug-cpu: Change default behaviour of cede_offline to "off"
  pseries/hotplug-cpu: Add sysfs attribute for cede_offline

 Documentation/ABI/testing/sysfs-devices-system-cpu | 14 ++++
 Documentation/core-api/cpu_hotplug.rst             |  2 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c       | 80 ++++++++++++++++++++--
 3 files changed, 88 insertions(+), 8 deletions(-)

-- 
1.9.4


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] pseries/hotplug-cpu: Change default behaviour of cede_offline to "off"
  2019-09-12 10:35 [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline Gautham R. Shenoy
@ 2019-09-12 10:35 ` Gautham R. Shenoy
  2019-09-12 10:35 ` [PATCH 2/2] pseries/hotplug-cpu: Add sysfs attribute for cede_offline Gautham R. Shenoy
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Gautham R. Shenoy @ 2019-09-12 10:35 UTC (permalink / raw)
  To: Nathan Lynch, Michael Ellerman, Nicholas Piggin, Tyrel Datwyler
  Cc: linux-kernel, linuxppc-dev, Vaidyanathan Srinivasan,
	Kamalesh Babulal, Naveen N . Rao, Aneesh Kumar K.V,
	Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Currently on Pseries Linux Guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the Hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede. The PHYP hypervisor
considers CPUs in extended cede to be "active" since they are still
under the control fo the Linux Guests. Hence, when we change the SMT
modes by offlining the secondary CPUs, the PURR and the RWMR SPRs will
continue to count the values for offlined CPUs in extended cede as if
they are online. This breaks the accounting in tools such as lparstat.

To fix this, ensure that by default the offlined CPUs are returned to
the Hypervisor via RTAS "stop-self" call by changing the default value
of "cede_offline_enabled" to false.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 Documentation/core-api/cpu_hotplug.rst       |  2 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++++++++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst
index 4a50ab7..5319593 100644
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -53,7 +53,7 @@ Command Line Switches
 ``cede_offline={"off","on"}``
   Use this option to disable/enable putting offlined processors to an extended
   ``H_CEDE`` state on supported pseries platforms. If nothing is specified,
-  ``cede_offline`` is set to "on".
+  ``cede_offline`` is set to "off".
 
   This option is limited to the PowerPC architecture.
 
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index bbda646..f9d0366 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -46,7 +46,17 @@ static DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) =
 
 static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE;
 
-static bool cede_offline_enabled __read_mostly = true;
+/*
+ * Determines whether the offlined CPUs should be put to a long term
+ * processor cede (called extended cede) for power-saving
+ * purposes. The CPUs in extended cede are still with the Linux Guest
+ * and are not returned to the Hypervisor.
+ *
+ * By default, the offlined CPUs are returned to the hypervisor via
+ * RTAS "stop-self". This behaviour can be changed by passing the
+ * kernel commandline parameter "cede_offline=on".
+ */
+static bool cede_offline_enabled __read_mostly;
 
 /*
  * Enable/disable cede_offline when available.
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] pseries/hotplug-cpu: Add sysfs attribute for cede_offline
  2019-09-12 10:35 [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline Gautham R. Shenoy
  2019-09-12 10:35 ` [PATCH 1/2] pseries/hotplug-cpu: Change default behaviour of cede_offline to "off" Gautham R. Shenoy
@ 2019-09-12 10:35 ` Gautham R. Shenoy
  2019-09-12 15:39 ` [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline Nathan Lynch
  2019-09-18  5:14 ` Michael Ellerman
  3 siblings, 0 replies; 15+ messages in thread
From: Gautham R. Shenoy @ 2019-09-12 10:35 UTC (permalink / raw)
  To: Nathan Lynch, Michael Ellerman, Nicholas Piggin, Tyrel Datwyler
  Cc: linux-kernel, linuxppc-dev, Vaidyanathan Srinivasan,
	Kamalesh Babulal, Naveen N . Rao, Aneesh Kumar K.V,
	Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Define a new sysfs attribute
"/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
to allow userspace programs to change the state into which the
offlined CPU need to be put to at runtime. This is intended for
userspace programs that fold CPUs for the purpose of saving energy
when the utilization is low.

Setting the value of this attribute ensures that subsequent CPU
offline operations will put the offlined CPUs to extended
cede. However, it will cause inconsistencies in the PURR accounting.

Clearing the attribute will make the offlined CPUs call the RTAS
"stop-self" call thereby returning the CPU to the hypervisor.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 14 +++++
 arch/powerpc/platforms/pseries/hotplug-cpu.c       | 68 ++++++++++++++++++++--
 2 files changed, 76 insertions(+), 6 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 06d0931..b3c52cd 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -572,3 +572,17 @@ Description:	Secure Virtual Machine
 		If 1, it means the system is using the Protected Execution
 		Facility in POWER9 and newer processors. i.e., it is a Secure
 		Virtual Machine.
+
+What:		/sys/devices/system/cpu/cede_offline_enabled
+Date:		August 2019
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+		Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>
+Description:	Offline CPU state control
+
+		If 1, it means that offline CPUs on PSeries guests
+		will be made to call an extended CEDE which provides
+		energy savings but at the expense of accuracy of PURR
+		accounting. If 0, the offline CPUs on PSeries guests
+		will be made to call RTAS "stop-self" call which will
+		return the CPUs to the Hypervisor and provide accurate
+		values of PURR. The value is 0 by default.
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index f9d0366..4a04cf7 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -943,9 +943,64 @@ static int parse_cede_parameters(void)
 			 CEDE_LATENCY_PARAM_MAX_LENGTH);
 }
 
-static int __init pseries_cpu_hotplug_init(void)
+/*
+ * Must be guarded by
+ * cpu_maps_update_begin()...cpu_maps_update_done()
+ */
+static void update_default_offline_state(void)
 {
 	int cpu;
+
+	if (cede_offline_enabled)
+		default_offline_state = CPU_STATE_INACTIVE;
+	else
+		default_offline_state = CPU_STATE_OFFLINE;
+
+	for_each_possible_cpu(cpu)
+		set_default_offline_state(cpu);
+}
+
+static ssize_t show_cede_offline_enabled(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	unsigned long ret = 0;
+
+	if (cede_offline_enabled)
+		ret = 1;
+
+	return sprintf(buf, "%lx\n", ret);
+}
+
+static ssize_t store_cede_offline_enabled(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	bool val;
+	int ret = 0;
+
+	ret = kstrtobool(buf, &val);
+	if (ret)
+		return -EINVAL;
+
+	cpu_maps_update_begin();
+	/* Check if anything needs to be done */
+	if (val == cede_offline_enabled)
+		goto done;
+	cede_offline_enabled = val;
+	update_default_offline_state();
+done:
+	cpu_maps_update_done();
+
+	return count;
+}
+
+static DEVICE_ATTR(cede_offline_enabled, 0600,
+		   show_cede_offline_enabled,
+		   store_cede_offline_enabled);
+
+static int __init pseries_cpu_hotplug_init(void)
+{
 	int qcss_tok;
 
 #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
@@ -971,11 +1026,12 @@ static int __init pseries_cpu_hotplug_init(void)
 	if (firmware_has_feature(FW_FEATURE_LPAR)) {
 		of_reconfig_notifier_register(&pseries_smp_nb);
 		cpu_maps_update_begin();
-		if (cede_offline_enabled && parse_cede_parameters() == 0) {
-			default_offline_state = CPU_STATE_INACTIVE;
-			for_each_online_cpu(cpu)
-				set_default_offline_state(cpu);
-		}
+		if (parse_cede_parameters() == 0)
+			device_create_file(cpu_subsys.dev_root,
+					   &dev_attr_cede_offline_enabled);
+		else /* Extended cede is not supported */
+			cede_offline_enabled = false;
+		update_default_offline_state();
 		cpu_maps_update_done();
 	}
 
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
  2019-09-12 10:35 [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline Gautham R. Shenoy
  2019-09-12 10:35 ` [PATCH 1/2] pseries/hotplug-cpu: Change default behaviour of cede_offline to "off" Gautham R. Shenoy
  2019-09-12 10:35 ` [PATCH 2/2] pseries/hotplug-cpu: Add sysfs attribute for cede_offline Gautham R. Shenoy
@ 2019-09-12 15:39 ` Nathan Lynch
  2019-09-15  7:42   ` Gautham R Shenoy
  2019-09-18  5:14 ` Michael Ellerman
  3 siblings, 1 reply; 15+ messages in thread
From: Nathan Lynch @ 2019-09-12 15:39 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: linux-kernel, linuxppc-dev, Vaidyanathan Srinivasan,
	Kamalesh Babulal, Naveen N . Rao, Aneesh Kumar K.V,
	Michael Ellerman, Nicholas Piggin, Tyrel Datwyler

"Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
> The patchset also defines a new sysfs attribute
> "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
> to allow userspace programs to change the state into which the
> offlined CPU need to be put to at runtime.

A boolean sysfs interface will become awkward if we need to add another
mode in the future.

What do you think about naming the attribute something like
'offline_mode', with the possible values 'extended-cede' and
'rtas-stopped'?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
  2019-09-12 15:39 ` [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline Nathan Lynch
@ 2019-09-15  7:42   ` Gautham R Shenoy
  2019-09-17 17:36     ` Nathan Lynch
  0 siblings, 1 reply; 15+ messages in thread
From: Gautham R Shenoy @ 2019-09-15  7:42 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Gautham R. Shenoy, linux-kernel, linuxppc-dev,
	Vaidyanathan Srinivasan, Kamalesh Babulal, Naveen N . Rao,
	Aneesh Kumar K.V, Michael Ellerman, Nicholas Piggin,
	Tyrel Datwyler

Hello Nathan,

On Thu, Sep 12, 2019 at 10:39:45AM -0500, Nathan Lynch wrote:
> "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
> > The patchset also defines a new sysfs attribute
> > "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
> > to allow userspace programs to change the state into which the
> > offlined CPU need to be put to at runtime.
> 
> A boolean sysfs interface will become awkward if we need to add another
> mode in the future.
> 
> What do you think about naming the attribute something like
> 'offline_mode', with the possible values 'extended-cede' and
> 'rtas-stopped'?

We can do that. However, IMHO in the longer term, on PSeries guests,
we should have only one offline state - rtas-stopped.  The reason for
this being, that on Linux, SMT switch is brought into effect through
the CPU Hotplug interface. The only state in which the SMT switch will
recognized by the hypervisors such as PHYP is rtas-stopped.

All other states (such as extended-cede) should in the long-term be
exposed via the cpuidle interface.

With this in mind, I made the sysfs interface boolean to mirror the
current "cede_offline" commandline parameter. Eventually when we have
only one offline-state, we can deprecate the commandline parameter as
well as the sysfs interface.

Thoughts?

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
  2019-09-15  7:42   ` Gautham R Shenoy
@ 2019-09-17 17:36     ` Nathan Lynch
  2019-09-18  5:17       ` Michael Ellerman
  2019-09-18 12:30       ` Gautham R Shenoy
  0 siblings, 2 replies; 15+ messages in thread
From: Nathan Lynch @ 2019-09-17 17:36 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: linux-kernel, linuxppc-dev, Vaidyanathan Srinivasan,
	Kamalesh Babulal, Naveen N . Rao, Aneesh Kumar K.V,
	Michael Ellerman, Nicholas Piggin, Tyrel Datwyler

Gautham R Shenoy <ego@linux.vnet.ibm.com> writes:
> On Thu, Sep 12, 2019 at 10:39:45AM -0500, Nathan Lynch wrote:
>> "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
>> > The patchset also defines a new sysfs attribute
>> > "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
>> > to allow userspace programs to change the state into which the
>> > offlined CPU need to be put to at runtime.
>> 
>> A boolean sysfs interface will become awkward if we need to add another
>> mode in the future.
>> 
>> What do you think about naming the attribute something like
>> 'offline_mode', with the possible values 'extended-cede' and
>> 'rtas-stopped'?
>
> We can do that. However, IMHO in the longer term, on PSeries guests,
> we should have only one offline state - rtas-stopped.  The reason for
> this being, that on Linux, SMT switch is brought into effect through
> the CPU Hotplug interface. The only state in which the SMT switch will
> recognized by the hypervisors such as PHYP is rtas-stopped.

OK. Why "longer term" though, instead of doing it now?


> All other states (such as extended-cede) should in the long-term be
> exposed via the cpuidle interface.
>
> With this in mind, I made the sysfs interface boolean to mirror the
> current "cede_offline" commandline parameter. Eventually when we have
> only one offline-state, we can deprecate the commandline parameter as
> well as the sysfs interface.

I don't care for adding a sysfs interface that is intended from the
beginning to become vestigial...

This strikes me as unnecessarily incremental if you're changing the
default offline state. Any user space programs depending on the current
behavior will have to change anyway (and why is it OK to break them?)

Why isn't the plan:

  1. Add extended cede support to the pseries cpuidle driver
  2. Make stop-self the only cpu offline state for pseries (no sysfs
     interface necessary)

?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
  2019-09-12 10:35 [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline Gautham R. Shenoy
                   ` (2 preceding siblings ...)
  2019-09-12 15:39 ` [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline Nathan Lynch
@ 2019-09-18  5:14 ` Michael Ellerman
  2019-09-18  6:52   ` Naveen N. Rao
  2019-09-18 12:51   ` Gautham R Shenoy
  3 siblings, 2 replies; 15+ messages in thread
From: Michael Ellerman @ 2019-09-18  5:14 UTC (permalink / raw)
  To: Gautham R. Shenoy, Nathan Lynch, Nicholas Piggin, Tyrel Datwyler
  Cc: linux-kernel, linuxppc-dev, Vaidyanathan Srinivasan,
	Kamalesh Babulal, Naveen N . Rao, Aneesh Kumar K.V,
	Gautham R. Shenoy

"Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>
> Currently on Pseries Linux Guests, the offlined CPU can be put to one
> of the following two states:
>    - Long term processor cede (also called extended cede)
>    - Returned to the Hypervisor via RTAS "stop-self" call.
>
> This is controlled by the kernel boot parameter "cede_offline=on/off".
>
> By default the offlined CPUs enter extended cede.

Since commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") (Nov 2009)

Which you wrote :)

Why was that wrong?

> The PHYP hypervisor considers CPUs in extended cede to be "active"
> since the CPUs are still under the control fo the Linux Guests. Hence, when we change the
> SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
> will continue to count the values for offlined CPUs in extended cede
> as if they are online.
>
> One of the expectations with PURR is that the for an interval of time,
> the sum of the PURR increments across the online CPUs of a core should
> equal the number of timebase ticks for that interval.
>
> This is currently not the case.

But why does that matter? It's just some accounting stuff, does it
actually break something meaningful?

Also what does this do to the latency of CPU online/offline.
And what does this do on KVM?


> In the following data (Generated using
> https://github.com/gautshen/misc/blob/master/purr_tb.py):
>
>
> delta tb = tb ticks elapsed in 1 second.
> delta purr = sum of PURR increments on online CPUs of that core in 1
>        	     second
>       
> SMT=off
> ===========================================
> Core        	delta tb(apprx)  delta purr	
> ===========================================
> core00 [  0]	512000000	69883784	
> core01 [  8]	512000000	88782536	
> core02 [ 16]	512000000	94296824	
> core03 [ 24]	512000000	80951968	

Showing the expected value in another column would make this much
clearer.

cheers


> SMT=2
> ===========================================
> Core            delta tb(apprx)  delta purr	
> ===========================================
> core00 [  0,1]	512000000	136147792	
> core01 [  8,9]	512000000	128636784	
> core02 [ 16,17]	512000000	135426488	
> core03 [ 24,25]	512000000	153027520	
>
> SMT=4
> ===================================================
> Core                   	delta tb(apprx)  delta purr	
> ===================================================
> core00 [  0,1,2,3]	512000000	258331616	
> core01 [  8,9,10,11]	512000000	274220072	
> core02 [ 16,17,18,19]	512000000	260013736	
> core03 [ 24,25,26,27]	512000000	260079672	
>
> SMT=on
> ===================================================================
> Core                                   	delta tb(apprx)  delta purr	
> ===================================================================
> core00 [  0,1,2,3,4,5,6,7]		512000000	512941248	
> core01 [  8,9,10,11,12,13,14,15]	512000000	512936544	
> core02 [ 16,17,18,19,20,21,22,23]	512000000	512931544	
> core03 [ 24,25,26,27,28,29,30,31]	512000000	512923800
>
> This patchset addresses this issue by ensuring that by default, the
> offlined CPUs are returned to the Hypervisor via RTAS "stop-self" call
> by changing the default value of "cede_offline_enabled" to false.
>
> The patchset also defines a new sysfs attribute
> "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
> to allow userspace programs to change the state into which the
> offlined CPU need to be put to at runtime. This is intended for
> userspace programs that fold CPUs for the purpose of saving energy
> when the utilization is low. Setting the value of this attribute
> ensures that subsequent CPU offline operations will put the offlined
> CPUs to extended cede. However, it will cause inconsistencies in the
> PURR accounting. Clearing the attribute will make the offlined CPUs
> call the RTAS "stop-self" call thereby returning the CPU to the
> hypervisor.
>
> With the patches,
>
> SMT=off
> ===========================================
> Core        	delta tb(apprx)	 delta purr	
> ===========================================
> core00 [  0]	512000000	 512527568	
> core01 [  8]	512000000	 512556128	
> core02 [ 16]	512000000	 512590016	
> core03 [ 24]	512000000	 512589440	
>
> SMT=2
> ===========================================
> Core            delta tb(apprx)	 delta purr	
> ===========================================
> core00 [  0,1]	512000000	512635328
> core01 [  8,9]	512000000	512610416	
> core02 [ 16,17]	512000000	512639360	
> core03 [ 24,25]	512000000	512638720	
>
> SMT=4
> ===================================================
> Core                    delta tb(apprx)  delta purr	
> ===================================================
> core00 [  0,1,2,3]	512000000	512757328	
> core01 [  8,9,10,11]	512000000	512727920	
> core02 [ 16,17,18,19]	512000000	512754712	
> core03 [ 24,25,26,27]	512000000	512739040	
>
> SMT=on
> ==============================================================
> Core                               delta tb(apprx)  delta purr	
> ==============================================================
> core00 [  0,1,2,3,4,5,6,7]	   512000000	   512920936	
> core01 [  8,9,10,11,12,13,14,15]   512000000	   512878728	
> core02 [ 16,17,18,19,20,21,22,23]  512000000	   512921192	
> core03 [ 24,25,26,27,28,29,30,31]  512000000	   512924816	
>
> Gautham R. Shenoy (2):
>   pseries/hotplug-cpu: Change default behaviour of cede_offline to "off"
>   pseries/hotplug-cpu: Add sysfs attribute for cede_offline
>
>  Documentation/ABI/testing/sysfs-devices-system-cpu | 14 ++++
>  Documentation/core-api/cpu_hotplug.rst             |  2 +-
>  arch/powerpc/platforms/pseries/hotplug-cpu.c       | 80 ++++++++++++++++++++--
>  3 files changed, 88 insertions(+), 8 deletions(-)
>
> -- 
> 1.9.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
  2019-09-17 17:36     ` Nathan Lynch
@ 2019-09-18  5:17       ` Michael Ellerman
  2019-09-18 12:30       ` Gautham R Shenoy
  1 sibling, 0 replies; 15+ messages in thread
From: Michael Ellerman @ 2019-09-18  5:17 UTC (permalink / raw)
  To: Nathan Lynch, Gautham R Shenoy
  Cc: linux-kernel, linuxppc-dev, Vaidyanathan Srinivasan,
	Kamalesh Babulal, Naveen N . Rao, Aneesh Kumar K.V,
	Nicholas Piggin, Tyrel Datwyler

Nathan Lynch <nathanl@linux.ibm.com> writes:
> Gautham R Shenoy <ego@linux.vnet.ibm.com> writes:
>> On Thu, Sep 12, 2019 at 10:39:45AM -0500, Nathan Lynch wrote:
>>> "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
>>> > The patchset also defines a new sysfs attribute
>>> > "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
>>> > to allow userspace programs to change the state into which the
>>> > offlined CPU need to be put to at runtime.
>>> 
>>> A boolean sysfs interface will become awkward if we need to add another
>>> mode in the future.
>>> 
>>> What do you think about naming the attribute something like
>>> 'offline_mode', with the possible values 'extended-cede' and
>>> 'rtas-stopped'?
>>
>> We can do that. However, IMHO in the longer term, on PSeries guests,
>> we should have only one offline state - rtas-stopped.  The reason for
>> this being, that on Linux, SMT switch is brought into effect through
>> the CPU Hotplug interface. The only state in which the SMT switch will
>> recognized by the hypervisors such as PHYP is rtas-stopped.
>
> OK. Why "longer term" though, instead of doing it now?
>
>
>> All other states (such as extended-cede) should in the long-term be
>> exposed via the cpuidle interface.
>>
>> With this in mind, I made the sysfs interface boolean to mirror the
>> current "cede_offline" commandline parameter. Eventually when we have
>> only one offline-state, we can deprecate the commandline parameter as
>> well as the sysfs interface.
>
> I don't care for adding a sysfs interface that is intended from the
> beginning to become vestigial...
>
> This strikes me as unnecessarily incremental if you're changing the
> default offline state. Any user space programs depending on the current
> behavior will have to change anyway (and why is it OK to break them?)
>
> Why isn't the plan:
>
>   1. Add extended cede support to the pseries cpuidle driver
>   2. Make stop-self the only cpu offline state for pseries (no sysfs
>      interface necessary)

I agree, that would be preferable. Adding more sysfs tunables sucks,
especially if they're going to be deprecated in the not too distant
future.

Another option would be to add extended cede to the cpuidle driver,
and retain the cede_offline boot parameter but make it false by default.

cheers

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
  2019-09-18  5:14 ` Michael Ellerman
@ 2019-09-18  6:52   ` Naveen N. Rao
  2019-09-18 11:31     ` Michael Ellerman
  2019-09-18 12:51   ` Gautham R Shenoy
  1 sibling, 1 reply; 15+ messages in thread
From: Naveen N. Rao @ 2019-09-18  6:52 UTC (permalink / raw)
  To: Gautham R. Shenoy, Michael Ellerman, Nathan Lynch,
	Nicholas Piggin, Tyrel Datwyler
  Cc: Aneesh Kumar K.V, Kamalesh Babulal, linux-kernel, linuxppc-dev,
	Vaidyanathan Srinivasan

Michael Ellerman wrote:
> "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
>> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>>
>> Currently on Pseries Linux Guests, the offlined CPU can be put to one
>> of the following two states:
>>    - Long term processor cede (also called extended cede)
>>    - Returned to the Hypervisor via RTAS "stop-self" call.
>>
>> This is controlled by the kernel boot parameter "cede_offline=on/off".
>>
>> By default the offlined CPUs enter extended cede.
> 
> Since commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") (Nov 2009)
> 
> Which you wrote :)
> 
> Why was that wrong?
> 
>> The PHYP hypervisor considers CPUs in extended cede to be "active"
>> since the CPUs are still under the control fo the Linux Guests. Hence, when we change the
>> SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
>> will continue to count the values for offlined CPUs in extended cede
>> as if they are online.
>>
>> One of the expectations with PURR is that the for an interval of time,
>> the sum of the PURR increments across the online CPUs of a core should
>> equal the number of timebase ticks for that interval.
>>
>> This is currently not the case.
> 
> But why does that matter? It's just some accounting stuff, does it
> actually break something meaningful?

Yes, this broke lparstat at the very least (though its quite unfortunate 
we took so long to notice).

With SMT disabled, and under load:
  $ sudo lparstat 1 10

  System Configuration
  type=Shared mode=Uncapped smt=Off lcpu=2 mem=7759616 kB cpus=6 ent=1.00 

  %user  %sys %wait    %idle    physc %entc lbusy  vcsw phint
  ----- ----- -----    -----    ----- ----- ----- ----- -----
  100.00  0.00  0.00     0.00     1.10 110.00 100.00 128784460     0
  100.00  0.00  0.00     0.00     1.07 107.00 100.00 128784860     0
  100.00  0.00  0.00     0.00     1.07 107.00 100.00 128785260     0
  100.00  0.00  0.00     0.00     1.07 107.00 100.00 128785662     0
  100.00  0.00  0.00     0.00     1.07 107.00 100.00 128786062     0
  100.00  0.00  0.00     0.00     1.07 107.00 100.00 128786462     0
  100.00  0.00  0.00     0.00     1.07 107.00 100.00 128786862     0
  100.00  0.00  0.00     0.00     1.07 107.00 100.00 128787262     0
  100.00  0.00  0.00     0.00     1.07 107.00 100.00 128787664     0
  100.00  0.00  0.00     0.00     1.07 107.00 100.00 128788064     0


With cede_offline=off:
  $ sudo lparstat 1 10

  System Configuration
  type=Shared mode=Uncapped smt=Off lcpu=2 mem=7759616 kB cpus=6 ent=1.00 

  %user  %sys %wait    %idle    physc %entc lbusy  vcsw phint
  ----- ----- -----    -----    ----- ----- ----- ----- -----
  100.00  0.00  0.00     0.00     1.94 194.00 100.00 128961588     0
  100.00  0.00  0.00     0.00     1.91 191.00 100.00 128961988     0
  100.00  0.00  0.00     0.00      inf   inf 100.00 128962392     0
  100.00  0.00  0.00     0.00     1.91 191.00 100.00 128962792     0
  100.00  0.00  0.00     0.00     1.91 191.00 100.00 128963192     0
  100.00  0.00  0.00     0.00     1.91 191.00 100.00 128963592     0
  100.00  0.00  0.00     0.00     1.91 191.00 100.00 128963992     0
  100.00  0.00  0.00     0.00     1.91 191.00 100.00 128964392     0
  100.00  0.00  0.00     0.00     1.91 191.00 100.00 128964792     0
  100.00  0.00  0.00     0.00     1.91 191.00 100.00 128965194     0

[The 'inf' values there show a different bug]

Also, since we expose [S]PURR through sysfs, any tools that make use of 
that directly are also affected due to this.


- Naveen


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
  2019-09-18  6:52   ` Naveen N. Rao
@ 2019-09-18 11:31     ` Michael Ellerman
  2019-09-18 13:38       ` Aneesh Kumar K.V
  2019-09-18 16:24       ` Naveen N. Rao
  0 siblings, 2 replies; 15+ messages in thread
From: Michael Ellerman @ 2019-09-18 11:31 UTC (permalink / raw)
  To: Naveen N. Rao, Gautham R. Shenoy, Nathan Lynch, Nicholas Piggin,
	Tyrel Datwyler
  Cc: Aneesh Kumar K.V, Kamalesh Babulal, linux-kernel, linuxppc-dev,
	Vaidyanathan Srinivasan

"Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
> Michael Ellerman wrote:
>> "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
>>> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>>>
>>> Currently on Pseries Linux Guests, the offlined CPU can be put to one
>>> of the following two states:
>>>    - Long term processor cede (also called extended cede)
>>>    - Returned to the Hypervisor via RTAS "stop-self" call.
>>>
>>> This is controlled by the kernel boot parameter "cede_offline=on/off".
>>>
>>> By default the offlined CPUs enter extended cede.
>> 
>> Since commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") (Nov 2009)
>> 
>> Which you wrote :)
>> 
>> Why was that wrong?
>> 
>>> The PHYP hypervisor considers CPUs in extended cede to be "active"
>>> since the CPUs are still under the control fo the Linux Guests. Hence, when we change the
>>> SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
>>> will continue to count the values for offlined CPUs in extended cede
>>> as if they are online.
>>>
>>> One of the expectations with PURR is that the for an interval of time,
>>> the sum of the PURR increments across the online CPUs of a core should
>>> equal the number of timebase ticks for that interval.
>>>
>>> This is currently not the case.
>> 
>> But why does that matter? It's just some accounting stuff, does it
>> actually break something meaningful?
>
> Yes, this broke lparstat at the very least (though its quite unfortunate 
> we took so long to notice).

By "so long" you mean 10 years?

Also I've never heard of lparstat, but I assume it's one of these tools
that's meant to behave like the AIX equivalent?

If it's been "broken" for 10 years and no one noticed, I'd argue the
current behaviour is now "correct" and fixing it would actually be a
breakage :)

> With SMT disabled, and under load:
>   $ sudo lparstat 1 10
>
>   System Configuration
>   type=Shared mode=Uncapped smt=Off lcpu=2 mem=7759616 kB cpus=6 ent=1.00 
>
>   %user  %sys %wait    %idle    physc %entc lbusy  vcsw phint
>   ----- ----- -----    -----    ----- ----- ----- ----- -----
>   100.00  0.00  0.00     0.00     1.10 110.00 100.00 128784460     0
>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128784860     0
>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128785260     0
>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128785662     0
>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128786062     0
>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128786462     0
>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128786862     0
>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128787262     0
>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128787664     0
>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128788064     0

What about that is wrong?

> With cede_offline=off:
>   $ sudo lparstat 1 10
>
>   System Configuration
>   type=Shared mode=Uncapped smt=Off lcpu=2 mem=7759616 kB cpus=6 ent=1.00 
>
>   %user  %sys %wait    %idle    physc %entc lbusy  vcsw phint
>   ----- ----- -----    -----    ----- ----- ----- ----- -----
>   100.00  0.00  0.00     0.00     1.94 194.00 100.00 128961588     0
>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128961988     0
>   100.00  0.00  0.00     0.00      inf   inf 100.00 128962392     0
>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128962792     0
>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128963192     0
>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128963592     0
>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128963992     0
>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128964392     0
>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128964792     0
>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128965194     0
>
> [The 'inf' values there show a different bug]
>
> Also, since we expose [S]PURR through sysfs, any tools that make use of 
> that directly are also affected due to this.

But again if we've had the current behaviour for 10 years then arguably
that's now the correct behaviour.

cheers

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
  2019-09-17 17:36     ` Nathan Lynch
  2019-09-18  5:17       ` Michael Ellerman
@ 2019-09-18 12:30       ` Gautham R Shenoy
  2019-09-18 17:08         ` Nathan Lynch
  1 sibling, 1 reply; 15+ messages in thread
From: Gautham R Shenoy @ 2019-09-18 12:30 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Gautham R Shenoy, linux-kernel, linuxppc-dev,
	Vaidyanathan Srinivasan, Kamalesh Babulal, Naveen N . Rao,
	Aneesh Kumar K.V, Michael Ellerman, Nicholas Piggin,
	Tyrel Datwyler

Hello Nathan, Michael,

On Tue, Sep 17, 2019 at 12:36:35PM -0500, Nathan Lynch wrote:
> Gautham R Shenoy <ego@linux.vnet.ibm.com> writes:
> > On Thu, Sep 12, 2019 at 10:39:45AM -0500, Nathan Lynch wrote:
> >> "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
> >> > The patchset also defines a new sysfs attribute
> >> > "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
> >> > to allow userspace programs to change the state into which the
> >> > offlined CPU need to be put to at runtime.
> >> 
> >> A boolean sysfs interface will become awkward if we need to add another
> >> mode in the future.
> >> 
> >> What do you think about naming the attribute something like
> >> 'offline_mode', with the possible values 'extended-cede' and
> >> 'rtas-stopped'?
> >
> > We can do that. However, IMHO in the longer term, on PSeries guests,
> > we should have only one offline state - rtas-stopped.  The reason for
> > this being, that on Linux, SMT switch is brought into effect through
> > the CPU Hotplug interface. The only state in which the SMT switch will
> > recognized by the hypervisors such as PHYP is rtas-stopped.
> 
> OK. Why "longer term" though, instead of doing it now?

Because adding extended-cede into a cpuidle state is non-trivial since
a CPU in that state is non responsive to external interrupts. We will
additional changes in the IPI, Timer and the Interrupt code to ensure
that these get translated to a H_PROD in order to wake-up the target
CPU in extended CEDE.

Timer: is relatively easy since the cpuidle infrastructure has the
       timer-offload framework (used for fastsleep in POWER8) where we
       can offload the timers of an idling CPU to another CPU which
       can wakeup the CPU when the timer expires via an IPI.

IPIs: We need to ensure that icp_hv_set_qirr() correctly sends H_IPI
      or H_PROD depending on whether or not the target CPU is in
      extended CEDE.

Interrupts: Either we migrate away the interrupts from the CPU that is
            entering extended CEDE or we prevent a CPU that is the
            sole target for an interrupt from entering extended CEDE.

The accounting problem in tools such as lparstat with
"cede_offline=on" is affecting customers who are using these tools for
capacity-planning. That problem needs a fix in the short-term, for
which Patch 1 changes the default behaviour of cede_offline from "on"
to "off". Since this patch would break the existing userspace tools
that use the CPU-Offline infrastructure to fold CPUs for saving power,
the sysfs interface allowing a runtime change of cede_offline_enabled
was provided to enable these userspace tools to cope with minimal
change.

> 
> 
> > All other states (such as extended-cede) should in the long-term be
> > exposed via the cpuidle interface.
> >
> > With this in mind, I made the sysfs interface boolean to mirror the
> > current "cede_offline" commandline parameter. Eventually when we have
> > only one offline-state, we can deprecate the commandline parameter as
> > well as the sysfs interface.
> 
> I don't care for adding a sysfs interface that is intended from the
> beginning to become vestigial...

Fair point. Come to think of it, in case the cpuidle menu governor
behaviour doesn't match the expectations provided by the current
userspace solutions for folding idle CPUs for power-savings, it would
be useful to have this option around so that existing users who prefer
the userspace solution can still have that option.

> 
> This strikes me as unnecessarily incremental if you're changing the
> default offline state. Any user space programs depending on the current
> behavior will have to change anyway (and why is it OK to break them?)
>

Yes, the current userspace program will need to be modified to check
for the sysfs interface and change the value to
cede_offline_enabled=1.

> Why isn't the plan:
> 
>   1. Add extended cede support to the pseries cpuidle driver
>   2. Make stop-self the only cpu offline state for pseries (no sysfs
>      interface necessary)

This is the plan, except that 1. requires some additional work and
this patchset is proposed as a short-term mitigation until we get
1. right.

> 
> ?

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
  2019-09-18  5:14 ` Michael Ellerman
  2019-09-18  6:52   ` Naveen N. Rao
@ 2019-09-18 12:51   ` Gautham R Shenoy
  1 sibling, 0 replies; 15+ messages in thread
From: Gautham R Shenoy @ 2019-09-18 12:51 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Gautham R. Shenoy, Nathan Lynch, Nicholas Piggin, Tyrel Datwyler,
	linux-kernel, linuxppc-dev, Vaidyanathan Srinivasan,
	Kamalesh Babulal, Naveen N . Rao, Aneesh Kumar K.V

On Wed, Sep 18, 2019 at 03:14:15PM +1000, Michael Ellerman wrote:
> "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
> > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> >
> > Currently on Pseries Linux Guests, the offlined CPU can be put to one
> > of the following two states:
> >    - Long term processor cede (also called extended cede)
> >    - Returned to the Hypervisor via RTAS "stop-self" call.
> >
> > This is controlled by the kernel boot parameter "cede_offline=on/off".
> >
> > By default the offlined CPUs enter extended cede.
> 
> Since commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") (Nov 2009)
> 
> Which you wrote :)

Mea Culpa! I forgot to include the "Fixes commit 3aa565f53c39" into
Patch 1 of the series.

> 
> Why was that wrong?

It was wrong from the definition of what PHYP considers as
"not-active" CPU. From the point of view of that hypervisor, a CPU is
not-active iff it is in RTAS "stop-self". Thus if a CPU is offline via
extended cede, and not using any cycles, it is still considered to be
active, by PHYP. This causes PURR accounting is broken. 

> 
> > The PHYP hypervisor considers CPUs in extended cede to be "active"
> > since the CPUs are still under the control fo the Linux Guests. Hence, when we change the
> > SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
> > will continue to count the values for offlined CPUs in extended cede
> > as if they are online.
> >
> > One of the expectations with PURR is that the for an interval of time,
> > the sum of the PURR increments across the online CPUs of a core should
> > equal the number of timebase ticks for that interval.
> >
> > This is currently not the case.
> 
> But why does that matter? It's just some accounting stuff, does it
> actually break something meaningful?

As Naveen mentioned, it breaks lparstat which the customers are using
for capacity planning. Unfortunately we discovered this 10 years after
the feature was written.

> 
> Also what does this do to the latency of CPU online/offline.

It will have a slightly higher latency compared to extended cede,
since it involves an additional rtas-call for both the start and
stopping of CPU. Will measure the exact difference and post it in the
next version.

> And what does this do on KVM?

KVM doesn't seem to depend on the state of the offline VCPU as it has
an explicit way of signalling whether a CPU is online or not, via
KVM_REG_PPC_ONLINE. In commit 7aa15842c15f ("KVM: PPC: Book3S HV: Set
RWMR on POWER8 so PURR/SPURR count correctly") we use this KVM reg to
update the count of online vCPUs in a core, and use this count to set
the RWMR correctly before dispatching the core.

So, this patchset doesn't affect KVM.

> 
> 
> > In the following data (Generated using
> > https://github.com/gautshen/misc/blob/master/purr_tb.py):
> >
> >
> > delta tb = tb ticks elapsed in 1 second.
> > delta purr = sum of PURR increments on online CPUs of that core in 1
> >        	     second
> >       
> > SMT=off
> > ===========================================
> > Core        	delta tb(apprx)  delta purr	
> > ===========================================
> > core00 [  0]	512000000	69883784	
> > core01 [  8]	512000000	88782536	
> > core02 [ 16]	512000000	94296824	
> > core03 [ 24]	512000000	80951968	
> 
> Showing the expected value in another column would make this much
> clearer.

Thanks. Will update the testcase to call out the expected value.
> 
> cheers
> 


--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
  2019-09-18 11:31     ` Michael Ellerman
@ 2019-09-18 13:38       ` Aneesh Kumar K.V
  2019-09-18 16:24       ` Naveen N. Rao
  1 sibling, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2019-09-18 13:38 UTC (permalink / raw)
  To: Michael Ellerman, Naveen N. Rao, Gautham R. Shenoy, Nathan Lynch,
	Nicholas Piggin, Tyrel Datwyler
  Cc: Kamalesh Babulal, linux-kernel, linuxppc-dev, Vaidyanathan Srinivasan

On 9/18/19 5:01 PM, Michael Ellerman wrote:
> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
>> Michael Ellerman wrote:
>>> "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
>>>> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>

....

>> Also, since we expose [S]PURR through sysfs, any tools that make use of
>> that directly are also affected due to this.
> 
> But again if we've had the current behaviour for 10 years then arguably
> that's now the correct behaviour.

Or nobody is looking at PURR based accounting/usage metric on Linux. We 
got some customers complaining about this recently once they started 
using the metric.

-aneesh


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
  2019-09-18 11:31     ` Michael Ellerman
  2019-09-18 13:38       ` Aneesh Kumar K.V
@ 2019-09-18 16:24       ` Naveen N. Rao
  1 sibling, 0 replies; 15+ messages in thread
From: Naveen N. Rao @ 2019-09-18 16:24 UTC (permalink / raw)
  To: Gautham R. Shenoy, Michael Ellerman, Nathan Lynch,
	Nicholas Piggin, Tyrel Datwyler
  Cc: Aneesh Kumar K.V, Kamalesh Babulal, linux-kernel, linuxppc-dev,
	Vaidyanathan Srinivasan

Michael Ellerman wrote:
> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
>> Michael Ellerman wrote:
>>> "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
>>>> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>>>>
>>>> Currently on Pseries Linux Guests, the offlined CPU can be put to one
>>>> of the following two states:
>>>>    - Long term processor cede (also called extended cede)
>>>>    - Returned to the Hypervisor via RTAS "stop-self" call.
>>>>
>>>> This is controlled by the kernel boot parameter "cede_offline=on/off".
>>>>
>>>> By default the offlined CPUs enter extended cede.
>>> 
>>> Since commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") (Nov 2009)
>>> 
>>> Which you wrote :)
>>> 
>>> Why was that wrong?
>>> 
>>>> The PHYP hypervisor considers CPUs in extended cede to be "active"
>>>> since the CPUs are still under the control fo the Linux Guests. Hence, when we change the
>>>> SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
>>>> will continue to count the values for offlined CPUs in extended cede
>>>> as if they are online.
>>>>
>>>> One of the expectations with PURR is that the for an interval of time,
>>>> the sum of the PURR increments across the online CPUs of a core should
>>>> equal the number of timebase ticks for that interval.
>>>>
>>>> This is currently not the case.
>>> 
>>> But why does that matter? It's just some accounting stuff, does it
>>> actually break something meaningful?
>>
>> Yes, this broke lparstat at the very least (though its quite unfortunate 
>> we took so long to notice).
> 
> By "so long" you mean 10 years?
> 
> Also I've never heard of lparstat, but I assume it's one of these tools
> that's meant to behave like the AIX equivalent?

Yes, and yes. lparstat is part of powerpc-utils.

> 
> If it's been "broken" for 10 years and no one noticed, I'd argue the
> current behaviour is now "correct" and fixing it would actually be a
> breakage :)

:)
More on this below...

> 
>> With SMT disabled, and under load:
>>   $ sudo lparstat 1 10
>>
>>   System Configuration
>>   type=Shared mode=Uncapped smt=Off lcpu=2 mem=7759616 kB cpus=6 ent=1.00 
>>
>>   %user  %sys %wait    %idle    physc %entc lbusy  vcsw phint
>>   ----- ----- -----    -----    ----- ----- ----- ----- -----
>>   100.00  0.00  0.00     0.00     1.10 110.00 100.00 128784460     0
>>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128784860     0
>>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128785260     0
>>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128785662     0
>>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128786062     0
>>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128786462     0
>>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128786862     0
>>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128787262     0
>>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128787664     0
>>   100.00  0.00  0.00     0.00     1.07 107.00 100.00 128788064     0
> 
> What about that is wrong?

The 'physc' column represents cpu usage in units of physical cores.  
With 2 virtual cores ('lcpu=2') in uncapped, shared processor mode, we 
expect this to be closer to 2 when fully loaded (and spare capacity 
elsewhere in the system).

> 
>> With cede_offline=off:
>>   $ sudo lparstat 1 10
>>
>>   System Configuration
>>   type=Shared mode=Uncapped smt=Off lcpu=2 mem=7759616 kB cpus=6 ent=1.00 
>>
>>   %user  %sys %wait    %idle    physc %entc lbusy  vcsw phint
>>   ----- ----- -----    -----    ----- ----- ----- ----- -----
>>   100.00  0.00  0.00     0.00     1.94 194.00 100.00 128961588     0
>>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128961988     0
>>   100.00  0.00  0.00     0.00      inf   inf 100.00 128962392     0
>>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128962792     0
>>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128963192     0
>>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128963592     0
>>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128963992     0
>>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128964392     0
>>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128964792     0
>>   100.00  0.00  0.00     0.00     1.91 191.00 100.00 128965194     0
>>
>> [The 'inf' values there show a different bug]
>>
>> Also, since we expose [S]PURR through sysfs, any tools that make use of 
>> that directly are also affected due to this.
> 
> But again if we've had the current behaviour for 10 years then arguably
> that's now the correct behaviour.

That's a fair point, and probably again points to this area getting less 
tested. One of the main reasons for this being caught now though, is 
that there are workloads being tested under lower SMT levels now. So, I 
suspect no one has been relying on this behavior and we can consider 
this to be a bug.


Thanks,
Naveen


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
  2019-09-18 12:30       ` Gautham R Shenoy
@ 2019-09-18 17:08         ` Nathan Lynch
  0 siblings, 0 replies; 15+ messages in thread
From: Nathan Lynch @ 2019-09-18 17:08 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: linux-kernel, linuxppc-dev, Vaidyanathan Srinivasan,
	Kamalesh Babulal, Naveen N . Rao, Aneesh Kumar K.V,
	Michael Ellerman, Nicholas Piggin, Tyrel Datwyler

Gautham R Shenoy <ego@linux.vnet.ibm.com> writes:
> The accounting problem in tools such as lparstat with
> "cede_offline=on" is affecting customers who are using these tools for
> capacity-planning. That problem needs a fix in the short-term, for
> which Patch 1 changes the default behaviour of cede_offline from "on"
> to "off". Since this patch would break the existing userspace tools
> that use the CPU-Offline infrastructure to fold CPUs for saving power,
> the sysfs interface allowing a runtime change of cede_offline_enabled
> was provided to enable these userspace tools to cope with minimal
> change.

So we would be putting users in the position of choosing between folding
and correct accounting. :-(

Actually how does changing the offline mechanism to stop-self break the
folding utility?

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-09-18 17:32 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-12 10:35 [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline Gautham R. Shenoy
2019-09-12 10:35 ` [PATCH 1/2] pseries/hotplug-cpu: Change default behaviour of cede_offline to "off" Gautham R. Shenoy
2019-09-12 10:35 ` [PATCH 2/2] pseries/hotplug-cpu: Add sysfs attribute for cede_offline Gautham R. Shenoy
2019-09-12 15:39 ` [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline Nathan Lynch
2019-09-15  7:42   ` Gautham R Shenoy
2019-09-17 17:36     ` Nathan Lynch
2019-09-18  5:17       ` Michael Ellerman
2019-09-18 12:30       ` Gautham R Shenoy
2019-09-18 17:08         ` Nathan Lynch
2019-09-18  5:14 ` Michael Ellerman
2019-09-18  6:52   ` Naveen N. Rao
2019-09-18 11:31     ` Michael Ellerman
2019-09-18 13:38       ` Aneesh Kumar K.V
2019-09-18 16:24       ` Naveen N. Rao
2019-09-18 12:51   ` Gautham R Shenoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).