All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code
@ 2021-07-19  6:33 ` Gautham R. Shenoy
  0 siblings, 0 replies; 10+ messages in thread
From: Gautham R. Shenoy @ 2021-07-19  6:33 UTC (permalink / raw)
  To: Rafael J. Wysocki, Daniel Lezcano, Michael Ellerman,
	Aneesh Kumar K.V, Vaidyanathan Srinivasan, Michal Suchanek
  Cc: linux-pm, joedecke, linuxppc-dev, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>


Hi,

This is the v5 of the patchset to fixup CEDE0 latency only from
POWER10 onwards.


The previous versions of the patches are
v4 : https://lore.kernel.org/linux-pm/1623048014-16451-1-git-send-email-ego@linux.vnet.ibm.com/
v3 : https://lore.kernel.org/linuxppc-dev/1619697982-28461-1-git-send-email-ego@linux.vnet.ibm.com/
v2 : https://lore.kernel.org/linuxppc-dev/1619673517-10853-1-git-send-email-ego@linux.vnet.ibm.com/
v1 : https://lore.kernel.org/linuxppc-dev/1619104049-5118-1-git-send-email-ego@linux.vnet.ibm.com/

v4 --> v5 changes
 * Patch 1 : Unchanged. Rebased it against the latest powerpc/merge
  tree. With this patch, on processors older than POWER10, the CEDE
  latency is set to the hardcoded value of 10us which is closer to the
  measured value (details of the measurement in Patch 1).

 * Added a Patch 2/2 titled "cpuidle/pseries: Do not cap the CEDE0
   latency in fixup_cede0_latency()" which will ensure that on POWER10
   onwards we simply take the latency value exposed by the firmware
   without keeping an upper cap of 10us. This upper cap was previously
   required to prevent regression on POWER8 which had advertized
   latency values higher than 10us while the measured values were
   lesser. With Patch 1, we don't need the upper cap any longer.


Tested the series on POWER8, POWER9 and POWER10 with the
cpuidle-smt-performance test case
(https://github.com/gautshen/misc/tree/master/cpuidle-smt-performance) .
  
This test has three classes of threads
1. Workload thread which computes fibonacci numbers (Pinned to the
   primary thread CPU 8). We are interested in the throughput of this
   workload thread.

2. Three irritator threads which are pinned to the secondary CPUs of
   the core on which the workload thread is running (CPUs 10, 12,
   14). These irritators block on a pipe until woken up by a
   waker. After being woken up, they again go back to sleep by
   blocking on a pipe read. We are interested in the wakeup latency of
   the irritator threads.

3. A waker thread which, pinned to a different core (CPU 16) from
   where the workload and the irritators are running, periodically
   wakes up the three irritator threads by writing to their respective
   pipes. The purpose of these periodic wakeups is to prime the
   cpuidle governor on the irritator CPUs to pick the idle state the
   wakeup period.

We measure the wakeup latency of the irritator threads, which tells us
the impact of entering a particular combinations of idle states. Thus
shallower the state, lower should be the wakeup latency.

We also measure the throughput of the fibonacci workload to measure
the single-thread performance in the presence of the waking irritators
on the sibling threads. Entering an idle state which performs SMT
folding should show greater throughput.

There is no observable difference in the behaviour on POWER8 and
POWER10 with and without the patch series, since the CEDE latencies on
both of them with and without the patch are 10us.

However, on POWER9, without the patch, the CEDE latency is 1us based
on the value returned by the firmware (which is not accurate), while
with the patch it is set to the default value of 10us which is closer
to the accurate measure.

The throughput, wakeup latency, throughput and the snooze, CEDE idle
percentage residency results on POWER9 with and without patch are as
follows.

We observe that for a wakeup period between 20us - 100us, the wakeup
latency of the irritator threads with the patch improves by 40-45%.

Though note that with the patch, the throughput of the fibbonacci
workload drops by 5-10% when the wakeup period of the irritator
threads is between 20us-100us. This is an acceptable tradeoff since
there are certain benchmarks on POWER9 which are very sensitive to the
wakeup latency and have a sleeping duration of less than 100us.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Avg Wakeup Latency of the irritator threads
(lower the better)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Irritator  |                   |
wakeup     |   Without         |   With
period     |   Patch           |   Patch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 us   |   3.703 us        |   3.632 us ( -1.91%)
    2 us   |   3.843 us        |   3.925 us ( +2.13%)
    5 us   |   8.575 us        |   8.656 us ( +0.94%)
   10 us   |   8.264 us        |   8.242 us ( -0.27%)
   20 us   |   8.672 us        |   8.256 us ( -4.80%)
   50 us   |  15.552 us        |   8.257 us (-46.90%)
   80 us   |  15.603 us        |   8.803 us (-43.58%)
  100 us   |  15.617 us        |   8.328 us (-46.67%)
  120 us   |  15.612 us        |  14.505 us ( -7.09%)
  500 us   |  15.957 us        |  15.723 us ( -1.47%)
 1000 us   |  16.526 us        |  16.502 us ( -0.14%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fibonacci workload throughput in Million Operations
per second (higher the better)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Irritator  |                   |
wakeup     |   Without         |   With
period     |   Patch           |   Patch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 us   |  44.234 Mops/s    |   44.305 Mops/s ( +0.16%)
    2 us   |  44.290 Mops/s    |   44.233 Mops/s ( -0.13%)
    5 us   |  44.757 Mops/s    |   44.759 Mops/s ( -0.01%)
   10 us   |  46.169 Mops/s    |   46.049 Mops/s ( -0.25%)
   20 us   |  48.263 Mops/s    |   49.647 Mops/s ( +2.87%)
   50 us   |  52.817 Mops/s    |   52.310 Mops/s ( -0.96%)
   80 us   |  57.338 Mops/s    |   53.216 Mops/s ( -7.19%)
  100 us   |  58.958 Mops/s    |   53.497 Mops/s ( -9.26%)
  120 us   |  60.060 Mops/s    |   58.980 Mops/s ( -1.80%)
  500 us   |  64.484 Mops/s    |   64.460 Mops/s ( -0.04%)
 1000 us   |  65.200 Mops/s    |   65.188 Mops/s ( -0.02%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(snooze, CEDE Residency Percentage)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Irritator  |                   |
wakeup     |   Without         |   With
period     |   Patch           |   Patch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 us   |  ( 0.40%,  0.00%) |  (0.28%,   0.00%)
    2 us   |  ( 0.42%,  0.00%) |  (0.33%,   0.00%)
    5 us   |  ( 3.94%,  0.00%) |  (3.89%,   0.00%)
   10 us   |  (21.85%,  0.00%) |  (21.62%,  0.00%)
   20 us   |  (43.44%,  0.00%) |  (50.90%,  0.00%)
   50 us   |  ( 0.03%, 76.07%) |  (76.85%,  0.00%)
   80 us   |  ( 0.07%, 84.14%) |  (84.85%,  0.00%)
  100 us   |  ( 0.03%, 87.18%) |  (87.61%,  0.02%)
  120 us   |  ( 0.02%, 89.21%) |  (14.71%, 74.40%)
  500 us   |  ( 0.00%, 97.27%) |  ( 3.70%, 93.53%
 1000 us   |  ( 0.00%, 98.57%) |  ( 0.17%, 98.40%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 



Gautham R. Shenoy (2):
  cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
  cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()

 drivers/cpuidle/cpuidle-pseries.c | 75 +++++++++++++++++++++++----------------
 1 file changed, 45 insertions(+), 30 deletions(-)

-- 
1.9.4


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code
@ 2021-07-19  6:33 ` Gautham R. Shenoy
  0 siblings, 0 replies; 10+ messages in thread
From: Gautham R. Shenoy @ 2021-07-19  6:33 UTC (permalink / raw)
  To: Rafael J. Wysocki, Daniel Lezcano, Michael Ellerman,
	Aneesh Kumar K.V, Vaidyanathan Srinivasan, Michal Suchanek
  Cc: Gautham R. Shenoy, linuxppc-dev, joedecke, linux-pm

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>


Hi,

This is the v5 of the patchset to fixup CEDE0 latency only from
POWER10 onwards.


The previous versions of the patches are
v4 : https://lore.kernel.org/linux-pm/1623048014-16451-1-git-send-email-ego@linux.vnet.ibm.com/
v3 : https://lore.kernel.org/linuxppc-dev/1619697982-28461-1-git-send-email-ego@linux.vnet.ibm.com/
v2 : https://lore.kernel.org/linuxppc-dev/1619673517-10853-1-git-send-email-ego@linux.vnet.ibm.com/
v1 : https://lore.kernel.org/linuxppc-dev/1619104049-5118-1-git-send-email-ego@linux.vnet.ibm.com/

v4 --> v5 changes
 * Patch 1 : Unchanged. Rebased it against the latest powerpc/merge
  tree. With this patch, on processors older than POWER10, the CEDE
  latency is set to the hardcoded value of 10us which is closer to the
  measured value (details of the measurement in Patch 1).

 * Added a Patch 2/2 titled "cpuidle/pseries: Do not cap the CEDE0
   latency in fixup_cede0_latency()" which will ensure that on POWER10
   onwards we simply take the latency value exposed by the firmware
   without keeping an upper cap of 10us. This upper cap was previously
   required to prevent regression on POWER8 which had advertized
   latency values higher than 10us while the measured values were
   lesser. With Patch 1, we don't need the upper cap any longer.


Tested the series on POWER8, POWER9 and POWER10 with the
cpuidle-smt-performance test case
(https://github.com/gautshen/misc/tree/master/cpuidle-smt-performance) .
  
This test has three classes of threads
1. Workload thread which computes fibonacci numbers (Pinned to the
   primary thread CPU 8). We are interested in the throughput of this
   workload thread.

2. Three irritator threads which are pinned to the secondary CPUs of
   the core on which the workload thread is running (CPUs 10, 12,
   14). These irritators block on a pipe until woken up by a
   waker. After being woken up, they again go back to sleep by
   blocking on a pipe read. We are interested in the wakeup latency of
   the irritator threads.

3. A waker thread which, pinned to a different core (CPU 16) from
   where the workload and the irritators are running, periodically
   wakes up the three irritator threads by writing to their respective
   pipes. The purpose of these periodic wakeups is to prime the
   cpuidle governor on the irritator CPUs to pick the idle state the
   wakeup period.

We measure the wakeup latency of the irritator threads, which tells us
the impact of entering a particular combinations of idle states. Thus
shallower the state, lower should be the wakeup latency.

We also measure the throughput of the fibonacci workload to measure
the single-thread performance in the presence of the waking irritators
on the sibling threads. Entering an idle state which performs SMT
folding should show greater throughput.

There is no observable difference in the behaviour on POWER8 and
POWER10 with and without the patch series, since the CEDE latencies on
both of them with and without the patch are 10us.

However, on POWER9, without the patch, the CEDE latency is 1us based
on the value returned by the firmware (which is not accurate), while
with the patch it is set to the default value of 10us which is closer
to the accurate measure.

The throughput, wakeup latency, throughput and the snooze, CEDE idle
percentage residency results on POWER9 with and without patch are as
follows.

We observe that for a wakeup period between 20us - 100us, the wakeup
latency of the irritator threads with the patch improves by 40-45%.

Though note that with the patch, the throughput of the fibbonacci
workload drops by 5-10% when the wakeup period of the irritator
threads is between 20us-100us. This is an acceptable tradeoff since
there are certain benchmarks on POWER9 which are very sensitive to the
wakeup latency and have a sleeping duration of less than 100us.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Avg Wakeup Latency of the irritator threads
(lower the better)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Irritator  |                   |
wakeup     |   Without         |   With
period     |   Patch           |   Patch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 us   |   3.703 us        |   3.632 us ( -1.91%)
    2 us   |   3.843 us        |   3.925 us ( +2.13%)
    5 us   |   8.575 us        |   8.656 us ( +0.94%)
   10 us   |   8.264 us        |   8.242 us ( -0.27%)
   20 us   |   8.672 us        |   8.256 us ( -4.80%)
   50 us   |  15.552 us        |   8.257 us (-46.90%)
   80 us   |  15.603 us        |   8.803 us (-43.58%)
  100 us   |  15.617 us        |   8.328 us (-46.67%)
  120 us   |  15.612 us        |  14.505 us ( -7.09%)
  500 us   |  15.957 us        |  15.723 us ( -1.47%)
 1000 us   |  16.526 us        |  16.502 us ( -0.14%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fibonacci workload throughput in Million Operations
per second (higher the better)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Irritator  |                   |
wakeup     |   Without         |   With
period     |   Patch           |   Patch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 us   |  44.234 Mops/s    |   44.305 Mops/s ( +0.16%)
    2 us   |  44.290 Mops/s    |   44.233 Mops/s ( -0.13%)
    5 us   |  44.757 Mops/s    |   44.759 Mops/s ( -0.01%)
   10 us   |  46.169 Mops/s    |   46.049 Mops/s ( -0.25%)
   20 us   |  48.263 Mops/s    |   49.647 Mops/s ( +2.87%)
   50 us   |  52.817 Mops/s    |   52.310 Mops/s ( -0.96%)
   80 us   |  57.338 Mops/s    |   53.216 Mops/s ( -7.19%)
  100 us   |  58.958 Mops/s    |   53.497 Mops/s ( -9.26%)
  120 us   |  60.060 Mops/s    |   58.980 Mops/s ( -1.80%)
  500 us   |  64.484 Mops/s    |   64.460 Mops/s ( -0.04%)
 1000 us   |  65.200 Mops/s    |   65.188 Mops/s ( -0.02%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(snooze, CEDE Residency Percentage)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Irritator  |                   |
wakeup     |   Without         |   With
period     |   Patch           |   Patch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 us   |  ( 0.40%,  0.00%) |  (0.28%,   0.00%)
    2 us   |  ( 0.42%,  0.00%) |  (0.33%,   0.00%)
    5 us   |  ( 3.94%,  0.00%) |  (3.89%,   0.00%)
   10 us   |  (21.85%,  0.00%) |  (21.62%,  0.00%)
   20 us   |  (43.44%,  0.00%) |  (50.90%,  0.00%)
   50 us   |  ( 0.03%, 76.07%) |  (76.85%,  0.00%)
   80 us   |  ( 0.07%, 84.14%) |  (84.85%,  0.00%)
  100 us   |  ( 0.03%, 87.18%) |  (87.61%,  0.02%)
  120 us   |  ( 0.02%, 89.21%) |  (14.71%, 74.40%)
  500 us   |  ( 0.00%, 97.27%) |  ( 3.70%, 93.53%
 1000 us   |  ( 0.00%, 98.57%) |  ( 0.17%, 98.40%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 



Gautham R. Shenoy (2):
  cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
  cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()

 drivers/cpuidle/cpuidle-pseries.c | 75 +++++++++++++++++++++++----------------
 1 file changed, 45 insertions(+), 30 deletions(-)

-- 
1.9.4


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 1/2] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
  2021-07-19  6:33 ` Gautham R. Shenoy
@ 2021-07-19  6:33   ` Gautham R. Shenoy
  -1 siblings, 0 replies; 10+ messages in thread
From: Gautham R. Shenoy @ 2021-07-19  6:33 UTC (permalink / raw)
  To: Rafael J. Wysocki, Daniel Lezcano, Michael Ellerman,
	Aneesh Kumar K.V, Vaidyanathan Srinivasan, Michal Suchanek
  Cc: linux-pm, joedecke, linuxppc-dev, Gautham R. Shenoy,
	Vaidyanathan Srinivasan

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
of the Extended CEDE states advertised by the platform

On POWER9 LPARs, the firmwares advertise a very low value of 2us for
CEDE1 exit latency on a Dedicated LPAR. The latency advertized by the
PHYP hypervisor corresponds to the latency required to wakeup from the
underlying hardware idle state. However the wakeup latency from the
LPAR perspective should include

1. The time taken to transition the CPU from the Hypervisor into the
   LPAR post wakeup from platform idle state

2. Time taken to send the IPI from the source CPU (waker) to the idle
   target CPU (wakee).

1. can be measured via timer idle test, where we queue a timer, say
for 1ms, and enter the CEDE state. When the timer fires, in the timer
handler we compute how much extra timer over the expected 1ms have we
consumed. On a a POWER9 LPAR the numbers are

CEDE latency measured using a timer (numbers in ns)
N       Min      Median   Avg       90%ile  99%ile    Max    Stddev
400     2601     5677     5668.74    5917    6413     9299   455.01

1. and 2. combined can be determined by an IPI latency test where we
send an IPI to an idle CPU and in the handler compute the time
difference between when the IPI was sent and when the handler ran. We
see the following numbers on POWER9 LPAR.

CEDE latency measured using an IPI (numbers in ns)
N       Min      Median   Avg       90%ile  99%ile    Max    Stddev
400     711      7564     7369.43   8559    9514      9698   1200.01

Suppose, we consider the 99th percentile latency value measured using
the IPI to be the wakeup latency, the value would be 9.5us This is in
the ballpark of the default value of 10us.

Hence, use the exit latency of CEDE(0) based on the latency values
advertized by platform only from POWER10 onwards. The values
advertized on POWER10 platforms is more realistic and informed by the
latency measurements. For earlier platforms stick to the default value
of 10us. The fix was suggested by Michael Ellerman.

Reported-by: Enrico Joedecke <joedecke@de.ibm.com>
Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)")
Cc: Michal Suchanek <msuchanek@suse.de>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-pseries.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
index a2b5c6f..e592280d 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -419,7 +419,21 @@ static int pseries_idle_probe(void)
 			cpuidle_state_table = shared_states;
 			max_idle_state = ARRAY_SIZE(shared_states);
 		} else {
-			fixup_cede0_latency();
+			/*
+			 * Use firmware provided latency values
+			 * starting with POWER10 platforms. In the
+			 * case that we are running on a POWER10
+			 * platform but in an earlier compat mode, we
+			 * can still use the firmware provided values.
+			 *
+			 * However, on platforms prior to POWER10, we
+			 * cannot rely on the accuracy of the firmware
+			 * provided latency values. On such platforms,
+			 * go with the conservative default estimate
+			 * of 10us.
+			 */
+			if (cpu_has_feature(CPU_FTR_ARCH_31) || pvr_version_is(PVR_POWER10))
+				fixup_cede0_latency();
 			cpuidle_state_table = dedicated_states;
 			max_idle_state = NR_DEDICATED_STATES;
 		}
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 1/2] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
@ 2021-07-19  6:33   ` Gautham R. Shenoy
  0 siblings, 0 replies; 10+ messages in thread
From: Gautham R. Shenoy @ 2021-07-19  6:33 UTC (permalink / raw)
  To: Rafael J. Wysocki, Daniel Lezcano, Michael Ellerman,
	Aneesh Kumar K.V, Vaidyanathan Srinivasan, Michal Suchanek
  Cc: Gautham R. Shenoy, linuxppc-dev, joedecke, linux-pm

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
of the Extended CEDE states advertised by the platform

On POWER9 LPARs, the firmwares advertise a very low value of 2us for
CEDE1 exit latency on a Dedicated LPAR. The latency advertized by the
PHYP hypervisor corresponds to the latency required to wakeup from the
underlying hardware idle state. However the wakeup latency from the
LPAR perspective should include

1. The time taken to transition the CPU from the Hypervisor into the
   LPAR post wakeup from platform idle state

2. Time taken to send the IPI from the source CPU (waker) to the idle
   target CPU (wakee).

1. can be measured via timer idle test, where we queue a timer, say
for 1ms, and enter the CEDE state. When the timer fires, in the timer
handler we compute how much extra timer over the expected 1ms have we
consumed. On a a POWER9 LPAR the numbers are

CEDE latency measured using a timer (numbers in ns)
N       Min      Median   Avg       90%ile  99%ile    Max    Stddev
400     2601     5677     5668.74    5917    6413     9299   455.01

1. and 2. combined can be determined by an IPI latency test where we
send an IPI to an idle CPU and in the handler compute the time
difference between when the IPI was sent and when the handler ran. We
see the following numbers on POWER9 LPAR.

CEDE latency measured using an IPI (numbers in ns)
N       Min      Median   Avg       90%ile  99%ile    Max    Stddev
400     711      7564     7369.43   8559    9514      9698   1200.01

Suppose, we consider the 99th percentile latency value measured using
the IPI to be the wakeup latency, the value would be 9.5us This is in
the ballpark of the default value of 10us.

Hence, use the exit latency of CEDE(0) based on the latency values
advertized by platform only from POWER10 onwards. The values
advertized on POWER10 platforms is more realistic and informed by the
latency measurements. For earlier platforms stick to the default value
of 10us. The fix was suggested by Michael Ellerman.

Reported-by: Enrico Joedecke <joedecke@de.ibm.com>
Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)")
Cc: Michal Suchanek <msuchanek@suse.de>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-pseries.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
index a2b5c6f..e592280d 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -419,7 +419,21 @@ static int pseries_idle_probe(void)
 			cpuidle_state_table = shared_states;
 			max_idle_state = ARRAY_SIZE(shared_states);
 		} else {
-			fixup_cede0_latency();
+			/*
+			 * Use firmware provided latency values
+			 * starting with POWER10 platforms. In the
+			 * case that we are running on a POWER10
+			 * platform but in an earlier compat mode, we
+			 * can still use the firmware provided values.
+			 *
+			 * However, on platforms prior to POWER10, we
+			 * cannot rely on the accuracy of the firmware
+			 * provided latency values. On such platforms,
+			 * go with the conservative default estimate
+			 * of 10us.
+			 */
+			if (cpu_has_feature(CPU_FTR_ARCH_31) || pvr_version_is(PVR_POWER10))
+				fixup_cede0_latency();
 			cpuidle_state_table = dedicated_states;
 			max_idle_state = NR_DEDICATED_STATES;
 		}
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 2/2] cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()
  2021-07-19  6:33 ` Gautham R. Shenoy
@ 2021-07-19  6:33   ` Gautham R. Shenoy
  -1 siblings, 0 replies; 10+ messages in thread
From: Gautham R. Shenoy @ 2021-07-19  6:33 UTC (permalink / raw)
  To: Rafael J. Wysocki, Daniel Lezcano, Michael Ellerman,
	Aneesh Kumar K.V, Vaidyanathan Srinivasan, Michal Suchanek
  Cc: linux-pm, joedecke, linuxppc-dev, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Currently in fixup_cede0_latency() code, we perform the fixup the
CEDE(0) exit latency value only if minimum advertized extended CEDE
latency values are less than 10us. This was done so as to not break
the expected behaviour on POWER8 platforms where the advertised
latency was higher than the default 10us, which would delay the SMT
folding on the core.

However, after the earlier patch "cpuidle/pseries: Fixup CEDE0 latency
only for POWER10 onwards", we can be sure that the fixup of CEDE0
latency is going to happen only from POWER10 onwards. Hence
unconditionally use the minimum exit latency provided by the platform.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-pseries.c | 59 ++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 29 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
index e592280d..bba449b 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -346,11 +346,9 @@ static int pseries_cpuidle_driver_init(void)
 static void __init fixup_cede0_latency(void)
 {
 	struct xcede_latency_payload *payload;
-	u64 min_latency_us;
+	u64 min_xcede_latency_us = UINT_MAX;
 	int i;
 
-	min_latency_us = dedicated_states[1].exit_latency; // CEDE latency
-
 	if (parse_cede_parameters())
 		return;
 
@@ -358,42 +356,45 @@ static void __init fixup_cede0_latency(void)
 		nr_xcede_records);
 
 	payload = &xcede_latency_parameter.payload;
+
+	/*
+	 * The CEDE idle state maps to CEDE(0). While the hypervisor
+	 * does not advertise CEDE(0) exit latency values, it does
+	 * advertise the latency values of the extended CEDE states.
+	 * We use the lowest advertised exit latency value as a proxy
+	 * for the exit latency of CEDE(0).
+	 */
 	for (i = 0; i < nr_xcede_records; i++) {
 		struct xcede_latency_record *record = &payload->records[i];
+		u8 hint = record->hint;
 		u64 latency_tb = be64_to_cpu(record->latency_ticks);
 		u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), NSEC_PER_USEC);
 
-		if (latency_us == 0)
-			pr_warn("cpuidle: xcede record %d has an unrealistic latency of 0us.\n", i);
-
-		if (latency_us < min_latency_us)
-			min_latency_us = latency_us;
-	}
-
-	/*
-	 * By default, we assume that CEDE(0) has exit latency 10us,
-	 * since there is no way for us to query from the platform.
-	 *
-	 * However, if the wakeup latency of an Extended CEDE state is
-	 * smaller than 10us, then we can be sure that CEDE(0)
-	 * requires no more than that.
-	 *
-	 * Perform the fix-up.
-	 */
-	if (min_latency_us < dedicated_states[1].exit_latency) {
 		/*
-		 * We set a minimum of 1us wakeup latency for cede0 to
-		 * distinguish it from snooze
+		 * We expect the exit latency of an extended CEDE
+		 * state to be non-zero, it to since it takes at least
+		 * a few nanoseconds to wakeup the idle CPU and
+		 * dispatch the virtual processor into the Linux
+		 * Guest.
+		 *
+		 * So we consider only non-zero value for performing
+		 * the fixup of CEDE(0) latency.
 		 */
-		u64 cede0_latency = 1;
+		if (latency_us == 0) {
+			pr_warn("cpuidle: Skipping xcede record %d [hint=%d]. Exit latency = 0us\n",
+				i, hint);
+			continue;
+		}
 
-		if (min_latency_us > cede0_latency)
-			cede0_latency = min_latency_us - 1;
+		if (latency_us < min_xcede_latency_us)
+			min_xcede_latency_us = latency_us;
+	}
 
-		dedicated_states[1].exit_latency = cede0_latency;
-		dedicated_states[1].target_residency = 10 * (cede0_latency);
+	if (min_xcede_latency_us != UINT_MAX) {
+		dedicated_states[1].exit_latency = min_xcede_latency_us;
+		dedicated_states[1].target_residency = 10 * (min_xcede_latency_us);
 		pr_info("cpuidle: Fixed up CEDE exit latency to %llu us\n",
-			cede0_latency);
+			min_xcede_latency_us);
 	}
 
 }
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 2/2] cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()
@ 2021-07-19  6:33   ` Gautham R. Shenoy
  0 siblings, 0 replies; 10+ messages in thread
From: Gautham R. Shenoy @ 2021-07-19  6:33 UTC (permalink / raw)
  To: Rafael J. Wysocki, Daniel Lezcano, Michael Ellerman,
	Aneesh Kumar K.V, Vaidyanathan Srinivasan, Michal Suchanek
  Cc: Gautham R. Shenoy, linuxppc-dev, joedecke, linux-pm

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Currently in fixup_cede0_latency() code, we perform the fixup the
CEDE(0) exit latency value only if minimum advertized extended CEDE
latency values are less than 10us. This was done so as to not break
the expected behaviour on POWER8 platforms where the advertised
latency was higher than the default 10us, which would delay the SMT
folding on the core.

However, after the earlier patch "cpuidle/pseries: Fixup CEDE0 latency
only for POWER10 onwards", we can be sure that the fixup of CEDE0
latency is going to happen only from POWER10 onwards. Hence
unconditionally use the minimum exit latency provided by the platform.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-pseries.c | 59 ++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 29 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
index e592280d..bba449b 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -346,11 +346,9 @@ static int pseries_cpuidle_driver_init(void)
 static void __init fixup_cede0_latency(void)
 {
 	struct xcede_latency_payload *payload;
-	u64 min_latency_us;
+	u64 min_xcede_latency_us = UINT_MAX;
 	int i;
 
-	min_latency_us = dedicated_states[1].exit_latency; // CEDE latency
-
 	if (parse_cede_parameters())
 		return;
 
@@ -358,42 +356,45 @@ static void __init fixup_cede0_latency(void)
 		nr_xcede_records);
 
 	payload = &xcede_latency_parameter.payload;
+
+	/*
+	 * The CEDE idle state maps to CEDE(0). While the hypervisor
+	 * does not advertise CEDE(0) exit latency values, it does
+	 * advertise the latency values of the extended CEDE states.
+	 * We use the lowest advertised exit latency value as a proxy
+	 * for the exit latency of CEDE(0).
+	 */
 	for (i = 0; i < nr_xcede_records; i++) {
 		struct xcede_latency_record *record = &payload->records[i];
+		u8 hint = record->hint;
 		u64 latency_tb = be64_to_cpu(record->latency_ticks);
 		u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), NSEC_PER_USEC);
 
-		if (latency_us == 0)
-			pr_warn("cpuidle: xcede record %d has an unrealistic latency of 0us.\n", i);
-
-		if (latency_us < min_latency_us)
-			min_latency_us = latency_us;
-	}
-
-	/*
-	 * By default, we assume that CEDE(0) has exit latency 10us,
-	 * since there is no way for us to query from the platform.
-	 *
-	 * However, if the wakeup latency of an Extended CEDE state is
-	 * smaller than 10us, then we can be sure that CEDE(0)
-	 * requires no more than that.
-	 *
-	 * Perform the fix-up.
-	 */
-	if (min_latency_us < dedicated_states[1].exit_latency) {
 		/*
-		 * We set a minimum of 1us wakeup latency for cede0 to
-		 * distinguish it from snooze
+		 * We expect the exit latency of an extended CEDE
+		 * state to be non-zero, it to since it takes at least
+		 * a few nanoseconds to wakeup the idle CPU and
+		 * dispatch the virtual processor into the Linux
+		 * Guest.
+		 *
+		 * So we consider only non-zero value for performing
+		 * the fixup of CEDE(0) latency.
 		 */
-		u64 cede0_latency = 1;
+		if (latency_us == 0) {
+			pr_warn("cpuidle: Skipping xcede record %d [hint=%d]. Exit latency = 0us\n",
+				i, hint);
+			continue;
+		}
 
-		if (min_latency_us > cede0_latency)
-			cede0_latency = min_latency_us - 1;
+		if (latency_us < min_xcede_latency_us)
+			min_xcede_latency_us = latency_us;
+	}
 
-		dedicated_states[1].exit_latency = cede0_latency;
-		dedicated_states[1].target_residency = 10 * (cede0_latency);
+	if (min_xcede_latency_us != UINT_MAX) {
+		dedicated_states[1].exit_latency = min_xcede_latency_us;
+		dedicated_states[1].target_residency = 10 * (min_xcede_latency_us);
 		pr_info("cpuidle: Fixed up CEDE exit latency to %llu us\n",
-			cede0_latency);
+			min_xcede_latency_us);
 	}
 
 }
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code
  2021-07-19  6:33 ` Gautham R. Shenoy
@ 2021-08-03 10:20   ` Michael Ellerman
  -1 siblings, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2021-08-03 10:20 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan, Rafael J. Wysocki, Michael Ellerman,
	Michal Suchanek, Gautham R. Shenoy, Aneesh Kumar K.V,
	Daniel Lezcano
  Cc: linux-pm, joedecke, linuxppc-dev

On Mon, 19 Jul 2021 12:03:17 +0530, Gautham R. Shenoy wrote:
> 
> Hi,
> 
> This is the v5 of the patchset to fixup CEDE0 latency only from
> POWER10 onwards.
> 
> 
> [...]

Applied to powerpc/next.

[1/2] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
      https://git.kernel.org/powerpc/c/7cbd631d4decfd78f8a17196dce9abcd4e7e1e94
[2/2] cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()
      https://git.kernel.org/powerpc/c/4bceb03859c1a508669ce542c649c8d4f5d4bd93

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code
@ 2021-08-03 10:20   ` Michael Ellerman
  0 siblings, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2021-08-03 10:20 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan, Rafael J. Wysocki, Michael Ellerman,
	Michal Suchanek, Gautham R. Shenoy, Aneesh Kumar K.V,
	Daniel Lezcano
  Cc: linuxppc-dev, joedecke, linux-pm

On Mon, 19 Jul 2021 12:03:17 +0530, Gautham R. Shenoy wrote:
> 
> Hi,
> 
> This is the v5 of the patchset to fixup CEDE0 latency only from
> POWER10 onwards.
> 
> 
> [...]

Applied to powerpc/next.

[1/2] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
      https://git.kernel.org/powerpc/c/7cbd631d4decfd78f8a17196dce9abcd4e7e1e94
[2/2] cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()
      https://git.kernel.org/powerpc/c/4bceb03859c1a508669ce542c649c8d4f5d4bd93

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code
  2021-08-03 10:20   ` Michael Ellerman
@ 2021-08-03 12:50     ` Michael Ellerman
  -1 siblings, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2021-08-03 12:50 UTC (permalink / raw)
  To: Michael Ellerman, Vaidyanathan Srinivasan, Rafael J. Wysocki,
	Michal Suchanek, Gautham R. Shenoy, Aneesh Kumar K.V,
	Daniel Lezcano
  Cc: linux-pm, joedecke, linuxppc-dev

Michael Ellerman <patch-notifications@ellerman.id.au> writes:
> On Mon, 19 Jul 2021 12:03:17 +0530, Gautham R. Shenoy wrote:
>> 
>> Hi,
>> 
>> This is the v5 of the patchset to fixup CEDE0 latency only from
>> POWER10 onwards.
>> 
>> 
>> [...]
>
> Applied to powerpc/next.
>
> [1/2] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
>       https://git.kernel.org/powerpc/c/7cbd631d4decfd78f8a17196dce9abcd4e7e1e94
> [2/2] cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()
>       https://git.kernel.org/powerpc/c/4bceb03859c1a508669ce542c649c8d4f5d4bd93

First commit had a bad fixes tag, so now these are:

[1/2] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
      https://git.kernel.org/powerpc/c/50741b70b0cbbafbd9199f5180e66c0c53783a4a
[2/2] cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()
      https://git.kernel.org/powerpc/c/71737a6c2a8f801622d2b71567d1ec1e4c5b40b8

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code
@ 2021-08-03 12:50     ` Michael Ellerman
  0 siblings, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2021-08-03 12:50 UTC (permalink / raw)
  To: Michael Ellerman, Vaidyanathan Srinivasan, Rafael J. Wysocki,
	Michal Suchanek, Gautham R. Shenoy, Aneesh Kumar K.V,
	Daniel Lezcano
  Cc: linuxppc-dev, joedecke, linux-pm

Michael Ellerman <patch-notifications@ellerman.id.au> writes:
> On Mon, 19 Jul 2021 12:03:17 +0530, Gautham R. Shenoy wrote:
>> 
>> Hi,
>> 
>> This is the v5 of the patchset to fixup CEDE0 latency only from
>> POWER10 onwards.
>> 
>> 
>> [...]
>
> Applied to powerpc/next.
>
> [1/2] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
>       https://git.kernel.org/powerpc/c/7cbd631d4decfd78f8a17196dce9abcd4e7e1e94
> [2/2] cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()
>       https://git.kernel.org/powerpc/c/4bceb03859c1a508669ce542c649c8d4f5d4bd93

First commit had a bad fixes tag, so now these are:

[1/2] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
      https://git.kernel.org/powerpc/c/50741b70b0cbbafbd9199f5180e66c0c53783a4a
[2/2] cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()
      https://git.kernel.org/powerpc/c/71737a6c2a8f801622d2b71567d1ec1e4c5b40b8

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-08-03 12:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-19  6:33 [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code Gautham R. Shenoy
2021-07-19  6:33 ` Gautham R. Shenoy
2021-07-19  6:33 ` [PATCH v5 1/2] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards Gautham R. Shenoy
2021-07-19  6:33   ` Gautham R. Shenoy
2021-07-19  6:33 ` [PATCH v5 2/2] cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency() Gautham R. Shenoy
2021-07-19  6:33   ` Gautham R. Shenoy
2021-08-03 10:20 ` [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code Michael Ellerman
2021-08-03 10:20   ` Michael Ellerman
2021-08-03 12:50   ` Michael Ellerman
2021-08-03 12:50     ` Michael Ellerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.