[PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
To: "Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>,
	Michal Suchanek <msuchanek@suse.de>
Cc: linux-pm@vger.kernel.org, joedecke@de.ibm.com,
	linuxppc-dev@lists.ozlabs.org,
	"Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
Subject: [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code
Date: Mon, 19 Jul 2021 12:03:17 +0530	[thread overview]
Message-ID: <1626676399-15975-1-git-send-email-ego@linux.vnet.ibm.com> (raw)

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Hi,

This is the v5 of the patchset to fixup CEDE0 latency only from
POWER10 onwards.

The previous versions of the patches are
v4 : https://lore.kernel.org/linux-pm/1623048014-16451-1-git-send-email-ego@linux.vnet.ibm.com/
v3 : https://lore.kernel.org/linuxppc-dev/1619697982-28461-1-git-send-email-ego@linux.vnet.ibm.com/
v2 : https://lore.kernel.org/linuxppc-dev/1619673517-10853-1-git-send-email-ego@linux.vnet.ibm.com/
v1 : https://lore.kernel.org/linuxppc-dev/1619104049-5118-1-git-send-email-ego@linux.vnet.ibm.com/

v4 --> v5 changes
 * Patch 1 : Unchanged. Rebased it against the latest powerpc/merge
  tree. With this patch, on processors older than POWER10, the CEDE
  latency is set to the hardcoded value of 10us which is closer to the
  measured value (details of the measurement in Patch 1).

 * Added a Patch 2/2 titled "cpuidle/pseries: Do not cap the CEDE0
   latency in fixup_cede0_latency()" which will ensure that on POWER10
   onwards we simply take the latency value exposed by the firmware
   without keeping an upper cap of 10us. This upper cap was previously
   required to prevent regression on POWER8 which had advertized
   latency values higher than 10us while the measured values were
   lesser. With Patch 1, we don't need the upper cap any longer.

Tested the series on POWER8, POWER9 and POWER10 with the
cpuidle-smt-performance test case
(https://github.com/gautshen/misc/tree/master/cpuidle-smt-performance) .

This test has three classes of threads
1. Workload thread which computes fibonacci numbers (Pinned to the
   primary thread CPU 8). We are interested in the throughput of this
   workload thread.

2. Three irritator threads which are pinned to the secondary CPUs of
   the core on which the workload thread is running (CPUs 10, 12,
   14). These irritators block on a pipe until woken up by a
   waker. After being woken up, they again go back to sleep by
   blocking on a pipe read. We are interested in the wakeup latency of
   the irritator threads.

3. A waker thread which, pinned to a different core (CPU 16) from
   where the workload and the irritators are running, periodically
   wakes up the three irritator threads by writing to their respective
   pipes. The purpose of these periodic wakeups is to prime the
   cpuidle governor on the irritator CPUs to pick the idle state the
   wakeup period.

We measure the wakeup latency of the irritator threads, which tells us
the impact of entering a particular combinations of idle states. Thus
shallower the state, lower should be the wakeup latency.

We also measure the throughput of the fibonacci workload to measure
the single-thread performance in the presence of the waking irritators
on the sibling threads. Entering an idle state which performs SMT
folding should show greater throughput.

There is no observable difference in the behaviour on POWER8 and
POWER10 with and without the patch series, since the CEDE latencies on
both of them with and without the patch are 10us.

However, on POWER9, without the patch, the CEDE latency is 1us based
on the value returned by the firmware (which is not accurate), while
with the patch it is set to the default value of 10us which is closer
to the accurate measure.

The throughput, wakeup latency, throughput and the snooze, CEDE idle
percentage residency results on POWER9 with and without patch are as
follows.

We observe that for a wakeup period between 20us - 100us, the wakeup
latency of the irritator threads with the patch improves by 40-45%.

Though note that with the patch, the throughput of the fibbonacci
workload drops by 5-10% when the wakeup period of the irritator
threads is between 20us-100us. This is an acceptable tradeoff since
there are certain benchmarks on POWER9 which are very sensitive to the
wakeup latency and have a sleeping duration of less than 100us.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Avg Wakeup Latency of the irritator threads
(lower the better)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Irritator  |                   |
wakeup     |   Without         |   With
period     |   Patch           |   Patch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 us   |   3.703 us        |   3.632 us ( -1.91%)
    2 us   |   3.843 us        |   3.925 us ( +2.13%)
    5 us   |   8.575 us        |   8.656 us ( +0.94%)
   10 us   |   8.264 us        |   8.242 us ( -0.27%)
   20 us   |   8.672 us        |   8.256 us ( -4.80%)
   50 us   |  15.552 us        |   8.257 us (-46.90%)
   80 us   |  15.603 us        |   8.803 us (-43.58%)
  100 us   |  15.617 us        |   8.328 us (-46.67%)
  120 us   |  15.612 us        |  14.505 us ( -7.09%)
  500 us   |  15.957 us        |  15.723 us ( -1.47%)
 1000 us   |  16.526 us        |  16.502 us ( -0.14%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fibonacci workload throughput in Million Operations
per second (higher the better)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Irritator  |                   |
wakeup     |   Without         |   With
period     |   Patch           |   Patch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 us   |  44.234 Mops/s    |   44.305 Mops/s ( +0.16%)
    2 us   |  44.290 Mops/s    |   44.233 Mops/s ( -0.13%)
    5 us   |  44.757 Mops/s    |   44.759 Mops/s ( -0.01%)
   10 us   |  46.169 Mops/s    |   46.049 Mops/s ( -0.25%)
   20 us   |  48.263 Mops/s    |   49.647 Mops/s ( +2.87%)
   50 us   |  52.817 Mops/s    |   52.310 Mops/s ( -0.96%)
   80 us   |  57.338 Mops/s    |   53.216 Mops/s ( -7.19%)
  100 us   |  58.958 Mops/s    |   53.497 Mops/s ( -9.26%)
  120 us   |  60.060 Mops/s    |   58.980 Mops/s ( -1.80%)
  500 us   |  64.484 Mops/s    |   64.460 Mops/s ( -0.04%)
 1000 us   |  65.200 Mops/s    |   65.188 Mops/s ( -0.02%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(snooze, CEDE Residency Percentage)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Irritator  |                   |
wakeup     |   Without         |   With
period     |   Patch           |   Patch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 us   |  ( 0.40%,  0.00%) |  (0.28%,   0.00%)
    2 us   |  ( 0.42%,  0.00%) |  (0.33%,   0.00%)
    5 us   |  ( 3.94%,  0.00%) |  (3.89%,   0.00%)
   10 us   |  (21.85%,  0.00%) |  (21.62%,  0.00%)
   20 us   |  (43.44%,  0.00%) |  (50.90%,  0.00%)
   50 us   |  ( 0.03%, 76.07%) |  (76.85%,  0.00%)
   80 us   |  ( 0.07%, 84.14%) |  (84.85%,  0.00%)
  100 us   |  ( 0.03%, 87.18%) |  (87.61%,  0.02%)
  120 us   |  ( 0.02%, 89.21%) |  (14.71%, 74.40%)
  500 us   |  ( 0.00%, 97.27%) |  ( 3.70%, 93.53%
 1000 us   |  ( 0.00%, 98.57%) |  ( 0.17%, 98.40%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

Gautham R. Shenoy (2):
  cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
  cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()

 drivers/cpuidle/cpuidle-pseries.c | 75 +++++++++++++++++++++++----------------
 1 file changed, 45 insertions(+), 30 deletions(-)

-- 
1.9.4