From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> To: "Rafael J. Wysocki" <rjw@rjwysocki.net>, Daniel Lezcano <daniel.lezcano@linaro.org>, Michael Ellerman <mpe@ellerman.id.au>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>, Michal Suchanek <msuchanek@suse.de> Cc: linux-pm@vger.kernel.org, joedecke@de.ibm.com, linuxppc-dev@lists.ozlabs.org, "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> Subject: [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code Date: Mon, 19 Jul 2021 12:03:17 +0530 [thread overview] Message-ID: <1626676399-15975-1-git-send-email-ego@linux.vnet.ibm.com> (raw) From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> Hi, This is the v5 of the patchset to fixup CEDE0 latency only from POWER10 onwards. The previous versions of the patches are v4 : https://lore.kernel.org/linux-pm/1623048014-16451-1-git-send-email-ego@linux.vnet.ibm.com/ v3 : https://lore.kernel.org/linuxppc-dev/1619697982-28461-1-git-send-email-ego@linux.vnet.ibm.com/ v2 : https://lore.kernel.org/linuxppc-dev/1619673517-10853-1-git-send-email-ego@linux.vnet.ibm.com/ v1 : https://lore.kernel.org/linuxppc-dev/1619104049-5118-1-git-send-email-ego@linux.vnet.ibm.com/ v4 --> v5 changes * Patch 1 : Unchanged. Rebased it against the latest powerpc/merge tree. With this patch, on processors older than POWER10, the CEDE latency is set to the hardcoded value of 10us which is closer to the measured value (details of the measurement in Patch 1). * Added a Patch 2/2 titled "cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()" which will ensure that on POWER10 onwards we simply take the latency value exposed by the firmware without keeping an upper cap of 10us. This upper cap was previously required to prevent regression on POWER8 which had advertized latency values higher than 10us while the measured values were lesser. With Patch 1, we don't need the upper cap any longer. Tested the series on POWER8, POWER9 and POWER10 with the cpuidle-smt-performance test case (https://github.com/gautshen/misc/tree/master/cpuidle-smt-performance) . This test has three classes of threads 1. Workload thread which computes fibonacci numbers (Pinned to the primary thread CPU 8). We are interested in the throughput of this workload thread. 2. Three irritator threads which are pinned to the secondary CPUs of the core on which the workload thread is running (CPUs 10, 12, 14). These irritators block on a pipe until woken up by a waker. After being woken up, they again go back to sleep by blocking on a pipe read. We are interested in the wakeup latency of the irritator threads. 3. A waker thread which, pinned to a different core (CPU 16) from where the workload and the irritators are running, periodically wakes up the three irritator threads by writing to their respective pipes. The purpose of these periodic wakeups is to prime the cpuidle governor on the irritator CPUs to pick the idle state the wakeup period. We measure the wakeup latency of the irritator threads, which tells us the impact of entering a particular combinations of idle states. Thus shallower the state, lower should be the wakeup latency. We also measure the throughput of the fibonacci workload to measure the single-thread performance in the presence of the waking irritators on the sibling threads. Entering an idle state which performs SMT folding should show greater throughput. There is no observable difference in the behaviour on POWER8 and POWER10 with and without the patch series, since the CEDE latencies on both of them with and without the patch are 10us. However, on POWER9, without the patch, the CEDE latency is 1us based on the value returned by the firmware (which is not accurate), while with the patch it is set to the default value of 10us which is closer to the accurate measure. The throughput, wakeup latency, throughput and the snooze, CEDE idle percentage residency results on POWER9 with and without patch are as follows. We observe that for a wakeup period between 20us - 100us, the wakeup latency of the irritator threads with the patch improves by 40-45%. Though note that with the patch, the throughput of the fibbonacci workload drops by 5-10% when the wakeup period of the irritator threads is between 20us-100us. This is an acceptable tradeoff since there are certain benchmarks on POWER9 which are very sensitive to the wakeup latency and have a sleeping duration of less than 100us. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Avg Wakeup Latency of the irritator threads (lower the better) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Irritator | | wakeup | Without | With period | Patch | Patch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 us | 3.703 us | 3.632 us ( -1.91%) 2 us | 3.843 us | 3.925 us ( +2.13%) 5 us | 8.575 us | 8.656 us ( +0.94%) 10 us | 8.264 us | 8.242 us ( -0.27%) 20 us | 8.672 us | 8.256 us ( -4.80%) 50 us | 15.552 us | 8.257 us (-46.90%) 80 us | 15.603 us | 8.803 us (-43.58%) 100 us | 15.617 us | 8.328 us (-46.67%) 120 us | 15.612 us | 14.505 us ( -7.09%) 500 us | 15.957 us | 15.723 us ( -1.47%) 1000 us | 16.526 us | 16.502 us ( -0.14%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fibonacci workload throughput in Million Operations per second (higher the better) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Irritator | | wakeup | Without | With period | Patch | Patch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 us | 44.234 Mops/s | 44.305 Mops/s ( +0.16%) 2 us | 44.290 Mops/s | 44.233 Mops/s ( -0.13%) 5 us | 44.757 Mops/s | 44.759 Mops/s ( -0.01%) 10 us | 46.169 Mops/s | 46.049 Mops/s ( -0.25%) 20 us | 48.263 Mops/s | 49.647 Mops/s ( +2.87%) 50 us | 52.817 Mops/s | 52.310 Mops/s ( -0.96%) 80 us | 57.338 Mops/s | 53.216 Mops/s ( -7.19%) 100 us | 58.958 Mops/s | 53.497 Mops/s ( -9.26%) 120 us | 60.060 Mops/s | 58.980 Mops/s ( -1.80%) 500 us | 64.484 Mops/s | 64.460 Mops/s ( -0.04%) 1000 us | 65.200 Mops/s | 65.188 Mops/s ( -0.02%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (snooze, CEDE Residency Percentage) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Irritator | | wakeup | Without | With period | Patch | Patch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 us | ( 0.40%, 0.00%) | (0.28%, 0.00%) 2 us | ( 0.42%, 0.00%) | (0.33%, 0.00%) 5 us | ( 3.94%, 0.00%) | (3.89%, 0.00%) 10 us | (21.85%, 0.00%) | (21.62%, 0.00%) 20 us | (43.44%, 0.00%) | (50.90%, 0.00%) 50 us | ( 0.03%, 76.07%) | (76.85%, 0.00%) 80 us | ( 0.07%, 84.14%) | (84.85%, 0.00%) 100 us | ( 0.03%, 87.18%) | (87.61%, 0.02%) 120 us | ( 0.02%, 89.21%) | (14.71%, 74.40%) 500 us | ( 0.00%, 97.27%) | ( 3.70%, 93.53% 1000 us | ( 0.00%, 98.57%) | ( 0.17%, 98.40%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Gautham R. Shenoy (2): cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency() drivers/cpuidle/cpuidle-pseries.c | 75 +++++++++++++++++++++++---------------- 1 file changed, 45 insertions(+), 30 deletions(-) -- 1.9.4
WARNING: multiple messages have this Message-ID (diff)
From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> To: "Rafael J. Wysocki" <rjw@rjwysocki.net>, Daniel Lezcano <daniel.lezcano@linaro.org>, Michael Ellerman <mpe@ellerman.id.au>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>, Michal Suchanek <msuchanek@suse.de> Cc: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>, linuxppc-dev@lists.ozlabs.org, joedecke@de.ibm.com, linux-pm@vger.kernel.org Subject: [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code Date: Mon, 19 Jul 2021 12:03:17 +0530 [thread overview] Message-ID: <1626676399-15975-1-git-send-email-ego@linux.vnet.ibm.com> (raw) From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> Hi, This is the v5 of the patchset to fixup CEDE0 latency only from POWER10 onwards. The previous versions of the patches are v4 : https://lore.kernel.org/linux-pm/1623048014-16451-1-git-send-email-ego@linux.vnet.ibm.com/ v3 : https://lore.kernel.org/linuxppc-dev/1619697982-28461-1-git-send-email-ego@linux.vnet.ibm.com/ v2 : https://lore.kernel.org/linuxppc-dev/1619673517-10853-1-git-send-email-ego@linux.vnet.ibm.com/ v1 : https://lore.kernel.org/linuxppc-dev/1619104049-5118-1-git-send-email-ego@linux.vnet.ibm.com/ v4 --> v5 changes * Patch 1 : Unchanged. Rebased it against the latest powerpc/merge tree. With this patch, on processors older than POWER10, the CEDE latency is set to the hardcoded value of 10us which is closer to the measured value (details of the measurement in Patch 1). * Added a Patch 2/2 titled "cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency()" which will ensure that on POWER10 onwards we simply take the latency value exposed by the firmware without keeping an upper cap of 10us. This upper cap was previously required to prevent regression on POWER8 which had advertized latency values higher than 10us while the measured values were lesser. With Patch 1, we don't need the upper cap any longer. Tested the series on POWER8, POWER9 and POWER10 with the cpuidle-smt-performance test case (https://github.com/gautshen/misc/tree/master/cpuidle-smt-performance) . This test has three classes of threads 1. Workload thread which computes fibonacci numbers (Pinned to the primary thread CPU 8). We are interested in the throughput of this workload thread. 2. Three irritator threads which are pinned to the secondary CPUs of the core on which the workload thread is running (CPUs 10, 12, 14). These irritators block on a pipe until woken up by a waker. After being woken up, they again go back to sleep by blocking on a pipe read. We are interested in the wakeup latency of the irritator threads. 3. A waker thread which, pinned to a different core (CPU 16) from where the workload and the irritators are running, periodically wakes up the three irritator threads by writing to their respective pipes. The purpose of these periodic wakeups is to prime the cpuidle governor on the irritator CPUs to pick the idle state the wakeup period. We measure the wakeup latency of the irritator threads, which tells us the impact of entering a particular combinations of idle states. Thus shallower the state, lower should be the wakeup latency. We also measure the throughput of the fibonacci workload to measure the single-thread performance in the presence of the waking irritators on the sibling threads. Entering an idle state which performs SMT folding should show greater throughput. There is no observable difference in the behaviour on POWER8 and POWER10 with and without the patch series, since the CEDE latencies on both of them with and without the patch are 10us. However, on POWER9, without the patch, the CEDE latency is 1us based on the value returned by the firmware (which is not accurate), while with the patch it is set to the default value of 10us which is closer to the accurate measure. The throughput, wakeup latency, throughput and the snooze, CEDE idle percentage residency results on POWER9 with and without patch are as follows. We observe that for a wakeup period between 20us - 100us, the wakeup latency of the irritator threads with the patch improves by 40-45%. Though note that with the patch, the throughput of the fibbonacci workload drops by 5-10% when the wakeup period of the irritator threads is between 20us-100us. This is an acceptable tradeoff since there are certain benchmarks on POWER9 which are very sensitive to the wakeup latency and have a sleeping duration of less than 100us. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Avg Wakeup Latency of the irritator threads (lower the better) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Irritator | | wakeup | Without | With period | Patch | Patch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 us | 3.703 us | 3.632 us ( -1.91%) 2 us | 3.843 us | 3.925 us ( +2.13%) 5 us | 8.575 us | 8.656 us ( +0.94%) 10 us | 8.264 us | 8.242 us ( -0.27%) 20 us | 8.672 us | 8.256 us ( -4.80%) 50 us | 15.552 us | 8.257 us (-46.90%) 80 us | 15.603 us | 8.803 us (-43.58%) 100 us | 15.617 us | 8.328 us (-46.67%) 120 us | 15.612 us | 14.505 us ( -7.09%) 500 us | 15.957 us | 15.723 us ( -1.47%) 1000 us | 16.526 us | 16.502 us ( -0.14%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fibonacci workload throughput in Million Operations per second (higher the better) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Irritator | | wakeup | Without | With period | Patch | Patch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 us | 44.234 Mops/s | 44.305 Mops/s ( +0.16%) 2 us | 44.290 Mops/s | 44.233 Mops/s ( -0.13%) 5 us | 44.757 Mops/s | 44.759 Mops/s ( -0.01%) 10 us | 46.169 Mops/s | 46.049 Mops/s ( -0.25%) 20 us | 48.263 Mops/s | 49.647 Mops/s ( +2.87%) 50 us | 52.817 Mops/s | 52.310 Mops/s ( -0.96%) 80 us | 57.338 Mops/s | 53.216 Mops/s ( -7.19%) 100 us | 58.958 Mops/s | 53.497 Mops/s ( -9.26%) 120 us | 60.060 Mops/s | 58.980 Mops/s ( -1.80%) 500 us | 64.484 Mops/s | 64.460 Mops/s ( -0.04%) 1000 us | 65.200 Mops/s | 65.188 Mops/s ( -0.02%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (snooze, CEDE Residency Percentage) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Irritator | | wakeup | Without | With period | Patch | Patch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 us | ( 0.40%, 0.00%) | (0.28%, 0.00%) 2 us | ( 0.42%, 0.00%) | (0.33%, 0.00%) 5 us | ( 3.94%, 0.00%) | (3.89%, 0.00%) 10 us | (21.85%, 0.00%) | (21.62%, 0.00%) 20 us | (43.44%, 0.00%) | (50.90%, 0.00%) 50 us | ( 0.03%, 76.07%) | (76.85%, 0.00%) 80 us | ( 0.07%, 84.14%) | (84.85%, 0.00%) 100 us | ( 0.03%, 87.18%) | (87.61%, 0.02%) 120 us | ( 0.02%, 89.21%) | (14.71%, 74.40%) 500 us | ( 0.00%, 97.27%) | ( 3.70%, 93.53% 1000 us | ( 0.00%, 98.57%) | ( 0.17%, 98.40%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Gautham R. Shenoy (2): cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency() drivers/cpuidle/cpuidle-pseries.c | 75 +++++++++++++++++++++++---------------- 1 file changed, 45 insertions(+), 30 deletions(-) -- 1.9.4
next reply other threads:[~2021-07-19 6:33 UTC|newest] Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-07-19 6:33 Gautham R. Shenoy [this message] 2021-07-19 6:33 ` [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code Gautham R. Shenoy 2021-07-19 6:33 ` [PATCH v5 1/2] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards Gautham R. Shenoy 2021-07-19 6:33 ` Gautham R. Shenoy 2021-07-19 6:33 ` [PATCH v5 2/2] cpuidle/pseries: Do not cap the CEDE0 latency in fixup_cede0_latency() Gautham R. Shenoy 2021-07-19 6:33 ` Gautham R. Shenoy 2021-08-03 10:20 ` [PATCH v5 0/2] cpuidle/pseries: cleanup of the CEDE0 latency fixup code Michael Ellerman 2021-08-03 10:20 ` Michael Ellerman 2021-08-03 12:50 ` Michael Ellerman 2021-08-03 12:50 ` Michael Ellerman
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=1626676399-15975-1-git-send-email-ego@linux.vnet.ibm.com \ --to=ego@linux.vnet.ibm.com \ --cc=aneesh.kumar@linux.ibm.com \ --cc=daniel.lezcano@linaro.org \ --cc=joedecke@de.ibm.com \ --cc=linux-pm@vger.kernel.org \ --cc=linuxppc-dev@lists.ozlabs.org \ --cc=mpe@ellerman.id.au \ --cc=msuchanek@suse.de \ --cc=rjw@rjwysocki.net \ --cc=svaidy@linux.vnet.ibm.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.