linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] intel_idle: improve Xeon C1 interrupt response time
@ 2021-09-17  7:20 Artem Bityutskiy
  2021-09-17  7:20 ` [PATCH 1/1] intel_idle: enable interrupts before C1 on Xeons Artem Bityutskiy
  0 siblings, 1 reply; 3+ messages in thread
From: Artem Bityutskiy @ 2021-09-17  7:20 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux PM Mailing List, Artem Bityutskiy

From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>

This patch improves C1 interrupt latency by about 5-10% on the following
Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake.
In other words, if CPU is in C1 idle state, and an interrupt happens, in
average the CPU will reach the interrupt handler about 10% earlier if this
patch is applied.

Today, on Intel CPUs, all idle states except for 'POLL' are entered with local
interrupts disabled. If the CPU is woken up by an interrupt, it does not jump
to the interrupt handler, but instead, it continues executing some amount of
housekeeping code in the cpuidle subsystem. Then cpuidle enables the interrupts
and the CPU jumps to the interrupt handler(s).

This patch enables interrupts for C1 before entering the idle state. Therefore,
in a situation like that the CPU will fist run the interrupt handler(s), and
only then the cpuidle housekeeping code. As a result, the interrupt handler
runs a bit earlier, and it potentially may wake up a thread on another CPU a
bit earlier, which, depending on the workload, may end up improving the overall
system responsiveness.

The reason we enable interrupts only before C1 is because it only makes a
measurable difference for fast C-states (C1 has about 1 microsecond latency).
It is mostly data centers being very sensitive to C1 latency, so we only apply
this change to Xeons.

I measured all 4 above mentioned Xeons with 'wult', 'cyclictest', and 'dbench'.
All three showed improvements. Below are Ice Lake test results, but on SKX,
CLX, and CPX the results were similar.

In all 3 tests below I did the following configuration changes.
1. All C-states but C1 disabled ('POLL' also disabled).
2. CPU frequency pinned to HFM (base frequency).
3. Uncore frequency pinned to the maximum value.

#1 was done to test only C1. #2 and #3 were done to exclude frequency-related
test variations.

The results below are my results on my specific system. I did not optimize my
systems to get best possible test scores, and for the most part, just used OS
defaults. The goal was to make an AB comparison, not to get the best possible
scores.

Wult
----

Wult is a tool for measuring C-state wake latency:
https://github.com/intel/wult

I used the 'hrt' wult method for measuring C1 latency with and without this
patch. The median latency improved about 10%. Here is the HTML report:

https://git.infradead.org/~dedekind/wult/patches/intr_on_xeons/v1/

Cyclictest
----------

I ran cyclictest for a single CPU for 4 hours. I made sure that the CPU has a
lot of C1 residency while cyclictest runs. I was looking to compare the
average latency.

cyclictest --mlockall --priority=80 --duration 4h --interval=100 --distance=0 \
           --threads=1 --affinity=0 -N --latency=1000

The '-N' option was used to get the results in nanoseconds. The '--latency'
option was used to prevent cyclictest from blocking C1 via the Linux PM QoS
interface.

Before: average latency: 2417 ns
After:  average latency: 2293 ns

This is a 5.1% improvement.

Dbench
------

I ran dbench for 4 hours for all CPUs. I was using a simple consumer
grade SSD + ext4. I was looking to compare the average throughput.

Before: average throughput: 4527.12 MB/sec
After:  average throughput: 4562.52 MB/sec

This is a 0.8% improvement. It may be a tie, but I ran this test twice for 4
hours on 4 Xeon platforms, and patched kernel version consistently gave a
fraction of percent improvement. I believe this is because this patch slightly
reduces I/O wait time.

Artem.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 1/1] intel_idle: enable interrupts before C1 on Xeons
  2021-09-17  7:20 [PATCH 0/1] intel_idle: improve Xeon C1 interrupt response time Artem Bityutskiy
@ 2021-09-17  7:20 ` Artem Bityutskiy
  2021-09-24 16:45   ` Rafael J. Wysocki
  0 siblings, 1 reply; 3+ messages in thread
From: Artem Bityutskiy @ 2021-09-17  7:20 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux PM Mailing List, Artem Bityutskiy

From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>

Enable local interrupts before requesting C1 on the last two generations of
Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake.
This decreases average C1 interrupt latency by about 5-10%, as measured with
the 'wult' tool.

The '->enter()' function of the driver enters C-states with local interrupts
disabled by executing the 'monitor' and 'mwait' pair of instructions. If an
interrupt happens, the CPU exits the C-state and continues executing
instructions after 'mwait'. It does not jump to the interrupt handler, because
local interrupts are disabled. The cpuidle subsystem enables interrupts a bit
later, after doing some housekeeping.

With this patch, we enable local interrupts before requesting C1. In this case,
if the CPU wakes up because of an interrupt, it will jump to the interrupt
handler right away. The cpuidle housekeeping will be done after the pending
interrupt(s) are handled.

Enabling interrupts before entering a C-state has measurable impact for faster
C-states, like C1. Deeper, but slower C-states like C6 do not really benefit
from this sort of change, because their latency is a lot higher comparing to
the delay added by cpuidle housekeeping.

This change was also tested with cyclictest and dbench. In case of Ice Lake,
the average cyclictest latency decreased by 5.1%, and the average 'dbench'
throughput increased by about 0.8%. Both tests were run for 4 hours with only
C1 enabled (all other idle states, including 'POLL', were disabled). CPU
frequency was pinned to HFM, and uncore frequency was pinned to the maximum
value. The other platforms had similar single-digit percentage improvements.

It is worth noting, that this patch affects 'cpuidle' statistics a tiny bit.
Before this patch, C1 residency did not include the interrupt handling time, but
with this patch, it will include it. This is similar to what happens in case of
the 'POLL' state, which also runs with interrupts enabled.

Suggested-by: Len Brown <len.brown@intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
---
 drivers/idle/intel_idle.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index e6c543b5ee1d..0b66e25c0e2d 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -88,6 +88,12 @@ static struct cpuidle_state *cpuidle_state_table __initdata;
 
 static unsigned int mwait_substates __initdata;
 
+/*
+ * Enable interrupts before entering the C-state. On some platforms and for
+ * some C-states, this may measurably decrease interrupt latency.
+ */
+#define CPUIDLE_FLAG_IRQ_ENABLE		BIT(14)
+
 /*
  * Enable this state by default even if the ACPI _CST does not list it.
  */
@@ -127,6 +133,9 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
 	unsigned long eax = flg2MWAIT(state->flags);
 	unsigned long ecx = 1; /* break on interrupt flag */
 
+	if (state->flags & CPUIDLE_FLAG_IRQ_ENABLE)
+		local_irq_enable();
+
 	mwait_idle_with_hints(eax, ecx);
 
 	return index;
@@ -698,7 +707,7 @@ static struct cpuidle_state skx_cstates[] __initdata = {
 	{
 		.name = "C1",
 		.desc = "MWAIT 0x00",
-		.flags = MWAIT2flg(0x00),
+		.flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE,
 		.exit_latency = 2,
 		.target_residency = 2,
 		.enter = &intel_idle,
@@ -727,7 +736,7 @@ static struct cpuidle_state icx_cstates[] __initdata = {
 	{
 		.name = "C1",
 		.desc = "MWAIT 0x00",
-		.flags = MWAIT2flg(0x00),
+		.flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE,
 		.exit_latency = 1,
 		.target_residency = 1,
 		.enter = &intel_idle,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH 1/1] intel_idle: enable interrupts before C1 on Xeons
  2021-09-17  7:20 ` [PATCH 1/1] intel_idle: enable interrupts before C1 on Xeons Artem Bityutskiy
@ 2021-09-24 16:45   ` Rafael J. Wysocki
  0 siblings, 0 replies; 3+ messages in thread
From: Rafael J. Wysocki @ 2021-09-24 16:45 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Rafael J. Wysocki, Linux PM Mailing List

On Fri, Sep 17, 2021 at 9:20 AM Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
>
> Enable local interrupts before requesting C1 on the last two generations of
> Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake.
> This decreases average C1 interrupt latency by about 5-10%, as measured with
> the 'wult' tool.
>
> The '->enter()' function of the driver enters C-states with local interrupts
> disabled by executing the 'monitor' and 'mwait' pair of instructions. If an
> interrupt happens, the CPU exits the C-state and continues executing
> instructions after 'mwait'. It does not jump to the interrupt handler, because
> local interrupts are disabled. The cpuidle subsystem enables interrupts a bit
> later, after doing some housekeeping.
>
> With this patch, we enable local interrupts before requesting C1. In this case,
> if the CPU wakes up because of an interrupt, it will jump to the interrupt
> handler right away. The cpuidle housekeeping will be done after the pending
> interrupt(s) are handled.
>
> Enabling interrupts before entering a C-state has measurable impact for faster
> C-states, like C1. Deeper, but slower C-states like C6 do not really benefit
> from this sort of change, because their latency is a lot higher comparing to
> the delay added by cpuidle housekeeping.
>
> This change was also tested with cyclictest and dbench. In case of Ice Lake,
> the average cyclictest latency decreased by 5.1%, and the average 'dbench'
> throughput increased by about 0.8%. Both tests were run for 4 hours with only
> C1 enabled (all other idle states, including 'POLL', were disabled). CPU
> frequency was pinned to HFM, and uncore frequency was pinned to the maximum
> value. The other platforms had similar single-digit percentage improvements.
>
> It is worth noting, that this patch affects 'cpuidle' statistics a tiny bit.
> Before this patch, C1 residency did not include the interrupt handling time, but
> with this patch, it will include it. This is similar to what happens in case of
> the 'POLL' state, which also runs with interrupts enabled.
>
> Suggested-by: Len Brown <len.brown@intel.com>
> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> ---
>  drivers/idle/intel_idle.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index e6c543b5ee1d..0b66e25c0e2d 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -88,6 +88,12 @@ static struct cpuidle_state *cpuidle_state_table __initdata;
>
>  static unsigned int mwait_substates __initdata;
>
> +/*
> + * Enable interrupts before entering the C-state. On some platforms and for
> + * some C-states, this may measurably decrease interrupt latency.
> + */
> +#define CPUIDLE_FLAG_IRQ_ENABLE                BIT(14)
> +
>  /*
>   * Enable this state by default even if the ACPI _CST does not list it.
>   */
> @@ -127,6 +133,9 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
>         unsigned long eax = flg2MWAIT(state->flags);
>         unsigned long ecx = 1; /* break on interrupt flag */
>
> +       if (state->flags & CPUIDLE_FLAG_IRQ_ENABLE)
> +               local_irq_enable();
> +
>         mwait_idle_with_hints(eax, ecx);
>
>         return index;
> @@ -698,7 +707,7 @@ static struct cpuidle_state skx_cstates[] __initdata = {
>         {
>                 .name = "C1",
>                 .desc = "MWAIT 0x00",
> -               .flags = MWAIT2flg(0x00),
> +               .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE,
>                 .exit_latency = 2,
>                 .target_residency = 2,
>                 .enter = &intel_idle,
> @@ -727,7 +736,7 @@ static struct cpuidle_state icx_cstates[] __initdata = {
>         {
>                 .name = "C1",
>                 .desc = "MWAIT 0x00",
> -               .flags = MWAIT2flg(0x00),
> +               .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE,
>                 .exit_latency = 1,
>                 .target_residency = 1,
>                 .enter = &intel_idle,
> --

Applied as 5.16 material, thanks!

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-09-24 16:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-17  7:20 [PATCH 0/1] intel_idle: improve Xeon C1 interrupt response time Artem Bityutskiy
2021-09-17  7:20 ` [PATCH 1/1] intel_idle: enable interrupts before C1 on Xeons Artem Bityutskiy
2021-09-24 16:45   ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).