* [PATCH 0/1] intel_idle: improve Xeon C1 interrupt response time @ 2021-09-17 7:20 Artem Bityutskiy 2021-09-17 7:20 ` [PATCH 1/1] intel_idle: enable interrupts before C1 on Xeons Artem Bityutskiy 0 siblings, 1 reply; 3+ messages in thread From: Artem Bityutskiy @ 2021-09-17 7:20 UTC (permalink / raw) To: Rafael J. Wysocki; +Cc: Linux PM Mailing List, Artem Bityutskiy From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> This patch improves C1 interrupt latency by about 5-10% on the following Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake. In other words, if CPU is in C1 idle state, and an interrupt happens, in average the CPU will reach the interrupt handler about 10% earlier if this patch is applied. Today, on Intel CPUs, all idle states except for 'POLL' are entered with local interrupts disabled. If the CPU is woken up by an interrupt, it does not jump to the interrupt handler, but instead, it continues executing some amount of housekeeping code in the cpuidle subsystem. Then cpuidle enables the interrupts and the CPU jumps to the interrupt handler(s). This patch enables interrupts for C1 before entering the idle state. Therefore, in a situation like that the CPU will fist run the interrupt handler(s), and only then the cpuidle housekeeping code. As a result, the interrupt handler runs a bit earlier, and it potentially may wake up a thread on another CPU a bit earlier, which, depending on the workload, may end up improving the overall system responsiveness. The reason we enable interrupts only before C1 is because it only makes a measurable difference for fast C-states (C1 has about 1 microsecond latency). It is mostly data centers being very sensitive to C1 latency, so we only apply this change to Xeons. I measured all 4 above mentioned Xeons with 'wult', 'cyclictest', and 'dbench'. All three showed improvements. Below are Ice Lake test results, but on SKX, CLX, and CPX the results were similar. In all 3 tests below I did the following configuration changes. 1. All C-states but C1 disabled ('POLL' also disabled). 2. CPU frequency pinned to HFM (base frequency). 3. Uncore frequency pinned to the maximum value. #1 was done to test only C1. #2 and #3 were done to exclude frequency-related test variations. The results below are my results on my specific system. I did not optimize my systems to get best possible test scores, and for the most part, just used OS defaults. The goal was to make an AB comparison, not to get the best possible scores. Wult ---- Wult is a tool for measuring C-state wake latency: https://github.com/intel/wult I used the 'hrt' wult method for measuring C1 latency with and without this patch. The median latency improved about 10%. Here is the HTML report: https://git.infradead.org/~dedekind/wult/patches/intr_on_xeons/v1/ Cyclictest ---------- I ran cyclictest for a single CPU for 4 hours. I made sure that the CPU has a lot of C1 residency while cyclictest runs. I was looking to compare the average latency. cyclictest --mlockall --priority=80 --duration 4h --interval=100 --distance=0 \ --threads=1 --affinity=0 -N --latency=1000 The '-N' option was used to get the results in nanoseconds. The '--latency' option was used to prevent cyclictest from blocking C1 via the Linux PM QoS interface. Before: average latency: 2417 ns After: average latency: 2293 ns This is a 5.1% improvement. Dbench ------ I ran dbench for 4 hours for all CPUs. I was using a simple consumer grade SSD + ext4. I was looking to compare the average throughput. Before: average throughput: 4527.12 MB/sec After: average throughput: 4562.52 MB/sec This is a 0.8% improvement. It may be a tie, but I ran this test twice for 4 hours on 4 Xeon platforms, and patched kernel version consistently gave a fraction of percent improvement. I believe this is because this patch slightly reduces I/O wait time. Artem. ^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH 1/1] intel_idle: enable interrupts before C1 on Xeons 2021-09-17 7:20 [PATCH 0/1] intel_idle: improve Xeon C1 interrupt response time Artem Bityutskiy @ 2021-09-17 7:20 ` Artem Bityutskiy 2021-09-24 16:45 ` Rafael J. Wysocki 0 siblings, 1 reply; 3+ messages in thread From: Artem Bityutskiy @ 2021-09-17 7:20 UTC (permalink / raw) To: Rafael J. Wysocki; +Cc: Linux PM Mailing List, Artem Bityutskiy From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Enable local interrupts before requesting C1 on the last two generations of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake. This decreases average C1 interrupt latency by about 5-10%, as measured with the 'wult' tool. The '->enter()' function of the driver enters C-states with local interrupts disabled by executing the 'monitor' and 'mwait' pair of instructions. If an interrupt happens, the CPU exits the C-state and continues executing instructions after 'mwait'. It does not jump to the interrupt handler, because local interrupts are disabled. The cpuidle subsystem enables interrupts a bit later, after doing some housekeeping. With this patch, we enable local interrupts before requesting C1. In this case, if the CPU wakes up because of an interrupt, it will jump to the interrupt handler right away. The cpuidle housekeeping will be done after the pending interrupt(s) are handled. Enabling interrupts before entering a C-state has measurable impact for faster C-states, like C1. Deeper, but slower C-states like C6 do not really benefit from this sort of change, because their latency is a lot higher comparing to the delay added by cpuidle housekeeping. This change was also tested with cyclictest and dbench. In case of Ice Lake, the average cyclictest latency decreased by 5.1%, and the average 'dbench' throughput increased by about 0.8%. Both tests were run for 4 hours with only C1 enabled (all other idle states, including 'POLL', were disabled). CPU frequency was pinned to HFM, and uncore frequency was pinned to the maximum value. The other platforms had similar single-digit percentage improvements. It is worth noting, that this patch affects 'cpuidle' statistics a tiny bit. Before this patch, C1 residency did not include the interrupt handling time, but with this patch, it will include it. This is similar to what happens in case of the 'POLL' state, which also runs with interrupts enabled. Suggested-by: Len Brown <len.brown@intel.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> --- drivers/idle/intel_idle.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index e6c543b5ee1d..0b66e25c0e2d 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -88,6 +88,12 @@ static struct cpuidle_state *cpuidle_state_table __initdata; static unsigned int mwait_substates __initdata; +/* + * Enable interrupts before entering the C-state. On some platforms and for + * some C-states, this may measurably decrease interrupt latency. + */ +#define CPUIDLE_FLAG_IRQ_ENABLE BIT(14) + /* * Enable this state by default even if the ACPI _CST does not list it. */ @@ -127,6 +133,9 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev, unsigned long eax = flg2MWAIT(state->flags); unsigned long ecx = 1; /* break on interrupt flag */ + if (state->flags & CPUIDLE_FLAG_IRQ_ENABLE) + local_irq_enable(); + mwait_idle_with_hints(eax, ecx); return index; @@ -698,7 +707,7 @@ static struct cpuidle_state skx_cstates[] __initdata = { { .name = "C1", .desc = "MWAIT 0x00", - .flags = MWAIT2flg(0x00), + .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE, .exit_latency = 2, .target_residency = 2, .enter = &intel_idle, @@ -727,7 +736,7 @@ static struct cpuidle_state icx_cstates[] __initdata = { { .name = "C1", .desc = "MWAIT 0x00", - .flags = MWAIT2flg(0x00), + .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE, .exit_latency = 1, .target_residency = 1, .enter = &intel_idle, -- 2.31.1 ^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH 1/1] intel_idle: enable interrupts before C1 on Xeons 2021-09-17 7:20 ` [PATCH 1/1] intel_idle: enable interrupts before C1 on Xeons Artem Bityutskiy @ 2021-09-24 16:45 ` Rafael J. Wysocki 0 siblings, 0 replies; 3+ messages in thread From: Rafael J. Wysocki @ 2021-09-24 16:45 UTC (permalink / raw) To: Artem Bityutskiy; +Cc: Rafael J. Wysocki, Linux PM Mailing List On Fri, Sep 17, 2021 at 9:20 AM Artem Bityutskiy <dedekind1@gmail.com> wrote: > > From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> > > Enable local interrupts before requesting C1 on the last two generations of > Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake. > This decreases average C1 interrupt latency by about 5-10%, as measured with > the 'wult' tool. > > The '->enter()' function of the driver enters C-states with local interrupts > disabled by executing the 'monitor' and 'mwait' pair of instructions. If an > interrupt happens, the CPU exits the C-state and continues executing > instructions after 'mwait'. It does not jump to the interrupt handler, because > local interrupts are disabled. The cpuidle subsystem enables interrupts a bit > later, after doing some housekeeping. > > With this patch, we enable local interrupts before requesting C1. In this case, > if the CPU wakes up because of an interrupt, it will jump to the interrupt > handler right away. The cpuidle housekeeping will be done after the pending > interrupt(s) are handled. > > Enabling interrupts before entering a C-state has measurable impact for faster > C-states, like C1. Deeper, but slower C-states like C6 do not really benefit > from this sort of change, because their latency is a lot higher comparing to > the delay added by cpuidle housekeeping. > > This change was also tested with cyclictest and dbench. In case of Ice Lake, > the average cyclictest latency decreased by 5.1%, and the average 'dbench' > throughput increased by about 0.8%. Both tests were run for 4 hours with only > C1 enabled (all other idle states, including 'POLL', were disabled). CPU > frequency was pinned to HFM, and uncore frequency was pinned to the maximum > value. The other platforms had similar single-digit percentage improvements. > > It is worth noting, that this patch affects 'cpuidle' statistics a tiny bit. > Before this patch, C1 residency did not include the interrupt handling time, but > with this patch, it will include it. This is similar to what happens in case of > the 'POLL' state, which also runs with interrupts enabled. > > Suggested-by: Len Brown <len.brown@intel.com> > Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> > --- > drivers/idle/intel_idle.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c > index e6c543b5ee1d..0b66e25c0e2d 100644 > --- a/drivers/idle/intel_idle.c > +++ b/drivers/idle/intel_idle.c > @@ -88,6 +88,12 @@ static struct cpuidle_state *cpuidle_state_table __initdata; > > static unsigned int mwait_substates __initdata; > > +/* > + * Enable interrupts before entering the C-state. On some platforms and for > + * some C-states, this may measurably decrease interrupt latency. > + */ > +#define CPUIDLE_FLAG_IRQ_ENABLE BIT(14) > + > /* > * Enable this state by default even if the ACPI _CST does not list it. > */ > @@ -127,6 +133,9 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev, > unsigned long eax = flg2MWAIT(state->flags); > unsigned long ecx = 1; /* break on interrupt flag */ > > + if (state->flags & CPUIDLE_FLAG_IRQ_ENABLE) > + local_irq_enable(); > + > mwait_idle_with_hints(eax, ecx); > > return index; > @@ -698,7 +707,7 @@ static struct cpuidle_state skx_cstates[] __initdata = { > { > .name = "C1", > .desc = "MWAIT 0x00", > - .flags = MWAIT2flg(0x00), > + .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE, > .exit_latency = 2, > .target_residency = 2, > .enter = &intel_idle, > @@ -727,7 +736,7 @@ static struct cpuidle_state icx_cstates[] __initdata = { > { > .name = "C1", > .desc = "MWAIT 0x00", > - .flags = MWAIT2flg(0x00), > + .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE, > .exit_latency = 1, > .target_residency = 1, > .enter = &intel_idle, > -- Applied as 5.16 material, thanks! ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-09-24 16:46 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-17 7:20 [PATCH 0/1] intel_idle: improve Xeon C1 interrupt response time Artem Bityutskiy 2021-09-17 7:20 ` [PATCH 1/1] intel_idle: enable interrupts before C1 on Xeons Artem Bityutskiy 2021-09-24 16:45 ` Rafael J. Wysocki
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.