All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state
@ 2020-11-26 17:18 Mel Gorman
  2020-11-26 18:24 ` Rafael J. Wysocki
  2020-11-28  5:41 ` kernel test robot
  0 siblings, 2 replies; 8+ messages in thread
From: Mel Gorman @ 2020-11-26 17:18 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Daniel Lezcano, Marcelo Tosatti, Linux Kernel Mailing List, Linux PM

It was noted that a few workloads that idle rapidly regressed when commit
36fcb4292473 ("cpuidle: use first valid target residency as poll time")
was merged. The workloads in question were heavy communicators that idle
rapidly and were impacted by the c-state exit latency as the active CPUs
were not polling at the time of wakeup. As they were not particularly
realistic workloads, it was not considered to be a major problem.

Unfortunately, a bug was then reported for a real workload in a production
environment that relied on large numbers of threads operating in a worker
pool pattern. These threads would idle for periods of time slightly
longer than the C1 exit latency and so incurred the c-state exit latency.
The application is sensitive to wakeup latency and appears to indirectly
rely on behaviour prior to commit on a37b969a61c1 ("cpuidle: poll_state:
Add time limit to poll_idle()") to poll for long enough to avoid the exit
latency cost.

The current behaviour favours power consumption over wakeup latency
and it is reasonable behaviour but it should be tunable. In theory
applications could use /dev/cpu_dma_latency but not all applications
are aware of cpu_dma_latency. Similarly, a tool could be installed
that opens cpu_dma_latency for the whole system but such a tool is not
always available, is not always known to the sysadmin or the tool can have
unexpected side-effects if it tunes more than cpu_dma_latency. In practice,
it is more common for sysadmins to try idle=poll (which is x86 specific)
or try disabling c-states and hope for the best.

This patch makes it straight-forward to configure how long a CPU should
poll before entering a c-state. By default, there is no behaviour change.
At build time a decision can be made to favour performance over power
by default even if that potentially impacts turbo boosting for workloads
that are sensitive to wakeup latency. In the event the kernel default is
not suitable, the kernel command line can be used as a substitute for
implementing cpu_dma_latency support in an application or requiring an
additional tool to be installed.

Note that it is not expected that tuning for longer polling times will be a
universal win. For example, extra polling might prevent a turbo state being
used or prevent hyperthread resources being released to an SMT sibling.

By default, nothing has changed but here is an example of tbench4
comparing the default "poll based on the min cstate" vs "poll based on
the max cstate"

tbench4
			  min-cstate		 max-cstate
Hmean     1        512.88 (   0.00%)      566.74 *  10.50%*
Hmean     2        999.47 (   0.00%)     1090.01 *   9.06%*
Hmean     4       1960.83 (   0.00%)     2106.62 *   7.44%*
Hmean     8       3828.61 (   0.00%)     4097.93 *   7.03%*
Hmean     16      6899.44 (   0.00%)     7120.38 *   3.20%*
Hmean     32     10718.38 (   0.00%)    10672.44 *  -0.43%*
Hmean     64     12672.21 (   0.00%)    12608.15 *  -0.51%*
Hmean     128    20744.83 (   0.00%)    21147.02 *   1.94%*
Hmean     256    20646.60 (   0.00%)    20608.48 *  -0.18%*
Hmean     320    20892.89 (   0.00%)    20831.99 *  -0.29%*

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/admin-guide/kernel-parameters.txt | 25 +++++++++
 drivers/cpuidle/Kconfig                         | 57 +++++++++++++++++++++
 drivers/cpuidle/cpuidle.c                       | 67 +++++++++++++++++++++++--
 include/linux/moduleparam.h                     |  5 ++
 kernel/params.c                                 |  1 +
 5 files changed, 151 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 526d65d8573a..382126781b0e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -719,6 +719,31 @@
 			policy to use. This governor must be registered in the
 			kernel before the cpufreq driver probes.
 
+	cpuidle.poll=
+			[CPU_IDLE]
+			Format: <int>
+			Set the time in microseconds a CPU should poll
+			in cpuidle for a new task before entering
+			a sleep state. The default is usually fine
+			but may need to be tuned for workloads that
+			are both latency-sensitive and idles rapidly
+			for short durations. Limiting c-states can
+			be insufficient if the polling time is still
+			too short, the application has no knowledge
+			of /dev/cpu_dma_latency, there are multiple
+			applications or the environment does not allow
+			the installation of a userspace tool that controls
+			cpu_dma_latency.
+
+			-1: default, poll based on the shallowest enabled c-state
+			-2: poll based on the deepest enabled c-state. If no
+			    c-states are enabled then poll based on the tick
+			-3: poll up until a tick would force a resched event
+			>0: poll for a fixed duration in nanoseconds. Poll
+			    range is limited. The lowest bound is the
+			    shallowest enabled c-state.  The maximum
+			    bound is based on the tick
+
 	cpu_init_udelay=N
 			[X86] Delay for N microsec between assert and de-assert
 			of APIC INIT to start processors.  This delay occurs
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index c0aeedd66f02..7c9d2073db85 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -47,6 +47,63 @@ config CPU_IDLE_GOV_HALTPOLL
 config DT_IDLE_STATES
 	bool
 
+choice
+	prompt "CPU Idle default polling interval before entering an idle state"
+	default CPUIDLE_DEFAULT_POLL_MINCSTATE
+	help
+	  CPU Idle polls briefly waiting for a new task to be runnable. This
+	  setting determines what the default polling interval is. Lower
+	  polling intervals will optimise for power consumption while higher
+	  intervals will favour wakeup latency in cases where a task is
+	  queued before the CPU enters a c-state. If c-states are unavailable
+	  at runtime, polling will poll for up to a tick.
+
+	config CPUIDLE_DEFAULT_POLL_MINCSTATE
+		bool "interval mincstate"
+	help
+	  Polling interval will be the same as the shallowest enabled c-state
+	  e.g. C1 exit latency
+
+	config CPUIDLE_DEFAULT_POLL_MAXCSTATE
+		bool "interval maxcstate"
+	help
+	  Polling interval will be the same as the deepest enabled c-state
+	  e.g. C3 exit latency
+
+	config CPUIDLE_DEFAULT_POLL_TICK
+		bool "interval tick"
+	help
+	  Polling interval will be up to a tick interval or until the idle
+	  task needs to reschedule. e.g. a kernel with HZ of 250 could
+	  potentially poll up to 4ms.
+
+	config CPUIDLE_DEFAULT_POLL_FIXED
+		bool "interval fixed"
+	help
+	  Polling can be specified as a fixed interval regardless of
+	  c-states. This option is only useful in the case where there
+	  is a known class of applications that have a fixed latency
+	  that is not necessarily related to the underlying c-state
+	  capabilities and there is a need for the polling interval
+	  to be be hardware invariant.
+endchoice
+
+config CPUIDLE_DEFAULT_POOL_FIXED_INTERVAL
+	depends on CPUIDLE_DEFAULT_POLL_FIXED
+	int "Default polling inverval for CPU Idle"
+	default 2
+	help
+	  This is the fixed interval used for cpuidle polling if
+	  CPUIDLE_DEFAULT_POLL_FIXED is selected.
+
+config CPUIDLE_DEFAULT_POLL
+	int
+	default -1 if CPUIDLE_DEFAULT_POLL_MINCSTATE
+	default -2 if CPUIDLE_DEFAULT_POLL_MAXCSTATE
+	default -3 if CPUIDLE_DEFAULT_POLL_TICK
+	default CPUIDLE_DEFAULT_POOL_FIXED_INTERVAL if CPUIDLE_DEFAULT_POLL_FIXED
+
+
 menu "ARM CPU Idle Drivers"
 depends on ARM || ARM64
 source "drivers/cpuidle/Kconfig.arm"
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 83af15f77f66..32a745bb7eb8 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -368,6 +368,25 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index)
 		cpuidle_curr_governor->reflect(dev, index);
 }
 
+static struct kernel_param_ops poll_param_ops = {
+	.set =          param_set_llong,
+	.get =          param_get_llong,
+};
+
+/* Poll based on the shallowest enabled cstate */
+#define CPUIDLE_POLL_MINCSTATE	-1
+
+/* Poll based on the deepest enabled cstate */
+#define CPUIDLE_POLL_MAXCSTATE	-2
+
+/* Poll up to when a tick forces a resched event */
+#define CPUIDLE_POLL_TICK	-3
+
+/* Sanity check to guard against future options */
+#define CPUIDLE_POLL_BOUND	CPUIDLE_POLL_TICK
+
+static long long param_poll = CONFIG_CPUIDLE_DEFAULT_POLL;
+
 /**
  * cpuidle_poll_time - return amount of time to poll for,
  * governors can override dev->poll_limit_ns if necessary
@@ -380,21 +399,60 @@ u64 cpuidle_poll_time(struct cpuidle_driver *drv,
 		      struct cpuidle_device *dev)
 {
 	int i;
-	u64 limit_ns;
+	u64 min_limit_ns, max_limit_ns;
 
 	if (dev->poll_limit_ns)
 		return dev->poll_limit_ns;
 
-	limit_ns = TICK_NSEC;
+	/* Sanitise param_poll. */
+	param_poll = clamp_t(int, param_poll, CPUIDLE_POLL_BOUND, TICK_NSEC);
+	if (param_poll > 0)
+		param_poll *= NSEC_PER_USEC;
+
+	/* Find the min and max exit latencies for c-states */
+	max_limit_ns = 0;
+	min_limit_ns = TICK_NSEC;
 	for (i = 1; i < drv->state_count; i++) {
+		u64 state_limit;
+
 		if (dev->states_usage[i].disable)
 			continue;
 
-		limit_ns = drv->states[i].target_residency_ns;
+		state_limit = drv->states[i].target_residency_ns;
+		min_limit_ns = min_t(u64, min_limit_ns, state_limit);
+		max_limit_ns = max_t(u64, max_limit_ns, state_limit);
+	}
+
+	/* Select a polling interval */
+	switch (param_poll) {
+	case CPUIDLE_POLL_MINCSTATE:
+		dev->poll_limit_ns = min_limit_ns;
+		break;
+	case CPUIDLE_POLL_MAXCSTATE:
+		/* If there are no enabled c-states, use the tick interval */
+		if (!max_limit_ns)
+			max_limit_ns = TICK_NSEC;
+
+		dev->poll_limit_ns = max_limit_ns;
+		break;
+	case CPUIDLE_POLL_TICK:
+		dev->poll_limit_ns = TICK_NSEC;
+		break;
+	default:
+		/* If there are no enabled c-states, make sure param_poll is used */
+		if (!max_limit_ns)
+			min_limit_ns = 1;
+
+		dev->poll_limit_ns = clamp_t(u64, param_poll, min_limit_ns, TICK_NSEC);
 		break;
 	}
 
-	dev->poll_limit_ns = limit_ns;
+
+	/*
+	 * Polling parameter reported as usec to match the values reported
+	 * for c-cstate exit latencies in sysfs.
+	 */
+	param_poll = dev->poll_limit_ns / NSEC_PER_USEC;;
 
 	return dev->poll_limit_ns;
 }
@@ -755,4 +813,5 @@ static int __init cpuidle_init(void)
 
 module_param(off, int, 0444);
 module_param_string(governor, param_governor, CPUIDLE_NAME_LEN, 0444);
+module_param_cb(poll, &poll_param_ops, &param_poll, 0444);
 core_initcall(cpuidle_init);
diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h
index 6388eb9734a5..418cf714580d 100644
--- a/include/linux/moduleparam.h
+++ b/include/linux/moduleparam.h
@@ -438,6 +438,11 @@ extern int param_set_long(const char *val, const struct kernel_param *kp);
 extern int param_get_long(char *buffer, const struct kernel_param *kp);
 #define param_check_long(name, p) __param_check(name, p, long)
 
+extern const struct kernel_param_ops param_ops_llong;
+extern int param_set_llong(const char *val, const struct kernel_param *kp);
+extern int param_get_llong(char *buffer, const struct kernel_param *kp);
+#define param_check_llong(name, p) __param_check(name, p, long long)
+
 extern const struct kernel_param_ops param_ops_ulong;
 extern int param_set_ulong(const char *val, const struct kernel_param *kp);
 extern int param_get_ulong(char *buffer, const struct kernel_param *kp);
diff --git a/kernel/params.c b/kernel/params.c
index 164d79330849..a2b0e5b1b26b 100644
--- a/kernel/params.c
+++ b/kernel/params.c
@@ -239,6 +239,7 @@ STANDARD_PARAM_DEF(ushort,	unsigned short,		"%hu",		kstrtou16);
 STANDARD_PARAM_DEF(int,		int,			"%i",		kstrtoint);
 STANDARD_PARAM_DEF(uint,	unsigned int,		"%u",		kstrtouint);
 STANDARD_PARAM_DEF(long,	long,			"%li",		kstrtol);
+STANDARD_PARAM_DEF(llong,	long long,		"%lli",		kstrtoll);
 STANDARD_PARAM_DEF(ulong,	unsigned long,		"%lu",		kstrtoul);
 STANDARD_PARAM_DEF(ullong,	unsigned long long,	"%llu",		kstrtoull);
 STANDARD_PARAM_DEF(hexint,	unsigned int,		"%#08x", 	kstrtouint);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state
  2020-11-26 17:18 [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state Mel Gorman
@ 2020-11-26 18:24 ` Rafael J. Wysocki
  2020-11-26 20:31   ` Mel Gorman
  2020-11-27 14:08   ` Marcelo Tosatti
  2020-11-28  5:41 ` kernel test robot
  1 sibling, 2 replies; 8+ messages in thread
From: Rafael J. Wysocki @ 2020-11-26 18:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rafael J. Wysocki, Daniel Lezcano, Marcelo Tosatti,
	Linux Kernel Mailing List, Linux PM

On Thu, Nov 26, 2020 at 6:25 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> It was noted that a few workloads that idle rapidly regressed when commit
> 36fcb4292473 ("cpuidle: use first valid target residency as poll time")
> was merged. The workloads in question were heavy communicators that idle
> rapidly and were impacted by the c-state exit latency as the active CPUs
> were not polling at the time of wakeup. As they were not particularly
> realistic workloads, it was not considered to be a major problem.
>
> Unfortunately, a bug was then reported for a real workload in a production
> environment that relied on large numbers of threads operating in a worker
> pool pattern. These threads would idle for periods of time slightly
> longer than the C1 exit latency and so incurred the c-state exit latency.
> The application is sensitive to wakeup latency and appears to indirectly
> rely on behaviour prior to commit on a37b969a61c1 ("cpuidle: poll_state:
> Add time limit to poll_idle()") to poll for long enough to avoid the exit
> latency cost.

Well, this means that it depends on the governor to mispredict short
idle durations (so it selects "poll" over "C1" when it should select
"C1" often enough) and on the lack of a polling limit (or a large
enough one).

While the latter can be kind of addressed by increasing the polling
limit, the misprediction in the governor really isn't guaranteed to
happen and it really is necessary to have a PM QoS request in place to
ensure a suitable latency.

> The current behaviour favours power consumption over wakeup latency
> and it is reasonable behaviour but it should be tunable.

Only if there is no way to cover all of the relevant use cases in a
generally acceptable way without adding more module params etc.

In this particular case, it should be possible to determine a polling
limit acceptable to everyone.

BTW, I admit that using the exit latency of the lowest enabled C-state
was kind of arbitrary and it was based on the assumption that it would
make more sense to try to enter C1 instead of polling for that much
time, but C1 is an exception, because it is often artificially made
particularly attractive to the governors (by reducing its target
residency as much as possible).  Also making the polling limit that
short distorts the governor statistics somewhat.

So the polling limit equal to the target residency of C1 really may be
overly aggressive and something tick-based may work better in general
(e.g. 1/8 or 1/16 of the tick period).

In principle, a multiple of C1 target residency could be used too.

> In theory applications could use /dev/cpu_dma_latency but not all applications
> are aware of cpu_dma_latency. Similarly, a tool could be installed
> that opens cpu_dma_latency for the whole system but such a tool is not
> always available, is not always known to the sysadmin or the tool can have
> unexpected side-effects if it tunes more than cpu_dma_latency. In practice,
> it is more common for sysadmins to try idle=poll (which is x86 specific)

And really should be avoided if one cares about turbo or wants to
avoid thermal issues.

> or try disabling c-states and hope for the best.
>
> This patch makes it straight-forward to configure how long a CPU should
> poll before entering a c-state.

Well, IMV this is not straightforward at all.

It requires the admin to know how cpuidle works and why this
particular polling limit is likely to be suitable for the given
workload.  And whether or not the default polling limit should be
changed at all.

Honestly, nobody knows that in advance (with all due respect) and this
would cause people to try various settings at random and stick to the
one that they feel works best for them without much understanding.

> By default, there is no behaviour change.
> At build time a decision can be made to favour performance over power
> by default even if that potentially impacts turbo boosting for workloads
> that are sensitive to wakeup latency. In the event the kernel default is
> not suitable, the kernel command line can be used as a substitute for
> implementing cpu_dma_latency support in an application or requiring an
> additional tool to be installed.
>
> Note that it is not expected that tuning for longer polling times will be a
> universal win. For example, extra polling might prevent a turbo state being
> used or prevent hyperthread resources being released to an SMT sibling.
>
> By default, nothing has changed but here is an example of tbench4
> comparing the default "poll based on the min cstate" vs "poll based on
> the max cstate"
>
> tbench4
>                           min-cstate             max-cstate
> Hmean     1        512.88 (   0.00%)      566.74 *  10.50%*
> Hmean     2        999.47 (   0.00%)     1090.01 *   9.06%*
> Hmean     4       1960.83 (   0.00%)     2106.62 *   7.44%*
> Hmean     8       3828.61 (   0.00%)     4097.93 *   7.03%*
> Hmean     16      6899.44 (   0.00%)     7120.38 *   3.20%*
> Hmean     32     10718.38 (   0.00%)    10672.44 *  -0.43%*
> Hmean     64     12672.21 (   0.00%)    12608.15 *  -0.51%*
> Hmean     128    20744.83 (   0.00%)    21147.02 *   1.94%*
> Hmean     256    20646.60 (   0.00%)    20608.48 *  -0.18%*
> Hmean     320    20892.89 (   0.00%)    20831.99 *  -0.29%*

I'm wondering if you have similar results for "poll based on 2 x min
cstate" (or 4 x min cstate for that matter).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state
  2020-11-26 18:24 ` Rafael J. Wysocki
@ 2020-11-26 20:31   ` Mel Gorman
  2020-11-27 10:53     ` Mel Gorman
  2020-11-27 14:08   ` Marcelo Tosatti
  1 sibling, 1 reply; 8+ messages in thread
From: Mel Gorman @ 2020-11-26 20:31 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Daniel Lezcano, Marcelo Tosatti, Linux Kernel Mailing List, Linux PM

On Thu, Nov 26, 2020 at 07:24:41PM +0100, Rafael J. Wysocki wrote:
> On Thu, Nov 26, 2020 at 6:25 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> > It was noted that a few workloads that idle rapidly regressed when commit
> > 36fcb4292473 ("cpuidle: use first valid target residency as poll time")
> > was merged. The workloads in question were heavy communicators that idle
> > rapidly and were impacted by the c-state exit latency as the active CPUs
> > were not polling at the time of wakeup. As they were not particularly
> > realistic workloads, it was not considered to be a major problem.
> >
> > Unfortunately, a bug was then reported for a real workload in a production
> > environment that relied on large numbers of threads operating in a worker
> > pool pattern. These threads would idle for periods of time slightly
> > longer than the C1 exit latency and so incurred the c-state exit latency.
> > The application is sensitive to wakeup latency and appears to indirectly
> > rely on behaviour prior to commit on a37b969a61c1 ("cpuidle: poll_state:
> > Add time limit to poll_idle()") to poll for long enough to avoid the exit
> > latency cost.
> 
> Well, this means that it depends on the governor to mispredict short
> idle durations (so it selects "poll" over "C1" when it should select
> "C1" often enough) and on the lack of a polling limit (or a large
> enough one).
> 

Potentially. I was limited to what degree the load could be analysed
unfortunately other than noting that the polling time should have been

> While the latter can be kind of addressed by increasing the polling
> limit, the misprediction in the governor really isn't guaranteed to
> happen and it really is necessary to have a PM QoS request in place to
> ensure a suitable latency.
> 

Indeed, so yes, the application should be using cpu_dma_latency but not
all of them do.

> > The current behaviour favours power consumption over wakeup latency
> > and it is reasonable behaviour but it should be tunable.
> 
> Only if there is no way to cover all of the relevant use cases in a
> generally acceptable way without adding more module params etc.
> 
> In this particular case, it should be possible to determine a polling
> limit acceptable to everyone.
> 

Potentially yes. cpuidle is not my strong suit but it could try being
adaptive the polling similar to how the menu governor tries to guess
the typical interval. Basically it would have to pick a polling internal
between 2 and TICK_NSEC. Superficially it a task is queued before polling
finishes, decrease the interval and increase it otherwise. That is a mess
though because then it may be polling for ages with nothing arriving. It
would have to start tracking when the CPU exited idle to see if polling
is even worthwhile. That

I felt that starting with anything that tried adapting the polling
interval based on heuristics would meet higher resistance than making it
tunable. Hence, make it tunable so at least the problem can be addressed
when it's encountered.

> BTW, I admit that using the exit latency of the lowest enabled C-state
> was kind of arbitrary and it was based on the assumption that it would
> make more sense to try to enter C1 instead of polling for that much
> time, but C1 is an exception, because it is often artificially made
> particularly attractive to the governors (by reducing its target
> residency as much as possible).  Also making the polling limit that
> short distorts the governor statistics somewhat.
> 
> So the polling limit equal to the target residency of C1 really may be
> overly aggressive and something tick-based may work better in general
> (e.g. 1/8 or 1/16 of the tick period).
> 

Really, tying it to the c-states at all is a little counter-intuitive
because a task wakeup interarrival time has nothing to do with c-states.
It just happened to have some nice numbers that were smaller than the
tick and as you say, c1 may be artifically attractive to the governor if
the polling interval affects target residency estimations.

> In principle, a multiple of C1 target residency could be used too.
> 

Yes but then you hit the problem that C1 exit latencies can be very
different or not even exist. A fixed scaling factor might be great on
one machine and useless on another.

> > In theory applications could use /dev/cpu_dma_latency but not all applications
> > are aware of cpu_dma_latency. Similarly, a tool could be installed
> > that opens cpu_dma_latency for the whole system but such a tool is not
> > always available, is not always known to the sysadmin or the tool can have
> > unexpected side-effects if it tunes more than cpu_dma_latency. In practice,
> > it is more common for sysadmins to try idle=poll (which is x86 specific)
> 
> And really should be avoided if one cares about turbo or wants to
> avoid thermal issues.
> 

Yes.

> > or try disabling c-states and hope for the best.
> >
> > This patch makes it straight-forward to configure how long a CPU should
> > poll before entering a c-state.
> 
> Well, IMV this is not straightforward at all.
> 
> It requires the admin to know how cpuidle works and why this
> particular polling limit is likely to be suitable for the given
> workload.  And whether or not the default polling limit should be
> changed at all.
> 

I tried to make the parameter documentation as clear as possible so a
decision can be made without necessarily knowing how cpuidle works. If
they don't detect it needs to change, the'll never bother.

> Honestly, nobody knows that in advance (with all due respect) and this
> would cause people to try various settings at random and stick to the
> one that they feel works best for them without much understanding.
> 

More than likely, yes. However, I would point out that quite a lot of
"tuning" efforts are based on people randomly flipping tunables and hoping
for the best.

> > By default, there is no behaviour change.
> > At build time a decision can be made to favour performance over power
> > by default even if that potentially impacts turbo boosting for workloads
> > that are sensitive to wakeup latency. In the event the kernel default is
> > not suitable, the kernel command line can be used as a substitute for
> > implementing cpu_dma_latency support in an application or requiring an
> > additional tool to be installed.
> >
> > Note that it is not expected that tuning for longer polling times will be a
> > universal win. For example, extra polling might prevent a turbo state being
> > used or prevent hyperthread resources being released to an SMT sibling.
> >
> > By default, nothing has changed but here is an example of tbench4
> > comparing the default "poll based on the min cstate" vs "poll based on
> > the max cstate"
> >
> > tbench4
> >                           min-cstate             max-cstate
> > Hmean     1        512.88 (   0.00%)      566.74 *  10.50%*
> > Hmean     2        999.47 (   0.00%)     1090.01 *   9.06%*
> > Hmean     4       1960.83 (   0.00%)     2106.62 *   7.44%*
> > Hmean     8       3828.61 (   0.00%)     4097.93 *   7.03%*
> > Hmean     16      6899.44 (   0.00%)     7120.38 *   3.20%*
> > Hmean     32     10718.38 (   0.00%)    10672.44 *  -0.43%*
> > Hmean     64     12672.21 (   0.00%)    12608.15 *  -0.51%*
> > Hmean     128    20744.83 (   0.00%)    21147.02 *   1.94%*
> > Hmean     256    20646.60 (   0.00%)    20608.48 *  -0.18%*
> > Hmean     320    20892.89 (   0.00%)    20831.99 *  -0.29%*
> 
> I'm wondering if you have similar results for "poll based on 2 x min
> cstate" (or 4 x min cstate for that matter).

Fixed value of 30 generally worked out roughly the same performance as
picking the setting the polling interval based on the shallowest
c-state. I did not do many experiments with different possible values as
I felt the optimal value would be both workload and machine dependant.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state
  2020-11-26 20:31   ` Mel Gorman
@ 2020-11-27 10:53     ` Mel Gorman
  0 siblings, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2020-11-27 10:53 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Daniel Lezcano, Marcelo Tosatti, Linux Kernel Mailing List, Linux PM

On Thu, Nov 26, 2020 at 08:31:51PM +0000, Mel Gorman wrote:
> > > and it is reasonable behaviour but it should be tunable.
> > 
> > Only if there is no way to cover all of the relevant use cases in a
> > generally acceptable way without adding more module params etc.
> > 
> > In this particular case, it should be possible to determine a polling
> > limit acceptable to everyone.
> > 
> 
> Potentially yes. cpuidle is not my strong suit but it could try being
> adaptive the polling similar to how the menu governor tries to guess
> the typical interval. Basically it would have to pick a polling internal
> between 2 and TICK_NSEC. Superficially it a task is queued before polling
> finishes, decrease the interval and increase it otherwise. That is a mess
> though because then it may be polling for ages with nothing arriving. It
> would have to start tracking when the CPU exited idle to see if polling
> is even worthwhile. That
> 
> I felt that starting with anything that tried adapting the polling
> interval based on heuristics would meet higher resistance than making it
> tunable. Hence, make it tunable so at least the problem can be addressed
> when it's encountered.
> 

I looked at this again and determining a "polling limit acceptable
to everyone" looks like reimplementing haltpoll in the core or adding
haltpoll-like logic to each governor. I doubt that'll be a popular
approach.

The c1 exit latency as a hint is definitely too low though. I checked
one of the test machines to double check what the granularity of the time
checks in poll_idle() at boot time with something like this.

        for (i = 0; i < POLL_IDLE_RELAX_COUNT; i++) {
                cpu_relax();
        }

This takes roughly 1100ns on a test machine where the C1 exit latency is
2000ns. Lets say you have a basic pair of tasks communicating over a pipe
on the same machine (e.g. perf bench pipe). The time for a round-trip on
the same machine is roughly 7000ns meaning that polling is almost never
useful for a basic workload.


-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state
  2020-11-26 18:24 ` Rafael J. Wysocki
  2020-11-26 20:31   ` Mel Gorman
@ 2020-11-27 14:08   ` Marcelo Tosatti
  2020-11-27 14:34     ` Mel Gorman
  2020-11-27 14:45     ` Mel Gorman
  1 sibling, 2 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2020-11-27 14:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Mel Gorman, Daniel Lezcano, Linux Kernel Mailing List, Linux PM

On Thu, Nov 26, 2020 at 07:24:41PM +0100, Rafael J. Wysocki wrote:
> On Thu, Nov 26, 2020 at 6:25 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > It was noted that a few workloads that idle rapidly regressed when commit
> > 36fcb4292473 ("cpuidle: use first valid target residency as poll time")
> > was merged. The workloads in question were heavy communicators that idle
> > rapidly and were impacted by the c-state exit latency as the active CPUs
> > were not polling at the time of wakeup. As they were not particularly
> > realistic workloads, it was not considered to be a major problem.
> >
> > Unfortunately, a bug was then reported for a real workload in a production
> > environment that relied on large numbers of threads operating in a worker
> > pool pattern. These threads would idle for periods of time slightly
> > longer than the C1 exit latency and so incurred the c-state exit latency.
> > The application is sensitive to wakeup latency and appears to indirectly
> > rely on behaviour prior to commit on a37b969a61c1 ("cpuidle: poll_state:
> > Add time limit to poll_idle()") to poll for long enough to avoid the exit
> > latency cost.
> 
> Well, this means that it depends on the governor to mispredict short
> idle durations (so it selects "poll" over "C1" when it should select
> "C1" often enough) and on the lack of a polling limit (or a large
> enough one).
> 
> While the latter can be kind of addressed by increasing the polling
> limit, the misprediction in the governor really isn't guaranteed to
> happen and it really is necessary to have a PM QoS request in place to
> ensure a suitable latency.
> 
> > The current behaviour favours power consumption over wakeup latency
> > and it is reasonable behaviour but it should be tunable.
> 
> Only if there is no way to cover all of the relevant use cases in a
> generally acceptable way without adding more module params etc.
> 
> In this particular case, it should be possible to determine a polling
> limit acceptable to everyone.
> 
> BTW, I admit that using the exit latency of the lowest enabled C-state
> was kind of arbitrary and it was based on the assumption that it would
> make more sense to try to enter C1 instead of polling for that much
> time, but C1 is an exception, because it is often artificially made
> particularly attractive to the governors (by reducing its target
> residency as much as possible).  Also making the polling limit that
> short distorts the governor statistics somewhat.
> 
> So the polling limit equal to the target residency of C1 really may be
> overly aggressive and something tick-based may work better in general
> (e.g. 1/8 or 1/16 of the tick period).
> 
> In principle, a multiple of C1 target residency could be used too.
> 
> > In theory applications could use /dev/cpu_dma_latency but not all applications
> > are aware of cpu_dma_latency. Similarly, a tool could be installed
> > that opens cpu_dma_latency for the whole system but such a tool is not
> > always available, is not always known to the sysadmin or the tool can have
> > unexpected side-effects if it tunes more than cpu_dma_latency. In practice,
> > it is more common for sysadmins to try idle=poll (which is x86 specific)
> 
> And really should be avoided if one cares about turbo or wants to
> avoid thermal issues.
> 
> > or try disabling c-states and hope for the best.
> >
> > This patch makes it straight-forward to configure how long a CPU should
> > poll before entering a c-state.
> 
> Well, IMV this is not straightforward at all.
> 
> It requires the admin to know how cpuidle works and why this
> particular polling limit is likely to be suitable for the given
> workload.  And whether or not the default polling limit should be
> changed at all.

KVM polling (virt/kvm/kvm_main.c grow_halt_poll_ns/shrink_halt_poll_ns)
tries to adjust the polling window based on poll success/failure. 

The cpuidle haltpoll governor (for KVM guests) uses the same adjustment
logic.

Perhaps a similar (or improved) scheme can be adapted to baremetal.

https://www.kernel.org/doc/Documentation/virtual/kvm/halt-polling.txt
> 
> Honestly, nobody knows that in advance (with all due respect) and this
> would cause people to try various settings at random and stick to the
> one that they feel works best for them without much understanding.
> 
> > By default, there is no behaviour change.
> > At build time a decision can be made to favour performance over power
> > by default even if that potentially impacts turbo boosting for workloads
> > that are sensitive to wakeup latency. In the event the kernel default is
> > not suitable, the kernel command line can be used as a substitute for
> > implementing cpu_dma_latency support in an application or requiring an
> > additional tool to be installed.
> >
> > Note that it is not expected that tuning for longer polling times will be a
> > universal win. For example, extra polling might prevent a turbo state being
> > used or prevent hyperthread resources being released to an SMT sibling.
> >
> > By default, nothing has changed but here is an example of tbench4
> > comparing the default "poll based on the min cstate" vs "poll based on
> > the max cstate"
> >
> > tbench4
> >                           min-cstate             max-cstate
> > Hmean     1        512.88 (   0.00%)      566.74 *  10.50%*
> > Hmean     2        999.47 (   0.00%)     1090.01 *   9.06%*
> > Hmean     4       1960.83 (   0.00%)     2106.62 *   7.44%*
> > Hmean     8       3828.61 (   0.00%)     4097.93 *   7.03%*
> > Hmean     16      6899.44 (   0.00%)     7120.38 *   3.20%*
> > Hmean     32     10718.38 (   0.00%)    10672.44 *  -0.43%*
> > Hmean     64     12672.21 (   0.00%)    12608.15 *  -0.51%*
> > Hmean     128    20744.83 (   0.00%)    21147.02 *   1.94%*
> > Hmean     256    20646.60 (   0.00%)    20608.48 *  -0.18%*
> > Hmean     320    20892.89 (   0.00%)    20831.99 *  -0.29%*
> 
> I'm wondering if you have similar results for "poll based on 2 x min
> cstate" (or 4 x min cstate for that matter).


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state
  2020-11-27 14:08   ` Marcelo Tosatti
@ 2020-11-27 14:34     ` Mel Gorman
  2020-11-27 14:45     ` Mel Gorman
  1 sibling, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2020-11-27 14:34 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Rafael J. Wysocki, Daniel Lezcano, Linux Kernel Mailing List, Linux PM

On Fri, Nov 27, 2020 at 11:08:11AM -0300, Marcelo Tosatti wrote:
> > Well, IMV this is not straightforward at all.
> > 
> > It requires the admin to know how cpuidle works and why this
> > particular polling limit is likely to be suitable for the given
> > workload.  And whether or not the default polling limit should be
> > changed at all.
> 
> KVM polling (virt/kvm/kvm_main.c grow_halt_poll_ns/shrink_halt_poll_ns)
> tries to adjust the polling window based on poll success/failure. 
> 
> The cpuidle haltpoll governor (for KVM guests) uses the same adjustment
> logic.
> 
> Perhaps a similar (or improved) scheme can be adapted to baremetal.
> 
> https://www.kernel.org/doc/Documentation/virtual/kvm/halt-polling.txt

I'm aware of the haltpoll driver and why it's needed there. Given that
it adds cost to the refect callback and increases exit latency that it
would be unpopular on bare metal.

I'm prototyping a v2 that simply picks different balance point to see if
that gets a better reception.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state
  2020-11-27 14:08   ` Marcelo Tosatti
  2020-11-27 14:34     ` Mel Gorman
@ 2020-11-27 14:45     ` Mel Gorman
  1 sibling, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2020-11-27 14:45 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Rafael J. Wysocki, Daniel Lezcano, Linux Kernel Mailing List, Linux PM

On Fri, Nov 27, 2020 at 11:08:11AM -0300, Marcelo Tosatti wrote:
> > It requires the admin to know how cpuidle works and why this
> > particular polling limit is likely to be suitable for the given
> > workload.  And whether or not the default polling limit should be
> > changed at all.
> 
> KVM polling (virt/kvm/kvm_main.c grow_halt_poll_ns/shrink_halt_poll_ns)
> tries to adjust the polling window based on poll success/failure. 
> 
> The cpuidle haltpoll governor (for KVM guests) uses the same adjustment
> logic.
> 
> Perhaps a similar (or improved) scheme can be adapted to baremetal.
> 
> https://www.kernel.org/doc/Documentation/virtual/kvm/halt-polling.txt

I'm aware of the haltpoll governor and why it makes sense for KVM. As
adaptive polling would increase the exit latency, I assumed it would be
an unpopular approach on bare metal. If that is not the case, then it
would make more sense that the haltpoll adaptive logic would simply be
part of the core and not a per-governor decision.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state
  2020-11-26 17:18 [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state Mel Gorman
  2020-11-26 18:24 ` Rafael J. Wysocki
@ 2020-11-28  5:41 ` kernel test robot
  1 sibling, 0 replies; 8+ messages in thread
From: kernel test robot @ 2020-11-28  5:41 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2081 bytes --]

Hi Mel,

I love your patch! Yet something to improve:

[auto build test ERROR on pm/linux-next]
[also build test ERROR on linux/master linus/master v5.10-rc5 next-20201127]
[cannot apply to daniel.lezcano/clockevents/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Mel-Gorman/cpuidle-Allow-configuration-of-the-polling-interval-before-cpuidle-enters-a-c-state/20201127-012003
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
config: arm-oxnas_v6_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/d1527e7760686f12ddc196853410773843ff9a44
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Mel-Gorman/cpuidle-Allow-configuration-of-the-polling-interval-before-cpuidle-enters-a-c-state/20201127-012003
        git checkout d1527e7760686f12ddc196853410773843ff9a44
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=arm 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arm-linux-gnueabi-ld: drivers/cpuidle/cpuidle.o: in function `cpuidle_poll_time':
>> cpuidle.c:(.text+0xd10): undefined reference to `__aeabi_uldivmod'
>> arm-linux-gnueabi-ld: cpuidle.c:(.text+0xdb8): undefined reference to `__aeabi_uldivmod'
   arm-linux-gnueabi-ld: cpuidle.c:(.text+0xde4): undefined reference to `__aeabi_uldivmod'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 17133 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-11-28  5:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-26 17:18 [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state Mel Gorman
2020-11-26 18:24 ` Rafael J. Wysocki
2020-11-26 20:31   ` Mel Gorman
2020-11-27 10:53     ` Mel Gorman
2020-11-27 14:08   ` Marcelo Tosatti
2020-11-27 14:34     ` Mel Gorman
2020-11-27 14:45     ` Mel Gorman
2020-11-28  5:41 ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.