[PATCH 0/4] Reduce migrations and unnecessary spreading of load to multiple CPUs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] Reduce migrations and unnecessary spreading of load to multiple CPUs
@ 2018-01-30 10:45 Mel Gorman
  2018-01-30 10:45 ` [PATCH 1/4] sched/fair: Remove unnecessary parameters from wake_affine_idle Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Mel Gorman @ 2018-01-30 10:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML, Mel Gorman

It has been observed recently that problems with interaction between
scheduler decisions and cpufreq decisions have been unfavourable.
The scheduler decisions are biased towards reducing latency of searches
which is great on one hand but tends to spread load across an entire socket
unnecessarily. On low utilisation, this means the load on each individual CPU
is low which can be good but cpufreq decides that utilisation on individual
CPUs is too low to increase P-state and overall throughput suffers.

When a cpufreq driver is completely under the control of the OS, it
can be compensated for. For example, intel_pstate can decide to boost
apparent utilisation if a task recently slept on a CPU for idle. However,
if hardware-based cpufreq is in play (e.g. HWP) then very poor decisions
can be made and the OS cannot do much about it. This only gets worse as HWP
becomes more prevalent, sockets get larger and the p-state for individual
cores can be controlled. Just setting the performance governor is not an
answer given that plenty of people really do worry about power utilisation
and still want a reasonable balance between performance and power.

Patches 0-3 of this series reduce the number of migrations due to interrupts.
	Specifically, if prev CPU and the current CPU share cache and prev
	CPU is idle then use it in preference to migrating the load. The full
	reasoning why is in the changelog.

Patch 4 observes that co-operating tasks, particularly between an application
	and a kworker can push a load around a socket very quickly. This is
	particularly true if interrupts are involved (e.g. IO completions)
	and are delivered round-robin (e.g. due to MSI-X). It tracks
	what CPU was used for a recent wakeup and reuses that CPU if it's
	still idle when woken later. This reduces the number of cores that
	are active for a workload and can have a big boost in throughput
	without a hit to wakeup latency or stacking multiple tasks on one
	CPU when a machine is lightly loaded.

 include/linux/sched.h |  8 ++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 69 +++++++++++++++++++++++++++++++++------------------
 3 files changed, 54 insertions(+), 24 deletions(-)

-- 
2.15.1

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/4] sched/fair: Remove unnecessary parameters from wake_affine_idle
  2018-01-30 10:45 [PATCH 0/4] Reduce migrations and unnecessary spreading of load to multiple CPUs Mel Gorman
@ 2018-01-30 10:45 ` Mel Gorman
  2018-02-06 11:55   ` [tip:sched/urgent] sched/fair: Remove unnecessary parameters from wake_affine_idle() tip-bot for Mel Gorman
  2018-01-30 10:45 ` [PATCH 2/4] sched/fair: Restructure wake_affine to return a CPU id Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2018-01-30 10:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML, Mel Gorman

wake_affine_idle takes parameters it never uses so clean it up.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5aa629021ceb..9a5d8462c13c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5696,10 +5696,8 @@ static int wake_wide(struct task_struct *p)
  *			  scheduling latency of the CPUs. This seems to work
  *			  for the overloaded case.
  */
-
 static bool
-wake_affine_idle(struct sched_domain *sd, struct task_struct *p,
-		 int this_cpu, int prev_cpu, int sync)
+wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 {
 	/*
 	 * If this_cpu is idle, it implies the wakeup is from interrupt
@@ -5756,8 +5754,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	int this_cpu = smp_processor_id();
 	bool affine = false;
 
-	if (sched_feat(WA_IDLE) && !affine)
-		affine = wake_affine_idle(sd, p, this_cpu, prev_cpu, sync);
+	if (sched_feat(WA_IDLE))
+		affine = wake_affine_idle(this_cpu, prev_cpu, sync);
 
 	if (sched_feat(WA_WEIGHT) && !affine)
 		affine = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/4] sched/fair: Restructure wake_affine to return a CPU id
  2018-01-30 10:45 [PATCH 0/4] Reduce migrations and unnecessary spreading of load to multiple CPUs Mel Gorman
  2018-01-30 10:45 ` [PATCH 1/4] sched/fair: Remove unnecessary parameters from wake_affine_idle Mel Gorman
@ 2018-01-30 10:45 ` Mel Gorman
  2018-02-06 11:56   ` [tip:sched/urgent] sched/fair: Restructure wake_affine*() " tip-bot for Mel Gorman
  2018-01-30 10:45 ` [PATCH 3/4] sched/fair: Do not migrate if the prev_cpu is idle Mel Gorman
  2018-01-30 10:45 ` [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS Mel Gorman
  3 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2018-01-30 10:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML, Mel Gorman

This is a preparation patch that has wake_affine return a CPU ID instead of
a boolean. The intent is to allow the wake_affine() helpers to be avoided
if a decision is already made. This patch has no functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5d8462c13c..1aebe79da2ab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5696,7 +5696,7 @@ static int wake_wide(struct task_struct *p)
  *			  scheduling latency of the CPUs. This seems to work
  *			  for the overloaded case.
  */
-static bool
+static int
 wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 {
 	/*
@@ -5706,15 +5706,15 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 	 * node depending on the IO topology or IRQ affinity settings.
 	 */
 	if (idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return true;
+		return this_cpu;
 
 	if (sync && cpu_rq(this_cpu)->nr_running == 1)
-		return true;
+		return this_cpu;
 
-	return false;
+	return nr_cpumask_bits;
 }
 
-static bool
+static int
 wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		   int this_cpu, int prev_cpu, int sync)
 {
@@ -5728,7 +5728,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		unsigned long current_load = task_h_load(current);
 
 		if (current_load > this_eff_load)
-			return true;
+			return this_cpu;
 
 		this_eff_load -= current_load;
 	}
@@ -5745,28 +5745,28 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
 	prev_eff_load *= capacity_of(this_cpu);
 
-	return this_eff_load <= prev_eff_load;
+	return this_eff_load <= prev_eff_load ? this_cpu : nr_cpumask_bits;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int prev_cpu, int sync)
 {
 	int this_cpu = smp_processor_id();
-	bool affine = false;
+	int target = nr_cpumask_bits;
 
 	if (sched_feat(WA_IDLE))
-		affine = wake_affine_idle(this_cpu, prev_cpu, sync);
+		target = wake_affine_idle(this_cpu, prev_cpu, sync);
 
-	if (sched_feat(WA_WEIGHT) && !affine)
-		affine = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
+	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
+		target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
 
 	schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
-	if (affine) {
-		schedstat_inc(sd->ttwu_move_affine);
-		schedstat_inc(p->se.statistics.nr_wakeups_affine);
-	}
+	if (target == nr_cpumask_bits)
+		return prev_cpu;
 
-	return affine;
+	schedstat_inc(sd->ttwu_move_affine);
+	schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	return target;
 }
 
 static inline int task_util(struct task_struct *p);
@@ -6359,8 +6359,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (cpu == prev_cpu)
 			goto pick_cpu;
 
-		if (wake_affine(affine_sd, p, prev_cpu, sync))
-			new_cpu = cpu;
+		new_cpu = wake_affine(affine_sd, p, prev_cpu, sync);
 	}
 
 	if (sd && !(sd_flag & SD_BALANCE_FORK)) {
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/4] sched/fair: Do not migrate if the prev_cpu is idle
  2018-01-30 10:45 [PATCH 0/4] Reduce migrations and unnecessary spreading of load to multiple CPUs Mel Gorman
  2018-01-30 10:45 ` [PATCH 1/4] sched/fair: Remove unnecessary parameters from wake_affine_idle Mel Gorman
  2018-01-30 10:45 ` [PATCH 2/4] sched/fair: Restructure wake_affine to return a CPU id Mel Gorman
@ 2018-01-30 10:45 ` Mel Gorman
  2018-02-06 11:56   ` [tip:sched/urgent] " tip-bot for Mel Gorman
  2018-01-30 10:45 ` [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS Mel Gorman
  3 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2018-01-30 10:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML, Mel Gorman

wake_affine_idle prefers to move a task to the current CPU if the
wakeup is due to an interrupt. The expectation is that the interrupt
data is cache hot and relevant to the waking task as well as avoiding
a search. However, there is no way to determine if there was cache hot
data on the previous CPU that may exceed the interrupt data. Furthermore,
round-robin delivery of interrupts can migrate tasks around a socket where
each CPU is under-utilised.  This can interact badly with cpufreq which
makes decisions based on per-cpu data. It has been observed on machines
with HWP that p-states are not boosted to their maximum levels even though
the workload is latency and throughput sensitive.

This patch uses the previous CPU for the task if it's idle and cache-affine
with the current CPU even if the current CPU is idle due to the wakup
being related to the interrupt. This reduces migrations at the cost of
the interrupt data not being cache hot when the task wakes.

A variety of workloads were tested on various machines and no adverse
impact was noticed that was outside noise. dbench on ext4 on UMA showed
roughly 10% reduction in the number of CPU migrations and it is a case
where interrupts are frequent for IO competions. In most cases, the
difference in performance is quite small but variability is often
reduced. For example, this is the result for pgbench running on a UMA
machine with different numbers of clients.

                         4.15.0-rc9             4.15.0-rc9
                           baseline              waprev-v1
Hmean     1     22096.28 (   0.00%)    22734.86 (   2.89%)
Hmean     4     74633.42 (   0.00%)    75496.77 (   1.16%)
Hmean     7    115017.50 (   0.00%)   113030.81 (  -1.73%)
Hmean     12   126209.63 (   0.00%)   126613.40 (   0.32%)
Hmean     16   131886.91 (   0.00%)   130844.35 (  -0.79%)
Stddev    1       636.38 (   0.00%)      417.11 (  34.46%)
Stddev    4       614.64 (   0.00%)      583.24 (   5.11%)
Stddev    7       542.46 (   0.00%)      435.45 (  19.73%)
Stddev    12      173.93 (   0.00%)      171.50 (   1.40%)
Stddev    16      671.42 (   0.00%)      680.30 (  -1.32%)
CoeffVar  1         2.88 (   0.00%)        1.83 (  36.26%)

Note that the different in performance is marginal but for low utilisation,
there is less variability.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1aebe79da2ab..3b732caa6fba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5704,9 +5704,15 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 	 * context. Only allow the move if cache is shared. Otherwise an
 	 * interrupt intensive workload could force all tasks onto one
 	 * node depending on the IO topology or IRQ affinity settings.
+	 *
+	 * If the prev_cpu is idle and cache affine then avoid a migration.
+	 * There is no guarantee that the cache hot data from an interrupt
+	 * is more important than cache hot data on the prev_cpu and from
+	 * a cpufreq perspective, it's better to have higher utilisation
+	 * on one CPU.
 	 */
 	if (idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return this_cpu;
+		return idle_cpu(prev_cpu) ? prev_cpu : this_cpu;

 	if (sync && cpu_rq(this_cpu)->nr_running == 1)
 		return this_cpu;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 10:45 [PATCH 0/4] Reduce migrations and unnecessary spreading of load to multiple CPUs Mel Gorman
                   ` (2 preceding siblings ...)
  2018-01-30 10:45 ` [PATCH 3/4] sched/fair: Do not migrate if the prev_cpu is idle Mel Gorman
@ 2018-01-30 10:45 ` Mel Gorman
  2018-01-30 11:50   ` Peter Zijlstra
                     ` (2 more replies)
  3 siblings, 3 replies; 48+ messages in thread
From: Mel Gorman @ 2018-01-30 10:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML, Mel Gorman

The select_idle_sibling (SIS) rewrite in commit 10e2f1acd010 ("sched/core:
Rewrite and improve select_idle_siblings()") replaced a domain iteration
with a search that broadly speaking does a wrapped walk of the scheduler
domain sharing a last-level-cache. While this had a number of improvements,
one consequence is that two tasks that share a waker/wakee relationship push
each other around a socket. Even though two tasks may be active, all cores
are evenly used. This is great from a search perspective and spreads a load
across individual cores but it has adverse consequences for cpufreq. As each
CPU has relatively low utilisation, cpufreq may decide the utilisation is
too low to used a higher P-state and overall computation throughput suffers.
While individual cpufreq and cpuidle drivers may compensate by artifically
boosting P-state (at c0) or avoiding lower C-states (during idle), it does
not help if hardware-based cpufreq (e.g. HWP) is used.

This patch tracks a recently used CPU based on what CPU a task was running
on when it last was a waker a CPU it was recently using when a task is a
wakee. During SIS, the recently used CPU is used as a target if it's still
allowed by the task and is idle.

The benefit may be non-obvious so consider an example of two tasks
communicating back and forth. Task A may be an application doing IO where
task B is a kworker or kthread like journald. Task A may issue IO, wake
B and B wakes up A on completion.  With the existing scheme this may look
like the following (potentially different IDs if SMT is in use but similar
principal applies).

A (cpu 0)	wake	B (wakes on cpu 1)
B (cpu 1)	wake	A (wakes on cpu 2)
A (cpu 2)	wake	B (wakes on cpu 3)
etc.

A careful reader may wonder why cpu 0 was not idle when B wakes A the
first time and it's simply due to the fact that A can be rescheduled to
another CPU and the pattern is that prev == target when B tries to wakeup
A and the information about CPU 0 has been lost.

With this patch, the pattern is more likely to be

A (cpu 0)	wake	B (wakes on cpu 1)
B (cpu 1)	wake	A (wakes on cpu 0)
A (cpu 0)	wake	B (wakes on cpu 1)
etc

i.e. two communicating casts are more likely to use just two cores instead
of all available cores sharing a LLC.

The most dramatic speedup was noticed on dbench using the XFS filesystem on
UMA as clients interact heavily with workqueues in that configuration. Note
that a similar speedup is not observed on ext4 as the wakeup pattern
is different

                         4.15.0-rc9             4.15.0-rc9
                          waprev-v1        biasancestor-v1
Hmean     1       287.54 (   0.00%)      817.01 ( 184.14%)
Hmean     2      1268.12 (   0.00%)     1781.24 (  40.46%)
Hmean     4      1739.68 (   0.00%)     1594.47 (  -8.35%)
Hmean     8      2464.12 (   0.00%)     2479.56 (   0.63%)
Hmean     64     1455.57 (   0.00%)     1434.68 (  -1.44%)

The results can be less dramatic on NUMA where automatic balancing interferes
with the test. It's also known that network benchmarks running on localhost
also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
and TCP depending on the machine). Hackbench also seens small improvements
(6-11% depending on machine and thread count). The facebook schbench was also
tested but in most cases showed little or no different to wakeup latencies.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched.h |  8 ++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 22 ++++++++++++++++++++--
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2588263a989..d9140ddaa4e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -551,6 +551,14 @@ struct task_struct {
 	unsigned long			wakee_flip_decay_ts;
 	struct task_struct		*last_wakee;

+	/*
+	 * recent_used_cpu is initially set as the last CPU used by a task
+	 * that wakes affine another task. Waker/wakee relationships can
+	 * push tasks around a CPU where each wakeup moves to the next one.
+	 * Tracking a recently used CPU allows a quick search for a recently
+	 * used CPU that may be idle.
+	 */
+	int				recent_used_cpu;
 	int				wake_cpu;
 #endif
 	int				on_rq;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7bf32aabfda..68d7bcaf0fc7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2460,6 +2460,7 @@ void wake_up_new_task(struct task_struct *p)
 	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
 	 * as we're not fully set-up yet.
 	 */
+	p->recent_used_cpu = task_cpu(p);
 	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 	rq = __task_rq_lock(p, &rf);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3b732caa6fba..e96b0c1b43ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6201,7 +6201,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
 	struct sched_domain *sd;
-	int i;
+	int i, recent_used_cpu;

 	if (idle_cpu(target))
 		return target;
@@ -6212,6 +6212,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
 		return prev;

+	/* Check a recently used CPU as a potential idle candidate */
+	recent_used_cpu = p->recent_used_cpu;
+	if (recent_used_cpu != prev &&
+	    recent_used_cpu != target &&
+	    cpus_share_cache(recent_used_cpu, target) &&
+	    idle_cpu(recent_used_cpu) &&
+	    cpumask_test_cpu(p->recent_used_cpu, &p->cpus_allowed)) {
+		/*
+		 * Replace recent_used_cpu with prev as it is a potential
+		 * candidate for the next wake.
+		 */
+		p->recent_used_cpu = prev;
+		return recent_used_cpu;
+	}
+
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	if (!sd)
 		return target;
@@ -6379,9 +6394,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f

 	if (!sd) {
 pick_cpu:
-		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
+		if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

+			if (want_affine)
+				current->recent_used_cpu = cpu;
+		}
 	} else {
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	}
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 10:45 ` [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS Mel Gorman
@ 2018-01-30 11:50   ` Peter Zijlstra
  2018-01-30 12:57     ` Mel Gorman
  2018-01-30 11:53   ` Peter Zijlstra
  2018-02-06 11:56   ` [tip:sched/urgent] " tip-bot for Mel Gorman
  2 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2018-01-30 11:50 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Mike Galbraith, Matt Fleming, LKML, rjw, srinivas.pandruvada

On Tue, Jan 30, 2018 at 10:45:55AM +0000, Mel Gorman wrote:
> The select_idle_sibling (SIS) rewrite in commit 10e2f1acd010 ("sched/core:
> Rewrite and improve select_idle_siblings()") replaced a domain iteration
> with a search that broadly speaking does a wrapped walk of the scheduler
> domain sharing a last-level-cache. While this had a number of improvements,
> one consequence is that two tasks that share a waker/wakee relationship push
> each other around a socket. Even though two tasks may be active, all cores
> are evenly used. This is great from a search perspective and spreads a load
> across individual cores but it has adverse consequences for cpufreq. As each
> CPU has relatively low utilisation, cpufreq may decide the utilisation is
> too low to used a higher P-state and overall computation throughput suffers.

> While individual cpufreq and cpuidle drivers may compensate by artifically
> boosting P-state (at c0) or avoiding lower C-states (during idle), it does
> not help if hardware-based cpufreq (e.g. HWP) is used.

Not saying this patch is bad; but Rafael / Srinivas we really should do
better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
still give it hints, and it looks like we're not doing that.

Mel, what hardware are you testing this on?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 10:45 ` [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS Mel Gorman
  2018-01-30 11:50   ` Peter Zijlstra
@ 2018-01-30 11:53   ` Peter Zijlstra
  2018-01-30 12:59     ` Mel Gorman
  2018-01-30 13:06     ` Peter Zijlstra
  2018-02-06 11:56   ` [tip:sched/urgent] " tip-bot for Mel Gorman
  2 siblings, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2018-01-30 11:53 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Mike Galbraith, Matt Fleming, LKML

On Tue, Jan 30, 2018 at 10:45:55AM +0000, Mel Gorman wrote:
> The results can be less dramatic on NUMA where automatic balancing interferes
> with the test. It's also known that network benchmarks running on localhost
> also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
> and TCP depending on the machine). Hackbench also seens small improvements
> (6-11% depending on machine and thread count). The facebook schbench was also
> tested but in most cases showed little or no different to wakeup latencies.

What cpufreq setting were you using for these tests?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 11:50   ` Peter Zijlstra
@ 2018-01-30 12:57     ` Mel Gorman
  2018-01-30 13:15       ` Peter Zijlstra
  2018-01-30 13:15       ` Mike Galbraith
  0 siblings, 2 replies; 48+ messages in thread
From: Mel Gorman @ 2018-01-30 12:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Matt Fleming, LKML, rjw, srinivas.pandruvada

On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 10:45:55AM +0000, Mel Gorman wrote:
> > The select_idle_sibling (SIS) rewrite in commit 10e2f1acd010 ("sched/core:
> > Rewrite and improve select_idle_siblings()") replaced a domain iteration
> > with a search that broadly speaking does a wrapped walk of the scheduler
> > domain sharing a last-level-cache. While this had a number of improvements,
> > one consequence is that two tasks that share a waker/wakee relationship push
> > each other around a socket. Even though two tasks may be active, all cores
> > are evenly used. This is great from a search perspective and spreads a load
> > across individual cores but it has adverse consequences for cpufreq. As each
> > CPU has relatively low utilisation, cpufreq may decide the utilisation is
> > too low to used a higher P-state and overall computation throughput suffers.
> 
> > While individual cpufreq and cpuidle drivers may compensate by artifically
> > boosting P-state (at c0) or avoiding lower C-states (during idle), it does
> > not help if hardware-based cpufreq (e.g. HWP) is used.
> 
> Not saying this patch is bad; but Rafael / Srinivas we really should do
> better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
> still give it hints, and it looks like we're not doing that.
> 

I'm not sure if HWP can fix it because of the per-cpu nature of its
decisions. I believe it can only give the most basic of hints to hardware
like an energy performance profile or bias (EPP and EPB respectively).
Of course HWP can be turned off but not many people can detect that it's
an appropriate decision, or even desirable, and there is always the caveat
that disabling it increases the system CPU footprint.

> Mel, what hardware are you testing this on?

The primary one was a single socket skylake machine with 8 threads (HT
enabled). However, 11 machines were used in total across multiple
generations to reduce the chance of a regression slipping in that was
machine-specific.


-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 11:53   ` Peter Zijlstra
@ 2018-01-30 12:59     ` Mel Gorman
  2018-01-30 13:06     ` Peter Zijlstra
  1 sibling, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2018-01-30 12:59 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML

On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 10:45:55AM +0000, Mel Gorman wrote:
> > The results can be less dramatic on NUMA where automatic balancing interferes
> > with the test. It's also known that network benchmarks running on localhost
> > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
> > and TCP depending on the machine). Hackbench also seens small improvements
> > (6-11% depending on machine and thread count). The facebook schbench was also
> > tested but in most cases showed little or no different to wakeup latencies.
> 
> What cpufreq setting were you using for these tests?

Default powersave settings in general -- intel_pstate driver is the one
used most often but HWP was not always available. By and large, no special
tuning was applied. Two sets of tests were run, one with turbostat and perf
keeping track of p-states and migrations and one without to make sure the
measured increase was real. The irony when hitting cpufreq-related problems
is that monitoring can mask the issue by increasing overall utilisation.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 11:53   ` Peter Zijlstra
  2018-01-30 12:59     ` Mel Gorman
@ 2018-01-30 13:06     ` Peter Zijlstra
  2018-01-30 13:18       ` Mel Gorman
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2018-01-30 13:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Mike Galbraith, Matt Fleming, LKML

On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 10:45:55AM +0000, Mel Gorman wrote:
> > The results can be less dramatic on NUMA where automatic balancing interferes
> > with the test. It's also known that network benchmarks running on localhost
> > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
> > and TCP depending on the machine). Hackbench also seens small improvements
> > (6-11% depending on machine and thread count). The facebook schbench was also
> > tested but in most cases showed little or no different to wakeup latencies.
> 
> What cpufreq setting were you using for these tests?

I cannot measure any hackbench variation one way or the other with these
patches using 'performance' mode. So I'll assume you've been running
things with HWP or something.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 12:57     ` Mel Gorman
@ 2018-01-30 13:15       ` Peter Zijlstra
  2018-01-30 13:25         ` Mel Gorman
  2018-01-31  9:22         ` Rafael J. Wysocki
  2018-01-30 13:15       ` Mike Galbraith
  1 sibling, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2018-01-30 13:15 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Mike Galbraith, Matt Fleming, LKML, rjw, srinivas.pandruvada

On Tue, Jan 30, 2018 at 12:57:18PM +0000, Mel Gorman wrote:
> On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:

> > Not saying this patch is bad; but Rafael / Srinivas we really should do
> > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
> > still give it hints, and it looks like we're not doing that.
> > 
> 
> I'm not sure if HWP can fix it because of the per-cpu nature of its
> decisions. I believe it can only give the most basic of hints to hardware
> like an energy performance profile or bias (EPP and EPB respectively).
> Of course HWP can be turned off but not many people can detect that it's
> an appropriate decision, or even desirable, and there is always the caveat
> that disabling it increases the system CPU footprint.

IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
"Desired_Performance" fields which can be used to give explicit
frequency hints. And we really _should_ be doing that.

Because, esp. in this scenario; a task migrating; the hardware really
can't do anything sensible, whereas the OS _knows_.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 12:57     ` Mel Gorman
  2018-01-30 13:15       ` Peter Zijlstra
@ 2018-01-30 13:15       ` Mike Galbraith
  2018-01-30 13:25         ` Peter Zijlstra
  1 sibling, 1 reply; 48+ messages in thread
From: Mike Galbraith @ 2018-01-30 13:15 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra; +Cc: Matt Fleming, LKML, rjw, srinivas.pandruvada

On Tue, 2018-01-30 at 12:57 +0000, Mel Gorman wrote:
> On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> 
> > Mel, what hardware are you testing this on?
> 
> The primary one was a single socket skylake machine with 8 threads (HT
> enabled).

I took it for a spin in a 2 x Gold 6152 box.

	-Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 13:06     ` Peter Zijlstra
@ 2018-01-30 13:18       ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2018-01-30 13:18 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML

On Tue, Jan 30, 2018 at 02:06:06PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 30, 2018 at 10:45:55AM +0000, Mel Gorman wrote:
> > > The results can be less dramatic on NUMA where automatic balancing interferes
> > > with the test. It's also known that network benchmarks running on localhost
> > > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
> > > and TCP depending on the machine). Hackbench also seens small improvements
> > > (6-11% depending on machine and thread count). The facebook schbench was also
> > > tested but in most cases showed little or no different to wakeup latencies.
> > 
> > What cpufreq setting were you using for these tests?
> 
> I cannot measure any hackbench variation one way or the other with these
> patches using 'performance' mode. So I'll assume you've been running
> things with HWP or something.

Correct, I am not using the performance governor as it's not the default and
not necessarily desirable as a default.  HWP in some cases is enabled but
even machines using the intel_pstate driver on machines without HWP benefit.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 13:15       ` Peter Zijlstra
@ 2018-01-30 13:25         ` Mel Gorman
  2018-01-30 13:40           ` Peter Zijlstra
  2018-01-31  9:22         ` Rafael J. Wysocki
  1 sibling, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2018-01-30 13:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Matt Fleming, LKML, rjw, srinivas.pandruvada

On Tue, Jan 30, 2018 at 02:15:31PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 12:57:18PM +0000, Mel Gorman wrote:
> > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> 
> > > Not saying this patch is bad; but Rafael / Srinivas we really should do
> > > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
> > > still give it hints, and it looks like we're not doing that.
> > > 
> > 
> > I'm not sure if HWP can fix it because of the per-cpu nature of its
> > decisions. I believe it can only give the most basic of hints to hardware
> > like an energy performance profile or bias (EPP and EPB respectively).
> > Of course HWP can be turned off but not many people can detect that it's
> > an appropriate decision, or even desirable, and there is always the caveat
> > that disabling it increases the system CPU footprint.
> 
> IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> "Desired_Performance" fields which can be used to give explicit
> frequency hints. And we really _should_ be doing that.
> 

They can be although these are usually set by the bios or setup early
in boot and then left alone. It's not clear how or if these should be
tuned on the fly or what variables would drive dynamic tuning. The data
collected would still be per-cpu so if all CPUs have low utilisation,
the decisions will still be poor except maybe for things like IO boosting.

> Because, esp. in this scenario; a task migrating; the hardware really
> can't do anything sensible, whereas the OS _knows_.
> 

Potentially yes. One option without HWP would be to track utilisation
for a task or artifically boost it for a short period after migration so
a higher p-state is potentially selected. With HWP, a hint would have to
be given to the hardware to try select a higher frequency but I've no idea
how expensive that is or how it would behave on different implementations
of HWP. It may also be a game of whack-a-mole trying to get every cpufreq
configuration correct. One advantage of using fewer cores is that it should
work regardless of cpufreq driver.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 13:15       ` Mike Galbraith
@ 2018-01-30 13:25         ` Peter Zijlstra
  2018-01-30 13:35           ` Mike Galbraith
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2018-01-30 13:25 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Mel Gorman, Matt Fleming, LKML, rjw, srinivas.pandruvada

On Tue, Jan 30, 2018 at 02:15:44PM +0100, Mike Galbraith wrote:
> On Tue, 2018-01-30 at 12:57 +0000, Mel Gorman wrote:
> > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> > 
> > > Mel, what hardware are you testing this on?
> > 
> > The primary one was a single socket skylake machine with 8 threads (HT
> > enabled).
> 
> I took it for a spin in a 2 x Gold 6152 box.

Can you translate that marketing speak for a simpleton like me? That's
what we should've called SKL-EP, right?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 13:25         ` Peter Zijlstra
@ 2018-01-30 13:35           ` Mike Galbraith
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Galbraith @ 2018-01-30 13:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mel Gorman, Matt Fleming, LKML, rjw, srinivas.pandruvada

On Tue, 2018-01-30 at 14:25 +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 02:15:44PM +0100, Mike Galbraith wrote:
> > On Tue, 2018-01-30 at 12:57 +0000, Mel Gorman wrote:
> > > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> > > 
> > > > Mel, what hardware are you testing this on?
> > > 
> > > The primary one was a single socket skylake machine with 8 threads (HT
> > > enabled).
> > 
> > I took it for a spin in a 2 x Gold 6152 box.
> 
> Can you translate that marketing speak for a simpleton like me? That's
> what we should've called SKL-EP, right?

Um um.. danged if I know.  It's SKL, w. 22 cores each + SMT.

	-Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 13:25         ` Mel Gorman
@ 2018-01-30 13:40           ` Peter Zijlstra
  2018-01-30 14:06             ` Mel Gorman
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2018-01-30 13:40 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Mike Galbraith, Matt Fleming, LKML, rjw, srinivas.pandruvada

On Tue, Jan 30, 2018 at 01:25:27PM +0000, Mel Gorman wrote:
> > Because, esp. in this scenario; a task migrating; the hardware really
> > can't do anything sensible, whereas the OS _knows_.
> 
> Potentially yes. One option without HWP would be to track utilisation
> for a task or artifically boost it for a short period after migration so
> a higher p-state is potentially selected. With HWP, a hint would have to
> be given to the hardware to try select a higher frequency but I've no idea
> how expensive that is or how it would behave on different implementations
> of HWP. It may also be a game of whack-a-mole trying to get every cpufreq
> configuration correct.

We have much of this infrastructure and have been working on improving
it [1]. We already track per task utilization and feed it into a cpufreq
governor (schedutil).

> One advantage of using fewer cores is that it should
> work regardless of cpufreq driver.

Sure, and I started out by saying this patch isn't necessarily bad; but
I think our 'use' [2] of HWP _is_.


[1] https://lkml.kernel.org/r/20180123180847.4477-1-patrick.bellasi@arm.com
[2] IIRC we basically don't do _anything_ when HWP.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 13:40           ` Peter Zijlstra
@ 2018-01-30 14:06             ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2018-01-30 14:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Matt Fleming, LKML, rjw, srinivas.pandruvada

On Tue, Jan 30, 2018 at 02:40:49PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 01:25:27PM +0000, Mel Gorman wrote:
> > > Because, esp. in this scenario; a task migrating; the hardware really
> > > can't do anything sensible, whereas the OS _knows_.
> > 
> > Potentially yes. One option without HWP would be to track utilisation
> > for a task or artifically boost it for a short period after migration so
> > a higher p-state is potentially selected. With HWP, a hint would have to
> > be given to the hardware to try select a higher frequency but I've no idea
> > how expensive that is or how it would behave on different implementations
> > of HWP. It may also be a game of whack-a-mole trying to get every cpufreq
> > configuration correct.
> 
> We have much of this infrastructure and have been working on improving
> it [1]. We already track per task utilization and feed it into a cpufreq
> governor (schedutil).
> 

Which is potentially great without HWP but ...

> > One advantage of using fewer cores is that it should
> > work regardless of cpufreq driver.
> 
> Sure, and I started out by saying this patch isn't necessarily bad; but
> I think our 'use' [2] of HWP _is_.
> 

This is where we get caught -- HWP means we had full responsibility to
the hardware with limited feedback options. As far as I can tell, any
recent intel CPU is going to have HWP and we've enabled it by default if
available since 4.6. 

Even if further hints can be given from the OS, it does not necessarily
mean this patch is unhelpful. On low utilisation, it's still somewhat
sensible to use the minimum number of CPUs necessary (less cache traffic,
higher p-states, maybe beneficial to turbo boost etc).

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 13:15       ` Peter Zijlstra
  2018-01-30 13:25         ` Mel Gorman
@ 2018-01-31  9:22         ` Rafael J. Wysocki
  2018-01-31 10:17           ` Peter Zijlstra
  1 sibling, 1 reply; 48+ messages in thread
From: Rafael J. Wysocki @ 2018-01-31  9:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML, srinivas.pandruvada

On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 12:57:18PM +0000, Mel Gorman wrote:
> > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> 
> > > Not saying this patch is bad; but Rafael / Srinivas we really should do
> > > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
> > > still give it hints, and it looks like we're not doing that.
> > > 
> > 
> > I'm not sure if HWP can fix it because of the per-cpu nature of its
> > decisions. I believe it can only give the most basic of hints to hardware
> > like an energy performance profile or bias (EPP and EPB respectively).
> > Of course HWP can be turned off but not many people can detect that it's
> > an appropriate decision, or even desirable, and there is always the caveat
> > that disabling it increases the system CPU footprint.
> 
> IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> "Desired_Performance" fields which can be used to give explicit
> frequency hints. And we really _should_ be doing that.
> 
> Because, esp. in this scenario; a task migrating; the hardware really
> can't do anything sensible, whereas the OS _knows_.

But IA32_HWP_REQUEST is not a cheap MSR to write to.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-31  9:22         ` Rafael J. Wysocki
@ 2018-01-31 10:17           ` Peter Zijlstra
  2018-01-31 11:54             ` Mel Gorman
                               ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Peter Zijlstra @ 2018-01-31 10:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML, srinivas.pandruvada

On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:

> > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > "Desired_Performance" fields which can be used to give explicit
> > frequency hints. And we really _should_ be doing that.
> > 
> > Because, esp. in this scenario; a task migrating; the hardware really
> > can't do anything sensible, whereas the OS _knows_.
> 
> But IA32_HWP_REQUEST is not a cheap MSR to write to.

That just means we might need to throttle writing to it, like it already
does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
cheap msr?)

Not touching it at all seems silly.

But now that you made me look, intel_pstate_hwp_set() is horrible crap.
You should _never_ do things like:

  rdmsr_on_cpu()
  /* frob value */
  wrmsr_on_cpu()

That's insane.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-31 10:17           ` Peter Zijlstra
@ 2018-01-31 11:54             ` Mel Gorman
  2018-01-31 17:44             ` Srinivas Pandruvada
  2018-02-01  7:50             ` Rafael J. Wysocki
  2 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2018-01-31 11:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Mike Galbraith, Matt Fleming, LKML,
	srinivas.pandruvada

On Wed, Jan 31, 2018 at 11:17:10AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> 
> > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > > "Desired_Performance" fields which can be used to give explicit
> > > frequency hints. And we really _should_ be doing that.
> > > 
> > > Because, esp. in this scenario; a task migrating; the hardware really
> > > can't do anything sensible, whereas the OS _knows_.
> > 
> > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> 
> That just means we might need to throttle writing to it, like it already
> does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
> cheap msr?)
> 
> Not touching it at all seems silly.
> 

Note that even if we do such programming, it would still be desirable to
minimise the number of times we have to reprogram it so the series as it
stands would still be useful.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-31 10:17           ` Peter Zijlstra
  2018-01-31 11:54             ` Mel Gorman
@ 2018-01-31 17:44             ` Srinivas Pandruvada
  2018-02-01  9:11               ` Peter Zijlstra
  2018-02-01  7:50             ` Rafael J. Wysocki
  2 siblings, 1 reply; 48+ messages in thread
From: Srinivas Pandruvada @ 2018-01-31 17:44 UTC (permalink / raw)
  To: Peter Zijlstra, Rafael J. Wysocki
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML

On Wed, 2018-01-31 at 11:17 +0100, Peter Zijlstra wrote:
> On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance"
> > > and
> > > "Desired_Performance" fields which can be used to give explicit
> > > frequency hints. And we really _should_ be doing that.
> > > 
> > > Because, esp. in this scenario; a task migrating; the hardware
> > > really
> > > can't do anything sensible, whereas the OS _knows_.
> > 
> > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> 
> That just means we might need to throttle writing to it, like it
> already
> does for the regular pstate (PERF_CTRL) msr in any case (also, is
> that a
> cheap msr?)
Much more throttling required compared to PERF_CTL. MSR_HWP_REQUEST is
much slower compared to PERF_CTL (as high as 10:1). 

> 
> Not touching it at all seems silly.
> 
> But now that you made me look, intel_pstate_hwp_set() is horrible
> crap.
> You should _never_ do things like:
> 
>   rdmsr_on_cpu()
>   /* frob value */
>   wrmsr_on_cpu()
> 
> That's insane.

Since the cpufreq callback is not guaranteed to be called on the same
CPU, we have to use rd/wrmsr_on_cpu().
But we can use smp_call_function_single() and optimize this.
This function is called only during init, when usermode changes
frequency limits and from thermal, so very few times.

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-31 10:17           ` Peter Zijlstra
  2018-01-31 11:54             ` Mel Gorman
  2018-01-31 17:44             ` Srinivas Pandruvada
@ 2018-02-01  7:50             ` Rafael J. Wysocki
  2018-02-01  9:11               ` Peter Zijlstra
  2 siblings, 1 reply; 48+ messages in thread
From: Rafael J. Wysocki @ 2018-02-01  7:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML, srinivas.pandruvada

On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote:
> On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> 
> > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > > "Desired_Performance" fields which can be used to give explicit
> > > frequency hints. And we really _should_ be doing that.
> > > 
> > > Because, esp. in this scenario; a task migrating; the hardware really
> > > can't do anything sensible, whereas the OS _knows_.
> > 
> > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> 
> That just means we might need to throttle writing to it, like it already
> does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
> cheap msr?)
> 
> Not touching it at all seems silly.

OK

So what field precisely would you touch?  "desired"?  If so, does that actually
guarantee anything to happen?

> But now that you made me look, intel_pstate_hwp_set() is horrible crap.
> You should _never_ do things like:
> 
>   rdmsr_on_cpu()
>   /* frob value */
>   wrmsr_on_cpu()
> 
> That's insane.

I guess you mean it does too many IPIs?  Or that it shouldn't do any IPIs
at all?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-01  7:50             ` Rafael J. Wysocki
@ 2018-02-01  9:11               ` Peter Zijlstra
  2018-02-01 13:18                 ` Srinivas Pandruvada
  2018-02-02 11:42                 ` Rafael J. Wysocki
  0 siblings, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2018-02-01  9:11 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML, srinivas.pandruvada

On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote:
> On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote:
> > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> > 
> > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > > > "Desired_Performance" fields which can be used to give explicit
> > > > frequency hints. And we really _should_ be doing that.
> > > > 
> > > > Because, esp. in this scenario; a task migrating; the hardware really
> > > > can't do anything sensible, whereas the OS _knows_.
> > > 
> > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > 
> > That just means we might need to throttle writing to it, like it already
> > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
> > cheap msr?)
> > 
> > Not touching it at all seems silly.
> 
> OK
> 
> So what field precisely would you touch?  "desired"?  If so, does that actually
> guarantee anything to happen?

No idea, desired would be the one I would start with, it matches with
the intent here. But I've no idea what our current HWP implementation
actually does with it.

> > But now that you made me look, intel_pstate_hwp_set() is horrible crap.
> > You should _never_ do things like:
> > 
> >   rdmsr_on_cpu()
> >   /* frob value */
> >   wrmsr_on_cpu()
> > 
> > That's insane.
> 
> I guess you mean it does too many IPIs?  Or that it shouldn't do any IPIs
> at all?

Yes, too many synchronous IPIs, which themselves are typically already
more expensive than the MSR access.

At one point I looked to getting rid of the *msr_on_cpu() crud entirely,
but there's just too much users out there I didn't feel like touching.

If you really care you can do async IPIs and do a custom serialization
that only waits when you do back-to-back things, which should be fairly
uncommon I'd think.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-31 17:44             ` Srinivas Pandruvada
@ 2018-02-01  9:11               ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2018-02-01  9:11 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Rafael J. Wysocki, Mel Gorman, Mike Galbraith, Matt Fleming, LKML

On Wed, Jan 31, 2018 at 09:44:18AM -0800, Srinivas Pandruvada wrote:

> Much more throttling required compared to PERF_CTL. MSR_HWP_REQUEST is
> much slower compared to PERF_CTL (as high as 10:1). 

Still much better than what other architectures have to deal with ;-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-01  9:11               ` Peter Zijlstra
@ 2018-02-01 13:18                 ` Srinivas Pandruvada
  2018-02-02 11:00                   ` Rafael J. Wysocki
  2018-02-02 11:42                 ` Rafael J. Wysocki
  1 sibling, 1 reply; 48+ messages in thread
From: Srinivas Pandruvada @ 2018-02-01 13:18 UTC (permalink / raw)
  To: Peter Zijlstra, Rafael J. Wysocki
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML

On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote:
> On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote:
> > 
> > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra
> > wrote:
> > > 
> > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki
> > > wrote:
> > > > 
> > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra
> > > > wrote:
> > > > 
> > > > > 
> > > > > IA32_HWP_REQUEST has "Minimum_Performance",
> > > > > "Maximum_Performance" and
> > > > > "Desired_Performance" fields which can be used to give
> > > > > explicit
> > > > > frequency hints. And we really _should_ be doing that.
> > > > > 
> > > > > Because, esp. in this scenario; a task migrating; the
> > > > > hardware really
> > > > > can't do anything sensible, whereas the OS _knows_.
> > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > That just means we might need to throttle writing to it, like it
> > > already
> > > does for the regular pstate (PERF_CTRL) msr in any case (also, is
> > > that a
> > > cheap msr?)
> > > 
> > > Not touching it at all seems silly.
> > OK
> > 
> > So what field precisely would you touch?  "desired"?  If so, does
> > that actually
> > guarantee anything to happen?
> No idea, desired would be the one I would start with, it matches with
> the intent here. But I've no idea what our current HWP implementation
> actually does with it.
Desired !=0 will disable HWP autonomous mode of frequency selection.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-01 13:18                 ` Srinivas Pandruvada
@ 2018-02-02 11:00                   ` Rafael J. Wysocki
  2018-02-02 14:54                     ` Srinivas Pandruvada
  0 siblings, 1 reply; 48+ messages in thread
From: Rafael J. Wysocki @ 2018-02-02 11:00 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Peter Zijlstra, Mel Gorman, Mike Galbraith, Matt Fleming, LKML

On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada wrote:
> On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote:
> > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote:
> > > 
> > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra
> > > wrote:
> > > > 
> > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki
> > > > wrote:
> > > > > 
> > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra
> > > > > wrote:
> > > > > 
> > > > > > 
> > > > > > IA32_HWP_REQUEST has "Minimum_Performance",
> > > > > > "Maximum_Performance" and
> > > > > > "Desired_Performance" fields which can be used to give
> > > > > > explicit
> > > > > > frequency hints. And we really _should_ be doing that.
> > > > > > 
> > > > > > Because, esp. in this scenario; a task migrating; the
> > > > > > hardware really
> > > > > > can't do anything sensible, whereas the OS _knows_.
> > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > > That just means we might need to throttle writing to it, like it
> > > > already
> > > > does for the regular pstate (PERF_CTRL) msr in any case (also, is
> > > > that a
> > > > cheap msr?)
> > > > 
> > > > Not touching it at all seems silly.
> > > OK
> > > 
> > > So what field precisely would you touch?  "desired"?  If so, does
> > > that actually
> > > guarantee anything to happen?
> > No idea, desired would be the one I would start with, it matches with
> > the intent here. But I've no idea what our current HWP implementation
> > actually does with it.
> Desired !=0 will disable HWP autonomous mode of frequency selection.

But I don't think it will just run at "desired" then, will it?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-01  9:11               ` Peter Zijlstra
  2018-02-01 13:18                 ` Srinivas Pandruvada
@ 2018-02-02 11:42                 ` Rafael J. Wysocki
  2018-02-02 12:46                   ` Peter Zijlstra
                                     ` (2 more replies)
  1 sibling, 3 replies; 48+ messages in thread
From: Rafael J. Wysocki @ 2018-02-02 11:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML, srinivas.pandruvada

On Thursday, February 1, 2018 10:11:04 AM CET Peter Zijlstra wrote:
> On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote:
> > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote:
> > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> > > 
> > > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > > > > "Desired_Performance" fields which can be used to give explicit
> > > > > frequency hints. And we really _should_ be doing that.
> > > > > 
> > > > > Because, esp. in this scenario; a task migrating; the hardware really
> > > > > can't do anything sensible, whereas the OS _knows_.
> > > > 
> > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > 
> > > That just means we might need to throttle writing to it, like it already
> > > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
> > > cheap msr?)
> > > 
> > > Not touching it at all seems silly.
> > 
> > OK
> > 
> > So what field precisely would you touch?  "desired"?  If so, does that actually
> > guarantee anything to happen?
> 
> No idea, desired would be the one I would start with, it matches with
> the intent here. But I've no idea what our current HWP implementation
> actually does with it.
> 
> > > But now that you made me look, intel_pstate_hwp_set() is horrible crap.
> > > You should _never_ do things like:
> > > 
> > >   rdmsr_on_cpu()
> > >   /* frob value */
> > >   wrmsr_on_cpu()
> > > 
> > > That's insane.
> > 
> > I guess you mean it does too many IPIs?  Or that it shouldn't do any IPIs
> > at all?
> 
> Yes, too many synchronous IPIs, which themselves are typically already
> more expensive than the MSR access.

We could do all of the updates in one IPI (as Srinivas said), but it would be
more code, and custom code for that matter.

Is this really worth it for a slow path like this one?

> At one point I looked to getting rid of the *msr_on_cpu() crud entirely,
> but there's just too much users out there I didn't feel like touching.
> 
> If you really care you can do async IPIs and do a custom serialization
> that only waits when you do back-to-back things, which should be fairly
> uncommon I'd think.

In this particular case we don't want to return to user space before the
MSR is actually written with the new value.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-02 11:42                 ` Rafael J. Wysocki
@ 2018-02-02 12:46                   ` Peter Zijlstra
  2018-02-02 12:55                     ` Peter Zijlstra
  2018-02-02 14:08                     ` Peter Zijlstra
  2018-02-02 12:58                   ` Peter Zijlstra
  2018-02-02 13:27                   ` Mel Gorman
  2 siblings, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2018-02-02 12:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML, srinivas.pandruvada

On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote:
> > If you really care you can do async IPIs and do a custom serialization
> > that only waits when you do back-to-back things, which should be fairly
> > uncommon I'd think.
> 
> In this particular case we don't want to return to user space before the
> MSR is actually written with the new value.

Why not?

I was thinking of something like the below, which would in fact do
exactly that.

---
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 7edf7a0e5a96..f0caa5cc7adb 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -29,6 +29,7 @@
 #include <linux/debugfs.h>
 #include <linux/acpi.h>
 #include <linux/vmalloc.h>
+#include <linux/smp.h>
 #include <trace/events/power.h>
 
 #include <asm/div64.h>
@@ -767,6 +768,40 @@ static void intel_pstate_hwp_set(unsigned int cpu)
 	wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
 }
 
+static void __intel_pstate_hwp_set_desired(int val)
+{
+	u64 value;
+
+	value = rdmsrl(MSR_HWP_REQUEST);
+	value &= ~GENMASK_ULL(23, 16);
+	value |= (val & 0xff) << 16;
+	wrmsrl(MSR_HWP_REQUEST, val);
+}
+
+static void __intel_pstate_hwp_func(void *data)
+{
+	__intel_pstate_hwp_set_desired((int)data);
+}
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct __call_single_data, csd_data);
+
+static void intel_pstate_hwp_set_desired(int cpu, int val)
+{
+	struct call_function_data *csd;
+
+	preempt_disable();
+	csd = this_cpu_ptr(&csd_data);
+	/* wait for previous invocation to complete */
+	csd_lock_wait(csd);
+
+	csd->func = __intel_pstate_hwp_func;
+	csd->info = (unsigned long)val;
+
+	smp_call_function_single_async(cpu, csd);
+	preempt_enable();
+}
+
+
 static int intel_pstate_hwp_save_state(struct cpufreq_policy *policy)
 {
 	struct cpudata *cpu_data = all_cpu_data[policy->cpu];
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 9fb239e12b82..2bc125ec6146 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -14,6 +14,11 @@
 #include <linux/init.h>
 #include <linux/llist.h>
 
+enum {
+	CSD_FLAG_LOCK		= 0x01,
+	CSD_FLAG_SYNCHRONOUS	= 0x02,
+};
+
 typedef void (*smp_call_func_t)(void *info);
 struct __call_single_data {
 	struct llist_node llist;
@@ -26,6 +31,11 @@ struct __call_single_data {
 typedef struct __call_single_data call_single_data_t
 	__aligned(sizeof(struct __call_single_data));
 
+static __always_inline void csd_lock_wait(call_single_data_t *csd)
+{
+	smp_cond_load_acquire(&csd->flags, !(VAL & CSD_FLAG_LOCK));
+}
+
 /* total number of cpus in this system (may exceed NR_CPUS) */
 extern unsigned int total_cpus;
 
diff --git a/kernel/smp.c b/kernel/smp.c
index 084c8b3a2681..af0ef9eb7679 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -22,11 +22,6 @@
 
 #include "smpboot.h"
 
-enum {
-	CSD_FLAG_LOCK		= 0x01,
-	CSD_FLAG_SYNCHRONOUS	= 0x02,
-};
-
 struct call_function_data {
 	call_single_data_t	__percpu *csd;
 	cpumask_var_t		cpumask;
@@ -103,11 +98,6 @@ void __init call_function_init(void)
  * previous function call. For multi-cpu calls its even more interesting
  * as we'll have to ensure no other cpu is observing our csd.
  */
-static __always_inline void csd_lock_wait(call_single_data_t *csd)
-{
-	smp_cond_load_acquire(&csd->flags, !(VAL & CSD_FLAG_LOCK));
-}
-
 static __always_inline void csd_lock(call_single_data_t *csd)
 {
 	csd_lock_wait(csd);

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-02 12:46                   ` Peter Zijlstra
@ 2018-02-02 12:55                     ` Peter Zijlstra
  2018-02-02 14:08                     ` Peter Zijlstra
  1 sibling, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2018-02-02 12:55 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML, srinivas.pandruvada

On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:
> +static void __intel_pstate_hwp_set_desired(int val)
> +{
> +	u64 value;
> +
> +	value = rdmsrl(MSR_HWP_REQUEST);
> +	value &= ~GENMASK_ULL(23, 16);
> +	value |= (val & 0xff) << 16;
> +	wrmsrl(MSR_HWP_REQUEST, val);

D'0h: s/val/value/

> +}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-02 11:42                 ` Rafael J. Wysocki
  2018-02-02 12:46                   ` Peter Zijlstra
@ 2018-02-02 12:58                   ` Peter Zijlstra
  2018-02-02 13:27                   ` Mel Gorman
  2 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2018-02-02 12:58 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML, srinivas.pandruvada

On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote:
> > Yes, too many synchronous IPIs, which themselves are typically already
> > more expensive than the MSR access.
> 
> We could do all of the updates in one IPI (as Srinivas said), but it would be
> more code, and custom code for that matter.
> 
> Is this really worth it for a slow path like this one?

Probably not, still, the pattern really annoys me ;-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-02 11:42                 ` Rafael J. Wysocki
  2018-02-02 12:46                   ` Peter Zijlstra
  2018-02-02 12:58                   ` Peter Zijlstra
@ 2018-02-02 13:27                   ` Mel Gorman
  2 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2018-02-02 13:27 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Mike Galbraith, Matt Fleming, LKML, srinivas.pandruvada

On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote:
> > > > But now that you made me look, intel_pstate_hwp_set() is horrible crap.
> > > > You should _never_ do things like:
> > > > 
> > > >   rdmsr_on_cpu()
> > > >   /* frob value */
> > > >   wrmsr_on_cpu()
> > > > 
> > > > That's insane.
> > > 
> > > I guess you mean it does too many IPIs?  Or that it shouldn't do any IPIs
> > > at all?
> > 
> > Yes, too many synchronous IPIs, which themselves are typically already
> > more expensive than the MSR access.
> 
> We could do all of the updates in one IPI (as Srinivas said), but it would be
> more code, and custom code for that matter.
> 
> Is this really worth it for a slow path like this one?
> 

Maybe it's a slow path at the moment but don't forget that one motivation
for this series is that HWP does not properly react when utilisation of a
CPU is artifically low because a task recently slept for IO or recently
migrated. In both cases, the task may be busy and sensitive to either
latency, throughput or both but HWP will use a low p-state. A standard
driver can do io-boost and while it currently does not do so, it could
also trivially do idle-boosting -- HWP does neither.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-02 12:46                   ` Peter Zijlstra
  2018-02-02 12:55                     ` Peter Zijlstra
@ 2018-02-02 14:08                     ` Peter Zijlstra
  2018-02-03 16:30                       ` Srinivas Pandruvada
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2018-02-02 14:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML, srinivas.pandruvada

On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:

> +static void __intel_pstate_hwp_set_desired(int val)
> +{
> +	u64 value;
> +
> +	value = rdmsrl(MSR_HWP_REQUEST);
> +	value &= ~GENMASK_ULL(23, 16);
> +	value |= (val & 0xff) << 16;
> +	wrmsrl(MSR_HWP_REQUEST, val);
> +}

Also, if we keep a software shadow of that MSR, we can avoid the
rdmsr, which might also help.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-02 11:00                   ` Rafael J. Wysocki
@ 2018-02-02 14:54                     ` Srinivas Pandruvada
  2018-02-02 19:48                       ` Mel Gorman
  2018-02-04  8:38                       ` Rafael J. Wysocki
  0 siblings, 2 replies; 48+ messages in thread
From: Srinivas Pandruvada @ 2018-02-02 14:54 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Mel Gorman, Mike Galbraith, Matt Fleming, LKML

On Fri, 2018-02-02 at 12:00 +0100, Rafael J. Wysocki wrote:
> On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada
> wrote:
> > 
> > On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote:
> > > 
> > > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki
> > > wrote:
> > > > 
> > > > 
> > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra
> > > > wrote:
> > > > > 
> > > > > 
> > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki
> > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra
> > > > > > wrote:
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > IA32_HWP_REQUEST has "Minimum_Performance",
> > > > > > > "Maximum_Performance" and
> > > > > > > "Desired_Performance" fields which can be used to give
> > > > > > > explicit
> > > > > > > frequency hints. And we really _should_ be doing that.
> > > > > > > 
> > > > > > > Because, esp. in this scenario; a task migrating; the
> > > > > > > hardware really
> > > > > > > can't do anything sensible, whereas the OS _knows_.
> > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > > > That just means we might need to throttle writing to it, like
> > > > > it
> > > > > already
> > > > > does for the regular pstate (PERF_CTRL) msr in any case
> > > > > (also, is
> > > > > that a
> > > > > cheap msr?)
> > > > > 
> > > > > Not touching it at all seems silly.
> > > > OK
> > > > 
> > > > So what field precisely would you touch?  "desired"?  If so,
> > > > does
> > > > that actually
> > > > guarantee anything to happen?
> > > No idea, desired would be the one I would start with, it matches
> > > with
> > > the intent here. But I've no idea what our current HWP
> > > implementation
> > > actually does with it.
> > Desired !=0 will disable HWP autonomous mode of frequency
> > selection.
> But I don't think it will just run at "desired" then, will it?
HWP all are these hints only not a guarantee.
There are totally different way HWP is handled in client an servers.
If you set desired all heuristics they collected will be dumped, so
they suggest don't set desired when you are in autonomous mode. If we
really want a boost set the EPP. We know that EPP makes lots of
measurable difference.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-02 14:54                     ` Srinivas Pandruvada
@ 2018-02-02 19:48                       ` Mel Gorman
  2018-02-02 20:01                         ` Srinivas Pandruvada
  2018-02-04  8:42                         ` Rafael J. Wysocki
  2018-02-04  8:38                       ` Rafael J. Wysocki
  1 sibling, 2 replies; 48+ messages in thread
From: Mel Gorman @ 2018-02-02 19:48 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Rafael J. Wysocki, Peter Zijlstra, Mike Galbraith, Matt Fleming, LKML

On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote:
> > > > No idea, desired would be the one I would start with, it matches
> > > > with
> > > > the intent here. But I've no idea what our current HWP
> > > > implementation
> > > > actually does with it.
> > > Desired !=0 will disable HWP autonomous mode of frequency
> > > selection.
> > But I don't think it will just run at "desired" then, will it?
> HWP all are these hints only not a guarantee.

Sure, but the lack on detection when tasks are low utilisation but still
latency/throughput sensitive is problematic. Users shouldn't have to
know they need to disable HWP or set performance goernor out of the box.
It's only going to get worse as sockets get larger.

> There are totally different way HWP is handled in client an servers.
> If you set desired all heuristics they collected will be dumped, so
> they suggest don't set desired when you are in autonomous mode. If we
> really want a boost set the EPP. We know that EPP makes lots of
> measurable difference.
> 

Sure boosting EPP makes a difference -- it's essentially what the performance
goveror does and I know that can be done by a user but it's still basically a
cop-out. Default performance for low utilisation or lightly loaded machines
is poor. Maybe it should be set based on the ACPI preferred profile but
that information is not always available. It would be nice if *some*
sort of hint about new migrations or tasks waking from IO would be desirable.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-02 19:48                       ` Mel Gorman
@ 2018-02-02 20:01                         ` Srinivas Pandruvada
  2018-02-05 11:10                           ` Mel Gorman
  2018-02-04  8:42                         ` Rafael J. Wysocki
  1 sibling, 1 reply; 48+ messages in thread
From: Srinivas Pandruvada @ 2018-02-02 20:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rafael J. Wysocki, Peter Zijlstra, Mike Galbraith, Matt Fleming, LKML

On Fri, 2018-02-02 at 19:48 +0000, Mel Gorman wrote:
> On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote:
> > > > > No idea, desired would be the one I would start with, it
> > > > > matches
> > > > > with
> > > > > the intent here. But I've no idea what our current HWP
> > > > > implementation
> > > > > actually does with it.
> > > > 
> > > > Desired !=0 will disable HWP autonomous mode of frequency
> > > > selection.
> > > 
> > > But I don't think it will just run at "desired" then, will it?
> > 
> > HWP all are these hints only not a guarantee.
> 
> Sure, but the lack on detection when tasks are low utilisation but
> still
> latency/throughput sensitive is problematic. Users shouldn't have to
> know they need to disable HWP or set performance goernor out of the
> box.
> It's only going to get worse as sockets get larger.
I am not saying that we shouldn't do anything. Can you give me some
workloads which you care the most?

> 
> > There are totally different way HWP is handled in client an
> > servers.
> > If you set desired all heuristics they collected will be dumped, so
> > they suggest don't set desired when you are in autonomous mode. If
> > we
> > really want a boost set the EPP. We know that EPP makes lots of
> > measurable difference.
> > 
> 
> Sure boosting EPP makes a difference -- it's essentially what the
> performance
> goveror does and I know that can be done by a user but it's still
> basically a
> cop-out. Default performance for low utilisation or lightly loaded
> machines
> is poor. Maybe it should be set based on the ACPI preferred profile
> but
> that information is not always available. It would be nice if *some*
> sort of hint about new migrations or tasks waking from IO would be
> desirable.
EPP is a range not a single value. So you don't need to make EPP=0 as a
performance governor. PeterZ gave me some scheduler change to
experiment, which can be used as hint to play with EPP. 

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-02 14:08                     ` Peter Zijlstra
@ 2018-02-03 16:30                       ` Srinivas Pandruvada
  2018-02-05 10:44                         ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Srinivas Pandruvada @ 2018-02-03 16:30 UTC (permalink / raw)
  To: Peter Zijlstra, Rafael J. Wysocki
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML, Len Brown

On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:
> 
> > +static void __intel_pstate_hwp_set_desired(int val)
> > +{
> > +	u64 value;
> > +
> > +	value = rdmsrl(MSR_HWP_REQUEST);
> > +	value &= ~GENMASK_ULL(23, 16);
> > +	value |= (val & 0xff) << 16;
> > +	wrmsrl(MSR_HWP_REQUEST, val);
> > +}
> 
> Also, if we keep a software shadow of that MSR, we can avoid the
> rdmsr, which might also help.
The reason we don't keep a software shadow as users can use x86-energy-
perf utility or via BMC to adjust.  CCed  to Len. 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-02 14:54                     ` Srinivas Pandruvada
  2018-02-02 19:48                       ` Mel Gorman
@ 2018-02-04  8:38                       ` Rafael J. Wysocki
  1 sibling, 0 replies; 48+ messages in thread
From: Rafael J. Wysocki @ 2018-02-04  8:38 UTC (permalink / raw)
  To: Srinivas Pandruvada, Peter Zijlstra
  Cc: Mel Gorman, Mike Galbraith, Matt Fleming, LKML

On Friday, February 2, 2018 3:54:24 PM CET Srinivas Pandruvada wrote:
> On Fri, 2018-02-02 at 12:00 +0100, Rafael J. Wysocki wrote:
> > On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada
> > wrote:
> > > 
> > > On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote:
> > > > 
> > > > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki
> > > > wrote:
> > > > > 
> > > > > 
> > > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra
> > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki
> > > > > > wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra
> > > > > > > wrote:
> > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance",
> > > > > > > > "Maximum_Performance" and
> > > > > > > > "Desired_Performance" fields which can be used to give
> > > > > > > > explicit
> > > > > > > > frequency hints. And we really _should_ be doing that.
> > > > > > > > 
> > > > > > > > Because, esp. in this scenario; a task migrating; the
> > > > > > > > hardware really
> > > > > > > > can't do anything sensible, whereas the OS _knows_.
> > > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > > > > That just means we might need to throttle writing to it, like
> > > > > > it
> > > > > > already
> > > > > > does for the regular pstate (PERF_CTRL) msr in any case
> > > > > > (also, is
> > > > > > that a
> > > > > > cheap msr?)
> > > > > > 
> > > > > > Not touching it at all seems silly.
> > > > > OK
> > > > > 
> > > > > So what field precisely would you touch?  "desired"?  If so,
> > > > > does
> > > > > that actually
> > > > > guarantee anything to happen?
> > > > No idea, desired would be the one I would start with, it matches
> > > > with
> > > > the intent here. But I've no idea what our current HWP
> > > > implementation
> > > > actually does with it.
> > > Desired !=0 will disable HWP autonomous mode of frequency
> > > selection.
> > But I don't think it will just run at "desired" then, will it?
> HWP all are these hints only not a guarantee.
> There are totally different way HWP is handled in client an servers.
> If you set desired all heuristics they collected will be dumped, so
> they suggest don't set desired when you are in autonomous mode. If we
> really want a boost set the EPP. We know that EPP makes lots of
> measurable difference.

Yeah, we've been consistently told to use EPP rather than "desired" for hints
like that.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-02 19:48                       ` Mel Gorman
  2018-02-02 20:01                         ` Srinivas Pandruvada
@ 2018-02-04  8:42                         ` Rafael J. Wysocki
  1 sibling, 0 replies; 48+ messages in thread
From: Rafael J. Wysocki @ 2018-02-04  8:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srinivas Pandruvada, Peter Zijlstra, Mike Galbraith, Matt Fleming, LKML

On Friday, February 2, 2018 8:48:01 PM CET Mel Gorman wrote:
> On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote:
> > > > > No idea, desired would be the one I would start with, it matches
> > > > > with
> > > > > the intent here. But I've no idea what our current HWP
> > > > > implementation
> > > > > actually does with it.
> > > > Desired !=0 will disable HWP autonomous mode of frequency
> > > > selection.
> > > But I don't think it will just run at "desired" then, will it?
> > HWP all are these hints only not a guarantee.
> 
> Sure, but the lack on detection when tasks are low utilisation but still
> latency/throughput sensitive is problematic. Users shouldn't have to
> know they need to disable HWP or set performance goernor out of the box.
> It's only going to get worse as sockets get larger.

OK, so that's the case I was thinking about when the kernel actually knows
more about what's going on than the HW-based logic.  That's when it makes
sense to provide hints, of course.

> > There are totally different way HWP is handled in client an servers.
> > If you set desired all heuristics they collected will be dumped, so
> > they suggest don't set desired when you are in autonomous mode. If we
> > really want a boost set the EPP. We know that EPP makes lots of
> > measurable difference.
> > 
> 
> Sure boosting EPP makes a difference -- it's essentially what the performance
> goveror does and I know that can be done by a user but it's still basically a
> cop-out. Default performance for low utilisation or lightly loaded machines
> is poor. Maybe it should be set based on the ACPI preferred profile but
> that information is not always available. It would be nice if *some*
> sort of hint about new migrations or tasks waking from IO would be desirable.

Agreed, we only need to figure out a clean way to do that.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-03 16:30                       ` Srinivas Pandruvada
@ 2018-02-05 10:44                         ` Peter Zijlstra
  2018-02-05 10:58                           ` Ingo Molnar
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2018-02-05 10:44 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Rafael J. Wysocki, Mel Gorman, Mike Galbraith, Matt Fleming,
	LKML, Len Brown, Thomas Gleixner, Ingo Molnar

On Sat, Feb 03, 2018 at 08:30:48AM -0800, Srinivas Pandruvada wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote:
> > On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:
> > 
> > > +static void __intel_pstate_hwp_set_desired(int val)
> > > +{
> > > +	u64 value;
> > > +
> > > +	value = rdmsrl(MSR_HWP_REQUEST);
> > > +	value &= ~GENMASK_ULL(23, 16);
> > > +	value |= (val & 0xff) << 16;
> > > +	wrmsrl(MSR_HWP_REQUEST, val);
> > > +}
> > 
> > Also, if we keep a software shadow of that MSR, we can avoid the
> > rdmsr, which might also help.
> The reason we don't keep a software shadow as users can use x86-energy-
> perf utility or via BMC to adjust.  CCed  to Len. 

Total NAK on that. People using /dev/msr to prod at state get to keep
whatever pieces they end up with.

We _really_ should make /dev/msr taint the kernel on write. Ingo,
Thomas, can we pretty please just do that?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-05 10:44                         ` Peter Zijlstra
@ 2018-02-05 10:58                           ` Ingo Molnar
  0 siblings, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2018-02-05 10:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srinivas Pandruvada, Rafael J. Wysocki, Mel Gorman,
	Mike Galbraith, Matt Fleming, LKML, Len Brown, Thomas Gleixner


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, Feb 03, 2018 at 08:30:48AM -0800, Srinivas Pandruvada wrote:
> > On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote:
> > > On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:
> > > 
> > > > +static void __intel_pstate_hwp_set_desired(int val)
> > > > +{
> > > > +	u64 value;
> > > > +
> > > > +	value = rdmsrl(MSR_HWP_REQUEST);
> > > > +	value &= ~GENMASK_ULL(23, 16);
> > > > +	value |= (val & 0xff) << 16;
> > > > +	wrmsrl(MSR_HWP_REQUEST, val);
> > > > +}
> > > 
> > > Also, if we keep a software shadow of that MSR, we can avoid the
> > > rdmsr, which might also help.
> > The reason we don't keep a software shadow as users can use x86-energy-
> > perf utility or via BMC to adjust.  CCed  to Len. 
> 
> Total NAK on that. People using /dev/msr to prod at state get to keep
> whatever pieces they end up with.
> 
> We _really_ should make /dev/msr taint the kernel on write. Ingo,
> Thomas, can we pretty please just do that?

Sure - mind sending a patch?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-02 20:01                         ` Srinivas Pandruvada
@ 2018-02-05 11:10                           ` Mel Gorman
  2018-02-05 17:04                             ` Srinivas Pandruvada
  0 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2018-02-05 11:10 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Rafael J. Wysocki, Peter Zijlstra, Mike Galbraith, Matt Fleming, LKML

On Fri, Feb 02, 2018 at 12:01:37PM -0800, Srinivas Pandruvada wrote:
> > Sure, but the lack on detection when tasks are low utilisation but
> > still
> > latency/throughput sensitive is problematic. Users shouldn't have to
> > know they need to disable HWP or set performance goernor out of the
> > box.
> > It's only going to get worse as sockets get larger.
>
> I am not saying that we shouldn't do anything. Can you give me some
> workloads which you care the most?
> 

The proprietary workloads I'm aware of are useless to the discussion
as they cannot be trivially reproduced and are typically only available
under NDA. However, hints can be gotten by looking at the number of cases
where recommended tunings limits C-states, set the performance governor,
alter intel_pstate setpoint (if not HWP) etc.

For the purposes of illustration, dbench at low thread counts does
a reasonable job even though it's not that interesting a workload in
general. With ext4 in particular, the journalling thread interactions
bounce tasks around the machine and the short sleeps for IO both combine
to have relatively low utilisation on individual CPUs. It's less pronounced
on xfs as it bounces less due to using kworkers instead of kthreads.

> > 
> > > There are totally different way HWP is handled in client an
> > > servers.
> > > If you set desired all heuristics they collected will be dumped, so
> > > they suggest don't set desired when you are in autonomous mode. If
> > > we
> > > really want a boost set the EPP. We know that EPP makes lots of
> > > measurable difference.
> > > 
> > 
> > Sure boosting EPP makes a difference -- it's essentially what the
> > performance
> > goveror does and I know that can be done by a user but it's still
> > basically a
> > cop-out. Default performance for low utilisation or lightly loaded
> > machines
> > is poor. Maybe it should be set based on the ACPI preferred profile
> > but
> > that information is not always available. It would be nice if *some*
> > sort of hint about new migrations or tasks waking from IO would be
> > desirable.
> EPP is a range not a single value. So you don't need to make EPP=0 as a
> performance governor. PeterZ gave me some scheduler change to
> experiment, which can be used as hint to play with EPP. 
> 

I know EPP is a range, default from bios usually appear to be 6 or 7 but
I didn't do much experiementation to see if there is another value that
works better. Even if there is, the default may need to change as not many
people even know what EPP is or how it should be tuned.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-05 11:10                           ` Mel Gorman
@ 2018-02-05 17:04                             ` Srinivas Pandruvada
  2018-02-05 17:50                               ` Mel Gorman
  0 siblings, 1 reply; 48+ messages in thread
From: Srinivas Pandruvada @ 2018-02-05 17:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rafael J. Wysocki, Peter Zijlstra, Mike Galbraith, Matt Fleming, LKML

On Mon, 2018-02-05 at 11:10 +0000, Mel Gorman wrote:
> On Fri, Feb 02, 2018 at 12:01:37PM -0800, Srinivas Pandruvada wrote:
> > > Sure, but the lack on detection when tasks are low utilisation
> > > but
> > > still
> > > latency/throughput sensitive is problematic. Users shouldn't have
> > > to
> > > know they need to disable HWP or set performance goernor out of
> > > the
> > > box.
> > > It's only going to get worse as sockets get larger.
> > 
> > I am not saying that we shouldn't do anything. Can you give me some
> > workloads which you care the most?
> > 
> 
> The proprietary workloads I'm aware of are useless to the discussion
> as they cannot be trivially reproduced and are typically only
> available
> under NDA. However, hints can be gotten by looking at the number of
> cases
> where recommended tunings limits C-states, set the performance
> governor,
> alter intel_pstate setpoint (if not HWP) etc.
> 
> For the purposes of illustration, dbench at low thread counts does
> a reasonable job even though it's not that interesting a workload in
> general. With ext4 in particular, the journalling thread interactions
> bounce tasks around the machine and the short sleeps for IO both
> combine
> to have relatively low utilisation on individual CPUs. It's less
> pronounced
> on xfs as it bounces less due to using kworkers instead of kthreads.
> 
> > > 
> > > > There are totally different way HWP is handled in client an
> > > > servers.
> > > > If you set desired all heuristics they collected will be
> > > > dumped, so
> > > > they suggest don't set desired when you are in autonomous mode.
> > > > If
> > > > we
> > > > really want a boost set the EPP. We know that EPP makes lots of
> > > > measurable difference.
> > > > 
> > > 
> > > Sure boosting EPP makes a difference -- it's essentially what the
> > > performance
> > > goveror does and I know that can be done by a user but it's still
> > > basically a
> > > cop-out. Default performance for low utilisation or lightly
> > > loaded
> > > machines
> > > is poor. Maybe it should be set based on the ACPI preferred
> > > profile
> > > but
> > > that information is not always available. It would be nice if
> > > *some*
> > > sort of hint about new migrations or tasks waking from IO would
> > > be
> > > desirable.
> > 
> > EPP is a range not a single value. So you don't need to make EPP=0
> > as a
> > performance governor. PeterZ gave me some scheduler change to
> > experiment, which can be used as hint to play with EPP. 
> > 
> 
> I know EPP is a range, default from bios usually appear to be 6 or 7
> but
> I didn't do much experiementation to see if there is another value
> that
> works better. Even if there is, the default may need to change as not
> many
> people even know what EPP is or how it should be tuned.
I think you are talking about EPB not EPP because of ranges you
mentioned here. EPP is a value from 0 to 255. EPP is part of
HWP_REQUEST MSR.
EPB with HWP is used only in Broadwell server. I think you are using
Skylake here.

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-02-05 17:04                             ` Srinivas Pandruvada
@ 2018-02-05 17:50                               ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2018-02-05 17:50 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Rafael J. Wysocki, Peter Zijlstra, Mike Galbraith, Matt Fleming, LKML

On Mon, Feb 05, 2018 at 09:04:25AM -0800, Srinivas Pandruvada wrote:
> > I know EPP is a range, default from bios usually appear to be 6 or 7
> > but
> > I didn't do much experiementation to see if there is another value
> > that
> > works better. Even if there is, the default may need to change as not
> > many
> > people even know what EPP is or how it should be tuned.
> I think you are talking about EPB not EPP because of ranges you
> mentioned here. EPP is a value from 0 to 255. EPP is part of
> HWP_REQUEST MSR.
> EPB with HWP is used only in Broadwell server. I think you are using
> Skylake here.
> 

You're right in that I mixed up EPB (0-15 IIRC) and EPP which has a
different range whose exact meaning and semantics are not super-clear to
me. The machine in question is Skylake but for the purposes of the current
discussion, the issue is the same. There appears to be a lack of giving
decent hints to the hardware based on knowledge of migrations or short
periods of idleness that the processor may be very busy for very short
bursts that are important even if overall utilisation is low.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [tip:sched/urgent] sched/fair: Remove unnecessary parameters from wake_affine_idle()
  2018-01-30 10:45 ` [PATCH 1/4] sched/fair: Remove unnecessary parameters from wake_affine_idle Mel Gorman
@ 2018-02-06 11:55   ` tip-bot for Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-06 11:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, efault, mingo, hpa, matt, torvalds, mgorman, linux-kernel, peterz

Commit-ID:  89a55f56fd1cdbe7e69d4693fc5790af9a6e1501
Gitweb:     https://git.kernel.org/tip/89a55f56fd1cdbe7e69d4693fc5790af9a6e1501
Author:     Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 30 Jan 2018 10:45:52 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 6 Feb 2018 10:20:35 +0100

sched/fair: Remove unnecessary parameters from wake_affine_idle()

wake_affine_idle() takes parameters it never uses so clean it up.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a6b8157..0a551dfe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5692,10 +5692,8 @@ static int wake_wide(struct task_struct *p)
  *			  scheduling latency of the CPUs. This seems to work
  *			  for the overloaded case.
  */
-
 static bool
-wake_affine_idle(struct sched_domain *sd, struct task_struct *p,
-		 int this_cpu, int prev_cpu, int sync)
+wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 {
 	/*
 	 * If this_cpu is idle, it implies the wakeup is from interrupt
@@ -5752,8 +5750,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	int this_cpu = smp_processor_id();
 	bool affine = false;
 
-	if (sched_feat(WA_IDLE) && !affine)
-		affine = wake_affine_idle(sd, p, this_cpu, prev_cpu, sync);
+	if (sched_feat(WA_IDLE))
+		affine = wake_affine_idle(this_cpu, prev_cpu, sync);
 
 	if (sched_feat(WA_WEIGHT) && !affine)
 		affine = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [tip:sched/urgent] sched/fair: Restructure wake_affine*() to return a CPU id
  2018-01-30 10:45 ` [PATCH 2/4] sched/fair: Restructure wake_affine to return a CPU id Mel Gorman
@ 2018-02-06 11:56   ` tip-bot for Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-06 11:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, peterz, mgorman, linux-kernel, torvalds, mingo, matt, hpa, efault

Commit-ID:  3b76c4a33959ca98a573cd9c94c8690d123912ca
Gitweb:     https://git.kernel.org/tip/3b76c4a33959ca98a573cd9c94c8690d123912ca
Author:     Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 30 Jan 2018 10:45:53 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 6 Feb 2018 10:20:35 +0100

sched/fair: Restructure wake_affine*() to return a CPU id

This is a preparation patch that has wake_affine*() return a CPU ID instead of
a boolean. The intent is to allow the wake_affine() helpers to be avoided
if a decision is already made. This patch has no functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a551dfe..4c400d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5692,7 +5692,7 @@ static int wake_wide(struct task_struct *p)
  *			  scheduling latency of the CPUs. This seems to work
  *			  for the overloaded case.
  */
-static bool
+static int
 wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 {
 	/*
@@ -5702,15 +5702,15 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 	 * node depending on the IO topology or IRQ affinity settings.
 	 */
 	if (idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return true;
+		return this_cpu;
 
 	if (sync && cpu_rq(this_cpu)->nr_running == 1)
-		return true;
+		return this_cpu;
 
-	return false;
+	return nr_cpumask_bits;
 }
 
-static bool
+static int
 wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		   int this_cpu, int prev_cpu, int sync)
 {
@@ -5724,7 +5724,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		unsigned long current_load = task_h_load(current);
 
 		if (current_load > this_eff_load)
-			return true;
+			return this_cpu;
 
 		this_eff_load -= current_load;
 	}
@@ -5741,28 +5741,28 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
 	prev_eff_load *= capacity_of(this_cpu);
 
-	return this_eff_load <= prev_eff_load;
+	return this_eff_load <= prev_eff_load ? this_cpu : nr_cpumask_bits;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int prev_cpu, int sync)
 {
 	int this_cpu = smp_processor_id();
-	bool affine = false;
+	int target = nr_cpumask_bits;
 
 	if (sched_feat(WA_IDLE))
-		affine = wake_affine_idle(this_cpu, prev_cpu, sync);
+		target = wake_affine_idle(this_cpu, prev_cpu, sync);
 
-	if (sched_feat(WA_WEIGHT) && !affine)
-		affine = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
+	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
+		target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
 
 	schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
-	if (affine) {
-		schedstat_inc(sd->ttwu_move_affine);
-		schedstat_inc(p->se.statistics.nr_wakeups_affine);
-	}
+	if (target == nr_cpumask_bits)
+		return prev_cpu;
 
-	return affine;
+	schedstat_inc(sd->ttwu_move_affine);
+	schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	return target;
 }
 
 static inline unsigned long task_util(struct task_struct *p);
@@ -6355,8 +6355,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (cpu == prev_cpu)
 			goto pick_cpu;
 
-		if (wake_affine(affine_sd, p, prev_cpu, sync))
-			new_cpu = cpu;
+		new_cpu = wake_affine(affine_sd, p, prev_cpu, sync);
 	}
 
 	if (sd && !(sd_flag & SD_BALANCE_FORK)) {

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [tip:sched/urgent] sched/fair: Do not migrate if the prev_cpu is idle
  2018-01-30 10:45 ` [PATCH 3/4] sched/fair: Do not migrate if the prev_cpu is idle Mel Gorman
@ 2018-02-06 11:56   ` tip-bot for Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-06 11:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, linux-kernel, matt, peterz, efault, hpa, tglx, mgorman, mingo

Commit-ID:  806486c377e33ab662de6d47902e9e2a32b79368
Gitweb:     https://git.kernel.org/tip/806486c377e33ab662de6d47902e9e2a32b79368
Author:     Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 30 Jan 2018 10:45:54 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 6 Feb 2018 10:20:36 +0100

sched/fair: Do not migrate if the prev_cpu is idle

wake_affine_idle() prefers to move a task to the current CPU if the
wakeup is due to an interrupt. The expectation is that the interrupt
data is cache hot and relevant to the waking task as well as avoiding
a search. However, there is no way to determine if there was cache hot
data on the previous CPU that may exceed the interrupt data. Furthermore,
round-robin delivery of interrupts can migrate tasks around a socket where
each CPU is under-utilised.  This can interact badly with cpufreq which
makes decisions based on per-cpu data. It has been observed on machines
with HWP that p-states are not boosted to their maximum levels even though
the workload is latency and throughput sensitive.

This patch uses the previous CPU for the task if it's idle and cache-affine
with the current CPU even if the current CPU is idle due to the wakup
being related to the interrupt. This reduces migrations at the cost of
the interrupt data not being cache hot when the task wakes.

A variety of workloads were tested on various machines and no adverse
impact was noticed that was outside noise. dbench on ext4 on UMA showed
roughly 10% reduction in the number of CPU migrations and it is a case
where interrupts are frequent for IO competions. In most cases, the
difference in performance is quite small but variability is often
reduced. For example, this is the result for pgbench running on a UMA
machine with different numbers of clients.

                          4.15.0-rc9             4.15.0-rc9
                            baseline              waprev-v1
 Hmean     1     22096.28 (   0.00%)    22734.86 (   2.89%)
 Hmean     4     74633.42 (   0.00%)    75496.77 (   1.16%)
 Hmean     7    115017.50 (   0.00%)   113030.81 (  -1.73%)
 Hmean     12   126209.63 (   0.00%)   126613.40 (   0.32%)
 Hmean     16   131886.91 (   0.00%)   130844.35 (  -0.79%)
 Stddev    1       636.38 (   0.00%)      417.11 (  34.46%)
 Stddev    4       614.64 (   0.00%)      583.24 (   5.11%)
 Stddev    7       542.46 (   0.00%)      435.45 (  19.73%)
 Stddev    12      173.93 (   0.00%)      171.50 (   1.40%)
 Stddev    16      671.42 (   0.00%)      680.30 (  -1.32%)
 CoeffVar  1         2.88 (   0.00%)        1.83 (  36.26%)

Note that the different in performance is marginal but for low utilisation,
there is less variability.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4c400d7..db45b35 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5700,9 +5700,15 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 	 * context. Only allow the move if cache is shared. Otherwise an
 	 * interrupt intensive workload could force all tasks onto one
 	 * node depending on the IO topology or IRQ affinity settings.
+	 *
+	 * If the prev_cpu is idle and cache affine then avoid a migration.
+	 * There is no guarantee that the cache hot data from an interrupt
+	 * is more important than cache hot data on the prev_cpu and from
+	 * a cpufreq perspective, it's better to have higher utilisation
+	 * on one CPU.
 	 */
 	if (idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return this_cpu;
+		return idle_cpu(prev_cpu) ? prev_cpu : this_cpu;

 	if (sync && cpu_rq(this_cpu)->nr_running == 1)
 		return this_cpu;

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [tip:sched/urgent] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
  2018-01-30 10:45 ` [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS Mel Gorman
  2018-01-30 11:50   ` Peter Zijlstra
  2018-01-30 11:53   ` Peter Zijlstra
@ 2018-02-06 11:56   ` tip-bot for Mel Gorman
  2 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-06 11:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: matt, hpa, torvalds, mingo, linux-kernel, peterz, efault, mgorman, tglx

Commit-ID:  32e839dda3ba576943365f0f5817ce5c843137dc
Gitweb:     https://git.kernel.org/tip/32e839dda3ba576943365f0f5817ce5c843137dc
Author:     Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 30 Jan 2018 10:45:55 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 6 Feb 2018 10:20:37 +0100

sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

The select_idle_sibling() (SIS) rewrite in commit:

  10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()")

... replaced a domain iteration with a search that broadly speaking
does a wrapped walk of the scheduler domain sharing a last-level-cache.

While this had a number of improvements, one consequence is that two tasks
that share a waker/wakee relationship push each other around a socket. Even
though two tasks may be active, all cores are evenly used. This is great from
a search perspective and spreads a load across individual cores, but it has
adverse consequences for cpufreq. As each CPU has relatively low utilisation,
cpufreq may decide the utilisation is too low to used a higher P-state and
overall computation throughput suffers.

While individual cpufreq and cpuidle drivers may compensate by artifically
boosting P-state (at c0) or avoiding lower C-states (during idle), it does
not help if hardware-based cpufreq (e.g. HWP) is used.

This patch tracks a recently used CPU based on what CPU a task was running
on when it last was a waker a CPU it was recently using when a task is a
wakee. During SIS, the recently used CPU is used as a target if it's still
allowed by the task and is idle.

The benefit may be non-obvious so consider an example of two tasks
communicating back and forth. Task A may be an application doing IO where
task B is a kworker or kthread like journald. Task A may issue IO, wake
B and B wakes up A on completion.  With the existing scheme this may look
like the following (potentially different IDs if SMT is in use but similar
principal applies).

 A (cpu 0)	wake	B (wakes on cpu 1)
 B (cpu 1)	wake	A (wakes on cpu 2)
 A (cpu 2)	wake	B (wakes on cpu 3)
 etc.

A careful reader may wonder why CPU 0 was not idle when B wakes A the
first time and it's simply due to the fact that A can be rescheduled to
another CPU and the pattern is that prev == target when B tries to wakeup A
and the information about CPU 0 has been lost.

With this patch, the pattern is more likely to be:

 A (cpu 0)	wake	B (wakes on cpu 1)
 B (cpu 1)	wake	A (wakes on cpu 0)
 A (cpu 0)	wake	B (wakes on cpu 1)
 etc

i.e. two communicating casts are more likely to use just two cores instead
of all available cores sharing a LLC.

The most dramatic speedup was noticed on dbench using the XFS filesystem on
UMA as clients interact heavily with workqueues in that configuration. Note
that a similar speedup is not observed on ext4 as the wakeup pattern
is different:

                          4.15.0-rc9             4.15.0-rc9
                           waprev-v1        biasancestor-v1
 Hmean      1      287.54 (   0.00%)      817.01 ( 184.14%)
 Hmean      2     1268.12 (   0.00%)     1781.24 (  40.46%)
 Hmean      4     1739.68 (   0.00%)     1594.47 (  -8.35%)
 Hmean      8     2464.12 (   0.00%)     2479.56 (   0.63%)
 Hmean     64     1455.57 (   0.00%)     1434.68 (  -1.44%)

The results can be less dramatic on NUMA where automatic balancing interferes
with the test. It's also known that network benchmarks running on localhost
also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
and TCP depending on the machine). Hackbench also seens small improvements
(6-11% depending on machine and thread count). The facebook schbench was also
tested but in most cases showed little or no different to wakeup latencies.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  8 ++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 22 ++++++++++++++++++++--
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 166144c..92744e3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -555,6 +555,14 @@ struct task_struct {
 	unsigned long			wakee_flip_decay_ts;
 	struct task_struct		*last_wakee;

+	/*
+	 * recent_used_cpu is initially set as the last CPU used by a task
+	 * that wakes affine another task. Waker/wakee relationships can
+	 * push tasks around a CPU where each wakeup moves to the next one.
+	 * Tracking a recently used CPU allows a quick search for a recently
+	 * used CPU that may be idle.
+	 */
+	int				recent_used_cpu;
 	int				wake_cpu;
 #endif
 	int				on_rq;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b40540e..36f113a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2461,6 +2461,7 @@ void wake_up_new_task(struct task_struct *p)
 	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
 	 * as we're not fully set-up yet.
 	 */
+	p->recent_used_cpu = task_cpu(p);
 	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 	rq = __task_rq_lock(p, &rf);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db45b35..5eb3ffc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6197,7 +6197,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
 	struct sched_domain *sd;
-	int i;
+	int i, recent_used_cpu;

 	if (idle_cpu(target))
 		return target;
@@ -6208,6 +6208,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
 		return prev;

+	/* Check a recently used CPU as a potential idle candidate */
+	recent_used_cpu = p->recent_used_cpu;
+	if (recent_used_cpu != prev &&
+	    recent_used_cpu != target &&
+	    cpus_share_cache(recent_used_cpu, target) &&
+	    idle_cpu(recent_used_cpu) &&
+	    cpumask_test_cpu(p->recent_used_cpu, &p->cpus_allowed)) {
+		/*
+		 * Replace recent_used_cpu with prev as it is a potential
+		 * candidate for the next wake.
+		 */
+		p->recent_used_cpu = prev;
+		return recent_used_cpu;
+	}
+
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	if (!sd)
 		return target;
@@ -6375,9 +6390,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f

 	if (!sd) {
 pick_cpu:
-		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
+		if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

+			if (want_affine)
+				current->recent_used_cpu = cpu;
+		}
 	} else {
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	}

^ permalink raw reply related	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2018-02-06 12:01 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-30 10:45 [PATCH 0/4] Reduce migrations and unnecessary spreading of load to multiple CPUs Mel Gorman
2018-01-30 10:45 ` [PATCH 1/4] sched/fair: Remove unnecessary parameters from wake_affine_idle Mel Gorman
2018-02-06 11:55   ` [tip:sched/urgent] sched/fair: Remove unnecessary parameters from wake_affine_idle() tip-bot for Mel Gorman
2018-01-30 10:45 ` [PATCH 2/4] sched/fair: Restructure wake_affine to return a CPU id Mel Gorman
2018-02-06 11:56   ` [tip:sched/urgent] sched/fair: Restructure wake_affine*() " tip-bot for Mel Gorman
2018-01-30 10:45 ` [PATCH 3/4] sched/fair: Do not migrate if the prev_cpu is idle Mel Gorman
2018-02-06 11:56   ` [tip:sched/urgent] " tip-bot for Mel Gorman
2018-01-30 10:45 ` [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS Mel Gorman
2018-01-30 11:50   ` Peter Zijlstra
2018-01-30 12:57     ` Mel Gorman
2018-01-30 13:15       ` Peter Zijlstra
2018-01-30 13:25         ` Mel Gorman
2018-01-30 13:40           ` Peter Zijlstra
2018-01-30 14:06             ` Mel Gorman
2018-01-31  9:22         ` Rafael J. Wysocki
2018-01-31 10:17           ` Peter Zijlstra
2018-01-31 11:54             ` Mel Gorman
2018-01-31 17:44             ` Srinivas Pandruvada
2018-02-01  9:11               ` Peter Zijlstra
2018-02-01  7:50             ` Rafael J. Wysocki
2018-02-01  9:11               ` Peter Zijlstra
2018-02-01 13:18                 ` Srinivas Pandruvada
2018-02-02 11:00                   ` Rafael J. Wysocki
2018-02-02 14:54                     ` Srinivas Pandruvada
2018-02-02 19:48                       ` Mel Gorman
2018-02-02 20:01                         ` Srinivas Pandruvada
2018-02-05 11:10                           ` Mel Gorman
2018-02-05 17:04                             ` Srinivas Pandruvada
2018-02-05 17:50                               ` Mel Gorman
2018-02-04  8:42                         ` Rafael J. Wysocki
2018-02-04  8:38                       ` Rafael J. Wysocki
2018-02-02 11:42                 ` Rafael J. Wysocki
2018-02-02 12:46                   ` Peter Zijlstra
2018-02-02 12:55                     ` Peter Zijlstra
2018-02-02 14:08                     ` Peter Zijlstra
2018-02-03 16:30                       ` Srinivas Pandruvada
2018-02-05 10:44                         ` Peter Zijlstra
2018-02-05 10:58                           ` Ingo Molnar
2018-02-02 12:58                   ` Peter Zijlstra
2018-02-02 13:27                   ` Mel Gorman
2018-01-30 13:15       ` Mike Galbraith
2018-01-30 13:25         ` Peter Zijlstra
2018-01-30 13:35           ` Mike Galbraith
2018-01-30 11:53   ` Peter Zijlstra
2018-01-30 12:59     ` Mel Gorman
2018-01-30 13:06     ` Peter Zijlstra
2018-01-30 13:18       ` Mel Gorman
2018-02-06 11:56   ` [tip:sched/urgent] " tip-bot for Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.