linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] sched: delayed thread migration
@ 2020-10-20 15:44 Redha Gouicem
  2020-10-20 15:44 ` [PATCH 1/3] cpufreq: x86: allow external frequency measures Redha Gouicem
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Redha Gouicem @ 2020-10-20 15:44 UTC (permalink / raw)
  Cc: julien.sopena, julia.lawall, gilles.muller, carverdamien,
	jean-pierre.lozi, baptiste.lepers, nicolas.palix,
	willy.zwaenepoel, Redha Gouicem, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Rafael J. Wysocki,
	Viresh Kumar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Qais Yousef, Al Viro, Andrey Ignatov,
	Guilherme G. Piccoli, linux-kernel, linux-pm, linux-fsdevel

This patch series proposes a tweak in the thread placement strategy of CFS
to make a better use of cores running at a high frequency. It is the result
of a published work at ATC'20, available here:
https://www.usenix.org/conference/atc20/presentation/gouicern

We address the frequency inversion problem that stems from new DVFS
capabilities of CPUs where frequency can be different among cores of the
same chip. On applications with a large number of fork/wait patterns, we
see that CFS tends to place new threads on idle cores running at a low
frequency, while cores where the parent is running are at a high frequency
and become idle immediately after. This means that a low frequency core is
used while a high frequency core becomes idle. A more detailed analysis of
this behavior is available in the paper but here is small example with 2
cores c0 and c1 and 2 threads A and B.

Legend: - means high frequency
        . means low frequency
	  nothing means nothing is running.

c0:   A------fork()-wait()
                  \
c1:                B................--------

c0 becomes idle while at a high frequency and c1 becomes busy and executes
B at a low frequency for a long time.

The fix we propose is to delay the migration of threads in order to
see if the parent's core can be used quickly before using another core.
When choosing a core to place a new or waking up thread (in
select_task_rq_fair()), we check if the destination core is idle or busy
frequency-wise. If it is busy, we keep the current behavior and move the
thread to this new core. If it is idle, we place the thread locally
(parent's core on fork, waker's core on wakeup). We also arm a
high-resolution timer that will trigger a migration of this thread to the
new core chosen by the original algorithm of CFS. If the thread is able to
run quickly on the local core, we cancel the timer (e.g. the parent thread
calls the wait() syscall). Else, the thread will be moved to the core
decided by CFS and run quickly. When the parent thread waits, the previous
example becomes:

c0:   A------fork()-wait()-B---------------
                  \/
c1:

If the parent thread keeps running, the example becomes:

c0:   A------fork()---------timer------------
                  \/           \
c1:                             B.............--------

In the first case, we use a high frequency core (c0) and let c1 be idle. In
the second case, we loose a few us of execution time for B.

There are two configurable parameters in this patch:
     - the frequency threshold above which we consider a core busy
       frequency-wise. By default, the threshold is set to 0, which
       disables the delayed migrations. On our test server (an Intel Xeon
       E7-8870 v4), an efficient value was the minimal frequency of the
       CPU. On another test machine (AMD Ryzen 5 3400G), it was a frequency
       a little bit higher than the minimal one.
     - the delay of the high-resolution timer. We choose a default value of
       50 us since it is the mean delay we observed between a fork and a
       wait during a kernel build.
Both parameters can be modified through files in /proc/sys/kernel/. Since
this is kind of work in progress, we are still looking to find better ways
to choose the best threshold and delay, as both values probably depend on
the machine.

In terms of performance, we benchmarked 60 applications on a 5.4 kernel at
the time of the paper's publication. The detailed results are available in
the paper but overall, we improve the performance of 23 applications, with
a best improvement of 56%. Only 3 applications suffer large performance
degradations (more than 5%), but the degradation is "small" (at most 8%).
We also think there might be energy savings since we try to avoid waking
idle cores. We did see small improvements on our machines, but nothing as
good as in terms of performance (at most 5% improvement).

On the current development kernel (on the sched/core branch, commit
f166b111e049), we run a few of these benchmarks to confirm our results on
an up-to-date kernel. We run the build of the kernel (complete and
scheduler only), hackbench, 2 applications from the NAS benchmark suite and
openssl. We select these applications because they had either very good or
very bad results on the 5.4 kernel. We present the mean of 10 runs, with
the powersave scaling governor and the intel_pstate driver in active mode.

 Benchmark    | unit | 5.9-f166b111e049 |          patched
 -------------+------+------------------+------------------
			LOWER IS BETTER
 -------------+------+------------------+------------------
 hackbench    |    s |            5.661 |    5.612 (- 0.9%)
 kbuild-sched |    s |            7.370 |    6.309 (-14.4%)
 kbuild       |    s |           34.798 |   34.788 (- 0.0%)
 nas_ft.C     |    s |           10.080 |   10.012 (- 0.7%)
 nas_sp.B     |    s |            6.653 |    7.033 (+ 5.7%)
 -------------+------+------------------+------------------
 			HIGHER IS BETTER
 -------------+------+------------------+------------------
 openssl      |sign/s|         15086.89 | 15071.08 (- 0.1%)

On this small subset, the only application negatively impacted application
is an HPC workload, so it is less of a problem since HPC people usually
bypass the scheduler altogether. We still have a large beneficial impact on
the scheduler build that proeminently has the fork/wait pattern and
frequency inversions.



The first patch of the series is not specific to scheduling. It allows us
(or anyone else) to use the cpufreq infrastructure at a different sampling
rate without compromising the cpufreq subsystem and applications that
depend on it.

The main idea behind this patch series is to bring to light the frequency
inversion problem that will become more and more prominent with new CPUs
that feature per-core DVFS. The solution proposed is a first idea for
solving this problem that still needs to be tested across more CPUs and
with more applications.


Redha Gouicem (3):
  cpufreq: x86: allow external frequency measures
  sched: core: x86: query frequency at each tick
  sched/fair: delay thread migration on fork/wakeup/exec

 arch/x86/kernel/cpu/aperfmperf.c | 31 ++++++++++++++++++++---
 drivers/cpufreq/cpufreq.c        |  5 ++++
 include/linux/cpufreq.h          |  1 +
 include/linux/sched.h            |  4 +++
 include/linux/sched/sysctl.h     |  3 +++
 kernel/sched/core.c              | 43 ++++++++++++++++++++++++++++++++
 kernel/sched/fair.c              | 42 ++++++++++++++++++++++++++++++-
 kernel/sched/sched.h             |  7 ++++++
 kernel/sysctl.c                  | 14 +++++++++++
 9 files changed, 145 insertions(+), 5 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/3] cpufreq: x86: allow external frequency measures
  2020-10-20 15:44 [RFC PATCH 0/3] sched: delayed thread migration Redha Gouicem
@ 2020-10-20 15:44 ` Redha Gouicem
  2020-10-20 15:44 ` [PATCH 2/3] sched: core: x86: query frequency at each tick Redha Gouicem
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Redha Gouicem @ 2020-10-20 15:44 UTC (permalink / raw)
  Cc: julien.sopena, julia.lawall, gilles.muller, carverdamien,
	jean-pierre.lozi, baptiste.lepers, nicolas.palix,
	willy.zwaenepoel, Redha Gouicem, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Rafael J. Wysocki,
	Viresh Kumar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Guilherme G. Piccoli, Al Viro, Qais Yousef,
	linux-kernel, linux-pm, linux-fsdevel

Allow other subsystems to query the current frequency from the CPU without
affecting the cached value of cpufreq. The subsystems doing this will need
to maintain their own version of struct aperfmperf_sample.

This is useful if you need to query frequency more frequently than
APERFMPERF_CACHE_THRESHOLD_MS but don't want to mess with the current
caching behavior.

Even though querying too frequently may render the measures inaccurate, it
is still useful if your subsystem tolerates this inaccuracy.

Co-developed-by: Damien Carver <carverdamien@gmail.com>
Signed-off-by: Damien Carver <carverdamien@gmail.com>
Signed-off-by: Redha Gouicem <redha.gouicem@gmail.com>
---
 arch/x86/kernel/cpu/aperfmperf.c | 31 +++++++++++++++++++++++++++----
 drivers/cpufreq/cpufreq.c        |  5 +++++
 include/linux/cpufreq.h          |  1 +
 3 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index e2f319dc992d..c3be81d689f4 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -36,11 +36,11 @@ static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
  * unless we already did it within 10ms
  * calculate kHz, save snapshot
  */
-static void aperfmperf_snapshot_khz(void *dummy)
+static void aperfmperf_snapshot_khz(void *prev_sample)
 {
 	u64 aperf, aperf_delta;
 	u64 mperf, mperf_delta;
-	struct aperfmperf_sample *s = this_cpu_ptr(&samples);
+	struct aperfmperf_sample *s = prev_sample;
 	unsigned long flags;
 
 	local_irq_save(flags);
@@ -72,7 +72,8 @@ static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
 	if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
 		return true;
 
-	smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait);
+	smp_call_function_single(cpu, aperfmperf_snapshot_khz,
+				 per_cpu_ptr(&samples, cpu), wait);
 
 	/* Return false if the previous iteration was too long ago. */
 	return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
@@ -131,7 +132,29 @@ unsigned int arch_freq_get_on_cpu(int cpu)
 		return per_cpu(samples.khz, cpu);
 
 	msleep(APERFMPERF_REFRESH_DELAY_MS);
-	smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
+	smp_call_function_single(cpu, aperfmperf_snapshot_khz,
+				 per_cpu_ptr(&samples, cpu), 1);
 
 	return per_cpu(samples.khz, cpu);
 }
+
+unsigned int arch_freq_get_on_cpu_from_sample(int cpu, void *sample)
+{
+	struct aperfmperf_sample *s = sample;
+
+	if (!sample)
+		return 0;
+
+	if (!cpu_khz)
+		return 0;
+
+	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+		return 0;
+
+	if (!housekeeping_cpu(cpu, HK_FLAG_MISC))
+		return 0;
+
+	smp_call_function_single(cpu, aperfmperf_snapshot_khz, s, 1);
+
+	return s->khz;
+}
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 02ab56b2a0d8..36e6dbd87317 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -695,6 +695,11 @@ __weak unsigned int arch_freq_get_on_cpu(int cpu)
 	return 0;
 }
 
+__weak unsigned int arch_freq_get_on_cpu_from_sample(int cpu, void *sample)
+{
+	return 0;
+}
+
 static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
 {
 	ssize_t ret;
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 8f141d4c859c..129083684ca0 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -1005,6 +1005,7 @@ static inline void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
 
 extern void arch_freq_prepare_all(void);
 extern unsigned int arch_freq_get_on_cpu(int cpu);
+extern unsigned int arch_freq_get_on_cpu_from_sample(int cpu, void *sample);
 
 extern void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
 				unsigned long max_freq);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/3] sched: core: x86: query frequency at each tick
  2020-10-20 15:44 [RFC PATCH 0/3] sched: delayed thread migration Redha Gouicem
  2020-10-20 15:44 ` [PATCH 1/3] cpufreq: x86: allow external frequency measures Redha Gouicem
@ 2020-10-20 15:44 ` Redha Gouicem
  2020-10-20 15:44 ` [PATCH 3/3] sched/fair: delay thread migration on fork/wakeup/exec Redha Gouicem
  2020-10-21  7:26 ` [RFC PATCH 0/3] sched: delayed thread migration Peter Zijlstra
  3 siblings, 0 replies; 7+ messages in thread
From: Redha Gouicem @ 2020-10-20 15:44 UTC (permalink / raw)
  Cc: julien.sopena, julia.lawall, gilles.muller, carverdamien,
	jean-pierre.lozi, baptiste.lepers, nicolas.palix,
	willy.zwaenepoel, Redha Gouicem, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Rafael J. Wysocki,
	Viresh Kumar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Andrew Morton, Qais Yousef, Al Viro,
	Guilherme G. Piccoli, linux-kernel, linux-pm, linux-fsdevel

Query the current frequency of the core during the scheduler tick.
The scheduler subsystem maintains its own copies of the aperf/mperf
structure because it will query the frequency more frequently than what
the cpufreq subsystem does. This can lead to inaccurate measurements,
but it is not problematic here.

Co-developed-by: Damien Carver <carverdamien@gmail.com>
Signed-off-by: Damien Carver <carverdamien@gmail.com>
Signed-off-by: Redha Gouicem <redha.gouicem@gmail.com>
---
 kernel/sched/core.c  | 11 +++++++++++
 kernel/sched/sched.h |  4 ++++
 2 files changed, 15 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c36dc1ae58be..d6d27a6fc23c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3976,6 +3976,15 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 	return ns;
 }
 
+struct freq_sample {
+	unsigned int  khz;
+	ktime_t       time;
+	u64           aperf;
+	u64           mperf;
+};
+
+DEFINE_PER_CPU(struct freq_sample, freq_sample);
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -3996,6 +4005,8 @@ void scheduler_tick(void)
 	update_rq_clock(rq);
 	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
 	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
+	rq->freq = arch_freq_get_on_cpu_from_sample(cpu,
+						    this_cpu_ptr(&freq_sample));
 	curr->sched_class->task_tick(rq, curr, 0);
 	calc_global_load_tick(rq);
 	psi_task_tick(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 28709f6b0975..7d794ab756d2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1048,6 +1048,10 @@ struct rq {
 	/* Must be inspected within a rcu lock section */
 	struct cpuidle_state	*idle_state;
 #endif
+
+	/* Frequency measured at the last tick */
+	unsigned int            freq;
+
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/3] sched/fair: delay thread migration on fork/wakeup/exec
  2020-10-20 15:44 [RFC PATCH 0/3] sched: delayed thread migration Redha Gouicem
  2020-10-20 15:44 ` [PATCH 1/3] cpufreq: x86: allow external frequency measures Redha Gouicem
  2020-10-20 15:44 ` [PATCH 2/3] sched: core: x86: query frequency at each tick Redha Gouicem
@ 2020-10-20 15:44 ` Redha Gouicem
  2020-10-21  7:26 ` [RFC PATCH 0/3] sched: delayed thread migration Peter Zijlstra
  3 siblings, 0 replies; 7+ messages in thread
From: Redha Gouicem @ 2020-10-20 15:44 UTC (permalink / raw)
  Cc: julien.sopena, julia.lawall, gilles.muller, carverdamien,
	jean-pierre.lozi, baptiste.lepers, nicolas.palix,
	willy.zwaenepoel, Redha Gouicem, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Rafael J. Wysocki,
	Viresh Kumar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Andrey Ignatov, Qais Yousef, Al Viro,
	Guilherme G. Piccoli, linux-kernel, linux-pm, linux-fsdevel

On CPUs that implement per-core frequency scaling (and not per-socket), it
is beneficial to try to use cores running at a higher frequency. One way to
do this on fork/exec/wakeup is to keep the targeted thread on its
parent/previous/waker core, since it should already be running at a high
frequency.

This is how this works:
   - choose the new_cpu as usual in select_task_rq_fair()
   - if we are in a fork/exec/wakeup and new_cpu runs at a low frequency,
     start a timer in 50us and place the
     thread on its parent/previous/waker core
   - if the thread is scheduled, cancel the timer
   - if the timer expires, migrate the thread to new_cpu

This way, if the previous cpu is too busy to be used, the thread will use
another cpu. This is particularly useful in fork/wait patterns where the
child thread is placed on an idle core (low frequency) and the parent
thread waits and makes its core idle (high frequency). This patch avoids
using a low frequency core if a higher frquency one is available.

There are two configuration parameters for this feature:
   - the frequency threshold under which we consider the core to be running
     at a low frequency, in kHz (/proc/sys/kernel/sched_lowfreq). By
     default, this is set to 0, which disables the delayed thread migration
     feature.
   - the delay of the timer, in ns
     (/proc/sys/kernel/sched_delayed_placement)

Co-developed-by: Damien Carver <carverdamien@gmail.com>
Signed-off-by: Damien Carver <carverdamien@gmail.com>
Signed-off-by: Redha Gouicem <redha.gouicem@gmail.com>
---
 include/linux/sched.h        |  4 ++++
 include/linux/sched/sysctl.h |  3 +++
 kernel/sched/core.c          | 32 +++++++++++++++++++++++++++
 kernel/sched/fair.c          | 42 +++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h         |  3 +++
 kernel/sysctl.c              | 14 ++++++++++++
 6 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2bf0af19a62a..ae823d458f94 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -662,6 +662,10 @@ struct task_struct {
 	unsigned long			wakee_flip_decay_ts;
 	struct task_struct		*last_wakee;
 
+	/* Delayed placement */
+	struct hrtimer                  delay_placement_timer;
+	int                             delay_placement_cpu;
+
 	/*
 	 * recent_used_cpu is initially set as the last CPU used by a task
 	 * that wakes affine another task. Waker/wakee relationships can
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 3c31ba88aca5..97a1f4489910 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -52,6 +52,9 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
 		void *buffer, size_t *length, loff_t *ppos);
 #endif
 
+extern __read_mostly unsigned int sysctl_sched_delayed_placement;
+extern __read_mostly unsigned int sysctl_sched_lowfreq;
+
 /*
  *  control realtime throttling:
  *
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d6d27a6fc23c..9958b38a5b6f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3217,6 +3217,33 @@ int sysctl_schedstats(struct ctl_table *table, int write, void *buffer,
 static inline void init_schedstats(void) {}
 #endif /* CONFIG_SCHEDSTATS */
 
+static enum hrtimer_restart delayed_placement_fn(struct hrtimer *data)
+{
+	struct task_struct *p = container_of(data, struct task_struct,
+					     delay_placement_timer);
+	struct rq *rq;
+	struct rq_flags rf;
+	bool queued, running;
+
+	/*
+	 * If, by chance, p was already migrated to this cpu, no need to do
+	 * anything. This can happen because of load balancing for example.
+	 */
+	if (task_cpu(p) == p->delay_placement_cpu)
+		return HRTIMER_NORESTART;
+
+	rq = task_rq_lock(p, &rf);
+
+	queued = task_on_rq_queued(p);
+	running = task_current(rq, p);
+	if (queued && !running)
+		rq = __migrate_task(rq, &rf, p, p->delay_placement_cpu);
+
+	task_rq_unlock(rq, p, &rf);
+
+	return HRTIMER_NORESTART;
+}
+
 /*
  * fork()/clone()-time setup:
  */
@@ -3299,6 +3326,11 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 	RB_CLEAR_NODE(&p->pushable_dl_tasks);
 #endif
+
+	hrtimer_init(&p->delay_placement_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL);
+	p->delay_placement_timer.function = delayed_placement_fn;
+
 	return 0;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33699db27ed5..99c42c215477 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -84,6 +84,15 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity	= 1000000UL;
 
 const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
 
+/*
+ * After fork, exec or wakeup, thread placement is delayed. This option gives
+ * the delay in nanoseconds.
+ *
+ * (default: 50us)
+ */
+unsigned int sysctl_sched_delayed_placement = 50000;
+unsigned int sysctl_sched_lowfreq;
+
 int sched_thermal_decay_shift;
 static int __init setup_sched_thermal_decay_shift(char *str)
 {
@@ -6656,6 +6665,13 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	return -1;
 }
 
+static bool is_cpu_low_freq(int cpu)
+{
+	if (!sysctl_sched_lowfreq)
+		return false;
+	return cpu_rq(cpu)->freq <= sysctl_sched_lowfreq;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6683,7 +6699,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
-				return new_cpu;
+				goto local;
 			new_cpu = prev_cpu;
 		}
 
@@ -6724,6 +6740,28 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	}
 	rcu_read_unlock();
 
+local:
+	if (!is_cpu_low_freq(new_cpu))
+		goto end;
+	/*
+	 * If fork/wake/exec, trigger an interrupt in 50us (default) to eventually steal the thread
+	 * and place the thread locally.
+	 */
+	if (new_cpu == task_cpu(p))
+		goto end;
+
+	if (sd_flag & (SD_BALANCE_FORK | SD_BALANCE_WAKE | SD_BALANCE_EXEC)) {
+		p->delay_placement_cpu = new_cpu;
+		new_cpu = task_cpu(current);
+
+		/* Arm timer in 50us */
+		hrtimer_start(&p->delay_placement_timer,
+			      ktime_set(0,
+					sysctl_sched_delayed_placement),
+			      HRTIMER_MODE_REL);
+	}
+
+end:
 	return new_cpu;
 }
 
@@ -7085,6 +7123,8 @@ done: __maybe_unused;
 	if (hrtick_enabled(rq))
 		hrtick_start_fair(rq, p);
 
+	hrtimer_try_to_cancel(&p->delay_placement_timer);
+
 	update_misfit_status(p, rq);
 
 	return p;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7d794ab756d2..02da9ca69b4a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2018,6 +2018,9 @@ extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
+extern unsigned int sysctl_sched_delayed_placement;
+extern unsigned int sysctl_sched_lowfreq;
+
 #ifdef CONFIG_SCHED_HRTICK
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 287862f91717..e8cc36624330 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1712,6 +1712,20 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "sched_delayed_placement",
+		.data           = &sysctl_sched_delayed_placement,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
+		.procname       = "sched_lowfreq",
+		.data           = &sysctl_sched_lowfreq,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #ifdef CONFIG_SCHEDSTATS
 	{
 		.procname	= "sched_schedstats",
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/3] sched: delayed thread migration
  2020-10-20 15:44 [RFC PATCH 0/3] sched: delayed thread migration Redha Gouicem
                   ` (2 preceding siblings ...)
  2020-10-20 15:44 ` [PATCH 3/3] sched/fair: delay thread migration on fork/wakeup/exec Redha Gouicem
@ 2020-10-21  7:26 ` Peter Zijlstra
  2020-10-21 10:40   ` Redha
  3 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2020-10-21  7:26 UTC (permalink / raw)
  To: Redha Gouicem
  Cc: julien.sopena, julia.lawall, gilles.muller, carverdamien,
	jean-pierre.lozi, baptiste.lepers, nicolas.palix,
	willy.zwaenepoel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin, Rafael J. Wysocki, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Qais Yousef, Al Viro, Andrey Ignatov,
	Guilherme G. Piccoli, linux-kernel, linux-pm, linux-fsdevel

On Tue, Oct 20, 2020 at 05:44:38PM +0200, Redha Gouicem wrote:
> 
> The first patch of the series is not specific to scheduling. It allows us
> (or anyone else) to use the cpufreq infrastructure at a different sampling
> rate without compromising the cpufreq subsystem and applications that
> depend on it.

It's also completely redudant as the scheduler already reads aperf/mperf
on every tick. Clearly you didn't do your homework ;-)

> The main idea behind this patch series is to bring to light the frequency
> inversion problem that will become more and more prominent with new CPUs
> that feature per-core DVFS. The solution proposed is a first idea for
> solving this problem that still needs to be tested across more CPUs and
> with more applications.

Which is why schedutil (the only cpufreq gov anybody should be using) is
integrated with the scheduler and closes the loop and tells the CPU
about the expected load.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/3] sched: delayed thread migration
  2020-10-21  7:26 ` [RFC PATCH 0/3] sched: delayed thread migration Peter Zijlstra
@ 2020-10-21 10:40   ` Redha
  2020-10-21 11:48     ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Redha @ 2020-10-21 10:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: julien.sopena, julia.lawall, gilles.muller, carverdamien,
	jean-pierre.lozi, baptiste.lepers, nicolas.palix,
	willy.zwaenepoel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin, Rafael J. Wysocki, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Qais Yousef, Al Viro, Andrey Ignatov,
	Guilherme G. Piccoli, linux-kernel, linux-pm, linux-fsdevel

On 21/10/2020 09:26, Peter Zijlstra wrote:
> On Tue, Oct 20, 2020 at 05:44:38PM +0200, Redha Gouicem wrote:
>> The first patch of the series is not specific to scheduling. It allows us
>> (or anyone else) to use the cpufreq infrastructure at a different sampling
>> rate without compromising the cpufreq subsystem and applications that
>> depend on it.
> It's also completely redudant as the scheduler already reads aperf/mperf
> on every tick. Clearly you didn't do your homework ;-)
My bad. I worked on this a year ago, just never got time to submit to the
lkml and I should have re-done my homework more thoroughly before
submitting. The paper was submitted approximately at the same time as the
patch introducing support frequency invariance and frequency reading at
every tick (1 week apart!)
Again, my bad.

>
>> The main idea behind this patch series is to bring to light the frequency
>> inversion problem that will become more and more prominent with new CPUs
>> that feature per-core DVFS. The solution proposed is a first idea for
>> solving this problem that still needs to be tested across more CPUs and
>> with more applications.
> Which is why schedutil (the only cpufreq gov anybody should be using) is
> integrated with the scheduler and closes the loop and tells the CPU
> about the expected load.
>
While I agree that schedutil is probably a good option, I'm not sure we
treat exactly the same problem. schedutil aims at mapping the frequency of
the CPU to the actual load. What I'm saying is that since it takes some
time for the frequency to match the load, why not account for the frequency
when making placement/migration decisions. I know that with the frequency
invariance code, capacity accounts for frequency, which means that thread
placement decisions do account for frequency indirectly. However, we still
have performance improvements with our patch for the workloads with
fork/wait patterns. I really believe that we can still gain performance if
we make decisions while accounting for the frequency more directly.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/3] sched: delayed thread migration
  2020-10-21 10:40   ` Redha
@ 2020-10-21 11:48     ` Peter Zijlstra
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2020-10-21 11:48 UTC (permalink / raw)
  To: Redha
  Cc: julien.sopena, julia.lawall, gilles.muller, carverdamien,
	jean-pierre.lozi, baptiste.lepers, nicolas.palix,
	willy.zwaenepoel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin, Rafael J. Wysocki, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Qais Yousef, Al Viro, Andrey Ignatov,
	Guilherme G. Piccoli, linux-kernel, linux-pm, linux-fsdevel

On Wed, Oct 21, 2020 at 12:40:44PM +0200, Redha wrote:

> >> The main idea behind this patch series is to bring to light the frequency
> >> inversion problem that will become more and more prominent with new CPUs
> >> that feature per-core DVFS. The solution proposed is a first idea for
> >> solving this problem that still needs to be tested across more CPUs and
> >> with more applications.
> > Which is why schedutil (the only cpufreq gov anybody should be using) is
> > integrated with the scheduler and closes the loop and tells the CPU
> > about the expected load.
> >
> While I agree that schedutil is probably a good option, I'm not sure we
> treat exactly the same problem. schedutil aims at mapping the frequency of
> the CPU to the actual load. What I'm saying is that since it takes some
> time for the frequency to match the load, why not account for the frequency
> when making placement/migration decisions.

Because overhead, mostly :/ EAS does some of that. Mostly wakeup CPU
selection is already a bottle-neck for some applications (see the fight
over select_idle_sibling()).

Programming a timer is out of budget for high rate wakeup workloads.
Worse, you also don't prime the CPU to ramp up during the enforced
delay.

Also, two new config knobs :-(

> I know that with the frequency invariance code, capacity accounts for
> frequency, which means that thread placement decisions do account for
> frequency indirectly. However, we still have performance improvements
> with our patch for the workloads with fork/wait patterns. I really
> believe that we can still gain performance if we make decisions while
> accounting for the frequency more directly.

So I don't think that's fundamentally a DVFS problem though, just
something that's exacerbated by it. There's a long history with trying
to detect this pattern, see for example WF_SYNC and wake_affine().

(we even had some code in the early CFS days that measured overlap
between tasks, to account for the period between waking up the recipient
and blocking on the answer, but that never worked reliably either, so
that went the way of the dodo)

The classical micro-benchmark is pipe-bench, which ping-pongs a single
byte between two tasks over a pipe. If you run that on a single CPU it
is _much_ faster then when the tasks get split up. DVFS is just like
caches here, yet another reason.

schedutil does solve the problem where, when we migrate a task, the CPU
would have to individually re-learn the DVFS state. By using the
scheduler statistics we can program the DVFS state up-front, on
migration. Instead of waiting for it.

So from that PoV, schedutil fundamentally solves the individual DVFS
problem as best is possible. It closes the control loop; we no longer
have individually operating control loops that are unaware of one
another.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-10-21 11:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-20 15:44 [RFC PATCH 0/3] sched: delayed thread migration Redha Gouicem
2020-10-20 15:44 ` [PATCH 1/3] cpufreq: x86: allow external frequency measures Redha Gouicem
2020-10-20 15:44 ` [PATCH 2/3] sched: core: x86: query frequency at each tick Redha Gouicem
2020-10-20 15:44 ` [PATCH 3/3] sched/fair: delay thread migration on fork/wakeup/exec Redha Gouicem
2020-10-21  7:26 ` [RFC PATCH 0/3] sched: delayed thread migration Peter Zijlstra
2020-10-21 10:40   ` Redha
2020-10-21 11:48     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).