linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations
@ 2019-05-15 13:53 Parth Shah
  2019-05-15 13:53 ` [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface Parth Shah
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Parth Shah @ 2019-05-15 13:53 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: mingo, peterz, dietmar.eggemann, dsmythies

Abstract
========

The modern servers allows multiple cores to run at range of
frequencies higher than rated range of frequencies. But the power budget
of the system inhibits sustaining these higher frequencies for
longer durations.

However when certain cores are put to idle states, the power can be
effectively channelled to other busy cores, allowing them to sustain
the higher frequency.

One way to achieve this is to pack tasks onto fewer cores keeping others idle,
but it may lead to performance penalty for such tasks and sustaining higher
frequencies proves to be of no benefit. But if one can identify unimportant low
utilization tasks which can be packed on the already active cores then waking up
of new cores can be avoided. Such tasks are short and/or bursty "jitter tasks"
and waking up new core is expensive for such case.

Current CFS algorithm in kernel scheduler is performance oriented and hence
tries to assign any idle CPU first for the waking up of new tasks. This policy
is perfect for major categories of the workload, but for jitter tasks, one
can save energy by packing it onto active cores and allow other cores to run at
higher frequencies.

These patch-set tunes the task wake up logic in scheduler to pack exclusively
classified jitter tasks onto busy cores. The work involves the use of additional
attributes inside "cpu" cgroup controller to manually classify tasks as jitter. 


Implementation
==============

These patches uses UCLAMP mechanism from "cpu" cgroup controller which
can be used to classify the jitter tasks. The task wakeup logic uses
this information to pack such tasks onto cores which are busy running
other workloads. The task packing is done at `select_task_rq_fair` only
so that in case of wrong decision load balancer may pull the classified
jitter tasks to performance giving CPU.

Any tasks added to the "cpu" cgroup tagged with cpu.util.max=1 are
classified as jitter. We define a core to be non-idle if it is over
12.5% utilized; the jitters are packed over these cores using First-fit
approach.

To demonstrate/benchmark, one can use a synthetic workload generator
`turbo_bench.c` available at
https://github.com/parthsl/tools/blob/master/benchmarks/turbo_bench.c

Following snippet demonstrates the use of TurboSched feature:
```
mkdir -p /sys/fs/cgroup/cpu/jitter
echo 0 > /proc/sys/kernel/sched_uclamp_util_min;
echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.min;
echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.max;
i=8;
./turbo_bench -t 30 -h $i -n $i &
./turbo_bench -t 30 -h 0 -n $i &
echo $! > /sys/fs/cgroup/cpu/jitter/cgroup.procs
```

Current implementation uses only jitter classified tasks to be packed on any
busy cores, but can be further optimized by getting userspace input of
important tasks and keeping track of such tasks. This leads to optimized
searching of non idle cores and also more accurate as userspace hints
are safer than auto classified busy cores/tasks.


Result
======

The patch-set proves to be useful for the system and the workload where
frequency boost is found to be useful than packing tasks into cores. IBM POWER 9
system shows the benefit for a workload can be up to 13%.

                   Performance benefit of TurboSched over CFS                  
     +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+   
     | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |   
  15 +-+                                  Performance benefit in %       +-+   
     |                    **                                               |   
     |                    **                                               |   
  10 +-+                ********                                         +-+   
     |                  ********                                           |   
     |              ************   *                                       |   
   5 +-+            ************   *                                     +-+   
     |            * ************   * ****                                  |   
     |       ** * * ************ * * ******                                |   
     |       ** * * ************ * * ************ *                        |   
   0 +-******** * * ************ * * ************ * * * ********** * * * **+   
     |     **                                           ****               |   
     |     **                                                              |   
  -5 +-+   **                                                            +-+   
     | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |   
     +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+   
       1 2 3 4  5 6 7 8 9101112 1314151617181920 2122232425262728 29303132     
                             Workload threads count                            


                     Frequency benefit of TurboSched over CFS                   
  20 +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+   
     | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |   
     |                                      Frequency benefit in %         |   
  15 +-+                  **                                             +-+   
     |                    **                                               |   
     |              ********                                               |   
  10 +-+          * ************                                         +-+   
     |            * ************                                           |   
     |            * ************                                           |   
   5 +-+        * * ************   *                                     +-+   
     |       ** * * ************ * *                                       |   
     |     **** * * ************ * * ******                 **             |   
   0 +-******** * * ************ * * ************ * * * ********** * * * **+   
     |   **                                                                |   
     |   **                                                                |   
  -5 +-+ **                                                              +-+   
     | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |   
     +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+   
       1 2 3 4  5 6 7 8 9101112 1314151617181920 2122232425262728 29303132     
                             Workload threads count                            

These numbers are w.r.t. `turbo_bench.c` test benchmark which spawns multiple
threads of a mix of High Utilization and Low Utilization(jitters). X-axis
represents the count of both the categories of tasks spawned.


Series organization
==============
- Patches [01-03]: Cgroup based jitter tasks classification
- Patches [04]: Defines Core Capacity to limit task packing
- Patches [05-06]: Tune CFS task wakeup logic to pack tasks onto busy
  cores

Series can be applied on top of Patrick Bellasi's UCLAMP RFCv8[3]
patches with branch on tip/sched/core and UCLAMP_TASK_GROUP config
options enabled.


Changelogs
=========
This patch set is a respin of TurboSched RFCv1
https://lwn.net/Articles/783959/
which includes the following main changes

- No WOF tasks classification, only jitter tasks are classified from
  the cpu cgroup controller
- Use of Spinlock rather than mutex to count number of jitters in the
  system classified from cgroup
- Architecture specific implementation of Core capacity multiplication
  factor changes dynamically based on the number of active threads in
  the core
- Selection of non idle core in the system is bounded by DIE domain
- Use of UCLAMP mechanism to classify jitter tasks
- Removed "highutil_cpu_mask", and rather uses sd for DIE domain to find
  better fit


References
==========

[1] "TurboSched : A scheduler for sustaining Turbo frequency for longer
durations" https://lwn.net/Articles/783959/

[2] "Turbo_bench: Synthetic workload generator"
https://github.com/parthsl/tools/blob/master/benchmarks/turbo_bench.c

[3] "Patrick Bellasi, Add utilization clamping support"
https://lore.kernel.org/lkml/20190402104153.25404-1-patrick.bellasi@arm.com/


Parth Shah (6):
  sched/core: Add manual jitter classification from cgroup interface
  sched: Introduce switch to enable TurboSched mode
  sched/core: Update turbo_sched count only when required
  sched/fair: Define core capacity to limit task packing
  sched/fair: Tune task wake-up logic to pack jitter tasks
  sched/fair: Bound non idle core search by DIE domain

 arch/powerpc/include/asm/topology.h |   7 ++
 arch/powerpc/kernel/smp.c           |  37 ++++++++
 kernel/sched/core.c                 |  32 +++++++
 kernel/sched/fair.c                 | 127 +++++++++++++++++++++++++++-
 kernel/sched/sched.h                |   8 ++
 5 files changed, 210 insertions(+), 1 deletion(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface
  2019-05-15 13:53 [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
@ 2019-05-15 13:53 ` Parth Shah
  2019-05-15 16:29   ` Peter Zijlstra
  2019-05-15 13:53 ` [RFCv2 2/6] sched: Introduce switch to enable TurboSched mode Parth Shah
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Parth Shah @ 2019-05-15 13:53 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: mingo, peterz, dietmar.eggemann, dsmythies

Jitter tasks are usually of less important in terms of performance
and are short/bursty in characteristics. TurboSched uses this jitter
classification to pack jitters into the already running busy cores to
keep the total idle core count high.

The patch describes the use of UCLAMP mechanism to classify tasks. Patrick
Bellasi came up with a mechanism to classify tasks from the userspace
https://lore.kernel.org/lkml/20190402104153.25404-1-patrick.bellasi@arm.com/

This UCLAMP mechanism can be useful in classifying tasks as jitter. Jitters
can be classified for the cgroup by keeping util.max of the tasks as the
least(=0). This also provides benefit of giving the least frequency to
those jitter tasks, which is useful if all jitters are packed onto a
separate core.

Use Case with UCLAMP
===================
To create a cgroup with all the tasks classified as jitters;

```
mkdir -p /sys/fs/cgroup/cpu/jitter
echo 0 > /proc/sys/kernel/sched_uclamp_util_min;
echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.min;
echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.max;
i=8;
./turbo_bench -t 30 -h $i -n $i &
./turbo_bench -t 30 -h 0 -n $i &
echo $! > /sys/fs/cgroup/cpu/jitter/cgroup.procs;
```

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c  | 9 +++++++++
 kernel/sched/sched.h | 1 +
 2 files changed, 10 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d42c0f5eefa9..77aa4aee4478 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7192,6 +7192,15 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
 	tg->uclamp_req[UCLAMP_MAX].value = max_value;
 	tg->uclamp_req[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);
 
+	/*
+	 * Classify the tasks belonging to the last bucket of MAX UCLAMP as
+	 * jitters
+	 */
+	if (uclamp_bucket_id(max_value) == 0)
+		tg->turbo_sched_enabled = 1;
+	else if (tg->turbo_sched_enabled)
+		tg->turbo_sched_enabled = 0;
+
 	/* Update effective clamps to track the most restrictive value */
 	cpu_util_update_eff(css, UCLAMP_MAX);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b4019012d84b..e75ffaf3ff34 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -407,6 +407,7 @@ struct task_group {
 	struct uclamp_se	uclamp[UCLAMP_CNT];
 #endif
 
+	bool			turbo_sched_enabled;
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFCv2 2/6] sched: Introduce switch to enable TurboSched mode
  2019-05-15 13:53 [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
  2019-05-15 13:53 ` [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface Parth Shah
@ 2019-05-15 13:53 ` Parth Shah
  2019-05-15 16:30   ` Peter Zijlstra
  2019-05-15 13:53 ` [RFCv2 3/6] sched/core: Update turbo_sched count only when required Parth Shah
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Parth Shah @ 2019-05-15 13:53 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: mingo, peterz, dietmar.eggemann, dsmythies

This patch creates a static key which allows to enable or disable
TurboSched feature at runtime.

This key is added in order to enable the TurboSched feature. The static key
helps in optimizing the scheduler fast-path when the TurboSched feature is
disabled.

The patch also provides get/put methods to keep track of the cgroups using
the TurboSched feature. This allows to enable the feature on adding first
cgroup classified as jitter, similarly disable the feature on removal of
such last cgroup.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c  | 20 ++++++++++++++++++++
 kernel/sched/sched.h |  7 +++++++
 2 files changed, 27 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 77aa4aee4478..facbedd2554e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -60,6 +60,26 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+DEFINE_STATIC_KEY_FALSE(__turbo_sched_enabled);
+static DEFINE_SPINLOCK(turbo_sched_lock);
+static int turbo_sched_count;
+
+void turbo_sched_get(void)
+{
+	spin_lock(&turbo_sched_lock);
+	if (!turbo_sched_count++)
+		static_branch_enable(&__turbo_sched_enabled);
+	spin_unlock(&turbo_sched_lock);
+}
+
+void turbo_sched_put(void)
+{
+	spin_lock(&turbo_sched_lock);
+	if (!--turbo_sched_count)
+		static_branch_disable(&__turbo_sched_enabled);
+	spin_unlock(&turbo_sched_lock);
+}
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e75ffaf3ff34..0339964cdf43 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2437,3 +2437,10 @@ static inline bool sched_energy_enabled(void)
 static inline bool sched_energy_enabled(void) { return false; }
 
 #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
+DECLARE_STATIC_KEY_FALSE(__turbo_sched_enabled);
+
+static inline bool is_turbosched_enabled(void)
+{
+	return static_branch_unlikely(&__turbo_sched_enabled);
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFCv2 3/6] sched/core: Update turbo_sched count only when required
  2019-05-15 13:53 [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
  2019-05-15 13:53 ` [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface Parth Shah
  2019-05-15 13:53 ` [RFCv2 2/6] sched: Introduce switch to enable TurboSched mode Parth Shah
@ 2019-05-15 13:53 ` Parth Shah
  2019-05-15 16:31   ` Peter Zijlstra
  2019-05-15 13:53 ` [RFCv2 4/6] sched/fair: Define core capacity to limit task packing Parth Shah
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Parth Shah @ 2019-05-15 13:53 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: mingo, peterz, dietmar.eggemann, dsmythies

Use the get/put methods to add/remove the use of TurboSched support from
the cgroup.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index facbedd2554e..4c55b5399985 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7216,10 +7216,13 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
 	 * Classify the tasks belonging to the last bucket of MAX UCLAMP as
 	 * jitters
 	 */
-	if (uclamp_bucket_id(max_value) == 0)
+	if (uclamp_bucket_id(max_value) == 0) {
 		tg->turbo_sched_enabled = 1;
-	else if (tg->turbo_sched_enabled)
+		turbo_sched_get();
+	} else if (tg->turbo_sched_enabled) {
 		tg->turbo_sched_enabled = 0;
+		turbo_sched_put();
+	}
 
 	/* Update effective clamps to track the most restrictive value */
 	cpu_util_update_eff(css, UCLAMP_MAX);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFCv2 4/6] sched/fair: Define core capacity to limit task packing
  2019-05-15 13:53 [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (2 preceding siblings ...)
  2019-05-15 13:53 ` [RFCv2 3/6] sched/core: Update turbo_sched count only when required Parth Shah
@ 2019-05-15 13:53 ` Parth Shah
  2019-05-15 16:37   ` Peter Zijlstra
  2019-05-15 13:53 ` [RFCv2 5/6] sched/fair: Tune task wake-up logic to pack jitter tasks Parth Shah
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Parth Shah @ 2019-05-15 13:53 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: mingo, peterz, dietmar.eggemann, dsmythies

The task packing on a core needs to be bounded based on its capacity. This
patch defines a new method which acts as a tipping point for task packing.

The Core capacity is the method which limits task packing above certain
point. In general, the capacity of a core is defined to be the aggregated
sum of all the CPUs in the Core.

Some architectures does not have core capacity linearly increasing with the
number of threads( or CPUs) in the core. For such cases, architecture
specific calculations needs to be done to find core capacity.

The `arch_scale_core_capacity` is currently tuned for `powerpc` arch by
scaling capacity w.r.t to the number of online SMT in the core.

The patch provides default handler for other architecture by scaling core
capacity w.r.t. to the capacity of all the threads in the core.

ToDo: SMT mode is calculated each time a jitter task wakes up leading to
redundant decision time which can be eliminated by keeping track of online
CPUs during hotplug task.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 arch/powerpc/include/asm/topology.h |  4 ++++
 arch/powerpc/kernel/smp.c           | 32 +++++++++++++++++++++++++++++
 kernel/sched/fair.c                 | 19 +++++++++++++++++
 3 files changed, 55 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f85e2b01c3df..1c777ee67180 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -132,6 +132,10 @@ static inline void shared_proc_topology_init(void) {}
 #define topology_sibling_cpumask(cpu)	(per_cpu(cpu_sibling_map, cpu))
 #define topology_core_cpumask(cpu)	(per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)		(cpu_to_core_id(cpu))
+#define arch_scale_core_capacity	powerpc_scale_core_capacity
+
+unsigned long powerpc_scale_core_capacity(int first_smt,
+					  unsigned long smt_cap);
 
 int dlpar_cpu_readd(int cpu);
 #endif
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index e784342bdaa1..256ab2a50f6e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1173,6 +1173,38 @@ static void remove_cpu_from_masks(int cpu)
 }
 #endif
 
+#ifdef CONFIG_SCHED_SMT
+/*
+ * Calculate capacity of a core based on the active threads in the core
+ * Scale the capacity of first SM-thread based on total number of
+ * active threads in the respective smt_mask.
+ *
+ * The scaling is done such that for
+ * SMT-4, core_capacity = 1.5x first_cpu_capacity
+ * and for SMT-8, core_capacity multiplication factor is 2x
+ *
+ * So, core_capacity multiplication factor = (1 + smt_mode*0.125)
+ *
+ * @first_cpu: First/any CPU id in the core
+ * @cap: Capacity of the first_cpu
+ */
+inline unsigned long powerpc_scale_core_capacity(int first_cpu,
+		unsigned long cap) {
+	struct cpumask select_idles;
+	struct cpumask *cpus = &select_idles;
+	int cpu, smt_mode = 0;
+
+	cpumask_and(cpus, cpu_smt_mask(first_cpu), cpu_online_mask);
+
+	/* Find SMT mode from active SM-threads */
+	for_each_cpu(cpu, cpus)
+		smt_mode++;
+
+	/* Scale core capacity based on smt mode */
+	return smt_mode == 1 ? cap : ((cap * smt_mode) >> 3) + cap;
+}
+#endif
+
 static inline void add_cpu_to_smallcore_masks(int cpu)
 {
 	struct cpumask *this_l1_cache_map = per_cpu(cpu_l1_cache_map, cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7eea9dc4644..2578e6bdf85b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6231,6 +6231,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	return cpu;
 }
 
+#ifdef CONFIG_SCHED_SMT
+
+#ifndef arch_scale_core_capacity
+static inline unsigned long arch_scale_core_capacity(int first_thread,
+						     unsigned long smt_cap)
+{
+	/* Default capacity of core is sum of cap of all the threads */
+	unsigned long ret = 0;
+	int sibling;
+
+	for_each_cpu(sibling, cpu_smt_mask(first_thread))
+		ret += cpu_rq(sibling)->cpu_capacity;
+
+	return ret;
+}
+#endif
+
+#endif
+
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFCv2 5/6] sched/fair: Tune task wake-up logic to pack jitter tasks
  2019-05-15 13:53 [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (3 preceding siblings ...)
  2019-05-15 13:53 ` [RFCv2 4/6] sched/fair: Define core capacity to limit task packing Parth Shah
@ 2019-05-15 13:53 ` Parth Shah
  2019-05-15 16:43   ` Peter Zijlstra
  2019-05-15 13:53 ` [RFCv2 6/6] sched/fair: Bound non idle core search by DIE domain Parth Shah
  2019-05-15 16:48 ` [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Peter Zijlstra
  6 siblings, 1 reply; 18+ messages in thread
From: Parth Shah @ 2019-05-15 13:53 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: mingo, peterz, dietmar.eggemann, dsmythies

The algorithm finds the first non idle core in the system and tries to
place a task in the least utilized CPU in the chosen core. To maintain
cache hotness, work of finding non idle core starts from the prev_cpu,
which also reduces task ping-pong behaviour inside of the core.

The CPU/core is defined as under-utilized when the aggregated utilization
of the given CPUs is less than 12.5%. The function is named as
core_underutilized because of its specific use in finding a non idle core.

This patch uses the core_underutilized method to calculate whether the core
should be considered sufficiently busy or not. Since core with low
utilization should not be selected for packing, the margin of
under-utilization is kept at 12.5% of core capacity. This number is
experimental and can be modified as per the need. More larger the number,
more aggressive task packing will be.

12.5% is an experimental number which identifies whether the core is
considered to be idle or not.  For task packing, the algorithm should
select the best core where the task can be accommodated such that it does
not wake up an idle core. But the jitter tasks should not be placed on the
core which is about to go idle. Since the core having aggregated
utilization of <12.5%, it may go idle soon and hence packing on such core
should be ignored. The experiment showed that keeping this threshold to
12.5% gives better decision capability on not selecting the core which will
idle out soon.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/fair.c | 100 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 99 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2578e6bdf85b..d2d556eb6d0f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5323,6 +5323,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 /* Working cpumask for: load_balance, load_balance_newidle. */
 DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
 DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+/* A cpumask to find active cores in the system. */
+DEFINE_PER_CPU(cpumask_var_t, turbo_sched_mask);
 
 #ifdef CONFIG_NO_HZ_COMMON
 /*
@@ -6248,6 +6250,73 @@ static inline unsigned long arch_scale_core_capacity(int first_thread,
 }
 #endif
 
+/*
+ * Core is defined as under-utilized in case if the aggregated utilization of a
+ * all the CPUs in a core is less than 12.5%
+ */
+static inline bool core_underutilized(unsigned long core_util,
+				      unsigned long core_capacity)
+{
+	return core_util < (core_capacity >> 3);
+}
+
+/*
+ * Try to find a non idle core in the system  with spare capacity
+ * available for task packing, thereby keeping minimal cores active.
+ * Uses first fit algorithm to pack low util jitter tasks on active cores.
+ */
+static int select_non_idle_core(struct task_struct *p, int prev_cpu)
+{
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
+	int iter_cpu, sibling;
+
+	cpumask_and(cpus, cpu_online_mask, &p->cpus_allowed);
+
+	for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
+		unsigned long core_util = 0;
+		unsigned long core_cap = arch_scale_core_capacity(iter_cpu,
+				capacity_of(iter_cpu));
+		unsigned long est_util = 0, est_util_enqueued = 0;
+		unsigned long util_best_cpu = (unsigned int)-1;
+		int best_cpu = iter_cpu;
+		struct cfs_rq *cfs_rq;
+
+		for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
+			__cpumask_clear_cpu(sibling, cpus);
+			core_util += cpu_util(sibling);
+
+			/*
+			 * Keep track of least utilized CPU in the core
+			 */
+			if (cpu_util(sibling) < util_best_cpu) {
+				util_best_cpu = cpu_util(sibling);
+				best_cpu = sibling;
+			}
+		}
+
+		/*
+		 * Find if the selected task will fit into the tracked minutil
+		 * CPU or not by estimating the utilization of the CPU.
+		 */
+		cfs_rq = &cpu_rq(best_cpu)->cfs;
+		est_util = READ_ONCE(cfs_rq->avg.util_avg) + task_util(p);
+		est_util_enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+		est_util_enqueued += _task_util_est(p);
+		est_util = max(est_util, est_util_enqueued);
+
+		if (!core_underutilized(core_util, core_cap) && est_util < core_cap) {
+			/*
+			 * Try to bias towards prev_cpu to avoid task ping-pong
+			 * behaviour inside the core.
+			 */
+			if (cpumask_test_cpu(prev_cpu, cpu_smt_mask(iter_cpu)))
+				return prev_cpu;
+
+			return best_cpu;
+		}
+	}
+	return -1;
+}
 #endif
 
 /*
@@ -6704,6 +6773,31 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	return -1;
 }
 
+#ifdef CONFIG_SCHED_SMT
+/*
+ * Select all tasks of type 1(jitter) for task packing
+ */
+static int turbosched_select_idle_sibling(struct task_struct *p, int prev_cpu,
+					  int target)
+{
+	int new_cpu;
+
+	if (unlikely(task_group(p)->turbo_sched_enabled)) {
+		new_cpu = select_non_idle_core(p, prev_cpu);
+		if (new_cpu >= 0)
+			return new_cpu;
+	}
+
+	return select_idle_sibling(p, prev_cpu, target);
+}
+#else
+static int turbosched_select_idle_sibling(struct task_struct *p, int prev_cpu,
+					  int target)
+{
+	return select_idle_sibling(p, prev_cpu, target);
+}
+#endif
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6769,7 +6863,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 		/* Fast path */
 
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+		if (is_turbosched_enabled())
+			new_cpu = turbosched_select_idle_sibling(p, prev_cpu,
+								 new_cpu);
+		else
+			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
 		if (want_affine)
 			current->recent_used_cpu = cpu;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFCv2 6/6] sched/fair: Bound non idle core search by DIE domain
  2019-05-15 13:53 [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (4 preceding siblings ...)
  2019-05-15 13:53 ` [RFCv2 5/6] sched/fair: Tune task wake-up logic to pack jitter tasks Parth Shah
@ 2019-05-15 13:53 ` Parth Shah
  2019-05-15 16:44   ` Peter Zijlstra
  2019-05-15 16:48 ` [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Peter Zijlstra
  6 siblings, 1 reply; 18+ messages in thread
From: Parth Shah @ 2019-05-15 13:53 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: mingo, peterz, dietmar.eggemann, dsmythies

This patch specifies the sched domain to search for a non idle core.

The select_non_idle_core searches for the non idle cores across whole
system. But in the systems with multiple NUMA domains, the Turbo frequency
can be sustained within the NUMA domain without being affected from other
NUMA.

This patch provides an architecture specific implementation for defining
the turbo domain to make searching of the core to be bound within the NUMA.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 arch/powerpc/include/asm/topology.h |  3 +++
 arch/powerpc/kernel/smp.c           |  5 +++++
 kernel/sched/fair.c                 | 10 +++++++++-
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 1c777ee67180..410b94c9e1a2 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -133,10 +133,13 @@ static inline void shared_proc_topology_init(void) {}
 #define topology_core_cpumask(cpu)	(per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)		(cpu_to_core_id(cpu))
 #define arch_scale_core_capacity	powerpc_scale_core_capacity
+#define arch_turbo_domain		powerpc_turbo_domain
 
 unsigned long powerpc_scale_core_capacity(int first_smt,
 					  unsigned long smt_cap);
 
+struct cpumask *powerpc_turbo_domain(int cpu);
+
 int dlpar_cpu_readd(int cpu);
 #endif
 #endif
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 256ab2a50f6e..e13ba3981891 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1203,6 +1203,11 @@ inline unsigned long powerpc_scale_core_capacity(int first_cpu,
 	/* Scale core capacity based on smt mode */
 	return smt_mode == 1 ? cap : ((cap * smt_mode) >> 3) + cap;
 }
+
+inline struct cpumask *powerpc_turbo_domain(int cpu)
+{
+	return cpumask_of_node(cpu_to_node(cpu));
+}
 #endif
 
 static inline void add_cpu_to_smallcore_masks(int cpu)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d2d556eb6d0f..bd9985775db4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6260,6 +6260,13 @@ static inline bool core_underutilized(unsigned long core_util,
 	return core_util < (core_capacity >> 3);
 }
 
+#ifndef arch_turbo_domain
+static __always_inline struct cpumask *arch_turbo_domain(int cpu)
+{
+	return sched_domain_span(rcu_dereference(per_cpu(sd_llc, cpu)));
+}
+#endif
+
 /*
  * Try to find a non idle core in the system  with spare capacity
  * available for task packing, thereby keeping minimal cores active.
@@ -6270,7 +6277,8 @@ static int select_non_idle_core(struct task_struct *p, int prev_cpu)
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
 	int iter_cpu, sibling;
 
-	cpumask_and(cpus, cpu_online_mask, &p->cpus_allowed);
+	cpumask_and(cpus, cpu_online_mask, arch_turbo_domain(prev_cpu));
+	cpumask_and(cpus, cpus, &p->cpus_allowed);
 
 	for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
 		unsigned long core_util = 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface
  2019-05-15 13:53 ` [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface Parth Shah
@ 2019-05-15 16:29   ` Peter Zijlstra
  2019-05-16 16:12     ` Parth Shah
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2019-05-15 16:29 UTC (permalink / raw)
  To: Parth Shah; +Cc: linux-kernel, linux-pm, mingo, dietmar.eggemann, dsmythies

On Wed, May 15, 2019 at 07:23:17PM +0530, Parth Shah wrote:

> Subject: [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface

How can this be v2 ?! I've never seen v1.

> Jitter tasks are usually of less important in terms of performance
> and are short/bursty in characteristics. TurboSched uses this jitter
> classification to pack jitters into the already running busy cores to
> keep the total idle core count high.
> 
> The patch describes the use of UCLAMP mechanism to classify tasks. Patrick
> Bellasi came up with a mechanism to classify tasks from the userspace
> https://lore.kernel.org/lkml/20190402104153.25404-1-patrick.bellasi@arm.com/

The canonical form is:

	https://lkml.kernel.org/r/$MSGID

> This UCLAMP mechanism can be useful in classifying tasks as jitter. Jitters
> can be classified for the cgroup by keeping util.max of the tasks as the
> least(=0). This also provides benefit of giving the least frequency to
> those jitter tasks, which is useful if all jitters are packed onto a
> separate core.
> 
> Use Case with UCLAMP
> ===================
> To create a cgroup with all the tasks classified as jitters;
> 
> ```
> mkdir -p /sys/fs/cgroup/cpu/jitter
> echo 0 > /proc/sys/kernel/sched_uclamp_util_min;
> echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.min;
> echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.max;
> i=8;
> ./turbo_bench -t 30 -h $i -n $i &
> ./turbo_bench -t 30 -h 0 -n $i &
> echo $! > /sys/fs/cgroup/cpu/jitter/cgroup.procs;
> ```
> 
> Signed-off-by: Parth Shah <parth@linux.ibm.com>
> ---
>  kernel/sched/core.c  | 9 +++++++++
>  kernel/sched/sched.h | 1 +
>  2 files changed, 10 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d42c0f5eefa9..77aa4aee4478 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7192,6 +7192,15 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
>  	tg->uclamp_req[UCLAMP_MAX].value = max_value;
>  	tg->uclamp_req[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);
>  
> +	/*
> +	 * Classify the tasks belonging to the last bucket of MAX UCLAMP as
> +	 * jitters
> +	 */
> +	if (uclamp_bucket_id(max_value) == 0)
> +		tg->turbo_sched_enabled = 1;
> +	else if (tg->turbo_sched_enabled)
> +		tg->turbo_sched_enabled = 0;
> +
>  	/* Update effective clamps to track the most restrictive value */
>  	cpu_util_update_eff(css, UCLAMP_MAX);
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b4019012d84b..e75ffaf3ff34 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -407,6 +407,7 @@ struct task_group {
>  	struct uclamp_se	uclamp[UCLAMP_CNT];
>  #endif
>  
> +	bool			turbo_sched_enabled;
>  };

Your simple patch has 3 problems:

 - it limits itself; for no apparent reason; to the cgroup interface.

 - it is inconsistent in the terminology; pick either jitter or
   turbo-sched, and I think the latter is a horrid name, it wants to be
   'pack' or something similar. Also, jitter really doesn't make sense
   given the classification.

 - you use '_Bool' in a composite type.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv2 2/6] sched: Introduce switch to enable TurboSched mode
  2019-05-15 13:53 ` [RFCv2 2/6] sched: Introduce switch to enable TurboSched mode Parth Shah
@ 2019-05-15 16:30   ` Peter Zijlstra
  2019-05-16 16:15     ` Parth Shah
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2019-05-15 16:30 UTC (permalink / raw)
  To: Parth Shah; +Cc: linux-kernel, linux-pm, mingo, dietmar.eggemann, dsmythies

On Wed, May 15, 2019 at 07:23:18PM +0530, Parth Shah wrote:
> +void turbo_sched_get(void)
> +{
> +	spin_lock(&turbo_sched_lock);
> +	if (!turbo_sched_count++)
> +		static_branch_enable(&__turbo_sched_enabled);
> +	spin_unlock(&turbo_sched_lock);
> +}

Muwhahaha, you didn't test this code, did you?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv2 3/6] sched/core: Update turbo_sched count only when required
  2019-05-15 13:53 ` [RFCv2 3/6] sched/core: Update turbo_sched count only when required Parth Shah
@ 2019-05-15 16:31   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2019-05-15 16:31 UTC (permalink / raw)
  To: Parth Shah; +Cc: linux-kernel, linux-pm, mingo, dietmar.eggemann, dsmythies

On Wed, May 15, 2019 at 07:23:19PM +0530, Parth Shah wrote:
> Use the get/put methods to add/remove the use of TurboSched support from
> the cgroup.

Didn't anybody tell you that cgroup only interfaces are frowned upon?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv2 4/6] sched/fair: Define core capacity to limit task packing
  2019-05-15 13:53 ` [RFCv2 4/6] sched/fair: Define core capacity to limit task packing Parth Shah
@ 2019-05-15 16:37   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2019-05-15 16:37 UTC (permalink / raw)
  To: Parth Shah; +Cc: linux-kernel, linux-pm, mingo, dietmar.eggemann, dsmythies

On Wed, May 15, 2019 at 07:23:20PM +0530, Parth Shah wrote:
> The task packing on a core needs to be bounded based on its capacity. This
> patch defines a new method which acts as a tipping point for task packing.
> 
> The Core capacity is the method which limits task packing above certain
> point. In general, the capacity of a core is defined to be the aggregated
> sum of all the CPUs in the Core.
> 
> Some architectures does not have core capacity linearly increasing with the
> number of threads( or CPUs) in the core. For such cases, architecture
> specific calculations needs to be done to find core capacity.
> 
> The `arch_scale_core_capacity` is currently tuned for `powerpc` arch by
> scaling capacity w.r.t to the number of online SMT in the core.
> 
> The patch provides default handler for other architecture by scaling core
> capacity w.r.t. to the capacity of all the threads in the core.
> 
> ToDo: SMT mode is calculated each time a jitter task wakes up leading to
> redundant decision time which can be eliminated by keeping track of online
> CPUs during hotplug task.

Urgh, we just got rid of capacity for SMT. Also I don't think the above
clearly defines your metric.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv2 5/6] sched/fair: Tune task wake-up logic to pack jitter tasks
  2019-05-15 13:53 ` [RFCv2 5/6] sched/fair: Tune task wake-up logic to pack jitter tasks Parth Shah
@ 2019-05-15 16:43   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2019-05-15 16:43 UTC (permalink / raw)
  To: Parth Shah; +Cc: linux-kernel, linux-pm, mingo, dietmar.eggemann, dsmythies

On Wed, May 15, 2019 at 07:23:21PM +0530, Parth Shah wrote:
> @@ -6704,6 +6773,31 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  	return -1;
>  }
>  
> +#ifdef CONFIG_SCHED_SMT
> +/*
> + * Select all tasks of type 1(jitter) for task packing
> + */
> +static int turbosched_select_idle_sibling(struct task_struct *p, int prev_cpu,
> +					  int target)
> +{
> +	int new_cpu;
> +
> +	if (unlikely(task_group(p)->turbo_sched_enabled)) {

So if you build without cgroups, this is a NULL dereference.

Also, this really should not be group based.

> +		new_cpu = select_non_idle_core(p, prev_cpu);
> +		if (new_cpu >= 0)
> +			return new_cpu;
> +	}
> +
> +	return select_idle_sibling(p, prev_cpu, target);
> +}
> +#else

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv2 6/6] sched/fair: Bound non idle core search by DIE domain
  2019-05-15 13:53 ` [RFCv2 6/6] sched/fair: Bound non idle core search by DIE domain Parth Shah
@ 2019-05-15 16:44   ` Peter Zijlstra
  2019-05-16 16:26     ` Parth Shah
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2019-05-15 16:44 UTC (permalink / raw)
  To: Parth Shah; +Cc: linux-kernel, linux-pm, mingo, dietmar.eggemann, dsmythies

On Wed, May 15, 2019 at 07:23:22PM +0530, Parth Shah wrote:
> This patch specifies the sched domain to search for a non idle core.
> 
> The select_non_idle_core searches for the non idle cores across whole
> system. But in the systems with multiple NUMA domains, the Turbo frequency
> can be sustained within the NUMA domain without being affected from other
> NUMA.
> 
> This patch provides an architecture specific implementation for defining
> the turbo domain to make searching of the core to be bound within the NUMA.

NAK, this is insane. You don't need arch hooks to find the numa domain.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations
  2019-05-15 13:53 [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (5 preceding siblings ...)
  2019-05-15 13:53 ` [RFCv2 6/6] sched/fair: Bound non idle core search by DIE domain Parth Shah
@ 2019-05-15 16:48 ` Peter Zijlstra
  2019-05-16 16:05   ` Parth Shah
  6 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2019-05-15 16:48 UTC (permalink / raw)
  To: Parth Shah; +Cc: linux-kernel, linux-pm, mingo, dietmar.eggemann, dsmythies

On Wed, May 15, 2019 at 07:23:16PM +0530, Parth Shah wrote:
> Abstract
> ========
> 
> The modern servers allows multiple cores to run at range of
> frequencies higher than rated range of frequencies. But the power budget
> of the system inhibits sustaining these higher frequencies for
> longer durations.
> 
> However when certain cores are put to idle states, the power can be
> effectively channelled to other busy cores, allowing them to sustain
> the higher frequency.
> 
> One way to achieve this is to pack tasks onto fewer cores keeping others idle,
> but it may lead to performance penalty for such tasks and sustaining higher
> frequencies proves to be of no benefit. But if one can identify unimportant low
> utilization tasks which can be packed on the already active cores then waking up
> of new cores can be avoided. Such tasks are short and/or bursty "jitter tasks"
> and waking up new core is expensive for such case.
> 
> Current CFS algorithm in kernel scheduler is performance oriented and hence
> tries to assign any idle CPU first for the waking up of new tasks. This policy
> is perfect for major categories of the workload, but for jitter tasks, one
> can save energy by packing it onto active cores and allow other cores to run at
> higher frequencies.
> 
> These patch-set tunes the task wake up logic in scheduler to pack exclusively
> classified jitter tasks onto busy cores. The work involves the use of additional
> attributes inside "cpu" cgroup controller to manually classify tasks as jitter. 

Why does this make sense? Don't these higher freq bins burn power like
stupid? That is, it makes sense to use turbo-bins for single threaded
workloads that are CPU-bound and need performance.

But why pack a bunch of 'crap' tasks onto a core and give it turbo;
that's just burning power without getting anything back for it.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations
  2019-05-15 16:48 ` [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Peter Zijlstra
@ 2019-05-16 16:05   ` Parth Shah
  0 siblings, 0 replies; 18+ messages in thread
From: Parth Shah @ 2019-05-16 16:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, linux-pm, mingo, dietmar.eggemann, dsmythies



On 5/15/19 10:18 PM, Peter Zijlstra wrote:
> On Wed, May 15, 2019 at 07:23:16PM +0530, Parth Shah wrote:
>> Abstract
>> ========
>>
>> The modern servers allows multiple cores to run at range of
>> frequencies higher than rated range of frequencies. But the power budget
>> of the system inhibits sustaining these higher frequencies for
>> longer durations.
>>
>> However when certain cores are put to idle states, the power can be
>> effectively channelled to other busy cores, allowing them to sustain
>> the higher frequency.
>>
>> One way to achieve this is to pack tasks onto fewer cores keeping others idle,
>> but it may lead to performance penalty for such tasks and sustaining higher
>> frequencies proves to be of no benefit. But if one can identify unimportant low
>> utilization tasks which can be packed on the already active cores then waking up
>> of new cores can be avoided. Such tasks are short and/or bursty "jitter tasks"
>> and waking up new core is expensive for such case.
>>
>> Current CFS algorithm in kernel scheduler is performance oriented and hence
>> tries to assign any idle CPU first for the waking up of new tasks. This policy
>> is perfect for major categories of the workload, but for jitter tasks, one
>> can save energy by packing it onto active cores and allow other cores to run at
>> higher frequencies.
>>
>> These patch-set tunes the task wake up logic in scheduler to pack exclusively
>> classified jitter tasks onto busy cores. The work involves the use of additional
>> attributes inside "cpu" cgroup controller to manually classify tasks as jitter. 
> 
> Why does this make sense? Don't these higher freq bins burn power like
> stupid? That is, it makes sense to use turbo-bins for single threaded
> workloads that are CPU-bound and need performance.
> 
> But why pack a bunch of 'crap' tasks onto a core and give it turbo;
> that's just burning power without getting anything back for it.
> 

Thanks for taking interest in my patch series.
I will try my best to answer your question.

This patch series tries to pack jitter tasks on the busier cores to avoid waking
up any idle core as long as possible. This approach is supposed to give more
performance to the CPU bound tasks by sustaining Turbo for a longer duration.

Current implementation for task wake up is biased towards waking an idle CPU first,
which in turn consumes power as the CPU leaves idle domain.
For the system supporting Turbo frequencies, power budget is fixed and hence to
maintain this budget the system may throttle the frequency.

So the idea is, if we can pack the jitter tasks on already running cores, then we
can avoid waking up new cores and save power thereby sustaining Turbo for longer
duration.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface
  2019-05-15 16:29   ` Peter Zijlstra
@ 2019-05-16 16:12     ` Parth Shah
  0 siblings, 0 replies; 18+ messages in thread
From: Parth Shah @ 2019-05-16 16:12 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, linux-pm, mingo, dietmar.eggemann, dsmythies



On 5/15/19 9:59 PM, Peter Zijlstra wrote:
> On Wed, May 15, 2019 at 07:23:17PM +0530, Parth Shah wrote:
> 
>> Subject: [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface
> 
> How can this be v2 ?! I've never seen v1.
> 

Actually, I sent out v1 on linux-pm@vger.kernel.org mailing list. This patch set
is then refined and re-organized to get better comments from larger audience.
You can find v1 at https://lwn.net/Articles/783959/

>> Jitter tasks are usually of less important in terms of performance
>> and are short/bursty in characteristics. TurboSched uses this jitter
>> classification to pack jitters into the already running busy cores to
>> keep the total idle core count high.
>>
>> The patch describes the use of UCLAMP mechanism to classify tasks. Patrick
>> Bellasi came up with a mechanism to classify tasks from the userspace
>> https://lore.kernel.org/lkml/20190402104153.25404-1-patrick.bellasi@arm.com/
> 
> The canonical form is:
> 
> 	https://lkml.kernel.org/r/$MSGID
> 

Thanks for pointing out. I will use the above form from next time onwards

>> This UCLAMP mechanism can be useful in classifying tasks as jitter. Jitters
>> can be classified for the cgroup by keeping util.max of the tasks as the
>> least(=0). This also provides benefit of giving the least frequency to
>> those jitter tasks, which is useful if all jitters are packed onto a
>> separate core.
>>
>> Use Case with UCLAMP
>> ===================
>> To create a cgroup with all the tasks classified as jitters;
>>
>> ```
>> mkdir -p /sys/fs/cgroup/cpu/jitter
>> echo 0 > /proc/sys/kernel/sched_uclamp_util_min;
>> echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.min;
>> echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.max;
>> i=8;
>> ./turbo_bench -t 30 -h $i -n $i &
>> ./turbo_bench -t 30 -h 0 -n $i &
>> echo $! > /sys/fs/cgroup/cpu/jitter/cgroup.procs;
>> ```
>>
>> Signed-off-by: Parth Shah <parth@linux.ibm.com>
>> ---
>>  kernel/sched/core.c  | 9 +++++++++
>>  kernel/sched/sched.h | 1 +
>>  2 files changed, 10 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index d42c0f5eefa9..77aa4aee4478 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7192,6 +7192,15 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
>>  	tg->uclamp_req[UCLAMP_MAX].value = max_value;
>>  	tg->uclamp_req[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);
>>  
>> +	/*
>> +	 * Classify the tasks belonging to the last bucket of MAX UCLAMP as
>> +	 * jitters
>> +	 */
>> +	if (uclamp_bucket_id(max_value) == 0)
>> +		tg->turbo_sched_enabled = 1;
>> +	else if (tg->turbo_sched_enabled)
>> +		tg->turbo_sched_enabled = 0;
>> +
>>  	/* Update effective clamps to track the most restrictive value */
>>  	cpu_util_update_eff(css, UCLAMP_MAX);
>>  
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index b4019012d84b..e75ffaf3ff34 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -407,6 +407,7 @@ struct task_group {
>>  	struct uclamp_se	uclamp[UCLAMP_CNT];
>>  #endif
>>  
>> +	bool			turbo_sched_enabled;
>>  };
> 
> Your simple patch has 3 problems:
> 
>  - it limits itself; for no apparent reason; to the cgroup interface.

Maybe I can add other interfaces like syscall to allow per-entity classification.

> 
>  - it is inconsistent in the terminology; pick either jitter or
>    turbo-sched, and I think the latter is a horrid name, it wants to be
>    'pack' or something similar. Also, jitter really doesn't make sense
>    given the classification.
> 

Yes, I will be happy to re-name any variables/function to enhance readability.

>  - you use '_Bool' in a composite type.
> 

Maybe I can switch to int.


Thanks,
Parth


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv2 2/6] sched: Introduce switch to enable TurboSched mode
  2019-05-15 16:30   ` Peter Zijlstra
@ 2019-05-16 16:15     ` Parth Shah
  0 siblings, 0 replies; 18+ messages in thread
From: Parth Shah @ 2019-05-16 16:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, linux-pm, mingo, dietmar.eggemann, dsmythies



On 5/15/19 10:00 PM, Peter Zijlstra wrote:
> On Wed, May 15, 2019 at 07:23:18PM +0530, Parth Shah wrote:
>> +void turbo_sched_get(void)
>> +{
>> +	spin_lock(&turbo_sched_lock);
>> +	if (!turbo_sched_count++)
>> +		static_branch_enable(&__turbo_sched_enabled);
>> +	spin_unlock(&turbo_sched_lock);
>> +}
> 
> Muwhahaha, you didn't test this code, did you?
> 

yeah, I didn't see task-sleep problem coming in.
I will change to mutex.

Thanks for pointing out.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv2 6/6] sched/fair: Bound non idle core search by DIE domain
  2019-05-15 16:44   ` Peter Zijlstra
@ 2019-05-16 16:26     ` Parth Shah
  0 siblings, 0 replies; 18+ messages in thread
From: Parth Shah @ 2019-05-16 16:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, linux-pm, mingo, dietmar.eggemann, dsmythies



On 5/15/19 10:14 PM, Peter Zijlstra wrote:
> On Wed, May 15, 2019 at 07:23:22PM +0530, Parth Shah wrote:
>> This patch specifies the sched domain to search for a non idle core.
>>
>> The select_non_idle_core searches for the non idle cores across whole
>> system. But in the systems with multiple NUMA domains, the Turbo frequency
>> can be sustained within the NUMA domain without being affected from other
>> NUMA.
>>
>> This patch provides an architecture specific implementation for defining
>> the turbo domain to make searching of the core to be bound within the NUMA.
> 
> NAK, this is insane. You don't need arch hooks to find the numa domain.
> 

The aim here is to limit searching for non-idle cores inside a NUMA node
(or DIE sched-domain), because some systems can sustain Turbo frequency by task
packing inside of a NUMA node. Hence turbo domain for them should be DIE.

Since not all systems have DIE domain, adding arch hooks can allow each arch to
override their turbo domain within which to allow task packing.

Thanks


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-05-16 17:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-15 13:53 [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
2019-05-15 13:53 ` [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface Parth Shah
2019-05-15 16:29   ` Peter Zijlstra
2019-05-16 16:12     ` Parth Shah
2019-05-15 13:53 ` [RFCv2 2/6] sched: Introduce switch to enable TurboSched mode Parth Shah
2019-05-15 16:30   ` Peter Zijlstra
2019-05-16 16:15     ` Parth Shah
2019-05-15 13:53 ` [RFCv2 3/6] sched/core: Update turbo_sched count only when required Parth Shah
2019-05-15 16:31   ` Peter Zijlstra
2019-05-15 13:53 ` [RFCv2 4/6] sched/fair: Define core capacity to limit task packing Parth Shah
2019-05-15 16:37   ` Peter Zijlstra
2019-05-15 13:53 ` [RFCv2 5/6] sched/fair: Tune task wake-up logic to pack jitter tasks Parth Shah
2019-05-15 16:43   ` Peter Zijlstra
2019-05-15 13:53 ` [RFCv2 6/6] sched/fair: Bound non idle core search by DIE domain Parth Shah
2019-05-15 16:44   ` Peter Zijlstra
2019-05-16 16:26     ` Parth Shah
2019-05-15 16:48 ` [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Peter Zijlstra
2019-05-16 16:05   ` Parth Shah

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).