linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations
@ 2020-01-21  6:33 Parth Shah
  2020-01-21  6:33 ` [RFC v6 1/5] sched: Introduce switch to enable TurboSched for task packing Parth Shah
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Parth Shah @ 2020-01-21  6:33 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies, qperret,
	tim.c.chen

This is the 5th version of the patch series to sustain Turbo frequencies
for longer durations.

The previous versions can be found here:
v5: https://lkml.org/lkml/2019/10/7/118
v4: https://lkml.org/lkml/2019/7/25/296
v3: https://lkml.org/lkml/2019/6/25/25
v2: https://lkml.org/lkml/2019/5/15/1258
v1: https://lwn.net/Articles/783959/

The changes in this versions are:
v5 -> v6:
- Addressed comments from Vincent Guittot and Hillf Danton
- Re based the series on the top of latency_nice patch series defined at
  https://lkml.org/lkml/2020/1/16/319. This allows [1] to use the
  latency_nice framework for small background tasks classification from the
  userspace.
ToDo:
- Add Documentation for TurboSched including possible regression as per the
  comment from Pavel Machek

v4 -> v5:
- Remove Core capacity calculation for finding non-idle core
- Use idle_cpu() and cpu_overutilized() to find the core for task packing
- This changes functionality a bit. Updated new results for POWER9 system
- Re-named ambiguous naming "jitter" to "small background" tasks

v3 -> v4:
- Based on Patrick Bellasi's comments, removed the use of UCLAMP based
  mechanism to classify tasks as jitter
- Added support to sched_setattr to mark the task as jitter by adding a new
  flag to the existing task_struct->flags attribute. This is decided to not
  have any new variable inside task_struct and thus get rid of size
  bloating.
- No functional changes

v2 -> v3:
- Added a new attribute in task_struct to allow per task jitter
  classification so that scheduler can use this as request to change wakeup
  path for task packing
- Use syscall for jitter classification, removed cgroup based task
  classification
- Use mutex over spinlock to get rid of task sleeping problem
- Changed _Bool->int everywhere
- Split few patches to have arch specific code separate from core scheduler
  code

v1 -> v2:
- No CPU bound tasks' classification, only jitter tasks are classified from
  the cpu cgroup controller
- Use of Spinlock rather than mutex to count number of jitters in the
  system classified from cgroup
- Architecture specific implementation of Core capacity multiplication
  factor changes dynamically based on the number of active threads in the
  core
- Selection of non idle core in the system is bounded by DIE domain
- Use of UCLAMP mechanism to classify jitter tasks
- Removed "highutil_cpu_mask", and rather uses sd for DIE domain to find
  better fit



Abstract
========

The modern servers allows multiple cores to run at range of frequencies
higher than rated range of frequencies. But the power budget of the system
inhibits sustaining these higher frequencies for longer durations.

However when certain cores are put to idle states, the power can be
effectively channelled to other busy cores, allowing them to sustain the
higher frequency.

One way to achieve this is to pack tasks onto fewer cores keeping others
idle, but it may lead to performance penalty for such tasks and sustaining
higher frequencies proves to be of no benefit. But if one can identify
unimportant low utilization tasks which can be packed on the already active
cores then waking up of new cores can be avoided. Such tasks are short
and/or bursty "background tasks" and waking up new core is expensive for
such case.

Current CFS algorithm in kernel scheduler is performance oriented and hence
tries to assign any idle CPU first for the waking up of new tasks. This
policy is perfect for major categories of the workload, but for background
tasks, one can save energy by packing them onto the active cores and allow
those cores to run at higher frequencies.

These patch-set tunes the task wake up logic in scheduler to pack
exclusively classified background tasks onto busy cores. The work involves
the such tasks classifications by using syscall based mechanisms.

In brief, if we can pack such small background tasks on busy cores then we
can save power by keeping other cores idle and allow busier cores to run at
turbo frequencies, patch-set tries to meet this solution in simplest
manner by only packing tasks with latency_nice==19 and util <= 12.5%.


Implementation
==============

These patches uses latency_nice [3] syscall based mechanism to classify the
tasks as small background noises. The task wakeup logic uses this
information to pack such tasks onto cores which are already running busy
with CPU intensive tasks.  The task packing is done at
`select_task_rq_fair` only so that in case of wrong decision load balancer
may pull the classified background tasks for maximizing performance.

We define a core to be non-idle if any CPU has >12.5% utilization and not
more than 1 CPU is overutilized (>80% utilization); the background tasks
are packed over these cores using First-fit approach.

The value 12.5% utilization indicates the CPU is sufficiently busy to not
go to deeper IDLE-states (target_residency >= 10ms) and tasks can be packed
here.

To demonstrate/benchmark, patches uses turbo_bench, a synthetic workload
generator [2].

Following snippet demonstrates the use of TurboSched feature:
```
i=8; ./turbo_bench -t 30 -h $i -n $((i*2)) -j
```
This spawns 2*i total threads: of which i-CPU bound and i-low util threads.

Current implementation uses only small background classified tasks to be
packed on the first busy cores, but can be further optimized by getting
userspace input of important tasks and keeping track of such tasks. This
leads to optimized searching of non idle cores and also more accurate as
userspace hints are safer than auto classified busy cores/tasks.


Result
======

The patch-set proves to be useful for the system and the workload where
frequency boost is found to be useful than packing tasks into cores. IBM
POWER 9 system shows the frequency benefit can be up to 18% which can be
translated to the maximum workload benefit up to 14%.

(higher is better)

                 Frequency benefit of TurboSched w.r.t. CFS               
   +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
20 +-+ + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + +-+
   |                    **                Frequency benefit in %         |
   |                    **                                               |
15 +-+                  **                                             +-+
   |              ****  **  **                                           |
   |            * ****  ******                                           |
10 +-+          * ****  ******                                         +-+
   |            * ****  ******                                           |
   |          * * ************   *                                       |
 5 +-+        * * ************ * *   **                                +-+
   |       ** * * ************ * *   ****                                |
 0 +-******** * * ************ * * ************ * * * ********** * * * **+
   |   **                                                                |
   |                                                                     |
-5 +-+                                                                 +-+
   | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |
   +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
     0 1 2 3  4 5 6 7 8 91011 1213141516171819 2021222324252627 28293031  
                           No. of workload threads                        

                 Performance benefit of TurboSched w.r.t. CFS             
20 +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
   | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |
   |                                    Performance benefit in %         |
15 +-+                  **                                             +-+
   |                    **                                               |
   |                    ******                                           |
10 +-+                  ******                                         +-+
   |                **********                                           |
   |              ************                                           |
 5 +-+            ************   *     **                              +-+
   |              ************   *   ****                                |
   |            * ************ * *   ******  **                          |
 0 +-******** * * ************ * * ************ * * * ********** * * * **+
   |                                       **             **     *       |
   |                                                                     |
-5 +-+                                                                 +-+
   | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |
   +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
     0 1 2 3  4 5 6 7 8 91011 1213141516171819 2021222324252627 28293031  
                           No. of workload threads                        
                                                                       

These numbers are w.r.t. `turbo_bench.c` multi-threaded test benchmark
which can create two kinds of tasks: CPU bound (High Utilization) and
Background (Low Utilization). N in X-axis represents N-CPU bound and
N-background tasks spawned. The performance (Operations per Seconds) graph
indicates the benefit with TurboSched can be upto 14% compared to the CFS
task placement strategy for such background classified tasks.


Series organization
==============
- Patches 1-2: Small background tasks classification using syscall
- Patch   3  : Tune CFS task wakeup logic to pack tasks onto busy cores
- Patches 4-5: Change non-idle core search domain to LLC by default and
  	       provide arch hooks to change to NUMA for powerpc.

Series can be applied on top of latency_nice attribute introduction
patches [3].


References
==========
[1]. Usecases for the per-task latency-nice attribute,
     https://lkml.org/lkml/2019/9/30/215
[2]. Test Benchmark: turbobench,
     https://github.com/parthsl/tools/blob/master/benchmarks/turbo_bench.c
[3]. Introduce per-task latency_nice for scheduler hints,
     https://lkml.org/lkml/2020/1/16/319


Parth Shah (5):
  sched: Introduce switch to enable TurboSched for task packing
  sched/core: Update turbo_sched count only when required
  sched/fair: Tune task wake-up logic to pack small background tasks on
    fewer cores
  sched/fair: Provide arch hook to find domain for non idle core search
    scan
  powerpc: Set turbo domain to NUMA node for task packing

 arch/powerpc/include/asm/topology.h |  3 +
 arch/powerpc/kernel/smp.c           |  7 +++
 kernel/sched/core.c                 | 37 +++++++++++
 kernel/sched/fair.c                 | 95 ++++++++++++++++++++++++++++-
 kernel/sched/sched.h                | 15 +++++
 5 files changed, 156 insertions(+), 1 deletion(-)

-- 
2.17.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC v6 1/5] sched: Introduce switch to enable TurboSched for task packing
  2020-01-21  6:33 [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
@ 2020-01-21  6:33 ` Parth Shah
  2020-01-22 21:37   ` Tim Chen
  2020-01-21  6:33 ` [RFC v6 2/5] sched/core: Update turbo_sched count only when required Parth Shah
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Parth Shah @ 2020-01-21  6:33 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies, qperret,
	tim.c.chen

Create a static key which allows to enable or disable TurboSched feature at
runtime.

This key is added in order to enable the TurboSched feature only when
required. This helps in optimizing the scheduler fast-path when the
TurboSched feature is disabled.

Also provide get/put methods to keep track of the tasks using the
TurboSched feature and also refcount classified background tasks. This
allows to enable the feature on setting first task classified as background
noise, similarly disable the feature on unsetting of such last task.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c  | 25 +++++++++++++++++++++++++
 kernel/sched/sched.h | 12 ++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a9e5d157b1a5..dfbb52d66b29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -73,6 +73,31 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+#ifdef CONFIG_SCHED_SMT
+DEFINE_STATIC_KEY_FALSE(__turbo_sched_enabled);
+static DEFINE_MUTEX(turbo_sched_lock);
+static int turbo_sched_count;
+
+void turbo_sched_get(void)
+{
+	mutex_lock(&turbo_sched_lock);
+	if (!turbo_sched_count++)
+		static_branch_enable(&__turbo_sched_enabled);
+	mutex_unlock(&turbo_sched_lock);
+}
+
+void turbo_sched_put(void)
+{
+	mutex_lock(&turbo_sched_lock);
+	if (!--turbo_sched_count)
+		static_branch_disable(&__turbo_sched_enabled);
+	mutex_unlock(&turbo_sched_lock);
+}
+#else
+void turbo_sched_get(void) { return ; }
+void turbo_sched_get(void) { return ; }
+#endif
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index edae9277e48d..f841297b7d56 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2497,3 +2497,15 @@ static inline void membarrier_switch_mm(struct rq *rq,
 {
 }
 #endif
+
+void turbo_sched_get(void);
+void turbo_sched_put(void);
+
+#ifdef CONFIG_SCHED_SMT
+DECLARE_STATIC_KEY_FALSE(__turbo_sched_enabled);
+
+static inline bool is_turbosched_enabled(void)
+{
+	return static_branch_unlikely(&__turbo_sched_enabled);
+}
+#endif
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC v6 2/5] sched/core: Update turbo_sched count only when required
  2020-01-21  6:33 [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
  2020-01-21  6:33 ` [RFC v6 1/5] sched: Introduce switch to enable TurboSched for task packing Parth Shah
@ 2020-01-21  6:33 ` Parth Shah
  2020-01-21  6:33 ` [RFC v6 3/5] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores Parth Shah
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Parth Shah @ 2020-01-21  6:33 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies, qperret,
	tim.c.chen

Use the get/put methods to add/remove the use of TurboSched support, such
that the feature is turned on only in the presence of atleast one
classified small bckground task.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c  | 9 +++++++++
 kernel/sched/sched.h | 3 +++
 2 files changed, 12 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dfbb52d66b29..629c2589d727 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3272,6 +3272,9 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		mmdrop(mm);
 	}
 	if (unlikely(prev_state == TASK_DEAD)) {
+		if (unlikely(is_bg_task(prev)))
+			turbo_sched_put();
+
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
@@ -4800,6 +4803,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	int reset_on_fork;
 	int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
 	struct rq *rq;
+	bool attr_leniency = bgtask_latency(attr->sched_latency_nice);
+
 
 	/* The pi code expects interrupts enabled */
 	BUG_ON(pi && in_interrupt());
@@ -5024,6 +5029,10 @@ static int __sched_setscheduler(struct task_struct *p,
 
 	prev_class = p->sched_class;
 
+	/* Refcount tasks classified as a small background task */
+	if (is_bg_task(p) != attr_leniency)
+		attr_leniency ? turbo_sched_get() : turbo_sched_put();
+
 	__setscheduler(rq, p, attr, pi);
 	__setscheduler_uclamp(p, attr);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f841297b7d56..0a00e16e033a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2498,6 +2498,9 @@ static inline void membarrier_switch_mm(struct rq *rq,
 }
 #endif
 
+#define bgtask_latency(lat)	((lat) == MAX_LATENCY_NICE)
+#define is_bg_task(p)		(bgtask_latency((p)->latency_nice))
+
 void turbo_sched_get(void);
 void turbo_sched_put(void);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC v6 3/5] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2020-01-21  6:33 [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
  2020-01-21  6:33 ` [RFC v6 1/5] sched: Introduce switch to enable TurboSched for task packing Parth Shah
  2020-01-21  6:33 ` [RFC v6 2/5] sched/core: Update turbo_sched count only when required Parth Shah
@ 2020-01-21  6:33 ` Parth Shah
  2020-01-21  6:33 ` [RFC v6 4/5] sched/fair: Provide arch hook to find domain for non idle core search scan Parth Shah
  2020-01-21  6:33 ` [RFC v6 5/5] powerpc: Set turbo domain to NUMA node for task packing Parth Shah
  4 siblings, 0 replies; 8+ messages in thread
From: Parth Shah @ 2020-01-21  6:33 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies, qperret,
	tim.c.chen

The algorithm finds the first non idle core in the system and tries to
place a task in the idle CPU of the chosen core. To maintain cache hotness,
work of finding non idle core starts from the prev_cpu, which also reduces
task ping-pong behaviour inside of the core.

Define a new method to select_non_idle_core which keep tracks of the idle
and non-idle CPUs in the core and based on the heuristics determines if the
core is sufficiently busy to place the waking up background task. The
heuristic further defines the non-idle CPU into either busy (>12.5% util)
CPU and overutilized (>80% util) CPU.
- The core containing more idle CPUs and no busy CPUs is not selected for
  packing
- The core if contains more than 1 overutilized CPUs are exempted from
  task packing
- Pack if there is atleast one busy CPU and overutilized CPUs count is <2

Value of 12.5% utilization for busy CPU gives sufficient heuristics for CPU
doing enough work and not become idle in nearby time frame.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c |  3 ++
 kernel/sched/fair.c | 87 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 629c2589d727..a34a5589ae16 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6617,6 +6617,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
 DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
+DECLARE_PER_CPU(cpumask_var_t, turbo_sched_mask);
 
 void __init sched_init(void)
 {
@@ -6657,6 +6658,8 @@ void __init sched_init(void)
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 		per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
+		per_cpu(turbo_sched_mask, i) = (cpumask_var_t)kzalloc_node(
+			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 	}
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d170b5da0e3..8643e6309451 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5379,6 +5379,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 /* Working cpumask for: load_balance, load_balance_newidle. */
 DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
 DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+/* A cpumask to find active cores in the system. */
+DEFINE_PER_CPU(cpumask_var_t, turbo_sched_mask);
 
 #ifdef CONFIG_NO_HZ_COMMON
 
@@ -5883,6 +5885,81 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	return cpu;
 }
 
+#ifdef CONFIG_SCHED_SMT
+
+/* Define non-idle CPU as the one with the utilization >= 12.5% */
+#define merely_used_cpu(util) ((cpu_util(util)) > (100 >> 3))
+
+/*
+ * Classify small background tasks with higher latency_nice value for task
+ * packing.
+ */
+static inline bool is_small_bg_task(struct task_struct *p)
+{
+	if (is_bg_task(p) && (task_util(p) > (SCHED_CAPACITY_SCALE >> 3)))
+		return true;
+
+	return false;
+}
+
+/*
+ * Try to find a non idle core in the system  based on few heuristics:
+ * - Keep track of overutilized (>80% util) and busy (>12.5% util) CPUs
+ * - If none CPUs are busy then do not select the core for task packing
+ * - If atleast one CPU is busy then do task packing unless overutilized CPUs
+ *   count is < busy/2 CPU count
+ * - Always select idle CPU for task packing
+ */
+static int select_non_idle_core(struct task_struct *p, int prev_cpu)
+{
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
+	int iter_cpu, sibling;
+
+	cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
+
+	for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
+		int idle_cpu_count = 0, non_idle_cpu_count = 0;
+		int overutil_cpu_count = 0;
+		int busy_cpu_count = 0;
+		int best_cpu = iter_cpu;
+
+		for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
+			__cpumask_clear_cpu(sibling, cpus);
+			if (idle_cpu(sibling)) {
+				idle_cpu_count++;
+				best_cpu = sibling;
+			} else {
+				non_idle_cpu_count++;
+				if (cpu_overutilized(sibling))
+					overutil_cpu_count++;
+				if (merely_used_cpu(sibling))
+					busy_cpu_count++;
+			}
+		}
+
+		/*
+		 * Pack tasks to this core if
+		 * 1. Idle CPU count is higher and atleast one is busy
+		 * 2. If idle_cpu_count < non_idle_cpu_count then ideally do
+		 * packing but if there are more CPUs overutilized then don't
+		 * overload it.
+		 */
+		if (idle_cpu_count > non_idle_cpu_count) {
+			if (busy_cpu_count)
+				return best_cpu;
+		} else {
+			/*
+			 * Pack tasks if at max 1 CPU is overutilized
+			 */
+			if (overutil_cpu_count < 2)
+				return best_cpu;
+		}
+	}
+
+	return -1;
+}
+#endif /* CONFIG_SCHED_SMT */
+
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
@@ -6367,6 +6444,15 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			new_cpu = prev_cpu;
 		}
 
+#ifdef CONFIG_SCHED_SMT
+		if (is_turbosched_enabled() && unlikely(is_small_bg_task(p))) {
+			new_cpu = select_non_idle_core(p, prev_cpu);
+			if (new_cpu >= 0)
+				return new_cpu;
+			new_cpu = prev_cpu;
+		}
+#endif
+
 		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
 			      cpumask_test_cpu(cpu, p->cpus_ptr);
 	}
@@ -6400,7 +6486,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 		/* Fast path */
-
 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
 		if (want_affine)
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC v6 4/5] sched/fair: Provide arch hook to find domain for non idle core search scan
  2020-01-21  6:33 [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (2 preceding siblings ...)
  2020-01-21  6:33 ` [RFC v6 3/5] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores Parth Shah
@ 2020-01-21  6:33 ` Parth Shah
  2020-01-21  6:33 ` [RFC v6 5/5] powerpc: Set turbo domain to NUMA node for task packing Parth Shah
  4 siblings, 0 replies; 8+ messages in thread
From: Parth Shah @ 2020-01-21  6:33 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies, qperret,
	tim.c.chen

Specify the method which returns cpumask within which to limit the
search for a non idle core. By default, limit the search in LLC domain
which usually includes few/all the cores in the processor chip.

The select_non_idle_core searches for the non idle cores in the LLC domain.
But in the systems with multiple NUMA domains, the Turbo frequency can be
sustained within the NUMA domain without being affected from other
NUMA. For such case, arch_turbo_domain can be tuned to change domain for
non idle core search.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/fair.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8643e6309451..af19e1f9d56d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5890,6 +5890,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 /* Define non-idle CPU as the one with the utilization >= 12.5% */
 #define merely_used_cpu(util) ((cpu_util(util)) > (100 >> 3))
 
+#ifndef arch_turbo_domain
+static __always_inline struct cpumask *arch_turbo_domain(int cpu)
+{
+	return sched_domain_span(rcu_dereference(per_cpu(sd_llc, cpu)));
+}
+#endif
+
 /*
  * Classify small background tasks with higher latency_nice value for task
  * packing.
@@ -5916,6 +5923,7 @@ static int select_non_idle_core(struct task_struct *p, int prev_cpu)
 	int iter_cpu, sibling;
 
 	cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
+	cpumask_and(cpus, cpus, arch_turbo_domain(prev_cpu));
 
 	for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
 		int idle_cpu_count = 0, non_idle_cpu_count = 0;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC v6 5/5] powerpc: Set turbo domain to NUMA node for task packing
  2020-01-21  6:33 [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (3 preceding siblings ...)
  2020-01-21  6:33 ` [RFC v6 4/5] sched/fair: Provide arch hook to find domain for non idle core search scan Parth Shah
@ 2020-01-21  6:33 ` Parth Shah
  4 siblings, 0 replies; 8+ messages in thread
From: Parth Shah @ 2020-01-21  6:33 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies, qperret,
	tim.c.chen

Provide an powerpc architecture specific implementation for defining the
turbo domain to make searching of the core to be bound within the NUMA.

The POWER9 systems have a pair of cores in the LLC domain. Hence to make
TurboSched more effective, increase the domain space for task packing
to search within NUMA domain.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 arch/powerpc/include/asm/topology.h | 3 +++
 arch/powerpc/kernel/smp.c           | 7 +++++++
 2 files changed, 10 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 2f7e1ea5089e..83adfb99f8ba 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -138,6 +138,9 @@ static inline void shared_proc_topology_init(void) {}
 #define topology_sibling_cpumask(cpu)	(per_cpu(cpu_sibling_map, cpu))
 #define topology_core_cpumask(cpu)	(per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)		(cpu_to_core_id(cpu))
+#define arch_turbo_domain		powerpc_turbo_domain
+
+struct cpumask *powerpc_turbo_domain(int cpu);
 
 int dlpar_cpu_readd(int cpu);
 #endif
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf6a221..0fc4443a3f27 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1169,6 +1169,13 @@ static void remove_cpu_from_masks(int cpu)
 }
 #endif
 
+#ifdef CONFIG_SCHED_SMT
+inline struct cpumask *powerpc_turbo_domain(int cpu)
+{
+	return cpumask_of_node(cpu_to_node(cpu));
+}
+#endif
+
 static inline void add_cpu_to_smallcore_masks(int cpu)
 {
 	struct cpumask *this_l1_cache_map = per_cpu(cpu_l1_cache_map, cpu);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC v6 1/5] sched: Introduce switch to enable TurboSched for task packing
  2020-01-21  6:33 ` [RFC v6 1/5] sched: Introduce switch to enable TurboSched for task packing Parth Shah
@ 2020-01-22 21:37   ` Tim Chen
  2020-01-23  6:35     ` Parth Shah
  0 siblings, 1 reply; 8+ messages in thread
From: Tim Chen @ 2020-01-22 21:37 UTC (permalink / raw)
  To: Parth Shah, linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies, qperret

On 1/20/20 10:33 PM, Parth Shah wrote:
> Create a static key which allows to enable or disable TurboSched feature at
> runtime.
> 
> This key is added in order to enable the TurboSched feature only when
> required. This helps in optimizing the scheduler fast-path when the
> TurboSched feature is disabled.
> 
> Also provide get/put methods to keep track of the tasks using the
> TurboSched feature and also refcount classified background tasks. This
> allows to enable the feature on setting first task classified as background
> noise, similarly disable the feature on unsetting of such last task.
> 
> Signed-off-by: Parth Shah <parth@linux.ibm.com>
> ---
>  kernel/sched/core.c  | 25 +++++++++++++++++++++++++
>  kernel/sched/sched.h | 12 ++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a9e5d157b1a5..dfbb52d66b29 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -73,6 +73,31 @@ __read_mostly int scheduler_running;
>   */
>  int sysctl_sched_rt_runtime = 950000;
>  
> +#ifdef CONFIG_SCHED_SMT
> +DEFINE_STATIC_KEY_FALSE(__turbo_sched_enabled);
> +static DEFINE_MUTEX(turbo_sched_lock);
> +static int turbo_sched_count;
> +
> +void turbo_sched_get(void)
> +{
> +	mutex_lock(&turbo_sched_lock);
> +	if (!turbo_sched_count++)
> +		static_branch_enable(&__turbo_sched_enabled);

If you use static_branch_inc(&__turbo_sched_enabled) and
static_branch_dec(&__turbo_sched_enabled),  you don't have
to define turbo_sched_count. And turbo_sched_lock is
also unnecessary as static_branch_inc/dec are atomic.

> +	mutex_unlock(&turbo_sched_lock);
> +}
> +
> +void turbo_sched_put(void)
> +{
> +	mutex_lock(&turbo_sched_lock);
> +	if (!--turbo_sched_count)
> +		static_branch_disable(&__turbo_sched_enabled);
> +	mutex_unlock(&turbo_sched_lock);
> +}
> +#else
> +void turbo_sched_get(void) { return ; }
> +void turbo_sched_get(void) { return ; }

Double definition of turbo_sched_get.
You probably meant turbo_sched_put in the second definition.

Tim

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v6 1/5] sched: Introduce switch to enable TurboSched for task packing
  2020-01-22 21:37   ` Tim Chen
@ 2020-01-23  6:35     ` Parth Shah
  0 siblings, 0 replies; 8+ messages in thread
From: Parth Shah @ 2020-01-23  6:35 UTC (permalink / raw)
  To: Tim Chen, linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies, qperret



On 1/23/20 3:07 AM, Tim Chen wrote:
> On 1/20/20 10:33 PM, Parth Shah wrote:
>> Create a static key which allows to enable or disable TurboSched feature at
>> runtime.
>>
>> This key is added in order to enable the TurboSched feature only when
>> required. This helps in optimizing the scheduler fast-path when the
>> TurboSched feature is disabled.
>>
>> Also provide get/put methods to keep track of the tasks using the
>> TurboSched feature and also refcount classified background tasks. This
>> allows to enable the feature on setting first task classified as background
>> noise, similarly disable the feature on unsetting of such last task.
>>
>> Signed-off-by: Parth Shah <parth@linux.ibm.com>
>> ---
>>  kernel/sched/core.c  | 25 +++++++++++++++++++++++++
>>  kernel/sched/sched.h | 12 ++++++++++++
>>  2 files changed, 37 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index a9e5d157b1a5..dfbb52d66b29 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -73,6 +73,31 @@ __read_mostly int scheduler_running;
>>   */
>>  int sysctl_sched_rt_runtime = 950000;
>>  
>> +#ifdef CONFIG_SCHED_SMT
>> +DEFINE_STATIC_KEY_FALSE(__turbo_sched_enabled);
>> +static DEFINE_MUTEX(turbo_sched_lock);
>> +static int turbo_sched_count;
>> +
>> +void turbo_sched_get(void)
>> +{
>> +	mutex_lock(&turbo_sched_lock);
>> +	if (!turbo_sched_count++)
>> +		static_branch_enable(&__turbo_sched_enabled);
> 
> If you use static_branch_inc(&__turbo_sched_enabled) and
> static_branch_dec(&__turbo_sched_enabled),  you don't have
> to define turbo_sched_count. And turbo_sched_lock is
> also unnecessary as static_branch_inc/dec are atomic.
> 

That's a good suggestion. I will make those changes in the next version.

>> +	mutex_unlock(&turbo_sched_lock);
>> +}
>> +
>> +void turbo_sched_put(void)
>> +{
>> +	mutex_lock(&turbo_sched_lock);
>> +	if (!--turbo_sched_count)
>> +		static_branch_disable(&__turbo_sched_enabled);
>> +	mutex_unlock(&turbo_sched_lock);
>> +}
>> +#else
>> +void turbo_sched_get(void) { return ; }
>> +void turbo_sched_get(void) { return ; }
> 
> Double definition of turbo_sched_get.
> You probably meant turbo_sched_put in the second definition.

yes, my bad. I meant turbo_sched_put() instead.


Thanks,
Parth

> 
> Tim
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-01-23  6:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-21  6:33 [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
2020-01-21  6:33 ` [RFC v6 1/5] sched: Introduce switch to enable TurboSched for task packing Parth Shah
2020-01-22 21:37   ` Tim Chen
2020-01-23  6:35     ` Parth Shah
2020-01-21  6:33 ` [RFC v6 2/5] sched/core: Update turbo_sched count only when required Parth Shah
2020-01-21  6:33 ` [RFC v6 3/5] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores Parth Shah
2020-01-21  6:33 ` [RFC v6 4/5] sched/fair: Provide arch hook to find domain for non idle core search scan Parth Shah
2020-01-21  6:33 ` [RFC v6 5/5] powerpc: Set turbo domain to NUMA node for task packing Parth Shah

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).