All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v5 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations
@ 2019-10-07  8:30 Parth Shah
  2019-10-07  8:30 ` [RFC v5 1/6] sched/core: Add manual background task classification using sched_setattr syscall Parth Shah
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Parth Shah @ 2019-10-07  8:30 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies,
	quentin.perret, rafael.j.wysocki, tim.c.chen, daniel.lezcano

This is the 5th version of the patchset to sustain Turbo frequencies for
longer durations.

The previous versions can be found here:
v4: https://lkml.org/lkml/2019/7/25/296
v3: https://lkml.org/lkml/2019/6/25/25
v2: https://lkml.org/lkml/2019/5/15/1258
v1: https://lwn.net/Articles/783959/

The changes in this versions are:
v[4] -> v[5]:
- Remove Core capacity calculation for finding non-idle core
- Use idle_cpu() and cpu_overutilized() to find the core for task packing
- This changes functionality a bit. Updated new results for POWER9 system
- Re-named ambiguous naming "jitter" to "small background" tasks
ToDo:
- Hopefully, once the attribute like per-task latency-nice[1] is added, the
  TurboSched can leverage that as a means of background task classification
- Add Documentation for TurboSched including possible regression as per the
  comment from Pavel Machek

v[3] -> v[4]:
- Based on Patrick Bellasi's comments, removed the use of UCLAMP based
  mechanism to classify tasks as jitter
- Added support to sched_setattr to mark the task as jitter by adding a new
  flag to the existing task_struct->flags attribute. This is decided to not
  have any new variable inside task_struct and thus get rid of size
  bloating.
- No functional changes

v[2] -> v[3]:
- Added a new attribute in task_struct to allow per task jitter
  classification so that scheduler can use this as request to change wakeup
  path for task packing
- Use syscall for jitter classification, removed cgroup based task
  classification
- Use mutex over spinlock to get rid of task sleeping problem
- Changed _Bool->int everywhere
- Split few patches to have arch specific code separate from core scheduler
  code

v[1] -> v[2]:
- No CPU bound tasks' classification, only jitter tasks are classified from
  the cpu cgroup controller
- Use of Spinlock rather than mutex to count number of jitters in the
  system classified from cgroup
- Architecture specific implementation of Core capacity multiplication
  factor changes dynamically based on the number of active threads in the
  core
- Selection of non idle core in the system is bounded by DIE domain
- Use of UCLAMP mechanism to classify jitter tasks
- Removed "highutil_cpu_mask", and rather uses sd for DIE domain to find
  better fit



Abstract
========

The modern servers allows multiple cores to run at range of frequencies
higher than rated range of frequencies. But the power budget of the system
inhibits sustaining these higher frequencies for longer durations.

However when certain cores are put to idle states, the power can be
effectively channelled to other busy cores, allowing them to sustain the
higher frequency.

One way to achieve this is to pack tasks onto fewer cores keeping others
idle, but it may lead to performance penalty for such tasks and sustaining
higher frequencies proves to be of no benefit. But if one can identify
unimportant low utilization tasks which can be packed on the already active
cores then waking up of new cores can be avoided. Such tasks are short
and/or bursty "background tasks" and waking up new core is expensive for
such case.

Current CFS algorithm in kernel scheduler is performance oriented and hence
tries to assign any idle CPU first for the waking up of new tasks. This
policy is perfect for major categories of the workload, but for background
tasks, one can save energy by packing them onto the active cores and allow
those cores to run at higher frequencies.

These patch-set tunes the task wake up logic in scheduler to pack
exclusively classified background tasks onto busy cores. The work involves
the such tasks classifications by using syscall based mechanisms.

In brief, if we can pack such small background tasks on busy cores then we
can save power by keeping other cores idle and allow busier cores to run at
turbo frequencies, patch-set tries to meet this solution in simplest
manner.  Though, there are some challenges in implementing it(like
smt_capacity, un-needed arch hooks, etc) and I'm trying to work around
that, it would be great to have a discussion around this patches.


Implementation
==============

These patches uses syscall based mechanism to classify the tasks as small
background noises. The task wakeup logic uses this information to pack such
tasks onto cores which are already running busy with CPU intensive tasks.
The task packing is done at `select_task_rq_fair` only so that in case of
wrong decision load balancer may pull the classified background tasks for
maximizing performance.

We define a core to be non-idle if any CPU has >12.5% utilization and not
more than 1 CPU is overutilized (>80% utilization); the background tasks
are packed over these cores using First-fit approach.

The value 12.5% utilization indicates the CPU is sufficiently busy to not
go to deeper IDLE-states (target_residency >= 10ms) and tasks can be packed
here.

To demonstrate/benchmark, patches uses turbo_bench, a synthetic workload
generator [2].

Following snippet demonstrates the use of TurboSched feature:
```
i=8; ./turbo_bench -t 30 -h $i -n $((i*2)) -j
```
This spawns 2*i total threads: of which i-CPU bound and i-low util threads.

Current implementation uses only small background classified tasks to be
packed on the first busy cores, but can be further optimized by getting
userspace input of important tasks and keeping track of such tasks. This
leads to optimized searching of non idle cores and also more accurate as
userspace hints are safer than auto classified busy cores/tasks.


Result
======

The patch-set proves to be useful for the system and the workload where
frequency boost is found to be useful than packing tasks into cores. IBM
POWER 9 system shows the benefit for a workload can be up to 15%.

                 Frequency benefit of TurboSched w.r.t. CFS
   +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
20 +-+ + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + +-+
   |                                      Frequency benefit in %         |
   |                  **                                                 |
15 +-+                **    **                                         +-+
   |              ************                                           |
   |       **     ************ *                                         |
10 +-+     ** *   ************ *                                       +-+
   |       ** * * ************ * *                                       |
   |     **** * * ************ * * **                                    |
 5 +-+   **** * * ************ * * ******                              +-+
   | **  **** * * ************ * * ********                              |
 0 +-******** * * ************ * * ************ * * * ********** * * * **+
   |                                                                     |
   |                                                                     |
-5 +-+                                                                 +-+
   | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |
   +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
     1 2 3 4  5 6 7 8 9101112 1314151617181920 2122232425262728 29303132  
                           No. of workload threads                        

                                                                          
                Performance (OPS) benefit of TurboSched w.r.t. CFS       
20 +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
   | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |
   |                                    Performance benefit in %         |
15 +-+                      **                                         +-+
   |                  **  **** *                                         |
   |                  **  **** *   **                                    |
10 +-+                ******** * * **                                  +-+
   |                ********** * * **                                    |
   |                ********** * * ******                                |
 5 +-+            ************ * * ******                              +-+
   |              ************ * * **********                            |
   |            * ************ * * ************                          |
 0 +-******** * * ************ * * ************ * * * ********** * * * **+
   |                                              * * ********** * * * **|
   |                                                                   **|
-5 +-+                                                                 +-+
   | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |
   +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
     1 2 3 4  5 6 7 8 9101112 1314151617181920 2122232425262728 29303132  
                           No. of workload threads                        
 

These numbers are w.r.t. `turbo_bench.c` multi-threaded test benchmark
which can create two kinds of tasks: CPU bound (High Utilization) and
Background (Low Utilization). N in X-axis represents N-CPU bound and
N-background tasks spawned. The performance (Operations per Seconds) graph
indicates the benefit with TurboSched can be upto 14% compared to the CFS
task placement strategy for such background classified tasks.


Series organization
==============
- Patches [01-03]: Small background tasks classification using syscall
- Patche  [04]   : Tune CFS task wakeup logic to pack tasks onto busy cores
- Patches [05-06]: Change non-idle core search domain to LLC by default and
  		   provide arch hooks to change to NUMA for powerpc.

Series can be applied on the top of tip/sched/core at
commit af24bde8df20 ("sched/uclamp: Add uclamp support to energy_compute()")

References
==========
[1]. https://lkml.org/lkml/2019/9/30/215
[2]. https://github.com/parthsl/tools/blob/master/benchmarks/turbo_bench.c


Parth Shah (6):
  sched/core: Add manual background task classification using
    sched_setattr syscall
  sched: Introduce switch to enable TurboSched for task packing
  sched/core: Update turbo_sched count only when required
  sched/fair: Tune task wake-up logic to pack small background tasks on
    fewer cores
  sched/fair: Provide arch hook to find domain for non idle core search
    scan
  powerpc: Set turbo domain to NUMA node for task packing

 arch/powerpc/include/asm/topology.h |   3 +
 arch/powerpc/kernel/smp.c           |   7 ++
 include/linux/sched.h               |   1 +
 include/uapi/linux/sched.h          |   4 +-
 kernel/sched/core.c                 |  39 +++++++++++
 kernel/sched/fair.c                 | 104 +++++++++++++++++++++++++++-
 kernel/sched/sched.h                |   9 +++
 7 files changed, 165 insertions(+), 2 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC v5 1/6] sched/core: Add manual background task classification using sched_setattr syscall
  2019-10-07  8:30 [RFC v5 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
@ 2019-10-07  8:30 ` Parth Shah
  2019-10-07  8:30 ` [RFC v5 2/6] sched: Introduce switch to enable TurboSched for task packing Parth Shah
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Parth Shah @ 2019-10-07  8:30 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies,
	quentin.perret, rafael.j.wysocki, tim.c.chen, daniel.lezcano

Small background tasks typically performs some housekeeping work and are
less important in the overall scheme of load balancing and scheduling.

So provide a way to mark the task which are small background noises with
the use of additional flag to the existing task attribute. Also provide an
interface from the userspace which uses sched_setattr syscall to mark such
tasks.

The scheduler may use this as hints to pack such tasks on fewer number of
cores.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 include/linux/sched.h      | 1 +
 include/uapi/linux/sched.h | 4 +++-
 kernel/sched/core.c        | 9 +++++++++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1113dd4706ae..e03b85166e34 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1463,6 +1463,7 @@ extern struct pid *cad_pid;
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_NOCMA	0x10000000	/* All allocation request will have _GFP_MOVABLE cleared */
+#define PF_CAN_BE_PACKED	0x20000000	/* Provide hints to the scheduler to pack such tasks */
 #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 617bb59aa8ba..fccb1c57d037 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -55,6 +55,7 @@
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_TASK_PACKING		0x80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
@@ -66,6 +67,7 @@
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN		| \
 			 SCHED_FLAG_KEEP_ALL		| \
-			 SCHED_FLAG_UTIL_CLAMP)
+			 SCHED_FLAG_UTIL_CLAMP		| \
+			 SCHED_FLAG_TASK_PACKING)
 
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fa43ce3962e7..e7cda4aa8696 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4498,6 +4498,8 @@ static void __setscheduler_params(struct task_struct *p,
 	p->rt_priority = attr->sched_priority;
 	p->normal_prio = normal_prio(p);
 	set_load_weight(p, true);
+	if (attr->sched_flags & SCHED_FLAG_TASK_PACKING)
+		p->flags |= PF_CAN_BE_PACKED;
 }
 
 /* Actually do priority change: must hold pi & rq lock. */
@@ -4557,6 +4559,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	struct rq_flags rf;
 	int reset_on_fork;
 	int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+	unsigned long long task_packing_flag =
+				attr->sched_flags & SCHED_FLAG_TASK_PACKING;
 	struct rq *rq;
 
 	/* The pi code expects interrupts enabled */
@@ -4686,6 +4690,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			goto change;
 		if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
 			goto change;
+		if (task_packing_flag)
+			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
 		task_rq_unlock(rq, p, &rf);
@@ -5181,6 +5187,9 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 	attr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
 #endif
 
+	if (p->flags & PF_CAN_BE_PACKED)
+		attr.sched_flags |= SCHED_FLAG_TASK_PACKING;
+
 	rcu_read_unlock();
 
 	retval = sched_read_attr(uattr, &attr, size);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC v5 2/6] sched: Introduce switch to enable TurboSched for task packing
  2019-10-07  8:30 [RFC v5 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
  2019-10-07  8:30 ` [RFC v5 1/6] sched/core: Add manual background task classification using sched_setattr syscall Parth Shah
@ 2019-10-07  8:30 ` Parth Shah
  2019-10-07  8:30 ` [RFC v5 3/6] sched/core: Update turbo_sched count only when required Parth Shah
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Parth Shah @ 2019-10-07  8:30 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies,
	quentin.perret, rafael.j.wysocki, tim.c.chen, daniel.lezcano

Create a static key which allows to enable or disable TurboSched feature at
runtime.

This key is added in order to enable the TurboSched feature only when
required. This helps in optimizing the scheduler fast-path when the
TurboSched feature is disabled.

Also provide get/put methods to keep track of the tasks using the
TurboSched feature and also refcount classified background tasks. This
allows to enable the feature on setting first task classified as background
noise, similarly disable the feature on unsetting of such last task.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c  | 20 ++++++++++++++++++++
 kernel/sched/sched.h |  9 +++++++++
 2 files changed, 29 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e7cda4aa8696..ee5980b4e150 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,26 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+DEFINE_STATIC_KEY_FALSE(__turbo_sched_enabled);
+static DEFINE_MUTEX(turbo_sched_lock);
+static int turbo_sched_count;
+
+void turbo_sched_get(void)
+{
+	mutex_lock(&turbo_sched_lock);
+	if (!turbo_sched_count++)
+		static_branch_enable(&__turbo_sched_enabled);
+	mutex_unlock(&turbo_sched_lock);
+}
+
+void turbo_sched_put(void)
+{
+	mutex_lock(&turbo_sched_lock);
+	if (!--turbo_sched_count)
+		static_branch_disable(&__turbo_sched_enabled);
+	mutex_unlock(&turbo_sched_lock);
+}
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3405f2..4a0b90ea8652 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2423,3 +2423,12 @@ static inline bool sched_energy_enabled(void)
 static inline bool sched_energy_enabled(void) { return false; }
 
 #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
+void turbo_sched_get(void);
+void turbo_sched_put(void);
+DECLARE_STATIC_KEY_FALSE(__turbo_sched_enabled);
+
+static inline bool is_turbosched_enabled(void)
+{
+	return static_branch_unlikely(&__turbo_sched_enabled);
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC v5 3/6] sched/core: Update turbo_sched count only when required
  2019-10-07  8:30 [RFC v5 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
  2019-10-07  8:30 ` [RFC v5 1/6] sched/core: Add manual background task classification using sched_setattr syscall Parth Shah
  2019-10-07  8:30 ` [RFC v5 2/6] sched: Introduce switch to enable TurboSched for task packing Parth Shah
@ 2019-10-07  8:30 ` Parth Shah
  2019-10-07  8:30 ` [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores Parth Shah
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Parth Shah @ 2019-10-07  8:30 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies,
	quentin.perret, rafael.j.wysocki, tim.c.chen, daniel.lezcano

Use the get/put methods to add/remove the use of TurboSched support, such
that the feature is turned on only in the presence of atleast one
classified small bckground task.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee5980b4e150..6e1ae8046fe0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3141,6 +3141,9 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		mmdrop(mm);
 	}
 	if (unlikely(prev_state == TASK_DEAD)) {
+		if (unlikely(prev->flags & PF_CAN_BE_PACKED))
+			turbo_sched_put();
+
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
@@ -4793,6 +4796,10 @@ static int __sched_setscheduler(struct task_struct *p,
 
 	prev_class = p->sched_class;
 
+	/* Refcount tasks classified as a small background task */
+	if (task_packing_flag != (p->flags & PF_CAN_BE_PACKED))
+		(task_packing_flag) ? turbo_sched_get() : turbo_sched_put();
+
 	__setscheduler(rq, p, attr, pi);
 	__setscheduler_uclamp(p, attr);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2019-10-07  8:30 [RFC v5 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (2 preceding siblings ...)
  2019-10-07  8:30 ` [RFC v5 3/6] sched/core: Update turbo_sched count only when required Parth Shah
@ 2019-10-07  8:30 ` Parth Shah
  2019-10-07 12:19   ` Vincent Guittot
  2019-10-07  8:30 ` [RFC v5 5/6] sched/fair: Provide arch hook to find domain for non idle core search scan Parth Shah
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Parth Shah @ 2019-10-07  8:30 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies,
	quentin.perret, rafael.j.wysocki, tim.c.chen, daniel.lezcano

The algorithm finds the first non idle core in the system and tries to
place a task in the idle CPU in the chosen core. To maintain
cache hotness, work of finding non idle core starts from the prev_cpu,
which also reduces task ping-pong behaviour inside of the core.

Define a new method to select_non_idle_core which keep tracks of the idle
and non-idle CPUs in the core and based on the heuristics determines if the
core is sufficiently busy to place the incoming backgroung task. The
heuristic further defines the non-idle CPU into either busy (>12.5% util)
CPU and overutilized (>80% util) CPU.
- The core containing more idle CPUs and no busy CPUs is not selected for
  packing
- The core if contains more than 1 overutilized CPUs are exempted from
  task packing
- Pack if there is atleast one busy CPU and overutilized CPUs count is <2

Value of 12.5% utilization for busy CPU gives sufficient heuristics for CPU
doing enough work and not become idle in nearby timeframe.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c |  3 ++
 kernel/sched/fair.c | 95 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6e1ae8046fe0..7e3aff59540a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6402,6 +6402,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
 DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
+DECLARE_PER_CPU(cpumask_var_t, turbo_sched_mask);
 
 void __init sched_init(void)
 {
@@ -6442,6 +6443,8 @@ void __init sched_init(void)
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 		per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
+		per_cpu(turbo_sched_mask, i) = (cpumask_var_t)kzalloc_node(
+			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 	}
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b798fe7ff7cd..d4a1b6474338 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5353,6 +5353,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 /* Working cpumask for: load_balance, load_balance_newidle. */
 DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
 DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+/* A cpumask to find active cores in the system. */
+DEFINE_PER_CPU(cpumask_var_t, turbo_sched_mask);
 
 #ifdef CONFIG_NO_HZ_COMMON
 
@@ -5964,6 +5966,76 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	return cpu;
 }
 
+#ifdef CONFIG_SCHED_SMT
+static inline bool is_background_task(struct task_struct *p)
+{
+	if (p->flags & PF_CAN_BE_PACKED)
+		return true;
+
+	return false;
+}
+
+#define busyness_threshold	(100 >> 3)
+#define is_cpu_busy(util) ((util) > busyness_threshold)
+
+/*
+ * Try to find a non idle core in the system  based on few heuristics:
+ * - Keep track of overutilized (>80% util) and busy (>12.5% util) CPUs
+ * - If none CPUs are busy then do not select the core for task packing
+ * - If atleast one CPU is busy then do task packing unless overutilized CPUs
+ *   count is < busy/2 CPU count
+ * - Always select idle CPU for task packing
+ */
+static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
+{
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
+	int iter_cpu, sibling;
+
+	cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
+
+	for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
+		int idle_cpu_count = 0, non_idle_cpu_count = 0;
+		int overutil_cpu_count = 0;
+		int busy_cpu_count = 0;
+		int best_cpu = iter_cpu;
+
+		for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
+			__cpumask_clear_cpu(sibling, cpus);
+			if (idle_cpu(iter_cpu)) {
+				idle_cpu_count++;
+				best_cpu = iter_cpu;
+			} else {
+				non_idle_cpu_count++;
+				if (cpu_overutilized(iter_cpu))
+					overutil_cpu_count++;
+				if (is_cpu_busy(cpu_util(iter_cpu)))
+					busy_cpu_count++;
+			}
+		}
+
+		/*
+		 * Pack tasks to this core if
+		 * 1. Idle CPU count is higher and atleast one is busy
+		 * 2. If idle_cpu_count < non_idle_cpu_count then ideally do
+		 * packing but if there are more CPUs overutilized then don't
+		 * overload it.
+		 */
+		if (idle_cpu_count > non_idle_cpu_count) {
+			if (busy_cpu_count)
+				return best_cpu;
+		} else {
+			/*
+			 * Pack tasks if at max 1 CPU is overutilized
+			 */
+			if (overutil_cpu_count < 2)
+				return best_cpu;
+		}
+	}
+
+	return select_idle_sibling(p, prev_cpu, target);
+}
+#endif /* CONFIG_SCHED_SMT */
+
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
@@ -6418,6 +6490,23 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	return -1;
 }
 
+#ifdef CONFIG_SCHED_SMT
+/*
+ * Select all classified background tasks for task packing
+ */
+static inline int turbosched_select_non_idle_core(struct task_struct *p,
+						  int prev_cpu, int target)
+{
+	return select_non_idle_core(p, prev_cpu, target);
+}
+#else
+static inline int turbosched_select_non_idle_core(struct task_struct *p,
+						  int prev_cpu, int target)
+{
+	return select_idle_sibling(p, prev_cpu, target);
+}
+#endif
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6483,7 +6572,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 		/* Fast path */
 
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+		if (is_turbosched_enabled() && unlikely(is_background_task(p)))
+			new_cpu = turbosched_select_non_idle_core(p, prev_cpu,
+								  new_cpu);
+		else
+			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
 		if (want_affine)
 			current->recent_used_cpu = cpu;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC v5 5/6] sched/fair: Provide arch hook to find domain for non idle core search scan
  2019-10-07  8:30 [RFC v5 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (3 preceding siblings ...)
  2019-10-07  8:30 ` [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores Parth Shah
@ 2019-10-07  8:30 ` Parth Shah
  2019-10-07  8:30 ` [RFC v5 6/6] powerpc: Set turbo domain to NUMA node for task packing Parth Shah
       [not found] ` <20191008132842.6612-1-hdanton@sina.com>
  6 siblings, 0 replies; 19+ messages in thread
From: Parth Shah @ 2019-10-07  8:30 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies,
	quentin.perret, rafael.j.wysocki, tim.c.chen, daniel.lezcano

Specify the method which returns cpumask within which to limit the
search for a non idle core. By default, limit the search in LLC domain
which usually includes few/all the cores in the processor chip.

The select_non_idle_core searches for the non idle cores in the LLC domain.
But in the systems with multiple NUMA domains, the Turbo frequency can be
sustained within the NUMA domain without being affected from other
NUMA. For such case, arch_turbo_domain can be tuned to change domain for
non idle core search.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/fair.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4a1b6474338..731d1aaa6bc0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5967,6 +5967,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 }
 
 #ifdef CONFIG_SCHED_SMT
+
+#ifndef arch_turbo_domain
+static __always_inline struct cpumask *arch_turbo_domain(int cpu)
+{
+	return sched_domain_span(rcu_dereference(per_cpu(sd_llc, cpu)));
+}
+#endif
+
 static inline bool is_background_task(struct task_struct *p)
 {
 	if (p->flags & PF_CAN_BE_PACKED)
@@ -5991,7 +5999,8 @@ static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
 	int iter_cpu, sibling;
 
-	cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
+	cpumask_and(cpus, cpu_online_mask, arch_turbo_domain(prev_cpu));
+	cpumask_and(cpus, cpus, p->cpus_ptr);
 
 	for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
 		int idle_cpu_count = 0, non_idle_cpu_count = 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC v5 6/6] powerpc: Set turbo domain to NUMA node for task packing
  2019-10-07  8:30 [RFC v5 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (4 preceding siblings ...)
  2019-10-07  8:30 ` [RFC v5 5/6] sched/fair: Provide arch hook to find domain for non idle core search scan Parth Shah
@ 2019-10-07  8:30 ` Parth Shah
       [not found] ` <20191008132842.6612-1-hdanton@sina.com>
  6 siblings, 0 replies; 19+ messages in thread
From: Parth Shah @ 2019-10-07  8:30 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann,
	patrick.bellasi, valentin.schneider, pavel, dsmythies,
	quentin.perret, rafael.j.wysocki, tim.c.chen, daniel.lezcano

Provide an powerpc architecture specific implementation for defining the
turbo domain to make searching of the core to be bound within the NUMA.

The POWER9 systems have a pair of cores in the LLC domain. Hence to make
TurboSched more effective, increase the domain space for task packing
to search within NUMA domain.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 arch/powerpc/include/asm/topology.h | 3 +++
 arch/powerpc/kernel/smp.c           | 7 +++++++
 2 files changed, 10 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f85e2b01c3df..b2493bb11653 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -132,6 +132,9 @@ static inline void shared_proc_topology_init(void) {}
 #define topology_sibling_cpumask(cpu)	(per_cpu(cpu_sibling_map, cpu))
 #define topology_core_cpumask(cpu)	(per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)		(cpu_to_core_id(cpu))
+#define arch_turbo_domain		powerpc_turbo_domain
+
+struct cpumask *powerpc_turbo_domain(int cpu);
 
 int dlpar_cpu_readd(int cpu);
 #endif
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf6a221..0fc4443a3f27 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1169,6 +1169,13 @@ static void remove_cpu_from_masks(int cpu)
 }
 #endif
 
+#ifdef CONFIG_SCHED_SMT
+inline struct cpumask *powerpc_turbo_domain(int cpu)
+{
+	return cpumask_of_node(cpu_to_node(cpu));
+}
+#endif
+
 static inline void add_cpu_to_smallcore_masks(int cpu)
 {
 	struct cpumask *this_l1_cache_map = per_cpu(cpu_l1_cache_map, cpu);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2019-10-07  8:30 ` [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores Parth Shah
@ 2019-10-07 12:19   ` Vincent Guittot
  2019-10-07 16:53     ` Parth Shah
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2019-10-07 12:19 UTC (permalink / raw)
  To: Parth Shah
  Cc: linux-kernel, open list:THERMAL, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Patrick Bellasi, Valentin Schneider,
	Pavel Machek, Doug Smythies, Quentin Perret, Rafael J. Wysocki,
	Tim Chen, Daniel Lezcano

On Mon, 7 Oct 2019 at 10:31, Parth Shah <parth@linux.ibm.com> wrote:
>
> The algorithm finds the first non idle core in the system and tries to
> place a task in the idle CPU in the chosen core. To maintain
> cache hotness, work of finding non idle core starts from the prev_cpu,
> which also reduces task ping-pong behaviour inside of the core.
>
> Define a new method to select_non_idle_core which keep tracks of the idle
> and non-idle CPUs in the core and based on the heuristics determines if the
> core is sufficiently busy to place the incoming backgroung task. The
> heuristic further defines the non-idle CPU into either busy (>12.5% util)
> CPU and overutilized (>80% util) CPU.
> - The core containing more idle CPUs and no busy CPUs is not selected for
>   packing
> - The core if contains more than 1 overutilized CPUs are exempted from
>   task packing
> - Pack if there is atleast one busy CPU and overutilized CPUs count is <2
>
> Value of 12.5% utilization for busy CPU gives sufficient heuristics for CPU
> doing enough work and not become idle in nearby timeframe.
>
> Signed-off-by: Parth Shah <parth@linux.ibm.com>
> ---
>  kernel/sched/core.c |  3 ++
>  kernel/sched/fair.c | 95 ++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 97 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6e1ae8046fe0..7e3aff59540a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6402,6 +6402,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
>
>  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
>  DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
> +DECLARE_PER_CPU(cpumask_var_t, turbo_sched_mask);
>
>  void __init sched_init(void)
>  {
> @@ -6442,6 +6443,8 @@ void __init sched_init(void)
>                         cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>                 per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
>                         cpumask_size(), GFP_KERNEL, cpu_to_node(i));
> +               per_cpu(turbo_sched_mask, i) = (cpumask_var_t)kzalloc_node(
> +                       cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>         }
>  #endif /* CONFIG_CPUMASK_OFFSTACK */
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b798fe7ff7cd..d4a1b6474338 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5353,6 +5353,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  /* Working cpumask for: load_balance, load_balance_newidle. */
>  DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
>  DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
> +/* A cpumask to find active cores in the system. */
> +DEFINE_PER_CPU(cpumask_var_t, turbo_sched_mask);
>
>  #ifdef CONFIG_NO_HZ_COMMON
>
> @@ -5964,6 +5966,76 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>         return cpu;
>  }
>
> +#ifdef CONFIG_SCHED_SMT
> +static inline bool is_background_task(struct task_struct *p)
> +{
> +       if (p->flags & PF_CAN_BE_PACKED)
> +               return true;
> +
> +       return false;
> +}
> +
> +#define busyness_threshold     (100 >> 3)
> +#define is_cpu_busy(util) ((util) > busyness_threshold)
> +
> +/*
> + * Try to find a non idle core in the system  based on few heuristics:
> + * - Keep track of overutilized (>80% util) and busy (>12.5% util) CPUs
> + * - If none CPUs are busy then do not select the core for task packing
> + * - If atleast one CPU is busy then do task packing unless overutilized CPUs
> + *   count is < busy/2 CPU count
> + * - Always select idle CPU for task packing
> + */
> +static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
> +{
> +       struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
> +       int iter_cpu, sibling;
> +
> +       cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
> +
> +       for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
> +               int idle_cpu_count = 0, non_idle_cpu_count = 0;
> +               int overutil_cpu_count = 0;
> +               int busy_cpu_count = 0;
> +               int best_cpu = iter_cpu;
> +
> +               for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
> +                       __cpumask_clear_cpu(sibling, cpus);
> +                       if (idle_cpu(iter_cpu)) {
> +                               idle_cpu_count++;
> +                               best_cpu = iter_cpu;
> +                       } else {
> +                               non_idle_cpu_count++;
> +                               if (cpu_overutilized(iter_cpu))
> +                                       overutil_cpu_count++;
> +                               if (is_cpu_busy(cpu_util(iter_cpu)))
> +                                       busy_cpu_count++;
> +                       }
> +               }
> +
> +               /*
> +                * Pack tasks to this core if
> +                * 1. Idle CPU count is higher and atleast one is busy
> +                * 2. If idle_cpu_count < non_idle_cpu_count then ideally do
> +                * packing but if there are more CPUs overutilized then don't
> +                * overload it.

Could you give details about the rationale behind these conditions ?

> +                */
> +               if (idle_cpu_count > non_idle_cpu_count) {
> +                       if (busy_cpu_count)
> +                               return best_cpu;
> +               } else {
> +                       /*
> +                        * Pack tasks if at max 1 CPU is overutilized
> +                        */
> +                       if (overutil_cpu_count < 2)
> +                               return best_cpu;
> +               }
> +       }
> +
> +       return select_idle_sibling(p, prev_cpu, target);
> +}
> +#endif /* CONFIG_SCHED_SMT */
> +
>  /*
>   * Try and locate an idle core/thread in the LLC cache domain.
>   */
> @@ -6418,6 +6490,23 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>         return -1;
>  }
>
> +#ifdef CONFIG_SCHED_SMT
> +/*
> + * Select all classified background tasks for task packing
> + */
> +static inline int turbosched_select_non_idle_core(struct task_struct *p,
> +                                                 int prev_cpu, int target)
> +{
> +       return select_non_idle_core(p, prev_cpu, target);
> +}
> +#else
> +static inline int turbosched_select_non_idle_core(struct task_struct *p,
> +                                                 int prev_cpu, int target)
> +{
> +       return select_idle_sibling(p, prev_cpu, target);

should be better to make turbosched_select_non_idle_core empty and
make sure that __turbo_sched_enabled is never enabled if
CONFIG_SCHED_SMT is disabled

> +}
> +#endif
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -6483,7 +6572,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>         } else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
>                 /* Fast path */
>
> -               new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> +               if (is_turbosched_enabled() && unlikely(is_background_task(p)))
> +                       new_cpu = turbosched_select_non_idle_core(p, prev_cpu,
> +                                                                 new_cpu);

Could you add turbosched_select_non_idle_core() similarly to
find_energy_efficient_cpu() ?
Add it at the beg select_task_rq_fair()
Return immediately with theCPU if you have found one
Or let the normal path select a CPU if the
turbosched_select_non_idle_core() has not been able to find a suitable
CPU for packing


> +               else
> +                       new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>
>                 if (want_affine)
>                         current->recent_used_cpu = cpu;
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2019-10-07 12:19   ` Vincent Guittot
@ 2019-10-07 16:53     ` Parth Shah
  2019-10-08 16:20       ` Vincent Guittot
  2019-10-08 16:52       ` Dietmar Eggemann
  0 siblings, 2 replies; 19+ messages in thread
From: Parth Shah @ 2019-10-07 16:53 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, open list:THERMAL, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Patrick Bellasi, Valentin Schneider,
	Pavel Machek, Doug Smythies, Quentin Perret, Rafael J. Wysocki,
	Tim Chen, Daniel Lezcano



On 10/7/19 5:49 PM, Vincent Guittot wrote:
> On Mon, 7 Oct 2019 at 10:31, Parth Shah <parth@linux.ibm.com> wrote:
>>
>> The algorithm finds the first non idle core in the system and tries to
>> place a task in the idle CPU in the chosen core. To maintain
>> cache hotness, work of finding non idle core starts from the prev_cpu,
>> which also reduces task ping-pong behaviour inside of the core.
>>
>> Define a new method to select_non_idle_core which keep tracks of the idle
>> and non-idle CPUs in the core and based on the heuristics determines if the
>> core is sufficiently busy to place the incoming backgroung task. The
>> heuristic further defines the non-idle CPU into either busy (>12.5% util)
>> CPU and overutilized (>80% util) CPU.
>> - The core containing more idle CPUs and no busy CPUs is not selected for
>>   packing
>> - The core if contains more than 1 overutilized CPUs are exempted from
>>   task packing
>> - Pack if there is atleast one busy CPU and overutilized CPUs count is <2
>>
>> Value of 12.5% utilization for busy CPU gives sufficient heuristics for CPU
>> doing enough work and not become idle in nearby timeframe.
>>
>> Signed-off-by: Parth Shah <parth@linux.ibm.com>
>> ---
>>  kernel/sched/core.c |  3 ++
>>  kernel/sched/fair.c | 95 ++++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 97 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 6e1ae8046fe0..7e3aff59540a 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6402,6 +6402,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
>>
>>  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
>>  DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
>> +DECLARE_PER_CPU(cpumask_var_t, turbo_sched_mask);
>>
>>  void __init sched_init(void)
>>  {
>> @@ -6442,6 +6443,8 @@ void __init sched_init(void)
>>                         cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>>                 per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
>>                         cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>> +               per_cpu(turbo_sched_mask, i) = (cpumask_var_t)kzalloc_node(
>> +                       cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>>         }
>>  #endif /* CONFIG_CPUMASK_OFFSTACK */
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b798fe7ff7cd..d4a1b6474338 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5353,6 +5353,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>  /* Working cpumask for: load_balance, load_balance_newidle. */
>>  DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
>>  DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
>> +/* A cpumask to find active cores in the system. */
>> +DEFINE_PER_CPU(cpumask_var_t, turbo_sched_mask);
>>
>>  #ifdef CONFIG_NO_HZ_COMMON
>>
>> @@ -5964,6 +5966,76 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>         return cpu;
>>  }
>>
>> +#ifdef CONFIG_SCHED_SMT
>> +static inline bool is_background_task(struct task_struct *p)
>> +{
>> +       if (p->flags & PF_CAN_BE_PACKED)
>> +               return true;
>> +
>> +       return false;
>> +}
>> +
>> +#define busyness_threshold     (100 >> 3)
>> +#define is_cpu_busy(util) ((util) > busyness_threshold)
>> +
>> +/*
>> + * Try to find a non idle core in the system  based on few heuristics:
>> + * - Keep track of overutilized (>80% util) and busy (>12.5% util) CPUs
>> + * - If none CPUs are busy then do not select the core for task packing
>> + * - If atleast one CPU is busy then do task packing unless overutilized CPUs
>> + *   count is < busy/2 CPU count
>> + * - Always select idle CPU for task packing
>> + */
>> +static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
>> +{
>> +       struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
>> +       int iter_cpu, sibling;
>> +
>> +       cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
>> +
>> +       for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
>> +               int idle_cpu_count = 0, non_idle_cpu_count = 0;
>> +               int overutil_cpu_count = 0;
>> +               int busy_cpu_count = 0;
>> +               int best_cpu = iter_cpu;
>> +
>> +               for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
>> +                       __cpumask_clear_cpu(sibling, cpus);
>> +                       if (idle_cpu(iter_cpu)) {
>> +                               idle_cpu_count++;
>> +                               best_cpu = iter_cpu;
>> +                       } else {
>> +                               non_idle_cpu_count++;
>> +                               if (cpu_overutilized(iter_cpu))
>> +                                       overutil_cpu_count++;
>> +                               if (is_cpu_busy(cpu_util(iter_cpu)))
>> +                                       busy_cpu_count++;
>> +                       }
>> +               }
>> +
>> +               /*
>> +                * Pack tasks to this core if
>> +                * 1. Idle CPU count is higher and atleast one is busy
>> +                * 2. If idle_cpu_count < non_idle_cpu_count then ideally do
>> +                * packing but if there are more CPUs overutilized then don't
>> +                * overload it.
> 
> Could you give details about the rationale behind these conditions ?

sure. but first maybe some background is required for busy_cpu.
Task packing needs to be done across cores minimizing number of busy cores
in the chip. Hence when picking a core for packing a background task it
will be good to not select a core which is in deeper idle-states.

Usually deeper idle states have target_residency >= 10ms which are really
power saving states and saved power can be channeled to active cores.
A CPU with utilization of 12.5% is most probably not going to those deeper
idle states and picking a CPU with >= 12.5% util seems to be a good
approximation.

Now going to the _main point_, task packing needs to take care of the
following scenarios:
1. Not select a core having all the CPUs idle or <= 12.5% util
2. Do not select a core with 2 or more CPUs overloaded (>=80% util)
3. Select a core even if 1 CPU is overloaded as background tasks are
usually short running and spending time for selecting better alternative is
not worth the investment here
4. Select a core if at least one CPU is busy (>=12.5% util)
5. On selecting a core, select an idle CPU in it.

Hence to satisfy this scenarios for SMT-1/2/4 (POWER9) or 8 (POWER8 has
8-threads per core/ POWER9 has feature to make fake SMT-8), the approach
keeps track of idle, non-idle, busy and overloaded CPU count in the core
and uses above approach to find _sufficiently_ non-idle core, which seems
to be a good heuristics to do task packing without much of regression on
CPU intensive threads.

So as per the comments in this patch, first point covers the scenario 1 and
4 (if part in the code), and second point covers scenario 2 and 3 (else
part in the code).

>> +                */
>> +               if (idle_cpu_count > non_idle_cpu_count) {
>> +                       if (busy_cpu_count)
>> +                               return best_cpu;
>> +               } else {
>> +                       /*
>> +                        * Pack tasks if at max 1 CPU is overutilized
>> +                        */
>> +                       if (overutil_cpu_count < 2)
>> +                               return best_cpu;
>> +               }
>> +       }
>> +
>> +       return select_idle_sibling(p, prev_cpu, target);
>> +}
>> +#endif /* CONFIG_SCHED_SMT */
>> +
>>  /*
>>   * Try and locate an idle core/thread in the LLC cache domain.
>>   */
>> @@ -6418,6 +6490,23 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>         return -1;
>>  }
>>
>> +#ifdef CONFIG_SCHED_SMT
>> +/*
>> + * Select all classified background tasks for task packing
>> + */
>> +static inline int turbosched_select_non_idle_core(struct task_struct *p,
>> +                                                 int prev_cpu, int target)
>> +{
>> +       return select_non_idle_core(p, prev_cpu, target);
>> +}
>> +#else
>> +static inline int turbosched_select_non_idle_core(struct task_struct *p,
>> +                                                 int prev_cpu, int target)
>> +{
>> +       return select_idle_sibling(p, prev_cpu, target);
> 
> should be better to make turbosched_select_non_idle_core empty and
> make sure that __turbo_sched_enabled is never enabled if
> CONFIG_SCHED_SMT is disabled
> 

Totally agreed. I thought keeping like this so as to not have any "#def.."
in select_task_rq_fair method.
So can I do this by adding a new method like __select_idle_sibling() which
will call turbosched_select_non_idle_core() in case of SCHED_SMT present
and otherwise will call the regular select_idle_sibling()?

>> +}
>> +#endif
>> +
>>  /*
>>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>> @@ -6483,7 +6572,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>         } else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
>>                 /* Fast path */
>>
>> -               new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>> +               if (is_turbosched_enabled() && unlikely(is_background_task(p)))
>> +                       new_cpu = turbosched_select_non_idle_core(p, prev_cpu,
>> +                                                                 new_cpu);
> 
> Could you add turbosched_select_non_idle_core() similarly to
> find_energy_efficient_cpu() ?
> Add it at the beg select_task_rq_fair()
> Return immediately with theCPU if you have found one
> Or let the normal path select a CPU if the
> turbosched_select_non_idle_core() has not been able to find a suitable
> CPU for packing
> 

of course. I can do that.
I was just not aware about the effect of wake_affine and so was waiting for
such comments to be sure of. Thanks for this.
Maybe I can add just below the sched_energy_present(){...} construct giving
precedence to EAS? I'm asking this because I remember Patrick telling me to
leverage task packing for android as well?


Thank you very much for looking at the patches.
Parth

> 
>> +               else
>> +                       new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>>
>>                 if (want_affine)
>>                         current->recent_used_cpu = cpu;
>> --
>> 2.17.1
>>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2019-10-07 16:53     ` Parth Shah
@ 2019-10-08 16:20       ` Vincent Guittot
  2019-10-09  8:46         ` Parth Shah
  2019-10-08 16:52       ` Dietmar Eggemann
  1 sibling, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2019-10-08 16:20 UTC (permalink / raw)
  To: Parth Shah
  Cc: linux-kernel, open list:THERMAL, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Patrick Bellasi, Valentin Schneider,
	Pavel Machek, Doug Smythies, Quentin Perret, Rafael J. Wysocki,
	Tim Chen, Daniel Lezcano

On Mon, 7 Oct 2019 at 18:54, Parth Shah <parth@linux.ibm.com> wrote:
>
>
>
> On 10/7/19 5:49 PM, Vincent Guittot wrote:
> > On Mon, 7 Oct 2019 at 10:31, Parth Shah <parth@linux.ibm.com> wrote:
> >>
> >> The algorithm finds the first non idle core in the system and tries to
> >> place a task in the idle CPU in the chosen core. To maintain
> >> cache hotness, work of finding non idle core starts from the prev_cpu,
> >> which also reduces task ping-pong behaviour inside of the core.
> >>
> >> Define a new method to select_non_idle_core which keep tracks of the idle
> >> and non-idle CPUs in the core and based on the heuristics determines if the
> >> core is sufficiently busy to place the incoming backgroung task. The
> >> heuristic further defines the non-idle CPU into either busy (>12.5% util)
> >> CPU and overutilized (>80% util) CPU.
> >> - The core containing more idle CPUs and no busy CPUs is not selected for
> >>   packing
> >> - The core if contains more than 1 overutilized CPUs are exempted from
> >>   task packing
> >> - Pack if there is atleast one busy CPU and overutilized CPUs count is <2
> >>
> >> Value of 12.5% utilization for busy CPU gives sufficient heuristics for CPU
> >> doing enough work and not become idle in nearby timeframe.
> >>
> >> Signed-off-by: Parth Shah <parth@linux.ibm.com>
> >> ---
> >>  kernel/sched/core.c |  3 ++
> >>  kernel/sched/fair.c | 95 ++++++++++++++++++++++++++++++++++++++++++++-
> >>  2 files changed, 97 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 6e1ae8046fe0..7e3aff59540a 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -6402,6 +6402,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
> >>
> >>  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
> >>  DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
> >> +DECLARE_PER_CPU(cpumask_var_t, turbo_sched_mask);
> >>
> >>  void __init sched_init(void)
> >>  {
> >> @@ -6442,6 +6443,8 @@ void __init sched_init(void)
> >>                         cpumask_size(), GFP_KERNEL, cpu_to_node(i));
> >>                 per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
> >>                         cpumask_size(), GFP_KERNEL, cpu_to_node(i));
> >> +               per_cpu(turbo_sched_mask, i) = (cpumask_var_t)kzalloc_node(
> >> +                       cpumask_size(), GFP_KERNEL, cpu_to_node(i));
> >>         }
> >>  #endif /* CONFIG_CPUMASK_OFFSTACK */
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b798fe7ff7cd..d4a1b6474338 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -5353,6 +5353,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>  /* Working cpumask for: load_balance, load_balance_newidle. */
> >>  DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
> >>  DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
> >> +/* A cpumask to find active cores in the system. */
> >> +DEFINE_PER_CPU(cpumask_var_t, turbo_sched_mask);
> >>
> >>  #ifdef CONFIG_NO_HZ_COMMON
> >>
> >> @@ -5964,6 +5966,76 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >>         return cpu;
> >>  }
> >>
> >> +#ifdef CONFIG_SCHED_SMT
> >> +static inline bool is_background_task(struct task_struct *p)
> >> +{
> >> +       if (p->flags & PF_CAN_BE_PACKED)
> >> +               return true;
> >> +
> >> +       return false;
> >> +}
> >> +
> >> +#define busyness_threshold     (100 >> 3)
> >> +#define is_cpu_busy(util) ((util) > busyness_threshold)
> >> +
> >> +/*
> >> + * Try to find a non idle core in the system  based on few heuristics:
> >> + * - Keep track of overutilized (>80% util) and busy (>12.5% util) CPUs
> >> + * - If none CPUs are busy then do not select the core for task packing
> >> + * - If atleast one CPU is busy then do task packing unless overutilized CPUs
> >> + *   count is < busy/2 CPU count
> >> + * - Always select idle CPU for task packing
> >> + */
> >> +static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
> >> +{
> >> +       struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
> >> +       int iter_cpu, sibling;
> >> +
> >> +       cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
> >> +
> >> +       for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
> >> +               int idle_cpu_count = 0, non_idle_cpu_count = 0;
> >> +               int overutil_cpu_count = 0;
> >> +               int busy_cpu_count = 0;
> >> +               int best_cpu = iter_cpu;
> >> +
> >> +               for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
> >> +                       __cpumask_clear_cpu(sibling, cpus);
> >> +                       if (idle_cpu(iter_cpu)) {
> >> +                               idle_cpu_count++;
> >> +                               best_cpu = iter_cpu;
> >> +                       } else {
> >> +                               non_idle_cpu_count++;
> >> +                               if (cpu_overutilized(iter_cpu))
> >> +                                       overutil_cpu_count++;
> >> +                               if (is_cpu_busy(cpu_util(iter_cpu)))
> >> +                                       busy_cpu_count++;
> >> +                       }
> >> +               }
> >> +
> >> +               /*
> >> +                * Pack tasks to this core if
> >> +                * 1. Idle CPU count is higher and atleast one is busy
> >> +                * 2. If idle_cpu_count < non_idle_cpu_count then ideally do
> >> +                * packing but if there are more CPUs overutilized then don't
> >> +                * overload it.
> >
> > Could you give details about the rationale behind these conditions ?
>
> sure. but first maybe some background is required for busy_cpu.
> Task packing needs to be done across cores minimizing number of busy cores
> in the chip. Hence when picking a core for packing a background task it
> will be good to not select a core which is in deeper idle-states.

Make sense.
find_idlest_group_cpu() is doing something similar with the help of cpuidle
Don't you have such information available in your cpuidle driver ?

>
> Usually deeper idle states have target_residency >= 10ms which are really
> power saving states and saved power can be channeled to active cores.
> A CPU with utilization of 12.5% is most probably not going to those deeper
> idle states and picking a CPU with >= 12.5% util seems to be a good
> approximation.

you should better use idle_get_state(rq)

>
> Now going to the _main point_, task packing needs to take care of the
> following scenarios:
> 1. Not select a core having all the CPUs idle or <= 12.5% util
> 2. Do not select a core with 2 or more CPUs overloaded (>=80% util)

Why is it always 2 CPUs ? it seems that you can have 1/2/4/8 CPUs but
you keep using 2 CPUs as a threshold

> 3. Select a core even if 1 CPU is overloaded as background tasks are
> usually short running and spending time for selecting better alternative is
> not worth the investment here
> 4. Select a core if at least one CPU is busy (>=12.5% util)
> 5. On selecting a core, select an idle CPU in it.
>
> Hence to satisfy this scenarios for SMT-1/2/4 (POWER9) or 8 (POWER8 has
> 8-threads per core/ POWER9 has feature to make fake SMT-8), the approach
> keeps track of idle, non-idle, busy and overloaded CPU count in the core
> and uses above approach to find _sufficiently_ non-idle core, which seems
> to be a good heuristics to do task packing without much of regression on
> CPU intensive threads.
>
> So as per the comments in this patch, first point covers the scenario 1 and
> 4 (if part in the code), and second point covers scenario 2 and 3 (else
> part in the code).
>
> >> +                */
> >> +               if (idle_cpu_count > non_idle_cpu_count) {
> >> +                       if (busy_cpu_count)
> >> +                               return best_cpu;
> >> +               } else {
> >> +                       /*
> >> +                        * Pack tasks if at max 1 CPU is overutilized
> >> +                        */
> >> +                       if (overutil_cpu_count < 2)
> >> +                               return best_cpu;
> >> +               }
> >> +       }
> >> +
> >> +       return select_idle_sibling(p, prev_cpu, target);
> >> +}
> >> +#endif /* CONFIG_SCHED_SMT */
> >> +
> >>  /*
> >>   * Try and locate an idle core/thread in the LLC cache domain.
> >>   */
> >> @@ -6418,6 +6490,23 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >>         return -1;
> >>  }
> >>
> >> +#ifdef CONFIG_SCHED_SMT
> >> +/*
> >> + * Select all classified background tasks for task packing
> >> + */
> >> +static inline int turbosched_select_non_idle_core(struct task_struct *p,
> >> +                                                 int prev_cpu, int target)
> >> +{
> >> +       return select_non_idle_core(p, prev_cpu, target);
> >> +}
> >> +#else
> >> +static inline int turbosched_select_non_idle_core(struct task_struct *p,
> >> +                                                 int prev_cpu, int target)
> >> +{
> >> +       return select_idle_sibling(p, prev_cpu, target);
> >
> > should be better to make turbosched_select_non_idle_core empty and
> > make sure that __turbo_sched_enabled is never enabled if
> > CONFIG_SCHED_SMT is disabled
> >
>
> Totally agreed. I thought keeping like this so as to not have any "#def.."
> in select_task_rq_fair method.
> So can I do this by adding a new method like __select_idle_sibling() which
> will call turbosched_select_non_idle_core() in case of SCHED_SMT present
> and otherwise will call the regular select_idle_sibling()?
>
> >> +}
> >> +#endif
> >> +
> >>  /*
> >>   * select_task_rq_fair: Select target runqueue for the waking task in domains
> >>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> >> @@ -6483,7 +6572,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >>         } else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
> >>                 /* Fast path */
> >>
> >> -               new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> >> +               if (is_turbosched_enabled() && unlikely(is_background_task(p)))
> >> +                       new_cpu = turbosched_select_non_idle_core(p, prev_cpu,
> >> +                                                                 new_cpu);
> >
> > Could you add turbosched_select_non_idle_core() similarly to
> > find_energy_efficient_cpu() ?
> > Add it at the beg select_task_rq_fair()
> > Return immediately with theCPU if you have found one
> > Or let the normal path select a CPU if the
> > turbosched_select_non_idle_core() has not been able to find a suitable
> > CPU for packing
> >
>
> of course. I can do that.
> I was just not aware about the effect of wake_affine and so was waiting for
> such comments to be sure of. Thanks for this.
> Maybe I can add just below the sched_energy_present(){...} construct giving
> precedence to EAS? I'm asking this because I remember Patrick telling me to
> leverage task packing for android as well?

After  sched_energy_present(){...} seems to be a good place.

Leveraging task packing for android means that it task pacing should
collaborate with EAS and find_energy_efficient_cpu()

>
>
> Thank you very much for looking at the patches.
> Parth
>
> >
> >> +               else
> >> +                       new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> >>
> >>                 if (want_affine)
> >>                         current->recent_used_cpu = cpu;
> >> --
> >> 2.17.1
> >>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2019-10-07 16:53     ` Parth Shah
  2019-10-08 16:20       ` Vincent Guittot
@ 2019-10-08 16:52       ` Dietmar Eggemann
  2019-10-09  8:57         ` Parth Shah
  1 sibling, 1 reply; 19+ messages in thread
From: Dietmar Eggemann @ 2019-10-08 16:52 UTC (permalink / raw)
  To: Parth Shah, Vincent Guittot
  Cc: linux-kernel, open list:THERMAL, Peter Zijlstra, Ingo Molnar,
	Patrick Bellasi, Valentin Schneider, Pavel Machek, Doug Smythies,
	Quentin Perret, Rafael J. Wysocki, Tim Chen, Daniel Lezcano

[- Quentin Perret <quentin.perret@arm.com>]
[+ Quentin Perret <qperret@qperret.net>]

See commit c193a3ffc282 ("mailmap: Update email address for Quentin Perret")

On 07/10/2019 18:53, Parth Shah wrote:
> 
> 
> On 10/7/19 5:49 PM, Vincent Guittot wrote:
>> On Mon, 7 Oct 2019 at 10:31, Parth Shah <parth@linux.ibm.com> wrote:
>>>
>>> The algorithm finds the first non idle core in the system and tries to
>>> place a task in the idle CPU in the chosen core. To maintain
>>> cache hotness, work of finding non idle core starts from the prev_cpu,
>>> which also reduces task ping-pong behaviour inside of the core.
>>>
>>> Define a new method to select_non_idle_core which keep tracks of the idle
>>> and non-idle CPUs in the core and based on the heuristics determines if the
>>> core is sufficiently busy to place the incoming backgroung task. The
>>> heuristic further defines the non-idle CPU into either busy (>12.5% util)
>>> CPU and overutilized (>80% util) CPU.
>>> - The core containing more idle CPUs and no busy CPUs is not selected for
>>>   packing
>>> - The core if contains more than 1 overutilized CPUs are exempted from
>>>   task packing
>>> - Pack if there is atleast one busy CPU and overutilized CPUs count is <2
>>>
>>> Value of 12.5% utilization for busy CPU gives sufficient heuristics for CPU
>>> doing enough work an

[...]

>>> @@ -6483,7 +6572,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>>         } else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
>>>                 /* Fast path */
>>>
>>> -               new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>>> +               if (is_turbosched_enabled() && unlikely(is_background_task(p)))
>>> +                       new_cpu = turbosched_select_non_idle_core(p, prev_cpu,
>>> +                                                                 new_cpu);
>>
>> Could you add turbosched_select_non_idle_core() similarly to
>> find_energy_efficient_cpu() ?
>> Add it at the beg select_task_rq_fair()
>> Return immediately with theCPU if you have found one
>> Or let the normal path select a CPU if the
>> turbosched_select_non_idle_core() has not been able to find a suitable
>> CPU for packing
>>
> 
> of course. I can do that.
> I was just not aware about the effect of wake_affine and so was waiting for
> such comments to be sure of. Thanks for this.
> Maybe I can add just below the sched_energy_present(){...} construct giving
> precedence to EAS? I'm asking this because I remember Patrick telling me to
> leverage task packing for android as well?

I have a hard time imaging that Turbosched will be used in Android next
to EAS in the foreseeable future.

First of all, EAS provides task packing already on Performance Domain
(PD) level (a.k.a. as cluster on traditional 2-cluster Arm/Arm64
big.LITTLE or DynamIQ (with Phantom domains (out of tree solution)).
This is where we can safe energy without harming latency.

See the tests results under '2.1 Energy test case' in

https://lore.kernel.org/r/20181203095628.11858-1-quentin.perret@arm.com

There are 10 to 50 small (classified solely by task utilization) tasks
per test case and EAS shows an effect on energy consumption by packing
them onto the PD (cluster) of the small CPUs.

And second, the CPU supported topology is different to the one you're
testing on.

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2019-10-08 16:20       ` Vincent Guittot
@ 2019-10-09  8:46         ` Parth Shah
  0 siblings, 0 replies; 19+ messages in thread
From: Parth Shah @ 2019-10-09  8:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, open list:THERMAL, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Patrick Bellasi, Valentin Schneider,
	Pavel Machek, Doug Smythies, Quentin Perret, Rafael J. Wysocki,
	Tim Chen, Daniel Lezcano



On 10/8/19 9:50 PM, Vincent Guittot wrote:
> On Mon, 7 Oct 2019 at 18:54, Parth Shah <parth@linux.ibm.com> wrote:
>>
>>
>>
>> On 10/7/19 5:49 PM, Vincent Guittot wrote:
>>> On Mon, 7 Oct 2019 at 10:31, Parth Shah <parth@linux.ibm.com> wrote:
>>>>
>>>> The algorithm finds the first non idle core in the system and tries to
>>>> place a task in the idle CPU in the chosen core. To maintain
>>>> cache hotness, work of finding non idle core starts from the prev_cpu,
>>>> which also reduces task ping-pong behaviour inside of the core.
>>>>
>>>> Define a new method to select_non_idle_core which keep tracks of the idle
>>>> and non-idle CPUs in the core and based on the heuristics determines if the
>>>> core is sufficiently busy to place the incoming backgroung task. The
>>>> heuristic further defines the non-idle CPU into either busy (>12.5% util)
>>>> CPU and overutilized (>80% util) CPU.
>>>> - The core containing more idle CPUs and no busy CPUs is not selected for
>>>>   packing
>>>> - The core if contains more than 1 overutilized CPUs are exempted from
>>>>   task packing
>>>> - Pack if there is atleast one busy CPU and overutilized CPUs count is <2
>>>>
>>>> Value of 12.5% utilization for busy CPU gives sufficient heuristics for CPU
>>>> doing enough work and not become idle in nearby timeframe.
>>>>
>>>> Signed-off-by: Parth Shah <parth@linux.ibm.com>
>>>> ---
>>>>  kernel/sched/core.c |  3 ++
>>>>  kernel/sched/fair.c | 95 ++++++++++++++++++++++++++++++++++++++++++++-
>>>>  2 files changed, 97 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 6e1ae8046fe0..7e3aff59540a 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -6402,6 +6402,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
>>>>
>>>>  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
>>>>  DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
>>>> +DECLARE_PER_CPU(cpumask_var_t, turbo_sched_mask);
>>>>
>>>>  void __init sched_init(void)
>>>>  {
>>>> @@ -6442,6 +6443,8 @@ void __init sched_init(void)
>>>>                         cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>>>>                 per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
>>>>                         cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>>>> +               per_cpu(turbo_sched_mask, i) = (cpumask_var_t)kzalloc_node(
>>>> +                       cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>>>>         }
>>>>  #endif /* CONFIG_CPUMASK_OFFSTACK */
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b798fe7ff7cd..d4a1b6474338 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -5353,6 +5353,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>  /* Working cpumask for: load_balance, load_balance_newidle. */
>>>>  DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
>>>>  DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
>>>> +/* A cpumask to find active cores in the system. */
>>>> +DEFINE_PER_CPU(cpumask_var_t, turbo_sched_mask);
>>>>
>>>>  #ifdef CONFIG_NO_HZ_COMMON
>>>>
>>>> @@ -5964,6 +5966,76 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>         return cpu;
>>>>  }
>>>>
>>>> +#ifdef CONFIG_SCHED_SMT
>>>> +static inline bool is_background_task(struct task_struct *p)
>>>> +{
>>>> +       if (p->flags & PF_CAN_BE_PACKED)
>>>> +               return true;
>>>> +
>>>> +       return false;
>>>> +}
>>>> +
>>>> +#define busyness_threshold     (100 >> 3)
>>>> +#define is_cpu_busy(util) ((util) > busyness_threshold)
>>>> +
>>>> +/*
>>>> + * Try to find a non idle core in the system  based on few heuristics:
>>>> + * - Keep track of overutilized (>80% util) and busy (>12.5% util) CPUs
>>>> + * - If none CPUs are busy then do not select the core for task packing
>>>> + * - If atleast one CPU is busy then do task packing unless overutilized CPUs
>>>> + *   count is < busy/2 CPU count
>>>> + * - Always select idle CPU for task packing
>>>> + */
>>>> +static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
>>>> +{
>>>> +       struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
>>>> +       int iter_cpu, sibling;
>>>> +
>>>> +       cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
>>>> +
>>>> +       for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
>>>> +               int idle_cpu_count = 0, non_idle_cpu_count = 0;
>>>> +               int overutil_cpu_count = 0;
>>>> +               int busy_cpu_count = 0;
>>>> +               int best_cpu = iter_cpu;
>>>> +
>>>> +               for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
>>>> +                       __cpumask_clear_cpu(sibling, cpus);
>>>> +                       if (idle_cpu(iter_cpu)) {
>>>> +                               idle_cpu_count++;
>>>> +                               best_cpu = iter_cpu;
>>>> +                       } else {
>>>> +                               non_idle_cpu_count++;
>>>> +                               if (cpu_overutilized(iter_cpu))
>>>> +                                       overutil_cpu_count++;
>>>> +                               if (is_cpu_busy(cpu_util(iter_cpu)))
>>>> +                                       busy_cpu_count++;
>>>> +                       }
>>>> +               }
>>>> +
>>>> +               /*
>>>> +                * Pack tasks to this core if
>>>> +                * 1. Idle CPU count is higher and atleast one is busy
>>>> +                * 2. If idle_cpu_count < non_idle_cpu_count then ideally do
>>>> +                * packing but if there are more CPUs overutilized then don't
>>>> +                * overload it.
>>>
>>> Could you give details about the rationale behind these conditions ?
>>
>> sure. but first maybe some background is required for busy_cpu.
>> Task packing needs to be done across cores minimizing number of busy cores
>> in the chip. Hence when picking a core for packing a background task it
>> will be good to not select a core which is in deeper idle-states.
> 
> Make sense.
> find_idlest_group_cpu() is doing something similar with the help of cpuidle
> Don't you have such information available in your cpuidle driver ?
> 

yes that can be done but 12.5% utilization is a derived entity from
resulted from pelt decaying and seems to be a good prediction for a CPU not
going to idle states. whereas...

>>
>> Usually deeper idle states have target_residency >= 10ms which are really
>> power saving states and saved power can be channeled to active cores.
>> A CPU with utilization of 12.5% is most probably not going to those deeper
>> idle states and picking a CPU with >= 12.5% util seems to be a good
>> approximation.
> 
> you should better use idle_get_state(rq)

... idle_get_state(rq) is a point in time a value and may not give better
decision capability.>
>>
>>
>> Thank you very much for looking at the patches.
>> Parth
>>

Though 12.5% is an experimental value, it can be backed by some explanation
as stated above. I wish to do task packing on a core which really is busy
and 12.5% is a better prediction indicating that it won't go deep idle in
near future and we can pack it here. Whereas when using idle_get_state(rq),
we might read the rq as busy for the instance we look at it but still
cannot predict future though, right?

Hope that explains to go with 12.5%util but I would be happy to hear your
thoughts on using something generic like the idle_get_state().

> 
>>
>> Now going to the _main point_, task packing needs to take care of the
>> following scenarios:
>> 1. Not select a core having all the CPUs idle or <= 12.5% util
>> 2. Do not select a core with 2 or more CPUs overloaded (>=80% util)
> 
> Why is it always 2 CPUs ? it seems that you can have 1/2/4/8 CPUs but
> you keep using 2 CPUs as a threshold

I thought of going absolute here because of no good reason but to just
eliminate the computation of counting the online sibling in a core (similar
was done in RFC v4)
But now I think this can be done here by simply adding:
"overutil_cpu_count < (idle_cpu_count + non_idle_cpu_count)/2"
*unless* we can get rid of any of the counter here.

> 
>> 3. Select a core even if 1 CPU is overloaded as background tasks are
>> usually short running and spending time for selecting better alternative is
>> not worth the investment here
>> 4. Select a core if at least one CPU is busy (>=12.5% util)
>> 5. On selecting a core, select an idle CPU in it.
>>
>> Hence to satisfy this scenarios for SMT-1/2/4 (POWER9) or 8 (POWER8 has
>> 8-threads per core/ POWER9 has feature to make fake SMT-8), the approach
>> keeps track of idle, non-idle, busy and overloaded CPU count in the core
>> and uses above approach to find _sufficiently_ non-idle core, which seems
>> to be a good heuristics to do task packing without much of regression on
>> CPU intensive threads.
>>
>> So as per the comments in this patch, first point covers tadding he scenario 1 and
>> 4 (if part in the code), and second point covers scenario 2 and 3 (else
>> part in the code).
>>
>>>> +                */
>>>> +               if (idle_cpu_count > non_idle_cpu_count) {
>>>> +                       if (busy_cpu_count)
>>>> +                               return best_cpu;
>>>> +               } else {
>>>> +                       /*
>>>> +                        * Pack tasks if at max 1 CPU is overutilized
>>>> +                        */
>>>> +                       if (overutil_cpu_count < 2)
>>>> +                               return best_cpu;
>>>> +               }
>>>> +       }
>>>> +
>>>> +       return select_idle_sibling(p, prev_cpu, target);
>>>> +}
>>>> +#endif /* CONFIG_SCHED_SMT */
>>>> +
>>>>  /*
>>>>   * Try and locate an idle core/thread in the LLC cache domain.
>>>>   */
>>>> @@ -6418,6 +6490,23 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>>>         return -1;
>>>>  }
>>>>
>>>> +#ifdef CONFIG_SCHED_SMT
>>>> +/*
>>>> + * Select all classified background tasks for task packing
>>>> + */
>>>> +static inline int turbosched_select_non_idle_core(struct task_struct *p,
>>>> +                                                 int prev_cpu, int target)
>>>> +{
>>>> +       return select_non_idle_core(p, prev_cpu, target);
>>>> +}
>>>> +#else
>>>> +static inline int turbosched_select_non_idle_core(struct task_struct *p,
>>>> +                                                 int prev_cpu, int target)
>>>> +{
>>>> +       return select_idle_sibling(p, prev_cpu, target);
>>>
>>> should be better to make turbosched_select_non_idle_core empty and
>>> make sure that __turbo_sched_enabled is never enabled if
>>> CONFIG_SCHED_SMT is disabled
>>>
>>
>> Totally agreed. I thought keeping like this so as to not have any "#def.."
>> in select_task_rq_fair method.
>> So can I do this by adding a new method like __select_idle_sibling() which
>> will call turbosched_select_non_idle_core() in case of SCHED_SMT present
>> and otherwise will call the regular select_idle_sibling()?
>>
>>>> +}
>>>> +#endif
>>>> +
>>>>  /*
>>>>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>>>>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>>>> @@ -6483,7 +6572,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>>>         } else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
>>>>                 /* Fast path */
>>>>
>>>> -               new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>>>> +               if (is_turbosched_enabled() && unlikely(is_background_task(p)))
>>>> +                       new_cpu = turbosched_select_non_idle_core(p, prev_cpu,
>>>> +                                                                 new_cpu);
>>>
>>> Could you add turbosched_select_non_idle_core() similarly to
>>> find_energy_efficient_cpu() ?
>>> Add it at the beg select_task_rq_fair()
>>> Return immediately with theCPU if you have found one
>>> Or let the normal path select a CPU if the
>>> turbosched_select_non_idle_core() has not been able to find a suitable
>>> CPU for packing
>>>
>>
>> of course. I can do that.
>> I was just not aware about the effect of wake_affine and so was waiting for
>> such comments to be sure of. Thanks for this.
>> Maybe I can add just below the sched_energy_present(){...} construct giving
>> precedence to EAS? I'm asking this because I remember Patrick telling me to
>> leverage task packing for android as well?
> 
> After  sched_energy_present(){...} seems to be a good place.
> 
> Leveraging task packing for android means that it task pacing should
> collaborate with EAS and find_energy_efficient_cpu()

ok, noted.

Thanks,
Parth

>>>
>>>> +               else
>>>> +                       new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>>>>
>>>>                 if (want_affine)
>>>>                         current->recent_used_cpu = cpu;
>>>> --
>>>> 2.17.1
>>>>
>>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2019-10-08 16:52       ` Dietmar Eggemann
@ 2019-10-09  8:57         ` Parth Shah
  2019-10-09 14:26           ` Dietmar Eggemann
  0 siblings, 1 reply; 19+ messages in thread
From: Parth Shah @ 2019-10-09  8:57 UTC (permalink / raw)
  To: Dietmar Eggemann, Vincent Guittot
  Cc: linux-kernel, open list:THERMAL, Peter Zijlstra, Ingo Molnar,
	Patrick Bellasi, Valentin Schneider, Pavel Machek, Doug Smythies,
	Quentin Perret, Rafael J. Wysocki, Tim Chen, Daniel Lezcano



On 10/8/19 10:22 PM, Dietmar Eggemann wrote:
> [- Quentin Perret <quentin.perret@arm.com>]
> [+ Quentin Perret <qperret@qperret.net>]
> 
> See commit c193a3ffc282 ("mailmap: Update email address for Quentin Perret")
> 

noted. thanks for notifying me.

> On 07/10/2019 18:53, Parth Shah wrote:
>>
>>
>> On 10/7/19 5:49 PM, Vincent Guittot wrote:
>>> On Mon, 7 Oct 2019 at 10:31, Parth Shah <parth@linux.ibm.com> wrote:
>>>>
>>>> The algorithm finds the first non idle core in the system and tries to
>>>> place a task in the idle CPU in the chosen core. To maintain
>>>> cache hotness, work of finding non idle core starts from the prev_cpu,
>>>> which also reduces task ping-pong behaviour inside of the core.
>>>>
>>>> Define a new method to select_non_idle_core which keep tracks of the idle
>>>> and non-idle CPUs in the core and based on the heuristics determines if the
>>>> core is sufficiently busy to place the incoming backgroung task. The
>>>> heuristic further defines the non-idle CPU into either busy (>12.5% util)
>>>> CPU and overutilized (>80% util) CPU.
>>>> - The core containing more idle CPUs and no busy CPUs is not selected for
>>>>   packing
>>>> - The core if contains more than 1 overutilized CPUs are exempted from
>>>>   task packing
>>>> - Pack if there is atleast one busy CPU and overutilized CPUs count is <2
>>>>
>>>> Value of 12.5% utilization for busy CPU gives sufficient heuristics for CPU
>>>> doing enough work an
> 
> [...]
> 
>>>> @@ -6483,7 +6572,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>>>         } else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
>>>>                 /* Fast path */
>>>>
>>>> -               new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>>>> +               if (is_turbosched_enabled() && unlikely(is_background_task(p)))
>>>> +                       new_cpu = turbosched_select_non_idle_core(p, prev_cpu,
>>>> +                                                                 new_cpu);
>>>
>>> Could you add turbosched_select_non_idle_core() similarly to
>>> find_energy_efficient_cpu() ?
>>> Add it at the beg select_task_rq_fair()
>>> Return immediately with theCPU if you have found one
>>> Or let the normal path select a CPU if the
>>> turbosched_select_non_idle_core() has not been able to find a suitable
>>> CPU for packing
>>>
>>
>> of course. I can do that.
>> I was just not aware about the effect of wake_affine and so was waiting for
>> such comments to be sure of. Thanks for this.
>> Maybe I can add just below the sched_energy_present(){...} construct giving
>> precedence to EAS? I'm asking this because I remember Patrick telling me to
>> leverage task packing for android as well?
> 
> I have a hard time imaging that Turbosched will be used in Android next
> to EAS in the foreseeable future.
> 
> First of all, EAS provides task packing already on Performance Domain
> (PD) level (a.k.a. as cluster on traditional 2-cluster Arm/Arm64
> big.LITTLE or DynamIQ (with Phantom domains (out of tree solution)).
> This is where we can safe energy without harming latency.
> 
> See the tests results under '2.1 Energy test case' in
> 
> https://lore.kernel.org/r/20181203095628.11858-1-quentin.perret@arm.com
> 
> There are 10 to 50 small (classified solely by task utilization) tasks
> per test case and EAS shows an effect on energy consumption by packing
> them onto the PD (cluster) of the small CPUs.
> 
> And second, the CPU supported topology is different to the one you're
> testing on.
> 

cool. I was just keeping in mind the following quote
" defining a generic spread-vs-pack wakeup policy which is something
Android also could benefit from " (https://lkml.org/lkml/2019/6/28/628)

BTW, IIUC that does task consolidation only on single CPU unless
rd->overload is set, right?

> [...]
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
       [not found] ` <20191008132842.6612-1-hdanton@sina.com>
@ 2019-10-09  9:22   ` Parth Shah
  2019-10-09 11:34     ` Vincent Guittot
  0 siblings, 1 reply; 19+ messages in thread
From: Parth Shah @ 2019-10-09  9:22 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-kernel, linux-pm, peterz, mingo, vincent.guittot,
	dietmar.eggemann, patrick.bellasi, valentin.schneider, pavel,
	dsmythies, Quentin Perret, rafael.j.wysocki, tim.c.chen,
	daniel.lezcano



On 10/8/19 6:58 PM, Hillf Danton wrote:
> 
> On Mon,  7 Oct 2019 14:00:49 +0530 Parth Shah wrote:
>> +/*
>> + * Try to find a non idle core in the system  based on few heuristics:
>> + * - Keep track of overutilized (>80% util) and busy (>12.5% util) CPUs
>> + * - If none CPUs are busy then do not select the core for task packing
>> + * - If atleast one CPU is busy then do task packing unless overutilized CPUs
>> + *   count is < busy/2 CPU count
>> + * - Always select idle CPU for task packing
>> + */
>> +static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
>> +{
>> +	struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
>> +	int iter_cpu, sibling;
>> +
>> +	cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
>> +
>> +	for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
>> +		int idle_cpu_count = 0, non_idle_cpu_count = 0;
>> +		int overutil_cpu_count = 0;
>> +		int busy_cpu_count = 0;
>> +		int best_cpu = iter_cpu;
>> +
>> +		for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
>> +			__cpumask_clear_cpu(sibling, cpus);
>> +			if (idle_cpu(iter_cpu)) {
> 
> Would you please elaborate the reasons that the iter cpu is checked idle
> more than once for finding a busy core?
> 

Thanks for looking at the patches.
Could you please point me out where iter_cpu is checked more than once?

>> +				idle_cpu_count++;
>> +				best_cpu = iter_cpu;
>> +			} else {
>> +				non_idle_cpu_count++;
>> +				if (cpu_overutilized(iter_cpu))
>> +					overutil_cpu_count++;
>> +				if (is_cpu_busy(cpu_util(iter_cpu)))
>> +					busy_cpu_count++;
>> +			}
>> +		}
>> +
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2019-10-09  9:22   ` [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores Parth Shah
@ 2019-10-09 11:34     ` Vincent Guittot
  2019-10-09 16:55       ` Parth Shah
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2019-10-09 11:34 UTC (permalink / raw)
  To: Parth Shah
  Cc: Hillf Danton, linux-kernel, open list:THERMAL, Peter Zijlstra,
	Ingo Molnar, Dietmar Eggemann, Patrick Bellasi,
	Valentin Schneider, Pavel Machek, Doug Smythies, Quentin Perret,
	Rafael J. Wysocki, Tim Chen, Daniel Lezcano

On Wed, 9 Oct 2019 at 11:23, Parth Shah <parth@linux.ibm.com> wrote:
>
>
>
> On 10/8/19 6:58 PM, Hillf Danton wrote:
> >
> > On Mon,  7 Oct 2019 14:00:49 +0530 Parth Shah wrote:
> >> +/*
> >> + * Try to find a non idle core in the system  based on few heuristics:
> >> + * - Keep track of overutilized (>80% util) and busy (>12.5% util) CPUs
> >> + * - If none CPUs are busy then do not select the core for task packing
> >> + * - If atleast one CPU is busy then do task packing unless overutilized CPUs
> >> + *   count is < busy/2 CPU count
> >> + * - Always select idle CPU for task packing
> >> + */
> >> +static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
> >> +{
> >> +    struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
> >> +    int iter_cpu, sibling;
> >> +
> >> +    cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
> >> +
> >> +    for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
> >> +            int idle_cpu_count = 0, non_idle_cpu_count = 0;
> >> +            int overutil_cpu_count = 0;
> >> +            int busy_cpu_count = 0;
> >> +            int best_cpu = iter_cpu;
> >> +
> >> +            for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
> >> +                    __cpumask_clear_cpu(sibling, cpus);
> >> +                    if (idle_cpu(iter_cpu)) {
> >
> > Would you please elaborate the reasons that the iter cpu is checked idle
> > more than once for finding a busy core?
> >
>
> Thanks for looking at the patches.
> Could you please point me out where iter_cpu is checked more than once?

I think that point is that you have a sibling that there is
for_each_cpu(sibling, cpu_smt_mask(iter_cpu) but you never use sibling
in the loop except for clearing it on the cpumask cpus
All the tests are done with iter_cpu so you will test several time
iter_cpus but never the other sibling
Should you use sibling instead ?


>
> >> +                            idle_cpu_count++;
> >> +                            best_cpu = iter_cpu;
> >> +                    } else {
> >> +                            non_idle_cpu_count++;
> >> +                            if (cpu_overutilized(iter_cpu))
> >> +                                    overutil_cpu_count++;
> >> +                            if (is_cpu_busy(cpu_util(iter_cpu)))
> >> +                                    busy_cpu_count++;
> >> +                    }
> >> +            }
> >> +
> >
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2019-10-09  8:57         ` Parth Shah
@ 2019-10-09 14:26           ` Dietmar Eggemann
  2019-10-09 17:02             ` Parth Shah
  0 siblings, 1 reply; 19+ messages in thread
From: Dietmar Eggemann @ 2019-10-09 14:26 UTC (permalink / raw)
  To: Parth Shah, Vincent Guittot
  Cc: linux-kernel, open list:THERMAL, Peter Zijlstra, Ingo Molnar,
	Patrick Bellasi, Valentin Schneider, Pavel Machek, Doug Smythies,
	Quentin Perret, Rafael J. Wysocki, Tim Chen, Daniel Lezcano

On 09/10/2019 10:57, Parth Shah wrote:

[...]

>> On 07/10/2019 18:53, Parth Shah wrote:
>>>
>>>
>>> On 10/7/19 5:49 PM, Vincent Guittot wrote:
>>>> On Mon, 7 Oct 2019 at 10:31, Parth Shah <parth@linux.ibm.com> wrote:

[...]

>>> Maybe I can add just below the sched_energy_present(){...} construct giving
>>> precedence to EAS? I'm asking this because I remember Patrick telling me to
>>> leverage task packing for android as well?
>>
>> I have a hard time imaging that Turbosched will be used in Android next
>> to EAS in the foreseeable future.
>>
>> First of all, EAS provides task packing already on Performance Domain
>> (PD) level (a.k.a. as cluster on traditional 2-cluster Arm/Arm64
>> big.LITTLE or DynamIQ (with Phantom domains (out of tree solution)).
>> This is where we can safe energy without harming latency.
>>
>> See the tests results under '2.1 Energy test case' in
>>
>> https://lore.kernel.org/r/20181203095628.11858-1-quentin.perret@arm.com
>>
>> There are 10 to 50 small (classified solely by task utilization) tasks
>> per test case and EAS shows an effect on energy consumption by packing
>> them onto the PD (cluster) of the small CPUs.
>>
>> And second, the CPU supported topology is different to the one you're
>> testing on.
>>
> 
> cool. I was just keeping in mind the following quote
> " defining a generic spread-vs-pack wakeup policy which is something
> Android also could benefit from " (https://lkml.org/lkml/2019/6/28/628)

The main thing is that in case we want to introduce a new functionality
into CFS, we should try hard to use existing infrastructure (or
infrastructure there is agreement on that we'll need it) as much as
possible.

If I understand Patrick here correctly, he suggested not to use uclamp
but the task latency nice approach. There is agreement that we would
need something like this as infrastructure:

https://lore.kernel.org/r/20190830174944.21741-1-subhra.mazumdar@oracle.com

So p->latency_nice is suitable to include your p->flags |=
PF_CAN_BE_PACKED concept nicely.

> 
> BTW, IIUC that does task consolidation only on single CPU unless
> rd->overload is set, right?

Task consolidation on Performance Domains (PDs) w/ multiple CPUs (e.g.
on a per-cluster PD big.LITTLE system) only works when the system is not
overutilized:

6326 int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
6327 {
...
6337         if (!pd || *READ_ONCE(rd->overutilized)*)
6338                 goto fail;
...

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2019-10-09 11:34     ` Vincent Guittot
@ 2019-10-09 16:55       ` Parth Shah
  0 siblings, 0 replies; 19+ messages in thread
From: Parth Shah @ 2019-10-09 16:55 UTC (permalink / raw)
  To: Vincent Guittot, Hillf Danton
  Cc: linux-kernel, open list:THERMAL, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Patrick Bellasi, Valentin Schneider,
	Pavel Machek, Doug Smythies, Quentin Perret, Rafael J. Wysocki,
	Tim Chen, Daniel Lezcano



On 10/9/19 5:04 PM, Vincent Guittot wrote:
> On Wed, 9 Oct 2019 at 11:23, Parth Shah <parth@linux.ibm.com> wrote:
>>
>>
>>
>> On 10/8/19 6:58 PM, Hillf Danton wrote:
>>>
>>> On Mon,  7 Oct 2019 14:00:49 +0530 Parth Shah wrote:
>>>> +/*
>>>> + * Try to find a non idle core in the system  based on few heuristics:
>>>> + * - Keep track of overutilized (>80% util) and busy (>12.5% util) CPUs
>>>> + * - If none CPUs are busy then do not select the core for task packing
>>>> + * - If atleast one CPU is busy then do task packing unless overutilized CPUs
>>>> + *   count is < busy/2 CPU count
>>>> + * - Always select idle CPU for task packing
>>>> + */
>>>> +static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
>>>> +{
>>>> +    struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
>>>> +    int iter_cpu, sibling;
>>>> +
>>>> +    cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
>>>> +
>>>> +    for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
>>>> +            int idle_cpu_count = 0, non_idle_cpu_count = 0;
>>>> +            int overutil_cpu_count = 0;
>>>> +            int busy_cpu_count = 0;
>>>> +            int best_cpu = iter_cpu;
>>>> +
>>>> +            for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
>>>> +                    __cpumask_clear_cpu(sibling, cpus);
>>>> +                    if (idle_cpu(iter_cpu)) {
>>>
>>> Would you please elaborate the reasons that the iter cpu is checked idle
>>> more than once for finding a busy core?
>>>
>>
>> Thanks for looking at the patches.
>> Could you please point me out where iter_cpu is checked more than once?
> 
> I think that point is that you have a sibling that there is
> for_each_cpu(sibling, cpu_smt_mask(iter_cpu) but you never use sibling
> in the loop except for clearing it on the cpumask cpus
> All the tests are done with iter_cpu so you will test several time
> iter_cpus but never the other sibling
> Should you use sibling instead ?
> 

oh got it. it was unintentional here, my bad.
good find

I did s/iter_cpu/sibling/ at required places:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4a1b6474338..a75c2b382771 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6001,14 +6001,14 @@ static int select_non_idle_core(struct task_struct
*p, int prev_cpu, int target)

                for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
                        __cpumask_clear_cpu(sibling, cpus);
-                       if (idle_cpu(iter_cpu)) {
+                       if (idle_cpu(sibling)) {
                                idle_cpu_count++;
-                               best_cpu = iter_cpu;
+                               best_cpu = sibling;
                        } else {
                                non_idle_cpu_count++;
-                               if (cpu_overutilized(iter_cpu))
+                               if (cpu_overutilized(sibling))
                                        overutil_cpu_count++;
-                               if (is_cpu_busy(cpu_util(iter_cpu)))
+                               if (is_cpu_busy(cpu_util(sibling)))
                                        busy_cpu_count++;
                        }
                }

and the took the results again to see functionality changes.
Results are still within the bounds with maximum of 15% gain in performance
and <2% of regression.

		Frequency benefit of TurboSched w.r.t. CFS
   +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
20 +-+ + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + +-+
   |                                      Frequency benefit in %         |
   |                    **                                               |
15 +-+        * * **    ******                                         +-+
   |          * * ************                                           |
   |       ** * * ************ *                                         |
10 +-+     ** * * ************ *                                       +-+
   |       ** * * ************ * * ****                                  |
   |     **** * * ************ * * ****                                  |
 5 +-+ ****** * * ************ * * ******                              +-+
   |   ****** * * ************ * * **********                            |
 0 +-******** * * ************ * * ************ * * * ********** * * * **+
   |                                                                     |
   |                                                                     |
-5 +-+                                                                 +-+
   | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |
   +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
     0 1 2 3  4 5 6 7 8 91011 1213141516171819 2021222324252627 28293031
                           No. of workload threads

		 Performance benefit of TurboSched w.r.t. CFS
20 +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
   | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |
   |                                    Performance benefit in %         |
15 +-+                    **                                           +-+
   |                      **                                             |
   |                  ******** *                                         |
10 +-+                ******** *   **                                  +-+
   |                  ******** * * **                                    |
   |                  ******** * * **                                    |
 5 +-+              ********** * * ******                              +-+
   |                ********** * * **********                            |
   |              ************ * * **********   *       **               |
 0 +-******** * * ************ * * ************ * * * ********** * * * **+
   | ******** *                                                          |
   |                                                                     |
-5 +-+                                                                 +-+
   | + + + +  + + + + + + + +  + + + + + + + +  + + + + + + + +  + + + + |
   +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
     0 1 2 3  4 5 6 7 8 91011 1213141516171819 2021222324252627 28293031
                           No. of workload threads

Thanks,
Parth

> 
>>
>>>> +                            idle_cpu_count++;
>>>> +                            best_cpu = iter_cpu;
>>>> +                    } else {
>>>> +                            non_idle_cpu_count++;
>>>> +                            if (cpu_overutilized(iter_cpu))
>>>> +                                    overutil_cpu_count++;
>>>> +                            if (is_cpu_busy(cpu_util(iter_cpu)))
>>>> +                                    busy_cpu_count++;
>>>> +                    }
>>>> +            }
>>>> +
>>>
>>


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2019-10-09 14:26           ` Dietmar Eggemann
@ 2019-10-09 17:02             ` Parth Shah
  2019-10-10 14:53               ` Dietmar Eggemann
  0 siblings, 1 reply; 19+ messages in thread
From: Parth Shah @ 2019-10-09 17:02 UTC (permalink / raw)
  To: Dietmar Eggemann, Vincent Guittot
  Cc: linux-kernel, open list:THERMAL, Peter Zijlstra, Ingo Molnar,
	Patrick Bellasi, Valentin Schneider, Pavel Machek, Doug Smythies,
	Quentin Perret, Rafael J. Wysocki, Tim Chen, Daniel Lezcano



On 10/9/19 7:56 PM, Dietmar Eggemann wrote:
> On 09/10/2019 10:57, Parth Shah wrote:
> 
> [...]
> 
>>> On 07/10/2019 18:53, Parth Shah wrote:
>>>>
>>>>
>>>> On 10/7/19 5:49 PM, Vincent Guittot wrote:
>>>>> On Mon, 7 Oct 2019 at 10:31, Parth Shah <parth@linux.ibm.com> wrote:
> 
> [...]
> 
>>>> Maybe I can add just below the sched_energy_present(){...} construct giving
>>>> precedence to EAS? I'm asking this because I remember Patrick telling me to
>>>> leverage task packing for android as well?
>>>
>>> I have a hard time imaging that Turbosched will be used in Android next
>>> to EAS in the foreseeable future.
>>>
>>> First of all, EAS provides task packing already on Performance Domain
>>> (PD) level (a.k.a. as cluster on traditional 2-cluster Arm/Arm64
>>> big.LITTLE or DynamIQ (with Phantom domains (out of tree solution)).
>>> This is where we can safe energy without harming latency.
>>>
>>> See the tests results under '2.1 Energy test case' in
>>>
>>> https://lore.kernel.org/r/20181203095628.11858-1-quentin.perret@arm.com
>>>
>>> There are 10 to 50 small (classified solely by task utilization) tasks
>>> per test case and EAS shows an effect on energy consumption by packing
>>> them onto the PD (cluster) of the small CPUs.
>>>
>>> And second, the CPU supported topology is different to the one you're
>>> testing on.
>>>
>>
>> cool. I was just keeping in mind the following quote
>> " defining a generic spread-vs-pack wakeup policy which is something
>> Android also could benefit from " (https://lkml.org/lkml/2019/6/28/628)
> 
> The main thing is that in case we want to introduce a new functionality
> into CFS, we should try hard to use existing infrastructure (or
> infrastructure there is agreement on that we'll need it) as much as
> possible.
> 
> If I understand Patrick here correctly, he suggested not to use uclamp
> but the task latency nice approach. There is agreement that we would
> need something like this as infrastructure:
> 
> https://lore.kernel.org/r/20190830174944.21741-1-subhra.mazumdar@oracle.com
> 

got it.

> So p->latency_nice is suitable to include your p->flags |=
> PF_CAN_BE_PACKED concept nicely.

yeah, I'm working on that part too as a bigger goal.

> 
>>
>> BTW, IIUC that does task consolidation only on single CPU unless
>> rd->overload is set, right?
> 
> Task consolidation on Performance Domains (PDs) w/ multiple CPUs (e.g.
> on a per-cluster PD big.LITTLE system) only works when the system is not
> overutilized:
> 
> 6326 int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> 6327 {
> ...
> 6337         if (!pd || *READ_ONCE(rd->overutilized)*)
> 6338                 goto fail;
> ...

ok. so does that mean TurboSched can still do some good in such systems as
well ?
I mean save energy even when rd->overutilized==1 by not waking user
classified bg tasks on idle core.

> 
> [...]
> 

Thanks,
Parth


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores
  2019-10-09 17:02             ` Parth Shah
@ 2019-10-10 14:53               ` Dietmar Eggemann
  0 siblings, 0 replies; 19+ messages in thread
From: Dietmar Eggemann @ 2019-10-10 14:53 UTC (permalink / raw)
  To: Parth Shah, Vincent Guittot
  Cc: linux-kernel, open list:THERMAL, Peter Zijlstra, Ingo Molnar,
	Patrick Bellasi, Valentin Schneider, Pavel Machek, Doug Smythies,
	Quentin Perret, Rafael J. Wysocki, Tim Chen, Daniel Lezcano

On 09/10/2019 19:02, Parth Shah wrote:
> 
> 
> On 10/9/19 7:56 PM, Dietmar Eggemann wrote:
>> On 09/10/2019 10:57, Parth Shah wrote:
>>
>> [...]
>>
>>>> On 07/10/2019 18:53, Parth Shah wrote:
>>>>>
>>>>>
>>>>> On 10/7/19 5:49 PM, Vincent Guittot wrote:
>>>>>> On Mon, 7 Oct 2019 at 10:31, Parth Shah <parth@linux.ibm.com> wrote:

[...]

> ok. so does that mean TurboSched can still do some good in such systems as
> well ?
> I mean save energy even when rd->overutilized==1 by not waking user
> classified bg tasks on idle core.

I wouldn't say it is impossible but how likely would it be?

The Android runtime already classifies its tasks into groups such as
background, foreground, top-app, etc. It uses existing infrastructure
like cpusets, taskgroups, util_clamp (or its out-of-tree predecessor
schedtune) as well as EAS/Energy Model on asymmetric CPU capacity
systems to map them (differently) onto the CPU topology to achieve the
best possible performance/energy consumption trade-off.

Moreover, Google and Arm are keen getting the concept of 'latency nice'
upstream so we can map Android Common Kernel's 'prefer idle' feature
into the mainline energy-aware wu path.

So I'm afraid the question whether TurboSched could make sense on an
Android system can only be answered by people responsible for future
Android runtime architecture.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-10-10 14:53 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-07  8:30 [RFC v5 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
2019-10-07  8:30 ` [RFC v5 1/6] sched/core: Add manual background task classification using sched_setattr syscall Parth Shah
2019-10-07  8:30 ` [RFC v5 2/6] sched: Introduce switch to enable TurboSched for task packing Parth Shah
2019-10-07  8:30 ` [RFC v5 3/6] sched/core: Update turbo_sched count only when required Parth Shah
2019-10-07  8:30 ` [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores Parth Shah
2019-10-07 12:19   ` Vincent Guittot
2019-10-07 16:53     ` Parth Shah
2019-10-08 16:20       ` Vincent Guittot
2019-10-09  8:46         ` Parth Shah
2019-10-08 16:52       ` Dietmar Eggemann
2019-10-09  8:57         ` Parth Shah
2019-10-09 14:26           ` Dietmar Eggemann
2019-10-09 17:02             ` Parth Shah
2019-10-10 14:53               ` Dietmar Eggemann
2019-10-07  8:30 ` [RFC v5 5/6] sched/fair: Provide arch hook to find domain for non idle core search scan Parth Shah
2019-10-07  8:30 ` [RFC v5 6/6] powerpc: Set turbo domain to NUMA node for task packing Parth Shah
     [not found] ` <20191008132842.6612-1-hdanton@sina.com>
2019-10-09  9:22   ` [RFC v5 4/6] sched/fair: Tune task wake-up logic to pack small background tasks on fewer cores Parth Shah
2019-10-09 11:34     ` Vincent Guittot
2019-10-09 16:55       ` Parth Shah

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.