linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations
@ 2019-07-25  7:08 Parth Shah
  2019-07-25  7:08 ` [RFC v4 1/8] sched/core: Add manual jitter classification using sched_setattr syscall Parth Shah
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: Parth Shah @ 2019-07-25  7:08 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, linux-pm, patrick.bellasi, dietmar.eggemann,
	daniel.lezcano, subhra.mazumdar

This is the 4th version of the patchset to sustain Turbo frequencies for
longer durations.

The previous versions can be found here:
v3: https://lkml.org/lkml/2019/6/25/25
v2: https://lkml.org/lkml/2019/5/15/1258
v1: https://lwn.net/Articles/783959/

The changes in this versions are:
v[3] -> v[4]:
- Based on Patrick Bellasi's comments, removed the use of UCLAMP based
  mechanism to classify tasks as jitter
- Added support to sched_setattr to mark the task as jitter by adding a new
  flag to the existing task_struct->flags attribute. This is decided to not
  have any new variable inside task_struct and thus get rid of size
  bloating.
- No functional changes

v[2] -> v[3]:
- Added a new attribute in task_struct to allow per task jitter
  classification so that scheduler can use this as request to change wakeup
  path for task packing
- Use syscall for jitter classification, removed cgroup based task
  classification
- Use mutex over spinlock to get rid of task sleeping problem
- Changed _Bool->int everywhere
- Split few patches to have arch specific code separate from core scheduler
  code
ToDo:
- Recompute core capacity only during CPU-Hotplug operation
- Remove smt capacity 

v[1] -> v[2]:
- No CPU bound tasks' classification, only jitter tasks are classified from
  the cpu cgroup controller
- Use of Spinlock rather than mutex to count number of jitters in the
  system classified from cgroup
- Architecture specific implementation of Core capacity multiplication
  factor changes dynamically based on the number of active threads in the
  core
- Selection of non idle core in the system is bounded by DIE domain
- Use of UCLAMP mechanism to classify jitter tasks
- Removed "highutil_cpu_mask", and rather uses sd for DIE domain to find
  better fit



Abstract
========

The modern servers allows multiple cores to run at range of frequencies
higher than rated range of frequencies. But the power budget of the system
inhibits sustaining these higher frequencies for longer durations.

However when certain cores are put to idle states, the power can be
effectively channelled to other busy cores, allowing them to sustain the
higher frequency.

One way to achieve this is to pack tasks onto fewer cores keeping others
idle, but it may lead to performance penalty for such tasks and sustaining
higher frequencies proves to be of no benefit. But if one can identify
unimportant low utilization tasks which can be packed on the already active
cores then waking up of new cores can be avoided. Such tasks are short
and/or bursty "jitter tasks" and waking up new core is expensive for such
case.

Current CFS algorithm in kernel scheduler is performance oriented and hence
tries to assign any idle CPU first for the waking up of new tasks. This
policy is perfect for major categories of the workload, but for jitter
tasks, one can save energy by packing them onto the active cores and allow
those cores to run at higher frequencies.

These patch-set tunes the task wake up logic in scheduler to pack
exclusively classified jitter tasks onto busy cores. The work involves the
jitter tasks classifications by using syscall based mechanisms.

In brief, if we can pack jitter tasks on busy cores then we can save power
by keeping other cores idle and allow busier cores to run at turbo
frequencies, patch-set tries to meet this solution in simplest manner.
Though, there are some challenges in implementing it(like smt_capacity,
un-needed arch hooks, etc) and I'm trying to work around that, it would be
great to have a discussion around this patches.


Implementation
==============

These patches uses syscall based mechanism to classify the tasks as jitter.
The task wakeup logic uses this information to pack such tasks onto cores
which are already running busy with CPU intensive tasks. The task packing
is done at `select_task_rq_fair` only so that in case of wrong decision
load balancer may pull the classified jitter tasks for maximizing
performance.

We define a core to be non-idle if it is over 12.5% utilized of its
capacity; the jitters are packed over these cores using First-fit approach.

To demonstrate/benchmark, one can use a synthetic workload generator
`turbo_bench.c`[1] available at
https://github.com/parthsl/tools/blob/master/benchmarks/turbo_bench.c

Following snippet demonstrates the use of TurboSched feature:
```
i=8; ./turbo_bench -t 30 -h $i -n $((i*2)) -j
```
This spawns 2*i total threads: of which i-CPU bound and i-jitter threads.

Current implementation uses only jitter classified tasks to be packed on
the first busy cores, but can be further optimized by getting userspace
input of important tasks and keeping track of such tasks. This leads to
optimized searching of non idle cores and also more accurate as userspace
hints are safer than auto classified busy cores/tasks.


Result
======

The patch-set proves to be useful for the system and the workload where
frequency boost is found to be useful than packing tasks into cores. IBM
POWER 9 system shows the benefit for a workload can be up to 13%.

                Performance benefit of TurboSched w.r.t. CFS 
   +--+--+--+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+--+--+
   |  +  +  +  +  +  + +  +  +  +  +  +  +  +  +  +  +  + +  +  +  +  +  |
15 +-+                                  Performance benefit in %       +-+
   |                         **                                          |
   |                         ** **                                       |
10 +-+                       ** ** **                                  +-+
   |                         ** ** **                                    |
   |                         ** ** **                                    |
 5 +-+                 ** ** ** ** **    **                            +-+
   |                   ** ** ** ** ** ** ** **                           |
   |                   ** ** ** ** ** ** ** ** ** **                     |
   |                 * ** ** ** ** ** ** ** ** ** ** ** *                |
 0 +-+** ** ** ** ** * ** ** ** ** ** ** ** ** ** ** ** * ** ** ** ** **-+
   |  ** ** ** **                                                        |
   |  **                                                                 |
-5 +-+                                                                 +-+
   |  +  +  +  +  +  + +  +  +  +  +  +  +  +  +  +  +  + +  +  +  +  +  |
   +--+--+--+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+--+--+
      2  3  4  5  6  7 8  9 10 11 12 13 14 15 16 17 18 1920 21 22 23 24   
                           No. of workload threads                        


                      Frequency benefit of TurboSched w.r.t. CFS
   +--+--+--+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+--+--+
   |  +  +  +  +  +  + +  +  +  +  +  +  +  +  +  +  +  + +  +  +  +  +  |
15 +-+                                    Frequency benefit in %       +-+
   |                         **                                          |
   |                         **                                          |
10 +-+            **         **                                        +-+
   |              **         ** **                                       |
   |        **    ** * **    ** **                                       |
 5 +-+      ** ** ** * ** ** ** **                                     +-+
   |     ** ** ** ** * ** ** ** **    **                                 |
   |  ** ** ** ** ** * ** ** ** ** ** **                                 |
   |  ** ** ** ** ** * ** ** ** ** ** ** ** ** **                        |
 0 +-+** ** ** ** ** * ** ** ** ** ** ** ** ** ** ** ** * ** ** ** ** **-+
   |                                                                     |
   |                                                                     |
-5 +-+                                                                 +-+
   |  +  +  +  +  +  + +  +  +  +  +  +  +  +  +  +  +  + +  +  +  +  +  |
   +--+--+--+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+--+--+
      2  3  4  5  6  7 8  9 10 11 12 13 14 15 16 17 18 1920 21 22 23 24   
                             No. of workload threads                      

 
These numbers are w.r.t. `turbo_bench.c` multi-threaded test benchmark
which can create two kinds of tasks: CPU bound (High Utilization) and
Jitters (Low Utilization). N in X-axis represents N-CPU bound and N-Jitter
tasks spawned.


Series organization
==============
- Patches [01-03]: Jitter tasks classification using syscall
- Patches [04-05]: Defines Core Capacity to limit task packing
- Patches [06-08]: Tune CFS task wakeup logic to pack tasks onto busy
  cores

Series can be applied on the top of tip/sched/core at
commit af24bde8df20 ("sched/uclamp: Add uclamp support to energy_compute()")


Parth Shah (8):
  sched/core: Add manual jitter classification using sched_setattr
    syscall
  sched: Introduce switch to enable TurboSched mode
  sched/core: Update turbo_sched count only when required
  sched/fair: Define core capacity to limit task packing
  powerpc: Define Core Capacity for POWER systems
  sched/fair: Tune task wake-up logic to pack jitter tasks
  sched/fair: Bound non idle core search within LLC domain
  powerpc: Set turbo domain to NUMA node for task packing

 arch/powerpc/include/asm/topology.h |   7 ++
 arch/powerpc/kernel/smp.c           |  38 ++++++++
 include/linux/sched.h               |   1 +
 include/uapi/linux/sched.h          |   4 +-
 kernel/sched/core.c                 |  39 ++++++++
 kernel/sched/fair.c                 | 135 +++++++++++++++++++++++++++-
 kernel/sched/sched.h                |   9 ++
 7 files changed, 231 insertions(+), 2 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC v4 1/8] sched/core: Add manual jitter classification using sched_setattr syscall
  2019-07-25  7:08 [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
@ 2019-07-25  7:08 ` Parth Shah
  2019-07-25  7:08 ` [RFC v4 2/8] sched: Introduce switch to enable TurboSched mode Parth Shah
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Parth Shah @ 2019-07-25  7:08 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, linux-pm, patrick.bellasi, dietmar.eggemann,
	daniel.lezcano, subhra.mazumdar

Jitter tasks are short/bursty tasks, typically performing some housekeeping
work and are less important in the overall scheme of things.

So provide a way to mark the task as jitter with the use of additional flag
to the existing task attribute. Also provide an interface from the
userspace which uses sched_setattr syscall to mark tasks as jitter.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 include/linux/sched.h      | 1 +
 include/uapi/linux/sched.h | 4 +++-
 kernel/sched/core.c        | 9 +++++++++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1113dd4706ae..e03b85166e34 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1463,6 +1463,7 @@ extern struct pid *cad_pid;
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_NOCMA	0x10000000	/* All allocation request will have _GFP_MOVABLE cleared */
+#define PF_CAN_BE_PACKED	0x20000000	/* Provide hints to the scheduler to pack such tasks */
 #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 617bb59aa8ba..fccb1c57d037 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -55,6 +55,7 @@
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_TASK_PACKING		0x80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
@@ -66,6 +67,7 @@
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN		| \
 			 SCHED_FLAG_KEEP_ALL		| \
-			 SCHED_FLAG_UTIL_CLAMP)
+			 SCHED_FLAG_UTIL_CLAMP		| \
+			 SCHED_FLAG_TASK_PACKING)
 
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fa43ce3962e7..e7cda4aa8696 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4498,6 +4498,8 @@ static void __setscheduler_params(struct task_struct *p,
 	p->rt_priority = attr->sched_priority;
 	p->normal_prio = normal_prio(p);
 	set_load_weight(p, true);
+	if (attr->sched_flags & SCHED_FLAG_TASK_PACKING)
+		p->flags |= PF_CAN_BE_PACKED;
 }
 
 /* Actually do priority change: must hold pi & rq lock. */
@@ -4557,6 +4559,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	struct rq_flags rf;
 	int reset_on_fork;
 	int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+	unsigned long long task_packing_flag =
+				attr->sched_flags & SCHED_FLAG_TASK_PACKING;
 	struct rq *rq;
 
 	/* The pi code expects interrupts enabled */
@@ -4686,6 +4690,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			goto change;
 		if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
 			goto change;
+		if (task_packing_flag)
+			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
 		task_rq_unlock(rq, p, &rf);
@@ -5181,6 +5187,9 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 	attr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
 #endif
 
+	if (p->flags & PF_CAN_BE_PACKED)
+		attr.sched_flags |= SCHED_FLAG_TASK_PACKING;
+
 	rcu_read_unlock();
 
 	retval = sched_read_attr(uattr, &attr, size);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC v4 2/8] sched: Introduce switch to enable TurboSched mode
  2019-07-25  7:08 [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
  2019-07-25  7:08 ` [RFC v4 1/8] sched/core: Add manual jitter classification using sched_setattr syscall Parth Shah
@ 2019-07-25  7:08 ` Parth Shah
  2019-07-25  7:08 ` [RFC v4 3/8] sched/core: Update turbo_sched count only when required Parth Shah
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Parth Shah @ 2019-07-25  7:08 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, linux-pm, patrick.bellasi, dietmar.eggemann,
	daniel.lezcano, subhra.mazumdar

Create a static key which allows to enable or disable TurboSched feature at
runtime.

This key is added in order to enable the TurboSched feature only when
required. This helps in optimizing the scheduler fast-path when the
TurboSched feature is disabled.

Also provide get/put methods to keep track of the tasks using the
TurboSched feature and also refcount jitter tasks. This allows to enable
the feature on setting first task classified as jitter, similarly disable
the feature on unsetting of such last task.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c  | 20 ++++++++++++++++++++
 kernel/sched/sched.h |  9 +++++++++
 2 files changed, 29 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e7cda4aa8696..ee5980b4e150 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,26 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+DEFINE_STATIC_KEY_FALSE(__turbo_sched_enabled);
+static DEFINE_MUTEX(turbo_sched_lock);
+static int turbo_sched_count;
+
+void turbo_sched_get(void)
+{
+	mutex_lock(&turbo_sched_lock);
+	if (!turbo_sched_count++)
+		static_branch_enable(&__turbo_sched_enabled);
+	mutex_unlock(&turbo_sched_lock);
+}
+
+void turbo_sched_put(void)
+{
+	mutex_lock(&turbo_sched_lock);
+	if (!--turbo_sched_count)
+		static_branch_disable(&__turbo_sched_enabled);
+	mutex_unlock(&turbo_sched_lock);
+}
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3405f2..4a0b90ea8652 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2423,3 +2423,12 @@ static inline bool sched_energy_enabled(void)
 static inline bool sched_energy_enabled(void) { return false; }
 
 #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
+void turbo_sched_get(void);
+void turbo_sched_put(void);
+DECLARE_STATIC_KEY_FALSE(__turbo_sched_enabled);
+
+static inline bool is_turbosched_enabled(void)
+{
+	return static_branch_unlikely(&__turbo_sched_enabled);
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC v4 3/8] sched/core: Update turbo_sched count only when required
  2019-07-25  7:08 [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
  2019-07-25  7:08 ` [RFC v4 1/8] sched/core: Add manual jitter classification using sched_setattr syscall Parth Shah
  2019-07-25  7:08 ` [RFC v4 2/8] sched: Introduce switch to enable TurboSched mode Parth Shah
@ 2019-07-25  7:08 ` Parth Shah
  2019-07-25  7:08 ` [RFC v4 4/8] sched/fair: Define core capacity to limit task packing Parth Shah
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Parth Shah @ 2019-07-25  7:08 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, linux-pm, patrick.bellasi, dietmar.eggemann,
	daniel.lezcano, subhra.mazumdar

Use the get/put methods to add/remove the use of TurboSched support, such
that the feature is turned on only if there is atleast one jitter task.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee5980b4e150..60340fa18abb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3141,6 +3141,9 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		mmdrop(mm);
 	}
 	if (unlikely(prev_state == TASK_DEAD)) {
+		if (unlikely(prev->flags & PF_CAN_BE_PACKED))
+			turbo_sched_put();
+
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
@@ -4793,6 +4796,10 @@ static int __sched_setscheduler(struct task_struct *p,
 
 	prev_class = p->sched_class;
 
+	/* Refcount tasks classified as jitter task */
+	if (task_packing_flag != (p->flags & PF_CAN_BE_PACKED))
+		(task_packing_flag) ? turbo_sched_get() : turbo_sched_put();
+
 	__setscheduler(rq, p, attr, pi);
 	__setscheduler_uclamp(p, attr);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC v4 4/8] sched/fair: Define core capacity to limit task packing
  2019-07-25  7:08 [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (2 preceding siblings ...)
  2019-07-25  7:08 ` [RFC v4 3/8] sched/core: Update turbo_sched count only when required Parth Shah
@ 2019-07-25  7:08 ` Parth Shah
  2019-07-25  7:08 ` [RFC v4 5/8] powerpc: Define Core Capacity for POWER systems Parth Shah
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Parth Shah @ 2019-07-25  7:08 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, linux-pm, patrick.bellasi, dietmar.eggemann,
	daniel.lezcano, subhra.mazumdar

Define a method name arch_scale_core_capacity which should
return the capacity of the core. This method will be used in the future
patches to determine if the spare capacity is left in the core to pack
jitter tasks.

For some architectures, core capacity does not increase much with the
number of threads (or CPUs) in the core. For such cases, architecture
specific calculations needs to be done to find core capacity.

In addition to this, provide a default implementation for the scaling core
capacity.

ToDo: As per Peter's comments, if we are getting rid of SMT capacity then
we need to find a workaround for limiting task packing. I'm working around
that trying to find a solution for the same but would like to get community
response first to have better view.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/fair.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b798fe7ff7cd..793e1172afc7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5964,6 +5964,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	return cpu;
 }
 
+#ifdef CONFIG_SCHED_SMT
+
+#ifndef arch_scale_core_capacity
+static inline unsigned long arch_scale_core_capacity(int first_thread,
+						     unsigned long smt_cap)
+{
+	/* Default capacity of core is sum of cap of all the threads */
+	unsigned long ret = 0;
+	int sibling;
+
+	for_each_cpu(sibling, cpu_smt_mask(first_thread))
+		ret += cpu_rq(sibling)->cpu_capacity;
+
+	return ret;
+}
+#endif
+
+#endif
+
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC v4 5/8] powerpc: Define Core Capacity for POWER systems
  2019-07-25  7:08 [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (3 preceding siblings ...)
  2019-07-25  7:08 ` [RFC v4 4/8] sched/fair: Define core capacity to limit task packing Parth Shah
@ 2019-07-25  7:08 ` Parth Shah
  2019-07-25  7:08 ` [RFC v4 6/8] sched/fair: Tune task wake-up logic to pack jitter tasks Parth Shah
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Parth Shah @ 2019-07-25  7:08 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, linux-pm, patrick.bellasi, dietmar.eggemann,
	daniel.lezcano, subhra.mazumdar

Tune arch_scale_core_capacity for powerpc architecture by scaling
capacity w.r.t to the number of online SMT in the core such that for SMT-4,
core capacity is 1.5x the capacity of sibling thread.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 arch/powerpc/include/asm/topology.h |  4 ++++
 arch/powerpc/kernel/smp.c           | 33 +++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f85e2b01c3df..1c777ee67180 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -132,6 +132,10 @@ static inline void shared_proc_topology_init(void) {}
 #define topology_sibling_cpumask(cpu)	(per_cpu(cpu_sibling_map, cpu))
 #define topology_core_cpumask(cpu)	(per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)		(cpu_to_core_id(cpu))
+#define arch_scale_core_capacity	powerpc_scale_core_capacity
+
+unsigned long powerpc_scale_core_capacity(int first_smt,
+					  unsigned long smt_cap);
 
 int dlpar_cpu_readd(int cpu);
 #endif
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf6a221..149a3fbf8ed3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1169,6 +1169,39 @@ static void remove_cpu_from_masks(int cpu)
 }
 #endif
 
+#ifdef CONFIG_SCHED_SMT
+/*
+ * Calculate capacity of a core based on the active threads in the core
+ * Scale the capacity of first SM-thread based on total number of
+ * active threads in the respective smt_mask.
+ *
+ * The scaling is done such that for
+ * SMT-4, core_capacity = 1.5x first_cpu_capacity
+ * and for SMT-8, core_capacity multiplication factor is 2x
+ *
+ * So, core_capacity multiplication factor = (1 + smt_mode*0.125)
+ *
+ * @first_cpu: First/any CPU id in the core
+ * @cap: Capacity of the first_cpu
+ */
+unsigned long powerpc_scale_core_capacity(int first_cpu,
+					  unsigned long cap)
+{
+	struct cpumask select_idles;
+	struct cpumask *cpus = &select_idles;
+	int cpu, smt_mode = 0;
+
+	cpumask_and(cpus, cpu_smt_mask(first_cpu), cpu_online_mask);
+
+	/* Find SMT mode from active SM-threads */
+	for_each_cpu(cpu, cpus)
+		smt_mode++;
+
+	/* Scale core capacity based on smt mode */
+	return smt_mode == 1 ? cap : ((cap * smt_mode) >> 3) + cap;
+}
+#endif
+
 static inline void add_cpu_to_smallcore_masks(int cpu)
 {
 	struct cpumask *this_l1_cache_map = per_cpu(cpu_l1_cache_map, cpu);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC v4 6/8] sched/fair: Tune task wake-up logic to pack jitter tasks
  2019-07-25  7:08 [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (4 preceding siblings ...)
  2019-07-25  7:08 ` [RFC v4 5/8] powerpc: Define Core Capacity for POWER systems Parth Shah
@ 2019-07-25  7:08 ` Parth Shah
  2019-07-25  7:08 ` [RFC v4 7/8] sched/fair: Bound non idle core search within LLC domain Parth Shah
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Parth Shah @ 2019-07-25  7:08 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, linux-pm, patrick.bellasi, dietmar.eggemann,
	daniel.lezcano, subhra.mazumdar

The algorithm finds the first non idle core in the system and tries to
place a task in the least utilized CPU in the chosen core. To maintain
cache hotness, work of finding non idle core starts from the prev_cpu,
which also reduces task ping-pong behaviour inside of the core.

This is defined with a new method named core_underutilized() which will
determine if the core utilization is less than 12.5% of its capacity.
Since core with low utilization should not be selected for packing, the
margin of under-utilization is kept at 12.5% of core capacity.

12.5% is an experimental number which identifies whether the core is
considered to be idle or not.  For task packing, the algorithm should
select the best core where the task can be accommodated such that it does
not wake up an idle core. But the jitter tasks should not be placed on the
core which is about to go idle. If the core has aggregated utilization of
<12.5%, it may go idle soon and hence packing on such core should be
ignored. The experiment showed that keeping this threshold to 12.5% gives
better decision capability on not selecting the core which will idle out
soon.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c |   3 ++
 kernel/sched/fair.c | 108 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 60340fa18abb..fcfd0ab187ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6402,6 +6402,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
 DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
+DECLARE_PER_CPU(cpumask_var_t, turbo_sched_mask);
 
 void __init sched_init(void)
 {
@@ -6442,6 +6443,8 @@ void __init sched_init(void)
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 		per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
+		per_cpu(turbo_sched_mask, i) = (cpumask_var_t)kzalloc_node(
+			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 	}
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 793e1172afc7..3ba2dc44cba4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5353,6 +5353,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 /* Working cpumask for: load_balance, load_balance_newidle. */
 DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
 DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+/* A cpumask to find active cores in the system. */
+DEFINE_PER_CPU(cpumask_var_t, turbo_sched_mask);
 
 #ifdef CONFIG_NO_HZ_COMMON
 
@@ -5964,6 +5966,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	return cpu;
 }
 
+static inline bool is_task_jitter(struct task_struct *p)
+{
+	if (p->flags & PF_CAN_BE_PACKED)
+		return true;
+
+	return false;
+}
+
 #ifdef CONFIG_SCHED_SMT
 
 #ifndef arch_scale_core_capacity
@@ -5981,6 +5991,81 @@ static inline unsigned long arch_scale_core_capacity(int first_thread,
 }
 #endif
 
+/*
+ * Core is defined as under-utilized in case if the aggregated utilization of a
+ * all the CPUs in a core is less than 12.5%
+ */
+#define UNDERUTILIZED_THRESHOLD 3
+static inline bool core_underutilized(unsigned long core_util,
+				      unsigned long core_capacity)
+{
+	return core_util < (core_capacity >> UNDERUTILIZED_THRESHOLD);
+}
+
+/*
+ * Try to find a non idle core in the system  with spare capacity
+ * available for task packing, thereby keeping minimal cores active.
+ * Uses first fit algorithm to pack low util jitter tasks on active cores.
+ */
+static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
+{
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
+	int iter_cpu, sibling;
+
+	cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
+
+	for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
+		unsigned long core_util = 0;
+		unsigned long core_cap = arch_scale_core_capacity(iter_cpu,
+				capacity_of(iter_cpu));
+		unsigned long est_util = 0, est_util_enqueued = 0;
+		unsigned long util_best_cpu = ULONG_MAX;
+		int best_cpu = iter_cpu;
+		struct cfs_rq *cfs_rq;
+
+		for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
+			__cpumask_clear_cpu(sibling, cpus);
+			core_util += cpu_util(sibling);
+
+			/*
+			 * Keep track of least utilized CPU in the core
+			 */
+			if (cpu_util(sibling) < util_best_cpu) {
+				util_best_cpu = cpu_util(sibling);
+				best_cpu = sibling;
+			}
+		}
+
+		/*
+		 * Find if the selected task will fit into this core or not by
+		 * estimating the utilization of the core.
+		 */
+		if (!core_underutilized(core_util, core_cap)) {
+			cfs_rq = &cpu_rq(best_cpu)->cfs;
+			est_util =
+				READ_ONCE(cfs_rq->avg.util_avg) + task_util(p);
+			est_util_enqueued =
+				READ_ONCE(cfs_rq->avg.util_est.enqueued);
+			est_util_enqueued += _task_util_est(p);
+			est_util = max(est_util, est_util_enqueued);
+			est_util = core_util - util_best_cpu + est_util;
+
+			if (est_util < core_cap) {
+				/*
+				 * Try to bias towards prev_cpu to avoid task
+				 * ping-pong behaviour inside the core.
+				 */
+				if (cpumask_test_cpu(prev_cpu,
+						     cpu_smt_mask(iter_cpu)))
+					return prev_cpu;
+
+				return best_cpu;
+			}
+		}
+	}
+
+	return select_idle_sibling(p, prev_cpu, target);
+}
 #endif
 
 /*
@@ -6437,6 +6522,23 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	return -1;
 }
 
+#ifdef CONFIG_SCHED_SMT
+/*
+ * Select all jittersfor task packing
+ */
+static inline int turbosched_select_non_idle_core(struct task_struct *p,
+						  int prev_cpu, int target)
+{
+	return select_non_idle_core(p, prev_cpu, target);
+}
+#else
+static inline int turbosched_select_non_idle_core(struct task_struct *p,
+						  int prev_cpu, int target)
+{
+	return select_idle_sibling(p, prev_cpu, target);
+}
+#endif
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6502,7 +6604,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 		/* Fast path */
 
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+		if (is_turbosched_enabled() && unlikely(is_task_jitter(p)))
+			new_cpu = turbosched_select_non_idle_core(p, prev_cpu,
+								  new_cpu);
+		else
+			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
 		if (want_affine)
 			current->recent_used_cpu = cpu;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC v4 7/8] sched/fair: Bound non idle core search within LLC domain
  2019-07-25  7:08 [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (5 preceding siblings ...)
  2019-07-25  7:08 ` [RFC v4 6/8] sched/fair: Tune task wake-up logic to pack jitter tasks Parth Shah
@ 2019-07-25  7:08 ` Parth Shah
  2019-07-25  7:08 ` [RFC v4 8/8] powerpc: Set turbo domain to NUMA node for task packing Parth Shah
  2019-07-28 13:31 ` [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Pavel Machek
  8 siblings, 0 replies; 13+ messages in thread
From: Parth Shah @ 2019-07-25  7:08 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, linux-pm, patrick.bellasi, dietmar.eggemann,
	daniel.lezcano, subhra.mazumdar

Specify the method which returns sched domain to limit the
search for a non idle core. By default, limit the search in LLC domain
which usually includes all the cores across the system.

The select_non_idle_core searches for the non idle cores across whole
system. But in the systems with multiple NUMA domains, the Turbo frequency
can be sustained within the NUMA domain without being affected from other
NUMA. For such case, arch_turbo_domain can be tuned to change domain for
non idle core search.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/fair.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ba2dc44cba4..e09e7546abeb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6002,6 +6002,13 @@ static inline bool core_underutilized(unsigned long core_util,
 	return core_util < (core_capacity >> UNDERUTILIZED_THRESHOLD);
 }
 
+#ifndef arch_turbo_domain
+static __always_inline struct cpumask *arch_turbo_domain(int cpu)
+{
+	return sched_domain_span(rcu_dereference(per_cpu(sd_llc, cpu)));
+}
+#endif
+
 /*
  * Try to find a non idle core in the system  with spare capacity
  * available for task packing, thereby keeping minimal cores active.
@@ -6012,7 +6019,8 @@ static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
 	int iter_cpu, sibling;
 
-	cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
+	cpumask_and(cpus, cpu_online_mask, arch_turbo_domain(prev_cpu));
+	cpumask_and(cpus, cpus, p->cpus_ptr);
 
 	for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
 		unsigned long core_util = 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC v4 8/8] powerpc: Set turbo domain to NUMA node for task packing
  2019-07-25  7:08 [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (6 preceding siblings ...)
  2019-07-25  7:08 ` [RFC v4 7/8] sched/fair: Bound non idle core search within LLC domain Parth Shah
@ 2019-07-25  7:08 ` Parth Shah
  2019-07-28 13:31 ` [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Pavel Machek
  8 siblings, 0 replies; 13+ messages in thread
From: Parth Shah @ 2019-07-25  7:08 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, linux-pm, patrick.bellasi, dietmar.eggemann,
	daniel.lezcano, subhra.mazumdar

Provide an powerpc architecture specific implementation for
defining the turbo domain to make searching of the core to be bound within
the NUMA.  This provides a way to decrease the searching time for specific
architectures where we know the domain for the power budget.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 arch/powerpc/include/asm/topology.h | 3 +++
 arch/powerpc/kernel/smp.c           | 5 +++++
 2 files changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 1c777ee67180..410b94c9e1a2 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -133,10 +133,13 @@ static inline void shared_proc_topology_init(void) {}
 #define topology_core_cpumask(cpu)	(per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)		(cpu_to_core_id(cpu))
 #define arch_scale_core_capacity	powerpc_scale_core_capacity
+#define arch_turbo_domain		powerpc_turbo_domain
 
 unsigned long powerpc_scale_core_capacity(int first_smt,
 					  unsigned long smt_cap);
 
+struct cpumask *powerpc_turbo_domain(int cpu);
+
 int dlpar_cpu_readd(int cpu);
 #endif
 #endif
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 149a3fbf8ed3..856f7233190e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1200,6 +1200,11 @@ unsigned long powerpc_scale_core_capacity(int first_cpu,
 	/* Scale core capacity based on smt mode */
 	return smt_mode == 1 ? cap : ((cap * smt_mode) >> 3) + cap;
 }
+
+inline struct cpumask *powerpc_turbo_domain(int cpu)
+{
+	return cpumask_of_node(cpu_to_node(cpu));
+}
 #endif
 
 static inline void add_cpu_to_smallcore_masks(int cpu)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations
  2019-07-25  7:08 [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
                   ` (7 preceding siblings ...)
  2019-07-25  7:08 ` [RFC v4 8/8] powerpc: Set turbo domain to NUMA node for task packing Parth Shah
@ 2019-07-28 13:31 ` Pavel Machek
  2019-07-31 16:39   ` Parth Shah
  8 siblings, 1 reply; 13+ messages in thread
From: Pavel Machek @ 2019-07-28 13:31 UTC (permalink / raw)
  To: Parth Shah
  Cc: peterz, mingo, linux-kernel, linux-pm, patrick.bellasi,
	dietmar.eggemann, daniel.lezcano, subhra.mazumdar

Hi!

> Abstract
> ========
> 
> The modern servers allows multiple cores to run at range of frequencies
> higher than rated range of frequencies. But the power budget of the system
> inhibits sustaining these higher frequencies for longer durations.

Thermal budget?

Should this go to documentation somewhere?

> Current CFS algorithm in kernel scheduler is performance oriented and hence
> tries to assign any idle CPU first for the waking up of new tasks. This
> policy is perfect for major categories of the workload, but for jitter
> tasks, one can save energy by packing them onto the active cores and allow
> those cores to run at higher frequencies.
> 
> These patch-set tunes the task wake up logic in scheduler to pack
> exclusively classified jitter tasks onto busy cores. The work involves the
> jitter tasks classifications by using syscall based mechanisms.
> 
> In brief, if we can pack jitter tasks on busy cores then we can save power
> by keeping other cores idle and allow busier cores to run at turbo
> frequencies, patch-set tries to meet this solution in simplest manner.
> Though, there are some challenges in implementing it(like smt_capacity,

Space before (.

> These numbers are w.r.t. `turbo_bench.c` multi-threaded test benchmark
> which can create two kinds of tasks: CPU bound (High Utilization) and
> Jitters (Low Utilization). N in X-axis represents N-CPU bound and N-Jitter
> tasks spawned.

Ok, so you have description how it causes 13% improvements. Do you also have metrics how
it harms performance.. how much delay is added to unimportant tasks etc...?

Thanks,
										Pavel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations
  2019-07-28 13:31 ` [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Pavel Machek
@ 2019-07-31 16:39   ` Parth Shah
  2019-07-31 17:32     ` Pavel Machek
  0 siblings, 1 reply; 13+ messages in thread
From: Parth Shah @ 2019-07-31 16:39 UTC (permalink / raw)
  To: Pavel Machek
  Cc: peterz, mingo, linux-kernel, linux-pm, patrick.bellasi,
	dietmar.eggemann, daniel.lezcano, subhra.mazumdar



On 7/28/19 7:01 PM, Pavel Machek wrote:
> Hi!
> 
>> Abstract
>> ========
>>
>> The modern servers allows multiple cores to run at range of frequencies
>> higher than rated range of frequencies. But the power budget of the system
>> inhibits sustaining these higher frequencies for longer durations.
> 
> Thermal budget?

Right, it is a good point, and there can be possibility of Thermal throttling
which is not covered here.
But the thermal throttling is less often seen in the servers than the throttling
due to the Power budget constraints. Also one can change the power cap which leads
to increase in the throttling and task packing can handle in such cases.

BTW, Task packing allows few more cores to remain idle for longer time, so
shouldn't this decrease thermal throttles upto certain extent?

> 
> Should this go to documentation somewhere?
> 

Sure, I can add to the Documentation/scheduler or under selftest.

>> Current CFS algorithm in kernel scheduler is performance oriented and hence
>> tries to assign any idle CPU first for the waking up of new tasks. This
>> policy is perfect for major categories of the workload, but for jitter
>> tasks, one can save energy by packing them onto the active cores and allow
>> those cores to run at higher frequencies.
>>
>> These patch-set tunes the task wake up logic in scheduler to pack
>> exclusively classified jitter tasks onto busy cores. The work involves the
>> jitter tasks classifications by using syscall based mechanisms.
>>
>> In brief, if we can pack jitter tasks on busy cores then we can save power
>> by keeping other cores idle and allow busier cores to run at turbo
>> frequencies, patch-set tries to meet this solution in simplest manner.
>> Though, there are some challenges in implementing it(like smt_capacity,
> 
> Space before (.

My bad, somehow missed it. Thanks for pointing out.

> >> These numbers are w.r.t. `turbo_bench.c` multi-threaded test benchmark
>> which can create two kinds of tasks: CPU bound (High Utilization) and
>> Jitters (Low Utilization). N in X-axis represents N-CPU bound and N-Jitter
>> tasks spawned.
> 
> Ok, so you have description how it causes 13% improvements. Do you also have metrics how
> it harms performance.. how much delay is added to unimportant tasks etc...?
> 

Yes, if we try to pack the tasks despite of no frequency throttling, we see a regression
around 5%. For instance, in the synthetic benchmark I used to show performance benefit,
for lower count of CPU intensive threads (N=2) there is -5% performance drop.

Talking about the delay added to an unimportant tasks, the result can be lower throughput
or higher latency for such tasks.

1. Throughput
For instance, when classifying 8 running tasks as jitters, we can have performance
drop "based on the task characteristics".

Below table shows the performance (total operations performed) drop observed when
jitters have different utilization on a CPU set at max Frequency.
+-------------------+-------------+
| Utilization(in %) | Performance |
+-------------------+-------------+
| 10-20             | -0.32%      |
| 30-40             | -0.003%     |
+-------------------+-------------+

Jitters here are frequency insensitive and does only X-operations in N-period time. Hence
it doesn't show much drop in throughput.

2.  Latency
The wakeup latency of the jitter tasks gives below results
Test-1:
- 8 CPU intensive tasks, 40 jitter low utilization tasks
+-------+-------------+--------------+
| %ile  | w/o patches | with patches |
+-------+-------------+--------------+
| Min   |           3 | 5 (-66%)     |
| 50    |          64 | 64 (0%)      |
| 90    |          66 | 67 (-1.5%)   |
| 99    |          67 | 68 (-1.4%)   |
| 99.99 |          78 | 439 (-462%)  |
| Max   |         159 | 1023 (-543%) |
+-------+-------------+--------------+

Test-2:
- 8 CPU intensive tasks, 8 jitter tasks
+-------+-------------+--------------+
| %ile  | w/o patches | with patches |
+-------+-------------+--------------+
| Min   |           4 | 6 (-50%)     |
| 50    |          65 | 55 (+15%)    |
| 90    |          65 | 55 (+15%)    |
| 99    |          66 | 56 (+15%)    |
| 99.99 |          76 | 69 (+9%)     |
| Max   |          78 | 672 (-761%)  |
+-------+-------------+--------------+

Note: I used the synthetic workload generator to compute wakeup latency for jitter tasks,
the source code for the same can be found at
https://github.com/parthsl/tools/blob/master/benchmarks/turbosched_delay.c


Also, the jitter tasks would create regression on CPU intensive tasks when placed
on the sibling thread, but the performance gain with sustained frequency is more
enough here to overcome this regression. Hence, if there is no throttling, there
will be performance penalty for both the type of tasks.


Thanks,
Parth


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations
  2019-07-31 16:39   ` Parth Shah
@ 2019-07-31 17:32     ` Pavel Machek
  2019-08-02 11:12       ` Parth Shah
  0 siblings, 1 reply; 13+ messages in thread
From: Pavel Machek @ 2019-07-31 17:32 UTC (permalink / raw)
  To: Parth Shah
  Cc: peterz, mingo, linux-kernel, linux-pm, patrick.bellasi,
	dietmar.eggemann, daniel.lezcano, subhra.mazumdar

[-- Attachment #1: Type: text/plain, Size: 2157 bytes --]

Hi!

> >> Abstract
> >> ========
> >>
> >> The modern servers allows multiple cores to run at range of frequencies
> >> higher than rated range of frequencies. But the power budget of the system
> >> inhibits sustaining these higher frequencies for longer durations.
> > 
> > Thermal budget?
> 
> Right, it is a good point, and there can be possibility of Thermal throttling
> which is not covered here.
> But the thermal throttling is less often seen in the servers than the throttling
> due to the Power budget constraints. Also one can change the power cap which leads
> to increase in the throttling and task packing can handle in such
> cases.

Ok. I thought you are doing this due to thermals. If I understand
things correctly, you can go over thermal limits for a few seconds
before the silicon heats up. What is the timescale for power budget?

> BTW, Task packing allows few more cores to remain idle for longer time, so
> shouldn't this decrease thermal throttles upto certain extent?

I guess so, yes.

> > >> These numbers are w.r.t. `turbo_bench.c` multi-threaded test benchmark
> >> which can create two kinds of tasks: CPU bound (High Utilization) and
> >> Jitters (Low Utilization). N in X-axis represents N-CPU bound and N-Jitter
> >> tasks spawned.
> > 
> > Ok, so you have description how it causes 13% improvements. Do you also have metrics how
> > it harms performance.. how much delay is added to unimportant tasks etc...?
> > 
> 
> Yes, if we try to pack the tasks despite of no frequency throttling, we see a regression
> around 5%. For instance, in the synthetic benchmark I used to show performance benefit,
> for lower count of CPU intensive threads (N=2) there is -5% performance drop.
> 
> Talking about the delay added to an unimportant tasks, the result can be lower throughput
> or higher latency for such tasks.

Thanks. I believe it would be good to mention disadvantages in the
documentation, too.

Best regards,
							Pavel
							
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations
  2019-07-31 17:32     ` Pavel Machek
@ 2019-08-02 11:12       ` Parth Shah
  0 siblings, 0 replies; 13+ messages in thread
From: Parth Shah @ 2019-08-02 11:12 UTC (permalink / raw)
  To: Pavel Machek
  Cc: peterz, mingo, linux-kernel, linux-pm, patrick.bellasi,
	dietmar.eggemann, daniel.lezcano, subhra.mazumdar



On 7/31/19 11:02 PM, Pavel Machek wrote:
> Hi!
> 
>>>> Abstract
>>>> ========
>>>>
>>>> The modern servers allows multiple cores to run at range of frequencies
>>>> higher than rated range of frequencies. But the power budget of the system
>>>> inhibits sustaining these higher frequencies for longer durations.
>>>
>>> Thermal budget?
>>
>> Right, it is a good point, and there can be possibility of Thermal throttling
>> which is not covered here.
>> But the thermal throttling is less often seen in the servers than the throttling
>> due to the Power budget constraints. Also one can change the power cap which leads
>> to increase in the throttling and task packing can handle in such
>> cases.
> 
> Ok. I thought you are doing this due to thermals. If I understand
> things correctly, you can go over thermal limits for a few seconds
> before the silicon heats up. What is the timescale for power budget?
> 

I guess it varies across architectures.
AFAIK, in the POWER systems, the frequency is throttled down instantaneously as we exceed
the power budget.
If an idle core is woken up and the power budget is exceeded then the system throttles
down to the frequency value that is know to be sustainable with that many busy cores.

>> BTW, Task packing allows few more cores to remain idle for longer time, so
>> shouldn't this decrease thermal throttles upto certain extent?
> 
> I guess so, yes.
> 
>>>>> These numbers are w.r.t. `turbo_bench.c` multi-threaded test benchmark
>>>> which can create two kinds of tasks: CPU bound (High Utilization) and
>>>> Jitters (Low Utilization). N in X-axis represents N-CPU bound and N-Jitter
>>>> tasks spawned.
>>>
>>> Ok, so you have description how it causes 13% improvements. Do you also have metrics how
>>> it harms performance.. how much delay is added to unimportant tasks etc...?
>>>
>>
>> Yes, if we try to pack the tasks despite of no frequency throttling, we see a regression
>> around 5%. For instance, in the synthetic benchmark I used to show performance benefit,
>> for lower count of CPU intensive threads (N=2) there is -5% performance drop.
>>
>> Talking about the delay added to an unimportant tasks, the result can be lower throughput
>> or higher latency for such tasks.
> 
> Thanks. I believe it would be good to mention disadvantages in the
> documentation, too.

Sure, I will add the mentioned possible regression on jitter tasks in the documentation
somewhere.


Thanks,
Parth


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-08-02 11:12 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-25  7:08 [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Parth Shah
2019-07-25  7:08 ` [RFC v4 1/8] sched/core: Add manual jitter classification using sched_setattr syscall Parth Shah
2019-07-25  7:08 ` [RFC v4 2/8] sched: Introduce switch to enable TurboSched mode Parth Shah
2019-07-25  7:08 ` [RFC v4 3/8] sched/core: Update turbo_sched count only when required Parth Shah
2019-07-25  7:08 ` [RFC v4 4/8] sched/fair: Define core capacity to limit task packing Parth Shah
2019-07-25  7:08 ` [RFC v4 5/8] powerpc: Define Core Capacity for POWER systems Parth Shah
2019-07-25  7:08 ` [RFC v4 6/8] sched/fair: Tune task wake-up logic to pack jitter tasks Parth Shah
2019-07-25  7:08 ` [RFC v4 7/8] sched/fair: Bound non idle core search within LLC domain Parth Shah
2019-07-25  7:08 ` [RFC v4 8/8] powerpc: Set turbo domain to NUMA node for task packing Parth Shah
2019-07-28 13:31 ` [RFC v4 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations Pavel Machek
2019-07-31 16:39   ` Parth Shah
2019-07-31 17:32     ` Pavel Machek
2019-08-02 11:12       ` Parth Shah

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).