All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/9] Task latency-nice
@ 2019-08-30 17:49 subhra mazumdar
  2019-08-30 17:49 ` [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice subhra mazumdar
                   ` (10 more replies)
  0 siblings, 11 replies; 55+ messages in thread
From: subhra mazumdar @ 2019-08-30 17:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi

Introduce new per task property latency-nice for controlling scalability
in scheduler idle CPU search path. Valid latency-nice values are from 1 to
100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New
CPU cgroup file cpu.latency-nice is added as an interface to set and get.
All tasks in the same cgroup share the same latency-nice value. Using a
lower latency-nice value can help latency intolerant tasks e.g very short
running OLTP threads where full LLC search cost can be significant compared
to run time of the threads. The default latency-nice value is 5.

In addition to latency-nice, it also adds a new sched feature SIS_CORE to
be able to disable idle core search altogether which is costly and hurts
more than it helps in short running workloads.

Finally it also introduces a new per-cpu variable next_cpu to track
the limit of search so that every time search starts from where it ended.
This rotating search window over cpus in LLC domain ensures that idle
cpus are eventually found in case of high load.

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline   latency-nice=5,SIS_CORE     latency-nice=5,NO_SIS_CORE 
8       64.66      64.38 (-0.43%)              64.79 (0.2%)
16      123.34     122.88 (-0.37%)             125.87 (2.05%)
32      215.18     215.55 (0.17%)              247.77 (15.15%)
48      278.56     321.6 (15.45%)              321.2 (15.3%)
64      259.99     319.45 (22.87%)             333.95 (28.44%)
128     431.1      437.69 (1.53%)              431.09 (0%)

subhra mazumdar (9):
  sched,cgroup: Add interface for latency-nice
  sched: add search limit as per latency-nice
  sched: add sched feature to disable idle core search
  sched: SIS_CORE to disable idle core search
  sched: Define macro for number of CPUs in core
  x86/smpboot: Optimize cpumask_weight_sibling macro for x86
  sched: search SMT before LLC domain
  sched: introduce per-cpu var next_cpu to track search limit
  sched: rotate the cpu search window for better spread

 arch/x86/include/asm/smp.h      |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c       | 17 ++++++++++++++++-
 include/linux/sched.h           |  1 +
 include/linux/topology.h        |  4 ++++
 kernel/sched/core.c             | 42 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c             | 34 +++++++++++++++++++++------------
 kernel/sched/features.h         |  1 +
 kernel/sched/sched.h            |  9 +++++++++
 9 files changed, 97 insertions(+), 13 deletions(-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-08-30 17:49 [RFC PATCH 0/9] Task latency-nice subhra mazumdar
@ 2019-08-30 17:49 ` subhra mazumdar
  2019-09-04 17:32   ` Tim Chen
                     ` (2 more replies)
  2019-08-30 17:49 ` [RFC PATCH 2/9] sched: add search limit as per latency-nice subhra mazumdar
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 55+ messages in thread
From: subhra mazumdar @ 2019-08-30 17:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi

Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
"latency-nice" which is shared by all the threads in that Cgroup.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c   |  1 +
 kernel/sched/sched.h  |  8 ++++++++
 4 files changed, 50 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1183741..b4a79c3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -631,6 +631,7 @@ struct task_struct {
 	int				static_prio;
 	int				normal_prio;
 	unsigned int			rt_priority;
+	u64				latency_nice;
 
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..47969bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5976,6 +5976,7 @@ void __init sched_init(void)
 		init_dl_rq(&rq->dl);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
+		root_task_group.latency_nice = LATENCY_NICE_DEFAULT;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
 		rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
 		/*
@@ -6345,6 +6346,7 @@ static void sched_change_group(struct task_struct *tsk, int type)
 	 */
 	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
+	tsk->latency_nice = tg->latency_nice;
 	tg = autogroup_task_group(tsk, tg);
 	tsk->sched_task_group = tg;
 
@@ -6812,6 +6814,34 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+static u64 cpu_latency_nice_read_u64(struct cgroup_subsys_state *css,
+				     struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return tg->latency_nice;
+}
+
+static int cpu_latency_nice_write_u64(struct cgroup_subsys_state *css,
+				      struct cftype *cft, u64 latency_nice)
+{
+	struct task_group *tg = css_tg(css);
+	struct css_task_iter it;
+	struct task_struct *p;
+
+	if (latency_nice < LATENCY_NICE_MIN || latency_nice > LATENCY_NICE_MAX)
+		return -ERANGE;
+
+	tg->latency_nice = latency_nice;
+
+	css_task_iter_start(css, 0, &it);
+	while ((p = css_task_iter_next(&it)))
+		p->latency_nice = latency_nice;
+	css_task_iter_end(&it);
+
+	return 0;
+}
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -6848,6 +6878,11 @@ static struct cftype cpu_legacy_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+	{
+		.name = "latency-nice",
+		.read_u64 = cpu_latency_nice_read_u64,
+		.write_u64 = cpu_latency_nice_write_u64,
+	},
 	{ }	/* Terminate */
 };
 
@@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = {
 		.write = cpu_max_write,
 	},
 #endif
+	{
+		.name = "latency-nice",
+		.read_u64 = cpu_latency_nice_read_u64,
+		.write_u64 = cpu_latency_nice_write_u64,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..b08d00c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 		goto err;
 
 	tg->shares = NICE_0_LOAD;
+	tg->latency_nice = LATENCY_NICE_DEFAULT;
 
 	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..365c928 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
 #define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
 
 /*
+ * Latency-nice default value
+ */
+#define	LATENCY_NICE_DEFAULT	5
+#define	LATENCY_NICE_MIN	1
+#define	LATENCY_NICE_MAX	100
+
+/*
  * Single value that decides SCHED_DEADLINE internal math precision.
  * 10 -> just above 1us
  * 9  -> just above 0.5us
@@ -362,6 +369,7 @@ struct cfs_bandwidth {
 /* Task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
+	u64 latency_nice;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* schedulable entities of this group on each CPU */
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC PATCH 2/9] sched: add search limit as per latency-nice
  2019-08-30 17:49 [RFC PATCH 0/9] Task latency-nice subhra mazumdar
  2019-08-30 17:49 ` [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice subhra mazumdar
@ 2019-08-30 17:49 ` subhra mazumdar
  2019-09-05  6:22   ` Parth Shah
  2019-08-30 17:49 ` [RFC PATCH 3/9] sched: add sched feature to disable idle core search subhra mazumdar
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: subhra mazumdar @ 2019-08-30 17:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi

Put upper and lower limit on CPU search in select_idle_cpu. The lower limit
is set to amount of CPUs in a core  while upper limit is derived from the
latency-nice of the thread. This ensures for any architecture we will
usually search beyond a core. Changing the latency-nice value by user will
change the search cost making it appropriate for given workload.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b08d00c..c31082d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, nr = INT_MAX;
+	int cpu, floor, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6205,11 +6205,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		return -1;
 
 	if (sched_feat(SIS_PROP)) {
-		u64 span_avg = sd->span_weight * avg_idle;
-		if (span_avg > 4*avg_cost)
-			nr = div_u64(span_avg, avg_cost);
-		else
-			nr = 4;
+		floor = cpumask_weight(topology_sibling_cpumask(target));
+		if (floor < 2)
+			floor = 2;
+		nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX;
+		if (nr < floor)
+			nr = floor;
 	}
 
 	time = local_clock();
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC PATCH 3/9] sched: add sched feature to disable idle core search
  2019-08-30 17:49 [RFC PATCH 0/9] Task latency-nice subhra mazumdar
  2019-08-30 17:49 ` [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice subhra mazumdar
  2019-08-30 17:49 ` [RFC PATCH 2/9] sched: add search limit as per latency-nice subhra mazumdar
@ 2019-08-30 17:49 ` subhra mazumdar
  2019-09-05 10:17   ` Patrick Bellasi
  2019-08-30 17:49 ` [RFC PATCH 4/9] sched: SIS_CORE " subhra mazumdar
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: subhra mazumdar @ 2019-08-30 17:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi

Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..de4d506 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC PATCH 4/9] sched: SIS_CORE to disable idle core search
  2019-08-30 17:49 [RFC PATCH 0/9] Task latency-nice subhra mazumdar
                   ` (2 preceding siblings ...)
  2019-08-30 17:49 ` [RFC PATCH 3/9] sched: add sched feature to disable idle core search subhra mazumdar
@ 2019-08-30 17:49 ` subhra mazumdar
  2019-09-05 10:19   ` Patrick Bellasi
  2019-08-30 17:49 ` [RFC PATCH 5/9] sched: Define macro for number of CPUs in core subhra mazumdar
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: subhra mazumdar @ 2019-08-30 17:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi

Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c31082d..23ec9c6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6268,9 +6268,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (!sd)
 		return target;
 
-	i = select_idle_core(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
+	if (sched_feat(SIS_CORE)) {
+		i = select_idle_core(p, sd, target);
+		if ((unsigned)i < nr_cpumask_bits)
+			return i;
+	}
 
 	i = select_idle_cpu(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC PATCH 5/9] sched: Define macro for number of CPUs in core
  2019-08-30 17:49 [RFC PATCH 0/9] Task latency-nice subhra mazumdar
                   ` (3 preceding siblings ...)
  2019-08-30 17:49 ` [RFC PATCH 4/9] sched: SIS_CORE " subhra mazumdar
@ 2019-08-30 17:49 ` subhra mazumdar
  2019-08-30 17:49 ` [RFC PATCH 6/9] x86/smpboot: Optimize cpumask_weight_sibling macro for x86 subhra mazumdar
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 55+ messages in thread
From: subhra mazumdar @ 2019-08-30 17:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi

Introduce macro topology_sibling_weight for number of sibling CPUs in a
core and use in select_idle_cpu

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 include/linux/topology.h | 4 ++++
 kernel/sched/fair.c      | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e..a85aea1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_sibling_cpumask
 #define topology_sibling_cpumask(cpu)		cpumask_of(cpu)
 #endif
+#ifndef topology_sibling_weight
+#define topology_sibling_weight(cpu)	\
+		cpumask_weight(topology_sibling_cpumask(cpu))
+#endif
 #ifndef topology_core_cpumask
 #define topology_core_cpumask(cpu)		cpumask_of(cpu)
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23ec9c6..8856503 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6205,7 +6205,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		return -1;
 
 	if (sched_feat(SIS_PROP)) {
-		floor = cpumask_weight(topology_sibling_cpumask(target));
+		floor = topology_sibling_weight(target);
 		if (floor < 2)
 			floor = 2;
 		nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC PATCH 6/9] x86/smpboot: Optimize cpumask_weight_sibling macro for x86
  2019-08-30 17:49 [RFC PATCH 0/9] Task latency-nice subhra mazumdar
                   ` (4 preceding siblings ...)
  2019-08-30 17:49 ` [RFC PATCH 5/9] sched: Define macro for number of CPUs in core subhra mazumdar
@ 2019-08-30 17:49 ` subhra mazumdar
  2019-08-30 17:49 ` [RFC PATCH 7/9] sched: search SMT before LLC domain subhra mazumdar
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 55+ messages in thread
From: subhra mazumdar @ 2019-08-30 17:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi

Use per-CPU variable for cpumask_weight_sibling macro in case of x86 for
fast lookup in select_idle_cpu. This avoids reading multiple cache lines
in case of systems with large numbers of CPUs where bitmask can span
multiple cache lines. Even if bitmask spans only one cache line this avoids
looping through it to find the number of bits and gets it in O(1).

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 arch/x86/include/asm/smp.h      |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c       | 17 ++++++++++++++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index da545df..1e90cbd 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -22,6 +22,7 @@ extern int smp_num_siblings;
 extern unsigned int num_processors;
 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
+DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38..dd19c71 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 #ifdef CONFIG_SMP
 #define topology_core_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)		(per_cpu(cpu_sibling_map, cpu))
+#define topology_sibling_weight(cpu)	(per_cpu(cpumask_weight_sibling, cpu))
 
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()			(__max_logical_packages)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 362dd89..57ad88d 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -85,6 +85,9 @@
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 
+/* representing number of HT siblings of each CPU */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
+
 /* representing HT and core siblings of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
@@ -520,6 +523,8 @@ void set_cpu_sibling_map(int cpu)
 
 	if (!has_mp) {
 		cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
+		per_cpu(cpumask_weight_sibling, cpu) =
+		    cpumask_weight(topology_sibling_cpumask(cpu));
 		cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
 		cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
 		c->booted_cores = 1;
@@ -529,8 +534,13 @@ void set_cpu_sibling_map(int cpu)
 	for_each_cpu(i, cpu_sibling_setup_mask) {
 		o = &cpu_data(i);
 
-		if ((i == cpu) || (has_smt && match_smt(c, o)))
+		if ((i == cpu) || (has_smt && match_smt(c, o))) {
 			link_mask(topology_sibling_cpumask, cpu, i);
+			per_cpu(cpumask_weight_sibling, cpu) =
+			    cpumask_weight(topology_sibling_cpumask(cpu));
+			per_cpu(cpumask_weight_sibling, i) =
+			    cpumask_weight(topology_sibling_cpumask(i));
+		}
 
 		if ((i == cpu) || (has_mp && match_llc(c, o)))
 			link_mask(cpu_llc_shared_mask, cpu, i);
@@ -1173,6 +1183,8 @@ static __init void disable_smp(void)
 	else
 		physid_set_mask_of_physid(0, &phys_cpu_present_map);
 	cpumask_set_cpu(0, topology_sibling_cpumask(0));
+	per_cpu(cpumask_weight_sibling, 0) =
+	    cpumask_weight(topology_sibling_cpumask(0));
 	cpumask_set_cpu(0, topology_core_cpumask(0));
 }
 
@@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
 
 	for_each_cpu(sibling, topology_core_cpumask(cpu)) {
 		cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
+		per_cpu(cpumask_weight_sibling, sibling) =
+		    cpumask_weight(topology_sibling_cpumask(sibling));
 		/*/
 		 * last thread sibling in this cpu core going down
 		 */
@@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu)
 		cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
 	cpumask_clear(cpu_llc_shared_mask(cpu));
 	cpumask_clear(topology_sibling_cpumask(cpu));
+	per_cpu(cpumask_weight_sibling, cpu) = 0;
 	cpumask_clear(topology_core_cpumask(cpu));
 	c->cpu_core_id = 0;
 	c->booted_cores = 0;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC PATCH 7/9] sched: search SMT before LLC domain
  2019-08-30 17:49 [RFC PATCH 0/9] Task latency-nice subhra mazumdar
                   ` (5 preceding siblings ...)
  2019-08-30 17:49 ` [RFC PATCH 6/9] x86/smpboot: Optimize cpumask_weight_sibling macro for x86 subhra mazumdar
@ 2019-08-30 17:49 ` subhra mazumdar
  2019-09-05  9:31   ` Peter Zijlstra
  2019-08-30 17:49 ` [RFC PATCH 8/9] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: subhra mazumdar @ 2019-08-30 17:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi

Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps
in L1 cache locality.
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8856503..94dd4a32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 			return i;
 	}
 
-	i = select_idle_cpu(p, sd, target);
+	i = select_idle_smt(p, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
-	i = select_idle_smt(p, target);
+	i = select_idle_cpu(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC PATCH 8/9] sched: introduce per-cpu var next_cpu to track search limit
  2019-08-30 17:49 [RFC PATCH 0/9] Task latency-nice subhra mazumdar
                   ` (6 preceding siblings ...)
  2019-08-30 17:49 ` [RFC PATCH 7/9] sched: search SMT before LLC domain subhra mazumdar
@ 2019-08-30 17:49 ` subhra mazumdar
  2019-08-30 17:49 ` [RFC PATCH 9/9] sched: rotate the cpu search window for better spread subhra mazumdar
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 55+ messages in thread
From: subhra mazumdar @ 2019-08-30 17:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi

Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 47969bc..5862d54 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -24,6 +24,7 @@
 #include <trace/events/sched.h>
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
 /*
@@ -5966,6 +5967,7 @@ void __init sched_init(void)
 	for_each_possible_cpu(i) {
 		struct rq *rq;
 
+		per_cpu(next_cpu, i) = -1;
 		rq = cpu_rq(i);
 		raw_spin_lock_init(&rq->lock);
 		rq->nr_running = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 365c928..cca2b09 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1002,6 +1002,7 @@ static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		this_cpu_ptr(&runqueues)
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC PATCH 9/9] sched: rotate the cpu search window for better spread
  2019-08-30 17:49 [RFC PATCH 0/9] Task latency-nice subhra mazumdar
                   ` (7 preceding siblings ...)
  2019-08-30 17:49 ` [RFC PATCH 8/9] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
@ 2019-08-30 17:49 ` subhra mazumdar
  2019-09-05  6:37   ` Parth Shah
  2019-09-05  5:55 ` [RFC PATCH 0/9] Task latency-nice Parth Shah
  2019-09-05 10:31 ` Patrick Bellasi
  10 siblings, 1 reply; 55+ messages in thread
From: subhra mazumdar @ 2019-08-30 17:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi

Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 94dd4a32..7419b47 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, floor, nr = INT_MAX;
+	int cpu, floor, target_tmp, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6213,9 +6213,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 			nr = floor;
 	}
 
+	if (per_cpu(next_cpu, target) != -1)
+		target_tmp = per_cpu(next_cpu, target);
+	else
+		target_tmp = target;
+
 	time = local_clock();
 
-	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+	for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+		per_cpu(next_cpu, target) = cpu;
 		if (!--nr)
 			return -1;
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-08-30 17:49 ` [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice subhra mazumdar
@ 2019-09-04 17:32   ` Tim Chen
  2019-09-05  6:15     ` Parth Shah
  2019-09-05  8:31   ` Peter Zijlstra
  2019-09-05 10:05   ` Patrick Bellasi
  2 siblings, 1 reply; 55+ messages in thread
From: Tim Chen @ 2019-09-04 17:32 UTC (permalink / raw)
  To: subhra mazumdar, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, mgorman, parth,
	patrick.bellasi

On 8/30/19 10:49 AM, subhra mazumdar wrote:
> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
> "latency-nice" which is shared by all the threads in that Cgroup.


Subhra,

Thanks for posting the patchset.  Having a latency nice hint
is useful beyond idle load balancing.  I can think of other
application scenarios, like scheduling batch machine learning AVX 512
processes with latency sensitive processes.  AVX512 limits the frequency
of the CPU and it is best to avoid latency sensitive task on the
same core with AVX512.  So latency nice hint allows the scheduler
to have a criteria to determine the latency sensitivity of a task
and arrange latency sensitive tasks away from AVX512 tasks.

You configure the latency hint on a cgroup basis.
But I think not all tasks in a cgroup necessarily have the same
latency sensitivity.

For example, I can see that cgroup can be applied on a per user basis,
and the user could run different tasks that have different latency sensitivity.
We may also need a way to configure latency sensitivity on a per task basis instead on
a per cgroup basis.

Tim


> @@ -631,6 +631,7 @@ struct task_struct {
>  	int				static_prio;
>  	int				normal_prio;
>  	unsigned int			rt_priority;
> +	u64				latency_nice;

Does it need to be 64 bit?  Max latency nice is only 100.

>  
>  	const struct sched_class	*sched_class;
>  	struct sched_entity		se;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 874c427..47969bc 100644

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b52ed1a..365c928 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
>  #define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
>  
>  /*
> + * Latency-nice default value
> + */

Will be useful to add comments to let reader know 
that higher latency nice number means a task is more 
latency tolerant.

Is there a reason for setting the default to be a low
value of 5?

Seems like we will default to only to search the
same core for idle cpu on a smaller system, 
as we only search 5% of the cpu span of the target sched domain.

> +#define	LATENCY_NICE_DEFAULT	5
> +#define	LATENCY_NICE_MIN	1
> +#define	LATENCY_NICE_MAX	100
> +

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 0/9] Task latency-nice
  2019-08-30 17:49 [RFC PATCH 0/9] Task latency-nice subhra mazumdar
                   ` (8 preceding siblings ...)
  2019-08-30 17:49 ` [RFC PATCH 9/9] sched: rotate the cpu search window for better spread subhra mazumdar
@ 2019-09-05  5:55 ` Parth Shah
  2019-09-05 10:31 ` Patrick Bellasi
  10 siblings, 0 replies; 55+ messages in thread
From: Parth Shah @ 2019-09-05  5:55 UTC (permalink / raw)
  To: subhra mazumdar, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, patrick.bellasi

Hi Subhra,

On 8/30/19 11:19 PM, subhra mazumdar wrote:
> Introduce new per task property latency-nice for controlling scalability
> in scheduler idle CPU search path. Valid latency-nice values are from 1 to
> 100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New
> CPU cgroup file cpu.latency-nice is added as an interface to set and get.
> All tasks in the same cgroup share the same latency-nice value. Using a
> lower latency-nice value can help latency intolerant tasks e.g very short
> running OLTP threads where full LLC search cost can be significant compared
> to run time of the threads. The default latency-nice value is 5.
> 
> In addition to latency-nice, it also adds a new sched feature SIS_CORE to
> be able to disable idle core search altogether which is costly and hurts
> more than it helps in short running workloads.
> 
> Finally it also introduces a new per-cpu variable next_cpu to track
> the limit of search so that every time search starts from where it ended.
> This rotating search window over cpus in LLC domain ensures that idle
> cpus are eventually found in case of high load.
> 
> Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
> message size = 8k (higher is better):
> threads baseline   latency-nice=5,SIS_CORE     latency-nice=5,NO_SIS_CORE 
> 8       64.66      64.38 (-0.43%)              64.79 (0.2%)
> 16      123.34     122.88 (-0.37%)             125.87 (2.05%)
> 32      215.18     215.55 (0.17%)              247.77 (15.15%)
> 48      278.56     321.6 (15.45%)              321.2 (15.3%)
> 64      259.99     319.45 (22.87%)             333.95 (28.44%)
> 128     431.1      437.69 (1.53%)              431.09 (0%)
> 

The result seems to be appealing with your experimental setup.
BTW, do you have any plans of load balancing as well based on latency niceness
of the tasks? It seems to be a more interesting case when we give pack the lower
latency sensitive tasks on fewer CPUs.

Also, do you see any workload results showing performance regression with NO_SIS_CORE?


Thanks,
Parth


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-04 17:32   ` Tim Chen
@ 2019-09-05  6:15     ` Parth Shah
  2019-09-05 10:11       ` Patrick Bellasi
  0 siblings, 1 reply; 55+ messages in thread
From: Parth Shah @ 2019-09-05  6:15 UTC (permalink / raw)
  To: Tim Chen, subhra mazumdar, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, mgorman,
	patrick.bellasi



On 9/4/19 11:02 PM, Tim Chen wrote:
> On 8/30/19 10:49 AM, subhra mazumdar wrote:
>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>> "latency-nice" which is shared by all the threads in that Cgroup.
> 
> 
> Subhra,
> 
> Thanks for posting the patchset.  Having a latency nice hint
> is useful beyond idle load balancing.  I can think of other
> application scenarios, like scheduling batch machine learning AVX 512
> processes with latency sensitive processes.  AVX512 limits the frequency
> of the CPU and it is best to avoid latency sensitive task on the
> same core with AVX512.  So latency nice hint allows the scheduler
> to have a criteria to determine the latency sensitivity of a task
> and arrange latency sensitive tasks away from AVX512 tasks.
> 


Hi Tim and Subhra,

This patchset seems to be interesting for my TurboSched patches as well
where I try to pack jitter tasks on fewer cores to get higher Turbo Frequencies.
Well, the problem I face is that we sometime end up putting multiple jitter tasks on a core
running some latency sensitive application which may see performance degradation.
So my plan was to classify such tasks to be latency sensitive thereby hinting the load
balancer to not put tasks on such cores.

TurboSched: https://lkml.org/lkml/2019/7/25/296

> You configure the latency hint on a cgroup basis.
> But I think not all tasks in a cgroup necessarily have the same
> latency sensitivity.
> 
> For example, I can see that cgroup can be applied on a per user basis,
> and the user could run different tasks that have different latency sensitivity.
> We may also need a way to configure latency sensitivity on a per task basis instead on
> a per cgroup basis.
> 

AFAIU, the problem defined above intersects with my patches as well where the interface
is required to classify the jitter tasks. I have already tried few methods like
syscall and cgroup to classify such tasks and maybe something like that can be adopted
with these patchset as well.


Thanks,
Parth


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 2/9] sched: add search limit as per latency-nice
  2019-08-30 17:49 ` [RFC PATCH 2/9] sched: add search limit as per latency-nice subhra mazumdar
@ 2019-09-05  6:22   ` Parth Shah
  0 siblings, 0 replies; 55+ messages in thread
From: Parth Shah @ 2019-09-05  6:22 UTC (permalink / raw)
  To: subhra mazumdar, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, patrick.bellasi



On 8/30/19 11:19 PM, subhra mazumdar wrote:
> Put upper and lower limit on CPU search in select_idle_cpu. The lower limit
> is set to amount of CPUs in a core  while upper limit is derived from the
> latency-nice of the thread. This ensures for any architecture we will
> usually search beyond a core. Changing the latency-nice value by user will
> change the search cost making it appropriate for given workload.
> 
> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
> ---
>  kernel/sched/fair.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b08d00c..c31082d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  	u64 avg_cost, avg_idle;
>  	u64 time, cost;
>  	s64 delta;
> -	int cpu, nr = INT_MAX;
> +	int cpu, floor, nr = INT_MAX;
> 
>  	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>  	if (!this_sd)
> @@ -6205,11 +6205,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  		return -1;
> 
>  	if (sched_feat(SIS_PROP)) {
> -		u64 span_avg = sd->span_weight * avg_idle;
> -		if (span_avg > 4*avg_cost)
> -			nr = div_u64(span_avg, avg_cost);
> -		else
> -			nr = 4;
> +		floor = cpumask_weight(topology_sibling_cpumask(target));
> +		if (floor < 2)
> +			floor = 2;
> +		nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX;

I see you defined LATENCY_NICE_MAX = 100,
So is the value 100 an experimental value?
I was hoping to be something in the power of 2 resulting in just ">>>" rather than
the heavy division operation.

> +		if (nr < floor)
> +			nr = floor;
>  	}
> 
>  	time = local_clock();
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 9/9] sched: rotate the cpu search window for better spread
  2019-08-30 17:49 ` [RFC PATCH 9/9] sched: rotate the cpu search window for better spread subhra mazumdar
@ 2019-09-05  6:37   ` Parth Shah
  0 siblings, 0 replies; 55+ messages in thread
From: Parth Shah @ 2019-09-05  6:37 UTC (permalink / raw)
  To: subhra mazumdar, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, patrick.bellasi



On 8/30/19 11:19 PM, subhra mazumdar wrote:
> Rotate the cpu search window for better spread of threads. This will ensure
> an idle cpu will quickly be found if one exists.
> 
> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
> ---
>  kernel/sched/fair.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 94dd4a32..7419b47 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  	u64 avg_cost, avg_idle;
>  	u64 time, cost;
>  	s64 delta;
> -	int cpu, floor, nr = INT_MAX;
> +	int cpu, floor, target_tmp, nr = INT_MAX;
>  
>  	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>  	if (!this_sd)
> @@ -6213,9 +6213,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  			nr = floor;
>  	}
>  
> +	if (per_cpu(next_cpu, target) != -1)
> +		target_tmp = per_cpu(next_cpu, target);
> +	else
> +		target_tmp = target;
> +
>  	time = local_clock();
>  
> -	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
> +	for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
> +		per_cpu(next_cpu, target) = cpu;

Is it possible that two simultaneous select_idle_cpu call have the same target value?

>  		if (!--nr)
>  			return -1;
>  		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-08-30 17:49 ` [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice subhra mazumdar
  2019-09-04 17:32   ` Tim Chen
@ 2019-09-05  8:31   ` Peter Zijlstra
  2019-09-05  9:45     ` Patrick Bellasi
  2019-09-05 10:05   ` Patrick Bellasi
  2 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2019-09-05  8:31 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi

On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote:
> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
> "latency-nice" which is shared by all the threads in that Cgroup.

*sigh*, no. We start with a normal per task attribute, and then later,
if it is needed and makes sense, we add it to cgroups.

Also, your Changelog fails on pretty much every point. It doesn't
explain why, it doesn't describe anything and so on.

From just reading the above, I would expect it to have the range
[-20,19] just like normal nice. Apparently this is not so.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 7/9] sched: search SMT before LLC domain
  2019-08-30 17:49 ` [RFC PATCH 7/9] sched: search SMT before LLC domain subhra mazumdar
@ 2019-09-05  9:31   ` Peter Zijlstra
  2019-09-05 20:40     ` Subhra Mazumdar
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2019-09-05  9:31 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi

On Fri, Aug 30, 2019 at 10:49:42AM -0700, subhra mazumdar wrote:
> Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps
> in L1 cache locality.
> ---
>  kernel/sched/fair.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8856503..94dd4a32 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  			return i;
>  	}
>  
> -	i = select_idle_cpu(p, sd, target);
> +	i = select_idle_smt(p, target);
>  	if ((unsigned)i < nr_cpumask_bits)
>  		return i;
>  
> -	i = select_idle_smt(p, target);
> +	i = select_idle_cpu(p, sd, target);
>  	if ((unsigned)i < nr_cpumask_bits)
>  		return i;
>  

But it is absolutely conceptually wrong. An idle core is a much better
target than an idle sibling.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05  8:31   ` Peter Zijlstra
@ 2019-09-05  9:45     ` Patrick Bellasi
  2019-09-05 10:46       ` Peter Zijlstra
  2019-09-06 12:31       ` Parth Shah
  0 siblings, 2 replies; 55+ messages in thread
From: Patrick Bellasi @ 2019-09-05  9:45 UTC (permalink / raw)
  To: Subhra Mazumdar, Peter Zijlstra
  Cc: linux-kernel, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth


On Thu, Sep 05, 2019 at 09:31:27 +0100, Peter Zijlstra wrote...

> On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote:
>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>> "latency-nice" which is shared by all the threads in that Cgroup.
>
> *sigh*, no. We start with a normal per task attribute, and then later,
> if it is needed and makes sense, we add it to cgroups.

FWIW, to add on top of what Peter says, we used this same approach for
uclamp and it proved to be a very effective way to come up with a good
design. General principles have been:

 - a system wide API [1] (under /proc/sys/kernel/sched_*) defines
   default values for all tasks affected by that feature.
   This interface has to define also upper bounds for task specific
   values. Thus, in the case of latency-nice, it should be set by
   default to the MIN value, since that's the current mainline
   behaviour: all tasks are latency sensitive.

 - a per-task API [2] (via the sched_setattr() syscall) can be used to
   relax the system wide setting thus implementing a "nice" policy.

 - a per-taskgroup API [3] (via cpu controller's attributes) can be used
   to relax the system-wide settings and restrict the per-task API.

The above features are worth to be added in that exact order.

> Also, your Changelog fails on pretty much every point. It doesn't
> explain why, it doesn't describe anything and so on.

On the description side, I guess it's worth to mention somewhere to
which scheduling classes this feature can be useful for. It's worth to
mention that it can apply only to:

 - CFS tasks: for example, at wakeup time a task with an high
   latency-nice should avoid to preempt a low latency-nice task.
   Maybe by mapping the latency nice value into proper vruntime
   normalization value?

 - RT tasks: for example, at wakeup time a task with an high
   latency-nice value could avoid to preempt a CFS task.

I'm sure there will be discussion about some of these features, that's
why it's important in the proposal presentation to keep a well defined
distinction among the "mechanisms and API" and how we use the new
concept to "bias" some scheduler policies.

> From just reading the above, I would expect it to have the range
> [-20,19] just like normal nice. Apparently this is not so.

Regarding the range for the latency-nice values, I guess we have two
options:

  - [-20..19], which makes it similar to priorities
  downside: we quite likely end up with a kernel space representation
  which does not match the user-space one, e.g. look at
  task_struct::prio.

  - [0..1024], which makes it more similar to a "percentage"

Being latency-nice a new concept, we are not constrained by POSIX and
IMHO the [0..1024] scale is a better fit.

That will translate into:

  latency-nice=0 : default (current mainline) behaviour, all "biasing"
  policies are disabled and we wakeup up as fast as possible

  latency-nice=1024 : maximum niceness, where for example we can imaging
  to turn switch a CFS task to be SCHED_IDLE?

Best,
Patrick

[1] commit e8f14172c6b1 ("sched/uclamp: Add system default clamps")
[2] commit a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support utilization clamping")
[3] 5 patches in today's tip/sched/core up to:
    commit babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-08-30 17:49 ` [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice subhra mazumdar
  2019-09-04 17:32   ` Tim Chen
  2019-09-05  8:31   ` Peter Zijlstra
@ 2019-09-05 10:05   ` Patrick Bellasi
  2019-09-05 10:48     ` Peter Zijlstra
  2 siblings, 1 reply; 55+ messages in thread
From: Patrick Bellasi @ 2019-09-05 10:05 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth


We already commented on adding the cgroup API after the per-task API.

However, for the cgroup bits will be super important to have

 [ +tejun ]

in CC since here we are at discussing the idea to add a new cpu
controller's attribute.

There are opinions about which kind of attributes can be added to
cgroups and I'm sure a "latency-nice" attribute will generate an
interesting discussion. :)

LPC is coming up, perhaps we can get the chance to have a chat with
Tejun about the manoeuvring space in this area.

On Fri, Aug 30, 2019 at 18:49:36 +0100, subhra mazumdar wrote...

> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
> "latency-nice" which is shared by all the threads in that Cgroup.
>
> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
> ---
>  include/linux/sched.h |  1 +
>  kernel/sched/core.c   | 40 ++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/fair.c   |  1 +
>  kernel/sched/sched.h  |  8 ++++++++
>  4 files changed, 50 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1183741..b4a79c3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -631,6 +631,7 @@ struct task_struct {
>  	int				static_prio;
>  	int				normal_prio;
>  	unsigned int			rt_priority;
> +	u64				latency_nice;

I guess we can save some bit here... or, if we are very brave, maybe we
can explore the possibility to pack all prios into a single u64?

 ( ( (tomatoes target here) ) )

>  	const struct sched_class	*sched_class;
>  	struct sched_entity		se;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 874c427..47969bc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5976,6 +5976,7 @@ void __init sched_init(void)
>  		init_dl_rq(&rq->dl);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
> +		root_task_group.latency_nice = LATENCY_NICE_DEFAULT;
>  		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
>  		rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
>  		/*
> @@ -6345,6 +6346,7 @@ static void sched_change_group(struct task_struct *tsk, int type)
>  	 */
>  	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
>  			  struct task_group, css);
> +	tsk->latency_nice = tg->latency_nice;
>  	tg = autogroup_task_group(tsk, tg);
>  	tsk->sched_task_group = tg;
>  
> @@ -6812,6 +6814,34 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
>  }
>  #endif /* CONFIG_RT_GROUP_SCHED */
>  
> +static u64 cpu_latency_nice_read_u64(struct cgroup_subsys_state *css,
> +				     struct cftype *cft)
> +{
> +	struct task_group *tg = css_tg(css);
> +
> +	return tg->latency_nice;
> +}
> +
> +static int cpu_latency_nice_write_u64(struct cgroup_subsys_state *css,
> +				      struct cftype *cft, u64 latency_nice)
> +{
> +	struct task_group *tg = css_tg(css);
> +	struct css_task_iter it;
> +	struct task_struct *p;
> +
> +	if (latency_nice < LATENCY_NICE_MIN || latency_nice > LATENCY_NICE_MAX)
> +		return -ERANGE;
> +
> +	tg->latency_nice = latency_nice;
> +
> +	css_task_iter_start(css, 0, &it);
> +	while ((p = css_task_iter_next(&it)))
> +		p->latency_nice = latency_nice;

Once (and if) the cgroup API is added we can avoid this (potentially
massive) "update on write" in favour of an "on demand composition at
wakeup-time".

We don't care about updating the latency-nice of NON RUNNABLE tasks,
do we?

AFAIK, we need that value only (or mostly) at wakeup time. Thus, when a
task wakeup up we can easily compose (and eventually cache) it's
current latency-nice value by considering, in priority order:

  - the system wide upper-bound
  - the task group restriction
  - the task specific relaxation

Something similar to what we already do for uclamp composition with this
patch currently in tip/sched/core:

   commit 3eac870a3247 ("sched/uclamp: Use TG's clamps to restrict TASK's clamps")


> +	css_task_iter_end(&it);
> +
> +	return 0;
> +}
> +
>  static struct cftype cpu_legacy_files[] = {
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	{
> @@ -6848,6 +6878,11 @@ static struct cftype cpu_legacy_files[] = {
>  		.write_u64 = cpu_rt_period_write_uint,
>  	},
>  #endif
> +	{
> +		.name = "latency-nice",
> +		.read_u64 = cpu_latency_nice_read_u64,
> +		.write_u64 = cpu_latency_nice_write_u64,
> +	},
>  	{ }	/* Terminate */
>  };
>  
> @@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = {
>  		.write = cpu_max_write,
>  	},
>  #endif
> +	{
> +		.name = "latency-nice",
> +		.read_u64 = cpu_latency_nice_read_u64,
> +		.write_u64 = cpu_latency_nice_write_u64,
> +	},
>  	{ }	/* terminate */
>  };
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..b08d00c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
>  		goto err;
>  
>  	tg->shares = NICE_0_LOAD;
> +	tg->latency_nice = LATENCY_NICE_DEFAULT;
                           ^^^^^^^^^^^^^^^^^^^^
Maybe better NICE_0_LATENCY to be more consistent?


>  	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b52ed1a..365c928 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
>  #define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
>  
>  /*
> + * Latency-nice default value
> + */
> +#define	LATENCY_NICE_DEFAULT	5
> +#define	LATENCY_NICE_MIN	1
> +#define	LATENCY_NICE_MAX	100

Values 1 and 5 looks kind of arbitrary.
For the range specifically, I already commented in this other message:

   Message-ID: <87r24v2i14.fsf@arm.com>
   https://lore.kernel.org/lkml/87r24v2i14.fsf@arm.com/

> +
> +/*
>   * Single value that decides SCHED_DEADLINE internal math precision.
>   * 10 -> just above 1us
>   * 9  -> just above 0.5us
> @@ -362,6 +369,7 @@ struct cfs_bandwidth {
>  /* Task group related information */
>  struct task_group {
>  	struct cgroup_subsys_state css;
> +	u64 latency_nice;
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	/* schedulable entities of this group on each CPU */


Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05  6:15     ` Parth Shah
@ 2019-09-05 10:11       ` Patrick Bellasi
  2019-09-06 12:22         ` Parth Shah
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick Bellasi @ 2019-09-05 10:11 UTC (permalink / raw)
  To: Parth Shah
  Cc: Tim Chen, subhra mazumdar, linux-kernel, peterz, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, mgorman


On Thu, Sep 05, 2019 at 07:15:34 +0100, Parth Shah wrote...

> On 9/4/19 11:02 PM, Tim Chen wrote:
>> On 8/30/19 10:49 AM, subhra mazumdar wrote:
>>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>>> "latency-nice" which is shared by all the threads in that Cgroup.
>> 
>> 
>> Subhra,
>> 
>> Thanks for posting the patchset.  Having a latency nice hint
>> is useful beyond idle load balancing.  I can think of other
>> application scenarios, like scheduling batch machine learning AVX 512
>> processes with latency sensitive processes.  AVX512 limits the frequency
>> of the CPU and it is best to avoid latency sensitive task on the
>> same core with AVX512.  So latency nice hint allows the scheduler
>> to have a criteria to determine the latency sensitivity of a task
>> and arrange latency sensitive tasks away from AVX512 tasks.
>> 
>
>
> Hi Tim and Subhra,
>
> This patchset seems to be interesting for my TurboSched patches as well
> where I try to pack jitter tasks on fewer cores to get higher Turbo Frequencies.
> Well, the problem I face is that we sometime end up putting multiple jitter tasks on a core
> running some latency sensitive application which may see performance degradation.
> So my plan was to classify such tasks to be latency sensitive thereby hinting the load
> balancer to not put tasks on such cores.
>
> TurboSched: https://lkml.org/lkml/2019/7/25/296
>
>> You configure the latency hint on a cgroup basis.
>> But I think not all tasks in a cgroup necessarily have the same
>> latency sensitivity.
>> 
>> For example, I can see that cgroup can be applied on a per user basis,
>> and the user could run different tasks that have different latency sensitivity.
>> We may also need a way to configure latency sensitivity on a per task basis instead on
>> a per cgroup basis.
>> 
>
> AFAIU, the problem defined above intersects with my patches as well where the interface
> is required to classify the jitter tasks. I have already tried few methods like
> syscall and cgroup to classify such tasks and maybe something like that can be adopted
> with these patchset as well.

Agree, these two patchest are definitively overlapping in terms of
mechanisms and APIs to expose to userspace. You to guys seems to target
different goals but the general approach should be:

 - expose a single and abstract concept to user-space
   latency-nice or latency-tolerant as PaulT proposed at OSPM

 - map this concept in kernel-space to different kind of bias, both at
   wakeup time and load-balance time, and use both for RT and CFS tasks.

That's my understanding at least ;)

I guess we will have interesting discussions at the upcoming LPC to
figure out a solution fitting all needs.

> Thanks,
> Parth

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 3/9] sched: add sched feature to disable idle core search
  2019-08-30 17:49 ` [RFC PATCH 3/9] sched: add sched feature to disable idle core search subhra mazumdar
@ 2019-09-05 10:17   ` Patrick Bellasi
  2019-09-05 22:02     ` Subhra Mazumdar
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick Bellasi @ 2019-09-05 10:17 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth


On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote...

> Add a new sched feature SIS_CORE to have an option to disable idle core
> search (select_idle_core).
>
> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
> ---
>  kernel/sched/features.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 858589b..de4d506 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   */
>  SCHED_FEAT(SIS_AVG_CPU, false)
>  SCHED_FEAT(SIS_PROP, true)
> +SCHED_FEAT(SIS_CORE, true)

Why do we need a sched_feature? If you think there are systems in
which the usage of latency-nice does not make sense for in "Select Idle
Sibling", then we should probably better add a new Kconfig option.

If that's the case, you can probably use the init/Kconfig's
"Scheduler features" section, recently added by:

  commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")

>  /*
>   * Issue a WARN when we do multiple update_rq_clock() calls

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 4/9] sched: SIS_CORE to disable idle core search
  2019-08-30 17:49 ` [RFC PATCH 4/9] sched: SIS_CORE " subhra mazumdar
@ 2019-09-05 10:19   ` Patrick Bellasi
  0 siblings, 0 replies; 55+ messages in thread
From: Patrick Bellasi @ 2019-09-05 10:19 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth


On Fri, Aug 30, 2019 at 18:49:39 +0100, subhra mazumdar wrote...

> Use SIS_CORE to disable idle core search. For some workloads
> select_idle_core becomes a scalability bottleneck, removing it improves
> throughput. Also there are workloads where disabling it can hurt latency,
> so need to have an option.
>
> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
> ---
>  kernel/sched/fair.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c31082d..23ec9c6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6268,9 +6268,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	if (!sd)
>  		return target;
>  
> -	i = select_idle_core(p, sd, target);
> -	if ((unsigned)i < nr_cpumask_bits)
> -		return i;
> +	if (sched_feat(SIS_CORE)) {
> +		i = select_idle_core(p, sd, target);
> +		if ((unsigned)i < nr_cpumask_bits)
> +			return i;
> +	}
>  
>  	i = select_idle_cpu(p, sd, target);
>  	if ((unsigned)i < nr_cpumask_bits)

This looks like should be squashed with the previous one, or whatever
code you'll add to define when this "biasing" is to be used or not.

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 0/9] Task latency-nice
  2019-08-30 17:49 [RFC PATCH 0/9] Task latency-nice subhra mazumdar
                   ` (9 preceding siblings ...)
  2019-09-05  5:55 ` [RFC PATCH 0/9] Task latency-nice Parth Shah
@ 2019-09-05 10:31 ` Patrick Bellasi
  10 siblings, 0 replies; 55+ messages in thread
From: Patrick Bellasi @ 2019-09-05 10:31 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth


On Fri, Aug 30, 2019 at 18:49:35 +0100, subhra mazumdar wrote...

> Introduce new per task property latency-nice for controlling scalability
> in scheduler idle CPU search path.

As per my comments in other message, we should try to better split the
"latency nice" concept introduction (API and mechanisms) from its usage
in different scheduler code paths.

This distinction should be evident from the cover letter too. What you
use "latency nice" for is just one possible use-case, thus it's
important to make sure it's generic enough to fit other usages too.

> Valid latency-nice values are from 1 to
> 100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New
> CPU cgroup file cpu.latency-nice is added as an interface to set and get.
> All tasks in the same cgroup share the same latency-nice value. Using a
> lower latency-nice value can help latency intolerant tasks e.g very short
> running OLTP threads where full LLC search cost can be significant compared
> to run time of the threads. The default latency-nice value is 5.
>
> In addition to latency-nice, it also adds a new sched feature SIS_CORE to
> be able to disable idle core search altogether which is costly and hurts
> more than it helps in short running workloads.

I don't see why you latency-nice cannot be used to implement what you
get with NO_SIS_CORE.

IMHO, "latency nice" should be exposed as a pretty generic concept which
progressively enables different "levels of biasing" both at wake-up time
and load-balance time.

Why it's not possible to have SIS_CORE/NO_SIS_CORE switch implemented
just as different threshold values for the latency-nice value of a task?

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05  9:45     ` Patrick Bellasi
@ 2019-09-05 10:46       ` Peter Zijlstra
  2019-09-05 11:13         ` Qais Yousef
  2019-09-05 11:18         ` Patrick Bellasi
  2019-09-06 12:31       ` Parth Shah
  1 sibling, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2019-09-05 10:46 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Subhra Mazumdar, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman, parth

On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:

> > From just reading the above, I would expect it to have the range
> > [-20,19] just like normal nice. Apparently this is not so.
> 
> Regarding the range for the latency-nice values, I guess we have two
> options:
> 
>   - [-20..19], which makes it similar to priorities
>   downside: we quite likely end up with a kernel space representation
>   which does not match the user-space one, e.g. look at
>   task_struct::prio.
> 
>   - [0..1024], which makes it more similar to a "percentage"
> 
> Being latency-nice a new concept, we are not constrained by POSIX and
> IMHO the [0..1024] scale is a better fit.
> 
> That will translate into:
> 
>   latency-nice=0 : default (current mainline) behaviour, all "biasing"
>   policies are disabled and we wakeup up as fast as possible
> 
>   latency-nice=1024 : maximum niceness, where for example we can imaging
>   to turn switch a CFS task to be SCHED_IDLE?

There's a few things wrong there; I really feel that if we call it nice,
it should be like nice. Otherwise we should call it latency-bias and not
have the association with nice to confuse people.

Secondly; the default should be in the middle of the range. Naturally
this would be a signed range like nice [-(x+1),x] for some x. but if you
want [0,1024], then the default really should be 512, but personally I
like 0 better as a default, in which case we need negative numbers.

This is important because we want to be able to bias towards less
importance to (tail) latency as well as more importantance to (tail)
latency.

Specifically, Oracle wants to sacrifice (some) latency for throughput.
Facebook OTOH seems to want to sacrifice (some) throughput for latency.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 10:05   ` Patrick Bellasi
@ 2019-09-05 10:48     ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2019-09-05 10:48 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: subhra mazumdar, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman, parth

On Thu, Sep 05, 2019 at 11:05:18AM +0100, Patrick Bellasi wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index b52ed1a..365c928 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
> >  #define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
> >  
> >  /*
> > + * Latency-nice default value
> > + */
> > +#define	LATENCY_NICE_DEFAULT	5
> > +#define	LATENCY_NICE_MIN	1
> > +#define	LATENCY_NICE_MAX	100
> 
> Values 1 and 5 looks kind of arbitrary.

Yes, and like I just wrote, completely and utterly wrong.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 10:46       ` Peter Zijlstra
@ 2019-09-05 11:13         ` Qais Yousef
  2019-09-05 11:30           ` Peter Zijlstra
  2019-09-05 11:30           ` Patrick Bellasi
  2019-09-05 11:18         ` Patrick Bellasi
  1 sibling, 2 replies; 55+ messages in thread
From: Qais Yousef @ 2019-09-05 11:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Patrick Bellasi, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman, parth

On 09/05/19 12:46, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
> 
> > > From just reading the above, I would expect it to have the range
> > > [-20,19] just like normal nice. Apparently this is not so.
> > 
> > Regarding the range for the latency-nice values, I guess we have two
> > options:
> > 
> >   - [-20..19], which makes it similar to priorities
> >   downside: we quite likely end up with a kernel space representation
> >   which does not match the user-space one, e.g. look at
> >   task_struct::prio.
> > 
> >   - [0..1024], which makes it more similar to a "percentage"
> > 
> > Being latency-nice a new concept, we are not constrained by POSIX and
> > IMHO the [0..1024] scale is a better fit.
> > 
> > That will translate into:
> > 
> >   latency-nice=0 : default (current mainline) behaviour, all "biasing"
> >   policies are disabled and we wakeup up as fast as possible
> > 
> >   latency-nice=1024 : maximum niceness, where for example we can imaging
> >   to turn switch a CFS task to be SCHED_IDLE?
> 
> There's a few things wrong there; I really feel that if we call it nice,
> it should be like nice. Otherwise we should call it latency-bias and not
> have the association with nice to confuse people.
> 
> Secondly; the default should be in the middle of the range. Naturally
> this would be a signed range like nice [-(x+1),x] for some x. but if you
> want [0,1024], then the default really should be 512, but personally I
> like 0 better as a default, in which case we need negative numbers.
> 
> This is important because we want to be able to bias towards less
> importance to (tail) latency as well as more importantance to (tail)
> latency.
> 
> Specifically, Oracle wants to sacrifice (some) latency for throughput.
> Facebook OTOH seems to want to sacrifice (some) throughput for latency.

Another use case I'm considering is using latency-nice to prefer an idle CPU if
latency-nice is set otherwise go for the most energy efficient CPU.

Ie: sacrifice (some) energy for latency.

The way I see interpreting latency-nice here as a binary switch. But maybe we
can use the range to select what (some) energy to sacrifice mean here. Hmmm.

--
Qais Yousef

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 10:46       ` Peter Zijlstra
  2019-09-05 11:13         ` Qais Yousef
@ 2019-09-05 11:18         ` Patrick Bellasi
  2019-09-05 11:40           ` Peter Zijlstra
  2019-09-05 11:46           ` Valentin Schneider
  1 sibling, 2 replies; 55+ messages in thread
From: Patrick Bellasi @ 2019-09-05 11:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Subhra Mazumdar, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman, parth


On Thu, Sep 05, 2019 at 11:46:16 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>
>> > From just reading the above, I would expect it to have the range
>> > [-20,19] just like normal nice. Apparently this is not so.
>> 
>> Regarding the range for the latency-nice values, I guess we have two
>> options:
>> 
>>   - [-20..19], which makes it similar to priorities
>>   downside: we quite likely end up with a kernel space representation
>>   which does not match the user-space one, e.g. look at
>>   task_struct::prio.
>> 
>>   - [0..1024], which makes it more similar to a "percentage"
>> 
>> Being latency-nice a new concept, we are not constrained by POSIX and
>> IMHO the [0..1024] scale is a better fit.
>> 
>> That will translate into:
>> 
>>   latency-nice=0 : default (current mainline) behaviour, all "biasing"
>>   policies are disabled and we wakeup up as fast as possible
>> 
>>   latency-nice=1024 : maximum niceness, where for example we can imaging
>>   to turn switch a CFS task to be SCHED_IDLE?
>
> There's a few things wrong there; I really feel that if we call it nice,
> it should be like nice. Otherwise we should call it latency-bias and not
> have the association with nice to confuse people.
>
> Secondly; the default should be in the middle of the range. Naturally
> this would be a signed range like nice [-(x+1),x] for some x. but if you
> want [0,1024], then the default really should be 512, but personally I
> like 0 better as a default, in which case we need negative numbers.
>
> This is important because we want to be able to bias towards less
> importance to (tail) latency as well as more importantance to (tail)
> latency.
>
> Specifically, Oracle wants to sacrifice (some) latency for throughput.
> Facebook OTOH seems to want to sacrifice (some) throughput for latency.

Right, we have this dualism to deal with and current mainline behaviour
is somehow in the middle.

BTW, the FB requirement is the same we have in Android.
We want some CFS tasks to have very small latency and a low chance
to be preempted by the wake-up of less-important "background" tasks.

I'm not totally against the usage of a signed range, but I'm thinking
that since we are introducing a new (non POSIX) concept we can get the
chance to make it more human friendly.

Give the two extremes above, would not be much simpler and intuitive to
have 0 implementing the FB/Android (no latency) case and 1024 the
(max latency) Oracle case?

Moreover, we will never match completely the nice semantic, give that
a 1 nice unit has a proper math meaning, isn't something like 10% CPU
usage change for each step?

For latency-nice instead we will likely base our biasing strategies on
some predefined (maybe system-wide configurable) const thresholds.

Could changing the name to "latency-tolerance" break the tie by marking
its difference wrt prior/nice levels? AFAIR, that was also the original
proposal [1] by PaulT during the OSPM discussion.

Best,
Patrick

[1] https://youtu.be/oz43thSFqmk?t=1302

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 11:13         ` Qais Yousef
@ 2019-09-05 11:30           ` Peter Zijlstra
  2019-09-05 11:40             ` Patrick Bellasi
  2019-09-05 11:47             ` Qais Yousef
  2019-09-05 11:30           ` Patrick Bellasi
  1 sibling, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2019-09-05 11:30 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Patrick Bellasi, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman, parth

On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
> On 09/05/19 12:46, Peter Zijlstra wrote:

> > This is important because we want to be able to bias towards less
> > importance to (tail) latency as well as more importantance to (tail)
> > latency.
> > 
> > Specifically, Oracle wants to sacrifice (some) latency for throughput.
> > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
> 
> Another use case I'm considering is using latency-nice to prefer an idle CPU if
> latency-nice is set otherwise go for the most energy efficient CPU.
> 
> Ie: sacrifice (some) energy for latency.
> 
> The way I see interpreting latency-nice here as a binary switch. But
> maybe we can use the range to select what (some) energy to sacrifice
> mean here. Hmmm.

It cannot be binary, per definition is must be ternary, that is, <0, ==0
and >0 (or middle value if you're of that persuasion).

In your case, I'm thinking you mean >0, we want to lower the latency.

Anyway; there were a number of things mentioned at OSPM that we could
tie into this thing and finding sensible mappings is going to be a bit
of trial and error I suppose.

But as patrick said; we're very much exporting a BIAS knob, not a set of
behaviours.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 11:13         ` Qais Yousef
  2019-09-05 11:30           ` Peter Zijlstra
@ 2019-09-05 11:30           ` Patrick Bellasi
  2019-09-05 11:47             ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Patrick Bellasi @ 2019-09-05 11:30 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman, parth


On Thu, Sep 05, 2019 at 12:13:47 +0100, Qais Yousef wrote...

> On 09/05/19 12:46, Peter Zijlstra wrote:
>> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>> 
>> > > From just reading the above, I would expect it to have the range
>> > > [-20,19] just like normal nice. Apparently this is not so.
>> > 
>> > Regarding the range for the latency-nice values, I guess we have two
>> > options:
>> > 
>> >   - [-20..19], which makes it similar to priorities
>> >   downside: we quite likely end up with a kernel space representation
>> >   which does not match the user-space one, e.g. look at
>> >   task_struct::prio.
>> > 
>> >   - [0..1024], which makes it more similar to a "percentage"
>> > 
>> > Being latency-nice a new concept, we are not constrained by POSIX and
>> > IMHO the [0..1024] scale is a better fit.
>> > 
>> > That will translate into:
>> > 
>> >   latency-nice=0 : default (current mainline) behaviour, all "biasing"
>> >   policies are disabled and we wakeup up as fast as possible
>> > 
>> >   latency-nice=1024 : maximum niceness, where for example we can imaging
>> >   to turn switch a CFS task to be SCHED_IDLE?
>> 
>> There's a few things wrong there; I really feel that if we call it nice,
>> it should be like nice. Otherwise we should call it latency-bias and not
>> have the association with nice to confuse people.
>> 
>> Secondly; the default should be in the middle of the range. Naturally
>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>> want [0,1024], then the default really should be 512, but personally I
>> like 0 better as a default, in which case we need negative numbers.
>> 
>> This is important because we want to be able to bias towards less
>> importance to (tail) latency as well as more importantance to (tail)
>> latency.
>> 
>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>
> Another use case I'm considering is using latency-nice to prefer an idle CPU if
> latency-nice is set otherwise go for the most energy efficient CPU.
>
> Ie: sacrifice (some) energy for latency.
>
> The way I see interpreting latency-nice here as a binary switch. But maybe we
> can use the range to select what (some) energy to sacrifice mean here. Hmmm.

I see this concept possibly evolving into something more then just a
binary switch. Not yet convinced if it make sense and/or it's possible
but, in principle, I was thinking about these possible usages for CFS
tasks:

 - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
   depending on crossing certain pre-configured threshold of latency
   niceness.

 - dynamically bias the vruntime updates we do in place_entity()
   depending on the actual latency niceness of a task.

 - bias the decisions we take in check_preempt_tick() still depending
   on a relative comparison of the current and wakeup task latency
   niceness values.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 11:30           ` Peter Zijlstra
@ 2019-09-05 11:40             ` Patrick Bellasi
  2019-09-05 11:48               ` Peter Zijlstra
  2019-09-05 11:47             ` Qais Yousef
  1 sibling, 1 reply; 55+ messages in thread
From: Patrick Bellasi @ 2019-09-05 11:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qais Yousef, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman, parth


On Thu, Sep 05, 2019 at 12:30:02 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
>> On 09/05/19 12:46, Peter Zijlstra wrote:
>
>> > This is important because we want to be able to bias towards less
>> > importance to (tail) latency as well as more importantance to (tail)
>> > latency.
>> > 
>> > Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>> 
>> Another use case I'm considering is using latency-nice to prefer an idle CPU if
>> latency-nice is set otherwise go for the most energy efficient CPU.
>> 
>> Ie: sacrifice (some) energy for latency.
>> 
>> The way I see interpreting latency-nice here as a binary switch. But
>> maybe we can use the range to select what (some) energy to sacrifice
>> mean here. Hmmm.
>
> It cannot be binary, per definition is must be ternary, that is, <0, ==0
> and >0 (or middle value if you're of that persuasion).
>
> In your case, I'm thinking you mean >0, we want to lower the latency.
>
> Anyway; there were a number of things mentioned at OSPM that we could
> tie into this thing and finding sensible mappings is going to be a bit
> of trial and error I suppose.
>
> But as patrick said; we're very much exporting a BIAS knob, not a set of
> behaviours.

Right, although I think behaviours could still be exported but via a
different and configurable interface, using thresholds.

Either at compile time or via procfs maybe we can expose and properly
document what happen in the scheduler if/when a task has a "latency
niceness" crossing a given threshold.

For example, by setting something like:

   /proc/sys/kernel/sched_cfs_latency_idle = 1000

we state that the task is going to be scheduled according to the
SCHED_IDLE policy.

  ( ( (tomatoes target here) ) )

Not sure also if we wanna commit to user-space APIs how we internally
map/translate a "latency niceness" value into a scheduler behaviour
bias. Maybe better not at least at the very beginning.

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 11:18         ` Patrick Bellasi
@ 2019-09-05 11:40           ` Peter Zijlstra
  2019-09-05 11:46             ` Patrick Bellasi
  2019-09-05 11:46           ` Valentin Schneider
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2019-09-05 11:40 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Subhra Mazumdar, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman, parth

On Thu, Sep 05, 2019 at 12:18:55PM +0100, Patrick Bellasi wrote:

> Right, we have this dualism to deal with and current mainline behaviour
> is somehow in the middle.
> 
> BTW, the FB requirement is the same we have in Android.
> We want some CFS tasks to have very small latency and a low chance
> to be preempted by the wake-up of less-important "background" tasks.
> 
> I'm not totally against the usage of a signed range, but I'm thinking
> that since we are introducing a new (non POSIX) concept we can get the
> chance to make it more human friendly.

I'm arguing that signed _is_ more human friendly ;-)

> Give the two extremes above, would not be much simpler and intuitive to
> have 0 implementing the FB/Android (no latency) case and 1024 the
> (max latency) Oracle case?

See, I find the signed thing more natural, negative is a bias away from
latency sensitive, positive is a bias towards latency sensitive.

Also; 0 is a good default value ;-)

> Moreover, we will never match completely the nice semantic, give that
> a 1 nice unit has a proper math meaning, isn't something like 10% CPU
> usage change for each step?

Only because we were nice when implementing it. Posix leaves it
unspecified and we could change it at any time. The only real semantics
is a relative 'weight' (opengroup uses the term 'favourable').

> Could changing the name to "latency-tolerance" break the tie by marking
> its difference wrt prior/nice levels? AFAIR, that was also the original
> proposal [1] by PaulT during the OSPM discussion.

latency torrerance could still be a signed entity, positive would
signify we're more tolerant of latency (ie. less sensitive) while
negative would be less tolerant (ie. more sensitive).

> For latency-nice instead we will likely base our biasing strategies on
> some predefined (maybe system-wide configurable) const thresholds.

I'm not quite sure; yes, for some of these things, like the idle search
on wakeup, certainly. But say for wakeup-preemption, we could definitely
make it a task relative attribute.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 11:18         ` Patrick Bellasi
  2019-09-05 11:40           ` Peter Zijlstra
@ 2019-09-05 11:46           ` Valentin Schneider
  2019-09-05 13:07             ` Patrick Bellasi
  1 sibling, 1 reply; 55+ messages in thread
From: Valentin Schneider @ 2019-09-05 11:46 UTC (permalink / raw)
  To: Patrick Bellasi, Peter Zijlstra
  Cc: Subhra Mazumdar, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman, parth

On 05/09/2019 12:18, Patrick Bellasi wrote:
>> There's a few things wrong there; I really feel that if we call it nice,
>> it should be like nice. Otherwise we should call it latency-bias and not
>> have the association with nice to confuse people.
>>
>> Secondly; the default should be in the middle of the range. Naturally
>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>> want [0,1024], then the default really should be 512, but personally I
>> like 0 better as a default, in which case we need negative numbers.
>>
>> This is important because we want to be able to bias towards less
>> importance to (tail) latency as well as more importantance to (tail)
>> latency.
>>
>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
> 
> Right, we have this dualism to deal with and current mainline behaviour
> is somehow in the middle.
> 
> BTW, the FB requirement is the same we have in Android.
> We want some CFS tasks to have very small latency and a low chance
> to be preempted by the wake-up of less-important "background" tasks.
> 
> I'm not totally against the usage of a signed range, but I'm thinking
> that since we are introducing a new (non POSIX) concept we can get the
> chance to make it more human friendly.
> 
> Give the two extremes above, would not be much simpler and intuitive to
> have 0 implementing the FB/Android (no latency) case and 1024 the
> (max latency) Oracle case?
> 

For something like latency-<whatever>, I don't see the point of having
such a wide range. The nice range is probably more than enough - and before
even bothering about the range, we should probably agree on what the range
should represent.

If it's niceness, I read it as: positive latency-nice value means we're
nice to latency, means we reduce it. So the further up you go, the more you
restrict your wakeup scan. I think it's quite easy to map that into the
code: current behaviour at 0, with a decreasing scan mask size as we go
towards +19. I don't think anyone needs 512 steps to tune this.

I don't know what logic we'd follow for negative values though. Maybe
latency-nice -20 means always going through the slowpath, but what of the
intermediate values?

AFAICT this RFC only looks at wakeups, but I guess latency-nice can be
applied elsewhere (e.g. load-balance, something like task_hot() and its
use of sysctl_sched_migration_cost).

> Moreover, we will never match completely the nice semantic, give that
> a 1 nice unit has a proper math meaning, isn't something like 10% CPU
> usage change for each step?
> 
> For latency-nice instead we will likely base our biasing strategies on
> some predefined (maybe system-wide configurable) const thresholds.
> 
> Could changing the name to "latency-tolerance" break the tie by marking
> its difference wrt prior/nice levels? AFAIR, that was also the original
> proposal [1] by PaulT during the OSPM discussion.
> 
> Best,
> Patrick
> 
> [1] https://youtu.be/oz43thSFqmk?t=1302
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 11:40           ` Peter Zijlstra
@ 2019-09-05 11:46             ` Patrick Bellasi
  0 siblings, 0 replies; 55+ messages in thread
From: Patrick Bellasi @ 2019-09-05 11:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Subhra Mazumdar, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman, parth


On Thu, Sep 05, 2019 at 12:40:30 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 12:18:55PM +0100, Patrick Bellasi wrote:
>
>> Right, we have this dualism to deal with and current mainline behaviour
>> is somehow in the middle.
>> 
>> BTW, the FB requirement is the same we have in Android.
>> We want some CFS tasks to have very small latency and a low chance
>> to be preempted by the wake-up of less-important "background" tasks.
>> 
>> I'm not totally against the usage of a signed range, but I'm thinking
>> that since we are introducing a new (non POSIX) concept we can get the
>> chance to make it more human friendly.
>
> I'm arguing that signed _is_ more human friendly ;-)

... but you are not human. :)

>> Give the two extremes above, would not be much simpler and intuitive to
>> have 0 implementing the FB/Android (no latency) case and 1024 the
>> (max latency) Oracle case?
>
> See, I find the signed thing more natural, negative is a bias away from
> latency sensitive, positive is a bias towards latency sensitive.
>
> Also; 0 is a good default value ;-)

Yes, that's appealing indeed.

>> Moreover, we will never match completely the nice semantic, give that
>> a 1 nice unit has a proper math meaning, isn't something like 10% CPU
>> usage change for each step?
>
> Only because we were nice when implementing it. Posix leaves it
> unspecified and we could change it at any time. The only real semantics
> is a relative 'weight' (opengroup uses the term 'favourable').

Good to know, I was considering it a POXIS requirement.

>> Could changing the name to "latency-tolerance" break the tie by marking
>> its difference wrt prior/nice levels? AFAIR, that was also the original
>> proposal [1] by PaulT during the OSPM discussion.
>
> latency torrerance could still be a signed entity, positive would
> signify we're more tolerant of latency (ie. less sensitive) while
> negative would be less tolerant (ie. more sensitive).

Right.

>> For latency-nice instead we will likely base our biasing strategies on
>> some predefined (maybe system-wide configurable) const thresholds.
>
> I'm not quite sure; yes, for some of these things, like the idle search
> on wakeup, certainly. But say for wakeup-preemption, we could definitely
> make it a task relative attribute.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 11:30           ` Patrick Bellasi
@ 2019-09-05 11:47             ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2019-09-05 11:47 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Qais Yousef, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman, parth

On Thu, Sep 05, 2019 at 12:30:52PM +0100, Patrick Bellasi wrote:

> I see this concept possibly evolving into something more then just a
> binary switch. Not yet convinced if it make sense and/or it's possible
> but, in principle, I was thinking about these possible usages for CFS
> tasks:
> 
>  - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
>    depending on crossing certain pre-configured threshold of latency
>    niceness.

A big part of BATCH is wakeup preemption (batch doesn't preempt itself),
and wakeup preemption is a task-task propery and can thus be completely
relative.

>  - dynamically bias the vruntime updates we do in place_entity()
>    depending on the actual latency niceness of a task.

That is dangerous; theory says we should keep track of the 0-lag point
and place it back where we found it. BFQ does this correctly IIRC, but
for CFS I've never done that because 'expensive'.

But yes, we could (carefully) fumble a bit there.

>  - bias the decisions we take in check_preempt_tick() still depending
>    on a relative comparison of the current and wakeup task latency
>    niceness values.

Ack.

Placing relative and absolute behaviour on the same scale is going to be
'fun' :-)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 11:30           ` Peter Zijlstra
  2019-09-05 11:40             ` Patrick Bellasi
@ 2019-09-05 11:47             ` Qais Yousef
  2020-04-16  0:02               ` Joel Fernandes
  1 sibling, 1 reply; 55+ messages in thread
From: Qais Yousef @ 2019-09-05 11:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Patrick Bellasi, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman, parth

On 09/05/19 13:30, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
> > On 09/05/19 12:46, Peter Zijlstra wrote:
> 
> > > This is important because we want to be able to bias towards less
> > > importance to (tail) latency as well as more importantance to (tail)
> > > latency.
> > > 
> > > Specifically, Oracle wants to sacrifice (some) latency for throughput.
> > > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
> > 
> > Another use case I'm considering is using latency-nice to prefer an idle CPU if
> > latency-nice is set otherwise go for the most energy efficient CPU.
> > 
> > Ie: sacrifice (some) energy for latency.
> > 
> > The way I see interpreting latency-nice here as a binary switch. But
> > maybe we can use the range to select what (some) energy to sacrifice
> > mean here. Hmmm.
> 
> It cannot be binary, per definition is must be ternary, that is, <0, ==0
> and >0 (or middle value if you're of that persuasion).

I meant I want to use it as a binary.

> 
> In your case, I'm thinking you mean >0, we want to lower the latency.

Yes. As long as there's an easy way to say: does this task care about latency
or not I'm good.

> 
> Anyway; there were a number of things mentioned at OSPM that we could
> tie into this thing and finding sensible mappings is going to be a bit
> of trial and error I suppose.
> 
> But as patrick said; we're very much exporting a BIAS knob, not a set of
> behaviours.

Agreed. I just wanted to say that the way this range is going to be
interpreted will differ from path to path and we need to consider that in the
final mapping. Especially from the final user's perspective of what setting
this value ultimately means to them.

--
Qais Yousef

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 11:40             ` Patrick Bellasi
@ 2019-09-05 11:48               ` Peter Zijlstra
  2019-09-05 13:32                 ` Qais Yousef
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2019-09-05 11:48 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Qais Yousef, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman, parth

On Thu, Sep 05, 2019 at 12:40:01PM +0100, Patrick Bellasi wrote:
> Right, although I think behaviours could still be exported but via a
> different and configurable interface, using thresholds.

I would try _really_ hard to avoid pinning down behaviour. The more you
do that, the less you can change.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 11:46           ` Valentin Schneider
@ 2019-09-05 13:07             ` Patrick Bellasi
  2019-09-05 14:48               ` Valentin Schneider
  2019-09-06 12:45               ` Parth Shah
  0 siblings, 2 replies; 55+ messages in thread
From: Patrick Bellasi @ 2019-09-05 13:07 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman, parth


On Thu, Sep 05, 2019 at 12:46:37 +0100, Valentin Schneider wrote...

> On 05/09/2019 12:18, Patrick Bellasi wrote:
>>> There's a few things wrong there; I really feel that if we call it nice,
>>> it should be like nice. Otherwise we should call it latency-bias and not
>>> have the association with nice to confuse people.
>>>
>>> Secondly; the default should be in the middle of the range. Naturally
>>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>>> want [0,1024], then the default really should be 512, but personally I
>>> like 0 better as a default, in which case we need negative numbers.
>>>
>>> This is important because we want to be able to bias towards less
>>> importance to (tail) latency as well as more importantance to (tail)
>>> latency.
>>>
>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>> 
>> Right, we have this dualism to deal with and current mainline behaviour
>> is somehow in the middle.
>> 
>> BTW, the FB requirement is the same we have in Android.
>> We want some CFS tasks to have very small latency and a low chance
>> to be preempted by the wake-up of less-important "background" tasks.
>> 
>> I'm not totally against the usage of a signed range, but I'm thinking
>> that since we are introducing a new (non POSIX) concept we can get the
>> chance to make it more human friendly.
>> 
>> Give the two extremes above, would not be much simpler and intuitive to
>> have 0 implementing the FB/Android (no latency) case and 1024 the
>> (max latency) Oracle case?
>> 
>
> For something like latency-<whatever>, I don't see the point of having
> such a wide range. The nice range is probably more than enough - and before
> even bothering about the range, we should probably agree on what the range
> should represent.
>
> If it's niceness, I read it as: positive latency-nice value means we're
> nice to latency, means we reduce it. So the further up you go, the more you
> restrict your wakeup scan. I think it's quite easy to map that into the
> code: current behaviour at 0, with a decreasing scan mask size as we go
> towards +19. I don't think anyone needs 512 steps to tune this.
>
> I don't know what logic we'd follow for negative values though. Maybe
> latency-nice -20 means always going through the slowpath, but what of the
> intermediate values?

Yep, I think so fare we are all converging towards the idea to use the
a signed range. Regarding the range itself, yes: 1024 looks very
oversized, but +-20 is still something which leave room for a bit of
flexibility and it also better matches the idea that we don't want to
"enumerate behaviours" but just expose a knob. To map certain "bias" we
could benefit from a slightly larger range.

> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be

For the wakeup path there is also the TurboSched proposal by Parth:

   Message-ID: <20190725070857.6639-1-parth@linux.ibm.com> 
   https://lore.kernel.org/lkml/20190725070857.6639-1-parth@linux.ibm.com/

we should keep in mind.

> applied elsewhere (e.g. load-balance, something like task_hot() and its
> use of sysctl_sched_migration_cost).

For LB can you come up with some better description of what usages you
see could benefit from a "per task" or "per task-group" latency niceness?

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 11:48               ` Peter Zijlstra
@ 2019-09-05 13:32                 ` Qais Yousef
  0 siblings, 0 replies; 55+ messages in thread
From: Qais Yousef @ 2019-09-05 13:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Patrick Bellasi, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman, parth

On 09/05/19 13:48, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 12:40:01PM +0100, Patrick Bellasi wrote:
> > Right, although I think behaviours could still be exported but via a
> > different and configurable interface, using thresholds.
> 
> I would try _really_ hard to avoid pinning down behaviour. The more you
> do that, the less you can change.

While I agree with that but I find there's a contradiction between not
'pinning down behavior' and 'easy and clear way to bias latency sensitive from
end user's perspective'.

Maybe we should protect this with a kconfig + experimental tag until trial
and error show the best way forward?

--
Qais Yousef

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 13:07             ` Patrick Bellasi
@ 2019-09-05 14:48               ` Valentin Schneider
  2019-09-06 12:45               ` Parth Shah
  1 sibling, 0 replies; 55+ messages in thread
From: Valentin Schneider @ 2019-09-05 14:48 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Peter Zijlstra, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman, parth

On 05/09/2019 14:07, Patrick Bellasi wrote:
> Yep, I think so fare we are all converging towards the idea to use the
> a signed range. Regarding the range itself, yes: 1024 looks very
> oversized, but +-20 is still something which leave room for a bit of
> flexibility and it also better matches the idea that we don't want to
> "enumerate behaviours" but just expose a knob. To map certain "bias" we
> could benefit from a slightly larger range.
> 
>> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be
> 
> For the wakeup path there is also the TurboSched proposal by Parth:
> 
>    Message-ID: <20190725070857.6639-1-parth@linux.ibm.com> 
>    https://lore.kernel.org/lkml/20190725070857.6639-1-parth@linux.ibm.com/
> 
> we should keep in mind.
> 
>> applied elsewhere (e.g. load-balance, something like task_hot() and its
>> use of sysctl_sched_migration_cost).
> 
> For LB can you come up with some better description of what usages you
> see could benefit from a "per task" or "per task-group" latency niceness?
> 

task_hot() "ratelimits" migrations of tasks that were running up until
very recently (hence "cache hot"), but the knob is system wide. It might
make sense to exploit a per-task attribute to tune this (e.g. go wild if
not latency sensitive, otherwise stay away for longer).

We could perhaps even apply it to active load balance to similarly stay
away from latency sensitive tasks. Right now this is gated by a
sched_domain-wide attribute (nr_balance_failed). We could tweak this to
requiring more (less) failed attempts before interrupting latency (in)
sensitive tasks.

I'm sure we can come up with even more creative ways to pour even more
heuristics in there ;)

> Best,
> Patrick
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 7/9] sched: search SMT before LLC domain
  2019-09-05  9:31   ` Peter Zijlstra
@ 2019-09-05 20:40     ` Subhra Mazumdar
  0 siblings, 0 replies; 55+ messages in thread
From: Subhra Mazumdar @ 2019-09-05 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth, patrick.bellasi


On 9/5/19 2:31 AM, Peter Zijlstra wrote:
> On Fri, Aug 30, 2019 at 10:49:42AM -0700, subhra mazumdar wrote:
>> Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps
>> in L1 cache locality.
>> ---
>>   kernel/sched/fair.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8856503..94dd4a32 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>   			return i;
>>   	}
>>   
>> -	i = select_idle_cpu(p, sd, target);
>> +	i = select_idle_smt(p, target);
>>   	if ((unsigned)i < nr_cpumask_bits)
>>   		return i;
>>   
>> -	i = select_idle_smt(p, target);
>> +	i = select_idle_cpu(p, sd, target);
>>   	if ((unsigned)i < nr_cpumask_bits)
>>   		return i;
>>   
> But it is absolutely conceptually wrong. An idle core is a much better
> target than an idle sibling.
This is select_idle_smt not select_idle_core.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 3/9] sched: add sched feature to disable idle core search
  2019-09-05 10:17   ` Patrick Bellasi
@ 2019-09-05 22:02     ` Subhra Mazumdar
  0 siblings, 0 replies; 55+ messages in thread
From: Subhra Mazumdar @ 2019-09-05 22:02 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, parth


On 9/5/19 3:17 AM, Patrick Bellasi wrote:
> On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote...
>
>> Add a new sched feature SIS_CORE to have an option to disable idle core
>> search (select_idle_core).
>>
>> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
>> ---
>>   kernel/sched/features.h | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
>> index 858589b..de4d506 100644
>> --- a/kernel/sched/features.h
>> +++ b/kernel/sched/features.h
>> @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>>    */
>>   SCHED_FEAT(SIS_AVG_CPU, false)
>>   SCHED_FEAT(SIS_PROP, true)
>> +SCHED_FEAT(SIS_CORE, true)
> Why do we need a sched_feature? If you think there are systems in
> which the usage of latency-nice does not make sense for in "Select Idle
> Sibling", then we should probably better add a new Kconfig option.
This is not for latency-nice but to be able to disable a different aspect
of the scheduler, i.e searching for idle cores. This can be made part of
latency-nice (i.e not do idle core search if latency-nice is below a
certain value) but even then having a feature to disable it doesn't hurt.
>
> If that's the case, you can probably use the init/Kconfig's
> "Scheduler features" section, recently added by:
>
>    commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
>
>>   /*
>>    * Issue a WARN when we do multiple update_rq_clock() calls
> Best,
> Patrick
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 10:11       ` Patrick Bellasi
@ 2019-09-06 12:22         ` Parth Shah
  0 siblings, 0 replies; 55+ messages in thread
From: Parth Shah @ 2019-09-06 12:22 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Tim Chen, subhra mazumdar, linux-kernel, peterz, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, mgorman



On 9/5/19 3:41 PM, Patrick Bellasi wrote:
> 
> On Thu, Sep 05, 2019 at 07:15:34 +0100, Parth Shah wrote...
> 
>> On 9/4/19 11:02 PM, Tim Chen wrote:
>>> On 8/30/19 10:49 AM, subhra mazumdar wrote:
>>>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>>>> "latency-nice" which is shared by all the threads in that Cgroup.
>>>
>>>
>>> Subhra,
>>>
>>> Thanks for posting the patchset.  Having a latency nice hint
>>> is useful beyond idle load balancing.  I can think of other
>>> application scenarios, like scheduling batch machine learning AVX 512
>>> processes with latency sensitive processes.  AVX512 limits the frequency
>>> of the CPU and it is best to avoid latency sensitive task on the
>>> same core with AVX512.  So latency nice hint allows the scheduler
>>> to have a criteria to determine the latency sensitivity of a task
>>> and arrange latency sensitive tasks away from AVX512 tasks.
>>>
>>
>>
>> Hi Tim and Subhra,
>>
>> This patchset seems to be interesting for my TurboSched patches as well
>> where I try to pack jitter tasks on fewer cores to get higher Turbo Frequencies.
>> Well, the problem I face is that we sometime end up putting multiple jitter tasks on a core
>> running some latency sensitive application which may see performance degradation.
>> So my plan was to classify such tasks to be latency sensitive thereby hinting the load
>> balancer to not put tasks on such cores.
>>
>> TurboSched: https://lkml.org/lkml/2019/7/25/296
>>
>>> You configure the latency hint on a cgroup basis.
>>> But I think not all tasks in a cgroup necessarily have the same
>>> latency sensitivity.
>>>
>>> For example, I can see that cgroup can be applied on a per user basis,
>>> and the user could run different tasks that have different latency sensitivity.
>>> We may also need a way to configure latency sensitivity on a per task basis instead on
>>> a per cgroup basis.
>>>
>>
>> AFAIU, the problem defined above intersects with my patches as well where the interface
>> is required to classify the jitter tasks. I have already tried few methods like
>> syscall and cgroup to classify such tasks and maybe something like that can be adopted
>> with these patchset as well.
> 
> Agree, these two patchest are definitively overlapping in terms of
> mechanisms and APIs to expose to userspace. You to guys seems to target
> different goals but the general approach should be:
> 
>  - expose a single and abstract concept to user-space
>    latency-nice or latency-tolerant as PaulT proposed at OSPM
> 

I agree. Both the patchset tries to classify a tasks for some purpose for better latency.
TurboSched requires the classification of whether the task is jitter and should not be given
enough resources/frequency. This is a boolean value.
Whereas, latency-nice is a range. So does that mean that a max-latency-nice task is a jitter?

I was thinking of not doing jitter packing on a core occupying
min-latency-nice (i.e, latency sensitive) task (until there are other busier cores).

Given this, we can expose a single per-task attribute to the user by a syscall, right?

>  - map this concept in kernel-space to different kind of bias, both at
>    wakeup time and load-balance time, and use both for RT and CFS tasks.
> 
> That's my understanding at least ;)
> 
> I guess we will have interesting discussions at the upcoming LPC to
> figure out a solution fitting all needs.

Definitely.

> 
>> Thanks,
>> Parth
> 
> Best,
> Patrick
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05  9:45     ` Patrick Bellasi
  2019-09-05 10:46       ` Peter Zijlstra
@ 2019-09-06 12:31       ` Parth Shah
  1 sibling, 0 replies; 55+ messages in thread
From: Parth Shah @ 2019-09-06 12:31 UTC (permalink / raw)
  To: Patrick Bellasi, Subhra Mazumdar, Peter Zijlstra
  Cc: linux-kernel, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman



On 9/5/19 3:15 PM, Patrick Bellasi wrote:
> 
> On Thu, Sep 05, 2019 at 09:31:27 +0100, Peter Zijlstra wrote...
> 
>> On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote:
>>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>>> "latency-nice" which is shared by all the threads in that Cgroup.
>>
>> *sigh*, no. We start with a normal per task attribute, and then later,
>> if it is needed and makes sense, we add it to cgroups.
> 
> FWIW, to add on top of what Peter says, we used this same approach for
> uclamp and it proved to be a very effective way to come up with a good
> design. General principles have been:
> 
>  - a system wide API [1] (under /proc/sys/kernel/sched_*) defines
>    default values for all tasks affected by that feature.
>    This interface has to define also upper bounds for task specific
>    values. Thus, in the case of latency-nice, it should be set by
>    default to the MIN value, since that's the current mainline
>    behaviour: all tasks are latency sensitive.
> 
>  - a per-task API [2] (via the sched_setattr() syscall) can be used to
>    relax the system wide setting thus implementing a "nice" policy.
> 
>  - a per-taskgroup API [3] (via cpu controller's attributes) can be used
>    to relax the system-wide settings and restrict the per-task API.
> 
> The above features are worth to be added in that exact order.
> 
>> Also, your Changelog fails on pretty much every point. It doesn't
>> explain why, it doesn't describe anything and so on.
> 
> On the description side, I guess it's worth to mention somewhere to
> which scheduling classes this feature can be useful for. It's worth to
> mention that it can apply only to:
> 
>  - CFS tasks: for example, at wakeup time a task with an high
>    latency-nice should avoid to preempt a low latency-nice task.
>    Maybe by mapping the latency nice value into proper vruntime
>    normalization value?
> 

If I got this correct, does this also mean that a task's latency-nice
will be mapped to prio/nice.
i.e, task with min-latency-nice will have highest priority?

>  - RT tasks: for example, at wakeup time a task with an high
>    latency-nice value could avoid to preempt a CFS task.
> 

So, will this make CFS task to precede RT task?
and cause priority inversion?

> I'm sure there will be discussion about some of these features, that's
> why it's important in the proposal presentation to keep a well defined
> distinction among the "mechanisms and API" and how we use the new
> concept to "bias" some scheduler policies.
> 
>> From just reading the above, I would expect it to have the range
>> [-20,19] just like normal nice. Apparently this is not so.
> 
> Regarding the range for the latency-nice values, I guess we have two
> options:
> 
>   - [-20..19], which makes it similar to priorities
>   downside: we quite likely end up with a kernel space representation
>   which does not match the user-space one, e.g. look at
>   task_struct::prio.
> 
>   - [0..1024], which makes it more similar to a "percentage"
> 
> Being latency-nice a new concept, we are not constrained by POSIX and
> IMHO the [0..1024] scale is a better fit.
> 
> That will translate into:
> 
>   latency-nice=0 : default (current mainline) behaviour, all "biasing"
>   policies are disabled and we wakeup up as fast as possible
> 
>   latency-nice=1024 : maximum niceness, where for example we can imaging
>   to turn switch a CFS task to be SCHED_IDLE?
> 
> Best,
> Patrick
> 
> [1] commit e8f14172c6b1 ("sched/uclamp: Add system default clamps")
> [2] commit a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support utilization clamping")
> [3] 5 patches in today's tip/sched/core up to:
>     commit babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 13:07             ` Patrick Bellasi
  2019-09-05 14:48               ` Valentin Schneider
@ 2019-09-06 12:45               ` Parth Shah
  2019-09-06 14:13                 ` Valentin Schneider
  1 sibling, 1 reply; 55+ messages in thread
From: Parth Shah @ 2019-09-06 12:45 UTC (permalink / raw)
  To: Patrick Bellasi, Valentin Schneider
  Cc: Peter Zijlstra, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman



On 9/5/19 6:37 PM, Patrick Bellasi wrote:
> 
> On Thu, Sep 05, 2019 at 12:46:37 +0100, Valentin Schneider wrote...
> 
>> On 05/09/2019 12:18, Patrick Bellasi wrote:
>>>> There's a few things wrong there; I really feel that if we call it nice,
>>>> it should be like nice. Otherwise we should call it latency-bias and not
>>>> have the association with nice to confuse people.
>>>>
>>>> Secondly; the default should be in the middle of the range. Naturally
>>>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>>>> want [0,1024], then the default really should be 512, but personally I
>>>> like 0 better as a default, in which case we need negative numbers.
>>>>
>>>> This is important because we want to be able to bias towards less
>>>> importance to (tail) latency as well as more importantance to (tail)
>>>> latency.
>>>>
>>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>>>
>>> Right, we have this dualism to deal with and current mainline behaviour
>>> is somehow in the middle.
>>>
>>> BTW, the FB requirement is the same we have in Android.
>>> We want some CFS tasks to have very small latency and a low chance
>>> to be preempted by the wake-up of less-important "background" tasks.
>>>
>>> I'm not totally against the usage of a signed range, but I'm thinking
>>> that since we are introducing a new (non POSIX) concept we can get the
>>> chance to make it more human friendly.
>>>
>>> Give the two extremes above, would not be much simpler and intuitive to
>>> have 0 implementing the FB/Android (no latency) case and 1024 the
>>> (max latency) Oracle case?
>>>
>>
>> For something like latency-<whatever>, I don't see the point of having
>> such a wide range. The nice range is probably more than enough - and before
>> even bothering about the range, we should probably agree on what the range
>> should represent.
>>
>> If it's niceness, I read it as: positive latency-nice value means we're
>> nice to latency, means we reduce it. So the further up you go, the more you
>> restrict your wakeup scan. I think it's quite easy to map that into the
>> code: current behaviour at 0, with a decreasing scan mask size as we go
>> towards +19. I don't think anyone needs 512 steps to tune this.
>>
>> I don't know what logic we'd follow for negative values though. Maybe
>> latency-nice -20 means always going through the slowpath, but what of the
>> intermediate values?
> 
> Yep, I think so fare we are all converging towards the idea to use the
> a signed range. Regarding the range itself, yes: 1024 looks very
> oversized, but +-20 is still something which leave room for a bit of
> flexibility and it also better matches the idea that we don't want to
> "enumerate behaviours" but just expose a knob. To map certain "bias" we
> could benefit from a slightly larger range.
> 
>> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be
> 
> For the wakeup path there is also the TurboSched proposal by Parth:
> 
>    Message-ID: <20190725070857.6639-1-parth@linux.ibm.com> 
>    https://lore.kernel.org/lkml/20190725070857.6639-1-parth@linux.ibm.com/
> 
> we should keep in mind.
> 
>> applied elsewhere (e.g. load-balance, something like task_hot() and its
>> use of sysctl_sched_migration_cost).
> 
> For LB can you come up with some better description of what usages you
> see could benefit from a "per task" or "per task-group" latency niceness?
> 

I guess there is some usecase in case of thermal throttling.
If a task is heating up the core then in ideal scenarios POWER systems throttle
down to rated frequency.
In such case, if the task is latency sensitive (min latency nice), we can move the
task around the chip to heat up the chip uniformly allowing me to gain more performance
with sustained higher frequency.
With this, we will require the help from active load balancer and latency-nice
classification on per task and/or group basis.

Hopefully, this might be useful for other arch as well, right?

> Best,
> Patrick
> 

Thanks,
Parth


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-06 12:45               ` Parth Shah
@ 2019-09-06 14:13                 ` Valentin Schneider
  2019-09-06 14:32                   ` Vincent Guittot
  2019-09-06 17:10                   ` Parth Shah
  0 siblings, 2 replies; 55+ messages in thread
From: Valentin Schneider @ 2019-09-06 14:13 UTC (permalink / raw)
  To: Parth Shah, Patrick Bellasi
  Cc: Peter Zijlstra, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman

On 06/09/2019 13:45, Parth Shah wrote:> 
> I guess there is some usecase in case of thermal throttling.
> If a task is heating up the core then in ideal scenarios POWER systems throttle
> down to rated frequency.
> In such case, if the task is latency sensitive (min latency nice), we can move the
> task around the chip to heat up the chip uniformly allowing me to gain more performance
> with sustained higher frequency.
> With this, we will require the help from active load balancer and latency-nice
> classification on per task and/or group basis.
> 
> Hopefully, this might be useful for other arch as well, right?
> 

Most of the functionality is already there, we're only really missing thermal
pressure awareness. There was [1] but it seems to have died.


At least with CFS load balancing, if thermal throttling is correctly
reflected as a CPU capacity reduction you will tend to move things away from
that CPU, since load is balanced over capacities.


For active balance, we actually already have a condition that moves a task
to a less capacity-pressured CPU (although it is somewhat specific). So if
thermal pressure follows that task (e.g. it's doing tons of vector/float),
it will be rotated around.

However there should be a point made on latency vs throughput. If you
care about latency you probably do not want to active balance your task. If
you care about throughput, it should be specified in some way (util-clamp
says hello!).

It sort of feels like you'd want an extension of misfit migration (salesman
hat goes on from here) - misfit moves tasks that are CPU bound (IOW their
util is >= 80% of the CPU capacity) to CPUs of higher capacity. It's only
enabled for systems with asymmetric capacities, but could be enabled globally
for "dynamically-created asymmetric capacities" (IOW RT/IRQ/thermal pressure
on SMP systems).

On top of that, if we make misfit consider e.g. uclamp.min (I don't think
that's already the case), then you have your throughput knob to have *some* 
designated tasks move away from (thermal & else) pressure. 


[1]: https://lore.kernel.org/lkml/1555443521-579-1-git-send-email-thara.gopinath@linaro.org/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-06 14:13                 ` Valentin Schneider
@ 2019-09-06 14:32                   ` Vincent Guittot
  2019-09-06 17:10                   ` Parth Shah
  1 sibling, 0 replies; 55+ messages in thread
From: Vincent Guittot @ 2019-09-06 14:32 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Parth Shah, Patrick Bellasi, Peter Zijlstra, Subhra Mazumdar,
	linux-kernel, Ingo Molnar, Thomas Gleixner, steven.sistare,
	Dhaval Giani, Daniel Lezcano, viresh kumar, Tim Chen, Mel Gorman

On Fri, 6 Sep 2019 at 16:13, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 06/09/2019 13:45, Parth Shah wrote:>
> > I guess there is some usecase in case of thermal throttling.
> > If a task is heating up the core then in ideal scenarios POWER systems throttle
> > down to rated frequency.
> > In such case, if the task is latency sensitive (min latency nice), we can move the
> > task around the chip to heat up the chip uniformly allowing me to gain more performance
> > with sustained higher frequency.
> > With this, we will require the help from active load balancer and latency-nice
> > classification on per task and/or group basis.
> >
> > Hopefully, this might be useful for other arch as well, right?
> >
>
> Most of the functionality is already there, we're only really missing thermal
> pressure awareness. There was [1] but it seems to have died.

Thara still works on it but she has been sidetracked on other
activities and  It takes more time than expected to run all tests with
different averaging window and process the results
>
>
> At least with CFS load balancing, if thermal throttling is correctly
> reflected as a CPU capacity reduction you will tend to move things away from
> that CPU, since load is balanced over capacities.
>
>
> For active balance, we actually already have a condition that moves a task
> to a less capacity-pressured CPU (although it is somewhat specific). So if
> thermal pressure follows that task (e.g. it's doing tons of vector/float),
> it will be rotated around.
>
> However there should be a point made on latency vs throughput. If you
> care about latency you probably do not want to active balance your task. If
> you care about throughput, it should be specified in some way (util-clamp
> says hello!).
>
> It sort of feels like you'd want an extension of misfit migration (salesman
> hat goes on from here) - misfit moves tasks that are CPU bound (IOW their
> util is >= 80% of the CPU capacity) to CPUs of higher capacity. It's only
> enabled for systems with asymmetric capacities, but could be enabled globally
> for "dynamically-created asymmetric capacities" (IOW RT/IRQ/thermal pressure
> on SMP systems).
>
> On top of that, if we make misfit consider e.g. uclamp.min (I don't think
> that's already the case), then you have your throughput knob to have *some*
> designated tasks move away from (thermal & else) pressure.
>
>
> [1]: https://lore.kernel.org/lkml/1555443521-579-1-git-send-email-thara.gopinath@linaro.org/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-06 14:13                 ` Valentin Schneider
  2019-09-06 14:32                   ` Vincent Guittot
@ 2019-09-06 17:10                   ` Parth Shah
  2019-09-06 22:50                     ` Valentin Schneider
  1 sibling, 1 reply; 55+ messages in thread
From: Parth Shah @ 2019-09-06 17:10 UTC (permalink / raw)
  To: Valentin Schneider, Patrick Bellasi
  Cc: Peter Zijlstra, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman



On 9/6/19 7:43 PM, Valentin Schneider wrote:
> On 06/09/2019 13:45, Parth Shah wrote:> 
>> I guess there is some usecase in case of thermal throttling.
>> If a task is heating up the core then in ideal scenarios POWER systems throttle
>> down to rated frequency.
>> In such case, if the task is latency sensitive (min latency nice), we can move the
>> task around the chip to heat up the chip uniformly allowing me to gain more performance
>> with sustained higher frequency.
>> With this, we will require the help from active load balancer and latency-nice
>> classification on per task and/or group basis.
>>
>> Hopefully, this might be useful for other arch as well, right?
>>
> 
> Most of the functionality is already there, we're only really missing thermal
> pressure awareness. There was [1] but it seems to have died.
> 
> 
> At least with CFS load balancing, if thermal throttling is correctly
> reflected as a CPU capacity reduction you will tend to move things away from
> that CPU, since load is balanced over capacities.
> 

Right, CPU capacity can solve the problem of indicating the thermal throttle to the scheduler.
AFAIU, the patchset from Thara changes CPU capacity to reflect Thermal headroom of the CPU.
This is a nice mitigation but,
1. Sometimes a single task is responsible for the Thermal heatup of the core, reducing the
   CPU capacity of all the CPUs in the core is not optimal when just moving such single
   task to other core can allow us to remain within thermal headroom. This is important
   for the servers especially where there are upto 8 threads.
2. Given the implementation in the patches and its integration with EAS, it seems difficult
   to adapt to servers, where CPU capacity itself is in doubt.
   https://lkml.org/lkml/2019/5/15/1402

> 
> For active balance, we actually already have a condition that moves a task
> to a less capacity-pressured CPU (although it is somewhat specific). So if
> thermal pressure follows that task (e.g. it's doing tons of vector/float),
> it will be rotated around.

Agree. But this should break in certain conditions like when we have multiple tasks
in a core with almost equal utilization among which one is just doing vector operations.
LB can pick and move any task with equal probability if the capacity is reduced here.

> 
> However there should be a point made on latency vs throughput. If you
> care about latency you probably do not want to active balance your task. If

Can you please elaborate on why not to consider active balance for latency sensitive tasks?
Because, sometimes finding a thermally cool core is beneficial when Turbo frequency
range is around 20% above rated ones.

> you care about throughput, it should be specified in some way (util-clamp
> says hello!).
> 

yes I do care for latency and throughput both. :-)
but I'm wondering how uclamp can solve the problem for throughput.
If I make the thermally hot tasks to appear bigger than other tasks then reducing
CPU capacity can allow such tasks to move around the chip.
But this will require the utilization value to be relatively large compared to the other
tasks in the core. Or other task's uclamp.max can be lowered to make such task rotate.
If I got it right, then this will be a difficult UCLAMP usecase from user perspective, right?
I feel like I'm missing something here.

> It sort of feels like you'd want an extension of misfit migration (salesman
> hat goes on from here) - misfit moves tasks that are CPU bound (IOW their
> util is >= 80% of the CPU capacity) to CPUs of higher capacity. It's only
> enabled for systems with asymmetric capacities, but could be enabled globally
> for "dynamically-created asymmetric capacities" (IOW RT/IRQ/thermal pressure
> on SMP systems).> On top of that, if we make misfit consider e.g. uclamp.min (I don't think
> that's already the case), then you have your throughput knob to have *some* 
> designated tasks move away from (thermal & else) pressure. 
> 
> 
> [1]: https://lore.kernel.org/lkml/1555443521-579-1-git-send-email-thara.gopinath@linaro.org/
> 

Thanks,
Parth


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-06 17:10                   ` Parth Shah
@ 2019-09-06 22:50                     ` Valentin Schneider
  0 siblings, 0 replies; 55+ messages in thread
From: Valentin Schneider @ 2019-09-06 22:50 UTC (permalink / raw)
  To: Parth Shah, Patrick Bellasi
  Cc: Peter Zijlstra, Subhra Mazumdar, linux-kernel, mingo, tglx,
	steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot,
	viresh.kumar, tim.c.chen, mgorman

On 06/09/2019 18:10, Parth Shah wrote:
> Right, CPU capacity can solve the problem of indicating the thermal throttle to the scheduler.
> AFAIU, the patchset from Thara changes CPU capacity to reflect Thermal headroom of the CPU.
> This is a nice mitigation but,
> 1. Sometimes a single task is responsible for the Thermal heatup of the core, reducing the
>    CPU capacity of all the CPUs in the core is not optimal when just moving such single
>    task to other core can allow us to remain within thermal headroom. This is important
>    for the servers especially where there are upto 8 threads.> 2. Given the implementation in the patches and its integration with EAS, it seems difficult
>    to adapt to servers, where CPU capacity itself is in doubt.
>    https://lkml.org/lkml/2019/5/15/1402
> 

I'd nuance this to *SMT* capacity (which isn't just servers). The thing is
that it's difficult to come up with a sensible scheme to describe the base
capacity of a single logical CPU. But yeah, valid point.

>>
>> For active balance, we actually already have a condition that moves a task
>> to a less capacity-pressured CPU (although it is somewhat specific). So if
>> thermal pressure follows that task (e.g. it's doing tons of vector/float),
>> it will be rotated around.
> 
> Agree. But this should break in certain conditions like when we have multiple tasks
> in a core with almost equal utilization among which one is just doing vector operations.
> LB can pick and move any task with equal probability if the capacity is reduced here.
> 

Right, if/when we get things like per-unit signals (wasn't there something
about tracking AVX a few months back?) then we'll be able to make
more informed decisions, for now we'll need some handholding (read: task
classification).

>>
>> However there should be a point made on latency vs throughput. If you
>> care about latency you probably do not want to active balance your task. If
> 
> Can you please elaborate on why not to consider active balance for latency sensitive tasks?
> Because, sometimes finding a thermally cool core is beneficial when Turbo frequency
> range is around 20% above rated ones.
> 

This goes back to my reply to Patrick further up the thread.

Right now active balance can happen just because we've been imbalanced for
some time and repeatedly failed to migrate anything. After 3 (IIRC) successive
failed attempts, we'll active balance the running task of the remote rq we
decided was busiest.

If that happens to be a latency sensitive task, that's not great - active
balancing means stopping that task's execution, so we're going to add some
latency to this latency-sensitive task. My proposal was to further ratelimit
active balance (e.g. require more failed attempts) when the task that would be
preempted is latency-sensitive.

My point is: if that task is doing fine where it is, why preempt it? That's
just introducing latency IMO (keeping in mind that those balance attempts
could happen despite not having any thermal pressure).

If you care about performance (e.g. a minimum level of throughput), to me
that is a separate (though perhaps not entirely distinct) property.

>> you care about throughput, it should be specified in some way (util-clamp
>> says hello!).
>>
> 
> yes I do care for latency and throughput both. :-)

Don't we all!

> but I'm wondering how uclamp can solve the problem for throughput.
> If I make the thermally hot tasks to appear bigger than other tasks then reducing
> CPU capacity can allow such tasks to move around the chip.
> But this will require the utilization value to be relatively large compared to the other
> tasks in the core. Or other task's uclamp.max can be lowered to make such task rotate.
> If I got it right, then this will be a difficult UCLAMP usecase from user perspective, right?
> I feel like I'm missing something here.
> 

Hmm perhaps I was jumping the gun here. What I was getting to is if you have
something like misfit that migrates tasks to CPUs of higher capacity than the
one they are on, you could use uclamp to flag them.

You could translate your throughput requirement as a uclamp.min of e.g. 80%,
and if the CPU capacity goes below that (or close within a margin) then you'd
try to migrate the task to a CPU of higher capacity (i.e. not or less 
thermally pressured).

This doesn't have to involve your less throughput-sensitive tasks, since you
would only tag and take action for your throughput-sensitive tasks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2019-09-05 11:47             ` Qais Yousef
@ 2020-04-16  0:02               ` Joel Fernandes
  2020-04-16 17:23                 ` Dietmar Eggemann
  0 siblings, 1 reply; 55+ messages in thread
From: Joel Fernandes @ 2020-04-16  0:02 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Patrick Bellasi, Subhra Mazumdar, linux-kernel,
	mingo, tglx, steven.sistare, dhaval.giani, daniel.lezcano,
	vincent.guittot, viresh.kumar, tim.c.chen, mgorman, parth

On Thu, Sep 05, 2019 at 12:47:26PM +0100, Qais Yousef wrote:
> On 09/05/19 13:30, Peter Zijlstra wrote:
> > On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
> > > On 09/05/19 12:46, Peter Zijlstra wrote:
> > 
> > > > This is important because we want to be able to bias towards less
> > > > importance to (tail) latency as well as more importantance to (tail)
> > > > latency.
> > > > 
> > > > Specifically, Oracle wants to sacrifice (some) latency for throughput.
> > > > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
> > > 
> > > Another use case I'm considering is using latency-nice to prefer an idle CPU if
> > > latency-nice is set otherwise go for the most energy efficient CPU.
> > > 
> > > Ie: sacrifice (some) energy for latency.
> > > 
> > > The way I see interpreting latency-nice here as a binary switch. But
> > > maybe we can use the range to select what (some) energy to sacrifice
> > > mean here. Hmmm.
> > 
> > It cannot be binary, per definition is must be ternary, that is, <0, ==0
> > and >0 (or middle value if you're of that persuasion).
> 
> I meant I want to use it as a binary.
> 
> > 
> > In your case, I'm thinking you mean >0, we want to lower the latency.
> 
> Yes. As long as there's an easy way to say: does this task care about latency
> or not I'm good.

Qais, Peter, all,

For ChromeOS (my team), we are planning to use the upstream uclamp mechanism
instead of the out-of-tree schedtune mechanism to provide EAS with the
latency-sensitivity (binary/ternary) hint. ChromeOS is thankfully quite a bit
upstream focussed :)

However, uclamp is missing an attribute to provide this biasing to EAS as we
know.

What was the consensus on adding a per-task attribute to uclamp for providing
this? Happy to collaborate on this front.

thanks,

 - Joel


> > Anyway; there were a number of things mentioned at OSPM that we could
> > tie into this thing and finding sensible mappings is going to be a bit
> > of trial and error I suppose.
> > 
> > But as patrick said; we're very much exporting a BIAS knob, not a set of
> > behaviours.
> 
> Agreed. I just wanted to say that the way this range is going to be
> interpreted will differ from path to path and we need to consider that in the
> final mapping. Especially from the final user's perspective of what setting
> this value ultimately means to them.
> 
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2020-04-16  0:02               ` Joel Fernandes
@ 2020-04-16 17:23                 ` Dietmar Eggemann
  2020-04-18 16:01                   ` Joel Fernandes
  0 siblings, 1 reply; 55+ messages in thread
From: Dietmar Eggemann @ 2020-04-16 17:23 UTC (permalink / raw)
  To: Joel Fernandes, Qais Yousef
  Cc: Peter Zijlstra, Patrick Bellasi, Subhra Mazumdar, linux-kernel,
	mingo, tglx, steven.sistare, dhaval.giani, daniel.lezcano,
	vincent.guittot, viresh.kumar, tim.c.chen, mgorman, parth

Hi Joel,

On 16.04.20 02:02, Joel Fernandes wrote:
> On Thu, Sep 05, 2019 at 12:47:26PM +0100, Qais Yousef wrote:
>> On 09/05/19 13:30, Peter Zijlstra wrote:
>>> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
>>>> On 09/05/19 12:46, Peter Zijlstra wrote:
>>>
>>>>> This is important because we want to be able to bias towards less
>>>>> importance to (tail) latency as well as more importantance to (tail)
>>>>> latency.
>>>>>
>>>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>>>>
>>>> Another use case I'm considering is using latency-nice to prefer an idle CPU if
>>>> latency-nice is set otherwise go for the most energy efficient CPU.
>>>>
>>>> Ie: sacrifice (some) energy for latency.
>>>>
>>>> The way I see interpreting latency-nice here as a binary switch. But
>>>> maybe we can use the range to select what (some) energy to sacrifice
>>>> mean here. Hmmm.
>>>
>>> It cannot be binary, per definition is must be ternary, that is, <0, ==0
>>> and >0 (or middle value if you're of that persuasion).
>>
>> I meant I want to use it as a binary.
>>
>>>
>>> In your case, I'm thinking you mean >0, we want to lower the latency.
>>
>> Yes. As long as there's an easy way to say: does this task care about latency
>> or not I'm good.
> 
> Qais, Peter, all,
> 
> For ChromeOS (my team), we are planning to use the upstream uclamp mechanism
> instead of the out-of-tree schedtune mechanism to provide EAS with the
> latency-sensitivity (binary/ternary) hint. ChromeOS is thankfully quite a bit
> upstream focussed :)
> 
> However, uclamp is missing an attribute to provide this biasing to EAS as we
> know.
> 
> What was the consensus on adding a per-task attribute to uclamp for providing
> this? Happy to collaborate on this front.

We're planning to have a session about this topic (latency-nice
attribute per task group) during the virtual Pisa OSPM summit
retis.sssup.it/ospm-summit in May this year.

There are two presentations/discussions planned:

"Introducing Latency Nice for Scheduler Hints and Optimizing Scheduler
Task Wakeup" and "The latency nice use case for Energy-Aware-Scheduling
(EAS) in Android Common Kernel (ACK)"

We'll probably merge those two into one presentation/discussion.

So far we have Parth's per-task implementation

https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com

What's missing is the per-taskgroup implementation, at least from the
standpoint of ACK.

The (mainline) EAS use-case for latency nice is already in ACK
(android-5.4):

https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2020-04-16 17:23                 ` Dietmar Eggemann
@ 2020-04-18 16:01                   ` Joel Fernandes
  2020-04-20 11:26                     ` Parth Shah
  2020-04-20 11:47                     ` Qais Yousef
  0 siblings, 2 replies; 55+ messages in thread
From: Joel Fernandes @ 2020-04-18 16:01 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Qais Yousef, Peter Zijlstra, Patrick Bellasi, Subhra Mazumdar,
	LKML, Ingo Molnar, Thomas Glexiner, steven.sistare, Dhaval Giani,
	Daniel Lezcano, Vincent Guittot, Viresh Kumar, Tim Chen,
	Mel Gorman, parth

Hi Dietmar,

On Thu, Apr 16, 2020 at 1:23 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> Hi Joel,
>
> On 16.04.20 02:02, Joel Fernandes wrote:
> > On Thu, Sep 05, 2019 at 12:47:26PM +0100, Qais Yousef wrote:
> >> On 09/05/19 13:30, Peter Zijlstra wrote:
> >>> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
> >>>> On 09/05/19 12:46, Peter Zijlstra wrote:
> >>>
> >>>>> This is important because we want to be able to bias towards less
> >>>>> importance to (tail) latency as well as more importantance to (tail)
> >>>>> latency.
> >>>>>
> >>>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
> >>>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
> >>>>
> >>>> Another use case I'm considering is using latency-nice to prefer an idle CPU if
> >>>> latency-nice is set otherwise go for the most energy efficient CPU.
> >>>>
> >>>> Ie: sacrifice (some) energy for latency.
> >>>>
> >>>> The way I see interpreting latency-nice here as a binary switch. But
> >>>> maybe we can use the range to select what (some) energy to sacrifice
> >>>> mean here. Hmmm.
> >>>
> >>> It cannot be binary, per definition is must be ternary, that is, <0, ==0
> >>> and >0 (or middle value if you're of that persuasion).
> >>
> >> I meant I want to use it as a binary.
> >>
> >>>
> >>> In your case, I'm thinking you mean >0, we want to lower the latency.
> >>
> >> Yes. As long as there's an easy way to say: does this task care about latency
> >> or not I'm good.
> >
> > Qais, Peter, all,
> >
> > For ChromeOS (my team), we are planning to use the upstream uclamp mechanism
> > instead of the out-of-tree schedtune mechanism to provide EAS with the
> > latency-sensitivity (binary/ternary) hint. ChromeOS is thankfully quite a bit
> > upstream focussed :)
> >
> > However, uclamp is missing an attribute to provide this biasing to EAS as we
> > know.
> >
> > What was the consensus on adding a per-task attribute to uclamp for providing
> > this? Happy to collaborate on this front.
>
> We're planning to have a session about this topic (latency-nice
> attribute per task group) during the virtual Pisa OSPM summit
> retis.sssup.it/ospm-summit in May this year.

Cool, I registered as well.

>
> There are two presentations/discussions planned:
>
> "Introducing Latency Nice for Scheduler Hints and Optimizing Scheduler
> Task Wakeup" and "The latency nice use case for Energy-Aware-Scheduling
> (EAS) in Android Common Kernel (ACK)"
>
> We'll probably merge those two into one presentation/discussion.
>
> So far we have Parth's per-task implementation
>
> https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com

Cool, I see it has some Reviewed-by tags so that's a good sign. Will
look more into that.

> What's missing is the per-taskgroup implementation, at least from the
> standpoint of ACK.
>
> The (mainline) EAS use-case for latency nice is already in ACK
> (android-5.4):
>
> https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54

Yes, I was aware of this. But if we use task groups, then the
transition from schedtune -> uclamp means now the tasks that use
uclamp would also be subjected to cpu.shares. That's why we were
looking into the per-task interface and glad there's some work on this
already done.

Thanks!

 - Joel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2020-04-18 16:01                   ` Joel Fernandes
@ 2020-04-20 11:26                     ` Parth Shah
  2020-04-20 19:14                       ` Joel Fernandes
  2020-04-20 11:47                     ` Qais Yousef
  1 sibling, 1 reply; 55+ messages in thread
From: Parth Shah @ 2020-04-20 11:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Dietmar Eggemann, Qais Yousef, Peter Zijlstra, Patrick Bellasi,
	Subhra Mazumdar, LKML, Ingo Molnar, Thomas Glexiner,
	steven.sistare, Dhaval Giani, Daniel Lezcano, Vincent Guittot,
	Viresh Kumar, Tim Chen, Mel Gorman

Hi Joel,

On 4/18/20 9:31 PM, Joel Fernandes wrote:
> Hi Dietmar,
> 
> On Thu, Apr 16, 2020 at 1:23 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> Hi Joel,
>>
>> On 16.04.20 02:02, Joel Fernandes wrote:
>>> On Thu, Sep 05, 2019 at 12:47:26PM +0100, Qais Yousef wrote:
>>>> On 09/05/19 13:30, Peter Zijlstra wrote:
>>>>> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
>>>>>> On 09/05/19 12:46, Peter Zijlstra wrote:
>>>>>
>>>>>>> This is important because we want to be able to bias towards less
>>>>>>> importance to (tail) latency as well as more importantance to (tail)
>>>>>>> latency.
>>>>>>>
>>>>>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>>>>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>>>>>>
>>>>>> Another use case I'm considering is using latency-nice to prefer an idle CPU if
>>>>>> latency-nice is set otherwise go for the most energy efficient CPU.
>>>>>>
>>>>>> Ie: sacrifice (some) energy for latency.
>>>>>>
>>>>>> The way I see interpreting latency-nice here as a binary switch. But
>>>>>> maybe we can use the range to select what (some) energy to sacrifice
>>>>>> mean here. Hmmm.
>>>>>
>>>>> It cannot be binary, per definition is must be ternary, that is, <0, ==0
>>>>> and >0 (or middle value if you're of that persuasion).
>>>>
>>>> I meant I want to use it as a binary.
>>>>
>>>>>
>>>>> In your case, I'm thinking you mean >0, we want to lower the latency.
>>>>
>>>> Yes. As long as there's an easy way to say: does this task care about latency
>>>> or not I'm good.
>>>
>>> Qais, Peter, all,
>>>
>>> For ChromeOS (my team), we are planning to use the upstream uclamp mechanism
>>> instead of the out-of-tree schedtune mechanism to provide EAS with the
>>> latency-sensitivity (binary/ternary) hint. ChromeOS is thankfully quite a bit
>>> upstream focussed :)
>>>
>>> However, uclamp is missing an attribute to provide this biasing to EAS as we
>>> know.
>>>
>>> What was the consensus on adding a per-task attribute to uclamp for providing
>>> this? Happy to collaborate on this front.
>>
>> We're planning to have a session about this topic (latency-nice
>> attribute per task group) during the virtual Pisa OSPM summit
>> retis.sssup.it/ospm-summit in May this year.
> 
> Cool, I registered as well.
> 
>>
>> There are two presentations/discussions planned:
>>
>> "Introducing Latency Nice for Scheduler Hints and Optimizing Scheduler
>> Task Wakeup" and "The latency nice use case for Energy-Aware-Scheduling
>> (EAS) in Android Common Kernel (ACK)"
>>
>> We'll probably merge those two into one presentation/discussion.
>>
>> So far we have Parth's per-task implementation
>>
>> https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com
> 
> Cool, I see it has some Reviewed-by tags so that's a good sign. Will
> look more into that.
> 
>> What's missing is the per-taskgroup implementation, at least from the
>> standpoint of ACK.
>>
>> The (mainline) EAS use-case for latency nice is already in ACK
>> (android-5.4):
>>
>> https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54
> 
> Yes, I was aware of this. But if we use task groups, then the
> transition from schedtune -> uclamp means now the tasks that use
> uclamp would also be subjected to cpu.shares. That's why we were
> looking into the per-task interface and glad there's some work on this
> already done.
> 

Yes, that series of latency_nice seems to be in good shape to be used for
any usecases. Hopefully, OSPM will lead to its upstreaming sooner :-)
But at the end, we aim to have both the per-task and cgroup based interface
to mark the latency_nice value of a task.
Till, then I'm finding some generic use-cases to show benefits of such task
attribute to increase community interest.


Thanks,
Parth


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2020-04-18 16:01                   ` Joel Fernandes
  2020-04-20 11:26                     ` Parth Shah
@ 2020-04-20 11:47                     ` Qais Yousef
  2020-04-20 19:10                       ` Joel Fernandes
  1 sibling, 1 reply; 55+ messages in thread
From: Qais Yousef @ 2020-04-20 11:47 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Dietmar Eggemann, Peter Zijlstra, Patrick Bellasi,
	Subhra Mazumdar, LKML, Ingo Molnar, Thomas Glexiner,
	steven.sistare, Dhaval Giani, Daniel Lezcano, Vincent Guittot,
	Viresh Kumar, Tim Chen, Mel Gorman, parth

On 04/18/20 12:01, Joel Fernandes wrote:
> > What's missing is the per-taskgroup implementation, at least from the
> > standpoint of ACK.
> >
> > The (mainline) EAS use-case for latency nice is already in ACK
> > (android-5.4):
> >
> > https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54
> 
> Yes, I was aware of this. But if we use task groups, then the
> transition from schedtune -> uclamp means now the tasks that use
> uclamp would also be subjected to cpu.shares. That's why we were
> looking into the per-task interface and glad there's some work on this
> already done.

Hmm uclamp doesn't do anything with cpu.shares. I assume this is some
implementation detail at your end? IOW, you don't have to use cpu.shares to use
uclamp.

Although there should be few tasks in the system that need the latency-nice, so
I prefer the per-task interface rather than lump everything in a cgroup. Though
there could be valid use cases for the latter.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2020-04-20 11:47                     ` Qais Yousef
@ 2020-04-20 19:10                       ` Joel Fernandes
  0 siblings, 0 replies; 55+ messages in thread
From: Joel Fernandes @ 2020-04-20 19:10 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Dietmar Eggemann, Peter Zijlstra, Patrick Bellasi,
	Subhra Mazumdar, LKML, Ingo Molnar, Thomas Glexiner,
	steven.sistare, Dhaval Giani, Daniel Lezcano, Vincent Guittot,
	Viresh Kumar, Tim Chen, Mel Gorman, parth

Hi Qais!

On Mon, Apr 20, 2020 at 12:47:29PM +0100, Qais Yousef wrote:
> On 04/18/20 12:01, Joel Fernandes wrote:
> > > What's missing is the per-taskgroup implementation, at least from the
> > > standpoint of ACK.
> > >
> > > The (mainline) EAS use-case for latency nice is already in ACK
> > > (android-5.4):
> > >
> > > https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54
> > 
> > Yes, I was aware of this. But if we use task groups, then the
> > transition from schedtune -> uclamp means now the tasks that use
> > uclamp would also be subjected to cpu.shares. That's why we were
> > looking into the per-task interface and glad there's some work on this
> > already done.
> 
> Hmm uclamp doesn't do anything with cpu.shares. I assume this is some
> implementation detail at your end? IOW, you don't have to use cpu.shares to use
> uclamp.

Right, it is a ChromeOS-specific issue. We have CONFIG_FAIR_GROUP_SCHED
enabled in the kernel for container workloads. However there are CGroups of
tasks that used "schedtune" CGroup interface before to provide util clamping
like behavior. We are now migrating these to the upstream util-clamp.

We can't disable CONFIG_FAIR_GROUP_SCHED because that would break the
container workloads.

So we have to use the per-process interface of util clamp.

If we used the CGroups interface of util clamping, we would get the
cpu.shares as well since the CGroup interface comes with shares. There's no
way to avoid being subject to cpu.shares (that I'm aware off anyway).

> Although there should be few tasks in the system that need the latency-nice, so
> I prefer the per-task interface rather than lump everything in a cgroup. Though
> there could be valid use cases for the latter.

Yes, with either interface, we need something like latency_nice to indicate
that the task is low-latency (something we used for a number of years with
the out-of-tree schedtune).

thanks!

 - Joel


> 
> Thanks
> 
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
  2020-04-20 11:26                     ` Parth Shah
@ 2020-04-20 19:14                       ` Joel Fernandes
  0 siblings, 0 replies; 55+ messages in thread
From: Joel Fernandes @ 2020-04-20 19:14 UTC (permalink / raw)
  To: Parth Shah
  Cc: Dietmar Eggemann, Qais Yousef, Peter Zijlstra, Patrick Bellasi,
	Subhra Mazumdar, LKML, Ingo Molnar, Thomas Glexiner,
	steven.sistare, Dhaval Giani, Daniel Lezcano, Vincent Guittot,
	Viresh Kumar, Tim Chen, Mel Gorman

On Mon, Apr 20, 2020 at 04:56:55PM +0530, Parth Shah wrote:

> >>
> >> There are two presentations/discussions planned:
> >>
> >> "Introducing Latency Nice for Scheduler Hints and Optimizing Scheduler
> >> Task Wakeup" and "The latency nice use case for Energy-Aware-Scheduling
> >> (EAS) in Android Common Kernel (ACK)"
> >>
> >> We'll probably merge those two into one presentation/discussion.
> >>
> >> So far we have Parth's per-task implementation
> >>
> >> https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com
> > 
> > Cool, I see it has some Reviewed-by tags so that's a good sign. Will
> > look more into that.
> > 
> >> What's missing is the per-taskgroup implementation, at least from the
> >> standpoint of ACK.
> >>
> >> The (mainline) EAS use-case for latency nice is already in ACK
> >> (android-5.4):
> >>
> >> https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54
> > 
> > Yes, I was aware of this. But if we use task groups, then the
> > transition from schedtune -> uclamp means now the tasks that use
> > uclamp would also be subjected to cpu.shares. That's why we were
> > looking into the per-task interface and glad there's some work on this
> > already done.
> > 
> 
> Yes, that series of latency_nice seems to be in good shape to be used for
> any usecases. Hopefully, OSPM will lead to its upstreaming sooner :-)

Cool :)

> But at the end, we aim to have both the per-task and cgroup based interface
> to mark the latency_nice value of a task.

Ok. We'd likely use the per-task interface unless we decide to assign
cpu.shares for the groups as well.

> Till, then I'm finding some generic use-cases to show benefits of such task
> attribute to increase community interest.

Ok. Feel free to add ChromeOS as a usecase as well.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2020-04-20 19:14 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-30 17:49 [RFC PATCH 0/9] Task latency-nice subhra mazumdar
2019-08-30 17:49 ` [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice subhra mazumdar
2019-09-04 17:32   ` Tim Chen
2019-09-05  6:15     ` Parth Shah
2019-09-05 10:11       ` Patrick Bellasi
2019-09-06 12:22         ` Parth Shah
2019-09-05  8:31   ` Peter Zijlstra
2019-09-05  9:45     ` Patrick Bellasi
2019-09-05 10:46       ` Peter Zijlstra
2019-09-05 11:13         ` Qais Yousef
2019-09-05 11:30           ` Peter Zijlstra
2019-09-05 11:40             ` Patrick Bellasi
2019-09-05 11:48               ` Peter Zijlstra
2019-09-05 13:32                 ` Qais Yousef
2019-09-05 11:47             ` Qais Yousef
2020-04-16  0:02               ` Joel Fernandes
2020-04-16 17:23                 ` Dietmar Eggemann
2020-04-18 16:01                   ` Joel Fernandes
2020-04-20 11:26                     ` Parth Shah
2020-04-20 19:14                       ` Joel Fernandes
2020-04-20 11:47                     ` Qais Yousef
2020-04-20 19:10                       ` Joel Fernandes
2019-09-05 11:30           ` Patrick Bellasi
2019-09-05 11:47             ` Peter Zijlstra
2019-09-05 11:18         ` Patrick Bellasi
2019-09-05 11:40           ` Peter Zijlstra
2019-09-05 11:46             ` Patrick Bellasi
2019-09-05 11:46           ` Valentin Schneider
2019-09-05 13:07             ` Patrick Bellasi
2019-09-05 14:48               ` Valentin Schneider
2019-09-06 12:45               ` Parth Shah
2019-09-06 14:13                 ` Valentin Schneider
2019-09-06 14:32                   ` Vincent Guittot
2019-09-06 17:10                   ` Parth Shah
2019-09-06 22:50                     ` Valentin Schneider
2019-09-06 12:31       ` Parth Shah
2019-09-05 10:05   ` Patrick Bellasi
2019-09-05 10:48     ` Peter Zijlstra
2019-08-30 17:49 ` [RFC PATCH 2/9] sched: add search limit as per latency-nice subhra mazumdar
2019-09-05  6:22   ` Parth Shah
2019-08-30 17:49 ` [RFC PATCH 3/9] sched: add sched feature to disable idle core search subhra mazumdar
2019-09-05 10:17   ` Patrick Bellasi
2019-09-05 22:02     ` Subhra Mazumdar
2019-08-30 17:49 ` [RFC PATCH 4/9] sched: SIS_CORE " subhra mazumdar
2019-09-05 10:19   ` Patrick Bellasi
2019-08-30 17:49 ` [RFC PATCH 5/9] sched: Define macro for number of CPUs in core subhra mazumdar
2019-08-30 17:49 ` [RFC PATCH 6/9] x86/smpboot: Optimize cpumask_weight_sibling macro for x86 subhra mazumdar
2019-08-30 17:49 ` [RFC PATCH 7/9] sched: search SMT before LLC domain subhra mazumdar
2019-09-05  9:31   ` Peter Zijlstra
2019-09-05 20:40     ` Subhra Mazumdar
2019-08-30 17:49 ` [RFC PATCH 8/9] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
2019-08-30 17:49 ` [RFC PATCH 9/9] sched: rotate the cpu search window for better spread subhra mazumdar
2019-09-05  6:37   ` Parth Shah
2019-09-05  5:55 ` [RFC PATCH 0/9] Task latency-nice Parth Shah
2019-09-05 10:31 ` Patrick Bellasi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.