Introduce new per task property latency-nice for controlling scalability in scheduler idle CPU search path. Valid latency-nice values are from 1 to 100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New CPU cgroup file cpu.latency-nice is added as an interface to set and get. All tasks in the same cgroup share the same latency-nice value. Using a lower latency-nice value can help latency intolerant tasks e.g very short running OLTP threads where full LLC search cost can be significant compared to run time of the threads. The default latency-nice value is 5. In addition to latency-nice, it also adds a new sched feature SIS_CORE to be able to disable idle core search altogether which is costly and hurts more than it helps in short running workloads. Finally it also introduces a new per-cpu variable next_cpu to track the limit of search so that every time search starts from where it ended. This rotating search window over cpus in LLC domain ensures that idle cpus are eventually found in case of high load. Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline latency-nice=5,SIS_CORE latency-nice=5,NO_SIS_CORE 8 64.66 64.38 (-0.43%) 64.79 (0.2%) 16 123.34 122.88 (-0.37%) 125.87 (2.05%) 32 215.18 215.55 (0.17%) 247.77 (15.15%) 48 278.56 321.6 (15.45%) 321.2 (15.3%) 64 259.99 319.45 (22.87%) 333.95 (28.44%) 128 431.1 437.69 (1.53%) 431.09 (0%) subhra mazumdar (9): sched,cgroup: Add interface for latency-nice sched: add search limit as per latency-nice sched: add sched feature to disable idle core search sched: SIS_CORE to disable idle core search sched: Define macro for number of CPUs in core x86/smpboot: Optimize cpumask_weight_sibling macro for x86 sched: search SMT before LLC domain sched: introduce per-cpu var next_cpu to track search limit sched: rotate the cpu search window for better spread arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 ++++++++++++++++- include/linux/sched.h | 1 + include/linux/topology.h | 4 ++++ kernel/sched/core.c | 42 +++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 34 +++++++++++++++++++++------------ kernel/sched/features.h | 1 + kernel/sched/sched.h | 9 +++++++++ 9 files changed, 97 insertions(+), 13 deletions(-) -- 2.9.3
Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file "latency-nice" which is shared by all the threads in that Cgroup. Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- include/linux/sched.h | 1 + kernel/sched/core.c | 40 ++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 1 + kernel/sched/sched.h | 8 ++++++++ 4 files changed, 50 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 1183741..b4a79c3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -631,6 +631,7 @@ struct task_struct { int static_prio; int normal_prio; unsigned int rt_priority; + u64 latency_nice; const struct sched_class *sched_class; struct sched_entity se; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 874c427..47969bc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5976,6 +5976,7 @@ void __init sched_init(void) init_dl_rq(&rq->dl); #ifdef CONFIG_FAIR_GROUP_SCHED root_task_group.shares = ROOT_TASK_GROUP_LOAD; + root_task_group.latency_nice = LATENCY_NICE_DEFAULT; INIT_LIST_HEAD(&rq->leaf_cfs_rq_list); rq->tmp_alone_branch = &rq->leaf_cfs_rq_list; /* @@ -6345,6 +6346,7 @@ static void sched_change_group(struct task_struct *tsk, int type) */ tg = container_of(task_css_check(tsk, cpu_cgrp_id, true), struct task_group, css); + tsk->latency_nice = tg->latency_nice; tg = autogroup_task_group(tsk, tg); tsk->sched_task_group = tg; @@ -6812,6 +6814,34 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, } #endif /* CONFIG_RT_GROUP_SCHED */ +static u64 cpu_latency_nice_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct task_group *tg = css_tg(css); + + return tg->latency_nice; +} + +static int cpu_latency_nice_write_u64(struct cgroup_subsys_state *css, + struct cftype *cft, u64 latency_nice) +{ + struct task_group *tg = css_tg(css); + struct css_task_iter it; + struct task_struct *p; + + if (latency_nice < LATENCY_NICE_MIN || latency_nice > LATENCY_NICE_MAX) + return -ERANGE; + + tg->latency_nice = latency_nice; + + css_task_iter_start(css, 0, &it); + while ((p = css_task_iter_next(&it))) + p->latency_nice = latency_nice; + css_task_iter_end(&it); + + return 0; +} + static struct cftype cpu_legacy_files[] = { #ifdef CONFIG_FAIR_GROUP_SCHED { @@ -6848,6 +6878,11 @@ static struct cftype cpu_legacy_files[] = { .write_u64 = cpu_rt_period_write_uint, }, #endif + { + .name = "latency-nice", + .read_u64 = cpu_latency_nice_read_u64, + .write_u64 = cpu_latency_nice_write_u64, + }, { } /* Terminate */ }; @@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = { .write = cpu_max_write, }, #endif + { + .name = "latency-nice", + .read_u64 = cpu_latency_nice_read_u64, + .write_u64 = cpu_latency_nice_write_u64, + }, { } /* terminate */ }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..b08d00c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) goto err; tg->shares = NICE_0_LOAD; + tg->latency_nice = LATENCY_NICE_DEFAULT; init_cfs_bandwidth(tg_cfs_bandwidth(tg)); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..365c928 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { } #define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT) /* + * Latency-nice default value + */ +#define LATENCY_NICE_DEFAULT 5 +#define LATENCY_NICE_MIN 1 +#define LATENCY_NICE_MAX 100 + +/* * Single value that decides SCHED_DEADLINE internal math precision. * 10 -> just above 1us * 9 -> just above 0.5us @@ -362,6 +369,7 @@ struct cfs_bandwidth { /* Task group related information */ struct task_group { struct cgroup_subsys_state css; + u64 latency_nice; #ifdef CONFIG_FAIR_GROUP_SCHED /* schedulable entities of this group on each CPU */ -- 2.9.3
Put upper and lower limit on CPU search in select_idle_cpu. The lower limit is set to amount of CPUs in a core while upper limit is derived from the latency-nice of the thread. This ensures for any architecture we will usually search beyond a core. Changing the latency-nice value by user will change the search cost making it appropriate for given workload. Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- kernel/sched/fair.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b08d00c..c31082d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6205,11 +6205,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (sched_feat(SIS_PROP)) { - u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) - nr = div_u64(span_avg, avg_cost); - else - nr = 4; + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX; + if (nr < floor) + nr = floor; } time = local_clock(); -- 2.9.3
Add a new sched feature SIS_CORE to have an option to disable idle core search (select_idle_core). Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- kernel/sched/features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..de4d506 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_CORE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls -- 2.9.3
Use SIS_CORE to disable idle core search. For some workloads select_idle_core becomes a scalability bottleneck, removing it improves throughput. Also there are workloads where disabling it can hurt latency, so need to have an option. Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- kernel/sched/fair.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c31082d..23ec9c6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6268,9 +6268,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_core(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; + if (sched_feat(SIS_CORE)) { + i = select_idle_core(p, sd, target); + if ((unsigned)i < nr_cpumask_bits) + return i; + } i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) -- 2.9.3
Introduce macro topology_sibling_weight for number of sibling CPUs in a core and use in select_idle_cpu Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- include/linux/topology.h | 4 ++++ kernel/sched/fair.c | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/include/linux/topology.h b/include/linux/topology.h index cb0775e..a85aea1 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_sibling_cpumask #define topology_sibling_cpumask(cpu) cpumask_of(cpu) #endif +#ifndef topology_sibling_weight +#define topology_sibling_weight(cpu) \ + cpumask_weight(topology_sibling_cpumask(cpu)) +#endif #ifndef topology_core_cpumask #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 23ec9c6..8856503 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6205,7 +6205,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (sched_feat(SIS_PROP)) { - floor = cpumask_weight(topology_sibling_cpumask(target)); + floor = topology_sibling_weight(target); if (floor < 2) floor = 2; nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX; -- 2.9.3
Use per-CPU variable for cpumask_weight_sibling macro in case of x86 for fast lookup in select_idle_cpu. This avoids reading multiple cache lines in case of systems with large numbers of CPUs where bitmask can span multiple cache lines. Even if bitmask spans only one cache line this avoids looping through it to find the number of bits and gets it in O(1). Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 ++++++++++++++++- 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index da545df..1e90cbd 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -22,6 +22,7 @@ extern int smp_num_siblings; extern unsigned int num_processors; DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); +DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); /* cpus sharing the last level cache: */ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 453cf38..dd19c71 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #ifdef CONFIG_SMP #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) +#define topology_sibling_weight(cpu) (per_cpu(cpumask_weight_sibling, cpu)) extern unsigned int __max_logical_packages; #define topology_max_packages() (__max_logical_packages) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 362dd89..57ad88d 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -85,6 +85,9 @@ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); +/* representing number of HT siblings of each CPU */ +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); + /* representing HT and core siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); @@ -520,6 +523,8 @@ void set_cpu_sibling_map(int cpu) if (!has_mp) { cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); c->booted_cores = 1; @@ -529,8 +534,13 @@ void set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { o = &cpu_data(i); - if ((i == cpu) || (has_smt && match_smt(c, o))) + if ((i == cpu) || (has_smt && match_smt(c, o))) { link_mask(topology_sibling_cpumask, cpu, i); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, i) = + cpumask_weight(topology_sibling_cpumask(i)); + } if ((i == cpu) || (has_mp && match_llc(c, o))) link_mask(cpu_llc_shared_mask, cpu, i); @@ -1173,6 +1183,8 @@ static __init void disable_smp(void) else physid_set_mask_of_physid(0, &phys_cpu_present_map); cpumask_set_cpu(0, topology_sibling_cpumask(0)); + per_cpu(cpumask_weight_sibling, 0) = + cpumask_weight(topology_sibling_cpumask(0)); cpumask_set_cpu(0, topology_core_cpumask(0)); } @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu) for_each_cpu(sibling, topology_core_cpumask(cpu)) { cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); + per_cpu(cpumask_weight_sibling, sibling) = + cpumask_weight(topology_sibling_cpumask(sibling)); /*/ * last thread sibling in this cpu core going down */ @@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu) cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling)); cpumask_clear(cpu_llc_shared_mask(cpu)); cpumask_clear(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = 0; cpumask_clear(topology_core_cpumask(cpu)); c->cpu_core_id = 0; c->booted_cores = 0; -- 2.9.3
Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps in L1 cache locality. --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8856503..94dd4a32 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) return i; } - i = select_idle_cpu(p, sd, target); + i = select_idle_smt(p, target); if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_smt(p, target); + i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) return i; -- 2.9.3
Introduce a per-cpu variable to track the limit upto which idle cpu search was done in select_idle_cpu(). This will help to start the search next time from there. This is necessary for rotating the search window over entire LLC domain. Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- kernel/sched/core.c | 2 ++ kernel/sched/sched.h | 1 + 2 files changed, 3 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 47969bc..5862d54 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -24,6 +24,7 @@ #include <trace/events/sched.h> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL) /* @@ -5966,6 +5967,7 @@ void __init sched_init(void) for_each_possible_cpu(i) { struct rq *rq; + per_cpu(next_cpu, i) = -1; rq = cpu_rq(i); raw_spin_lock_init(&rq->lock); rq->nr_running = 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 365c928..cca2b09 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1002,6 +1002,7 @@ static inline void update_idle_core(struct rq *rq) { } #endif DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) -- 2.9.3
Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- kernel/sched/fair.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 94dd4a32..7419b47 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, floor, nr = INT_MAX; + int cpu, floor, target_tmp, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6213,9 +6213,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t nr = floor; } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; if (!--nr) return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) -- 2.9.3
On 8/30/19 10:49 AM, subhra mazumdar wrote: > Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file > "latency-nice" which is shared by all the threads in that Cgroup. Subhra, Thanks for posting the patchset. Having a latency nice hint is useful beyond idle load balancing. I can think of other application scenarios, like scheduling batch machine learning AVX 512 processes with latency sensitive processes. AVX512 limits the frequency of the CPU and it is best to avoid latency sensitive task on the same core with AVX512. So latency nice hint allows the scheduler to have a criteria to determine the latency sensitivity of a task and arrange latency sensitive tasks away from AVX512 tasks. You configure the latency hint on a cgroup basis. But I think not all tasks in a cgroup necessarily have the same latency sensitivity. For example, I can see that cgroup can be applied on a per user basis, and the user could run different tasks that have different latency sensitivity. We may also need a way to configure latency sensitivity on a per task basis instead on a per cgroup basis. Tim > @@ -631,6 +631,7 @@ struct task_struct { > int static_prio; > int normal_prio; > unsigned int rt_priority; > + u64 latency_nice; Does it need to be 64 bit? Max latency nice is only 100. > > const struct sched_class *sched_class; > struct sched_entity se; > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 874c427..47969bc 100644 > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index b52ed1a..365c928 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { } > #define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT) > > /* > + * Latency-nice default value > + */ Will be useful to add comments to let reader know that higher latency nice number means a task is more latency tolerant. Is there a reason for setting the default to be a low value of 5? Seems like we will default to only to search the same core for idle cpu on a smaller system, as we only search 5% of the cpu span of the target sched domain. > +#define LATENCY_NICE_DEFAULT 5 > +#define LATENCY_NICE_MIN 1 > +#define LATENCY_NICE_MAX 100 > +
Hi Subhra,
On 8/30/19 11:19 PM, subhra mazumdar wrote:
> Introduce new per task property latency-nice for controlling scalability
> in scheduler idle CPU search path. Valid latency-nice values are from 1 to
> 100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New
> CPU cgroup file cpu.latency-nice is added as an interface to set and get.
> All tasks in the same cgroup share the same latency-nice value. Using a
> lower latency-nice value can help latency intolerant tasks e.g very short
> running OLTP threads where full LLC search cost can be significant compared
> to run time of the threads. The default latency-nice value is 5.
>
> In addition to latency-nice, it also adds a new sched feature SIS_CORE to
> be able to disable idle core search altogether which is costly and hurts
> more than it helps in short running workloads.
>
> Finally it also introduces a new per-cpu variable next_cpu to track
> the limit of search so that every time search starts from where it ended.
> This rotating search window over cpus in LLC domain ensures that idle
> cpus are eventually found in case of high load.
>
> Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
> message size = 8k (higher is better):
> threads baseline latency-nice=5,SIS_CORE latency-nice=5,NO_SIS_CORE
> 8 64.66 64.38 (-0.43%) 64.79 (0.2%)
> 16 123.34 122.88 (-0.37%) 125.87 (2.05%)
> 32 215.18 215.55 (0.17%) 247.77 (15.15%)
> 48 278.56 321.6 (15.45%) 321.2 (15.3%)
> 64 259.99 319.45 (22.87%) 333.95 (28.44%)
> 128 431.1 437.69 (1.53%) 431.09 (0%)
>
The result seems to be appealing with your experimental setup.
BTW, do you have any plans of load balancing as well based on latency niceness
of the tasks? It seems to be a more interesting case when we give pack the lower
latency sensitive tasks on fewer CPUs.
Also, do you see any workload results showing performance regression with NO_SIS_CORE?
Thanks,
Parth
On 9/4/19 11:02 PM, Tim Chen wrote: > On 8/30/19 10:49 AM, subhra mazumdar wrote: >> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file >> "latency-nice" which is shared by all the threads in that Cgroup. > > > Subhra, > > Thanks for posting the patchset. Having a latency nice hint > is useful beyond idle load balancing. I can think of other > application scenarios, like scheduling batch machine learning AVX 512 > processes with latency sensitive processes. AVX512 limits the frequency > of the CPU and it is best to avoid latency sensitive task on the > same core with AVX512. So latency nice hint allows the scheduler > to have a criteria to determine the latency sensitivity of a task > and arrange latency sensitive tasks away from AVX512 tasks. > Hi Tim and Subhra, This patchset seems to be interesting for my TurboSched patches as well where I try to pack jitter tasks on fewer cores to get higher Turbo Frequencies. Well, the problem I face is that we sometime end up putting multiple jitter tasks on a core running some latency sensitive application which may see performance degradation. So my plan was to classify such tasks to be latency sensitive thereby hinting the load balancer to not put tasks on such cores. TurboSched: https://lkml.org/lkml/2019/7/25/296 > You configure the latency hint on a cgroup basis. > But I think not all tasks in a cgroup necessarily have the same > latency sensitivity. > > For example, I can see that cgroup can be applied on a per user basis, > and the user could run different tasks that have different latency sensitivity. > We may also need a way to configure latency sensitivity on a per task basis instead on > a per cgroup basis. > AFAIU, the problem defined above intersects with my patches as well where the interface is required to classify the jitter tasks. I have already tried few methods like syscall and cgroup to classify such tasks and maybe something like that can be adopted with these patchset as well. Thanks, Parth
On 8/30/19 11:19 PM, subhra mazumdar wrote: > Put upper and lower limit on CPU search in select_idle_cpu. The lower limit > is set to amount of CPUs in a core while upper limit is derived from the > latency-nice of the thread. This ensures for any architecture we will > usually search beyond a core. Changing the latency-nice value by user will > change the search cost making it appropriate for given workload. > > Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> > --- > kernel/sched/fair.c | 13 +++++++------ > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b08d00c..c31082d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t > u64 avg_cost, avg_idle; > u64 time, cost; > s64 delta; > - int cpu, nr = INT_MAX; > + int cpu, floor, nr = INT_MAX; > > this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); > if (!this_sd) > @@ -6205,11 +6205,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t > return -1; > > if (sched_feat(SIS_PROP)) { > - u64 span_avg = sd->span_weight * avg_idle; > - if (span_avg > 4*avg_cost) > - nr = div_u64(span_avg, avg_cost); > - else > - nr = 4; > + floor = cpumask_weight(topology_sibling_cpumask(target)); > + if (floor < 2) > + floor = 2; > + nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX; I see you defined LATENCY_NICE_MAX = 100, So is the value 100 an experimental value? I was hoping to be something in the power of 2 resulting in just ">>>" rather than the heavy division operation. > + if (nr < floor) > + nr = floor; > } > > time = local_clock(); >
On 8/30/19 11:19 PM, subhra mazumdar wrote: > Rotate the cpu search window for better spread of threads. This will ensure > an idle cpu will quickly be found if one exists. > > Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> > --- > kernel/sched/fair.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 94dd4a32..7419b47 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t > u64 avg_cost, avg_idle; > u64 time, cost; > s64 delta; > - int cpu, floor, nr = INT_MAX; > + int cpu, floor, target_tmp, nr = INT_MAX; > > this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); > if (!this_sd) > @@ -6213,9 +6213,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t > nr = floor; > } > > + if (per_cpu(next_cpu, target) != -1) > + target_tmp = per_cpu(next_cpu, target); > + else > + target_tmp = target; > + > time = local_clock(); > > - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { > + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { > + per_cpu(next_cpu, target) = cpu; Is it possible that two simultaneous select_idle_cpu call have the same target value? > if (!--nr) > return -1; > if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) >
On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote:
> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
> "latency-nice" which is shared by all the threads in that Cgroup.
*sigh*, no. We start with a normal per task attribute, and then later,
if it is needed and makes sense, we add it to cgroups.
Also, your Changelog fails on pretty much every point. It doesn't
explain why, it doesn't describe anything and so on.
From just reading the above, I would expect it to have the range
[-20,19] just like normal nice. Apparently this is not so.
On Fri, Aug 30, 2019 at 10:49:42AM -0700, subhra mazumdar wrote:
> Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps
> in L1 cache locality.
> ---
> kernel/sched/fair.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8856503..94dd4a32 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> return i;
> }
>
> - i = select_idle_cpu(p, sd, target);
> + i = select_idle_smt(p, target);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> - i = select_idle_smt(p, target);
> + i = select_idle_cpu(p, sd, target);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
But it is absolutely conceptually wrong. An idle core is a much better
target than an idle sibling.
On Thu, Sep 05, 2019 at 09:31:27 +0100, Peter Zijlstra wrote... > On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote: >> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file >> "latency-nice" which is shared by all the threads in that Cgroup. > > *sigh*, no. We start with a normal per task attribute, and then later, > if it is needed and makes sense, we add it to cgroups. FWIW, to add on top of what Peter says, we used this same approach for uclamp and it proved to be a very effective way to come up with a good design. General principles have been: - a system wide API [1] (under /proc/sys/kernel/sched_*) defines default values for all tasks affected by that feature. This interface has to define also upper bounds for task specific values. Thus, in the case of latency-nice, it should be set by default to the MIN value, since that's the current mainline behaviour: all tasks are latency sensitive. - a per-task API [2] (via the sched_setattr() syscall) can be used to relax the system wide setting thus implementing a "nice" policy. - a per-taskgroup API [3] (via cpu controller's attributes) can be used to relax the system-wide settings and restrict the per-task API. The above features are worth to be added in that exact order. > Also, your Changelog fails on pretty much every point. It doesn't > explain why, it doesn't describe anything and so on. On the description side, I guess it's worth to mention somewhere to which scheduling classes this feature can be useful for. It's worth to mention that it can apply only to: - CFS tasks: for example, at wakeup time a task with an high latency-nice should avoid to preempt a low latency-nice task. Maybe by mapping the latency nice value into proper vruntime normalization value? - RT tasks: for example, at wakeup time a task with an high latency-nice value could avoid to preempt a CFS task. I'm sure there will be discussion about some of these features, that's why it's important in the proposal presentation to keep a well defined distinction among the "mechanisms and API" and how we use the new concept to "bias" some scheduler policies. > From just reading the above, I would expect it to have the range > [-20,19] just like normal nice. Apparently this is not so. Regarding the range for the latency-nice values, I guess we have two options: - [-20..19], which makes it similar to priorities downside: we quite likely end up with a kernel space representation which does not match the user-space one, e.g. look at task_struct::prio. - [0..1024], which makes it more similar to a "percentage" Being latency-nice a new concept, we are not constrained by POSIX and IMHO the [0..1024] scale is a better fit. That will translate into: latency-nice=0 : default (current mainline) behaviour, all "biasing" policies are disabled and we wakeup up as fast as possible latency-nice=1024 : maximum niceness, where for example we can imaging to turn switch a CFS task to be SCHED_IDLE? Best, Patrick [1] commit e8f14172c6b1 ("sched/uclamp: Add system default clamps") [2] commit a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support utilization clamping") [3] 5 patches in today's tip/sched/core up to: commit babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes") -- #include <best/regards.h> Patrick Bellasi
We already commented on adding the cgroup API after the per-task API. However, for the cgroup bits will be super important to have [ +tejun ] in CC since here we are at discussing the idea to add a new cpu controller's attribute. There are opinions about which kind of attributes can be added to cgroups and I'm sure a "latency-nice" attribute will generate an interesting discussion. :) LPC is coming up, perhaps we can get the chance to have a chat with Tejun about the manoeuvring space in this area. On Fri, Aug 30, 2019 at 18:49:36 +0100, subhra mazumdar wrote... > Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file > "latency-nice" which is shared by all the threads in that Cgroup. > > Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> > --- > include/linux/sched.h | 1 + > kernel/sched/core.c | 40 ++++++++++++++++++++++++++++++++++++++++ > kernel/sched/fair.c | 1 + > kernel/sched/sched.h | 8 ++++++++ > 4 files changed, 50 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 1183741..b4a79c3 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -631,6 +631,7 @@ struct task_struct { > int static_prio; > int normal_prio; > unsigned int rt_priority; > + u64 latency_nice; I guess we can save some bit here... or, if we are very brave, maybe we can explore the possibility to pack all prios into a single u64? ( ( (tomatoes target here) ) ) > const struct sched_class *sched_class; > struct sched_entity se; > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 874c427..47969bc 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5976,6 +5976,7 @@ void __init sched_init(void) > init_dl_rq(&rq->dl); > #ifdef CONFIG_FAIR_GROUP_SCHED > root_task_group.shares = ROOT_TASK_GROUP_LOAD; > + root_task_group.latency_nice = LATENCY_NICE_DEFAULT; > INIT_LIST_HEAD(&rq->leaf_cfs_rq_list); > rq->tmp_alone_branch = &rq->leaf_cfs_rq_list; > /* > @@ -6345,6 +6346,7 @@ static void sched_change_group(struct task_struct *tsk, int type) > */ > tg = container_of(task_css_check(tsk, cpu_cgrp_id, true), > struct task_group, css); > + tsk->latency_nice = tg->latency_nice; > tg = autogroup_task_group(tsk, tg); > tsk->sched_task_group = tg; > > @@ -6812,6 +6814,34 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, > } > #endif /* CONFIG_RT_GROUP_SCHED */ > > +static u64 cpu_latency_nice_read_u64(struct cgroup_subsys_state *css, > + struct cftype *cft) > +{ > + struct task_group *tg = css_tg(css); > + > + return tg->latency_nice; > +} > + > +static int cpu_latency_nice_write_u64(struct cgroup_subsys_state *css, > + struct cftype *cft, u64 latency_nice) > +{ > + struct task_group *tg = css_tg(css); > + struct css_task_iter it; > + struct task_struct *p; > + > + if (latency_nice < LATENCY_NICE_MIN || latency_nice > LATENCY_NICE_MAX) > + return -ERANGE; > + > + tg->latency_nice = latency_nice; > + > + css_task_iter_start(css, 0, &it); > + while ((p = css_task_iter_next(&it))) > + p->latency_nice = latency_nice; Once (and if) the cgroup API is added we can avoid this (potentially massive) "update on write" in favour of an "on demand composition at wakeup-time". We don't care about updating the latency-nice of NON RUNNABLE tasks, do we? AFAIK, we need that value only (or mostly) at wakeup time. Thus, when a task wakeup up we can easily compose (and eventually cache) it's current latency-nice value by considering, in priority order: - the system wide upper-bound - the task group restriction - the task specific relaxation Something similar to what we already do for uclamp composition with this patch currently in tip/sched/core: commit 3eac870a3247 ("sched/uclamp: Use TG's clamps to restrict TASK's clamps") > + css_task_iter_end(&it); > + > + return 0; > +} > + > static struct cftype cpu_legacy_files[] = { > #ifdef CONFIG_FAIR_GROUP_SCHED > { > @@ -6848,6 +6878,11 @@ static struct cftype cpu_legacy_files[] = { > .write_u64 = cpu_rt_period_write_uint, > }, > #endif > + { > + .name = "latency-nice", > + .read_u64 = cpu_latency_nice_read_u64, > + .write_u64 = cpu_latency_nice_write_u64, > + }, > { } /* Terminate */ > }; > > @@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = { > .write = cpu_max_write, > }, > #endif > + { > + .name = "latency-nice", > + .read_u64 = cpu_latency_nice_read_u64, > + .write_u64 = cpu_latency_nice_write_u64, > + }, > { } /* terminate */ > }; > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f35930f..b08d00c 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) > goto err; > > tg->shares = NICE_0_LOAD; > + tg->latency_nice = LATENCY_NICE_DEFAULT; ^^^^^^^^^^^^^^^^^^^^ Maybe better NICE_0_LATENCY to be more consistent? > init_cfs_bandwidth(tg_cfs_bandwidth(tg)); > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index b52ed1a..365c928 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { } > #define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT) > > /* > + * Latency-nice default value > + */ > +#define LATENCY_NICE_DEFAULT 5 > +#define LATENCY_NICE_MIN 1 > +#define LATENCY_NICE_MAX 100 Values 1 and 5 looks kind of arbitrary. For the range specifically, I already commented in this other message: Message-ID: <87r24v2i14.fsf@arm.com> https://lore.kernel.org/lkml/87r24v2i14.fsf@arm.com/ > + > +/* > * Single value that decides SCHED_DEADLINE internal math precision. > * 10 -> just above 1us > * 9 -> just above 0.5us > @@ -362,6 +369,7 @@ struct cfs_bandwidth { > /* Task group related information */ > struct task_group { > struct cgroup_subsys_state css; > + u64 latency_nice; > > #ifdef CONFIG_FAIR_GROUP_SCHED > /* schedulable entities of this group on each CPU */ Best, Patrick -- #include <best/regards.h> Patrick Bellasi
On Thu, Sep 05, 2019 at 07:15:34 +0100, Parth Shah wrote... > On 9/4/19 11:02 PM, Tim Chen wrote: >> On 8/30/19 10:49 AM, subhra mazumdar wrote: >>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file >>> "latency-nice" which is shared by all the threads in that Cgroup. >> >> >> Subhra, >> >> Thanks for posting the patchset. Having a latency nice hint >> is useful beyond idle load balancing. I can think of other >> application scenarios, like scheduling batch machine learning AVX 512 >> processes with latency sensitive processes. AVX512 limits the frequency >> of the CPU and it is best to avoid latency sensitive task on the >> same core with AVX512. So latency nice hint allows the scheduler >> to have a criteria to determine the latency sensitivity of a task >> and arrange latency sensitive tasks away from AVX512 tasks. >> > > > Hi Tim and Subhra, > > This patchset seems to be interesting for my TurboSched patches as well > where I try to pack jitter tasks on fewer cores to get higher Turbo Frequencies. > Well, the problem I face is that we sometime end up putting multiple jitter tasks on a core > running some latency sensitive application which may see performance degradation. > So my plan was to classify such tasks to be latency sensitive thereby hinting the load > balancer to not put tasks on such cores. > > TurboSched: https://lkml.org/lkml/2019/7/25/296 > >> You configure the latency hint on a cgroup basis. >> But I think not all tasks in a cgroup necessarily have the same >> latency sensitivity. >> >> For example, I can see that cgroup can be applied on a per user basis, >> and the user could run different tasks that have different latency sensitivity. >> We may also need a way to configure latency sensitivity on a per task basis instead on >> a per cgroup basis. >> > > AFAIU, the problem defined above intersects with my patches as well where the interface > is required to classify the jitter tasks. I have already tried few methods like > syscall and cgroup to classify such tasks and maybe something like that can be adopted > with these patchset as well. Agree, these two patchest are definitively overlapping in terms of mechanisms and APIs to expose to userspace. You to guys seems to target different goals but the general approach should be: - expose a single and abstract concept to user-space latency-nice or latency-tolerant as PaulT proposed at OSPM - map this concept in kernel-space to different kind of bias, both at wakeup time and load-balance time, and use both for RT and CFS tasks. That's my understanding at least ;) I guess we will have interesting discussions at the upcoming LPC to figure out a solution fitting all needs. > Thanks, > Parth Best, Patrick -- #include <best/regards.h> Patrick Bellasi
On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote... > Add a new sched feature SIS_CORE to have an option to disable idle core > search (select_idle_core). > > Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> > --- > kernel/sched/features.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index 858589b..de4d506 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) > */ > SCHED_FEAT(SIS_AVG_CPU, false) > SCHED_FEAT(SIS_PROP, true) > +SCHED_FEAT(SIS_CORE, true) Why do we need a sched_feature? If you think there are systems in which the usage of latency-nice does not make sense for in "Select Idle Sibling", then we should probably better add a new Kconfig option. If that's the case, you can probably use the init/Kconfig's "Scheduler features" section, recently added by: commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting") > /* > * Issue a WARN when we do multiple update_rq_clock() calls Best, Patrick -- #include <best/regards.h> Patrick Bellasi
On Fri, Aug 30, 2019 at 18:49:39 +0100, subhra mazumdar wrote...
> Use SIS_CORE to disable idle core search. For some workloads
> select_idle_core becomes a scalability bottleneck, removing it improves
> throughput. Also there are workloads where disabling it can hurt latency,
> so need to have an option.
>
> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
> ---
> kernel/sched/fair.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c31082d..23ec9c6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6268,9 +6268,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if (!sd)
> return target;
>
> - i = select_idle_core(p, sd, target);
> - if ((unsigned)i < nr_cpumask_bits)
> - return i;
> + if (sched_feat(SIS_CORE)) {
> + i = select_idle_core(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
>
> i = select_idle_cpu(p, sd, target);
> if ((unsigned)i < nr_cpumask_bits)
This looks like should be squashed with the previous one, or whatever
code you'll add to define when this "biasing" is to be used or not.
Best,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
On Fri, Aug 30, 2019 at 18:49:35 +0100, subhra mazumdar wrote... > Introduce new per task property latency-nice for controlling scalability > in scheduler idle CPU search path. As per my comments in other message, we should try to better split the "latency nice" concept introduction (API and mechanisms) from its usage in different scheduler code paths. This distinction should be evident from the cover letter too. What you use "latency nice" for is just one possible use-case, thus it's important to make sure it's generic enough to fit other usages too. > Valid latency-nice values are from 1 to > 100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New > CPU cgroup file cpu.latency-nice is added as an interface to set and get. > All tasks in the same cgroup share the same latency-nice value. Using a > lower latency-nice value can help latency intolerant tasks e.g very short > running OLTP threads where full LLC search cost can be significant compared > to run time of the threads. The default latency-nice value is 5. > > In addition to latency-nice, it also adds a new sched feature SIS_CORE to > be able to disable idle core search altogether which is costly and hurts > more than it helps in short running workloads. I don't see why you latency-nice cannot be used to implement what you get with NO_SIS_CORE. IMHO, "latency nice" should be exposed as a pretty generic concept which progressively enables different "levels of biasing" both at wake-up time and load-balance time. Why it's not possible to have SIS_CORE/NO_SIS_CORE switch implemented just as different threshold values for the latency-nice value of a task? Best, Patrick -- #include <best/regards.h> Patrick Bellasi
On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
> > From just reading the above, I would expect it to have the range
> > [-20,19] just like normal nice. Apparently this is not so.
>
> Regarding the range for the latency-nice values, I guess we have two
> options:
>
> - [-20..19], which makes it similar to priorities
> downside: we quite likely end up with a kernel space representation
> which does not match the user-space one, e.g. look at
> task_struct::prio.
>
> - [0..1024], which makes it more similar to a "percentage"
>
> Being latency-nice a new concept, we are not constrained by POSIX and
> IMHO the [0..1024] scale is a better fit.
>
> That will translate into:
>
> latency-nice=0 : default (current mainline) behaviour, all "biasing"
> policies are disabled and we wakeup up as fast as possible
>
> latency-nice=1024 : maximum niceness, where for example we can imaging
> to turn switch a CFS task to be SCHED_IDLE?
There's a few things wrong there; I really feel that if we call it nice,
it should be like nice. Otherwise we should call it latency-bias and not
have the association with nice to confuse people.
Secondly; the default should be in the middle of the range. Naturally
this would be a signed range like nice [-(x+1),x] for some x. but if you
want [0,1024], then the default really should be 512, but personally I
like 0 better as a default, in which case we need negative numbers.
This is important because we want to be able to bias towards less
importance to (tail) latency as well as more importantance to (tail)
latency.
Specifically, Oracle wants to sacrifice (some) latency for throughput.
Facebook OTOH seems to want to sacrifice (some) throughput for latency.
On Thu, Sep 05, 2019 at 11:05:18AM +0100, Patrick Bellasi wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index b52ed1a..365c928 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
> > #define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT)
> >
> > /*
> > + * Latency-nice default value
> > + */
> > +#define LATENCY_NICE_DEFAULT 5
> > +#define LATENCY_NICE_MIN 1
> > +#define LATENCY_NICE_MAX 100
>
> Values 1 and 5 looks kind of arbitrary.
Yes, and like I just wrote, completely and utterly wrong.
On 09/05/19 12:46, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>
> > > From just reading the above, I would expect it to have the range
> > > [-20,19] just like normal nice. Apparently this is not so.
> >
> > Regarding the range for the latency-nice values, I guess we have two
> > options:
> >
> > - [-20..19], which makes it similar to priorities
> > downside: we quite likely end up with a kernel space representation
> > which does not match the user-space one, e.g. look at
> > task_struct::prio.
> >
> > - [0..1024], which makes it more similar to a "percentage"
> >
> > Being latency-nice a new concept, we are not constrained by POSIX and
> > IMHO the [0..1024] scale is a better fit.
> >
> > That will translate into:
> >
> > latency-nice=0 : default (current mainline) behaviour, all "biasing"
> > policies are disabled and we wakeup up as fast as possible
> >
> > latency-nice=1024 : maximum niceness, where for example we can imaging
> > to turn switch a CFS task to be SCHED_IDLE?
>
> There's a few things wrong there; I really feel that if we call it nice,
> it should be like nice. Otherwise we should call it latency-bias and not
> have the association with nice to confuse people.
>
> Secondly; the default should be in the middle of the range. Naturally
> this would be a signed range like nice [-(x+1),x] for some x. but if you
> want [0,1024], then the default really should be 512, but personally I
> like 0 better as a default, in which case we need negative numbers.
>
> This is important because we want to be able to bias towards less
> importance to (tail) latency as well as more importantance to (tail)
> latency.
>
> Specifically, Oracle wants to sacrifice (some) latency for throughput.
> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
Another use case I'm considering is using latency-nice to prefer an idle CPU if
latency-nice is set otherwise go for the most energy efficient CPU.
Ie: sacrifice (some) energy for latency.
The way I see interpreting latency-nice here as a binary switch. But maybe we
can use the range to select what (some) energy to sacrifice mean here. Hmmm.
--
Qais Yousef
On Thu, Sep 05, 2019 at 11:46:16 +0100, Peter Zijlstra wrote... > On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote: > >> > From just reading the above, I would expect it to have the range >> > [-20,19] just like normal nice. Apparently this is not so. >> >> Regarding the range for the latency-nice values, I guess we have two >> options: >> >> - [-20..19], which makes it similar to priorities >> downside: we quite likely end up with a kernel space representation >> which does not match the user-space one, e.g. look at >> task_struct::prio. >> >> - [0..1024], which makes it more similar to a "percentage" >> >> Being latency-nice a new concept, we are not constrained by POSIX and >> IMHO the [0..1024] scale is a better fit. >> >> That will translate into: >> >> latency-nice=0 : default (current mainline) behaviour, all "biasing" >> policies are disabled and we wakeup up as fast as possible >> >> latency-nice=1024 : maximum niceness, where for example we can imaging >> to turn switch a CFS task to be SCHED_IDLE? > > There's a few things wrong there; I really feel that if we call it nice, > it should be like nice. Otherwise we should call it latency-bias and not > have the association with nice to confuse people. > > Secondly; the default should be in the middle of the range. Naturally > this would be a signed range like nice [-(x+1),x] for some x. but if you > want [0,1024], then the default really should be 512, but personally I > like 0 better as a default, in which case we need negative numbers. > > This is important because we want to be able to bias towards less > importance to (tail) latency as well as more importantance to (tail) > latency. > > Specifically, Oracle wants to sacrifice (some) latency for throughput. > Facebook OTOH seems to want to sacrifice (some) throughput for latency. Right, we have this dualism to deal with and current mainline behaviour is somehow in the middle. BTW, the FB requirement is the same we have in Android. We want some CFS tasks to have very small latency and a low chance to be preempted by the wake-up of less-important "background" tasks. I'm not totally against the usage of a signed range, but I'm thinking that since we are introducing a new (non POSIX) concept we can get the chance to make it more human friendly. Give the two extremes above, would not be much simpler and intuitive to have 0 implementing the FB/Android (no latency) case and 1024 the (max latency) Oracle case? Moreover, we will never match completely the nice semantic, give that a 1 nice unit has a proper math meaning, isn't something like 10% CPU usage change for each step? For latency-nice instead we will likely base our biasing strategies on some predefined (maybe system-wide configurable) const thresholds. Could changing the name to "latency-tolerance" break the tie by marking its difference wrt prior/nice levels? AFAIR, that was also the original proposal [1] by PaulT during the OSPM discussion. Best, Patrick [1] https://youtu.be/oz43thSFqmk?t=1302 -- #include <best/regards.h> Patrick Bellasi
On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote: > On 09/05/19 12:46, Peter Zijlstra wrote: > > This is important because we want to be able to bias towards less > > importance to (tail) latency as well as more importantance to (tail) > > latency. > > > > Specifically, Oracle wants to sacrifice (some) latency for throughput. > > Facebook OTOH seems to want to sacrifice (some) throughput for latency. > > Another use case I'm considering is using latency-nice to prefer an idle CPU if > latency-nice is set otherwise go for the most energy efficient CPU. > > Ie: sacrifice (some) energy for latency. > > The way I see interpreting latency-nice here as a binary switch. But > maybe we can use the range to select what (some) energy to sacrifice > mean here. Hmmm. It cannot be binary, per definition is must be ternary, that is, <0, ==0 and >0 (or middle value if you're of that persuasion). In your case, I'm thinking you mean >0, we want to lower the latency. Anyway; there were a number of things mentioned at OSPM that we could tie into this thing and finding sensible mappings is going to be a bit of trial and error I suppose. But as patrick said; we're very much exporting a BIAS knob, not a set of behaviours.
On Thu, Sep 05, 2019 at 12:13:47 +0100, Qais Yousef wrote...
> On 09/05/19 12:46, Peter Zijlstra wrote:
>> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>>
>> > > From just reading the above, I would expect it to have the range
>> > > [-20,19] just like normal nice. Apparently this is not so.
>> >
>> > Regarding the range for the latency-nice values, I guess we have two
>> > options:
>> >
>> > - [-20..19], which makes it similar to priorities
>> > downside: we quite likely end up with a kernel space representation
>> > which does not match the user-space one, e.g. look at
>> > task_struct::prio.
>> >
>> > - [0..1024], which makes it more similar to a "percentage"
>> >
>> > Being latency-nice a new concept, we are not constrained by POSIX and
>> > IMHO the [0..1024] scale is a better fit.
>> >
>> > That will translate into:
>> >
>> > latency-nice=0 : default (current mainline) behaviour, all "biasing"
>> > policies are disabled and we wakeup up as fast as possible
>> >
>> > latency-nice=1024 : maximum niceness, where for example we can imaging
>> > to turn switch a CFS task to be SCHED_IDLE?
>>
>> There's a few things wrong there; I really feel that if we call it nice,
>> it should be like nice. Otherwise we should call it latency-bias and not
>> have the association with nice to confuse people.
>>
>> Secondly; the default should be in the middle of the range. Naturally
>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>> want [0,1024], then the default really should be 512, but personally I
>> like 0 better as a default, in which case we need negative numbers.
>>
>> This is important because we want to be able to bias towards less
>> importance to (tail) latency as well as more importantance to (tail)
>> latency.
>>
>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>
> Another use case I'm considering is using latency-nice to prefer an idle CPU if
> latency-nice is set otherwise go for the most energy efficient CPU.
>
> Ie: sacrifice (some) energy for latency.
>
> The way I see interpreting latency-nice here as a binary switch. But maybe we
> can use the range to select what (some) energy to sacrifice mean here. Hmmm.
I see this concept possibly evolving into something more then just a
binary switch. Not yet convinced if it make sense and/or it's possible
but, in principle, I was thinking about these possible usages for CFS
tasks:
- dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
depending on crossing certain pre-configured threshold of latency
niceness.
- dynamically bias the vruntime updates we do in place_entity()
depending on the actual latency niceness of a task.
- bias the decisions we take in check_preempt_tick() still depending
on a relative comparison of the current and wakeup task latency
niceness values.
--
#include <best/regards.h>
Patrick Bellasi
On Thu, Sep 05, 2019 at 12:30:02 +0100, Peter Zijlstra wrote...
> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
>> On 09/05/19 12:46, Peter Zijlstra wrote:
>
>> > This is important because we want to be able to bias towards less
>> > importance to (tail) latency as well as more importantance to (tail)
>> > latency.
>> >
>> > Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>>
>> Another use case I'm considering is using latency-nice to prefer an idle CPU if
>> latency-nice is set otherwise go for the most energy efficient CPU.
>>
>> Ie: sacrifice (some) energy for latency.
>>
>> The way I see interpreting latency-nice here as a binary switch. But
>> maybe we can use the range to select what (some) energy to sacrifice
>> mean here. Hmmm.
>
> It cannot be binary, per definition is must be ternary, that is, <0, ==0
> and >0 (or middle value if you're of that persuasion).
>
> In your case, I'm thinking you mean >0, we want to lower the latency.
>
> Anyway; there were a number of things mentioned at OSPM that we could
> tie into this thing and finding sensible mappings is going to be a bit
> of trial and error I suppose.
>
> But as patrick said; we're very much exporting a BIAS knob, not a set of
> behaviours.
Right, although I think behaviours could still be exported but via a
different and configurable interface, using thresholds.
Either at compile time or via procfs maybe we can expose and properly
document what happen in the scheduler if/when a task has a "latency
niceness" crossing a given threshold.
For example, by setting something like:
/proc/sys/kernel/sched_cfs_latency_idle = 1000
we state that the task is going to be scheduled according to the
SCHED_IDLE policy.
( ( (tomatoes target here) ) )
Not sure also if we wanna commit to user-space APIs how we internally
map/translate a "latency niceness" value into a scheduler behaviour
bias. Maybe better not at least at the very beginning.
Best,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
On Thu, Sep 05, 2019 at 12:18:55PM +0100, Patrick Bellasi wrote: > Right, we have this dualism to deal with and current mainline behaviour > is somehow in the middle. > > BTW, the FB requirement is the same we have in Android. > We want some CFS tasks to have very small latency and a low chance > to be preempted by the wake-up of less-important "background" tasks. > > I'm not totally against the usage of a signed range, but I'm thinking > that since we are introducing a new (non POSIX) concept we can get the > chance to make it more human friendly. I'm arguing that signed _is_ more human friendly ;-) > Give the two extremes above, would not be much simpler and intuitive to > have 0 implementing the FB/Android (no latency) case and 1024 the > (max latency) Oracle case? See, I find the signed thing more natural, negative is a bias away from latency sensitive, positive is a bias towards latency sensitive. Also; 0 is a good default value ;-) > Moreover, we will never match completely the nice semantic, give that > a 1 nice unit has a proper math meaning, isn't something like 10% CPU > usage change for each step? Only because we were nice when implementing it. Posix leaves it unspecified and we could change it at any time. The only real semantics is a relative 'weight' (opengroup uses the term 'favourable'). > Could changing the name to "latency-tolerance" break the tie by marking > its difference wrt prior/nice levels? AFAIR, that was also the original > proposal [1] by PaulT during the OSPM discussion. latency torrerance could still be a signed entity, positive would signify we're more tolerant of latency (ie. less sensitive) while negative would be less tolerant (ie. more sensitive). > For latency-nice instead we will likely base our biasing strategies on > some predefined (maybe system-wide configurable) const thresholds. I'm not quite sure; yes, for some of these things, like the idle search on wakeup, certainly. But say for wakeup-preemption, we could definitely make it a task relative attribute.
On 05/09/2019 12:18, Patrick Bellasi wrote: >> There's a few things wrong there; I really feel that if we call it nice, >> it should be like nice. Otherwise we should call it latency-bias and not >> have the association with nice to confuse people. >> >> Secondly; the default should be in the middle of the range. Naturally >> this would be a signed range like nice [-(x+1),x] for some x. but if you >> want [0,1024], then the default really should be 512, but personally I >> like 0 better as a default, in which case we need negative numbers. >> >> This is important because we want to be able to bias towards less >> importance to (tail) latency as well as more importantance to (tail) >> latency. >> >> Specifically, Oracle wants to sacrifice (some) latency for throughput. >> Facebook OTOH seems to want to sacrifice (some) throughput for latency. > > Right, we have this dualism to deal with and current mainline behaviour > is somehow in the middle. > > BTW, the FB requirement is the same we have in Android. > We want some CFS tasks to have very small latency and a low chance > to be preempted by the wake-up of less-important "background" tasks. > > I'm not totally against the usage of a signed range, but I'm thinking > that since we are introducing a new (non POSIX) concept we can get the > chance to make it more human friendly. > > Give the two extremes above, would not be much simpler and intuitive to > have 0 implementing the FB/Android (no latency) case and 1024 the > (max latency) Oracle case? > For something like latency-<whatever>, I don't see the point of having such a wide range. The nice range is probably more than enough - and before even bothering about the range, we should probably agree on what the range should represent. If it's niceness, I read it as: positive latency-nice value means we're nice to latency, means we reduce it. So the further up you go, the more you restrict your wakeup scan. I think it's quite easy to map that into the code: current behaviour at 0, with a decreasing scan mask size as we go towards +19. I don't think anyone needs 512 steps to tune this. I don't know what logic we'd follow for negative values though. Maybe latency-nice -20 means always going through the slowpath, but what of the intermediate values? AFAICT this RFC only looks at wakeups, but I guess latency-nice can be applied elsewhere (e.g. load-balance, something like task_hot() and its use of sysctl_sched_migration_cost). > Moreover, we will never match completely the nice semantic, give that > a 1 nice unit has a proper math meaning, isn't something like 10% CPU > usage change for each step? > > For latency-nice instead we will likely base our biasing strategies on > some predefined (maybe system-wide configurable) const thresholds. > > Could changing the name to "latency-tolerance" break the tie by marking > its difference wrt prior/nice levels? AFAIR, that was also the original > proposal [1] by PaulT during the OSPM discussion. > > Best, > Patrick > > [1] https://youtu.be/oz43thSFqmk?t=1302 >
On Thu, Sep 05, 2019 at 12:40:30 +0100, Peter Zijlstra wrote... > On Thu, Sep 05, 2019 at 12:18:55PM +0100, Patrick Bellasi wrote: > >> Right, we have this dualism to deal with and current mainline behaviour >> is somehow in the middle. >> >> BTW, the FB requirement is the same we have in Android. >> We want some CFS tasks to have very small latency and a low chance >> to be preempted by the wake-up of less-important "background" tasks. >> >> I'm not totally against the usage of a signed range, but I'm thinking >> that since we are introducing a new (non POSIX) concept we can get the >> chance to make it more human friendly. > > I'm arguing that signed _is_ more human friendly ;-) ... but you are not human. :) >> Give the two extremes above, would not be much simpler and intuitive to >> have 0 implementing the FB/Android (no latency) case and 1024 the >> (max latency) Oracle case? > > See, I find the signed thing more natural, negative is a bias away from > latency sensitive, positive is a bias towards latency sensitive. > > Also; 0 is a good default value ;-) Yes, that's appealing indeed. >> Moreover, we will never match completely the nice semantic, give that >> a 1 nice unit has a proper math meaning, isn't something like 10% CPU >> usage change for each step? > > Only because we were nice when implementing it. Posix leaves it > unspecified and we could change it at any time. The only real semantics > is a relative 'weight' (opengroup uses the term 'favourable'). Good to know, I was considering it a POXIS requirement. >> Could changing the name to "latency-tolerance" break the tie by marking >> its difference wrt prior/nice levels? AFAIR, that was also the original >> proposal [1] by PaulT during the OSPM discussion. > > latency torrerance could still be a signed entity, positive would > signify we're more tolerant of latency (ie. less sensitive) while > negative would be less tolerant (ie. more sensitive). Right. >> For latency-nice instead we will likely base our biasing strategies on >> some predefined (maybe system-wide configurable) const thresholds. > > I'm not quite sure; yes, for some of these things, like the idle search > on wakeup, certainly. But say for wakeup-preemption, we could definitely > make it a task relative attribute. -- #include <best/regards.h> Patrick Bellasi
On Thu, Sep 05, 2019 at 12:30:52PM +0100, Patrick Bellasi wrote: > I see this concept possibly evolving into something more then just a > binary switch. Not yet convinced if it make sense and/or it's possible > but, in principle, I was thinking about these possible usages for CFS > tasks: > > - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE} > depending on crossing certain pre-configured threshold of latency > niceness. A big part of BATCH is wakeup preemption (batch doesn't preempt itself), and wakeup preemption is a task-task propery and can thus be completely relative. > - dynamically bias the vruntime updates we do in place_entity() > depending on the actual latency niceness of a task. That is dangerous; theory says we should keep track of the 0-lag point and place it back where we found it. BFQ does this correctly IIRC, but for CFS I've never done that because 'expensive'. But yes, we could (carefully) fumble a bit there. > - bias the decisions we take in check_preempt_tick() still depending > on a relative comparison of the current and wakeup task latency > niceness values. Ack. Placing relative and absolute behaviour on the same scale is going to be 'fun' :-)
On 09/05/19 13:30, Peter Zijlstra wrote: > On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote: > > On 09/05/19 12:46, Peter Zijlstra wrote: > > > > This is important because we want to be able to bias towards less > > > importance to (tail) latency as well as more importantance to (tail) > > > latency. > > > > > > Specifically, Oracle wants to sacrifice (some) latency for throughput. > > > Facebook OTOH seems to want to sacrifice (some) throughput for latency. > > > > Another use case I'm considering is using latency-nice to prefer an idle CPU if > > latency-nice is set otherwise go for the most energy efficient CPU. > > > > Ie: sacrifice (some) energy for latency. > > > > The way I see interpreting latency-nice here as a binary switch. But > > maybe we can use the range to select what (some) energy to sacrifice > > mean here. Hmmm. > > It cannot be binary, per definition is must be ternary, that is, <0, ==0 > and >0 (or middle value if you're of that persuasion). I meant I want to use it as a binary. > > In your case, I'm thinking you mean >0, we want to lower the latency. Yes. As long as there's an easy way to say: does this task care about latency or not I'm good. > > Anyway; there were a number of things mentioned at OSPM that we could > tie into this thing and finding sensible mappings is going to be a bit > of trial and error I suppose. > > But as patrick said; we're very much exporting a BIAS knob, not a set of > behaviours. Agreed. I just wanted to say that the way this range is going to be interpreted will differ from path to path and we need to consider that in the final mapping. Especially from the final user's perspective of what setting this value ultimately means to them. -- Qais Yousef
On Thu, Sep 05, 2019 at 12:40:01PM +0100, Patrick Bellasi wrote:
> Right, although I think behaviours could still be exported but via a
> different and configurable interface, using thresholds.
I would try _really_ hard to avoid pinning down behaviour. The more you
do that, the less you can change.
On Thu, Sep 05, 2019 at 12:46:37 +0100, Valentin Schneider wrote... > On 05/09/2019 12:18, Patrick Bellasi wrote: >>> There's a few things wrong there; I really feel that if we call it nice, >>> it should be like nice. Otherwise we should call it latency-bias and not >>> have the association with nice to confuse people. >>> >>> Secondly; the default should be in the middle of the range. Naturally >>> this would be a signed range like nice [-(x+1),x] for some x. but if you >>> want [0,1024], then the default really should be 512, but personally I >>> like 0 better as a default, in which case we need negative numbers. >>> >>> This is important because we want to be able to bias towards less >>> importance to (tail) latency as well as more importantance to (tail) >>> latency. >>> >>> Specifically, Oracle wants to sacrifice (some) latency for throughput. >>> Facebook OTOH seems to want to sacrifice (some) throughput for latency. >> >> Right, we have this dualism to deal with and current mainline behaviour >> is somehow in the middle. >> >> BTW, the FB requirement is the same we have in Android. >> We want some CFS tasks to have very small latency and a low chance >> to be preempted by the wake-up of less-important "background" tasks. >> >> I'm not totally against the usage of a signed range, but I'm thinking >> that since we are introducing a new (non POSIX) concept we can get the >> chance to make it more human friendly. >> >> Give the two extremes above, would not be much simpler and intuitive to >> have 0 implementing the FB/Android (no latency) case and 1024 the >> (max latency) Oracle case? >> > > For something like latency-<whatever>, I don't see the point of having > such a wide range. The nice range is probably more than enough - and before > even bothering about the range, we should probably agree on what the range > should represent. > > If it's niceness, I read it as: positive latency-nice value means we're > nice to latency, means we reduce it. So the further up you go, the more you > restrict your wakeup scan. I think it's quite easy to map that into the > code: current behaviour at 0, with a decreasing scan mask size as we go > towards +19. I don't think anyone needs 512 steps to tune this. > > I don't know what logic we'd follow for negative values though. Maybe > latency-nice -20 means always going through the slowpath, but what of the > intermediate values? Yep, I think so fare we are all converging towards the idea to use the a signed range. Regarding the range itself, yes: 1024 looks very oversized, but +-20 is still something which leave room for a bit of flexibility and it also better matches the idea that we don't want to "enumerate behaviours" but just expose a knob. To map certain "bias" we could benefit from a slightly larger range. > AFAICT this RFC only looks at wakeups, but I guess latency-nice can be For the wakeup path there is also the TurboSched proposal by Parth: Message-ID: <20190725070857.6639-1-parth@linux.ibm.com> https://lore.kernel.org/lkml/20190725070857.6639-1-parth@linux.ibm.com/ we should keep in mind. > applied elsewhere (e.g. load-balance, something like task_hot() and its > use of sysctl_sched_migration_cost). For LB can you come up with some better description of what usages you see could benefit from a "per task" or "per task-group" latency niceness? Best, Patrick -- #include <best/regards.h> Patrick Bellasi
On 09/05/19 13:48, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 12:40:01PM +0100, Patrick Bellasi wrote:
> > Right, although I think behaviours could still be exported but via a
> > different and configurable interface, using thresholds.
>
> I would try _really_ hard to avoid pinning down behaviour. The more you
> do that, the less you can change.
While I agree with that but I find there's a contradiction between not
'pinning down behavior' and 'easy and clear way to bias latency sensitive from
end user's perspective'.
Maybe we should protect this with a kconfig + experimental tag until trial
and error show the best way forward?
--
Qais Yousef
On 05/09/2019 14:07, Patrick Bellasi wrote: > Yep, I think so fare we are all converging towards the idea to use the > a signed range. Regarding the range itself, yes: 1024 looks very > oversized, but +-20 is still something which leave room for a bit of > flexibility and it also better matches the idea that we don't want to > "enumerate behaviours" but just expose a knob. To map certain "bias" we > could benefit from a slightly larger range. > >> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be > > For the wakeup path there is also the TurboSched proposal by Parth: > > Message-ID: <20190725070857.6639-1-parth@linux.ibm.com> > https://lore.kernel.org/lkml/20190725070857.6639-1-parth@linux.ibm.com/ > > we should keep in mind. > >> applied elsewhere (e.g. load-balance, something like task_hot() and its >> use of sysctl_sched_migration_cost). > > For LB can you come up with some better description of what usages you > see could benefit from a "per task" or "per task-group" latency niceness? > task_hot() "ratelimits" migrations of tasks that were running up until very recently (hence "cache hot"), but the knob is system wide. It might make sense to exploit a per-task attribute to tune this (e.g. go wild if not latency sensitive, otherwise stay away for longer). We could perhaps even apply it to active load balance to similarly stay away from latency sensitive tasks. Right now this is gated by a sched_domain-wide attribute (nr_balance_failed). We could tweak this to requiring more (less) failed attempts before interrupting latency (in) sensitive tasks. I'm sure we can come up with even more creative ways to pour even more heuristics in there ;) > Best, > Patrick >
On 9/5/19 2:31 AM, Peter Zijlstra wrote:
> On Fri, Aug 30, 2019 at 10:49:42AM -0700, subhra mazumdar wrote:
>> Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps
>> in L1 cache locality.
>> ---
>> kernel/sched/fair.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8856503..94dd4a32 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> return i;
>> }
>>
>> - i = select_idle_cpu(p, sd, target);
>> + i = select_idle_smt(p, target);
>> if ((unsigned)i < nr_cpumask_bits)
>> return i;
>>
>> - i = select_idle_smt(p, target);
>> + i = select_idle_cpu(p, sd, target);
>> if ((unsigned)i < nr_cpumask_bits)
>> return i;
>>
> But it is absolutely conceptually wrong. An idle core is a much better
> target than an idle sibling.
This is select_idle_smt not select_idle_core.
On 9/5/19 3:17 AM, Patrick Bellasi wrote: > On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote... > >> Add a new sched feature SIS_CORE to have an option to disable idle core >> search (select_idle_core). >> >> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> >> --- >> kernel/sched/features.h | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/kernel/sched/features.h b/kernel/sched/features.h >> index 858589b..de4d506 100644 >> --- a/kernel/sched/features.h >> +++ b/kernel/sched/features.h >> @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) >> */ >> SCHED_FEAT(SIS_AVG_CPU, false) >> SCHED_FEAT(SIS_PROP, true) >> +SCHED_FEAT(SIS_CORE, true) > Why do we need a sched_feature? If you think there are systems in > which the usage of latency-nice does not make sense for in "Select Idle > Sibling", then we should probably better add a new Kconfig option. This is not for latency-nice but to be able to disable a different aspect of the scheduler, i.e searching for idle cores. This can be made part of latency-nice (i.e not do idle core search if latency-nice is below a certain value) but even then having a feature to disable it doesn't hurt. > > If that's the case, you can probably use the init/Kconfig's > "Scheduler features" section, recently added by: > > commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting") > >> /* >> * Issue a WARN when we do multiple update_rq_clock() calls > Best, > Patrick >
On 9/5/19 3:41 PM, Patrick Bellasi wrote: > > On Thu, Sep 05, 2019 at 07:15:34 +0100, Parth Shah wrote... > >> On 9/4/19 11:02 PM, Tim Chen wrote: >>> On 8/30/19 10:49 AM, subhra mazumdar wrote: >>>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file >>>> "latency-nice" which is shared by all the threads in that Cgroup. >>> >>> >>> Subhra, >>> >>> Thanks for posting the patchset. Having a latency nice hint >>> is useful beyond idle load balancing. I can think of other >>> application scenarios, like scheduling batch machine learning AVX 512 >>> processes with latency sensitive processes. AVX512 limits the frequency >>> of the CPU and it is best to avoid latency sensitive task on the >>> same core with AVX512. So latency nice hint allows the scheduler >>> to have a criteria to determine the latency sensitivity of a task >>> and arrange latency sensitive tasks away from AVX512 tasks. >>> >> >> >> Hi Tim and Subhra, >> >> This patchset seems to be interesting for my TurboSched patches as well >> where I try to pack jitter tasks on fewer cores to get higher Turbo Frequencies. >> Well, the problem I face is that we sometime end up putting multiple jitter tasks on a core >> running some latency sensitive application which may see performance degradation. >> So my plan was to classify such tasks to be latency sensitive thereby hinting the load >> balancer to not put tasks on such cores. >> >> TurboSched: https://lkml.org/lkml/2019/7/25/296 >> >>> You configure the latency hint on a cgroup basis. >>> But I think not all tasks in a cgroup necessarily have the same >>> latency sensitivity. >>> >>> For example, I can see that cgroup can be applied on a per user basis, >>> and the user could run different tasks that have different latency sensitivity. >>> We may also need a way to configure latency sensitivity on a per task basis instead on >>> a per cgroup basis. >>> >> >> AFAIU, the problem defined above intersects with my patches as well where the interface >> is required to classify the jitter tasks. I have already tried few methods like >> syscall and cgroup to classify such tasks and maybe something like that can be adopted >> with these patchset as well. > > Agree, these two patchest are definitively overlapping in terms of > mechanisms and APIs to expose to userspace. You to guys seems to target > different goals but the general approach should be: > > - expose a single and abstract concept to user-space > latency-nice or latency-tolerant as PaulT proposed at OSPM > I agree. Both the patchset tries to classify a tasks for some purpose for better latency. TurboSched requires the classification of whether the task is jitter and should not be given enough resources/frequency. This is a boolean value. Whereas, latency-nice is a range. So does that mean that a max-latency-nice task is a jitter? I was thinking of not doing jitter packing on a core occupying min-latency-nice (i.e, latency sensitive) task (until there are other busier cores). Given this, we can expose a single per-task attribute to the user by a syscall, right? > - map this concept in kernel-space to different kind of bias, both at > wakeup time and load-balance time, and use both for RT and CFS tasks. > > That's my understanding at least ;) > > I guess we will have interesting discussions at the upcoming LPC to > figure out a solution fitting all needs. Definitely. > >> Thanks, >> Parth > > Best, > Patrick >
On 9/5/19 3:15 PM, Patrick Bellasi wrote: > > On Thu, Sep 05, 2019 at 09:31:27 +0100, Peter Zijlstra wrote... > >> On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote: >>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file >>> "latency-nice" which is shared by all the threads in that Cgroup. >> >> *sigh*, no. We start with a normal per task attribute, and then later, >> if it is needed and makes sense, we add it to cgroups. > > FWIW, to add on top of what Peter says, we used this same approach for > uclamp and it proved to be a very effective way to come up with a good > design. General principles have been: > > - a system wide API [1] (under /proc/sys/kernel/sched_*) defines > default values for all tasks affected by that feature. > This interface has to define also upper bounds for task specific > values. Thus, in the case of latency-nice, it should be set by > default to the MIN value, since that's the current mainline > behaviour: all tasks are latency sensitive. > > - a per-task API [2] (via the sched_setattr() syscall) can be used to > relax the system wide setting thus implementing a "nice" policy. > > - a per-taskgroup API [3] (via cpu controller's attributes) can be used > to relax the system-wide settings and restrict the per-task API. > > The above features are worth to be added in that exact order. > >> Also, your Changelog fails on pretty much every point. It doesn't >> explain why, it doesn't describe anything and so on. > > On the description side, I guess it's worth to mention somewhere to > which scheduling classes this feature can be useful for. It's worth to > mention that it can apply only to: > > - CFS tasks: for example, at wakeup time a task with an high > latency-nice should avoid to preempt a low latency-nice task. > Maybe by mapping the latency nice value into proper vruntime > normalization value? > If I got this correct, does this also mean that a task's latency-nice will be mapped to prio/nice. i.e, task with min-latency-nice will have highest priority? > - RT tasks: for example, at wakeup time a task with an high > latency-nice value could avoid to preempt a CFS task. > So, will this make CFS task to precede RT task? and cause priority inversion? > I'm sure there will be discussion about some of these features, that's > why it's important in the proposal presentation to keep a well defined > distinction among the "mechanisms and API" and how we use the new > concept to "bias" some scheduler policies. > >> From just reading the above, I would expect it to have the range >> [-20,19] just like normal nice. Apparently this is not so. > > Regarding the range for the latency-nice values, I guess we have two > options: > > - [-20..19], which makes it similar to priorities > downside: we quite likely end up with a kernel space representation > which does not match the user-space one, e.g. look at > task_struct::prio. > > - [0..1024], which makes it more similar to a "percentage" > > Being latency-nice a new concept, we are not constrained by POSIX and > IMHO the [0..1024] scale is a better fit. > > That will translate into: > > latency-nice=0 : default (current mainline) behaviour, all "biasing" > policies are disabled and we wakeup up as fast as possible > > latency-nice=1024 : maximum niceness, where for example we can imaging > to turn switch a CFS task to be SCHED_IDLE? > > Best, > Patrick > > [1] commit e8f14172c6b1 ("sched/uclamp: Add system default clamps") > [2] commit a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support utilization clamping") > [3] 5 patches in today's tip/sched/core up to: > commit babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes") >
On 9/5/19 6:37 PM, Patrick Bellasi wrote: > > On Thu, Sep 05, 2019 at 12:46:37 +0100, Valentin Schneider wrote... > >> On 05/09/2019 12:18, Patrick Bellasi wrote: >>>> There's a few things wrong there; I really feel that if we call it nice, >>>> it should be like nice. Otherwise we should call it latency-bias and not >>>> have the association with nice to confuse people. >>>> >>>> Secondly; the default should be in the middle of the range. Naturally >>>> this would be a signed range like nice [-(x+1),x] for some x. but if you >>>> want [0,1024], then the default really should be 512, but personally I >>>> like 0 better as a default, in which case we need negative numbers. >>>> >>>> This is important because we want to be able to bias towards less >>>> importance to (tail) latency as well as more importantance to (tail) >>>> latency. >>>> >>>> Specifically, Oracle wants to sacrifice (some) latency for throughput. >>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency. >>> >>> Right, we have this dualism to deal with and current mainline behaviour >>> is somehow in the middle. >>> >>> BTW, the FB requirement is the same we have in Android. >>> We want some CFS tasks to have very small latency and a low chance >>> to be preempted by the wake-up of less-important "background" tasks. >>> >>> I'm not totally against the usage of a signed range, but I'm thinking >>> that since we are introducing a new (non POSIX) concept we can get the >>> chance to make it more human friendly. >>> >>> Give the two extremes above, would not be much simpler and intuitive to >>> have 0 implementing the FB/Android (no latency) case and 1024 the >>> (max latency) Oracle case? >>> >> >> For something like latency-<whatever>, I don't see the point of having >> such a wide range. The nice range is probably more than enough - and before >> even bothering about the range, we should probably agree on what the range >> should represent. >> >> If it's niceness, I read it as: positive latency-nice value means we're >> nice to latency, means we reduce it. So the further up you go, the more you >> restrict your wakeup scan. I think it's quite easy to map that into the >> code: current behaviour at 0, with a decreasing scan mask size as we go >> towards +19. I don't think anyone needs 512 steps to tune this. >> >> I don't know what logic we'd follow for negative values though. Maybe >> latency-nice -20 means always going through the slowpath, but what of the >> intermediate values? > > Yep, I think so fare we are all converging towards the idea to use the > a signed range. Regarding the range itself, yes: 1024 looks very > oversized, but +-20 is still something which leave room for a bit of > flexibility and it also better matches the idea that we don't want to > "enumerate behaviours" but just expose a knob. To map certain "bias" we > could benefit from a slightly larger range. > >> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be > > For the wakeup path there is also the TurboSched proposal by Parth: > > Message-ID: <20190725070857.6639-1-parth@linux.ibm.com> > https://lore.kernel.org/lkml/20190725070857.6639-1-parth@linux.ibm.com/ > > we should keep in mind. > >> applied elsewhere (e.g. load-balance, something like task_hot() and its >> use of sysctl_sched_migration_cost). > > For LB can you come up with some better description of what usages you > see could benefit from a "per task" or "per task-group" latency niceness? > I guess there is some usecase in case of thermal throttling. If a task is heating up the core then in ideal scenarios POWER systems throttle down to rated frequency. In such case, if the task is latency sensitive (min latency nice), we can move the task around the chip to heat up the chip uniformly allowing me to gain more performance with sustained higher frequency. With this, we will require the help from active load balancer and latency-nice classification on per task and/or group basis. Hopefully, this might be useful for other arch as well, right? > Best, > Patrick > Thanks, Parth
On 06/09/2019 13:45, Parth Shah wrote:> > I guess there is some usecase in case of thermal throttling. > If a task is heating up the core then in ideal scenarios POWER systems throttle > down to rated frequency. > In such case, if the task is latency sensitive (min latency nice), we can move the > task around the chip to heat up the chip uniformly allowing me to gain more performance > with sustained higher frequency. > With this, we will require the help from active load balancer and latency-nice > classification on per task and/or group basis. > > Hopefully, this might be useful for other arch as well, right? > Most of the functionality is already there, we're only really missing thermal pressure awareness. There was [1] but it seems to have died. At least with CFS load balancing, if thermal throttling is correctly reflected as a CPU capacity reduction you will tend to move things away from that CPU, since load is balanced over capacities. For active balance, we actually already have a condition that moves a task to a less capacity-pressured CPU (although it is somewhat specific). So if thermal pressure follows that task (e.g. it's doing tons of vector/float), it will be rotated around. However there should be a point made on latency vs throughput. If you care about latency you probably do not want to active balance your task. If you care about throughput, it should be specified in some way (util-clamp says hello!). It sort of feels like you'd want an extension of misfit migration (salesman hat goes on from here) - misfit moves tasks that are CPU bound (IOW their util is >= 80% of the CPU capacity) to CPUs of higher capacity. It's only enabled for systems with asymmetric capacities, but could be enabled globally for "dynamically-created asymmetric capacities" (IOW RT/IRQ/thermal pressure on SMP systems). On top of that, if we make misfit consider e.g. uclamp.min (I don't think that's already the case), then you have your throughput knob to have *some* designated tasks move away from (thermal & else) pressure. [1]: https://lore.kernel.org/lkml/1555443521-579-1-git-send-email-thara.gopinath@linaro.org/
On Fri, 6 Sep 2019 at 16:13, Valentin Schneider <valentin.schneider@arm.com> wrote: > > On 06/09/2019 13:45, Parth Shah wrote:> > > I guess there is some usecase in case of thermal throttling. > > If a task is heating up the core then in ideal scenarios POWER systems throttle > > down to rated frequency. > > In such case, if the task is latency sensitive (min latency nice), we can move the > > task around the chip to heat up the chip uniformly allowing me to gain more performance > > with sustained higher frequency. > > With this, we will require the help from active load balancer and latency-nice > > classification on per task and/or group basis. > > > > Hopefully, this might be useful for other arch as well, right? > > > > Most of the functionality is already there, we're only really missing thermal > pressure awareness. There was [1] but it seems to have died. Thara still works on it but she has been sidetracked on other activities and It takes more time than expected to run all tests with different averaging window and process the results > > > At least with CFS load balancing, if thermal throttling is correctly > reflected as a CPU capacity reduction you will tend to move things away from > that CPU, since load is balanced over capacities. > > > For active balance, we actually already have a condition that moves a task > to a less capacity-pressured CPU (although it is somewhat specific). So if > thermal pressure follows that task (e.g. it's doing tons of vector/float), > it will be rotated around. > > However there should be a point made on latency vs throughput. If you > care about latency you probably do not want to active balance your task. If > you care about throughput, it should be specified in some way (util-clamp > says hello!). > > It sort of feels like you'd want an extension of misfit migration (salesman > hat goes on from here) - misfit moves tasks that are CPU bound (IOW their > util is >= 80% of the CPU capacity) to CPUs of higher capacity. It's only > enabled for systems with asymmetric capacities, but could be enabled globally > for "dynamically-created asymmetric capacities" (IOW RT/IRQ/thermal pressure > on SMP systems). > > On top of that, if we make misfit consider e.g. uclamp.min (I don't think > that's already the case), then you have your throughput knob to have *some* > designated tasks move away from (thermal & else) pressure. > > > [1]: https://lore.kernel.org/lkml/1555443521-579-1-git-send-email-thara.gopinath@linaro.org/
On 9/6/19 7:43 PM, Valentin Schneider wrote: > On 06/09/2019 13:45, Parth Shah wrote:> >> I guess there is some usecase in case of thermal throttling. >> If a task is heating up the core then in ideal scenarios POWER systems throttle >> down to rated frequency. >> In such case, if the task is latency sensitive (min latency nice), we can move the >> task around the chip to heat up the chip uniformly allowing me to gain more performance >> with sustained higher frequency. >> With this, we will require the help from active load balancer and latency-nice >> classification on per task and/or group basis. >> >> Hopefully, this might be useful for other arch as well, right? >> > > Most of the functionality is already there, we're only really missing thermal > pressure awareness. There was [1] but it seems to have died. > > > At least with CFS load balancing, if thermal throttling is correctly > reflected as a CPU capacity reduction you will tend to move things away from > that CPU, since load is balanced over capacities. > Right, CPU capacity can solve the problem of indicating the thermal throttle to the scheduler. AFAIU, the patchset from Thara changes CPU capacity to reflect Thermal headroom of the CPU. This is a nice mitigation but, 1. Sometimes a single task is responsible for the Thermal heatup of the core, reducing the CPU capacity of all the CPUs in the core is not optimal when just moving such single task to other core can allow us to remain within thermal headroom. This is important for the servers especially where there are upto 8 threads. 2. Given the implementation in the patches and its integration with EAS, it seems difficult to adapt to servers, where CPU capacity itself is in doubt. https://lkml.org/lkml/2019/5/15/1402 > > For active balance, we actually already have a condition that moves a task > to a less capacity-pressured CPU (although it is somewhat specific). So if > thermal pressure follows that task (e.g. it's doing tons of vector/float), > it will be rotated around. Agree. But this should break in certain conditions like when we have multiple tasks in a core with almost equal utilization among which one is just doing vector operations. LB can pick and move any task with equal probability if the capacity is reduced here. > > However there should be a point made on latency vs throughput. If you > care about latency you probably do not want to active balance your task. If Can you please elaborate on why not to consider active balance for latency sensitive tasks? Because, sometimes finding a thermally cool core is beneficial when Turbo frequency range is around 20% above rated ones. > you care about throughput, it should be specified in some way (util-clamp > says hello!). > yes I do care for latency and throughput both. :-) but I'm wondering how uclamp can solve the problem for throughput. If I make the thermally hot tasks to appear bigger than other tasks then reducing CPU capacity can allow such tasks to move around the chip. But this will require the utilization value to be relatively large compared to the other tasks in the core. Or other task's uclamp.max can be lowered to make such task rotate. If I got it right, then this will be a difficult UCLAMP usecase from user perspective, right? I feel like I'm missing something here. > It sort of feels like you'd want an extension of misfit migration (salesman > hat goes on from here) - misfit moves tasks that are CPU bound (IOW their > util is >= 80% of the CPU capacity) to CPUs of higher capacity. It's only > enabled for systems with asymmetric capacities, but could be enabled globally > for "dynamically-created asymmetric capacities" (IOW RT/IRQ/thermal pressure > on SMP systems).> On top of that, if we make misfit consider e.g. uclamp.min (I don't think > that's already the case), then you have your throughput knob to have *some* > designated tasks move away from (thermal & else) pressure. > > > [1]: https://lore.kernel.org/lkml/1555443521-579-1-git-send-email-thara.gopinath@linaro.org/ > Thanks, Parth
On 06/09/2019 18:10, Parth Shah wrote: > Right, CPU capacity can solve the problem of indicating the thermal throttle to the scheduler. > AFAIU, the patchset from Thara changes CPU capacity to reflect Thermal headroom of the CPU. > This is a nice mitigation but, > 1. Sometimes a single task is responsible for the Thermal heatup of the core, reducing the > CPU capacity of all the CPUs in the core is not optimal when just moving such single > task to other core can allow us to remain within thermal headroom. This is important > for the servers especially where there are upto 8 threads.> 2. Given the implementation in the patches and its integration with EAS, it seems difficult > to adapt to servers, where CPU capacity itself is in doubt. > https://lkml.org/lkml/2019/5/15/1402 > I'd nuance this to *SMT* capacity (which isn't just servers). The thing is that it's difficult to come up with a sensible scheme to describe the base capacity of a single logical CPU. But yeah, valid point. >> >> For active balance, we actually already have a condition that moves a task >> to a less capacity-pressured CPU (although it is somewhat specific). So if >> thermal pressure follows that task (e.g. it's doing tons of vector/float), >> it will be rotated around. > > Agree. But this should break in certain conditions like when we have multiple tasks > in a core with almost equal utilization among which one is just doing vector operations. > LB can pick and move any task with equal probability if the capacity is reduced here. > Right, if/when we get things like per-unit signals (wasn't there something about tracking AVX a few months back?) then we'll be able to make more informed decisions, for now we'll need some handholding (read: task classification). >> >> However there should be a point made on latency vs throughput. If you >> care about latency you probably do not want to active balance your task. If > > Can you please elaborate on why not to consider active balance for latency sensitive tasks? > Because, sometimes finding a thermally cool core is beneficial when Turbo frequency > range is around 20% above rated ones. > This goes back to my reply to Patrick further up the thread. Right now active balance can happen just because we've been imbalanced for some time and repeatedly failed to migrate anything. After 3 (IIRC) successive failed attempts, we'll active balance the running task of the remote rq we decided was busiest. If that happens to be a latency sensitive task, that's not great - active balancing means stopping that task's execution, so we're going to add some latency to this latency-sensitive task. My proposal was to further ratelimit active balance (e.g. require more failed attempts) when the task that would be preempted is latency-sensitive. My point is: if that task is doing fine where it is, why preempt it? That's just introducing latency IMO (keeping in mind that those balance attempts could happen despite not having any thermal pressure). If you care about performance (e.g. a minimum level of throughput), to me that is a separate (though perhaps not entirely distinct) property. >> you care about throughput, it should be specified in some way (util-clamp >> says hello!). >> > > yes I do care for latency and throughput both. :-) Don't we all! > but I'm wondering how uclamp can solve the problem for throughput. > If I make the thermally hot tasks to appear bigger than other tasks then reducing > CPU capacity can allow such tasks to move around the chip. > But this will require the utilization value to be relatively large compared to the other > tasks in the core. Or other task's uclamp.max can be lowered to make such task rotate. > If I got it right, then this will be a difficult UCLAMP usecase from user perspective, right? > I feel like I'm missing something here. > Hmm perhaps I was jumping the gun here. What I was getting to is if you have something like misfit that migrates tasks to CPUs of higher capacity than the one they are on, you could use uclamp to flag them. You could translate your throughput requirement as a uclamp.min of e.g. 80%, and if the CPU capacity goes below that (or close within a margin) then you'd try to migrate the task to a CPU of higher capacity (i.e. not or less thermally pressured). This doesn't have to involve your less throughput-sensitive tasks, since you would only tag and take action for your throughput-sensitive tasks.
On Thu, Sep 05, 2019 at 12:47:26PM +0100, Qais Yousef wrote: > On 09/05/19 13:30, Peter Zijlstra wrote: > > On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote: > > > On 09/05/19 12:46, Peter Zijlstra wrote: > > > > > > This is important because we want to be able to bias towards less > > > > importance to (tail) latency as well as more importantance to (tail) > > > > latency. > > > > > > > > Specifically, Oracle wants to sacrifice (some) latency for throughput. > > > > Facebook OTOH seems to want to sacrifice (some) throughput for latency. > > > > > > Another use case I'm considering is using latency-nice to prefer an idle CPU if > > > latency-nice is set otherwise go for the most energy efficient CPU. > > > > > > Ie: sacrifice (some) energy for latency. > > > > > > The way I see interpreting latency-nice here as a binary switch. But > > > maybe we can use the range to select what (some) energy to sacrifice > > > mean here. Hmmm. > > > > It cannot be binary, per definition is must be ternary, that is, <0, ==0 > > and >0 (or middle value if you're of that persuasion). > > I meant I want to use it as a binary. > > > > > In your case, I'm thinking you mean >0, we want to lower the latency. > > Yes. As long as there's an easy way to say: does this task care about latency > or not I'm good. Qais, Peter, all, For ChromeOS (my team), we are planning to use the upstream uclamp mechanism instead of the out-of-tree schedtune mechanism to provide EAS with the latency-sensitivity (binary/ternary) hint. ChromeOS is thankfully quite a bit upstream focussed :) However, uclamp is missing an attribute to provide this biasing to EAS as we know. What was the consensus on adding a per-task attribute to uclamp for providing this? Happy to collaborate on this front. thanks, - Joel > > Anyway; there were a number of things mentioned at OSPM that we could > > tie into this thing and finding sensible mappings is going to be a bit > > of trial and error I suppose. > > > > But as patrick said; we're very much exporting a BIAS knob, not a set of > > behaviours. > > Agreed. I just wanted to say that the way this range is going to be > interpreted will differ from path to path and we need to consider that in the > final mapping. Especially from the final user's perspective of what setting > this value ultimately means to them. > > -- > Qais Yousef
Hi Joel, On 16.04.20 02:02, Joel Fernandes wrote: > On Thu, Sep 05, 2019 at 12:47:26PM +0100, Qais Yousef wrote: >> On 09/05/19 13:30, Peter Zijlstra wrote: >>> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote: >>>> On 09/05/19 12:46, Peter Zijlstra wrote: >>> >>>>> This is important because we want to be able to bias towards less >>>>> importance to (tail) latency as well as more importantance to (tail) >>>>> latency. >>>>> >>>>> Specifically, Oracle wants to sacrifice (some) latency for throughput. >>>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency. >>>> >>>> Another use case I'm considering is using latency-nice to prefer an idle CPU if >>>> latency-nice is set otherwise go for the most energy efficient CPU. >>>> >>>> Ie: sacrifice (some) energy for latency. >>>> >>>> The way I see interpreting latency-nice here as a binary switch. But >>>> maybe we can use the range to select what (some) energy to sacrifice >>>> mean here. Hmmm. >>> >>> It cannot be binary, per definition is must be ternary, that is, <0, ==0 >>> and >0 (or middle value if you're of that persuasion). >> >> I meant I want to use it as a binary. >> >>> >>> In your case, I'm thinking you mean >0, we want to lower the latency. >> >> Yes. As long as there's an easy way to say: does this task care about latency >> or not I'm good. > > Qais, Peter, all, > > For ChromeOS (my team), we are planning to use the upstream uclamp mechanism > instead of the out-of-tree schedtune mechanism to provide EAS with the > latency-sensitivity (binary/ternary) hint. ChromeOS is thankfully quite a bit > upstream focussed :) > > However, uclamp is missing an attribute to provide this biasing to EAS as we > know. > > What was the consensus on adding a per-task attribute to uclamp for providing > this? Happy to collaborate on this front. We're planning to have a session about this topic (latency-nice attribute per task group) during the virtual Pisa OSPM summit retis.sssup.it/ospm-summit in May this year. There are two presentations/discussions planned: "Introducing Latency Nice for Scheduler Hints and Optimizing Scheduler Task Wakeup" and "The latency nice use case for Energy-Aware-Scheduling (EAS) in Android Common Kernel (ACK)" We'll probably merge those two into one presentation/discussion. So far we have Parth's per-task implementation https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com What's missing is the per-taskgroup implementation, at least from the standpoint of ACK. The (mainline) EAS use-case for latency nice is already in ACK (android-5.4): https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54
Hi Dietmar, On Thu, Apr 16, 2020 at 1:23 PM Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > Hi Joel, > > On 16.04.20 02:02, Joel Fernandes wrote: > > On Thu, Sep 05, 2019 at 12:47:26PM +0100, Qais Yousef wrote: > >> On 09/05/19 13:30, Peter Zijlstra wrote: > >>> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote: > >>>> On 09/05/19 12:46, Peter Zijlstra wrote: > >>> > >>>>> This is important because we want to be able to bias towards less > >>>>> importance to (tail) latency as well as more importantance to (tail) > >>>>> latency. > >>>>> > >>>>> Specifically, Oracle wants to sacrifice (some) latency for throughput. > >>>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency. > >>>> > >>>> Another use case I'm considering is using latency-nice to prefer an idle CPU if > >>>> latency-nice is set otherwise go for the most energy efficient CPU. > >>>> > >>>> Ie: sacrifice (some) energy for latency. > >>>> > >>>> The way I see interpreting latency-nice here as a binary switch. But > >>>> maybe we can use the range to select what (some) energy to sacrifice > >>>> mean here. Hmmm. > >>> > >>> It cannot be binary, per definition is must be ternary, that is, <0, ==0 > >>> and >0 (or middle value if you're of that persuasion). > >> > >> I meant I want to use it as a binary. > >> > >>> > >>> In your case, I'm thinking you mean >0, we want to lower the latency. > >> > >> Yes. As long as there's an easy way to say: does this task care about latency > >> or not I'm good. > > > > Qais, Peter, all, > > > > For ChromeOS (my team), we are planning to use the upstream uclamp mechanism > > instead of the out-of-tree schedtune mechanism to provide EAS with the > > latency-sensitivity (binary/ternary) hint. ChromeOS is thankfully quite a bit > > upstream focussed :) > > > > However, uclamp is missing an attribute to provide this biasing to EAS as we > > know. > > > > What was the consensus on adding a per-task attribute to uclamp for providing > > this? Happy to collaborate on this front. > > We're planning to have a session about this topic (latency-nice > attribute per task group) during the virtual Pisa OSPM summit > retis.sssup.it/ospm-summit in May this year. Cool, I registered as well. > > There are two presentations/discussions planned: > > "Introducing Latency Nice for Scheduler Hints and Optimizing Scheduler > Task Wakeup" and "The latency nice use case for Energy-Aware-Scheduling > (EAS) in Android Common Kernel (ACK)" > > We'll probably merge those two into one presentation/discussion. > > So far we have Parth's per-task implementation > > https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com Cool, I see it has some Reviewed-by tags so that's a good sign. Will look more into that. > What's missing is the per-taskgroup implementation, at least from the > standpoint of ACK. > > The (mainline) EAS use-case for latency nice is already in ACK > (android-5.4): > > https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54 Yes, I was aware of this. But if we use task groups, then the transition from schedtune -> uclamp means now the tasks that use uclamp would also be subjected to cpu.shares. That's why we were looking into the per-task interface and glad there's some work on this already done. Thanks! - Joel
Hi Joel,
On 4/18/20 9:31 PM, Joel Fernandes wrote:
> Hi Dietmar,
>
> On Thu, Apr 16, 2020 at 1:23 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> Hi Joel,
>>
>> On 16.04.20 02:02, Joel Fernandes wrote:
>>> On Thu, Sep 05, 2019 at 12:47:26PM +0100, Qais Yousef wrote:
>>>> On 09/05/19 13:30, Peter Zijlstra wrote:
>>>>> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
>>>>>> On 09/05/19 12:46, Peter Zijlstra wrote:
>>>>>
>>>>>>> This is important because we want to be able to bias towards less
>>>>>>> importance to (tail) latency as well as more importantance to (tail)
>>>>>>> latency.
>>>>>>>
>>>>>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>>>>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>>>>>>
>>>>>> Another use case I'm considering is using latency-nice to prefer an idle CPU if
>>>>>> latency-nice is set otherwise go for the most energy efficient CPU.
>>>>>>
>>>>>> Ie: sacrifice (some) energy for latency.
>>>>>>
>>>>>> The way I see interpreting latency-nice here as a binary switch. But
>>>>>> maybe we can use the range to select what (some) energy to sacrifice
>>>>>> mean here. Hmmm.
>>>>>
>>>>> It cannot be binary, per definition is must be ternary, that is, <0, ==0
>>>>> and >0 (or middle value if you're of that persuasion).
>>>>
>>>> I meant I want to use it as a binary.
>>>>
>>>>>
>>>>> In your case, I'm thinking you mean >0, we want to lower the latency.
>>>>
>>>> Yes. As long as there's an easy way to say: does this task care about latency
>>>> or not I'm good.
>>>
>>> Qais, Peter, all,
>>>
>>> For ChromeOS (my team), we are planning to use the upstream uclamp mechanism
>>> instead of the out-of-tree schedtune mechanism to provide EAS with the
>>> latency-sensitivity (binary/ternary) hint. ChromeOS is thankfully quite a bit
>>> upstream focussed :)
>>>
>>> However, uclamp is missing an attribute to provide this biasing to EAS as we
>>> know.
>>>
>>> What was the consensus on adding a per-task attribute to uclamp for providing
>>> this? Happy to collaborate on this front.
>>
>> We're planning to have a session about this topic (latency-nice
>> attribute per task group) during the virtual Pisa OSPM summit
>> retis.sssup.it/ospm-summit in May this year.
>
> Cool, I registered as well.
>
>>
>> There are two presentations/discussions planned:
>>
>> "Introducing Latency Nice for Scheduler Hints and Optimizing Scheduler
>> Task Wakeup" and "The latency nice use case for Energy-Aware-Scheduling
>> (EAS) in Android Common Kernel (ACK)"
>>
>> We'll probably merge those two into one presentation/discussion.
>>
>> So far we have Parth's per-task implementation
>>
>> https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com
>
> Cool, I see it has some Reviewed-by tags so that's a good sign. Will
> look more into that.
>
>> What's missing is the per-taskgroup implementation, at least from the
>> standpoint of ACK.
>>
>> The (mainline) EAS use-case for latency nice is already in ACK
>> (android-5.4):
>>
>> https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54
>
> Yes, I was aware of this. But if we use task groups, then the
> transition from schedtune -> uclamp means now the tasks that use
> uclamp would also be subjected to cpu.shares. That's why we were
> looking into the per-task interface and glad there's some work on this
> already done.
>
Yes, that series of latency_nice seems to be in good shape to be used for
any usecases. Hopefully, OSPM will lead to its upstreaming sooner :-)
But at the end, we aim to have both the per-task and cgroup based interface
to mark the latency_nice value of a task.
Till, then I'm finding some generic use-cases to show benefits of such task
attribute to increase community interest.
Thanks,
Parth
On 04/18/20 12:01, Joel Fernandes wrote:
> > What's missing is the per-taskgroup implementation, at least from the
> > standpoint of ACK.
> >
> > The (mainline) EAS use-case for latency nice is already in ACK
> > (android-5.4):
> >
> > https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54
>
> Yes, I was aware of this. But if we use task groups, then the
> transition from schedtune -> uclamp means now the tasks that use
> uclamp would also be subjected to cpu.shares. That's why we were
> looking into the per-task interface and glad there's some work on this
> already done.
Hmm uclamp doesn't do anything with cpu.shares. I assume this is some
implementation detail at your end? IOW, you don't have to use cpu.shares to use
uclamp.
Although there should be few tasks in the system that need the latency-nice, so
I prefer the per-task interface rather than lump everything in a cgroup. Though
there could be valid use cases for the latter.
Thanks
--
Qais Yousef
Hi Qais! On Mon, Apr 20, 2020 at 12:47:29PM +0100, Qais Yousef wrote: > On 04/18/20 12:01, Joel Fernandes wrote: > > > What's missing is the per-taskgroup implementation, at least from the > > > standpoint of ACK. > > > > > > The (mainline) EAS use-case for latency nice is already in ACK > > > (android-5.4): > > > > > > https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54 > > > > Yes, I was aware of this. But if we use task groups, then the > > transition from schedtune -> uclamp means now the tasks that use > > uclamp would also be subjected to cpu.shares. That's why we were > > looking into the per-task interface and glad there's some work on this > > already done. > > Hmm uclamp doesn't do anything with cpu.shares. I assume this is some > implementation detail at your end? IOW, you don't have to use cpu.shares to use > uclamp. Right, it is a ChromeOS-specific issue. We have CONFIG_FAIR_GROUP_SCHED enabled in the kernel for container workloads. However there are CGroups of tasks that used "schedtune" CGroup interface before to provide util clamping like behavior. We are now migrating these to the upstream util-clamp. We can't disable CONFIG_FAIR_GROUP_SCHED because that would break the container workloads. So we have to use the per-process interface of util clamp. If we used the CGroups interface of util clamping, we would get the cpu.shares as well since the CGroup interface comes with shares. There's no way to avoid being subject to cpu.shares (that I'm aware off anyway). > Although there should be few tasks in the system that need the latency-nice, so > I prefer the per-task interface rather than lump everything in a cgroup. Though > there could be valid use cases for the latter. Yes, with either interface, we need something like latency_nice to indicate that the task is low-latency (something we used for a number of years with the out-of-tree schedtune). thanks! - Joel > > Thanks > > -- > Qais Yousef
On Mon, Apr 20, 2020 at 04:56:55PM +0530, Parth Shah wrote: > >> > >> There are two presentations/discussions planned: > >> > >> "Introducing Latency Nice for Scheduler Hints and Optimizing Scheduler > >> Task Wakeup" and "The latency nice use case for Energy-Aware-Scheduling > >> (EAS) in Android Common Kernel (ACK)" > >> > >> We'll probably merge those two into one presentation/discussion. > >> > >> So far we have Parth's per-task implementation > >> > >> https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com > > > > Cool, I see it has some Reviewed-by tags so that's a good sign. Will > > look more into that. > > > >> What's missing is the per-taskgroup implementation, at least from the > >> standpoint of ACK. > >> > >> The (mainline) EAS use-case for latency nice is already in ACK > >> (android-5.4): > >> > >> https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54 > > > > Yes, I was aware of this. But if we use task groups, then the > > transition from schedtune -> uclamp means now the tasks that use > > uclamp would also be subjected to cpu.shares. That's why we were > > looking into the per-task interface and glad there's some work on this > > already done. > > > > Yes, that series of latency_nice seems to be in good shape to be used for > any usecases. Hopefully, OSPM will lead to its upstreaming sooner :-) Cool :) > But at the end, we aim to have both the per-task and cgroup based interface > to mark the latency_nice value of a task. Ok. We'd likely use the per-task interface unless we decide to assign cpu.shares for the groups as well. > Till, then I'm finding some generic use-cases to show benefits of such task > attribute to increase community interest. Ok. Feel free to add ChromeOS as a usecase as well. thanks, - Joel