Fifth iteration of the Core-Scheduling feature. Core scheduling is a feature that only allows trusted tasks to run concurrently on cpus sharing compute resources(eg: hyperthreads on a core). The goal is to mitigate the core-level side-channel attacks without requiring to disable SMT (which has a significant impact on performance in some situations). So far, the feature mitigates user-space to user-space attacks but not user-space to kernel attack, when one of the hardware thread enters the kernel (syscall, interrupt etc). By default, the feature doesn't change any of the current scheduler behavior. The user decides which tasks can run simultaneously on the same core (for now by having them in the same tagged cgroup). When a tag is enabled in a cgroup and a task from that cgroup is running on a hardware thread, the scheduler ensures that only idle or trusted tasks run on the other sibling(s). Besides security concerns, this feature can also be beneficial for RT and performance applications where we want to control how tasks make use of SMT dynamically. This version was focusing on performance and stability. Couple of crashes related to task tagging and cpu hotplug path were fixed. This version also improves the performance considerably by making task migration and load balancing coresched aware. In terms of performance, the major difference since the last iteration is that now even IO-heavy and mixed-resources workloads are less impacted by core-scheduling than by disabling SMT. Both host-level and VM-level benchmarks were performed. Details in: https://lkml.org/lkml/2020/2/12/1194 https://lkml.org/lkml/2019/11/1/269 v5 is rebased on top of 5.5.5(449718782a46) https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y Changes in v5 ------------- - Fixes for cgroup/process tagging during corner cases like cgroup destroy, task moving across cgroups etc - Tim Chen - Coresched aware task migrations - Aubrey Li - Other minor stability fixes. Changes in v4 ------------- - Implement a core wide min_vruntime for vruntime comparison of tasks across cpus in a core. - Aaron Lu - Fixes a typo bug in setting the forced_idle cpu. - Aaron Lu Changes in v3 ------------- - Fixes the issue of sibling picking up an incompatible task - Aaron Lu - Vineeth Pillai - Julien Desfossez - Fixes the issue of starving threads due to forced idle - Peter Zijlstra - Fixes the refcounting issue when deleting a cgroup with tag - Julien Desfossez - Fixes a crash during cpu offline/online with coresched enabled - Vineeth Pillai - Fixes a comparison logic issue in sched_core_find - Aaron Lu Changes in v2 ------------- - Fixes for couple of NULL pointer dereference crashes - Subhra Mazumdar - Tim Chen - Improves priority comparison logic for process in different cpus - Peter Zijlstra - Aaron Lu - Fixes a hard lockup in rq locking - Vineeth Pillai - Julien Desfossez - Fixes a performance issue seen on IO heavy workloads - Vineeth Pillai - Julien Desfossez - Fix for 32bit build - Aubrey Li ISSUES ------ - Aaron(Intel) found an issue with load balancing when the tasks have different weights(nice or cgroup shares). Task weight is not considered in coresched aware load balancing and causes those higher weights task to starve. - Joel(ChromeOS) found an issue where RT task may be preempted by a lower class task. - Joel(ChromeOS) found a deadlock and crash on PREEMPT kernel in the coreshed idle balance logic TODO ---- - Work on merging patches that are ready to be merged - Decide on the API for exposing the feature to userland - Experiment with adding synchronization points in VMEXIT to mitigate the VM-to-host-kernel leaking - Investigate the source of the overhead even when no tasks are tagged: https://lkml.org/lkml/2019/10/29/242 --- Aaron Lu (2): sched/fair: wrapper for cfs_rq->min_vruntime sched/fair: core wide vruntime comparison Aubrey Li (1): sched: migration changes for core scheduling Peter Zijlstra (9): sched: Wrap rq::lock access sched: Introduce sched_class::pick_task() sched: Core-wide rq->lock sched/fair: Add a few assertions sched: Basic tracking of matching tasks sched: Add core wide task selection and scheduling. sched: Trivial forced-newidle balancer sched: cgroup tagging interface for core scheduling sched: Debug bits... Tim Chen (1): sched: Update core scheduler queue when taking cpu online/offline include/linux/sched.h | 9 +- kernel/Kconfig.preempt | 6 + kernel/sched/core.c | 1037 +++++++++++++++++++++++++++++++++++++- kernel/sched/cpuacct.c | 12 +- kernel/sched/deadline.c | 69 ++- kernel/sched/debug.c | 4 +- kernel/sched/fair.c | 387 +++++++++++--- kernel/sched/idle.c | 11 +- kernel/sched/pelt.h | 2 +- kernel/sched/rt.c | 65 ++- kernel/sched/sched.h | 248 +++++++-- kernel/sched/stop_task.c | 13 +- kernel/sched/topology.c | 4 +- 13 files changed, 1672 insertions(+), 195 deletions(-) -- 2.17.1
From: Peter Zijlstra <peterz@infradead.org> In preparation of playing games with rq->lock, abstract the thing using an accessor. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com> --- kernel/sched/core.c | 46 +++++++++--------- kernel/sched/cpuacct.c | 12 ++--- kernel/sched/deadline.c | 18 +++---- kernel/sched/debug.c | 4 +- kernel/sched/fair.c | 38 +++++++-------- kernel/sched/idle.c | 4 +- kernel/sched/pelt.h | 2 +- kernel/sched/rt.c | 8 +-- kernel/sched/sched.h | 105 +++++++++++++++++++++------------------- kernel/sched/topology.c | 4 +- 10 files changed, 122 insertions(+), 119 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b2564d62a0f7..28ba9b56dd8a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -85,12 +85,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) for (;;) { rq = task_rq(p); - raw_spin_lock(&rq->lock); + raw_spin_lock(rq_lockp(rq)); if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) { rq_pin_lock(rq, rf); return rq; } - raw_spin_unlock(&rq->lock); + raw_spin_unlock(rq_lockp(rq)); while (unlikely(task_on_rq_migrating(p))) cpu_relax(); @@ -109,7 +109,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) for (;;) { raw_spin_lock_irqsave(&p->pi_lock, rf->flags); rq = task_rq(p); - raw_spin_lock(&rq->lock); + raw_spin_lock(rq_lockp(rq)); /* * move_queued_task() task_rq_lock() * @@ -131,7 +131,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) rq_pin_lock(rq, rf); return rq; } - raw_spin_unlock(&rq->lock); + raw_spin_unlock(rq_lockp(rq)); raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); while (unlikely(task_on_rq_migrating(p))) @@ -201,7 +201,7 @@ void update_rq_clock(struct rq *rq) { s64 delta; - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); if (rq->clock_update_flags & RQCF_ACT_SKIP) return; @@ -510,7 +510,7 @@ void resched_curr(struct rq *rq) struct task_struct *curr = rq->curr; int cpu; - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); if (test_tsk_need_resched(curr)) return; @@ -534,10 +534,10 @@ void resched_cpu(int cpu) struct rq *rq = cpu_rq(cpu); unsigned long flags; - raw_spin_lock_irqsave(&rq->lock, flags); + raw_spin_lock_irqsave(rq_lockp(rq), flags); if (cpu_online(cpu) || cpu == smp_processor_id()) resched_curr(rq); - raw_spin_unlock_irqrestore(&rq->lock, flags); + raw_spin_unlock_irqrestore(rq_lockp(rq), flags); } #ifdef CONFIG_SMP @@ -949,7 +949,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p, struct uclamp_se *uc_se = &p->uclamp[clamp_id]; struct uclamp_bucket *bucket; - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); /* Update task effective clamp */ p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id); @@ -989,7 +989,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p, unsigned int bkt_clamp; unsigned int rq_clamp; - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); bucket = &uc_rq->bucket[uc_se->bucket_id]; SCHED_WARN_ON(!bucket->tasks); @@ -1490,7 +1490,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu) static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf, struct task_struct *p, int new_cpu) { - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING); dequeue_task(rq, p, DEQUEUE_NOCLOCK); @@ -1604,7 +1604,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) * Because __kthread_bind() calls this on blocked tasks without * holding rq->lock. */ - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK); } if (running) @@ -1736,7 +1736,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) * task_rq_lock(). */ WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) || - lockdep_is_held(&task_rq(p)->lock))); + lockdep_is_held(rq_lockp(task_rq(p))))); #endif /* * Clearly, migrating tasks to offline CPUs is a fairly daft thing. @@ -2253,7 +2253,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, { int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK; - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); #ifdef CONFIG_SMP if (p->sched_contributes_to_load) @@ -3107,10 +3107,10 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf * do an early lockdep release here: */ rq_unpin_lock(rq, rf); - spin_release(&rq->lock.dep_map, _THIS_IP_); + spin_release(&rq_lockp(rq)->dep_map, _THIS_IP_); #ifdef CONFIG_DEBUG_SPINLOCK /* this is a valid case when another task releases the spinlock */ - rq->lock.owner = next; + rq_lockp(rq)->owner = next; #endif } @@ -3121,8 +3121,8 @@ static inline void finish_lock_switch(struct rq *rq) * fix up the runqueue lock - which gets 'carried over' from * prev into current: */ - spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_); - raw_spin_unlock_irq(&rq->lock); + spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_); + raw_spin_unlock_irq(rq_lockp(rq)); } /* @@ -3272,7 +3272,7 @@ static void __balance_callback(struct rq *rq) void (*func)(struct rq *rq); unsigned long flags; - raw_spin_lock_irqsave(&rq->lock, flags); + raw_spin_lock_irqsave(rq_lockp(rq), flags); head = rq->balance_callback; rq->balance_callback = NULL; while (head) { @@ -3283,7 +3283,7 @@ static void __balance_callback(struct rq *rq) func(rq); } - raw_spin_unlock_irqrestore(&rq->lock, flags); + raw_spin_unlock_irqrestore(rq_lockp(rq), flags); } static inline void balance_callback(struct rq *rq) @@ -6033,7 +6033,7 @@ void init_idle(struct task_struct *idle, int cpu) __sched_fork(0, idle); raw_spin_lock_irqsave(&idle->pi_lock, flags); - raw_spin_lock(&rq->lock); + raw_spin_lock(rq_lockp(rq)); idle->state = TASK_RUNNING; idle->se.exec_start = sched_clock(); @@ -6070,7 +6070,7 @@ void init_idle(struct task_struct *idle, int cpu) #ifdef CONFIG_SMP idle->on_cpu = 1; #endif - raw_spin_unlock(&rq->lock); + raw_spin_unlock(rq_lockp(rq)); raw_spin_unlock_irqrestore(&idle->pi_lock, flags); /* Set the preempt count _outside_ the spinlocks! */ @@ -6632,7 +6632,7 @@ void __init sched_init(void) struct rq *rq; rq = cpu_rq(i); - raw_spin_lock_init(&rq->lock); + raw_spin_lock_init(&rq->__lock); rq->nr_running = 0; rq->calc_load_active = 0; rq->calc_load_update = jiffies + LOAD_FREQ; diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 9fbb10383434..78de28ebc45d 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -111,7 +111,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, /* * Take rq->lock to make 64-bit read safe on 32-bit platforms. */ - raw_spin_lock_irq(&cpu_rq(cpu)->lock); + raw_spin_lock_irq(rq_lockp(cpu_rq(cpu))); #endif if (index == CPUACCT_STAT_NSTATS) { @@ -125,7 +125,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, } #ifndef CONFIG_64BIT - raw_spin_unlock_irq(&cpu_rq(cpu)->lock); + raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu))); #endif return data; @@ -140,14 +140,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val) /* * Take rq->lock to make 64-bit write safe on 32-bit platforms. */ - raw_spin_lock_irq(&cpu_rq(cpu)->lock); + raw_spin_lock_irq(rq_lockp(cpu_rq(cpu))); #endif for (i = 0; i < CPUACCT_STAT_NSTATS; i++) cpuusage->usages[i] = val; #ifndef CONFIG_64BIT - raw_spin_unlock_irq(&cpu_rq(cpu)->lock); + raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu))); #endif } @@ -252,13 +252,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V) * Take rq->lock to make 64-bit read safe on 32-bit * platforms. */ - raw_spin_lock_irq(&cpu_rq(cpu)->lock); + raw_spin_lock_irq(rq_lockp(cpu_rq(cpu))); #endif seq_printf(m, " %llu", cpuusage->usages[index]); #ifndef CONFIG_64BIT - raw_spin_unlock_irq(&cpu_rq(cpu)->lock); + raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu))); #endif } seq_puts(m, "\n"); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 43323f875cb9..ded147f84382 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -80,7 +80,7 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq) { u64 old = dl_rq->running_bw; - lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock); + lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq))); dl_rq->running_bw += dl_bw; SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */ SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw); @@ -93,7 +93,7 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq) { u64 old = dl_rq->running_bw; - lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock); + lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq))); dl_rq->running_bw -= dl_bw; SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */ if (dl_rq->running_bw > old) @@ -107,7 +107,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq) { u64 old = dl_rq->this_bw; - lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock); + lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq))); dl_rq->this_bw += dl_bw; SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */ } @@ -117,7 +117,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq) { u64 old = dl_rq->this_bw; - lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock); + lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq))); dl_rq->this_bw -= dl_bw; SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */ if (dl_rq->this_bw > old) @@ -925,7 +925,7 @@ static int start_dl_timer(struct task_struct *p) ktime_t now, act; s64 delta; - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); /* * We want the timer to fire at the deadline, but considering @@ -1035,9 +1035,9 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) * If the runqueue is no longer available, migrate the * task elsewhere. This necessarily changes rq. */ - lockdep_unpin_lock(&rq->lock, rf.cookie); + lockdep_unpin_lock(rq_lockp(rq), rf.cookie); rq = dl_task_offline_migration(rq, p); - rf.cookie = lockdep_pin_lock(&rq->lock); + rf.cookie = lockdep_pin_lock(rq_lockp(rq)); update_rq_clock(rq); /* @@ -1652,7 +1652,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused * from try_to_wake_up(). Hence, p->pi_lock is locked, but * rq->lock is not... So, lock it */ - raw_spin_lock(&rq->lock); + raw_spin_lock(rq_lockp(rq)); if (p->dl.dl_non_contending) { sub_running_bw(&p->dl, &rq->dl); p->dl.dl_non_contending = 0; @@ -1667,7 +1667,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused put_task_struct(p); } sub_rq_bw(&p->dl, &rq->dl); - raw_spin_unlock(&rq->lock); + raw_spin_unlock(rq_lockp(rq)); } static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index f7e4579e746c..7e5f2237c7e4 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -498,7 +498,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "exec_clock", SPLIT_NS(cfs_rq->exec_clock)); - raw_spin_lock_irqsave(&rq->lock, flags); + raw_spin_lock_irqsave(rq_lockp(rq), flags); if (rb_first_cached(&cfs_rq->tasks_timeline)) MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime; last = __pick_last_entity(cfs_rq); @@ -506,7 +506,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) max_vruntime = last->vruntime; min_vruntime = cfs_rq->min_vruntime; rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime; - raw_spin_unlock_irqrestore(&rq->lock, flags); + raw_spin_unlock_irqrestore(rq_lockp(rq), flags); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "MIN_vruntime", SPLIT_NS(MIN_vruntime)); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime", diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ba749f579714..3b218753bf7a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1091,7 +1091,7 @@ struct numa_group { static struct numa_group *deref_task_numa_group(struct task_struct *p) { return rcu_dereference_check(p->numa_group, p == current || - (lockdep_is_held(&task_rq(p)->lock) && !READ_ONCE(p->on_cpu))); + (lockdep_is_held(rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu))); } static struct numa_group *deref_curr_numa_group(struct task_struct *p) @@ -5035,7 +5035,7 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq) { struct task_group *tg; - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); rcu_read_lock(); list_for_each_entry_rcu(tg, &task_groups, list) { @@ -5054,7 +5054,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq) { struct task_group *tg; - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); rcu_read_lock(); list_for_each_entry_rcu(tg, &task_groups, list) { @@ -6438,7 +6438,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old' * rq->lock and can modify state directly. */ - lockdep_assert_held(&task_rq(p)->lock); + lockdep_assert_held(rq_lockp(task_rq(p))); detach_entity_cfs_rq(&p->se); } else { @@ -7066,7 +7066,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env) { s64 delta; - lockdep_assert_held(&env->src_rq->lock); + lockdep_assert_held(rq_lockp(env->src_rq)); if (p->sched_class != &fair_sched_class) return 0; @@ -7160,7 +7160,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) { int tsk_cache_hot; - lockdep_assert_held(&env->src_rq->lock); + lockdep_assert_held(rq_lockp(env->src_rq)); /* * We do not migrate tasks that are: @@ -7238,7 +7238,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) */ static void detach_task(struct task_struct *p, struct lb_env *env) { - lockdep_assert_held(&env->src_rq->lock); + lockdep_assert_held(rq_lockp(env->src_rq)); deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK); set_task_cpu(p, env->dst_cpu); @@ -7254,7 +7254,7 @@ static struct task_struct *detach_one_task(struct lb_env *env) { struct task_struct *p; - lockdep_assert_held(&env->src_rq->lock); + lockdep_assert_held(rq_lockp(env->src_rq)); list_for_each_entry_reverse(p, &env->src_rq->cfs_tasks, se.group_node) { @@ -7290,7 +7290,7 @@ static int detach_tasks(struct lb_env *env) struct task_struct *p; int detached = 0; - lockdep_assert_held(&env->src_rq->lock); + lockdep_assert_held(rq_lockp(env->src_rq)); if (env->imbalance <= 0) return 0; @@ -7405,7 +7405,7 @@ static int detach_tasks(struct lb_env *env) */ static void attach_task(struct rq *rq, struct task_struct *p) { - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); BUG_ON(task_rq(p) != rq); activate_task(rq, p, ENQUEUE_NOCLOCK); @@ -9291,7 +9291,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, if (need_active_balance(&env)) { unsigned long flags; - raw_spin_lock_irqsave(&busiest->lock, flags); + raw_spin_lock_irqsave(rq_lockp(busiest), flags); /* * Don't kick the active_load_balance_cpu_stop, @@ -9299,7 +9299,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, * moved to this_cpu: */ if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) { - raw_spin_unlock_irqrestore(&busiest->lock, + raw_spin_unlock_irqrestore(rq_lockp(busiest), flags); env.flags |= LBF_ALL_PINNED; goto out_one_pinned; @@ -9315,7 +9315,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, busiest->push_cpu = this_cpu; active_balance = 1; } - raw_spin_unlock_irqrestore(&busiest->lock, flags); + raw_spin_unlock_irqrestore(rq_lockp(busiest), flags); if (active_balance) { stop_one_cpu_nowait(cpu_of(busiest), @@ -10058,7 +10058,7 @@ static void nohz_newidle_balance(struct rq *this_rq) time_before(jiffies, READ_ONCE(nohz.next_blocked))) return; - raw_spin_unlock(&this_rq->lock); + raw_spin_unlock(rq_lockp(this_rq)); /* * This CPU is going to be idle and blocked load of idle CPUs * need to be updated. Run the ilb locally as it is a good @@ -10067,7 +10067,7 @@ static void nohz_newidle_balance(struct rq *this_rq) */ if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE)) kick_ilb(NOHZ_STATS_KICK); - raw_spin_lock(&this_rq->lock); + raw_spin_lock(rq_lockp(this_rq)); } #else /* !CONFIG_NO_HZ_COMMON */ @@ -10133,7 +10133,7 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf) goto out; } - raw_spin_unlock(&this_rq->lock); + raw_spin_unlock(rq_lockp(this_rq)); update_blocked_averages(this_cpu); rcu_read_lock(); @@ -10174,7 +10174,7 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf) } rcu_read_unlock(); - raw_spin_lock(&this_rq->lock); + raw_spin_lock(rq_lockp(this_rq)); if (curr_cost > this_rq->max_idle_balance_cost) this_rq->max_idle_balance_cost = curr_cost; @@ -10647,9 +10647,9 @@ void unregister_fair_sched_group(struct task_group *tg) rq = cpu_rq(cpu); - raw_spin_lock_irqsave(&rq->lock, flags); + raw_spin_lock_irqsave(rq_lockp(rq), flags); list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); - raw_spin_unlock_irqrestore(&rq->lock, flags); + raw_spin_unlock_irqrestore(rq_lockp(rq), flags); } } diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index ffa959e91227..f8653290de95 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -413,10 +413,10 @@ struct task_struct *pick_next_task_idle(struct rq *rq) static void dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags) { - raw_spin_unlock_irq(&rq->lock); + raw_spin_unlock_irq(rq_lockp(rq)); printk(KERN_ERR "bad: scheduling from the idle thread!\n"); dump_stack(); - raw_spin_lock_irq(&rq->lock); + raw_spin_lock_irq(rq_lockp(rq)); } /* diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h index afff644da065..6649cb63e32a 100644 --- a/kernel/sched/pelt.h +++ b/kernel/sched/pelt.h @@ -116,7 +116,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq) static inline u64 rq_clock_pelt(struct rq *rq) { - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); assert_clock_updated(rq); return rq->clock_pelt - rq->lost_idle_time; diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index e591d40fd645..fc7d6706b209 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -846,7 +846,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun) if (skip) continue; - raw_spin_lock(&rq->lock); + raw_spin_lock(rq_lockp(rq)); update_rq_clock(rq); if (rt_rq->rt_time) { @@ -884,7 +884,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun) if (enqueue) sched_rt_rq_enqueue(rt_rq); - raw_spin_unlock(&rq->lock); + raw_spin_unlock(rq_lockp(rq)); } if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)) @@ -2021,9 +2021,9 @@ void rto_push_irq_work_func(struct irq_work *work) * When it gets updated, a check is made if a push is possible. */ if (has_pushable_tasks(rq)) { - raw_spin_lock(&rq->lock); + raw_spin_lock(rq_lockp(rq)); push_rt_tasks(rq); - raw_spin_unlock(&rq->lock); + raw_spin_unlock(rq_lockp(rq)); } raw_spin_lock(&rd->rto_lock); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 280a3c735935..a306008a12f7 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -846,7 +846,7 @@ struct uclamp_rq { */ struct rq { /* runqueue lock: */ - raw_spinlock_t lock; + raw_spinlock_t __lock; /* * nr_running and cpu_load should be in the same cacheline because @@ -1026,6 +1026,10 @@ static inline int cpu_of(struct rq *rq) #endif } +static inline raw_spinlock_t *rq_lockp(struct rq *rq) +{ + return &rq->__lock; +} #ifdef CONFIG_SCHED_SMT extern void __update_idle_core(struct rq *rq); @@ -1093,7 +1097,7 @@ static inline void assert_clock_updated(struct rq *rq) static inline u64 rq_clock(struct rq *rq) { - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); assert_clock_updated(rq); return rq->clock; @@ -1101,7 +1105,7 @@ static inline u64 rq_clock(struct rq *rq) static inline u64 rq_clock_task(struct rq *rq) { - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); assert_clock_updated(rq); return rq->clock_task; @@ -1109,7 +1113,7 @@ static inline u64 rq_clock_task(struct rq *rq) static inline void rq_clock_skip_update(struct rq *rq) { - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); rq->clock_update_flags |= RQCF_REQ_SKIP; } @@ -1119,7 +1123,7 @@ static inline void rq_clock_skip_update(struct rq *rq) */ static inline void rq_clock_cancel_skipupdate(struct rq *rq) { - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); rq->clock_update_flags &= ~RQCF_REQ_SKIP; } @@ -1138,7 +1142,7 @@ struct rq_flags { static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf) { - rf->cookie = lockdep_pin_lock(&rq->lock); + rf->cookie = lockdep_pin_lock(rq_lockp(rq)); #ifdef CONFIG_SCHED_DEBUG rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP); @@ -1153,12 +1157,12 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf) rf->clock_update_flags = RQCF_UPDATED; #endif - lockdep_unpin_lock(&rq->lock, rf->cookie); + lockdep_unpin_lock(rq_lockp(rq), rf->cookie); } static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf) { - lockdep_repin_lock(&rq->lock, rf->cookie); + lockdep_repin_lock(rq_lockp(rq), rf->cookie); #ifdef CONFIG_SCHED_DEBUG /* @@ -1179,7 +1183,7 @@ static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf) __releases(rq->lock) { rq_unpin_lock(rq, rf); - raw_spin_unlock(&rq->lock); + raw_spin_unlock(rq_lockp(rq)); } static inline void @@ -1188,7 +1192,7 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf) __releases(p->pi_lock) { rq_unpin_lock(rq, rf); - raw_spin_unlock(&rq->lock); + raw_spin_unlock(rq_lockp(rq)); raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); } @@ -1196,7 +1200,7 @@ static inline void rq_lock_irqsave(struct rq *rq, struct rq_flags *rf) __acquires(rq->lock) { - raw_spin_lock_irqsave(&rq->lock, rf->flags); + raw_spin_lock_irqsave(rq_lockp(rq), rf->flags); rq_pin_lock(rq, rf); } @@ -1204,7 +1208,7 @@ static inline void rq_lock_irq(struct rq *rq, struct rq_flags *rf) __acquires(rq->lock) { - raw_spin_lock_irq(&rq->lock); + raw_spin_lock_irq(rq_lockp(rq)); rq_pin_lock(rq, rf); } @@ -1212,7 +1216,7 @@ static inline void rq_lock(struct rq *rq, struct rq_flags *rf) __acquires(rq->lock) { - raw_spin_lock(&rq->lock); + raw_spin_lock(rq_lockp(rq)); rq_pin_lock(rq, rf); } @@ -1220,7 +1224,7 @@ static inline void rq_relock(struct rq *rq, struct rq_flags *rf) __acquires(rq->lock) { - raw_spin_lock(&rq->lock); + raw_spin_lock(rq_lockp(rq)); rq_repin_lock(rq, rf); } @@ -1229,7 +1233,7 @@ rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf) __releases(rq->lock) { rq_unpin_lock(rq, rf); - raw_spin_unlock_irqrestore(&rq->lock, rf->flags); + raw_spin_unlock_irqrestore(rq_lockp(rq), rf->flags); } static inline void @@ -1237,7 +1241,7 @@ rq_unlock_irq(struct rq *rq, struct rq_flags *rf) __releases(rq->lock) { rq_unpin_lock(rq, rf); - raw_spin_unlock_irq(&rq->lock); + raw_spin_unlock_irq(rq_lockp(rq)); } static inline void @@ -1245,7 +1249,7 @@ rq_unlock(struct rq *rq, struct rq_flags *rf) __releases(rq->lock) { rq_unpin_lock(rq, rf); - raw_spin_unlock(&rq->lock); + raw_spin_unlock(rq_lockp(rq)); } static inline struct rq * @@ -1310,7 +1314,7 @@ queue_balance_callback(struct rq *rq, struct callback_head *head, void (*func)(struct rq *rq)) { - lockdep_assert_held(&rq->lock); + lockdep_assert_held(rq_lockp(rq)); if (unlikely(head->next)) return; @@ -1994,7 +1998,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest) __acquires(busiest->lock) __acquires(this_rq->lock) { - raw_spin_unlock(&this_rq->lock); + raw_spin_unlock(rq_lockp(this_rq)); double_rq_lock(this_rq, busiest); return 1; @@ -2013,20 +2017,22 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest) __acquires(busiest->lock) __acquires(this_rq->lock) { - int ret = 0; - - if (unlikely(!raw_spin_trylock(&busiest->lock))) { - if (busiest < this_rq) { - raw_spin_unlock(&this_rq->lock); - raw_spin_lock(&busiest->lock); - raw_spin_lock_nested(&this_rq->lock, - SINGLE_DEPTH_NESTING); - ret = 1; - } else - raw_spin_lock_nested(&busiest->lock, - SINGLE_DEPTH_NESTING); + if (rq_lockp(this_rq) == rq_lockp(busiest)) + return 0; + + if (likely(raw_spin_trylock(rq_lockp(busiest)))) + return 0; + + if (rq_lockp(busiest) >= rq_lockp(this_rq)) { + raw_spin_lock_nested(rq_lockp(busiest), SINGLE_DEPTH_NESTING); + return 0; } - return ret; + + raw_spin_unlock(rq_lockp(this_rq)); + raw_spin_lock(rq_lockp(busiest)); + raw_spin_lock_nested(rq_lockp(this_rq), SINGLE_DEPTH_NESTING); + + return 1; } #endif /* CONFIG_PREEMPTION */ @@ -2036,11 +2042,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest) */ static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest) { - if (unlikely(!irqs_disabled())) { - /* printk() doesn't work well under rq->lock */ - raw_spin_unlock(&this_rq->lock); - BUG_ON(1); - } + lockdep_assert_irqs_disabled(); return _double_lock_balance(this_rq, busiest); } @@ -2048,8 +2050,9 @@ static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest) static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest) __releases(busiest->lock) { - raw_spin_unlock(&busiest->lock); - lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_); + if (rq_lockp(this_rq) != rq_lockp(busiest)) + raw_spin_unlock(rq_lockp(busiest)); + lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_); } static inline void double_lock(spinlock_t *l1, spinlock_t *l2) @@ -2090,16 +2093,16 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2) __acquires(rq2->lock) { BUG_ON(!irqs_disabled()); - if (rq1 == rq2) { - raw_spin_lock(&rq1->lock); + if (rq_lockp(rq1) == rq_lockp(rq2)) { + raw_spin_lock(rq_lockp(rq1)); __acquire(rq2->lock); /* Fake it out ;) */ } else { - if (rq1 < rq2) { - raw_spin_lock(&rq1->lock); - raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING); + if (rq_lockp(rq1) < rq_lockp(rq2)) { + raw_spin_lock(rq_lockp(rq1)); + raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING); } else { - raw_spin_lock(&rq2->lock); - raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING); + raw_spin_lock(rq_lockp(rq2)); + raw_spin_lock_nested(rq_lockp(rq1), SINGLE_DEPTH_NESTING); } } } @@ -2114,9 +2117,9 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2) __releases(rq1->lock) __releases(rq2->lock) { - raw_spin_unlock(&rq1->lock); - if (rq1 != rq2) - raw_spin_unlock(&rq2->lock); + raw_spin_unlock(rq_lockp(rq1)); + if (rq_lockp(rq1) != rq_lockp(rq2)) + raw_spin_unlock(rq_lockp(rq2)); else __release(rq2->lock); } @@ -2139,7 +2142,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2) { BUG_ON(!irqs_disabled()); BUG_ON(rq1 != rq2); - raw_spin_lock(&rq1->lock); + raw_spin_lock(rq_lockp(rq1)); __acquire(rq2->lock); /* Fake it out ;) */ } @@ -2154,7 +2157,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2) __releases(rq2->lock) { BUG_ON(rq1 != rq2); - raw_spin_unlock(&rq1->lock); + raw_spin_unlock(rq_lockp(rq1)); __release(rq2->lock); } diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index dfb64c08a407..991accc492d8 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -442,7 +442,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd) struct root_domain *old_rd = NULL; unsigned long flags; - raw_spin_lock_irqsave(&rq->lock, flags); + raw_spin_lock_irqsave(rq_lockp(rq), flags); if (rq->rd) { old_rd = rq->rd; @@ -468,7 +468,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd) if (cpumask_test_cpu(rq->cpu, cpu_active_mask)) set_rq_online(rq); - raw_spin_unlock_irqrestore(&rq->lock, flags); + raw_spin_unlock_irqrestore(rq_lockp(rq), flags); if (old_rd) call_rcu(&old_rd->rcu, free_rootdomain); -- 2.17.1
From: Peter Zijlstra <peterz@infradead.org> Because sched_class::pick_next_task() also implies sched_class::set_next_task() (and possibly put_prev_task() and newidle_balance) it is not state invariant. This makes it unsuitable for remote task selection. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com> --- kernel/sched/deadline.c | 16 ++++++++++++++-- kernel/sched/fair.c | 34 +++++++++++++++++++++++++++++++--- kernel/sched/idle.c | 6 ++++++ kernel/sched/rt.c | 14 ++++++++++++-- kernel/sched/sched.h | 3 +++ kernel/sched/stop_task.c | 13 +++++++++++-- 6 files changed, 77 insertions(+), 9 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index ded147f84382..ee7fd8611ee4 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1773,7 +1773,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq, return rb_entry(left, struct sched_dl_entity, rb_node); } -static struct task_struct *pick_next_task_dl(struct rq *rq) +static struct task_struct *pick_task_dl(struct rq *rq) { struct sched_dl_entity *dl_se; struct dl_rq *dl_rq = &rq->dl; @@ -1785,7 +1785,18 @@ static struct task_struct *pick_next_task_dl(struct rq *rq) dl_se = pick_next_dl_entity(rq, dl_rq); BUG_ON(!dl_se); p = dl_task_of(dl_se); - set_next_task_dl(rq, p, true); + + return p; +} + +static struct task_struct *pick_next_task_dl(struct rq *rq) +{ + struct task_struct *p; + + p = pick_task_dl(rq); + if (p) + set_next_task_dl(rq, p, true); + return p; } @@ -2442,6 +2453,7 @@ const struct sched_class dl_sched_class = { #ifdef CONFIG_SMP .balance = balance_dl, + .pick_task = pick_task_dl, .select_task_rq = select_task_rq_dl, .migrate_task_rq = migrate_task_rq_dl, .set_cpus_allowed = set_cpus_allowed_dl, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3b218753bf7a..5eaaf0c4d9ad 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4228,7 +4228,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) * Avoid running the skip buddy, if running something else can * be done without getting too unfair. */ - if (cfs_rq->skip == se) { + if (cfs_rq->skip && cfs_rq->skip == se) { struct sched_entity *second; if (se == curr) { @@ -4246,13 +4246,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) /* * Prefer last buddy, try to return the CPU to a preempted task. */ - if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) + if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) se = cfs_rq->last; /* * Someone really wants this to run. If it's not unfair, run it. */ - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) + if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) se = cfs_rq->next; clear_buddies(cfs_rq, se); @@ -6642,6 +6642,33 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ set_last_buddy(se); } +static struct task_struct *pick_task_fair(struct rq *rq) +{ + struct cfs_rq *cfs_rq = &rq->cfs; + struct sched_entity *se; + + if (!cfs_rq->nr_running) + return NULL; + + do { + struct sched_entity *curr = cfs_rq->curr; + + se = pick_next_entity(cfs_rq, NULL); + + if (curr) { + if (se && curr->on_rq) + update_curr(cfs_rq); + + if (!se || entity_before(curr, se)) + se = curr; + } + + cfs_rq = group_cfs_rq(se); + } while (cfs_rq); + + return task_of(se); +} + struct task_struct * pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { @@ -10771,6 +10798,7 @@ const struct sched_class fair_sched_class = { #ifdef CONFIG_SMP .balance = balance_fair, + .pick_task = pick_task_fair, .select_task_rq = select_task_rq_fair, .migrate_task_rq = migrate_task_rq_fair, diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index f8653290de95..46c18e3dab13 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -397,6 +397,11 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir schedstat_inc(rq->sched_goidle); } +static struct task_struct *pick_task_idle(struct rq *rq) +{ + return rq->idle; +} + struct task_struct *pick_next_task_idle(struct rq *rq) { struct task_struct *next = rq->idle; @@ -469,6 +474,7 @@ const struct sched_class idle_sched_class = { #ifdef CONFIG_SMP .balance = balance_idle, + .pick_task = pick_task_idle, .select_task_rq = select_task_rq_idle, .set_cpus_allowed = set_cpus_allowed_common, #endif diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index fc7d6706b209..d044baedc617 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1567,7 +1567,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq) return rt_task_of(rt_se); } -static struct task_struct *pick_next_task_rt(struct rq *rq) +static struct task_struct *pick_task_rt(struct rq *rq) { struct task_struct *p; @@ -1575,7 +1575,16 @@ static struct task_struct *pick_next_task_rt(struct rq *rq) return NULL; p = _pick_next_task_rt(rq); - set_next_task_rt(rq, p, true); + + return p; +} + +static struct task_struct *pick_next_task_rt(struct rq *rq) +{ + struct task_struct *p = pick_task_rt(rq); + if (p) + set_next_task_rt(rq, p, true); + return p; } @@ -2368,6 +2377,7 @@ const struct sched_class rt_sched_class = { #ifdef CONFIG_SMP .balance = balance_rt, + .pick_task = pick_task_rt, .select_task_rq = select_task_rq_rt, .set_cpus_allowed = set_cpus_allowed_common, .rq_online = rq_online_rt, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a306008a12f7..a8335e3078ab 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1724,6 +1724,9 @@ struct sched_class { #ifdef CONFIG_SMP int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); + + struct task_struct * (*pick_task)(struct rq *rq); + int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags); void (*migrate_task_rq)(struct task_struct *p, int new_cpu); diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index 4c9e9975684f..0611348edb28 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -34,15 +34,23 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool fir stop->se.exec_start = rq_clock_task(rq); } -static struct task_struct *pick_next_task_stop(struct rq *rq) +static struct task_struct *pick_task_stop(struct rq *rq) { if (!sched_stop_runnable(rq)) return NULL; - set_next_task_stop(rq, rq->stop, true); return rq->stop; } +static struct task_struct *pick_next_task_stop(struct rq *rq) +{ + struct task_struct *p = pick_task_stop(rq); + if (p) + set_next_task_stop(rq, p, true); + + return p; +} + static void enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags) { @@ -130,6 +138,7 @@ const struct sched_class stop_sched_class = { #ifdef CONFIG_SMP .balance = balance_stop, + .pick_task = pick_task_stop, .select_task_rq = select_task_rq_stop, .set_cpus_allowed = set_cpus_allowed_common, #endif -- 2.17.1
From: Peter Zijlstra <peterz@infradead.org> Introduce the basic infrastructure to have a core wide rq->lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> --- kernel/Kconfig.preempt | 6 +++ kernel/sched/core.c | 113 ++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 31 +++++++++++ 3 files changed, 148 insertions(+), 2 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index bf82259cff96..577c288e81e5 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -80,3 +80,9 @@ config PREEMPT_COUNT config PREEMPTION bool select PREEMPT_COUNT + +config SCHED_CORE + bool + default y + depends on SCHED_SMT + diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 28ba9b56dd8a..ba17ff8a8663 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -73,6 +73,70 @@ __read_mostly int scheduler_running; */ int sysctl_sched_rt_runtime = 950000; +#ifdef CONFIG_SCHED_CORE + +DEFINE_STATIC_KEY_FALSE(__sched_core_enabled); + +/* + * The static-key + stop-machine variable are needed such that: + * + * spin_lock(rq_lockp(rq)); + * ... + * spin_unlock(rq_lockp(rq)); + * + * ends up locking and unlocking the _same_ lock, and all CPUs + * always agree on what rq has what lock. + * + * XXX entirely possible to selectively enable cores, don't bother for now. + */ +static int __sched_core_stopper(void *data) +{ + bool enabled = !!(unsigned long)data; + int cpu; + + for_each_online_cpu(cpu) + cpu_rq(cpu)->core_enabled = enabled; + + return 0; +} + +static DEFINE_MUTEX(sched_core_mutex); +static int sched_core_count; + +static void __sched_core_enable(void) +{ + // XXX verify there are no cookie tasks (yet) + + static_branch_enable(&__sched_core_enabled); + stop_machine(__sched_core_stopper, (void *)true, NULL); +} + +static void __sched_core_disable(void) +{ + // XXX verify there are no cookie tasks (left) + + stop_machine(__sched_core_stopper, (void *)false, NULL); + static_branch_disable(&__sched_core_enabled); +} + +void sched_core_get(void) +{ + mutex_lock(&sched_core_mutex); + if (!sched_core_count++) + __sched_core_enable(); + mutex_unlock(&sched_core_mutex); +} + +void sched_core_put(void) +{ + mutex_lock(&sched_core_mutex); + if (!--sched_core_count) + __sched_core_disable(); + mutex_unlock(&sched_core_mutex); +} + +#endif /* CONFIG_SCHED_CORE */ + /* * __task_rq_lock - lock the rq @p resides on. */ @@ -6400,8 +6464,15 @@ int sched_cpu_activate(unsigned int cpu) /* * When going up, increment the number of cores with SMT present. */ - if (cpumask_weight(cpu_smt_mask(cpu)) == 2) + if (cpumask_weight(cpu_smt_mask(cpu)) == 2) { static_branch_inc_cpuslocked(&sched_smt_present); +#ifdef CONFIG_SCHED_CORE + if (static_branch_unlikely(&__sched_core_enabled)) { + rq->core_enabled = true; + } +#endif + } + #endif set_cpu_active(cpu, true); @@ -6447,8 +6518,16 @@ int sched_cpu_deactivate(unsigned int cpu) /* * When going down, decrement the number of cores with SMT present. */ - if (cpumask_weight(cpu_smt_mask(cpu)) == 2) + if (cpumask_weight(cpu_smt_mask(cpu)) == 2) { +#ifdef CONFIG_SCHED_CORE + struct rq *rq = cpu_rq(cpu); + if (static_branch_unlikely(&__sched_core_enabled)) { + rq->core_enabled = false; + } +#endif static_branch_dec_cpuslocked(&sched_smt_present); + + } #endif if (!sched_smp_initialized) @@ -6473,6 +6552,28 @@ static void sched_rq_cpu_starting(unsigned int cpu) int sched_cpu_starting(unsigned int cpu) { +#ifdef CONFIG_SCHED_CORE + const struct cpumask *smt_mask = cpu_smt_mask(cpu); + struct rq *rq, *core_rq = NULL; + int i; + + for_each_cpu(i, smt_mask) { + rq = cpu_rq(i); + if (rq->core && rq->core == rq) + core_rq = rq; + } + + if (!core_rq) + core_rq = cpu_rq(cpu); + + for_each_cpu(i, smt_mask) { + rq = cpu_rq(i); + + WARN_ON_ONCE(rq->core && rq->core != core_rq); + rq->core = core_rq; + } +#endif /* CONFIG_SCHED_CORE */ + sched_rq_cpu_starting(cpu); sched_tick_start(cpu); return 0; @@ -6501,6 +6602,9 @@ int sched_cpu_dying(unsigned int cpu) update_max_interval(); nohz_balance_exit_idle(rq); hrtick_clear(rq); +#ifdef CONFIG_SCHED_CORE + rq->core = NULL; +#endif return 0; } #endif @@ -6695,6 +6799,11 @@ void __init sched_init(void) #endif /* CONFIG_SMP */ hrtick_rq_init(rq); atomic_set(&rq->nr_iowait, 0); + +#ifdef CONFIG_SCHED_CORE + rq->core = NULL; + rq->core_enabled = 0; +#endif } set_load_weight(&init_task, false); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a8335e3078ab..a3941b2ee29e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -999,6 +999,12 @@ struct rq { /* Must be inspected within a rcu lock section */ struct cpuidle_state *idle_state; #endif + +#ifdef CONFIG_SCHED_CORE + /* per rq */ + struct rq *core; + unsigned int core_enabled; +#endif }; #ifdef CONFIG_FAIR_GROUP_SCHED @@ -1026,11 +1032,36 @@ static inline int cpu_of(struct rq *rq) #endif } +#ifdef CONFIG_SCHED_CORE +DECLARE_STATIC_KEY_FALSE(__sched_core_enabled); + +static inline bool sched_core_enabled(struct rq *rq) +{ + return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled; +} + +static inline raw_spinlock_t *rq_lockp(struct rq *rq) +{ + if (sched_core_enabled(rq)) + return &rq->core->__lock; + + return &rq->__lock; +} + +#else /* !CONFIG_SCHED_CORE */ + +static inline bool sched_core_enabled(struct rq *rq) +{ + return false; +} + static inline raw_spinlock_t *rq_lockp(struct rq *rq) { return &rq->__lock; } +#endif /* CONFIG_SCHED_CORE */ + #ifdef CONFIG_SCHED_SMT extern void __update_idle_core(struct rq *rq); -- 2.17.1
From: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/fair.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5eaaf0c4d9ad..cffc59a8b481 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5886,6 +5886,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) struct sched_domain *sd; int i, recent_used_cpu; + /* + * per-cpu select_idle_mask usage + */ + lockdep_assert_irqs_disabled(); + if (available_idle_cpu(target) || sched_idle_cpu(target)) return target; @@ -6332,8 +6337,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set. * * Returns the target CPU number. - * - * preempt must be disabled. */ static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) @@ -6344,6 +6347,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f int want_affine = 0; int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); + /* + * required for stable ->cpus_allowed + */ + lockdep_assert_held(&p->pi_lock); + if (sd_flag & SD_BALANCE_WAKE) { record_wakee(p); -- 2.17.1
From: Peter Zijlstra <peterz@infradead.org> Introduce task_struct::core_cookie as an opaque identifier for core scheduling. When enabled; core scheduling will only allow matching task to be on the core; where idle matches everything. When task_struct::core_cookie is set (and core scheduling is enabled) these tasks are indexed in a second RB-tree, first on cookie value then on scheduling function, such that matching task selection always finds the most elegible match. NOTE: *shudder* at the overhead... NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative is per class tracking of cookies and that just duplicates a lot of stuff for no raisin (the 2nd copy lives in the rt-mutex PI code). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com> --- include/linux/sched.h | 8 ++- kernel/sched/core.c | 146 ++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 46 ------------- kernel/sched/sched.h | 55 ++++++++++++++++ 4 files changed, 208 insertions(+), 47 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 716ad1d8d95e..80ec54706282 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -680,10 +680,16 @@ struct task_struct { const struct sched_class *sched_class; struct sched_entity se; struct sched_rt_entity rt; + struct sched_dl_entity dl; + +#ifdef CONFIG_SCHED_CORE + struct rb_node core_node; + unsigned long core_cookie; +#endif + #ifdef CONFIG_CGROUP_SCHED struct task_group *sched_task_group; #endif - struct sched_dl_entity dl; #ifdef CONFIG_UCLAMP_TASK /* Clamp values requested for a scheduling entity */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ba17ff8a8663..452ce5bb9321 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -77,6 +77,141 @@ int sysctl_sched_rt_runtime = 950000; DEFINE_STATIC_KEY_FALSE(__sched_core_enabled); +/* kernel prio, less is more */ +static inline int __task_prio(struct task_struct *p) +{ + if (p->sched_class == &stop_sched_class) /* trumps deadline */ + return -2; + + if (rt_prio(p->prio)) /* includes deadline */ + return p->prio; /* [-1, 99] */ + + if (p->sched_class == &idle_sched_class) + return MAX_RT_PRIO + NICE_WIDTH; /* 140 */ + + return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */ +} + +/* + * l(a,b) + * le(a,b) := !l(b,a) + * g(a,b) := l(b,a) + * ge(a,b) := !l(a,b) + */ + +/* real prio, less is less */ +static inline bool prio_less(struct task_struct *a, struct task_struct *b) +{ + + int pa = __task_prio(a), pb = __task_prio(b); + + if (-pa < -pb) + return true; + + if (-pb < -pa) + return false; + + if (pa == -1) /* dl_prio() doesn't work because of stop_class above */ + return !dl_time_before(a->dl.deadline, b->dl.deadline); + + if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */ + u64 vruntime = b->se.vruntime; + + /* + * Normalize the vruntime if tasks are in different cpus. + */ + if (task_cpu(a) != task_cpu(b)) { + vruntime -= task_cfs_rq(b)->min_vruntime; + vruntime += task_cfs_rq(a)->min_vruntime; + } + + return !((s64)(a->se.vruntime - vruntime) <= 0); + } + + return false; +} + +static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b) +{ + if (a->core_cookie < b->core_cookie) + return true; + + if (a->core_cookie > b->core_cookie) + return false; + + /* flip prio, so high prio is leftmost */ + if (prio_less(b, a)) + return true; + + return false; +} + +static void sched_core_enqueue(struct rq *rq, struct task_struct *p) +{ + struct rb_node *parent, **node; + struct task_struct *node_task; + + rq->core->core_task_seq++; + + if (!p->core_cookie) + return; + + node = &rq->core_tree.rb_node; + parent = *node; + + while (*node) { + node_task = container_of(*node, struct task_struct, core_node); + parent = *node; + + if (__sched_core_less(p, node_task)) + node = &parent->rb_left; + else + node = &parent->rb_right; + } + + rb_link_node(&p->core_node, parent, node); + rb_insert_color(&p->core_node, &rq->core_tree); +} + +static void sched_core_dequeue(struct rq *rq, struct task_struct *p) +{ + rq->core->core_task_seq++; + + if (!p->core_cookie) + return; + + rb_erase(&p->core_node, &rq->core_tree); +} + +/* + * Find left-most (aka, highest priority) task matching @cookie. + */ +static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie) +{ + struct rb_node *node = rq->core_tree.rb_node; + struct task_struct *node_task, *match; + + /* + * The idle task always matches any cookie! + */ + match = idle_sched_class.pick_task(rq); + + while (node) { + node_task = container_of(node, struct task_struct, core_node); + + if (cookie < node_task->core_cookie) { + node = node->rb_left; + } else if (cookie > node_task->core_cookie) { + node = node->rb_right; + } else { + match = node_task; + node = node->rb_left; + } + } + + return match; +} + /* * The static-key + stop-machine variable are needed such that: * @@ -135,6 +270,11 @@ void sched_core_put(void) mutex_unlock(&sched_core_mutex); } +#else /* !CONFIG_SCHED_CORE */ + +static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { } +static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { } + #endif /* CONFIG_SCHED_CORE */ /* @@ -1354,6 +1494,9 @@ static inline void init_uclamp(void) { } static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) { + if (sched_core_enabled(rq)) + sched_core_enqueue(rq, p); + if (!(flags & ENQUEUE_NOCLOCK)) update_rq_clock(rq); @@ -1368,6 +1511,9 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) { + if (sched_core_enabled(rq)) + sched_core_dequeue(rq, p); + if (!(flags & DEQUEUE_NOCLOCK)) update_rq_clock(rq); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cffc59a8b481..d6c932e8d554 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -247,33 +247,11 @@ const struct sched_class fair_sched_class; */ #ifdef CONFIG_FAIR_GROUP_SCHED -static inline struct task_struct *task_of(struct sched_entity *se) -{ - SCHED_WARN_ON(!entity_is_task(se)); - return container_of(se, struct task_struct, se); -} /* Walk up scheduling entities hierarchy */ #define for_each_sched_entity(se) \ for (; se; se = se->parent) -static inline struct cfs_rq *task_cfs_rq(struct task_struct *p) -{ - return p->se.cfs_rq; -} - -/* runqueue on which this entity is (to be) queued */ -static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se) -{ - return se->cfs_rq; -} - -/* runqueue "owned" by this group */ -static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) -{ - return grp->my_q; -} - static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len) { if (!path) @@ -434,33 +412,9 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse) #else /* !CONFIG_FAIR_GROUP_SCHED */ -static inline struct task_struct *task_of(struct sched_entity *se) -{ - return container_of(se, struct task_struct, se); -} - #define for_each_sched_entity(se) \ for (; se; se = NULL) -static inline struct cfs_rq *task_cfs_rq(struct task_struct *p) -{ - return &task_rq(p)->cfs; -} - -static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se) -{ - struct task_struct *p = task_of(se); - struct rq *rq = task_rq(p); - - return &rq->cfs; -} - -/* runqueue "owned" by this group */ -static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) -{ - return NULL; -} - static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len) { if (path) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a3941b2ee29e..a38ae770dfd6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1004,6 +1004,10 @@ struct rq { /* per rq */ struct rq *core; unsigned int core_enabled; + struct rb_root core_tree; + + /* shared state */ + unsigned int core_task_seq; #endif }; @@ -1083,6 +1087,57 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); #define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define raw_rq() raw_cpu_ptr(&runqueues) +#ifdef CONFIG_FAIR_GROUP_SCHED +static inline struct task_struct *task_of(struct sched_entity *se) +{ + SCHED_WARN_ON(!entity_is_task(se)); + return container_of(se, struct task_struct, se); +} + +static inline struct cfs_rq *task_cfs_rq(struct task_struct *p) +{ + return p->se.cfs_rq; +} + +/* runqueue on which this entity is (to be) queued */ +static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se) +{ + return se->cfs_rq; +} + +/* runqueue "owned" by this group */ +static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) +{ + return grp->my_q; +} + +#else + +static inline struct task_struct *task_of(struct sched_entity *se) +{ + return container_of(se, struct task_struct, se); +} + +static inline struct cfs_rq *task_cfs_rq(struct task_struct *p) +{ + return &task_rq(p)->cfs; +} + +static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se) +{ + struct task_struct *p = task_of(se); + struct rq *rq = task_rq(p); + + return &rq->cfs; +} + +/* runqueue "owned" by this group */ +static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) +{ + return NULL; +} +#endif + extern void update_rq_clock(struct rq *rq); static inline u64 __rq_clock_broken(struct rq *rq) -- 2.17.1
From: Tim Chen <tim.c.chen@linux.intel.com> When we bring a CPU online and enable core scheduler, tasks that need core scheduling need to be placed in the core's core scheduling queue. Likewise when we taks a CPU offline or disable core scheudling on a core, tasks in the core's core scheduling queue need to be removed. Without such mechanisms, the core scheduler causes OOPs due to inconsistent core scheduling state of a task. Implement such enqueue and dequeue mechanisms according to a CPU's change in core scheduling status. The switch of core scheduling mode of a core, and enqueue/dequeue of tasks on a core's queue due to the core scheduling mode change has to be run in a separate context as it cannot be done in the context taking cpu online/offline. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> --- kernel/sched/core.c | 156 ++++++++++++++++++++++++++++++++++++---- kernel/sched/deadline.c | 35 +++++++++ kernel/sched/fair.c | 38 ++++++++++ kernel/sched/rt.c | 43 +++++++++++ kernel/sched/sched.h | 7 ++ 5 files changed, 264 insertions(+), 15 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 452ce5bb9321..445f0d519336 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -75,6 +75,11 @@ int sysctl_sched_rt_runtime = 950000; #ifdef CONFIG_SCHED_CORE +struct core_sched_cpu_work { + struct work_struct work; + cpumask_t smt_mask; +}; + DEFINE_STATIC_KEY_FALSE(__sched_core_enabled); /* kernel prio, less is more */ @@ -183,6 +188,18 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p) rb_erase(&p->core_node, &rq->core_tree); } +void sched_core_add(struct rq *rq, struct task_struct *p) +{ + if (p->core_cookie && task_on_rq_queued(p)) + sched_core_enqueue(rq, p); +} + +void sched_core_remove(struct rq *rq, struct task_struct *p) +{ + if (sched_core_enqueued(p)) + sched_core_dequeue(rq, p); +} + /* * Find left-most (aka, highest priority) task matching @cookie. */ @@ -270,10 +287,132 @@ void sched_core_put(void) mutex_unlock(&sched_core_mutex); } +enum cpu_action { + CPU_ACTIVATE = 1, + CPU_DEACTIVATE = 2 +}; + +static int __activate_cpu_core_sched(void *data); +static int __deactivate_cpu_core_sched(void *data); +static void core_sched_cpu_update(unsigned int cpu, enum cpu_action action); + +static int activate_cpu_core_sched(struct core_sched_cpu_work *work) +{ + if (static_branch_unlikely(&__sched_core_enabled)) + stop_machine(__activate_cpu_core_sched, (void *) work, NULL); + + return 0; +} + +static int deactivate_cpu_core_sched(struct core_sched_cpu_work *work) +{ + if (static_branch_unlikely(&__sched_core_enabled)) + stop_machine(__deactivate_cpu_core_sched, (void *) work, NULL); + + return 0; +} + +static void core_sched_cpu_activate_fn(struct work_struct *work) +{ + struct core_sched_cpu_work *cpu_work; + + cpu_work = container_of(work, struct core_sched_cpu_work, work); + activate_cpu_core_sched(cpu_work); + kfree(cpu_work); +} + +static void core_sched_cpu_deactivate_fn(struct work_struct *work) +{ + struct core_sched_cpu_work *cpu_work; + + cpu_work = container_of(work, struct core_sched_cpu_work, work); + deactivate_cpu_core_sched(cpu_work); + kfree(cpu_work); +} + +static void core_sched_cpu_update(unsigned int cpu, enum cpu_action action) +{ + struct core_sched_cpu_work *work; + + work = kmalloc(sizeof(struct core_sched_cpu_work), GFP_ATOMIC); + if (!work) + return; + + if (action == CPU_ACTIVATE) + INIT_WORK(&work->work, core_sched_cpu_activate_fn); + else + INIT_WORK(&work->work, core_sched_cpu_deactivate_fn); + + cpumask_copy(&work->smt_mask, cpu_smt_mask(cpu)); + + queue_work(system_highpri_wq, &work->work); +} + +static int __activate_cpu_core_sched(void *data) +{ + struct core_sched_cpu_work *work = (struct core_sched_cpu_work *) data; + struct rq *rq; + int i; + + if (cpumask_weight(&work->smt_mask) < 2) + return 0; + + for_each_cpu(i, &work->smt_mask) { + const struct sched_class *class; + + rq = cpu_rq(i); + + if (rq->core_enabled) + continue; + + for_each_class(class) { + if (!class->core_sched_activate) + continue; + + if (cpu_online(i)) + class->core_sched_activate(rq); + } + + rq->core_enabled = true; + } + return 0; +} + +static int __deactivate_cpu_core_sched(void *data) +{ + struct core_sched_cpu_work *work = (struct core_sched_cpu_work *) data; + struct rq *rq; + int i; + + if (cpumask_weight(&work->smt_mask) > 2) + return 0; + + for_each_cpu(i, &work->smt_mask) { + const struct sched_class *class; + + rq = cpu_rq(i); + + if (!rq->core_enabled) + continue; + + for_each_class(class) { + if (!class->core_sched_deactivate) + continue; + + if (cpu_online(i)) + class->core_sched_deactivate(cpu_rq(i)); + } + + rq->core_enabled = false; + } + return 0; +} + #else /* !CONFIG_SCHED_CORE */ static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { } static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { } +static inline void core_sched_cpu_update(unsigned int cpu, int action) { } #endif /* CONFIG_SCHED_CORE */ @@ -6612,13 +6751,8 @@ int sched_cpu_activate(unsigned int cpu) */ if (cpumask_weight(cpu_smt_mask(cpu)) == 2) { static_branch_inc_cpuslocked(&sched_smt_present); -#ifdef CONFIG_SCHED_CORE - if (static_branch_unlikely(&__sched_core_enabled)) { - rq->core_enabled = true; - } -#endif } - + core_sched_cpu_update(cpu, CPU_ACTIVATE); #endif set_cpu_active(cpu, true); @@ -6665,15 +6799,10 @@ int sched_cpu_deactivate(unsigned int cpu) * When going down, decrement the number of cores with SMT present. */ if (cpumask_weight(cpu_smt_mask(cpu)) == 2) { -#ifdef CONFIG_SCHED_CORE - struct rq *rq = cpu_rq(cpu); - if (static_branch_unlikely(&__sched_core_enabled)) { - rq->core_enabled = false; - } -#endif static_branch_dec_cpuslocked(&sched_smt_present); } + core_sched_cpu_update(cpu, CPU_DEACTIVATE); #endif if (!sched_smp_initialized) @@ -6748,9 +6877,6 @@ int sched_cpu_dying(unsigned int cpu) update_max_interval(); nohz_balance_exit_idle(rq); hrtick_clear(rq); -#ifdef CONFIG_SCHED_CORE - rq->core = NULL; -#endif return 0; } #endif diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index ee7fd8611ee4..e916bba0159c 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1773,6 +1773,37 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq, return rb_entry(left, struct sched_dl_entity, rb_node); } +static void for_each_dl_task(struct rq *rq, + void (*fn)(struct rq *rq, struct task_struct *p)) +{ + struct dl_rq *dl_rq = &rq->dl; + struct sched_dl_entity *dl_ent; + struct task_struct *task; + struct rb_node *rb_node; + + rb_node = rb_first_cached(&dl_rq->root); + while (rb_node) { + dl_ent = rb_entry(rb_node, struct sched_dl_entity, rb_node); + task = dl_task_of(dl_ent); + fn(rq, task); + rb_node = rb_next(rb_node); + } +} + +#ifdef CONFIG_SCHED_CORE + +static void core_sched_activate_dl(struct rq *rq) +{ + for_each_dl_task(rq, sched_core_add); +} + +static void core_sched_deactivate_dl(struct rq *rq) +{ + for_each_dl_task(rq, sched_core_remove); +} + +#endif + static struct task_struct *pick_task_dl(struct rq *rq) { struct sched_dl_entity *dl_se; @@ -2460,6 +2491,10 @@ const struct sched_class dl_sched_class = { .rq_online = rq_online_dl, .rq_offline = rq_offline_dl, .task_woken = task_woken_dl, +#ifdef CONFIG_SCHED_CORE + .core_sched_activate = core_sched_activate_dl, + .core_sched_deactivate = core_sched_deactivate_dl, +#endif #endif .task_tick = task_tick_dl, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d6c932e8d554..a9eeef896c78 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10249,6 +10249,40 @@ static void rq_offline_fair(struct rq *rq) unthrottle_offline_cfs_rqs(rq); } +static void for_each_fair_task(struct rq *rq, + void (*fn)(struct rq *rq, struct task_struct *p)) +{ + struct cfs_rq *cfs_rq, *pos; + struct sched_entity *se; + struct task_struct *task; + + for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { + for (se = __pick_first_entity(cfs_rq); + se != NULL; + se = __pick_next_entity(se)) { + + if (!entity_is_task(se)) + continue; + + task = task_of(se); + fn(rq, task); + } + } +} + +#ifdef CONFIG_SCHED_CORE + +static void core_sched_activate_fair(struct rq *rq) +{ + for_each_fair_task(rq, sched_core_add); +} + +static void core_sched_deactivate_fair(struct rq *rq) +{ + for_each_fair_task(rq, sched_core_remove); +} + +#endif #endif /* CONFIG_SMP */ /* @@ -10769,6 +10803,10 @@ const struct sched_class fair_sched_class = { .task_dead = task_dead_fair, .set_cpus_allowed = set_cpus_allowed_common, +#ifdef CONFIG_SCHED_CORE + .core_sched_activate = core_sched_activate_fair, + .core_sched_deactivate = core_sched_deactivate_fair, +#endif #endif .task_tick = task_tick_fair, diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index d044baedc617..ccb585223fad 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1567,6 +1567,45 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq) return rt_task_of(rt_se); } +static void for_each_rt_task(struct rq *rq, + void (*fn)(struct rq *rq, struct task_struct *p)) +{ + rt_rq_iter_t iter; + struct rt_prio_array *array; + struct list_head *queue; + int i; + struct rt_rq *rt_rq = &rq->rt; + struct sched_rt_entity *rt_se = NULL; + struct task_struct *task; + + for_each_rt_rq(rt_rq, iter, rq) { + array = &rt_rq->active; + for (i = 0; i < MAX_RT_PRIO; i++) { + queue = array->queue + i; + list_for_each_entry(rt_se, queue, run_list) { + if (rt_entity_is_task(rt_se)) { + task = rt_task_of(rt_se); + fn(rq, task); + } + } + } + } +} + +#ifdef CONFIG_SCHED_CORE + +static void core_sched_activate_rt(struct rq *rq) +{ + for_each_rt_task(rq, sched_core_add); +} + +static void core_sched_deactivate_rt(struct rq *rq) +{ + for_each_rt_task(rq, sched_core_remove); +} + +#endif + static struct task_struct *pick_task_rt(struct rq *rq) { struct task_struct *p; @@ -2384,6 +2423,10 @@ const struct sched_class rt_sched_class = { .rq_offline = rq_offline_rt, .task_woken = task_woken_rt, .switched_from = switched_from_rt, +#ifdef CONFIG_SCHED_CORE + .core_sched_activate = core_sched_activate_rt, + .core_sched_deactivate = core_sched_deactivate_rt, +#endif #endif .task_tick = task_tick_rt, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a38ae770dfd6..03d502357599 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1052,6 +1052,9 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq) return &rq->__lock; } +void sched_core_add(struct rq *rq, struct task_struct *p); +void sched_core_remove(struct rq *rq, struct task_struct *p); + #else /* !CONFIG_SCHED_CORE */ static inline bool sched_core_enabled(struct rq *rq) @@ -1823,6 +1826,10 @@ struct sched_class { void (*rq_online)(struct rq *rq); void (*rq_offline)(struct rq *rq); +#ifdef CONFIG_SCHED_CORE + void (*core_sched_activate)(struct rq *rq); + void (*core_sched_deactivate)(struct rq *rq); +#endif #endif void (*task_tick)(struct rq *rq, struct task_struct *p, int queued); -- 2.17.1
From: Peter Zijlstra <peterz@infradead.org> Instead of only selecting a local task, select a task for all SMT siblings for every reschedule on the core (irrespective which logical CPU does the reschedule). There could be races in core scheduler where a CPU is trying to pick a task for its sibling in core scheduler, when that CPU has just been offlined. We should not schedule any tasks on the CPU in this case. Return an idle task in pick_next_task for this situation. NOTE: there is still potential for siblings rivalry. NOTE: this is far too complicated; but thus far I've failed to simplify it further. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> Signed-off-by: Aaron Lu <aaron.lu@linux.alibaba.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> --- kernel/sched/core.c | 274 ++++++++++++++++++++++++++++++++++++++++++- kernel/sched/fair.c | 40 +++++++ kernel/sched/sched.h | 6 +- 3 files changed, 318 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 445f0d519336..9a1bd236044e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4253,7 +4253,7 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt) * Pick up the highest-prio task: */ static inline struct task_struct * -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { const struct sched_class *class; struct task_struct *p; @@ -4309,6 +4309,273 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) BUG(); } +#ifdef CONFIG_SCHED_CORE + +static inline bool cookie_equals(struct task_struct *a, unsigned long cookie) +{ + return is_idle_task(a) || (a->core_cookie == cookie); +} + +static inline bool cookie_match(struct task_struct *a, struct task_struct *b) +{ + if (is_idle_task(a) || is_idle_task(b)) + return true; + + return a->core_cookie == b->core_cookie; +} + +// XXX fairness/fwd progress conditions +/* + * Returns + * - NULL if there is no runnable task for this class. + * - the highest priority task for this runqueue if it matches + * rq->core->core_cookie or its priority is greater than max. + * - Else returns idle_task. + */ +static struct task_struct * +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max) +{ + struct task_struct *class_pick, *cookie_pick; + unsigned long cookie = rq->core->core_cookie; + + class_pick = class->pick_task(rq); + if (!class_pick) + return NULL; + + if (!cookie) { + /* + * If class_pick is tagged, return it only if it has + * higher priority than max. + */ + if (max && class_pick->core_cookie && + prio_less(class_pick, max)) + return idle_sched_class.pick_task(rq); + + return class_pick; + } + + /* + * If class_pick is idle or matches cookie, return early. + */ + if (cookie_equals(class_pick, cookie)) + return class_pick; + + cookie_pick = sched_core_find(rq, cookie); + + /* + * If class > max && class > cookie, it is the highest priority task on + * the core (so far) and it must be selected, otherwise we must go with + * the cookie pick in order to satisfy the constraint. + */ + if (prio_less(cookie_pick, class_pick) && + (!max || prio_less(max, class_pick))) + return class_pick; + + return cookie_pick; +} + +static struct task_struct * +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +{ + struct task_struct *next, *max = NULL; + const struct sched_class *class; + const struct cpumask *smt_mask; + int i, j, cpu; + bool need_sync = false; + + cpu = cpu_of(rq); + if (cpu_is_offline(cpu)) + return idle_sched_class.pick_next_task(rq); + + if (!sched_core_enabled(rq)) + return __pick_next_task(rq, prev, rf); + + /* + * If there were no {en,de}queues since we picked (IOW, the task + * pointers are all still valid), and we haven't scheduled the last + * pick yet, do so now. + */ + if (rq->core->core_pick_seq == rq->core->core_task_seq && + rq->core->core_pick_seq != rq->core_sched_seq) { + WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq); + + next = rq->core_pick; + if (next != prev) { + put_prev_task(rq, prev); + set_next_task(rq, next); + } + return next; + } + + prev->sched_class->put_prev_task(rq, prev); + if (!rq->nr_running) + newidle_balance(rq, rf); + + smt_mask = cpu_smt_mask(cpu); + + /* + * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq + * + * @task_seq guards the task state ({en,de}queues) + * @pick_seq is the @task_seq we did a selection on + * @sched_seq is the @pick_seq we scheduled + * + * However, preemptions can cause multiple picks on the same task set. + * 'Fix' this by also increasing @task_seq for every pick. + */ + rq->core->core_task_seq++; + need_sync = !!rq->core->core_cookie; + + /* reset state */ + rq->core->core_cookie = 0UL; + for_each_cpu(i, smt_mask) { + struct rq *rq_i = cpu_rq(i); + + rq_i->core_pick = NULL; + + if (rq_i->core_forceidle) { + need_sync = true; + rq_i->core_forceidle = false; + } + + if (i != cpu) + update_rq_clock(rq_i); + } + + /* + * Try and select tasks for each sibling in decending sched_class + * order. + */ + for_each_class(class) { +again: + for_each_cpu_wrap(i, smt_mask, cpu) { + struct rq *rq_i = cpu_rq(i); + struct task_struct *p; + + if (cpu_is_offline(i)) { + rq_i->core_pick = rq_i->idle; + continue; + } + + if (rq_i->core_pick) + continue; + + /* + * If this sibling doesn't yet have a suitable task to + * run; ask for the most elegible task, given the + * highest priority task already selected for this + * core. + */ + p = pick_task(rq_i, class, max); + if (!p) { + /* + * If there weren't no cookies; we don't need + * to bother with the other siblings. + */ + if (i == cpu && !need_sync) + goto next_class; + + continue; + } + + /* + * Optimize the 'normal' case where there aren't any + * cookies and we don't need to sync up. + */ + if (i == cpu && !need_sync && !p->core_cookie) { + next = p; + goto done; + } + + rq_i->core_pick = p; + + /* + * If this new candidate is of higher priority than the + * previous; and they're incompatible; we need to wipe + * the slate and start over. pick_task makes sure that + * p's priority is more than max if it doesn't match + * max's cookie. + * + * NOTE: this is a linear max-filter and is thus bounded + * in execution time. + */ + if (!max || !cookie_match(max, p)) { + struct task_struct *old_max = max; + + rq->core->core_cookie = p->core_cookie; + max = p; + + if (old_max) { + for_each_cpu(j, smt_mask) { + if (j == i) + continue; + + cpu_rq(j)->core_pick = NULL; + } + goto again; + } else { + /* + * Once we select a task for a cpu, we + * should not be doing an unconstrained + * pick because it might starve a task + * on a forced idle cpu. + */ + need_sync = true; + } + + } + } +next_class:; + } + + rq->core->core_pick_seq = rq->core->core_task_seq; + next = rq->core_pick; + rq->core_sched_seq = rq->core->core_pick_seq; + + /* + * Reschedule siblings + * + * NOTE: L1TF -- at this point we're no longer running the old task and + * sending an IPI (below) ensures the sibling will no longer be running + * their task. This ensures there is no inter-sibling overlap between + * non-matching user state. + */ + for_each_cpu(i, smt_mask) { + struct rq *rq_i = cpu_rq(i); + + if (cpu_is_offline(i)) + continue; + + WARN_ON_ONCE(!rq_i->core_pick); + + if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) + rq_i->core_forceidle = true; + + if (i == cpu) + continue; + + if (rq_i->curr != rq_i->core_pick) + resched_curr(rq_i); + + /* Did we break L1TF mitigation requirements? */ + WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick)); + } + +done: + set_next_task(rq, next); + return next; +} + +#else /* !CONFIG_SCHED_CORE */ + +static struct task_struct * +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +{ + return __pick_next_task(rq, prev, rf); +} + +#endif /* CONFIG_SCHED_CORE */ + /* * __schedule() is the main scheduler function. * @@ -7074,7 +7341,12 @@ void __init sched_init(void) #ifdef CONFIG_SCHED_CORE rq->core = NULL; + rq->core_pick = NULL; rq->core_enabled = 0; + rq->core_tree = RB_ROOT; + rq->core_forceidle = false; + + rq->core_cookie = 0UL; #endif } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a9eeef896c78..8432de767730 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4080,6 +4080,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) update_min_vruntime(cfs_rq); } +static inline bool +__entity_slice_used(struct sched_entity *se) +{ + return (se->sum_exec_runtime - se->prev_sum_exec_runtime) > + sched_slice(cfs_rq_of(se), se); +} + /* * Preempt the current task with a newly woken task if needed: */ @@ -10285,6 +10292,34 @@ static void core_sched_deactivate_fair(struct rq *rq) #endif #endif /* CONFIG_SMP */ +#ifdef CONFIG_SCHED_CORE +/* + * If runqueue has only one task which used up its slice and + * if the sibling is forced idle, then trigger schedule + * to give forced idle task a chance. + */ +static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se) +{ + int cpu = cpu_of(rq), sibling_cpu; + if (rq->cfs.nr_running > 1 || !__entity_slice_used(se)) + return; + + for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) { + struct rq *sibling_rq; + if (sibling_cpu == cpu) + continue; + if (cpu_is_offline(sibling_cpu)) + continue; + + sibling_rq = cpu_rq(sibling_cpu); + if (sibling_rq->core_forceidle) { + resched_curr(sibling_rq); + } + } +} +#endif + + /* * scheduler tick hitting a task of our scheduling class. * @@ -10308,6 +10343,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) update_misfit_status(curr, rq); update_overutilized_status(task_rq(curr)); + +#ifdef CONFIG_SCHED_CORE + if (sched_core_enabled(rq)) + resched_forceidle_sibling(rq, &curr->se); +#endif } /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 03d502357599..a829e26fa43a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1003,11 +1003,16 @@ struct rq { #ifdef CONFIG_SCHED_CORE /* per rq */ struct rq *core; + struct task_struct *core_pick; unsigned int core_enabled; + unsigned int core_sched_seq; struct rb_root core_tree; + bool core_forceidle; /* shared state */ unsigned int core_task_seq; + unsigned int core_pick_seq; + unsigned long core_cookie; #endif }; @@ -1867,7 +1872,6 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev) static inline void set_next_task(struct rq *rq, struct task_struct *next) { - WARN_ON_ONCE(rq->curr != next); next->sched_class->set_next_task(rq, next, false); } -- 2.17.1
From: Aaron Lu <aaron.lu@linux.alibaba.com> Add a wrapper function cfs_rq_min_vruntime(cfs_rq) to return cfs_rq->min_vruntime. It will be used in the following patch, no functionality change. Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com> --- kernel/sched/fair.c | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8432de767730..d99ea6ee7af2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -449,6 +449,11 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse) #endif /* CONFIG_FAIR_GROUP_SCHED */ +static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq) +{ + return cfs_rq->min_vruntime; +} + static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec); @@ -485,7 +490,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq) struct sched_entity *curr = cfs_rq->curr; struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline); - u64 vruntime = cfs_rq->min_vruntime; + u64 vruntime = cfs_rq_min_vruntime(cfs_rq); if (curr) { if (curr->on_rq) @@ -505,7 +510,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq) } /* ensure we never gain time by being placed backwards. */ - cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime); + cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime); #ifndef CONFIG_64BIT smp_wmb(); cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; @@ -3833,7 +3838,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {} static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se) { #ifdef CONFIG_SCHED_DEBUG - s64 d = se->vruntime - cfs_rq->min_vruntime; + s64 d = se->vruntime - cfs_rq_min_vruntime(cfs_rq); if (d < 0) d = -d; @@ -3846,7 +3851,7 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se) static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) { - u64 vruntime = cfs_rq->min_vruntime; + u64 vruntime = cfs_rq_min_vruntime(cfs_rq); /* * The 'current' period is already promised to the current tasks, @@ -3939,7 +3944,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * update_curr(). */ if (renorm && curr) - se->vruntime += cfs_rq->min_vruntime; + se->vruntime += cfs_rq_min_vruntime(cfs_rq); update_curr(cfs_rq); @@ -3950,7 +3955,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * fairness detriment of existing tasks. */ if (renorm && !curr) - se->vruntime += cfs_rq->min_vruntime; + se->vruntime += cfs_rq_min_vruntime(cfs_rq); /* * When enqueuing a sched_entity, we must: @@ -4063,7 +4068,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * can move min_vruntime forward still more. */ if (!(flags & DEQUEUE_SLEEP)) - se->vruntime -= cfs_rq->min_vruntime; + se->vruntime -= cfs_rq_min_vruntime(cfs_rq); /* return excess runtime on last dequeue */ return_cfs_rq_runtime(cfs_rq); @@ -6396,7 +6401,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) min_vruntime = cfs_rq->min_vruntime; } while (min_vruntime != min_vruntime_copy); #else - min_vruntime = cfs_rq->min_vruntime; + min_vruntime = cfs_rq_min_vruntime(cfs_rq); #endif se->vruntime -= min_vruntime; @@ -10382,7 +10387,7 @@ static void task_fork_fair(struct task_struct *p) resched_curr(rq); } - se->vruntime -= cfs_rq->min_vruntime; + se->vruntime -= cfs_rq_min_vruntime(cfs_rq); rq_unlock(rq, &rf); } @@ -10502,7 +10507,7 @@ static void detach_task_cfs_rq(struct task_struct *p) * cause 'unlimited' sleep bonus. */ place_entity(cfs_rq, se, 0); - se->vruntime -= cfs_rq->min_vruntime; + se->vruntime -= cfs_rq_min_vruntime(cfs_rq); } detach_entity_cfs_rq(se); @@ -10516,7 +10521,7 @@ static void attach_task_cfs_rq(struct task_struct *p) attach_entity_cfs_rq(se); if (!vruntime_normalized(p)) - se->vruntime += cfs_rq->min_vruntime; + se->vruntime += cfs_rq_min_vruntime(cfs_rq); } static void switched_from_fair(struct rq *rq, struct task_struct *p) -- 2.17.1
From: Aaron Lu <aaron.lu@linux.alibaba.com> This patch provides a vruntime based way to compare two cfs task's priority, be it on the same cpu or different threads of the same core. When the two tasks are on the same CPU, we just need to find a common cfs_rq both sched_entities are on and then do the comparison. When the two tasks are on differen threads of the same core, the root level sched_entities to which the two tasks belong will be used to do the comparison. An ugly illustration for the cross CPU case: cpu0 cpu1 / | \ / | \ se1 se2 se3 se4 se5 se6 / \ / \ se21 se22 se61 se62 Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while task B's se is se61. To compare priority of task A and B, we compare priority of se2 and se6. Whose vruntime is smaller, who wins. To make this work, the root level se should have a common cfs_rq min vuntime, which I call it the core cfs_rq min vruntime. When we adjust the min_vruntime of rq->core, we need to propgate that down the tree so as to not cause starvation of existing tasks based on previous vruntime. Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com> --- kernel/sched/core.c | 15 +------ kernel/sched/fair.c | 99 +++++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 2 + 3 files changed, 102 insertions(+), 14 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9a1bd236044e..556bf054b896 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -119,19 +119,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b) if (pa == -1) /* dl_prio() doesn't work because of stop_class above */ return !dl_time_before(a->dl.deadline, b->dl.deadline); - if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */ - u64 vruntime = b->se.vruntime; - - /* - * Normalize the vruntime if tasks are in different cpus. - */ - if (task_cpu(a) != task_cpu(b)) { - vruntime -= task_cfs_rq(b)->min_vruntime; - vruntime += task_cfs_rq(a)->min_vruntime; - } - - return !((s64)(a->se.vruntime - vruntime) <= 0); - } + if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */ + return cfs_prio_less(a, b); return false; } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d99ea6ee7af2..1c9a80d8dbb8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -449,9 +449,105 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse) #endif /* CONFIG_FAIR_GROUP_SCHED */ +static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq) +{ + return &rq_of(cfs_rq)->cfs; +} + +static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq) +{ + return cfs_rq == root_cfs_rq(cfs_rq); +} + +static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq) +{ + return &rq_of(cfs_rq)->core->cfs; +} + static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq) { - return cfs_rq->min_vruntime; + if (!sched_core_enabled(rq_of(cfs_rq))) + return cfs_rq->min_vruntime; + + if (is_root_cfs_rq(cfs_rq)) + return core_cfs_rq(cfs_rq)->min_vruntime; + else + return cfs_rq->min_vruntime; +} + +static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta) +{ + struct sched_entity *se, *next; + + if (!cfs_rq) + return; + + cfs_rq->min_vruntime -= delta; + rbtree_postorder_for_each_entry_safe(se, next, + &cfs_rq->tasks_timeline.rb_root, run_node) { + if (se->vruntime > delta) + se->vruntime -= delta; + if (se->my_q) + coresched_adjust_vruntime(se->my_q, delta); + } +} + +static void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq) +{ + struct cfs_rq *cfs_rq_core; + + if (!sched_core_enabled(rq_of(cfs_rq))) + return; + + if (!is_root_cfs_rq(cfs_rq)) + return; + + cfs_rq_core = core_cfs_rq(cfs_rq); + if (cfs_rq_core != cfs_rq && + cfs_rq->min_vruntime < cfs_rq_core->min_vruntime) { + u64 delta = cfs_rq_core->min_vruntime - cfs_rq->min_vruntime; + coresched_adjust_vruntime(cfs_rq_core, delta); + } +} + +bool cfs_prio_less(struct task_struct *a, struct task_struct *b) +{ + struct sched_entity *sea = &a->se; + struct sched_entity *seb = &b->se; + bool samecpu = task_cpu(a) == task_cpu(b); + struct task_struct *p; + s64 delta; + + if (samecpu) { + /* vruntime is per cfs_rq */ + while (!is_same_group(sea, seb)) { + int sea_depth = sea->depth; + int seb_depth = seb->depth; + + if (sea_depth >= seb_depth) + sea = parent_entity(sea); + if (sea_depth <= seb_depth) + seb = parent_entity(seb); + } + + delta = (s64)(sea->vruntime - seb->vruntime); + goto out; + } + + /* crosscpu: compare root level se's vruntime to decide priority */ + while (sea->parent) + sea = sea->parent; + while (seb->parent) + seb = seb->parent; + delta = (s64)(sea->vruntime - seb->vruntime); + +out: + p = delta > 0 ? b : a; + trace_printk("picked %s/%d %s: %Ld %Ld %Ld\n", p->comm, p->pid, + samecpu ? "samecpu" : "crosscpu", + sea->vruntime, seb->vruntime, delta); + + return delta > 0; } static __always_inline @@ -511,6 +607,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq) /* ensure we never gain time by being placed backwards. */ cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime); + update_core_cfs_rq_min_vruntime(cfs_rq); #ifndef CONFIG_64BIT smp_wmb(); cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a829e26fa43a..ef9e08e5da6a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2561,6 +2561,8 @@ static inline bool sched_energy_enabled(void) { return false; } #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */ +bool cfs_prio_less(struct task_struct *a, struct task_struct *b); + #ifdef CONFIG_MEMBARRIER /* * The scheduler provides memory barriers required by membarrier between: -- 2.17.1
From: Peter Zijlstra <peterz@infradead.org> When a sibling is forced-idle to match the core-cookie; search for matching tasks to fill the core. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/sched.h | 1 + kernel/sched/core.c | 131 +++++++++++++++++++++++++++++++++++++++++- kernel/sched/idle.c | 1 + kernel/sched/sched.h | 6 ++ 4 files changed, 138 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 80ec54706282..c9406a5b678f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -685,6 +685,7 @@ struct task_struct { #ifdef CONFIG_SCHED_CORE struct rb_node core_node; unsigned long core_cookie; + unsigned int core_occupation; #endif #ifdef CONFIG_CGROUP_SCHED diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 556bf054b896..18ee8e10a171 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -218,6 +218,21 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie) return match; } +static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie) +{ + struct rb_node *node = &p->core_node; + + node = rb_next(node); + if (!node) + return NULL; + + p = container_of(node, struct task_struct, core_node); + if (p->core_cookie != cookie) + return NULL; + + return p; +} + /* * The static-key + stop-machine variable are needed such that: * @@ -4369,7 +4384,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) struct task_struct *next, *max = NULL; const struct sched_class *class; const struct cpumask *smt_mask; - int i, j, cpu; + int i, j, cpu, occ = 0; bool need_sync = false; cpu = cpu_of(rq); @@ -4476,6 +4491,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) goto done; } + if (!is_idle_task(p)) + occ++; + rq_i->core_pick = p; /* @@ -4501,6 +4519,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) cpu_rq(j)->core_pick = NULL; } + occ = 1; goto again; } else { /* @@ -4540,6 +4559,8 @@ next_class:; if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) rq_i->core_forceidle = true; + rq_i->core_pick->core_occupation = occ; + if (i == cpu) continue; @@ -4555,6 +4576,114 @@ next_class:; return next; } +static bool try_steal_cookie(int this, int that) +{ + struct rq *dst = cpu_rq(this), *src = cpu_rq(that); + struct task_struct *p; + unsigned long cookie; + bool success = false; + + local_irq_disable(); + double_rq_lock(dst, src); + + cookie = dst->core->core_cookie; + if (!cookie) + goto unlock; + + if (dst->curr != dst->idle) + goto unlock; + + p = sched_core_find(src, cookie); + if (p == src->idle) + goto unlock; + + do { + if (p == src->core_pick || p == src->curr) + goto next; + + if (!cpumask_test_cpu(this, &p->cpus_mask)) + goto next; + + if (p->core_occupation > dst->idle->core_occupation) + goto next; + + p->on_rq = TASK_ON_RQ_MIGRATING; + deactivate_task(src, p, 0); + set_task_cpu(p, this); + activate_task(dst, p, 0); + p->on_rq = TASK_ON_RQ_QUEUED; + + resched_curr(dst); + + success = true; + break; + +next: + p = sched_core_next(p, cookie); + } while (p); + +unlock: + double_rq_unlock(dst, src); + local_irq_enable(); + + return success; +} + +static bool steal_cookie_task(int cpu, struct sched_domain *sd) +{ + int i; + + for_each_cpu_wrap(i, sched_domain_span(sd), cpu) { + if (i == cpu) + continue; + + if (need_resched()) + break; + + if (try_steal_cookie(cpu, i)) + return true; + } + + return false; +} + +static void sched_core_balance(struct rq *rq) +{ + struct sched_domain *sd; + int cpu = cpu_of(rq); + + rcu_read_lock(); + raw_spin_unlock_irq(rq_lockp(rq)); + for_each_domain(cpu, sd) { + if (!(sd->flags & SD_LOAD_BALANCE)) + break; + + if (need_resched()) + break; + + if (steal_cookie_task(cpu, sd)) + break; + } + raw_spin_lock_irq(rq_lockp(rq)); + rcu_read_unlock(); +} + +static DEFINE_PER_CPU(struct callback_head, core_balance_head); + +void queue_core_balance(struct rq *rq) +{ + if (!sched_core_enabled(rq)) + return; + + if (!rq->core->core_cookie) + return; + + if (!rq->nr_running) /* not forced idle */ + return; + + queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance); +} + #else /* !CONFIG_SCHED_CORE */ static struct task_struct * diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 46c18e3dab13..b2f08431f0f1 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -395,6 +395,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir { update_idle_core(rq); schedstat_inc(rq->sched_goidle); + queue_core_balance(rq); } static struct task_struct *pick_task_idle(struct rq *rq) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ef9e08e5da6a..552c80b70757 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1057,6 +1057,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq) return &rq->__lock; } +extern void queue_core_balance(struct rq *rq); + void sched_core_add(struct rq *rq, struct task_struct *p); void sched_core_remove(struct rq *rq, struct task_struct *p); @@ -1072,6 +1074,10 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq) return &rq->__lock; } +static inline void queue_core_balance(struct rq *rq) +{ +} + #endif /* CONFIG_SCHED_CORE */ #ifdef CONFIG_SCHED_SMT -- 2.17.1
From: Aubrey Li <aubrey.li@intel.com> - Don't migrate if there is a cookie mismatch Load balance tries to move task from busiest CPU to the destination CPU. When core scheduling is enabled, if the task's cookie does not match with the destination CPU's core cookie, this task will be skipped by this CPU. This mitigates the forced idle time on the destination CPU. - Select cookie matched idle CPU In the fast path of task wakeup, select the first cookie matched idle CPU instead of the first idle CPU. - Find cookie matched idlest CPU In the slow path of task wakeup, find the idlest CPU whose core cookie matches with task's cookie - Don't migrate task if cookie not match For the NUMA load balance, don't migrate task to the CPU whose core cookie does not match with task's cookie Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> --- kernel/sched/fair.c | 55 +++++++++++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 29 +++++++++++++++++++++++ 2 files changed, 81 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1c9a80d8dbb8..f42ceecb749f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1789,6 +1789,15 @@ static void task_numa_find_cpu(struct task_numa_env *env, if (!cpumask_test_cpu(cpu, env->p->cpus_ptr)) continue; +#ifdef CONFIG_SCHED_CORE + /* + * Skip this cpu if source task's cookie does not match + * with CPU's core cookie. + */ + if (!sched_core_cookie_match(cpu_rq(cpu), env->p)) + continue; +#endif + env->dst_cpu = cpu; task_numa_compare(env, taskimp, groupimp, maymove); } @@ -5660,8 +5669,13 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this /* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) { + struct rq *rq = cpu_rq(i); + +#ifdef CONFIG_SCHED_CORE + if (!sched_core_cookie_match(rq, p)) + continue; +#endif if (available_idle_cpu(i)) { - struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); if (idle && idle->exit_latency < min_exit_latency) { /* @@ -5927,8 +5941,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return si_cpu; if (!cpumask_test_cpu(cpu, p->cpus_ptr)) continue; +#ifdef CONFIG_SCHED_CORE + if (available_idle_cpu(cpu) && + sched_core_cookie_match(cpu_rq(cpu), p)) + break; +#else if (available_idle_cpu(cpu)) break; +#endif if (si_cpu == -1 && sched_idle_cpu(cpu)) si_cpu = cpu; } @@ -7264,8 +7284,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) * We do not migrate tasks that are: * 1) throttled_lb_pair, or * 2) cannot be migrated to this CPU due to cpus_ptr, or - * 3) running (obviously), or - * 4) are cache-hot on their current CPU. + * 3) task's cookie does not match with this CPU's core cookie + * 4) running (obviously), or + * 5) are cache-hot on their current CPU. */ if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) return 0; @@ -7300,6 +7321,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) return 0; } +#ifdef CONFIG_SCHED_CORE + /* + * Don't migrate task if the task's cookie does not match + * with the destination CPU's core cookie. + */ + if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p)) + return 0; +#endif + /* Record that we found atleast one task that could run on dst_cpu */ env->flags &= ~LBF_ALL_PINNED; @@ -8498,6 +8528,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, p->cpus_ptr)) continue; +#ifdef CONFIG_SCHED_CORE + if (sched_core_enabled(cpu_rq(this_cpu))) { + int i = 0; + bool cookie_match = false; + + for_each_cpu(i, sched_group_span(group)) { + struct rq *rq = cpu_rq(i); + + if (sched_core_cookie_match(rq, p)) { + cookie_match = true; + break; + } + } + /* Skip over this group if no cookie matched */ + if (!cookie_match) + continue; + } +#endif + local_group = cpumask_test_cpu(this_cpu, sched_group_span(group)); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 552c80b70757..e4019a482f0e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1057,6 +1057,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq) return &rq->__lock; } +/* + * Helper to check if the CPU's core cookie matches with the task's cookie + * when core scheduling is enabled. + * A special case is that the task's cookie always matches with CPU's core + * cookie if the CPU is in an idle core. + */ +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p) +{ + bool idle_core = true; + int cpu; + + /* Ignore cookie match if core scheduler is not enabled on the CPU. */ + if (!sched_core_enabled(rq)) + return true; + + for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) { + if (!available_idle_cpu(cpu)) { + idle_core = false; + break; + } + } + + /* + * A CPU in an idle core is always the best choice for tasks with + * cookies. + */ + return idle_core || rq->core->core_cookie == p->core_cookie; +} + extern void queue_core_balance(struct rq *rq); void sched_core_add(struct rq *rq, struct task_struct *p); -- 2.17.1
From: Peter Zijlstra <peterz@infradead.org> Marks all tasks in a cgroup as matching for core-scheduling. A task will need to be moved into the core scheduler queue when the cgroup it belongs to is tagged to run with core scheduling. Similarly the task will need to be moved out of the core scheduler queue when the cgroup is untagged. Also after we forked a task, its core scheduler queue's presence will need to be updated according to its new cgroup's status. Use stop machine mechanism to update all tasks in a cgroup to prevent a new task from sneaking into the cgroup, and missed out from the update while we iterates through all the tasks in the cgroup. A more complicated scheme could probably avoid the stop machine. Such scheme will also need to resovle inconsistency between a task's cgroup core scheduling tag and residency in core scheduler queue. We are opting for the simple stop machine mechanism for now that avoids such complications. Core scheduler has extra overhead. Enable it only for core with more than one SMT hardware threads. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> --- kernel/sched/core.c | 186 +++++++++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h | 4 + 2 files changed, 184 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 18ee8e10a171..11e5a2a494ac 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -140,6 +140,37 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct * return false; } +static bool sched_core_empty(struct rq *rq) +{ + return RB_EMPTY_ROOT(&rq->core_tree); +} + +static bool sched_core_enqueued(struct task_struct *task) +{ + return !RB_EMPTY_NODE(&task->core_node); +} + +static struct task_struct *sched_core_first(struct rq *rq) +{ + struct task_struct *task; + + task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node); + return task; +} + +static void sched_core_flush(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + struct task_struct *task; + + while (!sched_core_empty(rq)) { + task = sched_core_first(rq); + rb_erase(&task->core_node, &rq->core_tree); + RB_CLEAR_NODE(&task->core_node); + } + rq->core->core_task_seq++; +} + static void sched_core_enqueue(struct rq *rq, struct task_struct *p) { struct rb_node *parent, **node; @@ -171,10 +202,11 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p) { rq->core->core_task_seq++; - if (!p->core_cookie) + if (!sched_core_enqueued(p)) return; rb_erase(&p->core_node, &rq->core_tree); + RB_CLEAR_NODE(&p->core_node); } void sched_core_add(struct rq *rq, struct task_struct *p) @@ -250,8 +282,22 @@ static int __sched_core_stopper(void *data) bool enabled = !!(unsigned long)data; int cpu; - for_each_online_cpu(cpu) - cpu_rq(cpu)->core_enabled = enabled; + if (!enabled) { + for_each_online_cpu(cpu) { + /* + * All active and migrating tasks will have already been removed + * from core queue when we clear the cgroup tags. + * However, dying tasks could still be left in core queue. + * Flush them here. + */ + sched_core_flush(cpu); + } + } + + for_each_online_cpu(cpu) { + if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) + cpu_rq(cpu)->core_enabled = enabled; + } return 0; } @@ -261,7 +307,11 @@ static int sched_core_count; static void __sched_core_enable(void) { - // XXX verify there are no cookie tasks (yet) + int cpu; + + /* verify there are no cookie tasks (yet) */ + for_each_online_cpu(cpu) + BUG_ON(!sched_core_empty(cpu_rq(cpu))); static_branch_enable(&__sched_core_enabled); stop_machine(__sched_core_stopper, (void *)true, NULL); @@ -269,8 +319,6 @@ static void __sched_core_enable(void) static void __sched_core_disable(void) { - // XXX verify there are no cookie tasks (left) - stop_machine(__sched_core_stopper, (void *)false, NULL); static_branch_disable(&__sched_core_enabled); } @@ -416,6 +464,7 @@ static int __deactivate_cpu_core_sched(void *data) static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { } static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { } +static bool sched_core_enqueued(struct task_struct *task) { return false; } static inline void core_sched_cpu_update(unsigned int cpu, int action) { } #endif /* CONFIG_SCHED_CORE */ @@ -3268,6 +3317,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) #ifdef CONFIG_SMP plist_node_init(&p->pushable_tasks, MAX_PRIO); RB_CLEAR_NODE(&p->pushable_dl_tasks); +#endif +#ifdef CONFIG_SCHED_CORE + RB_CLEAR_NODE(&p->core_node); #endif return 0; } @@ -6819,6 +6871,9 @@ void init_idle(struct task_struct *idle, int cpu) #ifdef CONFIG_SMP sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu); #endif +#ifdef CONFIG_SCHED_CORE + RB_CLEAR_NODE(&idle->core_node); +#endif } #ifdef CONFIG_SMP @@ -7796,6 +7851,15 @@ static void sched_change_group(struct task_struct *tsk, int type) tg = container_of(task_css_check(tsk, cpu_cgrp_id, true), struct task_group, css); tg = autogroup_task_group(tsk, tg); + +#ifdef CONFIG_SCHED_CORE + if ((unsigned long)tsk->sched_task_group == tsk->core_cookie) + tsk->core_cookie = 0UL; + + if (tg->tagged /* && !tsk->core_cookie ? */) + tsk->core_cookie = (unsigned long)tg; +#endif + tsk->sched_task_group = tg; #ifdef CONFIG_FAIR_GROUP_SCHED @@ -7881,6 +7945,18 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css) return 0; } +static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css) +{ +#ifdef CONFIG_SCHED_CORE + struct task_group *tg = css_tg(css); + + if (tg->tagged) { + sched_core_put(); + tg->tagged = 0; + } +#endif +} + static void cpu_cgroup_css_released(struct cgroup_subsys_state *css) { struct task_group *tg = css_tg(css); @@ -7910,7 +7986,12 @@ static void cpu_cgroup_fork(struct task_struct *task) rq = task_rq_lock(task, &rf); update_rq_clock(rq); + if (sched_core_enqueued(task)) + sched_core_dequeue(rq, task); sched_change_group(task, TASK_SET_GROUP); + if (sched_core_enabled(rq) && task_on_rq_queued(task) && + task->core_cookie) + sched_core_enqueue(rq, task); task_rq_unlock(rq, task, &rf); } @@ -8436,6 +8517,82 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, } #endif /* CONFIG_RT_GROUP_SCHED */ +#ifdef CONFIG_SCHED_CORE +static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) +{ + struct task_group *tg = css_tg(css); + + return !!tg->tagged; +} + +struct write_core_tag { + struct cgroup_subsys_state *css; + int val; +}; + +static int __sched_write_tag(void *data) +{ + struct write_core_tag *tag = (struct write_core_tag *) data; + struct cgroup_subsys_state *css = tag->css; + int val = tag->val; + struct task_group *tg = css_tg(tag->css); + struct css_task_iter it; + struct task_struct *p; + + tg->tagged = !!val; + + css_task_iter_start(css, 0, &it); + /* + * Note: css_task_iter_next will skip dying tasks. + * There could still be dying tasks left in the core queue + * when we set cgroup tag to 0 when the loop is done below. + */ + while ((p = css_task_iter_next(&it))) { + p->core_cookie = !!val ? (unsigned long)tg : 0UL; + + if (sched_core_enqueued(p)) { + sched_core_dequeue(task_rq(p), p); + if (!p->core_cookie) + continue; + } + + if (sched_core_enabled(task_rq(p)) && + p->core_cookie && task_on_rq_queued(p)) + sched_core_enqueue(task_rq(p), p); + + } + css_task_iter_end(&it); + + return 0; +} + +static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val) +{ + struct task_group *tg = css_tg(css); + struct write_core_tag wtag; + + if (val > 1) + return -ERANGE; + + if (!static_branch_likely(&sched_smt_present)) + return -EINVAL; + + if (tg->tagged == !!val) + return 0; + + if (!!val) + sched_core_get(); + + wtag.css = css; + wtag.val = val; + stop_machine(__sched_write_tag, (void *) &wtag, NULL); + if (!val) + sched_core_put(); + + return 0; +} +#endif + static struct cftype cpu_legacy_files[] = { #ifdef CONFIG_FAIR_GROUP_SCHED { @@ -8472,6 +8629,14 @@ static struct cftype cpu_legacy_files[] = { .write_u64 = cpu_rt_period_write_uint, }, #endif +#ifdef CONFIG_SCHED_CORE + { + .name = "tag", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cpu_core_tag_read_u64, + .write_u64 = cpu_core_tag_write_u64, + }, +#endif #ifdef CONFIG_UCLAMP_TASK_GROUP { .name = "uclamp.min", @@ -8645,6 +8810,14 @@ static struct cftype cpu_files[] = { .write_s64 = cpu_weight_nice_write_s64, }, #endif +#ifdef CONFIG_SCHED_CORE + { + .name = "tag", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cpu_core_tag_read_u64, + .write_u64 = cpu_core_tag_write_u64, + }, +#endif #ifdef CONFIG_CFS_BANDWIDTH { .name = "max", @@ -8673,6 +8846,7 @@ static struct cftype cpu_files[] = { struct cgroup_subsys cpu_cgrp_subsys = { .css_alloc = cpu_cgroup_css_alloc, .css_online = cpu_cgroup_css_online, + .css_offline = cpu_cgroup_css_offline, .css_released = cpu_cgroup_css_released, .css_free = cpu_cgroup_css_free, .css_extra_stat_show = cpu_extra_stat_show, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e4019a482f0e..2079654b5c87 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -355,6 +355,10 @@ struct cfs_bandwidth { struct task_group { struct cgroup_subsys_state css; +#ifdef CONFIG_SCHED_CORE + int tagged; +#endif + #ifdef CONFIG_FAIR_GROUP_SCHED /* schedulable entities of this group on each CPU */ struct sched_entity **se; -- 2.17.1
From: Peter Zijlstra <peterz@infradead.org> Not-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/core.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 42 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 11e5a2a494ac..a01df3e0b11e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -110,6 +110,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b) int pa = __task_prio(a), pb = __task_prio(b); + trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n", + a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline, + b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline); + if (-pa < -pb) return true; @@ -315,12 +319,16 @@ static void __sched_core_enable(void) static_branch_enable(&__sched_core_enabled); stop_machine(__sched_core_stopper, (void *)true, NULL); + + printk("core sched enabled\n"); } static void __sched_core_disable(void) { stop_machine(__sched_core_stopper, (void *)false, NULL); static_branch_disable(&__sched_core_enabled); + + printk("core sched disabled\n"); } void sched_core_get(void) @@ -4460,6 +4468,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) put_prev_task(rq, prev); set_next_task(rq, next); } + + trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n", + rq->core->core_task_seq, + rq->core->core_pick_seq, + rq->core_sched_seq, + next->comm, next->pid, + next->core_cookie); + return next; } @@ -4540,6 +4556,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) */ if (i == cpu && !need_sync && !p->core_cookie) { next = p; + trace_printk("unconstrained pick: %s/%d %lx\n", + next->comm, next->pid, next->core_cookie); + goto done; } @@ -4548,6 +4567,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) rq_i->core_pick = p; + trace_printk("cpu(%d): selected: %s/%d %lx\n", + i, p->comm, p->pid, p->core_cookie); + /* * If this new candidate is of higher priority than the * previous; and they're incompatible; we need to wipe @@ -4564,6 +4586,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) rq->core->core_cookie = p->core_cookie; max = p; + trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie); + if (old_max) { for_each_cpu(j, smt_mask) { if (j == i) @@ -4591,6 +4615,7 @@ next_class:; rq->core->core_pick_seq = rq->core->core_task_seq; next = rq->core_pick; rq->core_sched_seq = rq->core->core_pick_seq; + trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie); /* * Reschedule siblings @@ -4616,11 +4641,20 @@ next_class:; if (i == cpu) continue; - if (rq_i->curr != rq_i->core_pick) + if (rq_i->curr != rq_i->core_pick) { + trace_printk("IPI(%d)\n", i); resched_curr(rq_i); + } /* Did we break L1TF mitigation requirements? */ - WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick)); + if (unlikely(!cookie_match(next, rq_i->core_pick))) { + trace_printk("[%d]: cookie mismatch. %s/%d/0x%lx/0x%lx\n", + rq_i->cpu, rq_i->core_pick->comm, + rq_i->core_pick->pid, + rq_i->core_pick->core_cookie, + rq_i->core->core_cookie); + WARN_ON_ONCE(1); + } } done: @@ -4659,6 +4693,10 @@ static bool try_steal_cookie(int this, int that) if (p->core_occupation > dst->idle->core_occupation) goto next; + trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n", + p->comm, p->pid, that, this, + p->core_occupation, dst->idle->core_occupation, cookie); + p->on_rq = TASK_ON_RQ_MIGRATING; deactivate_task(src, p, 0); set_task_cpu(p, this); @@ -7287,6 +7325,8 @@ int sched_cpu_starting(unsigned int cpu) WARN_ON_ONCE(rq->core && rq->core != core_rq); rq->core = core_rq; } + + printk("core: %d -> %d\n", cpu, cpu_of(core_rq)); #endif /* CONFIG_SCHED_CORE */ sched_rq_cpu_starting(cpu); -- 2.17.1
On 3/4/20 8:59 AM, vpillai wrote: > > ISSUES > ------ > - Aaron(Intel) found an issue with load balancing when the tasks have Just to set the record straight, Aaron works at Alibaba. > different weights(nice or cgroup shares). Task weight is not considered > in coresched aware load balancing and causes those higher weights task > to starve. Thanks. Tim
>
> Just to set the record straight, Aaron works at Alibaba.
>
Sorry about this. Thanks for the correction.
~Vineeth
when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as SMT_MASK to initialize rq->core, but only after store_cpu_topology(), the thread_sibling is ready for use. notify_cpu_starting() -> sched_cpu_starting() # use thread_sibling store_cpu_topology(cpu) -> update_siblings_masks # set thread_sibling Fix this by doing notify_cpu_starting later, just like x86 do. Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> --- arch/arm64/kernel/smp.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 5407bf5d98ac..a427c14e82af 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -236,13 +236,18 @@ asmlinkage notrace void secondary_start_kernel(void) cpuinfo_store_cpu(); /* - * Enable GIC and timers. + * Store cpu topology before notify_cpu_starting, + * CPUHP_AP_SCHED_STARTING requires SMT topology + * been initialized for SCHED_CORE. */ - notify_cpu_starting(cpu); - store_cpu_topology(cpu); numa_add_cpu(cpu); + /* + * Enable GIC and timers. + */ + notify_cpu_starting(cpu); + /* * OK, now it's safe to let the boot CPU continue. Wait for * the CPU migration code to notice that the CPU is online -- 2.17.1
(+LAKML, +Sudeep) On Wed, Apr 01 2020, Cheng Jian wrote: > when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as > SMT_MASK to initialize rq->core, but only after store_cpu_topology(), > the thread_sibling is ready for use. > > notify_cpu_starting() > -> sched_cpu_starting() # use thread_sibling > > store_cpu_topology(cpu) > -> update_siblings_masks # set thread_sibling > > Fix this by doing notify_cpu_starting later, just like x86 do. > I haven't been following the sched core stuff closely; can't this rq->core assignment be done in sched_cpu_activate() instead? We already look at the cpu_smt_mask() in there, and it is valid (we go through the entirety of secondary_start_kernel() before getting anywhere near CPUHP_AP_ACTIVE). I don't think this breaks anything, but without this dependency in sched_cpu_starting() then there isn't really a reason for this move. > Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> > --- > arch/arm64/kernel/smp.c | 11 ++++++++--- > 1 file changed, 8 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > index 5407bf5d98ac..a427c14e82af 100644 > --- a/arch/arm64/kernel/smp.c > +++ b/arch/arm64/kernel/smp.c > @@ -236,13 +236,18 @@ asmlinkage notrace void secondary_start_kernel(void) > cpuinfo_store_cpu(); > > /* > - * Enable GIC and timers. > + * Store cpu topology before notify_cpu_starting, > + * CPUHP_AP_SCHED_STARTING requires SMT topology > + * been initialized for SCHED_CORE. > */ > - notify_cpu_starting(cpu); > - > store_cpu_topology(cpu); > numa_add_cpu(cpu); > > + /* > + * Enable GIC and timers. > + */ > + notify_cpu_starting(cpu); > + > /* > * OK, now it's safe to let the boot CPU continue. Wait for > * the CPU migration code to notice that the CPU is online
On 2020/4/1 21:23, Valentin Schneider wrote:
> (+LAKML, +Sudeep)
>
> On Wed, Apr 01 2020, Cheng Jian wrote:
>> when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as
>> SMT_MASK to initialize rq->core, but only after store_cpu_topology(),
>> the thread_sibling is ready for use.
>>
>> notify_cpu_starting()
>> -> sched_cpu_starting() # use thread_sibling
>>
>> store_cpu_topology(cpu)
>> -> update_siblings_masks # set thread_sibling
>>
>> Fix this by doing notify_cpu_starting later, just like x86 do.
>>
> I haven't been following the sched core stuff closely; can't this
> rq->core assignment be done in sched_cpu_activate() instead? We already
> look at the cpu_smt_mask() in there, and it is valid (we go through the
> entirety of secondary_start_kernel() before getting anywhere near
> CPUHP_AP_ACTIVE).
>
> I don't think this breaks anything, but without this dependency in
> sched_cpu_starting() then there isn't really a reason for this move.
Yes, it is correct to put the rq-> core assignment in sched_cpu_active().
The cpu_smt_mask is already valid here.
I have made such an attempt on my own branch and passed the test.
Thank you.
-- Cheng Jian
On Wed, Apr 01, 2020 at 02:23:33PM +0100, Valentin Schneider wrote: > > (+LAKML, +Sudeep) > Thanks Valentin. > On Wed, Apr 01 2020, Cheng Jian wrote: > > when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as > > SMT_MASK to initialize rq->core, but only after store_cpu_topology(), > > the thread_sibling is ready for use. > > > > notify_cpu_starting() > > -> sched_cpu_starting() # use thread_sibling > > > > store_cpu_topology(cpu) > > -> update_siblings_masks # set thread_sibling > > > > Fix this by doing notify_cpu_starting later, just like x86 do. > > > > I haven't been following the sched core stuff closely; can't this > rq->core assignment be done in sched_cpu_activate() instead? We already > look at the cpu_smt_mask() in there, and it is valid (we go through the > entirety of secondary_start_kernel() before getting anywhere near > CPUHP_AP_ACTIVE). > I too came to same conclusion. Did you see any issues ? Or is it just code inspection in parity with x86 ? > I don't think this breaks anything, but without this dependency in > sched_cpu_starting() then there isn't really a reason for this move. > Based on the commit message, had a quick look at x86 code and I agree this shouldn't break anything. However the commit message does make complete sense to me, especially reference to sched_cpu_starting while smt_masks are accessed in sched_cpu_activate. Or am I missing to understand something here ? -- Regards, Sudeep
On 09/04/20 10:59, Sudeep Holla wrote: > On Wed, Apr 01, 2020 at 02:23:33PM +0100, Valentin Schneider wrote: >> >> (+LAKML, +Sudeep) >> > > Thanks Valentin. > >> On Wed, Apr 01 2020, Cheng Jian wrote: >> > when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as >> > SMT_MASK to initialize rq->core, but only after store_cpu_topology(), >> > the thread_sibling is ready for use. >> > >> > notify_cpu_starting() >> > -> sched_cpu_starting() # use thread_sibling >> > >> > store_cpu_topology(cpu) >> > -> update_siblings_masks # set thread_sibling >> > >> > Fix this by doing notify_cpu_starting later, just like x86 do. >> > >> >> I haven't been following the sched core stuff closely; can't this >> rq->core assignment be done in sched_cpu_activate() instead? We already >> look at the cpu_smt_mask() in there, and it is valid (we go through the >> entirety of secondary_start_kernel() before getting anywhere near >> CPUHP_AP_ACTIVE). >> > > I too came to same conclusion. Did you see any issues ? Or is it > just code inspection in parity with x86 ? > With mainline this isn't a problem; with the core scheduling stuff there is an expectation that we can use the SMT masks in sched_cpu_starting(). >> I don't think this breaks anything, but without this dependency in >> sched_cpu_starting() then there isn't really a reason for this move. >> > > Based on the commit message, had a quick look at x86 code and I agree > this shouldn't break anything. However the commit message does make > complete sense to me, especially reference to sched_cpu_starting > while smt_masks are accessed in sched_cpu_activate. Or am I missing > to understand something here ? As stated above, it's not a problem for mainline, and AIUI we can change the core scheduling bits to only use the SMT mask in sched_cpu_activate() instead, therefore not requiring any change in the arch code. I'm not aware of any written rule that the topology masks should be usable from a given hotplug state upwards, only that right now we need them in sched_cpu_(de)activate() for SMT scheduling - and that is already working fine. So really this should be considering as a simple neutral cleanup; I don't really have any opinion on picking it up or not.
On Thu, Apr 09, 2020 at 11:32:12AM +0100, Valentin Schneider wrote: > > On 09/04/20 10:59, Sudeep Holla wrote: > > On Wed, Apr 01, 2020 at 02:23:33PM +0100, Valentin Schneider wrote: > >> > >> (+LAKML, +Sudeep) > >> > > > > Thanks Valentin. > > > >> On Wed, Apr 01 2020, Cheng Jian wrote: > >> > when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as > >> > SMT_MASK to initialize rq->core, but only after store_cpu_topology(), > >> > the thread_sibling is ready for use. > >> > > >> > notify_cpu_starting() > >> > -> sched_cpu_starting() # use thread_sibling > >> > > >> > store_cpu_topology(cpu) > >> > -> update_siblings_masks # set thread_sibling > >> > > >> > Fix this by doing notify_cpu_starting later, just like x86 do. > >> > > >> > >> I haven't been following the sched core stuff closely; can't this > >> rq->core assignment be done in sched_cpu_activate() instead? We already > >> look at the cpu_smt_mask() in there, and it is valid (we go through the > >> entirety of secondary_start_kernel() before getting anywhere near > >> CPUHP_AP_ACTIVE). > >> > > > > I too came to same conclusion. Did you see any issues ? Or is it > > just code inspection in parity with x86 ? > > > > With mainline this isn't a problem; with the core scheduling stuff there is > an expectation that we can use the SMT masks in sched_cpu_starting(). > Ah, OK. I prefer this to be specified in the commit message as it is not obvious. > >> I don't think this breaks anything, but without this dependency in > >> sched_cpu_starting() then there isn't really a reason for this move. > >> > > > > Based on the commit message, had a quick look at x86 code and I agree > > this shouldn't break anything. However the commit message does make > > complete sense to me, especially reference to sched_cpu_starting > > while smt_masks are accessed in sched_cpu_activate. Or am I missing > > to understand something here ? > > As stated above, it's not a problem for mainline, and AIUI we can change > the core scheduling bits to only use the SMT mask in sched_cpu_activate() > instead, therefore not requiring any change in the arch code. > Either way is fine. If it is already set expectation that SMT masks needs to be set before sched_cpu_starting, then let us just stick with that. > I'm not aware of any written rule that the topology masks should be usable > from a given hotplug state upwards, only that right now we need them in > sched_cpu_(de)activate() for SMT scheduling - and that is already working > fine. > Sure, we can at-least document as part of this change even if it is just in ARM64 so that someone need not wonder the same in future. > So really this should be considering as a simple neutral cleanup; I don't > really have any opinion on picking it up or not. I am fine with the change too, just need some tweaking in the commit message. -- Regards, Sudeep
On Wed, Apr 1, 2020 at 7:27 AM Cheng Jian <cj.chengjian@huawei.com> wrote: > > when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as > SMT_MASK to initialize rq->core, but only after store_cpu_topology(), > the thread_sibling is ready for use. > > notify_cpu_starting() > -> sched_cpu_starting() # use thread_sibling > > store_cpu_topology(cpu) > -> update_siblings_masks # set thread_sibling > > Fix this by doing notify_cpu_starting later, just like x86 do. > > Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Just a high-level question, why does core-scheduling matter on ARM64? Is it for HPC workloads? Thanks, - Joel > --- > arch/arm64/kernel/smp.c | 11 ++++++++--- > 1 file changed, 8 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > index 5407bf5d98ac..a427c14e82af 100644 > --- a/arch/arm64/kernel/smp.c > +++ b/arch/arm64/kernel/smp.c > @@ -236,13 +236,18 @@ asmlinkage notrace void secondary_start_kernel(void) > cpuinfo_store_cpu(); > > /* > - * Enable GIC and timers. > + * Store cpu topology before notify_cpu_starting, > + * CPUHP_AP_SCHED_STARTING requires SMT topology > + * been initialized for SCHED_CORE. > */ > - notify_cpu_starting(cpu); > - > store_cpu_topology(cpu); > numa_add_cpu(cpu); > > + /* > + * Enable GIC and timers. > + */ > + notify_cpu_starting(cpu); > + > /* > * OK, now it's safe to let the boot CPU continue. Wait for > * the CPU migration code to notice that the CPU is online > -- > 2.17.1 >
On 2020/4/10 1:54, Joel Fernandes wrote:
> On Wed, Apr 1, 2020 at 7:27 AM Cheng Jian <cj.chengjian@huawei.com> wrote:
>> when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as
>> SMT_MASK to initialize rq->core, but only after store_cpu_topology(),
>> the thread_sibling is ready for use.
>>
>> notify_cpu_starting()
>> -> sched_cpu_starting() # use thread_sibling
>>
>> store_cpu_topology(cpu)
>> -> update_siblings_masks # set thread_sibling
>>
>> Fix this by doing notify_cpu_starting later, just like x86 do.
>>
>> Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
> Just a high-level question, why does core-scheduling matter on ARM64?
> Is it for HPC workloads?
>
> Thanks,
>
> - Joel
Hi, Joel
I am analyzing the mainline scheduling patches and find this problem.
ARM has some platforms that support SMT, and provides some emulate
can be used.
Thanks.
--Cheng Jian
On Wed, Mar 04, 2020 at 04:59:53PM +0000, vpillai wrote:
> @@ -6400,8 +6464,15 @@ int sched_cpu_activate(unsigned int cpu)
> /*
> * When going up, increment the number of cores with SMT present.
> */
> - if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> + if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
> static_branch_inc_cpuslocked(&sched_smt_present);
> +#ifdef CONFIG_SCHED_CORE
> + if (static_branch_unlikely(&__sched_core_enabled)) {
> + rq->core_enabled = true;
> + }
> +#endif
> + }
> +
> #endif
> set_cpu_active(cpu, true);
>
> @@ -6447,8 +6518,16 @@ int sched_cpu_deactivate(unsigned int cpu)
> /*
> * When going down, decrement the number of cores with SMT present.
> */
> - if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> + if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
> +#ifdef CONFIG_SCHED_CORE
> + struct rq *rq = cpu_rq(cpu);
> + if (static_branch_unlikely(&__sched_core_enabled)) {
> + rq->core_enabled = false;
> + }
> +#endif
> static_branch_dec_cpuslocked(&sched_smt_present);
> +
> + }
> #endif
>
> if (!sched_smp_initialized)
Aside from the fact that it's probably much saner to write this as:
rq->core_enabled = static_key_enabled(&__sched_core_enabled);
I'm fairly sure I didn't write this part. And while I do somewhat see
the point of disabling core scheduling for a core that has only a single
thread on, I wonder why we care.
The thing is, this directly leads to the utter horror-show that is patch
6.
It should be perfectly possible to core schedule a core with only a
single thread on. It might be a tad silly to do, but it beats the heck
out of the trainwreck created here.
So how did this happen?
On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote: > +static struct task_struct * > +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > +{ > + struct task_struct *next, *max = NULL; > + const struct sched_class *class; > + const struct cpumask *smt_mask; > + int i, j, cpu; > + bool need_sync = false; AFAICT that assignment is superfluous. Also, you violated the inverse x-mas tree. > + > + cpu = cpu_of(rq); > + if (cpu_is_offline(cpu)) > + return idle_sched_class.pick_next_task(rq); Are we actually hitting this one? > + if (!sched_core_enabled(rq)) > + return __pick_next_task(rq, prev, rf); > + > + /* > + * If there were no {en,de}queues since we picked (IOW, the task > + * pointers are all still valid), and we haven't scheduled the last > + * pick yet, do so now. > + */ > + if (rq->core->core_pick_seq == rq->core->core_task_seq && > + rq->core->core_pick_seq != rq->core_sched_seq) { > + WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq); > + > + next = rq->core_pick; > + if (next != prev) { > + put_prev_task(rq, prev); > + set_next_task(rq, next); > + } > + return next; > + } > + > + prev->sched_class->put_prev_task(rq, prev); > + if (!rq->nr_running) > + newidle_balance(rq, rf); This is wrong per commit: 6e2df0581f56 ("sched: Fix pick_next_task() vs 'change' pattern race") > + smt_mask = cpu_smt_mask(cpu); > + > + /* > + * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq > + * > + * @task_seq guards the task state ({en,de}queues) > + * @pick_seq is the @task_seq we did a selection on > + * @sched_seq is the @pick_seq we scheduled > + * > + * However, preemptions can cause multiple picks on the same task set. > + * 'Fix' this by also increasing @task_seq for every pick. > + */ > + rq->core->core_task_seq++; > + need_sync = !!rq->core->core_cookie; > + > + /* reset state */ > + rq->core->core_cookie = 0UL; > + for_each_cpu(i, smt_mask) { > + struct rq *rq_i = cpu_rq(i); > + > + rq_i->core_pick = NULL; > + > + if (rq_i->core_forceidle) { > + need_sync = true; > + rq_i->core_forceidle = false; > + } > + > + if (i != cpu) > + update_rq_clock(rq_i); > + } > + > + /* > + * Try and select tasks for each sibling in decending sched_class > + * order. > + */ > + for_each_class(class) { > +again: > + for_each_cpu_wrap(i, smt_mask, cpu) { > + struct rq *rq_i = cpu_rq(i); > + struct task_struct *p; > + > + if (cpu_is_offline(i)) { > + rq_i->core_pick = rq_i->idle; > + continue; > + } Why are we polluting the 'fast' path with offline crud? Why isn't this the natural result of running pick_task() on an empty runqueue? > + > + if (rq_i->core_pick) > + continue; > + > + /* > + * If this sibling doesn't yet have a suitable task to > + * run; ask for the most elegible task, given the > + * highest priority task already selected for this > + * core. > + */ > + p = pick_task(rq_i, class, max); > + if (!p) { > + /* > + * If there weren't no cookies; we don't need > + * to bother with the other siblings. > + */ > + if (i == cpu && !need_sync) > + goto next_class; > + > + continue; > + } > + > + /* > + * Optimize the 'normal' case where there aren't any > + * cookies and we don't need to sync up. > + */ > + if (i == cpu && !need_sync && !p->core_cookie) { > + next = p; > + goto done; > + } > + > + rq_i->core_pick = p; > + > + /* > + * If this new candidate is of higher priority than the > + * previous; and they're incompatible; we need to wipe > + * the slate and start over. pick_task makes sure that > + * p's priority is more than max if it doesn't match > + * max's cookie. > + * > + * NOTE: this is a linear max-filter and is thus bounded > + * in execution time. > + */ > + if (!max || !cookie_match(max, p)) { > + struct task_struct *old_max = max; > + > + rq->core->core_cookie = p->core_cookie; > + max = p; > + > + if (old_max) { > + for_each_cpu(j, smt_mask) { > + if (j == i) > + continue; > + > + cpu_rq(j)->core_pick = NULL; > + } > + goto again; > + } else { > + /* > + * Once we select a task for a cpu, we > + * should not be doing an unconstrained > + * pick because it might starve a task > + * on a forced idle cpu. > + */ > + need_sync = true; > + } > + > + } > + } > +next_class:; > + } > + > + rq->core->core_pick_seq = rq->core->core_task_seq; > + next = rq->core_pick; > + rq->core_sched_seq = rq->core->core_pick_seq; > + > + /* > + * Reschedule siblings > + * > + * NOTE: L1TF -- at this point we're no longer running the old task and > + * sending an IPI (below) ensures the sibling will no longer be running > + * their task. This ensures there is no inter-sibling overlap between > + * non-matching user state. > + */ > + for_each_cpu(i, smt_mask) { > + struct rq *rq_i = cpu_rq(i); > + > + if (cpu_is_offline(i)) > + continue; Another one; please explain how an offline cpu can be part of the smt_mask. Last time I checked it got cleared in stop-machine. > + > + WARN_ON_ONCE(!rq_i->core_pick); > + > + if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) > + rq_i->core_forceidle = true; > + > + if (i == cpu) > + continue; > + > + if (rq_i->curr != rq_i->core_pick) > + resched_curr(rq_i); > + > + /* Did we break L1TF mitigation requirements? */ > + WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick)); That comment is misleading... > + } > + > +done: > + set_next_task(rq, next); > + return next; > +} ----8<---- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a9eeef896c78..8432de767730 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4080,6 +4080,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > update_min_vruntime(cfs_rq); > } > > +static inline bool > +__entity_slice_used(struct sched_entity *se) > +{ > + return (se->sum_exec_runtime - se->prev_sum_exec_runtime) > > + sched_slice(cfs_rq_of(se), se); > +} > + > /* > * Preempt the current task with a newly woken task if needed: > */ > @@ -10285,6 +10292,34 @@ static void core_sched_deactivate_fair(struct rq *rq) > #endif > #endif /* CONFIG_SMP */ > > +#ifdef CONFIG_SCHED_CORE > +/* > + * If runqueue has only one task which used up its slice and > + * if the sibling is forced idle, then trigger schedule > + * to give forced idle task a chance. > + */ > +static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se) > +{ > + int cpu = cpu_of(rq), sibling_cpu; > + if (rq->cfs.nr_running > 1 || !__entity_slice_used(se)) > + return; > + > + for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) { > + struct rq *sibling_rq; > + if (sibling_cpu == cpu) > + continue; > + if (cpu_is_offline(sibling_cpu)) > + continue; > + > + sibling_rq = cpu_rq(sibling_cpu); > + if (sibling_rq->core_forceidle) { > + resched_curr(sibling_rq); > + } > + } > +} > +#endif > + > + > /* > * scheduler tick hitting a task of our scheduling class. > * > @@ -10308,6 +10343,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) > > update_misfit_status(curr, rq); > update_overutilized_status(task_rq(curr)); > + > +#ifdef CONFIG_SCHED_CORE > + if (sched_core_enabled(rq)) > + resched_forceidle_sibling(rq, &curr->se); > +#endif > } > > /* This ^ seems like it should be in it's own patch. > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 03d502357599..a829e26fa43a 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1003,11 +1003,16 @@ struct rq { > #ifdef CONFIG_SCHED_CORE > /* per rq */ > struct rq *core; > + struct task_struct *core_pick; > unsigned int core_enabled; > + unsigned int core_sched_seq; > struct rb_root core_tree; > + bool core_forceidle; Someone forgot that _Bool shouldn't be part of composite types? > /* shared state */ > unsigned int core_task_seq; > + unsigned int core_pick_seq; > + unsigned long core_cookie; > #endif > };
On Wed, Mar 04, 2020 at 04:59:59PM +0000, vpillai wrote: > From: Aaron Lu <aaron.lu@linux.alibaba.com> > > This patch provides a vruntime based way to compare two cfs task's > priority, be it on the same cpu or different threads of the same core. > > When the two tasks are on the same CPU, we just need to find a common > cfs_rq both sched_entities are on and then do the comparison. > > When the two tasks are on differen threads of the same core, the root > level sched_entities to which the two tasks belong will be used to do > the comparison. > > An ugly illustration for the cross CPU case: > > cpu0 cpu1 > / | \ / | \ > se1 se2 se3 se4 se5 se6 > / \ / \ > se21 se22 se61 se62 > > Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while > task B's se is se61. To compare priority of task A and B, we compare > priority of se2 and se6. Whose vruntime is smaller, who wins. > > To make this work, the root level se should have a common cfs_rq min > vuntime, which I call it the core cfs_rq min vruntime. > > When we adjust the min_vruntime of rq->core, we need to propgate > that down the tree so as to not cause starvation of existing tasks > based on previous vruntime. You forgot the time complexity analysis. > +static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta) > +{ > + struct sched_entity *se, *next; > + > + if (!cfs_rq) > + return; > + > + cfs_rq->min_vruntime -= delta; > + rbtree_postorder_for_each_entry_safe(se, next, > + &cfs_rq->tasks_timeline.rb_root, run_node) { Which per this ^ > + if (se->vruntime > delta) > + se->vruntime -= delta; > + if (se->my_q) > + coresched_adjust_vruntime(se->my_q, delta); > + } > +} > @@ -511,6 +607,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq) > > /* ensure we never gain time by being placed backwards. */ > cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime); > + update_core_cfs_rq_min_vruntime(cfs_rq); > #ifndef CONFIG_64BIT > smp_wmb(); > cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; as called from here, is exceedingly important. Worse, I don't think our post-order iteration is even O(n). All of this is exceedingly yuck.
On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote: > TODO > ---- > - Work on merging patches that are ready to be merged > - Decide on the API for exposing the feature to userland > - Experiment with adding synchronization points in VMEXIT to mitigate > the VM-to-host-kernel leaking VMEXIT is too late, you need to hook irq_enter(), which is what makes the whole thing so horrible. > - Investigate the source of the overhead even when no tasks are tagged: > https://lkml.org/lkml/2019/10/29/242 - explain why we're all still doing this .... Seriously, what actual problems does it solve? The patch-set still isn't L1TF complete and afaict it does exactly nothing for MDS. Like I've written many times now, back when the world was simpler and all we had to worry about was L1TF, core-scheduling made some sense, but how does it make sense today? It's cute that this series sucks less than it did before, but what are we trading that performance for?
On Wed, Mar 04, 2020 at 04:59:53PM +0000, vpillai wrote: > +DEFINE_STATIC_KEY_FALSE(__sched_core_enabled); > + > +/* > + * The static-key + stop-machine variable are needed such that: > + * > + * spin_lock(rq_lockp(rq)); > + * ... > + * spin_unlock(rq_lockp(rq)); > + * > + * ends up locking and unlocking the _same_ lock, and all CPUs > + * always agree on what rq has what lock. > + * > + * XXX entirely possible to selectively enable cores, don't bother for now. > + */ > +static int __sched_core_stopper(void *data) > +{ > + bool enabled = !!(unsigned long)data; > + int cpu; > + > + for_each_online_cpu(cpu) > + cpu_rq(cpu)->core_enabled = enabled; > + > + return 0; > +} > + > +static DEFINE_MUTEX(sched_core_mutex); > +static int sched_core_count; > + > +static void __sched_core_enable(void) > +{ > + // XXX verify there are no cookie tasks (yet) > + > + static_branch_enable(&__sched_core_enabled); > + stop_machine(__sched_core_stopper, (void *)true, NULL); > +} > + > +static void __sched_core_disable(void) > +{ > + // XXX verify there are no cookie tasks (left) > + > + stop_machine(__sched_core_stopper, (void *)false, NULL); > + static_branch_disable(&__sched_core_enabled); > +} > +static inline raw_spinlock_t *rq_lockp(struct rq *rq) > +{ > + if (sched_core_enabled(rq)) > + return &rq->core->__lock; > + > + return &rq->__lock; > +} While reading all this again, I realized it's not too hard to get rid of stop-machine here. void __raw_rq_lock(struct rq *rq) { raw_spinlock_t *lock; for (;;) { lock = rq_lockp(rq); raw_spin_lock(lock); if (lock == rq_lock(rq)) return; raw_spin_unlock(lock); } } void __sched_core_enable(int core, bool enable) { const cpumask *smt_mask; int cpu, i; smt_mask = cpu_smt_mask(core); for_each_cpu(cpu, smt_mask) raw_spin_lock_nested(&cpu_rq(cpu)->__lock, i++); for_each_cpu(cpu, smt_mask) cpu_rq(cpu)->core_enabled = enable; for_each_cpu(cpu, smt_mask) raw_spin_unlock(&cpu_rq(cpu)->__lock); }
> Aside from the fact that it's probably much saner to write this as: > > rq->core_enabled = static_key_enabled(&__sched_core_enabled); > > I'm fairly sure I didn't write this part. And while I do somewhat see > the point of disabling core scheduling for a core that has only a single > thread on, I wonder why we care. > I think this change was to fix some crashes which happened due to uninitialized rq->core if a sibling was offline during boot and is onlined after coresched was enabled. https://lwn.net/ml/linux-kernel/20190424111913.1386-1-vpillai@digitalocean.com/ I tried to fix it by initializing coresched members during a cpu online and tearing it down on a cpu offline. This was back in v3 and do not remember the exact details. I shall revisit this and see if there is a better way to fix the race condition above. Thanks, Vineeth
On Tue, Apr 14, 2020 at 03:56:24PM +0200, Peter Zijlstra wrote: > On Wed, Mar 04, 2020 at 04:59:59PM +0000, vpillai wrote: > > From: Aaron Lu <aaron.lu@linux.alibaba.com> > > > > This patch provides a vruntime based way to compare two cfs task's > > priority, be it on the same cpu or different threads of the same core. > > > > When the two tasks are on the same CPU, we just need to find a common > > cfs_rq both sched_entities are on and then do the comparison. > > > > When the two tasks are on differen threads of the same core, the root > > level sched_entities to which the two tasks belong will be used to do > > the comparison. > > > > An ugly illustration for the cross CPU case: > > > > cpu0 cpu1 > > / | \ / | \ > > se1 se2 se3 se4 se5 se6 > > / \ / \ > > se21 se22 se61 se62 > > > > Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while > > task B's se is se61. To compare priority of task A and B, we compare > > priority of se2 and se6. Whose vruntime is smaller, who wins. > > > > To make this work, the root level se should have a common cfs_rq min > > vuntime, which I call it the core cfs_rq min vruntime. > > > > When we adjust the min_vruntime of rq->core, we need to propgate > > that down the tree so as to not cause starvation of existing tasks > > based on previous vruntime. > > You forgot the time complexity analysis. This is a mistake and the adjust should be needed only once when core scheduling is initially enabled. It is an initialization thing and there is no reason to do it in every invocation of coresched_adjust_vruntime(). Vineeth, I think we have talked about this before and you agreed that it is needed only once: https://lore.kernel.org/lkml/20191012035503.GA113034@aaronlu/ https://lore.kernel.org/lkml/CANaguZBevMsQ_Zy1ozKn2Z5Uj6WBviC6UU+zpTQOVdDDLK6r2w@mail.gmail.com/ I'll see how to fix it, but feel free to beat me to it. > > +static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta) > > +{ > > + struct sched_entity *se, *next; > > + > > + if (!cfs_rq) > > + return; > > + > > + cfs_rq->min_vruntime -= delta; > > + rbtree_postorder_for_each_entry_safe(se, next, > > + &cfs_rq->tasks_timeline.rb_root, run_node) { > > Which per this ^ > > > + if (se->vruntime > delta) > > + se->vruntime -= delta; > > + if (se->my_q) > > + coresched_adjust_vruntime(se->my_q, delta); > > + } > > +} > > > @@ -511,6 +607,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq) > > > > /* ensure we never gain time by being placed backwards. */ > > cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime); > > + update_core_cfs_rq_min_vruntime(cfs_rq); > > #ifndef CONFIG_64BIT > > smp_wmb(); > > cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; > > as called from here, is exceedingly important. > > Worse, I don't think our post-order iteration is even O(n). > > > All of this is exceedingly yuck.
On Wed, Apr 15, 2020 at 11:34:08AM +0800, Aaron Lu wrote:
> On Tue, Apr 14, 2020 at 03:56:24PM +0200, Peter Zijlstra wrote:
> > On Wed, Mar 04, 2020 at 04:59:59PM +0000, vpillai wrote:
> > > From: Aaron Lu <aaron.lu@linux.alibaba.com>
> > >
> > > This patch provides a vruntime based way to compare two cfs task's
> > > priority, be it on the same cpu or different threads of the same core.
> > >
> > > When the two tasks are on the same CPU, we just need to find a common
> > > cfs_rq both sched_entities are on and then do the comparison.
> > >
> > > When the two tasks are on differen threads of the same core, the root
> > > level sched_entities to which the two tasks belong will be used to do
> > > the comparison.
> > >
> > > An ugly illustration for the cross CPU case:
> > >
> > > cpu0 cpu1
> > > / | \ / | \
> > > se1 se2 se3 se4 se5 se6
> > > / \ / \
> > > se21 se22 se61 se62
> > >
> > > Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
> > > task B's se is se61. To compare priority of task A and B, we compare
> > > priority of se2 and se6. Whose vruntime is smaller, who wins.
> > >
> > > To make this work, the root level se should have a common cfs_rq min
> > > vuntime, which I call it the core cfs_rq min vruntime.
> > >
> > > When we adjust the min_vruntime of rq->core, we need to propgate
> > > that down the tree so as to not cause starvation of existing tasks
> > > based on previous vruntime.
> >
> > You forgot the time complexity analysis.
>
> This is a mistake and the adjust should be needed only once when core
> scheduling is initially enabled. It is an initialization thing and there
> is no reason to do it in every invocation of coresched_adjust_vruntime().
Correction...
I meant there is no need to call coresched_adjust_vruntime() in every
invocation of update_core_cfs_rq_min_vruntime().
On Tue, Apr 14, 2020 at 05:35:07PM -0400, Vineeth Remanan Pillai wrote:
> > Aside from the fact that it's probably much saner to write this as:
> >
> > rq->core_enabled = static_key_enabled(&__sched_core_enabled);
> >
> > I'm fairly sure I didn't write this part. And while I do somewhat see
> > the point of disabling core scheduling for a core that has only a single
> > thread on, I wonder why we care.
> >
> I think this change was to fix some crashes which happened due to
> uninitialized rq->core if a sibling was offline during boot and is
> onlined after coresched was enabled.
>
> https://lwn.net/ml/linux-kernel/20190424111913.1386-1-vpillai@digitalocean.com/
>
> I tried to fix it by initializing coresched members during a cpu online
> and tearing it down on a cpu offline. This was back in v3 and do not
> remember the exact details. I shall revisit this and see if there is a
> better way to fix the race condition above.
Argh, that problem again. So AFAIK booting with maxcpus= is broken in a
whole number of 'interesting' ways. I'm not sure what to do about that,
perhaps we should add a config around that option and make it depend on
CONFIG_BROKEN.
That said; I'm thinking it shouldn't be too hard to fix up the core
state before we add the CPU to the masks, but it will be arch specific.
See speculative_store_bypass_ht_init() for inspiration, but you'll need
to be even earlier, before set_cpu_sibling_map() in smp_callin() on x86
(no clue about other archs).
Even without maxcpus= this can happen when you do physical hotplug and
add a part (or replace one where the new part has more cores than the
old).
The moment core-scheduling is enabled and you're adding unknown
topology, we need to set up state before we publish the mask,... or I
suppose endlessly do: 'smt_mask & active_mask' all over the place :/ In
which case you can indeed do it purely in sched/core.
Hurmph...
On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote: > On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote: > > TODO > > ---- > > - Work on merging patches that are ready to be merged > > - Decide on the API for exposing the feature to userland > > - Experiment with adding synchronization points in VMEXIT to mitigate > > the VM-to-host-kernel leaking > > VMEXIT is too late, you need to hook irq_enter(), which is what makes > the whole thing so horrible. We came up with a patch to do this as well. Currently testing it more and it looks clean, will share it soon. > > - Investigate the source of the overhead even when no tasks are tagged: > > https://lkml.org/lkml/2019/10/29/242 > > - explain why we're all still doing this .... > > Seriously, what actual problems does it solve? The patch-set still isn't > L1TF complete and afaict it does exactly nothing for MDS. The L1TF incompleteness is because of cross-HT attack from Guest vCPU attacker to an interrupt/softirq executing on the other sibling correct? The IRQ enter pausing the other sibling should fix that (which we will share in a future series revision after adequate testing). > Like I've written many times now, back when the world was simpler and > all we had to worry about was L1TF, core-scheduling made some sense, but > how does it make sense today? For ChromeOS we're planning to tag each and every task seperately except for trusted processes, so we are isolating untrusted tasks even from each other. Sorry if this sounds like pushing my usecase, but we do get parallelism advantage for the trusted tasks while still solving all security issues (for ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if untrusted (tagged) tasks execute together on same core, but we are not planning to do that on our setup at least. > It's cute that this series sucks less than it did before, but what are > we trading that performance for? AIUI, the performance improves vs noht in the recent series. I am told that is the case in recent postings of the series. thanks, - Joel
> > > You forgot the time complexity analysis.
> >
> > This is a mistake and the adjust should be needed only once when core
> > scheduling is initially enabled. It is an initialization thing and there
> > is no reason to do it in every invocation of coresched_adjust_vruntime().
>
> Correction...
> I meant there is no need to call coresched_adjust_vruntime() in every
> invocation of update_core_cfs_rq_min_vruntime().
Due to the checks in place, update_core_cfs_rq_min_vruntime should
not be calling coresched_adjust_vruntime more than once between a
coresched enable/disable. Once the min_vruntime is adjusted, we depend
only on rq->core and the other sibling's min_vruntime will not grow
until coresched disable.
I did some micro benchmark tests today to verify this and observed
that coresched_adjust_vruntime called at most once between a coresched
enable/disable.
Thanks,
Vineeth
On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote: > From: Peter Zijlstra <peterz@infradead.org> > > Instead of only selecting a local task, select a task for all SMT > siblings for every reschedule on the core (irrespective which logical > CPU does the reschedule). > > There could be races in core scheduler where a CPU is trying to pick > a task for its sibling in core scheduler, when that CPU has just been > offlined. We should not schedule any tasks on the CPU in this case. > Return an idle task in pick_next_task for this situation. > > NOTE: there is still potential for siblings rivalry. > NOTE: this is far too complicated; but thus far I've failed to > simplify it further. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com> > Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> > Signed-off-by: Aaron Lu <aaron.lu@linux.alibaba.com> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > --- [cut] Hi Vineeth, An NULL pointer exception was found when testing V5 on top of stable v5.6.2. And we tried the patch Peter suggested, the NULL pointer was not found so far. We don't know if this change would help mitigate the symptom, but it should do no harm to test with this fix applied. Thanks, Chenyu From 6828eaf4611eeb3e1bad3b9a0d4ec53c6fa01fe3 Mon Sep 17 00:00:00 2001 From: Chen Yu <yu.c.chen@intel.com> Date: Thu, 16 Apr 2020 10:51:07 +0800 Subject: [PATCH] sched: Fix pick_next_task() race condition in core scheduling As Perter mentioned that Commit 6e2df0581f56 ("sched: Fix pick_next_task() vs 'change' pattern race") has fixed a race condition due to rq->lock improperly released after put_prev_task(), backport this fix to core scheduling's pick_next_task() as well. Without this fix, Aubrey, Long and I found an NULL exception point triggered within one hour when running RDT MBA(Intel Resource Directory Technolodge Memory Bandwidth Allocation) benchmarks on a 36 Core(72 HTs) platform, which tries to dereference a NULL sched_entity: [ 3618.429053] BUG: kernel NULL pointer dereference, address: 0000000000000160 [ 3618.429039] RIP: 0010:pick_task_fair+0x2e/0xa0 [ 3618.429042] RSP: 0018:ffffc90000317da8 EFLAGS: 00010046 [ 3618.429044] RAX: 0000000000000000 RBX: ffff88afdf4ad100 RCX: 0000000000000001 [ 3618.429045] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88afdf4ad100 [ 3618.429045] RBP: ffffc90000317dc0 R08: 0000000000000048 R09: 0100000000100000 [ 3618.429046] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 3618.429047] R13: 000000000002d080 R14: ffff88afdf4ad080 R15: 0000000000000014 [ 3618.429048] ? pick_task_fair+0x48/0xa0 [ 3618.429048] pick_next_task+0x34c/0x7e0 [ 3618.429049] ? tick_program_event+0x44/0x70 [ 3618.429049] __schedule+0xee/0x5d0 [ 3618.429050] schedule_idle+0x2c/0x40 [ 3618.429051] do_idle+0x175/0x280 [ 3618.429051] cpu_startup_entry+0x1d/0x30 [ 3618.429052] start_secondary+0x169/0x1c0 [ 3618.429052] secondary_startup_64+0xa4/0xb0 While with this patch applied, no NULL pointer exception was found within 14 hours for now. Although there's no direct evidence this fix would solve the issue, it does fix a potential race condition. Signed-off-by: Chen Yu <yu.c.chen@intel.com> --- kernel/sched/core.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 02495d44870f..ef101a3ef583 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4477,9 +4477,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) return next; } - prev->sched_class->put_prev_task(rq, prev); - if (!rq->nr_running) - newidle_balance(rq, rf); + +#ifdef CONFIG_SMP + for_class_range(class, prev->sched_class, &idle_sched_class) { + if (class->balance(rq, prev, rf)) + break; + } +#endif + put_prev_task(rq, prev); smt_mask = cpu_smt_mask(cpu); -- 2.20.1
> Hi Vineeth,
> An NULL pointer exception was found when testing V5 on top of stable
> v5.6.2. And we tried the patch Peter suggested, the NULL pointer
> was not found so far. We don't know if this change would help mitigate
> the symptom, but it should do no harm to test with this fix applied.
>
Thanks for the patch Chenyu!
I was also looking into this as part of addressing Peter's comments. I
shall include this patch in v6 along with addressing all the issues
that Peter pointed out in this thread.
Thanks,
Vineeth
On 4/14/20 6:35 AM, Peter Zijlstra wrote:
> On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote:
>> +static struct task_struct *
>> +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>> +{
>> + struct task_struct *next, *max = NULL;
>> + const struct sched_class *class;
>> + const struct cpumask *smt_mask;
>> + int i, j, cpu;
>> + bool need_sync = false;
>
> AFAICT that assignment is superfluous. Also, you violated the inverse
> x-mas tree.
>
>> +
>> + cpu = cpu_of(rq);
>> + if (cpu_is_offline(cpu))
>> + return idle_sched_class.pick_next_task(rq);
>
> Are we actually hitting this one?
>
I did hit this race when I was testing taking cpu offline and online,
which prompted the check of cpu being offline.
Tim
On Wed, Apr 15, 2020 at 05:24:30PM -0400, Vineeth Remanan Pillai wrote: > > > > You forgot the time complexity analysis. > > > > > > This is a mistake and the adjust should be needed only once when core > > > scheduling is initially enabled. It is an initialization thing and there > > > is no reason to do it in every invocation of coresched_adjust_vruntime(). > > > > Correction... > > I meant there is no need to call coresched_adjust_vruntime() in every > > invocation of update_core_cfs_rq_min_vruntime(). > > Due to the checks in place, update_core_cfs_rq_min_vruntime should > not be calling coresched_adjust_vruntime more than once between a > coresched enable/disable. Once the min_vruntime is adjusted, we depend > only on rq->core and the other sibling's min_vruntime will not grow > until coresched disable. OK, but I prefer to make it clear that this is an initialization only stuff. Below is what I cooked, I also enhanced the changelog while at it. From e80121e61953da717da074ea2a097194f6d29ef4 Mon Sep 17 00:00:00 2001 From: Aaron Lu <ziqian.lzq@antfin.com> Date: Thu, 25 Jul 2019 22:32:48 +0800 Subject: [PATCH] sched/fair: core vruntime comparison This patch provides a vruntime based way to compare two cfs task's priority, be it on the same cpu or different threads of the same core. When the two tasks are on the same CPU, we just need to find a common cfs_rq both sched_entities are on and then do the comparison. When the two tasks are on differen threads of the same core, each thread will choose the next task to run the usual way and then the root level sched entities which the two tasks belong to will be used to decide which task runs next core wide. An illustration for the cross CPU case: cpu0 cpu1 / | \ / | \ se1 se2 se3 se4 se5 se6 / \ / \ se21 se22 se61 se62 (A) / se621 (B) Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to run next and cpu1 has decided task B to run next. To compare priority of task A and B, we compare priority of se2 and se6. Whose vruntime is smaller, who wins. To make this work, the root level sched entities' vruntime of the two threads must be directly comparable. So a new core wide cfs_rq min_vruntime is introduced to serve the purpose of normalizing these root level sched entities' vruntime. All sub cfs_rqs and sched entities are not interesting in cross cpu priority comparison as they will only participate in the usual cpu local schedule decision so no need to normalize their vruntimes. Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com> --- kernel/sched/core.c | 24 ++++------ kernel/sched/fair.c | 101 ++++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 3 ++ 3 files changed, 111 insertions(+), 17 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5f322922f5ae..d6c8c76cb07a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -119,19 +119,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b) if (pa == -1) /* dl_prio() doesn't work because of stop_class above */ return !dl_time_before(a->dl.deadline, b->dl.deadline); - if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */ - u64 vruntime = b->se.vruntime; - - /* - * Normalize the vruntime if tasks are in different cpus. - */ - if (task_cpu(a) != task_cpu(b)) { - vruntime -= task_cfs_rq(b)->min_vruntime; - vruntime += task_cfs_rq(a)->min_vruntime; - } - - return !((s64)(a->se.vruntime - vruntime) <= 0); - } + if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */ + return cfs_prio_less(a, b); return false; } @@ -291,8 +280,13 @@ static int __sched_core_stopper(void *data) } for_each_online_cpu(cpu) { - if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) - cpu_rq(cpu)->core_enabled = enabled; + if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) { + struct rq *rq = cpu_rq(cpu); + + rq->core_enabled = enabled; + if (rq->core == rq) + sched_core_adjust_se_vruntime(cpu); + } } return 0; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d99ea6ee7af2..7eecf590d6c0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -449,9 +449,103 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse) #endif /* CONFIG_FAIR_GROUP_SCHED */ +static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq) +{ + return &rq_of(cfs_rq)->cfs; +} + +static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq) +{ + return cfs_rq == root_cfs_rq(cfs_rq); +} + +static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq) +{ + return &rq_of(cfs_rq)->core->cfs; +} + static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq) { - return cfs_rq->min_vruntime; + if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq)) + return cfs_rq->min_vruntime; + + return core_cfs_rq(cfs_rq)->min_vruntime; +} + +bool cfs_prio_less(struct task_struct *a, struct task_struct *b) +{ + struct sched_entity *sea = &a->se; + struct sched_entity *seb = &b->se; + bool samecpu = task_cpu(a) == task_cpu(b); + s64 delta; + + if (samecpu) { + /* vruntime is per cfs_rq */ + while (!is_same_group(sea, seb)) { + int sea_depth = sea->depth; + int seb_depth = seb->depth; + + if (sea_depth >= seb_depth) + sea = parent_entity(sea); + if (sea_depth <= seb_depth) + seb = parent_entity(seb); + } + + delta = (s64)(sea->vruntime - seb->vruntime); + goto out; + } + + /* crosscpu: compare root level se's vruntime to decide priority */ + while (sea->parent) + sea = sea->parent; + while (seb->parent) + seb = seb->parent; + delta = (s64)(sea->vruntime - seb->vruntime); + +out: + return delta > 0; +} + +/* + * This is called in stop machine context so no need to take the rq lock. + * + * Core scheduling is going to be enabled and the root level sched entities + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq + * min_vruntime, so it's necessary to normalize vruntime of existing root + * level sched entities in sibling_cfs_rq. + * + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be + * only using cfs_rq->min_vruntime during the entire run of core scheduling. + */ +void sched_core_adjust_se_vruntime(int cpu) +{ + int i; + + for_each_cpu(i, cpu_smt_mask(cpu)) { + struct cfs_rq *cfs_rq, *sibling_cfs_rq; + struct sched_entity *se, *next; + s64 delta; + + if (i == cpu) + continue; + + sibling_cfs_rq = &cpu_rq(i)->cfs; + if (!sibling_cfs_rq->nr_running) + continue; + + cfs_rq = &cpu_rq(cpu)->cfs; + delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime; + /* + * XXX Malicious user can create a ton of runnable tasks in root + * sibling_cfs_rq and cause the below vruntime normalization + * potentially taking a long time. + */ + rbtree_postorder_for_each_entry_safe(se, next, + &sibling_cfs_rq->tasks_timeline.rb_root, + run_node) { + se->vruntime += delta; + } + } } static __always_inline @@ -509,8 +603,11 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq) vruntime = min_vruntime(vruntime, se->vruntime); } + if (sched_core_enabled(rq_of(cfs_rq)) && is_root_cfs_rq(cfs_rq)) + cfs_rq = core_cfs_rq(cfs_rq); + /* ensure we never gain time by being placed backwards. */ - cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime); + cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime); #ifndef CONFIG_64BIT smp_wmb(); cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 50a5675e941a..24bae760f764 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2594,3 +2594,6 @@ static inline void membarrier_switch_mm(struct rq *rq, { } #endif + +bool cfs_prio_less(struct task_struct *a, struct task_struct *b); +void sched_core_adjust_se_vruntime(int cpu); -- 2.19.1.3.ge56e4f7
On Thu, Apr 16, 2020 at 04:32:28PM -0700, Tim Chen wrote:
>
>
> On 4/14/20 6:35 AM, Peter Zijlstra wrote:
> > On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote:
> >> +static struct task_struct *
> >> +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >> +{
> >> + struct task_struct *next, *max = NULL;
> >> + const struct sched_class *class;
> >> + const struct cpumask *smt_mask;
> >> + int i, j, cpu;
> >> + bool need_sync = false;
> >
> > AFAICT that assignment is superfluous. Also, you violated the inverse
> > x-mas tree.
> >
> >> +
> >> + cpu = cpu_of(rq);
> >> + if (cpu_is_offline(cpu))
> >> + return idle_sched_class.pick_next_task(rq);
> >
> > Are we actually hitting this one?
> >
>
> I did hit this race when I was testing taking cpu offline and online,
> which prompted the check of cpu being offline.
This is the schedule from the stop task to the idle task I presume,
there should really not be any other. And at that point the rq had
better be empty, so why didn't the normal task selection work?
On Wed, Apr 15, 2020 at 12:32:20PM -0400, Joel Fernandes wrote: > On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote: > > On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote: > > > TODO > > > ---- > > > - Work on merging patches that are ready to be merged > > > - Decide on the API for exposing the feature to userland > > > - Experiment with adding synchronization points in VMEXIT to mitigate > > > the VM-to-host-kernel leaking > > > > VMEXIT is too late, you need to hook irq_enter(), which is what makes > > the whole thing so horrible. > > We came up with a patch to do this as well. Currently testing it more and it > looks clean, will share it soon. Thomas said we actually first do VMEXIT, and then enable interrupts. So the VMEXIT thing should actually work, and that is indeed much saner than sticking it in irq_enter(). It does however put yet more nails in the out-of-tree hypervisors. > > > - Investigate the source of the overhead even when no tasks are tagged: > > > https://lkml.org/lkml/2019/10/29/242 > > > > - explain why we're all still doing this .... > > > > Seriously, what actual problems does it solve? The patch-set still isn't > > L1TF complete and afaict it does exactly nothing for MDS. > > The L1TF incompleteness is because of cross-HT attack from Guest vCPU > attacker to an interrupt/softirq executing on the other sibling correct? The > IRQ enter pausing the other sibling should fix that (which we will share in > a future series revision after adequate testing). Correct, the vCPU still running can glean host (kernel) state from the sibling handling the interrupt in the host kernel. > > Like I've written many times now, back when the world was simpler and > > all we had to worry about was L1TF, core-scheduling made some sense, but > > how does it make sense today? > > For ChromeOS we're planning to tag each and every task seperately except for > trusted processes, so we are isolating untrusted tasks even from each other. > > Sorry if this sounds like pushing my usecase, but we do get parallelism > advantage for the trusted tasks while still solving all security issues (for > ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if > untrusted (tagged) tasks execute together on same core, but we are not > planning to do that on our setup at least. That doesn't completely solve things I think. Even if you run all untrusted tasks as core exclusive, you still have a problem of them vs interrupts on the other sibling. You need to somehow arrange all interrupts to the core happen on the same sibling that runs your untrusted task, such that the VERW on return-to-userspace works as intended. I suppose you can try and play funny games with interrupt routing tied to the force-idle state, but I'm dreading what that'll look like. Or were you going to handle this from your irq_enter() thing too? Can someone go write up a document that very clearly states all the problems and clearly explains how to use this feature?
On Thu, Apr 16, 2020 at 11:39:05AM +0800, Chen Yu wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 02495d44870f..ef101a3ef583 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4477,9 +4477,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> return next;
> }
>
> - prev->sched_class->put_prev_task(rq, prev);
> - if (!rq->nr_running)
> - newidle_balance(rq, rf);
> +
> +#ifdef CONFIG_SMP
> + for_class_range(class, prev->sched_class, &idle_sched_class) {
> + if (class->balance(rq, prev, rf))
> + break;
> + }
> +#endif
> + put_prev_task(rq, prev);
>
> smt_mask = cpu_smt_mask(cpu);
Instead of duplicating that, how about you put the existing copy in a
function to share? finish_prev_task() perhaps?
Also, can you please make newidle_balance() static again; I forgot doing
that in 6e2df0581f56, which would've made you notice this sooner I
suppose.
On 17.04.20 13:12, Peter Zijlstra wrote: > On Wed, Apr 15, 2020 at 12:32:20PM -0400, Joel Fernandes wrote: >> On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote: >>> On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote: >>>> TODO >>>> ---- >>>> - Work on merging patches that are ready to be merged >>>> - Decide on the API for exposing the feature to userland >>>> - Experiment with adding synchronization points in VMEXIT to mitigate >>>> the VM-to-host-kernel leaking >>> >>> VMEXIT is too late, you need to hook irq_enter(), which is what makes >>> the whole thing so horrible. >> >> We came up with a patch to do this as well. Currently testing it more and it >> looks clean, will share it soon. > > Thomas said we actually first do VMEXIT, and then enable interrupts. So > the VMEXIT thing should actually work, and that is indeed much saner > than sticking it in irq_enter(). If we first kick out the sibling HT for every #VMEXIT, performance will be abysmal, no? I know of a few options to make this work without the big hammer: 1) Leave interrupts disabled on "fast-path" exits. This can become very hard to grasp very quickly. 2) Patch the IRQ handlers (or build something more generic that installs a trampoline on all IRQ handler installations) 3) Ignore IRQ data exposure (what could possibly go wrong, it's not like your IRQ handler reads secret data from the network, right) 4) Create a "safe" page table which runs with HT enabled. Any access outside of the "safe" zone disables the sibling and switches to the "full" kernel page table. This should prevent any secret data to be fetched into caches/core buffers. 5) Create a KVM specific "safe zone": Keep improving the ASI patches and make only the ASI environment safe for HT, everything else not. Has there been any progress on 4? It sounded like the most generic option ... > > It does however put yet more nails in the out-of-tree hypervisors. > >>>> - Investigate the source of the overhead even when no tasks are tagged: >>>> https://lkml.org/lkml/2019/10/29/242 >>> >>> - explain why we're all still doing this .... >>> >>> Seriously, what actual problems does it solve? The patch-set still isn't >>> L1TF complete and afaict it does exactly nothing for MDS. >> >> The L1TF incompleteness is because of cross-HT attack from Guest vCPU >> attacker to an interrupt/softirq executing on the other sibling correct? The >> IRQ enter pausing the other sibling should fix that (which we will share in >> a future series revision after adequate testing). > > Correct, the vCPU still running can glean host (kernel) state from the > sibling handling the interrupt in the host kernel. > >>> Like I've written many times now, back when the world was simpler and >>> all we had to worry about was L1TF, core-scheduling made some sense, but >>> how does it make sense today? >> >> For ChromeOS we're planning to tag each and every task seperately except for >> trusted processes, so we are isolating untrusted tasks even from each other. >> >> Sorry if this sounds like pushing my usecase, but we do get parallelism >> advantage for the trusted tasks while still solving all security issues (for >> ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if >> untrusted (tagged) tasks execute together on same core, but we are not >> planning to do that on our setup at least. > > That doesn't completely solve things I think. Even if you run all > untrusted tasks as core exclusive, you still have a problem of them vs > interrupts on the other sibling. > > You need to somehow arrange all interrupts to the core happen on the > same sibling that runs your untrusted task, such that the VERW on > return-to-userspace works as intended. > > I suppose you can try and play funny games with interrupt routing tied > to the force-idle state, but I'm dreading what that'll look like. Or > were you going to handle this from your irq_enter() thing too? I'm not sure I follow. We have thread local interrupts (timers, IPIs) and device interrupts (network, block, etc). Thread local ones shouldn't transfer too much knowledge, so I'd be inclined to say we can just ignore that attack vector. Device interrupts we can easily route to HT0. If we now make "core exclusive" a synonym for "always run on HT0", we can guarantee that they always land on the same CPU, no? Then you don't need to hook into any idle state tracking, because you always know which CPU the "safe" one to both schedule tasks and route interrupts to is. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Fri, Apr 17, 2020 at 02:35:38PM +0200, Alexander Graf wrote: > On 17.04.20 13:12, Peter Zijlstra wrote: > If we first kick out the sibling HT for every #VMEXIT, performance will be > abysmal, no? I've been given to understand that people serious about virt try really hard to avoid VMEXIT. > > That doesn't completely solve things I think. Even if you run all > > untrusted tasks as core exclusive, you still have a problem of them vs > > interrupts on the other sibling. > > > > You need to somehow arrange all interrupts to the core happen on the > > same sibling that runs your untrusted task, such that the VERW on > > return-to-userspace works as intended. > > > > I suppose you can try and play funny games with interrupt routing tied > > to the force-idle state, but I'm dreading what that'll look like. Or > > were you going to handle this from your irq_enter() thing too? > > I'm not sure I follow. We have thread local interrupts (timers, IPIs) and > device interrupts (network, block, etc). > > Thread local ones shouldn't transfer too much knowledge, so I'd be inclined > to say we can just ignore that attack vector. > > Device interrupts we can easily route to HT0. If we now make "core > exclusive" a synonym for "always run on HT0", we can guarantee that they > always land on the same CPU, no? > > Then you don't need to hook into any idle state tracking, because you always > know which CPU the "safe" one to both schedule tasks and route interrupts to > is. That would come apart most mighty when someone does an explicit sched_setaffinity() for !HT0. While that might work for some relatively contained systems like chromeos, it will not work in general I think.
Hi Peter, On Fri, Apr 17, 2020 at 01:12:55PM +0200, Peter Zijlstra wrote: > On Wed, Apr 15, 2020 at 12:32:20PM -0400, Joel Fernandes wrote: > > On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote: > > > On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote: > > > > TODO > > > > ---- > > > > - Work on merging patches that are ready to be merged > > > > - Decide on the API for exposing the feature to userland > > > > - Experiment with adding synchronization points in VMEXIT to mitigate > > > > the VM-to-host-kernel leaking > > > > > > VMEXIT is too late, you need to hook irq_enter(), which is what makes > > > the whole thing so horrible. > > > > We came up with a patch to do this as well. Currently testing it more and it > > looks clean, will share it soon. > > Thomas said we actually first do VMEXIT, and then enable interrupts. So > the VMEXIT thing should actually work, and that is indeed much saner > than sticking it in irq_enter(). > > It does however put yet more nails in the out-of-tree hypervisors. Just to clarify what we're talking about here. The condition we are trying to protect against: 1. VM is malicious. 2. Sibling of VM is entering an interrupt on the host. 3. When we enter the interrupt, we send an IPI to force the VM into waiting. 4. The VM on the sibling enters VMEXIT. In step 4, we have to synchronize. Is this the scenario we are discussing? > > > > - Investigate the source of the overhead even when no tasks are tagged: > > > > https://lkml.org/lkml/2019/10/29/242 > > > > > > - explain why we're all still doing this .... > > > > > > Seriously, what actual problems does it solve? The patch-set still isn't > > > L1TF complete and afaict it does exactly nothing for MDS. > > > > The L1TF incompleteness is because of cross-HT attack from Guest vCPU > > attacker to an interrupt/softirq executing on the other sibling correct? The > > IRQ enter pausing the other sibling should fix that (which we will share in > > a future series revision after adequate testing). > > Correct, the vCPU still running can glean host (kernel) state from the > sibling handling the interrupt in the host kernel. Right. This is what we're handling. > > > Like I've written many times now, back when the world was simpler and > > > all we had to worry about was L1TF, core-scheduling made some sense, but > > > how does it make sense today? > > > > For ChromeOS we're planning to tag each and every task seperately except for > > trusted processes, so we are isolating untrusted tasks even from each other. > > > > Sorry if this sounds like pushing my usecase, but we do get parallelism > > advantage for the trusted tasks while still solving all security issues (for > > ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if > > untrusted (tagged) tasks execute together on same core, but we are not > > planning to do that on our setup at least. > > That doesn't completely solve things I think. Even if you run all > untrusted tasks as core exclusive, you still have a problem of them vs > interrupts on the other sibling. > > You need to somehow arrange all interrupts to the core happen on the > same sibling that runs your untrusted task, such that the VERW on > return-to-userspace works as intended. > > I suppose you can try and play funny games with interrupt routing tied > to the force-idle state, but I'm dreading what that'll look like. Or > were you going to handle this from your irq_enter() thing too? Yes, even when host interrupt is on one sibling and the untrusted host process is on the other sibling, we would be handling it the same way we handle it for host interrupts vs untrusted guests. Perhaps we could optimize pausing of the guest. But Vineeth tested and found that the same code that pauses hosts also works for guests. > Can someone go write up a document that very clearly states all the > problems and clearly explains how to use this feature? A document would make sense on how to use the feature. Maybe we can add it as a documentation patch to the series? Basically, from my notes the following are the problems: Core-scheduling will help with cross-HT MDS and L1TF attacks. The following are the scenarios (borrowed from an email from Thomas -- thanks!): HT1 (attack) HT2 (victim) A idle -> user space user space -> idle B idle -> user space guest -> idle C idle -> guest user space -> idle D idle -> guest guest -> idle All of them suffer from MDS. #C and #D suffer from L1TF. All of these scenarios result in the victim getting idled to prevent any leakage. However, this does not address the case where victim is an interrupt handler or softirq. So we need to either route interrupts to run on the attacker CPU, or have the irq_enter() send IPIs to pause the sibling which is what we prototyped. Another approach to solve this is to force interrupts into threaded mode as Thomas suggested. The problematic usecase then left (if we ignore IRQ troubles), is MDS issues between user <-> kernel simultaneous execution on both siblings. This does not become an issue on ChromeOS where everything untrusted has its own tag. For Vineeth's VM workloads also this isn't a problem from my discussions with him, as he mentioned hyperthreads are both running guests and 2 vCPU threads that belong to different VMs will not execute on the same core (though I'm not sure whether hypercalls from a vCPU when the sibling is running another vCPU of the same VM, is a concern here). Any other case that needs to be considered? thanks, - Joel
On Fri, Apr 17, 2020 at 7:18 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Apr 16, 2020 at 11:39:05AM +0800, Chen Yu wrote:
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 02495d44870f..ef101a3ef583 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4477,9 +4477,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > return next;
> > }
> >
> > - prev->sched_class->put_prev_task(rq, prev);
> > - if (!rq->nr_running)
> > - newidle_balance(rq, rf);
> > +
> > +#ifdef CONFIG_SMP
> > + for_class_range(class, prev->sched_class, &idle_sched_class) {
> > + if (class->balance(rq, prev, rf))
> > + break;
> > + }
> > +#endif
> > + put_prev_task(rq, prev);
> >
> > smt_mask = cpu_smt_mask(cpu);
>
> Instead of duplicating that, how about you put the existing copy in a
> function to share? finish_prev_task() perhaps?
>
> Also, can you please make newidle_balance() static again; I forgot doing
> that in 6e2df0581f56, which would've made you notice this sooner I
> suppose.
Okay, I'll do that,
Thanks,
Chenyu
On Fri, Apr 17, 2020 at 05:40:45PM +0800, Aaron Lu wrote: > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -291,8 +280,13 @@ static int __sched_core_stopper(void *data) > } > > for_each_online_cpu(cpu) { > - if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) > - cpu_rq(cpu)->core_enabled = enabled; > + if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) { > + struct rq *rq = cpu_rq(cpu); > + > + rq->core_enabled = enabled; > + if (rq->core == rq) > + sched_core_adjust_se_vruntime(cpu); The adjust is only needed when core scheduling is enabled while I mistakenly called it on both enable and disable. And I come to think normalize is a better name than adjust. > + } > } > > return 0; > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index d99ea6ee7af2..7eecf590d6c0 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > +void sched_core_adjust_se_vruntime(int cpu) > +{ > + int i; > + > + for_each_cpu(i, cpu_smt_mask(cpu)) { > + struct cfs_rq *cfs_rq, *sibling_cfs_rq; > + struct sched_entity *se, *next; > + s64 delta; > + > + if (i == cpu) > + continue; > + > + sibling_cfs_rq = &cpu_rq(i)->cfs; > + if (!sibling_cfs_rq->nr_running) > + continue; > + > + cfs_rq = &cpu_rq(cpu)->cfs; > + delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime; > + /* > + * XXX Malicious user can create a ton of runnable tasks in root > + * sibling_cfs_rq and cause the below vruntime normalization > + * potentially taking a long time. > + */ Testing on a qemu/kvm VM shows that normalizing 32268 sched entities takes about 6ms time so I think the risk is low, therefore, I'm going to remove the XXX comment. (I disabled CONFIG_SCHED_AUTOGROUP and started 32268 cpuhog tasks on one cpu using taskset, adding trace_printk() before and after the below loop gives me: migration/0-11 [000] d..1 674.546882: sched_core_normalize_se_vruntime: cpu5: normalize nr_running=32268 migration/0-11 [000] d..1 674.552364: sched_core_normalize_se_vruntime: cpu5: normalize done ) > + rbtree_postorder_for_each_entry_safe(se, next, > + &sibling_cfs_rq->tasks_timeline.rb_root, > + run_node) { > + se->vruntime += delta; > + } > + } > } > > static __always_inline I also think the patch is not to make every sched entity's vruntime core wide but to make it possible to do core wide priority comparison for cfs tasks so I changed the subject. Here is the updated patch: From d045030074247faf3b515fab21ac06236ce4bd74 Mon Sep 17 00:00:00 2001 From: Aaron Lu <ziqian.lzq@antfin.com> Date: Mon, 20 Apr 2020 10:27:17 +0800 Subject: [PATCH] sched/fair: core wide cfs task priority comparison This patch provides a vruntime based way to compare two cfs task's priority, be it on the same cpu or different threads of the same core. When the two tasks are on the same CPU, we just need to find a common cfs_rq both sched_entities are on and then do the comparison. When the two tasks are on differen threads of the same core, each thread will choose the next task to run the usual way and then the root level sched entities which the two tasks belong to will be used to decide which task runs next core wide. An illustration for the cross CPU case: cpu0 cpu1 / | \ / | \ se1 se2 se3 se4 se5 se6 / \ / \ se21 se22 se61 se62 (A) / se621 (B) Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to run next and cpu1 has decided task B to run next. To compare priority of task A and B, we compare priority of se2 and se6. Whose vruntime is smaller, who wins. To make this work, the root level sched entities' vruntime of the two threads must be directly comparable. So one of the hyperthread's root cfs_rq's min_vruntime is chosen as the core wide one and all root level sched entities' vruntime is normalized against it. All sub cfs_rqs and sched entities are not interesting in cross cpu priority comparison as they will only participate in the usual cpu local schedule decision so no need to normalize their vruntimes. Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com> --- kernel/sched/core.c | 24 +++++------ kernel/sched/fair.c | 96 +++++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 3 ++ 3 files changed, 106 insertions(+), 17 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5f322922f5ae..059add9a89ed 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -119,19 +119,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b) if (pa == -1) /* dl_prio() doesn't work because of stop_class above */ return !dl_time_before(a->dl.deadline, b->dl.deadline); - if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */ - u64 vruntime = b->se.vruntime; - - /* - * Normalize the vruntime if tasks are in different cpus. - */ - if (task_cpu(a) != task_cpu(b)) { - vruntime -= task_cfs_rq(b)->min_vruntime; - vruntime += task_cfs_rq(a)->min_vruntime; - } - - return !((s64)(a->se.vruntime - vruntime) <= 0); - } + if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */ + return cfs_prio_less(a, b); return false; } @@ -291,8 +280,13 @@ static int __sched_core_stopper(void *data) } for_each_online_cpu(cpu) { - if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) - cpu_rq(cpu)->core_enabled = enabled; + if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) { + struct rq *rq = cpu_rq(cpu); + + rq->core_enabled = enabled; + if (enabled && rq->core == rq) + sched_core_normalize_se_vruntime(cpu); + } } return 0; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d99ea6ee7af2..1b87d0c8b9ca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -449,9 +449,98 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse) #endif /* CONFIG_FAIR_GROUP_SCHED */ +static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq) +{ + return &rq_of(cfs_rq)->cfs; +} + +static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq) +{ + return cfs_rq == root_cfs_rq(cfs_rq); +} + +static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq) +{ + return &rq_of(cfs_rq)->core->cfs; +} + static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq) { - return cfs_rq->min_vruntime; + if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq)) + return cfs_rq->min_vruntime; + + return core_cfs_rq(cfs_rq)->min_vruntime; +} + +bool cfs_prio_less(struct task_struct *a, struct task_struct *b) +{ + struct sched_entity *sea = &a->se; + struct sched_entity *seb = &b->se; + bool samecpu = task_cpu(a) == task_cpu(b); + s64 delta; + + if (samecpu) { + /* vruntime is per cfs_rq */ + while (!is_same_group(sea, seb)) { + int sea_depth = sea->depth; + int seb_depth = seb->depth; + + if (sea_depth >= seb_depth) + sea = parent_entity(sea); + if (sea_depth <= seb_depth) + seb = parent_entity(seb); + } + + delta = (s64)(sea->vruntime - seb->vruntime); + goto out; + } + + /* crosscpu: compare root level se's vruntime to decide priority */ + while (sea->parent) + sea = sea->parent; + while (seb->parent) + seb = seb->parent; + delta = (s64)(sea->vruntime - seb->vruntime); + +out: + return delta > 0; +} + +/* + * This is called in stop machine context so no need to take the rq lock. + * + * Core scheduling is going to be enabled and the root level sched entities + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq + * min_vruntime, so it's necessary to normalize vruntime of existing root + * level sched entities in sibling_cfs_rq. + * + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be + * only using cfs_rq->min_vruntime during the entire run of core scheduling. + */ +void sched_core_normalize_se_vruntime(int cpu) +{ + struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs; + int i; + + for_each_cpu(i, cpu_smt_mask(cpu)) { + struct sched_entity *se, *next; + struct cfs_rq *sibling_cfs_rq; + s64 delta; + + if (i == cpu) + continue; + + sibling_cfs_rq = &cpu_rq(i)->cfs; + if (!sibling_cfs_rq->nr_running) + continue; + + delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime; + rbtree_postorder_for_each_entry_safe(se, next, + &sibling_cfs_rq->tasks_timeline.rb_root, + run_node) { + se->vruntime += delta; + } + } } static __always_inline @@ -509,8 +598,11 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq) vruntime = min_vruntime(vruntime, se->vruntime); } + if (sched_core_enabled(rq_of(cfs_rq)) && is_root_cfs_rq(cfs_rq)) + cfs_rq = core_cfs_rq(cfs_rq); + /* ensure we never gain time by being placed backwards. */ - cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime); + cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime); #ifndef CONFIG_64BIT smp_wmb(); cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 50a5675e941a..d8f0eb7f6e42 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2594,3 +2594,6 @@ static inline void membarrier_switch_mm(struct rq *rq, { } #endif + +bool cfs_prio_less(struct task_struct *a, struct task_struct *b); +void sched_core_normalize_se_vruntime(int cpu); -- 2.19.1.3.ge56e4f7
On Mon, Apr 20, 2020 at 4:08 AM Aaron Lu <aaron.lwe@gmail.com> wrote: > > On Fri, Apr 17, 2020 at 05:40:45PM +0800, Aaron Lu wrote: > The adjust is only needed when core scheduling is enabled while I > mistakenly called it on both enable and disable. And I come to think > normalize is a better name than adjust. > I guess we would also need to update the min_vruntime of the sibling to match the rq->core->min_vruntime on coresched disable. Otherwise a new enqueue on root cfs of the sibling would inherit the very old min_vruntime before coresched enable and thus would starve all the already queued tasks until the newly enqueued se's vruntime catches up. Other than that, I think the patch looks good. We haven't tested it yet. Will do a round of testing and let you know soon. Thanks, Vineeth
On Mon, Apr 20, 2020 at 06:26:34PM -0400, Vineeth Remanan Pillai wrote: > On Mon, Apr 20, 2020 at 4:08 AM Aaron Lu <aaron.lwe@gmail.com> wrote: > > > > On Fri, Apr 17, 2020 at 05:40:45PM +0800, Aaron Lu wrote: > > > The adjust is only needed when core scheduling is enabled while I > > mistakenly called it on both enable and disable. And I come to think > > normalize is a better name than adjust. > > > I guess we would also need to update the min_vruntime of the sibling > to match the rq->core->min_vruntime on coresched disable. Otherwise > a new enqueue on root cfs of the sibling would inherit the very old > min_vruntime before coresched enable and thus would starve all the > already queued tasks until the newly enqueued se's vruntime catches up. Yes this is a concern but AFAICS, there is no problem. Consider: - when there is no queued task across the disable boundary, the stale min_vruntime doesn't matter as you said; - when there are queued tasks across the disable boundary, the newly queued task will normalize its vruntime against the sibling_cfs_rq's min_vruntime, if the min_vruntime is stale and problem would occur. But my reading of the code made me think this min_vruntime should have already been updated by update_curr() in enqueue_entity() before being used by this newly enqueued task and update_curr() would bring the stale min_vruntime to the smallest vruntime of the queued ones so again, no problem should occur. I have done a simple test locally before sending the patch out and didn't find any problem but maybe I failed to hit the race window. Let me know if I misunderstood something. > Other than that, I think the patch looks good. We haven't tested it > yet. Will do a round of testing and let you know soon. Thanks.
On Tue, Apr 21, 2020 at 10:51:31AM +0800, Aaron Lu wrote: > On Mon, Apr 20, 2020 at 06:26:34PM -0400, Vineeth Remanan Pillai wrote: > > On Mon, Apr 20, 2020 at 4:08 AM Aaron Lu <aaron.lwe@gmail.com> wrote: > > > > > > On Fri, Apr 17, 2020 at 05:40:45PM +0800, Aaron Lu wrote: > > > > > The adjust is only needed when core scheduling is enabled while I > > > mistakenly called it on both enable and disable. And I come to think > > > normalize is a better name than adjust. > > > > > I guess we would also need to update the min_vruntime of the sibling > > to match the rq->core->min_vruntime on coresched disable. Otherwise > > a new enqueue on root cfs of the sibling would inherit the very old > > min_vruntime before coresched enable and thus would starve all the > > already queued tasks until the newly enqueued se's vruntime catches up. > > Yes this is a concern but AFAICS, there is no problem. Consider: > - when there is no queued task across the disable boundary, the stale > min_vruntime doesn't matter as you said; > - when there are queued tasks across the disable boundary, the newly > queued task will normalize its vruntime against the sibling_cfs_rq's > min_vruntime, if the min_vruntime is stale and problem would occur. > But my reading of the code made me think this min_vruntime should > have already been updated by update_curr() in enqueue_entity() before > being used by this newly enqueued task and update_curr() would bring > the stale min_vruntime to the smallest vruntime of the queued ones so > again, no problem should occur. After discussion with Vineeth, I now tend to add the syncing of sibling_cfs_rq min_vruntime on core disable because analysing all the code is time consuming and though I didn't find any problems now, I might miss something and future code change may also break the expectations so adding it seems a safe thing to do, also, it didn't bring any performance downgrade as it is a one time disable stuff. Vineeth also pointed out a problem of misusing cfs_rq->min_vruntime for !CONFIG_64BIT kernel in migrate_task_rq_fair(), this is also fixed. (only compile tested for !CONFIG_64BIT kernel) From cda051ed33e6b88f28b44147cc7c894994c9d991 Mon Sep 17 00:00:00 2001 From: Aaron Lu <ziqian.lzq@antfin.com> Date: Mon, 20 Apr 2020 10:27:17 +0800 Subject: [PATCH] sched/fair: core wide cfs task priority comparison This patch provides a vruntime based way to compare two cfs task's priority, be it on the same cpu or different threads of the same core. When the two tasks are on the same CPU, we just need to find a common cfs_rq both sched_entities are on and then do the comparison. When the two tasks are on differen threads of the same core, each thread will choose the next task to run the usual way and then the root level sched entities which the two tasks belong to will be used to decide which task runs next core wide. An illustration for the cross CPU case: cpu0 cpu1 / | \ / | \ se1 se2 se3 se4 se5 se6 / \ / \ se21 se22 se61 se62 (A) / se621 (B) Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to run next and cpu1 has decided task B to run next. To compare priority of task A and B, we compare priority of se2 and se6. Whose vruntime is smaller, who wins. To make this work, the root level sched entities' vruntime of the two threads must be directly comparable. So one of the hyperthread's root cfs_rq's min_vruntime is chosen as the core wide one and all root level sched entities' vruntime is normalized against it. All sub cfs_rqs and sched entities are not interesting in cross cpu priority comparison as they will only participate in the usual cpu local schedule decision so no need to normalize their vruntimes. Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com> --- kernel/sched/core.c | 28 +++++---- kernel/sched/fair.c | 135 +++++++++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h | 4 ++ 3 files changed, 148 insertions(+), 19 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5f322922f5ae..d8bedddef6fb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -119,19 +119,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b) if (pa == -1) /* dl_prio() doesn't work because of stop_class above */ return !dl_time_before(a->dl.deadline, b->dl.deadline); - if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */ - u64 vruntime = b->se.vruntime; - - /* - * Normalize the vruntime if tasks are in different cpus. - */ - if (task_cpu(a) != task_cpu(b)) { - vruntime -= task_cfs_rq(b)->min_vruntime; - vruntime += task_cfs_rq(a)->min_vruntime; - } - - return !((s64)(a->se.vruntime - vruntime) <= 0); - } + if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */ + return cfs_prio_less(a, b); return false; } @@ -291,8 +280,17 @@ static int __sched_core_stopper(void *data) } for_each_online_cpu(cpu) { - if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) - cpu_rq(cpu)->core_enabled = enabled; + if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) { + struct rq *rq = cpu_rq(cpu); + + rq->core_enabled = enabled; + if (rq->core == rq) { + if (enabled) + sched_core_normalize_se_vruntime(cpu); + else + sched_core_sync_cfs_vruntime(cpu); + } + } } return 0; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d99ea6ee7af2..a5774f495d97 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -449,9 +449,133 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse) #endif /* CONFIG_FAIR_GROUP_SCHED */ +static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq) +{ + return &rq_of(cfs_rq)->cfs; +} + +static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq) +{ + return cfs_rq == root_cfs_rq(cfs_rq); +} + +static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq) +{ + return &rq_of(cfs_rq)->core->cfs; +} + static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq) { - return cfs_rq->min_vruntime; + if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq)) + return cfs_rq->min_vruntime; + + return core_cfs_rq(cfs_rq)->min_vruntime; +} + +#ifndef CONFIG_64BIT +static inline u64 cfs_rq_min_vruntime_copy(struct cfs_rq *cfs_rq) +{ + if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq)) + return cfs_rq->min_vruntime_copy; + + return core_cfs_rq(cfs_rq)->min_vruntime_copy; +} +#endif + +bool cfs_prio_less(struct task_struct *a, struct task_struct *b) +{ + struct sched_entity *sea = &a->se; + struct sched_entity *seb = &b->se; + bool samecpu = task_cpu(a) == task_cpu(b); + s64 delta; + + if (samecpu) { + /* vruntime is per cfs_rq */ + while (!is_same_group(sea, seb)) { + int sea_depth = sea->depth; + int seb_depth = seb->depth; + + if (sea_depth >= seb_depth) + sea = parent_entity(sea); + if (sea_depth <= seb_depth) + seb = parent_entity(seb); + } + + delta = (s64)(sea->vruntime - seb->vruntime); + goto out; + } + + /* crosscpu: compare root level se's vruntime to decide priority */ + while (sea->parent) + sea = sea->parent; + while (seb->parent) + seb = seb->parent; + delta = (s64)(sea->vruntime - seb->vruntime); + +out: + return delta > 0; +} + +/* + * This is called in stop machine context so no need to take the rq lock. + * + * Core scheduling is going to be enabled and the root level sched entities + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq + * min_vruntime, so it's necessary to normalize vruntime of existing root + * level sched entities in sibling_cfs_rq. + * + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be + * only using cfs_rq->min_vruntime during the entire run of core scheduling. + */ +void sched_core_normalize_se_vruntime(int cpu) +{ + struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs; + int i; + + for_each_cpu(i, cpu_smt_mask(cpu)) { + struct sched_entity *se, *next; + struct cfs_rq *sibling_cfs_rq; + s64 delta; + + if (i == cpu) + continue; + + sibling_cfs_rq = &cpu_rq(i)->cfs; + if (!sibling_cfs_rq->nr_running) + continue; + + delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime; + rbtree_postorder_for_each_entry_safe(se, next, + &sibling_cfs_rq->tasks_timeline.rb_root, + run_node) { + se->vruntime += delta; + } + } +} + +/* + * During the entire run of core scheduling, sibling_cfs_rq's min_vruntime + * is left unused and could lag far behind its still queued sched entities. + * Sync it to the up2date core wide one to avoid problems. + */ +void sched_core_sync_cfs_vruntime(int cpu) +{ + struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs; + int i; + + for_each_cpu(i, cpu_smt_mask(cpu)) { + struct cfs_rq *sibling_cfs_rq; + + if (i == cpu) + continue; + + sibling_cfs_rq = &cpu_rq(i)->cfs; + sibling_cfs_rq->min_vruntime = cfs_rq->min_vruntime; +#ifndef CONFIG_64BIT + smp_wmb(); + sibling_cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; +#endif + } } static __always_inline @@ -509,8 +633,11 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq) vruntime = min_vruntime(vruntime, se->vruntime); } + if (sched_core_enabled(rq_of(cfs_rq)) && is_root_cfs_rq(cfs_rq)) + cfs_rq = core_cfs_rq(cfs_rq); + /* ensure we never gain time by being placed backwards. */ - cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime); + cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime); #ifndef CONFIG_64BIT smp_wmb(); cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; @@ -6396,9 +6523,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) u64 min_vruntime_copy; do { - min_vruntime_copy = cfs_rq->min_vruntime_copy; + min_vruntime_copy = cfs_rq_min_vruntime_copy(cfs_rq); smp_rmb(); - min_vruntime = cfs_rq->min_vruntime; + min_vruntime = cfs_rq_min_vruntime(cfs_rq); } while (min_vruntime != min_vruntime_copy); #else min_vruntime = cfs_rq_min_vruntime(cfs_rq); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 50a5675e941a..5517ca92b5bd 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2594,3 +2594,7 @@ static inline void membarrier_switch_mm(struct rq *rq, { } #endif + +bool cfs_prio_less(struct task_struct *a, struct task_struct *b); +void sched_core_normalize_se_vruntime(int cpu); +void sched_core_sync_cfs_vruntime(int cpu); -- 2.19.1.3.ge56e4f7
Sorry for being verbose; I've been procrastinating replying, and in doing so the things I wanted to say kept growing. On Fri, Apr 24, 2020 at 10:24:43PM +0800, Aaron Lu wrote: > To make this work, the root level sched entities' vruntime of the two > threads must be directly comparable. So one of the hyperthread's root > cfs_rq's min_vruntime is chosen as the core wide one and all root level > sched entities' vruntime is normalized against it. > +/* > + * This is called in stop machine context so no need to take the rq lock. > + * > + * Core scheduling is going to be enabled and the root level sched entities > + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq > + * min_vruntime, so it's necessary to normalize vruntime of existing root > + * level sched entities in sibling_cfs_rq. > + * > + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be > + * only using cfs_rq->min_vruntime during the entire run of core scheduling. > + */ > +void sched_core_normalize_se_vruntime(int cpu) > +{ > + struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs; > + int i; > + > + for_each_cpu(i, cpu_smt_mask(cpu)) { > + struct sched_entity *se, *next; > + struct cfs_rq *sibling_cfs_rq; > + s64 delta; > + > + if (i == cpu) > + continue; > + > + sibling_cfs_rq = &cpu_rq(i)->cfs; > + if (!sibling_cfs_rq->nr_running) > + continue; > + > + delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime; > + rbtree_postorder_for_each_entry_safe(se, next, > + &sibling_cfs_rq->tasks_timeline.rb_root, > + run_node) { > + se->vruntime += delta; > + } > + } > +} Aside from this being way to complicated for what it does -- you could've saved the min_vruntime for each rq and compared them with subtraction -- it is also terminally broken afaict. Consider any infeasible weight scenario. Take for instance two tasks, each bound to their respective sibling, one with weight 1 and one with weight 2. Then the lower weight task will run ahead of the higher weight task without bound. This utterly destroys the concept of a shared time base. Remember; all this is about a proportionally fair scheduling, where each tasks receives: w_i dt_i = ---------- dt (1) \Sum_j w_j which we do by tracking a virtual time, s_i: 1 s_i = --- d[t]_i (2) w_i Where d[t] is a delta of discrete time, while dt is an infinitesimal. The immediate corrolary is that the ideal schedule S, where (2) to use an infnitesimal delta, is: 1 S = ---------- dt (3) \Sum_i w_i From which we can define the lag, or deviation from the ideal, as: lag(i) = S - s_i (4) And since the one and only purpose is to approximate S, we get that: \Sum_i w_i lag(i) := 0 (5) If this were not so, we no longer converge to S, and we can no longer claim our scheduler has any of the properties we derive from S. This is exactly what you did above, you broke it! Let's continue for a while though; to see if there is anything useful to be learned. We can combine (1)-(3) or (4)-(5) and express S in s_i: \Sum_i w_i s_i S = -------------- (6) \Sum_i w_i Which gives us a way to compute S, given our s_i. Now, if you've read our code, you know that we do not in fact do this, the reason for this is two-fold. Firstly, computing S in that way requires a 64bit division for every time we'd use it (see 12), and secondly, this only describes the steady-state, it doesn't handle dynamics. Anyway, in (6): s_i -> x + (s_i - x), to get: \Sum_i w_i (s_i - x) S - x = -------------------- (7) \Sum_i w_i Which shows that S and s_i transform alike (which makes perfect sense given that S is basically the (weighted) average of s_i). Then: x -> s_min := min{s_i} (8) to obtain: \Sum_i w_i (s_i - s_min) S = s_min + ------------------------ (9) \Sum_i w_i Which already looks familiar, and is the basis for our current approximation: S ~= s_min (10) Now, obviously, (10) is absolute crap :-), but it sorta works. So the thing to remember is that the above is strictly UP. It is possible to generalize to multiple runqueues -- however it gets really yuck when you have to add affinity support, as illustrated by our very first counter-example. XXX segue into the load-balance issues related to this: - how a negative lag task on a 'heavy' runqueue should not remain a negative lag task when migrated to a 'light' runqueue. - how we can compute and use the combined S in load-balancing to better handle infeasible weight scenarios. Luckily I think we can avoid needing a full multi-queue variant for core-scheduling (or load-balancing). The crucial observation is that we only actually need this comparison in the presence of forced-idle; only then do we need to tell if the stalled rq has higher priority over the other. [XXX assumes SMT2; better consider the more general case, I suspect it'll work out because our comparison is always between 2 rqs and the answer is only interesting if one of them is forced-idle] And (under assumption of SMT2) when there is forced-idle, there is only a single queue, so everything works like normal. Let, for our runqueue 'k': T_k = \Sum_i w_i s_i W_k = \Sum_i w_i ; for all i of k (11) Then we can write (6) like: T_k S_k = --- (12) W_k From which immediately follows that: T_k + T_l S_k+l = --------- (13) W_k + W_l On which we can define a combined lag: lag_k+l(i) := S_k+l - s_i (14) And that gives us the tools to compare tasks across a combined runqueue. Combined this gives the following: a) when a runqueue enters force-idle, sync it against it's sibling rq(s) using (7); this only requires storing single 'time'-stamps. b) when comparing tasks between 2 runqueues of which one is forced-idle, compare the combined lag, per (14). Now, of course cgroups (I so hate them) make this more interesting in that a) seems to suggest we need to iterate all cgroup on a CPU at such boundaries, but I think we can avoid that. The force-idle is for the whole CPU, all it's rqs. So we can mark it in the root and lazily propagate downward on demand.
On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote: > > Sorry for being verbose; I've been procrastinating replying, and in > doing so the things I wanted to say kept growing. > > On Fri, Apr 24, 2020 at 10:24:43PM +0800, Aaron Lu wrote: > > > To make this work, the root level sched entities' vruntime of the two > > threads must be directly comparable. So one of the hyperthread's root > > cfs_rq's min_vruntime is chosen as the core wide one and all root level > > sched entities' vruntime is normalized against it. > > > +/* > > + * This is called in stop machine context so no need to take the rq lock. > > + * > > + * Core scheduling is going to be enabled and the root level sched entities > > + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq > > + * min_vruntime, so it's necessary to normalize vruntime of existing root > > + * level sched entities in sibling_cfs_rq. > > + * > > + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be > > + * only using cfs_rq->min_vruntime during the entire run of core scheduling. > > + */ > > +void sched_core_normalize_se_vruntime(int cpu) > > +{ > > + struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs; > > + int i; > > + > > + for_each_cpu(i, cpu_smt_mask(cpu)) { > > + struct sched_entity *se, *next; > > + struct cfs_rq *sibling_cfs_rq; > > + s64 delta; > > + > > + if (i == cpu) > > + continue; > > + > > + sibling_cfs_rq = &cpu_rq(i)->cfs; > > + if (!sibling_cfs_rq->nr_running) > > + continue; > > + > > + delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime; > > + rbtree_postorder_for_each_entry_safe(se, next, > > + &sibling_cfs_rq->tasks_timeline.rb_root, > > + run_node) { > > + se->vruntime += delta; > > + } > > + } > > +} > > Aside from this being way to complicated for what it does -- you > could've saved the min_vruntime for each rq and compared them with > subtraction -- it is also terminally broken afaict. > > Consider any infeasible weight scenario. Take for instance two tasks, > each bound to their respective sibling, one with weight 1 and one with > weight 2. Then the lower weight task will run ahead of the higher weight > task without bound. I don't follow how this could happen. Even the lower weight task runs first, after some time, the higher weight task will get its turn and from then on, the higher weight task will get more chance to run(due to its higher weight and thus, slower accumulation of vruntime). We used to have the following patch as a standalone one in v4: sched/fair : Wake up forced idle siblings if needed https://lore.kernel.org/lkml/cover.1572437285.git.vpillai@digitalocean.com/T/#md22d25d0e2932d059013e9b56600d8a847b02a13 Which originates from: https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/ And in this series, it seems to be merged in: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling https://lore.kernel.org/lkml/e942da7fd881977923463f19648085c1bfaa37f8.1583332765.git.vpillai@digitalocean.com/ My local test shows that when two cgroup's share are both set to 1024 and each bound to one sibling of a core, start a cpu intensive task in each cgroup, then the cpu intensive task will each consume 50% cpu. When one cgroup's share set to 512, it will consume about 33% while the other consumes 67%, as expected. I think the current patch works fine when 2 differently tagged tasks are competing CPU, but when there are 3 tasks or more, things can get less fair.
On Fri, May 08, 2020 at 04:44:19PM +0800, Aaron Lu wrote: > On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote: > > Aside from this being way to complicated for what it does -- you > > could've saved the min_vruntime for each rq and compared them with > > subtraction -- it is also terminally broken afaict. > > > > Consider any infeasible weight scenario. Take for instance two tasks, > > each bound to their respective sibling, one with weight 1 and one with > > weight 2. Then the lower weight task will run ahead of the higher weight > > task without bound. > > I don't follow how this could happen. Even the lower weight task runs > first, after some time, the higher weight task will get its turn and > from then on, the higher weight task will get more chance to run(due to > its higher weight and thus, slower accumulation of vruntime). That seems to assume they're mutually exclusive. In that case, as I argued, we only have a single runqueue and then yes it works. But if they're not exclusive, and can run concurrently, it comes apart.
On Fri, May 08, 2020 at 11:09:25AM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 04:44:19PM +0800, Aaron Lu wrote:
> > On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote:
>
> > > Aside from this being way to complicated for what it does -- you
> > > could've saved the min_vruntime for each rq and compared them with
> > > subtraction -- it is also terminally broken afaict.
> > >
> > > Consider any infeasible weight scenario. Take for instance two tasks,
> > > each bound to their respective sibling, one with weight 1 and one with
> > > weight 2. Then the lower weight task will run ahead of the higher weight
> > > task without bound.
> >
> > I don't follow how this could happen. Even the lower weight task runs
> > first, after some time, the higher weight task will get its turn and
> > from then on, the higher weight task will get more chance to run(due to
> > its higher weight and thus, slower accumulation of vruntime).
>
> That seems to assume they're mutually exclusive. In that case, as I
> argued, we only have a single runqueue and then yes it works. But if
> they're not exclusive, and can run concurrently, it comes apart.
Ah right, now I see what you mean. Sorry for misunderstanding.
And yes, that 'utterly destroys the concept of a shared time base' and
then bad things can happen:
1) two same tagged tasks(t1 and t2) running on two siblings, with t1's
weight lower than t2's;
2) both tasks are cpu intensive;
3) over time, the lower weight task(t1)'s vruntime becomes bigger and
bigger than t2's vruntime and the core wide min_vruntime is the
same as t1's vruntime per this patch;
4) a new task enqueued on the same sibling as t1, if the new task has
an incompatible tag, it will be starved by t2 because t2's vruntime
is way smaller than the core wide min_vruntime.
With this said, I realized a workaround for the issue described above:
when the core went from 'compatible mode'(step 1-3) to 'incompatible
mode'(step 4), reset all root level sched entities' vruntime to be the
same as the core wide min_vruntime. After all, the core is transforming
from two runqueue mode to single runqueue mode... I think this can solve
the issue to some extent but I may miss other scenarios.
I'll also re-read your last email about the 'lag' idea.
- Test environment: Intel Xeon Server platform CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 4 - Kernel under test: Core scheduling v5 base https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y - Test set based on sysbench 1.1.0-bd4b418: A: sysbench cpu in cgroup cpu 1 + sysbench mysql in cgroup mysql 1 (192 workload tasks for each cgroup) B: sysbench cpu in cgroup cpu 1 + sysbench cpu in cgroup cpu 2 + sysbench mysql in cgroup mysql 1 + sysbench mysql in cgroup mysql 2 (192 workload tasks for each cgroup) - Test results briefing: 1 Good results: 1.1 For test set A, coresched could achieve same or better performance compared to smt_off, for both cpu workload and sysbench workload 1.2 For test set B, cpu workload, coresched could achieve better performance compared to smt_off 2 Bad results: 2.1 For test set B, mysql workload, coresched performance is lower than smt_off, potential fairness issue between cpu workloads and mysql workloads 2.2 For test set B, cpu workload, potential fairness issue between 2 cgroups cpu workloads - Test results: Note: test results in following tables are Tput normalized to default baseline -- Test set A Tput normalized results: +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+ | | **** | default | coresched | smt_off | *** | default | coresched | smt_off | +====================+========+===========+=============+===========+=======+=============+===============+=============+ | cgroups | **** | cg cpu 1 | cg cpu 1 | cg cpu 1 | *** | cg mysql 1 | cg mysql 1 | cg mysql 1 | +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+ | sysbench workload | **** | cpu | cpu | cpu | *** | mysql | mysql | mysql | +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+ | 192 tasks / cgroup | **** | 1 | 0.95 | 0.54 | *** | 1 | 0.92 | 0.97 | +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+ -- Test set B Tput normalized results: +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+ | | **** | default | coresched | smt_off | *** | default | coresched | smt_off | ** | default | coresched | smt_off | * | default | coresched | smt_off | +====================+========+===========+=============+===========+=======+=============+===============+=============+======+=============+===============+=============+=====+=============+===============+=============+ | cgroups | **** | cg cpu 1 | cg cpu 1 | cg cpu 1 | *** | cg cpu 2 | cg cpu 2 | cg cpu 2 | ** | cg mysql 1 | cg mysql 1 | cg mysql 1 | * | cg mysql 2 | cg mysql 2 | cg mysql 2 | +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+ | sysbench workload | **** | cpu | cpu | cpu | *** | cpu | cpu | cpu | ** | mysql | mysql | mysql | * | mysql | mysql | mysql | +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+ | 192 tasks / cgroup | **** | 1 | 0.9 | 0.47 | *** | 1 | 1.32 | 0.66 | ** | 1 | 0.42 | 0.89 | * | 1 | 0.42 | 0.89 | +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+ > On Date: Wed, 4 Mar 2020 16:59:50 +0000, vpillai <vpillai@digitalocean.com> wrote: > To: Nishanth Aravamudan <naravamudan@digitalocean.com>, Julien Desfossez <jdesfossez@digitalocean.com>, Peter Zijlstra <peterz@infradead.org>, Tim Chen <tim.c.chen@linux.intel.com>, mingo@kernel.org, tglx@linutronix.de, pjt@google.com, torvalds@linux-foundation.org > CC: vpillai <vpillai@digitalocean.com>, linux-kernel@vger.kernel.org, fweisbec@gmail.com, keescook@chromium.org, kerrnel@google.com, Phil Auld <pauld@redhat.com>, Aaron Lu <aaron.lwe@gmail.com>, Aubrey Li <aubrey.intel@gmail.com>, aubrey.li@linux.intel.com, Valentin Schneider <valentin.schneider@arm.com>, Mel Gorman <mgorman@techsingularity.net>, Pawan Gupta <pawan.kumar.gupta@linux.intel.com>, Paolo Bonzini <pbonzini@redhat.com>, Joel Fernandes <joelaf@google.com>, joel@joelfernandes.org > > > Fifth iteration of the Core-Scheduling feature. > > Core scheduling is a feature that only allows trusted tasks to run > concurrently on cpus sharing compute resources(eg: hyperthreads on a > core). The goal is to mitigate the core-level side-channel attacks > without requiring to disable SMT (which has a significant impact on > performance in some situations). So far, the feature mitigates user-space > to user-space attacks but not user-space to kernel attack, when one of > the hardware thread enters the kernel (syscall, interrupt etc). > > By default, the feature doesn't change any of the current scheduler > behavior. The user decides which tasks can run simultaneously on the > same core (for now by having them in the same tagged cgroup). When > a tag is enabled in a cgroup and a task from that cgroup is running > on a hardware thread, the scheduler ensures that only idle or trusted > tasks run on the other sibling(s). Besides security concerns, this > feature can also be beneficial for RT and performance applications > where we want to control how tasks make use of SMT dynamically. > > This version was focusing on performance and stability. Couple of > crashes related to task tagging and cpu hotplug path were fixed. > This version also improves the performance considerably by making > task migration and load balancing coresched aware. > > In terms of performance, the major difference since the last iteration > is that now even IO-heavy and mixed-resources workloads are less > impacted by core-scheduling than by disabling SMT. Both host-level and > VM-level benchmarks were performed. Details in: > https://lkml.org/lkml/2020/2/12/1194 > https://lkml.org/lkml/2019/11/1/269 > > v5 is rebased on top of 5.5.5(449718782a46) > https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y >
[-- Attachment #1: Type: text/plain, Size: 5824 bytes --] On Tue, 2020-04-14 at 16:21 +0200, Peter Zijlstra wrote: > On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote: > > > > - Investigate the source of the overhead even when no tasks are > > tagged: > > https://lkml.org/lkml/2019/10/29/242 > > - explain why we're all still doing this .... > > Seriously, what actual problems does it solve? The patch-set still > isn't > L1TF complete and afaict it does exactly nothing for MDS. > Hey Peter! Late to the party, I know... But I'm replying anyway. At least, you'll have the chance to yell at me for this during OSPM. ;-P > Like I've written many times now, back when the world was simpler and > all we had to worry about was L1TF, core-scheduling made some sense, > but > how does it make sense today? > Indeed core-scheduling alone doesn't even completely solve L1TF. There are the interrupts and the VMEXITs issues. Both are being discussed in this thread and, FWIW, my personal opinion is that the way to go is what Alex says here: <79529592-5d60-2a41-fbb6-4a5f8279f998@amazon.com> (E.g., when he mentions solution 4 "Create a "safe" page table which runs with HT enabled", etc). But let's stick to your point: if it were only for L1TF, then fine, but it's all pointless because of MDS. My answer to this is very much focused on my usecase, which is virtualization. I know you hate us, and you surely have your good reasons, but you know... :-) Correct me if I'm wrong, but I think that the "nice" thing of L1TF is that it allows a VM to spy on another VM or on the host, but it does not allow a regular task to spy on another task or on the kernel (well, it would, but it's easily mitigated). The bad thing about MDS is that it instead allow *all* of that. Now, one thing that we absolutely want to avoid in virt is that a VM is able to spy on other VMs or on the host. Sure, we also care about tasks running in our VMs to be safe, but, really, inter-VM and VM-to-host isolation is the primary concern of an hypervisor. And how a VM (or stuff running inside a VM) can spy on another VM or on the host, via L1TF or MDS? Well, if the attacker VM and the victim VM --or if the attacker VM and the host-- are running on the same core. If they're not, it can't... which is basically an L1TF-only looking scenario. So, in virt, core-scheduling: 1) is the *only* way (aside from no-EPT) to prevent attacker VM to spy on victim VM, if they're running concurrently, both in guest mode, on the same core (and that's, of course, because with core-scheduling they just won't be doing that :-) ) 2) interrupts and VMEXITs needs being taken care of --which was the case already when, as you said "we had only L1TF". Once that is done we will effectively prevent all VM to VM and VM to host attack scenarios. Sure, it will still be possible, for instance, for task_A in VM1 to spy on task_B, also in VM1. This seems to be, AFAIUI, Joel's usecase, so I'm happy to leave it to him to defend that, as he's doing already (but indeed I'm very happy to see that it is also getting attention). Now, of course saying anything like "works for my own usecase so let's go for it" does not fly. But since you were asking whether and how this feature could make sense today, suppose that: 1) we get core-scheduling, 2) we find a solution for irqs and VMEXITs, as we would have to if there was only L1TF, 3) we manage to make the overhead of core-scheduling close to zero when it's there (I mean, enabled at compile time) but not used (I mean, no tagging of tasks, or whatever). That would mean that virt people can enable core-scheduling, and achieve good inter-VM and VM-to-host isolation, without imposing overhead to other use cases, that would leave core-scheduling disabled. And this is something that I would think it makes sense. Of course, we're not there... because even when this series will give us point 1, we will also need 2 and we need to make sure we also satisfy 3 (and we weren't, last time I checked ;-P). But I think it's worth keeping trying. I'd also add a couple of more ideas, still about core-scheduling in virt, but from a different standpoint than security: - if I tag vcpu0 and vcpu1 together[*], then vcpu2 and vcpu3 together, then vcpu4 and vcpu5 together, then I'm sure that each pair will always be scheduled on the same core. At which point I can define an SMT virtual topology, for the VM, that will make sense, even without pinning the vcpus; - if I run VMs from different customers, when vcpu2 of VM1 and vcpu1 of VM2 run on the same core, they influence each others' performance. If, e.g., I bill basing on time spent on CPUs, it means customer A's workload, running in VM1, may influence the billing of customer B, who owns VM2. With core scheduling, if I tag all the vcpus of each VM together, I won't have this any longer. [*] with "tag together" I mean let them have the same tag which, ATM would be "put them in the same cgroup and enable cpu.tag". Whether or not these make sense, e.g., performance wise, it's a bid hard to tell, with the feature not-yet finalized... But I've started doing some preliminary measurements already. Hopefully, they'll be ready by Monday. So that's it. I hope this gives you enough material to complain about during OSPM. At least, given the event is virtual, I won't get any microphone box (or, worse, frozen sharks!) thrown at me in anger! :-D Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 833 bytes --]
With current core scheduling patchset, non-threaded IRQ and softirq victims can leak data from its hyperthread to a sibling hyperthread running an attacker. For MDS, it is possible for the IRQ and softirq handlers to leak data to either host or guest attackers. For L1TF, it is possible to leak to guest attackers. There is no possible mitigation involving flushing of buffers to avoid this since the execution of attacker and victims happen concurrently on 2 or more HTs. The solution in this patch is to monitor the outer-most core-wide irq_enter() and irq_exit() executed by any sibling. In between these two, we mark the core to be in a special core-wide IRQ state. In the IRQ entry, if we detect that the sibling is running untrusted code, we send a reschedule IPI so that the sibling transitions through the sibling's irq_exit() to do any waiting there, till the IRQ being protected finishes. We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu outer-most irq_exit(), the core is still in the special core-wide IRQ state, we perform a busy-wait till the core exits this state. This combination of per-cpu and core-wide IRQ states helps to handle any combination of irq_entry()s and irq_exit()s happening on all of the siblings of the core in any order. Lastly, we also check in the schedule loop if we are about to schedule an untrusted process while the core is in such a state. This is possible if a trusted thread enters the scheduler by way of yielding CPU. This would involve no transitions through the irq_exit() point to do any waiting, so we have to explicitly do the waiting there. Every attempt is made to prevent a busy-wait unnecessarily, and in testing on real-world ChromeOS usecases, it has not shown a performance drop. In ChromeOS, with this and the rest of the core scheduling patchset, we see around a 300% improvement in key press latencies into Google docs when Camera streaming is running simulatenously (90th percentile latency of ~150ms drops to ~50ms). Cc: Paul E. McKenney <paulmck@kernel.org> Co-developed-by: Vineeth Pillai <vpillai@digitalocean.com> Signed-off-by: Vineeth Pillai <vpillai@digitalocean.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- If you like some pictures of the cases handled by this patch, please see the OSPM slide deck (the below link jumps straight to relevant slides - about 6-7 of them in total): https://bit.ly/2zvzxWk TODO: 1. Any optimziations for VM usecases (can we do something better than scheduler IPI?) 2. Waiting in schedule() can likely be optimized, example no need to wait if previous task was idle, as there would have been an IRQ involved with the wake up of the next task. include/linux/sched.h | 8 +++ kernel/sched/core.c | 159 ++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 3 + kernel/softirq.c | 2 + 4 files changed, 172 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 710e9a8956007..fe6ae59fcadbe 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2018,4 +2018,12 @@ int sched_trace_rq_cpu(struct rq *rq); const struct cpumask *sched_trace_rd_span(struct root_domain *rd); +#ifdef CONFIG_SCHED_CORE +void sched_core_irq_enter(void); +void sched_core_irq_exit(void); +#else +static void sched_core_irq_enter(void) { } +static void sched_core_irq_exit(void) { } +#endif + #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 21c640170323b..e06195dcca7a0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4391,6 +4391,153 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b) return a->core_cookie == b->core_cookie; } +/* + * Helper function to pause the caller's hyperthread until the core exits the + * core-wide IRQ state. Obviously the CPU calling this function should not be + * responsible for the core being in the core-wide IRQ state otherwise it will + * deadlock. This function should be called from irq_exit() and from schedule(). + * It is upto the callers to decide if calling here is necessary. + */ +static inline void sched_core_sibling_irq_pause(struct rq *rq) +{ + /* + * Wait till the core of this HT is not in a core-wide IRQ state. + * + * Pair with smp_store_release() in sched_core_irq_exit(). + */ + while (smp_load_acquire(&rq->core->core_irq_nest) > 0) + cpu_relax(); +} + +/* + * Enter the core-wide IRQ state. Sibling will be paused if it is running + * 'untrusted' code, until sched_core_irq_exit() is called. Every attempt to + * avoid sending useless IPIs is made. Must be called only from hard IRQ + * context. + */ +void sched_core_irq_enter(void) +{ + int i, cpu = smp_processor_id(); + struct rq *rq = cpu_rq(cpu); + const struct cpumask *smt_mask; + + if (!sched_core_enabled(rq)) + return; + + /* Count irq_enter() calls received without irq_exit() on this CPU. */ + rq->core_this_irq_nest++; + + /* If not outermost irq_enter(), do nothing. */ + if (rq->core_this_irq_nest != 1 || + WARN_ON_ONCE(rq->core->core_this_irq_nest == UINT_MAX)) + return; + + raw_spin_lock(rq_lockp(rq)); + smt_mask = cpu_smt_mask(cpu); + + /* Contribute this CPU's irq_enter() to core-wide irq_enter() count. */ + WRITE_ONCE(rq->core->core_irq_nest, rq->core->core_irq_nest + 1); + if (WARN_ON_ONCE(rq->core->core_irq_nest == UINT_MAX)) + goto unlock; + + if (rq->core_pause_pending) { + /* + * Do nothing more since we are in a 'reschedule IPI' sent from + * another sibling. That sibling would have sent IPIs to all of + * the HTs. + */ + goto unlock; + } + + /* + * If we are not the first ones on the core to enter core-wide IRQ + * state, do nothing. + */ + if (rq->core->core_irq_nest > 1) + goto unlock; + + /* Do nothing more if the core is not tagged. */ + if (!rq->core->core_cookie) + goto unlock; + + for_each_cpu(i, smt_mask) { + struct rq *srq = cpu_rq(i); + + if (i == cpu || cpu_is_offline(i)) + continue; + + if (!srq->curr->mm || is_idle_task(srq->curr)) + continue; + + /* Skip if HT is not running a tagged task. */ + if (!srq->curr->core_cookie && !srq->core_pick) + continue; + + /* IPI only if previous IPI was not pending. */ + if (!srq->core_pause_pending) { + srq->core_pause_pending = 1; + smp_send_reschedule(i); + } + } +unlock: + raw_spin_unlock(rq_lockp(rq)); +} + +/* + * Process any work need for either exiting the core-wide IRQ state, or for + * waiting on this hyperthread if the core is still in this state. + */ +void sched_core_irq_exit(void) +{ + int cpu = smp_processor_id(); + struct rq *rq = cpu_rq(cpu); + bool wait_here = false; + unsigned int nest; + + /* Do nothing if core-sched disabled. */ + if (!sched_core_enabled(rq)) + return; + + rq->core_this_irq_nest--; + + /* If not outermost on this CPU, do nothing. */ + if (rq->core_this_irq_nest > 0 || + WARN_ON_ONCE(rq->core_this_irq_nest == UINT_MAX)) + return; + + raw_spin_lock(rq_lockp(rq)); + /* + * Core-wide nesting counter can never be 0 because we are + * still in it on this CPU. + */ + nest = rq->core->core_irq_nest; + WARN_ON_ONCE(!nest); + + /* + * If we still have other CPUs in IRQs, we have to wait for them. + * Either here, or in the scheduler. + */ + if (rq->core->core_cookie && nest > 1) { + /* + * If we are entering the scheduler anyway, we can just wait + * there for ->core_irq_nest to reach 0. If not, just wait here. + */ + if (!tif_need_resched()) { + wait_here = true; + } + } + + if (rq->core_pause_pending) + rq->core_pause_pending = 0; + + /* Pair with smp_load_acquire() in sched_core_sibling_irq_pause(). */ + smp_store_release(&rq->core->core_irq_nest, nest - 1); + raw_spin_unlock(rq_lockp(rq)); + + if (wait_here) + sched_core_sibling_irq_pause(rq); +} + // XXX fairness/fwd progress conditions /* * Returns @@ -4910,6 +5057,18 @@ static void __sched notrace __schedule(bool preempt) rq_unlock_irq(rq, &rf); } +#ifdef CONFIG_SCHED_CORE + /* + * If a CPU that was running a trusted task entered the scheduler, and + * the next task is untrusted, then check if waiting for core-wide IRQ + * state to cease is needed since we would not have been able to get + * the services of irq_exit() to do that waiting. + */ + if (sched_core_enabled(rq) && + !is_idle_task(next) && next->mm && next->core_cookie) + sched_core_sibling_irq_pause(rq); +#endif + balance_callback(rq); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a7d9f156242e2..3a065d133ef51 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1018,11 +1018,14 @@ struct rq { unsigned int core_sched_seq; struct rb_root core_tree; unsigned char core_forceidle; + unsigned char core_pause_pending; + unsigned int core_this_irq_nest; /* shared state */ unsigned int core_task_seq; unsigned int core_pick_seq; unsigned long core_cookie; + unsigned int core_irq_nest; #endif }; diff --git a/kernel/softirq.c b/kernel/softirq.c index 0427a86743a46..b953386c8f62f 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -345,6 +345,7 @@ asmlinkage __visible void do_softirq(void) void irq_enter(void) { rcu_irq_enter(); + sched_core_irq_enter(); if (is_idle_task(current) && !in_interrupt()) { /* * Prevent raise_softirq from needlessly waking up ksoftirqd @@ -413,6 +414,7 @@ void irq_exit(void) invoke_softirq(); tick_irq_exit(); + sched_core_irq_exit(); rcu_irq_exit(); trace_hardirq_exit(); /* must be last! */ } -- 2.26.2.645.ge9eca65c58-goog
On Sun, May 10, 2020 at 07:46:52PM -0400, Joel Fernandes (Google) wrote:
> With current core scheduling patchset, non-threaded IRQ and softirq
> victims can leak data from its hyperthread to a sibling hyperthread
> running an attacker.
>
> For MDS, it is possible for the IRQ and softirq handlers to leak data to
> either host or guest attackers. For L1TF, it is possible to leak to
> guest attackers. There is no possible mitigation involving flushing of
> buffers to avoid this since the execution of attacker and victims happen
> concurrently on 2 or more HTs.
>
> The solution in this patch is to monitor the outer-most core-wide
> irq_enter() and irq_exit() executed by any sibling. In between these
> two, we mark the core to be in a special core-wide IRQ state.
Another possible option is force_irqthreads :-) That would cure it
nicely.
Anyway, I'll go read this.
On Mon, May 11, 2020 at 9:49 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, May 10, 2020 at 07:46:52PM -0400, Joel Fernandes (Google) wrote:
> > With current core scheduling patchset, non-threaded IRQ and softirq
> > victims can leak data from its hyperthread to a sibling hyperthread
> > running an attacker.
> >
> > For MDS, it is possible for the IRQ and softirq handlers to leak data to
> > either host or guest attackers. For L1TF, it is possible to leak to
> > guest attackers. There is no possible mitigation involving flushing of
> > buffers to avoid this since the execution of attacker and victims happen
> > concurrently on 2 or more HTs.
> >
> > The solution in this patch is to monitor the outer-most core-wide
> > irq_enter() and irq_exit() executed by any sibling. In between these
> > two, we mark the core to be in a special core-wide IRQ state.
>
> Another possible option is force_irqthreads :-) That would cure it
> nicely.
Yes true, it was definitely my "plan B" at one point if this patch
showed any regression. Lastly, people not doing force_irqthreads would
still leave a hole open and it'd be nice to solve it by "default" than
depending on user/sysadmin configuration (same argument against
interrupt affinities, it is another knob for the sysadmin/designer to
configure correctly, Another argument being not all interrupts can be
threaded / affinitized).
Thanks in advance for reviewing the patch,
- Joel
On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote:
> With this said, I realized a workaround for the issue described above:
> when the core went from 'compatible mode'(step 1-3) to 'incompatible
> mode'(step 4), reset all root level sched entities' vruntime to be the
> same as the core wide min_vruntime. After all, the core is transforming
> from two runqueue mode to single runqueue mode... I think this can solve
> the issue to some extent but I may miss other scenarios.
A little something like so, this syncs min_vruntime when we switch to
single queue mode. This is very much SMT2 only, I got my head in twist
when thikning about more siblings, I'll have to try again later.
This very much retains the horrible approximation of S we always do.
Also, it is _completely_ untested...
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -102,7 +102,6 @@ static inline int __task_prio(struct tas
/* real prio, less is less */
static inline bool prio_less(struct task_struct *a, struct task_struct *b)
{
-
int pa = __task_prio(a), pb = __task_prio(b);
if (-pa < -pb)
@@ -114,19 +113,8 @@ static inline bool prio_less(struct task
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);
- if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
- u64 vruntime = b->se.vruntime;
-
- /*
- * Normalize the vruntime if tasks are in different cpus.
- */
- if (task_cpu(a) != task_cpu(b)) {
- vruntime -= task_cfs_rq(b)->min_vruntime;
- vruntime += task_cfs_rq(a)->min_vruntime;
- }
-
- return !((s64)(a->se.vruntime - vruntime) <= 0);
- }
+ if (pa == MAX_RT_PRIO + MAX_NICE)
+ return cfs_prio_less(a, b);
return false;
}
@@ -4293,10 +4281,11 @@ static struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
struct task_struct *next, *max = NULL;
+ int old_active = 0, new_active = 0;
const struct sched_class *class;
const struct cpumask *smt_mask;
- int i, j, cpu;
bool need_sync = false;
+ int i, j, cpu;
cpu = cpu_of(rq);
if (cpu_is_offline(cpu))
@@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas
rq_i->core_pick = NULL;
if (rq_i->core_forceidle) {
+ // XXX is_idle_task(rq_i->curr) && rq_i->nr_running ??
need_sync = true;
rq_i->core_forceidle = false;
}
+ if (!is_idle_task(rq_i->curr))
+ old_active++;
+
if (i != cpu)
update_rq_clock(rq_i);
}
@@ -4463,8 +4456,12 @@ next_class:;
WARN_ON_ONCE(!rq_i->core_pick);
- if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
- rq_i->core_forceidle = true;
+ if (is_idle_task(rq_i->core_pick)) {
+ if (rq_i->nr_running)
+ rq_i->core_forceidle = true;
+ } else {
+ new_active++;
+ }
if (i == cpu)
continue;
@@ -4476,6 +4473,16 @@ next_class:;
WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
}
+ /* XXX SMT2 only */
+ if (new_active == 1 && old_active > 1) {
+ /*
+ * We just dropped into single-rq mode, increment the sequence
+ * count to trigger the vruntime sync.
+ */
+ rq->core->core_sync_seq++;
+ }
+ rq->core->core_active = new_active;
+
done:
set_next_task(rq, next);
return next;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -386,6 +386,12 @@ is_same_group(struct sched_entity *se, s
return NULL;
}
+static inline bool
+is_same_tg(struct sched_entity *se, struct sched_entity *pse)
+{
+ return se->cfs_rq->tg == pse->cfs_rq->tg;
+}
+
static inline struct sched_entity *parent_entity(struct sched_entity *se)
{
return se->parent;
@@ -394,8 +400,6 @@ static inline struct sched_entity *paren
static void
find_matching_se(struct sched_entity **se, struct sched_entity **pse)
{
- int se_depth, pse_depth;
-
/*
* preemption test can be made between sibling entities who are in the
* same cfs_rq i.e who have a common parent. Walk up the hierarchy of
@@ -403,23 +407,16 @@ find_matching_se(struct sched_entity **s
* parent.
*/
- /* First walk up until both entities are at same depth */
- se_depth = (*se)->depth;
- pse_depth = (*pse)->depth;
-
- while (se_depth > pse_depth) {
- se_depth--;
- *se = parent_entity(*se);
- }
-
- while (pse_depth > se_depth) {
- pse_depth--;
- *pse = parent_entity(*pse);
- }
+ /* XXX we now have 3 of these loops, C stinks */
while (!is_same_group(*se, *pse)) {
- *se = parent_entity(*se);
- *pse = parent_entity(*pse);
+ int se_depth = (*se)->depth;
+ int pse_depth = (*pse)->depth;
+
+ if (se_depth <= pse_depth)
+ *pse = parent_entity(*pse);
+ if (se_depth >= pse_depth)
+ *se = parent_entity(*se);
}
}
@@ -455,6 +452,12 @@ static inline struct sched_entity *paren
return NULL;
}
+static inline bool
+is_same_tg(struct sched_entity *se, struct sched_entity *pse)
+{
+ return true;
+}
+
static inline void
find_matching_se(struct sched_entity **se, struct sched_entity **pse)
{
@@ -462,6 +465,31 @@ find_matching_se(struct sched_entity **s
#endif /* CONFIG_FAIR_GROUP_SCHED */
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+ struct sched_entity *se_a = &a->se, *se_b = &b->se;
+ struct cfs_rq *cfs_rq_a, *cfa_rq_b;
+ u64 vruntime_a, vruntime_b;
+
+ while (!is_same_tg(se_a, se_b)) {
+ int se_a_depth = se_a->depth;
+ int se_b_depth = se_b->depth;
+
+ if (se_a_depth <= se_b_depth)
+ se_b = parent_entity(se_b);
+ if (se_a_depth >= se_b_depth)
+ se_a = parent_entity(se_a);
+ }
+
+ cfs_rq_a = cfs_rq_of(se_a);
+ cfs_rq_b = cfs_rq_of(se_b);
+
+ vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
+ vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
+
+ return !((s64)(vruntime_a - vruntime_b) <= 0);
+}
+
static __always_inline
void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
@@ -6891,6 +6919,18 @@ static void check_preempt_wakeup(struct
set_last_buddy(se);
}
+static void core_sync_entity(struct rq *rq, struct cfs_rq *cfs_rq)
+{
+ if (!sched_core_enabled())
+ return;
+
+ if (rq->core->core_sync_seq == cfs_rq->core_sync_seq)
+ return;
+
+ cfs_rq->core_sync_seq = rq->core->core_sync_seq;
+ cfs_rq->core_vruntime = cfs_rq->min_vruntime;
+}
+
static struct task_struct *pick_task_fair(struct rq *rq)
{
struct cfs_rq *cfs_rq = &rq->cfs;
@@ -6902,6 +6942,14 @@ static struct task_struct *pick_task_fai
do {
struct sched_entity *curr = cfs_rq->curr;
+ /*
+ * Propagate the sync state down to whatever cfs_rq we need,
+ * the active cfs_rq's will have been done by
+ * set_next_task_fair(), the rest is inactive and will not have
+ * changed due to the current running task.
+ */
+ core_sync_entity(rq, cfs_rq);
+
se = pick_next_entity(cfs_rq, NULL);
if (curr) {
@@ -10825,7 +10873,8 @@ static void switched_to_fair(struct rq *
}
}
-/* Account for a task changing its policy or group.
+/*
+ * Account for a task changing its policy or group.
*
* This routine is mostly called to set cfs_rq->curr field when a task
* migrates between groups/classes.
@@ -10847,6 +10896,9 @@ static void set_next_task_fair(struct rq
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ /* snapshot vruntime before using it */
+ core_sync_entity(rq, cfs_rq);
+
set_next_entity(cfs_rq, se);
/* ensure bandwidth has been allocated on our new cfs_rq */
account_cfs_rq_runtime(cfs_rq, 0);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -503,6 +503,10 @@ struct cfs_rq {
unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
unsigned int idle_h_nr_running; /* SCHED_IDLE */
+#ifdef CONFIG_SCHED_CORE
+ unsigned int core_sync_seq;
+ u64 core_vruntime;
+#endif
u64 exec_clock;
u64 min_vruntime;
#ifndef CONFIG_64BIT
@@ -1035,12 +1039,15 @@ struct rq {
unsigned int core_enabled;
unsigned int core_sched_seq;
struct rb_root core_tree;
- bool core_forceidle;
+ unsigned int core_forceidle;
/* shared state */
unsigned int core_task_seq;
unsigned int core_pick_seq;
unsigned long core_cookie;
+ unsigned int core_sync_seq;
+ unsigned int core_active;
+
#endif
};
@@ -2592,6 +2599,8 @@ static inline bool sched_energy_enabled(
#endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+extern bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+
#ifdef CONFIG_MEMBARRIER
/*
* The scheduler provides memory barriers required by membarrier between:
[-- Attachment #1: Type: text/plain, Size: 11835 bytes --] -----Original Message----- From: linux-kernel-owner@vger.kernel.org <linux-kernel-owner@vger.kernel.org> On Behalf Of Ning, Hongyu Sent: Friday, May 8, 2020 8:40 PM To: vpillai@digitalocean.com; naravamudan@digitalocean.com; jdesfossez@digitalocean.com; peterz@infradead.org; Tim Chen <tim.c.chen@linux.intel.com>; mingo@kernel.org; tglx@linutronix.de; pjt@google.com; torvalds@linux-foundation.org Cc: vpillai@digitalocean.com; fweisbec@gmail.com; keescook@chromium.org; kerrnel@google.com; pauld@redhat.com; aaron.lwe@gmail.com; aubrey.intel@gmail.com; Li, Aubrey <aubrey.li@linux.intel.com>; valentin.schneider@arm.com; mgorman@techsingularity.net; pawan.kumar.gupta@linux.intel.com; pbonzini@redhat.com; joelaf@google.com; joel@joelfernandes.org; linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 00/13] Core scheduling v5 - Test environment: Intel Xeon Server platform CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 4 - Kernel under test: Core scheduling v5 base https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y - Test set based on sysbench 1.1.0-bd4b418: A: sysbench cpu in cgroup cpu 1 + sysbench mysql in cgroup mysql 1 (192 workload tasks for each cgroup) B: sysbench cpu in cgroup cpu 1 + sysbench cpu in cgroup cpu 2 + sysbench mysql in cgroup mysql 1 + sysbench mysql in cgroup mysql 2 (192 workload tasks for each cgroup) - Test results briefing: 1 Good results: 1.1 For test set A, coresched could achieve same or better performance compared to smt_off, for both cpu workload and sysbench workload 1.2 For test set B, cpu workload, coresched could achieve better performance compared to smt_off 2 Bad results: 2.1 For test set B, mysql workload, coresched performance is lower than smt_off, potential fairness issue between cpu workloads and mysql workloads 2.2 For test set B, cpu workload, potential fairness issue between 2 cgroups cpu workloads - Test results: Note: test results in following tables are Tput normalized to default baseline -- Test set A Tput normalized results: +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+ | | **** | default | coresched | smt_off | *** | default | coresched | smt_off | +====================+========+===========+=============+===========+=== +====+=============+===============+=============+ | cgroups | **** | cg cpu 1 | cg cpu 1 | cg cpu 1 | *** | cg mysql 1 | cg mysql 1 | cg mysql 1 | +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+ | sysbench workload | **** | cpu | cpu | cpu | *** | mysql | mysql | mysql | +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+ | 192 tasks / cgroup | **** | 1 | 0.95 | 0.54 | *** | 1 | 0.92 | 0.97 | +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+ -- Test set B Tput normalized results: +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+ | | **** | default | coresched | smt_off | *** | default | coresched | smt_off | ** | default | coresched | smt_off | * | default | coresched | smt_off | +====================+========+===========+=============+===========+=== +====+=============+===============+=============+======+=============+= +==============+=============+=====+=============+===============+====== +=======+ | cgroups | **** | cg cpu 1 | cg cpu 1 | cg cpu 1 | *** | cg cpu 2 | cg cpu 2 | cg cpu 2 | ** | cg mysql 1 | cg mysql 1 | cg mysql 1 | * | cg mysql 2 | cg mysql 2 | cg mysql 2 | +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+ | sysbench workload | **** | cpu | cpu | cpu | *** | cpu | cpu | cpu | ** | mysql | mysql | mysql | * | mysql | mysql | mysql | +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+ | 192 tasks / cgroup | **** | 1 | 0.9 | 0.47 | *** | 1 | 1.32 | 0.66 | ** | 1 | 0.42 | 0.89 | * | 1 | 0.42 | 0.89 | +--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+ > On Date: Wed, 4 Mar 2020 16:59:50 +0000, vpillai <vpillai@digitalocean.com> wrote: > To: Nishanth Aravamudan <naravamudan@digitalocean.com>, Julien > Desfossez <jdesfossez@digitalocean.com>, Peter Zijlstra > <peterz@infradead.org>, Tim Chen <tim.c.chen@linux.intel.com>, > mingo@kernel.org, tglx@linutronix.de, pjt@google.com, > torvalds@linux-foundation.org > CC: vpillai <vpillai@digitalocean.com>, linux-kernel@vger.kernel.org, > fweisbec@gmail.com, keescook@chromium.org, kerrnel@google.com, Phil > Auld <pauld@redhat.com>, Aaron Lu <aaron.lwe@gmail.com>, Aubrey Li > <aubrey.intel@gmail.com>, aubrey.li@linux.intel.com, Valentin > Schneider <valentin.schneider@arm.com>, Mel Gorman > <mgorman@techsingularity.net>, Pawan Gupta > <pawan.kumar.gupta@linux.intel.com>, Paolo Bonzini > <pbonzini@redhat.com>, Joel Fernandes <joelaf@google.com>, > joel@joelfernandes.org > > > Fifth iteration of the Core-Scheduling feature. > > Core scheduling is a feature that only allows trusted tasks to run > concurrently on cpus sharing compute resources(eg: hyperthreads on a > core). The goal is to mitigate the core-level side-channel attacks > without requiring to disable SMT (which has a significant impact on > performance in some situations). So far, the feature mitigates > user-space to user-space attacks but not user-space to kernel attack, > when one of the hardware thread enters the kernel (syscall, interrupt etc). > > By default, the feature doesn't change any of the current scheduler > behavior. The user decides which tasks can run simultaneously on the > same core (for now by having them in the same tagged cgroup). When a > tag is enabled in a cgroup and a task from that cgroup is running on a > hardware thread, the scheduler ensures that only idle or trusted tasks > run on the other sibling(s). Besides security concerns, this feature > can also be beneficial for RT and performance applications where we > want to control how tasks make use of SMT dynamically. > > This version was focusing on performance and stability. Couple of > crashes related to task tagging and cpu hotplug path were fixed. > This version also improves the performance considerably by making task > migration and load balancing coresched aware. > > In terms of performance, the major difference since the last iteration > is that now even IO-heavy and mixed-resources workloads are less > impacted by core-scheduling than by disabling SMT. Both host-level and > VM-level benchmarks were performed. Details in: > https://lkml.org/lkml/2020/2/12/1194 > https://lkml.org/lkml/2019/11/1/269 > > v5 is rebased on top of 5.5.5(449718782a46) > https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5 > .y > ---------------------------------------------------------------------- ABOUT: ---------------------------------------------------------------------- Hello, Core scheduling is required to protect against leakage of sensitive data allocated on a sibling thread. Our goal is to measure performance impact of core scheduling across different workloads and show how it evolved over time. Below you will find data based on core-sched (v5). In attached PDF system configuration setup as well as further explanation of the findings. ---------------------------------------------------------------------- BENCHMARKS: ---------------------------------------------------------------------- - hammerdb : database benchmarking application - sysbench-cpu : multi-threaded cpu benchmark - sysbench-mysql: multi-threaded benchmark that tests open source DBMS - build-kernel : benchmark that is used to build Linux kernel ---------------------------------------------------------------------- PERFORMANCE IMPACT: ---------------------------------------------------------------------- +--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+ | benchmark | **** | # of cgroups | overcommit | baseline + smt_on | coresched + smt_on | baseline + smt_off | +====================+========+==============+=============+===================+====================+======================+ | hammerdb | **** | 2cgroups | 2x | 1 | 0.96 | 0.87 | +--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+ | sysbench-cpu | **** | 2cgroups | 2x | 1 | 0.95 | 0.54 | | sysbench-mysql | **** | | | 1 | 0.90 | 0.47 | +--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+ | sysbench-cpu | **** | 4cgroups | 4x | 1 | 0.90 | 0.47 | | sysbench-cpu | **** | | | 1 | 1.32 | 0.66 | | sysbench-mycql | **** | | | 1 | 0.42 | 0.89 | | sysbench-mysql | **** | | | 1 | 0.42 | 0.89 | +--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+ | kernel-build | **** | 2cgroups | 0.5x | 1 | 1 | 0.93 | | | **** | | 1x | 1 | 0.99 | 0.92 | | | **** | | 2x | 1 | 0.98 | 0.91 | +--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+ ---------------------------------------------------------------------- TAKE AWAYS: ---------------------------------------------------------------------- 1. Core scheduling performs better than turning off HT. 2. Impact of core scheduling depends on the workload and thread scheduling intensity. 3. Core scheduling requires cgroups. Tasks from the same cgroup are scheduled on the same core. 4. Having core scheduling, in certain situations will introduce an uneven load distribution between multiple workload types. In such a case bias towards the cpu intensive workload is expected. 5. Load balancing is not perfect. It needs more work. Many thanks, --Agata [-- Attachment #2: LKML_core_sched_v5.5.y.pdf --] [-- Type: application/pdf, Size: 360252 bytes --]
Hi Peter, On Thu, May 14, 2020 at 9:02 AM Peter Zijlstra <peterz@infradead.org> wrote: > > A little something like so, this syncs min_vruntime when we switch to > single queue mode. This is very much SMT2 only, I got my head in twist > when thikning about more siblings, I'll have to try again later. > Thanks for the quick patch! :-) For SMT-n, would it work if sync vruntime if atleast one sibling is forced idle? Since force_idle is for all the rqs, I think it would be correct to sync the vruntime if atleast one cpu is forced idle. > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > - if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) > - rq_i->core_forceidle = true; > + if (is_idle_task(rq_i->core_pick)) { > + if (rq_i->nr_running) > + rq_i->core_forceidle = true; > + } else { > + new_active++; I think we need to reset new_active on restarting the selection. > + } > > if (i == cpu) > continue; > @@ -4476,6 +4473,16 @@ next_class:; > WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick)); > } > > + /* XXX SMT2 only */ > + if (new_active == 1 && old_active > 1) { As I mentioned above, would it be correct to check if atleast one sibling is forced_idle? Something like: if (cpumask_weight(cpu_smt_mask(cpu)) == old_active && new_active < old_active) > + /* > + * We just dropped into single-rq mode, increment the sequence > + * count to trigger the vruntime sync. > + */ > + rq->core->core_sync_seq++; > + } > + rq->core->core_active = new_active; core_active seems to be unused. > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b) > +{ > + struct sched_entity *se_a = &a->se, *se_b = &b->se; > + struct cfs_rq *cfs_rq_a, *cfa_rq_b; > + u64 vruntime_a, vruntime_b; > + > + while (!is_same_tg(se_a, se_b)) { > + int se_a_depth = se_a->depth; > + int se_b_depth = se_b->depth; > + > + if (se_a_depth <= se_b_depth) > + se_b = parent_entity(se_b); > + if (se_a_depth >= se_b_depth) > + se_a = parent_entity(se_a); > + } > + > + cfs_rq_a = cfs_rq_of(se_a); > + cfs_rq_b = cfs_rq_of(se_b); > + > + vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime; > + vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime; Should we be using core_vruntime conditionally? should it be min_vruntime for default comparisons and core_vruntime during force_idle? Thanks, Vineeth
On Thu, May 14, 2020 at 06:51:27PM -0400, Vineeth Remanan Pillai wrote: > On Thu, May 14, 2020 at 9:02 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > A little something like so, this syncs min_vruntime when we switch to > > single queue mode. This is very much SMT2 only, I got my head in twist > > when thikning about more siblings, I'll have to try again later. > > > Thanks for the quick patch! :-) > > For SMT-n, would it work if sync vruntime if atleast one sibling is > forced idle? Since force_idle is for all the rqs, I think it would > be correct to sync the vruntime if atleast one cpu is forced idle. It's complicated ;-) So this sync is basically a relative reset of S to 0. So with 2 queues, when one goes idle, we drop them both to 0 and one then increases due to not being idle, and the idle one builds up lag to get re-elected. So far so simple, right? When there's 3, we can have the situation where 2 run and one is idle, we sync to 0 and let the idle one build up lag to get re-election. Now suppose another one also drops idle. At this point dropping all to 0 again would destroy the built-up lag from the queue that was already idle, not good. So instead of syncing everything, we can: less := !((s64)(s_a - s_b) <= 0) (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b == v_a - (v_b - S_a + S_b) IOW, we can recast the (lag) comparison to a one-sided difference. So if then, instead of syncing the whole queue, sync the idle queue against the active queue with S_a + S_b at the point where we sync. (XXX consider the implication of living in a cyclic group: N / 2^n N) This gives us means of syncing single queues against the active queue, and for already idle queues to preseve their build-up lag. Of course, then we get the situation where there's 2 active and one going idle, who do we pick to sync against? Theory would have us sync against the combined S, but as we've already demonstated, there is no such thing in infeasible weight scenarios. One thing I've considered; and this is where that core_active rudiment came from, is having active queues sync up between themselves after every tick. This limits the observed divergence due to the work conservance. On top of that, we can improve upon things by moving away from our horrible (10) hack and moving to (9) and employing (13) here. Anyway, I got partway through that in the past days, but then my head hurt. I'll consider it some more :-) > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > - if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) > > - rq_i->core_forceidle = true; > > + if (is_idle_task(rq_i->core_pick)) { > > + if (rq_i->nr_running) > > + rq_i->core_forceidle = true; > > + } else { > > + new_active++; > I think we need to reset new_active on restarting the selection. But this loop is after selection has been done; we don't modify new_active during selection. > > + /* > > + * We just dropped into single-rq mode, increment the sequence > > + * count to trigger the vruntime sync. > > + */ > > + rq->core->core_sync_seq++; > > + } > > + rq->core->core_active = new_active; > core_active seems to be unused. Correct; that's rudiments from an SMT-n attempt. > > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b) > > +{ > > + struct sched_entity *se_a = &a->se, *se_b = &b->se; > > + struct cfs_rq *cfs_rq_a, *cfa_rq_b; > > + u64 vruntime_a, vruntime_b; > > + > > + while (!is_same_tg(se_a, se_b)) { > > + int se_a_depth = se_a->depth; > > + int se_b_depth = se_b->depth; > > + > > + if (se_a_depth <= se_b_depth) > > + se_b = parent_entity(se_b); > > + if (se_a_depth >= se_b_depth) > > + se_a = parent_entity(se_a); > > + } > > + > > + cfs_rq_a = cfs_rq_of(se_a); > > + cfs_rq_b = cfs_rq_of(se_b); > > + > > + vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime; > > + vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime; > Should we be using core_vruntime conditionally? should it be min_vruntime for > default comparisons and core_vruntime during force_idle? At the very least it should be min_vruntime when cfs_rq_a == cfs_rq_b, ie. when we're on the same CPU. For the other case I was considering that tick based active sync, but never got that finished and admittedly it all looks a bit weird. But I figured I'd send it out so we can at least advance the discussion. --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -469,7 +469,7 @@ bool cfs_prio_less(struct task_struct *a { struct sched_entity *se_a = &a->se, *se_b = &b->se; struct cfs_rq *cfs_rq_a, *cfa_rq_b; - u64 vruntime_a, vruntime_b; + u64 s_a, s_b, S_a, S_b; while (!is_same_tg(se_a, se_b)) { int se_a_depth = se_a->depth; @@ -484,10 +484,16 @@ bool cfs_prio_less(struct task_struct *a cfs_rq_a = cfs_rq_of(se_a); cfs_rq_b = cfs_rq_of(se_b); - vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime; - vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime; + S_a = cfs_rq_a->core_vruntime; + S_b = cfs_rq_b->core_vruntime; - return !((s64)(vruntime_a - vruntime_b) <= 0); + if (cfs_rq_a == cfs_rq_b) + S_a = S_b = cfs_rq_a->min_vruntime; + + s_a = se_a->vruntime - S_a; + s_b = se_b->vruntime - S_b; + + return !((s64)(s_a - s_b) <= 0); } static __always_inline
On Fri, May 15, 2020 at 12:38:44PM +0200, Peter Zijlstra wrote: > less := !((s64)(s_a - s_b) <= 0) > > (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b > == v_a - (v_b - S_a + S_b) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -469,7 +469,7 @@ bool cfs_prio_less(struct task_struct *a > { > struct sched_entity *se_a = &a->se, *se_b = &b->se; > struct cfs_rq *cfs_rq_a, *cfa_rq_b; > - u64 vruntime_a, vruntime_b; > + u64 s_a, s_b, S_a, S_b; > > while (!is_same_tg(se_a, se_b)) { > int se_a_depth = se_a->depth; > @@ -484,10 +484,16 @@ bool cfs_prio_less(struct task_struct *a > cfs_rq_a = cfs_rq_of(se_a); > cfs_rq_b = cfs_rq_of(se_b); > > - vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime; > - vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime; > + S_a = cfs_rq_a->core_vruntime; > + S_b = cfs_rq_b->core_vruntime; > > - return !((s64)(vruntime_a - vruntime_b) <= 0); > + if (cfs_rq_a == cfs_rq_b) > + S_a = S_b = cfs_rq_a->min_vruntime; > + > + s_a = se_a->vruntime - S_a; > + s_b = se_b->vruntime - S_b; > + > + return !((s64)(s_a - s_b) <= 0); > } Clearly I'm not awake yet; 's/s_/l_/g', 's/v_/s_/g', IOW: l_a = s_a - S_a
On Fri, May 15, 2020 at 6:39 AM Peter Zijlstra <peterz@infradead.org> wrote: > > It's complicated ;-) > > So this sync is basically a relative reset of S to 0. > > So with 2 queues, when one goes idle, we drop them both to 0 and one > then increases due to not being idle, and the idle one builds up lag to > get re-elected. So far so simple, right? > > When there's 3, we can have the situation where 2 run and one is idle, > we sync to 0 and let the idle one build up lag to get re-election. Now > suppose another one also drops idle. At this point dropping all to 0 > again would destroy the built-up lag from the queue that was already > idle, not good. > Thanks for the clarification :-). I was suggesting an idea of corewide force_idle. We sync the core_vruntime on first force_idle of a sibling in the core and start using core_vruntime for priority comparison from then on. That way, we don't reset the lag on every force_idle and the lag builds up from the first sibling that was forced_idle. I think this would work with infeasible weights as well, but needs to think more to see if it would break. A sample check to enter this core wide force_idle state is: (cpumask_weight(cpu_smt_mask(cpu)) == old_active && new_active < old_active) And we exit the core wide force_idle state when the last sibling goes out of force_idle and can start using min_vruntime for priority comparison from then on. When there is a cookie match on all siblings, we don't do priority comparison now. But I think we need to do priority comparison for cookie matches also, so that we update 'max' in the loop. And for this comparison during a no forced_idle scenario, I hope it should be fine to use the min_vruntime. Updating 'max' in the loop when cookie matches is not really needed for SMT2, but would be needed for SMTn. This is just a wild idea on top of your patches. Might not be accurate in all cases and need to think more about the corner cases. I thought I would think it loud here :-) > So instead of syncing everything, we can: > > less := !((s64)(s_a - s_b) <= 0) > > (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b > == v_a - (v_b - S_a + S_b) > > IOW, we can recast the (lag) comparison to a one-sided difference. > So if then, instead of syncing the whole queue, sync the idle queue > against the active queue with S_a + S_b at the point where we sync. > > (XXX consider the implication of living in a cyclic group: N / 2^n N) > > This gives us means of syncing single queues against the active queue, > and for already idle queues to preseve their build-up lag. > > Of course, then we get the situation where there's 2 active and one > going idle, who do we pick to sync against? Theory would have us sync > against the combined S, but as we've already demonstated, there is no > such thing in infeasible weight scenarios. > > One thing I've considered; and this is where that core_active rudiment > came from, is having active queues sync up between themselves after > every tick. This limits the observed divergence due to the work > conservance. > > On top of that, we can improve upon things by moving away from our > horrible (10) hack and moving to (9) and employing (13) here. > > Anyway, I got partway through that in the past days, but then my head > hurt. I'll consider it some more :-) This sounds much better and a more accurate approach then the one I mentioned above. Please share the code when you have it in some form :-) > > > > + new_active++; > > I think we need to reset new_active on restarting the selection. > > But this loop is after selection has been done; we don't modify > new_active during selection. My bad, sorry about this false alarm! > > > + > > > + vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime; > > > + vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime; > > Should we be using core_vruntime conditionally? should it be min_vruntime for > > default comparisons and core_vruntime during force_idle? > > At the very least it should be min_vruntime when cfs_rq_a == cfs_rq_b, > ie. when we're on the same CPU. > yes, this makes sense. The issue that I was thinking about is, when there is no force_idle and all siblings run compatible tasks for a while, min_vruntime progresses, but core_vruntime lags behind. And when a new task gets enqueued, it gets the min_vruntime. But now during comparison it might be treated unfairly. Consider a small example of two rqs rq1 and rq2. rq1->cfs->min_vruntime = 1000 rq2->cfs->min_vruntime = 2000 During a force_idle, core_vruntime gets synced and rq1->cfs->core_vruntime = 1000 rq2->cfs->core_vruntime = 2000 Now, suppose the core is out of force_idle and runs two compatible tasks for a while, where the task on rq1 has more weight. min_vruntime progresses on both, but slowly on rq1. Say the progress looks like: rq1->cfs->min_vruntime = 1200, se1->vruntime = 1200 rq2->cfs->min_vruntime = 2500, se2->vruntime = 2500 If a new incompatible task(se3) gets enqueued to rq2, it's vruntime would be based on rq2's min_vruntime, say: se3->vruntime = 2500 During our priority comparison, lag would be: l_se1 = 200 l_se3 = 500 So se1, will get selected and run with se2 until its lag catches up with se3's lag(even if se3 has more weight than se1). This is a hypothetical situation, but can happen I think. And if we use min_vruntime for comparison during no force_idle scenario, we could avoid this. What do you think? I didn't clearly understand the tick based active sync and probably would better fix this problem I guess. Thanks, Vineeth
On Thu, May 14, 2020 at 03:02:48PM +0200, Peter Zijlstra wrote: > On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote: > > With this said, I realized a workaround for the issue described above: > > when the core went from 'compatible mode'(step 1-3) to 'incompatible > > mode'(step 4), reset all root level sched entities' vruntime to be the > > same as the core wide min_vruntime. After all, the core is transforming > > from two runqueue mode to single runqueue mode... I think this can solve > > the issue to some extent but I may miss other scenarios. > > A little something like so, this syncs min_vruntime when we switch to > single queue mode. This is very much SMT2 only, I got my head in twist > when thikning about more siblings, I'll have to try again later. Thanks a lot for the patch, I now see that "there is no need to adjust every se's vruntime". :-) > This very much retains the horrible approximation of S we always do. > > Also, it is _completely_ untested... I've been testing it. One problem below. > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4293,10 +4281,11 @@ static struct task_struct * > pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > { > struct task_struct *next, *max = NULL; > + int old_active = 0, new_active = 0; > const struct sched_class *class; > const struct cpumask *smt_mask; > - int i, j, cpu; > bool need_sync = false; > + int i, j, cpu; > > cpu = cpu_of(rq); > if (cpu_is_offline(cpu)) > @@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas > rq_i->core_pick = NULL; > > if (rq_i->core_forceidle) { > + // XXX is_idle_task(rq_i->curr) && rq_i->nr_running ?? > need_sync = true; > rq_i->core_forceidle = false; > } > > + if (!is_idle_task(rq_i->curr)) > + old_active++; > + > if (i != cpu) > update_rq_clock(rq_i); > } > @@ -4463,8 +4456,12 @@ next_class:; > > WARN_ON_ONCE(!rq_i->core_pick); > > - if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) > - rq_i->core_forceidle = true; > + if (is_idle_task(rq_i->core_pick)) { > + if (rq_i->nr_running) > + rq_i->core_forceidle = true; > + } else { > + new_active++; > + } > > if (i == cpu) > continue; > @@ -4476,6 +4473,16 @@ next_class:; > WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick)); > } > > + /* XXX SMT2 only */ > + if (new_active == 1 && old_active > 1) { There is a case when incompatible task appears but we failed to 'drop into single-rq mode' per the above condition check. The TLDR is: when there is a task that sits on the sibling rq with the same cookie as 'max', new_active will be 2 instead of 1 and that would cause us missing the chance to do a sync of core min_vruntime. This is how it happens: 1) 2 tasks of the same cgroup with different weight running on 2 siblings, say cg0_A with weight 1024 bound at cpu0 and cg0_B with weight 2 bound at cpu1(assume cpu0 and cpu1 are siblings); 2) Since new_active == 2, we didn't trigger min_vruntime sync. For simplicity, let's assume both siblings' root cfs_rq's min_vruntime and core_vruntime are all at 0 now; 3) let the two tasks run a while; 4) a new task cg1_C of another cgroup gets queued on cpu1. Since cpu1's existing task has a very small weight, its cfs_rq's min_vruntime can be much larger than cpu0's cfs_rq min_vruntime. So cg1_C's vruntime is much larger than cg0_A's and the 'max' of the core wide task selection goes to cg0_A; 5) Now I suppose we should drop into single-rq mode and by doing a sync of core min_vruntime, cg1_C's turn shall come. But the problem is, our current selection logic prefer not to waste CPU time so after decides cg0_A as the 'max', the sibling will also do a cookie_pick() and get cg0_B to run. This is where problem asises: new_active is 2 instead of the expected 1. 6) Due to we didn't do the sync of core min_vruntime, the newly queued cg1_C shall wait a long time before cg0_A's vruntime catches up. One naive way of precisely determine when to drop into single-rq mode is to track how many tasks of a particular tag exists and use that to decide if the core is in compatible mode(all tasks belong to the same cgroup, IOW, have the same core_cookie) or not and act accordingly, except that: does this sound too complex and inefficient?... > + /* > + * We just dropped into single-rq mode, increment the sequence > + * count to trigger the vruntime sync. > + */ > + rq->core->core_sync_seq++; > + } > + rq->core->core_active = new_active; > + > done: > set_next_task(rq, next); > return next;
Add a per-thread core scheduling interface which allows a thread to tag itself and enable core scheduling. Based on discussion at OSPM with maintainers, we propose a prctl(2) interface accepting values of 0 or 1. 1 - enable core scheduling for the task. 0 - disable core scheduling for the task. Special cases: (1) The core-scheduling patchset contains a CGroup interface as well. In order for us to respect users of that interface, we avoid overriding the tag if a task was CGroup-tagged because the task becomes inconsistent with the CGroup tag. Instead return -EBUSY. (2) If a task is prctl-tagged, allow the CGroup interface to override the task's tag. ChromeOS will use core-scheduling to securely enable hyperthreading. This cuts down the keypress latency in Google docs from 150ms to 50ms while improving the camera streaming frame rate by ~3%. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- include/linux/sched.h | 6 ++++ include/uapi/linux/prctl.h | 3 ++ kernel/sched/core.c | 57 ++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 3 ++ 4 files changed, 69 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index fe6ae59fcadbe..8a40a093aa2ca 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1986,6 +1986,12 @@ static inline void rseq_execve(struct task_struct *t) #endif +#ifdef CONFIG_SCHED_CORE +int task_set_core_sched(int set, struct task_struct *tsk); +#else +int task_set_core_sched(int set, struct task_struct *tsk) { return -ENOTSUPP; } +#endif + void __exit_umh(struct task_struct *tsk); static inline void exit_umh(struct task_struct *tsk) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 07b4f8131e362..dba0c70f9cce6 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -238,4 +238,7 @@ struct prctl_mm_map { #define PR_SET_IO_FLUSHER 57 #define PR_GET_IO_FLUSHER 58 +/* Core scheduling per-task interface */ +#define PR_SET_CORE_SCHED 59 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 684359ff357e7..780514d03da47 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3320,6 +3320,13 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) #endif #ifdef CONFIG_SCHED_CORE RB_CLEAR_NODE(&p->core_node); + + /* + * If task is using prctl(2) for tagging, do the prctl(2)-style tagging + * for the child as well. + */ + if (current->core_cookie && ((unsigned long)current == current->core_cookie)) + task_set_core_sched(1, p); #endif return 0; } @@ -7857,6 +7864,56 @@ void __cant_sleep(const char *file, int line, int preempt_offset) EXPORT_SYMBOL_GPL(__cant_sleep); #endif +#ifdef CONFIG_SCHED_CORE + +/* Ensure that all siblings have rescheduled once */ +static int task_set_core_sched_stopper(void *data) +{ + return 0; +} + +int task_set_core_sched(int set, struct task_struct *tsk) +{ + if (!tsk) + tsk = current; + + if (set > 1) + return -ERANGE; + + if (!static_branch_likely(&sched_smt_present)) + return -EINVAL; + + /* + * If cookie was set previously, return -EBUSY if either of the + * following are true: + * 1. Task was previously tagged by CGroup method. + * 2. Task or its parent were tagged by prctl(). + * + * Note that, if CGroup tagging is done after prctl(), then that would + * override the cookie. However, if prctl() is done after task was + * added to tagged CGroup, then the prctl() returns -EBUSY. + */ + if (!!tsk->core_cookie == set) { + if ((tsk->core_cookie == (unsigned long)tsk) || + (tsk->core_cookie == (unsigned long)tsk->sched_task_group)) { + return -EBUSY; + } + } + + if (set) + sched_core_get(); + + tsk->core_cookie = set ? (unsigned long)tsk : 0; + + stop_machine(task_set_core_sched_stopper, NULL, NULL); + + if (!set) + sched_core_put(); + + return 0; +} +#endif + #ifdef CONFIG_MAGIC_SYSRQ void normalize_rt_tasks(void) { diff --git a/kernel/sys.c b/kernel/sys.c index d325f3ab624a9..5c3bcf40dcb34 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2514,6 +2514,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER; break; + case PR_SET_CORE_SCHED: + error = task_set_core_sched(arg2, NULL); + break; default: error = -EINVAL; break; -- 2.26.2.761.g0e0b3e54be-goog
With current core scheduling patchset, non-threaded IRQ and softirq victims can leak data from its hyperthread to a sibling hyperthread running an attacker. For MDS, it is possible for the IRQ and softirq handlers to leak data to either host or guest attackers. For L1TF, it is possible to leak to guest attackers. There is no possible mitigation involving flushing of buffers to avoid this since the execution of attacker and victims happen concurrently on 2 or more HTs. The solution in this patch is to monitor the outer-most core-wide irq_enter() and irq_exit() executed by any sibling. In between these two, we mark the core to be in a special core-wide IRQ state. In the IRQ entry, if we detect that the sibling is running untrusted code, we send a reschedule IPI so that the sibling transitions through the sibling's irq_exit() to do any waiting there, till the IRQ being protected finishes. We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu outer-most irq_exit(), the core is still in the special core-wide IRQ state, we perform a busy-wait till the core exits this state. This combination of per-cpu and core-wide IRQ states helps to handle any combination of irq_entry()s and irq_exit()s happening on all of the siblings of the core in any order. Lastly, we also check in the schedule loop if we are about to schedule an untrusted process while the core is in such a state. This is possible if a trusted thread enters the scheduler by way of yielding CPU. This would involve no transitions through the irq_exit() point to do any waiting, so we have to explicitly do the waiting there. Every attempt is made to prevent a busy-wait unnecessarily, and in testing on real-world ChromeOS usecases, it has not shown a performance drop. In ChromeOS, with this and the rest of the core scheduling patchset, we see around a 300% improvement in key press latencies into Google docs when Camera streaming is running simulatenously (90th percentile latency of ~150ms drops to ~50ms). Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Aubrey Li <aubrey.li@linux.intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Paul E. McKenney <paulmck@kernel.org> Co-developed-by: Vineeth Pillai <vpillai@digitalocean.com> Signed-off-by: Vineeth Pillai <vpillai@digitalocean.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- If you like some pictures of the cases handled by this patch, please see the OSPM slide deck (the below link jumps straight to relevant slides - about 6-7 of them in total): https://bit.ly/2zvzxWk v1->v2: Fixed a bug where softirq was causing deadlock (thanks Vineeth/Julien) The issue was because of the following flow: On CPU0: local_bh_enable() -> Enter softirq -> Softirq takes a lock. -> <new Interrupt received during softirq> -> New interrupt's irq_exit() : Wait since it is not outermost core-wide irq_exit(). On CPU1: <interrupt received> irq_enter() -> Enter the core wide IRQ state. <ISR raises a softirq which will run from irq_exit(). irq_exit() -> -> enters softirq -> softirq tries to take a lock and blocks. So it is an A->B and B->A deadlock. A = Enter the core-wide IRQ state or wait for it to end. B = Acquire a lock during softirq or wait for it to be released. The fix is to enter the core-wide IRQ state even when entering through the local_bh_enable -> softirq path (When there is no hardirq context). which basically becomes: On CPU0: local_bh_enable() (Fix: Call sched_core_irq_enter() --> similar to irq_enter()). -> Enter softirq -> Softirq takes a lock. -> <new Interrupt received during softirq> -> irq_enter() -> New interrupt's irq_exit() (Will not wait since we are inner irq_exit()). include/linux/sched.h | 8 +++ kernel/sched/core.c | 159 ++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 3 + kernel/softirq.c | 12 ++++ 4 files changed, 182 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 710e9a8956007..fe6ae59fcadbe 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2018,4 +2018,12 @@ int sched_trace_rq_cpu(struct rq *rq); const struct cpumask *sched_trace_rd_span(struct root_domain *rd); +#ifdef CONFIG_SCHED_CORE +void sched_core_irq_enter(void); +void sched_core_irq_exit(void); +#else +static void sched_core_irq_enter(void) { } +static void sched_core_irq_exit(void) { } +#endif + #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 21c640170323b..684359ff357e7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4391,6 +4391,153 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b) return a->core_cookie == b->core_cookie; } +/* + * Helper function to pause the caller's hyperthread until the core exits the + * core-wide IRQ state. Obviously the CPU calling this function should not be + * responsible for the core being in the core-wide IRQ state otherwise it will + * deadlock. This function should be called from irq_exit() and from schedule(). + * It is upto the callers to decide if calling here is necessary. + */ +static inline void sched_core_sibling_irq_pause(struct rq *rq) +{ + /* + * Wait till the core of this HT is not in a core-wide IRQ state. + * + * Pair with smp_store_release() in sched_core_irq_exit(). + */ + while (smp_load_acquire(&rq->core->core_irq_nest) > 0) + cpu_relax(); +} + +/* + * Enter the core-wide IRQ state. Sibling will be paused if it is running + * 'untrusted' code, until sched_core_irq_exit() is called. Every attempt to + * avoid sending useless IPIs is made. Must be called only from hard IRQ + * context. + */ +void sched_core_irq_enter(void) +{ + int i, cpu = smp_processor_id(); + struct rq *rq = cpu_rq(cpu); + const struct cpumask *smt_mask; + + if (!sched_core_enabled(rq)) + return; + + /* Count irq_enter() calls received without irq_exit() on this CPU. */ + rq->core_this_irq_nest++; + + /* If not outermost irq_enter(), do nothing. */ + if (WARN_ON_ONCE(rq->core->core_this_irq_nest == UINT_MAX) || + rq->core_this_irq_nest != 1) + return; + + raw_spin_lock(rq_lockp(rq)); + smt_mask = cpu_smt_mask(cpu); + + /* Contribute this CPU's irq_enter() to core-wide irq_enter() count. */ + WRITE_ONCE(rq->core->core_irq_nest, rq->core->core_irq_nest + 1); + if (WARN_ON_ONCE(rq->core->core_irq_nest == UINT_MAX)) + goto unlock; + + if (rq->core_pause_pending) { + /* + * Do nothing more since we are in a 'reschedule IPI' sent from + * another sibling. That sibling would have sent IPIs to all of + * the HTs. + */ + goto unlock; + } + + /* + * If we are not the first ones on the core to enter core-wide IRQ + * state, do nothing. + */ + if (rq->core->core_irq_nest > 1) + goto unlock; + + /* Do nothing more if the core is not tagged. */ + if (!rq->core->core_cookie) + goto unlock; + + for_each_cpu(i, smt_mask) { + struct rq *srq = cpu_rq(i); + + if (i == cpu || cpu_is_offline(i)) + continue; + + if (!srq->curr->mm || is_idle_task(srq->curr)) + continue; + + /* Skip if HT is not running a tagged task. */ + if (!srq->curr->core_cookie && !srq->core_pick) + continue; + + /* IPI only if previous IPI was not pending. */ + if (!srq->core_pause_pending) { + srq->core_pause_pending = 1; + smp_send_reschedule(i); + } + } +unlock: + raw_spin_unlock(rq_lockp(rq)); +} + +/* + * Process any work need for either exiting the core-wide IRQ state, or for + * waiting on this hyperthread if the core is still in this state. + */ +void sched_core_irq_exit(void) +{ + int cpu = smp_processor_id(); + struct rq *rq = cpu_rq(cpu); + bool wait_here = false; + unsigned int nest; + + /* Do nothing if core-sched disabled. */ + if (!sched_core_enabled(rq)) + return; + + rq->core_this_irq_nest--; + + /* If not outermost on this CPU, do nothing. */ + if (WARN_ON_ONCE(rq->core_this_irq_nest == UINT_MAX) || + rq->core_this_irq_nest > 0) + return; + + raw_spin_lock(rq_lockp(rq)); + /* + * Core-wide nesting counter can never be 0 because we are + * still in it on this CPU. + */ + nest = rq->core->core_irq_nest; + WARN_ON_ONCE(!nest); + + /* + * If we still have other CPUs in IRQs, we have to wait for them. + * Either here, or in the scheduler. + */ + if (rq->core->core_cookie && nest > 1) { + /* + * If we are entering the scheduler anyway, we can just wait + * there for ->core_irq_nest to reach 0. If not, just wait here. + */ + if (!tif_need_resched()) { + wait_here = true; + } + } + + if (rq->core_pause_pending) + rq->core_pause_pending = 0; + + /* Pair with smp_load_acquire() in sched_core_sibling_irq_pause(). */ + smp_store_release(&rq->core->core_irq_nest, nest - 1); + raw_spin_unlock(rq_lockp(rq)); + + if (wait_here) + sched_core_sibling_irq_pause(rq); +} + // XXX fairness/fwd progress conditions /* * Returns @@ -4910,6 +5057,18 @@ static void __sched notrace __schedule(bool preempt) rq_unlock_irq(rq, &rf); } +#ifdef CONFIG_SCHED_CORE + /* + * If a CPU that was running a trusted task entered the scheduler, and + * the next task is untrusted, then check if waiting for core-wide IRQ + * state to cease is needed since we would not have been able to get + * the services of irq_exit() to do that waiting. + */ + if (sched_core_enabled(rq) && + !is_idle_task(next) && next->mm && next->core_cookie) + sched_core_sibling_irq_pause(rq); +#endif + balance_callback(rq); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a7d9f156242e2..3a065d133ef51 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1018,11 +1018,14 @@ struct rq { unsigned int core_sched_seq; struct rb_root core_tree; unsigned char core_forceidle; + unsigned char core_pause_pending; + unsigned int core_this_irq_nest; /* shared state */ unsigned int core_task_seq; unsigned int core_pick_seq; unsigned long core_cookie; + unsigned int core_irq_nest; #endif }; diff --git a/kernel/softirq.c b/kernel/softirq.c index 0427a86743a46..147abd6d82599 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -273,6 +273,13 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) /* Reset the pending bitmask before enabling irqs */ set_softirq_pending(0); + /* + * Core scheduling mitigations require entry into softirq to send stall + * IPIs to sibling hyperthreads if needed (ex, sibling is running + * untrusted task). If we are here from irq_exit(), no IPIs are sent. + */ + sched_core_irq_enter(); + local_irq_enable(); h = softirq_vec; @@ -305,6 +312,9 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) rcu_softirq_qs(); local_irq_disable(); + /* Inform the scheduler about exit from softirq. */ + sched_core_irq_exit(); + pending = local_softirq_pending(); if (pending) { if (time_before(jiffies, end) && !need_resched() && @@ -345,6 +355,7 @@ asmlinkage __visible void do_softirq(void) void irq_enter(void) { rcu_irq_enter(); + sched_core_irq_enter(); if (is_idle_task(current) && !in_interrupt()) { /* * Prevent raise_softirq from needlessly waking up ksoftirqd @@ -413,6 +424,7 @@ void irq_exit(void) invoke_softirq(); tick_irq_exit(); + sched_core_irq_exit(); rcu_irq_exit(); trace_hardirq_exit(); /* must be last! */ } -- 2.26.2.761.g0e0b3e54be-goog
rcu_read_unlock() can incur an infrequent deadlock in sched_core_balance(). Fix this by using sched-RCU instead. This fixes the following spinlock recursion observed when testing the core scheduling patches on PREEMPT=y kernel on ChromeOS: [ 3.240891] BUG: spinlock recursion on CPU#2, swapper/2/0 [ 3.240900] lock: 0xffff9cd1eeb28e40, .magic: dead4ead, .owner: swapper/2/0, .owner_cpu: 2 [ 3.240905] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.22htcore #4 [ 3.240908] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.174.0 05/29/2018 [ 3.240910] Call Trace: [ 3.240919] dump_stack+0x97/0xdb [ 3.240924] ? spin_bug+0xa4/0xb1 [ 3.240927] do_raw_spin_lock+0x79/0x98 [ 3.240931] try_to_wake_up+0x367/0x61b [ 3.240935] rcu_read_unlock_special+0xde/0x169 [ 3.240938] ? sched_core_balance+0xd9/0x11e [ 3.240941] __rcu_read_unlock+0x48/0x4a [ 3.240945] __balance_callback+0x50/0xa1 [ 3.240949] __schedule+0x55a/0x61e [ 3.240952] schedule_idle+0x21/0x2d [ 3.240956] do_idle+0x1d5/0x1f8 [ 3.240960] cpu_startup_entry+0x1d/0x1f [ 3.240964] start_secondary+0x159/0x174 [ 3.240967] secondary_startup_64+0xa4/0xb0 [ 14.998590] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [kworker/0:10:965] Cc: vpillai <vpillai@digitalocean.com> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Aubrey Li <aubrey.intel@gmail.com> Cc: peterz@infradead.org Cc: paulmck@kernel.org Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Change-Id: I1a4bf0cd1426b3c21ad5de44719813ad4ee5805e --- kernel/sched/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 780514d03da47..b8ca6fcaaaf06 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4897,7 +4897,7 @@ static void sched_core_balance(struct rq *rq) struct sched_domain *sd; int cpu = cpu_of(rq); - rcu_read_lock(); + rcu_read_lock_sched(); raw_spin_unlock_irq(rq_lockp(rq)); for_each_domain(cpu, sd) { if (!(sd->flags & SD_LOAD_BALANCE)) @@ -4910,7 +4910,7 @@ static void sched_core_balance(struct rq *rq) break; } raw_spin_lock_irq(rq_lockp(rq)); - rcu_read_unlock(); + rcu_read_unlock_sched(); } static DEFINE_PER_CPU(struct callback_head, core_balance_head); -- 2.26.2.761.g0e0b3e54be-goog
> On May 21, 2020, at 6:26 AM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
>
> Add a per-thread core scheduling interface which allows a thread to tag
> itself and enable core scheduling. Based on discussion at OSPM with
> maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> 1 - enable core scheduling for the task.
> 0 - disable core scheduling for the task.
>
> Special cases:
> (1)
> The core-scheduling patchset contains a CGroup interface as well. In
> order for us to respect users of that interface, we avoid overriding the
> tag if a task was CGroup-tagged because the task becomes inconsistent
> with the CGroup tag. Instead return -EBUSY.
>
> (2)
> If a task is prctl-tagged, allow the CGroup interface to override
> the task's tag.
>
> ChromeOS will use core-scheduling to securely enable hyperthreading.
> This cuts down the keypress latency in Google docs from 150ms to 50ms
> while improving the camera streaming frame rate by ~3%.
Hi,
Are the performance improvements compared to the hyperthreading disabled scenario or not?
Could you help to explain how the keypress latency improvement comes with core-scheduling?
Thanks a lot.
Regards,
Jiang
On Wed, May 20, 2020 at 06:26:42PM -0400, Joel Fernandes (Google) wrote: > Add a per-thread core scheduling interface which allows a thread to tag > itself and enable core scheduling. Based on discussion at OSPM with > maintainers, we propose a prctl(2) interface accepting values of 0 or 1. > 1 - enable core scheduling for the task. > 0 - disable core scheduling for the task. Yeah, so this is a terrible interface :-) It doens't allow tasks for form their own groups (by for example setting the key to that of another task). It is also horribly ill defined what it means to 'enable', with whoem is it allows to share a core. > Special cases: > (1) > The core-scheduling patchset contains a CGroup interface as well. In > order for us to respect users of that interface, we avoid overriding the > tag if a task was CGroup-tagged because the task becomes inconsistent > with the CGroup tag. Instead return -EBUSY. > > (2) > If a task is prctl-tagged, allow the CGroup interface to override > the task's tag. OK, so cgroup always wins; is why is that a good thing? > ChromeOS will use core-scheduling to securely enable hyperthreading. > This cuts down the keypress latency in Google docs from 150ms to 50ms > while improving the camera streaming frame rate by ~3%. It doesn't consider permissions. Basically, with the way you guys use it, it should be a CAP_SYS_ADMIN only to enable core-sched. That also means we should very much default to disable.
Hi Peter, Thanks for the comments. On Thu, May 21, 2020 at 10:51:22AM +0200, Peter Zijlstra wrote: > On Wed, May 20, 2020 at 06:26:42PM -0400, Joel Fernandes (Google) wrote: > > Add a per-thread core scheduling interface which allows a thread to tag > > itself and enable core scheduling. Based on discussion at OSPM with > > maintainers, we propose a prctl(2) interface accepting values of 0 or 1. > > 1 - enable core scheduling for the task. > > 0 - disable core scheduling for the task. > > Yeah, so this is a terrible interface :-) I tried to keep it simple. You are right, lets make it better. > It doens't allow tasks for form their own groups (by for example setting > the key to that of another task). So for this, I was thinking of making the prctl pass in an integer. And 0 would mean untagged. Does that sound good to you? > It is also horribly ill defined what it means to 'enable', with whoem > is it allows to share a core. I couldn't parse this. Do you mean "enabling coresched does not make sense if we don't specify whom to share the core with?" > > Special cases: > > > (1) > > The core-scheduling patchset contains a CGroup interface as well. In > > order for us to respect users of that interface, we avoid overriding the > > tag if a task was CGroup-tagged because the task becomes inconsistent > > with the CGroup tag. Instead return -EBUSY. > > > > (2) > > If a task is prctl-tagged, allow the CGroup interface to override > > the task's tag. > > OK, so cgroup always wins; is why is that a good thing? I was just trying to respect the functionality of the CGroup patch in the coresched series, after all a gentleman named Peter Zijlstra wrote that patch ;-) ;-). More seriously, the reason I did it this way is the prctl-tagging is a bit incompatible with CGroup tagging: 1. What happens if 2 tasks are in a tagged CGroup and one of them changes their cookie through prctl? Do they still remain in the tagged CGroup but are now going to not trust each other? Do they get removed from the CGroup? This is why I made the prctl fail with -EBUSY in such cases. 2. What happens if 2 tagged tasks with different cookies are added to a tagged CGroup? Do we fail the addition of the tasks to the group, or do we override their cookie (like I'm doing)? > > ChromeOS will use core-scheduling to securely enable hyperthreading. > > This cuts down the keypress latency in Google docs from 150ms to 50ms > > while improving the camera streaming frame rate by ~3%. > > It doesn't consider permissions. > > Basically, with the way you guys use it, it should be a CAP_SYS_ADMIN > only to enable core-sched. True, we were relying on the seccomp sandboxing in ChromeOS to protect the prctl but you're right and I fixed it for next revision. > That also means we should very much default to disable. This is how it is already. thanks, - Joel
On Thu, May 21, 2020 at 04:09:50AM +0000, benbjiang(蒋彪) wrote:
>
>
> > On May 21, 2020, at 6:26 AM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> >
> > Add a per-thread core scheduling interface which allows a thread to tag
> > itself and enable core scheduling. Based on discussion at OSPM with
> > maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> > 1 - enable core scheduling for the task.
> > 0 - disable core scheduling for the task.
> >
> > Special cases:
> > (1)
> > The core-scheduling patchset contains a CGroup interface as well. In
> > order for us to respect users of that interface, we avoid overriding the
> > tag if a task was CGroup-tagged because the task becomes inconsistent
> > with the CGroup tag. Instead return -EBUSY.
> >
> > (2)
> > If a task is prctl-tagged, allow the CGroup interface to override
> > the task's tag.
> >
> > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > while improving the camera streaming frame rate by ~3%.
> Hi,
> Are the performance improvements compared to the hyperthreading disabled scenario or not?
> Could you help to explain how the keypress latency improvement comes with core-scheduling?
Hi Jiang,
The keypress end-to-end latency metric we have is calculated from when the
keypress is registered in hardware to when a character is drawn on the
scereen. This involves several parties including the GPU and browser
processes which are all running in the same trust domain and benefit from
parallelism through hyperthreading.
thanks,
- Joel
On Wed, May 20, 2020 at 3:26 PM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> ChromeOS will use core-scheduling to securely enable hyperthreading.
> This cuts down the keypress latency in Google docs from 150ms to 50ms
> while improving the camera streaming frame rate by ~3%.
I'm assuming this is "compared to SMT disabled"?
What is the cost compared to "SMT enabled but no core scheduling"?
But the real reason I'm piping up is that your latency benchmark
sounds very cool.
Generally throughput benchmarks are much easier to do, how do you do
this latency benchmark, and is it perhaps something that could be run
more widely (ie I'm thinking that if it's generic enough and stable
enough to be run by some of the performance regression checking
robots, it would be a much more interesting test-case than some of the
ones they run right now...)
I'm looking at that "threaded phoronix gzip performance regression"
thread due to a totally unrelated scheduling change ("sched/fair:
Rework load_balance()"), and then I see this thread and my reaction is
"the keypress latency thing sounds like a much more interesting
performance test than threaded gzip from clear linux".
But the threaded gzip test is presumably trivial to script, while your
latency test is perhaps very specific to one particular platform and
setuip?
Linus
On Thu, May 21, 2020 at 9:47 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> > It doens't allow tasks for form their own groups (by for example setting
> > the key to that of another task).
>
> So for this, I was thinking of making the prctl pass in an integer. And 0
> would mean untagged. Does that sound good to you?
>
On a similar note, me and Joel were discussing about prctl and it came up
that, there is no mechanism to set cookie from outside a process using
prctl(2). So, another option we could consider is to use sched_setattr(2)
and expand sched_attr to accomodate a u64 cookie. User could pass in a
cookie to explicitly set it and also use the same cookie for grouping.
Haven't prototyped it yet. Will need to dig deeper and see how it would
really look like.
Thanks,
Vineeth
Hi Linus, On Thu, May 21, 2020 at 11:31:38AM -0700, Linus Torvalds wrote: > On Wed, May 20, 2020 at 3:26 PM Joel Fernandes (Google) > <joel@joelfernandes.org> wrote: > > > > ChromeOS will use core-scheduling to securely enable hyperthreading. > > This cuts down the keypress latency in Google docs from 150ms to 50ms > > while improving the camera streaming frame rate by ~3%. > > I'm assuming this is "compared to SMT disabled"? Yes this is compared to SMT disabled, I'll improve the commit message. > What is the cost compared to "SMT enabled but no core scheduling"? With SMT enabled and no core scheduling, it is around 40ms in the higher percentiles. Also one more thing I wanted to mention, this is the 90th percentile. > But the real reason I'm piping up is that your latency benchmark > sounds very cool. > > Generally throughput benchmarks are much easier to do, how do you do > this latency benchmark, and is it perhaps something that could be run > more widely (ie I'm thinking that if it's generic enough and stable > enough to be run by some of the performance regression checking > robots, it would be a much more interesting test-case than some of the > ones they run right now...) Glad you like it! The metric is calculated with a timestamp of when the driver says the key was pressed, up until when the GPU says we've drawn pixels in response. The test requires a mostly only requires Chrome browser. It opens some pre-existing test URLs (a google doc, a window that opens a camera stream and another window that decodes video). This metric is already calculated in Chrome, we just scrape it from chrome://histograms/Event.Latency.EndToEnd.KeyPress. If you install Chrome, you can goto this link and see the histogram. We open a Google docs window and synthetically input keys into it with a camera stream and video decoding running in other windows which gives the CPUs a good beating. Then we collect roughly the 90th percentile keypress latency from the above histogram and the camera and decoded video's FPS, among other things. There is a test in the works that my colleagues are writing to run the full Google hangout video chatting stack to stress the system more (versus just the camera stream). I guess if the robots can somehow input keys into the Google docs and open the right windows, then it is just a matter of scraping the histogram. > I'm looking at that "threaded phoronix gzip performance regression" > thread due to a totally unrelated scheduling change ("sched/fair: > Rework load_balance()"), and then I see this thread and my reaction is > "the keypress latency thing sounds like a much more interesting > performance test than threaded gzip from clear linux". > > But the threaded gzip test is presumably trivial to script, while your > latency test is perhaps very specific to one particular platform and > setuip? Yes it is specifically a ChromeOS running on a pixel book running a 7th Gen Intel Core i7 with 4 hardware threads. https://store.google.com/us/product/google_pixelbook I could try to make it a synthetic test but it might be difficult for a robot to run it if it does not have graphics support and a camera connected to it. It would then need a fake/emulated camera connected to it. These robots run Linux in a non-GUI environment in qemu instances right? thanks, - Joel
On Thu, May 21, 2020 at 1:45 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Linus,
>
> On Thu, May 21, 2020 at 11:31:38AM -0700, Linus Torvalds wrote:
> > On Wed, May 20, 2020 at 3:26 PM Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > Generally throughput benchmarks are much easier to do, how do you do
> > this latency benchmark, and is it perhaps something that could be run
> > more widely (ie I'm thinking that if it's generic enough and stable
> > enough to be run by some of the performance regression checking
> > robots, it would be a much more interesting test-case than some of the
> > ones they run right now...)
>
> Glad you like it! The metric is calculated with a timestamp of when the
> driver says the key was pressed, up until when the GPU says we've drawn
> pixels in response.
>
> The test requires a mostly only requires Chrome browser. It opens some
> pre-existing test URLs (a google doc, a window that opens a camera stream and
> another window that decodes video). This metric is already calculated in
> Chrome, we just scrape it from
> chrome://histograms/Event.Latency.EndToEnd.KeyPress. If you install Chrome,
> you can goto this link and see the histogram. We open a Google docs window
> and synthetically input keys into it with a camera stream and video decoding
> running in other windows which gives the CPUs a good beating. Then we collect
> roughly the 90th percentile keypress latency from the above histogram and the
> camera and decoded video's FPS, among other things. There is a test in the
> works that my colleagues are writing to run the full Google hangout video
> chatting stack to stress the system more (versus just the camera stream). I
> guess if the robots can somehow input keys into the Google docs and open the
> right windows, then it is just a matter of scraping the histogram.
Expanding on this a little, we're working on a couple of projects that
should provide results like these for upstream. One is continuously
rebasing our upstream backlog onto new kernels for testing purposes
(the idea here is to make it easier for us to update kernels on
Chromebooks), and the second is to drive more stuff into the
kernelci.org infrastructure. Given the test environments we have in
place now, we can probably get results from our continuous rebase
project first and provide those against -rc releases if that's
something you'd be interested in. Going forward, I hope we can
extract several of our tests and put them into kernelci as well, so we
get more general coverage without the potential impact of our (still
somewhat large) upstream backlog of patches.
To Joel's point, there are a few changes we'll have to make to get
similar results outside of our environment, but I think that's doable
without a ton of work. And if anyone is curious, I think most of this
stuff is already public in the tast and autotest repos of the
chromiumos tree. Just let us know if you want to make changes or port
to another environment so we can try to stay in sync wrt new features,
etc.
Thanks,
Jesse
On Wed, May 20, 2020 at 06:48:18PM -0400, Joel Fernandes (Google) wrote: > rcu_read_unlock() can incur an infrequent deadlock in > sched_core_balance(). Fix this by using sched-RCU instead. > > This fixes the following spinlock recursion observed when testing the > core scheduling patches on PREEMPT=y kernel on ChromeOS: > > [ 3.240891] BUG: spinlock recursion on CPU#2, swapper/2/0 > [ 3.240900] lock: 0xffff9cd1eeb28e40, .magic: dead4ead, .owner: swapper/2/0, .owner_cpu: 2 > [ 3.240905] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.22htcore #4 > [ 3.240908] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.174.0 05/29/2018 > [ 3.240910] Call Trace: > [ 3.240919] dump_stack+0x97/0xdb > [ 3.240924] ? spin_bug+0xa4/0xb1 > [ 3.240927] do_raw_spin_lock+0x79/0x98 > [ 3.240931] try_to_wake_up+0x367/0x61b > [ 3.240935] rcu_read_unlock_special+0xde/0x169 > [ 3.240938] ? sched_core_balance+0xd9/0x11e > [ 3.240941] __rcu_read_unlock+0x48/0x4a > [ 3.240945] __balance_callback+0x50/0xa1 > [ 3.240949] __schedule+0x55a/0x61e > [ 3.240952] schedule_idle+0x21/0x2d > [ 3.240956] do_idle+0x1d5/0x1f8 > [ 3.240960] cpu_startup_entry+0x1d/0x1f > [ 3.240964] start_secondary+0x159/0x174 > [ 3.240967] secondary_startup_64+0xa4/0xb0 > [ 14.998590] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [kworker/0:10:965] > > Cc: vpillai <vpillai@digitalocean.com> > Cc: Aaron Lu <aaron.lwe@gmail.com> > Cc: Aubrey Li <aubrey.intel@gmail.com> > Cc: peterz@infradead.org > Cc: paulmck@kernel.org > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > Change-Id: I1a4bf0cd1426b3c21ad5de44719813ad4ee5805e With some luck, the commit removing the need for this will hit mainline during the next merge window. Fingers firmly crossed... Thanx, Paul > --- > kernel/sched/core.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 780514d03da47..b8ca6fcaaaf06 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4897,7 +4897,7 @@ static void sched_core_balance(struct rq *rq) > struct sched_domain *sd; > int cpu = cpu_of(rq); > > - rcu_read_lock(); > + rcu_read_lock_sched(); > raw_spin_unlock_irq(rq_lockp(rq)); > for_each_domain(cpu, sd) { > if (!(sd->flags & SD_LOAD_BALANCE)) > @@ -4910,7 +4910,7 @@ static void sched_core_balance(struct rq *rq) > break; > } > raw_spin_lock_irq(rq_lockp(rq)); > - rcu_read_unlock(); > + rcu_read_unlock_sched(); > } > > static DEFINE_PER_CPU(struct callback_head, core_balance_head); > -- > 2.26.2.761.g0e0b3e54be-goog >
On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote: > From: Peter Zijlstra <peterz@infradead.org> > > Instead of only selecting a local task, select a task for all SMT > siblings for every reschedule on the core (irrespective which logical > CPU does the reschedule). > > There could be races in core scheduler where a CPU is trying to pick > a task for its sibling in core scheduler, when that CPU has just been > offlined. We should not schedule any tasks on the CPU in this case. > Return an idle task in pick_next_task for this situation. > > NOTE: there is still potential for siblings rivalry. > NOTE: this is far too complicated; but thus far I've failed to > simplify it further. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com> > Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> > Signed-off-by: Aaron Lu <aaron.lu@linux.alibaba.com> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > --- > kernel/sched/core.c | 274 ++++++++++++++++++++++++++++++++++++++++++- > kernel/sched/fair.c | 40 +++++++ > kernel/sched/sched.h | 6 +- > 3 files changed, 318 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 445f0d519336..9a1bd236044e 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4253,7 +4253,7 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt) > * Pick up the highest-prio task: > */ > static inline struct task_struct * > -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > { > const struct sched_class *class; > struct task_struct *p; > @@ -4309,6 +4309,273 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > BUG(); > } > > +#ifdef CONFIG_SCHED_CORE > + > +static inline bool cookie_equals(struct task_struct *a, unsigned long cookie) > +{ > + return is_idle_task(a) || (a->core_cookie == cookie); > +} > + > +static inline bool cookie_match(struct task_struct *a, struct task_struct *b) > +{ > + if (is_idle_task(a) || is_idle_task(b)) > + return true; > + > + return a->core_cookie == b->core_cookie; > +} > + > +// XXX fairness/fwd progress conditions > +/* > + * Returns > + * - NULL if there is no runnable task for this class. > + * - the highest priority task for this runqueue if it matches > + * rq->core->core_cookie or its priority is greater than max. > + * - Else returns idle_task. > + */ > +static struct task_struct * > +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max) > +{ > + struct task_struct *class_pick, *cookie_pick; > + unsigned long cookie = rq->core->core_cookie; > + > + class_pick = class->pick_task(rq); > + if (!class_pick) > + return NULL; > + > + if (!cookie) { > + /* > + * If class_pick is tagged, return it only if it has > + * higher priority than max. > + */ > + if (max && class_pick->core_cookie && > + prio_less(class_pick, max)) > + return idle_sched_class.pick_task(rq); > + > + return class_pick; > + } > + > + /* > + * If class_pick is idle or matches cookie, return early. > + */ > + if (cookie_equals(class_pick, cookie)) > + return class_pick; > + > + cookie_pick = sched_core_find(rq, cookie); > + > + /* > + * If class > max && class > cookie, it is the highest priority task on > + * the core (so far) and it must be selected, otherwise we must go with > + * the cookie pick in order to satisfy the constraint. > + */ > + if (prio_less(cookie_pick, class_pick) && > + (!max || prio_less(max, class_pick))) > + return class_pick; > + > + return cookie_pick; > +} I've been hating on this pick_task() routine for a while now :-). If we add the task to the tag tree as Peter suggested at OSPM for that other issue Vineeth found, it seems it could be simpler. This has just been near a compiler so far but how about: ---8<----------------------- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 005d7f7323e2d..81e23252b6c99 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -182,9 +182,6 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p) rq->core->core_task_seq++; - if (!p->core_cookie) - return; - node = &rq->core_tree.rb_node; parent = *node; @@ -215,7 +212,7 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p) void sched_core_add(struct rq *rq, struct task_struct *p) { - if (p->core_cookie && task_on_rq_queued(p)) + if (task_on_rq_queued(p)) sched_core_enqueue(rq, p); } @@ -4563,36 +4560,32 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma if (!class_pick) return NULL; - if (!cookie) { - /* - * If class_pick is tagged, return it only if it has - * higher priority than max. - */ - if (max && class_pick->core_cookie && - prio_less(class_pick, max)) - return idle_sched_class.pick_task(rq); - + if (!max) return class_pick; - } - /* - * If class_pick is idle or matches cookie, return early. - */ + /* Make sure the current max's cookie is core->core_cookie */ + WARN_ON_ONCE(max->core_cookie != cookie); + + /* If class_pick is idle or matches cookie, play nice. */ if (cookie_equals(class_pick, cookie)) return class_pick; - cookie_pick = sched_core_find(rq, cookie); + /* If class_pick is highest prio, trump max. */ + if (prio_less(max, class_pick)) { + + /* .. but not before checking if cookie trumps class. */ + cookie_pick = sched_core_find(rq, cookie); + if (prio_less(class_pick, cookie_pick)) + return cookie_pick; - /* - * If class > max && class > cookie, it is the highest priority task on - * the core (so far) and it must be selected, otherwise we must go with - * the cookie pick in order to satisfy the constraint. - */ - if (prio_less(cookie_pick, class_pick) && - (!max || prio_less(max, class_pick))) return class_pick; + } - return cookie_pick; + /* + * We get here if class_pick was incompatible with max + * and lower prio than max. So we have nothing. + */ + return idle_sched_class.pick_task(rq); } static struct task_struct *
On Thu, May 21, 2020 at 07:14:26PM -0400, Joel Fernandes wrote: > On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote: [snip] > > + /* > > + * If class_pick is idle or matches cookie, return early. > > + */ > > + if (cookie_equals(class_pick, cookie)) > > + return class_pick; > > + > > + cookie_pick = sched_core_find(rq, cookie); > > + > > + /* > > + * If class > max && class > cookie, it is the highest priority task on > > + * the core (so far) and it must be selected, otherwise we must go with > > + * the cookie pick in order to satisfy the constraint. > > + */ > > + if (prio_less(cookie_pick, class_pick) && > > + (!max || prio_less(max, class_pick))) > > + return class_pick; > > + > > + return cookie_pick; > > +} > > I've been hating on this pick_task() routine for a while now :-). If we add > the task to the tag tree as Peter suggested at OSPM for that other issue > Vineeth found, it seems it could be simpler. Sorry, I meant adding of a 0-tagged (no cookie) task to the tag tree. thanks, - Joel
On Thu, May 21, 2020 at 6:52 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Wed, May 20, 2020 at 06:48:18PM -0400, Joel Fernandes (Google) wrote: > > rcu_read_unlock() can incur an infrequent deadlock in > > sched_core_balance(). Fix this by using sched-RCU instead. > > > > This fixes the following spinlock recursion observed when testing the > > core scheduling patches on PREEMPT=y kernel on ChromeOS: > > > > [ 3.240891] BUG: spinlock recursion on CPU#2, swapper/2/0 > > [ 3.240900] lock: 0xffff9cd1eeb28e40, .magic: dead4ead, .owner: swapper/2/0, .owner_cpu: 2 > > [ 3.240905] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.22htcore #4 > > [ 3.240908] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.174.0 05/29/2018 > > [ 3.240910] Call Trace: > > [ 3.240919] dump_stack+0x97/0xdb > > [ 3.240924] ? spin_bug+0xa4/0xb1 > > [ 3.240927] do_raw_spin_lock+0x79/0x98 > > [ 3.240931] try_to_wake_up+0x367/0x61b > > [ 3.240935] rcu_read_unlock_special+0xde/0x169 > > [ 3.240938] ? sched_core_balance+0xd9/0x11e > > [ 3.240941] __rcu_read_unlock+0x48/0x4a > > [ 3.240945] __balance_callback+0x50/0xa1 > > [ 3.240949] __schedule+0x55a/0x61e > > [ 3.240952] schedule_idle+0x21/0x2d > > [ 3.240956] do_idle+0x1d5/0x1f8 > > [ 3.240960] cpu_startup_entry+0x1d/0x1f > > [ 3.240964] start_secondary+0x159/0x174 > > [ 3.240967] secondary_startup_64+0xa4/0xb0 > > [ 14.998590] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [kworker/0:10:965] > > > > Cc: vpillai <vpillai@digitalocean.com> > > Cc: Aaron Lu <aaron.lwe@gmail.com> > > Cc: Aubrey Li <aubrey.intel@gmail.com> > > Cc: peterz@infradead.org > > Cc: paulmck@kernel.org > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > Change-Id: I1a4bf0cd1426b3c21ad5de44719813ad4ee5805e > > With some luck, the commit removing the need for this will hit > mainline during the next merge window. Fingers firmly crossed... Sounds good, thank you Paul :-) - Joel > > Thanx, Paul > > > --- > > kernel/sched/core.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 780514d03da47..b8ca6fcaaaf06 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4897,7 +4897,7 @@ static void sched_core_balance(struct rq *rq) > > struct sched_domain *sd; > > int cpu = cpu_of(rq); > > > > - rcu_read_lock(); > > + rcu_read_lock_sched(); > > raw_spin_unlock_irq(rq_lockp(rq)); > > for_each_domain(cpu, sd) { > > if (!(sd->flags & SD_LOAD_BALANCE)) > > @@ -4910,7 +4910,7 @@ static void sched_core_balance(struct rq *rq) > > break; > > } > > raw_spin_lock_irq(rq_lockp(rq)); > > - rcu_read_unlock(); > > + rcu_read_unlock_sched(); > > } > > > > static DEFINE_PER_CPU(struct callback_head, core_balance_head); > > -- > > 2.26.2.761.g0e0b3e54be-goog > >
On Thu, May 21, 2020 at 07:14:26PM -0400, Joel Fernandes wrote: > On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote: > > From: Peter Zijlstra <peterz@infradead.org> > > > > Instead of only selecting a local task, select a task for all SMT > > siblings for every reschedule on the core (irrespective which logical > > CPU does the reschedule). > > > > There could be races in core scheduler where a CPU is trying to pick > > a task for its sibling in core scheduler, when that CPU has just been > > offlined. We should not schedule any tasks on the CPU in this case. > > Return an idle task in pick_next_task for this situation. > > > > NOTE: there is still potential for siblings rivalry. > > NOTE: this is far too complicated; but thus far I've failed to > > simplify it further. > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com> > > Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> > > Signed-off-by: Aaron Lu <aaron.lu@linux.alibaba.com> > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > > --- > > kernel/sched/core.c | 274 ++++++++++++++++++++++++++++++++++++++++++- > > kernel/sched/fair.c | 40 +++++++ > > kernel/sched/sched.h | 6 +- > > 3 files changed, 318 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 445f0d519336..9a1bd236044e 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4253,7 +4253,7 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt) > > * Pick up the highest-prio task: > > */ > > static inline struct task_struct * > > -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > { > > const struct sched_class *class; > > struct task_struct *p; > > @@ -4309,6 +4309,273 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > BUG(); > > } > > > > +#ifdef CONFIG_SCHED_CORE > > + > > +static inline bool cookie_equals(struct task_struct *a, unsigned long cookie) > > +{ > > + return is_idle_task(a) || (a->core_cookie == cookie); > > +} > > + > > +static inline bool cookie_match(struct task_struct *a, struct task_struct *b) > > +{ > > + if (is_idle_task(a) || is_idle_task(b)) > > + return true; > > + > > + return a->core_cookie == b->core_cookie; > > +} > > + > > +// XXX fairness/fwd progress conditions > > +/* > > + * Returns > > + * - NULL if there is no runnable task for this class. > > + * - the highest priority task for this runqueue if it matches > > + * rq->core->core_cookie or its priority is greater than max. > > + * - Else returns idle_task. > > + */ > > +static struct task_struct * > > +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max) > > +{ > > + struct task_struct *class_pick, *cookie_pick; > > + unsigned long cookie = rq->core->core_cookie; > > + > > + class_pick = class->pick_task(rq); > > + if (!class_pick) > > + return NULL; > > + > > + if (!cookie) { > > + /* > > + * If class_pick is tagged, return it only if it has > > + * higher priority than max. > > + */ > > + if (max && class_pick->core_cookie && > > + prio_less(class_pick, max)) > > + return idle_sched_class.pick_task(rq); > > + > > + return class_pick; > > + } > > + > > + /* > > + * If class_pick is idle or matches cookie, return early. > > + */ > > + if (cookie_equals(class_pick, cookie)) > > + return class_pick; > > + > > + cookie_pick = sched_core_find(rq, cookie); > > + > > + /* > > + * If class > max && class > cookie, it is the highest priority task on > > + * the core (so far) and it must be selected, otherwise we must go with > > + * the cookie pick in order to satisfy the constraint. > > + */ > > + if (prio_less(cookie_pick, class_pick) && > > + (!max || prio_less(max, class_pick))) > > + return class_pick; > > + > > + return cookie_pick; > > +} > > I've been hating on this pick_task() routine for a while now :-). If we add > the task to the tag tree as Peter suggested at OSPM for that other issue > Vineeth found, it seems it could be simpler. > > This has just been near a compiler so far but how about: Discussed a lot with Vineeth. Below is an improved version of the pick_task() similification. It also handles the following "bug" in the existing code as well that Vineeth brought up in OSPM: Suppose 2 siblings of a core: rq 1 and rq 2. In priority order (high to low), say we have the tasks: A - untagged (rq 1) B - tagged (rq 2) C - untagged (rq 2) Say, B and C are in the same scheduling class. When the pick_next_task() loop runs, it looks at rq 1 and max is A, A is tenantively selected for rq 1. Then it looks at rq 2 and the class_pick is B. But that's not compatible with A. So rq 2 gets forced idle. In reality, rq 2 could have run C instead of idle. The fix is to add C to the tag tree as Peter suggested in OSPM. Updated diff below: ---8<----------------------- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 005d7f7323e2d..625377f393ed3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -182,9 +182,6 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p) rq->core->core_task_seq++; - if (!p->core_cookie) - return; - node = &rq->core_tree.rb_node; parent = *node; @@ -215,7 +212,7 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p) void sched_core_add(struct rq *rq, struct task_struct *p) { - if (p->core_cookie && task_on_rq_queued(p)) + if (task_on_rq_queued(p)) sched_core_enqueue(rq, p); } @@ -4556,43 +4553,57 @@ void sched_core_irq_exit(void) static struct task_struct * pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max) { - struct task_struct *class_pick, *cookie_pick; + struct task_struct *class_pick, *cookie_pick, *rq_pick; unsigned long cookie = rq->core->core_cookie; class_pick = class->pick_task(rq); if (!class_pick) return NULL; - if (!cookie) { - /* - * If class_pick is tagged, return it only if it has - * higher priority than max. - */ - if (max && class_pick->core_cookie && - prio_less(class_pick, max)) - return idle_sched_class.pick_task(rq); + if (!max) + return class_pick; + + /* Make sure the current max's cookie is core->core_cookie */ + WARN_ON_ONCE(max->core_cookie != cookie); + /* Try to play really nice: see if the class's cookie works. */ + if (cookie_equals(class_pick, cookie)) return class_pick; - } /* - * If class_pick is idle or matches cookie, return early. + * From here on, we must return class_pick, cookie_pick or idle. + * Following are the cases: + * 1 - lowest prio. + * 3 - highest prio. + * + * max class cookie outcome + * 1 2 3 cookie + * 1 3 2 class + * 2 1 3 cookie + * 2 3 1 class + * 3 1 2 cookie + * 3 2 1 cookie + * 3 2 - return idle (when no cookie task). */ - if (cookie_equals(class_pick, cookie)) - return class_pick; + /* First try to find the highest prio of (cookie, class and max). */ cookie_pick = sched_core_find(rq, cookie); + if (cookie_pick && prio_less(class_pick, cookie_pick)) + rq_pick = cookie_pick; + else + rq_pick = class_pick; + if (prio_less(max, rq_pick)) + return rq_pick; + + /* If we max was greatest, then see if there was a cookie. */ + if (cookie_pick) + return cookie_pick; /* - * If class > max && class > cookie, it is the highest priority task on - * the core (so far) and it must be selected, otherwise we must go with - * the cookie pick in order to satisfy the constraint. + * We get here with if class_pick was incompatible with max + * and lower prio than max. So we have nothing. */ - if (prio_less(cookie_pick, class_pick) && - (!max || prio_less(max, class_pick))) - return class_pick; - - return cookie_pick; + return idle_sched_class.pick_task(rq); } static struct task_struct *
On Thu, May 21, 2020 at 10:35:56PM -0400, Joel Fernandes wrote: > Discussed a lot with Vineeth. Below is an improved version of the pick_task() > similification. > > It also handles the following "bug" in the existing code as well that Vineeth > brought up in OSPM: Suppose 2 siblings of a core: rq 1 and rq 2. > > In priority order (high to low), say we have the tasks: > A - untagged (rq 1) > B - tagged (rq 2) > C - untagged (rq 2) > > Say, B and C are in the same scheduling class. > > When the pick_next_task() loop runs, it looks at rq 1 and max is A, A is > tenantively selected for rq 1. Then it looks at rq 2 and the class_pick is B. > But that's not compatible with A. So rq 2 gets forced idle. > > In reality, rq 2 could have run C instead of idle. The fix is to add C to the > tag tree as Peter suggested in OSPM. I like the idea of adding untagged task to the core tree. > Updated diff below: > > ---8<----------------------- > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 005d7f7323e2d..625377f393ed3 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -182,9 +182,6 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p) > > rq->core->core_task_seq++; > > - if (!p->core_cookie) > - return; > - > node = &rq->core_tree.rb_node; > parent = *node; > > @@ -215,7 +212,7 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p) > > void sched_core_add(struct rq *rq, struct task_struct *p) > { > - if (p->core_cookie && task_on_rq_queued(p)) > + if (task_on_rq_queued(p)) > sched_core_enqueue(rq, p); > } It appears there are other call sites of sched_core_enqueue() where core_cookie is checked: cpu_cgroup_fork() and __sched_write_tag().
On Sat, May 16, 2020 at 11:42:30AM +0800, Aaron Lu wrote: > On Thu, May 14, 2020 at 03:02:48PM +0200, Peter Zijlstra wrote: > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4476,6 +4473,16 @@ next_class:; > > WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick)); > > } > > > > + /* XXX SMT2 only */ > > + if (new_active == 1 && old_active > 1) { > > There is a case when incompatible task appears but we failed to 'drop > into single-rq mode' per the above condition check. The TLDR is: when > there is a task that sits on the sibling rq with the same cookie as > 'max', new_active will be 2 instead of 1 and that would cause us missing > the chance to do a sync of core min_vruntime. FWIW: when I disable the feature of running cookie_pick task on sibling and thus enforce a strict single-rq mode, Peter's patch works well for the scenario described below. > This is how it happens: > 1) 2 tasks of the same cgroup with different weight running on 2 siblings, > say cg0_A with weight 1024 bound at cpu0 and cg0_B with weight 2 bound > at cpu1(assume cpu0 and cpu1 are siblings); > 2) Since new_active == 2, we didn't trigger min_vruntime sync. For > simplicity, let's assume both siblings' root cfs_rq's min_vruntime and > core_vruntime are all at 0 now; > 3) let the two tasks run a while; > 4) a new task cg1_C of another cgroup gets queued on cpu1. Since cpu1's > existing task has a very small weight, its cfs_rq's min_vruntime can > be much larger than cpu0's cfs_rq min_vruntime. So cg1_C's vruntime is > much larger than cg0_A's and the 'max' of the core wide task > selection goes to cg0_A; > 5) Now I suppose we should drop into single-rq mode and by doing a sync > of core min_vruntime, cg1_C's turn shall come. But the problem is, our > current selection logic prefer not to waste CPU time so after decides > cg0_A as the 'max', the sibling will also do a cookie_pick() and > get cg0_B to run. This is where problem asises: new_active is 2 > instead of the expected 1. > 6) Due to we didn't do the sync of core min_vruntime, the newly queued > cg1_C shall wait a long time before cg0_A's vruntime catches up. P.S. this is what I did to enforce a strict single-rq mode: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1fa5b48b742a..0f5580bc7e96 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4411,7 +4411,7 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma (!max || prio_less(max, class_pick))) return class_pick; - return cookie_pick; + return NULL; } static struct task_struct *
On Thu, May 21, 2020 at 09:47:05AM -0400, Joel Fernandes wrote: > Hi Peter, > Thanks for the comments. > > On Thu, May 21, 2020 at 10:51:22AM +0200, Peter Zijlstra wrote: > > On Wed, May 20, 2020 at 06:26:42PM -0400, Joel Fernandes (Google) wrote: > > > Add a per-thread core scheduling interface which allows a thread to tag > > > itself and enable core scheduling. Based on discussion at OSPM with > > > maintainers, we propose a prctl(2) interface accepting values of 0 or 1. > > > 1 - enable core scheduling for the task. > > > 0 - disable core scheduling for the task. > > > > Yeah, so this is a terrible interface :-) > > I tried to keep it simple. You are right, lets make it better. > > > It doens't allow tasks for form their own groups (by for example setting > > the key to that of another task). > > So for this, I was thinking of making the prctl pass in an integer. And 0 > would mean untagged. Does that sound good to you? A TID, I think. If you pass your own TID, you tag yourself as not-sharing. If you tag yourself with another tasks's TID, you can do ptrace tests to see if you're allowed to observe their junk. > > It is also horribly ill defined what it means to 'enable', with whoem > > is it allows to share a core. > > I couldn't parse this. Do you mean "enabling coresched does not make sense if > we don't specify whom to share the core with?" As a corrolary yes. I mostly meant that a blanket 'enable' doesn't specify a 'who' you're sharing your bits with. > > OK, so cgroup always wins; is why is that a good thing? > > I was just trying to respect the functionality of the CGroup patch in the > coresched series, after all a gentleman named Peter Zijlstra wrote that > patch ;-) ;-). Yeah, but I think that same guy said that that was a shit interface and only hacked up because it was easy :-) > More seriously, the reason I did it this way is the prctl-tagging is a bit > incompatible with CGroup tagging: > > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes > their cookie through prctl? Do they still remain in the tagged CGroup but are > now going to not trust each other? Do they get removed from the CGroup? This > is why I made the prctl fail with -EBUSY in such cases. > > 2. What happens if 2 tagged tasks with different cookies are added to a > tagged CGroup? Do we fail the addition of the tasks to the group, or do we > override their cookie (like I'm doing)? For #2 I think I prefer failure. But having the rationale spelled out in documentation (man-pages for example) is important. > > > ChromeOS will use core-scheduling to securely enable hyperthreading. > > > This cuts down the keypress latency in Google docs from 150ms to 50ms > > > while improving the camera streaming frame rate by ~3%. > > > > It doesn't consider permissions. > > > > Basically, with the way you guys use it, it should be a CAP_SYS_ADMIN > > only to enable core-sched. > > True, we were relying on the seccomp sandboxing in ChromeOS to protect the > prctl but you're right and I fixed it for next revision. With the TID idea above you get the ptrace tests.
On Thu, May 21, 2020 at 2:58 PM Jesse Barnes <jsbarnes@google.com> wrote: > > Expanding on this a little, we're working on a couple of projects that > should provide results like these for upstream. One is continuously > rebasing our upstream backlog onto new kernels for testing purposes > (the idea here is to make it easier for us to update kernels on > Chromebooks), Lovely. Not just for any performance work that comes out of this, but hopefully this means that we'll also have quick problem reports if something happens that affects chrome. There's certainly been issues on the server side of google where we made changes (*cough*cgroup*cough*) which didn't make anybody really blink until years after the fact.. Which ends up being very inconvenient when other parts of the community have been using the new features for years. > and the second is to drive more stuff into the > kernelci.org infrastructure. Given the test environments we have in > place now, we can probably get results from our continuous rebase > project first and provide those against -rc releases if that's > something you'd be interested in. I think the more automated (or regular, or close-to-upstream) real-world testing that we get, the better off we are. We have a number of regular distributions that track the upstream kernel fairly closely, so we get a fair amount of coverage for the normal desktop loads. And the bots are doing great, but they tend to test very specific things (in the case of "syzbot" the "specific" thing is obviously pretty far-ranging, but it's still very small details). And latency has always been harder to really test (outside of the truly trivial microbenchmarks), so the fact that it sounds like you're going to test not only a different environment than the usual distros but have a few macro-level latency tests just sounds lovely in general. Let's see how lovely I think it is once you start sending regression reports.. Linus
On Fri, May 22, 2020 at 11:44:06AM +0800, Aaron Lu wrote:
[...]
> > Updated diff below:
> >
> > ---8<-----------------------
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 005d7f7323e2d..625377f393ed3 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -182,9 +182,6 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> >
> > rq->core->core_task_seq++;
> >
> > - if (!p->core_cookie)
> > - return;
> > -
> > node = &rq->core_tree.rb_node;
> > parent = *node;
> >
> > @@ -215,7 +212,7 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
> >
> > void sched_core_add(struct rq *rq, struct task_struct *p)
> > {
> > - if (p->core_cookie && task_on_rq_queued(p))
> > + if (task_on_rq_queued(p))
> > sched_core_enqueue(rq, p);
> > }
>
> It appears there are other call sites of sched_core_enqueue() where
> core_cookie is checked: cpu_cgroup_fork() and __sched_write_tag().
Thanks, but looks like pick_task()'s caller also makes various assumptions
about cookie == 0 so all that needs to be vetted again I think.
- Joel
On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote: [..] > > > It doens't allow tasks for form their own groups (by for example setting > > > the key to that of another task). > > > > So for this, I was thinking of making the prctl pass in an integer. And 0 > > would mean untagged. Does that sound good to you? > > A TID, I think. If you pass your own TID, you tag yourself as > not-sharing. If you tag yourself with another tasks's TID, you can do > ptrace tests to see if you're allowed to observe their junk. But that would require a bunch of tasks agreeing on which TID to tag with. For example, if 2 tasks tag with each other's TID, then they would have different tags and not share. What's wrong with passing in an integer instead? In any case, we would do the CAP_SYS_ADMIN check to limit who can do it. Also, one thing CGroup interface allows is an external process to set the cookie, so I am wondering if we should use sched_setattr(2) instead of, or in addition to, the prctl(2). That way, we can drop the CGroup interface completely. How do you feel about that? > > > It is also horribly ill defined what it means to 'enable', with whoem > > > is it allows to share a core. > > > > I couldn't parse this. Do you mean "enabling coresched does not make sense if > > we don't specify whom to share the core with?" > > As a corrolary yes. I mostly meant that a blanket 'enable' doesn't > specify a 'who' you're sharing your bits with. Yes, ok. I can reword the commit log a bit to make it more clear that we are specifying who we can share a core with. > > I was just trying to respect the functionality of the CGroup patch in the > > coresched series, after all a gentleman named Peter Zijlstra wrote that > > patch ;-) ;-). > > Yeah, but I think that same guy said that that was a shit interface and > only hacked up because it was easy :-) Fair enough :-) > > More seriously, the reason I did it this way is the prctl-tagging is a bit > > incompatible with CGroup tagging: > > > > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes > > their cookie through prctl? Do they still remain in the tagged CGroup but are > > now going to not trust each other? Do they get removed from the CGroup? This > > is why I made the prctl fail with -EBUSY in such cases. > > > > 2. What happens if 2 tagged tasks with different cookies are added to a > > tagged CGroup? Do we fail the addition of the tasks to the group, or do we > > override their cookie (like I'm doing)? > > For #2 I think I prefer failure. > > But having the rationale spelled out in documentation (man-pages for > example) is important. If we drop the CGroup interface, this would avoid both #1 and #2. thanks, - Joel
On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote: > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote: > [..] > > > > It doens't allow tasks for form their own groups (by for example setting > > > > the key to that of another task). > > > > > > So for this, I was thinking of making the prctl pass in an integer. And 0 > > > would mean untagged. Does that sound good to you? > > > > A TID, I think. If you pass your own TID, you tag yourself as > > not-sharing. If you tag yourself with another tasks's TID, you can do > > ptrace tests to see if you're allowed to observe their junk. > > But that would require a bunch of tasks agreeing on which TID to tag with. > For example, if 2 tasks tag with each other's TID, then they would have > different tags and not share. > > What's wrong with passing in an integer instead? In any case, we would do the > CAP_SYS_ADMIN check to limit who can do it. > > Also, one thing CGroup interface allows is an external process to set the > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in > addition to, the prctl(2). That way, we can drop the CGroup interface > completely. How do you feel about that? > I think it should be an arbitrary 64bit value, in both interfaces to avoid any potential reuse security issues. I think the cgroup interface could be extended not to be a boolean but take the value. With 0 being untagged as now. And sched_setattr could be used to set it on a per task basis. > > > > It is also horribly ill defined what it means to 'enable', with whoem > > > > is it allows to share a core. > > > > > > I couldn't parse this. Do you mean "enabling coresched does not make sense if > > > we don't specify whom to share the core with?" > > > > As a corrolary yes. I mostly meant that a blanket 'enable' doesn't > > specify a 'who' you're sharing your bits with. > > Yes, ok. I can reword the commit log a bit to make it more clear that we are > specifying who we can share a core with. > > > > I was just trying to respect the functionality of the CGroup patch in the > > > coresched series, after all a gentleman named Peter Zijlstra wrote that > > > patch ;-) ;-). > > > > Yeah, but I think that same guy said that that was a shit interface and > > only hacked up because it was easy :-) > > Fair enough :-) > > > > More seriously, the reason I did it this way is the prctl-tagging is a bit > > > incompatible with CGroup tagging: > > > > > > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes > > > their cookie through prctl? Do they still remain in the tagged CGroup but are > > > now going to not trust each other? Do they get removed from the CGroup? This > > > is why I made the prctl fail with -EBUSY in such cases. > > > > > > 2. What happens if 2 tagged tasks with different cookies are added to a > > > tagged CGroup? Do we fail the addition of the tasks to the group, or do we > > > override their cookie (like I'm doing)? > > > > For #2 I think I prefer failure. > > > > But having the rationale spelled out in documentation (man-pages for > > example) is important. > > If we drop the CGroup interface, this would avoid both #1 and #2. > I believe both are useful. Personally, I think the per-task setting should win over the cgroup tagging. In that case #1 just falls out. And #2 pretty much as well. Nothing would happen to the tagged task as they were added to the cgroup. They'd keep their explicitly assigned tags and everything should "just work". There are other reasons to be in a cpu cgroup together than just the core scheduling tag. There are a few other edge cases, like if you are in a cgroup, but have been tagged explicitly with sched_setattr and then get untagged (presumably by setting 0) do you get the cgroup tag or just stay untagged? I think based on per-task winning you'd stay untagged. I supposed you could move out and back in the cgroup to get the tag reapplied (Or maybe the cgroup interface could just be reused with the same value to re-tag everyone who's untagged). Cheers, Phil > thanks, > > - Joel > --
On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote: > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote: > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote: > > [..] > > > > > It doens't allow tasks for form their own groups (by for example setting > > > > > the key to that of another task). > > > > > > > > So for this, I was thinking of making the prctl pass in an integer. And 0 > > > > would mean untagged. Does that sound good to you? > > > > > > A TID, I think. If you pass your own TID, you tag yourself as > > > not-sharing. If you tag yourself with another tasks's TID, you can do > > > ptrace tests to see if you're allowed to observe their junk. > > > > But that would require a bunch of tasks agreeing on which TID to tag with. > > For example, if 2 tasks tag with each other's TID, then they would have > > different tags and not share. > > > > What's wrong with passing in an integer instead? In any case, we would do the > > CAP_SYS_ADMIN check to limit who can do it. > > > > Also, one thing CGroup interface allows is an external process to set the > > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in > > addition to, the prctl(2). That way, we can drop the CGroup interface > > completely. How do you feel about that? > > > > I think it should be an arbitrary 64bit value, in both interfaces to avoid > any potential reuse security issues. > > I think the cgroup interface could be extended not to be a boolean but take > the value. With 0 being untagged as now. > > And sched_setattr could be used to set it on a per task basis. Yeah, something like this will be needed. > > > > More seriously, the reason I did it this way is the prctl-tagging is a bit > > > > incompatible with CGroup tagging: > > > > > > > > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes > > > > their cookie through prctl? Do they still remain in the tagged CGroup but are > > > > now going to not trust each other? Do they get removed from the CGroup? This > > > > is why I made the prctl fail with -EBUSY in such cases. In util-clamp's design (which has task-specific attribute and task-group attribute), it seems for that the priority is task-specific value first, then the group one, then the system-wide one. Perhaps a similar design can be adopted for this interface. So probably we should let the per-task interface not fail if the task was already in CGroup and rather prioritize its value first before looking at the group one? Uclamp's comments: * The effective clamp bucket index of a task depends on, by increasing * priority: * - the task specific clamp value, when explicitly requested from userspace * - the task group effective clamp value, for tasks not either in the root * group or in an autogroup * - the system default clamp value, defined by the sysadmin > > > > > > > > 2. What happens if 2 tagged tasks with different cookies are added to a > > > > tagged CGroup? Do we fail the addition of the tasks to the group, or do we > > > > override their cookie (like I'm doing)? > > > > > > For #2 I think I prefer failure. > > > > > > But having the rationale spelled out in documentation (man-pages for > > > example) is important. > > > > If we drop the CGroup interface, this would avoid both #1 and #2. > > > > I believe both are useful. Personally, I think the per-task setting should > win over the cgroup tagging. In that case #1 just falls out. Cool, this is similar to what I mentioned above. > And #2 pretty > much as well. Nothing would happen to the tagged task as they were added > to the cgroup. They'd keep their explicitly assigned tags and everything > should "just work". There are other reasons to be in a cpu cgroup together > than just the core scheduling tag. Well ok, so there's no reason to fail them the addition to CGroup of a prctl-tagged task then, we can let it succeed but prioritize the task-specific attribute over the group-specific one. > There are a few other edge cases, like if you are in a cgroup, but have > been tagged explicitly with sched_setattr and then get untagged (presumably > by setting 0) do you get the cgroup tag or just stay untagged? I think based > on per-task winning you'd stay untagged. I supposed you could move out and > back in the cgroup to get the tag reapplied (Or maybe the cgroup interface > could just be reused with the same value to re-tag everyone who's untagged). If we maintain a task-specific tag and a group-specific tag, then I think both tags can coexist and the final tag is decided on priority basis mentioned above. So before getting into CGroup, I think first we develop the task-specific tagging mechanism like Peter was suggesting. So let us talk about that. I will reply to the other thread Vineeth started while CC'ing you. In particular, I like Peter's idea about user land passing a TID to share a core with. thanks, - Joel > > > > Cheers, > Phil > > > > thanks, > > > > - Joel > > > > -- >
On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote: > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote: > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote: > > [..] > > > > > It doens't allow tasks for form their own groups (by for example setting > > > > > the key to that of another task). > > > > > > > > So for this, I was thinking of making the prctl pass in an integer. And 0 > > > > would mean untagged. Does that sound good to you? > > > > > > A TID, I think. If you pass your own TID, you tag yourself as > > > not-sharing. If you tag yourself with another tasks's TID, you can do > > > ptrace tests to see if you're allowed to observe their junk. > > > > But that would require a bunch of tasks agreeing on which TID to tag with. > > For example, if 2 tasks tag with each other's TID, then they would have > > different tags and not share. Well, don't do that then ;-) > > What's wrong with passing in an integer instead? In any case, we would do the > > CAP_SYS_ADMIN check to limit who can do it. So the actual permission model can be different depending on how broken the hardware is. > > Also, one thing CGroup interface allows is an external process to set the > > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in > > addition to, the prctl(2). That way, we can drop the CGroup interface > > completely. How do you feel about that? > > > > I think it should be an arbitrary 64bit value, in both interfaces to avoid > any potential reuse security issues. > > I think the cgroup interface could be extended not to be a boolean but take > the value. With 0 being untagged as now. How do you avoid reuse in such a huge space? That just creates yet another problem for the kernel to keep track of who is who. With random u64 numbers, it even becomes hard to determine if you're sharing at all or not. Now, with the current SMT+MDS trainwreck, any sharing is bad because it allows leaking kernel privates. But under a less severe thread scenario, say where only user data would be at risk, the ptrace() tests make sense, but those become really hard with random u64 numbers too. What would the purpose of random u64 values be for cgroups? That only replicates the problem of determining uniqueness there. Then you can get two cgroups unintentionally sharing because you got lucky. Also, fundamentally, we cannot have more threads than TID space, it's a natural identifier.
On Thu, May 28, 2020 at 07:01:28PM +0200 Peter Zijlstra wrote: > On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote: > > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote: > > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote: > > > [..] > > > > > > It doens't allow tasks for form their own groups (by for example setting > > > > > > the key to that of another task). > > > > > > > > > > So for this, I was thinking of making the prctl pass in an integer. And 0 > > > > > would mean untagged. Does that sound good to you? > > > > > > > > A TID, I think. If you pass your own TID, you tag yourself as > > > > not-sharing. If you tag yourself with another tasks's TID, you can do > > > > ptrace tests to see if you're allowed to observe their junk. > > > > > > But that would require a bunch of tasks agreeing on which TID to tag with. > > > For example, if 2 tasks tag with each other's TID, then they would have > > > different tags and not share. > > Well, don't do that then ;-) > That was a poorly worded example :) The point I was trying to make was more that one TID of a group (not cgroup!) of tasks is just an arbitrary value. At a single process (or pair rather) level, sure, you can see it as an identifier of whom you want to share with, but even then you have to tag both processes with this. And it has less meaning when the whom you want to share with is mutltiple tasks. > > > What's wrong with passing in an integer instead? In any case, we would do the > > > CAP_SYS_ADMIN check to limit who can do it. > > So the actual permission model can be different depending on how broken > the hardware is. > > > > Also, one thing CGroup interface allows is an external process to set the > > > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in > > > addition to, the prctl(2). That way, we can drop the CGroup interface > > > completely. How do you feel about that? > > > > > > > I think it should be an arbitrary 64bit value, in both interfaces to avoid > > any potential reuse security issues. > > > > I think the cgroup interface could be extended not to be a boolean but take > > the value. With 0 being untagged as now. > > How do you avoid reuse in such a huge space? That just creates yet > another problem for the kernel to keep track of who is who. > The kernel doesn't care or have to track anything. The admin does. At the kernel level it's just matching cookies. Tasks A,B,C all can share core so you give them each A's TID as a cookie. Task A then exits. Now B and C are using essentially a random value. Task D comes along and want to share with B and C. You have to tag it with A's old TID, which has no meaning at this point. And if A's TID ever gets reused. The new A` gets to share too. At some level aren't those still 32bits? > With random u64 numbers, it even becomes hard to determine if you're > sharing at all or not. > > Now, with the current SMT+MDS trainwreck, any sharing is bad because it > allows leaking kernel privates. But under a less severe thread scenario, > say where only user data would be at risk, the ptrace() tests make > sense, but those become really hard with random u64 numbers too. > > What would the purpose of random u64 values be for cgroups? That only > replicates the problem of determining uniqueness there. Then you can get > two cgroups unintentionally sharing because you got lucky. > Seems that would be more flexible for the admin. What if you had two cgroups you wanted to allow to run together? Or a cgroup and a few processes from a different one (say with different quotas or something). I don't have such use cases so I don't feel that strongly but it seemed more flexible and followed the mechanism-in-kernel/policy-in-userspace dictum rather than basing the functionality on the implementation details. Cheers, Phil > Also, fundamentally, we cannot have more threads than TID space, it's a > natural identifier. > --
Hi Peter, On Thu, May 28, 2020 at 07:01:28PM +0200, Peter Zijlstra wrote: > On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote: > > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote: > > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote: > > > [..] > > > > > > It doens't allow tasks for form their own groups (by for example setting > > > > > > the key to that of another task). > > > > > > > > > > So for this, I was thinking of making the prctl pass in an integer. And 0 > > > > > would mean untagged. Does that sound good to you? > > > > > > > > A TID, I think. If you pass your own TID, you tag yourself as > > > > not-sharing. If you tag yourself with another tasks's TID, you can do > > > > ptrace tests to see if you're allowed to observe their junk. > > > > > > But that would require a bunch of tasks agreeing on which TID to tag with. > > > For example, if 2 tasks tag with each other's TID, then they would have > > > different tags and not share. > > Well, don't do that then ;-) We could also guard it with a mutex. First task to set the TID wins, the other thread just reuses the cookie of the TID that won. But I think we cannot just use the TID value as the cookie, due to TID wrap-around and reuse. Otherwise we could accidentally group 2 tasks. Instead, I suggest let us keep TID as the interface per your suggestion and do the needed ptrace checks, but convert the TID to the task_struct pointer value and use that as the cookie for the group of tasks sharing a core. Thoughts? thanks, - Joel > > > What's wrong with passing in an integer instead? In any case, we would do the > > > CAP_SYS_ADMIN check to limit who can do it. > > So the actual permission model can be different depending on how broken > the hardware is. > > > > Also, one thing CGroup interface allows is an external process to set the > > > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in > > > addition to, the prctl(2). That way, we can drop the CGroup interface > > > completely. How do you feel about that? > > > > > > > I think it should be an arbitrary 64bit value, in both interfaces to avoid > > any potential reuse security issues. > > > > I think the cgroup interface could be extended not to be a boolean but take > > the value. With 0 being untagged as now. > > How do you avoid reuse in such a huge space? That just creates yet > another problem for the kernel to keep track of who is who. > > With random u64 numbers, it even becomes hard to determine if you're > sharing at all or not. > > Now, with the current SMT+MDS trainwreck, any sharing is bad because it > allows leaking kernel privates. But under a less severe thread scenario, > say where only user data would be at risk, the ptrace() tests make > sense, but those become really hard with random u64 numbers too. > > What would the purpose of random u64 values be for cgroups? That only > replicates the problem of determining uniqueness there. Then you can get > two cgroups unintentionally sharing because you got lucky. > > Also, fundamentally, we cannot have more threads than TID space, it's a > natural identifier.
On Thu, May 28, 2020 at 02:17:19PM -0400 Phil Auld wrote: > On Thu, May 28, 2020 at 07:01:28PM +0200 Peter Zijlstra wrote: > > On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote: > > > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote: > > > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote: > > > > [..] > > > > > > > It doens't allow tasks for form their own groups (by for example setting > > > > > > > the key to that of another task). > > > > > > > > > > > > So for this, I was thinking of making the prctl pass in an integer. And 0 > > > > > > would mean untagged. Does that sound good to you? > > > > > > > > > > A TID, I think. If you pass your own TID, you tag yourself as > > > > > not-sharing. If you tag yourself with another tasks's TID, you can do > > > > > ptrace tests to see if you're allowed to observe their junk. > > > > > > > > But that would require a bunch of tasks agreeing on which TID to tag with. > > > > For example, if 2 tasks tag with each other's TID, then they would have > > > > different tags and not share. > > > > Well, don't do that then ;-) > > > > That was a poorly worded example :) > Heh, sorry, I thought that was my statement. I do not mean to belittle Joel's example... That's a fine example of a totally different problem than I was thinking of :) Cheers, Phil > The point I was trying to make was more that one TID of a group (not cgroup!) > of tasks is just an arbitrary value. > > At a single process (or pair rather) level, sure, you can see it as an > identifier of whom you want to share with, but even then you have to tag > both processes with this. And it has less meaning when the whom you want to > share with is mutltiple tasks. > > > > > What's wrong with passing in an integer instead? In any case, we would do the > > > > CAP_SYS_ADMIN check to limit who can do it. > > > > So the actual permission model can be different depending on how broken > > the hardware is. > > > > > > Also, one thing CGroup interface allows is an external process to set the > > > > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in > > > > addition to, the prctl(2). That way, we can drop the CGroup interface > > > > completely. How do you feel about that? > > > > > > > > > > I think it should be an arbitrary 64bit value, in both interfaces to avoid > > > any potential reuse security issues. > > > > > > I think the cgroup interface could be extended not to be a boolean but take > > > the value. With 0 being untagged as now. > > > > How do you avoid reuse in such a huge space? That just creates yet > > another problem for the kernel to keep track of who is who. > > > > The kernel doesn't care or have to track anything. The admin does. > At the kernel level it's just matching cookies. > > Tasks A,B,C all can share core so you give them each A's TID as a cookie. > Task A then exits. Now B and C are using essentially a random value. > Task D comes along and want to share with B and C. You have to tag it > with A's old TID, which has no meaning at this point. > > And if A's TID ever gets reused. The new A` gets to share too. At some > level aren't those still 32bits? > > > With random u64 numbers, it even becomes hard to determine if you're > > sharing at all or not. > > > > Now, with the current SMT+MDS trainwreck, any sharing is bad because it > > allows leaking kernel privates. But under a less severe thread scenario, > > say where only user data would be at risk, the ptrace() tests make > > sense, but those become really hard with random u64 numbers too. > > > > What would the purpose of random u64 values be for cgroups? That only > > replicates the problem of determining uniqueness there. Then you can get > > two cgroups unintentionally sharing because you got lucky. > > > > Seems that would be more flexible for the admin. > > What if you had two cgroups you wanted to allow to run together? Or a > cgroup and a few processes from a different one (say with different > quotas or something). > > I don't have such use cases so I don't feel that strongly but it seemed > more flexible and followed the mechanism-in-kernel/policy-in-userspace > dictum rather than basing the functionality on the implementation details. > > > Cheers, > Phil > > > > Also, fundamentally, we cannot have more threads than TID space, it's a > > natural identifier. > > > > -- --
On 2020/5/14 21:02, Peter Zijlstra wrote: > On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote: >> With this said, I realized a workaround for the issue described above: >> when the core went from 'compatible mode'(step 1-3) to 'incompatible >> mode'(step 4), reset all root level sched entities' vruntime to be the >> same as the core wide min_vruntime. After all, the core is transforming >> from two runqueue mode to single runqueue mode... I think this can solve >> the issue to some extent but I may miss other scenarios. > > A little something like so, this syncs min_vruntime when we switch to > single queue mode. This is very much SMT2 only, I got my head in twist > when thikning about more siblings, I'll have to try again later. > > This very much retains the horrible approximation of S we always do. > > Also, it is _completely_ untested... > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -102,7 +102,6 @@ static inline int __task_prio(struct tas > /* real prio, less is less */ > static inline bool prio_less(struct task_struct *a, struct task_struct *b) > { > - > int pa = __task_prio(a), pb = __task_prio(b); > > if (-pa < -pb) > @@ -114,19 +113,8 @@ static inline bool prio_less(struct task > if (pa == -1) /* dl_prio() doesn't work because of stop_class above */ > return !dl_time_before(a->dl.deadline, b->dl.deadline); > > - if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */ > - u64 vruntime = b->se.vruntime; > - > - /* > - * Normalize the vruntime if tasks are in different cpus. > - */ > - if (task_cpu(a) != task_cpu(b)) { > - vruntime -= task_cfs_rq(b)->min_vruntime; > - vruntime += task_cfs_rq(a)->min_vruntime; > - } > - > - return !((s64)(a->se.vruntime - vruntime) <= 0); > - } > + if (pa == MAX_RT_PRIO + MAX_NICE) > + return cfs_prio_less(a, b); > > return false; > } > @@ -4293,10 +4281,11 @@ static struct task_struct * > pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > { > struct task_struct *next, *max = NULL; > + int old_active = 0, new_active = 0; > const struct sched_class *class; > const struct cpumask *smt_mask; > - int i, j, cpu; > bool need_sync = false; > + int i, j, cpu; > > cpu = cpu_of(rq); > if (cpu_is_offline(cpu)) > @@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas > rq_i->core_pick = NULL; > > if (rq_i->core_forceidle) { > + // XXX is_idle_task(rq_i->curr) && rq_i->nr_running ?? > need_sync = true; > rq_i->core_forceidle = false; > } > > + if (!is_idle_task(rq_i->curr)) > + old_active++; > + > if (i != cpu) > update_rq_clock(rq_i); > } > @@ -4463,8 +4456,12 @@ next_class:; > > WARN_ON_ONCE(!rq_i->core_pick); > > - if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) > - rq_i->core_forceidle = true; > + if (is_idle_task(rq_i->core_pick)) { > + if (rq_i->nr_running) > + rq_i->core_forceidle = true; > + } else { > + new_active++; > + } > > if (i == cpu) > continue; > @@ -4476,6 +4473,16 @@ next_class:; > WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick)); > } > > + /* XXX SMT2 only */ > + if (new_active == 1 && old_active > 1) { > + /* > + * We just dropped into single-rq mode, increment the sequence > + * count to trigger the vruntime sync. > + */ > + rq->core->core_sync_seq++; > + } > + rq->core->core_active = new_active; > + > done: > set_next_task(rq, next); > return next; > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -386,6 +386,12 @@ is_same_group(struct sched_entity *se, s > return NULL; > } > > +static inline bool > +is_same_tg(struct sched_entity *se, struct sched_entity *pse) > +{ > + return se->cfs_rq->tg == pse->cfs_rq->tg; > +} > + > static inline struct sched_entity *parent_entity(struct sched_entity *se) > { > return se->parent; > @@ -394,8 +400,6 @@ static inline struct sched_entity *paren > static void > find_matching_se(struct sched_entity **se, struct sched_entity **pse) > { > - int se_depth, pse_depth; > - > /* > * preemption test can be made between sibling entities who are in the > * same cfs_rq i.e who have a common parent. Walk up the hierarchy of > @@ -403,23 +407,16 @@ find_matching_se(struct sched_entity **s > * parent. > */ > > - /* First walk up until both entities are at same depth */ > - se_depth = (*se)->depth; > - pse_depth = (*pse)->depth; > - > - while (se_depth > pse_depth) { > - se_depth--; > - *se = parent_entity(*se); > - } > - > - while (pse_depth > se_depth) { > - pse_depth--; > - *pse = parent_entity(*pse); > - } > + /* XXX we now have 3 of these loops, C stinks */ > > while (!is_same_group(*se, *pse)) { > - *se = parent_entity(*se); > - *pse = parent_entity(*pse); > + int se_depth = (*se)->depth; > + int pse_depth = (*pse)->depth; > + > + if (se_depth <= pse_depth) > + *pse = parent_entity(*pse); > + if (se_depth >= pse_depth) > + *se = parent_entity(*se); > } > } > > @@ -455,6 +452,12 @@ static inline struct sched_entity *paren > return NULL; > } > > +static inline bool > +is_same_tg(struct sched_entity *se, struct sched_entity *pse) > +{ > + return true; > +} > + > static inline void > find_matching_se(struct sched_entity **se, struct sched_entity **pse) > { > @@ -462,6 +465,31 @@ find_matching_se(struct sched_entity **s > > #endif /* CONFIG_FAIR_GROUP_SCHED */ > > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b) > +{ > + struct sched_entity *se_a = &a->se, *se_b = &b->se; > + struct cfs_rq *cfs_rq_a, *cfa_rq_b; > + u64 vruntime_a, vruntime_b; > + > + while (!is_same_tg(se_a, se_b)) { > + int se_a_depth = se_a->depth; > + int se_b_depth = se_b->depth; > + > + if (se_a_depth <= se_b_depth) > + se_b = parent_entity(se_b); > + if (se_a_depth >= se_b_depth) > + se_a = parent_entity(se_a); > + } > + > + cfs_rq_a = cfs_rq_of(se_a); > + cfs_rq_b = cfs_rq_of(se_b); > + > + vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime; > + vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime; > + > + return !((s64)(vruntime_a - vruntime_b) <= 0); > +} > + > static __always_inline > void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec); > > @@ -6891,6 +6919,18 @@ static void check_preempt_wakeup(struct > set_last_buddy(se); > } > > +static void core_sync_entity(struct rq *rq, struct cfs_rq *cfs_rq) > +{ > + if (!sched_core_enabled()) > + return; > + > + if (rq->core->core_sync_seq == cfs_rq->core_sync_seq) > + return; > + > + cfs_rq->core_sync_seq = rq->core->core_sync_seq; > + cfs_rq->core_vruntime = cfs_rq->min_vruntime; > +} > + > static struct task_struct *pick_task_fair(struct rq *rq) > { > struct cfs_rq *cfs_rq = &rq->cfs; > @@ -6902,6 +6942,14 @@ static struct task_struct *pick_task_fai > do { > struct sched_entity *curr = cfs_rq->curr; > > + /* > + * Propagate the sync state down to whatever cfs_rq we need, > + * the active cfs_rq's will have been done by > + * set_next_task_fair(), the rest is inactive and will not have > + * changed due to the current running task. > + */ > + core_sync_entity(rq, cfs_rq); > + > se = pick_next_entity(cfs_rq, NULL); > > if (curr) { > @@ -10825,7 +10873,8 @@ static void switched_to_fair(struct rq * > } > } > > -/* Account for a task changing its policy or group. > +/* > + * Account for a task changing its policy or group. > * > * This routine is mostly called to set cfs_rq->curr field when a task > * migrates between groups/classes. > @@ -10847,6 +10896,9 @@ static void set_next_task_fair(struct rq > for_each_sched_entity(se) { > struct cfs_rq *cfs_rq = cfs_rq_of(se); > > + /* snapshot vruntime before using it */ > + core_sync_entity(rq, cfs_rq); > + > set_next_entity(cfs_rq, se); > /* ensure bandwidth has been allocated on our new cfs_rq */ > account_cfs_rq_runtime(cfs_rq, 0); > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -503,6 +503,10 @@ struct cfs_rq { > unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */ > unsigned int idle_h_nr_running; /* SCHED_IDLE */ > > +#ifdef CONFIG_SCHED_CORE > + unsigned int core_sync_seq; > + u64 core_vruntime; > +#endif > u64 exec_clock; > u64 min_vruntime; > #ifndef CONFIG_64BIT > @@ -1035,12 +1039,15 @@ struct rq { > unsigned int core_enabled; > unsigned int core_sched_seq; > struct rb_root core_tree; > - bool core_forceidle; > + unsigned int core_forceidle; > > /* shared state */ > unsigned int core_task_seq; > unsigned int core_pick_seq; > unsigned long core_cookie; > + unsigned int core_sync_seq; > + unsigned int core_active; > + > #endif > }; > > @@ -2592,6 +2599,8 @@ static inline bool sched_energy_enabled( > > #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */ > > +extern bool cfs_prio_less(struct task_struct *a, struct task_struct *b); > + > #ifdef CONFIG_MEMBARRIER > /* > * The scheduler provides memory barriers required by membarrier between: > here is a quick test update based on Peter's fairness patch above: - Kernel under test: A: Core scheduling v5 community base + Peter's fairness patch (by reverting Aaron's fairness patch) https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y + Peter's patch above B: Core scheduling v5 community base (with Aaron's fairness patchset) https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y (with Aaron's fairness patch) - Test results briefing: OVERALL PERFORMANCE ARE THE SAME FOR FOLLOWING 3 TEST SETS, BETWEEN 2 KERNEL TEST BUILDS - Test set based on sysbench 1.1.0-bd4b418: 1: sysbench cpu in cgroup cpu 0 + sysbench cpu in cgroup cpu 1 (192 workload tasks for each cgroup) 2: sysbench mysql in cgroup mysql 0 + sysbench mysql in cgroup mysql 1 (192 workload tasks for each cgroup) 3: sysbench cpu in cgroup cpu 0 + sysbench mysql in cgroup mysql 0 (192 workload tasks for each cgroup) - Test environment: Intel Xeon Server platform CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 4 - Test results: Note: 1: test results in following tables are Tput normalized to default baseline 2: test setting in following tables: 2.1: default -> core scheduling disabled 2.2: coresched -> core scheduling enabled 3. default test results are reused between 2 kernel test builds Test set 1: +----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ | setting | *** | default | default | coresched | coresched | ** | default | default | coresched | coresched | +==================================+=======+===========+=============+=============+===============+======+=============+=============+===============+===============+ | cgroups | *** | cg cpu 0 | cg cpu 0 | cg cpu 0 | cg cpu 0 | ** | cg cpu 1 | cg cpu 1 | cg cpu 1 | cg cpu 1 | +----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ | sysbench workload | *** | cpu | cpu | cpu | cpu | ** | cpu | cpu | cpu | cpu | +----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ | record item | *** | Tput_avg | Tput_stdev% | Tput_avg | Tput_stdev% | ** | Tput_avg | Tput_stdev% | Tput_avg | Tput_stdev% | +----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ | Kernel_A(Peter's fairness patch) | *** | | | 0.96 | 3.45% | ** | | | 1.03 | 3.60% | +----------------------------------+-------+ 1 + 1.14% +-------------+---------------+------+ 1 + 1.20% +---------------+---------------+ | Kernel_B(Aaron's fairness patch) | *** | | | 0.98 | 1.75% | ** | | | 1.01 | 1.83% | +----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ Test set 2: +----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ | setting | *** | default | default | coresched | coresched | ** | default | default | coresched | coresched | +==================================+=======+============+=============+=============+===============+======+=============+=============+===============+===============+ | cgroups | *** | cg mysql 0 | cg mysql 0 | cg mysql 0 | cg mysql 0 | ** | cg mysql 1 | cg mysql 1 | cg mysql 1 | cg mysql 1 | +----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ | sysbench workload | *** | mysql | mysql | mysql | mysql | ** | mysql | mysql | mysql | mysql | +----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ | record item | *** | Tput_avg | Tput_stdev% | Tput_avg | Tput_stdev% | ** | Tput_avg | Tput_stdev% | Tput_avg | Tput_stdev% | +----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ | Kernel_A(Peter's fairness patch) | *** | | | 0.98 | 2.00% | ** | | | 0.98 | 1.98% | +----------------------------------+-------+ 1 + 1.85% +-------------+---------------+------+ 1 + 1.84% +---------------+---------------+ | Kernel_B(Aaron's fairness patch) | *** | | | 1.01 | 1.61% | ** | | | 1.01 | 1.59% | +----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ Test set 3: +----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ | setting | *** | default | default | coresched | coresched | ** | default | default | coresched | coresched | +==================================+=======+===========+=============+=============+===============+======+=============+=============+===============+===============+ | cgroups | *** | cg cpu | cg cpu | cg cpu | cg cpu | ** | cg mysql | cg mysql | cg mysql | cg mysql | +----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ | sysbench workload | *** | cpu | cpu | cpu | cpu | ** | mysql | mysql | mysql | mysql | +----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ | record item | *** | Tput_avg | Tput_stdev% | Tput_avg | Tput_stdev% | ** | Tput_avg | Tput_stdev% | Tput_avg | Tput_stdev% | +----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+ | Kernel_A(Peter's fairness patch) | *** | | | 1.01 | 4.67% | ** | | | 0.84 | 25.89% | +----------------------------------+-------+ 1 + 1.56% +-------------+---------------+------+ 1 + 3.17% +---------------+---------------+ | Kernel_B(Aaron's fairness patch) | *** | | | 0.99 | 4.17% | ** | | | 0.89 | 16.44% | +----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
On Wed, Mar 04, 2020 at 05:00:01PM +0000, vpillai wrote:
> From: Aubrey Li <aubrey.li@intel.com>
>
> - Don't migrate if there is a cookie mismatch
> Load balance tries to move task from busiest CPU to the
> destination CPU. When core scheduling is enabled, if the
> task's cookie does not match with the destination CPU's
> core cookie, this task will be skipped by this CPU. This
> mitigates the forced idle time on the destination CPU.
>
> - Select cookie matched idle CPU
> In the fast path of task wakeup, select the first cookie matched
> idle CPU instead of the first idle CPU.
>
> - Find cookie matched idlest CPU
> In the slow path of task wakeup, find the idlest CPU whose core
> cookie matches with task's cookie
>
> - Don't migrate task if cookie not match
> For the NUMA load balance, don't migrate task to the CPU whose
> core cookie does not match with task's cookie
>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
> ---
> kernel/sched/fair.c | 55 +++++++++++++++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 29 +++++++++++++++++++++++
> 2 files changed, 81 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1c9a80d8dbb8..f42ceecb749f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1789,6 +1789,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
> continue;
>
> +#ifdef CONFIG_SCHED_CORE
> + /*
> + * Skip this cpu if source task's cookie does not match
> + * with CPU's core cookie.
> + */
> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
> + continue;
> +#endif
> +
> env->dst_cpu = cpu;
> task_numa_compare(env, taskimp, groupimp, maymove);
> }
> @@ -5660,8 +5669,13 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>
> /* Traverse only the allowed CPUs */
> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
> + struct rq *rq = cpu_rq(i);
> +
> +#ifdef CONFIG_SCHED_CORE
> + if (!sched_core_cookie_match(rq, p))
> + continue;
> +#endif
> if (available_idle_cpu(i)) {
> - struct rq *rq = cpu_rq(i);
> struct cpuidle_state *idle = idle_get_state(rq);
> if (idle && idle->exit_latency < min_exit_latency) {
> /*
> @@ -5927,8 +5941,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> return si_cpu;
> if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> continue;
> +#ifdef CONFIG_SCHED_CORE
> + if (available_idle_cpu(cpu) &&
> + sched_core_cookie_match(cpu_rq(cpu), p))
> + break;
> +#else
select_idle_cpu() is called only if no idle core could be found in the LLC by
select_idle_core().
So, would it be better here to just do the cookie equality check directly
instead of calling the sched_core_cookie_match() helper? More so, because
select_idle_sibling() is a fastpath.
AFAIR, that's what v4 did:
if (available_idle_cpu(cpu))
#ifdef CONFIG_SCHED_CORE
if (sched_core_enabled(cpu_rq(cpu)) &&
(p->core_cookie == cpu_rq(cpu)->core->core_cookie))
break;
#else
break;
#endif
Thoughts? thanks,
- Joel
On Fri, Jun 12, 2020 at 9:21 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > +#ifdef CONFIG_SCHED_CORE > > + if (available_idle_cpu(cpu) && > > + sched_core_cookie_match(cpu_rq(cpu), p)) > > + break; > > +#else > > select_idle_cpu() is called only if no idle core could be found in the LLC by > select_idle_core(). > > So, would it be better here to just do the cookie equality check directly > instead of calling the sched_core_cookie_match() helper? More so, because > select_idle_sibling() is a fastpath. > Agree, this makes sense to me. > AFAIR, that's what v4 did: > > if (available_idle_cpu(cpu)) > #ifdef CONFIG_SCHED_CORE > if (sched_core_enabled(cpu_rq(cpu)) && > (p->core_cookie == cpu_rq(cpu)->core->core_cookie)) > break; > #else > break; > #endif > This patch was initially not in v4 and this is a merging of 4 patches suggested post-v4. During the initial round, code was like above. But since there looked like a code duplication in the different migration paths, it was consolidated into sched_core_cookie_match() and it caused this extra logic to this specific code path. As you mentioned, I also feel we do not need to check for core idleness in this path. Thanks, Vineeth
On Fri, Jun 12, 2020 at 05:32:01PM -0400, Vineeth Remanan Pillai wrote:
> > AFAIR, that's what v4 did:
> >
> > if (available_idle_cpu(cpu))
> > #ifdef CONFIG_SCHED_CORE
> > if (sched_core_enabled(cpu_rq(cpu)) &&
> > (p->core_cookie == cpu_rq(cpu)->core->core_cookie))
> > break;
> > #else
> > break;
> > #endif
> >
> This patch was initially not in v4 and this is a merging of 4 patches
> suggested post-v4. During the initial round, code was like above. But since
> there looked like a code duplication in the different migration paths,
> it was consolidated into sched_core_cookie_match() and it caused this
> extra logic to this specific code path. As you mentioned, I also feel
> we do not need to check for core idleness in this path.
Ok, so I take it that you will make it so in v6 then, unless of course
someone else objects.
thanks!
- Joel
On Fri, Jun 12, 2020 at 10:25 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Ok, so I take it that you will make it so in v6 then, unless of course
> someone else objects.
>
Yes, just wanted to hear from Aubrey, Tim and others as well to see
if we have not missed anything obvious. Will have this in v6 if
there are no objections.
Thanks for bringing this up!
~Vineeth
On 2020/6/14 2:59, Vineeth Remanan Pillai wrote:
> On Fri, Jun 12, 2020 at 10:25 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>
>> Ok, so I take it that you will make it so in v6 then, unless of course
>> someone else objects.
>>
> Yes, just wanted to hear from Aubrey, Tim and others as well to see
> if we have not missed anything obvious. Will have this in v6 if
> there are no objections.
>
> Thanks for bringing this up!
>
> ~Vineeth
>
Yes, this makes sense to me, no need to find idle core in select_idle_cpu().
Thanks to catch this!
Thanks,
-Aubrey
On Wed, Mar 4, 2020 at 12:00 PM vpillai <vpillai@digitalocean.com> wrote: > > > Fifth iteration of the Core-Scheduling feature. > Its probably time for an iteration and We are planning to post v6 based on this branch: https://github.com/digitalocean/linux-coresched/tree/coresched/pre-v6-v5.7.y Just wanted to share the details about v6 here before posting the patch series. If there is no objection to the following, we shall be posting the v6 early next week. The main changes from v6 are the following: 1. Address Peter's comments in v5 - Code cleanup - Remove fixes related to hotplugging. - Split the patch out for force idling starvation 3. Fix for RCU deadlock 4. core wide priority comparison minor re-work. 5. IRQ Pause patch 6. Documentation - https://github.com/digitalocean/linux-coresched/blob/coresched/pre-v6-v5.7.y/Documentation/admin-guide/hw-vuln/core-scheduling.rst This version is much leaner compared to v5 due to the removal of hotplug support. As a result, dynamic coresched enable/disable on cpus due to smt on/off on the core do not function anymore. I tried to reproduce the crashes during hotplug, but could not reproduce reliably. The plan is to try to reproduce the crashes with v6, and document each corner case for crashes as we fix those. Previously, we randomly fixed the issues without a clear documentation and the fixes became complex over time. TODO lists: - Interface discussions could not come to a conclusion in v5 and hence would like to restart the discussion and reach a consensus on it. - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org - Core wide vruntime calculation needs rework: - https://lwn.net/ml/linux-kernel/20200506143506.GH5298@hirez.programming.kicks-ass.net - Load balancing/migration changes ignores group weights: - https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain Please have a look and let me know comments/suggestions or anything missed. Thanks, Vineeth
On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai <vpillai@digitalocean.com> wrote: [...] > TODO lists: > > - Interface discussions could not come to a conclusion in v5 and hence would > like to restart the discussion and reach a consensus on it. > - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org Thanks Vineeth, just want to add: I have a revised implementation of prctl(2) where you only pass a TID of a task you'd like to share a core with (credit to Peter for the idea [1]) so we can make use of ptrace_may_access() checks. I am currently finishing writing of kselftests for this and post it all once it is ready. However a question: If using the prctl(2) on a CGroup tagged task, we discussed in previous threads [2] to override the CGroup cookie such that the task may not share a core with any of the tasks in its CGroup anymore and I think Peter and Phil are Ok with. My question though is - would that not be confusing for anyone looking at the CGroup filesystem's "tag" and "tasks" files? To resolve this, I am proposing to add a new CGroup file 'tasks.coresched' to the CGroup, and this will only contain tasks that were assigned cookies due to their CGroup residency. As soon as one prctl(2)'s the task, it will stop showing up in the CGroup's "tasks.coresched" file (unless of course it was requesting to prctl-share a core with someone in its CGroup itself). Are folks Ok with this solution? [1] https://lore.kernel.org/lkml/20200528170128.GN2483@worktop.programming.kicks-ass.net/ [2] https://lore.kernel.org/lkml/20200524140046.GA5598@lorien.usersys.redhat.com/
On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai > <vpillai@digitalocean.com> wrote: > [...] > > TODO lists: > > > > - Interface discussions could not come to a conclusion in v5 and hence would > > like to restart the discussion and reach a consensus on it. > > - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org > > Thanks Vineeth, just want to add: I have a revised implementation of > prctl(2) where you only pass a TID of a task you'd like to share a > core with (credit to Peter for the idea [1]) so we can make use of > ptrace_may_access() checks. I am currently finishing writing of > kselftests for this and post it all once it is ready. > Thinking more about it, using TID/PID for prctl(2) and internally using a task identifier to identify coresched group may have limitations. A coresched group can exist longer than the lifetime of a task and then there is a chance for that identifier to be reused by a newer task which may or maynot be a part of the same coresched group. A way to overcome this is to have a coresched group with a seperate identifier implemented internally and have mapping from task to the group. And cgroup framework provides exactly that. I feel we could use prctl for isolating individual tasks/processes and use grouping frameworks like cgroup for core scheduling groups. Cpu cgroup might not be a good idea as it has its own purpose. Users might not always want a group of trusted tasks in the same cpu cgroup. Or all the processes in an existing cpu cgroup might not be mutually trusted as well. What do you think about having a separate cgroup for coresched? Both coresched cgroup and prctl() could co-exist where prctl could be used to isolate individual process or task and coresched cgroup to group trusted processes. > However a question: If using the prctl(2) on a CGroup tagged task, we > discussed in previous threads [2] to override the CGroup cookie such > that the task may not share a core with any of the tasks in its CGroup > anymore and I think Peter and Phil are Ok with. My question though is > - would that not be confusing for anyone looking at the CGroup > filesystem's "tag" and "tasks" files? > Having a dedicated cgroup for coresched could solve this problem as well. "coresched.tasks" inside the cgroup hierarchy would list all the taskx in the group and prctl can override this and take it out of the group. > To resolve this, I am proposing to add a new CGroup file > 'tasks.coresched' to the CGroup, and this will only contain tasks that > were assigned cookies due to their CGroup residency. As soon as one > prctl(2)'s the task, it will stop showing up in the CGroup's > "tasks.coresched" file (unless of course it was requesting to > prctl-share a core with someone in its CGroup itself). Are folks Ok > with this solution? > As I mentioned above, IMHO cpu cgroups should not be used to account for core scheduling as well. Cpu cgroups serve a different purpose and overloading it with core scheduling would not be flexible and scalable. But if there is a consensus to move forward with cpu cgroups, adding this new file seems to be okay with me. Thoughts/suggestions/concerns? Thanks, Vineeth
On Wed, Mar 4, 2020 at 12:00 PM vpillai <vpillai@digitalocean.com> wrote: > > > Marks all tasks in a cgroup as matching for core-scheduling. > > A task will need to be moved into the core scheduler queue when the cgroup > it belongs to is tagged to run with core scheduling. Similarly the task > will need to be moved out of the core scheduler queue when the cgroup > is untagged. > > Also after we forked a task, its core scheduler queue's presence will > need to be updated according to its new cgroup's status. > This came up during a private discussion with Joel and thanks to him for bringing this up! Details below.. > @@ -7910,7 +7986,12 @@ static void cpu_cgroup_fork(struct task_struct *task) > rq = task_rq_lock(task, &rf); > > update_rq_clock(rq); > + if (sched_core_enqueued(task)) > + sched_core_dequeue(rq, task); A newly created task will not be enqueued and hence do we need this here? > sched_change_group(task, TASK_SET_GROUP); > + if (sched_core_enabled(rq) && task_on_rq_queued(task) && > + task->core_cookie) > + sched_core_enqueue(rq, task); > Do we need this here? Soon after this, wake_up_new_task() is called which will ultimately call enqueue_task() and adds the task to the coresched rbtree. So we will be trying to enqueue twice. Also, this code will not really enqueue, because task_on_rq_queued() would return false at this point(activate_task is not yet called for this new task). I am not sure if I missed any other code path reaching here that does not proceed with wake_up_new_task().Please let me know, if I missed anything here. Thanks, Vineeth
On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote: > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai > > <vpillai@digitalocean.com> wrote: > > [...] > > > TODO lists: > > > > > > - Interface discussions could not come to a conclusion in v5 and hence would > > > like to restart the discussion and reach a consensus on it. > > > - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org > > > > Thanks Vineeth, just want to add: I have a revised implementation of > > prctl(2) where you only pass a TID of a task you'd like to share a > > core with (credit to Peter for the idea [1]) so we can make use of > > ptrace_may_access() checks. I am currently finishing writing of > > kselftests for this and post it all once it is ready. > > > Thinking more about it, using TID/PID for prctl(2) and internally > using a task identifier to identify coresched group may have > limitations. A coresched group can exist longer than the lifetime > of a task and then there is a chance for that identifier to be > reused by a newer task which may or maynot be a part of the same > coresched group. True, for the prctl(2) tagging (a task wanting to share core with another) we will need some way of internally identifying groups which does not depend on any value that can be reused for another purpose. [..] > What do you think about having a separate cgroup for coresched? > Both coresched cgroup and prctl() could co-exist where prctl could > be used to isolate individual process or task and coresched cgroup > to group trusted processes. This sounds like a fine idea to me. I wonder how Tejun and Peter feel about having a new attribute-less CGroup controller for core-scheduling and just use that for tagging. (No need to even have a tag file, just adding/removing to/from CGroup will tag). > > However a question: If using the prctl(2) on a CGroup tagged task, we > > discussed in previous threads [2] to override the CGroup cookie such > > that the task may not share a core with any of the tasks in its CGroup > > anymore and I think Peter and Phil are Ok with. My question though is > > - would that not be confusing for anyone looking at the CGroup > > filesystem's "tag" and "tasks" files? > > > Having a dedicated cgroup for coresched could solve this problem > as well. "coresched.tasks" inside the cgroup hierarchy would list all > the taskx in the group and prctl can override this and take it out > of the group. We don't even need coresched.tasks, just the existing 'tasks' of CGroups can be used. > > To resolve this, I am proposing to add a new CGroup file > > 'tasks.coresched' to the CGroup, and this will only contain tasks that > > were assigned cookies due to their CGroup residency. As soon as one > > prctl(2)'s the task, it will stop showing up in the CGroup's > > "tasks.coresched" file (unless of course it was requesting to > > prctl-share a core with someone in its CGroup itself). Are folks Ok > > with this solution? > > > As I mentioned above, IMHO cpu cgroups should not be used to account > for core scheduling as well. Cpu cgroups serve a different purpose > and overloading it with core scheduling would not be flexible and > scalable. But if there is a consensus to move forward with cpu cgroups, > adding this new file seems to be okay with me. Yes, this is the problem. Many people use CPU controller CGroups already for other purposes. In that case, tagging a CGroup would make all the entities in the group be able to share a core, which may not always make sense. May be a new CGroup controller is the answer (?). thanks, - Joel
On Fri, Jun 26, 2020 at 11:10:28AM -0400, Joel Fernandes wrote: > On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote: > > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai > > > <vpillai@digitalocean.com> wrote: > > > [...] > > > > TODO lists: > > > > > > > > - Interface discussions could not come to a conclusion in v5 and hence would > > > > like to restart the discussion and reach a consensus on it. > > > > - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org > > > > > > Thanks Vineeth, just want to add: I have a revised implementation of > > > prctl(2) where you only pass a TID of a task you'd like to share a > > > core with (credit to Peter for the idea [1]) so we can make use of > > > ptrace_may_access() checks. I am currently finishing writing of > > > kselftests for this and post it all once it is ready. > > > > > Thinking more about it, using TID/PID for prctl(2) and internally > > using a task identifier to identify coresched group may have > > limitations. A coresched group can exist longer than the lifetime > > of a task and then there is a chance for that identifier to be > > reused by a newer task which may or maynot be a part of the same > > coresched group. > > True, for the prctl(2) tagging (a task wanting to share core with > another) we will need some way of internally identifying groups which does > not depend on any value that can be reused for another purpose. > > [..] > > What do you think about having a separate cgroup for coresched? > > Both coresched cgroup and prctl() could co-exist where prctl could > > be used to isolate individual process or task and coresched cgroup > > to group trusted processes. > > This sounds like a fine idea to me. I wonder how Tejun and Peter feel about > having a new attribute-less CGroup controller for core-scheduling and just > use that for tagging. (No need to even have a tag file, just adding/removing > to/from CGroup will tag). +Tejun thanks, - Joel > > > However a question: If using the prctl(2) on a CGroup tagged task, we > > > discussed in previous threads [2] to override the CGroup cookie such > > > that the task may not share a core with any of the tasks in its CGroup > > > anymore and I think Peter and Phil are Ok with. My question though is > > > - would that not be confusing for anyone looking at the CGroup > > > filesystem's "tag" and "tasks" files? > > > > > Having a dedicated cgroup for coresched could solve this problem > > as well. "coresched.tasks" inside the cgroup hierarchy would list all > > the taskx in the group and prctl can override this and take it out > > of the group. > > We don't even need coresched.tasks, just the existing 'tasks' of CGroups can > be used. > > > > To resolve this, I am proposing to add a new CGroup file > > > 'tasks.coresched' to the CGroup, and this will only contain tasks that > > > were assigned cookies due to their CGroup residency. As soon as one > > > prctl(2)'s the task, it will stop showing up in the CGroup's > > > "tasks.coresched" file (unless of course it was requesting to > > > prctl-share a core with someone in its CGroup itself). Are folks Ok > > > with this solution? > > > > > As I mentioned above, IMHO cpu cgroups should not be used to account > > for core scheduling as well. Cpu cgroups serve a different purpose > > and overloading it with core scheduling would not be flexible and > > scalable. But if there is a consensus to move forward with cpu cgroups, > > adding this new file seems to be okay with me. > > Yes, this is the problem. Many people use CPU controller CGroups already for > other purposes. In that case, tagging a CGroup would make all the entities in > the group be able to share a core, which may not always make sense. May be a > new CGroup controller is the answer (?). > > thanks, > > - Joel >
On Fri, Jun 26, 2020 at 11:10 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> [..]
> > What do you think about having a separate cgroup for coresched?
> > Both coresched cgroup and prctl() could co-exist where prctl could
> > be used to isolate individual process or task and coresched cgroup
> > to group trusted processes.
>
> This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
> having a new attribute-less CGroup controller for core-scheduling and just
> use that for tagging. (No need to even have a tag file, just adding/removing
> to/from CGroup will tag).
Unless there are any major objections to this idea, or better ideas
for CGroup users, we will consider proposing a new CGroup controller
for this. The issue with CPU controller CGroups being they may be
configured in a way that is incompatible with tagging.
And I was also thinking of a new clone flag CLONE_CORE (which allows a
child to share a parent's core). This is because the fork-semantics
are not clear and it may be better to leave the behavior of fork to
userspace IMHO than hard-coding policy in the kernel.
Perhaps we can also discuss this at the scheduler MC at Plumbers.
Any other thoughts?
- Joel
Hi Vineeth, On 2020/6/26 4:12, Vineeth Remanan Pillai wrote: > On Wed, Mar 4, 2020 at 12:00 PM vpillai <vpillai@digitalocean.com> wrote: >> >> >> Fifth iteration of the Core-Scheduling feature. >> > Its probably time for an iteration and We are planning to post v6 based > on this branch: > https://github.com/digitalocean/linux-coresched/tree/coresched/pre-v6-v5.7.y > > Just wanted to share the details about v6 here before posting the patch > series. If there is no objection to the following, we shall be posting > the v6 early next week. > > The main changes from v6 are the following: > 1. Address Peter's comments in v5 > - Code cleanup > - Remove fixes related to hotplugging. > - Split the patch out for force idling starvation > 3. Fix for RCU deadlock > 4. core wide priority comparison minor re-work. > 5. IRQ Pause patch > 6. Documentation > - https://github.com/digitalocean/linux-coresched/blob/coresched/pre-v6-v5.7.y/Documentation/admin-guide/hw-vuln/core-scheduling.rst > > This version is much leaner compared to v5 due to the removal of hotplug > support. As a result, dynamic coresched enable/disable on cpus due to > smt on/off on the core do not function anymore. I tried to reproduce the > crashes during hotplug, but could not reproduce reliably. The plan is to > try to reproduce the crashes with v6, and document each corner case for crashes > as we fix those. Previously, we randomly fixed the issues without a clear > documentation and the fixes became complex over time. > > TODO lists: > > - Interface discussions could not come to a conclusion in v5 and hence would > like to restart the discussion and reach a consensus on it. > - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org > > - Core wide vruntime calculation needs rework: > - https://lwn.net/ml/linux-kernel/20200506143506.GH5298@hirez.programming.kicks-ass.net > > - Load balancing/migration changes ignores group weights: > - https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain According to Aaron's response below: https://lwn.net/ml/linux-kernel/20200305085231.GA12108@ziqianlu-desktop.localdomain/ The following logic seems to be helpful for Aaron's case. + /* + * Ignore cookie match if there is a big imbalance between the src rq + * and dst rq. + */ + if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1) + return true; I didn't see any other comments on the patch at here: https://lwn.net/ml/linux-kernel/67e46f79-51c2-5b69-71c6-133ec10b68c4@linux.intel.com/ Do we have another way to address this issue? Thanks, -Aubrey
Hi Aubrey,
On Mon, Jun 29, 2020 at 8:34 AM Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>
> > - Load balancing/migration changes ignores group weights:
> > - https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain
>
> According to Aaron's response below:
> https://lwn.net/ml/linux-kernel/20200305085231.GA12108@ziqianlu-desktop.localdomain/
>
> The following logic seems to be helpful for Aaron's case.
>
> + /*
> + * Ignore cookie match if there is a big imbalance between the src rq
> + * and dst rq.
> + */
> + if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
> + return true;
>
> I didn't see any other comments on the patch at here:
> https://lwn.net/ml/linux-kernel/67e46f79-51c2-5b69-71c6-133ec10b68c4@linux.intel.com/
>
> Do we have another way to address this issue?
>
We do not have a clear fix for this yet, and did not get much time to
work on this.
I feel that the above change would not be fixing the real issue.
The issue is about not considering the weight of the group when we
try to load balance, but the above change is checking only the
nr_running which might not work always. I feel that we should fix
the real issue in v6 and probably hold on to adding the workaround
fix in the interim. I have added a TODO specifically for this bug
in v6.
What do you think?
Thanks,
Vineeth
On Fri, Jun 26, 2020 at 11:10:28AM -0400 Joel Fernandes wrote: > On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote: > > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai > > > <vpillai@digitalocean.com> wrote: > > > [...] > > > > TODO lists: > > > > > > > > - Interface discussions could not come to a conclusion in v5 and hence would > > > > like to restart the discussion and reach a consensus on it. > > > > - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org > > > > > > Thanks Vineeth, just want to add: I have a revised implementation of > > > prctl(2) where you only pass a TID of a task you'd like to share a > > > core with (credit to Peter for the idea [1]) so we can make use of > > > ptrace_may_access() checks. I am currently finishing writing of > > > kselftests for this and post it all once it is ready. > > > > > Thinking more about it, using TID/PID for prctl(2) and internally > > using a task identifier to identify coresched group may have > > limitations. A coresched group can exist longer than the lifetime > > of a task and then there is a chance for that identifier to be > > reused by a newer task which may or maynot be a part of the same > > coresched group. > > True, for the prctl(2) tagging (a task wanting to share core with > another) we will need some way of internally identifying groups which does > not depend on any value that can be reused for another purpose. > That was my concern as well. That's why I was thinking it should be an arbitrary, user/admin/orchestrator defined value and not be the responsibility of the kernel at all. However... > [..] > > What do you think about having a separate cgroup for coresched? > > Both coresched cgroup and prctl() could co-exist where prctl could > > be used to isolate individual process or task and coresched cgroup > > to group trusted processes. > > This sounds like a fine idea to me. I wonder how Tejun and Peter feel about > having a new attribute-less CGroup controller for core-scheduling and just > use that for tagging. (No need to even have a tag file, just adding/removing > to/from CGroup will tag). > ... this could be an interesting approach. Then the cookie could still be the cgroup address as is and there would be no need for the prctl. At least so it seems. Cheers, Phil > > > However a question: If using the prctl(2) on a CGroup tagged task, we > > > discussed in previous threads [2] to override the CGroup cookie such > > > that the task may not share a core with any of the tasks in its CGroup > > > anymore and I think Peter and Phil are Ok with. My question though is > > > - would that not be confusing for anyone looking at the CGroup > > > filesystem's "tag" and "tasks" files? > > > > > Having a dedicated cgroup for coresched could solve this problem > > as well. "coresched.tasks" inside the cgroup hierarchy would list all > > the taskx in the group and prctl can override this and take it out > > of the group. > > We don't even need coresched.tasks, just the existing 'tasks' of CGroups can > be used. > > > > To resolve this, I am proposing to add a new CGroup file > > > 'tasks.coresched' to the CGroup, and this will only contain tasks that > > > were assigned cookies due to their CGroup residency. As soon as one > > > prctl(2)'s the task, it will stop showing up in the CGroup's > > > "tasks.coresched" file (unless of course it was requesting to > > > prctl-share a core with someone in its CGroup itself). Are folks Ok > > > with this solution? > > > > > As I mentioned above, IMHO cpu cgroups should not be used to account > > for core scheduling as well. Cpu cgroups serve a different purpose > > and overloading it with core scheduling would not be flexible and > > scalable. But if there is a consensus to move forward with cpu cgroups, > > adding this new file seems to be okay with me. > > Yes, this is the problem. Many people use CPU controller CGroups already for > other purposes. In that case, tagging a CGroup would make all the entities in > the group be able to share a core, which may not always make sense. May be a > new CGroup controller is the answer (?). > > thanks, > > - Joel > --