From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1A85C43381 for ; Mon, 18 Feb 2019 17:40:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 883342085A for ; Mon, 18 Feb 2019 17:40:39 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="q+GoRDY6" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389040AbfBRRki (ORCPT ); Mon, 18 Feb 2019 12:40:38 -0500 Received: from bombadil.infradead.org ([198.137.202.133]:59120 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388781AbfBRRkd (ORCPT ); Mon, 18 Feb 2019 12:40:33 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-Id:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=KZeke3HJCfFM6JcXY6curYpX6RyAuVqxvqsPDc4E+iI=; b=q+GoRDY67OXGNQBts88POoa10d rQqjLgpCcU9fnp8v9NG4Av7CVYHYs9Yx20LzCYW4vFKjuK1ktjP0U5V95fEEF8wdV3RwD/qpKCSBk PY5y87mw75dhOedHkWpCMZlefq9WLY35BhfniGS45B/TTdx8ApbtKueyJSK2scd1ViqMJOVt36nHk b9Gz1jNmBL2pF/o0LwSYrLqVfiO9EmAfupCBEzh7s0KEr4eCDMU0LcYmUfa6Ip6M9MFnfVqr3OJCG t/hxHHL5uIci61EVolKWOiysVOYBV7uo2TJiU1PldkIEf0nUPHmETtLmTeuzp6iWRCrd7vgBwzmDt LLR4qAjw==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=hirez.programming.kicks-ass.net) by bombadil.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1gvmu4-0005zP-EW; Mon, 18 Feb 2019 17:40:28 +0000 Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id 478672848B87F; Mon, 18 Feb 2019 18:40:23 +0100 (CET) Message-Id: <20190218173514.667598558@infradead.org> User-Agent: quilt/0.65 Date: Mon, 18 Feb 2019 17:56:33 +0100 From: Peter Zijlstra To: mingo@kernel.org, tglx@linutronix.de, pjt@google.com, tim.c.chen@linux.intel.com, torvalds@linux-foundation.org Cc: linux-kernel@vger.kernel.org, subhra.mazumdar@oracle.com, fweisbec@gmail.com, keescook@chromium.org, kerrnel@google.com, "Peter Zijlstra (Intel)" Subject: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling. References: <20190218165620.383905466@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Instead of only selecting a local task, select a task for all SMT siblings for every reschedule on the core (irrespective which logical CPU does the reschedule). NOTE: there is still potential for siblings rivalry. NOTE: this is far too complicated; but thus far I've failed to simplify it further. Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c | 222 ++++++++++++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 5 - 2 files changed, 224 insertions(+), 3 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3552,7 +3552,7 @@ static inline void schedule_debug(struct * Pick up the highest-prio task: */ static inline struct task_struct * -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { const struct sched_class *class; struct task_struct *p; @@ -3597,6 +3597,220 @@ pick_next_task(struct rq *rq, struct tas BUG(); } +#ifdef CONFIG_SCHED_CORE + +static inline bool cookie_match(struct task_struct *a, struct task_struct *b) +{ + if (is_idle_task(a) || is_idle_task(b)) + return true; + + return a->core_cookie == b->core_cookie; +} + +// XXX fairness/fwd progress conditions +static struct task_struct * +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max) +{ + struct task_struct *class_pick, *cookie_pick; + unsigned long cookie = 0UL; + + /* + * We must not rely on rq->core->core_cookie here, because we fail to reset + * rq->core->core_cookie on new picks, such that we can detect if we need + * to do single vs multi rq task selection. + */ + + if (max && max->core_cookie) { + WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie); + cookie = max->core_cookie; + } + + class_pick = class->pick_task(rq); + if (!cookie) + return class_pick; + + cookie_pick = sched_core_find(rq, cookie); + if (!class_pick) + return cookie_pick; + + /* + * If class > max && class > cookie, it is the highest priority task on + * the core (so far) and it must be selected, otherwise we must go with + * the cookie pick in order to satisfy the constraint. + */ + if (cpu_prio_less(cookie_pick, class_pick) && cpu_prio_less(max, class_pick)) + return class_pick; + + return cookie_pick; +} + +static struct task_struct * +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +{ + struct task_struct *next, *max = NULL; + const struct sched_class *class; + const struct cpumask *smt_mask; + int i, j, cpu; + + if (!sched_core_enabled(rq)) + return __pick_next_task(rq, prev, rf); + + /* + * If there were no {en,de}queues since we picked (IOW, the task + * pointers are all still valid), and we haven't scheduled the last + * pick yet, do so now. + */ + if (rq->core->core_pick_seq == rq->core->core_task_seq && + rq->core->core_pick_seq != rq->core_sched_seq) { + WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq); + + next = rq->core_pick; + if (next != prev) { + put_prev_task(rq, prev); + set_next_task(rq, next); + } + return next; + } + + prev->sched_class->put_prev_task(rq, prev, rf); + if (!rq->nr_running) + newidle_balance(rq, rf); + + cpu = cpu_of(rq); + smt_mask = cpu_smt_mask(cpu); + + /* + * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq + * + * @task_seq guards the task state ({en,de}queues) + * @pick_seq is the @task_seq we did a selection on + * @sched_seq is the @pick_seq we scheduled + * + * However, preemptions can cause multiple picks on the same task set. + * 'Fix' this by also increasing @task_seq for every pick. + */ + rq->core->core_task_seq++; + + /* reset state */ + for_each_cpu(i, smt_mask) { + struct rq *rq_i = cpu_rq(i); + + rq_i->core_pick = NULL; + + if (i != cpu) + update_rq_clock(rq_i); + } + + /* + * Try and select tasks for each sibling in decending sched_class + * order. + */ + for_each_class(class) { +again: + for_each_cpu_wrap(i, smt_mask, cpu) { + struct rq *rq_i = cpu_rq(i); + struct task_struct *p; + + if (rq_i->core_pick) + continue; + + /* + * If this sibling doesn't yet have a suitable task to + * run; ask for the most elegible task, given the + * highest priority task already selected for this + * core. + */ + p = pick_task(rq_i, class, max); + if (!p) { + /* + * If there weren't no cookies; we don't need + * to bother with the other siblings. + */ + if (i == cpu && !rq->core->core_cookie) + goto next_class; + + continue; + } + + /* + * Optimize the 'normal' case where there aren't any + * cookies and we don't need to sync up. + */ + if (i == cpu && !rq->core->core_cookie && !p->core_cookie) { + next = p; + goto done; + } + + rq_i->core_pick = p; + + /* + * If this new candidate is of higher priority than the + * previous; and they're incompatible; we need to wipe + * the slate and start over. + * + * NOTE: this is a linear max-filter and is thus bounded + * in execution time. + */ + if (!max || core_prio_less(max, p)) { + struct task_struct *old_max = max; + + rq->core->core_cookie = p->core_cookie; + max = p; + + if (old_max && !cookie_match(old_max, p)) { + for_each_cpu(j, smt_mask) { + if (j == i) + continue; + + cpu_rq(j)->core_pick = NULL; + } + goto again; + } + } + } +next_class:; + } + + rq->core->core_pick_seq = rq->core->core_task_seq; + + /* + * Reschedule siblings + * + * NOTE: L1TF -- at this point we're no longer running the old task and + * sending an IPI (below) ensures the sibling will no longer be running + * their task. This ensures there is no inter-sibling overlap between + * non-matching user state. + */ + for_each_cpu(i, smt_mask) { + struct rq *rq_i = cpu_rq(i); + + WARN_ON_ONCE(!rq_i->core_pick); + + if (i == cpu) + continue; + + if (rq_i->curr != rq_i->core_pick) + resched_curr(rq_i); + } + + rq->core_sched_seq = rq->core->core_pick_seq; + next = rq->core_pick; + +done: + set_next_task(rq, next); + return next; +} + +#else /* !CONFIG_SCHED_CORE */ + +static struct task_struct * +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +{ + return __pick_next_task(rq, prev, rf); +} + +#endif /* CONFIG_SCHED_CORE */ + /* * __schedule() is the main scheduler function. * @@ -5866,7 +6080,7 @@ static void migrate_tasks(struct rq *dea /* * pick_next_task() assumes pinned rq->lock: */ - next = pick_next_task(rq, &fake_task, rf); + next = __pick_next_task(rq, &fake_task, rf); BUG_ON(!next); put_prev_task(rq, next); @@ -6322,7 +6536,11 @@ void __init sched_init(void) #ifdef CONFIG_SCHED_CORE rq->core = NULL; + rq->core_pick = NULL; rq->core_enabled = 0; + rq->core_tree = RB_ROOT; + + rq->core_cookie = 0UL; #endif } --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -960,11 +960,15 @@ struct rq { #ifdef CONFIG_SCHED_CORE /* per rq */ struct rq *core; + struct task_struct *core_pick; unsigned int core_enabled; + unsigned int core_sched_seq; struct rb_root core_tree; /* shared state */ unsigned int core_task_seq; + unsigned int core_pick_seq; + unsigned long core_cookie; #endif }; @@ -1770,7 +1774,6 @@ static inline void put_prev_task(struct static inline void set_next_task(struct rq *rq, struct task_struct *next) { - WARN_ON_ONCE(rq->curr != next); next->sched_class->set_next_task(rq, next); }