From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F9D4C4332F for ; Mon, 12 Dec 2022 21:33:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233228AbiLLVdU (ORCPT ); Mon, 12 Dec 2022 16:33:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34344 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229872AbiLLVdP (ORCPT ); Mon, 12 Dec 2022 16:33:15 -0500 Received: from mail-pj1-x1029.google.com (mail-pj1-x1029.google.com [IPv6:2607:f8b0:4864:20::1029]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6C534DDC; Mon, 12 Dec 2022 13:33:14 -0800 (PST) Received: by mail-pj1-x1029.google.com with SMTP id u15-20020a17090a3fcf00b002191825cf02so1412821pjm.2; Mon, 12 Dec 2022 13:33:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id :reply-to; bh=DAwiiJ3gzBmaogwEMCs5wwB0PWSWlCpNEYVO2I6Gx2A=; b=WGRXmtvVuu6kPUpvgb4SRtKcVotSdyJUJzDOzCG42ZzUpb6qhS0dY9lSlBLeQTKu/i 48c3/Z172q3e+bemEtGPMeA2eaPJgq6fcLQ3/r6LHGjdWUYxZhKxmTPVecDpc+WMO+Mr Yd2nGNvHCAe0GIMEgTJ9NiYVV0hgy5LTgYHBIcJfxmXVm1vr/u7P1RN3Md8jMbagciUL 6NWjn5vf9ZDuMCK/vfOCoKN4+Nw8vSXvm25/LfsIbzVyzQjUt0B8adSEnbxbTAM50sw/ E/umkAe/0Ueild7o7vIUzVVEqRm0Gsofaou7cY8doooLeNdBYX3pcAQKm6WxDj3BxcMi Ja4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=DAwiiJ3gzBmaogwEMCs5wwB0PWSWlCpNEYVO2I6Gx2A=; b=qXmK2Oz677zeNPaTcytDhpae/bK7zFsMyuVOA+LtYRFB+5diz9XDhv5twh+lRl3UaT Uof+fFbJzLXzxRUzUw70AMdFFXt9iH1OMALB3juQcEJVwASAatGhjgONsP1/RBXigeAB uGC6btp16/nLpDixeZCXBpnAoQwrdHiW+8THCcd7YtV8yAo1HUkRfsaZYkpgdAdtB8ZX aos/sXJGFnWq92jpwyjcvfEjD6lHxVudTzIdhkmXm3Ag+HBzUhGOxXDEG/muvasyNzfw 8H74fe0y2hhoGDDFOY5O07Y+YcynQy010YbpkMNxBAldpDrk7LciZ1mH0yikHqjcVv/F wPPA== X-Gm-Message-State: ANoB5pmyIgDnHTSeyyBdpUXzrjKmE5SsPAw+8eJGREIUKO3N/y1RvFwW JhxzdXmYCJU6FI7dPTVuFl4= X-Google-Smtp-Source: AA0mqf6qqt7P5sZgPWb2d/3ZdsVhc85L0yyAivW6G3O+FWF9vxSUkK+zgywGm9geQ8J+YBKza9RQvA== X-Received: by 2002:a17:902:eb82:b0:189:da3b:17ae with SMTP id q2-20020a170902eb8200b00189da3b17aemr19154032plg.14.1670880793761; Mon, 12 Dec 2022 13:33:13 -0800 (PST) Received: from localhost (2603-800c-1a02-1bae-a7fa-157f-969a-4cde.res6.spectrum.com. [2603:800c:1a02:1bae:a7fa:157f:969a:4cde]) by smtp.gmail.com with ESMTPSA id b18-20020a170903229200b00182a9c27acfsm6842253plh.227.2022.12.12.13.33.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 12 Dec 2022 13:33:13 -0800 (PST) Sender: Tejun Heo Date: Mon, 12 Dec 2022 11:33:12 -1000 From: Tejun Heo To: Peter Zijlstra Cc: torvalds@linux-foundation.org, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, martin.lau@kernel.org, joshdon@google.com, brho@google.com, pjt@google.com, derkling@google.com, haoluo@google.com, dvernet@meta.com, dschatzberg@meta.com, dskarlat@cs.cmu.edu, riel@surriel.com, linux-kernel@vger.kernel.org, bpf@vger.kernel.org, kernel-team@meta.com Subject: Re: [PATCH 14/31] sched_ext: Implement BPF extensible scheduler class Message-ID: References: <20221130082313.3241517-1-tj@kernel.org> <20221130082313.3241517-15-tj@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Mon, Dec 12, 2022 at 01:53:55PM +0100, Peter Zijlstra wrote: > Perhaps the saner way to write this is: > > #ifdef CONFIG_SMP > BUG_ON(!sched_class_above(&stop_sched_class, &dl_sched_class)); > #endif > BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class)); > BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class)); > BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class)); > #ifdef CONFIG_... > BUG_ON(!sched_class_above(&fair_sched_class, &ext_sched_class)); > BUG_ON(!sched_class_above(&ext_sched_class, &idle_sched_class)); > #endif Yeah, this looks way better. Will update. > > +static inline const struct sched_class *next_active_class(const struct sched_class *class) > > +{ > > + class++; > > + if (!scx_enabled() && class == &ext_sched_class) > > + class++; > > + return class; > > +} > > + > > +#define for_active_class_range(class, _from, _to) \ > > + for (class = (_from); class != (_to); class = next_active_class(class)) > > + > > +#define for_each_active_class(class) \ > > + for_active_class_range(class, __sched_class_highest, __sched_class_lowest) > > + > > +/* > > + * SCX requires a balance() call before every pick_next_task() call including > > + * when waking up from idle. > > + */ > > +#define for_balance_class_range(class, prev_class, end_class) \ > > + for_active_class_range(class, (prev_class) > &ext_sched_class ? \ > > + &ext_sched_class : (prev_class), (end_class)) > > + > > This seems quite insane; why not simply make the ext methods effective > no-ops? Both balance and pick explicitly support that already, no? Yeah, we can definitely do that. It's just nice to guarantee from the core code that we aren't calling into a sched class which isn't enabled at the moment. If you take a look at "[PATCH 20/31] sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext", it's used the other way around too to elide calling into CFS when it knows that there are no tasks there. If the core code doesn't elide the calls, we might need some gating in CFS too. > > @@ -5800,10 +5812,13 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev, > > * We can terminate the balance pass as soon as we know there is > > * a runnable task of @class priority or higher. > > */ > > - for_class_range(class, prev->sched_class, &idle_sched_class) { > > + for_balance_class_range(class, prev->sched_class, &idle_sched_class) { > > if (class->balance(rq, prev, rf)) > > break; > > } > > +#else > > + /* SCX needs the balance call even in UP, call it explicitly */ > > This, *WHY* !?! This comes from the fact that there are no strict rq boundaries and the BPF scheduler can share scheduling queues across any subset of CPUs however they see fit. So, task <-> rq association is flexible until the very last moment the task gets picked for execution. For the dispatch path to support this, it needs to be able to migrate tasks across rq's which requires unlocking the current rq which can only be done in ->balance(). So, sched_ext uses ->balance() to run the dispatch path and thus needs it called even on UP. Given that UP doesn't need to transfer tasks across, it might be possible to move the whole dispatch operation into ->pick_next_task() but the current state would be different, so it's more complicated and will likely be more brittle. > > @@ -5876,7 +5894,7 @@ static inline struct task_struct *pick_task(struct rq *rq) > > const struct sched_class *class; > > struct task_struct *p; > > > > - for_each_class(class) { > > + for_each_active_class(class) { > > p = class->pick_task(rq); > > if (p) > > return p; > > > But this.. afaict that means that: > > - the whole EXT thing is incompatible with SCHED_CORE Can you expand on why this would be? I didn't test against SCHED_CORE, so am sure things might be broken but can't think of a reason why it'd be fundamentally incompatible. > - the whole EXT thing can be trivially starved by the presence of a > single CFS/BATCH/IDLE task. It's a simliar situation w/ RT vs. CFS, which is resolved via RT having starvation avoidance. Here, the way it's handled is a bit different, SCX has a watchdog mechanism implemented in "[PATCH 18/31] sched_ext: Implement runnable task stall watchdog", so if SCX tasks hang for whatever reason including being starved by CFS, it will get aborted and all tasks will be handed back to CFS. IOW, it's treated like any other BPF scheduler errors that can lead to stalls and recovered the same way. Thanks. -- tejun