linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Righi <andrea.righi@canonical.com>
To: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org, mingo@redhat.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	bristot@redhat.com, vschneid@redhat.com, ast@kernel.org,
	daniel@iogearbox.net, andrii@kernel.org, martin.lau@kernel.org,
	joshdon@google.com, brho@google.com, pjt@google.com,
	derkling@google.com, haoluo@google.com, dvernet@meta.com,
	dschatzberg@meta.com, dskarlat@cs.cmu.edu, riel@surriel.com,
	changwoo@igalia.com, himadrics@inria.fr, memxor@gmail.com,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
	kernel-team@meta.com
Subject: Re: [PATCH 12/36] sched_ext: Implement BPF extensible scheduler class
Date: Mon, 13 Nov 2023 15:04:24 -0500	[thread overview]
Message-ID: <ZVKBSIPqJnAvrE3g@gpd> (raw)
In-Reply-To: <20231111024835.2164816-13-tj@kernel.org>

On Fri, Nov 10, 2023 at 04:47:38PM -1000, Tejun Heo wrote:
> Implement a new scheduler class sched_ext (SCX), which allows scheduling
> policies to be implemented as BPF programs to achieve the following:
> 
> 1. Ease of experimentation and exploration: Enabling rapid iteration of new
>    scheduling policies.
> 
> 2. Customization: Building application-specific schedulers which implement
>    policies that are not applicable to general-purpose schedulers.
> 
> 3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
>    policies in production environments.
> 
> sched_ext leverages BPF’s struct_ops feature to define a structure which
> exports function callbacks and flags to BPF programs that wish to implement
> scheduling policies. The struct_ops structure exported by sched_ext is
> struct sched_ext_ops, and is conceptually similar to struct sched_class. The
> role of sched_ext is to map the complex sched_class callbacks to the more
> simple and ergonomic struct sched_ext_ops callbacks.
> 
> For more detailed discussion on the motivations and overview, please refer
> to the cover letter.
> 
> Later patches will also add several example schedulers and documentation.
> 
> This patch implements the minimum core framework to enable implementation of
> BPF schedulers. Subsequent patches will gradually add functionalities
> including safety guarantee mechanisms, nohz and cgroup support.
> 
> include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
> top, each operation should be self-explanatory. The followings are worth
> noting:
> 
> * Both "sched_ext" and its shorthand "scx" are used. If the identifier
>   already has "sched" in it, "ext" is used; otherwise, "scx".
> 
> * In sched_ext_ops, only .name is mandatory. Every operation is optional and
>   if omitted a simple but functional default behavior is provided.
> 
> * A new policy constant SCHED_EXT is added and a task can select sched_ext
>   by invoking sched_setscheduler(2) with the new policy constant. However,
>   if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
>   and the task is scheduled by CFS. When the BPF scheduler is loaded, all
>   tasks which have the SCHED_EXT policy are switched to sched_ext.
> 
> * To bridge the workflow imbalance between the scheduler core and
>   sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
>   queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
>   one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
>   convenience and need not be used by a scheduler that doesn't require it.
>   SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
>   the next task on the CPU. The BPF scheduler can manage an arbitrary number
>   of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
> 
> * sched_ext guarantees system integrity no matter what the BPF scheduler
>   does. To enable this, each task's ownership is tracked through
>   p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
>   can always recover and revert all tasks back to CFS. See p->scx.ops_state
>   and scx_tasks.
> 
> * A task is not tied to its rq while enqueued. This decouples CPU selection
>   from queueing and allows sharing a scheduling queue across an arbitrary
>   subset of CPUs. This adds some complexities as a task may need to be
>   bounced between rq's right before it starts executing. See
>   dispatch_to_local_dsq() and move_task_to_local_dsq().
> 
> * One complication that arises from the above weak association between task
>   and rq is that synchronizing with dequeue() gets complicated as dequeue()
>   may happen anytime while the task is enqueued and the dispatch path might
>   need to release the rq lock to transfer the task. Solving this requires a
>   bit of complexity. See the logic around p->scx.sticky_cpu and
>   p->scx.ops_qseq.
> 
> * Both enable and disable paths are a bit complicated. The enable path
>   switches all tasks without blocking to avoid issues which can arise from
>   partially switched states (e.g. the switching task itself being starved).
>   The disable path can't trust the BPF scheduler at all, so it also has to
>   guarantee forward progress without blocking. See scx_ops_enable() and
>   scx_ops_disable_workfn().
> 
> * When sched_ext is disabled, static_branches are used to shut down the
>   entry points from hot paths.
> 
> v5: * To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t
>       instead of atomic64_t and scx_dsp_buf_ent.qseq which uses
>       load_acquire/store_release is now unsigned long instead of u64.
> 
>     * Fix the bug where bpf_scx_btf_struct_access() was allowing write
>       access to arbitrary fields.
> 
>     * Distinguish kfuncs which can be called from any sched_ext ops and from
>       anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from
>       sched_ext ops.
> 
>     * Rename "type" to "kind" in scx_exit_info to make it easier to use on
>       languages in which "type" is a reserved keyword.
> 
>     * Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle
>       setup"), PF_IDLE is not set on idle tasks which haven't been online
>       yet which made scx_task_iter_next_filtered() include those idle tasks
>       in iterations leading to oopses. Update scx_task_iter_next_filtered()
>       to directly test p->sched_class against idle_sched_class instead of
>       using is_idle_task() which tests PF_IDLE.
> 
>     * Other updates to match upstream changes such as adding const to
>       set_cpumask() param and renaming check_preempt_curr() to
>       wakeup_preempt().
> 
> v4: * SCHED_CHANGE_BLOCK replaced with the previous
>       sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is
>       because upstream is adaopting a different generic cleanup mechanism.
>       Once that lands, the code will be adapted accordingly.
> 
>     * task_on_scx() used to test whether a task should be switched into SCX,
>       which is confusing. Renamed to task_should_scx(). task_on_scx() now
>       tests whether a task is currently on SCX.
> 
>     * scx_has_idle_cpus is barely used anymore and replaced with direct
>       check on the idle cpumask.
> 
>     * SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer
>       fully idle cores.
> 
>     * ops.enable() now sees up-to-date p->scx.weight value.
> 
>     * ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF
>       schedulers expecting ->select_cpu() call.
> 
>     * Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest
>       of the scheduler.
> 
> v3: * ops.set_weight() added to allow BPF schedulers to track weight changes
>       without polling p->scx.weight.
> 
>     * move_task_to_local_dsq() was losing SCX-specific enq_flags when
>       enqueueing the task on the target dsq because it goes through
>       activate_task() which loses the upper 32bit of the flags. Carry the
>       flags through rq->scx.extra_enq_flags.
> 
>     * scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running()
>       and scx_bpf_task_cpu() now use the new KF_RCU instead of
>       KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.
> 
>     * The kfunc helper access control mechanism implemented through
>       sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always
>       used when invoking scx_ops operations.
> 
> v2: * balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
>       called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
>       To determine whether balance_scx() should be called from
>       put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
>       comment in put_prev_task_scx() for details.
> 
>     * sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
>       with SCHED_CHANGE_BLOCK().
> 
>     * Unused all_dsqs list removed. This was a left-over from previous
>       iterations.
> 
>     * p->scx.kf_mask is added to track and enforce which kfunc helpers are
>       allowed. Also, init/exit sequences are updated to make some kfuncs
>       always safe to call regardless of the current BPF scheduler state.
>       Combined, this should make all the kfuncs safe.
> 
>     * BPF now supports sleepable struct_ops operations. Hacky workaround
>       removed and operations and kfunc helpers are tagged appropriately.
> 
>     * BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
>       and friends are added so that BPF schedulers can use the idle masks
>       with the generic helpers. This replaces the hacky kfunc helpers added
>       by a separate patch in V1.
> 
>     * CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
>       enabled. This restriction will be removed by a later patch which adds
>       core-sched support.
> 
>     * Add MAINTAINERS entries and other misc changes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Co-authored-by: David Vernet <dvernet@meta.com>
> Acked-by: Josh Don <joshdon@google.com>
> Acked-by: Hao Luo <haoluo@google.com>
> Acked-by: Barret Rhoden <brho@google.com>
> Cc: Andrea Righi <andrea.righi@canonical.com>

...

> +#ifdef CONFIG_SCHED_DEBUG
> +static const char *scx_ops_enable_state_str[] = {
> +	[SCX_OPS_PREPPING]	= "prepping",
> +	[SCX_OPS_ENABLING]	= "enabling",
> +	[SCX_OPS_ENABLED]	= "enabled",
> +	[SCX_OPS_DISABLING]	= "disabling",
> +	[SCX_OPS_DISABLED]	= "disabled",
> +};

We may want to move scx_ops_enable_state_str[] outside of
CONFIG_SCHED_DEBUG, because we're using it later in print_scx_info()
("sched_ext: Print sched_ext info when dumping stack"), or we make
print_scx_info() dependent of CONFIG_SCHED_DEBUG.

-Andrea

  parent reply	other threads:[~2023-11-13 20:04 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-11  2:47 [PATCHSET v5] sched: Implement BPF extensible scheduler class Tejun Heo
2023-11-11  2:47 ` [PATCH 01/36] cgroup: Implement cgroup_show_cftypes() Tejun Heo
2023-11-11  2:47 ` [PATCH 02/36] sched: Restructure sched_class order sanity checks in sched_init() Tejun Heo
2023-11-11  2:47 ` [PATCH 03/36] sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() Tejun Heo
2023-11-11  2:47 ` [PATCH 04/36] sched: Add sched_class->reweight_task() Tejun Heo
2023-11-11  2:47 ` [PATCH 05/36] sched: Add sched_class->switching_to() and expose check_class_changing/changed() Tejun Heo
2023-11-11  2:47 ` [PATCH 06/36] sched: Factor out cgroup weight conversion functions Tejun Heo
2023-11-11  2:47 ` [PATCH 07/36] sched: Expose css_tg() and __setscheduler_prio() Tejun Heo
2023-11-11  2:47 ` [PATCH 08/36] sched: Enumerate CPU cgroup file types Tejun Heo
2023-11-11  2:47 ` [PATCH 09/36] sched: Add @reason to sched_class->rq_{on|off}line() Tejun Heo
2023-11-11  2:47 ` [PATCH 10/36] sched: Add normal_policy() Tejun Heo
2023-11-11  2:47 ` [PATCH 11/36] sched_ext: Add boilerplate for extensible scheduler class Tejun Heo
2023-11-11  2:47 ` [PATCH 12/36] sched_ext: Implement BPF " Tejun Heo
2023-11-13 13:34   ` Changwoo Min
2023-11-14 19:07     ` Tejun Heo
2023-11-13 20:04   ` Andrea Righi [this message]
2023-11-14 19:07     ` Tejun Heo
2023-11-23  8:07   ` Andrea Righi
2023-11-25 19:59     ` Tejun Heo
2023-11-26  9:05       ` Andrea Righi
2023-12-07  2:04   ` [PATCH] scx: set p->scx.ops_state using atomic_long_set_release Changwoo Min
2023-12-08  0:16     ` Tejun Heo
2024-03-23  2:37   ` [PATCH 12/36] sched_ext: Implement BPF extensible scheduler class Joel Fernandes
2024-03-23 22:12     ` Tejun Heo
2024-04-25 21:28       ` Joel Fernandes
2024-04-26 16:57         ` Barret Rhoden
2024-04-26 21:58         ` Tejun Heo
2023-11-11  2:47 ` [PATCH 13/36] sched_ext: Add scx_simple and scx_example_qmap example schedulers Tejun Heo
2023-11-12  4:17   ` kernel test robot
2023-11-12 18:06     ` Tejun Heo
2023-11-11  2:47 ` [PATCH 14/36] sched_ext: Add sysrq-S which disables the BPF scheduler Tejun Heo
2023-11-11  2:47 ` [PATCH 15/36] sched_ext: Implement runnable task stall watchdog Tejun Heo
2023-11-11  2:47 ` [PATCH 16/36] sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT Tejun Heo
2023-11-11  2:47 ` [PATCH 17/36] sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext Tejun Heo
2023-11-11  2:47 ` [PATCH 18/36] sched_ext: Print sched_ext info when dumping stack Tejun Heo
2023-11-14 19:23   ` [PATCH v2 " Tejun Heo
2023-11-11  2:47 ` [PATCH 19/36] sched_ext: Implement scx_bpf_kick_cpu() and task preemption support Tejun Heo
2023-11-11  2:47 ` [PATCH 20/36] sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU Tejun Heo
2023-11-11  2:47 ` [PATCH 21/36] sched_ext: Make watchdog handle ops.dispatch() looping stall Tejun Heo
2023-11-11  2:47 ` [PATCH 22/36] sched_ext: Add task state tracking operations Tejun Heo
2023-11-11  2:47 ` [PATCH 23/36] sched_ext: Implement tickless support Tejun Heo
2023-11-11  2:47 ` [PATCH 24/36] sched_ext: Track tasks that are subjects of the in-flight SCX operation Tejun Heo
2023-11-11  2:47 ` [PATCH 25/36] sched_ext: Add cgroup support Tejun Heo
2023-11-11  2:47 ` [PATCH 26/36] sched_ext: Add a cgroup-based core-scheduling scheduler Tejun Heo
2023-11-11  2:47 ` [PATCH 27/36] sched_ext: Add a cgroup scheduler which uses flattened hierarchy Tejun Heo
2023-11-11  2:47 ` [PATCH 28/36] sched_ext: Implement SCX_KICK_WAIT Tejun Heo
2023-11-11  2:47 ` [PATCH 29/36] sched_ext: Implement sched_ext_ops.cpu_acquire/release() Tejun Heo
2023-11-11  2:47 ` [PATCH 30/36] sched_ext: Implement sched_ext_ops.cpu_online/offline() Tejun Heo
2023-11-11  2:47 ` [PATCH 31/36] sched_ext: Implement core-sched support Tejun Heo
2023-11-11  2:47 ` [PATCH 32/36] sched_ext: Add vtime-ordered priority queue to dispatch_q's Tejun Heo
2023-11-11  2:47 ` [PATCH 33/36] sched_ext: Documentation: scheduler: Document extensible scheduler class Tejun Heo
2023-11-11  2:48 ` [PATCH 34/36] sched_ext: Add a basic, userland vruntime scheduler Tejun Heo
2023-11-11  2:48 ` [PATCH 35/36] sched_ext: Add scx_rusty, a rust userspace hybrid scheduler Tejun Heo
2023-11-11  2:48 ` [PATCH 36/36] sched_ext: Add scx_layered, a highly configurable multi-layer scheduler Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZVKBSIPqJnAvrE3g@gpd \
    --to=andrea.righi@canonical.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brho@google.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=daniel@iogearbox.net \
    --cc=derkling@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dschatzberg@meta.com \
    --cc=dskarlat@cs.cmu.edu \
    --cc=dvernet@meta.com \
    --cc=haoluo@google.com \
    --cc=himadrics@inria.fr \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@kernel.org \
    --cc=memxor@gmail.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=riel@surriel.com \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).