All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qais Yousef <qyousef@layalina.io>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
	torvalds@linux-foundation.org, mingo@redhat.com,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, bsegall@google.com, mgorman@suse.de,
	bristot@redhat.com, vschneid@redhat.com, ast@kernel.org,
	daniel@iogearbox.net, andrii@kernel.org, martin.lau@kernel.org,
	joshdon@google.com, brho@google.com, pjt@google.com,
	derkling@google.com, haoluo@google.com, dvernet@meta.com,
	dschatzberg@meta.com, dskarlat@cs.cmu.edu, riel@surriel.com,
	changwoo@igalia.com, himadrics@inria.fr, memxor@gmail.com,
	andrea.righi@canonical.com, joel@joelfernandes.org,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
	kernel-team@meta.com
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class
Date: Tue, 14 May 2024 01:07:15 +0100	[thread overview]
Message-ID: <20240514000715.4765jfpwi5ovlizj@airbuntu> (raw)
In-Reply-To: <20240513142646.4dc5484d@rorschach.local.home>

On 05/13/24 14:26, Steven Rostedt wrote:
> On Mon, 13 May 2024 10:03:59 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > I believe we agree that we want more people contributing to the scheduling
> > > area.   
> > 
> > I think therein lies the rub -- contribution. If we were to do this
> > thing, random loadable BPF schedulers, then how do we ensure people will
> > contribute back?
> 
> Hi Peter,
> 
> I'm somewhat agnostic to sched_ext itself, but I have been an advocate
> for a plugable scheduler infrastructure. And we are seriously looking
> at adding it to ChromeOS.
> 
> > 
> > That is, from where I am sitting I see $vendor mandate their $enterprise
> > product needs their $BPF scheduler. At which point $vendor will have no
> > incentive to ever contribute back.
> 
> Believe me they already have their own scheduler, and because its so
> different, it's very hard to contribute back.
> 
> > 
> > And customers of $vendor that want to run additional workloads on
> > their machine are then stuck with that scheduler, irrespective of it
> > being suitable for them or not. This is not a good experience.
> 
> And $vendor usually has a unique workload that their changes will
> likely cause regressions in other workloads, making it even harder to
> contribute back.
> 
> > 
> > So I don't at all mind people playing around with schedulers -- they can
> > do so today, there are a ton of out of tree patches to start or learn
> > from, or like I said, it really isn't all that hard to just rip out fair
> > and write something new.
> 
> For cloud servers, I bet a lot of schedulers are not public. Although,
> my company tries to publish the schedulers they use.
> 
> > 
> > Open source, you get to do your own thing. Have at.
> > 
> > But part of what made Linux work so well, is in my opinion the GPL. GPL
> > forces people to contribute back -- to work on the shared project. And I
> > see the whole BPF thing as a run-around on that.
> > 
> > Even the large cloud vendors and service providers (Amazon, Google,
> > Facebook etc.) contribute back because of rebase pain -- as you well
> > know. The rebase pain offsets the 'TIVO hole'.
> 
> From what I understand (I don't work on production, but Chromebooks), a
> lot of changes cannot be contributed back because their updates are far
> from what is upstream. Having a plugable scheduler would actually allow
> them to contribute *more*.
> 
> > 
> > But with the BPF muck; where is the motivation to help improve things?
> 
> For the same reasons you mention about GPL and why it works.
> Collaboration. Sharing ideas helps everyone. If there's some secret
> sauce scheduler then they would likely just replace the scheduler, as
> its more performant. I don't believe it would be worth while to use BPF
> for that purpose.
> 
> > 
> > Keeping a rando github repo with BPF schedulers is not contributing.
> 
> Agreed, and I would guess having them in the Linux kernel tree would be
> more beneficial.
> 
> > That's just a repo with multiple out of tree schedulers to be ignored.
> > Who will put in the effort of upsteaming things if they can hack up a
> > BPF and throw it over the wall?
> 
> If there's a place in the Linux kernel tree, I'm sure there would be
> motivation to place it there. Having it in the kernel proper does give
> more visibility of code, and therefore enhancements to that code. This
> was the same rationale for putting perf into the kernel proper.
> 
> > 
> > So yeah, I'm very much NOT supportive of this effort. From where I'm
> > sitting there is simply not a single benefit. You're not making my life
> > better, so why would I care?
> > 
> > How does this BPF muck translate into better quality patches for me?
> 
> Here's how we will be using it (we will likely be porting sched_ext to
> ChromeOS regardless of its acceptance).
> 
> Doing testing of scheduler changes in the field is extremely time
> consuming and complex. We tested EEVDF vs CFS by backporting EEVDF to
> 5.15 (as that is the kernel version we are using on the chromebooks we
> were testing on), and then we need to add a user space "switch" to
> change the scheduler. Note, this also risks causing a bug in adding
> these changes. Then we push the kernel out, and then start our
> experiment that enables our feature to a small percentage, and slowly
> increases the number of users until we have a enough for a statistical
> result.
> 
> What sched_ext would give us is a easy way to try different scheduling
> algorithms and get feedback much quicker. Once we determine a solution
> that improves things, we would then spend the time to implement it in
> the scheduler, and yes, send it upstream.
> 
> To me, sched_ext should never be the final solution, but it can be
> extremely useful in testing various changes quickly in the field. Which
> to me would encourage more contributions.

I really don't think the problems we have are because of EEVDF vs CFS vs
anything else. Other major OSes have one scheduler, but what they exceed on is
providing better QoS interfaces and mechanism to handle specific scenarios that
Linux lacks.

The confusion I see again and again over the years is the fragmentation of
Linux eco system and app writers don't know how to do things properly on Linux
vs other OSes. Note our CONFIG system is part of this fragmentation.

The addition of more flavours which inevitably will lead to custom QoS specific
to that scheduler and libraries built on top of it that require that particular
extension available is a recipe for more confusion and fragmentation. Not to
mention big players are likely to take over, and I wouldn't be surprised if new
business models start to spring up on top of that. Add to the lot the potential
security issues with the ease to lure people to download sneaky sched extension
that gives great promises but full of malware (more dangerous with the greater
power of BPF/sudo misused).

I really don't buy the rapid development aspect too. The scheduler was heavily
influenced by the early contributors which come from server market that had
(few) very specific workloads they needed to optimize for and throughput had
a heavier weight vs latency. Fast forward to now, things are different. Even on
server market latency/responsiveness has become more important. Power and
thermal are important on a larger class of systems now too. I'd dare say even
on server market. How do you know when it's okay for an app/task to consume too
much power and when it is not? Hint hint, you can't unless someone in userspace
tells you. Similarly for latency vs throughput. What is the correct way to
write an application to provide this info? Then we can ask what is missing in
the scheduler to enable this.

Note the original min/wakeup_granularity_ns, latency_ns etc were tuned by
default for throughput by the way (server market bias). You can manipulate
those and get better latencies.

And this brings me to the major point, we really need to stop thinking that we
must improve everything at system level. Workloads need to evolve to take best
out of systems and we need new libraries for performance and power management.
And this means they need to get new APIs and libraries do a be able to do
a better job and scale well.

I agree with Peter it is not hard to write something to make specific workload
better. But what we really need is enable workloads to be written better and be
more portable to take best of the hardware they run on, AND coexist with other
workloads. For example, how do you write a good multi threaded application that
can scale well across systems (including big.LITTLE) and not trip over other
workloads stealing resources sometimes? You need something like this

	https://developer.apple.com/documentation/DISPATCH

which has a linux port

	https://github.com/apple/swift-corelibs-libdispatch

not a new scheduler.

How do you write an app that can manage bad thermal situations?

	https://developer.android.com/games/optimize/adpf/thermal

POSIX is dormant, and every OS has to wing new interfaces to deal with the new
realities. And I don't see a lot of these discussions. Linux is lagging behind
in general in this aspect. The trend I see is how do I make existing stuff
better, and believe me I've seen strcmp(task->comm, ...) to hand pick things.
Which I am sure we'll end up down this path if we let things loose.

So I am against any custom extension. I think it all has to be part of the
kernel tree and adhere to all of its supported interfaces. Which I think what
we really ought on focusing to evolve and improve. This is the biggest friction
point IMO, not the scheduler algorithm. If the latter need to change, it needs
to be as the result of this friction - which what EEVDF came about from to my
understanding. To enable implementing a latency interface easier. But Vincent
had a working implementation with CFS too which I think would have worked fine
by the way.

I do hope we can reconsider some of our default behaviors though (that bias to
perf and throughput specifically).

FWIW IMO the biggest issues I see in the scheduler is that its testability and
debuggability is hard. I think BPF can be a good fit for that. For the latter
I started this project, yet I am still trying to figure out how to add tracer
for the difficult paths to help people more easily report when a bad decision
has happened to provide more info about the internal state of the scheduler, in
hope to accelerate the process of finding solutions. I think people are getting
stuck explaining why things are failing, which makes finding a common solution
hard if not impossible. We need better way to understand the problems people
are seeing

	https://github.com/qais-yousef/sched-analyzer

Similar methodology can be used to create a BPF based sched test framework.
I don't have cycles to start this, but hope to if no one beats me to it.

I think it would be great to have a clear list of the current limitations
people see in the scheduler. It could be a failure on my end, but I haven't
seen specifics of problems and what was tried and failed to the point it is
impossible to move forward. From what I see, I am hitting bugs here and there
all the time. But they are hard to debug to truly understand where things went
wrong. Like this one for example where PTHREAD_PRIO_PI is a NOP for fair tasks.
Many thought using this flag doesn't help (rather than buggy)..

	https://lore.kernel.org/lkml/20240403005930.1587032-1-qyousef@layalina.io/

  reply	other threads:[~2024-05-14  0:07 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-01 15:09 [PATCHSET v6] sched: Implement BPF extensible scheduler class Tejun Heo
2024-05-01 15:09 ` [PATCH 01/39] cgroup: Implement cgroup_show_cftypes() Tejun Heo
2024-05-01 15:09 ` [PATCH 02/39] sched: Restructure sched_class order sanity checks in sched_init() Tejun Heo
2024-05-01 15:09 ` [PATCH 03/39] sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() Tejun Heo
2024-05-01 15:09 ` [PATCH 04/39] sched: Add sched_class->reweight_task() Tejun Heo
2024-05-01 15:09 ` [PATCH 05/39] sched: Add sched_class->switching_to() and expose check_class_changing/changed() Tejun Heo
2024-05-01 15:09 ` [PATCH 06/39] sched: Factor out cgroup weight conversion functions Tejun Heo
2024-05-01 15:09 ` [PATCH 07/39] sched: Expose css_tg() and __setscheduler_prio() Tejun Heo
2024-05-01 15:09 ` [PATCH 08/39] sched: Enumerate CPU cgroup file types Tejun Heo
2024-05-01 15:09 ` [PATCH 09/39] sched: Add @reason to sched_class->rq_{on|off}line() Tejun Heo
2024-05-01 15:09 ` [PATCH 10/39] sched: Factor out update_other_load_avgs() from __update_blocked_others() Tejun Heo
2024-05-01 15:09 ` [PATCH 11/39] cpufreq_schedutil: Refactor sugov_cpu_is_busy() Tejun Heo
2024-05-01 15:09 ` [PATCH 12/39] sched: Add normal_policy() Tejun Heo
2024-05-01 15:09 ` [PATCH 13/39] sched_ext: Add boilerplate for extensible scheduler class Tejun Heo
2024-05-01 15:09 ` [PATCH 14/39] sched_ext: Implement BPF " Tejun Heo
2024-05-01 15:09 ` [PATCH 15/39] sched_ext: Add scx_simple and scx_example_qmap example schedulers Tejun Heo
2024-05-01 15:09 ` [PATCH 16/39] sched_ext: Add sysrq-S which disables the BPF scheduler Tejun Heo
2024-05-01 15:09 ` [PATCH 17/39] sched_ext: Implement runnable task stall watchdog Tejun Heo
2024-05-01 15:09 ` [PATCH 18/39] sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT Tejun Heo
2024-05-01 15:09 ` [PATCH 19/39] sched_ext: Print sched_ext info when dumping stack Tejun Heo
2024-05-01 15:09 ` [PATCH 20/39] sched_ext: Print debug dump after an error exit Tejun Heo
2024-05-01 15:09 ` [PATCH 21/39] tools/sched_ext: Add scx_show_state.py Tejun Heo
2024-05-01 15:09 ` [PATCH 22/39] sched_ext: Implement scx_bpf_kick_cpu() and task preemption support Tejun Heo
2024-05-01 15:09 ` [PATCH 23/39] sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU Tejun Heo
2024-05-01 15:09 ` [PATCH 24/39] sched_ext: Make watchdog handle ops.dispatch() looping stall Tejun Heo
2024-05-01 15:10 ` [PATCH 25/39] sched_ext: Add task state tracking operations Tejun Heo
2024-05-01 15:10 ` [PATCH 26/39] sched_ext: Implement tickless support Tejun Heo
2024-05-01 15:10 ` [PATCH 27/39] sched_ext: Track tasks that are subjects of the in-flight SCX operation Tejun Heo
2024-05-01 15:10 ` [PATCH 28/39] sched_ext: Add cgroup support Tejun Heo
2024-05-01 15:10 ` [PATCH 29/39] sched_ext: Add a cgroup scheduler which uses flattened hierarchy Tejun Heo
2024-05-01 15:10 ` [PATCH 30/39] sched_ext: Implement SCX_KICK_WAIT Tejun Heo
2024-05-01 15:10 ` [PATCH 31/39] sched_ext: Implement sched_ext_ops.cpu_acquire/release() Tejun Heo
2024-05-01 15:10 ` [PATCH 32/39] sched_ext: Implement sched_ext_ops.cpu_online/offline() Tejun Heo
2024-05-01 15:10 ` [PATCH 33/39] sched_ext: Bypass BPF scheduler while PM events are in progress Tejun Heo
2024-05-01 15:10 ` [PATCH 34/39] sched_ext: Implement core-sched support Tejun Heo
2024-05-01 15:10 ` [PATCH 35/39] sched_ext: Add vtime-ordered priority queue to dispatch_q's Tejun Heo
2024-05-01 15:10 ` [PATCH 36/39] sched_ext: Implement DSQ iterator Tejun Heo
2024-05-01 15:10 ` [PATCH 37/39] sched_ext: Add cpuperf support Tejun Heo
2024-05-01 15:10 ` [PATCH 38/39] sched_ext: Documentation: scheduler: Document extensible scheduler class Tejun Heo
2024-05-02  2:24   ` Bagas Sanjaya
2024-05-01 15:10 ` [PATCH 39/39] sched_ext: Add selftests Tejun Heo
2024-05-02  8:48 ` [PATCHSET v6] sched: Implement BPF extensible scheduler class Peter Zijlstra
2024-05-02 19:20   ` Tejun Heo
2024-05-03  8:52     ` Peter Zijlstra
2024-05-05 23:31       ` Tejun Heo
2024-05-13  8:03         ` Peter Zijlstra
2024-05-13 18:26           ` Steven Rostedt
2024-05-14  0:07             ` Qais Yousef [this message]
2024-05-14 21:34               ` David Vernet
2024-05-27 21:25                 ` Qais Yousef
2024-05-28 23:46                   ` Tejun Heo
2024-05-29 22:09                     ` Qais Yousef
2024-05-17  9:58               ` Peter Zijlstra
2024-05-27 20:29                 ` Qais Yousef
2024-05-14 20:22           ` Chris Mason
2024-05-14 22:06           ` Josh Don
2024-05-15 20:41           ` Tejun Heo
2024-05-21  0:19             ` Tejun Heo
2024-05-06 18:47       ` Rik van Riel
2024-05-07 19:33         ` Tejun Heo
2024-05-07 19:47           ` Rik van Riel
2024-05-09  7:38       ` Changwoo Min
2024-05-10 18:24 ` Peter Jung
2024-05-13 20:36 ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240514000715.4765jfpwi5ovlizj@airbuntu \
    --to=qyousef@layalina.io \
    --cc=andrea.righi@canonical.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brho@google.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=daniel@iogearbox.net \
    --cc=derkling@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dschatzberg@meta.com \
    --cc=dskarlat@cs.cmu.edu \
    --cc=dvernet@meta.com \
    --cc=haoluo@google.com \
    --cc=himadrics@inria.fr \
    --cc=joel@joelfernandes.org \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@kernel.org \
    --cc=memxor@gmail.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=riel@surriel.com \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.