Re: [PATCH rfc 0/6] Scheduler BPF

From: Roman Gushchin <guro@fb.com>
To: Qais Yousef <qais.yousef@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Mel Gorman <mgorman@techsingularity.net>, <bpf@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH rfc 0/6] Scheduler BPF
Date: Mon, 11 Oct 2021 11:09:35 -0700	[thread overview]
Message-ID: <YWR9339EvxX6Ld1U@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <20211011163852.s4pq45rs2j3qhdwl@e107158-lin.cambridge.arm.com>

On Mon, Oct 11, 2021 at 05:38:52PM +0100, Qais Yousef wrote:
> Hi Roman
> 
> On 10/06/21 11:50, Roman Gushchin wrote:
> > On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote:
> > > Hi Roman
> > > 
> > > On 09/16/21 09:24, Roman Gushchin wrote:
> > > > There is a long history of distro people, system administrators, and
> > > > application owners tuning the CFS settings in /proc/sys, which are now
> > > > in debugfs. Looking at what these settings actually did, it ended up
> > > > boiling down to changing the likelihood of task preemption, or
> > > > disabling it by setting the wakeup_granularity_ns to more than half of
> > > > the latency_ns. The other settings didn't really do much for
> > > > performance.
> > > > 
> > > > In other words, some our workloads benefit by having long running tasks
> > > > preempted by tasks handling short running requests, and some workloads
> > > > that run only short term requests which benefit from never being preempted.
> > > 
> > > We had discussion about introducing latency-nice hint; but that discussion
> > > didn't end up producing any new API. Your use case seem similar to Android's;
> > > we want some tasks to run ASAP. There's an out of tree patch that puts these
> > > tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
> > > which seem okay for its purpose. Having a more generic solution in mainline
> > > would be nice.
> > > 
> > > https://lwn.net/Articles/820659/ 
> > 
> > Hello Qais!
> > 
> > Thank you for the link, I like it!
> > 
> > > 
> > > > 
> > > > This leads to a few observations and ideas:
> > > > - Different workloads want different policies. Being able to configure
> > > >   the policy per workload could be useful.
> > > > - A workload that benefits from not being preempted itself could still
> > > >   benefit from preempting (low priority) background system tasks.
> > > 
> > > You can put these tasks as SCHED_IDLE. There's a potential danger of starving
> > > these tasks; but assuming they're background and there's idle time in the
> > > system that should be fine.
> > > 
> > > https://lwn.net/Articles/805317/ 
> > > 
> > > That of course assuming you can classify these background tasks..
> > > 
> > > If you can do the classification, you can also use cpu.shares to reduce how
> > > much cpu time they get. Or CFS bandwidth controller
> > > 
> > > https://lwn.net/Articles/844976/ 
> > 
> > The cfs cgroup controller is that it's getting quite expensive quickly with the
> > increasing depth of the cgroup tree. This is why we had to disable it for some
> > of our primary workloads.
> 
> I can understand that..
> 
> > 
> > Still being able to control latencies on per-cgroup level is one of the goals
> > of this patchset.
> > 
> > > 
> > > I like Androd's model of classifying tasks. I think we need this classification
> > > done by other non-android systems too.
> > > 
> > > > - It would be useful to quickly (and safely) experiment with different
> > > >   policies in production, without having to shut down applications or reboot
> > > >   systems, to determine what the policies for different workloads should be.
> > > 
> > > Userspace should have the knobs that allows them to tune that without reboot.
> > > If you're doing kernel development; then it's part of the job spec I'd say :-)
> > 
> > The problem here occurs because there is no comprehensive way to test any
> > scheduler change rather than run it on many machines (sometimes 1000's) running
> > different production-alike workloads.
> > 
> > If I'm able to test an idea by loading a bpf program (and btw have some sort of
> > safety guarantees: maybe the performance will be hurt, but at least no panics),
> > it can speed up the development process significantly. The alternative is way
> > more complex from the infrastructure's point of view: releasing a custom kernel,
> > test it for safety, reboot certain machines to it, pin the kernel from being
> > automatically updated etc.
> 
> This process is unavoidable IMO. Assuming you have these hooks in; as soon as
> you require a new hook you'll be forced to have a custom kernel with that new
> hook introduced. Which, in my view, no different than pushing a custom kernel
> that forces the function of interest to be noinline. Right?

I think a relatively small and stable set of hooks can cover a large percent
of potential customization ideas.

> 
> > 
> > > 
> > > I think one can still go with the workflow you suggest for development without
> > > the hooks. You'd need to un-inline the function you're interested in; then you
> > > can use kprobes to hook into it and force an early return. That should produce
> > > the same effect, no?
> > 
> > Basically it's exactly what I'm suggesting. My patchset just provides a
> > convenient way to define these hooks and some basic useful helper functions.
> 
> Convenient will be only true assuming you have a full comprehensive list of
> hooks to never require adding a new one. As I highlighted above, this
> convenience is limited to hooks that you added now.
> 
> Do people always want more hooks? Rhetorical question ;-)

Why do you think that the list of the hooks will be so large/dynamic?

I'm not saying we can figure it out from a first attempt, but I'm pretty sure
that after some initial phase it can be relatively stable, e.g. changing only
with some _major_ changes in the scheduler code.

> 
> > 
> > > 
> > > > - Only a few workloads are large and sensitive enough to merit their own
> > > >   policy tweaks. CFS by itself should be good enough for everything else,
> > > >   and we probably do not want policy tweaks to be a replacement for anything
> > > >   CFS does.
> > > > 
> > > > This leads to BPF hooks, which have been successfully used in various
> > > > kernel subsystems to provide a way for external code to (safely)
> > > > change a few kernel decisions. BPF tooling makes this pretty easy to do,
> > > > and the people deploying BPF scripts are already quite used to updating them
> > > > for new kernel versions.
> > > 
> > > I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
> > > gets heavily modified by vendors and OEMs. We try very hard to understand the
> > > problems they face and get the right set of solutions in mainline. Which would
> > > ultimately help towards the goal of having a single Generic kernel Image [1]
> > > that gives you what you'd expect out of the platform without any need for
> > > additional cherries on top.
> > 
> > Wouldn't it make your life easier had they provide a set of bpf programs instead
> > of custom patches?
> 
> Not really.
> 
> Having consistent mainline behavior is important, and these customization
> contribute to fragmentation and can throw off userspace developers who find
> they have to do extra work on some platforms to get the desired outcome. They
> will be easy to misuse. We want to see the patches and find ways to improve
> mainline kernel instead.
> 
> That said, I can see the use case of being able to micro-optimize part of the
> scheduler in a workload specific way. But then the way I see this support
> happening (DISCLAIMER, personal opinion :-))
> 
> 	1. The hooks have to be about replacing specific snippet, like Barry's
> 	   example where it's an area that is hard to find a generic solution
> 	   that doesn't have a drawback over a class of workloads.

This makes sense to me, and this is a good topic to discuss: which hooks do we
really need. I don't think it necessarily has to replace something, but I
totally agree on the "hard to find a generic solution" part.

> 
> 	2. The set of bpf programs that modify it live in the kernel tree for
> 	   each hook added. Then we can reason about why the hook is there and
> 	   allow others to reap the benefit. Beside being able to re-evaluate
> 	   easily if the users still need that hook after a potential
> 	   improvement that could render it unnecessary.
> 
> 	3. Out of tree bpf programs can only be loaded if special CONFIG option
> 	   is set so that production kernel can only load known ones that the
> 	   community knows and have reasoned about.
> 
> 	4. Out of tree bpf programs will taint the kernel. A regression
> 	   reported with something funny loaded should be flagged as
> 	   potentially bogus.

2-4 look as generic bpf questions to me, I don't think there is anything
scheduler-specific. So I'd suggest to bring bpf maintainers into the discussion,
their input can be very valuable.

> 
> IMHO this should tame the beast to something useful to address these situations
> where the change required to improve one workload will harm others and it's
> hard to come up with a good compromise. Then the hook as you suggest could help
> implement that policy specifically for that platform/workload.
> 
> One can note that the behavior I suggest is similar to how modules work :)

The important benefit of bpf is safety guarantees.

> 
> > 
> > > 
> > > So my worry is that this will open the gate for these hooks to get more than
> > > just micro-optimization done in a platform specific way. And that it will
> > > discourage having the right discussion to fix real problems in the scheduler
> > > because the easy path is to do whatever you want in userspace. I am not sure we
> > > can control how these hooks are used.
> > 
> > I totally understand your worry. I think we need to find a right balance between
> > allowing to implement custom policies and keeping the core functionality
> > working well enough for everybody without a need to tweak anything.
> > 
> > It seems like an alternative to this "let's allow cfs customization via bpf"
> > approach is to completely move the scheduler code into userspace/bpf, something
> > that Google's ghOSt is aiming to do.
> 
> Why not ship a custom kernel instead then?

Shipping a custom kernel (actually any kernel) at this scale isn't easy or fast.
Just for example, imagine a process of rebooting of a 1000000 machines running
1000's different workloads, each with their own redundancy and capacity requirements.

This what makes an ability to push scheduler changes without a reboot/kernel upgrade
so attractive.

Obviously, it's not a case when we talk about a single kernel engineer and their
laptop/dev server/vm.

> 
> > 
> > > 
> > > The question is: why can't we fix any issues in the scheduler/make it better
> > > and must have these hooks instead?
> > 
> > Of course, if it's possible to implement an idea in a form which is suitable
> > for everybody and upstream it, this is the best outcome. The problem is that
> > not every idea is like that. A bpf program can leverage a priori knowledge
> > of a workload and its needs, something the generic scheduler code lacks
> > by the definition.
> 
> Yep I see your point for certain aspects of the scheduler that are hard to tune
> universally. We just need to be careful not to end up in a wild west or Anything
> Can Happen Thursday situation :-)

Totally agree!

Thanks!