linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH rfc 0/6] Scheduler BPF
@ 2021-11-20 16:41 Hui-Chun Feng
  0 siblings, 0 replies; 17+ messages in thread
From: Hui-Chun Feng @ 2021-11-20 16:41 UTC (permalink / raw)
  To: guro; +Cc: bpf, linux-kernel, mgorman, mingo, peterz

Hi list,

Is there any reason why this patch only got a few comments so far?

IMHO, by using static keys, the patch should add barely observable overhead
to sched when no BPF progs loaded. Or is it because of the point that Yousef
mentioned earlier? I.e. sched BPF may encourage people to fix sched problems
with BPF progs instead of fixing sched directly.

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2022-07-19 13:17   ` Ren Zhijie
@ 2022-07-19 23:21     ` Roman Gushchin
  0 siblings, 0 replies; 17+ messages in thread
From: Roman Gushchin @ 2022-07-19 23:21 UTC (permalink / raw)
  To: Ren Zhijie; +Cc: bpf, linux-kernel, mgorman, mingo, peterz

On Tue, Jul 19, 2022 at 09:17:24PM +0800, Ren Zhijie wrote:
> Hi Roman and list,
> 
> We want to implement a programmable scheduler to meet the schedule
> requirements of different workloads.
> 
> Using BPF, we can easily deploy schedule policies for specific workloads,
> quickly verifying without modifying the kernel code. This greatly reduces
> the cost of deploying new schedule policies in the production environment.
> 
> Therefore, we want to continue to develop based on your patch. We plan to
> merge it into the openeuler open-source community and use the community to
> continuously evolve and maintain it.
> (link: https://www.openeuler.org/en/)
> 
> We made some changes to your patch:
> 1. Adapt to the openeuler-OLK-5.10 branch, which mostly base on linux
> longterm branch 5.10.
> 2. Introduce the Kconfig CONFIG_BPF_SCHED to isolate related code at compile
> time.
> 3. helpers bpf_sched_entity_to_cgrpid() and
> bpf_sched_entity_belongs_to_cgrp() are modified to obtain the task group to
> which the sched entity belongs through se->my_q->tg->css.cgroup.
> 
> We have some ideas for the next iteration of Scheduler BPF that we would
> like to share with you:
> 1.The tag field is added to struct task_struct and struct task_group. Users
> can use the file system interface to mark different tags for specific
> workloads. The bpf prog obtains the tags to detect different workloads.
> 2.Add BPF hook and helper to scheduling processes such as select_task_rq and
> pick_next_task to enable scalability.
> 
> It's a new attempt, and there's bound to be a lot of problems later, but
> it's exciting that it makes the schduler programmable.

Hi Ren!

Great to hear my work is useful and thank you for describing your plans!
I'm not actively working on it right now, but I might start again in the future.
Let me know if I can help you with this effort.

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2021-09-16 16:24 ` Roman Gushchin
                     ` (4 preceding siblings ...)
  2022-07-19 13:05   ` Ren Zhijie
@ 2022-07-19 13:17   ` Ren Zhijie
  2022-07-19 23:21     ` Roman Gushchin
  5 siblings, 1 reply; 17+ messages in thread
From: Ren Zhijie @ 2022-07-19 13:17 UTC (permalink / raw)
  To: roman.gushchin; +Cc: bpf, linux-kernel, mgorman, mingo, peterz

Hi Roman and list,

We want to implement a programmable scheduler to meet the schedule 
requirements of different workloads.

Using BPF, we can easily deploy schedule policies for specific 
workloads, quickly verifying without modifying the kernel code. This 
greatly reduces the cost of deploying new schedule policies in the 
production environment.

Therefore, we want to continue to develop based on your patch. We plan 
to merge it into the openeuler open-source community and use the 
community to continuously evolve and maintain it.
(link: https://www.openeuler.org/en/)

We made some changes to your patch:
1. Adapt to the openeuler-OLK-5.10 branch, which mostly base on linux 
longterm branch 5.10.
2. Introduce the Kconfig CONFIG_BPF_SCHED to isolate related code at 
compile time.
3. helpers bpf_sched_entity_to_cgrpid() and 
bpf_sched_entity_belongs_to_cgrp() are modified to obtain the task group 
to which the sched entity belongs through se->my_q->tg->css.cgroup.

We have some ideas for the next iteration of Scheduler BPF that we would 
like to share with you:
1.The tag field is added to struct task_struct and struct task_group. 
Users can use the file system interface to mark different tags for 
specific workloads. The bpf prog obtains the tags to detect different 
workloads.
2.Add BPF hook and helper to scheduling processes such as select_task_rq 
and pick_next_task to enable scalability.

It's a new attempt, and there's bound to be a lot of problems later, but 
it's exciting that it makes the schduler programmable.

cheers,
Ren Zhijie



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2021-09-16 16:24 ` Roman Gushchin
                     ` (3 preceding siblings ...)
  2022-01-15  8:29   ` Huichun Feng
@ 2022-07-19 13:05   ` Ren Zhijie
  2022-07-19 13:17   ` Ren Zhijie
  5 siblings, 0 replies; 17+ messages in thread
From: Ren Zhijie @ 2022-07-19 13:05 UTC (permalink / raw)
  To: guro; +Cc: bpf, linux-kernel, mgorman, mingo, peterz

Hi Roman and list,

We want to implement a programmable scheduler to meet the schedule 
requirements of different workloads.

Using BPF, we can easily deploy schedule policies for specific 
workloads, quickly verifying without modifying the kernel code. This 
greatly reduces the cost of deploying new schedule policies in the 
production environment.

Therefore, we want to continue to develop based on your patch. We plan 
to merge it into the openeuler open-source community and use the 
community to continuously evolve and maintain it.
(link: https://www.openeuler.org/en/)

We made some changes to your patch:
1. Adapt to the openeuler-OLK-5.10 branch, which mostly base on linux 
longterm branch 5.10.
2. Introduce the Kconfig CONFIG_BPF_SCHED to isolate related code at 
compile time.
3. helpers bpf_sched_entity_to_cgrpid() and 
bpf_sched_entity_belongs_to_cgrp() are modified to obtain the task group 
to which the sched entity belongs through se->my_q->tg->css.cgroup.

We have some ideas for the next iteration of Scheduler BPF that we would 
like to share with you:
1.The tag field is added to struct task_struct and struct task_group. 
Users can use the file system interface to mark different tags for 
specific workloads. The bpf prog obtains the tags to detect different 
workloads.
2.Add BPF hook and helper to scheduling processes such as select_task_rq 
and pick_next_task to enable scalability.

It's a new attempt, and there's bound to be a lot of problems later, but 
it's exciting that it makes the schduler programmable.

cheers,
Ren Zhijie



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2022-01-15  8:29   ` Huichun Feng
@ 2022-01-18 22:54     ` Roman Gushchin
  0 siblings, 0 replies; 17+ messages in thread
From: Roman Gushchin @ 2022-01-18 22:54 UTC (permalink / raw)
  To: Huichun Feng; +Cc: bpf, linux-kernel, mgorman, mingo, peterz

On Sat, Jan 15, 2022 at 04:29:24PM +0800, Huichun Feng wrote:
> Hi Roman and the list,

Hello Huichun!

> 
> I have a naive question regarding BPF hook for sched.
> 
> Given that BPF can also be attached to tracepoint, why do we add a BPF prog
> type specific to sched?

Tracing programs can have return values as well, see kretprobes.

> 
> The reason I can come up with is that sched BPF can have retval to drive the
> scheduling decision in static branch, whereas tracepoint is not able to do this.
> Is it mainly because of this or anything else?

Well, you are right that right now there is no strict necessity to
introduce a new prog type (aside from static branch mechanism you
mentioned), however I believe it's useful in a long run. Sched
programs might be able to use a different set of helpers, maybe there
will be some additional restrictions, etc. It's an RFC version of the
patchset and any ideas, suggestions and critic are highly welcome!

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2021-09-16 16:24 ` Roman Gushchin
                     ` (2 preceding siblings ...)
       [not found]   ` <52EC1E80-4C89-43AD-8A59-8ACA184EAE53@gmail.com>
@ 2022-01-15  8:29   ` Huichun Feng
  2022-01-18 22:54     ` Roman Gushchin
  2022-07-19 13:05   ` Ren Zhijie
  2022-07-19 13:17   ` Ren Zhijie
  5 siblings, 1 reply; 17+ messages in thread
From: Huichun Feng @ 2022-01-15  8:29 UTC (permalink / raw)
  To: guro; +Cc: bpf, linux-kernel, mgorman, mingo, peterz

Hi Roman and the list,

I have a naive question regarding BPF hook for sched.

Given that BPF can also be attached to tracepoint, why do we add a BPF prog
type specific to sched?

The reason I can come up with is that sched BPF can have retval to drive the
scheduling decision in static branch, whereas tracepoint is not able to do this.
Is it mainly because of this or anything else?


Thanks
Huichun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2021-11-25  6:00     ` Yafang Shao
@ 2021-11-26 19:46       ` Roman Gushchin
  0 siblings, 0 replies; 17+ messages in thread
From: Roman Gushchin @ 2021-11-26 19:46 UTC (permalink / raw)
  To: Yafang Shao; +Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, bpf, LKML

On Thu, Nov 25, 2021 at 02:00:04PM +0800, Yafang Shao wrote:
> Hi Roman,

Hi Yafang!

> 
> Scheduler BPF is a great idea.
> Thanks for the work.

Thanks!

> 
> Scheduler BPF won’t be a small feature,  I think we’d better give a
> summary of possible hooks it may add first.
> We must have a *basic rule* to control what it will tend to be to
> avoid adding BPF hooks here and there.
> I haven’t found a clear rule yet, but maybe we can learn it from
> netfilter, which has 5 basic hooks.
> Regarding the scheduler BPF hooks, some possible basic hooks may be:
>   - Hook for Enqueue
>   - Hook for Dequeue
>   - Hook for Put Prev Task
>    - Hook for Set Next Task

I think it depends on what we want to achieve. There are several options:
we might aim to implement the whole scheduler logic in bpf, we might aim
to do some adjustments to the existing scheduler behavior or a mix of those
approaches.

Bpf as now is now is not capable enough to implement a new scheduler class
without a substantial amount of new c code (in form of helpers, maybe custom
maps, some verifier changes etc). In particular, it's a challenging to
provide strong safety guarantees: any scheduler bpf program loaded shouldn't
crash or deadlock the system (otherwise bpf isn't any better than a kernel
module). Also performance margins are quite tight.

I'm not saying that providing such generic hooks is impossible or useless,
but it requires a lot of changes and support code and I'm not sure that we have
a good justification for them right now.

I think instead we might want to see bpf hooks as a better form of (sysctl)
tunables, which are more flexible (e.g. can be used for specific processes,
cgroups, cpus, being enabled depending on load, weather, etc) and do not create
an ABI (so are easier to maintain).

> 
> 
> > An example of an userspace part, which loads some simple hooks is available
> > here [3]. It's very simple, provided only to simplify playing with the provided
> > kernel patches.
> >
> 
> You’d better add this userspace code into samples/bpf/.

I thought samples/bpf was considered deprecated (in favor to selftests/bpf/),
but I'm gonna check with bpf maintainers. Thanks for the idea!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
       [not found]   ` <52EC1E80-4C89-43AD-8A59-8ACA184EAE53@gmail.com>
@ 2021-11-25  6:00     ` Yafang Shao
  2021-11-26 19:46       ` Roman Gushchin
  0 siblings, 1 reply; 17+ messages in thread
From: Yafang Shao @ 2021-11-25  6:00 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, bpf, LKML

Hi Roman,

Scheduler BPF is a great idea.
Thanks for the work.

Scheduler BPF won’t be a small feature,  I think we’d better give a
summary of possible hooks it may add first.
We must have a *basic rule* to control what it will tend to be to
avoid adding BPF hooks here and there.
I haven’t found a clear rule yet, but maybe we can learn it from
netfilter, which has 5 basic hooks.
Regarding the scheduler BPF hooks, some possible basic hooks may be:
  - Hook for Enqueue
  - Hook for Dequeue
  - Hook for Put Prev Task
   - Hook for Set Next Task


> An example of an userspace part, which loads some simple hooks is available
> here [3]. It's very simple, provided only to simplify playing with the provided
> kernel patches.
>

You’d better add this userspace code into samples/bpf/.


[Some error occurs in my mail client, so I resend it]


--
Thanks
Yafang

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2021-10-11 18:09         ` Roman Gushchin
@ 2021-10-12 10:16           ` Qais Yousef
  0 siblings, 0 replies; 17+ messages in thread
From: Qais Yousef @ 2021-10-12 10:16 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, bpf, linux-kernel

On 10/11/21 11:09, Roman Gushchin wrote:
> > Convenient will be only true assuming you have a full comprehensive list of
> > hooks to never require adding a new one. As I highlighted above, this
> > convenience is limited to hooks that you added now.
> > 
> > Do people always want more hooks? Rhetorical question ;-)
> 
> Why do you think that the list of the hooks will be so large/dynamic?

It's not a fact. Just my thoughts/guess based on how things usually end up.
It's very likely this will grow. I could be wrong of course :)

> I'm not saying we can figure it out from a first attempt, but I'm pretty sure
> that after some initial phase it can be relatively stable, e.g. changing only
> with some _major_ changes in the scheduler code.

My point was that the speed up in workflow will be limited by the what's
available. It might be enough for a large use cases as you say, but at some
point there will be a new bottleneck that you might think worth experimenting
with and the chances a suitable hook is available are 50:50 in theory. So it's
not a magical fix where one would *never* have to push a custom kernel on all
these systems to experiment with some scheduler changes.

> > > > So my worry is that this will open the gate for these hooks to get more than
> > > > just micro-optimization done in a platform specific way. And that it will
> > > > discourage having the right discussion to fix real problems in the scheduler
> > > > because the easy path is to do whatever you want in userspace. I am not sure we
> > > > can control how these hooks are used.
> > > 
> > > I totally understand your worry. I think we need to find a right balance between
> > > allowing to implement custom policies and keeping the core functionality
> > > working well enough for everybody without a need to tweak anything.
> > > 
> > > It seems like an alternative to this "let's allow cfs customization via bpf"
> > > approach is to completely move the scheduler code into userspace/bpf, something
> > > that Google's ghOSt is aiming to do.
> > 
> > Why not ship a custom kernel instead then?
> 
> Shipping a custom kernel (actually any kernel) at this scale isn't easy or fast.
> Just for example, imagine a process of rebooting of a 1000000 machines running
> 1000's different workloads, each with their own redundancy and capacity requirements.
> 
> This what makes an ability to push scheduler changes without a reboot/kernel upgrade
> so attractive.
> 
> Obviously, it's not a case when we talk about a single kernel engineer and their
> laptop/dev server/vm.

I think you're still referring to ghOSt here. I thought your 2 use cases are
different as you mentioned they "completely move the scheduler code into
userspace/bpf"; but it could be just me mis-interpreting what this means. That
didn't read to me they want to micro-optimize (few) certain decisions in the
scheduler, rather replace it altogether, hence my question.

Anyway. My 2cents here is that we should be careful not to introduce something
that encourages out-of-tree workarounds for real scheduler problems nor have it
done in a way where we lose visibility over how these hooks are used and being
able to share it with others who could benefit from the same mico-optimization
too.

Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2021-10-11 16:38       ` Qais Yousef
@ 2021-10-11 18:09         ` Roman Gushchin
  2021-10-12 10:16           ` Qais Yousef
  0 siblings, 1 reply; 17+ messages in thread
From: Roman Gushchin @ 2021-10-11 18:09 UTC (permalink / raw)
  To: Qais Yousef; +Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, bpf, linux-kernel

On Mon, Oct 11, 2021 at 05:38:52PM +0100, Qais Yousef wrote:
> Hi Roman
> 
> On 10/06/21 11:50, Roman Gushchin wrote:
> > On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote:
> > > Hi Roman
> > > 
> > > On 09/16/21 09:24, Roman Gushchin wrote:
> > > > There is a long history of distro people, system administrators, and
> > > > application owners tuning the CFS settings in /proc/sys, which are now
> > > > in debugfs. Looking at what these settings actually did, it ended up
> > > > boiling down to changing the likelihood of task preemption, or
> > > > disabling it by setting the wakeup_granularity_ns to more than half of
> > > > the latency_ns. The other settings didn't really do much for
> > > > performance.
> > > > 
> > > > In other words, some our workloads benefit by having long running tasks
> > > > preempted by tasks handling short running requests, and some workloads
> > > > that run only short term requests which benefit from never being preempted.
> > > 
> > > We had discussion about introducing latency-nice hint; but that discussion
> > > didn't end up producing any new API. Your use case seem similar to Android's;
> > > we want some tasks to run ASAP. There's an out of tree patch that puts these
> > > tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
> > > which seem okay for its purpose. Having a more generic solution in mainline
> > > would be nice.
> > > 
> > > https://lwn.net/Articles/820659/ 
> > 
> > Hello Qais!
> > 
> > Thank you for the link, I like it!
> > 
> > > 
> > > > 
> > > > This leads to a few observations and ideas:
> > > > - Different workloads want different policies. Being able to configure
> > > >   the policy per workload could be useful.
> > > > - A workload that benefits from not being preempted itself could still
> > > >   benefit from preempting (low priority) background system tasks.
> > > 
> > > You can put these tasks as SCHED_IDLE. There's a potential danger of starving
> > > these tasks; but assuming they're background and there's idle time in the
> > > system that should be fine.
> > > 
> > > https://lwn.net/Articles/805317/ 
> > > 
> > > That of course assuming you can classify these background tasks..
> > > 
> > > If you can do the classification, you can also use cpu.shares to reduce how
> > > much cpu time they get. Or CFS bandwidth controller
> > > 
> > > https://lwn.net/Articles/844976/ 
> > 
> > The cfs cgroup controller is that it's getting quite expensive quickly with the
> > increasing depth of the cgroup tree. This is why we had to disable it for some
> > of our primary workloads.
> 
> I can understand that..
> 
> > 
> > Still being able to control latencies on per-cgroup level is one of the goals
> > of this patchset.
> > 
> > > 
> > > I like Androd's model of classifying tasks. I think we need this classification
> > > done by other non-android systems too.
> > > 
> > > > - It would be useful to quickly (and safely) experiment with different
> > > >   policies in production, without having to shut down applications or reboot
> > > >   systems, to determine what the policies for different workloads should be.
> > > 
> > > Userspace should have the knobs that allows them to tune that without reboot.
> > > If you're doing kernel development; then it's part of the job spec I'd say :-)
> > 
> > The problem here occurs because there is no comprehensive way to test any
> > scheduler change rather than run it on many machines (sometimes 1000's) running
> > different production-alike workloads.
> > 
> > If I'm able to test an idea by loading a bpf program (and btw have some sort of
> > safety guarantees: maybe the performance will be hurt, but at least no panics),
> > it can speed up the development process significantly. The alternative is way
> > more complex from the infrastructure's point of view: releasing a custom kernel,
> > test it for safety, reboot certain machines to it, pin the kernel from being
> > automatically updated etc.
> 
> This process is unavoidable IMO. Assuming you have these hooks in; as soon as
> you require a new hook you'll be forced to have a custom kernel with that new
> hook introduced. Which, in my view, no different than pushing a custom kernel
> that forces the function of interest to be noinline. Right?

I think a relatively small and stable set of hooks can cover a large percent
of potential customization ideas.

> 
> > 
> > > 
> > > I think one can still go with the workflow you suggest for development without
> > > the hooks. You'd need to un-inline the function you're interested in; then you
> > > can use kprobes to hook into it and force an early return. That should produce
> > > the same effect, no?
> > 
> > Basically it's exactly what I'm suggesting. My patchset just provides a
> > convenient way to define these hooks and some basic useful helper functions.
> 
> Convenient will be only true assuming you have a full comprehensive list of
> hooks to never require adding a new one. As I highlighted above, this
> convenience is limited to hooks that you added now.
> 
> Do people always want more hooks? Rhetorical question ;-)

Why do you think that the list of the hooks will be so large/dynamic?

I'm not saying we can figure it out from a first attempt, but I'm pretty sure
that after some initial phase it can be relatively stable, e.g. changing only
with some _major_ changes in the scheduler code.

> 
> > 
> > > 
> > > > - Only a few workloads are large and sensitive enough to merit their own
> > > >   policy tweaks. CFS by itself should be good enough for everything else,
> > > >   and we probably do not want policy tweaks to be a replacement for anything
> > > >   CFS does.
> > > > 
> > > > This leads to BPF hooks, which have been successfully used in various
> > > > kernel subsystems to provide a way for external code to (safely)
> > > > change a few kernel decisions. BPF tooling makes this pretty easy to do,
> > > > and the people deploying BPF scripts are already quite used to updating them
> > > > for new kernel versions.
> > > 
> > > I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
> > > gets heavily modified by vendors and OEMs. We try very hard to understand the
> > > problems they face and get the right set of solutions in mainline. Which would
> > > ultimately help towards the goal of having a single Generic kernel Image [1]
> > > that gives you what you'd expect out of the platform without any need for
> > > additional cherries on top.
> > 
> > Wouldn't it make your life easier had they provide a set of bpf programs instead
> > of custom patches?
> 
> Not really.
> 
> Having consistent mainline behavior is important, and these customization
> contribute to fragmentation and can throw off userspace developers who find
> they have to do extra work on some platforms to get the desired outcome. They
> will be easy to misuse. We want to see the patches and find ways to improve
> mainline kernel instead.
> 
> That said, I can see the use case of being able to micro-optimize part of the
> scheduler in a workload specific way. But then the way I see this support
> happening (DISCLAIMER, personal opinion :-))
> 
> 	1. The hooks have to be about replacing specific snippet, like Barry's
> 	   example where it's an area that is hard to find a generic solution
> 	   that doesn't have a drawback over a class of workloads.

This makes sense to me, and this is a good topic to discuss: which hooks do we
really need. I don't think it necessarily has to replace something, but I
totally agree on the "hard to find a generic solution" part.

> 
> 	2. The set of bpf programs that modify it live in the kernel tree for
> 	   each hook added. Then we can reason about why the hook is there and
> 	   allow others to reap the benefit. Beside being able to re-evaluate
> 	   easily if the users still need that hook after a potential
> 	   improvement that could render it unnecessary.
> 
> 	3. Out of tree bpf programs can only be loaded if special CONFIG option
> 	   is set so that production kernel can only load known ones that the
> 	   community knows and have reasoned about.
> 
> 	4. Out of tree bpf programs will taint the kernel. A regression
> 	   reported with something funny loaded should be flagged as
> 	   potentially bogus.

2-4 look as generic bpf questions to me, I don't think there is anything
scheduler-specific. So I'd suggest to bring bpf maintainers into the discussion,
their input can be very valuable.

> 
> IMHO this should tame the beast to something useful to address these situations
> where the change required to improve one workload will harm others and it's
> hard to come up with a good compromise. Then the hook as you suggest could help
> implement that policy specifically for that platform/workload.
> 
> One can note that the behavior I suggest is similar to how modules work :)

The important benefit of bpf is safety guarantees.

> 
> > 
> > > 
> > > So my worry is that this will open the gate for these hooks to get more than
> > > just micro-optimization done in a platform specific way. And that it will
> > > discourage having the right discussion to fix real problems in the scheduler
> > > because the easy path is to do whatever you want in userspace. I am not sure we
> > > can control how these hooks are used.
> > 
> > I totally understand your worry. I think we need to find a right balance between
> > allowing to implement custom policies and keeping the core functionality
> > working well enough for everybody without a need to tweak anything.
> > 
> > It seems like an alternative to this "let's allow cfs customization via bpf"
> > approach is to completely move the scheduler code into userspace/bpf, something
> > that Google's ghOSt is aiming to do.
> 
> Why not ship a custom kernel instead then?

Shipping a custom kernel (actually any kernel) at this scale isn't easy or fast.
Just for example, imagine a process of rebooting of a 1000000 machines running
1000's different workloads, each with their own redundancy and capacity requirements.

This what makes an ability to push scheduler changes without a reboot/kernel upgrade
so attractive.

Obviously, it's not a case when we talk about a single kernel engineer and their
laptop/dev server/vm.

> 
> > 
> > > 
> > > The question is: why can't we fix any issues in the scheduler/make it better
> > > and must have these hooks instead?
> > 
> > Of course, if it's possible to implement an idea in a form which is suitable
> > for everybody and upstream it, this is the best outcome. The problem is that
> > not every idea is like that. A bpf program can leverage a priori knowledge
> > of a workload and its needs, something the generic scheduler code lacks
> > by the definition.
> 
> Yep I see your point for certain aspects of the scheduler that are hard to tune
> universally. We just need to be careful not to end up in a wild west or Anything
> Can Happen Thursday situation :-)

Totally agree!

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2021-10-06 18:50     ` Roman Gushchin
@ 2021-10-11 16:38       ` Qais Yousef
  2021-10-11 18:09         ` Roman Gushchin
  0 siblings, 1 reply; 17+ messages in thread
From: Qais Yousef @ 2021-10-11 16:38 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, bpf, linux-kernel

Hi Roman

On 10/06/21 11:50, Roman Gushchin wrote:
> On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote:
> > Hi Roman
> > 
> > On 09/16/21 09:24, Roman Gushchin wrote:
> > > There is a long history of distro people, system administrators, and
> > > application owners tuning the CFS settings in /proc/sys, which are now
> > > in debugfs. Looking at what these settings actually did, it ended up
> > > boiling down to changing the likelihood of task preemption, or
> > > disabling it by setting the wakeup_granularity_ns to more than half of
> > > the latency_ns. The other settings didn't really do much for
> > > performance.
> > > 
> > > In other words, some our workloads benefit by having long running tasks
> > > preempted by tasks handling short running requests, and some workloads
> > > that run only short term requests which benefit from never being preempted.
> > 
> > We had discussion about introducing latency-nice hint; but that discussion
> > didn't end up producing any new API. Your use case seem similar to Android's;
> > we want some tasks to run ASAP. There's an out of tree patch that puts these
> > tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
> > which seem okay for its purpose. Having a more generic solution in mainline
> > would be nice.
> > 
> > https://lwn.net/Articles/820659/
> 
> Hello Qais!
> 
> Thank you for the link, I like it!
> 
> > 
> > > 
> > > This leads to a few observations and ideas:
> > > - Different workloads want different policies. Being able to configure
> > >   the policy per workload could be useful.
> > > - A workload that benefits from not being preempted itself could still
> > >   benefit from preempting (low priority) background system tasks.
> > 
> > You can put these tasks as SCHED_IDLE. There's a potential danger of starving
> > these tasks; but assuming they're background and there's idle time in the
> > system that should be fine.
> > 
> > https://lwn.net/Articles/805317/
> > 
> > That of course assuming you can classify these background tasks..
> > 
> > If you can do the classification, you can also use cpu.shares to reduce how
> > much cpu time they get. Or CFS bandwidth controller
> > 
> > https://lwn.net/Articles/844976/
> 
> The cfs cgroup controller is that it's getting quite expensive quickly with the
> increasing depth of the cgroup tree. This is why we had to disable it for some
> of our primary workloads.

I can understand that..

> 
> Still being able to control latencies on per-cgroup level is one of the goals
> of this patchset.
> 
> > 
> > I like Androd's model of classifying tasks. I think we need this classification
> > done by other non-android systems too.
> > 
> > > - It would be useful to quickly (and safely) experiment with different
> > >   policies in production, without having to shut down applications or reboot
> > >   systems, to determine what the policies for different workloads should be.
> > 
> > Userspace should have the knobs that allows them to tune that without reboot.
> > If you're doing kernel development; then it's part of the job spec I'd say :-)
> 
> The problem here occurs because there is no comprehensive way to test any
> scheduler change rather than run it on many machines (sometimes 1000's) running
> different production-alike workloads.
> 
> If I'm able to test an idea by loading a bpf program (and btw have some sort of
> safety guarantees: maybe the performance will be hurt, but at least no panics),
> it can speed up the development process significantly. The alternative is way
> more complex from the infrastructure's point of view: releasing a custom kernel,
> test it for safety, reboot certain machines to it, pin the kernel from being
> automatically updated etc.

This process is unavoidable IMO. Assuming you have these hooks in; as soon as
you require a new hook you'll be forced to have a custom kernel with that new
hook introduced. Which, in my view, no different than pushing a custom kernel
that forces the function of interest to be noinline. Right?

> 
> > 
> > I think one can still go with the workflow you suggest for development without
> > the hooks. You'd need to un-inline the function you're interested in; then you
> > can use kprobes to hook into it and force an early return. That should produce
> > the same effect, no?
> 
> Basically it's exactly what I'm suggesting. My patchset just provides a
> convenient way to define these hooks and some basic useful helper functions.

Convenient will be only true assuming you have a full comprehensive list of
hooks to never require adding a new one. As I highlighted above, this
convenience is limited to hooks that you added now.

Do people always want more hooks? Rhetorical question ;-)

> 
> > 
> > > - Only a few workloads are large and sensitive enough to merit their own
> > >   policy tweaks. CFS by itself should be good enough for everything else,
> > >   and we probably do not want policy tweaks to be a replacement for anything
> > >   CFS does.
> > > 
> > > This leads to BPF hooks, which have been successfully used in various
> > > kernel subsystems to provide a way for external code to (safely)
> > > change a few kernel decisions. BPF tooling makes this pretty easy to do,
> > > and the people deploying BPF scripts are already quite used to updating them
> > > for new kernel versions.
> > 
> > I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
> > gets heavily modified by vendors and OEMs. We try very hard to understand the
> > problems they face and get the right set of solutions in mainline. Which would
> > ultimately help towards the goal of having a single Generic kernel Image [1]
> > that gives you what you'd expect out of the platform without any need for
> > additional cherries on top.
> 
> Wouldn't it make your life easier had they provide a set of bpf programs instead
> of custom patches?

Not really.

Having consistent mainline behavior is important, and these customization
contribute to fragmentation and can throw off userspace developers who find
they have to do extra work on some platforms to get the desired outcome. They
will be easy to misuse. We want to see the patches and find ways to improve
mainline kernel instead.

That said, I can see the use case of being able to micro-optimize part of the
scheduler in a workload specific way. But then the way I see this support
happening (DISCLAIMER, personal opinion :-))

	1. The hooks have to be about replacing specific snippet, like Barry's
	   example where it's an area that is hard to find a generic solution
	   that doesn't have a drawback over a class of workloads.

	2. The set of bpf programs that modify it live in the kernel tree for
	   each hook added. Then we can reason about why the hook is there and
	   allow others to reap the benefit. Beside being able to re-evaluate
	   easily if the users still need that hook after a potential
	   improvement that could render it unnecessary.

	3. Out of tree bpf programs can only be loaded if special CONFIG option
	   is set so that production kernel can only load known ones that the
	   community knows and have reasoned about.

	4. Out of tree bpf programs will taint the kernel. A regression
	   reported with something funny loaded should be flagged as
	   potentially bogus.

IMHO this should tame the beast to something useful to address these situations
where the change required to improve one workload will harm others and it's
hard to come up with a good compromise. Then the hook as you suggest could help
implement that policy specifically for that platform/workload.

One can note that the behavior I suggest is similar to how modules work :)

> 
> > 
> > So my worry is that this will open the gate for these hooks to get more than
> > just micro-optimization done in a platform specific way. And that it will
> > discourage having the right discussion to fix real problems in the scheduler
> > because the easy path is to do whatever you want in userspace. I am not sure we
> > can control how these hooks are used.
> 
> I totally understand your worry. I think we need to find a right balance between
> allowing to implement custom policies and keeping the core functionality
> working well enough for everybody without a need to tweak anything.
> 
> It seems like an alternative to this "let's allow cfs customization via bpf"
> approach is to completely move the scheduler code into userspace/bpf, something
> that Google's ghOSt is aiming to do.

Why not ship a custom kernel instead then?

> 
> > 
> > The question is: why can't we fix any issues in the scheduler/make it better
> > and must have these hooks instead?
> 
> Of course, if it's possible to implement an idea in a form which is suitable
> for everybody and upstream it, this is the best outcome. The problem is that
> not every idea is like that. A bpf program can leverage a priori knowledge
> of a workload and its needs, something the generic scheduler code lacks
> by the definition.

Yep I see your point for certain aspects of the scheduler that are hard to tune
universally. We just need to be careful not to end up in a wild west or Anything
Can Happen Thursday situation :-)

Maybe the maintainers have a different opinion though.

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2021-10-06 16:39   ` Qais Yousef
@ 2021-10-06 18:50     ` Roman Gushchin
  2021-10-11 16:38       ` Qais Yousef
  0 siblings, 1 reply; 17+ messages in thread
From: Roman Gushchin @ 2021-10-06 18:50 UTC (permalink / raw)
  To: Qais Yousef; +Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, bpf, linux-kernel

On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote:
> Hi Roman
> 
> On 09/16/21 09:24, Roman Gushchin wrote:
> > There is a long history of distro people, system administrators, and
> > application owners tuning the CFS settings in /proc/sys, which are now
> > in debugfs. Looking at what these settings actually did, it ended up
> > boiling down to changing the likelihood of task preemption, or
> > disabling it by setting the wakeup_granularity_ns to more than half of
> > the latency_ns. The other settings didn't really do much for
> > performance.
> > 
> > In other words, some our workloads benefit by having long running tasks
> > preempted by tasks handling short running requests, and some workloads
> > that run only short term requests which benefit from never being preempted.
> 
> We had discussion about introducing latency-nice hint; but that discussion
> didn't end up producing any new API. Your use case seem similar to Android's;
> we want some tasks to run ASAP. There's an out of tree patch that puts these
> tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
> which seem okay for its purpose. Having a more generic solution in mainline
> would be nice.
> 
> https://lwn.net/Articles/820659/

Hello Qais!

Thank you for the link, I like it!

> 
> > 
> > This leads to a few observations and ideas:
> > - Different workloads want different policies. Being able to configure
> >   the policy per workload could be useful.
> > - A workload that benefits from not being preempted itself could still
> >   benefit from preempting (low priority) background system tasks.
> 
> You can put these tasks as SCHED_IDLE. There's a potential danger of starving
> these tasks; but assuming they're background and there's idle time in the
> system that should be fine.
> 
> https://lwn.net/Articles/805317/
> 
> That of course assuming you can classify these background tasks..
> 
> If you can do the classification, you can also use cpu.shares to reduce how
> much cpu time they get. Or CFS bandwidth controller
> 
> https://lwn.net/Articles/844976/

The cfs cgroup controller is that it's getting quite expensive quickly with the
increasing depth of the cgroup tree. This is why we had to disable it for some
of our primary workloads.

Still being able to control latencies on per-cgroup level is one of the goals
of this patchset.

> 
> I like Androd's model of classifying tasks. I think we need this classification
> done by other non-android systems too.
> 
> > - It would be useful to quickly (and safely) experiment with different
> >   policies in production, without having to shut down applications or reboot
> >   systems, to determine what the policies for different workloads should be.
> 
> Userspace should have the knobs that allows them to tune that without reboot.
> If you're doing kernel development; then it's part of the job spec I'd say :-)

The problem here occurs because there is no comprehensive way to test any
scheduler change rather than run it on many machines (sometimes 1000's) running
different production-alike workloads.

If I'm able to test an idea by loading a bpf program (and btw have some sort of
safety guarantees: maybe the performance will be hurt, but at least no panics),
it can speed up the development process significantly. The alternative is way
more complex from the infrastructure's point of view: releasing a custom kernel,
test it for safety, reboot certain machines to it, pin the kernel from being
automatically updated etc.

> 
> I think one can still go with the workflow you suggest for development without
> the hooks. You'd need to un-inline the function you're interested in; then you
> can use kprobes to hook into it and force an early return. That should produce
> the same effect, no?

Basically it's exactly what I'm suggesting. My patchset just provides a
convenient way to define these hooks and some basic useful helper functions.

> 
> > - Only a few workloads are large and sensitive enough to merit their own
> >   policy tweaks. CFS by itself should be good enough for everything else,
> >   and we probably do not want policy tweaks to be a replacement for anything
> >   CFS does.
> > 
> > This leads to BPF hooks, which have been successfully used in various
> > kernel subsystems to provide a way for external code to (safely)
> > change a few kernel decisions. BPF tooling makes this pretty easy to do,
> > and the people deploying BPF scripts are already quite used to updating them
> > for new kernel versions.
> 
> I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
> gets heavily modified by vendors and OEMs. We try very hard to understand the
> problems they face and get the right set of solutions in mainline. Which would
> ultimately help towards the goal of having a single Generic kernel Image [1]
> that gives you what you'd expect out of the platform without any need for
> additional cherries on top.

Wouldn't it make your life easier had they provide a set of bpf programs instead
of custom patches?

> 
> So my worry is that this will open the gate for these hooks to get more than
> just micro-optimization done in a platform specific way. And that it will
> discourage having the right discussion to fix real problems in the scheduler
> because the easy path is to do whatever you want in userspace. I am not sure we
> can control how these hooks are used.

I totally understand your worry. I think we need to find a right balance between
allowing to implement custom policies and keeping the core functionality
working well enough for everybody without a need to tweak anything.

It seems like an alternative to this "let's allow cfs customization via bpf"
approach is to completely move the scheduler code into userspace/bpf, something
that Google's ghOSt is aiming to do.

> 
> The question is: why can't we fix any issues in the scheduler/make it better
> and must have these hooks instead?

Of course, if it's possible to implement an idea in a form which is suitable
for everybody and upstream it, this is the best outcome. The problem is that
not every idea is like that. A bpf program can leverage a priori knowledge
of a workload and its needs, something the generic scheduler code lacks
by the definition.

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2021-09-16 16:24 ` Roman Gushchin
  2021-09-16 16:36   ` Roman Gushchin
@ 2021-10-06 16:39   ` Qais Yousef
  2021-10-06 18:50     ` Roman Gushchin
       [not found]   ` <52EC1E80-4C89-43AD-8A59-8ACA184EAE53@gmail.com>
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Qais Yousef @ 2021-10-06 16:39 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, bpf, linux-kernel

Hi Roman

On 09/16/21 09:24, Roman Gushchin wrote:
> There is a long history of distro people, system administrators, and
> application owners tuning the CFS settings in /proc/sys, which are now
> in debugfs. Looking at what these settings actually did, it ended up
> boiling down to changing the likelihood of task preemption, or
> disabling it by setting the wakeup_granularity_ns to more than half of
> the latency_ns. The other settings didn't really do much for
> performance.
> 
> In other words, some our workloads benefit by having long running tasks
> preempted by tasks handling short running requests, and some workloads
> that run only short term requests which benefit from never being preempted.

We had discussion about introducing latency-nice hint; but that discussion
didn't end up producing any new API. Your use case seem similar to Android's;
we want some tasks to run ASAP. There's an out of tree patch that puts these
tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
which seem okay for its purpose. Having a more generic solution in mainline
would be nice.

https://lwn.net/Articles/820659/

> 
> This leads to a few observations and ideas:
> - Different workloads want different policies. Being able to configure
>   the policy per workload could be useful.
> - A workload that benefits from not being preempted itself could still
>   benefit from preempting (low priority) background system tasks.

You can put these tasks as SCHED_IDLE. There's a potential danger of starving
these tasks; but assuming they're background and there's idle time in the
system that should be fine.

https://lwn.net/Articles/805317/

That of course assuming you can classify these background tasks..

If you can do the classification, you can also use cpu.shares to reduce how
much cpu time they get. Or CFS bandwidth controller

https://lwn.net/Articles/844976/

I like Androd's model of classifying tasks. I think we need this classification
done by other non-android systems too.

> - It would be useful to quickly (and safely) experiment with different
>   policies in production, without having to shut down applications or reboot
>   systems, to determine what the policies for different workloads should be.

Userspace should have the knobs that allows them to tune that without reboot.
If you're doing kernel development; then it's part of the job spec I'd say :-)

I think one can still go with the workflow you suggest for development without
the hooks. You'd need to un-inline the function you're interested in; then you
can use kprobes to hook into it and force an early return. That should produce
the same effect, no?

> - Only a few workloads are large and sensitive enough to merit their own
>   policy tweaks. CFS by itself should be good enough for everything else,
>   and we probably do not want policy tweaks to be a replacement for anything
>   CFS does.
> 
> This leads to BPF hooks, which have been successfully used in various
> kernel subsystems to provide a way for external code to (safely)
> change a few kernel decisions. BPF tooling makes this pretty easy to do,
> and the people deploying BPF scripts are already quite used to updating them
> for new kernel versions.

I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
gets heavily modified by vendors and OEMs. We try very hard to understand the
problems they face and get the right set of solutions in mainline. Which would
ultimately help towards the goal of having a single Generic kernel Image [1]
that gives you what you'd expect out of the platform without any need for
additional cherries on top.

So my worry is that this will open the gate for these hooks to get more than
just micro-optimization done in a platform specific way. And that it will
discourage having the right discussion to fix real problems in the scheduler
because the easy path is to do whatever you want in userspace. I am not sure we
can control how these hooks are used.

The question is: why can't we fix any issues in the scheduler/make it better
and must have these hooks instead?

[1] https://arstechnica.com/gadgets/2021/09/android-to-take-an-upstream-first-development-model-for-the-linux-kernel/

Thanks

--
Qais Yousef

> 
> This patchset aims to start a discussion about potential applications of BPF
> to the scheduler. It also aims to land some very basic BPF infrastructure
> necessary to add new BPF hooks to the scheduler, a minimal set of useful
> helpers, corresponding libbpf changes, etc.
> 
> Our very first experiments with using BPF in CFS look very promising. We're
> at a very early stage, however already have seen a nice latency and ~1% RPS
> wins for our (Facebook's) main web workload.
> 
> As I know, Google is working on a more radical approach [2]: they aim to move
> the scheduling code into userspace. It seems that their core motivation is
> somewhat similar: to make the scheduler changes easier to develop, validate
> and deploy. Even though their approach is different, they also use BPF for
> speeding up some hot paths. I think the suggested infrastructure can serve
> their purpose too.
> 
> An example of an userspace part, which loads some simple hooks is available
> here [3]. It's very simple, provided only to simplify playing with the provided
> kernel patches.
> 
> 
> [1] c722f35b513f ("sched/fair: Bring back select_idle_smt(), but differently")
> [2] Google's ghOSt: https://linuxplumbersconf.org/event/11/contributions/954/
> [3] https://github.com/rgushchin/atc
> 
> 
> Roman Gushchin (6):
>   bpf: sched: basic infrastructure for scheduler bpf
>   bpf: sched: add convenient helpers to identify sched entities
>   bpf: sched: introduce bpf_sched_enable()
>   sched: cfs: add bpf hooks to control wakeup and tick preemption
>   libbpf: add support for scheduler bpf programs
>   bpftool: recognize scheduler programs
> 
>  include/linux/bpf_sched.h       |  53 ++++++++++++
>  include/linux/bpf_types.h       |   3 +
>  include/linux/sched_hook_defs.h |   4 +
>  include/uapi/linux/bpf.h        |  25 ++++++
>  kernel/bpf/btf.c                |   1 +
>  kernel/bpf/syscall.c            |  21 ++++-
>  kernel/bpf/trampoline.c         |   1 +
>  kernel/bpf/verifier.c           |   9 ++-
>  kernel/sched/Makefile           |   1 +
>  kernel/sched/bpf_sched.c        | 138 ++++++++++++++++++++++++++++++++
>  kernel/sched/fair.c             |  27 +++++++
>  scripts/bpf_doc.py              |   2 +
>  tools/bpf/bpftool/common.c      |   1 +
>  tools/bpf/bpftool/prog.c        |   1 +
>  tools/include/uapi/linux/bpf.h  |  25 ++++++
>  tools/lib/bpf/libbpf.c          |  27 ++++++-
>  tools/lib/bpf/libbpf.h          |   4 +
>  tools/lib/bpf/libbpf.map        |   3 +
>  18 files changed, 341 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/bpf_sched.h
>  create mode 100644 include/linux/sched_hook_defs.h
>  create mode 100644 kernel/sched/bpf_sched.c
> 
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2021-09-16 16:24 ` Roman Gushchin
@ 2021-09-16 16:36   ` Roman Gushchin
  2021-10-06 16:39   ` Qais Yousef
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Roman Gushchin @ 2021-09-16 16:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Rik van Riel, Mel Gorman, Vincent Guittot, Valentin Schneider,
	Will Deacon, Dietmar Eggemann, Hao Luo, Barret Rhoden, bpf,
	linux-kernel

Hello!

I'm sorry, somehow the patchset didn't reach mailing lists and at least
some recepients yesterday (I'm digging into why).

Resending with a fewer people in the cc list, which probably was the reason.

Thanks!

On Thu, Sep 16, 2021 at 09:24:45AM -0700, Roman Gushchin wrote:
> There is a long history of distro people, system administrators, and
> application owners tuning the CFS settings in /proc/sys, which are now
> in debugfs. Looking at what these settings actually did, it ended up
> boiling down to changing the likelihood of task preemption, or
> disabling it by setting the wakeup_granularity_ns to more than half of
> the latency_ns. The other settings didn't really do much for
> performance.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH rfc 0/6] Scheduler BPF
       [not found] <20210915213550.3696532-1-guro@fb.com>
  2021-09-16  0:19 ` Hao Luo
@ 2021-09-16 16:24 ` Roman Gushchin
  2021-09-16 16:36   ` Roman Gushchin
                     ` (5 more replies)
  1 sibling, 6 replies; 17+ messages in thread
From: Roman Gushchin @ 2021-09-16 16:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: Mel Gorman, bpf, linux-kernel, Roman Gushchin

There is a long history of distro people, system administrators, and
application owners tuning the CFS settings in /proc/sys, which are now
in debugfs. Looking at what these settings actually did, it ended up
boiling down to changing the likelihood of task preemption, or
disabling it by setting the wakeup_granularity_ns to more than half of
the latency_ns. The other settings didn't really do much for
performance.

In other words, some our workloads benefit by having long running tasks
preempted by tasks handling short running requests, and some workloads
that run only short term requests which benefit from never being preempted.

This leads to a few observations and ideas:
- Different workloads want different policies. Being able to configure
  the policy per workload could be useful.
- A workload that benefits from not being preempted itself could still
  benefit from preempting (low priority) background system tasks.
- It would be useful to quickly (and safely) experiment with different
  policies in production, without having to shut down applications or reboot
  systems, to determine what the policies for different workloads should be.
- Only a few workloads are large and sensitive enough to merit their own
  policy tweaks. CFS by itself should be good enough for everything else,
  and we probably do not want policy tweaks to be a replacement for anything
  CFS does.

This leads to BPF hooks, which have been successfully used in various
kernel subsystems to provide a way for external code to (safely)
change a few kernel decisions. BPF tooling makes this pretty easy to do,
and the people deploying BPF scripts are already quite used to updating them
for new kernel versions.

This patchset aims to start a discussion about potential applications of BPF
to the scheduler. It also aims to land some very basic BPF infrastructure
necessary to add new BPF hooks to the scheduler, a minimal set of useful
helpers, corresponding libbpf changes, etc.

Our very first experiments with using BPF in CFS look very promising. We're
at a very early stage, however already have seen a nice latency and ~1% RPS
wins for our (Facebook's) main web workload.

As I know, Google is working on a more radical approach [2]: they aim to move
the scheduling code into userspace. It seems that their core motivation is
somewhat similar: to make the scheduler changes easier to develop, validate
and deploy. Even though their approach is different, they also use BPF for
speeding up some hot paths. I think the suggested infrastructure can serve
their purpose too.

An example of an userspace part, which loads some simple hooks is available
here [3]. It's very simple, provided only to simplify playing with the provided
kernel patches.


[1] c722f35b513f ("sched/fair: Bring back select_idle_smt(), but differently")
[2] Google's ghOSt: https://linuxplumbersconf.org/event/11/contributions/954/
[3] https://github.com/rgushchin/atc


Roman Gushchin (6):
  bpf: sched: basic infrastructure for scheduler bpf
  bpf: sched: add convenient helpers to identify sched entities
  bpf: sched: introduce bpf_sched_enable()
  sched: cfs: add bpf hooks to control wakeup and tick preemption
  libbpf: add support for scheduler bpf programs
  bpftool: recognize scheduler programs

 include/linux/bpf_sched.h       |  53 ++++++++++++
 include/linux/bpf_types.h       |   3 +
 include/linux/sched_hook_defs.h |   4 +
 include/uapi/linux/bpf.h        |  25 ++++++
 kernel/bpf/btf.c                |   1 +
 kernel/bpf/syscall.c            |  21 ++++-
 kernel/bpf/trampoline.c         |   1 +
 kernel/bpf/verifier.c           |   9 ++-
 kernel/sched/Makefile           |   1 +
 kernel/sched/bpf_sched.c        | 138 ++++++++++++++++++++++++++++++++
 kernel/sched/fair.c             |  27 +++++++
 scripts/bpf_doc.py              |   2 +
 tools/bpf/bpftool/common.c      |   1 +
 tools/bpf/bpftool/prog.c        |   1 +
 tools/include/uapi/linux/bpf.h  |  25 ++++++
 tools/lib/bpf/libbpf.c          |  27 ++++++-
 tools/lib/bpf/libbpf.h          |   4 +
 tools/lib/bpf/libbpf.map        |   3 +
 18 files changed, 341 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/bpf_sched.h
 create mode 100644 include/linux/sched_hook_defs.h
 create mode 100644 kernel/sched/bpf_sched.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
  2021-09-16  0:19 ` Hao Luo
@ 2021-09-16  1:42   ` Roman Gushchin
  0 siblings, 0 replies; 17+ messages in thread
From: Roman Gushchin @ 2021-09-16  1:42 UTC (permalink / raw)
  To: Hao Luo
  Cc: Peter Zijlstra, Ingo Molnar, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Rik van Riel, Mel Gorman, Vincent Guittot,
	Valentin Schneider, Will Deacon, Dietmar Eggemann, Barret Rhoden,
	bpf, linux-kernel

On Wed, Sep 15, 2021 at 05:19:03PM -0700, Hao Luo wrote:
> Hi Roman,
> 
> On Wed, Sep 15, 2021 at 3:04 PM Roman Gushchin <guro@fb.com> wrote:
> >

Hi Hao!

> 
> Thanks for initiating the effort of bringing BPF to sched. I've been
> looking at the potential applications of BPF in sched for some time
> and I'm very excited about this work!
> 
> My current focus has been using BPF for profiling performance and
> exporting sched related stats. I think BPF can provide a great help
> there. We have many users in Google that want the kernel to export
> various scheduling metrics to userspace. I think BPF is a good fit for
> such a task. So one of my recent attempts is to use BPF to account for
> the forced idle time caused by core scheduling [1]. This is one of the
> topics I want to discuss in my upcoming LPC BPF talk [2].

I guess for profiling we don't necessarily need a dedicated program type
etc, but it might be convenient, and some helpers can be useful too.

Unfortunately I won't be able to attend your talk, but hopefully I can
see it in a record later. I'm very interested.

> 
> Looking forward, I agree that BPF has a great potential in customizing
> policies in the scheduler. It has the advantage of quick
> experimentation and deployment. One of the use cases I'm thinking of
> is to customize load balancing policies. For example, allow using BPF
> to influence whether a task can migrate (can_migrate_task). This is
> currently only an idea.
> 
> > Our very first experiments with using BPF in CFS look very promising. We're
> > at a very early stage, however already have seen a nice latency and ~1% RPS
> > wins for our (Facebook's) main web workload.
> >
> > As I know, Google is working on a more radical approach [2]: they aim to move
> > the scheduling code into userspace. It seems that their core motivation is
> > somewhat similar: to make the scheduler changes easier to develop, validate
> > and deploy. Even though their approach is different, they also use BPF for
> > speeding up some hot paths. I think the suggested infrastructure can serve
> > their purpose too.
> 
> Yes. Barret can talk more about this, but I think it summarized the
> work of ghOSt [3] and the use of BPF in ghOSt well.

I took a brief look over how you use BPF in ghOSt and I think what I suggest
will work for you as well. I'd appreciate any comments/feedback whether it's
definitely true.

Thank you!

Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/6] Scheduler BPF
       [not found] <20210915213550.3696532-1-guro@fb.com>
@ 2021-09-16  0:19 ` Hao Luo
  2021-09-16  1:42   ` Roman Gushchin
  2021-09-16 16:24 ` Roman Gushchin
  1 sibling, 1 reply; 17+ messages in thread
From: Hao Luo @ 2021-09-16  0:19 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Peter Zijlstra, Ingo Molnar, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Rik van Riel, Mel Gorman, Vincent Guittot,
	Valentin Schneider, Will Deacon, Dietmar Eggemann, Barret Rhoden,
	bpf, linux-kernel

Hi Roman,

On Wed, Sep 15, 2021 at 3:04 PM Roman Gushchin <guro@fb.com> wrote:
>
> There is a long history of distro people, system administrators, and
> application owners tuning the CFS settings in /proc/sys, which are now
> in debugfs. Looking at what these settings actually did, it ended up
> boiling down to changing the likelihood of task preemption, or
> disabling it by setting the wakeup_granularity_ns to more than half of
> the latency_ns. The other settings didn't really do much for
> performance.
>
> In other words, some our workloads benefit by having long running tasks
> preempted by tasks handling short running requests, and some workloads
> that run only short term requests which benefit from never being preempted.
>
> This leads to a few observations and ideas:
> - Different workloads want different policies. Being able to configure
>   the policy per workload could be useful.
> - A workload that benefits from not being preempted itself could still
>   benefit from preempting (low priority) background system tasks.
> - It would be useful to quickly (and safely) experiment with different
>   policies in production, without having to shut down applications or reboot
>   systems, to determine what the policies for different workloads should be.
> - Only a few workloads are large and sensitive enough to merit their own
>   policy tweaks. CFS by itself should be good enough for everything else,
>   and we probably do not want policy tweaks to be a replacement for anything
>   CFS does.
>
> This leads to BPF hooks, which have been successfully used in various
> kernel subsystems to provide a way for external code to (safely)
> change a few kernel decisions. BPF tooling makes this pretty easy to do,
> and the people deploying BPF scripts are already quite used to updating them
> for new kernel versions.
>
> This patchset aims to start a discussion about potential applications of BPF
> to the scheduler. It also aims to land some very basic BPF infrastructure
> necessary to add new BPF hooks to the scheduler, a minimal set of useful
> helpers, corresponding libbpf changes, etc.
>

Thanks for initiating the effort of bringing BPF to sched. I've been
looking at the potential applications of BPF in sched for some time
and I'm very excited about this work!

My current focus has been using BPF for profiling performance and
exporting sched related stats. I think BPF can provide a great help
there. We have many users in Google that want the kernel to export
various scheduling metrics to userspace. I think BPF is a good fit for
such a task. So one of my recent attempts is to use BPF to account for
the forced idle time caused by core scheduling [1]. This is one of the
topics I want to discuss in my upcoming LPC BPF talk [2].

Looking forward, I agree that BPF has a great potential in customizing
policies in the scheduler. It has the advantage of quick
experimentation and deployment. One of the use cases I'm thinking of
is to customize load balancing policies. For example, allow using BPF
to influence whether a task can migrate (can_migrate_task). This is
currently only an idea.

> Our very first experiments with using BPF in CFS look very promising. We're
> at a very early stage, however already have seen a nice latency and ~1% RPS
> wins for our (Facebook's) main web workload.
>
> As I know, Google is working on a more radical approach [2]: they aim to move
> the scheduling code into userspace. It seems that their core motivation is
> somewhat similar: to make the scheduler changes easier to develop, validate
> and deploy. Even though their approach is different, they also use BPF for
> speeding up some hot paths. I think the suggested infrastructure can serve
> their purpose too.

Yes. Barret can talk more about this, but I think it summarized the
work of ghOSt [3] and the use of BPF in ghOSt well.

Hao

>
> An example of an userspace part, which loads some simple hooks is available
> here [3]. It's very simple, provided only to simplify playing with the provided
> kernel patches.
>
>
> [1] c722f35b513f ("sched/fair: Bring back select_idle_smt(), but differently")
> [2] Google's ghOSt: https://linuxplumbersconf.org/event/11/contributions/954/
> [3] https://github.com/rgushchin/atc
>
>
> Roman Gushchin (6):
>   bpf: sched: basic infrastructure for scheduler bpf
>   bpf: sched: add convenient helpers to identify sched entities
>   bpf: sched: introduce bpf_sched_enable()
>   sched: cfs: add bpf hooks to control wakeup and tick preemption
>   libbpf: add support for scheduler bpf programs
>   bpftool: recognize scheduler programs
>
>  include/linux/bpf_sched.h       |  53 ++++++++++++
>  include/linux/bpf_types.h       |   3 +
>  include/linux/sched_hook_defs.h |   4 +
>  include/uapi/linux/bpf.h        |  25 ++++++
>  kernel/bpf/btf.c                |   1 +
>  kernel/bpf/syscall.c            |  21 ++++-
>  kernel/bpf/trampoline.c         |   1 +
>  kernel/bpf/verifier.c           |   9 ++-
>  kernel/sched/Makefile           |   1 +
>  kernel/sched/bpf_sched.c        | 138 ++++++++++++++++++++++++++++++++
>  kernel/sched/fair.c             |  27 +++++++
>  scripts/bpf_doc.py              |   2 +
>  tools/bpf/bpftool/common.c      |   1 +
>  tools/bpf/bpftool/prog.c        |   1 +
>  tools/include/uapi/linux/bpf.h  |  25 ++++++
>  tools/lib/bpf/libbpf.c          |  27 ++++++-
>  tools/lib/bpf/libbpf.h          |   4 +
>  tools/lib/bpf/libbpf.map        |   3 +
>  18 files changed, 341 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/bpf_sched.h
>  create mode 100644 include/linux/sched_hook_defs.h
>  create mode 100644 kernel/sched/bpf_sched.c
>
> --
> 2.31.1
>

[1] core scheduling and forced idle: https://lwn.net/Articles/799454/
[2] BPF talk: https://linuxplumbersconf.org/event/11/contributions/954/
[3] ghOSt: https://github.com/google/ghost-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-07-19 23:21 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-20 16:41 [PATCH rfc 0/6] Scheduler BPF Hui-Chun Feng
     [not found] <20210915213550.3696532-1-guro@fb.com>
2021-09-16  0:19 ` Hao Luo
2021-09-16  1:42   ` Roman Gushchin
2021-09-16 16:24 ` Roman Gushchin
2021-09-16 16:36   ` Roman Gushchin
2021-10-06 16:39   ` Qais Yousef
2021-10-06 18:50     ` Roman Gushchin
2021-10-11 16:38       ` Qais Yousef
2021-10-11 18:09         ` Roman Gushchin
2021-10-12 10:16           ` Qais Yousef
     [not found]   ` <52EC1E80-4C89-43AD-8A59-8ACA184EAE53@gmail.com>
2021-11-25  6:00     ` Yafang Shao
2021-11-26 19:46       ` Roman Gushchin
2022-01-15  8:29   ` Huichun Feng
2022-01-18 22:54     ` Roman Gushchin
2022-07-19 13:05   ` Ren Zhijie
2022-07-19 13:17   ` Ren Zhijie
2022-07-19 23:21     ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).