bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Roman Gushchin <guro@fb.com>
To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>, <bpf@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, Roman Gushchin <guro@fb.com>
Subject: [PATCH rfc 0/6] Scheduler BPF
Date: Thu, 16 Sep 2021 09:24:45 -0700	[thread overview]
Message-ID: <20210916162451.709260-1-guro@fb.com> (raw)
In-Reply-To: <20210915213550.3696532-1-guro@fb.com>

There is a long history of distro people, system administrators, and
application owners tuning the CFS settings in /proc/sys, which are now
in debugfs. Looking at what these settings actually did, it ended up
boiling down to changing the likelihood of task preemption, or
disabling it by setting the wakeup_granularity_ns to more than half of
the latency_ns. The other settings didn't really do much for
performance.

In other words, some our workloads benefit by having long running tasks
preempted by tasks handling short running requests, and some workloads
that run only short term requests which benefit from never being preempted.

This leads to a few observations and ideas:
- Different workloads want different policies. Being able to configure
  the policy per workload could be useful.
- A workload that benefits from not being preempted itself could still
  benefit from preempting (low priority) background system tasks.
- It would be useful to quickly (and safely) experiment with different
  policies in production, without having to shut down applications or reboot
  systems, to determine what the policies for different workloads should be.
- Only a few workloads are large and sensitive enough to merit their own
  policy tweaks. CFS by itself should be good enough for everything else,
  and we probably do not want policy tweaks to be a replacement for anything
  CFS does.

This leads to BPF hooks, which have been successfully used in various
kernel subsystems to provide a way for external code to (safely)
change a few kernel decisions. BPF tooling makes this pretty easy to do,
and the people deploying BPF scripts are already quite used to updating them
for new kernel versions.

This patchset aims to start a discussion about potential applications of BPF
to the scheduler. It also aims to land some very basic BPF infrastructure
necessary to add new BPF hooks to the scheduler, a minimal set of useful
helpers, corresponding libbpf changes, etc.

Our very first experiments with using BPF in CFS look very promising. We're
at a very early stage, however already have seen a nice latency and ~1% RPS
wins for our (Facebook's) main web workload.

As I know, Google is working on a more radical approach [2]: they aim to move
the scheduling code into userspace. It seems that their core motivation is
somewhat similar: to make the scheduler changes easier to develop, validate
and deploy. Even though their approach is different, they also use BPF for
speeding up some hot paths. I think the suggested infrastructure can serve
their purpose too.

An example of an userspace part, which loads some simple hooks is available
here [3]. It's very simple, provided only to simplify playing with the provided
kernel patches.


[1] c722f35b513f ("sched/fair: Bring back select_idle_smt(), but differently")
[2] Google's ghOSt: https://linuxplumbersconf.org/event/11/contributions/954/
[3] https://github.com/rgushchin/atc


Roman Gushchin (6):
  bpf: sched: basic infrastructure for scheduler bpf
  bpf: sched: add convenient helpers to identify sched entities
  bpf: sched: introduce bpf_sched_enable()
  sched: cfs: add bpf hooks to control wakeup and tick preemption
  libbpf: add support for scheduler bpf programs
  bpftool: recognize scheduler programs

 include/linux/bpf_sched.h       |  53 ++++++++++++
 include/linux/bpf_types.h       |   3 +
 include/linux/sched_hook_defs.h |   4 +
 include/uapi/linux/bpf.h        |  25 ++++++
 kernel/bpf/btf.c                |   1 +
 kernel/bpf/syscall.c            |  21 ++++-
 kernel/bpf/trampoline.c         |   1 +
 kernel/bpf/verifier.c           |   9 ++-
 kernel/sched/Makefile           |   1 +
 kernel/sched/bpf_sched.c        | 138 ++++++++++++++++++++++++++++++++
 kernel/sched/fair.c             |  27 +++++++
 scripts/bpf_doc.py              |   2 +
 tools/bpf/bpftool/common.c      |   1 +
 tools/bpf/bpftool/prog.c        |   1 +
 tools/include/uapi/linux/bpf.h  |  25 ++++++
 tools/lib/bpf/libbpf.c          |  27 ++++++-
 tools/lib/bpf/libbpf.h          |   4 +
 tools/lib/bpf/libbpf.map        |   3 +
 18 files changed, 341 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/bpf_sched.h
 create mode 100644 include/linux/sched_hook_defs.h
 create mode 100644 kernel/sched/bpf_sched.c

-- 
2.31.1


  parent reply	other threads:[~2021-09-16 16:27 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20210915213550.3696532-1-guro@fb.com>
2021-09-16  0:19 ` [PATCH rfc 0/6] Scheduler BPF Hao Luo
2021-09-16  1:42   ` Roman Gushchin
2021-09-16 16:24 ` Roman Gushchin [this message]
2021-09-16 16:24   ` [PATCH rfc 1/6] bpf: sched: basic infrastructure for scheduler bpf Roman Gushchin
2021-09-16 16:24   ` [PATCH rfc 2/6] bpf: sched: add convenient helpers to identify sched entities Roman Gushchin
2021-11-25  6:09     ` Yafang Shao
2021-11-26 19:50       ` Roman Gushchin
2021-09-16 16:24   ` [PATCH rfc 3/6] bpf: sched: introduce bpf_sched_enable() Roman Gushchin
2021-09-16 16:24   ` [PATCH rfc 4/6] sched: cfs: add bpf hooks to control wakeup and tick preemption Roman Gushchin
2021-10-01  3:35     ` Barry Song
2021-10-02  0:13       ` Roman Gushchin
2021-09-16 16:24   ` [PATCH rfc 5/6] libbpf: add support for scheduler bpf programs Roman Gushchin
2021-09-16 16:24   ` [PATCH rfc 6/6] bpftool: recognize scheduler programs Roman Gushchin
2021-09-16 16:36   ` [PATCH rfc 0/6] Scheduler BPF Roman Gushchin
2021-10-06 16:39   ` Qais Yousef
2021-10-06 18:50     ` Roman Gushchin
2021-10-11 16:38       ` Qais Yousef
2021-10-11 18:09         ` Roman Gushchin
2021-10-12 10:16           ` Qais Yousef
     [not found]   ` <52EC1E80-4C89-43AD-8A59-8ACA184EAE53@gmail.com>
2021-11-25  6:00     ` Yafang Shao
2021-11-26 19:46       ` Roman Gushchin
2022-01-15  8:29   ` Huichun Feng
2022-01-18 22:54     ` Roman Gushchin
2022-07-19 13:05   ` Ren Zhijie
2022-07-19 13:17   ` Ren Zhijie
2022-07-19 23:21     ` Roman Gushchin
2021-11-20 16:41 Hui-Chun Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210916162451.709260-1-guro@fb.com \
    --to=guro@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).