linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Rik van Riel <riel@surriel.com>
To: peterz@infradead.org
Cc: mingo@redhat.com, linux-kernel@vger.kernel.org,
	kernel-team@fb.com, morten.rasmussen@arm.com, tglx@linutronix.de,
	dietmar.eggeman@arm.com, mgorman@techsingularity.com,
	vincent.guittot@linaro.org
Subject: [RFC] sched,cfs: flatten CPU controller runqueues
Date: Wed, 12 Jun 2019 15:32:19 -0400	[thread overview]
Message-ID: <20190612193227.993-1-riel@surriel.com> (raw)

The current implementation of the CPU controller uses hierarchical
runqueues, where on wakeup a task is enqueued on its group's runqueue,
the group is enqueued on the runqueue of the group above it, etc.

This increases a fairly large amount of overhead for workloads that
do a lot of wakeups a second, especially given that the default systemd
hierarchy is 2 or 3 levels deep.

This patch series is an attempt at reducing that overhead, by placing
all the tasks on the same runqueue, and scaling the task priority by
the priority of the group, which is calculated periodically.

This patch series still has a number of TODO items:
- Clean up the code, and fix compilation without CONFIG_FAIR_GROUP_SCHED.
- Remove some more now unused code.
- Figure out a regression with schbench, where the p99 latency goes up
  before the system is fully overloaded. I suspect wakeup_preempt_entity()
  and wakeup_gran() because they now use the task_h_load instead of the
  unscaled load to figure out whether a task should be preempted.
- Reimplement CONFIG_CFS_BANDWIDTH.

Plan for the CONFIG_CFS_BANDWIDTH reimplementation:
- When a cgroup gets throttled, mark the cgroup and its children
  as throttled.
- When pick_next_entity finds a task that is on a throttled cgroup,
  stash it on the cgroup runqueue (which is not used for runnable
  tasks any more). Leave the vruntime unchanged, and adjust that
  runqueue's vruntime to be that of the left-most task.
- When a cgroup gets unthrottled, and has tasks on it, place it on
  a vruntime ordered heap separate from the main runqueue.
- Have pick_next_task_fair grab one task off that heap every time it
  is called, and the min vruntime of that heap is lower than the
  vruntime of the CPU's cfs_rq (or the CPU has no other runnable tasks).
- Place that selected task on the CPU's cfs_rq, renormalizing its
  vruntime with the GENTLE_FAIR_SLEEPERS logic. That should help
  interleave the already runnable tasks with the recently unthrottled
  group, and prevent thundering herd issues.
- If the group gets throttled again before all of its task had a chance
  to run, vruntime sorting ensures all the tasks in the throttled cgroup
  get a chance to run over time.

This patch applies on top of what was Linus's current tree when I last
rebased it:
2c1212de6f97 ("Merge tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core")

 include/linux/sched.h |    5 
 kernel/sched/core.c   |    2 
 kernel/sched/debug.c  |   12 
 kernel/sched/fair.c   |  744 +++++++++++++++++++++-----------------------------
 kernel/sched/pelt.c   |   55 +--
 kernel/sched/pelt.h   |    2 
 kernel/sched/sched.h  |    9 
 7 files changed, 346 insertions(+), 483 deletions(-)



             reply	other threads:[~2019-06-12 19:32 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-12 19:32 Rik van Riel [this message]
2019-06-12 19:32 ` [PATCH 1/8] sched: introduce task_se_h_load helper Rik van Riel
2019-06-19 12:52   ` Dietmar Eggemann
2019-06-19 13:57     ` Rik van Riel
2019-06-19 15:18       ` Dietmar Eggemann
2019-06-19 15:55         ` Rik van Riel
2019-06-12 19:32 ` [PATCH 2/8] sched: change /proc/sched_debug fields Rik van Riel
2019-06-12 19:32 ` [PATCH 3/8] sched,fair: redefine runnable_load_avg as the sum of task_h_load Rik van Riel
2019-06-18  9:08   ` Dietmar Eggemann
2019-06-26 14:34   ` Dietmar Eggemann
2019-06-12 19:32 ` [PATCH 4/8] sched,fair: remove cfs rqs from leaf_cfs_rq_list bottom up Rik van Riel
2019-06-12 19:32 ` [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper Rik van Riel
2019-06-20 16:23   ` Dietmar Eggemann
2019-06-20 16:29     ` Rik van Riel
2019-06-24 11:24       ` Dietmar Eggemann
2019-06-26 15:58   ` Dietmar Eggemann
2019-06-26 16:15     ` Rik van Riel
2019-06-12 19:32 ` [PATCH 6/8] sched,cfs: fix zero length timeslice calculation Rik van Riel
2019-06-12 19:32 ` [PATCH 7/8] sched,fair: refactor enqueue/dequeue_entity Rik van Riel
2019-06-12 19:32 ` [PATCH 8/8] sched,fair: flatten hierarchical runqueues Rik van Riel
2019-06-25  9:50   ` Dietmar Eggemann
2019-06-25 13:51     ` Rik van Riel
2019-06-28 10:26   ` Dietmar Eggemann
2019-06-28 19:36     ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190612193227.993-1-riel@surriel.com \
    --to=riel@surriel.com \
    --cc=dietmar.eggeman@arm.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.com \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).