linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Chris Mason" <clm@fb.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Rik van Riel <riel@surriel.com>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"
Date: Fri, 23 Oct 2020 19:49:28 -0400	[thread overview]
Message-ID: <DB4481A8-FD4E-4879-9CD2-275ABAFC09CF@fb.com> (raw)

Hi everyone,

We’re validating a new kernel in the fleet, and compared with v5.2, 
performance is ~2-3% lower for some of our workloads.  After some 
digging, Johannes found that our involuntary context switch rate was ~2x 
higher, and we were leaving a CPU idle a higher percentage of the time, 
even though the workload was trying to saturate the system.

We were able to reproduce the problem with schbench, and Johannes 
bisected down to:

commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Fri Oct 18 15:26:31 2019 +0200

     sched/fair: Rework load_balance()

Our working theory is the load balancing changes are leaving processes 
behind busy CPUs instead of moving them onto idle ones.  I made a few 
schbench modifications to make this easier to demonstrate:

https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/

My VM has 40 cpus (20 cores, 2 threads per core), and my schbench 
command line is:

schbench -t 20 -r 0 -c 1000000 -s 1000 -i 30 -z 120

This has two message threads, and 20 workers per message thread.  Once 
woken up, the workers think for a full second, which means you’ll have 
some long latencies if you’re stuck behind one of these workers in the 
runqueue.  The message thread does a little bit of work and then sleeps, 
so we end up with 40 threads hammering full blast on the CPU and 2 
threads popping in and out of idle.

schbench times the delay from when a message thread wakes a worker to 
when the worker runs.  On a good kernel, the output looks like this:

Latency percentiles (usec) runtime 1290 (s) (3280 total samples)
         50.0th: 155 (1653 samples)
         75.0th: 189 (808 samples)
         90.0th: 216 (501 samples)
         95.0th: 227 (163 samples)
         *99.0th: 256 (123 samples)
         99.5th: 1510 (16 samples)
         99.9th: 3132 (13 samples)
         min=21, max=3286

With 0b0695f2b34a, we get this:

Latency percentiles (usec) runtime 1440 (s) (4480 total samples)
         50.0th: 147 (2261 samples)
         75.0th: 182 (1116 samples)
         90.0th: 205 (671 samples)
         95.0th: 224 (215 samples)
         *99.0th: 12240 (173 samples) <—— much higher p99 and up
         99.5th: 12752 (22 samples)
         99.9th: 13104 (18 samples)
         min=21, max=13172

Since the idea is to fully load the machine with schbench, use schbench 
-t <your_num_cpus/2>, and make sure the box doesn’t have other stuff 
running in the background.  I used a VM because it ended up giving more 
consistent results on our kernel test machines, which have some periodic 
noise running in the background.

We’ve tried a few different approaches, but don’t quite have a solid 
fix yet.  I thought I’d kick off the discussion with my most useful 
hunks so far:

diff a/kernel/sched/fair.c b/kernel/sched/fair.c
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

-chris

             reply	other threads:[~2020-10-23 23:49 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-23 23:49 Chris Mason [this message]
2020-10-26  8:39 ` [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" Vincent Guittot
2020-10-26 12:45   ` Chris Mason
2020-10-26 14:24     ` Vincent Guittot
2020-10-26 14:38       ` Rik van Riel
2020-10-26 14:56         ` Vincent Guittot
2020-10-26 15:04           ` Rik van Riel
2020-10-26 15:42             ` Vincent Guittot
2020-10-26 15:54               ` Vincent Guittot
2020-10-26 16:04               ` Rik van Riel
2020-10-26 16:20                 ` Vincent Guittot
2020-10-26 16:48                   ` Chris Mason
2020-10-26 16:52                     ` Vincent Guittot
2020-10-30  2:10                       ` Rik van Riel
2020-10-30  9:16                         ` Vincent Guittot
2020-10-26 15:05       ` Chris Mason
2020-10-26 15:18         ` Vincent Guittot
2020-10-26 15:28         ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DB4481A8-FD4E-4879-9CD2-275ABAFC09CF@fb.com \
    --to=clm@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).