All of lore.kernel.org
 help / color / mirror / Atom feed
From: Julien Desfossez <jdesfossez@digitalocean.com>
To: Peter Zijlstra <peterz@infradead.org>,
	mingo@kernel.org, tglx@linutronix.de, pjt@google.com,
	tim.c.chen@linux.intel.com, torvalds@linux-foundation.org
Cc: Julien Desfossez <jdesfossez@digitalocean.com>,
	linux-kernel@vger.kernel.org, subhra.mazumdar@oracle.com,
	fweisbec@gmail.com, keescook@chromium.org, kerrnel@google.com,
	Vineeth Pillai <vpillai@digitalocean.com>,
	Nishanth Aravamudan <naravamudan@digitalocean.com>
Subject: Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
Date: Thu, 21 Mar 2019 17:20:17 -0400	[thread overview]
Message-ID: <1553203217-11444-1-git-send-email-jdesfossez@digitalocean.com> (raw)
In-Reply-To: <15f3f7e6-5dce-6bbf-30af-7cffbd7bb0c3@oracle.com>

On Tue, Mar 19, 2019 at 10:31 PM Subhra Mazumdar <subhra.mazumdar@oracle.com>
wrote:
> On 3/18/19 8:41 AM, Julien Desfossez wrote:
> > The case where we try to acquire the lock on 2 runqueues belonging to 2
> > different cores requires the rq_lockp wrapper as well otherwise we
> > frequently deadlock in there.
> >
> > This fixes the crash reported in
> > 1552577311-8218-1-git-send-email-jdesfossez@digitalocean.com
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 76fee56..71bb71f 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2078,7 +2078,7 @@ static inline void double_rq_lock(struct rq *rq1,
> struct rq *rq2)
> >               raw_spin_lock(rq_lockp(rq1));
> >               __acquire(rq2->lock);   /* Fake it out ;) */
> >       } else {
> > -             if (rq1 < rq2) {
> > +             if (rq_lockp(rq1) < rq_lockp(rq2)) {
> >                       raw_spin_lock(rq_lockp(rq1));
> >                       raw_spin_lock_nested(rq_lockp(rq2),
> SINGLE_DEPTH_NESTING);
> >               } else {
> With this fix and my previous NULL pointer fix my stress tests are
> surviving. I
> re-ran my 2 DB instance setup on 44 core 2 socket system by putting each DB
> instance in separate core scheduling group. The numbers look much worse
> now.
>
> users  baseline  %stdev  %idle  core_sched  %stdev %idle
> 16     1         0.3     66     -73.4%      136.8 82
> 24     1         1.6     54     -95.8%      133.2 81
> 32     1         1.5     42     -97.5%      124.3 89

We are also seeing a performance degradation of about 83% on the throughput
of 2 MySQL VMs under a stress test (12 vcpus, 32GB of RAM). The server has 2
NUMA nodes, each with 18 cores (so a total of 72 hardware threads). Each
MySQL VM is pinned to a different NUMA node. The clients for the stress
tests are running on a separate physical machine, each client runs 48 query
threads. Only the MySQL VMs use core scheduling (all vcpus and emulator
threads). Overall the server is 90% idle when the 2 VMs use core scheduling,
and 75% when they don’t.

The rate of preemption vs normal “switch out” is about 1% with and without
core scheduling enabled, but the overall rate of sched_switch is 5 times
higher without core scheduling which suggests some heavy contention in the
scheduling path.

On further investigation, we could see that the contention is mostly in the
way rq locks are taken. With this patchset, we lock the whole core if
cpu.tag is set for at least one cgroup. Due to this, __schedule() is more or
less serialized for the core and that attributes to the performance loss
that we are seeing. We also saw that newidle_balance() takes considerably
long time in load_balance() due to the rq spinlock contention. Do you think
it would help if the core-wide locking was only performed when absolutely
needed ?

In terms of isolation, we measured the time a thread spends co-scheduled
with either a thread from the same group, the idle thread or a thread from
another group. This is what we see for 60 seconds of a specific busy VM
pinned to a whole NUMA node (all its threads):

no core scheduling:
- local neighbors (19.989 % of process runtime)
- idle neighbors (47.197 % of process runtime)
- foreign neighbors (22.811 % of process runtime)

core scheduling enabled:
- local neighbors (6.763 % of process runtime)
- idle neighbors (93.064 % of process runtime)
- foreign neighbors (0.236 % of process runtime)

As a separate test, we tried to pin all the vcpu threads to a set of cores
(6 cores for 12 vcpus):
no core scheduling:
- local neighbors (88.299 % of process runtime)
- idle neighbors (9.334 % of process runtime)
- foreign neighbors (0.197 % of process runtime)

core scheduling enabled:
- local neighbors (84.570 % of process runtime)
- idle neighbors (15.195 % of process runtime)
- foreign neighbors (0.257 % of process runtime)

Thanks,

Julien

  reply	other threads:[~2019-03-21 21:20 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 01/16] stop_machine: Fix stop_cpus_in_progress ordering Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 03/16] sched: Wrap rq::lock access Peter Zijlstra
2019-02-19 16:13   ` Phil Auld
2019-02-19 16:22     ` Peter Zijlstra
2019-02-19 16:37       ` Phil Auld
2019-03-18 15:41   ` Julien Desfossez
2019-03-20  2:29     ` Subhra Mazumdar
2019-03-21 21:20       ` Julien Desfossez [this message]
2019-03-22 13:34         ` Peter Zijlstra
2019-03-22 20:59           ` Julien Desfossez
2019-03-23  0:06         ` Subhra Mazumdar
2019-03-27  1:02           ` Subhra Mazumdar
2019-03-29 13:35           ` Julien Desfossez
2019-03-29 22:23             ` Subhra Mazumdar
2019-04-01 21:35               ` Subhra Mazumdar
2019-04-03 20:16                 ` Julien Desfossez
2019-04-05  1:30                   ` Subhra Mazumdar
2019-04-02  7:42               ` Peter Zijlstra
2019-03-22 23:28       ` Tim Chen
2019-03-22 23:44         ` Tim Chen
2019-02-18 16:56 ` [RFC][PATCH 04/16] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 05/16] sched: Add task_struct pointer to sched_class::set_curr_task Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 06/16] sched/fair: Export newidle_balance() Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 07/16] sched: Allow put_prev_task() to drop rq->lock Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 08/16] sched: Rework pick_next_task() slow-path Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 09/16] sched: Introduce sched_class::pick_task() Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 10/16] sched: Core-wide rq->lock Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 11/16] sched: Basic tracking of matching tasks Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 12/16] sched: A quick and dirty cgroup tagging interface Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling Peter Zijlstra
     [not found]   ` <20190402064612.GA46500@aaronlu>
2019-04-02  8:28     ` Peter Zijlstra
2019-04-02 13:20       ` Aaron Lu
2019-04-05 14:55       ` Aaron Lu
2019-04-09 18:09         ` Tim Chen
2019-04-10  4:36           ` Aaron Lu
2019-04-10 14:18             ` Aubrey Li
2019-04-11  2:11               ` Aaron Lu
2019-04-10 14:44             ` Peter Zijlstra
2019-04-11  3:05               ` Aaron Lu
2019-04-11  9:19                 ` Peter Zijlstra
2019-04-10  8:06           ` Peter Zijlstra
2019-04-10 19:58             ` Vineeth Remanan Pillai
2019-04-15 16:59             ` Julien Desfossez
2019-04-16 13:43       ` Aaron Lu
2019-04-09 18:38   ` Julien Desfossez
2019-04-10 15:01     ` Peter Zijlstra
2019-04-11  0:11     ` Subhra Mazumdar
2019-04-19  8:40       ` Ingo Molnar
2019-04-19 23:16         ` Subhra Mazumdar
2019-02-18 16:56 ` [RFC][PATCH 14/16] sched/fair: Add a few assertions Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer Peter Zijlstra
2019-02-21 16:19   ` Valentin Schneider
2019-02-21 16:41     ` Peter Zijlstra
2019-02-21 16:47       ` Peter Zijlstra
2019-02-21 18:28         ` Valentin Schneider
2019-04-04  8:31       ` Aubrey Li
2019-04-06  1:36         ` Aubrey Li
2019-02-18 16:56 ` [RFC][PATCH 16/16] sched: Debug bits Peter Zijlstra
2019-02-18 17:49 ` [RFC][PATCH 00/16] sched: Core scheduling Linus Torvalds
2019-02-18 20:40   ` Peter Zijlstra
2019-02-19  0:29     ` Linus Torvalds
2019-02-19 15:15       ` Ingo Molnar
2019-02-22 12:17     ` Paolo Bonzini
2019-02-22 14:20       ` Peter Zijlstra
2019-02-22 19:26         ` Tim Chen
2019-02-26  8:26           ` Aubrey Li
2019-02-27  7:54             ` Aubrey Li
2019-02-21  2:53   ` Subhra Mazumdar
2019-02-21 14:03     ` Peter Zijlstra
2019-02-21 18:44       ` Subhra Mazumdar
2019-02-22  0:34       ` Subhra Mazumdar
2019-02-22 12:45   ` Mel Gorman
2019-02-22 16:10     ` Mel Gorman
2019-03-08 19:44     ` Subhra Mazumdar
2019-03-11  4:23       ` Aubrey Li
2019-03-11 18:34         ` Subhra Mazumdar
2019-03-11 23:33           ` Subhra Mazumdar
2019-03-12  0:20             ` Greg Kerr
2019-03-12  0:47               ` Subhra Mazumdar
2019-03-12  7:33               ` Aaron Lu
2019-03-12  7:45             ` Aubrey Li
2019-03-13  5:55               ` Aubrey Li
2019-03-14  0:35                 ` Tim Chen
2019-03-14  5:30                   ` Aubrey Li
2019-03-14  6:07                     ` Li, Aubrey
2019-03-18  6:56             ` Aubrey Li
2019-03-12 19:07           ` Pawan Gupta
2019-03-26  7:32       ` Aaron Lu
2019-03-26  7:56         ` Aaron Lu
2019-02-19 22:07 ` Greg Kerr
2019-02-20  9:42   ` Peter Zijlstra
2019-02-20 18:33     ` Greg Kerr
2019-02-22 14:10       ` Peter Zijlstra
2019-03-07 22:06         ` Paolo Bonzini
2019-02-20 18:43     ` Subhra Mazumdar
2019-03-01  2:54 ` Subhra Mazumdar
2019-03-14 15:28 ` Julien Desfossez

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1553203217-11444-1-git-send-email-jdesfossez@digitalocean.com \
    --to=jdesfossez@digitalocean.com \
    --cc=fweisbec@gmail.com \
    --cc=keescook@chromium.org \
    --cc=kerrnel@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=naravamudan@digitalocean.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=subhra.mazumdar@oracle.com \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vpillai@digitalocean.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.