All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: steven.sistare@oracle.com
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	subhra.mazumdar@oracle.com,
	Dhaval Giani <dhaval.giani@oracle.com>,
	daniel.m.jordan@oracle.com, pavel.tatashin@microsoft.com,
	Matt Fleming <matt@codeblueprint.co.uk>,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	Rik van Riel <riel@redhat.com>, Josef Bacik <jbacik@fb.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Quentin Perret <quentin.perret@arm.com>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v4 00/10] steal tasks to improve CPU utilization
Date: Mon, 10 Dec 2018 18:08:22 +0100	[thread overview]
Message-ID: <CAKfTPtCePEwUDyH1on+fJbuUEBYR8v7U4doL-N-ymdT3woD7OA@mail.gmail.com> (raw)
In-Reply-To: <CAKfTPtBh59Wt3+JiZ168WRAZ+AEsS+-GTqXzeUZTKQPcDAJ73g@mail.gmail.com>

On Mon, 10 Dec 2018 at 17:33, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Mon, 10 Dec 2018 at 17:29, Steven Sistare <steven.sistare@oracle.com> wrote:
> >
> > On 12/10/2018 11:10 AM, Vincent Guittot wrote:
> > > Hi Steven,
> > >
> > > On Thu, 6 Dec 2018 at 22:38, Steve Sistare <steven.sistare@oracle.com> wrote:
> > >>
> > >> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> > >> find a task, then attempt to steal a task from an overloaded CPU in the
> > >> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> > >> identify candidates.  To minimize search time, steal the first migratable
> > >> task that is found when the bitmap is traversed.  For fairness, search
> > >> for migratable tasks on an overloaded CPU in order of next to run.
> > >>
> > >> This simple stealing yields a higher CPU utilization than idle_balance()
> > >> alone, because the search is cheap, so it may be called every time the CPU
> > >> is about to go idle.  idle_balance() does more work because it searches
> > >> widely for the busiest queue, so to limit its CPU consumption, it declines
> > >> to search if the system is too busy.  Simple stealing does not offload the
> > >> globally busiest queue, but it is much better than running nothing at all.
> > >>
> > >> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
> > >> reduce cache contention vs the usual bitmap when many threads concurrently
> > >> set, clear, and visit elements.
> > >>
> > >> Patch 1 defines the sparsemask type and its operations.
> > >>
> > >> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
> > >>
> > >> Patches 5 and 6 refactor existing code for a cleaner merge of later
> > >>   patches.
> > >>
> > >> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
> > >>
> > >> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
> > >> time being because of performance regressions that are not due to stealing
> > >> per-se.  See the patch description for details.
> > >>
> > >> Patch 10 adds schedstats for comparing the new behavior to the old, and
> > >>   provided as a convenience for developers only, not for integration.
> > >>
> > >> The patch series is based on kernel 4.20.0-rc1.  It compiles, boots, and
> > >> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
> > >> and CONFIG_PREEMPT.  It runs without error with CONFIG_DEBUG_PREEMPT +
> > >> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
> > >> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP.  CPU hot plug and CPU
> > >> bandwidth control were tested.
> > >>
> > >> Stealing improves utilization with only a modest CPU overhead in scheduler
> > >> code.  In the following experiment, hackbench is run with varying numbers
> > >> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
> > >> for each run, averaged per CPU, augmented with these non-standard stats:
> > >>
> > >>   %find - percent of time spent in old and new functions that search for
> > >>     idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
> > >>
> > >>   steal - number of times a task is stolen from another CPU.
> > >>
> > >> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> > >> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> > >> hackbench <grps> process 100000
> > >> sched_wakeup_granularity_ns=15000000
> > >>
> > >>   baseline
> > >>   grps  time  %busy  slice   sched   idle     wake %find  steal
> > >>   1    8.084  75.02   0.10  105476  46291    59183  0.31      0
> > >>   2   13.892  85.33   0.10  190225  70958   119264  0.45      0
> > >>   3   19.668  89.04   0.10  263896  87047   176850  0.49      0
> > >>   4   25.279  91.28   0.10  322171  94691   227474  0.51      0
> > >>   8   47.832  94.86   0.09  630636 144141   486322  0.56      0
> > >>
> > >>   new
> > >>   grps  time  %busy  slice   sched   idle     wake %find  steal  %speedup
> > >>   1    5.938  96.80   0.24   31255   7190    24061  0.63   7433  36.1
> > >>   2   11.491  99.23   0.16   74097   4578    69512  0.84  19463  20.9
> > >>   3   16.987  99.66   0.15  115824   1985   113826  0.77  24707  15.8
> > >>   4   22.504  99.80   0.14  167188   2385   164786  0.75  29353  12.3
> > >>   8   44.441  99.86   0.11  389153   1616   387401  0.67  38190   7.6
> > >>
> > >> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
> > >> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
> > >> The cost is at most 0.4% more find time.
> > >
> > > I have run some hackbench tests on my hikey arm64 octo cores with your
> > > patchset. My original intent was to send a tested-by but I have some
> > > performances regressions.
> > > This hikey is the smp one and not the asymetric hikey960 that Valentin
> > > used for his tests
> > > The sched domain topology is
> > > domain-0: span=0-3 level=MC  and domain-0: span=4-7 level=MC
> > > domain-1: span=0-7 level=DIE
> > >
> > > I have run 12 times hackbench -g $j -P -l 2000 with j equals to 1 2 3 4 8
> > >
> > > grps  time
> > > 1      1.396
> > > 2      2.699
> > > 3      3.617
> > > 4      4.498
> > > 8      7.721
> > >
> > > Then after disabling STEAL in sched_feature with echo NO_STEAL >
> > > /sys/kernel/debug/sched_features , the results become:
> > > grps  time
> > > 1      1.217
> > > 2      1.973
> > > 3      2.855
> > > 4      3.932
> > > 8      7.674
> > >
> > > I haven't looked in details about some possible reasons of such
> > > difference yet and haven't collected the stats that you added with
> > > patch 10.
> > > Have you got a script to collect and post process them ?
> > >
> > > Regards,
> > > Vincent
> >
> > Thanks Vincent.  What is the value of /proc/sys/kernel/sched_wakeup_granularity_ns?
>
> it's 4000000
>
> > Try 15000000.  Your 8-core system is heavily overloaded with 40 * groups tasks,
> > and I suspect preemptions are killing performance.
>
> ok. I'm going to run the tests with the proposed value

Results look better after changing /proc/sys/kernel/sched_wakeup_granularity_ns

With STEAL
grps  time
1      0.869
2      1.646
3      2.395
4      3.163
8      6.199

after echo NO_STEAL > /sys/kernel/debug/sched_features
grps  time
1      0.928
2      1.770
3      2.597
4      3.407
8      6.431

There is a 7% improvement with steal and the larger value for
/proc/sys/kernel/sched_wakeup_granularity_ns for all groups
Should we set the STEAL feature disabled by default as this provides
benefit only when changing sched_wakeup_granularity_ns value from
default value?

>
> >
> > I have a python script to post-process schedstat files, but it does many things
> > and is large and I am not ready to share it.  I can write a short bash script if
> > that would help.
>
> It was mainly in case you wanted the figures of these statistics
>
> Vincent
>
> >
> > - Steve

  reply	other threads:[~2018-12-10 17:08 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-06 21:28 [PATCH v4 00/10] steal tasks to improve CPU utilization Steve Sistare
2018-12-06 21:28 ` [PATCH v4 01/10] sched: Provide sparsemask, a reduced contention bitmap Steve Sistare
2019-01-31 19:18   ` Tim Chen
2018-12-06 21:28 ` [PATCH v4 02/10] sched/topology: Provide hooks to allocate data shared per LLC Steve Sistare
2018-12-06 21:28 ` [PATCH v4 03/10] sched/topology: Provide cfs_overload_cpus bitmap Steve Sistare
2018-12-07 20:20   ` Valentin Schneider
2018-12-07 22:35     ` Steven Sistare
2018-12-08 18:33       ` Valentin Schneider
2018-12-06 21:28 ` [PATCH v4 04/10] sched/fair: Dynamically update cfs_overload_cpus Steve Sistare
2018-12-07 20:20   ` Valentin Schneider
2018-12-07 22:35     ` Steven Sistare
2018-12-08 18:47       ` Valentin Schneider
2018-12-06 21:28 ` [PATCH v4 05/10] sched/fair: Hoist idle_stamp up from idle_balance Steve Sistare
2018-12-06 21:28 ` [PATCH v4 06/10] sched/fair: Generalize the detach_task interface Steve Sistare
2018-12-06 21:28 ` [PATCH v4 07/10] sched/fair: Provide can_migrate_task_llc Steve Sistare
2018-12-06 21:28 ` [PATCH v4 08/10] sched/fair: Steal work from an overloaded CPU when CPU goes idle Steve Sistare
2018-12-07 20:21   ` Valentin Schneider
2018-12-07 22:36     ` Steven Sistare
2018-12-08 18:39       ` Valentin Schneider
2018-12-06 21:28 ` [PATCH v4 09/10] sched/fair: disable stealing if too many NUMA nodes Steve Sistare
2018-12-07 11:43   ` Valentin Schneider
2018-12-07 13:37     ` Steven Sistare
2018-12-06 21:28 ` [PATCH v4 10/10] sched/fair: Provide idle search schedstats Steve Sistare
2018-12-07 11:56   ` Valentin Schneider
2018-12-07 13:45     ` Steven Sistare
2018-12-24 12:25   ` Rick Lindsley
2019-01-14 17:04     ` Steven Sistare
2018-12-07 20:30 ` [PATCH v4 00/10] steal tasks to improve CPU utilization Valentin Schneider
2018-12-07 22:36   ` Steven Sistare
2019-02-01 15:07   ` Valentin Schneider
2018-12-10 16:10 ` Vincent Guittot
2018-12-10 16:29   ` Steven Sistare
2018-12-10 16:33     ` Vincent Guittot
2018-12-10 17:08       ` Vincent Guittot [this message]
2018-12-10 17:20         ` Steven Sistare
2018-12-10 17:06     ` Valentin Schneider
2019-01-04 13:44 ` Shijith Thotton
2019-01-14 16:55 ` Steven Sistare
2019-01-31 17:16   ` Dhaval Giani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKfTPtCePEwUDyH1on+fJbuUEBYR8v7U4doL-N-ymdT3woD7OA@mail.gmail.com \
    --to=vincent.guittot@linaro.org \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dhaval.giani@oracle.com \
    --cc=jbacik@fb.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mingo@redhat.com \
    --cc=pavel.tatashin@microsoft.com \
    --cc=peterz@infradead.org \
    --cc=quentin.perret@arm.com \
    --cc=riel@redhat.com \
    --cc=steven.sistare@oracle.com \
    --cc=subhra.mazumdar@oracle.com \
    --cc=umgwanakikbuti@gmail.com \
    --cc=valentin.schneider@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.