linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Shijith Thotton <sthotton@marvell.com>
To: Steve Sistare <steven.sistare@oracle.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"peterz@infradead.org" <peterz@infradead.org>
Cc: "subhra.mazumdar@oracle.com" <subhra.mazumdar@oracle.com>,
	"dhaval.giani@oracle.com" <dhaval.giani@oracle.com>,
	"rohit.k.jain@oracle.com" <rohit.k.jain@oracle.com>,
	"daniel.m.jordan@oracle.com" <daniel.m.jordan@oracle.com>,
	"pavel.tatashin@microsoft.com" <pavel.tatashin@microsoft.com>,
	"matt@codeblueprint.co.uk" <matt@codeblueprint.co.uk>,
	"umgwanakikbuti@gmail.com" <umgwanakikbuti@gmail.com>,
	"riel@redhat.com" <riel@redhat.com>,
	"jbacik@fb.com" <jbacik@fb.com>,
	"juri.lelli@redhat.com" <juri.lelli@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Ganapatrao Kulkarni <gkulkarni@marvell.com>,
	Jayachandran Chandrasekharan Nair <jnair@marvell.com>
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization
Date: Fri, 4 Jan 2019 13:37:17 +0000	[thread overview]
Message-ID: <DM6PR18MB246096BE8C516A5A5D125932D98E0@DM6PR18MB2460.namprd18.prod.outlook.com> (raw)
In-Reply-To: 1540220381-424433-1-git-send-email-steven.sistare@oracle.com

On 22-Oct-18 8:40 PM, Steve Sistare wrote:
> 
> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> find a task, then attempt to steal a task from an overloaded CPU in the
> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> identify candidates.  To minimize search time, steal the first migratable
> task that is found when the bitmap is traversed.  For fairness, search
> for migratable tasks on an overloaded CPU in order of next to run.
> 
> This simple stealing yields a higher CPU utilization than idle_balance()
> alone, because the search is cheap, so it may be called every time the CPU
> is about to go idle.  idle_balance() does more work because it searches
> widely for the busiest queue, so to limit its CPU consumption, it declines
> to search if the system is too busy.  Simple stealing does not offload the
> globally busiest queue, but it is much better than running nothing at all.
> 
> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
> reduce cache contention vs the usual bitmap when many threads concurrently
> set, clear, and visit elements.
> 
> Patch 1 defines the sparsemask type and its operations.
> 
> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
> 
> Patches 5 and 6 refactor existing code for a cleaner merge of later
>    patches.
> 
> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
> 
> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
> time being because of performance regressions that are not due to stealing
> per-se.  See the patch description for details.
> 
> Patch 10 adds schedstats for comparing the new behavior to the old, and
>    provided as a convenience for developers only, not for integration.
> 
> The patch series is based on kernel 4.19.0-rc7.  It compiles, boots, and
> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
> and CONFIG_PREEMPT.  It runs without error with CONFIG_DEBUG_PREEMPT +
> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP.  CPU hot plug and CPU
> bandwidth control were tested.
> 
> Stealing imprroves utilization with only a modest CPU overhead in scheduler
> code.  In the following experiment, hackbench is run with varying numbers
> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
> for each run, averaged per CPU, augmented with these non-standard stats:
> 
>    %find - percent of time spent in old and new functions that search for
>      idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
> 
>    steal - number of times a task is stolen from another CPU.
> 
> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> hackbench <grps> process 100000
> sched_wakeup_granularity_ns=15000000
> 
>    baseline
>    grps  time  %busy  slice   sched   idle     wake %find  steal
>    1    8.084  75.02   0.10  105476  46291    59183  0.31      0
>    2   13.892  85.33   0.10  190225  70958   119264  0.45      0
>    3   19.668  89.04   0.10  263896  87047   176850  0.49      0
>    4   25.279  91.28   0.10  322171  94691   227474  0.51      0
>    8   47.832  94.86   0.09  630636 144141   486322  0.56      0
> 
>    new
>    grps  time  %busy  slice   sched   idle     wake %find  steal  %speedup
>    1    5.938  96.80   0.24   31255   7190    24061  0.63   7433  36.1
>    2   11.491  99.23   0.16   74097   4578    69512  0.84  19463  20.9
>    3   16.987  99.66   0.15  115824   1985   113826  0.77  24707  15.8
>    4   22.504  99.80   0.14  167188   2385   164786  0.75  29353  12.3
>    8   44.441  99.86   0.11  389153   1616   387401  0.67  38190   7.6
> 
> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
> The cost is at most 0.4% more find time.
> 
> Additional performance results follow.  A negative "speedup" is a
> regression.  Note: for all hackbench runs, sched_wakeup_granularity_ns
> is set to 15 msec.  Otherwise, preemptions increase at higher loads and
> distort the comparison between baseline and new.
> 
> ------------------ 1 Socket Results ------------------
> 
> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
> 
>              --- base --    --- new ---
>    groups    time %stdev    time %stdev  %speedup
>         1   8.008    0.1   5.905    0.2      35.6
>         2  13.814    0.2  11.438    0.1      20.7
>         3  19.488    0.2  16.919    0.1      15.1
>         4  25.059    0.1  22.409    0.1      11.8
>         8  47.478    0.1  44.221    0.1       7.3
> 
> X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
> 
>              --- base --    --- new ---
>    groups    time %stdev    time %stdev  %speedup
>         1   4.586    0.8   4.596    0.6      -0.3
>         2   7.693    0.2   5.775    1.3      33.2
>         3  10.442    0.3   8.288    0.3      25.9
>         4  13.087    0.2  11.057    0.1      18.3
>         8  24.145    0.2  22.076    0.3       9.3
>        16  43.779    0.1  41.741    0.2       4.8
> 
> KVM 4-cpu
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> tbench, average of 11 runs.
> 
>    clients    %speedup
>          1        16.2
>          2        11.7
>          4         9.9
>          8        12.8
>         16        13.7
> 
> KVM 2-cpu
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> 
>    Benchmark                     %speedup
>    specjbb2015_critical_jops          5.7
>    mysql_sysb1.0.14_mutex_2          40.6
>    mysql_sysb1.0.14_oltp_2            3.9
> 
> ------------------ 2 Socket Results ------------------
> 
> X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
> 
>              --- base --    --- new ---
>    groups    time %stdev    time %stdev  %speedup
>         1   7.945    0.2   7.219    8.7      10.0
>         2   8.444    0.4   6.689    1.5      26.2
>         3  12.100    1.1   9.962    2.0      21.4
>         4  15.001    0.4  13.109    1.1      14.4
>         8  27.960    0.2  26.127    0.3       7.0
> 
> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
> 
>              --- base --    --- new ---
>    groups    time %stdev    time %stdev  %speedup
>         1   5.826    5.4   5.840    5.0      -0.3
>         2   5.041    5.3   6.171   23.4     -18.4
>         3   6.839    2.1   6.324    3.8       8.1
>         4   8.177    0.6   7.318    3.6      11.7
>         8  14.429    0.7  13.966    1.3       3.3
>        16  26.401    0.3  25.149    1.5       4.9
> 
> 
> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Oracle database OLTP, logging disabled, NVRAM storage
> 
>    Customers   Users   %speedup
>      1200000      40       -1.2
>      2400000      80        2.7
>      3600000     120        8.9
>      4800000     160        4.4
>      6000000     200        3.0
> 
> X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs
> Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
> Results from the Oracle "Performance PIT".
> 
>    Benchmark                                           %speedup
> 
>    mysql_sysb1.0.14_fileio_56_rndrd                        19.6
>    mysql_sysb1.0.14_fileio_56_seqrd                        12.1
>    mysql_sysb1.0.14_fileio_56_rndwr                         0.4
>    mysql_sysb1.0.14_fileio_56_seqrewr                      -0.3
> 
>    pgsql_sysb1.0.14_fileio_56_rndrd                        19.5
>    pgsql_sysb1.0.14_fileio_56_seqrd                         8.6
>    pgsql_sysb1.0.14_fileio_56_rndwr                         1.0
>    pgsql_sysb1.0.14_fileio_56_seqrewr                       0.5
> 
>    opatch_time_ASM_12.2.0.1.0_HP2M                          7.5
>    select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M             5.1
>    select-1_users_asmm_ASM_12.2.0.1.0_HP2M                  4.4
>    swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M           5.8
> 
>    lm3_memlat_L2                                            4.8
>    lm3_memlat_L1                                            0.0
> 
>    ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching     60.1
>    ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent        5.2
>    ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent       -3.0
>    ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4
> 
> X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> 
>    NAS_OMP
>    bench class   ncpu    %improved(Mops)
>    dc    B       72      1.3
>    is    C       72      0.9
>    is    D       72      0.7
> 
>    sysbench mysql, average of 24 runs
>            --- base ---     --- new ---
>    nthr   events  %stdev   events  %stdev %speedup
>       1    331.0    0.25    331.0    0.24     -0.1
>       2    661.3    0.22    661.8    0.22      0.0
>       4   1297.0    0.88   1300.5    0.82      0.2
>       8   2420.8    0.04   2420.5    0.04     -0.1
>      16   4826.3    0.07   4825.4    0.05     -0.1
>      32   8815.3    0.27   8830.2    0.18      0.1
>      64  12823.0    0.24  12823.6    0.26      0.0
> 
> --------------------------------------------------------------
> 
> Steve Sistare (10):
>    sched: Provide sparsemask, a reduced contention bitmap
>    sched/topology: Provide hooks to allocate data shared per LLC
>    sched/topology: Provide cfs_overload_cpus bitmap
>    sched/fair: Dynamically update cfs_overload_cpus
>    sched/fair: Hoist idle_stamp up from idle_balance
>    sched/fair: Generalize the detach_task interface
>    sched/fair: Provide can_migrate_task_llc
>    sched/fair: Steal work from an overloaded CPU when CPU goes idle
>    sched/fair: disable stealing if too many NUMA nodes
>    sched/fair: Provide idle search schedstats
> 
>   include/linux/sched/topology.h |   1 +
>   include/linux/sparsemask.h     | 260 +++++++++++++++++++++++++++++++
>   kernel/sched/core.c            |  30 +++-
>   kernel/sched/fair.c            | 338 +++++++++++++++++++++++++++++++++++++----
>   kernel/sched/features.h        |   6 +
>   kernel/sched/sched.h           |  13 +-
>   kernel/sched/stats.c           |  11 +-
>   kernel/sched/stats.h           |  13 ++
>   kernel/sched/topology.c        | 117 +++++++++++++-
>   lib/Makefile                   |   2 +-
>   lib/sparsemask.c               | 142 +++++++++++++++++
>   11 files changed, 898 insertions(+), 35 deletions(-)
>   create mode 100644 include/linux/sparsemask.h
>   create mode 100644 lib/sparsemask.c
> 
> --
> 1.8.3.1
> 
> 

Hi Steve,

Tried your patchset on ThunderX2 with 2 nodes. Please find my observations below.

Hackbench was run on single node due to variance on 2 nodes and it showed 
improvement under load.

Single node hackbench numbers:
group    old time	new time	steals		%change
1	 6.717		7.275		21		-8.31
2	 8.449		9.268		106		-9.69
3	 12.035		12.761		173071		-6.03
4	 14.648		9.787		595889		 33.19
8	 22.513		18.329		2397394		 18.58
16	 39.861		36.263		3949903		 9.06

column "new time" shows hackbench runtime in seconds with the patchset.

Tried below benchmarks with 2 nodes, but no performance benefit/degradation was 
observed on multiple runs.
  - MySQL (read/write/PS etc with sysbench)
  - HHVM running oss-performance benchmarks

Shijith

      parent reply	other threads:[~2019-01-04 13:37 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-22 14:59 [PATCH 00/10] steal tasks to improve CPU utilization Steve Sistare
2018-10-22 14:59 ` [PATCH 01/10] sched: Provide sparsemask, a reduced contention bitmap Steve Sistare
2018-10-22 14:59 ` [PATCH 02/10] sched/topology: Provide hooks to allocate data shared per LLC Steve Sistare
2018-10-22 14:59 ` [PATCH 03/10] sched/topology: Provide cfs_overload_cpus bitmap Steve Sistare
2018-10-22 14:59 ` [PATCH 04/10] sched/fair: Dynamically update cfs_overload_cpus Steve Sistare
2018-10-22 16:56   ` Peter Zijlstra
2018-10-22 18:43     ` Steven Sistare
2018-10-22 14:59 ` [PATCH 05/10] sched/fair: Hoist idle_stamp up from idle_balance Steve Sistare
2018-10-25 13:47   ` Valentin Schneider
2018-10-25 14:04     ` Steven Sistare
2018-10-22 14:59 ` [PATCH 06/10] sched/fair: Generalize the detach_task interface Steve Sistare
2018-10-22 14:59 ` [PATCH 07/10] sched/fair: Provide can_migrate_task_llc Steve Sistare
2018-10-26 18:04   ` Valentin Schneider
2018-10-26 18:28     ` Steven Sistare
2018-10-29 19:34       ` Valentin Schneider
2018-10-31 15:43         ` Steven Sistare
2018-10-31 18:48           ` Valentin Schneider
2018-10-31 19:14         ` Peter Zijlstra
2018-11-01 11:16           ` Valentin Schneider
2018-10-22 14:59 ` [PATCH 08/10] sched/fair: Steal work from an overloaded CPU when CPU goes idle Steve Sistare
2018-10-25 13:48   ` Valentin Schneider
2018-10-25 14:07     ` Steven Sistare
2018-10-22 14:59 ` [PATCH 09/10] sched/fair: disable stealing if too many NUMA nodes Steve Sistare
2018-10-22 17:06   ` Peter Zijlstra
2018-10-22 18:47     ` Steven Sistare
2018-10-22 19:21       ` Steven Sistare
2018-10-22 22:05         ` Peter Zijlstra
2018-10-23 13:18   ` Steven Sistare
2018-10-22 14:59 ` [PATCH 10/10] sched/fair: Provide idle search schedstats Steve Sistare
2018-10-22 17:04 ` [PATCH 00/10] steal tasks to improve CPU utilization Peter Zijlstra
2018-10-22 19:07   ` Steven Sistare
2018-10-22 22:09     ` Peter Zijlstra
2018-10-24 15:34     ` Valentin Schneider
2018-10-24 19:27       ` Steven Sistare
2018-10-25 11:31         ` Valentin Schneider
2018-10-25 12:21           ` Steven Sistare
2018-10-25  7:50 ` Vincent Guittot
2018-10-25 11:28   ` Steven Sistare
2018-10-25 12:43     ` Vincent Guittot
2018-10-25 14:19       ` Steven Sistare
2018-10-31 19:35 ` Steven Sistare
2018-11-01 11:56 ` Steven Sistare
2018-11-02 23:39 ` Subhra Mazumdar
2018-11-05 20:08   ` Steven Sistare
2019-01-04 13:37 ` Shijith Thotton [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DM6PR18MB246096BE8C516A5A5D125932D98E0@DM6PR18MB2460.namprd18.prod.outlook.com \
    --to=sthotton@marvell.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dhaval.giani@oracle.com \
    --cc=gkulkarni@marvell.com \
    --cc=jbacik@fb.com \
    --cc=jnair@marvell.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mingo@redhat.com \
    --cc=pavel.tatashin@microsoft.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rohit.k.jain@oracle.com \
    --cc=steven.sistare@oracle.com \
    --cc=subhra.mazumdar@oracle.com \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).