All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shijith Thotton <sthotton@marvell.com>
To: Steve Sistare <steven.sistare@oracle.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"peterz@infradead.org" <peterz@infradead.org>
Cc: "subhra.mazumdar@oracle.com" <subhra.mazumdar@oracle.com>,
	"dhaval.giani@oracle.com" <dhaval.giani@oracle.com>,
	"daniel.m.jordan@oracle.com" <daniel.m.jordan@oracle.com>,
	"pavel.tatashin@microsoft.com" <pavel.tatashin@microsoft.com>,
	"matt@codeblueprint.co.uk" <matt@codeblueprint.co.uk>,
	"umgwanakikbuti@gmail.com" <umgwanakikbuti@gmail.com>,
	"riel@redhat.com" <riel@redhat.com>,
	"jbacik@fb.com" <jbacik@fb.com>,
	"juri.lelli@redhat.com" <juri.lelli@redhat.com>,
	"valentin.schneider@arm.com" <valentin.schneider@arm.com>,
	"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
	"quentin.perret@arm.com" <quentin.perret@arm.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Jayachandran Chandrasekharan Nair <jnair@marvell.com>,
	Ganapatrao Kulkarni <gkulkarni@marvell.com>
Subject: Re: [PATCH v4 00/10] steal tasks to improve CPU utilization
Date: Fri, 4 Jan 2019 13:44:11 +0000	[thread overview]
Message-ID: <DM6PR18MB246030C5A36973675B7DF38ED98E0@DM6PR18MB2460.namprd18.prod.outlook.com> (raw)
In-Reply-To: 1544131696-2888-1-git-send-email-steven.sistare@oracle.com

On 07-Dec-18 3:09 AM, Steve Sistare wrote:
> 
> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> find a task, then attempt to steal a task from an overloaded CPU in the
> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> identify candidates.  To minimize search time, steal the first migratable
> task that is found when the bitmap is traversed.  For fairness, search
> for migratable tasks on an overloaded CPU in order of next to run.
> 
> This simple stealing yields a higher CPU utilization than idle_balance()
> alone, because the search is cheap, so it may be called every time the CPU
> is about to go idle.  idle_balance() does more work because it searches
> widely for the busiest queue, so to limit its CPU consumption, it declines
> to search if the system is too busy.  Simple stealing does not offload the
> globally busiest queue, but it is much better than running nothing at all.
> 
> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
> reduce cache contention vs the usual bitmap when many threads concurrently
> set, clear, and visit elements.
> 
> Patch 1 defines the sparsemask type and its operations.
> 
> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
> 
> Patches 5 and 6 refactor existing code for a cleaner merge of later
>    patches.
> 
> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
> 
> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
> time being because of performance regressions that are not due to stealing
> per-se.  See the patch description for details.
> 
> Patch 10 adds schedstats for comparing the new behavior to the old, and
>    provided as a convenience for developers only, not for integration.
> 
> The patch series is based on kernel 4.20.0-rc1.  It compiles, boots, and
> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
> and CONFIG_PREEMPT.  It runs without error with CONFIG_DEBUG_PREEMPT +
> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP.  CPU hot plug and CPU
> bandwidth control were tested.
> 
> Stealing improves utilization with only a modest CPU overhead in scheduler
> code.  In the following experiment, hackbench is run with varying numbers
> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
> for each run, averaged per CPU, augmented with these non-standard stats:
> 
>    %find - percent of time spent in old and new functions that search for
>      idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
> 
>    steal - number of times a task is stolen from another CPU.
> 
> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> hackbench <grps> process 100000
> sched_wakeup_granularity_ns=15000000
> 
>    baseline
>    grps  time  %busy  slice   sched   idle     wake %find  steal
>    1    8.084  75.02   0.10  105476  46291    59183  0.31      0
>    2   13.892  85.33   0.10  190225  70958   119264  0.45      0
>    3   19.668  89.04   0.10  263896  87047   176850  0.49      0
>    4   25.279  91.28   0.10  322171  94691   227474  0.51      0
>    8   47.832  94.86   0.09  630636 144141   486322  0.56      0
> 
>    new
>    grps  time  %busy  slice   sched   idle     wake %find  steal  %speedup
>    1    5.938  96.80   0.24   31255   7190    24061  0.63   7433  36.1
>    2   11.491  99.23   0.16   74097   4578    69512  0.84  19463  20.9
>    3   16.987  99.66   0.15  115824   1985   113826  0.77  24707  15.8
>    4   22.504  99.80   0.14  167188   2385   164786  0.75  29353  12.3
>    8   44.441  99.86   0.11  389153   1616   387401  0.67  38190   7.6
> 
> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
> The cost is at most 0.4% more find time.
> 
> Additional performance results follow.  A negative "speedup" is a
> regression.  Note: for all hackbench runs, sched_wakeup_granularity_ns
> is set to 15 msec.  Otherwise, preemptions increase at higher loads and
> distort the comparison between baseline and new.
> 
> ------------------ 1 Socket Results ------------------
> 
> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
> 
>              --- base --    --- new ---
>    groups    time %stdev    time %stdev  %speedup
>         1   8.008    0.1   5.905    0.2      35.6
>         2  13.814    0.2  11.438    0.1      20.7
>         3  19.488    0.2  16.919    0.1      15.1
>         4  25.059    0.1  22.409    0.1      11.8
>         8  47.478    0.1  44.221    0.1       7.3
> 
> X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
> 
>              --- base --    --- new ---
>    groups    time %stdev    time %stdev  %speedup
>         1   4.586    0.8   4.596    0.6      -0.3
>         2   7.693    0.2   5.775    1.3      33.2
>         3  10.442    0.3   8.288    0.3      25.9
>         4  13.087    0.2  11.057    0.1      18.3
>         8  24.145    0.2  22.076    0.3       9.3
>        16  43.779    0.1  41.741    0.2       4.8
> 
> KVM 4-cpu
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> tbench, average of 11 runs.
> 
>    clients    %speedup
>          1        16.2
>          2        11.7
>          4         9.9
>          8        12.8
>         16        13.7
> 
> KVM 2-cpu
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> 
>    Benchmark                     %speedup
>    specjbb2015_critical_jops          5.7
>    mysql_sysb1.0.14_mutex_2          40.6
>    mysql_sysb1.0.14_oltp_2            3.9
> 
> ------------------ 2 Socket Results ------------------
> 
> X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
> 
>              --- base --    --- new ---
>    groups    time %stdev    time %stdev  %speedup
>         1   7.945    0.2   7.219    8.7      10.0
>         2   8.444    0.4   6.689    1.5      26.2
>         3  12.100    1.1   9.962    2.0      21.4
>         4  15.001    0.4  13.109    1.1      14.4
>         8  27.960    0.2  26.127    0.3       7.0
> 
> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
> 
>              --- base --    --- new ---
>    groups    time %stdev    time %stdev  %speedup
>         1   5.826    5.4   5.840    5.0      -0.3
>         2   5.041    5.3   6.171   23.4     -18.4
>         3   6.839    2.1   6.324    3.8       8.1
>         4   8.177    0.6   7.318    3.6      11.7
>         8  14.429    0.7  13.966    1.3       3.3
>        16  26.401    0.3  25.149    1.5       4.9
> 
> 
> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Oracle database OLTP, logging disabled, NVRAM storage
> 
>    Customers   Users   %speedup
>      1200000      40       -1.2
>      2400000      80        2.7
>      3600000     120        8.9
>      4800000     160        4.4
>      6000000     200        3.0
> 
> X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs
> Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
> Results from the Oracle "Performance PIT".
> 
>    Benchmark                                           %speedup
> 
>    mysql_sysb1.0.14_fileio_56_rndrd                        19.6
>    mysql_sysb1.0.14_fileio_56_seqrd                        12.1
>    mysql_sysb1.0.14_fileio_56_rndwr                         0.4
>    mysql_sysb1.0.14_fileio_56_seqrewr                      -0.3
> 
>    pgsql_sysb1.0.14_fileio_56_rndrd                        19.5
>    pgsql_sysb1.0.14_fileio_56_seqrd                         8.6
>    pgsql_sysb1.0.14_fileio_56_rndwr                         1.0
>    pgsql_sysb1.0.14_fileio_56_seqrewr                       0.5
> 
>    opatch_time_ASM_12.2.0.1.0_HP2M                          7.5
>    select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M             5.1
>    select-1_users_asmm_ASM_12.2.0.1.0_HP2M                  4.4
>    swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M           5.8
> 
>    lm3_memlat_L2                                            4.8
>    lm3_memlat_L1                                            0.0
> 
>    ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching     60.1
>    ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent        5.2
>    ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent       -3.0
>    ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4
> 
> X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> 
>    NAS_OMP
>    bench class   ncpu    %improved(Mops)
>    dc    B       72      1.3
>    is    C       72      0.9
>    is    D       72      0.7
> 
>    sysbench mysql, average of 24 runs
>            --- base ---     --- new ---
>    nthr   events  %stdev   events  %stdev %speedup
>       1    331.0    0.25    331.0    0.24     -0.1
>       2    661.3    0.22    661.8    0.22      0.0
>       4   1297.0    0.88   1300.5    0.82      0.2
>       8   2420.8    0.04   2420.5    0.04     -0.1
>      16   4826.3    0.07   4825.4    0.05     -0.1
>      32   8815.3    0.27   8830.2    0.18      0.1
>      64  12823.0    0.24  12823.6    0.26      0.0
> 
> --------------------------------------------------------------
> 
> Changes from v1 to v2:
>    - Remove stray find_time hunk from patch 5
>    - Fix "warning: label out defined but not used" for !CONFIG_SCHED_SMT
>    - Set SCHED_STEAL_NODE_LIMIT_DEFAULT to 2
>    - Steal iff avg_idle exceeds the cost of stealing
> 
> Changes from v2 to v3:
>    - Update series for kernel 4.20.  Context changes only.
> 
> Changes from v3 to v4:
>    - Avoid 64-bit division on 32-bit processors in compute_skid()
>    - Replace IF_SMP with inline functions to set idle_stamp
>    - Push ZALLOC_MASK body into calling function
>    - Set rq->cfs_overload_cpus in update_top_cache_domain instead of
>      cpu_attach_domain
>    - Rewrite sparsemask iterator for complete inlining
>    - Cull and clean up sparsemask functions and moved all into
>      sched/sparsemask.h
> 
> Steve Sistare (10):
>    sched: Provide sparsemask, a reduced contention bitmap
>    sched/topology: Provide hooks to allocate data shared per LLC
>    sched/topology: Provide cfs_overload_cpus bitmap
>    sched/fair: Dynamically update cfs_overload_cpus
>    sched/fair: Hoist idle_stamp up from idle_balance
>    sched/fair: Generalize the detach_task interface
>    sched/fair: Provide can_migrate_task_llc
>    sched/fair: Steal work from an overloaded CPU when CPU goes idle
>    sched/fair: disable stealing if too many NUMA nodes
>    sched/fair: Provide idle search schedstats
> 
>   include/linux/sched/topology.h |   1 +
>   kernel/sched/core.c            |  31 +++-
>   kernel/sched/fair.c            | 354 +++++++++++++++++++++++++++++++++++++----
>   kernel/sched/features.h        |   6 +
>   kernel/sched/sched.h           |  13 +-
>   kernel/sched/sparsemask.h      | 210 ++++++++++++++++++++++++
>   kernel/sched/stats.c           |  11 +-
>   kernel/sched/stats.h           |  13 ++
>   kernel/sched/topology.c        | 121 +++++++++++++-
>   9 files changed, 726 insertions(+), 34 deletions(-)
>   create mode 100644 kernel/sched/sparsemask.h
> 
> --
> 1.8.3.1
> 
> 

Hi Steve,

Tried your patchset on ThunderX2 with 2 nodes. Please find my observations below.

Hackbench was run on single node due to variance on 2 nodes and it showed
improvement under load.

Single node hackbench numbers:
group    old time       new time        steals          %change
1        6.717           7.275           21              -8.31
2        8.449           9.268           106             -9.69
3        12.035          12.761          173071          -6.03
4        14.648          9.787           595889           33.19
8        22.513          18.329          2397394          18.58
16       39.861          36.263          3949903          9.06

column "new time" shows hackbench runtime in seconds with the patchset.

Tried below benchmarks with 2 nodes, but no performance benefit/degradation was
observed on multiple runs.
   - MySQL (read/write/PS etc with sysbench)
   - HHVM running oss-performance benchmarks

Shijith

  parent reply	other threads:[~2019-01-04 13:44 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-06 21:28 [PATCH v4 00/10] steal tasks to improve CPU utilization Steve Sistare
2018-12-06 21:28 ` [PATCH v4 01/10] sched: Provide sparsemask, a reduced contention bitmap Steve Sistare
2019-01-31 19:18   ` Tim Chen
2018-12-06 21:28 ` [PATCH v4 02/10] sched/topology: Provide hooks to allocate data shared per LLC Steve Sistare
2018-12-06 21:28 ` [PATCH v4 03/10] sched/topology: Provide cfs_overload_cpus bitmap Steve Sistare
2018-12-07 20:20   ` Valentin Schneider
2018-12-07 22:35     ` Steven Sistare
2018-12-08 18:33       ` Valentin Schneider
2018-12-06 21:28 ` [PATCH v4 04/10] sched/fair: Dynamically update cfs_overload_cpus Steve Sistare
2018-12-07 20:20   ` Valentin Schneider
2018-12-07 22:35     ` Steven Sistare
2018-12-08 18:47       ` Valentin Schneider
2018-12-06 21:28 ` [PATCH v4 05/10] sched/fair: Hoist idle_stamp up from idle_balance Steve Sistare
2018-12-06 21:28 ` [PATCH v4 06/10] sched/fair: Generalize the detach_task interface Steve Sistare
2018-12-06 21:28 ` [PATCH v4 07/10] sched/fair: Provide can_migrate_task_llc Steve Sistare
2018-12-06 21:28 ` [PATCH v4 08/10] sched/fair: Steal work from an overloaded CPU when CPU goes idle Steve Sistare
2018-12-07 20:21   ` Valentin Schneider
2018-12-07 22:36     ` Steven Sistare
2018-12-08 18:39       ` Valentin Schneider
2018-12-06 21:28 ` [PATCH v4 09/10] sched/fair: disable stealing if too many NUMA nodes Steve Sistare
2018-12-07 11:43   ` Valentin Schneider
2018-12-07 13:37     ` Steven Sistare
2018-12-06 21:28 ` [PATCH v4 10/10] sched/fair: Provide idle search schedstats Steve Sistare
2018-12-07 11:56   ` Valentin Schneider
2018-12-07 13:45     ` Steven Sistare
2018-12-24 12:25   ` Rick Lindsley
2019-01-14 17:04     ` Steven Sistare
2018-12-07 20:30 ` [PATCH v4 00/10] steal tasks to improve CPU utilization Valentin Schneider
2018-12-07 22:36   ` Steven Sistare
2019-02-01 15:07   ` Valentin Schneider
2018-12-10 16:10 ` Vincent Guittot
2018-12-10 16:29   ` Steven Sistare
2018-12-10 16:33     ` Vincent Guittot
2018-12-10 17:08       ` Vincent Guittot
2018-12-10 17:20         ` Steven Sistare
2018-12-10 17:06     ` Valentin Schneider
2019-01-04 13:44 ` Shijith Thotton [this message]
2019-01-14 16:55 ` Steven Sistare
2019-01-31 17:16   ` Dhaval Giani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DM6PR18MB246030C5A36973675B7DF38ED98E0@DM6PR18MB2460.namprd18.prod.outlook.com \
    --to=sthotton@marvell.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dhaval.giani@oracle.com \
    --cc=gkulkarni@marvell.com \
    --cc=jbacik@fb.com \
    --cc=jnair@marvell.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mingo@redhat.com \
    --cc=pavel.tatashin@microsoft.com \
    --cc=peterz@infradead.org \
    --cc=quentin.perret@arm.com \
    --cc=riel@redhat.com \
    --cc=steven.sistare@oracle.com \
    --cc=subhra.mazumdar@oracle.com \
    --cc=umgwanakikbuti@gmail.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.