Group Imbalance Bug - performance drop by factor 10x on NUMA boxes with cgroups

* Group Imbalance Bug - performance drop by factor 10x on NUMA boxes with cgroups
@ 2018-10-27 23:25 Jirka Hladky
  0 siblings, 0 replies; only message in thread
From: Jirka Hladky @ 2018-10-27 23:25 UTC (permalink / raw)
  To: Mel Gorman, Srikar Dronamraju; +Cc: linux-kernel

Hi Mel and Srikar,

I would like to ask you if you could look into the Group Imbalance Bug
described in this paper

http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

in chapter 3.1. See also comment [1]. The paper describes the bug on
workload which involves different ssh sessions and it assumes
kernel.sched_autogroup_enabled=1. We have found out that it can be
reproduced more easily with cgroups.

Reproducer consists of this workload
* 2 separate "stress --cpu 1" processes. Each stress process needs 1 CPU.
* NAS benchmark (https://www.nas.nasa.gov/publications/npb.html) from
which I use lu.C.x binary (Lower-Upper Gauss-Seidel solver) in the
Open Multi-Processing (OMP) mode.

We run the workload in two modes:

NORMAL - both stress and lu.C.x are run in the same control group
GROUP  - each binary is run in a separate control group:
stress, first instance: cpu:test_group_1
stress, seconds instance: cpu:test_group_2
lu.C.x : cpu:test_group_main

I run lu.C.x with a different number of threads - for example on 4
NUMA server with 4x Xeon Gold 6126 CPU (total 96 CPUs) - I run lu.C.x
with 72, 80, 88, and 92 threads. Since the server has 96 CPUs in
total, even with 92 threads for lu.C.x and two stress processes server
is still not fully loaded.

Here are the runtimes in seconds for lu.C.x for different number of threads

#Threads  NORMAL GROUP
72             21.27        30.01
80             15.32         164
88             17.91         367
92             19.22         432

As you can see, already for 72 threads lu.C.x is significantly slower
when executed in dedicated cgroup. And it gets much worse with an
increasing number of threads (slowdown by the factor 10x and greater).

Some more details are below.

Please let me know if it sounds interesting and if you would like to
look into it. I can provide you with the reproducer plus some
supplementary python scripts to further analyze the results.

Thanks a lot!
Jirka

Some more details on the case with 80 threads for lu.C.x, 2 stress
processes run  on 96 CPUs server with 4 NUMA nodes.

Analyzing ps output is very interesting (here for 5 subsequent runs of
the workload):
========================================================
Average number of threads scheduled for NUMA NODE  0      1      2      3
========================================================
lu.C.x_80_NORMAL_1.ps.numa.hist         Average    21.25  21.00  19.75  18.00
lu.C.x_80_NORMAL_1.stress.ps.numa.hist  Average    1.00  1.00
lu.C.x_80_NORMAL_2.ps.numa.hist         Average    20.50  20.75  18.00  20.75
lu.C.x_80_NORMAL_2.stress.ps.numa.hist  Average    1.00  0.75  0.25
lu.C.x_80_NORMAL_3.ps.numa.hist         Average    21.75  22.00  18.75  17.50
lu.C.x_80_NORMAL_3.stress.ps.numa.hist  Average    1.00  1.00
lu.C.x_80_NORMAL_4.ps.numa.hist         Average    21.50  21.00  18.75  18.75
lu.C.x_80_NORMAL_4.stress.ps.numa.hist  Average    1.00  1.00
lu.C.x_80_NORMAL_5.ps.numa.hist         Average    18.00  23.33  19.33  19.33
lu.C.x_80_NORMAL_5.stress.ps.numa.hist  Average    1.00  1.00

As you can see, in NORMAL mode lu.C.x is uniformly scheduled over NUMA nodes.

Compare it with cgroups mode:
========================================================
Average number of threads scheduled for NUMA NODE  0      1      2      3
========================================================
lu.C.x_80_GROUP_1.ps.numa.hist         Average    13.05  13.54  27.65  25.76
lu.C.x_80_GROUP_1.stress.ps.numa.hist  Average    1.00  1.00
lu.C.x_80_GROUP_2.ps.numa.hist         Average    12.18  14.85  27.56  25.41
lu.C.x_80_GROUP_2.stress.ps.numa.hist  Average    1.00  1.00
lu.C.x_80_GROUP_3.ps.numa.hist         Average    15.32  13.23  26.52  24.94
lu.C.x_80_GROUP_3.stress.ps.numa.hist  Average    1.00  1.00
lu.C.x_80_GROUP_4.ps.numa.hist         Average    13.82  14.86  25.64  25.68
lu.C.x_80_GROUP_4.stress.ps.numa.hist  Average    1.00  1.00
lu.C.x_80_GROUP_5.ps.numa.hist         Average    15.12  13.03  25.12  26.73
lu.C.x_80_GROUP_5.stress.ps.numa.hist  Average    1.00  1.00

In cgroup mode, the scheduler is moving lu.C.x away from the nodes #0
and #1 where stress processes are running. It does it to such extent
that NUMA nodes #2 and #3 are overcommitted - these NUMA nodes have
more NAS threads scheduled than CPUs available - there are 24 CPUs in
each NUMA node.

Here is the detailed report:
$more lu.C.x_80_GROUP_1.ps.numa.hist
#Date                   NUMA 0  NUMA 1  NUMA 2  NUMA 3
2018-Oct-27_04h39m57s    6       7       37      30
2018-Oct-27_04h40m02s    16      15      23      26
2018-Oct-27_04h40m08s    13      12      27      28
2018-Oct-27_04h40m13s    9       15      29      27
2018-Oct-27_04h40m18s    16      13      27      24
2018-Oct-27_04h40m23s    16      14      25      25
2018-Oct-27_04h40m28s    16      15      24      25
2018-Oct-27_04h40m33s    10      11      34      25
2018-Oct-27_04h40m38s    16      13      25      26
2018-Oct-27_04h40m43s    10      10      32      28
2018-Oct-27_04h40m48s    12      16      26      26
2018-Oct-27_04h40m53s    13      11      30      26
2018-Oct-27_04h40m58s    13      14      28      25
2018-Oct-27_04h41m03s    11      15      28      26
2018-Oct-27_04h41m08s    13      15      28      24
2018-Oct-27_04h41m13s    14      17      25      24
2018-Oct-27_04h41m18s    14      17      24      25
2018-Oct-27_04h41m24s    13      12      28      27
2018-Oct-27_04h41m29s    11      12      30      27
2018-Oct-27_04h41m34s    13      15      26      26
2018-Oct-27_04h41m39s    13      15      27      25
2018-Oct-27_04h41m44s    13      15      26      26
2018-Oct-27_04h41m49s    12      7       36      25
2018-Oct-27_04h41m54s    14      13      27      26
2018-Oct-27_04h41m59s    16      13      25      26
2018-Oct-27_04h42m04s    15      14      26      25
2018-Oct-27_04h42m09s    16      12      26      26
2018-Oct-27_04h42m14s    12      15      27      26
2018-Oct-27_04h42m19s    13      15      26      26
2018-Oct-27_04h42m24s    14      15      26      25
2018-Oct-27_04h42m29s    14      15      26      25
2018-Oct-27_04h42m34s    8       11      36      25
2018-Oct-27_04h42m39s    13      14      28      25
2018-Oct-27_04h42m45s    13      16      26      25
2018-Oct-27_04h42m50s    13      16      27      24
2018-Oct-27_04h42m55s    13      16      26      25
2018-Oct-27_04h43m00s    16      10      26      28
Average                  13.05   13.54   27.65   25.76

Please notice that NODEs #3 and #4 have SIGNIFICANTLY (upto 36!!) more
threads scheduled than the number of available CPUs (24) while nodes
#0 and #1 have plenty of idle cores. I think that this is the best
illustration of the Group Imbalance bug - some cores are overcommitted
while others cores are idle and this disbalance is not getting any
better over time.

[1]
There are four bugs described in the paper. I have actively worked
over past two years on all of them; this is the current status: Group
Construction bug and Missing Scheduling Domains bugs are fixed. I was
not able to reproduce Overload on Wakeup bug (despite great effort). I
have created reproducer for the Group Imbalance Bug using the cgroups,
but it seems there is no easy fix.)

^ permalink raw reply	[flat|nested] only message in thread