[PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

* [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
@ 2015-08-18 15:55 Dario Faggioli
  2015-08-18 16:53 ` Konrad Rzeszutek Wilk
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Dario Faggioli @ 2015-08-18 15:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Andrew Cooper, Luis R. Rodriguez, David Vrabel,
	Boris Ostrovsky, Konrad Rzeszutek Wilk, linux-kernel,
	Stefano Stabellini, George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 27809 bytes --]

Hey everyone,

So, as a followup of what we were discussing in this thread:

 [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
 http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html

I started looking in more details at scheduling domains in the Linux
kernel. Now, that thread was about CPUID and vNUMA, and their weird way
of interacting, while this thing I'm proposing here is completely
independent from them both.

In fact, no matter whether vNUMA is supported and enabled, and no matter
whether CPUID is reporting accurate, random, meaningful or completely
misleading information, I think that we should do something about how
scheduling domains are build.

Fact is, unless we use 1:1, and immutable (across all the guest
lifetime) pinning, scheduling domains should not be constructed, in
Linux, by looking at *any* topology information, because that just does
not make any sense, when vcpus move around.

Let me state this again (hoping to make myself as clear as possible): no
matter in  how much good shape we put CPUID support, no matter how
beautifully and consistently that will interact with both vNUMA,
licensing requirements and whatever else. It will be always possible for
vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
on two different NUMA nodes at time t2. Hence, the Linux scheduler
should really not skew his load balancing logic toward any of those two
situations, as neither of them could be considered correct (since
nothing is!).

For now, this only covers the PV case. HVM case shouldn't be any
different, but I haven't looked at how to make the same thing happen in
there as well.

OVERALL DESCRIPTION
===================
What this RFC patch does is, in the Xen PV case, configure scheduling
domains in such a way that there is only one of them, spanning all the
pCPUs of the guest.

Note that the patch deals directly with scheduling domains, and there is
no need to alter the masks that will then be used for building and
reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
the main difference between it and the patch proposed by Juergen here:
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html

This means that when, in future, we will fix CPUID handling and make it
comply with whatever logic or requirements we want, that won't have  any
unexpected side effects on scheduling domains.

Information about how the scheduling domains are being constructed
during boot are available in `dmesg', if the kernel is booted with the
'sched_debug' parameter. It is also possible to look
at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.

With the patch applied, only one scheduling domain is created, called
the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
tell that from the fact that every cpu* folder
in /proc/sys/kernel/sched_domain/ only have one subdirectory
('domain0'), with all the tweaks and the tunables for our scheduling
domain.

EVALUATION
==========
I've tested this with UnixBench, and by looking at Xen build time, on a
16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
now, but I plan to re-run them in DomUs soon (Juergen may be doing
something similar to this in DomU already, AFAUI).

I've run the benchmarks with and without the patch applied ('patched'
and 'vanilla', respectively, in the tables below), and with different
number of build jobs (in case of the Xen build) or of parallel copy of
the benchmarks (in the case of UnixBench).

What I get from the numbers is that the patch almost always brings
benefits, in some cases even huge ones. There are a couple of cases
where we regress, but always only slightly so, especially if comparing
that to the magnitude of some of the improvement that we get.

Bear also in mind that these results are gathered from Dom0, and without
any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
we move things in DomU and do overcommit at the Xen scheduler level, I
am expecting even better results.

RESULTS
=======
To have a quick idea of how a benchmark went, look at the '%
improvement' row of each table.

I'll put these results online, in a googledoc spreadsheet or something
like that, to make them easier to read, as soon as possible.

*** Intel(R) Xeon(R) E5620 @ 2.40GHz                                                                                                                    
*** pCPUs      16        DOM0 vCPUS  16
*** RAM        12285 MB  DOM0 Memory 9955 MB
*** NUMA nodes 2         
=======================================================================================================================================
MAKE XEN (lower == better)                                                                                                                            
=======================================================================================================================================
# of build jobs                     -j1                   -j6                   -j8                   -j16**                -j24                
vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
---------------------------------------------------------------------------------------------------------------------------------------
                              153.72     152.41      35.33      34.93       30.7      30.33      26.79      25.97      26.88      26.21
                              153.81     152.76      35.37      34.99      30.81      30.36      26.83      26.08         27      26.24
                              153.93     152.79      35.37      35.25      30.92      30.39      26.83      26.13      27.01      26.28
                              153.94     152.94      35.39      35.28      31.05      30.43       26.9      26.14      27.01      26.44
                              153.98     153.06      35.45      35.31      31.17       30.5      26.95      26.18      27.02      26.55
                              154.01     153.23       35.5      35.35       31.2      30.59      26.98       26.2      27.05      26.61
                              154.04     153.34      35.56      35.42      31.45      30.76      27.12      26.21      27.06      26.78
                              154.16      153.5      37.79      35.58      31.68      30.83      27.16      26.23      27.16      26.78
                              154.18     153.71      37.98      35.61      33.73       30.9      27.49      26.32      27.16       26.8
                              154.9      154.67      38.03      37.64      34.69      31.69      29.82      26.38       27.2      28.63
---------------------------------------------------------------------------------------------------------------------------------------
 Avg.                        154.067    153.241     36.177     35.536      31.74     30.678     27.287     26.184     27.055     26.732
---------------------------------------------------------------------------------------------------------------------------------------
 Std. Dev.                     0.325      0.631      1.215      0.771      1.352      0.410      0.914      0.116      0.095      0.704
---------------------------------------------------------------------------------------------------------------------------------------
 % improvement                            0.536                 1.772                 3.346                 4.042                 1.194
========================================================================================================================================
====================================================================================================================================================
UNIXBENCH
====================================================================================================================================================
# parallel copies                            1 parallel            6 parrallel           8 parallel            16 parallel**         24 parallel
vanilla/patched                          vanilla    patched    vanilla    pached     vanilla    patched    vanilla    patched    vanilla    patched
----------------------------------------------------------------------------------------------------------------------------------------------------
Dhrystone 2 using register variables       2302.2     2302.1    13157.8    12262.4    15691.5    15860.1    18927.7    19078.5    18654.3    18855.6
Double-Precision Whetstone                  620.2      620.2     3481.2     3566.9     4669.2     4551.5     7610.1     7614.3    11558.9    11561.3
Execl Throughput                            184.3      186.7      884.6      905.3     1168.4     1213.6     2134.6     2210.2     2250.9       2265
File Copy 1024 bufsize 2000 maxblocks       780.8      783.3     1243.7     1255.5     1250.6     1215.7     1080.9     1094.2     1069.8     1062.5
File Copy 256 bufsize 500 maxblocks         479.8      482.8      781.8      803.6      806.4        781      682.9      707.7      698.2      694.6
File Copy 4096 bufsize 8000 maxblocks      1617.6     1593.5     2739.7     2943.4     2818.3     2957.8     2389.6     2412.6     2371.6     2423.8
Pipe Throughput                             363.9      361.6     2068.6     2065.6       2622     2633.5     4053.3     4085.9     4064.7     4076.7
Pipe-based Context Switching                 70.6      207.2      369.1     1126.8      623.9     1431.3     1970.4     2082.9     1963.8       2077
Process Creation                            103.1        135        503      677.6      618.7      855.4       1138     1113.7     1195.6       1199
Shell Scripts (1 concurrent)                723.2      765.3     4406.4     4334.4     5045.4     5002.5     5861.9     5844.2     5958.8     5916.1
Shell Scripts (8 concurrent)               2243.7     2715.3     5694.7     5663.6     5694.7     5657.8     5637.1     5600.5     5582.9     5543.6
System Call Overhead                          330      330.1     1669.2     1672.4     2028.6     1996.6     2920.5     2947.1     2923.9     2952.5
System Benchmarks Index Score               496.8      567.5     1861.9       2106     2220.3     2441.3     2972.5     3007.9     3103.4     3125.3
----------------------------------------------------------------------------------------------------------------------------------------------------
% increase (of the Index Score)                       14.231                13.110                 9.954                 1.191                 0.706
====================================================================================================================================================

*** Intel(R) Xeon(R) X5650 @ 2.67GHz
*** pCPUs      24        DOM0 vCPUS  16
*** RAM        36851 MB  DOM0 Memory 9955 MB
*** NUMA nodes 2
=======================================================================================================================================
MAKE XEN (lower == better)
=======================================================================================================================================
# of build jobs                     -j1                   -j8                   -j12                   -j24**               -j32
vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
---------------------------------------------------------------------------------------------------------------------------------------
                              119.49     119.47      23.37      23.29      20.12      19.85      17.99       17.9      17.82       17.8
                              119.59     119.64      23.52      23.31      20.16      19.99      18.19      18.05      18.23      17.89
                              119.59     119.65      23.53      23.35      20.19      20.08      18.26      18.09      18.35      17.91
                              119.72     119.75      23.63      23.41       20.2      20.14      18.54       18.1       18.4      17.95
                              119.95     119.86      23.68      23.42      20.24      20.19      18.57      18.15      18.44      18.03
                              119.97      119.9      23.72      23.51      20.38      20.31      18.61      18.21      18.49      18.03
                              119.97     119.91      25.03      23.53      20.38      20.42      18.75      18.28      18.51      18.08
                              120.01     119.98      25.05      23.93      20.39      21.69      19.99      18.49      18.52       18.6
                              120.24     119.99      25.12      24.19      21.67      21.76      20.08      19.74      19.73      19.62
                              120.66     121.22      25.16      25.36      21.94      21.85      20.26       20.3      19.92      19.81
---------------------------------------------------------------------------------------------------------------------------------------
 Avg.                        119.919    119.937     24.181      23.73     20.567     20.628     18.924     18.531     18.641     18.372
---------------------------------------------------------------------------------------------------------------------------------------
 Std. Dev.                     0.351      0.481      0.789      0.642      0.663      0.802      0.851      0.811      0.658      0.741
---------------------------------------------------------------------------------------------------------------------------------------
 % improvement                           -0.015                 1.865                -0.297                 2.077                 1.443
========================================================================================================================================
====================================================================================================================================================
UNIXBENCH
====================================================================================================================================================
# parallel copies                            1 parallel            8 parrallel            12 parallel           24 parallel**         32 parallel
vanilla/patched                          vanilla     patched   vanilla     pached     vanilla    patched    vanilla    patched    vanilla    patched
----------------------------------------------------------------------------------------------------------------------------------------------------
Dhrystone 2 using register variables       2650.1     2664.6    18967.8    19060.4    27534.1    27046.8    30077.9    30110.6    30542.1    30358.7
Double-Precision Whetstone                  713.7      713.5     5463.6     5455.1     7863.9     7923.8    12725.1    12727.8    17474.3    17463.3
Execl Throughput                            280.9      283.8     1724.4     1866.5     2029.5     2367.6       2370     2521.3       2453     2506.8
File Copy 1024 bufsize 2000 maxblocks       891.1      894.2       1423     1457.7     1385.6     1482.2     1226.1     1224.2     1235.9     1265.5
File Copy 256 bufsize 500 maxblocks         546.9      555.4        949      972.1      882.8      878.6      821.9      817.7      784.7      810.8
File Copy 4096 bufsize 8000 maxblocks      1743.4     1722.8     3406.5     3438.9     3314.3     3265.9     2801.9     2788.3     2695.2     2781.5
Pipe Throughput                             426.8      423.4     3207.9       3234     4635.1     4708.9       7326     7335.3     7327.2     7319.7
Pipe-based Context Switching                110.2      223.5      680.8     1602.2      998.6     2324.6     3122.1     3252.7     3128.6     3337.2
Process Creation                            130.7      224.4     1001.3     1043.6       1209     1248.2     1337.9     1380.4     1338.6     1280.1
Shell Scripts (1 concurrent)               1140.5     1257.5     5462.8     6146.4     6435.3     7206.1     7425.2     7636.2     7566.1     7636.6
Shell Scripts (8 concurrent)                 3492     3586.7     7144.9       7307       7258     7320.2     7295.1     7296.7     7248.6     7252.2
System Call Overhead                        387.7      387.5     2398.4       2367     2793.8     2752.7     3735.7     3694.2     3752.1     3709.4
System Benchmarks Index Score               634.8      712.6     2725.8     3005.7     3232.4     3569.7     3981.3     4028.8     4085.2     4126.3
----------------------------------------------------------------------------------------------------------------------------------------------------
% increase (of the Index Score)                       12.256                10.269                10.435                 1.193                 1.006
====================================================================================================================================================

*** Intel(R) Xeon(R) X5650 @ 2.67GHz
*** pCPUs      48        DOM0 vCPUS  16
*** RAM        393138 MB DOM0 Memory 9955 MB
*** NUMA nodes 2
=======================================================================================================================================
MAKE XEN (lower == better)
=======================================================================================================================================
# of build jobs                     -j1                   -j20                   -j24                  -j48**               -j62
vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
---------------------------------------------------------------------------------------------------------------------------------------
                              267.78     233.25      36.53      35.53      35.98      34.99      33.46      32.13      33.57      32.54
                              268.42     233.92      36.82      35.56      36.12       35.2      34.24      32.24      33.64      32.56
                              268.85     234.39      36.92      35.75      36.15      35.35      34.48      32.86      33.67      32.74
                              268.98     235.11      36.96      36.01      36.25      35.46      34.73      32.89      33.97      32.83
                              269.03     236.48      37.04      36.16      36.45      35.63      34.77      32.97      34.12      33.01
                              269.54     237.05      40.33      36.59      36.57      36.15      34.97      33.09      34.18      33.52
                              269.99     238.24      40.45      36.78      36.58      36.22      34.99      33.69      34.28      33.63
                              270.11     238.48      41.13      39.98      40.22      36.24         38      33.92      34.35      33.87
                              270.96     239.07      41.66      40.81      40.59      36.35      38.99      34.19      34.49      37.24
                              271.84     240.89      42.07      41.24      40.63      40.06      39.07      36.04      34.69      37.59
---------------------------------------------------------------------------------------------------------------------------------------
 Avg.                         269.55    236.688     38.991     37.441     37.554     36.165      35.77     33.402     34.096     33.953
---------------------------------------------------------------------------------------------------------------------------------------
 Std. Dev.                     1.213      2.503      2.312      2.288      2.031      1.452      2.079      1.142      0.379      1.882
---------------------------------------------------------------------------------------------------------------------------------------
 % improvement                           12.191                 3.975                 3.699                 6.620                 0.419
========================================================================================================================================
====================================================================================================================================================
UNIXBENCH
====================================================================================================================================================
# parallel copies                            1 parallel            20 parrallel           24 parallel           48 parallel**         62 parallel
vanilla/patched                          vanilla     patched   vanilla     pached     vanilla    patched    vanilla    patched    vanilla    patched
----------------------------------------------------------------------------------------------------------------------------------------------------
Dhrystone 2 using register variables       2037.6     2037.5    39615.4    38990.5    43976.8    44660.8      51238    51117.4    51672.5    52332.5
Double-Precision Whetstone                  525.1      521.6    10389.7    10429.3    12236.5    12188.8    20897.1    20921.9    26957.5    27035.7
Execl Throughput                            112.1      113.6        799      786.5      715.1      702.3      758.2        744      756.3      765.6
File Copy 1024 bufsize 2000 maxblocks       605.5        622      671.6      630.4      624.3      605.8        599      581.2      447.4      433.7
File Copy 256 bufsize 500 maxblocks           384      382.7      447.2      429.1      464.5      404.3      416.1      428.5      313.8      305.6
File Copy 4096 bufsize 8000 maxblocks       883.7     1100.5       1326       1307     1343.2     1305.9     1260.4     1245.3     1001.4      920.1
Pipe Throughput                             283.7      282.8     5636.6     5634.2       6551       6571      10390    10437.4      10459    10498.9
Pipe-based Context Switching                 41.5      143.7      518.5     1899.1      737.5     2068.8     2877.1     3093.2     2949.3     3184.1
Process Creation                             58.5       78.4      370.7      389.4        338      355.8      380.1      375.5      383.8      369.6
Shell Scripts (1 concurrent)                443.7      475.5     1901.9       1945     1765.1     1789.6       2417     2354.4     2395.3     2362.2
Shell Scripts (8 concurrent)               1283.1     1319.1     2265.4     2209.8     2263.3       2209     2202.7     2216.1     2190.4     2206.5
System Call Overhead                        254.1      254.3      891.6      881.6      971.1      958.3     1446.8     1409.5     1461.7     1429.2
System Benchmarks Index Score               340.8      398.6     1690.6     1866.3     1770.6       1902     2303.5     2300.8     2208.3     2189.8
----------------------------------------------------------------------------------------------------------------------------------------------------
% increase (of the Index Score)                       16.960                10.393                 7.421                -0.117                -0.838
====================================================================================================================================================

OVERHEAD EVALUATION
===================

Only in the Xen build case, I quickly checked with `perf stat' some
scheduling related metrics. I only did this on the biggest box, for now,
as it is there that we show the larger improvement (in case of "-j1" and
a couple of slight regressions (although, those happen in UnixBench).

We see that using only one, "flat", scheduling domain always means less
migrations, while it seems to be increasing the number of context
switches.

===============================================================================================================================================================
                        “-j1”                                  “-j24”                               “-j48”                                “-j62”
---------------------------------------------------------------------------------------------------------------------------------------------------------------
            cpu-migrations  context-switches      cpu-migrations   context-switches      cpu-migrations  context-switches      cpu-migrations  context-switches
---------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla  21,242(0.074 K/s) 46,196(0.160 K/s)   22,992(0.066 K/s)  48,684(0.140 K/s)   24,516(0.064 K/s) 63,391(0.166 K/s)   23,164(0.062 K/s) 68,239(0.182 K/s)
patched  19,522(0.077 K/s) 50,871(0.201 K/s)   20,593(0.059 K/s)  57,688(0.167 K/s)   21,137(0.056 K/s) 63,822(0.169 K/s)   20,830(0.055 K/s) 69,783(0.185 K/s)
===============================================================================================================================================================

REQUEST FOR COMMENTS
====================
Basically, the kind of feedback I'd be really glad to hear is:
 - what you guys thing of the approach,
 - whether you think, looking at this preliminary set of numbers, that
   this is something worth continuing investigating,
 - if yes, what other workloads and benchmark it would make sense to
   throw at it.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
---
commit 3240f68a08511c3db616cfc2a653e6761e23ff7f
Author: Dario Faggioli <dario.faggioli@citrix.com>
Date:   Tue Aug 18 08:41:38 2015 -0700

    xen: if on Xen, "flatten" the scheduling domain hierarchy
    
    With this patch applied, only one scheduling domain is
    created (called the 'VCPU' domain) spanning all the
    guest's vCPUs.
    
    This is because, since vCPUs are moving around on pCPUs,
    there is no point in building a full hierarchy, based
    *any* topology information, which will just never be
    accurate. Having only one "flat" domain is really the
    only thing that looks sensible.
    
    Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 8648438..34f39f1 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -55,6 +55,21 @@ static irqreturn_t xen_call_function_interrupt(int irq, void *dev_id);
 static irqreturn_t xen_call_function_single_interrupt(int irq, void *dev_id);
 static irqreturn_t xen_irq_work_interrupt(int irq, void *dev_id);
 
+const struct cpumask *xen_pcpu_sched_domain_mask(int cpu)
+{
+	return cpu_online_mask;
+}
+
+static struct sched_domain_topology_level xen_sched_domain_topology[] = {
+        { xen_pcpu_sched_domain_mask, SD_INIT_NAME(VCPU) },
+        { NULL, },
+};
+
+static void xen_set_sched_topology(void)
+{
+        set_sched_topology(xen_sched_domain_topology);
+}
+
 /*
  * Reschedule call back.
  */
@@ -335,6 +350,8 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus)
 	}
 	set_cpu_sibling_map(0);
 
+	xen_set_sched_topology();
+
 	if (xen_smp_intr_init(0))
 		BUG();
 


[-- Attachment #1.2: topology.patch --]
[-- Type: text/x-patch, Size: 1635 bytes --]

commit 3240f68a08511c3db616cfc2a653e6761e23ff7f
Author: Dario Faggioli <dario.faggioli@citrix.com>
Date:   Tue Aug 18 08:41:38 2015 -0700

    xen: if on Xen, "flatten" the scheduling domain hierarchy
    
    With this patch applied, only one scheduling domain is
    created (called the 'VCPU' domain) spanning all the
    guest's vCPUs.
    
    This is because, since vCPUs are moving around on pCPUs,
    there is no point in building a full hierarchy, based
    *any* topology information, which will just never be
    accurate. Having only one "flat" domain is really the
    only thing that looks sensible.
    
    Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 8648438..34f39f1 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -55,6 +55,21 @@ static irqreturn_t xen_call_function_interrupt(int irq, void *dev_id);
 static irqreturn_t xen_call_function_single_interrupt(int irq, void *dev_id);
 static irqreturn_t xen_irq_work_interrupt(int irq, void *dev_id);
 
+const struct cpumask *xen_pcpu_sched_domain_mask(int cpu)
+{
+	return cpu_online_mask;
+}
+
+static struct sched_domain_topology_level xen_sched_domain_topology[] = {
+        { xen_pcpu_sched_domain_mask, SD_INIT_NAME(VCPU) },
+        { NULL, },
+};
+
+static void xen_set_sched_topology(void)
+{
+        set_sched_topology(xen_sched_domain_topology);
+}
+
 /*
  * Reschedule call back.
  */
@@ -335,6 +350,8 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus)
 	}
 	set_cpu_sibling_map(0);
 
+	xen_set_sched_topology();
+
 	if (xen_smp_intr_init(0))
 		BUG();
 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread