linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
@ 2015-08-18 15:55 Dario Faggioli
  2015-08-18 16:53 ` Konrad Rzeszutek Wilk
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Dario Faggioli @ 2015-08-18 15:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Andrew Cooper, Luis R. Rodriguez, David Vrabel,
	Boris Ostrovsky, Konrad Rzeszutek Wilk, linux-kernel,
	Stefano Stabellini, George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 27809 bytes --]

Hey everyone,

So, as a followup of what we were discussing in this thread:

 [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
 http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html

I started looking in more details at scheduling domains in the Linux
kernel. Now, that thread was about CPUID and vNUMA, and their weird way
of interacting, while this thing I'm proposing here is completely
independent from them both.

In fact, no matter whether vNUMA is supported and enabled, and no matter
whether CPUID is reporting accurate, random, meaningful or completely
misleading information, I think that we should do something about how
scheduling domains are build.

Fact is, unless we use 1:1, and immutable (across all the guest
lifetime) pinning, scheduling domains should not be constructed, in
Linux, by looking at *any* topology information, because that just does
not make any sense, when vcpus move around.

Let me state this again (hoping to make myself as clear as possible): no
matter in  how much good shape we put CPUID support, no matter how
beautifully and consistently that will interact with both vNUMA,
licensing requirements and whatever else. It will be always possible for
vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
on two different NUMA nodes at time t2. Hence, the Linux scheduler
should really not skew his load balancing logic toward any of those two
situations, as neither of them could be considered correct (since
nothing is!).

For now, this only covers the PV case. HVM case shouldn't be any
different, but I haven't looked at how to make the same thing happen in
there as well.

OVERALL DESCRIPTION
===================
What this RFC patch does is, in the Xen PV case, configure scheduling
domains in such a way that there is only one of them, spanning all the
pCPUs of the guest.

Note that the patch deals directly with scheduling domains, and there is
no need to alter the masks that will then be used for building and
reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
the main difference between it and the patch proposed by Juergen here:
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html

This means that when, in future, we will fix CPUID handling and make it
comply with whatever logic or requirements we want, that won't have  any
unexpected side effects on scheduling domains.

Information about how the scheduling domains are being constructed
during boot are available in `dmesg', if the kernel is booted with the
'sched_debug' parameter. It is also possible to look
at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.

With the patch applied, only one scheduling domain is created, called
the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
tell that from the fact that every cpu* folder
in /proc/sys/kernel/sched_domain/ only have one subdirectory
('domain0'), with all the tweaks and the tunables for our scheduling
domain.

EVALUATION
==========
I've tested this with UnixBench, and by looking at Xen build time, on a
16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
now, but I plan to re-run them in DomUs soon (Juergen may be doing
something similar to this in DomU already, AFAUI).

I've run the benchmarks with and without the patch applied ('patched'
and 'vanilla', respectively, in the tables below), and with different
number of build jobs (in case of the Xen build) or of parallel copy of
the benchmarks (in the case of UnixBench).

What I get from the numbers is that the patch almost always brings
benefits, in some cases even huge ones. There are a couple of cases
where we regress, but always only slightly so, especially if comparing
that to the magnitude of some of the improvement that we get.

Bear also in mind that these results are gathered from Dom0, and without
any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
we move things in DomU and do overcommit at the Xen scheduler level, I
am expecting even better results.

RESULTS
=======
To have a quick idea of how a benchmark went, look at the '%
improvement' row of each table.

I'll put these results online, in a googledoc spreadsheet or something
like that, to make them easier to read, as soon as possible.

*** Intel(R) Xeon(R) E5620 @ 2.40GHz                                                                                                                    
*** pCPUs      16        DOM0 vCPUS  16
*** RAM        12285 MB  DOM0 Memory 9955 MB
*** NUMA nodes 2         
=======================================================================================================================================
MAKE XEN (lower == better)                                                                                                                            
=======================================================================================================================================
# of build jobs                     -j1                   -j6                   -j8                   -j16**                -j24                
vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
---------------------------------------------------------------------------------------------------------------------------------------
                              153.72     152.41      35.33      34.93       30.7      30.33      26.79      25.97      26.88      26.21
                              153.81     152.76      35.37      34.99      30.81      30.36      26.83      26.08         27      26.24
                              153.93     152.79      35.37      35.25      30.92      30.39      26.83      26.13      27.01      26.28
                              153.94     152.94      35.39      35.28      31.05      30.43       26.9      26.14      27.01      26.44
                              153.98     153.06      35.45      35.31      31.17       30.5      26.95      26.18      27.02      26.55
                              154.01     153.23       35.5      35.35       31.2      30.59      26.98       26.2      27.05      26.61
                              154.04     153.34      35.56      35.42      31.45      30.76      27.12      26.21      27.06      26.78
                              154.16      153.5      37.79      35.58      31.68      30.83      27.16      26.23      27.16      26.78
                              154.18     153.71      37.98      35.61      33.73       30.9      27.49      26.32      27.16       26.8
                              154.9      154.67      38.03      37.64      34.69      31.69      29.82      26.38       27.2      28.63
---------------------------------------------------------------------------------------------------------------------------------------
 Avg.                        154.067    153.241     36.177     35.536      31.74     30.678     27.287     26.184     27.055     26.732
---------------------------------------------------------------------------------------------------------------------------------------
 Std. Dev.                     0.325      0.631      1.215      0.771      1.352      0.410      0.914      0.116      0.095      0.704
---------------------------------------------------------------------------------------------------------------------------------------
 % improvement                            0.536                 1.772                 3.346                 4.042                 1.194
========================================================================================================================================
====================================================================================================================================================
UNIXBENCH
====================================================================================================================================================
# parallel copies                            1 parallel            6 parrallel           8 parallel            16 parallel**         24 parallel
vanilla/patched                          vanilla    patched    vanilla    pached     vanilla    patched    vanilla    patched    vanilla    patched
----------------------------------------------------------------------------------------------------------------------------------------------------
Dhrystone 2 using register variables       2302.2     2302.1    13157.8    12262.4    15691.5    15860.1    18927.7    19078.5    18654.3    18855.6
Double-Precision Whetstone                  620.2      620.2     3481.2     3566.9     4669.2     4551.5     7610.1     7614.3    11558.9    11561.3
Execl Throughput                            184.3      186.7      884.6      905.3     1168.4     1213.6     2134.6     2210.2     2250.9       2265
File Copy 1024 bufsize 2000 maxblocks       780.8      783.3     1243.7     1255.5     1250.6     1215.7     1080.9     1094.2     1069.8     1062.5
File Copy 256 bufsize 500 maxblocks         479.8      482.8      781.8      803.6      806.4        781      682.9      707.7      698.2      694.6
File Copy 4096 bufsize 8000 maxblocks      1617.6     1593.5     2739.7     2943.4     2818.3     2957.8     2389.6     2412.6     2371.6     2423.8
Pipe Throughput                             363.9      361.6     2068.6     2065.6       2622     2633.5     4053.3     4085.9     4064.7     4076.7
Pipe-based Context Switching                 70.6      207.2      369.1     1126.8      623.9     1431.3     1970.4     2082.9     1963.8       2077
Process Creation                            103.1        135        503      677.6      618.7      855.4       1138     1113.7     1195.6       1199
Shell Scripts (1 concurrent)                723.2      765.3     4406.4     4334.4     5045.4     5002.5     5861.9     5844.2     5958.8     5916.1
Shell Scripts (8 concurrent)               2243.7     2715.3     5694.7     5663.6     5694.7     5657.8     5637.1     5600.5     5582.9     5543.6
System Call Overhead                          330      330.1     1669.2     1672.4     2028.6     1996.6     2920.5     2947.1     2923.9     2952.5
System Benchmarks Index Score               496.8      567.5     1861.9       2106     2220.3     2441.3     2972.5     3007.9     3103.4     3125.3
----------------------------------------------------------------------------------------------------------------------------------------------------
% increase (of the Index Score)                       14.231                13.110                 9.954                 1.191                 0.706
====================================================================================================================================================

*** Intel(R) Xeon(R) X5650 @ 2.67GHz
*** pCPUs      24        DOM0 vCPUS  16
*** RAM        36851 MB  DOM0 Memory 9955 MB
*** NUMA nodes 2
=======================================================================================================================================
MAKE XEN (lower == better)
=======================================================================================================================================
# of build jobs                     -j1                   -j8                   -j12                   -j24**               -j32
vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
---------------------------------------------------------------------------------------------------------------------------------------
                              119.49     119.47      23.37      23.29      20.12      19.85      17.99       17.9      17.82       17.8
                              119.59     119.64      23.52      23.31      20.16      19.99      18.19      18.05      18.23      17.89
                              119.59     119.65      23.53      23.35      20.19      20.08      18.26      18.09      18.35      17.91
                              119.72     119.75      23.63      23.41       20.2      20.14      18.54       18.1       18.4      17.95
                              119.95     119.86      23.68      23.42      20.24      20.19      18.57      18.15      18.44      18.03
                              119.97      119.9      23.72      23.51      20.38      20.31      18.61      18.21      18.49      18.03
                              119.97     119.91      25.03      23.53      20.38      20.42      18.75      18.28      18.51      18.08
                              120.01     119.98      25.05      23.93      20.39      21.69      19.99      18.49      18.52       18.6
                              120.24     119.99      25.12      24.19      21.67      21.76      20.08      19.74      19.73      19.62
                              120.66     121.22      25.16      25.36      21.94      21.85      20.26       20.3      19.92      19.81
---------------------------------------------------------------------------------------------------------------------------------------
 Avg.                        119.919    119.937     24.181      23.73     20.567     20.628     18.924     18.531     18.641     18.372
---------------------------------------------------------------------------------------------------------------------------------------
 Std. Dev.                     0.351      0.481      0.789      0.642      0.663      0.802      0.851      0.811      0.658      0.741
---------------------------------------------------------------------------------------------------------------------------------------
 % improvement                           -0.015                 1.865                -0.297                 2.077                 1.443
========================================================================================================================================
====================================================================================================================================================
UNIXBENCH
====================================================================================================================================================
# parallel copies                            1 parallel            8 parrallel            12 parallel           24 parallel**         32 parallel
vanilla/patched                          vanilla     patched   vanilla     pached     vanilla    patched    vanilla    patched    vanilla    patched
----------------------------------------------------------------------------------------------------------------------------------------------------
Dhrystone 2 using register variables       2650.1     2664.6    18967.8    19060.4    27534.1    27046.8    30077.9    30110.6    30542.1    30358.7
Double-Precision Whetstone                  713.7      713.5     5463.6     5455.1     7863.9     7923.8    12725.1    12727.8    17474.3    17463.3
Execl Throughput                            280.9      283.8     1724.4     1866.5     2029.5     2367.6       2370     2521.3       2453     2506.8
File Copy 1024 bufsize 2000 maxblocks       891.1      894.2       1423     1457.7     1385.6     1482.2     1226.1     1224.2     1235.9     1265.5
File Copy 256 bufsize 500 maxblocks         546.9      555.4        949      972.1      882.8      878.6      821.9      817.7      784.7      810.8
File Copy 4096 bufsize 8000 maxblocks      1743.4     1722.8     3406.5     3438.9     3314.3     3265.9     2801.9     2788.3     2695.2     2781.5
Pipe Throughput                             426.8      423.4     3207.9       3234     4635.1     4708.9       7326     7335.3     7327.2     7319.7
Pipe-based Context Switching                110.2      223.5      680.8     1602.2      998.6     2324.6     3122.1     3252.7     3128.6     3337.2
Process Creation                            130.7      224.4     1001.3     1043.6       1209     1248.2     1337.9     1380.4     1338.6     1280.1
Shell Scripts (1 concurrent)               1140.5     1257.5     5462.8     6146.4     6435.3     7206.1     7425.2     7636.2     7566.1     7636.6
Shell Scripts (8 concurrent)                 3492     3586.7     7144.9       7307       7258     7320.2     7295.1     7296.7     7248.6     7252.2
System Call Overhead                        387.7      387.5     2398.4       2367     2793.8     2752.7     3735.7     3694.2     3752.1     3709.4
System Benchmarks Index Score               634.8      712.6     2725.8     3005.7     3232.4     3569.7     3981.3     4028.8     4085.2     4126.3
----------------------------------------------------------------------------------------------------------------------------------------------------
% increase (of the Index Score)                       12.256                10.269                10.435                 1.193                 1.006
====================================================================================================================================================

*** Intel(R) Xeon(R) X5650 @ 2.67GHz
*** pCPUs      48        DOM0 vCPUS  16
*** RAM        393138 MB DOM0 Memory 9955 MB
*** NUMA nodes 2
=======================================================================================================================================
MAKE XEN (lower == better)
=======================================================================================================================================
# of build jobs                     -j1                   -j20                   -j24                  -j48**               -j62
vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
---------------------------------------------------------------------------------------------------------------------------------------
                              267.78     233.25      36.53      35.53      35.98      34.99      33.46      32.13      33.57      32.54
                              268.42     233.92      36.82      35.56      36.12       35.2      34.24      32.24      33.64      32.56
                              268.85     234.39      36.92      35.75      36.15      35.35      34.48      32.86      33.67      32.74
                              268.98     235.11      36.96      36.01      36.25      35.46      34.73      32.89      33.97      32.83
                              269.03     236.48      37.04      36.16      36.45      35.63      34.77      32.97      34.12      33.01
                              269.54     237.05      40.33      36.59      36.57      36.15      34.97      33.09      34.18      33.52
                              269.99     238.24      40.45      36.78      36.58      36.22      34.99      33.69      34.28      33.63
                              270.11     238.48      41.13      39.98      40.22      36.24         38      33.92      34.35      33.87
                              270.96     239.07      41.66      40.81      40.59      36.35      38.99      34.19      34.49      37.24
                              271.84     240.89      42.07      41.24      40.63      40.06      39.07      36.04      34.69      37.59
---------------------------------------------------------------------------------------------------------------------------------------
 Avg.                         269.55    236.688     38.991     37.441     37.554     36.165      35.77     33.402     34.096     33.953
---------------------------------------------------------------------------------------------------------------------------------------
 Std. Dev.                     1.213      2.503      2.312      2.288      2.031      1.452      2.079      1.142      0.379      1.882
---------------------------------------------------------------------------------------------------------------------------------------
 % improvement                           12.191                 3.975                 3.699                 6.620                 0.419
========================================================================================================================================
====================================================================================================================================================
UNIXBENCH
====================================================================================================================================================
# parallel copies                            1 parallel            20 parrallel           24 parallel           48 parallel**         62 parallel
vanilla/patched                          vanilla     patched   vanilla     pached     vanilla    patched    vanilla    patched    vanilla    patched
----------------------------------------------------------------------------------------------------------------------------------------------------
Dhrystone 2 using register variables       2037.6     2037.5    39615.4    38990.5    43976.8    44660.8      51238    51117.4    51672.5    52332.5
Double-Precision Whetstone                  525.1      521.6    10389.7    10429.3    12236.5    12188.8    20897.1    20921.9    26957.5    27035.7
Execl Throughput                            112.1      113.6        799      786.5      715.1      702.3      758.2        744      756.3      765.6
File Copy 1024 bufsize 2000 maxblocks       605.5        622      671.6      630.4      624.3      605.8        599      581.2      447.4      433.7
File Copy 256 bufsize 500 maxblocks           384      382.7      447.2      429.1      464.5      404.3      416.1      428.5      313.8      305.6
File Copy 4096 bufsize 8000 maxblocks       883.7     1100.5       1326       1307     1343.2     1305.9     1260.4     1245.3     1001.4      920.1
Pipe Throughput                             283.7      282.8     5636.6     5634.2       6551       6571      10390    10437.4      10459    10498.9
Pipe-based Context Switching                 41.5      143.7      518.5     1899.1      737.5     2068.8     2877.1     3093.2     2949.3     3184.1
Process Creation                             58.5       78.4      370.7      389.4        338      355.8      380.1      375.5      383.8      369.6
Shell Scripts (1 concurrent)                443.7      475.5     1901.9       1945     1765.1     1789.6       2417     2354.4     2395.3     2362.2
Shell Scripts (8 concurrent)               1283.1     1319.1     2265.4     2209.8     2263.3       2209     2202.7     2216.1     2190.4     2206.5
System Call Overhead                        254.1      254.3      891.6      881.6      971.1      958.3     1446.8     1409.5     1461.7     1429.2
System Benchmarks Index Score               340.8      398.6     1690.6     1866.3     1770.6       1902     2303.5     2300.8     2208.3     2189.8
----------------------------------------------------------------------------------------------------------------------------------------------------
% increase (of the Index Score)                       16.960                10.393                 7.421                -0.117                -0.838
====================================================================================================================================================

OVERHEAD EVALUATION
===================

Only in the Xen build case, I quickly checked with `perf stat' some
scheduling related metrics. I only did this on the biggest box, for now,
as it is there that we show the larger improvement (in case of "-j1" and
a couple of slight regressions (although, those happen in UnixBench).

We see that using only one, "flat", scheduling domain always means less
migrations, while it seems to be increasing the number of context
switches.

===============================================================================================================================================================
                        “-j1”                                  “-j24”                               “-j48”                                “-j62”
---------------------------------------------------------------------------------------------------------------------------------------------------------------
            cpu-migrations  context-switches      cpu-migrations   context-switches      cpu-migrations  context-switches      cpu-migrations  context-switches
---------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla  21,242(0.074 K/s) 46,196(0.160 K/s)   22,992(0.066 K/s)  48,684(0.140 K/s)   24,516(0.064 K/s) 63,391(0.166 K/s)   23,164(0.062 K/s) 68,239(0.182 K/s)
patched  19,522(0.077 K/s) 50,871(0.201 K/s)   20,593(0.059 K/s)  57,688(0.167 K/s)   21,137(0.056 K/s) 63,822(0.169 K/s)   20,830(0.055 K/s) 69,783(0.185 K/s)
===============================================================================================================================================================

REQUEST FOR COMMENTS
====================
Basically, the kind of feedback I'd be really glad to hear is:
 - what you guys thing of the approach,
 - whether you think, looking at this preliminary set of numbers, that
   this is something worth continuing investigating,
 - if yes, what other workloads and benchmark it would make sense to
   throw at it.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
---
commit 3240f68a08511c3db616cfc2a653e6761e23ff7f
Author: Dario Faggioli <dario.faggioli@citrix.com>
Date:   Tue Aug 18 08:41:38 2015 -0700

    xen: if on Xen, "flatten" the scheduling domain hierarchy
    
    With this patch applied, only one scheduling domain is
    created (called the 'VCPU' domain) spanning all the
    guest's vCPUs.
    
    This is because, since vCPUs are moving around on pCPUs,
    there is no point in building a full hierarchy, based
    *any* topology information, which will just never be
    accurate. Having only one "flat" domain is really the
    only thing that looks sensible.
    
    Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 8648438..34f39f1 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -55,6 +55,21 @@ static irqreturn_t xen_call_function_interrupt(int irq, void *dev_id);
 static irqreturn_t xen_call_function_single_interrupt(int irq, void *dev_id);
 static irqreturn_t xen_irq_work_interrupt(int irq, void *dev_id);
 
+const struct cpumask *xen_pcpu_sched_domain_mask(int cpu)
+{
+	return cpu_online_mask;
+}
+
+static struct sched_domain_topology_level xen_sched_domain_topology[] = {
+        { xen_pcpu_sched_domain_mask, SD_INIT_NAME(VCPU) },
+        { NULL, },
+};
+
+static void xen_set_sched_topology(void)
+{
+        set_sched_topology(xen_sched_domain_topology);
+}
+
 /*
  * Reschedule call back.
  */
@@ -335,6 +350,8 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus)
 	}
 	set_cpu_sibling_map(0);
 
+	xen_set_sched_topology();
+
 	if (xen_smp_intr_init(0))
 		BUG();
 


[-- Attachment #1.2: topology.patch --]
[-- Type: text/x-patch, Size: 1635 bytes --]

commit 3240f68a08511c3db616cfc2a653e6761e23ff7f
Author: Dario Faggioli <dario.faggioli@citrix.com>
Date:   Tue Aug 18 08:41:38 2015 -0700

    xen: if on Xen, "flatten" the scheduling domain hierarchy
    
    With this patch applied, only one scheduling domain is
    created (called the 'VCPU' domain) spanning all the
    guest's vCPUs.
    
    This is because, since vCPUs are moving around on pCPUs,
    there is no point in building a full hierarchy, based
    *any* topology information, which will just never be
    accurate. Having only one "flat" domain is really the
    only thing that looks sensible.
    
    Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 8648438..34f39f1 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -55,6 +55,21 @@ static irqreturn_t xen_call_function_interrupt(int irq, void *dev_id);
 static irqreturn_t xen_call_function_single_interrupt(int irq, void *dev_id);
 static irqreturn_t xen_irq_work_interrupt(int irq, void *dev_id);
 
+const struct cpumask *xen_pcpu_sched_domain_mask(int cpu)
+{
+	return cpu_online_mask;
+}
+
+static struct sched_domain_topology_level xen_sched_domain_topology[] = {
+        { xen_pcpu_sched_domain_mask, SD_INIT_NAME(VCPU) },
+        { NULL, },
+};
+
+static void xen_set_sched_topology(void)
+{
+        set_sched_topology(xen_sched_domain_topology);
+}
+
 /*
  * Reschedule call back.
  */
@@ -335,6 +350,8 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus)
 	}
 	set_cpu_sibling_map(0);
 
+	xen_set_sched_topology();
+
 	if (xen_smp_intr_init(0))
 		BUG();
 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-08-18 15:55 [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy Dario Faggioli
@ 2015-08-18 16:53 ` Konrad Rzeszutek Wilk
  2015-08-20 18:16 ` Juergen Groß
  2015-08-27 10:24 ` George Dunlap
  2 siblings, 0 replies; 22+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-08-18 16:53 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel, herbert.van.den.bergh
  Cc: Juergen Gross, Andrew Cooper, Luis R. Rodriguez, David Vrabel,
	Boris Ostrovsky, linux-kernel, Stefano Stabellini, George Dunlap

On August 18, 2015 8:55:32 AM PDT, Dario Faggioli <dario.faggioli@citrix.com> wrote:
>Hey everyone,
>
>So, as a followup of what we were discussing in this thread:
>
> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>
>I started looking in more details at scheduling domains in the Linux
>kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>of interacting, while this thing I'm proposing here is completely
>independent from them both.
>
>In fact, no matter whether vNUMA is supported and enabled, and no
>matter
>whether CPUID is reporting accurate, random, meaningful or completely
>misleading information, I think that we should do something about how
>scheduling domains are build.
>
>Fact is, unless we use 1:1, and immutable (across all the guest
>lifetime) pinning, scheduling domains should not be constructed, in
>Linux, by looking at *any* topology information, because that just does
>not make any sense, when vcpus move around.
>
>Let me state this again (hoping to make myself as clear as possible):
>no
>matter in  how much good shape we put CPUID support, no matter how
>beautifully and consistently that will interact with both vNUMA,
>licensing requirements and whatever else. It will be always possible
>for
>vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>on two different NUMA nodes at time t2. Hence, the Linux scheduler
>should really not skew his load balancing logic toward any of those two
>situations, as neither of them could be considered correct (since
>nothing is!).

What about Windows guests?

>
>For now, this only covers the PV case. HVM case shouldn't be any
>different, but I haven't looked at how to make the same thing happen in
>there as well.
>
>OVERALL DESCRIPTION
>===================
>What this RFC patch does is, in the Xen PV case, configure scheduling
>domains in such a way that there is only one of them, spanning all the
>pCPUs of the guest.

Wow. That is an pretty simple patch!!

>
>Note that the patch deals directly with scheduling domains, and there
>is
>no need to alter the masks that will then be used for building and
>reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That
>is
>the main difference between it and the patch proposed by Juergen here:
>http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>
>This means that when, in future, we will fix CPUID handling and make it
>comply with whatever logic or requirements we want, that won't have 
>any
>unexpected side effects on scheduling domains.
>
>Information about how the scheduling domains are being constructed
>during boot are available in `dmesg', if the kernel is booted with the
>'sched_debug' parameter. It is also possible to look
>at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>
>With the patch applied, only one scheduling domain is created, called
>the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>tell that from the fact that every cpu* folder
>in /proc/sys/kernel/sched_domain/ only have one subdirectory
>('domain0'), with all the tweaks and the tunables for our scheduling
>domain.
>
...
>
>REQUEST FOR COMMENTS
>====================
>Basically, the kind of feedback I'd be really glad to hear is:
> - what you guys thing of the approach,
> - whether you think, looking at this preliminary set of numbers, that
>   this is something worth continuing investigating,
> - if yes, what other workloads and benchmark it would make sense to
>   throw at it.
>

The thing that I was worried about is that we would be modifying the generic code, but your changes are all in Xen code!

Woot!

In terms of workloads, I am CCing Herbert who I hope can provide advise on this.

Herbert, the full email is here: 
http://lists.xen.org/archives/html/xen-devel/2015-08/msg01691.html


>Thanks and Regards,
>Dario



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-08-18 15:55 [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy Dario Faggioli
  2015-08-18 16:53 ` Konrad Rzeszutek Wilk
@ 2015-08-20 18:16 ` Juergen Groß
  2015-08-31 16:12   ` Boris Ostrovsky
  2015-09-15 16:50   ` Dario Faggioli
  2015-08-27 10:24 ` George Dunlap
  2 siblings, 2 replies; 22+ messages in thread
From: Juergen Groß @ 2015-08-20 18:16 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Andrew Cooper, Luis R. Rodriguez, David Vrabel, Boris Ostrovsky,
	Konrad Rzeszutek Wilk, linux-kernel, Stefano Stabellini,
	George Dunlap

On 08/18/2015 05:55 PM, Dario Faggioli wrote:
> Hey everyone,
>
> So, as a followup of what we were discussing in this thread:
>
>   [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>   http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>
> I started looking in more details at scheduling domains in the Linux
> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
> of interacting, while this thing I'm proposing here is completely
> independent from them both.
>
> In fact, no matter whether vNUMA is supported and enabled, and no matter
> whether CPUID is reporting accurate, random, meaningful or completely
> misleading information, I think that we should do something about how
> scheduling domains are build.
>
> Fact is, unless we use 1:1, and immutable (across all the guest
> lifetime) pinning, scheduling domains should not be constructed, in
> Linux, by looking at *any* topology information, because that just does
> not make any sense, when vcpus move around.
>
> Let me state this again (hoping to make myself as clear as possible): no
> matter in  how much good shape we put CPUID support, no matter how
> beautifully and consistently that will interact with both vNUMA,
> licensing requirements and whatever else. It will be always possible for
> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
> on two different NUMA nodes at time t2. Hence, the Linux scheduler
> should really not skew his load balancing logic toward any of those two
> situations, as neither of them could be considered correct (since
> nothing is!).
>
> For now, this only covers the PV case. HVM case shouldn't be any
> different, but I haven't looked at how to make the same thing happen in
> there as well.
>
> OVERALL DESCRIPTION
> ===================
> What this RFC patch does is, in the Xen PV case, configure scheduling
> domains in such a way that there is only one of them, spanning all the
> pCPUs of the guest.
>
> Note that the patch deals directly with scheduling domains, and there is
> no need to alter the masks that will then be used for building and
> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
> the main difference between it and the patch proposed by Juergen here:
> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>
> This means that when, in future, we will fix CPUID handling and make it
> comply with whatever logic or requirements we want, that won't have  any
> unexpected side effects on scheduling domains.
>
> Information about how the scheduling domains are being constructed
> during boot are available in `dmesg', if the kernel is booted with the
> 'sched_debug' parameter. It is also possible to look
> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>
> With the patch applied, only one scheduling domain is created, called
> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
> tell that from the fact that every cpu* folder
> in /proc/sys/kernel/sched_domain/ only have one subdirectory
> ('domain0'), with all the tweaks and the tunables for our scheduling
> domain.
>
> EVALUATION
> ==========
> I've tested this with UnixBench, and by looking at Xen build time, on a
> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
> now, but I plan to re-run them in DomUs soon (Juergen may be doing
> something similar to this in DomU already, AFAUI).
>
> I've run the benchmarks with and without the patch applied ('patched'
> and 'vanilla', respectively, in the tables below), and with different
> number of build jobs (in case of the Xen build) or of parallel copy of
> the benchmarks (in the case of UnixBench).
>
> What I get from the numbers is that the patch almost always brings
> benefits, in some cases even huge ones. There are a couple of cases
> where we regress, but always only slightly so, especially if comparing
> that to the magnitude of some of the improvement that we get.
>
> Bear also in mind that these results are gathered from Dom0, and without
> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
> we move things in DomU and do overcommit at the Xen scheduler level, I
> am expecting even better results.
>
...
> REQUEST FOR COMMENTS
> ====================
> Basically, the kind of feedback I'd be really glad to hear is:
>   - what you guys thing of the approach,

Yesterday at the end of the developer meeting we (Andrew, Elena and
myself) discussed this topic again.

Regarding a possible future scenario with credit2 eventually supporting
gang scheduling on hyperthreads (which is desirable due to security
reasons [side channel attack] and fairness) my patch seems to be more
suited for that direction than yours. Correct me if I'm wrong, but I
think scheduling domains won't enable the guest kernel's scheduler to
migrate threads more easily between hyperthreads opposed to other vcpus,
while my approach can easily be extended to do so.

>   - whether you think, looking at this preliminary set of numbers, that
>     this is something worth continuing investigating,

I believe as both approaches lead to the same topology information used
by the scheduler (all vcpus are regarded as being equal) your numbers
should apply to my patch as well. Would you mind verifying this?

I still believe making the guest scheduler's decisions independant from
cpuid values is the way to go, as this will enable us to support more
scenarios (e.g. cpuid based licensing). For HVM guests and old PV guests
mangling the cpuid should still be done, though.

>   - if yes, what other workloads and benchmark it would make sense to
>     throw at it.

As you already mentioned an overcommitted host should be looked at as
well.


Thanks for doing the measurements,


Juergen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-08-18 15:55 [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy Dario Faggioli
  2015-08-18 16:53 ` Konrad Rzeszutek Wilk
  2015-08-20 18:16 ` Juergen Groß
@ 2015-08-27 10:24 ` George Dunlap
  2015-08-27 17:05   ` [Xen-devel] " George Dunlap
  2015-09-15 14:32   ` Dario Faggioli
  2 siblings, 2 replies; 22+ messages in thread
From: George Dunlap @ 2015-08-27 10:24 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Juergen Gross, Andrew Cooper, Luis R. Rodriguez, David Vrabel,
	Boris Ostrovsky, Konrad Rzeszutek Wilk, linux-kernel,
	Stefano Stabellini

On 08/18/2015 04:55 PM, Dario Faggioli wrote:
> Hey everyone,
> 
> So, as a followup of what we were discussing in this thread:
> 
>  [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>  http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
> 
> I started looking in more details at scheduling domains in the Linux
> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
> of interacting, while this thing I'm proposing here is completely
> independent from them both.
> 
> In fact, no matter whether vNUMA is supported and enabled, and no matter
> whether CPUID is reporting accurate, random, meaningful or completely
> misleading information, I think that we should do something about how
> scheduling domains are build.
> 
> Fact is, unless we use 1:1, and immutable (across all the guest
> lifetime) pinning, scheduling domains should not be constructed, in
> Linux, by looking at *any* topology information, because that just does
> not make any sense, when vcpus move around.
> 
> Let me state this again (hoping to make myself as clear as possible): no
> matter in  how much good shape we put CPUID support, no matter how
> beautifully and consistently that will interact with both vNUMA,
> licensing requirements and whatever else. It will be always possible for
> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
> on two different NUMA nodes at time t2. Hence, the Linux scheduler
> should really not skew his load balancing logic toward any of those two
> situations, as neither of them could be considered correct (since
> nothing is!).
> 
> For now, this only covers the PV case. HVM case shouldn't be any
> different, but I haven't looked at how to make the same thing happen in
> there as well.
> 
> OVERALL DESCRIPTION
> ===================
> What this RFC patch does is, in the Xen PV case, configure scheduling
> domains in such a way that there is only one of them, spanning all the
> pCPUs of the guest.
> 
> Note that the patch deals directly with scheduling domains, and there is
> no need to alter the masks that will then be used for building and
> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
> the main difference between it and the patch proposed by Juergen here:
> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
> 
> This means that when, in future, we will fix CPUID handling and make it
> comply with whatever logic or requirements we want, that won't have  any
> unexpected side effects on scheduling domains.
> 
> Information about how the scheduling domains are being constructed
> during boot are available in `dmesg', if the kernel is booted with the
> 'sched_debug' parameter. It is also possible to look
> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
> 
> With the patch applied, only one scheduling domain is created, called
> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
> tell that from the fact that every cpu* folder
> in /proc/sys/kernel/sched_domain/ only have one subdirectory
> ('domain0'), with all the tweaks and the tunables for our scheduling
> domain.
> 
> EVALUATION
> ==========
> I've tested this with UnixBench, and by looking at Xen build time, on a
> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
> now, but I plan to re-run them in DomUs soon (Juergen may be doing
> something similar to this in DomU already, AFAUI).
> 
> I've run the benchmarks with and without the patch applied ('patched'
> and 'vanilla', respectively, in the tables below), and with different
> number of build jobs (in case of the Xen build) or of parallel copy of
> the benchmarks (in the case of UnixBench).
> 
> What I get from the numbers is that the patch almost always brings
> benefits, in some cases even huge ones. There are a couple of cases
> where we regress, but always only slightly so, especially if comparing
> that to the magnitude of some of the improvement that we get.
> 
> Bear also in mind that these results are gathered from Dom0, and without
> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
> we move things in DomU and do overcommit at the Xen scheduler level, I
> am expecting even better results.
> 
> RESULTS
> =======
> To have a quick idea of how a benchmark went, look at the '%
> improvement' row of each table.
> 
> I'll put these results online, in a googledoc spreadsheet or something
> like that, to make them easier to read, as soon as possible.
> 
> *** Intel(R) Xeon(R) E5620 @ 2.40GHz                                                                                                                    
> *** pCPUs      16        DOM0 vCPUS  16
> *** RAM        12285 MB  DOM0 Memory 9955 MB
> *** NUMA nodes 2         
> =======================================================================================================================================
> MAKE XEN (lower == better)                                                                                                                            
> =======================================================================================================================================
> # of build jobs                     -j1                   -j6                   -j8                   -j16**                -j24                
> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
> ---------------------------------------------------------------------------------------------------------------------------------------
>                               153.72     152.41      35.33      34.93       30.7      30.33      26.79      25.97      26.88      26.21
>                               153.81     152.76      35.37      34.99      30.81      30.36      26.83      26.08         27      26.24
>                               153.93     152.79      35.37      35.25      30.92      30.39      26.83      26.13      27.01      26.28
>                               153.94     152.94      35.39      35.28      31.05      30.43       26.9      26.14      27.01      26.44
>                               153.98     153.06      35.45      35.31      31.17       30.5      26.95      26.18      27.02      26.55
>                               154.01     153.23       35.5      35.35       31.2      30.59      26.98       26.2      27.05      26.61
>                               154.04     153.34      35.56      35.42      31.45      30.76      27.12      26.21      27.06      26.78
>                               154.16      153.5      37.79      35.58      31.68      30.83      27.16      26.23      27.16      26.78
>                               154.18     153.71      37.98      35.61      33.73       30.9      27.49      26.32      27.16       26.8
>                               154.9      154.67      38.03      37.64      34.69      31.69      29.82      26.38       27.2      28.63
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Avg.                        154.067    153.241     36.177     35.536      31.74     30.678     27.287     26.184     27.055     26.732
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Std. Dev.                     0.325      0.631      1.215      0.771      1.352      0.410      0.914      0.116      0.095      0.704
> ---------------------------------------------------------------------------------------------------------------------------------------
>  % improvement                            0.536                 1.772                 3.346                 4.042                 1.194
> ========================================================================================================================================
> ====================================================================================================================================================
> UNIXBENCH
> ====================================================================================================================================================
> # parallel copies                            1 parallel            6 parrallel           8 parallel            16 parallel**         24 parallel
> vanilla/patched                          vanilla    patched    vanilla    pached     vanilla    patched    vanilla    patched    vanilla    patched
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Dhrystone 2 using register variables       2302.2     2302.1    13157.8    12262.4    15691.5    15860.1    18927.7    19078.5    18654.3    18855.6
> Double-Precision Whetstone                  620.2      620.2     3481.2     3566.9     4669.2     4551.5     7610.1     7614.3    11558.9    11561.3
> Execl Throughput                            184.3      186.7      884.6      905.3     1168.4     1213.6     2134.6     2210.2     2250.9       2265
> File Copy 1024 bufsize 2000 maxblocks       780.8      783.3     1243.7     1255.5     1250.6     1215.7     1080.9     1094.2     1069.8     1062.5
> File Copy 256 bufsize 500 maxblocks         479.8      482.8      781.8      803.6      806.4        781      682.9      707.7      698.2      694.6
> File Copy 4096 bufsize 8000 maxblocks      1617.6     1593.5     2739.7     2943.4     2818.3     2957.8     2389.6     2412.6     2371.6     2423.8
> Pipe Throughput                             363.9      361.6     2068.6     2065.6       2622     2633.5     4053.3     4085.9     4064.7     4076.7
> Pipe-based Context Switching                 70.6      207.2      369.1     1126.8      623.9     1431.3     1970.4     2082.9     1963.8       2077
> Process Creation                            103.1        135        503      677.6      618.7      855.4       1138     1113.7     1195.6       1199
> Shell Scripts (1 concurrent)                723.2      765.3     4406.4     4334.4     5045.4     5002.5     5861.9     5844.2     5958.8     5916.1
> Shell Scripts (8 concurrent)               2243.7     2715.3     5694.7     5663.6     5694.7     5657.8     5637.1     5600.5     5582.9     5543.6
> System Call Overhead                          330      330.1     1669.2     1672.4     2028.6     1996.6     2920.5     2947.1     2923.9     2952.5
> System Benchmarks Index Score               496.8      567.5     1861.9       2106     2220.3     2441.3     2972.5     3007.9     3103.4     3125.3
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> % increase (of the Index Score)                       14.231                13.110                 9.954                 1.191                 0.706
> ====================================================================================================================================================
> 
> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
> *** pCPUs      24        DOM0 vCPUS  16
> *** RAM        36851 MB  DOM0 Memory 9955 MB
> *** NUMA nodes 2
> =======================================================================================================================================
> MAKE XEN (lower == better)
> =======================================================================================================================================
> # of build jobs                     -j1                   -j8                   -j12                   -j24**               -j32
> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
> ---------------------------------------------------------------------------------------------------------------------------------------
>                               119.49     119.47      23.37      23.29      20.12      19.85      17.99       17.9      17.82       17.8
>                               119.59     119.64      23.52      23.31      20.16      19.99      18.19      18.05      18.23      17.89
>                               119.59     119.65      23.53      23.35      20.19      20.08      18.26      18.09      18.35      17.91
>                               119.72     119.75      23.63      23.41       20.2      20.14      18.54       18.1       18.4      17.95
>                               119.95     119.86      23.68      23.42      20.24      20.19      18.57      18.15      18.44      18.03
>                               119.97      119.9      23.72      23.51      20.38      20.31      18.61      18.21      18.49      18.03
>                               119.97     119.91      25.03      23.53      20.38      20.42      18.75      18.28      18.51      18.08
>                               120.01     119.98      25.05      23.93      20.39      21.69      19.99      18.49      18.52       18.6
>                               120.24     119.99      25.12      24.19      21.67      21.76      20.08      19.74      19.73      19.62
>                               120.66     121.22      25.16      25.36      21.94      21.85      20.26       20.3      19.92      19.81
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Avg.                        119.919    119.937     24.181      23.73     20.567     20.628     18.924     18.531     18.641     18.372
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Std. Dev.                     0.351      0.481      0.789      0.642      0.663      0.802      0.851      0.811      0.658      0.741
> ---------------------------------------------------------------------------------------------------------------------------------------
>  % improvement                           -0.015                 1.865                -0.297                 2.077                 1.443
> ========================================================================================================================================
> ====================================================================================================================================================
> UNIXBENCH
> ====================================================================================================================================================
> # parallel copies                            1 parallel            8 parrallel            12 parallel           24 parallel**         32 parallel
> vanilla/patched                          vanilla     patched   vanilla     pached     vanilla    patched    vanilla    patched    vanilla    patched
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Dhrystone 2 using register variables       2650.1     2664.6    18967.8    19060.4    27534.1    27046.8    30077.9    30110.6    30542.1    30358.7
> Double-Precision Whetstone                  713.7      713.5     5463.6     5455.1     7863.9     7923.8    12725.1    12727.8    17474.3    17463.3
> Execl Throughput                            280.9      283.8     1724.4     1866.5     2029.5     2367.6       2370     2521.3       2453     2506.8
> File Copy 1024 bufsize 2000 maxblocks       891.1      894.2       1423     1457.7     1385.6     1482.2     1226.1     1224.2     1235.9     1265.5
> File Copy 256 bufsize 500 maxblocks         546.9      555.4        949      972.1      882.8      878.6      821.9      817.7      784.7      810.8
> File Copy 4096 bufsize 8000 maxblocks      1743.4     1722.8     3406.5     3438.9     3314.3     3265.9     2801.9     2788.3     2695.2     2781.5
> Pipe Throughput                             426.8      423.4     3207.9       3234     4635.1     4708.9       7326     7335.3     7327.2     7319.7
> Pipe-based Context Switching                110.2      223.5      680.8     1602.2      998.6     2324.6     3122.1     3252.7     3128.6     3337.2
> Process Creation                            130.7      224.4     1001.3     1043.6       1209     1248.2     1337.9     1380.4     1338.6     1280.1
> Shell Scripts (1 concurrent)               1140.5     1257.5     5462.8     6146.4     6435.3     7206.1     7425.2     7636.2     7566.1     7636.6
> Shell Scripts (8 concurrent)                 3492     3586.7     7144.9       7307       7258     7320.2     7295.1     7296.7     7248.6     7252.2
> System Call Overhead                        387.7      387.5     2398.4       2367     2793.8     2752.7     3735.7     3694.2     3752.1     3709.4
> System Benchmarks Index Score               634.8      712.6     2725.8     3005.7     3232.4     3569.7     3981.3     4028.8     4085.2     4126.3
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> % increase (of the Index Score)                       12.256                10.269                10.435                 1.193                 1.006
> ====================================================================================================================================================
> 
> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
> *** pCPUs      48        DOM0 vCPUS  16
> *** RAM        393138 MB DOM0 Memory 9955 MB
> *** NUMA nodes 2
> =======================================================================================================================================
> MAKE XEN (lower == better)
> =======================================================================================================================================
> # of build jobs                     -j1                   -j20                   -j24                  -j48**               -j62
> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
> ---------------------------------------------------------------------------------------------------------------------------------------
>                               267.78     233.25      36.53      35.53      35.98      34.99      33.46      32.13      33.57      32.54
>                               268.42     233.92      36.82      35.56      36.12       35.2      34.24      32.24      33.64      32.56
>                               268.85     234.39      36.92      35.75      36.15      35.35      34.48      32.86      33.67      32.74
>                               268.98     235.11      36.96      36.01      36.25      35.46      34.73      32.89      33.97      32.83
>                               269.03     236.48      37.04      36.16      36.45      35.63      34.77      32.97      34.12      33.01
>                               269.54     237.05      40.33      36.59      36.57      36.15      34.97      33.09      34.18      33.52
>                               269.99     238.24      40.45      36.78      36.58      36.22      34.99      33.69      34.28      33.63
>                               270.11     238.48      41.13      39.98      40.22      36.24         38      33.92      34.35      33.87
>                               270.96     239.07      41.66      40.81      40.59      36.35      38.99      34.19      34.49      37.24
>                               271.84     240.89      42.07      41.24      40.63      40.06      39.07      36.04      34.69      37.59
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Avg.                         269.55    236.688     38.991     37.441     37.554     36.165      35.77     33.402     34.096     33.953
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Std. Dev.                     1.213      2.503      2.312      2.288      2.031      1.452      2.079      1.142      0.379      1.882
> ---------------------------------------------------------------------------------------------------------------------------------------
>  % improvement                           12.191                 3.975                 3.699                 6.620                 0.419
> ========================================================================================================================================

I'm a bit confused here as to why, if dom0 has 16 vcpus in all of your
tests, you change the -j number (apparently) based on the number of
pcpus available to Xen.  Wouldn't it make more sense to stick with
1/6/8/16/24?  That would allow us to have actually comparable numbers.

But in any case, it seems to me that the numbers do show a uniform
improvement and no regressions -- I think this approach looks really
good, particularly as it is so small and well-contained.

 -George



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-08-27 10:24 ` George Dunlap
@ 2015-08-27 17:05   ` George Dunlap
  2015-09-15 14:32   ` Dario Faggioli
  1 sibling, 0 replies; 22+ messages in thread
From: George Dunlap @ 2015-08-27 17:05 UTC (permalink / raw)
  To: George Dunlap
  Cc: Dario Faggioli, xen-devel, Juergen Gross, Andrew Cooper,
	Luis R. Rodriguez, linux-kernel, David Vrabel, Boris Ostrovsky,
	Stefano Stabellini

On Thu, Aug 27, 2015 at 11:24 AM, George Dunlap
<george.dunlap@citrix.com> wrote:
> On 08/18/2015 04:55 PM, Dario Faggioli wrote:
>> Hey everyone,
>>
>> So, as a followup of what we were discussing in this thread:
>>
>>  [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>>  http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>
>> I started looking in more details at scheduling domains in the Linux
>> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>> of interacting, while this thing I'm proposing here is completely
>> independent from them both.
>>
>> In fact, no matter whether vNUMA is supported and enabled, and no matter
>> whether CPUID is reporting accurate, random, meaningful or completely
>> misleading information, I think that we should do something about how
>> scheduling domains are build.
>>
>> Fact is, unless we use 1:1, and immutable (across all the guest
>> lifetime) pinning, scheduling domains should not be constructed, in
>> Linux, by looking at *any* topology information, because that just does
>> not make any sense, when vcpus move around.
>>
>> Let me state this again (hoping to make myself as clear as possible): no
>> matter in  how much good shape we put CPUID support, no matter how
>> beautifully and consistently that will interact with both vNUMA,
>> licensing requirements and whatever else. It will be always possible for
>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>> should really not skew his load balancing logic toward any of those two
>> situations, as neither of them could be considered correct (since
>> nothing is!).
>>
>> For now, this only covers the PV case. HVM case shouldn't be any
>> different, but I haven't looked at how to make the same thing happen in
>> there as well.
>>
>> OVERALL DESCRIPTION
>> ===================
>> What this RFC patch does is, in the Xen PV case, configure scheduling
>> domains in such a way that there is only one of them, spanning all the
>> pCPUs of the guest.
>>
>> Note that the patch deals directly with scheduling domains, and there is
>> no need to alter the masks that will then be used for building and
>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
>> the main difference between it and the patch proposed by Juergen here:
>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>
>> This means that when, in future, we will fix CPUID handling and make it
>> comply with whatever logic or requirements we want, that won't have  any
>> unexpected side effects on scheduling domains.
>>
>> Information about how the scheduling domains are being constructed
>> during boot are available in `dmesg', if the kernel is booted with the
>> 'sched_debug' parameter. It is also possible to look
>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>
>> With the patch applied, only one scheduling domain is created, called
>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>> tell that from the fact that every cpu* folder
>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>> ('domain0'), with all the tweaks and the tunables for our scheduling
>> domain.
>>
>> EVALUATION
>> ==========
>> I've tested this with UnixBench, and by looking at Xen build time, on a
>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>> something similar to this in DomU already, AFAUI).
>>
>> I've run the benchmarks with and without the patch applied ('patched'
>> and 'vanilla', respectively, in the tables below), and with different
>> number of build jobs (in case of the Xen build) or of parallel copy of
>> the benchmarks (in the case of UnixBench).
>>
>> What I get from the numbers is that the patch almost always brings
>> benefits, in some cases even huge ones. There are a couple of cases
>> where we regress, but always only slightly so, especially if comparing
>> that to the magnitude of some of the improvement that we get.
>>
>> Bear also in mind that these results are gathered from Dom0, and without
>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>> we move things in DomU and do overcommit at the Xen scheduler level, I
>> am expecting even better results.
>>
>> RESULTS
>> =======
>> To have a quick idea of how a benchmark went, look at the '%
>> improvement' row of each table.
>>
>> I'll put these results online, in a googledoc spreadsheet or something
>> like that, to make them easier to read, as soon as possible.
>>
>> *** Intel(R) Xeon(R) E5620 @ 2.40GHz
>> *** pCPUs      16        DOM0 vCPUS  16
>> *** RAM        12285 MB  DOM0 Memory 9955 MB
>> *** NUMA nodes 2
>> =======================================================================================================================================
>> MAKE XEN (lower == better)
>> =======================================================================================================================================
>> # of build jobs                     -j1                   -j6                   -j8                   -j16**                -j24
>> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>                               153.72     152.41      35.33      34.93       30.7      30.33      26.79      25.97      26.88      26.21
>>                               153.81     152.76      35.37      34.99      30.81      30.36      26.83      26.08         27      26.24
>>                               153.93     152.79      35.37      35.25      30.92      30.39      26.83      26.13      27.01      26.28
>>                               153.94     152.94      35.39      35.28      31.05      30.43       26.9      26.14      27.01      26.44
>>                               153.98     153.06      35.45      35.31      31.17       30.5      26.95      26.18      27.02      26.55
>>                               154.01     153.23       35.5      35.35       31.2      30.59      26.98       26.2      27.05      26.61
>>                               154.04     153.34      35.56      35.42      31.45      30.76      27.12      26.21      27.06      26.78
>>                               154.16      153.5      37.79      35.58      31.68      30.83      27.16      26.23      27.16      26.78
>>                               154.18     153.71      37.98      35.61      33.73       30.9      27.49      26.32      27.16       26.8
>>                               154.9      154.67      38.03      37.64      34.69      31.69      29.82      26.38       27.2      28.63
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  Avg.                        154.067    153.241     36.177     35.536      31.74     30.678     27.287     26.184     27.055     26.732
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  Std. Dev.                     0.325      0.631      1.215      0.771      1.352      0.410      0.914      0.116      0.095      0.704
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  % improvement                            0.536                 1.772                 3.346                 4.042                 1.194
>> ========================================================================================================================================
>> ====================================================================================================================================================
>> UNIXBENCH
>> ====================================================================================================================================================
>> # parallel copies                            1 parallel            6 parrallel           8 parallel            16 parallel**         24 parallel
>> vanilla/patched                          vanilla    patched    vanilla    pached     vanilla    patched    vanilla    patched    vanilla    patched
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> Dhrystone 2 using register variables       2302.2     2302.1    13157.8    12262.4    15691.5    15860.1    18927.7    19078.5    18654.3    18855.6
>> Double-Precision Whetstone                  620.2      620.2     3481.2     3566.9     4669.2     4551.5     7610.1     7614.3    11558.9    11561.3
>> Execl Throughput                            184.3      186.7      884.6      905.3     1168.4     1213.6     2134.6     2210.2     2250.9       2265
>> File Copy 1024 bufsize 2000 maxblocks       780.8      783.3     1243.7     1255.5     1250.6     1215.7     1080.9     1094.2     1069.8     1062.5
>> File Copy 256 bufsize 500 maxblocks         479.8      482.8      781.8      803.6      806.4        781      682.9      707.7      698.2      694.6
>> File Copy 4096 bufsize 8000 maxblocks      1617.6     1593.5     2739.7     2943.4     2818.3     2957.8     2389.6     2412.6     2371.6     2423.8
>> Pipe Throughput                             363.9      361.6     2068.6     2065.6       2622     2633.5     4053.3     4085.9     4064.7     4076.7
>> Pipe-based Context Switching                 70.6      207.2      369.1     1126.8      623.9     1431.3     1970.4     2082.9     1963.8       2077
>> Process Creation                            103.1        135        503      677.6      618.7      855.4       1138     1113.7     1195.6       1199
>> Shell Scripts (1 concurrent)                723.2      765.3     4406.4     4334.4     5045.4     5002.5     5861.9     5844.2     5958.8     5916.1
>> Shell Scripts (8 concurrent)               2243.7     2715.3     5694.7     5663.6     5694.7     5657.8     5637.1     5600.5     5582.9     5543.6
>> System Call Overhead                          330      330.1     1669.2     1672.4     2028.6     1996.6     2920.5     2947.1     2923.9     2952.5
>> System Benchmarks Index Score               496.8      567.5     1861.9       2106     2220.3     2441.3     2972.5     3007.9     3103.4     3125.3
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> % increase (of the Index Score)                       14.231                13.110                 9.954                 1.191                 0.706
>> ====================================================================================================================================================
>>
>> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
>> *** pCPUs      24        DOM0 vCPUS  16
>> *** RAM        36851 MB  DOM0 Memory 9955 MB
>> *** NUMA nodes 2
>> =======================================================================================================================================
>> MAKE XEN (lower == better)
>> =======================================================================================================================================
>> # of build jobs                     -j1                   -j8                   -j12                   -j24**               -j32
>> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>                               119.49     119.47      23.37      23.29      20.12      19.85      17.99       17.9      17.82       17.8
>>                               119.59     119.64      23.52      23.31      20.16      19.99      18.19      18.05      18.23      17.89
>>                               119.59     119.65      23.53      23.35      20.19      20.08      18.26      18.09      18.35      17.91
>>                               119.72     119.75      23.63      23.41       20.2      20.14      18.54       18.1       18.4      17.95
>>                               119.95     119.86      23.68      23.42      20.24      20.19      18.57      18.15      18.44      18.03
>>                               119.97      119.9      23.72      23.51      20.38      20.31      18.61      18.21      18.49      18.03
>>                               119.97     119.91      25.03      23.53      20.38      20.42      18.75      18.28      18.51      18.08
>>                               120.01     119.98      25.05      23.93      20.39      21.69      19.99      18.49      18.52       18.6
>>                               120.24     119.99      25.12      24.19      21.67      21.76      20.08      19.74      19.73      19.62
>>                               120.66     121.22      25.16      25.36      21.94      21.85      20.26       20.3      19.92      19.81
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  Avg.                        119.919    119.937     24.181      23.73     20.567     20.628     18.924     18.531     18.641     18.372
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  Std. Dev.                     0.351      0.481      0.789      0.642      0.663      0.802      0.851      0.811      0.658      0.741
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  % improvement                           -0.015                 1.865                -0.297                 2.077                 1.443
>> ========================================================================================================================================
>> ====================================================================================================================================================
>> UNIXBENCH
>> ====================================================================================================================================================
>> # parallel copies                            1 parallel            8 parrallel            12 parallel           24 parallel**         32 parallel
>> vanilla/patched                          vanilla     patched   vanilla     pached     vanilla    patched    vanilla    patched    vanilla    patched
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> Dhrystone 2 using register variables       2650.1     2664.6    18967.8    19060.4    27534.1    27046.8    30077.9    30110.6    30542.1    30358.7
>> Double-Precision Whetstone                  713.7      713.5     5463.6     5455.1     7863.9     7923.8    12725.1    12727.8    17474.3    17463.3
>> Execl Throughput                            280.9      283.8     1724.4     1866.5     2029.5     2367.6       2370     2521.3       2453     2506.8
>> File Copy 1024 bufsize 2000 maxblocks       891.1      894.2       1423     1457.7     1385.6     1482.2     1226.1     1224.2     1235.9     1265.5
>> File Copy 256 bufsize 500 maxblocks         546.9      555.4        949      972.1      882.8      878.6      821.9      817.7      784.7      810.8
>> File Copy 4096 bufsize 8000 maxblocks      1743.4     1722.8     3406.5     3438.9     3314.3     3265.9     2801.9     2788.3     2695.2     2781.5
>> Pipe Throughput                             426.8      423.4     3207.9       3234     4635.1     4708.9       7326     7335.3     7327.2     7319.7
>> Pipe-based Context Switching                110.2      223.5      680.8     1602.2      998.6     2324.6     3122.1     3252.7     3128.6     3337.2
>> Process Creation                            130.7      224.4     1001.3     1043.6       1209     1248.2     1337.9     1380.4     1338.6     1280.1
>> Shell Scripts (1 concurrent)               1140.5     1257.5     5462.8     6146.4     6435.3     7206.1     7425.2     7636.2     7566.1     7636.6
>> Shell Scripts (8 concurrent)                 3492     3586.7     7144.9       7307       7258     7320.2     7295.1     7296.7     7248.6     7252.2
>> System Call Overhead                        387.7      387.5     2398.4       2367     2793.8     2752.7     3735.7     3694.2     3752.1     3709.4
>> System Benchmarks Index Score               634.8      712.6     2725.8     3005.7     3232.4     3569.7     3981.3     4028.8     4085.2     4126.3
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> % increase (of the Index Score)                       12.256                10.269                10.435                 1.193                 1.006
>> ====================================================================================================================================================
>>
>> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
>> *** pCPUs      48        DOM0 vCPUS  16
>> *** RAM        393138 MB DOM0 Memory 9955 MB
>> *** NUMA nodes 2
>> =======================================================================================================================================
>> MAKE XEN (lower == better)
>> =======================================================================================================================================
>> # of build jobs                     -j1                   -j20                   -j24                  -j48**               -j62
>> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>                               267.78     233.25      36.53      35.53      35.98      34.99      33.46      32.13      33.57      32.54
>>                               268.42     233.92      36.82      35.56      36.12       35.2      34.24      32.24      33.64      32.56
>>                               268.85     234.39      36.92      35.75      36.15      35.35      34.48      32.86      33.67      32.74
>>                               268.98     235.11      36.96      36.01      36.25      35.46      34.73      32.89      33.97      32.83
>>                               269.03     236.48      37.04      36.16      36.45      35.63      34.77      32.97      34.12      33.01
>>                               269.54     237.05      40.33      36.59      36.57      36.15      34.97      33.09      34.18      33.52
>>                               269.99     238.24      40.45      36.78      36.58      36.22      34.99      33.69      34.28      33.63
>>                               270.11     238.48      41.13      39.98      40.22      36.24         38      33.92      34.35      33.87
>>                               270.96     239.07      41.66      40.81      40.59      36.35      38.99      34.19      34.49      37.24
>>                               271.84     240.89      42.07      41.24      40.63      40.06      39.07      36.04      34.69      37.59
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  Avg.                         269.55    236.688     38.991     37.441     37.554     36.165      35.77     33.402     34.096     33.953
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  Std. Dev.                     1.213      2.503      2.312      2.288      2.031      1.452      2.079      1.142      0.379      1.882
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  % improvement                           12.191                 3.975                 3.699                 6.620                 0.419
>> ========================================================================================================================================
>
> I'm a bit confused here as to why, if dom0 has 16 vcpus in all of your
> tests, you change the -j number (apparently) based on the number of
> pcpus available to Xen.  Wouldn't it make more sense to stick with
> 1/6/8/16/24?  That would allow us to have actually comparable numbers.
>
> But in any case, it seems to me that the numbers do show a uniform
> improvement and no regressions -- I think this approach looks really
> good, particularly as it is so small and well-contained.

That said, it's probably a good idea to make this optional somehow, so
that if people do decide to do a pinning / partitioning approach, the
guest scheduler actually can take advantage of topological
information.

 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-08-20 18:16 ` Juergen Groß
@ 2015-08-31 16:12   ` Boris Ostrovsky
  2015-09-02 11:58     ` Juergen Gross
  2015-09-15 16:50   ` Dario Faggioli
  1 sibling, 1 reply; 22+ messages in thread
From: Boris Ostrovsky @ 2015-08-31 16:12 UTC (permalink / raw)
  To: Juergen Groß, Dario Faggioli, xen-devel
  Cc: Andrew Cooper, Luis R. Rodriguez, David Vrabel,
	Konrad Rzeszutek Wilk, linux-kernel, Stefano Stabellini,
	George Dunlap



On 08/20/2015 02:16 PM, Juergen Groß wrote:
> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>> Hey everyone,
>>
>> So, as a followup of what we were discussing in this thread:
>>
>>   [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>
>> I started looking in more details at scheduling domains in the Linux
>> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>> of interacting, while this thing I'm proposing here is completely
>> independent from them both.
>>
>> In fact, no matter whether vNUMA is supported and enabled, and no matter
>> whether CPUID is reporting accurate, random, meaningful or completely
>> misleading information, I think that we should do something about how
>> scheduling domains are build.
>>
>> Fact is, unless we use 1:1, and immutable (across all the guest
>> lifetime) pinning, scheduling domains should not be constructed, in
>> Linux, by looking at *any* topology information, because that just does
>> not make any sense, when vcpus move around.
>>
>> Let me state this again (hoping to make myself as clear as possible): no
>> matter in  how much good shape we put CPUID support, no matter how
>> beautifully and consistently that will interact with both vNUMA,
>> licensing requirements and whatever else. It will be always possible for
>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>> should really not skew his load balancing logic toward any of those two
>> situations, as neither of them could be considered correct (since
>> nothing is!).
>>
>> For now, this only covers the PV case. HVM case shouldn't be any
>> different, but I haven't looked at how to make the same thing happen in
>> there as well.
>>
>> OVERALL DESCRIPTION
>> ===================
>> What this RFC patch does is, in the Xen PV case, configure scheduling
>> domains in such a way that there is only one of them, spanning all the
>> pCPUs of the guest.
>>
>> Note that the patch deals directly with scheduling domains, and there is
>> no need to alter the masks that will then be used for building and
>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
>> the main difference between it and the patch proposed by Juergen here:
>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html 
>>
>>
>> This means that when, in future, we will fix CPUID handling and make it
>> comply with whatever logic or requirements we want, that won't have  any
>> unexpected side effects on scheduling domains.
>>
>> Information about how the scheduling domains are being constructed
>> during boot are available in `dmesg', if the kernel is booted with the
>> 'sched_debug' parameter. It is also possible to look
>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>
>> With the patch applied, only one scheduling domain is created, called
>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>> tell that from the fact that every cpu* folder
>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>> ('domain0'), with all the tweaks and the tunables for our scheduling
>> domain.
>>
>> EVALUATION
>> ==========
>> I've tested this with UnixBench, and by looking at Xen build time, on a
>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>> something similar to this in DomU already, AFAUI).
>>
>> I've run the benchmarks with and without the patch applied ('patched'
>> and 'vanilla', respectively, in the tables below), and with different
>> number of build jobs (in case of the Xen build) or of parallel copy of
>> the benchmarks (in the case of UnixBench).
>>
>> What I get from the numbers is that the patch almost always brings
>> benefits, in some cases even huge ones. There are a couple of cases
>> where we regress, but always only slightly so, especially if comparing
>> that to the magnitude of some of the improvement that we get.
>>
>> Bear also in mind that these results are gathered from Dom0, and without
>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>> we move things in DomU and do overcommit at the Xen scheduler level, I
>> am expecting even better results.
>>
> ...
>> REQUEST FOR COMMENTS
>> ====================
>> Basically, the kind of feedback I'd be really glad to hear is:
>>   - what you guys thing of the approach,
>
> Yesterday at the end of the developer meeting we (Andrew, Elena and
> myself) discussed this topic again.
>
> Regarding a possible future scenario with credit2 eventually supporting
> gang scheduling on hyperthreads (which is desirable due to security
> reasons [side channel attack] and fairness) my patch seems to be more
> suited for that direction than yours. Correct me if I'm wrong, but I
> think scheduling domains won't enable the guest kernel's scheduler to
> migrate threads more easily between hyperthreads opposed to other vcpus,
> while my approach can easily be extended to do so.
>
>>   - whether you think, looking at this preliminary set of numbers, that
>>     this is something worth continuing investigating,
>
> I believe as both approaches lead to the same topology information used
> by the scheduler (all vcpus are regarded as being equal) your numbers
> should apply to my patch as well. Would you mind verifying this?

If set_cpu_sibling_map()'s has_mp is false, wouldn't we effectively have 
both of your patches?

Also, it seems to me that Xen guests would not be the only ones having 
to deal with topology inconsistencies due to migrating VCPUs. Don't KVM 
guests, for example, have the same problem? And if yes, perhaps we 
should try solving it in non-Xen-specific way (especially given that 
both of those patches look pretty simple and thus are presumably easy to 
integrate into common code).

And, as George already pointed out, this should be an optional feature 
--- if a guest spans physical nodes and VCPUs are pinned then we don't 
always want flat topology/domains.

-boris


>
> I still believe making the guest scheduler's decisions independant from
> cpuid values is the way to go, as this will enable us to support more
> scenarios (e.g. cpuid based licensing). For HVM guests and old PV guests
> mangling the cpuid should still be done, though.
>
>>   - if yes, what other workloads and benchmark it would make sense to
>>     throw at it.
>
> As you already mentioned an overcommitted host should be looked at as
> well.
>
>
> Thanks for doing the measurements,
>
>
> Juergen


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-08-31 16:12   ` Boris Ostrovsky
@ 2015-09-02 11:58     ` Juergen Gross
  2015-09-02 14:08       ` Boris Ostrovsky
  0 siblings, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2015-09-02 11:58 UTC (permalink / raw)
  To: Boris Ostrovsky, Dario Faggioli, xen-devel
  Cc: Andrew Cooper, Luis R. Rodriguez, David Vrabel,
	Konrad Rzeszutek Wilk, linux-kernel, Stefano Stabellini,
	George Dunlap

On 08/31/2015 06:12 PM, Boris Ostrovsky wrote:
>
>
> On 08/20/2015 02:16 PM, Juergen Groß wrote:
>> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>>> Hey everyone,
>>>
>>> So, as a followup of what we were discussing in this thread:
>>>
>>>   [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>>
>>>
>>> I started looking in more details at scheduling domains in the Linux
>>> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>>> of interacting, while this thing I'm proposing here is completely
>>> independent from them both.
>>>
>>> In fact, no matter whether vNUMA is supported and enabled, and no matter
>>> whether CPUID is reporting accurate, random, meaningful or completely
>>> misleading information, I think that we should do something about how
>>> scheduling domains are build.
>>>
>>> Fact is, unless we use 1:1, and immutable (across all the guest
>>> lifetime) pinning, scheduling domains should not be constructed, in
>>> Linux, by looking at *any* topology information, because that just does
>>> not make any sense, when vcpus move around.
>>>
>>> Let me state this again (hoping to make myself as clear as possible): no
>>> matter in  how much good shape we put CPUID support, no matter how
>>> beautifully and consistently that will interact with both vNUMA,
>>> licensing requirements and whatever else. It will be always possible for
>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>>> should really not skew his load balancing logic toward any of those two
>>> situations, as neither of them could be considered correct (since
>>> nothing is!).
>>>
>>> For now, this only covers the PV case. HVM case shouldn't be any
>>> different, but I haven't looked at how to make the same thing happen in
>>> there as well.
>>>
>>> OVERALL DESCRIPTION
>>> ===================
>>> What this RFC patch does is, in the Xen PV case, configure scheduling
>>> domains in such a way that there is only one of them, spanning all the
>>> pCPUs of the guest.
>>>
>>> Note that the patch deals directly with scheduling domains, and there is
>>> no need to alter the masks that will then be used for building and
>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
>>> the main difference between it and the patch proposed by Juergen here:
>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>>
>>>
>>> This means that when, in future, we will fix CPUID handling and make it
>>> comply with whatever logic or requirements we want, that won't have  any
>>> unexpected side effects on scheduling domains.
>>>
>>> Information about how the scheduling domains are being constructed
>>> during boot are available in `dmesg', if the kernel is booted with the
>>> 'sched_debug' parameter. It is also possible to look
>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>>
>>> With the patch applied, only one scheduling domain is created, called
>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>>> tell that from the fact that every cpu* folder
>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>>> ('domain0'), with all the tweaks and the tunables for our scheduling
>>> domain.
>>>
>>> EVALUATION
>>> ==========
>>> I've tested this with UnixBench, and by looking at Xen build time, on a
>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>>> something similar to this in DomU already, AFAUI).
>>>
>>> I've run the benchmarks with and without the patch applied ('patched'
>>> and 'vanilla', respectively, in the tables below), and with different
>>> number of build jobs (in case of the Xen build) or of parallel copy of
>>> the benchmarks (in the case of UnixBench).
>>>
>>> What I get from the numbers is that the patch almost always brings
>>> benefits, in some cases even huge ones. There are a couple of cases
>>> where we regress, but always only slightly so, especially if comparing
>>> that to the magnitude of some of the improvement that we get.
>>>
>>> Bear also in mind that these results are gathered from Dom0, and without
>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>>> we move things in DomU and do overcommit at the Xen scheduler level, I
>>> am expecting even better results.
>>>
>> ...
>>> REQUEST FOR COMMENTS
>>> ====================
>>> Basically, the kind of feedback I'd be really glad to hear is:
>>>   - what you guys thing of the approach,
>>
>> Yesterday at the end of the developer meeting we (Andrew, Elena and
>> myself) discussed this topic again.
>>
>> Regarding a possible future scenario with credit2 eventually supporting
>> gang scheduling on hyperthreads (which is desirable due to security
>> reasons [side channel attack] and fairness) my patch seems to be more
>> suited for that direction than yours. Correct me if I'm wrong, but I
>> think scheduling domains won't enable the guest kernel's scheduler to
>> migrate threads more easily between hyperthreads opposed to other vcpus,
>> while my approach can easily be extended to do so.
>>
>>>   - whether you think, looking at this preliminary set of numbers, that
>>>     this is something worth continuing investigating,
>>
>> I believe as both approaches lead to the same topology information used
>> by the scheduler (all vcpus are regarded as being equal) your numbers
>> should apply to my patch as well. Would you mind verifying this?
>
> If set_cpu_sibling_map()'s has_mp is false, wouldn't we effectively have
> both of your patches?

Hmm, sort of.

OTOH this would it make hard to make use of some of the topology
information in case of e.g. pinned vcpus (as George pointed out).

> Also, it seems to me that Xen guests would not be the only ones having
> to deal with topology inconsistencies due to migrating VCPUs. Don't KVM
> guests, for example, have the same problem? And if yes, perhaps we
> should try solving it in non-Xen-specific way (especially given that
> both of those patches look pretty simple and thus are presumably easy to
> integrate into common code).

Indeed. I'll have a try.

> And, as George already pointed out, this should be an optional feature
> --- if a guest spans physical nodes and VCPUs are pinned then we don't
> always want flat topology/domains.

Yes, it might be a good idea to be able to keep some of the topology
levels. I'll modify my patch to make this command line selectable.


Juergen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-02 11:58     ` Juergen Gross
@ 2015-09-02 14:08       ` Boris Ostrovsky
  2015-09-02 14:30         ` Juergen Gross
  0 siblings, 1 reply; 22+ messages in thread
From: Boris Ostrovsky @ 2015-09-02 14:08 UTC (permalink / raw)
  To: Juergen Gross, Dario Faggioli, xen-devel
  Cc: Andrew Cooper, Luis R. Rodriguez, David Vrabel,
	Konrad Rzeszutek Wilk, linux-kernel, Stefano Stabellini,
	George Dunlap

On 09/02/2015 07:58 AM, Juergen Gross wrote:
> On 08/31/2015 06:12 PM, Boris Ostrovsky wrote:
>>
>>
>> On 08/20/2015 02:16 PM, Juergen Groß wrote:
>>> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>>>> Hey everyone,
>>>>
>>>> So, as a followup of what we were discussing in this thread:
>>>>
>>>>   [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html 
>>>>
>>>>
>>>>
>>>> I started looking in more details at scheduling domains in the Linux
>>>> kernel. Now, that thread was about CPUID and vNUMA, and their weird 
>>>> way
>>>> of interacting, while this thing I'm proposing here is completely
>>>> independent from them both.
>>>>
>>>> In fact, no matter whether vNUMA is supported and enabled, and no 
>>>> matter
>>>> whether CPUID is reporting accurate, random, meaningful or completely
>>>> misleading information, I think that we should do something about how
>>>> scheduling domains are build.
>>>>
>>>> Fact is, unless we use 1:1, and immutable (across all the guest
>>>> lifetime) pinning, scheduling domains should not be constructed, in
>>>> Linux, by looking at *any* topology information, because that just 
>>>> does
>>>> not make any sense, when vcpus move around.
>>>>
>>>> Let me state this again (hoping to make myself as clear as 
>>>> possible): no
>>>> matter in  how much good shape we put CPUID support, no matter how
>>>> beautifully and consistently that will interact with both vNUMA,
>>>> licensing requirements and whatever else. It will be always 
>>>> possible for
>>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>>>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>>>> should really not skew his load balancing logic toward any of those 
>>>> two
>>>> situations, as neither of them could be considered correct (since
>>>> nothing is!).
>>>>
>>>> For now, this only covers the PV case. HVM case shouldn't be any
>>>> different, but I haven't looked at how to make the same thing 
>>>> happen in
>>>> there as well.
>>>>
>>>> OVERALL DESCRIPTION
>>>> ===================
>>>> What this RFC patch does is, in the Xen PV case, configure scheduling
>>>> domains in such a way that there is only one of them, spanning all the
>>>> pCPUs of the guest.
>>>>
>>>> Note that the patch deals directly with scheduling domains, and 
>>>> there is
>>>> no need to alter the masks that will then be used for building and
>>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). 
>>>> That is
>>>> the main difference between it and the patch proposed by Juergen here:
>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html 
>>>>
>>>>
>>>>
>>>> This means that when, in future, we will fix CPUID handling and 
>>>> make it
>>>> comply with whatever logic or requirements we want, that won't 
>>>> have  any
>>>> unexpected side effects on scheduling domains.
>>>>
>>>> Information about how the scheduling domains are being constructed
>>>> during boot are available in `dmesg', if the kernel is booted with the
>>>> 'sched_debug' parameter. It is also possible to look
>>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>>>
>>>> With the patch applied, only one scheduling domain is created, called
>>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>>>> tell that from the fact that every cpu* folder
>>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>>>> ('domain0'), with all the tweaks and the tunables for our scheduling
>>>> domain.
>>>>
>>>> EVALUATION
>>>> ==========
>>>> I've tested this with UnixBench, and by looking at Xen build time, 
>>>> on a
>>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>>>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>>>> something similar to this in DomU already, AFAUI).
>>>>
>>>> I've run the benchmarks with and without the patch applied ('patched'
>>>> and 'vanilla', respectively, in the tables below), and with different
>>>> number of build jobs (in case of the Xen build) or of parallel copy of
>>>> the benchmarks (in the case of UnixBench).
>>>>
>>>> What I get from the numbers is that the patch almost always brings
>>>> benefits, in some cases even huge ones. There are a couple of cases
>>>> where we regress, but always only slightly so, especially if comparing
>>>> that to the magnitude of some of the improvement that we get.
>>>>
>>>> Bear also in mind that these results are gathered from Dom0, and 
>>>> without
>>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>>>> we move things in DomU and do overcommit at the Xen scheduler level, I
>>>> am expecting even better results.
>>>>
>>> ...
>>>> REQUEST FOR COMMENTS
>>>> ====================
>>>> Basically, the kind of feedback I'd be really glad to hear is:
>>>>   - what you guys thing of the approach,
>>>
>>> Yesterday at the end of the developer meeting we (Andrew, Elena and
>>> myself) discussed this topic again.
>>>
>>> Regarding a possible future scenario with credit2 eventually supporting
>>> gang scheduling on hyperthreads (which is desirable due to security
>>> reasons [side channel attack] and fairness) my patch seems to be more
>>> suited for that direction than yours. Correct me if I'm wrong, but I
>>> think scheduling domains won't enable the guest kernel's scheduler to
>>> migrate threads more easily between hyperthreads opposed to other 
>>> vcpus,
>>> while my approach can easily be extended to do so.
>>>
>>>>   - whether you think, looking at this preliminary set of numbers, 
>>>> that
>>>>     this is something worth continuing investigating,
>>>
>>> I believe as both approaches lead to the same topology information used
>>> by the scheduler (all vcpus are regarded as being equal) your numbers
>>> should apply to my patch as well. Would you mind verifying this?
>>
>> If set_cpu_sibling_map()'s has_mp is false, wouldn't we effectively have
>> both of your patches?
>
> Hmm, sort of.
>
> OTOH this would it make hard to make use of some of the topology
> information in case of e.g. pinned vcpus (as George pointed out).


I didn't mean to just set has_mp to zero unconditionally (for Xen, or 
any other, guest). We'd need to have some logic as to when to set it to 
false.

-boris


>
>> Also, it seems to me that Xen guests would not be the only ones having
>> to deal with topology inconsistencies due to migrating VCPUs. Don't KVM
>> guests, for example, have the same problem? And if yes, perhaps we
>> should try solving it in non-Xen-specific way (especially given that
>> both of those patches look pretty simple and thus are presumably easy to
>> integrate into common code).
>
> Indeed. I'll have a try.
>
>> And, as George already pointed out, this should be an optional feature
>> --- if a guest spans physical nodes and VCPUs are pinned then we don't
>> always want flat topology/domains.
>
> Yes, it might be a good idea to be able to keep some of the topology
> levels. I'll modify my patch to make this command line selectable.
>
>
> Juergen


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-02 14:08       ` Boris Ostrovsky
@ 2015-09-02 14:30         ` Juergen Gross
  2015-09-15 17:16           ` [Xen-devel] " Dario Faggioli
  0 siblings, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2015-09-02 14:30 UTC (permalink / raw)
  To: Boris Ostrovsky, Dario Faggioli, xen-devel
  Cc: Andrew Cooper, Luis R. Rodriguez, David Vrabel,
	Konrad Rzeszutek Wilk, linux-kernel, Stefano Stabellini,
	George Dunlap

On 09/02/2015 04:08 PM, Boris Ostrovsky wrote:
> On 09/02/2015 07:58 AM, Juergen Gross wrote:
>> On 08/31/2015 06:12 PM, Boris Ostrovsky wrote:
>>>
>>>
>>> On 08/20/2015 02:16 PM, Juergen Groß wrote:
>>>> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>>>>> Hey everyone,
>>>>>
>>>>> So, as a followup of what we were discussing in this thread:
>>>>>
>>>>>   [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>>>>
>>>>>
>>>>>
>>>>> I started looking in more details at scheduling domains in the Linux
>>>>> kernel. Now, that thread was about CPUID and vNUMA, and their weird
>>>>> way
>>>>> of interacting, while this thing I'm proposing here is completely
>>>>> independent from them both.
>>>>>
>>>>> In fact, no matter whether vNUMA is supported and enabled, and no
>>>>> matter
>>>>> whether CPUID is reporting accurate, random, meaningful or completely
>>>>> misleading information, I think that we should do something about how
>>>>> scheduling domains are build.
>>>>>
>>>>> Fact is, unless we use 1:1, and immutable (across all the guest
>>>>> lifetime) pinning, scheduling domains should not be constructed, in
>>>>> Linux, by looking at *any* topology information, because that just
>>>>> does
>>>>> not make any sense, when vcpus move around.
>>>>>
>>>>> Let me state this again (hoping to make myself as clear as
>>>>> possible): no
>>>>> matter in  how much good shape we put CPUID support, no matter how
>>>>> beautifully and consistently that will interact with both vNUMA,
>>>>> licensing requirements and whatever else. It will be always
>>>>> possible for
>>>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>>>>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>>>>> should really not skew his load balancing logic toward any of those
>>>>> two
>>>>> situations, as neither of them could be considered correct (since
>>>>> nothing is!).
>>>>>
>>>>> For now, this only covers the PV case. HVM case shouldn't be any
>>>>> different, but I haven't looked at how to make the same thing
>>>>> happen in
>>>>> there as well.
>>>>>
>>>>> OVERALL DESCRIPTION
>>>>> ===================
>>>>> What this RFC patch does is, in the Xen PV case, configure scheduling
>>>>> domains in such a way that there is only one of them, spanning all the
>>>>> pCPUs of the guest.
>>>>>
>>>>> Note that the patch deals directly with scheduling domains, and
>>>>> there is
>>>>> no need to alter the masks that will then be used for building and
>>>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.).
>>>>> That is
>>>>> the main difference between it and the patch proposed by Juergen here:
>>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>>>>
>>>>>
>>>>>
>>>>> This means that when, in future, we will fix CPUID handling and
>>>>> make it
>>>>> comply with whatever logic or requirements we want, that won't
>>>>> have  any
>>>>> unexpected side effects on scheduling domains.
>>>>>
>>>>> Information about how the scheduling domains are being constructed
>>>>> during boot are available in `dmesg', if the kernel is booted with the
>>>>> 'sched_debug' parameter. It is also possible to look
>>>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>>>>
>>>>> With the patch applied, only one scheduling domain is created, called
>>>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>>>>> tell that from the fact that every cpu* folder
>>>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>>>>> ('domain0'), with all the tweaks and the tunables for our scheduling
>>>>> domain.
>>>>>
>>>>> EVALUATION
>>>>> ==========
>>>>> I've tested this with UnixBench, and by looking at Xen build time,
>>>>> on a
>>>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>>>>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>>>>> something similar to this in DomU already, AFAUI).
>>>>>
>>>>> I've run the benchmarks with and without the patch applied ('patched'
>>>>> and 'vanilla', respectively, in the tables below), and with different
>>>>> number of build jobs (in case of the Xen build) or of parallel copy of
>>>>> the benchmarks (in the case of UnixBench).
>>>>>
>>>>> What I get from the numbers is that the patch almost always brings
>>>>> benefits, in some cases even huge ones. There are a couple of cases
>>>>> where we regress, but always only slightly so, especially if comparing
>>>>> that to the magnitude of some of the improvement that we get.
>>>>>
>>>>> Bear also in mind that these results are gathered from Dom0, and
>>>>> without
>>>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>>>>> we move things in DomU and do overcommit at the Xen scheduler level, I
>>>>> am expecting even better results.
>>>>>
>>>> ...
>>>>> REQUEST FOR COMMENTS
>>>>> ====================
>>>>> Basically, the kind of feedback I'd be really glad to hear is:
>>>>>   - what you guys thing of the approach,
>>>>
>>>> Yesterday at the end of the developer meeting we (Andrew, Elena and
>>>> myself) discussed this topic again.
>>>>
>>>> Regarding a possible future scenario with credit2 eventually supporting
>>>> gang scheduling on hyperthreads (which is desirable due to security
>>>> reasons [side channel attack] and fairness) my patch seems to be more
>>>> suited for that direction than yours. Correct me if I'm wrong, but I
>>>> think scheduling domains won't enable the guest kernel's scheduler to
>>>> migrate threads more easily between hyperthreads opposed to other
>>>> vcpus,
>>>> while my approach can easily be extended to do so.
>>>>
>>>>>   - whether you think, looking at this preliminary set of numbers,
>>>>> that
>>>>>     this is something worth continuing investigating,
>>>>
>>>> I believe as both approaches lead to the same topology information used
>>>> by the scheduler (all vcpus are regarded as being equal) your numbers
>>>> should apply to my patch as well. Would you mind verifying this?
>>>
>>> If set_cpu_sibling_map()'s has_mp is false, wouldn't we effectively have
>>> both of your patches?
>>
>> Hmm, sort of.
>>
>> OTOH this would it make hard to make use of some of the topology
>> information in case of e.g. pinned vcpus (as George pointed out).
>
>
> I didn't mean to just set has_mp to zero unconditionally (for Xen, or
> any other, guest). We'd need to have some logic as to when to set it to
> false.

In case we want to be able to use some of the topology information this
would mean we'd have two different mechanisms to either disable all
topology usage or only parts of it. I'd rather have a way to specify
which levels of the topology information (numa nodes, cache siblings,
core siblings) are to be used. Using none is just one possibility with
all levels disabled.


Juergen

>>
>>> Also, it seems to me that Xen guests would not be the only ones having
>>> to deal with topology inconsistencies due to migrating VCPUs. Don't KVM
>>> guests, for example, have the same problem? And if yes, perhaps we
>>> should try solving it in non-Xen-specific way (especially given that
>>> both of those patches look pretty simple and thus are presumably easy to
>>> integrate into common code).
>>
>> Indeed. I'll have a try.
>>
>>> And, as George already pointed out, this should be an optional feature
>>> --- if a guest spans physical nodes and VCPUs are pinned then we don't
>>> always want flat topology/domains.
>>
>> Yes, it might be a good idea to be able to keep some of the topology
>> levels. I'll modify my patch to make this command line selectable.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-08-27 10:24 ` George Dunlap
  2015-08-27 17:05   ` [Xen-devel] " George Dunlap
@ 2015-09-15 14:32   ` Dario Faggioli
  1 sibling, 0 replies; 22+ messages in thread
From: Dario Faggioli @ 2015-09-15 14:32 UTC (permalink / raw)
  To: George Dunlap
  Cc: xen-devel, Juergen Gross, Andrew Cooper, Luis R. Rodriguez,
	linux-kernel, David Vrabel, Boris Ostrovsky, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 4546 bytes --]

On Thu, 2015-08-27 at 11:24 +0100, George Dunlap wrote:
> On 08/18/2015 04:55 PM, Dario Faggioli wrote:

> > *** Intel(R) Xeon(R) X5650 @ 2.67GHz
> > *** pCPUs      48        DOM0 vCPUS  16
> > *** RAM        393138 MB DOM0 Memory 9955 MB
> > *** NUMA nodes 2
> > =======================================================================================================================================
> > MAKE XEN (lower == better)
> > =======================================================================================================================================
> > # of build jobs                     -j1                   -j20                   -j24                  -j48**               -j62
> > vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
> > ---------------------------------------------------------------------------------------------------------------------------------------
> >                               267.78     233.25      36.53      35.53      35.98      34.99      33.46      32.13      33.57      32.54
> >                               268.42     233.92      36.82      35.56      36.12       35.2      34.24      32.24      33.64      32.56
> >                               268.85     234.39      36.92      35.75      36.15      35.35      34.48      32.86      33.67      32.74
> >                               268.98     235.11      36.96      36.01      36.25      35.46      34.73      32.89      33.97      32.83
> >                               269.03     236.48      37.04      36.16      36.45      35.63      34.77      32.97      34.12      33.01
> >                               269.54     237.05      40.33      36.59      36.57      36.15      34.97      33.09      34.18      33.52
> >                               269.99     238.24      40.45      36.78      36.58      36.22      34.99      33.69      34.28      33.63
> >                               270.11     238.48      41.13      39.98      40.22      36.24         38      33.92      34.35      33.87
> >                               270.96     239.07      41.66      40.81      40.59      36.35      38.99      34.19      34.49      37.24
> >                               271.84     240.89      42.07      41.24      40.63      40.06      39.07      36.04      34.69      37.59
> > ---------------------------------------------------------------------------------------------------------------------------------------
> >  Avg.                         269.55    236.688     38.991     37.441     37.554     36.165      35.77     33.402     34.096     33.953
> > ---------------------------------------------------------------------------------------------------------------------------------------
> >  Std. Dev.                     1.213      2.503      2.312      2.288      2.031      1.452      2.079      1.142      0.379      1.882
> > ---------------------------------------------------------------------------------------------------------------------------------------
> >  % improvement                           12.191                 3.975                 3.699                 6.620                 0.419
> > ========================================================================================================================================
> 
> I'm a bit confused here as to why, if dom0 has 16 vcpus in all of your
> tests, you change the -j number (apparently) based on the number of
> pcpus available to Xen.  Wouldn't it make more sense to stick with
> 1/6/8/16/24?  That would allow us to have actually comparable numbers.
> 
Bah, no, sorry, that was a mistake I made when I cut-&-past'ed the
tables in the email... Dom0 always have as much vCPUs as the host has
pCPUs. I know this is a rather critical piece of information, so sorry
for messing it up! :-/

> But in any case, it seems to me that the numbers do show a uniform
> improvement and no regressions -- I think this approach looks really
> good, particularly as it is so small and well-contained.
> 
Yeah, that seems the case... But I really would like to try more
configurations and more workloads. I'll do that ASAP.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-08-20 18:16 ` Juergen Groß
  2015-08-31 16:12   ` Boris Ostrovsky
@ 2015-09-15 16:50   ` Dario Faggioli
  2015-09-21  5:49     ` Juergen Gross
  1 sibling, 1 reply; 22+ messages in thread
From: Dario Faggioli @ 2015-09-15 16:50 UTC (permalink / raw)
  To: Juergen Groß
  Cc: xen-devel, Andrew Cooper, Luis R. Rodriguez, linux-kernel,
	George Dunlap, David Vrabel, Boris Ostrovsky, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 9839 bytes --]

On Thu, 2015-08-20 at 20:16 +0200, Juergen Groß wrote:
> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
> > Hey everyone,
> >
> > So, as a followup of what we were discussing in this thread:
> >
> >   [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
> >   http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
> >
> > I started looking in more details at scheduling domains in the Linux
> > kernel. Now, that thread was about CPUID and vNUMA, and their weird way
> > of interacting, while this thing I'm proposing here is completely
> > independent from them both.
> >
> > In fact, no matter whether vNUMA is supported and enabled, and no matter
> > whether CPUID is reporting accurate, random, meaningful or completely
> > misleading information, I think that we should do something about how
> > scheduling domains are build.
> >
> > Fact is, unless we use 1:1, and immutable (across all the guest
> > lifetime) pinning, scheduling domains should not be constructed, in
> > Linux, by looking at *any* topology information, because that just does
> > not make any sense, when vcpus move around.
> >
> > Let me state this again (hoping to make myself as clear as possible): no
> > matter in  how much good shape we put CPUID support, no matter how
> > beautifully and consistently that will interact with both vNUMA,
> > licensing requirements and whatever else. It will be always possible for
> > vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
> > on two different NUMA nodes at time t2. Hence, the Linux scheduler
> > should really not skew his load balancing logic toward any of those two
> > situations, as neither of them could be considered correct (since
> > nothing is!).
> >
> > For now, this only covers the PV case. HVM case shouldn't be any
> > different, but I haven't looked at how to make the same thing happen in
> > there as well.
> >
> > OVERALL DESCRIPTION
> > ===================
> > What this RFC patch does is, in the Xen PV case, configure scheduling
> > domains in such a way that there is only one of them, spanning all the
> > pCPUs of the guest.
> >
> > Note that the patch deals directly with scheduling domains, and there is
> > no need to alter the masks that will then be used for building and
> > reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
> > the main difference between it and the patch proposed by Juergen here:
> > http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
> >
> > This means that when, in future, we will fix CPUID handling and make it
> > comply with whatever logic or requirements we want, that won't have  any
> > unexpected side effects on scheduling domains.
> >
> > Information about how the scheduling domains are being constructed
> > during boot are available in `dmesg', if the kernel is booted with the
> > 'sched_debug' parameter. It is also possible to look
> > at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
> >
> > With the patch applied, only one scheduling domain is created, called
> > the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
> > tell that from the fact that every cpu* folder
> > in /proc/sys/kernel/sched_domain/ only have one subdirectory
> > ('domain0'), with all the tweaks and the tunables for our scheduling
> > domain.
> >
> > EVALUATION
> > ==========
> > I've tested this with UnixBench, and by looking at Xen build time, on a
> > 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
> > now, but I plan to re-run them in DomUs soon (Juergen may be doing
> > something similar to this in DomU already, AFAUI).
> >
> > I've run the benchmarks with and without the patch applied ('patched'
> > and 'vanilla', respectively, in the tables below), and with different
> > number of build jobs (in case of the Xen build) or of parallel copy of
> > the benchmarks (in the case of UnixBench).
> >
> > What I get from the numbers is that the patch almost always brings
> > benefits, in some cases even huge ones. There are a couple of cases
> > where we regress, but always only slightly so, especially if comparing
> > that to the magnitude of some of the improvement that we get.
> >
> > Bear also in mind that these results are gathered from Dom0, and without
> > any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
> > we move things in DomU and do overcommit at the Xen scheduler level, I
> > am expecting even better results.
> >
> ...
> > REQUEST FOR COMMENTS
> > ====================
> > Basically, the kind of feedback I'd be really glad to hear is:
> >   - what you guys thing of the approach,
> 
> Yesterday at the end of the developer meeting we (Andrew, Elena and
> myself) discussed this topic again.
> 
Hey,

Sorry for replying so late, I've been on vacation from right after
XenSummit up until yesterday. :-)

> Regarding a possible future scenario with credit2 eventually supporting
> gang scheduling on hyperthreads (which is desirable due to security
> reasons [side channel attack] and fairness) my patch seems to be more
> suited for that direction than yours. 
>
Ok. Just let me mention that 'Credit2 + gang scheduling' might not be
exactly around the corner (although, we can prioritize working on it if
we want).

In principle, I think it's a really nice idea. I still don't have clear
in mind how we would handle a couple of situations, but let's leave this
aside for now, and stay on-topic.

> Correct me if I'm wrong, but I
> think scheduling domains won't enable the guest kernel's scheduler to
> migrate threads more easily between hyperthreads opposed to other vcpus,
> while my approach can easily be extended to do so.
> 
I'm not sure I understand what you mean here. As far as the (Linux)
scheduler is concerned, your patch and mine do the exact same thing:
they arrange for the scheduling domains, when they're built, during
boot, not to consider hyperthreads or multi-cores.

Mine does it by removing the SMT (and the MC) level from the data
structure in the scheduler that is used as a base for configuring the
scheduling domains. Yours does it by making the topology bitmaps that
are used at each one of those level all look the same. In fact, with
your patch applied, I get the exact same situation as with mine, as far
as scheduling domains are concerned: there is only one scheduling
domain, with a different scheduling group for each vCPU inside it.

In my case, that one scheduling domain is the special one that I define
in xen_sched_domain_topology (in arch/x86/xen/smp.c), in my patch (it's
called PCPU). In your case, it's the DIE scheduling domain, i.e., the
one coming from the last level defined in default_topology (in
kernel/sched/core.c). I'd have to recheck, but ISTR that, since you're
setting all the bitmaps for all the levels to the same value, previous
levels are created, recognised to be all equal, and merged/discarded.

IOW, mine is using a scheduler provided interface explicitly, via
set_sched_topology(), i.e., the way an architecture (and in this case
the architecture would be 'xen') let the scheduler know about its
topology quirks:
 http://lxr.free-electrons.com/ident?i=set_sched_topology
Basically, I'm telling the scheduler <<Hey, you're on Xen, don't bother
looking for hyperthreads, as they don't make any sense!>>.

Yours is changing the topology bitmaps directly. Basically, you're
telling nothing to the scheduler, which then goes down looking for SMTs
and MCs, but finds none.

All this being said, the effect is the same, and the reason why the
scheduling inside the guest changes --between mainline and both mine or
your patch-- is because of scheduling domains, or so it is how I
understood it.

Therefore, I don't really understand why you're saying one approach is
more easily extensible toward anything... What am I missing?

> >   - whether you think, looking at this preliminary set of numbers, that
> >     this is something worth continuing investigating,
> 
> I believe as both approaches lead to the same topology information used
> by the scheduler (all vcpus are regarded as being equal) your numbers
> should apply to my patch as well. Would you mind verifying this?
> 
I'll run some tests, but yes, I 100% expect the numbers to look the
same. Actually, I did a very quick check, for a few cases, already, and
that is indeed the case, but I'll report back when I'll have the full
data set.

> >   - if yes, what other workloads and benchmark it would make sense to
> >     throw at it.
> 
> As you already mentioned an overcommitted host should be looked at as
> well.
> 
Sure.

> Thanks for doing the measurements,
> 
And more of them will be coming. ISTR you telling me in Seattle that you
(or some teammates of yours) were running some benches too... Any output
from that yet? :-)

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-02 14:30         ` Juergen Gross
@ 2015-09-15 17:16           ` Dario Faggioli
  0 siblings, 0 replies; 22+ messages in thread
From: Dario Faggioli @ 2015-09-15 17:16 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Boris Ostrovsky, xen-devel, Andrew Cooper, Stefano Stabellini,
	linux-kernel, George Dunlap, David Vrabel, Luis R. Rodriguez

[-- Attachment #1: Type: text/plain, Size: 2117 bytes --]

On Wed, 2015-09-02 at 16:30 +0200, Juergen Gross wrote:
> On 09/02/2015 04:08 PM, Boris Ostrovsky wrote:
> > On 09/02/2015 07:58 AM, Juergen Gross wrote:
> >> On 08/31/2015 06:12 PM, Boris Ostrovsky wrote:

> >>> If set_cpu_sibling_map()'s has_mp is false, wouldn't we effectively have
> >>> both of your patches?
> >>
> >> Hmm, sort of.
> >>
> >> OTOH this would it make hard to make use of some of the topology
> >> information in case of e.g. pinned vcpus (as George pointed out).
> >
> >
> > I didn't mean to just set has_mp to zero unconditionally (for Xen, or
> > any other, guest). We'd need to have some logic as to when to set it to
> > false.
> 
> In case we want to be able to use some of the topology information this
> would mean we'd have two different mechanisms to either disable all
> topology usage or only parts of it. I'd rather have a way to specify
> which levels of the topology information (numa nodes, cache siblings,
> core siblings) are to be used. Using none is just one possibility with
> all levels disabled.
> 
I agree, indeed, acting on has_mp seems overkill/not ideal to me too
(I'm not even sure I fully understand how it's used in
set_cpu_sibling_map()... I'll dig more).

However...

> >>
> >>> Also, it seems to me that Xen guests would not be the only ones having
> >>> to deal with topology inconsistencies due to migrating VCPUs. Don't KVM
> >>> guests, for example, have the same problem? And if yes, perhaps we
> >>> should try solving it in non-Xen-specific way (especially given that
> >>> both of those patches look pretty simple and thus are presumably easy to
> >>> integrate into common code).
> >>
> >> Indeed. I'll have a try.
> >>
...yes, this is an interesting point, and it's worth try looking at how
to implement things that way.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-15 16:50   ` Dario Faggioli
@ 2015-09-21  5:49     ` Juergen Gross
  2015-09-22  4:42       ` Juergen Gross
  2015-09-23  7:24       ` Dario Faggioli
  0 siblings, 2 replies; 22+ messages in thread
From: Juergen Gross @ 2015-09-21  5:49 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Andrew Cooper, Luis R. Rodriguez, linux-kernel,
	George Dunlap, David Vrabel, Boris Ostrovsky, Stefano Stabellini

On 09/15/2015 06:50 PM, Dario Faggioli wrote:
> On Thu, 2015-08-20 at 20:16 +0200, Juergen Groß wrote:
>> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>>> Hey everyone,
>>>
>>> So, as a followup of what we were discussing in this thread:
>>>
>>>    [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>>>    http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>>
>>> I started looking in more details at scheduling domains in the Linux
>>> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>>> of interacting, while this thing I'm proposing here is completely
>>> independent from them both.
>>>
>>> In fact, no matter whether vNUMA is supported and enabled, and no matter
>>> whether CPUID is reporting accurate, random, meaningful or completely
>>> misleading information, I think that we should do something about how
>>> scheduling domains are build.
>>>
>>> Fact is, unless we use 1:1, and immutable (across all the guest
>>> lifetime) pinning, scheduling domains should not be constructed, in
>>> Linux, by looking at *any* topology information, because that just does
>>> not make any sense, when vcpus move around.
>>>
>>> Let me state this again (hoping to make myself as clear as possible): no
>>> matter in  how much good shape we put CPUID support, no matter how
>>> beautifully and consistently that will interact with both vNUMA,
>>> licensing requirements and whatever else. It will be always possible for
>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>>> should really not skew his load balancing logic toward any of those two
>>> situations, as neither of them could be considered correct (since
>>> nothing is!).
>>>
>>> For now, this only covers the PV case. HVM case shouldn't be any
>>> different, but I haven't looked at how to make the same thing happen in
>>> there as well.
>>>
>>> OVERALL DESCRIPTION
>>> ===================
>>> What this RFC patch does is, in the Xen PV case, configure scheduling
>>> domains in such a way that there is only one of them, spanning all the
>>> pCPUs of the guest.
>>>
>>> Note that the patch deals directly with scheduling domains, and there is
>>> no need to alter the masks that will then be used for building and
>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
>>> the main difference between it and the patch proposed by Juergen here:
>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>>
>>> This means that when, in future, we will fix CPUID handling and make it
>>> comply with whatever logic or requirements we want, that won't have  any
>>> unexpected side effects on scheduling domains.
>>>
>>> Information about how the scheduling domains are being constructed
>>> during boot are available in `dmesg', if the kernel is booted with the
>>> 'sched_debug' parameter. It is also possible to look
>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>>
>>> With the patch applied, only one scheduling domain is created, called
>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>>> tell that from the fact that every cpu* folder
>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>>> ('domain0'), with all the tweaks and the tunables for our scheduling
>>> domain.
>>>
>>> EVALUATION
>>> ==========
>>> I've tested this with UnixBench, and by looking at Xen build time, on a
>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>>> something similar to this in DomU already, AFAUI).
>>>
>>> I've run the benchmarks with and without the patch applied ('patched'
>>> and 'vanilla', respectively, in the tables below), and with different
>>> number of build jobs (in case of the Xen build) or of parallel copy of
>>> the benchmarks (in the case of UnixBench).
>>>
>>> What I get from the numbers is that the patch almost always brings
>>> benefits, in some cases even huge ones. There are a couple of cases
>>> where we regress, but always only slightly so, especially if comparing
>>> that to the magnitude of some of the improvement that we get.
>>>
>>> Bear also in mind that these results are gathered from Dom0, and without
>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>>> we move things in DomU and do overcommit at the Xen scheduler level, I
>>> am expecting even better results.
>>>
>> ...
>>> REQUEST FOR COMMENTS
>>> ====================
>>> Basically, the kind of feedback I'd be really glad to hear is:
>>>    - what you guys thing of the approach,
>>
>> Yesterday at the end of the developer meeting we (Andrew, Elena and
>> myself) discussed this topic again.
>>
> Hey,
>
> Sorry for replying so late, I've been on vacation from right after
> XenSummit up until yesterday. :-)
>
>> Regarding a possible future scenario with credit2 eventually supporting
>> gang scheduling on hyperthreads (which is desirable due to security
>> reasons [side channel attack] and fairness) my patch seems to be more
>> suited for that direction than yours.
>>
> Ok. Just let me mention that 'Credit2 + gang scheduling' might not be
> exactly around the corner (although, we can prioritize working on it if
> we want).
>
> In principle, I think it's a really nice idea. I still don't have clear
> in mind how we would handle a couple of situations, but let's leave this
> aside for now, and stay on-topic.
>
>> Correct me if I'm wrong, but I
>> think scheduling domains won't enable the guest kernel's scheduler to
>> migrate threads more easily between hyperthreads opposed to other vcpus,
>> while my approach can easily be extended to do so.
>>
> I'm not sure I understand what you mean here. As far as the (Linux)
> scheduler is concerned, your patch and mine do the exact same thing:
> they arrange for the scheduling domains, when they're built, during
> boot, not to consider hyperthreads or multi-cores.
>
> Mine does it by removing the SMT (and the MC) level from the data
> structure in the scheduler that is used as a base for configuring the
> scheduling domains. Yours does it by making the topology bitmaps that
> are used at each one of those level all look the same. In fact, with
> your patch applied, I get the exact same situation as with mine, as far
> as scheduling domains are concerned: there is only one scheduling
> domain, with a different scheduling group for each vCPU inside it.

Uuh, nearly.

Your case won't deal correctly with NUMA, as the generic NUMA code is
using set_sched_topology() as well. One of NUMA and Xen will win and
overwrite the other's settings.

To do things correctly you will have to handle NUMA as well.


Juergen


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-21  5:49     ` Juergen Gross
@ 2015-09-22  4:42       ` Juergen Gross
  2015-09-22 16:22         ` George Dunlap
  2015-09-23  7:24       ` Dario Faggioli
  1 sibling, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2015-09-22  4:42 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Andrew Cooper, Luis R. Rodriguez, linux-kernel,
	George Dunlap, David Vrabel, Boris Ostrovsky, Stefano Stabellini

On 09/21/2015 07:49 AM, Juergen Gross wrote:
> On 09/15/2015 06:50 PM, Dario Faggioli wrote:
>> On Thu, 2015-08-20 at 20:16 +0200, Juergen Groß wrote:
>>> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>>>> Hey everyone,
>>>>
>>>> So, as a followup of what we were discussing in this thread:
>>>>
>>>>    [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>>>>
>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>>>
>>>>
>>>> I started looking in more details at scheduling domains in the Linux
>>>> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>>>> of interacting, while this thing I'm proposing here is completely
>>>> independent from them both.
>>>>
>>>> In fact, no matter whether vNUMA is supported and enabled, and no
>>>> matter
>>>> whether CPUID is reporting accurate, random, meaningful or completely
>>>> misleading information, I think that we should do something about how
>>>> scheduling domains are build.
>>>>
>>>> Fact is, unless we use 1:1, and immutable (across all the guest
>>>> lifetime) pinning, scheduling domains should not be constructed, in
>>>> Linux, by looking at *any* topology information, because that just does
>>>> not make any sense, when vcpus move around.
>>>>
>>>> Let me state this again (hoping to make myself as clear as
>>>> possible): no
>>>> matter in  how much good shape we put CPUID support, no matter how
>>>> beautifully and consistently that will interact with both vNUMA,
>>>> licensing requirements and whatever else. It will be always possible
>>>> for
>>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>>>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>>>> should really not skew his load balancing logic toward any of those two
>>>> situations, as neither of them could be considered correct (since
>>>> nothing is!).
>>>>
>>>> For now, this only covers the PV case. HVM case shouldn't be any
>>>> different, but I haven't looked at how to make the same thing happen in
>>>> there as well.
>>>>
>>>> OVERALL DESCRIPTION
>>>> ===================
>>>> What this RFC patch does is, in the Xen PV case, configure scheduling
>>>> domains in such a way that there is only one of them, spanning all the
>>>> pCPUs of the guest.
>>>>
>>>> Note that the patch deals directly with scheduling domains, and
>>>> there is
>>>> no need to alter the masks that will then be used for building and
>>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.).
>>>> That is
>>>> the main difference between it and the patch proposed by Juergen here:
>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>>>
>>>>
>>>> This means that when, in future, we will fix CPUID handling and make it
>>>> comply with whatever logic or requirements we want, that won't have
>>>> any
>>>> unexpected side effects on scheduling domains.
>>>>
>>>> Information about how the scheduling domains are being constructed
>>>> during boot are available in `dmesg', if the kernel is booted with the
>>>> 'sched_debug' parameter. It is also possible to look
>>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>>>
>>>> With the patch applied, only one scheduling domain is created, called
>>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>>>> tell that from the fact that every cpu* folder
>>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>>>> ('domain0'), with all the tweaks and the tunables for our scheduling
>>>> domain.
>>>>
>>>> EVALUATION
>>>> ==========
>>>> I've tested this with UnixBench, and by looking at Xen build time, on a
>>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>>>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>>>> something similar to this in DomU already, AFAUI).
>>>>
>>>> I've run the benchmarks with and without the patch applied ('patched'
>>>> and 'vanilla', respectively, in the tables below), and with different
>>>> number of build jobs (in case of the Xen build) or of parallel copy of
>>>> the benchmarks (in the case of UnixBench).
>>>>
>>>> What I get from the numbers is that the patch almost always brings
>>>> benefits, in some cases even huge ones. There are a couple of cases
>>>> where we regress, but always only slightly so, especially if comparing
>>>> that to the magnitude of some of the improvement that we get.
>>>>
>>>> Bear also in mind that these results are gathered from Dom0, and
>>>> without
>>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>>>> we move things in DomU and do overcommit at the Xen scheduler level, I
>>>> am expecting even better results.
>>>>
>>> ...
>>>> REQUEST FOR COMMENTS
>>>> ====================
>>>> Basically, the kind of feedback I'd be really glad to hear is:
>>>>    - what you guys thing of the approach,
>>>
>>> Yesterday at the end of the developer meeting we (Andrew, Elena and
>>> myself) discussed this topic again.
>>>
>> Hey,
>>
>> Sorry for replying so late, I've been on vacation from right after
>> XenSummit up until yesterday. :-)
>>
>>> Regarding a possible future scenario with credit2 eventually supporting
>>> gang scheduling on hyperthreads (which is desirable due to security
>>> reasons [side channel attack] and fairness) my patch seems to be more
>>> suited for that direction than yours.
>>>
>> Ok. Just let me mention that 'Credit2 + gang scheduling' might not be
>> exactly around the corner (although, we can prioritize working on it if
>> we want).
>>
>> In principle, I think it's a really nice idea. I still don't have clear
>> in mind how we would handle a couple of situations, but let's leave this
>> aside for now, and stay on-topic.
>>
>>> Correct me if I'm wrong, but I
>>> think scheduling domains won't enable the guest kernel's scheduler to
>>> migrate threads more easily between hyperthreads opposed to other vcpus,
>>> while my approach can easily be extended to do so.
>>>
>> I'm not sure I understand what you mean here. As far as the (Linux)
>> scheduler is concerned, your patch and mine do the exact same thing:
>> they arrange for the scheduling domains, when they're built, during
>> boot, not to consider hyperthreads or multi-cores.
>>
>> Mine does it by removing the SMT (and the MC) level from the data
>> structure in the scheduler that is used as a base for configuring the
>> scheduling domains. Yours does it by making the topology bitmaps that
>> are used at each one of those level all look the same. In fact, with
>> your patch applied, I get the exact same situation as with mine, as far
>> as scheduling domains are concerned: there is only one scheduling
>> domain, with a different scheduling group for each vCPU inside it.
>
> Uuh, nearly.
>
> Your case won't deal correctly with NUMA, as the generic NUMA code is
> using set_sched_topology() as well. One of NUMA and Xen will win and
> overwrite the other's settings.
>
> To do things correctly you will have to handle NUMA as well.

One other thing I just discovered: there are other consumers of the
topology sibling masks (e.g. topology_sibling_cpumask()) as well.

I think we would want to avoid any optimizations based on those in
drivers as well, not only in the scheduler.


Juergen


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-22  4:42       ` Juergen Gross
@ 2015-09-22 16:22         ` George Dunlap
  2015-09-23  4:36           ` Juergen Gross
  0 siblings, 1 reply; 22+ messages in thread
From: George Dunlap @ 2015-09-22 16:22 UTC (permalink / raw)
  To: Juergen Gross, Dario Faggioli
  Cc: xen-devel, Andrew Cooper, Luis R. Rodriguez, linux-kernel,
	David Vrabel, Boris Ostrovsky, Stefano Stabellini

On 09/22/2015 05:42 AM, Juergen Gross wrote:
> One other thing I just discovered: there are other consumers of the
> topology sibling masks (e.g. topology_sibling_cpumask()) as well.
> 
> I think we would want to avoid any optimizations based on those in
> drivers as well, not only in the scheduler.

I'm beginning to lose the thread of the discussion here a bit.

Juergen / Dario, could one of you summarize your two approaches, and the
(alleged) advantages and disadvantages of each one?

Thanks,
 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-22 16:22         ` George Dunlap
@ 2015-09-23  4:36           ` Juergen Gross
  2015-09-23  8:30             ` Dario Faggioli
  2015-09-23 10:23             ` George Dunlap
  0 siblings, 2 replies; 22+ messages in thread
From: Juergen Gross @ 2015-09-23  4:36 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli
  Cc: xen-devel, Andrew Cooper, Luis R. Rodriguez, linux-kernel,
	David Vrabel, Boris Ostrovsky, Stefano Stabellini

On 09/22/2015 06:22 PM, George Dunlap wrote:
> On 09/22/2015 05:42 AM, Juergen Gross wrote:
>> One other thing I just discovered: there are other consumers of the
>> topology sibling masks (e.g. topology_sibling_cpumask()) as well.
>>
>> I think we would want to avoid any optimizations based on those in
>> drivers as well, not only in the scheduler.
>
> I'm beginning to lose the thread of the discussion here a bit.
>
> Juergen / Dario, could one of you summarize your two approaches, and the
> (alleged) advantages and disadvantages of each one?

Okay, I'll have a try:

The problem we want to solve:
-----------------------------

The Linux kernel is gathering cpu topology data during boot via the
CPUID instruction on each processor coming online. This data is
primarily used in the scheduler to decide to which cpu a thread should
be migrated when this seems to be necessary. There are other users of
the topology information in the kernel (e.g. some drivers try to do
optimizations like core-specific queues/lists).

When started in a virtualized environment the obtained data is next to
useless or even wrong, as it is reflecting only the status of the time
of booting the system. Scheduling of the (v)cpus done by the hypervisor
is changing the topology beneath the feet of the Linux kernel without
reflecting this in the gathered topology information. So any decisions
taken based on that data will be clueless and possibly just wrong.

The minimal solution is to change the topology data in the kernel in a
way that all cpus are regarded as equal regarding their relation to each
other (e.g. when migrating a thread to another cpu no cpu is preferred
as a target).

The topology information of the CPUID instruction is, however, even
accessible form user mode and might be used for licensing purposes of
any user program (e.g. by limiting the software to run on a specific
number of cores or sockets). So just mangling the data returned by
CPUID in the hypervisor seems not to be a general solution, while we
might want to do it at least optionally in the future.

In the future we might want to support either dynamic topology updates
or be able to tell the kernel to use some of the topology data, e.g.
when pinning vcpus.


Solution 1 (Dario):
-------------------

Don't use the CPUID derived topology information in the Linux scheduler,
but let it use a simple "flat" topology by setting own scheduler domain
data under Xen.

Advantages:
+ very clean solution regarding the scheduler interface
+ scheduler decisions are based on a minimal data set
+ small patch

Disadvantages:
- covers the scheduler only, drivers still use the "wrong" data
- a little bit hacky regarding some NUMA architectures (needs either a
   hook in the code dealing with that architecture or multiple scheduler
   domain data overwrites)
- future enhancements will make the solution less clean (either need
   duplicating scheduler domain data or some new hooks in scheduler
   domain interface)


Solution 2 (Juergen):
---------------------

When booted as a Xen guest modify the topology data built during boot
resulting in the same simple "flat" topology as in Dario's solution.

Advantages:
+ the simple topology is seen by all consumers of topology data as the
   data itself is modified accordingly
+ small patch
+ future enhancements rather easy by selecting which data to modify

Disadvantages:
- interface to scheduler not as clean as in Dario's approach
- scheduler decisions are based on multiple layers of topology data
   where one layer would be enough to describe the topology


Dario, are you okay with this summary?

Juergen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-21  5:49     ` Juergen Gross
  2015-09-22  4:42       ` Juergen Gross
@ 2015-09-23  7:24       ` Dario Faggioli
  2015-09-23  7:35         ` Juergen Gross
  1 sibling, 1 reply; 22+ messages in thread
From: Dario Faggioli @ 2015-09-23  7:24 UTC (permalink / raw)
  To: Juergen Gross
  Cc: xen-devel, Andrew Cooper, Luis R. Rodriguez, linux-kernel,
	George Dunlap, David Vrabel, Boris Ostrovsky, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 8912 bytes --]

On Mon, 2015-09-21 at 07:49 +0200, Juergen Gross wrote:
> On 09/15/2015 06:50 PM, Dario Faggioli wrote:
> > On Thu, 2015-08-20 at 20:16 +0200, Juergen Groß wrote:
> > > On 08/18/2015 05:55 PM, Dario Faggioli wrote:
> > > > Hey everyone,
> > > > 
> > > > So, as a followup of what we were discussing in this thread:
> > > > 
> > > >    [Xen-devel] PV-vNUMA issue: topology is misinterpreted by
> > > > the guest
> > > >    http://lists.xenproject.org/archives/html/xen-devel/2015-07/
> > > > msg03241.html
> > > > 
> > > > I started looking in more details at scheduling domains in the
> > > > Linux
> > > > kernel. Now, that thread was about CPUID and vNUMA, and their
> > > > weird way
> > > > of interacting, while this thing I'm proposing here is
> > > > completely
> > > > independent from them both.
> > > > 
> > > > In fact, no matter whether vNUMA is supported and enabled, and
> > > > no matter
> > > > whether CPUID is reporting accurate, random, meaningful or
> > > > completely
> > > > misleading information, I think that we should do something
> > > > about how
> > > > scheduling domains are build.
> > > > 
> > > > Fact is, unless we use 1:1, and immutable (across all the guest
> > > > lifetime) pinning, scheduling domains should not be
> > > > constructed, in
> > > > Linux, by looking at *any* topology information, because that
> > > > just does
> > > > not make any sense, when vcpus move around.
> > > > 
> > > > Let me state this again (hoping to make myself as clear as
> > > > possible): no
> > > > matter in  how much good shape we put CPUID support, no matter
> > > > how
> > > > beautifully and consistently that will interact with both
> > > > vNUMA,
> > > > licensing requirements and whatever else. It will be always
> > > > possible for
> > > > vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time
> > > > t1, and
> > > > on two different NUMA nodes at time t2. Hence, the Linux
> > > > scheduler
> > > > should really not skew his load balancing logic toward any of
> > > > those two
> > > > situations, as neither of them could be considered correct
> > > > (since
> > > > nothing is!).
> > > > 
> > > > For now, this only covers the PV case. HVM case shouldn't be
> > > > any
> > > > different, but I haven't looked at how to make the same thing
> > > > happen in
> > > > there as well.
> > > > 
> > > > OVERALL DESCRIPTION
> > > > ===================
> > > > What this RFC patch does is, in the Xen PV case, configure
> > > > scheduling
> > > > domains in such a way that there is only one of them, spanning
> > > > all the
> > > > pCPUs of the guest.
> > > > 
> > > > Note that the patch deals directly with scheduling domains, and
> > > > there is
> > > > no need to alter the masks that will then be used for building
> > > > and
> > > > reporting the topology (via CPUID, /proc/cpuinfo, /sysfs,
> > > > etc.). That is
> > > > the main difference between it and the patch proposed by
> > > > Juergen here:
> > > > http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg
> > > > 05088.html
> > > > 
> > > > This means that when, in future, we will fix CPUID handling and
> > > > make it
> > > > comply with whatever logic or requirements we want, that won't
> > > > have  any
> > > > unexpected side effects on scheduling domains.
> > > > 
> > > > Information about how the scheduling domains are being
> > > > constructed
> > > > during boot are available in `dmesg', if the kernel is booted
> > > > with the
> > > > 'sched_debug' parameter. It is also possible to look
> > > > at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
> > > > 
> > > > With the patch applied, only one scheduling domain is created,
> > > > called
> > > > the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs.
> > > > You can
> > > > tell that from the fact that every cpu* folder
> > > > in /proc/sys/kernel/sched_domain/ only have one subdirectory
> > > > ('domain0'), with all the tweaks and the tunables for our
> > > > scheduling
> > > > domain.
> > > > 
> > > > EVALUATION
> > > > ==========
> > > > I've tested this with UnixBench, and by looking at Xen build
> > > > time, on a
> > > > 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0
> > > > only, for
> > > > now, but I plan to re-run them in DomUs soon (Juergen may be
> > > > doing
> > > > something similar to this in DomU already, AFAUI).
> > > > 
> > > > I've run the benchmarks with and without the patch applied
> > > > ('patched'
> > > > and 'vanilla', respectively, in the tables below), and with
> > > > different
> > > > number of build jobs (in case of the Xen build) or of parallel
> > > > copy of
> > > > the benchmarks (in the case of UnixBench).
> > > > 
> > > > What I get from the numbers is that the patch almost always
> > > > brings
> > > > benefits, in some cases even huge ones. There are a couple of
> > > > cases
> > > > where we regress, but always only slightly so, especially if
> > > > comparing
> > > > that to the magnitude of some of the improvement that we get.
> > > > 
> > > > Bear also in mind that these results are gathered from Dom0,
> > > > and without
> > > > any overcommitment at the vCPU level (i.e., nr. vCPUs == nr
> > > > pCPUs). If
> > > > we move things in DomU and do overcommit at the Xen scheduler
> > > > level, I
> > > > am expecting even better results.
> > > > 
> > > ...
> > > > REQUEST FOR COMMENTS
> > > > ====================
> > > > Basically, the kind of feedback I'd be really glad to hear is:
> > > >    - what you guys thing of the approach,
> > > 
> > > Yesterday at the end of the developer meeting we (Andrew, Elena
> > > and
> > > myself) discussed this topic again.
> > > 
> > Hey,
> > 
> > Sorry for replying so late, I've been on vacation from right after
> > XenSummit up until yesterday. :-)
> > 
> > > Regarding a possible future scenario with credit2 eventually
> > > supporting
> > > gang scheduling on hyperthreads (which is desirable due to
> > > security
> > > reasons [side channel attack] and fairness) my patch seems to be
> > > more
> > > suited for that direction than yours.
> > > 
> > Ok. Just let me mention that 'Credit2 + gang scheduling' might not
> > be
> > exactly around the corner (although, we can prioritize working on
> > it if
> > we want).
> > 
> > In principle, I think it's a really nice idea. I still don't have
> > clear
> > in mind how we would handle a couple of situations, but let's leave
> > this
> > aside for now, and stay on-topic.
> > 
> > > Correct me if I'm wrong, but I
> > > think scheduling domains won't enable the guest kernel's
> > > scheduler to
> > > migrate threads more easily between hyperthreads opposed to other
> > > vcpus,
> > > while my approach can easily be extended to do so.
> > > 
> > I'm not sure I understand what you mean here. As far as the (Linux)
> > scheduler is concerned, your patch and mine do the exact same
> > thing:
> > they arrange for the scheduling domains, when they're built, during
> > boot, not to consider hyperthreads or multi-cores.
> > 
> > Mine does it by removing the SMT (and the MC) level from the data
> > structure in the scheduler that is used as a base for configuring
> > the
> > scheduling domains. Yours does it by making the topology bitmaps
> > that
> > are used at each one of those level all look the same. In fact,
> > with
> > your patch applied, I get the exact same situation as with mine, as
> > far
> > as scheduling domains are concerned: there is only one scheduling
> > domain, with a different scheduling group for each vCPU inside it.
> 
> Uuh, nearly.
> 
> Your case won't deal correctly with NUMA, as the generic NUMA code is
> using set_sched_topology() as well. 
>
Mmm... have you tried and seen something like this? AFAICT, the NUMA
related setup steps of scheduling domains happens after the basic (as
in "without taking NUMAness into account") topology has been set
already, and builds on top of it.

It uses set_sched_topology() only in a special case which, I'm not sure
we'd be hitting.

I'm asking because trying this out, right now, is not straightforward,
as PV vNUMA, even with Wei's Linux patches and with either yours or
mine one, still incurs in the CPUID issue... I'll try that ASAP, but
there are a couple of things I've got to finish for the next few days.

> One of NUMA and Xen will win and
> overwrite the other's settings.
> 
Not sure what this means, but as I said, I'll try.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-23  7:24       ` Dario Faggioli
@ 2015-09-23  7:35         ` Juergen Gross
  2015-09-23 12:25           ` Boris Ostrovsky
  0 siblings, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2015-09-23  7:35 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Andrew Cooper, Luis R. Rodriguez, linux-kernel,
	George Dunlap, David Vrabel, Boris Ostrovsky, Stefano Stabellini

On 09/23/2015 09:24 AM, Dario Faggioli wrote:
> On Mon, 2015-09-21 at 07:49 +0200, Juergen Gross wrote:
>> On 09/15/2015 06:50 PM, Dario Faggioli wrote:
>>> On Thu, 2015-08-20 at 20:16 +0200, Juergen Groß wrote:
>>>> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>>>>> Hey everyone,
>>>>>
>>>>> So, as a followup of what we were discussing in this thread:
>>>>>
>>>>>     [Xen-devel] PV-vNUMA issue: topology is misinterpreted by
>>>>> the guest
>>>>>     http://lists.xenproject.org/archives/html/xen-devel/2015-07/
>>>>> msg03241.html
>>>>>
>>>>> I started looking in more details at scheduling domains in the
>>>>> Linux
>>>>> kernel. Now, that thread was about CPUID and vNUMA, and their
>>>>> weird way
>>>>> of interacting, while this thing I'm proposing here is
>>>>> completely
>>>>> independent from them both.
>>>>>
>>>>> In fact, no matter whether vNUMA is supported and enabled, and
>>>>> no matter
>>>>> whether CPUID is reporting accurate, random, meaningful or
>>>>> completely
>>>>> misleading information, I think that we should do something
>>>>> about how
>>>>> scheduling domains are build.
>>>>>
>>>>> Fact is, unless we use 1:1, and immutable (across all the guest
>>>>> lifetime) pinning, scheduling domains should not be
>>>>> constructed, in
>>>>> Linux, by looking at *any* topology information, because that
>>>>> just does
>>>>> not make any sense, when vcpus move around.
>>>>>
>>>>> Let me state this again (hoping to make myself as clear as
>>>>> possible): no
>>>>> matter in  how much good shape we put CPUID support, no matter
>>>>> how
>>>>> beautifully and consistently that will interact with both
>>>>> vNUMA,
>>>>> licensing requirements and whatever else. It will be always
>>>>> possible for
>>>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time
>>>>> t1, and
>>>>> on two different NUMA nodes at time t2. Hence, the Linux
>>>>> scheduler
>>>>> should really not skew his load balancing logic toward any of
>>>>> those two
>>>>> situations, as neither of them could be considered correct
>>>>> (since
>>>>> nothing is!).
>>>>>
>>>>> For now, this only covers the PV case. HVM case shouldn't be
>>>>> any
>>>>> different, but I haven't looked at how to make the same thing
>>>>> happen in
>>>>> there as well.
>>>>>
>>>>> OVERALL DESCRIPTION
>>>>> ===================
>>>>> What this RFC patch does is, in the Xen PV case, configure
>>>>> scheduling
>>>>> domains in such a way that there is only one of them, spanning
>>>>> all the
>>>>> pCPUs of the guest.
>>>>>
>>>>> Note that the patch deals directly with scheduling domains, and
>>>>> there is
>>>>> no need to alter the masks that will then be used for building
>>>>> and
>>>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs,
>>>>> etc.). That is
>>>>> the main difference between it and the patch proposed by
>>>>> Juergen here:
>>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg
>>>>> 05088.html
>>>>>
>>>>> This means that when, in future, we will fix CPUID handling and
>>>>> make it
>>>>> comply with whatever logic or requirements we want, that won't
>>>>> have  any
>>>>> unexpected side effects on scheduling domains.
>>>>>
>>>>> Information about how the scheduling domains are being
>>>>> constructed
>>>>> during boot are available in `dmesg', if the kernel is booted
>>>>> with the
>>>>> 'sched_debug' parameter. It is also possible to look
>>>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>>>>
>>>>> With the patch applied, only one scheduling domain is created,
>>>>> called
>>>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs.
>>>>> You can
>>>>> tell that from the fact that every cpu* folder
>>>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>>>>> ('domain0'), with all the tweaks and the tunables for our
>>>>> scheduling
>>>>> domain.
>>>>>
>>>>> EVALUATION
>>>>> ==========
>>>>> I've tested this with UnixBench, and by looking at Xen build
>>>>> time, on a
>>>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0
>>>>> only, for
>>>>> now, but I plan to re-run them in DomUs soon (Juergen may be
>>>>> doing
>>>>> something similar to this in DomU already, AFAUI).
>>>>>
>>>>> I've run the benchmarks with and without the patch applied
>>>>> ('patched'
>>>>> and 'vanilla', respectively, in the tables below), and with
>>>>> different
>>>>> number of build jobs (in case of the Xen build) or of parallel
>>>>> copy of
>>>>> the benchmarks (in the case of UnixBench).
>>>>>
>>>>> What I get from the numbers is that the patch almost always
>>>>> brings
>>>>> benefits, in some cases even huge ones. There are a couple of
>>>>> cases
>>>>> where we regress, but always only slightly so, especially if
>>>>> comparing
>>>>> that to the magnitude of some of the improvement that we get.
>>>>>
>>>>> Bear also in mind that these results are gathered from Dom0,
>>>>> and without
>>>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr
>>>>> pCPUs). If
>>>>> we move things in DomU and do overcommit at the Xen scheduler
>>>>> level, I
>>>>> am expecting even better results.
>>>>>
>>>> ...
>>>>> REQUEST FOR COMMENTS
>>>>> ====================
>>>>> Basically, the kind of feedback I'd be really glad to hear is:
>>>>>     - what you guys thing of the approach,
>>>>
>>>> Yesterday at the end of the developer meeting we (Andrew, Elena
>>>> and
>>>> myself) discussed this topic again.
>>>>
>>> Hey,
>>>
>>> Sorry for replying so late, I've been on vacation from right after
>>> XenSummit up until yesterday. :-)
>>>
>>>> Regarding a possible future scenario with credit2 eventually
>>>> supporting
>>>> gang scheduling on hyperthreads (which is desirable due to
>>>> security
>>>> reasons [side channel attack] and fairness) my patch seems to be
>>>> more
>>>> suited for that direction than yours.
>>>>
>>> Ok. Just let me mention that 'Credit2 + gang scheduling' might not
>>> be
>>> exactly around the corner (although, we can prioritize working on
>>> it if
>>> we want).
>>>
>>> In principle, I think it's a really nice idea. I still don't have
>>> clear
>>> in mind how we would handle a couple of situations, but let's leave
>>> this
>>> aside for now, and stay on-topic.
>>>
>>>> Correct me if I'm wrong, but I
>>>> think scheduling domains won't enable the guest kernel's
>>>> scheduler to
>>>> migrate threads more easily between hyperthreads opposed to other
>>>> vcpus,
>>>> while my approach can easily be extended to do so.
>>>>
>>> I'm not sure I understand what you mean here. As far as the (Linux)
>>> scheduler is concerned, your patch and mine do the exact same
>>> thing:
>>> they arrange for the scheduling domains, when they're built, during
>>> boot, not to consider hyperthreads or multi-cores.
>>>
>>> Mine does it by removing the SMT (and the MC) level from the data
>>> structure in the scheduler that is used as a base for configuring
>>> the
>>> scheduling domains. Yours does it by making the topology bitmaps
>>> that
>>> are used at each one of those level all look the same. In fact,
>>> with
>>> your patch applied, I get the exact same situation as with mine, as
>>> far
>>> as scheduling domains are concerned: there is only one scheduling
>>> domain, with a different scheduling group for each vCPU inside it.
>>
>> Uuh, nearly.
>>
>> Your case won't deal correctly with NUMA, as the generic NUMA code is
>> using set_sched_topology() as well.
>>
> Mmm... have you tried and seen something like this? AFAICT, the NUMA
> related setup steps of scheduling domains happens after the basic (as
> in "without taking NUMAness into account") topology has been set
> already, and builds on top of it.
>
> It uses set_sched_topology() only in a special case which, I'm not sure
> we'd be hitting.

Depends on the hardware. On some AMD processors one socket covers
multiple NUMA nodes. This is the critical case. set_sched_topology()
will be called on those machines possibly multiple times when bringing
up additional cpus.

> I'm asking because trying this out, right now, is not straightforward,
> as PV vNUMA, even with Wei's Linux patches and with either yours or
> mine one, still incurs in the CPUID issue... I'll try that ASAP, but
> there are a couple of things I've got to finish for the next few days.
>
>> One of NUMA and Xen will win and
>> overwrite the other's settings.
>>
> Not sure what this means, but as I said, I'll try.

Make sure to use the correct hardware (I'm pretty sure this should be
the AMD "Magny-Cours" [1]).


Juergen

[1]: 
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/introduction-to-magny-cours/


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-23  4:36           ` Juergen Gross
@ 2015-09-23  8:30             ` Dario Faggioli
  2015-09-23  9:44               ` Juergen Gross
  2015-09-23 10:23             ` George Dunlap
  1 sibling, 1 reply; 22+ messages in thread
From: Dario Faggioli @ 2015-09-23  8:30 UTC (permalink / raw)
  To: Juergen Gross, George Dunlap
  Cc: xen-devel, Andrew Cooper, Luis R. Rodriguez, linux-kernel,
	David Vrabel, Boris Ostrovsky, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 6107 bytes --]

On Wed, 2015-09-23 at 06:36 +0200, Juergen Gross wrote:

> On 09/22/2015 06:22 PM, George Dunlap wrote:
> > Juergen / Dario, could one of you summarize your two approaches, 
> > and the
> > (alleged) advantages and disadvantages of each one?
> 
> Okay, I'll have a try:
> 
Thanks for this! ;-)

> The problem we want to solve:
> -----------------------------
> 
> The Linux kernel is gathering cpu topology data during boot via the
> CPUID instruction on each processor coming online. This data is
> primarily used in the scheduler to decide to which cpu a thread
> should
> be migrated when this seems to be necessary. There are other users of
> the topology information in the kernel (e.g. some drivers try to do
> optimizations like core-specific queues/lists).
> 
> When started in a virtualized environment the obtained data is next
> to
> useless or even wrong, as it is reflecting only the status of the
> time
> of booting the system. Scheduling of the (v)cpus done by the
> hypervisor
> is changing the topology beneath the feet of the Linux kernel without
> reflecting this in the gathered topology information. So any
> decisions
> taken based on that data will be clueless and possibly just wrong.
> 
Exactly.

> The minimal solution is to change the topology data in the kernel in
> a
> way that all cpus are regarded as equal regarding their relation to
> each
> other (e.g. when migrating a thread to another cpu no cpu is
> preferred
> as a target).
> 
> The topology information of the CPUID instruction is, however, even
> accessible form user mode and might be used for licensing purposes of
> any user program (e.g. by limiting the software to run on a specific
> number of cores or sockets). So just mangling the data returned by
> CPUID in the hypervisor seems not to be a general solution, while we
> might want to do it at least optionally in the future.
> 
Yep. It turned out that, although being what started all this, CPUID
handling is a somewhat related but mostly independent problem. :-)

> In the future we might want to support either dynamic topology
> updates
> or be able to tell the kernel to use some of the topology data, e.g.
> when pinning vcpus.
> 
Indeed. At least for the latter. Dynamic looks really difficult to me,
but indeed it would be ideal. Let's see.

> Solution 1 (Dario):
> -------------------
> 
> Don't use the CPUID derived topology information in the Linux
> scheduler,
> but let it use a simple "flat" topology by setting own scheduler
> domain
> data under Xen.
> 
> Advantages:
> + very clean solution regarding the scheduler interface
>
Yes, this is, I think, one of the main advantages of the patch. The
scheduler is offering an interface to architectures to define their
topology requirements and I'm using it, for specifying our topology
requirements: the tool for the job. :-D

> + scheduler decisions are based on a minimal data set
> + small patch
> 
> Disadvantages:
> - covers the scheduler only, drivers still use the "wrong" data
>
This is a good point. It was the patch's purpose, TBH, but it's
certainly true that, if we need something similar elsewhere, we need to
do more.

> - a little bit hacky regarding some NUMA architectures (needs either
> a
>    hook in the code dealing with that architecture or multiple
> scheduler
>    domain data overwrites)
>
As I said in my other email, I'll double check (yes, I also think this
is about AMD boxes with intra-socket NUMA nodes).

> - future enhancements will make the solution less clean (either need
>    duplicating scheduler domain data or some new hooks in scheduler
>    domain interface)
> 
This one, I'm not sure I understand.

> Solution 2 (Juergen):
> ---------------------
> 
> When booted as a Xen guest modify the topology data built during boot
> resulting in the same simple "flat" topology as in Dario's solution.
> 
> Advantages:
> + the simple topology is seen by all consumers of topology data as
> the
>    data itself is modified accordingly
>
Yep, that's a good point.

> + small patch

> + future enhancements rather easy by selecting which data to modify
>
As for the '-' above about this, I'm not really sure what this means.
> 
> Disadvantages:
> - interface to scheduler not as clean as in Dario's approach
> - scheduler decisions are based on multiple layers of topology data
>    where one layer would be enough to describe the topology
> 
This is not too big of a deal, IMO. Not at runtime, at least, as far as
my investigation went for now. Initialization (of scheduling domains)
is a bit clumsy in this case, as scheduling domains are created and
then destroyed/collapsed, but after they are setup, the net effect is
that there's only one scheduling domain with Juergen's patch too,
exactly as with mine.

> Dario, are you okay with this summary?
>
To most of it, yes, and thanks again for it.

Allow me to add a few points, out of the top of my head:

 * we need to check whether the two approaches have the same 
   performance. In principle, they really should, and early results 
   seems to confirm that, but I'd like to run the full set of benches 
   (and I'll do that ASAP);
 * I think we want to run even more benchmarks, and run them in    
   different (over)load conditions to better assess the effect of the 
   change
 * both our patches provides a solution for Xen (for Xen PV guests, at 
   least for now, to be more precise). It is very likely that, e.g., 
   KVM is in a similar situation, hence it may be worth to look for a 
   more general solution, especially if that buys us something (e.g., 
   HVM support made easy?)

Thanks and Regards,
Dario

PS. BTW, Juergen, you're not on IRC, on #xendevel, are you?

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-23  8:30             ` Dario Faggioli
@ 2015-09-23  9:44               ` Juergen Gross
  0 siblings, 0 replies; 22+ messages in thread
From: Juergen Gross @ 2015-09-23  9:44 UTC (permalink / raw)
  To: Dario Faggioli, George Dunlap
  Cc: xen-devel, Andrew Cooper, Luis R. Rodriguez, linux-kernel,
	David Vrabel, Boris Ostrovsky, Stefano Stabellini

On 09/23/2015 10:30 AM, Dario Faggioli wrote:
> On Wed, 2015-09-23 at 06:36 +0200, Juergen Gross wrote:
>
>> On 09/22/2015 06:22 PM, George Dunlap wrote:
>>> Juergen / Dario, could one of you summarize your two approaches,
>>> and the
>>> (alleged) advantages and disadvantages of each one?
>>
>> Okay, I'll have a try:
>>
> Thanks for this! ;-)
>
>> The problem we want to solve:
>> -----------------------------
>>
>> The Linux kernel is gathering cpu topology data during boot via the
>> CPUID instruction on each processor coming online. This data is
>> primarily used in the scheduler to decide to which cpu a thread
>> should
>> be migrated when this seems to be necessary. There are other users of
>> the topology information in the kernel (e.g. some drivers try to do
>> optimizations like core-specific queues/lists).
>>
>> When started in a virtualized environment the obtained data is next
>> to
>> useless or even wrong, as it is reflecting only the status of the
>> time
>> of booting the system. Scheduling of the (v)cpus done by the
>> hypervisor
>> is changing the topology beneath the feet of the Linux kernel without
>> reflecting this in the gathered topology information. So any
>> decisions
>> taken based on that data will be clueless and possibly just wrong.
>>
> Exactly.
>
>> The minimal solution is to change the topology data in the kernel in
>> a
>> way that all cpus are regarded as equal regarding their relation to
>> each
>> other (e.g. when migrating a thread to another cpu no cpu is
>> preferred
>> as a target).
>>
>> The topology information of the CPUID instruction is, however, even
>> accessible form user mode and might be used for licensing purposes of
>> any user program (e.g. by limiting the software to run on a specific
>> number of cores or sockets). So just mangling the data returned by
>> CPUID in the hypervisor seems not to be a general solution, while we
>> might want to do it at least optionally in the future.
>>
> Yep. It turned out that, although being what started all this, CPUID
> handling is a somewhat related but mostly independent problem. :-)
>
>> In the future we might want to support either dynamic topology
>> updates
>> or be able to tell the kernel to use some of the topology data, e.g.
>> when pinning vcpus.
>>
> Indeed. At least for the latter. Dynamic looks really difficult to me,
> but indeed it would be ideal. Let's see.
>
>> Solution 1 (Dario):
>> -------------------
>>
>> Don't use the CPUID derived topology information in the Linux
>> scheduler,
>> but let it use a simple "flat" topology by setting own scheduler
>> domain
>> data under Xen.
>>
>> Advantages:
>> + very clean solution regarding the scheduler interface
>>
> Yes, this is, I think, one of the main advantages of the patch. The
> scheduler is offering an interface to architectures to define their
> topology requirements and I'm using it, for specifying our topology
> requirements: the tool for the job. :-D
>
>> + scheduler decisions are based on a minimal data set
>> + small patch
>>
>> Disadvantages:
>> - covers the scheduler only, drivers still use the "wrong" data
>>
> This is a good point. It was the patch's purpose, TBH, but it's
> certainly true that, if we need something similar elsewhere, we need to
> do more.
>
>> - a little bit hacky regarding some NUMA architectures (needs either
>> a
>>     hook in the code dealing with that architecture or multiple
>> scheduler
>>     domain data overwrites)
>>
> As I said in my other email, I'll double check (yes, I also think this
> is about AMD boxes with intra-socket NUMA nodes).
>
>> - future enhancements will make the solution less clean (either need
>>     duplicating scheduler domain data or some new hooks in scheduler
>>     domain interface)
>>
> This one, I'm not sure I understand.

What would you do for keeping the topology information of one level,
e.g. hyperthreads, in case we'd have a gang-scheduler in Xen? Either
you would copy the line:

{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },

from kernel/sched/core.c into your topology array, or you would add a
way in kernel/sched/core.c to remove all but this entry and add your
entry on top of it.

>
>> Solution 2 (Juergen):
>> ---------------------
>>
>> When booted as a Xen guest modify the topology data built during boot
>> resulting in the same simple "flat" topology as in Dario's solution.
>>
>> Advantages:
>> + the simple topology is seen by all consumers of topology data as
>> the
>>     data itself is modified accordingly
>>
> Yep, that's a good point.
>
>> + small patch
>
>> + future enhancements rather easy by selecting which data to modify
>>
> As for the '-' above about this, I'm not really sure what this means.

In the case mentioned above I just wouldn't zap the
topology_sibling_cpumask in my patch.

>>
>> Disadvantages:
>> - interface to scheduler not as clean as in Dario's approach
>> - scheduler decisions are based on multiple layers of topology data
>>     where one layer would be enough to describe the topology
>>
> This is not too big of a deal, IMO. Not at runtime, at least, as far as
> my investigation went for now. Initialization (of scheduling domains)
> is a bit clumsy in this case, as scheduling domains are created and
> then destroyed/collapsed, but after they are setup, the net effect is
> that there's only one scheduling domain with Juergen's patch too,
> exactly as with mine.
>
>> Dario, are you okay with this summary?
>>
> To most of it, yes, and thanks again for it.
>
> Allow me to add a few points, out of the top of my head:
>
>   * we need to check whether the two approaches have the same
>     performance. In principle, they really should, and early results
>     seems to confirm that, but I'd like to run the full set of benches
>     (and I'll do that ASAP);

Thanks.

>   * I think we want to run even more benchmarks, and run them in
>     different (over)load conditions to better assess the effect of the
>     change
>   * both our patches provides a solution for Xen (for Xen PV guests, at
>     least for now, to be more precise). It is very likely that, e.g.,
>     KVM is in a similar situation, hence it may be worth to look for a
>     more general solution, especially if that buys us something (e.g.,
>     HVM support made easy?)

I wanted to look at this as soon as we've decided which way to go.

I had some discussion with a kvm guy last week and he seemed not to be
convinced they need something else as mangling CPUID (what they already
do).

>
> Thanks and Regards,
> Dario
>
> PS. BTW, Juergen, you're not on IRC, on #xendevel, are you?

I'd like to, but I'd need an invitation. My user name is juergen_gross.


Juergen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-23  4:36           ` Juergen Gross
  2015-09-23  8:30             ` Dario Faggioli
@ 2015-09-23 10:23             ` George Dunlap
  1 sibling, 0 replies; 22+ messages in thread
From: George Dunlap @ 2015-09-23 10:23 UTC (permalink / raw)
  To: Juergen Gross, Dario Faggioli
  Cc: xen-devel, Andrew Cooper, Luis R. Rodriguez, linux-kernel,
	David Vrabel, Boris Ostrovsky, Stefano Stabellini

On 09/23/2015 05:36 AM, Juergen Gross wrote:
> On 09/22/2015 06:22 PM, George Dunlap wrote:
>> On 09/22/2015 05:42 AM, Juergen Gross wrote:
>>> One other thing I just discovered: there are other consumers of the
>>> topology sibling masks (e.g. topology_sibling_cpumask()) as well.
>>>
>>> I think we would want to avoid any optimizations based on those in
>>> drivers as well, not only in the scheduler.
>>
>> I'm beginning to lose the thread of the discussion here a bit.
>>
>> Juergen / Dario, could one of you summarize your two approaches, and the
>> (alleged) advantages and disadvantages of each one?
> 
> Okay, I'll have a try:
> 
> The problem we want to solve:
> -----------------------------
> 
> The Linux kernel is gathering cpu topology data during boot via the
> CPUID instruction on each processor coming online. This data is
> primarily used in the scheduler to decide to which cpu a thread should
> be migrated when this seems to be necessary. There are other users of
> the topology information in the kernel (e.g. some drivers try to do
> optimizations like core-specific queues/lists).
> 
> When started in a virtualized environment the obtained data is next to
> useless or even wrong, as it is reflecting only the status of the time
> of booting the system. Scheduling of the (v)cpus done by the hypervisor
> is changing the topology beneath the feet of the Linux kernel without
> reflecting this in the gathered topology information. So any decisions
> taken based on that data will be clueless and possibly just wrong.
> 
> The minimal solution is to change the topology data in the kernel in a
> way that all cpus are regarded as equal regarding their relation to each
> other (e.g. when migrating a thread to another cpu no cpu is preferred
> as a target).
> 
> The topology information of the CPUID instruction is, however, even
> accessible form user mode and might be used for licensing purposes of
> any user program (e.g. by limiting the software to run on a specific
> number of cores or sockets). So just mangling the data returned by
> CPUID in the hypervisor seems not to be a general solution, while we
> might want to do it at least optionally in the future.
> 
> In the future we might want to support either dynamic topology updates
> or be able to tell the kernel to use some of the topology data, e.g.
> when pinning vcpus.
> 
> 
> Solution 1 (Dario):
> -------------------
> 
> Don't use the CPUID derived topology information in the Linux scheduler,
> but let it use a simple "flat" topology by setting own scheduler domain
> data under Xen.
> 
> Advantages:
> + very clean solution regarding the scheduler interface
> + scheduler decisions are based on a minimal data set
> + small patch
> 
> Disadvantages:
> - covers the scheduler only, drivers still use the "wrong" data
> - a little bit hacky regarding some NUMA architectures (needs either a
>   hook in the code dealing with that architecture or multiple scheduler
>   domain data overwrites)
> - future enhancements will make the solution less clean (either need
>   duplicating scheduler domain data or some new hooks in scheduler
>   domain interface)
> 
> 
> Solution 2 (Juergen):
> ---------------------
> 
> When booted as a Xen guest modify the topology data built during boot
> resulting in the same simple "flat" topology as in Dario's solution.
> 
> Advantages:
> + the simple topology is seen by all consumers of topology data as the
>   data itself is modified accordingly
> + small patch
> + future enhancements rather easy by selecting which data to modify
> 
> Disadvantages:
> - interface to scheduler not as clean as in Dario's approach
> - scheduler decisions are based on multiple layers of topology data
>   where one layer would be enough to describe the topology
> 
> 
> Dario, are you okay with this summary?

Thanks -- that's very helpful.

 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  2015-09-23  7:35         ` Juergen Gross
@ 2015-09-23 12:25           ` Boris Ostrovsky
  0 siblings, 0 replies; 22+ messages in thread
From: Boris Ostrovsky @ 2015-09-23 12:25 UTC (permalink / raw)
  To: Juergen Gross, Dario Faggioli
  Cc: xen-devel, Andrew Cooper, Luis R. Rodriguez, linux-kernel,
	George Dunlap, David Vrabel, Stefano Stabellini

On 09/23/2015 03:35 AM, Juergen Gross wrote:
>
> Depends on the hardware. On some AMD processors one socket covers
> multiple NUMA nodes. This is the critical case. set_sched_topology()
> will be called on those machines possibly multiple times when bringing
> up additional cpus.
>
>> I'm asking because trying this out, right now, is not straightforward,
>> as PV vNUMA, even with Wei's Linux patches and with either yours or
>> mine one, still incurs in the CPUID issue... I'll try that ASAP, but
>> there are a couple of things I've got to finish for the next few days.
>>
>>> One of NUMA and Xen will win and
>>> overwrite the other's settings.
>>>
>> Not sure what this means, but as I said, I'll try.
>
> Make sure to use the correct hardware (I'm pretty sure this should be
> the AMD "Magny-Cours" [1]).
>
>
> Juergen
>
> [1]: 
> http://developer.amd.com/resources/documentation-articles/articles-whitepapers/introduction-to-magny-cours/
>


There are few family 0x10 and 0x15 processors that are like that. You 
can see whether you have such a system by comparing number of NUMA nodes 
with number of physical IDs, e.g.:

[root@ovs106 ~]# numactl --hardware |grep available
available: 4 nodes (0-3)
[root@ovs106 ~]# grep "physical id" /proc/cpuinfo | uniq
physical id    : 0
physical id    : 1
[root@ovs106 ~]#


-boris

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-09-23 12:27 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-18 15:55 [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy Dario Faggioli
2015-08-18 16:53 ` Konrad Rzeszutek Wilk
2015-08-20 18:16 ` Juergen Groß
2015-08-31 16:12   ` Boris Ostrovsky
2015-09-02 11:58     ` Juergen Gross
2015-09-02 14:08       ` Boris Ostrovsky
2015-09-02 14:30         ` Juergen Gross
2015-09-15 17:16           ` [Xen-devel] " Dario Faggioli
2015-09-15 16:50   ` Dario Faggioli
2015-09-21  5:49     ` Juergen Gross
2015-09-22  4:42       ` Juergen Gross
2015-09-22 16:22         ` George Dunlap
2015-09-23  4:36           ` Juergen Gross
2015-09-23  8:30             ` Dario Faggioli
2015-09-23  9:44               ` Juergen Gross
2015-09-23 10:23             ` George Dunlap
2015-09-23  7:24       ` Dario Faggioli
2015-09-23  7:35         ` Juergen Gross
2015-09-23 12:25           ` Boris Ostrovsky
2015-08-27 10:24 ` George Dunlap
2015-08-27 17:05   ` [Xen-devel] " George Dunlap
2015-09-15 14:32   ` Dario Faggioli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).