[PATCH 00 of 11 v3] NUMA aware credit scheduling

* [PATCH 00 of 11 v3] NUMA aware credit scheduling
@ 2013-02-01 11:01 Dario Faggioli
  2013-02-01 11:01 ` [PATCH 01 of 11 v3] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
                   ` (11 more replies)
  0 siblings, 12 replies; 39+ messages in thread
From: Dario Faggioli @ 2013-02-01 11:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

Hello Everyone,

V3 of the NUMA aware scheduling series. It is nothing more than v2 with all
the review comments I got, addressed... Or so I think. :-)

I added a new patch in the series (#3), for dealing with the suboptimal SMT
load balancing in credit there, instead than within what now is patch #4.

I ran the following benchmarks (again):

 * SpecJBB is all about throughput, so pinning is likely the ideal solution.

 * Sysbench-memory is the time it takes for writing a fixed amount of memory
   (and then it is the throughput that is measured). What we expect is
   locality to be important, but at the same time the potential imbalances
   due to pinning could have a say in it.

 * LMBench-proc is the time it takes for a process to fork a fixed amount of
   children. This is much more about latency than throughput, with locality
   of memory accesses playing a smaller role and, again, imbalances due to
   pinning being a potential issue.

Summarizing, we expect pinning to win on throughput biased benchmarks (SpecJBB
and Sysbench), while not having any affinity should be better when latency is
important, especially under (over)load (i.e., on LMBench). NUMA aware
scheduling tries to get the best out of the two approaches: take advantage of
locality, but in a flexible way. Therefore, it would be nice for it to sit in
the middle:
 - not as bad as no-affinity (or, if you prefer, almost as good as pinning)
   when looking at SpecJBB and Sysbench results;
 - not as bad as pinning (or, if you prefer, almost as good as no-affinity)
   when looking at LMBench results.

On a 2 nodes, 16 cores system, where I can have from 2 to 10 VMs (2 vCPUs,
960MB RAM each) executing the benchmarks concurrently, here's what I get:

 ----------------------------------------------------
 | SpecJBB2005, throughput (the higher the better)  |
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 |    2 |  43318.613  | 49715.158 |    49822.545    |
 |    6 |  29587.838  | 33560.944 |    33739.412    |
 |   10 |  19223.962  | 21860.794 |    20089.602    |
 ----------------------------------------------------
 | Sysbench memory, throughput (the higher the better)
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 |    2 |  469.37667  | 534.03167 |    555.09500    |
 |    6 |  411.45056  | 437.02333 |    463.53389    |
 |   10 |  292.79400  | 309.63800 |    305.55167    |
 ----------------------------------------------------
 | LMBench proc, latency (the lower the better)     |
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 ----------------------------------------------------
 |    2 |  788.06613  | 753.78508 |    750.07010    |
 |    6 |  986.44955  | 1076.7447 |    900.21504    |
 |   10 |  1211.2434  | 1371.6014 |    1285.5947    |
 ----------------------------------------------------

Which, reasoning in terms of %-performances increase/decrease, means NUMA aware
scheduling does as follows, as compared to no-affinity at all and to pinning:

     ----------------------------------
     | SpecJBB2005 (throughput)       |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     |    2 |   +13.05%   |  +0.21%   |
     |    6 |   +12.30%   |  +0.53%   |
     |   10 |    +4.31%   |  -8.82%   |
     ----------------------------------
     | Sysbench memory (throughput)   |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     |    2 |   +15.44%   |  +3.79%   |
     |    6 |   +11.24%   |  +5.72%   |
     |   10 |    +4.18%   |  -1.34%   |
     ----------------------------------
     | LMBench proc (latency)         | NOTICE: -x.xx% = GOOD here
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     ----------------------------------
     |    2 |    -5.66%   |  -0.50%   |
     |    6 |    -9.58%   | -19.61%   |
     |   10 |    +5.78%   |  -6.69%   |
     ----------------------------------

So, not bad at all. :-) In particular, when not in overload, NUMA scheduling is
the absolute best of all the three, even when it was expectable for one of the
other approaches to win. In fact, if looking at the 2 and 6 VMs cases, it beats
(although by a very small amount in the SpecJBB case) pinning in both the
throughput biased benchmarks, as well as it beats no-affinity on LMBench.  Of
course it does a lot better than no-pinning on throughput and than pinning on
latency (exp. with 6 VMs), but that was expected.

Regarding the overloaded case, NUMA scheduling scores "in the middle", i.e.,
better than no-affinity but worse than pinning on throughput benchs, and the
vice-versa on the latency bench, as it was expectable and intended, and this is
fine as it is right what we expected from it. It must be noticed, however, that
the benefits are not as huge as in the non-overloaded case. I chased the reason
and found out that our load-balancing approach --in particular the fact we rely
on tickling idle pCPUs to come pick up new work by themselves-- couples
particularly bad with the new concept of node affinity. I spent some time
looking for a simple "fix" for this, but it does not seem amendable to me, so
I'll prepare a patch, using a completely different approach, and send it
separately from this series (hopefully on top of it, in case this will have hit
the repo at that time :-D). For now, I really think we can be happy with the
performance figures this series enables... After all, I'm overloading the box
by 20% (without counting Dom0 vCPUs!) and still seeing improvements, although
perhaps not as huge as they could have been. Thoughts?

Here are the patches included in the series. I '*'-ed ones already received one
or more acks during previous rounds. Of course, I retained these Ack-s only for
the patches that have not been touched, or only underwent minor cluenups. Of
course, feel free to re-review everything, whatever your Ack is there or not!

 * [ 1/11] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
 * [ 2/11] xen, libxc: introduce node maps and masks
   [ 3/11] xen: sched_credit: when picking, make sure we get an idle one, if any
   [ 4/11] xen: sched_credit: let the scheduler know about node-affinity
 * [ 5/11] xen: allow for explicitly specifying node-affinity
 * [ 6/11] libxc: allow for explicitly specifying node-affinity
 * [ 7/11] libxl: allow for explicitly specifying node-affinity
   [ 8/11] libxl: optimize the calculation of how many VCPUs can run on a candidate
 * [ 9/11] libxl: automatic placement deals with node-affinity
 * [10/11] xl: add node-affinity to the output of `xl list`
   [11/11] docs: rearrange and update NUMA placement documentation

Thanks and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 39+ messages in thread