[PATCH 00 of 10 v2] NUMA aware credit scheduling

* [PATCH 00 of 10 v2] NUMA aware credit scheduling
@ 2012-12-19 19:07 Dario Faggioli
  2012-12-19 19:07 ` [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
                   ` (11 more replies)
  0 siblings, 12 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 19:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

Hello Everyone,

Here it is the take 2 of the NUMA aware credit scheduling series. Sorry it took
a bit, but I had to take care of those nasty bugs causing scheduling anomalies,
as they were getting in the way and messing up the numbers when trying to
evaluate performances of this! :-)

I also rewrote most of the core of the two step vcpu and node affinity
balancing algorithm, as per George's suggestion during last round, to try to
squeeze a little bit more of performances improvement.

As already and repeatedly said, what the series does is providing the (credit)
scheduler with the knowledge of a domain's node-affinity. It will then always
try to run domain's vCPUs on one of those nodes first. Only if that turns out
to be impossible, it falls back to the old behaviour. (BTW, for any update on
the status of my "quest" about improving NUMA support in Xen, look
http://wiki.xen.org/wiki/Xen_NUMA_Roadmap.)

I rerun my usual benchmark, SpecJBB2005, plus some others, i.e., some
configurations of sysbench and lmbench. A little bit more about them follows:

 * SpecJBB is all about throughput, so pinning is likely the ideal solution.

 * Sysbench-memory is the time it takes for writing a fixed amount of memory
   (and then it is the throughput that is measured). What we expect is
   locality to be important, but at the same time the potential imbalances
   due to pinning could have a say in it.

 * Lmbench-proc is the time it takes for a process to fork a fixed amount of
   children. This is much more about latency than throughput, with locality
   of memory accesses playing a smaller role and, again, imbalances due to
   pinning being a potential issue.

On a 2 nodes, 16 cores system, where I can have 2 to 10 VMs (2 vCPUs each)
executing the benchmarks concurrently, here's what I get:

 ----------------------------------------------------
 | SpecJBB2005, throughput (the higher the better)  |
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 |    2 |   43451.853 | 49876.750 |    49693.653    |
 |    6 |   29368.589 | 33782.132 |    33692.936    |
 |   10 |   19138.934 | 21950.696 |    21413.311    |
 ----------------------------------------------------
 | Sysbench memory, throughput (the higher the better)
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 |    2 |  484.42167  | 552.32667 |    552.86167    |
 |    6 |  404.43667  | 440.00056 |    449.42611    |
 |   10 |  296.45600  | 315.51733 |    331.49067    |
 ----------------------------------------------------
 | LMBench proc, latency (the lower the better)     |
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 ----------------------------------------------------
 |    2 |  824.00437  | 749.51892 |    741.42952    |
 |    6 |  942.39442  | 985.02761 |    974.94700    |
 |   10 |  1254.3121  | 1363.0792 |    1301.2917    |
 ----------------------------------------------------

Which, reasoning in terms of %-performances increase/decrease, means NUMA aware
scheduling does as follows, as compared to no affinity at all and to pinning:

     ----------------------------------
     | SpecJBB2005 (throughput)       |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     |    2 |   +14.36%   |   -0.36%  |
     |    6 |   +14.72%   |   -0.26%  |
     |   10 |   +11.88%   |   -2.44%  |
     ----------------------------------
     | Sysbench memory (throughput)   |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     |    2 |   +14.12%   |   +0.09%  |
     |    6 |   +11.12%   |   +2.14%  |
     |   10 |   +11.81%   |   +5.06%  |
     ----------------------------------
     | LMBench proc (latency)         |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     ----------------------------------
     |    2 |   +10.02%   |   +1.07%  |
     |    6 |    +3.45%   |   +1.02%  |
     |   10 |    +2.94%   |   +4.53%  |
     ----------------------------------

Numbers seem to tell we're being successful in taking advantage of both the
improved locality (when compared to no affinity) and the greater flexibility
the NUMA aware scheduling approach gives us (when compared to pinning).  In
fact, when throughput only is concerned (SpecJBB case), it behaves almost on
par with pinning, and a lot better than no affinity at all. Moreover, we're
even able to do better than them both, when latency comes a little bit more
into the game and the imbalances caused by pinning would make things worse than
not having any affinity, like in the sysbench and, especially, in the LMBench
case.

Here are the patches included in the series. I '*'-ed ones already received one
or more acks during v1.  However, there are patches that were significantly
reworked since then. In that case, I just ignored that, and left them with my
SOB only, as I think they definitely need to be re-reviewd. :-)

 * [ 1/10] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
 * [ 2/10] xen, libxc: introduce node maps and masks
   [ 3/10] xen: sched_credit: let the scheduler know about node-affinity
   [ 4/10] xen: allow for explicitly specifying node-affinity
 * [ 5/10] libxc: allow for explicitly specifying node-affinity
 * [ 6/10] libxl: allow for explicitly specifying node-affinity
   [ 7/10] libxl: optimize the calculation of how many VCPUs can run on a candidate
 * [ 8/10] libxl: automatic placement deals with node-affinity
 * [ 9/10] xl: add node-affinity to the output of `xl list`
   [10/10] docs: rearrange and update NUMA placement documentation

Thanks and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 57+ messages in thread