From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dario Faggioli <dario.faggioli@citrix.com>
Subject: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler
Date: Fri, 05 Oct 2012 16:08:18 +0200
Message-ID: <patchbomb.1349446098@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: xen-devel@lists.xen.org
Cc: Andre Przywara <andre.przywara@amd.com>, Ian Campbell <Ian.Campbell@citrix.com>, Anil Madhavapeddy <anil@recoil.org>, George Dunlap <george.dunlap@eu.citrix.com>, Andrew Cooper <Andrew.Cooper3@citrix.com>, Juergen Gross <juergen.gross@ts.fujitsu.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, Jan Beulich <JBeulich@suse.com>, Marcus Granado <Marcus.Granado@eu.citrix.com>, Daniel De Graaf <dgdegra@tycho.nsa.gov>, Matt Wilson <msw@amazon.com>
List-Id: xen-devel@lists.xenproject.org

Hi Everyone,

Here it comes a patch series instilling some NUMA awareness in the Credit
scheduler.

What the patches do is teaching the Xen's scheduler how to try maximizing
performances on a NUMA host, taking advantage of the information coming from
the automatic NUMA placement we have in libxl.  Right now, the
placement algorithm runs and selects a node (or a set of nodes) where it is best
to put a new domain on. Then, all the memory for the new domain is allocated
from those node(s) and all the vCPUs of the new domain are pinned to the pCPUs
of those node(s). What we do here is, instead of statically pinning the domain's
vCPUs to the nodes' pCPUs, have the (Credit) scheduler _prefer_ running them
there. That enables most of the performances benefits of "real" pinning, but
without its intrinsic lack of flexibility.

The above happens by extending to the scheduler the knowledge of a domain's
node-affinity. We then ask it to first try to run the domain's vCPUs on one of
the nodes the domain has affinity with. Of course, if that turns out to be
impossible, it falls back on the old behaviour (i.e., considering vcpu-affinity
only).

Just allow me to mention that NUMA aware scheduling not only is one of the item
of the NUMA roadmap I'm trying to maintain here
http://wiki.xen.org/wiki/Xen_NUMA_Roadmap. It is also one of the features we
decided we want for Xen 4.3 (and thus it is part of the list of such features
that George is maintaining).

Up to now, I've been able to thoroughly test this only on my 2 NUMA nodes
testbox, by running the SpecJBB2005 benchmark concurrently on multiple VMs, and
the results looks really nice.  A full set of what I got can be found inside my
presentation from last XenSummit, which is available here:

 http://www.slideshare.net/xen_com_mgr/numa-and-virtualization-the-case-of-xen?ref=http://www.xen.org/xensummit/xs12na_talks/T9.html

However, I rerun some of the tests in these last days (since I changed some
bits of the implementation) and here's what I got:

-------------------------------------------------------
 SpecJBB2005 Total Aggregate Throughput
-------------------------------------------------------
#VMs       No NUMA affinity     NUMA affinity &   +/- %
                                  scheduling
-------------------------------------------------------
   2            34653.273          40243.015    +16.13%
   4            29883.057          35526.807    +18.88%
   6            23512.926          27015.786    +14.89%
   8            19120.243          21825.818    +14.15%
  10            15676.675          17701.472    +12.91%

Basically, results are consistent with what is shown in the super-nice graphs I
have in the slides above! :-) As said, this looks nice to me, especially
considering that my test machine is quite small, i.e., its 2 nodes are very
close to each others from a latency point of view. I really expect more
improvement on bigger hardware, where much greater NUMA effect is to be
expected.  Of course, I myself will continue benchmarking (hopefully, on
systems with more than 2 nodes too), but should anyone want to run its own
testing, that would be great, so feel free to do that and report results to me
and/or to the list!

A little bit more about the series:

 1/8 xen, libxc: rename xenctl_cpumap to xenctl_bitmap
 2/8 xen, libxc: introduce node maps and masks

Is some preparation work.

 3/8 xen: let the (credit) scheduler know about `node affinity`

Is where the vcpu load balancing logic of the credit scheduler is modified to
support node-affinity.

 4/8 xen: allow for explicitly specifying node-affinity
 5/8 libxc: allow for explicitly specifying node-affinity
 6/8 libxl: allow for explicitly specifying node-affinity
 7/8 libxl: automatic placement deals with node-affinity

Is what wires the in-scheduler node-affinity support with the external world.
Please, note that patch 4 touches XSM and Flask, which is the area with which I
have less experience and less chance to test properly. So, If Daniel and/or
anyone interested in that could take a look and comment, that would be awesome.

 8/8 xl: report node-affinity for domains

Is just some small output enhancement.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)