From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler Date: Fri, 05 Oct 2012 16:08:18 +0200 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: xen-devel@lists.xen.org Cc: Andre Przywara , Ian Campbell , Anil Madhavapeddy , George Dunlap , Andrew Cooper , Juergen Gross , Ian Jackson , Jan Beulich , Marcus Granado , Daniel De Graaf , Matt Wilson List-Id: xen-devel@lists.xenproject.org Hi Everyone, Here it comes a patch series instilling some NUMA awareness in the Credit scheduler. What the patches do is teaching the Xen's scheduler how to try maximizing performances on a NUMA host, taking advantage of the information coming from the automatic NUMA placement we have in libxl. Right now, the placement algorithm runs and selects a node (or a set of nodes) where it is best to put a new domain on. Then, all the memory for the new domain is allocated from those node(s) and all the vCPUs of the new domain are pinned to the pCPUs of those node(s). What we do here is, instead of statically pinning the domain's vCPUs to the nodes' pCPUs, have the (Credit) scheduler _prefer_ running them there. That enables most of the performances benefits of "real" pinning, but without its intrinsic lack of flexibility. The above happens by extending to the scheduler the knowledge of a domain's node-affinity. We then ask it to first try to run the domain's vCPUs on one of the nodes the domain has affinity with. Of course, if that turns out to be impossible, it falls back on the old behaviour (i.e., considering vcpu-affinity only). Just allow me to mention that NUMA aware scheduling not only is one of the item of the NUMA roadmap I'm trying to maintain here http://wiki.xen.org/wiki/Xen_NUMA_Roadmap. It is also one of the features we decided we want for Xen 4.3 (and thus it is part of the list of such features that George is maintaining). Up to now, I've been able to thoroughly test this only on my 2 NUMA nodes testbox, by running the SpecJBB2005 benchmark concurrently on multiple VMs, and the results looks really nice. A full set of what I got can be found inside my presentation from last XenSummit, which is available here: http://www.slideshare.net/xen_com_mgr/numa-and-virtualization-the-case-of-xen?ref=http://www.xen.org/xensummit/xs12na_talks/T9.html However, I rerun some of the tests in these last days (since I changed some bits of the implementation) and here's what I got: ------------------------------------------------------- SpecJBB2005 Total Aggregate Throughput ------------------------------------------------------- #VMs No NUMA affinity NUMA affinity & +/- % scheduling ------------------------------------------------------- 2 34653.273 40243.015 +16.13% 4 29883.057 35526.807 +18.88% 6 23512.926 27015.786 +14.89% 8 19120.243 21825.818 +14.15% 10 15676.675 17701.472 +12.91% Basically, results are consistent with what is shown in the super-nice graphs I have in the slides above! :-) As said, this looks nice to me, especially considering that my test machine is quite small, i.e., its 2 nodes are very close to each others from a latency point of view. I really expect more improvement on bigger hardware, where much greater NUMA effect is to be expected. Of course, I myself will continue benchmarking (hopefully, on systems with more than 2 nodes too), but should anyone want to run its own testing, that would be great, so feel free to do that and report results to me and/or to the list! A little bit more about the series: 1/8 xen, libxc: rename xenctl_cpumap to xenctl_bitmap 2/8 xen, libxc: introduce node maps and masks Is some preparation work. 3/8 xen: let the (credit) scheduler know about `node affinity` Is where the vcpu load balancing logic of the credit scheduler is modified to support node-affinity. 4/8 xen: allow for explicitly specifying node-affinity 5/8 libxc: allow for explicitly specifying node-affinity 6/8 libxl: allow for explicitly specifying node-affinity 7/8 libxl: automatic placement deals with node-affinity Is what wires the in-scheduler node-affinity support with the external world. Please, note that patch 4 touches XSM and Flask, which is the area with which I have less experience and less chance to test properly. So, If Daniel and/or anyone interested in that could take a look and comment, that would be awesome. 8/8 xl: report node-affinity for domains Is just some small output enhancement. Thanks and Regards, Dario -- <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)