[RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

* [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
@ 2013-08-16  4:13 Yechen Li
  2013-08-16  9:09 ` Jan Beulich
  2013-08-19 12:58 ` David Vrabel
  0 siblings, 2 replies; 25+ messages in thread
From: Yechen Li @ 2013-08-16  4:13 UTC (permalink / raw)
  To: xen-devel
  Cc: ufimtseva, dario.faggioli, Yechen Li, Ian.Jackson, Ian.Campbell,
	JBeulich

This patch contains the details about the design, the benefits and
the indended usage of NUMA-awareness in the ballooning driver, coming
in the form of a markdown document in docs.

Given this is an RFC, it has bits explaining what is still missing,
that will of course need to disappear for the final version, but
that are useful for reviewing the whole series at this stage.

Signed-off-by: Yechen Li <lccycc123@gmail.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 docs/misc/numa-aware-ballooning.markdown | 376 +++++++++++++++++++++++++++++++
 1 file changed, 376 insertions(+)
 create mode 100644 docs/misc/numa-aware-ballooning.markdown

diff --git a/docs/misc/numa-aware-ballooning.markdown b/docs/misc/numa-aware-ballooning.markdown
new file mode 100644
index 0000000..3a899f1
--- /dev/null
+++ b/docs/misc/numa-aware-ballooning.markdown
@@ -0,0 +1,376 @@
+# NUMA-Aware Ballooning Design and Implementation #
+
+## Rationale ##
+
+The purpose of this document is describing how the ballooning driver for
+Linux is made NUMA-aware, why this is important, and its intended usage on
+a NUMA host.
+
+NUMA stands for Non-Uniform Memory Access, which means that a typical NUMA
+host will have more than one memory controller and that accessing a specific
+memory locations can have a varying cost, depending on from which physical
+CPU such accesses come from. In presence of such architecture, it is very
+important to keep all the memory of a domain on the same pysical NUMA node
+(or on the smallest possible set of physical NUMA nodes), in order to get
+good and predictable performance.
+
+Ballooning is an effective mean for making space, when there is the need to
+create a new domain. However, on a NUMA host, one could want to not only
+free some memory, but also to free it from a specific node. That could be,
+for instance, the case for making sure the new domain will fit in just one
+node.
+
+For more information on NUMA and on how Xen deals with it, see [Xen NUMA
+Introduction][numa_intro] on the Xen Wiki, and/or
+[xl-numa-placement.markdown][numa_placement].
+
+## Definitions ##
+
+Some definitions that could came handy for reading the rest of this
+document:
+
+* physical topology -- the hardware NUMA characteristics of the host. It
+  typically is comprised of the number of NUMA nodes, the memory each node
+  has and the number of (and the which ones) physical CPUs belongs to each
+  node.
+* virtual topology -- similar information to the physical topology, but it
+  applies to a guest, rather than to the host (hence the virtual). It is
+  constructed at guest creation time and is comprised of the number of
+  virtual NUMA nodes, the amount of guest memory each virtual node is
+  entitled of and the number of (and the which ones) virtual vCPUs belongs
+  to each virtual node.
+
+Physical node may be abbreviated as pnode, or p-node, and virtual node is
+often abbreviated as vnode, or v-node (similarly to what happens to pCPU
+and vCPU).
+
+### Physical topology and virtual topology ###
+
+It is __not__ __at__ __all__ __required__ that there is any relationship
+between the physical and virtual topology, exactly as it is not at all
+required that there is any relationship between physical CPUs and virtual
+CPUs.
+
+That is to say, on an host with 4 NUMA nodes, each node with 8 GiB RAM and
+4 pCPUs, there can be guests with 1 vnode, 2 vnodes and 4 vnodes (it is not
+clear yet whether it would make sense to allow overbooking, perhaps yes, but
+only after having warn the user about it). In all the cases, vnodes can have
+arbitrary sizes, with respect to both the amount of memory and the number of
+vCPUs in each of them. Similarly, there is neither the requirement nor the
+guarantee that, either:
+
+* the memory of a vnode is actually allocated on a particular pnode, nor
+  even that it is allocated on __only__ one pnode
+* the vCPUs belonging to a vnode are actually bound or pinned to the pCPUS
+  of a particular pnode.
+
+Of course, although it is not required, having some kind of relationship in
+place between the physical and virtual topology would be beneficial for both
+performance and resource usage optimization. This is really similar to what
+happens with vCPU to pCPU pinning: it is not required, but if it is possible
+to do that properly, performance will increase.
+
+In this specific case, the ideal situation would be as follows:
+
+* all the guest memory pages that constitutes one vnode, say vnode #1, are
+  backed by frames that, on the host, belongs to the same pnode, say pnode #3
+  (and that should be true for all the vnodes);
+* all the guest vCPUs that belongs to vnode #1 have, as far as the Xen
+  scheduler is concerned, NUMA node affinity with (some of) the pCPUs
+  belonging to pnode #3 (and that again should hold for all the vnodes).
+
+For that reason, a mechanism for the user to specify the relationship between
+virtual and physical topologies at guest creation time will be implemented, as
+well as some automated logic for coming up with a sane default, in case the
+user does not say anything (like it happens with NUMA automatic placement).
+
+## Current status ##
+
+At the time of writing this document, the work to allow a virtual NUMA topology
+to be specified and properly constructed within a domain is still in progress.
+Patches will be released soon, both for Xen and Linux, implementing right what
+has been described in the previous section.
+
+NUMA aware ballooning needs such feature to be present, or it won't work
+properly, so the reminder of this document will assume that we have virtual
+NUMA in place.
+
+We will also assume that we have the following "services" in place too,
+provided by the Xen hypervisor, via a suitable hypercall interface, and
+available for the toolstack and to the ballooning driver.
+
+### nodemask VNODE\_TO\_PNODE(int vnode) ###
+
+This service is provided by the hypervisor (and wired, if necessary, all the
+way up to the proper toolstack layer or guest kernel), since it is only Xen
+that knows both the virtual and the physical topologies.
+
+It takes the ID of a vnode, as they are identified within the virtual topology
+of a guest, and returns a bitmask. The bits set to 1 in there, corresponds to
+the pnodes from which, on the host, the memory of the passed vnode comes from.
+
+The ideal situation is that such bitmask only has 1 bit set, since having the
+memory of a vnode coming from more than one pnode is clearly bad for access
+locality.
+
+### nodemask PNODE\_TO\_VNODE(int pnode) ###
+
+This service is provided by the hypervisor (and wired, if necessary, all the
+way up to the proper toolstack layer or guest kernel), since it is only Xen
+that knows both the virtual and the physical topologies.
+
+It takes the ID of a pnode, as they are identified within the physical
+topology of the host, and returns a bitmask. The bits set to 1 in there,
+corresponds to the vnodes, in the guest, that uses memory from the passed
+pnode.
+
+The ideal situation is that such bitmask only has 1 bit set, but is less of a
+problem (wrt the case above) it there are more.
+
+## Description of the problem ##
+
+Let us use an example. Let's assume that guest _G_ has a virtual 2 vnodes,
+and that the memory for vnode #0 and #1 comes from pnode #0 and pnode #2,
+respectively.
+
+Now, the user wants to create a new guest, but the system is under high memory
+pressure, so he decides to try ballooning _G_ down. He sees that pnode #2 has
+the best chances to accommodate all the memory for the new guest, which would
+be really good for performance, if only he can make space there. _G_ is the
+only domain eating some memory from pnode, #2 but, as said above, not all of
+its memory comes from there.
+
+So, right now, the user has no way to specify that he wants to balloon down
+_G_ in such a way that he will get as much as possible free pages from pnode
+#2, rather than from pnode #0. He can ask _G_ to balloon down, but there is
+no guarantee on from what pnode the memory will be freed.
+
+The same applies to the ballooning up case, when the user, for some specific
+reasons, wants to be sure that it is the memory of some (other) specific pnode
+that will be used.
+
+What is necessary is something allowing the user to specify from what pnode
+he wants the free pages, and allowing the ballooning driver to make that
+happen.
+
+## The current situation ##
+
+This section describes the details of how ballooning works right now. It is
+possible to skip this if you already possess such knowledge.
+
+Ballooning happens inside the guest, with of course some "help" from Xen.
+The notification that ballooning up or down is necessary reaches the
+ballooning driver in the guest via xenstore. See [here][xenstore] and
+[here][xenstore_paths] for more details.
+
+Typically, the user --via xl or via any other high level toolstack-- calls
+libxl\_set\_memory\_target(), where the new desired memory target for the
+guest is written in a xenstore key: ~/memory/target.  
+The guest has a xenstore watch on this key, and every time it changes, work
+inside the guest itself is scheduled (the body is in the balloon\_process()
+function). It is responsibility of this component of the ballooning driver
+to read the key and take proper action.
+
+There are two possible situations:
+
+* new target < current usage -- the ballooning driver maintains a list of
+  free pages (similar to what OS does, for tracking free memory). Despite
+  being actually free, they are hidden to the OS, which can't use them,
+  unless they are handed to it by the ballooning. From the OS perspective,
+  the ballooning driver can be seen as some sort of monster, eating some of
+  the guest free memory.  
+  To steal a free page from the guest OS, and add it to the list of ballooned
+  pages, the driver calls alloc\_page(). From the point of view of the OS,
+  the ballooning monster ate that page, decreasing the amount of free memory
+  (inside the guest).  
+  At this point, a XENMEM\_decrease\_reservation hypercall is issued, which
+  hands the newly allocated page to Xen so that he can unmap it, turning it,
+  to all the effects, into free host memory.
+* new target > current usage -- the ballooning driver picks up one page from
+  the list of ballooned pages it maintains and issue a
+  XENMEM\_populate\_physmap hypercall, with the PFN of the chosen page as an
+  argument.  
+  Xen goes looking for some free memory in the host and, when it finds it,
+  allocate a page and map that in the guest's memory space, using the PFN it
+  has been provided with.  
+  At this point, the ballooning driver calls \_\_free\_reserved\_page() to
+  make the guest OS know that the PFN can now be considered a free page
+  (again), which also means removing the page from the ballooned pages list.
+
+Operations like the ones described above goes on within the ballooning driver,
+perhaps one after the other, until the new target is reached.
+
+## NUMA-aware ballooning ##
+
+The new NUMA-aware ballooning logic works as follows.
+
+There is room, in libxl\_set\_memory\_target() for two more parameters, in
+addition to the new memory target:
+
+* _pnid_ -- which is the pnode id of which node the user wants to try get some
+  free memory on
+* _nodeexact_ -- which is a bool specifying whether or not, in case it is not
+  possible to reach the new ballooning target only with memory from pnode
+  _pnid_, the user is fine with using memory from other pnodes.  
+  If _nodeexact_ is true, it is possible that the new target is not reached; if
+  it is false, the new target will (probably) be reached, but it is possible
+  that some memory is freed on pnodes other than _pnid_.
+
+To let the ballooning driver know about these new parameters, a new xenstore
+key exists in ~/memory/target\_nid. So, for a proper NUMA aware ballooning
+operation to occur, the user should write the proper values in both the keys:
+~/memory/target\_nid and ~/memory/target.
+
+So, in NUMA aware ballooning, ballooning down and up works as follows:
+
+* target < current usage -- first of all, the ballooning driver uses the
+  PNODE\_TO\_VNODE() service (provided by the virtual topology implementation,
+  as explained above) to translate _pnid_ (that it reads from xenstore) to
+  the id(s) of the corresponding set of vnode IDs, say _{vnids}_ (which will
+  be be a one element only set, in case there is only one vnode in the guest
+  allocated out of pnode _pnid_). It then uses alloc_pages_node(), with one
+  or more elements from _{vnids}_ as nodes, in order to rip the proper amount
+  of ballooned pages out of the guest OS, and hand them to Xen (with the same
+  hypercal as above).  
+  If not enough pages can be found, and only if _nodeexact_ is false, it starts
+  considering other nodes (by just calling alloc\_page()). If _nodeexact_ is
+  true, it just returns.
+* target > current usage -- first of all, the ballooning driver uses the
+  PNODE\_TO\_VNODE() service to translate _pnid_ to the ID(s) of the
+  corresponding set of vnode IDs, say _{vnids}_. It then looks for enough
+  ballooned pages, among the ones belonging to the vnodes in _{vnids}_, and
+  asks Xen to map them via XENMEM\_polulate\_physmap. While doing the latter,
+  it explicitly tells the hypervisor to allocate the actual host pages on
+  pnode _pnid_.  
+  If not enough ballooned pages can be found among vnodes in _{vnids}_, and
+  only if node\_exact is false, it falls back looking for ballooned pages in
+  other vnodes. For each one he finds, it calls VNODE\_TO\_PNODE(), to see on
+  what pnode pnode it belongs, and then asks Xen to map it, exactly as above.
+
+The biggest difference between current and NUMA-aware ballooning is that the
+latter needs to keep multiple lists of the ballooned pages in an array, with
+one element for each virtual node. This way, it is always evident, at any
+given time, what ballooned pages belong to what vnode.
+
+Regarding the stealing a page from the OS part, it is enough to use the Linux
+function alloc_page_node(), in place of alloc\_page().
+
+Finally, from the hypercall point of view, both XENMEM\_decrease\_reservation
+and XENMEM\_populate\_physmap already are NUMA-aware, so it is just a matter
+of passing them the proper arguments.
+
+### Usage examples ###
+
+Let us assume we have a 4 vnodes guest on an 8 pnodes host. Virtual to
+pysical topology mapping is as follows:
+
+* vnode #0 --> pnode #1
+* vnode #1 --> pnode #3
+* vnode #2 --> pnode #3
+* vnode #3 --> pnode #5
+
+Let us also assume that the user has just decided that he want to change
+the current memory allocation scheme, by ballooning (up or down) $DOMID.
+
+Here they come some usage examples, for the NUMA-aware ballooning, with an
+exemplified explanation of what happens in the various situations.
+
+#### Ballooning down, example 1 ####
+
+The user wants N free pages from pnode #3, and only from there:
+
+1. user invokes `xl mem-set -n 3 -e  $DOMID N
+2. libxl writes N to ~/memory/target and {3,true} to ~/memory/target_nid
+3. ballooning driver asks Xen what vnodes insists on pnode #3, and gets
+   {1,2}
+4. ballooning driver tries to steal N pages from vnode #1 and/or vnode #3
+   (both are fine) and add them to ballooned out pages (i.e., allocate
+   them in guest and unmap them in Xen)
+5. whether phase 4 succeeds or not, return, having potentially freed less
+   than N pages on pnode #3
+
+#### Ballooning down, example 2 ####
+
+The user wants N free pages from pnode #1, but if impossible, is fine with
+other pnodes to contribute to that:
+
+1. user invokes `xl mem-set -n 1 $DOMID N
+2. libxl writes N to ~/memory/target and {1,false} to ~/memory/target\_nid
+3. ballooning driver asks Xen what vnodes insists on pnode #1, and gets {0}
+4. ballooning driver tries to steal N pages from vnode #0 and add them to
+   ballooned out pages
+5. if less than N pages are found in phase 4, ballooning driver tries
+   to steal from any vnode
+
+#### Ballooning up, example 1 ####
+
+The user wants to give N more pages, taking them from pnode #3, and only
+from there:
+
+1. user invokes `xl mem-set -n 3 -e $DOMID N
+2. libxl writes N to ~/memory/target and {3,true} to ~/memory/target_nid
+3. ballooning driver asks Xen what vnodes insists on pnode #3, and gets
+   {1,2}
+4. ballooning driver locates N ballooned pages belonging to vnode #1 or
+   vnode #2 (both would do)
+5. ballooning driver asks Xen to allocate N pages on pnode #3
+6. ballooning driver ask Xen to map the new pages on the ballooned pages
+   and hands the result back to the guest OS
+
+NOTICE that, since node_exact was true, if phase 4 fails to find N pages
+(say it finds Y<N) then phases 5 and 6 will work with Y
+
+## Limitations ##
+
+This features represents a performance (and resource usage) optimization.
+As it is common in these cases, it does its best under certain conditions.
+In case it is put to work under different conditions, it is possible that
+its actual beneficial potential is diminished or, at worst, lost completely.
+Nevertheless, it is required that everything keeps working and that the
+performance at least do not drop below the level they where without the
+feature itself.
+
+In this case, the best or worst working conditions have to do with how well
+the guest domain is placed on the host NUMA nodes, i.e., on how many and what
+physical NUMA nodes its memory comes from. If the guest has a virtual NUMA
+topology, they also have to do with the relationship between the virtual and
+the physical topology.
+
+It is possible to identify three situations:
+
+1. multiple vnodes is backed by host memory from only one pnode (but not
+   vice-versa, i.e., there is no single vnode backed by host memory from two
+   or more pnodes). We can call this scenario many-to-one;
+2. for all the vnodes, each one of them is backed by host memory coming from
+   one specific (and all different among each others) pnode. We can call this
+   scenario one-to-one;
+3. there are vnodes that are backed by host memory coming at the same time
+   from multiple pnodes. We can call this scenario one-to-many.
+
+If a guest does not have a virtual NUMA topology, it can be seen as having
+only one virtual NUMA node, accommodating all the memory.
+
+Among the three scenarios above, 1 and 2 are fine, meaning that NUMA aware
+ballooning will succeed in using host memory from the correct pnodes,
+guaranteeing improved performance and better resource utilization than the
+current situation (if there is enough free memory in the proper place, of
+course).
+
+The third situation (one-to-many) is the only problematic one. In fact, if
+we try to get several pages from pnode #1, with the guest vnode #0 having
+pages on both pnode #1 and pnode #2, the ballooning driver will pick up the
+pages from the vnode #0's pool of ballooned pages, without any chance of
+knowing whether they actually are backed by host pages from pnode #1 or
+from pnode #2.
+
+However, this just means that, in this case, NUMA-aware ballooning behaves
+pretty much like the current (i.e., __non__ NUMA-aware ballooning) which is
+certainly undesirable, but still acceptable. On a related note, having the 
+memory of a vnode split among two (or more) pnodes is a non optimal situation
+already, at least from a NUMA perspective, so it would be unrealistic to ask
+the ballooning driver to do anything better than the above.
+
+[numa_intro]: http://wiki.xen.org/wiki/Xen_NUMA_Introduction
+[numa_placement]: http://xenbits.xen.org/docs/unstable/misc/xl-numa-placement.markdown
+[xenstore]: http://xenbits.xen.org/docs/unstable/misc/xenstore.txt
+[xenstore_paths]: http://xenbits.xen.org/docs/unstable/misc/xenstore-paths.markdown
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread