RFC: vNUMA project

* RFC: vNUMA project
@ 2014-11-11 17:36 Wei Liu
  2014-11-11 18:03 ` David Vrabel
  2014-11-19 11:18 ` George Dunlap
  0 siblings, 2 replies; 13+ messages in thread
From: Wei Liu @ 2014-11-11 17:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Dario Faggioli, wei.liu2, David Vrabel, Jan Beulich

# What's already implemented?

PV vNUMA support in libxl/xl and Linux kernel.

# What's planned but yet implemented?

NUMA-aware ballooning, HVM vNUMA

# How is vNUMA used in toolstack and Xen?

On libxl level, user (xl and other higher level toolstack) can specify
number of vnodes, size of a vnode, vnode to pnode mapping, vcpu to vnode
mapping, and distances for local and remote node.

Then libxl will generate one or more vmemranges for each node. The
need to generate more than one vmemranges is to accommodate memory
holes. One example is to have e820_host=1 in PV guest config file and
allocate to guest more than 4G RAM.

The generated information will also be stored in Xen. It will be
used in two scenarios: to be retrieved by PV guest; to implement
NUMA-aware ballooning.

# How is vNUMA used in guest?

When PV guest boots up, it issues hypercall to retrieve vNUMA
information. Guest is able to retrieve the number of vnodes, size of
each vnode, vcpu to vnode mapping and finally an array of vmemranges.
Guest can massage these pieces of information for its own use.

HVM guest will still use ACPI to initialise NUMA. ACPI table is
arranged by hvmloader.

# NUMA-aware ballooning

It's agreed that NUMA-aware ballooning should be achieved solely in
hypervisor. Everything should happen under the hood without guest
knowing vnode to pnode mapping.

As far as I can tell, existing guests (Linux and FreeBSD) use
XENMEM_populate_physmap to balloon up. There's a hypercall
called XENMEM_increase_reservation but it's not used
by Linux and FreeBSD.

I can think of two options to implement NUMA-aware ballooning:

1. Modify XENMEM_populate_physmap to take into account vNUMA hint
   when it tries to allocate a page for guest.
2. Introduce a new hypercall dedicated to vNUMA ballooning. Its
   functionality is similar to XENMEM_populate_physmap but it's only
   used in ballooning so that we don't break XENMEM_populate_physmap.

Option #1 requires less modification to guest, because guest won't
need to switch to new hypercall. It's unclear at this point if a guest
asks to populate a gpfn that doesn't belong to any vnode, what Xen
should do about it. Should it be permissive or strict? 

If Xen is strict (say, refuse to populate gpfn that doesn't belong to
a vnode), it imposes difficulty in implementing HVM vNUMA. Hvmloader
may try to populate firmware pages which are in a memory hole, and
memory hole doesn't belong to a node.

Option #2, the question would be should Xen be permissive or strict
on guest that uses vNUMA but doesn't use the new hypercall to balloon
up.

# HVM vNUMA

HVM vNUMA is implemented as followed:

1. Libxl generates vNUMA information and passes it to hvmloader.
2. Hvmloader build SRAT table.

Note that hvmloader is capable of relocating memory. This means
toolstack and guest can have different ideas of the memory layout.

This makes NUMA-aware ballooning for HVM guest tricky to implement,
due to the fact toolstack to hvmloader communication is one way, and
hypervisor shares the same view of guest memory layout as
toolstack. Hvmloader should not be allowed to adjust memory layout;
otherwise Xen will use the wrong hinting information and the end
result is certainly wrong.

To have basic HVM vNUMA support, we should disallow memory relocation
and discourage ballooning if vNUMA is enabled in HVM guest. We also
need to disable populate-on-demand as PoD pool in Xen is not
NUMA-aware.

We then can gradually lift these limits when we deicde what to do
about them.

# Planning

There are many moving parts that don't fit well together. I think a
valid strategy is to impose some limitations on vNUMA and other
features, either by restricting in toolstack or in documentation. Then
lift these limitations in different stages.

First stage:

           Basic     PoD   Ballooning  Mem_relocation
PV/PVH       Y       na       X         na
HVM          Y       X        X         X

Implement basic functionality of vNUMA. That is, to boot a guest
(PV/HVM) with vNUMA support.

Second stage:

           Basic     PoD   Ballooning  Mem_relocation
PV/PVH       Y       na       Y         na
HVM          Y       X        Y         X

Implement NUMA-aware ballooning.

Third stage:

           Basic     PoD   Ballooning  Mem_relocation
PV/PVH       Y       na       Y         na
HVM          Y       Y        Y         X

NUMA-aware PoD?

Fourth stage:

           Basic     PoD   Ballooning  Mem_relocation
PV/PVH       Y       na       Y         na
HVM          Y       Y        Y         Y

Implement bi-direction communication mechanism so that we can
allow memory relocation in hvmloader?

Third stages onward are less concrete at this point.

Thoughts?

Wei.

^ permalink raw reply	[flat|nested] 13+ messages in thread