All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/19] Virtual NUMA for PV and HVM
@ 2014-11-21 15:06 Wei Liu
  2014-11-21 15:06 ` [PATCH 01/19] xen: dump vNUMA information with debug key "u" Wei Liu
                   ` (20 more replies)
  0 siblings, 21 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu

Hi all

This patch series implements virtual NUMA support for both PV and HVM guest.
That is, admin can configure via libxl what virtual NUMA topology the guest
sees.

This is the stage 1 (basic vNUMA support) and part of stage 2 (vNUMA-ware
ballooning, hypervisor side) described in my previous email to xen-devel [0].

This series is broken into several parts:

1. xen patches: vNUMA debug output and vNUMA-aware memory hypercall support.
2. libxc/libxl support for PV vNUMA.
3. libxc/libxl support for HVM vNUMA.
4. xl vNUMA configuration documentation and parser.

I think one significant difference from Elena's work is that this patch series
makes use of multiple vmemranges should there be a memory hole, instead of
shrinking ram. This matches the behaviour of real hardware.

The vNUMA auto placement algorithm is missing at the moment and Dario is
working on it.

This series can be found at:
 git://xenbits.xen.org/people/liuw/xen.git wip.vnuma-v1 

With this series, the following configuration can be used to enabled virtual
NUMA support, and it works for both PV and HVM guests.

memory = 6000
vnuma_memory = [3000, 3000]
vnuma_vcpu_map = [0, 1]
vnuma_pnode_map = [0, 0]
vnuma_vdistances = [10, 30] # optional

dmesg output for HVM guest:

[    0.000000] ACPI: SRAT 00000000fc009ff0 000C8 (v01    Xen      HVM 00000000 HVML 00000000)
[    0.000000] ACPI: SLIT 00000000fc00a0c0 00030 (v01    Xen      HVM 00000000 HVML 00000000)
<...snip...>
[    0.000000] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[    0.000000] SRAT: PXM 1 -> APIC 0x02 -> Node 1
[    0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0xbb7fffff]
[    0.000000] SRAT: Node 1 PXM 1 [mem 0xbb800000-0xefffffff]
[    0.000000] SRAT: Node 1 PXM 1 [mem 0x100000000-0x186ffffff]
[    0.000000] NUMA: Initialized distance table, cnt=2
[    0.000000] NUMA: Node 1 [mem 0xbb800000-0xefffffff] + [mem 0x100000000-0x1867fffff] -> [mem 0xbb800000-0x1867fffff]
[    0.000000] Initmem setup node 0 [mem 0x00000000-0xbb7fffff]
[    0.000000]   NODE_DATA [mem 0xbb7fc000-0xbb7fffff]
[    0.000000] Initmem setup node 1 [mem 0xbb800000-0x1867fffff]
[    0.000000]   NODE_DATA [mem 0x1867f7000-0x1867fafff]
[    0.000000]  [ffffea0000000000-ffffea00029fffff] PMD -> [ffff8800b8600000-ffff8800baffffff] on node 0
[    0.000000]  [ffffea0002a00000-ffffea00055fffff] PMD -> [ffff880183000000-ffff8801859fffff] on node 1
<...snip...>
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009efff]
[    0.000000]   node   0: [mem 0x00100000-0xbb7fffff]
[    0.000000]   node   1: [mem 0xbb800000-0xefffefff]
[    0.000000]   node   1: [mem 0x100000000-0x1867fffff]

numactl output for HVM guest:

available: 2 nodes (0-1)
node 0 cpus: 0
node 0 size: 2999 MB
node 0 free: 2546 MB
node 1 cpus: 1
node 1 size: 2991 MB
node 1 free: 2144 MB
node distances:
node   0   1 
  0:  10  30 
  1:  30  10 

dmesg output for PV guest:

[    0.000000] NUMA: Initialized distance table, cnt=2
[    0.000000] NUMA: Node 1 [mem 0xbb800000-0xce68efff] + [mem 0x100000000-0x1a8970fff] -> [mem 0xbb800000-0x1a8970fff]
[    0.000000] NODE_DATA(0) allocated [mem 0xbb7fc000-0xbb7fffff]
[    0.000000] NODE_DATA(1) allocated [mem 0x1a8969000-0x1a896cfff]

numactl output for PV guest:

available: 2 nodes (0-1)
node 0 cpus: 0
node 0 size: 2944 MB
node 0 free: 2917 MB
node 1 cpus: 1
node 1 size: 2934 MB
node 1 free: 2904 MB
node distances:
node   0   1 
  0:  10  30
  1:  30  10

And for a HVM guest on a real NUMA-capable machine:

(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 262144):
(XEN)     Node 0: 245758
(XEN)     Node 1: 16386
(XEN) Domain 2 (total: 2097226):
(XEN)     Node 0: 1046335
(XEN)     Node 1: 1050891
(XEN)      2 vnodes, 4 vcpus
(XEN)        vnode   0 - pnode 0
(XEN)         3840 MB:  0x0 - 0xf0000000
(XEN)         256 MB:  0x100000000 - 0x110000000
(XEN)         vcpus: 0 1 
(XEN)        vnode   1 - pnode 1
(XEN)         4096 MB:  0x110000000 - 0x210000000
(XEN)         vcpus: 2 3 

Wei.

[0] <20141111173606.GC21312@zion.uk.xensource.com>

Wei Liu (19):
  xen: dump vNUMA information with debug key "u"
  xen: make two memory hypercalls vNUMA-aware
  libxc: allocate memory with vNUMA information for PV guest
  libxl: add emacs local variables in libxl_{x86,arm}.c
  libxl: introduce vNUMA types
  libxl: add vmemrange to libxl__domain_build_state
  libxl: introduce libxl__vnuma_config_check
  libxl: x86: factor out e820_host_sanitize
  libxl: functions to build vmemranges for PV guest
  libxl: build, check and pass vNUMA info to Xen for PV guest
  hvmloader: add new fields for vNUMA information
  hvmloader: construct SRAT
  hvmloader: construct SLIT
  hvmloader: disallow memory relocation when vNUMA is enabled
  libxc: allocate memory with vNUMA information for HVM guest
  libxl: build, check and pass vNUMA info to Xen for HVM guest
  libxl: refactor hvm_build_set_params
  libxl: fill vNUMA information in hvm info
  xl: vNUMA support

 docs/man/xl.cfg.pod.5                   |   32 +++++
 tools/firmware/hvmloader/acpi/acpi2_0.h |   61 +++++++++
 tools/firmware/hvmloader/acpi/build.c   |  104 ++++++++++++++
 tools/firmware/hvmloader/pci.c          |   13 ++
 tools/libxc/include/xc_dom.h            |    5 +
 tools/libxc/include/xenguest.h          |    7 +
 tools/libxc/xc_dom_x86.c                |   72 ++++++++--
 tools/libxc/xc_hvm_build_x86.c          |  224 +++++++++++++++++++-----------
 tools/libxc/xc_private.h                |    2 +
 tools/libxl/Makefile                    |    2 +-
 tools/libxl/libxl_arch.h                |    6 +
 tools/libxl/libxl_arm.c                 |   17 +++
 tools/libxl/libxl_create.c              |    9 ++
 tools/libxl/libxl_dom.c                 |  172 ++++++++++++++++++++---
 tools/libxl/libxl_internal.h            |   18 +++
 tools/libxl/libxl_types.idl             |    9 ++
 tools/libxl/libxl_vnuma.c               |  228 +++++++++++++++++++++++++++++++
 tools/libxl/libxl_x86.c                 |  113 +++++++++++++--
 tools/libxl/xl_cmdimpl.c                |  151 ++++++++++++++++++++
 xen/arch/x86/numa.c                     |   46 ++++++-
 xen/common/memory.c                     |   58 +++++++-
 xen/include/public/features.h           |    3 +
 xen/include/public/hvm/hvm_info_table.h |   19 +++
 xen/include/public/memory.h             |    2 +
 24 files changed, 1247 insertions(+), 126 deletions(-)
 create mode 100644 tools/libxl/libxl_vnuma.c

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/19] xen: dump vNUMA information with debug key "u"
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 16:39   ` Jan Beulich
  2014-11-21 15:06 ` [PATCH 02/19] xen: make two memory hypercalls vNUMA-aware Wei Liu
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu, Jan Beulich, Elena Ufimsteva

Signed-off-by: Elena Ufimsteva <ufimtseva@gmail.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
---
 xen/arch/x86/numa.c |   46 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 628a40a..d27c30f 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -363,10 +363,12 @@ EXPORT_SYMBOL(node_data);
 static void dump_numa(unsigned char key)
 {
     s_time_t now = NOW();
-    int i;
+    int i, j, err, n;
     struct domain *d;
     struct page_info *page;
     unsigned int page_num_node[MAX_NUMNODES];
+    uint64_t mem;
+    struct vnuma_info *vnuma;
 
     printk("'%c' pressed -> dumping numa info (now-0x%X:%08X)\n", key,
            (u32)(now>>32), (u32)now);
@@ -408,6 +410,48 @@ static void dump_numa(unsigned char key)
 
         for_each_online_node ( i )
             printk("    Node %u: %u\n", i, page_num_node[i]);
+
+        if ( !d->vnuma )
+                continue;
+
+        vnuma = d->vnuma;
+        printk("     %u vnodes, %u vcpus\n", vnuma->nr_vnodes, d->max_vcpus);
+        for ( i = 0; i < vnuma->nr_vnodes; i++ )
+        {
+            err = snprintf(keyhandler_scratch, 12, "%u",
+                    vnuma->vnode_to_pnode[i]);
+            if ( err < 0 || vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
+                snprintf(keyhandler_scratch, 3, "???");
+
+            printk("       vnode %3u - pnode %s\n", i, keyhandler_scratch);
+            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
+            {
+                if ( vnuma->vmemrange[j].nid == i )
+                {
+                    mem = vnuma->vmemrange[j].end - vnuma->vmemrange[j].start;
+                    printk("        %"PRIu64" MB: ", mem >> 20);
+                    printk(" 0x%"PRIx64" - 0x%"PRIx64"\n",
+                           vnuma->vmemrange[j].start,
+                           vnuma->vmemrange[j].end);
+                }
+            }
+
+            printk("        vcpus: ");
+            for ( j = 0, n = 0; j < d->max_vcpus; j++ )
+            {
+                if ( vnuma->vcpu_to_vnode[j] == i )
+                {
+                    if ( (n + 1) % 8 == 0 )
+                        printk("%d\n", j);
+                    else if ( !(n % 8) && n != 0 )
+                        printk("                %d ", j);
+                    else
+                        printk("%d ", j);
+                    n++;
+                }
+            }
+            printk("\n");
+        }
     }
 
     rcu_read_unlock(&domlist_read_lock);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 02/19] xen: make two memory hypercalls vNUMA-aware
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
  2014-11-21 15:06 ` [PATCH 01/19] xen: dump vNUMA information with debug key "u" Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 17:03   ` Jan Beulich
  2014-11-21 15:06 ` [PATCH 03/19] libxc: allocate memory with vNUMA information for PV guest Wei Liu
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu, Jan Beulich

Make XENMEM_increase_reservation and XENMEM_populate_physmap
vNUMA-aware.

That is, if guest requests Xen to allocate memory for specific vnode,
Xen can translate vnode to pnode using vNUMA information of that guest.

XENMEMF_vnode is introduced for the guest to mark the node number is in
fact virtual node number and should be translated by Xen.

XENFEAT_memory_op_vnode_supported is introduced to indicate that Xen is
able to translate virtual node to physical node.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
---
 xen/common/memory.c           |   58 ++++++++++++++++++++++++++++++++++++++---
 xen/include/public/features.h |    3 +++
 xen/include/public/memory.h   |    2 ++
 3 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/xen/common/memory.c b/xen/common/memory.c
index cc36e39..afdfd04 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -692,6 +692,49 @@ out:
     return rc;
 }
 
+static int translate_vnode_to_pnode(struct domain *d,
+                                    struct xen_memory_reservation *r,
+                                    struct memop_args *a)
+{
+    int rc = 0;
+    unsigned int vnode, pnode;
+
+    /* Note: we don't strictly require non-supported bits set to zero,
+     * so we may have exact_vnode bit set for old guests that don't
+     * support vNUMA.
+     *
+     * To distinguish spurious vnode request v.s. real one, check if
+     * d->vnuma exists.
+     */
+    if ( r->mem_flags & XENMEMF_vnode )
+    {
+        read_lock(&d->vnuma_rwlock);
+        if ( d->vnuma )
+        {
+            vnode = XENMEMF_get_node(r->mem_flags);
+
+            if ( vnode < d->vnuma->nr_vnodes )
+            {
+                pnode = d->vnuma->vnode_to_pnode[vnode];
+
+                a->memflags &=
+                    ~MEMF_node(XENMEMF_get_node(r->mem_flags));
+
+                if ( pnode != NUMA_NO_NODE )
+                {
+                    a->memflags |= MEMF_exact_node;
+                    a->memflags |= MEMF_node(pnode);
+                }
+            }
+            else
+                rc = -EINVAL;
+        }
+        read_unlock(&d->vnuma_rwlock);
+    }
+
+    return rc;
+}
+
 long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct domain *d;
@@ -734,10 +777,6 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             args.memflags = MEMF_bits(address_bits);
         }
 
-        args.memflags |= MEMF_node(XENMEMF_get_node(reservation.mem_flags));
-        if ( reservation.mem_flags & XENMEMF_exact_node_request )
-            args.memflags |= MEMF_exact_node;
-
         if ( op == XENMEM_populate_physmap
              && (reservation.mem_flags & XENMEMF_populate_on_demand) )
             args.memflags |= MEMF_populate_on_demand;
@@ -747,6 +786,17 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             return start_extent;
         args.domain = d;
 
+        args.memflags |= MEMF_node(XENMEMF_get_node(reservation.mem_flags));
+        if ( reservation.mem_flags & XENMEMF_exact_node_request )
+            args.memflags |= MEMF_exact_node;
+
+        rc = translate_vnode_to_pnode(d, &reservation, &args);
+        if ( rc )
+        {
+            rcu_unlock_domain(d);
+            return rc;
+        }
+
         rc = xsm_memory_adjust_reservation(XSM_TARGET, current->domain, d);
         if ( rc )
         {
diff --git a/xen/include/public/features.h b/xen/include/public/features.h
index 16d92aa..9f33502 100644
--- a/xen/include/public/features.h
+++ b/xen/include/public/features.h
@@ -99,6 +99,9 @@
 #define XENFEAT_grant_map_identity        12
  */
 
+/* Guset can use XENMEMF_vnode to specify virtual node for memory op. */
+#define XENFEAT_memory_op_vnode_supported 13
+
 #define XENFEAT_NR_SUBMAPS 1
 
 #endif /* __XEN_PUBLIC_FEATURES_H__ */
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index f021958..f2e5d14 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -55,6 +55,8 @@
 /* Flag to request allocation only from the node specified */
 #define XENMEMF_exact_node_request  (1<<17)
 #define XENMEMF_exact_node(n) (XENMEMF_node(n) | XENMEMF_exact_node_request)
+/* Flag to indicate the node specified is virtual node */
+#define XENMEMF_vnode  (1<<18)
 #endif
 
 struct xen_memory_reservation {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 03/19] libxc: allocate memory with vNUMA information for PV guest
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
  2014-11-21 15:06 ` [PATCH 01/19] xen: dump vNUMA information with debug key "u" Wei Liu
  2014-11-21 15:06 ` [PATCH 02/19] xen: make two memory hypercalls vNUMA-aware Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 15:06 ` [PATCH 04/19] libxl: add emacs local variables in libxl_{x86, arm}.c Wei Liu
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxc/include/xc_dom.h |    5 +++
 tools/libxc/xc_dom_x86.c     |   72 +++++++++++++++++++++++++++++++++++-------
 tools/libxc/xc_private.h     |    2 ++
 3 files changed, 68 insertions(+), 11 deletions(-)

diff --git a/tools/libxc/include/xc_dom.h b/tools/libxc/include/xc_dom.h
index 6ae6a9f..eb8e2ce 100644
--- a/tools/libxc/include/xc_dom.h
+++ b/tools/libxc/include/xc_dom.h
@@ -162,6 +162,11 @@ struct xc_dom_image {
     struct xc_dom_loader *kernel_loader;
     void *private_loader;
 
+    /* vNUMA information */
+    unsigned int *vnode_to_pnode;
+    uint64_t *vnode_size;
+    unsigned int nr_vnodes;
+
     /* kernel loader */
     struct xc_dom_arch *arch_hooks;
     /* allocate up to virt_alloc_end */
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index bf06fe4..3286232 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -759,7 +759,8 @@ static int x86_shadow(xc_interface *xch, domid_t domid)
 int arch_setup_meminit(struct xc_dom_image *dom)
 {
     int rc;
-    xen_pfn_t pfn, allocsz, i, j, mfn;
+    xen_pfn_t pfn, allocsz, mfn, total, pfn_base;
+    int i, j;
 
     rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
     if ( rc )
@@ -811,18 +812,67 @@ int arch_setup_meminit(struct xc_dom_image *dom)
         /* setup initial p2m */
         for ( pfn = 0; pfn < dom->total_pages; pfn++ )
             dom->p2m_host[pfn] = pfn;
-        
+
+        /* setup dummy vNUMA information if it's not provided */
+        if ( dom->nr_vnodes == 0 )
+        {
+            dom->nr_vnodes = 1;
+            dom->vnode_to_pnode = xc_dom_malloc(dom,
+                                                sizeof(*dom->vnode_to_pnode));
+            dom->vnode_to_pnode[0] = XC_VNUMA_NO_NODE;
+            dom->vnode_size = xc_dom_malloc(dom, sizeof(*dom->vnode_size));
+            dom->vnode_size[0] = ((dom->total_pages << PAGE_SHIFT) >> 20);
+        }
+
+        total = 0;
+        for ( i = 0; i < dom->nr_vnodes; i++ )
+            total += ((dom->vnode_size[i] << 20) >> PAGE_SHIFT);
+        if ( total != dom->total_pages )
+        {
+            xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                         "%s: number of pages requested by vNUMA (0x%"PRIpfn") mismatches number of pages configured for domain (0x%"PRIpfn")\n",
+                         __FUNCTION__, total, dom->total_pages);
+            return -EINVAL;
+        }
+
         /* allocate guest memory */
-        for ( i = rc = allocsz = 0;
-              (i < dom->total_pages) && !rc;
-              i += allocsz )
+        pfn_base = 0;
+        for ( i = 0; i < dom->nr_vnodes; i++ )
         {
-            allocsz = dom->total_pages - i;
-            if ( allocsz > 1024*1024 )
-                allocsz = 1024*1024;
-            rc = xc_domain_populate_physmap_exact(
-                dom->xch, dom->guest_domid, allocsz,
-                0, 0, &dom->p2m_host[i]);
+            unsigned int memflags;
+            uint64_t pages;
+
+            memflags = 0;
+            if ( dom->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
+            {
+                memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[i]);
+                memflags |= XENMEMF_exact_node_request;
+            }
+
+            pages = (dom->vnode_size[i] << 20) >> PAGE_SHIFT;
+
+            for ( j = 0; j < pages; j += allocsz )
+            {
+                allocsz = pages - j;
+                if ( allocsz > 1024*1024 )
+                    allocsz = 1024*1024;
+
+                rc = xc_domain_populate_physmap_exact(dom->xch,
+                         dom->guest_domid, allocsz, 0, memflags,
+                         &dom->p2m_host[pfn_base+j]);
+
+                if ( rc )
+                {
+                    if ( dom->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
+                        xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                                     "%s: fail to allocate 0x%"PRIx64" pages for vnode %d on pnode %d out of 0x%"PRIpfn"\n",
+                                     __FUNCTION__, pages, i,
+                                     dom->vnode_to_pnode[i], dom->total_pages);
+                    return rc;
+                }
+            }
+
+            pfn_base += pages;
         }
 
         /* Ensure no unclaimed pages are left unused.
diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
index 45b8644..1809674 100644
--- a/tools/libxc/xc_private.h
+++ b/tools/libxc/xc_private.h
@@ -35,6 +35,8 @@
 
 #include <xen/sys/privcmd.h>
 
+#define XC_VNUMA_NO_NODE (~0U)
+
 #if defined(HAVE_VALGRIND_MEMCHECK_H) && !defined(NDEBUG) && !defined(__MINIOS__)
 /* Compile in Valgrind client requests? */
 #include <valgrind/memcheck.h>
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 04/19] libxl: add emacs local variables in libxl_{x86, arm}.c
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (2 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 03/19] libxc: allocate memory with vNUMA information for PV guest Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 15:06 ` [PATCH 05/19] libxl: introduce vNUMA types Wei Liu
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_arm.c |    8 ++++++++
 tools/libxl/libxl_x86.c |    8 ++++++++
 2 files changed, 16 insertions(+)

diff --git a/tools/libxl/libxl_arm.c b/tools/libxl/libxl_arm.c
index 448ac07..34d21f5 100644
--- a/tools/libxl/libxl_arm.c
+++ b/tools/libxl/libxl_arm.c
@@ -706,3 +706,11 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
 
     return 0;
 }
+
+/*
+ * Local variables:
+ * mode: C
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 7589060..9ceb373 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -324,3 +324,11 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
 {
     return 0;
 }
+
+/*
+ * Local variables:
+ * mode: C
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 05/19] libxl: introduce vNUMA types
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (3 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 04/19] libxl: add emacs local variables in libxl_{x86, arm}.c Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 15:06 ` [PATCH 06/19] libxl: add vmemrange to libxl__domain_build_state Wei Liu
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_types.idl |    9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index f7fc695..75855fb 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -353,6 +353,13 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
     ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
     ])
 
+libxl_vnode_info = Struct("vnode_info", [
+    ("mem", uint64), # memory size of this node, in MiB
+    ("distances", Array(uint32, "num_distances")), # distances from this node to other nodes
+    ("pnode", uint32), # physical node of this node
+    ("vcpus", libxl_bitmap), # vcpus in this node
+    ])
+
 libxl_domain_build_info = Struct("domain_build_info",[
     ("max_vcpus",       integer),
     ("avail_vcpus",     libxl_bitmap),
@@ -373,6 +380,8 @@ libxl_domain_build_info = Struct("domain_build_info",[
     ("disable_migrate", libxl_defbool),
     ("cpuid",           libxl_cpuid_policy_list),
     ("blkdev_start",    string),
+
+    ("vnuma_nodes", Array(libxl_vnode_info, "num_vnuma_nodes")),
     
     ("device_model_version", libxl_device_model_version),
     ("device_model_stubdomain", libxl_defbool),
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 06/19] libxl: add vmemrange to libxl__domain_build_state
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (4 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 05/19] libxl: introduce vNUMA types Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 15:06 ` [PATCH 07/19] libxl: introduce libxl__vnuma_config_check Wei Liu
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

Currently we haven't exported vmemrange interface to libxl user.
Vmemranges are generated during domain build.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_internal.h |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 4361421..7ee7482 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -971,6 +971,9 @@ typedef struct {
     libxl__file_reference pv_ramdisk;
     const char * pv_cmdline;
     bool pvh_enabled;
+
+    vmemrange_t *vmemranges;
+    uint32_t num_vmemranges;
 } libxl__domain_build_state;
 
 _hidden int libxl__build_pre(libxl__gc *gc, uint32_t domid,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 07/19] libxl: introduce libxl__vnuma_config_check
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (5 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 06/19] libxl: add vmemrange to libxl__domain_build_state Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 15:06 ` [PATCH 08/19] libxl: x86: factor out e820_host_sanitize Wei Liu
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

This vNUMA function (and future ones) is placed in a new file called
libxl_vnuma.c

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/Makefile         |    2 +-
 tools/libxl/libxl_internal.h |    5 ++
 tools/libxl/libxl_vnuma.c    |  138 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 144 insertions(+), 1 deletion(-)
 create mode 100644 tools/libxl/libxl_vnuma.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index df08c8a..9fcdfb1 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -93,7 +93,7 @@ LIBXL_LIBS += -lyajl
 LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
 			libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \
 			libxl_internal.o libxl_utils.o libxl_uuid.o \
-			libxl_json.o libxl_aoutils.o libxl_numa.o \
+			libxl_json.o libxl_aoutils.o libxl_numa.o libxl_vnuma.o \
 			libxl_save_callout.o _libxl_save_msgs_callout.o \
 			libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y)
 LIBXL_OBJS += libxl_genid.o
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 7ee7482..ee76df6 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3392,6 +3392,11 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
     libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
 }
 
+/* Check if vNUMA config is valid. Returns 0 if valid. */
+int libxl__vnuma_config_check(libxl__gc *gc,
+                              const libxl_domain_build_info *b_info,
+                              const libxl__domain_build_state *state);
+
 _hidden int libxl__ms_vm_genid_set(libxl__gc *gc, uint32_t domid,
                                    const libxl_ms_vm_genid *id);
 
diff --git a/tools/libxl/libxl_vnuma.c b/tools/libxl/libxl_vnuma.c
new file mode 100644
index 0000000..f5912e6
--- /dev/null
+++ b/tools/libxl/libxl_vnuma.c
@@ -0,0 +1,138 @@
+/*
+ * Copyright (C) 2014      Citrix Ltd.
+ * Author Wei Liu <wei.liu2@citrix.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+#include "libxl_osdeps.h" /* must come before any other headers */
+#include "libxl_internal.h"
+#include <stdlib.h>
+
+/* Sort vmemranges in ascending order with "start" */
+static int compare_vmemrange(const void *a, const void *b)
+{
+    const vmemrange_t *x = a, *y = b;
+    if (x->start < y->start)
+        return -1;
+    if (x->start > y->start)
+        return 1;
+    return 0;
+}
+
+/* Check if vNUMA configuration is valid:
+ *  1. all pnodes inside vnode_to_pnode array are valid
+ *  2. one vcpu belongs to and only belongs to one vnode
+ *  3. each vmemrange is valid and doesn't overlap with each other
+ */
+int libxl__vnuma_config_check(libxl__gc *gc,
+			      const libxl_domain_build_info *b_info,
+                              const libxl__domain_build_state *state)
+{
+    int i, j, rc = ERROR_INVAL, nr_nodes;
+    libxl_numainfo *ninfo = NULL;
+    uint64_t total_ram = 0;
+    libxl_bitmap cpumap;
+    libxl_vnode_info *p;
+
+    libxl_bitmap_init(&cpumap);
+
+    /* Check pnode specified is valid */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (!ninfo) {
+        LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "libxl_get_numainfo failed");
+        goto out;
+    }
+
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        uint32_t pnode;
+
+        p = &b_info->vnuma_nodes[i];
+        pnode = p->pnode;
+
+        /* The pnode specified is not valid? */
+        if (pnode >= nr_nodes) {
+            LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                       "Invalid pnode %d specified",
+                       pnode);
+            goto out;
+        }
+
+        total_ram += p->mem;
+    }
+
+    if (total_ram != (b_info->max_memkb >> 10)) {
+        LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                   "Total ram in vNUMA configuration 0x%"PRIx64" while maxmem specified 0x%"PRIx64,
+                   total_ram, (b_info->max_memkb >> 10));
+        goto out;
+    }
+
+    /* Check vcpu mapping */
+    libxl_cpu_bitmap_alloc(CTX, &cpumap, b_info->max_vcpus);
+    libxl_bitmap_set_none(&cpumap);
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        p = &b_info->vnuma_nodes[i];
+        libxl_for_each_set_bit(j, p->vcpus) {
+            if (!libxl_bitmap_test(&cpumap, j))
+                libxl_bitmap_set(&cpumap, j);
+            else {
+                LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                           "Try to assign vcpu %d to vnode %d while it's already assigned to other vnode",
+                           j, i);
+                goto out;
+            }
+        }
+    }
+
+    for (i = 0; i < b_info->max_vcpus; i++) {
+        if (!libxl_bitmap_test(&cpumap, i)) {
+            LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                       "Vcpu %d is not assigned to any vnode", i);
+            goto out;
+        }
+    }
+
+    /* Check vmemranges */
+    qsort(state->vmemranges, state->num_vmemranges, sizeof(vmemrange_t),
+          compare_vmemrange);
+
+    for (i = 0; i < state->num_vmemranges; i++) {
+        if (state->vmemranges[i].end < state->vmemranges[i].start) {
+                LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                           "Vmemrange end < start");
+                goto out;
+        }
+    }
+
+    for (i = 0; i < state->num_vmemranges - 1; i++) {
+        if (state->vmemranges[i].end > state->vmemranges[i+1].start) {
+            LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                       "Vmemranges overlapped, 0x%"PRIx64"-0x%"PRIx64", 0x%"PRIx64"-0x%"PRIx64,
+                       state->vmemranges[i].start, state->vmemranges[i].end,
+                       state->vmemranges[i+1].start, state->vmemranges[i+1].end);
+            goto out;
+        }
+    }
+
+    rc = 0;
+out:
+    if (ninfo) libxl_numainfo_dispose(ninfo);
+    libxl_bitmap_dispose(&cpumap);
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 08/19] libxl: x86: factor out e820_host_sanitize
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (6 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 07/19] libxl: introduce libxl__vnuma_config_check Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 15:06 ` [PATCH 09/19] libxl: functions to build vmemranges for PV guest Wei Liu
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

This function gets the machine E820 map and sanitize it according to PV
guest configuration.

This will be used in later patch.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_x86.c |   31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 9ceb373..e959e37 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -207,6 +207,27 @@ static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
     return 0;
 }
 
+static int e820_host_sanitize(libxl__gc *gc,
+                              libxl_domain_build_info *b_info,
+                              struct e820entry map[],
+                              uint32_t *nr)
+{
+    int rc;
+
+    rc = xc_get_machine_memory_map(CTX->xch, map, E820MAX);
+    if (rc < 0) {
+        errno = rc;
+        return ERROR_FAIL;
+    }
+
+    *nr = rc;
+
+    rc = e820_sanitize(CTX, map, nr, b_info->target_memkb,
+                       (b_info->max_memkb - b_info->target_memkb) +
+                       b_info->u.pv.slack_memkb);
+    return rc;
+}
+
 static int libxl__e820_alloc(libxl__gc *gc, uint32_t domid,
         libxl_domain_config *d_config)
 {
@@ -223,15 +244,7 @@ static int libxl__e820_alloc(libxl__gc *gc, uint32_t domid,
     if (!libxl_defbool_val(b_info->u.pv.e820_host))
         return ERROR_INVAL;
 
-    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
-    if (rc < 0) {
-        errno = rc;
-        return ERROR_FAIL;
-    }
-    nr = rc;
-    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
-                       (b_info->max_memkb - b_info->target_memkb) +
-                       b_info->u.pv.slack_memkb);
+    rc = e820_host_sanitize(gc, b_info, map, &nr);
     if (rc)
         return ERROR_FAIL;
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 09/19] libxl: functions to build vmemranges for PV guest
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (7 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 08/19] libxl: x86: factor out e820_host_sanitize Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 15:06 ` [PATCH 10/19] libxl: build, check and pass vNUMA info to Xen " Wei Liu
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

One vmemrange is generated for each vnode. And for those guests who care
about machine E820 map, vmemranges are further split to accommodate
memory holes.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_arch.h     |    6 ++++
 tools/libxl/libxl_arm.c      |    9 +++++
 tools/libxl/libxl_internal.h |    5 +++
 tools/libxl/libxl_vnuma.c    |   34 +++++++++++++++++++
 tools/libxl/libxl_x86.c      |   74 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 128 insertions(+)

diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h
index d3bc136..e249048 100644
--- a/tools/libxl/libxl_arch.h
+++ b/tools/libxl/libxl_arch.h
@@ -27,4 +27,10 @@ int libxl__arch_domain_init_hw_description(libxl__gc *gc,
 int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
                                       libxl_domain_build_info *info,
                                       struct xc_dom_image *dom);
+
+/* build vNUMA vmemrange with arch specific information */
+int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,
+                                      uint32_t domid,
+                                      libxl_domain_build_info *b_info,
+                                      libxl__domain_build_state *state);
 #endif
diff --git a/tools/libxl/libxl_arm.c b/tools/libxl/libxl_arm.c
index 34d21f5..1f1bc24 100644
--- a/tools/libxl/libxl_arm.c
+++ b/tools/libxl/libxl_arm.c
@@ -707,6 +707,15 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
     return 0;
 }
 
+int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,
+                                      uint32_t domid,
+                                      libxl_domain_build_info *info,
+                                      libxl__domain_build_state *state)
+{
+    /* Don't touch anything. */
+    return 0;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index ee76df6..b1b60cb 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3397,6 +3397,11 @@ int libxl__vnuma_config_check(libxl__gc *gc,
                               const libxl_domain_build_info *b_info,
                               const libxl__domain_build_state *state);
 
+int libxl__vnuma_build_vmemrange_pv(libxl__gc *gc,
+                                    uint32_t domid,
+                                    libxl_domain_build_info *b_info,
+                                    libxl__domain_build_state *state);
+
 _hidden int libxl__ms_vm_genid_set(libxl__gc *gc, uint32_t domid,
                                    const libxl_ms_vm_genid *id);
 
diff --git a/tools/libxl/libxl_vnuma.c b/tools/libxl/libxl_vnuma.c
index f5912e6..1d50606 100644
--- a/tools/libxl/libxl_vnuma.c
+++ b/tools/libxl/libxl_vnuma.c
@@ -14,6 +14,7 @@
  */
 #include "libxl_osdeps.h" /* must come before any other headers */
 #include "libxl_internal.h"
+#include "libxl_arch.h"
 #include <stdlib.h>
 
 /* Sort vmemranges in ascending order with "start" */
@@ -129,6 +130,39 @@ out:
     return rc;
 }
 
+/* Build vmemranges for PV guest */
+int libxl__vnuma_build_vmemrange_pv(libxl__gc *gc,
+                                    uint32_t domid,
+                                    libxl_domain_build_info *b_info,
+                                    libxl__domain_build_state *state)
+{
+    int i;
+    uint64_t next;
+    vmemrange_t *v = NULL;
+
+    assert(state->vmemranges == NULL);
+
+    /* Generate one vmemrange for each virtual node. */
+    next = 0;
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
+
+        v = libxl__realloc(gc, v, sizeof(*v) * (i+1));
+
+        v[i].start = next;
+        v[i].end = next + (p->mem << 20); /* mem is in MiB */
+        v[i].flags = 0;
+        v[i].nid = i;
+
+        next = v[i].end;
+    }
+
+    state->vmemranges = v;
+    state->num_vmemranges = i;
+
+    return libxl__arch_vnuma_build_vmemrange(gc, domid, b_info, state);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index e959e37..8e7af6a 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -338,6 +338,80 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
     return 0;
 }
 
+int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,
+                                      uint32_t domid,
+                                      libxl_domain_build_info *b_info,
+                                      libxl__domain_build_state *state)
+{
+    int i, x, n, rc;
+    uint32_t nr_e820;
+    struct e820entry map[E820MAX];
+    vmemrange_t *v;
+
+    /* Only touch vmemranges if it's PV guest and e820_host is true */
+    if (!(b_info->type == LIBXL_DOMAIN_TYPE_PV &&
+          libxl_defbool_val(b_info->u.pv.e820_host))) {
+        rc = 0;
+        goto out;
+    }
+
+    rc = e820_host_sanitize(gc, b_info, map, &nr_e820);
+    if (rc) goto out;
+
+    /* Ditch old vmemranges and start with host E820 map. Note, memory
+     * was gc allocated.
+     */
+    state->vmemranges = NULL;
+    state->num_vmemranges = 0;
+
+    n = 0; /* E820 counter */
+    x = 0;
+    v = NULL;
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
+        uint64_t remaining = (p->mem << 20);
+
+        while (remaining > 0) {
+            if (n >= nr_e820) {
+                rc = ERROR_FAIL;
+                goto out;
+            }
+
+            /* Skip non RAM region */
+            if (map[n].type != E820_RAM) {
+                n++;
+                continue;
+            }
+
+            v = libxl__realloc(gc, v, sizeof(vmemrange_t) * (x+1));
+
+            if (map[n].size >= remaining) {
+                v[x].start = map[n].addr;
+                v[x].end = map[n].addr + remaining;
+                map[n].addr += remaining;
+                map[n].size -= remaining;
+                remaining = 0;
+            } else {
+                v[x].start = map[n].addr;
+                v[x].end = map[n].addr + map[n].size;
+                remaining -= map[n].size;
+                n++;
+            }
+
+            v[x].flags = 0;
+            v[x].nid = i;
+            x++;
+        }
+    }
+
+    state->vmemranges = v;
+    state->num_vmemranges = x;
+
+    rc = 0;
+out:
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 10/19] libxl: build, check and pass vNUMA info to Xen for PV guest
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (8 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 09/19] libxl: functions to build vmemranges for PV guest Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 15:06 ` [PATCH 11/19] hvmloader: add new fields for vNUMA information Wei Liu
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_dom.c |   71 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 74ea84b..7339bbc 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -512,6 +512,51 @@ retry_transaction:
     return 0;
 }
 
+static int set_vnuma_info(libxl__gc *gc, uint32_t domid,
+                          const libxl_domain_build_info *info,
+                          const libxl__domain_build_state *state)
+{
+    int rc = 0;
+    int i, nr_vdistance;
+    unsigned int *vcpu_to_vnode, *vnode_to_pnode, *vdistance = NULL;
+
+    vcpu_to_vnode = libxl__calloc(gc, info->max_vcpus,
+                                  sizeof(unsigned int));
+    vnode_to_pnode = libxl__calloc(gc, info->num_vnuma_nodes,
+                                   sizeof(unsigned int));
+
+    nr_vdistance = info->num_vnuma_nodes * info->num_vnuma_nodes;
+    vdistance = libxl__calloc(gc, nr_vdistance, sizeof(unsigned int));
+
+    for (i = 0; i < info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *v = &info->vnuma_nodes[i];
+        int bit;
+
+        /* vnode to pnode mapping */
+        vnode_to_pnode[i] = v->pnode;
+
+        /* vcpu to vnode mapping */
+        libxl_for_each_set_bit(bit, v->vcpus)
+            vcpu_to_vnode[bit] = i;
+
+        /* node distances */
+        assert(info->num_vnuma_nodes == v->num_distances);
+        memcpy(vdistance + (i * info->num_vnuma_nodes),
+               v->distances,
+               v->num_distances * sizeof(unsigned int));
+    }
+
+    if ( xc_domain_setvnuma(CTX->xch, domid, info->num_vnuma_nodes,
+                            state->num_vmemranges, info->max_vcpus,
+                            state->vmemranges, vdistance,
+                            vcpu_to_vnode, vnode_to_pnode) < 0 ) {
+        LOGE(ERROR, "xc_domain_setvnuma failed");
+        rc = ERROR_FAIL;
+    }
+
+    return rc;
+}
+
 int libxl__build_pv(libxl__gc *gc, uint32_t domid,
              libxl_domain_build_info *info, libxl__domain_build_state *state)
 {
@@ -569,6 +614,32 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
     dom->xenstore_domid = state->store_domid;
     dom->claim_enabled = libxl_defbool_val(info->claim_mode);
 
+    if (info->num_vnuma_nodes != 0) {
+        int i;
+
+        ret = libxl__vnuma_build_vmemrange_pv(gc, domid, info, state);
+        if (ret) {
+            LOGE(ERROR, "cannot build vmemranges");
+            goto out;
+        }
+        ret = libxl__vnuma_config_check(gc, info, state);
+        if (ret) goto out;
+
+        ret = set_vnuma_info(gc, domid, info, state);
+        if (ret) goto out;
+
+        dom->nr_vnodes = info->num_vnuma_nodes;
+        dom->vnode_to_pnode = xc_dom_malloc(dom, sizeof(*dom->vnode_to_pnode) *
+                                            dom->nr_vnodes);
+        dom->vnode_size = xc_dom_malloc(dom, sizeof(*dom->vnode_size) *
+                                        dom->nr_vnodes);
+
+        for (i = 0; i < dom->nr_vnodes; i++) {
+            dom->vnode_to_pnode[i] = info->vnuma_nodes[i].pnode;
+            dom->vnode_size[i] = info->vnuma_nodes[i].mem;
+        }
+    }
+
     if ( (ret = xc_dom_boot_xen_init(dom, ctx->xch, domid)) != 0 ) {
         LOGE(ERROR, "xc_dom_boot_xen_init failed");
         goto out;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 11/19] hvmloader: add new fields for vNUMA information
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (9 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 10/19] libxl: build, check and pass vNUMA info to Xen " Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-24  9:58   ` Jan Beulich
  2014-11-21 15:06 ` [PATCH 12/19] hvmloader: construct SRAT Wei Liu
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu, Jan Beulich

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
---
 xen/include/public/hvm/hvm_info_table.h |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/xen/include/public/hvm/hvm_info_table.h b/xen/include/public/hvm/hvm_info_table.h
index 36085fa..9d3f218 100644
--- a/xen/include/public/hvm/hvm_info_table.h
+++ b/xen/include/public/hvm/hvm_info_table.h
@@ -32,6 +32,17 @@
 /* Maximum we can support with current vLAPIC ID mapping. */
 #define HVM_MAX_VCPUS        128
 
+#define HVM_MAX_NODES         16
+#define HVM_MAX_LOCALITIES    (HVM_MAX_NODES * HVM_MAX_NODES)
+
+#define HVM_MAX_VMEMRANGES    64
+struct hvm_info_vmemrange {
+    uint64_t start;
+    uint64_t end;
+    uint32_t flags;
+    uint32_t nid;
+};
+
 struct hvm_info_table {
     char        signature[8]; /* "HVM INFO" */
     uint32_t    length;
@@ -67,6 +78,14 @@ struct hvm_info_table {
 
     /* Bitmap of which CPUs are online at boot time. */
     uint8_t     vcpu_online[(HVM_MAX_VCPUS + 7)/8];
+
+    /* Virtual NUMA information */
+    uint32_t    nr_nodes;
+    uint8_t     vcpu_to_vnode[HVM_MAX_VCPUS];
+    uint32_t    nr_vmemranges;
+    struct hvm_info_vmemrange vmemranges[HVM_MAX_VMEMRANGES];
+    uint64_t    nr_localities;
+    uint8_t     localities[HVM_MAX_LOCALITIES];
 };
 
 #endif /* __XEN_PUBLIC_HVM_HVM_INFO_TABLE_H__ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 12/19] hvmloader: construct SRAT
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (10 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 11/19] hvmloader: add new fields for vNUMA information Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-24 10:08   ` Jan Beulich
  2014-11-21 15:06 ` [PATCH 13/19] hvmloader: construct SLIT Wei Liu
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu, Jan Beulich

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
---
 tools/firmware/hvmloader/acpi/acpi2_0.h |   53 ++++++++++++++++++++++++
 tools/firmware/hvmloader/acpi/build.c   |   68 +++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+)

diff --git a/tools/firmware/hvmloader/acpi/acpi2_0.h b/tools/firmware/hvmloader/acpi/acpi2_0.h
index 7b22d80..6169213 100644
--- a/tools/firmware/hvmloader/acpi/acpi2_0.h
+++ b/tools/firmware/hvmloader/acpi/acpi2_0.h
@@ -364,6 +364,57 @@ struct acpi_20_madt_intsrcovr {
 };
 
 /*
+ * System Resource Affinity Table header definition (SRAT)
+ */
+struct acpi_20_srat {
+    struct acpi_header header;
+    uint32_t table_revision;
+    uint32_t reserved2[2];
+};
+
+#define ACPI_SRAT_TABLE_REVISION 1
+
+/*
+ * System Resource Affinity Table structure types.
+ */
+#define ACPI_PROCESSOR_AFFINITY 0x0
+#define ACPI_MEMORY_AFFINITY    0x1
+struct acpi_20_srat_processor {
+    uint8_t type;
+    uint8_t length;
+    uint8_t domain;
+    uint8_t apic_id;
+    uint32_t flags;
+    uint8_t sapic_id;
+    uint8_t domain_hi[3];
+    uint32_t reserved;
+};
+
+/*
+ * Local APIC Affinity Flags.  All other bits are reserved and must be 0.
+ */
+#define ACPI_LOCAL_APIC_AFFIN_ENABLED (1 << 0)
+
+struct acpi_20_srat_memory {
+    uint8_t type;
+    uint8_t length;
+    uint32_t domain;
+    uint16_t reserved;
+    uint64_t base_address;
+    uint64_t mem_length;
+    uint32_t reserved2;
+    uint32_t flags;
+    uint64_t reserved3;
+};
+
+/*
+ * Memory Affinity Flags.  All other bits are reserved and must be 0.
+ */
+#define ACPI_MEM_AFFIN_ENABLED (1 << 0)
+#define ACPI_MEM_AFFIN_HOTPLUGGABLE (1 << 1)
+#define ACPI_MEM_AFFIN_NONVOLATILE (1 << 2)
+
+/*
  * Table Signatures.
  */
 #define ACPI_2_0_RSDP_SIGNATURE ASCII64('R','S','D',' ','P','T','R',' ')
@@ -375,6 +426,7 @@ struct acpi_20_madt_intsrcovr {
 #define ACPI_2_0_TCPA_SIGNATURE ASCII32('T','C','P','A')
 #define ACPI_2_0_HPET_SIGNATURE ASCII32('H','P','E','T')
 #define ACPI_2_0_WAET_SIGNATURE ASCII32('W','A','E','T')
+#define ACPI_2_0_SRAT_SIGNATURE ASCII32('S','R','A','T')
 
 /*
  * Table revision numbers.
@@ -388,6 +440,7 @@ struct acpi_20_madt_intsrcovr {
 #define ACPI_2_0_HPET_REVISION 0x01
 #define ACPI_2_0_WAET_REVISION 0x01
 #define ACPI_1_0_FADT_REVISION 0x01
+#define ACPI_2_0_SRAT_REVISION 0x01
 
 #pragma pack ()
 
diff --git a/tools/firmware/hvmloader/acpi/build.c b/tools/firmware/hvmloader/acpi/build.c
index 1431296..b90344a 100644
--- a/tools/firmware/hvmloader/acpi/build.c
+++ b/tools/firmware/hvmloader/acpi/build.c
@@ -203,6 +203,66 @@ static struct acpi_20_waet *construct_waet(void)
     return waet;
 }
 
+static struct acpi_20_srat *construct_srat(void)
+{
+    struct acpi_20_srat *srat;
+    struct acpi_20_srat_processor *processor;
+    struct acpi_20_srat_memory *memory;
+    unsigned int size;
+    void *p;
+    int i;
+    uint64_t mem;
+
+    size = sizeof(*srat) + sizeof(*processor) * hvm_info->nr_vcpus +
+        sizeof(*memory) * hvm_info->nr_vmemranges;
+
+    p = mem_alloc(size, 16);
+    if (!p) return NULL;
+
+    srat = p;
+    memset(srat, 0, sizeof(*srat));
+    srat->header.signature    = ACPI_2_0_SRAT_SIGNATURE;
+    srat->header.revision     = ACPI_2_0_SRAT_REVISION;
+    fixed_strcpy(srat->header.oem_id, ACPI_OEM_ID);
+    fixed_strcpy(srat->header.oem_table_id, ACPI_OEM_TABLE_ID);
+    srat->header.oem_revision = ACPI_OEM_REVISION;
+    srat->header.creator_id   = ACPI_CREATOR_ID;
+    srat->header.creator_revision = ACPI_CREATOR_REVISION;
+    srat->table_revision      = ACPI_SRAT_TABLE_REVISION;
+
+    processor = (struct acpi_20_srat_processor *)(srat + 1);
+    for ( i = 0; i < hvm_info->nr_vcpus; i++ )
+    {
+        memset(processor, 0, sizeof(*processor));
+        processor->type     = ACPI_PROCESSOR_AFFINITY;
+        processor->length   = sizeof(*processor);
+        processor->domain   = hvm_info->vcpu_to_vnode[i];
+        processor->apic_id  = LAPIC_ID(i);
+        processor->flags    = ACPI_LOCAL_APIC_AFFIN_ENABLED;
+        processor->sapic_id = 0;
+        processor++;
+    }
+
+    memory = (struct acpi_20_srat_memory *)processor;
+    for ( i = 0; i < hvm_info->nr_vmemranges; i++ )
+    {
+        mem = hvm_info->vmemranges[i].end - hvm_info->vmemranges[i].start;
+        memset(memory, 0, sizeof(*memory));
+        memory->type          = ACPI_MEMORY_AFFINITY;
+        memory->length        = sizeof(*memory);
+        memory->domain        = hvm_info->vmemranges[i].nid;
+        memory->flags         = ACPI_MEM_AFFIN_ENABLED;
+        memory->base_address  = hvm_info->vmemranges[i].start;
+        memory->mem_length    = mem;
+        memory++;
+    }
+
+    srat->header.length = size;
+    set_checksum(srat, offsetof(struct acpi_header, checksum), size);
+
+    return srat;
+}
+
 static int construct_passthrough_tables(unsigned long *table_ptrs,
                                         int nr_tables)
 {
@@ -257,6 +317,7 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
     struct acpi_20_hpet *hpet;
     struct acpi_20_waet *waet;
     struct acpi_20_tcpa *tcpa;
+    struct acpi_20_srat *srat;
     unsigned char *ssdt;
     static const uint16_t tis_signature[] = {0x0001, 0x0001, 0x0001};
     uint16_t *tis_hdr;
@@ -270,6 +331,13 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
         table_ptrs[nr_tables++] = (unsigned long)madt;
     }
 
+    if ( hvm_info->nr_nodes > 0 )
+    {
+        srat = construct_srat();
+        if (!srat) return -1;
+        table_ptrs[nr_tables++] = (unsigned long)srat;
+    }
+
     /* HPET. */
     if ( hpet_exists(ACPI_HPET_ADDRESS) )
     {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 13/19] hvmloader: construct SLIT
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (11 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 12/19] hvmloader: construct SRAT Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-24 10:11   ` Jan Beulich
  2014-11-21 15:06 ` [PATCH 14/19] hvmloader: disallow memory relocation when vNUMA is enabled Wei Liu
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu, Jan Beulich

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
---
 tools/firmware/hvmloader/acpi/acpi2_0.h |    8 +++++++
 tools/firmware/hvmloader/acpi/build.c   |   36 +++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/tools/firmware/hvmloader/acpi/acpi2_0.h b/tools/firmware/hvmloader/acpi/acpi2_0.h
index 6169213..d698095 100644
--- a/tools/firmware/hvmloader/acpi/acpi2_0.h
+++ b/tools/firmware/hvmloader/acpi/acpi2_0.h
@@ -414,6 +414,12 @@ struct acpi_20_srat_memory {
 #define ACPI_MEM_AFFIN_HOTPLUGGABLE (1 << 1)
 #define ACPI_MEM_AFFIN_NONVOLATILE (1 << 2)
 
+struct acpi_20_slit {
+    struct acpi_header header;
+    uint64_t localities;
+    uint8_t entry[0];
+};
+
 /*
  * Table Signatures.
  */
@@ -427,6 +433,7 @@ struct acpi_20_srat_memory {
 #define ACPI_2_0_HPET_SIGNATURE ASCII32('H','P','E','T')
 #define ACPI_2_0_WAET_SIGNATURE ASCII32('W','A','E','T')
 #define ACPI_2_0_SRAT_SIGNATURE ASCII32('S','R','A','T')
+#define ACPI_2_0_SLIT_SIGNATURE ASCII32('S','L','I','T')
 
 /*
  * Table revision numbers.
@@ -441,6 +448,7 @@ struct acpi_20_srat_memory {
 #define ACPI_2_0_WAET_REVISION 0x01
 #define ACPI_1_0_FADT_REVISION 0x01
 #define ACPI_2_0_SRAT_REVISION 0x01
+#define ACPI_2_0_SLIT_REVISION 0x01
 
 #pragma pack ()
 
diff --git a/tools/firmware/hvmloader/acpi/build.c b/tools/firmware/hvmloader/acpi/build.c
index b90344a..95fd603 100644
--- a/tools/firmware/hvmloader/acpi/build.c
+++ b/tools/firmware/hvmloader/acpi/build.c
@@ -263,6 +263,38 @@ static struct acpi_20_srat *construct_srat(void)
     return srat;
 }
 
+static struct acpi_20_slit *construct_slit(void)
+{
+    struct acpi_20_slit *slit;
+    unsigned int num, size;
+    int i;
+
+    num = hvm_info->nr_localities * hvm_info->nr_localities;
+    size = sizeof(*slit) + num * sizeof(uint8_t);
+
+    slit = mem_alloc(size, 16);
+    if (!slit) return NULL;
+
+    memset(slit, 0, size);
+    slit->header.signature    = ACPI_2_0_SLIT_SIGNATURE;
+    slit->header.revision     = ACPI_2_0_SLIT_REVISION;
+    fixed_strcpy(slit->header.oem_id, ACPI_OEM_ID);
+    fixed_strcpy(slit->header.oem_table_id, ACPI_OEM_TABLE_ID);
+    slit->header.oem_revision = ACPI_OEM_REVISION;
+    slit->header.creator_id   = ACPI_CREATOR_ID;
+    slit->header.creator_revision = ACPI_CREATOR_REVISION;
+
+    for ( i = 0; i < num; i++ )
+        slit->entry[i] = hvm_info->localities[i];
+
+    slit->localities = hvm_info->nr_localities;
+
+    slit->header.length = size;
+    set_checksum(slit, offsetof(struct acpi_header, checksum), size);
+
+    return slit;
+}
+
 static int construct_passthrough_tables(unsigned long *table_ptrs,
                                         int nr_tables)
 {
@@ -318,6 +350,7 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
     struct acpi_20_waet *waet;
     struct acpi_20_tcpa *tcpa;
     struct acpi_20_srat *srat;
+    struct acpi_20_slit *slit;
     unsigned char *ssdt;
     static const uint16_t tis_signature[] = {0x0001, 0x0001, 0x0001};
     uint16_t *tis_hdr;
@@ -336,6 +369,9 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
         srat = construct_srat();
         if (!srat) return -1;
         table_ptrs[nr_tables++] = (unsigned long)srat;
+        slit = construct_slit();
+        if (!slit) return -1;
+        table_ptrs[nr_tables++] = (unsigned long)slit;
     }
 
     /* HPET. */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 14/19] hvmloader: disallow memory relocation when vNUMA is enabled
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (12 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 13/19] hvmloader: construct SLIT Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 19:56   ` Konrad Rzeszutek Wilk
  2014-11-24 10:15   ` Jan Beulich
  2014-11-21 15:06 ` [PATCH 15/19] libxc: allocate memory with vNUMA information for HVM guest Wei Liu
                   ` (6 subsequent siblings)
  20 siblings, 2 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Wei Liu, Jan Beulich

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 tools/firmware/hvmloader/pci.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/tools/firmware/hvmloader/pci.c b/tools/firmware/hvmloader/pci.c
index 4e8d803..d7ea740 100644
--- a/tools/firmware/hvmloader/pci.c
+++ b/tools/firmware/hvmloader/pci.c
@@ -88,6 +88,19 @@ void pci_setup(void)
     printf("Relocating guest memory for lowmem MMIO space %s\n",
            allow_memory_relocate?"enabled":"disabled");
 
+    /* Disallow low memory relocation when vNUMA is enabled, because
+     * relocated memory ends up off node. Further more, even if we
+     * dynamically expand node coverage in hvmloader, low memory and
+     * high memory may reside in different physical nodes, blindly
+     * relocates low memory to high memory gives us a sub-optimal
+     * configuration.
+     */
+    if ( hvm_info->nr_nodes != 0 && allow_memory_relocate )
+    {
+        allow_memory_relocate = false;
+        printf("vNUMA enabled, relocating guest memory for lowmem MMIO space disabled\n");
+    }
+
     s = xenstore_read("platform/mmio_hole_size", NULL);
     if ( s )
         mmio_hole_size = strtoll(s, NULL, 0);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 15/19] libxc: allocate memory with vNUMA information for HVM guest
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (13 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 14/19] hvmloader: disallow memory relocation when vNUMA is enabled Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 15:06 ` [PATCH 16/19] libxl: build, check and pass vNUMA info to Xen " Wei Liu
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

And then returns low memory end, high memory end and mmio start to
caller.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxc/include/xenguest.h |    7 ++
 tools/libxc/xc_hvm_build_x86.c |  224 ++++++++++++++++++++++++++--------------
 2 files changed, 151 insertions(+), 80 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 40bbac8..dca0375 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -230,6 +230,13 @@ struct xc_hvm_build_args {
     struct xc_hvm_firmware_module smbios_module;
     /* Whether to use claim hypercall (1 - enable, 0 - disable). */
     int claim_enabled;
+    unsigned int nr_vnodes;
+    unsigned int *vnode_to_pnode;
+    uint64_t *vnode_size;;
+    /* Out parameters  */
+    uint64_t lowmem_end;
+    uint64_t highmem_end;
+    uint64_t mmio_start;
 };
 
 /**
diff --git a/tools/libxc/xc_hvm_build_x86.c b/tools/libxc/xc_hvm_build_x86.c
index c81a25b..54d3dc8 100644
--- a/tools/libxc/xc_hvm_build_x86.c
+++ b/tools/libxc/xc_hvm_build_x86.c
@@ -89,7 +89,8 @@ static int modules_init(struct xc_hvm_build_args *args,
 }
 
 static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
-                           uint64_t mmio_start, uint64_t mmio_size)
+                           uint64_t mmio_start, uint64_t mmio_size,
+                           struct xc_hvm_build_args *args)
 {
     struct hvm_info_table *hvm_info = (struct hvm_info_table *)
         (((unsigned char *)hvm_info_page) + HVM_INFO_OFFSET);
@@ -119,6 +120,10 @@ static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
     hvm_info->high_mem_pgend = highmem_end >> PAGE_SHIFT;
     hvm_info->reserved_mem_pgstart = ioreq_server_pfn(0);
 
+    args->lowmem_end = lowmem_end;
+    args->highmem_end = highmem_end;
+    args->mmio_start = mmio_start;
+
     /* Finish with the checksum. */
     for ( i = 0, sum = 0; i < hvm_info->length; i++ )
         sum += ((uint8_t *)hvm_info)[i];
@@ -244,7 +249,7 @@ static int setup_guest(xc_interface *xch,
                        char *image, unsigned long image_size)
 {
     xen_pfn_t *page_array = NULL;
-    unsigned long i, nr_pages = args->mem_size >> PAGE_SHIFT;
+    unsigned long i, j, nr_pages = args->mem_size >> PAGE_SHIFT;
     unsigned long target_pages = args->mem_target >> PAGE_SHIFT;
     uint64_t mmio_start = (1ull << 32) - args->mmio_size;
     uint64_t mmio_size = args->mmio_size;
@@ -258,13 +263,13 @@ static int setup_guest(xc_interface *xch,
     xen_capabilities_info_t caps;
     unsigned long stat_normal_pages = 0, stat_2mb_pages = 0, 
         stat_1gb_pages = 0;
-    int pod_mode = 0;
+    unsigned int memflags = 0;
     int claim_enabled = args->claim_enabled;
     xen_pfn_t special_array[NR_SPECIAL_PAGES];
     xen_pfn_t ioreq_server_array[NR_IOREQ_SERVER_PAGES];
-
-    if ( nr_pages > target_pages )
-        pod_mode = XENMEMF_populate_on_demand;
+    uint64_t dummy_vnode_size;
+    unsigned int dummy_vnode_to_pnode;
+    uint64_t total;
 
     memset(&elf, 0, sizeof(elf));
     if ( elf_init(&elf, image, image_size) != 0 )
@@ -276,6 +281,37 @@ static int setup_guest(xc_interface *xch,
     v_start = 0;
     v_end = args->mem_size;
 
+    if ( nr_pages > target_pages )
+        memflags |= XENMEMF_populate_on_demand;
+
+    if ( args->nr_vnodes == 0 )
+    {
+        /* Build dummy vnode information */
+        args->nr_vnodes = 1;
+        dummy_vnode_to_pnode = XC_VNUMA_NO_NODE;
+        dummy_vnode_size = args->mem_size >> 20;
+        args->vnode_size = &dummy_vnode_size;
+        args->vnode_to_pnode = &dummy_vnode_to_pnode;
+    }
+    else
+    {
+        if ( nr_pages > target_pages )
+        {
+            PERROR("Cannot enable vNUMA and PoD at the same time");
+            goto error_out;
+        }
+    }
+
+    total = 0;
+    for ( i = 0; i < args->nr_vnodes; i++ )
+        total += (args->vnode_size[i] << 20);
+    if ( total != args->mem_size )
+    {
+        PERROR("Memory size requested by vNUMA (0x%"PRIx64") mismatches memory size configured for domain (0x%"PRIx64")",
+               total, args->mem_size);
+        goto error_out;
+    }
+
     if ( xc_version(xch, XENVER_capabilities, &caps) != 0 )
     {
         PERROR("Could not get Xen capabilities");
@@ -320,7 +356,7 @@ static int setup_guest(xc_interface *xch,
         }
     }
 
-    if ( pod_mode )
+    if ( memflags & XENMEMF_populate_on_demand )
     {
         /*
          * Subtract VGA_HOLE_SIZE from target_pages for the VGA
@@ -349,103 +385,128 @@ static int setup_guest(xc_interface *xch,
      * ensure that we can be preempted and hence dom0 remains responsive.
      */
     rc = xc_domain_populate_physmap_exact(
-        xch, dom, 0xa0, 0, pod_mode, &page_array[0x00]);
+        xch, dom, 0xa0, 0, memflags, &page_array[0x00]);
     cur_pages = 0xc0;
     stat_normal_pages = 0xc0;
 
-    while ( (rc == 0) && (nr_pages > cur_pages) )
+    for ( i = 0; i < args->nr_vnodes; i++ )
     {
-        /* Clip count to maximum 1GB extent. */
-        unsigned long count = nr_pages - cur_pages;
-        unsigned long max_pages = SUPERPAGE_1GB_NR_PFNS;
-
-        if ( count > max_pages )
-            count = max_pages;
-
-        cur_pfn = page_array[cur_pages];
+        unsigned int new_memflags = memflags;
+        uint64_t pages, finished;
 
-        /* Take care the corner cases of super page tails */
-        if ( ((cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
-             (count > (-cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1))) )
-            count = -cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1);
-        else if ( ((count & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
-                  (count > SUPERPAGE_1GB_NR_PFNS) )
-            count &= ~(SUPERPAGE_1GB_NR_PFNS - 1);
-
-        /* Attemp to allocate 1GB super page. Because in each pass we only
-         * allocate at most 1GB, we don't have to clip super page boundaries.
-         */
-        if ( ((count | cur_pfn) & (SUPERPAGE_1GB_NR_PFNS - 1)) == 0 &&
-             /* Check if there exists MMIO hole in the 1GB memory range */
-             !check_mmio_hole(cur_pfn << PAGE_SHIFT,
-                              SUPERPAGE_1GB_NR_PFNS << PAGE_SHIFT,
-                              mmio_start, mmio_size) )
+        if ( args->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
         {
-            long done;
-            unsigned long nr_extents = count >> SUPERPAGE_1GB_SHIFT;
-            xen_pfn_t sp_extents[nr_extents];
-
-            for ( i = 0; i < nr_extents; i++ )
-                sp_extents[i] = page_array[cur_pages+(i<<SUPERPAGE_1GB_SHIFT)];
-
-            done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_1GB_SHIFT,
-                                              pod_mode, sp_extents);
-
-            if ( done > 0 )
-            {
-                stat_1gb_pages += done;
-                done <<= SUPERPAGE_1GB_SHIFT;
-                cur_pages += done;
-                count -= done;
-            }
+            new_memflags |= XENMEMF_exact_node(args->vnode_to_pnode[i]);
+            new_memflags |= XENMEMF_exact_node_request;
         }
 
-        if ( count != 0 )
+        pages = (args->vnode_size[i] << 20) >> PAGE_SHIFT;
+        /* Consider vga hole belongs to node 0 */
+        if ( i == 0 )
+            finished = 0xc0;
+        else
+            finished = 0;
+
+        while ( (rc == 0) && (pages > finished) )
         {
-            /* Clip count to maximum 8MB extent. */
-            max_pages = SUPERPAGE_2MB_NR_PFNS * 4;
+            /* Clip count to maximum 1GB extent. */
+            unsigned long count = pages - finished;
+            unsigned long max_pages = SUPERPAGE_1GB_NR_PFNS;
+
             if ( count > max_pages )
                 count = max_pages;
-            
-            /* Clip partial superpage extents to superpage boundaries. */
-            if ( ((cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
-                 (count > (-cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1))) )
-                count = -cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1);
-            else if ( ((count & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
-                      (count > SUPERPAGE_2MB_NR_PFNS) )
-                count &= ~(SUPERPAGE_2MB_NR_PFNS - 1); /* clip non-s.p. tail */
-
-            /* Attempt to allocate superpage extents. */
-            if ( ((count | cur_pfn) & (SUPERPAGE_2MB_NR_PFNS - 1)) == 0 )
+
+            cur_pfn = page_array[cur_pages];
+
+            /* Take care the corner cases of super page tails */
+            if ( ((cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
+                 (count > (-cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1))) )
+                count = -cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1);
+            else if ( ((count & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
+                      (count > SUPERPAGE_1GB_NR_PFNS) )
+                count &= ~(SUPERPAGE_1GB_NR_PFNS - 1);
+
+            /* Attemp to allocate 1GB super page. Because in each pass we only
+             * allocate at most 1GB, we don't have to clip super page boundaries.
+             */
+            if ( ((count | cur_pfn) & (SUPERPAGE_1GB_NR_PFNS - 1)) == 0 &&
+                 /* Check if there exists MMIO hole in the 1GB memory range */
+                 !check_mmio_hole(cur_pfn << PAGE_SHIFT,
+                                  SUPERPAGE_1GB_NR_PFNS << PAGE_SHIFT,
+                                  mmio_start, mmio_size) )
             {
                 long done;
-                unsigned long nr_extents = count >> SUPERPAGE_2MB_SHIFT;
+                unsigned long nr_extents = count >> SUPERPAGE_1GB_SHIFT;
                 xen_pfn_t sp_extents[nr_extents];
 
-                for ( i = 0; i < nr_extents; i++ )
-                    sp_extents[i] = page_array[cur_pages+(i<<SUPERPAGE_2MB_SHIFT)];
+                for ( j = 0; j < nr_extents; j++ )
+                    sp_extents[j] = page_array[cur_pages+(j<<SUPERPAGE_1GB_SHIFT)];
 
-                done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_2MB_SHIFT,
-                                                  pod_mode, sp_extents);
+                done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_1GB_SHIFT,
+                                                  new_memflags, sp_extents);
 
                 if ( done > 0 )
                 {
-                    stat_2mb_pages += done;
-                    done <<= SUPERPAGE_2MB_SHIFT;
+                    stat_1gb_pages += done;
+                    done <<= SUPERPAGE_1GB_SHIFT;
                     cur_pages += done;
+                    finished += done;
                     count -= done;
                 }
             }
-        }
 
-        /* Fall back to 4kB extents. */
-        if ( count != 0 )
-        {
-            rc = xc_domain_populate_physmap_exact(
-                xch, dom, count, 0, pod_mode, &page_array[cur_pages]);
-            cur_pages += count;
-            stat_normal_pages += count;
+            if ( count != 0 )
+            {
+                /* Clip count to maximum 8MB extent. */
+                max_pages = SUPERPAGE_2MB_NR_PFNS * 4;
+                if ( count > max_pages )
+                    count = max_pages;
+
+                /* Clip partial superpage extents to superpage boundaries. */
+                if ( ((cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
+                     (count > (-cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1))) )
+                    count = -cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1);
+                else if ( ((count & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
+                          (count > SUPERPAGE_2MB_NR_PFNS) )
+                    count &= ~(SUPERPAGE_2MB_NR_PFNS - 1); /* clip non-s.p. tail */
+
+                /* Attempt to allocate superpage extents. */
+                if ( ((count | cur_pfn) & (SUPERPAGE_2MB_NR_PFNS - 1)) == 0 )
+                {
+                    long done;
+                    unsigned long nr_extents = count >> SUPERPAGE_2MB_SHIFT;
+                    xen_pfn_t sp_extents[nr_extents];
+
+                    for ( j = 0; j < nr_extents; j++ )
+                        sp_extents[j] = page_array[cur_pages+(j<<SUPERPAGE_2MB_SHIFT)];
+
+                    done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_2MB_SHIFT,
+                                                      new_memflags, sp_extents);
+
+                    if ( done > 0 )
+                    {
+                        stat_2mb_pages += done;
+                        done <<= SUPERPAGE_2MB_SHIFT;
+                        cur_pages += done;
+                        finished += done;
+                        count -= done;
+                    }
+                }
+            }
+
+            /* Fall back to 4kB extents. */
+            if ( count != 0 )
+            {
+                rc = xc_domain_populate_physmap_exact(
+                    xch, dom, count, 0, new_memflags, &page_array[cur_pages]);
+                cur_pages += count;
+                finished += count;
+                stat_normal_pages += count;
+            }
         }
+
+        if ( rc != 0 )
+            break;
     }
 
     if ( rc != 0 )
@@ -469,7 +530,7 @@ static int setup_guest(xc_interface *xch,
               xch, dom, PAGE_SIZE, PROT_READ | PROT_WRITE,
               HVM_INFO_PFN)) == NULL )
         goto error_out;
-    build_hvm_info(hvm_info_page, v_end, mmio_start, mmio_size);
+    build_hvm_info(hvm_info_page, v_end, mmio_start, mmio_size, args);
     munmap(hvm_info_page, PAGE_SIZE);
 
     /* Allocate and clear special pages. */
@@ -608,6 +669,9 @@ int xc_hvm_build(xc_interface *xch, uint32_t domid,
             args.acpi_module.guest_addr_out;
         hvm_args->smbios_module.guest_addr_out = 
             args.smbios_module.guest_addr_out;
+        hvm_args->lowmem_end = args.lowmem_end;
+        hvm_args->highmem_end = args.highmem_end;
+        hvm_args->mmio_start = args.mmio_start;
     }
 
     free(image);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 16/19] libxl: build, check and pass vNUMA info to Xen for HVM guest
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (14 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 15/19] libxc: allocate memory with vNUMA information for HVM guest Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-21 15:06 ` [PATCH 17/19] libxl: refactor hvm_build_set_params Wei Liu
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

Libxc has more involvement in building vmemranges in HVM case. The
building of vmemranges is placed after xc_hvm_build returns, because it
relies on memory hole information provided by xc_hvm_build.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_create.c   |    9 +++++++
 tools/libxl/libxl_dom.c      |   28 +++++++++++++++++++++
 tools/libxl/libxl_internal.h |    5 ++++
 tools/libxl/libxl_vnuma.c    |   56 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 98 insertions(+)

diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index b1ff5ae..1d96a5f 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -843,6 +843,15 @@ static void initiate_domain_create(libxl__egc *egc,
         goto error_out;
     }
 
+    /* Disallow PoD and vNUMA to be enabled at the same time because PoD
+     * pool is not vNUMA-aware yet.
+     */
+    if (pod_enabled && d_config->b_info.num_vnuma_nodes) {
+        ret = ERROR_INVAL;
+        LOG(ERROR, "Cannot enable PoD and vNUMA at the same time");
+        goto error_out;
+    }
+
     ret = libxl__domain_create_info_setdefault(gc, &d_config->c_info);
     if (ret) goto error_out;
 
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 7339bbc..3fe3092 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -884,12 +884,40 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
         goto out;
     }
 
+    if (info->num_vnuma_nodes != 0) {
+        int i;
+
+        args.nr_vnodes = info->num_vnuma_nodes;
+        args.vnode_to_pnode = libxl__malloc(gc, sizeof(*args.vnode_to_pnode) *
+                                            args.nr_vnodes);
+        args.vnode_size = libxl__malloc(gc, sizeof(*args.vnode_size) *
+                                        args.nr_vnodes);
+        for (i = 0; i < args.nr_vnodes; i++) {
+            args.vnode_to_pnode[i] = info->vnuma_nodes[i].pnode;
+            args.vnode_size[i] = info->vnuma_nodes[i].mem;
+        }
+
+        /* Consider video ram belongs to node 0 */
+        args.vnode_size[0] -= (info->video_memkb >> 10);
+    }
+
     ret = xc_hvm_build(ctx->xch, domid, &args);
     if (ret) {
         LOGEV(ERROR, ret, "hvm building failed");
         goto out;
     }
 
+    if (info->num_vnuma_nodes != 0) {
+        ret = libxl__vnuma_build_vmemrange_hvm(gc, domid, info, state, &args);
+        if (ret) {
+            LOGEV(ERROR, ret, "hvm build vmemranges failed");
+            goto out;
+        }
+        ret = libxl__vnuma_config_check(gc, info, state);
+        if (ret) goto out;
+        ret = set_vnuma_info(gc, domid, info, state);
+        if (ret) goto out;
+    }
     ret = hvm_build_set_params(ctx->xch, domid, info, state->store_port,
                                &state->store_mfn, state->console_port,
                                &state->console_mfn, state->store_domid,
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index b1b60cb..02e2bce 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3401,6 +3401,11 @@ int libxl__vnuma_build_vmemrange_pv(libxl__gc *gc,
                                     uint32_t domid,
                                     libxl_domain_build_info *b_info,
                                     libxl__domain_build_state *state);
+int libxl__vnuma_build_vmemrange_hvm(libxl__gc *gc,
+                                     uint32_t domid,
+                                     libxl_domain_build_info *b_info,
+                                     libxl__domain_build_state *state,
+                                     struct xc_hvm_build_args *args);
 
 _hidden int libxl__ms_vm_genid_set(libxl__gc *gc, uint32_t domid,
                                    const libxl_ms_vm_genid *id);
diff --git a/tools/libxl/libxl_vnuma.c b/tools/libxl/libxl_vnuma.c
index 1d50606..5609dce 100644
--- a/tools/libxl/libxl_vnuma.c
+++ b/tools/libxl/libxl_vnuma.c
@@ -163,6 +163,62 @@ int libxl__vnuma_build_vmemrange_pv(libxl__gc *gc,
     return libxl__arch_vnuma_build_vmemrange(gc, domid, b_info, state);
 }
 
+/* Build vmemranges for HVM guest */
+int libxl__vnuma_build_vmemrange_hvm(libxl__gc *gc,
+                                     uint32_t domid,
+                                     libxl_domain_build_info *b_info,
+                                     libxl__domain_build_state *state,
+                                     struct xc_hvm_build_args *args)
+{
+    uint64_t hole_start, hole_end, next;
+    int i, x;
+    vmemrange_t *v;
+
+    /* Derive vmemranges from vnode size and memory hole.
+     *
+     * Guest physical address space layout:
+     * [0, hole_start) [hole_start, hole_end) [hole_end, highmem_end)
+     */
+    hole_start = args->lowmem_end < args->mmio_start ?
+        args->lowmem_end : args->mmio_start;
+    hole_end = (args->mmio_start + args->mmio_size) > (1ULL << 32) ?
+        (args->mmio_start + args->mmio_size) : (1ULL << 32);
+
+    assert(state->vmemranges == NULL);
+
+    next = 0;
+    x = 0;
+    v = NULL;
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
+        uint64_t remaining = (p->mem << 20);
+
+        while (remaining > 0) {
+            uint64_t count = remaining;
+
+            if (next >= hole_start && next < hole_end)
+                next = hole_end;
+            if ((next < hole_start) && (next + remaining >= hole_start))
+                count = hole_start - next;
+
+            v = libxl__realloc(gc, v, sizeof(vmemrange_t) * (x + 1));
+            v[x].start = next;
+            v[x].end = next + count;
+            v[x].flags = 0;
+            v[x].nid = i;
+
+            x++;
+            remaining -= count;
+            next += count;
+        }
+    }
+
+    state->vmemranges = v;
+    state->num_vmemranges = x;
+
+    return 0;
+}
+
 /*
  * Local variables:
  * mode: C
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 17/19] libxl: refactor hvm_build_set_params
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (15 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 16/19] libxl: build, check and pass vNUMA info to Xen " Wei Liu
@ 2014-11-21 15:06 ` Wei Liu
  2014-11-25 10:06   ` Wei Liu
  2014-11-21 15:07 ` [PATCH 18/19] libxl: fill vNUMA information in hvm info Wei Liu
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

Changes:
1. Take care of function calls that can fail.
2. Use proper libxl error handling style.
3. Pass in build state instead of individual fields.

This is mechanical change in preparation for later patch.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_dom.c |   48 +++++++++++++++++++++++++++--------------------
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 3fe3092..bace1cb 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -702,20 +702,20 @@ out:
 
 static int hvm_build_set_params(xc_interface *handle, uint32_t domid,
                                 libxl_domain_build_info *info,
-                                int store_evtchn, unsigned long *store_mfn,
-                                int console_evtchn, unsigned long *console_mfn,
-                                domid_t store_domid, domid_t console_domid)
+                                libxl__domain_build_state *state)
 {
     struct hvm_info_table *va_hvm;
     uint8_t *va_map, sum;
     uint64_t str_mfn, cons_mfn;
-    int i;
+    int i, rc;
 
     va_map = xc_map_foreign_range(handle, domid,
                                   XC_PAGE_SIZE, PROT_READ | PROT_WRITE,
                                   HVM_INFO_PFN);
-    if (va_map == NULL)
-        return -1;
+    if (va_map == NULL) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
 
     va_hvm = (struct hvm_info_table *)(va_map + HVM_INFO_OFFSET);
     va_hvm->apic_mode = libxl_defbool_val(info->u.hvm.apic);
@@ -727,16 +727,27 @@ static int hvm_build_set_params(xc_interface *handle, uint32_t domid,
     va_hvm->checksum -= sum;
     munmap(va_map, XC_PAGE_SIZE);
 
-    xc_hvm_param_get(handle, domid, HVM_PARAM_STORE_PFN, &str_mfn);
-    xc_hvm_param_get(handle, domid, HVM_PARAM_CONSOLE_PFN, &cons_mfn);
-    xc_hvm_param_set(handle, domid, HVM_PARAM_STORE_EVTCHN, store_evtchn);
-    xc_hvm_param_set(handle, domid, HVM_PARAM_CONSOLE_EVTCHN, console_evtchn);
-
-    *store_mfn = str_mfn;
-    *console_mfn = cons_mfn;
-
-    xc_dom_gnttab_hvm_seed(handle, domid, *console_mfn, *store_mfn, console_domid, store_domid);
-    return 0;
+    rc = xc_hvm_param_get(handle, domid, HVM_PARAM_STORE_PFN, &str_mfn);
+    if (rc) { rc = ERROR_FAIL; goto out; }
+    rc = xc_hvm_param_get(handle, domid, HVM_PARAM_CONSOLE_PFN, &cons_mfn);
+    if (rc) { rc = ERROR_FAIL; goto out; }
+    rc = xc_hvm_param_set(handle, domid, HVM_PARAM_STORE_EVTCHN,
+                          state->store_port);
+    if (rc) { rc = ERROR_FAIL; goto out; }
+    rc = xc_hvm_param_set(handle, domid, HVM_PARAM_CONSOLE_EVTCHN,
+                          state->console_port);
+    if (rc) { rc = ERROR_FAIL; goto out; }
+
+    state->store_mfn = str_mfn;
+    state->console_mfn = cons_mfn;
+
+    rc = xc_dom_gnttab_hvm_seed(handle, domid, state->console_mfn,
+                                state->store_mfn,
+                                state->console_domid,
+                                state->store_domid);
+    if (rc) { rc = ERROR_FAIL; goto out; }
+out:
+    return rc;
 }
 
 static int hvm_build_set_xs_values(libxl__gc *gc,
@@ -918,10 +929,7 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
         ret = set_vnuma_info(gc, domid, info, state);
         if (ret) goto out;
     }
-    ret = hvm_build_set_params(ctx->xch, domid, info, state->store_port,
-                               &state->store_mfn, state->console_port,
-                               &state->console_mfn, state->store_domid,
-                               state->console_domid);
+    ret = hvm_build_set_params(ctx->xch, domid, info, state);
     if (ret) {
         LOGEV(ERROR, ret, "hvm build set params failed");
         goto out;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 18/19] libxl: fill vNUMA information in hvm info
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (16 preceding siblings ...)
  2014-11-21 15:06 ` [PATCH 17/19] libxl: refactor hvm_build_set_params Wei Liu
@ 2014-11-21 15:07 ` Wei Liu
  2014-11-25 10:06   ` Wei Liu
  2014-11-21 15:07 ` [PATCH 19/19] xl: vNUMA support Wei Liu
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_dom.c |   27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index bace1cb..5980d87 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -707,7 +707,13 @@ static int hvm_build_set_params(xc_interface *handle, uint32_t domid,
     struct hvm_info_table *va_hvm;
     uint8_t *va_map, sum;
     uint64_t str_mfn, cons_mfn;
-    int i, rc;
+    int i, j, rc;
+
+    if (state->num_vmemranges > HVM_MAX_VMEMRANGES ||
+        info->num_vnuma_nodes * info->num_vnuma_nodes > HVM_MAX_LOCALITIES) {
+        rc = ERROR_INVAL;
+        goto out;
+    }
 
     va_map = xc_map_foreign_range(handle, domid,
                                   XC_PAGE_SIZE, PROT_READ | PROT_WRITE,
@@ -722,6 +728,25 @@ static int hvm_build_set_params(xc_interface *handle, uint32_t domid,
     va_hvm->nr_vcpus = info->max_vcpus;
     memset(va_hvm->vcpu_online, 0, sizeof(va_hvm->vcpu_online));
     memcpy(va_hvm->vcpu_online, info->avail_vcpus.map, info->avail_vcpus.size);
+
+    va_hvm->nr_nodes = info->num_vnuma_nodes;
+    va_hvm->nr_localities = info->num_vnuma_nodes;
+    for (i = 0; i < info->num_vnuma_nodes; i++) {
+        int bit;
+        libxl_for_each_set_bit(bit, info->vnuma_nodes[i].vcpus)
+            va_hvm->vcpu_to_vnode[bit] = i;
+        for (j = 0; j < info->vnuma_nodes[i].num_distances; j++)
+            va_hvm->localities[i*info->num_vnuma_nodes+j] =
+                info->vnuma_nodes[i].distances[j];
+    }
+    for (i = 0; i < state->num_vmemranges; i++) {
+        va_hvm->vmemranges[i].start = state->vmemranges[i].start;
+        va_hvm->vmemranges[i].end = state->vmemranges[i].end;
+        va_hvm->vmemranges[i].flags = state->vmemranges[i].flags;
+        va_hvm->vmemranges[i].nid = state->vmemranges[i].nid;
+    }
+    va_hvm->nr_vmemranges = state->num_vmemranges;
+
     for (i = 0, sum = 0; i < va_hvm->length; i++)
         sum += ((uint8_t *) va_hvm)[i];
     va_hvm->checksum -= sum;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 19/19] xl: vNUMA support
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (17 preceding siblings ...)
  2014-11-21 15:07 ` [PATCH 18/19] libxl: fill vNUMA information in hvm info Wei Liu
@ 2014-11-21 15:07 ` Wei Liu
  2014-11-21 16:25 ` [PATCH 00/19] Virtual NUMA for PV and HVM Jan Beulich
  2014-11-21 20:01 ` Konrad Rzeszutek Wilk
  20 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 15:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

This patch includes configuration options parser and documentation.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 docs/man/xl.cfg.pod.5    |   32 ++++++++++
 tools/libxl/xl_cmdimpl.c |  151 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 183 insertions(+)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index 622ea53..0394d53 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -266,6 +266,38 @@ it will crash.
 
 =back
 
+=head3 Virtual NUMA Memory Allocation
+
+=over 4
+
+=item B<vnuma_memory=[ NUMBER, NUMBER, ... ]>
+
+Specify the size of memory covered by each virtual NUMA node. The number of
+elements in the list also implicitly defines the number of virtual NUMA nodes.
+
+The sum of all elements in this list should be equal to memory size specified
+by B<maxmem=> in guest configuration file, or B<memory=> if B<maxmem=> is not
+specified.
+
+=item B<vnuma_vcpu_map=[ NUMBER, NUMBER, ... ]>
+
+Specifiy which virutal NUMA node a specific vcpu belongs to. The number of
+elements in this list should be equal to B<maxvcpus=> in guest configuration
+file, or B<vcpus=> if B<maxvcpus=> is not specified.
+
+=item B<vnuma_pnode_map=[ NUMBER, NUMBER, ... ]>
+
+Specifiy which physical NUMA node a specific virtual NUMA node maps to. The
+number of elements in this list should be equal to the number of virtual
+NUMA nodes defined in B<vnuma_memory=>.
+
+=item B<vnuma_vidstance=[ NUMBER, NUMBER ]>
+
+Specify distance from local node to local node and local node to remote node
+respectively. This is optional. If not specified, [10,20] will be used.
+
+=back
+
 =head3 Event Actions
 
 =over 4
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 9afef3f..d82e41a 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -904,6 +904,155 @@ static void replace_string(char **str, const char *val)
     *str = xstrdup(val);
 }
 
+static void parse_vnuma_config(const XLU_Config *config,
+                               libxl_domain_build_info *b_info)
+{
+    int i;
+    XLU_ConfigList *vnuma_memory, *vnuma_vcpu_map, *vnuma_pnode_map,
+        *vnuma_vdistances;
+    int num_vnuma_memory, num_vnuma_vcpu_map, num_vnuma_pnode_map,
+        num_vnuma_vdistances;
+    const char *buf;
+    libxl_physinfo physinfo;
+    uint32_t nr_nodes;
+    unsigned long local, remote; /* vdistance */
+    unsigned long val;
+    char *ep;
+
+    libxl_physinfo_init(&physinfo);
+    if (libxl_get_physinfo(ctx, &physinfo) != 0) {
+        libxl_physinfo_dispose(&physinfo);
+        fprintf(stderr, "libxl_physinfo failed.\n");
+        exit(1);
+    }
+    nr_nodes = physinfo.nr_nodes;
+    libxl_physinfo_dispose(&physinfo);
+
+    if (xlu_cfg_get_list(config, "vnuma_memory", &vnuma_memory,
+                         &num_vnuma_memory, 1))
+        return;              /* No vnuma config */
+
+    b_info->num_vnuma_nodes = num_vnuma_memory;
+    b_info->vnuma_nodes = xmalloc(num_vnuma_memory * sizeof(libxl_vnode_info));
+
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
+
+        libxl_vnode_info_init(p);
+        libxl_cpu_bitmap_alloc(ctx, &p->vcpus, b_info->max_vcpus);
+        libxl_bitmap_set_none(&p->vcpus);
+        p->distances = xmalloc(b_info->num_vnuma_nodes * sizeof(uint32_t));
+        p->num_distances = b_info->num_vnuma_nodes;
+    }
+
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        buf = xlu_cfg_get_listitem(vnuma_memory, i);
+        val = strtoul(buf, &ep, 10);
+        if (ep == buf) {
+            fprintf(stderr, "xl: Invalid argument parsing vnuma memory: %s\n", buf);
+            exit(1);
+        }
+        b_info->vnuma_nodes[i].mem = val;
+    }
+
+    if (xlu_cfg_get_list(config, "vnuma_vcpu_map", &vnuma_vcpu_map,
+                         &num_vnuma_vcpu_map, 1)) {
+        fprintf(stderr, "No vcpu to vnode map specified\n");
+        exit(1);
+    }
+
+    i = 0;
+    while (i < b_info->max_vcpus &&
+           (buf = xlu_cfg_get_listitem(vnuma_vcpu_map, i)) != NULL) {
+        val = strtoul(buf, &ep, 10);
+        if (ep == buf) {
+            fprintf(stderr, "xl: Invalid argument parsing vcpu map: %s\n", buf);
+            exit(1);
+        }
+        if (val >= num_vnuma_memory) {
+            fprintf(stderr, "xl: Invalid vnode number specified: %lu\n", val);
+            exit(1);
+        }
+        libxl_bitmap_set(&b_info->vnuma_nodes[val].vcpus, i);
+        i++;
+    }
+
+    if (i != b_info->max_vcpus) {
+        fprintf(stderr, "xl: Not enough elements in vnuma_vcpu_map, provided %d, required %d\n",
+                i + 1, b_info->max_vcpus);
+        exit(1);
+    }
+
+    if (xlu_cfg_get_list(config, "vnuma_pnode_map", &vnuma_pnode_map,
+                         &num_vnuma_pnode_map, 1)) {
+        fprintf(stderr, "No vnode to pnode map specified\n");
+        exit(1);
+    }
+
+    i = 0;
+    while (i < num_vnuma_pnode_map &&
+           (buf = xlu_cfg_get_listitem(vnuma_pnode_map, i)) != NULL) {
+        val = strtoul(buf, &ep, 10);
+        if (ep == buf) {
+            fprintf(stderr, "xl: Invalid argument parsing vnode to pnode map: %s\n", buf);
+            exit(1);
+        }
+        if (val >= nr_nodes) {
+            fprintf(stderr, "xl: Invalid pnode number specified: %lu\n", val);
+            exit(1);
+        }
+
+        b_info->vnuma_nodes[i].pnode = val;
+
+        i++;
+    }
+
+    if (i != b_info->num_vnuma_nodes) {
+        fprintf(stderr, "xl: Not enough elements in vnuma_vnode_map, provided %d, required %d\n",
+                i + 1, b_info->num_vnuma_nodes);
+        exit(1);
+    }
+
+    /* Set default values for distances, then try to parse config */
+    local = 10;
+    remote = 20;
+    if (!xlu_cfg_get_list(config, "vnuma_vdistances", &vnuma_vdistances,
+                          &num_vnuma_vdistances, 1)) {
+        if (num_vnuma_vdistances != 2) {
+            fprintf(stderr, "xl: vnuma_vdistances array can only contain 2 elements\n");
+            exit(1);
+        }
+
+        buf = xlu_cfg_get_listitem(vnuma_vdistances, 0);
+        local = strtoul(buf, &ep, 10);
+        if (ep == buf) {
+            fprintf(stderr, "xl: Invalid argument parsing vdistances map: %s\n", buf);
+            exit(1);
+        }
+
+        buf = xlu_cfg_get_listitem(vnuma_vdistances, 1);
+        remote = strtoul(buf, &ep, 10);
+        if (ep == buf) {
+            fprintf(stderr, "xl: Invalid argument parsing vdistances map: %s\n", buf);
+            exit(1);
+        }
+
+    }
+
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        int j;
+        uint32_t *x = b_info->vnuma_nodes[i].distances;
+
+        for (j = 0; j < b_info->vnuma_nodes[i].num_distances; j++) {
+            if (i == j)
+                x[j] = local;
+            else
+                x[j] = remote;
+        }
+    }
+
+}
+
 static void parse_config_data(const char *config_source,
                               const char *config_data,
                               int config_len,
@@ -1093,6 +1242,8 @@ static void parse_config_data(const char *config_source,
         }
     }
 
+    parse_vnuma_config(config, b_info);
+
     if (!xlu_cfg_get_long(config, "rtc_timeoffset", &l, 0))
         b_info->rtc_timeoffset = l;
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/19] Virtual NUMA for PV and HVM
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (18 preceding siblings ...)
  2014-11-21 15:07 ` [PATCH 19/19] xl: vNUMA support Wei Liu
@ 2014-11-21 16:25 ` Jan Beulich
  2014-11-21 16:35   ` Wei Liu
  2014-11-21 20:01 ` Konrad Rzeszutek Wilk
  20 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2014-11-21 16:25 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel

>>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> memory = 6000
> vnuma_memory = [3000, 3000]

So what would

memory = 6000
vnuma_memory = [3000, 2000]

or

memory = 6000
vnuma_memory = [3000, 4000]

mean? Redundant specification of values is always a problem...
Would be possible to extend "memory" to allow for a vector as well
as a single value?

> vnuma_vcpu_map = [0, 1]
> vnuma_pnode_map = [0, 0]
> vnuma_vdistances = [10, 30] # optional

Being optional, would the real distances be used instead? And what
meaning does this apparently one dimensional array here have for
the actually two dimensional SLIT? (Read: An example with more
than two nodes would be useful.)

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/19] Virtual NUMA for PV and HVM
  2014-11-21 16:25 ` [PATCH 00/19] Virtual NUMA for PV and HVM Jan Beulich
@ 2014-11-21 16:35   ` Wei Liu
  2014-11-21 16:42     ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Wei Liu @ 2014-11-21 16:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, xen-devel

On Fri, Nov 21, 2014 at 04:25:34PM +0000, Jan Beulich wrote:
> >>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> > memory = 6000
> > vnuma_memory = [3000, 3000]
> 
> So what would
> 
> memory = 6000
> vnuma_memory = [3000, 2000]
> 
> or
> 
> memory = 6000
> vnuma_memory = [3000, 4000]
> 

Those are not valid configurations at the moment. I have checks in libxl
to mandate sum of vnuma_memory equals to memory (in fact, maxmem).

> mean? Redundant specification of values is always a problem...
> Would be possible to extend "memory" to allow for a vector as well
> as a single value?
> 

Yes, I think it can be done in backward compatible way.

> > vnuma_vcpu_map = [0, 1]
> > vnuma_pnode_map = [0, 0]
> > vnuma_vdistances = [10, 30] # optional
> 
> Being optional, would the real distances be used instead? And what

Default value of [10, 20] will be used.

> meaning does this apparently one dimensional array here have for
> the actually two dimensional SLIT? (Read: An example with more
> than two nodes would be useful.)
> 

The first element of [X, Y] is local distance, the second element is
remote distance.

For a 4 node system:

     0    1    2    3
0    X    Y    Y    Y 
1    Y    X    Y    Y
2    Y    Y    X    Y
3    Y    Y    Y    X

Wei.

> Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 01/19] xen: dump vNUMA information with debug key "u"
  2014-11-21 15:06 ` [PATCH 01/19] xen: dump vNUMA information with debug key "u" Wei Liu
@ 2014-11-21 16:39   ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2014-11-21 16:39 UTC (permalink / raw)
  To: Wei Liu; +Cc: Elena Ufimsteva, xen-devel

>>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> Signed-off-by: Elena Ufimsteva <ufimtseva@gmail.com>
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> ---
>  xen/arch/x86/numa.c |   46 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> index 628a40a..d27c30f 100644
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -363,10 +363,12 @@ EXPORT_SYMBOL(node_data);
>  static void dump_numa(unsigned char key)
>  {
>      s_time_t now = NOW();
> -    int i;
> +    int i, j, err, n;

unsigned int please for all but err.

> @@ -408,6 +410,48 @@ static void dump_numa(unsigned char key)
>  
>          for_each_online_node ( i )
>              printk("    Node %u: %u\n", i, page_num_node[i]);
> +
> +        if ( !d->vnuma )
> +                continue;
> +
> +        vnuma = d->vnuma;
> +        printk("     %u vnodes, %u vcpus\n", vnuma->nr_vnodes, d->max_vcpus);
> +        for ( i = 0; i < vnuma->nr_vnodes; i++ )
> +        {
> +            err = snprintf(keyhandler_scratch, 12, "%u",
> +                    vnuma->vnode_to_pnode[i]);
> +            if ( err < 0 || vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
> +                snprintf(keyhandler_scratch, 3, "???");

strcpy() would be much cheaper here.

> +
> +            printk("       vnode %3u - pnode %s\n", i, keyhandler_scratch);
> +            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
> +            {
> +                if ( vnuma->vmemrange[j].nid == i )
> +                {
> +                    mem = vnuma->vmemrange[j].end - vnuma->vmemrange[j].start;
> +                    printk("        %"PRIu64" MB: ", mem >> 20);
> +                    printk(" 0x%"PRIx64" - 0x%"PRIx64"\n",
> +                           vnuma->vmemrange[j].start,
> +                           vnuma->vmemrange[j].end);

Where possible please don't split printk()s of a single output line. Also
%# instead of 0x% please (but maybe padding the values so the
align properly would be the better change; that would also eliminate
the need for explicit leading spaces).

> +                }
> +            }
> +
> +            printk("        vcpus: ");
> +            for ( j = 0, n = 0; j < d->max_vcpus; j++ )
> +            {
> +                if ( vnuma->vcpu_to_vnode[j] == i )
> +                {
> +                    if ( (n + 1) % 8 == 0 )
> +                        printk("%d\n", j);
> +                    else if ( !(n % 8) && n != 0 )
> +                        printk("                %d ", j);
> +                    else
> +                        printk("%d ", j);

Same here - left padding them to make them aligned will likely make
the result quite a bit easier to grok especially for large guests.

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/19] Virtual NUMA for PV and HVM
  2014-11-21 16:35   ` Wei Liu
@ 2014-11-21 16:42     ` Jan Beulich
  2014-11-21 16:55       ` Wei Liu
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2014-11-21 16:42 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel

>>> On 21.11.14 at 17:35, <wei.liu2@citrix.com> wrote:
> On Fri, Nov 21, 2014 at 04:25:34PM +0000, Jan Beulich wrote:
>> >>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
>> > vnuma_vdistances = [10, 30] # optional
>> 
>> Being optional, would the real distances be used instead? And what
> 
> Default value of [10, 20] will be used.

That's bad. Would it be very difficult to use the host values?

>> meaning does this apparently one dimensional array here have for
>> the actually two dimensional SLIT? (Read: An example with more
>> than two nodes would be useful.)
>> 
> 
> The first element of [X, Y] is local distance, the second element is
> remote distance.
> 
> For a 4 node system:
> 
>      0    1    2    3
> 0    X    Y    Y    Y 
> 1    Y    X    Y    Y
> 2    Y    Y    X    Y
> 3    Y    Y    Y    X

That may match up with how most current NUMA systems look like,
but do we really want to bake in an oversimplification like this?

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/19] Virtual NUMA for PV and HVM
  2014-11-21 16:42     ` Jan Beulich
@ 2014-11-21 16:55       ` Wei Liu
  2014-11-21 17:05         ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Wei Liu @ 2014-11-21 16:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, xen-devel

On Fri, Nov 21, 2014 at 04:42:07PM +0000, Jan Beulich wrote:
> >>> On 21.11.14 at 17:35, <wei.liu2@citrix.com> wrote:
> > On Fri, Nov 21, 2014 at 04:25:34PM +0000, Jan Beulich wrote:
> >> >>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> >> > vnuma_vdistances = [10, 30] # optional
> >> 
> >> Being optional, would the real distances be used instead? And what
> > 
> > Default value of [10, 20] will be used.
> 
> That's bad. Would it be very difficult to use the host values?
> 

It's easy. I will do that in next iteration.

> >> meaning does this apparently one dimensional array here have for
> >> the actually two dimensional SLIT? (Read: An example with more
> >> than two nodes would be useful.)
> >> 
> > 
> > The first element of [X, Y] is local distance, the second element is
> > remote distance.
> > 
> > For a 4 node system:
> > 
> >      0    1    2    3
> > 0    X    Y    Y    Y 
> > 1    Y    X    Y    Y
> > 2    Y    Y    X    Y
> > 3    Y    Y    Y    X
> 
> That may match up with how most current NUMA systems look like,
> but do we really want to bake in an oversimplification like this?
> 

I want to clarify this is only *xl* option, libxl interface is capable
of specifying every single element in SLIT.

Nonetheless I'm all for having a configuration option that would meet
both present and future need. Do you have anything in mind? Are you
suggesting we should allow specifying every element in SLIT in xl?

Wei.

> Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 02/19] xen: make two memory hypercalls vNUMA-aware
  2014-11-21 15:06 ` [PATCH 02/19] xen: make two memory hypercalls vNUMA-aware Wei Liu
@ 2014-11-21 17:03   ` Jan Beulich
  2014-11-21 17:30     ` Wei Liu
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2014-11-21 17:03 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel

>>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> @@ -747,6 +786,17 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>              return start_extent;
>          args.domain = d;
>  
> +        args.memflags |= MEMF_node(XENMEMF_get_node(reservation.mem_flags));
> +        if ( reservation.mem_flags & XENMEMF_exact_node_request )
> +            args.memflags |= MEMF_exact_node;
> +
> +        rc = translate_vnode_to_pnode(d, &reservation, &args);
> +        if ( rc )
> +        {
> +            rcu_unlock_domain(d);
> +            return rc;

I'm afraid you got misguided here by the (buggy) adjacent XSM error
path: You shouldn't return a negative error code but "start_extent"
here. And I'll try to remember to fix the XSM path post-4.5.

> +/* Guset can use XENMEMF_vnode to specify virtual node for memory op. */
> +#define XENFEAT_memory_op_vnode_supported 13

You introduce this flag but then don't use it? Also there's a typo in
the comment.

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/19] Virtual NUMA for PV and HVM
  2014-11-21 16:55       ` Wei Liu
@ 2014-11-21 17:05         ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2014-11-21 17:05 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel

>>> On 21.11.14 at 17:55, <wei.liu2@citrix.com> wrote:
> Nonetheless I'm all for having a configuration option that would meet
> both present and future need. Do you have anything in mind? Are you
> suggesting we should allow specifying every element in SLIT in xl?

I think that would be desirable.

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 02/19] xen: make two memory hypercalls vNUMA-aware
  2014-11-21 17:03   ` Jan Beulich
@ 2014-11-21 17:30     ` Wei Liu
  0 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 17:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, xen-devel

On Fri, Nov 21, 2014 at 05:03:09PM +0000, Jan Beulich wrote:
> >>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> > @@ -747,6 +786,17 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >              return start_extent;
> >          args.domain = d;
> >  
> > +        args.memflags |= MEMF_node(XENMEMF_get_node(reservation.mem_flags));
> > +        if ( reservation.mem_flags & XENMEMF_exact_node_request )
> > +            args.memflags |= MEMF_exact_node;
> > +
> > +        rc = translate_vnode_to_pnode(d, &reservation, &args);
> > +        if ( rc )
> > +        {
> > +            rcu_unlock_domain(d);
> > +            return rc;
> 
> I'm afraid you got misguided here by the (buggy) adjacent XSM error
> path: You shouldn't return a negative error code but "start_extent"
> here. And I'll try to remember to fix the XSM path post-4.5.
> 

Fixed.

> > +/* Guset can use XENMEMF_vnode to specify virtual node for memory op. */
> > +#define XENFEAT_memory_op_vnode_supported 13
> 
> You introduce this flag but then don't use it? Also there's a typo in
> the comment.
> 

Oops, a hunk is missing. Fixed.

Wei.

> Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/19] hvmloader: disallow memory relocation when vNUMA is enabled
  2014-11-21 15:06 ` [PATCH 14/19] hvmloader: disallow memory relocation when vNUMA is enabled Wei Liu
@ 2014-11-21 19:56   ` Konrad Rzeszutek Wilk
  2014-11-24  9:21     ` Wei Liu
  2014-11-24  9:29     ` Jan Beulich
  2014-11-24 10:15   ` Jan Beulich
  1 sibling, 2 replies; 44+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-21 19:56 UTC (permalink / raw)
  To: Wei Liu; +Cc: George Dunlap, Jan Beulich, xen-devel

On Fri, Nov 21, 2014 at 03:06:56PM +0000, Wei Liu wrote:
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> ---
>  tools/firmware/hvmloader/pci.c |   13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/tools/firmware/hvmloader/pci.c b/tools/firmware/hvmloader/pci.c
> index 4e8d803..d7ea740 100644
> --- a/tools/firmware/hvmloader/pci.c
> +++ b/tools/firmware/hvmloader/pci.c
> @@ -88,6 +88,19 @@ void pci_setup(void)
>      printf("Relocating guest memory for lowmem MMIO space %s\n",
>             allow_memory_relocate?"enabled":"disabled");
>  
> +    /* Disallow low memory relocation when vNUMA is enabled, because
> +     * relocated memory ends up off node. Further more, even if we
> +     * dynamically expand node coverage in hvmloader, low memory and
> +     * high memory may reside in different physical nodes, blindly
> +     * relocates low memory to high memory gives us a sub-optimal
> +     * configuration.

And this is done in hvmloader, so the toolstack has no inkling that
we need to relocate memory to make space for the PCI.

In such case I would not have this check here. Instead put it in 
libxl and disallow vNUMA with PCI passthrough.

And then the fix is to take the logic that is in hvmloader for PCI
BAR size relocation and move it in libxl. Then it can construct the
proper vNUMA topology and also fix an outstanding QEMU-xen bug.

> +     */
> +    if ( hvm_info->nr_nodes != 0 && allow_memory_relocate )
> +    {
> +        allow_memory_relocate = false;
> +        printf("vNUMA enabled, relocating guest memory for lowmem MMIO space disabled\n");
> +    }
> +
>      s = xenstore_read("platform/mmio_hole_size", NULL);
>      if ( s )
>          mmio_hole_size = strtoll(s, NULL, 0);
> -- 
> 1.7.10.4
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/19] Virtual NUMA for PV and HVM
  2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (19 preceding siblings ...)
  2014-11-21 16:25 ` [PATCH 00/19] Virtual NUMA for PV and HVM Jan Beulich
@ 2014-11-21 20:01 ` Konrad Rzeszutek Wilk
  2014-11-21 20:44   ` Wei Liu
  20 siblings, 1 reply; 44+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-21 20:01 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel

On Fri, Nov 21, 2014 at 03:06:42PM +0000, Wei Liu wrote:
> Hi all
> 
> This patch series implements virtual NUMA support for both PV and HVM guest.
> That is, admin can configure via libxl what virtual NUMA topology the guest
> sees.
> 
> This is the stage 1 (basic vNUMA support) and part of stage 2 (vNUMA-ware
> ballooning, hypervisor side) described in my previous email to xen-devel [0].
> 
> This series is broken into several parts:
> 
> 1. xen patches: vNUMA debug output and vNUMA-aware memory hypercall support.
> 2. libxc/libxl support for PV vNUMA.
> 3. libxc/libxl support for HVM vNUMA.
> 4. xl vNUMA configuration documentation and parser.
> 
> I think one significant difference from Elena's work is that this patch series
> makes use of multiple vmemranges should there be a memory hole, instead of
> shrinking ram. This matches the behaviour of real hardware.

Are some of the patches then borrowed from Elena? If so, she should be credited
in the patches?
> 
> The vNUMA auto placement algorithm is missing at the moment and Dario is
> working on it.
> 
> This series can be found at:
>  git://xenbits.xen.org/people/liuw/xen.git wip.vnuma-v1 
> 
> With this series, the following configuration can be used to enabled virtual
> NUMA support, and it works for both PV and HVM guests.
> 
> memory = 6000
> vnuma_memory = [3000, 3000]
> vnuma_vcpu_map = [0, 1]
> vnuma_pnode_map = [0, 0]
> vnuma_vdistances = [10, 30] # optional
> 
> dmesg output for HVM guest:
> 
> [    0.000000] ACPI: SRAT 00000000fc009ff0 000C8 (v01    Xen      HVM 00000000 HVML 00000000)
> [    0.000000] ACPI: SLIT 00000000fc00a0c0 00030 (v01    Xen      HVM 00000000 HVML 00000000)
> <...snip...>
> [    0.000000] SRAT: PXM 0 -> APIC 0x00 -> Node 0
> [    0.000000] SRAT: PXM 1 -> APIC 0x02 -> Node 1
> [    0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0xbb7fffff]
> [    0.000000] SRAT: Node 1 PXM 1 [mem 0xbb800000-0xefffffff]
> [    0.000000] SRAT: Node 1 PXM 1 [mem 0x100000000-0x186ffffff]
> [    0.000000] NUMA: Initialized distance table, cnt=2
> [    0.000000] NUMA: Node 1 [mem 0xbb800000-0xefffffff] + [mem 0x100000000-0x1867fffff] -> [mem 0xbb800000-0x1867fffff]
> [    0.000000] Initmem setup node 0 [mem 0x00000000-0xbb7fffff]
> [    0.000000]   NODE_DATA [mem 0xbb7fc000-0xbb7fffff]
> [    0.000000] Initmem setup node 1 [mem 0xbb800000-0x1867fffff]
> [    0.000000]   NODE_DATA [mem 0x1867f7000-0x1867fafff]
> [    0.000000]  [ffffea0000000000-ffffea00029fffff] PMD -> [ffff8800b8600000-ffff8800baffffff] on node 0
> [    0.000000]  [ffffea0002a00000-ffffea00055fffff] PMD -> [ffff880183000000-ffff8801859fffff] on node 1
> <...snip...>
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x00001000-0x0009efff]
> [    0.000000]   node   0: [mem 0x00100000-0xbb7fffff]
> [    0.000000]   node   1: [mem 0xbb800000-0xefffefff]
> [    0.000000]   node   1: [mem 0x100000000-0x1867fffff]
> 
> numactl output for HVM guest:
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0
> node 0 size: 2999 MB
> node 0 free: 2546 MB
> node 1 cpus: 1
> node 1 size: 2991 MB
> node 1 free: 2144 MB
> node distances:
> node   0   1 
>   0:  10  30 
>   1:  30  10 
> 
> dmesg output for PV guest:
> 
> [    0.000000] NUMA: Initialized distance table, cnt=2
> [    0.000000] NUMA: Node 1 [mem 0xbb800000-0xce68efff] + [mem 0x100000000-0x1a8970fff] -> [mem 0xbb800000-0x1a8970fff]
> [    0.000000] NODE_DATA(0) allocated [mem 0xbb7fc000-0xbb7fffff]
> [    0.000000] NODE_DATA(1) allocated [mem 0x1a8969000-0x1a896cfff]
> 
> numactl output for PV guest:
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0
> node 0 size: 2944 MB
> node 0 free: 2917 MB
> node 1 cpus: 1
> node 1 size: 2934 MB
> node 1 free: 2904 MB
> node distances:
> node   0   1 
>   0:  10  30
>   1:  30  10
> 
> And for a HVM guest on a real NUMA-capable machine:
> 
> (XEN) Memory location of each domain:
> (XEN) Domain 0 (total: 262144):
> (XEN)     Node 0: 245758
> (XEN)     Node 1: 16386
> (XEN) Domain 2 (total: 2097226):
> (XEN)     Node 0: 1046335
> (XEN)     Node 1: 1050891
> (XEN)      2 vnodes, 4 vcpus
> (XEN)        vnode   0 - pnode 0
> (XEN)         3840 MB:  0x0 - 0xf0000000
> (XEN)         256 MB:  0x100000000 - 0x110000000
> (XEN)         vcpus: 0 1 
> (XEN)        vnode   1 - pnode 1
> (XEN)         4096 MB:  0x110000000 - 0x210000000
> (XEN)         vcpus: 2 3 
> 
> Wei.
> 
> [0] <20141111173606.GC21312@zion.uk.xensource.com>
> 
> Wei Liu (19):
>   xen: dump vNUMA information with debug key "u"
>   xen: make two memory hypercalls vNUMA-aware
>   libxc: allocate memory with vNUMA information for PV guest
>   libxl: add emacs local variables in libxl_{x86,arm}.c
>   libxl: introduce vNUMA types
>   libxl: add vmemrange to libxl__domain_build_state
>   libxl: introduce libxl__vnuma_config_check
>   libxl: x86: factor out e820_host_sanitize
>   libxl: functions to build vmemranges for PV guest
>   libxl: build, check and pass vNUMA info to Xen for PV guest
>   hvmloader: add new fields for vNUMA information
>   hvmloader: construct SRAT
>   hvmloader: construct SLIT
>   hvmloader: disallow memory relocation when vNUMA is enabled
>   libxc: allocate memory with vNUMA information for HVM guest
>   libxl: build, check and pass vNUMA info to Xen for HVM guest
>   libxl: refactor hvm_build_set_params
>   libxl: fill vNUMA information in hvm info
>   xl: vNUMA support
> 
>  docs/man/xl.cfg.pod.5                   |   32 +++++
>  tools/firmware/hvmloader/acpi/acpi2_0.h |   61 +++++++++
>  tools/firmware/hvmloader/acpi/build.c   |  104 ++++++++++++++
>  tools/firmware/hvmloader/pci.c          |   13 ++
>  tools/libxc/include/xc_dom.h            |    5 +
>  tools/libxc/include/xenguest.h          |    7 +
>  tools/libxc/xc_dom_x86.c                |   72 ++++++++--
>  tools/libxc/xc_hvm_build_x86.c          |  224 +++++++++++++++++++-----------
>  tools/libxc/xc_private.h                |    2 +
>  tools/libxl/Makefile                    |    2 +-
>  tools/libxl/libxl_arch.h                |    6 +
>  tools/libxl/libxl_arm.c                 |   17 +++
>  tools/libxl/libxl_create.c              |    9 ++
>  tools/libxl/libxl_dom.c                 |  172 ++++++++++++++++++++---
>  tools/libxl/libxl_internal.h            |   18 +++
>  tools/libxl/libxl_types.idl             |    9 ++
>  tools/libxl/libxl_vnuma.c               |  228 +++++++++++++++++++++++++++++++
>  tools/libxl/libxl_x86.c                 |  113 +++++++++++++--
>  tools/libxl/xl_cmdimpl.c                |  151 ++++++++++++++++++++
>  xen/arch/x86/numa.c                     |   46 ++++++-
>  xen/common/memory.c                     |   58 +++++++-
>  xen/include/public/features.h           |    3 +
>  xen/include/public/hvm/hvm_info_table.h |   19 +++
>  xen/include/public/memory.h             |    2 +
>  24 files changed, 1247 insertions(+), 126 deletions(-)
>  create mode 100644 tools/libxl/libxl_vnuma.c
> 
> -- 
> 1.7.10.4
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/19] Virtual NUMA for PV and HVM
  2014-11-21 20:01 ` Konrad Rzeszutek Wilk
@ 2014-11-21 20:44   ` Wei Liu
  0 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-21 20:44 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Wei Liu, xen-devel

On Fri, Nov 21, 2014 at 03:01:58PM -0500, Konrad Rzeszutek Wilk wrote:
> On Fri, Nov 21, 2014 at 03:06:42PM +0000, Wei Liu wrote:
> > Hi all
> > 
> > This patch series implements virtual NUMA support for both PV and HVM guest.
> > That is, admin can configure via libxl what virtual NUMA topology the guest
> > sees.
> > 
> > This is the stage 1 (basic vNUMA support) and part of stage 2 (vNUMA-ware
> > ballooning, hypervisor side) described in my previous email to xen-devel [0].
> > 
> > This series is broken into several parts:
> > 
> > 1. xen patches: vNUMA debug output and vNUMA-aware memory hypercall support.
> > 2. libxc/libxl support for PV vNUMA.
> > 3. libxc/libxl support for HVM vNUMA.
> > 4. xl vNUMA configuration documentation and parser.
> > 
> > I think one significant difference from Elena's work is that this patch series
> > makes use of multiple vmemranges should there be a memory hole, instead of
> > shrinking ram. This matches the behaviour of real hardware.
> 
> Are some of the patches then borrowed from Elena? If so, she should be credited
> in the patches?
> > 

Due to the fact that libxl interface changed, as well as lots of
assumptions, I only have one patch taken from her (the first one) and
rewrote PV toolstack implementation.

Elena, if you discover that I didn't credit you for later patches, don't
hesitate to tell me. ;-)

Wei.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/19] hvmloader: disallow memory relocation when vNUMA is enabled
  2014-11-21 19:56   ` Konrad Rzeszutek Wilk
@ 2014-11-24  9:21     ` Wei Liu
  2014-11-24  9:29     ` Jan Beulich
  1 sibling, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-24  9:21 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: George Dunlap, Wei Liu, Jan Beulich, xen-devel

On Fri, Nov 21, 2014 at 02:56:31PM -0500, Konrad Rzeszutek Wilk wrote:
> On Fri, Nov 21, 2014 at 03:06:56PM +0000, Wei Liu wrote:
> > Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> > Cc: Jan Beulich <JBeulich@suse.com>
> > Cc: George Dunlap <george.dunlap@eu.citrix.com>
> > ---
> >  tools/firmware/hvmloader/pci.c |   13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> > 
> > diff --git a/tools/firmware/hvmloader/pci.c b/tools/firmware/hvmloader/pci.c
> > index 4e8d803..d7ea740 100644
> > --- a/tools/firmware/hvmloader/pci.c
> > +++ b/tools/firmware/hvmloader/pci.c
> > @@ -88,6 +88,19 @@ void pci_setup(void)
> >      printf("Relocating guest memory for lowmem MMIO space %s\n",
> >             allow_memory_relocate?"enabled":"disabled");
> >  
> > +    /* Disallow low memory relocation when vNUMA is enabled, because
> > +     * relocated memory ends up off node. Further more, even if we
> > +     * dynamically expand node coverage in hvmloader, low memory and
> > +     * high memory may reside in different physical nodes, blindly
> > +     * relocates low memory to high memory gives us a sub-optimal
> > +     * configuration.
> 
> And this is done in hvmloader, so the toolstack has no inkling that
> we need to relocate memory to make space for the PCI.
> 
> In such case I would not have this check here. Instead put it in 
> libxl 

You're right, I think this should be placed in libxl.

> and disallow vNUMA with PCI passthrough.
> 
> And then the fix is to take the logic that is in hvmloader for PCI
> BAR size relocation and move it in libxl. Then it can construct the
> proper vNUMA topology and also fix an outstanding QEMU-xen bug.
> 

But FYI not only PCI passthrough requires larger memory hole. A user can
use device_model_args_extra (don't remember the exact name) to
instrument QEMU to emulate arbitrary PCI devices.

Wei.

> > +     */
> > +    if ( hvm_info->nr_nodes != 0 && allow_memory_relocate )
> > +    {
> > +        allow_memory_relocate = false;
> > +        printf("vNUMA enabled, relocating guest memory for lowmem MMIO space disabled\n");
> > +    }
> > +
> >      s = xenstore_read("platform/mmio_hole_size", NULL);
> >      if ( s )
> >          mmio_hole_size = strtoll(s, NULL, 0);
> > -- 
> > 1.7.10.4
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/19] hvmloader: disallow memory relocation when vNUMA is enabled
  2014-11-21 19:56   ` Konrad Rzeszutek Wilk
  2014-11-24  9:21     ` Wei Liu
@ 2014-11-24  9:29     ` Jan Beulich
  1 sibling, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2014-11-24  9:29 UTC (permalink / raw)
  To: Wei Liu, Konrad Rzeszutek Wilk; +Cc: George Dunlap, xen-devel

>>> On 21.11.14 at 20:56, <konrad.wilk@oracle.com> wrote:
> On Fri, Nov 21, 2014 at 03:06:56PM +0000, Wei Liu wrote:
>> --- a/tools/firmware/hvmloader/pci.c
>> +++ b/tools/firmware/hvmloader/pci.c
>> @@ -88,6 +88,19 @@ void pci_setup(void)
>>      printf("Relocating guest memory for lowmem MMIO space %s\n",
>>             allow_memory_relocate?"enabled":"disabled");
>>  
>> +    /* Disallow low memory relocation when vNUMA is enabled, because
>> +     * relocated memory ends up off node. Further more, even if we
>> +     * dynamically expand node coverage in hvmloader, low memory and
>> +     * high memory may reside in different physical nodes, blindly
>> +     * relocates low memory to high memory gives us a sub-optimal
>> +     * configuration.
> 
> And this is done in hvmloader, so the toolstack has no inkling that
> we need to relocate memory to make space for the PCI.
> 
> In such case I would not have this check here. Instead put it in 
> libxl and disallow vNUMA with PCI passthrough.
> 
> And then the fix is to take the logic that is in hvmloader for PCI
> BAR size relocation and move it in libxl. Then it can construct the
> proper vNUMA topology and also fix an outstanding QEMU-xen bug.

The problem then being that two code pieces pretty far apart from
one another need to be kept in perfect sync. Not really nice in
terms of maintainability. I'd really prefer hvmloader to re-write the
vNUMA info (on real hardware firmware also needs to take care of
the memory holes - there's no magic external entity there either).

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/19] hvmloader: add new fields for vNUMA information
  2014-11-21 15:06 ` [PATCH 11/19] hvmloader: add new fields for vNUMA information Wei Liu
@ 2014-11-24  9:58   ` Jan Beulich
  2014-11-24 10:07     ` Wei Liu
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2014-11-24  9:58 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel

>>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> --- a/xen/include/public/hvm/hvm_info_table.h
> +++ b/xen/include/public/hvm/hvm_info_table.h
> @@ -32,6 +32,17 @@
>  /* Maximum we can support with current vLAPIC ID mapping. */
>  #define HVM_MAX_VCPUS        128
>  
> +#define HVM_MAX_NODES         16
> +#define HVM_MAX_LOCALITIES    (HVM_MAX_NODES * HVM_MAX_NODES)
> +
> +#define HVM_MAX_VMEMRANGES    64
> +struct hvm_info_vmemrange {
> +    uint64_t start;
> +    uint64_t end;
> +    uint32_t flags;
> +    uint32_t nid;
> +};
> +
>  struct hvm_info_table {
>      char        signature[8]; /* "HVM INFO" */
>      uint32_t    length;
> @@ -67,6 +78,14 @@ struct hvm_info_table {
>  
>      /* Bitmap of which CPUs are online at boot time. */
>      uint8_t     vcpu_online[(HVM_MAX_VCPUS + 7)/8];
> +
> +    /* Virtual NUMA information */
> +    uint32_t    nr_nodes;
> +    uint8_t     vcpu_to_vnode[HVM_MAX_VCPUS];
> +    uint32_t    nr_vmemranges;
> +    struct hvm_info_vmemrange vmemranges[HVM_MAX_VMEMRANGES];
> +    uint64_t    nr_localities;
> +    uint8_t     localities[HVM_MAX_LOCALITIES];
>  };
>  
>  #endif /* __XEN_PUBLIC_HVM_HVM_INFO_TABLE_H__ */

Is this really the right place? This is a public interface, which we
shouldn't modify in ways making future changes more cumbersome.
In particular, once we finally get the LAPIC ID brokenness fixed,
HVM_MAX_VCPUS won't need to be limited to 128 anymore. And
we likely would want to keep things simple an retain the bitmap
where it currently sits, just extending its size. With all of the data
above (supposedly, or we made a mistake somewhere) being
retrievable via hypercall, what is the rationale for doing things
this way in the first place (the lack of any kind of description is of
course not really helpful here)?

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/19] hvmloader: add new fields for vNUMA information
  2014-11-24  9:58   ` Jan Beulich
@ 2014-11-24 10:07     ` Wei Liu
  2014-11-24 10:22       ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Wei Liu @ 2014-11-24 10:07 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, xen-devel

On Mon, Nov 24, 2014 at 09:58:29AM +0000, Jan Beulich wrote:
> >>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> > --- a/xen/include/public/hvm/hvm_info_table.h
> > +++ b/xen/include/public/hvm/hvm_info_table.h
> > @@ -32,6 +32,17 @@
> >  /* Maximum we can support with current vLAPIC ID mapping. */
> >  #define HVM_MAX_VCPUS        128
> >  
> > +#define HVM_MAX_NODES         16
> > +#define HVM_MAX_LOCALITIES    (HVM_MAX_NODES * HVM_MAX_NODES)
> > +
> > +#define HVM_MAX_VMEMRANGES    64
> > +struct hvm_info_vmemrange {
> > +    uint64_t start;
> > +    uint64_t end;
> > +    uint32_t flags;
> > +    uint32_t nid;
> > +};
> > +
> >  struct hvm_info_table {
> >      char        signature[8]; /* "HVM INFO" */
> >      uint32_t    length;
> > @@ -67,6 +78,14 @@ struct hvm_info_table {
> >  
> >      /* Bitmap of which CPUs are online at boot time. */
> >      uint8_t     vcpu_online[(HVM_MAX_VCPUS + 7)/8];
> > +
> > +    /* Virtual NUMA information */
> > +    uint32_t    nr_nodes;
> > +    uint8_t     vcpu_to_vnode[HVM_MAX_VCPUS];
> > +    uint32_t    nr_vmemranges;
> > +    struct hvm_info_vmemrange vmemranges[HVM_MAX_VMEMRANGES];
> > +    uint64_t    nr_localities;
> > +    uint8_t     localities[HVM_MAX_LOCALITIES];
> >  };
> >  
> >  #endif /* __XEN_PUBLIC_HVM_HVM_INFO_TABLE_H__ */
> 
> Is this really the right place? This is a public interface, which we
> shouldn't modify in ways making future changes more cumbersome.
> In particular, once we finally get the LAPIC ID brokenness fixed,
> HVM_MAX_VCPUS won't need to be limited to 128 anymore. And
> we likely would want to keep things simple an retain the bitmap
> where it currently sits, just extending its size. With all of the data
> above (supposedly, or we made a mistake somewhere) being
> retrievable via hypercall, what is the rationale for doing things
> this way in the first place (the lack of any kind of description is of
> course not really helpful here)?
> 

My thought was that this is interface between libxl and hvmloader, and
the way it's used suggests that this is canonical way of doing things. I
might have got this wrong in the first place.

So I take it you're of the opinion this piece of information should be
retrieved via hypercall in hvmloader, right? That's OK (and even better)
for me.

Wei.


> Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 12/19] hvmloader: construct SRAT
  2014-11-21 15:06 ` [PATCH 12/19] hvmloader: construct SRAT Wei Liu
@ 2014-11-24 10:08   ` Jan Beulich
  2014-11-24 10:13     ` Wei Liu
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2014-11-24 10:08 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel

>>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> --- a/tools/firmware/hvmloader/acpi/build.c
> +++ b/tools/firmware/hvmloader/acpi/build.c
> @@ -203,6 +203,66 @@ static struct acpi_20_waet *construct_waet(void)
>      return waet;
>  }
>  
> +static struct acpi_20_srat *construct_srat(void)
> +{
> +    struct acpi_20_srat *srat;
> +    struct acpi_20_srat_processor *processor;
> +    struct acpi_20_srat_memory *memory;
> +    unsigned int size;
> +    void *p;
> +    int i;
> +    uint64_t mem;
> +
> +    size = sizeof(*srat) + sizeof(*processor) * hvm_info->nr_vcpus +
> +        sizeof(*memory) * hvm_info->nr_vmemranges;
> +
> +    p = mem_alloc(size, 16);
> +    if (!p) return NULL;
> +
> +    srat = p;
> +    memset(srat, 0, sizeof(*srat));
> +    srat->header.signature    = ACPI_2_0_SRAT_SIGNATURE;
> +    srat->header.revision     = ACPI_2_0_SRAT_REVISION;
> +    fixed_strcpy(srat->header.oem_id, ACPI_OEM_ID);
> +    fixed_strcpy(srat->header.oem_table_id, ACPI_OEM_TABLE_ID);
> +    srat->header.oem_revision = ACPI_OEM_REVISION;
> +    srat->header.creator_id   = ACPI_CREATOR_ID;
> +    srat->header.creator_revision = ACPI_CREATOR_REVISION;
> +    srat->table_revision      = ACPI_SRAT_TABLE_REVISION;
> +
> +    processor = (struct acpi_20_srat_processor *)(srat + 1);
> +    for ( i = 0; i < hvm_info->nr_vcpus; i++ )
> +    {
> +        memset(processor, 0, sizeof(*processor));
> +        processor->type     = ACPI_PROCESSOR_AFFINITY;
> +        processor->length   = sizeof(*processor);
> +        processor->domain   = hvm_info->vcpu_to_vnode[i];
> +        processor->apic_id  = LAPIC_ID(i);
> +        processor->flags    = ACPI_LOCAL_APIC_AFFIN_ENABLED;
> +        processor->sapic_id = 0;

Either you initialize all fields explicitly and drop the memset(), or
you consistently avoid explicit zero initializers (as being redundant).

> @@ -270,6 +331,13 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
>          table_ptrs[nr_tables++] = (unsigned long)madt;
>      }
>  
> +    if ( hvm_info->nr_nodes > 0 )
> +    {
> +        srat = construct_srat();
> +        if (!srat) return -1;

I don't think failure to set up secondary information (especially when
it requires a variable size table and hence has [slightly] higher
likelihood of table space allocation failing) should result in skipping
other table setup.

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 13/19] hvmloader: construct SLIT
  2014-11-21 15:06 ` [PATCH 13/19] hvmloader: construct SLIT Wei Liu
@ 2014-11-24 10:11   ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2014-11-24 10:11 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel

>>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> --- a/tools/firmware/hvmloader/acpi/build.c
> +++ b/tools/firmware/hvmloader/acpi/build.c
> @@ -263,6 +263,38 @@ static struct acpi_20_srat *construct_srat(void)
>      return srat;
>  }
>  
> +static struct acpi_20_slit *construct_slit(void)
> +{
> +    struct acpi_20_slit *slit;
> +    unsigned int num, size;
> +    int i;
>[...]
> +    for ( i = 0; i < num; i++ )

How can i be signed here when num is unsigned. Even without the
common desire to have variables that can't be negative declared
unsigned, inconsistencies like this should be avoided.

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 12/19] hvmloader: construct SRAT
  2014-11-24 10:08   ` Jan Beulich
@ 2014-11-24 10:13     ` Wei Liu
  2014-11-24 10:26       ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Wei Liu @ 2014-11-24 10:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, xen-devel

On Mon, Nov 24, 2014 at 10:08:42AM +0000, Jan Beulich wrote:
> >>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> > --- a/tools/firmware/hvmloader/acpi/build.c
> > +++ b/tools/firmware/hvmloader/acpi/build.c
> > @@ -203,6 +203,66 @@ static struct acpi_20_waet *construct_waet(void)
> >      return waet;
> >  }
> >  
> > +static struct acpi_20_srat *construct_srat(void)
> > +{
> > +    struct acpi_20_srat *srat;
> > +    struct acpi_20_srat_processor *processor;
> > +    struct acpi_20_srat_memory *memory;
> > +    unsigned int size;
> > +    void *p;
> > +    int i;
> > +    uint64_t mem;
> > +
> > +    size = sizeof(*srat) + sizeof(*processor) * hvm_info->nr_vcpus +
> > +        sizeof(*memory) * hvm_info->nr_vmemranges;
> > +
> > +    p = mem_alloc(size, 16);
> > +    if (!p) return NULL;
> > +
> > +    srat = p;
> > +    memset(srat, 0, sizeof(*srat));
> > +    srat->header.signature    = ACPI_2_0_SRAT_SIGNATURE;
> > +    srat->header.revision     = ACPI_2_0_SRAT_REVISION;
> > +    fixed_strcpy(srat->header.oem_id, ACPI_OEM_ID);
> > +    fixed_strcpy(srat->header.oem_table_id, ACPI_OEM_TABLE_ID);
> > +    srat->header.oem_revision = ACPI_OEM_REVISION;
> > +    srat->header.creator_id   = ACPI_CREATOR_ID;
> > +    srat->header.creator_revision = ACPI_CREATOR_REVISION;
> > +    srat->table_revision      = ACPI_SRAT_TABLE_REVISION;
> > +
> > +    processor = (struct acpi_20_srat_processor *)(srat + 1);
> > +    for ( i = 0; i < hvm_info->nr_vcpus; i++ )
> > +    {
> > +        memset(processor, 0, sizeof(*processor));
> > +        processor->type     = ACPI_PROCESSOR_AFFINITY;
> > +        processor->length   = sizeof(*processor);
> > +        processor->domain   = hvm_info->vcpu_to_vnode[i];
> > +        processor->apic_id  = LAPIC_ID(i);
> > +        processor->flags    = ACPI_LOCAL_APIC_AFFIN_ENABLED;
> > +        processor->sapic_id = 0;
> 
> Either you initialize all fields explicitly and drop the memset(), or
> you consistently avoid explicit zero initializers (as being redundant).
> 

Ack.

> > @@ -270,6 +331,13 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
> >          table_ptrs[nr_tables++] = (unsigned long)madt;
> >      }
> >  
> > +    if ( hvm_info->nr_nodes > 0 )
> > +    {
> > +        srat = construct_srat();
> > +        if (!srat) return -1;
> 
> I don't think failure to set up secondary information (especially when
> it requires a variable size table and hence has [slightly] higher
> likelihood of table space allocation failing) should result in skipping
> other table setup.
> 

But MADT, HPET and WAET are treated like that. I want to be
consistent.

Wei.

> Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/19] hvmloader: disallow memory relocation when vNUMA is enabled
  2014-11-21 15:06 ` [PATCH 14/19] hvmloader: disallow memory relocation when vNUMA is enabled Wei Liu
  2014-11-21 19:56   ` Konrad Rzeszutek Wilk
@ 2014-11-24 10:15   ` Jan Beulich
  1 sibling, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2014-11-24 10:15 UTC (permalink / raw)
  To: Wei Liu; +Cc: GeorgeDunlap, xen-devel

>>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>

So this is the fourth patch now without any description whatsoever.

> --- a/tools/firmware/hvmloader/pci.c
> +++ b/tools/firmware/hvmloader/pci.c
> @@ -88,6 +88,19 @@ void pci_setup(void)
>      printf("Relocating guest memory for lowmem MMIO space %s\n",
>             allow_memory_relocate?"enabled":"disabled");
>  
> +    /* Disallow low memory relocation when vNUMA is enabled, because
> +     * relocated memory ends up off node. Further more, even if we
> +     * dynamically expand node coverage in hvmloader, low memory and
> +     * high memory may reside in different physical nodes, blindly
> +     * relocates low memory to high memory gives us a sub-optimal
> +     * configuration.
> +     */
> +    if ( hvm_info->nr_nodes != 0 && allow_memory_relocate )
> +    {
> +        allow_memory_relocate = false;
> +        printf("vNUMA enabled, relocating guest memory for lowmem MMIO space disabled\n");
> +    }

Apart from the comment violating our coding style, as already
indicated in the reply to Konrad's comment I don't think this is
the right approach. If it is meant to be a temporary measure, the
comment should say so (and perhaps have a TBD or similar grep-
able mark in it).

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/19] hvmloader: add new fields for vNUMA information
  2014-11-24 10:07     ` Wei Liu
@ 2014-11-24 10:22       ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2014-11-24 10:22 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel

>>> On 24.11.14 at 11:07, <wei.liu2@citrix.com> wrote:
> On Mon, Nov 24, 2014 at 09:58:29AM +0000, Jan Beulich wrote:
>> >>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
>> > --- a/xen/include/public/hvm/hvm_info_table.h
>> > +++ b/xen/include/public/hvm/hvm_info_table.h
>> > @@ -32,6 +32,17 @@
>> >  /* Maximum we can support with current vLAPIC ID mapping. */
>> >  #define HVM_MAX_VCPUS        128
>> >  
>> > +#define HVM_MAX_NODES         16
>> > +#define HVM_MAX_LOCALITIES    (HVM_MAX_NODES * HVM_MAX_NODES)
>> > +
>> > +#define HVM_MAX_VMEMRANGES    64
>> > +struct hvm_info_vmemrange {
>> > +    uint64_t start;
>> > +    uint64_t end;
>> > +    uint32_t flags;
>> > +    uint32_t nid;
>> > +};
>> > +
>> >  struct hvm_info_table {
>> >      char        signature[8]; /* "HVM INFO" */
>> >      uint32_t    length;
>> > @@ -67,6 +78,14 @@ struct hvm_info_table {
>> >  
>> >      /* Bitmap of which CPUs are online at boot time. */
>> >      uint8_t     vcpu_online[(HVM_MAX_VCPUS + 7)/8];
>> > +
>> > +    /* Virtual NUMA information */
>> > +    uint32_t    nr_nodes;
>> > +    uint8_t     vcpu_to_vnode[HVM_MAX_VCPUS];
>> > +    uint32_t    nr_vmemranges;
>> > +    struct hvm_info_vmemrange vmemranges[HVM_MAX_VMEMRANGES];
>> > +    uint64_t    nr_localities;
>> > +    uint8_t     localities[HVM_MAX_LOCALITIES];
>> >  };
>> >  
>> >  #endif /* __XEN_PUBLIC_HVM_HVM_INFO_TABLE_H__ */
>> 
>> Is this really the right place? This is a public interface, which we
>> shouldn't modify in ways making future changes more cumbersome.
>> In particular, once we finally get the LAPIC ID brokenness fixed,
>> HVM_MAX_VCPUS won't need to be limited to 128 anymore. And
>> we likely would want to keep things simple an retain the bitmap
>> where it currently sits, just extending its size. With all of the data
>> above (supposedly, or we made a mistake somewhere) being
>> retrievable via hypercall, what is the rationale for doing things
>> this way in the first place (the lack of any kind of description is of
>> course not really helpful here)?
>> 
> 
> My thought was that this is interface between libxl and hvmloader, and
> the way it's used suggests that this is canonical way of doing things. I
> might have got this wrong in the first place.
> 
> So I take it you're of the opinion this piece of information should be
> retrieved via hypercall in hvmloader, right? That's OK (and even better)
> for me.

Yes - that other interface should be used only for things that
can't be communicated to the guest in another (sane) way.

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 12/19] hvmloader: construct SRAT
  2014-11-24 10:13     ` Wei Liu
@ 2014-11-24 10:26       ` Jan Beulich
  2014-11-24 10:46         ` Wei Liu
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2014-11-24 10:26 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel

>>> On 24.11.14 at 11:13, <wei.liu2@citrix.com> wrote:
> On Mon, Nov 24, 2014 at 10:08:42AM +0000, Jan Beulich wrote:
>> >>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
>> > @@ -270,6 +331,13 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
>> >          table_ptrs[nr_tables++] = (unsigned long)madt;
>> >      }
>> >  
>> > +    if ( hvm_info->nr_nodes > 0 )
>> > +    {
>> > +        srat = construct_srat();
>> > +        if (!srat) return -1;
>> 
>> I don't think failure to set up secondary information (especially when
>> it requires a variable size table and hence has [slightly] higher
>> likelihood of table space allocation failing) should result in skipping
>> other table setup.
> 
> But MADT, HPET and WAET are treated like that. I want to be
> consistent.

I kind of expected you to say that, and specifically added the
reference to SRAT and SLIT being variable size (and perhaps
relatively big). While MADT is variable size too, it (other than the
tables you add here) is kind of essential for the guest to come up
in ACPI mode. Which btw also tells us that these two tables
should be added as late as possible, to avoid them exhausting
memory before other, essential allocations got done.

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 12/19] hvmloader: construct SRAT
  2014-11-24 10:26       ` Jan Beulich
@ 2014-11-24 10:46         ` Wei Liu
  0 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-24 10:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, xen-devel

On Mon, Nov 24, 2014 at 10:26:44AM +0000, Jan Beulich wrote:
> >>> On 24.11.14 at 11:13, <wei.liu2@citrix.com> wrote:
> > On Mon, Nov 24, 2014 at 10:08:42AM +0000, Jan Beulich wrote:
> >> >>> On 21.11.14 at 16:06, <wei.liu2@citrix.com> wrote:
> >> > @@ -270,6 +331,13 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
> >> >          table_ptrs[nr_tables++] = (unsigned long)madt;
> >> >      }
> >> >  
> >> > +    if ( hvm_info->nr_nodes > 0 )
> >> > +    {
> >> > +        srat = construct_srat();
> >> > +        if (!srat) return -1;
> >> 
> >> I don't think failure to set up secondary information (especially when
> >> it requires a variable size table and hence has [slightly] higher
> >> likelihood of table space allocation failing) should result in skipping
> >> other table setup.
> > 
> > But MADT, HPET and WAET are treated like that. I want to be
> > consistent.
> 
> I kind of expected you to say that, and specifically added the
> reference to SRAT and SLIT being variable size (and perhaps
> relatively big). While MADT is variable size too, it (other than the
> tables you add here) is kind of essential for the guest to come up
> in ACPI mode. Which btw also tells us that these two tables
> should be added as late as possible, to avoid them exhausting
> memory before other, essential allocations got done.
> 

So the plan is:

1. Move SRAT and SLIT down.
2. Don't return -1 on failure (and print warning).

Wei.

> Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 17/19] libxl: refactor hvm_build_set_params
  2014-11-21 15:06 ` [PATCH 17/19] libxl: refactor hvm_build_set_params Wei Liu
@ 2014-11-25 10:06   ` Wei Liu
  0 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-25 10:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

This patch can be ignored because it's going to be dropped in v2.

Wei

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 18/19] libxl: fill vNUMA information in hvm info
  2014-11-21 15:07 ` [PATCH 18/19] libxl: fill vNUMA information in hvm info Wei Liu
@ 2014-11-25 10:06   ` Wei Liu
  0 siblings, 0 replies; 44+ messages in thread
From: Wei Liu @ 2014-11-25 10:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Jackson, Dario Faggioli, Wei Liu, Ian Campbell, Elena Ufimtseva

This is going to be dropped in v2, please ignore this one.

Wei.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2014-11-25 10:06 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-21 15:06 [PATCH 00/19] Virtual NUMA for PV and HVM Wei Liu
2014-11-21 15:06 ` [PATCH 01/19] xen: dump vNUMA information with debug key "u" Wei Liu
2014-11-21 16:39   ` Jan Beulich
2014-11-21 15:06 ` [PATCH 02/19] xen: make two memory hypercalls vNUMA-aware Wei Liu
2014-11-21 17:03   ` Jan Beulich
2014-11-21 17:30     ` Wei Liu
2014-11-21 15:06 ` [PATCH 03/19] libxc: allocate memory with vNUMA information for PV guest Wei Liu
2014-11-21 15:06 ` [PATCH 04/19] libxl: add emacs local variables in libxl_{x86, arm}.c Wei Liu
2014-11-21 15:06 ` [PATCH 05/19] libxl: introduce vNUMA types Wei Liu
2014-11-21 15:06 ` [PATCH 06/19] libxl: add vmemrange to libxl__domain_build_state Wei Liu
2014-11-21 15:06 ` [PATCH 07/19] libxl: introduce libxl__vnuma_config_check Wei Liu
2014-11-21 15:06 ` [PATCH 08/19] libxl: x86: factor out e820_host_sanitize Wei Liu
2014-11-21 15:06 ` [PATCH 09/19] libxl: functions to build vmemranges for PV guest Wei Liu
2014-11-21 15:06 ` [PATCH 10/19] libxl: build, check and pass vNUMA info to Xen " Wei Liu
2014-11-21 15:06 ` [PATCH 11/19] hvmloader: add new fields for vNUMA information Wei Liu
2014-11-24  9:58   ` Jan Beulich
2014-11-24 10:07     ` Wei Liu
2014-11-24 10:22       ` Jan Beulich
2014-11-21 15:06 ` [PATCH 12/19] hvmloader: construct SRAT Wei Liu
2014-11-24 10:08   ` Jan Beulich
2014-11-24 10:13     ` Wei Liu
2014-11-24 10:26       ` Jan Beulich
2014-11-24 10:46         ` Wei Liu
2014-11-21 15:06 ` [PATCH 13/19] hvmloader: construct SLIT Wei Liu
2014-11-24 10:11   ` Jan Beulich
2014-11-21 15:06 ` [PATCH 14/19] hvmloader: disallow memory relocation when vNUMA is enabled Wei Liu
2014-11-21 19:56   ` Konrad Rzeszutek Wilk
2014-11-24  9:21     ` Wei Liu
2014-11-24  9:29     ` Jan Beulich
2014-11-24 10:15   ` Jan Beulich
2014-11-21 15:06 ` [PATCH 15/19] libxc: allocate memory with vNUMA information for HVM guest Wei Liu
2014-11-21 15:06 ` [PATCH 16/19] libxl: build, check and pass vNUMA info to Xen " Wei Liu
2014-11-21 15:06 ` [PATCH 17/19] libxl: refactor hvm_build_set_params Wei Liu
2014-11-25 10:06   ` Wei Liu
2014-11-21 15:07 ` [PATCH 18/19] libxl: fill vNUMA information in hvm info Wei Liu
2014-11-25 10:06   ` Wei Liu
2014-11-21 15:07 ` [PATCH 19/19] xl: vNUMA support Wei Liu
2014-11-21 16:25 ` [PATCH 00/19] Virtual NUMA for PV and HVM Jan Beulich
2014-11-21 16:35   ` Wei Liu
2014-11-21 16:42     ` Jan Beulich
2014-11-21 16:55       ` Wei Liu
2014-11-21 17:05         ` Jan Beulich
2014-11-21 20:01 ` Konrad Rzeszutek Wilk
2014-11-21 20:44   ` Wei Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.