All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/21] Virtual NUMA for PV and HVM
@ 2015-01-23 11:13 Wei Liu
  2015-01-23 11:13 ` [PATCH v4 01/21] xen: dump vNUMA information with debug key "u" Wei Liu
                   ` (20 more replies)
  0 siblings, 21 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

Hi all

This is version 4 of this series rebased on top of master.

This patch series implements virtual NUMA support for both PV and HVM guest.
That is, admin can configure via libxl what virtual NUMA topology the guest
sees.

This is the stage 1 (basic vNUMA support) and part of stage 2 (vNUMA-ware
ballooning, hypervisor side) described in my previous email to xen-devel [0].

This series is broken into several parts:

1. xen patches: vNUMA debug output and vNUMA-aware memory hypercall support.
2. libxc/libxl support for PV vNUMA.
3. libxc/libxl/hypervisor support for HVM vNUMA.
4. xl vNUMA configuration documentation and parser.

One significant difference from Elena's work is that this patch series makes
use of multiple vmemranges should there be a memory hole, instead of shrinking
ram. This matches the behaviour of real hardware.

The vNUMA auto placement algorithm is missing at the moment and Dario is
working on it.

This series can be found at:
 git://xenbits.xen.org/people/liuw/xen.git wip.vnuma-v4

With this series, the following configuration can be used to enabled virtual
NUMA support, and it works for both PV and HVM guests.

memory = 6000
vnuma_memory = [3000, 3000]
vnuma_vcpu_map = [0, 1]
vnuma_pnode_map = [0, 0]
vnuma_vdistances = [ [10,30], [30,10] ]

For example output of guest NUMA information, please look at [1].

In terms of libxl / libxc internal, things are broken into several
parts:

1. libxl interface

Users of libxl can only specify how many vnodes a guest can have, but
currently they have no control over the actual memory layout. Note that
it's fairly easy to export the interface to control memory layout in the
future.

2. libxl internal

It generates some internal vNUMA configurations when building domain,
then transform them into libxc representations. It also validates vNUMA
configuration along the line.

3. libxc internal

Libxc does what it's told to do. It doesn't do anything smart (in fact,
I delibrately didn't put any smart logic inside it). Libxc will also
report back some information in HVM case to libxl but that's it.

Wei.

[0] <20141111173606.GC21312@zion.uk.xensource.com>
[1] <1416582421-10789-1-git-send-email-wei.liu2@citrix.com>

Changes in v4:
1. Address comments from many people.
2. Break down the libxlu patch into three.
3. Use dedicate patches for non-functional changes.

Changes in v3:
1. Address comments made by Jan.
2. More commit messages and comments.
3. Shorten some error messages.

Changes in v2:
1. Make vnuma_vdistances mandatory.
2. Use nested list to specify distances among nodes.
3. Hvmloader uses hypercall to retrieve vNUMA information.
4. Fix some problems spotted by Jan.


Wei Liu (21):
  xen: dump vNUMA information with debug key "u"
  xen: make two memory hypercalls vNUMA-aware
  libxc: allocate memory with vNUMA information for PV guest
  libxl: introduce vNUMA types
  libxl: add vmemrange to libxl__domain_build_state
  libxl: introduce libxl__vnuma_config_check
  libxl: x86: factor out e820_host_sanitize
  libxl: functions to build vmemranges for PV guest
  libxl: build, check and pass vNUMA info to Xen for PV guest
  xen: handle XENMEM_get_vnumainfo in compat_memory_op
  hvmloader: retrieve vNUMA information from hypervisor
  hvmloader: construct SRAT
  hvmloader: construct SLIT
  libxc: indentation change to xc_hvm_build_x86.c
  libxc: allocate memory with vNUMA information for HVM guest
  libxl: build, check and pass vNUMA info to Xen for HVM guest
  libxl: disallow memory relocation when vNUMA is enabled
  libxlu: rework internal representation of setting
  libxlu: nested list support
  libxlu: introduce new APIs
  xl: vNUMA support

 docs/man/xl.cfg.pod.5                   |   39 ++++++
 tools/firmware/hvmloader/Makefile       |    2 +-
 tools/firmware/hvmloader/acpi/acpi2_0.h |   61 ++++++++
 tools/firmware/hvmloader/acpi/build.c   |  110 +++++++++++++++
 tools/firmware/hvmloader/hvmloader.c    |    3 +
 tools/firmware/hvmloader/vnuma.c        |   96 +++++++++++++
 tools/firmware/hvmloader/vnuma.h        |   52 +++++++
 tools/libxc/include/xc_dom.h            |    6 +
 tools/libxc/include/xenguest.h          |   15 ++
 tools/libxc/xc_dom_x86.c                |   85 ++++++++++--
 tools/libxc/xc_hvm_build_x86.c          |  229 ++++++++++++++++++++-----------
 tools/libxc/xc_private.h                |    2 +
 tools/libxl/Makefile                    |    2 +-
 tools/libxl/libxl_arch.h                |    6 +
 tools/libxl/libxl_arm.c                 |    9 ++
 tools/libxl/libxl_create.c              |    9 ++
 tools/libxl/libxl_dm.c                  |    6 +-
 tools/libxl/libxl_dom.c                 |   96 +++++++++++++
 tools/libxl/libxl_internal.h            |   18 +++
 tools/libxl/libxl_types.idl             |    9 ++
 tools/libxl/libxl_vnuma.c               |  227 ++++++++++++++++++++++++++++++
 tools/libxl/libxl_x86.c                 |  108 +++++++++++++--
 tools/libxl/libxlu_cfg.c                |  175 ++++++++++++++++-------
 tools/libxl/libxlu_cfg_i.h              |   13 +-
 tools/libxl/libxlu_cfg_y.c              |   46 +++----
 tools/libxl/libxlu_cfg_y.h              |    2 +-
 tools/libxl/libxlu_cfg_y.y              |   14 +-
 tools/libxl/libxlu_internal.h           |   22 ++-
 tools/libxl/libxlutil.h                 |   11 ++
 tools/libxl/xl_cmdimpl.c                |  147 ++++++++++++++++++++
 xen/arch/x86/numa.c                     |   69 +++++++++-
 xen/common/compat/memory.c              |   48 +++++++
 xen/common/kernel.c                     |    2 +-
 xen/common/memory.c                     |   58 +++++++-
 xen/include/public/features.h           |    3 +
 xen/include/public/memory.h             |    2 +
 xen/include/xlat.lst                    |    2 +
 37 files changed, 1601 insertions(+), 203 deletions(-)
 create mode 100644 tools/firmware/hvmloader/vnuma.c
 create mode 100644 tools/firmware/hvmloader/vnuma.h
 create mode 100644 tools/libxl/libxl_vnuma.c

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v4 01/21] xen: dump vNUMA information with debug key "u"
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-23 13:03   ` Jan Beulich
  2015-01-23 11:13 ` [PATCH v4 02/21] xen: make two memory hypercalls vNUMA-aware Wei Liu
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

Signed-off-by: Elena Ufimsteva <ufimtseva@gmail.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
---
Changes in v4:
1. Acquire rwlock before accessing vnuma struct.
2. Improve output.

Changes in v3:
1. Constify struct vnuma_info.
2. Don't print amount of ram of a vmemrange.
3. Process softirqs when dumping information.
4. Fix format string.

Changes in v2:
1. Use unsigned int for loop vars.
2. Use strlcpy.
3. Properly align output.
---
 xen/arch/x86/numa.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 68 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 628a40a..f34677e 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -16,6 +16,7 @@
 #include <xen/pfn.h>
 #include <asm/acpi.h>
 #include <xen/sched.h>
+#include <xen/softirq.h>
 
 static int numa_setup(char *s);
 custom_param("numa", numa_setup);
@@ -363,10 +364,12 @@ EXPORT_SYMBOL(node_data);
 static void dump_numa(unsigned char key)
 {
     s_time_t now = NOW();
-    int i;
+    unsigned int i, j;
+    int err;
     struct domain *d;
     struct page_info *page;
     unsigned int page_num_node[MAX_NUMNODES];
+    const struct vnuma_info *vnuma;
 
     printk("'%c' pressed -> dumping numa info (now-0x%X:%08X)\n", key,
            (u32)(now>>32), (u32)now);
@@ -393,6 +396,8 @@ static void dump_numa(unsigned char key)
     printk("Memory location of each domain:\n");
     for_each_domain ( d )
     {
+        process_pending_softirqs();
+
         printk("Domain %u (total: %u):\n", d->domain_id, d->tot_pages);
 
         for_each_online_node ( i )
@@ -408,6 +413,68 @@ static void dump_numa(unsigned char key)
 
         for_each_online_node ( i )
             printk("    Node %u: %u\n", i, page_num_node[i]);
+
+        read_lock(&d->vnuma_rwlock);
+        if ( !d->vnuma )
+        {
+            read_unlock(&d->vnuma_rwlock);
+            continue;
+        }
+
+        vnuma = d->vnuma;
+        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
+               vnuma->nr_vnodes, d->max_vcpus);
+        for ( i = 0; i < vnuma->nr_vnodes; i++ )
+        {
+            unsigned int start_cpu = ~0U;
+
+            err = snprintf(keyhandler_scratch, 12, "%3u",
+                    vnuma->vnode_to_pnode[i]);
+            if ( err < 0 || vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
+                strlcpy(keyhandler_scratch, "???", 3);
+
+            printk("       %3u: pnode %s,", i, keyhandler_scratch);
+
+            printk(" vcpus ");
+
+            for ( j = 0; j < d->max_vcpus; j++ )
+            {
+                if ( !(j & 0x3f) )
+                    process_pending_softirqs();
+
+                if ( vnuma->vcpu_to_vnode[j] == i )
+                {
+                    if ( start_cpu == ~0U )
+                    {
+                        printk("%d", j);
+                        start_cpu = j;
+                    }
+                } else if ( start_cpu != ~0U ) {
+                    if ( j - 1 != start_cpu )
+                        printk("-%d ", j - 1);
+                    else
+                        printk(" ");
+                    start_cpu = ~0U;
+                }
+            }
+
+            if ( start_cpu != ~0U  && start_cpu != j - 1 )
+                printk("-%d", j - 1);
+
+            printk("\n");
+
+            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
+            {
+                if ( vnuma->vmemrange[j].nid == i )
+                {
+                    printk("           %016"PRIx64" - %016"PRIx64"\n",
+                           vnuma->vmemrange[j].start,
+                           vnuma->vmemrange[j].end);
+                }
+            }
+        }
+
+        read_unlock(&d->vnuma_rwlock);
     }
 
     rcu_read_unlock(&domlist_read_lock);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 02/21] xen: make two memory hypercalls vNUMA-aware
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
  2015-01-23 11:13 ` [PATCH v4 01/21] xen: dump vNUMA information with debug key "u" Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-23 13:16   ` Jan Beulich
  2015-01-23 11:13 ` [PATCH v4 03/21] libxc: allocate memory with vNUMA information for PV guest Wei Liu
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

Make XENMEM_increase_reservation and XENMEM_populate_physmap
vNUMA-aware.

That is, if guest requests Xen to allocate memory for specific vnode,
Xen can translate vnode to pnode using vNUMA information of that guest.

XENMEMF_vnode is introduced for the guest to mark the node number is in
fact virtual node number and should be translated by Xen.

XENFEAT_memory_op_vnode_supported is introduced to indicate that Xen is
able to translate virtual node to physical node.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
---
Changes in v3:
1. Coding style fix.
2. Remove redundant assignment.

Changes in v2:
1. Return start_extent when vnode translation fails.
2. Expose new feature bit to guest.
3. Fix typo in comment.
---
 xen/common/kernel.c           |    2 +-
 xen/common/memory.c           |   58 ++++++++++++++++++++++++++++++++++++++---
 xen/include/public/features.h |    3 +++
 xen/include/public/memory.h   |    2 ++
 4 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index 0d9e519..e5e0050 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -301,7 +301,7 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         switch ( fi.submap_idx )
         {
         case 0:
-            fi.submap = 0;
+            fi.submap = (1U << XENFEAT_memory_op_vnode_supported);
             if ( VM_ASSIST(d, VMASST_TYPE_pae_extended_cr3) )
                 fi.submap |= (1U << XENFEAT_pae_pgdir_above_4gb);
             if ( paging_mode_translate(current->domain) )
diff --git a/xen/common/memory.c b/xen/common/memory.c
index e84ace9..580ac61 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -692,6 +692,50 @@ out:
     return rc;
 }
 
+static int translate_vnode_to_pnode(struct domain *d,
+                                    struct xen_memory_reservation *r,
+                                    struct memop_args *a)
+{
+    int rc = 0;
+    unsigned int vnode, pnode;
+
+    /*
+     * Note: we don't strictly require non-supported bits set to zero,
+     * so we may have exact_vnode bit set for old guests that don't
+     * support vNUMA.
+     *
+     * To distinguish spurious vnode request v.s. real one, check if
+     * d->vnuma exists.
+     */
+    if ( r->mem_flags & XENMEMF_vnode )
+    {
+        read_lock(&d->vnuma_rwlock);
+        if ( d->vnuma )
+        {
+            vnode = XENMEMF_get_node(r->mem_flags);
+
+            if ( vnode < d->vnuma->nr_vnodes )
+            {
+                pnode = d->vnuma->vnode_to_pnode[vnode];
+
+                a->memflags &=
+                    ~MEMF_node(XENMEMF_get_node(r->mem_flags));
+
+                if ( pnode != NUMA_NO_NODE )
+                {
+                    a->memflags |= MEMF_exact_node;
+                    a->memflags |= MEMF_node(pnode);
+                }
+            }
+            else
+                rc = -EINVAL;
+        }
+        read_unlock(&d->vnuma_rwlock);
+    }
+
+    return rc;
+}
+
 long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct domain *d;
@@ -734,10 +778,6 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             args.memflags = MEMF_bits(address_bits);
         }
 
-        args.memflags |= MEMF_node(XENMEMF_get_node(reservation.mem_flags));
-        if ( reservation.mem_flags & XENMEMF_exact_node_request )
-            args.memflags |= MEMF_exact_node;
-
         if ( op == XENMEM_populate_physmap
              && (reservation.mem_flags & XENMEMF_populate_on_demand) )
             args.memflags |= MEMF_populate_on_demand;
@@ -747,6 +787,16 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             return start_extent;
         args.domain = d;
 
+        args.memflags |= MEMF_node(XENMEMF_get_node(reservation.mem_flags));
+        if ( reservation.mem_flags & XENMEMF_exact_node_request )
+            args.memflags |= MEMF_exact_node;
+
+        if ( translate_vnode_to_pnode(d, &reservation, &args) )
+        {
+            rcu_unlock_domain(d);
+            return start_extent;
+        }
+
         if ( xsm_memory_adjust_reservation(XSM_TARGET, current->domain, d) )
         {
             rcu_unlock_domain(d);
diff --git a/xen/include/public/features.h b/xen/include/public/features.h
index 16d92aa..2110b04 100644
--- a/xen/include/public/features.h
+++ b/xen/include/public/features.h
@@ -99,6 +99,9 @@
 #define XENFEAT_grant_map_identity        12
  */
 
+/* Guest can use XENMEMF_vnode to specify virtual node for memory op. */
+#define XENFEAT_memory_op_vnode_supported 13
+
 #define XENFEAT_NR_SUBMAPS 1
 
 #endif /* __XEN_PUBLIC_FEATURES_H__ */
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 595f953..2b5206b 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -55,6 +55,8 @@
 /* Flag to request allocation only from the node specified */
 #define XENMEMF_exact_node_request  (1<<17)
 #define XENMEMF_exact_node(n) (XENMEMF_node(n) | XENMEMF_exact_node_request)
+/* Flag to indicate the node specified is virtual node */
+#define XENMEMF_vnode  (1<<18)
 #endif
 
 struct xen_memory_reservation {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 03/21] libxc: allocate memory with vNUMA information for PV guest
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
  2015-01-23 11:13 ` [PATCH v4 01/21] xen: dump vNUMA information with debug key "u" Wei Liu
  2015-01-23 11:13 ` [PATCH v4 02/21] xen: make two memory hypercalls vNUMA-aware Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:02   ` Ian Campbell
  2015-01-23 11:13 ` [PATCH v4 04/21] libxl: introduce vNUMA types Wei Liu
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

>From libxc's point of view, it only needs to know vnode to pnode mapping
and size of each vnode to allocate memory accordingly. Add these fields
to xc_dom structure.

The caller might not pass in vNUMA information. In that case, a dummy
layout is generated for the convenience of libxc's allocation code. The
upper layer (libxl etc) still sees the domain has no vNUMA
configuration.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v4:
1. Pack fields into a struct.
2. Use "page" as unit.
3. __FUNCTION__ -> __func__.
4. Don't print total_pages.
5. Improve comment.

Changes in v3:
1. Rewrite commit log.
2. Shorten some error messages.
---
 tools/libxc/include/xc_dom.h   |    6 +++
 tools/libxc/include/xenguest.h |    6 +++
 tools/libxc/xc_dom_x86.c       |   85 ++++++++++++++++++++++++++++++++++------
 tools/libxc/xc_private.h       |    2 +
 4 files changed, 88 insertions(+), 11 deletions(-)

diff --git a/tools/libxc/include/xc_dom.h b/tools/libxc/include/xc_dom.h
index 07d7224..dba2e27 100644
--- a/tools/libxc/include/xc_dom.h
+++ b/tools/libxc/include/xc_dom.h
@@ -50,6 +50,8 @@ struct xc_dom_phys {
     xen_pfn_t count;
 };
 
+struct xc_vnuma_info;
+
 struct xc_dom_image {
     /* files */
     void *kernel_blob;
@@ -167,6 +169,10 @@ struct xc_dom_image {
     struct xc_dom_loader *kernel_loader;
     void *private_loader;
 
+    /* vNUMA information */
+    struct xc_vnuma_info *vnuma_info;
+    unsigned int nr_vnuma_info;
+
     /* kernel loader */
     struct xc_dom_arch *arch_hooks;
     /* allocate up to virt_alloc_end */
diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 40bbac8..9b7edff 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -217,6 +217,12 @@ struct xc_hvm_firmware_module {
     uint64_t  guest_addr_out;
 };
 
+struct xc_vnuma_info {
+    unsigned int vnode;
+    unsigned int pnode;
+    uint64_t     pages;
+};
+
 struct xc_hvm_build_args {
     uint64_t mem_size;           /* Memory size in bytes. */
     uint64_t mem_target;         /* Memory target in bytes. */
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index bf06fe4..6b42187 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -759,7 +759,8 @@ static int x86_shadow(xc_interface *xch, domid_t domid)
 int arch_setup_meminit(struct xc_dom_image *dom)
 {
     int rc;
-    xen_pfn_t pfn, allocsz, i, j, mfn;
+    xen_pfn_t pfn, allocsz, mfn, total, pfn_base;
+    int i, j;
 
     rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
     if ( rc )
@@ -811,18 +812,80 @@ int arch_setup_meminit(struct xc_dom_image *dom)
         /* setup initial p2m */
         for ( pfn = 0; pfn < dom->total_pages; pfn++ )
             dom->p2m_host[pfn] = pfn;
-        
+
+        /* Setup dummy vNUMA information if it's not provided. Note
+         * that this is a valid state if libxl doesn't provide any
+         * vNUMA information.
+         *
+         * The dummy values make libxc allocate all pages from
+         * arbitrary physical. This is the expected behaviour if no
+         * vNUMA configuration is provided to libxc.
+         *
+         * Note that the following hunk is just for the convenience of
+         * allocation code. No defaulting is happening in libxc.
+         */
+        if ( dom->nr_vnuma_info == 0 )
+        {
+            dom->nr_vnuma_info = 1;
+            dom->vnuma_info = xc_dom_malloc(dom, sizeof(*dom->vnuma_info));
+            dom->vnuma_info[0].vnode = 0;
+            dom->vnuma_info[0].pnode = XC_VNUMA_NO_NODE;
+            dom->vnuma_info[0].pages = dom->total_pages;
+        }
+
+        total = 0;
+        for ( i = 0; i < dom->nr_vnuma_info; i++ )
+            total += dom->vnuma_info[i].pages;
+        if ( total != dom->total_pages )
+        {
+            xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                         "%s: vNUMA page count mismatch (0x%"PRIpfn" != 0x%"PRIpfn")\n",
+                         __func__, total, dom->total_pages);
+            return -EINVAL;
+        }
+
         /* allocate guest memory */
-        for ( i = rc = allocsz = 0;
-              (i < dom->total_pages) && !rc;
-              i += allocsz )
+        pfn_base = 0;
+        for ( i = 0; i < dom->nr_vnuma_info; i++ )
         {
-            allocsz = dom->total_pages - i;
-            if ( allocsz > 1024*1024 )
-                allocsz = 1024*1024;
-            rc = xc_domain_populate_physmap_exact(
-                dom->xch, dom->guest_domid, allocsz,
-                0, 0, &dom->p2m_host[i]);
+            unsigned int memflags;
+            uint64_t pages;
+
+            memflags = 0;
+            if ( dom->vnuma_info[i].pnode != XC_VNUMA_NO_NODE )
+            {
+                memflags |= XENMEMF_exact_node(dom->vnuma_info[i].pnode);
+                memflags |= XENMEMF_exact_node_request;
+            }
+
+            pages = dom->vnuma_info[i].pages;
+
+            for ( j = 0; j < pages; j += allocsz )
+            {
+                allocsz = pages - j;
+                if ( allocsz > 1024*1024 )
+                    allocsz = 1024*1024;
+
+                rc = xc_domain_populate_physmap_exact(dom->xch,
+                         dom->guest_domid, allocsz, 0, memflags,
+                         &dom->p2m_host[pfn_base+j]);
+
+                if ( rc )
+                {
+                    if ( dom->vnuma_info[i].pnode != XC_VNUMA_NO_NODE )
+                        xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                                     "%s: failed to allocate 0x%"PRIx64" pages (v=%d, p=%d)\n",
+                                     __func__, pages, i,
+                                     dom->vnuma_info[i].pnode);
+                    else
+                        xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                                     "%s: failed to allocate 0x%"PRIx64" pages\n",
+                                     __func__, pages);
+                    return rc;
+                }
+            }
+
+            pfn_base += pages;
         }
 
         /* Ensure no unclaimed pages are left unused.
diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
index 45b8644..1809674 100644
--- a/tools/libxc/xc_private.h
+++ b/tools/libxc/xc_private.h
@@ -35,6 +35,8 @@
 
 #include <xen/sys/privcmd.h>
 
+#define XC_VNUMA_NO_NODE (~0U)
+
 #if defined(HAVE_VALGRIND_MEMCHECK_H) && !defined(NDEBUG) && !defined(__MINIOS__)
 /* Compile in Valgrind client requests? */
 #include <valgrind/memcheck.h>
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 04/21] libxl: introduce vNUMA types
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (2 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 03/21] libxc: allocate memory with vNUMA information for PV guest Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:04   ` Ian Campbell
  2015-01-23 11:13 ` [PATCH v4 05/21] libxl: add vmemrange to libxl__domain_build_state Wei Liu
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

A domain can contain several virtual NUMA nodes, hence we introduce an
array in libxl_domain_build_info.

libxl_vnode_info contains the size of memory in that node, the distance
from that node to every nodes, the underlying pnode and a bitmap of
vcpus.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v4:
1. Use MemKB.

Changes in v3:
1. Add commit message.
---
 tools/libxl/libxl_types.idl |    9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 1214d2e..c44b74e 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -354,6 +354,13 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
     ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
     ])
 
+libxl_vnode_info = Struct("vnode_info", [
+    ("memkb", MemKB),
+    ("distances", Array(uint32, "num_distances")), # distances from this node to other nodes
+    ("pnode", uint32), # physical node of this node
+    ("vcpus", libxl_bitmap), # vcpus in this node
+    ])
+
 libxl_domain_build_info = Struct("domain_build_info",[
     ("max_vcpus",       integer),
     ("avail_vcpus",     libxl_bitmap),
@@ -374,6 +381,8 @@ libxl_domain_build_info = Struct("domain_build_info",[
     ("disable_migrate", libxl_defbool),
     ("cpuid",           libxl_cpuid_policy_list),
     ("blkdev_start",    string),
+
+    ("vnuma_nodes", Array(libxl_vnode_info, "num_vnuma_nodes")),
     
     ("device_model_version", libxl_device_model_version),
     ("device_model_stubdomain", libxl_defbool),
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 05/21] libxl: add vmemrange to libxl__domain_build_state
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (3 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 04/21] libxl: introduce vNUMA types Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:05   ` Ian Campbell
  2015-01-23 11:13 ` [PATCH v4 06/21] libxl: introduce libxl__vnuma_config_check Wei Liu
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

A vnode consists of one or more vmemranges (virtual memory range).  One
example of multiple vmemranges is that there is a hole in one vnode.

Currently we haven't exported vmemrange interface to libxl user.
Vmemranges are generated during domain build, so we have relevant
structures in domain build state.

Later if we discover we need to export the interface, those structures
can be moved to libxl_domain_build_info as well.

These new fields (along with other fields in that struct) are set to 0
at start of day so we don't need to explicitly initialise them. A
following patch which introduces an independent checking function will
need to access these fields. I don't feel very comfortable squashing
this change into that one so I use a single commit.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v4:
1. Improve commit message.

Changes in v3:
1. Rewrite commit message.
---
 tools/libxl/libxl_internal.h |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 934465a..6d3ac58 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -973,6 +973,9 @@ typedef struct {
     libxl__file_reference pv_ramdisk;
     const char * pv_cmdline;
     bool pvh_enabled;
+
+    xen_vmemrange_t *vmemranges;
+    uint32_t num_vmemranges;
 } libxl__domain_build_state;
 
 _hidden int libxl__build_pre(libxl__gc *gc, uint32_t domid,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 06/21] libxl: introduce libxl__vnuma_config_check
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (4 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 05/21] libxl: add vmemrange to libxl__domain_build_state Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:13   ` Ian Campbell
  2015-01-23 11:13 ` [PATCH v4 07/21] libxl: x86: factor out e820_host_sanitize Wei Liu
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

This function is used to check whether vNUMA configuration (be it
auto-generated or supplied by user) is valid.

The checks performed can be found in the comment of the function.

This vNUMA function (and future ones) is placed in a new file called
libxl_vnuma.c

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v4:
1. Adapt to new interface.

Changes in v3:
1. Rewrite commit log.
2. Shorten two error messages.
---
 tools/libxl/Makefile         |    2 +-
 tools/libxl/libxl_internal.h |    5 ++
 tools/libxl/libxl_vnuma.c    |  137 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 143 insertions(+), 1 deletion(-)
 create mode 100644 tools/libxl/libxl_vnuma.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 08fe814..4808bea 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -93,7 +93,7 @@ LIBXL_LIBS += -lyajl
 LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
 			libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \
 			libxl_internal.o libxl_utils.o libxl_uuid.o \
-			libxl_json.o libxl_aoutils.o libxl_numa.o \
+			libxl_json.o libxl_aoutils.o libxl_numa.o libxl_vnuma.o \
 			libxl_save_callout.o _libxl_save_msgs_callout.o \
 			libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y)
 LIBXL_OBJS += libxl_genid.o
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 6d3ac58..39356ba 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3394,6 +3394,11 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
     libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
 }
 
+/* Check if vNUMA config is valid. Returns 0 if valid. */
+int libxl__vnuma_config_check(libxl__gc *gc,
+                              const libxl_domain_build_info *b_info,
+                              const libxl__domain_build_state *state);
+
 _hidden int libxl__ms_vm_genid_set(libxl__gc *gc, uint32_t domid,
                                    const libxl_ms_vm_genid *id);
 
diff --git a/tools/libxl/libxl_vnuma.c b/tools/libxl/libxl_vnuma.c
new file mode 100644
index 0000000..4edfaa4
--- /dev/null
+++ b/tools/libxl/libxl_vnuma.c
@@ -0,0 +1,137 @@
+/*
+ * Copyright (C) 2014      Citrix Ltd.
+ * Author Wei Liu <wei.liu2@citrix.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+#include "libxl_osdeps.h" /* must come before any other headers */
+#include "libxl_internal.h"
+#include <stdlib.h>
+
+/* Sort vmemranges in ascending order with "start" */
+static int compare_vmemrange(const void *a, const void *b)
+{
+    const xen_vmemrange_t *x = a, *y = b;
+    if (x->start < y->start)
+        return -1;
+    if (x->start > y->start)
+        return 1;
+    return 0;
+}
+
+/* Check if vNUMA configuration is valid:
+ *  1. all pnodes inside vnode_to_pnode array are valid
+ *  2. one vcpu belongs to and only belongs to one vnode
+ *  3. each vmemrange is valid and doesn't overlap with each other
+ */
+int libxl__vnuma_config_check(libxl__gc *gc,
+			      const libxl_domain_build_info *b_info,
+                              const libxl__domain_build_state *state)
+{
+    int i, j, rc = ERROR_INVAL, nr_nodes;
+    libxl_numainfo *ninfo = NULL;
+    uint64_t total_memkb = 0;
+    libxl_bitmap cpumap;
+    libxl_vnode_info *p;
+
+    libxl_bitmap_init(&cpumap);
+
+    /* Check pnode specified is valid */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (!ninfo) {
+        LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "libxl_get_numainfo failed");
+        goto out;
+    }
+
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        uint32_t pnode;
+
+        p = &b_info->vnuma_nodes[i];
+        pnode = p->pnode;
+
+        /* The pnode specified is not valid? */
+        if (pnode >= nr_nodes) {
+            LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                       "Invalid pnode %d specified",
+                       pnode);
+            goto out;
+        }
+
+        total_memkb += p->memkb;
+    }
+
+    if (total_memkb != b_info->max_memkb) {
+        LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                   "Amount of memory mismatch (0x%"PRIx64" != 0x%"PRIx64")",
+                   total_memkb, b_info->max_memkb);
+        goto out;
+    }
+
+    /* Check vcpu mapping */
+    libxl_cpu_bitmap_alloc(CTX, &cpumap, b_info->max_vcpus);
+    libxl_bitmap_set_none(&cpumap);
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        p = &b_info->vnuma_nodes[i];
+        libxl_for_each_set_bit(j, p->vcpus) {
+            if (!libxl_bitmap_test(&cpumap, j))
+                libxl_bitmap_set(&cpumap, j);
+            else {
+                LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                           "Vcpu %d assigned more than once", j);
+                goto out;
+            }
+        }
+    }
+
+    for (i = 0; i < b_info->max_vcpus; i++) {
+        if (!libxl_bitmap_test(&cpumap, i)) {
+            LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                       "Vcpu %d is not assigned to any vnode", i);
+            goto out;
+        }
+    }
+
+    /* Check vmemranges */
+    qsort(state->vmemranges, state->num_vmemranges, sizeof(xen_vmemrange_t),
+          compare_vmemrange);
+
+    for (i = 0; i < state->num_vmemranges; i++) {
+        if (state->vmemranges[i].end < state->vmemranges[i].start) {
+                LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                           "Vmemrange end < start");
+                goto out;
+        }
+    }
+
+    for (i = 0; i < state->num_vmemranges - 1; i++) {
+        if (state->vmemranges[i].end > state->vmemranges[i+1].start) {
+            LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                       "Vmemranges overlapped, 0x%"PRIx64"-0x%"PRIx64", 0x%"PRIx64"-0x%"PRIx64,
+                       state->vmemranges[i].start, state->vmemranges[i].end,
+                       state->vmemranges[i+1].start, state->vmemranges[i+1].end);
+            goto out;
+        }
+    }
+
+    rc = 0;
+out:
+    if (ninfo) libxl_numainfo_dispose(ninfo);
+    libxl_bitmap_dispose(&cpumap);
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 07/21] libxl: x86: factor out e820_host_sanitize
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (5 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 06/21] libxl: introduce libxl__vnuma_config_check Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:14   ` Ian Campbell
  2015-01-23 11:13 ` [PATCH v4 08/21] libxl: functions to build vmemranges for PV guest Wei Liu
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

This function gets the machine E820 map and sanitize it according to PV
guest configuration.

This will be used in later patch. No functional change introduced in
this patch.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v4:
1. Use actual size of the map instead of using E820MAX.
---
 tools/libxl/libxl_x86.c |   32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 9ceb373..d012b4d 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -207,6 +207,27 @@ static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
     return 0;
 }
 
+static int e820_host_sanitize(libxl__gc *gc,
+                              libxl_domain_build_info *b_info,
+                              struct e820entry map[],
+                              uint32_t *nr)
+{
+    int rc;
+
+    rc = xc_get_machine_memory_map(CTX->xch, map, *nr);
+    if (rc < 0) {
+        errno = rc;
+        return ERROR_FAIL;
+    }
+
+    *nr = rc;
+
+    rc = e820_sanitize(CTX, map, nr, b_info->target_memkb,
+                       (b_info->max_memkb - b_info->target_memkb) +
+                       b_info->u.pv.slack_memkb);
+    return rc;
+}
+
 static int libxl__e820_alloc(libxl__gc *gc, uint32_t domid,
         libxl_domain_config *d_config)
 {
@@ -223,15 +244,8 @@ static int libxl__e820_alloc(libxl__gc *gc, uint32_t domid,
     if (!libxl_defbool_val(b_info->u.pv.e820_host))
         return ERROR_INVAL;
 
-    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
-    if (rc < 0) {
-        errno = rc;
-        return ERROR_FAIL;
-    }
-    nr = rc;
-    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
-                       (b_info->max_memkb - b_info->target_memkb) +
-                       b_info->u.pv.slack_memkb);
+    nr = E820MAX;
+    rc = e820_host_sanitize(gc, b_info, map, &nr);
     if (rc)
         return ERROR_FAIL;
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 08/21] libxl: functions to build vmemranges for PV guest
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (6 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 07/21] libxl: x86: factor out e820_host_sanitize Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:27   ` Ian Campbell
  2015-01-23 11:13 ` [PATCH v4 09/21] libxl: build, check and pass vNUMA info to Xen " Wei Liu
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

Introduce a arch-independent routine to generate one vmemrange per
vnode. Also introduce arch-dependent routines for different
architectures because part of the process is arch-specific -- ARM has
yet have NUMA support and E820 is x86 only.

For those x86 guests who care about machine E820 map (i.e. with
e820_host=1), vnode is further split into several vmemranges to
accommodate memory holes.  A few stubs for libxl_arm.c are created.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v4:
1. Adapt to new interface.
2. Address Ian Jackson's comments.

Changes in v3:
1. Rewrite commit log.
---
 tools/libxl/libxl_arch.h     |    6 ++++
 tools/libxl/libxl_arm.c      |    9 +++++
 tools/libxl/libxl_internal.h |    5 +++
 tools/libxl/libxl_vnuma.c    |   34 +++++++++++++++++++
 tools/libxl/libxl_x86.c      |   76 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 130 insertions(+)

diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h
index d3bc136..e249048 100644
--- a/tools/libxl/libxl_arch.h
+++ b/tools/libxl/libxl_arch.h
@@ -27,4 +27,10 @@ int libxl__arch_domain_init_hw_description(libxl__gc *gc,
 int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
                                       libxl_domain_build_info *info,
                                       struct xc_dom_image *dom);
+
+/* build vNUMA vmemrange with arch specific information */
+int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,
+                                      uint32_t domid,
+                                      libxl_domain_build_info *b_info,
+                                      libxl__domain_build_state *state);
 #endif
diff --git a/tools/libxl/libxl_arm.c b/tools/libxl/libxl_arm.c
index 65a762b..d3968a7 100644
--- a/tools/libxl/libxl_arm.c
+++ b/tools/libxl/libxl_arm.c
@@ -707,6 +707,15 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
     return 0;
 }
 
+int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,
+                                      uint32_t domid,
+                                      libxl_domain_build_info *info,
+                                      libxl__domain_build_state *state)
+{
+    /* Don't touch anything. */
+    return 0;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 39356ba..73d533a 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3399,6 +3399,11 @@ int libxl__vnuma_config_check(libxl__gc *gc,
                               const libxl_domain_build_info *b_info,
                               const libxl__domain_build_state *state);
 
+int libxl__vnuma_build_vmemrange_pv(libxl__gc *gc,
+                                    uint32_t domid,
+                                    libxl_domain_build_info *b_info,
+                                    libxl__domain_build_state *state);
+
 _hidden int libxl__ms_vm_genid_set(libxl__gc *gc, uint32_t domid,
                                    const libxl_ms_vm_genid *id);
 
diff --git a/tools/libxl/libxl_vnuma.c b/tools/libxl/libxl_vnuma.c
index 4edfaa4..0189a4b 100644
--- a/tools/libxl/libxl_vnuma.c
+++ b/tools/libxl/libxl_vnuma.c
@@ -14,6 +14,7 @@
  */
 #include "libxl_osdeps.h" /* must come before any other headers */
 #include "libxl_internal.h"
+#include "libxl_arch.h"
 #include <stdlib.h>
 
 /* Sort vmemranges in ascending order with "start" */
@@ -128,6 +129,39 @@ out:
     return rc;
 }
 
+/* Build vmemranges for PV guest */
+int libxl__vnuma_build_vmemrange_pv(libxl__gc *gc,
+                                    uint32_t domid,
+                                    libxl_domain_build_info *b_info,
+                                    libxl__domain_build_state *state)
+{
+    int i;
+    uint64_t next;
+    xen_vmemrange_t *v = NULL;
+
+    assert(state->vmemranges == NULL);
+
+    /* Generate one vmemrange for each virtual node. */
+    next = 0;
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
+
+        GCREALLOC_ARRAY(v, i+1);
+
+        v[i].start = next;
+        v[i].end = next + (p->memkb << 10);
+        v[i].flags = 0;
+        v[i].nid = i;
+
+        next = v[i].end;
+    }
+
+    state->vmemranges = v;
+    state->num_vmemranges = i;
+
+    return libxl__arch_vnuma_build_vmemrange(gc, domid, b_info, state);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index d012b4d..4841fef 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -339,6 +339,82 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
     return 0;
 }
 
+int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,
+                                      uint32_t domid,
+                                      libxl_domain_build_info *b_info,
+                                      libxl__domain_build_state *state)
+{
+    int nid, nr_vmemrange, rc;
+    uint32_t nr_e820, e820_count;
+    struct e820entry map[E820MAX];
+    xen_vmemrange_t *vmemranges;
+
+    /* Only touch vmemranges if it's PV guest and e820_host is true */
+    if (!(b_info->type == LIBXL_DOMAIN_TYPE_PV &&
+          libxl_defbool_val(b_info->u.pv.e820_host))) {
+        rc = 0;
+        goto out;
+    }
+
+    rc = e820_host_sanitize(gc, b_info, map, &nr_e820);
+    if (rc) goto out;
+
+    /* Ditch old vmemranges and start with host E820 map. Note, memory
+     * was gc allocated.
+     */
+    state->vmemranges = NULL;
+    state->num_vmemranges = 0;
+
+    e820_count = 0;
+    nr_vmemrange = 0;
+    vmemranges = NULL;
+    for (nid = 0; nid < b_info->num_vnuma_nodes; nid++) {
+        libxl_vnode_info *p = &b_info->vnuma_nodes[nid];
+        uint64_t remaining_bytes = (p->memkb << 10), bytes;
+
+        while (remaining_bytes > 0) {
+            if (e820_count >= nr_e820) {
+                rc = ERROR_NOMEM;
+                goto out;
+            }
+
+            /* Skip non RAM region */
+            if (map[e820_count].type != E820_RAM) {
+                e820_count++;
+                continue;
+            }
+
+            GCREALLOC_ARRAY(vmemranges, nr_vmemrange+1);
+
+            bytes = map[e820_count].size >= remaining_bytes ?
+                remaining_bytes : map[e820_count].size;
+
+            vmemranges[nr_vmemrange].start = map[e820_count].addr;
+            vmemranges[nr_vmemrange].end = map[e820_count].addr + bytes;
+
+            if (map[e820_count].size >= remaining_bytes) {
+                map[e820_count].addr += bytes;
+                map[e820_count].size -= bytes;
+            } else {
+                e820_count++;
+            }
+
+            remaining_bytes -= bytes;
+
+            vmemranges[nr_vmemrange].flags = 0;
+            vmemranges[nr_vmemrange].nid = nid;
+            nr_vmemrange++;
+        }
+    }
+
+    state->vmemranges = vmemranges;
+    state->num_vmemranges = nr_vmemrange;
+
+    rc = 0;
+out:
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 09/21] libxl: build, check and pass vNUMA info to Xen for PV guest
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (7 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 08/21] libxl: functions to build vmemranges for PV guest Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:29   ` Ian Campbell
  2015-01-23 11:13 ` [PATCH v4 10/21] xen: handle XENMEM_get_vnumainfo in compat_memory_op Wei Liu
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

Transform the user supplied vNUMA configuration into libxl internal
representations, and finally libxc representations. Check validity of
the configuration along the line.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v4:
1. Adapt to new interfaces.

Changes in v3:
1. Add more commit log.
---
 tools/libxl/libxl_dom.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 48d661a..b06fd65 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -515,6 +515,51 @@ retry_transaction:
     return 0;
 }
 
+static int set_vnuma_info(libxl__gc *gc, uint32_t domid,
+                          const libxl_domain_build_info *info,
+                          const libxl__domain_build_state *state)
+{
+    int rc = 0;
+    int i, nr_vdistance;
+    unsigned int *vcpu_to_vnode, *vnode_to_pnode, *vdistance = NULL;
+
+    vcpu_to_vnode = libxl__calloc(gc, info->max_vcpus,
+                                  sizeof(unsigned int));
+    vnode_to_pnode = libxl__calloc(gc, info->num_vnuma_nodes,
+                                   sizeof(unsigned int));
+
+    nr_vdistance = info->num_vnuma_nodes * info->num_vnuma_nodes;
+    vdistance = libxl__calloc(gc, nr_vdistance, sizeof(unsigned int));
+
+    for (i = 0; i < info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *v = &info->vnuma_nodes[i];
+        int bit;
+
+        /* vnode to pnode mapping */
+        vnode_to_pnode[i] = v->pnode;
+
+        /* vcpu to vnode mapping */
+        libxl_for_each_set_bit(bit, v->vcpus)
+            vcpu_to_vnode[bit] = i;
+
+        /* node distances */
+        assert(info->num_vnuma_nodes == v->num_distances);
+        memcpy(vdistance + (i * info->num_vnuma_nodes),
+               v->distances,
+               v->num_distances * sizeof(unsigned int));
+    }
+
+    if (xc_domain_setvnuma(CTX->xch, domid, info->num_vnuma_nodes,
+                           state->num_vmemranges, info->max_vcpus,
+                           state->vmemranges, vdistance,
+                           vcpu_to_vnode, vnode_to_pnode) < 0) {
+        LOGE(ERROR, "xc_domain_setvnuma failed");
+        rc = ERROR_FAIL;
+    }
+
+    return rc;
+}
+
 int libxl__build_pv(libxl__gc *gc, uint32_t domid,
              libxl_domain_build_info *info, libxl__domain_build_state *state)
 {
@@ -572,6 +617,30 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
     dom->xenstore_domid = state->store_domid;
     dom->claim_enabled = libxl_defbool_val(info->claim_mode);
 
+    if (info->num_vnuma_nodes != 0) {
+        int i;
+
+        ret = libxl__vnuma_build_vmemrange_pv(gc, domid, info, state);
+        if (ret) {
+            LOGE(ERROR, "cannot build vmemranges");
+            goto out;
+        }
+        ret = libxl__vnuma_config_check(gc, info, state);
+        if (ret) goto out;
+
+        ret = set_vnuma_info(gc, domid, info, state);
+        if (ret) goto out;
+
+        dom->nr_vnuma_info = info->num_vnuma_nodes;
+        dom->vnuma_info = xc_dom_malloc(dom, sizeof(*dom->vnuma_info) *
+                                        dom->nr_vnuma_info);
+        for (i = 0; i < dom->nr_vnuma_info; i++) {
+            dom->vnuma_info[i].vnode = i;
+            dom->vnuma_info[i].pnode = info->vnuma_nodes[i].pnode;
+            dom->vnuma_info[i].pages = info->vnuma_nodes[i].memkb >> 2;
+        }
+    }
+
     if ( (ret = xc_dom_boot_xen_init(dom, ctx->xch, domid)) != 0 ) {
         LOGE(ERROR, "xc_dom_boot_xen_init failed");
         goto out;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 10/21] xen: handle XENMEM_get_vnumainfo in compat_memory_op
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (8 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 09/21] libxl: build, check and pass vNUMA info to Xen " Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-23 11:13 ` [PATCH v4 11/21] hvmloader: retrieve vNUMA information from hypervisor Wei Liu
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in v4:
1. Rearrange if and add ack.

Changes in v3:
1. Fix hard tabs.
2. Fix subject line.
3. Add extra hunk to handle -ENOBUFS (hence drop Reviewed-by:)
---
 xen/common/compat/memory.c |   48 ++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xlat.lst       |    2 ++
 2 files changed, 50 insertions(+)

diff --git a/xen/common/compat/memory.c b/xen/common/compat/memory.c
index 06c90be..b258138 100644
--- a/xen/common/compat/memory.c
+++ b/xen/common/compat/memory.c
@@ -15,6 +15,7 @@ CHECK_TYPE(domid);
 #undef xen_domid_t
 
 CHECK_mem_access_op;
+CHECK_vmemrange;
 
 int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 {
@@ -32,12 +33,14 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
             struct xen_add_to_physmap *atp;
             struct xen_add_to_physmap_batch *atpb;
             struct xen_remove_from_physmap *xrfp;
+            struct xen_vnuma_topology_info *vnuma;
         } nat;
         union {
             struct compat_memory_reservation rsrv;
             struct compat_memory_exchange xchg;
             struct compat_add_to_physmap atp;
             struct compat_add_to_physmap_batch atpb;
+            struct compat_vnuma_topology_info vnuma;
         } cmp;
 
         set_xen_guest_handle(nat.hnd, COMPAT_ARG_XLAT_VIRT_BASE);
@@ -273,13 +276,50 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
             break;
         }
 
+        case XENMEM_get_vnumainfo:
+        {
+            enum XLAT_vnuma_topology_info_vdistance vdistance =
+                XLAT_vnuma_topology_info_vdistance_h;
+            enum XLAT_vnuma_topology_info_vcpu_to_vnode vcpu_to_vnode =
+                XLAT_vnuma_topology_info_vcpu_to_vnode_h;
+            enum XLAT_vnuma_topology_info_vmemrange vmemrange =
+                XLAT_vnuma_topology_info_vmemrange_h;
+
+            if ( copy_from_guest(&cmp.vnuma, compat, 1) )
+                return -EFAULT;
+
+#define XLAT_vnuma_topology_info_HNDL_vdistance_h(_d_, _s_)		\
+            guest_from_compat_handle((_d_)->vdistance.h, (_s_)->vdistance.h)
+#define XLAT_vnuma_topology_info_HNDL_vcpu_to_vnode_h(_d_, _s_)		\
+            guest_from_compat_handle((_d_)->vcpu_to_vnode.h, (_s_)->vcpu_to_vnode.h)
+#define XLAT_vnuma_topology_info_HNDL_vmemrange_h(_d_, _s_)		\
+            guest_from_compat_handle((_d_)->vmemrange.h, (_s_)->vmemrange.h)
+
+            XLAT_vnuma_topology_info(nat.vnuma, &cmp.vnuma);
+
+#undef XLAT_vnuma_topology_info_HNDL_vdistance_h
+#undef XLAT_vnuma_topology_info_HNDL_vcpu_to_vnode_h
+#undef XLAT_vnuma_topology_info_HNDL_vmemrange_h
+            break;
+        }
+
         default:
             return compat_arch_memory_op(cmd, compat);
         }
 
         rc = do_memory_op(cmd, nat.hnd);
         if ( rc < 0 )
+        {
+            if ( rc == -ENOBUFS && op == XENMEM_get_vnumainfo )
+            {
+                cmp.vnuma.nr_vnodes = nat.vnuma->nr_vnodes;
+                cmp.vnuma.nr_vcpus = nat.vnuma->nr_vcpus;
+                cmp.vnuma.nr_vmemranges = nat.vnuma->nr_vmemranges;
+                if ( __copy_to_guest(compat, &cmp.vnuma, 1) )
+                    rc = -EFAULT;
+            }
             break;
+        }
 
         cmd = 0;
         if ( hypercall_xlat_continuation(&cmd, 2, 0x02, nat.hnd, compat) )
@@ -398,6 +438,14 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
         case XENMEM_remove_from_physmap:
             break;
 
+        case XENMEM_get_vnumainfo:
+            cmp.vnuma.nr_vnodes = nat.vnuma->nr_vnodes;
+            cmp.vnuma.nr_vcpus = nat.vnuma->nr_vcpus;
+            cmp.vnuma.nr_vmemranges = nat.vnuma->nr_vmemranges;
+            if ( __copy_to_guest(compat, &cmp.vnuma, 1) )
+                rc = -EFAULT;
+            break;
+
         default:
             domain_crash(current->domain);
             split = 0;
diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
index 41b3e35..9c9fd9a 100644
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -64,6 +64,8 @@
 ?	mem_access_op		memory.h
 !	pod_target			memory.h
 !	remove_from_physmap		memory.h
+?	vmemrange			memory.h
+!	vnuma_topology_info		memory.h
 ?	physdev_eoi			physdev.h
 ?	physdev_get_free_pirq		physdev.h
 ?	physdev_irq			physdev.h
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 11/21] hvmloader: retrieve vNUMA information from hypervisor
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (9 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 10/21] xen: handle XENMEM_get_vnumainfo in compat_memory_op Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-23 13:27   ` Jan Beulich
  2015-01-23 11:13 ` [PATCH v4 12/21] hvmloader: construct SRAT Wei Liu
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

Hvmloader issues XENMEM_get_vnumainfo hypercall and stores the
information retrieved in scratch space for later use.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
---
Changes in v4:
1. Use *vnode_to_pnode to calculate size.
2. Remove loop.
3. TBD: rebase onto Jan's errno ABI change.

Changes in v3:
1. Move init_vnuma_info before ACPI stuff.
2. Fix errno.h inclusion.
3. Remove upper limits and use loop.
---
 tools/firmware/hvmloader/Makefile    |    2 +-
 tools/firmware/hvmloader/hvmloader.c |    3 ++
 tools/firmware/hvmloader/vnuma.c     |   96 ++++++++++++++++++++++++++++++++++
 tools/firmware/hvmloader/vnuma.h     |   52 ++++++++++++++++++
 4 files changed, 152 insertions(+), 1 deletion(-)
 create mode 100644 tools/firmware/hvmloader/vnuma.c
 create mode 100644 tools/firmware/hvmloader/vnuma.h

diff --git a/tools/firmware/hvmloader/Makefile b/tools/firmware/hvmloader/Makefile
index 46a79c5..a5af34b 100644
--- a/tools/firmware/hvmloader/Makefile
+++ b/tools/firmware/hvmloader/Makefile
@@ -29,7 +29,7 @@ LOADADDR = 0x100000
 CFLAGS += $(CFLAGS_xeninclude)
 
 OBJS  = hvmloader.o mp_tables.o util.o smbios.o 
-OBJS += smp.o cacheattr.o xenbus.o
+OBJS += smp.o cacheattr.o xenbus.o vnuma.o
 OBJS += e820.o pci.o pir.o ctype.o
 OBJS += hvm_param.o
 ifeq ($(debug),y)
diff --git a/tools/firmware/hvmloader/hvmloader.c b/tools/firmware/hvmloader/hvmloader.c
index 7b0da38..25b7f08 100644
--- a/tools/firmware/hvmloader/hvmloader.c
+++ b/tools/firmware/hvmloader/hvmloader.c
@@ -26,6 +26,7 @@
 #include "pci_regs.h"
 #include "apic_regs.h"
 #include "acpi/acpi2_0.h"
+#include "vnuma.h"
 #include <xen/version.h>
 #include <xen/hvm/params.h>
 
@@ -310,6 +311,8 @@ int main(void)
 
     if ( acpi_enabled )
     {
+        init_vnuma_info();
+
         if ( bios->acpi_build_tables )
         {
             printf("Loading ACPI ...\n");
diff --git a/tools/firmware/hvmloader/vnuma.c b/tools/firmware/hvmloader/vnuma.c
new file mode 100644
index 0000000..9ad5f7a
--- /dev/null
+++ b/tools/firmware/hvmloader/vnuma.c
@@ -0,0 +1,96 @@
+/*
+ * vnuma.c: obtain vNUMA information from hypervisor
+ *
+ * Copyright (c) 2014 Wei Liu, Citrix Systems (R&D) Ltd.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "util.h"
+#include "hypercall.h"
+#include "vnuma.h"
+
+unsigned int nr_vnodes, nr_vmemranges;
+unsigned int *vcpu_to_vnode, *vdistance;
+xen_vmemrange_t *vmemrange;
+
+/* XXX: this will be removed once Jan's errno ABI patch is committed */
+#define XEN_ENOBUFS  105
+
+void init_vnuma_info(void)
+{
+    int rc;
+    struct xen_vnuma_topology_info vnuma_topo;
+
+    vcpu_to_vnode =
+        scratch_alloc(sizeof(*vcpu_to_vnode) * hvm_info->nr_vcpus, 0);
+
+    vnuma_topo.domid = DOMID_SELF;
+    vnuma_topo.pad = 0;
+    vnuma_topo.nr_vcpus = 0;
+    vnuma_topo.nr_vnodes = 0;
+    vnuma_topo.nr_vmemranges = 0;
+
+    set_xen_guest_handle(vnuma_topo.vdistance.h, NULL);
+    set_xen_guest_handle(vnuma_topo.vcpu_to_vnode.h, NULL);
+    set_xen_guest_handle(vnuma_topo.vmemrange.h, NULL);
+
+    rc = hypercall_memory_op(XENMEM_get_vnumainfo, &vnuma_topo);
+
+    if ( rc != -XEN_ENOBUFS )
+        return;
+
+    ASSERT(vnuma_topo.nr_vcpus == hvm_info->nr_vcpus);
+
+    vdistance = scratch_alloc(sizeof(uint32_t) * vnuma_topo.nr_vnodes *
+                              vnuma_topo.nr_vnodes, 0);
+    vmemrange = scratch_alloc(sizeof(xen_vmemrange_t) *
+                              vnuma_topo.nr_vmemranges, 0);
+
+    set_xen_guest_handle(vnuma_topo.vdistance.h, vdistance);
+    set_xen_guest_handle(vnuma_topo.vcpu_to_vnode.h, vcpu_to_vnode);
+    set_xen_guest_handle(vnuma_topo.vmemrange.h, vmemrange);
+
+    rc = hypercall_memory_op(XENMEM_get_vnumainfo, &vnuma_topo);
+
+    if ( rc < 0 )
+    {
+        printf("Failed to retrieve vNUMA information, rc = %d\n", rc);
+        return;
+    }
+
+    nr_vnodes = vnuma_topo.nr_vnodes;
+    nr_vmemranges = vnuma_topo.nr_vmemranges;
+
+    return;
+}
+
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/firmware/hvmloader/vnuma.h b/tools/firmware/hvmloader/vnuma.h
new file mode 100644
index 0000000..63b648a
--- /dev/null
+++ b/tools/firmware/hvmloader/vnuma.h
@@ -0,0 +1,52 @@
+/******************************************************************************
+ * vnuma.h
+ *
+ * Copyright (c) 2014, Wei Liu
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef __HVMLOADER_VNUMA_H__
+#define __HVMLOADER_VNUMA_H__
+
+#include <xen/memory.h>
+
+extern unsigned int nr_vnodes, nr_vmemranges;
+extern unsigned int *vcpu_to_vnode, *vdistance;
+extern xen_vmemrange_t *vmemrange;
+
+void init_vnuma_info(void);
+
+#endif /* __HVMLOADER_VNUMA_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 12/21] hvmloader: construct SRAT
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (10 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 11/21] hvmloader: retrieve vNUMA information from hypervisor Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-23 11:13 ` [PATCH v4 13/21] hvmloader: construct SLIT Wei Liu
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
---
Changes in v3:
1. Remove redundant variable.
2. Coding style fix.
3. Add assertion.

Changes in v2:
1. Remove explicit zero initializers.
2. Adapt to new vNUMA retrieval routine.
3. Move SRAT very late in secondary table build.
---
 tools/firmware/hvmloader/acpi/acpi2_0.h |   53 +++++++++++++++++++++++
 tools/firmware/hvmloader/acpi/build.c   |   72 +++++++++++++++++++++++++++++++
 2 files changed, 125 insertions(+)

diff --git a/tools/firmware/hvmloader/acpi/acpi2_0.h b/tools/firmware/hvmloader/acpi/acpi2_0.h
index 7b22d80..6169213 100644
--- a/tools/firmware/hvmloader/acpi/acpi2_0.h
+++ b/tools/firmware/hvmloader/acpi/acpi2_0.h
@@ -364,6 +364,57 @@ struct acpi_20_madt_intsrcovr {
 };
 
 /*
+ * System Resource Affinity Table header definition (SRAT)
+ */
+struct acpi_20_srat {
+    struct acpi_header header;
+    uint32_t table_revision;
+    uint32_t reserved2[2];
+};
+
+#define ACPI_SRAT_TABLE_REVISION 1
+
+/*
+ * System Resource Affinity Table structure types.
+ */
+#define ACPI_PROCESSOR_AFFINITY 0x0
+#define ACPI_MEMORY_AFFINITY    0x1
+struct acpi_20_srat_processor {
+    uint8_t type;
+    uint8_t length;
+    uint8_t domain;
+    uint8_t apic_id;
+    uint32_t flags;
+    uint8_t sapic_id;
+    uint8_t domain_hi[3];
+    uint32_t reserved;
+};
+
+/*
+ * Local APIC Affinity Flags.  All other bits are reserved and must be 0.
+ */
+#define ACPI_LOCAL_APIC_AFFIN_ENABLED (1 << 0)
+
+struct acpi_20_srat_memory {
+    uint8_t type;
+    uint8_t length;
+    uint32_t domain;
+    uint16_t reserved;
+    uint64_t base_address;
+    uint64_t mem_length;
+    uint32_t reserved2;
+    uint32_t flags;
+    uint64_t reserved3;
+};
+
+/*
+ * Memory Affinity Flags.  All other bits are reserved and must be 0.
+ */
+#define ACPI_MEM_AFFIN_ENABLED (1 << 0)
+#define ACPI_MEM_AFFIN_HOTPLUGGABLE (1 << 1)
+#define ACPI_MEM_AFFIN_NONVOLATILE (1 << 2)
+
+/*
  * Table Signatures.
  */
 #define ACPI_2_0_RSDP_SIGNATURE ASCII64('R','S','D',' ','P','T','R',' ')
@@ -375,6 +426,7 @@ struct acpi_20_madt_intsrcovr {
 #define ACPI_2_0_TCPA_SIGNATURE ASCII32('T','C','P','A')
 #define ACPI_2_0_HPET_SIGNATURE ASCII32('H','P','E','T')
 #define ACPI_2_0_WAET_SIGNATURE ASCII32('W','A','E','T')
+#define ACPI_2_0_SRAT_SIGNATURE ASCII32('S','R','A','T')
 
 /*
  * Table revision numbers.
@@ -388,6 +440,7 @@ struct acpi_20_madt_intsrcovr {
 #define ACPI_2_0_HPET_REVISION 0x01
 #define ACPI_2_0_WAET_REVISION 0x01
 #define ACPI_1_0_FADT_REVISION 0x01
+#define ACPI_2_0_SRAT_REVISION 0x01
 
 #pragma pack ()
 
diff --git a/tools/firmware/hvmloader/acpi/build.c b/tools/firmware/hvmloader/acpi/build.c
index 1431296..3e96c23 100644
--- a/tools/firmware/hvmloader/acpi/build.c
+++ b/tools/firmware/hvmloader/acpi/build.c
@@ -23,6 +23,7 @@
 #include "ssdt_pm.h"
 #include "../config.h"
 #include "../util.h"
+#include "../vnuma.h"
 #include <xen/hvm/hvm_xs_strings.h>
 #include <xen/hvm/params.h>
 
@@ -203,6 +204,66 @@ static struct acpi_20_waet *construct_waet(void)
     return waet;
 }
 
+static struct acpi_20_srat *construct_srat(void)
+{
+    struct acpi_20_srat *srat;
+    struct acpi_20_srat_processor *processor;
+    struct acpi_20_srat_memory *memory;
+    unsigned int size;
+    void *p;
+    int i;
+
+    size = sizeof(*srat) + sizeof(*processor) * hvm_info->nr_vcpus +
+        sizeof(*memory) * nr_vmemranges;
+
+    p = mem_alloc(size, 16);
+    if ( !p )
+        return NULL;
+
+    srat = p;
+    memset(srat, 0, sizeof(*srat));
+    srat->header.signature    = ACPI_2_0_SRAT_SIGNATURE;
+    srat->header.revision     = ACPI_2_0_SRAT_REVISION;
+    fixed_strcpy(srat->header.oem_id, ACPI_OEM_ID);
+    fixed_strcpy(srat->header.oem_table_id, ACPI_OEM_TABLE_ID);
+    srat->header.oem_revision = ACPI_OEM_REVISION;
+    srat->header.creator_id   = ACPI_CREATOR_ID;
+    srat->header.creator_revision = ACPI_CREATOR_REVISION;
+    srat->table_revision      = ACPI_SRAT_TABLE_REVISION;
+
+    processor = (struct acpi_20_srat_processor *)(srat + 1);
+    for ( i = 0; i < hvm_info->nr_vcpus; i++ )
+    {
+        memset(processor, 0, sizeof(*processor));
+        processor->type     = ACPI_PROCESSOR_AFFINITY;
+        processor->length   = sizeof(*processor);
+        processor->domain   = vcpu_to_vnode[i];
+        processor->apic_id  = LAPIC_ID(i);
+        processor->flags    = ACPI_LOCAL_APIC_AFFIN_ENABLED;
+        processor++;
+    }
+
+    memory = (struct acpi_20_srat_memory *)processor;
+    for ( i = 0; i < nr_vmemranges; i++ )
+    {
+        memset(memory, 0, sizeof(*memory));
+        memory->type          = ACPI_MEMORY_AFFINITY;
+        memory->length        = sizeof(*memory);
+        memory->domain        = vmemrange[i].nid;
+        memory->flags         = ACPI_MEM_AFFIN_ENABLED;
+        memory->base_address  = vmemrange[i].start;
+        memory->mem_length    = vmemrange[i].end - vmemrange[i].start;
+        memory++;
+    }
+
+    ASSERT(((unsigned long)memory) - ((unsigned long)p) == size);
+
+    srat->header.length = size;
+    set_checksum(srat, offsetof(struct acpi_header, checksum), size);
+
+    return srat;
+}
+
 static int construct_passthrough_tables(unsigned long *table_ptrs,
                                         int nr_tables)
 {
@@ -257,6 +318,7 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
     struct acpi_20_hpet *hpet;
     struct acpi_20_waet *waet;
     struct acpi_20_tcpa *tcpa;
+    struct acpi_20_srat *srat;
     unsigned char *ssdt;
     static const uint16_t tis_signature[] = {0x0001, 0x0001, 0x0001};
     uint16_t *tis_hdr;
@@ -346,6 +408,16 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
         }
     }
 
+    /* SRAT */
+    if ( nr_vnodes > 0 )
+    {
+        srat = construct_srat();
+        if ( srat )
+            table_ptrs[nr_tables++] = (unsigned long)srat;
+        else
+            printf("Failed to build SRAT, skipping...\n");
+    }
+
     /* Load any additional tables passed through. */
     nr_tables += construct_passthrough_tables(table_ptrs, nr_tables);
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 13/21] hvmloader: construct SLIT
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (11 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 12/21] hvmloader: construct SRAT Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-23 11:13 ` [PATCH v4 14/21] libxc: indentation change to xc_hvm_build_x86.c Wei Liu
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
---
Changes in v3:
1. Coding style fix.
2. Fix an error code.
3. Use unsigned int for loop variable.

Changes in v2:
1. Adapt to new vNUMA retrieval routine.
2. Move SLIT very late in secondary table build.
---
 tools/firmware/hvmloader/acpi/acpi2_0.h |    8 +++++++
 tools/firmware/hvmloader/acpi/build.c   |   40 ++++++++++++++++++++++++++++++-
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/tools/firmware/hvmloader/acpi/acpi2_0.h b/tools/firmware/hvmloader/acpi/acpi2_0.h
index 6169213..d698095 100644
--- a/tools/firmware/hvmloader/acpi/acpi2_0.h
+++ b/tools/firmware/hvmloader/acpi/acpi2_0.h
@@ -414,6 +414,12 @@ struct acpi_20_srat_memory {
 #define ACPI_MEM_AFFIN_HOTPLUGGABLE (1 << 1)
 #define ACPI_MEM_AFFIN_NONVOLATILE (1 << 2)
 
+struct acpi_20_slit {
+    struct acpi_header header;
+    uint64_t localities;
+    uint8_t entry[0];
+};
+
 /*
  * Table Signatures.
  */
@@ -427,6 +433,7 @@ struct acpi_20_srat_memory {
 #define ACPI_2_0_HPET_SIGNATURE ASCII32('H','P','E','T')
 #define ACPI_2_0_WAET_SIGNATURE ASCII32('W','A','E','T')
 #define ACPI_2_0_SRAT_SIGNATURE ASCII32('S','R','A','T')
+#define ACPI_2_0_SLIT_SIGNATURE ASCII32('S','L','I','T')
 
 /*
  * Table revision numbers.
@@ -441,6 +448,7 @@ struct acpi_20_srat_memory {
 #define ACPI_2_0_WAET_REVISION 0x01
 #define ACPI_1_0_FADT_REVISION 0x01
 #define ACPI_2_0_SRAT_REVISION 0x01
+#define ACPI_2_0_SLIT_REVISION 0x01
 
 #pragma pack ()
 
diff --git a/tools/firmware/hvmloader/acpi/build.c b/tools/firmware/hvmloader/acpi/build.c
index 3e96c23..7dac6a8 100644
--- a/tools/firmware/hvmloader/acpi/build.c
+++ b/tools/firmware/hvmloader/acpi/build.c
@@ -264,6 +264,38 @@ static struct acpi_20_srat *construct_srat(void)
     return srat;
 }
 
+static struct acpi_20_slit *construct_slit(void)
+{
+    struct acpi_20_slit *slit;
+    unsigned int i, num, size;
+
+    num = nr_vnodes * nr_vnodes;
+    size = sizeof(*slit) + num * sizeof(uint8_t);
+
+    slit = mem_alloc(size, 16);
+    if ( !slit )
+        return NULL;
+
+    memset(slit, 0, size);
+    slit->header.signature    = ACPI_2_0_SLIT_SIGNATURE;
+    slit->header.revision     = ACPI_2_0_SLIT_REVISION;
+    fixed_strcpy(slit->header.oem_id, ACPI_OEM_ID);
+    fixed_strcpy(slit->header.oem_table_id, ACPI_OEM_TABLE_ID);
+    slit->header.oem_revision = ACPI_OEM_REVISION;
+    slit->header.creator_id   = ACPI_CREATOR_ID;
+    slit->header.creator_revision = ACPI_CREATOR_REVISION;
+
+    for ( i = 0; i < num; i++ )
+        slit->entry[i] = vdistance[i];
+
+    slit->localities = nr_vnodes;
+
+    slit->header.length = size;
+    set_checksum(slit, offsetof(struct acpi_header, checksum), size);
+
+    return slit;
+}
+
 static int construct_passthrough_tables(unsigned long *table_ptrs,
                                         int nr_tables)
 {
@@ -319,6 +351,7 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
     struct acpi_20_waet *waet;
     struct acpi_20_tcpa *tcpa;
     struct acpi_20_srat *srat;
+    struct acpi_20_slit *slit;
     unsigned char *ssdt;
     static const uint16_t tis_signature[] = {0x0001, 0x0001, 0x0001};
     uint16_t *tis_hdr;
@@ -408,7 +441,7 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
         }
     }
 
-    /* SRAT */
+    /* SRAT and SLIT */
     if ( nr_vnodes > 0 )
     {
         srat = construct_srat();
@@ -416,6 +449,11 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
             table_ptrs[nr_tables++] = (unsigned long)srat;
         else
             printf("Failed to build SRAT, skipping...\n");
+        slit = construct_slit();
+        if ( slit )
+            table_ptrs[nr_tables++] = (unsigned long)slit;
+        else
+            printf("Failed to build SLIT, skipping...\n");
     }
 
     /* Load any additional tables passed through. */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 14/21] libxc: indentation change to xc_hvm_build_x86.c
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (12 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 13/21] hvmloader: construct SLIT Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:30   ` Ian Campbell
  2015-01-23 11:13 ` [PATCH v4 15/21] libxc: allocate memory with vNUMA information for HVM guest Wei Liu
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

Move a while loop in xc_hvm_build_x86 one block to the right. No
functional change introduced.

Functional changes will be introduced in next patch.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxc/xc_hvm_build_x86.c |  153 +++++++++++++++++++++-------------------
 1 file changed, 81 insertions(+), 72 deletions(-)

diff --git a/tools/libxc/xc_hvm_build_x86.c b/tools/libxc/xc_hvm_build_x86.c
index c81a25b..ecc3224 100644
--- a/tools/libxc/xc_hvm_build_x86.c
+++ b/tools/libxc/xc_hvm_build_x86.c
@@ -353,98 +353,107 @@ static int setup_guest(xc_interface *xch,
     cur_pages = 0xc0;
     stat_normal_pages = 0xc0;
 
-    while ( (rc == 0) && (nr_pages > cur_pages) )
     {
-        /* Clip count to maximum 1GB extent. */
-        unsigned long count = nr_pages - cur_pages;
-        unsigned long max_pages = SUPERPAGE_1GB_NR_PFNS;
-
-        if ( count > max_pages )
-            count = max_pages;
-
-        cur_pfn = page_array[cur_pages];
-
-        /* Take care the corner cases of super page tails */
-        if ( ((cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
-             (count > (-cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1))) )
-            count = -cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1);
-        else if ( ((count & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
-                  (count > SUPERPAGE_1GB_NR_PFNS) )
-            count &= ~(SUPERPAGE_1GB_NR_PFNS - 1);
-
-        /* Attemp to allocate 1GB super page. Because in each pass we only
-         * allocate at most 1GB, we don't have to clip super page boundaries.
-         */
-        if ( ((count | cur_pfn) & (SUPERPAGE_1GB_NR_PFNS - 1)) == 0 &&
-             /* Check if there exists MMIO hole in the 1GB memory range */
-             !check_mmio_hole(cur_pfn << PAGE_SHIFT,
-                              SUPERPAGE_1GB_NR_PFNS << PAGE_SHIFT,
-                              mmio_start, mmio_size) )
+        while ( (rc == 0) && (nr_pages > cur_pages) )
         {
-            long done;
-            unsigned long nr_extents = count >> SUPERPAGE_1GB_SHIFT;
-            xen_pfn_t sp_extents[nr_extents];
-
-            for ( i = 0; i < nr_extents; i++ )
-                sp_extents[i] = page_array[cur_pages+(i<<SUPERPAGE_1GB_SHIFT)];
-
-            done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_1GB_SHIFT,
-                                              pod_mode, sp_extents);
-
-            if ( done > 0 )
-            {
-                stat_1gb_pages += done;
-                done <<= SUPERPAGE_1GB_SHIFT;
-                cur_pages += done;
-                count -= done;
-            }
-        }
+            /* Clip count to maximum 1GB extent. */
+            unsigned long count = nr_pages - cur_pages;
+            unsigned long max_pages = SUPERPAGE_1GB_NR_PFNS;
 
-        if ( count != 0 )
-        {
-            /* Clip count to maximum 8MB extent. */
-            max_pages = SUPERPAGE_2MB_NR_PFNS * 4;
             if ( count > max_pages )
                 count = max_pages;
-            
-            /* Clip partial superpage extents to superpage boundaries. */
-            if ( ((cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
-                 (count > (-cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1))) )
-                count = -cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1);
-            else if ( ((count & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
-                      (count > SUPERPAGE_2MB_NR_PFNS) )
-                count &= ~(SUPERPAGE_2MB_NR_PFNS - 1); /* clip non-s.p. tail */
-
-            /* Attempt to allocate superpage extents. */
-            if ( ((count | cur_pfn) & (SUPERPAGE_2MB_NR_PFNS - 1)) == 0 )
+
+            cur_pfn = page_array[cur_pages];
+
+            /* Take care the corner cases of super page tails */
+            if ( ((cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
+                 (count > (-cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1))) )
+                count = -cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1);
+            else if ( ((count & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
+                      (count > SUPERPAGE_1GB_NR_PFNS) )
+                count &= ~(SUPERPAGE_1GB_NR_PFNS - 1);
+
+            /* Attemp to allocate 1GB super page. Because in each pass
+             * we only allocate at most 1GB, we don't have to clip
+             * super page boundaries.
+             */
+            if ( ((count | cur_pfn) & (SUPERPAGE_1GB_NR_PFNS - 1)) == 0 &&
+                 /* Check if there exists MMIO hole in the 1GB memory
+                  * range */
+                 !check_mmio_hole(cur_pfn << PAGE_SHIFT,
+                                  SUPERPAGE_1GB_NR_PFNS << PAGE_SHIFT,
+                                  mmio_start, mmio_size) )
             {
                 long done;
-                unsigned long nr_extents = count >> SUPERPAGE_2MB_SHIFT;
+                unsigned long nr_extents = count >> SUPERPAGE_1GB_SHIFT;
                 xen_pfn_t sp_extents[nr_extents];
 
                 for ( i = 0; i < nr_extents; i++ )
-                    sp_extents[i] = page_array[cur_pages+(i<<SUPERPAGE_2MB_SHIFT)];
+                    sp_extents[i] =
+                        page_array[cur_pages+(i<<SUPERPAGE_1GB_SHIFT)];
 
-                done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_2MB_SHIFT,
+                done = xc_domain_populate_physmap(xch, dom, nr_extents,
+                                                  SUPERPAGE_1GB_SHIFT,
                                                   pod_mode, sp_extents);
 
                 if ( done > 0 )
                 {
-                    stat_2mb_pages += done;
-                    done <<= SUPERPAGE_2MB_SHIFT;
+                    stat_1gb_pages += done;
+                    done <<= SUPERPAGE_1GB_SHIFT;
                     cur_pages += done;
                     count -= done;
                 }
             }
-        }
 
-        /* Fall back to 4kB extents. */
-        if ( count != 0 )
-        {
-            rc = xc_domain_populate_physmap_exact(
-                xch, dom, count, 0, pod_mode, &page_array[cur_pages]);
-            cur_pages += count;
-            stat_normal_pages += count;
+            if ( count != 0 )
+            {
+                /* Clip count to maximum 8MB extent. */
+                max_pages = SUPERPAGE_2MB_NR_PFNS * 4;
+                if ( count > max_pages )
+                    count = max_pages;
+
+                /* Clip partial superpage extents to superpage
+                 * boundaries. */
+                if ( ((cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
+                     (count > (-cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1))) )
+                    count = -cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1);
+                else if ( ((count & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
+                          (count > SUPERPAGE_2MB_NR_PFNS) )
+                    count &= ~(SUPERPAGE_2MB_NR_PFNS - 1); /* clip non-s.p. tail */
+
+                /* Attempt to allocate superpage extents. */
+                if ( ((count | cur_pfn) & (SUPERPAGE_2MB_NR_PFNS - 1)) == 0 )
+                {
+                    long done;
+                    unsigned long nr_extents = count >> SUPERPAGE_2MB_SHIFT;
+                    xen_pfn_t sp_extents[nr_extents];
+
+                    for ( i = 0; i < nr_extents; i++ )
+                        sp_extents[i] =
+                            page_array[cur_pages+(i<<SUPERPAGE_2MB_SHIFT)];
+
+                    done = xc_domain_populate_physmap(xch, dom, nr_extents,
+                                                      SUPERPAGE_2MB_SHIFT,
+                                                      pod_mode, sp_extents);
+
+                    if ( done > 0 )
+                    {
+                        stat_2mb_pages += done;
+                        done <<= SUPERPAGE_2MB_SHIFT;
+                        cur_pages += done;
+                        count -= done;
+                    }
+                }
+            }
+
+            /* Fall back to 4kB extents. */
+            if ( count != 0 )
+            {
+                rc = xc_domain_populate_physmap_exact(
+                    xch, dom, count, 0, pod_mode, &page_array[cur_pages]);
+                cur_pages += count;
+                stat_normal_pages += count;
+            }
         }
     }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 15/21] libxc: allocate memory with vNUMA information for HVM guest
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (13 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 14/21] libxc: indentation change to xc_hvm_build_x86.c Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:36   ` Ian Campbell
  2015-01-23 11:13 ` [PATCH v4 16/21] libxl: build, check and pass vNUMA info to Xen " Wei Liu
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

The algorithm is more or less the same as the one used for PV guest.
Libxc gets hold of the mapping of vnode to pnode and size of each vnode
then allocate memory accordingly.

And then the function returns low memory end, high memory end and mmio
start to caller. Libxl needs those values to construct vmemranges for
that guest.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v4:
1. Adapt to new interface.
2. Shorten error message.
3. This patch includes only functional changes.

Changes in v3:
1. Rewrite commit log.
2. Add a few code comments.
---
 tools/libxc/include/xenguest.h |    9 ++++
 tools/libxc/xc_hvm_build_x86.c |  102 +++++++++++++++++++++++++++++++---------
 2 files changed, 90 insertions(+), 21 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 9b7edff..6a1ac52 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -236,6 +236,15 @@ struct xc_hvm_build_args {
     struct xc_hvm_firmware_module smbios_module;
     /* Whether to use claim hypercall (1 - enable, 0 - disable). */
     int claim_enabled;
+
+    /* vNUMA information*/
+    struct xc_vnuma_info *vnuma_info;
+    unsigned int nr_vnuma_info;
+
+    /* Out parameters  */
+    uint64_t lowmem_end;
+    uint64_t highmem_end;
+    uint64_t mmio_start;
 };
 
 /**
diff --git a/tools/libxc/xc_hvm_build_x86.c b/tools/libxc/xc_hvm_build_x86.c
index ecc3224..bd27ce5 100644
--- a/tools/libxc/xc_hvm_build_x86.c
+++ b/tools/libxc/xc_hvm_build_x86.c
@@ -89,7 +89,8 @@ static int modules_init(struct xc_hvm_build_args *args,
 }
 
 static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
-                           uint64_t mmio_start, uint64_t mmio_size)
+                           uint64_t mmio_start, uint64_t mmio_size,
+                           struct xc_hvm_build_args *args)
 {
     struct hvm_info_table *hvm_info = (struct hvm_info_table *)
         (((unsigned char *)hvm_info_page) + HVM_INFO_OFFSET);
@@ -119,6 +120,10 @@ static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
     hvm_info->high_mem_pgend = highmem_end >> PAGE_SHIFT;
     hvm_info->reserved_mem_pgstart = ioreq_server_pfn(0);
 
+    args->lowmem_end = lowmem_end;
+    args->highmem_end = highmem_end;
+    args->mmio_start = mmio_start;
+
     /* Finish with the checksum. */
     for ( i = 0, sum = 0; i < hvm_info->length; i++ )
         sum += ((uint8_t *)hvm_info)[i];
@@ -244,7 +249,7 @@ static int setup_guest(xc_interface *xch,
                        char *image, unsigned long image_size)
 {
     xen_pfn_t *page_array = NULL;
-    unsigned long i, nr_pages = args->mem_size >> PAGE_SHIFT;
+    unsigned long i, j, nr_pages = args->mem_size >> PAGE_SHIFT;
     unsigned long target_pages = args->mem_target >> PAGE_SHIFT;
     uint64_t mmio_start = (1ull << 32) - args->mmio_size;
     uint64_t mmio_size = args->mmio_size;
@@ -258,13 +263,12 @@ static int setup_guest(xc_interface *xch,
     xen_capabilities_info_t caps;
     unsigned long stat_normal_pages = 0, stat_2mb_pages = 0, 
         stat_1gb_pages = 0;
-    int pod_mode = 0;
+    unsigned int memflags = 0;
     int claim_enabled = args->claim_enabled;
     xen_pfn_t special_array[NR_SPECIAL_PAGES];
     xen_pfn_t ioreq_server_array[NR_IOREQ_SERVER_PAGES];
-
-    if ( nr_pages > target_pages )
-        pod_mode = XENMEMF_populate_on_demand;
+    struct xc_vnuma_info dummy_vnuma_info;
+    uint64_t total_pages;
 
     memset(&elf, 0, sizeof(elf));
     if ( elf_init(&elf, image, image_size) != 0 )
@@ -276,6 +280,37 @@ static int setup_guest(xc_interface *xch,
     v_start = 0;
     v_end = args->mem_size;
 
+    if ( nr_pages > target_pages )
+        memflags |= XENMEMF_populate_on_demand;
+
+    if ( args->nr_vnuma_info == 0 )
+    {
+        /* Build dummy vnode information */
+        dummy_vnuma_info.vnode = 0;
+        dummy_vnuma_info.pnode = XC_VNUMA_NO_NODE;
+        dummy_vnuma_info.pages = args->mem_size >> PAGE_SHIFT;
+        args->nr_vnuma_info = 1;
+        args->vnuma_info = &dummy_vnuma_info;
+    }
+    else
+    {
+        if ( nr_pages > target_pages )
+        {
+            PERROR("Cannot enable vNUMA and PoD at the same time");
+            goto error_out;
+        }
+    }
+
+    total_pages = 0;
+    for ( i = 0; i < args->nr_vnuma_info; i++ )
+        total_pages += args->vnuma_info[i].pages;
+    if ( total_pages != (args->mem_size >> PAGE_SHIFT) )
+    {
+        PERROR("vNUMA memory pages mismatch (0x%"PRIx64" != 0x%"PRIx64")",
+               total_pages, args->mem_size >> PAGE_SHIFT);
+        goto error_out;
+    }
+
     if ( xc_version(xch, XENVER_capabilities, &caps) != 0 )
     {
         PERROR("Could not get Xen capabilities");
@@ -320,7 +355,7 @@ static int setup_guest(xc_interface *xch,
         }
     }
 
-    if ( pod_mode )
+    if ( memflags & XENMEMF_populate_on_demand )
     {
         /*
          * Subtract VGA_HOLE_SIZE from target_pages for the VGA
@@ -349,15 +384,32 @@ static int setup_guest(xc_interface *xch,
      * ensure that we can be preempted and hence dom0 remains responsive.
      */
     rc = xc_domain_populate_physmap_exact(
-        xch, dom, 0xa0, 0, pod_mode, &page_array[0x00]);
+        xch, dom, 0xa0, 0, memflags, &page_array[0x00]);
     cur_pages = 0xc0;
     stat_normal_pages = 0xc0;
 
+    for ( i = 0; i < args->nr_vnuma_info; i++ )
     {
-        while ( (rc == 0) && (nr_pages > cur_pages) )
+        unsigned int new_memflags = memflags;
+        uint64_t pages, finished;
+
+        if ( args->vnuma_info[i].pnode != XC_VNUMA_NO_NODE )
+        {
+            new_memflags |= XENMEMF_exact_node(args->vnuma_info[i].pnode);
+            new_memflags |= XENMEMF_exact_node_request;
+        }
+
+        pages = args->vnuma_info[i].pages;
+        /* Consider vga hole belongs to node 0 */
+        if ( i == 0 )
+            finished = 0xc0;
+        else
+            finished = 0;
+
+        while ( (rc == 0) && (pages > finished) )
         {
             /* Clip count to maximum 1GB extent. */
-            unsigned long count = nr_pages - cur_pages;
+            unsigned long count = pages - finished;
             unsigned long max_pages = SUPERPAGE_1GB_NR_PFNS;
 
             if ( count > max_pages )
@@ -388,19 +440,20 @@ static int setup_guest(xc_interface *xch,
                 unsigned long nr_extents = count >> SUPERPAGE_1GB_SHIFT;
                 xen_pfn_t sp_extents[nr_extents];
 
-                for ( i = 0; i < nr_extents; i++ )
-                    sp_extents[i] =
-                        page_array[cur_pages+(i<<SUPERPAGE_1GB_SHIFT)];
+                for ( j = 0; j < nr_extents; j++ )
+                    sp_extents[j] =
+                        page_array[cur_pages+(j<<SUPERPAGE_1GB_SHIFT)];
 
                 done = xc_domain_populate_physmap(xch, dom, nr_extents,
                                                   SUPERPAGE_1GB_SHIFT,
-                                                  pod_mode, sp_extents);
+                                                  memflags, sp_extents);
 
                 if ( done > 0 )
                 {
                     stat_1gb_pages += done;
                     done <<= SUPERPAGE_1GB_SHIFT;
                     cur_pages += done;
+                    finished += done;
                     count -= done;
                 }
             }
@@ -428,19 +481,19 @@ static int setup_guest(xc_interface *xch,
                     unsigned long nr_extents = count >> SUPERPAGE_2MB_SHIFT;
                     xen_pfn_t sp_extents[nr_extents];
 
-                    for ( i = 0; i < nr_extents; i++ )
-                        sp_extents[i] =
-                            page_array[cur_pages+(i<<SUPERPAGE_2MB_SHIFT)];
+                    for ( j = 0; j < nr_extents; j++ )
+                        sp_extents[j] =
+                            page_array[cur_pages+(j<<SUPERPAGE_2MB_SHIFT)];
 
                     done = xc_domain_populate_physmap(xch, dom, nr_extents,
                                                       SUPERPAGE_2MB_SHIFT,
-                                                      pod_mode, sp_extents);
-
+                                                      memflags, sp_extents);
                     if ( done > 0 )
                     {
                         stat_2mb_pages += done;
                         done <<= SUPERPAGE_2MB_SHIFT;
                         cur_pages += done;
+                        finished += done;
                         count -= done;
                     }
                 }
@@ -450,11 +503,15 @@ static int setup_guest(xc_interface *xch,
             if ( count != 0 )
             {
                 rc = xc_domain_populate_physmap_exact(
-                    xch, dom, count, 0, pod_mode, &page_array[cur_pages]);
+                    xch, dom, count, 0, new_memflags, &page_array[cur_pages]);
                 cur_pages += count;
+                finished += count;
                 stat_normal_pages += count;
             }
         }
+
+        if ( rc != 0 )
+            break;
     }
 
     if ( rc != 0 )
@@ -478,7 +535,7 @@ static int setup_guest(xc_interface *xch,
               xch, dom, PAGE_SIZE, PROT_READ | PROT_WRITE,
               HVM_INFO_PFN)) == NULL )
         goto error_out;
-    build_hvm_info(hvm_info_page, v_end, mmio_start, mmio_size);
+    build_hvm_info(hvm_info_page, v_end, mmio_start, mmio_size, args);
     munmap(hvm_info_page, PAGE_SIZE);
 
     /* Allocate and clear special pages. */
@@ -617,6 +674,9 @@ int xc_hvm_build(xc_interface *xch, uint32_t domid,
             args.acpi_module.guest_addr_out;
         hvm_args->smbios_module.guest_addr_out = 
             args.smbios_module.guest_addr_out;
+        hvm_args->lowmem_end = args.lowmem_end;
+        hvm_args->highmem_end = args.highmem_end;
+        hvm_args->mmio_start = args.mmio_start;
     }
 
     free(image);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 16/21] libxl: build, check and pass vNUMA info to Xen for HVM guest
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (14 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 15/21] libxc: allocate memory with vNUMA information for HVM guest Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:41   ` Ian Campbell
  2015-01-23 11:13 ` [PATCH v4 17/21] libxl: disallow memory relocation when vNUMA is enabled Wei Liu
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

Transform user supplied vNUMA configuration into libxl internal
representations then libxc representations. Check validity along the
line.

Libxc has more involvement in building vmemranges in HVM case compared
to PV case. The building of vmemranges is placed after xc_hvm_build
returns, because it relies on memory hole information provided by
xc_hvm_build.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v4:
1. Adapt to new interface.
2. Rename some variables.
3. Use GCREALLOC_ARRAY.

Changes in v3:
1. Rewrite commit log.
---
 tools/libxl/libxl_create.c   |    9 +++++++
 tools/libxl/libxl_dom.c      |   27 ++++++++++++++++++++
 tools/libxl/libxl_internal.h |    5 ++++
 tools/libxl/libxl_vnuma.c    |   56 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 97 insertions(+)

diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 98687bd..af04248 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -853,6 +853,15 @@ static void initiate_domain_create(libxl__egc *egc,
         goto error_out;
     }
 
+    /* Disallow PoD and vNUMA to be enabled at the same time because PoD
+     * pool is not vNUMA-aware yet.
+     */
+    if (pod_enabled && d_config->b_info.num_vnuma_nodes) {
+        ret = ERROR_INVAL;
+        LOG(ERROR, "Cannot enable PoD and vNUMA at the same time");
+        goto error_out;
+    }
+
     ret = libxl__domain_create_info_setdefault(gc, &d_config->c_info);
     if (ret) goto error_out;
 
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index b06fd65..d229870 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -885,12 +885,39 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
         goto out;
     }
 
+    if (info->num_vnuma_nodes != 0) {
+        int i;
+
+        args.nr_vnuma_info = info->num_vnuma_nodes;
+        args.vnuma_info = libxl__malloc(gc, sizeof(*args.vnuma_info) *
+                                        args.nr_vnuma_info);
+        for (i = 0; i < args.nr_vnuma_info; i++) {
+            args.vnuma_info[i].vnode = i;
+            args.vnuma_info[i].pnode = info->vnuma_nodes[i].pnode;
+            args.vnuma_info[i].pages = info->vnuma_nodes[i].memkb >> 2;
+        }
+
+        /* Consider video ram belongs to node 0 */
+        args.vnuma_info[0].pages -= (info->video_memkb >> 2);
+    }
+
     ret = xc_hvm_build(ctx->xch, domid, &args);
     if (ret) {
         LOGEV(ERROR, ret, "hvm building failed");
         goto out;
     }
 
+    if (info->num_vnuma_nodes != 0) {
+        ret = libxl__vnuma_build_vmemrange_hvm(gc, domid, info, state, &args);
+        if (ret) {
+            LOGEV(ERROR, ret, "hvm build vmemranges failed");
+            goto out;
+        }
+        ret = libxl__vnuma_config_check(gc, info, state);
+        if (ret) goto out;
+        ret = set_vnuma_info(gc, domid, info, state);
+        if (ret) goto out;
+    }
     ret = hvm_build_set_params(ctx->xch, domid, info, state->store_port,
                                &state->store_mfn, state->console_port,
                                &state->console_mfn, state->store_domid,
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 73d533a..a5513c9 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3403,6 +3403,11 @@ int libxl__vnuma_build_vmemrange_pv(libxl__gc *gc,
                                     uint32_t domid,
                                     libxl_domain_build_info *b_info,
                                     libxl__domain_build_state *state);
+int libxl__vnuma_build_vmemrange_hvm(libxl__gc *gc,
+                                     uint32_t domid,
+                                     libxl_domain_build_info *b_info,
+                                     libxl__domain_build_state *state,
+                                     struct xc_hvm_build_args *args);
 
 _hidden int libxl__ms_vm_genid_set(libxl__gc *gc, uint32_t domid,
                                    const libxl_ms_vm_genid *id);
diff --git a/tools/libxl/libxl_vnuma.c b/tools/libxl/libxl_vnuma.c
index 0189a4b..62ee6da 100644
--- a/tools/libxl/libxl_vnuma.c
+++ b/tools/libxl/libxl_vnuma.c
@@ -162,6 +162,62 @@ int libxl__vnuma_build_vmemrange_pv(libxl__gc *gc,
     return libxl__arch_vnuma_build_vmemrange(gc, domid, b_info, state);
 }
 
+/* Build vmemranges for HVM guest */
+int libxl__vnuma_build_vmemrange_hvm(libxl__gc *gc,
+                                     uint32_t domid,
+                                     libxl_domain_build_info *b_info,
+                                     libxl__domain_build_state *state,
+                                     struct xc_hvm_build_args *args)
+{
+    uint64_t hole_start, hole_end, next;
+    int nid, nr_vmemrange;
+    xen_vmemrange_t *vmemranges;
+
+    /* Derive vmemranges from vnode size and memory hole.
+     *
+     * Guest physical address space layout:
+     * [0, hole_start) [hole_start, hole_end) [hole_end, highmem_end)
+     */
+    hole_start = args->lowmem_end < args->mmio_start ?
+        args->lowmem_end : args->mmio_start;
+    hole_end = (args->mmio_start + args->mmio_size) > (1ULL << 32) ?
+        (args->mmio_start + args->mmio_size) : (1ULL << 32);
+
+    assert(state->vmemranges == NULL);
+
+    next = 0;
+    nr_vmemrange = 0;
+    vmemranges = NULL;
+    for (nid = 0; nid < b_info->num_vnuma_nodes; nid++) {
+        libxl_vnode_info *p = &b_info->vnuma_nodes[nid];
+        uint64_t remaining_bytes = p->memkb << 10;
+
+        while (remaining_bytes > 0) {
+            uint64_t count = remaining_bytes;
+
+            if (next >= hole_start && next < hole_end)
+                next = hole_end;
+            if ((next < hole_start) && (next + remaining_bytes >= hole_start))
+                count = hole_start - next;
+
+            GCREALLOC_ARRAY(vmemranges, nr_vmemrange+1);
+            vmemranges[nr_vmemrange].start = next;
+            vmemranges[nr_vmemrange].end = next + count;
+            vmemranges[nr_vmemrange].flags = 0;
+            vmemranges[nr_vmemrange].nid = nid;
+
+            nr_vmemrange++;
+            remaining_bytes -= count;
+            next += count;
+        }
+    }
+
+    state->vmemranges = vmemranges;
+    state->num_vmemranges = nr_vmemrange;
+
+    return 0;
+}
+
 /*
  * Local variables:
  * mode: C
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 17/21] libxl: disallow memory relocation when vNUMA is enabled
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (15 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 16/21] libxl: build, check and pass vNUMA info to Xen " Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:41   ` Ian Campbell
  2015-01-23 11:13 ` [PATCH v4 18/21] libxlu: rework internal representation of setting Wei Liu
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

Disallow memory relocation when vNUMA is enabled, because relocated
memory ends up off node. Further more, even if we dynamically expand
node coverage in hvmloader, low memory and high memory may reside
in different physical nodes, blindly relocating low memory to high
memory gives us a sub-optimal configuration.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
---
 tools/libxl/libxl_dm.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index c2b0487..201c9e5 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -1356,13 +1356,15 @@ void libxl__spawn_local_dm(libxl__egc *egc, libxl__dm_spawn_state *dmss)
                         libxl__sprintf(gc, "%s/hvmloader/bios", path),
                         "%s", libxl_bios_type_to_string(b_info->u.hvm.bios));
         /* Disable relocating memory to make the MMIO hole larger
-         * unless we're running qemu-traditional */
+         * unless we're running qemu-traditional and vNUMA is not
+         * configured. */
         libxl__xs_write(gc, XBT_NULL,
                         libxl__sprintf(gc,
                                        "%s/hvmloader/allow-memory-relocate",
                                        path),
                         "%d",
-                        b_info->device_model_version==LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL);
+                        b_info->device_model_version==LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL &&
+                        !b_info->num_vnuma_nodes);
         free(path);
     }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 18/21] libxlu: rework internal representation of setting
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (16 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 17/21] libxl: disallow memory relocation when vNUMA is enabled Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:41   ` Ian Campbell
  2015-02-11 16:12   ` Ian Jackson
  2015-01-23 11:13 ` [PATCH v4 19/21] libxlu: nested list support Wei Liu
                   ` (2 subsequent siblings)
  20 siblings, 2 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

This patches does following things:

1. Properly define a XLU_ConfigList type. Originally it was defined to
   be XLU_ConfigSetting.
2. Define XLU_ConfigValue type, which can be either a string or a list
   of XLU_ConfigValue.
3. ConfigSetting now references XLU_ConfigValue. Originally it only
   worked with **string.
4. Properly construct list where necessary, see changes to .y file.

To achieve above changes:

1. xlu__cfg_set_mk and xlu__cfg_set_add are deleted, because they
   are no more needed in the new code.
2. Introduce xlu__cfg_string_mk to make a XLU_ConfigSetting that points
   to a XLU_ConfigValue that wraps a string.
3. Introduce xlu__cfg_list_mk to make a XLU_ConfigSetting that points
   to XLU_ConfigValue that is a list.
4. The parser now generates XLU_ConfigValue instead of XLU_ConfigSetting
   when construct values, which enables us to recursively generate list
   of lists.
5. XLU_ConfigSetting is generated in xlu__cfg_set_store.
6. Adapt other functions to use new types.

No change to public API. Xl compiles without problem and 'xl create -n
guest.cfg' is valgrind clean.

This patch is needed because we're going to implement nested list
support, which requires support for list of list.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
---
 tools/libxl/libxlu_cfg.c      |  161 ++++++++++++++++++++++++++++-------------
 tools/libxl/libxlu_cfg_i.h    |   12 ++-
 tools/libxl/libxlu_cfg_y.c    |   24 +++---
 tools/libxl/libxlu_cfg_y.h    |    2 +-
 tools/libxl/libxlu_cfg_y.y    |   14 ++--
 tools/libxl/libxlu_internal.h |   29 ++++++--
 6 files changed, 161 insertions(+), 81 deletions(-)

diff --git a/tools/libxl/libxlu_cfg.c b/tools/libxl/libxlu_cfg.c
index 22adcb0..23b0cdb 100644
--- a/tools/libxl/libxlu_cfg.c
+++ b/tools/libxl/libxlu_cfg.c
@@ -131,14 +131,28 @@ int xlu_cfg_readdata(XLU_Config *cfg, const char *data, int length) {
     return ctx.err;
 }
 
-void xlu__cfg_set_free(XLU_ConfigSetting *set) {
+void xlu__cfg_value_free(XLU_ConfigValue *value)
+{
     int i;
 
+    if (!value) return;
+
+    switch (value->type) {
+    case XLU_STRING:
+        free(value->u.string);
+        break;
+    case XLU_LIST:
+        for (i = 0; i < value->u.list.nvalues; i++)
+            xlu__cfg_value_free(value->u.list.values[i]);
+        free(value->u.list.values);
+    }
+    free(value);
+}
+
+void xlu__cfg_set_free(XLU_ConfigSetting *set) {
     if (!set) return;
     free(set->name);
-    for (i=0; i<set->nvalues; i++)
-        free(set->values[i]);
-    free(set->values);
+    xlu__cfg_value_free(set->value);
     free(set);
 }
 
@@ -173,7 +187,7 @@ static int find_atom(const XLU_Config *cfg, const char *n,
     set= find(cfg,n);
     if (!set) return ESRCH;
 
-    if (set->avalues!=1) {
+    if (set->value->type!=XLU_STRING) {
         if (!dont_warn)
             fprintf(cfg->report,
                     "%s:%d: warning: parameter `%s' is"
@@ -191,7 +205,7 @@ int xlu_cfg_get_string(const XLU_Config *cfg, const char *n,
     int e;
 
     e= find_atom(cfg,n,&set,dont_warn);  if (e) return e;
-    *value_r= set->values[0];
+    *value_r= set->value->u.string;
     return 0;
 }
 
@@ -202,7 +216,7 @@ int xlu_cfg_replace_string(const XLU_Config *cfg, const char *n,
 
     e= find_atom(cfg,n,&set,dont_warn);  if (e) return e;
     free(*value_r);
-    *value_r= strdup(set->values[0]);
+    *value_r= strdup(set->value->u.string);
     return 0;
 }
 
@@ -214,7 +228,7 @@ int xlu_cfg_get_long(const XLU_Config *cfg, const char *n,
     char *ep;
 
     e= find_atom(cfg,n,&set,dont_warn);  if (e) return e;
-    errno= 0; l= strtol(set->values[0], &ep, 0);
+    errno= 0; l= strtol(set->value->u.string, &ep, 0);
     e= errno;
     if (errno) {
         e= errno;
@@ -226,7 +240,7 @@ int xlu_cfg_get_long(const XLU_Config *cfg, const char *n,
                     cfg->config_source, set->lineno, n, strerror(e));
         return e;
     }
-    if (*ep || ep==set->values[0]) {
+    if (*ep || ep==set->value->u.string) {
         if (!dont_warn)
             fprintf(cfg->report,
                     "%s:%d: warning: parameter `%s' is not a valid number\n",
@@ -253,7 +267,7 @@ int xlu_cfg_get_list(const XLU_Config *cfg, const char *n,
                      XLU_ConfigList **list_r, int *entries_r, int dont_warn) {
     XLU_ConfigSetting *set;
     set= find(cfg,n);  if (!set) return ESRCH;
-    if (set->avalues==1) {
+    if (set->value->type!=XLU_LIST) {
         if (!dont_warn) {
             fprintf(cfg->report,
                     "%s:%d: warning: parameter `%s' is a single value"
@@ -262,8 +276,8 @@ int xlu_cfg_get_list(const XLU_Config *cfg, const char *n,
         }
         return EINVAL;
     }
-    if (list_r) *list_r= set;
-    if (entries_r) *entries_r= set->nvalues;
+    if (list_r) *list_r= &set->value->u.list;
+    if (entries_r) *entries_r= set->value->u.list.nvalues;
     return 0;
 }
 
@@ -290,72 +304,117 @@ int xlu_cfg_get_list_as_string_list(const XLU_Config *cfg, const char *n,
     return 0;
 }
 
-const char *xlu_cfg_get_listitem(const XLU_ConfigList *set, int entry) {
-    if (entry < 0 || entry >= set->nvalues) return 0;
-    return set->values[entry];
+const char *xlu_cfg_get_listitem(const XLU_ConfigList *list, int entry) {
+    if (entry < 0 || entry >= list->nvalues) return 0;
+    if (list->values[entry]->type != XLU_STRING) return 0;
+    return list->values[entry]->u.string;
 }
 
 
-XLU_ConfigSetting *xlu__cfg_set_mk(CfgParseContext *ctx,
-                                   int alloc, char *atom) {
-    XLU_ConfigSetting *set= 0;
+XLU_ConfigValue *xlu__cfg_string_mk(CfgParseContext *ctx, char *atom)
+{
+    XLU_ConfigValue *value = NULL;
 
     if (ctx->err) goto x;
-    assert(!!alloc == !!atom);
 
-    set= malloc(sizeof(*set));
-    if (!set) goto xe;
+    value = malloc(sizeof(*value));
+    if (!value) goto xe;
+    value->type = XLU_STRING;
+    value->u.string = atom;
 
-    set->name= 0; /* tbd */
-    set->avalues= alloc;
+    return value;
 
-    if (!alloc) {
-        set->nvalues= 0;
-        set->values= 0;
-    } else {
-        set->values= malloc(sizeof(*set->values) * alloc);
-        if (!set->values) goto xe;
+ xe:
+    ctx->err= errno;
+ x:
+    free(value);
+    free(atom);
+    return NULL;
+}
 
-        set->nvalues= 1;
-        set->values[0]= atom;
-    }
-    return set;
+XLU_ConfigValue *xlu__cfg_list_mk(CfgParseContext *ctx, char *atom)
+{
+    XLU_ConfigValue *value = NULL;
+    XLU_ConfigValue **values = NULL;
+    XLU_ConfigValue *val = NULL;
+
+    if (ctx->err) goto x;
+
+    val = malloc(sizeof(*val));
+    if (!val) goto xe;
+    val->type = XLU_STRING;
+    val->u.string = atom;
+
+    values = malloc(sizeof(*values));
+    if (!values) goto xe;
+    values[0] = val;
+
+    value = malloc(sizeof(*value));
+    if (!value) goto xe;
+    value->type = XLU_LIST;
+    value->u.list.nvalues = 1;
+    value->u.list.values = values;
+
+    return value;
 
  xe:
     ctx->err= errno;
  x:
-    free(set);
+    free(value);
+    free(values);
+    free(val);
     free(atom);
-    return 0;
+    return NULL;
 }
 
-void xlu__cfg_set_add(CfgParseContext *ctx, XLU_ConfigSetting *set,
-                      char *atom) {
+void xlu__cfg_list_append(CfgParseContext *ctx,
+                          XLU_ConfigValue *list,
+                          char *atom)
+{
+    XLU_ConfigValue **new_val = NULL;
+    XLU_ConfigValue *val = NULL;
     if (ctx->err) return;
 
     assert(atom);
+    assert(list->type == XLU_LIST);
 
-    if (set->nvalues >= set->avalues) {
-        int new_avalues;
-        char **new_values;
-
-        if (set->avalues > INT_MAX / 100) { ctx->err= ERANGE; return; }
-        new_avalues= set->avalues * 4;
-        new_values= realloc(set->values,
-                            sizeof(*new_values) * new_avalues);
-        if (!new_values) { ctx->err= errno; free(atom); return; }
-        set->values= new_values;
-        set->avalues= new_avalues;
-    }
-    set->values[set->nvalues++]= atom;
+    new_val = realloc(list->u.list.values,
+                      sizeof(*new_val) * (list->u.list.nvalues+1));
+    if (!new_val) goto xe;
+
+    val = malloc(sizeof(*val));
+    if (!val) goto xe;
+    val->type = XLU_STRING;
+    val->u.string = atom;
+
+    list->u.list.values = new_val;
+    list->u.list.values[list->u.list.nvalues] = val;
+    list->u.list.nvalues++;
+
+    return;
+
+ xe:
+    ctx->err = errno;
+    free(new_val);
+    free(val);
+    free(atom);
+    return;
 }
 
 void xlu__cfg_set_store(CfgParseContext *ctx, char *name,
-                        XLU_ConfigSetting *set, int lineno) {
+                        XLU_ConfigValue *val, int lineno) {
+    XLU_ConfigSetting *set;
+
     if (ctx->err) return;
 
     assert(name);
+    set = malloc(sizeof(*set));
+    if (!set) {
+        ctx->err = errno;
+        return;
+    }
     set->name= name;
+    set->value = val;
     set->lineno= lineno;
     set->next= ctx->cfg->settings;
     ctx->cfg->settings= set;
diff --git a/tools/libxl/libxlu_cfg_i.h b/tools/libxl/libxlu_cfg_i.h
index 54d033c..b71e9fd 100644
--- a/tools/libxl/libxlu_cfg_i.h
+++ b/tools/libxl/libxlu_cfg_i.h
@@ -23,11 +23,15 @@
 #include "libxlu_cfg_y.h"
 
 void xlu__cfg_set_free(XLU_ConfigSetting *set);
-XLU_ConfigSetting *xlu__cfg_set_mk(CfgParseContext*, int alloc, char *atom);
-void xlu__cfg_set_add(CfgParseContext*, XLU_ConfigSetting *set, char *atom);
 void xlu__cfg_set_store(CfgParseContext*, char *name,
-                        XLU_ConfigSetting *set, int lineno);
-
+                        XLU_ConfigValue *val, int lineno);
+XLU_ConfigValue *xlu__cfg_string_mk(CfgParseContext *ctx,
+                                    char *atom);
+XLU_ConfigValue *xlu__cfg_list_mk(CfgParseContext *ctx, char *atom);
+void xlu__cfg_list_append(CfgParseContext *ctx,
+                          XLU_ConfigValue *list,
+                          char *atom);
+void xlu__cfg_value_free(XLU_ConfigValue *value);
 char *xlu__cfgl_strdup(CfgParseContext*, const char *src);
 char *xlu__cfgl_dequote(CfgParseContext*, const char *src);
 
diff --git a/tools/libxl/libxlu_cfg_y.c b/tools/libxl/libxlu_cfg_y.c
index 07b5a1d..eb3884f 100644
--- a/tools/libxl/libxlu_cfg_y.c
+++ b/tools/libxl/libxlu_cfg_y.c
@@ -126,7 +126,7 @@ typedef union YYSTYPE
 #line 25 "libxlu_cfg_y.y"
 
   char *string;
-  XLU_ConfigSetting *setting;
+  XLU_ConfigValue *value;
 
 
 
@@ -1148,7 +1148,7 @@ yydestruct (yymsg, yytype, yyvaluep, yylocationp, ctx)
 
 /* Line 1391 of yacc.c  */
 #line 43 "libxlu_cfg_y.y"
-	{ xlu__cfg_set_free((yyvaluep->setting)); };
+	{ xlu__cfg_value_free((yyvaluep->value)); };
 
 /* Line 1391 of yacc.c  */
 #line 1155 "libxlu_cfg_y.c"
@@ -1166,7 +1166,7 @@ yydestruct (yymsg, yytype, yyvaluep, yylocationp, ctx)
 
 /* Line 1391 of yacc.c  */
 #line 43 "libxlu_cfg_y.y"
-	{ xlu__cfg_set_free((yyvaluep->setting)); };
+	{ xlu__cfg_value_free((yyvaluep->value)); };
 
 /* Line 1391 of yacc.c  */
 #line 1173 "libxlu_cfg_y.c"
@@ -1175,7 +1175,7 @@ yydestruct (yymsg, yytype, yyvaluep, yylocationp, ctx)
 
 /* Line 1391 of yacc.c  */
 #line 43 "libxlu_cfg_y.y"
-	{ xlu__cfg_set_free((yyvaluep->setting)); };
+	{ xlu__cfg_value_free((yyvaluep->value)); };
 
 /* Line 1391 of yacc.c  */
 #line 1182 "libxlu_cfg_y.c"
@@ -1508,21 +1508,21 @@ yyreduce:
 
 /* Line 1806 of yacc.c  */
 #line 57 "libxlu_cfg_y.y"
-    { xlu__cfg_set_store(ctx,(yyvsp[(1) - (3)].string),(yyvsp[(3) - (3)].setting),(yylsp[(3) - (3)]).first_line); }
+    { xlu__cfg_set_store(ctx,(yyvsp[(1) - (3)].string),(yyvsp[(3) - (3)].value),(yylsp[(3) - (3)]).first_line); }
     break;
 
   case 12:
 
 /* Line 1806 of yacc.c  */
 #line 62 "libxlu_cfg_y.y"
-    { (yyval.setting)= xlu__cfg_set_mk(ctx,1,(yyvsp[(1) - (1)].string)); }
+    { (yyval.value)= xlu__cfg_string_mk(ctx,(yyvsp[(1) - (1)].string)); }
     break;
 
   case 13:
 
 /* Line 1806 of yacc.c  */
 #line 63 "libxlu_cfg_y.y"
-    { (yyval.setting)= (yyvsp[(3) - (4)].setting); }
+    { (yyval.value)= (yyvsp[(3) - (4)].value); }
     break;
 
   case 14:
@@ -1543,35 +1543,35 @@ yyreduce:
 
 /* Line 1806 of yacc.c  */
 #line 68 "libxlu_cfg_y.y"
-    { (yyval.setting)= xlu__cfg_set_mk(ctx,0,0); }
+    { (yyval.value)= xlu__cfg_list_mk(ctx,NULL); }
     break;
 
   case 17:
 
 /* Line 1806 of yacc.c  */
 #line 69 "libxlu_cfg_y.y"
-    { (yyval.setting)= (yyvsp[(1) - (1)].setting); }
+    { (yyval.value)= (yyvsp[(1) - (1)].value); }
     break;
 
   case 18:
 
 /* Line 1806 of yacc.c  */
 #line 70 "libxlu_cfg_y.y"
-    { (yyval.setting)= (yyvsp[(1) - (3)].setting); }
+    { (yyval.value)= (yyvsp[(1) - (3)].value); }
     break;
 
   case 19:
 
 /* Line 1806 of yacc.c  */
 #line 72 "libxlu_cfg_y.y"
-    { (yyval.setting)= xlu__cfg_set_mk(ctx,2,(yyvsp[(1) - (2)].string)); }
+    { (yyval.value)= xlu__cfg_list_mk(ctx,(yyvsp[(1) - (2)].string)); }
     break;
 
   case 20:
 
 /* Line 1806 of yacc.c  */
 #line 73 "libxlu_cfg_y.y"
-    { xlu__cfg_set_add(ctx,(yyvsp[(1) - (5)].setting),(yyvsp[(4) - (5)].string)); (yyval.setting)= (yyvsp[(1) - (5)].setting); }
+    { xlu__cfg_list_append(ctx,(yyvsp[(1) - (5)].value),(yyvsp[(4) - (5)].string)); (yyval.value)= (yyvsp[(1) - (5)].value); }
     break;
 
 
diff --git a/tools/libxl/libxlu_cfg_y.h b/tools/libxl/libxlu_cfg_y.h
index d7dfaf2..37e8213 100644
--- a/tools/libxl/libxlu_cfg_y.h
+++ b/tools/libxl/libxlu_cfg_y.h
@@ -54,7 +54,7 @@ typedef union YYSTYPE
 #line 25 "libxlu_cfg_y.y"
 
   char *string;
-  XLU_ConfigSetting *setting;
+  XLU_ConfigValue *value;
 
 
 
diff --git a/tools/libxl/libxlu_cfg_y.y b/tools/libxl/libxlu_cfg_y.y
index 5acd438..6848686 100644
--- a/tools/libxl/libxlu_cfg_y.y
+++ b/tools/libxl/libxlu_cfg_y.y
@@ -24,7 +24,7 @@
 
 %union {
   char *string;
-  XLU_ConfigSetting *setting;
+  XLU_ConfigValue *value;
 }
 
 %locations
@@ -39,8 +39,8 @@
 %type <string>            atom
 %destructor { free($$); } atom IDENT STRING NUMBER
 
-%type <setting>                         value valuelist values
-%destructor { xlu__cfg_set_free($$); }  value valuelist values
+%type <value>                             value valuelist values
+%destructor { xlu__cfg_value_free($$); }  value valuelist values
 
 %%
 
@@ -59,18 +59,18 @@ assignment: IDENT '=' value { xlu__cfg_set_store(ctx,$1,$3,@3.first_line); }
 endstmt: NEWLINE
  |      ';'
 
-value:  atom                         { $$= xlu__cfg_set_mk(ctx,1,$1); }
+value:  atom                         { $$= xlu__cfg_string_mk(ctx,$1); }
  |      '[' nlok valuelist ']'       { $$= $3; }
 
 atom:   STRING                   { $$= $1; }
  |      NUMBER                   { $$= $1; }
 
-valuelist: /* empty */           { $$= xlu__cfg_set_mk(ctx,0,0); }
+valuelist: /* empty */           { $$= xlu__cfg_list_mk(ctx,NULL); }
  |      values                  { $$= $1; }
  |      values ',' nlok         { $$= $1; }
 
-values: atom nlok                  { $$= xlu__cfg_set_mk(ctx,2,$1); }
- |      values ',' nlok atom nlok  { xlu__cfg_set_add(ctx,$1,$4); $$= $1; }
+values: atom nlok                  { $$= xlu__cfg_list_mk(ctx,$1); }
+ |      values ',' nlok atom nlok  { xlu__cfg_list_append(ctx,$1,$4); $$= $1; }
 
 nlok:
         /* nothing */
diff --git a/tools/libxl/libxlu_internal.h b/tools/libxl/libxlu_internal.h
index 7579158..3f4ef80 100644
--- a/tools/libxl/libxlu_internal.h
+++ b/tools/libxl/libxlu_internal.h
@@ -23,17 +23,34 @@
 #include <assert.h>
 #include <regex.h>
 
-#define XLU_ConfigList XLU_ConfigSetting
-
 #include "libxlutil.h"
 
-struct XLU_ConfigSetting { /* transparent */
+enum XLU_ConfigValueType {
+    XLU_STRING,
+    XLU_LIST,
+};
+
+typedef struct XLU_ConfigValue XLU_ConfigValue;
+
+typedef struct XLU_ConfigList {
+    int nvalues;
+    XLU_ConfigValue **values;
+} XLU_ConfigList;
+
+struct XLU_ConfigValue {
+    enum XLU_ConfigValueType type;
+    union {
+        char *string;
+        XLU_ConfigList list;
+    } u;
+};
+
+typedef struct XLU_ConfigSetting { /* transparent */
     struct XLU_ConfigSetting *next;
     char *name;
-    int nvalues, avalues; /* lists have avalues>1 */
-    char **values;
+    XLU_ConfigValue *value;
     int lineno;
-};
+} XLU_ConfigSetting;
 
 struct XLU_Config {
     XLU_ConfigSetting *settings;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 19/21] libxlu: nested list support
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (17 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 18/21] libxlu: rework internal representation of setting Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-02-11 16:13   ` Ian Jackson
  2015-01-23 11:13 ` [PATCH v4 20/21] libxlu: introduce new APIs Wei Liu
  2015-01-23 11:13 ` [PATCH v4 21/21] xl: vNUMA support Wei Liu
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

1. Extend grammar of parser.
2. Adjust internal functional to accept XLU_ConfigValue instead of
   char *.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
---
 tools/libxl/libxlu_cfg.c   |   25 ++++++-------------------
 tools/libxl/libxlu_cfg_i.h |    5 +++--
 tools/libxl/libxlu_cfg_y.c |   26 +++++++++++++-------------
 tools/libxl/libxlu_cfg_y.y |    4 ++--
 4 files changed, 24 insertions(+), 36 deletions(-)

diff --git a/tools/libxl/libxlu_cfg.c b/tools/libxl/libxlu_cfg.c
index 23b0cdb..e072046 100644
--- a/tools/libxl/libxlu_cfg.c
+++ b/tools/libxl/libxlu_cfg.c
@@ -332,19 +332,14 @@ XLU_ConfigValue *xlu__cfg_string_mk(CfgParseContext *ctx, char *atom)
     return NULL;
 }
 
-XLU_ConfigValue *xlu__cfg_list_mk(CfgParseContext *ctx, char *atom)
+XLU_ConfigValue *xlu__cfg_list_mk(CfgParseContext *ctx,
+                                  XLU_ConfigValue *val)
 {
     XLU_ConfigValue *value = NULL;
     XLU_ConfigValue **values = NULL;
-    XLU_ConfigValue *val = NULL;
 
     if (ctx->err) goto x;
 
-    val = malloc(sizeof(*val));
-    if (!val) goto xe;
-    val->type = XLU_STRING;
-    val->u.string = atom;
-
     values = malloc(sizeof(*values));
     if (!values) goto xe;
     values[0] = val;
@@ -362,31 +357,24 @@ XLU_ConfigValue *xlu__cfg_list_mk(CfgParseContext *ctx, char *atom)
  x:
     free(value);
     free(values);
-    free(val);
-    free(atom);
+    xlu__cfg_value_free(val);
     return NULL;
 }
 
 void xlu__cfg_list_append(CfgParseContext *ctx,
                           XLU_ConfigValue *list,
-                          char *atom)
+                          XLU_ConfigValue *val)
 {
     XLU_ConfigValue **new_val = NULL;
-    XLU_ConfigValue *val = NULL;
     if (ctx->err) return;
 
-    assert(atom);
+    assert(val);
     assert(list->type == XLU_LIST);
 
     new_val = realloc(list->u.list.values,
                       sizeof(*new_val) * (list->u.list.nvalues+1));
     if (!new_val) goto xe;
 
-    val = malloc(sizeof(*val));
-    if (!val) goto xe;
-    val->type = XLU_STRING;
-    val->u.string = atom;
-
     list->u.list.values = new_val;
     list->u.list.values[list->u.list.nvalues] = val;
     list->u.list.nvalues++;
@@ -396,8 +384,7 @@ void xlu__cfg_list_append(CfgParseContext *ctx,
  xe:
     ctx->err = errno;
     free(new_val);
-    free(val);
-    free(atom);
+    xlu__cfg_value_free(val);
     return;
 }
 
diff --git a/tools/libxl/libxlu_cfg_i.h b/tools/libxl/libxlu_cfg_i.h
index b71e9fd..11dc33f 100644
--- a/tools/libxl/libxlu_cfg_i.h
+++ b/tools/libxl/libxlu_cfg_i.h
@@ -27,10 +27,11 @@ void xlu__cfg_set_store(CfgParseContext*, char *name,
                         XLU_ConfigValue *val, int lineno);
 XLU_ConfigValue *xlu__cfg_string_mk(CfgParseContext *ctx,
                                     char *atom);
-XLU_ConfigValue *xlu__cfg_list_mk(CfgParseContext *ctx, char *atom);
+XLU_ConfigValue *xlu__cfg_list_mk(CfgParseContext *ctx,
+                                  XLU_ConfigValue *val);
 void xlu__cfg_list_append(CfgParseContext *ctx,
                           XLU_ConfigValue *list,
-                          char *atom);
+                          XLU_ConfigValue *val);
 void xlu__cfg_value_free(XLU_ConfigValue *value);
 char *xlu__cfgl_strdup(CfgParseContext*, const char *src);
 char *xlu__cfgl_dequote(CfgParseContext*, const char *src);
diff --git a/tools/libxl/libxlu_cfg_y.c b/tools/libxl/libxlu_cfg_y.c
index eb3884f..b05e48b 100644
--- a/tools/libxl/libxlu_cfg_y.c
+++ b/tools/libxl/libxlu_cfg_y.c
@@ -377,7 +377,7 @@ union yyalloc
 /* YYFINAL -- State number of the termination state.  */
 #define YYFINAL  3
 /* YYLAST -- Last index in YYTABLE.  */
-#define YYLAST   24
+#define YYLAST   25
 
 /* YYNTOKENS -- Number of terminals.  */
 #define YYNTOKENS  12
@@ -444,8 +444,8 @@ static const yytype_int8 yyrhs[] =
       15,    -1,    16,    17,    -1,    17,    -1,     1,     6,    -1,
        3,     7,    18,    -1,     6,    -1,     8,    -1,    19,    -1,
        9,    22,    20,    10,    -1,     4,    -1,     5,    -1,    -1,
-      21,    -1,    21,    11,    22,    -1,    19,    22,    -1,    21,
-      11,    22,    19,    22,    -1,    -1,    22,     6,    -1
+      21,    -1,    21,    11,    22,    -1,    18,    22,    -1,    21,
+      11,    22,    18,    22,    -1,    -1,    22,     6,    -1
 };
 
 /* YYRLINE[YYN] -- source line where rule number YYN was defined.  */
@@ -517,14 +517,14 @@ static const yytype_int8 yydefgoto[] =
 static const yytype_int8 yypact[] =
 {
      -18,     4,     0,   -18,    -1,     6,   -18,   -18,   -18,     3,
-     -18,   -18,    11,   -18,   -18,   -18,   -18,   -18,   -18,    13,
-     -18,   -18,    12,    10,    17,   -18,   -18,    13,   -18,    17
+     -18,   -18,    14,   -18,   -18,   -18,   -18,   -18,   -18,    11,
+     -18,   -18,    12,    10,    18,   -18,   -18,    11,   -18,    18
 };
 
 /* YYPGOTO[NTERM-NUM].  */
 static const yytype_int8 yypgoto[] =
 {
-     -18,   -18,   -18,   -18,   -18,    15,   -18,   -17,   -18,   -18,
+     -18,   -18,   -18,   -18,   -18,    16,   -17,   -18,   -18,   -18,
      -14
 };
 
@@ -535,8 +535,8 @@ static const yytype_int8 yypgoto[] =
 static const yytype_int8 yytable[] =
 {
       -2,     4,    21,     5,     3,    11,     6,    24,     7,     6,
-      28,     7,    27,    12,    29,    14,    15,    14,    15,    20,
-      16,    26,    25,    20,    13
+      28,     7,    27,    12,    29,    14,    15,    20,    14,    15,
+      16,    26,    25,    16,    20,    13
 };
 
 #define yypact_value_is_default(yystate) \
@@ -548,8 +548,8 @@ static const yytype_int8 yytable[] =
 static const yytype_uint8 yycheck[] =
 {
        0,     1,    19,     3,     0,     6,     6,    21,     8,     6,
-      27,     8,    26,     7,    28,     4,     5,     4,     5,     6,
-       9,    11,    10,     6,     9
+      27,     8,    26,     7,    28,     4,     5,     6,     4,     5,
+       9,    11,    10,     9,     6,     9
 };
 
 /* YYSTOS[STATE-NUM] -- The (internal number of the) accessing
@@ -558,7 +558,7 @@ static const yytype_uint8 yystos[] =
 {
        0,    13,    14,     0,     1,     3,     6,     8,    15,    16,
       17,     6,     7,    17,     4,     5,     9,    18,    19,    22,
-       6,    19,    20,    21,    22,    10,    11,    22,    19,    22
+       6,    18,    20,    21,    22,    10,    11,    22,    18,    22
 };
 
 #define yyerrok		(yyerrstatus = 0)
@@ -1564,14 +1564,14 @@ yyreduce:
 
 /* Line 1806 of yacc.c  */
 #line 72 "libxlu_cfg_y.y"
-    { (yyval.value)= xlu__cfg_list_mk(ctx,(yyvsp[(1) - (2)].string)); }
+    { (yyval.value)= xlu__cfg_list_mk(ctx,(yyvsp[(1) - (2)].value)); }
     break;
 
   case 20:
 
 /* Line 1806 of yacc.c  */
 #line 73 "libxlu_cfg_y.y"
-    { xlu__cfg_list_append(ctx,(yyvsp[(1) - (5)].value),(yyvsp[(4) - (5)].string)); (yyval.value)= (yyvsp[(1) - (5)].value); }
+    { xlu__cfg_list_append(ctx,(yyvsp[(1) - (5)].value),(yyvsp[(4) - (5)].value)); (yyval.value)= (yyvsp[(1) - (5)].value); }
     break;
 
 
diff --git a/tools/libxl/libxlu_cfg_y.y b/tools/libxl/libxlu_cfg_y.y
index 6848686..4a5ca3a 100644
--- a/tools/libxl/libxlu_cfg_y.y
+++ b/tools/libxl/libxlu_cfg_y.y
@@ -69,8 +69,8 @@ valuelist: /* empty */           { $$= xlu__cfg_list_mk(ctx,NULL); }
  |      values                  { $$= $1; }
  |      values ',' nlok         { $$= $1; }
 
-values: atom nlok                  { $$= xlu__cfg_list_mk(ctx,$1); }
- |      values ',' nlok atom nlok  { xlu__cfg_list_append(ctx,$1,$4); $$= $1; }
+values: value nlok                  { $$= xlu__cfg_list_mk(ctx,$1); }
+ |      values ',' nlok value nlok  { xlu__cfg_list_append(ctx,$1,$4); $$= $1; }
 
 nlok:
         /* nothing */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 20/21] libxlu: introduce new APIs
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (18 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 19/21] libxlu: nested list support Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-02-11 16:17   ` Ian Jackson
  2015-01-23 11:13 ` [PATCH v4 21/21] xl: vNUMA support Wei Liu
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

These APIs can be used to manipulate XLU_ConfigValue and XLU_ConfigList.

APIs introduced:
1. xlu_cfg_value_type
2. xlu_cfg_value_get_string
3. xlu_cfg_value_get_list
4. xlu_cfg_get_listitem2

Move some definitions from private header to public header as needed.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
---
 tools/libxl/libxlu_cfg.c      |   25 +++++++++++++++++++++++++
 tools/libxl/libxlu_internal.h |    7 -------
 tools/libxl/libxlutil.h       |   11 +++++++++++
 3 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/tools/libxl/libxlu_cfg.c b/tools/libxl/libxlu_cfg.c
index e072046..edb1d67 100644
--- a/tools/libxl/libxlu_cfg.c
+++ b/tools/libxl/libxlu_cfg.c
@@ -199,6 +199,31 @@ static int find_atom(const XLU_Config *cfg, const char *n,
     return 0;
 }
 
+
+enum XLU_ConfigValueType xlu_cfg_value_type(const XLU_ConfigValue *value)
+{
+    return value->type;
+}
+
+const char *xlu_cfg_value_get_string(const XLU_ConfigValue *value)
+{
+    assert(value->type == XLU_STRING);
+    return value->u.string;
+}
+
+const XLU_ConfigList *xlu_cfg_value_get_list(const XLU_ConfigValue *value)
+{
+    assert(value->type == XLU_LIST);
+    return &value->u.list;
+}
+
+const XLU_ConfigValue *xlu_cfg_get_listitem2(const XLU_ConfigList *list,
+                                             int entry)
+{
+    if (entry < 0 || entry >= list->nvalues) return NULL;
+    return list->values[entry];
+}
+
 int xlu_cfg_get_string(const XLU_Config *cfg, const char *n,
                        const char **value_r, int dont_warn) {
     XLU_ConfigSetting *set;
diff --git a/tools/libxl/libxlu_internal.h b/tools/libxl/libxlu_internal.h
index 3f4ef80..db86369 100644
--- a/tools/libxl/libxlu_internal.h
+++ b/tools/libxl/libxlu_internal.h
@@ -25,13 +25,6 @@
 
 #include "libxlutil.h"
 
-enum XLU_ConfigValueType {
-    XLU_STRING,
-    XLU_LIST,
-};
-
-typedef struct XLU_ConfigValue XLU_ConfigValue;
-
 typedef struct XLU_ConfigList {
     int nvalues;
     XLU_ConfigValue **values;
diff --git a/tools/libxl/libxlutil.h b/tools/libxl/libxlutil.h
index 0333e55..893ec0e 100644
--- a/tools/libxl/libxlutil.h
+++ b/tools/libxl/libxlutil.h
@@ -20,9 +20,15 @@
 
 #include "libxl.h"
 
+enum XLU_ConfigValueType {
+    XLU_STRING,
+    XLU_LIST,
+};
+
 /* Unless otherwise stated, all functions return an errno value. */
 typedef struct XLU_Config XLU_Config;
 typedef struct XLU_ConfigList XLU_ConfigList;
+typedef struct XLU_ConfigValue XLU_ConfigValue;
 
 XLU_Config *xlu_cfg_init(FILE *report, const char *report_filename);
   /* 0 means we got ENOMEM. */
@@ -66,6 +72,11 @@ const char *xlu_cfg_get_listitem(const XLU_ConfigList*, int entry);
   /* xlu_cfg_get_listitem cannot fail, except that if entry is
    * out of range it returns 0 (not setting errno) */
 
+enum XLU_ConfigValueType xlu_cfg_value_type(const XLU_ConfigValue *value);
+const char *xlu_cfg_value_get_string(const XLU_ConfigValue *value);
+const XLU_ConfigList *xlu_cfg_value_get_list(const XLU_ConfigValue *value);
+const XLU_ConfigValue *xlu_cfg_get_listitem2(const XLU_ConfigList *list,
+                                             int entry);
 
 /*
  * Disk specification parsing.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 21/21] xl: vNUMA support
  2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
                   ` (19 preceding siblings ...)
  2015-01-23 11:13 ` [PATCH v4 20/21] libxlu: introduce new APIs Wei Liu
@ 2015-01-23 11:13 ` Wei Liu
  2015-01-28 16:46   ` Ian Campbell
  20 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 11:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, JBeulich, ufimtseva

This patch includes configuration options parser and documentation.

Please find the hunk to xl.cfg.pod.5 for more information.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v2:
1. Make vnuma_vdistances mandatory.
2. Use nested list to specify vdistances.
3. Update documentation.
---
 docs/man/xl.cfg.pod.5    |   39 ++++++++++++
 tools/libxl/xl_cmdimpl.c |  147 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 186 insertions(+)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index e2f91fc..b23bd6f 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -266,6 +266,45 @@ it will crash.
 
 =back
 
+=head3 Virtual NUMA Memory Allocation
+
+=over 4
+
+=item B<vnuma_memory=[ NUMBER, NUMBER, ... ]>
+
+Specify the size of memory covered by each virtual NUMA node. The number of
+elements in the list also implicitly defines the number of virtual NUMA nodes.
+
+The sum of all elements in this list should be equal to memory size specified
+by B<maxmem=> in guest configuration file, or B<memory=> if B<maxmem=> is not
+specified.
+
+=item B<vnuma_vcpu_map=[ NUMBER, NUMBER, ... ]>
+
+Specifiy which virutal NUMA node a specific vcpu belongs to. The number of
+elements in this list should be equal to B<maxvcpus=> in guest configuration
+file, or B<vcpus=> if B<maxvcpus=> is not specified.
+
+=item B<vnuma_pnode_map=[ NUMBER, NUMBER, ... ]>
+
+Specifiy which physical NUMA node a specific virtual NUMA node maps to. The
+number of elements in this list should be equal to the number of virtual
+NUMA nodes defined in B<vnuma_memory=>.
+
+=item B<vnuma_vdistance=[ [NUMBER, ..., NUMBER], [NUMBER, ..., NUMBER], ... ]>
+
+Two dimensional list to specify distances among nodes.
+
+The number of elements in the first dimension list equals the number of virtual
+nodes. Each element in position B<i> is a list that specifies the distances
+from node B<i> to other nodes.
+
+For example, for a guest with 2 virtual nodes, user can specify:
+
+  vnuma_vdistance = [ [10, 20], [20, 10] ]
+
+=back
+
 =head3 Event Actions
 
 =over 4
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 0b02a6c..107b552 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -977,6 +977,151 @@ static int parse_nic_config(libxl_device_nic *nic, XLU_Config **config, char *to
     return 0;
 }
 
+static void parse_vnuma_config(const XLU_Config *config,
+                               libxl_domain_build_info *b_info)
+{
+    int i, j;
+    XLU_ConfigList *vnuma_memory, *vnuma_vcpu_map, *vnuma_pnode_map,
+        *vnuma_vdistances;
+    int num_vnuma_memory, num_vnuma_vcpu_map, num_vnuma_pnode_map,
+        num_vnuma_vdistances;
+    const char *buf;
+    libxl_physinfo physinfo;
+    uint32_t nr_nodes;
+    unsigned long val;
+    char *ep;
+
+    libxl_physinfo_init(&physinfo);
+    if (libxl_get_physinfo(ctx, &physinfo) != 0) {
+        libxl_physinfo_dispose(&physinfo);
+        fprintf(stderr, "libxl_physinfo failed.\n");
+        exit(1);
+    }
+    nr_nodes = physinfo.nr_nodes;
+    libxl_physinfo_dispose(&physinfo);
+
+    if (xlu_cfg_get_list(config, "vnuma_memory", &vnuma_memory,
+                         &num_vnuma_memory, 1))
+        return;              /* No vnuma config */
+
+    b_info->num_vnuma_nodes = num_vnuma_memory;
+    b_info->vnuma_nodes = xmalloc(num_vnuma_memory * sizeof(libxl_vnode_info));
+
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
+
+        libxl_vnode_info_init(p);
+        libxl_cpu_bitmap_alloc(ctx, &p->vcpus, b_info->max_vcpus);
+        libxl_bitmap_set_none(&p->vcpus);
+        p->distances = xmalloc(b_info->num_vnuma_nodes * sizeof(uint32_t));
+        p->num_distances = b_info->num_vnuma_nodes;
+    }
+
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        buf = xlu_cfg_get_listitem(vnuma_memory, i);
+        val = strtoul(buf, &ep, 10);
+        if (ep == buf) {
+            fprintf(stderr, "xl: Invalid argument parsing vnuma memory: %s\n", buf);
+            exit(1);
+        }
+        b_info->vnuma_nodes[i].memkb = val << 10;
+    }
+
+    if (xlu_cfg_get_list(config, "vnuma_vcpu_map", &vnuma_vcpu_map,
+                         &num_vnuma_vcpu_map, 1)) {
+        fprintf(stderr, "No vcpu to vnode map specified\n");
+        exit(1);
+    }
+
+    i = 0;
+    while (i < b_info->max_vcpus &&
+           (buf = xlu_cfg_get_listitem(vnuma_vcpu_map, i)) != NULL) {
+        val = strtoul(buf, &ep, 10);
+        if (ep == buf) {
+            fprintf(stderr, "xl: Invalid argument parsing vcpu map: %s\n", buf);
+            exit(1);
+        }
+        if (val >= num_vnuma_memory) {
+            fprintf(stderr, "xl: Invalid vnode number specified: %lu\n", val);
+            exit(1);
+        }
+        libxl_bitmap_set(&b_info->vnuma_nodes[val].vcpus, i);
+        i++;
+    }
+
+    if (i != b_info->max_vcpus) {
+        fprintf(stderr, "xl: Not enough elements in vnuma_vcpu_map, provided %d, required %d\n",
+                i + 1, b_info->max_vcpus);
+        exit(1);
+    }
+
+    if (xlu_cfg_get_list(config, "vnuma_pnode_map", &vnuma_pnode_map,
+                         &num_vnuma_pnode_map, 1)) {
+        fprintf(stderr, "No vnode to pnode map specified\n");
+        exit(1);
+    }
+
+    i = 0;
+    while (i < num_vnuma_pnode_map &&
+           (buf = xlu_cfg_get_listitem(vnuma_pnode_map, i)) != NULL) {
+        val = strtoul(buf, &ep, 10);
+        if (ep == buf) {
+            fprintf(stderr, "xl: Invalid argument parsing vnode to pnode map: %s\n", buf);
+            exit(1);
+        }
+        if (val >= nr_nodes) {
+            fprintf(stderr, "xl: Invalid pnode number specified: %lu\n", val);
+            exit(1);
+        }
+
+        b_info->vnuma_nodes[i].pnode = val;
+
+        i++;
+    }
+
+    if (i != b_info->num_vnuma_nodes) {
+        fprintf(stderr, "xl: Not enough elements in vnuma_vnode_map, provided %d, required %d\n",
+                i + 1, b_info->num_vnuma_nodes);
+        exit(1);
+    }
+
+    if (!xlu_cfg_get_list(config, "vnuma_vdistances", &vnuma_vdistances,
+                          &num_vnuma_vdistances, 1)) {
+        if (num_vnuma_vdistances != num_vnuma_memory) {
+            fprintf(stderr, "xl: Required %d sub-lists in vnuma_vdistances but provided %d\n",
+                    num_vnuma_memory, num_vnuma_vdistances);
+            exit(1);
+        }
+
+        for (i = 0; i < num_vnuma_vdistances; i++) {
+            const XLU_ConfigValue *tmp;
+            const XLU_ConfigList *sublist;
+
+            tmp = xlu_cfg_get_listitem2(vnuma_vdistances, i);
+            if (xlu_cfg_value_type(tmp) != XLU_LIST) {
+                fprintf(stderr, "xl: Expecting list in vnuma_vdistance\n");
+                exit(1);
+            }
+            sublist = xlu_cfg_value_get_list(tmp);
+            for (j = 0;
+                 (buf = xlu_cfg_get_listitem(sublist, j)) != NULL;
+                 j++) {
+                val = strtoul(buf, &ep, 10);
+                if (ep == buf) {
+                    fprintf(stderr, "xl: Invalid argument parsing vdistances map: %s\n", buf);
+                    exit(1);
+                }
+
+                b_info->vnuma_nodes[i].distances[j] = val;
+            }
+        }
+
+    } else {
+        fprintf(stderr, "xl: No vnuma_vdistances specified.\n");
+        exit(1);
+    }
+}
+
 static void parse_config_data(const char *config_source,
                               const char *config_data,
                               int config_len,
@@ -1167,6 +1312,8 @@ static void parse_config_data(const char *config_source,
         }
     }
 
+    parse_vnuma_config(config, b_info);
+
     if (!xlu_cfg_get_long(config, "rtc_timeoffset", &l, 0))
         b_info->rtc_timeoffset = l;
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 01/21] xen: dump vNUMA information with debug key "u"
  2015-01-23 11:13 ` [PATCH v4 01/21] xen: dump vNUMA information with debug key "u" Wei Liu
@ 2015-01-23 13:03   ` Jan Beulich
  2015-01-23 13:23     ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-01-23 13:03 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

>>> On 23.01.15 at 12:13, <wei.liu2@citrix.com> wrote:
> @@ -408,6 +413,68 @@ static void dump_numa(unsigned char key)
>  
>          for_each_online_node ( i )
>              printk("    Node %u: %u\n", i, page_num_node[i]);
> +
> +        read_lock(&d->vnuma_rwlock);

I'm sorry, but no - on v3 I specifically said "don't you need to try to
acquire d->vnuma_rwlock for reading, and skip any printing if the
acquire attempt fails". I.e. this should read_trylock(). (Yes, there
are pre-existing bad examples [including the acquiring of
d->page_alloc_lock in this function], but I think we should stop this
practice.)

> +        if ( !d->vnuma )
> +        {
> +            read_unlock(&d->vnuma_rwlock);
> +            continue;
> +        }
> +
> +        vnuma = d->vnuma;
> +        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
> +               vnuma->nr_vnodes, d->max_vcpus);
> +        for ( i = 0; i < vnuma->nr_vnodes; i++ )
> +        {
> +            unsigned int start_cpu = ~0U;
> +
> +            err = snprintf(keyhandler_scratch, 12, "%3u",
> +                    vnuma->vnode_to_pnode[i]);
> +            if ( err < 0 || vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
> +                strlcpy(keyhandler_scratch, "???", 3);

If you really want 3 question marks to appear, the number needs to
be > 3. Is there any reason not to pass the correct one, i.e.
ARRAY_SIZE(keyhandler_scratch) (or sizeof(keyhandler_scratch))?

> +
> +            printk("       %3u: pnode %s,", i, keyhandler_scratch);
> +
> +            printk(" vcpus ");
> +
> +            for ( j = 0; j < d->max_vcpus; j++ )
> +            {
> +                if ( !(j & 0x3f) )
> +                    process_pending_softirqs();
> +
> +                if ( vnuma->vcpu_to_vnode[j] == i )
> +                {
> +                    if ( start_cpu == ~0U )
> +                    {
> +                        printk("%d", j);
> +                        start_cpu = j;
> +                    }
> +                } else if ( start_cpu != ~0U ) {

Coding style.

> +                    if ( j - 1 != start_cpu )
> +                        printk("-%d ", j - 1);
> +                    else
> +                        printk(" ");
> +                    start_cpu = ~0U;
> +                }
> +            }
> +
> +            if ( start_cpu != ~0U  && start_cpu != j - 1 )
> +                printk("-%d", j - 1);
> +
> +            printk("\n");
> +
> +            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
> +            {
> +                if ( vnuma->vmemrange[j].nid == i )
> +                {
> +                    printk("           %016"PRIx64" - %016"PRIx64"\n",
> +                           vnuma->vmemrange[j].start,
> +                           vnuma->vmemrange[j].end);
> +                }
> +            }

At least the inner (belonging to the if()) braces should be dropped
as being unnecessary.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 02/21] xen: make two memory hypercalls vNUMA-aware
  2015-01-23 11:13 ` [PATCH v4 02/21] xen: make two memory hypercalls vNUMA-aware Wei Liu
@ 2015-01-23 13:16   ` Jan Beulich
  2015-01-23 14:46     ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-01-23 13:16 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

>>> On 23.01.15 at 12:13, <wei.liu2@citrix.com> wrote:
> Make XENMEM_increase_reservation and XENMEM_populate_physmap
> vNUMA-aware.
> 
> That is, if guest requests Xen to allocate memory for specific vnode,
> Xen can translate vnode to pnode using vNUMA information of that guest.
> 
> XENMEMF_vnode is introduced for the guest to mark the node number is in
> fact virtual node number and should be translated by Xen.
> 
> XENFEAT_memory_op_vnode_supported is introduced to indicate that Xen is
> able to translate virtual node to physical node.
> 
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> Acked-by: Jan Beulich <JBeulich@suse.com>

I'm afraid there's another change needed for this to hold:

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -692,6 +692,50 @@ out:
>      return rc;
>  }
>  
> +static int translate_vnode_to_pnode(struct domain *d,
> +                                    struct xen_memory_reservation *r,
> +                                    struct memop_args *a)
> +{
> +    int rc = 0;
> +    unsigned int vnode, pnode;
> +
> +    /*
> +     * Note: we don't strictly require non-supported bits set to zero,
> +     * so we may have exact_vnode bit set for old guests that don't
> +     * support vNUMA.
> +     *
> +     * To distinguish spurious vnode request v.s. real one, check if
> +     * d->vnuma exists.
> +     */
> +    if ( r->mem_flags & XENMEMF_vnode )
> +    {
> +        read_lock(&d->vnuma_rwlock);
> +        if ( d->vnuma )

if r->mem_flags has XENMEMF_vnode set but d->vnuma is NULL,
you need to clear the node from the flags.

> +        {
> +            vnode = XENMEMF_get_node(r->mem_flags);
> +
> +            if ( vnode < d->vnuma->nr_vnodes )
> +            {
> +                pnode = d->vnuma->vnode_to_pnode[vnode];
> +
> +                a->memflags &=
> +                    ~MEMF_node(XENMEMF_get_node(r->mem_flags));

I.e. this one needs to be pulled up, and probably include
MEMF_exact_node.

> @@ -747,6 +787,16 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>              return start_extent;
>          args.domain = d;
>  
> +        args.memflags |= MEMF_node(XENMEMF_get_node(reservation.mem_flags));
> +        if ( reservation.mem_flags & XENMEMF_exact_node_request )
> +            args.memflags |= MEMF_exact_node;

Since you force MEMF_exact_node when XENMEMF_vnode (and the
necessary data is available), I also wonder whether the combination
of both flags in what the caller requests should be forbidden (to
potentially obtain some specific meaning in the future). Or
alternatively don't enforce the former?

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 01/21] xen: dump vNUMA information with debug key "u"
  2015-01-23 13:03   ` Jan Beulich
@ 2015-01-23 13:23     ` Wei Liu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-23 13:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, xen-devel, ufimtseva

On Fri, Jan 23, 2015 at 01:03:16PM +0000, Jan Beulich wrote:
> >>> On 23.01.15 at 12:13, <wei.liu2@citrix.com> wrote:
> > @@ -408,6 +413,68 @@ static void dump_numa(unsigned char key)
> >  
> >          for_each_online_node ( i )
> >              printk("    Node %u: %u\n", i, page_num_node[i]);
> > +
> > +        read_lock(&d->vnuma_rwlock);
> 
> I'm sorry, but no - on v3 I specifically said "don't you need to try to
> acquire d->vnuma_rwlock for reading, and skip any printing if the
> acquire attempt fails". I.e. this should read_trylock(). (Yes, there
> are pre-existing bad examples [including the acquiring of
> d->page_alloc_lock in this function], but I think we should stop this
> practice.)
> 

No problem. I will fix this.

> > +        if ( !d->vnuma )
> > +        {
> > +            read_unlock(&d->vnuma_rwlock);
> > +            continue;
> > +        }
> > +
> > +        vnuma = d->vnuma;
> > +        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
> > +               vnuma->nr_vnodes, d->max_vcpus);
> > +        for ( i = 0; i < vnuma->nr_vnodes; i++ )
> > +        {
> > +            unsigned int start_cpu = ~0U;
> > +
> > +            err = snprintf(keyhandler_scratch, 12, "%3u",
> > +                    vnuma->vnode_to_pnode[i]);
> > +            if ( err < 0 || vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
> > +                strlcpy(keyhandler_scratch, "???", 3);
> 
> If you really want 3 question marks to appear, the number needs to
> be > 3. Is there any reason not to pass the correct one, i.e.
> ARRAY_SIZE(keyhandler_scratch) (or sizeof(keyhandler_scratch))?
> 

This is from Elena's original patch. I missed this one. Will fix.

> > +
> > +            printk("       %3u: pnode %s,", i, keyhandler_scratch);
> > +
> > +            printk(" vcpus ");
> > +
> > +            for ( j = 0; j < d->max_vcpus; j++ )
> > +            {
> > +                if ( !(j & 0x3f) )
> > +                    process_pending_softirqs();
> > +
> > +                if ( vnuma->vcpu_to_vnode[j] == i )
> > +                {
> > +                    if ( start_cpu == ~0U )
> > +                    {
> > +                        printk("%d", j);
> > +                        start_cpu = j;
> > +                    }
> > +                } else if ( start_cpu != ~0U ) {
> 
> Coding style.
> 

Fixed.

> > +                    if ( j - 1 != start_cpu )
> > +                        printk("-%d ", j - 1);
> > +                    else
> > +                        printk(" ");
> > +                    start_cpu = ~0U;
> > +                }
> > +            }
> > +
> > +            if ( start_cpu != ~0U  && start_cpu != j - 1 )
> > +                printk("-%d", j - 1);
> > +
> > +            printk("\n");
> > +
> > +            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
> > +            {
> > +                if ( vnuma->vmemrange[j].nid == i )
> > +                {
> > +                    printk("           %016"PRIx64" - %016"PRIx64"\n",
> > +                           vnuma->vmemrange[j].start,
> > +                           vnuma->vmemrange[j].end);
> > +                }
> > +            }
> 
> At least the inner (belonging to the if()) braces should be dropped
> as being unnecessary.
> 

Inner braces removed.

Wei.

> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 11/21] hvmloader: retrieve vNUMA information from hypervisor
  2015-01-23 11:13 ` [PATCH v4 11/21] hvmloader: retrieve vNUMA information from hypervisor Wei Liu
@ 2015-01-23 13:27   ` Jan Beulich
  2015-01-23 14:17     ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-01-23 13:27 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

>>> On 23.01.15 at 12:13, <wei.liu2@citrix.com> wrote:
> Changes in v4:
> 1. Use *vnode_to_pnode to calculate size.
> 2. Remove loop.
> 3. TBD: rebase onto Jan's errno ABI change.

Which went in already.

> +void init_vnuma_info(void)
> +{
> +    int rc;
> +    struct xen_vnuma_topology_info vnuma_topo;
> +
> +    vcpu_to_vnode =
> +        scratch_alloc(sizeof(*vcpu_to_vnode) * hvm_info->nr_vcpus, 0);

Why is this not done together with the other two allocs below,
avoiding it when the probing hypercall fails?

> +    vnuma_topo.domid = DOMID_SELF;
> +    vnuma_topo.pad = 0;
> +    vnuma_topo.nr_vcpus = 0;
> +    vnuma_topo.nr_vnodes = 0;
> +    vnuma_topo.nr_vmemranges = 0;
> +
> +    set_xen_guest_handle(vnuma_topo.vdistance.h, NULL);
> +    set_xen_guest_handle(vnuma_topo.vcpu_to_vnode.h, NULL);
> +    set_xen_guest_handle(vnuma_topo.vmemrange.h, NULL);

Perhaps easier to just memset() the whole thing to zero, and the
set .domid?

> +
> +    rc = hypercall_memory_op(XENMEM_get_vnumainfo, &vnuma_topo);
> +
> +    if ( rc != -XEN_ENOBUFS )
> +        return;
> +
> +    ASSERT(vnuma_topo.nr_vcpus == hvm_info->nr_vcpus);
> +
> +    vdistance = scratch_alloc(sizeof(uint32_t) * vnuma_topo.nr_vnodes *
> +                              vnuma_topo.nr_vnodes, 0);
> +    vmemrange = scratch_alloc(sizeof(xen_vmemrange_t) *
> +                              vnuma_topo.nr_vmemranges, 0);
> +
> +    set_xen_guest_handle(vnuma_topo.vdistance.h, vdistance);
> +    set_xen_guest_handle(vnuma_topo.vcpu_to_vnode.h, vcpu_to_vnode);
> +    set_xen_guest_handle(vnuma_topo.vmemrange.h, vmemrange);
> +
> +    rc = hypercall_memory_op(XENMEM_get_vnumainfo, &vnuma_topo);
> +
> +    if ( rc < 0 )
> +    {
> +        printf("Failed to retrieve vNUMA information, rc = %d\n", rc);
> +        return;
> +    }
> +
> +    nr_vnodes = vnuma_topo.nr_vnodes;
> +    nr_vmemranges = vnuma_topo.nr_vmemranges;
> +
> +    return;
> +}

Please drop this pointless "return".

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 11/21] hvmloader: retrieve vNUMA information from hypervisor
  2015-01-23 13:27   ` Jan Beulich
@ 2015-01-23 14:17     ` Wei Liu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-23 14:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, xen-devel, ufimtseva

On Fri, Jan 23, 2015 at 01:27:27PM +0000, Jan Beulich wrote:
> >>> On 23.01.15 at 12:13, <wei.liu2@citrix.com> wrote:
> > Changes in v4:
> > 1. Use *vnode_to_pnode to calculate size.
> > 2. Remove loop.
> > 3. TBD: rebase onto Jan's errno ABI change.
> 
> Which went in already.
> 

I will wait until it shows up in master and rebase.

> > +void init_vnuma_info(void)
> > +{
> > +    int rc;
> > +    struct xen_vnuma_topology_info vnuma_topo;
> > +
> > +    vcpu_to_vnode =
> > +        scratch_alloc(sizeof(*vcpu_to_vnode) * hvm_info->nr_vcpus, 0);
> 
> Why is this not done together with the other two allocs below,
> avoiding it when the probing hypercall fails?
> 

OK. I will group them together.

> > +    vnuma_topo.domid = DOMID_SELF;
> > +    vnuma_topo.pad = 0;
> > +    vnuma_topo.nr_vcpus = 0;
> > +    vnuma_topo.nr_vnodes = 0;
> > +    vnuma_topo.nr_vmemranges = 0;
> > +
> > +    set_xen_guest_handle(vnuma_topo.vdistance.h, NULL);
> > +    set_xen_guest_handle(vnuma_topo.vcpu_to_vnode.h, NULL);
> > +    set_xen_guest_handle(vnuma_topo.vmemrange.h, NULL);
> 
> Perhaps easier to just memset() the whole thing to zero, and the
> set .domid?
> 

Done.

> > +
> > +    rc = hypercall_memory_op(XENMEM_get_vnumainfo, &vnuma_topo);
> > +
> > +    if ( rc != -XEN_ENOBUFS )
> > +        return;
> > +
> > +    ASSERT(vnuma_topo.nr_vcpus == hvm_info->nr_vcpus);
> > +
> > +    vdistance = scratch_alloc(sizeof(uint32_t) * vnuma_topo.nr_vnodes *
> > +                              vnuma_topo.nr_vnodes, 0);
> > +    vmemrange = scratch_alloc(sizeof(xen_vmemrange_t) *
> > +                              vnuma_topo.nr_vmemranges, 0);
> > +
> > +    set_xen_guest_handle(vnuma_topo.vdistance.h, vdistance);
> > +    set_xen_guest_handle(vnuma_topo.vcpu_to_vnode.h, vcpu_to_vnode);
> > +    set_xen_guest_handle(vnuma_topo.vmemrange.h, vmemrange);
> > +
> > +    rc = hypercall_memory_op(XENMEM_get_vnumainfo, &vnuma_topo);
> > +
> > +    if ( rc < 0 )
> > +    {
> > +        printf("Failed to retrieve vNUMA information, rc = %d\n", rc);
> > +        return;
> > +    }
> > +
> > +    nr_vnodes = vnuma_topo.nr_vnodes;
> > +    nr_vmemranges = vnuma_topo.nr_vmemranges;
> > +
> > +    return;
> > +}
> 
> Please drop this pointless "return".
> 

Done.

Wei.

> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 02/21] xen: make two memory hypercalls vNUMA-aware
  2015-01-23 13:16   ` Jan Beulich
@ 2015-01-23 14:46     ` Wei Liu
  2015-01-23 15:37       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 14:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, xen-devel, ufimtseva

On Fri, Jan 23, 2015 at 01:16:19PM +0000, Jan Beulich wrote:
> >>> On 23.01.15 at 12:13, <wei.liu2@citrix.com> wrote:
> > Make XENMEM_increase_reservation and XENMEM_populate_physmap
> > vNUMA-aware.
> > 
> > That is, if guest requests Xen to allocate memory for specific vnode,
> > Xen can translate vnode to pnode using vNUMA information of that guest.
> > 
> > XENMEMF_vnode is introduced for the guest to mark the node number is in
> > fact virtual node number and should be translated by Xen.
> > 
> > XENFEAT_memory_op_vnode_supported is introduced to indicate that Xen is
> > able to translate virtual node to physical node.
> > 
> > Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> > Acked-by: Jan Beulich <JBeulich@suse.com>
> 
> I'm afraid there's another change needed for this to hold:
> 
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -692,6 +692,50 @@ out:
> >      return rc;
> >  }
> >  
> > +static int translate_vnode_to_pnode(struct domain *d,
> > +                                    struct xen_memory_reservation *r,
> > +                                    struct memop_args *a)
> > +{
> > +    int rc = 0;
> > +    unsigned int vnode, pnode;
> > +
> > +    /*
> > +     * Note: we don't strictly require non-supported bits set to zero,
> > +     * so we may have exact_vnode bit set for old guests that don't
> > +     * support vNUMA.
> > +     *
> > +     * To distinguish spurious vnode request v.s. real one, check if
> > +     * d->vnuma exists.
> > +     */
> > +    if ( r->mem_flags & XENMEMF_vnode )
> > +    {
> > +        read_lock(&d->vnuma_rwlock);
> > +        if ( d->vnuma )
> 
> if r->mem_flags has XENMEMF_vnode set but d->vnuma is NULL,
> you need to clear the node from the flags.
> 

As said in the comment, we don't seem to enforce non-supported bits set
to zero (IIRC you told me that). So an old guest that sets XENMEMF_vnode
by accident will get its other flags cleared if I follow your suggestion.

> > +        {
> > +            vnode = XENMEMF_get_node(r->mem_flags);
> > +
> > +            if ( vnode < d->vnuma->nr_vnodes )
> > +            {
> > +                pnode = d->vnuma->vnode_to_pnode[vnode];
> > +
> > +                a->memflags &=
> > +                    ~MEMF_node(XENMEMF_get_node(r->mem_flags));
> 
> I.e. this one needs to be pulled up, and probably include
> MEMF_exact_node.
> 

> > @@ -747,6 +787,16 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >              return start_extent;
> >          args.domain = d;
> >  
> > +        args.memflags |= MEMF_node(XENMEMF_get_node(reservation.mem_flags));
> > +        if ( reservation.mem_flags & XENMEMF_exact_node_request )
> > +            args.memflags |= MEMF_exact_node;
> 
> Since you force MEMF_exact_node when XENMEMF_vnode (and the
> necessary data is available), I also wonder whether the combination
> of both flags in what the caller requests should be forbidden (to
> potentially obtain some specific meaning in the future). Or
> alternatively don't enforce the former?
> 

We can't forbid guests from setting both flags, the reason is the same
as above. So don't enforce the MEMF_exact_node seems when XENMEMF_vnode
seems the way to go.

Wei.

> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 02/21] xen: make two memory hypercalls vNUMA-aware
  2015-01-23 14:46     ` Wei Liu
@ 2015-01-23 15:37       ` Jan Beulich
  2015-01-23 15:43         ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-01-23 15:37 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

>>> On 23.01.15 at 15:46, <wei.liu2@citrix.com> wrote:
> On Fri, Jan 23, 2015 at 01:16:19PM +0000, Jan Beulich wrote:
>> >>> On 23.01.15 at 12:13, <wei.liu2@citrix.com> wrote:
>> > Make XENMEM_increase_reservation and XENMEM_populate_physmap
>> > vNUMA-aware.
>> > 
>> > That is, if guest requests Xen to allocate memory for specific vnode,
>> > Xen can translate vnode to pnode using vNUMA information of that guest.
>> > 
>> > XENMEMF_vnode is introduced for the guest to mark the node number is in
>> > fact virtual node number and should be translated by Xen.
>> > 
>> > XENFEAT_memory_op_vnode_supported is introduced to indicate that Xen is
>> > able to translate virtual node to physical node.
>> > 
>> > Signed-off-by: Wei Liu <wei.liu2@citrix.com>
>> > Acked-by: Jan Beulich <JBeulich@suse.com>
>> 
>> I'm afraid there's another change needed for this to hold:
>> 
>> > --- a/xen/common/memory.c
>> > +++ b/xen/common/memory.c
>> > @@ -692,6 +692,50 @@ out:
>> >      return rc;
>> >  }
>> >  
>> > +static int translate_vnode_to_pnode(struct domain *d,
>> > +                                    struct xen_memory_reservation *r,
>> > +                                    struct memop_args *a)
>> > +{
>> > +    int rc = 0;
>> > +    unsigned int vnode, pnode;
>> > +
>> > +    /*
>> > +     * Note: we don't strictly require non-supported bits set to zero,
>> > +     * so we may have exact_vnode bit set for old guests that don't
>> > +     * support vNUMA.
>> > +     *
>> > +     * To distinguish spurious vnode request v.s. real one, check if
>> > +     * d->vnuma exists.
>> > +     */
>> > +    if ( r->mem_flags & XENMEMF_vnode )
>> > +    {
>> > +        read_lock(&d->vnuma_rwlock);
>> > +        if ( d->vnuma )
>> 
>> if r->mem_flags has XENMEMF_vnode set but d->vnuma is NULL,
>> you need to clear the node from the flags.
>> 
> 
> As said in the comment, we don't seem to enforce non-supported bits set
> to zero (IIRC you told me that). So an old guest that sets XENMEMF_vnode
> by accident will get its other flags cleared if I follow your suggestion.

Which is an acceptable thing to do I think - they called for
undefined behavior, and they now get unexpected behavior.
Mistaking the virtual node specified for a physical one is certainly
less desirable.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 02/21] xen: make two memory hypercalls vNUMA-aware
  2015-01-23 15:37       ` Jan Beulich
@ 2015-01-23 15:43         ` Wei Liu
  2015-01-23 16:06           ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 15:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, xen-devel, ufimtseva

On Fri, Jan 23, 2015 at 03:37:51PM +0000, Jan Beulich wrote:
> >>> On 23.01.15 at 15:46, <wei.liu2@citrix.com> wrote:
> > On Fri, Jan 23, 2015 at 01:16:19PM +0000, Jan Beulich wrote:
> >> >>> On 23.01.15 at 12:13, <wei.liu2@citrix.com> wrote:
> >> > Make XENMEM_increase_reservation and XENMEM_populate_physmap
> >> > vNUMA-aware.
> >> > 
> >> > That is, if guest requests Xen to allocate memory for specific vnode,
> >> > Xen can translate vnode to pnode using vNUMA information of that guest.
> >> > 
> >> > XENMEMF_vnode is introduced for the guest to mark the node number is in
> >> > fact virtual node number and should be translated by Xen.
> >> > 
> >> > XENFEAT_memory_op_vnode_supported is introduced to indicate that Xen is
> >> > able to translate virtual node to physical node.
> >> > 
> >> > Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> >> > Acked-by: Jan Beulich <JBeulich@suse.com>
> >> 
> >> I'm afraid there's another change needed for this to hold:
> >> 
> >> > --- a/xen/common/memory.c
> >> > +++ b/xen/common/memory.c
> >> > @@ -692,6 +692,50 @@ out:
> >> >      return rc;
> >> >  }
> >> >  
> >> > +static int translate_vnode_to_pnode(struct domain *d,
> >> > +                                    struct xen_memory_reservation *r,
> >> > +                                    struct memop_args *a)
> >> > +{
> >> > +    int rc = 0;
> >> > +    unsigned int vnode, pnode;
> >> > +
> >> > +    /*
> >> > +     * Note: we don't strictly require non-supported bits set to zero,
> >> > +     * so we may have exact_vnode bit set for old guests that don't
> >> > +     * support vNUMA.
> >> > +     *
> >> > +     * To distinguish spurious vnode request v.s. real one, check if
> >> > +     * d->vnuma exists.
> >> > +     */
> >> > +    if ( r->mem_flags & XENMEMF_vnode )
> >> > +    {
> >> > +        read_lock(&d->vnuma_rwlock);
> >> > +        if ( d->vnuma )
> >> 
> >> if r->mem_flags has XENMEMF_vnode set but d->vnuma is NULL,
> >> you need to clear the node from the flags.
> >> 
> > 
> > As said in the comment, we don't seem to enforce non-supported bits set
> > to zero (IIRC you told me that). So an old guest that sets XENMEMF_vnode
> > by accident will get its other flags cleared if I follow your suggestion.
> 
> Which is an acceptable thing to do I think - they called for
> undefined behavior, and they now get unexpected behavior.
> Mistaking the virtual node specified for a physical one is certainly
> less desirable.
> 

OK, thanks for clarification.

Wei.

> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 02/21] xen: make two memory hypercalls vNUMA-aware
  2015-01-23 15:43         ` Wei Liu
@ 2015-01-23 16:06           ` Wei Liu
  2015-01-23 16:17             ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-23 16:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli,
	ian.jackson, xen-devel, ufimtseva

On Fri, Jan 23, 2015 at 03:43:07PM +0000, Wei Liu wrote:
> On Fri, Jan 23, 2015 at 03:37:51PM +0000, Jan Beulich wrote:
> > >>> On 23.01.15 at 15:46, <wei.liu2@citrix.com> wrote:
> > > On Fri, Jan 23, 2015 at 01:16:19PM +0000, Jan Beulich wrote:
> > >> >>> On 23.01.15 at 12:13, <wei.liu2@citrix.com> wrote:
> > >> > Make XENMEM_increase_reservation and XENMEM_populate_physmap
> > >> > vNUMA-aware.
> > >> > 
> > >> > That is, if guest requests Xen to allocate memory for specific vnode,
> > >> > Xen can translate vnode to pnode using vNUMA information of that guest.
> > >> > 
> > >> > XENMEMF_vnode is introduced for the guest to mark the node number is in
> > >> > fact virtual node number and should be translated by Xen.
> > >> > 
> > >> > XENFEAT_memory_op_vnode_supported is introduced to indicate that Xen is
> > >> > able to translate virtual node to physical node.
> > >> > 
> > >> > Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> > >> > Acked-by: Jan Beulich <JBeulich@suse.com>
> > >> 
> > >> I'm afraid there's another change needed for this to hold:
> > >> 
> > >> > --- a/xen/common/memory.c
> > >> > +++ b/xen/common/memory.c
> > >> > @@ -692,6 +692,50 @@ out:
> > >> >      return rc;
> > >> >  }
> > >> >  
> > >> > +static int translate_vnode_to_pnode(struct domain *d,
> > >> > +                                    struct xen_memory_reservation *r,
> > >> > +                                    struct memop_args *a)
> > >> > +{
> > >> > +    int rc = 0;
> > >> > +    unsigned int vnode, pnode;
> > >> > +
> > >> > +    /*
> > >> > +     * Note: we don't strictly require non-supported bits set to zero,
> > >> > +     * so we may have exact_vnode bit set for old guests that don't
> > >> > +     * support vNUMA.
> > >> > +     *
> > >> > +     * To distinguish spurious vnode request v.s. real one, check if
> > >> > +     * d->vnuma exists.
> > >> > +     */
> > >> > +    if ( r->mem_flags & XENMEMF_vnode )
> > >> > +    {
> > >> > +        read_lock(&d->vnuma_rwlock);
> > >> > +        if ( d->vnuma )
> > >> 
> > >> if r->mem_flags has XENMEMF_vnode set but d->vnuma is NULL,
> > >> you need to clear the node from the flags.
> > >> 
> > > 
> > > As said in the comment, we don't seem to enforce non-supported bits set
> > > to zero (IIRC you told me that). So an old guest that sets XENMEMF_vnode
> > > by accident will get its other flags cleared if I follow your suggestion.
> > 
> > Which is an acceptable thing to do I think - they called for
> > undefined behavior, and they now get unexpected behavior.
> > Mistaking the virtual node specified for a physical one is certainly
> > less desirable.
> > 
> 
> OK, thanks for clarification.
> 

So the logic of translation now is (take into consideration the second
point of how we should enforce exact_node flag, I think that flag should
be preserved if it was requested at the beginning):

+static int translate_vnode_to_pnode(struct domain *d,
+                                    struct xen_memory_reservation *r,
+                                    struct memop_args *a)
+{
+    int rc = 0;
+    unsigned int vnode, pnode;
+
+    if ( r->mem_flags & XENMEMF_vnode )
+    {
+        a->memflags &= ~MEMF_node(XENMEMF_get_node(r->mem_flags));
+        a->memflags &= ~MEMF_exact_node;
+
+        read_lock(&d->vnuma_rwlock);
+        if ( d->vnuma )
+        {
+            vnode = XENMEMF_get_node(r->mem_flags);
+
+            if ( vnode < d->vnuma->nr_vnodes )
+            {
+                pnode = d->vnuma->vnode_to_pnode[vnode];
+
+                if ( pnode != NUMA_NO_NODE )
+                {
+                    a->memflags |= MEMF_node(pnode);
+                    if ( r->mem_flags & XENMEMF_exact_node_request )
+                        a->memflags |= MEMF_exact_node;
+                }
+            }
+            else
+                rc = -EINVAL;
+        }
+        read_unlock(&d->vnuma_rwlock);
+    }
+
+    return rc;
+}

> Wei.
> 
> > Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 02/21] xen: make two memory hypercalls vNUMA-aware
  2015-01-23 16:06           ` Wei Liu
@ 2015-01-23 16:17             ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-01-23 16:17 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

>>> On 23.01.15 at 17:06, <wei.liu2@citrix.com> wrote:
> So the logic of translation now is (take into consideration the second
> point of how we should enforce exact_node flag, I think that flag should
> be preserved if it was requested at the beginning):

Yes, and the code looks right now.

Jan

> +static int translate_vnode_to_pnode(struct domain *d,
> +                                    struct xen_memory_reservation *r,
> +                                    struct memop_args *a)
> +{
> +    int rc = 0;
> +    unsigned int vnode, pnode;
> +
> +    if ( r->mem_flags & XENMEMF_vnode )
> +    {
> +        a->memflags &= ~MEMF_node(XENMEMF_get_node(r->mem_flags));
> +        a->memflags &= ~MEMF_exact_node;
> +
> +        read_lock(&d->vnuma_rwlock);
> +        if ( d->vnuma )
> +        {
> +            vnode = XENMEMF_get_node(r->mem_flags);
> +
> +            if ( vnode < d->vnuma->nr_vnodes )
> +            {
> +                pnode = d->vnuma->vnode_to_pnode[vnode];
> +
> +                if ( pnode != NUMA_NO_NODE )
> +                {
> +                    a->memflags |= MEMF_node(pnode);
> +                    if ( r->mem_flags & XENMEMF_exact_node_request )
> +                        a->memflags |= MEMF_exact_node;
> +                }
> +            }
> +            else
> +                rc = -EINVAL;
> +        }
> +        read_unlock(&d->vnuma_rwlock);
> +    }
> +
> +    return rc;
> +}
> 
>> Wei.
>> 
>> > Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 03/21] libxc: allocate memory with vNUMA information for PV guest
  2015-01-23 11:13 ` [PATCH v4 03/21] libxc: allocate memory with vNUMA information for PV guest Wei Liu
@ 2015-01-28 16:02   ` Ian Campbell
  0 siblings, 0 replies; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:02 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> +        /* Setup dummy vNUMA information if it's not provided. Note
> +         * that this is a valid state if libxl doesn't provide any
> +         * vNUMA information.
> +         *
> +         * The dummy values make libxc allocate all pages from
> +         * arbitrary physical. This is the expected behaviour if no

                                ^ nodes ?

> +         * vNUMA configuration is provided to libxc.
> +         *
> +         * Note that the following hunk is just for the convenience of
> +         * allocation code. No defaulting is happening in libxc.
> +         */
> +        if ( dom->nr_vnuma_info == 0 )
> +        {

ASSERT(!dom->vnuma_info)?

Other than those minor issues/suggestions looks good:

Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 04/21] libxl: introduce vNUMA types
  2015-01-23 11:13 ` [PATCH v4 04/21] libxl: introduce vNUMA types Wei Liu
@ 2015-01-28 16:04   ` Ian Campbell
  2015-01-28 21:51     ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:04 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> A domain can contain several virtual NUMA nodes, hence we introduce an
> array in libxl_domain_build_info.
> 
> libxl_vnode_info contains the size of memory in that node, the distance
> from that node to every nodes, the underlying pnode and a bitmap of
> vcpus.
> 
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Dario Faggioli <dario.faggioli@citrix.com>
> Cc: Elena Ufimtseva <ufimtseva@gmail.com>

Assuming there is a LIBXL_HAVE define to come later in the series:

Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 05/21] libxl: add vmemrange to libxl__domain_build_state
  2015-01-23 11:13 ` [PATCH v4 05/21] libxl: add vmemrange to libxl__domain_build_state Wei Liu
@ 2015-01-28 16:05   ` Ian Campbell
  0 siblings, 0 replies; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:05 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> A vnode consists of one or more vmemranges (virtual memory range).  One
> example of multiple vmemranges is that there is a hole in one vnode.
> 
> Currently we haven't exported vmemrange interface to libxl user.
> Vmemranges are generated during domain build, so we have relevant
> structures in domain build state.
> 
> Later if we discover we need to export the interface, those structures
> can be moved to libxl_domain_build_info as well.
> 
> These new fields (along with other fields in that struct) are set to 0
> at start of day so we don't need to explicitly initialise them. A
> following patch which introduces an independent checking function will
> need to access these fields. I don't feel very comfortable squashing
> this change into that one so I use a single commit.

                                ^didn't?

Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 06/21] libxl: introduce libxl__vnuma_config_check
  2015-01-23 11:13 ` [PATCH v4 06/21] libxl: introduce libxl__vnuma_config_check Wei Liu
@ 2015-01-28 16:13   ` Ian Campbell
  2015-01-28 21:51     ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:13 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> index 6d3ac58..39356ba 100644
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -3394,6 +3394,11 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
>      libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
>  }
>  
> +/* Check if vNUMA config is valid. Returns 0 if valid. */

And on invalid? ERROR_FOO or just !0?

I see further down it is ERROR_INVAL, perhaps say so, and perhaps
introduce ERROR_INVAL_VNUMA_CONFIGUATION or something.

> +/* Check if vNUMA configuration is valid:
> + *  1. all pnodes inside vnode_to_pnode array are valid
> + *  2. one vcpu belongs to and only belongs to one vnode
> + *  3. each vmemrange is valid and doesn't overlap with each other
> + */
> +int libxl__vnuma_config_check(libxl__gc *gc,
> +			      const libxl_domain_build_info *b_info,

Hard tabs here.


> +    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
> +        uint32_t pnode;
> +
> +        p = &b_info->vnuma_nodes[i];
> +        pnode = p->pnode;
> +
> +        /* The pnode specified is not valid? */
> +        if (pnode >= nr_nodes) {

I take it that pnodes aren't (potentially) sparse?

> +            LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
> +                       "Invalid pnode %d specified",
> +                       pnode);

Please use the short LOG macros (throughout).

> +            goto out;
> +        }
> +
> +        total_memkb += p->memkb;
> +    }
> +
> +    if (total_memkb != b_info->max_memkb) {
> +        LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
> +                   "Amount of memory mismatch (0x%"PRIx64" != 0x%"PRIx64")",
> +                   total_memkb, b_info->max_memkb);

Did arch_setup_meminit not also check this?

Meaning it could maybe abort() or ASSERT. Not 100% sure about that being
wise though.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 07/21] libxl: x86: factor out e820_host_sanitize
  2015-01-23 11:13 ` [PATCH v4 07/21] libxl: x86: factor out e820_host_sanitize Wei Liu
@ 2015-01-28 16:14   ` Ian Campbell
  0 siblings, 0 replies; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:14 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> This function gets the machine E820 map and sanitize it according to PV
> guest configuration.
> 
> This will be used in later patch. No functional change introduced in
> this patch.
> 
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>

Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 08/21] libxl: functions to build vmemranges for PV guest
  2015-01-23 11:13 ` [PATCH v4 08/21] libxl: functions to build vmemranges for PV guest Wei Liu
@ 2015-01-28 16:27   ` Ian Campbell
  2015-01-28 21:59     ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:27 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> +    /* Generate one vmemrange for each virtual node. */
> +    next = 0;
> +    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
> +        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
> +
> +        GCREALLOC_ARRAY(v, i+1);

Can we not just allocate it outside the loop using
b_info->num_vnuma_nodes?

> +
> +        v[i].start = next;
> +        v[i].end = next + (p->memkb << 10);
> +        v[i].flags = 0;
> +        v[i].nid = i;
> +
> +        next = v[i].end;
> +    }
> +
> +    state->vmemranges = v;
> +    state->num_vmemranges = i;
> +
> +    return libxl__arch_vnuma_build_vmemrange(gc, domid, b_info, state);
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
> index d012b4d..4841fef 100644
> --- a/tools/libxl/libxl_x86.c
> +++ b/tools/libxl/libxl_x86.c
> @@ -339,6 +339,82 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
>      return 0;
>  }
>  
> +int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,

This isn't so much "build" as "adjust" or "make it fit arch policy" or
something?

> +                                      uint32_t domid,
> +                                      libxl_domain_build_info *b_info,
> +                                      libxl__domain_build_state *state)
> +{
> +    int nid, nr_vmemrange, rc;
> +    uint32_t nr_e820, e820_count;
> +    struct e820entry map[E820MAX];
> +    xen_vmemrange_t *vmemranges;
> +
> +    /* Only touch vmemranges if it's PV guest and e820_host is true */

I take it that obeying e820_host here is only WRT where the memory is
placed, not also trying to match the underlying hosts numa layout?

> +    if (!(b_info->type == LIBXL_DOMAIN_TYPE_PV &&
> +          libxl_defbool_val(b_info->u.pv.e820_host))) {
> +        rc = 0;
> +        goto out;
> +    }
> +
> +    rc = e820_host_sanitize(gc, b_info, map, &nr_e820);
> +    if (rc) goto out;
> +
> +    /* Ditch old vmemranges and start with host E820 map. Note, memory
> +     * was gc allocated.

This makes me thing something is backwards in the layering here.

Wouldn't it make more sense for libxl__arch_vnuma_build_vmemrange to be
the entry point which starts with:
    if (nothing unusual like e820_host going on)
        return libxl__vnuma_build_pv_generic(...)

    ...the special stuff.

> +    e820_count = 0;
> +    nr_vmemrange = 0;
> +    vmemranges = NULL;
> +    for (nid = 0; nid < b_info->num_vnuma_nodes; nid++) {

This fill-loop looked ok to me.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 09/21] libxl: build, check and pass vNUMA info to Xen for PV guest
  2015-01-23 11:13 ` [PATCH v4 09/21] libxl: build, check and pass vNUMA info to Xen " Wei Liu
@ 2015-01-28 16:29   ` Ian Campbell
  0 siblings, 0 replies; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:29 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> Transform the user supplied vNUMA configuration into libxl internal
> representations, and finally libxc representations. Check validity of
> the configuration along the line.
> 
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>

Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 14/21] libxc: indentation change to xc_hvm_build_x86.c
  2015-01-23 11:13 ` [PATCH v4 14/21] libxc: indentation change to xc_hvm_build_x86.c Wei Liu
@ 2015-01-28 16:30   ` Ian Campbell
  0 siblings, 0 replies; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:30 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> Move a while loop in xc_hvm_build_x86 one block to the right. No
> functional change introduced.
> 
> Functional changes will be introduced in next patch.
> 
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>

Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 15/21] libxc: allocate memory with vNUMA information for HVM guest
  2015-01-23 11:13 ` [PATCH v4 15/21] libxc: allocate memory with vNUMA information for HVM guest Wei Liu
@ 2015-01-28 16:36   ` Ian Campbell
  2015-01-28 22:07     ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:36 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> The algorithm is more or less the same as the one used for PV guest.

Any reason the code can't be shared then? :-D
>  
> +    if ( nr_pages > target_pages )
> +        memflags |= XENMEMF_populate_on_demand;

OOI how does vNUMA and PoD interact? Do you prefer to fill a node as
much as possible (leaving other nodes entirely PoD) or do you prefer to
balance the PoD pages between nodes?

I think the former?

I suspect this depends a lot on the guest behaviour, so there probably
isn't a right answer. Are there corner cases where this might go wrong
though?

> +
> +    if ( args->nr_vnuma_info == 0 )
> +    {
> +        /* Build dummy vnode information */
> +        dummy_vnuma_info.vnode = 0;
> +        dummy_vnuma_info.pnode = XC_VNUMA_NO_NODE;
> +        dummy_vnuma_info.pages = args->mem_size >> PAGE_SHIFT;
> +        args->nr_vnuma_info = 1;
> +        args->vnuma_info = &dummy_vnuma_info;

You could have done this in the PV case too, I nearly suggested it then
but realised I already had, so I figured there was a reason not to?

> +    }
> +    else
> +    {
> +        if ( nr_pages > target_pages )
> +        {
> +            PERROR("Cannot enable vNUMA and PoD at the same time");

And there's the answer to my question above ;-)

> +            goto error_out;
>  
> +    for ( i = 0; i < args->nr_vnuma_info; i++ )
>      {

Reindenting in a precursor patch made this diff a lot easier to read,
thanks.

>              if ( count > max_pages )
> @@ -388,19 +440,20 @@ static int setup_guest(xc_interface *xch,
>                  unsigned long nr_extents = count >> SUPERPAGE_1GB_SHIFT;
>                  xen_pfn_t sp_extents[nr_extents];
>  
> -                for ( i = 0; i < nr_extents; i++ )
> -                    sp_extents[i] =
> -                        page_array[cur_pages+(i<<SUPERPAGE_1GB_SHIFT)];
> +                for ( j = 0; j < nr_extents; j++ )
> +                    sp_extents[j] =
> +                        page_array[cur_pages+(j<<SUPERPAGE_1GB_SHIFT)];

You might condider s/i/j/ for a precursor patch too. Or given the scope
of the outermost loop is quite large a more specific name than i (like
vnid) might be more appropriate.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 16/21] libxl: build, check and pass vNUMA info to Xen for HVM guest
  2015-01-23 11:13 ` [PATCH v4 16/21] libxl: build, check and pass vNUMA info to Xen " Wei Liu
@ 2015-01-28 16:41   ` Ian Campbell
  2015-01-28 22:14     ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:41 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> +    if (info->num_vnuma_nodes != 0) {
> +        int i;
> +
> +        args.nr_vnuma_info = info->num_vnuma_nodes;
> +        args.vnuma_info = libxl__malloc(gc, sizeof(*args.vnuma_info) *
> +                                        args.nr_vnuma_info);
> +        for (i = 0; i < args.nr_vnuma_info; i++) {
> +            args.vnuma_info[i].vnode = i;
> +            args.vnuma_info[i].pnode = info->vnuma_nodes[i].pnode;
> +            args.vnuma_info[i].pages = info->vnuma_nodes[i].memkb >> 2;
> +        }
> +
> +        /* Consider video ram belongs to node 0 */
> +        args.vnuma_info[0].pages -= (info->video_memkb >> 2);

You need to check that the node is big enough I think?

> +            if (next >= hole_start && next < hole_end)
> +                next = hole_end;
> +            if ((next < hole_start) && (next + remaining_bytes >= hole_start))
> +                count = hole_start - next;
> +
> +            GCREALLOC_ARRAY(vmemranges, nr_vmemrange+1);

Given there's only a single hole, can't you calculate the final
nr_vmemrange value up front? (i.e. it's 2, I think).

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 17/21] libxl: disallow memory relocation when vNUMA is enabled
  2015-01-23 11:13 ` [PATCH v4 17/21] libxl: disallow memory relocation when vNUMA is enabled Wei Liu
@ 2015-01-28 16:41   ` Ian Campbell
  2015-01-28 22:22     ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:41 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> Disallow memory relocation when vNUMA is enabled, because relocated
> memory ends up off node. Further more, even if we dynamically expand
> node coverage in hvmloader, low memory and high memory may reside
> in different physical nodes, blindly relocating low memory to high
> memory gives us a sub-optimal configuration.

What are the downsides of doing this? Less flexible PCI passthrough/MMIO
hole sizing I think? Anything else?

Does some user doc somewhere need this adding?

(and now I think of it, the PoD vs. vnuma one too).

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 18/21] libxlu: rework internal representation of setting
  2015-01-23 11:13 ` [PATCH v4 18/21] libxlu: rework internal representation of setting Wei Liu
@ 2015-01-28 16:41   ` Ian Campbell
  2015-02-11 16:12   ` Ian Jackson
  1 sibling, 0 replies; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:41 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> This patches does following things:

I'm leaving this one to Ian J.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 21/21] xl: vNUMA support
  2015-01-23 11:13 ` [PATCH v4 21/21] xl: vNUMA support Wei Liu
@ 2015-01-28 16:46   ` Ian Campbell
  2015-01-28 22:52     ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Campbell @ 2015-01-28 16:46 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> This patch includes configuration options parser and documentation.
> 
> Please find the hunk to xl.cfg.pod.5 for more information.
> 
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Dario Faggioli <dario.faggioli@citrix.com>
> Cc: Elena Ufimtseva <ufimtseva@gmail.com>
> ---
> Changes in v2:
> 1. Make vnuma_vdistances mandatory.
> 2. Use nested list to specify vdistances.
> 3. Update documentation.
> ---
>  docs/man/xl.cfg.pod.5    |   39 ++++++++++++
>  tools/libxl/xl_cmdimpl.c |  147 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 186 insertions(+)
> 
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> index e2f91fc..b23bd6f 100644
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -266,6 +266,45 @@ it will crash.
>  
>  =back
>  
> +=head3 Virtual NUMA Memory Allocation
> +
> +=over 4
> +
> +=item B<vnuma_memory=[ NUMBER, NUMBER, ... ]>
> +
> +Specify the size of memory covered by each virtual NUMA node. The number of
> +elements in the list also implicitly defines the number of virtual NUMA nodes.
> +
> +The sum of all elements in this list should be equal to memory size specified
> +by B<maxmem=> in guest configuration file, or B<memory=> if B<maxmem=> is not
> +specified.

Could we make it permissible to omit memory= and have the toolstack do
the maths for us?

I suppose == maxmem is due to no vnuma vs. PoD? What about for PV
guests, is preballooning allowed there too?

> +
> +=item B<vnuma_vcpu_map=[ NUMBER, NUMBER, ... ]>
> +
> +Specifiy which virutal NUMA node a specific vcpu belongs to. The number of

"Specify" and "virtual".

> +elements in this list should be equal to B<maxvcpus=> in guest configuration
> +file, or B<vcpus=> if B<maxvcpus=> is not specified.

Again, can we relieve the user of these tedious sums?
> +=item B<vnuma_pnode_map=[ NUMBER, NUMBER, ... ]>
> +
> +Specifiy which physical NUMA node a specific virtual NUMA node maps to. The

"Specify" again.

> +number of elements in this list should be equal to the number of virtual
> +NUMA nodes defined in B<vnuma_memory=>.

Would it make sense to instead have a single array or e.g. "NODE:SIZE"
or something?

> +
> +=item B<vnuma_vdistance=[ [NUMBER, ..., NUMBER], [NUMBER, ..., NUMBER], ... ]>
> +
> +Two dimensional list to specify distances among nodes.
> +
> +The number of elements in the first dimension list equals the number of virtual
> +nodes. Each element in position B<i> is a list that specifies the distances
> +from node B<i> to other nodes.
> +
> +For example, for a guest with 2 virtual nodes, user can specify:
> +
> +  vnuma_vdistance = [ [10, 20], [20, 10] ]

Any guidance on how a user should choose these numbers?

Do we support a mode where something figures this out based on the
underlying distances between the pnode to which a vnode is assigned?

Would a user ever want/need to override such a mapping?

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 04/21] libxl: introduce vNUMA types
  2015-01-28 16:04   ` Ian Campbell
@ 2015-01-28 21:51     ` Wei Liu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-28 21:51 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, JBeulich, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

On Wed, Jan 28, 2015 at 04:04:12PM +0000, Ian Campbell wrote:
> On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> > A domain can contain several virtual NUMA nodes, hence we introduce an
> > array in libxl_domain_build_info.
> > 
> > libxl_vnode_info contains the size of memory in that node, the distance
> > from that node to every nodes, the underlying pnode and a bitmap of
> > vcpus.
> > 
> > Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> > Cc: Ian Campbell <ian.campbell@citrix.com>
> > Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> > Cc: Dario Faggioli <dario.faggioli@citrix.com>
> > Cc: Elena Ufimtseva <ufimtseva@gmail.com>
> 
> Assuming there is a LIBXL_HAVE define to come later in the series:
> 

I will add / move that macro here in next version.

Wei.

> Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 06/21] libxl: introduce libxl__vnuma_config_check
  2015-01-28 16:13   ` Ian Campbell
@ 2015-01-28 21:51     ` Wei Liu
  2015-01-29 11:04       ` Ian Campbell
  0 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-28 21:51 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, JBeulich, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

On Wed, Jan 28, 2015 at 04:13:28PM +0000, Ian Campbell wrote:
> On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> > index 6d3ac58..39356ba 100644
> > --- a/tools/libxl/libxl_internal.h
> > +++ b/tools/libxl/libxl_internal.h
> > @@ -3394,6 +3394,11 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
> >      libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
> >  }
> >  
> > +/* Check if vNUMA config is valid. Returns 0 if valid. */
> 
> And on invalid? ERROR_FOO or just !0?
> 
> I see further down it is ERROR_INVAL, perhaps say so, and perhaps
> introduce ERROR_INVAL_VNUMA_CONFIGUATION or something.

I think ERROR_INVALID is good enough. There are logs along the line to
indicate what goes wrong.

> 
> > +/* Check if vNUMA configuration is valid:
> > + *  1. all pnodes inside vnode_to_pnode array are valid
> > + *  2. one vcpu belongs to and only belongs to one vnode
> > + *  3. each vmemrange is valid and doesn't overlap with each other
> > + */
> > +int libxl__vnuma_config_check(libxl__gc *gc,
> > +			      const libxl_domain_build_info *b_info,
> 
> Hard tabs here.
> 

Fixed.

> 
> > +    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
> > +        uint32_t pnode;
> > +
> > +        p = &b_info->vnuma_nodes[i];
> > +        pnode = p->pnode;
> > +
> > +        /* The pnode specified is not valid? */
> > +        if (pnode >= nr_nodes) {
> 
> I take it that pnodes aren't (potentially) sparse?
> 

That is the assumption used in libxc and xen I think. I haven't seen
sparse pnodes personally.

> > +            LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
> > +                       "Invalid pnode %d specified",
> > +                       pnode);
> 
> Please use the short LOG macros (throughout).
> 

Ack.

> > +            goto out;
> > +        }
> > +
> > +        total_memkb += p->memkb;
> > +    }
> > +
> > +    if (total_memkb != b_info->max_memkb) {
> > +        LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
> > +                   "Amount of memory mismatch (0x%"PRIx64" != 0x%"PRIx64")",
> > +                   total_memkb, b_info->max_memkb);
> 
> Did arch_setup_meminit not also check this?
> 

Yes. But I think libxl should perform due diligence as well.  Logging
here is also more user friendly.

> Meaning it could maybe abort() or ASSERT. Not 100% sure about that being
> wise though.
> 

No need to abort, because this function is used to validate user input.

Wei.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 08/21] libxl: functions to build vmemranges for PV guest
  2015-01-28 16:27   ` Ian Campbell
@ 2015-01-28 21:59     ` Wei Liu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-28 21:59 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, JBeulich, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

On Wed, Jan 28, 2015 at 04:27:29PM +0000, Ian Campbell wrote:
> On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> > +    /* Generate one vmemrange for each virtual node. */
> > +    next = 0;
> > +    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
> > +        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
> > +
> > +        GCREALLOC_ARRAY(v, i+1);
> 
> Can we not just allocate it outside the loop using
> b_info->num_vnuma_nodes?
> 

Yes.

> > +
> > +        v[i].start = next;
> > +        v[i].end = next + (p->memkb << 10);
> > +        v[i].flags = 0;
> > +        v[i].nid = i;
> > +
> > +        next = v[i].end;
> > +    }
> > +
> > +    state->vmemranges = v;
> > +    state->num_vmemranges = i;
> > +
> > +    return libxl__arch_vnuma_build_vmemrange(gc, domid, b_info, state);
> > +}
> > +
> >  /*
> >   * Local variables:
> >   * mode: C
> > diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
> > index d012b4d..4841fef 100644
> > --- a/tools/libxl/libxl_x86.c
> > +++ b/tools/libxl/libxl_x86.c
> > @@ -339,6 +339,82 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
> >      return 0;
> >  }
> >  
> > +int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,
> 
> This isn't so much "build" as "adjust" or "make it fit arch policy" or
> something?
> 

"adjust" is better.

> > +                                      uint32_t domid,
> > +                                      libxl_domain_build_info *b_info,
> > +                                      libxl__domain_build_state *state)
> > +{
> > +    int nid, nr_vmemrange, rc;
> > +    uint32_t nr_e820, e820_count;
> > +    struct e820entry map[E820MAX];
> > +    xen_vmemrange_t *vmemranges;
> > +
> > +    /* Only touch vmemranges if it's PV guest and e820_host is true */
> 
> I take it that obeying e820_host here is only WRT where the memory is
> placed, not also trying to match the underlying hosts numa layout?
> 

It's used to set aside memory holes according to host e820 map. It
doesn't take into account host numa layout.

> > +    if (!(b_info->type == LIBXL_DOMAIN_TYPE_PV &&
> > +          libxl_defbool_val(b_info->u.pv.e820_host))) {
> > +        rc = 0;
> > +        goto out;
> > +    }
> > +
> > +    rc = e820_host_sanitize(gc, b_info, map, &nr_e820);
> > +    if (rc) goto out;
> > +
> > +    /* Ditch old vmemranges and start with host E820 map. Note, memory
> > +     * was gc allocated.
> 
> This makes me thing something is backwards in the layering here.
> 
> Wouldn't it make more sense for libxl__arch_vnuma_build_vmemrange to be
> the entry point which starts with:
>     if (nothing unusual like e820_host going on)
>         return libxl__vnuma_build_pv_generic(...)
> 
>     ...the special stuff.
> 

OK. That's fine by me.

Wei.

> > +    e820_count = 0;
> > +    nr_vmemrange = 0;
> > +    vmemranges = NULL;
> > +    for (nid = 0; nid < b_info->num_vnuma_nodes; nid++) {
> 
> This fill-loop looked ok to me.
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 15/21] libxc: allocate memory with vNUMA information for HVM guest
  2015-01-28 16:36   ` Ian Campbell
@ 2015-01-28 22:07     ` Wei Liu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-28 22:07 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, JBeulich, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

On Wed, Jan 28, 2015 at 04:36:56PM +0000, Ian Campbell wrote:
> On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> > The algorithm is more or less the same as the one used for PV guest.
> 
> Any reason the code can't be shared then? :-D

Because the details are quite different. HVM path is entangled with
superpage handling logic...

> >  
> > +    if ( nr_pages > target_pages )
> > +        memflags |= XENMEMF_populate_on_demand;
> 
> OOI how does vNUMA and PoD interact? Do you prefer to fill a node as
> much as possible (leaving other nodes entirely PoD) or do you prefer to
> balance the PoD pages between nodes?
> 
> I think the former?
> 
> I suspect this depends a lot on the guest behaviour, so there probably
> isn't a right answer. Are there corner cases where this might go wrong
> though?
> 

I don't think vnuma and PoD interact well at this point. We need to at
least implement numa-aware PoD to make them work well.

> > +
> > +    if ( args->nr_vnuma_info == 0 )
> > +    {
> > +        /* Build dummy vnode information */
> > +        dummy_vnuma_info.vnode = 0;
> > +        dummy_vnuma_info.pnode = XC_VNUMA_NO_NODE;
> > +        dummy_vnuma_info.pages = args->mem_size >> PAGE_SHIFT;
> > +        args->nr_vnuma_info = 1;
> > +        args->vnuma_info = &dummy_vnuma_info;
> 
> You could have done this in the PV case too, I nearly suggested it then
> but realised I already had, so I figured there was a reason not to?
> 

Yes I thought about that.

In PV case memory is managed by libxc's simple memory pool. Using stack
variables like this won't actually make code cleaner. 

In HVM case there is no libxc memory management involved, using stack
variables is simpler IMHO.

> > +    }
> > +    else
> > +    {
> > +        if ( nr_pages > target_pages )
> > +        {
> > +            PERROR("Cannot enable vNUMA and PoD at the same time");
> 
> And there's the answer to my question above ;-)
> 
> > +            goto error_out;
> >  
> > +    for ( i = 0; i < args->nr_vnuma_info; i++ )
> >      {
> 
> Reindenting in a precursor patch made this diff a lot easier to read,
> thanks.
> 
> >              if ( count > max_pages )
> > @@ -388,19 +440,20 @@ static int setup_guest(xc_interface *xch,
> >                  unsigned long nr_extents = count >> SUPERPAGE_1GB_SHIFT;
> >                  xen_pfn_t sp_extents[nr_extents];
> >  
> > -                for ( i = 0; i < nr_extents; i++ )
> > -                    sp_extents[i] =
> > -                        page_array[cur_pages+(i<<SUPERPAGE_1GB_SHIFT)];
> > +                for ( j = 0; j < nr_extents; j++ )
> > +                    sp_extents[j] =
> > +                        page_array[cur_pages+(j<<SUPERPAGE_1GB_SHIFT)];
> 
> You might condider s/i/j/ for a precursor patch too. Or given the scope
> of the outermost loop is quite large a more specific name than i (like
> vnid) might be more appropriate.
> 

Ack.

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 16/21] libxl: build, check and pass vNUMA info to Xen for HVM guest
  2015-01-28 16:41   ` Ian Campbell
@ 2015-01-28 22:14     ` Wei Liu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-28 22:14 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, JBeulich, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

On Wed, Jan 28, 2015 at 04:41:05PM +0000, Ian Campbell wrote:
> On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> > +    if (info->num_vnuma_nodes != 0) {
> > +        int i;
> > +
> > +        args.nr_vnuma_info = info->num_vnuma_nodes;
> > +        args.vnuma_info = libxl__malloc(gc, sizeof(*args.vnuma_info) *
> > +                                        args.nr_vnuma_info);
> > +        for (i = 0; i < args.nr_vnuma_info; i++) {
> > +            args.vnuma_info[i].vnode = i;
> > +            args.vnuma_info[i].pnode = info->vnuma_nodes[i].pnode;
> > +            args.vnuma_info[i].pages = info->vnuma_nodes[i].memkb >> 2;
> > +        }
> > +
> > +        /* Consider video ram belongs to node 0 */
> > +        args.vnuma_info[0].pages -= (info->video_memkb >> 2);
> 
> You need to check that the node is big enough I think?
> 

Right.

> > +            if (next >= hole_start && next < hole_end)
> > +                next = hole_end;
> > +            if ((next < hole_start) && (next + remaining_bytes >= hole_start))
> > +                count = hole_start - next;
> > +
> > +            GCREALLOC_ARRAY(vmemranges, nr_vmemrange+1);
> 
> Given there's only a single hole, can't you calculate the final
> nr_vmemrange value up front? (i.e. it's 2, I think).
> 

I would like to keep the algorithm more flexible here so that we can
incorporate e820_host option for hvm guest in the future.

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 17/21] libxl: disallow memory relocation when vNUMA is enabled
  2015-01-28 16:41   ` Ian Campbell
@ 2015-01-28 22:22     ` Wei Liu
  2015-01-29 11:06       ` Ian Campbell
  0 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-28 22:22 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, JBeulich, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

On Wed, Jan 28, 2015 at 04:41:11PM +0000, Ian Campbell wrote:
> On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> > Disallow memory relocation when vNUMA is enabled, because relocated
> > memory ends up off node. Further more, even if we dynamically expand
> > node coverage in hvmloader, low memory and high memory may reside
> > in different physical nodes, blindly relocating low memory to high
> > memory gives us a sub-optimal configuration.
> 
> What are the downsides of doing this? Less flexible PCI passthrough/MMIO
> hole sizing I think? Anything else?

Yes, PCI passthrough and MMIO hole size are affected, but not to the
degree that they are unusable.  Going forward like this doesn't make
thing worse than before I think.  Memory relocation is already disabled
if we're using QEMU upstream.

> 
> Does some user doc somewhere need this adding?
> 
> (and now I think of it, the PoD vs. vnuma one too).
> 

Maybe release note or xl manpage?

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 21/21] xl: vNUMA support
  2015-01-28 16:46   ` Ian Campbell
@ 2015-01-28 22:52     ` Wei Liu
  2015-01-29 11:10       ` Ian Campbell
  0 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-28 22:52 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, JBeulich, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

On Wed, Jan 28, 2015 at 04:46:05PM +0000, Ian Campbell wrote:
> On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> > This patch includes configuration options parser and documentation.
> > 
> > Please find the hunk to xl.cfg.pod.5 for more information.
> > 
> > Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> > Cc: Ian Campbell <ian.campbell@citrix.com>
> > Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> > Cc: Dario Faggioli <dario.faggioli@citrix.com>
> > Cc: Elena Ufimtseva <ufimtseva@gmail.com>
> > ---
> > Changes in v2:
> > 1. Make vnuma_vdistances mandatory.
> > 2. Use nested list to specify vdistances.
> > 3. Update documentation.
> > ---
> >  docs/man/xl.cfg.pod.5    |   39 ++++++++++++
> >  tools/libxl/xl_cmdimpl.c |  147 ++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 186 insertions(+)
> > 
> > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> > index e2f91fc..b23bd6f 100644
> > --- a/docs/man/xl.cfg.pod.5
> > +++ b/docs/man/xl.cfg.pod.5
> > @@ -266,6 +266,45 @@ it will crash.
> >  
> >  =back
> >  
> > +=head3 Virtual NUMA Memory Allocation
> > +
> > +=over 4
> > +
> > +=item B<vnuma_memory=[ NUMBER, NUMBER, ... ]>
> > +
> > +Specify the size of memory covered by each virtual NUMA node. The number of
> > +elements in the list also implicitly defines the number of virtual NUMA nodes.
> > +
> > +The sum of all elements in this list should be equal to memory size specified
> > +by B<maxmem=> in guest configuration file, or B<memory=> if B<maxmem=> is not
> > +specified.
> 
> Could we make it permissible to omit memory= and have the toolstack do
> the maths for us?
> 

This is OK.

> I suppose == maxmem is due to no vnuma vs. PoD? What about for PV

Yes.

> guests, is preballooning allowed there too?

I need to check PV boot sequence to have a definite answer.

Currently memory allocation in libxc only deals with a chunk of
contiguous memory. Not sure if I change that will break some assumptions
that guest kernel makes.

> 
> > +
> > +=item B<vnuma_vcpu_map=[ NUMBER, NUMBER, ... ]>
> > +
> > +Specifiy which virutal NUMA node a specific vcpu belongs to. The number of
> 
> "Specify" and "virtual".
> 
> > +elements in this list should be equal to B<maxvcpus=> in guest configuration
> > +file, or B<vcpus=> if B<maxvcpus=> is not specified.
> 
> Again, can we relieve the user of these tedious sums?

This point becomes moot if we use other syntax to specify vcpu list.

> > +=item B<vnuma_pnode_map=[ NUMBER, NUMBER, ... ]>
> > +
> > +Specifiy which physical NUMA node a specific virtual NUMA node maps to. The
> 
> "Specify" again.
> 
> > +number of elements in this list should be equal to the number of virtual
> > +NUMA nodes defined in B<vnuma_memory=>.
> 
> Would it make sense to instead have a single array or e.g. "NODE:SIZE"
> or something?
> 

Or "PNODE:SIZE:VCPUS"?

> > +
> > +=item B<vnuma_vdistance=[ [NUMBER, ..., NUMBER], [NUMBER, ..., NUMBER], ... ]>
> > +
> > +Two dimensional list to specify distances among nodes.
> > +
> > +The number of elements in the first dimension list equals the number of virtual
> > +nodes. Each element in position B<i> is a list that specifies the distances
> > +from node B<i> to other nodes.
> > +
> > +For example, for a guest with 2 virtual nodes, user can specify:
> > +
> > +  vnuma_vdistance = [ [10, 20], [20, 10] ]
> 
> Any guidance on how a user should choose these numbers?
> 

I think using the number from numactl is good enough.

> Do we support a mode where something figures this out based on the
> underlying distances between the pnode to which a vnode is assigned?
> 

Dario is working on that.

> Would a user ever want/need to override such a mapping?
> 

Probably not. But we need to provide a way to override should user want
to do so.

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 06/21] libxl: introduce libxl__vnuma_config_check
  2015-01-28 21:51     ` Wei Liu
@ 2015-01-29 11:04       ` Ian Campbell
  2015-01-29 16:01         ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Campbell @ 2015-01-29 11:04 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Wed, 2015-01-28 at 21:51 +0000, Wei Liu wrote:
> On Wed, Jan 28, 2015 at 04:13:28PM +0000, Ian Campbell wrote:
> > On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> > > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> > > index 6d3ac58..39356ba 100644
> > > --- a/tools/libxl/libxl_internal.h
> > > +++ b/tools/libxl/libxl_internal.h
> > > @@ -3394,6 +3394,11 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
> > >      libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
> > >  }
> > >  
> > > +/* Check if vNUMA config is valid. Returns 0 if valid. */
> > 
> > And on invalid? ERROR_FOO or just !0?
> > 
> > I see further down it is ERROR_INVAL, perhaps say so, and perhaps
> > introduce ERROR_INVAL_VNUMA_CONFIGUATION or something.
> 
> I think ERROR_INVALID is good enough. There are logs along the line to
> indicate what goes wrong.

The xapi guys have indicated that having to parse log lines is an issue
for them, and are planning to help clean this up.

We are supposed to have a policy of specific error codes which we've not
really done a good job of applying. I think we can start not adding new
uses of generic error codes. (ERROR_FAIL is the worst offender, but
ERROR_INVALID isn't much better IMHO).


> > > +    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
> > > +        uint32_t pnode;
> > > +
> > > +        p = &b_info->vnuma_nodes[i];
> > > +        pnode = p->pnode;
> > > +
> > > +        /* The pnode specified is not valid? */
> > > +        if (pnode >= nr_nodes) {
> > 
> > I take it that pnodes aren't (potentially) sparse?
> > 
> 
> That is the assumption used in libxc and xen I think. I haven't seen
> sparse pnodes personally.

OK. The question is really whether nodes can be sparse in h/w and if so
does Xen hide that from us. I assume in h/w == yes and the hope here is
that Xen will hide it...


> > > +            goto out;
> > > +        }
> > > +
> > > +        total_memkb += p->memkb;
> > > +    }
> > > +
> > > +    if (total_memkb != b_info->max_memkb) {
> > > +        LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
> > > +                   "Amount of memory mismatch (0x%"PRIx64" != 0x%"PRIx64")",
> > > +                   total_memkb, b_info->max_memkb);
> > 
> > Did arch_setup_meminit not also check this?
> > 
> 
> Yes. But I think libxl should perform due diligence as well.  Logging
> here is also more user friendly.

OK

> > Meaning it could maybe abort() or ASSERT. Not 100% sure about that being
> > wise though.
> > 
> 
> No need to abort, because this function is used to validate user input.

I think I meant the libxc version -- which in theory should never see
invalid input because libxl has arranged never to pass it any... But
anyway, not sure that's a good idea.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 17/21] libxl: disallow memory relocation when vNUMA is enabled
  2015-01-28 22:22     ` Wei Liu
@ 2015-01-29 11:06       ` Ian Campbell
  2015-01-29 16:04         ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Campbell @ 2015-01-29 11:06 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Wed, 2015-01-28 at 22:22 +0000, Wei Liu wrote:
> On Wed, Jan 28, 2015 at 04:41:11PM +0000, Ian Campbell wrote:
> > On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> > > Disallow memory relocation when vNUMA is enabled, because relocated
> > > memory ends up off node. Further more, even if we dynamically expand
> > > node coverage in hvmloader, low memory and high memory may reside
> > > in different physical nodes, blindly relocating low memory to high
> > > memory gives us a sub-optimal configuration.
> > 
> > What are the downsides of doing this? Less flexible PCI passthrough/MMIO
> > hole sizing I think? Anything else?
> 
> Yes, PCI passthrough and MMIO hole size are affected, but not to the
> degree that they are unusable.  Going forward like this doesn't make
> thing worse than before I think.

> Memory relocation is already disabled if we're using QEMU upstream.

Is that a shortcoming which might get fixed?

Is this vnuma restriction something which could be fixed in the future?
 
> > Does some user doc somewhere need this adding?
> > 
> > (and now I think of it, the PoD vs. vnuma one too).
> > 
> 
> Maybe release note or xl manpage?

Manpage would be good. Perhaps in some vnuma docs on the wiki too.

> 
> Wei.
> 
> > Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 21/21] xl: vNUMA support
  2015-01-28 22:52     ` Wei Liu
@ 2015-01-29 11:10       ` Ian Campbell
  2015-01-29 17:46         ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Campbell @ 2015-01-29 11:10 UTC (permalink / raw)
  To: Wei Liu
  Cc: ufimtseva, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich

On Wed, 2015-01-28 at 22:52 +0000, Wei Liu wrote:
> > guests, is preballooning allowed there too?
> 
> I need to check PV boot sequence to have a definite answer.
> 
> Currently memory allocation in libxc only deals with a chunk of
> contiguous memory. Not sure if I change that will break some assumptions
> that guest kernel makes.

Please do check, and if it doesn't work today we really ought to have
plan on how to integrate in the future, in case (as seems likely) it
requires cooperation between tools and kernel -- so we can think about
steps now to make it easier on ourselves then...

> > > +=item B<vnuma_pnode_map=[ NUMBER, NUMBER, ... ]>
> > > +
> > > +Specifiy which physical NUMA node a specific virtual NUMA node maps to. The
> > 
> > "Specify" again.
> > 
> > > +number of elements in this list should be equal to the number of virtual
> > > +NUMA nodes defined in B<vnuma_memory=>.
> > 
> > Would it make sense to instead have a single array or e.g. "NODE:SIZE"
> > or something?
> > 
> 
> Or "PNODE:SIZE:VCPUS"?

That seems plausible.

One concern would be future expansion, perhaps foo=bar,baz=wibble?

> > > +=item B<vnuma_vdistance=[ [NUMBER, ..., NUMBER], [NUMBER, ..., NUMBER], ... ]>
> > > +
> > > +Two dimensional list to specify distances among nodes.
> > > +
> > > +The number of elements in the first dimension list equals the number of virtual
> > > +nodes. Each element in position B<i> is a list that specifies the distances
> > > +from node B<i> to other nodes.
> > > +
> > > +For example, for a guest with 2 virtual nodes, user can specify:
> > > +
> > > +  vnuma_vdistance = [ [10, 20], [20, 10] ]
> > 
> > Any guidance on how a user should choose these numbers?
> > 
> 
> I think using the number from numactl is good enough.

Worth mentioning in the docs I think.

> > Do we support a mode where something figures this out based on the
> > underlying distances between the pnode to which a vnode is assigned?
> > 
> 
> Dario is working on that.

I thought he was working on automatic numa placement (i.e. figuring out
the best set of pnodes to map the vnodes to), whereas what I was
suggesting was that given the user has specified a vnode->pnode mapping
it should be possible to construct a distances table pretty trivially
from that. Is Dario working on that too?

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 06/21] libxl: introduce libxl__vnuma_config_check
  2015-01-29 11:04       ` Ian Campbell
@ 2015-01-29 16:01         ` Wei Liu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-29 16:01 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, JBeulich, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

On Thu, Jan 29, 2015 at 11:04:15AM +0000, Ian Campbell wrote:
> On Wed, 2015-01-28 at 21:51 +0000, Wei Liu wrote:
> > On Wed, Jan 28, 2015 at 04:13:28PM +0000, Ian Campbell wrote:
> > > On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> > > > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> > > > index 6d3ac58..39356ba 100644
> > > > --- a/tools/libxl/libxl_internal.h
> > > > +++ b/tools/libxl/libxl_internal.h
> > > > @@ -3394,6 +3394,11 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
> > > >      libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
> > > >  }
> > > >  
> > > > +/* Check if vNUMA config is valid. Returns 0 if valid. */
> > > 
> > > And on invalid? ERROR_FOO or just !0?
> > > 
> > > I see further down it is ERROR_INVAL, perhaps say so, and perhaps
> > > introduce ERROR_INVAL_VNUMA_CONFIGUATION or something.
> > 
> > I think ERROR_INVALID is good enough. There are logs along the line to
> > indicate what goes wrong.
> 
> The xapi guys have indicated that having to parse log lines is an issue
> for them, and are planning to help clean this up.
> 
> We are supposed to have a policy of specific error codes which we've not
> really done a good job of applying. I think we can start not adding new
> uses of generic error codes. (ERROR_FAIL is the worst offender, but
> ERROR_INVALID isn't much better IMHO).
> 

OK. I will use ERROR_INVAL_VNUMA_CONFIG.

Wei.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 17/21] libxl: disallow memory relocation when vNUMA is enabled
  2015-01-29 11:06       ` Ian Campbell
@ 2015-01-29 16:04         ` Wei Liu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-01-29 16:04 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, JBeulich, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

On Thu, Jan 29, 2015 at 11:06:07AM +0000, Ian Campbell wrote:
> On Wed, 2015-01-28 at 22:22 +0000, Wei Liu wrote:
> > On Wed, Jan 28, 2015 at 04:41:11PM +0000, Ian Campbell wrote:
> > > On Fri, 2015-01-23 at 11:13 +0000, Wei Liu wrote:
> > > > Disallow memory relocation when vNUMA is enabled, because relocated
> > > > memory ends up off node. Further more, even if we dynamically expand
> > > > node coverage in hvmloader, low memory and high memory may reside
> > > > in different physical nodes, blindly relocating low memory to high
> > > > memory gives us a sub-optimal configuration.
> > > 
> > > What are the downsides of doing this? Less flexible PCI passthrough/MMIO
> > > hole sizing I think? Anything else?
> > 
> > Yes, PCI passthrough and MMIO hole size are affected, but not to the
> > degree that they are unusable.  Going forward like this doesn't make
> > thing worse than before I think.
> 
> > Memory relocation is already disabled if we're using QEMU upstream.
> 
> Is that a shortcoming which might get fixed?
> 

Not until proper infrastructure is in place to allow communication among
several parties (hypervisor, QEMU and hvmloader).

> Is this vnuma restriction something which could be fixed in the future?

Presumably if the aforementioned infrastructure is in place this can be
fixed at the same time we fix that QEMU problem.

The infrastructure is a common blocker to both issues.

>  
> > > Does some user doc somewhere need this adding?
> > > 
> > > (and now I think of it, the PoD vs. vnuma one too).
> > > 
> > 
> > Maybe release note or xl manpage?
> 
> Manpage would be good. Perhaps in some vnuma docs on the wiki too.
> 

No problem.

Wei.

> > 
> > Wei.
> > 
> > > Ian.
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 21/21] xl: vNUMA support
  2015-01-29 11:10       ` Ian Campbell
@ 2015-01-29 17:46         ` Wei Liu
  2015-02-24 16:15           ` Dario Faggioli
  0 siblings, 1 reply; 64+ messages in thread
From: Wei Liu @ 2015-01-29 17:46 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, JBeulich, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, ufimtseva

On Thu, Jan 29, 2015 at 11:10:39AM +0000, Ian Campbell wrote:
> On Wed, 2015-01-28 at 22:52 +0000, Wei Liu wrote:
> > > guests, is preballooning allowed there too?
> > 
> > I need to check PV boot sequence to have a definite answer.
> > 
> > Currently memory allocation in libxc only deals with a chunk of
> > contiguous memory. Not sure if I change that will break some assumptions
> > that guest kernel makes.
> 
> Please do check, and if it doesn't work today we really ought to have
> plan on how to integrate in the future, in case (as seems likely) it
> requires cooperation between tools and kernel -- so we can think about
> steps now to make it easier on ourselves then...
> 

I only look at Linux kernel so this is very Linux centric -- though I
wonder if there are any other PV kernels in the wild.

Libxc allocates contiguous chunk of memory and then guest kernel will
remap memory inside a non-ram region. (This leads me to think I need to
rewrite the patch that allocates memory to also take into account memory
hole, but that is another matter)

Speaking of pre-ballooned PV guest, in theory if we still allocate
memory in contiguous trunk, it should work. But something more complex
like partially populating multiple vnodes might not, because code in
Linux kernel assumes that those pre-ballooned pages are appended to the
end of populated memory.

> > > > +=item B<vnuma_pnode_map=[ NUMBER, NUMBER, ... ]>
> > > > +
> > > > +Specifiy which physical NUMA node a specific virtual NUMA node maps to. The
> > > 
> > > "Specify" again.
> > > 
> > > > +number of elements in this list should be equal to the number of virtual
> > > > +NUMA nodes defined in B<vnuma_memory=>.
> > > 
> > > Would it make sense to instead have a single array or e.g. "NODE:SIZE"
> > > or something?
> > > 
> > 
> > Or "PNODE:SIZE:VCPUS"?
> 
> That seems plausible.
> 
> One concern would be future expansion, perhaps foo=bar,baz=wibble?
> 

I'm fine with that. We can use nested list

vnuma = [ [node=0,size=1024,vcpus=...] [ ...] ]

> > > > +=item B<vnuma_vdistance=[ [NUMBER, ..., NUMBER], [NUMBER, ..., NUMBER], ... ]>
> > > > +
> > > > +Two dimensional list to specify distances among nodes.
> > > > +
> > > > +The number of elements in the first dimension list equals the number of virtual
> > > > +nodes. Each element in position B<i> is a list that specifies the distances
> > > > +from node B<i> to other nodes.
> > > > +
> > > > +For example, for a guest with 2 virtual nodes, user can specify:
> > > > +
> > > > +  vnuma_vdistance = [ [10, 20], [20, 10] ]
> > > 
> > > Any guidance on how a user should choose these numbers?
> > > 
> > 
> > I think using the number from numactl is good enough.
> 
> Worth mentioning in the docs I think.
> 
> > > Do we support a mode where something figures this out based on the
> > > underlying distances between the pnode to which a vnode is assigned?
> > > 
> > 
> > Dario is working on that.
> 
> I thought he was working on automatic numa placement (i.e. figuring out
> the best set of pnodes to map the vnodes to), whereas what I was
> suggesting was that given the user has specified a vnode->pnode mapping
> it should be possible to construct a distances table pretty trivially
> from that. Is Dario working on that too?
> 

Right. That's trivial inside libxl.

What I meant was Dario was about to touch all these automation stuffs it
might be trivial for him to just do it all in one go. Of course if he
has not done that I can add the logic myself.

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 18/21] libxlu: rework internal representation of setting
  2015-01-23 11:13 ` [PATCH v4 18/21] libxlu: rework internal representation of setting Wei Liu
  2015-01-28 16:41   ` Ian Campbell
@ 2015-02-11 16:12   ` Ian Jackson
  2015-02-12 10:58     ` Wei Liu
  1 sibling, 1 reply; 64+ messages in thread
From: Ian Jackson @ 2015-02-11 16:12 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, andrew.cooper3, dario.faggioli, ian.jackson,
	xen-devel, JBeulich, ufimtseva

Wei Liu writes ("[PATCH v4 18/21] libxlu: rework internal representation of setting"):
> This patches does following things:
...
> +void xlu__cfg_list_append(CfgParseContext *ctx,
> +                          XLU_ConfigValue *list,
> +                          char *atom)
> +{
> +    XLU_ConfigValue **new_val = NULL;
> +    XLU_ConfigValue *val = NULL;
>      if (ctx->err) return;
...
> -    if (set->nvalues >= set->avalues) {
> -        int new_avalues;
> -        char **new_values;
> -
> -        if (set->avalues > INT_MAX / 100) { ctx->err= ERANGE; return; }
> -        new_avalues= set->avalues * 4;
> -        new_values= realloc(set->values,
> -                            sizeof(*new_values) * new_avalues);
> -        if (!new_values) { ctx->err= errno; free(atom); return; }
> -        set->values= new_values;
> -        set->avalues= new_avalues;

This is a standard expanding-array pattern which arranges not to
realloc the array each time.

> -    }
> -    set->values[set->nvalues++]= atom;
> +    new_val = realloc(list->u.list.values,
> +                      sizeof(*new_val) * (list->u.list.nvalues+1));
> +    if (!new_val) goto xe;

But you replace it here with one which has quadradic performance.

I don't know whether people are going to specify lists with hundreds
or thousands of elements, but it doesn't seem impossible.

I think you should retain the existing avalues stuff.

Apart from that this all looks good to me.

Thanks,
Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 19/21] libxlu: nested list support
  2015-01-23 11:13 ` [PATCH v4 19/21] libxlu: nested list support Wei Liu
@ 2015-02-11 16:13   ` Ian Jackson
  0 siblings, 0 replies; 64+ messages in thread
From: Ian Jackson @ 2015-02-11 16:13 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, andrew.cooper3, dario.faggioli, xen-devel,
	JBeulich, ufimtseva

Wei Liu writes ("[PATCH v4 19/21] libxlu: nested list support"):
> 1. Extend grammar of parser.
> 2. Adjust internal functional to accept XLU_ConfigValue instead of

                     ^functions

Otherwise,

Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 20/21] libxlu: introduce new APIs
  2015-01-23 11:13 ` [PATCH v4 20/21] libxlu: introduce new APIs Wei Liu
@ 2015-02-11 16:17   ` Ian Jackson
  2015-02-12 10:57     ` Wei Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Jackson @ 2015-02-11 16:17 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, andrew.cooper3, dario.faggioli, xen-devel,
	JBeulich, ufimtseva

Wei Liu writes ("[PATCH v4 20/21] libxlu: introduce new APIs"):
> These APIs can be used to manipulate XLU_ConfigValue and XLU_ConfigList.
...
> +const char *xlu_cfg_value_get_string(const XLU_ConfigValue *value)
> +{
> +    assert(value->type == XLU_STRING);
> +    return value->u.string;
> +}

Most of the existing xlu_cfg_... functions return null (or -1) setting
EINVAL if the type of the supplied config item is not correct.

But these new functions are not really suitable for use directly
because they crash on incorrect configuration input.

Wouldn't it be better if these functions had calling conventions
similar to xlu_cfg_get_string et al (returning errno values, taking
dont_warn, etc.) ?

Thanks,
Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 20/21] libxlu: introduce new APIs
  2015-02-11 16:17   ` Ian Jackson
@ 2015-02-12 10:57     ` Wei Liu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-02-12 10:57 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli, xen-devel,
	JBeulich, ufimtseva

On Wed, Feb 11, 2015 at 04:17:19PM +0000, Ian Jackson wrote:
> Wei Liu writes ("[PATCH v4 20/21] libxlu: introduce new APIs"):
> > These APIs can be used to manipulate XLU_ConfigValue and XLU_ConfigList.
> ...
> > +const char *xlu_cfg_value_get_string(const XLU_ConfigValue *value)
> > +{
> > +    assert(value->type == XLU_STRING);
> > +    return value->u.string;
> > +}
> 
> Most of the existing xlu_cfg_... functions return null (or -1) setting
> EINVAL if the type of the supplied config item is not correct.
> 
> But these new functions are not really suitable for use directly
> because they crash on incorrect configuration input.
> 
> Wouldn't it be better if these functions had calling conventions
> similar to xlu_cfg_get_string et al (returning errno values, taking
> dont_warn, etc.) ?
> 

Right. I will use the same convention.

Wei.

> Thanks,
> Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 18/21] libxlu: rework internal representation of setting
  2015-02-11 16:12   ` Ian Jackson
@ 2015-02-12 10:58     ` Wei Liu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2015-02-12 10:58 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Wei Liu, ian.campbell, andrew.cooper3, dario.faggioli, xen-devel,
	JBeulich, ufimtseva

On Wed, Feb 11, 2015 at 04:12:24PM +0000, Ian Jackson wrote:
> Wei Liu writes ("[PATCH v4 18/21] libxlu: rework internal representation of setting"):
> > This patches does following things:
> ...
> > +void xlu__cfg_list_append(CfgParseContext *ctx,
> > +                          XLU_ConfigValue *list,
> > +                          char *atom)
> > +{
> > +    XLU_ConfigValue **new_val = NULL;
> > +    XLU_ConfigValue *val = NULL;
> >      if (ctx->err) return;
> ...
> > -    if (set->nvalues >= set->avalues) {
> > -        int new_avalues;
> > -        char **new_values;
> > -
> > -        if (set->avalues > INT_MAX / 100) { ctx->err= ERANGE; return; }
> > -        new_avalues= set->avalues * 4;
> > -        new_values= realloc(set->values,
> > -                            sizeof(*new_values) * new_avalues);
> > -        if (!new_values) { ctx->err= errno; free(atom); return; }
> > -        set->values= new_values;
> > -        set->avalues= new_avalues;
> 
> This is a standard expanding-array pattern which arranges not to
> realloc the array each time.
> 
> > -    }
> > -    set->values[set->nvalues++]= atom;
> > +    new_val = realloc(list->u.list.values,
> > +                      sizeof(*new_val) * (list->u.list.nvalues+1));
> > +    if (!new_val) goto xe;
> 
> But you replace it here with one which has quadradic performance.
> 
> I don't know whether people are going to specify lists with hundreds
> or thousands of elements, but it doesn't seem impossible.
> 
> I think you should retain the existing avalues stuff.
> 

No problem.

> Apart from that this all looks good to me.

Thanks.

Wei.

> 
> Thanks,
> Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 21/21] xl: vNUMA support
  2015-01-29 17:46         ` Wei Liu
@ 2015-02-24 16:15           ` Dario Faggioli
  0 siblings, 0 replies; 64+ messages in thread
From: Dario Faggioli @ 2015-02-24 16:15 UTC (permalink / raw)
  To: Wei Liu
  Cc: Ian Campbell, Andrew Cooper, xen-devel, JBeulich, ufimtseva, Ian Jackson


[-- Attachment #1.1: Type: text/plain, Size: 1968 bytes --]

On Thu, 2015-01-29 at 17:46 +0000, Wei Liu wrote:
> On Thu, Jan 29, 2015 at 11:10:39AM +0000, Ian Campbell wrote:
> > On Wed, 2015-01-28 at 22:52 +0000, Wei Liu wrote:

Hey,

While reviewing v5, I realized I always wanted to, but never replied to
this email! :-(

Very quickly...

> > > > > +For example, for a guest with 2 virtual nodes, user can specify:
> > > > > +
> > > > > +  vnuma_vdistance = [ [10, 20], [20, 10] ]
> > > > 
> > > > Any guidance on how a user should choose these numbers?
> > > > 
> > > 
> > > I think using the number from numactl is good enough.
> > 
> > Worth mentioning in the docs I think.
> > 
> > > > Do we support a mode where something figures this out based on the
> > > > underlying distances between the pnode to which a vnode is assigned?
> > > > 
> > > 
> > > Dario is working on that.
> > 
> > I thought he was working on automatic numa placement (i.e. figuring out
> > the best set of pnodes to map the vnodes to), whereas what I was
> > suggesting was that given the user has specified a vnode->pnode mapping
> > it should be possible to construct a distances table pretty trivially
> > from that. Is Dario working on that too?
> > 
> 
> Right. That's trivial inside libxl.
> 
> What I meant was Dario was about to touch all these automation stuffs it
> might be trivial for him to just do it all in one go. Of course if he
> has not done that I can add the logic myself.
>
Yes, Ian, that is a valid way of coming up with a default set of
distance values.

And as Wei says, our agreement is that he would make things work when
all the parameters are explicitly specified. Then, I'll take care of the
automation required to deal with situations where some parameters are
missing.

That include triggering automatic placement, in a way that it properly
accounts for vNUMA, and other things, among which default values for
distances, if nothing is specified.

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2015-02-24 16:15 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-23 11:13 [PATCH v4 00/21] Virtual NUMA for PV and HVM Wei Liu
2015-01-23 11:13 ` [PATCH v4 01/21] xen: dump vNUMA information with debug key "u" Wei Liu
2015-01-23 13:03   ` Jan Beulich
2015-01-23 13:23     ` Wei Liu
2015-01-23 11:13 ` [PATCH v4 02/21] xen: make two memory hypercalls vNUMA-aware Wei Liu
2015-01-23 13:16   ` Jan Beulich
2015-01-23 14:46     ` Wei Liu
2015-01-23 15:37       ` Jan Beulich
2015-01-23 15:43         ` Wei Liu
2015-01-23 16:06           ` Wei Liu
2015-01-23 16:17             ` Jan Beulich
2015-01-23 11:13 ` [PATCH v4 03/21] libxc: allocate memory with vNUMA information for PV guest Wei Liu
2015-01-28 16:02   ` Ian Campbell
2015-01-23 11:13 ` [PATCH v4 04/21] libxl: introduce vNUMA types Wei Liu
2015-01-28 16:04   ` Ian Campbell
2015-01-28 21:51     ` Wei Liu
2015-01-23 11:13 ` [PATCH v4 05/21] libxl: add vmemrange to libxl__domain_build_state Wei Liu
2015-01-28 16:05   ` Ian Campbell
2015-01-23 11:13 ` [PATCH v4 06/21] libxl: introduce libxl__vnuma_config_check Wei Liu
2015-01-28 16:13   ` Ian Campbell
2015-01-28 21:51     ` Wei Liu
2015-01-29 11:04       ` Ian Campbell
2015-01-29 16:01         ` Wei Liu
2015-01-23 11:13 ` [PATCH v4 07/21] libxl: x86: factor out e820_host_sanitize Wei Liu
2015-01-28 16:14   ` Ian Campbell
2015-01-23 11:13 ` [PATCH v4 08/21] libxl: functions to build vmemranges for PV guest Wei Liu
2015-01-28 16:27   ` Ian Campbell
2015-01-28 21:59     ` Wei Liu
2015-01-23 11:13 ` [PATCH v4 09/21] libxl: build, check and pass vNUMA info to Xen " Wei Liu
2015-01-28 16:29   ` Ian Campbell
2015-01-23 11:13 ` [PATCH v4 10/21] xen: handle XENMEM_get_vnumainfo in compat_memory_op Wei Liu
2015-01-23 11:13 ` [PATCH v4 11/21] hvmloader: retrieve vNUMA information from hypervisor Wei Liu
2015-01-23 13:27   ` Jan Beulich
2015-01-23 14:17     ` Wei Liu
2015-01-23 11:13 ` [PATCH v4 12/21] hvmloader: construct SRAT Wei Liu
2015-01-23 11:13 ` [PATCH v4 13/21] hvmloader: construct SLIT Wei Liu
2015-01-23 11:13 ` [PATCH v4 14/21] libxc: indentation change to xc_hvm_build_x86.c Wei Liu
2015-01-28 16:30   ` Ian Campbell
2015-01-23 11:13 ` [PATCH v4 15/21] libxc: allocate memory with vNUMA information for HVM guest Wei Liu
2015-01-28 16:36   ` Ian Campbell
2015-01-28 22:07     ` Wei Liu
2015-01-23 11:13 ` [PATCH v4 16/21] libxl: build, check and pass vNUMA info to Xen " Wei Liu
2015-01-28 16:41   ` Ian Campbell
2015-01-28 22:14     ` Wei Liu
2015-01-23 11:13 ` [PATCH v4 17/21] libxl: disallow memory relocation when vNUMA is enabled Wei Liu
2015-01-28 16:41   ` Ian Campbell
2015-01-28 22:22     ` Wei Liu
2015-01-29 11:06       ` Ian Campbell
2015-01-29 16:04         ` Wei Liu
2015-01-23 11:13 ` [PATCH v4 18/21] libxlu: rework internal representation of setting Wei Liu
2015-01-28 16:41   ` Ian Campbell
2015-02-11 16:12   ` Ian Jackson
2015-02-12 10:58     ` Wei Liu
2015-01-23 11:13 ` [PATCH v4 19/21] libxlu: nested list support Wei Liu
2015-02-11 16:13   ` Ian Jackson
2015-01-23 11:13 ` [PATCH v4 20/21] libxlu: introduce new APIs Wei Liu
2015-02-11 16:17   ` Ian Jackson
2015-02-12 10:57     ` Wei Liu
2015-01-23 11:13 ` [PATCH v4 21/21] xl: vNUMA support Wei Liu
2015-01-28 16:46   ` Ian Campbell
2015-01-28 22:52     ` Wei Liu
2015-01-29 11:10       ` Ian Campbell
2015-01-29 17:46         ` Wei Liu
2015-02-24 16:15           ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.