All of lore.kernel.org
 help / color / mirror / Atom feed
* API to query NUMA node of mfn
@ 2017-07-10 10:10 Olaf Hering
  2017-07-10 10:41 ` Jan Beulich
  0 siblings, 1 reply; 5+ messages in thread
From: Olaf Hering @ 2017-07-10 10:10 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 260 bytes --]

I would like to verify on which NUMA node the PFNs used by a HVM guest
are located. Is there an API for that? Something like:

  foreach (pfn, domid)
    mfns_per_node[pfn_to_node(pfn)]++
  foreach (node)
    printk("%x %x\n", node, mfns_per_node[node])

Olaf

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: API to query NUMA node of mfn
  2017-07-10 10:10 API to query NUMA node of mfn Olaf Hering
@ 2017-07-10 10:41 ` Jan Beulich
  2017-07-10 13:13   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Beulich @ 2017-07-10 10:41 UTC (permalink / raw)
  To: Olaf Hering; +Cc: xen-devel

>>> On 10.07.17 at 12:10, <olaf@aepfle.de> wrote:
> I would like to verify on which NUMA node the PFNs used by a HVM guest
> are located. Is there an API for that? Something like:
> 
>   foreach (pfn, domid)
>     mfns_per_node[pfn_to_node(pfn)]++
>   foreach (node)
>     printk("%x %x\n", node, mfns_per_node[node])

phys_to_nid() ?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: API to query NUMA node of mfn
  2017-07-10 10:41 ` Jan Beulich
@ 2017-07-10 13:13   ` Konrad Rzeszutek Wilk
  2017-07-10 13:35     ` Olaf Hering
  0 siblings, 1 reply; 5+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-07-10 13:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Olaf Hering, xen-devel

[-- Attachment #1: Type: text/plain, Size: 710 bytes --]

On Mon, Jul 10, 2017 at 04:41:35AM -0600, Jan Beulich wrote:
> >>> On 10.07.17 at 12:10, <olaf@aepfle.de> wrote:
> > I would like to verify on which NUMA node the PFNs used by a HVM guest
> > are located. Is there an API for that? Something like:
> > 
> >   foreach (pfn, domid)
> >     mfns_per_node[pfn_to_node(pfn)]++
> >   foreach (node)
> >     printk("%x %x\n", node, mfns_per_node[node])
> 
> phys_to_nid() ?

Soo I wrote some code for exactly this for Xen 4.4.4 , along with
creation of a PGM map to see the NUMA nodes locality.

Attaching them here..
> 
> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

[-- Attachment #2: 0001-xen-x86-XENDOMCTL_get_memlist-Make-it-work.patch --]
[-- Type: text/plain, Size: 9407 bytes --]

>From a5e039801c989df29b704a4a5256715321906535 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Tue, 6 Jun 2017 20:31:21 -0400
Subject: [PATCH 1/7] xen/x86: XENDOMCTL_get_memlist: Make it work

This hypercall has a bunch of problems which this patch
fixes.

Specifically it is not preempt capable, takes a nested lock,
and the data is stale after you get it.

The nested lock (and order inversion) is due to the
copy_to_guest_offset call. The particular implementation
(see __hvm_copy) makes P2M calls (p2m_mem_paging_populate), which
take the p2m_lock.

We avoid this by taking the p2m lock early (before page_lock) in:

if ( !guest_handle_okay(domctl->u.getmemlist.buffer, max_pfns) )

here (this takes the p2m lock and then unlocks). And since
it checks out, we can use the fast variant of copy_to_guest
(which still takes the p2m lock).

And we extend this thinking in the copying of the values to
the guest. The loop that copies the mfns[] to buffer
takes (potentially) a p2m lock on every invocation. So to
not make us holding the page_alloc_lock we create a temporary
array (mfns) - which is filled while holding page_alloc_lock.
But we don't hold any locks (well, we hold the domctl lock)
while copying to the guest.

The preemption is used and we also honor 'start_pfn' which
is renamed to 'index' - as there is no enforced order in which
the pages correspond to PFNs.

All of those are fixed by this patch, also it means that
the callers of xc_get_pfn_list have to take into account
that max_pfns != num_pfns value and loop around.

See patch: "libxc: Use XENDOMCTL_get_memlist properly"
and "xen-mceinj: Loop around xc_get_pfn_list"

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/arch/x86/domctl.c       | 76 ++++++++++++++++++++++++++++++---------------
 xen/arch/x86/mm/hap/hap.c   |  1 +
 xen/arch/x86/mm/p2m-ept.c   |  2 ++
 xen/include/asm-x86/p2m.h   |  2 ++
 xen/include/public/domctl.h | 36 ++++++++++++++++-----
 5 files changed, 84 insertions(+), 33 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index bebe1fb..3af6b39 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -325,57 +325,83 @@ long arch_do_domctl(
 
     case XEN_DOMCTL_getmemlist:
     {
-        int i;
+#define XEN_DOMCTL_getmemlist_max_pfns (GB(1) / PAGE_SIZE)
+        unsigned int i = 0, idx = 0;
         unsigned long max_pfns = domctl->u.getmemlist.max_pfns;
+        unsigned long index = domctl->u.getmemlist.index;
         uint64_t mfn;
         struct page_info *page;
+        uint64_t *mfns;
 
         if ( unlikely(d->is_dying) ) {
             ret = -EINVAL;
             break;
         }
+        /* XSA-74: This sub-hypercall is fixed. */
 
-        /*
-         * XSA-74: This sub-hypercall is broken in several ways:
-         * - lock order inversion (p2m locks inside page_alloc_lock)
-         * - no preemption on huge max_pfns input
-         * - not (re-)checking d->is_dying with page_alloc_lock held
-         * - not honoring start_pfn input (which libxc also doesn't set)
-         * Additionally it is rather useless, as the result is stale by the
-         * time the caller gets to look at it.
-         * As it only has a single, non-production consumer (xen-mceinj),
-         * rather than trying to fix it we restrict it for the time being.
-         */
-        if ( /* No nested locks inside copy_to_guest_offset(). */
-             paging_mode_external(current->domain) ||
-             /* Arbitrary limit capping processing time. */
-             max_pfns > GB(4) / PAGE_SIZE )
+        ret = -E2BIG;
+        if ( max_pfns > XEN_DOMCTL_getmemlist_max_pfns )
+            max_pfns = XEN_DOMCTL_getmemlist_max_pfns;
+
+        /* Report the max number we are OK with. */
+        if ( !max_pfns && guest_handle_is_null(domctl->u.getmemlist.buffer) )
         {
-            ret = -EOPNOTSUPP;
+            domctl->u.getmemlist.max_pfns = XEN_DOMCTL_getmemlist_max_pfns;
+            copyback = 1;
             break;
         }
 
-        spin_lock(&d->page_alloc_lock);
+        ret = -EINVAL;
+        if ( !guest_handle_okay(domctl->u.getmemlist.buffer, max_pfns) )
+            break;
+
+        mfns = xmalloc_array(uint64_t, max_pfns);
+        if ( !mfns )
+        {
+            ret = -ENOMEM;
+            break;
+        }
 
-        ret = i = 0;
+        ret = -EINVAL;
+        spin_lock(&d->page_alloc_lock);
         page_list_for_each(page, &d->page_list)
         {
-            if ( i >= max_pfns )
+            if ( idx >= max_pfns )
                 break;
+
+            if ( index > i++ )
+                continue;
+
+            if ( idx && !(idx & 0xFF) && hypercall_preempt_check() )
+                break;
+
             mfn = page_to_mfn(page);
-            if ( copy_to_guest_offset(domctl->u.getmemlist.buffer,
-                                      i, &mfn, 1) )
+            mfns[idx++] = mfn;
+        }
+        spin_unlock(&d->page_alloc_lock);
+
+        ret = 0;
+        for ( i = 0; i < idx; i++ )
+        {
+
+            if ( __copy_to_guest_offset(domctl->u.getmemlist.buffer,
+                                        i, &mfns[i], 1) )
             {
                 ret = -EFAULT;
                 break;
             }
-			++i;
 		}
 
-        spin_unlock(&d->page_alloc_lock);
-
         domctl->u.getmemlist.num_pfns = i;
+        /*
+         * A poor-man way of keeping track of P2M changes. If the P2M
+         * is changed the version will change as well and the caller
+         * can redo it's list.
+         */
+        domctl->u.getmemlist.version = p2m_get_hostp2m(d)->version;
+
         copyback = 1;
+        xfree(mfns);
     }
     break;
 
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index ccc4174..0406c2a 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -709,6 +709,7 @@ hap_write_p2m_entry(struct vcpu *v, unsigned long gfn, l1_pgentry_t *p,
     if ( old_flags & _PAGE_PRESENT )
         flush_tlb_mask(d->domain_dirty_cpumask);
 
+    p2m_get_hostp2m(d)->version++;
     paging_unlock(d);
 
     if ( flush_nestedp2m )
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 72b3d0a..7da5b06 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -674,6 +674,8 @@ void ept_sync_domain(struct p2m_domain *p2m)
 {
     struct domain *d = p2m->domain;
     struct ept_data *ept = &p2m->ept;
+
+    p2m->version++;
     /* Only if using EPT and this domain has some VCPUs to dirty. */
     if ( !paging_mode_hap(d) || !d->vcpu || !d->vcpu[0] )
         return;
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index fcb50b1..b0549e8 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -293,6 +293,8 @@ struct p2m_domain {
         struct ept_data ept;
         /* NPT-equivalent structure could be added here. */
     };
+    /* OVM: Every update to P2M increases this version. */
+    unsigned long version;
 };
 
 /* get host p2m table */
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 27f5001..2a25079 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -118,16 +118,36 @@ typedef struct xen_domctl_getdomaininfo xen_domctl_getdomaininfo_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_getdomaininfo_t);
 
 
-/* XEN_DOMCTL_getmemlist */
+/*
+ * XEN_DOMCTL_getmemlist
+ * Retrieve an array of mfns of the guest.
+ *
+ * If the hypercall returns an zero value, then it has copied 'num_pfns'
+ * (up to `max_pfns`) of the MFNs in 'buffer', along with the
+ * `version` updated (it may be the same across hypercalls. If it
+ * varies the data is stale and it is recommended that the caller restart
+ * iwht 'index' being zero).
+ *
+ * If the 'max_pfns' is zero, and 'buffer' is NULL, the hypercall returns
+ * -E2BIG and updates the 'max_pfns' with the recommend value to be used.
+ *
+ * Note that due to the asynchronous nature of hypercalls the domain might have
+ * added or removed the number of MFNS making this information stale. It is
+ * the responsibility of the toolstack to use the `version` field to check
+ * between each invocation. if the version differs it should discard the stale
+ * data and start from scratch. It is OK for the toolstack to use the new
+ * `version` field.
+ */
 struct xen_domctl_getmemlist {
-    /* IN variables. */
-    /* Max entries to write to output buffer. */
+    /* IN/OUT: Max entries to write to output buffer. If max_pfns is zero and
+     * buffer is NULL, this has the recommend max size of buffer. */
     uint64_aligned_t max_pfns;
-    /* Start index in guest's page list. */
-    uint64_aligned_t start_pfn;
-    XEN_GUEST_HANDLE_64(uint64) buffer;
-    /* OUT variables. */
-    uint64_aligned_t num_pfns;
+    uint64_aligned_t index;     /* IN: Start index in guest's page list. */
+    XEN_GUEST_HANDLE_64(uint64) buffer; /* IN: If NULL with max_pfns == 0, then
+                                         * max_pfns has recommend value. */
+    uint64_aligned_t version;   /* IN/OUT: If value differs, prior calls may
+                                 * have stale data. */
+    uint64_aligned_t num_pfns;  /* OUT: Number (up to max_pfns) copied. */
 };
 typedef struct xen_domctl_getmemlist xen_domctl_getmemlist_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_getmemlist_t);
-- 
2.9.4


[-- Attachment #3: 0002-libxc-libxc-Use-XENDOMCTL_get_memlist-properly.patch --]
[-- Type: text/plain, Size: 5999 bytes --]

>From d2edab820ee1bf4c354836e33c427602963986ba Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 9 Jun 2017 12:00:24 -0400
Subject: [PATCH 2/7] libxc: libxc: Use XENDOMCTL_get_memlist properly

With the hypervisor having this working well we take advantage
of that. The process has changed as we need to figure out what
the upper limit of PFNs the hypervisor is willing to provide.

Fortunatly for us, if we provide max_pfns with a zero value
and buffer to be NULL, we get in max_pfns the acceptable max number.

With this information we can make the hypercall - however
the amount we get may be smaller than max, hence we adjust
the number and also let the caller know. Furtheremore the
'version' is provided as a way for the caller to restart from
scratch if that version is different from its previous call.

Also we modify the tools to compile, but not neccessarily work
well (see "xen-mceinj: Loop around xc_get_pfn_list".

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 tools/libxc/xc_private.c                | 50 +++++++++++++++++++++++++--------
 tools/libxc/xenctrl.h                   |  2 +-
 tools/ocaml/libs/xc/xenctrl_stubs.c     |  3 +-
 tools/tests/mce-test/tools/xen-mceinj.c |  3 +-
 4 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/tools/libxc/xc_private.c b/tools/libxc/xc_private.c
index 52f53d7..3ecee6a 100644
--- a/tools/libxc/xc_private.c
+++ b/tools/libxc/xc_private.c
@@ -581,33 +581,59 @@ int xc_machphys_mfn_list(xc_interface *xch,
 
 int xc_get_pfn_list(xc_interface *xch,
                     uint32_t domid,
-                    uint64_t *pfn_buf,
-                    unsigned long max_pfns)
+                    uint64_t *buf,
+                    unsigned long index,
+                    unsigned long max_pfns,
+                    unsigned long *version)
 {
     DECLARE_DOMCTL;
-    DECLARE_HYPERCALL_BOUNCE(pfn_buf, max_pfns * sizeof(*pfn_buf), XC_HYPERCALL_BUFFER_BOUNCE_OUT);
+    DECLARE_HYPERCALL_BUFFER(uint64_t, pfn_buf);
     int ret;
+    unsigned long nr_pfns;
 
-#ifdef VALGRIND
-    memset(pfn_buf, 0, max_pfns * sizeof(*pfn_buf));
-#endif
+    domctl.cmd = XEN_DOMCTL_getmemlist;
+    domctl.domain = (domid_t)domid;
+    domctl.u.getmemlist.max_pfns = 0;
+    domctl.u.getmemlist.index = 0;
+
+    /* It is NULL */
+    set_xen_guest_handle(domctl.u.getmemlist.buffer, pfn_buf);
+
+    ret = do_domctl(xch, &domctl);
+    if ( ret && errno != E2BIG )
+        return ret;
 
-    if ( xc_hypercall_bounce_pre(xch, pfn_buf) )
+    if ( !domctl.u.getmemlist.max_pfns )
     {
-        PERROR("xc_get_pfn_list: pfn_buf bounce failed");
+        errno = ENXIO;
         return -1;
     }
+    if ( max_pfns > domctl.u.getmemlist.max_pfns )
+        max_pfns = domctl.u.getmemlist.max_pfns;
 
-    domctl.cmd = XEN_DOMCTL_getmemlist;
-    domctl.domain   = (domid_t)domid;
+    domctl.u.getmemlist.index = index;
     domctl.u.getmemlist.max_pfns = max_pfns;
+
+    pfn_buf = xc_hypercall_buffer_alloc(xch, pfn_buf, max_pfns * sizeof(uint64_t));
+    if ( !pfn_buf ) {
+        errno = ENOMEM;
+        return -1;
+    }
+
     set_xen_guest_handle(domctl.u.getmemlist.buffer, pfn_buf);
 
     ret = do_domctl(xch, &domctl);
 
-    xc_hypercall_bounce_post(xch, pfn_buf);
+    nr_pfns = domctl.u.getmemlist.num_pfns;
+
+    if ( !ret ) {
+        memcpy(buf, pfn_buf, nr_pfns * sizeof(*buf));
+        *version = domctl.u.getmemlist.version;
+    }
+
+    xc_hypercall_buffer_free(xch, pfn_buf);
 
-    return (ret < 0) ? -1 : domctl.u.getmemlist.num_pfns;
+    return (ret < 0) ? -1 : nr_pfns;
 }
 
 long xc_get_tot_pages(xc_interface *xch, uint32_t domid)
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 5f015b2..5802d69 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -1443,7 +1443,7 @@ unsigned long xc_translate_foreign_address(xc_interface *xch, uint32_t dom,
  * without a backing MFN.
  */
 int xc_get_pfn_list(xc_interface *xch, uint32_t domid, uint64_t *pfn_buf,
-                    unsigned long max_pfns);
+                    unsigned long start_pfn, unsigned long max_pfns, unsigned long *version);
 
 int xc_copy_to_domain_page(xc_interface *xch, uint32_t domid,
                            unsigned long dst_pfn, const char *src_page);
diff --git a/tools/ocaml/libs/xc/xenctrl_stubs.c b/tools/ocaml/libs/xc/xenctrl_stubs.c
index 5ed0008..b1f1dff 100644
--- a/tools/ocaml/libs/xc/xenctrl_stubs.c
+++ b/tools/ocaml/libs/xc/xenctrl_stubs.c
@@ -1055,6 +1055,7 @@ CAMLprim value stub_xc_domain_get_pfn_list(value xch, value domid,
 	CAMLparam3(xch, domid, nr_pfns);
 	CAMLlocal2(array, v);
 	unsigned long c_nr_pfns;
+	unsigned long version;
 	long ret, i;
 	uint64_t *c_array;
 
@@ -1065,7 +1066,7 @@ CAMLprim value stub_xc_domain_get_pfn_list(value xch, value domid,
 		caml_raise_out_of_memory();
 
 	ret = xc_get_pfn_list(_H(xch), _D(domid),
-			      c_array, c_nr_pfns);
+			      c_array, 0, c_nr_pfns, &version);
 	if (ret < 0) {
 		free(c_array);
 		failwith_xc(_H(xch));
diff --git a/tools/tests/mce-test/tools/xen-mceinj.c b/tools/tests/mce-test/tools/xen-mceinj.c
index 21a488b..22b4401 100644
--- a/tools/tests/mce-test/tools/xen-mceinj.c
+++ b/tools/tests/mce-test/tools/xen-mceinj.c
@@ -263,6 +263,7 @@ static uint64_t guest_mfn(xc_interface *xc_handle,
     unsigned long max_mfn = 0; /* max mfn of the whole machine */
     unsigned long m2p_mfn0;
     unsigned int guest_width;
+    unsigned long version;
     long max_gpfn,i;
     uint64_t mfn = MCE_INVALID_MFN;
 
@@ -289,7 +290,7 @@ static uint64_t guest_mfn(xc_interface *xc_handle,
         err(xc_handle, "Failed to alloc pfn buf\n");
     memset(pfn_buf, 0, sizeof(uint64_t) * max_gpfn);
 
-    ret = xc_get_pfn_list(xc_handle, domain, pfn_buf, max_gpfn);
+    ret = xc_get_pfn_list(xc_handle, domain, pfn_buf, 0, max_gpfn, &version);
     if ( ret < 0 ) {
         free(pfn_buf);
         err(xc_handle, "Failed to get pfn list %x\n", ret);
-- 
2.9.4


[-- Attachment #4: 0003-xen-mceinj-Loop-around-xc_get_pfn_list.patch --]
[-- Type: text/plain, Size: 3799 bytes --]

>From 483e55bc13c6ab246251fcd62c59adc7bf169c52 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Tue, 13 Jun 2017 10:56:52 -0400
Subject: [PATCH 3/7] xen-mceinj: Loop around xc_get_pfn_list

Now that the xc_get_pfn_list can return max != num_pfns
we need to take that into account and loop around.

While at it fix the code:
 - Move the memset in the loop.
 - Change the loop conditions to exit if the mfn has been found.
 - Explain why 262144 is used.
 - Add munmap if we failed to allocate the buffer.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 tools/tests/mce-test/tools/xen-mceinj.c | 67 ++++++++++++++++++++++-----------
 1 file changed, 45 insertions(+), 22 deletions(-)

diff --git a/tools/tests/mce-test/tools/xen-mceinj.c b/tools/tests/mce-test/tools/xen-mceinj.c
index 22b4401..9c90235 100644
--- a/tools/tests/mce-test/tools/xen-mceinj.c
+++ b/tools/tests/mce-test/tools/xen-mceinj.c
@@ -264,7 +264,7 @@ static uint64_t guest_mfn(xc_interface *xc_handle,
     unsigned long m2p_mfn0;
     unsigned int guest_width;
     unsigned long version;
-    long max_gpfn,i;
+    unsigned long max_gpfn, i, start_pfn, max, old_v;
     uint64_t mfn = MCE_INVALID_MFN;
 
     if ( domain > DOMID_FIRST_RESERVED )
@@ -284,40 +284,63 @@ static uint64_t guest_mfn(xc_interface *xc_handle,
                             &pt_levels, &guest_width) )
         err(xc_handle, "Failed to get platform information\n");
 
-    /* Get guest's pfn list */
-    pfn_buf = malloc(sizeof(uint64_t) * max_gpfn);
-    if ( !pfn_buf )
-        err(xc_handle, "Failed to alloc pfn buf\n");
-    memset(pfn_buf, 0, sizeof(uint64_t) * max_gpfn);
-
-    ret = xc_get_pfn_list(xc_handle, domain, pfn_buf, 0, max_gpfn, &version);
-    if ( ret < 0 ) {
-        free(pfn_buf);
-        err(xc_handle, "Failed to get pfn list %x\n", ret);
-    }
+    max = 262144 /* 1GB of MFNs. */;
 
     /* Now get the m2p table */
     live_m2p = xc_map_m2p(xc_handle, max_mfn, PROT_READ, &m2p_mfn0);
     if ( !live_m2p )
         err(xc_handle, "Failed to map live M2P table\n");
 
-    /* match the mapping */
-    for ( i = 0; i < max_gpfn; i++ )
+
+    /* Get guest's pfn list */
+    pfn_buf = alloc(sizeof(uint64_t) * max);
+    if ( !pfn_buf )
     {
-        uint64_t tmp;
-        tmp = pfn_buf[i];
+        munmap(live_m2p, M2P_SIZE(max_mfn));
+        err(xc_handle, "Failed to alloc pfn buf\n");
+    }
 
-        if (mfn_valid(tmp) &&  (mfn_to_pfn(tmp) == gpfn))
-        {
-            mfn = tmp;
-            Lprintf("We get the mfn 0x%lx for this injection\n", mfn);
+    start_pfn = 0;
+    old_v = version = 0;
+    do {
+        memset(pfn_buf, 0, sizeof(uint64_t) * max);
+        ret = xc_get_pfn_list(xc_handle, domain, pfn_buf, start_pfn, max, &version);
+        if ( old_v != version ) {
+            Lprintf("P2M changed, refetching.\n");
+            start_pfn = 0;
+            old_v = version;
+            continue;
+        }
+
+        if ( ret < 0 )
             break;
+
+        Lprintf("%ld/%ld .. \n", start_pfn, max_gpfn);
+
+        if ( max != ret )
+            max = ret; /* Update it for the next iteration. */
+
+        start_pfn += ret;
+
+        for ( i = 0; i < ret; i++ )
+        {
+            uint64_t tmp;
+            tmp = pfn_buf[i];
+
+            if (mfn_valid(tmp) && (mfn_to_pfn(tmp) == gpfn))
+            {
+                mfn = tmp;
+                Lprintf("We get the mfn 0x%lx for this injection\n", mfn);
+                break;
+            }
         }
-    }
+    } while ( start_pfn < max_gpfn && (mce == MCE_INVALID_MFN) );
 
     munmap(live_m2p, M2P_SIZE(max_mfn));
-
     free(pfn_buf);
+    if ( ret < 0 )
+        err(xc_handle, "Failed to get pfn list %x\n", ret);
+
     return mfn;
 }
 
-- 
2.9.4


[-- Attachment #5: 0004-x86-domctl-Add-XEN_DOMCTL_get_numa_ranges.patch --]
[-- Type: text/plain, Size: 5711 bytes --]

>From c6bf254298d68638ca8652825a28fa97cee51ff4 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Tue, 6 Jun 2017 13:04:22 -0400
Subject: [PATCH 4/7] x86:domctl: Add XEN_DOMCTL_get_numa_ranges

This is a fairly simple hypercall - it allows us to return the
list (and ranges) of NUMA nodes. Like:

NODE0 0 -> 100000
NODE1 100000 -> 230000

This is different from XEN_SYSCTL_numainfo which returns size
of the NODEs and its distance - but not the ranges.

Alternatively this functionality can be stuffed in
XEN_SYSCTL_numainfo.

Also this hypercall - if nodes is set to zero, then it will
return the number of nodes so that the caller can allocate
right away the right buffer for it.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 tools/flask/policy/policy/modules/xen/xen.te |  1 +
 xen/arch/x86/domctl.c                        | 50 ++++++++++++++++++++++++++++
 xen/include/public/domctl.h                  | 17 ++++++++++
 xen/xsm/flask/hooks.c                        |  3 ++
 xen/xsm/flask/policy/access_vectors          |  2 ++
 5 files changed, 73 insertions(+)

diff --git a/tools/flask/policy/policy/modules/xen/xen.te b/tools/flask/policy/policy/modules/xen/xen.te
index d4974ae..5fe5381 100644
--- a/tools/flask/policy/policy/modules/xen/xen.te
+++ b/tools/flask/policy/policy/modules/xen/xen.te
@@ -68,6 +68,7 @@ allow dom0_t xen_t:xen {
 allow dom0_t xen_t:xen2 {
     livepatch_op
     module_op
+    get_numa_ranges
 };
 
 # Allow dom0 to use all XENVER_ subops that have checks.
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 3af6b39..4eaf510 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -323,6 +323,56 @@ long arch_do_domctl(
     }
     break;
 
+    case XEN_DOMCTL_get_numa_ranges:
+    {
+        unsigned int nr = domctl->u.get_numa_ranges.nodes;
+        unsigned int i = 0, idx = 0;
+
+        if ( !nr )
+        {
+            ret = -E2BIG;
+            domctl->u.get_numa_ranges.nodes = num_online_nodes();
+            copyback = 1;
+            break;
+        }
+
+        if ( nr && !guest_handle_okay(domctl->u.get_numa_ranges.ranges, nr) )
+        {
+            ret = -EINVAL;
+            break;
+        }
+
+        ret = 0;
+        for_each_online_node ( i )
+        {
+            struct xen_vmemrange range;
+
+            range.start = node_start_pfn(i);
+            range.end = node_end_pfn(i);
+            range.nid = i;
+            range.flags = 0;
+            if ( __copy_to_guest_offset(domctl->u.get_numa_ranges.ranges, idx, &range, 1) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+
+            idx++;
+            if ( idx >= nr )
+                break;
+        }
+        if ( idx )
+        {
+            copyback = 1;
+            /*
+             * idx is zero-based and num_online_nodes() has at least 1, hence
+             * nodes == num_online_nodes() - 1.
+             */
+            domctl->u.get_numa_ranges.nodes = idx;
+        }
+    }
+    break;
+
     case XEN_DOMCTL_getmemlist:
     {
 #define XEN_DOMCTL_getmemlist_max_pfns (GB(1) / PAGE_SIZE)
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 2a25079..a32b662 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -955,6 +955,21 @@ struct xen_domctl_smt {
 typedef struct xen_domctl_smt xen_domctl_smt;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_smt);
 
+/*
+ * Returns how many more there are. If ranges is NULL
+ * and nodes is zero, then we provide the number of
+ * nodes in 'nodes'.
+ *
+ * Negative values on errors. Zero on success.
+ */
+struct xen_domctl_numa {
+    uint32_t nodes; /* IN: How many nodes are requested.
+                     * OUT: How many copied (zero based index). */
+    XEN_GUEST_HANDLE_64(xen_vmemrange_t) ranges;
+};
+typedef struct xen_domctl_numa xen_domctl_numa_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_numa_t);
+
 struct xen_domctl {
     uint32_t cmd;
 #define XEN_DOMCTL_createdomain                   1
@@ -1035,6 +1050,7 @@ struct xen_domctl {
 #define XEN_DOMCTL_hide_device                 2001
 #define XEN_DOMCTL_unhide_device               2002
 #define XEN_DOMCTL_setsmt                      2003
+#define XEN_DOMCTL_get_numa_ranges             2004
     uint32_t interface_version; /* XEN_DOMCTL_INTERFACE_VERSION */
     domid_t  domain;
     union {
@@ -1094,6 +1110,7 @@ struct xen_domctl {
         struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
         struct xen_domctl_vnuma             vnuma;
         struct xen_domctl_smt               smt;
+        struct xen_domctl_numa              get_numa_ranges;
         uint8_t                             pad[128];
     } u;
 };
diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
index be12169..d0cca5f 100644
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -740,6 +740,9 @@ static int flask_domctl(struct domain *d, int cmd)
     case XEN_DOMCTL_setsmt:
         return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__SETSMT);
 
+    case XEN_DOMCTL_get_numa_ranges:
+        return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__GET_NUMA_RANGES);
+
     default:
         printk("flask_domctl: Unknown op %d\n", cmd);
         return -EPERM;
diff --git a/xen/xsm/flask/policy/access_vectors b/xen/xsm/flask/policy/access_vectors
index 10f86b8..7b30cda 100644
--- a/xen/xsm/flask/policy/access_vectors
+++ b/xen/xsm/flask/policy/access_vectors
@@ -214,6 +214,8 @@ class domain2
     soft_reset
 # XEN_DOMCTL_setsmt
     setsmt
+# XEN_DOMCTL_get_numa_ranges
+    get_numa_ranges
 }
 
 # Similar to class domain, but primarily contains domctls related to HVM domains
-- 
2.9.4


[-- Attachment #6: 0005-libxc-Add-xc_list_numa.patch --]
[-- Type: text/plain, Size: 2869 bytes --]

>From 5b9c5881773f209d06235fba421704f0a0e44712 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 9 Jun 2017 12:21:52 -0400
Subject: [PATCH 5/7] libxc: Add xc_list_numa

Implement the libxc call to retrieve NUMA ranges.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 tools/libxc/xc_misc.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h |  1 +
 2 files changed, 63 insertions(+)

diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
index 851f341..8ce8fc4 100644
--- a/tools/libxc/xc_misc.c
+++ b/tools/libxc/xc_misc.c
@@ -1373,6 +1373,68 @@ int xc_livepatch_replace(xc_interface *xch, char *name, uint32_t timeout)
     return _xc_livepatch_action(xch, name, LIVEPATCH_ACTION_REPLACE, timeout);
 }
 
+int xc_list_numa(xc_interface *xch, struct xen_vmemrange** _info)
+{
+    int rc = 0;
+    unsigned int nodes;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(xen_vmemrange_t, ranges);
+
+    if ( !_info )
+    {
+        errno = EINVAL;
+        return -1;
+    }
+
+    memset(&domctl, 0, sizeof(domctl));
+    domctl.cmd = XEN_DOMCTL_get_numa_ranges;
+    domctl.u.get_numa_ranges.nodes = 0;
+
+    rc = do_domctl(xch, &domctl);
+    /* E2BIG is expected here since we didn't allocate any. */
+    if ( rc && errno != E2BIG )
+        return rc;
+
+    nodes = domctl.u.get_numa_ranges.nodes;
+    if ( nodes == 0 )
+    {
+        /* No NUMA at all? It should have one entry at least! */
+        *_info = NULL;
+        errno = EINVAL;
+        return -1;
+    }
+    *_info = calloc(nodes, sizeof(**_info));
+    if ( !*_info )
+    {
+        errno = ENOMEM;
+        return -1;
+    }
+
+    ranges = xc_hypercall_buffer_alloc(xch, ranges, nodes * sizeof(xen_vmemrange_t));
+    if ( !ranges )
+    {
+        free(*_info);
+        errno = ENOMEM;
+        return -1;
+    }
+    set_xen_guest_handle(domctl.u.get_numa_ranges.ranges, ranges);
+    memset(*_info, 0, nodes * sizeof(**_info));
+
+    rc = do_domctl(xch, &domctl);
+
+    if ( !rc && (domctl.u.get_numa_ranges.nodes != (nodes - 1)) )
+        memcpy(*_info, ranges, nodes * sizeof(*ranges));
+
+    xc_hypercall_buffer_free(xch, ranges);
+    if ( rc < 0 )
+    {
+        free(*_info);
+        *_info = NULL;
+    }
+
+    return (rc < 0) ? -1 : nodes;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 5802d69..2ba2b47 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -2569,4 +2569,5 @@ int xc_livepatch_revert(xc_interface *xch, char *name, uint32_t timeout);
 int xc_livepatch_unload(xc_interface *xch, char *name, uint32_t timeout);
 int xc_livepatch_replace(xc_interface *xch, char *name, uint32_t timeout);
 
+int xc_list_numa(xc_interface *xch, struct xen_vmemrange** _info);
 #endif /* XENCTRL_H */
-- 
2.9.4


[-- Attachment #7: 0006-xen-numa-Diagnostic-tool-to-figure-out-NUMA-issues.patch --]
[-- Type: text/plain, Size: 16041 bytes --]

>From 033baca36963923c467adcb3d0473ea1f1e9b440 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 9 Jun 2017 12:22:24 -0400
Subject: [PATCH 6/7] xen-numa: Diagnostic tool to figure out NUMA issues.

The tool can provide multiple views of a guest.

 - 'mfns' will dump all of the MFNs of a guest, useful
for sorting and such and double-checking.

 - 'pfns' is an upgraded version of the above. It includes
such details as what the PFN is within the guest. The list
is not sorted. The PFNs are decimal, while the MFNs are
hex to easy sorting.

 - 'node' digs in the PFNs and MFNs and figures out where
they are - which PFNs belong to what NODE. This should match
the guest view, otherwise we have issues.

For example on Dom0 on SuperMicro H8DG6:
sh-4.1# xen-numa node 0
-bash-4.1# /xen-numa node 0
NODE0 0 -> 0x1a8000 (6784 MB)
NODE1 0x1a8000 -> 0x2a8000 (4096 MB)
0.0%..10.0%..20.0%..30.0%..40.0%..50.0%..60.0%..70.0%..80.0%..90.0%..
Max gpfn is 0x40069 (1024 MB)
- NODE0 PFNs (33.173813%):
0x8352->0x8553 (514)
0x28554->0x2b995 (13378)
0x2b997->0x2d10c (6006)
0x2d10d->0x2d5a1 (1173)
0x2d5a3->0x38553 (44977)
0x3dc00->0x3e553 (2388)
0x3f554->0x3fd53 (2048)
0x3ff54->0x3ffd3 (128)
0x3fff4->0x3fffb (8)
0x39c00->0x3dbff (16384)
0x620a->0x620c (3)
0x61fe->0x6200 (3)
0x6215, 0x621b, 0x6221, 0x6249, 0x6231, 0x63a7, 0x635f,
- NODE1 PFNs (66.771660%):
0x0->0x97 (152)
0x40000->0x40068 (105)
0x100->0x61fd (24830)
0x6201->0x6209 (9)
0x620d->0x6214 (8)
0x6216->0x621a (5)
0x621c->0x6220 (5)
0x6222->0x6230 (15)
0x6232->0x6248 (23)
0x624a->0x635e (277)
0x6360->0x63a6 (71)
0x63a8->0x8351 (8106)
0x8353, 0x8554->0x28553 (131072)
0x38554->0x39bff (5804)
0x3e554->0x3f553 (4096)
0x3fd54->0x3ff53 (512)
0x3ffd4->0x3fff3 (32)
0x3fffc->0x3ffff (4)

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 tools/misc/Makefile   |   5 +
 tools/misc/xen-numa.c | 556 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 561 insertions(+)
 create mode 100644 tools/misc/xen-numa.c

diff --git a/tools/misc/Makefile b/tools/misc/Makefile
index 4cc7296..ea0bd9b 100644
--- a/tools/misc/Makefile
+++ b/tools/misc/Makefile
@@ -17,6 +17,7 @@ TARGETS-y += xen-insmod
 TARGETS-y += xen-rmmod
 TARGETS-y += xen-lsmod
 TARGETS-y += xen-attribute
+TARGETS-y += xen-numa
 TARGETS := $(TARGETS-y)
 
 SUBDIRS := $(SUBDIRS-y)
@@ -34,6 +35,7 @@ INSTALL_SBIN-y += xen-insmod
 INSTALL_SBIN-y += xen-rmmod
 INSTALL_SBIN-y += xen-lsmod
 INSTALL_SBIN-y += xen-attribute
+INSTALL_SBIN-y += xen-numa
 INSTALL_SBIN := $(INSTALL_SBIN-y)
 
 INSTALL_PRIVBIN-y := xenpvnetboot
@@ -100,6 +102,9 @@ xen-lsmod xen-rmmod xen-insmod: xen-%: xen-%.o
 xen-attribute: xen-%: xen-%.o
 	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
 
+xen-numa: xen-numa.o
+	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest) $(APPEND_LDFLAGS)
+
 xen-lowmemd: xen-lowmemd.o
 	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(LDLIBS_libxenstore) $(APPEND_LDFLAGS)
 
diff --git a/tools/misc/xen-numa.c b/tools/misc/xen-numa.c
new file mode 100644
index 0000000..a0af262
--- /dev/null
+++ b/tools/misc/xen-numa.c
@@ -0,0 +1,556 @@
+/*
+ * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
+ */
+
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <xenctrl.h>
+
+#include <xc_private.h>
+#include <xc_core.h>
+#include <xg_save_restore.h>
+
+#define LOGFILE stdout
+
+struct ops {
+    const char *name;
+    const char *help;
+    int (*setup)(struct ops *);
+    void (*free)(struct ops *);
+    void (*begin)(struct ops *);
+    int (*iterate)(struct ops *, unsigned long pfn, unsigned long mfn);
+    void (*end)(struct ops *);
+
+    unsigned int arg3;
+    unsigned int arg4;
+
+    unsigned long max_gpfn;
+    xen_pfn_t *live_m2p;
+
+    struct xen_vmemrange *nodes;
+    unsigned int nodes_nr;
+
+    void *priv;
+};
+
+static int iterate(xc_interface *xc_handle,
+                   uint32_t domain,
+                   struct ops *ops)
+{
+    int ret;
+    unsigned long hvirt_start;
+    unsigned int pt_levels;
+    uint64_t *buf = NULL;
+    unsigned long max_mfn = 0; /* max mfn of the whole machine */
+    unsigned long m2p_mfn0;
+    unsigned int guest_width;
+    unsigned long i, start_pfn, version, max, old_v, max_gpfn;
+
+    if ( domain > DOMID_FIRST_RESERVED )
+        return -1;
+
+    /* Get max gpfn */
+    max_gpfn = do_memory_op(xc_handle, XENMEM_maximum_gpfn, &domain,
+                               sizeof(domain)) + 1;
+    if ( max_gpfn <= 0 )
+    {
+        fprintf(stderr, "Failed to get max_gpfn 0x%lx\n", max_gpfn);
+        return -EINVAL;
+    }
+
+    ops->max_gpfn = max_gpfn;
+    if ( ops->begin )
+        (ops->begin)(ops);
+
+    /* Get max mfn */
+    if ( !get_platform_info(xc_handle, domain,
+                            &max_mfn, &hvirt_start,
+                            &pt_levels, &guest_width) )
+    {
+        fprintf(stderr, "Failed to get platform information\n");
+        return -EINVAL;
+    }
+
+    /* The max is GB(1) in pages. */
+    max = 262144;
+
+    ops->live_m2p = xc_map_m2p(xc_handle, max_mfn, PROT_READ, &m2p_mfn0);
+    if ( !ops->live_m2p )
+    {
+        fprintf(stderr, "Failed to map live M2P table\n");
+        return -EINVAL;
+    }
+
+    /* Get guest's pfn list */
+    buf = malloc(sizeof(uint64_t) * max);
+    if ( !buf )
+    {
+        fprintf(stderr, "Failed to alloc pfn buf\n");
+        munmap(ops->live_m2p, M2P_SIZE(max_mfn));
+        return -EINVAL;
+    }
+
+    start_pfn = 0;
+    old_v = version = 0;
+    do {
+        memset(buf, 0xFF, sizeof(uint64_t) * max);
+        ret = xc_get_pfn_list(xc_handle, domain, buf, start_pfn, max, &version);
+        if ( old_v != version )
+        {
+            fprintf(stderr, "P2M changed, refetching.\n");
+            start_pfn = 0;
+            old_v = version;
+            if ( ops->free )
+                (ops->free)(ops);
+            if ( ops->begin )
+                (ops->begin)(ops);
+            continue;
+        }
+
+        if ( ret < 0 )
+        {
+            fprintf(stderr, "Failed to call with start_pfn=0x%lx, max=0x%lx, ret %d\n", start_pfn, max, ret);
+            break;
+        }
+        if ( !ret )
+            break;
+
+        max = ret; /* Update it for the next iteration. */
+        for ( i = 0; i < max; i++ )
+        {
+            ret = (ops->iterate)(ops, i + start_pfn, buf[i]);
+            if ( ret )
+                break;
+        }
+
+        start_pfn += max;
+        if ( ret )
+            break;
+
+    } while ( start_pfn < max_gpfn );
+
+    free(buf);
+    if ( ops->end )
+        (ops->end)(ops);
+    munmap(ops->live_m2p, M2P_SIZE(max_mfn));
+
+    return ret;
+}
+
+/* ------------------------- */
+static int print_mfns(struct ops *ops, unsigned long pfn, unsigned long mfn)
+{
+    fprintf(stdout, "0x%lx\n", mfn);
+    return 0;
+}
+
+static struct ops print_mfn_op = {
+    .help = " mfns  - print all the MFNs of the guest",
+    .name = "mfns",
+    .iterate = print_mfns,
+};
+
+/* ------------------------- */
+static int print_pfn_and_mfns_header(struct ops *ops)
+{
+    fprintf(stdout,"PFN\tMFN\tNODE\n");
+    fprintf(stdout,"--------------------------\n");
+
+    return 0;
+}
+
+static int print_pfn_and_mfns(struct ops *ops, unsigned long pfn, unsigned long mfn)
+{
+    unsigned long m2p = ops->live_m2p[mfn];
+    unsigned int i;
+    int nid = -1;
+
+    for ( i = 0; i < ops->nodes_nr; i++ )
+    {
+        if ( mfn >= ops->nodes[i].start && mfn < ops->nodes[i].end )
+        {
+            nid = ops->nodes[i].nid;
+            break;
+        }
+    }
+
+    fprintf(stdout, "%ld\t0x%lx\tNODE%d\n", m2p, mfn, nid);
+    return 0;
+}
+
+static struct ops print_pfns_ops = {
+    .help = " pfns - print the MFNs and PFNs of the guest",
+    .name = "pfns",
+    .setup = print_pfn_and_mfns_header,
+    .iterate = print_pfn_and_mfns,
+};
+
+/* ------------------------- */
+
+struct groups {
+    unsigned long start;
+    unsigned int len;
+    struct groups *next;
+};
+
+struct node_data {
+    int nid;
+    unsigned long pfns;
+    struct groups *groups;
+};
+
+struct node_args {
+    unsigned int stride;
+    struct node_data empty;
+    struct node_data *nodes_data;
+};
+
+static struct node_args *create_node(struct ops *ops)
+{
+    struct node_args *args;
+    unsigned int i;
+    struct node_data *n;
+
+    args = malloc(sizeof(struct node_args));
+    if ( !args )
+        return NULL;
+
+    args->stride = 262144; /* Every 1GB. */
+    args->empty.nid = -1;
+    args->empty.groups = NULL;
+    args->empty.pfns = 0;
+
+    n = malloc(sizeof(struct node_data) * ops->nodes_nr);
+    if ( !n )
+    {
+        free(args);
+        fprintf(stderr, "Failed to initialize temp data.\n");
+        return NULL;
+    }
+    args->nodes_data = n;
+
+    for ( i = 0; i < ops->nodes_nr ; i++ )
+    {
+        n[i].nid = ops->nodes[i].nid;
+        n[i].groups = NULL;
+        n[i].pfns = 0;
+    }
+
+    return args;
+}
+
+static int setup_node(struct ops *ops)
+{
+    struct node_args *args = create_node(ops);
+
+    if ( !args )
+        return -1;
+
+    ops->priv = args;
+    return 0;
+}
+
+static void begin_node(struct ops *ops)
+{
+    struct node_args *args = ops->priv;
+    unsigned int i;
+
+    args->stride =  ops->max_gpfn / 10;
+
+    for ( i = 0; i < ops->nodes_nr ; i++ )
+    {
+        fprintf(stdout, "NODE%d %#lx -> %#lx (%ld MB)\n", ops->nodes[i].nid,
+                ops->nodes[i].start, ops->nodes[i].end,
+                (ops->nodes[i].end - ops->nodes[i].start) >> 8);
+    }
+}
+
+static struct groups *create(unsigned long pfn)
+{
+    struct groups *g;
+
+    g = malloc(sizeof(*g));
+    if ( !g )
+        return NULL;
+
+    g->next = NULL;
+    g->start = pfn;
+    g->len = 1;
+
+    return g;
+}
+
+static int add_to(struct node_data *n, unsigned long pfn)
+{
+    struct groups *g, *prev;
+
+    if ( !n )
+        return -1;
+
+    if ( !n->groups )
+    {
+        g = create(pfn);
+        if ( !g )
+            return -ENOMEM;
+        n->groups = g;
+    }
+
+
+    for ( prev = NULL, g = n->groups; g; prev = g, g = g->next )
+    {
+#if DEBUG_NODE
+        fprintf(stderr, "%s[%d]: %ld -> %ld (%ld)\n",
+                        __func__, n->nid, g->start, g->len+g->start, pfn);
+#endif
+        if ( pfn >= g->start && pfn <= (g->start + g->len) )
+        {
+            g->len++;
+            n->pfns++;
+
+            return 0;
+        }
+    }
+    if ( !prev )
+        return -EINVAL;
+
+    if ( prev->next )
+        return -EINVAL;
+
+    prev->next = create(pfn);
+    if ( !prev->next )
+        return -ENOMEM;
+
+    return 0;
+}
+
+static int _node_iterate(struct node_args *args, struct ops *ops,
+                         unsigned long pfn, unsigned long mfn)
+{
+    unsigned int i;
+
+    if ( !args )
+        return -1;
+
+    if ( !args->nodes_data )
+        return -1;
+
+    if ( args->stride && (pfn % args->stride) == 0 )
+    {
+        fprintf(stdout, "%.1f%%..", ((float)pfn / ops->max_gpfn) * 100);
+        fflush(stdout);
+    }
+    if ( !mfn )
+        return add_to(&args->empty, pfn);
+#ifdef DEBUG_NODE
+    if ( pfn > 10 )
+        return -1;
+#endif
+
+    pfn = ops->live_m2p[mfn];
+    for ( i = 0; i < ops->nodes_nr; i++ )
+    {
+        if ( mfn >= ops->nodes[i].start && mfn < ops->nodes[i].end )
+            return add_to(&args->nodes_data[i], pfn);
+    }
+
+    fprintf(stderr, "PFN 0x%lx, MFN 0x%lx is not within any NODE?!\n", pfn, mfn);
+    return -1;
+}
+
+static int node_iterate(struct ops *ops,
+                        unsigned long pfn, unsigned long mfn)
+{
+    return _node_iterate(ops->priv, ops, pfn, mfn);
+}
+
+static void print_groups(struct node_data *n, unsigned long max_gpfn)
+{
+    struct groups *g;
+    float p = 0.0;
+
+    if ( !n->groups )
+    {
+        if ( n->nid >= 0 )
+            fprintf(stdout, "- NODE%d not used.\n", n->nid);
+        return;
+    }
+    if ( n->pfns )
+    {
+        p = (float)n->pfns / (float)max_gpfn;
+        p *= 100;
+    }
+    if ( n->nid >= 0 )
+        fprintf(stdout, "- NODE%d PFNs (%lf%%):\n", n->nid, p);
+    else
+        fprintf(stdout, "PFNs not in any node (%lf%%):\n", p);
+
+    for ( g = n->groups; g; g = g->next )
+    {
+        if ( g->len == 1 )
+            fprintf(stdout, "0x%lx, ", g->start);
+        else
+            fprintf(stdout, "0x%lx->0x%lx (%d)\n", g->start, g->start + g->len - 1, g->len);
+    }
+    fprintf(stdout, "\n");
+}
+
+static void free_groups(struct node_data *n)
+{
+    struct groups *g, *prev;
+
+    if ( !n->groups )
+        return;
+
+    for ( prev = NULL, g = n->groups; g; prev = g, g = g->next )
+    {
+        if ( prev )
+            free( prev );
+    }
+
+    n->groups = NULL;
+}
+
+static void node_free(struct ops *ops)
+{
+    struct node_args *args = ops->priv;
+    unsigned int i;
+
+    if ( !args )
+        return;
+
+    for ( i = 0; i < ops->nodes_nr; i++ )
+        free_groups(&args->nodes_data[i]);
+}
+
+static void node_end(struct ops *ops)
+{
+    struct node_args *args = ops->priv;
+    unsigned int i;
+
+    fprintf(stdout, "\nMax gpfn is 0x%lx (%ld MB)\n",
+           ops->max_gpfn, ops->max_gpfn >> 8);
+
+    if ( !args )
+    {
+        fprintf(stderr, "We lost our collected data!\n");
+        return;
+    }
+    for ( i = 0; i < ops->nodes_nr; i++ )
+        print_groups(&args->nodes_data[i], ops->max_gpfn);
+
+    print_groups(&args->empty, ops->max_gpfn);
+
+    node_free(ops);
+    free(args->nodes_data);
+    free(args);
+    ops->priv = NULL;
+}
+
+static struct ops node_ops = {
+    .help = " node - summary of which PFNs are in which NODE.",
+    .name = "node",
+    .begin = begin_node,
+    .setup = setup_node,
+    .iterate = node_iterate,
+    .end = node_end,
+    .free = node_free,
+};
+
+static struct ops *callback_ops[] = {
+    &print_pfns_ops,
+    &print_mfn_op,
+    &print_pgm_ops,
+    &node_ops,
+};
+
+#define ARRAY_SIZE(a) (sizeof (a) / sizeof ((a)[0]))
+
+static int print_numa(xc_interface *xch, unsigned int mode, unsigned int domid,
+                      unsigned int arg3, unsigned int arg4)
+{
+    struct xen_vmemrange *info;
+    int rc = 0;
+    struct ops *ops;
+
+    rc = xc_list_numa(xch, &info);
+    if ( rc < 0 )
+    {
+        fprintf(stderr, "Could not get the list of NUMA nodes: %s\n",
+                strerror(errno));
+        return rc;
+    }
+
+    if ( !info )
+    {
+        printf("There is no NUMA?\n");
+        return rc;
+    }
+
+    ops = callback_ops[mode];
+    ops->nodes_nr = rc;
+    ops->nodes = info;
+    ops->arg3 = arg3;
+    ops->arg4 = arg4;
+
+    rc = 0;
+    if ( ops->setup )
+        rc = (ops->setup)(ops);
+
+    if ( !rc )
+        rc = iterate(xch, domid, ops);
+
+    if ( ops->free )
+        (ops->free)(ops);
+
+    free(info);
+
+    return rc;
+}
+
+static void show_usage(const char *const progname)
+{
+    unsigned int i;
+    fprintf(stderr, "%s <operation> <domid> [optional]\n", progname);
+    for ( i = 0; i < ARRAY_SIZE(callback_ops); i++ )
+        fprintf(stderr, "%s\n", callback_ops[i]->help);
+}
+
+int main(int argc, char **argv)
+{
+    xc_interface *xch = NULL;
+    unsigned int i;
+
+    if ( argc < 3 )
+    {
+        show_usage(argv[0]);
+        return -EINVAL;
+    }
+
+    for ( i = 0; i < ARRAY_SIZE(callback_ops); i++ )
+    {
+         if (!strncmp(callback_ops[i]->name, argv[1], strlen(argv[1])))
+            break;
+    }
+
+    if ( i != ARRAY_SIZE(callback_ops) )
+    {
+        xch = xc_interface_open(0, 0, 0);
+        if ( !xch )
+        {
+            fprintf(stderr, "Could not open Xen handler.\n");
+            return -ENXIO;
+        }
+
+        return print_numa(xch, i, atoi(argv[2]),
+                          argc > 3 ? atoi(argv[3]) : 0,
+                          argc > 4 ? atoi(argv[4]) : 0);
+    }
+
+    return -EINVAL;
+}
-- 
2.9.4


[-- Attachment #8: 0007-xen-numa-Add-a-heatmap.patch --]
[-- Type: text/plain, Size: 4376 bytes --]

>From 901fe4364deb69a6a803f540f03c1d8cf418dbc0 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 9 Jun 2017 13:44:22 -0400
Subject: [PATCH 7/7] xen-numa: Add a heatmap.

 - 'heatmap' outputs an PGM file of where the MFNs of a guest
reside. Use ImageMagick to convert this file to PNG:

Also there is an third optional parameter to change the width
of the file. And a fourth optional to inverse the colors.
That can help as ImageMagick looks to be ignoring the PGM spec
and printing 0 as white instead of as black.

The heatmap illustrates a picture of 0..max_gpfns and the colors
are the N NODEs. For two nodes we should see three colors - NODE0,
NODE1, and holes.

For example to use it:

-bash-4.1# /xen-numa heatmap 1 1600 1 > /isos/heatmap-32gb-xl-vnuma.pgm

See
http://char.us.oracle.com/isos/heatmap-32gb-xl-vnuma.png
http://char.us.oracle.com/isos/heatmap-32gb-xm-vnuma.png

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 tools/misc/xen-numa.c | 124 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 124 insertions(+)

diff --git a/tools/misc/xen-numa.c b/tools/misc/xen-numa.c
index a0af262..b52def8 100644
--- a/tools/misc/xen-numa.c
+++ b/tools/misc/xen-numa.c
@@ -462,6 +462,130 @@ static struct ops node_ops = {
     .free = node_free,
 };
 
+/* ------------------------- */
+
+static int setup_pgm(struct ops *ops)
+{
+    int rc = setup_node(ops);
+
+    if ( !rc )
+    {
+        struct node_args *n;
+
+        n = ops->priv;
+        /* We don't want the percentage counter to show up, so.. */
+        n->stride = 0;
+    }
+
+    return rc;
+}
+
+static int find_pfn(struct ops *ops, struct node_data *n, unsigned long pfn)
+{
+    struct groups *g;
+    unsigned int inverted = ops->arg4 ? : 0;
+    int rc;
+
+    if ( !n->groups )
+        return -ENOENT;
+
+    if ( n->nid >= 0 )
+        rc = inverted ? ops->nodes_nr - n->nid : n->nid;
+    else
+        rc = inverted ? 0 : ops->nodes_nr;
+
+    for ( g = n->groups; g; g = g->next )
+    {
+        if ( g->start == pfn )
+                return rc;
+
+        if ( pfn >= g->start && pfn <= g->start + g->len - 1 )
+                return rc;
+    }
+    return -ENOENT;
+}
+
+static void end_pgm(struct ops *ops)
+{
+    struct node_args *args = ops->priv;
+    unsigned long pfn;
+    unsigned long w, h;
+    unsigned long count;
+    unsigned int inverted = ops->arg4 ? : 0;
+    int rc;
+
+    if ( !args )
+    {
+        fprintf(stderr, "We lost our collected data!\n");
+        return;
+    }
+    w = ops->arg3 ? : 1600;
+    h = (float)ops->max_gpfn / w;
+
+    while ( ops->max_gpfn > (w*h) )
+        h++;
+
+    count = w*h;
+    fprintf(stdout,"P2\n%ld %ld\n", w, h);
+    fprintf(stdout, "%d\n", ops->nodes_nr);
+
+    for ( pfn = 0; pfn < ops->max_gpfn; pfn++ )
+    {
+        int node;
+
+        rc = -ENOENT;
+        for ( node = 0; node < ops->nodes_nr; node++ )
+        {
+            rc = find_pfn(ops, &args->nodes_data[node], pfn);
+            if ( rc >= 0 ) /* Found! */
+                break;
+            if ( rc != -ENOENT ) /* Uh oh. Not good */
+                break;
+        }
+        if ( rc == -ENOENT )
+            rc = find_pfn(ops, &args->empty, pfn);
+
+        if ( rc < 0 )
+        {
+            if ( rc == -ENOENT )
+                rc = inverted ? 0 : ops->nodes_nr;
+            else
+                goto out;
+        }
+        fprintf(stdout, "%d ", rc);
+        if ( pfn && (pfn % w ) == 0 )
+            fprintf(stdout, "\n");
+
+    }
+    count -= pfn;
+
+    rc = inverted ? 0 : ops->nodes_nr;
+    for ( pfn = 0; pfn < count; pfn++ )
+    {
+        fprintf(stdout, "%d ", rc);
+        if ( (pfn % w ) == 0 )
+            fprintf(stdout, "\n");
+    }
+
+ out:
+    node_free(ops);
+    free(args->nodes_data);
+    free(args);
+    ops->priv = NULL;
+}
+
+static struct ops print_pgm_ops = {
+    .help = " heatmap - Output an PGM file of PFNs with NODE values.\n" \
+            "           First optional parameter to define width, and\n" \
+            "           second to invert colors.\n",
+    .name = "heatmap",
+    .setup = setup_pgm,
+    .iterate = node_iterate,
+    .end = end_pgm,
+    .free = node_free,
+};
+/* ------------------------- */
+
 static struct ops *callback_ops[] = {
     &print_pfns_ops,
     &print_mfn_op,
-- 
2.9.4


[-- Attachment #9: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: API to query NUMA node of mfn
  2017-07-10 13:13   ` Konrad Rzeszutek Wilk
@ 2017-07-10 13:35     ` Olaf Hering
  2017-07-10 20:29       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 5+ messages in thread
From: Olaf Hering @ 2017-07-10 13:35 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Jan Beulich, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 287 bytes --]

On Mon, Jul 10, Konrad Rzeszutek Wilk wrote:

> Soo I wrote some code for exactly this for Xen 4.4.4 , along with
> creation of a PGM map to see the NUMA nodes locality.

Are you planning to prepare that for staging at some point? I have not
checked this series is already merged.

Olaf

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: API to query NUMA node of mfn
  2017-07-10 13:35     ` Olaf Hering
@ 2017-07-10 20:29       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 5+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-07-10 20:29 UTC (permalink / raw)
  To: Olaf Hering; +Cc: Jan Beulich, xen-devel

On Mon, Jul 10, 2017 at 03:35:33PM +0200, Olaf Hering wrote:
> On Mon, Jul 10, Konrad Rzeszutek Wilk wrote:
> 
> > Soo I wrote some code for exactly this for Xen 4.4.4 , along with
> > creation of a PGM map to see the NUMA nodes locality.
> 
> Are you planning to prepare that for staging at some point? I have not
> checked this series is already merged.

At some point when life is not as crazy. You are of course
welcome to see if these patches help you and if they do and
you are itching to get them in the repo - to upstream them.

> 
> Olaf



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-07-10 20:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-10 10:10 API to query NUMA node of mfn Olaf Hering
2017-07-10 10:41 ` Jan Beulich
2017-07-10 13:13   ` Konrad Rzeszutek Wilk
2017-07-10 13:35     ` Olaf Hering
2017-07-10 20:29       ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.