From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: API to query NUMA node of mfn Date: Mon, 10 Jul 2017 09:13:24 -0400 Message-ID: <20170710131323.GF2461@localhost.localdomain> References: <20170710101034.GA19754@aepfle.de> <596375FF020000780016A352@prv-mh.provo.novell.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="vtzGhvizbBRQ85DL" Return-path: Content-Disposition: inline In-Reply-To: <596375FF020000780016A352@prv-mh.provo.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" To: Jan Beulich Cc: Olaf Hering , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org --vtzGhvizbBRQ85DL Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Mon, Jul 10, 2017 at 04:41:35AM -0600, Jan Beulich wrote: > >>> On 10.07.17 at 12:10, wrote: > > I would like to verify on which NUMA node the PFNs used by a HVM guest > > are located. Is there an API for that? Something like: > > > > foreach (pfn, domid) > > mfns_per_node[pfn_to_node(pfn)]++ > > foreach (node) > > printk("%x %x\n", node, mfns_per_node[node]) > > phys_to_nid() ? Soo I wrote some code for exactly this for Xen 4.4.4 , along with creation of a PGM map to see the NUMA nodes locality. Attaching them here.. > > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > https://lists.xen.org/xen-devel --vtzGhvizbBRQ85DL Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0001-xen-x86-XENDOMCTL_get_memlist-Make-it-work.patch" >>From a5e039801c989df29b704a4a5256715321906535 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk Date: Tue, 6 Jun 2017 20:31:21 -0400 Subject: [PATCH 1/7] xen/x86: XENDOMCTL_get_memlist: Make it work This hypercall has a bunch of problems which this patch fixes. Specifically it is not preempt capable, takes a nested lock, and the data is stale after you get it. The nested lock (and order inversion) is due to the copy_to_guest_offset call. The particular implementation (see __hvm_copy) makes P2M calls (p2m_mem_paging_populate), which take the p2m_lock. We avoid this by taking the p2m lock early (before page_lock) in: if ( !guest_handle_okay(domctl->u.getmemlist.buffer, max_pfns) ) here (this takes the p2m lock and then unlocks). And since it checks out, we can use the fast variant of copy_to_guest (which still takes the p2m lock). And we extend this thinking in the copying of the values to the guest. The loop that copies the mfns[] to buffer takes (potentially) a p2m lock on every invocation. So to not make us holding the page_alloc_lock we create a temporary array (mfns) - which is filled while holding page_alloc_lock. But we don't hold any locks (well, we hold the domctl lock) while copying to the guest. The preemption is used and we also honor 'start_pfn' which is renamed to 'index' - as there is no enforced order in which the pages correspond to PFNs. All of those are fixed by this patch, also it means that the callers of xc_get_pfn_list have to take into account that max_pfns != num_pfns value and loop around. See patch: "libxc: Use XENDOMCTL_get_memlist properly" and "xen-mceinj: Loop around xc_get_pfn_list" Signed-off-by: Konrad Rzeszutek Wilk --- xen/arch/x86/domctl.c | 76 ++++++++++++++++++++++++++++++--------------- xen/arch/x86/mm/hap/hap.c | 1 + xen/arch/x86/mm/p2m-ept.c | 2 ++ xen/include/asm-x86/p2m.h | 2 ++ xen/include/public/domctl.h | 36 ++++++++++++++++----- 5 files changed, 84 insertions(+), 33 deletions(-) diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c index bebe1fb..3af6b39 100644 --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -325,57 +325,83 @@ long arch_do_domctl( case XEN_DOMCTL_getmemlist: { - int i; +#define XEN_DOMCTL_getmemlist_max_pfns (GB(1) / PAGE_SIZE) + unsigned int i = 0, idx = 0; unsigned long max_pfns = domctl->u.getmemlist.max_pfns; + unsigned long index = domctl->u.getmemlist.index; uint64_t mfn; struct page_info *page; + uint64_t *mfns; if ( unlikely(d->is_dying) ) { ret = -EINVAL; break; } + /* XSA-74: This sub-hypercall is fixed. */ - /* - * XSA-74: This sub-hypercall is broken in several ways: - * - lock order inversion (p2m locks inside page_alloc_lock) - * - no preemption on huge max_pfns input - * - not (re-)checking d->is_dying with page_alloc_lock held - * - not honoring start_pfn input (which libxc also doesn't set) - * Additionally it is rather useless, as the result is stale by the - * time the caller gets to look at it. - * As it only has a single, non-production consumer (xen-mceinj), - * rather than trying to fix it we restrict it for the time being. - */ - if ( /* No nested locks inside copy_to_guest_offset(). */ - paging_mode_external(current->domain) || - /* Arbitrary limit capping processing time. */ - max_pfns > GB(4) / PAGE_SIZE ) + ret = -E2BIG; + if ( max_pfns > XEN_DOMCTL_getmemlist_max_pfns ) + max_pfns = XEN_DOMCTL_getmemlist_max_pfns; + + /* Report the max number we are OK with. */ + if ( !max_pfns && guest_handle_is_null(domctl->u.getmemlist.buffer) ) { - ret = -EOPNOTSUPP; + domctl->u.getmemlist.max_pfns = XEN_DOMCTL_getmemlist_max_pfns; + copyback = 1; break; } - spin_lock(&d->page_alloc_lock); + ret = -EINVAL; + if ( !guest_handle_okay(domctl->u.getmemlist.buffer, max_pfns) ) + break; + + mfns = xmalloc_array(uint64_t, max_pfns); + if ( !mfns ) + { + ret = -ENOMEM; + break; + } - ret = i = 0; + ret = -EINVAL; + spin_lock(&d->page_alloc_lock); page_list_for_each(page, &d->page_list) { - if ( i >= max_pfns ) + if ( idx >= max_pfns ) break; + + if ( index > i++ ) + continue; + + if ( idx && !(idx & 0xFF) && hypercall_preempt_check() ) + break; + mfn = page_to_mfn(page); - if ( copy_to_guest_offset(domctl->u.getmemlist.buffer, - i, &mfn, 1) ) + mfns[idx++] = mfn; + } + spin_unlock(&d->page_alloc_lock); + + ret = 0; + for ( i = 0; i < idx; i++ ) + { + + if ( __copy_to_guest_offset(domctl->u.getmemlist.buffer, + i, &mfns[i], 1) ) { ret = -EFAULT; break; } - ++i; } - spin_unlock(&d->page_alloc_lock); - domctl->u.getmemlist.num_pfns = i; + /* + * A poor-man way of keeping track of P2M changes. If the P2M + * is changed the version will change as well and the caller + * can redo it's list. + */ + domctl->u.getmemlist.version = p2m_get_hostp2m(d)->version; + copyback = 1; + xfree(mfns); } break; diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c index ccc4174..0406c2a 100644 --- a/xen/arch/x86/mm/hap/hap.c +++ b/xen/arch/x86/mm/hap/hap.c @@ -709,6 +709,7 @@ hap_write_p2m_entry(struct vcpu *v, unsigned long gfn, l1_pgentry_t *p, if ( old_flags & _PAGE_PRESENT ) flush_tlb_mask(d->domain_dirty_cpumask); + p2m_get_hostp2m(d)->version++; paging_unlock(d); if ( flush_nestedp2m ) diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c index 72b3d0a..7da5b06 100644 --- a/xen/arch/x86/mm/p2m-ept.c +++ b/xen/arch/x86/mm/p2m-ept.c @@ -674,6 +674,8 @@ void ept_sync_domain(struct p2m_domain *p2m) { struct domain *d = p2m->domain; struct ept_data *ept = &p2m->ept; + + p2m->version++; /* Only if using EPT and this domain has some VCPUs to dirty. */ if ( !paging_mode_hap(d) || !d->vcpu || !d->vcpu[0] ) return; diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h index fcb50b1..b0549e8 100644 --- a/xen/include/asm-x86/p2m.h +++ b/xen/include/asm-x86/p2m.h @@ -293,6 +293,8 @@ struct p2m_domain { struct ept_data ept; /* NPT-equivalent structure could be added here. */ }; + /* OVM: Every update to P2M increases this version. */ + unsigned long version; }; /* get host p2m table */ diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 27f5001..2a25079 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -118,16 +118,36 @@ typedef struct xen_domctl_getdomaininfo xen_domctl_getdomaininfo_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_getdomaininfo_t); -/* XEN_DOMCTL_getmemlist */ +/* + * XEN_DOMCTL_getmemlist + * Retrieve an array of mfns of the guest. + * + * If the hypercall returns an zero value, then it has copied 'num_pfns' + * (up to `max_pfns`) of the MFNs in 'buffer', along with the + * `version` updated (it may be the same across hypercalls. If it + * varies the data is stale and it is recommended that the caller restart + * iwht 'index' being zero). + * + * If the 'max_pfns' is zero, and 'buffer' is NULL, the hypercall returns + * -E2BIG and updates the 'max_pfns' with the recommend value to be used. + * + * Note that due to the asynchronous nature of hypercalls the domain might have + * added or removed the number of MFNS making this information stale. It is + * the responsibility of the toolstack to use the `version` field to check + * between each invocation. if the version differs it should discard the stale + * data and start from scratch. It is OK for the toolstack to use the new + * `version` field. + */ struct xen_domctl_getmemlist { - /* IN variables. */ - /* Max entries to write to output buffer. */ + /* IN/OUT: Max entries to write to output buffer. If max_pfns is zero and + * buffer is NULL, this has the recommend max size of buffer. */ uint64_aligned_t max_pfns; - /* Start index in guest's page list. */ - uint64_aligned_t start_pfn; - XEN_GUEST_HANDLE_64(uint64) buffer; - /* OUT variables. */ - uint64_aligned_t num_pfns; + uint64_aligned_t index; /* IN: Start index in guest's page list. */ + XEN_GUEST_HANDLE_64(uint64) buffer; /* IN: If NULL with max_pfns == 0, then + * max_pfns has recommend value. */ + uint64_aligned_t version; /* IN/OUT: If value differs, prior calls may + * have stale data. */ + uint64_aligned_t num_pfns; /* OUT: Number (up to max_pfns) copied. */ }; typedef struct xen_domctl_getmemlist xen_domctl_getmemlist_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_getmemlist_t); -- 2.9.4 --vtzGhvizbBRQ85DL Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0002-libxc-libxc-Use-XENDOMCTL_get_memlist-properly.patch" >>From d2edab820ee1bf4c354836e33c427602963986ba Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk Date: Fri, 9 Jun 2017 12:00:24 -0400 Subject: [PATCH 2/7] libxc: libxc: Use XENDOMCTL_get_memlist properly With the hypervisor having this working well we take advantage of that. The process has changed as we need to figure out what the upper limit of PFNs the hypervisor is willing to provide. Fortunatly for us, if we provide max_pfns with a zero value and buffer to be NULL, we get in max_pfns the acceptable max number. With this information we can make the hypercall - however the amount we get may be smaller than max, hence we adjust the number and also let the caller know. Furtheremore the 'version' is provided as a way for the caller to restart from scratch if that version is different from its previous call. Also we modify the tools to compile, but not neccessarily work well (see "xen-mceinj: Loop around xc_get_pfn_list". Signed-off-by: Konrad Rzeszutek Wilk --- tools/libxc/xc_private.c | 50 +++++++++++++++++++++++++-------- tools/libxc/xenctrl.h | 2 +- tools/ocaml/libs/xc/xenctrl_stubs.c | 3 +- tools/tests/mce-test/tools/xen-mceinj.c | 3 +- 4 files changed, 43 insertions(+), 15 deletions(-) diff --git a/tools/libxc/xc_private.c b/tools/libxc/xc_private.c index 52f53d7..3ecee6a 100644 --- a/tools/libxc/xc_private.c +++ b/tools/libxc/xc_private.c @@ -581,33 +581,59 @@ int xc_machphys_mfn_list(xc_interface *xch, int xc_get_pfn_list(xc_interface *xch, uint32_t domid, - uint64_t *pfn_buf, - unsigned long max_pfns) + uint64_t *buf, + unsigned long index, + unsigned long max_pfns, + unsigned long *version) { DECLARE_DOMCTL; - DECLARE_HYPERCALL_BOUNCE(pfn_buf, max_pfns * sizeof(*pfn_buf), XC_HYPERCALL_BUFFER_BOUNCE_OUT); + DECLARE_HYPERCALL_BUFFER(uint64_t, pfn_buf); int ret; + unsigned long nr_pfns; -#ifdef VALGRIND - memset(pfn_buf, 0, max_pfns * sizeof(*pfn_buf)); -#endif + domctl.cmd = XEN_DOMCTL_getmemlist; + domctl.domain = (domid_t)domid; + domctl.u.getmemlist.max_pfns = 0; + domctl.u.getmemlist.index = 0; + + /* It is NULL */ + set_xen_guest_handle(domctl.u.getmemlist.buffer, pfn_buf); + + ret = do_domctl(xch, &domctl); + if ( ret && errno != E2BIG ) + return ret; - if ( xc_hypercall_bounce_pre(xch, pfn_buf) ) + if ( !domctl.u.getmemlist.max_pfns ) { - PERROR("xc_get_pfn_list: pfn_buf bounce failed"); + errno = ENXIO; return -1; } + if ( max_pfns > domctl.u.getmemlist.max_pfns ) + max_pfns = domctl.u.getmemlist.max_pfns; - domctl.cmd = XEN_DOMCTL_getmemlist; - domctl.domain = (domid_t)domid; + domctl.u.getmemlist.index = index; domctl.u.getmemlist.max_pfns = max_pfns; + + pfn_buf = xc_hypercall_buffer_alloc(xch, pfn_buf, max_pfns * sizeof(uint64_t)); + if ( !pfn_buf ) { + errno = ENOMEM; + return -1; + } + set_xen_guest_handle(domctl.u.getmemlist.buffer, pfn_buf); ret = do_domctl(xch, &domctl); - xc_hypercall_bounce_post(xch, pfn_buf); + nr_pfns = domctl.u.getmemlist.num_pfns; + + if ( !ret ) { + memcpy(buf, pfn_buf, nr_pfns * sizeof(*buf)); + *version = domctl.u.getmemlist.version; + } + + xc_hypercall_buffer_free(xch, pfn_buf); - return (ret < 0) ? -1 : domctl.u.getmemlist.num_pfns; + return (ret < 0) ? -1 : nr_pfns; } long xc_get_tot_pages(xc_interface *xch, uint32_t domid) diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h index 5f015b2..5802d69 100644 --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -1443,7 +1443,7 @@ unsigned long xc_translate_foreign_address(xc_interface *xch, uint32_t dom, * without a backing MFN. */ int xc_get_pfn_list(xc_interface *xch, uint32_t domid, uint64_t *pfn_buf, - unsigned long max_pfns); + unsigned long start_pfn, unsigned long max_pfns, unsigned long *version); int xc_copy_to_domain_page(xc_interface *xch, uint32_t domid, unsigned long dst_pfn, const char *src_page); diff --git a/tools/ocaml/libs/xc/xenctrl_stubs.c b/tools/ocaml/libs/xc/xenctrl_stubs.c index 5ed0008..b1f1dff 100644 --- a/tools/ocaml/libs/xc/xenctrl_stubs.c +++ b/tools/ocaml/libs/xc/xenctrl_stubs.c @@ -1055,6 +1055,7 @@ CAMLprim value stub_xc_domain_get_pfn_list(value xch, value domid, CAMLparam3(xch, domid, nr_pfns); CAMLlocal2(array, v); unsigned long c_nr_pfns; + unsigned long version; long ret, i; uint64_t *c_array; @@ -1065,7 +1066,7 @@ CAMLprim value stub_xc_domain_get_pfn_list(value xch, value domid, caml_raise_out_of_memory(); ret = xc_get_pfn_list(_H(xch), _D(domid), - c_array, c_nr_pfns); + c_array, 0, c_nr_pfns, &version); if (ret < 0) { free(c_array); failwith_xc(_H(xch)); diff --git a/tools/tests/mce-test/tools/xen-mceinj.c b/tools/tests/mce-test/tools/xen-mceinj.c index 21a488b..22b4401 100644 --- a/tools/tests/mce-test/tools/xen-mceinj.c +++ b/tools/tests/mce-test/tools/xen-mceinj.c @@ -263,6 +263,7 @@ static uint64_t guest_mfn(xc_interface *xc_handle, unsigned long max_mfn = 0; /* max mfn of the whole machine */ unsigned long m2p_mfn0; unsigned int guest_width; + unsigned long version; long max_gpfn,i; uint64_t mfn = MCE_INVALID_MFN; @@ -289,7 +290,7 @@ static uint64_t guest_mfn(xc_interface *xc_handle, err(xc_handle, "Failed to alloc pfn buf\n"); memset(pfn_buf, 0, sizeof(uint64_t) * max_gpfn); - ret = xc_get_pfn_list(xc_handle, domain, pfn_buf, max_gpfn); + ret = xc_get_pfn_list(xc_handle, domain, pfn_buf, 0, max_gpfn, &version); if ( ret < 0 ) { free(pfn_buf); err(xc_handle, "Failed to get pfn list %x\n", ret); -- 2.9.4 --vtzGhvizbBRQ85DL Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0003-xen-mceinj-Loop-around-xc_get_pfn_list.patch" >>From 483e55bc13c6ab246251fcd62c59adc7bf169c52 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk Date: Tue, 13 Jun 2017 10:56:52 -0400 Subject: [PATCH 3/7] xen-mceinj: Loop around xc_get_pfn_list Now that the xc_get_pfn_list can return max != num_pfns we need to take that into account and loop around. While at it fix the code: - Move the memset in the loop. - Change the loop conditions to exit if the mfn has been found. - Explain why 262144 is used. - Add munmap if we failed to allocate the buffer. Signed-off-by: Konrad Rzeszutek Wilk --- tools/tests/mce-test/tools/xen-mceinj.c | 67 ++++++++++++++++++++++----------- 1 file changed, 45 insertions(+), 22 deletions(-) diff --git a/tools/tests/mce-test/tools/xen-mceinj.c b/tools/tests/mce-test/tools/xen-mceinj.c index 22b4401..9c90235 100644 --- a/tools/tests/mce-test/tools/xen-mceinj.c +++ b/tools/tests/mce-test/tools/xen-mceinj.c @@ -264,7 +264,7 @@ static uint64_t guest_mfn(xc_interface *xc_handle, unsigned long m2p_mfn0; unsigned int guest_width; unsigned long version; - long max_gpfn,i; + unsigned long max_gpfn, i, start_pfn, max, old_v; uint64_t mfn = MCE_INVALID_MFN; if ( domain > DOMID_FIRST_RESERVED ) @@ -284,40 +284,63 @@ static uint64_t guest_mfn(xc_interface *xc_handle, &pt_levels, &guest_width) ) err(xc_handle, "Failed to get platform information\n"); - /* Get guest's pfn list */ - pfn_buf = malloc(sizeof(uint64_t) * max_gpfn); - if ( !pfn_buf ) - err(xc_handle, "Failed to alloc pfn buf\n"); - memset(pfn_buf, 0, sizeof(uint64_t) * max_gpfn); - - ret = xc_get_pfn_list(xc_handle, domain, pfn_buf, 0, max_gpfn, &version); - if ( ret < 0 ) { - free(pfn_buf); - err(xc_handle, "Failed to get pfn list %x\n", ret); - } + max = 262144 /* 1GB of MFNs. */; /* Now get the m2p table */ live_m2p = xc_map_m2p(xc_handle, max_mfn, PROT_READ, &m2p_mfn0); if ( !live_m2p ) err(xc_handle, "Failed to map live M2P table\n"); - /* match the mapping */ - for ( i = 0; i < max_gpfn; i++ ) + + /* Get guest's pfn list */ + pfn_buf = alloc(sizeof(uint64_t) * max); + if ( !pfn_buf ) { - uint64_t tmp; - tmp = pfn_buf[i]; + munmap(live_m2p, M2P_SIZE(max_mfn)); + err(xc_handle, "Failed to alloc pfn buf\n"); + } - if (mfn_valid(tmp) && (mfn_to_pfn(tmp) == gpfn)) - { - mfn = tmp; - Lprintf("We get the mfn 0x%lx for this injection\n", mfn); + start_pfn = 0; + old_v = version = 0; + do { + memset(pfn_buf, 0, sizeof(uint64_t) * max); + ret = xc_get_pfn_list(xc_handle, domain, pfn_buf, start_pfn, max, &version); + if ( old_v != version ) { + Lprintf("P2M changed, refetching.\n"); + start_pfn = 0; + old_v = version; + continue; + } + + if ( ret < 0 ) break; + + Lprintf("%ld/%ld .. \n", start_pfn, max_gpfn); + + if ( max != ret ) + max = ret; /* Update it for the next iteration. */ + + start_pfn += ret; + + for ( i = 0; i < ret; i++ ) + { + uint64_t tmp; + tmp = pfn_buf[i]; + + if (mfn_valid(tmp) && (mfn_to_pfn(tmp) == gpfn)) + { + mfn = tmp; + Lprintf("We get the mfn 0x%lx for this injection\n", mfn); + break; + } } - } + } while ( start_pfn < max_gpfn && (mce == MCE_INVALID_MFN) ); munmap(live_m2p, M2P_SIZE(max_mfn)); - free(pfn_buf); + if ( ret < 0 ) + err(xc_handle, "Failed to get pfn list %x\n", ret); + return mfn; } -- 2.9.4 --vtzGhvizbBRQ85DL Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0004-x86-domctl-Add-XEN_DOMCTL_get_numa_ranges.patch" >>From c6bf254298d68638ca8652825a28fa97cee51ff4 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk Date: Tue, 6 Jun 2017 13:04:22 -0400 Subject: [PATCH 4/7] x86:domctl: Add XEN_DOMCTL_get_numa_ranges This is a fairly simple hypercall - it allows us to return the list (and ranges) of NUMA nodes. Like: NODE0 0 -> 100000 NODE1 100000 -> 230000 This is different from XEN_SYSCTL_numainfo which returns size of the NODEs and its distance - but not the ranges. Alternatively this functionality can be stuffed in XEN_SYSCTL_numainfo. Also this hypercall - if nodes is set to zero, then it will return the number of nodes so that the caller can allocate right away the right buffer for it. Signed-off-by: Konrad Rzeszutek Wilk --- tools/flask/policy/policy/modules/xen/xen.te | 1 + xen/arch/x86/domctl.c | 50 ++++++++++++++++++++++++++++ xen/include/public/domctl.h | 17 ++++++++++ xen/xsm/flask/hooks.c | 3 ++ xen/xsm/flask/policy/access_vectors | 2 ++ 5 files changed, 73 insertions(+) diff --git a/tools/flask/policy/policy/modules/xen/xen.te b/tools/flask/policy/policy/modules/xen/xen.te index d4974ae..5fe5381 100644 --- a/tools/flask/policy/policy/modules/xen/xen.te +++ b/tools/flask/policy/policy/modules/xen/xen.te @@ -68,6 +68,7 @@ allow dom0_t xen_t:xen { allow dom0_t xen_t:xen2 { livepatch_op module_op + get_numa_ranges }; # Allow dom0 to use all XENVER_ subops that have checks. diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c index 3af6b39..4eaf510 100644 --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -323,6 +323,56 @@ long arch_do_domctl( } break; + case XEN_DOMCTL_get_numa_ranges: + { + unsigned int nr = domctl->u.get_numa_ranges.nodes; + unsigned int i = 0, idx = 0; + + if ( !nr ) + { + ret = -E2BIG; + domctl->u.get_numa_ranges.nodes = num_online_nodes(); + copyback = 1; + break; + } + + if ( nr && !guest_handle_okay(domctl->u.get_numa_ranges.ranges, nr) ) + { + ret = -EINVAL; + break; + } + + ret = 0; + for_each_online_node ( i ) + { + struct xen_vmemrange range; + + range.start = node_start_pfn(i); + range.end = node_end_pfn(i); + range.nid = i; + range.flags = 0; + if ( __copy_to_guest_offset(domctl->u.get_numa_ranges.ranges, idx, &range, 1) ) + { + ret = -EFAULT; + break; + } + + idx++; + if ( idx >= nr ) + break; + } + if ( idx ) + { + copyback = 1; + /* + * idx is zero-based and num_online_nodes() has at least 1, hence + * nodes == num_online_nodes() - 1. + */ + domctl->u.get_numa_ranges.nodes = idx; + } + } + break; + case XEN_DOMCTL_getmemlist: { #define XEN_DOMCTL_getmemlist_max_pfns (GB(1) / PAGE_SIZE) diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 2a25079..a32b662 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -955,6 +955,21 @@ struct xen_domctl_smt { typedef struct xen_domctl_smt xen_domctl_smt; DEFINE_XEN_GUEST_HANDLE(xen_domctl_smt); +/* + * Returns how many more there are. If ranges is NULL + * and nodes is zero, then we provide the number of + * nodes in 'nodes'. + * + * Negative values on errors. Zero on success. + */ +struct xen_domctl_numa { + uint32_t nodes; /* IN: How many nodes are requested. + * OUT: How many copied (zero based index). */ + XEN_GUEST_HANDLE_64(xen_vmemrange_t) ranges; +}; +typedef struct xen_domctl_numa xen_domctl_numa_t; +DEFINE_XEN_GUEST_HANDLE(xen_domctl_numa_t); + struct xen_domctl { uint32_t cmd; #define XEN_DOMCTL_createdomain 1 @@ -1035,6 +1050,7 @@ struct xen_domctl { #define XEN_DOMCTL_hide_device 2001 #define XEN_DOMCTL_unhide_device 2002 #define XEN_DOMCTL_setsmt 2003 +#define XEN_DOMCTL_get_numa_ranges 2004 uint32_t interface_version; /* XEN_DOMCTL_INTERFACE_VERSION */ domid_t domain; union { @@ -1094,6 +1110,7 @@ struct xen_domctl { struct xen_domctl_gdbsx_domstatus gdbsx_domstatus; struct xen_domctl_vnuma vnuma; struct xen_domctl_smt smt; + struct xen_domctl_numa get_numa_ranges; uint8_t pad[128]; } u; }; diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c index be12169..d0cca5f 100644 --- a/xen/xsm/flask/hooks.c +++ b/xen/xsm/flask/hooks.c @@ -740,6 +740,9 @@ static int flask_domctl(struct domain *d, int cmd) case XEN_DOMCTL_setsmt: return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__SETSMT); + case XEN_DOMCTL_get_numa_ranges: + return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__GET_NUMA_RANGES); + default: printk("flask_domctl: Unknown op %d\n", cmd); return -EPERM; diff --git a/xen/xsm/flask/policy/access_vectors b/xen/xsm/flask/policy/access_vectors index 10f86b8..7b30cda 100644 --- a/xen/xsm/flask/policy/access_vectors +++ b/xen/xsm/flask/policy/access_vectors @@ -214,6 +214,8 @@ class domain2 soft_reset # XEN_DOMCTL_setsmt setsmt +# XEN_DOMCTL_get_numa_ranges + get_numa_ranges } # Similar to class domain, but primarily contains domctls related to HVM domains -- 2.9.4 --vtzGhvizbBRQ85DL Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0005-libxc-Add-xc_list_numa.patch" >>From 5b9c5881773f209d06235fba421704f0a0e44712 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk Date: Fri, 9 Jun 2017 12:21:52 -0400 Subject: [PATCH 5/7] libxc: Add xc_list_numa Implement the libxc call to retrieve NUMA ranges. Signed-off-by: Konrad Rzeszutek Wilk --- tools/libxc/xc_misc.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++ tools/libxc/xenctrl.h | 1 + 2 files changed, 63 insertions(+) diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c index 851f341..8ce8fc4 100644 --- a/tools/libxc/xc_misc.c +++ b/tools/libxc/xc_misc.c @@ -1373,6 +1373,68 @@ int xc_livepatch_replace(xc_interface *xch, char *name, uint32_t timeout) return _xc_livepatch_action(xch, name, LIVEPATCH_ACTION_REPLACE, timeout); } +int xc_list_numa(xc_interface *xch, struct xen_vmemrange** _info) +{ + int rc = 0; + unsigned int nodes; + DECLARE_DOMCTL; + DECLARE_HYPERCALL_BUFFER(xen_vmemrange_t, ranges); + + if ( !_info ) + { + errno = EINVAL; + return -1; + } + + memset(&domctl, 0, sizeof(domctl)); + domctl.cmd = XEN_DOMCTL_get_numa_ranges; + domctl.u.get_numa_ranges.nodes = 0; + + rc = do_domctl(xch, &domctl); + /* E2BIG is expected here since we didn't allocate any. */ + if ( rc && errno != E2BIG ) + return rc; + + nodes = domctl.u.get_numa_ranges.nodes; + if ( nodes == 0 ) + { + /* No NUMA at all? It should have one entry at least! */ + *_info = NULL; + errno = EINVAL; + return -1; + } + *_info = calloc(nodes, sizeof(**_info)); + if ( !*_info ) + { + errno = ENOMEM; + return -1; + } + + ranges = xc_hypercall_buffer_alloc(xch, ranges, nodes * sizeof(xen_vmemrange_t)); + if ( !ranges ) + { + free(*_info); + errno = ENOMEM; + return -1; + } + set_xen_guest_handle(domctl.u.get_numa_ranges.ranges, ranges); + memset(*_info, 0, nodes * sizeof(**_info)); + + rc = do_domctl(xch, &domctl); + + if ( !rc && (domctl.u.get_numa_ranges.nodes != (nodes - 1)) ) + memcpy(*_info, ranges, nodes * sizeof(*ranges)); + + xc_hypercall_buffer_free(xch, ranges); + if ( rc < 0 ) + { + free(*_info); + *_info = NULL; + } + + return (rc < 0) ? -1 : nodes; +} + /* * Local variables: * mode: C diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h index 5802d69..2ba2b47 100644 --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -2569,4 +2569,5 @@ int xc_livepatch_revert(xc_interface *xch, char *name, uint32_t timeout); int xc_livepatch_unload(xc_interface *xch, char *name, uint32_t timeout); int xc_livepatch_replace(xc_interface *xch, char *name, uint32_t timeout); +int xc_list_numa(xc_interface *xch, struct xen_vmemrange** _info); #endif /* XENCTRL_H */ -- 2.9.4 --vtzGhvizbBRQ85DL Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0006-xen-numa-Diagnostic-tool-to-figure-out-NUMA-issues.patch" >>From 033baca36963923c467adcb3d0473ea1f1e9b440 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk Date: Fri, 9 Jun 2017 12:22:24 -0400 Subject: [PATCH 6/7] xen-numa: Diagnostic tool to figure out NUMA issues. The tool can provide multiple views of a guest. - 'mfns' will dump all of the MFNs of a guest, useful for sorting and such and double-checking. - 'pfns' is an upgraded version of the above. It includes such details as what the PFN is within the guest. The list is not sorted. The PFNs are decimal, while the MFNs are hex to easy sorting. - 'node' digs in the PFNs and MFNs and figures out where they are - which PFNs belong to what NODE. This should match the guest view, otherwise we have issues. For example on Dom0 on SuperMicro H8DG6: sh-4.1# xen-numa node 0 -bash-4.1# /xen-numa node 0 NODE0 0 -> 0x1a8000 (6784 MB) NODE1 0x1a8000 -> 0x2a8000 (4096 MB) 0.0%..10.0%..20.0%..30.0%..40.0%..50.0%..60.0%..70.0%..80.0%..90.0%.. Max gpfn is 0x40069 (1024 MB) - NODE0 PFNs (33.173813%): 0x8352->0x8553 (514) 0x28554->0x2b995 (13378) 0x2b997->0x2d10c (6006) 0x2d10d->0x2d5a1 (1173) 0x2d5a3->0x38553 (44977) 0x3dc00->0x3e553 (2388) 0x3f554->0x3fd53 (2048) 0x3ff54->0x3ffd3 (128) 0x3fff4->0x3fffb (8) 0x39c00->0x3dbff (16384) 0x620a->0x620c (3) 0x61fe->0x6200 (3) 0x6215, 0x621b, 0x6221, 0x6249, 0x6231, 0x63a7, 0x635f, - NODE1 PFNs (66.771660%): 0x0->0x97 (152) 0x40000->0x40068 (105) 0x100->0x61fd (24830) 0x6201->0x6209 (9) 0x620d->0x6214 (8) 0x6216->0x621a (5) 0x621c->0x6220 (5) 0x6222->0x6230 (15) 0x6232->0x6248 (23) 0x624a->0x635e (277) 0x6360->0x63a6 (71) 0x63a8->0x8351 (8106) 0x8353, 0x8554->0x28553 (131072) 0x38554->0x39bff (5804) 0x3e554->0x3f553 (4096) 0x3fd54->0x3ff53 (512) 0x3ffd4->0x3fff3 (32) 0x3fffc->0x3ffff (4) Signed-off-by: Konrad Rzeszutek Wilk --- tools/misc/Makefile | 5 + tools/misc/xen-numa.c | 556 ++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 561 insertions(+) create mode 100644 tools/misc/xen-numa.c diff --git a/tools/misc/Makefile b/tools/misc/Makefile index 4cc7296..ea0bd9b 100644 --- a/tools/misc/Makefile +++ b/tools/misc/Makefile @@ -17,6 +17,7 @@ TARGETS-y += xen-insmod TARGETS-y += xen-rmmod TARGETS-y += xen-lsmod TARGETS-y += xen-attribute +TARGETS-y += xen-numa TARGETS := $(TARGETS-y) SUBDIRS := $(SUBDIRS-y) @@ -34,6 +35,7 @@ INSTALL_SBIN-y += xen-insmod INSTALL_SBIN-y += xen-rmmod INSTALL_SBIN-y += xen-lsmod INSTALL_SBIN-y += xen-attribute +INSTALL_SBIN-y += xen-numa INSTALL_SBIN := $(INSTALL_SBIN-y) INSTALL_PRIVBIN-y := xenpvnetboot @@ -100,6 +102,9 @@ xen-lsmod xen-rmmod xen-insmod: xen-%: xen-%.o xen-attribute: xen-%: xen-%.o $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS) +xen-numa: xen-numa.o + $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest) $(APPEND_LDFLAGS) + xen-lowmemd: xen-lowmemd.o $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(LDLIBS_libxenstore) $(APPEND_LDFLAGS) diff --git a/tools/misc/xen-numa.c b/tools/misc/xen-numa.c new file mode 100644 index 0000000..a0af262 --- /dev/null +++ b/tools/misc/xen-numa.c @@ -0,0 +1,556 @@ +/* + * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved. + */ + +#define _GNU_SOURCE + +#include +#include +#include +#include +#include + +#include + +#include +#include +#include + +#define LOGFILE stdout + +struct ops { + const char *name; + const char *help; + int (*setup)(struct ops *); + void (*free)(struct ops *); + void (*begin)(struct ops *); + int (*iterate)(struct ops *, unsigned long pfn, unsigned long mfn); + void (*end)(struct ops *); + + unsigned int arg3; + unsigned int arg4; + + unsigned long max_gpfn; + xen_pfn_t *live_m2p; + + struct xen_vmemrange *nodes; + unsigned int nodes_nr; + + void *priv; +}; + +static int iterate(xc_interface *xc_handle, + uint32_t domain, + struct ops *ops) +{ + int ret; + unsigned long hvirt_start; + unsigned int pt_levels; + uint64_t *buf = NULL; + unsigned long max_mfn = 0; /* max mfn of the whole machine */ + unsigned long m2p_mfn0; + unsigned int guest_width; + unsigned long i, start_pfn, version, max, old_v, max_gpfn; + + if ( domain > DOMID_FIRST_RESERVED ) + return -1; + + /* Get max gpfn */ + max_gpfn = do_memory_op(xc_handle, XENMEM_maximum_gpfn, &domain, + sizeof(domain)) + 1; + if ( max_gpfn <= 0 ) + { + fprintf(stderr, "Failed to get max_gpfn 0x%lx\n", max_gpfn); + return -EINVAL; + } + + ops->max_gpfn = max_gpfn; + if ( ops->begin ) + (ops->begin)(ops); + + /* Get max mfn */ + if ( !get_platform_info(xc_handle, domain, + &max_mfn, &hvirt_start, + &pt_levels, &guest_width) ) + { + fprintf(stderr, "Failed to get platform information\n"); + return -EINVAL; + } + + /* The max is GB(1) in pages. */ + max = 262144; + + ops->live_m2p = xc_map_m2p(xc_handle, max_mfn, PROT_READ, &m2p_mfn0); + if ( !ops->live_m2p ) + { + fprintf(stderr, "Failed to map live M2P table\n"); + return -EINVAL; + } + + /* Get guest's pfn list */ + buf = malloc(sizeof(uint64_t) * max); + if ( !buf ) + { + fprintf(stderr, "Failed to alloc pfn buf\n"); + munmap(ops->live_m2p, M2P_SIZE(max_mfn)); + return -EINVAL; + } + + start_pfn = 0; + old_v = version = 0; + do { + memset(buf, 0xFF, sizeof(uint64_t) * max); + ret = xc_get_pfn_list(xc_handle, domain, buf, start_pfn, max, &version); + if ( old_v != version ) + { + fprintf(stderr, "P2M changed, refetching.\n"); + start_pfn = 0; + old_v = version; + if ( ops->free ) + (ops->free)(ops); + if ( ops->begin ) + (ops->begin)(ops); + continue; + } + + if ( ret < 0 ) + { + fprintf(stderr, "Failed to call with start_pfn=0x%lx, max=0x%lx, ret %d\n", start_pfn, max, ret); + break; + } + if ( !ret ) + break; + + max = ret; /* Update it for the next iteration. */ + for ( i = 0; i < max; i++ ) + { + ret = (ops->iterate)(ops, i + start_pfn, buf[i]); + if ( ret ) + break; + } + + start_pfn += max; + if ( ret ) + break; + + } while ( start_pfn < max_gpfn ); + + free(buf); + if ( ops->end ) + (ops->end)(ops); + munmap(ops->live_m2p, M2P_SIZE(max_mfn)); + + return ret; +} + +/* ------------------------- */ +static int print_mfns(struct ops *ops, unsigned long pfn, unsigned long mfn) +{ + fprintf(stdout, "0x%lx\n", mfn); + return 0; +} + +static struct ops print_mfn_op = { + .help = " mfns - print all the MFNs of the guest", + .name = "mfns", + .iterate = print_mfns, +}; + +/* ------------------------- */ +static int print_pfn_and_mfns_header(struct ops *ops) +{ + fprintf(stdout,"PFN\tMFN\tNODE\n"); + fprintf(stdout,"--------------------------\n"); + + return 0; +} + +static int print_pfn_and_mfns(struct ops *ops, unsigned long pfn, unsigned long mfn) +{ + unsigned long m2p = ops->live_m2p[mfn]; + unsigned int i; + int nid = -1; + + for ( i = 0; i < ops->nodes_nr; i++ ) + { + if ( mfn >= ops->nodes[i].start && mfn < ops->nodes[i].end ) + { + nid = ops->nodes[i].nid; + break; + } + } + + fprintf(stdout, "%ld\t0x%lx\tNODE%d\n", m2p, mfn, nid); + return 0; +} + +static struct ops print_pfns_ops = { + .help = " pfns - print the MFNs and PFNs of the guest", + .name = "pfns", + .setup = print_pfn_and_mfns_header, + .iterate = print_pfn_and_mfns, +}; + +/* ------------------------- */ + +struct groups { + unsigned long start; + unsigned int len; + struct groups *next; +}; + +struct node_data { + int nid; + unsigned long pfns; + struct groups *groups; +}; + +struct node_args { + unsigned int stride; + struct node_data empty; + struct node_data *nodes_data; +}; + +static struct node_args *create_node(struct ops *ops) +{ + struct node_args *args; + unsigned int i; + struct node_data *n; + + args = malloc(sizeof(struct node_args)); + if ( !args ) + return NULL; + + args->stride = 262144; /* Every 1GB. */ + args->empty.nid = -1; + args->empty.groups = NULL; + args->empty.pfns = 0; + + n = malloc(sizeof(struct node_data) * ops->nodes_nr); + if ( !n ) + { + free(args); + fprintf(stderr, "Failed to initialize temp data.\n"); + return NULL; + } + args->nodes_data = n; + + for ( i = 0; i < ops->nodes_nr ; i++ ) + { + n[i].nid = ops->nodes[i].nid; + n[i].groups = NULL; + n[i].pfns = 0; + } + + return args; +} + +static int setup_node(struct ops *ops) +{ + struct node_args *args = create_node(ops); + + if ( !args ) + return -1; + + ops->priv = args; + return 0; +} + +static void begin_node(struct ops *ops) +{ + struct node_args *args = ops->priv; + unsigned int i; + + args->stride = ops->max_gpfn / 10; + + for ( i = 0; i < ops->nodes_nr ; i++ ) + { + fprintf(stdout, "NODE%d %#lx -> %#lx (%ld MB)\n", ops->nodes[i].nid, + ops->nodes[i].start, ops->nodes[i].end, + (ops->nodes[i].end - ops->nodes[i].start) >> 8); + } +} + +static struct groups *create(unsigned long pfn) +{ + struct groups *g; + + g = malloc(sizeof(*g)); + if ( !g ) + return NULL; + + g->next = NULL; + g->start = pfn; + g->len = 1; + + return g; +} + +static int add_to(struct node_data *n, unsigned long pfn) +{ + struct groups *g, *prev; + + if ( !n ) + return -1; + + if ( !n->groups ) + { + g = create(pfn); + if ( !g ) + return -ENOMEM; + n->groups = g; + } + + + for ( prev = NULL, g = n->groups; g; prev = g, g = g->next ) + { +#if DEBUG_NODE + fprintf(stderr, "%s[%d]: %ld -> %ld (%ld)\n", + __func__, n->nid, g->start, g->len+g->start, pfn); +#endif + if ( pfn >= g->start && pfn <= (g->start + g->len) ) + { + g->len++; + n->pfns++; + + return 0; + } + } + if ( !prev ) + return -EINVAL; + + if ( prev->next ) + return -EINVAL; + + prev->next = create(pfn); + if ( !prev->next ) + return -ENOMEM; + + return 0; +} + +static int _node_iterate(struct node_args *args, struct ops *ops, + unsigned long pfn, unsigned long mfn) +{ + unsigned int i; + + if ( !args ) + return -1; + + if ( !args->nodes_data ) + return -1; + + if ( args->stride && (pfn % args->stride) == 0 ) + { + fprintf(stdout, "%.1f%%..", ((float)pfn / ops->max_gpfn) * 100); + fflush(stdout); + } + if ( !mfn ) + return add_to(&args->empty, pfn); +#ifdef DEBUG_NODE + if ( pfn > 10 ) + return -1; +#endif + + pfn = ops->live_m2p[mfn]; + for ( i = 0; i < ops->nodes_nr; i++ ) + { + if ( mfn >= ops->nodes[i].start && mfn < ops->nodes[i].end ) + return add_to(&args->nodes_data[i], pfn); + } + + fprintf(stderr, "PFN 0x%lx, MFN 0x%lx is not within any NODE?!\n", pfn, mfn); + return -1; +} + +static int node_iterate(struct ops *ops, + unsigned long pfn, unsigned long mfn) +{ + return _node_iterate(ops->priv, ops, pfn, mfn); +} + +static void print_groups(struct node_data *n, unsigned long max_gpfn) +{ + struct groups *g; + float p = 0.0; + + if ( !n->groups ) + { + if ( n->nid >= 0 ) + fprintf(stdout, "- NODE%d not used.\n", n->nid); + return; + } + if ( n->pfns ) + { + p = (float)n->pfns / (float)max_gpfn; + p *= 100; + } + if ( n->nid >= 0 ) + fprintf(stdout, "- NODE%d PFNs (%lf%%):\n", n->nid, p); + else + fprintf(stdout, "PFNs not in any node (%lf%%):\n", p); + + for ( g = n->groups; g; g = g->next ) + { + if ( g->len == 1 ) + fprintf(stdout, "0x%lx, ", g->start); + else + fprintf(stdout, "0x%lx->0x%lx (%d)\n", g->start, g->start + g->len - 1, g->len); + } + fprintf(stdout, "\n"); +} + +static void free_groups(struct node_data *n) +{ + struct groups *g, *prev; + + if ( !n->groups ) + return; + + for ( prev = NULL, g = n->groups; g; prev = g, g = g->next ) + { + if ( prev ) + free( prev ); + } + + n->groups = NULL; +} + +static void node_free(struct ops *ops) +{ + struct node_args *args = ops->priv; + unsigned int i; + + if ( !args ) + return; + + for ( i = 0; i < ops->nodes_nr; i++ ) + free_groups(&args->nodes_data[i]); +} + +static void node_end(struct ops *ops) +{ + struct node_args *args = ops->priv; + unsigned int i; + + fprintf(stdout, "\nMax gpfn is 0x%lx (%ld MB)\n", + ops->max_gpfn, ops->max_gpfn >> 8); + + if ( !args ) + { + fprintf(stderr, "We lost our collected data!\n"); + return; + } + for ( i = 0; i < ops->nodes_nr; i++ ) + print_groups(&args->nodes_data[i], ops->max_gpfn); + + print_groups(&args->empty, ops->max_gpfn); + + node_free(ops); + free(args->nodes_data); + free(args); + ops->priv = NULL; +} + +static struct ops node_ops = { + .help = " node - summary of which PFNs are in which NODE.", + .name = "node", + .begin = begin_node, + .setup = setup_node, + .iterate = node_iterate, + .end = node_end, + .free = node_free, +}; + +static struct ops *callback_ops[] = { + &print_pfns_ops, + &print_mfn_op, + &print_pgm_ops, + &node_ops, +}; + +#define ARRAY_SIZE(a) (sizeof (a) / sizeof ((a)[0])) + +static int print_numa(xc_interface *xch, unsigned int mode, unsigned int domid, + unsigned int arg3, unsigned int arg4) +{ + struct xen_vmemrange *info; + int rc = 0; + struct ops *ops; + + rc = xc_list_numa(xch, &info); + if ( rc < 0 ) + { + fprintf(stderr, "Could not get the list of NUMA nodes: %s\n", + strerror(errno)); + return rc; + } + + if ( !info ) + { + printf("There is no NUMA?\n"); + return rc; + } + + ops = callback_ops[mode]; + ops->nodes_nr = rc; + ops->nodes = info; + ops->arg3 = arg3; + ops->arg4 = arg4; + + rc = 0; + if ( ops->setup ) + rc = (ops->setup)(ops); + + if ( !rc ) + rc = iterate(xch, domid, ops); + + if ( ops->free ) + (ops->free)(ops); + + free(info); + + return rc; +} + +static void show_usage(const char *const progname) +{ + unsigned int i; + fprintf(stderr, "%s [optional]\n", progname); + for ( i = 0; i < ARRAY_SIZE(callback_ops); i++ ) + fprintf(stderr, "%s\n", callback_ops[i]->help); +} + +int main(int argc, char **argv) +{ + xc_interface *xch = NULL; + unsigned int i; + + if ( argc < 3 ) + { + show_usage(argv[0]); + return -EINVAL; + } + + for ( i = 0; i < ARRAY_SIZE(callback_ops); i++ ) + { + if (!strncmp(callback_ops[i]->name, argv[1], strlen(argv[1]))) + break; + } + + if ( i != ARRAY_SIZE(callback_ops) ) + { + xch = xc_interface_open(0, 0, 0); + if ( !xch ) + { + fprintf(stderr, "Could not open Xen handler.\n"); + return -ENXIO; + } + + return print_numa(xch, i, atoi(argv[2]), + argc > 3 ? atoi(argv[3]) : 0, + argc > 4 ? atoi(argv[4]) : 0); + } + + return -EINVAL; +} -- 2.9.4 --vtzGhvizbBRQ85DL Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0007-xen-numa-Add-a-heatmap.patch" >>From 901fe4364deb69a6a803f540f03c1d8cf418dbc0 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk Date: Fri, 9 Jun 2017 13:44:22 -0400 Subject: [PATCH 7/7] xen-numa: Add a heatmap. - 'heatmap' outputs an PGM file of where the MFNs of a guest reside. Use ImageMagick to convert this file to PNG: Also there is an third optional parameter to change the width of the file. And a fourth optional to inverse the colors. That can help as ImageMagick looks to be ignoring the PGM spec and printing 0 as white instead of as black. The heatmap illustrates a picture of 0..max_gpfns and the colors are the N NODEs. For two nodes we should see three colors - NODE0, NODE1, and holes. For example to use it: -bash-4.1# /xen-numa heatmap 1 1600 1 > /isos/heatmap-32gb-xl-vnuma.pgm See http://char.us.oracle.com/isos/heatmap-32gb-xl-vnuma.png http://char.us.oracle.com/isos/heatmap-32gb-xm-vnuma.png Signed-off-by: Konrad Rzeszutek Wilk --- tools/misc/xen-numa.c | 124 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 124 insertions(+) diff --git a/tools/misc/xen-numa.c b/tools/misc/xen-numa.c index a0af262..b52def8 100644 --- a/tools/misc/xen-numa.c +++ b/tools/misc/xen-numa.c @@ -462,6 +462,130 @@ static struct ops node_ops = { .free = node_free, }; +/* ------------------------- */ + +static int setup_pgm(struct ops *ops) +{ + int rc = setup_node(ops); + + if ( !rc ) + { + struct node_args *n; + + n = ops->priv; + /* We don't want the percentage counter to show up, so.. */ + n->stride = 0; + } + + return rc; +} + +static int find_pfn(struct ops *ops, struct node_data *n, unsigned long pfn) +{ + struct groups *g; + unsigned int inverted = ops->arg4 ? : 0; + int rc; + + if ( !n->groups ) + return -ENOENT; + + if ( n->nid >= 0 ) + rc = inverted ? ops->nodes_nr - n->nid : n->nid; + else + rc = inverted ? 0 : ops->nodes_nr; + + for ( g = n->groups; g; g = g->next ) + { + if ( g->start == pfn ) + return rc; + + if ( pfn >= g->start && pfn <= g->start + g->len - 1 ) + return rc; + } + return -ENOENT; +} + +static void end_pgm(struct ops *ops) +{ + struct node_args *args = ops->priv; + unsigned long pfn; + unsigned long w, h; + unsigned long count; + unsigned int inverted = ops->arg4 ? : 0; + int rc; + + if ( !args ) + { + fprintf(stderr, "We lost our collected data!\n"); + return; + } + w = ops->arg3 ? : 1600; + h = (float)ops->max_gpfn / w; + + while ( ops->max_gpfn > (w*h) ) + h++; + + count = w*h; + fprintf(stdout,"P2\n%ld %ld\n", w, h); + fprintf(stdout, "%d\n", ops->nodes_nr); + + for ( pfn = 0; pfn < ops->max_gpfn; pfn++ ) + { + int node; + + rc = -ENOENT; + for ( node = 0; node < ops->nodes_nr; node++ ) + { + rc = find_pfn(ops, &args->nodes_data[node], pfn); + if ( rc >= 0 ) /* Found! */ + break; + if ( rc != -ENOENT ) /* Uh oh. Not good */ + break; + } + if ( rc == -ENOENT ) + rc = find_pfn(ops, &args->empty, pfn); + + if ( rc < 0 ) + { + if ( rc == -ENOENT ) + rc = inverted ? 0 : ops->nodes_nr; + else + goto out; + } + fprintf(stdout, "%d ", rc); + if ( pfn && (pfn % w ) == 0 ) + fprintf(stdout, "\n"); + + } + count -= pfn; + + rc = inverted ? 0 : ops->nodes_nr; + for ( pfn = 0; pfn < count; pfn++ ) + { + fprintf(stdout, "%d ", rc); + if ( (pfn % w ) == 0 ) + fprintf(stdout, "\n"); + } + + out: + node_free(ops); + free(args->nodes_data); + free(args); + ops->priv = NULL; +} + +static struct ops print_pgm_ops = { + .help = " heatmap - Output an PGM file of PFNs with NODE values.\n" \ + " First optional parameter to define width, and\n" \ + " second to invert colors.\n", + .name = "heatmap", + .setup = setup_pgm, + .iterate = node_iterate, + .end = end_pgm, + .free = node_free, +}; +/* ------------------------- */ + static struct ops *callback_ops[] = { &print_pfns_ops, &print_mfn_op, -- 2.9.4 --vtzGhvizbBRQ85DL Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v cmcveGVuLWRldmVsCg== --vtzGhvizbBRQ85DL--