All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/5] Display IO topology when PXM data is available (plus some cleanup)
@ 2015-04-17 16:59 Boris Ostrovsky
  2015-04-17 16:59 ` [PATCH v7 1/5] sysctl: Make XEN_SYSCTL_numainfo a little more efficient Boris Ostrovsky
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Boris Ostrovsky @ 2015-04-17 16:59 UTC (permalink / raw)
  To: ian.jackson, ian.campbell, wei.liu2, stefano.stabellini, dgdegra,
	jbeulich, andrew.cooper3, keir
  Cc: elena.ufimtseva, dario.faggioli, boris.ostrovsky, xen-devel

Changes in v7:
* Allow one of arguments to NUMA info sysctls to be NULL, in which case only
  the non-NULL buffer will be filled in by hypervisor (patches 1 and 4)
* Properly handle -ENODEVS in PCI topology sysctl (patch 2)
* Error handling changes in patch 5

Changes in v6:
* PCI topology interface changes: no continuations, userspace will be dealing
  with "unfinished" sysctl (patches 2 and 5)
* Unknown device will cause ENODEV in sysctl
* No NULL tests in libxc
* Loop control initialization fix (similar to commit 26da081ac91a)
* Other minor changes (see per-patch notes)

Changes in v5:
* Make CPU topology and NUMA info sysctls behave more like XEN_DOMCTL_get_vcpu_msrs
  when passed NULL buffers. This required toolstack changes as well
* Don't use 8-bit data types in interfaces
* Fold interface version update into patch#3

Changes in v4:
* Split cputopology and NUMA info changes into separate patches
* Added patch#1 (partly because patch#4 needs to know when when distance is invalid,
  i.e. NUMA_NO_DISTANCE)
* Split sysctl version update into a separate patch
* Other changes are listed in each patch
* NOTE: I did not test python's xc changes since I don't think I know how.

Changes in v3:
* Added patch #1 to more consistently define nodes as a u8 and properly
  use NUMA_NO_NODE.
* Make changes to xen_sysctl_numainfo, similar to those made to
  xen_sysctl_topologyinfo. (Q: I kept both sets of changes in the same
  patch #3 to avoid bumping interface version twice. Perhaps it's better
  to split it into two?)
* Instead of copying data for each loop index allocate a buffer and copy
  once for all three queries in sysctl.c.
* Move hypercall buffer management from libxl to libxc (as requested by
  Dario, patches #5 and #6).
* Report topology info for offlined CPUs as well
* Added LIBXL_HAVE_PCITOPO macro

Changes in v2:
* Split topology sysctls into two --- one for CPU topology and the other
  for devices
* Avoid long loops in the hypervisor by using continuations. (I am not
  particularly happy about using first_dev in the interface, suggestions
  for a better interface would be appreciated)
* Use proper libxl conventions for interfaces
* Avoid hypervisor stack corruption when copying PXM data from guest


A few patches that add interface for querying hypervisor about device
topology and allow 'xl info -n' display this information if PXM object
is provided by ACPI.

This series also makes some optimizations and cleanup of current CPU
topology and NUMA sysctl queries.


Boris Ostrovsky (5):
  sysctl: Make XEN_SYSCTL_numainfo a little more efficient
  sysctl: Add sysctl interface for querying PCI topology
  libxl/libxc: Move libxl_get_cpu_topology()'s hypercall buffer
    management to libxc
  libxl/libxc: Move libxl_get_numainfo()'s hypercall buffer management
    to libxc
  libxl: Add interface for querying hypervisor about PCI topology

 docs/misc/xsm-flask.txt             |    1 +
 tools/libxc/include/xenctrl.h       |   12 ++-
 tools/libxc/xc_misc.c               |   97 ++++++++++++++++++---
 tools/libxl/libxl.c                 |  160 ++++++++++++++++++-----------------
 tools/libxl/libxl.h                 |   12 +++
 tools/libxl/libxl_freebsd.c         |   12 +++
 tools/libxl/libxl_internal.h        |    5 +
 tools/libxl/libxl_linux.c           |   70 +++++++++++++++
 tools/libxl/libxl_netbsd.c          |   12 +++
 tools/libxl/libxl_types.idl         |    7 ++
 tools/libxl/libxl_utils.c           |    8 ++
 tools/libxl/xl_cmdimpl.c            |   40 +++++++--
 tools/misc/xenpm.c                  |   51 +++++------
 tools/python/xen/lowlevel/xc/xc.c   |   74 ++++++----------
 xen/common/sysctl.c                 |  143 ++++++++++++++++++++++++-------
 xen/include/public/sysctl.h         |   84 +++++++++++++-----
 xen/xsm/flask/hooks.c               |    1 +
 xen/xsm/flask/policy/access_vectors |    1 +
 18 files changed, 560 insertions(+), 230 deletions(-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v7 1/5] sysctl: Make XEN_SYSCTL_numainfo a little more efficient
  2015-04-17 16:59 [PATCH v7 0/5] Display IO topology when PXM data is available (plus some cleanup) Boris Ostrovsky
@ 2015-04-17 16:59 ` Boris Ostrovsky
  2015-04-17 17:14   ` Andrew Cooper
  2015-04-17 16:59 ` [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology Boris Ostrovsky
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Boris Ostrovsky @ 2015-04-17 16:59 UTC (permalink / raw)
  To: ian.jackson, ian.campbell, wei.liu2, stefano.stabellini, dgdegra,
	jbeulich, andrew.cooper3, keir
  Cc: elena.ufimtseva, dario.faggioli, boris.ostrovsky, xen-devel

A number of changes to XEN_SYSCTL_numainfo interface:

* Make sysctl NUMA topology query use fewer copies by combining some
  fields into a single structure and copying distances for each node
  in a single copy.
* NULL meminfo and distance handles are a request for maximum number
  of nodes (num_nodes). If those handles are valid and num_nodes is
  is smaller than the number of nodes in the system then -ENOBUFS is
  returned (and correct num_nodes is provided)
* Instead of using max_node_index for passing number of nodes keep this
  value in num_nodes: almost all uses of max_node_index required adding
  or subtracting one to eventually get to number of nodes anyway.
* Replace INVALID_NUMAINFO_ID with XEN_INVALID_MEM_SZ and add
  XEN_INVALID_NODE_DIST.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
---

Changes in v7:
* Allow one of arguments to  NUMA info sysctls to be NULL, in which case only
  the non-NULL buffer will be filled in by hypervisor (changes in sysctl.[ch])

 tools/libxl/libxl.c               |   66 +++++++++++++----------------
 tools/python/xen/lowlevel/xc/xc.c |   58 ++++++++++++--------------
 xen/common/sysctl.c               |   84 +++++++++++++++++++++++--------------
 xen/include/public/sysctl.h       |   54 ++++++++++++++----------
 4 files changed, 141 insertions(+), 121 deletions(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 511eef1..2ff46b4 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -5156,65 +5156,59 @@ libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr)
 {
     GC_INIT(ctx);
     xc_numainfo_t ninfo;
-    DECLARE_HYPERCALL_BUFFER(xc_node_to_memsize_t, memsize);
-    DECLARE_HYPERCALL_BUFFER(xc_node_to_memfree_t, memfree);
-    DECLARE_HYPERCALL_BUFFER(uint32_t, node_dists);
+    DECLARE_HYPERCALL_BUFFER(xen_sysctl_meminfo_t, meminfo);
+    DECLARE_HYPERCALL_BUFFER(uint32_t, distance);
     libxl_numainfo *ret = NULL;
-    int i, j, max_nodes;
+    int i, j;
 
-    max_nodes = libxl_get_max_nodes(ctx);
-    if (max_nodes < 0)
-    {
+    set_xen_guest_handle(ninfo.meminfo, HYPERCALL_BUFFER_NULL);
+    set_xen_guest_handle(ninfo.distance, HYPERCALL_BUFFER_NULL);
+    if (xc_numainfo(ctx->xch, &ninfo) != 0) {
         LIBXL__LOG(ctx, XTL_ERROR, "Unable to determine number of NODES");
         ret = NULL;
         goto out;
     }
 
-    memsize = xc_hypercall_buffer_alloc
-        (ctx->xch, memsize, sizeof(*memsize) * max_nodes);
-    memfree = xc_hypercall_buffer_alloc
-        (ctx->xch, memfree, sizeof(*memfree) * max_nodes);
-    node_dists = xc_hypercall_buffer_alloc
-        (ctx->xch, node_dists, sizeof(*node_dists) * max_nodes * max_nodes);
-    if ((memsize == NULL) || (memfree == NULL) || (node_dists == NULL)) {
+    meminfo = xc_hypercall_buffer_alloc(ctx->xch, meminfo,
+                                        sizeof(*meminfo) * ninfo.num_nodes);
+    distance = xc_hypercall_buffer_alloc(ctx->xch, distance,
+                                         sizeof(*distance) *
+                                         ninfo.num_nodes * ninfo.num_nodes);
+    if ((meminfo == NULL) || (distance == NULL)) {
         LIBXL__LOG_ERRNOVAL(ctx, XTL_ERROR, ENOMEM,
                             "Unable to allocate hypercall arguments");
         goto fail;
     }
 
-    set_xen_guest_handle(ninfo.node_to_memsize, memsize);
-    set_xen_guest_handle(ninfo.node_to_memfree, memfree);
-    set_xen_guest_handle(ninfo.node_to_node_distance, node_dists);
-    ninfo.max_node_index = max_nodes - 1;
+    set_xen_guest_handle(ninfo.meminfo, meminfo);
+    set_xen_guest_handle(ninfo.distance, distance);
     if (xc_numainfo(ctx->xch, &ninfo) != 0) {
         LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting numainfo");
         goto fail;
     }
 
-    if (ninfo.max_node_index < max_nodes - 1)
-        max_nodes = ninfo.max_node_index + 1;
+    *nr = ninfo.num_nodes;
 
-    *nr = max_nodes;
+    ret = libxl__zalloc(NOGC, sizeof(libxl_numainfo) * ninfo.num_nodes);
+    for (i = 0; i < ninfo.num_nodes; i++)
+        ret[i].dists = libxl__calloc(NOGC, ninfo.num_nodes, sizeof(*distance));
 
-    ret = libxl__zalloc(NOGC, sizeof(libxl_numainfo) * max_nodes);
-    for (i = 0; i < max_nodes; i++)
-        ret[i].dists = libxl__calloc(NOGC, max_nodes, sizeof(*node_dists));
-
-    for (i = 0; i < max_nodes; i++) {
-#define V(mem, i) (mem[i] == INVALID_NUMAINFO_ID) ? \
-    LIBXL_NUMAINFO_INVALID_ENTRY : mem[i]
-        ret[i].size = V(memsize, i);
-        ret[i].free = V(memfree, i);
-        ret[i].num_dists = max_nodes;
-        for (j = 0; j < ret[i].num_dists; j++)
-            ret[i].dists[j] = V(node_dists, i * max_nodes + j);
+    for (i = 0; i < ninfo.num_nodes; i++) {
+#define V(val, invalid) (val == invalid) ? \
+       LIBXL_NUMAINFO_INVALID_ENTRY : val
+        ret[i].size = V(meminfo[i].memsize, XEN_INVALID_MEM_SZ);
+        ret[i].free = V(meminfo[i].memfree, XEN_INVALID_MEM_SZ);
+        ret[i].num_dists = ninfo.num_nodes;
+        for (j = 0; j < ret[i].num_dists; j++) {
+            unsigned idx = i * ninfo.num_nodes + j;
+            ret[i].dists[j] = V(distance[idx], XEN_INVALID_NODE_DIST);
+        }
 #undef V
     }
 
  fail:
-    xc_hypercall_buffer_free(ctx->xch, memsize);
-    xc_hypercall_buffer_free(ctx->xch, memfree);
-    xc_hypercall_buffer_free(ctx->xch, node_dists);
+    xc_hypercall_buffer_free(ctx->xch, meminfo);
+    xc_hypercall_buffer_free(ctx->xch, distance);
 
  out:
     GC_FREE;
diff --git a/tools/python/xen/lowlevel/xc/xc.c b/tools/python/xen/lowlevel/xc/xc.c
index 5e81c4a..ba66d55 100644
--- a/tools/python/xen/lowlevel/xc/xc.c
+++ b/tools/python/xen/lowlevel/xc/xc.c
@@ -1297,55 +1297,52 @@ out:
 
 static PyObject *pyxc_numainfo(XcObject *self)
 {
-#define MAX_NODE_INDEX 31
     xc_numainfo_t ninfo = { 0 };
-    int i, j, max_node_index;
+    unsigned i, j;
     uint64_t free_heap;
     PyObject *ret_obj = NULL, *node_to_node_dist_list_obj;
     PyObject *node_to_memsize_obj, *node_to_memfree_obj;
     PyObject *node_to_dma32_mem_obj, *node_to_node_dist_obj;
-    DECLARE_HYPERCALL_BUFFER(xc_node_to_memsize_t, node_memsize);
-    DECLARE_HYPERCALL_BUFFER(xc_node_to_memfree_t, node_memfree);
-    DECLARE_HYPERCALL_BUFFER(xc_node_to_node_dist_t, nodes_dist);
+    DECLARE_HYPERCALL_BUFFER(xen_sysctl_meminfo_t, meminfo);
+    DECLARE_HYPERCALL_BUFFER(uint32_t, distance);
 
-    node_memsize = xc_hypercall_buffer_alloc(self->xc_handle, node_memsize, sizeof(*node_memsize)*(MAX_NODE_INDEX+1));
-    if ( node_memsize == NULL )
+    set_xen_guest_handle(ninfo.meminfo, HYPERCALL_BUFFER_NULL);
+    set_xen_guest_handle(ninfo.distance, HYPERCALL_BUFFER_NULL);
+    if ( xc_numainfo(self->xc_handle, &ninfo) != 0 )
         goto out;
-    node_memfree = xc_hypercall_buffer_alloc(self->xc_handle, node_memfree, sizeof(*node_memfree)*(MAX_NODE_INDEX+1));
-    if ( node_memfree == NULL )
+
+    meminfo = xc_hypercall_buffer_alloc(self->xc_handle, meminfo,
+                                        sizeof(*meminfo) * ninfo.num_nodes);
+    if ( meminfo == NULL )
         goto out;
-    nodes_dist = xc_hypercall_buffer_alloc(self->xc_handle, nodes_dist, sizeof(*nodes_dist)*(MAX_NODE_INDEX+1)*(MAX_NODE_INDEX+1));
-    if ( nodes_dist == NULL )
+    distance = xc_hypercall_buffer_alloc(self->xc_handle, distance,
+                                         sizeof(*distance) *
+                                         ninfo.num_nodes * ninfo.num_nodes);
+    if ( distance == NULL )
         goto out;
 
-    set_xen_guest_handle(ninfo.node_to_memsize, node_memsize);
-    set_xen_guest_handle(ninfo.node_to_memfree, node_memfree);
-    set_xen_guest_handle(ninfo.node_to_node_distance, nodes_dist);
-    ninfo.max_node_index = MAX_NODE_INDEX;
-
+    set_xen_guest_handle(ninfo.meminfo, meminfo);
+    set_xen_guest_handle(ninfo.distance, distance);
     if ( xc_numainfo(self->xc_handle, &ninfo) != 0 )
         goto out;
 
-    max_node_index = ninfo.max_node_index;
-    if ( max_node_index > MAX_NODE_INDEX )
-        max_node_index = MAX_NODE_INDEX;
-
     /* Construct node-to-* lists. */
     node_to_memsize_obj = PyList_New(0);
     node_to_memfree_obj = PyList_New(0);
     node_to_dma32_mem_obj = PyList_New(0);
     node_to_node_dist_list_obj = PyList_New(0);
-    for ( i = 0; i <= max_node_index; i++ )
+    for ( i = 0; i < ninfo.num_nodes; i++ )
     {
         PyObject *pyint;
+        unsigned invalid_node;
 
         /* Total Memory */
-        pyint = PyInt_FromLong(node_memsize[i] >> 20); /* MB */
+        pyint = PyInt_FromLong(meminfo[i].memsize >> 20); /* MB */
         PyList_Append(node_to_memsize_obj, pyint);
         Py_DECREF(pyint);
 
         /* Free Memory */
-        pyint = PyInt_FromLong(node_memfree[i] >> 20); /* MB */
+        pyint = PyInt_FromLong(meminfo[i].memfree >> 20); /* MB */
         PyList_Append(node_to_memfree_obj, pyint);
         Py_DECREF(pyint);
 
@@ -1357,10 +1354,11 @@ static PyObject *pyxc_numainfo(XcObject *self)
 
         /* Node to Node Distance */
         node_to_node_dist_obj = PyList_New(0);
-        for ( j = 0; j <= max_node_index; j++ )
+        invalid_node = (meminfo[i].memsize == XEN_INVALID_MEM_SZ);
+        for ( j = 0; j < ninfo.num_nodes; j++ )
         {
-            uint32_t dist = nodes_dist[i*(max_node_index+1) + j];
-            if ( dist == ~0u )
+            uint32_t dist = distance[i * ninfo.num_nodes + j];
+            if ( invalid_node || (dist == XEN_INVALID_NODE_DIST) )
             {
                 PyList_Append(node_to_node_dist_obj, Py_None);
             }
@@ -1375,7 +1373,7 @@ static PyObject *pyxc_numainfo(XcObject *self)
         Py_DECREF(node_to_node_dist_obj);
     }
 
-    ret_obj = Py_BuildValue("{s:i}", "max_node_index", max_node_index);
+    ret_obj = Py_BuildValue("{s:i}", "max_node_index", ninfo.num_nodes + 1);
 
     PyDict_SetItemString(ret_obj, "node_memsize", node_to_memsize_obj);
     Py_DECREF(node_to_memsize_obj);
@@ -1391,11 +1389,9 @@ static PyObject *pyxc_numainfo(XcObject *self)
     Py_DECREF(node_to_node_dist_list_obj);
 
 out:
-    xc_hypercall_buffer_free(self->xc_handle, node_memsize);
-    xc_hypercall_buffer_free(self->xc_handle, node_memfree);
-    xc_hypercall_buffer_free(self->xc_handle, nodes_dist);
+    xc_hypercall_buffer_free(self->xc_handle, meminfo);
+    xc_hypercall_buffer_free(self->xc_handle, distance);
     return ret_obj ? ret_obj : pyxc_error_to_exception(self->xc_handle);
-#undef MAX_NODE_INDEX
 }
 
 static PyObject *pyxc_xeninfo(XcObject *self)
diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
index 70413cc..b025a90 100644
--- a/xen/common/sysctl.c
+++ b/xen/common/sysctl.c
@@ -274,54 +274,76 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
 
     case XEN_SYSCTL_numainfo:
     {
-        uint32_t i, j, max_node_index, last_online_node;
+        unsigned int i, j, num_nodes;
         xen_sysctl_numainfo_t *ni = &op->u.numainfo;
+        bool_t do_meminfo = !guest_handle_is_null(ni->meminfo);
+        bool_t do_distance = !guest_handle_is_null(ni->distance);
 
-        last_online_node = last_node(node_online_map);
-        max_node_index = min_t(uint32_t, ni->max_node_index, last_online_node);
-        ni->max_node_index = last_online_node;
+        num_nodes = last_node(node_online_map) + 1;
 
-        for ( i = 0; i <= max_node_index; i++ )
+        if ( do_meminfo || do_distance )
         {
-            if ( !guest_handle_is_null(ni->node_to_memsize) )
+            if ( ni->num_nodes < num_nodes )
             {
-                uint64_t memsize = node_online(i) ?
-                                   node_spanned_pages(i) << PAGE_SHIFT : 0ul;
-                if ( copy_to_guest_offset(ni->node_to_memsize, i, &memsize, 1) )
-                    break;
-            }
-            if ( !guest_handle_is_null(ni->node_to_memfree) )
-            {
-                uint64_t memfree = node_online(i) ?
-                                   avail_node_heap_pages(i) << PAGE_SHIFT : 0ul;
-                if ( copy_to_guest_offset(ni->node_to_memfree, i, &memfree, 1) )
-                    break;
+                ret = -ENOBUFS;
+                i = num_nodes;
             }
+            else
+                i = 0;
 
-            if ( !guest_handle_is_null(ni->node_to_node_distance) )
+            for ( ; i < num_nodes; i++ )
             {
-                for ( j = 0; j <= max_node_index; j++)
+                xen_sysctl_meminfo_t meminfo;
+                static uint32_t distance[MAX_NUMNODES];
+
+                if ( do_meminfo )
                 {
-                    uint32_t distance = ~0u;
-                    if ( node_online(i) && node_online(j) )
+                    if ( node_online(i) )
+                    {
+                        meminfo.memsize = node_spanned_pages(i) << PAGE_SHIFT;
+                        meminfo.memfree = avail_node_heap_pages(i) << PAGE_SHIFT;
+                    }
+                    else
+                        meminfo.memsize = meminfo.memfree = XEN_INVALID_MEM_SZ;
+
+                    if ( copy_to_guest_offset(ni->meminfo, i, &meminfo, 1) )
                     {
-                        u8 d = __node_distance(i, j);
+                        ret = -EFAULT;
+                        break;
+                    }
+                }
 
-                        if ( d != NUMA_NO_DISTANCE )
-                            distance = d;
+                if ( do_distance )
+                {
+                    for ( j = 0; j < num_nodes; j++ )
+                    {
+                        distance[j] = __node_distance(i, j);
+                        if ( distance[j] == NUMA_NO_DISTANCE )
+                            distance[j] = XEN_INVALID_NODE_DIST;
                     }
-                    if ( copy_to_guest_offset(
-                        ni->node_to_node_distance,
-                        i*(max_node_index+1) + j, &distance, 1) )
+
+                    if ( copy_to_guest_offset(ni->distance, i * num_nodes,
+                                              distance, num_nodes) )
+                    {
+                        ret = -EFAULT;
                         break;
+                    }
                 }
-                if ( j <= max_node_index )
-                    break;
             }
         }
+        else
+            i = num_nodes;
 
-        ret = ((i <= max_node_index) || copy_to_guest(u_sysctl, op, 1))
-            ? -EFAULT : 0;
+        if ( (!ret || (ret == -ENOBUFS)) && (ni->num_nodes != i) )
+        {
+            ni->num_nodes = i;
+            if ( __copy_field_to_guest(u_sysctl, op,
+                                       u.numainfo.num_nodes) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+        }
     }
     break;
 
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 711441f..021d505 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -494,34 +494,42 @@ typedef struct xen_sysctl_cputopoinfo xen_sysctl_cputopoinfo_t;
 DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cputopoinfo_t);
 
 /* XEN_SYSCTL_numainfo */
-#define INVALID_NUMAINFO_ID (~0U)
+#define XEN_INVALID_MEM_SZ     (~0U)
+#define XEN_INVALID_NODE_DIST  (~0U)
+
+struct xen_sysctl_meminfo {
+    uint64_t memsize;
+    uint64_t memfree;
+};
+typedef struct xen_sysctl_meminfo xen_sysctl_meminfo_t;
+DEFINE_XEN_GUEST_HANDLE(xen_sysctl_meminfo_t);
+
+/*
+ * IN:
+ *  - Both 'meminfo' and 'distance' handles being null is a request
+ *    for maximum value of 'num_nodes'.
+ *  - Otherwise it's the number of entries in 'meminfo' and square root
+ *    of number of entries in 'distance' (when corresponding handle is
+ *    non-null)
+ *
+ * OUT:
+ *  - If 'num_nodes' is less than the number Xen needs to write, -ENOBUFS shall
+ *    be returned and 'num_nodes' updated to reflect the intended number.
+ *  - On success, 'num_nodes' shall indicate the number of entries written, which
+ *    may be less than the maximum.
+ */
+
 struct xen_sysctl_numainfo {
-    /*
-     * IN: maximum addressable entry in the caller-provided arrays.
-     * OUT: largest node identifier in the system.
-     * If OUT is greater than IN then the arrays are truncated!
-     */
-    uint32_t max_node_index;
+    uint32_t num_nodes;
 
-    /* NB. Entries are 0 if node is not present. */
-    XEN_GUEST_HANDLE_64(uint64) node_to_memsize;
-    XEN_GUEST_HANDLE_64(uint64) node_to_memfree;
+    XEN_GUEST_HANDLE_64(xen_sysctl_meminfo_t) meminfo;
 
     /*
-     * Array, of size (max_node_index+1)^2, listing memory access distances
-     * between nodes. If an entry has no node distance information (e.g., node 
-     * not present) then the value ~0u is written.
-     * 
-     * Note that the array rows must be indexed by multiplying by the minimum 
-     * of the caller-provided max_node_index and the returned value of
-     * max_node_index. That is, if the largest node index in the system is
-     * smaller than the caller can handle, a smaller 2-d array is constructed
-     * within the space provided by the caller. When this occurs, trailing
-     * space provided by the caller is not modified. If the largest node index
-     * in the system is larger than the caller can handle, then a 2-d array of
-     * the maximum size handleable by the caller is constructed.
+     * Distance between nodes 'i' and 'j' is stored in index 'i*N + j',
+     * where N is the number of nodes that will be returned in 'num_nodes'
+     * (i.e. not 'num_nodes' provided by the caller)
      */
-    XEN_GUEST_HANDLE_64(uint32) node_to_node_distance;
+    XEN_GUEST_HANDLE_64(uint32) distance;
 };
 typedef struct xen_sysctl_numainfo xen_sysctl_numainfo_t;
 DEFINE_XEN_GUEST_HANDLE(xen_sysctl_numainfo_t);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-17 16:59 [PATCH v7 0/5] Display IO topology when PXM data is available (plus some cleanup) Boris Ostrovsky
  2015-04-17 16:59 ` [PATCH v7 1/5] sysctl: Make XEN_SYSCTL_numainfo a little more efficient Boris Ostrovsky
@ 2015-04-17 16:59 ` Boris Ostrovsky
  2015-04-17 17:16   ` Andrew Cooper
  2015-04-21  7:01   ` Jan Beulich
  2015-04-17 16:59 ` [PATCH v7 3/5] libxl/libxc: Move libxl_get_cpu_topology()'s hypercall buffer management to libxc Boris Ostrovsky
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Boris Ostrovsky @ 2015-04-17 16:59 UTC (permalink / raw)
  To: ian.jackson, ian.campbell, wei.liu2, stefano.stabellini, dgdegra,
	jbeulich, andrew.cooper3, keir
  Cc: elena.ufimtseva, dario.faggioli, boris.ostrovsky, xen-devel

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
---

Changes in v7:
* Break from the loop when -ENODEV is encountered

 docs/misc/xsm-flask.txt             |    1 +
 xen/common/sysctl.c                 |   59 +++++++++++++++++++++++++++++++++++
 xen/include/public/sysctl.h         |   30 ++++++++++++++++++
 xen/xsm/flask/hooks.c               |    1 +
 xen/xsm/flask/policy/access_vectors |    1 +
 5 files changed, 92 insertions(+), 0 deletions(-)

diff --git a/docs/misc/xsm-flask.txt b/docs/misc/xsm-flask.txt
index d63a8a7..e0a3dcc 100644
--- a/docs/misc/xsm-flask.txt
+++ b/docs/misc/xsm-flask.txt
@@ -121,6 +121,7 @@ __HYPERVISOR_sysctl (xen/include/public/sysctl.h)
  * XEN_SYSCTL_cpupool_op
  * XEN_SYSCTL_scheduler_op
  * XEN_SYSCTL_coverage_op
+ * XEN_SYSCTL_pcitopoinfo
 
 __HYPERVISOR_memory_op (xen/include/public/memory.h)
 
diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
index b025a90..8f3a58a 100644
--- a/xen/common/sysctl.c
+++ b/xen/common/sysctl.c
@@ -411,6 +411,65 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
         break;
 #endif
 
+#ifdef HAS_PCI
+    case XEN_SYSCTL_pcitopoinfo:
+    {
+        xen_sysctl_pcitopoinfo_t *ti = &op->u.pcitopoinfo;
+        unsigned dev_cnt = 0;
+
+        if ( guest_handle_is_null(ti->devs) ||
+             guest_handle_is_null(ti->nodes) ||
+             (ti->first_dev > ti->num_devs) )
+        {
+            ret = -EINVAL;
+            break;
+        }
+
+        while ( ti->first_dev < ti->num_devs )
+        {
+            physdev_pci_device_t dev;
+            uint32_t node;
+            struct pci_dev *pdev;
+
+            if ( copy_from_guest_offset(&dev, ti->devs, ti->first_dev, 1) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+
+            spin_lock(&pcidevs_lock);
+            pdev = pci_get_pdev(dev.seg, dev.bus, dev.devfn);
+            if ( !pdev )
+            {
+                ret = -ENODEV;
+                node = XEN_INVALID_NODE_ID;
+            }
+            else if ( pdev->node == NUMA_NO_NODE )
+                node = XEN_INVALID_NODE_ID;
+            else
+                node = pdev->node;
+            spin_unlock(&pcidevs_lock);
+
+            if ( copy_to_guest_offset(ti->nodes, ti->first_dev, &node, 1) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+
+            ti->first_dev++;
+
+            if ( (ret == -ENODEV) ||
+                 ((++dev_cnt > 0x3f) && hypercall_preempt_check()) )
+                break;
+        }
+
+        if ( (!ret || (ret == -ENODEV)) &&
+             __copy_field_to_guest(u_sysctl, op, u.pcitopoinfo.first_dev) )
+            ret = -EFAULT;
+    }
+    break;
+#endif
+
     default:
         ret = arch_do_sysctl(op, u_sysctl);
         copyback = 0;
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 021d505..58e72c3 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -33,6 +33,7 @@
 
 #include "xen.h"
 #include "domctl.h"
+#include "physdev.h"
 
 #define XEN_SYSCTL_INTERFACE_VERSION 0x0000000C
 
@@ -669,6 +670,33 @@ struct xen_sysctl_psr_cmt_op {
 typedef struct xen_sysctl_psr_cmt_op xen_sysctl_psr_cmt_op_t;
 DEFINE_XEN_GUEST_HANDLE(xen_sysctl_psr_cmt_op_t);
 
+/* XEN_SYSCTL_pcitopoinfo */
+struct xen_sysctl_pcitopoinfo {
+    /* IN: Number of elements in 'pcitopo' and 'nodes' arrays. */
+    uint32_t num_devs;
+
+    /*
+     * IN/OUT:
+     *   IN: First element of pcitopo array that needs to be processed by
+     *       the hypervisor.
+     *  OUT: Index of the first still unprocessed element of pcitopo array.
+     */
+    uint32_t first_dev;
+
+    /* IN: list of devices for which node IDs are requested. */
+    XEN_GUEST_HANDLE_64(physdev_pci_device_t) devs;
+
+    /*
+     * OUT: node identifier for each device.
+     * If information for a particular device is not avalable then set
+     * to XEN_INVALID_NODE_ID. In addition, if device is not known to the
+     * hypervisor, sysctl will stop further processing and return -ENODEV.
+     */
+    XEN_GUEST_HANDLE_64(uint32) nodes;
+};
+typedef struct xen_sysctl_pcitopoinfo xen_sysctl_pcitopoinfo_t;
+DEFINE_XEN_GUEST_HANDLE(xen_sysctl_pcitopoinfo_t);
+
 struct xen_sysctl {
     uint32_t cmd;
 #define XEN_SYSCTL_readconsole                    1
@@ -691,12 +719,14 @@ struct xen_sysctl {
 #define XEN_SYSCTL_scheduler_op                  19
 #define XEN_SYSCTL_coverage_op                   20
 #define XEN_SYSCTL_psr_cmt_op                    21
+#define XEN_SYSCTL_pcitopoinfo                   22
     uint32_t interface_version; /* XEN_SYSCTL_INTERFACE_VERSION */
     union {
         struct xen_sysctl_readconsole       readconsole;
         struct xen_sysctl_tbuf_op           tbuf_op;
         struct xen_sysctl_physinfo          physinfo;
         struct xen_sysctl_cputopoinfo       cputopoinfo;
+        struct xen_sysctl_pcitopoinfo       pcitopoinfo;
         struct xen_sysctl_numainfo          numainfo;
         struct xen_sysctl_sched_id          sched_id;
         struct xen_sysctl_perfc_op          perfc_op;
diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
index ab5141d..db7a2a0 100644
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -782,6 +782,7 @@ static int flask_sysctl(int cmd)
     case XEN_SYSCTL_physinfo:
     case XEN_SYSCTL_cputopoinfo:
     case XEN_SYSCTL_numainfo:
+    case XEN_SYSCTL_pcitopoinfo:
         return domain_has_xen(current->domain, XEN__PHYSINFO);
 
     case XEN_SYSCTL_psr_cmt_op:
diff --git a/xen/xsm/flask/policy/access_vectors b/xen/xsm/flask/policy/access_vectors
index 128250e..d8e2d8d 100644
--- a/xen/xsm/flask/policy/access_vectors
+++ b/xen/xsm/flask/policy/access_vectors
@@ -28,6 +28,7 @@ class xen
 # XENPF_microcode_update
     microcode
 # XEN_SYSCTL_physinfo, XEN_SYSCTL_cputopoinfo, XEN_SYSCTL_numainfo
+# XEN_SYSCTL_pcitopoinfo
     physinfo
 # XENPF_platform_quirk
     quirk
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 3/5] libxl/libxc: Move libxl_get_cpu_topology()'s hypercall buffer management to libxc
  2015-04-17 16:59 [PATCH v7 0/5] Display IO topology when PXM data is available (plus some cleanup) Boris Ostrovsky
  2015-04-17 16:59 ` [PATCH v7 1/5] sysctl: Make XEN_SYSCTL_numainfo a little more efficient Boris Ostrovsky
  2015-04-17 16:59 ` [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology Boris Ostrovsky
@ 2015-04-17 16:59 ` Boris Ostrovsky
  2015-04-17 16:59 ` [PATCH v7 4/5] libxl/libxc: Move libxl_get_numainfo()'s " Boris Ostrovsky
  2015-04-17 16:59 ` [PATCH v7 5/5] libxl: Add interface for querying hypervisor about PCI topology Boris Ostrovsky
  4 siblings, 0 replies; 22+ messages in thread
From: Boris Ostrovsky @ 2015-04-17 16:59 UTC (permalink / raw)
  To: ian.jackson, ian.campbell, wei.liu2, stefano.stabellini, dgdegra,
	jbeulich, andrew.cooper3, keir
  Cc: elena.ufimtseva, dario.faggioli, boris.ostrovsky, xen-devel

xc_cputopoinfo() is not expected to be used on a hot path and therefore
hypercall buffer management can be pushed into libxc. This will simplify
life for callers.

Also update error reporting macros.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
 tools/libxc/include/xenctrl.h     |    5 ++-
 tools/libxc/xc_misc.c             |   23 +++++++++++-----
 tools/libxl/libxl.c               |   37 ++++++++------------------
 tools/misc/xenpm.c                |   51 ++++++++++++++++---------------------
 tools/python/xen/lowlevel/xc/xc.c |   20 ++++++--------
 5 files changed, 61 insertions(+), 75 deletions(-)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 02d0db8..b476fda 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1228,7 +1228,7 @@ int xc_readconsolering(xc_interface *xch,
 int xc_send_debug_keys(xc_interface *xch, char *keys);
 
 typedef xen_sysctl_physinfo_t xc_physinfo_t;
-typedef xen_sysctl_cputopoinfo_t xc_cputopoinfo_t;
+typedef xen_sysctl_cputopo_t xc_cputopo_t;
 typedef xen_sysctl_numainfo_t xc_numainfo_t;
 
 typedef uint32_t xc_cpu_to_node_t;
@@ -1239,7 +1239,8 @@ typedef uint64_t xc_node_to_memfree_t;
 typedef uint32_t xc_node_to_node_dist_t;
 
 int xc_physinfo(xc_interface *xch, xc_physinfo_t *info);
-int xc_cputopoinfo(xc_interface *xch, xc_cputopoinfo_t *info);
+int xc_cputopoinfo(xc_interface *xch, unsigned *max_cpus,
+                   xc_cputopo_t *cputopo);
 int xc_numainfo(xc_interface *xch, xc_numainfo_t *info);
 
 int xc_sched_id(xc_interface *xch,
diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
index be68291..630a86c 100644
--- a/tools/libxc/xc_misc.c
+++ b/tools/libxc/xc_misc.c
@@ -177,22 +177,31 @@ int xc_physinfo(xc_interface *xch,
     return 0;
 }
 
-int xc_cputopoinfo(xc_interface *xch,
-                   xc_cputopoinfo_t *put_info)
+int xc_cputopoinfo(xc_interface *xch, unsigned *max_cpus,
+                   xc_cputopo_t *cputopo)
 {
     int ret;
     DECLARE_SYSCTL;
+    DECLARE_HYPERCALL_BOUNCE(cputopo, *max_cpus * sizeof(*cputopo),
+                             XC_HYPERCALL_BUFFER_BOUNCE_OUT);
 
-    sysctl.cmd = XEN_SYSCTL_cputopoinfo;
+    if ( (ret = xc_hypercall_bounce_pre(xch, cputopo)) )
+        goto out;
 
-    memcpy(&sysctl.u.cputopoinfo, put_info, sizeof(*put_info));
+    sysctl.u.cputopoinfo.num_cpus = *max_cpus;
+    set_xen_guest_handle(sysctl.u.cputopoinfo.cputopo, cputopo);
+
+    sysctl.cmd = XEN_SYSCTL_cputopoinfo;
 
     if ( (ret = do_sysctl(xch, &sysctl)) != 0 )
-        return ret;
+        goto out;
 
-    memcpy(put_info, &sysctl.u.cputopoinfo, sizeof(*put_info));
+    *max_cpus = sysctl.u.cputopoinfo.num_cpus;
 
-    return 0;
+out:
+    xc_hypercall_bounce_post(xch, cputopo);
+
+    return ret;
 }
 
 int xc_numainfo(xc_interface *xch,
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 2ff46b4..f348afd 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -5102,37 +5102,28 @@ int libxl_get_physinfo(libxl_ctx *ctx, libxl_physinfo *physinfo)
 libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nb_cpu_out)
 {
     GC_INIT(ctx);
-    xc_cputopoinfo_t tinfo;
-    DECLARE_HYPERCALL_BUFFER(xen_sysctl_cputopo_t, cputopo);
+    xc_cputopo_t *cputopo;
     libxl_cputopology *ret = NULL;
     int i;
+    unsigned num_cpus;
 
-    /* Setting buffer to NULL makes the hypercall return number of CPUs */
-    set_xen_guest_handle(tinfo.cputopo, HYPERCALL_BUFFER_NULL);
-    if (xc_cputopoinfo(ctx->xch, &tinfo) != 0)
+    /* Setting buffer to NULL makes the call return number of CPUs */
+    if (xc_cputopoinfo(ctx->xch, &num_cpus, NULL))
     {
-        LIBXL__LOG(ctx, XTL_ERROR, "Unable to determine number of CPUS");
-        ret = NULL;
+        LOGEV(ERROR, errno, "Unable to determine number of CPUS");
         goto out;
     }
 
-    cputopo = xc_hypercall_buffer_alloc(ctx->xch, cputopo,
-                                        sizeof(*cputopo) * tinfo.num_cpus);
-    if (cputopo == NULL) {
-        LIBXL__LOG_ERRNOVAL(ctx, XTL_ERROR, ENOMEM,
-                            "Unable to allocate hypercall arguments");
-        goto fail;
-    }
-    set_xen_guest_handle(tinfo.cputopo, cputopo);
+    cputopo = libxl__zalloc(gc, sizeof(*cputopo) * num_cpus);
 
-    if (xc_cputopoinfo(ctx->xch, &tinfo) != 0) {
-        LIBXL__LOG_ERRNO(ctx, XTL_ERROR, "CPU topology info hypercall failed");
-        goto fail;
+    if (xc_cputopoinfo(ctx->xch, &num_cpus, cputopo)) {
+        LOGEV(ERROR, errno, "CPU topology info hypercall failed");
+        goto out;
     }
 
-    ret = libxl__zalloc(NOGC, sizeof(libxl_cputopology) * tinfo.num_cpus);
+    ret = libxl__zalloc(NOGC, sizeof(libxl_cputopology) * num_cpus);
 
-    for (i = 0; i < tinfo.num_cpus; i++) {
+    for (i = 0; i < num_cpus; i++) {
 #define V(map, i, invalid) ( cputopo[i].map == invalid) ? \
    LIBXL_CPUTOPOLOGY_INVALID_ENTRY : cputopo[i].map
         ret[i].core = V(core, i, XEN_INVALID_CORE_ID);
@@ -5141,11 +5132,7 @@ libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nb_cpu_out)
 #undef V
     }
 
-fail:
-    xc_hypercall_buffer_free(ctx->xch, cputopo);
-
-    if (ret)
-        *nb_cpu_out = tinfo.num_cpus;
+    *nb_cpu_out = num_cpus;
 
  out:
     GC_FREE;
diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index a5d07de..24ee6ef 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -355,20 +355,17 @@ static void signal_int_handler(int signo)
     int i, j, k;
     struct timeval tv;
     int cx_cap = 0, px_cap = 0;
-    DECLARE_HYPERCALL_BUFFER(xen_sysctl_cputopo_t, cputopo);
-    xc_cputopoinfo_t info = { 0 };
+    xc_cputopo_t *cputopo;
+    unsigned max_cpus;
 
-    set_xen_guest_handle(info.cputopo, HYPERCALL_BUFFER_NULL);
-    if ( xc_cputopoinfo(xc_handle, &info) != 0 )
+    if ( xc_cputopoinfo(xc_handle, &max_cpus, NULL) != 0 )
     {
         fprintf(stderr, "failed to discover number of CPUs: %s\n",
                 strerror(errno));
         goto out;
     }
 
-    cputopo = xc_hypercall_buffer_alloc(xc_handle, cputopo,
-                                        sizeof(*cputopo) * info.num_cpus);
-
+    cputopo = calloc(max_cpus, sizeof(*cputopo));
     if ( cputopo == NULL )
     {
 	fprintf(stderr, "failed to allocate hypercall buffers\n");
@@ -453,28 +450,26 @@ static void signal_int_handler(int signo)
             printf("  Avg freq\t%d\tKHz\n", avgfreq[i]);
     }
 
-    set_xen_guest_handle(info.cputopo, cputopo);
-
-    if ( cx_cap && !xc_cputopoinfo(xc_handle, &info) )
+    if ( cx_cap && !xc_cputopoinfo(xc_handle, &max_cpus, cputopo) )
     {
         uint32_t socket_ids[MAX_NR_CPU];
         uint32_t core_ids[MAX_NR_CPU];
         uint32_t socket_nr = 0;
         uint32_t core_nr = 0;
 
-        if ( info.num_cpus > MAX_NR_CPU )
-            info.num_cpus = MAX_NR_CPU;
+        if ( max_cpus > MAX_NR_CPU )
+            max_cpus = MAX_NR_CPU;
         /* check validity */
-        for ( i = 0; i < info.num_cpus; i++ )
+        for ( i = 0; i < max_cpus; i++ )
         {
             if ( cputopo[i].core == XEN_INVALID_CORE_ID ||
                  cputopo[i].socket == XEN_INVALID_SOCKET_ID )
                 break;
         }
-        if ( i >= info.num_cpus )
+        if ( i >= max_cpus )
         {
             /* find socket nr & core nr per socket */
-            for ( i = 0; i < info.num_cpus; i++ )
+            for ( i = 0; i < max_cpus; i++ )
             {
                 for ( j = 0; j < socket_nr; j++ )
                     if ( cputopo[i].socket == socket_ids[j] )
@@ -501,7 +496,7 @@ static void signal_int_handler(int signo)
                 unsigned int n;
                 uint64_t res;
 
-                for ( j = 0; j < info.num_cpus; j++ )
+                for ( j = 0; j < max_cpus; j++ )
                 {
                     if ( cputopo[j].socket == socket_ids[i] )
                         break;
@@ -520,7 +515,7 @@ static void signal_int_handler(int signo)
                 }
                 for ( k = 0; k < core_nr; k++ )
                 {
-                    for ( j = 0; j < info.num_cpus; j++ )
+                    for ( j = 0; j < max_cpus; j++ )
                     {
                         if ( cputopo[j].socket == socket_ids[i] &&
                              cputopo[j].core == core_ids[k] )
@@ -558,7 +553,7 @@ static void signal_int_handler(int signo)
     free(sum);
     free(avgfreq);
 out:
-    xc_hypercall_buffer_free(xc_handle, cputopo);
+    free(cputopo);
     xc_interface_close(xc_handle);
     exit(0);
 }
@@ -965,12 +960,11 @@ void scaling_governor_func(int argc, char *argv[])
 
 void cpu_topology_func(int argc, char *argv[])
 {
-    DECLARE_HYPERCALL_BUFFER(xen_sysctl_cputopo_t, cputopo);
-    xc_cputopoinfo_t info = { 0 };
-    int i, rc = ENOMEM;
+    xc_cputopo_t *cputopo;
+    unsigned max_cpus;
+    int i, rc;
 
-    set_xen_guest_handle(info.cputopo, HYPERCALL_BUFFER_NULL);
-    if ( xc_cputopoinfo(xc_handle, &info) )
+    if ( xc_cputopoinfo(xc_handle, &max_cpus, NULL) != 0 )
     {
         rc = errno;
         fprintf(stderr, "failed to discover number of CPUs (%d - %s)\n",
@@ -978,16 +972,15 @@ void cpu_topology_func(int argc, char *argv[])
         goto out;
     }
 
-    cputopo = xc_hypercall_buffer_alloc(xc_handle, cputopo,
-                                        sizeof(*cputopo) * info.num_cpus);
+    cputopo = calloc(max_cpus, sizeof(*cputopo));
     if ( cputopo == NULL )
     {
+	rc = ENOMEM;
 	fprintf(stderr, "failed to allocate hypercall buffers\n");
 	goto out;
     }
 
-    set_xen_guest_handle(info.cputopo, cputopo);
-    if ( xc_cputopoinfo(xc_handle, &info) )
+    if ( xc_cputopoinfo(xc_handle, &max_cpus, cputopo) )
     {
         rc = errno;
         fprintf(stderr, "Cannot get Xen CPU topology (%d - %s)\n",
@@ -996,7 +989,7 @@ void cpu_topology_func(int argc, char *argv[])
     }
 
     printf("CPU\tcore\tsocket\tnode\n");
-    for ( i = 0; i < info.num_cpus; i++ )
+    for ( i = 0; i < max_cpus; i++ )
     {
         if ( cputopo[i].core == XEN_INVALID_CORE_ID )
             continue;
@@ -1005,7 +998,7 @@ void cpu_topology_func(int argc, char *argv[])
     }
     rc = 0;
 out:
-    xc_hypercall_buffer_free(xc_handle, cputopo);
+    free(cputopo);
     if ( rc )
         exit(rc);
 }
diff --git a/tools/python/xen/lowlevel/xc/xc.c b/tools/python/xen/lowlevel/xc/xc.c
index ba66d55..21ba57d 100644
--- a/tools/python/xen/lowlevel/xc/xc.c
+++ b/tools/python/xen/lowlevel/xc/xc.c
@@ -1220,30 +1220,26 @@ static PyObject *pyxc_getcpuinfo(XcObject *self, PyObject *args, PyObject *kwds)
 
 static PyObject *pyxc_topologyinfo(XcObject *self)
 {
-    xc_cputopoinfo_t tinfo = { 0 };
-    unsigned i;
+    xc_cputopo_t *cputopo = NULL;
+    unsigned i, num_cpus;
     PyObject *ret_obj = NULL;
     PyObject *cpu_to_core_obj, *cpu_to_socket_obj, *cpu_to_node_obj;
-    DECLARE_HYPERCALL_BUFFER(xen_sysctl_cputopo_t, cputopo);
 
-    set_xen_guest_handle(tinfo.cputopo, HYPERCALL_BUFFER_NULL);
-    if ( xc_cputopoinfo(self->xc_handle, &tinfo) != 0 )
+    if ( xc_cputopoinfo(self->xc_handle, &num_cpus, NULL) != 0 )
         goto out;
 
-    cputopo = xc_hypercall_buffer_alloc(self->xc_handle, cputopo,
-                                        sizeof(*cputopo) * tinfo.num_cpus);
+    cputopo = calloc(num_cpus, sizeof(*cputopo));
     if ( cputopo == NULL )
     	goto out;
 
-    set_xen_guest_handle(tinfo.cputopo, cputopo);
-    if ( xc_cputopoinfo(self->xc_handle, &tinfo) != 0 )
+    if ( xc_cputopoinfo(self->xc_handle, &num_cpus, cputopo) != 0 )
         goto out;
 
     /* Construct cpu-to-* lists. */
     cpu_to_core_obj = PyList_New(0);
     cpu_to_socket_obj = PyList_New(0);
     cpu_to_node_obj = PyList_New(0);
-    for ( i = 0; i < tinfo.num_cpus; i++ )
+    for ( i = 0; i < num_cpus; i++ )
     {
         if ( cputopo[i].core == XEN_INVALID_CORE_ID )
         {
@@ -1279,7 +1275,7 @@ static PyObject *pyxc_topologyinfo(XcObject *self)
         }
     }
 
-    ret_obj = Py_BuildValue("{s:i}", "max_cpu_index", tinfo.num_cpus + 1);
+    ret_obj = Py_BuildValue("{s:i}", "max_cpu_index", num_cpus + 1);
 
     PyDict_SetItemString(ret_obj, "cpu_to_core", cpu_to_core_obj);
     Py_DECREF(cpu_to_core_obj);
@@ -1291,7 +1287,7 @@ static PyObject *pyxc_topologyinfo(XcObject *self)
     Py_DECREF(cpu_to_node_obj);
 
 out:
-    xc_hypercall_buffer_free(self->xc_handle, cputopo);
+    free(cputopo);
     return ret_obj ? ret_obj : pyxc_error_to_exception(self->xc_handle);
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 4/5] libxl/libxc: Move libxl_get_numainfo()'s hypercall buffer management to libxc
  2015-04-17 16:59 [PATCH v7 0/5] Display IO topology when PXM data is available (plus some cleanup) Boris Ostrovsky
                   ` (2 preceding siblings ...)
  2015-04-17 16:59 ` [PATCH v7 3/5] libxl/libxc: Move libxl_get_cpu_topology()'s hypercall buffer management to libxc Boris Ostrovsky
@ 2015-04-17 16:59 ` Boris Ostrovsky
  2015-04-17 16:59 ` [PATCH v7 5/5] libxl: Add interface for querying hypervisor about PCI topology Boris Ostrovsky
  4 siblings, 0 replies; 22+ messages in thread
From: Boris Ostrovsky @ 2015-04-17 16:59 UTC (permalink / raw)
  To: ian.jackson, ian.campbell, wei.liu2, stefano.stabellini, dgdegra,
	jbeulich, andrew.cooper3, keir
  Cc: elena.ufimtseva, dario.faggioli, boris.ostrovsky, xen-devel

xc_numainfo() is not expected to be used on a hot path and therefore
hypercall buffer management can be pushed into libxc. This will simplify
life for callers.

Also update error logging macros.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
---

Changes in v7:
* Dropped '!!meminfo ^ !!distance' test in xc_numainfo() since one non-NULL 
  argument is now allowed

 tools/libxc/include/xenctrl.h     |    4 ++-
 tools/libxc/xc_misc.c             |   30 ++++++++++++++++-----
 tools/libxl/libxl.c               |   51 ++++++++++++------------------------
 tools/python/xen/lowlevel/xc/xc.c |   38 ++++++++++-----------------
 4 files changed, 57 insertions(+), 66 deletions(-)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index b476fda..520a284 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1230,6 +1230,7 @@ int xc_send_debug_keys(xc_interface *xch, char *keys);
 typedef xen_sysctl_physinfo_t xc_physinfo_t;
 typedef xen_sysctl_cputopo_t xc_cputopo_t;
 typedef xen_sysctl_numainfo_t xc_numainfo_t;
+typedef xen_sysctl_meminfo_t xc_meminfo_t;
 
 typedef uint32_t xc_cpu_to_node_t;
 typedef uint32_t xc_cpu_to_socket_t;
@@ -1241,7 +1242,8 @@ typedef uint32_t xc_node_to_node_dist_t;
 int xc_physinfo(xc_interface *xch, xc_physinfo_t *info);
 int xc_cputopoinfo(xc_interface *xch, unsigned *max_cpus,
                    xc_cputopo_t *cputopo);
-int xc_numainfo(xc_interface *xch, xc_numainfo_t *info);
+int xc_numainfo(xc_interface *xch, unsigned *max_nodes,
+                xc_meminfo_t *meminfo, uint32_t *distance);
 
 int xc_sched_id(xc_interface *xch,
                 int *sched_id);
diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
index 630a86c..94b70b4 100644
--- a/tools/libxc/xc_misc.c
+++ b/tools/libxc/xc_misc.c
@@ -204,22 +204,38 @@ out:
     return ret;
 }
 
-int xc_numainfo(xc_interface *xch,
-                xc_numainfo_t *put_info)
+int xc_numainfo(xc_interface *xch, unsigned *max_nodes,
+                xc_meminfo_t *meminfo, uint32_t *distance)
 {
     int ret;
     DECLARE_SYSCTL;
+    DECLARE_HYPERCALL_BOUNCE(meminfo, *max_nodes * sizeof(*meminfo),
+                             XC_HYPERCALL_BUFFER_BOUNCE_OUT);
+    DECLARE_HYPERCALL_BOUNCE(distance,
+                             *max_nodes * *max_nodes * sizeof(*distance),
+                             XC_HYPERCALL_BUFFER_BOUNCE_OUT);
+
+    if ( (ret = xc_hypercall_bounce_pre(xch, meminfo)) )
+        goto out;
+    if ((ret = xc_hypercall_bounce_pre(xch, distance)) )
+        goto out;
+
+    sysctl.u.numainfo.num_nodes = *max_nodes;
+    set_xen_guest_handle(sysctl.u.numainfo.meminfo, meminfo);
+    set_xen_guest_handle(sysctl.u.numainfo.distance, distance);
 
     sysctl.cmd = XEN_SYSCTL_numainfo;
 
-    memcpy(&sysctl.u.numainfo, put_info, sizeof(*put_info));
+    if ( (ret = do_sysctl(xch, &sysctl)) != 0 )
+        goto out;
 
-    if ((ret = do_sysctl(xch, &sysctl)) != 0)
-        return ret;
+    *max_nodes = sysctl.u.numainfo.num_nodes;
 
-    memcpy(put_info, &sysctl.u.numainfo, sizeof(*put_info));
+out:
+    xc_hypercall_bounce_post(xch, meminfo);
+    xc_hypercall_bounce_post(xch, distance);
 
-    return 0;
+    return ret;
 }
 
 
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index f348afd..1abcbf1 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -5142,61 +5142,44 @@ libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nb_cpu_out)
 libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr)
 {
     GC_INIT(ctx);
-    xc_numainfo_t ninfo;
-    DECLARE_HYPERCALL_BUFFER(xen_sysctl_meminfo_t, meminfo);
-    DECLARE_HYPERCALL_BUFFER(uint32_t, distance);
+    xc_meminfo_t *meminfo;
+    uint32_t *distance;
     libxl_numainfo *ret = NULL;
     int i, j;
+    unsigned num_nodes;
 
-    set_xen_guest_handle(ninfo.meminfo, HYPERCALL_BUFFER_NULL);
-    set_xen_guest_handle(ninfo.distance, HYPERCALL_BUFFER_NULL);
-    if (xc_numainfo(ctx->xch, &ninfo) != 0) {
-        LIBXL__LOG(ctx, XTL_ERROR, "Unable to determine number of NODES");
-        ret = NULL;
+    if (xc_numainfo(ctx->xch, &num_nodes, NULL, NULL)) {
+        LOGEV(ERROR, errno, "Unable to determine number of nodes");
         goto out;
     }
 
-    meminfo = xc_hypercall_buffer_alloc(ctx->xch, meminfo,
-                                        sizeof(*meminfo) * ninfo.num_nodes);
-    distance = xc_hypercall_buffer_alloc(ctx->xch, distance,
-                                         sizeof(*distance) *
-                                         ninfo.num_nodes * ninfo.num_nodes);
-    if ((meminfo == NULL) || (distance == NULL)) {
-        LIBXL__LOG_ERRNOVAL(ctx, XTL_ERROR, ENOMEM,
-                            "Unable to allocate hypercall arguments");
-        goto fail;
-    }
+    meminfo = libxl__zalloc(gc, sizeof(*meminfo) * num_nodes);
+    distance = libxl__zalloc(gc, sizeof(*distance) * num_nodes * num_nodes);
 
-    set_xen_guest_handle(ninfo.meminfo, meminfo);
-    set_xen_guest_handle(ninfo.distance, distance);
-    if (xc_numainfo(ctx->xch, &ninfo) != 0) {
-        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting numainfo");
-        goto fail;
+    if (xc_numainfo(ctx->xch, &num_nodes, meminfo, distance)) {
+        LOGEV(ERROR, errno, "getting numainfo");
+        goto out;
     }
 
-    *nr = ninfo.num_nodes;
+    *nr = num_nodes;
 
-    ret = libxl__zalloc(NOGC, sizeof(libxl_numainfo) * ninfo.num_nodes);
-    for (i = 0; i < ninfo.num_nodes; i++)
-        ret[i].dists = libxl__calloc(NOGC, ninfo.num_nodes, sizeof(*distance));
+    ret = libxl__zalloc(NOGC, sizeof(libxl_numainfo) * num_nodes);
+    for (i = 0; i < num_nodes; i++)
+        ret[i].dists = libxl__calloc(NOGC, num_nodes, sizeof(*distance));
 
-    for (i = 0; i < ninfo.num_nodes; i++) {
+    for (i = 0; i < num_nodes; i++) {
 #define V(val, invalid) (val == invalid) ? \
        LIBXL_NUMAINFO_INVALID_ENTRY : val
         ret[i].size = V(meminfo[i].memsize, XEN_INVALID_MEM_SZ);
         ret[i].free = V(meminfo[i].memfree, XEN_INVALID_MEM_SZ);
-        ret[i].num_dists = ninfo.num_nodes;
+        ret[i].num_dists = num_nodes;
         for (j = 0; j < ret[i].num_dists; j++) {
-            unsigned idx = i * ninfo.num_nodes + j;
+            unsigned idx = i * num_nodes + j;
             ret[i].dists[j] = V(distance[idx], XEN_INVALID_NODE_DIST);
         }
 #undef V
     }
 
- fail:
-    xc_hypercall_buffer_free(ctx->xch, meminfo);
-    xc_hypercall_buffer_free(ctx->xch, distance);
-
  out:
     GC_FREE;
     return ret;
diff --git a/tools/python/xen/lowlevel/xc/xc.c b/tools/python/xen/lowlevel/xc/xc.c
index 21ba57d..fbd93db 100644
--- a/tools/python/xen/lowlevel/xc/xc.c
+++ b/tools/python/xen/lowlevel/xc/xc.c
@@ -1293,33 +1293,23 @@ out:
 
 static PyObject *pyxc_numainfo(XcObject *self)
 {
-    xc_numainfo_t ninfo = { 0 };
-    unsigned i, j;
+    unsigned i, j, num_nodes;
     uint64_t free_heap;
     PyObject *ret_obj = NULL, *node_to_node_dist_list_obj;
     PyObject *node_to_memsize_obj, *node_to_memfree_obj;
     PyObject *node_to_dma32_mem_obj, *node_to_node_dist_obj;
-    DECLARE_HYPERCALL_BUFFER(xen_sysctl_meminfo_t, meminfo);
-    DECLARE_HYPERCALL_BUFFER(uint32_t, distance);
+    xc_meminfo_t *meminfo = NULL;
+    uint32_t *distance = NULL;
 
-    set_xen_guest_handle(ninfo.meminfo, HYPERCALL_BUFFER_NULL);
-    set_xen_guest_handle(ninfo.distance, HYPERCALL_BUFFER_NULL);
-    if ( xc_numainfo(self->xc_handle, &ninfo) != 0 )
+    if ( xc_numainfo(self->xc_handle, &num_nodes, NULL, NULL) != 0 )
         goto out;
 
-    meminfo = xc_hypercall_buffer_alloc(self->xc_handle, meminfo,
-                                        sizeof(*meminfo) * ninfo.num_nodes);
-    if ( meminfo == NULL )
-        goto out;
-    distance = xc_hypercall_buffer_alloc(self->xc_handle, distance,
-                                         sizeof(*distance) *
-                                         ninfo.num_nodes * ninfo.num_nodes);
-    if ( distance == NULL )
+    meminfo = calloc(num_nodes, sizeof(*meminfo));
+    distance = calloc(num_nodes * num_nodes, sizeof(*distance));
+    if ( (meminfo == NULL) || (distance == NULL) )
         goto out;
 
-    set_xen_guest_handle(ninfo.meminfo, meminfo);
-    set_xen_guest_handle(ninfo.distance, distance);
-    if ( xc_numainfo(self->xc_handle, &ninfo) != 0 )
+    if ( xc_numainfo(self->xc_handle, &num_nodes, meminfo, distance) != 0 )
         goto out;
 
     /* Construct node-to-* lists. */
@@ -1327,7 +1317,7 @@ static PyObject *pyxc_numainfo(XcObject *self)
     node_to_memfree_obj = PyList_New(0);
     node_to_dma32_mem_obj = PyList_New(0);
     node_to_node_dist_list_obj = PyList_New(0);
-    for ( i = 0; i < ninfo.num_nodes; i++ )
+    for ( i = 0; i < num_nodes; i++ )
     {
         PyObject *pyint;
         unsigned invalid_node;
@@ -1351,9 +1341,9 @@ static PyObject *pyxc_numainfo(XcObject *self)
         /* Node to Node Distance */
         node_to_node_dist_obj = PyList_New(0);
         invalid_node = (meminfo[i].memsize == XEN_INVALID_MEM_SZ);
-        for ( j = 0; j < ninfo.num_nodes; j++ )
+        for ( j = 0; j < num_nodes; j++ )
         {
-            uint32_t dist = distance[i * ninfo.num_nodes + j];
+            uint32_t dist = distance[i * num_nodes + j];
             if ( invalid_node || (dist == XEN_INVALID_NODE_DIST) )
             {
                 PyList_Append(node_to_node_dist_obj, Py_None);
@@ -1369,7 +1359,7 @@ static PyObject *pyxc_numainfo(XcObject *self)
         Py_DECREF(node_to_node_dist_obj);
     }
 
-    ret_obj = Py_BuildValue("{s:i}", "max_node_index", ninfo.num_nodes + 1);
+    ret_obj = Py_BuildValue("{s:i}", "max_node_index", num_nodes + 1);
 
     PyDict_SetItemString(ret_obj, "node_memsize", node_to_memsize_obj);
     Py_DECREF(node_to_memsize_obj);
@@ -1385,8 +1375,8 @@ static PyObject *pyxc_numainfo(XcObject *self)
     Py_DECREF(node_to_node_dist_list_obj);
 
 out:
-    xc_hypercall_buffer_free(self->xc_handle, meminfo);
-    xc_hypercall_buffer_free(self->xc_handle, distance);
+    free(meminfo);
+    free(distance);
     return ret_obj ? ret_obj : pyxc_error_to_exception(self->xc_handle);
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 5/5] libxl: Add interface for querying hypervisor about PCI topology
  2015-04-17 16:59 [PATCH v7 0/5] Display IO topology when PXM data is available (plus some cleanup) Boris Ostrovsky
                   ` (3 preceding siblings ...)
  2015-04-17 16:59 ` [PATCH v7 4/5] libxl/libxc: Move libxl_get_numainfo()'s " Boris Ostrovsky
@ 2015-04-17 16:59 ` Boris Ostrovsky
  2015-04-21 11:27   ` Ian Campbell
  4 siblings, 1 reply; 22+ messages in thread
From: Boris Ostrovsky @ 2015-04-17 16:59 UTC (permalink / raw)
  To: ian.jackson, ian.campbell, wei.liu2, stefano.stabellini, dgdegra,
	jbeulich, andrew.cooper3, keir
  Cc: elena.ufimtseva, dario.faggioli, boris.ostrovsky, xen-devel

.. and use this new interface to display it along with CPU topology
and NUMA information when 'xl info -n' command is issued

The output will look like
...
cpu_topology           :
cpu:    core    socket     node
  0:       0        0        0
...
device topology        :
device           node
0000:00:00.0      0
0000:00:01.0      0
...

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---

Changes in v7:
* Replaced LOG with LOGE in libxl_get_pci_topology()
* Replaced LOGE with LOG and set errno in libxl__pci_topology_init()
* Test correct pointer for non-NULL in output_topologyinfo() (pciinfo, not
  cpuinfo)


 tools/libxc/include/xenctrl.h |    3 ++
 tools/libxc/xc_misc.c         |   44 +++++++++++++++++++++++++
 tools/libxl/libxl.c           |   42 ++++++++++++++++++++++++
 tools/libxl/libxl.h           |   12 +++++++
 tools/libxl/libxl_freebsd.c   |   12 +++++++
 tools/libxl/libxl_internal.h  |    5 +++
 tools/libxl/libxl_linux.c     |   70 +++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_netbsd.c    |   12 +++++++
 tools/libxl/libxl_types.idl   |    7 ++++
 tools/libxl/libxl_utils.c     |    8 +++++
 tools/libxl/xl_cmdimpl.c      |   40 +++++++++++++++++++----
 11 files changed, 248 insertions(+), 7 deletions(-)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 520a284..2e7787a 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1231,6 +1231,7 @@ typedef xen_sysctl_physinfo_t xc_physinfo_t;
 typedef xen_sysctl_cputopo_t xc_cputopo_t;
 typedef xen_sysctl_numainfo_t xc_numainfo_t;
 typedef xen_sysctl_meminfo_t xc_meminfo_t;
+typedef xen_sysctl_pcitopoinfo_t xc_pcitopoinfo_t;
 
 typedef uint32_t xc_cpu_to_node_t;
 typedef uint32_t xc_cpu_to_socket_t;
@@ -1244,6 +1245,8 @@ int xc_cputopoinfo(xc_interface *xch, unsigned *max_cpus,
                    xc_cputopo_t *cputopo);
 int xc_numainfo(xc_interface *xch, unsigned *max_nodes,
                 xc_meminfo_t *meminfo, uint32_t *distance);
+int xc_pcitopoinfo(xc_interface *xch, unsigned num_devs,
+                   physdev_pci_device_t *devs, uint32_t *nodes);
 
 int xc_sched_id(xc_interface *xch,
                 int *sched_id);
diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
index 94b70b4..3a3e366 100644
--- a/tools/libxc/xc_misc.c
+++ b/tools/libxc/xc_misc.c
@@ -238,6 +238,50 @@ out:
     return ret;
 }
 
+int xc_pcitopoinfo(xc_interface *xch, unsigned num_devs,
+                   physdev_pci_device_t *devs,
+                   uint32_t *nodes)
+{
+    int ret = 0;
+    DECLARE_SYSCTL;
+    DECLARE_HYPERCALL_BOUNCE(devs, num_devs * sizeof(*devs),
+                             XC_HYPERCALL_BUFFER_BOUNCE_IN);
+    DECLARE_HYPERCALL_BOUNCE(nodes, num_devs* sizeof(*nodes),
+                             XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+
+    if ( (ret = xc_hypercall_bounce_pre(xch, devs)) )
+        goto out;
+    if ( (ret = xc_hypercall_bounce_pre(xch, nodes)) )
+        goto out;
+
+    sysctl.u.pcitopoinfo.first_dev = 0;
+    sysctl.u.pcitopoinfo.num_devs = num_devs;
+    set_xen_guest_handle(sysctl.u.pcitopoinfo.devs, devs);
+    set_xen_guest_handle(sysctl.u.pcitopoinfo.nodes, nodes);
+
+    sysctl.cmd = XEN_SYSCTL_pcitopoinfo;
+
+    while ( sysctl.u.pcitopoinfo.first_dev < num_devs )
+    {
+        if ( (ret = do_sysctl(xch, &sysctl)) != 0 )
+        {
+            /*
+             * node[] is set to XEN_INVALID_NODE_ID for invalid devices,
+             * we can just skip those entries.
+             */
+            if ( errno == ENODEV )
+                errno = ret = 0;
+            else
+                break;
+        }
+    }
+
+ out:
+    xc_hypercall_bounce_post(xch, devs);
+    xc_hypercall_bounce_post(xch, nodes);
+
+    return ret;
+}
 
 int xc_sched_id(xc_interface *xch,
                 int *sched_id)
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 1abcbf1..86aff8e 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -5139,6 +5139,48 @@ libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nb_cpu_out)
     return ret;
 }
 
+libxl_pcitopology *libxl_get_pci_topology(libxl_ctx *ctx, int *num_devs)
+{
+    GC_INIT(ctx);
+    physdev_pci_device_t *devs;
+    uint32_t *nodes;
+    libxl_pcitopology *ret = NULL;
+    int i;
+
+    *num_devs = libxl__pci_numdevs(gc);
+    if (*num_devs < 0) {
+        LOGE(ERROR, "Unable to determine number of PCI devices");
+        goto out;
+    }
+
+    devs = libxl__zalloc(gc, sizeof(*devs) * *num_devs);
+    nodes = libxl__zalloc(gc, sizeof(*nodes) * *num_devs);
+
+    if (libxl__pci_topology_init(gc, devs, *num_devs)) {
+        LOGE(ERROR, "Cannot initialize PCI hypercall structure");
+        goto out;
+    }
+
+    if (xc_pcitopoinfo(ctx->xch, *num_devs, devs, nodes) != 0) {
+        LOGE(ERROR, "PCI topology info hypercall failed");
+        goto out;
+    }
+
+    ret = libxl__zalloc(NOGC, sizeof(libxl_pcitopology) * *num_devs);
+
+    for (i = 0; i < *num_devs; i++) {
+        ret[i].seg = devs[i].seg;
+        ret[i].bus = devs[i].bus;
+        ret[i].devfn = devs[i].devfn;
+        ret[i].node = (nodes[i] == XEN_INVALID_NODE_ID) ?
+            LIBXL_PCITOPOLOGY_INVALID_ENTRY : nodes[i];
+    }
+
+ out:
+    GC_FREE;
+    return ret;
+}
+
 libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr)
 {
     GC_INIT(ctx);
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 6bc75c5..f0db4d5 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -736,6 +736,14 @@ void libxl_mac_copy(libxl_ctx *ctx, libxl_mac *dst, libxl_mac *src);
 #define LIBXL_HAVE_PSR_MBM 1
 #endif
 
+/*
+ * LIBXL_HAVE_PCITOPOLOGY
+ *
+ * If this is defined, then interface to query hypervisor about PCI device
+ * topology is available.
+ */
+#define LIBXL_HAVE_PCITOPOLOGY 1
+
 typedef char **libxl_string_list;
 void libxl_string_list_dispose(libxl_string_list *sl);
 int libxl_string_list_length(const libxl_string_list *sl);
@@ -1126,6 +1134,10 @@ void libxl_vminfo_list_free(libxl_vminfo *list, int nb_vm);
 libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nb_cpu_out);
 void libxl_cputopology_list_free(libxl_cputopology *, int nb_cpu);
 
+#define LIBXL_PCITOPOLOGY_INVALID_ENTRY (~(uint32_t)0)
+libxl_pcitopology *libxl_get_pci_topology(libxl_ctx *ctx, int *num_devs);
+void libxl_pcitopology_list_free(libxl_pcitopology *, int num_devs);
+
 #define LIBXL_NUMAINFO_INVALID_ENTRY (~(uint32_t)0)
 libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr);
 void libxl_numainfo_list_free(libxl_numainfo *, int nr);
diff --git a/tools/libxl/libxl_freebsd.c b/tools/libxl/libxl_freebsd.c
index e8b88b3..47c3391 100644
--- a/tools/libxl/libxl_freebsd.c
+++ b/tools/libxl/libxl_freebsd.c
@@ -131,3 +131,15 @@ libxl_device_model_version libxl__default_device_model(libxl__gc *gc)
 {
     return LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN;
 }
+
+int libxl__pci_numdevs(libxl__gc *gc)
+{
+    return ERROR_NI;
+}
+
+int libxl__pci_topology_init(libxl__gc *gc,
+                             physdev_pci_device_t *devs,
+                             int num_devs)
+{
+    return ERROR_NI;
+}
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 9c22309..bd7b508 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -1193,6 +1193,11 @@ _hidden int libxl__try_phy_backend(mode_t st_mode);
 
 _hidden char *libxl__devid_to_localdev(libxl__gc *gc, int devid);
 
+_hidden int libxl__pci_numdevs(libxl__gc *gc);
+_hidden int libxl__pci_topology_init(libxl__gc *gc,
+                                     physdev_pci_device_t *devs,
+                                     int num_devs);
+
 /* from libxl_pci */
 
 _hidden int libxl__device_pci_add(libxl__gc *gc, uint32_t domid, libxl_device_pci *pcidev, int starting);
diff --git a/tools/libxl/libxl_linux.c b/tools/libxl/libxl_linux.c
index b51930c..6882d3a 100644
--- a/tools/libxl/libxl_linux.c
+++ b/tools/libxl/libxl_linux.c
@@ -279,3 +279,73 @@ libxl_device_model_version libxl__default_device_model(libxl__gc *gc)
 {
     return LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN;
 }
+
+int libxl__pci_numdevs(libxl__gc *gc)
+{
+    DIR *dir;
+    struct dirent *entry;
+    int num_devs = 0;
+
+    dir = opendir("/sys/bus/pci/devices");
+    if (!dir) {
+        LOGEV(ERROR, errno, "Cannot open /sys/bus/pci/devices");
+        return ERROR_FAIL;
+    }
+
+    while ((entry = readdir(dir))) {
+        if (entry->d_name[0] == '.')
+            continue;
+        num_devs++;
+    }
+    closedir(dir);
+
+    return num_devs;
+}
+
+int libxl__pci_topology_init(libxl__gc *gc,
+                             physdev_pci_device_t *devs,
+                             int num_devs)
+{
+
+    DIR *dir;
+    struct dirent *entry;
+    int i, err = 0;
+
+    dir = opendir("/sys/bus/pci/devices");
+    if (!dir) {
+        LOGEV(ERROR, errno, "Cannot open /sys/bus/pci/devices");
+        return ERROR_FAIL;
+    }
+
+    i = 0;
+    while ((entry = readdir(dir))) {
+        unsigned int dom, bus, dev, func;
+
+        if (entry->d_name[0] == '.')
+            continue;
+
+        if (i == num_devs) {
+            LOG(ERROR, "Too many devices");
+            err = ERROR_FAIL;
+            errno = -ENOSPC;
+            goto out;
+        }
+
+        if (sscanf(entry->d_name, "%x:%x:%x.%d", &dom, &bus, &dev, &func) < 4) {
+            LOGEV(ERROR, errno, "Error processing /sys/bus/pci/devices");
+            err = ERROR_FAIL;
+            goto out;
+        }
+
+        devs[i].seg = dom;
+        devs[i].bus = bus;
+        devs[i].devfn = ((dev & 0x1f) << 3) | (func & 7);
+
+        i++;
+    }
+
+ out:
+    closedir(dir);
+
+    return err;
+}
diff --git a/tools/libxl/libxl_netbsd.c b/tools/libxl/libxl_netbsd.c
index 898e160..a2a962e 100644
--- a/tools/libxl/libxl_netbsd.c
+++ b/tools/libxl/libxl_netbsd.c
@@ -95,3 +95,15 @@ libxl_device_model_version libxl__default_device_model(libxl__gc *gc)
 {
     return LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL;
 }
+
+int libxl__pci_numdevs(libxl__gc *gc)
+{
+    return ERROR_NI;
+}
+
+int libxl__pci_topology_init(libxl__gc *gc,
+                             physdev_pci_device_t *devs,
+                             int num_devs)
+{
+    return ERROR_NI;
+}
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 0866433..cd6f230 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -655,6 +655,13 @@ libxl_cputopology = Struct("cputopology", [
     ("node", uint32),
     ], dir=DIR_OUT)
 
+libxl_pcitopology = Struct("pcitopology", [
+    ("seg", uint16),
+    ("bus", uint8),
+    ("devfn", uint8),
+    ("node", uint32),
+    ], dir=DIR_OUT)
+
 libxl_sched_credit_params = Struct("sched_credit_params", [
     ("tslice_ms", integer),
     ("ratelimit_us", integer),
diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
index 9053b27..9108b56 100644
--- a/tools/libxl/libxl_utils.c
+++ b/tools/libxl/libxl_utils.c
@@ -882,6 +882,14 @@ void libxl_cputopology_list_free(libxl_cputopology *list, int nr)
     free(list);
 }
 
+void libxl_pcitopology_list_free(libxl_pcitopology *list, int nr)
+{
+    int i;
+    for (i = 0; i < nr; i++)
+        libxl_pcitopology_dispose(&list[i]);
+    free(list);
+}
+
 void libxl_numainfo_list_free(libxl_numainfo *list, int nr)
 {
     int i;
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 394b55d..5eaa083 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -5405,12 +5405,15 @@ static void output_numainfo(void)
 
 static void output_topologyinfo(void)
 {
-    libxl_cputopology *info;
+    libxl_cputopology *cpuinfo;
     int i, nr;
+    libxl_pcitopology *pciinfo;
+    int valid_devs;
 
-    info = libxl_get_cpu_topology(ctx, &nr);
-    if (info == NULL) {
-        fprintf(stderr, "libxl_get_topologyinfo failed.\n");
+
+    cpuinfo = libxl_get_cpu_topology(ctx, &nr);
+    if (cpuinfo == NULL) {
+        fprintf(stderr, "libxl_get_cpu_topology failed.\n");
         return;
     }
 
@@ -5418,12 +5421,35 @@ static void output_topologyinfo(void)
     printf("cpu:    core    socket     node\n");
 
     for (i = 0; i < nr; i++) {
-        if (info[i].core != LIBXL_CPUTOPOLOGY_INVALID_ENTRY)
+        if (cpuinfo[i].core != LIBXL_CPUTOPOLOGY_INVALID_ENTRY)
             printf("%3d:    %4d     %4d     %4d\n", i,
-                   info[i].core, info[i].socket, info[i].node);
+                   cpuinfo[i].core, cpuinfo[i].socket, cpuinfo[i].node);
+    }
+
+    libxl_cputopology_list_free(cpuinfo, nr);
+
+    pciinfo = libxl_get_pci_topology(ctx, &nr);
+    if (pciinfo == NULL) {
+        fprintf(stderr, "libxl_get_pci_topology failed.\n");
+        return;
     }
 
-    libxl_cputopology_list_free(info, nr);
+    printf("device topology        :\n");
+    printf("device           node\n");
+    for (i = 0; i < nr; i++) {
+        if (pciinfo[i].node != LIBXL_PCITOPOLOGY_INVALID_ENTRY) {
+            printf("%04x:%02x:%02x.%01x      %d\n", pciinfo[i].seg,
+                   pciinfo[i].bus,
+                   ((pciinfo[i].devfn >> 3) & 0x1f), (pciinfo[i].devfn & 7),
+                   pciinfo[i].node);
+            valid_devs++;
+        }
+    }
+
+    if (valid_devs == 0)
+        printf("No device topology data available\n");
+
+    libxl_pcitopology_list_free(pciinfo, nr);
 
     return;
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 1/5] sysctl: Make XEN_SYSCTL_numainfo a little more efficient
  2015-04-17 16:59 ` [PATCH v7 1/5] sysctl: Make XEN_SYSCTL_numainfo a little more efficient Boris Ostrovsky
@ 2015-04-17 17:14   ` Andrew Cooper
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-17 17:14 UTC (permalink / raw)
  To: Boris Ostrovsky, ian.jackson, ian.campbell, wei.liu2,
	stefano.stabellini, dgdegra, jbeulich, keir
  Cc: elena.ufimtseva, dario.faggioli, xen-devel

On 17/04/15 17:59, Boris Ostrovsky wrote:
> A number of changes to XEN_SYSCTL_numainfo interface:
>
> * Make sysctl NUMA topology query use fewer copies by combining some
>   fields into a single structure and copying distances for each node
>   in a single copy.
> * NULL meminfo and distance handles are a request for maximum number
>   of nodes (num_nodes). If those handles are valid and num_nodes is
>   is smaller than the number of nodes in the system then -ENOBUFS is
>   returned (and correct num_nodes is provided)
> * Instead of using max_node_index for passing number of nodes keep this
>   value in num_nodes: almost all uses of max_node_index required adding
>   or subtracting one to eventually get to number of nodes anyway.
> * Replace INVALID_NUMAINFO_ID with XEN_INVALID_MEM_SZ and add
>   XEN_INVALID_NODE_DIST.
>
> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Acked-by: Ian Campbell <ian.campbell@citrix.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-17 16:59 ` [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology Boris Ostrovsky
@ 2015-04-17 17:16   ` Andrew Cooper
  2015-04-21  7:01   ` Jan Beulich
  1 sibling, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-17 17:16 UTC (permalink / raw)
  To: Boris Ostrovsky, ian.jackson, ian.campbell, wei.liu2,
	stefano.stabellini, dgdegra, jbeulich, keir
  Cc: elena.ufimtseva, dario.faggioli, xen-devel

On 17/04/15 17:59, Boris Ostrovsky wrote:
> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-17 16:59 ` [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology Boris Ostrovsky
  2015-04-17 17:16   ` Andrew Cooper
@ 2015-04-21  7:01   ` Jan Beulich
  2015-04-21 12:56     ` Boris Ostrovsky
  2015-04-23 22:20     ` Boris Ostrovsky
  1 sibling, 2 replies; 22+ messages in thread
From: Jan Beulich @ 2015-04-21  7:01 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: elena.ufimtseva, wei.liu2, ian.campbell, stefano.stabellini,
	andrew.cooper3, dario.faggioli, ian.jackson, xen-devel, keir,
	dgdegra

>>> On 17.04.15 at 18:59, <boris.ostrovsky@oracle.com> wrote:
> Changes in v7:
> * Break from the loop when -ENODEV is encountered

This seems pretty inefficient for the caller. Returning a "bad"
identifier other than XEN_INVALID_NODE_ID would seem better
to me.

> --- a/docs/misc/xsm-flask.txt
> +++ b/docs/misc/xsm-flask.txt
> @@ -121,6 +121,7 @@ __HYPERVISOR_sysctl (xen/include/public/sysctl.h)
>   * XEN_SYSCTL_cpupool_op
>   * XEN_SYSCTL_scheduler_op
>   * XEN_SYSCTL_coverage_op
> + * XEN_SYSCTL_pcitopoinfo

No additions to this list are permitted. Either the new sub-op is
disaggregation safe (which it looks to be), or it can't be accepted.

> --- a/xen/common/sysctl.c
> +++ b/xen/common/sysctl.c
> @@ -411,6 +411,65 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
>          break;
>  #endif
>  
> +#ifdef HAS_PCI
> +    case XEN_SYSCTL_pcitopoinfo:
> +    {
> +        xen_sysctl_pcitopoinfo_t *ti = &op->u.pcitopoinfo;
> +        unsigned dev_cnt = 0;
> +
> +        if ( guest_handle_is_null(ti->devs) ||
> +             guest_handle_is_null(ti->nodes) ||
> +             (ti->first_dev > ti->num_devs) )
> +        {
> +            ret = -EINVAL;
> +            break;
> +        }
> +
> +        while ( ti->first_dev < ti->num_devs )
> +        {
> +            physdev_pci_device_t dev;
> +            uint32_t node;
> +            struct pci_dev *pdev;
> +
> +            if ( copy_from_guest_offset(&dev, ti->devs, ti->first_dev, 1) )
> +            {
> +                ret = -EFAULT;
> +                break;
> +            }
> +
> +            spin_lock(&pcidevs_lock);
> +            pdev = pci_get_pdev(dev.seg, dev.bus, dev.devfn);
> +            if ( !pdev )
> +            {
> +                ret = -ENODEV;
> +                node = XEN_INVALID_NODE_ID;
> +            }
> +            else if ( pdev->node == NUMA_NO_NODE )
> +                node = XEN_INVALID_NODE_ID;
> +            else
> +                node = pdev->node;
> +            spin_unlock(&pcidevs_lock);
> +
> +            if ( copy_to_guest_offset(ti->nodes, ti->first_dev, &node, 1) )
> +            {
> +                ret = -EFAULT;
> +                break;
> +            }
> +
> +            ti->first_dev++;
> +
> +            if ( (ret == -ENODEV) ||

If despite the earlier comment you want to stay with this model, I think
you should keep first_dev pointing at the bad slot (but also see below),
leaving it to the caller to increment past it (consider this being function
0 and the next slots being the other functions - in that case it's clear
that the caller would want to skip more than just this one slot).

> +                 ((++dev_cnt > 0x3f) && hypercall_preempt_check()) )
> +                break;
> +        }
> +
> +        if ( (!ret || (ret == -ENODEV)) &&
> +             __copy_field_to_guest(u_sysctl, op, u.pcitopoinfo.first_dev) )
> +            ret = -EFAULT;
> +    }
> +    break;
> +#endif

With the continuation-less model now used I don't think it makes
sense to have first_dev and num_devs - for re-invocation all the
caller needs to do is increment the buffer pointer suitably. I.e.
you can get away with just a count of devices afaict.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 5/5] libxl: Add interface for querying hypervisor about PCI topology
  2015-04-17 16:59 ` [PATCH v7 5/5] libxl: Add interface for querying hypervisor about PCI topology Boris Ostrovsky
@ 2015-04-21 11:27   ` Ian Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ian Campbell @ 2015-04-21 11:27 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: elena.ufimtseva, wei.liu2, stefano.stabellini, andrew.cooper3,
	dario.faggioli, ian.jackson, xen-devel, jbeulich, keir, dgdegra

On Fri, 2015-04-17 at 12:59 -0400, Boris Ostrovsky wrote:
> +int libxl__pci_numdevs(libxl__gc *gc)
> +{
> +    DIR *dir;
> +    struct dirent *entry;
> +    int num_devs = 0;
> +
> +    dir = opendir("/sys/bus/pci/devices");
> +    if (!dir) {
> +        LOGEV(ERROR, errno, "Cannot open /sys/bus/pci/devices");

LOGEV is only useful if you want to log an errnoval which isn't actually
in errno, so here you could just use LOGE.

> +    dir = opendir("/sys/bus/pci/devices");
> +    if (!dir) {
> +        LOGEV(ERROR, errno, "Cannot open /sys/bus/pci/devices");

And again.
[...]
> +        if (sscanf(entry->d_name, "%x:%x:%x.%d", &dom, &bus, &dev, &func) < 4) {
> +            LOGEV(ERROR, errno, "Error processing /sys/bus/pci/devices");

Again.

With all those fixed: Acked-by: Ian Campbell <ian.campbell@citrix.com>

Ian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-21  7:01   ` Jan Beulich
@ 2015-04-21 12:56     ` Boris Ostrovsky
  2015-04-21 13:14       ` Andrew Cooper
  2015-04-21 13:18       ` Jan Beulich
  2015-04-23 22:20     ` Boris Ostrovsky
  1 sibling, 2 replies; 22+ messages in thread
From: Boris Ostrovsky @ 2015-04-21 12:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: elena.ufimtseva, wei.liu2, ian.campbell, stefano.stabellini,
	andrew.cooper3, dario.faggioli, ian.jackson, xen-devel, keir,
	dgdegra


On 04/21/2015 03:01 AM, Jan Beulich wrote:
>>>> On 17.04.15 at 18:59, <boris.ostrovsky@oracle.com> wrote:
>> Changes in v7:
>> * Break from the loop when -ENODEV is encountered
> This seems pretty inefficient for the caller. Returning a "bad"
> identifier other than XEN_INVALID_NODE_ID would seem better
> to me.

That would mean that there is really no reason to return -ENODEV, which 
I thought you wanted to see. In that case (i.e. if no error needs to be 
returned) yes, a new token would make things simpler.

>
>> --- a/docs/misc/xsm-flask.txt
>> +++ b/docs/misc/xsm-flask.txt
>> @@ -121,6 +121,7 @@ __HYPERVISOR_sysctl (xen/include/public/sysctl.h)
>>    * XEN_SYSCTL_cpupool_op
>>    * XEN_SYSCTL_scheduler_op
>>    * XEN_SYSCTL_coverage_op
>> + * XEN_SYSCTL_pcitopoinfo
> No additions to this list are permitted. Either the new sub-op is
> disaggregation safe (which it looks to be), or it can't be accepted.

True, it *is* safe, but why then cputopoinfo and numainfo are in this 
list? They look to be safe as well.


-boris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-21 13:14       ` Andrew Cooper
@ 2015-04-21 13:13         ` Boris Ostrovsky
  2015-04-21 13:30           ` Andrew Cooper
  0 siblings, 1 reply; 22+ messages in thread
From: Boris Ostrovsky @ 2015-04-21 13:13 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: elena.ufimtseva, wei.liu2, ian.campbell, stefano.stabellini,
	dario.faggioli, ian.jackson, xen-devel, keir, dgdegra


On 04/21/2015 09:14 AM, Andrew Cooper wrote:
> On 21/04/15 13:56, Boris Ostrovsky wrote:
>> On 04/21/2015 03:01 AM, Jan Beulich wrote:
>>>>>> --- a/docs/misc/xsm-flask.txt
>>>>>> +++ b/docs/misc/xsm-flask.txt
>>>>>> @@ -121,6 +121,7 @@ __HYPERVISOR_sysctl (xen/include/public/sysctl.h)
>>>>>>     * XEN_SYSCTL_cpupool_op
>>>>>>     * XEN_SYSCTL_scheduler_op
>>>>>>     * XEN_SYSCTL_coverage_op
>>>>>> + * XEN_SYSCTL_pcitopoinfo
>>> No additions to this list are permitted. Either the new sub-op is
>>> disaggregation safe (which it looks to be), or it can't be accepted.
>> True, it *is* safe, but why then cputopoinfo and numainfo are in this
>> list? They look to be safe as well.
> This list includes "not yet audited".  It is quite likely that some
> entries in the list are safe.
>
> fwiw, neither cputopo nor numainfo are currently safe for very large
> systems.

Why would safety of an operation depend on system size? And how are 
these two unsafe? (Because maybe then PCI topology query is unsafe in 
the same manner)

-boris


-boris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-21 12:56     ` Boris Ostrovsky
@ 2015-04-21 13:14       ` Andrew Cooper
  2015-04-21 13:13         ` Boris Ostrovsky
  2015-04-21 13:18       ` Jan Beulich
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2015-04-21 13:14 UTC (permalink / raw)
  To: Boris Ostrovsky, Jan Beulich
  Cc: elena.ufimtseva, wei.liu2, ian.campbell, stefano.stabellini,
	dario.faggioli, ian.jackson, xen-devel, keir, dgdegra

On 21/04/15 13:56, Boris Ostrovsky wrote:
>
> On 04/21/2015 03:01 AM, Jan Beulich wrote:
>>>>> On 17.04.15 at 18:59, <boris.ostrovsky@oracle.com> wrote:
>>> Changes in v7:
>>> * Break from the loop when -ENODEV is encountered
>> This seems pretty inefficient for the caller. Returning a "bad"
>> identifier other than XEN_INVALID_NODE_ID would seem better
>> to me.
>
> That would mean that there is really no reason to return -ENODEV,
> which I thought you wanted to see. In that case (i.e. if no error
> needs to be returned) yes, a new token would make things simpler.
>
>>
>>> --- a/docs/misc/xsm-flask.txt
>>> +++ b/docs/misc/xsm-flask.txt
>>> @@ -121,6 +121,7 @@ __HYPERVISOR_sysctl (xen/include/public/sysctl.h)
>>>    * XEN_SYSCTL_cpupool_op
>>>    * XEN_SYSCTL_scheduler_op
>>>    * XEN_SYSCTL_coverage_op
>>> + * XEN_SYSCTL_pcitopoinfo
>> No additions to this list are permitted. Either the new sub-op is
>> disaggregation safe (which it looks to be), or it can't be accepted.
>
> True, it *is* safe, but why then cputopoinfo and numainfo are in this
> list? They look to be safe as well.

This list includes "not yet audited".  It is quite likely that some
entries in the list are safe.

fwiw, neither cputopo nor numainfo are currently safe for very large
systems.

~Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-21 12:56     ` Boris Ostrovsky
  2015-04-21 13:14       ` Andrew Cooper
@ 2015-04-21 13:18       ` Jan Beulich
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2015-04-21 13:18 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: elena.ufimtseva, wei.liu2, ian.campbell, stefano.stabellini,
	andrew.cooper3, dario.faggioli, ian.jackson, xen-devel, keir,
	dgdegra

>>> On 21.04.15 at 14:56, <boris.ostrovsky@oracle.com> wrote:

> On 04/21/2015 03:01 AM, Jan Beulich wrote:
>>>>> On 17.04.15 at 18:59, <boris.ostrovsky@oracle.com> wrote:
>>> Changes in v7:
>>> * Break from the loop when -ENODEV is encountered
>> This seems pretty inefficient for the caller. Returning a "bad"
>> identifier other than XEN_INVALID_NODE_ID would seem better
>> to me.
> 
> That would mean that there is really no reason to return -ENODEV, which 
> I thought you wanted to see. In that case (i.e. if no error needs to be 
> returned) yes, a new token would make things simpler.

All I asked for was to make the two cases (no device and device
with no known node) distinguishable.

>>> --- a/docs/misc/xsm-flask.txt
>>> +++ b/docs/misc/xsm-flask.txt
>>> @@ -121,6 +121,7 @@ __HYPERVISOR_sysctl (xen/include/public/sysctl.h)
>>>    * XEN_SYSCTL_cpupool_op
>>>    * XEN_SYSCTL_scheduler_op
>>>    * XEN_SYSCTL_coverage_op
>>> + * XEN_SYSCTL_pcitopoinfo
>> No additions to this list are permitted. Either the new sub-op is
>> disaggregation safe (which it looks to be), or it can't be accepted.
> 
> True, it *is* safe, but why then cputopoinfo and numainfo are in this 
> list? They look to be safe as well.

Because at the time the list got composed we weren't able to
spend time looking at _all_ of the operations, and hence we
simply added them all (with a few exceptions on the domctl side
iirc). Quite likely the two you mention could be removed from
the list, but such needs to happen with proper reasoning in the
patch description. Feel free to submit a patch.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-21 13:13         ` Boris Ostrovsky
@ 2015-04-21 13:30           ` Andrew Cooper
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-21 13:30 UTC (permalink / raw)
  To: Boris Ostrovsky, Jan Beulich
  Cc: elena.ufimtseva, wei.liu2, ian.campbell, stefano.stabellini,
	dario.faggioli, ian.jackson, xen-devel, keir, dgdegra

On 21/04/15 14:13, Boris Ostrovsky wrote:
>
> On 04/21/2015 09:14 AM, Andrew Cooper wrote:
>> On 21/04/15 13:56, Boris Ostrovsky wrote:
>>> On 04/21/2015 03:01 AM, Jan Beulich wrote:
>>>>>>> --- a/docs/misc/xsm-flask.txt
>>>>>>> +++ b/docs/misc/xsm-flask.txt
>>>>>>> @@ -121,6 +121,7 @@ __HYPERVISOR_sysctl
>>>>>>> (xen/include/public/sysctl.h)
>>>>>>>     * XEN_SYSCTL_cpupool_op
>>>>>>>     * XEN_SYSCTL_scheduler_op
>>>>>>>     * XEN_SYSCTL_coverage_op
>>>>>>> + * XEN_SYSCTL_pcitopoinfo
>>>> No additions to this list are permitted. Either the new sub-op is
>>>> disaggregation safe (which it looks to be), or it can't be accepted.
>>> True, it *is* safe, but why then cputopoinfo and numainfo are in this
>>> list? They look to be safe as well.
>> This list includes "not yet audited".  It is quite likely that some
>> entries in the list are safe.
>>
>> fwiw, neither cputopo nor numainfo are currently safe for very large
>> systems.
>
> Why would safety of an operation depend on system size? And how are
> these two unsafe? (Because maybe then PCI topology query is unsafe in
> the same manner)

PCI topology has continuations in it.

cpu/numa info is bounded by host values, but does attempt to cover all
cpus/nodes in a single pass, which will get progressively longer as the
server gets larger.

~Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-21  7:01   ` Jan Beulich
  2015-04-21 12:56     ` Boris Ostrovsky
@ 2015-04-23 22:20     ` Boris Ostrovsky
  2015-04-24  7:19       ` Jan Beulich
  1 sibling, 1 reply; 22+ messages in thread
From: Boris Ostrovsky @ 2015-04-23 22:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: elena.ufimtseva, wei.liu2, ian.campbell, stefano.stabellini,
	andrew.cooper3, dario.faggioli, ian.jackson, xen-devel, keir,
	dgdegra

On 04/21/2015 03:01 AM, Jan Beulich wrote:
>
>> +                 ((++dev_cnt > 0x3f) && hypercall_preempt_check()) )
>> +                break;
>> +        }
>> +
>> +        if ( (!ret || (ret == -ENODEV)) &&
>> +             __copy_field_to_guest(u_sysctl, op, u.pcitopoinfo.first_dev) )
>> +            ret = -EFAULT;
>> +    }
>> +    break;
>> +#endif
> With the continuation-less model now used I don't think it makes
> sense to have first_dev and num_devs - for re-invocation all the
> caller needs to do is increment the buffer pointer suitably. I.e.
> you can get away with just a count of devices afaict.
>

This would require walking xc_hypercall_buffer_t->hbuf. Would something like

     set_xen_guest_handle_raw(sysctl..., (void 
*)HYPERCALL_BUFFER_AS_ARG(foo) + offset)

be acceptable? I don't think I see anything better.

I thought of adding set_xen_guest_handle_offset() that would look 
similar to set_xen_guest_handle() but then I felt that having this in 
API may not be a good idea since xc_hypercall_buffer_t->hbuf would end 
up pointing to memory that is not allocated for full 
xc_hypercall_buffer_t->sz.


-boris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-23 22:20     ` Boris Ostrovsky
@ 2015-04-24  7:19       ` Jan Beulich
  2015-04-24 14:09         ` Boris Ostrovsky
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2015-04-24  7:19 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: elena.ufimtseva, wei.liu2, ian.campbell, stefano.stabellini,
	andrew.cooper3, dario.faggioli, ian.jackson, xen-devel, keir,
	dgdegra

>>> On 24.04.15 at 00:20, <boris.ostrovsky@oracle.com> wrote:
> On 04/21/2015 03:01 AM, Jan Beulich wrote:
>>
>>> +                 ((++dev_cnt > 0x3f) && hypercall_preempt_check()) )
>>> +                break;
>>> +        }
>>> +
>>> +        if ( (!ret || (ret == -ENODEV)) &&
>>> +             __copy_field_to_guest(u_sysctl, op, u.pcitopoinfo.first_dev) )
>>> +            ret = -EFAULT;
>>> +    }
>>> +    break;
>>> +#endif
>> With the continuation-less model now used I don't think it makes
>> sense to have first_dev and num_devs - for re-invocation all the
>> caller needs to do is increment the buffer pointer suitably. I.e.
>> you can get away with just a count of devices afaict.
>>
> 
> This would require walking xc_hypercall_buffer_t->hbuf. Would something like
> 
>      set_xen_guest_handle_raw(sysctl..., (void *)HYPERCALL_BUFFER_AS_ARG(foo) + offset)
> 
> be acceptable? I don't think I see anything better.
> 
> I thought of adding set_xen_guest_handle_offset() that would look 
> similar to set_xen_guest_handle() but then I felt that having this in 
> API may not be a good idea since xc_hypercall_buffer_t->hbuf would end 
> up pointing to memory that is not allocated for full 
> xc_hypercall_buffer_t->sz.

There ought to be a way to create a guest handle from other than
the start of an allocated hypercall buffer, but that's really more a
question for the tool stack folks.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-24  7:19       ` Jan Beulich
@ 2015-04-24 14:09         ` Boris Ostrovsky
  2015-04-24 17:42           ` Boris Ostrovsky
  0 siblings, 1 reply; 22+ messages in thread
From: Boris Ostrovsky @ 2015-04-24 14:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: elena.ufimtseva, wei.liu2, ian.campbell, stefano.stabellini,
	andrew.cooper3, dario.faggioli, ian.jackson, xen-devel, keir,
	dgdegra


On 04/24/2015 03:19 AM, Jan Beulich wrote:
>>>> On 24.04.15 at 00:20, <boris.ostrovsky@oracle.com> wrote:
>> On 04/21/2015 03:01 AM, Jan Beulich wrote:
>>>> +                 ((++dev_cnt > 0x3f) && hypercall_preempt_check()) )
>>>> +                break;
>>>> +        }
>>>> +
>>>> +        if ( (!ret || (ret == -ENODEV)) &&
>>>> +             __copy_field_to_guest(u_sysctl, op, u.pcitopoinfo.first_dev) )
>>>> +            ret = -EFAULT;
>>>> +    }
>>>> +    break;
>>>> +#endif
>>> With the continuation-less model now used I don't think it makes
>>> sense to have first_dev and num_devs - for re-invocation all the
>>> caller needs to do is increment the buffer pointer suitably. I.e.
>>> you can get away with just a count of devices afaict.
>>>
>> This would require walking xc_hypercall_buffer_t->hbuf. Would something like
>>
>>       set_xen_guest_handle_raw(sysctl..., (void *)HYPERCALL_BUFFER_AS_ARG(foo) + offset)
>>
>> be acceptable? I don't think I see anything better.
>>
>> I thought of adding set_xen_guest_handle_offset() that would look
>> similar to set_xen_guest_handle() but then I felt that having this in
>> API may not be a good idea since xc_hypercall_buffer_t->hbuf would end
>> up pointing to memory that is not allocated for full
>> xc_hypercall_buffer_t->sz.
> There ought to be a way to create a guest handle from other than
> the start of an allocated hypercall buffer, but that's really more a
> question for the tool stack folks.

Yes, this was question for toolstack people.

(And my second paragraph was not stated correctly, now that I re-read 
it. I meant to say that my understanding is that API is expected to make 
all safety checks on buffers and with set_xen_guest_handle_offset() that 
I was picturing in my head we could pass in pretty much any pointer. I 
suppose we could check 'hbuf+offset < sz')


-boris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-24 14:09         ` Boris Ostrovsky
@ 2015-04-24 17:42           ` Boris Ostrovsky
  2015-05-01 13:56             ` Ian Campbell
  0 siblings, 1 reply; 22+ messages in thread
From: Boris Ostrovsky @ 2015-04-24 17:42 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, ian.jackson, stefano.stabellini
  Cc: elena.ufimtseva, keir, andrew.cooper3, dario.faggioli, xen-devel,
	Jan Beulich, dgdegra


On 04/24/2015 10:09 AM, Boris Ostrovsky wrote:
>
> On 04/24/2015 03:19 AM, Jan Beulich wrote:
>>>>> On 24.04.15 at 00:20, <boris.ostrovsky@oracle.com> wrote:
>>> On 04/21/2015 03:01 AM, Jan Beulich wrote:
>>>>> +                 ((++dev_cnt > 0x3f) && hypercall_preempt_check()) )
>>>>> +                break;
>>>>> +        }
>>>>> +
>>>>> +        if ( (!ret || (ret == -ENODEV)) &&
>>>>> +             __copy_field_to_guest(u_sysctl, op, 
>>>>> u.pcitopoinfo.first_dev) )
>>>>> +            ret = -EFAULT;
>>>>> +    }
>>>>> +    break;
>>>>> +#endif
>>>> With the continuation-less model now used I don't think it makes
>>>> sense to have first_dev and num_devs - for re-invocation all the
>>>> caller needs to do is increment the buffer pointer suitably. I.e.
>>>> you can get away with just a count of devices afaict.
>>>>
>>> This would require walking xc_hypercall_buffer_t->hbuf. Would 
>>> something like
>>>
>>>       set_xen_guest_handle_raw(sysctl..., (void 
>>> *)HYPERCALL_BUFFER_AS_ARG(foo) + offset)
>>>
>>> be acceptable? I don't think I see anything better.
>>>
>>> I thought of adding set_xen_guest_handle_offset() that would look
>>> similar to set_xen_guest_handle() but then I felt that having this in
>>> API may not be a good idea since xc_hypercall_buffer_t->hbuf would end
>>> up pointing to memory that is not allocated for full
>>> xc_hypercall_buffer_t->sz.
>> There ought to be a way to create a guest handle from other than
>> the start of an allocated hypercall buffer, but that's really more a
>> question for the tool stack folks.
>
> Yes, this was question for toolstack people.

(Adjusted TO/CC to reflect this).

>
> (And my second paragraph was not stated correctly, now that I re-read 
> it. I meant to say that my understanding is that API is expected to 
> make all safety checks on buffers and with 
> set_xen_guest_handle_offset() that I was picturing in my head we could 
> pass in pretty much any pointer. I suppose we could check 'hbuf+offset 
> < sz')
>
>
> -boris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-04-24 17:42           ` Boris Ostrovsky
@ 2015-05-01 13:56             ` Ian Campbell
  2015-05-01 15:20               ` Boris Ostrovsky
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Campbell @ 2015-05-01 13:56 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: elena.ufimtseva, wei.liu2, stefano.stabellini, andrew.cooper3,
	dario.faggioli, ian.jackson, xen-devel, Jan Beulich, keir,
	dgdegra

On Fri, 2015-04-24 at 13:42 -0400, Boris Ostrovsky wrote:
> On 04/24/2015 10:09 AM, Boris Ostrovsky wrote:
> >
> > On 04/24/2015 03:19 AM, Jan Beulich wrote:
> >>>>> On 24.04.15 at 00:20, <boris.ostrovsky@oracle.com> wrote:
> >>> On 04/21/2015 03:01 AM, Jan Beulich wrote:
> >>>>> +                 ((++dev_cnt > 0x3f) && hypercall_preempt_check()) )
> >>>>> +                break;
> >>>>> +        }
> >>>>> +
> >>>>> +        if ( (!ret || (ret == -ENODEV)) &&
> >>>>> +             __copy_field_to_guest(u_sysctl, op, 
> >>>>> u.pcitopoinfo.first_dev) )
> >>>>> +            ret = -EFAULT;
> >>>>> +    }
> >>>>> +    break;
> >>>>> +#endif
> >>>> With the continuation-less model now used I don't think it makes
> >>>> sense to have first_dev and num_devs - for re-invocation all the
> >>>> caller needs to do is increment the buffer pointer suitably. I.e.
> >>>> you can get away with just a count of devices afaict.
> >>>>
> >>> This would require walking xc_hypercall_buffer_t->hbuf. Would 
> >>> something like
> >>>
> >>>       set_xen_guest_handle_raw(sysctl..., (void 
> >>> *)HYPERCALL_BUFFER_AS_ARG(foo) + offset)
> >>>
> >>> be acceptable? I don't think I see anything better.
> >>>
> >>> I thought of adding set_xen_guest_handle_offset() that would look
> >>> similar to set_xen_guest_handle() but then I felt that having this in
> >>> API may not be a good idea since xc_hypercall_buffer_t->hbuf would end
> >>> up pointing to memory that is not allocated for full
> >>> xc_hypercall_buffer_t->sz.
> >> There ought to be a way to create a guest handle from other than
> >> the start of an allocated hypercall buffer, but that's really more a
> >> question for the tool stack folks.
> >
> > Yes, this was question for toolstack people.
> 
> (Adjusted TO/CC to reflect this).

Not sure if I've got the full context but would it be possible to
rearrange things such that libxc was bouncing/unbouncing only the slice
of the input which is needed for the invocation? Perhaps by pushing it
into whichever loop there is.

If not then I think set_xen_guest_handle_offset or perhaps
guest_handle_add_offset (in line with the xen side equivalent) would be
the way to go.

Under no circumstances should set_xen_guest_handle_raw be being used.

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-05-01 13:56             ` Ian Campbell
@ 2015-05-01 15:20               ` Boris Ostrovsky
  2015-05-01 15:31                 ` Ian Campbell
  0 siblings, 1 reply; 22+ messages in thread
From: Boris Ostrovsky @ 2015-05-01 15:20 UTC (permalink / raw)
  To: Ian Campbell
  Cc: elena.ufimtseva, wei.liu2, stefano.stabellini, andrew.cooper3,
	dario.faggioli, ian.jackson, xen-devel, Jan Beulich, keir,
	dgdegra

On 05/01/2015 09:56 AM, Ian Campbell wrote:
> On Fri, 2015-04-24 at 13:42 -0400, Boris Ostrovsky wrote:
>> On 04/24/2015 10:09 AM, Boris Ostrovsky wrote:
>>> On 04/24/2015 03:19 AM, Jan Beulich wrote:
>>>>>>> On 24.04.15 at 00:20, <boris.ostrovsky@oracle.com> wrote:
>>>>> On 04/21/2015 03:01 AM, Jan Beulich wrote:
>>>>>>> +                 ((++dev_cnt > 0x3f) && hypercall_preempt_check()) )
>>>>>>> +                break;
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        if ( (!ret || (ret == -ENODEV)) &&
>>>>>>> +             __copy_field_to_guest(u_sysctl, op,
>>>>>>> u.pcitopoinfo.first_dev) )
>>>>>>> +            ret = -EFAULT;
>>>>>>> +    }
>>>>>>> +    break;
>>>>>>> +#endif
>>>>>> With the continuation-less model now used I don't think it makes
>>>>>> sense to have first_dev and num_devs - for re-invocation all the
>>>>>> caller needs to do is increment the buffer pointer suitably. I.e.
>>>>>> you can get away with just a count of devices afaict.
>>>>>>
>>>>> This would require walking xc_hypercall_buffer_t->hbuf. Would
>>>>> something like
>>>>>
>>>>>        set_xen_guest_handle_raw(sysctl..., (void
>>>>> *)HYPERCALL_BUFFER_AS_ARG(foo) + offset)
>>>>>
>>>>> be acceptable? I don't think I see anything better.
>>>>>
>>>>> I thought of adding set_xen_guest_handle_offset() that would look
>>>>> similar to set_xen_guest_handle() but then I felt that having this in
>>>>> API may not be a good idea since xc_hypercall_buffer_t->hbuf would end
>>>>> up pointing to memory that is not allocated for full
>>>>> xc_hypercall_buffer_t->sz.
>>>> There ought to be a way to create a guest handle from other than
>>>> the start of an allocated hypercall buffer, but that's really more a
>>>> question for the tool stack folks.
>>> Yes, this was question for toolstack people.
>> (Adjusted TO/CC to reflect this).
> Not sure if I've got the full context

The hypervisor only processes part of the buffer and returns back to the 
caller how many elements are done. The caller advances the buffer 
pointer and retries the hypercall.


>   but would it be possible to
> rearrange things such that libxc was bouncing/unbouncing only the slice
> of the input which is needed for the invocation? Perhaps by pushing it
> into whichever loop there is.

Oh yes, that is certainly possible. I am just trying to avoid doing this 
because of the overhead of copying buffers back and forth many times. 
The worst thing is that with current bounce APIs I'd need to bounce full 
buffer, even though I will only need to do so for a part of it.

In other words, imagine we have 1024 elements in the buffer originally 
and the hypervisor only completes 256 elements at a time. I will then 
need to bounce [0-1024], [256-1024], [512-1024] and then [768-1024] 
elements. Not 4 chunks of 256 element as one would want.

I looked at HYPERCALL_BOUNCE_SET_SIZE but I don't think it will help 
here since we only know "useful" (i.e. processed by the hypervisor) 
buffer size after the hypercall (i.e. for the unbounce path).

>
> If not then I think set_xen_guest_handle_offset or perhaps
> guest_handle_add_offset (in line with the xen side equivalent) would be
> the way to go.
>
> Under no circumstances should set_xen_guest_handle_raw be being used.

But it can be used inside the new set_xen_guest_handle_offset(), right?

-boris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology
  2015-05-01 15:20               ` Boris Ostrovsky
@ 2015-05-01 15:31                 ` Ian Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ian Campbell @ 2015-05-01 15:31 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: elena.ufimtseva, wei.liu2, stefano.stabellini, andrew.cooper3,
	dario.faggioli, ian.jackson, xen-devel, Jan Beulich, keir,
	dgdegra

On Fri, 2015-05-01 at 11:20 -0400, Boris Ostrovsky wrote:
> On 05/01/2015 09:56 AM, Ian Campbell wrote:
> > On Fri, 2015-04-24 at 13:42 -0400, Boris Ostrovsky wrote:
> >> On 04/24/2015 10:09 AM, Boris Ostrovsky wrote:
> >>> On 04/24/2015 03:19 AM, Jan Beulich wrote:
> >>>>>>> On 24.04.15 at 00:20, <boris.ostrovsky@oracle.com> wrote:
> >>>>> On 04/21/2015 03:01 AM, Jan Beulich wrote:
> >>>>>>> +                 ((++dev_cnt > 0x3f) && hypercall_preempt_check()) )
> >>>>>>> +                break;
> >>>>>>> +        }
> >>>>>>> +
> >>>>>>> +        if ( (!ret || (ret == -ENODEV)) &&
> >>>>>>> +             __copy_field_to_guest(u_sysctl, op,
> >>>>>>> u.pcitopoinfo.first_dev) )
> >>>>>>> +            ret = -EFAULT;
> >>>>>>> +    }
> >>>>>>> +    break;
> >>>>>>> +#endif
> >>>>>> With the continuation-less model now used I don't think it makes
> >>>>>> sense to have first_dev and num_devs - for re-invocation all the
> >>>>>> caller needs to do is increment the buffer pointer suitably. I.e.
> >>>>>> you can get away with just a count of devices afaict.
> >>>>>>
> >>>>> This would require walking xc_hypercall_buffer_t->hbuf. Would
> >>>>> something like
> >>>>>
> >>>>>        set_xen_guest_handle_raw(sysctl..., (void
> >>>>> *)HYPERCALL_BUFFER_AS_ARG(foo) + offset)
> >>>>>
> >>>>> be acceptable? I don't think I see anything better.
> >>>>>
> >>>>> I thought of adding set_xen_guest_handle_offset() that would look
> >>>>> similar to set_xen_guest_handle() but then I felt that having this in
> >>>>> API may not be a good idea since xc_hypercall_buffer_t->hbuf would end
> >>>>> up pointing to memory that is not allocated for full
> >>>>> xc_hypercall_buffer_t->sz.
> >>>> There ought to be a way to create a guest handle from other than
> >>>> the start of an allocated hypercall buffer, but that's really more a
> >>>> question for the tool stack folks.
> >>> Yes, this was question for toolstack people.
> >> (Adjusted TO/CC to reflect this).
> > Not sure if I've got the full context
> 
> The hypervisor only processes part of the buffer and returns back to the 
> caller how many elements are done. The caller advances the buffer 
> pointer and retries the hypercall.
> 
> 
> >   but would it be possible to
> > rearrange things such that libxc was bouncing/unbouncing only the slice
> > of the input which is needed for the invocation? Perhaps by pushing it
> > into whichever loop there is.
> 
> Oh yes, that is certainly possible. I am just trying to avoid doing this 
> because of the overhead of copying buffers back and forth many times. 
> The worst thing is that with current bounce APIs I'd need to bounce full 
> buffer, even though I will only need to do so for a part of it.
> 
> In other words, imagine we have 1024 elements in the buffer originally 
> and the hypervisor only completes 256 elements at a time. I will then 
> need to bounce [0-1024], [256-1024], [512-1024] and then [768-1024] 
> elements. Not 4 chunks of 256 element as one would want.
> 
> I looked at HYPERCALL_BOUNCE_SET_SIZE but I don't think it will help 
> here since we only know "useful" (i.e. processed by the hypervisor) 
> buffer size after the hypercall (i.e. for the unbounce path).

Yes, doesn't sound like a good fit then.

> > If not then I think set_xen_guest_handle_offset or perhaps
> > guest_handle_add_offset (in line with the xen side equivalent) would be
> > the way to go.
> >
> > Under no circumstances should set_xen_guest_handle_raw be being used.
> 
> But it can be used inside the new set_xen_guest_handle_offset(), right?

Oh yes, of course, it's allowed inside the machinery itself, sorry I
didn't make that clear ;-)

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-05-01 15:31 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-17 16:59 [PATCH v7 0/5] Display IO topology when PXM data is available (plus some cleanup) Boris Ostrovsky
2015-04-17 16:59 ` [PATCH v7 1/5] sysctl: Make XEN_SYSCTL_numainfo a little more efficient Boris Ostrovsky
2015-04-17 17:14   ` Andrew Cooper
2015-04-17 16:59 ` [PATCH v7 2/5] sysctl: Add sysctl interface for querying PCI topology Boris Ostrovsky
2015-04-17 17:16   ` Andrew Cooper
2015-04-21  7:01   ` Jan Beulich
2015-04-21 12:56     ` Boris Ostrovsky
2015-04-21 13:14       ` Andrew Cooper
2015-04-21 13:13         ` Boris Ostrovsky
2015-04-21 13:30           ` Andrew Cooper
2015-04-21 13:18       ` Jan Beulich
2015-04-23 22:20     ` Boris Ostrovsky
2015-04-24  7:19       ` Jan Beulich
2015-04-24 14:09         ` Boris Ostrovsky
2015-04-24 17:42           ` Boris Ostrovsky
2015-05-01 13:56             ` Ian Campbell
2015-05-01 15:20               ` Boris Ostrovsky
2015-05-01 15:31                 ` Ian Campbell
2015-04-17 16:59 ` [PATCH v7 3/5] libxl/libxc: Move libxl_get_cpu_topology()'s hypercall buffer management to libxc Boris Ostrovsky
2015-04-17 16:59 ` [PATCH v7 4/5] libxl/libxc: Move libxl_get_numainfo()'s " Boris Ostrovsky
2015-04-17 16:59 ` [PATCH v7 5/5] libxl: Add interface for querying hypervisor about PCI topology Boris Ostrovsky
2015-04-21 11:27   ` Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.