[RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics
@ 2022-10-07 12:39 Matias Ezequiel Vara Larsen
  2022-10-07 12:39 ` [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type Matias Ezequiel Vara Larsen
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2022-10-07 12:39 UTC (permalink / raw)
  To: xen-devel
  Cc: Matias Ezequiel Vara Larsen, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, Anthony PERARD

Hello all,

The purpose of this RFC is to get feedback about a new acquire resource that
exposes vcpu statistics for a given domain. The current mechanism to get those
statistics is by querying the hypervisor. This mechanism relies on a hypercall
and holds the domctl spinlock during its execution. When a pv tool like xcp-rrdd
periodically samples these counters, it ends up affecting other paths that share
that spinlock. By using acquire resources, the pv tool only requires a few
hypercalls to set the shared memory region and samples are got without issuing
any other hypercall. The original idea has been suggested by Andrew Cooper to
which I have been discussing about how to implement the current PoC. You can
find the RFC patch series at [1]. The series is rebased on top of stable-4.15.

I am currently a bit blocked on 1) what to expose and 2) how to expose it. For
1), I decided to expose what xcp-rrdd is querying, e.g., XEN_DOMCTL_getvcpuinfo.
More precisely, xcp-rrd gets runstate.time[RUNSTATE_running]. This is a uint64_t
counter. However, the time spent in other states may be interesting too.
Regarding 2), I am not sure if simply using an array of uint64_t is enough or if
a different interface should be exposed. The remaining question is when to get
new values. For the moment, I am updating this counter during
vcpu_runstate_change().

The current series includes a simple pv tool that shows how this new interface is
used. This tool maps the counter and periodically samples it.

Any feedback/help would be appreciated.

Thanks, Matias.

[1] https://github.com/MatiasVara/xen/tree/feature_stats

Changes in v2:
- rework to ensure that consumer fetches consistent data

Changes in v1:
- rework how the resource is allocated and released
- rework when the resource is allocated that happens only when the resource is
  requested 
- rework the structure shared between the tool and Xen to make it extensible to
  new counters and declare it in a public header

There are still the following questions:
   - resource shall be released when there are no more readers otherwise we keep
     updating it during a hot path
   - one frame can host up to 512 vcpus. Should I check to this limit when
     updating? Should it be possible to allocate more than one frame for vcpu
     counters? 

Matias Ezequiel Vara Larsen (2):
  xen/memory : Add a stats_table resource type
  tools/misc: Add xen-vcpus-stats tool

 tools/misc/Makefile          |  6 +++
 tools/misc/xen-vcpus-stats.c | 87 +++++++++++++++++++++++++++++++++
 xen/arch/x86/hvm/hvm.c       |  2 +
 xen/common/memory.c          | 94 ++++++++++++++++++++++++++++++++++++
 xen/common/sched/core.c      | 16 ++++++
 xen/include/public/memory.h  |  3 ++
 xen/include/public/vcpu.h    | 16 ++++++
 xen/include/xen/mm.h         |  2 +
 xen/include/xen/sched.h      |  5 ++
 9 files changed, 231 insertions(+)
 create mode 100644 tools/misc/xen-vcpus-stats.c

-- 
2.25.1

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2022-10-07 12:39 [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics Matias Ezequiel Vara Larsen
@ 2022-10-07 12:39 ` Matias Ezequiel Vara Larsen
  2022-12-13 17:02   ` Jan Beulich
  2022-12-14  7:29   ` Jan Beulich
  2022-10-07 12:39 ` [RFC PATCH v2 2/2] tools/misc: Add xen-vcpus-stats tool Matias Ezequiel Vara Larsen
  2023-02-23 19:56 ` API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics Andrew Cooper
  2 siblings, 2 replies; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2022-10-07 12:39 UTC (permalink / raw)
  To: xen-devel
  Cc: Matias Ezequiel Vara Larsen, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli

This commit proposes a new mechanism to query the RUNSTATE_running counter for
a given vcpu from a dom0 userspace application. This commit proposes to expose
that counter by using the acquire_resource interface. The current mechanism
relies on the XEN_DOMCTL_getvcpuinfo and holds a single global domctl_lock for
the entire hypercall; and iterate over every vcpu in the system for every
update thus impacting operations that share that lock.

This commit proposes to expose vcpu RUNSTATE_running via the
xenforeignmemory interface thus preventing to issue the hypercall and holding
the lock. For that purpose, a new resource type named stats_table is added. The
first frame of this resource stores per-vcpu counters. The frame has one entry
of type struct vcpu_stats per vcpu. The allocation of this frame only happens
if the resource is requested. The frame is released after the domain is
destroyed.

Note that the updating of this counter is in a hot path, thus, in this commit,
copying only happens if it is specifically required.

Note that the exposed structure is extensible in two ways. First, the structure
vcpu_stats can be extended with new per-vcpu counters while it fits in a frame.
Second, new frames can be added in case new counters are required.

Signed-off-by: Matias Ezequiel Vara Larsen <matias.vara@vates.fr>
---
Changes in v2:
- rework to ensure that guest reads a coherent value by using a version
  number in the vcpu_stats structure
- add version to the vcpu_stats structure

Changes in v1:
- rework the allocation and releasing of the frames
- use the zero frame for per-vcpu counters that are listed as an array
- allocate vcpu stats frames only when the resource is requested
- rewrite commit message
- add the vcpu_stats structure to keep per-vcpu counters
- add the shared_vcpustatspage to keep an array of per-vcpu counters for a
  given domain
- declare the structures in a public header 
- define the vcpustats_page in the domain structure
---
 xen/arch/x86/hvm/hvm.c      |  2 +
 xen/common/memory.c         | 94 +++++++++++++++++++++++++++++++++++++
 xen/common/sched/core.c     | 16 +++++++
 xen/include/public/memory.h |  3 ++
 xen/include/public/vcpu.h   | 16 +++++++
 xen/include/xen/mm.h        |  2 +
 xen/include/xen/sched.h     |  5 ++
 7 files changed, 138 insertions(+)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index ddd001a6ad..1ef6cb5ff0 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -741,6 +741,8 @@ void hvm_domain_relinquish_resources(struct domain *d)
 
     ioreq_server_destroy_all(d);
 
+    stats_free_vcpu_mfn(d);
+
     msixtbl_pt_cleanup(d);
 
     /* Stop all asynchronous timer actions. */
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 297b98a562..749486d5d4 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1078,6 +1078,12 @@ unsigned int ioreq_server_max_frames(const struct domain *d)
     return nr;
 }
 
+unsigned int stats_table_max_frames(const struct domain *d)
+{
+    /* One frame per 512 vcpus. */
+    return 1;
+}
+
 /*
  * Return 0 on any kind of error.  Caller converts to -EINVAL.
  *
@@ -1099,6 +1105,9 @@ static unsigned int resource_max_frames(const struct domain *d,
     case XENMEM_resource_vmtrace_buf:
         return d->vmtrace_size >> PAGE_SHIFT;
 
+    case XENMEM_resource_stats_table:
+        return stats_table_max_frames(d);
+
     default:
         return -EOPNOTSUPP;
     }
@@ -1162,6 +1171,88 @@ static int acquire_vmtrace_buf(
     return nr_frames;
 }
 
+void stats_free_vcpu_mfn(struct domain * d)
+{
+    struct page_info *pg = d->vcpustats_page.pg;
+
+    if ( !pg )
+        return;
+
+    d->vcpustats_page.pg = NULL;
+
+    if ( d->vcpustats_page.va )
+        unmap_domain_page_global(d->vcpustats_page.va);
+
+    d->vcpustats_page.va = NULL;
+
+    put_page_alloc_ref(pg);
+    put_page_and_type(pg);
+}
+
+static int stats_vcpu_alloc_mfn(struct domain *d)
+{
+    struct page_info *pg;
+
+    pg = alloc_domheap_page(d, MEMF_no_refcount);
+
+    if ( !pg )
+        return -ENOMEM;
+
+    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
+        put_page_alloc_ref(pg);
+        return -ENODATA;
+    }
+
+    d->vcpustats_page.va = __map_domain_page_global(pg);
+    if ( !d->vcpustats_page.va )
+        goto fail;
+
+    d->vcpustats_page.pg = pg;
+    clear_page(d->vcpustats_page.va);
+    return 1;
+
+fail:
+    put_page_alloc_ref(pg);
+    put_page_and_type(pg);
+
+    return -ENOMEM;
+}
+
+static int acquire_stats_table(struct domain *d,
+                                unsigned int id,
+                                unsigned int frame,
+                                unsigned int nr_frames,
+                                xen_pfn_t mfn_list[])
+{
+    mfn_t mfn;
+    int rc;
+    unsigned int i;
+
+    if ( !d )
+        return -ENOENT;
+
+    for ( i = 0; i < nr_frames; i++ )
+    {
+        switch ( i )
+        {
+        case XENMEM_resource_stats_frame_vcpustats:
+            if ( !d->vcpustats_page.pg ) {
+                rc = stats_vcpu_alloc_mfn(d);
+                if ( rc < 1 )
+                    return rc;
+            }
+            mfn = page_to_mfn(d->vcpustats_page.pg);
+            mfn_list[i] = mfn_x(mfn);
+            break;
+
+        default:
+            return -EINVAL;
+        }
+    }
+
+    return nr_frames;
+}
+
 /*
  * Returns -errno on error, or positive in the range [1, nr_frames] on
  * success.  Returning less than nr_frames contitutes a request for a
@@ -1182,6 +1273,9 @@ static int _acquire_resource(
     case XENMEM_resource_vmtrace_buf:
         return acquire_vmtrace_buf(d, id, frame, nr_frames, mfn_list);
 
+    case XENMEM_resource_stats_table:
+        return acquire_stats_table(d, id, frame, nr_frames, mfn_list);
+
     default:
         return -EOPNOTSUPP;
     }
diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 8f4b1ca10d..3543a531a1 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -264,6 +264,8 @@ static inline void vcpu_runstate_change(
 {
     s_time_t delta;
     struct sched_unit *unit = v->sched_unit;
+    shared_vcpustatspage_t * vcpustats_va;
+    struct domain *d = v->domain;
 
     ASSERT(spin_is_locked(get_sched_res(v->processor)->schedule_lock));
     if ( v->runstate.state == new_state )
@@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
     }
 
     v->runstate.state = new_state;
+
+    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
+    if ( vcpustats_va )
+    {
+	vcpustats_va->vcpu_info[v->vcpu_id].version =
+	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
+        smp_wmb();
+        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
+               &v->runstate.time[RUNSTATE_running],
+               sizeof(v->runstate.time[RUNSTATE_running]));
+        smp_wmb();
+        vcpustats_va->vcpu_info[v->vcpu_id].version =
+            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
+    }
 }
 
 void sched_guest_idle(void (*idle) (void), unsigned int cpu)
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 50e73eef98..e1a10b8b97 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -626,6 +626,7 @@ struct xen_mem_acquire_resource {
 #define XENMEM_resource_ioreq_server 0
 #define XENMEM_resource_grant_table 1
 #define XENMEM_resource_vmtrace_buf 2
+#define XENMEM_resource_stats_table 3
 
     /*
      * IN - a type-specific resource identifier, which must be zero
@@ -683,6 +684,8 @@ struct xen_mem_acquire_resource {
 typedef struct xen_mem_acquire_resource xen_mem_acquire_resource_t;
 DEFINE_XEN_GUEST_HANDLE(xen_mem_acquire_resource_t);
 
+#define XENMEM_resource_stats_frame_vcpustats 0
+
 /*
  * XENMEM_get_vnumainfo used by guest to get
  * vNUMA topology from hypervisor.
diff --git a/xen/include/public/vcpu.h b/xen/include/public/vcpu.h
index 3623af932f..5c1812dfd2 100644
--- a/xen/include/public/vcpu.h
+++ b/xen/include/public/vcpu.h
@@ -235,6 +235,22 @@ struct vcpu_register_time_memory_area {
 typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
 DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
 
+struct vcpu_stats{
+    /* If the least-significant bit of the version number is set then an update
+     * is in progress and the guest must wait to read a consistent set of values
+     * This mechanism is similar to Linux's seqlock.
+     */
+    uint32_t version;
+    uint32_t pad0;
+    uint64_t runstate_running_time;
+};
+
+struct shared_vcpustatspage {
+    struct vcpu_stats vcpu_info[1];
+};
+
+typedef struct shared_vcpustatspage shared_vcpustatspage_t;
+
 #endif /* __XEN_PUBLIC_VCPU_H__ */
 
 /*
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 667f9dac83..d1ca8b9aa8 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -134,6 +134,8 @@ int assign_pages(
 /* Dump info to serial console */
 void arch_dump_shared_mem_info(void);
 
+void stats_free_vcpu_mfn(struct domain * d);
+
 /*
  * Extra fault info types which are used to further describe
  * the source of an access violation.
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 5485d08afb..d9551ce35f 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -577,6 +577,11 @@ struct domain
         struct ioreq_server     *server[MAX_NR_IOREQ_SERVERS];
     } ioreq_server;
 #endif
+    /* Page that hosts vcpu stats */
+    struct {
+        struct page_info *pg;
+        void *va;
+    } vcpustats_page;
 };
 
 static inline struct page_list_head *page_to_list(
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH v2 2/2] tools/misc: Add xen-vcpus-stats tool
  2022-10-07 12:39 [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics Matias Ezequiel Vara Larsen
  2022-10-07 12:39 ` [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type Matias Ezequiel Vara Larsen
@ 2022-10-07 12:39 ` Matias Ezequiel Vara Larsen
  2023-02-23 16:01   ` Andrew Cooper
  2023-02-23 19:56 ` API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics Andrew Cooper
  2 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2022-10-07 12:39 UTC (permalink / raw)
  To: xen-devel; +Cc: Matias Ezequiel Vara Larsen, Wei Liu, Anthony PERARD

Add a demonstration tool that uses the stats_table resource to
query vcpus' RUNSTATE_running counter for a DomU.

Signed-off-by: Matias Ezequiel Vara Larsen <matias.vara@vates.fr>
---
Changes in v2:
- use period instead of frec
- rely on version to ensure reading is coherent 

Changes in v1:
- change the name of the tool to xen-vcpus-stats
- set command line parameters in the same order that are passed
- remove header libs.h
- build by default
- remove errno, strerrno, "\n", and identation
- use errx when errno is not needed
- address better the number of pages requested and error msgs
- use the shared_vcpustatspage_t structure
- use the correct frame id when requesting the resource
---
 tools/misc/Makefile          |  6 +++
 tools/misc/xen-vcpus-stats.c | 87 ++++++++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)
 create mode 100644 tools/misc/xen-vcpus-stats.c

diff --git a/tools/misc/Makefile b/tools/misc/Makefile
index 2b683819d4..837e4b50da 100644
--- a/tools/misc/Makefile
+++ b/tools/misc/Makefile
@@ -49,6 +49,7 @@ TARGETS_COPY += xenpvnetboot
 
 # Everything which needs to be built
 TARGETS_BUILD := $(filter-out $(TARGETS_COPY),$(TARGETS_ALL))
+TARGETS_BUILD += xen-vcpus-stats
 
 # ... including build-only targets
 TARGETS_BUILD-$(CONFIG_X86)    += xen-vmtrace
@@ -135,4 +136,9 @@ xencov: xencov.o
 xen-ucode: xen-ucode.o
 	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
 
+xen-vcpus-stats.o: CFLAGS += $(CFLAGS_libxenforeginmemory)
+
+xen-vcpus-stats: xen-vcpus-stats.o
+	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(LDLIBS_libxenforeignmemory) $(APPEND_LDFLAGS)
+
 -include $(DEPS_INCLUDE)
diff --git a/tools/misc/xen-vcpus-stats.c b/tools/misc/xen-vcpus-stats.c
new file mode 100644
index 0000000000..29d0efb124
--- /dev/null
+++ b/tools/misc/xen-vcpus-stats.c
@@ -0,0 +1,87 @@
+#include <err.h>
+#include <errno.h>
+#include <error.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <signal.h>
+
+#include <xenctrl.h>
+#include <xenforeignmemory.h>
+#include <xen/vcpu.h>
+
+#define rmb()   asm volatile("lfence":::"memory")
+
+static sig_atomic_t interrupted;
+static void close_handler(int signum)
+{
+    interrupted = 1;
+}
+
+int main(int argc, char **argv)
+{
+    xenforeignmemory_handle *fh;
+    xenforeignmemory_resource_handle *res;
+    size_t size;
+    int rc, domid, period, vcpu;
+    shared_vcpustatspage_t * info;
+    struct sigaction act;
+    uint32_t version;
+    uint64_t value;
+
+    if (argc != 4 ) {
+        fprintf(stderr, "Usage: %s <domid> <vcpu> <period>\n", argv[0]);
+        return 1;
+    }
+
+    domid = atoi(argv[1]);
+    vcpu = atoi(argv[2]);
+    period = atoi(argv[3]);
+
+    act.sa_handler = close_handler;
+    act.sa_flags = 0;
+    sigemptyset(&act.sa_mask);
+    sigaction(SIGHUP,  &act, NULL);
+    sigaction(SIGTERM, &act, NULL);
+    sigaction(SIGINT,  &act, NULL);
+    sigaction(SIGALRM, &act, NULL);
+
+    fh = xenforeignmemory_open(NULL, 0);
+
+    if ( !fh )
+        err(1, "xenforeignmemory_open");
+
+    rc = xenforeignmemory_resource_size(
+        fh, domid, XENMEM_resource_stats_table,
+        0, &size);
+
+    if ( rc )
+        err(1, "Fail: Get size");
+
+    res = xenforeignmemory_map_resource(
+        fh, domid, XENMEM_resource_stats_table,
+        0, XENMEM_resource_stats_frame_vcpustats, size >> XC_PAGE_SHIFT,
+        (void **)&info, PROT_READ, 0);
+
+    if ( !res )
+        err(1, "Fail: Map");
+
+    while ( !interrupted ) {
+        sleep(period);
+        do {
+            version = info->vcpu_info[vcpu].version;
+            rmb();
+            value = info->vcpu_info[vcpu].runstate_running_time;
+            rmb();
+        } while ((info->vcpu_info[vcpu].version & 1) ||
+                (version != info->vcpu_info[vcpu].version));
+        printf("running_vcpu_time[%d]: %ld\n", vcpu, value);
+    }
+
+    rc = xenforeignmemory_unmap_resource(fh, res);
+    if ( rc )
+        err(1, "Fail: Unmap");
+
+    return 0;
+}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2022-10-07 12:39 ` [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type Matias Ezequiel Vara Larsen
@ 2022-12-13 17:02   ` Jan Beulich
  2023-02-16 14:48     ` Matias Ezequiel Vara Larsen
  2022-12-14  7:29   ` Jan Beulich
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2022-12-13 17:02 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> This commit proposes a new mechanism to query the RUNSTATE_running counter for
> a given vcpu from a dom0 userspace application. This commit proposes to expose
> that counter by using the acquire_resource interface. The current mechanism
> relies on the XEN_DOMCTL_getvcpuinfo and holds a single global domctl_lock for
> the entire hypercall; and iterate over every vcpu in the system for every
> update thus impacting operations that share that lock.
> 
> This commit proposes to expose vcpu RUNSTATE_running via the
> xenforeignmemory interface thus preventing to issue the hypercall and holding
> the lock. For that purpose, a new resource type named stats_table is added. The
> first frame of this resource stores per-vcpu counters. The frame has one entry
> of type struct vcpu_stats per vcpu. The allocation of this frame only happens
> if the resource is requested. The frame is released after the domain is
> destroyed.
> 
> Note that the updating of this counter is in a hot path, thus, in this commit,
> copying only happens if it is specifically required.
> 
> Note that the exposed structure is extensible in two ways. First, the structure
> vcpu_stats can be extended with new per-vcpu counters while it fits in a frame.

I'm afraid I don't see how this is "extensible". I would recommend that
you outline for yourself how a change would look like to actually add
such a 2nd counter. While doing that keep in mind that whatever changes
you make may not break existing consumers.

It's also not clear what you mean with "fits in a frame": struct
shared_vcpustatspage is a container for an array with a single element.
I may guess (looking at just the public interface) that this really is
meant to be a flexible array (and hence should be marked as such - see
other uses of XEN_FLEX_ARRAY_DIM in the public headers). Yet if that's
the case, then a single page already won't suffice for a domain with
sufficiently many vCPU-s.

> Second, new frames can be added in case new counters are required.

Are you talking of "new counters" here which aren't "new per-vcpu
counters"? Or else what's the difference from the 1st way?

> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -741,6 +741,8 @@ void hvm_domain_relinquish_resources(struct domain *d)
>  
>      ioreq_server_destroy_all(d);
>  
> +    stats_free_vcpu_mfn(d);

How come this lives here? Surely this new feature should be not only
guest-type independent, but also arch-agnostic? Clearly you putting
the new data in struct domain (and not struct arch_domain or yet
deeper in the hierarchy) indicates you may have been meaning to make
it so.

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1078,6 +1078,12 @@ unsigned int ioreq_server_max_frames(const struct domain *d)
>      return nr;
>  }
>  
> +unsigned int stats_table_max_frames(const struct domain *d)
> +{
> +    /* One frame per 512 vcpus. */
> +    return 1;
> +}

As alluded to earlier already - 1 isn't going to be suitable for
arbitrary size domains. (Yes, HVM domains are presently limited to
128 vCPU-s, but as per above this shouldn't be a HVM-only feature.)

> @@ -1162,6 +1171,88 @@ static int acquire_vmtrace_buf(
>      return nr_frames;
>  }
>  
> +void stats_free_vcpu_mfn(struct domain * d)
> +{
> +    struct page_info *pg = d->vcpustats_page.pg;
> +
> +    if ( !pg )
> +        return;
> +
> +    d->vcpustats_page.pg = NULL;
> +
> +    if ( d->vcpustats_page.va )
> +        unmap_domain_page_global(d->vcpustats_page.va);
> +
> +    d->vcpustats_page.va = NULL;

We ought to gain UNMAP_DOMAIN_PAGE_GLOBAL() for purposes like this one,
paralleling UNMAP_DOMAIN_PAGE().

> +    put_page_alloc_ref(pg);
> +    put_page_and_type(pg);
> +}
> +
> +static int stats_vcpu_alloc_mfn(struct domain *d)
> +{
> +    struct page_info *pg;
> +
> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> +
> +    if ( !pg )
> +        return -ENOMEM;
> +
> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {

Style: Brace placement (more elsewhere).

> +        put_page_alloc_ref(pg);

This is not allowed when what you may put is the last reference.
See other examples we have in the tree.

> +        return -ENODATA;
> +    }
> +
> +    d->vcpustats_page.va = __map_domain_page_global(pg);
> +    if ( !d->vcpustats_page.va )
> +        goto fail;
> +
> +    d->vcpustats_page.pg = pg;
> +    clear_page(d->vcpustats_page.va);

I guess this should happen before you globally announce the
address.

> +    return 1;

Functions returning -errno on error want to return 0 on success,
unless e.g. multiple success indicators are needed.

> +fail:

Style: Label indentation.

> +    put_page_alloc_ref(pg);
> +    put_page_and_type(pg);
> +
> +    return -ENOMEM;
> +}
> +
> +static int acquire_stats_table(struct domain *d,
> +                                unsigned int id,
> +                                unsigned int frame,
> +                                unsigned int nr_frames,
> +                                xen_pfn_t mfn_list[])

Style: Indentation.

> +{
> +    mfn_t mfn;
> +    int rc;
> +    unsigned int i;
> +
> +    if ( !d )
> +        return -ENOENT;
> +
> +    for ( i = 0; i < nr_frames; i++ )
> +    {
> +        switch ( i )
> +        {
> +        case XENMEM_resource_stats_frame_vcpustats:

Isn't this supposed to be indexed by "id" (which presently you ignore
altogether, which can't be right)?

> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -264,6 +264,8 @@ static inline void vcpu_runstate_change(
>  {
>      s_time_t delta;
>      struct sched_unit *unit = v->sched_unit;
> +    shared_vcpustatspage_t * vcpustats_va;

Style: Stray blank (more elsewhere).

> @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
>      }
>  
>      v->runstate.state = new_state;
> +
> +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;

There should be no need for a cast here.

> +    if ( vcpustats_va )
> +    {
> +	vcpustats_va->vcpu_info[v->vcpu_id].version =

Style: Hard tab.

> +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
> +        smp_wmb();
> +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
> +               &v->runstate.time[RUNSTATE_running],
> +               sizeof(v->runstate.time[RUNSTATE_running]));

Why memcpy() and not a plain assignment?

> +        smp_wmb();
> +        vcpustats_va->vcpu_info[v->vcpu_id].version =
> +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
> +    }

Overall latching &vcpustats_va->vcpu_info[v->vcpu_id] into a helper
variable would likely help readability quite a bit.

> --- a/xen/include/public/vcpu.h
> +++ b/xen/include/public/vcpu.h
> @@ -235,6 +235,22 @@ struct vcpu_register_time_memory_area {
>  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
>  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
>  
> +struct vcpu_stats{

Style: Missing blank.

> +    /* If the least-significant bit of the version number is set then an update

Style: Comment layout.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2022-10-07 12:39 ` [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type Matias Ezequiel Vara Larsen
  2022-12-13 17:02   ` Jan Beulich
@ 2022-12-14  7:29   ` Jan Beulich
  2022-12-14  7:56     ` Jan Beulich
  2023-02-16 15:07     ` Matias Ezequiel Vara Larsen
  1 sibling, 2 replies; 37+ messages in thread
From: Jan Beulich @ 2022-12-14  7:29 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1078,6 +1078,12 @@ unsigned int ioreq_server_max_frames(const struct domain *d)
>       return nr;
>  }
>  
> +unsigned int stats_table_max_frames(const struct domain *d)
> +{
> +    /* One frame per 512 vcpus. */
> +    return 1;
> +}

Beyond my earlier comment (and irrespective of this needing changing
anyway): I guess this "512" was not updated to match the now larger
size of struct vcpu_stats?

> +static int stats_vcpu_alloc_mfn(struct domain *d)
> +{
> +    struct page_info *pg;
> +
> +    pg = alloc_domheap_page(d, MEMF_no_refcount);

The ioreq and vmtrace resources are also allocated this way, but they're
HVM-specific. The one here being supposed to be VM-type independent, I'm
afraid such pages will be accessible by an "owning" PV domain (it'll
need to guess the MFN, but that's no excuse).

> +    if ( !pg )
> +        return -ENOMEM;
> +
> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> +        put_page_alloc_ref(pg);
> +        return -ENODATA;
> +    }
> +
> +    d->vcpustats_page.va = __map_domain_page_global(pg);
> +    if ( !d->vcpustats_page.va )
> +        goto fail;
> +
> +    d->vcpustats_page.pg = pg;
> +    clear_page(d->vcpustats_page.va);

Beyond my earlier comment: I think that by the time the surrounding
hypercall returns the page ought to contain valid data. Otherwise I
see no way for the consumer to know from which point on the data is
going to be valid.

> @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
>      }
>  
>      v->runstate.state = new_state;
> +
> +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
> +    if ( vcpustats_va )
> +    {
> +	vcpustats_va->vcpu_info[v->vcpu_id].version =
> +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
> +        smp_wmb();
> +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
> +               &v->runstate.time[RUNSTATE_running],
> +               sizeof(v->runstate.time[RUNSTATE_running]));
> +        smp_wmb();
> +        vcpustats_va->vcpu_info[v->vcpu_id].version =
> +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
> +    }

A further aspect to consider here is cache line ping-pong. I think the
per-vCPU elements of the array want to be big enough to not share a
cache line. The interface being generic this presents some challenge
in determining what the supposed size is to be. However, taking into
account the extensibility question, maybe the route to take is to
simply settle on a power-of-2 value somewhere between x86'es and Arm's
cache line sizes and the pretty common page size of 4k, e.g. 512 bytes
or 1k?

> --- a/xen/include/public/vcpu.h
> +++ b/xen/include/public/vcpu.h
> @@ -235,6 +235,22 @@ struct vcpu_register_time_memory_area {
>  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
>  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
>  
> +struct vcpu_stats{
> +    /* If the least-significant bit of the version number is set then an update
> +     * is in progress and the guest must wait to read a consistent set of values
> +     * This mechanism is similar to Linux's seqlock.
> +     */
> +    uint32_t version;
> +    uint32_t pad0;
> +    uint64_t runstate_running_time;
> +};
> +
> +struct shared_vcpustatspage {
> +    struct vcpu_stats vcpu_info[1];
> +};
> +
> +typedef struct shared_vcpustatspage shared_vcpustatspage_t;

For new additions please avoid further name space issues: All types
and alike want to be prefixed by "xen_".

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2022-12-14  7:29   ` Jan Beulich
@ 2022-12-14  7:56     ` Jan Beulich
  2023-02-17  8:50       ` Matias Ezequiel Vara Larsen
  2023-02-16 15:07     ` Matias Ezequiel Vara Larsen
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2022-12-14  7:56 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On 14.12.2022 08:29, Jan Beulich wrote:
> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>> +static int stats_vcpu_alloc_mfn(struct domain *d)
>> +{
>> +    struct page_info *pg;
>> +
>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> 
> The ioreq and vmtrace resources are also allocated this way, but they're
> HVM-specific. The one here being supposed to be VM-type independent, I'm
> afraid such pages will be accessible by an "owning" PV domain (it'll
> need to guess the MFN, but that's no excuse).

Which might be tolerable if it then can't write to the page. That would
require "locking" the page r/o (from guest pov), which ought to be
possible by leveraging a variant of what share_xen_page_with_guest()
does: It marks pages PGT_none with a single type ref. This would mean
...

>> +    if ( !pg )
>> +        return -ENOMEM;
>> +
>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {

... using PGT_none here. Afaict this _should_ work, but we have no
precedent of doing so in the tree, and I may be overlooking something
which prevents that from working.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2022-12-13 17:02   ` Jan Beulich
@ 2023-02-16 14:48     ` Matias Ezequiel Vara Larsen
  2023-02-16 15:10       ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-02-16 14:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

Hello Jan and thanks for your comments. I addressed most of the them but
I've still some questions. Please find my questions below:

On Tue, Dec 13, 2022 at 06:02:55PM +0100, Jan Beulich wrote:
> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> > This commit proposes a new mechanism to query the RUNSTATE_running counter for
> > a given vcpu from a dom0 userspace application. This commit proposes to expose
> > that counter by using the acquire_resource interface. The current mechanism
> > relies on the XEN_DOMCTL_getvcpuinfo and holds a single global domctl_lock for
> > the entire hypercall; and iterate over every vcpu in the system for every
> > update thus impacting operations that share that lock.
> > 
> > This commit proposes to expose vcpu RUNSTATE_running via the
> > xenforeignmemory interface thus preventing to issue the hypercall and holding
> > the lock. For that purpose, a new resource type named stats_table is added. The
> > first frame of this resource stores per-vcpu counters. The frame has one entry
> > of type struct vcpu_stats per vcpu. The allocation of this frame only happens
> > if the resource is requested. The frame is released after the domain is
> > destroyed.
> > 
> > Note that the updating of this counter is in a hot path, thus, in this commit,
> > copying only happens if it is specifically required.
> > 
> > Note that the exposed structure is extensible in two ways. First, the structure
> > vcpu_stats can be extended with new per-vcpu counters while it fits in a frame.
> 
> I'm afraid I don't see how this is "extensible". I would recommend that
> you outline for yourself how a change would look like to actually add
> such a 2nd counter. While doing that keep in mind that whatever changes
> you make may not break existing consumers.
> 
> It's also not clear what you mean with "fits in a frame": struct
> shared_vcpustatspage is a container for an array with a single element.
> I may guess (looking at just the public interface) that this really is
> meant to be a flexible array (and hence should be marked as such - see
> other uses of XEN_FLEX_ARRAY_DIM in the public headers). Yet if that's
> the case, then a single page already won't suffice for a domain with
> sufficiently many vCPU-s.
> 

I taclked this by using "d->max_vcpus" to calculate the number of required frames
to allocate for a given guest. Also, I added a new type-specific resource named
XENMEM_resource_stats_table_id_vcpustats to host per-vcpu counters. I
completely forgot the "id" in the previous series.

> > Second, new frames can be added in case new counters are required.
> 
> Are you talking of "new counters" here which aren't "new per-vcpu
> counters"? Or else what's the difference from the 1st way?

Yes, I was talking about that sort of counters. In the next series, that sort
of counters could be added by adding a new type-specific resource id.

> 
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -741,6 +741,8 @@ void hvm_domain_relinquish_resources(struct domain *d)
> >  
> >      ioreq_server_destroy_all(d);
> >  
> > +    stats_free_vcpu_mfn(d);
> 
> How come this lives here? Surely this new feature should be not only
> guest-type independent, but also arch-agnostic? Clearly you putting
> the new data in struct domain (and not struct arch_domain or yet
> deeper in the hierarchy) indicates you may have been meaning to make
> it so.
> 

The whole feature shall to be guest-type independent and also arch-agnostic.
Would it be better to put it at xen/common/domain.c:domain_kill()?
 
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -1078,6 +1078,12 @@ unsigned int ioreq_server_max_frames(const struct domain *d)
> >      return nr;
> >  }
> >  
> > +unsigned int stats_table_max_frames(const struct domain *d)
> > +{
> > +    /* One frame per 512 vcpus. */
> > +    return 1;
> > +}
> 
> As alluded to earlier already - 1 isn't going to be suitable for
> arbitrary size domains. (Yes, HVM domains are presently limited to
> 128 vCPU-s, but as per above this shouldn't be a HVM-only feature.)
>

I am going to use "d->max_vcpus" to calculate the number of required frames for
per-vcpu counters for a given guest.
 
> > @@ -1162,6 +1171,88 @@ static int acquire_vmtrace_buf(
> >      return nr_frames;
> >  }
> >  
> > +void stats_free_vcpu_mfn(struct domain * d)
> > +{
> > +    struct page_info *pg = d->vcpustats_page.pg;
> > +
> > +    if ( !pg )
> > +        return;
> > +
> > +    d->vcpustats_page.pg = NULL;
> > +
> > +    if ( d->vcpustats_page.va )
> > +        unmap_domain_page_global(d->vcpustats_page.va);
> > +
> > +    d->vcpustats_page.va = NULL;
> 
> We ought to gain UNMAP_DOMAIN_PAGE_GLOBAL() for purposes like this one,
> paralleling UNMAP_DOMAIN_PAGE().
> 

I do not understand this comment. Could you elaborate it?

> > +    put_page_alloc_ref(pg);
> > +    put_page_and_type(pg);
> > +}
> > +
> > +static int stats_vcpu_alloc_mfn(struct domain *d)
> > +{
> > +    struct page_info *pg;
> > +
> > +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> > +
> > +    if ( !pg )
> > +        return -ENOMEM;
> > +
> > +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> 
> Style: Brace placement (more elsewhere).
> 
> > +        put_page_alloc_ref(pg);
> 
> This is not allowed when what you may put is the last reference.
> See other examples we have in the tree.
> 

I do not understand this comment. Could you point me to an example? I used
ioreq_server_alloc_mfn() as example but it may not be a good example. 

> > +        return -ENODATA;
> > +    }
> > +
> > +    d->vcpustats_page.va = __map_domain_page_global(pg);
> > +    if ( !d->vcpustats_page.va )
> > +        goto fail;
> > +
> > +    d->vcpustats_page.pg = pg;
> > +    clear_page(d->vcpustats_page.va);
> 
> I guess this should happen before you globally announce the
> address.
> 

If I understand correctly, I should invoke clear_page() before I assign the
address to "d->vcpustats_page.va". Am I right?

> > +    return 1;
> 
> Functions returning -errno on error want to return 0 on success,
> unless e.g. multiple success indicators are needed.
> 
> > +fail:
> 
> Style: Label indentation.
> 
> > +    put_page_alloc_ref(pg);
> > +    put_page_and_type(pg);
> > +
> > +    return -ENOMEM;
> > +}
> > +
> > +static int acquire_stats_table(struct domain *d,
> > +                                unsigned int id,
> > +                                unsigned int frame,
> > +                                unsigned int nr_frames,
> > +                                xen_pfn_t mfn_list[])
> 
> Style: Indentation.
> 
> > +{
> > +    mfn_t mfn;
> > +    int rc;
> > +    unsigned int i;
> > +
> > +    if ( !d )
> > +        return -ENOENT;
> > +
> > +    for ( i = 0; i < nr_frames; i++ )
> > +    {
> > +        switch ( i )
> > +        {
> > +        case XENMEM_resource_stats_frame_vcpustats:
> 
> Isn't this supposed to be indexed by "id" (which presently you ignore
> altogether, which can't be right)?

I forgot the "id". I added a new type-specific resource id in the next
series. 

> 
> > --- a/xen/common/sched/core.c
> > +++ b/xen/common/sched/core.c
> > @@ -264,6 +264,8 @@ static inline void vcpu_runstate_change(
> >  {
> >      s_time_t delta;
> >      struct sched_unit *unit = v->sched_unit;
> > +    shared_vcpustatspage_t * vcpustats_va;
> 
> Style: Stray blank (more elsewhere).
> 
> > @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
> >      }
> >  
> >      v->runstate.state = new_state;
> > +
> > +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
> 
> There should be no need for a cast here.
> 
> > +    if ( vcpustats_va )
> > +    {
> > +	vcpustats_va->vcpu_info[v->vcpu_id].version =
> 
> Style: Hard tab.
> 
> > +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
> > +        smp_wmb();
> > +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
> > +               &v->runstate.time[RUNSTATE_running],
> > +               sizeof(v->runstate.time[RUNSTATE_running]));
> 
> Why memcpy() and not a plain assignment?
> 
> > +        smp_wmb();
> > +        vcpustats_va->vcpu_info[v->vcpu_id].version =
> > +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
> > +    }
> 
> Overall latching &vcpustats_va->vcpu_info[v->vcpu_id] into a helper
> variable would likely help readability quite a bit.
> 
> > --- a/xen/include/public/vcpu.h
> > +++ b/xen/include/public/vcpu.h
> > @@ -235,6 +235,22 @@ struct vcpu_register_time_memory_area {
> >  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
> >  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
> >  
> > +struct vcpu_stats{
> 
> Style: Missing blank.
> 
> > +    /* If the least-significant bit of the version number is set then an update
> 
> Style: Comment layout.
> 

Thanks for the comments regarding style.

Matias


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2022-12-14  7:29   ` Jan Beulich
  2022-12-14  7:56     ` Jan Beulich
@ 2023-02-16 15:07     ` Matias Ezequiel Vara Larsen
  2023-02-16 15:15       ` Jan Beulich
  1 sibling, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-02-16 15:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On Wed, Dec 14, 2022 at 08:29:53AM +0100, Jan Beulich wrote:
> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -1078,6 +1078,12 @@ unsigned int ioreq_server_max_frames(const struct domain *d)
> >       return nr;
> >  }
> >  
> > +unsigned int stats_table_max_frames(const struct domain *d)
> > +{
> > +    /* One frame per 512 vcpus. */
> > +    return 1;
> > +}
> 
> Beyond my earlier comment (and irrespective of this needing changing
> anyway): I guess this "512" was not updated to match the now larger
> size of struct vcpu_stats?

In the next series, I am calculating the number of frames by:

nr = DIV_ROUND_UP(d->max_vcpus * sizeof(struct vcpu_stats), PAGE_SIZE);

> 
> > +static int stats_vcpu_alloc_mfn(struct domain *d)
> > +{
> > +    struct page_info *pg;
> > +
> > +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> 
> The ioreq and vmtrace resources are also allocated this way, but they're
> HVM-specific. The one here being supposed to be VM-type independent, I'm
> afraid such pages will be accessible by an "owning" PV domain (it'll
> need to guess the MFN, but that's no excuse).
> 
> > +    if ( !pg )
> > +        return -ENOMEM;
> > +
> > +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> > +        put_page_alloc_ref(pg);
> > +        return -ENODATA;
> > +    }
> > +
> > +    d->vcpustats_page.va = __map_domain_page_global(pg);
> > +    if ( !d->vcpustats_page.va )
> > +        goto fail;
> > +
> > +    d->vcpustats_page.pg = pg;
> > +    clear_page(d->vcpustats_page.va);
> 
> Beyond my earlier comment: I think that by the time the surrounding
> hypercall returns the page ought to contain valid data. Otherwise I
> see no way for the consumer to know from which point on the data is
> going to be valid.
> 
> > @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
> >      }
> >  
> >      v->runstate.state = new_state;
> > +
> > +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
> > +    if ( vcpustats_va )
> > +    {
> > +	vcpustats_va->vcpu_info[v->vcpu_id].version =
> > +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
> > +        smp_wmb();
> > +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
> > +               &v->runstate.time[RUNSTATE_running],
> > +               sizeof(v->runstate.time[RUNSTATE_running]));
> > +        smp_wmb();
> > +        vcpustats_va->vcpu_info[v->vcpu_id].version =
> > +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
> > +    }
> 
> A further aspect to consider here is cache line ping-pong. I think the
> per-vCPU elements of the array want to be big enough to not share a
> cache line. The interface being generic this presents some challenge
> in determining what the supposed size is to be. However, taking into
> account the extensibility question, maybe the route to take is to
> simply settle on a power-of-2 value somewhere between x86'es and Arm's
> cache line sizes and the pretty common page size of 4k, e.g. 512 bytes
> or 1k?
> 

I do not now how to address this. I was thinking to align each vcpu_stats
instance to a multiple of the cache-line. I would pick up the first multiple
that is bigger to the size of the vcpu_stats structure. For example, currently
the structure is 16 bytes so I would align each instance in a frame to 64
bytes. Would it make sense? 

> > --- a/xen/include/public/vcpu.h
> > +++ b/xen/include/public/vcpu.h
> > @@ -235,6 +235,22 @@ struct vcpu_register_time_memory_area {
> >  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
> >  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
> >  
> > +struct vcpu_stats{
> > +    /* If the least-significant bit of the version number is set then an update
> > +     * is in progress and the guest must wait to read a consistent set of values
> > +     * This mechanism is similar to Linux's seqlock.
> > +     */
> > +    uint32_t version;
> > +    uint32_t pad0;
> > +    uint64_t runstate_running_time;
> > +};
> > +
> > +struct shared_vcpustatspage {
> > +    struct vcpu_stats vcpu_info[1];
> > +};
> > +
> > +typedef struct shared_vcpustatspage shared_vcpustatspage_t;
> 
> For new additions please avoid further name space issues: All types
> and alike want to be prefixed by "xen_".
>

Should I name it "xen_shared_vcpustatspage_t" instead?

Matias 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2023-02-16 14:48     ` Matias Ezequiel Vara Larsen
@ 2023-02-16 15:10       ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2023-02-16 15:10 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On 16.02.2023 15:48, Matias Ezequiel Vara Larsen wrote:
> On Tue, Dec 13, 2022 at 06:02:55PM +0100, Jan Beulich wrote:
>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>> This commit proposes a new mechanism to query the RUNSTATE_running counter for
>>> a given vcpu from a dom0 userspace application. This commit proposes to expose
>>> that counter by using the acquire_resource interface. The current mechanism
>>> relies on the XEN_DOMCTL_getvcpuinfo and holds a single global domctl_lock for
>>> the entire hypercall; and iterate over every vcpu in the system for every
>>> update thus impacting operations that share that lock.
>>>
>>> This commit proposes to expose vcpu RUNSTATE_running via the
>>> xenforeignmemory interface thus preventing to issue the hypercall and holding
>>> the lock. For that purpose, a new resource type named stats_table is added. The
>>> first frame of this resource stores per-vcpu counters. The frame has one entry
>>> of type struct vcpu_stats per vcpu. The allocation of this frame only happens
>>> if the resource is requested. The frame is released after the domain is
>>> destroyed.
>>>
>>> Note that the updating of this counter is in a hot path, thus, in this commit,
>>> copying only happens if it is specifically required.
>>>
>>> Note that the exposed structure is extensible in two ways. First, the structure
>>> vcpu_stats can be extended with new per-vcpu counters while it fits in a frame.
>>
>> I'm afraid I don't see how this is "extensible". I would recommend that
>> you outline for yourself how a change would look like to actually add
>> such a 2nd counter. While doing that keep in mind that whatever changes
>> you make may not break existing consumers.
>>
>> It's also not clear what you mean with "fits in a frame": struct
>> shared_vcpustatspage is a container for an array with a single element.
>> I may guess (looking at just the public interface) that this really is
>> meant to be a flexible array (and hence should be marked as such - see
>> other uses of XEN_FLEX_ARRAY_DIM in the public headers). Yet if that's
>> the case, then a single page already won't suffice for a domain with
>> sufficiently many vCPU-s.
>>
> 
> I taclked this by using "d->max_vcpus" to calculate the number of required frames
> to allocate for a given guest. Also, I added a new type-specific resource named
> XENMEM_resource_stats_table_id_vcpustats to host per-vcpu counters. I
> completely forgot the "id" in the previous series.

May I suggest that before you submit a new version of your patches, you
make yourself (and then perhaps submit for commenting) a layout of the
data structures you want to introduce, including how they interact and
what "granularity" (global, per-domain, per-vCPU, per-pCPU, or alike)
they are. While doing that, as previously suggested, put yourself in
the position of someone later wanting to add another counter. With the
initial logic there, such an extension should then end up being pretty
mechanical, or else the arrangement likely needs further adjustment.

>>> --- a/xen/arch/x86/hvm/hvm.c
>>> +++ b/xen/arch/x86/hvm/hvm.c
>>> @@ -741,6 +741,8 @@ void hvm_domain_relinquish_resources(struct domain *d)
>>>  
>>>      ioreq_server_destroy_all(d);
>>>  
>>> +    stats_free_vcpu_mfn(d);
>>
>> How come this lives here? Surely this new feature should be not only
>> guest-type independent, but also arch-agnostic? Clearly you putting
>> the new data in struct domain (and not struct arch_domain or yet
>> deeper in the hierarchy) indicates you may have been meaning to make
>> it so.
>>
> 
> The whole feature shall to be guest-type independent and also arch-agnostic.
> Would it be better to put it at xen/common/domain.c:domain_kill()?

Likely, and the earlier this is (safely) possible, the better.

>>> @@ -1162,6 +1171,88 @@ static int acquire_vmtrace_buf(
>>>      return nr_frames;
>>>  }
>>>  
>>> +void stats_free_vcpu_mfn(struct domain * d)
>>> +{
>>> +    struct page_info *pg = d->vcpustats_page.pg;
>>> +
>>> +    if ( !pg )
>>> +        return;
>>> +
>>> +    d->vcpustats_page.pg = NULL;
>>> +
>>> +    if ( d->vcpustats_page.va )
>>> +        unmap_domain_page_global(d->vcpustats_page.va);
>>> +
>>> +    d->vcpustats_page.va = NULL;
>>
>> We ought to gain UNMAP_DOMAIN_PAGE_GLOBAL() for purposes like this one,
>> paralleling UNMAP_DOMAIN_PAGE().
>>
> 
> I do not understand this comment. Could you elaborate it?

The last four lines of code would better be collapsed to a single one,
using the mentioned yet-to-be-introduced construct. I assume you did
look up UNMAP_DOMAIN_PAGE() to spot its difference from
unmap_domain_page()?

>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
>>> +{
>>> +    struct page_info *pg;
>>> +
>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
>>> +
>>> +    if ( !pg )
>>> +        return -ENOMEM;
>>> +
>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
>>
>> Style: Brace placement (more elsewhere).
>>
>>> +        put_page_alloc_ref(pg);
>>
>> This is not allowed when what you may put is the last reference.
>> See other examples we have in the tree.
>>
> 
> I do not understand this comment. Could you point me to an example? I used
> ioreq_server_alloc_mfn() as example but it may not be a good example. 

That's an okay example; what's not okay is that you altered what is
done there. There is a reason that the other function doesn't use
put_page_alloc_ref() like you do. And I would assume you've looked
up put_page_alloc_ref() and found the comment there that explains
things.

>>> +        return -ENODATA;
>>> +    }
>>> +
>>> +    d->vcpustats_page.va = __map_domain_page_global(pg);
>>> +    if ( !d->vcpustats_page.va )
>>> +        goto fail;
>>> +
>>> +    d->vcpustats_page.pg = pg;
>>> +    clear_page(d->vcpustats_page.va);
>>
>> I guess this should happen before you globally announce the
>> address.
>>
> 
> If I understand correctly, I should invoke clear_page() before I assign the
> address to "d->vcpustats_page.va". Am I right?

Yes.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2023-02-16 15:07     ` Matias Ezequiel Vara Larsen
@ 2023-02-16 15:15       ` Jan Beulich
  2023-02-20 16:51         ` Matias Ezequiel Vara Larsen
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-02-16 15:15 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On 16.02.2023 16:07, Matias Ezequiel Vara Larsen wrote:
> On Wed, Dec 14, 2022 at 08:29:53AM +0100, Jan Beulich wrote:
>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>> @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
>>>      }
>>>  
>>>      v->runstate.state = new_state;
>>> +
>>> +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
>>> +    if ( vcpustats_va )
>>> +    {
>>> +	vcpustats_va->vcpu_info[v->vcpu_id].version =
>>> +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
>>> +        smp_wmb();
>>> +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
>>> +               &v->runstate.time[RUNSTATE_running],
>>> +               sizeof(v->runstate.time[RUNSTATE_running]));
>>> +        smp_wmb();
>>> +        vcpustats_va->vcpu_info[v->vcpu_id].version =
>>> +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
>>> +    }
>>
>> A further aspect to consider here is cache line ping-pong. I think the
>> per-vCPU elements of the array want to be big enough to not share a
>> cache line. The interface being generic this presents some challenge
>> in determining what the supposed size is to be. However, taking into
>> account the extensibility question, maybe the route to take is to
>> simply settle on a power-of-2 value somewhere between x86'es and Arm's
>> cache line sizes and the pretty common page size of 4k, e.g. 512 bytes
>> or 1k?
>>
> 
> I do not now how to address this. I was thinking to align each vcpu_stats
> instance to a multiple of the cache-line. I would pick up the first multiple
> that is bigger to the size of the vcpu_stats structure. For example, currently
> the structure is 16 bytes so I would align each instance in a frame to 64
> bytes. Would it make sense? 

Well, 64 may be an option, but I gave higher numbers for a reason. One thing
I don't know is what common cache line sizes are on Arm or e.g. RISC-V.

>>> --- a/xen/include/public/vcpu.h
>>> +++ b/xen/include/public/vcpu.h
>>> @@ -235,6 +235,22 @@ struct vcpu_register_time_memory_area {
>>>  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
>>>  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
>>>  
>>> +struct vcpu_stats{
>>> +    /* If the least-significant bit of the version number is set then an update
>>> +     * is in progress and the guest must wait to read a consistent set of values
>>> +     * This mechanism is similar to Linux's seqlock.
>>> +     */
>>> +    uint32_t version;
>>> +    uint32_t pad0;
>>> +    uint64_t runstate_running_time;
>>> +};
>>> +
>>> +struct shared_vcpustatspage {
>>> +    struct vcpu_stats vcpu_info[1];
>>> +};
>>> +
>>> +typedef struct shared_vcpustatspage shared_vcpustatspage_t;
>>
>> For new additions please avoid further name space issues: All types
>> and alike want to be prefixed by "xen_".
> 
> Should I name it "xen_shared_vcpustatspage_t" instead?

Yes, that would fulfill the name space requirements. It's getting longish,
so you may want to think about abbreviating it some. For example, I'm not
sure the "page" in the name is really necessary.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2022-12-14  7:56     ` Jan Beulich
@ 2023-02-17  8:50       ` Matias Ezequiel Vara Larsen
  2023-02-17  8:57         ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-02-17  8:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
> On 14.12.2022 08:29, Jan Beulich wrote:
> > On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> >> +static int stats_vcpu_alloc_mfn(struct domain *d)
> >> +{
> >> +    struct page_info *pg;
> >> +
> >> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> > 
> > The ioreq and vmtrace resources are also allocated this way, but they're
> > HVM-specific. The one here being supposed to be VM-type independent, I'm
> > afraid such pages will be accessible by an "owning" PV domain (it'll
> > need to guess the MFN, but that's no excuse).
> 
> Which might be tolerable if it then can't write to the page. That would
> require "locking" the page r/o (from guest pov), which ought to be
> possible by leveraging a variant of what share_xen_page_with_guest()
> does: It marks pages PGT_none with a single type ref. This would mean
> ...
> 
> >> +    if ( !pg )
> >> +        return -ENOMEM;
> >> +
> >> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> 
> ... using PGT_none here. Afaict this _should_ work, but we have no
> precedent of doing so in the tree, and I may be overlooking something
> which prevents that from working.
> 

I do not fully understand this. I checked share_xen_page_with_guest() and I
think you're talking about doing something like this for each allocated page to
set them ro from a pv guest pov:

pg->u.inuse.type_info = PGT_none;
pg->u.inuse.type_info |= PGT_validated | 1;
page_set_owner(page, d); // not sure if this is needed

Then, I should use PGT_none instead of PGT_writable_page in
get_page_and_type(). Am I right?

Matias


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2023-02-17  8:50       ` Matias Ezequiel Vara Larsen
@ 2023-02-17  8:57         ` Jan Beulich
  2023-02-17  9:29           ` Matias Ezequiel Vara Larsen
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-02-17  8:57 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
>> On 14.12.2022 08:29, Jan Beulich wrote:
>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
>>>> +{
>>>> +    struct page_info *pg;
>>>> +
>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
>>>
>>> The ioreq and vmtrace resources are also allocated this way, but they're
>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
>>> afraid such pages will be accessible by an "owning" PV domain (it'll
>>> need to guess the MFN, but that's no excuse).
>>
>> Which might be tolerable if it then can't write to the page. That would
>> require "locking" the page r/o (from guest pov), which ought to be
>> possible by leveraging a variant of what share_xen_page_with_guest()
>> does: It marks pages PGT_none with a single type ref. This would mean
>> ...
>>
>>>> +    if ( !pg )
>>>> +        return -ENOMEM;
>>>> +
>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
>>
>> ... using PGT_none here. Afaict this _should_ work, but we have no
>> precedent of doing so in the tree, and I may be overlooking something
>> which prevents that from working.
>>
> 
> I do not fully understand this. I checked share_xen_page_with_guest() and I
> think you're talking about doing something like this for each allocated page to
> set them ro from a pv guest pov:
> 
> pg->u.inuse.type_info = PGT_none;
> pg->u.inuse.type_info |= PGT_validated | 1;
> page_set_owner(page, d); // not sure if this is needed
> 
> Then, I should use PGT_none instead of PGT_writable_page in
> get_page_and_type(). Am I right?

No, if at all possible you should avoid open-coding anything. As said,
simply passing PGT_none to get_page_and_type() ought to work (again, as
said, unless I'm overlooking something). share_xen_page_with_guest()
can do what it does because the page isn't owned yet. For a page with
owner you may not fiddle with type_info in such an open-coded manner.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2023-02-17  8:57         ` Jan Beulich
@ 2023-02-17  9:29           ` Matias Ezequiel Vara Larsen
  2023-02-17 14:10             ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-02-17  9:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
> > On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
> >> On 14.12.2022 08:29, Jan Beulich wrote:
> >>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> >>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
> >>>> +{
> >>>> +    struct page_info *pg;
> >>>> +
> >>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> >>>
> >>> The ioreq and vmtrace resources are also allocated this way, but they're
> >>> HVM-specific. The one here being supposed to be VM-type independent, I'm
> >>> afraid such pages will be accessible by an "owning" PV domain (it'll
> >>> need to guess the MFN, but that's no excuse).
> >>
> >> Which might be tolerable if it then can't write to the page. That would
> >> require "locking" the page r/o (from guest pov), which ought to be
> >> possible by leveraging a variant of what share_xen_page_with_guest()
> >> does: It marks pages PGT_none with a single type ref. This would mean
> >> ...
> >>
> >>>> +    if ( !pg )
> >>>> +        return -ENOMEM;
> >>>> +
> >>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> >>
> >> ... using PGT_none here. Afaict this _should_ work, but we have no
> >> precedent of doing so in the tree, and I may be overlooking something
> >> which prevents that from working.
> >>
> > 
> > I do not fully understand this. I checked share_xen_page_with_guest() and I
> > think you're talking about doing something like this for each allocated page to
> > set them ro from a pv guest pov:
> > 
> > pg->u.inuse.type_info = PGT_none;
> > pg->u.inuse.type_info |= PGT_validated | 1;
> > page_set_owner(page, d); // not sure if this is needed
> > 
> > Then, I should use PGT_none instead of PGT_writable_page in
> > get_page_and_type(). Am I right?
> 
> No, if at all possible you should avoid open-coding anything. As said,
> simply passing PGT_none to get_page_and_type() ought to work (again, as
> said, unless I'm overlooking something). share_xen_page_with_guest()
> can do what it does because the page isn't owned yet. For a page with
> owner you may not fiddle with type_info in such an open-coded manner.
> 

Thanks. I got the following bug when passing PGT_none:

(XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
(XEN) Xen BUG at mm.c:2643

I did not investigate yet why this has happened.

Matias


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2023-02-17  9:29           ` Matias Ezequiel Vara Larsen
@ 2023-02-17 14:10             ` Jan Beulich
  2023-02-23 12:16               ` Matias Ezequiel Vara Larsen
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-02-17 14:10 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On 17.02.2023 10:29, Matias Ezequiel Vara Larsen wrote:
> On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
>> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
>>> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
>>>> On 14.12.2022 08:29, Jan Beulich wrote:
>>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
>>>>>> +{
>>>>>> +    struct page_info *pg;
>>>>>> +
>>>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
>>>>>
>>>>> The ioreq and vmtrace resources are also allocated this way, but they're
>>>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
>>>>> afraid such pages will be accessible by an "owning" PV domain (it'll
>>>>> need to guess the MFN, but that's no excuse).
>>>>
>>>> Which might be tolerable if it then can't write to the page. That would
>>>> require "locking" the page r/o (from guest pov), which ought to be
>>>> possible by leveraging a variant of what share_xen_page_with_guest()
>>>> does: It marks pages PGT_none with a single type ref. This would mean
>>>> ...
>>>>
>>>>>> +    if ( !pg )
>>>>>> +        return -ENOMEM;
>>>>>> +
>>>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
>>>>
>>>> ... using PGT_none here. Afaict this _should_ work, but we have no
>>>> precedent of doing so in the tree, and I may be overlooking something
>>>> which prevents that from working.
>>>>
>>>
>>> I do not fully understand this. I checked share_xen_page_with_guest() and I
>>> think you're talking about doing something like this for each allocated page to
>>> set them ro from a pv guest pov:
>>>
>>> pg->u.inuse.type_info = PGT_none;
>>> pg->u.inuse.type_info |= PGT_validated | 1;
>>> page_set_owner(page, d); // not sure if this is needed
>>>
>>> Then, I should use PGT_none instead of PGT_writable_page in
>>> get_page_and_type(). Am I right?
>>
>> No, if at all possible you should avoid open-coding anything. As said,
>> simply passing PGT_none to get_page_and_type() ought to work (again, as
>> said, unless I'm overlooking something). share_xen_page_with_guest()
>> can do what it does because the page isn't owned yet. For a page with
>> owner you may not fiddle with type_info in such an open-coded manner.
>>
> 
> Thanks. I got the following bug when passing PGT_none:
> 
> (XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
> (XEN) Xen BUG at mm.c:2643

The caller of the function needs to avoid the call not only for writable
and shared pages, but also for this new case of PGT_none.

Jan



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2023-02-16 15:15       ` Jan Beulich
@ 2023-02-20 16:51         ` Matias Ezequiel Vara Larsen
  2023-02-21  8:48           ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-02-20 16:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On Thu, Feb 16, 2023 at 04:15:29PM +0100, Jan Beulich wrote:
> On 16.02.2023 16:07, Matias Ezequiel Vara Larsen wrote:
> > On Wed, Dec 14, 2022 at 08:29:53AM +0100, Jan Beulich wrote:
> >> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> >>> @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
> >>>      }
> >>>  
> >>>      v->runstate.state = new_state;
> >>> +
> >>> +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
> >>> +    if ( vcpustats_va )
> >>> +    {
> >>> +	vcpustats_va->vcpu_info[v->vcpu_id].version =
> >>> +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
> >>> +        smp_wmb();
> >>> +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
> >>> +               &v->runstate.time[RUNSTATE_running],
> >>> +               sizeof(v->runstate.time[RUNSTATE_running]));
> >>> +        smp_wmb();
> >>> +        vcpustats_va->vcpu_info[v->vcpu_id].version =
> >>> +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
> >>> +    }
> >>
> >> A further aspect to consider here is cache line ping-pong. I think the
> >> per-vCPU elements of the array want to be big enough to not share a
> >> cache line. The interface being generic this presents some challenge
> >> in determining what the supposed size is to be. However, taking into
> >> account the extensibility question, maybe the route to take is to
> >> simply settle on a power-of-2 value somewhere between x86'es and Arm's
> >> cache line sizes and the pretty common page size of 4k, e.g. 512 bytes
> >> or 1k?
> >>
> > 
> > I do not now how to address this. I was thinking to align each vcpu_stats
> > instance to a multiple of the cache-line. I would pick up the first multiple
> > that is bigger to the size of the vcpu_stats structure. For example, currently
> > the structure is 16 bytes so I would align each instance in a frame to 64
> > bytes. Would it make sense? 
> 
> Well, 64 may be an option, but I gave higher numbers for a reason. One thing
> I don't know is what common cache line sizes are on Arm or e.g. RISC-V.

Thanks. I found that structures that require cache-aligment are defined with
"__cacheline_aligned" that uses L1_CACHE_BYTES. For example, in x86, this
aligns to 128 bytes. What is the reason to use a higher value like 512 bytes or
1k?.

Thanks, Matias. 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2023-02-20 16:51         ` Matias Ezequiel Vara Larsen
@ 2023-02-21  8:48           ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2023-02-21  8:48 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On 20.02.2023 17:51, Matias Ezequiel Vara Larsen wrote:
> On Thu, Feb 16, 2023 at 04:15:29PM +0100, Jan Beulich wrote:
>> On 16.02.2023 16:07, Matias Ezequiel Vara Larsen wrote:
>>> On Wed, Dec 14, 2022 at 08:29:53AM +0100, Jan Beulich wrote:
>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>>>> @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
>>>>>      }
>>>>>  
>>>>>      v->runstate.state = new_state;
>>>>> +
>>>>> +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
>>>>> +    if ( vcpustats_va )
>>>>> +    {
>>>>> +	vcpustats_va->vcpu_info[v->vcpu_id].version =
>>>>> +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
>>>>> +        smp_wmb();
>>>>> +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
>>>>> +               &v->runstate.time[RUNSTATE_running],
>>>>> +               sizeof(v->runstate.time[RUNSTATE_running]));
>>>>> +        smp_wmb();
>>>>> +        vcpustats_va->vcpu_info[v->vcpu_id].version =
>>>>> +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
>>>>> +    }
>>>>
>>>> A further aspect to consider here is cache line ping-pong. I think the
>>>> per-vCPU elements of the array want to be big enough to not share a
>>>> cache line. The interface being generic this presents some challenge
>>>> in determining what the supposed size is to be. However, taking into
>>>> account the extensibility question, maybe the route to take is to
>>>> simply settle on a power-of-2 value somewhere between x86'es and Arm's
>>>> cache line sizes and the pretty common page size of 4k, e.g. 512 bytes
>>>> or 1k?
>>>>
>>>
>>> I do not now how to address this. I was thinking to align each vcpu_stats
>>> instance to a multiple of the cache-line. I would pick up the first multiple
>>> that is bigger to the size of the vcpu_stats structure. For example, currently
>>> the structure is 16 bytes so I would align each instance in a frame to 64
>>> bytes. Would it make sense? 
>>
>> Well, 64 may be an option, but I gave higher numbers for a reason. One thing
>> I don't know is what common cache line sizes are on Arm or e.g. RISC-V.
> 
> Thanks. I found that structures that require cache-aligment are defined with
> "__cacheline_aligned" that uses L1_CACHE_BYTES. For example, in x86, this
> aligns to 128 bytes. What is the reason to use a higher value like 512 bytes or
> 1k?.

You cannot bake an x86 property (which may even change: at some point we may
choose to drop the 128-byte special for the very few CPUs actually using
such, when the majority uses 64-byte cache lines) into the public interface.
You also don't want to make an aspect of the public interface arch-dependent
when not really needed. My suggestion for a higher value was in the hope that
we may never see a port to an architecture with cache lines wider than, say,
512 bytes. What exactly the value should be is of course up for discussion,
but I think it wants to include some slack on top of what we currently
support (arch-wise).

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2023-02-17 14:10             ` Jan Beulich
@ 2023-02-23 12:16               ` Matias Ezequiel Vara Larsen
  2023-02-23 12:42                 ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-02-23 12:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On Fri, Feb 17, 2023 at 03:10:53PM +0100, Jan Beulich wrote:
> On 17.02.2023 10:29, Matias Ezequiel Vara Larsen wrote:
> > On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
> >> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
> >>> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
> >>>> On 14.12.2022 08:29, Jan Beulich wrote:
> >>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> >>>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
> >>>>>> +{
> >>>>>> +    struct page_info *pg;
> >>>>>> +
> >>>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> >>>>>
> >>>>> The ioreq and vmtrace resources are also allocated this way, but they're
> >>>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
> >>>>> afraid such pages will be accessible by an "owning" PV domain (it'll
> >>>>> need to guess the MFN, but that's no excuse).
> >>>>
> >>>> Which might be tolerable if it then can't write to the page. That would
> >>>> require "locking" the page r/o (from guest pov), which ought to be
> >>>> possible by leveraging a variant of what share_xen_page_with_guest()
> >>>> does: It marks pages PGT_none with a single type ref. This would mean
> >>>> ...
> >>>>
> >>>>>> +    if ( !pg )
> >>>>>> +        return -ENOMEM;
> >>>>>> +
> >>>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> >>>>
> >>>> ... using PGT_none here. Afaict this _should_ work, but we have no
> >>>> precedent of doing so in the tree, and I may be overlooking something
> >>>> which prevents that from working.
> >>>>
> >>>
> >>> I do not fully understand this. I checked share_xen_page_with_guest() and I
> >>> think you're talking about doing something like this for each allocated page to
> >>> set them ro from a pv guest pov:
> >>>
> >>> pg->u.inuse.type_info = PGT_none;
> >>> pg->u.inuse.type_info |= PGT_validated | 1;
> >>> page_set_owner(page, d); // not sure if this is needed
> >>>
> >>> Then, I should use PGT_none instead of PGT_writable_page in
> >>> get_page_and_type(). Am I right?
> >>
> >> No, if at all possible you should avoid open-coding anything. As said,
> >> simply passing PGT_none to get_page_and_type() ought to work (again, as
> >> said, unless I'm overlooking something). share_xen_page_with_guest()
> >> can do what it does because the page isn't owned yet. For a page with
> >> owner you may not fiddle with type_info in such an open-coded manner.
> >>
> > 
> > Thanks. I got the following bug when passing PGT_none:
> > 
> > (XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
> > (XEN) Xen BUG at mm.c:2643
> 
> The caller of the function needs to avoid the call not only for writable
> and shared pages, but also for this new case of PGT_none.

Thanks. If I understand correctly, _get_page_type() needs to avoid to call
validate_page() when type = PGT_none. For the writable and shared pages, this
is avoided by setting nx |= PGT_validated. Am I right?

Matias


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2023-02-23 12:16               ` Matias Ezequiel Vara Larsen
@ 2023-02-23 12:42                 ` Jan Beulich
  2023-03-07 14:44                   ` Matias Ezequiel Vara Larsen
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-02-23 12:42 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On 23.02.2023 13:16, Matias Ezequiel Vara Larsen wrote:
> On Fri, Feb 17, 2023 at 03:10:53PM +0100, Jan Beulich wrote:
>> On 17.02.2023 10:29, Matias Ezequiel Vara Larsen wrote:
>>> On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
>>>> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
>>>>> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
>>>>>> On 14.12.2022 08:29, Jan Beulich wrote:
>>>>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>>>>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
>>>>>>>> +{
>>>>>>>> +    struct page_info *pg;
>>>>>>>> +
>>>>>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
>>>>>>>
>>>>>>> The ioreq and vmtrace resources are also allocated this way, but they're
>>>>>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
>>>>>>> afraid such pages will be accessible by an "owning" PV domain (it'll
>>>>>>> need to guess the MFN, but that's no excuse).
>>>>>>
>>>>>> Which might be tolerable if it then can't write to the page. That would
>>>>>> require "locking" the page r/o (from guest pov), which ought to be
>>>>>> possible by leveraging a variant of what share_xen_page_with_guest()
>>>>>> does: It marks pages PGT_none with a single type ref. This would mean
>>>>>> ...
>>>>>>
>>>>>>>> +    if ( !pg )
>>>>>>>> +        return -ENOMEM;
>>>>>>>> +
>>>>>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
>>>>>>
>>>>>> ... using PGT_none here. Afaict this _should_ work, but we have no
>>>>>> precedent of doing so in the tree, and I may be overlooking something
>>>>>> which prevents that from working.
>>>>>>
>>>>>
>>>>> I do not fully understand this. I checked share_xen_page_with_guest() and I
>>>>> think you're talking about doing something like this for each allocated page to
>>>>> set them ro from a pv guest pov:
>>>>>
>>>>> pg->u.inuse.type_info = PGT_none;
>>>>> pg->u.inuse.type_info |= PGT_validated | 1;
>>>>> page_set_owner(page, d); // not sure if this is needed
>>>>>
>>>>> Then, I should use PGT_none instead of PGT_writable_page in
>>>>> get_page_and_type(). Am I right?
>>>>
>>>> No, if at all possible you should avoid open-coding anything. As said,
>>>> simply passing PGT_none to get_page_and_type() ought to work (again, as
>>>> said, unless I'm overlooking something). share_xen_page_with_guest()
>>>> can do what it does because the page isn't owned yet. For a page with
>>>> owner you may not fiddle with type_info in such an open-coded manner.
>>>>
>>>
>>> Thanks. I got the following bug when passing PGT_none:
>>>
>>> (XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
>>> (XEN) Xen BUG at mm.c:2643
>>
>> The caller of the function needs to avoid the call not only for writable
>> and shared pages, but also for this new case of PGT_none.
> 
> Thanks. If I understand correctly, _get_page_type() needs to avoid to call
> validate_page() when type = PGT_none.

Yes.

> For the writable and shared pages, this
> is avoided by setting nx |= PGT_validated. Am I right?

Well, no, I wouldn't describe it like that. The two (soon three) types not
requiring validation simply set the flag without calling validate_page().

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 2/2] tools/misc: Add xen-vcpus-stats tool
  2022-10-07 12:39 ` [RFC PATCH v2 2/2] tools/misc: Add xen-vcpus-stats tool Matias Ezequiel Vara Larsen
@ 2023-02-23 16:01   ` Andrew Cooper
  2023-02-23 20:31     ` Julien Grall
  2023-02-24 15:31     ` Matias Ezequiel Vara Larsen
  0 siblings, 2 replies; 37+ messages in thread
From: Andrew Cooper @ 2023-02-23 16:01 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen, xen-devel
  Cc: Matias Ezequiel Vara Larsen, Wei Liu, Anthony PERARD

On 07/10/2022 1:39 pm, Matias Ezequiel Vara Larsen wrote:

A couple of observations, all unrelated to the stats themselves.

Although overall, I'm not entirely certain that a tool like this is
going to be very helpful after initial development.  Something to
consider would be to alter libxenstat to use this new interface?

> diff --git a/tools/misc/Makefile b/tools/misc/Makefile
> index 2b683819d4..837e4b50da 100644
> --- a/tools/misc/Makefile
> +++ b/tools/misc/Makefile
> @@ -49,6 +49,7 @@ TARGETS_COPY += xenpvnetboot
>
> # Everything which needs to be built
> TARGETS_BUILD := $(filter-out $(TARGETS_COPY),$(TARGETS_ALL))
> +TARGETS_BUILD += xen-vcpus-stats

This patch is whitespace corrupted.  If at all possible, you need to see
about getting `git send-email` working to send patches with, as it deals
with most of the whitespace problems for you.

I'm afraid you can't simply copy the patch text into an email and send that.

>
> # ... including build-only targets
> TARGETS_BUILD-$(CONFIG_X86)    += xen-vmtrace
> @@ -135,4 +136,9 @@ xencov: xencov.o
> xen-ucode: xen-ucode.o
>     $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
>
> +xen-vcpus-stats.o: CFLAGS += $(CFLAGS_libxenforeginmemory)
> +
> +xen-vcpus-stats: xen-vcpus-stats.o
> +    $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl)
> $(LDLIBS_libxenforeignmemory) $(APPEND_LDFLAGS)
> +
> -include $(DEPS_INCLUDE)
> diff --git a/tools/misc/xen-vcpus-stats.c b/tools/misc/xen-vcpus-stats.c
> new file mode 100644
> index 0000000000..29d0efb124
> --- /dev/null
> +++ b/tools/misc/xen-vcpus-stats.c
> @@ -0,0 +1,87 @@
> +#include <err.h>
> +#include <errno.h>
> +#include <error.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <signal.h>
> +
> +#include <xenctrl.h>
> +#include <xenforeignmemory.h>
> +#include <xen/vcpu.h>
> +
> +#define rmb()   asm volatile("lfence":::"memory")

This is rmb(), but rmb() isn't what you want.

You want smp_rmb(), which is

#define smp_rmb() asm volatile ("" ::: "memory")


I'm surprised we haven't got this in a common location, considering how
often it goes wrong.  (Doesn't help that there's plenty of buggy
examples to copy, even in xen.git)

> +
> +static sig_atomic_t interrupted;
> +static void close_handler(int signum)
> +{
> +    interrupted = 1;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +    xenforeignmemory_handle *fh;
> +    xenforeignmemory_resource_handle *res;
> +    size_t size;
> +    int rc, domid, period, vcpu;
> +    shared_vcpustatspage_t * info;

shared_vcpustatspage_t *info;

no space after the *.

But you also cannot have a single structure describing that.  I'll reply
to the cover letter discussing ABIs.

> +    struct sigaction act;
> +    uint32_t version;
> +    uint64_t value;
> +
> +    if (argc != 4 ) {

{ on a new line.

> +        fprintf(stderr, "Usage: %s <domid> <vcpu> <period>\n", argv[0]);
> +        return 1;
> +    }
> +
> +    domid = atoi(argv[1]);
> +    vcpu = atoi(argv[2]);
> +    period = atoi(argv[3]);
> +
> +    act.sa_handler = close_handler;
> +    act.sa_flags = 0;
> +    sigemptyset(&act.sa_mask);
> +    sigaction(SIGHUP,  &act, NULL);
> +    sigaction(SIGTERM, &act, NULL);
> +    sigaction(SIGINT,  &act, NULL);
> +    sigaction(SIGALRM, &act, NULL);
> +
> +    fh = xenforeignmemory_open(NULL, 0);
> +
> +    if ( !fh )
> +        err(1, "xenforeignmemory_open");
> +
> +    rc = xenforeignmemory_resource_size(
> +        fh, domid, XENMEM_resource_stats_table,
> +        0, &size);
> +
> +    if ( rc )
> +        err(1, "Fail: Get size");
> +
> +    res = xenforeignmemory_map_resource(
> +        fh, domid, XENMEM_resource_stats_table,
> +        0, XENMEM_resource_stats_frame_vcpustats, size >> XC_PAGE_SHIFT,
> +        (void **)&info, PROT_READ, 0);
> +
> +    if ( !res )
> +        err(1, "Fail: Map");
> +
> +    while ( !interrupted ) {

{ on newline again.

> +        sleep(period);
> +        do {
> +            version = info->vcpu_info[vcpu].version;
> +            rmb();
> +            value = info->vcpu_info[vcpu].runstate_running_time;
> +            rmb();
> +        } while ((info->vcpu_info[vcpu].version & 1) ||
> +                (version != info->vcpu_info[vcpu].version));

So I think this will function correctly.

But I do recall seeing a rather nice way of wrapping a sequence lock in
C99.  I'll see if I can find it.

> +        printf("running_vcpu_time[%d]: %ld\n", vcpu, value);
> +    }
> +
> +    rc = xenforeignmemory_unmap_resource(fh, res);
> +    if ( rc )
> +        err(1, "Fail: Unmap");

Given that you unmap(), you ought to close the fh handle too.

~Andrew

> +
> +    return 0;
> +}



^ permalink raw reply	[flat|nested] 37+ messages in thread

* API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics
  2022-10-07 12:39 [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics Matias Ezequiel Vara Larsen
  2022-10-07 12:39 ` [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type Matias Ezequiel Vara Larsen
  2022-10-07 12:39 ` [RFC PATCH v2 2/2] tools/misc: Add xen-vcpus-stats tool Matias Ezequiel Vara Larsen
@ 2023-02-23 19:56 ` Andrew Cooper
  2023-03-06 14:23   ` Matias Ezequiel Vara Larsen
  2 siblings, 1 reply; 37+ messages in thread
From: Andrew Cooper @ 2023-02-23 19:56 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen, xen-devel
  Cc: Matias Ezequiel Vara Larsen, Jan Beulich, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, Anthony PERARD

A discussion about forward extensible APIs and ABIs here.

First, its important to say that this should be "domain stats" and not
just "vcpu stats".  This is easy to account for now, but hard to
retrofit later.

For the shared memory, we have a minimum granularity of PAGE_SIZE (4k
for now, but this will change in due course on non-x86 architectures),
and a resource-agnostic way of determining the resource size (as a
multiple of the page size).

But there are other things we need to consider:

1) To be sensibly extendable, there needs to be some metadata, and the
domain stats is clearly going to be a different shape to the vcpu stats.

2) We also want to give Xen some flexibility to allocate memory in a
suitable manner for the system.

3) Xen and the userspace consuming this interface will likely be built
from different versions of the header.  This needs to inter-operate with
the common subset of functionality.

So what we want, at least to describe the shape, is something like this:

struct dom_shmem_stats {
    uint32_t dom_size; // sizeof(dom)
    uint32_t vcpu_offs;
    uint32_t vcpu_size; // sizeof(vcpu)
    uint32_t vcpu_stride;
    ...
};

struct vcpu_shmem_stats {
    ...
};

where the data layout shall be that there is one dom structure starting
at 0, and an array of vcpu_stride objects starting at vcpu_offset.

Obviously, some invariants apply.  vcpu_offs >= dom_size, and
vcpu_stride >= vcpu_size.  The total size of the mapping must be larger
than vcpu_offs + vcpus * vcpu_stride  (and no, I intentionally don't
care about the rounding at the end because it doesn't change things in
practice.)

A very simple layout, packing the data as closely as reasonable, might be:

vcpu_offs = roundup(sizeof(dom), cacheline_size)
vcpu_stride = roundup(sizeof(vcpu), cacheline_size);

but Xen might have reason to use some other rounding.  As the dom or
vcpu size approaches a page, then Xen will want to change allocation
scheme to use page size for both, and not vmap the lot as one
consecutive region.

For the stats data itself, there wants to be some indication of what
data Xen is producing, so userspace can know not to bother consuming. 
So something like:

struct $foo_stats {
    ...

#define STATS_A (1u << 0)
#define STATS_B (1u << 2)
    uint32_t stats_active;

    struct $foo_stats_a {
        uint32_t single_field;
        ... // maybe other singleton fields
    };

    struct $foo_stats_b {
        uint32_t seq;  // "seq" is more common than "version"
        uint32_t _pad;
        uint64_t field1;
        uint64_t field2;
        ...
    };
};

Forward compatibility rules say that you can only ever append new
fields.  But as hinted at with the stats_active field, its fine to leave
reserved fields around for future use, generally with a rule that
anything reserved shall be 0.

Importantly, this means that offsetof(dom, stats_b) is fixed, and will
inter-operate just fine if e.g. userspace knows about a stats_c that Xen
doesn't know about.

But this does highlight some more invariants.  Xen must not produce any
data outside of [0, vcpu_offs) for dom data, and [base, vcpu_stride) for
vcpu data.

Furthermore, Xen should not produce any data not indicated by the
stats_active field.  That said, if Xen is compiled knowing about
dom->stats_c, and userspace is not, then userspace will observe Xen
advertising a stat it doesn't know, and producing data beyond
userspace's sizeof(dom), but within dom->vcpu_offs.  This is explicitly
fine and expected, and why Xen writes it's sizeof() information into the
dom header.  This allows both sides to agree on the layout even when
they're not compiled from identical copies of the header.

A few closing thoughts.

1) It is wise to have a magic number at the head of each dom and vcpu
struct.  This helps sanity check that both sides have correctly agreed
on the layout, but can also serve double duty as an ABI "version".  If
we screw up spectacularly, and decide that the best course of action is
to break backwards compatibility, then we can change the magic and edit
the structs in a non-forwards-compatible way.

2) We may get to a point of wanting arch specific stats.  This can be
accommodated in the model by having struct arch_{dom,vcpu}_stats at
positions described by the layout at the start of dom.  It would be wise
to leave space (reserved fields) there to use if necessary.  This is
cleaner than deciding that we need to put some new layout fields after
the latest $foo_stats_$N and before $foo_stats_$M.

3) It would be great if we could have something in tools/tests/ which
can attach to a running VM and assess the correctness of the invariants
given.  Better yet if it could compile for each change of the ABI and
assess the correctness for all.

I hope this all makes sense.  I know its not trivial, but there's also
nothing in here which is rocket science, and with a bit of good design
work up front, it will be a flexible interface that we never have to
break backwards compatibility with.

~Andrew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 2/2] tools/misc: Add xen-vcpus-stats tool
  2023-02-23 16:01   ` Andrew Cooper
@ 2023-02-23 20:31     ` Julien Grall
  2023-03-17 11:01       ` Matias Ezequiel Vara Larsen
  2023-02-24 15:31     ` Matias Ezequiel Vara Larsen
  1 sibling, 1 reply; 37+ messages in thread
From: Julien Grall @ 2023-02-23 20:31 UTC (permalink / raw)
  To: Andrew Cooper, Matias Ezequiel Vara Larsen, xen-devel
  Cc: Matias Ezequiel Vara Larsen, Wei Liu, Anthony PERARD

Hi,

On 23/02/2023 16:01, Andrew Cooper wrote:
> On 07/10/2022 1:39 pm, Matias Ezequiel Vara Larsen wrote:
> 
> A couple of observations, all unrelated to the stats themselves.
> 
> Although overall, I'm not entirely certain that a tool like this is
> going to be very helpful after initial development.  Something to
> consider would be to alter libxenstat to use this new interface?
> 
>> diff --git a/tools/misc/Makefile b/tools/misc/Makefile
>> index 2b683819d4..837e4b50da 100644
>> --- a/tools/misc/Makefile
>> +++ b/tools/misc/Makefile
>> @@ -49,6 +49,7 @@ TARGETS_COPY += xenpvnetboot
>>
>> # Everything which needs to be built
>> TARGETS_BUILD := $(filter-out $(TARGETS_COPY),$(TARGETS_ALL))
>> +TARGETS_BUILD += xen-vcpus-stats
> 
> This patch is whitespace corrupted.  If at all possible, you need to see
> about getting `git send-email` working to send patches with, as it deals
> with most of the whitespace problems for you.
> 
> I'm afraid you can't simply copy the patch text into an email and send that.
> 
>>
>> # ... including build-only targets
>> TARGETS_BUILD-$(CONFIG_X86)    += xen-vmtrace
>> @@ -135,4 +136,9 @@ xencov: xencov.o
>> xen-ucode: xen-ucode.o
>>      $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
>>
>> +xen-vcpus-stats.o: CFLAGS += $(CFLAGS_libxenforeginmemory)
>> +
>> +xen-vcpus-stats: xen-vcpus-stats.o
>> +    $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl)
>> $(LDLIBS_libxenforeignmemory) $(APPEND_LDFLAGS)
>> +
>> -include $(DEPS_INCLUDE)
>> diff --git a/tools/misc/xen-vcpus-stats.c b/tools/misc/xen-vcpus-stats.c
>> new file mode 100644
>> index 0000000000..29d0efb124
>> --- /dev/null
>> +++ b/tools/misc/xen-vcpus-stats.c
>> @@ -0,0 +1,87 @@
>> +#include <err.h>
>> +#include <errno.h>
>> +#include <error.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +#include <sys/mman.h>
>> +#include <signal.h>
>> +
>> +#include <xenctrl.h>
>> +#include <xenforeignmemory.h>
>> +#include <xen/vcpu.h>
>> +
>> +#define rmb()   asm volatile("lfence":::"memory")
> 
> This is rmb(), but rmb() isn't what you want.
> 
> You want smp_rmb(), which is
> 
> #define smp_rmb() asm volatile ("" ::: "memory")

 From the generic PoV, I find smp_rmb() a bit misleading because it is 
not clear in this context whether we are referring to the SMP-ness of 
the hypervisor or the tools domain.

If the latter, then technically it could be uniprocessor domain and one 
could argue that for Arm it could be downgraded to just a compiler barrier.

AFAICT, this would not be the case here because we are getting data from 
Xen. So we always need a "dmb ish".

So, I would suggest to name it virt_*() (to match Linux's naming).

Also, is this tool meant to be arch-agnostic? If so, then we need to 
introduce the proper barrier for the other arch.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 2/2] tools/misc: Add xen-vcpus-stats tool
  2023-02-23 16:01   ` Andrew Cooper
  2023-02-23 20:31     ` Julien Grall
@ 2023-02-24 15:31     ` Matias Ezequiel Vara Larsen
  1 sibling, 0 replies; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-02-24 15:31 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: xen-devel, Matias Ezequiel Vara Larsen, Wei Liu, Anthony PERARD

Hello Andrew and thanks for the comments,

On Thu, Feb 23, 2023 at 04:01:09PM +0000, Andrew Cooper wrote:
> On 07/10/2022 1:39 pm, Matias Ezequiel Vara Larsen wrote:
> 
> A couple of observations, all unrelated to the stats themselves.
> 
> Although overall, I'm not entirely certain that a tool like this is
> going to be very helpful after initial development.  Something to
> consider would be to alter libxenstat to use this new interface?
> 

Yes. We discussed about this in a design sesion at the summit. I could not move
forward on that direction yet but it is the right way to go. I use this tool
only to play with the interface and I could just remove it from the RFC in next
versions.

> > diff --git a/tools/misc/Makefile b/tools/misc/Makefile
> > index 2b683819d4..837e4b50da 100644
> > --- a/tools/misc/Makefile
> > +++ b/tools/misc/Makefile
> > @@ -49,6 +49,7 @@ TARGETS_COPY += xenpvnetboot
> >
> > # Everything which needs to be built
> > TARGETS_BUILD := $(filter-out $(TARGETS_COPY),$(TARGETS_ALL))
> > +TARGETS_BUILD += xen-vcpus-stats
> 
> This patch is whitespace corrupted.  If at all possible, you need to see
> about getting `git send-email` working to send patches with, as it deals
> with most of the whitespace problems for you.
> 
> I'm afraid you can't simply copy the patch text into an email and send that.
> 

I am using `git send-email` to send patches. I may have missed some flag.
I'll double-check. 

> >
> > # ... including build-only targets
> > TARGETS_BUILD-$(CONFIG_X86)    += xen-vmtrace
> > @@ -135,4 +136,9 @@ xencov: xencov.o
> > xen-ucode: xen-ucode.o
> >     $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
> >
> > +xen-vcpus-stats.o: CFLAGS += $(CFLAGS_libxenforeginmemory)
> > +
> > +xen-vcpus-stats: xen-vcpus-stats.o
> > +    $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl)
> > $(LDLIBS_libxenforeignmemory) $(APPEND_LDFLAGS)
> > +
> > -include $(DEPS_INCLUDE)
> > diff --git a/tools/misc/xen-vcpus-stats.c b/tools/misc/xen-vcpus-stats.c
> > new file mode 100644
> > index 0000000000..29d0efb124
> > --- /dev/null
> > +++ b/tools/misc/xen-vcpus-stats.c
> > @@ -0,0 +1,87 @@
> > +#include <err.h>
> > +#include <errno.h>
> > +#include <error.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/mman.h>
> > +#include <signal.h>
> > +
> > +#include <xenctrl.h>
> > +#include <xenforeignmemory.h>
> > +#include <xen/vcpu.h>
> > +
> > +#define rmb()   asm volatile("lfence":::"memory")
> 
> This is rmb(), but rmb() isn't what you want.
> 
> You want smp_rmb(), which is
> 
> #define smp_rmb() asm volatile ("" ::: "memory")
> 
> 
> I'm surprised we haven't got this in a common location, considering how
> often it goes wrong.  (Doesn't help that there's plenty of buggy
> examples to copy, even in xen.git)
> 

Got it. I'll rework on it in the next version. For inspiration, I used the code
at arch/x86/kernel/pvclock.c:pvclock_read_wallclock(). 

> > +
> > +static sig_atomic_t interrupted;
> > +static void close_handler(int signum)
> > +{
> > +    interrupted = 1;
> > +}
> > +
> > +int main(int argc, char **argv)
> > +{
> > +    xenforeignmemory_handle *fh;
> > +    xenforeignmemory_resource_handle *res;
> > +    size_t size;
> > +    int rc, domid, period, vcpu;
> > +    shared_vcpustatspage_t * info;
> 
> shared_vcpustatspage_t *info;
> 
> no space after the *.
> 
> But you also cannot have a single structure describing that.  I'll reply
> to the cover letter discussing ABIs.

I am reading it and I will comment on this soon. 

> 
> > +    struct sigaction act;
> > +    uint32_t version;
> > +    uint64_t value;
> > +
> > +    if (argc != 4 ) {
> 
> { on a new line.
> 
> > +        fprintf(stderr, "Usage: %s <domid> <vcpu> <period>\n", argv[0]);
> > +        return 1;
> > +    }
> > +
> > +    domid = atoi(argv[1]);
> > +    vcpu = atoi(argv[2]);
> > +    period = atoi(argv[3]);
> > +
> > +    act.sa_handler = close_handler;
> > +    act.sa_flags = 0;
> > +    sigemptyset(&act.sa_mask);
> > +    sigaction(SIGHUP,  &act, NULL);
> > +    sigaction(SIGTERM, &act, NULL);
> > +    sigaction(SIGINT,  &act, NULL);
> > +    sigaction(SIGALRM, &act, NULL);
> > +
> > +    fh = xenforeignmemory_open(NULL, 0);
> > +
> > +    if ( !fh )
> > +        err(1, "xenforeignmemory_open");
> > +
> > +    rc = xenforeignmemory_resource_size(
> > +        fh, domid, XENMEM_resource_stats_table,
> > +        0, &size);
> > +
> > +    if ( rc )
> > +        err(1, "Fail: Get size");
> > +
> > +    res = xenforeignmemory_map_resource(
> > +        fh, domid, XENMEM_resource_stats_table,
> > +        0, XENMEM_resource_stats_frame_vcpustats, size >> XC_PAGE_SHIFT,
> > +        (void **)&info, PROT_READ, 0);
> > +
> > +    if ( !res )
> > +        err(1, "Fail: Map");
> > +
> > +    while ( !interrupted ) {
> 
> { on newline again.
> 
> > +        sleep(period);
> > +        do {
> > +            version = info->vcpu_info[vcpu].version;
> > +            rmb();
> > +            value = info->vcpu_info[vcpu].runstate_running_time;
> > +            rmb();
> > +        } while ((info->vcpu_info[vcpu].version & 1) ||
> > +                (version != info->vcpu_info[vcpu].version));
> 
> So I think this will function correctly.
> 
> But I do recall seeing a rather nice way of wrapping a sequence lock in
> C99.  I'll see if I can find it.
> 
> > +        printf("running_vcpu_time[%d]: %ld\n", vcpu, value);
> > +    }
> > +
> > +    rc = xenforeignmemory_unmap_resource(fh, res);
> > +    if ( rc )
> > +        err(1, "Fail: Unmap");
> 
> Given that you unmap(), you ought to close the fh handle too.
> 

Thanks, I'll fix these issues in the next version. I think Jan's review have
already spotted some of them.

Matias


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics
  2023-02-23 19:56 ` API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics Andrew Cooper
@ 2023-03-06 14:23   ` Matias Ezequiel Vara Larsen
  2023-03-07 10:12     ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-03-06 14:23 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: xen-devel, Matias Ezequiel Vara Larsen, Jan Beulich,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, Anthony PERARD

Hello Andrew and thanks for the comments, please find my comments below.

On Thu, Feb 23, 2023 at 07:56:28PM +0000, Andrew Cooper wrote:
> A discussion about forward extensible APIs and ABIs here.
> 
> First, its important to say that this should be "domain stats" and not
> just "vcpu stats".  This is easy to account for now, but hard to
> retrofit later.
> 
> For the shared memory, we have a minimum granularity of PAGE_SIZE (4k
> for now, but this will change in due course on non-x86 architectures),
> and a resource-agnostic way of determining the resource size (as a
> multiple of the page size).
> 
> But there are other things we need to consider:
> 
> 1) To be sensibly extendable, there needs to be some metadata, and the
> domain stats is clearly going to be a different shape to the vcpu stats.
> 
> 2) We also want to give Xen some flexibility to allocate memory in a
> suitable manner for the system.
> 
> 3) Xen and the userspace consuming this interface will likely be built
> from different versions of the header.  This needs to inter-operate with
> the common subset of functionality.
> 
> 
> So what we want, at least to describe the shape, is something like this:
> 
> struct dom_shmem_stats {
>     uint32_t dom_size; // sizeof(dom)
>     uint32_t vcpu_offs;
>     uint32_t vcpu_size; // sizeof(vcpu)
>     uint32_t vcpu_stride;
>     ...
> };
> 
> struct vcpu_shmem_stats {
>     ...
> };
> 
> where the data layout shall be that there is one dom structure starting
> at 0, and an array of vcpu_stride objects starting at vcpu_offset.
> 
> Obviously, some invariants apply.  vcpu_offs >= dom_size, and
> vcpu_stride >= vcpu_size.  The total size of the mapping must be larger
> than vcpu_offs + vcpus * vcpu_stride  (and no, I intentionally don't
> care about the rounding at the end because it doesn't change things in
> practice.)
> 

Would it make sense to use different type-specific resources identifiers for
each "stat"?, e.g., XENMEM_resource_stats_table_id_vcpustats,
XENMEM_resource_stats_table_id_domstats and so on. The size of each of these
type-specific resources would be queried by using
`xenforeignmemory_resource_size()`. The mapping would be done by using
`xenforeignmemory_map_resource()`.

The metadata to represent the XENMEM_resource_stats_table_id_vcpustats
resource could be:

struct vcpu_shmem_stats {
#define STATS_A (1u << 0)
...
#define VCPU_STATS_MAGIC 0xaabbccdd
     uint32_t magic;
     uint32_t stats_active;
     uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats), cacheline_size)
     uint32_t size;    // sizeof(vcpu_stats)
     uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
     uint32_t _pad;
     ...
};

struct vcpu_stats {
    /*
     * If the least-significant bit of the seq number is set then an update
     * is in progress and the consumer must wait to read a consistent set of
     * values. This mechanism is similar to Linux's seqlock.
     */
    uint32_t seq;
    uint32 _pad;
    uint64_t stats_a; // e.g., runstate_running_time
    ...
};

The data layout shall be that there is one vcpu_shmem_stats structure starting
at 0, and an array of stride objects starting at offset, i.e., vcpu_stats
structures. The invariants would be:
* 1. offset >= sizeof(struct vcpu_shmem_stats)
* 2. stride >= size
* 3. the total size of the mapping in frames, which is the value returned by
  resource_size(), must be larger than (offs + vcpus * stride).
* 4. Xen must not produce any data outside of [base, stride) for vcpu data.

The steps to add a new counter B would be:
1. append the new field at vcpu_stats structure.
2. define the bit in stats_active,.i.e., #define STATS_B (1u << 1)
3. advertise STATS_B
I may be missing some steps here but that would be the overall process.

Regarding your email, I have the following comments:

- I still do not know how to choose the value of cacheline_size. I understand
this value shall be between the actual cacheline_size and PAGE_SIZE. A value
that could match most architectures could be 256 bytes.

- Xen shall use the "stats_active" field to indicate what it is producing. In
  this field, reserved bits shall be 0. This shall allow us to agree on the
layout even when producer and consumer are compiled with different headers.

- In the vcpu_stats structure, new fields can only ever be appended.

- The magic field shall act as a sanity check but also as an ABI version in case
we decide to break backward-compatibility.

> A very simple layout, packing the data as closely as reasonable, might be:
> 
> vcpu_offs = roundup(sizeof(dom), cacheline_size)
> vcpu_stride = roundup(sizeof(vcpu), cacheline_size);
> 
> but Xen might have reason to use some other rounding.  As the dom or
> vcpu size approaches a page, then Xen will want to change allocation
> scheme to use page size for both, and not vmap the lot as one
> consecutive region.
> 
> 
> 
> For the stats data itself, there wants to be some indication of what
> data Xen is producing, so userspace can know not to bother consuming. 
> So something like:
> 
> struct $foo_stats {
>     ...
> 
> #define STATS_A (1u << 0)
> #define STATS_B (1u << 2)
>     uint32_t stats_active;
>    
>     struct $foo_stats_a {
>         uint32_t single_field;
>         ... // maybe other singleton fields
>     };
> 
>     struct $foo_stats_b {
>         uint32_t seq;  // "seq" is more common than "version"
>         uint32_t _pad;
>         uint64_t field1;
>         uint64_t field2;
>         ...
>     };
> };
> 
> 
> Forward compatibility rules say that you can only ever append new
> fields.  But as hinted at with the stats_active field, its fine to leave
> reserved fields around for future use, generally with a rule that
> anything reserved shall be 0.
> 
> Importantly, this means that offsetof(dom, stats_b) is fixed, and will
> inter-operate just fine if e.g. userspace knows about a stats_c that Xen
> doesn't know about.
> 
> But this does highlight some more invariants.  Xen must not produce any
> data outside of [0, vcpu_offs) for dom data, and [base, vcpu_stride) for
> vcpu data.
> 
> Furthermore, Xen should not produce any data not indicated by the
> stats_active field.  That said, if Xen is compiled knowing about
> dom->stats_c, and userspace is not, then userspace will observe Xen
> advertising a stat it doesn't know, and producing data beyond
> userspace's sizeof(dom), but within dom->vcpu_offs.  This is explicitly
> fine and expected, and why Xen writes it's sizeof() information into the
> dom header.  This allows both sides to agree on the layout even when
> they're not compiled from identical copies of the header.
> 
> 
> 
> A few closing thoughts.
> 
> 1) It is wise to have a magic number at the head of each dom and vcpu
> struct.  This helps sanity check that both sides have correctly agreed
> on the layout, but can also serve double duty as an ABI "version".  If
> we screw up spectacularly, and decide that the best course of action is
> to break backwards compatibility, then we can change the magic and edit
> the structs in a non-forwards-compatible way.
> 
> 2) We may get to a point of wanting arch specific stats.  This can be
> accommodated in the model by having struct arch_{dom,vcpu}_stats at
> positions described by the layout at the start of dom.  It would be wise
> to leave space (reserved fields) there to use if necessary.  This is
> cleaner than deciding that we need to put some new layout fields after
> the latest $foo_stats_$N and before $foo_stats_$M.
> 

I did not address this yet. The vcpu_stats could have some fields that would be
architecture-dependent. Those fields would be advertised by the stats_active
field.

> 3) It would be great if we could have something in tools/tests/ which
> can attach to a running VM and assess the correctness of the invariants
> given.  Better yet if it could compile for each change of the ABI and
> assess the correctness for all.
> 

I agree.

Feedback is welcome.

Thanks, Matias.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics
  2023-03-06 14:23   ` Matias Ezequiel Vara Larsen
@ 2023-03-07 10:12     ` Jan Beulich
  2023-03-08 11:54       ` Matias Ezequiel Vara Larsen
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-03-07 10:12 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: xen-devel, Matias Ezequiel Vara Larsen, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, Anthony PERARD, Andrew Cooper

On 06.03.2023 15:23, Matias Ezequiel Vara Larsen wrote:
> Regarding your email, I have the following comments:
> 
> - I still do not know how to choose the value of cacheline_size. I understand
> this value shall be between the actual cacheline_size and PAGE_SIZE. A value
> that could match most architectures could be 256 bytes.

This isn't a concern anymore when offset and stride are stored in the
header. It was a concern when trying to come up with a layout where
these value were to be inferred (or known a priori).

> - Xen shall use the "stats_active" field to indicate what it is producing. In
>   this field, reserved bits shall be 0. This shall allow us to agree on the
> layout even when producer and consumer are compiled with different headers.

I wonder how well such a bitfield is going to scale. It provides for
only 32 (maybe 64) counters. Of course this may seem a lot right now,
but you never know how quickly something like this can grow. Plus
with ...

> - In the vcpu_stats structure, new fields can only ever be appended.

... this rule the only ambiguity that could arise to consumers when
no active flags existed would be that they can't tell "event never
occurred" from "hypervisor doesn't know about this anymore".

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2023-02-23 12:42                 ` Jan Beulich
@ 2023-03-07 14:44                   ` Matias Ezequiel Vara Larsen
  2023-03-07 16:55                     ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-03-07 14:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On Thu, Feb 23, 2023 at 01:42:08PM +0100, Jan Beulich wrote:
> On 23.02.2023 13:16, Matias Ezequiel Vara Larsen wrote:
> > On Fri, Feb 17, 2023 at 03:10:53PM +0100, Jan Beulich wrote:
> >> On 17.02.2023 10:29, Matias Ezequiel Vara Larsen wrote:
> >>> On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
> >>>> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
> >>>>> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
> >>>>>> On 14.12.2022 08:29, Jan Beulich wrote:
> >>>>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> >>>>>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
> >>>>>>>> +{
> >>>>>>>> +    struct page_info *pg;
> >>>>>>>> +
> >>>>>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> >>>>>>>
> >>>>>>> The ioreq and vmtrace resources are also allocated this way, but they're
> >>>>>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
> >>>>>>> afraid such pages will be accessible by an "owning" PV domain (it'll
> >>>>>>> need to guess the MFN, but that's no excuse).
> >>>>>>
> >>>>>> Which might be tolerable if it then can't write to the page. That would
> >>>>>> require "locking" the page r/o (from guest pov), which ought to be
> >>>>>> possible by leveraging a variant of what share_xen_page_with_guest()
> >>>>>> does: It marks pages PGT_none with a single type ref. This would mean
> >>>>>> ...
> >>>>>>
> >>>>>>>> +    if ( !pg )
> >>>>>>>> +        return -ENOMEM;
> >>>>>>>> +
> >>>>>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> >>>>>>
> >>>>>> ... using PGT_none here. Afaict this _should_ work, but we have no
> >>>>>> precedent of doing so in the tree, and I may be overlooking something
> >>>>>> which prevents that from working.
> >>>>>>
> >>>>>
> >>>>> I do not fully understand this. I checked share_xen_page_with_guest() and I
> >>>>> think you're talking about doing something like this for each allocated page to
> >>>>> set them ro from a pv guest pov:
> >>>>>
> >>>>> pg->u.inuse.type_info = PGT_none;
> >>>>> pg->u.inuse.type_info |= PGT_validated | 1;
> >>>>> page_set_owner(page, d); // not sure if this is needed
> >>>>>
> >>>>> Then, I should use PGT_none instead of PGT_writable_page in
> >>>>> get_page_and_type(). Am I right?
> >>>>
> >>>> No, if at all possible you should avoid open-coding anything. As said,
> >>>> simply passing PGT_none to get_page_and_type() ought to work (again, as
> >>>> said, unless I'm overlooking something). share_xen_page_with_guest()
> >>>> can do what it does because the page isn't owned yet. For a page with
> >>>> owner you may not fiddle with type_info in such an open-coded manner.
> >>>>
> >>>
> >>> Thanks. I got the following bug when passing PGT_none:
> >>>
> >>> (XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
> >>> (XEN) Xen BUG at mm.c:2643
> >>
> >> The caller of the function needs to avoid the call not only for writable
> >> and shared pages, but also for this new case of PGT_none.
> > 
> > Thanks. If I understand correctly, _get_page_type() needs to avoid to call
> > validate_page() when type = PGT_none.
> 
> Yes.
> 
> > For the writable and shared pages, this
> > is avoided by setting nx |= PGT_validated. Am I right?
> 
> Well, no, I wouldn't describe it like that. The two (soon three) types not
> requiring validation simply set the flag without calling validate_page().
> 

I see, thanks. I added the corresponding check at _get_page_type() to set the
flag without calling validate_page() for the PGT_none type. I think I am
missing something when I am releasing the pages. I am triggering the following
BUG() when issuing put_page_and_type():
 
(XEN) Xen BUG at mm.c:2698

This is at devalidate_page(). I guess the call to devalidate_page() shall be
also avoided. I was wondering if put_page_and_type() is required in this case.

Matias


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2023-03-07 14:44                   ` Matias Ezequiel Vara Larsen
@ 2023-03-07 16:55                     ` Jan Beulich
  2023-03-09  9:22                       ` Matias Ezequiel Vara Larsen
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-03-07 16:55 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On 07.03.2023 15:44, Matias Ezequiel Vara Larsen wrote:
> On Thu, Feb 23, 2023 at 01:42:08PM +0100, Jan Beulich wrote:
>> On 23.02.2023 13:16, Matias Ezequiel Vara Larsen wrote:
>>> On Fri, Feb 17, 2023 at 03:10:53PM +0100, Jan Beulich wrote:
>>>> On 17.02.2023 10:29, Matias Ezequiel Vara Larsen wrote:
>>>>> On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
>>>>>> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
>>>>>>> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
>>>>>>>> On 14.12.2022 08:29, Jan Beulich wrote:
>>>>>>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>>>>>>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
>>>>>>>>>> +{
>>>>>>>>>> +    struct page_info *pg;
>>>>>>>>>> +
>>>>>>>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
>>>>>>>>>
>>>>>>>>> The ioreq and vmtrace resources are also allocated this way, but they're
>>>>>>>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
>>>>>>>>> afraid such pages will be accessible by an "owning" PV domain (it'll
>>>>>>>>> need to guess the MFN, but that's no excuse).
>>>>>>>>
>>>>>>>> Which might be tolerable if it then can't write to the page. That would
>>>>>>>> require "locking" the page r/o (from guest pov), which ought to be
>>>>>>>> possible by leveraging a variant of what share_xen_page_with_guest()
>>>>>>>> does: It marks pages PGT_none with a single type ref. This would mean
>>>>>>>> ...
>>>>>>>>
>>>>>>>>>> +    if ( !pg )
>>>>>>>>>> +        return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
>>>>>>>>
>>>>>>>> ... using PGT_none here. Afaict this _should_ work, but we have no
>>>>>>>> precedent of doing so in the tree, and I may be overlooking something
>>>>>>>> which prevents that from working.
>>>>>>>>
>>>>>>>
>>>>>>> I do not fully understand this. I checked share_xen_page_with_guest() and I
>>>>>>> think you're talking about doing something like this for each allocated page to
>>>>>>> set them ro from a pv guest pov:
>>>>>>>
>>>>>>> pg->u.inuse.type_info = PGT_none;
>>>>>>> pg->u.inuse.type_info |= PGT_validated | 1;
>>>>>>> page_set_owner(page, d); // not sure if this is needed
>>>>>>>
>>>>>>> Then, I should use PGT_none instead of PGT_writable_page in
>>>>>>> get_page_and_type(). Am I right?
>>>>>>
>>>>>> No, if at all possible you should avoid open-coding anything. As said,
>>>>>> simply passing PGT_none to get_page_and_type() ought to work (again, as
>>>>>> said, unless I'm overlooking something). share_xen_page_with_guest()
>>>>>> can do what it does because the page isn't owned yet. For a page with
>>>>>> owner you may not fiddle with type_info in such an open-coded manner.
>>>>>>
>>>>>
>>>>> Thanks. I got the following bug when passing PGT_none:
>>>>>
>>>>> (XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
>>>>> (XEN) Xen BUG at mm.c:2643
>>>>
>>>> The caller of the function needs to avoid the call not only for writable
>>>> and shared pages, but also for this new case of PGT_none.
>>>
>>> Thanks. If I understand correctly, _get_page_type() needs to avoid to call
>>> validate_page() when type = PGT_none.
>>
>> Yes.
>>
>>> For the writable and shared pages, this
>>> is avoided by setting nx |= PGT_validated. Am I right?
>>
>> Well, no, I wouldn't describe it like that. The two (soon three) types not
>> requiring validation simply set the flag without calling validate_page().
>>
> 
> I see, thanks. I added the corresponding check at _get_page_type() to set the
> flag without calling validate_page() for the PGT_none type. I think I am
> missing something when I am releasing the pages. I am triggering the following
> BUG() when issuing put_page_and_type():
>  
> (XEN) Xen BUG at mm.c:2698
> 
> This is at devalidate_page(). I guess the call to devalidate_page() shall be
> also avoided.

Well, yes, symmetry requires a change there as well. Here it's indirect:
You want to avoid the call to _put_final_page_type(). That's enclosed by
(nx & PGT_type_mask) <= PGT_l4_page_table, which happens to be true for
PGT_none as well. There may be more instances of such a comparison, so
it'll be necessary to find them and check whether they may now also be
reached with PGT_none (looks like a comparison against PGT_root_page_table
in _get_page_type() is also affected, albeit in a largely benign way).

> I was wondering if put_page_and_type() is required in this case.

It is, or some equivalent thereof. Again - see other examples where a
similar allocation pattern exists.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics
  2023-03-07 10:12     ` Jan Beulich
@ 2023-03-08 11:54       ` Matias Ezequiel Vara Larsen
  2023-03-08 14:16         ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-03-08 11:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Matias Ezequiel Vara Larsen, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, Anthony PERARD, Andrew Cooper

On Tue, Mar 07, 2023 at 11:12:00AM +0100, Jan Beulich wrote:
> On 06.03.2023 15:23, Matias Ezequiel Vara Larsen wrote:
> > Regarding your email, I have the following comments:
> > 
> > - I still do not know how to choose the value of cacheline_size. I understand
> > this value shall be between the actual cacheline_size and PAGE_SIZE. A value
> > that could match most architectures could be 256 bytes.
> 
> This isn't a concern anymore when offset and stride are stored in the
> header. It was a concern when trying to come up with a layout where
> these value were to be inferred (or known a priori).
> 

Oh, I see. Cacheline_size shall be decided at compilation time for a given
arch, e.g., SMP_CACHE_BYTES.

> > - Xen shall use the "stats_active" field to indicate what it is producing. In
> >   this field, reserved bits shall be 0. This shall allow us to agree on the
> > layout even when producer and consumer are compiled with different headers.
> 
> I wonder how well such a bitfield is going to scale. It provides for
> only 32 (maybe 64) counters. Of course this may seem a lot right now,
> but you never know how quickly something like this can grow. Plus
> with ...
> 

Would it make sense to define it like this?:

struct vcpu_shmem_stats {
#define STATS_A (1u << 0)
...
#define VCPU_STATS_MAGIC 0xaabbccdd
     uint32_t magic;
     uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats) + sizeof(uint32_t) * nr_stats, cacheline_size)
     uint32_t size;    // sizeof(vcpu_stats)
     uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
     uint32_t nr_stats; // size of stats_active in uint32_t
     uint32_t stats_active[XEN_FLEX_ARRAY_DIM];
     ...
};
 
> > - In the vcpu_stats structure, new fields can only ever be appended.
> 
> ... this rule the only ambiguity that could arise to consumers when
> no active flags existed would be that they can't tell "event never
> occurred" from "hypervisor doesn't know about this anymore".

Do you mean how the consumer can figure out if either 1) Xen does not know yet
about some flag or 2) the flag has been deprecated? I think 2) is the case that
Andrew mention in which the magic number could be used as an ABI version to
break backwards compatibility.

Thanks, Matias. 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics
  2023-03-08 11:54       ` Matias Ezequiel Vara Larsen
@ 2023-03-08 14:16         ` Jan Beulich
  2023-03-09 10:38           ` Matias Ezequiel Vara Larsen
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-03-08 14:16 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: xen-devel, Matias Ezequiel Vara Larsen, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, Anthony PERARD, Andrew Cooper

On 08.03.2023 12:54, Matias Ezequiel Vara Larsen wrote:
> On Tue, Mar 07, 2023 at 11:12:00AM +0100, Jan Beulich wrote:
>> On 06.03.2023 15:23, Matias Ezequiel Vara Larsen wrote:
>>> - Xen shall use the "stats_active" field to indicate what it is producing. In
>>>   this field, reserved bits shall be 0. This shall allow us to agree on the
>>> layout even when producer and consumer are compiled with different headers.
>>
>> I wonder how well such a bitfield is going to scale. It provides for
>> only 32 (maybe 64) counters. Of course this may seem a lot right now,
>> but you never know how quickly something like this can grow. Plus
>> with ...
>>
> 
> Would it make sense to define it like this?:
> 
> struct vcpu_shmem_stats {
> #define STATS_A (1u << 0)
> ...
> #define VCPU_STATS_MAGIC 0xaabbccdd
>      uint32_t magic;
>      uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats) + sizeof(uint32_t) * nr_stats, cacheline_size)
>      uint32_t size;    // sizeof(vcpu_stats)
>      uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
>      uint32_t nr_stats; // size of stats_active in uint32_t
>      uint32_t stats_active[XEN_FLEX_ARRAY_DIM];
>      ...
> };

Possibly, but this would make it harder to use the interface. An alternative
might be to specify that an actual stats value set to ~0 marks an inactive
element. Since these are 64-bit counters, with today's and tomorrow's
computers we won't be at risk of a counter reaching a value of 2^^64-1, I
guess. And even if there was one where such a risk could not be excluded
(e.g. because of using pretty large increments), one would merely need to
make sure to saturate at 2^^64-2. Plus at such a point one would need to
consider anyway to switch to 128-bit fields, as neither saturated nor
wrapped values are of much use to consumers.

>>> - In the vcpu_stats structure, new fields can only ever be appended.
>>
>> ... this rule the only ambiguity that could arise to consumers when
>> no active flags existed would be that they can't tell "event never
>> occurred" from "hypervisor doesn't know about this anymore".
> 
> Do you mean how the consumer can figure out if either 1) Xen does not know yet
> about some flag or 2) the flag has been deprecated? I think 2) is the case that
> Andrew mention in which the magic number could be used as an ABI version to
> break backwards compatibility.

No, an inactive field wouldn't break the ABI. An ABI change would be if
such an inactive field was actually removed from the array. Or if e.g.,
as per above, we needed to switch to 128-bit counters.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type
  2023-03-07 16:55                     ` Jan Beulich
@ 2023-03-09  9:22                       ` Matias Ezequiel Vara Larsen
  0 siblings, 0 replies; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-03-09  9:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Matias Ezequiel Vara Larsen, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, xen-devel

On Tue, Mar 07, 2023 at 05:55:26PM +0100, Jan Beulich wrote:
> On 07.03.2023 15:44, Matias Ezequiel Vara Larsen wrote:
> > On Thu, Feb 23, 2023 at 01:42:08PM +0100, Jan Beulich wrote:
> >> On 23.02.2023 13:16, Matias Ezequiel Vara Larsen wrote:
> >>> On Fri, Feb 17, 2023 at 03:10:53PM +0100, Jan Beulich wrote:
> >>>> On 17.02.2023 10:29, Matias Ezequiel Vara Larsen wrote:
> >>>>> On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
> >>>>>> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
> >>>>>>> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
> >>>>>>>> On 14.12.2022 08:29, Jan Beulich wrote:
> >>>>>>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> >>>>>>>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
> >>>>>>>>>> +{
> >>>>>>>>>> +    struct page_info *pg;
> >>>>>>>>>> +
> >>>>>>>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> >>>>>>>>>
> >>>>>>>>> The ioreq and vmtrace resources are also allocated this way, but they're
> >>>>>>>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
> >>>>>>>>> afraid such pages will be accessible by an "owning" PV domain (it'll
> >>>>>>>>> need to guess the MFN, but that's no excuse).
> >>>>>>>>
> >>>>>>>> Which might be tolerable if it then can't write to the page. That would
> >>>>>>>> require "locking" the page r/o (from guest pov), which ought to be
> >>>>>>>> possible by leveraging a variant of what share_xen_page_with_guest()
> >>>>>>>> does: It marks pages PGT_none with a single type ref. This would mean
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>>>> +    if ( !pg )
> >>>>>>>>>> +        return -ENOMEM;
> >>>>>>>>>> +
> >>>>>>>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> >>>>>>>>
> >>>>>>>> ... using PGT_none here. Afaict this _should_ work, but we have no
> >>>>>>>> precedent of doing so in the tree, and I may be overlooking something
> >>>>>>>> which prevents that from working.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I do not fully understand this. I checked share_xen_page_with_guest() and I
> >>>>>>> think you're talking about doing something like this for each allocated page to
> >>>>>>> set them ro from a pv guest pov:
> >>>>>>>
> >>>>>>> pg->u.inuse.type_info = PGT_none;
> >>>>>>> pg->u.inuse.type_info |= PGT_validated | 1;
> >>>>>>> page_set_owner(page, d); // not sure if this is needed
> >>>>>>>
> >>>>>>> Then, I should use PGT_none instead of PGT_writable_page in
> >>>>>>> get_page_and_type(). Am I right?
> >>>>>>
> >>>>>> No, if at all possible you should avoid open-coding anything. As said,
> >>>>>> simply passing PGT_none to get_page_and_type() ought to work (again, as
> >>>>>> said, unless I'm overlooking something). share_xen_page_with_guest()
> >>>>>> can do what it does because the page isn't owned yet. For a page with
> >>>>>> owner you may not fiddle with type_info in such an open-coded manner.
> >>>>>>
> >>>>>
> >>>>> Thanks. I got the following bug when passing PGT_none:
> >>>>>
> >>>>> (XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
> >>>>> (XEN) Xen BUG at mm.c:2643
> >>>>
> >>>> The caller of the function needs to avoid the call not only for writable
> >>>> and shared pages, but also for this new case of PGT_none.
> >>>
> >>> Thanks. If I understand correctly, _get_page_type() needs to avoid to call
> >>> validate_page() when type = PGT_none.
> >>
> >> Yes.
> >>
> >>> For the writable and shared pages, this
> >>> is avoided by setting nx |= PGT_validated. Am I right?
> >>
> >> Well, no, I wouldn't describe it like that. The two (soon three) types not
> >> requiring validation simply set the flag without calling validate_page().
> >>
> > 
> > I see, thanks. I added the corresponding check at _get_page_type() to set the
> > flag without calling validate_page() for the PGT_none type. I think I am
> > missing something when I am releasing the pages. I am triggering the following
> > BUG() when issuing put_page_and_type():
> >  
> > (XEN) Xen BUG at mm.c:2698
> > 
> > This is at devalidate_page(). I guess the call to devalidate_page() shall be
> > also avoided.
> 
> Well, yes, symmetry requires a change there as well. Here it's indirect:
> You want to avoid the call to _put_final_page_type(). That's enclosed by
> (nx & PGT_type_mask) <= PGT_l4_page_table, which happens to be true for
> PGT_none as well. There may be more instances of such a comparison, so
> it'll be necessary to find them and check whether they may now also be
> reached with PGT_none (looks like a comparison against PGT_root_page_table
> in _get_page_type() is also affected, albeit in a largely benign way).
> 
Thanks. I could not find any other instance of that comparison except those
that you mention. I'll add a new patch in the series to deal with the
support for PGT_none.

Matias 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics
  2023-03-08 14:16         ` Jan Beulich
@ 2023-03-09 10:38           ` Matias Ezequiel Vara Larsen
  2023-03-09 11:50             ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-03-09 10:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Matias Ezequiel Vara Larsen, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, Anthony PERARD, Andrew Cooper

On Wed, Mar 08, 2023 at 03:16:05PM +0100, Jan Beulich wrote:
> On 08.03.2023 12:54, Matias Ezequiel Vara Larsen wrote:
> > On Tue, Mar 07, 2023 at 11:12:00AM +0100, Jan Beulich wrote:
> >> On 06.03.2023 15:23, Matias Ezequiel Vara Larsen wrote:
> >>> - Xen shall use the "stats_active" field to indicate what it is producing. In
> >>>   this field, reserved bits shall be 0. This shall allow us to agree on the
> >>> layout even when producer and consumer are compiled with different headers.
> >>
> >> I wonder how well such a bitfield is going to scale. It provides for
> >> only 32 (maybe 64) counters. Of course this may seem a lot right now,
> >> but you never know how quickly something like this can grow. Plus
> >> with ...
> >>
> > 
> > Would it make sense to define it like this?:
> > 
> > struct vcpu_shmem_stats {
> > #define STATS_A (1u << 0)
> > ...
> > #define VCPU_STATS_MAGIC 0xaabbccdd
> >      uint32_t magic;
> >      uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats) + sizeof(uint32_t) * nr_stats, cacheline_size)
> >      uint32_t size;    // sizeof(vcpu_stats)
> >      uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
> >      uint32_t nr_stats; // size of stats_active in uint32_t
> >      uint32_t stats_active[XEN_FLEX_ARRAY_DIM];
> >      ...
> > };
> 

The use of stats_active[] is meant to have a bitmap that could scale thus not
limiting the number of counters in the vcpu_stat structure to 32 or 64. I can't
see other way to have an unlimited number of counters though.

> Possibly, but this would make it harder to use the interface. An alternative
> might be to specify that an actual stats value set to ~0 marks an inactive
> element. Since these are 64-bit counters, with today's and tomorrow's
> computers we won't be at risk of a counter reaching a value of 2^^64-1, I
> guess. And even if there was one where such a risk could not be excluded
> (e.g. because of using pretty large increments), one would merely need to
> make sure to saturate at 2^^64-2. Plus at such a point one would need to
> consider anyway to switch to 128-bit fields, as neither saturated nor
> wrapped values are of much use to consumers.
> 

If I understand well, this use-case is in case an element in the stats_active
bitmap becomes inactive, i.e., it is set to "0" in stats_active[]. You are
proposing to set to ~0 the actual stats value to mark an inactive element. I
may be missing something here but would not be enough to set to "0" the
corresponding stats_active[] bit? 

> >>> - In the vcpu_stats structure, new fields can only ever be appended.
> >>
> >> ... this rule the only ambiguity that could arise to consumers when
> >> no active flags existed would be that they can't tell "event never
> >> occurred" from "hypervisor doesn't know about this anymore".
> > 
> > Do you mean how the consumer can figure out if either 1) Xen does not know yet
> > about some flag or 2) the flag has been deprecated? I think 2) is the case that
> > Andrew mention in which the magic number could be used as an ABI version to
> > break backwards compatibility.
> 
> No, an inactive field wouldn't break the ABI. An ABI change would be if
> such an inactive field was actually removed from the array. Or if e.g.,
> as per above, we needed to switch to 128-bit counters.

Got it, Thanks.

Matias


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics
  2023-03-09 10:38           ` Matias Ezequiel Vara Larsen
@ 2023-03-09 11:50             ` Jan Beulich
  2023-03-10 10:58               ` Matias Ezequiel Vara Larsen
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-03-09 11:50 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: xen-devel, Matias Ezequiel Vara Larsen, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, Anthony PERARD, Andrew Cooper

On 09.03.2023 11:38, Matias Ezequiel Vara Larsen wrote:
> On Wed, Mar 08, 2023 at 03:16:05PM +0100, Jan Beulich wrote:
>> On 08.03.2023 12:54, Matias Ezequiel Vara Larsen wrote:
>>> On Tue, Mar 07, 2023 at 11:12:00AM +0100, Jan Beulich wrote:
>>>> On 06.03.2023 15:23, Matias Ezequiel Vara Larsen wrote:
>>>>> - Xen shall use the "stats_active" field to indicate what it is producing. In
>>>>>   this field, reserved bits shall be 0. This shall allow us to agree on the
>>>>> layout even when producer and consumer are compiled with different headers.
>>>>
>>>> I wonder how well such a bitfield is going to scale. It provides for
>>>> only 32 (maybe 64) counters. Of course this may seem a lot right now,
>>>> but you never know how quickly something like this can grow. Plus
>>>> with ...
>>>>
>>>
>>> Would it make sense to define it like this?:
>>>
>>> struct vcpu_shmem_stats {
>>> #define STATS_A (1u << 0)
>>> ...
>>> #define VCPU_STATS_MAGIC 0xaabbccdd
>>>      uint32_t magic;
>>>      uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats) + sizeof(uint32_t) * nr_stats, cacheline_size)
>>>      uint32_t size;    // sizeof(vcpu_stats)
>>>      uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
>>>      uint32_t nr_stats; // size of stats_active in uint32_t
>>>      uint32_t stats_active[XEN_FLEX_ARRAY_DIM];
>>>      ...
>>> };
>>
> 
> The use of stats_active[] is meant to have a bitmap that could scale thus not
> limiting the number of counters in the vcpu_stat structure to 32 or 64. I can't
> see other way to have an unlimited number of counters though.
> 
>> Possibly, but this would make it harder to use the interface. An alternative
>> might be to specify that an actual stats value set to ~0 marks an inactive
>> element. Since these are 64-bit counters, with today's and tomorrow's
>> computers we won't be at risk of a counter reaching a value of 2^^64-1, I
>> guess. And even if there was one where such a risk could not be excluded
>> (e.g. because of using pretty large increments), one would merely need to
>> make sure to saturate at 2^^64-2. Plus at such a point one would need to
>> consider anyway to switch to 128-bit fields, as neither saturated nor
>> wrapped values are of much use to consumers.
>>
> 
> If I understand well, this use-case is in case an element in the stats_active
> bitmap becomes inactive, i.e., it is set to "0" in stats_active[]. You are
> proposing to set to ~0 the actual stats value to mark an inactive element. I
> may be missing something here but would not be enough to set to "0" the
> corresponding stats_active[] bit? 

The suggestion was to eliminate the need for stats_active[].

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics
  2023-03-09 11:50             ` Jan Beulich
@ 2023-03-10 10:58               ` Matias Ezequiel Vara Larsen
  2023-03-10 11:34                 ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-03-10 10:58 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Matias Ezequiel Vara Larsen, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, Anthony PERARD, Andrew Cooper

On Thu, Mar 09, 2023 at 12:50:18PM +0100, Jan Beulich wrote:
> On 09.03.2023 11:38, Matias Ezequiel Vara Larsen wrote:
> > On Wed, Mar 08, 2023 at 03:16:05PM +0100, Jan Beulich wrote:
> >> On 08.03.2023 12:54, Matias Ezequiel Vara Larsen wrote:
> >>> On Tue, Mar 07, 2023 at 11:12:00AM +0100, Jan Beulich wrote:
> >>>> On 06.03.2023 15:23, Matias Ezequiel Vara Larsen wrote:
> >>>>> - Xen shall use the "stats_active" field to indicate what it is producing. In
> >>>>>   this field, reserved bits shall be 0. This shall allow us to agree on the
> >>>>> layout even when producer and consumer are compiled with different headers.
> >>>>
> >>>> I wonder how well such a bitfield is going to scale. It provides for
> >>>> only 32 (maybe 64) counters. Of course this may seem a lot right now,
> >>>> but you never know how quickly something like this can grow. Plus
> >>>> with ...
> >>>>
> >>>
> >>> Would it make sense to define it like this?:
> >>>
> >>> struct vcpu_shmem_stats {
> >>> #define STATS_A (1u << 0)
> >>> ...
> >>> #define VCPU_STATS_MAGIC 0xaabbccdd
> >>>      uint32_t magic;
> >>>      uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats) + sizeof(uint32_t) * nr_stats, cacheline_size)
> >>>      uint32_t size;    // sizeof(vcpu_stats)
> >>>      uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
> >>>      uint32_t nr_stats; // size of stats_active in uint32_t
> >>>      uint32_t stats_active[XEN_FLEX_ARRAY_DIM];
> >>>      ...
> >>> };
> >>
> > 
> > The use of stats_active[] is meant to have a bitmap that could scale thus not
> > limiting the number of counters in the vcpu_stat structure to 32 or 64. I can't
> > see other way to have an unlimited number of counters though.
> > 
> >> Possibly, but this would make it harder to use the interface. An alternative
> >> might be to specify that an actual stats value set to ~0 marks an inactive
> >> element. Since these are 64-bit counters, with today's and tomorrow's
> >> computers we won't be at risk of a counter reaching a value of 2^^64-1, I
> >> guess. And even if there was one where such a risk could not be excluded
> >> (e.g. because of using pretty large increments), one would merely need to
> >> make sure to saturate at 2^^64-2. Plus at such a point one would need to
> >> consider anyway to switch to 128-bit fields, as neither saturated nor
> >> wrapped values are of much use to consumers.
> >>
> > 
> > If I understand well, this use-case is in case an element in the stats_active
> > bitmap becomes inactive, i.e., it is set to "0" in stats_active[]. You are
> > proposing to set to ~0 the actual stats value to mark an inactive element. I
> > may be missing something here but would not be enough to set to "0" the
> > corresponding stats_active[] bit? 
> 
> The suggestion was to eliminate the need for stats_active[].
> 
Oh, I see, thanks for the clarification. To summarise, these are the current
options:
1. Use a "uint64_t" field thus limiting the number of counters to 64. The
current vcpu_runstate_info structure is limited to 4 counters though, one for
each RUNSTATE_*. 
2. Use a dynamic array but this makes harder to use the interface.
3. Eliminate stats_active and set to ~0 the actual stats value to mark inactive
counters. This requires adding a "nr_stats" field to know how many counters are.
Also, this requires to make sure to saturate at 2^^64-2.

I might miss some details here but these are the options to evaluate. 

I would go with a variation of 1) by using two uint64_t, i.e., up to 128 vcpu's
counters, which I think it would be enough. I may be wrong.

Thoughs?
 
Matias 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics
  2023-03-10 10:58               ` Matias Ezequiel Vara Larsen
@ 2023-03-10 11:34                 ` Jan Beulich
  2023-03-14 10:28                   ` Matias Ezequiel Vara Larsen
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-03-10 11:34 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: xen-devel, Matias Ezequiel Vara Larsen, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, Anthony PERARD, Andrew Cooper

On 10.03.2023 11:58, Matias Ezequiel Vara Larsen wrote:
> Oh, I see, thanks for the clarification. To summarise, these are the current
> options:
> 1. Use a "uint64_t" field thus limiting the number of counters to 64. The
> current vcpu_runstate_info structure is limited to 4 counters though, one for
> each RUNSTATE_*. 
> 2. Use a dynamic array but this makes harder to use the interface.
> 3. Eliminate stats_active and set to ~0 the actual stats value to mark inactive
> counters. This requires adding a "nr_stats" field to know how many counters are.

While nr_stats can indeed be seen as a generalization of the earlier
stats_active, I think it is possible to get away without, as long as
padding fields also are filled with the "inactive" marker.

> Also, this requires to make sure to saturate at 2^^64-2.

Thinking of it - considering overflowed counters inactive looks like a
reasonable model to me as well (which would mean saturating at 2^^64-1).

> I might miss some details here but these are the options to evaluate. 
> 
> I would go with a variation of 1) by using two uint64_t, i.e., up to 128 vcpu's
> counters, which I think it would be enough. I may be wrong.

Well, to me it doesn't matter whether it's 32, 64, or 128 - my concern
is with any kind of inherent upper bound. Using 128 right away might
look excessive, just like 32 might look too little. Hence my desire to
get away without any built-in upper bound. IOW I continue to favor 3,
irrespective of the presence or absence of nr_stats.

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics
  2023-03-10 11:34                 ` Jan Beulich
@ 2023-03-14 10:28                   ` Matias Ezequiel Vara Larsen
  2023-03-14 10:34                     ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-03-14 10:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Matias Ezequiel Vara Larsen, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, Anthony PERARD, Andrew Cooper

On Fri, Mar 10, 2023 at 12:34:33PM +0100, Jan Beulich wrote:
> On 10.03.2023 11:58, Matias Ezequiel Vara Larsen wrote:
> > Oh, I see, thanks for the clarification. To summarise, these are the current
> > options:
> > 1. Use a "uint64_t" field thus limiting the number of counters to 64. The
> > current vcpu_runstate_info structure is limited to 4 counters though, one for
> > each RUNSTATE_*. 
> > 2. Use a dynamic array but this makes harder to use the interface.
> > 3. Eliminate stats_active and set to ~0 the actual stats value to mark inactive
> > counters. This requires adding a "nr_stats" field to know how many counters are.
> 
> While nr_stats can indeed be seen as a generalization of the earlier
> stats_active, I think it is possible to get away without, as long as
> padding fields also are filled with the "inactive" marker.
> 

Understood.

> > Also, this requires to make sure to saturate at 2^^64-2.
> 
> Thinking of it - considering overflowed counters inactive looks like a
> reasonable model to me as well (which would mean saturating at 2^^64-1).
> 
> > I might miss some details here but these are the options to evaluate. 
> > 
> > I would go with a variation of 1) by using two uint64_t, i.e., up to 128 vcpu's
> > counters, which I think it would be enough. I may be wrong.
> 
> Well, to me it doesn't matter whether it's 32, 64, or 128 - my concern
> is with any kind of inherent upper bound. Using 128 right away might
> look excessive, just like 32 might look too little. Hence my desire to
> get away without any built-in upper bound. IOW I continue to favor 3,
> irrespective of the presence or absence of nr_stats.
> 
I see. 3) layout would look like:

struct vcpu_shmem_stats {
#define VCPU_STATS_MAGIC 0xaabbccdd
    uint32_t magic;
    uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats), cacheline_size)
    uint32_t size;    // sizeof(vcpu_stats)
    uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
};

struct vcpu_stats {
    /*
     * If the least-significant bit of the seq number is set then an update
     * is in progress and the consumer must wait to read a consistent set of
     * values. This mechanism is similar to Linux's seqlock.
     */
    uint32_t seq;
    uint32 _pad;
    /*
     * If the most-significant bit of a counter is set then the counter
     * is inactive and the consumer must ignore its value. Note that this
     * could also indicate that the counter has overflowed.
     */
    uint64_t stats_a; // e.g., runstate_running_time
    ...
};

All padding fields shall be marked as "inactive". The consumer can't
distinguish inactive from overflowed. Also, the consumer shall always verify
before reading that:

offsetof(struct vcpu_stats, stats_y) < size. 

in case the consumer knows about a counter, e.g., stats_y, that Xen does not
it.

Matias 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics
  2023-03-14 10:28                   ` Matias Ezequiel Vara Larsen
@ 2023-03-14 10:34                     ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2023-03-14 10:34 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: xen-devel, Matias Ezequiel Vara Larsen, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Dario Faggioli, Anthony PERARD, Andrew Cooper

On 14.03.2023 11:28, Matias Ezequiel Vara Larsen wrote:
> On Fri, Mar 10, 2023 at 12:34:33PM +0100, Jan Beulich wrote:
>> On 10.03.2023 11:58, Matias Ezequiel Vara Larsen wrote:
>>> Oh, I see, thanks for the clarification. To summarise, these are the current
>>> options:
>>> 1. Use a "uint64_t" field thus limiting the number of counters to 64. The
>>> current vcpu_runstate_info structure is limited to 4 counters though, one for
>>> each RUNSTATE_*. 
>>> 2. Use a dynamic array but this makes harder to use the interface.
>>> 3. Eliminate stats_active and set to ~0 the actual stats value to mark inactive
>>> counters. This requires adding a "nr_stats" field to know how many counters are.
>>
>> While nr_stats can indeed be seen as a generalization of the earlier
>> stats_active, I think it is possible to get away without, as long as
>> padding fields also are filled with the "inactive" marker.
>>
> 
> Understood.
> 
>>> Also, this requires to make sure to saturate at 2^^64-2.
>>
>> Thinking of it - considering overflowed counters inactive looks like a
>> reasonable model to me as well (which would mean saturating at 2^^64-1).
>>
>>> I might miss some details here but these are the options to evaluate. 
>>>
>>> I would go with a variation of 1) by using two uint64_t, i.e., up to 128 vcpu's
>>> counters, which I think it would be enough. I may be wrong.
>>
>> Well, to me it doesn't matter whether it's 32, 64, or 128 - my concern
>> is with any kind of inherent upper bound. Using 128 right away might
>> look excessive, just like 32 might look too little. Hence my desire to
>> get away without any built-in upper bound. IOW I continue to favor 3,
>> irrespective of the presence or absence of nr_stats.
>>
> I see. 3) layout would look like:
> 
> struct vcpu_shmem_stats {
> #define VCPU_STATS_MAGIC 0xaabbccdd
>     uint32_t magic;
>     uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats), cacheline_size)
>     uint32_t size;    // sizeof(vcpu_stats)
>     uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
> };
> 
> struct vcpu_stats {
>     /*
>      * If the least-significant bit of the seq number is set then an update
>      * is in progress and the consumer must wait to read a consistent set of
>      * values. This mechanism is similar to Linux's seqlock.
>      */
>     uint32_t seq;
>     uint32 _pad;
>     /*
>      * If the most-significant bit of a counter is set then the counter
>      * is inactive and the consumer must ignore its value. Note that this
>      * could also indicate that the counter has overflowed.
>      */
>     uint64_t stats_a; // e.g., runstate_running_time
>     ...
> };
> 
> All padding fields shall be marked as "inactive". The consumer can't
> distinguish inactive from overflowed. Also, the consumer shall always verify
> before reading that:
> 
> offsetof(struct vcpu_stats, stats_y) < size. 
> 
> in case the consumer knows about a counter, e.g., stats_y, that Xen does not
> it.

Looks okay to me this way, but please have this verified by another party
(preferably Andrew, whose proposal was now changed).

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 2/2] tools/misc: Add xen-vcpus-stats tool
  2023-02-23 20:31     ` Julien Grall
@ 2023-03-17 11:01       ` Matias Ezequiel Vara Larsen
  2023-03-29 21:29         ` Julien Grall
  0 siblings, 1 reply; 37+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2023-03-17 11:01 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, xen-devel, Matias Ezequiel Vara Larsen, Wei Liu,
	Anthony PERARD

On Thu, Feb 23, 2023 at 08:31:29PM +0000, Julien Grall wrote:
> Hi,
> 
> On 23/02/2023 16:01, Andrew Cooper wrote:
> > On 07/10/2022 1:39 pm, Matias Ezequiel Vara Larsen wrote:
> > 
> > A couple of observations, all unrelated to the stats themselves.
> > 
> > Although overall, I'm not entirely certain that a tool like this is
> > going to be very helpful after initial development.  Something to
> > consider would be to alter libxenstat to use this new interface?
> > 
> > > diff --git a/tools/misc/Makefile b/tools/misc/Makefile
> > > index 2b683819d4..837e4b50da 100644
> > > --- a/tools/misc/Makefile
> > > +++ b/tools/misc/Makefile
> > > @@ -49,6 +49,7 @@ TARGETS_COPY += xenpvnetboot
> > > 
> > > # Everything which needs to be built
> > > TARGETS_BUILD := $(filter-out $(TARGETS_COPY),$(TARGETS_ALL))
> > > +TARGETS_BUILD += xen-vcpus-stats
> > 
> > This patch is whitespace corrupted.  If at all possible, you need to see
> > about getting `git send-email` working to send patches with, as it deals
> > with most of the whitespace problems for you.
> > 
> > I'm afraid you can't simply copy the patch text into an email and send that.
> > 
> > > 
> > > # ... including build-only targets
> > > TARGETS_BUILD-$(CONFIG_X86)    += xen-vmtrace
> > > @@ -135,4 +136,9 @@ xencov: xencov.o
> > > xen-ucode: xen-ucode.o
> > >      $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
> > > 
> > > +xen-vcpus-stats.o: CFLAGS += $(CFLAGS_libxenforeginmemory)
> > > +
> > > +xen-vcpus-stats: xen-vcpus-stats.o
> > > +    $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl)
> > > $(LDLIBS_libxenforeignmemory) $(APPEND_LDFLAGS)
> > > +
> > > -include $(DEPS_INCLUDE)
> > > diff --git a/tools/misc/xen-vcpus-stats.c b/tools/misc/xen-vcpus-stats.c
> > > new file mode 100644
> > > index 0000000000..29d0efb124
> > > --- /dev/null
> > > +++ b/tools/misc/xen-vcpus-stats.c
> > > @@ -0,0 +1,87 @@
> > > +#include <err.h>
> > > +#include <errno.h>
> > > +#include <error.h>
> > > +#include <stdio.h>
> > > +#include <stdlib.h>
> > > +#include <string.h>
> > > +#include <sys/mman.h>
> > > +#include <signal.h>
> > > +
> > > +#include <xenctrl.h>
> > > +#include <xenforeignmemory.h>
> > > +#include <xen/vcpu.h>
> > > +
> > > +#define rmb()   asm volatile("lfence":::"memory")
> > 
> > This is rmb(), but rmb() isn't what you want.
> > 
> > You want smp_rmb(), which is
> > 
> > #define smp_rmb() asm volatile ("" ::: "memory")
> 
> From the generic PoV, I find smp_rmb() a bit misleading because it is not
> clear in this context whether we are referring to the SMP-ness of the
> hypervisor or the tools domain.
> 
> If the latter, then technically it could be uniprocessor domain and one
> could argue that for Arm it could be downgraded to just a compiler barrier.
> 
> AFAICT, this would not be the case here because we are getting data from
> Xen. So we always need a "dmb ish".
> 
> So, I would suggest to name it virt_*() (to match Linux's naming).
> 
> Also, is this tool meant to be arch-agnostic? If so, then we need to
> introduce the proper barrier for the other arch.
> 
Thanks Julien for the comment. Is it `xen_rmb()` meant for that?

Matias


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 2/2] tools/misc: Add xen-vcpus-stats tool
  2023-03-17 11:01       ` Matias Ezequiel Vara Larsen
@ 2023-03-29 21:29         ` Julien Grall
  0 siblings, 0 replies; 37+ messages in thread
From: Julien Grall @ 2023-03-29 21:29 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Andrew Cooper, xen-devel, Matias Ezequiel Vara Larsen, Wei Liu,
	Anthony PERARD

Hi,

Sorry for the late reply.

On 17/03/2023 11:01, Matias Ezequiel Vara Larsen wrote:
> On Thu, Feb 23, 2023 at 08:31:29PM +0000, Julien Grall wrote:
>>>> +#define rmb()   asm volatile("lfence":::"memory")
>>>
>>> This is rmb(), but rmb() isn't what you want.
>>>
>>> You want smp_rmb(), which is
>>>
>>> #define smp_rmb() asm volatile ("" ::: "memory")
>>
>>  From the generic PoV, I find smp_rmb() a bit misleading because it is not
>> clear in this context whether we are referring to the SMP-ness of the
>> hypervisor or the tools domain.
>>
>> If the latter, then technically it could be uniprocessor domain and one
>> could argue that for Arm it could be downgraded to just a compiler barrier.
>>
>> AFAICT, this would not be the case here because we are getting data from
>> Xen. So we always need a "dmb ish".
>>
>> So, I would suggest to name it virt_*() (to match Linux's naming).
>>
>> Also, is this tool meant to be arch-agnostic? If so, then we need to
>> introduce the proper barrier for the other arch.
>>
> Thanks Julien for the comment. Is it `xen_rmb()` meant for that?

I believe so.

> 
> Matias

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-03-29 21:29 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-07 12:39 [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics Matias Ezequiel Vara Larsen
2022-10-07 12:39 ` [RFC PATCH v2 1/2] xen/memory : Add a stats_table resource type Matias Ezequiel Vara Larsen
2022-12-13 17:02   ` Jan Beulich
2023-02-16 14:48     ` Matias Ezequiel Vara Larsen
2023-02-16 15:10       ` Jan Beulich
2022-12-14  7:29   ` Jan Beulich
2022-12-14  7:56     ` Jan Beulich
2023-02-17  8:50       ` Matias Ezequiel Vara Larsen
2023-02-17  8:57         ` Jan Beulich
2023-02-17  9:29           ` Matias Ezequiel Vara Larsen
2023-02-17 14:10             ` Jan Beulich
2023-02-23 12:16               ` Matias Ezequiel Vara Larsen
2023-02-23 12:42                 ` Jan Beulich
2023-03-07 14:44                   ` Matias Ezequiel Vara Larsen
2023-03-07 16:55                     ` Jan Beulich
2023-03-09  9:22                       ` Matias Ezequiel Vara Larsen
2023-02-16 15:07     ` Matias Ezequiel Vara Larsen
2023-02-16 15:15       ` Jan Beulich
2023-02-20 16:51         ` Matias Ezequiel Vara Larsen
2023-02-21  8:48           ` Jan Beulich
2022-10-07 12:39 ` [RFC PATCH v2 2/2] tools/misc: Add xen-vcpus-stats tool Matias Ezequiel Vara Larsen
2023-02-23 16:01   ` Andrew Cooper
2023-02-23 20:31     ` Julien Grall
2023-03-17 11:01       ` Matias Ezequiel Vara Larsen
2023-03-29 21:29         ` Julien Grall
2023-02-24 15:31     ` Matias Ezequiel Vara Larsen
2023-02-23 19:56 ` API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics Andrew Cooper
2023-03-06 14:23   ` Matias Ezequiel Vara Larsen
2023-03-07 10:12     ` Jan Beulich
2023-03-08 11:54       ` Matias Ezequiel Vara Larsen
2023-03-08 14:16         ` Jan Beulich
2023-03-09 10:38           ` Matias Ezequiel Vara Larsen
2023-03-09 11:50             ` Jan Beulich
2023-03-10 10:58               ` Matias Ezequiel Vara Larsen
2023-03-10 11:34                 ` Jan Beulich
2023-03-14 10:28                   ` Matias Ezequiel Vara Larsen
2023-03-14 10:34                     ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).