All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] runstate/time area registration by (guest) physical address
@ 2023-05-03 15:53 Jan Beulich
  2023-05-03 15:54 ` [PATCH v3 1/8] domain: GADDR based shared guest area registration alternative - teardown Jan Beulich
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Jan Beulich @ 2023-05-03 15:53 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

Since it was indicated that introducing specific new vCPU ops may be
beneficial independent of the introduction of a fully physical-
address-based ABI flavor, here we go. There continue to be a number of
open questions throughout the series, resolving of which was one of the
main goals of the earlier v2 posting; v3 has very little changes and is
meant to serve as a reminder that this series wants looking at.

1: domain: GADDR based shared guest area registration alternative - cleanup
3: domain: update GADDR based runstate guest area
4: x86: update GADDR based secondary time area
5: x86/mem-sharing: copy GADDR based shared guest areas
6: domain: map/unmap GADDR based shared guest areas
7: domain: introduce GADDR based runstate area registration alternative
8: x86: introduce GADDR based secondary time area registration alternative
9: common: convert vCPU info area registration

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 1/8] domain: GADDR based shared guest area registration alternative - teardown
  2023-05-03 15:53 [PATCH v3 0/8] runstate/time area registration by (guest) physical address Jan Beulich
@ 2023-05-03 15:54 ` Jan Beulich
  2023-09-27  8:51   ` Roger Pau Monné
  2023-05-03 15:55 ` [PATCH v3 2/8] domain: update GADDR based runstate guest area Jan Beulich
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-05-03 15:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, add the necessary domain cleanup hooks.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
---
RFC: Zapping the areas in pv_shim_shutdown() may not be strictly
     necessary: Aiui unmap_vcpu_info() is called only because the vCPU
     info area cannot be re-registered. Beyond that I guess the
     assumption is that the areas would only be re-registered as they
     were before. If that's not the case I wonder whether the guest
     handles for both areas shouldn't also be zapped.
---
v2: Add assertion in unmap_guest_area().

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1019,7 +1019,10 @@ int arch_domain_soft_reset(struct domain
     }
 
     for_each_vcpu ( d, v )
+    {
         set_xen_guest_handle(v->arch.time_info_guest, NULL);
+        unmap_guest_area(v, &v->arch.time_guest_area);
+    }
 
  exit_put_gfn:
     put_gfn(d, gfn_x(gfn));
@@ -2356,6 +2359,8 @@ int domain_relinquish_resources(struct d
             if ( ret )
                 return ret;
 
+            unmap_guest_area(v, &v->arch.time_guest_area);
+
             vpmu_destroy(v);
         }
 
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -669,6 +669,7 @@ struct arch_vcpu
 
     /* A secondary copy of the vcpu time info. */
     XEN_GUEST_HANDLE(vcpu_time_info_t) time_info_guest;
+    struct guest_area time_guest_area;
 
     struct arch_vm_event *vm_event;
 
--- a/xen/arch/x86/pv/shim.c
+++ b/xen/arch/x86/pv/shim.c
@@ -382,8 +382,10 @@ int pv_shim_shutdown(uint8_t reason)
 
     for_each_vcpu ( d, v )
     {
-        /* Unmap guest vcpu_info pages. */
+        /* Unmap guest vcpu_info page and runstate/time areas. */
         unmap_vcpu_info(v);
+        unmap_guest_area(v, &v->runstate_guest_area);
+        unmap_guest_area(v, &v->arch.time_guest_area);
 
         /* Reset the periodic timer to the default value. */
         vcpu_set_periodic_timer(v, MILLISECS(10));
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -963,7 +963,10 @@ int domain_kill(struct domain *d)
         if ( cpupool_move_domain(d, cpupool0) )
             return -ERESTART;
         for_each_vcpu ( d, v )
+        {
             unmap_vcpu_info(v);
+            unmap_guest_area(v, &v->runstate_guest_area);
+        }
         d->is_dying = DOMDYING_dead;
         /* Mem event cleanup has to go here because the rings 
          * have to be put before we call put_domain. */
@@ -1417,6 +1420,7 @@ int domain_soft_reset(struct domain *d,
     {
         set_xen_guest_handle(runstate_guest(v), NULL);
         unmap_vcpu_info(v);
+        unmap_guest_area(v, &v->runstate_guest_area);
     }
 
     rc = arch_domain_soft_reset(d);
@@ -1568,6 +1572,19 @@ void unmap_vcpu_info(struct vcpu *v)
     put_page_and_type(mfn_to_page(mfn));
 }
 
+/*
+ * This is only intended to be used for domain cleanup (or more generally only
+ * with at least the respective vCPU, if it's not the current one, reliably
+ * paused).
+ */
+void unmap_guest_area(struct vcpu *v, struct guest_area *area)
+{
+    struct domain *d = v->domain;
+
+    if ( v != current )
+        ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
+}
+
 int default_initialise_vcpu(struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct vcpu_guest_context *ctxt;
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -5,6 +5,12 @@
 #include <xen/types.h>
 
 #include <public/xen.h>
+
+struct guest_area {
+    struct page_info *pg;
+    void *map;
+};
+
 #include <asm/domain.h>
 #include <asm/numa.h>
 
@@ -76,6 +82,11 @@ void arch_vcpu_destroy(struct vcpu *v);
 int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned int offset);
 void unmap_vcpu_info(struct vcpu *v);
 
+int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
+                   struct guest_area *area,
+                   void (*populate)(void *dst, struct vcpu *v));
+void unmap_guest_area(struct vcpu *v, struct guest_area *area);
+
 int arch_domain_create(struct domain *d,
                        struct xen_domctl_createdomain *config,
                        unsigned int flags);
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -202,6 +202,7 @@ struct vcpu
         XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t) compat;
     } runstate_guest; /* guest address */
 #endif
+    struct guest_area runstate_guest_area;
     unsigned int     new_state;
 
     /* Has the FPU been initialised? */



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 2/8] domain: update GADDR based runstate guest area
  2023-05-03 15:53 [PATCH v3 0/8] runstate/time area registration by (guest) physical address Jan Beulich
  2023-05-03 15:54 ` [PATCH v3 1/8] domain: GADDR based shared guest area registration alternative - teardown Jan Beulich
@ 2023-05-03 15:55 ` Jan Beulich
  2023-09-27  9:44   ` Roger Pau Monné
  2023-05-03 15:55 ` [PATCH v3 3/8] x86: update GADDR based secondary time area Jan Beulich
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-05-03 15:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

Before adding a new vCPU operation to register the runstate area by
guest-physical address, add code to actually keep such areas up-to-date.

Note that updating of the area will be done exclusively following the
model enabled by VMASST_TYPE_runstate_update_flag for virtual-address
based registered areas.

Note further that pages aren't marked dirty when written to (matching
the handling of space mapped by map_vcpu_info()), on the basis that the
registrations are lost anyway across migration (or would need re-
populating at the target for transparent migration). Plus the contents
of the areas in question have to be deemed volatile in the first place
(so saving a "most recent" value is pretty meaningless even for e.g.
snapshotting).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: HVM guests (on x86) can change bitness and hence layout (and size!
     and alignment) of the runstate area. I don't think it is an option
     to require 32-bit code to pass a range such that even the 64-bit
     layout wouldn't cross a page boundary (and be suitably aligned). I
     also don't see any other good solution, so for now a crude approach
     with an extra boolean is used (using has_32bit_shinfo() isn't race
     free and could hence lead to overrunning the mapped space).
---
v3: Use assignment instead of memcpy().
v2: Drop VM-assist conditionals.

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1615,15 +1615,52 @@ bool update_runstate_area(struct vcpu *v
     bool rc;
     struct guest_memory_policy policy = { };
     void __user *guest_handle = NULL;
-    struct vcpu_runstate_info runstate;
+    struct vcpu_runstate_info runstate = v->runstate;
+    struct vcpu_runstate_info *map = v->runstate_guest_area.map;
+
+    if ( map )
+    {
+        uint64_t *pset;
+#ifdef CONFIG_COMPAT
+        struct compat_vcpu_runstate_info *cmap = NULL;
+
+        if ( v->runstate_guest_area_compat )
+            cmap = (void *)map;
+#endif
+
+        /*
+         * NB: No VM_ASSIST(v->domain, runstate_update_flag) check here.
+         *     Always using that updating model.
+         */
+#ifdef CONFIG_COMPAT
+        if ( cmap )
+            pset = &cmap->state_entry_time;
+        else
+#endif
+            pset = &map->state_entry_time;
+        runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
+        write_atomic(pset, runstate.state_entry_time);
+        smp_wmb();
+
+#ifdef CONFIG_COMPAT
+        if ( cmap )
+            XLAT_vcpu_runstate_info(cmap, &runstate);
+        else
+#endif
+            *map = runstate;
+
+        smp_wmb();
+        runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
+        write_atomic(pset, runstate.state_entry_time);
+
+        return true;
+    }
 
     if ( guest_handle_is_null(runstate_guest(v)) )
         return true;
 
     update_guest_memory_policy(v, &policy);
 
-    memcpy(&runstate, &v->runstate, sizeof(runstate));
-
     if ( VM_ASSIST(v->domain, runstate_update_flag) )
     {
 #ifdef CONFIG_COMPAT
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -231,6 +231,8 @@ struct vcpu
 #ifdef CONFIG_COMPAT
     /* A hypercall is using the compat ABI? */
     bool             hcall_compat;
+    /* Physical runstate area registered via compat ABI? */
+    bool             runstate_guest_area_compat;
 #endif
 
 #ifdef CONFIG_IOREQ_SERVER



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 3/8] x86: update GADDR based secondary time area
  2023-05-03 15:53 [PATCH v3 0/8] runstate/time area registration by (guest) physical address Jan Beulich
  2023-05-03 15:54 ` [PATCH v3 1/8] domain: GADDR based shared guest area registration alternative - teardown Jan Beulich
  2023-05-03 15:55 ` [PATCH v3 2/8] domain: update GADDR based runstate guest area Jan Beulich
@ 2023-05-03 15:55 ` Jan Beulich
  2023-09-27 10:14   ` Roger Pau Monné
  2023-05-03 15:56 ` [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-05-03 15:55 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

Before adding a new vCPU operation to register the secondary time area
by guest-physical address, add code to actually keep such areas up-to-
date.

Note that pages aren't marked dirty when written to (matching the
handling of space mapped by map_vcpu_info()), on the basis that the
registrations are lost anyway across migration (or would need re-
populating at the target for transparent migration). Plus the contents
of the areas in question have to be deemed volatile in the first place
(so saving a "most recent" value is pretty meaningless even for e.g.
snapshotting).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1571,12 +1571,34 @@ static void __update_vcpu_system_time(st
         v->arch.pv.pending_system_time = _u;
 }
 
+static void write_time_guest_area(struct vcpu_time_info *map,
+                                  const struct vcpu_time_info *src)
+{
+    /* 1. Update userspace version. */
+    write_atomic(&map->version, src->version);
+    smp_wmb();
+
+    /* 2. Update all other userspace fields. */
+    *map = *src;
+
+    /* 3. Update userspace version again. */
+    smp_wmb();
+    write_atomic(&map->version, version_update_end(src->version));
+}
+
 bool update_secondary_system_time(struct vcpu *v,
                                   struct vcpu_time_info *u)
 {
     XEN_GUEST_HANDLE(vcpu_time_info_t) user_u = v->arch.time_info_guest;
+    struct vcpu_time_info *map = v->arch.time_guest_area.map;
     struct guest_memory_policy policy = { .nested_guest_mode = false };
 
+    if ( map )
+    {
+        write_time_guest_area(map, u);
+        return true;
+    }
+
     if ( guest_handle_is_null(user_u) )
         return true;
 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-05-03 15:53 [PATCH v3 0/8] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (2 preceding siblings ...)
  2023-05-03 15:55 ` [PATCH v3 3/8] x86: update GADDR based secondary time area Jan Beulich
@ 2023-05-03 15:56 ` Jan Beulich
  2023-05-03 17:14   ` Tamas K Lengyel
  2023-09-27 11:08   ` Roger Pau Monné
  2023-05-03 15:57 ` [PATCH v3 5/8] domain: map/unmap " Jan Beulich
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Jan Beulich @ 2023-05-03 15:56 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Tamas K Lengyel

In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, add the necessary fork handling (with the
backing function yet to be filled in).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Extend comment.

--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
     hvm_set_nonreg_state(cd_vcpu, &nrs);
 }
 
+static int copy_guest_area(struct guest_area *cd_area,
+                           const struct guest_area *d_area,
+                           struct vcpu *cd_vcpu,
+                           const struct domain *d)
+{
+    mfn_t d_mfn, cd_mfn;
+
+    if ( !d_area->pg )
+        return 0;
+
+    d_mfn = page_to_mfn(d_area->pg);
+
+    /* Allocate & map a page for the area if it hasn't been already. */
+    if ( !cd_area->pg )
+    {
+        gfn_t gfn = mfn_to_gfn(d, d_mfn);
+        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
+        p2m_type_t p2mt;
+        p2m_access_t p2ma;
+        unsigned int offset;
+        int ret;
+
+        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
+        if ( mfn_eq(cd_mfn, INVALID_MFN) )
+        {
+            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
+
+            if ( !pg )
+                return -ENOMEM;
+
+            cd_mfn = page_to_mfn(pg);
+            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
+
+            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
+                                 p2m->default_access, -1);
+            if ( ret )
+                return ret;
+        }
+        else if ( p2mt != p2m_ram_rw )
+            return -EBUSY;
+
+        /*
+         * Map the into the guest. For simplicity specify the entire range up
+         * to the end of the page: All the function uses it for is to check
+         * that the range doesn't cross page boundaries. Having the area mapped
+         * in the original domain implies that it fits there and therefore will
+         * also fit in the clone.
+         */
+        offset = PAGE_OFFSET(d_area->map);
+        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
+                             PAGE_SIZE - offset, cd_area, NULL);
+        if ( ret )
+            return ret;
+    }
+    else
+        cd_mfn = page_to_mfn(cd_area->pg);
+
+    copy_domain_page(cd_mfn, d_mfn);
+
+    return 0;
+}
+
 static int copy_vpmu(struct vcpu *d_vcpu, struct vcpu *cd_vcpu)
 {
     struct vpmu_struct *d_vpmu = vcpu_vpmu(d_vcpu);
@@ -1733,6 +1795,16 @@ static int copy_vcpu_settings(struct dom
             copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
         }
 
+        /* Same for the (physically registered) runstate and time info areas. */
+        ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
+                              &d_vcpu->runstate_guest_area, cd_vcpu, d);
+        if ( ret )
+            return ret;
+        ret = copy_guest_area(&cd_vcpu->arch.time_guest_area,
+                              &d_vcpu->arch.time_guest_area, cd_vcpu, d);
+        if ( ret )
+            return ret;
+
         ret = copy_vpmu(d_vcpu, cd_vcpu);
         if ( ret )
             return ret;
@@ -1974,7 +2046,10 @@ int mem_sharing_fork_reset(struct domain
 
  state:
     if ( reset_state )
+    {
         rc = copy_settings(d, pd);
+        /* TBD: What to do here with -ERESTART? */
+    }
 
     domain_unpause(d);
 
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1572,6 +1572,13 @@ void unmap_vcpu_info(struct vcpu *v)
     put_page_and_type(mfn_to_page(mfn));
 }
 
+int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
+                   struct guest_area *area,
+                   void (*populate)(void *dst, struct vcpu *v))
+{
+    return -EOPNOTSUPP;
+}
+
 /*
  * This is only intended to be used for domain cleanup (or more generally only
  * with at least the respective vCPU, if it's not the current one, reliably



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 5/8] domain: map/unmap GADDR based shared guest areas
  2023-05-03 15:53 [PATCH v3 0/8] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (3 preceding siblings ...)
  2023-05-03 15:56 ` [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
@ 2023-05-03 15:57 ` Jan Beulich
  2023-09-27 14:53   ` Roger Pau Monné
  2023-05-03 15:57 ` [PATCH v3 6/8] domain: introduce GADDR based runstate area registration alternative Jan Beulich
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-05-03 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

The registration by virtual/linear address has downsides: At least on
x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
PV domains the areas are inaccessible (and hence cannot be updated by
Xen) when in guest-user mode, and for HVM guests they may be
inaccessible when Meltdown mitigations are in place. (There are yet
more issues.)

In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, flesh out the map/unmap functions.

Noteworthy differences from map_vcpu_info():
- areas can be registered more than once (and de-registered),
- remote vCPU-s are paused rather than checked for being down (which in
  principle can change right after the check),
- the domain lock is taken for a much smaller region.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: By using global domain page mappings the demand on the underlying
     VA range may increase significantly. I did consider to use per-
     domain mappings instead, but they exist for x86 only. Of course we
     could have arch_{,un}map_guest_area() aliasing global domain page
     mapping functions on Arm and using per-domain mappings on x86. Yet
     then again map_vcpu_info() doesn't (and can't) do so.

RFC: In map_guest_area() I'm not checking the P2M type, instead - just
     like map_vcpu_info() - solely relying on the type ref acquisition.
     Checking for p2m_ram_rw alone would be wrong, as at least
     p2m_ram_logdirty ought to also be okay to use here (and in similar
     cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
     used here (like altp2m_vcpu_enable_ve() does) as well as in
     map_vcpu_info(), yet then again the P2M type is stale by the time
     it is being looked at anyway without the P2M lock held.
---
v2: currd -> d, to cover mem-sharing's copy_guest_area(). Re-base over
    change(s) earlier in the series. Use ~0 as "unmap" request indicator.

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1576,7 +1576,82 @@ int map_guest_area(struct vcpu *v, paddr
                    struct guest_area *area,
                    void (*populate)(void *dst, struct vcpu *v))
 {
-    return -EOPNOTSUPP;
+    struct domain *d = v->domain;
+    void *map = NULL;
+    struct page_info *pg = NULL;
+    int rc = 0;
+
+    if ( ~gaddr )
+    {
+        unsigned long gfn = PFN_DOWN(gaddr);
+        unsigned int align;
+        p2m_type_t p2mt;
+
+        if ( gfn != PFN_DOWN(gaddr + size - 1) )
+            return -ENXIO;
+
+#ifdef CONFIG_COMPAT
+        if ( has_32bit_shinfo(d) )
+            align = alignof(compat_ulong_t);
+        else
+#endif
+            align = alignof(xen_ulong_t);
+        if ( gaddr & (align - 1) )
+            return -ENXIO;
+
+        rc = check_get_page_from_gfn(d, _gfn(gfn), false, &p2mt, &pg);
+        if ( rc )
+            return rc;
+
+        if ( !get_page_type(pg, PGT_writable_page) )
+        {
+            put_page(pg);
+            return -EACCES;
+        }
+
+        map = __map_domain_page_global(pg);
+        if ( !map )
+        {
+            put_page_and_type(pg);
+            return -ENOMEM;
+        }
+        map += PAGE_OFFSET(gaddr);
+    }
+
+    if ( v != current )
+    {
+        if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
+        {
+            rc = -ERESTART;
+            goto unmap;
+        }
+
+        vcpu_pause(v);
+
+        spin_unlock(&d->hypercall_deadlock_mutex);
+    }
+
+    domain_lock(d);
+
+    if ( map )
+        populate(map, v);
+
+    SWAP(area->pg, pg);
+    SWAP(area->map, map);
+
+    domain_unlock(d);
+
+    if ( v != current )
+        vcpu_unpause(v);
+
+ unmap:
+    if ( pg )
+    {
+        unmap_domain_page_global(map);
+        put_page_and_type(pg);
+    }
+
+    return rc;
 }
 
 /*
@@ -1587,9 +1662,24 @@ int map_guest_area(struct vcpu *v, paddr
 void unmap_guest_area(struct vcpu *v, struct guest_area *area)
 {
     struct domain *d = v->domain;
+    void *map;
+    struct page_info *pg;
 
     if ( v != current )
         ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
+
+    domain_lock(d);
+    map = area->map;
+    area->map = NULL;
+    pg = area->pg;
+    area->pg = NULL;
+    domain_unlock(d);
+
+    if ( pg )
+    {
+        unmap_domain_page_global(map);
+        put_page_and_type(pg);
+    }
 }
 
 int default_initialise_vcpu(struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg)



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 6/8] domain: introduce GADDR based runstate area registration alternative
  2023-05-03 15:53 [PATCH v3 0/8] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (4 preceding siblings ...)
  2023-05-03 15:57 ` [PATCH v3 5/8] domain: map/unmap " Jan Beulich
@ 2023-05-03 15:57 ` Jan Beulich
  2023-09-27 15:24   ` Roger Pau Monné
  2023-05-03 15:58 ` [PATCH v3 7/8] x86: introduce GADDR based secondary time " Jan Beulich
  2023-05-03 15:58 ` [PATCH v3 8/8] common: convert vCPU info area registration Jan Beulich
  7 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-05-03 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

The registration by virtual/linear address has downsides: At least on
x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
PV domains the area is inaccessible (and hence cannot be updated by Xen)
when in guest-user mode.

Introduce a new vCPU operation allowing to register the runstate area by
guest-physical address.

An at least theoretical downside to using physically registered areas is
that PV then won't see dirty (and perhaps also accessed) bits set in its
respective page table entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Extend comment in public header.

--- a/xen/arch/x86/x86_64/domain.c
+++ b/xen/arch/x86/x86_64/domain.c
@@ -12,6 +12,22 @@
 CHECK_vcpu_get_physid;
 #undef xen_vcpu_get_physid
 
+static void cf_check
+runstate_area_populate(void *map, struct vcpu *v)
+{
+    if ( is_pv_vcpu(v) )
+        v->arch.pv.need_update_runstate_area = false;
+
+    v->runstate_guest_area_compat = true;
+
+    if ( v == current )
+    {
+        struct compat_vcpu_runstate_info *info = map;
+
+        XLAT_vcpu_runstate_info(info, &v->runstate);
+    }
+}
+
 int
 compat_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
@@ -57,6 +73,25 @@ compat_vcpu_op(int cmd, unsigned int vcp
 
         break;
     }
+
+    case VCPUOP_register_runstate_phys_area:
+    {
+        struct compat_vcpu_register_runstate_memory_area area;
+
+        rc = -EFAULT;
+        if ( copy_from_guest(&area.addr.p, arg, 1) )
+            break;
+
+        rc = map_guest_area(v, area.addr.p,
+                            sizeof(struct compat_vcpu_runstate_info),
+                            &v->runstate_guest_area,
+                            runstate_area_populate);
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
+                                               cmd, vcpuid, arg);
+
+        break;
+    }
 
     case VCPUOP_register_vcpu_time_memory_area:
     {
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1801,6 +1801,26 @@ bool update_runstate_area(struct vcpu *v
     return rc;
 }
 
+static void cf_check
+runstate_area_populate(void *map, struct vcpu *v)
+{
+#ifdef CONFIG_PV
+    if ( is_pv_vcpu(v) )
+        v->arch.pv.need_update_runstate_area = false;
+#endif
+
+#ifdef CONFIG_COMPAT
+    v->runstate_guest_area_compat = false;
+#endif
+
+    if ( v == current )
+    {
+        struct vcpu_runstate_info *info = map;
+
+        *info = v->runstate;
+    }
+}
+
 long common_vcpu_op(int cmd, struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     long rc = 0;
@@ -1982,6 +2002,25 @@ long common_vcpu_op(int cmd, struct vcpu
 
         break;
     }
+
+    case VCPUOP_register_runstate_phys_area:
+    {
+        struct vcpu_register_runstate_memory_area area;
+
+        rc = -EFAULT;
+        if ( copy_from_guest(&area.addr.p, arg, 1) )
+            break;
+
+        rc = map_guest_area(v, area.addr.p,
+                            sizeof(struct vcpu_runstate_info),
+                            &v->runstate_guest_area,
+                            runstate_area_populate);
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
+                                               cmd, vcpuid, arg);
+
+        break;
+    }
 
     default:
         rc = -ENOSYS;
--- a/xen/include/public/vcpu.h
+++ b/xen/include/public/vcpu.h
@@ -221,6 +221,19 @@ struct vcpu_register_time_memory_area {
 typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
 DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
 
+/*
+ * Like the respective VCPUOP_register_*_memory_area, just using the "addr.p"
+ * field of the supplied struct as a guest physical address (i.e. in GFN space).
+ * The respective area may not cross a page boundary.  Pass ~0 to unregister an
+ * area.  Note that as long as an area is registered by physical address, the
+ * linear address based area will not be serviced (updated) by the hypervisor.
+ *
+ * Note that the area registered via VCPUOP_register_runstate_memory_area will
+ * be updated in the same manner as the one registered via virtual address PLUS
+ * VMASST_TYPE_runstate_update_flag engaged by the domain.
+ */
+#define VCPUOP_register_runstate_phys_area      14
+
 #endif /* __XEN_PUBLIC_VCPU_H__ */
 
 /*



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 7/8] x86: introduce GADDR based secondary time area registration alternative
  2023-05-03 15:53 [PATCH v3 0/8] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (5 preceding siblings ...)
  2023-05-03 15:57 ` [PATCH v3 6/8] domain: introduce GADDR based runstate area registration alternative Jan Beulich
@ 2023-05-03 15:58 ` Jan Beulich
  2023-09-27 15:50   ` Roger Pau Monné
  2023-05-03 15:58 ` [PATCH v3 8/8] common: convert vCPU info area registration Jan Beulich
  7 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-05-03 15:58 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

The registration by virtual/linear address has downsides: The access is
expensive for HVM/PVH domains. Furthermore for 64-bit PV domains the area
is inaccessible (and hence cannot be updated by Xen) when in guest-user
mode.

Introduce a new vCPU operation allowing to register the secondary time
area by guest-physical address.

An at least theoretical downside to using physically registered areas is
that PV then won't see dirty (and perhaps also accessed) bits set in its
respective page table entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Re-base.
v2: Forge version in force_update_secondary_system_time().

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1504,6 +1504,15 @@ int arch_vcpu_reset(struct vcpu *v)
     return 0;
 }
 
+static void cf_check
+time_area_populate(void *map, struct vcpu *v)
+{
+    if ( is_pv_vcpu(v) )
+        v->arch.pv.pending_system_time.version = 0;
+
+    force_update_secondary_system_time(v, map);
+}
+
 long do_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     long rc = 0;
@@ -1541,6 +1550,25 @@ long do_vcpu_op(int cmd, unsigned int vc
 
         break;
     }
+
+    case VCPUOP_register_vcpu_time_phys_area:
+    {
+        struct vcpu_register_time_memory_area area;
+
+        rc = -EFAULT;
+        if ( copy_from_guest(&area.addr.p, arg, 1) )
+            break;
+
+        rc = map_guest_area(v, area.addr.p,
+                            sizeof(vcpu_time_info_t),
+                            &v->arch.time_guest_area,
+                            time_area_populate);
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
+                                               cmd, vcpuid, arg);
+
+        break;
+    }
 
     case VCPUOP_get_physid:
     {
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -692,6 +692,8 @@ void domain_cpu_policy_changed(struct do
 
 bool update_secondary_system_time(struct vcpu *,
                                   struct vcpu_time_info *);
+void force_update_secondary_system_time(struct vcpu *,
+                                        struct vcpu_time_info *);
 
 void vcpu_show_registers(const struct vcpu *);
 
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1633,6 +1633,16 @@ void force_update_vcpu_system_time(struc
     __update_vcpu_system_time(v, 1);
 }
 
+void force_update_secondary_system_time(struct vcpu *v,
+                                        struct vcpu_time_info *map)
+{
+    struct vcpu_time_info u;
+
+    collect_time_info(v, &u);
+    u.version = -1; /* Compensate for version_update_end(). */
+    write_time_guest_area(map, &u);
+}
+
 static void update_domain_rtc(void)
 {
     struct domain *d;
--- a/xen/arch/x86/x86_64/domain.c
+++ b/xen/arch/x86/x86_64/domain.c
@@ -115,6 +115,7 @@ compat_vcpu_op(int cmd, unsigned int vcp
 
     case VCPUOP_send_nmi:
     case VCPUOP_get_physid:
+    case VCPUOP_register_vcpu_time_phys_area:
         rc = do_vcpu_op(cmd, vcpuid, arg);
         break;
 
--- a/xen/include/public/vcpu.h
+++ b/xen/include/public/vcpu.h
@@ -233,6 +233,7 @@ DEFINE_XEN_GUEST_HANDLE(vcpu_register_ti
  * VMASST_TYPE_runstate_update_flag engaged by the domain.
  */
 #define VCPUOP_register_runstate_phys_area      14
+#define VCPUOP_register_vcpu_time_phys_area     15
 
 #endif /* __XEN_PUBLIC_VCPU_H__ */
 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 8/8] common: convert vCPU info area registration
  2023-05-03 15:53 [PATCH v3 0/8] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (6 preceding siblings ...)
  2023-05-03 15:58 ` [PATCH v3 7/8] x86: introduce GADDR based secondary time " Jan Beulich
@ 2023-05-03 15:58 ` Jan Beulich
  2023-09-28  8:24   ` Roger Pau Monné
  7 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-05-03 15:58 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Tamas K Lengyel

Switch to using map_guest_area(). Noteworthy differences from
map_vcpu_info():
- remote vCPU-s are paused rather than checked for being down (which in
  principle can change right after the check),
- the domain lock is taken for a much smaller region,
- the error code for an attempt to re-register the area is now -EBUSY,
- we could in principle permit de-registration when no area was
  previously registered (which would permit "probing", if necessary for
  anything).

Note that this eliminates a bug in copy_vcpu_settings(): The function
did allocate a new page regardless of the GFN already having a mapping,
thus in particular breaking the case of two vCPU-s having their info
areas on the same page.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: I'm not really certain whether the preliminary check (ahead of
     calling map_guest_area()) is worthwhile to have.
---
v2: Re-base over changes earlier in the series. Properly enforce no re-
    registration. Avoid several casts by introducing local variables.

--- a/xen/arch/x86/include/asm/shared.h
+++ b/xen/arch/x86/include/asm/shared.h
@@ -26,17 +26,20 @@ static inline void arch_set_##field(stru
 #define GET_SET_VCPU(type, field)                               \
 static inline type arch_get_##field(const struct vcpu *v)       \
 {                                                               \
+    const vcpu_info_t *vi = v->vcpu_info_area.map;              \
+                                                                \
     return !has_32bit_shinfo(v->domain) ?                       \
-           v->vcpu_info->native.arch.field :                    \
-           v->vcpu_info->compat.arch.field;                     \
+           vi->native.arch.field : vi->compat.arch.field;       \
 }                                                               \
 static inline void arch_set_##field(struct vcpu *v,             \
                                     type val)                   \
 {                                                               \
+    vcpu_info_t *vi = v->vcpu_info_area.map;                    \
+                                                                \
     if ( !has_32bit_shinfo(v->domain) )                         \
-        v->vcpu_info->native.arch.field = val;                  \
+        vi->native.arch.field = val;                            \
     else                                                        \
-        v->vcpu_info->compat.arch.field = val;                  \
+        vi->compat.arch.field = val;                            \
 }
 
 #else
@@ -57,12 +60,16 @@ static inline void arch_set_##field(stru
 #define GET_SET_VCPU(type, field)                           \
 static inline type arch_get_##field(const struct vcpu *v)   \
 {                                                           \
-    return v->vcpu_info->arch.field;                        \
+    const vcpu_info_t *vi = v->vcpu_info_area.map;          \
+                                                            \
+    return vi->arch.field;                                  \
 }                                                           \
 static inline void arch_set_##field(struct vcpu *v,         \
                                     type val)               \
 {                                                           \
-    v->vcpu_info->arch.field = val;                         \
+    vcpu_info_t *vi = v->vcpu_info_area.map;                \
+                                                            \
+    vi->arch.field = val;                                   \
 }
 
 #endif
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -1749,53 +1749,24 @@ static int copy_vpmu(struct vcpu *d_vcpu
 static int copy_vcpu_settings(struct domain *cd, const struct domain *d)
 {
     unsigned int i;
-    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
     int ret = -EINVAL;
 
     for ( i = 0; i < cd->max_vcpus; i++ )
     {
         struct vcpu *d_vcpu = d->vcpu[i];
         struct vcpu *cd_vcpu = cd->vcpu[i];
-        mfn_t vcpu_info_mfn;
 
         if ( !d_vcpu || !cd_vcpu )
             continue;
 
-        /* Copy & map in the vcpu_info page if the guest uses one */
-        vcpu_info_mfn = d_vcpu->vcpu_info_mfn;
-        if ( !mfn_eq(vcpu_info_mfn, INVALID_MFN) )
-        {
-            mfn_t new_vcpu_info_mfn = cd_vcpu->vcpu_info_mfn;
-
-            /* Allocate & map the page for it if it hasn't been already */
-            if ( mfn_eq(new_vcpu_info_mfn, INVALID_MFN) )
-            {
-                gfn_t gfn = mfn_to_gfn(d, vcpu_info_mfn);
-                unsigned long gfn_l = gfn_x(gfn);
-                struct page_info *page;
-
-                if ( !(page = alloc_domheap_page(cd, 0)) )
-                    return -ENOMEM;
-
-                new_vcpu_info_mfn = page_to_mfn(page);
-                set_gpfn_from_mfn(mfn_x(new_vcpu_info_mfn), gfn_l);
-
-                ret = p2m->set_entry(p2m, gfn, new_vcpu_info_mfn,
-                                     PAGE_ORDER_4K, p2m_ram_rw,
-                                     p2m->default_access, -1);
-                if ( ret )
-                    return ret;
-
-                ret = map_vcpu_info(cd_vcpu, gfn_l,
-                                    PAGE_OFFSET(d_vcpu->vcpu_info));
-                if ( ret )
-                    return ret;
-            }
-
-            copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
-        }
-
-        /* Same for the (physically registered) runstate and time info areas. */
+        /*
+         * Copy and map the vcpu_info page and the (physically registered)
+         * runstate and time info areas.
+         */
+        ret = copy_guest_area(&cd_vcpu->vcpu_info_area,
+                              &d_vcpu->vcpu_info_area, cd_vcpu, d);
+        if ( ret )
+            return ret;
         ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
                               &d_vcpu->runstate_guest_area, cd_vcpu, d);
         if ( ret )
--- a/xen/arch/x86/pv/shim.c
+++ b/xen/arch/x86/pv/shim.c
@@ -383,7 +383,7 @@ int pv_shim_shutdown(uint8_t reason)
     for_each_vcpu ( d, v )
     {
         /* Unmap guest vcpu_info page and runstate/time areas. */
-        unmap_vcpu_info(v);
+        unmap_guest_area(v, &v->vcpu_info_area);
         unmap_guest_area(v, &v->runstate_guest_area);
         unmap_guest_area(v, &v->arch.time_guest_area);
 
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1547,7 +1547,7 @@ static void __update_vcpu_system_time(st
     struct vcpu_time_info *u = &vcpu_info(v, time), _u;
     const struct domain *d = v->domain;
 
-    if ( v->vcpu_info == NULL )
+    if ( !v->vcpu_info_area.map )
         return;
 
     collect_time_info(v, &_u);
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -53,7 +53,7 @@ void __dummy__(void)
 
     OFFSET(VCPU_processor, struct vcpu, processor);
     OFFSET(VCPU_domain, struct vcpu, domain);
-    OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info);
+    OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info_area.map);
     OFFSET(VCPU_trap_bounce, struct vcpu, arch.pv.trap_bounce);
     OFFSET(VCPU_thread_flags, struct vcpu, arch.flags);
     OFFSET(VCPU_event_addr, struct vcpu, arch.pv.event_callback_eip);
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -96,7 +96,7 @@ static void _show_registers(
     if ( context == CTXT_hypervisor )
         printk(" %pS", _p(regs->rip));
     printk("\nRFLAGS: %016lx   ", regs->rflags);
-    if ( (context == CTXT_pv_guest) && v && v->vcpu_info )
+    if ( (context == CTXT_pv_guest) && v && v->vcpu_info_area.map )
         printk("EM: %d   ", !!vcpu_info(v, evtchn_upcall_mask));
     printk("CONTEXT: %s", context_names[context]);
     if ( v && !is_idle_vcpu(v) )
--- a/xen/common/compat/domain.c
+++ b/xen/common/compat/domain.c
@@ -49,7 +49,7 @@ int compat_common_vcpu_op(int cmd, struc
     {
     case VCPUOP_initialise:
     {
-        if ( v->vcpu_info == &dummy_vcpu_info )
+        if ( v->vcpu_info_area.map == &dummy_vcpu_info )
             return -EINVAL;
 
 #ifdef CONFIG_HVM
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -127,10 +127,10 @@ static void vcpu_info_reset(struct vcpu
 {
     struct domain *d = v->domain;
 
-    v->vcpu_info = ((v->vcpu_id < XEN_LEGACY_MAX_VCPUS)
-                    ? (vcpu_info_t *)&shared_info(d, vcpu_info[v->vcpu_id])
-                    : &dummy_vcpu_info);
-    v->vcpu_info_mfn = INVALID_MFN;
+    v->vcpu_info_area.map =
+        ((v->vcpu_id < XEN_LEGACY_MAX_VCPUS)
+         ? (vcpu_info_t *)&shared_info(d, vcpu_info[v->vcpu_id])
+         : &dummy_vcpu_info);
 }
 
 static void vmtrace_free_buffer(struct vcpu *v)
@@ -964,7 +964,7 @@ int domain_kill(struct domain *d)
             return -ERESTART;
         for_each_vcpu ( d, v )
         {
-            unmap_vcpu_info(v);
+            unmap_guest_area(v, &v->vcpu_info_area);
             unmap_guest_area(v, &v->runstate_guest_area);
         }
         d->is_dying = DOMDYING_dead;
@@ -1419,7 +1419,7 @@ int domain_soft_reset(struct domain *d,
     for_each_vcpu ( d, v )
     {
         set_xen_guest_handle(runstate_guest(v), NULL);
-        unmap_vcpu_info(v);
+        unmap_guest_area(v, &v->vcpu_info_area);
         unmap_guest_area(v, &v->runstate_guest_area);
     }
 
@@ -1467,111 +1467,6 @@ int vcpu_reset(struct vcpu *v)
     return rc;
 }
 
-/*
- * Map a guest page in and point the vcpu_info pointer at it.  This
- * makes sure that the vcpu_info is always pointing at a valid piece
- * of memory, and it sets a pending event to make sure that a pending
- * event doesn't get missed.
- */
-int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned int offset)
-{
-    struct domain *d = v->domain;
-    void *mapping;
-    vcpu_info_t *new_info;
-    struct page_info *page;
-    unsigned int align;
-
-    if ( offset > (PAGE_SIZE - sizeof(*new_info)) )
-        return -ENXIO;
-
-#ifdef CONFIG_COMPAT
-    BUILD_BUG_ON(sizeof(*new_info) != sizeof(new_info->compat));
-    if ( has_32bit_shinfo(d) )
-        align = alignof(new_info->compat);
-    else
-#endif
-        align = alignof(*new_info);
-    if ( offset & (align - 1) )
-        return -ENXIO;
-
-    if ( !mfn_eq(v->vcpu_info_mfn, INVALID_MFN) )
-        return -EINVAL;
-
-    /* Run this command on yourself or on other offline VCPUS. */
-    if ( (v != current) && !(v->pause_flags & VPF_down) )
-        return -EINVAL;
-
-    page = get_page_from_gfn(d, gfn, NULL, P2M_UNSHARE);
-    if ( !page )
-        return -EINVAL;
-
-    if ( !get_page_type(page, PGT_writable_page) )
-    {
-        put_page(page);
-        return -EINVAL;
-    }
-
-    mapping = __map_domain_page_global(page);
-    if ( mapping == NULL )
-    {
-        put_page_and_type(page);
-        return -ENOMEM;
-    }
-
-    new_info = (vcpu_info_t *)(mapping + offset);
-
-    if ( v->vcpu_info == &dummy_vcpu_info )
-    {
-        memset(new_info, 0, sizeof(*new_info));
-#ifdef XEN_HAVE_PV_UPCALL_MASK
-        __vcpu_info(v, new_info, evtchn_upcall_mask) = 1;
-#endif
-    }
-    else
-    {
-        memcpy(new_info, v->vcpu_info, sizeof(*new_info));
-    }
-
-    v->vcpu_info = new_info;
-    v->vcpu_info_mfn = page_to_mfn(page);
-
-    /* Set new vcpu_info pointer /before/ setting pending flags. */
-    smp_wmb();
-
-    /*
-     * Mark everything as being pending just to make sure nothing gets
-     * lost.  The domain will get a spurious event, but it can cope.
-     */
-#ifdef CONFIG_COMPAT
-    if ( !has_32bit_shinfo(d) )
-        write_atomic(&new_info->native.evtchn_pending_sel, ~0);
-    else
-#endif
-        write_atomic(&vcpu_info(v, evtchn_pending_sel), ~0);
-    vcpu_mark_events_pending(v);
-
-    return 0;
-}
-
-/*
- * Unmap the vcpu info page if the guest decided to place it somewhere
- * else. This is used from domain_kill() and domain_soft_reset().
- */
-void unmap_vcpu_info(struct vcpu *v)
-{
-    mfn_t mfn = v->vcpu_info_mfn;
-
-    if ( mfn_eq(mfn, INVALID_MFN) )
-        return;
-
-    unmap_domain_page_global((void *)
-                             ((unsigned long)v->vcpu_info & PAGE_MASK));
-
-    vcpu_info_reset(v); /* NB: Clobbers v->vcpu_info_mfn */
-
-    put_page_and_type(mfn_to_page(mfn));
-}
-
 int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
                    struct guest_area *area,
                    void (*populate)(void *dst, struct vcpu *v))
@@ -1633,14 +1528,44 @@ int map_guest_area(struct vcpu *v, paddr
 
     domain_lock(d);
 
-    if ( map )
-        populate(map, v);
+    /* No re-registration of the vCPU info area. */
+    if ( area != &v->vcpu_info_area || !area->pg )
+    {
+        if ( map )
+            populate(map, v);
 
-    SWAP(area->pg, pg);
-    SWAP(area->map, map);
+        SWAP(area->pg, pg);
+        SWAP(area->map, map);
+    }
+    else
+        rc = -EBUSY;
 
     domain_unlock(d);
 
+    /* Set pending flags /after/ new vcpu_info pointer was set. */
+    if ( area == &v->vcpu_info_area && !rc )
+    {
+        /*
+         * Mark everything as being pending just to make sure nothing gets
+         * lost.  The domain will get a spurious event, but it can cope.
+         */
+#ifdef CONFIG_COMPAT
+        if ( !has_32bit_shinfo(d) )
+        {
+            vcpu_info_t *info = area->map;
+
+            /* For VCPUOP_register_vcpu_info handling in common_vcpu_op(). */
+            BUILD_BUG_ON(sizeof(*info) != sizeof(info->compat));
+            write_atomic(&info->native.evtchn_pending_sel, ~0);
+        }
+        else
+#endif
+            write_atomic(&vcpu_info(v, evtchn_pending_sel), ~0);
+        vcpu_mark_events_pending(v);
+
+        force_update_vcpu_system_time(v);
+    }
+
     if ( v != current )
         vcpu_unpause(v);
 
@@ -1670,7 +1595,10 @@ void unmap_guest_area(struct vcpu *v, st
 
     domain_lock(d);
     map = area->map;
-    area->map = NULL;
+    if ( area == &v->vcpu_info_area )
+        vcpu_info_reset(v);
+    else
+        area->map = NULL;
     pg = area->pg;
     area->pg = NULL;
     domain_unlock(d);
@@ -1801,6 +1729,27 @@ bool update_runstate_area(struct vcpu *v
     return rc;
 }
 
+/*
+ * This makes sure that the vcpu_info is always pointing at a valid piece of
+ * memory, and it sets a pending event to make sure that a pending event
+ * doesn't get missed.
+ */
+static void cf_check
+vcpu_info_populate(void *map, struct vcpu *v)
+{
+    vcpu_info_t *info = map;
+
+    if ( v->vcpu_info_area.map == &dummy_vcpu_info )
+    {
+        memset(info, 0, sizeof(*info));
+#ifdef XEN_HAVE_PV_UPCALL_MASK
+        __vcpu_info(v, info, evtchn_upcall_mask) = 1;
+#endif
+    }
+    else
+        memcpy(info, v->vcpu_info_area.map, sizeof(*info));
+}
+
 static void cf_check
 runstate_area_populate(void *map, struct vcpu *v)
 {
@@ -1830,7 +1779,7 @@ long common_vcpu_op(int cmd, struct vcpu
     switch ( cmd )
     {
     case VCPUOP_initialise:
-        if ( v->vcpu_info == &dummy_vcpu_info )
+        if ( v->vcpu_info_area.map == &dummy_vcpu_info )
             return -EINVAL;
 
         rc = arch_initialise_vcpu(v, arg);
@@ -1961,16 +1910,29 @@ long common_vcpu_op(int cmd, struct vcpu
     case VCPUOP_register_vcpu_info:
     {
         struct vcpu_register_vcpu_info info;
+        paddr_t gaddr;
 
         rc = -EFAULT;
         if ( copy_from_guest(&info, arg, 1) )
             break;
 
-        domain_lock(d);
-        rc = map_vcpu_info(v, info.mfn, info.offset);
-        domain_unlock(d);
+        rc = -EINVAL;
+        gaddr = gfn_to_gaddr(_gfn(info.mfn)) + info.offset;
+        if ( !~gaddr ||
+             gfn_x(gaddr_to_gfn(gaddr)) != info.mfn )
+            break;
 
-        force_update_vcpu_system_time(v);
+        /* Preliminary check only; see map_guest_area(). */
+        rc = -EBUSY;
+        if ( v->vcpu_info_area.pg )
+            break;
+
+        /* See the BUILD_BUG_ON() in vcpu_info_populate(). */
+        rc = map_guest_area(v, gaddr, sizeof(vcpu_info_t),
+                            &v->vcpu_info_area, vcpu_info_populate);
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
+                                               cmd, vcpuid, arg);
 
         break;
     }
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -79,9 +79,6 @@ void cf_check free_pirq_struct(void *);
 int  arch_vcpu_create(struct vcpu *v);
 void arch_vcpu_destroy(struct vcpu *v);
 
-int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned int offset);
-void unmap_vcpu_info(struct vcpu *v);
-
 int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
                    struct guest_area *area,
                    void (*populate)(void *dst, struct vcpu *v));
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -175,7 +175,7 @@ struct vcpu
 
     int              processor;
 
-    vcpu_info_t     *vcpu_info;
+    struct guest_area vcpu_info_area;
 
     struct domain   *domain;
 
@@ -288,9 +288,6 @@ struct vcpu
 
     struct waitqueue_vcpu *waitqueue_vcpu;
 
-    /* Guest-specified relocation of vcpu_info. */
-    mfn_t            vcpu_info_mfn;
-
     struct evtchn_fifo_vcpu *evtchn_fifo;
 
     /* vPCI per-vCPU area, used to store data for long running operations. */
--- a/xen/include/xen/shared.h
+++ b/xen/include/xen/shared.h
@@ -44,6 +44,7 @@ typedef struct vcpu_info vcpu_info_t;
 extern vcpu_info_t dummy_vcpu_info;
 
 #define shared_info(d, field)      __shared_info(d, (d)->shared_info, field)
-#define vcpu_info(v, field)        __vcpu_info(v, (v)->vcpu_info, field)
+#define vcpu_info(v, field)        \
+        __vcpu_info(v, (vcpu_info_t *)(v)->vcpu_info_area.map, field)
 
 #endif /* __XEN_SHARED_H__ */



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-05-03 15:56 ` [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
@ 2023-05-03 17:14   ` Tamas K Lengyel
  2023-05-04  7:44     ` Jan Beulich
  2023-09-27 11:08   ` Roger Pau Monné
  1 sibling, 1 reply; 38+ messages in thread
From: Tamas K Lengyel @ 2023-05-03 17:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 851 bytes --]

> @@ -1974,7 +2046,10 @@ int mem_sharing_fork_reset(struct domain
>
>   state:
>      if ( reset_state )
> +    {
>          rc = copy_settings(d, pd);
> +        /* TBD: What to do here with -ERESTART? */

Ideally we could avoid hitting code-paths that are restartable during fork
reset since it gets called from vm_event replies that have no concept of
handling errors. If we start having errors like this we would just have to
drop the vm_event reply optimization and issue a standalone fork reset
hypercall every time which isn't a big deal, it's just slower. My
preference would actually be that after the initial forking is performed a
local copy of the parent's state is maintained for the fork to reset to so
there would be no case of hitting an error like this. It would also allow
us in the future to unpause the parent for example..

Tamas

[-- Attachment #2: Type: text/html, Size: 999 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-05-03 17:14   ` Tamas K Lengyel
@ 2023-05-04  7:44     ` Jan Beulich
  2023-05-04 12:50       ` Tamas K Lengyel
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-05-04  7:44 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

On 03.05.2023 19:14, Tamas K Lengyel wrote:
>> @@ -1974,7 +2046,10 @@ int mem_sharing_fork_reset(struct domain
>>
>>   state:
>>      if ( reset_state )
>> +    {
>>          rc = copy_settings(d, pd);
>> +        /* TBD: What to do here with -ERESTART? */
> 
> Ideally we could avoid hitting code-paths that are restartable during fork
> reset since it gets called from vm_event replies that have no concept of
> handling errors. If we start having errors like this we would just have to
> drop the vm_event reply optimization and issue a standalone fork reset
> hypercall every time which isn't a big deal, it's just slower.

I'm afraid I don't follow: We are in the process of fork-reset here. How
would issuing "a standalone fork reset hypercall every time" make this
any different? The possible need for a continuation here comes from a
failed spin_trylock() in map_guest_area(). That won't change the next
time round.

But perhaps I should say that till now I didn't even pay much attention
to the 2nd use of the function by vm_event_resume(); I was mainly
focused on the one from XENMEM_sharing_op_fork_reset, where no
continuation handling exists. Yet perhaps your comment is mainly
related to that use?

I actually notice that the comment ahead of the function already has a
continuation related TODO, just that there thought is only of larger
memory footprint.

> My
> preference would actually be that after the initial forking is performed a
> local copy of the parent's state is maintained for the fork to reset to so
> there would be no case of hitting an error like this. It would also allow
> us in the future to unpause the parent for example..

Oh, I guess I didn't realize the parent was kept paused. Such state
copying / caching may then indeed be a possibility, but that's nothing
I can see myself deal with, even less so in the context of this series.
I need a solution to the problem at hand within the scope of what is
there right now (or based on what could be provided e.g. by you within
the foreseeable future). Bubbling up the need for continuation from the
XENMEM_sharing_op_fork_reset path is the most I could see me handle
myself ... For vm_event_resume() bubbling state up the domctl path
_may_ also be doable, but mem_sharing_notification() and friends don't
even check the function's return value.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-05-04  7:44     ` Jan Beulich
@ 2023-05-04 12:50       ` Tamas K Lengyel
  2023-05-04 14:25         ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Tamas K Lengyel @ 2023-05-04 12:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 4013 bytes --]

On Thu, May 4, 2023 at 3:44 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 03.05.2023 19:14, Tamas K Lengyel wrote:
> >> @@ -1974,7 +2046,10 @@ int mem_sharing_fork_reset(struct domain
> >>
> >>   state:
> >>      if ( reset_state )
> >> +    {
> >>          rc = copy_settings(d, pd);
> >> +        /* TBD: What to do here with -ERESTART? */
> >
> > Ideally we could avoid hitting code-paths that are restartable during
fork
> > reset since it gets called from vm_event replies that have no concept of
> > handling errors. If we start having errors like this we would just have
to
> > drop the vm_event reply optimization and issue a standalone fork reset
> > hypercall every time which isn't a big deal, it's just slower.
>
> I'm afraid I don't follow: We are in the process of fork-reset here. How
> would issuing "a standalone fork reset hypercall every time" make this
> any different? The possible need for a continuation here comes from a
> failed spin_trylock() in map_guest_area(). That won't change the next
> time round.

Why not? Who is holding the lock and why wouldn't it ever relinquish it? If
that's really true then there is a larger issue then just not being able to
report the error back to the user on vm_event_resume path and we need to
devise a way of being able to copy this from the parent bypassing this
lock. The parent is paused and its state should not be changing while forks
are active, so if the lock was turned into an rwlock of some sort so we can
acquire the read-lock would perhaps be a possible way out of this.

>
> But perhaps I should say that till now I didn't even pay much attention
> to the 2nd use of the function by vm_event_resume(); I was mainly
> focused on the one from XENMEM_sharing_op_fork_reset, where no
> continuation handling exists. Yet perhaps your comment is mainly
> related to that use?
>
> I actually notice that the comment ahead of the function already has a
> continuation related TODO, just that there thought is only of larger
> memory footprint.

With XENMEM_sharing_op_fork_reset the caller actually receives the error
code and can decide what to do next. With vm_event_resume there is no path
currently to notify the agent of an error. We could generate another
vm_event to send such an error, but the expectation with fork_reset is that
it will always work because the parent is paused, so not having that path
for an error to get back to the agent isn't a big deal.

Now, if it becomes the case that due to this locking we can get an error
even while the parent is paused, that will render the vm_event_resume path
unreliably so we would just switch to using XENMEM_sharing_op_fork_reset so
that at least it can re-try in case of an issue. Of course, only if a
reissue of the hypercall has any reasonable chance of succeeding.

>
> > My
> > preference would actually be that after the initial forking is
performed a
> > local copy of the parent's state is maintained for the fork to reset to
so
> > there would be no case of hitting an error like this. It would also
allow
> > us in the future to unpause the parent for example..
>
> Oh, I guess I didn't realize the parent was kept paused. Such state
> copying / caching may then indeed be a possibility, but that's nothing
> I can see myself deal with, even less so in the context of this series.
> I need a solution to the problem at hand within the scope of what is
> there right now (or based on what could be provided e.g. by you within
> the foreseeable future). Bubbling up the need for continuation from the
> XENMEM_sharing_op_fork_reset path is the most I could see me handle
> myself ... For vm_event_resume() bubbling state up the domctl path
> _may_ also be doable, but mem_sharing_notification() and friends don't
> even check the function's return value.

Sure, I wasn't expecting that work to be done as part of this series, just
as something I would like to get to in the future.

Tamas

[-- Attachment #2: Type: text/html, Size: 4644 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-05-04 12:50       ` Tamas K Lengyel
@ 2023-05-04 14:25         ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2023-05-04 14:25 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

On 04.05.2023 14:50, Tamas K Lengyel wrote:
> On Thu, May 4, 2023 at 3:44 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 03.05.2023 19:14, Tamas K Lengyel wrote:
>>>> @@ -1974,7 +2046,10 @@ int mem_sharing_fork_reset(struct domain
>>>>
>>>>   state:
>>>>      if ( reset_state )
>>>> +    {
>>>>          rc = copy_settings(d, pd);
>>>> +        /* TBD: What to do here with -ERESTART? */
>>>
>>> Ideally we could avoid hitting code-paths that are restartable during
> fork
>>> reset since it gets called from vm_event replies that have no concept of
>>> handling errors. If we start having errors like this we would just have
> to
>>> drop the vm_event reply optimization and issue a standalone fork reset
>>> hypercall every time which isn't a big deal, it's just slower.
>>
>> I'm afraid I don't follow: We are in the process of fork-reset here. How
>> would issuing "a standalone fork reset hypercall every time" make this
>> any different? The possible need for a continuation here comes from a
>> failed spin_trylock() in map_guest_area(). That won't change the next
>> time round.
> 
> Why not? Who is holding the lock and why wouldn't it ever relinquish it?

What state is the fork in at that point in time? We're talking about the
fork's hypercall deadlock mutex here, after all. Hence if we knew the
fork is paused (just like the parent is), then I don't think -ERESTART
can be coming back. (To be precise, both paths leading here are of
interest, yet the state the fork is in may be different in both cases.)

> If
> that's really true then there is a larger issue then just not being able to
> report the error back to the user on vm_event_resume path and we need to
> devise a way of being able to copy this from the parent bypassing this
> lock.

The issue isn't that the lock will never become available. But we can't
predict how many attempts it'll take.

But my earlier question went in a different direction anyway: You did
suggest to replace a fork-reset by "a standalone fork reset hypercall
every time". I somehow don't get the difference (but it looks like some
of your further reply down from here addresses that).

> The parent is paused and its state should not be changing while forks
> are active, so if the lock was turned into an rwlock of some sort so we can
> acquire the read-lock would perhaps be a possible way out of this.

Given the specific lock we're talking about here, an rwlock is out of
question, I think.

>> But perhaps I should say that till now I didn't even pay much attention
>> to the 2nd use of the function by vm_event_resume(); I was mainly
>> focused on the one from XENMEM_sharing_op_fork_reset, where no
>> continuation handling exists. Yet perhaps your comment is mainly
>> related to that use?
>>
>> I actually notice that the comment ahead of the function already has a
>> continuation related TODO, just that there thought is only of larger
>> memory footprint.
> 
> With XENMEM_sharing_op_fork_reset the caller actually receives the error
> code and can decide what to do next. With vm_event_resume there is no path
> currently to notify the agent of an error. We could generate another
> vm_event to send such an error, but the expectation with fork_reset is that
> it will always work because the parent is paused, so not having that path
> for an error to get back to the agent isn't a big deal.
> 
> Now, if it becomes the case that due to this locking we can get an error
> even while the parent is paused, that will render the vm_event_resume path
> unreliably so we would just switch to using XENMEM_sharing_op_fork_reset so
> that at least it can re-try in case of an issue. Of course, only if a
> reissue of the hypercall has any reasonable chance of succeeding.

(I think this is the explanation for the "standalone reset hypercall.)

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/8] domain: GADDR based shared guest area registration alternative - teardown
  2023-05-03 15:54 ` [PATCH v3 1/8] domain: GADDR based shared guest area registration alternative - teardown Jan Beulich
@ 2023-09-27  8:51   ` Roger Pau Monné
  2023-09-27  9:55     ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-27  8:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On Wed, May 03, 2023 at 05:54:47PM +0200, Jan Beulich wrote:
> In preparation of the introduction of new vCPU operations allowing to
> register the respective areas (one of the two is x86-specific) by
> guest-physical address, add the necessary domain cleanup hooks.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Julien Grall <jgrall@amazon.com>
> ---
> RFC: Zapping the areas in pv_shim_shutdown() may not be strictly
>      necessary: Aiui unmap_vcpu_info() is called only because the vCPU
>      info area cannot be re-registered. Beyond that I guess the
>      assumption is that the areas would only be re-registered as they
>      were before. If that's not the case I wonder whether the guest
>      handles for both areas shouldn't also be zapped.

IIRC the reason was to unset v->vcpu_info_mfn so that it could be set
again on resume from suspension, or else the hypercall will return an
error.

I guess we should also zap the runstate handlers, or else we might
corrupt guest state.

> ---
> v2: Add assertion in unmap_guest_area().
> 
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1019,7 +1019,10 @@ int arch_domain_soft_reset(struct domain
>      }
>  
>      for_each_vcpu ( d, v )
> +    {
>          set_xen_guest_handle(v->arch.time_info_guest, NULL);
> +        unmap_guest_area(v, &v->arch.time_guest_area);
> +    }
>  
>   exit_put_gfn:
>      put_gfn(d, gfn_x(gfn));
> @@ -2356,6 +2359,8 @@ int domain_relinquish_resources(struct d
>              if ( ret )
>                  return ret;
>  
> +            unmap_guest_area(v, &v->arch.time_guest_area);
> +
>              vpmu_destroy(v);
>          }
>  
> --- a/xen/arch/x86/include/asm/domain.h
> +++ b/xen/arch/x86/include/asm/domain.h
> @@ -669,6 +669,7 @@ struct arch_vcpu
>  
>      /* A secondary copy of the vcpu time info. */
>      XEN_GUEST_HANDLE(vcpu_time_info_t) time_info_guest;
> +    struct guest_area time_guest_area;
>  
>      struct arch_vm_event *vm_event;
>  
> --- a/xen/arch/x86/pv/shim.c
> +++ b/xen/arch/x86/pv/shim.c
> @@ -382,8 +382,10 @@ int pv_shim_shutdown(uint8_t reason)
>  
>      for_each_vcpu ( d, v )
>      {
> -        /* Unmap guest vcpu_info pages. */
> +        /* Unmap guest vcpu_info page and runstate/time areas. */
>          unmap_vcpu_info(v);
> +        unmap_guest_area(v, &v->runstate_guest_area);
> +        unmap_guest_area(v, &v->arch.time_guest_area);
>  
>          /* Reset the periodic timer to the default value. */
>          vcpu_set_periodic_timer(v, MILLISECS(10));
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -963,7 +963,10 @@ int domain_kill(struct domain *d)
>          if ( cpupool_move_domain(d, cpupool0) )
>              return -ERESTART;
>          for_each_vcpu ( d, v )
> +        {
>              unmap_vcpu_info(v);
> +            unmap_guest_area(v, &v->runstate_guest_area);
> +        }
>          d->is_dying = DOMDYING_dead;
>          /* Mem event cleanup has to go here because the rings 
>           * have to be put before we call put_domain. */
> @@ -1417,6 +1420,7 @@ int domain_soft_reset(struct domain *d,
>      {
>          set_xen_guest_handle(runstate_guest(v), NULL);
>          unmap_vcpu_info(v);
> +        unmap_guest_area(v, &v->runstate_guest_area);
>      }
>  
>      rc = arch_domain_soft_reset(d);
> @@ -1568,6 +1572,19 @@ void unmap_vcpu_info(struct vcpu *v)
>      put_page_and_type(mfn_to_page(mfn));
>  }
>  
> +/*
> + * This is only intended to be used for domain cleanup (or more generally only
> + * with at least the respective vCPU, if it's not the current one, reliably
> + * paused).
> + */
> +void unmap_guest_area(struct vcpu *v, struct guest_area *area)

vcpu param could be const given the current code, but I assume this
will be changed by future patches.  Same could be said about
guest_area but I'm confident that will change.

> +{
> +    struct domain *d = v->domain;
> +
> +    if ( v != current )
> +        ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));

Isn't this racy?  What guarantees that the vcpu won't be kicked just
after the check has been performed?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/8] domain: update GADDR based runstate guest area
  2023-05-03 15:55 ` [PATCH v3 2/8] domain: update GADDR based runstate guest area Jan Beulich
@ 2023-09-27  9:44   ` Roger Pau Monné
  2023-09-27 10:19     ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-27  9:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On Wed, May 03, 2023 at 05:55:11PM +0200, Jan Beulich wrote:
> Before adding a new vCPU operation to register the runstate area by
> guest-physical address, add code to actually keep such areas up-to-date.
> 
> Note that updating of the area will be done exclusively following the
> model enabled by VMASST_TYPE_runstate_update_flag for virtual-address
> based registered areas.
> 
> Note further that pages aren't marked dirty when written to (matching
> the handling of space mapped by map_vcpu_info()), on the basis that the
> registrations are lost anyway across migration (or would need re-
> populating at the target for transparent migration). Plus the contents
> of the areas in question have to be deemed volatile in the first place
> (so saving a "most recent" value is pretty meaningless even for e.g.
> snapshotting).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> RFC: HVM guests (on x86) can change bitness and hence layout (and size!
>      and alignment) of the runstate area. I don't think it is an option
>      to require 32-bit code to pass a range such that even the 64-bit
>      layout wouldn't cross a page boundary (and be suitably aligned). I
>      also don't see any other good solution, so for now a crude approach
>      with an extra boolean is used (using has_32bit_shinfo() isn't race
>      free and could hence lead to overrunning the mapped space).

Shouldn't a well behaved guest attempt to unmap the runstate area
before changing bitness?  I would expect this to happen for example
when OVMF relinquishes control to the OS.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/8] domain: GADDR based shared guest area registration alternative - teardown
  2023-09-27  8:51   ` Roger Pau Monné
@ 2023-09-27  9:55     ` Jan Beulich
  2023-09-27 10:42       ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-09-27  9:55 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On 27.09.2023 10:51, Roger Pau Monné wrote:
> On Wed, May 03, 2023 at 05:54:47PM +0200, Jan Beulich wrote:
>> In preparation of the introduction of new vCPU operations allowing to
>> register the respective areas (one of the two is x86-specific) by
>> guest-physical address, add the necessary domain cleanup hooks.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Reviewed-by: Julien Grall <jgrall@amazon.com>
>> ---
>> RFC: Zapping the areas in pv_shim_shutdown() may not be strictly
>>      necessary: Aiui unmap_vcpu_info() is called only because the vCPU
>>      info area cannot be re-registered. Beyond that I guess the
>>      assumption is that the areas would only be re-registered as they
>>      were before. If that's not the case I wonder whether the guest
>>      handles for both areas shouldn't also be zapped.
> 
> IIRC the reason was to unset v->vcpu_info_mfn so that it could be set
> again on resume from suspension, or else the hypercall will return an
> error.

Right, that's the re-registration aspect mentioned.

> I guess we should also zap the runstate handlers, or else we might
> corrupt guest state.

So you think the guest might re-register a different area post resume?
I can certainly add another patch to zap the handles; I don't think it
should be done right here, not the least because if we see room for
such behavior, that change may even want backporting.

>> @@ -1568,6 +1572,19 @@ void unmap_vcpu_info(struct vcpu *v)
>>      put_page_and_type(mfn_to_page(mfn));
>>  }
>>  
>> +/*
>> + * This is only intended to be used for domain cleanup (or more generally only
>> + * with at least the respective vCPU, if it's not the current one, reliably
>> + * paused).
>> + */
>> +void unmap_guest_area(struct vcpu *v, struct guest_area *area)
> 
> vcpu param could be const given the current code, but I assume this
> will be changed by future patches.  Same could be said about
> guest_area but I'm confident that will change.

True for both.

>> +{
>> +    struct domain *d = v->domain;
>> +
>> +    if ( v != current )
>> +        ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
> 
> Isn't this racy?

It is, yes.

>  What guarantees that the vcpu won't be kicked just
> after the check has been performed?

Nothing. This check isn't any better than assertions towards an ordinary
spinlock being held. I assume you realize that we've got a number of such
assertions elsewhere already.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/8] x86: update GADDR based secondary time area
  2023-05-03 15:55 ` [PATCH v3 3/8] x86: update GADDR based secondary time area Jan Beulich
@ 2023-09-27 10:14   ` Roger Pau Monné
  0 siblings, 0 replies; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-27 10:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, George Dunlap, Wei Liu

On Wed, May 03, 2023 at 05:55:52PM +0200, Jan Beulich wrote:
> Before adding a new vCPU operation to register the secondary time area
> by guest-physical address, add code to actually keep such areas up-to-
> date.
> 
> Note that pages aren't marked dirty when written to (matching the
> handling of space mapped by map_vcpu_info()), on the basis that the
> registrations are lost anyway across migration (or would need re-
> populating at the target for transparent migration). Plus the contents
> of the areas in question have to be deemed volatile in the first place
> (so saving a "most recent" value is pretty meaningless even for e.g.
> snapshotting).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/8] domain: update GADDR based runstate guest area
  2023-09-27  9:44   ` Roger Pau Monné
@ 2023-09-27 10:19     ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2023-09-27 10:19 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On 27.09.2023 11:44, Roger Pau Monné wrote:
> On Wed, May 03, 2023 at 05:55:11PM +0200, Jan Beulich wrote:
>> Before adding a new vCPU operation to register the runstate area by
>> guest-physical address, add code to actually keep such areas up-to-date.
>>
>> Note that updating of the area will be done exclusively following the
>> model enabled by VMASST_TYPE_runstate_update_flag for virtual-address
>> based registered areas.
>>
>> Note further that pages aren't marked dirty when written to (matching
>> the handling of space mapped by map_vcpu_info()), on the basis that the
>> registrations are lost anyway across migration (or would need re-
>> populating at the target for transparent migration). Plus the contents
>> of the areas in question have to be deemed volatile in the first place
>> (so saving a "most recent" value is pretty meaningless even for e.g.
>> snapshotting).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> ---
>> RFC: HVM guests (on x86) can change bitness and hence layout (and size!
>>      and alignment) of the runstate area. I don't think it is an option
>>      to require 32-bit code to pass a range such that even the 64-bit
>>      layout wouldn't cross a page boundary (and be suitably aligned). I
>>      also don't see any other good solution, so for now a crude approach
>>      with an extra boolean is used (using has_32bit_shinfo() isn't race
>>      free and could hence lead to overrunning the mapped space).
> 
> Shouldn't a well behaved guest attempt to unmap the runstate area
> before changing bitness?  I would expect this to happen for example
> when OVMF relinquishes control to the OS.

Well, the OVMF example falls in a different class: Before transferring
ownership of the system, it wants to unmap regardless of whether bitness
is going to change, or else memory corruption can occur to the following
entity.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/8] domain: GADDR based shared guest area registration alternative - teardown
  2023-09-27  9:55     ` Jan Beulich
@ 2023-09-27 10:42       ` Roger Pau Monné
  2023-09-27 10:46         ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-27 10:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On Wed, Sep 27, 2023 at 11:55:19AM +0200, Jan Beulich wrote:
> On 27.09.2023 10:51, Roger Pau Monné wrote:
> > On Wed, May 03, 2023 at 05:54:47PM +0200, Jan Beulich wrote:
> > I guess we should also zap the runstate handlers, or else we might
> > corrupt guest state.
> 
> So you think the guest might re-register a different area post resume?
> I can certainly add another patch to zap the handles; I don't think it
> should be done right here, not the least because if we see room for
> such behavior, that change may even want backporting.

For correctness it would be good to zap them, but there's no rush, as
I do think guests will set to the same address on resume.

> >> @@ -1568,6 +1572,19 @@ void unmap_vcpu_info(struct vcpu *v)
> >>      put_page_and_type(mfn_to_page(mfn));
> >>  }
> >>  
> >> +/*
> >> + * This is only intended to be used for domain cleanup (or more generally only
> >> + * with at least the respective vCPU, if it's not the current one, reliably
> >> + * paused).
> >> + */
> >> +void unmap_guest_area(struct vcpu *v, struct guest_area *area)
> > 
> > vcpu param could be const given the current code, but I assume this
> > will be changed by future patches.  Same could be said about
> > guest_area but I'm confident that will change.
> 
> True for both.
> 
> >> +{
> >> +    struct domain *d = v->domain;
> >> +
> >> +    if ( v != current )
> >> +        ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
> > 
> > Isn't this racy?
> 
> It is, yes.
> 
> >  What guarantees that the vcpu won't be kicked just
> > after the check has been performed?
> 
> Nothing. This check isn't any better than assertions towards an ordinary
> spinlock being held. I assume you realize that we've got a number of such
> assertions elsewhere already.

Right, but different from spinlock assertions, the code here could be
made safe just by pausing the vCPU?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/8] domain: GADDR based shared guest area registration alternative - teardown
  2023-09-27 10:42       ` Roger Pau Monné
@ 2023-09-27 10:46         ` Jan Beulich
  2023-09-27 10:50           ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-09-27 10:46 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On 27.09.2023 12:42, Roger Pau Monné wrote:
> On Wed, Sep 27, 2023 at 11:55:19AM +0200, Jan Beulich wrote:
>> On 27.09.2023 10:51, Roger Pau Monné wrote:
>>> On Wed, May 03, 2023 at 05:54:47PM +0200, Jan Beulich wrote:
>>> I guess we should also zap the runstate handlers, or else we might
>>> corrupt guest state.
>>
>> So you think the guest might re-register a different area post resume?
>> I can certainly add another patch to zap the handles; I don't think it
>> should be done right here, not the least because if we see room for
>> such behavior, that change may even want backporting.
> 
> For correctness it would be good to zap them, but there's no rush, as
> I do think guests will set to the same address on resume.

Well, I've already added a new patch coming ahead of this one.

>>>> @@ -1568,6 +1572,19 @@ void unmap_vcpu_info(struct vcpu *v)
>>>>      put_page_and_type(mfn_to_page(mfn));
>>>>  }
>>>>  
>>>> +/*
>>>> + * This is only intended to be used for domain cleanup (or more generally only
>>>> + * with at least the respective vCPU, if it's not the current one, reliably
>>>> + * paused).
>>>> + */
>>>> +void unmap_guest_area(struct vcpu *v, struct guest_area *area)
>>>
>>> vcpu param could be const given the current code, but I assume this
>>> will be changed by future patches.  Same could be said about
>>> guest_area but I'm confident that will change.
>>
>> True for both.
>>
>>>> +{
>>>> +    struct domain *d = v->domain;
>>>> +
>>>> +    if ( v != current )
>>>> +        ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
>>>
>>> Isn't this racy?
>>
>> It is, yes.
>>
>>>  What guarantees that the vcpu won't be kicked just
>>> after the check has been performed?
>>
>> Nothing. This check isn't any better than assertions towards an ordinary
>> spinlock being held. I assume you realize that we've got a number of such
>> assertions elsewhere already.
> 
> Right, but different from spinlock assertions, the code here could be
> made safe just by pausing the vCPU?

That's what the assertion is checking (see also the comment ahead of the
function). It's just that the assertions cannot be made more strict, at
least from all I can tell.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/8] domain: GADDR based shared guest area registration alternative - teardown
  2023-09-27 10:46         ` Jan Beulich
@ 2023-09-27 10:50           ` Roger Pau Monné
  2023-09-27 11:44             ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-27 10:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On Wed, Sep 27, 2023 at 12:46:07PM +0200, Jan Beulich wrote:
> On 27.09.2023 12:42, Roger Pau Monné wrote:
> > On Wed, Sep 27, 2023 at 11:55:19AM +0200, Jan Beulich wrote:
> >> On 27.09.2023 10:51, Roger Pau Monné wrote:
> >>> On Wed, May 03, 2023 at 05:54:47PM +0200, Jan Beulich wrote:
> >>>> +{
> >>>> +    struct domain *d = v->domain;
> >>>> +
> >>>> +    if ( v != current )
> >>>> +        ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
> >>>
> >>> Isn't this racy?
> >>
> >> It is, yes.
> >>
> >>>  What guarantees that the vcpu won't be kicked just
> >>> after the check has been performed?
> >>
> >> Nothing. This check isn't any better than assertions towards an ordinary
> >> spinlock being held. I assume you realize that we've got a number of such
> >> assertions elsewhere already.
> > 
> > Right, but different from spinlock assertions, the code here could be
> > made safe just by pausing the vCPU?
> 
> That's what the assertion is checking (see also the comment ahead of the
> function). It's just that the assertions cannot be made more strict, at
> least from all I can tell.

But the assertion might no longer be true by the time the code
afterwards is executed.  Why not wrap the code in a pair of
vcpu_{,un}pause() calls?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-05-03 15:56 ` [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
  2023-05-03 17:14   ` Tamas K Lengyel
@ 2023-09-27 11:08   ` Roger Pau Monné
  2023-09-27 12:06     ` Jan Beulich
       [not found]     ` <CABfawhkn1xXA+qEjB4-HtOVUZHONDE6ngMJZPe3fSPtoAtmg+Q@mail.gmail.com>
  1 sibling, 2 replies; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-27 11:08 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Tamas K Lengyel

On Wed, May 03, 2023 at 05:56:46PM +0200, Jan Beulich wrote:
> In preparation of the introduction of new vCPU operations allowing to
> register the respective areas (one of the two is x86-specific) by
> guest-physical address, add the necessary fork handling (with the
> backing function yet to be filled in).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Given the very limited and specific usage of the current Xen forking
code, do we really need to bother about copying such areas?

IOW: I doubt that not updating the runstate/time areas will make any
difference to the forking code ATM.

> ---
> v3: Extend comment.
> 
> --- a/xen/arch/x86/mm/mem_sharing.c
> +++ b/xen/arch/x86/mm/mem_sharing.c
> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>  }
>  
> +static int copy_guest_area(struct guest_area *cd_area,
> +                           const struct guest_area *d_area,
> +                           struct vcpu *cd_vcpu,
> +                           const struct domain *d)
> +{
> +    mfn_t d_mfn, cd_mfn;
> +
> +    if ( !d_area->pg )
> +        return 0;
> +
> +    d_mfn = page_to_mfn(d_area->pg);
> +
> +    /* Allocate & map a page for the area if it hasn't been already. */
> +    if ( !cd_area->pg )
> +    {
> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> +        p2m_type_t p2mt;
> +        p2m_access_t p2ma;
> +        unsigned int offset;
> +        int ret;
> +
> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> +        {
> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
> +
> +            if ( !pg )
> +                return -ENOMEM;
> +
> +            cd_mfn = page_to_mfn(pg);
> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> +
> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
> +                                 p2m->default_access, -1);
> +            if ( ret )
> +                return ret;
> +        }
> +        else if ( p2mt != p2m_ram_rw )
> +            return -EBUSY;

Shouldn't the populate of the underlying gfn in the fork case
be done by map_guest_area itself?

What if a forked guest attempts to register a new runstate/time info
against an address not yet populated?

> +        /*
> +         * Map the into the guest. For simplicity specify the entire range up
> +         * to the end of the page: All the function uses it for is to check
> +         * that the range doesn't cross page boundaries. Having the area mapped
> +         * in the original domain implies that it fits there and therefore will
> +         * also fit in the clone.
> +         */
> +        offset = PAGE_OFFSET(d_area->map);
> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
> +                             PAGE_SIZE - offset, cd_area, NULL);
> +        if ( ret )
> +            return ret;
> +    }
> +    else
> +        cd_mfn = page_to_mfn(cd_area->pg);
> +
> +    copy_domain_page(cd_mfn, d_mfn);
> +
> +    return 0;
> +}
> +
>  static int copy_vpmu(struct vcpu *d_vcpu, struct vcpu *cd_vcpu)
>  {
>      struct vpmu_struct *d_vpmu = vcpu_vpmu(d_vcpu);
> @@ -1733,6 +1795,16 @@ static int copy_vcpu_settings(struct dom
>              copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
>          }
>  
> +        /* Same for the (physically registered) runstate and time info areas. */
> +        ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
> +                              &d_vcpu->runstate_guest_area, cd_vcpu, d);
> +        if ( ret )
> +            return ret;
> +        ret = copy_guest_area(&cd_vcpu->arch.time_guest_area,
> +                              &d_vcpu->arch.time_guest_area, cd_vcpu, d);
> +        if ( ret )
> +            return ret;
> +
>          ret = copy_vpmu(d_vcpu, cd_vcpu);
>          if ( ret )
>              return ret;
> @@ -1974,7 +2046,10 @@ int mem_sharing_fork_reset(struct domain
>  
>   state:
>      if ( reset_state )
> +    {
>          rc = copy_settings(d, pd);
> +        /* TBD: What to do here with -ERESTART? */
> +    }
>  
>      domain_unpause(d);
>  
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1572,6 +1572,13 @@ void unmap_vcpu_info(struct vcpu *v)
>      put_page_and_type(mfn_to_page(mfn));
>  }
>  
> +int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
> +                   struct guest_area *area,
> +                   void (*populate)(void *dst, struct vcpu *v))

Oh, the prototype for this is added in patch 1, almost missed it.

Why not also add this dummy implementation in patch 1 then?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/8] domain: GADDR based shared guest area registration alternative - teardown
  2023-09-27 10:50           ` Roger Pau Monné
@ 2023-09-27 11:44             ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2023-09-27 11:44 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On 27.09.2023 12:50, Roger Pau Monné wrote:
> On Wed, Sep 27, 2023 at 12:46:07PM +0200, Jan Beulich wrote:
>> On 27.09.2023 12:42, Roger Pau Monné wrote:
>>> On Wed, Sep 27, 2023 at 11:55:19AM +0200, Jan Beulich wrote:
>>>> On 27.09.2023 10:51, Roger Pau Monné wrote:
>>>>> On Wed, May 03, 2023 at 05:54:47PM +0200, Jan Beulich wrote:
>>>>>> +{
>>>>>> +    struct domain *d = v->domain;
>>>>>> +
>>>>>> +    if ( v != current )
>>>>>> +        ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
>>>>>
>>>>> Isn't this racy?
>>>>
>>>> It is, yes.
>>>>
>>>>>  What guarantees that the vcpu won't be kicked just
>>>>> after the check has been performed?
>>>>
>>>> Nothing. This check isn't any better than assertions towards an ordinary
>>>> spinlock being held. I assume you realize that we've got a number of such
>>>> assertions elsewhere already.
>>>
>>> Right, but different from spinlock assertions, the code here could be
>>> made safe just by pausing the vCPU?
>>
>> That's what the assertion is checking (see also the comment ahead of the
>> function). It's just that the assertions cannot be made more strict, at
>> least from all I can tell.
> 
> But the assertion might no longer be true by the time the code
> afterwards is executed.  Why not wrap the code in a pair of
> vcpu_{,un}pause() calls?

Because it's not quite as simple (if I was to do so, I'd want to do it
correctly, and not just give the impression of universal usability). See
how map_guest_area() involves hypercall_deadlock_mutex. Hence I continue
to think it is okay the way I have it, with all callers satisfying the
requirement (afaict).

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-09-27 11:08   ` Roger Pau Monné
@ 2023-09-27 12:06     ` Jan Beulich
  2023-09-27 14:05       ` Roger Pau Monné
       [not found]     ` <CABfawhkn1xXA+qEjB4-HtOVUZHONDE6ngMJZPe3fSPtoAtmg+Q@mail.gmail.com>
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-09-27 12:06 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Tamas K Lengyel

On 27.09.2023 13:08, Roger Pau Monné wrote:
> On Wed, May 03, 2023 at 05:56:46PM +0200, Jan Beulich wrote:
>> In preparation of the introduction of new vCPU operations allowing to
>> register the respective areas (one of the two is x86-specific) by
>> guest-physical address, add the necessary fork handling (with the
>> backing function yet to be filled in).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Given the very limited and specific usage of the current Xen forking
> code, do we really need to bother about copying such areas?
> 
> IOW: I doubt that not updating the runstate/time areas will make any
> difference to the forking code ATM.

I expect Tamas'es reply has sufficiently addressed this question.

>> --- a/xen/arch/x86/mm/mem_sharing.c
>> +++ b/xen/arch/x86/mm/mem_sharing.c
>> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
>>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>>  }
>>  
>> +static int copy_guest_area(struct guest_area *cd_area,
>> +                           const struct guest_area *d_area,
>> +                           struct vcpu *cd_vcpu,
>> +                           const struct domain *d)
>> +{
>> +    mfn_t d_mfn, cd_mfn;
>> +
>> +    if ( !d_area->pg )
>> +        return 0;
>> +
>> +    d_mfn = page_to_mfn(d_area->pg);
>> +
>> +    /* Allocate & map a page for the area if it hasn't been already. */
>> +    if ( !cd_area->pg )
>> +    {
>> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
>> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
>> +        p2m_type_t p2mt;
>> +        p2m_access_t p2ma;
>> +        unsigned int offset;
>> +        int ret;
>> +
>> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
>> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
>> +        {
>> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
>> +
>> +            if ( !pg )
>> +                return -ENOMEM;
>> +
>> +            cd_mfn = page_to_mfn(pg);
>> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
>> +
>> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
>> +                                 p2m->default_access, -1);
>> +            if ( ret )
>> +                return ret;
>> +        }
>> +        else if ( p2mt != p2m_ram_rw )
>> +            return -EBUSY;
> 
> Shouldn't the populate of the underlying gfn in the fork case
> be done by map_guest_area itself?

I've used the existing logic for the info area to base my code on. As
noted in the patch switching the info area logic to use the generalize
infrastructure, I've taken the liberty to address an issue in the
original logic. But it was never a goal to re-write things from scratch.
If, as Tamas appears to agree, there a better way of layering things
here, then please as a follow-on patch.

> What if a forked guest attempts to register a new runstate/time info
> against an address not yet populated?

What if the same happened for the info area? Again, I'm not trying to
invent anything new here. But hopefully Tamas'es reply has helped here
as well.

>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -1572,6 +1572,13 @@ void unmap_vcpu_info(struct vcpu *v)
>>      put_page_and_type(mfn_to_page(mfn));
>>  }
>>  
>> +int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
>> +                   struct guest_area *area,
>> +                   void (*populate)(void *dst, struct vcpu *v))
> 
> Oh, the prototype for this is added in patch 1, almost missed it.
> 
> Why not also add this dummy implementation in patch 1 then?

The prototype isn't dead code, but the function would be until it gains
users here. If anything, I'd move the prototype introduction here; it
merely seemed desirable to me to touch xen/include/xen/domain.h no
more frequently than necessary.

Also, to be quite frank, I find this kind of nitpicking odd after the
series has been pending for almost a year.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas
       [not found]     ` <CABfawhkn1xXA+qEjB4-HtOVUZHONDE6ngMJZPe3fSPtoAtmg+Q@mail.gmail.com>
@ 2023-09-27 13:54       ` Roger Pau Monné
  0 siblings, 0 replies; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-27 13:54 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Jan Beulich, xen-devel, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu

On Wed, Sep 27, 2023 at 07:43:26AM -0400, Tamas K Lengyel wrote:
> On Wed, Sep 27, 2023 at 7:08 AM Roger Pau Monné <roger.pau@citrix.com> wrote:
> >
> > On Wed, May 03, 2023 at 05:56:46PM +0200, Jan Beulich wrote:
> > > In preparation of the introduction of new vCPU operations allowing to
> > > register the respective areas (one of the two is x86-specific) by
> > > guest-physical address, add the necessary fork handling (with the
> > > backing function yet to be filled in).
> > >
> > > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >
> > Given the very limited and specific usage of the current Xen forking
> > code, do we really need to bother about copying such areas?
> >
> > IOW: I doubt that not updating the runstate/time areas will make any
> > difference to the forking code ATM.
> 
> The current implementation of forking allows for fully functional VM
> forks to be setup, including launching the dm for them. The toolstack
> side hasn't been merged for that into Xen but that doesn't mean it's
> not available or used already. So any internal changes to Xen that
> create guest-states that could potentially be interacted with and
> relied on by a guest should get properly wired in for forking. So I'm
> happy Jan took the initiative here for wiring this up, even if the
> use-case seems niche.

But there are still some bits missing in Xen, seeing the comment in
copy_vcpu_settings().  If we don't copy the timers already then I'm
unsure whether copying the runstate/time shared areas is very
relevant.

> >
> > > ---
> > > v3: Extend comment.
> > >
> > > --- a/xen/arch/x86/mm/mem_sharing.c
> > > +++ b/xen/arch/x86/mm/mem_sharing.c
> > > @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
> > >      hvm_set_nonreg_state(cd_vcpu, &nrs);
> > >  }
> > >
> > > +static int copy_guest_area(struct guest_area *cd_area,
> > > +                           const struct guest_area *d_area,
> > > +                           struct vcpu *cd_vcpu,
> > > +                           const struct domain *d)
> > > +{
> > > +    mfn_t d_mfn, cd_mfn;
> > > +
> > > +    if ( !d_area->pg )
> > > +        return 0;
> > > +
> > > +    d_mfn = page_to_mfn(d_area->pg);
> > > +
> > > +    /* Allocate & map a page for the area if it hasn't been already. */
> > > +    if ( !cd_area->pg )
> > > +    {
> > > +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> > > +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> > > +        p2m_type_t p2mt;
> > > +        p2m_access_t p2ma;
> > > +        unsigned int offset;
> > > +        int ret;
> > > +
> > > +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
> > > +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> > > +        {
> > > +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
> > > +
> > > +            if ( !pg )
> > > +                return -ENOMEM;
> > > +
> > > +            cd_mfn = page_to_mfn(pg);
> > > +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> > > +
> > > +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
> > > +                                 p2m->default_access, -1);
> > > +            if ( ret )
> > > +                return ret;
> > > +        }
> > > +        else if ( p2mt != p2m_ram_rw )
> > > +            return -EBUSY;
> >
> > Shouldn't the populate of the underlying gfn in the fork case
> > be done by map_guest_area itself?
> 
> Would seem more logical, I agree, but then the call should be
> map_guest_area first, which conditionally calls down into a
> mem_sharing_copy_guest_area if the domain is a fork.
> 
> >
> > What if a forked guest attempts to register a new runstate/time info
> > against an address not yet populated?
> 
> Unpopulated memory will get forked from the parent for all access
> paths currently in existence, including access to a forked VMs memory
> from dom0/dm and the various internal Xen access paths. If the memory
> range is not mapped in the parent registering on that range ought to
> fail by I assume existing checks that validate that the memory is
> present during registration.

What I'm trying to say is that map_guest_area() already has to deal
with forked guests, and hence the populating of the gfn shouldn't be
needed as map_guest_area() must know how to deal with such guest
anyway.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-09-27 12:06     ` Jan Beulich
@ 2023-09-27 14:05       ` Roger Pau Monné
  2023-09-27 15:11         ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-27 14:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Tamas K Lengyel

On Wed, Sep 27, 2023 at 02:06:47PM +0200, Jan Beulich wrote:
> On 27.09.2023 13:08, Roger Pau Monné wrote:
> > On Wed, May 03, 2023 at 05:56:46PM +0200, Jan Beulich wrote:
> >> In preparation of the introduction of new vCPU operations allowing to
> >> register the respective areas (one of the two is x86-specific) by
> >> guest-physical address, add the necessary fork handling (with the
> >> backing function yet to be filled in).
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Given the very limited and specific usage of the current Xen forking
> > code, do we really need to bother about copying such areas?
> > 
> > IOW: I doubt that not updating the runstate/time areas will make any
> > difference to the forking code ATM.
> 
> I expect Tamas'es reply has sufficiently addressed this question.

Seeing the TODO in copy_vcpu_settings() makes me wonder how well we
copy information for PV interfaces for forks.  Timers are not
migrated, neither runstate areas for example.

Which is all fine, but I expect VMs this is currently tested against
don't use (a lot of) PV interfaces, or else I don't know how they can
survive.

> >> --- a/xen/arch/x86/mm/mem_sharing.c
> >> +++ b/xen/arch/x86/mm/mem_sharing.c
> >> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
> >>      hvm_set_nonreg_state(cd_vcpu, &nrs);
> >>  }
> >>  
> >> +static int copy_guest_area(struct guest_area *cd_area,
> >> +                           const struct guest_area *d_area,
> >> +                           struct vcpu *cd_vcpu,
> >> +                           const struct domain *d)
> >> +{
> >> +    mfn_t d_mfn, cd_mfn;
> >> +
> >> +    if ( !d_area->pg )
> >> +        return 0;
> >> +
> >> +    d_mfn = page_to_mfn(d_area->pg);
> >> +
> >> +    /* Allocate & map a page for the area if it hasn't been already. */
> >> +    if ( !cd_area->pg )
> >> +    {
> >> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> >> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> >> +        p2m_type_t p2mt;
> >> +        p2m_access_t p2ma;
> >> +        unsigned int offset;
> >> +        int ret;
> >> +
> >> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
> >> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> >> +        {
> >> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
> >> +
> >> +            if ( !pg )
> >> +                return -ENOMEM;
> >> +
> >> +            cd_mfn = page_to_mfn(pg);
> >> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> >> +
> >> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
> >> +                                 p2m->default_access, -1);
> >> +            if ( ret )
> >> +                return ret;
> >> +        }
> >> +        else if ( p2mt != p2m_ram_rw )
> >> +            return -EBUSY;
> > 
> > Shouldn't the populate of the underlying gfn in the fork case
> > be done by map_guest_area itself?
> 
> I've used the existing logic for the info area to base my code on. As
> noted in the patch switching the info area logic to use the generalize
> infrastructure, I've taken the liberty to address an issue in the
> original logic. But it was never a goal to re-write things from scratch.
> If, as Tamas appears to agree, there a better way of layering things
> here, then please as a follow-on patch.

Hm, I'm unsure the code that allocates the page and adds it to the p2m
for the vcpu_info area is required?  map_vcpu_info() should already be
able to handle this case and fork the page from the previous VM.

> > What if a forked guest attempts to register a new runstate/time info
> > against an address not yet populated?
> 
> What if the same happened for the info area? Again, I'm not trying to
> invent anything new here. But hopefully Tamas'es reply has helped here
> as well.

Yes, I don't think we should need to allocate and map a page for the
vcpu_info area before calling map_vcpu_info().

> >> --- a/xen/common/domain.c
> >> +++ b/xen/common/domain.c
> >> @@ -1572,6 +1572,13 @@ void unmap_vcpu_info(struct vcpu *v)
> >>      put_page_and_type(mfn_to_page(mfn));
> >>  }
> >>  
> >> +int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
> >> +                   struct guest_area *area,
> >> +                   void (*populate)(void *dst, struct vcpu *v))
> > 
> > Oh, the prototype for this is added in patch 1, almost missed it.
> > 
> > Why not also add this dummy implementation in patch 1 then?
> 
> The prototype isn't dead code, but the function would be until it gains
> users here. If anything, I'd move the prototype introduction here; it
> merely seemed desirable to me to touch xen/include/xen/domain.h no
> more frequently than necessary.
> 
> Also, to be quite frank, I find this kind of nitpicking odd after the
> series has been pending for almost a year.

TBH when I saw this I was about to comment that the patch won't build
because the prototypes was missing, but then I remembered about patch
1.  I don't think it's obvious, but anyway.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 5/8] domain: map/unmap GADDR based shared guest areas
  2023-05-03 15:57 ` [PATCH v3 5/8] domain: map/unmap " Jan Beulich
@ 2023-09-27 14:53   ` Roger Pau Monné
  2023-09-27 15:29     ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-27 14:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On Wed, May 03, 2023 at 05:57:09PM +0200, Jan Beulich wrote:
> The registration by virtual/linear address has downsides: At least on
> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
> PV domains the areas are inaccessible (and hence cannot be updated by
> Xen) when in guest-user mode, and for HVM guests they may be
> inaccessible when Meltdown mitigations are in place. (There are yet
> more issues.)
> 
> In preparation of the introduction of new vCPU operations allowing to
> register the respective areas (one of the two is x86-specific) by
> guest-physical address, flesh out the map/unmap functions.
> 
> Noteworthy differences from map_vcpu_info():
> - areas can be registered more than once (and de-registered),
> - remote vCPU-s are paused rather than checked for being down (which in
>   principle can change right after the check),
> - the domain lock is taken for a much smaller region.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> RFC: By using global domain page mappings the demand on the underlying
>      VA range may increase significantly. I did consider to use per-
>      domain mappings instead, but they exist for x86 only. Of course we
>      could have arch_{,un}map_guest_area() aliasing global domain page
>      mapping functions on Arm and using per-domain mappings on x86. Yet
>      then again map_vcpu_info() doesn't (and can't) do so.

I guess it's fine as you propose now, we might see about using
per-domain mappings.

> RFC: In map_guest_area() I'm not checking the P2M type, instead - just
>      like map_vcpu_info() - solely relying on the type ref acquisition.
>      Checking for p2m_ram_rw alone would be wrong, as at least
>      p2m_ram_logdirty ought to also be okay to use here (and in similar
>      cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
>      used here (like altp2m_vcpu_enable_ve() does) as well as in
>      map_vcpu_info(), yet then again the P2M type is stale by the time
>      it is being looked at anyway without the P2M lock held.

check_get_page_from_gfn() already does some type checks on the page.

> ---
> v2: currd -> d, to cover mem-sharing's copy_guest_area(). Re-base over
>     change(s) earlier in the series. Use ~0 as "unmap" request indicator.
> 
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1576,7 +1576,82 @@ int map_guest_area(struct vcpu *v, paddr
>                     struct guest_area *area,
>                     void (*populate)(void *dst, struct vcpu *v))
>  {
> -    return -EOPNOTSUPP;
> +    struct domain *d = v->domain;
> +    void *map = NULL;
> +    struct page_info *pg = NULL;
> +    int rc = 0;

Should you check/assert that size != 0?

> +    if ( ~gaddr )

I guess I will find in future patches, but why this special handling
for ~0 address?

Might be worth to add a comment here, or in the patch description.
Otherwise I would expect that when passed a ~0 address the first check
for page boundary crossing to fail.

> +    {
> +        unsigned long gfn = PFN_DOWN(gaddr);
> +        unsigned int align;
> +        p2m_type_t p2mt;
> +
> +        if ( gfn != PFN_DOWN(gaddr + size - 1) )
> +            return -ENXIO;
> +
> +#ifdef CONFIG_COMPAT
> +        if ( has_32bit_shinfo(d) )
> +            align = alignof(compat_ulong_t);
> +        else
> +#endif
> +            align = alignof(xen_ulong_t);
> +        if ( gaddr & (align - 1) )

IS_ALIGNED() might be clearer.

> +            return -ENXIO;
> +
> +        rc = check_get_page_from_gfn(d, _gfn(gfn), false, &p2mt, &pg);
> +        if ( rc )
> +            return rc;
> +
> +        if ( !get_page_type(pg, PGT_writable_page) )
> +        {
> +            put_page(pg);
> +            return -EACCES;
> +        }
> +
> +        map = __map_domain_page_global(pg);
> +        if ( !map )
> +        {
> +            put_page_and_type(pg);
> +            return -ENOMEM;
> +        }
> +        map += PAGE_OFFSET(gaddr);
> +    }
> +
> +    if ( v != current )
> +    {
> +        if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
> +        {
> +            rc = -ERESTART;
> +            goto unmap;
> +        }
> +
> +        vcpu_pause(v);
> +
> +        spin_unlock(&d->hypercall_deadlock_mutex);
> +    }
> +
> +    domain_lock(d);
> +
> +    if ( map )
> +        populate(map, v);

The call to map_guest_area() in copy_guest_area() does pass NULL as
the populate parameter, hence this will crash?

Should you either assert that populate != NULL or change the if
condition to be map && populate?

> +
> +    SWAP(area->pg, pg);
> +    SWAP(area->map, map);
> +
> +    domain_unlock(d);
> +
> +    if ( v != current )
> +        vcpu_unpause(v);
> +
> + unmap:
> +    if ( pg )
> +    {
> +        unmap_domain_page_global(map);
> +        put_page_and_type(pg);
> +    }
> +
> +    return rc;
>  }
>  
>  /*
> @@ -1587,9 +1662,24 @@ int map_guest_area(struct vcpu *v, paddr
>  void unmap_guest_area(struct vcpu *v, struct guest_area *area)
>  {
>      struct domain *d = v->domain;
> +    void *map;
> +    struct page_info *pg;
>  
>      if ( v != current )
>          ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
> +
> +    domain_lock(d);
> +    map = area->map;
> +    area->map = NULL;
> +    pg = area->pg;
> +    area->pg = NULL;

FWIW you could also use SWAP() here by initializing both map and pg to
NULL (I know it uses one extra local variable).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-09-27 14:05       ` Roger Pau Monné
@ 2023-09-27 15:11         ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2023-09-27 15:11 UTC (permalink / raw)
  To: Roger Pau Monné, Tamas K Lengyel
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On 27.09.2023 16:05, Roger Pau Monné wrote:
> On Wed, Sep 27, 2023 at 02:06:47PM +0200, Jan Beulich wrote:
>> On 27.09.2023 13:08, Roger Pau Monné wrote:
>>> On Wed, May 03, 2023 at 05:56:46PM +0200, Jan Beulich wrote:
>>>> --- a/xen/arch/x86/mm/mem_sharing.c
>>>> +++ b/xen/arch/x86/mm/mem_sharing.c
>>>> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
>>>>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>>>>  }
>>>>  
>>>> +static int copy_guest_area(struct guest_area *cd_area,
>>>> +                           const struct guest_area *d_area,
>>>> +                           struct vcpu *cd_vcpu,
>>>> +                           const struct domain *d)
>>>> +{
>>>> +    mfn_t d_mfn, cd_mfn;
>>>> +
>>>> +    if ( !d_area->pg )
>>>> +        return 0;
>>>> +
>>>> +    d_mfn = page_to_mfn(d_area->pg);
>>>> +
>>>> +    /* Allocate & map a page for the area if it hasn't been already. */
>>>> +    if ( !cd_area->pg )
>>>> +    {
>>>> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
>>>> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
>>>> +        p2m_type_t p2mt;
>>>> +        p2m_access_t p2ma;
>>>> +        unsigned int offset;
>>>> +        int ret;
>>>> +
>>>> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
>>>> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
>>>> +        {
>>>> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
>>>> +
>>>> +            if ( !pg )
>>>> +                return -ENOMEM;
>>>> +
>>>> +            cd_mfn = page_to_mfn(pg);
>>>> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
>>>> +
>>>> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
>>>> +                                 p2m->default_access, -1);
>>>> +            if ( ret )
>>>> +                return ret;
>>>> +        }
>>>> +        else if ( p2mt != p2m_ram_rw )
>>>> +            return -EBUSY;
>>>
>>> Shouldn't the populate of the underlying gfn in the fork case
>>> be done by map_guest_area itself?
>>
>> I've used the existing logic for the info area to base my code on. As
>> noted in the patch switching the info area logic to use the generalize
>> infrastructure, I've taken the liberty to address an issue in the
>> original logic. But it was never a goal to re-write things from scratch.
>> If, as Tamas appears to agree, there a better way of layering things
>> here, then please as a follow-on patch.
> 
> Hm, I'm unsure the code that allocates the page and adds it to the p2m
> for the vcpu_info area is required?  map_vcpu_info() should already be
> able to handle this case and fork the page from the previous VM.

I don't think that's the case. It would be able to unshare, but the fork
doesn't use shared pages aiui. I think it instead runs on an extremely
sparse p2m, where pages from the parent are brought in only as they're
touched. Tamas?

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 6/8] domain: introduce GADDR based runstate area registration alternative
  2023-05-03 15:57 ` [PATCH v3 6/8] domain: introduce GADDR based runstate area registration alternative Jan Beulich
@ 2023-09-27 15:24   ` Roger Pau Monné
  2023-09-27 15:36     ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-27 15:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On Wed, May 03, 2023 at 05:57:40PM +0200, Jan Beulich wrote:
> The registration by virtual/linear address has downsides: At least on
> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
> PV domains the area is inaccessible (and hence cannot be updated by Xen)
> when in guest-user mode.
> 
> Introduce a new vCPU operation allowing to register the runstate area by
> guest-physical address.
> 
> An at least theoretical downside to using physically registered areas is
> that PV then won't see dirty (and perhaps also accessed) bits set in its
> respective page table entries.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

One comment below.

> --- a/xen/include/public/vcpu.h
> +++ b/xen/include/public/vcpu.h
> @@ -221,6 +221,19 @@ struct vcpu_register_time_memory_area {
>  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
>  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
>  
> +/*
> + * Like the respective VCPUOP_register_*_memory_area, just using the "addr.p"
> + * field of the supplied struct as a guest physical address (i.e. in GFN space).
> + * The respective area may not cross a page boundary.  Pass ~0 to unregister an
> + * area.  Note that as long as an area is registered by physical address, the
> + * linear address based area will not be serviced (updated) by the hypervisor.
> + *
> + * Note that the area registered via VCPUOP_register_runstate_memory_area will
> + * be updated in the same manner as the one registered via virtual address PLUS
> + * VMASST_TYPE_runstate_update_flag engaged by the domain.
> + */
> +#define VCPUOP_register_runstate_phys_area      14

Just to make it more obvious, it might be nice to add a note in the
comment on VCPUOP_register_runstate_memory_area that `p` can also be
used with the `VCPUOP_register_runstate_phys_area` hypercall.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 5/8] domain: map/unmap GADDR based shared guest areas
  2023-09-27 14:53   ` Roger Pau Monné
@ 2023-09-27 15:29     ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2023-09-27 15:29 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On 27.09.2023 16:53, Roger Pau Monné wrote:
> On Wed, May 03, 2023 at 05:57:09PM +0200, Jan Beulich wrote:
>> RFC: In map_guest_area() I'm not checking the P2M type, instead - just
>>      like map_vcpu_info() - solely relying on the type ref acquisition.
>>      Checking for p2m_ram_rw alone would be wrong, as at least
>>      p2m_ram_logdirty ought to also be okay to use here (and in similar
>>      cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
>>      used here (like altp2m_vcpu_enable_ve() does) as well as in
>>      map_vcpu_info(), yet then again the P2M type is stale by the time
>>      it is being looked at anyway without the P2M lock held.
> 
> check_get_page_from_gfn() already does some type checks on the page.

But that's still of no help without holding the p2m lock for long enough.
Also the checks there all are to fail the call, whereas the remark above
talks about distinguishing when to permit and when to fail the request
here.

>> ---
>> v2: currd -> d, to cover mem-sharing's copy_guest_area(). Re-base over
>>     change(s) earlier in the series. Use ~0 as "unmap" request indicator.
>>
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -1576,7 +1576,82 @@ int map_guest_area(struct vcpu *v, paddr
>>                     struct guest_area *area,
>>                     void (*populate)(void *dst, struct vcpu *v))
>>  {
>> -    return -EOPNOTSUPP;
>> +    struct domain *d = v->domain;
>> +    void *map = NULL;
>> +    struct page_info *pg = NULL;
>> +    int rc = 0;
> 
> Should you check/assert that size != 0?

Not really, no, the more that this is an internal interface, and we
don't (and shouldn't) check similar aspects elsewhere. There's just
a single use of the value (see below).

>> +    if ( ~gaddr )
> 
> I guess I will find in future patches, but why this special handling
> for ~0 address?

Future patches aren't very likely to make this explicit. The most
explicit indicator for now was the revision log entry for v2.

> Might be worth to add a comment here, or in the patch description.
> Otherwise I would expect that when passed a ~0 address the first check
> for page boundary crossing to fail.

I've added "/* Map (i.e. not just unmap)? */" on the same line.

>> +    {
>> +        unsigned long gfn = PFN_DOWN(gaddr);
>> +        unsigned int align;
>> +        p2m_type_t p2mt;
>> +
>> +        if ( gfn != PFN_DOWN(gaddr + size - 1) )
>> +            return -ENXIO;

This is the only use of size. If size is 0 and gaddr is on a page boundary,
the call will fail. For any other gaddr nothing bad will happen (afaict).

>> +#ifdef CONFIG_COMPAT
>> +        if ( has_32bit_shinfo(d) )
>> +            align = alignof(compat_ulong_t);
>> +        else
>> +#endif
>> +            align = alignof(xen_ulong_t);
>> +        if ( gaddr & (align - 1) )
> 
> IS_ALIGNED() might be clearer.

Can switch, sure.

>> +            return -ENXIO;
>> +
>> +        rc = check_get_page_from_gfn(d, _gfn(gfn), false, &p2mt, &pg);
>> +        if ( rc )
>> +            return rc;
>> +
>> +        if ( !get_page_type(pg, PGT_writable_page) )
>> +        {
>> +            put_page(pg);
>> +            return -EACCES;
>> +        }
>> +
>> +        map = __map_domain_page_global(pg);
>> +        if ( !map )
>> +        {
>> +            put_page_and_type(pg);
>> +            return -ENOMEM;
>> +        }
>> +        map += PAGE_OFFSET(gaddr);
>> +    }
>> +
>> +    if ( v != current )
>> +    {
>> +        if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
>> +        {
>> +            rc = -ERESTART;
>> +            goto unmap;
>> +        }
>> +
>> +        vcpu_pause(v);
>> +
>> +        spin_unlock(&d->hypercall_deadlock_mutex);
>> +    }
>> +
>> +    domain_lock(d);
>> +
>> +    if ( map )
>> +        populate(map, v);
> 
> The call to map_guest_area() in copy_guest_area() does pass NULL as
> the populate parameter, hence this will crash?
> 
> Should you either assert that populate != NULL or change the if
> condition to be map && populate?

Oh, yes - the latter.

>> @@ -1587,9 +1662,24 @@ int map_guest_area(struct vcpu *v, paddr
>>  void unmap_guest_area(struct vcpu *v, struct guest_area *area)
>>  {
>>      struct domain *d = v->domain;
>> +    void *map;
>> +    struct page_info *pg;
>>  
>>      if ( v != current )
>>          ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
>> +
>> +    domain_lock(d);
>> +    map = area->map;
>> +    area->map = NULL;
>> +    pg = area->pg;
>> +    area->pg = NULL;
> 
> FWIW you could also use SWAP() here by initializing both map and pg to
> NULL (I know it uses one extra local variable).

When truly exchanging two values (like further up), I'm fine with using
SWAP() (and as you saw I do use it), but here I think it would be more
heavy than wanted/needed.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 6/8] domain: introduce GADDR based runstate area registration alternative
  2023-09-27 15:24   ` Roger Pau Monné
@ 2023-09-27 15:36     ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2023-09-27 15:36 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On 27.09.2023 17:24, Roger Pau Monné wrote:
> On Wed, May 03, 2023 at 05:57:40PM +0200, Jan Beulich wrote:
>> The registration by virtual/linear address has downsides: At least on
>> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
>> PV domains the area is inaccessible (and hence cannot be updated by Xen)
>> when in guest-user mode.
>>
>> Introduce a new vCPU operation allowing to register the runstate area by
>> guest-physical address.
>>
>> An at least theoretical downside to using physically registered areas is
>> that PV then won't see dirty (and perhaps also accessed) bits set in its
>> respective page table entries.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> One comment below.
> 
>> --- a/xen/include/public/vcpu.h
>> +++ b/xen/include/public/vcpu.h
>> @@ -221,6 +221,19 @@ struct vcpu_register_time_memory_area {
>>  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
>>  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
>>  
>> +/*
>> + * Like the respective VCPUOP_register_*_memory_area, just using the "addr.p"
>> + * field of the supplied struct as a guest physical address (i.e. in GFN space).
>> + * The respective area may not cross a page boundary.  Pass ~0 to unregister an
>> + * area.  Note that as long as an area is registered by physical address, the
>> + * linear address based area will not be serviced (updated) by the hypervisor.
>> + *
>> + * Note that the area registered via VCPUOP_register_runstate_memory_area will
>> + * be updated in the same manner as the one registered via virtual address PLUS
>> + * VMASST_TYPE_runstate_update_flag engaged by the domain.
>> + */
>> +#define VCPUOP_register_runstate_phys_area      14
> 
> Just to make it more obvious, it might be nice to add a note in the
> comment on VCPUOP_register_runstate_memory_area that `p` can also be
> used with the `VCPUOP_register_runstate_phys_area` hypercall.

I can add a reference there (and then in the later patch also for the
time area), but I don't think it wants or needs to emphasize p. It
merely needs to point to the alternative sub-function imo.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 7/8] x86: introduce GADDR based secondary time area registration alternative
  2023-05-03 15:58 ` [PATCH v3 7/8] x86: introduce GADDR based secondary time " Jan Beulich
@ 2023-09-27 15:50   ` Roger Pau Monné
  2023-09-27 16:12     ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-27 15:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On Wed, May 03, 2023 at 05:58:01PM +0200, Jan Beulich wrote:
> The registration by virtual/linear address has downsides: The access is
> expensive for HVM/PVH domains. Furthermore for 64-bit PV domains the area
> is inaccessible (and hence cannot be updated by Xen) when in guest-user
> mode.
> 
> Introduce a new vCPU operation allowing to register the secondary time
> area by guest-physical address.
> 
> An at least theoretical downside to using physically registered areas is
> that PV then won't see dirty (and perhaps also accessed) bits set in its
> respective page table entries.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> v3: Re-base.
> v2: Forge version in force_update_secondary_system_time().
> 
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1504,6 +1504,15 @@ int arch_vcpu_reset(struct vcpu *v)
>      return 0;
>  }
>  
> +static void cf_check
> +time_area_populate(void *map, struct vcpu *v)
> +{
> +    if ( is_pv_vcpu(v) )
> +        v->arch.pv.pending_system_time.version = 0;
> +
> +    force_update_secondary_system_time(v, map);
> +}
> +
>  long do_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
>  {
>      long rc = 0;
> @@ -1541,6 +1550,25 @@ long do_vcpu_op(int cmd, unsigned int vc
>  
>          break;
>      }
> +
> +    case VCPUOP_register_vcpu_time_phys_area:
> +    {
> +        struct vcpu_register_time_memory_area area;
> +
> +        rc = -EFAULT;
> +        if ( copy_from_guest(&area.addr.p, arg, 1) )
> +            break;
> +
> +        rc = map_guest_area(v, area.addr.p,
> +                            sizeof(vcpu_time_info_t),
> +                            &v->arch.time_guest_area,
> +                            time_area_populate);
> +        if ( rc == -ERESTART )
> +            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
> +                                               cmd, vcpuid, arg);
> +
> +        break;
> +    }
>  
>      case VCPUOP_get_physid:
>      {
> --- a/xen/arch/x86/include/asm/domain.h
> +++ b/xen/arch/x86/include/asm/domain.h
> @@ -692,6 +692,8 @@ void domain_cpu_policy_changed(struct do
>  
>  bool update_secondary_system_time(struct vcpu *,
>                                    struct vcpu_time_info *);
> +void force_update_secondary_system_time(struct vcpu *,
> +                                        struct vcpu_time_info *);
>  
>  void vcpu_show_registers(const struct vcpu *);
>  
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1633,6 +1633,16 @@ void force_update_vcpu_system_time(struc
>      __update_vcpu_system_time(v, 1);
>  }
>  
> +void force_update_secondary_system_time(struct vcpu *v,
> +                                        struct vcpu_time_info *map)
> +{
> +    struct vcpu_time_info u;
> +
> +    collect_time_info(v, &u);
> +    u.version = -1; /* Compensate for version_update_end(). */

Hm, we do not seem to compensate in
VCPUOP_register_vcpu_time_memory_area, what's more, in that case the
initial version is picked from the contents of the guest address.
Hopefully the guest will have zeroed the memory.

FWIW, I would be fine with leaving this at 0, so the first version
guest sees is 1.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 7/8] x86: introduce GADDR based secondary time area registration alternative
  2023-09-27 15:50   ` Roger Pau Monné
@ 2023-09-27 16:12     ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2023-09-27 16:12 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On 27.09.2023 17:50, Roger Pau Monné wrote:
> On Wed, May 03, 2023 at 05:58:01PM +0200, Jan Beulich wrote:
>> The registration by virtual/linear address has downsides: The access is
>> expensive for HVM/PVH domains. Furthermore for 64-bit PV domains the area
>> is inaccessible (and hence cannot be updated by Xen) when in guest-user
>> mode.
>>
>> Introduce a new vCPU operation allowing to register the secondary time
>> area by guest-physical address.
>>
>> An at least theoretical downside to using physically registered areas is
>> that PV then won't see dirty (and perhaps also accessed) bits set in its
>> respective page table entries.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> ---
>> v3: Re-base.
>> v2: Forge version in force_update_secondary_system_time().

For your question below, note this revision log entry. I didn't have the
compensation originally, and my made-up XTF test thus failed.

>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -1633,6 +1633,16 @@ void force_update_vcpu_system_time(struc
>>      __update_vcpu_system_time(v, 1);
>>  }
>>  
>> +void force_update_secondary_system_time(struct vcpu *v,
>> +                                        struct vcpu_time_info *map)
>> +{
>> +    struct vcpu_time_info u;
>> +
>> +    collect_time_info(v, &u);
>> +    u.version = -1; /* Compensate for version_update_end(). */
> 
> Hm, we do not seem to compensate in
> VCPUOP_register_vcpu_time_memory_area, what's more, in that case the
> initial version is picked from the contents of the guest address.
> Hopefully the guest will have zeroed the memory.
> 
> FWIW, I would be fine with leaving this at 0, so the first version
> guest sees is 1.

No, we can't. Odd numbers mean "update in progress". In
__update_vcpu_system_time() we properly use version_update_begin(),
so there's no need for any compensation. If the guest puts a non-zero
value there, that's also fine from the perspective of the protocol.
Just that if the initial value is odd, the guest will mislead itself
into thinking "oh, the hypervisor is updating this right now" until
the first real update has happened. Yet even that is hypothetical,
since upon registration the area is first populated, so upon the
registration hypercall returning the field will already properly have
an even value.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 8/8] common: convert vCPU info area registration
  2023-05-03 15:58 ` [PATCH v3 8/8] common: convert vCPU info area registration Jan Beulich
@ 2023-09-28  8:24   ` Roger Pau Monné
  2023-09-28  9:53     ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-28  8:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Tamas K Lengyel

On Wed, May 03, 2023 at 05:58:30PM +0200, Jan Beulich wrote:
> Switch to using map_guest_area(). Noteworthy differences from
> map_vcpu_info():
> - remote vCPU-s are paused rather than checked for being down (which in
>   principle can change right after the check),
> - the domain lock is taken for a much smaller region,
> - the error code for an attempt to re-register the area is now -EBUSY,
> - we could in principle permit de-registration when no area was
>   previously registered (which would permit "probing", if necessary for
>   anything).
> 
> Note that this eliminates a bug in copy_vcpu_settings(): The function
> did allocate a new page regardless of the GFN already having a mapping,
> thus in particular breaking the case of two vCPU-s having their info
> areas on the same page.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Some minor comments below:

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> RFC: I'm not really certain whether the preliminary check (ahead of
>      calling map_guest_area()) is worthwhile to have.
> ---
> v2: Re-base over changes earlier in the series. Properly enforce no re-
>     registration. Avoid several casts by introducing local variables.
> 
> --- a/xen/arch/x86/include/asm/shared.h
> +++ b/xen/arch/x86/include/asm/shared.h
> @@ -26,17 +26,20 @@ static inline void arch_set_##field(stru
>  #define GET_SET_VCPU(type, field)                               \
>  static inline type arch_get_##field(const struct vcpu *v)       \
>  {                                                               \
> +    const vcpu_info_t *vi = v->vcpu_info_area.map;              \
> +                                                                \
>      return !has_32bit_shinfo(v->domain) ?                       \
> -           v->vcpu_info->native.arch.field :                    \
> -           v->vcpu_info->compat.arch.field;                     \
> +           vi->native.arch.field : vi->compat.arch.field;       \
>  }                                                               \
>  static inline void arch_set_##field(struct vcpu *v,             \
>                                      type val)                   \
>  {                                                               \
> +    vcpu_info_t *vi = v->vcpu_info_area.map;                    \
> +                                                                \
>      if ( !has_32bit_shinfo(v->domain) )                         \
> -        v->vcpu_info->native.arch.field = val;                  \
> +        vi->native.arch.field = val;                            \
>      else                                                        \
> -        v->vcpu_info->compat.arch.field = val;                  \
> +        vi->compat.arch.field = val;                            \
>  }
>  
>  #else
> @@ -57,12 +60,16 @@ static inline void arch_set_##field(stru
>  #define GET_SET_VCPU(type, field)                           \
>  static inline type arch_get_##field(const struct vcpu *v)   \
>  {                                                           \
> -    return v->vcpu_info->arch.field;                        \
> +    const vcpu_info_t *vi = v->vcpu_info_area.map;          \
> +                                                            \
> +    return vi->arch.field;                                  \
>  }                                                           \
>  static inline void arch_set_##field(struct vcpu *v,         \
>                                      type val)               \
>  {                                                           \
> -    v->vcpu_info->arch.field = val;                         \
> +    vcpu_info_t *vi = v->vcpu_info_area.map;                \
> +                                                            \
> +    vi->arch.field = val;                                   \
>  }
>  
>  #endif
> --- a/xen/arch/x86/mm/mem_sharing.c
> +++ b/xen/arch/x86/mm/mem_sharing.c
> @@ -1749,53 +1749,24 @@ static int copy_vpmu(struct vcpu *d_vcpu
>  static int copy_vcpu_settings(struct domain *cd, const struct domain *d)
>  {
>      unsigned int i;
> -    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
>      int ret = -EINVAL;
>  
>      for ( i = 0; i < cd->max_vcpus; i++ )
>      {
>          struct vcpu *d_vcpu = d->vcpu[i];
>          struct vcpu *cd_vcpu = cd->vcpu[i];
> -        mfn_t vcpu_info_mfn;
>  
>          if ( !d_vcpu || !cd_vcpu )
>              continue;
>  
> -        /* Copy & map in the vcpu_info page if the guest uses one */
> -        vcpu_info_mfn = d_vcpu->vcpu_info_mfn;
> -        if ( !mfn_eq(vcpu_info_mfn, INVALID_MFN) )
> -        {
> -            mfn_t new_vcpu_info_mfn = cd_vcpu->vcpu_info_mfn;
> -
> -            /* Allocate & map the page for it if it hasn't been already */
> -            if ( mfn_eq(new_vcpu_info_mfn, INVALID_MFN) )
> -            {
> -                gfn_t gfn = mfn_to_gfn(d, vcpu_info_mfn);
> -                unsigned long gfn_l = gfn_x(gfn);
> -                struct page_info *page;
> -
> -                if ( !(page = alloc_domheap_page(cd, 0)) )
> -                    return -ENOMEM;
> -
> -                new_vcpu_info_mfn = page_to_mfn(page);
> -                set_gpfn_from_mfn(mfn_x(new_vcpu_info_mfn), gfn_l);
> -
> -                ret = p2m->set_entry(p2m, gfn, new_vcpu_info_mfn,
> -                                     PAGE_ORDER_4K, p2m_ram_rw,
> -                                     p2m->default_access, -1);
> -                if ( ret )
> -                    return ret;
> -
> -                ret = map_vcpu_info(cd_vcpu, gfn_l,
> -                                    PAGE_OFFSET(d_vcpu->vcpu_info));
> -                if ( ret )
> -                    return ret;
> -            }
> -
> -            copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
> -        }
> -
> -        /* Same for the (physically registered) runstate and time info areas. */
> +        /*
> +         * Copy and map the vcpu_info page and the (physically registered)
> +         * runstate and time info areas.
> +         */
> +        ret = copy_guest_area(&cd_vcpu->vcpu_info_area,
> +                              &d_vcpu->vcpu_info_area, cd_vcpu, d);
> +        if ( ret )
> +            return ret;
>          ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
>                                &d_vcpu->runstate_guest_area, cd_vcpu, d);
>          if ( ret )
> --- a/xen/arch/x86/pv/shim.c
> +++ b/xen/arch/x86/pv/shim.c
> @@ -383,7 +383,7 @@ int pv_shim_shutdown(uint8_t reason)
>      for_each_vcpu ( d, v )
>      {
>          /* Unmap guest vcpu_info page and runstate/time areas. */
> -        unmap_vcpu_info(v);
> +        unmap_guest_area(v, &v->vcpu_info_area);
>          unmap_guest_area(v, &v->runstate_guest_area);
>          unmap_guest_area(v, &v->arch.time_guest_area);
>  
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1547,7 +1547,7 @@ static void __update_vcpu_system_time(st
>      struct vcpu_time_info *u = &vcpu_info(v, time), _u;
>      const struct domain *d = v->domain;
>  
> -    if ( v->vcpu_info == NULL )
> +    if ( !v->vcpu_info_area.map )
>          return;
>  
>      collect_time_info(v, &_u);
> --- a/xen/arch/x86/x86_64/asm-offsets.c
> +++ b/xen/arch/x86/x86_64/asm-offsets.c
> @@ -53,7 +53,7 @@ void __dummy__(void)
>  
>      OFFSET(VCPU_processor, struct vcpu, processor);
>      OFFSET(VCPU_domain, struct vcpu, domain);
> -    OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info);
> +    OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info_area.map);
>      OFFSET(VCPU_trap_bounce, struct vcpu, arch.pv.trap_bounce);
>      OFFSET(VCPU_thread_flags, struct vcpu, arch.flags);
>      OFFSET(VCPU_event_addr, struct vcpu, arch.pv.event_callback_eip);
> --- a/xen/arch/x86/x86_64/traps.c
> +++ b/xen/arch/x86/x86_64/traps.c
> @@ -96,7 +96,7 @@ static void _show_registers(
>      if ( context == CTXT_hypervisor )
>          printk(" %pS", _p(regs->rip));
>      printk("\nRFLAGS: %016lx   ", regs->rflags);
> -    if ( (context == CTXT_pv_guest) && v && v->vcpu_info )
> +    if ( (context == CTXT_pv_guest) && v && v->vcpu_info_area.map )
>          printk("EM: %d   ", !!vcpu_info(v, evtchn_upcall_mask));
>      printk("CONTEXT: %s", context_names[context]);
>      if ( v && !is_idle_vcpu(v) )
> --- a/xen/common/compat/domain.c
> +++ b/xen/common/compat/domain.c
> @@ -49,7 +49,7 @@ int compat_common_vcpu_op(int cmd, struc
>      {
>      case VCPUOP_initialise:
>      {
> -        if ( v->vcpu_info == &dummy_vcpu_info )
> +        if ( v->vcpu_info_area.map == &dummy_vcpu_info )
>              return -EINVAL;
>  
>  #ifdef CONFIG_HVM
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -127,10 +127,10 @@ static void vcpu_info_reset(struct vcpu
>  {
>      struct domain *d = v->domain;

d could likely be made const?

>  
> -    v->vcpu_info = ((v->vcpu_id < XEN_LEGACY_MAX_VCPUS)
> -                    ? (vcpu_info_t *)&shared_info(d, vcpu_info[v->vcpu_id])
> -                    : &dummy_vcpu_info);
> -    v->vcpu_info_mfn = INVALID_MFN;
> +    v->vcpu_info_area.map =
> +        ((v->vcpu_id < XEN_LEGACY_MAX_VCPUS)
> +         ? (vcpu_info_t *)&shared_info(d, vcpu_info[v->vcpu_id])
> +         : &dummy_vcpu_info);
>  }
>  
>  static void vmtrace_free_buffer(struct vcpu *v)
> @@ -964,7 +964,7 @@ int domain_kill(struct domain *d)
>              return -ERESTART;
>          for_each_vcpu ( d, v )
>          {
> -            unmap_vcpu_info(v);
> +            unmap_guest_area(v, &v->vcpu_info_area);
>              unmap_guest_area(v, &v->runstate_guest_area);
>          }
>          d->is_dying = DOMDYING_dead;
> @@ -1419,7 +1419,7 @@ int domain_soft_reset(struct domain *d,
>      for_each_vcpu ( d, v )
>      {
>          set_xen_guest_handle(runstate_guest(v), NULL);
> -        unmap_vcpu_info(v);
> +        unmap_guest_area(v, &v->vcpu_info_area);
>          unmap_guest_area(v, &v->runstate_guest_area);
>      }
>  
> @@ -1467,111 +1467,6 @@ int vcpu_reset(struct vcpu *v)
>      return rc;
>  }
>  
> -/*
> - * Map a guest page in and point the vcpu_info pointer at it.  This
> - * makes sure that the vcpu_info is always pointing at a valid piece
> - * of memory, and it sets a pending event to make sure that a pending
> - * event doesn't get missed.
> - */
> -int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned int offset)
> -{
> -    struct domain *d = v->domain;
> -    void *mapping;
> -    vcpu_info_t *new_info;
> -    struct page_info *page;
> -    unsigned int align;
> -
> -    if ( offset > (PAGE_SIZE - sizeof(*new_info)) )
> -        return -ENXIO;
> -
> -#ifdef CONFIG_COMPAT
> -    BUILD_BUG_ON(sizeof(*new_info) != sizeof(new_info->compat));
> -    if ( has_32bit_shinfo(d) )
> -        align = alignof(new_info->compat);
> -    else
> -#endif
> -        align = alignof(*new_info);
> -    if ( offset & (align - 1) )
> -        return -ENXIO;
> -
> -    if ( !mfn_eq(v->vcpu_info_mfn, INVALID_MFN) )
> -        return -EINVAL;
> -
> -    /* Run this command on yourself or on other offline VCPUS. */
> -    if ( (v != current) && !(v->pause_flags & VPF_down) )
> -        return -EINVAL;
> -
> -    page = get_page_from_gfn(d, gfn, NULL, P2M_UNSHARE);
> -    if ( !page )
> -        return -EINVAL;
> -
> -    if ( !get_page_type(page, PGT_writable_page) )
> -    {
> -        put_page(page);
> -        return -EINVAL;
> -    }
> -
> -    mapping = __map_domain_page_global(page);
> -    if ( mapping == NULL )
> -    {
> -        put_page_and_type(page);
> -        return -ENOMEM;
> -    }
> -
> -    new_info = (vcpu_info_t *)(mapping + offset);
> -
> -    if ( v->vcpu_info == &dummy_vcpu_info )
> -    {
> -        memset(new_info, 0, sizeof(*new_info));
> -#ifdef XEN_HAVE_PV_UPCALL_MASK
> -        __vcpu_info(v, new_info, evtchn_upcall_mask) = 1;
> -#endif
> -    }
> -    else
> -    {
> -        memcpy(new_info, v->vcpu_info, sizeof(*new_info));
> -    }
> -
> -    v->vcpu_info = new_info;
> -    v->vcpu_info_mfn = page_to_mfn(page);
> -
> -    /* Set new vcpu_info pointer /before/ setting pending flags. */
> -    smp_wmb();
> -
> -    /*
> -     * Mark everything as being pending just to make sure nothing gets
> -     * lost.  The domain will get a spurious event, but it can cope.
> -     */
> -#ifdef CONFIG_COMPAT
> -    if ( !has_32bit_shinfo(d) )
> -        write_atomic(&new_info->native.evtchn_pending_sel, ~0);
> -    else
> -#endif
> -        write_atomic(&vcpu_info(v, evtchn_pending_sel), ~0);
> -    vcpu_mark_events_pending(v);
> -
> -    return 0;
> -}
> -
> -/*
> - * Unmap the vcpu info page if the guest decided to place it somewhere
> - * else. This is used from domain_kill() and domain_soft_reset().
> - */
> -void unmap_vcpu_info(struct vcpu *v)
> -{
> -    mfn_t mfn = v->vcpu_info_mfn;
> -
> -    if ( mfn_eq(mfn, INVALID_MFN) )
> -        return;
> -
> -    unmap_domain_page_global((void *)
> -                             ((unsigned long)v->vcpu_info & PAGE_MASK));
> -
> -    vcpu_info_reset(v); /* NB: Clobbers v->vcpu_info_mfn */
> -
> -    put_page_and_type(mfn_to_page(mfn));
> -}
> -
>  int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
>                     struct guest_area *area,
>                     void (*populate)(void *dst, struct vcpu *v))
> @@ -1633,14 +1528,44 @@ int map_guest_area(struct vcpu *v, paddr
>  
>      domain_lock(d);
>  
> -    if ( map )
> -        populate(map, v);
> +    /* No re-registration of the vCPU info area. */
> +    if ( area != &v->vcpu_info_area || !area->pg )

It would be nice if this check could be done earlier, as to avoid
having to fetch and map the page just to discard it.  That would
however require taking the domain lock earlier.

> +    {
> +        if ( map )
> +            populate(map, v);
>  
> -    SWAP(area->pg, pg);
> -    SWAP(area->map, map);
> +        SWAP(area->pg, pg);
> +        SWAP(area->map, map);
> +    }
> +    else
> +        rc = -EBUSY;
>  
>      domain_unlock(d);
>  
> +    /* Set pending flags /after/ new vcpu_info pointer was set. */
> +    if ( area == &v->vcpu_info_area && !rc )
> +    {
> +        /*
> +         * Mark everything as being pending just to make sure nothing gets
> +         * lost.  The domain will get a spurious event, but it can cope.
> +         */
> +#ifdef CONFIG_COMPAT
> +        if ( !has_32bit_shinfo(d) )
> +        {
> +            vcpu_info_t *info = area->map;
> +
> +            /* For VCPUOP_register_vcpu_info handling in common_vcpu_op(). */
> +            BUILD_BUG_ON(sizeof(*info) != sizeof(info->compat));
> +            write_atomic(&info->native.evtchn_pending_sel, ~0);
> +        }
> +        else
> +#endif
> +            write_atomic(&vcpu_info(v, evtchn_pending_sel), ~0);

Can't the setting of evtchn_pending_sel be done in
vcpu_info_populate()?

> +        vcpu_mark_events_pending(v);
> +
> +        force_update_vcpu_system_time(v);
> +    }
> +
>      if ( v != current )
>          vcpu_unpause(v);
>  
> @@ -1670,7 +1595,10 @@ void unmap_guest_area(struct vcpu *v, st
>  
>      domain_lock(d);
>      map = area->map;
> -    area->map = NULL;
> +    if ( area == &v->vcpu_info_area )
> +        vcpu_info_reset(v);
> +    else
> +        area->map = NULL;
>      pg = area->pg;
>      area->pg = NULL;
>      domain_unlock(d);
> @@ -1801,6 +1729,27 @@ bool update_runstate_area(struct vcpu *v
>      return rc;
>  }
>  
> +/*
> + * This makes sure that the vcpu_info is always pointing at a valid piece of
> + * memory, and it sets a pending event to make sure that a pending event
> + * doesn't get missed.
> + */
> +static void cf_check
> +vcpu_info_populate(void *map, struct vcpu *v)
> +{
> +    vcpu_info_t *info = map;
> +
> +    if ( v->vcpu_info_area.map == &dummy_vcpu_info )
> +    {
> +        memset(info, 0, sizeof(*info));
> +#ifdef XEN_HAVE_PV_UPCALL_MASK
> +        __vcpu_info(v, info, evtchn_upcall_mask) = 1;
> +#endif

I'm not sure about the point of those guards, this will always be 1,
as we always build the hypervisor with the headers in xen/public?

Is it to make backports easier?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 8/8] common: convert vCPU info area registration
  2023-09-28  8:24   ` Roger Pau Monné
@ 2023-09-28  9:53     ` Jan Beulich
  2023-09-28 10:15       ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-09-28  9:53 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Tamas K Lengyel

On 28.09.2023 10:24, Roger Pau Monné wrote:
> On Wed, May 03, 2023 at 05:58:30PM +0200, Jan Beulich wrote:
>> Switch to using map_guest_area(). Noteworthy differences from
>> map_vcpu_info():
>> - remote vCPU-s are paused rather than checked for being down (which in
>>   principle can change right after the check),
>> - the domain lock is taken for a much smaller region,
>> - the error code for an attempt to re-register the area is now -EBUSY,
>> - we could in principle permit de-registration when no area was
>>   previously registered (which would permit "probing", if necessary for
>>   anything).
>>
>> Note that this eliminates a bug in copy_vcpu_settings(): The function
>> did allocate a new page regardless of the GFN already having a mapping,
>> thus in particular breaking the case of two vCPU-s having their info
>> areas on the same page.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Some minor comments below:
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -127,10 +127,10 @@ static void vcpu_info_reset(struct vcpu
>>  {
>>      struct domain *d = v->domain;
> 
> d could likely be made const?

Probably, but this is just context here.

>> @@ -1633,14 +1528,44 @@ int map_guest_area(struct vcpu *v, paddr
>>  
>>      domain_lock(d);
>>  
>> -    if ( map )
>> -        populate(map, v);
>> +    /* No re-registration of the vCPU info area. */
>> +    if ( area != &v->vcpu_info_area || !area->pg )
> 
> It would be nice if this check could be done earlier, as to avoid
> having to fetch and map the page just to discard it.  That would
> however require taking the domain lock earlier.

In order to kind of balance the conflicting goals, there is an unlocked
early check in the caller. See also the related RFC remark; I might
interpret your remark as "yes, keep the early check".

>> +    {
>> +        if ( map )
>> +            populate(map, v);
>>  
>> -    SWAP(area->pg, pg);
>> -    SWAP(area->map, map);
>> +        SWAP(area->pg, pg);
>> +        SWAP(area->map, map);
>> +    }
>> +    else
>> +        rc = -EBUSY;
>>  
>>      domain_unlock(d);
>>  
>> +    /* Set pending flags /after/ new vcpu_info pointer was set. */
>> +    if ( area == &v->vcpu_info_area && !rc )
>> +    {
>> +        /*
>> +         * Mark everything as being pending just to make sure nothing gets
>> +         * lost.  The domain will get a spurious event, but it can cope.
>> +         */
>> +#ifdef CONFIG_COMPAT
>> +        if ( !has_32bit_shinfo(d) )
>> +        {
>> +            vcpu_info_t *info = area->map;
>> +
>> +            /* For VCPUOP_register_vcpu_info handling in common_vcpu_op(). */
>> +            BUILD_BUG_ON(sizeof(*info) != sizeof(info->compat));
>> +            write_atomic(&info->native.evtchn_pending_sel, ~0);
>> +        }
>> +        else
>> +#endif
>> +            write_atomic(&vcpu_info(v, evtchn_pending_sel), ~0);
> 
> Can't the setting of evtchn_pending_sel be done in
> vcpu_info_populate()?

No, see the comment ahead of the outer if(). populate() needs calling
ahead of updating the pointer.

>> @@ -1801,6 +1729,27 @@ bool update_runstate_area(struct vcpu *v
>>      return rc;
>>  }
>>  
>> +/*
>> + * This makes sure that the vcpu_info is always pointing at a valid piece of
>> + * memory, and it sets a pending event to make sure that a pending event
>> + * doesn't get missed.
>> + */
>> +static void cf_check
>> +vcpu_info_populate(void *map, struct vcpu *v)
>> +{
>> +    vcpu_info_t *info = map;
>> +
>> +    if ( v->vcpu_info_area.map == &dummy_vcpu_info )
>> +    {
>> +        memset(info, 0, sizeof(*info));
>> +#ifdef XEN_HAVE_PV_UPCALL_MASK
>> +        __vcpu_info(v, info, evtchn_upcall_mask) = 1;
>> +#endif
> 
> I'm not sure about the point of those guards, this will always be 1,
> as we always build the hypervisor with the headers in xen/public?

That's the case on x86, but this is common code, and hence the build would
break on other architectures if the guard wasn't there.

> Is it to make backports easier?

I'm afraid I don't see how backporting considerations could play into here.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 8/8] common: convert vCPU info area registration
  2023-09-28  9:53     ` Jan Beulich
@ 2023-09-28 10:15       ` Roger Pau Monné
  2023-09-28 10:35         ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-28 10:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Tamas K Lengyel

On Thu, Sep 28, 2023 at 11:53:36AM +0200, Jan Beulich wrote:
> On 28.09.2023 10:24, Roger Pau Monné wrote:
> > On Wed, May 03, 2023 at 05:58:30PM +0200, Jan Beulich wrote:
> >> Switch to using map_guest_area(). Noteworthy differences from
> >> map_vcpu_info():
> >> - remote vCPU-s are paused rather than checked for being down (which in
> >>   principle can change right after the check),
> >> - the domain lock is taken for a much smaller region,
> >> - the error code for an attempt to re-register the area is now -EBUSY,
> >> - we could in principle permit de-registration when no area was
> >>   previously registered (which would permit "probing", if necessary for
> >>   anything).
> >>
> >> Note that this eliminates a bug in copy_vcpu_settings(): The function
> >> did allocate a new page regardless of the GFN already having a mapping,
> >> thus in particular breaking the case of two vCPU-s having their info
> >> areas on the same page.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Some minor comments below:
> > 
> > Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
> >> --- a/xen/common/domain.c
> >> +++ b/xen/common/domain.c
> >> @@ -127,10 +127,10 @@ static void vcpu_info_reset(struct vcpu
> >>  {
> >>      struct domain *d = v->domain;
> > 
> > d could likely be made const?
> 
> Probably, but this is just context here.
> 
> >> @@ -1633,14 +1528,44 @@ int map_guest_area(struct vcpu *v, paddr
> >>  
> >>      domain_lock(d);
> >>  
> >> -    if ( map )
> >> -        populate(map, v);
> >> +    /* No re-registration of the vCPU info area. */
> >> +    if ( area != &v->vcpu_info_area || !area->pg )
> > 
> > It would be nice if this check could be done earlier, as to avoid
> > having to fetch and map the page just to discard it.  That would
> > however require taking the domain lock earlier.
> 
> In order to kind of balance the conflicting goals, there is an unlocked
> early check in the caller. See also the related RFC remark; I might
> interpret your remark as "yes, keep the early check".

Oh, yes, do keep the check.

> >> +    {
> >> +        if ( map )
> >> +            populate(map, v);
> >>  
> >> -    SWAP(area->pg, pg);
> >> -    SWAP(area->map, map);
> >> +        SWAP(area->pg, pg);
> >> +        SWAP(area->map, map);
> >> +    }
> >> +    else
> >> +        rc = -EBUSY;
> >>  
> >>      domain_unlock(d);
> >>  
> >> +    /* Set pending flags /after/ new vcpu_info pointer was set. */
> >> +    if ( area == &v->vcpu_info_area && !rc )
> >> +    {
> >> +        /*
> >> +         * Mark everything as being pending just to make sure nothing gets
> >> +         * lost.  The domain will get a spurious event, but it can cope.
> >> +         */
> >> +#ifdef CONFIG_COMPAT
> >> +        if ( !has_32bit_shinfo(d) )
> >> +        {
> >> +            vcpu_info_t *info = area->map;
> >> +
> >> +            /* For VCPUOP_register_vcpu_info handling in common_vcpu_op(). */
> >> +            BUILD_BUG_ON(sizeof(*info) != sizeof(info->compat));
> >> +            write_atomic(&info->native.evtchn_pending_sel, ~0);
> >> +        }
> >> +        else
> >> +#endif
> >> +            write_atomic(&vcpu_info(v, evtchn_pending_sel), ~0);
> > 
> > Can't the setting of evtchn_pending_sel be done in
> > vcpu_info_populate()?
> 
> No, see the comment ahead of the outer if(). populate() needs calling
> ahead of updating the pointer.

I'm afraid I don't obviously see why evtchn_pending_sel can't be set
before v->vcpu_info is updated.  It will end up being ~0 anyway,
regardless of the order of operations, and we do force an event
channel injection.  There's something I'm clearly missing.

vcpu_mark_events_pending() and force_update_vcpu_system_time()
obviously cannot be moved to the populate hook.

> >> @@ -1801,6 +1729,27 @@ bool update_runstate_area(struct vcpu *v
> >>      return rc;
> >>  }
> >>  
> >> +/*
> >> + * This makes sure that the vcpu_info is always pointing at a valid piece of
> >> + * memory, and it sets a pending event to make sure that a pending event
> >> + * doesn't get missed.
> >> + */
> >> +static void cf_check
> >> +vcpu_info_populate(void *map, struct vcpu *v)
> >> +{
> >> +    vcpu_info_t *info = map;
> >> +
> >> +    if ( v->vcpu_info_area.map == &dummy_vcpu_info )
> >> +    {
> >> +        memset(info, 0, sizeof(*info));
> >> +#ifdef XEN_HAVE_PV_UPCALL_MASK
> >> +        __vcpu_info(v, info, evtchn_upcall_mask) = 1;
> >> +#endif
> > 
> > I'm not sure about the point of those guards, this will always be 1,
> > as we always build the hypervisor with the headers in xen/public?
> 
> That's the case on x86, but this is common code, and hence the build would
> break on other architectures if the guard wasn't there.

Ah, I see, sorry for the noise.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 8/8] common: convert vCPU info area registration
  2023-09-28 10:15       ` Roger Pau Monné
@ 2023-09-28 10:35         ` Jan Beulich
  2023-09-28 11:24           ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-09-28 10:35 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Tamas K Lengyel

On 28.09.2023 12:15, Roger Pau Monné wrote:
> On Thu, Sep 28, 2023 at 11:53:36AM +0200, Jan Beulich wrote:
>> On 28.09.2023 10:24, Roger Pau Monné wrote:
>>> On Wed, May 03, 2023 at 05:58:30PM +0200, Jan Beulich wrote:
>>>> @@ -1633,14 +1528,44 @@ int map_guest_area(struct vcpu *v, paddr
>>>>  
>>>>      domain_lock(d);
>>>>  
>>>> -    if ( map )
>>>> -        populate(map, v);
>>>> +    /* No re-registration of the vCPU info area. */
>>>> +    if ( area != &v->vcpu_info_area || !area->pg )
>>>
>>> It would be nice if this check could be done earlier, as to avoid
>>> having to fetch and map the page just to discard it.  That would
>>> however require taking the domain lock earlier.
>>
>> In order to kind of balance the conflicting goals, there is an unlocked
>> early check in the caller. See also the related RFC remark; I might
>> interpret your remark as "yes, keep the early check".
> 
> Oh, yes, do keep the check.
> 
>>>> +    {
>>>> +        if ( map )
>>>> +            populate(map, v);
>>>>  
>>>> -    SWAP(area->pg, pg);
>>>> -    SWAP(area->map, map);
>>>> +        SWAP(area->pg, pg);
>>>> +        SWAP(area->map, map);
>>>> +    }
>>>> +    else
>>>> +        rc = -EBUSY;
>>>>  
>>>>      domain_unlock(d);
>>>>  
>>>> +    /* Set pending flags /after/ new vcpu_info pointer was set. */
>>>> +    if ( area == &v->vcpu_info_area && !rc )
>>>> +    {
>>>> +        /*
>>>> +         * Mark everything as being pending just to make sure nothing gets
>>>> +         * lost.  The domain will get a spurious event, but it can cope.
>>>> +         */
>>>> +#ifdef CONFIG_COMPAT
>>>> +        if ( !has_32bit_shinfo(d) )
>>>> +        {
>>>> +            vcpu_info_t *info = area->map;
>>>> +
>>>> +            /* For VCPUOP_register_vcpu_info handling in common_vcpu_op(). */
>>>> +            BUILD_BUG_ON(sizeof(*info) != sizeof(info->compat));
>>>> +            write_atomic(&info->native.evtchn_pending_sel, ~0);
>>>> +        }
>>>> +        else
>>>> +#endif
>>>> +            write_atomic(&vcpu_info(v, evtchn_pending_sel), ~0);
>>>
>>> Can't the setting of evtchn_pending_sel be done in
>>> vcpu_info_populate()?
>>
>> No, see the comment ahead of the outer if(). populate() needs calling
>> ahead of updating the pointer.
> 
> I'm afraid I don't obviously see why evtchn_pending_sel can't be set
> before v->vcpu_info is updated.  It will end up being ~0 anyway,
> regardless of the order of operations, and we do force an event
> channel injection.  There's something I'm clearly missing.

Considering the target vCPU is paused (or is current), doing so may be
possible (albeit I fear I'm overlooking something). But re-ordering of
operations shouldn't be done as a side effect of this change; if the
original ordering constraints were too strict, then imo relaxing should
either be a separate earlier change, or a separate follow-on one.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 8/8] common: convert vCPU info area registration
  2023-09-28 10:35         ` Jan Beulich
@ 2023-09-28 11:24           ` Roger Pau Monné
  0 siblings, 0 replies; 38+ messages in thread
From: Roger Pau Monné @ 2023-09-28 11:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Tamas K Lengyel

On Thu, Sep 28, 2023 at 12:35:23PM +0200, Jan Beulich wrote:
> On 28.09.2023 12:15, Roger Pau Monné wrote:
> > On Thu, Sep 28, 2023 at 11:53:36AM +0200, Jan Beulich wrote:
> >> On 28.09.2023 10:24, Roger Pau Monné wrote:
> >>> On Wed, May 03, 2023 at 05:58:30PM +0200, Jan Beulich wrote:
> >>>> @@ -1633,14 +1528,44 @@ int map_guest_area(struct vcpu *v, paddr
> >>>> +    {
> >>>> +        if ( map )
> >>>> +            populate(map, v);
> >>>>  
> >>>> -    SWAP(area->pg, pg);
> >>>> -    SWAP(area->map, map);
> >>>> +        SWAP(area->pg, pg);
> >>>> +        SWAP(area->map, map);
> >>>> +    }
> >>>> +    else
> >>>> +        rc = -EBUSY;
> >>>>  
> >>>>      domain_unlock(d);
> >>>>  
> >>>> +    /* Set pending flags /after/ new vcpu_info pointer was set. */
> >>>> +    if ( area == &v->vcpu_info_area && !rc )
> >>>> +    {
> >>>> +        /*
> >>>> +         * Mark everything as being pending just to make sure nothing gets
> >>>> +         * lost.  The domain will get a spurious event, but it can cope.
> >>>> +         */
> >>>> +#ifdef CONFIG_COMPAT
> >>>> +        if ( !has_32bit_shinfo(d) )
> >>>> +        {
> >>>> +            vcpu_info_t *info = area->map;
> >>>> +
> >>>> +            /* For VCPUOP_register_vcpu_info handling in common_vcpu_op(). */
> >>>> +            BUILD_BUG_ON(sizeof(*info) != sizeof(info->compat));
> >>>> +            write_atomic(&info->native.evtchn_pending_sel, ~0);
> >>>> +        }
> >>>> +        else
> >>>> +#endif
> >>>> +            write_atomic(&vcpu_info(v, evtchn_pending_sel), ~0);
> >>>
> >>> Can't the setting of evtchn_pending_sel be done in
> >>> vcpu_info_populate()?
> >>
> >> No, see the comment ahead of the outer if(). populate() needs calling
> >> ahead of updating the pointer.
> > 
> > I'm afraid I don't obviously see why evtchn_pending_sel can't be set
> > before v->vcpu_info is updated.  It will end up being ~0 anyway,
> > regardless of the order of operations, and we do force an event
> > channel injection.  There's something I'm clearly missing.
> 
> Considering the target vCPU is paused (or is current), doing so may be
> possible (albeit I fear I'm overlooking something). But re-ordering of
> operations shouldn't be done as a side effect of this change; if the
> original ordering constraints were too strict, then imo relaxing should
> either be a separate earlier change, or a separate follow-on one.

Let's do a followup then.  This will also need an RB instead of an
Ack:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2023-09-28 11:25 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-03 15:53 [PATCH v3 0/8] runstate/time area registration by (guest) physical address Jan Beulich
2023-05-03 15:54 ` [PATCH v3 1/8] domain: GADDR based shared guest area registration alternative - teardown Jan Beulich
2023-09-27  8:51   ` Roger Pau Monné
2023-09-27  9:55     ` Jan Beulich
2023-09-27 10:42       ` Roger Pau Monné
2023-09-27 10:46         ` Jan Beulich
2023-09-27 10:50           ` Roger Pau Monné
2023-09-27 11:44             ` Jan Beulich
2023-05-03 15:55 ` [PATCH v3 2/8] domain: update GADDR based runstate guest area Jan Beulich
2023-09-27  9:44   ` Roger Pau Monné
2023-09-27 10:19     ` Jan Beulich
2023-05-03 15:55 ` [PATCH v3 3/8] x86: update GADDR based secondary time area Jan Beulich
2023-09-27 10:14   ` Roger Pau Monné
2023-05-03 15:56 ` [PATCH v3 4/8] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
2023-05-03 17:14   ` Tamas K Lengyel
2023-05-04  7:44     ` Jan Beulich
2023-05-04 12:50       ` Tamas K Lengyel
2023-05-04 14:25         ` Jan Beulich
2023-09-27 11:08   ` Roger Pau Monné
2023-09-27 12:06     ` Jan Beulich
2023-09-27 14:05       ` Roger Pau Monné
2023-09-27 15:11         ` Jan Beulich
     [not found]     ` <CABfawhkn1xXA+qEjB4-HtOVUZHONDE6ngMJZPe3fSPtoAtmg+Q@mail.gmail.com>
2023-09-27 13:54       ` Roger Pau Monné
2023-05-03 15:57 ` [PATCH v3 5/8] domain: map/unmap " Jan Beulich
2023-09-27 14:53   ` Roger Pau Monné
2023-09-27 15:29     ` Jan Beulich
2023-05-03 15:57 ` [PATCH v3 6/8] domain: introduce GADDR based runstate area registration alternative Jan Beulich
2023-09-27 15:24   ` Roger Pau Monné
2023-09-27 15:36     ` Jan Beulich
2023-05-03 15:58 ` [PATCH v3 7/8] x86: introduce GADDR based secondary time " Jan Beulich
2023-09-27 15:50   ` Roger Pau Monné
2023-09-27 16:12     ` Jan Beulich
2023-05-03 15:58 ` [PATCH v3 8/8] common: convert vCPU info area registration Jan Beulich
2023-09-28  8:24   ` Roger Pau Monné
2023-09-28  9:53     ` Jan Beulich
2023-09-28 10:15       ` Roger Pau Monné
2023-09-28 10:35         ` Jan Beulich
2023-09-28 11:24           ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.