All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/9] runstate/time area registration by (guest) physical address
@ 2023-09-28  6:59 Jan Beulich
  2023-09-28  7:01 ` [PATCH v4 1/9] x86/shim: zap runstate and time area handles during shutdown Jan Beulich
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Jan Beulich @ 2023-09-28  6:59 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Henry Wang

Since it was indicated that introducing specific new vCPU ops may be
beneficial independent of the introduction of a fully physical-
address-based ABI flavor, here we go. There continue to be a few open
questions throughout the series, resolving of which was one of the main
goals of the earlier postings. v4 adds one (small) new patch and
addresses review comments; see individual patches.

1: x86/shim: zap runstate and time area handles during shutdown
2: domain: GADDR based shared guest area registration alternative - cleanup
3: domain: update GADDR based runstate guest area
4: x86: update GADDR based secondary time area
5: x86/mem-sharing: copy GADDR based shared guest areas
6: domain: map/unmap GADDR based shared guest areas
7: domain: introduce GADDR based runstate area registration alternative
8: x86: introduce GADDR based secondary time area registration alternative
9: common: convert vCPU info area registration

Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/9] x86/shim: zap runstate and time area handles during shutdown
  2023-09-28  6:59 [PATCH v4 0/9] runstate/time area registration by (guest) physical address Jan Beulich
@ 2023-09-28  7:01 ` Jan Beulich
  2023-09-28  9:27   ` Roger Pau Monné
  2023-09-28  7:14 ` [PATCH v4 2/9] domain: GADDR based shared guest area registration alternative - teardown Jan Beulich
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2023-09-28  7:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Henry Wang

While likely the guest would just re-register the same areas after
a possible resume, let's not take this for granted and avoid the risk of
otherwise corrupting guest memory.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/xen/arch/x86/pv/shim.c
+++ b/xen/arch/x86/pv/shim.c
@@ -385,6 +385,10 @@ int pv_shim_shutdown(uint8_t reason)
         /* Unmap guest vcpu_info pages. */
         unmap_vcpu_info(v);
 
+        /* Zap runstate and time area handles. */
+        set_xen_guest_handle(runstate_guest(v), NULL);
+        set_xen_guest_handle(v->arch.time_info_guest, NULL);
+
         /* Reset the periodic timer to the default value. */
         vcpu_set_periodic_timer(v, MILLISECS(10));
         /* Stop the singleshot timer. */



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 2/9] domain: GADDR based shared guest area registration alternative - teardown
  2023-09-28  6:59 [PATCH v4 0/9] runstate/time area registration by (guest) physical address Jan Beulich
  2023-09-28  7:01 ` [PATCH v4 1/9] x86/shim: zap runstate and time area handles during shutdown Jan Beulich
@ 2023-09-28  7:14 ` Jan Beulich
  2023-09-28  7:15 ` [PATCH v4 3/9] domain: update GADDR based runstate guest area Jan Beulich
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2023-09-28  7:14 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Henry Wang

In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, add the necessary domain cleanup hooks.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
---
v4: Re-base.
v2: Add assertion in unmap_guest_area().

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1024,7 +1024,10 @@ int arch_domain_soft_reset(struct domain
     }
 
     for_each_vcpu ( d, v )
+    {
         set_xen_guest_handle(v->arch.time_info_guest, NULL);
+        unmap_guest_area(v, &v->arch.time_guest_area);
+    }
 
  exit_put_gfn:
     put_gfn(d, gfn_x(gfn));
@@ -2381,6 +2384,8 @@ int domain_relinquish_resources(struct d
             if ( ret )
                 return ret;
 
+            unmap_guest_area(v, &v->arch.time_guest_area);
+
             vpmu_destroy(v);
         }
 
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -669,6 +669,7 @@ struct arch_vcpu
 
     /* A secondary copy of the vcpu time info. */
     XEN_GUEST_HANDLE(vcpu_time_info_t) time_info_guest;
+    struct guest_area time_guest_area;
 
     struct arch_vm_event *vm_event;
 
--- a/xen/arch/x86/pv/shim.c
+++ b/xen/arch/x86/pv/shim.c
@@ -382,8 +382,10 @@ int pv_shim_shutdown(uint8_t reason)
 
     for_each_vcpu ( d, v )
     {
-        /* Unmap guest vcpu_info pages. */
+        /* Unmap guest vcpu_info page and runstate/time areas. */
         unmap_vcpu_info(v);
+        unmap_guest_area(v, &v->runstate_guest_area);
+        unmap_guest_area(v, &v->arch.time_guest_area);
 
         /* Zap runstate and time area handles. */
         set_xen_guest_handle(runstate_guest(v), NULL);
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -992,7 +992,10 @@ int domain_kill(struct domain *d)
         if ( cpupool_move_domain(d, cpupool0) )
             return -ERESTART;
         for_each_vcpu ( d, v )
+        {
             unmap_vcpu_info(v);
+            unmap_guest_area(v, &v->runstate_guest_area);
+        }
         d->is_dying = DOMDYING_dead;
         /* Mem event cleanup has to go here because the rings 
          * have to be put before we call put_domain. */
@@ -1446,6 +1449,7 @@ int domain_soft_reset(struct domain *d,
     {
         set_xen_guest_handle(runstate_guest(v), NULL);
         unmap_vcpu_info(v);
+        unmap_guest_area(v, &v->runstate_guest_area);
     }
 
     rc = arch_domain_soft_reset(d);
@@ -1597,6 +1601,19 @@ void unmap_vcpu_info(struct vcpu *v)
     put_page_and_type(mfn_to_page(mfn));
 }
 
+/*
+ * This is only intended to be used for domain cleanup (or more generally only
+ * with at least the respective vCPU, if it's not the current one, reliably
+ * paused).
+ */
+void unmap_guest_area(struct vcpu *v, struct guest_area *area)
+{
+    struct domain *d = v->domain;
+
+    if ( v != current )
+        ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
+}
+
 int default_initialise_vcpu(struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct vcpu_guest_context *ctxt;
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -5,6 +5,12 @@
 #include <xen/types.h>
 
 #include <public/xen.h>
+
+struct guest_area {
+    struct page_info *pg;
+    void *map;
+};
+
 #include <asm/domain.h>
 #include <asm/numa.h>
 
@@ -77,6 +83,11 @@ void arch_vcpu_destroy(struct vcpu *v);
 int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned int offset);
 void unmap_vcpu_info(struct vcpu *v);
 
+int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
+                   struct guest_area *area,
+                   void (*populate)(void *dst, struct vcpu *v));
+void unmap_guest_area(struct vcpu *v, struct guest_area *area);
+
 struct xen_domctl_createdomain;
 int arch_domain_create(struct domain *d,
                        struct xen_domctl_createdomain *config,
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -203,6 +203,7 @@ struct vcpu
         XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t) compat;
     } runstate_guest; /* guest address */
 #endif
+    struct guest_area runstate_guest_area;
     unsigned int     new_state;
 
     /* Has the FPU been initialised? */



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 3/9] domain: update GADDR based runstate guest area
  2023-09-28  6:59 [PATCH v4 0/9] runstate/time area registration by (guest) physical address Jan Beulich
  2023-09-28  7:01 ` [PATCH v4 1/9] x86/shim: zap runstate and time area handles during shutdown Jan Beulich
  2023-09-28  7:14 ` [PATCH v4 2/9] domain: GADDR based shared guest area registration alternative - teardown Jan Beulich
@ 2023-09-28  7:15 ` Jan Beulich
  2023-09-28 12:15   ` Julien Grall
  2023-09-28  7:15 ` [PATCH v4 4/9] x86: update GADDR based secondary time area Jan Beulich
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2023-09-28  7:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Henry Wang

Before adding a new vCPU operation to register the runstate area by
guest-physical address, add code to actually keep such areas up-to-date.

Note that updating of the area will be done exclusively following the
model enabled by VMASST_TYPE_runstate_update_flag for virtual-address
based registered areas.

Note further that pages aren't marked dirty when written to (matching
the handling of space mapped by map_vcpu_info()), on the basis that the
registrations are lost anyway across migration (or would need re-
populating at the target for transparent migration). Plus the contents
of the areas in question have to be deemed volatile in the first place
(so saving a "most recent" value is pretty meaningless even for e.g.
snapshotting).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
RFC: HVM guests (on x86) can change bitness and hence layout (and size!
     and alignment) of the runstate area. I don't think it is an option
     to require 32-bit code to pass a range such that even the 64-bit
     layout wouldn't cross a page boundary (and be suitably aligned). I
     also don't see any other good solution, so for now a crude approach
     with an extra boolean is used (using has_32bit_shinfo() isn't race
     free and could hence lead to overrunning the mapped space).
---
v3: Use assignment instead of memcpy().
v2: Drop VM-assist conditionals.

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1644,15 +1644,52 @@ bool update_runstate_area(struct vcpu *v
     bool rc;
     struct guest_memory_policy policy = { };
     void __user *guest_handle = NULL;
-    struct vcpu_runstate_info runstate;
+    struct vcpu_runstate_info runstate = v->runstate;
+    struct vcpu_runstate_info *map = v->runstate_guest_area.map;
+
+    if ( map )
+    {
+        uint64_t *pset;
+#ifdef CONFIG_COMPAT
+        struct compat_vcpu_runstate_info *cmap = NULL;
+
+        if ( v->runstate_guest_area_compat )
+            cmap = (void *)map;
+#endif
+
+        /*
+         * NB: No VM_ASSIST(v->domain, runstate_update_flag) check here.
+         *     Always using that updating model.
+         */
+#ifdef CONFIG_COMPAT
+        if ( cmap )
+            pset = &cmap->state_entry_time;
+        else
+#endif
+            pset = &map->state_entry_time;
+        runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
+        write_atomic(pset, runstate.state_entry_time);
+        smp_wmb();
+
+#ifdef CONFIG_COMPAT
+        if ( cmap )
+            XLAT_vcpu_runstate_info(cmap, &runstate);
+        else
+#endif
+            *map = runstate;
+
+        smp_wmb();
+        runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
+        write_atomic(pset, runstate.state_entry_time);
+
+        return true;
+    }
 
     if ( guest_handle_is_null(runstate_guest(v)) )
         return true;
 
     update_guest_memory_policy(v, &policy);
 
-    memcpy(&runstate, &v->runstate, sizeof(runstate));
-
     if ( VM_ASSIST(v->domain, runstate_update_flag) )
     {
 #ifdef CONFIG_COMPAT
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -232,6 +232,8 @@ struct vcpu
 #ifdef CONFIG_COMPAT
     /* A hypercall is using the compat ABI? */
     bool             hcall_compat;
+    /* Physical runstate area registered via compat ABI? */
+    bool             runstate_guest_area_compat;
 #endif
 
 #ifdef CONFIG_IOREQ_SERVER



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 4/9] x86: update GADDR based secondary time area
  2023-09-28  6:59 [PATCH v4 0/9] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (2 preceding siblings ...)
  2023-09-28  7:15 ` [PATCH v4 3/9] domain: update GADDR based runstate guest area Jan Beulich
@ 2023-09-28  7:15 ` Jan Beulich
  2023-09-28  7:16 ` [PATCH v4 5/9] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2023-09-28  7:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné, Henry Wang

Before adding a new vCPU operation to register the secondary time area
by guest-physical address, add code to actually keep such areas up-to-
date.

Note that pages aren't marked dirty when written to (matching the
handling of space mapped by map_vcpu_info()), on the basis that the
registrations are lost anyway across migration (or would need re-
populating at the target for transparent migration). Plus the contents
of the areas in question have to be deemed volatile in the first place
(so saving a "most recent" value is pretty meaningless even for e.g.
snapshotting).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1566,12 +1566,34 @@ static void __update_vcpu_system_time(st
         v->arch.pv.pending_system_time = _u;
 }
 
+static void write_time_guest_area(struct vcpu_time_info *map,
+                                  const struct vcpu_time_info *src)
+{
+    /* 1. Update userspace version. */
+    write_atomic(&map->version, src->version);
+    smp_wmb();
+
+    /* 2. Update all other userspace fields. */
+    *map = *src;
+
+    /* 3. Update userspace version again. */
+    smp_wmb();
+    write_atomic(&map->version, version_update_end(src->version));
+}
+
 bool update_secondary_system_time(struct vcpu *v,
                                   struct vcpu_time_info *u)
 {
     XEN_GUEST_HANDLE(vcpu_time_info_t) user_u = v->arch.time_info_guest;
+    struct vcpu_time_info *map = v->arch.time_guest_area.map;
     struct guest_memory_policy policy = { .nested_guest_mode = false };
 
+    if ( map )
+    {
+        write_time_guest_area(map, u);
+        return true;
+    }
+
     if ( guest_handle_is_null(user_u) )
         return true;
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 5/9] x86/mem-sharing: copy GADDR based shared guest areas
  2023-09-28  6:59 [PATCH v4 0/9] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (3 preceding siblings ...)
  2023-09-28  7:15 ` [PATCH v4 4/9] x86: update GADDR based secondary time area Jan Beulich
@ 2023-09-28  7:16 ` Jan Beulich
  2023-09-28  9:51   ` Roger Pau Monné
  2023-09-28  7:16 ` [PATCH v4 6/9] domain: map/unmap " Jan Beulich
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2023-09-28  7:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Henry Wang, Tamas K Lengyel

In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, add the necessary fork handling (with the
backing function yet to be filled in).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Extend comment.

--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
     hvm_set_nonreg_state(cd_vcpu, &nrs);
 }
 
+static int copy_guest_area(struct guest_area *cd_area,
+                           const struct guest_area *d_area,
+                           struct vcpu *cd_vcpu,
+                           const struct domain *d)
+{
+    mfn_t d_mfn, cd_mfn;
+
+    if ( !d_area->pg )
+        return 0;
+
+    d_mfn = page_to_mfn(d_area->pg);
+
+    /* Allocate & map a page for the area if it hasn't been already. */
+    if ( !cd_area->pg )
+    {
+        gfn_t gfn = mfn_to_gfn(d, d_mfn);
+        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
+        p2m_type_t p2mt;
+        p2m_access_t p2ma;
+        unsigned int offset;
+        int ret;
+
+        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
+        if ( mfn_eq(cd_mfn, INVALID_MFN) )
+        {
+            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
+
+            if ( !pg )
+                return -ENOMEM;
+
+            cd_mfn = page_to_mfn(pg);
+            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
+
+            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
+                                 p2m->default_access, -1);
+            if ( ret )
+                return ret;
+        }
+        else if ( p2mt != p2m_ram_rw )
+            return -EBUSY;
+
+        /*
+         * Map the area into the guest. For simplicity specify the entire range
+         * up to the end of the page: All the function uses it for is to check
+         * that the range doesn't cross page boundaries. Having the area mapped
+         * in the original domain implies that it fits there and therefore will
+         * also fit in the clone.
+         */
+        offset = PAGE_OFFSET(d_area->map);
+        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
+                             PAGE_SIZE - offset, cd_area, NULL);
+        if ( ret )
+            return ret;
+    }
+    else
+        cd_mfn = page_to_mfn(cd_area->pg);
+
+    copy_domain_page(cd_mfn, d_mfn);
+
+    return 0;
+}
+
 static int copy_vpmu(struct vcpu *d_vcpu, struct vcpu *cd_vcpu)
 {
     struct vpmu_struct *d_vpmu = vcpu_vpmu(d_vcpu);
@@ -1733,6 +1795,16 @@ static int copy_vcpu_settings(struct dom
             copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
         }
 
+        /* Same for the (physically registered) runstate and time info areas. */
+        ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
+                              &d_vcpu->runstate_guest_area, cd_vcpu, d);
+        if ( ret )
+            return ret;
+        ret = copy_guest_area(&cd_vcpu->arch.time_guest_area,
+                              &d_vcpu->arch.time_guest_area, cd_vcpu, d);
+        if ( ret )
+            return ret;
+
         ret = copy_vpmu(d_vcpu, cd_vcpu);
         if ( ret )
             return ret;
@@ -1974,7 +2046,10 @@ int mem_sharing_fork_reset(struct domain
 
  state:
     if ( reset_state )
+    {
         rc = copy_settings(d, pd);
+        /* TBD: What to do here with -ERESTART? */
+    }
 
     domain_unpause(d);
 
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1601,6 +1601,13 @@ void unmap_vcpu_info(struct vcpu *v)
     put_page_and_type(mfn_to_page(mfn));
 }
 
+int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
+                   struct guest_area *area,
+                   void (*populate)(void *dst, struct vcpu *v))
+{
+    return -EOPNOTSUPP;
+}
+
 /*
  * This is only intended to be used for domain cleanup (or more generally only
  * with at least the respective vCPU, if it's not the current one, reliably



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 6/9] domain: map/unmap GADDR based shared guest areas
  2023-09-28  6:59 [PATCH v4 0/9] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (4 preceding siblings ...)
  2023-09-28  7:16 ` [PATCH v4 5/9] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
@ 2023-09-28  7:16 ` Jan Beulich
  2023-09-28 10:04   ` Roger Pau Monné
  2023-09-28  7:17 ` [PATCH v4 7/9] domain: introduce GADDR based runstate area registration alternative Jan Beulich
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2023-09-28  7:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Henry Wang

The registration by virtual/linear address has downsides: At least on
x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
PV domains the areas are inaccessible (and hence cannot be updated by
Xen) when in guest-user mode, and for HVM guests they may be
inaccessible when Meltdown mitigations are in place. (There are yet
more issues.)

In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, flesh out the map/unmap functions.

Noteworthy differences from map_vcpu_info():
- areas can be registered more than once (and de-registered),
- remote vCPU-s are paused rather than checked for being down (which in
  principle can change right after the check),
- the domain lock is taken for a much smaller region.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: By using global domain page mappings the demand on the underlying
     VA range may increase significantly. I did consider to use per-
     domain mappings instead, but they exist for x86 only. Of course we
     could have arch_{,un}map_guest_area() aliasing global domain page
     mapping functions on Arm and using per-domain mappings on x86. Yet
     then again map_vcpu_info() doesn't (and can't) do so.

RFC: In map_guest_area() I'm not checking the P2M type, instead - just
     like map_vcpu_info() - solely relying on the type ref acquisition.
     Checking for p2m_ram_rw alone would be wrong, as at least
     p2m_ram_logdirty ought to also be okay to use here (and in similar
     cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
     used here (like altp2m_vcpu_enable_ve() does) as well as in
     map_vcpu_info(), yet then again the P2M type is stale by the time
     it is being looked at anyway without the P2M lock held.
---
v4: Add another comment. Use IS_ALIGNED(). Add another NULL check.
v2: currd -> d, to cover mem-sharing's copy_guest_area(). Re-base over
    change(s) earlier in the series. Use ~0 as "unmap" request indicator.

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1605,7 +1605,82 @@ int map_guest_area(struct vcpu *v, paddr
                    struct guest_area *area,
                    void (*populate)(void *dst, struct vcpu *v))
 {
-    return -EOPNOTSUPP;
+    struct domain *d = v->domain;
+    void *map = NULL;
+    struct page_info *pg = NULL;
+    int rc = 0;
+
+    if ( ~gaddr ) /* Map (i.e. not just unmap)? */
+    {
+        unsigned long gfn = PFN_DOWN(gaddr);
+        unsigned int align;
+        p2m_type_t p2mt;
+
+        if ( gfn != PFN_DOWN(gaddr + size - 1) )
+            return -ENXIO;
+
+#ifdef CONFIG_COMPAT
+        if ( has_32bit_shinfo(d) )
+            align = alignof(compat_ulong_t);
+        else
+#endif
+            align = alignof(xen_ulong_t);
+        if ( !IS_ALIGNED(gaddr, align) )
+            return -ENXIO;
+
+        rc = check_get_page_from_gfn(d, _gfn(gfn), false, &p2mt, &pg);
+        if ( rc )
+            return rc;
+
+        if ( !get_page_type(pg, PGT_writable_page) )
+        {
+            put_page(pg);
+            return -EACCES;
+        }
+
+        map = __map_domain_page_global(pg);
+        if ( !map )
+        {
+            put_page_and_type(pg);
+            return -ENOMEM;
+        }
+        map += PAGE_OFFSET(gaddr);
+    }
+
+    if ( v != current )
+    {
+        if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
+        {
+            rc = -ERESTART;
+            goto unmap;
+        }
+
+        vcpu_pause(v);
+
+        spin_unlock(&d->hypercall_deadlock_mutex);
+    }
+
+    domain_lock(d);
+
+    if ( map && populate )
+        populate(map, v);
+
+    SWAP(area->pg, pg);
+    SWAP(area->map, map);
+
+    domain_unlock(d);
+
+    if ( v != current )
+        vcpu_unpause(v);
+
+ unmap:
+    if ( pg )
+    {
+        unmap_domain_page_global(map);
+        put_page_and_type(pg);
+    }
+
+    return rc;
 }
 
 /*
@@ -1616,9 +1691,24 @@ int map_guest_area(struct vcpu *v, paddr
 void unmap_guest_area(struct vcpu *v, struct guest_area *area)
 {
     struct domain *d = v->domain;
+    void *map;
+    struct page_info *pg;
 
     if ( v != current )
         ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
+
+    domain_lock(d);
+    map = area->map;
+    area->map = NULL;
+    pg = area->pg;
+    area->pg = NULL;
+    domain_unlock(d);
+
+    if ( pg )
+    {
+        unmap_domain_page_global(map);
+        put_page_and_type(pg);
+    }
 }
 
 int default_initialise_vcpu(struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg)



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 7/9] domain: introduce GADDR based runstate area registration alternative
  2023-09-28  6:59 [PATCH v4 0/9] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (5 preceding siblings ...)
  2023-09-28  7:16 ` [PATCH v4 6/9] domain: map/unmap " Jan Beulich
@ 2023-09-28  7:17 ` Jan Beulich
  2023-09-28  7:17 ` [PATCH v4 8/9] x86: introduce GADDR based secondary time " Jan Beulich
  2023-09-28  7:18 ` [PATCH v4 9/9] common: convert vCPU info area registration Jan Beulich
  8 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2023-09-28  7:17 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Henry Wang

The registration by virtual/linear address has downsides: At least on
x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
PV domains the area is inaccessible (and hence cannot be updated by Xen)
when in guest-user mode.

Introduce a new vCPU operation allowing to register the runstate area by
guest-physical address.

An at least theoretical downside to using physically registered areas is
that PV then won't see dirty (and perhaps also accessed) bits set in its
respective page table entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
v4: Add xref in public header.
v2: Extend comment in public header.

--- a/xen/arch/x86/x86_64/domain.c
+++ b/xen/arch/x86/x86_64/domain.c
@@ -12,6 +12,22 @@
 CHECK_vcpu_get_physid;
 #undef xen_vcpu_get_physid
 
+static void cf_check
+runstate_area_populate(void *map, struct vcpu *v)
+{
+    if ( is_pv_vcpu(v) )
+        v->arch.pv.need_update_runstate_area = false;
+
+    v->runstate_guest_area_compat = true;
+
+    if ( v == current )
+    {
+        struct compat_vcpu_runstate_info *info = map;
+
+        XLAT_vcpu_runstate_info(info, &v->runstate);
+    }
+}
+
 int
 compat_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
@@ -57,6 +73,25 @@ compat_vcpu_op(int cmd, unsigned int vcp
 
         break;
     }
+
+    case VCPUOP_register_runstate_phys_area:
+    {
+        struct compat_vcpu_register_runstate_memory_area area;
+
+        rc = -EFAULT;
+        if ( copy_from_guest(&area.addr.p, arg, 1) )
+            break;
+
+        rc = map_guest_area(v, area.addr.p,
+                            sizeof(struct compat_vcpu_runstate_info),
+                            &v->runstate_guest_area,
+                            runstate_area_populate);
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
+                                               cmd, vcpuid, arg);
+
+        break;
+    }
 
     case VCPUOP_register_vcpu_time_memory_area:
     {
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1830,6 +1830,26 @@ bool update_runstate_area(struct vcpu *v
     return rc;
 }
 
+static void cf_check
+runstate_area_populate(void *map, struct vcpu *v)
+{
+#ifdef CONFIG_PV
+    if ( is_pv_vcpu(v) )
+        v->arch.pv.need_update_runstate_area = false;
+#endif
+
+#ifdef CONFIG_COMPAT
+    v->runstate_guest_area_compat = false;
+#endif
+
+    if ( v == current )
+    {
+        struct vcpu_runstate_info *info = map;
+
+        *info = v->runstate;
+    }
+}
+
 long common_vcpu_op(int cmd, struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     long rc = 0;
@@ -2011,6 +2031,25 @@ long common_vcpu_op(int cmd, struct vcpu
 
         break;
     }
+
+    case VCPUOP_register_runstate_phys_area:
+    {
+        struct vcpu_register_runstate_memory_area area;
+
+        rc = -EFAULT;
+        if ( copy_from_guest(&area.addr.p, arg, 1) )
+            break;
+
+        rc = map_guest_area(v, area.addr.p,
+                            sizeof(struct vcpu_runstate_info),
+                            &v->runstate_guest_area,
+                            runstate_area_populate);
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
+                                               cmd, vcpuid, arg);
+
+        break;
+    }
 
     default:
         rc = -ENOSYS;
--- a/xen/include/public/vcpu.h
+++ b/xen/include/public/vcpu.h
@@ -110,6 +110,8 @@ DEFINE_XEN_GUEST_HANDLE(vcpu_runstate_in
  *     runstate.state will always be RUNSTATE_running and
  *     runstate.state_entry_time will indicate the system time at which the
  *     VCPU was last scheduled to run.
+ *  3. New code wants to prefer VCPUOP_register_runstate_phys_area, and only
+ *     fall back to the operation here for backwards compatibility.
  * @extra_arg == pointer to vcpu_register_runstate_memory_area structure.
  */
 #define VCPUOP_register_runstate_memory_area 5
@@ -221,6 +223,19 @@ struct vcpu_register_time_memory_area {
 typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
 DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
 
+/*
+ * Like the respective VCPUOP_register_*_memory_area, just using the "addr.p"
+ * field of the supplied struct as a guest physical address (i.e. in GFN space).
+ * The respective area may not cross a page boundary.  Pass ~0 to unregister an
+ * area.  Note that as long as an area is registered by physical address, the
+ * linear address based area will not be serviced (updated) by the hypervisor.
+ *
+ * Note that the area registered via VCPUOP_register_runstate_memory_area will
+ * be updated in the same manner as the one registered via virtual address PLUS
+ * VMASST_TYPE_runstate_update_flag engaged by the domain.
+ */
+#define VCPUOP_register_runstate_phys_area      14
+
 #endif /* __XEN_PUBLIC_VCPU_H__ */
 
 /*



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 8/9] x86: introduce GADDR based secondary time area registration alternative
  2023-09-28  6:59 [PATCH v4 0/9] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (6 preceding siblings ...)
  2023-09-28  7:17 ` [PATCH v4 7/9] domain: introduce GADDR based runstate area registration alternative Jan Beulich
@ 2023-09-28  7:17 ` Jan Beulich
  2023-09-28  7:18 ` [PATCH v4 9/9] common: convert vCPU info area registration Jan Beulich
  8 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2023-09-28  7:17 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Henry Wang

The registration by virtual/linear address has downsides: The access is
expensive for HVM/PVH domains. Furthermore for 64-bit PV domains the area
is inaccessible (and hence cannot be updated by Xen) when in guest-user
mode.

Introduce a new vCPU operation allowing to register the secondary time
area by guest-physical address.

An at least theoretical downside to using physically registered areas is
that PV then won't see dirty (and perhaps also accessed) bits set in its
respective page table entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
v4: Add xref in public header.
v3: Re-base.
v2: Forge version in force_update_secondary_system_time().

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1529,6 +1529,15 @@ int arch_vcpu_reset(struct vcpu *v)
     return 0;
 }
 
+static void cf_check
+time_area_populate(void *map, struct vcpu *v)
+{
+    if ( is_pv_vcpu(v) )
+        v->arch.pv.pending_system_time.version = 0;
+
+    force_update_secondary_system_time(v, map);
+}
+
 long do_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     long rc = 0;
@@ -1566,6 +1575,25 @@ long do_vcpu_op(int cmd, unsigned int vc
 
         break;
     }
+
+    case VCPUOP_register_vcpu_time_phys_area:
+    {
+        struct vcpu_register_time_memory_area area;
+
+        rc = -EFAULT;
+        if ( copy_from_guest(&area.addr.p, arg, 1) )
+            break;
+
+        rc = map_guest_area(v, area.addr.p,
+                            sizeof(vcpu_time_info_t),
+                            &v->arch.time_guest_area,
+                            time_area_populate);
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
+                                               cmd, vcpuid, arg);
+
+        break;
+    }
 
     case VCPUOP_get_physid:
     {
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -692,6 +692,8 @@ void domain_cpu_policy_changed(struct do
 
 bool update_secondary_system_time(struct vcpu *,
                                   struct vcpu_time_info *);
+void force_update_secondary_system_time(struct vcpu *,
+                                        struct vcpu_time_info *);
 
 void vcpu_show_registers(const struct vcpu *);
 
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1628,6 +1628,16 @@ void force_update_vcpu_system_time(struc
     __update_vcpu_system_time(v, 1);
 }
 
+void force_update_secondary_system_time(struct vcpu *v,
+                                        struct vcpu_time_info *map)
+{
+    struct vcpu_time_info u;
+
+    collect_time_info(v, &u);
+    u.version = -1; /* Compensate for version_update_end(). */
+    write_time_guest_area(map, &u);
+}
+
 static void update_domain_rtc(void)
 {
     struct domain *d;
--- a/xen/arch/x86/x86_64/domain.c
+++ b/xen/arch/x86/x86_64/domain.c
@@ -115,6 +115,7 @@ compat_vcpu_op(int cmd, unsigned int vcp
 
     case VCPUOP_send_nmi:
     case VCPUOP_get_physid:
+    case VCPUOP_register_vcpu_time_phys_area:
         rc = do_vcpu_op(cmd, vcpuid, arg);
         break;
 
--- a/xen/include/public/vcpu.h
+++ b/xen/include/public/vcpu.h
@@ -209,6 +209,9 @@ DEFINE_XEN_GUEST_HANDLE(vcpu_get_physid_
  * segment limit).  It can then apply the normal algorithm to compute
  * system time from the tsc.
  *
+ * New code wants to prefer VCPUOP_register_vcpu_time_phys_area, and only
+ * fall back to the operation here for backwards compatibility.
+ *
  * @extra_arg == pointer to vcpu_register_time_info_memory_area structure.
  */
 #define VCPUOP_register_vcpu_time_memory_area   13
@@ -235,6 +238,7 @@ DEFINE_XEN_GUEST_HANDLE(vcpu_register_ti
  * VMASST_TYPE_runstate_update_flag engaged by the domain.
  */
 #define VCPUOP_register_runstate_phys_area      14
+#define VCPUOP_register_vcpu_time_phys_area     15
 
 #endif /* __XEN_PUBLIC_VCPU_H__ */
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 9/9] common: convert vCPU info area registration
  2023-09-28  6:59 [PATCH v4 0/9] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (7 preceding siblings ...)
  2023-09-28  7:17 ` [PATCH v4 8/9] x86: introduce GADDR based secondary time " Jan Beulich
@ 2023-09-28  7:18 ` Jan Beulich
  8 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2023-09-28  7:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Henry Wang

Switch to using map_guest_area(). Noteworthy differences from
map_vcpu_info():
- remote vCPU-s are paused rather than checked for being down (which in
  principle can change right after the check),
- the domain lock is taken for a much smaller region,
- the error code for an attempt to re-register the area is now -EBUSY,
- we could in principle permit de-registration when no area was
  previously registered (which would permit "probing", if necessary for
  anything).

Note that this eliminates a bug in copy_vcpu_settings(): The function
did allocate a new page regardless of the GFN already having a mapping,
thus in particular breaking the case of two vCPU-s having their info
areas on the same page.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: I'm not really certain whether the preliminary check (ahead of
     calling map_guest_area()) is worthwhile to have.
---
v4: Re-base over changes earlier in the series.
v2: Re-base over changes earlier in the series. Properly enforce no re-
    registration. Avoid several casts by introducing local variables.

--- a/xen/arch/x86/include/asm/shared.h
+++ b/xen/arch/x86/include/asm/shared.h
@@ -26,17 +26,20 @@ static inline void arch_set_##field(stru
 #define GET_SET_VCPU(type, field)                               \
 static inline type arch_get_##field(const struct vcpu *v)       \
 {                                                               \
+    const vcpu_info_t *vi = v->vcpu_info_area.map;              \
+                                                                \
     return !has_32bit_shinfo(v->domain) ?                       \
-           v->vcpu_info->native.arch.field :                    \
-           v->vcpu_info->compat.arch.field;                     \
+           vi->native.arch.field : vi->compat.arch.field;       \
 }                                                               \
 static inline void arch_set_##field(struct vcpu *v,             \
                                     type val)                   \
 {                                                               \
+    vcpu_info_t *vi = v->vcpu_info_area.map;                    \
+                                                                \
     if ( !has_32bit_shinfo(v->domain) )                         \
-        v->vcpu_info->native.arch.field = val;                  \
+        vi->native.arch.field = val;                            \
     else                                                        \
-        v->vcpu_info->compat.arch.field = val;                  \
+        vi->compat.arch.field = val;                            \
 }
 
 #else
@@ -57,12 +60,16 @@ static inline void arch_set_##field(stru
 #define GET_SET_VCPU(type, field)                           \
 static inline type arch_get_##field(const struct vcpu *v)   \
 {                                                           \
-    return v->vcpu_info->arch.field;                        \
+    const vcpu_info_t *vi = v->vcpu_info_area.map;          \
+                                                            \
+    return vi->arch.field;                                  \
 }                                                           \
 static inline void arch_set_##field(struct vcpu *v,         \
                                     type val)               \
 {                                                           \
-    v->vcpu_info->arch.field = val;                         \
+    vcpu_info_t *vi = v->vcpu_info_area.map;                \
+                                                            \
+    vi->arch.field = val;                                   \
 }
 
 #endif
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -1749,53 +1749,24 @@ static int copy_vpmu(struct vcpu *d_vcpu
 static int copy_vcpu_settings(struct domain *cd, const struct domain *d)
 {
     unsigned int i;
-    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
     int ret = -EINVAL;
 
     for ( i = 0; i < cd->max_vcpus; i++ )
     {
         struct vcpu *d_vcpu = d->vcpu[i];
         struct vcpu *cd_vcpu = cd->vcpu[i];
-        mfn_t vcpu_info_mfn;
 
         if ( !d_vcpu || !cd_vcpu )
             continue;
 
-        /* Copy & map in the vcpu_info page if the guest uses one */
-        vcpu_info_mfn = d_vcpu->vcpu_info_mfn;
-        if ( !mfn_eq(vcpu_info_mfn, INVALID_MFN) )
-        {
-            mfn_t new_vcpu_info_mfn = cd_vcpu->vcpu_info_mfn;
-
-            /* Allocate & map the page for it if it hasn't been already */
-            if ( mfn_eq(new_vcpu_info_mfn, INVALID_MFN) )
-            {
-                gfn_t gfn = mfn_to_gfn(d, vcpu_info_mfn);
-                unsigned long gfn_l = gfn_x(gfn);
-                struct page_info *page;
-
-                if ( !(page = alloc_domheap_page(cd, 0)) )
-                    return -ENOMEM;
-
-                new_vcpu_info_mfn = page_to_mfn(page);
-                set_gpfn_from_mfn(mfn_x(new_vcpu_info_mfn), gfn_l);
-
-                ret = p2m->set_entry(p2m, gfn, new_vcpu_info_mfn,
-                                     PAGE_ORDER_4K, p2m_ram_rw,
-                                     p2m->default_access, -1);
-                if ( ret )
-                    return ret;
-
-                ret = map_vcpu_info(cd_vcpu, gfn_l,
-                                    PAGE_OFFSET(d_vcpu->vcpu_info));
-                if ( ret )
-                    return ret;
-            }
-
-            copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
-        }
-
-        /* Same for the (physically registered) runstate and time info areas. */
+        /*
+         * Copy and map the vcpu_info page and the (physically registered)
+         * runstate and time info areas.
+         */
+        ret = copy_guest_area(&cd_vcpu->vcpu_info_area,
+                              &d_vcpu->vcpu_info_area, cd_vcpu, d);
+        if ( ret )
+            return ret;
         ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
                               &d_vcpu->runstate_guest_area, cd_vcpu, d);
         if ( ret )
--- a/xen/arch/x86/pv/shim.c
+++ b/xen/arch/x86/pv/shim.c
@@ -383,7 +383,7 @@ int pv_shim_shutdown(uint8_t reason)
     for_each_vcpu ( d, v )
     {
         /* Unmap guest vcpu_info page and runstate/time areas. */
-        unmap_vcpu_info(v);
+        unmap_guest_area(v, &v->vcpu_info_area);
         unmap_guest_area(v, &v->runstate_guest_area);
         unmap_guest_area(v, &v->arch.time_guest_area);
 
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1542,7 +1542,7 @@ static void __update_vcpu_system_time(st
     struct vcpu_time_info *u = &vcpu_info(v, time), _u;
     const struct domain *d = v->domain;
 
-    if ( v->vcpu_info == NULL )
+    if ( !v->vcpu_info_area.map )
         return;
 
     collect_time_info(v, &_u);
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -53,7 +53,7 @@ void __dummy__(void)
 
     OFFSET(VCPU_processor, struct vcpu, processor);
     OFFSET(VCPU_domain, struct vcpu, domain);
-    OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info);
+    OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info_area.map);
     OFFSET(VCPU_trap_bounce, struct vcpu, arch.pv.trap_bounce);
     OFFSET(VCPU_thread_flags, struct vcpu, arch.flags);
     OFFSET(VCPU_event_addr, struct vcpu, arch.pv.event_callback_eip);
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -96,7 +96,7 @@ static void _show_registers(
     if ( context == CTXT_hypervisor )
         printk(" %pS", _p(regs->rip));
     printk("\nRFLAGS: %016lx   ", regs->rflags);
-    if ( (context == CTXT_pv_guest) && v && v->vcpu_info )
+    if ( (context == CTXT_pv_guest) && v && v->vcpu_info_area.map )
         printk("EM: %d   ", !!vcpu_info(v, evtchn_upcall_mask));
     printk("CONTEXT: %s", context_names[context]);
     if ( v && !is_idle_vcpu(v) )
--- a/xen/common/compat/domain.c
+++ b/xen/common/compat/domain.c
@@ -49,7 +49,7 @@ int compat_common_vcpu_op(int cmd, struc
     {
     case VCPUOP_initialise:
     {
-        if ( v->vcpu_info == &dummy_vcpu_info )
+        if ( v->vcpu_info_area.map == &dummy_vcpu_info )
             return -EINVAL;
 
 #ifdef CONFIG_HVM
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -127,10 +127,10 @@ static void vcpu_info_reset(struct vcpu
 {
     struct domain *d = v->domain;
 
-    v->vcpu_info = ((v->vcpu_id < XEN_LEGACY_MAX_VCPUS)
-                    ? (vcpu_info_t *)&shared_info(d, vcpu_info[v->vcpu_id])
-                    : &dummy_vcpu_info);
-    v->vcpu_info_mfn = INVALID_MFN;
+    v->vcpu_info_area.map =
+        ((v->vcpu_id < XEN_LEGACY_MAX_VCPUS)
+         ? (vcpu_info_t *)&shared_info(d, vcpu_info[v->vcpu_id])
+         : &dummy_vcpu_info);
 }
 
 static void vmtrace_free_buffer(struct vcpu *v)
@@ -993,7 +993,7 @@ int domain_kill(struct domain *d)
             return -ERESTART;
         for_each_vcpu ( d, v )
         {
-            unmap_vcpu_info(v);
+            unmap_guest_area(v, &v->vcpu_info_area);
             unmap_guest_area(v, &v->runstate_guest_area);
         }
         d->is_dying = DOMDYING_dead;
@@ -1448,7 +1448,7 @@ int domain_soft_reset(struct domain *d,
     for_each_vcpu ( d, v )
     {
         set_xen_guest_handle(runstate_guest(v), NULL);
-        unmap_vcpu_info(v);
+        unmap_guest_area(v, &v->vcpu_info_area);
         unmap_guest_area(v, &v->runstate_guest_area);
     }
 
@@ -1496,111 +1496,6 @@ int vcpu_reset(struct vcpu *v)
     return rc;
 }
 
-/*
- * Map a guest page in and point the vcpu_info pointer at it.  This
- * makes sure that the vcpu_info is always pointing at a valid piece
- * of memory, and it sets a pending event to make sure that a pending
- * event doesn't get missed.
- */
-int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned int offset)
-{
-    struct domain *d = v->domain;
-    void *mapping;
-    vcpu_info_t *new_info;
-    struct page_info *page;
-    unsigned int align;
-
-    if ( offset > (PAGE_SIZE - sizeof(*new_info)) )
-        return -ENXIO;
-
-#ifdef CONFIG_COMPAT
-    BUILD_BUG_ON(sizeof(*new_info) != sizeof(new_info->compat));
-    if ( has_32bit_shinfo(d) )
-        align = alignof(new_info->compat);
-    else
-#endif
-        align = alignof(*new_info);
-    if ( offset & (align - 1) )
-        return -ENXIO;
-
-    if ( !mfn_eq(v->vcpu_info_mfn, INVALID_MFN) )
-        return -EINVAL;
-
-    /* Run this command on yourself or on other offline VCPUS. */
-    if ( (v != current) && !(v->pause_flags & VPF_down) )
-        return -EINVAL;
-
-    page = get_page_from_gfn(d, gfn, NULL, P2M_UNSHARE);
-    if ( !page )
-        return -EINVAL;
-
-    if ( !get_page_type(page, PGT_writable_page) )
-    {
-        put_page(page);
-        return -EINVAL;
-    }
-
-    mapping = __map_domain_page_global(page);
-    if ( mapping == NULL )
-    {
-        put_page_and_type(page);
-        return -ENOMEM;
-    }
-
-    new_info = (vcpu_info_t *)(mapping + offset);
-
-    if ( v->vcpu_info == &dummy_vcpu_info )
-    {
-        memset(new_info, 0, sizeof(*new_info));
-#ifdef XEN_HAVE_PV_UPCALL_MASK
-        __vcpu_info(v, new_info, evtchn_upcall_mask) = 1;
-#endif
-    }
-    else
-    {
-        memcpy(new_info, v->vcpu_info, sizeof(*new_info));
-    }
-
-    v->vcpu_info = new_info;
-    v->vcpu_info_mfn = page_to_mfn(page);
-
-    /* Set new vcpu_info pointer /before/ setting pending flags. */
-    smp_wmb();
-
-    /*
-     * Mark everything as being pending just to make sure nothing gets
-     * lost.  The domain will get a spurious event, but it can cope.
-     */
-#ifdef CONFIG_COMPAT
-    if ( !has_32bit_shinfo(d) )
-        write_atomic(&new_info->native.evtchn_pending_sel, ~0);
-    else
-#endif
-        write_atomic(&vcpu_info(v, evtchn_pending_sel), ~0);
-    vcpu_mark_events_pending(v);
-
-    return 0;
-}
-
-/*
- * Unmap the vcpu info page if the guest decided to place it somewhere
- * else. This is used from domain_kill() and domain_soft_reset().
- */
-void unmap_vcpu_info(struct vcpu *v)
-{
-    mfn_t mfn = v->vcpu_info_mfn;
-
-    if ( mfn_eq(mfn, INVALID_MFN) )
-        return;
-
-    unmap_domain_page_global((void *)
-                             ((unsigned long)v->vcpu_info & PAGE_MASK));
-
-    vcpu_info_reset(v); /* NB: Clobbers v->vcpu_info_mfn */
-
-    put_page_and_type(mfn_to_page(mfn));
-}
-
 int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
                    struct guest_area *area,
                    void (*populate)(void *dst, struct vcpu *v))
@@ -1662,14 +1557,44 @@ int map_guest_area(struct vcpu *v, paddr
 
     domain_lock(d);
 
-    if ( map && populate )
-        populate(map, v);
+    /* No re-registration of the vCPU info area. */
+    if ( area != &v->vcpu_info_area || !area->pg )
+    {
+        if ( map && populate )
+            populate(map, v);
 
-    SWAP(area->pg, pg);
-    SWAP(area->map, map);
+        SWAP(area->pg, pg);
+        SWAP(area->map, map);
+    }
+    else
+        rc = -EBUSY;
 
     domain_unlock(d);
 
+    /* Set pending flags /after/ new vcpu_info pointer was set. */
+    if ( area == &v->vcpu_info_area && !rc )
+    {
+        /*
+         * Mark everything as being pending just to make sure nothing gets
+         * lost.  The domain will get a spurious event, but it can cope.
+         */
+#ifdef CONFIG_COMPAT
+        if ( !has_32bit_shinfo(d) )
+        {
+            vcpu_info_t *info = area->map;
+
+            /* For VCPUOP_register_vcpu_info handling in common_vcpu_op(). */
+            BUILD_BUG_ON(sizeof(*info) != sizeof(info->compat));
+            write_atomic(&info->native.evtchn_pending_sel, ~0);
+        }
+        else
+#endif
+            write_atomic(&vcpu_info(v, evtchn_pending_sel), ~0);
+        vcpu_mark_events_pending(v);
+
+        force_update_vcpu_system_time(v);
+    }
+
     if ( v != current )
         vcpu_unpause(v);
 
@@ -1699,7 +1624,10 @@ void unmap_guest_area(struct vcpu *v, st
 
     domain_lock(d);
     map = area->map;
-    area->map = NULL;
+    if ( area == &v->vcpu_info_area )
+        vcpu_info_reset(v);
+    else
+        area->map = NULL;
     pg = area->pg;
     area->pg = NULL;
     domain_unlock(d);
@@ -1830,6 +1758,27 @@ bool update_runstate_area(struct vcpu *v
     return rc;
 }
 
+/*
+ * This makes sure that the vcpu_info is always pointing at a valid piece of
+ * memory, and it sets a pending event to make sure that a pending event
+ * doesn't get missed.
+ */
+static void cf_check
+vcpu_info_populate(void *map, struct vcpu *v)
+{
+    vcpu_info_t *info = map;
+
+    if ( v->vcpu_info_area.map == &dummy_vcpu_info )
+    {
+        memset(info, 0, sizeof(*info));
+#ifdef XEN_HAVE_PV_UPCALL_MASK
+        __vcpu_info(v, info, evtchn_upcall_mask) = 1;
+#endif
+    }
+    else
+        memcpy(info, v->vcpu_info_area.map, sizeof(*info));
+}
+
 static void cf_check
 runstate_area_populate(void *map, struct vcpu *v)
 {
@@ -1859,7 +1808,7 @@ long common_vcpu_op(int cmd, struct vcpu
     switch ( cmd )
     {
     case VCPUOP_initialise:
-        if ( v->vcpu_info == &dummy_vcpu_info )
+        if ( v->vcpu_info_area.map == &dummy_vcpu_info )
             return -EINVAL;
 
         rc = arch_initialise_vcpu(v, arg);
@@ -1990,16 +1939,29 @@ long common_vcpu_op(int cmd, struct vcpu
     case VCPUOP_register_vcpu_info:
     {
         struct vcpu_register_vcpu_info info;
+        paddr_t gaddr;
 
         rc = -EFAULT;
         if ( copy_from_guest(&info, arg, 1) )
             break;
 
-        domain_lock(d);
-        rc = map_vcpu_info(v, info.mfn, info.offset);
-        domain_unlock(d);
+        rc = -EINVAL;
+        gaddr = gfn_to_gaddr(_gfn(info.mfn)) + info.offset;
+        if ( !~gaddr ||
+             gfn_x(gaddr_to_gfn(gaddr)) != info.mfn )
+            break;
 
-        force_update_vcpu_system_time(v);
+        /* Preliminary check only; see map_guest_area(). */
+        rc = -EBUSY;
+        if ( v->vcpu_info_area.pg )
+            break;
+
+        /* See the BUILD_BUG_ON() in vcpu_info_populate(). */
+        rc = map_guest_area(v, gaddr, sizeof(vcpu_info_t),
+                            &v->vcpu_info_area, vcpu_info_populate);
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
+                                               cmd, vcpuid, arg);
 
         break;
     }
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -80,9 +80,6 @@ void cf_check free_pirq_struct(void *);
 int  arch_vcpu_create(struct vcpu *v);
 void arch_vcpu_destroy(struct vcpu *v);
 
-int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned int offset);
-void unmap_vcpu_info(struct vcpu *v);
-
 int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
                    struct guest_area *area,
                    void (*populate)(void *dst, struct vcpu *v));
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -176,7 +176,7 @@ struct vcpu
 
     int              processor;
 
-    vcpu_info_t     *vcpu_info;
+    struct guest_area vcpu_info_area;
 
     struct domain   *domain;
 
@@ -289,9 +289,6 @@ struct vcpu
 
     struct waitqueue_vcpu *waitqueue_vcpu;
 
-    /* Guest-specified relocation of vcpu_info. */
-    mfn_t            vcpu_info_mfn;
-
     struct evtchn_fifo_vcpu *evtchn_fifo;
 
     /* vPCI per-vCPU area, used to store data for long running operations. */
--- a/xen/include/xen/shared.h
+++ b/xen/include/xen/shared.h
@@ -44,6 +44,7 @@ typedef struct vcpu_info vcpu_info_t;
 extern vcpu_info_t dummy_vcpu_info;
 
 #define shared_info(d, field)      __shared_info(d, (d)->shared_info, field)
-#define vcpu_info(v, field)        __vcpu_info(v, (v)->vcpu_info, field)
+#define vcpu_info(v, field)        \
+        __vcpu_info(v, (vcpu_info_t *)(v)->vcpu_info_area.map, field)
 
 #endif /* __XEN_SHARED_H__ */



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/9] x86/shim: zap runstate and time area handles during shutdown
  2023-09-28  7:01 ` [PATCH v4 1/9] x86/shim: zap runstate and time area handles during shutdown Jan Beulich
@ 2023-09-28  9:27   ` Roger Pau Monné
  0 siblings, 0 replies; 22+ messages in thread
From: Roger Pau Monné @ 2023-09-28  9:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Henry Wang

On Thu, Sep 28, 2023 at 09:01:53AM +0200, Jan Beulich wrote:
> While likely the guest would just re-register the same areas after
> a possible resume, let's not take this for granted and avoid the risk of
> otherwise corrupting guest memory.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 5/9] x86/mem-sharing: copy GADDR based shared guest areas
  2023-09-28  7:16 ` [PATCH v4 5/9] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
@ 2023-09-28  9:51   ` Roger Pau Monné
  2023-09-28 10:11     ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Roger Pau Monné @ 2023-09-28  9:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Henry Wang, Tamas K Lengyel

On Thu, Sep 28, 2023 at 09:16:20AM +0200, Jan Beulich wrote:
> In preparation of the introduction of new vCPU operations allowing to
> register the respective areas (one of the two is x86-specific) by
> guest-physical address, add the necessary fork handling (with the
> backing function yet to be filled in).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v3: Extend comment.
> 
> --- a/xen/arch/x86/mm/mem_sharing.c
> +++ b/xen/arch/x86/mm/mem_sharing.c
> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>  }
>  
> +static int copy_guest_area(struct guest_area *cd_area,
> +                           const struct guest_area *d_area,
> +                           struct vcpu *cd_vcpu,
> +                           const struct domain *d)
> +{
> +    mfn_t d_mfn, cd_mfn;
> +
> +    if ( !d_area->pg )
> +        return 0;
> +
> +    d_mfn = page_to_mfn(d_area->pg);
> +
> +    /* Allocate & map a page for the area if it hasn't been already. */
> +    if ( !cd_area->pg )
> +    {
> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> +        p2m_type_t p2mt;
> +        p2m_access_t p2ma;
> +        unsigned int offset;
> +        int ret;
> +
> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> +        {
> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
> +
> +            if ( !pg )
> +                return -ENOMEM;
> +
> +            cd_mfn = page_to_mfn(pg);
> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> +
> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
> +                                 p2m->default_access, -1);
> +            if ( ret )
> +                return ret;
> +        }

I'm still unsure why map_guest_area() shouldn't be able to deal with a
forked child needing the page to be mapped.  What happens when a
forked child executes the hypercall to map such areas against not yet
populated gfns?

Shouldn't map_guest_area() be capable of handling those calls and
populating on demand?

> +        else if ( p2mt != p2m_ram_rw )
> +            return -EBUSY;
> +
> +        /*
> +         * Map the area into the guest. For simplicity specify the entire range
> +         * up to the end of the page: All the function uses it for is to check
> +         * that the range doesn't cross page boundaries. Having the area mapped
> +         * in the original domain implies that it fits there and therefore will
> +         * also fit in the clone.
> +         */
> +        offset = PAGE_OFFSET(d_area->map);
> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
> +                             PAGE_SIZE - offset, cd_area, NULL);
> +        if ( ret )
> +            return ret;
> +    }
> +    else
> +        cd_mfn = page_to_mfn(cd_area->pg);
> +
> +    copy_domain_page(cd_mfn, d_mfn);

I think the page copy should be done only once, when the page is
populated on the child p2m.  Otherwise areas smaller than a page size
(like vpcu_time_info_t) that share the same page will get multiple
copies of the same data for no reason.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] domain: map/unmap GADDR based shared guest areas
  2023-09-28  7:16 ` [PATCH v4 6/9] domain: map/unmap " Jan Beulich
@ 2023-09-28 10:04   ` Roger Pau Monné
  2023-09-28 10:14     ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Roger Pau Monné @ 2023-09-28 10:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Henry Wang

On Thu, Sep 28, 2023 at 09:16:48AM +0200, Jan Beulich wrote:
> The registration by virtual/linear address has downsides: At least on
> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
> PV domains the areas are inaccessible (and hence cannot be updated by
> Xen) when in guest-user mode, and for HVM guests they may be
> inaccessible when Meltdown mitigations are in place. (There are yet
> more issues.)
> 
> In preparation of the introduction of new vCPU operations allowing to
> register the respective areas (one of the two is x86-specific) by
> guest-physical address, flesh out the map/unmap functions.
> 
> Noteworthy differences from map_vcpu_info():
> - areas can be registered more than once (and de-registered),
> - remote vCPU-s are paused rather than checked for being down (which in
>   principle can change right after the check),
> - the domain lock is taken for a much smaller region.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 5/9] x86/mem-sharing: copy GADDR based shared guest areas
  2023-09-28  9:51   ` Roger Pau Monné
@ 2023-09-28 10:11     ` Jan Beulich
  2023-09-28 11:07       ` Roger Pau Monné
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2023-09-28 10:11 UTC (permalink / raw)
  To: Roger Pau Monné, Tamas K Lengyel
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Henry Wang

On 28.09.2023 11:51, Roger Pau Monné wrote:
> On Thu, Sep 28, 2023 at 09:16:20AM +0200, Jan Beulich wrote:
>> --- a/xen/arch/x86/mm/mem_sharing.c
>> +++ b/xen/arch/x86/mm/mem_sharing.c
>> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
>>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>>  }
>>  
>> +static int copy_guest_area(struct guest_area *cd_area,
>> +                           const struct guest_area *d_area,
>> +                           struct vcpu *cd_vcpu,
>> +                           const struct domain *d)
>> +{
>> +    mfn_t d_mfn, cd_mfn;
>> +
>> +    if ( !d_area->pg )
>> +        return 0;
>> +
>> +    d_mfn = page_to_mfn(d_area->pg);
>> +
>> +    /* Allocate & map a page for the area if it hasn't been already. */
>> +    if ( !cd_area->pg )
>> +    {
>> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
>> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
>> +        p2m_type_t p2mt;
>> +        p2m_access_t p2ma;
>> +        unsigned int offset;
>> +        int ret;
>> +
>> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
>> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
>> +        {
>> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
>> +
>> +            if ( !pg )
>> +                return -ENOMEM;
>> +
>> +            cd_mfn = page_to_mfn(pg);
>> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
>> +
>> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
>> +                                 p2m->default_access, -1);
>> +            if ( ret )
>> +                return ret;
>> +        }
> 
> I'm still unsure why map_guest_area() shouldn't be able to deal with a
> forked child needing the page to be mapped.  What happens when a
> forked child executes the hypercall to map such areas against not yet
> populated gfns?
> 
> Shouldn't map_guest_area() be capable of handling those calls and
> populating on demand?

Funny you should use this wording: P2M_ALLOC will deal with populating
PoD slots, yes, but aiui forks don't fill their p2m with all PoD slots,
but rather leave them empty. Yet again I need to leave it to Tamas to
confirm or prove me wrong.

>> +        else if ( p2mt != p2m_ram_rw )
>> +            return -EBUSY;
>> +
>> +        /*
>> +         * Map the area into the guest. For simplicity specify the entire range
>> +         * up to the end of the page: All the function uses it for is to check
>> +         * that the range doesn't cross page boundaries. Having the area mapped
>> +         * in the original domain implies that it fits there and therefore will
>> +         * also fit in the clone.
>> +         */
>> +        offset = PAGE_OFFSET(d_area->map);
>> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
>> +                             PAGE_SIZE - offset, cd_area, NULL);
>> +        if ( ret )
>> +            return ret;
>> +    }
>> +    else
>> +        cd_mfn = page_to_mfn(cd_area->pg);
>> +
>> +    copy_domain_page(cd_mfn, d_mfn);
> 
> I think the page copy should be done only once, when the page is
> populated on the child p2m.  Otherwise areas smaller than a page size
> (like vpcu_time_info_t) that share the same page will get multiple
> copies of the same data for no reason.

I think you're right, but this would then be another issue in the original
code that I merely didn't spot (and it's not just "copy for no reason",
we'd actually corrupt what was put there before). IOW the copying needs to
move ahead of map_guest_area() (or yet more precisely after the error
checking for p2m->set_entry()), and in the original code it would have
needed to live ahead of map_vcpu_info(). Once again I'd like Tamas to
confirm (or otherwise) before making that change, though.

Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] domain: map/unmap GADDR based shared guest areas
  2023-09-28 10:04   ` Roger Pau Monné
@ 2023-09-28 10:14     ` Jan Beulich
  2023-09-28 11:10       ` Roger Pau Monné
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2023-09-28 10:14 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Henry Wang

On 28.09.2023 12:04, Roger Pau Monné wrote:
> On Thu, Sep 28, 2023 at 09:16:48AM +0200, Jan Beulich wrote:
>> The registration by virtual/linear address has downsides: At least on
>> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
>> PV domains the areas are inaccessible (and hence cannot be updated by
>> Xen) when in guest-user mode, and for HVM guests they may be
>> inaccessible when Meltdown mitigations are in place. (There are yet
>> more issues.)
>>
>> In preparation of the introduction of new vCPU operations allowing to
>> register the respective areas (one of the two is x86-specific) by
>> guest-physical address, flesh out the map/unmap functions.
>>
>> Noteworthy differences from map_vcpu_info():
>> - areas can be registered more than once (and de-registered),
>> - remote vCPU-s are paused rather than checked for being down (which in
>>   principle can change right after the check),
>> - the domain lock is taken for a much smaller region.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, but to be clear: Formally this doesn't help this patch make any
progress, aiui. I'll still need an A-b by a REST maintainer then. An R-b
from you would be different in this regard.

Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 5/9] x86/mem-sharing: copy GADDR based shared guest areas
  2023-09-28 10:11     ` Jan Beulich
@ 2023-09-28 11:07       ` Roger Pau Monné
       [not found]         ` <CABfawhmAZGTaKZfmwY4t8599i6qKaTOJ-fngFmtvS4LhJMh7iA@mail.gmail.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Roger Pau Monné @ 2023-09-28 11:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tamas K Lengyel, xen-devel, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu, Henry Wang

On Thu, Sep 28, 2023 at 12:11:02PM +0200, Jan Beulich wrote:
> On 28.09.2023 11:51, Roger Pau Monné wrote:
> > On Thu, Sep 28, 2023 at 09:16:20AM +0200, Jan Beulich wrote:
> >> --- a/xen/arch/x86/mm/mem_sharing.c
> >> +++ b/xen/arch/x86/mm/mem_sharing.c
> >> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
> >>      hvm_set_nonreg_state(cd_vcpu, &nrs);
> >>  }
> >>  
> >> +static int copy_guest_area(struct guest_area *cd_area,
> >> +                           const struct guest_area *d_area,
> >> +                           struct vcpu *cd_vcpu,
> >> +                           const struct domain *d)
> >> +{
> >> +    mfn_t d_mfn, cd_mfn;
> >> +
> >> +    if ( !d_area->pg )
> >> +        return 0;
> >> +
> >> +    d_mfn = page_to_mfn(d_area->pg);
> >> +
> >> +    /* Allocate & map a page for the area if it hasn't been already. */
> >> +    if ( !cd_area->pg )
> >> +    {
> >> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> >> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> >> +        p2m_type_t p2mt;
> >> +        p2m_access_t p2ma;
> >> +        unsigned int offset;
> >> +        int ret;
> >> +
> >> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
> >> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> >> +        {
> >> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
> >> +
> >> +            if ( !pg )
> >> +                return -ENOMEM;
> >> +
> >> +            cd_mfn = page_to_mfn(pg);
> >> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> >> +
> >> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
> >> +                                 p2m->default_access, -1);
> >> +            if ( ret )
> >> +                return ret;
> >> +        }
> > 
> > I'm still unsure why map_guest_area() shouldn't be able to deal with a
> > forked child needing the page to be mapped.  What happens when a
> > forked child executes the hypercall to map such areas against not yet
> > populated gfns?
> > 
> > Shouldn't map_guest_area() be capable of handling those calls and
> > populating on demand?
> 
> Funny you should use this wording: P2M_ALLOC will deal with populating
> PoD slots, yes, but aiui forks don't fill their p2m with all PoD slots,
> but rather leave them empty. Yet again I need to leave it to Tamas to
> confirm or prove me wrong.

If the child p2m populating is only triggered by guest accesses then a
lot of hypercalls are likely to not work correctly on childs.

> 
> >> +        else if ( p2mt != p2m_ram_rw )
> >> +            return -EBUSY;
> >> +
> >> +        /*
> >> +         * Map the area into the guest. For simplicity specify the entire range
> >> +         * up to the end of the page: All the function uses it for is to check
> >> +         * that the range doesn't cross page boundaries. Having the area mapped
> >> +         * in the original domain implies that it fits there and therefore will
> >> +         * also fit in the clone.
> >> +         */
> >> +        offset = PAGE_OFFSET(d_area->map);
> >> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
> >> +                             PAGE_SIZE - offset, cd_area, NULL);
> >> +        if ( ret )
> >> +            return ret;
> >> +    }
> >> +    else
> >> +        cd_mfn = page_to_mfn(cd_area->pg);
> >> +
> >> +    copy_domain_page(cd_mfn, d_mfn);
> > 
> > I think the page copy should be done only once, when the page is
> > populated on the child p2m.  Otherwise areas smaller than a page size
> > (like vpcu_time_info_t) that share the same page will get multiple
> > copies of the same data for no reason.
> 
> I think you're right, but this would then be another issue in the original
> code that I merely didn't spot (and it's not just "copy for no reason",
> we'd actually corrupt what was put there before). IOW the copying needs to
> move ahead of map_guest_area() (or yet more precisely after the error
> checking for p2m->set_entry()), and in the original code it would have
> needed to live ahead of map_vcpu_info(). Once again I'd like Tamas to
> confirm (or otherwise) before making that change, though.

Yes, it's already an issue in the current code.  I wonder whether
logic in the guest or Xen could malfunctions due to the fact that
map_vcpu_info() unconditionally sets evtchn_upcall_pending and injects
an event channel upcall, but the later call to copy_domain_page()
might unset evtchn_upcall_pending while the vector is already injected.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] domain: map/unmap GADDR based shared guest areas
  2023-09-28 10:14     ` Jan Beulich
@ 2023-09-28 11:10       ` Roger Pau Monné
  0 siblings, 0 replies; 22+ messages in thread
From: Roger Pau Monné @ 2023-09-28 11:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Henry Wang

On Thu, Sep 28, 2023 at 12:14:17PM +0200, Jan Beulich wrote:
> On 28.09.2023 12:04, Roger Pau Monné wrote:
> > On Thu, Sep 28, 2023 at 09:16:48AM +0200, Jan Beulich wrote:
> >> The registration by virtual/linear address has downsides: At least on
> >> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
> >> PV domains the areas are inaccessible (and hence cannot be updated by
> >> Xen) when in guest-user mode, and for HVM guests they may be
> >> inaccessible when Meltdown mitigations are in place. (There are yet
> >> more issues.)
> >>
> >> In preparation of the introduction of new vCPU operations allowing to
> >> register the respective areas (one of the two is x86-specific) by
> >> guest-physical address, flesh out the map/unmap functions.
> >>
> >> Noteworthy differences from map_vcpu_info():
> >> - areas can be registered more than once (and de-registered),
> >> - remote vCPU-s are paused rather than checked for being down (which in
> >>   principle can change right after the check),
> >> - the domain lock is taken for a much smaller region.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks, but to be clear: Formally this doesn't help this patch make any
> progress, aiui. I'll still need an A-b by a REST maintainer then. An R-b
> from you would be different in this regard.

I see.

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 3/9] domain: update GADDR based runstate guest area
  2023-09-28  7:15 ` [PATCH v4 3/9] domain: update GADDR based runstate guest area Jan Beulich
@ 2023-09-28 12:15   ` Julien Grall
  0 siblings, 0 replies; 22+ messages in thread
From: Julien Grall @ 2023-09-28 12:15 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, George Dunlap, Stefano Stabellini, Wei Liu,
	Roger Pau Monné,
	Henry Wang

Hi Jan,

On 28/09/2023 08:15, Jan Beulich wrote:
> Before adding a new vCPU operation to register the runstate area by
> guest-physical address, add code to actually keep such areas up-to-date.
> 
> Note that updating of the area will be done exclusively following the
> model enabled by VMASST_TYPE_runstate_update_flag for virtual-address
> based registered areas.
> 
> Note further that pages aren't marked dirty when written to (matching
> the handling of space mapped by map_vcpu_info()), on the basis that the
> registrations are lost anyway across migration (or would need re-
> populating at the target for transparent migration). Plus the contents
> of the areas in question have to be deemed volatile in the first place
> (so saving a "most recent" value is pretty meaningless even for e.g.
> snapshotting).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Julien Grall <jgrall@amazon.com>

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 5/9] x86/mem-sharing: copy GADDR based shared guest areas
       [not found]         ` <CABfawhmAZGTaKZfmwY4t8599i6qKaTOJ-fngFmtvS4LhJMh7iA@mail.gmail.com>
@ 2023-09-28 13:19           ` Jan Beulich
       [not found]             ` <CABfawhkK7LVQKhTtCAMpoaGLH24SwpyEAJGpu-PohR5e6W=pMw@mail.gmail.com>
  2023-09-28 14:08           ` Roger Pau Monné
  1 sibling, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2023-09-28 13:19 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Henry Wang, Roger Pau Monné

On 28.09.2023 14:57, Tamas K Lengyel wrote:
> On Thu, Sep 28, 2023 at 7:08 AM Roger Pau Monné <roger.pau@citrix.com> wrote:
>> On Thu, Sep 28, 2023 at 12:11:02PM +0200, Jan Beulich wrote:
>>> On 28.09.2023 11:51, Roger Pau Monné wrote:
>>>> On Thu, Sep 28, 2023 at 09:16:20AM +0200, Jan Beulich wrote:
>>>>> +        /*
>>>>> +         * Map the area into the guest. For simplicity specify the entire range
>>>>> +         * up to the end of the page: All the function uses it for is to check
>>>>> +         * that the range doesn't cross page boundaries. Having the area mapped
>>>>> +         * in the original domain implies that it fits there and therefore will
>>>>> +         * also fit in the clone.
>>>>> +         */
>>>>> +        offset = PAGE_OFFSET(d_area->map);
>>>>> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
>>>>> +                             PAGE_SIZE - offset, cd_area, NULL);
>>>>> +        if ( ret )
>>>>> +            return ret;
>>>>> +    }
>>>>> +    else
>>>>> +        cd_mfn = page_to_mfn(cd_area->pg);
>>>>> +
>>>>> +    copy_domain_page(cd_mfn, d_mfn);
>>>>
>>>> I think the page copy should be done only once, when the page is
>>>> populated on the child p2m.  Otherwise areas smaller than a page size
>>>> (like vpcu_time_info_t) that share the same page will get multiple
>>>> copies of the same data for no reason.
>>>
>>> I think you're right, but this would then be another issue in the original
>>> code that I merely didn't spot (and it's not just "copy for no reason",
>>> we'd actually corrupt what was put there before). IOW the copying needs to
>>> move ahead of map_guest_area() (or yet more precisely after the error
>>> checking for p2m->set_entry()), and in the original code it would have
>>> needed to live ahead of map_vcpu_info(). Once again I'd like Tamas to
>>> confirm (or otherwise) before making that change, though.
>>
>> Yes, it's already an issue in the current code.  I wonder whether
>> logic in the guest or Xen could malfunctions due to the fact that
>> map_vcpu_info() unconditionally sets evtchn_upcall_pending and injects
>> an event channel upcall, but the later call to copy_domain_page()
>> might unset evtchn_upcall_pending while the vector is already injected.
> 
> Sorry but I really don't follow the discussion here. My understanding
> was that map_vcpu_info, as its name suggests, maps the page. We use it
> to map a new page into that position in case the fork hasn't set it up
> yet but the parent has one. Then we follow with the copy from the
> parent so the page content is matching. If there is already a
> vcpu_info page in the fork, we just do the copy.
> 
> Now, if map_vcpu_info does more then mapping, then I don't know what
> it does, why it does it, and what happens if we skip it when the fork
> is reset for example. Is the suggestion to call it map_vcpu_info every
> time the page content is reset (ie after the copy)?

The vCPU info area (already prior to this series) and the two other areas
can be updated by the hypervisor at any time. Once one such area is
registered within a certain page, if another such area happens to live in
the same page, copying the entire page again would overwrite all updates
that might already have been made for the first area. IOW copying ought
to - imo - happen exactly once, when the new page is allocated.

As to map_vcpu_info() - just look at the function: It writes to the newly
registered area. Even if the function name says just "map", that's an
integral part of the operation. We can't just map it, but leave the area
untouched.

Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 5/9] x86/mem-sharing: copy GADDR based shared guest areas
       [not found]         ` <CABfawhmAZGTaKZfmwY4t8599i6qKaTOJ-fngFmtvS4LhJMh7iA@mail.gmail.com>
  2023-09-28 13:19           ` Jan Beulich
@ 2023-09-28 14:08           ` Roger Pau Monné
       [not found]             ` <CABfawhmfFuB14CiPdyaP4HNayms0AG5=4h1FNURu4MDgt2HzAg@mail.gmail.com>
  1 sibling, 1 reply; 22+ messages in thread
From: Roger Pau Monné @ 2023-09-28 14:08 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Jan Beulich, xen-devel, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu, Henry Wang

On Thu, Sep 28, 2023 at 08:57:04AM -0400, Tamas K Lengyel wrote:
> On Thu, Sep 28, 2023 at 7:08 AM Roger Pau Monné <roger.pau@citrix.com> wrote:
> >
> > On Thu, Sep 28, 2023 at 12:11:02PM +0200, Jan Beulich wrote:
> > > On 28.09.2023 11:51, Roger Pau Monné wrote:
> > > > On Thu, Sep 28, 2023 at 09:16:20AM +0200, Jan Beulich wrote:
> > > >> --- a/xen/arch/x86/mm/mem_sharing.c
> > > >> +++ b/xen/arch/x86/mm/mem_sharing.c
> > > >> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
> > > >>      hvm_set_nonreg_state(cd_vcpu, &nrs);
> > > >>  }
> > > >>
> > > >> +static int copy_guest_area(struct guest_area *cd_area,
> > > >> +                           const struct guest_area *d_area,
> > > >> +                           struct vcpu *cd_vcpu,
> > > >> +                           const struct domain *d)
> > > >> +{
> > > >> +    mfn_t d_mfn, cd_mfn;
> > > >> +
> > > >> +    if ( !d_area->pg )
> > > >> +        return 0;
> > > >> +
> > > >> +    d_mfn = page_to_mfn(d_area->pg);
> > > >> +
> > > >> +    /* Allocate & map a page for the area if it hasn't been already. */
> > > >> +    if ( !cd_area->pg )
> > > >> +    {
> > > >> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> > > >> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> > > >> +        p2m_type_t p2mt;
> > > >> +        p2m_access_t p2ma;
> > > >> +        unsigned int offset;
> > > >> +        int ret;
> > > >> +
> > > >> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
> > > >> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> > > >> +        {
> > > >> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
> > > >> +
> > > >> +            if ( !pg )
> > > >> +                return -ENOMEM;
> > > >> +
> > > >> +            cd_mfn = page_to_mfn(pg);
> > > >> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> > > >> +
> > > >> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
> > > >> +                                 p2m->default_access, -1);
> > > >> +            if ( ret )
> > > >> +                return ret;
> > > >> +        }
> > > >
> > > > I'm still unsure why map_guest_area() shouldn't be able to deal with a
> > > > forked child needing the page to be mapped.  What happens when a
> > > > forked child executes the hypercall to map such areas against not yet
> > > > populated gfns?
> > > >
> > > > Shouldn't map_guest_area() be capable of handling those calls and
> > > > populating on demand?
> > >
> > > Funny you should use this wording: P2M_ALLOC will deal with populating
> > > PoD slots, yes, but aiui forks don't fill their p2m with all PoD slots,
> > > but rather leave them empty. Yet again I need to leave it to Tamas to
> > > confirm or prove me wrong.
> >
> > If the child p2m populating is only triggered by guest accesses then a
> > lot of hypercalls are likely to not work correctly on childs.
> 
> That's not what's happening. As I said before, ALL access paths, be
> that from the guest, dom0 or Xen trigger page population. If there is
> a hole and P2M_ALLOC is set we do the following in
> p2m_get_gfn_type_access:
> 
>     /* Check if we need to fork the page */
>     if ( (q & P2M_ALLOC) && p2m_is_hole(*t) &&
>          !mem_sharing_fork_page(p2m->domain, gfn, q & P2M_UNSHARE) )
>         mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> 
> Depending on the type of access we either populate the hole with a
> shared memory entry or a copy.

Then the code here is redundant, as the call to get_page_from_gfn(...,
P2M_UNSHARE) in map_vcpu_info() will already take care of populating
the child p2m and copy the parents page contents?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 5/9] x86/mem-sharing: copy GADDR based shared guest areas
       [not found]             ` <CABfawhmfFuB14CiPdyaP4HNayms0AG5=4h1FNURu4MDgt2HzAg@mail.gmail.com>
@ 2023-09-29 14:17               ` Roger Pau Monné
  0 siblings, 0 replies; 22+ messages in thread
From: Roger Pau Monné @ 2023-09-29 14:17 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Jan Beulich, xen-devel, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu, Henry Wang

On Fri, Sep 29, 2023 at 08:31:58AM -0400, Tamas K Lengyel wrote:
> On Thu, Sep 28, 2023 at 10:08 AM Roger Pau Monné <roger.pau@citrix.com> wrote:
> >
> > On Thu, Sep 28, 2023 at 08:57:04AM -0400, Tamas K Lengyel wrote:
> > > On Thu, Sep 28, 2023 at 7:08 AM Roger Pau Monné <roger.pau@citrix.com> wrote:
> > > >
> > > > On Thu, Sep 28, 2023 at 12:11:02PM +0200, Jan Beulich wrote:
> > > > > On 28.09.2023 11:51, Roger Pau Monné wrote:
> > > > > > On Thu, Sep 28, 2023 at 09:16:20AM +0200, Jan Beulich wrote:
> > > > > >> --- a/xen/arch/x86/mm/mem_sharing.c
> > > > > >> +++ b/xen/arch/x86/mm/mem_sharing.c
> > > > > >> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
> > > > > >>      hvm_set_nonreg_state(cd_vcpu, &nrs);
> > > > > >>  }
> > > > > >>
> > > > > >> +static int copy_guest_area(struct guest_area *cd_area,
> > > > > >> +                           const struct guest_area *d_area,
> > > > > >> +                           struct vcpu *cd_vcpu,
> > > > > >> +                           const struct domain *d)
> > > > > >> +{
> > > > > >> +    mfn_t d_mfn, cd_mfn;
> > > > > >> +
> > > > > >> +    if ( !d_area->pg )
> > > > > >> +        return 0;
> > > > > >> +
> > > > > >> +    d_mfn = page_to_mfn(d_area->pg);
> > > > > >> +
> > > > > >> +    /* Allocate & map a page for the area if it hasn't been already. */
> > > > > >> +    if ( !cd_area->pg )
> > > > > >> +    {
> > > > > >> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> > > > > >> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> > > > > >> +        p2m_type_t p2mt;
> > > > > >> +        p2m_access_t p2ma;
> > > > > >> +        unsigned int offset;
> > > > > >> +        int ret;
> > > > > >> +
> > > > > >> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
> > > > > >> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> > > > > >> +        {
> > > > > >> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
> > > > > >> +
> > > > > >> +            if ( !pg )
> > > > > >> +                return -ENOMEM;
> > > > > >> +
> > > > > >> +            cd_mfn = page_to_mfn(pg);
> > > > > >> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> > > > > >> +
> > > > > >> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
> > > > > >> +                                 p2m->default_access, -1);
> > > > > >> +            if ( ret )
> > > > > >> +                return ret;
> > > > > >> +        }
> > > > > >
> > > > > > I'm still unsure why map_guest_area() shouldn't be able to deal with a
> > > > > > forked child needing the page to be mapped.  What happens when a
> > > > > > forked child executes the hypercall to map such areas against not yet
> > > > > > populated gfns?
> > > > > >
> > > > > > Shouldn't map_guest_area() be capable of handling those calls and
> > > > > > populating on demand?
> > > > >
> > > > > Funny you should use this wording: P2M_ALLOC will deal with populating
> > > > > PoD slots, yes, but aiui forks don't fill their p2m with all PoD slots,
> > > > > but rather leave them empty. Yet again I need to leave it to Tamas to
> > > > > confirm or prove me wrong.
> > > >
> > > > If the child p2m populating is only triggered by guest accesses then a
> > > > lot of hypercalls are likely to not work correctly on childs.
> > >
> > > That's not what's happening. As I said before, ALL access paths, be
> > > that from the guest, dom0 or Xen trigger page population. If there is
> > > a hole and P2M_ALLOC is set we do the following in
> > > p2m_get_gfn_type_access:
> > >
> > >     /* Check if we need to fork the page */
> > >     if ( (q & P2M_ALLOC) && p2m_is_hole(*t) &&
> > >          !mem_sharing_fork_page(p2m->domain, gfn, q & P2M_UNSHARE) )
> > >         mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> > >
> > > Depending on the type of access we either populate the hole with a
> > > shared memory entry or a copy.
> >
> > Then the code here is redundant, as the call to get_page_from_gfn(...,
> > P2M_UNSHARE) in map_vcpu_info() will already take care of populating
> > the child p2m and copy the parents page contents?
> 
> Reading the code now, yes, calling map_vcpu_info() would take care of
> populating the page for us as well so we can remove that code from
> mem_sharing.

Thanks for confirming.  Will try to prepare a patch next week to get
rid of the explicit child p2m populate for the vcpu_info page, and
hopefully simply the code here also as a result.

Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 5/9] x86/mem-sharing: copy GADDR based shared guest areas
       [not found]             ` <CABfawhkK7LVQKhTtCAMpoaGLH24SwpyEAJGpu-PohR5e6W=pMw@mail.gmail.com>
@ 2023-10-16  9:50               ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2023-10-16  9:50 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Henry Wang, Roger Pau Monné

On 29.09.2023 15:01, Tamas K Lengyel wrote:
> On Thu, Sep 28, 2023 at 9:19 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 28.09.2023 14:57, Tamas K Lengyel wrote:
>>> On Thu, Sep 28, 2023 at 7:08 AM Roger Pau Monné <roger.pau@citrix.com> wrote:
>>>> On Thu, Sep 28, 2023 at 12:11:02PM +0200, Jan Beulich wrote:
>>>>> On 28.09.2023 11:51, Roger Pau Monné wrote:
>>>>>> On Thu, Sep 28, 2023 at 09:16:20AM +0200, Jan Beulich wrote:
>>>>>>> +        /*
>>>>>>> +         * Map the area into the guest. For simplicity specify the entire range
>>>>>>> +         * up to the end of the page: All the function uses it for is to check
>>>>>>> +         * that the range doesn't cross page boundaries. Having the area mapped
>>>>>>> +         * in the original domain implies that it fits there and therefore will
>>>>>>> +         * also fit in the clone.
>>>>>>> +         */
>>>>>>> +        offset = PAGE_OFFSET(d_area->map);
>>>>>>> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
>>>>>>> +                             PAGE_SIZE - offset, cd_area, NULL);
>>>>>>> +        if ( ret )
>>>>>>> +            return ret;
>>>>>>> +    }
>>>>>>> +    else
>>>>>>> +        cd_mfn = page_to_mfn(cd_area->pg);
>>>>>>> +
>>>>>>> +    copy_domain_page(cd_mfn, d_mfn);
>>>>>>
>>>>>> I think the page copy should be done only once, when the page is
>>>>>> populated on the child p2m.  Otherwise areas smaller than a page size
>>>>>> (like vpcu_time_info_t) that share the same page will get multiple
>>>>>> copies of the same data for no reason.
>>>>>
>>>>> I think you're right, but this would then be another issue in the original
>>>>> code that I merely didn't spot (and it's not just "copy for no reason",
>>>>> we'd actually corrupt what was put there before). IOW the copying needs to
>>>>> move ahead of map_guest_area() (or yet more precisely after the error
>>>>> checking for p2m->set_entry()), and in the original code it would have
>>>>> needed to live ahead of map_vcpu_info(). Once again I'd like Tamas to
>>>>> confirm (or otherwise) before making that change, though.
>>>>
>>>> Yes, it's already an issue in the current code.  I wonder whether
>>>> logic in the guest or Xen could malfunctions due to the fact that
>>>> map_vcpu_info() unconditionally sets evtchn_upcall_pending and injects
>>>> an event channel upcall, but the later call to copy_domain_page()
>>>> might unset evtchn_upcall_pending while the vector is already injected.
>>>
>>> Sorry but I really don't follow the discussion here. My understanding
>>> was that map_vcpu_info, as its name suggests, maps the page. We use it
>>> to map a new page into that position in case the fork hasn't set it up
>>> yet but the parent has one. Then we follow with the copy from the
>>> parent so the page content is matching. If there is already a
>>> vcpu_info page in the fork, we just do the copy.
>>>
>>> Now, if map_vcpu_info does more then mapping, then I don't know what
>>> it does, why it does it, and what happens if we skip it when the fork
>>> is reset for example. Is the suggestion to call it map_vcpu_info every
>>> time the page content is reset (ie after the copy)?
>>
>> The vCPU info area (already prior to this series) and the two other areas
>> can be updated by the hypervisor at any time. Once one such area is
>> registered within a certain page, if another such area happens to live in
>> the same page, copying the entire page again would overwrite all updates
>> that might already have been made for the first area. IOW copying ought
>> to - imo - happen exactly once, when the new page is allocated.
>>
>> As to map_vcpu_info() - just look at the function: It writes to the newly
>> registered area. Even if the function name says just "map", that's an
>> integral part of the operation. We can't just map it, but leave the area
>> untouched.
> 
> The domains are paused (both the parent and the child) when a new fork
> is being made and also during fork reset. If Xen really does write to
> these memory areas "any time", even if the domain is paused, we will
> need a lock to tell Xen not touch it cause it hasn't finished being
> constructed/reset yet.

It's not really "any" time, but "any time the vCPU is (about to) run(ning)".

> Moreover, not sure what Xen is writing to these
> areas, but if it's anything "stateful" it should be discarded or
> copied from the parent because the guest state must match the parent
> after the fork/reset op.

How that? Both are running independently once unpaused, and the data
written is derived from various bits of system and guest state. I see
no reason why the initial populating would need to clone the parent's
when immediately afterwards the data would be overwritten by child-
specific values.

Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-10-16  9:51 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-28  6:59 [PATCH v4 0/9] runstate/time area registration by (guest) physical address Jan Beulich
2023-09-28  7:01 ` [PATCH v4 1/9] x86/shim: zap runstate and time area handles during shutdown Jan Beulich
2023-09-28  9:27   ` Roger Pau Monné
2023-09-28  7:14 ` [PATCH v4 2/9] domain: GADDR based shared guest area registration alternative - teardown Jan Beulich
2023-09-28  7:15 ` [PATCH v4 3/9] domain: update GADDR based runstate guest area Jan Beulich
2023-09-28 12:15   ` Julien Grall
2023-09-28  7:15 ` [PATCH v4 4/9] x86: update GADDR based secondary time area Jan Beulich
2023-09-28  7:16 ` [PATCH v4 5/9] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
2023-09-28  9:51   ` Roger Pau Monné
2023-09-28 10:11     ` Jan Beulich
2023-09-28 11:07       ` Roger Pau Monné
     [not found]         ` <CABfawhmAZGTaKZfmwY4t8599i6qKaTOJ-fngFmtvS4LhJMh7iA@mail.gmail.com>
2023-09-28 13:19           ` Jan Beulich
     [not found]             ` <CABfawhkK7LVQKhTtCAMpoaGLH24SwpyEAJGpu-PohR5e6W=pMw@mail.gmail.com>
2023-10-16  9:50               ` Jan Beulich
2023-09-28 14:08           ` Roger Pau Monné
     [not found]             ` <CABfawhmfFuB14CiPdyaP4HNayms0AG5=4h1FNURu4MDgt2HzAg@mail.gmail.com>
2023-09-29 14:17               ` Roger Pau Monné
2023-09-28  7:16 ` [PATCH v4 6/9] domain: map/unmap " Jan Beulich
2023-09-28 10:04   ` Roger Pau Monné
2023-09-28 10:14     ` Jan Beulich
2023-09-28 11:10       ` Roger Pau Monné
2023-09-28  7:17 ` [PATCH v4 7/9] domain: introduce GADDR based runstate area registration alternative Jan Beulich
2023-09-28  7:17 ` [PATCH v4 8/9] x86: introduce GADDR based secondary time " Jan Beulich
2023-09-28  7:18 ` [PATCH v4 9/9] common: convert vCPU info area registration Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.