[PATCH v2 0/8] runstate/time area registration by (guest) physical address

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/8] runstate/time area registration by (guest) physical address
@ 2023-01-23 14:51 Jan Beulich
  2023-01-23 14:53 ` [PATCH RFC v2 1/8] domain: GADDR based shared guest area registration alternative - teardown Jan Beulich
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Jan Beulich @ 2023-01-23 14:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

Since it was indicated that introducing specific new vCPU ops may be
beneficial independent of the introduction of a fully physical-
address-based ABI flavor, here we go. There continue to be a number of
open questions throughout the series, resolving of which is one of the
main goals of this v2 posting.

1: domain: GADDR based shared guest area registration alternative - cleanup
3: domain: update GADDR based runstate guest area
4: x86: update GADDR based secondary time area
5: x86/mem-sharing: copy GADDR based shared guest areas
6: domain: map/unmap GADDR based shared guest areas
7: domain: introduce GADDR based runstate area registration alternative
8: x86: introduce GADDR based secondary time area registration alternative
9: common: convert vCPU info area registration

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC v2 1/8] domain: GADDR based shared guest area registration alternative - teardown
  2023-01-23 14:51 [PATCH v2 0/8] runstate/time area registration by (guest) physical address Jan Beulich
@ 2023-01-23 14:53 ` Jan Beulich
  2023-01-23 14:53 ` [PATCH RFC v2 2/8] domain: update GADDR based runstate guest area Jan Beulich
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2023-01-23 14:53 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, add the necessary domain cleanup hooks.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
---
RFC: Zapping the areas in pv_shim_shutdown() may not be strictly
     necessary: Aiui unmap_vcpu_info() is called only because the vCPU
     info area cannot be re-registered. Beyond that I guess the
     assumption is that the areas would only be re-registered as they
     were before. If that's not the case I wonder whether the guest
     handles for both areas shouldn't also be zapped.
---
v2: Add assertion in unmap_guest_area().

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1014,7 +1014,10 @@ int arch_domain_soft_reset(struct domain
     }
 
     for_each_vcpu ( d, v )
+    {
         set_xen_guest_handle(v->arch.time_info_guest, NULL);
+        unmap_guest_area(v, &v->arch.time_guest_area);
+    }
 
  exit_put_gfn:
     put_gfn(d, gfn_x(gfn));
@@ -2329,6 +2332,8 @@ int domain_relinquish_resources(struct d
             if ( ret )
                 return ret;
 
+            unmap_guest_area(v, &v->arch.time_guest_area);
+
             vpmu_destroy(v);
         }
 
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -658,6 +658,7 @@ struct arch_vcpu
 
     /* A secondary copy of the vcpu time info. */
     XEN_GUEST_HANDLE(vcpu_time_info_t) time_info_guest;
+    struct guest_area time_guest_area;
 
     struct arch_vm_event *vm_event;
 
--- a/xen/arch/x86/pv/shim.c
+++ b/xen/arch/x86/pv/shim.c
@@ -394,8 +394,10 @@ int pv_shim_shutdown(uint8_t reason)
 
     for_each_vcpu ( d, v )
     {
-        /* Unmap guest vcpu_info pages. */
+        /* Unmap guest vcpu_info page and runstate/time areas. */
         unmap_vcpu_info(v);
+        unmap_guest_area(v, &v->runstate_guest_area);
+        unmap_guest_area(v, &v->arch.time_guest_area);
 
         /* Reset the periodic timer to the default value. */
         vcpu_set_periodic_timer(v, MILLISECS(10));
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -963,7 +963,10 @@ int domain_kill(struct domain *d)
         if ( cpupool_move_domain(d, cpupool0) )
             return -ERESTART;
         for_each_vcpu ( d, v )
+        {
             unmap_vcpu_info(v);
+            unmap_guest_area(v, &v->runstate_guest_area);
+        }
         d->is_dying = DOMDYING_dead;
         /* Mem event cleanup has to go here because the rings 
          * have to be put before we call put_domain. */
@@ -1417,6 +1420,7 @@ int domain_soft_reset(struct domain *d,
     {
         set_xen_guest_handle(runstate_guest(v), NULL);
         unmap_vcpu_info(v);
+        unmap_guest_area(v, &v->runstate_guest_area);
     }
 
     rc = arch_domain_soft_reset(d);
@@ -1568,6 +1572,19 @@ void unmap_vcpu_info(struct vcpu *v)
     put_page_and_type(mfn_to_page(mfn));
 }
 
+/*
+ * This is only intended to be used for domain cleanup (or more generally only
+ * with at least the respective vCPU, if it's not the current one, reliably
+ * paused).
+ */
+void unmap_guest_area(struct vcpu *v, struct guest_area *area)
+{
+    struct domain *d = v->domain;
+
+    if ( v != current )
+        ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
+}
+
 int default_initialise_vcpu(struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct vcpu_guest_context *ctxt;
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -5,6 +5,12 @@
 #include <xen/types.h>
 
 #include <public/xen.h>
+
+struct guest_area {
+    struct page_info *pg;
+    void *map;
+};
+
 #include <asm/domain.h>
 #include <asm/numa.h>
 
@@ -76,6 +82,11 @@ void arch_vcpu_destroy(struct vcpu *v);
 int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned int offset);
 void unmap_vcpu_info(struct vcpu *v);
 
+int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
+                   struct guest_area *area,
+                   void (*populate)(void *dst, struct vcpu *v));
+void unmap_guest_area(struct vcpu *v, struct guest_area *area);
+
 int arch_domain_create(struct domain *d,
                        struct xen_domctl_createdomain *config,
                        unsigned int flags);
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -202,6 +202,7 @@ struct vcpu
         XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t) compat;
     } runstate_guest; /* guest address */
 #endif
+    struct guest_area runstate_guest_area;
     unsigned int     new_state;
 
     /* Has the FPU been initialised? */



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC v2 2/8] domain: update GADDR based runstate guest area
  2023-01-23 14:51 [PATCH v2 0/8] runstate/time area registration by (guest) physical address Jan Beulich
  2023-01-23 14:53 ` [PATCH RFC v2 1/8] domain: GADDR based shared guest area registration alternative - teardown Jan Beulich
@ 2023-01-23 14:53 ` Jan Beulich
  2023-01-23 14:54 ` [PATCH v2 3/8] x86: update GADDR based secondary time area Jan Beulich
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2023-01-23 14:53 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

Before adding a new vCPU operation to register the runstate area by
guest-physical address, add code to actually keep such areas up-to-date.

Note that updating of the area will be done exclusively following the
model enabled by VMASST_TYPE_runstate_update_flag for virtual-address
based registered areas.

Note further that pages aren't marked dirty when written to (matching
the handling of space mapped by map_vcpu_info()), on the basis that the
registrations are lost anyway across migration (or would need re-
populating at the target for transparent migration). Plus the contents
of the areas in question have to be deemed volatile in the first place
(so saving a "most recent" value is pretty meaningless even for e.g.
snapshotting).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: HVM guests (on x86) can change bitness and hence layout (and size!
     and alignment) of the runstate area. I don't think it is an option
     to require 32-bit code to pass a range such that even the 64-bit
     layout wouldn't cross a page boundary (and be suitably aligned). I
     also don't see any other good solution, so for now a crude approach
     with an extra boolean is used (using has_32bit_shinfo() isn't race
     free and could hence lead to overrunning the mapped space).
---
v2: Drop VM-assist conditionals.

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1616,14 +1616,53 @@ bool update_runstate_area(struct vcpu *v
     struct guest_memory_policy policy = { };
     void __user *guest_handle = NULL;
     struct vcpu_runstate_info runstate;
+    struct vcpu_runstate_info *map = v->runstate_guest_area.map;
+
+    memcpy(&runstate, &v->runstate, sizeof(runstate));
+
+    if ( map )
+    {
+        uint64_t *pset;
+#ifdef CONFIG_COMPAT
+        struct compat_vcpu_runstate_info *cmap = NULL;
+
+        if ( v->runstate_guest_area_compat )
+            cmap = (void *)map;
+#endif
+
+        /*
+         * NB: No VM_ASSIST(v->domain, runstate_update_flag) check here.
+         *     Always using that updating model.
+         */
+#ifdef CONFIG_COMPAT
+        if ( cmap )
+            pset = &cmap->state_entry_time;
+        else
+#endif
+            pset = &map->state_entry_time;
+        runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
+        write_atomic(pset, runstate.state_entry_time);
+        smp_wmb();
+
+#ifdef CONFIG_COMPAT
+        if ( cmap )
+            XLAT_vcpu_runstate_info(cmap, &runstate);
+        else
+#endif
+            *map = runstate;
+
+        smp_wmb();
+        runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
+        write_atomic(pset, runstate.state_entry_time);
+
+        return true;
+    }
 
     if ( guest_handle_is_null(runstate_guest(v)) )
         return true;
 
     update_guest_memory_policy(v, &policy);
 
-    memcpy(&runstate, &v->runstate, sizeof(runstate));
-
     if ( VM_ASSIST(v->domain, runstate_update_flag) )
     {
 #ifdef CONFIG_COMPAT
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -231,6 +231,8 @@ struct vcpu
 #ifdef CONFIG_COMPAT
     /* A hypercall is using the compat ABI? */
     bool             hcall_compat;
+    /* Physical runstate area registered via compat ABI? */
+    bool             runstate_guest_area_compat;
 #endif
 
 #ifdef CONFIG_IOREQ_SERVER



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 3/8] x86: update GADDR based secondary time area
  2023-01-23 14:51 [PATCH v2 0/8] runstate/time area registration by (guest) physical address Jan Beulich
  2023-01-23 14:53 ` [PATCH RFC v2 1/8] domain: GADDR based shared guest area registration alternative - teardown Jan Beulich
  2023-01-23 14:53 ` [PATCH RFC v2 2/8] domain: update GADDR based runstate guest area Jan Beulich
@ 2023-01-23 14:54 ` Jan Beulich
  2023-01-23 14:55 ` [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2023-01-23 14:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

Before adding a new vCPU operation to register the secondary time area
by guest-physical address, add code to actually keep such areas up-to-
date.

Note that pages aren't marked dirty when written to (matching the
handling of space mapped by map_vcpu_info()), on the basis that the
registrations are lost anyway across migration (or would need re-
populating at the target for transparent migration). Plus the contents
of the areas in question have to be deemed volatile in the first place
(so saving a "most recent" value is pretty meaningless even for e.g.
snapshotting).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1462,12 +1462,34 @@ static void __update_vcpu_system_time(st
         v->arch.pv.pending_system_time = _u;
 }
 
+static void write_time_guest_area(struct vcpu_time_info *map,
+                                  const struct vcpu_time_info *src)
+{
+    /* 1. Update userspace version. */
+    write_atomic(&map->version, src->version);
+    smp_wmb();
+
+    /* 2. Update all other userspace fields. */
+    *map = *src;
+
+    /* 3. Update userspace version again. */
+    smp_wmb();
+    write_atomic(&map->version, version_update_end(src->version));
+}
+
 bool update_secondary_system_time(struct vcpu *v,
                                   struct vcpu_time_info *u)
 {
     XEN_GUEST_HANDLE(vcpu_time_info_t) user_u = v->arch.time_info_guest;
+    struct vcpu_time_info *map = v->arch.time_guest_area.map;
     struct guest_memory_policy policy = { .nested_guest_mode = false };
 
+    if ( map )
+    {
+        write_time_guest_area(map, u);
+        return true;
+    }
+
     if ( guest_handle_is_null(user_u) )
         return true;
 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-01-23 14:51 [PATCH v2 0/8] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (2 preceding siblings ...)
  2023-01-23 14:54 ` [PATCH v2 3/8] x86: update GADDR based secondary time area Jan Beulich
@ 2023-01-23 14:55 ` Jan Beulich
  2023-01-23 16:09   ` Tamas K Lengyel
  2023-01-23 14:55 ` [PATCH v2 5/8] domain: map/unmap " Jan Beulich
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2023-01-23 14:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Tamas K Lengyel

In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, add the necessary fork handling (with the
backing function yet to be filled in).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -1653,6 +1653,65 @@ static void copy_vcpu_nonreg_state(struc
     hvm_set_nonreg_state(cd_vcpu, &nrs);
 }
 
+static int copy_guest_area(struct guest_area *cd_area,
+                           const struct guest_area *d_area,
+                           struct vcpu *cd_vcpu,
+                           const struct domain *d)
+{
+    mfn_t d_mfn, cd_mfn;
+
+    if ( !d_area->pg )
+        return 0;
+
+    d_mfn = page_to_mfn(d_area->pg);
+
+    /* Allocate & map a page for the area if it hasn't been already. */
+    if ( !cd_area->pg )
+    {
+        gfn_t gfn = mfn_to_gfn(d, d_mfn);
+        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
+        p2m_type_t p2mt;
+        p2m_access_t p2ma;
+        unsigned int offset;
+        int ret;
+
+        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
+        if ( mfn_eq(cd_mfn, INVALID_MFN) )
+        {
+            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
+
+            if ( !pg )
+                return -ENOMEM;
+
+            cd_mfn = page_to_mfn(pg);
+            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
+
+            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
+                                 p2m->default_access, -1);
+            if ( ret )
+                return ret;
+        }
+        else if ( p2mt != p2m_ram_rw )
+            return -EBUSY;
+
+        /*
+         * Simply specify the entire range up to the end of the page. All the
+         * function uses it for is a check for not crossing page boundaries.
+         */
+        offset = PAGE_OFFSET(d_area->map);
+        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
+                             PAGE_SIZE - offset, cd_area, NULL);
+        if ( ret )
+            return ret;
+    }
+    else
+        cd_mfn = page_to_mfn(cd_area->pg);
+
+    copy_domain_page(cd_mfn, d_mfn);
+
+    return 0;
+}
+
 static int copy_vpmu(struct vcpu *d_vcpu, struct vcpu *cd_vcpu)
 {
     struct vpmu_struct *d_vpmu = vcpu_vpmu(d_vcpu);
@@ -1745,6 +1804,16 @@ static int copy_vcpu_settings(struct dom
             copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
         }
 
+        /* Same for the (physically registered) runstate and time info areas. */
+        ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
+                              &d_vcpu->runstate_guest_area, cd_vcpu, d);
+        if ( ret )
+            return ret;
+        ret = copy_guest_area(&cd_vcpu->arch.time_guest_area,
+                              &d_vcpu->arch.time_guest_area, cd_vcpu, d);
+        if ( ret )
+            return ret;
+
         ret = copy_vpmu(d_vcpu, cd_vcpu);
         if ( ret )
             return ret;
@@ -1987,7 +2056,10 @@ int mem_sharing_fork_reset(struct domain
 
  state:
     if ( reset_state )
+    {
         rc = copy_settings(d, pd);
+        /* TBD: What to do here with -ERESTART? */
+    }
 
     domain_unpause(d);
 
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1572,6 +1572,13 @@ void unmap_vcpu_info(struct vcpu *v)
     put_page_and_type(mfn_to_page(mfn));
 }
 
+int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
+                   struct guest_area *area,
+                   void (*populate)(void *dst, struct vcpu *v))
+{
+    return -EOPNOTSUPP;
+}
+
 /*
  * This is only intended to be used for domain cleanup (or more generally only
  * with at least the respective vCPU, if it's not the current one, reliably



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 5/8] domain: map/unmap GADDR based shared guest areas
  2023-01-23 14:51 [PATCH v2 0/8] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (3 preceding siblings ...)
  2023-01-23 14:55 ` [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
@ 2023-01-23 14:55 ` Jan Beulich
  2023-01-23 14:56 ` [PATCH v2 6/8] domain: introduce GADDR based runstate area registration alternative Jan Beulich
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2023-01-23 14:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

The registration by virtual/linear address has downsides: At least on
x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
PV domains the areas are inaccessible (and hence cannot be updated by
Xen) when in guest-user mode, and for HVM guests they may be
inaccessible when Meltdown mitigations are in place. (There are yet
more issues.)

In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, flesh out the map/unmap functions.

Noteworthy differences from map_vcpu_info():
- areas can be registered more than once (and de-registered),
- remote vCPU-s are paused rather than checked for being down (which in
  principle can change right after the check),
- the domain lock is taken for a much smaller region.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: By using global domain page mappings the demand on the underlying
     VA range may increase significantly. I did consider to use per-
     domain mappings instead, but they exist for x86 only. Of course we
     could have arch_{,un}map_guest_area() aliasing global domain page
     mapping functions on Arm and using per-domain mappings on x86. Yet
     then again map_vcpu_info() doesn't (and can't) do so.

RFC: In map_guest_area() I'm not checking the P2M type, instead - just
     like map_vcpu_info() - solely relying on the type ref acquisition.
     Checking for p2m_ram_rw alone would be wrong, as at least
     p2m_ram_logdirty ought to also be okay to use here (and in similar
     cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
     used here (like altp2m_vcpu_enable_ve() does) as well as in
     map_vcpu_info(), yet then again the P2M type is stale by the time
     it is being looked at anyway without the P2M lock held.
---
v2: currd -> d, to cover mem-sharing's copy_guest_area(). Re-base over
    change(s) earlier in the series. Use ~0 as "unmap" request indicator.

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1576,7 +1576,82 @@ int map_guest_area(struct vcpu *v, paddr
                    struct guest_area *area,
                    void (*populate)(void *dst, struct vcpu *v))
 {
-    return -EOPNOTSUPP;
+    struct domain *d = v->domain;
+    void *map = NULL;
+    struct page_info *pg = NULL;
+    int rc = 0;
+
+    if ( ~gaddr )
+    {
+        unsigned long gfn = PFN_DOWN(gaddr);
+        unsigned int align;
+        p2m_type_t p2mt;
+
+        if ( gfn != PFN_DOWN(gaddr + size - 1) )
+            return -ENXIO;
+
+#ifdef CONFIG_COMPAT
+        if ( has_32bit_shinfo(d) )
+            align = alignof(compat_ulong_t);
+        else
+#endif
+            align = alignof(xen_ulong_t);
+        if ( gaddr & (align - 1) )
+            return -ENXIO;
+
+        rc = check_get_page_from_gfn(d, _gfn(gfn), false, &p2mt, &pg);
+        if ( rc )
+            return rc;
+
+        if ( !get_page_type(pg, PGT_writable_page) )
+        {
+            put_page(pg);
+            return -EACCES;
+        }
+
+        map = __map_domain_page_global(pg);
+        if ( !map )
+        {
+            put_page_and_type(pg);
+            return -ENOMEM;
+        }
+        map += PAGE_OFFSET(gaddr);
+    }
+
+    if ( v != current )
+    {
+        if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
+        {
+            rc = -ERESTART;
+            goto unmap;
+        }
+
+        vcpu_pause(v);
+
+        spin_unlock(&d->hypercall_deadlock_mutex);
+    }
+
+    domain_lock(d);
+
+    if ( map )
+        populate(map, v);
+
+    SWAP(area->pg, pg);
+    SWAP(area->map, map);
+
+    domain_unlock(d);
+
+    if ( v != current )
+        vcpu_unpause(v);
+
+ unmap:
+    if ( pg )
+    {
+        unmap_domain_page_global(map);
+        put_page_and_type(pg);
+    }
+
+    return rc;
 }
 
 /*
@@ -1587,9 +1662,24 @@ int map_guest_area(struct vcpu *v, paddr
 void unmap_guest_area(struct vcpu *v, struct guest_area *area)
 {
     struct domain *d = v->domain;
+    void *map;
+    struct page_info *pg;
 
     if ( v != current )
         ASSERT(atomic_read(&v->pause_count) | atomic_read(&d->pause_count));
+
+    domain_lock(d);
+    map = area->map;
+    area->map = NULL;
+    pg = area->pg;
+    area->pg = NULL;
+    domain_unlock(d);
+
+    if ( pg )
+    {
+        unmap_domain_page_global(map);
+        put_page_and_type(pg);
+    }
 }
 
 int default_initialise_vcpu(struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 6/8] domain: introduce GADDR based runstate area registration alternative
  2023-01-23 14:51 [PATCH v2 0/8] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (4 preceding siblings ...)
  2023-01-23 14:55 ` [PATCH v2 5/8] domain: map/unmap " Jan Beulich
@ 2023-01-23 14:56 ` Jan Beulich
  2023-01-23 14:56 ` [PATCH v2 7/8] x86: introduce GADDR based secondary time " Jan Beulich
  2023-01-23 14:57 ` [PATCH v2 8/8] common: convert vCPU info area registration Jan Beulich
  7 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2023-01-23 14:56 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

The registration by virtual/linear address has downsides: At least on
x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
PV domains the area is inaccessible (and hence cannot be updated by Xen)
when in guest-user mode.

Introduce a new vCPU operation allowing to register the runstate area by
guest-physical address.

An at least theoretical downside to using physically registered areas is
that PV then won't see dirty (and perhaps also accessed) bits set in its
respective page table entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Extend comment in public header.

--- a/xen/arch/x86/x86_64/domain.c
+++ b/xen/arch/x86/x86_64/domain.c
@@ -12,6 +12,22 @@
 CHECK_vcpu_get_physid;
 #undef xen_vcpu_get_physid
 
+static void cf_check
+runstate_area_populate(void *map, struct vcpu *v)
+{
+    if ( is_pv_vcpu(v) )
+        v->arch.pv.need_update_runstate_area = false;
+
+    v->runstate_guest_area_compat = true;
+
+    if ( v == current )
+    {
+        struct compat_vcpu_runstate_info *info = map;
+
+        XLAT_vcpu_runstate_info(info, &v->runstate);
+    }
+}
+
 int
 compat_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
@@ -57,6 +73,25 @@ compat_vcpu_op(int cmd, unsigned int vcp
 
         break;
     }
+
+    case VCPUOP_register_runstate_phys_area:
+    {
+        struct compat_vcpu_register_runstate_memory_area area;
+
+        rc = -EFAULT;
+        if ( copy_from_guest(&area.addr.p, arg, 1) )
+            break;
+
+        rc = map_guest_area(v, area.addr.p,
+                            sizeof(struct compat_vcpu_runstate_info),
+                            &v->runstate_guest_area,
+                            runstate_area_populate);
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
+                                               cmd, vcpuid, arg);
+
+        break;
+    }
 
     case VCPUOP_register_vcpu_time_memory_area:
     {
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1803,6 +1803,26 @@ bool update_runstate_area(struct vcpu *v
     return rc;
 }
 
+static void cf_check
+runstate_area_populate(void *map, struct vcpu *v)
+{
+#ifdef CONFIG_PV
+    if ( is_pv_vcpu(v) )
+        v->arch.pv.need_update_runstate_area = false;
+#endif
+
+#ifdef CONFIG_COMPAT
+    v->runstate_guest_area_compat = false;
+#endif
+
+    if ( v == current )
+    {
+        struct vcpu_runstate_info *info = map;
+
+        *info = v->runstate;
+    }
+}
+
 long common_vcpu_op(int cmd, struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     long rc = 0;
@@ -1977,6 +1997,25 @@ long common_vcpu_op(int cmd, struct vcpu
 
         break;
     }
+
+    case VCPUOP_register_runstate_phys_area:
+    {
+        struct vcpu_register_runstate_memory_area area;
+
+        rc = -EFAULT;
+        if ( copy_from_guest(&area.addr.p, arg, 1) )
+            break;
+
+        rc = map_guest_area(v, area.addr.p,
+                            sizeof(struct vcpu_runstate_info),
+                            &v->runstate_guest_area,
+                            runstate_area_populate);
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
+                                               cmd, vcpuid, arg);
+
+        break;
+    }
 
     default:
         rc = -ENOSYS;
--- a/xen/include/public/vcpu.h
+++ b/xen/include/public/vcpu.h
@@ -218,6 +218,19 @@ struct vcpu_register_time_memory_area {
 typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
 DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
 
+/*
+ * Like the respective VCPUOP_register_*_memory_area, just using the "addr.p"
+ * field of the supplied struct as a guest physical address (i.e. in GFN space).
+ * The respective area may not cross a page boundary.  Pass ~0 to unregister an
+ * area.  Note that as long as an area is registered by physical address, the
+ * linear address based area will not be serviced (updated) by the hypervisor.
+ *
+ * Note that the area registered via VCPUOP_register_runstate_memory_area will
+ * be updated in the same manner as the one registered via virtual address PLUS
+ * VMASST_TYPE_runstate_update_flag engaged by the domain.
+ */
+#define VCPUOP_register_runstate_phys_area      14
+
 #endif /* __XEN_PUBLIC_VCPU_H__ */
 
 /*



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 7/8] x86: introduce GADDR based secondary time area registration alternative
  2023-01-23 14:51 [PATCH v2 0/8] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (5 preceding siblings ...)
  2023-01-23 14:56 ` [PATCH v2 6/8] domain: introduce GADDR based runstate area registration alternative Jan Beulich
@ 2023-01-23 14:56 ` Jan Beulich
  2023-01-23 14:57 ` [PATCH v2 8/8] common: convert vCPU info area registration Jan Beulich
  7 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2023-01-23 14:56 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

The registration by virtual/linear address has downsides: The access is
expensive for HVM/PVH domains. Furthermore for 64-bit PV domains the area
is inaccessible (and hence cannot be updated by Xen) when in guest-user
mode.

Introduce a new vCPU operation allowing to register the secondary time
area by guest-physical address.

An at least theoretical downside to using physically registered areas is
that PV then won't see dirty (and perhaps also accessed) bits set in its
respective page table entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Forge version in force_update_secondary_system_time().

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1499,6 +1499,15 @@ int arch_vcpu_reset(struct vcpu *v)
     return 0;
 }
 
+static void cf_check
+time_area_populate(void *map, struct vcpu *v)
+{
+    if ( is_pv_vcpu(v) )
+        v->arch.pv.pending_system_time.version = 0;
+
+    force_update_secondary_system_time(v, map);
+}
+
 long do_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     long rc = 0;
@@ -1536,6 +1545,25 @@ long do_vcpu_op(int cmd, unsigned int vc
 
         break;
     }
+
+    case VCPUOP_register_vcpu_time_phys_area:
+    {
+        struct vcpu_register_time_memory_area area;
+
+        rc = -EFAULT;
+        if ( copy_from_guest(&area.addr.p, arg, 1) )
+            break;
+
+        rc = map_guest_area(v, area.addr.p,
+                            sizeof(vcpu_time_info_t),
+                            &v->arch.time_guest_area,
+                            time_area_populate);
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
+                                               cmd, vcpuid, arg);
+
+        break;
+    }
 
     case VCPUOP_get_physid:
     {
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -681,6 +681,8 @@ void domain_cpu_policy_changed(struct do
 
 bool update_secondary_system_time(struct vcpu *,
                                   struct vcpu_time_info *);
+void force_update_secondary_system_time(struct vcpu *,
+                                        struct vcpu_time_info *);
 
 void vcpu_show_execution_state(struct vcpu *);
 void vcpu_show_registers(const struct vcpu *);
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1524,6 +1524,16 @@ void force_update_vcpu_system_time(struc
     __update_vcpu_system_time(v, 1);
 }
 
+void force_update_secondary_system_time(struct vcpu *v,
+                                        struct vcpu_time_info *map)
+{
+    struct vcpu_time_info u;
+
+    collect_time_info(v, &u);
+    u.version = -1; /* Compensate for version_update_end(). */
+    write_time_guest_area(map, &u);
+}
+
 static void update_domain_rtc(void)
 {
     struct domain *d;
--- a/xen/arch/x86/x86_64/domain.c
+++ b/xen/arch/x86/x86_64/domain.c
@@ -115,6 +115,7 @@ compat_vcpu_op(int cmd, unsigned int vcp
 
     case VCPUOP_send_nmi:
     case VCPUOP_get_physid:
+    case VCPUOP_register_vcpu_time_phys_area:
         rc = do_vcpu_op(cmd, vcpuid, arg);
         break;
 
--- a/xen/include/public/vcpu.h
+++ b/xen/include/public/vcpu.h
@@ -230,6 +230,7 @@ DEFINE_XEN_GUEST_HANDLE(vcpu_register_ti
  * VMASST_TYPE_runstate_update_flag engaged by the domain.
  */
 #define VCPUOP_register_runstate_phys_area      14
+#define VCPUOP_register_vcpu_time_phys_area     15
 
 #endif /* __XEN_PUBLIC_VCPU_H__ */
 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 8/8] common: convert vCPU info area registration
  2023-01-23 14:51 [PATCH v2 0/8] runstate/time area registration by (guest) physical address Jan Beulich
                   ` (6 preceding siblings ...)
  2023-01-23 14:56 ` [PATCH v2 7/8] x86: introduce GADDR based secondary time " Jan Beulich
@ 2023-01-23 14:57 ` Jan Beulich
  7 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2023-01-23 14:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Tamas K Lengyel

Switch to using map_guest_area(). Noteworthy differences from
map_vcpu_info():
- remote vCPU-s are paused rather than checked for being down (which in
  principle can change right after the check),
- the domain lock is taken for a much smaller region,
- the error code for an attempt to re-register the area is now -EBUSY,
- we could in principle permit de-registration when no area was
  previously registered (which would permit "probing", if necessary for
  anything).

Note that this eliminates a bug in copy_vcpu_settings(): The function
did allocate a new page regardless of the GFN already having a mapping,
thus in particular breaking the case of two vCPU-s having their info
areas on the same page.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: I'm not really certain whether the preliminary check (ahead of
     calling map_guest_area()) is worthwhile to have.
---
v2: Re-base over changes earlier in the series. Properly enforce no re-
    registration. Avoid several casts by introducing local variables.

--- a/xen/arch/x86/include/asm/shared.h
+++ b/xen/arch/x86/include/asm/shared.h
@@ -26,17 +26,20 @@ static inline void arch_set_##field(stru
 #define GET_SET_VCPU(type, field)                               \
 static inline type arch_get_##field(const struct vcpu *v)       \
 {                                                               \
+    const vcpu_info_t *vi = v->vcpu_info_area.map;              \
+                                                                \
     return !has_32bit_shinfo(v->domain) ?                       \
-           v->vcpu_info->native.arch.field :                    \
-           v->vcpu_info->compat.arch.field;                     \
+           vi->native.arch.field : vi->compat.arch.field;       \
 }                                                               \
 static inline void arch_set_##field(struct vcpu *v,             \
                                     type val)                   \
 {                                                               \
+    vcpu_info_t *vi = v->vcpu_info_area.map;                    \
+                                                                \
     if ( !has_32bit_shinfo(v->domain) )                         \
-        v->vcpu_info->native.arch.field = val;                  \
+        vi->native.arch.field = val;                            \
     else                                                        \
-        v->vcpu_info->compat.arch.field = val;                  \
+        vi->compat.arch.field = val;                            \
 }
 
 #else
@@ -57,12 +60,16 @@ static inline void arch_set_##field(stru
 #define GET_SET_VCPU(type, field)                           \
 static inline type arch_get_##field(const struct vcpu *v)   \
 {                                                           \
-    return v->vcpu_info->arch.field;                        \
+    const vcpu_info_t *vi = v->vcpu_info_area.map;          \
+                                                            \
+    return vi->arch.field;                                  \
 }                                                           \
 static inline void arch_set_##field(struct vcpu *v,         \
                                     type val)               \
 {                                                           \
-    v->vcpu_info->arch.field = val;                         \
+    vcpu_info_t *vi = v->vcpu_info_area.map;                \
+                                                            \
+    vi->arch.field = val;                                   \
 }
 
 #endif
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -1758,53 +1758,24 @@ static int copy_vpmu(struct vcpu *d_vcpu
 static int copy_vcpu_settings(struct domain *cd, const struct domain *d)
 {
     unsigned int i;
-    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
     int ret = -EINVAL;
 
     for ( i = 0; i < cd->max_vcpus; i++ )
     {
         struct vcpu *d_vcpu = d->vcpu[i];
         struct vcpu *cd_vcpu = cd->vcpu[i];
-        mfn_t vcpu_info_mfn;
 
         if ( !d_vcpu || !cd_vcpu )
             continue;
 
-        /* Copy & map in the vcpu_info page if the guest uses one */
-        vcpu_info_mfn = d_vcpu->vcpu_info_mfn;
-        if ( !mfn_eq(vcpu_info_mfn, INVALID_MFN) )
-        {
-            mfn_t new_vcpu_info_mfn = cd_vcpu->vcpu_info_mfn;
-
-            /* Allocate & map the page for it if it hasn't been already */
-            if ( mfn_eq(new_vcpu_info_mfn, INVALID_MFN) )
-            {
-                gfn_t gfn = mfn_to_gfn(d, vcpu_info_mfn);
-                unsigned long gfn_l = gfn_x(gfn);
-                struct page_info *page;
-
-                if ( !(page = alloc_domheap_page(cd, 0)) )
-                    return -ENOMEM;
-
-                new_vcpu_info_mfn = page_to_mfn(page);
-                set_gpfn_from_mfn(mfn_x(new_vcpu_info_mfn), gfn_l);
-
-                ret = p2m->set_entry(p2m, gfn, new_vcpu_info_mfn,
-                                     PAGE_ORDER_4K, p2m_ram_rw,
-                                     p2m->default_access, -1);
-                if ( ret )
-                    return ret;
-
-                ret = map_vcpu_info(cd_vcpu, gfn_l,
-                                    PAGE_OFFSET(d_vcpu->vcpu_info));
-                if ( ret )
-                    return ret;
-            }
-
-            copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
-        }
-
-        /* Same for the (physically registered) runstate and time info areas. */
+        /*
+         * Copy and map the vcpu_info page and the (physically registered)
+         * runstate and time info areas.
+         */
+        ret = copy_guest_area(&cd_vcpu->vcpu_info_area,
+                              &d_vcpu->vcpu_info_area, cd_vcpu, d);
+        if ( ret )
+            return ret;
         ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
                               &d_vcpu->runstate_guest_area, cd_vcpu, d);
         if ( ret )
--- a/xen/arch/x86/pv/shim.c
+++ b/xen/arch/x86/pv/shim.c
@@ -395,7 +395,7 @@ int pv_shim_shutdown(uint8_t reason)
     for_each_vcpu ( d, v )
     {
         /* Unmap guest vcpu_info page and runstate/time areas. */
-        unmap_vcpu_info(v);
+        unmap_guest_area(v, &v->vcpu_info_area);
         unmap_guest_area(v, &v->runstate_guest_area);
         unmap_guest_area(v, &v->arch.time_guest_area);
 
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1438,7 +1438,7 @@ static void __update_vcpu_system_time(st
     struct vcpu_time_info *u = &vcpu_info(v, time), _u;
     const struct domain *d = v->domain;
 
-    if ( v->vcpu_info == NULL )
+    if ( !v->vcpu_info_area.map )
         return;
 
     collect_time_info(v, &_u);
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -53,7 +53,7 @@ void __dummy__(void)
 
     OFFSET(VCPU_processor, struct vcpu, processor);
     OFFSET(VCPU_domain, struct vcpu, domain);
-    OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info);
+    OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info_area.map);
     OFFSET(VCPU_trap_bounce, struct vcpu, arch.pv.trap_bounce);
     OFFSET(VCPU_thread_flags, struct vcpu, arch.flags);
     OFFSET(VCPU_event_addr, struct vcpu, arch.pv.event_callback_eip);
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -97,7 +97,7 @@ static void _show_registers(
     if ( context == CTXT_hypervisor )
         printk(" %pS", _p(regs->rip));
     printk("\nRFLAGS: %016lx   ", regs->rflags);
-    if ( (context == CTXT_pv_guest) && v && v->vcpu_info )
+    if ( (context == CTXT_pv_guest) && v && v->vcpu_info_area.map )
         printk("EM: %d   ", !!vcpu_info(v, evtchn_upcall_mask));
     printk("CONTEXT: %s", context_names[context]);
     if ( v && !is_idle_vcpu(v) )
--- a/xen/common/compat/domain.c
+++ b/xen/common/compat/domain.c
@@ -49,7 +49,7 @@ int compat_common_vcpu_op(int cmd, struc
     {
     case VCPUOP_initialise:
     {
-        if ( v->vcpu_info == &dummy_vcpu_info )
+        if ( v->vcpu_info_area.map == &dummy_vcpu_info )
             return -EINVAL;
 
 #ifdef CONFIG_HVM
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -127,10 +127,10 @@ static void vcpu_info_reset(struct vcpu
 {
     struct domain *d = v->domain;
 
-    v->vcpu_info = ((v->vcpu_id < XEN_LEGACY_MAX_VCPUS)
-                    ? (vcpu_info_t *)&shared_info(d, vcpu_info[v->vcpu_id])
-                    : &dummy_vcpu_info);
-    v->vcpu_info_mfn = INVALID_MFN;
+    v->vcpu_info_area.map =
+        ((v->vcpu_id < XEN_LEGACY_MAX_VCPUS)
+         ? (vcpu_info_t *)&shared_info(d, vcpu_info[v->vcpu_id])
+         : &dummy_vcpu_info);
 }
 
 static void vmtrace_free_buffer(struct vcpu *v)
@@ -964,7 +964,7 @@ int domain_kill(struct domain *d)
             return -ERESTART;
         for_each_vcpu ( d, v )
         {
-            unmap_vcpu_info(v);
+            unmap_guest_area(v, &v->vcpu_info_area);
             unmap_guest_area(v, &v->runstate_guest_area);
         }
         d->is_dying = DOMDYING_dead;
@@ -1419,7 +1419,7 @@ int domain_soft_reset(struct domain *d,
     for_each_vcpu ( d, v )
     {
         set_xen_guest_handle(runstate_guest(v), NULL);
-        unmap_vcpu_info(v);
+        unmap_guest_area(v, &v->vcpu_info_area);
         unmap_guest_area(v, &v->runstate_guest_area);
     }
 
@@ -1467,111 +1467,6 @@ int vcpu_reset(struct vcpu *v)
     return rc;
 }
 
-/*
- * Map a guest page in and point the vcpu_info pointer at it.  This
- * makes sure that the vcpu_info is always pointing at a valid piece
- * of memory, and it sets a pending event to make sure that a pending
- * event doesn't get missed.
- */
-int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned int offset)
-{
-    struct domain *d = v->domain;
-    void *mapping;
-    vcpu_info_t *new_info;
-    struct page_info *page;
-    unsigned int align;
-
-    if ( offset > (PAGE_SIZE - sizeof(*new_info)) )
-        return -ENXIO;
-
-#ifdef CONFIG_COMPAT
-    BUILD_BUG_ON(sizeof(*new_info) != sizeof(new_info->compat));
-    if ( has_32bit_shinfo(d) )
-        align = alignof(new_info->compat);
-    else
-#endif
-        align = alignof(*new_info);
-    if ( offset & (align - 1) )
-        return -ENXIO;
-
-    if ( !mfn_eq(v->vcpu_info_mfn, INVALID_MFN) )
-        return -EINVAL;
-
-    /* Run this command on yourself or on other offline VCPUS. */
-    if ( (v != current) && !(v->pause_flags & VPF_down) )
-        return -EINVAL;
-
-    page = get_page_from_gfn(d, gfn, NULL, P2M_UNSHARE);
-    if ( !page )
-        return -EINVAL;
-
-    if ( !get_page_type(page, PGT_writable_page) )
-    {
-        put_page(page);
-        return -EINVAL;
-    }
-
-    mapping = __map_domain_page_global(page);
-    if ( mapping == NULL )
-    {
-        put_page_and_type(page);
-        return -ENOMEM;
-    }
-
-    new_info = (vcpu_info_t *)(mapping + offset);
-
-    if ( v->vcpu_info == &dummy_vcpu_info )
-    {
-        memset(new_info, 0, sizeof(*new_info));
-#ifdef XEN_HAVE_PV_UPCALL_MASK
-        __vcpu_info(v, new_info, evtchn_upcall_mask) = 1;
-#endif
-    }
-    else
-    {
-        memcpy(new_info, v->vcpu_info, sizeof(*new_info));
-    }
-
-    v->vcpu_info = new_info;
-    v->vcpu_info_mfn = page_to_mfn(page);
-
-    /* Set new vcpu_info pointer /before/ setting pending flags. */
-    smp_wmb();
-
-    /*
-     * Mark everything as being pending just to make sure nothing gets
-     * lost.  The domain will get a spurious event, but it can cope.
-     */
-#ifdef CONFIG_COMPAT
-    if ( !has_32bit_shinfo(d) )
-        write_atomic(&new_info->native.evtchn_pending_sel, ~0);
-    else
-#endif
-        write_atomic(&vcpu_info(v, evtchn_pending_sel), ~0);
-    vcpu_mark_events_pending(v);
-
-    return 0;
-}
-
-/*
- * Unmap the vcpu info page if the guest decided to place it somewhere
- * else. This is used from domain_kill() and domain_soft_reset().
- */
-void unmap_vcpu_info(struct vcpu *v)
-{
-    mfn_t mfn = v->vcpu_info_mfn;
-
-    if ( mfn_eq(mfn, INVALID_MFN) )
-        return;
-
-    unmap_domain_page_global((void *)
-                             ((unsigned long)v->vcpu_info & PAGE_MASK));
-
-    vcpu_info_reset(v); /* NB: Clobbers v->vcpu_info_mfn */
-
-    put_page_and_type(mfn_to_page(mfn));
-}
-
 int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
                    struct guest_area *area,
                    void (*populate)(void *dst, struct vcpu *v))
@@ -1633,14 +1528,44 @@ int map_guest_area(struct vcpu *v, paddr
 
     domain_lock(d);
 
-    if ( map )
-        populate(map, v);
+    /* No re-registration of the vCPU info area. */
+    if ( area != &v->vcpu_info_area || !area->pg )
+    {
+        if ( map )
+            populate(map, v);
 
-    SWAP(area->pg, pg);
-    SWAP(area->map, map);
+        SWAP(area->pg, pg);
+        SWAP(area->map, map);
+    }
+    else
+        rc = -EBUSY;
 
     domain_unlock(d);
 
+    /* Set pending flags /after/ new vcpu_info pointer was set. */
+    if ( area == &v->vcpu_info_area && !rc )
+    {
+        /*
+         * Mark everything as being pending just to make sure nothing gets
+         * lost.  The domain will get a spurious event, but it can cope.
+         */
+#ifdef CONFIG_COMPAT
+        if ( !has_32bit_shinfo(d) )
+        {
+            vcpu_info_t *info = area->map;
+
+            /* For VCPUOP_register_vcpu_info handling in common_vcpu_op(). */
+            BUILD_BUG_ON(sizeof(*info) != sizeof(info->compat));
+            write_atomic(&info->native.evtchn_pending_sel, ~0);
+        }
+        else
+#endif
+            write_atomic(&vcpu_info(v, evtchn_pending_sel), ~0);
+        vcpu_mark_events_pending(v);
+
+        force_update_vcpu_system_time(v);
+    }
+
     if ( v != current )
         vcpu_unpause(v);
 
@@ -1670,7 +1595,10 @@ void unmap_guest_area(struct vcpu *v, st
 
     domain_lock(d);
     map = area->map;
-    area->map = NULL;
+    if ( area == &v->vcpu_info_area )
+        vcpu_info_reset(v);
+    else
+        area->map = NULL;
     pg = area->pg;
     area->pg = NULL;
     domain_unlock(d);
@@ -1803,6 +1731,27 @@ bool update_runstate_area(struct vcpu *v
     return rc;
 }
 
+/*
+ * This makes sure that the vcpu_info is always pointing at a valid piece of
+ * memory, and it sets a pending event to make sure that a pending event
+ * doesn't get missed.
+ */
+static void cf_check
+vcpu_info_populate(void *map, struct vcpu *v)
+{
+    vcpu_info_t *info = map;
+
+    if ( v->vcpu_info_area.map == &dummy_vcpu_info )
+    {
+        memset(info, 0, sizeof(*info));
+#ifdef XEN_HAVE_PV_UPCALL_MASK
+        __vcpu_info(v, info, evtchn_upcall_mask) = 1;
+#endif
+    }
+    else
+        memcpy(info, v->vcpu_info_area.map, sizeof(*info));
+}
+
 static void cf_check
 runstate_area_populate(void *map, struct vcpu *v)
 {
@@ -1832,7 +1781,7 @@ long common_vcpu_op(int cmd, struct vcpu
     switch ( cmd )
     {
     case VCPUOP_initialise:
-        if ( v->vcpu_info == &dummy_vcpu_info )
+        if ( v->vcpu_info_area.map == &dummy_vcpu_info )
             return -EINVAL;
 
         rc = arch_initialise_vcpu(v, arg);
@@ -1956,16 +1905,29 @@ long common_vcpu_op(int cmd, struct vcpu
     case VCPUOP_register_vcpu_info:
     {
         struct vcpu_register_vcpu_info info;
+        paddr_t gaddr;
 
         rc = -EFAULT;
         if ( copy_from_guest(&info, arg, 1) )
             break;
 
-        domain_lock(d);
-        rc = map_vcpu_info(v, info.mfn, info.offset);
-        domain_unlock(d);
+        rc = -EINVAL;
+        gaddr = gfn_to_gaddr(_gfn(info.mfn)) + info.offset;
+        if ( !~gaddr ||
+             gfn_x(gaddr_to_gfn(gaddr)) != info.mfn )
+            break;
 
-        force_update_vcpu_system_time(v);
+        /* Preliminary check only; see map_guest_area(). */
+        rc = -EBUSY;
+        if ( v->vcpu_info_area.pg )
+            break;
+
+        /* See the BUILD_BUG_ON() in vcpu_info_populate(). */
+        rc = map_guest_area(v, gaddr, sizeof(vcpu_info_t),
+                            &v->vcpu_info_area, vcpu_info_populate);
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_vcpu_op, "iih",
+                                               cmd, vcpuid, arg);
 
         break;
     }
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -79,9 +79,6 @@ void cf_check free_pirq_struct(void *);
 int  arch_vcpu_create(struct vcpu *v);
 void arch_vcpu_destroy(struct vcpu *v);
 
-int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned int offset);
-void unmap_vcpu_info(struct vcpu *v);
-
 int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
                    struct guest_area *area,
                    void (*populate)(void *dst, struct vcpu *v));
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -175,7 +175,7 @@ struct vcpu
 
     int              processor;
 
-    vcpu_info_t     *vcpu_info;
+    struct guest_area vcpu_info_area;
 
     struct domain   *domain;
 
@@ -288,9 +288,6 @@ struct vcpu
 
     struct waitqueue_vcpu *waitqueue_vcpu;
 
-    /* Guest-specified relocation of vcpu_info. */
-    mfn_t            vcpu_info_mfn;
-
     struct evtchn_fifo_vcpu *evtchn_fifo;
 
     /* vPCI per-vCPU area, used to store data for long running operations. */
--- a/xen/include/xen/shared.h
+++ b/xen/include/xen/shared.h
@@ -44,6 +44,7 @@ typedef struct vcpu_info vcpu_info_t;
 extern vcpu_info_t dummy_vcpu_info;
 
 #define shared_info(d, field)      __shared_info(d, (d)->shared_info, field)
-#define vcpu_info(v, field)        __vcpu_info(v, (v)->vcpu_info, field)
+#define vcpu_info(v, field)        \
+        __vcpu_info(v, (vcpu_info_t *)(v)->vcpu_info_area.map, field)
 
 #endif /* __XEN_SHARED_H__ */



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-01-23 14:55 ` [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
@ 2023-01-23 16:09   ` Tamas K Lengyel
  2023-01-23 16:24     ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Tamas K Lengyel @ 2023-01-23 16:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 3857 bytes --]

On Mon, Jan 23, 2023 at 9:55 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> In preparation of the introduction of new vCPU operations allowing to
> register the respective areas (one of the two is x86-specific) by
> guest-physical address, add the necessary fork handling (with the
> backing function yet to be filled in).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/mm/mem_sharing.c
> +++ b/xen/arch/x86/mm/mem_sharing.c
> @@ -1653,6 +1653,65 @@ static void copy_vcpu_nonreg_state(struc
>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>  }
>
> +static int copy_guest_area(struct guest_area *cd_area,
> +                           const struct guest_area *d_area,
> +                           struct vcpu *cd_vcpu,
> +                           const struct domain *d)
> +{
> +    mfn_t d_mfn, cd_mfn;
> +
> +    if ( !d_area->pg )
> +        return 0;
> +
> +    d_mfn = page_to_mfn(d_area->pg);
> +
> +    /* Allocate & map a page for the area if it hasn't been already. */
> +    if ( !cd_area->pg )
> +    {
> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> +        p2m_type_t p2mt;
> +        p2m_access_t p2ma;
> +        unsigned int offset;
> +        int ret;
> +
> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> +        {
> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain,
0);
> +
> +            if ( !pg )
> +                return -ENOMEM;
> +
> +            cd_mfn = page_to_mfn(pg);
> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> +
> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K,
p2m_ram_rw,
> +                                 p2m->default_access, -1);
> +            if ( ret )
> +                return ret;
> +        }
> +        else if ( p2mt != p2m_ram_rw )
> +            return -EBUSY;
> +
> +        /*
> +         * Simply specify the entire range up to the end of the page.
All the
> +         * function uses it for is a check for not crossing page
boundaries.
> +         */
> +        offset = PAGE_OFFSET(d_area->map);
> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
> +                             PAGE_SIZE - offset, cd_area, NULL);
> +        if ( ret )
> +            return ret;
> +    }
> +    else
> +        cd_mfn = page_to_mfn(cd_area->pg);

Everything to this point seems to be non mem-sharing/forking related. Could
these live somewhere else? There must be some other place where allocating
these areas happens already for non-fork VMs so it would make sense to just
refactor that code to be callable from here.

> +
> +    copy_domain_page(cd_mfn, d_mfn);
> +
> +    return 0;
> +}
> +
>  static int copy_vpmu(struct vcpu *d_vcpu, struct vcpu *cd_vcpu)
>  {
>      struct vpmu_struct *d_vpmu = vcpu_vpmu(d_vcpu);
> @@ -1745,6 +1804,16 @@ static int copy_vcpu_settings(struct dom
>              copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
>          }
>
> +        /* Same for the (physically registered) runstate and time info
areas. */
> +        ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
> +                              &d_vcpu->runstate_guest_area, cd_vcpu, d);
> +        if ( ret )
> +            return ret;
> +        ret = copy_guest_area(&cd_vcpu->arch.time_guest_area,
> +                              &d_vcpu->arch.time_guest_area, cd_vcpu, d);
> +        if ( ret )
> +            return ret;
> +
>          ret = copy_vpmu(d_vcpu, cd_vcpu);
>          if ( ret )
>              return ret;
> @@ -1987,7 +2056,10 @@ int mem_sharing_fork_reset(struct domain
>
>   state:
>      if ( reset_state )
> +    {
>          rc = copy_settings(d, pd);
> +        /* TBD: What to do here with -ERESTART? */

Where does ERESTART coming from?

[-- Attachment #2: Type: text/html, Size: 5091 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-01-23 16:09   ` Tamas K Lengyel
@ 2023-01-23 16:24     ` Jan Beulich
  2023-01-23 18:32       ` Tamas K Lengyel
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2023-01-23 16:24 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

On 23.01.2023 17:09, Tamas K Lengyel wrote:
> On Mon, Jan 23, 2023 at 9:55 AM Jan Beulich <jbeulich@suse.com> wrote:
>> --- a/xen/arch/x86/mm/mem_sharing.c
>> +++ b/xen/arch/x86/mm/mem_sharing.c
>> @@ -1653,6 +1653,65 @@ static void copy_vcpu_nonreg_state(struc
>>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>>  }
>>
>> +static int copy_guest_area(struct guest_area *cd_area,
>> +                           const struct guest_area *d_area,
>> +                           struct vcpu *cd_vcpu,
>> +                           const struct domain *d)
>> +{
>> +    mfn_t d_mfn, cd_mfn;
>> +
>> +    if ( !d_area->pg )
>> +        return 0;
>> +
>> +    d_mfn = page_to_mfn(d_area->pg);
>> +
>> +    /* Allocate & map a page for the area if it hasn't been already. */
>> +    if ( !cd_area->pg )
>> +    {
>> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
>> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
>> +        p2m_type_t p2mt;
>> +        p2m_access_t p2ma;
>> +        unsigned int offset;
>> +        int ret;
>> +
>> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
>> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
>> +        {
>> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain,
> 0);
>> +
>> +            if ( !pg )
>> +                return -ENOMEM;
>> +
>> +            cd_mfn = page_to_mfn(pg);
>> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
>> +
>> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K,
> p2m_ram_rw,
>> +                                 p2m->default_access, -1);
>> +            if ( ret )
>> +                return ret;
>> +        }
>> +        else if ( p2mt != p2m_ram_rw )
>> +            return -EBUSY;
>> +
>> +        /*
>> +         * Simply specify the entire range up to the end of the page.
> All the
>> +         * function uses it for is a check for not crossing page
> boundaries.
>> +         */
>> +        offset = PAGE_OFFSET(d_area->map);
>> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
>> +                             PAGE_SIZE - offset, cd_area, NULL);
>> +        if ( ret )
>> +            return ret;
>> +    }
>> +    else
>> +        cd_mfn = page_to_mfn(cd_area->pg);
> 
> Everything to this point seems to be non mem-sharing/forking related. Could
> these live somewhere else? There must be some other place where allocating
> these areas happens already for non-fork VMs so it would make sense to just
> refactor that code to be callable from here.

It is the "copy" aspect with makes this mem-sharing (or really fork)
specific. Plus in the end this is no different from what you have
there right now for copying the vCPU info area. In the final patch
that other code gets removed by re-using the code here.

I also haven't been able to spot anything that could be factored
out (and one might expect that if there was something, then the vCPU
info area copying should also already have used it). map_guest_area()
is all that is used for other purposes as well.

>> +
>> +    copy_domain_page(cd_mfn, d_mfn);
>> +
>> +    return 0;
>> +}
>> +
>>  static int copy_vpmu(struct vcpu *d_vcpu, struct vcpu *cd_vcpu)
>>  {
>>      struct vpmu_struct *d_vpmu = vcpu_vpmu(d_vcpu);
>> @@ -1745,6 +1804,16 @@ static int copy_vcpu_settings(struct dom
>>              copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
>>          }
>>
>> +        /* Same for the (physically registered) runstate and time info
> areas. */
>> +        ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
>> +                              &d_vcpu->runstate_guest_area, cd_vcpu, d);
>> +        if ( ret )
>> +            return ret;
>> +        ret = copy_guest_area(&cd_vcpu->arch.time_guest_area,
>> +                              &d_vcpu->arch.time_guest_area, cd_vcpu, d);
>> +        if ( ret )
>> +            return ret;
>> +
>>          ret = copy_vpmu(d_vcpu, cd_vcpu);
>>          if ( ret )
>>              return ret;
>> @@ -1987,7 +2056,10 @@ int mem_sharing_fork_reset(struct domain
>>
>>   state:
>>      if ( reset_state )
>> +    {
>>          rc = copy_settings(d, pd);
>> +        /* TBD: What to do here with -ERESTART? */
> 
> Where does ERESTART coming from?

From map_guest_area()'s attempt to acquire the hypercall deadlock mutex,
in order to then pause the subject vCPU. I suppose that in the forking
case it may already be paused, but then there's no way map_guest_area()
could know. Looking at the pause count is fragile, as there's no
guarantee that the vCPU may be unpaused while we're still doing work on
it. Hence I view such checks as only suitable for assertions.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-01-23 16:24     ` Jan Beulich
@ 2023-01-23 18:32       ` Tamas K Lengyel
  2023-01-24 11:19         ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Tamas K Lengyel @ 2023-01-23 18:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 6352 bytes --]

On Mon, Jan 23, 2023 at 11:24 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 23.01.2023 17:09, Tamas K Lengyel wrote:
> > On Mon, Jan 23, 2023 at 9:55 AM Jan Beulich <jbeulich@suse.com> wrote:
> >> --- a/xen/arch/x86/mm/mem_sharing.c
> >> +++ b/xen/arch/x86/mm/mem_sharing.c
> >> @@ -1653,6 +1653,65 @@ static void copy_vcpu_nonreg_state(struc
> >>      hvm_set_nonreg_state(cd_vcpu, &nrs);
> >>  }
> >>
> >> +static int copy_guest_area(struct guest_area *cd_area,
> >> +                           const struct guest_area *d_area,
> >> +                           struct vcpu *cd_vcpu,
> >> +                           const struct domain *d)
> >> +{
> >> +    mfn_t d_mfn, cd_mfn;
> >> +
> >> +    if ( !d_area->pg )
> >> +        return 0;
> >> +
> >> +    d_mfn = page_to_mfn(d_area->pg);
> >> +
> >> +    /* Allocate & map a page for the area if it hasn't been already.
*/
> >> +    if ( !cd_area->pg )
> >> +    {
> >> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> >> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> >> +        p2m_type_t p2mt;
> >> +        p2m_access_t p2ma;
> >> +        unsigned int offset;
> >> +        int ret;
> >> +
> >> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL,
NULL);
> >> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> >> +        {
> >> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain,
> > 0);
> >> +
> >> +            if ( !pg )
> >> +                return -ENOMEM;
> >> +
> >> +            cd_mfn = page_to_mfn(pg);
> >> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> >> +
> >> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K,
> > p2m_ram_rw,
> >> +                                 p2m->default_access, -1);
> >> +            if ( ret )
> >> +                return ret;
> >> +        }
> >> +        else if ( p2mt != p2m_ram_rw )
> >> +            return -EBUSY;
> >> +
> >> +        /*
> >> +         * Simply specify the entire range up to the end of the page.
> > All the
> >> +         * function uses it for is a check for not crossing page
> > boundaries.
> >> +         */
> >> +        offset = PAGE_OFFSET(d_area->map);
> >> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
> >> +                             PAGE_SIZE - offset, cd_area, NULL);
> >> +        if ( ret )
> >> +            return ret;
> >> +    }
> >> +    else
> >> +        cd_mfn = page_to_mfn(cd_area->pg);
> >
> > Everything to this point seems to be non mem-sharing/forking related.
Could
> > these live somewhere else? There must be some other place where
allocating
> > these areas happens already for non-fork VMs so it would make sense to
just
> > refactor that code to be callable from here.
>
> It is the "copy" aspect with makes this mem-sharing (or really fork)
> specific. Plus in the end this is no different from what you have
> there right now for copying the vCPU info area. In the final patch
> that other code gets removed by re-using the code here.

Yes, the copy part is fork-specific. Arguably if there was a way to do the
allocation of the page for vcpu_info I would prefer that being elsewhere,
but while the only requirement is allocate-page and copy from parent I'm OK
with that logic being in here because it's really straight forward. But now
you also do extra sanity checks here which are harder to comprehend in this
context alone. What if extra sanity checks will be needed in the future? Or
the sanity checks in the future diverge from where this happens for normal
VMs because someone overlooks this needing to be synched here too?

> I also haven't been able to spot anything that could be factored
> out (and one might expect that if there was something, then the vCPU
> info area copying should also already have used it). map_guest_area()
> is all that is used for other purposes as well.

Well, there must be a location where all this happens for normal VMs as
well, no? Why not factor that code so that it can be called from here, so
that we don't have to track sanity check requirements in two different
locations? Or for normal VMs that sanity checking bit isn't required? If
so, why?

> >> +
> >> +    copy_domain_page(cd_mfn, d_mfn);
> >> +
> >> +    return 0;
> >> +}
> >> +
> >>  static int copy_vpmu(struct vcpu *d_vcpu, struct vcpu *cd_vcpu)
> >>  {
> >>      struct vpmu_struct *d_vpmu = vcpu_vpmu(d_vcpu);
> >> @@ -1745,6 +1804,16 @@ static int copy_vcpu_settings(struct dom
> >>              copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
> >>          }
> >>
> >> +        /* Same for the (physically registered) runstate and time info
> > areas. */
> >> +        ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
> >> +                              &d_vcpu->runstate_guest_area, cd_vcpu,
d);
> >> +        if ( ret )
> >> +            return ret;
> >> +        ret = copy_guest_area(&cd_vcpu->arch.time_guest_area,
> >> +                              &d_vcpu->arch.time_guest_area, cd_vcpu,
d);
> >> +        if ( ret )
> >> +            return ret;
> >> +
> >>          ret = copy_vpmu(d_vcpu, cd_vcpu);
> >>          if ( ret )
> >>              return ret;
> >> @@ -1987,7 +2056,10 @@ int mem_sharing_fork_reset(struct domain
> >>
> >>   state:
> >>      if ( reset_state )
> >> +    {
> >>          rc = copy_settings(d, pd);
> >> +        /* TBD: What to do here with -ERESTART? */
> >
> > Where does ERESTART coming from?
>
> From map_guest_area()'s attempt to acquire the hypercall deadlock mutex,
> in order to then pause the subject vCPU. I suppose that in the forking
> case it may already be paused, but then there's no way map_guest_area()
> could know. Looking at the pause count is fragile, as there's no
> guarantee that the vCPU may be unpaused while we're still doing work on
> it. Hence I view such checks as only suitable for assertions.

Since map_guest_area is only used to sanity check, and it only happens when
the page is being setup for the fork, why can't the sanity check be done on
the parent? The parent is guaranteed to be paused when forks are active so
there is no ERESTART concern and from the looks of it if there is a concern
the sanity check is looking for it would be visible on the parent just as
well as the fork. But I would like to understand why that sanity checking
is required in the first place.

Thanks,
Tamas

[-- Attachment #2: Type: text/html, Size: 8453 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-01-23 18:32       ` Tamas K Lengyel
@ 2023-01-24 11:19         ` Jan Beulich
  2023-01-25 15:34           ` Tamas K Lengyel
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2023-01-24 11:19 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

On 23.01.2023 19:32, Tamas K Lengyel wrote:
> On Mon, Jan 23, 2023 at 11:24 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 23.01.2023 17:09, Tamas K Lengyel wrote:
>>> On Mon, Jan 23, 2023 at 9:55 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>> --- a/xen/arch/x86/mm/mem_sharing.c
>>>> +++ b/xen/arch/x86/mm/mem_sharing.c
>>>> @@ -1653,6 +1653,65 @@ static void copy_vcpu_nonreg_state(struc
>>>>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>>>>  }
>>>>
>>>> +static int copy_guest_area(struct guest_area *cd_area,
>>>> +                           const struct guest_area *d_area,
>>>> +                           struct vcpu *cd_vcpu,
>>>> +                           const struct domain *d)
>>>> +{
>>>> +    mfn_t d_mfn, cd_mfn;
>>>> +
>>>> +    if ( !d_area->pg )
>>>> +        return 0;
>>>> +
>>>> +    d_mfn = page_to_mfn(d_area->pg);
>>>> +
>>>> +    /* Allocate & map a page for the area if it hasn't been already.
> */
>>>> +    if ( !cd_area->pg )
>>>> +    {
>>>> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
>>>> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
>>>> +        p2m_type_t p2mt;
>>>> +        p2m_access_t p2ma;
>>>> +        unsigned int offset;
>>>> +        int ret;
>>>> +
>>>> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL,
> NULL);
>>>> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
>>>> +        {
>>>> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain,
>>> 0);
>>>> +
>>>> +            if ( !pg )
>>>> +                return -ENOMEM;
>>>> +
>>>> +            cd_mfn = page_to_mfn(pg);
>>>> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
>>>> +
>>>> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K,
>>> p2m_ram_rw,
>>>> +                                 p2m->default_access, -1);
>>>> +            if ( ret )
>>>> +                return ret;
>>>> +        }
>>>> +        else if ( p2mt != p2m_ram_rw )
>>>> +            return -EBUSY;
>>>> +
>>>> +        /*
>>>> +         * Simply specify the entire range up to the end of the page.
>>> All the
>>>> +         * function uses it for is a check for not crossing page
>>> boundaries.
>>>> +         */
>>>> +        offset = PAGE_OFFSET(d_area->map);
>>>> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
>>>> +                             PAGE_SIZE - offset, cd_area, NULL);
>>>> +        if ( ret )
>>>> +            return ret;
>>>> +    }
>>>> +    else
>>>> +        cd_mfn = page_to_mfn(cd_area->pg);
>>>
>>> Everything to this point seems to be non mem-sharing/forking related.
> Could
>>> these live somewhere else? There must be some other place where
> allocating
>>> these areas happens already for non-fork VMs so it would make sense to
> just
>>> refactor that code to be callable from here.
>>
>> It is the "copy" aspect with makes this mem-sharing (or really fork)
>> specific. Plus in the end this is no different from what you have
>> there right now for copying the vCPU info area. In the final patch
>> that other code gets removed by re-using the code here.
> 
> Yes, the copy part is fork-specific. Arguably if there was a way to do the
> allocation of the page for vcpu_info I would prefer that being elsewhere,
> but while the only requirement is allocate-page and copy from parent I'm OK
> with that logic being in here because it's really straight forward. But now
> you also do extra sanity checks here which are harder to comprehend in this
> context alone.

What sanity checks are you talking about (also below, where you claim
map_guest_area() would be used only to sanity check)?

> What if extra sanity checks will be needed in the future? Or
> the sanity checks in the future diverge from where this happens for normal
> VMs because someone overlooks this needing to be synched here too?
> 
>> I also haven't been able to spot anything that could be factored
>> out (and one might expect that if there was something, then the vCPU
>> info area copying should also already have used it). map_guest_area()
>> is all that is used for other purposes as well.
> 
> Well, there must be a location where all this happens for normal VMs as
> well, no?

That's map_guest_area(). What is needed here but not elsewhere is the
populating of the GFN underlying the to-be-mapped area. That's the code
being added here, mirroring what you need to do for the vCPU info page.
Similar code isn't needed elsewhere because the guest invoked operation
is purely a "map" - the underlying pages are already expected to be
populated (which of course we check, or else we wouldn't know what page
to actually map).

> Why not factor that code so that it can be called from here, so
> that we don't have to track sanity check requirements in two different
> locations? Or for normal VMs that sanity checking bit isn't required? If
> so, why?

As per above, I'm afraid that I'm lost with these questions. I simply
don't know what you're talking about.

>>>> +
>>>> +    copy_domain_page(cd_mfn, d_mfn);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>>  static int copy_vpmu(struct vcpu *d_vcpu, struct vcpu *cd_vcpu)
>>>>  {
>>>>      struct vpmu_struct *d_vpmu = vcpu_vpmu(d_vcpu);
>>>> @@ -1745,6 +1804,16 @@ static int copy_vcpu_settings(struct dom
>>>>              copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
>>>>          }
>>>>
>>>> +        /* Same for the (physically registered) runstate and time info
>>> areas. */
>>>> +        ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
>>>> +                              &d_vcpu->runstate_guest_area, cd_vcpu,
> d);
>>>> +        if ( ret )
>>>> +            return ret;
>>>> +        ret = copy_guest_area(&cd_vcpu->arch.time_guest_area,
>>>> +                              &d_vcpu->arch.time_guest_area, cd_vcpu,
> d);
>>>> +        if ( ret )
>>>> +            return ret;
>>>> +
>>>>          ret = copy_vpmu(d_vcpu, cd_vcpu);
>>>>          if ( ret )
>>>>              return ret;
>>>> @@ -1987,7 +2056,10 @@ int mem_sharing_fork_reset(struct domain
>>>>
>>>>   state:
>>>>      if ( reset_state )
>>>> +    {
>>>>          rc = copy_settings(d, pd);
>>>> +        /* TBD: What to do here with -ERESTART? */
>>>
>>> Where does ERESTART coming from?
>>
>> From map_guest_area()'s attempt to acquire the hypercall deadlock mutex,
>> in order to then pause the subject vCPU. I suppose that in the forking
>> case it may already be paused, but then there's no way map_guest_area()
>> could know. Looking at the pause count is fragile, as there's no
>> guarantee that the vCPU may be unpaused while we're still doing work on
>> it. Hence I view such checks as only suitable for assertions.
> 
> Since map_guest_area is only used to sanity check, and it only happens when
> the page is being setup for the fork, why can't the sanity check be done on
> the parent?

As above, I'm afraid I simply don't understand what you're asking.

> The parent is guaranteed to be paused when forks are active so
> there is no ERESTART concern and from the looks of it if there is a concern
> the sanity check is looking for it would be visible on the parent just as
> well as the fork.

The parent being paused isn't of interest to map_guest_area(). It's the
subject vcpu (i.e. in the forked instance) where we require this. Thinking
of it - the forked domain wasn't started yet, was it? We could then avoid
the pausing (and the acquiring of the hypercall deadlock mutex) based on
->creation_finished still being "false", or even simply based on
v->domain != current->domain. Then there wouldn't be any chance anymore of
-ERESTART making it here.

> But I would like to understand why that sanity checking
> is required in the first place.

See further up.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-01-24 11:19         ` Jan Beulich
@ 2023-01-25 15:34           ` Tamas K Lengyel
  2023-01-26  8:13             ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Tamas K Lengyel @ 2023-01-25 15:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 5909 bytes --]

On Tue, Jan 24, 2023 at 6:19 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 23.01.2023 19:32, Tamas K Lengyel wrote:
> > On Mon, Jan 23, 2023 at 11:24 AM Jan Beulich <jbeulich@suse.com> wrote:
> >> On 23.01.2023 17:09, Tamas K Lengyel wrote:
> >>> On Mon, Jan 23, 2023 at 9:55 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>>> --- a/xen/arch/x86/mm/mem_sharing.c
> >>>> +++ b/xen/arch/x86/mm/mem_sharing.c
> >>>> @@ -1653,6 +1653,65 @@ static void copy_vcpu_nonreg_state(struc
> >>>>      hvm_set_nonreg_state(cd_vcpu, &nrs);
> >>>>  }
> >>>>
> >>>> +static int copy_guest_area(struct guest_area *cd_area,
> >>>> +                           const struct guest_area *d_area,
> >>>> +                           struct vcpu *cd_vcpu,
> >>>> +                           const struct domain *d)
> >>>> +{
> >>>> +    mfn_t d_mfn, cd_mfn;
> >>>> +
> >>>> +    if ( !d_area->pg )
> >>>> +        return 0;
> >>>> +
> >>>> +    d_mfn = page_to_mfn(d_area->pg);
> >>>> +
> >>>> +    /* Allocate & map a page for the area if it hasn't been already.
> > */
> >>>> +    if ( !cd_area->pg )
> >>>> +    {
> >>>> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> >>>> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> >>>> +        p2m_type_t p2mt;
> >>>> +        p2m_access_t p2ma;
> >>>> +        unsigned int offset;
> >>>> +        int ret;
> >>>> +
> >>>> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL,
> > NULL);
> >>>> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> >>>> +        {
> >>>> +            struct page_info *pg =
alloc_domheap_page(cd_vcpu->domain,
> >>> 0);
> >>>> +
> >>>> +            if ( !pg )
> >>>> +                return -ENOMEM;
> >>>> +
> >>>> +            cd_mfn = page_to_mfn(pg);
> >>>> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> >>>> +
> >>>> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K,
> >>> p2m_ram_rw,
> >>>> +                                 p2m->default_access, -1);
> >>>> +            if ( ret )
> >>>> +                return ret;
> >>>> +        }
> >>>> +        else if ( p2mt != p2m_ram_rw )
> >>>> +            return -EBUSY;
> >>>> +
> >>>> +        /*
> >>>> +         * Simply specify the entire range up to the end of the
page.
> >>> All the
> >>>> +         * function uses it for is a check for not crossing page
> >>> boundaries.
> >>>> +         */
> >>>> +        offset = PAGE_OFFSET(d_area->map);
> >>>> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
> >>>> +                             PAGE_SIZE - offset, cd_area, NULL);
> >>>> +        if ( ret )
> >>>> +            return ret;
> >>>> +    }
> >>>> +    else
> >>>> +        cd_mfn = page_to_mfn(cd_area->pg);
> >>>
> >>> Everything to this point seems to be non mem-sharing/forking related.
> > Could
> >>> these live somewhere else? There must be some other place where
> > allocating
> >>> these areas happens already for non-fork VMs so it would make sense to
> > just
> >>> refactor that code to be callable from here.
> >>
> >> It is the "copy" aspect with makes this mem-sharing (or really fork)
> >> specific. Plus in the end this is no different from what you have
> >> there right now for copying the vCPU info area. In the final patch
> >> that other code gets removed by re-using the code here.
> >
> > Yes, the copy part is fork-specific. Arguably if there was a way to do
the
> > allocation of the page for vcpu_info I would prefer that being
elsewhere,
> > but while the only requirement is allocate-page and copy from parent
I'm OK
> > with that logic being in here because it's really straight forward. But
now
> > you also do extra sanity checks here which are harder to comprehend in
this
> > context alone.
>
> What sanity checks are you talking about (also below, where you claim
> map_guest_area() would be used only to sanity check)?

Did I misread your comment above "All the function uses it for is a check
for not crossing page boundaries"? That sounds to me like a simple sanity
check, unclear why it matters though and why only for forks.

>
> > What if extra sanity checks will be needed in the future? Or
> > the sanity checks in the future diverge from where this happens for
normal
> > VMs because someone overlooks this needing to be synched here too?
> >
> >> I also haven't been able to spot anything that could be factored
> >> out (and one might expect that if there was something, then the vCPU
> >> info area copying should also already have used it). map_guest_area()
> >> is all that is used for other purposes as well.
> >
> > Well, there must be a location where all this happens for normal VMs as
> > well, no?
>
> That's map_guest_area(). What is needed here but not elsewhere is the
> populating of the GFN underlying the to-be-mapped area. That's the code
> being added here, mirroring what you need to do for the vCPU info page.
> Similar code isn't needed elsewhere because the guest invoked operation
> is purely a "map" - the underlying pages are already expected to be
> populated (which of course we check, or else we wouldn't know what page
> to actually map).

Populated by what and when?

>
> > Why not factor that code so that it can be called from here, so
> > that we don't have to track sanity check requirements in two different
> > locations? Or for normal VMs that sanity checking bit isn't required? If
> > so, why?
>
> As per above, I'm afraid that I'm lost with these questions. I simply
> don't know what you're talking about.

You are adding code here that allocates memory and copies the content of
similarly allocated memory from the parent. You perform extra sanity checks
for unknown reasons that seem to be only needed here. It is unclear why you
are doing that and why can't the same code-paths that allocate this memory
for the parent be simply reused so the only thing left to do here would be
to copy the content from the parent?

Tamas

[-- Attachment #2: Type: text/html, Size: 8268 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-01-25 15:34           ` Tamas K Lengyel
@ 2023-01-26  8:13             ` Jan Beulich
  2023-01-26 15:41               ` Tamas K Lengyel
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2023-01-26  8:13 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

On 25.01.2023 16:34, Tamas K Lengyel wrote:
> On Tue, Jan 24, 2023 at 6:19 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 23.01.2023 19:32, Tamas K Lengyel wrote:
>>> On Mon, Jan 23, 2023 at 11:24 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 23.01.2023 17:09, Tamas K Lengyel wrote:
>>>>> On Mon, Jan 23, 2023 at 9:55 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>>> --- a/xen/arch/x86/mm/mem_sharing.c
>>>>>> +++ b/xen/arch/x86/mm/mem_sharing.c
>>>>>> @@ -1653,6 +1653,65 @@ static void copy_vcpu_nonreg_state(struc
>>>>>>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>>>>>>  }
>>>>>>
>>>>>> +static int copy_guest_area(struct guest_area *cd_area,
>>>>>> +                           const struct guest_area *d_area,
>>>>>> +                           struct vcpu *cd_vcpu,
>>>>>> +                           const struct domain *d)
>>>>>> +{
>>>>>> +    mfn_t d_mfn, cd_mfn;
>>>>>> +
>>>>>> +    if ( !d_area->pg )
>>>>>> +        return 0;
>>>>>> +
>>>>>> +    d_mfn = page_to_mfn(d_area->pg);
>>>>>> +
>>>>>> +    /* Allocate & map a page for the area if it hasn't been already.
>>> */
>>>>>> +    if ( !cd_area->pg )
>>>>>> +    {
>>>>>> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
>>>>>> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
>>>>>> +        p2m_type_t p2mt;
>>>>>> +        p2m_access_t p2ma;
>>>>>> +        unsigned int offset;
>>>>>> +        int ret;
>>>>>> +
>>>>>> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL,
>>> NULL);
>>>>>> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
>>>>>> +        {
>>>>>> +            struct page_info *pg =
> alloc_domheap_page(cd_vcpu->domain,
>>>>> 0);
>>>>>> +
>>>>>> +            if ( !pg )
>>>>>> +                return -ENOMEM;
>>>>>> +
>>>>>> +            cd_mfn = page_to_mfn(pg);
>>>>>> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
>>>>>> +
>>>>>> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K,
>>>>> p2m_ram_rw,
>>>>>> +                                 p2m->default_access, -1);
>>>>>> +            if ( ret )
>>>>>> +                return ret;
>>>>>> +        }
>>>>>> +        else if ( p2mt != p2m_ram_rw )
>>>>>> +            return -EBUSY;
>>>>>> +
>>>>>> +        /*
>>>>>> +         * Simply specify the entire range up to the end of the
> page.
>>>>> All the
>>>>>> +         * function uses it for is a check for not crossing page
>>>>> boundaries.
>>>>>> +         */
>>>>>> +        offset = PAGE_OFFSET(d_area->map);
>>>>>> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
>>>>>> +                             PAGE_SIZE - offset, cd_area, NULL);
>>>>>> +        if ( ret )
>>>>>> +            return ret;
>>>>>> +    }
>>>>>> +    else
>>>>>> +        cd_mfn = page_to_mfn(cd_area->pg);
>>>>>
>>>>> Everything to this point seems to be non mem-sharing/forking related.
>>> Could
>>>>> these live somewhere else? There must be some other place where
>>> allocating
>>>>> these areas happens already for non-fork VMs so it would make sense to
>>> just
>>>>> refactor that code to be callable from here.
>>>>
>>>> It is the "copy" aspect with makes this mem-sharing (or really fork)
>>>> specific. Plus in the end this is no different from what you have
>>>> there right now for copying the vCPU info area. In the final patch
>>>> that other code gets removed by re-using the code here.
>>>
>>> Yes, the copy part is fork-specific. Arguably if there was a way to do
> the
>>> allocation of the page for vcpu_info I would prefer that being
> elsewhere,
>>> but while the only requirement is allocate-page and copy from parent
> I'm OK
>>> with that logic being in here because it's really straight forward. But
> now
>>> you also do extra sanity checks here which are harder to comprehend in
> this
>>> context alone.
>>
>> What sanity checks are you talking about (also below, where you claim
>> map_guest_area() would be used only to sanity check)?
> 
> Did I misread your comment above "All the function uses it for is a check
> for not crossing page boundaries"? That sounds to me like a simple sanity
> check, unclear why it matters though and why only for forks.

The comment is about the function's use of the range it is being passed.
It doesn't say in any way that the function is doing only sanity checking.
If the comment wording is ambiguous or unclear, I'm happy to take
improvement suggestions.

>>> What if extra sanity checks will be needed in the future? Or
>>> the sanity checks in the future diverge from where this happens for
> normal
>>> VMs because someone overlooks this needing to be synched here too?
>>>
>>>> I also haven't been able to spot anything that could be factored
>>>> out (and one might expect that if there was something, then the vCPU
>>>> info area copying should also already have used it). map_guest_area()
>>>> is all that is used for other purposes as well.
>>>
>>> Well, there must be a location where all this happens for normal VMs as
>>> well, no?
>>
>> That's map_guest_area(). What is needed here but not elsewhere is the
>> populating of the GFN underlying the to-be-mapped area. That's the code
>> being added here, mirroring what you need to do for the vCPU info page.
>> Similar code isn't needed elsewhere because the guest invoked operation
>> is purely a "map" - the underlying pages are already expected to be
>> populated (which of course we check, or else we wouldn't know what page
>> to actually map).
> 
> Populated by what and when?

Population happens either at domain creation or when the guest is moving
pages around (e.g. during ballooning). What is happening here (also in
the pre-existing code to deal with the vCPU info page) is the minimal
amount of "populate at creation" to meet the prereq for the mapping
operation(s).

>>> Why not factor that code so that it can be called from here, so
>>> that we don't have to track sanity check requirements in two different
>>> locations? Or for normal VMs that sanity checking bit isn't required? If
>>> so, why?
>>
>> As per above, I'm afraid that I'm lost with these questions. I simply
>> don't know what you're talking about.
> 
> You are adding code here that allocates memory and copies the content of
> similarly allocated memory from the parent. You perform extra sanity checks
> for unknown reasons that seem to be only needed here. It is unclear why you
> are doing that and why can't the same code-paths that allocate this memory
> for the parent be simply reused so the only thing left to do here would be
> to copy the content from the parent?

No, I'm not "adding code" in the sense that I read your comments so far.
Such code was already there (and, as pointed out somewhere, in slightly
broken form). Yes, I'm introducing a 2nd instance of this, but just to
then (in the last patch) remove the original (slightly broken) instance.
So across the entire series this is merely code movement (with
adjustments).

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-01-26  8:13             ` Jan Beulich
@ 2023-01-26 15:41               ` Tamas K Lengyel
  2023-01-26 16:48                 ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Tamas K Lengyel @ 2023-01-26 15:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 4755 bytes --]

On Thu, Jan 26, 2023 at 3:14 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 25.01.2023 16:34, Tamas K Lengyel wrote:
> > On Tue, Jan 24, 2023 at 6:19 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 23.01.2023 19:32, Tamas K Lengyel wrote:
> >>> On Mon, Jan 23, 2023 at 11:24 AM Jan Beulich <jbeulich@suse.com>
wrote:
> >>>> On 23.01.2023 17:09, Tamas K Lengyel wrote:
> >>>>> On Mon, Jan 23, 2023 at 9:55 AM Jan Beulich <jbeulich@suse.com>
wrote:
> >>>>>> --- a/xen/arch/x86/mm/mem_sharing.c
> >>>>>> +++ b/xen/arch/x86/mm/mem_sharing.c
> >>>>>> @@ -1653,6 +1653,65 @@ static void copy_vcpu_nonreg_state(struc
> >>>>>>      hvm_set_nonreg_state(cd_vcpu, &nrs);
> >>>>>>  }
> >>>>>>
> >>>>>> +static int copy_guest_area(struct guest_area *cd_area,
> >>>>>> +                           const struct guest_area *d_area,
> >>>>>> +                           struct vcpu *cd_vcpu,
> >>>>>> +                           const struct domain *d)
> >>>>>> +{
> >>>>>> +    mfn_t d_mfn, cd_mfn;
> >>>>>> +
> >>>>>> +    if ( !d_area->pg )
> >>>>>> +        return 0;
> >>>>>> +
> >>>>>> +    d_mfn = page_to_mfn(d_area->pg);
> >>>>>> +
> >>>>>> +    /* Allocate & map a page for the area if it hasn't been
already.
> >>> */
> >>>>>> +    if ( !cd_area->pg )
> >>>>>> +    {
> >>>>>> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> >>>>>> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> >>>>>> +        p2m_type_t p2mt;
> >>>>>> +        p2m_access_t p2ma;
> >>>>>> +        unsigned int offset;
> >>>>>> +        int ret;
> >>>>>> +
> >>>>>> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL,
> >>> NULL);
> >>>>>> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> >>>>>> +        {
> >>>>>> +            struct page_info *pg =
> > alloc_domheap_page(cd_vcpu->domain,
> >>>>> 0);
> >>>>>> +
> >>>>>> +            if ( !pg )
> >>>>>> +                return -ENOMEM;
> >>>>>> +
> >>>>>> +            cd_mfn = page_to_mfn(pg);
> >>>>>> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> >>>>>> +
> >>>>>> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K,
> >>>>> p2m_ram_rw,
> >>>>>> +                                 p2m->default_access, -1);
> >>>>>> +            if ( ret )
> >>>>>> +                return ret;
> >>>>>> +        }
> >>>>>> +        else if ( p2mt != p2m_ram_rw )
> >>>>>> +            return -EBUSY;
> >>>>>> +
> >>>>>> +        /*
> >>>>>> +         * Simply specify the entire range up to the end of the
> > page.
> >>>>> All the
> >>>>>> +         * function uses it for is a check for not crossing page
> >>>>> boundaries.
> >>>>>> +         */
> >>>>>> +        offset = PAGE_OFFSET(d_area->map);
> >>>>>> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
> >>>>>> +                             PAGE_SIZE - offset, cd_area, NULL);
> >>>>>> +        if ( ret )
> >>>>>> +            return ret;
> >>>>>> +    }
> >>>>>> +    else
> >>>>>> +        cd_mfn = page_to_mfn(cd_area->pg);
> >>>>>
> >>>>> Everything to this point seems to be non mem-sharing/forking
related.
> >>> Could
> >>>>> these live somewhere else? There must be some other place where
> >>> allocating
> >>>>> these areas happens already for non-fork VMs so it would make sense
to
> >>> just
> >>>>> refactor that code to be callable from here.
> >>>>
> >>>> It is the "copy" aspect with makes this mem-sharing (or really fork)
> >>>> specific. Plus in the end this is no different from what you have
> >>>> there right now for copying the vCPU info area. In the final patch
> >>>> that other code gets removed by re-using the code here.
> >>>
> >>> Yes, the copy part is fork-specific. Arguably if there was a way to do
> > the
> >>> allocation of the page for vcpu_info I would prefer that being
> > elsewhere,
> >>> but while the only requirement is allocate-page and copy from parent
> > I'm OK
> >>> with that logic being in here because it's really straight forward.
But
> > now
> >>> you also do extra sanity checks here which are harder to comprehend in
> > this
> >>> context alone.
> >>
> >> What sanity checks are you talking about (also below, where you claim
> >> map_guest_area() would be used only to sanity check)?
> >
> > Did I misread your comment above "All the function uses it for is a
check
> > for not crossing page boundaries"? That sounds to me like a simple
sanity
> > check, unclear why it matters though and why only for forks.
>
> The comment is about the function's use of the range it is being passed.
> It doesn't say in any way that the function is doing only sanity checking.
> If the comment wording is ambiguous or unclear, I'm happy to take
> improvement suggestions.

Yes, please do, it definitely was confusing while reviewing the patch.

Thanks,
Tamas

[-- Attachment #2: Type: text/html, Size: 7501 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-01-26 15:41               ` Tamas K Lengyel
@ 2023-01-26 16:48                 ` Jan Beulich
  2023-01-26 17:24                   ` Tamas K Lengyel
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2023-01-26 16:48 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

On 26.01.2023 16:41, Tamas K Lengyel wrote:
> On Thu, Jan 26, 2023 at 3:14 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 25.01.2023 16:34, Tamas K Lengyel wrote:
>>> On Tue, Jan 24, 2023 at 6:19 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>
>>>> On 23.01.2023 19:32, Tamas K Lengyel wrote:
>>>>> On Mon, Jan 23, 2023 at 11:24 AM Jan Beulich <jbeulich@suse.com>
> wrote:
>>>>>> On 23.01.2023 17:09, Tamas K Lengyel wrote:
>>>>>>> On Mon, Jan 23, 2023 at 9:55 AM Jan Beulich <jbeulich@suse.com>
> wrote:
>>>>>>>> --- a/xen/arch/x86/mm/mem_sharing.c
>>>>>>>> +++ b/xen/arch/x86/mm/mem_sharing.c
>>>>>>>> @@ -1653,6 +1653,65 @@ static void copy_vcpu_nonreg_state(struc
>>>>>>>>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static int copy_guest_area(struct guest_area *cd_area,
>>>>>>>> +                           const struct guest_area *d_area,
>>>>>>>> +                           struct vcpu *cd_vcpu,
>>>>>>>> +                           const struct domain *d)
>>>>>>>> +{
>>>>>>>> +    mfn_t d_mfn, cd_mfn;
>>>>>>>> +
>>>>>>>> +    if ( !d_area->pg )
>>>>>>>> +        return 0;
>>>>>>>> +
>>>>>>>> +    d_mfn = page_to_mfn(d_area->pg);
>>>>>>>> +
>>>>>>>> +    /* Allocate & map a page for the area if it hasn't been
> already.
>>>>> */
>>>>>>>> +    if ( !cd_area->pg )
>>>>>>>> +    {
>>>>>>>> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
>>>>>>>> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
>>>>>>>> +        p2m_type_t p2mt;
>>>>>>>> +        p2m_access_t p2ma;
>>>>>>>> +        unsigned int offset;
>>>>>>>> +        int ret;
>>>>>>>> +
>>>>>>>> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL,
>>>>> NULL);
>>>>>>>> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
>>>>>>>> +        {
>>>>>>>> +            struct page_info *pg =
>>> alloc_domheap_page(cd_vcpu->domain,
>>>>>>> 0);
>>>>>>>> +
>>>>>>>> +            if ( !pg )
>>>>>>>> +                return -ENOMEM;
>>>>>>>> +
>>>>>>>> +            cd_mfn = page_to_mfn(pg);
>>>>>>>> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
>>>>>>>> +
>>>>>>>> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K,
>>>>>>> p2m_ram_rw,
>>>>>>>> +                                 p2m->default_access, -1);
>>>>>>>> +            if ( ret )
>>>>>>>> +                return ret;
>>>>>>>> +        }
>>>>>>>> +        else if ( p2mt != p2m_ram_rw )
>>>>>>>> +            return -EBUSY;
>>>>>>>> +
>>>>>>>> +        /*
>>>>>>>> +         * Simply specify the entire range up to the end of the
>>> page.
>>>>>>> All the
>>>>>>>> +         * function uses it for is a check for not crossing page
>>>>>>> boundaries.
>>>>>>>> +         */
>>>>>>>> +        offset = PAGE_OFFSET(d_area->map);
>>>>>>>> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
>>>>>>>> +                             PAGE_SIZE - offset, cd_area, NULL);
>>>>>>>> +        if ( ret )
>>>>>>>> +            return ret;
>>>>>>>> +    }
>>>>>>>> +    else
>>>>>>>> +        cd_mfn = page_to_mfn(cd_area->pg);
>>>>>>>
>>>>>>> Everything to this point seems to be non mem-sharing/forking
> related.
>>>>> Could
>>>>>>> these live somewhere else? There must be some other place where
>>>>> allocating
>>>>>>> these areas happens already for non-fork VMs so it would make sense
> to
>>>>> just
>>>>>>> refactor that code to be callable from here.
>>>>>>
>>>>>> It is the "copy" aspect with makes this mem-sharing (or really fork)
>>>>>> specific. Plus in the end this is no different from what you have
>>>>>> there right now for copying the vCPU info area. In the final patch
>>>>>> that other code gets removed by re-using the code here.
>>>>>
>>>>> Yes, the copy part is fork-specific. Arguably if there was a way to do
>>> the
>>>>> allocation of the page for vcpu_info I would prefer that being
>>> elsewhere,
>>>>> but while the only requirement is allocate-page and copy from parent
>>> I'm OK
>>>>> with that logic being in here because it's really straight forward.
> But
>>> now
>>>>> you also do extra sanity checks here which are harder to comprehend in
>>> this
>>>>> context alone.
>>>>
>>>> What sanity checks are you talking about (also below, where you claim
>>>> map_guest_area() would be used only to sanity check)?
>>>
>>> Did I misread your comment above "All the function uses it for is a
> check
>>> for not crossing page boundaries"? That sounds to me like a simple
> sanity
>>> check, unclear why it matters though and why only for forks.
>>
>> The comment is about the function's use of the range it is being passed.
>> It doesn't say in any way that the function is doing only sanity checking.
>> If the comment wording is ambiguous or unclear, I'm happy to take
>> improvement suggestions.
> 
> Yes, please do, it definitely was confusing while reviewing the patch.

I'm sorry, but what does "please do" mean when I asked for improvement
suggestions? I continue to think the comment is quite clear as is, so
if anything needs adjusting, I'd need to know pretty precisely what it
is that needs adding and/or re-writing. I can't, after all, guess what
your misunderstanding resulted from.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas
  2023-01-26 16:48                 ` Jan Beulich
@ 2023-01-26 17:24                   ` Tamas K Lengyel
  0 siblings, 0 replies; 18+ messages in thread
From: Tamas K Lengyel @ 2023-01-26 17:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 5901 bytes --]

On Thu, Jan 26, 2023 at 11:48 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 26.01.2023 16:41, Tamas K Lengyel wrote:
> > On Thu, Jan 26, 2023 at 3:14 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 25.01.2023 16:34, Tamas K Lengyel wrote:
> >>> On Tue, Jan 24, 2023 at 6:19 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>>>
> >>>> On 23.01.2023 19:32, Tamas K Lengyel wrote:
> >>>>> On Mon, Jan 23, 2023 at 11:24 AM Jan Beulich <jbeulich@suse.com>
> > wrote:
> >>>>>> On 23.01.2023 17:09, Tamas K Lengyel wrote:
> >>>>>>> On Mon, Jan 23, 2023 at 9:55 AM Jan Beulich <jbeulich@suse.com>
> > wrote:
> >>>>>>>> --- a/xen/arch/x86/mm/mem_sharing.c
> >>>>>>>> +++ b/xen/arch/x86/mm/mem_sharing.c
> >>>>>>>> @@ -1653,6 +1653,65 @@ static void copy_vcpu_nonreg_state(struc
> >>>>>>>>      hvm_set_nonreg_state(cd_vcpu, &nrs);
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +static int copy_guest_area(struct guest_area *cd_area,
> >>>>>>>> +                           const struct guest_area *d_area,
> >>>>>>>> +                           struct vcpu *cd_vcpu,
> >>>>>>>> +                           const struct domain *d)
> >>>>>>>> +{
> >>>>>>>> +    mfn_t d_mfn, cd_mfn;
> >>>>>>>> +
> >>>>>>>> +    if ( !d_area->pg )
> >>>>>>>> +        return 0;
> >>>>>>>> +
> >>>>>>>> +    d_mfn = page_to_mfn(d_area->pg);
> >>>>>>>> +
> >>>>>>>> +    /* Allocate & map a page for the area if it hasn't been
> > already.
> >>>>> */
> >>>>>>>> +    if ( !cd_area->pg )
> >>>>>>>> +    {
> >>>>>>>> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> >>>>>>>> +        struct p2m_domain *p2m =
p2m_get_hostp2m(cd_vcpu->domain);
> >>>>>>>> +        p2m_type_t p2mt;
> >>>>>>>> +        p2m_access_t p2ma;
> >>>>>>>> +        unsigned int offset;
> >>>>>>>> +        int ret;
> >>>>>>>> +
> >>>>>>>> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL,
> >>>>> NULL);
> >>>>>>>> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> >>>>>>>> +        {
> >>>>>>>> +            struct page_info *pg =
> >>> alloc_domheap_page(cd_vcpu->domain,
> >>>>>>> 0);
> >>>>>>>> +
> >>>>>>>> +            if ( !pg )
> >>>>>>>> +                return -ENOMEM;
> >>>>>>>> +
> >>>>>>>> +            cd_mfn = page_to_mfn(pg);
> >>>>>>>> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> >>>>>>>> +
> >>>>>>>> +            ret = p2m->set_entry(p2m, gfn, cd_mfn,
PAGE_ORDER_4K,
> >>>>>>> p2m_ram_rw,
> >>>>>>>> +                                 p2m->default_access, -1);
> >>>>>>>> +            if ( ret )
> >>>>>>>> +                return ret;
> >>>>>>>> +        }
> >>>>>>>> +        else if ( p2mt != p2m_ram_rw )
> >>>>>>>> +            return -EBUSY;
> >>>>>>>> +
> >>>>>>>> +        /*
> >>>>>>>> +         * Simply specify the entire range up to the end of the
> >>> page.
> >>>>>>> All the
> >>>>>>>> +         * function uses it for is a check for not crossing page
> >>>>>>> boundaries.
> >>>>>>>> +         */
> >>>>>>>> +        offset = PAGE_OFFSET(d_area->map);
> >>>>>>>> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) +
offset,
> >>>>>>>> +                             PAGE_SIZE - offset, cd_area, NULL);
> >>>>>>>> +        if ( ret )
> >>>>>>>> +            return ret;
> >>>>>>>> +    }
> >>>>>>>> +    else
> >>>>>>>> +        cd_mfn = page_to_mfn(cd_area->pg);
> >>>>>>>
> >>>>>>> Everything to this point seems to be non mem-sharing/forking
> > related.
> >>>>> Could
> >>>>>>> these live somewhere else? There must be some other place where
> >>>>> allocating
> >>>>>>> these areas happens already for non-fork VMs so it would make
sense
> > to
> >>>>> just
> >>>>>>> refactor that code to be callable from here.
> >>>>>>
> >>>>>> It is the "copy" aspect with makes this mem-sharing (or really
fork)
> >>>>>> specific. Plus in the end this is no different from what you have
> >>>>>> there right now for copying the vCPU info area. In the final patch
> >>>>>> that other code gets removed by re-using the code here.
> >>>>>
> >>>>> Yes, the copy part is fork-specific. Arguably if there was a way to
do
> >>> the
> >>>>> allocation of the page for vcpu_info I would prefer that being
> >>> elsewhere,
> >>>>> but while the only requirement is allocate-page and copy from parent
> >>> I'm OK
> >>>>> with that logic being in here because it's really straight forward.
> > But
> >>> now
> >>>>> you also do extra sanity checks here which are harder to comprehend
in
> >>> this
> >>>>> context alone.
> >>>>
> >>>> What sanity checks are you talking about (also below, where you claim
> >>>> map_guest_area() would be used only to sanity check)?
> >>>
> >>> Did I misread your comment above "All the function uses it for is a
> > check
> >>> for not crossing page boundaries"? That sounds to me like a simple
> > sanity
> >>> check, unclear why it matters though and why only for forks.
> >>
> >> The comment is about the function's use of the range it is being
passed.
> >> It doesn't say in any way that the function is doing only sanity
checking.
> >> If the comment wording is ambiguous or unclear, I'm happy to take
> >> improvement suggestions.
> >
> > Yes, please do, it definitely was confusing while reviewing the patch.
>
> I'm sorry, but what does "please do" mean when I asked for improvement
> suggestions? I continue to think the comment is quite clear as is, so
> if anything needs adjusting, I'd need to know pretty precisely what it
> is that needs adding and/or re-writing. I can't, after all, guess what
> your misunderstanding resulted from.

I meant please do add the extra information you just spelled out in your
previous email. "Map the page into the guest. We pass the entire range to
map_guest_area so that it can check that no cross-page mapping is taking
place (in which case it will fail). If no such issue is present the page is
being mapped and made accessible to the guest."

If that's not what the function is doing, please explain clearly what it
will do.

Tamas

[-- Attachment #2: Type: text/html, Size: 9563 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-01-26 17:25 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-23 14:51 [PATCH v2 0/8] runstate/time area registration by (guest) physical address Jan Beulich
2023-01-23 14:53 ` [PATCH RFC v2 1/8] domain: GADDR based shared guest area registration alternative - teardown Jan Beulich
2023-01-23 14:53 ` [PATCH RFC v2 2/8] domain: update GADDR based runstate guest area Jan Beulich
2023-01-23 14:54 ` [PATCH v2 3/8] x86: update GADDR based secondary time area Jan Beulich
2023-01-23 14:55 ` [PATCH v2 4/8] x86/mem-sharing: copy GADDR based shared guest areas Jan Beulich
2023-01-23 16:09   ` Tamas K Lengyel
2023-01-23 16:24     ` Jan Beulich
2023-01-23 18:32       ` Tamas K Lengyel
2023-01-24 11:19         ` Jan Beulich
2023-01-25 15:34           ` Tamas K Lengyel
2023-01-26  8:13             ` Jan Beulich
2023-01-26 15:41               ` Tamas K Lengyel
2023-01-26 16:48                 ` Jan Beulich
2023-01-26 17:24                   ` Tamas K Lengyel
2023-01-23 14:55 ` [PATCH v2 5/8] domain: map/unmap " Jan Beulich
2023-01-23 14:56 ` [PATCH v2 6/8] domain: introduce GADDR based runstate area registration alternative Jan Beulich
2023-01-23 14:56 ` [PATCH v2 7/8] x86: introduce GADDR based secondary time " Jan Beulich
2023-01-23 14:57 ` [PATCH v2 8/8] common: convert vCPU info area registration Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.