xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [Xen-devel] [PATCH v12 0/3] VM forking
@ 2020-03-23 17:04 Tamas K Lengyel
  2020-03-23 17:04 ` [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: " Tamas K Lengyel
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-23 17:04 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Tamas K Lengyel, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Tamas K Lengyel, Jan Beulich,
	Anthony PERARD, Julien Grall, Roger Pau Monné

The following series implements VM forking for Intel HVM guests to allow for
the fast creation of identical VMs without the assosciated high startup costs
of booting or restoring the VM from a savefile.

JIRA issue: https://xenproject.atlassian.net/browse/XEN-89

The fork operation is implemented as part of the "xl fork-vm" command:
    xl fork-vm -C <config> -Q <qemu-save-file> -m <max-vcpus> <parent_domid>
    
By default a fully functional fork is created. The user is in charge however to
create the appropriate config file for the fork and to generate the QEMU save
file before the fork-vm call is made. The config file needs to give the
fork a new name at minimum but other settings may also require changes. Certain
settings in the config file of both the parent and the fork have to be set to
default. Details are documented.

The interface also allows to split the forking into two steps:
    xl fork-vm --launch-dm no \
               -m <max-vcpus> \
               -p <parent_domid>
    xl fork-vm --launch-dm late \
               -C <config_file_for_fork> \
               -Q <qemu_save_file> \
               <fork_domid>

The split creation model is useful when the VM needs to be created as fast as
possible. The forked VM can be unpaused without the device model being launched
to be monitored and accessed via VMI. Note however that without its device
model running (depending on what is executing in the VM) it is bound to
misbehave or even crash when its trying to access devices that would be
emulated by QEMU. We anticipate that for certain use-cases this would be an
acceptable situation, in case for example when fuzzing is performed of code
segments that don't access such devices.

Launching the device model requires the QEMU Xen savefile to be generated
manually from the parent VM. This can be accomplished simply by connecting to
its QMP socket and issuing the "xen-save-devices-state" command. For example
using the standard tool socat these commands can be used to generate the file:
    socat - UNIX-CONNECT:/var/run/xen/qmp-libxl-<parent_domid>
    { "execute": "qmp_capabilities" }
    { "execute": "xen-save-devices-state", \
        "arguments": { "filename": "/path/to/save/qemu_state", \
                        "live": false} }

At runtime the forked VM starts running with an empty p2m which gets lazily
populated when the VM generates EPT faults, similar to how altp2m views are
populated. If the memory access is a read-only access, the p2m entry is
populated with a memory shared entry with its parent. For write memory accesses
or in case memory sharing wasn't possible (for example in case a reference is
held by a third party), a new page is allocated and the page contents are
copied over from the parent VM. Forks can be further forked if needed, thus
allowing for further memory savings.

A VM fork reset hypercall is also added that allows the fork to be reset to the
state it was just after a fork, also accessible via xl:
    xl fork-vm --fork-reset -p <fork_domid>

This is an optimization for cases where the forks are very short-lived and run
without a device model, so resetting saves some time compared to creating a
brand new fork provided the fork has not aquired a lot of memory. If the fork
has a lot of memory deduplicated it is likely going to be faster to create a
new fork from scratch and asynchronously destroying the old one.

The series has been tested with Windows VMs and functions as expected. Linux
VMs when forked from a running VM will have a frozen VNC screen. Linux VMs at
this time can only be forked with a working device model when the parent VM was
restored from a snapshot using "xl restore -p". This is a known limitation.
Also note that PVHVM/PVH Linux guests have not been tested. Forking most likely
works but PV devices and drivers would require additional wiring to set things
up properly since the guests are unaware of the forking taking place, unlike
the save/restore routine where the guest is made aware of the procedure.

Forking time has been measured to be 0.0007s, device model launch to be around
1s depending largely on the number of devices being emulated. Fork resets have
been measured to be 0.0001s under the optimal circumstances.

New in v12:
    style cleanups & minor adjustments
    removing contiuation for fork reset and add TODO comment

Patch 1 implements the VM fork
Patch 2 implements fork reset operation
Patch 3 adds the toolstack-side code implementing VM forking and reset

Tamas K Lengyel (3):
  xen/mem_sharing: VM forking
  x86/mem_sharing: reset a fork
  xen/tools: VM forking toolstack side

 docs/man/xl.1.pod.in              |  44 +++
 tools/libxc/include/xenctrl.h     |  13 +
 tools/libxc/xc_memshr.c           |  22 ++
 tools/libxl/libxl.h               |  11 +
 tools/libxl/libxl_create.c        | 361 +++++++++++++-----------
 tools/libxl/libxl_dm.c            |   2 +-
 tools/libxl/libxl_dom.c           |  43 ++-
 tools/libxl/libxl_internal.h      |   7 +
 tools/libxl/libxl_types.idl       |   1 +
 tools/libxl/libxl_x86.c           |  41 +++
 tools/xl/Makefile                 |   2 +-
 tools/xl/xl.h                     |   5 +
 tools/xl/xl_cmdtable.c            |  15 +
 tools/xl/xl_forkvm.c              | 147 ++++++++++
 tools/xl/xl_vmcontrol.c           |  14 +
 xen/arch/x86/domain.c             |  11 +
 xen/arch/x86/hvm/hvm.c            |   4 +-
 xen/arch/x86/mm/hap/hap.c         |   3 +-
 xen/arch/x86/mm/mem_sharing.c     | 445 ++++++++++++++++++++++++++++++
 xen/arch/x86/mm/p2m.c             |   9 +-
 xen/common/domain.c               |   3 +
 xen/include/asm-x86/hap.h         |   1 +
 xen/include/asm-x86/hvm/hvm.h     |   2 +
 xen/include/asm-x86/mem_sharing.h |  18 ++
 xen/include/public/memory.h       |   6 +
 xen/include/xen/sched.h           |   5 +
 26 files changed, 1064 insertions(+), 171 deletions(-)
 create mode 100644 tools/xl/xl_forkvm.c

-- 
2.20.1



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-23 17:04 [Xen-devel] [PATCH v12 0/3] VM forking Tamas K Lengyel
@ 2020-03-23 17:04 ` Tamas K Lengyel
  2020-03-25 15:47   ` Roger Pau Monné
  2020-03-26 12:33   ` Jan Beulich
  2020-03-23 17:04 ` [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork Tamas K Lengyel
  2020-03-23 17:04 ` [Xen-devel] [PATCH v12 3/3] xen/tools: VM forking toolstack side Tamas K Lengyel
  2 siblings, 2 replies; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-23 17:04 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Tamas K Lengyel, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Tamas K Lengyel, Jan Beulich,
	Julien Grall, Roger Pau Monné

VM forking is the process of creating a domain with an empty memory space and a
parent domain specified from which to populate the memory when necessary. For
the new domain to be functional the VM state is copied over as part of the fork
operation (HVM params, hap allocation, etc).

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
---
v12: Minor style adjustments Jan pointed out
     Convert mem_sharing_is_fork to inline function
v11: Fully copy vcpu_info pages
     Setup vcpu_runstate for forks
     Added TODO note for PV timers
     Copy shared_info page
     Add copy_settings function, to be shared with fork_reset in the next patch
---
 xen/arch/x86/domain.c             |  11 +
 xen/arch/x86/hvm/hvm.c            |   4 +-
 xen/arch/x86/mm/hap/hap.c         |   3 +-
 xen/arch/x86/mm/mem_sharing.c     | 368 ++++++++++++++++++++++++++++++
 xen/arch/x86/mm/p2m.c             |   9 +-
 xen/common/domain.c               |   3 +
 xen/include/asm-x86/hap.h         |   1 +
 xen/include/asm-x86/hvm/hvm.h     |   2 +
 xen/include/asm-x86/mem_sharing.h |  18 ++
 xen/include/public/memory.h       |   5 +
 xen/include/xen/sched.h           |   5 +
 11 files changed, 424 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index caf2ecad7e..11d3c2216e 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -2202,6 +2202,17 @@ int domain_relinquish_resources(struct domain *d)
             ret = relinquish_shared_pages(d);
             if ( ret )
                 return ret;
+
+            /*
+             * If the domain is forked, decrement the parent's pause count
+             * and release the domain.
+             */
+            if ( mem_sharing_is_fork(d) )
+            {
+                domain_unpause(d->parent);
+                put_domain(d->parent);
+                d->parent = NULL;
+            }
         }
 #endif
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index a3d115b650..304b3d1562 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1917,7 +1917,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
     }
 #endif
 
-    /* Spurious fault? PoD and log-dirty also take this path. */
+    /* Spurious fault? PoD, log-dirty and VM forking also take this path. */
     if ( p2m_is_ram(p2mt) )
     {
         rc = 1;
@@ -4377,7 +4377,7 @@ static int hvm_allow_get_param(struct domain *d,
     return rc;
 }
 
-static int hvm_get_param(struct domain *d, uint32_t index, uint64_t *value)
+int hvm_get_param(struct domain *d, uint32_t index, uint64_t *value)
 {
     int rc;
 
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index a6d5e39b02..814d0c3253 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -321,8 +321,7 @@ static void hap_free_p2m_page(struct domain *d, struct page_info *pg)
 }
 
 /* Return the size of the pool, rounded up to the nearest MB */
-static unsigned int
-hap_get_allocation(struct domain *d)
+unsigned int hap_get_allocation(struct domain *d)
 {
     unsigned int pg = d->arch.paging.hap.total_pages
         + d->arch.paging.hap.p2m_pages;
diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
index 3835bc928f..23deeddff2 100644
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -22,6 +22,7 @@
 
 #include <xen/types.h>
 #include <xen/domain_page.h>
+#include <xen/event.h>
 #include <xen/spinlock.h>
 #include <xen/rwlock.h>
 #include <xen/mm.h>
@@ -36,6 +37,8 @@
 #include <asm/altp2m.h>
 #include <asm/atomic.h>
 #include <asm/event.h>
+#include <asm/hap.h>
+#include <asm/hvm/hvm.h>
 #include <xsm/xsm.h>
 
 #include "mm-locks.h"
@@ -1444,6 +1447,334 @@ static inline int mem_sharing_control(struct domain *d, bool enable)
     return 0;
 }
 
+/*
+ * Forking a page only gets called when the VM faults due to no entry being
+ * in the EPT for the access. Depending on the type of access we either
+ * populate the physmap with a shared entry for read-only access or
+ * fork the page if its a write access.
+ *
+ * The client p2m is already locked so we only need to lock
+ * the parent's here.
+ */
+int mem_sharing_fork_page(struct domain *d, gfn_t gfn, bool unsharing)
+{
+    int rc = -ENOENT;
+    shr_handle_t handle;
+    struct domain *parent = d->parent;
+    struct p2m_domain *p2m;
+    unsigned long gfn_l = gfn_x(gfn);
+    mfn_t mfn, new_mfn;
+    p2m_type_t p2mt;
+    struct page_info *page;
+
+    if ( !mem_sharing_is_fork(d) )
+        return -ENOENT;
+
+    if ( !unsharing )
+    {
+        /* For read-only accesses we just add a shared entry to the physmap */
+        while ( parent )
+        {
+            if ( !(rc = nominate_page(parent, gfn, 0, &handle)) )
+                break;
+
+            parent = parent->parent;
+        }
+
+        if ( !rc )
+        {
+            /* The client's p2m is already locked */
+            struct p2m_domain *pp2m = p2m_get_hostp2m(parent);
+
+            p2m_lock(pp2m);
+            rc = add_to_physmap(parent, gfn_l, handle, d, gfn_l, false);
+            p2m_unlock(pp2m);
+
+            if ( !rc )
+                return 0;
+        }
+    }
+
+    /*
+     * If it's a write access (ie. unsharing) or if adding a shared entry to
+     * the physmap failed we'll fork the page directly.
+     */
+    p2m = p2m_get_hostp2m(d);
+    parent = d->parent;
+
+    while ( parent )
+    {
+        mfn = get_gfn_query(parent, gfn_l, &p2mt);
+
+        /*
+         * We can't fork grant memory from the parent, only regular ram.
+         */
+        if ( mfn_valid(mfn) && p2m_is_ram(p2mt) )
+            break;
+
+        put_gfn(parent, gfn_l);
+        parent = parent->parent;
+    }
+
+    if ( !parent )
+        return -ENOENT;
+
+    if ( !(page = alloc_domheap_page(d, 0)) )
+    {
+        put_gfn(parent, gfn_l);
+        return -ENOMEM;
+    }
+
+    new_mfn = page_to_mfn(page);
+    copy_domain_page(new_mfn, mfn);
+    set_gpfn_from_mfn(mfn_x(new_mfn), gfn_l);
+
+    put_gfn(parent, gfn_l);
+
+    return p2m->set_entry(p2m, gfn, new_mfn, PAGE_ORDER_4K, p2m_ram_rw,
+                          p2m->default_access, -1);
+}
+
+static int bring_up_vcpus(struct domain *cd, struct domain *d)
+{
+    unsigned int i;
+    int ret = -EINVAL;
+
+    if ( d->max_vcpus != cd->max_vcpus ||
+        (ret = cpupool_move_domain(cd, d->cpupool)) )
+        return ret;
+
+    for ( i = 0; i < cd->max_vcpus; i++ )
+    {
+        if ( !d->vcpu[i] || cd->vcpu[i] )
+            continue;
+
+        if ( !vcpu_create(cd, i) )
+            return -EINVAL;
+    }
+
+    domain_update_node_affinity(cd);
+    return 0;
+}
+
+static int copy_vcpu_settings(struct domain *cd, struct domain *d)
+{
+    unsigned int i;
+    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
+    int ret = -EINVAL;
+
+    for ( i = 0; i < cd->max_vcpus; i++ )
+    {
+        const struct vcpu *d_vcpu = d->vcpu[i];
+        struct vcpu *cd_vcpu = cd->vcpu[i];
+        struct vcpu_runstate_info runstate;
+        mfn_t vcpu_info_mfn;
+
+        if ( !d_vcpu || !cd_vcpu )
+            continue;
+
+        /*
+         * Copy & map in the vcpu_info page if the guest uses one
+         */
+        vcpu_info_mfn = d_vcpu->vcpu_info_mfn;
+        if ( !mfn_eq(vcpu_info_mfn, INVALID_MFN) )
+        {
+            mfn_t new_vcpu_info_mfn = cd_vcpu->vcpu_info_mfn;
+
+            /*
+             * Allocate & map the page for it if it hasn't been already
+             */
+            if ( mfn_eq(new_vcpu_info_mfn, INVALID_MFN) )
+            {
+                gfn_t gfn = mfn_to_gfn(d, vcpu_info_mfn);
+                unsigned long gfn_l = gfn_x(gfn);
+                struct page_info *page;
+
+                if ( !(page = alloc_domheap_page(cd, 0)) )
+                    return -ENOMEM;
+
+                new_vcpu_info_mfn = page_to_mfn(page);
+                set_gpfn_from_mfn(mfn_x(new_vcpu_info_mfn), gfn_l);
+
+                ret = p2m->set_entry(p2m, gfn, new_vcpu_info_mfn, PAGE_ORDER_4K,
+                                     p2m_ram_rw, p2m->default_access, -1);
+                if ( ret )
+                    return ret;
+
+                ret = map_vcpu_info(cd_vcpu, gfn_l,
+                                    d_vcpu->vcpu_info_offset);
+                if ( ret )
+                    return ret;
+            }
+
+            copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
+        }
+
+        /*
+         * Setup the vCPU runstate area
+         */
+        if ( guest_handle_is_null(runstate_guest(cd_vcpu)) )
+        {
+            runstate_guest(cd_vcpu) = runstate_guest(d_vcpu);
+            vcpu_runstate_get(cd_vcpu, &runstate);
+            __copy_to_guest(runstate_guest(cd_vcpu), &runstate, 1);
+        }
+
+        /*
+         * TODO: to support VMs with PV interfaces copy additional
+         * settings here, such as PV timers.
+         */
+    }
+
+    return 0;
+}
+
+static int fork_hap_allocation(struct domain *cd, struct domain *d)
+{
+    int rc;
+    bool preempted;
+    unsigned long mb = hap_get_allocation(d);
+
+    if ( mb == hap_get_allocation(cd) )
+        return 0;
+
+    paging_lock(cd);
+    rc = hap_set_allocation(cd, mb << (20 - PAGE_SHIFT), &preempted);
+    paging_unlock(cd);
+
+    return preempted ? -ERESTART : rc;
+}
+
+static void copy_tsc(struct domain *cd, struct domain *d)
+{
+    uint32_t tsc_mode;
+    uint32_t gtsc_khz;
+    uint32_t incarnation;
+    uint64_t elapsed_nsec;
+
+    tsc_get_info(d, &tsc_mode, &elapsed_nsec, &gtsc_khz, &incarnation);
+    /* Don't bump incarnation on set */
+    tsc_set_info(cd, tsc_mode, elapsed_nsec, gtsc_khz, incarnation - 1);
+}
+
+static int copy_special_pages(struct domain *cd, struct domain *d)
+{
+    mfn_t new_mfn, old_mfn;
+    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
+    static const unsigned int params[] =
+    {
+        HVM_PARAM_STORE_PFN,
+        HVM_PARAM_IOREQ_PFN,
+        HVM_PARAM_BUFIOREQ_PFN,
+        HVM_PARAM_CONSOLE_PFN
+    };
+    unsigned int i;
+    int rc;
+
+    for ( i = 0; i < 4; i++ )
+    {
+        p2m_type_t t;
+        uint64_t value = 0;
+        struct page_info *page;
+
+        if ( hvm_get_param(cd, params[i], &value) || !value )
+            continue;
+
+        old_mfn = get_gfn_query_unlocked(d, value, &t);
+        new_mfn = get_gfn_query_unlocked(cd, value, &t);
+
+        /*
+         * Allocate the page and map it in if it's not present
+         */
+        if ( mfn_eq(new_mfn, INVALID_MFN) )
+        {
+            if ( !(page = alloc_domheap_page(cd, 0)) )
+                return -ENOMEM;
+
+            new_mfn = page_to_mfn(page);
+            set_gpfn_from_mfn(mfn_x(new_mfn), value);
+
+            rc = p2m->set_entry(p2m, _gfn(value), new_mfn, PAGE_ORDER_4K,
+                                p2m_ram_rw, p2m->default_access, -1);
+            if ( rc )
+                return rc;
+        }
+
+        copy_domain_page(new_mfn, old_mfn);
+    }
+
+    old_mfn = _mfn(virt_to_mfn(d->shared_info));
+    new_mfn = _mfn(virt_to_mfn(cd->shared_info));
+    copy_domain_page(new_mfn, old_mfn);
+
+    return 0;
+}
+
+static int copy_settings(struct domain *cd, struct domain *d)
+{
+    int rc;
+
+    if ( (rc = copy_vcpu_settings(cd, d)) )
+        return rc;
+
+    if ( (rc = hvm_copy_context_and_params(cd, d)) )
+        return rc;
+
+    if ( (rc = copy_special_pages(cd, d)) )
+        return rc;
+
+    copy_tsc(cd, d);
+
+    return rc;
+}
+
+static int fork(struct domain *cd, struct domain *d)
+{
+    int rc = -EBUSY;
+
+    if ( !cd->controller_pause_count )
+        return rc;
+
+    /*
+     * We only want to get and pause the parent once, not each time this
+     * operation is restarted due to preemption.
+     */
+    if ( !cd->parent_paused )
+    {
+        if ( !get_domain(d) )
+        {
+            ASSERT_UNREACHABLE();
+            return -EBUSY;
+        }
+
+        domain_pause(d);
+        cd->parent_paused = true;
+        cd->max_pages = d->max_pages;
+    }
+
+    /* this is preemptible so it's the first to get done */
+    if ( (rc = fork_hap_allocation(cd, d)) )
+        goto done;
+
+    if ( (rc = bring_up_vcpus(cd, d)) )
+        goto done;
+
+    if ( (rc = copy_settings(cd, d)) )
+        goto done;
+
+    cd->parent = d;
+
+ done:
+    if ( rc && rc != -ERESTART )
+    {
+        domain_unpause(d);
+        put_domain(d);
+        cd->parent_paused = false;
+    }
+
+    return rc;
+}
+
 int mem_sharing_memop(XEN_GUEST_HANDLE_PARAM(xen_mem_sharing_op_t) arg)
 {
     int rc;
@@ -1698,6 +2029,43 @@ int mem_sharing_memop(XEN_GUEST_HANDLE_PARAM(xen_mem_sharing_op_t) arg)
         rc = debug_gref(d, mso.u.debug.u.gref);
         break;
 
+    case XENMEM_sharing_op_fork:
+    {
+        struct domain *pd;
+
+        rc = -EINVAL;
+        if ( mso.u.fork.pad[0] || mso.u.fork.pad[1] ||
+             mso.u.fork.pad[2] )
+            goto out;
+
+        rc = rcu_lock_live_remote_domain_by_id(mso.u.fork.parent_domain,
+                                               &pd);
+        if ( rc )
+            goto out;
+
+        rc = -EINVAL;
+        if ( pd->max_vcpus != d->max_vcpus )
+        {
+            rcu_unlock_domain(pd);
+            goto out;
+        }
+
+        if ( !mem_sharing_enabled(pd) && (rc = mem_sharing_control(pd, true)) )
+        {
+            rcu_unlock_domain(pd);
+            goto out;
+        }
+
+        rc = fork(d, pd);
+
+        if ( rc == -ERESTART )
+            rc = hypercall_create_continuation(__HYPERVISOR_memory_op,
+                                               "lh", XENMEM_sharing_op,
+                                               arg);
+        rcu_unlock_domain(pd);
+        break;
+    }
+
     default:
         rc = -ENOSYS;
         break;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 9f51370327..1ed7d13084 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -509,6 +509,12 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn_l,
 
     mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
 
+    /* Check if we need to fork the page */
+    if ( (q & P2M_ALLOC) && p2m_is_hole(*t) &&
+         !mem_sharing_fork_page(p2m->domain, gfn, q & P2M_UNSHARE) )
+        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
+
+    /* Check if we need to unshare the page */
     if ( (q & P2M_UNSHARE) && p2m_is_shared(*t) )
     {
         ASSERT(p2m_is_hostp2m(p2m));
@@ -588,7 +594,8 @@ struct page_info *p2m_get_page_from_gfn(
             return page;
 
         /* Error path: not a suitable GFN at all */
-        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) )
+        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) &&
+             !mem_sharing_is_fork(p2m->domain) )
             return NULL;
     }
 
diff --git a/xen/common/domain.c b/xen/common/domain.c
index b4eb476a9c..62aed53a16 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1270,6 +1270,9 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
 
     v->vcpu_info = new_info;
     v->vcpu_info_mfn = page_to_mfn(page);
+#ifdef CONFIG_MEM_SHARING
+    v->vcpu_info_offset = offset;
+#endif
 
     /* Set new vcpu_info pointer /before/ setting pending flags. */
     smp_wmb();
diff --git a/xen/include/asm-x86/hap.h b/xen/include/asm-x86/hap.h
index b94bfb4ed0..1bf07e49fe 100644
--- a/xen/include/asm-x86/hap.h
+++ b/xen/include/asm-x86/hap.h
@@ -45,6 +45,7 @@ int   hap_track_dirty_vram(struct domain *d,
 
 extern const struct paging_mode *hap_paging_get_mode(struct vcpu *);
 int hap_set_allocation(struct domain *d, unsigned int pages, bool *preempted);
+unsigned int hap_get_allocation(struct domain *d);
 
 #endif /* XEN_HAP_H */
 
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index b007b2e343..f283c7d187 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -336,6 +336,8 @@ unsigned long hvm_cr4_guest_valid_bits(const struct domain *d, bool restore);
 
 int hvm_copy_context_and_params(struct domain *src, struct domain *dst);
 
+int hvm_get_param(struct domain *d, uint32_t index, uint64_t *value);
+
 #ifdef CONFIG_HVM
 
 #define hvm_get_guest_tsc(v) hvm_get_guest_tsc_fixed(v, 0)
diff --git a/xen/include/asm-x86/mem_sharing.h b/xen/include/asm-x86/mem_sharing.h
index 53b7929d0e..78c3a2c343 100644
--- a/xen/include/asm-x86/mem_sharing.h
+++ b/xen/include/asm-x86/mem_sharing.h
@@ -77,6 +77,14 @@ static inline int mem_sharing_unshare_page(struct domain *d,
     return rc;
 }
 
+static inline bool mem_sharing_is_fork(struct domain *d)
+{
+    return d->parent;
+}
+
+int mem_sharing_fork_page(struct domain *d, gfn_t gfn,
+                          bool unsharing);
+
 /*
  * If called by a foreign domain, possible errors are
  *   -EBUSY -> ring full
@@ -130,6 +138,16 @@ static inline int mem_sharing_notify_enomem(struct domain *d, unsigned long gfn,
     return -EOPNOTSUPP;
 }
 
+static inline bool mem_sharing_is_fork(struct domain *d)
+{
+    return false;
+}
+
+static inline int mem_sharing_fork_page(struct domain *d, gfn_t gfn, bool lock)
+{
+    return -EOPNOTSUPP;
+}
+
 #endif
 
 #endif /* __MEM_SHARING_H__ */
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 126d0ff06e..5ee4e0da12 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -482,6 +482,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_access_op_t);
 #define XENMEM_sharing_op_add_physmap       6
 #define XENMEM_sharing_op_audit             7
 #define XENMEM_sharing_op_range_share       8
+#define XENMEM_sharing_op_fork              9
 
 #define XENMEM_SHARING_OP_S_HANDLE_INVALID  (-10)
 #define XENMEM_SHARING_OP_C_HANDLE_INVALID  (-9)
@@ -532,6 +533,10 @@ struct xen_mem_sharing_op {
                 uint32_t gref;     /* IN: gref to debug         */
             } u;
         } debug;
+        struct mem_sharing_op_fork {      /* OP_FORK */
+            domid_t parent_domain;        /* IN: parent's domain id */
+            uint16_t pad[3];              /* Must be set to 0 */
+        } fork;
     } u;
 };
 typedef struct xen_mem_sharing_op xen_mem_sharing_op_t;
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index e6813288ab..881f2bb0c2 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -247,6 +247,9 @@ struct vcpu
 
     /* Guest-specified relocation of vcpu_info. */
     mfn_t            vcpu_info_mfn;
+#ifdef CONFIG_MEM_SHARING
+    unsigned short   vcpu_info_offset;
+#endif
 
     struct evtchn_fifo_vcpu *evtchn_fifo;
 
@@ -480,6 +483,8 @@ struct domain
     /* Memory sharing support */
 #ifdef CONFIG_MEM_SHARING
     struct vm_event_domain *vm_event_share;
+    struct domain *parent; /* VM fork parent */
+    bool parent_paused;
 #endif
     /* Memory paging support */
 #ifdef CONFIG_HAS_MEM_PAGING
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork
  2020-03-23 17:04 [Xen-devel] [PATCH v12 0/3] VM forking Tamas K Lengyel
  2020-03-23 17:04 ` [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: " Tamas K Lengyel
@ 2020-03-23 17:04 ` Tamas K Lengyel
  2020-03-25 15:52   ` Roger Pau Monné
  2020-03-26 10:16   ` Jan Beulich
  2020-03-23 17:04 ` [Xen-devel] [PATCH v12 3/3] xen/tools: VM forking toolstack side Tamas K Lengyel
  2 siblings, 2 replies; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-23 17:04 UTC (permalink / raw)
  To: xen-devel
  Cc: Tamas K Lengyel, Tamas K Lengyel, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Stefano Stabellini, Jan Beulich,
	Julien Grall, Roger Pau Monné

Implement hypercall that allows a fork to shed all memory that got allocated
for it during its execution and re-load its vCPU context from the parent VM.
This allows the forked VM to reset into the same state the parent VM is in a
faster way then creating a new fork would be. Measurements show about a 2x
speedup during normal fuzzing operations. Performance may vary depending how
much memory got allocated for the forked VM. If it has been completely
deduplicated from the parent VM then creating a new fork would likely be more
performant.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
---
v12: remove continuation & add comment back
     address style issues pointed out by Jan
---
 xen/arch/x86/mm/mem_sharing.c | 77 +++++++++++++++++++++++++++++++++++
 xen/include/public/memory.h   |  1 +
 2 files changed, 78 insertions(+)

diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
index 23deeddff2..930a5f58ef 100644
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -1775,6 +1775,60 @@ static int fork(struct domain *cd, struct domain *d)
     return rc;
 }
 
+/*
+ * The fork reset operation is intended to be used on short-lived forks only.
+ * There is no hypercall continuation operation implemented for this reason.
+ * For forks that obtain a larger memory footprint it is likely going to be
+ * more performant to create a new fork instead of resetting an existing one.
+ *
+ * TODO: In case this hypercall would become useful on forks with larger memory
+ * footprints the hypercall continuation should be implemented (or if this
+ * feature needs to be become "stable").
+ */
+static int mem_sharing_fork_reset(struct domain *d, struct domain *pd)
+{
+    int rc;
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    struct page_info *page, *tmp;
+
+    spin_lock(&d->page_alloc_lock);
+    domain_pause(d);
+
+    page_list_for_each_safe(page, tmp, &d->page_list)
+    {
+        p2m_type_t p2mt;
+        p2m_access_t p2ma;
+        mfn_t mfn = page_to_mfn(page);
+        gfn_t gfn = mfn_to_gfn(d, mfn);
+
+        mfn = __get_gfn_type_access(p2m, gfn_x(gfn), &p2mt, &p2ma,
+                                    0, NULL, false);
+
+        /* only reset pages that are sharable */
+        if ( !p2m_is_sharable(p2mt) )
+            continue;
+
+        /* take an extra reference or just skip if can't for whatever reason */
+        if ( !get_page(page, d) )
+            continue;
+
+        /* forked memory is 4k, not splitting large pages so this must work */
+        rc = p2m->set_entry(p2m, gfn, INVALID_MFN, PAGE_ORDER_4K,
+                            p2m_invalid, p2m_access_rwx, -1);
+        ASSERT(!rc);
+
+        put_page_alloc_ref(page);
+        put_page(page);
+    }
+
+    rc = copy_settings(d, pd);
+
+    domain_unpause(d);
+    spin_unlock(&d->page_alloc_lock);
+
+    return rc;
+}
+
 int mem_sharing_memop(XEN_GUEST_HANDLE_PARAM(xen_mem_sharing_op_t) arg)
 {
     int rc;
@@ -2066,6 +2120,29 @@ int mem_sharing_memop(XEN_GUEST_HANDLE_PARAM(xen_mem_sharing_op_t) arg)
         break;
     }
 
+    case XENMEM_sharing_op_fork_reset:
+    {
+        struct domain *pd;
+
+        rc = -EINVAL;
+        if ( mso.u.fork.pad[0] || mso.u.fork.pad[1] ||
+             mso.u.fork.pad[2] )
+            goto out;
+
+        rc = -ENOSYS;
+        if ( !d->parent )
+            goto out;
+
+        rc = rcu_lock_live_remote_domain_by_id(d->parent->domain_id, &pd);
+        if ( rc )
+            goto out;
+
+        rc = mem_sharing_fork_reset(d, pd);
+
+        rcu_unlock_domain(pd);
+        break;
+    }
+
     default:
         rc = -ENOSYS;
         break;
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 5ee4e0da12..d36d64b8dc 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -483,6 +483,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_access_op_t);
 #define XENMEM_sharing_op_audit             7
 #define XENMEM_sharing_op_range_share       8
 #define XENMEM_sharing_op_fork              9
+#define XENMEM_sharing_op_fork_reset        10
 
 #define XENMEM_SHARING_OP_S_HANDLE_INVALID  (-10)
 #define XENMEM_SHARING_OP_C_HANDLE_INVALID  (-9)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [Xen-devel] [PATCH v12 3/3] xen/tools: VM forking toolstack side
  2020-03-23 17:04 [Xen-devel] [PATCH v12 0/3] VM forking Tamas K Lengyel
  2020-03-23 17:04 ` [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: " Tamas K Lengyel
  2020-03-23 17:04 ` [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork Tamas K Lengyel
@ 2020-03-23 17:04 ` Tamas K Lengyel
  2 siblings, 0 replies; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-23 17:04 UTC (permalink / raw)
  To: xen-devel; +Cc: Anthony PERARD, Ian Jackson, Tamas K Lengyel, Wei Liu

Add necessary bits to implement "xl fork-vm" commands. The command allows the
user to specify how to launch the device model allowing for a late-launch model
in which the user can execute the fork without the device model and decide to
only later launch it.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
---
 docs/man/xl.1.pod.in          |  44 +++++
 tools/libxc/include/xenctrl.h |  13 ++
 tools/libxc/xc_memshr.c       |  22 +++
 tools/libxl/libxl.h           |  11 ++
 tools/libxl/libxl_create.c    | 361 +++++++++++++++++++---------------
 tools/libxl/libxl_dm.c        |   2 +-
 tools/libxl/libxl_dom.c       |  43 +++-
 tools/libxl/libxl_internal.h  |   7 +
 tools/libxl/libxl_types.idl   |   1 +
 tools/libxl/libxl_x86.c       |  41 ++++
 tools/xl/Makefile             |   2 +-
 tools/xl/xl.h                 |   5 +
 tools/xl/xl_cmdtable.c        |  15 ++
 tools/xl/xl_forkvm.c          | 147 ++++++++++++++
 tools/xl/xl_vmcontrol.c       |  14 ++
 15 files changed, 562 insertions(+), 166 deletions(-)
 create mode 100644 tools/xl/xl_forkvm.c

diff --git a/docs/man/xl.1.pod.in b/docs/man/xl.1.pod.in
index 09339282e6..59c03c6427 100644
--- a/docs/man/xl.1.pod.in
+++ b/docs/man/xl.1.pod.in
@@ -708,6 +708,50 @@ above).
 
 =back
 
+=item B<fork-vm> [I<OPTIONS>] I<domain-id>
+
+Create a fork of a running VM.  The domain will be paused after the operation
+and remains paused while forks of it exist.  Experimental and x86 only.
+Forks can only be made of domains with HAP enabled and on Intel hardware.  The
+parent domain must be created with the xl toolstack and its configuration must
+not manually define max_grant_frames, max_maptrack_frames or max_event_channels.
+
+B<OPTIONS>
+
+=over 4
+
+=item B<-p>
+
+Leave the fork paused after creating it.
+
+=item B<--launch-dm>
+
+Specify whether the device model (QEMU) should be launched for the fork. Late
+launch allows to start the device model for an already running fork.
+
+=item B<-C>
+
+The config file to use when launching the device model.  Currently required when
+launching the device model.  Most config settings MUST match the parent domain
+exactly, only change VM name, disk path and network configurations.
+
+=item B<-Q>
+
+The path to the qemu save file to use when launching the device model.  Currently
+required when launching the device model.
+
+=item B<--fork-reset>
+
+Perform a reset operation of an already running fork.  Note that resetting may
+be less performant then creating a new fork depending on how much memory the
+fork has deduplicated during its runtime.
+
+=item B<--max-vcpus>
+
+Specify the max-vcpus matching the parent domain when not launching the dm.
+
+=back
+
 =item B<sharing> [I<domain-id>]
 
 Display the number of shared pages for a specified domain. If no domain is
diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index fc6e57a1a0..00cb4cf1f7 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -2225,6 +2225,19 @@ int xc_memshr_range_share(xc_interface *xch,
                           uint64_t first_gfn,
                           uint64_t last_gfn);
 
+int xc_memshr_fork(xc_interface *xch,
+                   uint32_t source_domain,
+                   uint32_t client_domain);
+
+/*
+ * Note: this function is only intended to be used on short-lived forks that
+ * haven't yet aquired a lot of memory. In case the fork has a lot of memory
+ * it is likely more performant to create a new fork with xc_memshr_fork.
+ *
+ * With VMs that have a lot of memory this call may block for a long time.
+ */
+int xc_memshr_fork_reset(xc_interface *xch, uint32_t forked_domain);
+
 /* Debug calls: return the number of pages referencing the shared frame backing
  * the input argument. Should be one or greater.
  *
diff --git a/tools/libxc/xc_memshr.c b/tools/libxc/xc_memshr.c
index 97e2e6a8d9..d0e4ee225b 100644
--- a/tools/libxc/xc_memshr.c
+++ b/tools/libxc/xc_memshr.c
@@ -239,6 +239,28 @@ int xc_memshr_debug_gref(xc_interface *xch,
     return xc_memshr_memop(xch, domid, &mso);
 }
 
+int xc_memshr_fork(xc_interface *xch, uint32_t pdomid, uint32_t domid)
+{
+    xen_mem_sharing_op_t mso;
+
+    memset(&mso, 0, sizeof(mso));
+
+    mso.op = XENMEM_sharing_op_fork;
+    mso.u.fork.parent_domain = pdomid;
+
+    return xc_memshr_memop(xch, domid, &mso);
+}
+
+int xc_memshr_fork_reset(xc_interface *xch, uint32_t domid)
+{
+    xen_mem_sharing_op_t mso;
+
+    memset(&mso, 0, sizeof(mso));
+    mso.op = XENMEM_sharing_op_fork_reset;
+
+    return xc_memshr_memop(xch, domid, &mso);
+}
+
 int xc_memshr_audit(xc_interface *xch)
 {
     xen_mem_sharing_op_t mso;
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 71709dc585..088e81c78b 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -2666,6 +2666,17 @@ int libxl_psr_get_hw_info(libxl_ctx *ctx, libxl_psr_feat_type type,
                           unsigned int lvl, unsigned int *nr,
                           libxl_psr_hw_info **info);
 void libxl_psr_hw_info_list_free(libxl_psr_hw_info *list, unsigned int nr);
+
+int libxl_domain_fork_vm(libxl_ctx *ctx, uint32_t pdomid, uint32_t max_vcpus, uint32_t *domid)
+                         LIBXL_EXTERNAL_CALLERS_ONLY;
+
+int libxl_domain_fork_launch_dm(libxl_ctx *ctx, libxl_domain_config *d_config,
+                                uint32_t domid,
+                                const libxl_asyncprogress_how *aop_console_how)
+                                LIBXL_EXTERNAL_CALLERS_ONLY;
+
+int libxl_domain_fork_reset(libxl_ctx *ctx, uint32_t domid)
+                            LIBXL_EXTERNAL_CALLERS_ONLY;
 #endif
 
 /* misc */
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index e7cb2dbc2b..5705b6e3a5 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -538,12 +538,12 @@ out:
     return ret;
 }
 
-int libxl__domain_make(libxl__gc *gc, libxl_domain_config *d_config,
-                       libxl__domain_build_state *state,
-                       uint32_t *domid, bool soft_reset)
+static int libxl__domain_make_xs_entries(libxl__gc *gc, libxl_domain_config *d_config,
+                                         libxl__domain_build_state *state,
+                                         uint32_t domid)
 {
     libxl_ctx *ctx = libxl__gc_owner(gc);
-    int ret, rc, nb_vm;
+    int rc, nb_vm;
     const char *dom_type;
     char *uuid_string;
     char *dom_path, *vm_path, *libxl_path;
@@ -555,9 +555,6 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_config *d_config,
 
     /* convenience aliases */
     libxl_domain_create_info *info = &d_config->c_info;
-    libxl_domain_build_info *b_info = &d_config->b_info;
-
-    assert(soft_reset || *domid == INVALID_DOMID);
 
     uuid_string = libxl__uuid2string(gc, info->uuid);
     if (!uuid_string) {
@@ -565,137 +562,7 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_config *d_config,
         goto out;
     }
 
-    if (!soft_reset) {
-        struct xen_domctl_createdomain create = {
-            .ssidref = info->ssidref,
-            .max_vcpus = b_info->max_vcpus,
-            .max_evtchn_port = b_info->event_channels,
-            .max_grant_frames = b_info->max_grant_frames,
-            .max_maptrack_frames = b_info->max_maptrack_frames,
-        };
-
-        if (info->type != LIBXL_DOMAIN_TYPE_PV) {
-            create.flags |= XEN_DOMCTL_CDF_hvm;
-            create.flags |=
-                libxl_defbool_val(info->hap) ? XEN_DOMCTL_CDF_hap : 0;
-            create.flags |=
-                libxl_defbool_val(info->oos) ? 0 : XEN_DOMCTL_CDF_oos_off;
-        }
-
-        assert(info->passthrough != LIBXL_PASSTHROUGH_DEFAULT);
-        LOG(DETAIL, "passthrough: %s",
-            libxl_passthrough_to_string(info->passthrough));
-
-        if (info->passthrough != LIBXL_PASSTHROUGH_DISABLED)
-            create.flags |= XEN_DOMCTL_CDF_iommu;
-
-        if (info->passthrough == LIBXL_PASSTHROUGH_SYNC_PT)
-            create.iommu_opts |= XEN_DOMCTL_IOMMU_no_sharept;
-
-        /* Ultimately, handle is an array of 16 uint8_t, same as uuid */
-        libxl_uuid_copy(ctx, (libxl_uuid *)&create.handle, &info->uuid);
-
-        ret = libxl__arch_domain_prepare_config(gc, d_config, &create);
-        if (ret < 0) {
-            LOGED(ERROR, *domid, "fail to get domain config");
-            rc = ERROR_FAIL;
-            goto out;
-        }
-
-        for (;;) {
-            uint32_t local_domid;
-            bool recent;
-
-            if (info->domid == RANDOM_DOMID) {
-                uint16_t v;
-
-                ret = libxl__random_bytes(gc, (void *)&v, sizeof(v));
-                if (ret < 0)
-                    break;
-
-                v &= DOMID_MASK;
-                if (!libxl_domid_valid_guest(v))
-                    continue;
-
-                local_domid = v;
-            } else {
-                local_domid = info->domid; /* May not be valid */
-            }
-
-            ret = xc_domain_create(ctx->xch, &local_domid, &create);
-            if (ret < 0) {
-                /*
-                 * If we generated a random domid and creation failed
-                 * because that domid already exists then simply try
-                 * again.
-                 */
-                if (errno == EEXIST && info->domid == RANDOM_DOMID)
-                    continue;
-
-                LOGED(ERROR, local_domid, "domain creation fail");
-                rc = ERROR_FAIL;
-                goto out;
-            }
-
-            /* A new domain now exists */
-            *domid = local_domid;
-
-            rc = libxl__is_domid_recent(gc, local_domid, &recent);
-            if (rc)
-                goto out;
-
-            /* The domid is not recent, so we're done */
-            if (!recent)
-                break;
-
-            /*
-             * If the domid was specified then there's no point in
-             * trying again.
-             */
-            if (libxl_domid_valid_guest(info->domid)) {
-                LOGED(ERROR, local_domid, "domain id recently used");
-                rc = ERROR_FAIL;
-                goto out;
-            }
-
-            /*
-             * The domain is recent and so cannot be used. Clear domid
-             * here since, if xc_domain_destroy() fails below there is
-             * little point calling it again in the error path.
-             */
-            *domid = INVALID_DOMID;
-
-            ret = xc_domain_destroy(ctx->xch, local_domid);
-            if (ret < 0) {
-                LOGED(ERROR, local_domid, "domain destroy fail");
-                rc = ERROR_FAIL;
-                goto out;
-            }
-
-            /* The domain was successfully destroyed, so we can try again */
-        }
-
-        rc = libxl__arch_domain_save_config(gc, d_config, state, &create);
-        if (rc < 0)
-            goto out;
-    }
-
-    /*
-     * If soft_reset is set the the domid will have been valid on entry.
-     * If it was not set then xc_domain_create() should have assigned a
-     * valid value. Either way, if we reach this point, domid should be
-     * valid.
-     */
-    assert(libxl_domid_valid_guest(*domid));
-
-    ret = xc_cpupool_movedomain(ctx->xch, info->poolid, *domid);
-    if (ret < 0) {
-        LOGED(ERROR, *domid, "domain move fail");
-        rc = ERROR_FAIL;
-        goto out;
-    }
-
-    dom_path = libxl__xs_get_dompath(gc, *domid);
+    dom_path = libxl__xs_get_dompath(gc, domid);
     if (!dom_path) {
         rc = ERROR_FAIL;
         goto out;
@@ -703,12 +570,12 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_config *d_config,
 
     vm_path = GCSPRINTF("/vm/%s", uuid_string);
     if (!vm_path) {
-        LOGD(ERROR, *domid, "cannot allocate create paths");
+        LOGD(ERROR, domid, "cannot allocate create paths");
         rc = ERROR_FAIL;
         goto out;
     }
 
-    libxl_path = libxl__xs_libxl_path(gc, *domid);
+    libxl_path = libxl__xs_libxl_path(gc, domid);
     if (!libxl_path) {
         rc = ERROR_FAIL;
         goto out;
@@ -719,10 +586,10 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_config *d_config,
 
     roperm[0].id = 0;
     roperm[0].perms = XS_PERM_NONE;
-    roperm[1].id = *domid;
+    roperm[1].id = domid;
     roperm[1].perms = XS_PERM_READ;
 
-    rwperm[0].id = *domid;
+    rwperm[0].id = domid;
     rwperm[0].perms = XS_PERM_NONE;
 
 retry_transaction:
@@ -740,7 +607,7 @@ retry_transaction:
                     noperm, ARRAY_SIZE(noperm));
 
     xs_write(ctx->xsh, t, GCSPRINTF("%s/vm", dom_path), vm_path, strlen(vm_path));
-    rc = libxl__domain_rename(gc, *domid, 0, info->name, t);
+    rc = libxl__domain_rename(gc, domid, 0, info->name, t);
     if (rc)
         goto out;
 
@@ -830,7 +697,7 @@ retry_transaction:
 
     vm_list = libxl_list_vm(ctx, &nb_vm);
     if (!vm_list) {
-        LOGD(ERROR, *domid, "cannot get number of running guests");
+        LOGD(ERROR, domid, "cannot get number of running guests");
         rc = ERROR_FAIL;
         goto out;
     }
@@ -854,7 +721,7 @@ retry_transaction:
             t = 0;
             goto retry_transaction;
         }
-        LOGED(ERROR, *domid, "domain creation ""xenstore transaction commit failed");
+        LOGED(ERROR, domid, "domain creation ""xenstore transaction commit failed");
         rc = ERROR_FAIL;
         goto out;
     }
@@ -866,6 +733,155 @@ retry_transaction:
     return rc;
 }
 
+int libxl__domain_make(libxl__gc *gc, libxl_domain_config *d_config,
+                       libxl__domain_build_state *state,
+                       uint32_t *domid, bool soft_reset)
+{
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+    int ret, rc;
+
+    /* convenience aliases */
+    libxl_domain_create_info *info = &d_config->c_info;
+    libxl_domain_build_info *b_info = &d_config->b_info;
+
+    assert(soft_reset || *domid == INVALID_DOMID);
+
+    if (!soft_reset) {
+        struct xen_domctl_createdomain create = {
+            .ssidref = info->ssidref,
+            .max_vcpus = b_info->max_vcpus,
+            .max_evtchn_port = b_info->event_channels,
+            .max_grant_frames = b_info->max_grant_frames,
+            .max_maptrack_frames = b_info->max_maptrack_frames,
+        };
+
+        if (info->type != LIBXL_DOMAIN_TYPE_PV) {
+            create.flags |= XEN_DOMCTL_CDF_hvm;
+            create.flags |=
+                libxl_defbool_val(info->hap) ? XEN_DOMCTL_CDF_hap : 0;
+            create.flags |=
+                libxl_defbool_val(info->oos) ? 0 : XEN_DOMCTL_CDF_oos_off;
+        }
+
+        assert(info->passthrough != LIBXL_PASSTHROUGH_DEFAULT);
+        LOG(DETAIL, "passthrough: %s",
+            libxl_passthrough_to_string(info->passthrough));
+
+        if (info->passthrough != LIBXL_PASSTHROUGH_DISABLED)
+            create.flags |= XEN_DOMCTL_CDF_iommu;
+
+        if (info->passthrough == LIBXL_PASSTHROUGH_SYNC_PT)
+            create.iommu_opts |= XEN_DOMCTL_IOMMU_no_sharept;
+
+        /* Ultimately, handle is an array of 16 uint8_t, same as uuid */
+        libxl_uuid_copy(ctx, (libxl_uuid *)&create.handle, &info->uuid);
+
+        ret = libxl__arch_domain_prepare_config(gc, d_config, &create);
+        if (ret < 0) {
+            LOGED(ERROR, *domid, "fail to get domain config");
+            rc = ERROR_FAIL;
+            goto out;
+        }
+
+        for (;;) {
+            uint32_t local_domid;
+            bool recent;
+
+            if (info->domid == RANDOM_DOMID) {
+                uint16_t v;
+
+                ret = libxl__random_bytes(gc, (void *)&v, sizeof(v));
+                if (ret < 0)
+                    break;
+
+                v &= DOMID_MASK;
+                if (!libxl_domid_valid_guest(v))
+                    continue;
+
+                local_domid = v;
+            } else {
+                local_domid = info->domid; /* May not be valid */
+            }
+
+            ret = xc_domain_create(ctx->xch, &local_domid, &create);
+            if (ret < 0) {
+                /*
+                 * If we generated a random domid and creation failed
+                 * because that domid already exists then simply try
+                 * again.
+                 */
+                if (errno == EEXIST && info->domid == RANDOM_DOMID)
+                    continue;
+
+                LOGED(ERROR, local_domid, "domain creation fail");
+                rc = ERROR_FAIL;
+                goto out;
+            }
+
+            /* A new domain now exists */
+            *domid = local_domid;
+
+            rc = libxl__is_domid_recent(gc, local_domid, &recent);
+            if (rc)
+                goto out;
+
+            /* The domid is not recent, so we're done */
+            if (!recent)
+                break;
+
+            /*
+             * If the domid was specified then there's no point in
+             * trying again.
+             */
+            if (libxl_domid_valid_guest(info->domid)) {
+                LOGED(ERROR, local_domid, "domain id recently used");
+                rc = ERROR_FAIL;
+                goto out;
+            }
+
+            /*
+             * The domain is recent and so cannot be used. Clear domid
+             * here since, if xc_domain_destroy() fails below there is
+             * little point calling it again in the error path.
+             */
+            *domid = INVALID_DOMID;
+
+            ret = xc_domain_destroy(ctx->xch, local_domid);
+            if (ret < 0) {
+                LOGED(ERROR, local_domid, "domain destroy fail");
+                rc = ERROR_FAIL;
+                goto out;
+            }
+
+            /* The domain was successfully destroyed, so we can try again */
+        }
+
+        rc = libxl__arch_domain_save_config(gc, d_config, state, &create);
+        if (rc < 0)
+            goto out;
+    }
+
+    /*
+     * If soft_reset is set the the domid will have been valid on entry.
+     * If it was not set then xc_domain_create() should have assigned a
+     * valid value. Either way, if we reach this point, domid should be
+     * valid.
+     */
+    assert(libxl_domid_valid_guest(*domid));
+
+    ret = xc_cpupool_movedomain(ctx->xch, info->poolid, *domid);
+    if (ret < 0) {
+        LOGED(ERROR, *domid, "domain move fail");
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    rc = libxl__domain_make_xs_entries(gc, d_config, state, *domid);
+
+out:
+    return rc;
+}
+
 static int store_libxl_entry(libxl__gc *gc, uint32_t domid,
                              libxl_domain_build_info *b_info)
 {
@@ -1191,16 +1207,32 @@ static void initiate_domain_create(libxl__egc *egc,
     ret = libxl__domain_config_setdefault(gc,d_config,domid);
     if (ret) goto error_out;
 
-    ret = libxl__domain_make(gc, d_config, &dcs->build_state, &domid,
-                             dcs->soft_reset);
-    if (ret) {
-        LOGD(ERROR, domid, "cannot make domain: %d", ret);
+    if ( !d_config->dm_restore_file )
+    {
+        ret = libxl__domain_make(gc, d_config, &dcs->build_state, &domid,
+                                 dcs->soft_reset);
         dcs->guest_domid = domid;
+
+        if (ret) {
+            LOGD(ERROR, domid, "cannot make domain: %d", ret);
+            ret = ERROR_FAIL;
+            goto error_out;
+        }
+    } else if ( dcs->guest_domid != INVALID_DOMID ) {
+        domid = dcs->guest_domid;
+
+        ret = libxl__domain_make_xs_entries(gc, d_config, &dcs->build_state, domid);
+        if (ret) {
+            LOGD(ERROR, domid, "cannot make domain: %d", ret);
+            ret = ERROR_FAIL;
+            goto error_out;
+        }
+    } else {
+        LOGD(ERROR, domid, "cannot make domain");
         ret = ERROR_FAIL;
         goto error_out;
     }
 
-    dcs->guest_domid = domid;
     dcs->sdss.dm.guest_domid = 0; /* means we haven't spawned */
 
     /* post-4.13 todo: move these next bits of defaulting to
@@ -1236,7 +1268,7 @@ static void initiate_domain_create(libxl__egc *egc,
     if (ret)
         goto error_out;
 
-    if (restore_fd >= 0 || dcs->soft_reset) {
+    if (restore_fd >= 0 || dcs->soft_reset || d_config->dm_restore_file) {
         LOGD(DEBUG, domid, "restoring, not running bootloader");
         domcreate_bootloader_done(egc, &dcs->bl, 0);
     } else  {
@@ -1312,7 +1344,16 @@ static void domcreate_bootloader_done(libxl__egc *egc,
     dcs->sdss.dm.callback = domcreate_devmodel_started;
     dcs->sdss.callback = domcreate_devmodel_started;
 
-    if (restore_fd < 0 && !dcs->soft_reset) {
+    if (restore_fd < 0 && !dcs->soft_reset && !d_config->dm_restore_file) {
+        rc = libxl__domain_build(gc, d_config, domid, state);
+        domcreate_rebuild_done(egc, dcs, rc);
+        return;
+    }
+
+    if ( d_config->dm_restore_file ) {
+        dcs->srs.dcs = dcs;
+        dcs->srs.ao = ao;
+        state->forked_vm = true;
         rc = libxl__domain_build(gc, d_config, domid, state);
         domcreate_rebuild_done(egc, dcs, rc);
         return;
@@ -1510,6 +1551,7 @@ static void domcreate_rebuild_done(libxl__egc *egc,
     /* convenience aliases */
     const uint32_t domid = dcs->guest_domid;
     libxl_domain_config *const d_config = dcs->guest_config;
+    libxl__domain_build_state *const state = &dcs->build_state;
 
     if (ret) {
         LOGD(ERROR, domid, "cannot (re-)build domain: %d", ret);
@@ -1517,6 +1559,9 @@ static void domcreate_rebuild_done(libxl__egc *egc,
         goto error_out;
     }
 
+    if ( d_config->dm_restore_file )
+        state->saved_state = GCSPRINTF("%s", d_config->dm_restore_file);
+
     store_libxl_entry(gc, domid, &d_config->b_info);
 
     libxl__multidev_begin(ao, &dcs->multidev);
@@ -1947,7 +1992,7 @@ static void domain_create_cb(libxl__egc *egc,
                              libxl__domain_create_state *dcs,
                              int rc, uint32_t domid);
 
-static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
+int libxl__do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
                             uint32_t *domid, int restore_fd, int send_back_fd,
                             const libxl_domain_restore_params *params,
                             const libxl_asyncop_how *ao_how,
@@ -1960,6 +2005,8 @@ static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
     GCNEW(cdcs);
     cdcs->dcs.ao = ao;
     cdcs->dcs.guest_config = d_config;
+    cdcs->dcs.guest_domid = *domid;
+
     libxl_domain_config_init(&cdcs->dcs.guest_config_saved);
     libxl_domain_config_copy(ctx, &cdcs->dcs.guest_config_saved, d_config);
     cdcs->dcs.restore_fd = cdcs->dcs.libxc_fd = restore_fd;
@@ -2204,8 +2251,8 @@ int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config,
                             const libxl_asyncprogress_how *aop_console_how)
 {
     unset_disk_colo_restore(d_config);
-    return do_domain_create(ctx, d_config, domid, -1, -1, NULL,
-                            ao_how, aop_console_how);
+    return libxl__do_domain_create(ctx, d_config, domid, -1, -1, NULL,
+                                  ao_how, aop_console_how);
 }
 
 int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
@@ -2221,8 +2268,8 @@ int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
         unset_disk_colo_restore(d_config);
     }
 
-    return do_domain_create(ctx, d_config, domid, restore_fd, send_back_fd,
-                            params, ao_how, aop_console_how);
+    return libxl__do_domain_create(ctx, d_config, domid, restore_fd, send_back_fd,
+                                   params, ao_how, aop_console_how);
 }
 
 int libxl_domain_soft_reset(libxl_ctx *ctx,
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index f4007bbe50..b615f1fc88 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -2803,7 +2803,7 @@ static void device_model_spawn_outcome(libxl__egc *egc,
 
     libxl__domain_build_state *state = dmss->build_state;
 
-    if (state->saved_state) {
+    if (state->saved_state && !state->forked_vm) {
         ret2 = unlink(state->saved_state);
         if (ret2) {
             LOGED(ERROR, dmss->guest_domid, "%s: failed to remove device-model state %s",
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 71cb578923..3bc7117b99 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -249,9 +249,12 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
     libxl_domain_build_info *const info = &d_config->b_info;
     libxl_ctx *ctx = libxl__gc_owner(gc);
     char *xs_domid, *con_domid;
-    int rc;
+    int rc = 0;
     uint64_t size;
 
+    if ( state->forked_vm )
+        goto skip_fork;
+
     if (xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus) != 0) {
         LOG(ERROR, "Couldn't set max vcpu count");
         return ERROR_FAIL;
@@ -362,7 +365,6 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
         }
     }
 
-
     rc = libxl__arch_extra_memory(gc, info, &size);
     if (rc < 0) {
         LOGE(ERROR, "Couldn't get arch extra constant memory size");
@@ -374,6 +376,11 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
         return ERROR_FAIL;
     }
 
+    rc = libxl__arch_domain_create(gc, d_config, domid);
+    if ( rc )
+        goto out;
+
+skip_fork:
     xs_domid = xs_read(ctx->xsh, XBT_NULL, "/tool/xenstored/domid", NULL);
     state->store_domid = xs_domid ? atoi(xs_domid) : 0;
     free(xs_domid);
@@ -385,8 +392,7 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
     state->store_port = xc_evtchn_alloc_unbound(ctx->xch, domid, state->store_domid);
     state->console_port = xc_evtchn_alloc_unbound(ctx->xch, domid, state->console_domid);
 
-    rc = libxl__arch_domain_create(gc, d_config, domid);
-
+out:
     return rc;
 }
 
@@ -444,6 +450,9 @@ int libxl__build_post(libxl__gc *gc, uint32_t domid,
     char **ents;
     int i, rc;
 
+    if ( state->forked_vm )
+        goto skip_fork;
+
     if (info->num_vnuma_nodes && !info->num_vcpu_soft_affinity) {
         rc = set_vnuma_affinity(gc, domid, info);
         if (rc)
@@ -466,6 +475,7 @@ int libxl__build_post(libxl__gc *gc, uint32_t domid,
         }
     }
 
+skip_fork:
     ents = libxl__calloc(gc, 12 + (info->max_vcpus * 2) + 2, sizeof(char *));
     ents[0] = "memory/static-max";
     ents[1] = GCSPRINTF("%"PRId64, info->max_memkb);
@@ -728,14 +738,16 @@ static int hvm_build_set_params(xc_interface *handle, uint32_t domid,
                                 libxl_domain_build_info *info,
                                 int store_evtchn, unsigned long *store_mfn,
                                 int console_evtchn, unsigned long *console_mfn,
-                                domid_t store_domid, domid_t console_domid)
+                                domid_t store_domid, domid_t console_domid,
+                                bool forked_vm)
 {
     struct hvm_info_table *va_hvm;
     uint8_t *va_map, sum;
     uint64_t str_mfn, cons_mfn;
     int i;
 
-    if (info->type == LIBXL_DOMAIN_TYPE_HVM) {
+    if ( info->type == LIBXL_DOMAIN_TYPE_HVM && !forked_vm )
+    {
         va_map = xc_map_foreign_range(handle, domid,
                                       XC_PAGE_SIZE, PROT_READ | PROT_WRITE,
                                       HVM_INFO_PFN);
@@ -1051,6 +1063,23 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
     struct xc_dom_image *dom = NULL;
     bool device_model = info->type == LIBXL_DOMAIN_TYPE_HVM ? true : false;
 
+    if ( state->forked_vm )
+    {
+        rc = hvm_build_set_params(ctx->xch, domid, info, state->store_port,
+                                  &state->store_mfn, state->console_port,
+                                  &state->console_mfn, state->store_domid,
+                                  state->console_domid, state->forked_vm);
+
+        if ( rc )
+            return rc;
+
+        return xc_dom_gnttab_seed(ctx->xch, domid, true,
+                                  state->console_mfn,
+                                  state->store_mfn,
+                                  state->console_domid,
+                                  state->store_domid);
+    }
+
     xc_dom_loginit(ctx->xch);
 
     /*
@@ -1175,7 +1204,7 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
     rc = hvm_build_set_params(ctx->xch, domid, info, state->store_port,
                                &state->store_mfn, state->console_port,
                                &state->console_mfn, state->store_domid,
-                               state->console_domid);
+                               state->console_domid, false);
     if (rc != 0) {
         LOG(ERROR, "hvm build set params failed");
         goto out;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 5f39e44cb9..d05ff31e83 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -1374,6 +1374,7 @@ typedef struct {
 
     char *saved_state;
     int dm_monitor_fd;
+    bool forked_vm;
 
     libxl__file_reference pv_kernel;
     libxl__file_reference pv_ramdisk;
@@ -4818,6 +4819,12 @@ _hidden int libxl__domain_pvcontrol(libxl__egc *egc,
 /* Check whether a domid is recent */
 int libxl__is_domid_recent(libxl__gc *gc, uint32_t domid, bool *recent);
 
+_hidden int libxl__do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
+                                    uint32_t *domid, int restore_fd, int send_back_fd,
+                                    const libxl_domain_restore_params *params,
+                                    const libxl_asyncop_how *ao_how,
+                                    const libxl_asyncprogress_how *aop_console_how);
+
 #endif
 
 /*
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index f7c473be74..2bb5e6319e 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -958,6 +958,7 @@ libxl_domain_config = Struct("domain_config", [
     ("on_watchdog", libxl_action_on_shutdown),
     ("on_crash", libxl_action_on_shutdown),
     ("on_soft_reset", libxl_action_on_shutdown),
+    ("dm_restore_file", string, {'const': True}),
     ], dir=DIR_IN)
 
 libxl_diskinfo = Struct("diskinfo", [
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index f8bc828e62..f4312411fc 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -2,6 +2,7 @@
 #include "libxl_arch.h"
 
 #include <xc_dom.h>
+#include <xen-xsm/flask/flask.h>
 
 int libxl__arch_domain_prepare_config(libxl__gc *gc,
                                       libxl_domain_config *d_config,
@@ -842,6 +843,46 @@ int libxl__arch_passthrough_mode_setdefault(libxl__gc *gc,
     return rc;
 }
 
+/*
+ * The parent domain is expected to be created with default settings for
+ * - max_evtch_port
+ * - max_grant_frames
+ * - max_maptrack_frames
+ */
+int libxl_domain_fork_vm(libxl_ctx *ctx, uint32_t pdomid, uint32_t max_vcpus, uint32_t *domid)
+{
+    int rc;
+    struct xen_domctl_createdomain create = {0};
+    create.flags |= XEN_DOMCTL_CDF_hvm;
+    create.flags |= XEN_DOMCTL_CDF_hap;
+    create.flags |= XEN_DOMCTL_CDF_oos_off;
+    create.arch.emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI);
+    create.ssidref = SECINITSID_DOMU;
+    create.max_vcpus = max_vcpus;
+    create.max_evtchn_port = 1023;
+    create.max_grant_frames = LIBXL_MAX_GRANT_FRAMES_DEFAULT;
+    create.max_maptrack_frames = LIBXL_MAX_MAPTRACK_FRAMES_DEFAULT;
+
+    if ( (rc = xc_domain_create(ctx->xch, domid, &create)) )
+        return rc;
+
+    if ( (rc = xc_memshr_fork(ctx->xch, pdomid, *domid)) )
+        xc_domain_destroy(ctx->xch, *domid);
+
+    return rc;
+}
+
+int libxl_domain_fork_launch_dm(libxl_ctx *ctx, libxl_domain_config *d_config,
+                                uint32_t domid,
+                                const libxl_asyncprogress_how *aop_console_how)
+{
+    return libxl__do_domain_create(ctx, d_config, &domid, -1, -1, 0, 0, aop_console_how);
+}
+
+int libxl_domain_fork_reset(libxl_ctx *ctx, uint32_t domid)
+{
+    return xc_memshr_fork_reset(ctx->xch, domid);
+}
 
 /*
  * Local variables:
diff --git a/tools/xl/Makefile b/tools/xl/Makefile
index af4912e67a..073222233b 100644
--- a/tools/xl/Makefile
+++ b/tools/xl/Makefile
@@ -15,7 +15,7 @@ LDFLAGS += $(PTHREAD_LDFLAGS)
 CFLAGS_XL += $(CFLAGS_libxenlight)
 CFLAGS_XL += -Wshadow
 
-XL_OBJS-$(CONFIG_X86) = xl_psr.o
+XL_OBJS-$(CONFIG_X86) = xl_psr.o xl_forkvm.o
 XL_OBJS = xl.o xl_cmdtable.o xl_sxp.o xl_utils.o $(XL_OBJS-y)
 XL_OBJS += xl_parse.o xl_cpupool.o xl_flask.o
 XL_OBJS += xl_vtpm.o xl_block.o xl_nic.o xl_usb.o
diff --git a/tools/xl/xl.h b/tools/xl/xl.h
index 06569c6c4a..1105c34b15 100644
--- a/tools/xl/xl.h
+++ b/tools/xl/xl.h
@@ -31,6 +31,7 @@ struct cmd_spec {
 };
 
 struct domain_create {
+    uint32_t ddomid; /* fork launch dm for this domid */
     int debug;
     int daemonize;
     int monitor; /* handle guest reboots etc */
@@ -45,6 +46,7 @@ struct domain_create {
     const char *config_file;
     char *extra_config; /* extra config string */
     const char *restore_file;
+    const char *dm_restore_file;
     char *colo_proxy_script;
     bool userspace_colo_proxy;
     int migrate_fd; /* -1 means none */
@@ -128,6 +130,8 @@ int main_pciassignable_remove(int argc, char **argv);
 int main_pciassignable_list(int argc, char **argv);
 #ifndef LIBXL_HAVE_NO_SUSPEND_RESUME
 int main_restore(int argc, char **argv);
+int main_fork_launch_dm(int argc, char **argv);
+int main_fork_reset(int argc, char **argv);
 int main_migrate_receive(int argc, char **argv);
 int main_save(int argc, char **argv);
 int main_migrate(int argc, char **argv);
@@ -212,6 +216,7 @@ int main_psr_cat_cbm_set(int argc, char **argv);
 int main_psr_cat_show(int argc, char **argv);
 int main_psr_mba_set(int argc, char **argv);
 int main_psr_mba_show(int argc, char **argv);
+int main_fork_vm(int argc, char **argv);
 #endif
 int main_qemu_monitor_command(int argc, char **argv);
 
diff --git a/tools/xl/xl_cmdtable.c b/tools/xl/xl_cmdtable.c
index 08335394e5..ef634abf32 100644
--- a/tools/xl/xl_cmdtable.c
+++ b/tools/xl/xl_cmdtable.c
@@ -187,6 +187,21 @@ struct cmd_spec cmd_table[] = {
       "Restore a domain from a saved state",
       "- for internal use only",
     },
+#if defined(__i386__) || defined(__x86_64__)
+    { "fork-vm",
+      &main_fork_vm, 0, 1,
+      "Fork a domain from the running parent domid. Experimental. Most config settings must match parent.",
+      "[options] <Domid>",
+      "-h                           Print this help.\n"
+      "-C <config>                  Use config file for VM fork.\n"
+      "-Q <qemu-save-file>          Use qemu save file for VM fork.\n"
+      "--launch-dm <yes|no|late>    Launch device model (QEMU) for VM fork.\n"
+      "--fork-reset                 Reset VM fork.\n"
+      "--max-vcpus                  Specify max-vcpus matching the parent domain when not launching dm\n"
+      "-p                           Do not unpause fork VM after operation.\n"
+      "-d                           Enable debug messages.\n"
+    },
+#endif
 #endif
     { "dump-core",
       &main_dump_core, 0, 1,
diff --git a/tools/xl/xl_forkvm.c b/tools/xl/xl_forkvm.c
new file mode 100644
index 0000000000..a7ee5b4771
--- /dev/null
+++ b/tools/xl/xl_forkvm.c
@@ -0,0 +1,147 @@
+/*
+ * Copyright 2020 Intel Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include <fcntl.h>
+#include <inttypes.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/utsname.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <libxl.h>
+#include <libxl_utils.h>
+#include <libxlutil.h>
+
+#include "xl.h"
+#include "xl_utils.h"
+#include "xl_parse.h"
+
+int main_fork_vm(int argc, char **argv)
+{
+    int rc, debug = 0;
+    uint32_t domid_in = INVALID_DOMID, domid_out = INVALID_DOMID;
+    int launch_dm = 1;
+    bool reset = 0;
+    bool pause = 0;
+    const char *config_file = NULL;
+    const char *dm_restore_file = NULL;
+    uint32_t max_vcpus = 0;
+
+    int opt;
+    static struct option opts[] = {
+        {"launch-dm", 1, 0, 'l'},
+        {"fork-reset", 0, 0, 'r'},
+        {"max-vcpus", 1, 0, 'm'},
+        COMMON_LONG_OPTS
+    };
+
+    SWITCH_FOREACH_OPT(opt, "phdC:Q:l:rm:N:D:B:V:", opts, "fork-vm", 1) {
+    case 'd':
+        debug = 1;
+        break;
+    case 'p':
+        pause = 1;
+        break;
+    case 'm':
+        max_vcpus = atoi(optarg);
+        break;
+    case 'C':
+        config_file = optarg;
+        break;
+    case 'Q':
+        dm_restore_file = optarg;
+        break;
+    case 'l':
+        if ( !strcmp(optarg, "no") )
+            launch_dm = 0;
+        if ( !strcmp(optarg, "yes") )
+            launch_dm = 1;
+        if ( !strcmp(optarg, "late") )
+            launch_dm = 2;
+        break;
+    case 'r':
+        reset = 1;
+        break;
+    case 'N': /* fall-through */
+    case 'D': /* fall-through */
+    case 'B': /* fall-through */
+    case 'V':
+        fprintf(stderr, "Unimplemented option(s)\n");
+        return EXIT_FAILURE;
+    }
+
+    if (argc-optind == 1) {
+        domid_in = atoi(argv[optind]);
+    } else {
+        help("fork-vm");
+        return EXIT_FAILURE;
+    }
+
+    if (launch_dm && (!config_file || !dm_restore_file)) {
+        fprintf(stderr, "Currently you must provide both -C and -Q options\n");
+        return EXIT_FAILURE;
+    }
+
+    if (reset) {
+        domid_out = domid_in;
+        if (libxl_domain_fork_reset(ctx, domid_in) == EXIT_FAILURE)
+            return EXIT_FAILURE;
+    }
+
+    if (launch_dm == 2 || reset) {
+        domid_out = domid_in;
+        rc = EXIT_SUCCESS;
+    } else {
+        if ( !max_vcpus )
+        {
+            fprintf(stderr, "Currently you must parent's max_vcpu for this option\n");
+            return EXIT_FAILURE;
+        }
+
+        rc = libxl_domain_fork_vm(ctx, domid_in, max_vcpus, &domid_out);
+    }
+
+    if (rc == EXIT_SUCCESS) {
+        if ( launch_dm ) {
+            struct domain_create dom_info;
+            memset(&dom_info, 0, sizeof(dom_info));
+            dom_info.ddomid = domid_out;
+            dom_info.dm_restore_file = dm_restore_file;
+            dom_info.debug = debug;
+            dom_info.paused = pause;
+            dom_info.config_file = config_file;
+            dom_info.migrate_fd = -1;
+            dom_info.send_back_fd = -1;
+            rc = create_domain(&dom_info) < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
+        } else if ( !pause )
+            rc = libxl_domain_unpause(ctx, domid_out, NULL);
+    }
+
+    if (rc == EXIT_SUCCESS)
+        fprintf(stderr, "fork-vm command successfully returned domid: %u\n", domid_out);
+    else if ( domid_out != INVALID_DOMID )
+        libxl_domain_destroy(ctx, domid_out, 0);
+
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/xl/xl_vmcontrol.c b/tools/xl/xl_vmcontrol.c
index 2e2d427492..782fbbc24b 100644
--- a/tools/xl/xl_vmcontrol.c
+++ b/tools/xl/xl_vmcontrol.c
@@ -676,6 +676,12 @@ int create_domain(struct domain_create *dom_info)
 
     int restoring = (restore_file || (migrate_fd >= 0));
 
+#if defined(__i386__) || defined(__x86_64__)
+    /* VM forking */
+    uint32_t ddomid = dom_info->ddomid; // launch dm for this domain iff set
+    const char *dm_restore_file = dom_info->dm_restore_file;
+#endif
+
     libxl_domain_config_init(&d_config);
 
     if (restoring) {
@@ -926,6 +932,14 @@ start:
          * restore/migrate-receive it again.
          */
         restoring = 0;
+#if defined(__i386__) || defined(__x86_64__)
+    } else if ( ddomid ) {
+        d_config.dm_restore_file = dm_restore_file;
+        ret = libxl_domain_fork_launch_dm(ctx, &d_config, ddomid,
+                                          autoconnect_console_how);
+        domid = ddomid;
+        ddomid = INVALID_DOMID;
+#endif
     } else if (domid_soft_reset != INVALID_DOMID) {
         /* Do soft reset. */
         ret = libxl_domain_soft_reset(ctx, &d_config, domid_soft_reset,
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-23 17:04 ` [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: " Tamas K Lengyel
@ 2020-03-25 15:47   ` Roger Pau Monné
  2020-03-25 16:04     ` Tamas K Lengyel
                       ` (3 more replies)
  2020-03-26 12:33   ` Jan Beulich
  1 sibling, 4 replies; 29+ messages in thread
From: Roger Pau Monné @ 2020-03-25 15:47 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Tamas K Lengyel, Jan Beulich,
	xen-devel

Sorry it has taken me a while to get to this.

On Mon, Mar 23, 2020 at 10:04:35AM -0700, Tamas K Lengyel wrote:
> VM forking is the process of creating a domain with an empty memory space and a
> parent domain specified from which to populate the memory when necessary. For
> the new domain to be functional the VM state is copied over as part of the fork
> operation (HVM params, hap allocation, etc).
> 
> Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
> ---
> v12: Minor style adjustments Jan pointed out
>      Convert mem_sharing_is_fork to inline function
> v11: Fully copy vcpu_info pages
>      Setup vcpu_runstate for forks
>      Added TODO note for PV timers
>      Copy shared_info page
>      Add copy_settings function, to be shared with fork_reset in the next patch
> ---
>  xen/arch/x86/domain.c             |  11 +
>  xen/arch/x86/hvm/hvm.c            |   4 +-
>  xen/arch/x86/mm/hap/hap.c         |   3 +-
>  xen/arch/x86/mm/mem_sharing.c     | 368 ++++++++++++++++++++++++++++++
>  xen/arch/x86/mm/p2m.c             |   9 +-
>  xen/common/domain.c               |   3 +
>  xen/include/asm-x86/hap.h         |   1 +
>  xen/include/asm-x86/hvm/hvm.h     |   2 +
>  xen/include/asm-x86/mem_sharing.h |  18 ++
>  xen/include/public/memory.h       |   5 +
>  xen/include/xen/sched.h           |   5 +
>  11 files changed, 424 insertions(+), 5 deletions(-)
> 
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index caf2ecad7e..11d3c2216e 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -2202,6 +2202,17 @@ int domain_relinquish_resources(struct domain *d)
>              ret = relinquish_shared_pages(d);
>              if ( ret )
>                  return ret;
> +
> +            /*
> +             * If the domain is forked, decrement the parent's pause count
> +             * and release the domain.
> +             */
> +            if ( mem_sharing_is_fork(d) )
> +            {
> +                domain_unpause(d->parent);
> +                put_domain(d->parent);
> +                d->parent = NULL;
> +            }
>          }
>  #endif
>  
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index a3d115b650..304b3d1562 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -1917,7 +1917,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
>      }
>  #endif
>  
> -    /* Spurious fault? PoD and log-dirty also take this path. */
> +    /* Spurious fault? PoD, log-dirty and VM forking also take this path. */
>      if ( p2m_is_ram(p2mt) )
>      {
>          rc = 1;
> @@ -4377,7 +4377,7 @@ static int hvm_allow_get_param(struct domain *d,
>      return rc;
>  }
>  
> -static int hvm_get_param(struct domain *d, uint32_t index, uint64_t *value)
> +int hvm_get_param(struct domain *d, uint32_t index, uint64_t *value)
>  {
>      int rc;
>  
> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index a6d5e39b02..814d0c3253 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -321,8 +321,7 @@ static void hap_free_p2m_page(struct domain *d, struct page_info *pg)
>  }
>  
>  /* Return the size of the pool, rounded up to the nearest MB */
> -static unsigned int
> -hap_get_allocation(struct domain *d)
> +unsigned int hap_get_allocation(struct domain *d)
>  {
>      unsigned int pg = d->arch.paging.hap.total_pages
>          + d->arch.paging.hap.p2m_pages;
> diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
> index 3835bc928f..23deeddff2 100644
> --- a/xen/arch/x86/mm/mem_sharing.c
> +++ b/xen/arch/x86/mm/mem_sharing.c
> @@ -22,6 +22,7 @@
>  
>  #include <xen/types.h>
>  #include <xen/domain_page.h>
> +#include <xen/event.h>
>  #include <xen/spinlock.h>
>  #include <xen/rwlock.h>
>  #include <xen/mm.h>
> @@ -36,6 +37,8 @@
>  #include <asm/altp2m.h>
>  #include <asm/atomic.h>
>  #include <asm/event.h>
> +#include <asm/hap.h>
> +#include <asm/hvm/hvm.h>
>  #include <xsm/xsm.h>
>  
>  #include "mm-locks.h"
> @@ -1444,6 +1447,334 @@ static inline int mem_sharing_control(struct domain *d, bool enable)
>      return 0;
>  }
>  
> +/*
> + * Forking a page only gets called when the VM faults due to no entry being
> + * in the EPT for the access. Depending on the type of access we either
> + * populate the physmap with a shared entry for read-only access or
> + * fork the page if its a write access.
> + *
> + * The client p2m is already locked so we only need to lock
> + * the parent's here.
> + */
> +int mem_sharing_fork_page(struct domain *d, gfn_t gfn, bool unsharing)
> +{
> +    int rc = -ENOENT;
> +    shr_handle_t handle;
> +    struct domain *parent = d->parent;
> +    struct p2m_domain *p2m;
> +    unsigned long gfn_l = gfn_x(gfn);
> +    mfn_t mfn, new_mfn;
> +    p2m_type_t p2mt;
> +    struct page_info *page;
> +
> +    if ( !mem_sharing_is_fork(d) )
> +        return -ENOENT;
> +
> +    if ( !unsharing )
> +    {
> +        /* For read-only accesses we just add a shared entry to the physmap */
> +        while ( parent )
> +        {
> +            if ( !(rc = nominate_page(parent, gfn, 0, &handle)) )
> +                break;
> +
> +            parent = parent->parent;
> +        }
> +
> +        if ( !rc )
> +        {
> +            /* The client's p2m is already locked */
> +            struct p2m_domain *pp2m = p2m_get_hostp2m(parent);

Nit: I think you could just use the existing p2m local variable.

> +            p2m_lock(pp2m);
> +            rc = add_to_physmap(parent, gfn_l, handle, d, gfn_l, false);
> +            p2m_unlock(pp2m);
> +
> +            if ( !rc )
> +                return 0;
> +        }
> +    }
> +
> +    /*
> +     * If it's a write access (ie. unsharing) or if adding a shared entry to
> +     * the physmap failed we'll fork the page directly.
> +     */
> +    p2m = p2m_get_hostp2m(d);
> +    parent = d->parent;
> +
> +    while ( parent )
> +    {
> +        mfn = get_gfn_query(parent, gfn_l, &p2mt);
> +
> +        /*
> +         * We can't fork grant memory from the parent, only regular ram.
> +         */

Nit: single line comments should use /* ... */ (here and below).

> +        if ( mfn_valid(mfn) && p2m_is_ram(p2mt) )
> +            break;
> +
> +        put_gfn(parent, gfn_l);
> +        parent = parent->parent;
> +    }
> +
> +    if ( !parent )
> +        return -ENOENT;
> +
> +    if ( !(page = alloc_domheap_page(d, 0)) )
> +    {
> +        put_gfn(parent, gfn_l);
> +        return -ENOMEM;
> +    }
> +
> +    new_mfn = page_to_mfn(page);
> +    copy_domain_page(new_mfn, mfn);
> +    set_gpfn_from_mfn(mfn_x(new_mfn), gfn_l);
> +
> +    put_gfn(parent, gfn_l);
> +
> +    return p2m->set_entry(p2m, gfn, new_mfn, PAGE_ORDER_4K, p2m_ram_rw,
> +                          p2m->default_access, -1);
> +}
> +
> +static int bring_up_vcpus(struct domain *cd, struct domain *d)
> +{
> +    unsigned int i;
> +    int ret = -EINVAL;

Nit: you can get rid of ret...

> +
> +    if ( d->max_vcpus != cd->max_vcpus ||
> +        (ret = cpupool_move_domain(cd, d->cpupool)) )
> +        return ret;

...and just return -EINVAL here. Seeing as it's not used anywhere
else.

> +
> +    for ( i = 0; i < cd->max_vcpus; i++ )
> +    {
> +        if ( !d->vcpu[i] || cd->vcpu[i] )
> +            continue;
> +
> +        if ( !vcpu_create(cd, i) )
> +            return -EINVAL;
> +    }
> +
> +    domain_update_node_affinity(cd);
> +    return 0;
> +}
> +
> +static int copy_vcpu_settings(struct domain *cd, struct domain *d)
> +{
> +    unsigned int i;
> +    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
> +    int ret = -EINVAL;
> +
> +    for ( i = 0; i < cd->max_vcpus; i++ )
> +    {
> +        const struct vcpu *d_vcpu = d->vcpu[i];
> +        struct vcpu *cd_vcpu = cd->vcpu[i];
> +        struct vcpu_runstate_info runstate;
> +        mfn_t vcpu_info_mfn;
> +
> +        if ( !d_vcpu || !cd_vcpu )
> +            continue;
> +
> +        /*
> +         * Copy & map in the vcpu_info page if the guest uses one
> +         */
> +        vcpu_info_mfn = d_vcpu->vcpu_info_mfn;
> +        if ( !mfn_eq(vcpu_info_mfn, INVALID_MFN) )
> +        {
> +            mfn_t new_vcpu_info_mfn = cd_vcpu->vcpu_info_mfn;
> +
> +            /*
> +             * Allocate & map the page for it if it hasn't been already
> +             */
> +            if ( mfn_eq(new_vcpu_info_mfn, INVALID_MFN) )
> +            {
> +                gfn_t gfn = mfn_to_gfn(d, vcpu_info_mfn);
> +                unsigned long gfn_l = gfn_x(gfn);
> +                struct page_info *page;
> +
> +                if ( !(page = alloc_domheap_page(cd, 0)) )
> +                    return -ENOMEM;
> +
> +                new_vcpu_info_mfn = page_to_mfn(page);
> +                set_gpfn_from_mfn(mfn_x(new_vcpu_info_mfn), gfn_l);
> +
> +                ret = p2m->set_entry(p2m, gfn, new_vcpu_info_mfn, PAGE_ORDER_4K,
> +                                     p2m_ram_rw, p2m->default_access, -1);
> +                if ( ret )
> +                    return ret;
> +
> +                ret = map_vcpu_info(cd_vcpu, gfn_l,
> +                                    d_vcpu->vcpu_info_offset);
> +                if ( ret )
> +                    return ret;
> +            }
> +
> +            copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
> +        }
> +
> +        /*
> +         * Setup the vCPU runstate area
> +         */
> +        if ( guest_handle_is_null(runstate_guest(cd_vcpu)) )

Maybe I'm confused, but isn't this the other way around and you need
to check? If the parent runstate is not null copy it to the fork,
ie:

if ( !guest_handle_is_null(runstate_guest(d_vcpu)) )
{
    ...

> +        {
> +            runstate_guest(cd_vcpu) = runstate_guest(d_vcpu);
> +            vcpu_runstate_get(cd_vcpu, &runstate);
> +            __copy_to_guest(runstate_guest(cd_vcpu), &runstate, 1);

You should check the return code I think.

> +        }
> +
> +        /*
> +         * TODO: to support VMs with PV interfaces copy additional
> +         * settings here, such as PV timers.
> +         */
> +    }
> +
> +    return 0;
> +}
> +
> +static int fork_hap_allocation(struct domain *cd, struct domain *d)
> +{
> +    int rc;
> +    bool preempted;
> +    unsigned long mb = hap_get_allocation(d);
> +
> +    if ( mb == hap_get_allocation(cd) )
> +        return 0;
> +
> +    paging_lock(cd);
> +    rc = hap_set_allocation(cd, mb << (20 - PAGE_SHIFT), &preempted);
> +    paging_unlock(cd);
> +
> +    return preempted ? -ERESTART : rc;
> +}
> +
> +static void copy_tsc(struct domain *cd, struct domain *d)
> +{
> +    uint32_t tsc_mode;
> +    uint32_t gtsc_khz;
> +    uint32_t incarnation;
> +    uint64_t elapsed_nsec;
> +
> +    tsc_get_info(d, &tsc_mode, &elapsed_nsec, &gtsc_khz, &incarnation);
> +    /* Don't bump incarnation on set */
> +    tsc_set_info(cd, tsc_mode, elapsed_nsec, gtsc_khz, incarnation - 1);
> +}
> +
> +static int copy_special_pages(struct domain *cd, struct domain *d)
> +{
> +    mfn_t new_mfn, old_mfn;
> +    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
> +    static const unsigned int params[] =
> +    {
> +        HVM_PARAM_STORE_PFN,
> +        HVM_PARAM_IOREQ_PFN,
> +        HVM_PARAM_BUFIOREQ_PFN,
> +        HVM_PARAM_CONSOLE_PFN
> +    };
> +    unsigned int i;
> +    int rc;
> +
> +    for ( i = 0; i < 4; i++ )

Please use ARRAY_SIZE instead of hard coding 4.

> diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
> index 9f51370327..1ed7d13084 100644
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -509,6 +509,12 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn_l,
>  
>      mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
>  
> +    /* Check if we need to fork the page */
> +    if ( (q & P2M_ALLOC) && p2m_is_hole(*t) &&
> +         !mem_sharing_fork_page(p2m->domain, gfn, q & P2M_UNSHARE) )
> +        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> +
> +    /* Check if we need to unshare the page */
>      if ( (q & P2M_UNSHARE) && p2m_is_shared(*t) )
>      {
>          ASSERT(p2m_is_hostp2m(p2m));
> @@ -588,7 +594,8 @@ struct page_info *p2m_get_page_from_gfn(
>              return page;
>  
>          /* Error path: not a suitable GFN at all */
> -        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) )
> +        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) &&
> +             !mem_sharing_is_fork(p2m->domain) )
>              return NULL;
>      }
>  
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index b4eb476a9c..62aed53a16 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1270,6 +1270,9 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
>  
>      v->vcpu_info = new_info;
>      v->vcpu_info_mfn = page_to_mfn(page);
> +#ifdef CONFIG_MEM_SHARING
> +    v->vcpu_info_offset = offset;

There's no need to introduce this field, you can just use v->vcpu_info
& ~PAGE_MASK AFAICT.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork
  2020-03-23 17:04 ` [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork Tamas K Lengyel
@ 2020-03-25 15:52   ` Roger Pau Monné
  2020-03-25 15:54     ` Tamas K Lengyel
  2020-03-26 10:16   ` Jan Beulich
  1 sibling, 1 reply; 29+ messages in thread
From: Roger Pau Monné @ 2020-03-25 15:52 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Tamas K Lengyel, Julien Grall, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Stefano Stabellini, Jan Beulich,
	xen-devel

On Mon, Mar 23, 2020 at 10:04:36AM -0700, Tamas K Lengyel wrote:
> Implement hypercall that allows a fork to shed all memory that got allocated
> for it during its execution and re-load its vCPU context from the parent VM.
> This allows the forked VM to reset into the same state the parent VM is in a
> faster way then creating a new fork would be. Measurements show about a 2x
> speedup during normal fuzzing operations. Performance may vary depending how
> much memory got allocated for the forked VM. If it has been completely
> deduplicated from the parent VM then creating a new fork would likely be more
> performant.
> 
> Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>

LGTM:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

One minor nit below.

> ---
> v12: remove continuation & add comment back
>      address style issues pointed out by Jan
> ---
>  xen/arch/x86/mm/mem_sharing.c | 77 +++++++++++++++++++++++++++++++++++
>  xen/include/public/memory.h   |  1 +
>  2 files changed, 78 insertions(+)
> 
> diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
> index 23deeddff2..930a5f58ef 100644
> --- a/xen/arch/x86/mm/mem_sharing.c
> +++ b/xen/arch/x86/mm/mem_sharing.c
> @@ -1775,6 +1775,60 @@ static int fork(struct domain *cd, struct domain *d)
>      return rc;
>  }
>  
> +/*
> + * The fork reset operation is intended to be used on short-lived forks only.
> + * There is no hypercall continuation operation implemented for this reason.
> + * For forks that obtain a larger memory footprint it is likely going to be
> + * more performant to create a new fork instead of resetting an existing one.
> + *
> + * TODO: In case this hypercall would become useful on forks with larger memory
> + * footprints the hypercall continuation should be implemented (or if this
> + * feature needs to be become "stable").
> + */
> +static int mem_sharing_fork_reset(struct domain *d, struct domain *pd)
> +{
> +    int rc;
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +    struct page_info *page, *tmp;
> +
> +    spin_lock(&d->page_alloc_lock);
> +    domain_pause(d);
> +
> +    page_list_for_each_safe(page, tmp, &d->page_list)
> +    {
> +        p2m_type_t p2mt;
> +        p2m_access_t p2ma;
> +        mfn_t mfn = page_to_mfn(page);
> +        gfn_t gfn = mfn_to_gfn(d, mfn);
> +
> +        mfn = __get_gfn_type_access(p2m, gfn_x(gfn), &p2mt, &p2ma,
> +                                    0, NULL, false);
> +
> +        /* only reset pages that are sharable */
> +        if ( !p2m_is_sharable(p2mt) )
> +            continue;
> +
> +        /* take an extra reference or just skip if can't for whatever reason */
> +        if ( !get_page(page, d) )
> +            continue;

You can join both conditions above into a single one, if both just
need to perform a continue.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork
  2020-03-25 15:52   ` Roger Pau Monné
@ 2020-03-25 15:54     ` Tamas K Lengyel
  0 siblings, 0 replies; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-25 15:54 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Julien Grall, Stefano Stabellini, Tamas K Lengyel, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Jan Beulich,
	Xen-devel

On Wed, Mar 25, 2020 at 9:52 AM Roger Pau Monné <roger.pau@citrix.com> wrote:
>
> On Mon, Mar 23, 2020 at 10:04:36AM -0700, Tamas K Lengyel wrote:
> > Implement hypercall that allows a fork to shed all memory that got allocated
> > for it during its execution and re-load its vCPU context from the parent VM.
> > This allows the forked VM to reset into the same state the parent VM is in a
> > faster way then creating a new fork would be. Measurements show about a 2x
> > speedup during normal fuzzing operations. Performance may vary depending how
> > much memory got allocated for the forked VM. If it has been completely
> > deduplicated from the parent VM then creating a new fork would likely be more
> > performant.
> >
> > Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
>
> LGTM:
>
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
>
> One minor nit below.
>
> > ---
> > v12: remove continuation & add comment back
> >      address style issues pointed out by Jan
> > ---
> >  xen/arch/x86/mm/mem_sharing.c | 77 +++++++++++++++++++++++++++++++++++
> >  xen/include/public/memory.h   |  1 +
> >  2 files changed, 78 insertions(+)
> >
> > diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
> > index 23deeddff2..930a5f58ef 100644
> > --- a/xen/arch/x86/mm/mem_sharing.c
> > +++ b/xen/arch/x86/mm/mem_sharing.c
> > @@ -1775,6 +1775,60 @@ static int fork(struct domain *cd, struct domain *d)
> >      return rc;
> >  }
> >
> > +/*
> > + * The fork reset operation is intended to be used on short-lived forks only.
> > + * There is no hypercall continuation operation implemented for this reason.
> > + * For forks that obtain a larger memory footprint it is likely going to be
> > + * more performant to create a new fork instead of resetting an existing one.
> > + *
> > + * TODO: In case this hypercall would become useful on forks with larger memory
> > + * footprints the hypercall continuation should be implemented (or if this
> > + * feature needs to be become "stable").
> > + */
> > +static int mem_sharing_fork_reset(struct domain *d, struct domain *pd)
> > +{
> > +    int rc;
> > +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> > +    struct page_info *page, *tmp;
> > +
> > +    spin_lock(&d->page_alloc_lock);
> > +    domain_pause(d);
> > +
> > +    page_list_for_each_safe(page, tmp, &d->page_list)
> > +    {
> > +        p2m_type_t p2mt;
> > +        p2m_access_t p2ma;
> > +        mfn_t mfn = page_to_mfn(page);
> > +        gfn_t gfn = mfn_to_gfn(d, mfn);
> > +
> > +        mfn = __get_gfn_type_access(p2m, gfn_x(gfn), &p2mt, &p2ma,
> > +                                    0, NULL, false);
> > +
> > +        /* only reset pages that are sharable */
> > +        if ( !p2m_is_sharable(p2mt) )
> > +            continue;
> > +
> > +        /* take an extra reference or just skip if can't for whatever reason */
> > +        if ( !get_page(page, d) )
> > +            continue;
>
> You can join both conditions above into a single one, if both just
> need to perform a continue.

We could but I think it's easier to read it this way. So I prefer to
keep it separate.

Thanks for the review!
Tamas


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-25 15:47   ` Roger Pau Monné
@ 2020-03-25 16:04     ` Tamas K Lengyel
  2020-03-25 16:16     ` Tamas K Lengyel
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-25 16:04 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Julien Grall, Stefano Stabellini, Tamas K Lengyel, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Jan Beulich,
	Xen-devel

On Wed, Mar 25, 2020 at 9:47 AM Roger Pau Monné <roger.pau@citrix.com> wrote:
>
> Sorry it has taken me a while to get to this.

Thanks for the review, I'm addressing all the items you noticed in the
next revision.

Tamas


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-25 15:47   ` Roger Pau Monné
  2020-03-25 16:04     ` Tamas K Lengyel
@ 2020-03-25 16:16     ` Tamas K Lengyel
  2020-03-25 16:34     ` Tamas K Lengyel
  2020-03-26  7:07     ` Jan Beulich
  3 siblings, 0 replies; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-25 16:16 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Julien Grall, Stefano Stabellini, Tamas K Lengyel, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Jan Beulich,
	Xen-devel

> > +static int bring_up_vcpus(struct domain *cd, struct domain *d)
> > +{
> > +    unsigned int i;
> > +    int ret = -EINVAL;
>
> Nit: you can get rid of ret...
>
> > +
> > +    if ( d->max_vcpus != cd->max_vcpus ||
> > +        (ret = cpupool_move_domain(cd, d->cpupool)) )
> > +        return ret;
>
> ...and just return -EINVAL here. Seeing as it's not used anywhere
> else.
>

It is actually still needed, note that we store the return value of
cpupool_move_domain in it.

Tamas


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-25 15:47   ` Roger Pau Monné
  2020-03-25 16:04     ` Tamas K Lengyel
  2020-03-25 16:16     ` Tamas K Lengyel
@ 2020-03-25 16:34     ` Tamas K Lengyel
  2020-03-25 16:42       ` Julien Grall
  2020-03-26  7:07     ` Jan Beulich
  3 siblings, 1 reply; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-25 16:34 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Julien Grall, Stefano Stabellini, Tamas K Lengyel, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Jan Beulich,
	Xen-devel

On Wed, Mar 25, 2020 at 9:47 AM Roger Pau Monné <roger.pau@citrix.com> wrote:
>
> Sorry it has taken me a while to get to this.
>
> On Mon, Mar 23, 2020 at 10:04:35AM -0700, Tamas K Lengyel wrote:
> > VM forking is the process of creating a domain with an empty memory space and a
> > parent domain specified from which to populate the memory when necessary. For
> > the new domain to be functional the VM state is copied over as part of the fork
> > operation (HVM params, hap allocation, etc).
> >
> > Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
> > ---
> > v12: Minor style adjustments Jan pointed out
> >      Convert mem_sharing_is_fork to inline function
> > v11: Fully copy vcpu_info pages
> >      Setup vcpu_runstate for forks
> >      Added TODO note for PV timers
> >      Copy shared_info page
> >      Add copy_settings function, to be shared with fork_reset in the next patch
> > ---
> >  xen/arch/x86/domain.c             |  11 +
> >  xen/arch/x86/hvm/hvm.c            |   4 +-
> >  xen/arch/x86/mm/hap/hap.c         |   3 +-
> >  xen/arch/x86/mm/mem_sharing.c     | 368 ++++++++++++++++++++++++++++++
> >  xen/arch/x86/mm/p2m.c             |   9 +-
> >  xen/common/domain.c               |   3 +
> >  xen/include/asm-x86/hap.h         |   1 +
> >  xen/include/asm-x86/hvm/hvm.h     |   2 +
> >  xen/include/asm-x86/mem_sharing.h |  18 ++
> >  xen/include/public/memory.h       |   5 +
> >  xen/include/xen/sched.h           |   5 +
> >  11 files changed, 424 insertions(+), 5 deletions(-)
> >
> > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> > index caf2ecad7e..11d3c2216e 100644
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -2202,6 +2202,17 @@ int domain_relinquish_resources(struct domain *d)
> >              ret = relinquish_shared_pages(d);
> >              if ( ret )
> >                  return ret;
> > +
> > +            /*
> > +             * If the domain is forked, decrement the parent's pause count
> > +             * and release the domain.
> > +             */
> > +            if ( mem_sharing_is_fork(d) )
> > +            {
> > +                domain_unpause(d->parent);
> > +                put_domain(d->parent);
> > +                d->parent = NULL;
> > +            }
> >          }
> >  #endif
> >
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index a3d115b650..304b3d1562 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -1917,7 +1917,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
> >      }
> >  #endif
> >
> > -    /* Spurious fault? PoD and log-dirty also take this path. */
> > +    /* Spurious fault? PoD, log-dirty and VM forking also take this path. */
> >      if ( p2m_is_ram(p2mt) )
> >      {
> >          rc = 1;
> > @@ -4377,7 +4377,7 @@ static int hvm_allow_get_param(struct domain *d,
> >      return rc;
> >  }
> >
> > -static int hvm_get_param(struct domain *d, uint32_t index, uint64_t *value)
> > +int hvm_get_param(struct domain *d, uint32_t index, uint64_t *value)
> >  {
> >      int rc;
> >
> > diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> > index a6d5e39b02..814d0c3253 100644
> > --- a/xen/arch/x86/mm/hap/hap.c
> > +++ b/xen/arch/x86/mm/hap/hap.c
> > @@ -321,8 +321,7 @@ static void hap_free_p2m_page(struct domain *d, struct page_info *pg)
> >  }
> >
> >  /* Return the size of the pool, rounded up to the nearest MB */
> > -static unsigned int
> > -hap_get_allocation(struct domain *d)
> > +unsigned int hap_get_allocation(struct domain *d)
> >  {
> >      unsigned int pg = d->arch.paging.hap.total_pages
> >          + d->arch.paging.hap.p2m_pages;
> > diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
> > index 3835bc928f..23deeddff2 100644
> > --- a/xen/arch/x86/mm/mem_sharing.c
> > +++ b/xen/arch/x86/mm/mem_sharing.c
> > @@ -22,6 +22,7 @@
> >
> >  #include <xen/types.h>
> >  #include <xen/domain_page.h>
> > +#include <xen/event.h>
> >  #include <xen/spinlock.h>
> >  #include <xen/rwlock.h>
> >  #include <xen/mm.h>
> > @@ -36,6 +37,8 @@
> >  #include <asm/altp2m.h>
> >  #include <asm/atomic.h>
> >  #include <asm/event.h>
> > +#include <asm/hap.h>
> > +#include <asm/hvm/hvm.h>
> >  #include <xsm/xsm.h>
> >
> >  #include "mm-locks.h"
> > @@ -1444,6 +1447,334 @@ static inline int mem_sharing_control(struct domain *d, bool enable)
> >      return 0;
> >  }
> >
> > +/*
> > + * Forking a page only gets called when the VM faults due to no entry being
> > + * in the EPT for the access. Depending on the type of access we either
> > + * populate the physmap with a shared entry for read-only access or
> > + * fork the page if its a write access.
> > + *
> > + * The client p2m is already locked so we only need to lock
> > + * the parent's here.
> > + */
> > +int mem_sharing_fork_page(struct domain *d, gfn_t gfn, bool unsharing)
> > +{
> > +    int rc = -ENOENT;
> > +    shr_handle_t handle;
> > +    struct domain *parent = d->parent;
> > +    struct p2m_domain *p2m;
> > +    unsigned long gfn_l = gfn_x(gfn);
> > +    mfn_t mfn, new_mfn;
> > +    p2m_type_t p2mt;
> > +    struct page_info *page;
> > +
> > +    if ( !mem_sharing_is_fork(d) )
> > +        return -ENOENT;
> > +
> > +    if ( !unsharing )
> > +    {
> > +        /* For read-only accesses we just add a shared entry to the physmap */
> > +        while ( parent )
> > +        {
> > +            if ( !(rc = nominate_page(parent, gfn, 0, &handle)) )
> > +                break;
> > +
> > +            parent = parent->parent;
> > +        }
> > +
> > +        if ( !rc )
> > +        {
> > +            /* The client's p2m is already locked */
> > +            struct p2m_domain *pp2m = p2m_get_hostp2m(parent);
>
> Nit: I think you could just use the existing p2m local variable.
>
> > +            p2m_lock(pp2m);
> > +            rc = add_to_physmap(parent, gfn_l, handle, d, gfn_l, false);
> > +            p2m_unlock(pp2m);
> > +
> > +            if ( !rc )
> > +                return 0;
> > +        }
> > +    }
> > +
> > +    /*
> > +     * If it's a write access (ie. unsharing) or if adding a shared entry to
> > +     * the physmap failed we'll fork the page directly.
> > +     */
> > +    p2m = p2m_get_hostp2m(d);
> > +    parent = d->parent;
> > +
> > +    while ( parent )
> > +    {
> > +        mfn = get_gfn_query(parent, gfn_l, &p2mt);
> > +
> > +        /*
> > +         * We can't fork grant memory from the parent, only regular ram.
> > +         */
>
> Nit: single line comments should use /* ... */ (here and below).
>
> > +        if ( mfn_valid(mfn) && p2m_is_ram(p2mt) )
> > +            break;
> > +
> > +        put_gfn(parent, gfn_l);
> > +        parent = parent->parent;
> > +    }
> > +
> > +    if ( !parent )
> > +        return -ENOENT;
> > +
> > +    if ( !(page = alloc_domheap_page(d, 0)) )
> > +    {
> > +        put_gfn(parent, gfn_l);
> > +        return -ENOMEM;
> > +    }
> > +
> > +    new_mfn = page_to_mfn(page);
> > +    copy_domain_page(new_mfn, mfn);
> > +    set_gpfn_from_mfn(mfn_x(new_mfn), gfn_l);
> > +
> > +    put_gfn(parent, gfn_l);
> > +
> > +    return p2m->set_entry(p2m, gfn, new_mfn, PAGE_ORDER_4K, p2m_ram_rw,
> > +                          p2m->default_access, -1);
> > +}
> > +
> > +static int bring_up_vcpus(struct domain *cd, struct domain *d)
> > +{
> > +    unsigned int i;
> > +    int ret = -EINVAL;
>
> Nit: you can get rid of ret...
>
> > +
> > +    if ( d->max_vcpus != cd->max_vcpus ||
> > +        (ret = cpupool_move_domain(cd, d->cpupool)) )
> > +        return ret;
>
> ...and just return -EINVAL here. Seeing as it's not used anywhere
> else.
>
> > +
> > +    for ( i = 0; i < cd->max_vcpus; i++ )
> > +    {
> > +        if ( !d->vcpu[i] || cd->vcpu[i] )
> > +            continue;
> > +
> > +        if ( !vcpu_create(cd, i) )
> > +            return -EINVAL;
> > +    }
> > +
> > +    domain_update_node_affinity(cd);
> > +    return 0;
> > +}
> > +
> > +static int copy_vcpu_settings(struct domain *cd, struct domain *d)
> > +{
> > +    unsigned int i;
> > +    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
> > +    int ret = -EINVAL;
> > +
> > +    for ( i = 0; i < cd->max_vcpus; i++ )
> > +    {
> > +        const struct vcpu *d_vcpu = d->vcpu[i];
> > +        struct vcpu *cd_vcpu = cd->vcpu[i];
> > +        struct vcpu_runstate_info runstate;
> > +        mfn_t vcpu_info_mfn;
> > +
> > +        if ( !d_vcpu || !cd_vcpu )
> > +            continue;
> > +
> > +        /*
> > +         * Copy & map in the vcpu_info page if the guest uses one
> > +         */
> > +        vcpu_info_mfn = d_vcpu->vcpu_info_mfn;
> > +        if ( !mfn_eq(vcpu_info_mfn, INVALID_MFN) )
> > +        {
> > +            mfn_t new_vcpu_info_mfn = cd_vcpu->vcpu_info_mfn;
> > +
> > +            /*
> > +             * Allocate & map the page for it if it hasn't been already
> > +             */
> > +            if ( mfn_eq(new_vcpu_info_mfn, INVALID_MFN) )
> > +            {
> > +                gfn_t gfn = mfn_to_gfn(d, vcpu_info_mfn);
> > +                unsigned long gfn_l = gfn_x(gfn);
> > +                struct page_info *page;
> > +
> > +                if ( !(page = alloc_domheap_page(cd, 0)) )
> > +                    return -ENOMEM;
> > +
> > +                new_vcpu_info_mfn = page_to_mfn(page);
> > +                set_gpfn_from_mfn(mfn_x(new_vcpu_info_mfn), gfn_l);
> > +
> > +                ret = p2m->set_entry(p2m, gfn, new_vcpu_info_mfn, PAGE_ORDER_4K,
> > +                                     p2m_ram_rw, p2m->default_access, -1);
> > +                if ( ret )
> > +                    return ret;
> > +
> > +                ret = map_vcpu_info(cd_vcpu, gfn_l,
> > +                                    d_vcpu->vcpu_info_offset);
> > +                if ( ret )
> > +                    return ret;
> > +            }
> > +
> > +            copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
> > +        }
> > +
> > +        /*
> > +         * Setup the vCPU runstate area
> > +         */
> > +        if ( guest_handle_is_null(runstate_guest(cd_vcpu)) )
>
> Maybe I'm confused, but isn't this the other way around and you need
> to check? If the parent runstate is not null copy it to the fork,
> ie:
>
> if ( !guest_handle_is_null(runstate_guest(d_vcpu)) )
> {
>     ...
>
> > +        {
> > +            runstate_guest(cd_vcpu) = runstate_guest(d_vcpu);
> > +            vcpu_runstate_get(cd_vcpu, &runstate);
> > +            __copy_to_guest(runstate_guest(cd_vcpu), &runstate, 1);
>
> You should check the return code I think.
>
> > +        }
> > +
> > +        /*
> > +         * TODO: to support VMs with PV interfaces copy additional
> > +         * settings here, such as PV timers.
> > +         */
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int fork_hap_allocation(struct domain *cd, struct domain *d)
> > +{
> > +    int rc;
> > +    bool preempted;
> > +    unsigned long mb = hap_get_allocation(d);
> > +
> > +    if ( mb == hap_get_allocation(cd) )
> > +        return 0;
> > +
> > +    paging_lock(cd);
> > +    rc = hap_set_allocation(cd, mb << (20 - PAGE_SHIFT), &preempted);
> > +    paging_unlock(cd);
> > +
> > +    return preempted ? -ERESTART : rc;
> > +}
> > +
> > +static void copy_tsc(struct domain *cd, struct domain *d)
> > +{
> > +    uint32_t tsc_mode;
> > +    uint32_t gtsc_khz;
> > +    uint32_t incarnation;
> > +    uint64_t elapsed_nsec;
> > +
> > +    tsc_get_info(d, &tsc_mode, &elapsed_nsec, &gtsc_khz, &incarnation);
> > +    /* Don't bump incarnation on set */
> > +    tsc_set_info(cd, tsc_mode, elapsed_nsec, gtsc_khz, incarnation - 1);
> > +}
> > +
> > +static int copy_special_pages(struct domain *cd, struct domain *d)
> > +{
> > +    mfn_t new_mfn, old_mfn;
> > +    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
> > +    static const unsigned int params[] =
> > +    {
> > +        HVM_PARAM_STORE_PFN,
> > +        HVM_PARAM_IOREQ_PFN,
> > +        HVM_PARAM_BUFIOREQ_PFN,
> > +        HVM_PARAM_CONSOLE_PFN
> > +    };
> > +    unsigned int i;
> > +    int rc;
> > +
> > +    for ( i = 0; i < 4; i++ )
>
> Please use ARRAY_SIZE instead of hard coding 4.
>
> > diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
> > index 9f51370327..1ed7d13084 100644
> > --- a/xen/arch/x86/mm/p2m.c
> > +++ b/xen/arch/x86/mm/p2m.c
> > @@ -509,6 +509,12 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn_l,
> >
> >      mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> >
> > +    /* Check if we need to fork the page */
> > +    if ( (q & P2M_ALLOC) && p2m_is_hole(*t) &&
> > +         !mem_sharing_fork_page(p2m->domain, gfn, q & P2M_UNSHARE) )
> > +        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> > +
> > +    /* Check if we need to unshare the page */
> >      if ( (q & P2M_UNSHARE) && p2m_is_shared(*t) )
> >      {
> >          ASSERT(p2m_is_hostp2m(p2m));
> > @@ -588,7 +594,8 @@ struct page_info *p2m_get_page_from_gfn(
> >              return page;
> >
> >          /* Error path: not a suitable GFN at all */
> > -        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) )
> > +        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) &&
> > +             !mem_sharing_is_fork(p2m->domain) )
> >              return NULL;
> >      }
> >
> > diff --git a/xen/common/domain.c b/xen/common/domain.c
> > index b4eb476a9c..62aed53a16 100644
> > --- a/xen/common/domain.c
> > +++ b/xen/common/domain.c
> > @@ -1270,6 +1270,9 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
> >
> >      v->vcpu_info = new_info;
> >      v->vcpu_info_mfn = page_to_mfn(page);
> > +#ifdef CONFIG_MEM_SHARING
> > +    v->vcpu_info_offset = offset;
>
> There's no need to introduce this field, you can just use v->vcpu_info
> & ~PAGE_MASK AFAICT.

Just doing what you suggest above results in:

mem_sharing.c:1603:55: error: invalid operands to binary & (have
‘vcpu_info_t * const’ {aka ‘union <anonymous> * const’} and ‘long
int’)
                                     d_vcpu->vcpu_info & ~PAGE_MASK);

I can of course cast the vcpu_info pointer to (long int), it's just a
bit ugly. Thoughts?

Tamas


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-25 16:34     ` Tamas K Lengyel
@ 2020-03-25 16:42       ` Julien Grall
  2020-03-25 16:47         ` Tamas K Lengyel
  2020-03-25 16:54         ` Roger Pau Monné
  0 siblings, 2 replies; 29+ messages in thread
From: Julien Grall @ 2020-03-25 16:42 UTC (permalink / raw)
  To: Tamas K Lengyel, Roger Pau Monné
  Cc: Stefano Stabellini, Tamas K Lengyel, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Jan Beulich, Xen-devel

Hi,

On 25/03/2020 16:34, Tamas K Lengyel wrote:
>>> diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
>>> index 9f51370327..1ed7d13084 100644
>>> --- a/xen/arch/x86/mm/p2m.c
>>> +++ b/xen/arch/x86/mm/p2m.c
>>> @@ -509,6 +509,12 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn_l,
>>>
>>>       mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
>>>
>>> +    /* Check if we need to fork the page */
>>> +    if ( (q & P2M_ALLOC) && p2m_is_hole(*t) &&
>>> +         !mem_sharing_fork_page(p2m->domain, gfn, q & P2M_UNSHARE) )
>>> +        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
>>> +
>>> +    /* Check if we need to unshare the page */
>>>       if ( (q & P2M_UNSHARE) && p2m_is_shared(*t) )
>>>       {
>>>           ASSERT(p2m_is_hostp2m(p2m));
>>> @@ -588,7 +594,8 @@ struct page_info *p2m_get_page_from_gfn(
>>>               return page;
>>>
>>>           /* Error path: not a suitable GFN at all */
>>> -        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) )
>>> +        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) &&
>>> +             !mem_sharing_is_fork(p2m->domain) )
>>>               return NULL;
>>>       }
>>>
>>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>>> index b4eb476a9c..62aed53a16 100644
>>> --- a/xen/common/domain.c
>>> +++ b/xen/common/domain.c
>>> @@ -1270,6 +1270,9 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
>>>
>>>       v->vcpu_info = new_info;
>>>       v->vcpu_info_mfn = page_to_mfn(page);
>>> +#ifdef CONFIG_MEM_SHARING
>>> +    v->vcpu_info_offset = offset;
>>
>> There's no need to introduce this field, you can just use v->vcpu_info
>> & ~PAGE_MASK AFAICT.
> 
> Just doing what you suggest above results in:
> 
> mem_sharing.c:1603:55: error: invalid operands to binary & (have
> ‘vcpu_info_t * const’ {aka ‘union <anonymous> * const’} and ‘long
> int’)
>                                       d_vcpu->vcpu_info & ~PAGE_MASK);
> 
> I can of course cast the vcpu_info pointer to (long int), it's just a
> bit ugly. Thoughts?

FWIW, I will also need the offset for liveupdate. I have used (unsigned 
long)v->vcpu_info & ~PAGE_MASK so far but this is not really pretty.

So I am all for either a new field or a macro hiding this uglyness.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-25 16:42       ` Julien Grall
@ 2020-03-25 16:47         ` Tamas K Lengyel
  2020-03-25 16:51           ` Julien Grall
  2020-03-25 16:54         ` Roger Pau Monné
  1 sibling, 1 reply; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-25 16:47 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, Tamas K Lengyel, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Jan Beulich, Xen-devel,
	Roger Pau Monné

On Wed, Mar 25, 2020 at 10:42 AM Julien Grall <julien@xen.org> wrote:
>
> Hi,
>
> On 25/03/2020 16:34, Tamas K Lengyel wrote:
> >>> diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
> >>> index 9f51370327..1ed7d13084 100644
> >>> --- a/xen/arch/x86/mm/p2m.c
> >>> +++ b/xen/arch/x86/mm/p2m.c
> >>> @@ -509,6 +509,12 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn_l,
> >>>
> >>>       mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> >>>
> >>> +    /* Check if we need to fork the page */
> >>> +    if ( (q & P2M_ALLOC) && p2m_is_hole(*t) &&
> >>> +         !mem_sharing_fork_page(p2m->domain, gfn, q & P2M_UNSHARE) )
> >>> +        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> >>> +
> >>> +    /* Check if we need to unshare the page */
> >>>       if ( (q & P2M_UNSHARE) && p2m_is_shared(*t) )
> >>>       {
> >>>           ASSERT(p2m_is_hostp2m(p2m));
> >>> @@ -588,7 +594,8 @@ struct page_info *p2m_get_page_from_gfn(
> >>>               return page;
> >>>
> >>>           /* Error path: not a suitable GFN at all */
> >>> -        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) )
> >>> +        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) &&
> >>> +             !mem_sharing_is_fork(p2m->domain) )
> >>>               return NULL;
> >>>       }
> >>>
> >>> diff --git a/xen/common/domain.c b/xen/common/domain.c
> >>> index b4eb476a9c..62aed53a16 100644
> >>> --- a/xen/common/domain.c
> >>> +++ b/xen/common/domain.c
> >>> @@ -1270,6 +1270,9 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
> >>>
> >>>       v->vcpu_info = new_info;
> >>>       v->vcpu_info_mfn = page_to_mfn(page);
> >>> +#ifdef CONFIG_MEM_SHARING
> >>> +    v->vcpu_info_offset = offset;
> >>
> >> There's no need to introduce this field, you can just use v->vcpu_info
> >> & ~PAGE_MASK AFAICT.
> >
> > Just doing what you suggest above results in:
> >
> > mem_sharing.c:1603:55: error: invalid operands to binary & (have
> > ‘vcpu_info_t * const’ {aka ‘union <anonymous> * const’} and ‘long
> > int’)
> >                                       d_vcpu->vcpu_info & ~PAGE_MASK);
> >
> > I can of course cast the vcpu_info pointer to (long int), it's just a
> > bit ugly. Thoughts?
>
> FWIW, I will also need the offset for liveupdate. I have used (unsigned
> long)v->vcpu_info & ~PAGE_MASK so far but this is not really pretty.
>
> So I am all for either a new field or a macro hiding this uglyness.

A macro sounds like a good way to go, no need for an extra field if we
can calculate it based on the currently existing one. How about

#define VCPU_INFO_OFFSET(v) (((unsigned long)v->vcpu_info) & ~PAGE_MASK)

?

Tamas


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-25 16:47         ` Tamas K Lengyel
@ 2020-03-25 16:51           ` Julien Grall
  2020-03-25 17:00             ` Tamas K Lengyel
  0 siblings, 1 reply; 29+ messages in thread
From: Julien Grall @ 2020-03-25 16:51 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Stefano Stabellini, Tamas K Lengyel, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Jan Beulich, Xen-devel,
	Roger Pau Monné



On 25/03/2020 16:47, Tamas K Lengyel wrote:
> On Wed, Mar 25, 2020 at 10:42 AM Julien Grall <julien@xen.org> wrote:
>>
>> Hi,
>>
>> On 25/03/2020 16:34, Tamas K Lengyel wrote:
>>>>> diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
>>>>> index 9f51370327..1ed7d13084 100644
>>>>> --- a/xen/arch/x86/mm/p2m.c
>>>>> +++ b/xen/arch/x86/mm/p2m.c
>>>>> @@ -509,6 +509,12 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn_l,
>>>>>
>>>>>        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
>>>>>
>>>>> +    /* Check if we need to fork the page */
>>>>> +    if ( (q & P2M_ALLOC) && p2m_is_hole(*t) &&
>>>>> +         !mem_sharing_fork_page(p2m->domain, gfn, q & P2M_UNSHARE) )
>>>>> +        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
>>>>> +
>>>>> +    /* Check if we need to unshare the page */
>>>>>        if ( (q & P2M_UNSHARE) && p2m_is_shared(*t) )
>>>>>        {
>>>>>            ASSERT(p2m_is_hostp2m(p2m));
>>>>> @@ -588,7 +594,8 @@ struct page_info *p2m_get_page_from_gfn(
>>>>>                return page;
>>>>>
>>>>>            /* Error path: not a suitable GFN at all */
>>>>> -        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) )
>>>>> +        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) &&
>>>>> +             !mem_sharing_is_fork(p2m->domain) )
>>>>>                return NULL;
>>>>>        }
>>>>>
>>>>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>>>>> index b4eb476a9c..62aed53a16 100644
>>>>> --- a/xen/common/domain.c
>>>>> +++ b/xen/common/domain.c
>>>>> @@ -1270,6 +1270,9 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
>>>>>
>>>>>        v->vcpu_info = new_info;
>>>>>        v->vcpu_info_mfn = page_to_mfn(page);
>>>>> +#ifdef CONFIG_MEM_SHARING
>>>>> +    v->vcpu_info_offset = offset;
>>>>
>>>> There's no need to introduce this field, you can just use v->vcpu_info
>>>> & ~PAGE_MASK AFAICT.
>>>
>>> Just doing what you suggest above results in:
>>>
>>> mem_sharing.c:1603:55: error: invalid operands to binary & (have
>>> ‘vcpu_info_t * const’ {aka ‘union <anonymous> * const’} and ‘long
>>> int’)
>>>                                        d_vcpu->vcpu_info & ~PAGE_MASK);
>>>
>>> I can of course cast the vcpu_info pointer to (long int), it's just a
>>> bit ugly. Thoughts?
>>
>> FWIW, I will also need the offset for liveupdate. I have used (unsigned
>> long)v->vcpu_info & ~PAGE_MASK so far but this is not really pretty.
>>
>> So I am all for either a new field or a macro hiding this uglyness.
> 
> A macro sounds like a good way to go, no need for an extra field if we
> can calculate it based on the currently existing one. How about
> 
> #define VCPU_INFO_OFFSET(v) (((unsigned long)v->vcpu_info) & ~PAGE_MASK)

I was more thinking a generic macro to find the offset in a page.

PAGE_OFFSET(ptr) ((unsigned long)(ptr) & ~PAGE_MASK)

Anyway, I am happy either way.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-25 16:42       ` Julien Grall
  2020-03-25 16:47         ` Tamas K Lengyel
@ 2020-03-25 16:54         ` Roger Pau Monné
  2020-03-26  7:02           ` Jan Beulich
  1 sibling, 1 reply; 29+ messages in thread
From: Roger Pau Monné @ 2020-03-25 16:54 UTC (permalink / raw)
  To: Julien Grall, Tamas K Lengyel
  Cc: Stefano Stabellini, Tamas K Lengyel, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Jan Beulich, Xen-devel

On Wed, Mar 25, 2020 at 04:42:07PM +0000, Julien Grall wrote:
> Hi,
> 
> On 25/03/2020 16:34, Tamas K Lengyel wrote:
> > > > diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
> > > > index 9f51370327..1ed7d13084 100644
> > > > --- a/xen/arch/x86/mm/p2m.c
> > > > +++ b/xen/arch/x86/mm/p2m.c
> > > > @@ -509,6 +509,12 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn_l,
> > > > 
> > > >       mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> > > > 
> > > > +    /* Check if we need to fork the page */
> > > > +    if ( (q & P2M_ALLOC) && p2m_is_hole(*t) &&
> > > > +         !mem_sharing_fork_page(p2m->domain, gfn, q & P2M_UNSHARE) )
> > > > +        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> > > > +
> > > > +    /* Check if we need to unshare the page */
> > > >       if ( (q & P2M_UNSHARE) && p2m_is_shared(*t) )
> > > >       {
> > > >           ASSERT(p2m_is_hostp2m(p2m));
> > > > @@ -588,7 +594,8 @@ struct page_info *p2m_get_page_from_gfn(
> > > >               return page;
> > > > 
> > > >           /* Error path: not a suitable GFN at all */
> > > > -        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) )
> > > > +        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) &&
> > > > +             !mem_sharing_is_fork(p2m->domain) )
> > > >               return NULL;
> > > >       }
> > > > 
> > > > diff --git a/xen/common/domain.c b/xen/common/domain.c
> > > > index b4eb476a9c..62aed53a16 100644
> > > > --- a/xen/common/domain.c
> > > > +++ b/xen/common/domain.c
> > > > @@ -1270,6 +1270,9 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
> > > > 
> > > >       v->vcpu_info = new_info;
> > > >       v->vcpu_info_mfn = page_to_mfn(page);
> > > > +#ifdef CONFIG_MEM_SHARING
> > > > +    v->vcpu_info_offset = offset;
> > > 
> > > There's no need to introduce this field, you can just use v->vcpu_info
> > > & ~PAGE_MASK AFAICT.
> > 
> > Just doing what you suggest above results in:
> > 
> > mem_sharing.c:1603:55: error: invalid operands to binary & (have
> > ‘vcpu_info_t * const’ {aka ‘union <anonymous> * const’} and ‘long
> > int’)
> >                                       d_vcpu->vcpu_info & ~PAGE_MASK);
> > 
> > I can of course cast the vcpu_info pointer to (long int), it's just a
> > bit ugly. Thoughts?
> 
> FWIW, I will also need the offset for liveupdate. I have used (unsigned
> long)v->vcpu_info & ~PAGE_MASK so far but this is not really pretty.

I think using:

(vaddr_t)v->vcpu_info & ~PAGE_MASK

is acceptable, but that would require adding a vaddr_t typedef to x86.

Adding a macro to do so would be OK by me, maybe PAGE_OFFSET or some
such?

> So I am all for either a new field or a macro hiding this uglyness.

Macro would be better IMO, as it's redundant to store the offset when
we can obtain it from an existing field.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-25 16:51           ` Julien Grall
@ 2020-03-25 17:00             ` Tamas K Lengyel
  2020-03-25 17:16               ` Roger Pau Monné
  0 siblings, 1 reply; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-25 17:00 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, Tamas K Lengyel, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Jan Beulich, Xen-devel,
	Roger Pau Monné

On Wed, Mar 25, 2020 at 10:52 AM Julien Grall <julien@xen.org> wrote:
>
>
>
> On 25/03/2020 16:47, Tamas K Lengyel wrote:
> > On Wed, Mar 25, 2020 at 10:42 AM Julien Grall <julien@xen.org> wrote:
> >>
> >> Hi,
> >>
> >> On 25/03/2020 16:34, Tamas K Lengyel wrote:
> >>>>> diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
> >>>>> index 9f51370327..1ed7d13084 100644
> >>>>> --- a/xen/arch/x86/mm/p2m.c
> >>>>> +++ b/xen/arch/x86/mm/p2m.c
> >>>>> @@ -509,6 +509,12 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn_l,
> >>>>>
> >>>>>        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> >>>>>
> >>>>> +    /* Check if we need to fork the page */
> >>>>> +    if ( (q & P2M_ALLOC) && p2m_is_hole(*t) &&
> >>>>> +         !mem_sharing_fork_page(p2m->domain, gfn, q & P2M_UNSHARE) )
> >>>>> +        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> >>>>> +
> >>>>> +    /* Check if we need to unshare the page */
> >>>>>        if ( (q & P2M_UNSHARE) && p2m_is_shared(*t) )
> >>>>>        {
> >>>>>            ASSERT(p2m_is_hostp2m(p2m));
> >>>>> @@ -588,7 +594,8 @@ struct page_info *p2m_get_page_from_gfn(
> >>>>>                return page;
> >>>>>
> >>>>>            /* Error path: not a suitable GFN at all */
> >>>>> -        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) )
> >>>>> +        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) &&
> >>>>> +             !mem_sharing_is_fork(p2m->domain) )
> >>>>>                return NULL;
> >>>>>        }
> >>>>>
> >>>>> diff --git a/xen/common/domain.c b/xen/common/domain.c
> >>>>> index b4eb476a9c..62aed53a16 100644
> >>>>> --- a/xen/common/domain.c
> >>>>> +++ b/xen/common/domain.c
> >>>>> @@ -1270,6 +1270,9 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
> >>>>>
> >>>>>        v->vcpu_info = new_info;
> >>>>>        v->vcpu_info_mfn = page_to_mfn(page);
> >>>>> +#ifdef CONFIG_MEM_SHARING
> >>>>> +    v->vcpu_info_offset = offset;
> >>>>
> >>>> There's no need to introduce this field, you can just use v->vcpu_info
> >>>> & ~PAGE_MASK AFAICT.
> >>>
> >>> Just doing what you suggest above results in:
> >>>
> >>> mem_sharing.c:1603:55: error: invalid operands to binary & (have
> >>> ‘vcpu_info_t * const’ {aka ‘union <anonymous> * const’} and ‘long
> >>> int’)
> >>>                                        d_vcpu->vcpu_info & ~PAGE_MASK);
> >>>
> >>> I can of course cast the vcpu_info pointer to (long int), it's just a
> >>> bit ugly. Thoughts?
> >>
> >> FWIW, I will also need the offset for liveupdate. I have used (unsigned
> >> long)v->vcpu_info & ~PAGE_MASK so far but this is not really pretty.
> >>
> >> So I am all for either a new field or a macro hiding this uglyness.
> >
> > A macro sounds like a good way to go, no need for an extra field if we
> > can calculate it based on the currently existing one. How about
> >
> > #define VCPU_INFO_OFFSET(v) (((unsigned long)v->vcpu_info) & ~PAGE_MASK)
>
> I was more thinking a generic macro to find the offset in a page.
>
> PAGE_OFFSET(ptr) ((unsigned long)(ptr) & ~PAGE_MASK)

LGTM. Should we stuff this into xen/sched.h or asm/page.h?

Tamas


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-25 17:00             ` Tamas K Lengyel
@ 2020-03-25 17:16               ` Roger Pau Monné
  2020-03-25 17:47                 ` Tamas K Lengyel
  0 siblings, 1 reply; 29+ messages in thread
From: Roger Pau Monné @ 2020-03-25 17:16 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Stefano Stabellini, Tamas K Lengyel, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Jan Beulich, Xen-devel, Julien Grall

On Wed, Mar 25, 2020 at 11:00:05AM -0600, Tamas K Lengyel wrote:
> On Wed, Mar 25, 2020 at 10:52 AM Julien Grall <julien@xen.org> wrote:
> >
> >
> >
> > On 25/03/2020 16:47, Tamas K Lengyel wrote:
> > > On Wed, Mar 25, 2020 at 10:42 AM Julien Grall <julien@xen.org> wrote:
> > >>
> > >> Hi,
> > >>
> > >> On 25/03/2020 16:34, Tamas K Lengyel wrote:
> > >>>>> diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
> > >>>>> index 9f51370327..1ed7d13084 100644
> > >>>>> --- a/xen/arch/x86/mm/p2m.c
> > >>>>> +++ b/xen/arch/x86/mm/p2m.c
> > >>>>> @@ -509,6 +509,12 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn_l,
> > >>>>>
> > >>>>>        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> > >>>>>
> > >>>>> +    /* Check if we need to fork the page */
> > >>>>> +    if ( (q & P2M_ALLOC) && p2m_is_hole(*t) &&
> > >>>>> +         !mem_sharing_fork_page(p2m->domain, gfn, q & P2M_UNSHARE) )
> > >>>>> +        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> > >>>>> +
> > >>>>> +    /* Check if we need to unshare the page */
> > >>>>>        if ( (q & P2M_UNSHARE) && p2m_is_shared(*t) )
> > >>>>>        {
> > >>>>>            ASSERT(p2m_is_hostp2m(p2m));
> > >>>>> @@ -588,7 +594,8 @@ struct page_info *p2m_get_page_from_gfn(
> > >>>>>                return page;
> > >>>>>
> > >>>>>            /* Error path: not a suitable GFN at all */
> > >>>>> -        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) )
> > >>>>> +        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) &&
> > >>>>> +             !mem_sharing_is_fork(p2m->domain) )
> > >>>>>                return NULL;
> > >>>>>        }
> > >>>>>
> > >>>>> diff --git a/xen/common/domain.c b/xen/common/domain.c
> > >>>>> index b4eb476a9c..62aed53a16 100644
> > >>>>> --- a/xen/common/domain.c
> > >>>>> +++ b/xen/common/domain.c
> > >>>>> @@ -1270,6 +1270,9 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
> > >>>>>
> > >>>>>        v->vcpu_info = new_info;
> > >>>>>        v->vcpu_info_mfn = page_to_mfn(page);
> > >>>>> +#ifdef CONFIG_MEM_SHARING
> > >>>>> +    v->vcpu_info_offset = offset;
> > >>>>
> > >>>> There's no need to introduce this field, you can just use v->vcpu_info
> > >>>> & ~PAGE_MASK AFAICT.
> > >>>
> > >>> Just doing what you suggest above results in:
> > >>>
> > >>> mem_sharing.c:1603:55: error: invalid operands to binary & (have
> > >>> ‘vcpu_info_t * const’ {aka ‘union <anonymous> * const’} and ‘long
> > >>> int’)
> > >>>                                        d_vcpu->vcpu_info & ~PAGE_MASK);
> > >>>
> > >>> I can of course cast the vcpu_info pointer to (long int), it's just a
> > >>> bit ugly. Thoughts?
> > >>
> > >> FWIW, I will also need the offset for liveupdate. I have used (unsigned
> > >> long)v->vcpu_info & ~PAGE_MASK so far but this is not really pretty.
> > >>
> > >> So I am all for either a new field or a macro hiding this uglyness.
> > >
> > > A macro sounds like a good way to go, no need for an extra field if we
> > > can calculate it based on the currently existing one. How about
> > >
> > > #define VCPU_INFO_OFFSET(v) (((unsigned long)v->vcpu_info) & ~PAGE_MASK)
> >
> > I was more thinking a generic macro to find the offset in a page.
> >
> > PAGE_OFFSET(ptr) ((unsigned long)(ptr) & ~PAGE_MASK)
> 
> LGTM. Should we stuff this into xen/sched.h or asm/page.h?

page.h would be better, albeit you will have to duplicate it for x86
and Arm. There's xen/pfn.h which is not arch specific, but feels a bit
weird to place it there.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-25 17:16               ` Roger Pau Monné
@ 2020-03-25 17:47                 ` Tamas K Lengyel
  0 siblings, 0 replies; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-25 17:47 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, Tamas K Lengyel, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Jan Beulich, Xen-devel, Julien Grall

On Wed, Mar 25, 2020 at 11:16 AM Roger Pau Monné <roger.pau@citrix.com> wrote:
>
> On Wed, Mar 25, 2020 at 11:00:05AM -0600, Tamas K Lengyel wrote:
> > On Wed, Mar 25, 2020 at 10:52 AM Julien Grall <julien@xen.org> wrote:
> > >
> > >
> > >
> > > On 25/03/2020 16:47, Tamas K Lengyel wrote:
> > > > On Wed, Mar 25, 2020 at 10:42 AM Julien Grall <julien@xen.org> wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> On 25/03/2020 16:34, Tamas K Lengyel wrote:
> > > >>>>> diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
> > > >>>>> index 9f51370327..1ed7d13084 100644
> > > >>>>> --- a/xen/arch/x86/mm/p2m.c
> > > >>>>> +++ b/xen/arch/x86/mm/p2m.c
> > > >>>>> @@ -509,6 +509,12 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn_l,
> > > >>>>>
> > > >>>>>        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> > > >>>>>
> > > >>>>> +    /* Check if we need to fork the page */
> > > >>>>> +    if ( (q & P2M_ALLOC) && p2m_is_hole(*t) &&
> > > >>>>> +         !mem_sharing_fork_page(p2m->domain, gfn, q & P2M_UNSHARE) )
> > > >>>>> +        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order, NULL);
> > > >>>>> +
> > > >>>>> +    /* Check if we need to unshare the page */
> > > >>>>>        if ( (q & P2M_UNSHARE) && p2m_is_shared(*t) )
> > > >>>>>        {
> > > >>>>>            ASSERT(p2m_is_hostp2m(p2m));
> > > >>>>> @@ -588,7 +594,8 @@ struct page_info *p2m_get_page_from_gfn(
> > > >>>>>                return page;
> > > >>>>>
> > > >>>>>            /* Error path: not a suitable GFN at all */
> > > >>>>> -        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) )
> > > >>>>> +        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) &&
> > > >>>>> +             !mem_sharing_is_fork(p2m->domain) )
> > > >>>>>                return NULL;
> > > >>>>>        }
> > > >>>>>
> > > >>>>> diff --git a/xen/common/domain.c b/xen/common/domain.c
> > > >>>>> index b4eb476a9c..62aed53a16 100644
> > > >>>>> --- a/xen/common/domain.c
> > > >>>>> +++ b/xen/common/domain.c
> > > >>>>> @@ -1270,6 +1270,9 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
> > > >>>>>
> > > >>>>>        v->vcpu_info = new_info;
> > > >>>>>        v->vcpu_info_mfn = page_to_mfn(page);
> > > >>>>> +#ifdef CONFIG_MEM_SHARING
> > > >>>>> +    v->vcpu_info_offset = offset;
> > > >>>>
> > > >>>> There's no need to introduce this field, you can just use v->vcpu_info
> > > >>>> & ~PAGE_MASK AFAICT.
> > > >>>
> > > >>> Just doing what you suggest above results in:
> > > >>>
> > > >>> mem_sharing.c:1603:55: error: invalid operands to binary & (have
> > > >>> ‘vcpu_info_t * const’ {aka ‘union <anonymous> * const’} and ‘long
> > > >>> int’)
> > > >>>                                        d_vcpu->vcpu_info & ~PAGE_MASK);
> > > >>>
> > > >>> I can of course cast the vcpu_info pointer to (long int), it's just a
> > > >>> bit ugly. Thoughts?
> > > >>
> > > >> FWIW, I will also need the offset for liveupdate. I have used (unsigned
> > > >> long)v->vcpu_info & ~PAGE_MASK so far but this is not really pretty.
> > > >>
> > > >> So I am all for either a new field or a macro hiding this uglyness.
> > > >
> > > > A macro sounds like a good way to go, no need for an extra field if we
> > > > can calculate it based on the currently existing one. How about
> > > >
> > > > #define VCPU_INFO_OFFSET(v) (((unsigned long)v->vcpu_info) & ~PAGE_MASK)
> > >
> > > I was more thinking a generic macro to find the offset in a page.
> > >
> > > PAGE_OFFSET(ptr) ((unsigned long)(ptr) & ~PAGE_MASK)
> >
> > LGTM. Should we stuff this into xen/sched.h or asm/page.h?
>
> page.h would be better, albeit you will have to duplicate it for x86
> and Arm. There's xen/pfn.h which is not arch specific, but feels a bit
> weird to place it there.
>

I'll go with page.h, that allows us to use vaddr_t for the ARM macro.

Tamas


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-25 16:54         ` Roger Pau Monné
@ 2020-03-26  7:02           ` Jan Beulich
  2020-03-26  8:42             ` Roger Pau Monné
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2020-03-26  7:02 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Tamas K Lengyel, Tamas K Lengyel, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Stefano Stabellini, Xen-devel,
	Julien Grall

On 25.03.2020 17:54, Roger Pau Monné wrote:
> On Wed, Mar 25, 2020 at 04:42:07PM +0000, Julien Grall wrote:
>> On 25/03/2020 16:34, Tamas K Lengyel wrote:
>>>>> --- a/xen/common/domain.c
>>>>> +++ b/xen/common/domain.c
>>>>> @@ -1270,6 +1270,9 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
>>>>>
>>>>>       v->vcpu_info = new_info;
>>>>>       v->vcpu_info_mfn = page_to_mfn(page);
>>>>> +#ifdef CONFIG_MEM_SHARING
>>>>> +    v->vcpu_info_offset = offset;
>>>>
>>>> There's no need to introduce this field, you can just use v->vcpu_info
>>>> & ~PAGE_MASK AFAICT.
>>>
>>> Just doing what you suggest above results in:
>>>
>>> mem_sharing.c:1603:55: error: invalid operands to binary & (have
>>> ‘vcpu_info_t * const’ {aka ‘union <anonymous> * const’} and ‘long
>>> int’)
>>>                                       d_vcpu->vcpu_info & ~PAGE_MASK);
>>>
>>> I can of course cast the vcpu_info pointer to (long int), it's just a
>>> bit ugly. Thoughts?
>>
>> FWIW, I will also need the offset for liveupdate. I have used (unsigned
>> long)v->vcpu_info & ~PAGE_MASK so far but this is not really pretty.
> 
> I think using:
> 
> (vaddr_t)v->vcpu_info & ~PAGE_MASK
> 
> is acceptable, but that would require adding a vaddr_t typedef to x86.

I don't think vaddr_t is necessary given that all over the place we
assume conversions between pointers and unsigned long to be fine.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-25 15:47   ` Roger Pau Monné
                       ` (2 preceding siblings ...)
  2020-03-25 16:34     ` Tamas K Lengyel
@ 2020-03-26  7:07     ` Jan Beulich
  2020-03-26  9:10       ` Roger Pau Monné
  3 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2020-03-26  7:07 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Julien Grall, Stefano Stabellini, Tamas K Lengyel, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Tamas K Lengyel,
	xen-devel

On 25.03.2020 16:47, Roger Pau Monné wrote:
> On Mon, Mar 23, 2020 at 10:04:35AM -0700, Tamas K Lengyel wrote:
>> +static int copy_vcpu_settings(struct domain *cd, struct domain *d)
>> +{
>> +    unsigned int i;
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
>> +    int ret = -EINVAL;
>> +
>> +    for ( i = 0; i < cd->max_vcpus; i++ )
>> +    {
>> +        const struct vcpu *d_vcpu = d->vcpu[i];
>> +        struct vcpu *cd_vcpu = cd->vcpu[i];
>> +        struct vcpu_runstate_info runstate;
>> +        mfn_t vcpu_info_mfn;
>> +
>> +        if ( !d_vcpu || !cd_vcpu )
>> +            continue;
>> +
>> +        /*
>> +         * Copy & map in the vcpu_info page if the guest uses one
>> +         */
>> +        vcpu_info_mfn = d_vcpu->vcpu_info_mfn;
>> +        if ( !mfn_eq(vcpu_info_mfn, INVALID_MFN) )
>> +        {
>> +            mfn_t new_vcpu_info_mfn = cd_vcpu->vcpu_info_mfn;
>> +
>> +            /*
>> +             * Allocate & map the page for it if it hasn't been already
>> +             */
>> +            if ( mfn_eq(new_vcpu_info_mfn, INVALID_MFN) )
>> +            {
>> +                gfn_t gfn = mfn_to_gfn(d, vcpu_info_mfn);
>> +                unsigned long gfn_l = gfn_x(gfn);
>> +                struct page_info *page;
>> +
>> +                if ( !(page = alloc_domheap_page(cd, 0)) )
>> +                    return -ENOMEM;
>> +
>> +                new_vcpu_info_mfn = page_to_mfn(page);
>> +                set_gpfn_from_mfn(mfn_x(new_vcpu_info_mfn), gfn_l);
>> +
>> +                ret = p2m->set_entry(p2m, gfn, new_vcpu_info_mfn, PAGE_ORDER_4K,
>> +                                     p2m_ram_rw, p2m->default_access, -1);
>> +                if ( ret )
>> +                    return ret;
>> +
>> +                ret = map_vcpu_info(cd_vcpu, gfn_l,
>> +                                    d_vcpu->vcpu_info_offset);
>> +                if ( ret )
>> +                    return ret;
>> +            }
>> +
>> +            copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
>> +        }
>> +
>> +        /*
>> +         * Setup the vCPU runstate area
>> +         */
>> +        if ( guest_handle_is_null(runstate_guest(cd_vcpu)) )
> 
> Maybe I'm confused, but isn't this the other way around and you need
> to check? If the parent runstate is not null copy it to the fork,
> ie:
> 
> if ( !guest_handle_is_null(runstate_guest(d_vcpu)) )
> {
>     ...
> 
>> +        {
>> +            runstate_guest(cd_vcpu) = runstate_guest(d_vcpu);
>> +            vcpu_runstate_get(cd_vcpu, &runstate);
>> +            __copy_to_guest(runstate_guest(cd_vcpu), &runstate, 1);
> 
> You should check the return code I think.

I don't think so - this is a best effort operation just like e.g.
in the handling of VCPUOP_register_runstate_memory_area.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-26  7:02           ` Jan Beulich
@ 2020-03-26  8:42             ` Roger Pau Monné
  0 siblings, 0 replies; 29+ messages in thread
From: Roger Pau Monné @ 2020-03-26  8:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tamas K Lengyel, Tamas K Lengyel, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Stefano Stabellini, Xen-devel,
	Julien Grall

On Thu, Mar 26, 2020 at 08:02:22AM +0100, Jan Beulich wrote:
> On 25.03.2020 17:54, Roger Pau Monné wrote:
> > On Wed, Mar 25, 2020 at 04:42:07PM +0000, Julien Grall wrote:
> >> On 25/03/2020 16:34, Tamas K Lengyel wrote:
> >>>>> --- a/xen/common/domain.c
> >>>>> +++ b/xen/common/domain.c
> >>>>> @@ -1270,6 +1270,9 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
> >>>>>
> >>>>>       v->vcpu_info = new_info;
> >>>>>       v->vcpu_info_mfn = page_to_mfn(page);
> >>>>> +#ifdef CONFIG_MEM_SHARING
> >>>>> +    v->vcpu_info_offset = offset;
> >>>>
> >>>> There's no need to introduce this field, you can just use v->vcpu_info
> >>>> & ~PAGE_MASK AFAICT.
> >>>
> >>> Just doing what you suggest above results in:
> >>>
> >>> mem_sharing.c:1603:55: error: invalid operands to binary & (have
> >>> ‘vcpu_info_t * const’ {aka ‘union <anonymous> * const’} and ‘long
> >>> int’)
> >>>                                       d_vcpu->vcpu_info & ~PAGE_MASK);
> >>>
> >>> I can of course cast the vcpu_info pointer to (long int), it's just a
> >>> bit ugly. Thoughts?
> >>
> >> FWIW, I will also need the offset for liveupdate. I have used (unsigned
> >> long)v->vcpu_info & ~PAGE_MASK so far but this is not really pretty.
> > 
> > I think using:
> > 
> > (vaddr_t)v->vcpu_info & ~PAGE_MASK
> > 
> > is acceptable, but that would require adding a vaddr_t typedef to x86.
> 
> I don't think vaddr_t is necessary given that all over the place we
> assume conversions between pointers and unsigned long to be fine.

Right, using unsigned long is indeed fine, but I also agree with Tamas
that it's slightly ugly and hence wanted to provide a 'cleaner'
suggestion.

Roger.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-26  7:07     ` Jan Beulich
@ 2020-03-26  9:10       ` Roger Pau Monné
  2020-03-26 17:01         ` Tamas K Lengyel
  0 siblings, 1 reply; 29+ messages in thread
From: Roger Pau Monné @ 2020-03-26  9:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Julien Grall, Stefano Stabellini, Tamas K Lengyel, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Tamas K Lengyel,
	xen-devel

On Thu, Mar 26, 2020 at 08:07:09AM +0100, Jan Beulich wrote:
> On 25.03.2020 16:47, Roger Pau Monné wrote:
> > On Mon, Mar 23, 2020 at 10:04:35AM -0700, Tamas K Lengyel wrote:
> >> +static int copy_vcpu_settings(struct domain *cd, struct domain *d)
> >> +{
> >> +    unsigned int i;
> >> +    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
> >> +    int ret = -EINVAL;
> >> +
> >> +    for ( i = 0; i < cd->max_vcpus; i++ )
> >> +    {
> >> +        const struct vcpu *d_vcpu = d->vcpu[i];
> >> +        struct vcpu *cd_vcpu = cd->vcpu[i];
> >> +        struct vcpu_runstate_info runstate;
> >> +        mfn_t vcpu_info_mfn;
> >> +
> >> +        if ( !d_vcpu || !cd_vcpu )
> >> +            continue;
> >> +
> >> +        /*
> >> +         * Copy & map in the vcpu_info page if the guest uses one
> >> +         */
> >> +        vcpu_info_mfn = d_vcpu->vcpu_info_mfn;
> >> +        if ( !mfn_eq(vcpu_info_mfn, INVALID_MFN) )
> >> +        {
> >> +            mfn_t new_vcpu_info_mfn = cd_vcpu->vcpu_info_mfn;
> >> +
> >> +            /*
> >> +             * Allocate & map the page for it if it hasn't been already
> >> +             */
> >> +            if ( mfn_eq(new_vcpu_info_mfn, INVALID_MFN) )
> >> +            {
> >> +                gfn_t gfn = mfn_to_gfn(d, vcpu_info_mfn);
> >> +                unsigned long gfn_l = gfn_x(gfn);
> >> +                struct page_info *page;
> >> +
> >> +                if ( !(page = alloc_domheap_page(cd, 0)) )
> >> +                    return -ENOMEM;
> >> +
> >> +                new_vcpu_info_mfn = page_to_mfn(page);
> >> +                set_gpfn_from_mfn(mfn_x(new_vcpu_info_mfn), gfn_l);
> >> +
> >> +                ret = p2m->set_entry(p2m, gfn, new_vcpu_info_mfn, PAGE_ORDER_4K,
> >> +                                     p2m_ram_rw, p2m->default_access, -1);
> >> +                if ( ret )
> >> +                    return ret;
> >> +
> >> +                ret = map_vcpu_info(cd_vcpu, gfn_l,
> >> +                                    d_vcpu->vcpu_info_offset);
> >> +                if ( ret )
> >> +                    return ret;
> >> +            }
> >> +
> >> +            copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
> >> +        }
> >> +
> >> +        /*
> >> +         * Setup the vCPU runstate area
> >> +         */
> >> +        if ( guest_handle_is_null(runstate_guest(cd_vcpu)) )
> > 
> > Maybe I'm confused, but isn't this the other way around and you need
> > to check? If the parent runstate is not null copy it to the fork,
> > ie:
> > 
> > if ( !guest_handle_is_null(runstate_guest(d_vcpu)) )
> > {
> >     ...
> > 
> >> +        {
> >> +            runstate_guest(cd_vcpu) = runstate_guest(d_vcpu);
> >> +            vcpu_runstate_get(cd_vcpu, &runstate);
> >> +            __copy_to_guest(runstate_guest(cd_vcpu), &runstate, 1);
> > 
> > You should check the return code I think.
> 
> I don't think so - this is a best effort operation just like e.g.
> in the handling of VCPUOP_register_runstate_memory_area.

I think printing a debug message might be helpful, not so much as for
the importance of failing to copy the runstate area, but it could
signal that something went wrong, anyway I don't have such a strong
opinion.

Just to confirm, __copy_to_guest will cause the forked domain memory
to be populated and the whole page to be copied over right? (and will
also cause the page tables to be added to the fork physmap in write
mode to set the accessed/dirty bits)

Thanks, Roger.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork
  2020-03-23 17:04 ` [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork Tamas K Lengyel
  2020-03-25 15:52   ` Roger Pau Monné
@ 2020-03-26 10:16   ` Jan Beulich
  2020-03-26 14:48     ` Tamas K Lengyel
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2020-03-26 10:16 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Tamas K Lengyel, Julien Grall, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Stefano Stabellini, xen-devel,
	Roger Pau Monné

On 23.03.2020 18:04, Tamas K Lengyel wrote:
> +static int mem_sharing_fork_reset(struct domain *d, struct domain *pd)
> +{
> +    int rc;
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +    struct page_info *page, *tmp;
> +
> +    spin_lock(&d->page_alloc_lock);
> +    domain_pause(d);

Why do you take the lock first?

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-23 17:04 ` [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: " Tamas K Lengyel
  2020-03-25 15:47   ` Roger Pau Monné
@ 2020-03-26 12:33   ` Jan Beulich
  2020-03-26 14:52     ` Tamas K Lengyel
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2020-03-26 12:33 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, Tamas K Lengyel, xen-devel,
	Roger Pau Monné

On 23.03.2020 18:04, Tamas K Lengyel wrote:
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -2202,6 +2202,17 @@ int domain_relinquish_resources(struct domain *d)
>              ret = relinquish_shared_pages(d);
>              if ( ret )
>                  return ret;
> +
> +            /*
> +             * If the domain is forked, decrement the parent's pause count
> +             * and release the domain.
> +             */
> +            if ( mem_sharing_is_fork(d) )
> +            {
> +                domain_unpause(d->parent);
> +                put_domain(d->parent);
> +                d->parent = NULL;

I think you want to clear the field before putting the reference,
to make sure possible readers of it won't see it non-NULL when
the domain is already being cleaned up, or even gone.

With this, applicable parts of the change
Acked-by: Jan Beulich <jbeulich@suse.com>

I'll try to keep an eye on when you and Roger have settled on the
remaining aspects, to determine when this (probably v13) can be
committed.

> --- a/xen/include/asm-x86/mem_sharing.h
> +++ b/xen/include/asm-x86/mem_sharing.h
> @@ -77,6 +77,14 @@ static inline int mem_sharing_unshare_page(struct domain *d,
>      return rc;
>  }
>  
> +static inline bool mem_sharing_is_fork(struct domain *d)

const? (then also in the stub further down)

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork
  2020-03-26 10:16   ` Jan Beulich
@ 2020-03-26 14:48     ` Tamas K Lengyel
  2020-03-26 14:52       ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-26 14:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Julien Grall, Stefano Stabellini, Tamas K Lengyel, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Xen-devel,
	Roger Pau Monné

On Thu, Mar 26, 2020 at 4:17 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 23.03.2020 18:04, Tamas K Lengyel wrote:
> > +static int mem_sharing_fork_reset(struct domain *d, struct domain *pd)
> > +{
> > +    int rc;
> > +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> > +    struct page_info *page, *tmp;
> > +
> > +    spin_lock(&d->page_alloc_lock);
> > +    domain_pause(d);
>
> Why do you take the lock first?

No particular reason - does the order matter?

Tamas


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork
  2020-03-26 14:48     ` Tamas K Lengyel
@ 2020-03-26 14:52       ` Jan Beulich
  2020-03-26 14:53         ` Tamas K Lengyel
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2020-03-26 14:52 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Julien Grall, Stefano Stabellini, Tamas K Lengyel, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Xen-devel,
	Roger Pau Monné

On 26.03.2020 15:48, Tamas K Lengyel wrote:
> On Thu, Mar 26, 2020 at 4:17 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 23.03.2020 18:04, Tamas K Lengyel wrote:
>>> +static int mem_sharing_fork_reset(struct domain *d, struct domain *pd)
>>> +{
>>> +    int rc;
>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>> +    struct page_info *page, *tmp;
>>> +
>>> +    spin_lock(&d->page_alloc_lock);
>>> +    domain_pause(d);
>>
>> Why do you take the lock first?
> 
> No particular reason - does the order matter?

I think you'd better avoid holding a lock for extended periods
of time. And what's perhaps worse, what if a vCPU of the domain
sits in Xen trying to acquire this lock - you'd deadlock trying
to pause the domain then.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-26 12:33   ` Jan Beulich
@ 2020-03-26 14:52     ` Tamas K Lengyel
  0 siblings, 0 replies; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-26 14:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Julien Grall, Stefano Stabellini, Tamas K Lengyel, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Xen-devel,
	Roger Pau Monné

On Thu, Mar 26, 2020 at 6:33 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 23.03.2020 18:04, Tamas K Lengyel wrote:
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -2202,6 +2202,17 @@ int domain_relinquish_resources(struct domain *d)
> >              ret = relinquish_shared_pages(d);
> >              if ( ret )
> >                  return ret;
> > +
> > +            /*
> > +             * If the domain is forked, decrement the parent's pause count
> > +             * and release the domain.
> > +             */
> > +            if ( mem_sharing_is_fork(d) )
> > +            {
> > +                domain_unpause(d->parent);
> > +                put_domain(d->parent);
> > +                d->parent = NULL;
>
> I think you want to clear the field before putting the reference,
> to make sure possible readers of it won't see it non-NULL when
> the domain is already being cleaned up, or even gone.

Sure.

>
> With this, applicable parts of the change
> Acked-by: Jan Beulich <jbeulich@suse.com>
>
> I'll try to keep an eye on when you and Roger have settled on the
> remaining aspects, to determine when this (probably v13) can be
> committed.

Thanks!

>
> > --- a/xen/include/asm-x86/mem_sharing.h
> > +++ b/xen/include/asm-x86/mem_sharing.h
> > @@ -77,6 +77,14 @@ static inline int mem_sharing_unshare_page(struct domain *d,
> >      return rc;
> >  }
> >
> > +static inline bool mem_sharing_is_fork(struct domain *d)
>
> const? (then also in the stub further down)

Sure.

Tamas


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork
  2020-03-26 14:52       ` Jan Beulich
@ 2020-03-26 14:53         ` Tamas K Lengyel
  2020-03-26 23:40           ` Tamas K Lengyel
  0 siblings, 1 reply; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-26 14:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Julien Grall, Stefano Stabellini, Tamas K Lengyel, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Xen-devel,
	Roger Pau Monné

On Thu, Mar 26, 2020 at 8:52 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 26.03.2020 15:48, Tamas K Lengyel wrote:
> > On Thu, Mar 26, 2020 at 4:17 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 23.03.2020 18:04, Tamas K Lengyel wrote:
> >>> +static int mem_sharing_fork_reset(struct domain *d, struct domain *pd)
> >>> +{
> >>> +    int rc;
> >>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> >>> +    struct page_info *page, *tmp;
> >>> +
> >>> +    spin_lock(&d->page_alloc_lock);
> >>> +    domain_pause(d);
> >>
> >> Why do you take the lock first?
> >
> > No particular reason - does the order matter?
>
> I think you'd better avoid holding a lock for extended periods
> of time. And what's perhaps worse, what if a vCPU of the domain
> sits in Xen trying to acquire this lock - you'd deadlock trying
> to pause the domain then.

OK, I'll invert them order then.

Thanks,
Tamas


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: VM forking
  2020-03-26  9:10       ` Roger Pau Monné
@ 2020-03-26 17:01         ` Tamas K Lengyel
  0 siblings, 0 replies; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-26 17:01 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Julien Grall, Stefano Stabellini, Tamas K Lengyel, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Jan Beulich,
	Xen-devel

On Thu, Mar 26, 2020 at 3:10 AM Roger Pau Monné <roger.pau@citrix.com> wrote:
>
> On Thu, Mar 26, 2020 at 08:07:09AM +0100, Jan Beulich wrote:
> > On 25.03.2020 16:47, Roger Pau Monné wrote:
> > > On Mon, Mar 23, 2020 at 10:04:35AM -0700, Tamas K Lengyel wrote:
> > >> +static int copy_vcpu_settings(struct domain *cd, struct domain *d)
> > >> +{
> > >> +    unsigned int i;
> > >> +    struct p2m_domain *p2m = p2m_get_hostp2m(cd);
> > >> +    int ret = -EINVAL;
> > >> +
> > >> +    for ( i = 0; i < cd->max_vcpus; i++ )
> > >> +    {
> > >> +        const struct vcpu *d_vcpu = d->vcpu[i];
> > >> +        struct vcpu *cd_vcpu = cd->vcpu[i];
> > >> +        struct vcpu_runstate_info runstate;
> > >> +        mfn_t vcpu_info_mfn;
> > >> +
> > >> +        if ( !d_vcpu || !cd_vcpu )
> > >> +            continue;
> > >> +
> > >> +        /*
> > >> +         * Copy & map in the vcpu_info page if the guest uses one
> > >> +         */
> > >> +        vcpu_info_mfn = d_vcpu->vcpu_info_mfn;
> > >> +        if ( !mfn_eq(vcpu_info_mfn, INVALID_MFN) )
> > >> +        {
> > >> +            mfn_t new_vcpu_info_mfn = cd_vcpu->vcpu_info_mfn;
> > >> +
> > >> +            /*
> > >> +             * Allocate & map the page for it if it hasn't been already
> > >> +             */
> > >> +            if ( mfn_eq(new_vcpu_info_mfn, INVALID_MFN) )
> > >> +            {
> > >> +                gfn_t gfn = mfn_to_gfn(d, vcpu_info_mfn);
> > >> +                unsigned long gfn_l = gfn_x(gfn);
> > >> +                struct page_info *page;
> > >> +
> > >> +                if ( !(page = alloc_domheap_page(cd, 0)) )
> > >> +                    return -ENOMEM;
> > >> +
> > >> +                new_vcpu_info_mfn = page_to_mfn(page);
> > >> +                set_gpfn_from_mfn(mfn_x(new_vcpu_info_mfn), gfn_l);
> > >> +
> > >> +                ret = p2m->set_entry(p2m, gfn, new_vcpu_info_mfn, PAGE_ORDER_4K,
> > >> +                                     p2m_ram_rw, p2m->default_access, -1);
> > >> +                if ( ret )
> > >> +                    return ret;
> > >> +
> > >> +                ret = map_vcpu_info(cd_vcpu, gfn_l,
> > >> +                                    d_vcpu->vcpu_info_offset);
> > >> +                if ( ret )
> > >> +                    return ret;
> > >> +            }
> > >> +
> > >> +            copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
> > >> +        }
> > >> +
> > >> +        /*
> > >> +         * Setup the vCPU runstate area
> > >> +         */
> > >> +        if ( guest_handle_is_null(runstate_guest(cd_vcpu)) )
> > >
> > > Maybe I'm confused, but isn't this the other way around and you need
> > > to check? If the parent runstate is not null copy it to the fork,
> > > ie:
> > >
> > > if ( !guest_handle_is_null(runstate_guest(d_vcpu)) )
> > > {
> > >     ...
> > >
> > >> +        {
> > >> +            runstate_guest(cd_vcpu) = runstate_guest(d_vcpu);
> > >> +            vcpu_runstate_get(cd_vcpu, &runstate);
> > >> +            __copy_to_guest(runstate_guest(cd_vcpu), &runstate, 1);
> > >
> > > You should check the return code I think.
> >
> > I don't think so - this is a best effort operation just like e.g.
> > in the handling of VCPUOP_register_runstate_memory_area.
>
> I think printing a debug message might be helpful, not so much as for
> the importance of failing to copy the runstate area, but it could
> signal that something went wrong, anyway I don't have such a strong
> opinion.
>
> Just to confirm, __copy_to_guest will cause the forked domain memory
> to be populated and the whole page to be copied over right? (and will
> also cause the page tables to be added to the fork physmap in write
> mode to set the accessed/dirty bits)

I checked this and it ends up calling hvm_translate_get_page which
issues a call to get_page_from_gfn already with P2M_UNSHARE already.
The problem is that we are still in the process of forking the VM, so
mem_sharing_is_fork is not yet true, since we haven't finished the
process yet completely. So what I'll do is set the parent pointer
early which will allow memory to be populated from the parent. If
there is an error during the fork operation the fork domain is getting
destroyed by the toolstack anyway so we don't have to worry about
unwinding a half-way completed fork.

Tamas


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork
  2020-03-26 14:53         ` Tamas K Lengyel
@ 2020-03-26 23:40           ` Tamas K Lengyel
  0 siblings, 0 replies; 29+ messages in thread
From: Tamas K Lengyel @ 2020-03-26 23:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Julien Grall, Stefano Stabellini, Tamas K Lengyel, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Xen-devel,
	Roger Pau Monné

On Thu, Mar 26, 2020 at 8:53 AM Tamas K Lengyel <tamas@tklengyel.com> wrote:
>
> On Thu, Mar 26, 2020 at 8:52 AM Jan Beulich <jbeulich@suse.com> wrote:
> >
> > On 26.03.2020 15:48, Tamas K Lengyel wrote:
> > > On Thu, Mar 26, 2020 at 4:17 AM Jan Beulich <jbeulich@suse.com> wrote:
> > >>
> > >> On 23.03.2020 18:04, Tamas K Lengyel wrote:
> > >>> +static int mem_sharing_fork_reset(struct domain *d, struct domain *pd)
> > >>> +{
> > >>> +    int rc;
> > >>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> > >>> +    struct page_info *page, *tmp;
> > >>> +
> > >>> +    spin_lock(&d->page_alloc_lock);
> > >>> +    domain_pause(d);
> > >>
> > >> Why do you take the lock first?
> > >
> > > No particular reason - does the order matter?
> >
> > I think you'd better avoid holding a lock for extended periods
> > of time. And what's perhaps worse, what if a vCPU of the domain
> > sits in Xen trying to acquire this lock - you'd deadlock trying
> > to pause the domain then.
>
> OK, I'll invert them order then.

It turns out we also need to take the recursive lock here since we'll
free the pages. Fixed now and everything works as expected.

Tamas


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-03-26 23:42 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-23 17:04 [Xen-devel] [PATCH v12 0/3] VM forking Tamas K Lengyel
2020-03-23 17:04 ` [Xen-devel] [PATCH v12 1/3] xen/mem_sharing: " Tamas K Lengyel
2020-03-25 15:47   ` Roger Pau Monné
2020-03-25 16:04     ` Tamas K Lengyel
2020-03-25 16:16     ` Tamas K Lengyel
2020-03-25 16:34     ` Tamas K Lengyel
2020-03-25 16:42       ` Julien Grall
2020-03-25 16:47         ` Tamas K Lengyel
2020-03-25 16:51           ` Julien Grall
2020-03-25 17:00             ` Tamas K Lengyel
2020-03-25 17:16               ` Roger Pau Monné
2020-03-25 17:47                 ` Tamas K Lengyel
2020-03-25 16:54         ` Roger Pau Monné
2020-03-26  7:02           ` Jan Beulich
2020-03-26  8:42             ` Roger Pau Monné
2020-03-26  7:07     ` Jan Beulich
2020-03-26  9:10       ` Roger Pau Monné
2020-03-26 17:01         ` Tamas K Lengyel
2020-03-26 12:33   ` Jan Beulich
2020-03-26 14:52     ` Tamas K Lengyel
2020-03-23 17:04 ` [Xen-devel] [PATCH v12 2/3] x86/mem_sharing: reset a fork Tamas K Lengyel
2020-03-25 15:52   ` Roger Pau Monné
2020-03-25 15:54     ` Tamas K Lengyel
2020-03-26 10:16   ` Jan Beulich
2020-03-26 14:48     ` Tamas K Lengyel
2020-03-26 14:52       ` Jan Beulich
2020-03-26 14:53         ` Tamas K Lengyel
2020-03-26 23:40           ` Tamas K Lengyel
2020-03-23 17:04 ` [Xen-devel] [PATCH v12 3/3] xen/tools: VM forking toolstack side Tamas K Lengyel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).