xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
@ 2015-06-08  3:45 Yang Hongyang
  2015-06-08  3:45 ` [PATCH v6 COLO 01/15] docs: add colo readme Yang Hongyang
                   ` (14 more replies)
  0 siblings, 15 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

This patchset implemented the COLO feature for Xen.
For detail/install/use of COLO feature, refer to:
http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping

This patchset is based on:
[PATCH v2 COLOPre 00/13] Prerequisite patches for COLO

We only support hvm guest now. The codes are also hosted on github:
https://github.com/macrosheep/xen/tree/colo-v6

Patch 1    : Add readme
Patch 2-7: COLO framework related codes
Patch 8-9: Implement disk replication
Patch 10-15: Implement nic replication

Changelog from v5 to v6:
1. based on migration v2(libxc)
2. split the patchset into prerequisite patchset and this main patchset.

Changelog from v4 to v5:
1. rebase to the latest xen upstream
2. disk replication: blktap2->qdisk
3. nic replication: colo-agent->colo-proxy

Changelog from v3 to v4:
1. rebase to newest xen
2. bug fix

Changlog from v2 to v3:
1. rebase to newest remus
2. add nic replication support

Changlog from v1 to v2:
1. rebase to newest remus
2. add disk replication support

Wen Congyang (6):
  secondary vm suspend/resume/checkpoint code
  primary vm suspend/get_dirty_pfn/resume/checkpoint code
  send store mfn and console mfn to xl before resuming secondary vm
  implement the cmdline for COLO
  Support colo mode for qemu disk
  COLO: use qemu block replication

Yang Hongyang (9):
  docs: add colo readme
  libxc/restore: support COLO restore
  libxc/save: support COLO save
  COLO proxy: implement setup/teardown of COLO proxy module
  COLO proxy: preresume, postresume and checkpoint
  COLO nic: implement COLO nic subkind
  setup and control colo proxy on primary side
  setup and control colo proxy on secondary side
  cmdline switches and config vars to control colo-proxy

 docs/README.colo                     |    9 +
 docs/man/xl.conf.pod.5               |    6 +
 docs/man/xl.pod.1                    |   11 +-
 tools/hotplug/Linux/Makefile         |    1 +
 tools/hotplug/Linux/colo-proxy-setup |  131 ++++
 tools/libxc/include/xenguest.h       |   40 ++
 tools/libxc/xc_sr_common.h           |   11 +-
 tools/libxc/xc_sr_restore.c          |   67 +-
 tools/libxc/xc_sr_restore_x86_hvm.c  |    1 +
 tools/libxc/xc_sr_save.c             |   49 +-
 tools/libxl/Makefile                 |    4 +
 tools/libxl/libxl.c                  |   70 +-
 tools/libxl/libxl_colo.h             |   53 ++
 tools/libxl/libxl_colo_nic.c         |  317 +++++++++
 tools/libxl/libxl_colo_proxy.c       |  267 ++++++++
 tools/libxl/libxl_colo_qdisk.c       |  209 ++++++
 tools/libxl/libxl_colo_restore.c     | 1192 ++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_colo_save.c        |  784 ++++++++++++++++++++++
 tools/libxl/libxl_create.c           |  156 ++++-
 tools/libxl/libxl_device.c           |   38 ++
 tools/libxl/libxl_dm.c               |  262 +++++++-
 tools/libxl/libxl_dom_save.c         |   17 +-
 tools/libxl/libxl_internal.h         |   94 ++-
 tools/libxl/libxl_qmp.c              |   31 +
 tools/libxl/libxl_save_callout.c     |    6 +-
 tools/libxl/libxl_save_msgs_gen.pl   |    9 +-
 tools/libxl/libxl_types.idl          |    8 +
 tools/libxl/libxlu_disk_l.l          |    5 +
 tools/libxl/xl.c                     |    3 +
 tools/libxl/xl.h                     |    1 +
 tools/libxl/xl_cmdimpl.c             |   92 ++-
 tools/libxl/xl_cmdtable.c            |    4 +-
 32 files changed, 3896 insertions(+), 52 deletions(-)
 create mode 100644 docs/README.colo
 create mode 100755 tools/hotplug/Linux/colo-proxy-setup
 create mode 100644 tools/libxl/libxl_colo.h
 create mode 100644 tools/libxl/libxl_colo_nic.c
 create mode 100644 tools/libxl/libxl_colo_proxy.c
 create mode 100644 tools/libxl/libxl_colo_qdisk.c
 create mode 100644 tools/libxl/libxl_colo_restore.c
 create mode 100644 tools/libxl/libxl_colo_save.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 01/15] docs: add colo readme
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-16 10:56   ` Ian Campbell
  2015-06-08  3:45 ` [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code Yang Hongyang
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

add colo readme, refer to
http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 docs/README.colo | 9 +++++++++
 1 file changed, 9 insertions(+)
 create mode 100644 docs/README.colo

diff --git a/docs/README.colo b/docs/README.colo
new file mode 100644
index 0000000..466eb72
--- /dev/null
+++ b/docs/README.colo
@@ -0,0 +1,9 @@
+COLO FT/HA (COarse-grain LOck-stepping Virtual Machines for Non-stop Service)
+project is a high availability solution. Both primary VM (PVM) and secondary VM
+(SVM) run in parallel. They receive the same request from client, and generate
+response in parallel too. If the response packets from PVM and SVM are
+identical, they are released immediately. Otherwise, a VM checkpoint (on demand)
+is conducted.
+
+See the website at http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping
+for details.
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
  2015-06-08  3:45 ` [PATCH v6 COLO 01/15] docs: add colo readme Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-12 14:23   ` Wei Liu
  2015-06-08  3:45 ` [PATCH v6 COLO 03/15] primary vm suspend/get_dirty_pfn/resume/checkpoint code Yang Hongyang
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

From: Wen Congyang <wency@cn.fujitsu.com>

Secondary vm is running in colo mode. So we will do
the following things again and again:
1. Resume secondary vm
   a. Send LIBXL_COLO_SVM_READY to master.
   b. If it is not the first resume, call libxl__checkpoint_devices_preresume().
   c. If it is the first resume(resume right after live migration),
      - call libxl__xc_domain_restore_done() to build the secondary vm.
      - enable secondary vm's logdirty.
      - call libxl__domain_resume() to resume secondary vm.
      - call libxl__checkpoint_devices_setup() to setup checkpoint devices.
   d. Send LIBXL_COLO_SVM_RESUMED to master.
2. Wait a new checkpoint
   a. Call libxl__checkpoint_devices_commit().
   b. Read LIBXL_COLO_NEW_CHECKPOINT from master.
3. Suspend secondary vm
   a. Suspend secondary vm.
   b. Call libxl__checkpoint_devices_postsuspend().
   c. Get secondary vm's dirty page information.
   d. Send LIBXL_COLO_SVM_SUSPENDED to master.
   e. Send secondary vm's dirty page information to master(count + pfn list).

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 tools/libxc/include/xenguest.h     |   20 +
 tools/libxl/Makefile               |    1 +
 tools/libxl/libxl_colo.h           |   38 ++
 tools/libxl/libxl_colo_restore.c   | 1158 ++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_create.c         |  116 +++-
 tools/libxl/libxl_dom_save.c       |    2 +-
 tools/libxl/libxl_internal.h       |   24 +
 tools/libxl/libxl_save_callout.c   |    6 +-
 tools/libxl/libxl_save_msgs_gen.pl |    6 +-
 9 files changed, 1364 insertions(+), 7 deletions(-)
 create mode 100644 tools/libxl/libxl_colo.h
 create mode 100644 tools/libxl/libxl_colo_restore.c

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 7581263..86bcf9c 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -98,6 +98,26 @@ int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_ite
 
 /* callbacks provided by xc_domain_restore */
 struct restore_callbacks {
+    /* Called after a new checkpoint to suspend the guest.
+     */
+    int (*suspend)(void* data);
+
+    /* Called after the secondary vm is ready to resume.
+     * Callback function resumes the guest & the device model,
+     *  returns to xc_domain_restore.
+     */
+    int (*postcopy)(void* data);
+
+    /* callback to wait a new checkpoint
+     *
+     * returns:
+     * 0: terminate checkpointing gracefully
+     * 1: take another checkpoint */
+    int (*checkpoint)(void* data);
+
+    /* Enable qemu-dm logging dirty pages to xen */
+    int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
+
     /* callback to restore toolstack specific data */
     int (*toolstack_restore)(uint32_t domid, const uint8_t *buf,
             uint32_t size, void* data);
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index cd63dac..82cc4c2 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -57,6 +57,7 @@ LIBXL_OBJS-y += libxl_nonetbuffer.o
 endif
 
 LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
+LIBXL_OBJS-y += libxl_colo_restore.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o libxl_psr.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o libxl_libfdt_compat.o
diff --git a/tools/libxl/libxl_colo.h b/tools/libxl/libxl_colo.h
new file mode 100644
index 0000000..91df275
--- /dev/null
+++ b/tools/libxl/libxl_colo.h
@@ -0,0 +1,38 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#ifndef LIBXL_COLO_H
+#define LIBXL_COLO_H
+
+/*
+ * values to control suspend/resume primary vm and secondary vm
+ * at the same time
+ */
+enum {
+    LIBXL_COLO_NEW_CHECKPOINT = 1,
+    LIBXL_COLO_SVM_SUSPENDED,
+    LIBXL_COLO_SVM_READY,
+    LIBXL_COLO_SVM_RESUMED,
+};
+
+extern void libxl__colo_restore_done(libxl__egc *egc, void *dcs_void,
+                                     int ret, int retval, int errnoval);
+extern void libxl__colo_restore_setup(libxl__egc *egc,
+                                      libxl__colo_restore_state *crs);
+extern void libxl__colo_restore_teardown(libxl__egc *egc,
+                                         libxl__colo_restore_state *crs,
+                                         int rc);
+
+#endif
diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
new file mode 100644
index 0000000..6c39758
--- /dev/null
+++ b/tools/libxl/libxl_colo_restore.c
@@ -0,0 +1,1158 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *         Yang Hongyang <yanghy@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+#include "libxl_colo.h"
+#include "xc_bitops.h"
+
+#define XC_PAGE_SHIFT           12
+#define PAGE_SHIFT              XC_PAGE_SHIFT
+#define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1))
+#define NRPAGES(x) (ROUNDUP(x, PAGE_SHIFT) >> PAGE_SHIFT)
+
+enum {
+    LIBXL_COLO_SETUPED,
+    LIBXL_COLO_SUSPENDED,
+    LIBXL_COLO_RESUMED,
+};
+
+typedef struct libxl__colo_restore_checkpoint_state libxl__colo_restore_checkpoint_state;
+struct libxl__colo_restore_checkpoint_state {
+    xc_hypercall_buffer_t _dirty_bitmap;
+    xc_hypercall_buffer_t *dirty_bitmap;
+    unsigned long p2m_size;
+    libxl__domain_suspend_state dsps;
+    libxl__datacopier_state dc;
+    uint8_t section;
+    libxl__logdirty_switch lds;
+    libxl__colo_restore_state *crs;
+    int status;
+    bool preresume;
+    /* used for teardown */
+    int teardown_devices;
+    int saved_rc;
+
+    void (*callback)(libxl__egc *,
+                     libxl__colo_restore_checkpoint_state *,
+                     int);
+
+    /*
+     * 0: secondary vm's dirty bitmap for domain @domid
+     * 1: secondary vm is ready(domain @domid)
+     * 2: secondary vm is resumed(domain @domid)
+     * 3. new checkpoint is triggered(domain @domid)
+     */
+    const char *copywhat[4];
+};
+
+
+static void libxl__colo_restore_domain_resume_callback(void *data);
+static void libxl__colo_restore_domain_checkpoint_callback(void *data);
+static void libxl__colo_restore_domain_suspend_callback(void *data);
+
+static const libxl__checkpoint_device_instance_ops *colo_restore_ops[] = {
+    NULL,
+};
+
+/* ===================== colo: common functions ===================== */
+static void colo_enable_logdirty(libxl__colo_restore_state *crs, libxl__egc *egc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    const uint32_t domid = crs->domid;
+    libxl__logdirty_switch *const lds = &crcs->lds;
+
+    STATE_AO_GC(crs->ao);
+
+    /* we need to know which pages are dirty to restore the guest */
+    if (xc_shadow_control(CTX->xch, domid,
+                          XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
+                          NULL, 0, NULL, 0, NULL) < 0) {
+        LOG(ERROR, "cannot enable secondary vm's logdirty");
+        lds->callback(egc, lds, ERROR_FAIL);
+        return;
+    }
+
+    if (crs->hvm) {
+        libxl__domain_common_switch_qemu_logdirty(egc, domid, 1, lds);
+        return;
+    }
+
+    lds->callback(egc, lds, 0);
+}
+
+static void colo_disable_logdirty(libxl__colo_restore_state *crs,
+                                  libxl__egc *egc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    const uint32_t domid = crs->domid;
+    libxl__logdirty_switch *const lds = &crcs->lds;
+
+    STATE_AO_GC(crs->ao);
+
+    /* we need to know which pages are dirty to restore the guest */
+    if (xc_shadow_control(CTX->xch, domid, XEN_DOMCTL_SHADOW_OP_OFF,
+                          NULL, 0, NULL, 0, NULL) < 0)
+        LOG(WARN, "cannot disable secondary vm's logdirty");
+
+    if (crs->hvm) {
+        libxl__domain_common_switch_qemu_logdirty(egc, domid, 0, lds);
+        return;
+    }
+
+    lds->callback(egc, lds, 0);
+}
+
+static void colo_resume_vm(libxl__egc *egc,
+                           libxl__colo_restore_checkpoint_state *crcs,
+                           int restore_device_model)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    int rc;
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+
+    STATE_AO_GC(crs->ao);
+
+    if (!crs->saved_cb) {
+        /* TODO: sync mmu for hvm? */
+        if (restore_device_model) {
+            rc = libxl__domain_restore(gc, crs->domid);
+            if (rc) {
+                LOG(ERROR, "cannot restore device model for secondary vm");
+                crcs->callback(egc, crcs, rc);
+                return;
+            }
+        }
+        rc = libxl__domain_resume(gc, crs->domid, 0);
+        if (rc)
+            LOG(ERROR, "cannot resume secondary vm");
+
+        crcs->callback(egc, crcs, rc);
+        return;
+    }
+
+    /*
+     * TODO: get store mfn and console mfn
+     *  We should call the callback restore_results in
+     *  xc_domain_restore() before resuming the guest.
+     */
+    libxl__xc_domain_restore_done(egc, dcs, 0, 0, 0);
+
+    return;
+}
+
+static int init_device_subkind(libxl__checkpoint_devices_state *cds)
+{
+    /* init device subkind-specific state in the libxl ctx */
+    int rc;
+    STATE_AO_GC(cds->ao);
+
+    rc = 0;
+    return rc;
+}
+
+static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
+{
+    /* cleanup device subkind-specific state in the libxl ctx */
+    STATE_AO_GC(cds->ao);
+}
+
+
+/* ================ colo: setup restore environment ================ */
+static void libxl__colo_domain_create_cb(libxl__egc *egc,
+                                         libxl__domain_create_state *dcs,
+                                         int rc, uint32_t domid);
+
+static int init_dsps(libxl__domain_suspend_state *dsps)
+{
+    int rc = ERROR_FAIL;
+    libxl_domain_type type;
+
+    STATE_AO_GC(dsps->ao);
+
+    type = libxl__domain_type(gc, dsps->domid);
+    if (type == LIBXL_DOMAIN_TYPE_INVALID)
+        goto out;
+
+    libxl__xswait_init(&dsps->pvcontrol);
+    libxl__ev_evtchn_init(&dsps->guest_evtchn);
+    libxl__ev_xswatch_init(&dsps->guest_watch);
+    libxl__ev_time_init(&dsps->guest_timeout);
+
+    if (type == LIBXL_DOMAIN_TYPE_HVM)
+        dsps->hvm = 1;
+    else
+        dsps->hvm = 0;
+
+    dsps->guest_evtchn.port = -1;
+    dsps->guest_evtchn_lockfd = -1;
+    dsps->guest_responded = 0;
+    dsps->dm_savefile = libxl__device_model_savefile(gc, dsps->domid);
+
+    /* Secondary vm is not created, so we cannot get evtchn port */
+
+    rc = 0;
+
+out:
+    return rc;
+}
+
+void libxl__colo_restore_setup(libxl__egc *egc,
+                               libxl__colo_restore_state *crs)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs;
+    DECLARE_HYPERCALL_BUFFER(unsigned long, dirty_bitmap);
+    int rc = ERROR_FAIL;
+    int bsize;
+
+    /* Convenience aliases */
+    libxl__srm_restore_autogen_callbacks *const callbacks =
+        &dcs->shs.callbacks.restore.a;
+    const int domid = crs->domid;
+
+    STATE_AO_GC(crs->ao);
+
+    GCNEW(crcs);
+    crs->crcs = crcs;
+    crcs->crs = crs;
+
+    if (xc_domain_maximum_gpfn(CTX->xch, domid, &crcs->p2m_size) < 0) {
+        rc = ERROR_FAIL;
+        goto err;
+    }
+
+    crcs->copywhat[0] = GCSPRINTF("secondary vm's dirty bitmap for domain %"PRIu32,
+                                  domid);
+    crcs->copywhat[1] = GCSPRINTF("secondary vm is ready(domain %"PRIu32")",
+                                  domid);
+    crcs->copywhat[2] = GCSPRINTF("secondary vm is resumed(domain %"PRIu32")",
+                                  domid);
+    crcs->copywhat[3] = GCSPRINTF("new checkpoint is triggered(domain %"PRIu32")",
+                                  domid);
+
+    bsize = bitmap_size(crcs->p2m_size);
+    dirty_bitmap = xc_hypercall_buffer_alloc_pages(CTX->xch, dirty_bitmap,
+                                                   NRPAGES(bsize));
+    if (!dirty_bitmap) {
+        rc = ERROR_NOMEM;
+        goto err;
+    }
+    memset(dirty_bitmap, 0, bsize);
+    crcs->_dirty_bitmap = *HYPERCALL_BUFFER(dirty_bitmap);
+    crcs->dirty_bitmap = &crcs->_dirty_bitmap;
+
+    /* setup dsps */
+    crcs->dsps.ao = ao;
+    crcs->dsps.domid = domid;
+    if (init_dsps(&crcs->dsps))
+        goto err_init_dsps;
+
+    callbacks->suspend = libxl__colo_restore_domain_suspend_callback;
+    callbacks->postcopy = libxl__colo_restore_domain_resume_callback;
+    callbacks->checkpoint = libxl__colo_restore_domain_checkpoint_callback;
+
+    /*
+     * Secondary vm is running in colo mode, so we need to call
+     * libxl__xc_domain_restore_done() to create secondary vm.
+     * But we will exit in domain_create_cb(). So replace the
+     * callback here.
+     */
+    crs->saved_cb = dcs->callback;
+    dcs->callback = libxl__colo_domain_create_cb;
+    crcs->status = LIBXL_COLO_SETUPED;
+
+    logdirty_init(&crcs->lds);
+    crcs->lds.ao = ao;
+
+    rc = 0;
+
+out:
+    crs->callback(egc, crs, rc);
+    return;
+
+err_init_dsps:
+    xc_hypercall_buffer_free_pages(CTX->xch, dirty_bitmap, NRPAGES(bsize));
+    crcs->dirty_bitmap = NULL;
+err:
+    goto out;
+}
+
+static void libxl__colo_domain_create_cb(libxl__egc *egc,
+                                         libxl__domain_create_state *dcs,
+                                         int rc, uint32_t domid)
+{
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+
+    crcs->callback(egc, crcs, rc);
+}
+
+
+/* ================ colo: teardown restore environment ================ */
+static void colo_restore_teardown_done(libxl__egc *egc,
+                                       libxl__checkpoint_devices_state *cds,
+                                       int rc);
+static void do_failover_done(libxl__egc *egc,
+                             libxl__colo_restore_checkpoint_state* crcs,
+                             int rc);
+static void colo_disable_logdirty_done(libxl__egc *egc,
+                                       libxl__logdirty_switch *lds,
+                                       int rc);
+
+static void do_failover(libxl__egc *egc, libxl__colo_restore_state *crs)
+{
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    const int status = crcs->status;
+    libxl__logdirty_switch *const lds = &crcs->lds;
+
+    STATE_AO_GC(crs->ao);
+
+    switch(status) {
+    case LIBXL_COLO_SETUPED:
+        /* We don't enable logdirty now */
+        colo_resume_vm(egc, crcs, 0);
+        return;
+    case LIBXL_COLO_SUSPENDED:
+    case LIBXL_COLO_RESUMED:
+        /* disable logdirty first */
+        lds->callback = colo_disable_logdirty_done;
+        colo_disable_logdirty(crs, egc);
+        return;
+    default:
+        LOG(ERROR, "invalid status: %d", status);
+        crcs->callback(egc, crcs, ERROR_FAIL);
+    }
+}
+
+void libxl__colo_restore_teardown(libxl__egc *egc,
+                                  libxl__colo_restore_state *crs,
+                                  int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap, crcs->dirty_bitmap);
+    int bsize = bitmap_size(crcs->p2m_size);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+
+    EGC_GC;
+
+    if (!dirty_bitmap)
+        goto teardown_devices;
+
+    xc_hypercall_buffer_free_pages(CTX->xch, dirty_bitmap, NRPAGES(bsize));
+
+teardown_devices:
+    crcs->saved_rc = rc;
+    if (!crcs->teardown_devices) {
+        colo_restore_teardown_done(egc, &crs->cds, 0);
+        return;
+    }
+
+    crs->cds.callback = colo_restore_teardown_done;
+    libxl__checkpoint_devices_teardown(egc, &crs->cds);
+}
+
+static void colo_restore_teardown_done(libxl__egc *egc,
+                                       libxl__checkpoint_devices_state *cds,
+                                       int rc)
+{
+    libxl__colo_restore_state *crs = CONTAINER_OF(cds, *crs, cds);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+
+    EGC_GC;
+
+    if (rc)
+        LOG(ERROR, "COLO: failed to teardown device for guest with domid %u,"
+            " rc %d", cds->domid, rc);
+
+    if (crcs->teardown_devices)
+        cleanup_device_subkind(cds);
+
+    rc = crcs->saved_rc;
+    if (!rc) {
+        crcs->callback = do_failover_done;
+        do_failover(egc, crs);
+        return;
+    }
+
+    if (crs->saved_cb) {
+        dcs->callback = crs->saved_cb;
+        crs->saved_cb = NULL;
+    }
+    crs->callback(egc, crs, rc);
+}
+
+static void do_failover_done(libxl__egc *egc,
+                             libxl__colo_restore_checkpoint_state* crcs,
+                             int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+
+    STATE_AO_GC(crs->ao);
+
+    if (rc)
+        LOG(ERROR, "cannot do failover");
+
+    if (crs->saved_cb) {
+        dcs->callback = crs->saved_cb;
+        crs->saved_cb = NULL;
+    }
+
+    crs->callback(egc, crs, rc);
+}
+
+static void colo_disable_logdirty_done(libxl__egc *egc,
+                                       libxl__logdirty_switch *lds,
+                                       int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(lds, *crcs, lds);
+
+    STATE_AO_GC(lds->ao);
+
+    if (rc)
+        LOG(WARN, "cannot disable logdirty");
+
+    if (crcs->status == LIBXL_COLO_SUSPENDED) {
+        /*
+         * failover when reading state from master, so no need to
+         * call libxl__domain_restore().
+         */
+        colo_resume_vm(egc, crcs, 0);
+        return;
+    }
+
+    /* If we cannot disable logdirty, we still can do failover */
+    crcs->callback(egc, crcs, 0);
+}
+
+/*
+ * checkpoint callbacks are called in the following order:
+ * 1. resume
+ * 2. checkpoint
+ * 3. suspend
+ */
+static void colo_common_read_send_data_done(libxl__egc *egc,
+                                            libxl__datacopier_state *dc,
+                                            int onwrite, int errnoval);
+/* ===================== colo: resume secondary vm ===================== */
+/*
+ * Do the following things when resuming secondary vm the first time:
+ *  1. resume secondary vm
+ *  2. enable log dirty
+ *  3. setup checkpoint devices
+ *  4. write LIBXL_COLO_SVM_READY
+ *  5. unpause secondary vm
+ *  6. write LIBXL_COLO_SVM_RESUMED
+ *
+ * Do the following things when resuming secondary vm:
+ *  1. write LIBXL_COLO_SVM_READY
+ *  2. resume secondary vm
+ *  3. write LIBXL_COLO_SVM_RESUMED
+ */
+static void colo_send_svm_ready(libxl__egc *egc,
+                                libxl__colo_restore_checkpoint_state *crcs);
+static void colo_send_svm_ready_done(libxl__egc *egc,
+                                     libxl__colo_restore_checkpoint_state *crcs,
+                                     int rc);
+static void colo_restore_preresume_cb(libxl__egc *egc,
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc);
+static void colo_restore_resume_vm(libxl__egc *egc,
+                                   libxl__colo_restore_checkpoint_state *crcs);
+static void colo_resume_vm_done(libxl__egc *egc,
+                                libxl__colo_restore_checkpoint_state *crcs,
+                                int rc);
+static void colo_write_svm_resumed(libxl__egc *egc,
+                                   libxl__colo_restore_checkpoint_state *crcs);
+static void colo_enable_logdirty_done(libxl__egc *egc,
+                                      libxl__logdirty_switch *lds,
+                                      int retval);
+static void colo_reenable_logdirty(libxl__egc *egc,
+                                   libxl__logdirty_switch *lds,
+                                   int rc);
+static void colo_reenable_logdirty_done(libxl__egc *egc,
+                                        libxl__logdirty_switch *lds,
+                                        int rc);
+static void colo_setup_checkpoint_devices(libxl__egc *egc,
+                                          libxl__colo_restore_state *crs);
+static void colo_restore_setup_cds_done(libxl__egc *egc,
+                                        libxl__checkpoint_devices_state *cds,
+                                        int rc);
+static void colo_unpause_svm(libxl__egc *egc,
+                             libxl__colo_restore_checkpoint_state *crcs);
+
+static void libxl__colo_restore_domain_resume_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__domain_create_state *dcs = CONTAINER_OF(shs, *dcs, shs);
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+
+    if (crcs->teardown_devices)
+        colo_send_svm_ready(shs->egc, crcs);
+    else
+        colo_restore_resume_vm(shs->egc, crcs);
+}
+
+static void colo_send_svm_ready(libxl__egc *egc,
+                               libxl__colo_restore_checkpoint_state *crcs)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    uint8_t section = LIBXL_COLO_SVM_READY;
+    int rc;
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+    const int send_fd = crs->send_fd;
+    libxl__datacopier_state *const dc = &crcs->dc;
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(crs->ao);
+
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = -1;
+    dc->writefd = send_fd;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = crcs->copywhat[1];
+    dc->writewhat = "colo stream";
+    dc->callback = colo_common_read_send_data_done;
+    crcs->callback = colo_send_svm_ready_done;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    /* tell master that secondary vm is ready */
+    libxl__datacopier_prefixdata(egc, dc, &section, sizeof(section));
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+static void colo_send_svm_ready_done(libxl__egc *egc,
+                                     libxl__colo_restore_checkpoint_state *crcs,
+                                     int rc)
+{
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *cds = &crcs->crs->cds;
+
+    if (!crcs->preresume) {
+        crcs->preresume = true;
+        colo_unpause_svm(egc, crcs);
+        return;
+    }
+
+    cds->callback = colo_restore_preresume_cb;
+    libxl__checkpoint_devices_preresume(egc, cds);
+}
+
+static void colo_restore_preresume_cb(libxl__egc *egc,
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc)
+{
+    libxl__colo_restore_state *crs = CONTAINER_OF(cds, *crs, cds);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(crs->ao);
+
+    if (rc) {
+        LOG(ERROR, "preresume fails");
+        goto out;
+    }
+
+    colo_restore_resume_vm(egc, crcs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+static void colo_restore_resume_vm(libxl__egc *egc,
+                                   libxl__colo_restore_checkpoint_state *crcs)
+{
+
+    crcs->callback = colo_resume_vm_done;
+    colo_resume_vm(egc, crcs, 1);
+}
+
+static void colo_resume_vm_done(libxl__egc *egc,
+                                libxl__colo_restore_checkpoint_state *crcs,
+                                int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+    libxl__logdirty_switch *const lds = &crcs->lds;
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(crs->ao);
+
+    if (rc) {
+        LOG(ERROR, "cannot resume secondary vm");
+        goto out;
+    }
+
+    crcs->status = LIBXL_COLO_RESUMED;
+
+    /* avoid calling libxl__xc_domain_restore_done() more than once */
+    if (crs->saved_cb) {
+        dcs->callback = crs->saved_cb;
+        crs->saved_cb = NULL;
+
+        lds->callback = colo_enable_logdirty_done;
+        colo_enable_logdirty(crs, egc);
+        return;
+    }
+
+    colo_write_svm_resumed(egc, crcs);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+static void colo_write_svm_resumed(libxl__egc *egc,
+                                   libxl__colo_restore_checkpoint_state *crcs)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    uint8_t section = LIBXL_COLO_SVM_RESUMED;
+    int rc;
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+    const int send_fd = crs->send_fd;
+    libxl__datacopier_state *const dc = &crcs->dc;
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(crs->ao);
+
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = -1;
+    dc->writefd = send_fd;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = crcs->copywhat[2];
+    dc->writewhat = "colo stream";
+    dc->callback = colo_common_read_send_data_done;
+    crcs->callback = NULL;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    /* tell master that secondary vm is resumed */
+    libxl__datacopier_prefixdata(egc, dc, &section, sizeof(section));
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+static void colo_enable_logdirty_done(libxl__egc *egc,
+                                      libxl__logdirty_switch *lds,
+                                      int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(lds, *crcs, lds);
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+
+    STATE_AO_GC(crs->ao);
+
+    if (rc) {
+        /*
+         * log-dirty already enabled? There's no test op,
+         * so attempt to disable then reenable it
+         */
+        lds->callback = colo_reenable_logdirty;
+        colo_disable_logdirty(crs, egc);
+        return;
+    }
+
+    colo_setup_checkpoint_devices(egc, crs);
+}
+
+static void colo_reenable_logdirty(libxl__egc *egc,
+                                   libxl__logdirty_switch *lds,
+                                   int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(lds, *crcs, lds);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(crs->ao);
+
+    if (rc) {
+        LOG(ERROR, "cannot enable logdirty");
+        goto out;
+    }
+
+    lds->callback = colo_reenable_logdirty_done;
+    colo_enable_logdirty(crs, egc);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+static void colo_reenable_logdirty_done(libxl__egc *egc,
+                                        libxl__logdirty_switch *lds,
+                                        int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(lds, *crcs, lds);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(crcs->crs->ao);
+
+    if (rc) {
+        LOG(ERROR, "cannot enable logdirty");
+        goto out;
+    }
+
+    colo_setup_checkpoint_devices(egc, crcs->crs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+/*
+ * We cannot setup checkpoint devices in libxl__colo_restore_setup(),
+ * because the guest is not ready.
+ */
+static void colo_setup_checkpoint_devices(libxl__egc *egc,
+                                          libxl__colo_restore_state *crs)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *cds = &crs->cds;
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(crs->ao);
+
+    /* TODO: disk/nic support */
+    cds->device_kind_flags = 0;
+    cds->callback = colo_restore_setup_cds_done;
+    cds->ao = ao;
+    cds->domid = crs->domid;
+    cds->ops = colo_restore_ops;
+
+    if (init_device_subkind(cds))
+        goto out;
+
+    crcs->teardown_devices = 1;
+
+    libxl__checkpoint_devices_setup(egc, cds);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+static void colo_restore_setup_cds_done(libxl__egc *egc,
+                                        libxl__checkpoint_devices_state *cds,
+                                        int rc)
+{
+    libxl__colo_restore_state *crs = CONTAINER_OF(cds, *crs, cds);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(cds->ao);
+
+    if (rc) {
+        LOG(ERROR, "COLO: failed to setup device for guest with domid %u",
+            cds->domid);
+        goto out;
+    }
+
+    colo_send_svm_ready(egc, crcs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+static void colo_unpause_svm(libxl__egc *egc,
+                             libxl__colo_restore_checkpoint_state *crcs)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    int rc;
+
+    /* Convenience aliases */
+    const uint32_t domid = crcs->crs->domid;
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(crcs->crs->ao);
+
+    /* We have enabled secondary vm's logdirty, so we can unpause it now */
+    rc = libxl__domain_unpause(gc, domid);
+    if (rc) {
+        LOG(ERROR, "cannot unpause secondary vm");
+        goto out;
+    }
+
+    colo_write_svm_resumed(egc, crcs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+
+/* ===================== colo: wait new checkpoint ===================== */
+static void colo_restore_commit_cb(libxl__egc *egc,
+                                   libxl__checkpoint_devices_state *cds,
+                                   int rc);
+static void colo_stream_read_done(libxl__egc *egc,
+                                  libxl__colo_restore_checkpoint_state *crcs,
+                                  int real_size);
+
+static void libxl__colo_restore_domain_checkpoint_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__domain_create_state *dcs = CONTAINER_OF(shs, *dcs, shs);
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *cds = &dcs->crs.cds;
+
+    cds->callback = colo_restore_commit_cb;
+    libxl__checkpoint_devices_commit(shs->egc, cds);
+}
+
+static void colo_restore_commit_cb(libxl__egc *egc,
+                                   libxl__checkpoint_devices_state *cds,
+                                   int rc)
+{
+    libxl__colo_restore_state *crs = CONTAINER_OF(cds, *crs, cds);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    const int recv_fd = dcs->crs.recv_fd;
+    libxl__save_helper_state *const shs = &dcs->shs;
+    libxl__datacopier_state *const dc = &crcs->dc;
+
+    STATE_AO_GC(cds->ao);
+
+    if (rc) {
+        LOG(ERROR, "commit fails");
+        goto out;
+    }
+
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = recv_fd;
+    dc->writefd = -1;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = crcs->copywhat[3];
+    dc->readwhat = "colo stream";
+    dc->callback = colo_common_read_send_data_done;
+    dc->readbuf = &crcs->section;
+    dc->bytes_to_read = 1;
+    crcs->callback = colo_stream_read_done;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(shs->egc, shs, 0);
+}
+
+static void colo_stream_read_done(libxl__egc *egc,
+                                  libxl__colo_restore_checkpoint_state *crcs,
+                                  int real_size)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    int ok = 0;
+
+    /* Convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(dcs->ao);
+
+    if (real_size != 1) {
+        LOG(ERROR, "reading data fails: %lld", (long long)real_size);
+        goto out;
+    }
+
+    if (crcs->section != LIBXL_COLO_NEW_CHECKPOINT) {
+        LOG(ERROR, "invalid section: %d", crcs->section);
+        goto out;
+    }
+
+    ok = 1;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, ok);
+}
+
+
+/* ===================== colo: suspend secondary vm ===================== */
+/*
+ * Do the following things when resuming secondary vm:
+ *  1. suspend secondary vm
+ *  2. get secondary vm's dirty page information
+ *  3. send LIBXL_COLO_SVM_SUSPENDED
+ *  4. send secondary vm's dirty page information(count + pfn list)
+ */
+static void colo_suspend_vm_done(libxl__egc *egc,
+                                 libxl__domain_suspend_state *dsps,
+                                 int ok);
+static void colo_restore_postsuspend_cb(libxl__egc *egc,
+                                        libxl__checkpoint_devices_state *cds,
+                                        int rc);
+static void colo_append_pfn_type(libxl__egc *egc,
+                                 libxl__datacopier_state *dc,
+                                 unsigned long *dirty_bitmap,
+                                 unsigned long p2m_size);
+
+static void libxl__colo_restore_domain_suspend_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__domain_create_state *dcs = CONTAINER_OF(shs, *dcs, shs);
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+
+    STATE_AO_GC(dcs->ao);
+
+    /* Convenience aliases */
+    libxl__domain_suspend_state *const dsps = &crcs->dsps;
+
+    /* suspend secondary vm */
+    dsps->callback_common_done = colo_suspend_vm_done;
+
+    libxl__domain_suspend(shs->egc, dsps);
+}
+
+static void colo_suspend_vm_done(libxl__egc *egc,
+                                 libxl__domain_suspend_state *dsps,
+                                 int ok)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(dsps, *crcs, dsps);
+    libxl__colo_restore_state *crs = crcs->crs;
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *cds = &crs->cds;
+
+    STATE_AO_GC(crs->ao);
+
+    if (!ok) {
+        LOG(ERROR, "cannot suspend secondary vm");
+        goto out;
+    }
+
+    crcs->status = LIBXL_COLO_SUSPENDED;
+
+    cds->callback = colo_restore_postsuspend_cb;
+    libxl__checkpoint_devices_postsuspend(egc, cds);
+
+    return;
+
+out:
+    ok = 0;
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dcs->shs, ok);
+}
+
+static void colo_restore_postsuspend_cb(libxl__egc *egc,
+                                        libxl__checkpoint_devices_state *cds,
+                                        int rc)
+{
+    libxl__colo_restore_state *crs = CONTAINER_OF(cds, *crs, cds);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap, crcs->dirty_bitmap);
+    uint8_t section = LIBXL_COLO_SVM_SUSPENDED;
+    int i, ok = 0;
+    uint64_t count;
+
+    /* Convenience aliases */
+    const int send_fd = crs->send_fd;
+    const unsigned long p2m_size = crcs->p2m_size;
+    const uint32_t domid = crs->domid;
+    libxl__datacopier_state *const dc = &crcs->dc;
+
+    STATE_AO_GC(crs->ao);
+
+    if (rc) {
+        LOG(ERROR, "postsuspend fails");
+        goto out;
+    }
+
+    /*
+     * Secondary vm is running, so there are some dirty pages
+     * that are non-dirty in master. Get dirty bitmap and
+     * send it to master.
+     */
+    if (xc_shadow_control(CTX->xch, domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+                          HYPERCALL_BUFFER(dirty_bitmap), p2m_size,
+                          NULL, 0, NULL) != p2m_size) {
+        LOG(ERROR, "getting secondary vm's dirty bitmap fails");
+        goto out;
+    }
+
+    count = 0;
+    for (i = 0; i < p2m_size; i++) {
+        if (test_bit(i, dirty_bitmap))
+            count++;
+    }
+
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = -1;
+    dc->writefd = send_fd;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = crcs->copywhat[0];
+    dc->writewhat = "colo stream";
+    dc->callback = colo_common_read_send_data_done;
+    crcs->callback = NULL;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    /* tell master that secondary vm is suspended */
+    libxl__datacopier_prefixdata(egc, dc, &section, sizeof(section));
+
+    /* send dirty pages to master */
+    libxl__datacopier_prefixdata(egc, dc, &count, sizeof(count));
+    colo_append_pfn_type(egc, dc, dirty_bitmap, p2m_size);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dcs->shs, ok);
+}
+
+static void colo_append_pfn_type(libxl__egc *egc,
+                                 libxl__datacopier_state *dc,
+                                 unsigned long *dirty_bitmap,
+                                 unsigned long p2m_size)
+{
+    int i, count;
+    /* Hack, buf->buf is private member... */
+    libxl__datacopier_buf *buf = NULL;
+    int max_batch = sizeof(buf->buf) / sizeof(uint64_t);
+    int buf_size = max_batch * sizeof(uint64_t);
+    uint64_t *pfn;
+
+    STATE_AO_GC(dc->ao);
+
+    pfn = libxl__zalloc(NOGC, buf_size);
+
+    count = 0;
+    for (i = 0; i < p2m_size; i++) {
+        if (!test_bit(i, dirty_bitmap))
+            continue;
+
+        pfn[count++] = i;
+        if (count == max_batch) {
+            libxl__datacopier_prefixdata(egc, dc, pfn, buf_size);
+            count = 0;
+        }
+    }
+
+    if (count)
+        libxl__datacopier_prefixdata(egc, dc, pfn, count * sizeof(uint64_t));
+
+    free(pfn);
+}
+
+
+/* ===================== colo: common callback ===================== */
+static void colo_common_read_send_data_done(libxl__egc *egc,
+                                            libxl__datacopier_state *dc,
+                                            int onwrite, int errnoval)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(dc, *crcs, dc);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    int ok;
+    STATE_AO_GC(dc->ao);
+
+    if (onwrite == -1) {
+        LOG(ERROR, "reading/sending data fails");
+        ok = 0;
+        goto out;
+    }
+
+    if (errnoval < 0 || (onwrite == 1 && errnoval)) {
+        /* failure happens when reading/writing, do failover? */
+        ok = 2;
+        goto out;
+    }
+
+    if (!crcs->callback) {
+        /* Everythins is OK */
+        ok = 1;
+        goto out;
+    }
+
+    if (onwrite == 0)
+        crcs->callback(egc, crcs, dc->used);
+    else
+        crcs->callback(egc, crcs, 0);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dcs->shs, ok);
+}
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index bd8149c..1548b70 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -19,6 +19,7 @@
 
 #include "libxl_internal.h"
 #include "libxl_arch.h"
+#include "libxl_colo.h"
 
 #include <xc_dom.h>
 #include <xenguest.h>
@@ -992,6 +993,96 @@ static void domcreate_console_available(libxl__egc *egc,
                                         dcs->aop_console_how.for_event));
 }
 
+static void libxl__colo_restore_teardown_done(libxl__egc *egc,
+                                              libxl__colo_restore_state *crs,
+                                              int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    STATE_AO_GC(crs->ao);
+
+    /* convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->shs;
+    const int domid = crs->domid;
+    const libxl_ctx *const ctx = libxl__gc_owner(gc);
+    xc_interface *const xch = ctx->xch;
+
+    if (!rc)
+        /* failover, no need to destroy the secondary vm */
+        goto out;
+
+    if (shs->retval)
+        /*
+         * shs->retval stores the return value of xc_domain_restore().
+         * If it is not 0, we have destroyed the secondary vm in
+         * xc_domain_restore();
+         */
+        goto out;
+
+    xc_domain_destroy(xch, domid);
+
+out:
+    dcs->callback(egc, dcs, rc, crs->domid);
+}
+
+void libxl__colo_restore_done(libxl__egc *egc, void *dcs_void,
+                              int ret, int retval, int errnoval)
+{
+    libxl__domain_create_state *dcs = dcs_void;
+    int rc = 1;
+
+    /* convenience aliases */
+    libxl__colo_restore_state *const crs = &dcs->crs;
+    STATE_AO_GC(crs->ao);
+
+    /* teardown and failover */
+    crs->callback = libxl__colo_restore_teardown_done;
+
+    if (ret == 0 && retval == 0)
+        rc = 0;
+
+    LOG(INFO, "%s", rc ? "colo fails" : "failover");
+    libxl__colo_restore_teardown(egc, crs, rc);
+}
+
+static void libxl__colo_restore_cp_done(libxl__egc *egc,
+                                        libxl__colo_restore_state *crs,
+                                        int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    int ok = 0;
+
+    /* convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    if (!rc)
+        ok = 1;
+
+    libxl__xc_domain_saverestore_async_callback_done(shs->egc, shs, ok);
+}
+
+static void libxl__colo_restore_setup_done(libxl__egc *egc,
+                                           libxl__colo_restore_state *crs,
+                                           int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+
+    /* convenience aliases */
+    const int hvm = crs->hvm;
+    const int superpages = crs->superpages;
+    const int pae = crs->pae;
+    STATE_AO_GC(crs->ao);
+
+    if (rc) {
+        LOG(ERROR, "colo restore setup fails: %d", rc);
+        libxl__xc_domain_restore_done(egc, dcs, rc, 0, 0);
+        return;
+    }
+
+    crs->callback = libxl__colo_restore_cp_done;
+    libxl__xc_domain_restore(egc, dcs,
+                             hvm, pae, superpages);
+}
+
 static void domcreate_bootloader_done(libxl__egc *egc,
                                       libxl__bootloader_state *bl,
                                       int rc)
@@ -1007,6 +1098,8 @@ static void domcreate_bootloader_done(libxl__egc *egc,
     libxl__domain_build_state *const state = &dcs->build_state;
     libxl__srm_restore_autogen_callbacks *const callbacks =
         &dcs->shs.callbacks.restore.a;
+    const int checkpointed_stream = dcs->checkpointed_stream;
+    libxl__colo_restore_state *const crs = &dcs->crs;
 
     if (rc) {
         domcreate_rebuild_done(egc, dcs, rc);
@@ -1035,6 +1128,13 @@ static void domcreate_bootloader_done(libxl__egc *egc,
 
     /* Restore */
 
+    /* COLO only supports HVM now */
+    if (info->type != LIBXL_DOMAIN_TYPE_HVM &&
+        checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
     rc = libxl__build_pre(gc, domid, d_config, state);
     if (rc)
         goto out;
@@ -1057,8 +1157,20 @@ static void domcreate_bootloader_done(libxl__egc *egc,
         rc = ERROR_INVAL;
         goto out;
     }
-    libxl__xc_domain_restore(egc, dcs,
-                             hvm, pae, superpages);
+
+    if (checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
+        crs->ao = ao;
+        crs->domid = domid;
+        crs->send_fd = dcs->send_fd;
+        crs->recv_fd = restore_fd;
+        crs->hvm = hvm;
+        crs->superpages = superpages;
+        crs->pae = pae;
+        crs->callback = libxl__colo_restore_setup_done;
+        libxl__colo_restore_setup(egc, crs);
+    } else
+        libxl__xc_domain_restore(egc, dcs,
+                                 hvm, pae, superpages);
     return;
 
  out:
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index cb3d8db..eeb715a 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -46,7 +46,7 @@ static void switch_logdirty_xswatch(libxl__egc *egc, libxl__ev_xswatch*,
 static void switch_logdirty_done(libxl__egc *egc,
                                  libxl__logdirty_switch *lds, int ok);
 
-static void logdirty_init(libxl__logdirty_switch *lds)
+void logdirty_init(libxl__logdirty_switch *lds)
 {
     lds->cmd_path = 0;
     libxl__ev_xswatch_init(&lds->watch);
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 0ebb104..e9d890b 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2864,6 +2864,8 @@ typedef struct libxl__logdirty_switch {
     libxl__ev_time timeout;
 } libxl__logdirty_switch;
 
+_hidden void logdirty_init(libxl__logdirty_switch *lds);
+
 struct libxl__domain_suspend_state {
     /* set by caller of domain_suspend_callback_common */
     libxl__ao *ao;
@@ -3151,6 +3153,27 @@ typedef void libxl__domain_create_cb(libxl__egc *egc,
                                      libxl__domain_create_state*,
                                      int rc, uint32_t domid);
 
+/* colo related structure */
+typedef struct libxl__colo_restore_state libxl__colo_restore_state;
+typedef void libxl__colo_callback(libxl__egc *,
+                                  libxl__colo_restore_state *, int rc);
+struct libxl__colo_restore_state {
+    /* must set by caller of libxl__colo_(setup|teardown) */
+    libxl__ao *ao;
+    uint32_t domid;
+    int send_fd;
+    int recv_fd;
+    int hvm;
+    int pae;
+    int superpages;
+    libxl__colo_callback *callback;
+
+    /* private, colo restore checkpoint state */
+    libxl__domain_create_cb *saved_cb;
+    void *crcs;
+    libxl__checkpoint_devices_state cds;
+};
+
 struct libxl__domain_create_state {
     /* filled in by user */
     libxl__ao *ao;
@@ -3164,6 +3187,7 @@ struct libxl__domain_create_state {
     int guest_domid;
     int checkpointed_stream;
     libxl__domain_build_state build_state;
+    libxl__colo_restore_state crs;
     libxl__bootloader_state bl;
     libxl__stub_dm_spawn_state dmss;
         /* If we're not doing stubdom, we use only dmss.dm,
diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c
index 5c691eb..d5840e2 100644
--- a/tools/libxl/libxl_save_callout.c
+++ b/tools/libxl/libxl_save_callout.c
@@ -15,6 +15,7 @@
 #include "libxl_osdeps.h"
 
 #include "libxl_internal.h"
+#include "libxl_colo.h"
 
 /* stream_fd is as from the caller (eventually, the application).
  * It may be 0, 1 or 2, in which case we need to dup it elsewhere.
@@ -65,7 +66,10 @@ void libxl__xc_domain_restore(libxl__egc *egc, libxl__domain_create_state *dcs,
     dcs->shs.ao = ao;
     dcs->shs.domid = domid;
     dcs->shs.recv_callback = libxl__srm_callout_received_restore;
-    dcs->shs.completion_callback = libxl__xc_domain_restore_done;
+    if (dcs->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO)
+        dcs->shs.completion_callback = libxl__colo_restore_done;
+    else
+        dcs->shs.completion_callback = libxl__xc_domain_restore_done;
     dcs->shs.caller_state = dcs;
     dcs->shs.need_results = 1;
     dcs->shs.toolstack_data_file = 0;
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 41ee000..0239cac 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -24,9 +24,9 @@ our @msgs = (
                                                  STRING doing_what),
                                                 'unsigned long', 'done',
                                                 'unsigned long', 'total'] ],
-    [  3, 'scxA',   "suspend", [] ],
-    [  4, 'scxA',   "postcopy", [] ],
-    [  5, 'scxA',   "checkpoint", [] ],
+    [  3, 'srcxA',   "suspend", [] ],
+    [  4, 'srcxA',   "postcopy", [] ],
+    [  5, 'srcxA',   "checkpoint", [] ],
     [  6, 'scxA',   "switch_qemu_logdirty",  [qw(int domid
                                               unsigned enable)] ],
     #                toolstack_save          done entirely `by hand'
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 03/15] primary vm suspend/get_dirty_pfn/resume/checkpoint code
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
  2015-06-08  3:45 ` [PATCH v6 COLO 01/15] docs: add colo readme Yang Hongyang
  2015-06-08  3:45 ` [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-16 11:05   ` Ian Campbell
  2015-06-08  3:45 ` [PATCH v6 COLO 04/15] libxc/restore: support COLO restore Yang Hongyang
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

From: Wen Congyang <wency@cn.fujitsu.com>

We will do the following things again and again:
1. Suspend primary vm
   a. Suspend primary vm
   b. do postsuspend
   c. Read LIBXL_COLO_SVM_SUSPENDED sent by secondary
   d. Read secondary vm's dirty page information to master(count + pfn list)
2. Get dirty pfn list callback, used by libxc
   a. Return secondary vm's dirty pfn list
3. Resume primary vm
   a. Read LIBXL_COLO_SVM_READY from slave
   b. Do presume
   c. Resume primary vm
   d. Read LIBXL_COLO_SVM_RESUMED from slave
4. Wait a new checkpoint
   a. Wait a new checkpoint(not implemented)
   b. Send LIBXL_COLO_NEW_CHECKPOINT to slave

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 tools/libxc/include/xenguest.h     |  12 +
 tools/libxl/Makefile               |   2 +-
 tools/libxl/libxl.c                |   6 +-
 tools/libxl/libxl_colo.h           |  10 +
 tools/libxl/libxl_colo_save.c      | 643 +++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_dom_save.c       |  15 +-
 tools/libxl/libxl_internal.h       |  31 +-
 tools/libxl/libxl_save_msgs_gen.pl |   1 +
 tools/libxl/libxl_types.idl        |   1 +
 9 files changed, 712 insertions(+), 9 deletions(-)
 create mode 100644 tools/libxl/libxl_colo_save.c

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 86bcf9c..d5902a6 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -75,6 +75,18 @@ struct save_callbacks {
      */
     int (*toolstack_save)(uint32_t domid, uint8_t **buf, uint32_t *len, void *data);
 
+    /* Called after the guest is suspended.
+     *
+     * returns the list of dirty pfn:
+     *  struct {
+     *      uint64_t count;
+     *      uint64_t pfn[];
+     *  };
+     *
+     *  Note: the caller must free the return value.
+     */
+    uint8_t *(*get_dirty_pfn)(void *data);
+
     /* to be provided as the last argument to each callback function */
     void* data;
 };
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 82cc4c2..88c5426 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -57,7 +57,7 @@ LIBXL_OBJS-y += libxl_nonetbuffer.o
 endif
 
 LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
-LIBXL_OBJS-y += libxl_colo_restore.o
+LIBXL_OBJS-y += libxl_colo_restore.o libxl_colo_save.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o libxl_psr.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o libxl_libfdt_compat.o
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 10d3d82..1145ae4 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -17,6 +17,7 @@
 #include "libxl_osdeps.h"
 
 #include "libxl_internal.h"
+#include "libxl_colo.h"
 
 #define PAGE_TO_MEMKB(pages) ((pages) * 4)
 #define BACKEND_STRING_SIZE 5
@@ -841,7 +842,10 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
     assert(info);
 
     /* Point of no return */
-    libxl__remus_setup(egc, &dss->rs);
+    if (libxl_defbool_val(info->colo))
+        libxl__colo_save_setup(egc, &dss->css);
+    else
+        libxl__remus_setup(egc, &dss->rs);
     return AO_INPROGRESS;
 
  out:
diff --git a/tools/libxl/libxl_colo.h b/tools/libxl/libxl_colo.h
index 91df275..26a2563 100644
--- a/tools/libxl/libxl_colo.h
+++ b/tools/libxl/libxl_colo.h
@@ -35,4 +35,14 @@ extern void libxl__colo_restore_teardown(libxl__egc *egc,
                                          libxl__colo_restore_state *crs,
                                          int rc);
 
+extern void libxl__colo_save_domain_suspend_callback(void *data);
+extern void libxl__colo_save_domain_resume_callback(void *data);
+extern void libxl__colo_save_domain_checkpoint_callback(void *data);
+extern void libxl__colo_save_get_dirty_pfn_callback(void *data);
+extern void libxl__colo_save_setup(libxl__egc *egc,
+                                   libxl__colo_save_state *css);
+extern void libxl__colo_save_teardown(libxl__egc *egc,
+                                      libxl__colo_save_state *css,
+                                      int rc);
+
 #endif
diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
new file mode 100644
index 0000000..153ec57
--- /dev/null
+++ b/tools/libxl/libxl_colo_save.c
@@ -0,0 +1,643 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *         Yang Hongyang <yanghy@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+#include "libxl_colo.h"
+
+static const libxl__checkpoint_device_instance_ops *colo_ops[] = {
+    NULL,
+};
+
+/* ================= helper functions ================= */
+static int init_device_subkind(libxl__checkpoint_devices_state *cds)
+{
+    /* init device subkind-specific state in the libxl ctx */
+    int rc;
+    STATE_AO_GC(cds->ao);
+
+    rc = 0;
+    return rc;
+}
+
+static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
+{
+    /* cleanup device subkind-specific state in the libxl ctx */
+    STATE_AO_GC(cds->ao);
+}
+
+/* ================= colo: setup save environment ================= */
+static void colo_save_setup_done(libxl__egc *egc,
+                                 libxl__checkpoint_devices_state *cds,
+                                 int rc);
+static void colo_save_setup_failed(libxl__egc *egc,
+                                   libxl__checkpoint_devices_state *cds,
+                                   int rc);
+
+void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
+{
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = &css->cds;
+
+    STATE_AO_GC(dss->ao);
+
+    if (dss->type != LIBXL_DOMAIN_TYPE_HVM) {
+        LOG(ERROR, "COLO only supports hvm now");
+        goto out;
+    }
+
+    css->send_fd = dss->fd;
+    css->recv_fd = dss->recv_fd;
+    css->svm_running = false;
+
+    /* TODO: disk/nic support */
+    cds->device_kind_flags = 0;
+    cds->ops = colo_ops;
+    cds->callback = colo_save_setup_done;
+    cds->ao = ao;
+    cds->domid = dss->domid;
+
+    if (init_device_subkind(cds))
+        goto out;
+
+    libxl__checkpoint_devices_setup(egc, &css->cds);
+
+    return;
+
+out:
+    libxl__ao_complete(egc, ao, ERROR_FAIL);
+}
+
+static void colo_save_setup_done(libxl__egc *egc,
+                                 libxl__checkpoint_devices_state *cds,
+                                 int rc)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+    STATE_AO_GC(cds->ao);
+
+    if (!rc) {
+        libxl__domain_save(egc, dss);
+        return;
+    }
+
+    LOG(ERROR, "COLO: failed to setup device for guest with domid %u",
+        dss->domid);
+    css->cds.callback = colo_save_setup_failed;
+    libxl__checkpoint_devices_teardown(egc, &css->cds);
+}
+
+static void colo_save_setup_failed(libxl__egc *egc,
+                                   libxl__checkpoint_devices_state *cds,
+                                   int rc)
+{
+    STATE_AO_GC(cds->ao);
+
+    if (rc)
+        LOG(ERROR, "COLO: failed to teardown device after setup failed"
+            " for guest with domid %u, rc %d", cds->domid, rc);
+
+    cleanup_device_subkind(cds);
+    libxl__ao_complete(egc, ao, rc);
+}
+
+
+/* ================= colo: teardown save environment ================= */
+static void colo_teardown_done(libxl__egc *egc,
+                               libxl__checkpoint_devices_state *cds,
+                               int rc);
+
+void libxl__colo_save_teardown(libxl__egc *egc,
+                               libxl__colo_save_state *css,
+                               int rc)
+{
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    STATE_AO_GC(css->cds.ao);
+
+    LOG(WARN, "COLO: Domain suspend terminated with rc %d,"
+        " teardown COLO devices...", rc);
+    dss->css.cds.callback = colo_teardown_done;
+    libxl__checkpoint_devices_teardown(egc, &dss->css.cds);
+    return;
+}
+
+static void colo_teardown_done(libxl__egc *egc,
+                               libxl__checkpoint_devices_state *cds,
+                               int rc)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    cleanup_device_subkind(cds);
+    dss->callback(egc, dss, rc);
+}
+
+/*
+ * checkpoint callbacks are called in the following order:
+ * 1. suspend
+ * 2. resume
+ * 3. checkpoint
+ */
+static void colo_common_read_send_data_done(libxl__egc *egc,
+                                            libxl__datacopier_state *dc,
+                                            int onwrite, int errnoval);
+/* ===================== colo: suspend primary vm ===================== */
+/*
+ * Do the following things when suspending primary vm:
+ * 1. suspend primary vm
+ * 2. do postsuspend
+ * 3. read LIBXL_COLO_SVM_SUSPENDED
+ * 4. read secondary vm's dirty pages
+ */
+static void colo_suspend_primary_vm_done(libxl__egc *egc,
+                                         libxl__domain_suspend_state *dsps,
+                                         int ok);
+static void colo_postsuspend_cb(libxl__egc *egc,
+                                libxl__checkpoint_devices_state *cds,
+                                int rc);
+static void colo_read_pfn(libxl__egc *egc, libxl__colo_save_state *css);
+
+void libxl__colo_save_domain_suspend_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__egc *egc = shs->egc;
+    libxl__domain_save_state *dss = CONTAINER_OF(shs, *dss, shs);
+
+    /* Convenience aliases */
+    libxl__domain_suspend_state *dsps = &dss->dsps;
+
+    dsps->callback_common_done = colo_suspend_primary_vm_done;
+    libxl__domain_suspend(egc, dsps);
+}
+
+static void colo_suspend_primary_vm_done(libxl__egc *egc,
+                                         libxl__domain_suspend_state *dsps,
+                                         int ok)
+{
+    libxl__domain_save_state *dss = CONTAINER_OF(dsps, *dss, dsps);
+
+    STATE_AO_GC(dsps->ao);
+
+    if (!ok) {
+        LOG(ERROR, "cannot suspend primary vm");
+        goto out;
+    }
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = &dss->css.cds;
+
+    cds->callback = colo_postsuspend_cb;
+    libxl__checkpoint_devices_postsuspend(egc, cds);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+static void colo_postsuspend_cb(libxl__egc *egc,
+                                libxl__checkpoint_devices_state *cds,
+                                int rc)
+{
+    int ok = 0;
+    libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    /* Convenience aliases */
+    libxl__datacopier_state *const dc = &css->dc;
+
+    STATE_AO_GC(cds->ao);
+
+    if (rc) {
+        LOG(ERROR, "postsuspend fails");
+        goto out;
+    }
+
+    if (!css->svm_running) {
+        ok = 1;
+        goto out;
+    }
+
+    /*
+     * read LIBXL_COLO_SVM_SUSPENDED and the count of
+     * secondary vm's dirty pages.
+     */
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = css->recv_fd;
+    dc->writefd = -1;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = "secondary vm is suspended";
+    dc->readwhat = "colo stream";
+    dc->callback = colo_common_read_send_data_done;
+    dc->readbuf = css->temp_buff;
+    dc->bytes_to_read = sizeof(css->temp_buff);
+    css->callback = colo_read_pfn;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+static void colo_read_pfn(libxl__egc *egc, libxl__colo_save_state *css)
+{
+    int ok = 0;
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+    int rc;
+
+    STATE_AO_GC(css->cds.ao);
+
+    /* Convenience aliases */
+    libxl__datacopier_state *const dc = &css->dc;
+
+    assert(!css->buff);
+    css->section = css->temp_buff[0];
+    css->count = *(uint64_t *)(&css->temp_buff[1]);
+
+    if (css->section != LIBXL_COLO_SVM_SUSPENDED) {
+        LOG(ERROR, "invalid section: %d, expected: %d",
+            css->section, LIBXL_COLO_SVM_SUSPENDED);
+        goto out;
+    }
+
+    css->buff = libxl__zalloc(NOGC, sizeof(uint64_t) * (css->count + 1));
+    css->buff[0] = css->count;
+
+    if (css->count == 0) {
+        /* no dirty pages */
+        ok = 1;
+        goto out;
+    }
+
+    /* read the pfn of secondary vm's dirty pages */
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = css->recv_fd;
+    dc->writefd = -1;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = "secondary vm's dirty bitmap";
+    dc->readwhat = "colo stream";
+    dc->callback = colo_common_read_send_data_done;
+    dc->readbuf = css->buff + 1;
+    dc->bytes_to_read = css->count * sizeof(uint64_t);
+    css->callback = NULL;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+
+/* ===================== colo: get dirty pfn ===================== */
+void libxl__colo_save_get_dirty_pfn_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__egc *egc = shs->egc;
+    libxl__domain_save_state *dss = CONTAINER_OF(shs, *dss, shs);
+    uint64_t size;
+
+    /* Convenience aliases */
+    libxl__colo_save_state *const css = &dss->css;
+
+    assert(css->buff);
+    size = sizeof(uint64_t) * (css->count + 1);
+
+    libxl__xc_domain_saverestore_async_callback_done_with_data(egc, shs,
+                                                               (uint8_t *)css->buff,
+                                                               size);
+    free(css->buff);
+    css->buff = NULL;
+}
+
+
+/* ===================== colo: resume primary vm ===================== */
+/*
+ * Do the following things when resuming primary vm:
+ *  1. read LIBXL_COLO_SVM_READY
+ *  2. do preresume
+ *  3. resume primary vm
+ *  4. read LIBXL_COLO_SVM_RESUMED
+ */
+static void colo_preresume_dm_saved(libxl__egc *egc,
+                                    libxl__domain_save_state *dss, int rc);
+static void colo_read_svm_ready_done(libxl__egc *egc,
+                                     libxl__colo_save_state *css);
+static void colo_preresume_cb(libxl__egc *egc,
+                              libxl__checkpoint_devices_state *cds,
+                              int rc);
+static void colo_read_svm_resumed_done(libxl__egc *egc,
+                                       libxl__colo_save_state *css);
+
+void libxl__colo_save_domain_resume_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__egc *egc = shs->egc;
+    libxl__domain_save_state *dss = CONTAINER_OF(shs, *dss, shs);
+
+    /* This would go into tailbuf. */
+    if (dss->hvm) {
+        libxl__domain_save_device_model(egc, dss, colo_preresume_dm_saved);
+    } else {
+        colo_preresume_dm_saved(egc, dss, 0);
+    }
+
+    return;
+}
+
+static void colo_preresume_dm_saved(libxl__egc *egc,
+                                    libxl__domain_save_state *dss, int rc)
+{
+    /* Convenience aliases */
+    libxl__colo_save_state *const css = &dss->css;
+    libxl__datacopier_state *const dc = &css->dc;
+
+    STATE_AO_GC(css->cds.ao);
+
+    if (rc) {
+        LOG(ERROR, "Failed to save device model. Terminating COLO..");
+        goto out;
+    }
+
+    /* read LIBXL_COLO_SVM_READY */
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = css->recv_fd;
+    dc->writefd = -1;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = "secondary vm is ready";
+    dc->readwhat = "colo stream";
+    dc->callback = colo_common_read_send_data_done;
+    dc->readbuf = &css->section;
+    dc->bytes_to_read = sizeof(css->section);
+    css->callback = colo_read_svm_ready_done;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void colo_read_svm_ready_done(libxl__egc *egc,
+                                     libxl__colo_save_state *css)
+{
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    STATE_AO_GC(css->cds.ao);
+
+    if (css->section != LIBXL_COLO_SVM_READY) {
+        LOG(ERROR, "invalid section: %d, expected: %d",
+            css->section, LIBXL_COLO_SVM_READY);
+        goto out;
+    }
+
+    css->svm_running = true;
+    css->cds.callback = colo_preresume_cb;
+    libxl__checkpoint_devices_preresume(egc, &css->cds);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void colo_preresume_cb(libxl__egc *egc,
+                              libxl__checkpoint_devices_state *cds,
+                              int rc)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    /* Convenience aliases */
+    libxl__datacopier_state *const dc = &css->dc;
+
+    STATE_AO_GC(cds->ao);
+
+    if (rc) {
+        LOG(ERROR, "preresume fails");
+        goto out;
+    }
+
+    /* Resumes the domain and the device model */
+    if (libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1)) {
+        LOG(ERROR, "cannot resume primary vm");
+        goto out;
+    }
+
+    /* read LIBXL_COLO_SVM_RESUMED */
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = css->recv_fd;
+    dc->writefd = -1;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = "secondary vm is resumed";
+    dc->readwhat = "colo stream";
+    dc->callback = colo_common_read_send_data_done;
+    dc->readbuf = &css->section;
+    dc->bytes_to_read = sizeof(css->section);
+    css->callback = colo_read_svm_resumed_done;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void colo_read_svm_resumed_done(libxl__egc *egc,
+                                       libxl__colo_save_state *css)
+{
+    int ok = 0;
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    STATE_AO_GC(css->cds.ao);
+
+    if (css->section != LIBXL_COLO_SVM_RESUMED) {
+        LOG(ERROR, "invalid section: %d, expected: %d",
+            css->section, LIBXL_COLO_SVM_RESUMED);
+        goto out;
+    }
+
+    ok = 1;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+
+/* ===================== colo: wait new checkpoint ===================== */
+/*
+ * Do the following things:
+ * 1. do commit
+ * 2. wait for a new checkpoint
+ * 3. write LIBXL_COLO_NEW_CHECKPOINT
+ */
+static void colo_device_commit_cb(libxl__egc *egc,
+                                  libxl__checkpoint_devices_state *cds,
+                                  int rc);
+static void colo_start_new_checkpoint(libxl__egc *egc,
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc);
+
+void libxl__colo_save_domain_checkpoint_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__domain_save_state *dss = CONTAINER_OF(shs, *dss, shs);
+    libxl__egc *egc = dss->shs.egc;
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = &dss->css.cds;
+
+    cds->callback = colo_device_commit_cb;
+    libxl__checkpoint_devices_commit(egc, cds);
+}
+
+static void colo_device_commit_cb(libxl__egc *egc,
+                                  libxl__checkpoint_devices_state *cds,
+                                  int rc)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    STATE_AO_GC(cds->ao);
+
+    if (rc) {
+        LOG(ERROR, "commit fails");
+        goto out;
+    }
+
+    /* TODO: wait a new checkpoint */
+    colo_start_new_checkpoint(egc, cds, 0);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void colo_start_new_checkpoint(libxl__egc *egc,
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+    uint8_t section = LIBXL_COLO_NEW_CHECKPOINT;
+
+    /* Convenience aliases */
+    libxl__datacopier_state *const dc = &css->dc;
+
+    STATE_AO_GC(cds->ao);
+
+    if (rc)
+        goto out;
+
+    /* write LIBXL_COLO_NEW_CHECKPOINT */
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = -1;
+    dc->writefd = css->send_fd;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = "new checkpoint is triggered";
+    dc->writewhat = "colo stream";
+    dc->callback = colo_common_read_send_data_done;
+    css->callback = NULL;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    /* tell slave that a new checkpoint is triggered */
+    libxl__datacopier_prefixdata(egc, dc, &section, sizeof(section));
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+
+/* ===================== colo: common callback ===================== */
+static void colo_common_read_send_data_done(libxl__egc *egc,
+                                            libxl__datacopier_state *dc,
+                                            int onwrite, int errnoval)
+{
+    int ok = 0;
+    libxl__colo_save_state *css = CONTAINER_OF(dc, *css, dc);
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    STATE_AO_GC(dc->ao);
+
+    if (onwrite == -1) {
+        LOG(ERROR, "reading/sending data fails");
+        ok = 0;
+        goto out;
+    }
+
+    if (errnoval < 0 || (onwrite == 1 && errnoval)) {
+        /* failure happens when reading/writing, do failover? */
+        ok = 2;
+        goto out;
+    }
+
+    if (dc->bytes_to_read != 0) {
+        /* EOF is read */
+        LOG(ERROR, "reading EOF unexpectedly");
+        ok = 0;
+        goto out;
+    }
+
+    if (!css->callback) {
+        /* Everything is OK */
+        ok = 1;
+        goto out;
+    }
+
+    if (onwrite == 0)
+        css->callback(egc, css);
+    else
+        css->callback(egc, css);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index eeb715a..4b5a4d9 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -16,6 +16,7 @@
 #include "libxl_osdeps.h" /* must come before any other headers */
 
 #include "libxl_internal.h"
+#include "libxl_colo.h"
 
 struct libxl__physmap_info {
     uint64_t phys_offset;
@@ -426,7 +427,12 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
     }
 
     memset(callbacks, 0, sizeof(*callbacks));
-    if (r_info != NULL) {
+    if (r_info != NULL && libxl_defbool_val(r_info->colo)) {
+        callbacks->suspend = libxl__colo_save_domain_suspend_callback;
+        callbacks->postcopy = libxl__colo_save_domain_resume_callback;
+        callbacks->checkpoint = libxl__colo_save_domain_checkpoint_callback;
+        callbacks->get_dirty_pfn = libxl__colo_save_get_dirty_pfn_callback;
+    } else if (r_info != NULL) {
         callbacks->suspend = libxl__remus_domain_suspend_callback;
         callbacks->postcopy = libxl__remus_domain_resume_callback;
         callbacks->checkpoint = libxl__remus_domain_checkpoint_callback;
@@ -595,12 +601,15 @@ static void domain_save_done(libxl__egc *egc,
     }
 
     /*
-     * With Remus, if we reach this point, it means either
+     * With Remus/COLO, if we reach this point, it means either
      * backup died or some network error occurred preventing us
      * from sending checkpoints. Teardown the network buffers and
      * release netlink resources.  This is an async op.
      */
-    libxl__remus_teardown(egc, &dss->rs, rc);
+    if (libxl_defbool_val(dss->remus->colo))
+        libxl__colo_save_teardown(egc, &dss->css, rc);
+    else
+        libxl__remus_teardown(egc, &dss->rs, rc);
 }
 
 /*==================== Domain restore ====================*/
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index e9d890b..1acea97 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2653,7 +2653,7 @@ typedef struct libxl__save_helper_state {
 /*
  * The abstract checkpoint device layer exposes a common
  * set of API to [external] libxl for manipulating devices attached to
- * a guest protected by Remus. The device layer also exposes a set of
+ * a guest protected by Remus/COLO. The device layer also exposes a set of
  * [internal] interfaces that every device type must implement.
  *
  * The following API are exposed to libxl:
@@ -2671,7 +2671,7 @@ typedef struct libxl__save_helper_state {
  *  +libxl__checkpoint_devices_commit
  *
  * Each device type needs to implement the interfaces specified in
- * the libxl__checkpoint_device_instance_ops if it wishes to support Remus.
+ * the libxl__checkpoint_device_instance_ops if it wishes to support Remus/COLO.
  *
  * The high-level control flow through the checkpoint device layer is shown
  * below:
@@ -2691,7 +2691,7 @@ typedef struct libxl__checkpoint_device_instance_ops libxl__checkpoint_device_in
 
 /*
  * Interfaces to be implemented by every device subkind that wishes to
- * support Remus. Functions must be implemented unless otherwise
+ * support Remus/COLO. Functions must be implemented unless otherwise
  * stated. Many of these functions are asynchronous. They call
  * dev->aodev.callback when done.  The actual implementations may be
  * synchronous and call dev->aodev.callback directly (as the last
@@ -2841,6 +2841,24 @@ struct libxl__remus_state {
 };
 _hidden int libxl__netbuffer_enabled(libxl__gc *gc);
 
+/*----- colo related state structure -----*/
+typedef struct libxl__colo_save_state libxl__colo_save_state;
+struct libxl__colo_save_state {
+    libxl__checkpoint_devices_state cds;
+    int send_fd;
+    int recv_fd;
+
+    /* private */
+    libxl__datacopier_state dc;
+    uint8_t section;
+    uint64_t count;
+    uint64_t *buff;
+    /* read section and count, and then store it in temp_buff */
+    uint8_t temp_buff[9];
+    void (*callback)(libxl__egc *, libxl__colo_save_state *);
+    bool svm_running;
+};
+
 /*----- Domain suspend (save) state structure -----*/
 
 typedef struct libxl__domain_suspend_state libxl__domain_suspend_state;
@@ -2900,7 +2918,12 @@ struct libxl__domain_save_state {
     libxl__domain_suspend_state dsps;
     int hvm;
     int xcflags;
-    libxl__remus_state rs;
+    union {
+        /* for Remus */
+        libxl__remus_state rs;
+        /* for COLO */
+        libxl__colo_save_state css;
+    };
     libxl__save_helper_state shs;
     libxl__logdirty_switch logdirty;
     /* private for libxl__domain_save_device_model */
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 0239cac..fbb2d67 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -36,6 +36,7 @@ our @msgs = (
                                               'unsigned long', 'console_mfn'] ],
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
+    [ 10, 'scxAB',  "get_dirty_pfn", [] ],
 );
 
 #----------------------------------------
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 375c546..7f07f8b 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -695,6 +695,7 @@ libxl_domain_remus_info = Struct("domain_remus_info",[
     ("netbuf",       libxl_defbool),
     ("netbufscript", string),
     ("diskbuf",      libxl_defbool),
+    ("colo",         libxl_defbool)
     ])
 
 libxl_event_type = Enumeration("event_type", [
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 04/15] libxc/restore: support COLO restore
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
                   ` (2 preceding siblings ...)
  2015-06-08  3:45 ` [PATCH v6 COLO 03/15] primary vm suspend/get_dirty_pfn/resume/checkpoint code Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-08 10:39   ` Andrew Cooper
  2015-06-08  3:45 ` [PATCH v6 COLO 05/15] send store mfn and console mfn to xl before resuming secondary vm Yang Hongyang
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

call the callbacks resume/checkpoint/suspend while secondary vm
status is consistent with primary.

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/xc_sr_common.h          | 11 +++++--
 tools/libxc/xc_sr_restore.c         | 63 ++++++++++++++++++++++++++++++++++++-
 tools/libxc/xc_sr_restore_x86_hvm.c |  1 +
 3 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index 565c5da..382bf76 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -132,8 +132,11 @@ struct xc_sr_restore_ops
      *
      * @return 0 for success, -1 for failure, or the sentinel value
      * RECORD_NOT_PROCESSED.
+     * BROKEN_CHANNEL: if we are under Remus/COLO, this means master may dead,
+     *                 we will failover.
      */
 #define RECORD_NOT_PROCESSED 1
+#define BROKEN_CHANNEL 2
     int (*process_record)(struct xc_sr_context *ctx, struct xc_sr_record *rec);
 
     /**
@@ -205,8 +208,12 @@ struct xc_sr_context
             uint32_t guest_type;
             uint32_t guest_page_size;
 
-            /* Plain VM, or checkpoints over time. */
-            bool checkpointed;
+            /*
+             * 0: Plain VM
+             * 1: Remus
+             * 2: COLO
+             */
+            int checkpointed;
 
             /* Currently buffering records between a checkpoint */
             bool buffer_all_records;
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 2d2edd3..982a70e 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -1,4 +1,5 @@
 #include <arpa/inet.h>
+#include <assert.h>
 
 #include "xc_sr_common.h"
 
@@ -472,7 +473,7 @@ static int process_record(struct xc_sr_context *ctx, struct xc_sr_record *rec);
 static int handle_checkpoint(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
-    int rc = 0;
+    int rc = 0, ret;
     unsigned i;
 
     if ( !ctx->restore.checkpointed )
@@ -498,6 +499,46 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
     else
         ctx->restore.buffer_all_records = true;
 
+    if ( ctx->restore.checkpointed == 2 )
+    {
+#define HANDLE_CALLBACK_RETURN_VALUE(ret)                   \
+    do {                                                    \
+        if ( ret == 0 )                                     \
+        {                                                   \
+            /* Some internal error happens */               \
+            rc = -1;                                        \
+            goto err;                                       \
+        }                                                   \
+        else if ( ret == 2 )                                \
+        {                                                   \
+            /* Reading/writing error, do failover */        \
+            rc = BROKEN_CHANNEL;                            \
+            goto err;                                       \
+        }                                                   \
+    } while (0)
+
+        /* COLO */
+
+        /* We need to resume guest */
+        rc = ctx->restore.ops.stream_complete(ctx);
+        if ( rc )
+            goto err;
+
+        /* TODO: call restore_results */
+
+        /* Resume secondary vm */
+        ret = ctx->restore.callbacks->postcopy(ctx->restore.callbacks->data);
+        HANDLE_CALLBACK_RETURN_VALUE(ret);
+
+        /* wait for new checkpoint */
+        ret = ctx->restore.callbacks->checkpoint(ctx->restore.callbacks->data);
+        HANDLE_CALLBACK_RETURN_VALUE(ret);
+
+        /* suspend secondary vm */
+        ret = ctx->restore.callbacks->suspend(ctx->restore.callbacks->data);
+        HANDLE_CALLBACK_RETURN_VALUE(ret);
+    }
+
  err:
     return rc;
 }
@@ -678,6 +719,8 @@ static int restore(struct xc_sr_context *ctx)
                     goto err;
                 }
             }
+            else if ( rc == BROKEN_CHANNEL )
+                goto remus_failover;
             else if ( rc )
                 goto err;
         }
@@ -685,6 +728,15 @@ static int restore(struct xc_sr_context *ctx)
     } while ( rec.type != REC_TYPE_END );
 
  remus_failover:
+
+    if ( ctx->restore.checkpointed == 2 )
+    {
+        /* With COLO, we have already called stream_complete */
+        rc = 0;
+        IPRINTF("COLO Failover");
+        goto done;
+    }
+
     /*
      * With Remus, if we reach here, there must be some error on primary,
      * failover from the last checkpoint state.
@@ -735,6 +787,15 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
     ctx.restore.checkpointed = checkpointed_stream;
     ctx.restore.callbacks = callbacks;
 
+    /* Sanity checks for callbacks. */
+    if ( ctx.restore.checkpointed == 2 )
+    {
+        /* this is COLO restore */
+        assert(callbacks->suspend &&
+               callbacks->checkpoint &&
+               callbacks->postcopy);
+    }
+
     IPRINTF("In experimental %s", __func__);
     DPRINTF("fd %d, dom %u, hvm %u, pae %u, superpages %d"
             ", checkpointed_stream %d", io_fd, dom, hvm, pae,
diff --git a/tools/libxc/xc_sr_restore_x86_hvm.c b/tools/libxc/xc_sr_restore_x86_hvm.c
index 06177e0..8e54c68 100644
--- a/tools/libxc/xc_sr_restore_x86_hvm.c
+++ b/tools/libxc/xc_sr_restore_x86_hvm.c
@@ -181,6 +181,7 @@ static int handle_qemu(struct xc_sr_context *ctx)
     if ( fp )
         fclose(fp);
     free(qbuf);
+    ctx->x86_hvm.restore.qbuf = NULL;
 
     return rc;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 05/15] send store mfn and console mfn to xl before resuming secondary vm
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
                   ` (3 preceding siblings ...)
  2015-06-08  3:45 ` [PATCH v6 COLO 04/15] libxc/restore: support COLO restore Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-08 12:16   ` Andrew Cooper
  2015-06-16 11:13   ` Ian Campbell
  2015-06-08  3:45 ` [PATCH v6 COLO 06/15] libxc/save: support COLO save Yang Hongyang
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

From: Wen Congyang <wency@cn.fujitsu.com>

We will call libxl__xc_domain_restore_done() to rebuild secondary vm. But
we need store mfn and console mfn when rebuilding secondary vm. So make
restore_results is a function pointers in callbacks struct and struct
{save,restore}_callbacks, and use this callback to send store mfn and
console mfn to xl.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/include/xenguest.h     | 8 ++++++++
 tools/libxc/xc_sr_restore.c        | 8 ++++++--
 tools/libxl/libxl_colo_restore.c   | 5 -----
 tools/libxl/libxl_create.c         | 1 +
 tools/libxl/libxl_save_msgs_gen.pl | 2 +-
 5 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index d5902a6..50096b9 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -130,6 +130,14 @@ struct restore_callbacks {
     /* Enable qemu-dm logging dirty pages to xen */
     int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
 
+    /*
+     * callback to send store mfn and console mfn to xl
+     * if we want to resume vm before xc_domain_save()
+     * exits.
+     */
+    void (*restore_results)(unsigned long store_mfn, unsigned long console_mfn,
+                            void *data);
+
     /* callback to restore toolstack specific data */
     int (*toolstack_restore)(uint32_t domid, const uint8_t *buf,
             uint32_t size, void* data);
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 982a70e..5e2efd8 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -524,7 +524,10 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
         if ( rc )
             goto err;
 
-        /* TODO: call restore_results */
+        /* call restore_results */
+        ctx->restore.callbacks->restore_results(ctx->restore.xenstore_gfn,
+                                                ctx->restore.console_gfn,
+                                                ctx->restore.callbacks->data);
 
         /* Resume secondary vm */
         ret = ctx->restore.callbacks->postcopy(ctx->restore.callbacks->data);
@@ -793,7 +796,8 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
         /* this is COLO restore */
         assert(callbacks->suspend &&
                callbacks->checkpoint &&
-               callbacks->postcopy);
+               callbacks->postcopy &&
+               callbacks->restore_results);
     }
 
     IPRINTF("In experimental %s", __func__);
diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
index 6c39758..c613c15 100644
--- a/tools/libxl/libxl_colo_restore.c
+++ b/tools/libxl/libxl_colo_restore.c
@@ -153,11 +153,6 @@ static void colo_resume_vm(libxl__egc *egc,
         return;
     }
 
-    /*
-     * TODO: get store mfn and console mfn
-     *  We should call the callback restore_results in
-     *  xc_domain_restore() before resuming the guest.
-     */
     libxl__xc_domain_restore_done(egc, dcs, 0, 0, 0);
 
     return;
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 1548b70..6e307f3 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1157,6 +1157,7 @@ static void domcreate_bootloader_done(libxl__egc *egc,
         rc = ERROR_INVAL;
         goto out;
     }
+    callbacks->restore_results = libxl__srm_callout_callback_restore_results;
 
     if (checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
         crs->ao = ao;
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index fbb2d67..2ecd25d 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -32,7 +32,7 @@ our @msgs = (
     #                toolstack_save          done entirely `by hand'
     [  7, 'rcxW',   "toolstack_restore",     [qw(uint32_t domid
                                                 BLOCK tsdata)] ],
-    [  8, 'r',      "restore_results",       ['unsigned long', 'store_mfn',
+    [  8, 'rcx',    "restore_results",       ['unsigned long', 'store_mfn',
                                               'unsigned long', 'console_mfn'] ],
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 06/15] libxc/save: support COLO save
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
                   ` (4 preceding siblings ...)
  2015-06-08  3:45 ` [PATCH v6 COLO 05/15] send store mfn and console mfn to xl before resuming secondary vm Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-08 13:04   ` Andrew Cooper
  2015-06-08  3:45 ` [PATCH v6 COLO 07/15] implement the cmdline for COLO Yang Hongyang
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

call callbacks->get_dirty_pfn() after suspend primary vm to
get dirty pages on secondary vm, and send pages both dirty on
primary/secondary to secondary.

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/xc_sr_save.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index d63b783..cda61ed 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -515,6 +515,31 @@ static int send_memory_live(struct xc_sr_context *ctx)
     return rc;
 }
 
+static int update_dirty_bitmap(uint8_t *(*get_dirty_pfn)(void *), void *data,
+                               unsigned long p2m_size, unsigned long *bitmap)
+{
+    uint64_t *pfn_list;
+    uint64_t count, i;
+    uint64_t pfn;
+
+    pfn_list = (uint64_t *)get_dirty_pfn(data);
+    assert(pfn_list);
+
+    count = pfn_list[0];
+    for (i = 0; i < count; i++) {
+        pfn = pfn_list[i + 1];
+        if (pfn > p2m_size) {
+            errno = EINVAL;
+            return -1;
+        }
+
+        set_bit(pfn, bitmap);
+    }
+
+    free(pfn_list);
+    return 0;
+}
+
 /*
  * Suspend the domain and send dirty memory.
  * This is the last iteration of the live migration and the
@@ -555,6 +580,19 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
 
     bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_size);
 
+    if ( !ctx->save.live && ctx->save.callbacks->get_dirty_pfn )
+    {
+        rc = update_dirty_bitmap(ctx->save.callbacks->get_dirty_pfn,
+                                 ctx->save.callbacks->data,
+                                 ctx->save.p2m_size,
+                                 dirty_bitmap);
+        if ( rc )
+        {
+            PERROR("Failed to get secondary vm's dirty pages");
+            goto out;
+        }
+    }
+
     rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages);
     if ( rc )
         goto out;
@@ -784,7 +822,16 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
             if ( rc )
                 goto err;
 
-            ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
+            rc = ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
+            if ( !rc ) {
+                if ( !errno )
+                {
+                    /* Postcopy request failed (without errno, using EINVAL) */
+                    errno = EINVAL;
+                }
+                rc = -1;
+                goto err;
+            }
 
             rc = ctx->save.callbacks->checkpoint(ctx->save.callbacks->data);
             if ( rc <= 0 )
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 07/15] implement the cmdline for COLO
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
                   ` (5 preceding siblings ...)
  2015-06-08  3:45 ` [PATCH v6 COLO 06/15] libxc/save: support COLO save Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-16 11:19   ` Ian Campbell
  2015-06-08  3:45 ` [PATCH v6 COLO 08/15] Support colo mode for qemu disk Yang Hongyang
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

From: Wen Congyang <wency@cn.fujitsu.com>

Add a new option -c to the command 'xl remus'. If you want
to use COLO HA instead of Remus HA, please use -c option.

Update man pages to reflect the addition of a new option to
'xl remus' command.

Also add a new option -c to the internal command 'xl migrate-receive'.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 docs/man/xl.pod.1         | 12 +++++++++--
 tools/libxl/libxl.c       | 16 ++++++++++++++
 tools/libxl/xl_cmdimpl.c  | 53 +++++++++++++++++++++++++++++++++++++++--------
 tools/libxl/xl_cmdtable.c |  4 +++-
 4 files changed, 73 insertions(+), 12 deletions(-)

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index 4eb929d..f5e97d7 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -447,12 +447,15 @@ Print huge (!) amount of debug during the migration process.
 
 =item B<remus> [I<OPTIONS>] I<domain-id> I<host>
 
-Enable Remus HA for domain. By default B<xl> relies on ssh as a transport
-mechanism between the two hosts.
+Enable Remus HA or COLO HA for domain. By default B<xl> relies on ssh as a
+transport mechanism between the two hosts.
 
 N.B: Remus support in xl is still in experimental (proof-of-concept) phase.
      Disk replication support is limited to DRBD disks.
 
+     COLO support in xl is still in experimental (proof-of-concept) phase.
+     There is no support for network or disk at the moment.
+
 B<OPTIONS>
 
 =over 4
@@ -498,6 +501,11 @@ Disable network output buffering. Requires enabling unsafe mode.
 
 Disable disk replication. Requires enabling unsafe mode.
 
+=item B<-c>
+
+Enable COLO HA. It is conflict with B<-i> and B<-b>, and memory
+checkpoint compression must be disabled.
+
 =back
 
 =item B<pause> I<domain-id>
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 1145ae4..7df2466 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -811,6 +811,22 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
         goto out;
     }
 
+    /* The caller must set this defbool */
+    if (libxl_defbool_is_default(info->colo)) {
+        LOG(ERROR, "colo mode must be enabled/disabled");
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    if (libxl_defbool_val(info->colo)) {
+        libxl_defbool_setdefault(&info->compression, false);
+        if (libxl_defbool_val(info->compression)) {
+            LOG(ERROR, "cannot use memory checkpoint compression in COLO mode");
+            rc = ERROR_FAIL;
+            goto out;
+        }
+    }
+
     libxl_defbool_setdefault(&info->allow_unsafe, false);
     libxl_defbool_setdefault(&info->blackhole, false);
     libxl_defbool_setdefault(&info->compression, true);
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index adfadd1..4bbadd3 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -4273,6 +4273,9 @@ static void migrate_receive(int debug, int daemonize, int monitor,
     dom_info.send_fd = send_fd;
     dom_info.migration_domname_r = &migration_domname;
     dom_info.checkpointed_stream = remus;
+    if (remus == LIBXL_CHECKPOINTED_STREAM_COLO)
+        /* COLO uses stdout to send control message to master */
+        dom_info.quiet = 1;
 
     rc = create_domain(&dom_info);
     if (rc < 0) {
@@ -4287,7 +4290,8 @@ static void migrate_receive(int debug, int daemonize, int monitor,
         /* If we are here, it means that the sender (primary) has crashed.
          * TODO: Split-Brain Check.
          */
-        fprintf(stderr, "migration target: Remus Failover for domain %u\n",
+        fprintf(stderr, "migration target: %s Failover for domain %u\n",
+                remus == LIBXL_CHECKPOINTED_STREAM_COLO ? "COLO" : "Remus",
                 domid);
 
         /*
@@ -4304,15 +4308,21 @@ static void migrate_receive(int debug, int daemonize, int monitor,
             rc = libxl_domain_rename(ctx, domid, migration_domname,
                                      common_domname);
             if (rc)
-                fprintf(stderr, "migration target (Remus): "
+                fprintf(stderr, "migration target (%s): "
                         "Failed to rename domain from %s to %s:%d\n",
+                        remus == LIBXL_CHECKPOINTED_STREAM_COLO ? "COLO" : "Remus",
                         migration_domname, common_domname, rc);
         }
 
+        if (remus == LIBXL_CHECKPOINTED_STREAM_COLO)
+            /* The guest is running after failover in COLO mode */
+            exit(rc ? -ERROR_FAIL: 0);
+
         rc = libxl_domain_unpause(ctx, domid);
         if (rc)
-            fprintf(stderr, "migration target (Remus): "
+            fprintf(stderr, "migration target (%s): "
                     "Failed to unpause domain %s (id: %u):%d\n",
+                    remus == LIBXL_CHECKPOINTED_STREAM_COLO ? "COLO" : "Remus",
                     common_domname, domid, rc);
 
         exit(rc ? -ERROR_FAIL: 0);
@@ -4458,7 +4468,7 @@ int main_migrate_receive(int argc, char **argv)
     int debug = 0, daemonize = 1, monitor = 1, remus = 0;
     int opt;
 
-    SWITCH_FOREACH_OPT(opt, "Fedr", NULL, "migrate-receive", 0) {
+    SWITCH_FOREACH_OPT(opt, "Fedrc", NULL, "migrate-receive", 0) {
     case 'F':
         daemonize = 0;
         break;
@@ -4470,8 +4480,10 @@ int main_migrate_receive(int argc, char **argv)
         debug = 1;
         break;
     case 'r':
-        remus = 1;
+        remus = LIBXL_CHECKPOINTED_STREAM_REMUS;
         break;
+    case 'c':
+        remus = LIBXL_CHECKPOINTED_STREAM_COLO;
     }
 
     if (argc-optind != 0) {
@@ -7958,15 +7970,18 @@ int main_remus(int argc, char **argv)
     pid_t child = -1;
     uint8_t *config_data;
     int config_len;
+    int interval = 0;
 
     memset(&r_info, 0, sizeof(libxl_domain_remus_info));
     /* Defaults */
     r_info.interval = 200;
     libxl_defbool_setdefault(&r_info.blackhole, false);
+    libxl_defbool_setdefault(&r_info.colo, false);
 
-    SWITCH_FOREACH_OPT(opt, "Fbundi:s:N:e", NULL, "remus", 2) {
+    SWITCH_FOREACH_OPT(opt, "Fbundi:s:N:ec", NULL, "remus", 2) {
     case 'i':
         r_info.interval = atoi(optarg);
+        interval = 1;
         break;
     case 'F':
         libxl_defbool_set(&r_info.allow_unsafe, true);
@@ -7992,11 +8007,28 @@ int main_remus(int argc, char **argv)
     case 'e':
         daemonize = 0;
         break;
+    case 'c':
+        libxl_defbool_set(&r_info.colo, true);
     }
 
     domid = find_domain(argv[optind]);
     host = argv[optind + 1];
 
+    if (libxl_defbool_val(r_info.colo)) {
+        if (!interval)
+            r_info.interval = 0;
+
+        if (r_info.interval || libxl_defbool_val(r_info.blackhole)) {
+            perror("option -c is conflict with -i or -b");
+            exit(-1);
+        }
+
+        if (libxl_defbool_is_default(r_info.compression)) {
+            perror("option -u must be specified when using COLO");
+            exit(-1);
+        }
+    }
+
     if (!r_info.netbufscript)
         r_info.netbufscript = default_remus_netbufscript;
 
@@ -8011,8 +8043,9 @@ int main_remus(int argc, char **argv)
         if (!ssh_command[0]) {
             rune = host;
         } else {
-            if (asprintf(&rune, "exec %s %s xl migrate-receive -r %s",
+            if (asprintf(&rune, "exec %s %s xl migrate-receive %s %s",
                          ssh_command, host,
+                         libxl_defbool_val(r_info.colo) ? "-c" : "-r",
                          daemonize ? "" : " -e") < 0)
                 return 1;
         }
@@ -8041,7 +8074,8 @@ int main_remus(int argc, char **argv)
      * domain to force failover
      */
     if (libxl_domain_info(ctx, 0, domid)) {
-        fprintf(stderr, "Remus: Primary domain has been destroyed.\n");
+        fprintf(stderr, "%s: Primary domain has been destroyed.\n",
+                libxl_defbool_val(r_info.colo) ? "COLO" : "Remus");
         close(send_fd);
         return 0;
     }
@@ -8053,7 +8087,8 @@ int main_remus(int argc, char **argv)
     if (rc == ERROR_GUEST_TIMEDOUT)
         fprintf(stderr, "Failed to suspend domain at primary.\n");
     else {
-        fprintf(stderr, "Remus: Backup failed? resuming domain at primary.\n");
+        fprintf(stderr, "%s: Backup failed? resuming domain at primary.\n",
+                libxl_defbool_val(r_info.colo) ? "COLO" : "Remus");
         libxl_domain_resume(ctx, domid, 1, 0);
     }
 
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index 7f4759b..611accf 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -515,7 +515,9 @@ struct cmd_spec cmd_table[] = {
       "-b                      Replicate memory checkpoints to /dev/null (blackhole).\n"
       "                        Works only in unsafe mode.\n"
       "-n                      Disable network output buffering. Works only in unsafe mode.\n"
-      "-d                      Disable disk replication. Works only in unsafe mode."
+      "-d                      Disable disk replication. Works only in unsafe mode.\n"
+      "-c                      Enable COLO HA. It is conflict with -i and -b, and memory\n"
+      "                        checkpoint must be disabled"
     },
 #endif
     { "devd",
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 08/15] Support colo mode for qemu disk
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
                   ` (6 preceding siblings ...)
  2015-06-08  3:45 ` [PATCH v6 COLO 07/15] implement the cmdline for COLO Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-16 11:21   ` Ian Campbell
  2015-06-08  3:45 ` [PATCH v6 COLO 09/15] COLO: use qemu block replication Yang Hongyang
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

From: Wen Congyang <wency@cn.fujitsu.com>

Usage: disk = ['...,colo,colo-params=xxx,active-disk=xxx,hidden-disk=xxx...']
The format of colo-params: host:port:exportname=xx

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 docs/man/xl.pod.1           |   2 +-
 tools/libxl/libxl.c         |  42 ++++++-
 tools/libxl/libxl_create.c  |  25 ++++-
 tools/libxl/libxl_device.c  |  38 +++++++
 tools/libxl/libxl_dm.c      | 262 ++++++++++++++++++++++++++++++++++++++++++--
 tools/libxl/libxl_types.idl |   5 +
 tools/libxl/libxlu_disk_l.l |   5 +
 7 files changed, 367 insertions(+), 12 deletions(-)

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index f5e97d7..1c2ee24 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -454,7 +454,7 @@ N.B: Remus support in xl is still in experimental (proof-of-concept) phase.
      Disk replication support is limited to DRBD disks.
 
      COLO support in xl is still in experimental (proof-of-concept) phase.
-     There is no support for network or disk at the moment.
+     There is no support for network at the moment.
 
 B<OPTIONS>
 
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 7df2466..4a5957c 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -2273,6 +2273,8 @@ int libxl__device_disk_setdefault(libxl__gc *gc, libxl_device_disk *disk)
     int rc;
 
     libxl_defbool_setdefault(&disk->discard_enable, !!disk->readwrite);
+    libxl_defbool_setdefault(&disk->colo_enable, false);
+    libxl_defbool_setdefault(&disk->colo_restore_enable, false);
 
     rc = libxl__resolve_domid(gc, disk->backend_domname, &disk->backend_domid);
     if (rc < 0) return rc;
@@ -2473,6 +2475,14 @@ static void device_disk_add(libxl__egc *egc, uint32_t domid,
                 flexarray_append(back, "params");
                 flexarray_append(back, libxl__sprintf(gc, "%s:%s",
                               libxl__device_disk_string_of_format(disk->format), disk->pdev_path));
+                if (libxl_defbool_val(disk->colo_enable)) {
+                    flexarray_append(back, "colo-params");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->colo_params));
+                    flexarray_append(back, "active-disk");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->active_disk));
+                    flexarray_append(back, "hidden-disk");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->hidden_disk));
+                }
                 assert(device->backend_kind == LIBXL__DEVICE_KIND_QDISK);
                 break;
             default:
@@ -2587,7 +2597,10 @@ static int libxl__device_disk_from_xs_be(libxl__gc *gc,
         goto cleanup;
     }
 
-    /* "params" may not be present; but everything else must be. */
+    /*
+     * "params" and "colo-params" may not be present; but everything
+     * else must be.
+     */
     tmp = xs_read(ctx->xsh, XBT_NULL,
                   libxl__sprintf(gc, "%s/params", be_path), &len);
     if (tmp && strchr(tmp, ':')) {
@@ -2597,6 +2610,33 @@ static int libxl__device_disk_from_xs_be(libxl__gc *gc,
         disk->pdev_path = tmp;
     }
 
+    tmp = xs_read(ctx->xsh, XBT_NULL,
+                  libxl__sprintf(gc, "%s/colo-params", be_path), &len);
+    if (tmp) {
+        libxl_defbool_set(&disk->colo_enable, true);
+        disk->colo_params = tmp;
+    } else {
+        libxl_defbool_set(&disk->colo_enable, false);
+    }
+
+    if (libxl_defbool_val(disk->colo_enable)) {
+        tmp = xs_read(ctx->xsh, XBT_NULL,
+                      libxl__sprintf(gc, "%s/active-disk", be_path), &len);
+        if (!tmp) {
+            LOG(ERROR, "Missing xenstore node %s/active-disk", be_path);
+            goto cleanup;
+        }
+        disk->active_disk = tmp;
+
+        tmp = xs_read(ctx->xsh, XBT_NULL,
+                      libxl__sprintf(gc, "%s/hidden-disk", be_path), &len);
+        if (!tmp) {
+            LOG(ERROR, "Missing xenstore node %s/hidden-disk", be_path);
+            goto cleanup;
+        }
+        disk->hidden_disk = tmp;
+    }
+
 
     tmp = libxl__xs_read(gc, XBT_NULL,
                          libxl__sprintf(gc, "%s/type", be_path));
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 6e307f3..17d0d18 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1727,12 +1727,29 @@ static void domain_create_cb(libxl__egc *egc,
 
     libxl__ao_complete(egc, ao, rc);
 }
-    
+
+static void set_disk_colo_restore(libxl_domain_config *d_config)
+{
+    int i;
+
+    for (i = 0; i < d_config->num_disks; i++)
+        libxl_defbool_set(&d_config->disks[i].colo_restore_enable, true);
+}
+
+static void unset_disk_colo_restore(libxl_domain_config *d_config)
+{
+    int i;
+
+    for (i = 0; i < d_config->num_disks; i++)
+        libxl_defbool_set(&d_config->disks[i].colo_restore_enable, false);
+}
+
 int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config,
                             uint32_t *domid,
                             const libxl_asyncop_how *ao_how,
                             const libxl_asyncprogress_how *aop_console_how)
 {
+    unset_disk_colo_restore(d_config);
     return do_domain_create(ctx, d_config, domid, -1, -1, 0,
                             ao_how, aop_console_how);
 }
@@ -1745,8 +1762,12 @@ int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
 {
     int send_fd = -1;
 
-    if (params->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO)
+    if (params->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
         send_fd = params->send_fd;
+        set_disk_colo_restore(d_config);
+    } else {
+        unset_disk_colo_restore(d_config);
+    }
 
     return do_domain_create(ctx, d_config, domid, restore_fd, send_fd,
                             params->checkpointed_stream, ao_how, aop_console_how);
diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
index 93bb41e..df29bc3 100644
--- a/tools/libxl/libxl_device.c
+++ b/tools/libxl/libxl_device.c
@@ -196,6 +196,10 @@ static int disk_try_backend(disk_try_backend_args *a,
             goto bad_format;
         }
 
+        if (libxl_defbool_val(a->disk->colo_enable) ||
+            a->disk->active_disk || a->disk->hidden_disk)
+            goto bad_colo;
+
         if (a->disk->backend_domid != LIBXL_TOOLSTACK_DOMID) {
             LOG(DEBUG, "Disk vdev=%s, is using a storage driver domain, "
                        "skipping physical device check", a->disk->vdev);
@@ -218,6 +222,10 @@ static int disk_try_backend(disk_try_backend_args *a,
     case LIBXL_DISK_BACKEND_TAP:
         if (a->disk->script) goto bad_script;
 
+        if (libxl_defbool_val(a->disk->colo_enable) ||
+            a->disk->active_disk || a->disk->hidden_disk)
+            goto bad_colo;
+
         if (a->disk->is_cdrom) {
             LOG(DEBUG, "Disk vdev=%s, backend tap unsuitable for cdroms",
                        a->disk->vdev);
@@ -236,6 +244,16 @@ static int disk_try_backend(disk_try_backend_args *a,
 
     case LIBXL_DISK_BACKEND_QDISK:
         if (a->disk->script) goto bad_script;
+        if (libxl_defbool_val(a->disk->colo_enable)) {
+            if (!a->disk->colo_params)
+                goto bad_colo_params;
+
+            if (!a->disk->active_disk)
+                goto bad_active_disk;
+
+            if (!a->disk->hidden_disk)
+                goto bad_hidden_disk;
+        }
         return backend;
 
     default:
@@ -256,6 +274,26 @@ static int disk_try_backend(disk_try_backend_args *a,
     LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with script=...",
         a->disk->vdev, libxl_disk_backend_to_string(backend));
     return 0;
+
+ bad_colo:
+    LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_colo_params:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-params=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_active_disk:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs active-disk=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_hidden_disk:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs hidden-disk=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
 }
 
 int libxl__device_disk_set_backend(libxl__gc *gc, libxl_device_disk *disk) {
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index 33f9ce6..ac97baa 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -427,6 +427,211 @@ static char *dm_spice_options(libxl__gc *gc,
     return opt;
 }
 
+/* colo mode */
+enum {
+    LIBXL__COLO_NONE = 0,
+    LIBXL__COLO_PRIMARY,
+    LIBXL__COLO_SECONDARY,
+};
+
+/* The format of colo-params: host:port:exportname=xx */
+static int parse_colo_params(libxl__gc *gc, const char *colo_params,
+                             const char **host, const char **port,
+                             const char **exportname)
+{
+    const char *delim;
+
+    delim = strstr(colo_params, ":");
+    if (!delim)
+        return 1;
+    if (delim == colo_params)
+        return 1;
+    *host = libxl__strndup(gc, colo_params, delim - colo_params);
+    colo_params = delim + 1;
+
+    delim = strstr(colo_params, ":");
+    if (!delim)
+        return 1;
+    if (delim == colo_params)
+        return 1;
+    *port = libxl__strndup(gc, colo_params, delim - colo_params);
+    colo_params = delim + 1;
+
+    if (strncmp(colo_params, "exportname=", strlen("exportname=")))
+        return 1;
+    *exportname = colo_params + strlen("exportname=");
+    if ((*exportname)[0] == 0)
+        return 1;
+
+    return 0;
+}
+
+static char *qemu_disk_scsi_drive_string(libxl__gc *gc, const char *pdev_path,
+                                         int unit, const char *format,
+                                         const libxl_device_disk *disk,
+                                         const char *nbd_target,
+                                         int colo_mode)
+{
+    char *drive = NULL;
+    const char *host = NULL, *port = NULL, *exportname = NULL;
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+    const char *colo_params = disk->colo_params;
+    const char *active_disk = disk->active_disk;
+    const char *hidden_disk = disk->hidden_disk;
+
+    switch (colo_mode) {
+    case LIBXL__COLO_NONE:
+        drive = libxl__sprintf
+            (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback",
+             pdev_path, unit, format);
+        break;
+    case LIBXL__COLO_PRIMARY:
+        /*
+         * primary:
+         *  -dirve if=scsi,bus=0,unit=x,cache=writeback,driver=quorum,\
+         *  children.0.file.filename=pdev_path,\
+         *  children.0.driver=format,\
+         *  children.1.file.host=host,\
+         *  children.1.file.port=port,\
+         *  children.1.file.export=exportname,\
+         *  children.1.file.driver=nbd+colo,\
+         *  children.1.driver=raw,\
+         *  children.1.ignore-errors=on,\
+         *  read-pattern=fifo
+         */
+
+        if (parse_colo_params(gc, colo_params, &host, &port, &exportname))
+            break;
+
+        drive = libxl__sprintf
+            (gc, "if=scsi,bus=0,unit=%d,cache=writeback,driver=quorum,"
+                 "children.0.file.filename=%s,"
+                 "children.0.driver=%s,"
+                 "children.1.file.host=%s,"
+                 "children.1.file.port=%s,"
+                 "children.1.file.export=%s,"
+                 "children.1.file.driver=nbd+colo,"
+                 "children.1.driver=raw,"
+                 "children.1.ignore-errors=on,"
+                 "read-pattern=fifo",
+             unit, pdev_path, format, host, port, exportname);
+        break;
+    case LIBXL__COLO_SECONDARY:
+        /*
+         * secondary:
+         *  -drive if=scsi,bus=0,unit=x,cache=writeback,driver=qcow2+colo,\
+         *  file=active_disk,\
+         *  backing_reference.drive_id=nbd_target,\
+         *  backing_reference.hidden-disk.file.filename=hidden_disk,\
+         *  backing_reference.hidden-disk.allow-write-backing-file=on,\
+         *  export=exportname,
+         */
+
+        if (parse_colo_params(gc, colo_params, &host, &port, &exportname))
+            break;
+
+        drive = libxl__sprintf
+            (gc, "if=scsi,bus=0,unit=%d,cache=writeback,driver=qcow2+colo,"
+                 "file=%s,"
+                 "backing_reference.drive_id=%s,"
+                 "backing_reference.hidden-disk.file.filename=%s,"
+                 "backing_reference.hidden-disk.allow-write-backing-file=on,"
+                 "export=%s",
+             unit, active_disk, nbd_target, hidden_disk, exportname);
+        break;
+    default:
+        abort();
+    }
+
+    if (!drive)
+        LIBXL__LOG(ctx, LIBXL__LOG_WARNING,
+                   "colo-params is invalid for %s", pdev_path);
+    return drive;
+}
+
+static char *qemu_disk_ide_drive_string(libxl__gc *gc, const char *pdev_path,
+                                        int unit, const char *format,
+                                        const libxl_device_disk *disk,
+                                        const char *nbd_target,
+                                        int colo_mode)
+{
+    char *drive = NULL;
+    const char *host = NULL, *port = NULL, *exportname = NULL;
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+    const char *colo_params = disk->colo_params;
+    const char *active_disk = disk->active_disk;
+    const char *hidden_disk = disk->hidden_disk;
+
+    switch (colo_mode) {
+    case LIBXL__COLO_NONE:
+        drive = libxl__sprintf
+            (gc, "file=%s,if=ide,index=%d,media=disk,format=%s,cache=writeback",
+             pdev_path, unit, format);
+        break;
+    case LIBXL__COLO_PRIMARY:
+        /*
+         * primary:
+         *  -dirve if=ide,index=x,media=disk,cache=writeback,driver=quorum,\
+         *  children.0.file.filename=pdev_path,\
+         *  children.0.driver=format,\
+         *  children.1.file.host=host,\
+         *  children.1.file.port=port,\
+         *  children.1.file.export=exportname,\
+         *  children.1.file.driver=nbd+colo,\
+         *  children.1.driver=raw,\
+         *  children.1.ignored-errors=on,\
+         *  read-pattern=fifo
+         */
+
+        if (parse_colo_params(gc, colo_params, &host, &port, &exportname))
+            break;
+
+        drive = libxl__sprintf
+            (gc, "if=ide,index=%d,media=disk,cache=writeback,driver=quorum,"
+                 "children.0.file.filename=%s,"
+                 "children.0.driver=%s,"
+                 "children.1.file.host=%s,"
+                 "children.1.file.port=%s,"
+                 "children.1.file.export=%s,"
+                 "children.1.file.driver=nbd+colo,"
+                 "children.1.driver=raw,"
+                 "children.1.ignore-errors=on,"
+                 "read-pattern=fifo",
+             unit, pdev_path, format, host, port, exportname);
+        break;
+    case LIBXL__COLO_SECONDARY:
+        /*
+         * secondary:
+         *  -drive if=ide,index=x,media=disk,cache=writeback,driver=qcow2+colo,\
+         *  file=active_disk,\
+         *  backing_reference.drive_id=nbd_target,\
+         *  backing_reference.hidden-disk.file.filename=hidden_disk,\
+         *  backing_reference.hidden-disk.allow-write-backing-file=on,\
+         *  export=exportname,
+         */
+
+        if (parse_colo_params(gc, colo_params, &host, &port, &exportname))
+            break;
+
+        drive = libxl__sprintf
+            (gc, "if=ide,index=%d,media=disk,cache=writeback,driver=qcow2+colo,"
+                 "file=%s,"
+                 "backing_reference.drive_id=%s,"
+                 "backing_reference.hidden-disk.file.filename=%s,"
+                 "backing_reference.hidden-disk.allow-write-backing-file=on,"
+                 "export=%s",
+             unit, active_disk, nbd_target, hidden_disk, exportname);
+        break;
+    default:
+        abort();
+    }
+
+    if (!drive)
+        LIBXL__LOG(ctx, LIBXL__LOG_WARNING,
+                   "colo-params is invalid for %s", pdev_path);
+    return drive;
+}
+
 static int libxl__build_device_model_args_new(libxl__gc *gc,
                                         const char *dm, int guest_domid,
                                         const libxl_domain_config *guest_config,
@@ -825,6 +1030,8 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
             const char *format = qemu_disk_format_string(disks[i].format);
             char *drive;
             const char *pdev_path;
+            int colo_mode;
+            char *drive_id;
 
             if (dev_number == -1) {
                 LIBXL__LOG(ctx, LIBXL__LOG_WARNING, "unable to determine"
@@ -868,16 +1075,55 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
                  * For other disks we translate devices 0..3 into
                  * hd[a-d] and ignore the rest.
                  */
-                if (strncmp(disks[i].vdev, "sd", 2) == 0)
-                    drive = libxl__sprintf
-                        (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback",
-                         pdev_path, disk, format);
-                else if (disk < 4)
+                if (libxl_defbool_val(disks[i].colo_enable)) {
+                    if (libxl_defbool_val(disks[i].colo_restore_enable))
+                        colo_mode = LIBXL__COLO_SECONDARY;
+                    else
+                        colo_mode = LIBXL__COLO_PRIMARY;
+                } else {
+                    colo_mode = LIBXL__COLO_NONE;
+                }
+
+                if (colo_mode == LIBXL__COLO_SECONDARY) {
+                    /*
+                     * -drive if=none,driver=format,file=pdev_path,\
+                     * id=nbd_targetx
+                     */
+                    if (strncmp(disks[i].vdev, "sd", 2) == 0) {
+                        drive_id = libxl__sprintf(gc, "nbd_target%d", disk + 4);
+                    } else if (disk < 4) {
+                        drive_id = libxl__sprintf(gc, "nbd_target%d", disk);
+                    } else {
+                        continue; /* Do not emulate this disk */
+                    }
                     drive = libxl__sprintf
-                        (gc, "file=%s,if=ide,index=%d,media=disk,format=%s,cache=writeback",
-                         pdev_path, disk, format);
-                else
+                        (gc, "if=none,driver=%s,file=%s,id=%s",
+                         format, pdev_path, drive_id);
+
+                    flexarray_append(dm_args, "-drive");
+                    flexarray_append(dm_args, drive);
+                } else {
+                    drive_id = NULL;
+                }
+
+                if (strncmp(disks[i].vdev, "sd", 2) == 0) {
+                    drive = qemu_disk_scsi_drive_string(gc, pdev_path, disk,
+                                                        format,
+                                                        &disks[i],
+                                                        drive_id,
+                                                        colo_mode);
+                } else if (disk < 4) {
+                    drive = qemu_disk_ide_drive_string(gc, pdev_path, disk,
+                                                       format,
+                                                       &disks[i],
+                                                       drive_id,
+                                                       colo_mode);
+                } else {
                     continue; /* Do not emulate this disk */
+                }
+
+                if (!drive)
+                    continue;
             }
 
             flexarray_append(dm_args, "-drive");
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 7f07f8b..1e6b5ae 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -516,6 +516,11 @@ libxl_device_disk = Struct("device_disk", [
     ("is_cdrom", integer),
     ("direct_io_safe", bool),
     ("discard_enable", libxl_defbool),
+    ("colo_enable", libxl_defbool),
+    ("colo_restore_enable", libxl_defbool),
+    ("colo_params", string),
+    ("active_disk", string),
+    ("hidden_disk", string)
     ])
 
 libxl_device_nic = Struct("device_nic", [
diff --git a/tools/libxl/libxlu_disk_l.l b/tools/libxl/libxlu_disk_l.l
index 1a5deb5..566aa1e 100644
--- a/tools/libxl/libxlu_disk_l.l
+++ b/tools/libxl/libxlu_disk_l.l
@@ -176,6 +176,11 @@ script=[^,]*,?	{ STRIP(','); SAVESTRING("script", script, FROMEQUALS); }
 direct-io-safe,? { DPC->disk->direct_io_safe = 1; }
 discard,?	{ libxl_defbool_set(&DPC->disk->discard_enable, true); }
 no-discard,?	{ libxl_defbool_set(&DPC->disk->discard_enable, false); }
+colo,?		{ libxl_defbool_set(&DPC->disk->colo_enable, true); }
+no-colo,?	{ libxl_defbool_set(&DPC->disk->colo_enable, false); }
+colo-params=[^,]*,?	{ STRIP(','); SAVESTRING("colo-params", colo_params, FROMEQUALS); }
+active-disk=[^,]*,?	{ STRIP(','); SAVESTRING("active-disk", active_disk, FROMEQUALS); }
+hidden-disk=[^,]*,?	{ STRIP(','); SAVESTRING("hidden-disk", hidden_disk, FROMEQUALS); }
 
  /* the target magic parameter, eats the rest of the string */
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 09/15] COLO: use qemu block replication
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
                   ` (7 preceding siblings ...)
  2015-06-08  3:45 ` [PATCH v6 COLO 08/15] Support colo mode for qemu disk Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-16 11:22   ` Ian Campbell
  2015-06-08  3:45 ` [PATCH v6 COLO 10/15] COLO proxy: implement setup/teardown of COLO proxy module Yang Hongyang
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

From: Wen Congyang <wency@cn.fujitsu.com>

The guest should be paused before doing COLO!!!

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/Makefile             |   1 +
 tools/libxl/libxl_colo_qdisk.c   | 209 +++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_colo_restore.c |  21 +++-
 tools/libxl/libxl_colo_save.c    |  36 ++++++-
 tools/libxl/libxl_internal.h     |  18 ++++
 tools/libxl/libxl_qmp.c          |  31 ++++++
 6 files changed, 312 insertions(+), 4 deletions(-)
 create mode 100644 tools/libxl/libxl_colo_qdisk.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 88c5426..d93b271 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -58,6 +58,7 @@ endif
 
 LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
 LIBXL_OBJS-y += libxl_colo_restore.o libxl_colo_save.o
+LIBXL_OBJS-y += libxl_colo_qdisk.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o libxl_psr.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o libxl_libfdt_compat.o
diff --git a/tools/libxl/libxl_colo_qdisk.c b/tools/libxl/libxl_colo_qdisk.c
new file mode 100644
index 0000000..d73572e
--- /dev/null
+++ b/tools/libxl/libxl_colo_qdisk.c
@@ -0,0 +1,209 @@
+/*
+ * Copyright (C) 2015 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+typedef struct libxl__colo_qdisk {
+    libxl__checkpoint_device *dev;
+} libxl__colo_qdisk;
+
+/* ========== init() and cleanup() ========== */
+int init_subkind_qdisk(libxl__checkpoint_devices_state *cds)
+{
+    /*
+     * We don't know if we use qemu block replication, so
+     * we cannot start block replication here.
+     */
+    return 0;
+}
+
+void cleanup_subkind_qdisk(libxl__checkpoint_devices_state *cds)
+{
+}
+
+/* ========== setup() and teardown() ========== */
+static void colo_qdisk_setup(libxl__egc *egc, libxl__checkpoint_device *dev,
+                             bool primary)
+{
+    const libxl_device_disk *disk = dev->backend_dev;
+    const char *addr = NULL;
+    const char *export_name;
+    int ret, rc = 0;
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = dev->cds;
+    const char *colo_params = disk->colo_params;
+    const int domid = cds->domid;
+
+    EGC_GC;
+
+    if (disk->backend != LIBXL_DISK_BACKEND_QDISK ||
+        !libxl_defbool_val(disk->colo_enable)) {
+        rc = ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH;
+        goto out;
+    }
+
+    export_name = strstr(colo_params, ":exportname=");
+    if (!export_name) {
+        rc = ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH;
+        goto out;
+    }
+    export_name += strlen(":exportname=");
+    if (export_name[0] == 0) {
+        rc = ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH;
+        goto out;
+    }
+
+    dev->matched = 1;
+
+    if (primary) {
+        /* NBD server is not ready, so we cannot start block replication now */
+        goto out;
+    } else {
+        libxl__colo_restore_state *crs = CONTAINER_OF(cds, *crs, cds);
+        int len;
+
+        if (crs->qdisk_setuped)
+            goto out;
+
+        crs->qdisk_setuped = true;
+
+        len = export_name - strlen(":exportname=") - colo_params;
+        addr = libxl__strndup(gc, colo_params, len);
+    }
+
+    ret = libxl__qmp_block_start_replication(gc, domid, primary, addr);
+    if (ret)
+        rc = ERROR_FAIL;
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+static void colo_qdisk_teardown(libxl__egc *egc, libxl__checkpoint_device *dev,
+                                bool primary)
+{
+    int ret, rc = 0;
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = dev->cds;
+    const int domid = cds->domid;
+
+    EGC_GC;
+
+    if (primary) {
+        libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+
+        if (!css->qdisk_setuped)
+            goto out;
+
+        css->qdisk_setuped = false;
+    } else {
+        libxl__colo_restore_state *crs = CONTAINER_OF(cds, *crs, cds);
+
+        if (!crs->qdisk_setuped)
+            goto out;
+
+        crs->qdisk_setuped = false;
+    }
+
+    ret = libxl__qmp_block_stop_replication(gc, domid, primary);
+    if (ret)
+        rc = ERROR_FAIL;
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+/* ========== checkpointing APIs ========== */
+/* should be called after libxl__checkpoint_device_instance_ops.preresume */
+int colo_qdisk_preresume(libxl_ctx *ctx, domid_t domid)
+{
+    GC_INIT(ctx);
+    int ret;
+
+    ret = libxl__qmp_block_do_checkpoint(gc, domid);
+
+    GC_FREE;
+    return ret;
+}
+
+static void colo_qdisk_save_preresume(libxl__egc *egc,
+                                      libxl__checkpoint_device *dev)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(dev->cds, *css, cds);
+    int ret, rc = 0;
+
+    /* Convenience aliases */
+    const int domid = dev->cds->domid;
+
+    EGC_GC;
+
+    if (css->qdisk_setuped)
+        goto out;
+
+    css->qdisk_setuped = true;
+
+    ret = libxl__qmp_block_start_replication(gc, domid, true, NULL);
+    if (ret)
+        rc = ERROR_FAIL;
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+/* ======== primary ======== */
+static void colo_qdisk_save_setup(libxl__egc *egc,
+                                  libxl__checkpoint_device *dev)
+{
+    colo_qdisk_setup(egc, dev, true);
+}
+
+static void colo_qdisk_save_teardown(libxl__egc *egc,
+                                   libxl__checkpoint_device *dev)
+{
+    colo_qdisk_teardown(egc, dev, true);
+}
+
+const libxl__checkpoint_device_instance_ops colo_save_device_qdisk = {
+    .kind = LIBXL__DEVICE_KIND_VBD,
+    .setup = colo_qdisk_save_setup,
+    .teardown = colo_qdisk_save_teardown,
+    .preresume = colo_qdisk_save_preresume,
+};
+
+/* ======== secondary ======== */
+static void colo_qdisk_restore_setup(libxl__egc *egc,
+                                     libxl__checkpoint_device *dev)
+{
+    colo_qdisk_setup(egc, dev, false);
+}
+
+static void colo_qdisk_restore_teardown(libxl__egc *egc,
+                                      libxl__checkpoint_device *dev)
+{
+    colo_qdisk_teardown(egc, dev, false);
+}
+
+const libxl__checkpoint_device_instance_ops colo_restore_device_qdisk = {
+    .kind = LIBXL__DEVICE_KIND_VBD,
+    .setup = colo_qdisk_restore_setup,
+    .teardown = colo_qdisk_restore_teardown,
+};
diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
index c613c15..6731bd0 100644
--- a/tools/libxl/libxl_colo_restore.c
+++ b/tools/libxl/libxl_colo_restore.c
@@ -65,7 +65,10 @@ static void libxl__colo_restore_domain_resume_callback(void *data);
 static void libxl__colo_restore_domain_checkpoint_callback(void *data);
 static void libxl__colo_restore_domain_suspend_callback(void *data);
 
+extern const libxl__checkpoint_device_instance_ops colo_restore_device_qdisk;
+
 static const libxl__checkpoint_device_instance_ops *colo_restore_ops[] = {
+    &colo_restore_device_qdisk,
     NULL,
 };
 
@@ -164,7 +167,11 @@ static int init_device_subkind(libxl__checkpoint_devices_state *cds)
     int rc;
     STATE_AO_GC(cds->ao);
 
+    rc = init_subkind_qdisk(cds);
+    if (rc)  goto out;
+
     rc = 0;
+out:
     return rc;
 }
 
@@ -172,6 +179,8 @@ static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
 {
     /* cleanup device subkind-specific state in the libxl ctx */
     STATE_AO_GC(cds->ao);
+
+    cleanup_subkind_qdisk(cds);
 }
 
 
@@ -282,6 +291,8 @@ void libxl__colo_restore_setup(libxl__egc *egc,
     logdirty_init(&crcs->lds);
     crcs->lds.ao = ao;
 
+    crs->qdisk_setuped = false;
+
     rc = 0;
 
 out:
@@ -590,6 +601,12 @@ static void colo_restore_preresume_cb(libxl__egc *egc,
         goto out;
     }
 
+    rc = colo_qdisk_preresume(CTX, crs->domid);
+    if (rc) {
+        LOG(ERROR, "colo_qdisk_preresume() fails");
+        goto out;
+    }
+
     colo_restore_resume_vm(egc, crcs);
 
     return;
@@ -775,8 +792,8 @@ static void colo_setup_checkpoint_devices(libxl__egc *egc,
 
     STATE_AO_GC(crs->ao);
 
-    /* TODO: disk/nic support */
-    cds->device_kind_flags = 0;
+    /* TODO: nic support */
+    cds->device_kind_flags = (1 << LIBXL__DEVICE_KIND_VBD);
     cds->callback = colo_restore_setup_cds_done;
     cds->ao = ao;
     cds->domid = crs->domid;
diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
index 153ec57..80fd605 100644
--- a/tools/libxl/libxl_colo_save.c
+++ b/tools/libxl/libxl_colo_save.c
@@ -19,7 +19,10 @@
 #include "libxl_internal.h"
 #include "libxl_colo.h"
 
+extern const libxl__checkpoint_device_instance_ops colo_save_device_qdisk;
+
 static const libxl__checkpoint_device_instance_ops *colo_ops[] = {
+    &colo_save_device_qdisk,
     NULL,
 };
 
@@ -30,7 +33,11 @@ static int init_device_subkind(libxl__checkpoint_devices_state *cds)
     int rc;
     STATE_AO_GC(cds->ao);
 
+    rc = init_subkind_qdisk(cds);
+    if (rc) goto out;
+
     rc = 0;
+out:
     return rc;
 }
 
@@ -38,6 +45,8 @@ static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
 {
     /* cleanup device subkind-specific state in the libxl ctx */
     STATE_AO_GC(cds->ao);
+
+    cleanup_subkind_qdisk(cds);
 }
 
 /* ================= colo: setup save environment ================= */
@@ -65,9 +74,11 @@ void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
     css->send_fd = dss->fd;
     css->recv_fd = dss->recv_fd;
     css->svm_running = false;
+    css->paused = true;
+    css->qdisk_setuped = false;
 
-    /* TODO: disk/nic support */
-    cds->device_kind_flags = 0;
+    /* TODO: nic support */
+    cds->device_kind_flags = (1 << LIBXL__DEVICE_KIND_VBD);
     cds->ops = colo_ops;
     cds->callback = colo_save_setup_done;
     cds->ao = ao;
@@ -453,12 +464,33 @@ static void colo_preresume_cb(libxl__egc *egc,
         goto out;
     }
 
+    if (!css->paused) {
+        rc = colo_qdisk_preresume(CTX, dss->domid);
+        if (rc) {
+            LOG(ERROR, "colo_qdisk_preresume() fails");
+            goto out;
+        }
+    }
+
     /* Resumes the domain and the device model */
     if (libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1)) {
         LOG(ERROR, "cannot resume primary vm");
         goto out;
     }
 
+    /*
+     * The guest should be paused before doing colo because there is
+     * no disk migration.
+     */
+    if (css->paused) {
+        rc = libxl__domain_unpause(gc, dss->domid);
+        if (rc) {
+            LOG(ERROR, "cannot unpause primary vm");
+            goto out;
+        }
+        css->paused = false;
+    }
+
     /* read LIBXL_COLO_SVM_RESUMED */
     memset(dc, 0, sizeof(*dc));
     dc->ao = ao;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 1acea97..f07d8d9 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -1661,6 +1661,14 @@ _hidden int libxl__qmp_set_global_dirty_log(libxl__gc *gc, int domid, bool enabl
 _hidden int libxl__qmp_insert_cdrom(libxl__gc *gc, int domid, const libxl_device_disk *disk);
 /* Add a virtual CPU */
 _hidden int libxl__qmp_cpu_add(libxl__gc *gc, int domid, int index);
+/* Start block replication */
+_hidden int libxl__qmp_block_start_replication(libxl__gc *gc, int domid,
+                                               bool primary, const char *addr);
+/* Do block checkpoint */
+_hidden int libxl__qmp_block_do_checkpoint(libxl__gc *gc, int domid);
+/* Stop block replication */
+_hidden int libxl__qmp_block_stop_replication(libxl__gc *gc, int domid,
+                                              bool primary);
 /* close and free the QMP handler */
 _hidden void libxl__qmp_close(libxl__qmp_handler *qmp);
 /* remove the socket file, if the file has already been removed,
@@ -2733,6 +2741,9 @@ int init_subkind_nic(libxl__checkpoint_devices_state *cds);
 void cleanup_subkind_nic(libxl__checkpoint_devices_state *cds);
 int init_subkind_drbd_disk(libxl__checkpoint_devices_state *cds);
 void cleanup_subkind_drbd_disk(libxl__checkpoint_devices_state *cds);
+int init_subkind_qdisk(libxl__checkpoint_devices_state *cds);
+void cleanup_subkind_qdisk(libxl__checkpoint_devices_state *cds);
+int colo_qdisk_preresume(libxl_ctx *ctx, domid_t domid);
 
 typedef void libxl__checkpoint_callback(libxl__egc *,
                                         libxl__checkpoint_devices_state *,
@@ -2857,6 +2868,10 @@ struct libxl__colo_save_state {
     uint8_t temp_buff[9];
     void (*callback)(libxl__egc *, libxl__colo_save_state *);
     bool svm_running;
+    bool paused;
+
+    /* private, used by qdisk block replication */
+    bool qdisk_setuped;
 };
 
 /*----- Domain suspend (save) state structure -----*/
@@ -3195,6 +3210,9 @@ struct libxl__colo_restore_state {
     libxl__domain_create_cb *saved_cb;
     void *crcs;
     libxl__checkpoint_devices_state cds;
+
+    /* private, used by qdisk block replication */
+    bool qdisk_setuped;
 };
 
 struct libxl__domain_create_state {
diff --git a/tools/libxl/libxl_qmp.c b/tools/libxl/libxl_qmp.c
index a6f1a21..9714bdf 100644
--- a/tools/libxl/libxl_qmp.c
+++ b/tools/libxl/libxl_qmp.c
@@ -965,6 +965,37 @@ int libxl__qmp_cpu_add(libxl__gc *gc, int domid, int idx)
     return qmp_run_command(gc, domid, "cpu-add", args, NULL, NULL);
 }
 
+int libxl__qmp_block_start_replication(libxl__gc *gc, int domid,
+                                       bool primary, const char *addr)
+{
+    libxl__json_object *args = NULL;
+
+    qmp_parameters_add_bool(gc, &args, "enable", true);
+    qmp_parameters_add_bool(gc, &args, "primary", primary);
+    if (!primary)
+        qmp_parameters_add_string(gc, &args, "addr", addr);
+
+    return qmp_run_command(gc, domid, "xen-set-block-replication", args,
+                           NULL, NULL);
+}
+
+int libxl__qmp_block_do_checkpoint(libxl__gc *gc, int domid)
+{
+    return qmp_run_command(gc, domid, "xen-do-block-checkpoint", NULL,
+                           NULL, NULL);
+}
+
+int libxl__qmp_block_stop_replication(libxl__gc *gc, int domid, bool primary)
+{
+    libxl__json_object *args = NULL;
+
+    qmp_parameters_add_bool(gc, &args, "enable", false);
+    qmp_parameters_add_bool(gc, &args, "primary", primary);
+
+    return qmp_run_command(gc, domid, "xen-set-block-replication", args,
+                           NULL, NULL);
+}
+
 int libxl__qmp_initializations(libxl__gc *gc, uint32_t domid,
                                const libxl_domain_config *guest_config)
 {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 10/15] COLO proxy: implement setup/teardown of COLO proxy module
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
                   ` (8 preceding siblings ...)
  2015-06-08  3:45 ` [PATCH v6 COLO 09/15] COLO: use qemu block replication Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-16 11:24   ` Ian Campbell
  2015-06-08  3:45 ` [PATCH v6 COLO 11/15] COLO proxy: preresume, postresume and checkpoint Yang Hongyang
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

setup/teardown of COLO proxy module.
we use netlink to communicate with proxy module.

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 tools/libxl/Makefile           |   1 +
 tools/libxl/libxl_colo.h       |   2 +
 tools/libxl/libxl_colo_proxy.c | 210 +++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_internal.h   |  12 +++
 4 files changed, 225 insertions(+)
 create mode 100644 tools/libxl/libxl_colo_proxy.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index d93b271..b45fe62 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -59,6 +59,7 @@ endif
 LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
 LIBXL_OBJS-y += libxl_colo_restore.o libxl_colo_save.o
 LIBXL_OBJS-y += libxl_colo_qdisk.o
+LIBXL_OBJS-y += libxl_colo_proxy.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o libxl_psr.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o libxl_libfdt_compat.o
diff --git a/tools/libxl/libxl_colo.h b/tools/libxl/libxl_colo.h
index 26a2563..5983aa0 100644
--- a/tools/libxl/libxl_colo.h
+++ b/tools/libxl/libxl_colo.h
@@ -45,4 +45,6 @@ extern void libxl__colo_save_teardown(libxl__egc *egc,
                                       libxl__colo_save_state *css,
                                       int rc);
 
+extern int colo_proxy_setup(libxl__colo_proxy_state *cps);
+extern void colo_proxy_teardown(libxl__colo_proxy_state *cps);
 #endif
diff --git a/tools/libxl/libxl_colo_proxy.c b/tools/libxl/libxl_colo_proxy.c
new file mode 100644
index 0000000..9f1243e
--- /dev/null
+++ b/tools/libxl/libxl_colo_proxy.c
@@ -0,0 +1,210 @@
+/*
+ * Copyright (C) 2015 FUJITSU LIMITED
+ * Author: Yang Hongyang <yanghy@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+#include "libxl_colo.h"
+#include <linux/netlink.h>
+
+#define NETLINK_COLO 28
+
+enum colo_netlink_op {
+    COLO_QUERY_CHECKPOINT = (NLMSG_MIN_TYPE + 1),
+    COLO_CHECKPOINT,
+    COLO_FAILOVER,
+    COLO_PROXY_INIT,
+    COLO_PROXY_RESET, /* UNUSED, will be used for continuous FT */
+};
+
+/* ========= colo-proxy: helper functions ========== */
+
+static int colo_proxy_send(libxl__colo_proxy_state *cps, uint8_t *buff, uint64_t size, int type)
+{
+    struct sockaddr_nl sa;
+    struct nlmsghdr msg;
+    struct iovec iov;
+    struct msghdr mh;
+    int ret;
+
+    STATE_AO_GC(cps->ao);
+
+    memset(&sa, 0, sizeof(sa));
+    sa.nl_family = AF_NETLINK;
+    sa.nl_pid = 0;
+    sa.nl_groups = 0;
+
+    msg.nlmsg_len = NLMSG_SPACE(0);
+    msg.nlmsg_flags = NLM_F_REQUEST;
+    if (type == COLO_PROXY_INIT) {
+        msg.nlmsg_flags |= NLM_F_ACK;
+    }
+    msg.nlmsg_seq = 0;
+    /* This is untrusty */
+    msg.nlmsg_pid = cps->index;
+    msg.nlmsg_type = type;
+
+    iov.iov_base = &msg;
+    iov.iov_len = msg.nlmsg_len;
+
+    mh.msg_name = &sa;
+    mh.msg_namelen = sizeof(sa);
+    mh.msg_iov = &iov;
+    mh.msg_iovlen = 1;
+    mh.msg_control = NULL;
+    mh.msg_controllen = 0;
+    mh.msg_flags = 0;
+
+    ret = sendmsg(cps->sock_fd, &mh, 0);
+    if (ret <= 0) {
+        LOG(ERROR, "can't send msg to kernel by netlink: %s",
+            strerror(errno));
+    }
+
+    return ret;
+}
+
+/* error: return -1, otherwise return 0 */
+static int64_t colo_proxy_recv(libxl__colo_proxy_state *cps, uint8_t **buff, int flags)
+{
+    struct sockaddr_nl sa;
+    struct iovec iov;
+    struct msghdr mh = {
+        .msg_name = &sa,
+        .msg_namelen = sizeof(sa),
+        .msg_iov = &iov,
+        .msg_iovlen = 1,
+    };
+    uint32_t size = 16384;
+    int64_t len = 0;
+    int ret;
+
+    STATE_AO_GC(cps->ao);
+    uint8_t *tmp = libxl__malloc(NOGC, size);
+
+    iov.iov_base = tmp;
+    iov.iov_len = size;
+next:
+   ret = recvmsg(cps->sock_fd, &mh, flags);
+    if (ret <= 0) {
+        goto out;
+    }
+
+    len += ret;
+    if (mh.msg_flags & MSG_TRUNC) {
+        size += 16384;
+        tmp = libxl__realloc(NOGC, tmp, size);
+        iov.iov_base = tmp + len;
+        iov.iov_len = size - len;
+        goto next;
+    }
+
+    *buff = tmp;
+    return len;
+
+out:
+    free(tmp);
+    *buff = NULL;
+    return ret;
+}
+
+/* ========= colo-proxy: setup and teardown ========== */
+
+int colo_proxy_setup(libxl__colo_proxy_state *cps)
+{
+    int skfd = 0;
+    struct sockaddr_nl sa;
+    struct nlmsghdr *h;
+    struct timeval tv = {0, 500000}; /* timeout for recvmsg from kernel */
+    int i = 1;
+    int ret = ERROR_FAIL;
+    uint8_t *buff = NULL;
+    int64_t size;
+
+    STATE_AO_GC(cps->ao);
+
+    skfd = socket(PF_NETLINK, SOCK_RAW, NETLINK_COLO);
+    if (skfd < 0) {
+        LOG(ERROR, "can not create a netlink socket: %s", strerror(errno));
+        goto out;
+    }
+    cps->sock_fd = skfd;
+    memset(&sa, 0, sizeof(sa));
+    sa.nl_family = AF_NETLINK;
+    sa.nl_groups = 0;
+retry:
+    sa.nl_pid = i++;
+
+    if (i > 10) {
+        LOG(ERROR, "netlink bind error");
+        goto out;
+    }
+
+    ret = bind(skfd, (struct sockaddr *)&sa, sizeof(sa));
+    if (ret < 0 && errno == EADDRINUSE) {
+        LOG(ERROR, "colo index %d has already in used", sa.nl_pid);
+        goto retry;
+    }
+
+    cps->index = sa.nl_pid;
+    ret = colo_proxy_send(cps, NULL, 0, COLO_PROXY_INIT);
+    if (ret < 0) {
+        goto out;
+    }
+    setsockopt(cps->sock_fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
+    ret = -1;
+    size = colo_proxy_recv(cps, &buff, 0);
+    /* disable SO_RCVTIMEO */
+    tv.tv_usec = 0;
+    setsockopt(cps->sock_fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
+    if (size < 0) {
+        LOG(ERROR, "Can't recv msg from kernel by netlink: %s",
+            strerror(errno));
+        goto out;
+    }
+
+    if (size) {
+        h = (struct nlmsghdr *)buff;
+
+        if (h->nlmsg_type == NLMSG_ERROR) {
+            struct nlmsgerr *err = (struct nlmsgerr *)NLMSG_DATA(h);
+            if (size - sizeof(*h) < sizeof(*err)) {
+                goto out;
+            }
+            ret = -err->error;
+            if (ret) {
+                goto out;
+            }
+        }
+    }
+
+    ret = 0;
+
+out:
+    free(buff);
+    if (ret) {
+        close(cps->sock_fd);
+        cps->sock_fd = -1;
+    }
+    return ret;
+}
+
+void colo_proxy_teardown(libxl__colo_proxy_state *cps)
+{
+    if (cps->sock_fd >= 0) {
+        close(cps->sock_fd);
+        cps->sock_fd = -1;
+    }
+}
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index f07d8d9..68d1db9 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2853,6 +2853,15 @@ struct libxl__remus_state {
 _hidden int libxl__netbuffer_enabled(libxl__gc *gc);
 
 /*----- colo related state structure -----*/
+typedef struct libxl__colo_proxy_state libxl__colo_proxy_state;
+struct libxl__colo_proxy_state {
+    /* set by caller of colo_proxy_setup */
+    libxl__ao *ao;
+
+    int sock_fd;
+    int index;
+};
+
 typedef struct libxl__colo_save_state libxl__colo_save_state;
 struct libxl__colo_save_state {
     libxl__checkpoint_devices_state cds;
@@ -2872,6 +2881,9 @@ struct libxl__colo_save_state {
 
     /* private, used by qdisk block replication */
     bool qdisk_setuped;
+
+    /* private, used by colo-proxy */
+    libxl__colo_proxy_state cps;
 };
 
 /*----- Domain suspend (save) state structure -----*/
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 11/15] COLO proxy: preresume, postresume and checkpoint
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
                   ` (9 preceding siblings ...)
  2015-06-08  3:45 ` [PATCH v6 COLO 10/15] COLO proxy: implement setup/teardown of COLO proxy module Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-08  3:45 ` [PATCH v6 COLO 12/15] COLO nic: implement COLO nic subkind Yang Hongyang
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

preresume, postresume and checkpoint

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 tools/libxl/libxl_colo.h       |  3 +++
 tools/libxl/libxl_colo_proxy.c | 57 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/tools/libxl/libxl_colo.h b/tools/libxl/libxl_colo.h
index 5983aa0..872c652 100644
--- a/tools/libxl/libxl_colo.h
+++ b/tools/libxl/libxl_colo.h
@@ -47,4 +47,7 @@ extern void libxl__colo_save_teardown(libxl__egc *egc,
 
 extern int colo_proxy_setup(libxl__colo_proxy_state *cps);
 extern void colo_proxy_teardown(libxl__colo_proxy_state *cps);
+extern void colo_proxy_preresume(libxl__colo_proxy_state *cps);
+extern void colo_proxy_postresume(libxl__colo_proxy_state *cps);
+extern int colo_proxy_checkpoint(libxl__colo_proxy_state *cps);
 #endif
diff --git a/tools/libxl/libxl_colo_proxy.c b/tools/libxl/libxl_colo_proxy.c
index 9f1243e..c8ff722 100644
--- a/tools/libxl/libxl_colo_proxy.c
+++ b/tools/libxl/libxl_colo_proxy.c
@@ -208,3 +208,60 @@ void colo_proxy_teardown(libxl__colo_proxy_state *cps)
         cps->sock_fd = -1;
     }
 }
+
+/* ========= colo-proxy: preresume, postresume and checkpoint ========== */
+
+void colo_proxy_preresume(libxl__colo_proxy_state *cps)
+{
+    colo_proxy_send(cps, NULL, 0, COLO_CHECKPOINT);
+    /* TODO: need to handle if the call fails... */
+}
+
+void colo_proxy_postresume(libxl__colo_proxy_state *cps)
+{
+    /* nothing to do... */
+}
+
+
+typedef struct colo_msg {
+    bool is_checkpoint;
+} colo_msg;
+
+/*
+do checkpoint: return 1
+error: return -1
+do not checkpoint: return 0
+*/
+int colo_proxy_checkpoint(libxl__colo_proxy_state *cps)
+{
+    uint8_t *buff;
+    int64_t size;
+    struct nlmsghdr *h;
+    struct colo_msg *m;
+    int ret = -1;
+
+    size = colo_proxy_recv(cps, &buff, MSG_DONTWAIT);
+
+    /* timeout, return no checkpoint message. */
+    if (size <= 0) {
+        return 0;
+    }
+
+    h = (struct nlmsghdr *) buff;
+
+    if (h->nlmsg_type == NLMSG_ERROR) {
+        goto out;
+    }
+
+    if (h->nlmsg_len < NLMSG_LENGTH(sizeof(*m))) {
+        goto out;
+    }
+
+    m = NLMSG_DATA(h);
+
+    ret = m->is_checkpoint ? 1 : 0;
+
+out:
+    free(buff);
+    return ret;
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 12/15] COLO nic: implement COLO nic subkind
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
                   ` (10 preceding siblings ...)
  2015-06-08  3:45 ` [PATCH v6 COLO 11/15] COLO proxy: preresume, postresume and checkpoint Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-12 14:35   ` Wei Liu
  2015-06-08  3:45 ` [PATCH v6 COLO 13/15] setup and control colo proxy on primary side Yang Hongyang
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

implement COLO nic subkind.

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/hotplug/Linux/Makefile         |   1 +
 tools/hotplug/Linux/colo-proxy-setup | 131 +++++++++++++++
 tools/libxl/Makefile                 |   1 +
 tools/libxl/libxl_colo_nic.c         | 317 +++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_internal.h         |   5 +
 tools/libxl/libxl_types.idl          |   1 +
 6 files changed, 456 insertions(+)
 create mode 100755 tools/hotplug/Linux/colo-proxy-setup
 create mode 100644 tools/libxl/libxl_colo_nic.c

diff --git a/tools/hotplug/Linux/Makefile b/tools/hotplug/Linux/Makefile
index d94a9cb..1c28bea 100644
--- a/tools/hotplug/Linux/Makefile
+++ b/tools/hotplug/Linux/Makefile
@@ -25,6 +25,7 @@ XEN_SCRIPTS += vscsi
 XEN_SCRIPTS += block-iscsi
 XEN_SCRIPTS += block-drbd-probe
 XEN_SCRIPTS += $(XEN_SCRIPTS-y)
+XEN_SCRIPTS += colo-proxy-setup
 
 SUBDIRS-$(CONFIG_SYSTEMD) += systemd
 
diff --git a/tools/hotplug/Linux/colo-proxy-setup b/tools/hotplug/Linux/colo-proxy-setup
new file mode 100755
index 0000000..08a93de
--- /dev/null
+++ b/tools/hotplug/Linux/colo-proxy-setup
@@ -0,0 +1,131 @@
+#! /bin/bash
+
+dir=$(dirname "$0")
+. "$dir/xen-hotplug-common.sh"
+. "$dir/hotplugpath.sh"
+. "$dir/xen-network-ft.sh"
+
+findCommand "$@"
+
+if [ "$command" != "setup" -a  "$command" != "teardown" ]
+then
+    echo "Invalid command: $command"
+    log err "Invalid command: $command"
+    exit 1
+fi
+
+evalVariables "$@"
+
+: ${vifname:?}
+: ${forwarddev:?}
+: ${mode:?}
+: ${index:?}
+: ${bridge:?}
+
+forwardbr="colobr0"
+
+if [ "$mode" != "primary" -a "$mode" != "secondary" ]
+then
+    echo "Invalid mode: $mode"
+    log err "Invalid mode: $mode"
+    exit 1
+fi
+
+if [ $index -lt 0 ] || [ $index -gt 100 ]; then
+    echo "index overflow"
+    exit 1
+fi
+
+function setup_primary()
+{
+    do_without_error tc qdisc add dev $vifname root handle 1: prio
+    do_without_error tc filter add dev $vifname parent 1: protocol ip prio 10 \
+        u32 match u32 0 0 flowid 1:2 action mirred egress mirror dev $forwarddev
+    do_without_error tc filter add dev $vifname parent 1: protocol arp prio 11 \
+        u32 match u32 0 0 flowid 1:2 action mirred egress mirror dev $forwarddev
+    do_without_error tc filter add dev $vifname parent 1: protocol ipv6 prio \
+        12 u32 match u32 0 0 flowid 1:2 action mirred egress mirror \
+        dev $forwarddev
+
+    do_without_error modprobe nf_conntrack_ipv4
+    do_without_error modprobe xt_PMYCOLO sec_dev=$forwarddev
+
+    do_without_error /usr/local/sbin/iptables -t mangle -I PREROUTING -m physdev --physdev-in \
+        $vifname -j PMYCOLO --index $index
+    do_without_error /usr/local/sbin/ip6tables -t mangle -I PREROUTING -m physdev --physdev-in \
+        $vifname -j PMYCOLO --index $index
+    do_without_error /usr/local/sbin/arptables -I INPUT -i $forwarddev -j MARK --set-mark $index
+}
+
+function teardown_primary()
+{
+    do_without_error tc filter del dev $vifname parent 1: protocol ip prio 10 u32 match u32 \
+        0 0 flowid 1:2 action mirred egress mirror dev $forwarddev
+    do_without_error tc filter del dev $vifname parent 1: protocol arp prio 11 u32 match u32 \
+        0 0 flowid 1:2 action mirred egress mirror dev $forwarddev
+    do_without_error tc filter del dev $vifname parent 1: protocol ipv6 prio 12 u32 match u32 \
+        0 0 flowid 1:2 action mirred egress mirror dev $forwarddev
+    do_without_error tc qdisc del dev $vifname root handle 1: prio
+
+    do_without_error /usr/local/sbin/iptables -t mangle -F
+    do_without_error /usr/local/sbin/ip6tables -t mangle -F
+    do_without_error /usr/local/sbin/arptables -F
+    do_without_error rmmod xt_PMYCOLO
+}
+
+function setup_secondary()
+{
+    do_without_error brctl delif $bridge $vifname
+    do_without_error brctl addbr $forwardbr
+    do_without_error brctl addif $forwardbr $vifname
+    do_without_error brctl addif $forwardbr $forwarddev
+    do_without_error modprobe xt_SECCOLO
+
+    do_without_error /usr/local/sbin/iptables -t mangle -I PREROUTING -m physdev --physdev-in \
+        $vifname -j SECCOLO --index $index
+    do_without_error /usr/local/sbin/ip6tables -t mangle -I PREROUTING -m physdev --physdev-in \
+        $vifname -j SECCOLO --index $index
+}
+
+function teardown_secondary()
+{
+    do_without_error brctl delif $forwardbr $forwarddev
+    do_without_error brctl delif $forwardbr $vifname
+    do_without_error brctl delbr $forwardbr
+    do_without_error brctl addif $bridge $vifname
+
+    do_without_error /usr/local/sbin/iptables -t mangle -F
+    do_without_error /usr/local/sbin/ip6tables -t mangle -F
+    do_without_error rmmod xt_SECCOLO
+}
+
+case "$command" in
+    setup)
+        if [ "$mode" = "primary" ]
+        then
+            setup_primary
+        else
+            setup_secondary
+        fi
+
+        success
+        ;;
+    teardown)
+        if [ "$mode" = "primary" ]
+        then
+            teardown_primary
+        else
+            teardown_secondary
+        fi
+        ;;
+esac
+
+if [ "$mode" = "primary" ]
+then
+    log debug "Successful colo-proxy-setup $command for $vifname." \
+              " vifname: $vifname, index: $index, forwarddev: $forwarddev."
+else
+    log debug "Successful colo-proxy-setup $command for $vifname." \
+              " vifname: $vifname, index: $index, forwarddev: $forwarddev,"\
+              " forwardbr: $forwardbr."
+fi
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index b45fe62..c92bb59 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -60,6 +60,7 @@ LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
 LIBXL_OBJS-y += libxl_colo_restore.o libxl_colo_save.o
 LIBXL_OBJS-y += libxl_colo_qdisk.o
 LIBXL_OBJS-y += libxl_colo_proxy.o
+LIBXL_OBJS-y += libxl_colo_nic.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o libxl_psr.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o libxl_libfdt_compat.o
diff --git a/tools/libxl/libxl_colo_nic.c b/tools/libxl/libxl_colo_nic.c
new file mode 100644
index 0000000..6bbbded
--- /dev/null
+++ b/tools/libxl/libxl_colo_nic.c
@@ -0,0 +1,317 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+typedef struct libxl__colo_device_nic {
+    int devid;
+    const char *vif;
+} libxl__colo_device_nic;
+
+enum {
+    primary,
+    secondary,
+};
+
+
+/* ========== init() and cleanup() ========== */
+int init_subkind_colo_nic(libxl__checkpoint_devices_state *cds)
+{
+    return 0;
+}
+
+void cleanup_subkind_colo_nic(libxl__checkpoint_devices_state *cds)
+{
+}
+
+/* ========== helper functions ========== */
+static void colo_save_setup_script_cb(libxl__egc *egc,
+                                     libxl__async_exec_state *aes,
+                                     int status);
+static void colo_save_teardown_script_cb(libxl__egc *egc,
+                                         libxl__async_exec_state *aes,
+                                         int status);
+
+/*
+ * If the device has a vifname, then use that instead of
+ * the vifX.Y format.
+ * it must ONLY be used for remus because if driver domains
+ * were in use it would constitute a security vulnerability.
+ */
+static const char *get_vifname(libxl__checkpoint_device *dev,
+                               const libxl_device_nic *nic)
+{
+    const char *vifname = NULL;
+    const char *path;
+    int rc;
+
+    STATE_AO_GC(dev->cds->ao);
+
+    /* Convenience aliases */
+    const uint32_t domid = dev->cds->domid;
+
+    path = GCSPRINTF("%s/backend/vif/%d/%d/vifname",
+                     libxl__xs_get_dompath(gc, 0), domid, nic->devid);
+    rc = libxl__xs_read_checked(gc, XBT_NULL, path, &vifname);
+    if (!rc && !vifname) {
+        vifname = libxl__device_nic_devname(gc, domid,
+                                            nic->devid,
+                                            nic->nictype);
+    }
+
+    return vifname;
+}
+
+/*
+ * the script needs the following env & args
+ * $vifname
+ * $forwarddev
+ * $mode(primary/secondary)
+ * $index
+ * $bridge
+ * setup/teardown as command line arg.
+ */
+static void setup_async_exec(libxl__checkpoint_device *dev, char *op, int side,
+                             char *colo_proxy_script)
+{
+    int arraysize, nr = 0;
+    char **env = NULL, **args = NULL;
+    libxl__colo_device_nic *colo_nic = dev->concrete_data;
+    libxl__checkpoint_devices_state *cds = dev->cds;
+    libxl__async_exec_state *aes = &dev->aodev.aes;
+    const libxl_device_nic *nic = dev->backend_dev;
+    libxl__colo_save_state *css = CONTAINER_OF(dev->cds, *css, cds);
+
+    STATE_AO_GC(cds->ao);
+
+    /* Convenience aliases */
+    const char *const vif = colo_nic->vif;
+
+    arraysize = 11;
+    GCNEW_ARRAY(env, arraysize);
+    env[nr++] = "vifname";
+    env[nr++] = libxl__strdup(gc, vif);
+    env[nr++] = "forwarddev";
+    env[nr++] = libxl__strdup(gc, nic->forwarddev);
+    env[nr++] = "mode";
+    if (side == primary)
+        env[nr++] = "primary";
+    else
+        env[nr++] = "secondary";
+    env[nr++] = "index";
+    env[nr++] = GCSPRINTF("%d", css->cps.index);
+    env[nr++] = "bridge";
+    env[nr++] = libxl__strdup(gc, nic->bridge);
+    env[nr++] = NULL;
+    assert(nr == arraysize);
+
+    arraysize = 3; nr = 0;
+    GCNEW_ARRAY(args, arraysize);
+    args[nr++] = colo_proxy_script;
+    args[nr++] = op;
+    args[nr++] = NULL;
+    assert(nr == arraysize);
+
+    aes->ao = dev->cds->ao;
+    aes->what = GCSPRINTF("%s %s", args[0], args[1]);
+    aes->env = env;
+    aes->args = args;
+    aes->timeout_ms = LIBXL_HOTPLUG_TIMEOUT * 1000;
+    aes->stdfds[0] = -1;
+    aes->stdfds[1] = -1;
+    aes->stdfds[2] = -1;
+
+    if (!strcmp(op, "teardown"))
+        aes->callback = colo_save_teardown_script_cb;
+    else
+        aes->callback = colo_save_setup_script_cb;
+}
+
+/* ========== setup() and teardown() ========== */
+static void colo_nic_setup(libxl__egc *egc, libxl__checkpoint_device *dev,
+                           int side, char *colo_proxy_script)
+{
+    int rc;
+    libxl__colo_device_nic *colo_nic;
+    const libxl_device_nic *nic = dev->backend_dev;
+
+    STATE_AO_GC(dev->cds->ao);
+
+    /*
+     * thers's no subkind of nic devices, so nic ops is always matched
+     * with nic devices, we begin to setup the nic device
+     */
+    dev->matched = 1;
+
+    if (!nic->forwarddev) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    GCNEW(colo_nic);
+    dev->concrete_data = colo_nic;
+    colo_nic->devid = nic->devid;
+    colo_nic->vif = get_vifname(dev, nic);
+    if (!colo_nic->vif) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    setup_async_exec(dev, "setup", side, colo_proxy_script);
+    rc = libxl__async_exec_start(gc, &dev->aodev.aes);
+    if (rc)
+        goto out;
+
+    return;
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+static void colo_save_setup_script_cb(libxl__egc *egc,
+                                      libxl__async_exec_state *aes,
+                                      int status)
+{
+    libxl__ao_device *aodev = CONTAINER_OF(aes, *aodev, aes);
+    libxl__checkpoint_device *dev = CONTAINER_OF(aodev, *dev, aodev);
+    libxl__colo_device_nic *colo_nic = dev->concrete_data;
+    libxl__checkpoint_devices_state *cds = dev->cds;
+    const char *out_path_base, *hotplug_error = NULL;
+    int rc;
+
+    STATE_AO_GC(cds->ao);
+
+    /* Convenience aliases */
+    const uint32_t domid = cds->domid;
+    const int devid = colo_nic->devid;
+    const char *const vif = colo_nic->vif;
+
+    out_path_base = GCSPRINTF("%s/colo_proxy/%d",
+                              libxl__xs_libxl_path(gc, domid), devid);
+
+    rc = libxl__xs_read_checked(gc, XBT_NULL,
+                                GCSPRINTF("%s/hotplug-error", out_path_base),
+                                &hotplug_error);
+    if (rc)
+        goto out;
+
+    if (hotplug_error) {
+        LOG(ERROR, "colo_proxy script %s setup failed for vif %s: %s",
+            aes->args[0], vif, hotplug_error);
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    if (status) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    rc = 0;
+
+out:
+    aodev->rc = rc;
+    aodev->callback(egc, aodev);
+}
+
+static void colo_nic_teardown(libxl__egc *egc, libxl__checkpoint_device *dev,
+                              int side, char *colo_proxy_script)
+{
+    int rc;
+    libxl__colo_device_nic *colo_nic = dev->concrete_data;
+    STATE_AO_GC(dev->cds->ao);
+
+    if (!colo_nic || !colo_nic->vif) {
+        /* colo nic has not yet been set up, just return */
+        rc = 0;
+        goto out;
+    }
+
+    setup_async_exec(dev, "teardown", side, colo_proxy_script);
+
+    rc = libxl__async_exec_start(gc, &dev->aodev.aes);
+    if (rc)
+        goto out;
+
+    return;
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+static void colo_save_teardown_script_cb(libxl__egc *egc,
+                                         libxl__async_exec_state *aes,
+                                         int status)
+{
+    int rc;
+    libxl__ao_device *aodev = CONTAINER_OF(aes, *aodev, aes);
+
+    if (status)
+        rc = ERROR_FAIL;
+    else
+        rc = 0;
+
+    aodev->rc = rc;
+    aodev->callback(egc, aodev);
+}
+
+/* ======== primary ======== */
+static void colo_nic_save_setup(libxl__egc *egc, libxl__checkpoint_device *dev)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(dev->cds, *css, cds);
+
+    colo_nic_setup(egc, dev, primary, css->colo_proxy_script);
+}
+
+static void colo_nic_save_teardown(libxl__egc *egc,
+                                   libxl__checkpoint_device *dev)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(dev->cds, *css, cds);
+
+    colo_nic_teardown(egc, dev, primary, css->colo_proxy_script);
+}
+
+const libxl__checkpoint_device_instance_ops colo_save_device_nic = {
+    .kind = LIBXL__DEVICE_KIND_VIF,
+    .setup = colo_nic_save_setup,
+    .teardown = colo_nic_save_teardown,
+};
+
+/* ======== secondary ======== */
+static void colo_nic_restore_setup(libxl__egc *egc,
+                                   libxl__checkpoint_device *dev)
+{
+    libxl__colo_restore_state *crs = CONTAINER_OF(dev->cds, *crs, cds);
+
+    colo_nic_setup(egc, dev, secondary, crs->colo_proxy_script);
+}
+
+static void colo_nic_restore_teardown(libxl__egc *egc,
+                                      libxl__checkpoint_device *dev)
+{
+    libxl__colo_restore_state *crs = CONTAINER_OF(dev->cds, *crs, cds);
+
+    colo_nic_teardown(egc, dev, secondary, crs->colo_proxy_script);
+}
+
+const libxl__checkpoint_device_instance_ops colo_restore_device_nic = {
+    .kind = LIBXL__DEVICE_KIND_VIF,
+    .setup = colo_nic_restore_setup,
+    .teardown = colo_nic_restore_teardown,
+};
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 68d1db9..fbd4781 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2744,6 +2744,8 @@ void cleanup_subkind_drbd_disk(libxl__checkpoint_devices_state *cds);
 int init_subkind_qdisk(libxl__checkpoint_devices_state *cds);
 void cleanup_subkind_qdisk(libxl__checkpoint_devices_state *cds);
 int colo_qdisk_preresume(libxl_ctx *ctx, domid_t domid);
+int init_subkind_colo_nic(libxl__checkpoint_devices_state *cds);
+void cleanup_subkind_colo_nic(libxl__checkpoint_devices_state *cds);
 
 typedef void libxl__checkpoint_callback(libxl__egc *,
                                         libxl__checkpoint_devices_state *,
@@ -2867,6 +2869,7 @@ struct libxl__colo_save_state {
     libxl__checkpoint_devices_state cds;
     int send_fd;
     int recv_fd;
+    char *colo_proxy_script;
 
     /* private */
     libxl__datacopier_state dc;
@@ -3217,6 +3220,7 @@ struct libxl__colo_restore_state {
     int pae;
     int superpages;
     libxl__colo_callback *callback;
+    char *colo_proxy_script;
 
     /* private, colo restore checkpoint state */
     libxl__domain_create_cb *saved_cb;
@@ -3239,6 +3243,7 @@ struct libxl__domain_create_state {
     /* private to domain_create */
     int guest_domid;
     int checkpointed_stream;
+    const char *colo_proxy_script;
     libxl__domain_build_state build_state;
     libxl__colo_restore_state crs;
     libxl__bootloader_state bl;
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 1e6b5ae..4a14ce1 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -538,6 +538,7 @@ libxl_device_nic = Struct("device_nic", [
     ("rate_bytes_per_interval", uint64),
     ("rate_interval_usecs", uint32),
     ("gatewaydev", string),
+    ("forwarddev", string)
     ])
 
 libxl_device_pci = Struct("device_pci", [
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 13/15] setup and control colo proxy on primary side
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
                   ` (11 preceding siblings ...)
  2015-06-08  3:45 ` [PATCH v6 COLO 12/15] COLO nic: implement COLO nic subkind Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-08  3:45 ` [PATCH v6 COLO 14/15] setup and control colo proxy on secondary side Yang Hongyang
  2015-06-08  3:45 ` [PATCH v6 COLO 15/15] cmdline switches and config vars to control colo-proxy Yang Hongyang
  14 siblings, 0 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

setup and control colo proxy on primary side

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 tools/libxl/libxl_colo_save.c | 125 +++++++++++++++++++++++++++++++++++++++---
 tools/libxl/libxl_internal.h  |   1 +
 2 files changed, 118 insertions(+), 8 deletions(-)

diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
index 80fd605..9a4f501 100644
--- a/tools/libxl/libxl_colo_save.c
+++ b/tools/libxl/libxl_colo_save.c
@@ -19,9 +19,11 @@
 #include "libxl_internal.h"
 #include "libxl_colo.h"
 
+extern const libxl__checkpoint_device_instance_ops colo_save_device_nic;
 extern const libxl__checkpoint_device_instance_ops colo_save_device_qdisk;
 
 static const libxl__checkpoint_device_instance_ops *colo_ops[] = {
+    &colo_save_device_nic,
     &colo_save_device_qdisk,
     NULL,
 };
@@ -33,9 +35,15 @@ static int init_device_subkind(libxl__checkpoint_devices_state *cds)
     int rc;
     STATE_AO_GC(cds->ao);
 
-    rc = init_subkind_qdisk(cds);
+    rc = init_subkind_colo_nic(cds);
     if (rc) goto out;
 
+    rc = init_subkind_qdisk(cds);
+    if (rc) {
+        cleanup_subkind_colo_nic(cds);
+        goto out;
+    }
+
     rc = 0;
 out:
     return rc;
@@ -46,6 +54,7 @@ static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
     /* cleanup device subkind-specific state in the libxl ctx */
     STATE_AO_GC(cds->ao);
 
+    cleanup_subkind_colo_nic(cds);
     cleanup_subkind_qdisk(cds);
 }
 
@@ -76,14 +85,28 @@ void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
     css->svm_running = false;
     css->paused = true;
     css->qdisk_setuped = false;
+    libxl__ev_child_init(&css->child);
 
-    /* TODO: nic support */
-    cds->device_kind_flags = (1 << LIBXL__DEVICE_KIND_VBD);
+    if (dss->remus->netbufscript)
+        css->colo_proxy_script = libxl__strdup(gc, dss->remus->netbufscript);
+    else
+        css->colo_proxy_script = GCSPRINTF("%s/colo-proxy-setup",
+                                           libxl__xen_script_dir_path());
+
+    cds->device_kind_flags = (1 << LIBXL__DEVICE_KIND_VIF) |
+                             (1 << LIBXL__DEVICE_KIND_VBD);
     cds->ops = colo_ops;
     cds->callback = colo_save_setup_done;
     cds->ao = ao;
     cds->domid = dss->domid;
 
+    css->cps.ao = ao;
+    if (colo_proxy_setup(&css->cps)) {
+        LOG(ERROR, "COLO: failed to setup colo proxy for guest with domid %u",
+            cds->domid);
+        goto out;
+    }
+
     if (init_device_subkind(cds))
         goto out;
 
@@ -157,6 +180,7 @@ static void colo_teardown_done(libxl__egc *egc,
     libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
 
     cleanup_device_subkind(cds);
+    colo_proxy_teardown(&css->cps);
     dss->callback(egc, dss, rc);
 }
 
@@ -437,6 +461,8 @@ static void colo_read_svm_ready_done(libxl__egc *egc,
         goto out;
     }
 
+    colo_proxy_preresume(&css->cps);
+
     css->svm_running = true;
     css->cds.callback = colo_preresume_cb;
     libxl__checkpoint_devices_preresume(egc, &css->cds);
@@ -530,6 +556,8 @@ static void colo_read_svm_resumed_done(libxl__egc *egc,
         goto out;
     }
 
+    colo_proxy_postresume(&css->cps);
+
     ok = 1;
 
 out:
@@ -538,6 +566,91 @@ out:
 
 
 /* ===================== colo: wait new checkpoint ===================== */
+
+static void colo_start_new_checkpoint(libxl__egc *egc,
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc);
+static void colo_proxy_async_wait_for_checkpoint(libxl__colo_save_state *css);
+static void colo_proxy_async_call_done(libxl__egc *egc,
+                                       libxl__ev_child *child,
+                                       int pid,
+                                       int status);
+
+static void colo_proxy_async_call(libxl__egc *egc,
+                                  libxl__colo_save_state *css,
+                                  void func(libxl__colo_save_state *),
+                                  libxl__ev_child_callback callback)
+{
+    int pid = -1, rc;
+
+    STATE_AO_GC(css->cds.ao);
+
+    /* Fork and call */
+    pid = libxl__ev_child_fork(gc, &css->child, callback);
+    if (pid == -1) {
+        LOG(ERROR, "unable to fork");
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    if (!pid) {
+        /* child */
+        func(css);
+        /* notreached */
+        abort();
+    }
+
+    return;
+
+out:
+    callback(egc, &css->child, -1, 1);
+}
+
+static void colo_proxy_wait_for_checkpoint(libxl__egc *egc,
+                                           libxl__colo_save_state *css)
+{
+    colo_proxy_async_call(egc, css,
+                          colo_proxy_async_wait_for_checkpoint,
+                          colo_proxy_async_call_done);
+}
+
+static void colo_proxy_async_wait_for_checkpoint(libxl__colo_save_state *css)
+{
+    int req;
+
+again:
+    req = colo_proxy_checkpoint(&css->cps);
+    if (req < 0) {
+        /* some error happens */
+        _exit(1);
+    } else if (!req) {
+        /* no checkpoint is needed, wait for 1ms and the check again */
+        usleep(1000);
+        goto again;
+    } else {
+        /* net packets is not consistent, we need to start a checkpoint */
+        _exit(0);
+    }
+}
+
+static void colo_proxy_async_call_done(libxl__egc *egc,
+                                       libxl__ev_child *child,
+                                       int pid,
+                                       int status)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(child, *css, child);
+
+    EGC_GC;
+
+    if (status) {
+        LOG(ERROR, "failed to wait for new checkpoint");
+        colo_start_new_checkpoint(egc, &css->cds, ERROR_FAIL);
+        return;
+    }
+
+    colo_start_new_checkpoint(egc, &css->cds, 0);
+}
+
 /*
  * Do the following things:
  * 1. do commit
@@ -547,9 +660,6 @@ out:
 static void colo_device_commit_cb(libxl__egc *egc,
                                   libxl__checkpoint_devices_state *cds,
                                   int rc);
-static void colo_start_new_checkpoint(libxl__egc *egc,
-                                      libxl__checkpoint_devices_state *cds,
-                                      int rc);
 
 void libxl__colo_save_domain_checkpoint_callback(void *data)
 {
@@ -578,8 +688,7 @@ static void colo_device_commit_cb(libxl__egc *egc,
         goto out;
     }
 
-    /* TODO: wait a new checkpoint */
-    colo_start_new_checkpoint(egc, cds, 0);
+    colo_proxy_wait_for_checkpoint(egc, css);
     return;
 
 out:
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index fbd4781..0e54865 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2887,6 +2887,7 @@ struct libxl__colo_save_state {
 
     /* private, used by colo-proxy */
     libxl__colo_proxy_state cps;
+    libxl__ev_child child;
 };
 
 /*----- Domain suspend (save) state structure -----*/
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 14/15] setup and control colo proxy on secondary side
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
                   ` (12 preceding siblings ...)
  2015-06-08  3:45 ` [PATCH v6 COLO 13/15] setup and control colo proxy on primary side Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  2015-06-08  3:45 ` [PATCH v6 COLO 15/15] cmdline switches and config vars to control colo-proxy Yang Hongyang
  14 siblings, 0 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

setup and control colo proxy on secondary side

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 tools/libxl/libxl_colo_restore.c | 28 +++++++++++++++++++++++++---
 tools/libxl/libxl_internal.h     |  3 +++
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
index 6731bd0..9c659c0 100644
--- a/tools/libxl/libxl_colo_restore.c
+++ b/tools/libxl/libxl_colo_restore.c
@@ -65,9 +65,11 @@ static void libxl__colo_restore_domain_resume_callback(void *data);
 static void libxl__colo_restore_domain_checkpoint_callback(void *data);
 static void libxl__colo_restore_domain_suspend_callback(void *data);
 
+extern const libxl__checkpoint_device_instance_ops colo_restore_device_nic;
 extern const libxl__checkpoint_device_instance_ops colo_restore_device_qdisk;
 
 static const libxl__checkpoint_device_instance_ops *colo_restore_ops[] = {
+    &colo_restore_device_nic,
     &colo_restore_device_qdisk,
     NULL,
 };
@@ -167,8 +169,14 @@ static int init_device_subkind(libxl__checkpoint_devices_state *cds)
     int rc;
     STATE_AO_GC(cds->ao);
 
+    rc = init_subkind_colo_nic(cds);
+    if (rc) goto out;
+
     rc = init_subkind_qdisk(cds);
-    if (rc)  goto out;
+    if (rc) {
+        cleanup_subkind_colo_nic(cds);
+        goto out;
+    }
 
     rc = 0;
 out:
@@ -180,6 +188,7 @@ static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
     /* cleanup device subkind-specific state in the libxl ctx */
     STATE_AO_GC(cds->ao);
 
+    cleanup_subkind_colo_nic(cds);
     cleanup_subkind_qdisk(cds);
 }
 
@@ -398,6 +407,8 @@ static void colo_restore_teardown_done(libxl__egc *egc,
     if (crcs->teardown_devices)
         cleanup_device_subkind(cds);
 
+    colo_proxy_teardown(&crs->cps);
+
     rc = crcs->saved_rc;
     if (!rc) {
         crcs->callback = do_failover_done;
@@ -607,6 +618,8 @@ static void colo_restore_preresume_cb(libxl__egc *egc,
         goto out;
     }
 
+    colo_proxy_preresume(&crs->cps);
+
     colo_restore_resume_vm(egc, crcs);
 
     return;
@@ -643,6 +656,8 @@ static void colo_resume_vm_done(libxl__egc *egc,
 
     crcs->status = LIBXL_COLO_RESUMED;
 
+    colo_proxy_postresume(&crs->cps);
+
     /* avoid calling libxl__xc_domain_restore_done() more than once */
     if (crs->saved_cb) {
         dcs->callback = crs->saved_cb;
@@ -792,13 +807,20 @@ static void colo_setup_checkpoint_devices(libxl__egc *egc,
 
     STATE_AO_GC(crs->ao);
 
-    /* TODO: nic support */
-    cds->device_kind_flags = (1 << LIBXL__DEVICE_KIND_VBD);
+    cds->device_kind_flags = (1 << LIBXL__DEVICE_KIND_VIF) |
+                             (1 << LIBXL__DEVICE_KIND_VBD);
     cds->callback = colo_restore_setup_cds_done;
     cds->ao = ao;
     cds->domid = crs->domid;
     cds->ops = colo_restore_ops;
 
+    crs->cps.ao = ao;
+    if (colo_proxy_setup(&crs->cps)) {
+        LOG(ERROR, "COLO: failed to setup colo proxy for guest with domid %u",
+            cds->domid);
+        goto out;
+    }
+
     if (init_device_subkind(cds))
         goto out;
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 0e54865..33bf47b 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3230,6 +3230,9 @@ struct libxl__colo_restore_state {
 
     /* private, used by qdisk block replication */
     bool qdisk_setuped;
+
+    /* private, used by colo proxy */
+    libxl__colo_proxy_state cps;
 };
 
 struct libxl__domain_create_state {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 COLO 15/15] cmdline switches and config vars to control colo-proxy
  2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
                   ` (13 preceding siblings ...)
  2015-06-08  3:45 ` [PATCH v6 COLO 14/15] setup and control colo proxy on secondary side Yang Hongyang
@ 2015-06-08  3:45 ` Yang Hongyang
  14 siblings, 0 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08  3:45 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, guijianfeng, rshriram, ian.jackson

Add cmdline switches to 'xl migrate-receive' command to specify
a domain-specific hotplug script to setup COLO proxy.

Add a new config var 'colo.default.agentscript' to xl.conf, that
allows the user to override the default global script used to
setup COLO proxy.

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 docs/man/xl.conf.pod.5      |  6 ++++++
 docs/man/xl.pod.1           |  1 -
 tools/libxl/libxl.c         |  6 ++++++
 tools/libxl/libxl_create.c  | 14 +++++++++++--
 tools/libxl/libxl_types.idl |  1 +
 tools/libxl/xl.c            |  3 +++
 tools/libxl/xl.h            |  1 +
 tools/libxl/xl_cmdimpl.c    | 49 ++++++++++++++++++++++++++++++++++-----------
 8 files changed, 66 insertions(+), 15 deletions(-)

diff --git a/docs/man/xl.conf.pod.5 b/docs/man/xl.conf.pod.5
index 8ae19bb..8f7fd28 100644
--- a/docs/man/xl.conf.pod.5
+++ b/docs/man/xl.conf.pod.5
@@ -111,6 +111,12 @@ Configures the default script used by Remus to setup network buffering.
 
 Default: C</etc/xen/scripts/remus-netbuf-setup>
 
+=item B<colo.default.proxyscript="PATH">
+
+Configures the default script used by COLO to setup colo-proxy.
+
+Default: C</etc/xen/scripts/colo-proxy-setup>
+
 =item B<output_format="json|sxp">
 
 Configures the default output format used by xl when printing "machine
diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index 1c2ee24..8b425b5 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -454,7 +454,6 @@ N.B: Remus support in xl is still in experimental (proof-of-concept) phase.
      Disk replication support is limited to DRBD disks.
 
      COLO support in xl is still in experimental (proof-of-concept) phase.
-     There is no support for network at the moment.
 
 B<OPTIONS>
 
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 4a5957c..224b54d 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -3375,6 +3375,11 @@ void libxl__device_nic_add(libxl__egc *egc, uint32_t domid,
         flexarray_append(back, nic->ifname);
     }
 
+    if (nic->forwarddev) {
+        flexarray_append(back, "forwarddev");
+        flexarray_append(back, nic->forwarddev);
+    }
+
     flexarray_append(back, "mac");
     flexarray_append(back,libxl__sprintf(gc,
                                     LIBXL_MAC_FMT, LIBXL_MAC_BYTES(nic->mac)));
@@ -3498,6 +3503,7 @@ static int libxl__device_nic_from_xs_be(libxl__gc *gc,
     nic->ip = READ_BACKEND(NOGC, "ip");
     nic->bridge = READ_BACKEND(NOGC, "bridge");
     nic->script = READ_BACKEND(NOGC, "script");
+    nic->forwarddev = READ_BACKEND(NOGC, "forwarddev");
 
     /* vif_ioemu nics use the same xenstore entries as vif interfaces */
     tmp = READ_BACKEND(gc, "type");
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 17d0d18..597a64c 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1168,6 +1168,11 @@ static void domcreate_bootloader_done(libxl__egc *egc,
         crs->superpages = superpages;
         crs->pae = pae;
         crs->callback = libxl__colo_restore_setup_done;
+        if (dcs->colo_proxy_script)
+            crs->colo_proxy_script = libxl__strdup(gc, dcs->colo_proxy_script);
+        else
+            crs->colo_proxy_script = GCSPRINTF("%s/colo-proxy-setup",
+                                               libxl__xen_script_dir_path());
         libxl__colo_restore_setup(egc, crs);
     } else
         libxl__xc_domain_restore(egc, dcs,
@@ -1692,6 +1697,7 @@ static void domain_create_cb(libxl__egc *egc,
 static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
                             uint32_t *domid, int restore_fd,
                             int send_fd, int checkpointed_stream,
+                            const char *colo_proxy_script,
                             const libxl_asyncop_how *ao_how,
                             const libxl_asyncprogress_how *aop_console_how)
 {
@@ -1707,6 +1713,7 @@ static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
     cdcs->dcs.send_fd = send_fd;
     cdcs->dcs.callback = domain_create_cb;
     cdcs->dcs.checkpointed_stream = checkpointed_stream;
+    cdcs->dcs.colo_proxy_script = colo_proxy_script;
     libxl__ao_progress_gethow(&cdcs->dcs.aop_console_how, aop_console_how);
     cdcs->domid_out = domid;
 
@@ -1750,7 +1757,7 @@ int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config,
                             const libxl_asyncprogress_how *aop_console_how)
 {
     unset_disk_colo_restore(d_config);
-    return do_domain_create(ctx, d_config, domid, -1, -1, 0,
+    return do_domain_create(ctx, d_config, domid, -1, -1, 0, NULL,
                             ao_how, aop_console_how);
 }
 
@@ -1761,16 +1768,19 @@ int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
                                 const libxl_asyncprogress_how *aop_console_how)
 {
     int send_fd = -1;
+    char *colo_proxy_script = NULL;
 
     if (params->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
         send_fd = params->send_fd;
+        colo_proxy_script = params->colo_proxy_script;
         set_disk_colo_restore(d_config);
     } else {
         unset_disk_colo_restore(d_config);
     }
 
     return do_domain_create(ctx, d_config, domid, restore_fd, send_fd,
-                            params->checkpointed_stream, ao_how, aop_console_how);
+                            params->checkpointed_stream, colo_proxy_script,
+                            ao_how, aop_console_how);
 }
 
 /*
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 4a14ce1..fe15123 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -353,6 +353,7 @@ libxl_domain_create_info = Struct("domain_create_info",[
 libxl_domain_restore_params = Struct("domain_restore_params", [
     ("checkpointed_stream", integer),
     ("send_fd", integer),
+    ("colo_proxy_script", string),
     ])
 
 libxl_domain_sched_params = Struct("domain_sched_params",[
diff --git a/tools/libxl/xl.c b/tools/libxl/xl.c
index f014306..f44f04f 100644
--- a/tools/libxl/xl.c
+++ b/tools/libxl/xl.c
@@ -45,6 +45,7 @@ char *default_bridge = NULL;
 char *default_gatewaydev = NULL;
 char *default_vifbackend = NULL;
 char *default_remus_netbufscript = NULL;
+char *default_colo_proxy_script = NULL;
 enum output_format default_output_format = OUTPUT_FORMAT_JSON;
 int claim_mode = 1;
 bool progress_use_cr = 0;
@@ -179,6 +180,8 @@ static void parse_global_config(const char *configfile,
 
     xlu_cfg_replace_string (config, "remus.default.netbufscript",
         &default_remus_netbufscript, 0);
+    xlu_cfg_replace_string (config, "colo.default.proxyscript",
+        &default_colo_proxy_script, 0);
 
     xlu_cfg_destroy(config);
 }
diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
index 5bc138c..33f25d1 100644
--- a/tools/libxl/xl.h
+++ b/tools/libxl/xl.h
@@ -178,6 +178,7 @@ extern char *default_bridge;
 extern char *default_gatewaydev;
 extern char *default_vifbackend;
 extern char *default_remus_netbufscript;
+extern char *default_colo_proxy_script;
 extern char *blkdev_start;
 
 enum output_format {
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 4bbadd3..41fa1e9 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -153,6 +153,7 @@ struct domain_create {
     const char *config_file;
     const char *extra_config; /* extra config string */
     const char *restore_file;
+    char *colo_proxy_script;
     int migrate_fd; /* -1 means none */
     int send_fd; /* -1 means none */
     char **migration_domname_r; /* from malloc */
@@ -982,6 +983,8 @@ static int parse_nic_config(libxl_device_nic *nic, XLU_Config **config, char *to
         replace_string(&nic->model, oparg);
     } else if (MATCH_OPTION("rate", token, oparg)) {
         parse_vif_rate(config, oparg, nic);
+    } else if (MATCH_OPTION("forwarddev", token, oparg)) {
+        replace_string(&nic->forwarddev, oparg);
     } else if (MATCH_OPTION("accel", token, oparg)) {
         fprintf(stderr, "the accel parameter for vifs is currently not supported\n");
     } else {
@@ -2727,6 +2730,7 @@ start:
 
         params.checkpointed_stream = dom_info->checkpointed_stream;
         params.send_fd = send_fd;
+        params.colo_proxy_script = dom_info->colo_proxy_script;
         ret = libxl_domain_create_restore(ctx, &d_config,
                                           &domid, restore_fd,
                                           &params,
@@ -4246,7 +4250,8 @@ static void migrate_domain(uint32_t domid, const char *rune, int debug,
 }
 
 static void migrate_receive(int debug, int daemonize, int monitor,
-                            int send_fd, int recv_fd, int remus)
+                            int send_fd, int recv_fd, int remus,
+                            char *colo_proxy_script)
 {
     uint32_t domid;
     int rc, rc2;
@@ -4273,6 +4278,7 @@ static void migrate_receive(int debug, int daemonize, int monitor,
     dom_info.send_fd = send_fd;
     dom_info.migration_domname_r = &migration_domname;
     dom_info.checkpointed_stream = remus;
+    dom_info.colo_proxy_script = colo_proxy_script;
     if (remus == LIBXL_CHECKPOINTED_STREAM_COLO)
         /* COLO uses stdout to send control message to master */
         dom_info.quiet = 1;
@@ -4467,8 +4473,9 @@ int main_migrate_receive(int argc, char **argv)
 {
     int debug = 0, daemonize = 1, monitor = 1, remus = 0;
     int opt;
+    char *script = NULL;
 
-    SWITCH_FOREACH_OPT(opt, "Fedrc", NULL, "migrate-receive", 0) {
+    SWITCH_FOREACH_OPT(opt, "Fedrcn:", NULL, "migrate-receive", 0) {
     case 'F':
         daemonize = 0;
         break;
@@ -4484,6 +4491,8 @@ int main_migrate_receive(int argc, char **argv)
         break;
     case 'c':
         remus = LIBXL_CHECKPOINTED_STREAM_COLO;
+    case 'n':
+        script = optarg;
     }
 
     if (argc-optind != 0) {
@@ -4492,7 +4501,7 @@ int main_migrate_receive(int argc, char **argv)
     }
     migrate_receive(debug, daemonize, monitor,
                     STDOUT_FILENO, STDIN_FILENO,
-                    remus);
+                    remus, script);
 
     return 0;
 }
@@ -8018,8 +8027,10 @@ int main_remus(int argc, char **argv)
         if (!interval)
             r_info.interval = 0;
 
-        if (r_info.interval || libxl_defbool_val(r_info.blackhole)) {
-            perror("option -c is conflict with -i or -b");
+        if (r_info.interval || libxl_defbool_val(r_info.blackhole) ||
+            !libxl_defbool_is_default(r_info.netbuf) ||
+            !libxl_defbool_is_default(r_info.diskbuf)) {
+            perror("option -c is conflict with -i, -d, -n or -b");
             exit(-1);
         }
 
@@ -8029,8 +8040,12 @@ int main_remus(int argc, char **argv)
         }
     }
 
-    if (!r_info.netbufscript)
-        r_info.netbufscript = default_remus_netbufscript;
+    if (!r_info.netbufscript) {
+        if (libxl_defbool_val(r_info.colo))
+            r_info.netbufscript = default_colo_proxy_script;
+        else
+            r_info.netbufscript = default_remus_netbufscript;
+    }
 
     if (libxl_defbool_val(r_info.blackhole)) {
         send_fd = open("/dev/null", O_RDWR, 0644);
@@ -8043,11 +8058,21 @@ int main_remus(int argc, char **argv)
         if (!ssh_command[0]) {
             rune = host;
         } else {
-            if (asprintf(&rune, "exec %s %s xl migrate-receive %s %s",
-                         ssh_command, host,
-                         libxl_defbool_val(r_info.colo) ? "-c" : "-r",
-                         daemonize ? "" : " -e") < 0)
-                return 1;
+            if (!libxl_defbool_val(r_info.colo)) {
+                if (asprintf(&rune, "exec %s %s xl migrate-receive %s %s",
+                             ssh_command, host,
+                             "-r",
+                             daemonize ? "" : " -e") < 0)
+                    return 1;
+            } else {
+                if (asprintf(&rune, "exec %s %s xl migrate-receive %s %s %s %s",
+                             ssh_command, host,
+                             "-c",
+                             r_info.netbufscript ? "-n" : "",
+                             r_info.netbufscript ? r_info.netbufscript : "",
+                             daemonize ? "" : " -e") < 0)
+                    return 1;
+            }
         }
 
         save_domain_core_begin(domid, NULL, &config_data, &config_len);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 04/15] libxc/restore: support COLO restore
  2015-06-08  3:45 ` [PATCH v6 COLO 04/15] libxc/restore: support COLO restore Yang Hongyang
@ 2015-06-08 10:39   ` Andrew Cooper
  2015-06-08 14:06     ` Yang Hongyang
  0 siblings, 1 reply; 50+ messages in thread
From: Andrew Cooper @ 2015-06-08 10:39 UTC (permalink / raw)
  To: Yang Hongyang, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson

On 08/06/15 04:45, Yang Hongyang wrote:
> call the callbacks resume/checkpoint/suspend while secondary vm
> status is consistent with primary.
>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  tools/libxc/xc_sr_common.h          | 11 +++++--
>  tools/libxc/xc_sr_restore.c         | 63 ++++++++++++++++++++++++++++++++++++-
>  tools/libxc/xc_sr_restore_x86_hvm.c |  1 +
>  3 files changed, 72 insertions(+), 3 deletions(-)
>
> diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
> index 565c5da..382bf76 100644
> --- a/tools/libxc/xc_sr_common.h
> +++ b/tools/libxc/xc_sr_common.h
> @@ -132,8 +132,11 @@ struct xc_sr_restore_ops
>       *
>       * @return 0 for success, -1 for failure, or the sentinel value
>       * RECORD_NOT_PROCESSED.
> +     * BROKEN_CHANNEL: if we are under Remus/COLO, this means master may dead,
> +     *                 we will failover.

"this means that the master"

>       */
>  #define RECORD_NOT_PROCESSED 1
> +#define BROKEN_CHANNEL 2
>      int (*process_record)(struct xc_sr_context *ctx, struct xc_sr_record *rec);
>  
>      /**
> @@ -205,8 +208,12 @@ struct xc_sr_context
>              uint32_t guest_type;
>              uint32_t guest_page_size;
>  
> -            /* Plain VM, or checkpoints over time. */
> -            bool checkpointed;
> +            /*
> +             * 0: Plain VM
> +             * 1: Remus
> +             * 2: COLO
> +             */
> +            int checkpointed;

I think this would be nicer as

enum {
STREAM_PLAIN,
STREAM_REMUS,
STREAM_COLO,
} stream;

perhaps?  It would reduce the use of a magic 2 in the code.

>  
>              /* Currently buffering records between a checkpoint */
>              bool buffer_all_records;
> diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
> index 2d2edd3..982a70e 100644
> --- a/tools/libxc/xc_sr_restore.c
> +++ b/tools/libxc/xc_sr_restore.c
> @@ -1,4 +1,5 @@
>  #include <arpa/inet.h>
> +#include <assert.h>
>  
>  #include "xc_sr_common.h"
>  
> @@ -472,7 +473,7 @@ static int process_record(struct xc_sr_context *ctx, struct xc_sr_record *rec);
>  static int handle_checkpoint(struct xc_sr_context *ctx)
>  {
>      xc_interface *xch = ctx->xch;
> -    int rc = 0;
> +    int rc = 0, ret;
>      unsigned i;
>  
>      if ( !ctx->restore.checkpointed )
> @@ -498,6 +499,46 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
>      else
>          ctx->restore.buffer_all_records = true;
>  
> +    if ( ctx->restore.checkpointed == 2 )
> +    {
> +#define HANDLE_CALLBACK_RETURN_VALUE(ret)                   \

I would ideally like to avoid macros like this in an effort to avoid the
code slipping back into the state that the legacy code was in, but at
least it is local to the area used.

> +    do {                                                    \
> +        if ( ret == 0 )                                     \
> +        {                                                   \
> +            /* Some internal error happens */               \
> +            rc = -1;                                        \
> +            goto err;                                       \
> +        }                                                   \
> +        else if ( ret == 2 )                                \
> +        {                                                   \
> +            /* Reading/writing error, do failover */        \
> +            rc = BROKEN_CHANNEL;                            \
> +            goto err;                                       \
> +        }                                                   \
> +    } while (0)

This should have the logic inverted somewhat, to cover all possible
values of ret, including the negative half.

e.g.

if ( ret == 1 )
    rc = 0; /* Success */
else
{
    if ( ret == 2 )
        rc = BROKEN_CHANNEL;
    else
        rc = -1; /* Some unspecified error */
    goto err;
}

> +
> +        /* COLO */
> +
> +        /* We need to resume guest */
> +        rc = ctx->restore.ops.stream_complete(ctx);
> +        if ( rc )
> +            goto err;
> +
> +        /* TODO: call restore_results */
> +
> +        /* Resume secondary vm */
> +        ret = ctx->restore.callbacks->postcopy(ctx->restore.callbacks->data);
> +        HANDLE_CALLBACK_RETURN_VALUE(ret);
> +
> +        /* wait for new checkpoint */
> +        ret = ctx->restore.callbacks->checkpoint(ctx->restore.callbacks->data);
> +        HANDLE_CALLBACK_RETURN_VALUE(ret);
> +
> +        /* suspend secondary vm */
> +        ret = ctx->restore.callbacks->suspend(ctx->restore.callbacks->data);
> +        HANDLE_CALLBACK_RETURN_VALUE(ret);

Please #undef HANDLE_CALLBACK_RETURN_VALUE here.

> +    }
> +
>   err:
>      return rc;
>  }
> @@ -678,6 +719,8 @@ static int restore(struct xc_sr_context *ctx)
>                      goto err;
>                  }
>              }
> +            else if ( rc == BROKEN_CHANNEL )
> +                goto remus_failover;
>              else if ( rc )
>                  goto err;
>          }
> @@ -685,6 +728,15 @@ static int restore(struct xc_sr_context *ctx)
>      } while ( rec.type != REC_TYPE_END );
>  
>   remus_failover:
> +
> +    if ( ctx->restore.checkpointed == 2 )
> +    {
> +        /* With COLO, we have already called stream_complete */
> +        rc = 0;
> +        IPRINTF("COLO Failover");
> +        goto done;
> +    }
> +
>      /*
>       * With Remus, if we reach here, there must be some error on primary,
>       * failover from the last checkpoint state.
> @@ -735,6 +787,15 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
>      ctx.restore.checkpointed = checkpointed_stream;
>      ctx.restore.callbacks = callbacks;
>  
> +    /* Sanity checks for callbacks. */
> +    if ( ctx.restore.checkpointed == 2 )
> +    {
> +        /* this is COLO restore */
> +        assert(callbacks->suspend &&
> +               callbacks->checkpoint &&
> +               callbacks->postcopy);

FWIW, I need to make the ->checkpoint() callback used even in the remus
case for qemu handling in libxl migration v2.

> +    }
> +
>      IPRINTF("In experimental %s", __func__);
>      DPRINTF("fd %d, dom %u, hvm %u, pae %u, superpages %d"
>              ", checkpointed_stream %d", io_fd, dom, hvm, pae,
> diff --git a/tools/libxc/xc_sr_restore_x86_hvm.c b/tools/libxc/xc_sr_restore_x86_hvm.c
> index 06177e0..8e54c68 100644
> --- a/tools/libxc/xc_sr_restore_x86_hvm.c
> +++ b/tools/libxc/xc_sr_restore_x86_hvm.c
> @@ -181,6 +181,7 @@ static int handle_qemu(struct xc_sr_context *ctx)
>      if ( fp )
>          fclose(fp);
>      free(qbuf);
> +    ctx->x86_hvm.restore.qbuf = NULL;

This looks like an unrelated bugfix.

~Andrew

>  
>      return rc;
>  }

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 05/15] send store mfn and console mfn to xl before resuming secondary vm
  2015-06-08  3:45 ` [PATCH v6 COLO 05/15] send store mfn and console mfn to xl before resuming secondary vm Yang Hongyang
@ 2015-06-08 12:16   ` Andrew Cooper
  2015-06-08 14:08     ` Yang Hongyang
  2015-06-16 11:13   ` Ian Campbell
  1 sibling, 1 reply; 50+ messages in thread
From: Andrew Cooper @ 2015-06-08 12:16 UTC (permalink / raw)
  To: Yang Hongyang, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson

On 08/06/15 04:45, Yang Hongyang wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
>
> We will call libxl__xc_domain_restore_done() to rebuild secondary vm. But
> we need store mfn and console mfn when rebuilding secondary vm. So make
> restore_results is a function pointers in callbacks struct and struct
> {save,restore}_callbacks, and use this callback to send store mfn and
> console mfn to xl.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  tools/libxc/include/xenguest.h     | 8 ++++++++
>  tools/libxc/xc_sr_restore.c        | 8 ++++++--
>  tools/libxl/libxl_colo_restore.c   | 5 -----
>  tools/libxl/libxl_create.c         | 1 +
>  tools/libxl/libxl_save_msgs_gen.pl | 2 +-
>  5 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
> index d5902a6..50096b9 100644
> --- a/tools/libxc/include/xenguest.h
> +++ b/tools/libxc/include/xenguest.h
> @@ -130,6 +130,14 @@ struct restore_callbacks {
>      /* Enable qemu-dm logging dirty pages to xen */
>      int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
>  
> +    /*
> +     * callback to send store mfn and console mfn to xl
> +     * if we want to resume vm before xc_domain_save()
> +     * exits.
> +     */
> +    void (*restore_results)(unsigned long store_mfn, unsigned long console_mfn,
> +                            void *data);
> +
>      /* callback to restore toolstack specific data */
>      int (*toolstack_restore)(uint32_t domid, const uint8_t *buf,
>              uint32_t size, void* data);
> diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
> index 982a70e..5e2efd8 100644
> --- a/tools/libxc/xc_sr_restore.c
> +++ b/tools/libxc/xc_sr_restore.c
> @@ -524,7 +524,10 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
>          if ( rc )
>              goto err;
>  
> -        /* TODO: call restore_results */
> +        /* call restore_results */

I would drop this comment.  It is entirely redundant now.

Otherwise, looks good.

~Andrew

> +        ctx->restore.callbacks->restore_results(ctx->restore.xenstore_gfn,
> +                                                ctx->restore.console_gfn,
> +                                                ctx->restore.callbacks->data);
>  
>          /* Resume secondary vm */
>          ret = ctx->restore.callbacks->postcopy(ctx->restore.callbacks->data);
> @@ -793,7 +796,8 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
>          /* this is COLO restore */
>          assert(callbacks->suspend &&
>                 callbacks->checkpoint &&
> -               callbacks->postcopy);
> +               callbacks->postcopy &&
> +               callbacks->restore_results);
>      }
>  
>      IPRINTF("In experimental %s", __func__);
> diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
> index 6c39758..c613c15 100644
> --- a/tools/libxl/libxl_colo_restore.c
> +++ b/tools/libxl/libxl_colo_restore.c
> @@ -153,11 +153,6 @@ static void colo_resume_vm(libxl__egc *egc,
>          return;
>      }
>  
> -    /*
> -     * TODO: get store mfn and console mfn
> -     *  We should call the callback restore_results in
> -     *  xc_domain_restore() before resuming the guest.
> -     */
>      libxl__xc_domain_restore_done(egc, dcs, 0, 0, 0);
>  
>      return;
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> index 1548b70..6e307f3 100644
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -1157,6 +1157,7 @@ static void domcreate_bootloader_done(libxl__egc *egc,
>          rc = ERROR_INVAL;
>          goto out;
>      }
> +    callbacks->restore_results = libxl__srm_callout_callback_restore_results;
>  
>      if (checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
>          crs->ao = ao;
> diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
> index fbb2d67..2ecd25d 100755
> --- a/tools/libxl/libxl_save_msgs_gen.pl
> +++ b/tools/libxl/libxl_save_msgs_gen.pl
> @@ -32,7 +32,7 @@ our @msgs = (
>      #                toolstack_save          done entirely `by hand'
>      [  7, 'rcxW',   "toolstack_restore",     [qw(uint32_t domid
>                                                  BLOCK tsdata)] ],
> -    [  8, 'r',      "restore_results",       ['unsigned long', 'store_mfn',
> +    [  8, 'rcx',    "restore_results",       ['unsigned long', 'store_mfn',
>                                                'unsigned long', 'console_mfn'] ],
>      [  9, 'srW',    "complete",              [qw(int retval
>                                                   int errnoval)] ],

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 06/15] libxc/save: support COLO save
  2015-06-08  3:45 ` [PATCH v6 COLO 06/15] libxc/save: support COLO save Yang Hongyang
@ 2015-06-08 13:04   ` Andrew Cooper
  2015-06-09  3:15     ` Yang Hongyang
  2015-06-09  3:18     ` Yang Hongyang
  0 siblings, 2 replies; 50+ messages in thread
From: Andrew Cooper @ 2015-06-08 13:04 UTC (permalink / raw)
  To: Yang Hongyang, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson

On 08/06/15 04:45, Yang Hongyang wrote:
> call callbacks->get_dirty_pfn() after suspend primary vm to
> get dirty pages on secondary vm, and send pages both dirty on
> primary/secondary to secondary.
>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  tools/libxc/xc_sr_save.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 48 insertions(+), 1 deletion(-)
>
> diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
> index d63b783..cda61ed 100644
> --- a/tools/libxc/xc_sr_save.c
> +++ b/tools/libxc/xc_sr_save.c
> @@ -515,6 +515,31 @@ static int send_memory_live(struct xc_sr_context *ctx)
>      return rc;
>  }
>  
> +static int update_dirty_bitmap(uint8_t *(*get_dirty_pfn)(void *), void *data,
> +                               unsigned long p2m_size, unsigned long *bitmap)

This function should take a ctx rather than having the caller expand 3
parameters.  Also, "update_dirty_bitmap" is a little misleading, as it
isn't querying the hypervisor for the dirty bitmap.

> +{
> +    uint64_t *pfn_list;
> +    uint64_t count, i;
> +    uint64_t pfn;
> +
> +    pfn_list = (uint64_t *)get_dirty_pfn(data);

This looks like a recipe for width-errors.  The get_dirty_pfn() call
should take a pointer to a struct for it to fill.

> +    assert(pfn_list);

This should turn into an error rather than an abort().

> +
> +    count = pfn_list[0];
> +    for (i = 0; i < count; i++) {

style

> +        pfn = pfn_list[i + 1];
> +        if (pfn > p2m_size) {
> +            errno = EINVAL;
> +            return -1;
> +        }
> +
> +        set_bit(pfn, bitmap);
> +    }
> +
> +    free(pfn_list);
> +    return 0;
> +}
> +
>  /*
>   * Suspend the domain and send dirty memory.
>   * This is the last iteration of the live migration and the
> @@ -555,6 +580,19 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
>  
>      bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_size);
>  
> +    if ( !ctx->save.live && ctx->save.callbacks->get_dirty_pfn )
> +    {

Shouldn't get_dirty_pfn be mandatory for COLO streams (even if it is a
noop to start with) ?

~Andrew

> +        rc = update_dirty_bitmap(ctx->save.callbacks->get_dirty_pfn,
> +                                 ctx->save.callbacks->data,
> +                                 ctx->save.p2m_size,
> +                                 dirty_bitmap);
> +        if ( rc )
> +        {
> +            PERROR("Failed to get secondary vm's dirty pages");
> +            goto out;
> +        }
> +    }
> +
>      rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages);
>      if ( rc )
>          goto out;
> @@ -784,7 +822,16 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
>              if ( rc )
>                  goto err;
>  
> -            ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
> +            rc = ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
> +            if ( !rc ) {
> +                if ( !errno )
> +                {
> +                    /* Postcopy request failed (without errno, using EINVAL) */
> +                    errno = EINVAL;
> +                }
> +                rc = -1;
> +                goto err;
> +            }
>  
>              rc = ctx->save.callbacks->checkpoint(ctx->save.callbacks->data);
>              if ( rc <= 0 )

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 04/15] libxc/restore: support COLO restore
  2015-06-08 10:39   ` Andrew Cooper
@ 2015-06-08 14:06     ` Yang Hongyang
  0 siblings, 0 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08 14:06 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson



On 06/08/2015 06:39 PM, Andrew Cooper wrote:
> On 08/06/15 04:45, Yang Hongyang wrote:
>> call the callbacks resume/checkpoint/suspend while secondary vm
>> status is consistent with primary.
>>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> CC: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>>   tools/libxc/xc_sr_common.h          | 11 +++++--
>>   tools/libxc/xc_sr_restore.c         | 63 ++++++++++++++++++++++++++++++++++++-
>>   tools/libxc/xc_sr_restore_x86_hvm.c |  1 +
>>   3 files changed, 72 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
>> index 565c5da..382bf76 100644
>> --- a/tools/libxc/xc_sr_common.h
>> +++ b/tools/libxc/xc_sr_common.h
>> @@ -132,8 +132,11 @@ struct xc_sr_restore_ops
>>        *
>>        * @return 0 for success, -1 for failure, or the sentinel value
>>        * RECORD_NOT_PROCESSED.
>> +     * BROKEN_CHANNEL: if we are under Remus/COLO, this means master may dead,
>> +     *                 we will failover.
>
> "this means that the master"

Thanks.

>
>>        */
>>   #define RECORD_NOT_PROCESSED 1
>> +#define BROKEN_CHANNEL 2
>>       int (*process_record)(struct xc_sr_context *ctx, struct xc_sr_record *rec);
>>
>>       /**
>> @@ -205,8 +208,12 @@ struct xc_sr_context
>>               uint32_t guest_type;
>>               uint32_t guest_page_size;
>>
>> -            /* Plain VM, or checkpoints over time. */
>> -            bool checkpointed;
>> +            /*
>> +             * 0: Plain VM
>> +             * 1: Remus
>> +             * 2: COLO
>> +             */
>> +            int checkpointed;
>
> I think this would be nicer as
>
> enum {
> STREAM_PLAIN,
> STREAM_REMUS,
> STREAM_COLO,
> } stream;
>
> perhaps?  It would reduce the use of a magic 2 in the code.

This is another place that I missed, good catch, and it's better, thank you.

>
>>
>>               /* Currently buffering records between a checkpoint */
>>               bool buffer_all_records;
>> diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
>> index 2d2edd3..982a70e 100644
>> --- a/tools/libxc/xc_sr_restore.c
>> +++ b/tools/libxc/xc_sr_restore.c
>> @@ -1,4 +1,5 @@
>>   #include <arpa/inet.h>
>> +#include <assert.h>
>>
>>   #include "xc_sr_common.h"
>>
>> @@ -472,7 +473,7 @@ static int process_record(struct xc_sr_context *ctx, struct xc_sr_record *rec);
>>   static int handle_checkpoint(struct xc_sr_context *ctx)
>>   {
>>       xc_interface *xch = ctx->xch;
>> -    int rc = 0;
>> +    int rc = 0, ret;
>>       unsigned i;
>>
>>       if ( !ctx->restore.checkpointed )
>> @@ -498,6 +499,46 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
>>       else
>>           ctx->restore.buffer_all_records = true;
>>
>> +    if ( ctx->restore.checkpointed == 2 )
>> +    {
>> +#define HANDLE_CALLBACK_RETURN_VALUE(ret)                   \
>
> I would ideally like to avoid macros like this in an effort to avoid the
> code slipping back into the state that the legacy code was in, but at
> least it is local to the area used.
>
>> +    do {                                                    \
>> +        if ( ret == 0 )                                     \
>> +        {                                                   \
>> +            /* Some internal error happens */               \
>> +            rc = -1;                                        \
>> +            goto err;                                       \
>> +        }                                                   \
>> +        else if ( ret == 2 )                                \
>> +        {                                                   \
>> +            /* Reading/writing error, do failover */        \
>> +            rc = BROKEN_CHANNEL;                            \
>> +            goto err;                                       \
>> +        }                                                   \
>> +    } while (0)
>
> This should have the logic inverted somewhat, to cover all possible
> values of ret, including the negative half.

yes, will fix in the next version.

>
> e.g.
>
> if ( ret == 1 )
>      rc = 0; /* Success */
> else
> {
>      if ( ret == 2 )
>          rc = BROKEN_CHANNEL;
>      else
>          rc = -1; /* Some unspecified error */
>      goto err;
> }
>
>> +
>> +        /* COLO */
>> +
>> +        /* We need to resume guest */
>> +        rc = ctx->restore.ops.stream_complete(ctx);
>> +        if ( rc )
>> +            goto err;
>> +
>> +        /* TODO: call restore_results */
>> +
>> +        /* Resume secondary vm */
>> +        ret = ctx->restore.callbacks->postcopy(ctx->restore.callbacks->data);
>> +        HANDLE_CALLBACK_RETURN_VALUE(ret);
>> +
>> +        /* wait for new checkpoint */
>> +        ret = ctx->restore.callbacks->checkpoint(ctx->restore.callbacks->data);
>> +        HANDLE_CALLBACK_RETURN_VALUE(ret);
>> +
>> +        /* suspend secondary vm */
>> +        ret = ctx->restore.callbacks->suspend(ctx->restore.callbacks->data);
>> +        HANDLE_CALLBACK_RETURN_VALUE(ret);
>
> Please #undef HANDLE_CALLBACK_RETURN_VALUE here.

OK.

>
>> +    }
>> +
>>    err:
>>       return rc;
>>   }
>> @@ -678,6 +719,8 @@ static int restore(struct xc_sr_context *ctx)
>>                       goto err;
>>                   }
>>               }
>> +            else if ( rc == BROKEN_CHANNEL )
>> +                goto remus_failover;
>>               else if ( rc )
>>                   goto err;
>>           }
>> @@ -685,6 +728,15 @@ static int restore(struct xc_sr_context *ctx)
>>       } while ( rec.type != REC_TYPE_END );
>>
>>    remus_failover:
>> +
>> +    if ( ctx->restore.checkpointed == 2 )
>> +    {
>> +        /* With COLO, we have already called stream_complete */
>> +        rc = 0;
>> +        IPRINTF("COLO Failover");
>> +        goto done;
>> +    }
>> +
>>       /*
>>        * With Remus, if we reach here, there must be some error on primary,
>>        * failover from the last checkpoint state.
>> @@ -735,6 +787,15 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
>>       ctx.restore.checkpointed = checkpointed_stream;
>>       ctx.restore.callbacks = callbacks;
>>
>> +    /* Sanity checks for callbacks. */
>> +    if ( ctx.restore.checkpointed == 2 )
>> +    {
>> +        /* this is COLO restore */
>> +        assert(callbacks->suspend &&
>> +               callbacks->checkpoint &&
>> +               callbacks->postcopy);
>
> FWIW, I need to make the ->checkpoint() callback used even in the remus
> case for qemu handling in libxl migration v2.

So this should be move out in libxl migration v2 support.

>
>> +    }
>> +
>>       IPRINTF("In experimental %s", __func__);
>>       DPRINTF("fd %d, dom %u, hvm %u, pae %u, superpages %d"
>>               ", checkpointed_stream %d", io_fd, dom, hvm, pae,
>> diff --git a/tools/libxc/xc_sr_restore_x86_hvm.c b/tools/libxc/xc_sr_restore_x86_hvm.c
>> index 06177e0..8e54c68 100644
>> --- a/tools/libxc/xc_sr_restore_x86_hvm.c
>> +++ b/tools/libxc/xc_sr_restore_x86_hvm.c
>> @@ -181,6 +181,7 @@ static int handle_qemu(struct xc_sr_context *ctx)
>>       if ( fp )
>>           fclose(fp);
>>       free(qbuf);
>> +    ctx->x86_hvm.restore.qbuf = NULL;
>
> This looks like an unrelated bugfix.

Yes, this is a bugfix. It won't trigger when in normal migration or Remus, but
in colo case, this will cause error because handle_qemu will be called multiple
times.

>
> ~Andrew
>
>>
>>       return rc;
>>   }
>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 05/15] send store mfn and console mfn to xl before resuming secondary vm
  2015-06-08 12:16   ` Andrew Cooper
@ 2015-06-08 14:08     ` Yang Hongyang
  0 siblings, 0 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-08 14:08 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson



On 06/08/2015 08:16 PM, Andrew Cooper wrote:
> On 08/06/15 04:45, Yang Hongyang wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> We will call libxl__xc_domain_restore_done() to rebuild secondary vm. But
>> we need store mfn and console mfn when rebuilding secondary vm. So make
>> restore_results is a function pointers in callbacks struct and struct
>> {save,restore}_callbacks, and use this callback to send store mfn and
>> console mfn to xl.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> CC: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>>   tools/libxc/include/xenguest.h     | 8 ++++++++
>>   tools/libxc/xc_sr_restore.c        | 8 ++++++--
>>   tools/libxl/libxl_colo_restore.c   | 5 -----
>>   tools/libxl/libxl_create.c         | 1 +
>>   tools/libxl/libxl_save_msgs_gen.pl | 2 +-
>>   5 files changed, 16 insertions(+), 8 deletions(-)
>>
>> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
>> index d5902a6..50096b9 100644
>> --- a/tools/libxc/include/xenguest.h
>> +++ b/tools/libxc/include/xenguest.h
>> @@ -130,6 +130,14 @@ struct restore_callbacks {
>>       /* Enable qemu-dm logging dirty pages to xen */
>>       int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
>>
>> +    /*
>> +     * callback to send store mfn and console mfn to xl
>> +     * if we want to resume vm before xc_domain_save()
>> +     * exits.
>> +     */
>> +    void (*restore_results)(unsigned long store_mfn, unsigned long console_mfn,
>> +                            void *data);
>> +
>>       /* callback to restore toolstack specific data */
>>       int (*toolstack_restore)(uint32_t domid, const uint8_t *buf,
>>               uint32_t size, void* data);
>> diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
>> index 982a70e..5e2efd8 100644
>> --- a/tools/libxc/xc_sr_restore.c
>> +++ b/tools/libxc/xc_sr_restore.c
>> @@ -524,7 +524,10 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
>>           if ( rc )
>>               goto err;
>>
>> -        /* TODO: call restore_results */
>> +        /* call restore_results */
>
> I would drop this comment.  It is entirely redundant now.

will do, thanks.

>
> Otherwise, looks good.
>
> ~Andrew
>
>> +        ctx->restore.callbacks->restore_results(ctx->restore.xenstore_gfn,
>> +                                                ctx->restore.console_gfn,
>> +                                                ctx->restore.callbacks->data);
>>
>>           /* Resume secondary vm */
>>           ret = ctx->restore.callbacks->postcopy(ctx->restore.callbacks->data);
>> @@ -793,7 +796,8 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
>>           /* this is COLO restore */
>>           assert(callbacks->suspend &&
>>                  callbacks->checkpoint &&
>> -               callbacks->postcopy);
>> +               callbacks->postcopy &&
>> +               callbacks->restore_results);
>>       }
>>
>>       IPRINTF("In experimental %s", __func__);
>> diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
>> index 6c39758..c613c15 100644
>> --- a/tools/libxl/libxl_colo_restore.c
>> +++ b/tools/libxl/libxl_colo_restore.c
>> @@ -153,11 +153,6 @@ static void colo_resume_vm(libxl__egc *egc,
>>           return;
>>       }
>>
>> -    /*
>> -     * TODO: get store mfn and console mfn
>> -     *  We should call the callback restore_results in
>> -     *  xc_domain_restore() before resuming the guest.
>> -     */
>>       libxl__xc_domain_restore_done(egc, dcs, 0, 0, 0);
>>
>>       return;
>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>> index 1548b70..6e307f3 100644
>> --- a/tools/libxl/libxl_create.c
>> +++ b/tools/libxl/libxl_create.c
>> @@ -1157,6 +1157,7 @@ static void domcreate_bootloader_done(libxl__egc *egc,
>>           rc = ERROR_INVAL;
>>           goto out;
>>       }
>> +    callbacks->restore_results = libxl__srm_callout_callback_restore_results;
>>
>>       if (checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
>>           crs->ao = ao;
>> diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
>> index fbb2d67..2ecd25d 100755
>> --- a/tools/libxl/libxl_save_msgs_gen.pl
>> +++ b/tools/libxl/libxl_save_msgs_gen.pl
>> @@ -32,7 +32,7 @@ our @msgs = (
>>       #                toolstack_save          done entirely `by hand'
>>       [  7, 'rcxW',   "toolstack_restore",     [qw(uint32_t domid
>>                                                   BLOCK tsdata)] ],
>> -    [  8, 'r',      "restore_results",       ['unsigned long', 'store_mfn',
>> +    [  8, 'rcx',    "restore_results",       ['unsigned long', 'store_mfn',
>>                                                 'unsigned long', 'console_mfn'] ],
>>       [  9, 'srW',    "complete",              [qw(int retval
>>                                                    int errnoval)] ],
>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 06/15] libxc/save: support COLO save
  2015-06-08 13:04   ` Andrew Cooper
@ 2015-06-09  3:15     ` Yang Hongyang
  2015-06-09  7:20       ` Andrew Cooper
  2015-06-09  3:18     ` Yang Hongyang
  1 sibling, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-09  3:15 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson



On 06/08/2015 09:04 PM, Andrew Cooper wrote:
> On 08/06/15 04:45, Yang Hongyang wrote:
>> call callbacks->get_dirty_pfn() after suspend primary vm to
>> get dirty pages on secondary vm, and send pages both dirty on
>> primary/secondary to secondary.
>>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> CC: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>>   tools/libxc/xc_sr_save.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 48 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
>> index d63b783..cda61ed 100644
>> --- a/tools/libxc/xc_sr_save.c
>> +++ b/tools/libxc/xc_sr_save.c
>> @@ -515,6 +515,31 @@ static int send_memory_live(struct xc_sr_context *ctx)
>>       return rc;
>>   }
>>
>> +static int update_dirty_bitmap(uint8_t *(*get_dirty_pfn)(void *), void *data,
>> +                               unsigned long p2m_size, unsigned long *bitmap)
>
> This function should take a ctx rather than having the caller expand 3
> parameters.  Also, "update_dirty_bitmap" is a little misleading, as it
> isn't querying the hypervisor for the dirty bitmap.

ok.

>
>> +{
>> +    uint64_t *pfn_list;
>> +    uint64_t count, i;
>> +    uint64_t pfn;
>> +
>> +    pfn_list = (uint64_t *)get_dirty_pfn(data);
>
> This looks like a recipe for width-errors.  The get_dirty_pfn() call
> should take a pointer to a struct for it to fill.

but the size is unknown for the caller.pfn_list[0] is the count of
pfn.

>
>> +    assert(pfn_list);
>
> This should turn into an error rather than an abort().

Even if there are no dirty pages on secondary, pfn_list shouldn't be
NULL, it's just that pfn_list[0] will be 0. if pfn_list is NULL,
there might be unexpected error happened.

>
>> +
>> +    count = pfn_list[0];
>> +    for (i = 0; i < count; i++) {
>
> style
>
>> +        pfn = pfn_list[i + 1];
>> +        if (pfn > p2m_size) {
>> +            errno = EINVAL;
>> +            return -1;
>> +        }
>> +
>> +        set_bit(pfn, bitmap);
>> +    }
>> +
>> +    free(pfn_list);
>> +    return 0;
>> +}
>> +
>>   /*
>>    * Suspend the domain and send dirty memory.
>>    * This is the last iteration of the live migration and the
>> @@ -555,6 +580,19 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
>>
>>       bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_size);
>>
>> +    if ( !ctx->save.live && ctx->save.callbacks->get_dirty_pfn )
>> +    {
>
> Shouldn't get_dirty_pfn be mandatory for COLO streams (even if it is a
> noop to start with) ?

It should be mandatory, it shouldn't be noop under COLO. perhaps we should
add sanity check at the beginning. But problem is save side do not have a param
passed from libxl to indicate the stream type(like checkpointed_stream in
restore side). So we may need to add another XCFLAGS? Currently there is
XCFLAGS_CHECKPOINTED which represents Remus, we might need to change this to
XCFLAGS_STREAM_REMUS
XCFLAGS_STREAM_COLO
so that we can know what kind of stream we are handling?

>
> ~Andrew
>
>> +        rc = update_dirty_bitmap(ctx->save.callbacks->get_dirty_pfn,
>> +                                 ctx->save.callbacks->data,
>> +                                 ctx->save.p2m_size,
>> +                                 dirty_bitmap);
>> +        if ( rc )
>> +        {
>> +            PERROR("Failed to get secondary vm's dirty pages");
>> +            goto out;
>> +        }
>> +    }
>> +
>>       rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages);
>>       if ( rc )
>>           goto out;
>> @@ -784,7 +822,16 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
>>               if ( rc )
>>                   goto err;
>>
>> -            ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
>> +            rc = ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
>> +            if ( !rc ) {
>> +                if ( !errno )
>> +                {
>> +                    /* Postcopy request failed (without errno, using EINVAL) */
>> +                    errno = EINVAL;
>> +                }
>> +                rc = -1;
>> +                goto err;
>> +            }
>>
>>               rc = ctx->save.callbacks->checkpoint(ctx->save.callbacks->data);
>>               if ( rc <= 0 )
>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 06/15] libxc/save: support COLO save
  2015-06-08 13:04   ` Andrew Cooper
  2015-06-09  3:15     ` Yang Hongyang
@ 2015-06-09  3:18     ` Yang Hongyang
  1 sibling, 0 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-09  3:18 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson



On 06/08/2015 09:04 PM, Andrew Cooper wrote:
> On 08/06/15 04:45, Yang Hongyang wrote:
>> call callbacks->get_dirty_pfn() after suspend primary vm to
>> get dirty pages on secondary vm, and send pages both dirty on
>> primary/secondary to secondary.
>>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> CC: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>>   tools/libxc/xc_sr_save.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 48 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
>> index d63b783..cda61ed 100644
>> --- a/tools/libxc/xc_sr_save.c
>> +++ b/tools/libxc/xc_sr_save.c
>> @@ -515,6 +515,31 @@ static int send_memory_live(struct xc_sr_context *ctx)
>>       return rc;
>>   }
>>
>> +static int update_dirty_bitmap(uint8_t *(*get_dirty_pfn)(void *), void *data,
>> +                               unsigned long p2m_size, unsigned long *bitmap)
>
> This function should take a ctx rather than having the caller expand 3
> parameters.  Also, "update_dirty_bitmap" is a little misleading, as it
> isn't querying the hypervisor for the dirty bitmap.

how about merge_secondary_dirty_bitmap()?

>
>> +{
>> +    uint64_t *pfn_list;
>> +    uint64_t count, i;
>> +    uint64_t pfn;
>> +
>> +    pfn_list = (uint64_t *)get_dirty_pfn(data);
>
> This looks like a recipe for width-errors.  The get_dirty_pfn() call
> should take a pointer to a struct for it to fill.
>
>> +    assert(pfn_list);
>
> This should turn into an error rather than an abort().
>
>> +
>> +    count = pfn_list[0];
>> +    for (i = 0; i < count; i++) {
>
> style
>
>> +        pfn = pfn_list[i + 1];
>> +        if (pfn > p2m_size) {
>> +            errno = EINVAL;
>> +            return -1;
>> +        }
>> +
>> +        set_bit(pfn, bitmap);
>> +    }
>> +
>> +    free(pfn_list);
>> +    return 0;
>> +}
>> +
>>   /*
>>    * Suspend the domain and send dirty memory.
>>    * This is the last iteration of the live migration and the
>> @@ -555,6 +580,19 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
>>
>>       bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_size);
>>
>> +    if ( !ctx->save.live && ctx->save.callbacks->get_dirty_pfn )
>> +    {
>
> Shouldn't get_dirty_pfn be mandatory for COLO streams (even if it is a
> noop to start with) ?
>
> ~Andrew
>
>> +        rc = update_dirty_bitmap(ctx->save.callbacks->get_dirty_pfn,
>> +                                 ctx->save.callbacks->data,
>> +                                 ctx->save.p2m_size,
>> +                                 dirty_bitmap);
>> +        if ( rc )
>> +        {
>> +            PERROR("Failed to get secondary vm's dirty pages");
>> +            goto out;
>> +        }
>> +    }
>> +
>>       rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages);
>>       if ( rc )
>>           goto out;
>> @@ -784,7 +822,16 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
>>               if ( rc )
>>                   goto err;
>>
>> -            ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
>> +            rc = ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
>> +            if ( !rc ) {
>> +                if ( !errno )
>> +                {
>> +                    /* Postcopy request failed (without errno, using EINVAL) */
>> +                    errno = EINVAL;
>> +                }
>> +                rc = -1;
>> +                goto err;
>> +            }
>>
>>               rc = ctx->save.callbacks->checkpoint(ctx->save.callbacks->data);
>>               if ( rc <= 0 )
>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 06/15] libxc/save: support COLO save
  2015-06-09  3:15     ` Yang Hongyang
@ 2015-06-09  7:20       ` Andrew Cooper
  2015-06-09  8:45         ` Yang Hongyang
  0 siblings, 1 reply; 50+ messages in thread
From: Andrew Cooper @ 2015-06-09  7:20 UTC (permalink / raw)
  To: Yang Hongyang, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson

On 09/06/2015 04:15, Yang Hongyang wrote:
>
>
> On 06/08/2015 09:04 PM, Andrew Cooper wrote:
>> On 08/06/15 04:45, Yang Hongyang wrote:
>>> call callbacks->get_dirty_pfn() after suspend primary vm to
>>> get dirty pages on secondary vm, and send pages both dirty on
>>> primary/secondary to secondary.
>>>
>>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>> CC: Andrew Cooper <andrew.cooper3@citrix.com>
>>> ---
>>>   tools/libxc/xc_sr_save.c | 49
>>> +++++++++++++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 48 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
>>> index d63b783..cda61ed 100644
>>> --- a/tools/libxc/xc_sr_save.c
>>> +++ b/tools/libxc/xc_sr_save.c
>>> @@ -515,6 +515,31 @@ static int send_memory_live(struct
>>> xc_sr_context *ctx)
>>>       return rc;
>>>   }
>>>
>>> +static int update_dirty_bitmap(uint8_t *(*get_dirty_pfn)(void *),
>>> void *data,
>>> +                               unsigned long p2m_size, unsigned
>>> long *bitmap)
>>
>> This function should take a ctx rather than having the caller expand 3
>> parameters.  Also, "update_dirty_bitmap" is a little misleading, as it
>> isn't querying the hypervisor for the dirty bitmap.
>
> ok.

(Merging the other thread)

> how about merge_secondary_dirty_bitmap()? 

Much better!

>
>>
>>> +{
>>> +    uint64_t *pfn_list;
>>> +    uint64_t count, i;
>>> +    uint64_t pfn;
>>> +
>>> +    pfn_list = (uint64_t *)get_dirty_pfn(data);
>>
>> This looks like a recipe for width-errors.  The get_dirty_pfn() call
>> should take a pointer to a struct for it to fill.
>
> but the size is unknown for the caller.pfn_list[0] is the count of
> pfn.
>
>>
>>> +    assert(pfn_list);
>>
>> This should turn into an error rather than an abort().
>
> Even if there are no dirty pages on secondary, pfn_list shouldn't be
> NULL, it's just that pfn_list[0] will be 0. if pfn_list is NULL,
> there might be unexpected error happened.

get_dirty_pfn() should be declared alongside a

struct pfn_data
{
    uint64_t count;
    uint64_t *pfns;
};

and this function here should create one of these on the stack and pass
it by pointer to get_dirty_pfn().  I might also be tempted to rename
this to get_remote_logdirty() or similar, to indicate that it is a
source of logdirty data from something other than the current hypervisor.

>
>>
>>> +
>>> +    count = pfn_list[0];
>>> +    for (i = 0; i < count; i++) {
>>
>> style
>>
>>> +        pfn = pfn_list[i + 1];
>>> +        if (pfn > p2m_size) {
>>> +            errno = EINVAL;
>>> +            return -1;
>>> +        }
>>> +
>>> +        set_bit(pfn, bitmap);
>>> +    }
>>> +
>>> +    free(pfn_list);
>>> +    return 0;
>>> +}
>>> +
>>>   /*
>>>    * Suspend the domain and send dirty memory.
>>>    * This is the last iteration of the live migration and the
>>> @@ -555,6 +580,19 @@ static int suspend_and_send_dirty(struct
>>> xc_sr_context *ctx)
>>>
>>>       bitmap_or(dirty_bitmap, ctx->save.deferred_pages,
>>> ctx->save.p2m_size);
>>>
>>> +    if ( !ctx->save.live && ctx->save.callbacks->get_dirty_pfn )
>>> +    {
>>
>> Shouldn't get_dirty_pfn be mandatory for COLO streams (even if it is a
>> noop to start with) ?
>
> It should be mandatory, it shouldn't be noop under COLO. perhaps we
> should
> add sanity check at the beginning. But problem is save side do not
> have a param
> passed from libxl to indicate the stream type(like checkpointed_stream in
> restore side). So we may need to add another XCFLAGS? Currently there is
> XCFLAGS_CHECKPOINTED which represents Remus, we might need to change
> this to
> XCFLAGS_STREAM_REMUS
> XCFLAGS_STREAM_COLO
> so that we can know what kind of stream we are handling?

checkpointed_stream started out as a bugfix for a legacy stream
migration breakage.  Really, this information should have been passed
right from the start.

It would probably be best to take the enum{} suggested elsewhere and
make it a top level ctx item, and have it present for both save and
restore, with sutable parameters passed in from the top.  (When I am
finally able to take out the legacy code, there is going to be a severe
pruning/consolidation of the parameters.)

~Andrew

>
>>
>> ~Andrew
>>
>>> +        rc = update_dirty_bitmap(ctx->save.callbacks->get_dirty_pfn,
>>> +                                 ctx->save.callbacks->data,
>>> +                                 ctx->save.p2m_size,
>>> +                                 dirty_bitmap);
>>> +        if ( rc )
>>> +        {
>>> +            PERROR("Failed to get secondary vm's dirty pages");
>>> +            goto out;
>>> +        }
>>> +    }
>>> +
>>>       rc = send_dirty_pages(ctx, stats.dirty_count +
>>> ctx->save.nr_deferred_pages);
>>>       if ( rc )
>>>           goto out;
>>> @@ -784,7 +822,16 @@ static int save(struct xc_sr_context *ctx,
>>> uint16_t guest_type)
>>>               if ( rc )
>>>                   goto err;
>>>
>>> -            ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
>>> +            rc =
>>> ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
>>> +            if ( !rc ) {
>>> +                if ( !errno )
>>> +                {
>>> +                    /* Postcopy request failed (without errno,
>>> using EINVAL) */
>>> +                    errno = EINVAL;
>>> +                }
>>> +                rc = -1;
>>> +                goto err;
>>> +            }
>>>
>>>               rc =
>>> ctx->save.callbacks->checkpoint(ctx->save.callbacks->data);
>>>               if ( rc <= 0 )
>>
>> .
>>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 06/15] libxc/save: support COLO save
  2015-06-09  7:20       ` Andrew Cooper
@ 2015-06-09  8:45         ` Yang Hongyang
  2015-06-09  8:51           ` Andrew Cooper
  0 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-09  8:45 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson



On 06/09/2015 03:20 PM, Andrew Cooper wrote:
> On 09/06/2015 04:15, Yang Hongyang wrote:
>>
>>
>> On 06/08/2015 09:04 PM, Andrew Cooper wrote:
>>> On 08/06/15 04:45, Yang Hongyang wrote:
>>>> call callbacks->get_dirty_pfn() after suspend primary vm to
>>>> get dirty pages on secondary vm, and send pages both dirty on
>>>> primary/secondary to secondary.
>>>>
>>>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>> CC: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> ---
>>>>    tools/libxc/xc_sr_save.c | 49
>>>> +++++++++++++++++++++++++++++++++++++++++++++++-
>>>>    1 file changed, 48 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
>>>> index d63b783..cda61ed 100644
>>>> --- a/tools/libxc/xc_sr_save.c
>>>> +++ b/tools/libxc/xc_sr_save.c
>>>> @@ -515,6 +515,31 @@ static int send_memory_live(struct
>>>> xc_sr_context *ctx)
>>>>        return rc;
>>>>    }
>>>>
>>>> +static int update_dirty_bitmap(uint8_t *(*get_dirty_pfn)(void *),
>>>> void *data,
>>>> +                               unsigned long p2m_size, unsigned
>>>> long *bitmap)
>>>
>>> This function should take a ctx rather than having the caller expand 3
>>> parameters.  Also, "update_dirty_bitmap" is a little misleading, as it
>>> isn't querying the hypervisor for the dirty bitmap.
>>
>> ok.
>
> (Merging the other thread)
>
>> how about merge_secondary_dirty_bitmap()?
>
> Much better!
>
>>
>>>
>>>> +{
>>>> +    uint64_t *pfn_list;
>>>> +    uint64_t count, i;
>>>> +    uint64_t pfn;
>>>> +
>>>> +    pfn_list = (uint64_t *)get_dirty_pfn(data);
>>>
>>> This looks like a recipe for width-errors.  The get_dirty_pfn() call
>>> should take a pointer to a struct for it to fill.
>>
>> but the size is unknown for the caller.pfn_list[0] is the count of
>> pfn.
>>
>>>
>>>> +    assert(pfn_list);
>>>
>>> This should turn into an error rather than an abort().
>>
>> Even if there are no dirty pages on secondary, pfn_list shouldn't be
>> NULL, it's just that pfn_list[0] will be 0. if pfn_list is NULL,
>> there might be unexpected error happened.
>
> get_dirty_pfn() should be declared alongside a
>
> struct pfn_data
> {
>      uint64_t count;
>      uint64_t *pfns;
> };
>
> and this function here should create one of these on the stack and pass
> it by pointer to get_dirty_pfn().  I might also be tempted to rename
> this to get_remote_logdirty() or similar, to indicate that it is a
> source of logdirty data from something other than the current hypervisor.

This is a callback, I can't find a way to pass pointer from libxc to libxl,
libxl can not access the pointer data...The struct can be used for represent
the data however.

I like with the rename part, sounds much better.

>
>>
>>>
>>>> +
>>>> +    count = pfn_list[0];
>>>> +    for (i = 0; i < count; i++) {
>>>
>>> style
>>>
>>>> +        pfn = pfn_list[i + 1];
>>>> +        if (pfn > p2m_size) {
>>>> +            errno = EINVAL;
>>>> +            return -1;
>>>> +        }
>>>> +
>>>> +        set_bit(pfn, bitmap);
>>>> +    }
>>>> +
>>>> +    free(pfn_list);
>>>> +    return 0;
>>>> +}
>>>> +
>>>>    /*
>>>>     * Suspend the domain and send dirty memory.
>>>>     * This is the last iteration of the live migration and the
>>>> @@ -555,6 +580,19 @@ static int suspend_and_send_dirty(struct
>>>> xc_sr_context *ctx)
>>>>
>>>>        bitmap_or(dirty_bitmap, ctx->save.deferred_pages,
>>>> ctx->save.p2m_size);
>>>>
>>>> +    if ( !ctx->save.live && ctx->save.callbacks->get_dirty_pfn )
>>>> +    {
>>>
>>> Shouldn't get_dirty_pfn be mandatory for COLO streams (even if it is a
>>> noop to start with) ?
>>
>> It should be mandatory, it shouldn't be noop under COLO. perhaps we
>> should
>> add sanity check at the beginning. But problem is save side do not
>> have a param
>> passed from libxl to indicate the stream type(like checkpointed_stream in
>> restore side). So we may need to add another XCFLAGS? Currently there is
>> XCFLAGS_CHECKPOINTED which represents Remus, we might need to change
>> this to
>> XCFLAGS_STREAM_REMUS
>> XCFLAGS_STREAM_COLO
>> so that we can know what kind of stream we are handling?
>
> checkpointed_stream started out as a bugfix for a legacy stream
> migration breakage.  Really, this information should have been passed
> right from the start.

Did I miss the bugfix? is it not in upstream?

>
> It would probably be best to take the enum{} suggested elsewhere and
> make it a top level ctx item, and have it present for both save and
> restore, with sutable parameters passed in from the top.  (When I am
> finally able to take out the legacy code, there is going to be a severe
> pruning/consolidation of the parameters.)

This is what I thought when I saw the enum{} suggested.

>
> ~Andrew
>
>>
>>>
>>> ~Andrew
>>>
>>>> +        rc = update_dirty_bitmap(ctx->save.callbacks->get_dirty_pfn,
>>>> +                                 ctx->save.callbacks->data,
>>>> +                                 ctx->save.p2m_size,
>>>> +                                 dirty_bitmap);
>>>> +        if ( rc )
>>>> +        {
>>>> +            PERROR("Failed to get secondary vm's dirty pages");
>>>> +            goto out;
>>>> +        }
>>>> +    }
>>>> +
>>>>        rc = send_dirty_pages(ctx, stats.dirty_count +
>>>> ctx->save.nr_deferred_pages);
>>>>        if ( rc )
>>>>            goto out;
>>>> @@ -784,7 +822,16 @@ static int save(struct xc_sr_context *ctx,
>>>> uint16_t guest_type)
>>>>                if ( rc )
>>>>                    goto err;
>>>>
>>>> -            ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
>>>> +            rc =
>>>> ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
>>>> +            if ( !rc ) {
>>>> +                if ( !errno )
>>>> +                {
>>>> +                    /* Postcopy request failed (without errno,
>>>> using EINVAL) */
>>>> +                    errno = EINVAL;
>>>> +                }
>>>> +                rc = -1;
>>>> +                goto err;
>>>> +            }
>>>>
>>>>                rc =
>>>> ctx->save.callbacks->checkpoint(ctx->save.callbacks->data);
>>>>                if ( rc <= 0 )
>>>
>>> .
>>>
>>
>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 06/15] libxc/save: support COLO save
  2015-06-09  8:45         ` Yang Hongyang
@ 2015-06-09  8:51           ` Andrew Cooper
  2015-06-09  9:09             ` Yang Hongyang
  0 siblings, 1 reply; 50+ messages in thread
From: Andrew Cooper @ 2015-06-09  8:51 UTC (permalink / raw)
  To: Yang Hongyang, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson

On 09/06/15 09:45, Yang Hongyang wrote:
>
>>> Even if there are no dirty pages on secondary, pfn_list shouldn't be
>>> NULL, it's just that pfn_list[0] will be 0. if pfn_list is NULL,
>>> there might be unexpected error happened.
>>
>> get_dirty_pfn() should be declared alongside a
>>
>> struct pfn_data
>> {
>>      uint64_t count;
>>      uint64_t *pfns;
>> };
>>
>> and this function here should create one of these on the stack and pass
>> it by pointer to get_dirty_pfn().  I might also be tempted to rename
>> this to get_remote_logdirty() or similar, to indicate that it is a
>> source of logdirty data from something other than the current
>> hypervisor.
>
> This is a callback, I can't find a way to pass pointer from libxc to
> libxl,
> libxl can not access the pointer data...The struct can be used for
> represent
> the data however.

Right - my point is that it should be the implementation of
get_remote_logdirty() (i.e. in libxl_save_helper) which is responsible
for unpackaging the data from whatever RPC method is used, rather than
the caller.

>
>>>> Shouldn't get_dirty_pfn be mandatory for COLO streams (even if it is a
>>>> noop to start with) ?
>>>
>>> It should be mandatory, it shouldn't be noop under COLO. perhaps we
>>> should
>>> add sanity check at the beginning. But problem is save side do not
>>> have a param
>>> passed from libxl to indicate the stream type(like
>>> checkpointed_stream in
>>> restore side). So we may need to add another XCFLAGS? Currently
>>> there is
>>> XCFLAGS_CHECKPOINTED which represents Remus, we might need to change
>>> this to
>>> XCFLAGS_STREAM_REMUS
>>> XCFLAGS_STREAM_COLO
>>> so that we can know what kind of stream we are handling?
>>
>> checkpointed_stream started out as a bugfix for a legacy stream
>> migration breakage.  Really, this information should have been passed
>> right from the start.
>
> Did I miss the bugfix? is it not in upstream?

c/s 7051d5c

~Andrew

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 06/15] libxc/save: support COLO save
  2015-06-09  8:51           ` Andrew Cooper
@ 2015-06-09  9:09             ` Yang Hongyang
  2015-06-09  9:10               ` Andrew Cooper
  0 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-09  9:09 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson



On 06/09/2015 04:51 PM, Andrew Cooper wrote:
> On 09/06/15 09:45, Yang Hongyang wrote:
>>
>>>> Even if there are no dirty pages on secondary, pfn_list shouldn't be
>>>> NULL, it's just that pfn_list[0] will be 0. if pfn_list is NULL,
>>>> there might be unexpected error happened.
>>>
>>> get_dirty_pfn() should be declared alongside a
>>>
>>> struct pfn_data
>>> {
>>>       uint64_t count;
>>>       uint64_t *pfns;
>>> };
>>>
>>> and this function here should create one of these on the stack and pass
>>> it by pointer to get_dirty_pfn().  I might also be tempted to rename
>>> this to get_remote_logdirty() or similar, to indicate that it is a
>>> source of logdirty data from something other than the current
>>> hypervisor.
>>
>> This is a callback, I can't find a way to pass pointer from libxc to
>> libxl,
>> libxl can not access the pointer data...The struct can be used for
>> represent
>> the data however.
>
> Right - my point is that it should be the implementation of
> get_remote_logdirty() (i.e. in libxl_save_helper) which is responsible
> for unpackaging the data from whatever RPC method is used, rather than
> the caller.

Now I know what you mean, I will fix it in the next version, thanks!

>
>>
>>>>> Shouldn't get_dirty_pfn be mandatory for COLO streams (even if it is a
>>>>> noop to start with) ?
>>>>
>>>> It should be mandatory, it shouldn't be noop under COLO. perhaps we
>>>> should
>>>> add sanity check at the beginning. But problem is save side do not
>>>> have a param
>>>> passed from libxl to indicate the stream type(like
>>>> checkpointed_stream in
>>>> restore side). So we may need to add another XCFLAGS? Currently
>>>> there is
>>>> XCFLAGS_CHECKPOINTED which represents Remus, we might need to change
>>>> this to
>>>> XCFLAGS_STREAM_REMUS
>>>> XCFLAGS_STREAM_COLO
>>>> so that we can know what kind of stream we are handling?
>>>
>>> checkpointed_stream started out as a bugfix for a legacy stream
>>> migration breakage.  Really, this information should have been passed
>>> right from the start.
>>
>> Did I miss the bugfix? is it not in upstream?
>
> c/s 7051d5c

Ah, you are talking about the restore side, I'm talking about the save
side checkpointed_stream, so I should also post a prereq patch to
add checkpointed_stream to the save side? or there's already the
fix out there?

>
> ~Andrew
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 06/15] libxc/save: support COLO save
  2015-06-09  9:09             ` Yang Hongyang
@ 2015-06-09  9:10               ` Andrew Cooper
  2015-06-09  9:16                 ` Yang Hongyang
  0 siblings, 1 reply; 50+ messages in thread
From: Andrew Cooper @ 2015-06-09  9:10 UTC (permalink / raw)
  To: Yang Hongyang, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson

On 09/06/15 10:09, Yang Hongyang wrote:
>
>
>>
>>>
>>>>>> Shouldn't get_dirty_pfn be mandatory for COLO streams (even if it
>>>>>> is a
>>>>>> noop to start with) ?
>>>>>
>>>>> It should be mandatory, it shouldn't be noop under COLO. perhaps we
>>>>> should
>>>>> add sanity check at the beginning. But problem is save side do not
>>>>> have a param
>>>>> passed from libxl to indicate the stream type(like
>>>>> checkpointed_stream in
>>>>> restore side). So we may need to add another XCFLAGS? Currently
>>>>> there is
>>>>> XCFLAGS_CHECKPOINTED which represents Remus, we might need to change
>>>>> this to
>>>>> XCFLAGS_STREAM_REMUS
>>>>> XCFLAGS_STREAM_COLO
>>>>> so that we can know what kind of stream we are handling?
>>>>
>>>> checkpointed_stream started out as a bugfix for a legacy stream
>>>> migration breakage.  Really, this information should have been passed
>>>> right from the start.
>>>
>>> Did I miss the bugfix? is it not in upstream?
>>
>> c/s 7051d5c
>
> Ah, you are talking about the restore side, I'm talking about the save
> side checkpointed_stream, so I should also post a prereq patch to
> add checkpointed_stream to the save side? or there's already the
> fix out there?

Sorry for being unclear.  You will have to add one to the save side. 
The restore side only has one as a bugfix.

~Andrew

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 06/15] libxc/save: support COLO save
  2015-06-09  9:10               ` Andrew Cooper
@ 2015-06-09  9:16                 ` Yang Hongyang
  0 siblings, 0 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-09  9:16 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: wei.liu2, ian.campbell, wency, guijianfeng, yunhong.jiang,
	eddie.dong, rshriram, ian.jackson



On 06/09/2015 05:10 PM, Andrew Cooper wrote:
> On 09/06/15 10:09, Yang Hongyang wrote:
>>
>>
>>>
>>>>
>>>>>>> Shouldn't get_dirty_pfn be mandatory for COLO streams (even if it
>>>>>>> is a
>>>>>>> noop to start with) ?
>>>>>>
>>>>>> It should be mandatory, it shouldn't be noop under COLO. perhaps we
>>>>>> should
>>>>>> add sanity check at the beginning. But problem is save side do not
>>>>>> have a param
>>>>>> passed from libxl to indicate the stream type(like
>>>>>> checkpointed_stream in
>>>>>> restore side). So we may need to add another XCFLAGS? Currently
>>>>>> there is
>>>>>> XCFLAGS_CHECKPOINTED which represents Remus, we might need to change
>>>>>> this to
>>>>>> XCFLAGS_STREAM_REMUS
>>>>>> XCFLAGS_STREAM_COLO
>>>>>> so that we can know what kind of stream we are handling?
>>>>>
>>>>> checkpointed_stream started out as a bugfix for a legacy stream
>>>>> migration breakage.  Really, this information should have been passed
>>>>> right from the start.
>>>>
>>>> Did I miss the bugfix? is it not in upstream?
>>>
>>> c/s 7051d5c
>>
>> Ah, you are talking about the restore side, I'm talking about the save
>> side checkpointed_stream, so I should also post a prereq patch to
>> add checkpointed_stream to the save side? or there's already the
>> fix out there?
>
> Sorry for being unclear.  You will have to add one to the save side.
> The restore side only has one as a bugfix.

Got it~ thanks!

>
> ~Andrew
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code
  2015-06-08  3:45 ` [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code Yang Hongyang
@ 2015-06-12 14:23   ` Wei Liu
  2015-06-12 14:51     ` Ian Jackson
  2015-06-15  1:55     ` Yang Hongyang
  0 siblings, 2 replies; 50+ messages in thread
From: Wei Liu @ 2015-06-12 14:23 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, xen-devel, guijianfeng, rshriram, ian.jackson

On Mon, Jun 08, 2015 at 11:45:46AM +0800, Yang Hongyang wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> Secondary vm is running in colo mode. So we will do
> the following things again and again:
> 1. Resume secondary vm
>    a. Send LIBXL_COLO_SVM_READY to master.
>    b. If it is not the first resume, call libxl__checkpoint_devices_preresume().
>    c. If it is the first resume(resume right after live migration),
>       - call libxl__xc_domain_restore_done() to build the secondary vm.
>       - enable secondary vm's logdirty.
>       - call libxl__domain_resume() to resume secondary vm.
>       - call libxl__checkpoint_devices_setup() to setup checkpoint devices.
>    d. Send LIBXL_COLO_SVM_RESUMED to master.
> 2. Wait a new checkpoint
>    a. Call libxl__checkpoint_devices_commit().
>    b. Read LIBXL_COLO_NEW_CHECKPOINT from master.
> 3. Suspend secondary vm
>    a. Suspend secondary vm.
>    b. Call libxl__checkpoint_devices_postsuspend().
>    c. Get secondary vm's dirty page information.
>    d. Send LIBXL_COLO_SVM_SUSPENDED to master.
>    e. Send secondary vm's dirty page information to master(count + pfn list).
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> ---
[...]
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU Lesser General Public License for more details.
> + */
> +
> +#ifndef LIBXL_COLO_H
> +#define LIBXL_COLO_H
> +
> +/*
> + * values to control suspend/resume primary vm and secondary vm
> + * at the same time
> + */
> +enum {
> +    LIBXL_COLO_NEW_CHECKPOINT = 1,
> +    LIBXL_COLO_SVM_SUSPENDED,
> +    LIBXL_COLO_SVM_READY,
> +    LIBXL_COLO_SVM_RESUMED,
> +};
> +

Any reason to not have this in IDL?

> +extern void libxl__colo_restore_done(libxl__egc *egc, void *dcs_void,
> +                                     int ret, int retval, int errnoval);
> +extern void libxl__colo_restore_setup(libxl__egc *egc,
> +                                      libxl__colo_restore_state *crs);
> +extern void libxl__colo_restore_teardown(libxl__egc *egc,
> +                                         libxl__colo_restore_state *crs,
> +                                         int rc);
> +
> +#endif
> diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
> new file mode 100644
> index 0000000..6c39758
> --- /dev/null
> +++ b/tools/libxl/libxl_colo_restore.c
> @@ -0,0 +1,1158 @@
> +/*
> + * Copyright (C) 2014 FUJITSU LIMITED
> + * Author: Wen Congyang <wency@cn.fujitsu.com>
> + *         Yang Hongyang <yanghy@cn.fujitsu.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU Lesser General Public License as published
> + * by the Free Software Foundation; version 2.1 only. with the special
> + * exception on linking described in file LICENSE.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU Lesser General Public License for more details.
> + */
> +
> +#include "libxl_osdeps.h" /* must come before any other headers */
> +
> +#include "libxl_internal.h"
> +#include "libxl_colo.h"
> +#include "xc_bitops.h"
> +
> +#define XC_PAGE_SHIFT           12
> +#define PAGE_SHIFT              XC_PAGE_SHIFT

I don't think you need these.

> +#define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1))
> +#define NRPAGES(x) (ROUNDUP(x, PAGE_SHIFT) >> PAGE_SHIFT)

And you can use XC_PAGE_SHIFT directly in above macro.

> +
> +enum {
> +    LIBXL_COLO_SETUPED,
> +    LIBXL_COLO_SUSPENDED,
> +    LIBXL_COLO_RESUMED,
> +};
> +

Move it to IDL as well?

> +typedef struct libxl__colo_restore_checkpoint_state libxl__colo_restore_checkpoint_state;
> +struct libxl__colo_restore_checkpoint_state {
> +    xc_hypercall_buffer_t _dirty_bitmap;
> +    xc_hypercall_buffer_t *dirty_bitmap;

This one looks like layer violation to me. I don't have other good
suggestion on how to do this though. Maybe Ian and Ian have better idea.

> +    unsigned long p2m_size;
> +    libxl__domain_suspend_state dsps;
> +    libxl__datacopier_state dc;
> +    uint8_t section;

This could use a better name like "stage" / "state"?

> +    libxl__logdirty_switch lds;
> +    libxl__colo_restore_state *crs;
> +    int status;
> +    bool preresume;
> +    /* used for teardown */
> +    int teardown_devices;
> +    int saved_rc;
> +
> +    void (*callback)(libxl__egc *,
> +                     libxl__colo_restore_checkpoint_state *,
> +                     int);
> +
> +    /*
> +     * 0: secondary vm's dirty bitmap for domain @domid
> +     * 1: secondary vm is ready(domain @domid)
> +     * 2: secondary vm is resumed(domain @domid)
> +     * 3. new checkpoint is triggered(domain @domid)
> +     */
> +    const char *copywhat[4];
> +};
> +
> +
> +static void libxl__colo_restore_domain_resume_callback(void *data);
> +static void libxl__colo_restore_domain_checkpoint_callback(void *data);
> +static void libxl__colo_restore_domain_suspend_callback(void *data);
> +
> +static const libxl__checkpoint_device_instance_ops *colo_restore_ops[] = {
> +    NULL,
> +};
> +
[...]
> +    crcs->status = LIBXL_COLO_RESUMED;
> +
> +    /* avoid calling libxl__xc_domain_restore_done() more than once */
> +    if (crs->saved_cb) {
> +        dcs->callback = crs->saved_cb;
> +        crs->saved_cb = NULL;
> +

I have a feeling that this trick should be avoided. But I'm not an
expert on this so I will defer judgement to Ian J.

> +        lds->callback = colo_enable_logdirty_done;
> +        colo_enable_logdirty(crs, egc);
> +        return;
> +    }
> +
> +    colo_write_svm_resumed(egc, crcs);
> +    return;
> +
> +out:
> +    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
> +}
> +
[...]
>  
> +_hidden void logdirty_init(libxl__logdirty_switch *lds);
> +

This function should be in libxl__ namespace.

Other than these cosmetic issues I don't really have the expertise to
comment further.

Wei.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 12/15] COLO nic: implement COLO nic subkind
  2015-06-08  3:45 ` [PATCH v6 COLO 12/15] COLO nic: implement COLO nic subkind Yang Hongyang
@ 2015-06-12 14:35   ` Wei Liu
  2015-06-15  2:13     ` Yang Hongyang
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Liu @ 2015-06-12 14:35 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: wei.liu2, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, xen-devel, guijianfeng, rshriram, ian.jackson

On Mon, Jun 08, 2015 at 11:45:56AM +0800, Yang Hongyang wrote:
> implement COLO nic subkind.
> 
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> ---
>  tools/hotplug/Linux/Makefile         |   1 +
>  tools/hotplug/Linux/colo-proxy-setup | 131 +++++++++++++++

There are hardcoded paths in this script. Please avoid that.

For one Debian has iptables under /sbin, not /usr/local/sbin.

Wei.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code
  2015-06-12 14:23   ` Wei Liu
@ 2015-06-12 14:51     ` Ian Jackson
  2015-06-15  2:10       ` Yang Hongyang
  2015-06-15  1:55     ` Yang Hongyang
  1 sibling, 1 reply; 50+ messages in thread
From: Ian Jackson @ 2015-06-12 14:51 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, Yang Hongyang

Wei Liu writes ("Re: [Xen-devel] [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code"):
> On Mon, Jun 08, 2015 at 11:45:46AM +0800, Yang Hongyang wrote:
> > From: Wen Congyang <wency@cn.fujitsu.com>
> > +    crcs->status = LIBXL_COLO_RESUMED;
> > +
> > +    /* avoid calling libxl__xc_domain_restore_done() more than once */
> > +    if (crs->saved_cb) {
> > +        dcs->callback = crs->saved_cb;
> > +        crs->saved_cb = NULL;
> 
> I have a feeling that this trick should be avoided. But I'm not an
> expert on this so I will defer judgement to Ian J.

Yes, this trick should be avoided.  It will make the resulting
control flow very confusing.

Ian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code
  2015-06-12 14:23   ` Wei Liu
  2015-06-12 14:51     ` Ian Jackson
@ 2015-06-15  1:55     ` Yang Hongyang
  2015-06-16 11:42       ` Ian Jackson
  1 sibling, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-15  1:55 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson



On 06/12/2015 10:23 PM, Wei Liu wrote:
> On Mon, Jun 08, 2015 at 11:45:46AM +0800, Yang Hongyang wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Secondary vm is running in colo mode. So we will do
>> the following things again and again:
>> 1. Resume secondary vm
>>     a. Send LIBXL_COLO_SVM_READY to master.
>>     b. If it is not the first resume, call libxl__checkpoint_devices_preresume().
>>     c. If it is the first resume(resume right after live migration),
>>        - call libxl__xc_domain_restore_done() to build the secondary vm.
>>        - enable secondary vm's logdirty.
>>        - call libxl__domain_resume() to resume secondary vm.
>>        - call libxl__checkpoint_devices_setup() to setup checkpoint devices.
>>     d. Send LIBXL_COLO_SVM_RESUMED to master.
>> 2. Wait a new checkpoint
>>     a. Call libxl__checkpoint_devices_commit().
>>     b. Read LIBXL_COLO_NEW_CHECKPOINT from master.
>> 3. Suspend secondary vm
>>     a. Suspend secondary vm.
>>     b. Call libxl__checkpoint_devices_postsuspend().
>>     c. Get secondary vm's dirty page information.
>>     d. Send LIBXL_COLO_SVM_SUSPENDED to master.
>>     e. Send secondary vm's dirty page information to master(count + pfn list).
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> ---
> [...]
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU Lesser General Public License for more details.
>> + */
>> +
>> +#ifndef LIBXL_COLO_H
>> +#define LIBXL_COLO_H
>> +
>> +/*
>> + * values to control suspend/resume primary vm and secondary vm
>> + * at the same time
>> + */
>> +enum {
>> +    LIBXL_COLO_NEW_CHECKPOINT = 1,
>> +    LIBXL_COLO_SVM_SUSPENDED,
>> +    LIBXL_COLO_SVM_READY,
>> +    LIBXL_COLO_SVM_RESUMED,
>> +};
>> +
>
> Any reason to not have this in IDL?

No, will move it to IDL in the next version.

>
>> +extern void libxl__colo_restore_done(libxl__egc *egc, void *dcs_void,
>> +                                     int ret, int retval, int errnoval);
>> +extern void libxl__colo_restore_setup(libxl__egc *egc,
>> +                                      libxl__colo_restore_state *crs);
>> +extern void libxl__colo_restore_teardown(libxl__egc *egc,
>> +                                         libxl__colo_restore_state *crs,
>> +                                         int rc);
>> +
>> +#endif
>> diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
>> new file mode 100644
>> index 0000000..6c39758
>> --- /dev/null
>> +++ b/tools/libxl/libxl_colo_restore.c
>> @@ -0,0 +1,1158 @@
>> +/*
>> + * Copyright (C) 2014 FUJITSU LIMITED
>> + * Author: Wen Congyang <wency@cn.fujitsu.com>
>> + *         Yang Hongyang <yanghy@cn.fujitsu.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU Lesser General Public License as published
>> + * by the Free Software Foundation; version 2.1 only. with the special
>> + * exception on linking described in file LICENSE.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU Lesser General Public License for more details.
>> + */
>> +
>> +#include "libxl_osdeps.h" /* must come before any other headers */
>> +
>> +#include "libxl_internal.h"
>> +#include "libxl_colo.h"
>> +#include "xc_bitops.h"
>> +
>> +#define XC_PAGE_SHIFT           12
>> +#define PAGE_SHIFT              XC_PAGE_SHIFT
>
> I don't think you need these.
>
>> +#define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1))
>> +#define NRPAGES(x) (ROUNDUP(x, PAGE_SHIFT) >> PAGE_SHIFT)
>
> And you can use XC_PAGE_SHIFT directly in above macro.

Okay, thanks.

>
>> +
>> +enum {
>> +    LIBXL_COLO_SETUPED,
>> +    LIBXL_COLO_SUSPENDED,
>> +    LIBXL_COLO_RESUMED,
>> +};
>> +
>
> Move it to IDL as well?

Ok.

>
>> +typedef struct libxl__colo_restore_checkpoint_state libxl__colo_restore_checkpoint_state;
>> +struct libxl__colo_restore_checkpoint_state {
>> +    xc_hypercall_buffer_t _dirty_bitmap;
>> +    xc_hypercall_buffer_t *dirty_bitmap;
>
> This one looks like layer violation to me. I don't have other good
> suggestion on how to do this though. Maybe Ian and Ian have better idea.

We are talking about moving this operation to libxc layer, what's your opinion?
Please refer to the 4th COLOPre patch.

>
>> +    unsigned long p2m_size;
>> +    libxl__domain_suspend_state dsps;
>> +    libxl__datacopier_state dc;
>> +    uint8_t section;
>
> This could use a better name like "stage" / "state"?

stage should be better, thank you.

>
>> +    libxl__logdirty_switch lds;
>> +    libxl__colo_restore_state *crs;
>> +    int status;
>> +    bool preresume;
>> +    /* used for teardown */
>> +    int teardown_devices;
>> +    int saved_rc;
>> +
>> +    void (*callback)(libxl__egc *,
>> +                     libxl__colo_restore_checkpoint_state *,
>> +                     int);
>> +
>> +    /*
>> +     * 0: secondary vm's dirty bitmap for domain @domid
>> +     * 1: secondary vm is ready(domain @domid)
>> +     * 2: secondary vm is resumed(domain @domid)
>> +     * 3. new checkpoint is triggered(domain @domid)
>> +     */
>> +    const char *copywhat[4];
>> +};
>> +
>> +
>> +static void libxl__colo_restore_domain_resume_callback(void *data);
>> +static void libxl__colo_restore_domain_checkpoint_callback(void *data);
>> +static void libxl__colo_restore_domain_suspend_callback(void *data);
>> +
>> +static const libxl__checkpoint_device_instance_ops *colo_restore_ops[] = {
>> +    NULL,
>> +};
>> +
> [...]
>> +    crcs->status = LIBXL_COLO_RESUMED;
>> +
>> +    /* avoid calling libxl__xc_domain_restore_done() more than once */
>> +    if (crs->saved_cb) {
>> +        dcs->callback = crs->saved_cb;
>> +        crs->saved_cb = NULL;
>> +
>
> I have a feeling that this trick should be avoided. But I'm not an
> expert on this so I will defer judgement to Ian J.
>
>> +        lds->callback = colo_enable_logdirty_done;
>> +        colo_enable_logdirty(crs, egc);
>> +        return;
>> +    }
>> +
>> +    colo_write_svm_resumed(egc, crcs);
>> +    return;
>> +
>> +out:
>> +    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
>> +}
>> +
> [...]
>>
>> +_hidden void logdirty_init(libxl__logdirty_switch *lds);
>> +
>
> This function should be in libxl__ namespace.

OK, thanks.

>
> Other than these cosmetic issues I don't really have the expertise to
> comment further.
>
> Wei.
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code
  2015-06-12 14:51     ` Ian Jackson
@ 2015-06-15  2:10       ` Yang Hongyang
  0 siblings, 0 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-15  2:10 UTC (permalink / raw)
  To: Ian Jackson, Wei Liu
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram

Hi Ian J, Wei,

On 06/12/2015 10:51 PM, Ian Jackson wrote:
> Wei Liu writes ("Re: [Xen-devel] [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code"):
>> On Mon, Jun 08, 2015 at 11:45:46AM +0800, Yang Hongyang wrote:
>>> From: Wen Congyang <wency@cn.fujitsu.com>
>>> +    crcs->status = LIBXL_COLO_RESUMED;
>>> +
>>> +    /* avoid calling libxl__xc_domain_restore_done() more than once */
>>> +    if (crs->saved_cb) {
>>> +        dcs->callback = crs->saved_cb;
>>> +        crs->saved_cb = NULL;
>>
>> I have a feeling that this trick should be avoided. But I'm not an
>> expert on this so I will defer judgement to Ian J.
>
> Yes, this trick should be avoided.  It will make the resulting
> control flow very confusing.

I agree that this part is a bit of tricky. I will try to find another
way to do this. Maybe add another state variable to indicate what stage
we are in, the first boot or under checkpoint.

>
> Ian.
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 12/15] COLO nic: implement COLO nic subkind
  2015-06-12 14:35   ` Wei Liu
@ 2015-06-15  2:13     ` Yang Hongyang
  0 siblings, 0 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-15  2:13 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson



On 06/12/2015 10:35 PM, Wei Liu wrote:
> On Mon, Jun 08, 2015 at 11:45:56AM +0800, Yang Hongyang wrote:
>> implement COLO nic subkind.
>>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> ---
>>   tools/hotplug/Linux/Makefile         |   1 +
>>   tools/hotplug/Linux/colo-proxy-setup | 131 +++++++++++++++
>
> There are hardcoded paths in this script. Please avoid that.
>
> For one Debian has iptables under /sbin, not /usr/local/sbin.

We are using a modified iptables here. But hardcode is not a good thing,
will avoid this in the next version.

>
> Wei.
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 01/15] docs: add colo readme
  2015-06-08  3:45 ` [PATCH v6 COLO 01/15] docs: add colo readme Yang Hongyang
@ 2015-06-16 10:56   ` Ian Campbell
  2015-06-24  9:13     ` Yang Hongyang
  0 siblings, 1 reply; 50+ messages in thread
From: Ian Campbell @ 2015-06-16 10:56 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson

On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
> add colo readme, refer to
> http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping
> 
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>

This is fine as far as it goes but I wonder if perhaps
docs/README.{remus,colo} ought to be moved into docs/misc, perhaps
converted to markdown (which should be trivial) and perhaps merged into
a single document about checkpointing?

The reason for the move is twofold, first it is a bit a typical for docs
to live in the top-level docs dir and secondly moving it into misc will
cause it to appear automatically at
http://xenbits.xen.org/docs/unstable/ etc.

Ian.
> ---
>  docs/README.colo | 9 +++++++++
>  1 file changed, 9 insertions(+)
>  create mode 100644 docs/README.colo
> 
> diff --git a/docs/README.colo b/docs/README.colo
> new file mode 100644
> index 0000000..466eb72
> --- /dev/null
> +++ b/docs/README.colo
> @@ -0,0 +1,9 @@
> +COLO FT/HA (COarse-grain LOck-stepping Virtual Machines for Non-stop Service)
> +project is a high availability solution. Both primary VM (PVM) and secondary VM
> +(SVM) run in parallel. They receive the same request from client, and generate
> +response in parallel too. If the response packets from PVM and SVM are
> +identical, they are released immediately. Otherwise, a VM checkpoint (on demand)
> +is conducted.
> +
> +See the website at http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping
> +for details.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 03/15] primary vm suspend/get_dirty_pfn/resume/checkpoint code
  2015-06-08  3:45 ` [PATCH v6 COLO 03/15] primary vm suspend/get_dirty_pfn/resume/checkpoint code Yang Hongyang
@ 2015-06-16 11:05   ` Ian Campbell
  0 siblings, 0 replies; 50+ messages in thread
From: Ian Campbell @ 2015-06-16 11:05 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson

On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
> index 86bcf9c..d5902a6 100644
> --- a/tools/libxc/include/xenguest.h
> +++ b/tools/libxc/include/xenguest.h
> @@ -75,6 +75,18 @@ struct save_callbacks {
>       */
>      int (*toolstack_save)(uint32_t domid, uint8_t **buf, uint32_t *len, void *data);
>  
> +    /* Called after the guest is suspended.
> +     *
> +     * returns the list of dirty pfn:
> +     *  struct {
> +     *      uint64_t count;
> +     *      uint64_t pfn[];
> +     *  };

Seeing this comment and then a callback which returns a uint8_t* makes
me suspicious. Can we not do something a bit more typesafe here, like
returning a pointer to a suitable struct?

> +     *
> +     *  Note: the caller must free the return value.
> +     */
> +    uint8_t *(*get_dirty_pfn)(void *data);
> +
>      /* to be provided as the last argument to each callback function */
>      void* data;
>  };
> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
> index 10d3d82..1145ae4 100644
> --- a/tools/libxl/libxl.c
> +++ b/tools/libxl/libxl.c
> @@ -17,6 +17,7 @@
>  #include "libxl_osdeps.h"
>  
>  #include "libxl_internal.h"
> +#include "libxl_colo.h"
>  
>  #define PAGE_TO_MEMKB(pages) ((pages) * 4)
>  #define BACKEND_STRING_SIZE 5
> @@ -841,7 +842,10 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
>      assert(info);
>  
>      /* Point of no return */
> -    libxl__remus_setup(egc, &dss->rs);
> +    if (libxl_defbool_val(info->colo))

libxl code must arrange to have called libxl_defbool_setdefault before
using libxl_defbool_val, which I don't see here. There is a big block of
such settings near the top of this function which you should add to.

On the other hand -- is it possible for a caller to say they don't care
what kind of check pointing they want and have libxl decide? If not then
it doesn't make sense to use a defbool, a regular bool would be
appropriate.

I'm also wondering to what extent COLO could be considered an extension
to Remus, as opposed to an alternative -- iow I'm unsure if reusing
libxl_domain_remus_start as the API makes sense (the implementation
could still be shared where appropriate).

> diff --git a/tools/libxl/libxl_colo.h b/tools/libxl/libxl_colo.h
> index 91df275..26a2563 100644
> --- a/tools/libxl/libxl_colo.h
> +++ b/tools/libxl/libxl_colo.h
> @@ -35,4 +35,14 @@ extern void libxl__colo_restore_teardown(libxl__egc *egc,
>                                           libxl__colo_restore_state *crs,
>                                           int rc);
>  
> +extern void libxl__colo_save_domain_suspend_callback(void *data);
> +extern void libxl__colo_save_domain_resume_callback(void *data);
> +extern void libxl__colo_save_domain_checkpoint_callback(void *data);
> +extern void libxl__colo_save_get_dirty_pfn_callback(void *data);
> +extern void libxl__colo_save_setup(libxl__egc *egc,
> +                                   libxl__colo_save_state *css);
> +extern void libxl__colo_save_teardown(libxl__egc *egc,
> +                                      libxl__colo_save_state *css,
> +                                      int rc);

Should all be marked _hidden I think?

[...]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 05/15] send store mfn and console mfn to xl before resuming secondary vm
  2015-06-08  3:45 ` [PATCH v6 COLO 05/15] send store mfn and console mfn to xl before resuming secondary vm Yang Hongyang
  2015-06-08 12:16   ` Andrew Cooper
@ 2015-06-16 11:13   ` Ian Campbell
  1 sibling, 0 replies; 50+ messages in thread
From: Ian Campbell @ 2015-06-16 11:13 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson

On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> We will call libxl__xc_domain_restore_done() to rebuild secondary vm. But
> we need store mfn and console mfn when rebuilding secondary vm. So make
> restore_results is a function pointers in callbacks struct and struct

"...make restore_results a function pointer in callback struct...".

> {save,restore}_callbacks, and use this callback to send store mfn and
> console mfn to xl.

Since those are currently returned by some other means should we
deprecate/remove that path too, especially since this new call back is
mandatory.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 07/15] implement the cmdline for COLO
  2015-06-08  3:45 ` [PATCH v6 COLO 07/15] implement the cmdline for COLO Yang Hongyang
@ 2015-06-16 11:19   ` Ian Campbell
  2015-06-25  4:06     ` Yang Hongyang
  0 siblings, 1 reply; 50+ messages in thread
From: Ian Campbell @ 2015-06-16 11:19 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson

On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> Add a new option -c to the command 'xl remus'. If you want
> to use COLO HA instead of Remus HA, please use -c option.
> 
> Update man pages to reflect the addition of a new option to
> 'xl remus' command.
> 
> Also add a new option -c to the internal command 'xl migrate-receive'.

I asked about whether COLO was an extension or a peer to Remus in an
earlier patch. the answer may have an impact here too.

> @@ -498,6 +501,11 @@ Disable network output buffering. Requires enabling unsafe mode.
>  
>  Disable disk replication. Requires enabling unsafe mode.
>  
> +=item B<-c>
> +
> +Enable COLO HA. It is conflict with B<-i> and B<-b>, and memory

"It conflicts with" or "This conflicts with".

> +checkpoint compression must be disabled.
> +
>  =back
>  
>  =item B<pause> I<domain-id>
> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
> index 1145ae4..7df2466 100644
> --- a/tools/libxl/libxl.c
> +++ b/tools/libxl/libxl.c
> @@ -811,6 +811,22 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
>          goto out;
>      }
>  
> +    /* The caller must set this defbool */
> +    if (libxl_defbool_is_default(info->colo)) {
> +        LOG(ERROR, "colo mode must be enabled/disabled");

As I wondered earlier -- this suggests it should not be a defbool, or
that the interfaces should split.

> +        rc = ERROR_FAIL;
> +        goto out;
> +    }
> +
> +    if (libxl_defbool_val(info->colo)) {
> +        libxl_defbool_setdefault(&info->compression, false);

Assuming this isn't invalidated by the above comments, you should make
the existing:
 libxl_defbool_setdefault(&info->compression, true);
into 
 libxl_defbool_setdefault(&info->compression, libxl_defbool_val(colo));

and then do an error check later.

> +        if (libxl_defbool_val(info->compression)) {
> +            LOG(ERROR, "cannot use memory checkpoint compression in COLO mode");
> +            rc = ERROR_FAIL;
> +            goto out;
> +        }
> +    }
> +
>      libxl_defbool_setdefault(&info->allow_unsafe, false);
>      libxl_defbool_setdefault(&info->blackhole, false);
>      libxl_defbool_setdefault(&info->compression, true);
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> index adfadd1..4bbadd3 100644
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -4273,6 +4273,9 @@ static void migrate_receive(int debug, int daemonize, int monitor,
>      dom_info.send_fd = send_fd;
>      dom_info.migration_domname_r = &migration_domname;
>      dom_info.checkpointed_stream = remus;
> +    if (remus == LIBXL_CHECKPOINTED_STREAM_COLO)
> +        /* COLO uses stdout to send control message to master */
> +        dom_info.quiet = 1;

Please set a const char * to either "COLO" or "Remus" here and use it
everywhere you've currently got an open coded decision on that.

>  
>      rc = create_domain(&dom_info);
>      if (rc < 0) {
> @@ -4287,7 +4290,8 @@ static void migrate_receive(int debug, int daemonize, int monitor,
>          /* If we are here, it means that the sender (primary) has crashed.
>           * TODO: Split-Brain Check.
>           */
> -        fprintf(stderr, "migration target: Remus Failover for domain %u\n",
> +        fprintf(stderr, "migration target: %s Failover for domain %u\n",
> +                remus == LIBXL_CHECKPOINTED_STREAM_COLO ? "COLO" : "Remus",
>                  domid);
>  
>          /*
> @@ -4304,15 +4308,21 @@ static void migrate_receive(int debug, int daemonize, int monitor,
>              rc = libxl_domain_rename(ctx, domid, migration_domname,
>                                       common_domname);
>              if (rc)
> -                fprintf(stderr, "migration target (Remus): "
> +                fprintf(stderr, "migration target (%s): "
>                          "Failed to rename domain from %s to %s:%d\n",
> +                        remus == LIBXL_CHECKPOINTED_STREAM_COLO ? "COLO" : "Remus",
>                          migration_domname, common_domname, rc);
>          }
>  
> +        if (remus == LIBXL_CHECKPOINTED_STREAM_COLO)
> +            /* The guest is running after failover in COLO mode */
> +            exit(rc ? -ERROR_FAIL: 0);
> +
>          rc = libxl_domain_unpause(ctx, domid);
>          if (rc)
> -            fprintf(stderr, "migration target (Remus): "
> +            fprintf(stderr, "migration target (%s): "
>                      "Failed to unpause domain %s (id: %u):%d\n",
> +                    remus == LIBXL_CHECKPOINTED_STREAM_COLO ? "COLO" : "Remus",
>                      common_domname, domid, rc);
>  
>          exit(rc ? -ERROR_FAIL: 0);
> @@ -4458,7 +4468,7 @@ int main_migrate_receive(int argc, char **argv)
>      int debug = 0, daemonize = 1, monitor = 1, remus = 0;
>      int opt;
>  
> -    SWITCH_FOREACH_OPT(opt, "Fedr", NULL, "migrate-receive", 0) {
> +    SWITCH_FOREACH_OPT(opt, "Fedrc", NULL, "migrate-receive", 0) {
>      case 'F':
>          daemonize = 0;
>          break;
> @@ -4470,8 +4480,10 @@ int main_migrate_receive(int argc, char **argv)
>          debug = 1;
>          break;
>      case 'r':
> -        remus = 1;
> +        remus = LIBXL_CHECKPOINTED_STREAM_REMUS;
>          break;
> +    case 'c':
> +        remus = LIBXL_CHECKPOINTED_STREAM_COLO;
>      }
>  
>      if (argc-optind != 0) {
> @@ -7958,15 +7970,18 @@ int main_remus(int argc, char **argv)
>      pid_t child = -1;
>      uint8_t *config_data;
>      int config_len;
> +    int interval = 0;
>  
>      memset(&r_info, 0, sizeof(libxl_domain_remus_info));
>      /* Defaults */
>      r_info.interval = 200;
>      libxl_defbool_setdefault(&r_info.blackhole, false);
> +    libxl_defbool_setdefault(&r_info.colo, false);
>  
> -    SWITCH_FOREACH_OPT(opt, "Fbundi:s:N:e", NULL, "remus", 2) {
> +    SWITCH_FOREACH_OPT(opt, "Fbundi:s:N:ec", NULL, "remus", 2) {
>      case 'i':
>          r_info.interval = atoi(optarg);
> +        interval = 1;

This duplication of r_info.interval and interval seems odd. Perhaps
interval is really "interval_was_set" but even so I'm not sure this
would be better achieved with some more refactoring.

>          break;
>      case 'F':
>          libxl_defbool_set(&r_info.allow_unsafe, true);
> @@ -7992,11 +8007,28 @@ int main_remus(int argc, char **argv)
>      case 'e':
>          daemonize = 0;
>          break;
> +    case 'c':
> +        libxl_defbool_set(&r_info.colo, true);
>      }
>  
>      domid = find_domain(argv[optind]);
>      host = argv[optind + 1];
>  
> +    if (libxl_defbool_val(r_info.colo)) {
> +        if (!interval)
> +            r_info.interval = 0;
> +
> +        if (r_info.interval || libxl_defbool_val(r_info.blackhole)) {
> +            perror("option -c is conflict with -i or -b");

"...-c conflicts with..."

> +            exit(-1);
> +        }
> +
> +        if (libxl_defbool_is_default(r_info.compression)) {
> +            perror("option -u must be specified when using COLO");

Then enforce it using setdefault I think instead of making the user jump
through a pointless hoop.

Ian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 08/15] Support colo mode for qemu disk
  2015-06-08  3:45 ` [PATCH v6 COLO 08/15] Support colo mode for qemu disk Yang Hongyang
@ 2015-06-16 11:21   ` Ian Campbell
  0 siblings, 0 replies; 50+ messages in thread
From: Ian Campbell @ 2015-06-16 11:21 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson

On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> Usage: disk = ['...,colo,colo-params=xxx,active-disk=xxx,hidden-disk=xxx...']
> The format of colo-params: host:port:exportname=xx

Please expand docs/misc/xl-disk-configuration.txt with what these mean.

I would hope that at least some of these could be removed from the user
facing API (i.e. libxl's API) and either inferred from the use of the
others or from colo being enabled for the domain.

Ian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 09/15] COLO: use qemu block replication
  2015-06-08  3:45 ` [PATCH v6 COLO 09/15] COLO: use qemu block replication Yang Hongyang
@ 2015-06-16 11:22   ` Ian Campbell
  0 siblings, 0 replies; 50+ messages in thread
From: Ian Campbell @ 2015-06-16 11:22 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson

On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> The guest should be paused before doing COLO!!!

I'm not sure what to make of this comment. Please write a commit message
which explains what the commit does and the implication etc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 10/15] COLO proxy: implement setup/teardown of COLO proxy module
  2015-06-08  3:45 ` [PATCH v6 COLO 10/15] COLO proxy: implement setup/teardown of COLO proxy module Yang Hongyang
@ 2015-06-16 11:24   ` Ian Campbell
  2015-06-16 11:26     ` Ian Campbell
  0 siblings, 1 reply; 50+ messages in thread
From: Ian Campbell @ 2015-06-16 11:24 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson

On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
> setup/teardown of COLO proxy module.
> we use netlink to communicate with proxy module.

What is a COLO proxy module and where would one get hold of such a
thing?

Is this a new kernel feature with a patch? If so then please link to its
posting to the appropriate upstream and indicate what you understand of
its progress upstream.

(I seem to remember discussing a COLO networking component at the
hackathon which seemed like it could be done using existing components,
is that this?)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 10/15] COLO proxy: implement setup/teardown of COLO proxy module
  2015-06-16 11:24   ` Ian Campbell
@ 2015-06-16 11:26     ` Ian Campbell
  2015-06-25  5:22       ` Yang Hongyang
  0 siblings, 1 reply; 50+ messages in thread
From: Ian Campbell @ 2015-06-16 11:26 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson

On Tue, 2015-06-16 at 12:24 +0100, Ian Campbell wrote:
> On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
> > setup/teardown of COLO proxy module.
> > we use netlink to communicate with proxy module.
> 
> What is a COLO proxy module and where would one get hold of such a
> thing?
> 
> Is this a new kernel feature with a patch? If so then please link to its
> posting to the appropriate upstream and indicate what you understand of
> its progress upstream.
> 
> (I seem to remember discussing a COLO networking component at the
> hackathon which seemed like it could be done using existing components,
> is that this?)

IIRC the existing component I was thinking of was
http://www.netfilter.org/projects/libnetfilter_queue/ which allows
userspace to do pretty advanced filtering, queueing, gating, delaying
etc of packets.

Ian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code
  2015-06-15  1:55     ` Yang Hongyang
@ 2015-06-16 11:42       ` Ian Jackson
  0 siblings, 0 replies; 50+ messages in thread
From: Ian Jackson @ 2015-06-16 11:42 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: Wei Liu, ian.campbell, wency, andrew.cooper3, yunhong.jiang,
	eddie.dong, xen-devel, guijianfeng, rshriram

Yang Hongyang writes ("Re: [Xen-devel] [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code"):
> > On Mon, Jun 08, 2015 at 11:45:46AM +0800, Yang Hongyang wrote:
...
> >> 3. Suspend secondary vm
> >>     a. Suspend secondary vm.
> >>     b. Call libxl__checkpoint_devices_postsuspend().
> >>     c. Get secondary vm's dirty page information.
> >>     d. Send LIBXL_COLO_SVM_SUSPENDED to master.
> >>     e. Send secondary vm's dirty page information to master(count + pfn list).

In the pdf
   http://www.socc2013.org/home/program/a3-dong.pdf?attredirects=0
linked from the wiki page
   http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping
it says that the secondary keeps a copy of the original contents of
its dirty pages.  So I don't understand why you need to send the dirty
bitmap to the primary.

Thanks,
Ian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 01/15] docs: add colo readme
  2015-06-16 10:56   ` Ian Campbell
@ 2015-06-24  9:13     ` Yang Hongyang
  0 siblings, 0 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-24  9:13 UTC (permalink / raw)
  To: Ian Campbell
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson



On 06/16/2015 06:56 PM, Ian Campbell wrote:
> On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
>> add colo readme, refer to
>> http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping
>>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>
> This is fine as far as it goes but I wonder if perhaps
> docs/README.{remus,colo} ought to be moved into docs/misc, perhaps
> converted to markdown (which should be trivial) and perhaps merged into
> a single document about checkpointing?

Agreeed that we can add a checkpointing.txt to docs/misc, and describe
remus/COLO in that file. but can we do this later when COLO feature is
merged? at that time we can do this within one patch.

>
> The reason for the move is twofold, first it is a bit a typical for docs
> to live in the top-level docs dir and secondly moving it into misc will
> cause it to appear automatically at
> http://xenbits.xen.org/docs/unstable/ etc.
>
> Ian.
>> ---
>>   docs/README.colo | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>   create mode 100644 docs/README.colo
>>
>> diff --git a/docs/README.colo b/docs/README.colo
>> new file mode 100644
>> index 0000000..466eb72
>> --- /dev/null
>> +++ b/docs/README.colo
>> @@ -0,0 +1,9 @@
>> +COLO FT/HA (COarse-grain LOck-stepping Virtual Machines for Non-stop Service)
>> +project is a high availability solution. Both primary VM (PVM) and secondary VM
>> +(SVM) run in parallel. They receive the same request from client, and generate
>> +response in parallel too. If the response packets from PVM and SVM are
>> +identical, they are released immediately. Otherwise, a VM checkpoint (on demand)
>> +is conducted.
>> +
>> +See the website at http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping
>> +for details.
>
>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 07/15] implement the cmdline for COLO
  2015-06-16 11:19   ` Ian Campbell
@ 2015-06-25  4:06     ` Yang Hongyang
  2015-07-14 15:14       ` Ian Campbell
  0 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-25  4:06 UTC (permalink / raw)
  To: Ian Campbell
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson



On 06/16/2015 07:19 PM, Ian Campbell wrote:
> On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Add a new option -c to the command 'xl remus'. If you want
>> to use COLO HA instead of Remus HA, please use -c option.
>>
>> Update man pages to reflect the addition of a new option to
>> 'xl remus' command.
>>
>> Also add a new option -c to the internal command 'xl migrate-receive'.
>
> I asked about whether COLO was an extension or a peer to Remus in an
> earlier patch. the answer may have an impact here too.

We implemented COLO based on Remus, so we assume it is an extension to Remus.

>
>> @@ -498,6 +501,11 @@ Disable network output buffering. Requires enabling unsafe mode.
>>
>>   Disable disk replication. Requires enabling unsafe mode.
>>
>> +=item B<-c>
>> +
>> +Enable COLO HA. It is conflict with B<-i> and B<-b>, and memory
>
> "It conflicts with" or "This conflicts with".
>
>> +checkpoint compression must be disabled.
>> +
>>   =back
>>
>>   =item B<pause> I<domain-id>
>> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
>> index 1145ae4..7df2466 100644
>> --- a/tools/libxl/libxl.c
>> +++ b/tools/libxl/libxl.c
>> @@ -811,6 +811,22 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
>>           goto out;
>>       }
>>
>> +    /* The caller must set this defbool */
>> +    if (libxl_defbool_is_default(info->colo)) {
>> +        LOG(ERROR, "colo mode must be enabled/disabled");
>
> As I wondered earlier -- this suggests it should not be a defbool, or
> that the interfaces should split.
>
>> +        rc = ERROR_FAIL;
>> +        goto out;
>> +    }
>> +
>> +    if (libxl_defbool_val(info->colo)) {
>> +        libxl_defbool_setdefault(&info->compression, false);
>
> Assuming this isn't invalidated by the above comments, you should make
> the existing:
>   libxl_defbool_setdefault(&info->compression, true);
> into
>   libxl_defbool_setdefault(&info->compression, libxl_defbool_val(colo));
>
> and then do an error check later.
>
>> +        if (libxl_defbool_val(info->compression)) {
>> +            LOG(ERROR, "cannot use memory checkpoint compression in COLO mode");
>> +            rc = ERROR_FAIL;
>> +            goto out;
>> +        }
>> +    }
>> +
>>       libxl_defbool_setdefault(&info->allow_unsafe, false);
>>       libxl_defbool_setdefault(&info->blackhole, false);
>>       libxl_defbool_setdefault(&info->compression, true);
>> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
>> index adfadd1..4bbadd3 100644
>> --- a/tools/libxl/xl_cmdimpl.c
>> +++ b/tools/libxl/xl_cmdimpl.c
>> @@ -4273,6 +4273,9 @@ static void migrate_receive(int debug, int daemonize, int monitor,
>>       dom_info.send_fd = send_fd;
>>       dom_info.migration_domname_r = &migration_domname;
>>       dom_info.checkpointed_stream = remus;
>> +    if (remus == LIBXL_CHECKPOINTED_STREAM_COLO)
>> +        /* COLO uses stdout to send control message to master */
>> +        dom_info.quiet = 1;
>
> Please set a const char * to either "COLO" or "Remus" here and use it
> everywhere you've currently got an open coded decision on that.
>
>>
>>       rc = create_domain(&dom_info);
>>       if (rc < 0) {
>> @@ -4287,7 +4290,8 @@ static void migrate_receive(int debug, int daemonize, int monitor,
>>           /* If we are here, it means that the sender (primary) has crashed.
>>            * TODO: Split-Brain Check.
>>            */
>> -        fprintf(stderr, "migration target: Remus Failover for domain %u\n",
>> +        fprintf(stderr, "migration target: %s Failover for domain %u\n",
>> +                remus == LIBXL_CHECKPOINTED_STREAM_COLO ? "COLO" : "Remus",
>>                   domid);
>>
>>           /*
>> @@ -4304,15 +4308,21 @@ static void migrate_receive(int debug, int daemonize, int monitor,
>>               rc = libxl_domain_rename(ctx, domid, migration_domname,
>>                                        common_domname);
>>               if (rc)
>> -                fprintf(stderr, "migration target (Remus): "
>> +                fprintf(stderr, "migration target (%s): "
>>                           "Failed to rename domain from %s to %s:%d\n",
>> +                        remus == LIBXL_CHECKPOINTED_STREAM_COLO ? "COLO" : "Remus",
>>                           migration_domname, common_domname, rc);
>>           }
>>
>> +        if (remus == LIBXL_CHECKPOINTED_STREAM_COLO)
>> +            /* The guest is running after failover in COLO mode */
>> +            exit(rc ? -ERROR_FAIL: 0);
>> +
>>           rc = libxl_domain_unpause(ctx, domid);
>>           if (rc)
>> -            fprintf(stderr, "migration target (Remus): "
>> +            fprintf(stderr, "migration target (%s): "
>>                       "Failed to unpause domain %s (id: %u):%d\n",
>> +                    remus == LIBXL_CHECKPOINTED_STREAM_COLO ? "COLO" : "Remus",
>>                       common_domname, domid, rc);
>>
>>           exit(rc ? -ERROR_FAIL: 0);
>> @@ -4458,7 +4468,7 @@ int main_migrate_receive(int argc, char **argv)
>>       int debug = 0, daemonize = 1, monitor = 1, remus = 0;
>>       int opt;
>>
>> -    SWITCH_FOREACH_OPT(opt, "Fedr", NULL, "migrate-receive", 0) {
>> +    SWITCH_FOREACH_OPT(opt, "Fedrc", NULL, "migrate-receive", 0) {
>>       case 'F':
>>           daemonize = 0;
>>           break;
>> @@ -4470,8 +4480,10 @@ int main_migrate_receive(int argc, char **argv)
>>           debug = 1;
>>           break;
>>       case 'r':
>> -        remus = 1;
>> +        remus = LIBXL_CHECKPOINTED_STREAM_REMUS;
>>           break;
>> +    case 'c':
>> +        remus = LIBXL_CHECKPOINTED_STREAM_COLO;
>>       }
>>
>>       if (argc-optind != 0) {
>> @@ -7958,15 +7970,18 @@ int main_remus(int argc, char **argv)
>>       pid_t child = -1;
>>       uint8_t *config_data;
>>       int config_len;
>> +    int interval = 0;
>>
>>       memset(&r_info, 0, sizeof(libxl_domain_remus_info));
>>       /* Defaults */
>>       r_info.interval = 200;
>>       libxl_defbool_setdefault(&r_info.blackhole, false);
>> +    libxl_defbool_setdefault(&r_info.colo, false);
>>
>> -    SWITCH_FOREACH_OPT(opt, "Fbundi:s:N:e", NULL, "remus", 2) {
>> +    SWITCH_FOREACH_OPT(opt, "Fbundi:s:N:ec", NULL, "remus", 2) {
>>       case 'i':
>>           r_info.interval = atoi(optarg);
>> +        interval = 1;
>
> This duplication of r_info.interval and interval seems odd. Perhaps
> interval is really "interval_was_set" but even so I'm not sure this
> would be better achieved with some more refactoring.
>
>>           break;
>>       case 'F':
>>           libxl_defbool_set(&r_info.allow_unsafe, true);
>> @@ -7992,11 +8007,28 @@ int main_remus(int argc, char **argv)
>>       case 'e':
>>           daemonize = 0;
>>           break;
>> +    case 'c':
>> +        libxl_defbool_set(&r_info.colo, true);
>>       }
>>
>>       domid = find_domain(argv[optind]);
>>       host = argv[optind + 1];
>>
>> +    if (libxl_defbool_val(r_info.colo)) {
>> +        if (!interval)
>> +            r_info.interval = 0;
>> +
>> +        if (r_info.interval || libxl_defbool_val(r_info.blackhole)) {
>> +            perror("option -c is conflict with -i or -b");
>
> "...-c conflicts with..."
>
>> +            exit(-1);
>> +        }
>> +
>> +        if (libxl_defbool_is_default(r_info.compression)) {
>> +            perror("option -u must be specified when using COLO");
>
> Then enforce it using setdefault I think instead of making the user jump
> through a pointless hoop.
>
> Ian.
>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 10/15] COLO proxy: implement setup/teardown of COLO proxy module
  2015-06-16 11:26     ` Ian Campbell
@ 2015-06-25  5:22       ` Yang Hongyang
  2015-06-25  8:39         ` Ian Campbell
  0 siblings, 1 reply; 50+ messages in thread
From: Yang Hongyang @ 2015-06-25  5:22 UTC (permalink / raw)
  To: Ian Campbell
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson



On 06/16/2015 07:26 PM, Ian Campbell wrote:
> On Tue, 2015-06-16 at 12:24 +0100, Ian Campbell wrote:
>> On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
>>> setup/teardown of COLO proxy module.
>>> we use netlink to communicate with proxy module.
>>
>> What is a COLO proxy module and where would one get hold of such a
>> thing?
>>
>> Is this a new kernel feature with a patch? If so then please link to its
>> posting to the appropriate upstream and indicate what you understand of
>> its progress upstream.
>>
>> (I seem to remember discussing a COLO networking component at the
>> hackathon which seemed like it could be done using existing components,
>> is that this?)
>
> IIRC the existing component I was thinking of was
> http://www.netfilter.org/projects/libnetfilter_queue/ which allows
> userspace to do pretty advanced filtering, queueing, gating, delaying
> etc of packets.

The reason we are not using userspace solution is that we worried about
the performance. There will be huge amount of packets pass through, the
context switch cost will be an overhead. The colo-proxy module:
https://lkml.org/lkml/2015/6/18/32

>
> Ian.
>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 10/15] COLO proxy: implement setup/teardown of COLO proxy module
  2015-06-25  5:22       ` Yang Hongyang
@ 2015-06-25  8:39         ` Ian Campbell
  2015-06-25  8:48           ` Yang Hongyang
  0 siblings, 1 reply; 50+ messages in thread
From: Ian Campbell @ 2015-06-25  8:39 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson

On Thu, 2015-06-25 at 13:22 +0800, Yang Hongyang wrote:
> 
> On 06/16/2015 07:26 PM, Ian Campbell wrote:
> > On Tue, 2015-06-16 at 12:24 +0100, Ian Campbell wrote:
> >> On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
> >>> setup/teardown of COLO proxy module.
> >>> we use netlink to communicate with proxy module.
> >>
> >> What is a COLO proxy module and where would one get hold of such a
> >> thing?
> >>
> >> Is this a new kernel feature with a patch? If so then please link to its
> >> posting to the appropriate upstream and indicate what you understand of
> >> its progress upstream.
> >>
> >> (I seem to remember discussing a COLO networking component at the
> >> hackathon which seemed like it could be done using existing components,
> >> is that this?)
> >
> > IIRC the existing component I was thinking of was
> > http://www.netfilter.org/projects/libnetfilter_queue/ which allows
> > userspace to do pretty advanced filtering, queueing, gating, delaying
> > etc of packets.
> 
> The reason we are not using userspace solution is that we worried about
> the performance.

Is this a theoretical concern or something which has actually been
observed to be a problem in practice?

>  There will be huge amount of packets pass through, the
> context switch cost will be an overhead. The colo-proxy module:
> https://lkml.org/lkml/2015/6/18/32
> 
> >
> > Ian.
> >
> > .
> >
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 10/15] COLO proxy: implement setup/teardown of COLO proxy module
  2015-06-25  8:39         ` Ian Campbell
@ 2015-06-25  8:48           ` Yang Hongyang
  0 siblings, 0 replies; 50+ messages in thread
From: Yang Hongyang @ 2015-06-25  8:48 UTC (permalink / raw)
  To: Ian Campbell
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson



On 06/25/2015 04:39 PM, Ian Campbell wrote:
> On Thu, 2015-06-25 at 13:22 +0800, Yang Hongyang wrote:
>>
>> On 06/16/2015 07:26 PM, Ian Campbell wrote:
>>> On Tue, 2015-06-16 at 12:24 +0100, Ian Campbell wrote:
>>>> On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
>>>>> setup/teardown of COLO proxy module.
>>>>> we use netlink to communicate with proxy module.
>>>>
>>>> What is a COLO proxy module and where would one get hold of such a
>>>> thing?
>>>>
>>>> Is this a new kernel feature with a patch? If so then please link to its
>>>> posting to the appropriate upstream and indicate what you understand of
>>>> its progress upstream.
>>>>
>>>> (I seem to remember discussing a COLO networking component at the
>>>> hackathon which seemed like it could be done using existing components,
>>>> is that this?)
>>>
>>> IIRC the existing component I was thinking of was
>>> http://www.netfilter.org/projects/libnetfilter_queue/ which allows
>>> userspace to do pretty advanced filtering, queueing, gating, delaying
>>> etc of packets.
>>
>> The reason we are not using userspace solution is that we worried about
>> the performance.
>
> Is this a theoretical concern or something which has actually been
> observed to be a problem in practice?

It is a theoretical concern, we haven't had time try to implement the
userspace solution yet.

>
>>   There will be huge amount of packets pass through, the
>> context switch cost will be an overhead. The colo-proxy module:
>> https://lkml.org/lkml/2015/6/18/32
>>
>>>
>>> Ian.
>>>
>>> .
>>>
>>
>
>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 COLO 07/15] implement the cmdline for COLO
  2015-06-25  4:06     ` Yang Hongyang
@ 2015-07-14 15:14       ` Ian Campbell
  0 siblings, 0 replies; 50+ messages in thread
From: Ian Campbell @ 2015-07-14 15:14 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: wei.liu2, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, guijianfeng, rshriram, ian.jackson

On Thu, 2015-06-25 at 12:06 +0800, Yang Hongyang wrote:
> 
> On 06/16/2015 07:19 PM, Ian Campbell wrote:
> > On Mon, 2015-06-08 at 11:45 +0800, Yang Hongyang wrote:
> >> From: Wen Congyang <wency@cn.fujitsu.com>
> >>
> >> Add a new option -c to the command 'xl remus'. If you want
> >> to use COLO HA instead of Remus HA, please use -c option.
> >>
> >> Update man pages to reflect the addition of a new option to
> >> 'xl remus' command.
> >>
> >> Also add a new option -c to the internal command 'xl migrate-receive'.
> >
> > I asked about whether COLO was an extension or a peer to Remus in an
> > earlier patch. the answer may have an impact here too.
> 
> We implemented COLO based on Remus, so we assume it is an extension to Remus.

It's not so much a question of implementation (since any two unrelated
features might share some code, or be inspired by one another) but of
how the features relate logically from the users point of view, is COLO
an extension to Remus or is it an independent feature?

i.e. would a user expect the interface to be:
        xl remus <...> # run remus on a domain
        xl colo <...> # run colo on a domain
or
        xl remus <...> # run remus on a domain
        xl remus -c <...> # run remus with colo extensions on a domain

>From this end it seems that although colo builds upon the Remus code and
concepts in many ways to the end user it is actually a logically
separate feature which fills a similar niche to Remus. 

It might be that I've not fully appreciated how the two relate and colo
really is "just" an extension to Remus, I'm not sure.

A similar argument applies further down the stack at the libxl API layer
too (my comment on patch #3).

Ian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2015-07-14 15:14 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-08  3:45 [PATCH v6 COLO 00/15] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Yang Hongyang
2015-06-08  3:45 ` [PATCH v6 COLO 01/15] docs: add colo readme Yang Hongyang
2015-06-16 10:56   ` Ian Campbell
2015-06-24  9:13     ` Yang Hongyang
2015-06-08  3:45 ` [PATCH v6 COLO 02/15] secondary vm suspend/resume/checkpoint code Yang Hongyang
2015-06-12 14:23   ` Wei Liu
2015-06-12 14:51     ` Ian Jackson
2015-06-15  2:10       ` Yang Hongyang
2015-06-15  1:55     ` Yang Hongyang
2015-06-16 11:42       ` Ian Jackson
2015-06-08  3:45 ` [PATCH v6 COLO 03/15] primary vm suspend/get_dirty_pfn/resume/checkpoint code Yang Hongyang
2015-06-16 11:05   ` Ian Campbell
2015-06-08  3:45 ` [PATCH v6 COLO 04/15] libxc/restore: support COLO restore Yang Hongyang
2015-06-08 10:39   ` Andrew Cooper
2015-06-08 14:06     ` Yang Hongyang
2015-06-08  3:45 ` [PATCH v6 COLO 05/15] send store mfn and console mfn to xl before resuming secondary vm Yang Hongyang
2015-06-08 12:16   ` Andrew Cooper
2015-06-08 14:08     ` Yang Hongyang
2015-06-16 11:13   ` Ian Campbell
2015-06-08  3:45 ` [PATCH v6 COLO 06/15] libxc/save: support COLO save Yang Hongyang
2015-06-08 13:04   ` Andrew Cooper
2015-06-09  3:15     ` Yang Hongyang
2015-06-09  7:20       ` Andrew Cooper
2015-06-09  8:45         ` Yang Hongyang
2015-06-09  8:51           ` Andrew Cooper
2015-06-09  9:09             ` Yang Hongyang
2015-06-09  9:10               ` Andrew Cooper
2015-06-09  9:16                 ` Yang Hongyang
2015-06-09  3:18     ` Yang Hongyang
2015-06-08  3:45 ` [PATCH v6 COLO 07/15] implement the cmdline for COLO Yang Hongyang
2015-06-16 11:19   ` Ian Campbell
2015-06-25  4:06     ` Yang Hongyang
2015-07-14 15:14       ` Ian Campbell
2015-06-08  3:45 ` [PATCH v6 COLO 08/15] Support colo mode for qemu disk Yang Hongyang
2015-06-16 11:21   ` Ian Campbell
2015-06-08  3:45 ` [PATCH v6 COLO 09/15] COLO: use qemu block replication Yang Hongyang
2015-06-16 11:22   ` Ian Campbell
2015-06-08  3:45 ` [PATCH v6 COLO 10/15] COLO proxy: implement setup/teardown of COLO proxy module Yang Hongyang
2015-06-16 11:24   ` Ian Campbell
2015-06-16 11:26     ` Ian Campbell
2015-06-25  5:22       ` Yang Hongyang
2015-06-25  8:39         ` Ian Campbell
2015-06-25  8:48           ` Yang Hongyang
2015-06-08  3:45 ` [PATCH v6 COLO 11/15] COLO proxy: preresume, postresume and checkpoint Yang Hongyang
2015-06-08  3:45 ` [PATCH v6 COLO 12/15] COLO nic: implement COLO nic subkind Yang Hongyang
2015-06-12 14:35   ` Wei Liu
2015-06-15  2:13     ` Yang Hongyang
2015-06-08  3:45 ` [PATCH v6 COLO 13/15] setup and control colo proxy on primary side Yang Hongyang
2015-06-08  3:45 ` [PATCH v6 COLO 14/15] setup and control colo proxy on secondary side Yang Hongyang
2015-06-08  3:45 ` [PATCH v6 COLO 15/15] cmdline switches and config vars to control colo-proxy Yang Hongyang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).